Hypothesis Space - Albert Masoliver's learning site

## Definition The **hypothesis space** $\mathcal{H}$ of a learning algorithm is the set of all functions $h: X \to Y$ that the algorithm can possibly output. Choosing a learning algorithm is fundamentally choosing a hypothesis space. ## Examples - **Linear regression** — $\mathcal{H} = \{h(x) = w^\top x + b : w \in \mathbb{R}^d, b \in \mathbb{R}\}$. Constrained: cannot represent non-linear relationships. - **Polynomial regression of degree $k$** — all polynomials up to degree $k$. - **Decision trees of depth $d$** — all trees with depth $\leq d$. - **Neural networks with architecture $A$** — all parameter settings of $A$. Very large; capable of approximating arbitrary continuous functions (see Universal Approximation Theorem). ## The Capacity-Generalisation Trade-off - **Too small a hypothesis space:** cannot capture the true relationship — *underfitting*. - **Too large:** can fit the training data perfectly but generalises poorly — *overfitting*. This is the [[Bias-Variance Tradeoff]] from a different angle. ## VC Dimension The **Vapnik-Chervonenkis (VC) dimension** of $\mathcal{H}$ is the size of the largest set of points that $\mathcal{H}$ can *shatter* — i.e., realise every possible labelling of those points. Higher VC dim → richer hypothesis space → more samples needed to bound generalisation error. For example: - Linear classifiers in $\mathbb{R}^d$ have VC dim $d + 1$. - Decision trees of depth $d$ have VC dim $\Theta(2^d)$. VC dim provides PAC-learning bounds: with $n$ samples and confidence $1 - \delta$, the generalisation error is at most $ \hat\epsilon + O\left(\sqrt{\frac{d_{VC} \log(n) + \log(1/\delta)}{n}}\right) $ ## Relation to Inductive Bias The hypothesis space *is* the structural part of the algorithm's [[Inductive Bias]]: the constraint that some hypotheses are even possible. The other half of the bias — preference among hypotheses in $\mathcal{H}$ — comes from regularisation and prior knowledge. ## Modern Caveats For very large hypothesis spaces (modern deep networks), classical capacity measures like VC dim are loose to the point of uselessness. Networks with millions of parameters can shatter most reasonable datasets, yet generalise. Why this happens is an active research area (the "double-descent" curve, implicit regularisation of SGD, neural tangent kernels). For practical purposes, the hypothesis-space framing still guides intuition. ## Related - [[Machine Learning]] - [[Bias-Variance Tradeoff]] - [[Overfitting and Underfitting]] - [[Inductive Bias]] - [[Generalization]]