Model selection

A key problem in the design of learning algorithm is the choice of hypothesis set H\mathcal HH, which is the model selection problem. How do I choose the hypothesis set H\mathcal HH? A sufficiently rich or complex hypothesis set can contain an ideal Bayesian classifier. On the other hand, studying in such a complicated family is a very difficult task. More generally, the choice of his subject is a trade-off that can be analyzed in estimation and approximation errors.

Our discussion will focus on the special case of binary classification, but much of what is discussed can be directly extended to different tasks and loss functions.

4.1 Estimation and approximation

Let H\mathcal HH be a family of functions that map X\mathcal XX to {−1, +1}. The redundant error of assumed HHH selected from H\ Mathcal HH, i.e., the difference between its error R(H)R(H)R(H)R(H)R(H)R and the Bayesian error R∗R^*R∗, can be decomposed as follows:


R ( h ) R = ( R ( h ) i n f h H   R ( h ) ) + ( i n f h H   R ( h ) R ) . (4.1) R(h)-R^*=\left(R(h)-\underset{h\in\mathcal H}{inf}\ R(h)\right)+\left(\underset{h\in\mathcal H}{inf}\ R (h) – R ^ * \ right). \ tag} {4.1

The first term is called the estimation error and the second is called the approximation error. The estimated error depends on the chosen hypothesis HHH. The error in which it measures HHH is related to the next value of the error realized by the hypothesis in HHH, or the error of the best hypothesis h∗ H ^*h∗ in the class when that next value is reached. Note that the definition of unknown PAC learning is based on estimation error.

The approximation error measures the degree of bayesian approximation error using H\ Mathcal HH. It is a property of the hypothesis set H\ Mathcal HH and is a measure of its richness. For more complex or richer assumptions H\mathcal HH, the approximation error tends to be smaller at the expense of a larger estimation error. See Figure 4.1.

Figure 4.1

Estimation errors (green) and approximation errors (orange) are illustrated. Here, assume that there is a category of the best hypothesis, or h h ∗ ∗ h ^ * and make R ∗ (h) = infh ∈ HR (h) (h ^ *) R = inf_} {h \ \ mathcal in h R ∗ (h) (h) R = infh ∈ HR (h).

Model selection involves choosing H\mathcal HH between approximate and estimated errors. Note, however, that the approximation error is not accessible because the underlying distribution D\ Mathcal DD needed to determine R∗R^*R∗ is usually unknown. It is difficult to estimate the approximate error even with various noise assumptions. However, the estimation error of algorithm A\ Mathcal AA, that is, the estimation error of hypothesis hSh_ShS returned after training of sample SSS, can sometimes be bounded by generalization bounds, as shown in the following section.

4.2 Empirical Risk Minimization (ERM)

Empirical Risk Minimization (ERM) is a standard algorithm with bounded error estimation. ERM seeks to minimize the error on the training sample: 4^44


h S E R M = a r g m i n h H   R ^ S ( h ) . (4.2) H ^ {ERM} _S = \ underset {h \ \ mathcal in h} {argmin} \ \ widehat {R} _S (h). \ tag} {4.2

Proposition 4.1 For any sample SSS, the hypothesis returned by ERM has the following inequality:


P [ R ( h S E R M ) i n f h H   R ( h ) > ϵ ] Or less P [ s u p h H   R ( h ) R ^ S ( h ) > ϵ 2 ] . (4.3) \mathbb P\left[R(h_S^{ERM})-\underset{h\in\mathcal H}{inf}\ R(h)>\epsilon\right]\le \mathbb P \ left [\ underset {h \ \ mathcal in h} {who} \ | R (h) – \ widehat {R} _S (h) | > \ frac {\ epsilon} {2} \ right].. \ tag} {4.3

According to the definition of infH ∈HR(h)inf_{h\in\mathcal h}R(h) Infh ∈HR(h), for any ϵ>0\epsilon>0ϵ>0, hϵh_\epsilonhϵ, ϵ R (h) or less infh ∈ HR (h) + ϵ R (h_ \ epsilon) \ le inf_} {h \ \ mathcal in h R (h) + \ epsilonR ϵ (h) or less infh ∈ HR + ϵ (h). Therefore, using the algorithm of the definition of R ^ S (hSERM) R ^ S ϵ (h) or less \ widehat {R} _S (h ^ {ERM} _S) \ le \ widehat {R} _S (h_ \ epsilon) R S (hSERM) R S ϵ (h) or less, we can


R ( h S E R M ) i n f h H   R ( h ) = R ( h S E R M ) R ( h ϵ ) + R ( h ϵ ) i n f h H   R ( h ) Or less R ( h S E R M ) R ( h ϵ ) + ϵ = R ( h S E R M ) R ^ S ( h S E R M ) + R ^ S ( h S E R M ) R ( h ϵ ) + ϵ Or less R ( h S E R M ) R ^ S ( h S E R M ) + R ^ S ( h ϵ ) R ( h ϵ ) + ϵ Or less 2 s u p h H   R ( h ) R ^ S ( h ) + ϵ . \begin{aligned} R(h^{ERM}_S)-\underset{h\in\mathcal H}{inf}\ R(h)&=R(h^{ERM}_S)-R(h_\epsilon)+R(h_\epsilon)-\underset{h\in\mathcal H}{inf}\ R(h)\\ &\le R(h^{ERM}_S)-R(h_\epsilon)+\epsilon\\ &=R(h^{ERM}_S)-\widehat{R}_S(h^{ERM}_S)+\widehat{R}_S(h^{ERM}_S)-R(h_\epsilon)+\epsilon\\ &\le R(h^{ERM}_S)-\widehat{R}_S(h^{ERM}_S)+\widehat{R}_S(h_\epsilon)-R(h_\epsilon)+\epsilon\\ &\le 2 \underset{h\in\mathcal H}{sup}\ |R(h)-\widehat{R}_S(h)|+\epsilon. \end{aligned}

Since inequality exists in all ϵ>0\epsilon>0ϵ>0 says:


R ( h S E R M ) i n f h H   R ( h ) Or less 2   s u p h H   R ( h ) R ^ S ( h ) . R(h^{ERM}_S)-\underset{h\in\mathcal H}{inf}\ R(h)\le2\ \underset{h\in\mathcal H}{sup}\ |R(h)-\widehat{R}_S(h)|,

That’s the conclusion of the proof.