2.3 Guarantees for finite hypothesis sets — inconsistencies
In the most general case, there may be no hypothesis consistent with the labeled training sample in H. This is actually a typical case in practice where the learning problem may be a little more difficult or the hypothesis set used by the concept analogy learning algorithm is more complex. However, inconsistent assumptions with a small number of errors on the training sample can be useful and, as we shall see, can benefit from favorable guarantees under certain assumptions. This section provides learning guarantees for such inconsistencies and for a finite set of assumptions. To obtain learning guarantees in this more general context, we will use the Hoeffding inequality (Theorem D.1) or the following corollary, which involves the generalization error and empirical error of a single hypothesis.
Corollary 2.1
Fix ϵ > 0\epsilon > 0ϵ > 0 and let SSS represent an I.I.D. The sample size is MMM. Then, for any hypothesis h: X → {0,1}h: X → \{0,1\}h: X → {0,1}, the following inequality holds:
According to the union bounds, this means the following two-sided inequalities:
The proof follows theorem D.1. Set the right hand side of (2.16) to equal δδδ and solve ϵ\epsilonϵ immediately producing the following bounds for a single hypothesis.
Inference 2.2 generalization bounds — single hypothesis
Fixed an assumption h: X → {0,1}h: X → \{0,1\}h: X → {0,1}. Then, for any δ > 0δ > 0δ > 0, the probability of the following inequality being true is at least 1 − δ 1 − δ 1 − δ :
The following example illustrates this inference in a simple example.
Example 2.6 Flipping a coin
Imagine flipping a biased coin, the probability of heads is PPP, and let’s make our hypothesis the one that always guesses heads. So the true error rate is R(h)= pR(h) = pR(h) = P and the empirical error rate is R^(h)=p^\widehat R(h)=\ Widehat pR(h) =p, Where P ^\ Widehat PP is the probability of heads of training samples extracted based on I.I.D. Therefore, inference 2.2 is guaranteed with a probability of at least 1 − δ1 − δ1 − δ.
Therefore, if we choose δ = 0.02 δ = 0.02 δ = 0.02 and use samples of size 500500500 with a probability of at least 98%98\%98%, then P ^\ Widehat PP guarantees the following approximate qualities:
When training on sample SSS, can we easily apply inference 2.2 to limit the generalization error of the hypothetical HSH_SHS returned by the learning algorithm? No, because HSH_SHS is not a fixed hypothesis, but a random variable that depends on the training sample drawn. Note also that, unlike in the case of the fixed hypothesis, for the fixed hypothesis, the expected value of the empirical error is the generalization error (equation 2.3), and the generalization error R(hS) R(h_S) R(hS) is a random variable that is usually different from the expected value E[R^(hS)]E[\widehat R(h_S)]E[R (hS)], The latter is a constant. Therefore, just like the proof in the consistent case, we need to derive the uniform convergence bound, which is a bound with high probability for all the assumptions that H ∈Hh∈ Hh∈H.