This series is a review of the course “Fundamentals of Machine Learning” by Professor Hsuan-Tien Lin, Department of Information Engineering, Taiwan University. Focus on grooming rather than taking detailed notes, so you may leave out some details.

The course consists of 16 lectures divided into 4 parts:

  1. When will machines learn? When Can Machines Learn?
  2. Why can machines learn? Why Can Machines Learn?
  3. How do machines learn? (How Can Machines Learn?)
  4. How can machines learn better? How Can Machines Learn Better?

This article is part 2, corresponding to lectures 4-8 of the original course.

Main contents of this part:

  • Use cases to lead to the feasibility of learning questions;
  • VC dimension theory is introduced in detail, which guarantees the reliability of machine learning.
  • This paper introduces the measurement of error and how to deal with the different weight of error.

1 study feasibility question

Here is a primary School Olympiad/Civil Service Examination question:

In fact, there is no right answer to this question, and the following two answers are correct:

  • Pairs are called +1+1+1, and non-pairs are called −1-1−1, so the answer is +1+1+1;
  • The upper-left grid is +1+1+1 in white and −1-1−1 in black, so the answer is −1-1−1;

Therefore, choose different rules and you will get different answers. So if you were given some historical data and machine learned some rules, would that be the case?

2 reliability assurance of machine learning

2.1 Hoeffding Inequality

Let’s look at another problem: if you have a jar filled with lots of yellow and green balls, how do you estimate the proportion of yellow balls?

It’s simple. Just sample. Extract a part of the sample, calculate the proportion of yellow spheres in the sample ν\nuν, with this proportion as the proportion of yellow spheres in the jar μ\muμ can be estimated. Is such an estimate accurate? In statistics, Hoeffding inequality gives the limit of accuracy:


P [ argument mu > ϵ ] Or less 2 exp ( 2 ϵ 2 N ) \mathbb{P}[\vert\nu-\mu\vert>\epsilon]\le 2\exp{(-2\epsilon^2 N)}

Where, NNN is the number of samples sampled. This formula means that there is an upper limit to the probability of ν\nuν and μ\muμ being far apart, which is smaller in large samples, so that ν=μ\nu= muν=μ can be called probabilistic approximately correct.

2.2 Hoeffding Inequality in machine learning

Now compare this process to machine learning. The ball in the jar corresponds to a single data X\mathcal{X}X in X\ mathbf{X}X, given a hypothesis HHH in the hypothesis set, The proportion of yellow balls in the jar corresponds to the proportion of X\ mathbf{X}X in the jar such that h(X)=f(X)h(\mathbf{X})=f(\mathbf{X})h(X)=f(X). Now take a sample that corresponds to the existing data set D\mathcal{D}D, We can easily know if h(xn)=ynh(\mathbf{x}_n,y_n) =y_nh(xn)=yn for every data (\ mathcal{D}D (xn,yn)(\mathbf{x}_n,y_n)(xn,yn) in D\mathcal{D}D, if it is equal, the corresponding ball is yellow, The other way round is green. Our aim is to know the proportion of X\ mathbf{X}X in the whole X\mathcal{X}X such that h(X)=f(X)h(\mathbf{X})=f(\mathbf{X})h(X)=f(X).

If NNN is large enough and xn\mathbf{x}_nxn is I.I.D., for some fixed HHH, Can use known Ein (h) = 1 n ∑ n = 1 n1 E_ (xn) indicates yn [h] {\ text {in}} (h) = \ dfrac {1} {n} \ sum \ limits_ ^ {n = 1} {n} \ mathbf {1} _ {[h (\ mathbf {x} _n) \ ne Y_n]} Ein (h) = N1n = 1 ∑ N1 (xn)  = yn [h] to infer that Eout (h) = the Ex – P1 [h (x) indicates f (x)] E_ {\ text {out}} (h) = \ mathop {\ mathcal {E}} \ limits_ {\ mathbf {x} \ sim P} \ mathbf {1} _ {[h (\ mathbf {x}) \ ne f (\ mathbf {x})]} Eout (h) = x ~ PE1 [h (x)  = f (x)], how to judge the performance of the HHH, the diagram below:

According to the Hoeffding inequality, 1, 2, 3


P [ E in ( h ) E out ( h ) > ϵ ] Or less 2 exp ( 2 ϵ 2 N ) \mathbb{P}[\vert E_{\text{in}}(h)-E_{\text{out}}(h)\vert>\epsilon]\le 2\exp{(-2\epsilon^2 N)}

If Ein E_ (h) {\ text {in}} Ein (h) (h) and Eout E_ (h) {\ text {out}} Eout (h) (h) close enough, Ein (h) and E_ {\ text {in}} Ein (h) (h) is small enough, This ensures that Eout(h)E_{\text{out}}(h)Eout(h) is small enough to determine h≈fh\approx fh≈f for sampling PPP.

However, this can only be used to determine whether an HHH is good enough. If we now use the algorithm A\mathcal{A}A to select HHH from the hypothesis set H\mathcal{H}H and apply the above inequality, there will be A problem. If you take 150 people and you flip a coin five times, there’s a 99% chance that someone will flip a coin five times and come up heads, does that mean they’re better at flipping a coin than everyone else? If we choose him as our “GGG,” does that guarantee that he will get more heads when he flips a coin in the future?

Similarly, if a GGG with the smallest error in sample D\mathcal{D}D is selected from H\mathcal{H}H, can it be guaranteed that it will be better outside of D\mathcal{D}D? To get such a guarantee, we need to make some modifications to the inequality.

For each HHH, there may be some d\ mathcal{D}D such that HHH Ein(h)E_{\text{in}}(h)Ein(h) above it differs greatly from the real Eout(h)E_{\text{out}}(h)Eout(h), Calling this kind of D\mathcal{D}D “bad”, the Hoeffding inequality essentially guarantees that there is an upper bound on the probability of drawing a bad D\mathcal{D}D. With ∣H∣=M\vert\mathcal{H} vert=M, we want to make sure that no matter which A\mathcal{A}A picks, the probability that D\mathcal{D}D is bad has A small upper limit, The probability that D\mathcal{D}D is “bad” for at least one HHH should be calculated:


P D [ ( BAD   D  for  h 1 )   or   ( BAD   D  for  h 2 )   or     or   ( BAD   D  for  h M ) ] Or less P D [ BAD   D  for  h 1 ] + P D [ BAD   D  for  h 2 ] + + P D [ BAD   D  for  h M ] Or less 2 exp ( 2 ϵ 2 N ) + 2 exp ( 2 ϵ 2 N ) + + 2 exp ( 2 ϵ 2 N ) = 2 M exp ( 2 ϵ 2 N ) \begin{aligned} &\mathbb{P}_{\mathcal{D}}[(\textbf{BAD } \mathcal{D} \text{ for } h_1) \textbf{ or } (\textbf{BAD } \mathcal{D} \text{ for } h_2) \textbf{ or } \ldots \textbf{ or } (\textbf{BAD } \mathcal{D} \text{ for } h_M) ]\\ \le& \mathbb{P}_{\mathcal{D}}[\textbf{BAD } \mathcal{D} \text{ for } h_1] + \mathbb{P}_{\mathcal{D}}[\textbf{BAD } \mathcal{D} \text{ for } h_2] +\ldots+\mathbb{P}_{\mathcal{D}}[\textbf{BAD } \mathcal{D} \text{ for } h_M]\\ \le& 2\exp{(-2\epsilon^2 N)}+2\exp{(-2\epsilon^2 N)}+\ldots+2\exp{(-2\epsilon^2 N)}\\ =& 2M\exp{(-2\epsilon^2 N)} \end{aligned}

This is the upper limit of the distance between Ein(h)E_{\text{in}}(h)Ein(h) and Eout(h)E_{\text{out}}(h)Eout(h). But in the above procedure, because of the direct addition operation on the union of events, this upper limit is too large, and since the “bad” d\ mathcal{D}D corresponding to different HHH is likely to overlap greatly, the true upper limit should be much smaller. As shown in figure:

In addition, MMM if is limited, according to the type, we can still increase by NNN to ensure Ein E_ (h) {\ text {in}} Ein (h) (h) and Eout E_ (h) {\ text {out}} Eout (h) (h) close enough, but if there is no limit to the number of MMM? In PLA, for example, the value of the coefficient can be infinitely multiple, so PLA’s MMM is infinite.

VC 2.3 d

When MMM is infinite, there are ways to do it. Although PLA’s MMM is infinite, in fact, we can classify the elements in its H\mathcal{H}H as long as the number of samples is finite. For example, in the case of only one sample, the elements of two-dimensional PLA H\ Mathcal {H}H (that is, all the lines on the two-dimensional plane) can be simply divided into two categories, one is to divide the sample point into positive and the other is to divide the sample point into negative:

In the case of two samples, the elements in H\mathcal{H}H can be divided into four categories:

Three samples can be divided into eight categories:

But if 3 points are collinear, then there are only 6 categories:

When there are 4 samples, the elements in H\mathcal{H}H can be divided into 14 classes at most:

This indicates that in PLA, when there are NNN samples, the effective MMM will be less than or equal to 2N2^N2N.

Next, introduce a few concepts:

  • Dichotomies: For NNN samples, each sample has positive and negative possibilities. Each possibility composed of all samples is called a Dichotomy, and the set of Dichotomies can be denoted as H(x1,x2… , xN) \ mathcal {H} (\ mathbf {x} _1, \ mathbf {x} _2, \ ldots, \ mathbf {x} _N) H (x1, x2,… ,xN), obviously, the upper limit on the number of elements in the set is 2N2^N2N;
  • MH (N)= Max ⁡x1,x2… , xN ∈ X ∣ H (x1, x2,… , xN) ∣ m_ {\ mathcal {H}} (N) = \ \ limits_ Max {\ mathbf {x} _1, \ mathbf {x} _2, \ldots,\mathbf{x}_N \in \mathcal{X}} \vert \mathcal{H}(\mathbf{x}_1, \mathbf{x}_2, , \ \ ldots mathbf {x} \ vertmH _N) (N) = x1, x2,… , xN ∈ Xmax ∣ H (x1, x2,… ,xN)∣ with an upper limit of 2N2^N2N, mH(N)m_{\mathcal{H}}(N)mH(N) is smaller than 2N2^N2N and only polynomial in size for H\mathcal{H}H of most models such as two-dimensional perceptrons;
  • Shatter: if H\mathcal{H}H can completely realize 2N2^N2N dichotomies of NNN samples, then NNN points can be shattered by H\mathcal{H}H;
  • Break Point: If KKK points cannot be broken up by H\mathcal{H}H anyway, then KKK is called the break point of H\mathcal{H}H. According to the definition, all integers larger than KKK also become break points. For two-dimensional perceptron, Starting at 4 is its break point.

The next step is to find the relationship between the break point and mH(N)m_{\mathcal{H}}(N)mH(N).

We continue to introduce the concept of Bounding Function: B(N,k)B(N,k)B(N, K), which is the maximum possible mH(N)m_{\mathcal{H}}(N)mH(N) when the smallest break point is KKK. So, how do you calculate it or its upper bound?

First, when k=2k=2k=2, it means that no two points can be scattered. Therefore, when N=2N=2N=2, B(2,2)=3B(2,2)=3B(2,2)=3, that is, at most three kinds of dichotomies can be listed (four kinds are the two scattered points). B(3,2)=4B(3,2)=4B(3,2)=4 when N=3N=3N=3. However, when k=1k=1k=1, since no point can be broken, there can only be a dichotomy, that is, B(N,1)=1B(N,1)=1B(N,1)=1. In addition, if k>Nk>Nk>N, B(N,k)=2NB(N,k)=2^NB(N,k)=2N, since less than KKK sample points can be scattered. If N=kN=kN=k, then only a dichotomy can be removed from 2N2^N2N scattered points to satisfy the concept that NNN points are not scattered, so B(N,k)=2N−1B(N,k)=2^ n-1b (N,k)=2N−1.

So far, there is still a part of the function table that has not been calculated:

So let’s see how we calculate B(4,3), B(4,3). B(4,3)=11B(4,3)=11B(4,3)=11:

Looking at these 11 dichotomies, it is found that they can be divided into two groups. In one group, the first three points are repeated and they become different dichotomies simply because x4\mathbf{x}_4x4 is different, while in the other group, the first three points are not repeated.

If the eight dichotomies with a repeat in the first three spots are denoted as 2α2\alpha2α (just look at the first three spots as α\alphaα) and the last three as β\betaβ, then 2α+β=112\alpha+ beta=112α+β=11. And actually, B B B (4, 3) (4, 3) (4, 3) is better than B (3, ⋅) B (3, \ cdot) B (3, ⋅) one more point, suppose now remove the last point, So the first 3 points can only have α+β\alpha+ betaα+β dichotomies (because the first 2α2\alpha2α is repeated twice in each of the first 3 points, so half of the dichotomies need to be removed), since any 3 points in B(4,3)B(4,3)B(4,3) can not be broken, So the first three points must also be scattered, so there are alpha + beta, B (3, 3) or less \ \ beta alpha + \ le B (3, 3) alpha + beta, B (3, 3) or less.

On the other hand, since no three of the four points in the 2α2\alpha2α group can be broken, and no two of the first three points can be broken when the first three points in each group are fixed plus/minus, otherwise three points would be broken when the fourth point is added. Therefore, we must ensure that α≤B(3,2)\alpha\le B(3,2)α≤B(3,2).

Therefore, B (4, 3) = 2 alpha + beta, B + B (3, 3) or less (3, 2) B (4, 3) = 2 + \ \ alpha beta \ le B (3, 3) + B (3, 2) B (4, 3) = 2 alpha + beta, B + B (3, 3) or less (3, 2), and so on, B (N, k) B or less (N – 1, k) + B (1, N – k – 1) B (N, k) \ le B (N – 1, k) + B (N – 1 k – 1) B (N, k) B or less (N – 1, k) + B (1, N – k – 1), the final result is shown in figure:

Use mathematical induction to prove: B (N, k) or less ∑ I = 0 k – 1 (Ni) B (N, k), le, sum, limits_ ^ {I = 0} {1} k \ binom {N} {I} B (N, k) I = ∑ 0 or less k – 1 (iN), the specific process iN the skip. Can prove that, iN fact, B (N, k) = ∑ I = 0 k – 1 (Ni) B (N, k) = \ sum \ limits_ ^ {I = 0} {1} k \ binom {N} {I} B (N, k) = I = ∑ 0 k – 1 (iN), specific mathematical process is relatively complex, course had been slightly. In B(N,k)B(N,k)B(N,k), the fastest growing term is at most Nk−1N^{k-1}Nk−1.

By the definition of B(N,k)B(N,k)B(N,k), as long as the break point KKK exists, then mH(N)m_{\mathcal{H}}(N)mH(N) upper bound is B(N,k)B(N,k)B(N,k), therefore, MH (N)m_{\mathcal{H}}(N) The fastest growing term of mH(N) is at most Nk−1N^{k-1}Nk−1.

Once you have mH(N)m_{\mathcal{H}}(N)mH(N), you need to do something to replace MMM, which is omitted here. The last available is vapnik-Chervonenkis (VC) bound:


P [ h H  s.t.  E in ( h ) E out ( h ) > ϵ ] Or less 4 m H ( 2 N ) exp ( 1 8 ϵ 2 N ) \mathbb{P}[\exists h \in \mathcal{H} \text{ s.t. }\vert E_{\text{in}}(h)-E_{\text{out}}(h)\vert>\epsilon]\le 4 m_{\mathcal{H}}(2N)\exp{(-\dfrac{1}{8}\epsilon^2 N)}

Define VC dimension DVC (H)d_{\text{VC}}(\mathcal{H}) DVC (H) as the largest NNN such that mH(N)=2Nm_{\mathcal{H}}(N)=2^NmH(N)=2N, That is, the largest number of points H\mathcal{H}H can shatter, or the smallest break point minus 1. When N p 2 N \ ge2N acuity and DVC acuity 2 d_ {\ text {vc}} \ ge 2 DVC 2 or higher, with mH (N) or less Ndvcm_ {\ mathcal {H}} (N) \ le N ^ {d_ {\ text {vc}}} mH (N) Ndvc or less.

For the DDD dimension perceptron model, DVC =d+1d_{\text{vc}}= D +1dvc= D +1. As long as dvcd_{\text{vc}} DVC is finite, the generalization can be completed. DVC (H)d_{\text{vc}}(\mathcal{H}) DVC (H) is equivalent to H\mathcal{H}H powerfulness.

2.4 VC Bound and model complexity penalty

For g=A(D)∈Hg=\mathcal{A}(\mathcal{D})\in \mathcal{H}g=A(D)∈H, if D\mathcal{D}D is statistically large enough, there is


P [ E in ( g ) E out ( g ) > ϵ ] Or less 4 ( 2 N ) d vc exp ( 1 8 ϵ 2 N ) \mathbb{P}[\vert E_{\text{in}}(g)-E_{\text{out}}(g)\vert>\epsilon]\le 4 (2N)^{d_{\text{vc}}} \exp{(-\dfrac{1}{8}\epsilon^2 N)}

The left-hand side of the inequality is the probability of bad. If the right-hand side of the inequality is delta \delta delta, ϵ\epsilonϵ can be inversely expressed as 4 (2 N) ϵ = 8 NLN ⁡ DVC delta = Ω (N, H, the delta) \ epsilon = \ SQRT {\ dfrac {8} {N} \ ln {\ dfrac {4 (2 N) ^ {d_ {\ text {vc}}}} {\ delta}}} = \ Omega (N \ mathcal {H}, \ DE Lta) ϵ = 4 (2 N) DVC N8ln delta = Ω (N, H, the delta), Ω (N, H, the delta) \ Omega (N \ mathcal {H}, \ delta) Ω (N, H, the delta) represents the punishment to the model complexity.

It can be seen that at least the probability of 1−δ1-\delta1−δ can be satisfied


E out ( g ) Or less E in ( g ) + Ω ( N . H . Delta t. ) E_{\text{out}}(g)\le E_{\text{in}}(g)+\Omega(N,\mathcal{H},\delta)

Dvcd_ {\text{vc}} The relationship between DVC and error is shown below:

Find the optimal dvcd_{\text{vc}} DVC to minimize error.

VC Bound is just a very loose theoretical boundary. Such as setting ϵ = 0.1 \ epsilon ϵ = = 0.1 0.1, the delta = 0.1 \ delta delta = = 0.1 0.1, DVC = 3 d_ {\ text {vc}} = 3 DVC = 3, then according to the former type, N≈10,000dvcN\approx 10,000 d_{\text{vc}}N≈10,000 DVC can be obtained, but in practice, only N≈10dvcN\approx 10 d_{\text{vc}}N≈ 10dvC is often needed.

2.5 VC Bound with noise

Noise can be introduced if the label is incorrectly typed, or if the same person is labeled differently, or if the x\mathbf{x}x information is inaccurate. Does VC Bound still work when there is noise?

Ball back to the example of the ball before, before, the color of each ball is sure, this kind of situation is called “deterministic”, in the noisy case, can be thought of as the color of each ball obey a certain probability, namely y ~ P (y) ∣ x y \ sim P (y | \ mathbf {x}) y ~ P (y ∣ x), That’s called the probabilistic. Can prove that if (x, y) ~ I.I.D.P (x, y) (\ mathbf {x}, y) \ mathop ^ {\ sim} {I.I.D.} P (\ mathbf {x}, y) (x, y) ~ I.I.D.P (x, y), then the VC theory is still effective.

When there is noise, the learning target is common in the sample
P ( x ) P(\mathbf{x})
On the study
P ( y x ) P(y|\mathbf{x})
. The new learning process is as follows:

VC theory is still effective, pocket algorithm is a good example.

3 Measurement of error

A pointwise error measure can be expressed as err(g(x),f(x))\text{err}(g(\mathbf{x}), f(\mathbf{x}))err(g(x),f(\mathbf{x})),f(\mathbf{x}))err(g(x),f(\mathbf{x})), G (x) g (\ mathbf {x}) g (x) can be written to y ~ \ tilde {} y y ~, f (x) f (\ mathbf {x}) f (x) can be written to y.

There are two important pointwise error measures:

  • Err (~ y, y) = 1 / y ~ indicates y \ text {err} (\ tilde {} y, y) = \ mathbb {1} _ {[\ tilde {} y \ ne y]} err (~ y, y) = 1 / y ~  = y, this generally used in the classification problem;
  • Err (~ y, y) = (y ~ – y) 2 \ text {err} (\ tilde {} y, y) = (\ tilde {} y – y) ^ 2 err (~ y, y) = (y ~ – y) 2, that normally used in regression problems.

With the error measure, the learning process is as follows:

In classification problems, errors can be divided into two types, as shown in the following figure:

The two types of errors can be given different weights according to their importance. Therefore, different applications can have different err\text{err}err. Err ^=err widehat{\text{err}}err when considering the error measure in the algorithm (remember the error measure in the algorithm is err^=err widehat{\text{err}}=\text{err}err =err), the best case of course is to make err^=err widehat{\text{err}}=\text{err}err =err, Err ^\widehat{\text{err}}err ^\widehat{\text{err}}err It is better to have closed-form solution or convex objective function.

After adding the error measure design to A\mathcal{A}A, the learning flow is as follows:

One can use a strategy of “virtual copying” if the two types of errors have different weights. Take the Pocket algorithm as an example, assuming that the weight of false reject error is 1 and false Accept error weight is 1000, it is not necessary to assign weight to each sample point during calculation, but can “virtually” copy 1000 copies of y=−1y=-1y=−1 points. In practice, there is no need to copy. When selecting sample points randomly, the probability of the algorithm randomly selecting points y=−1y=-1y=−1 can be increased by 1000 times.