This article is a brief note compiled by the author after learning the course fundamentals of Machine Learning by Professor Lin Xuen-tian of NATIONAL Taiwan University. Most of the content is from the course materials of Professor Lin Xuen-tian, and other relevant reference links have been highlighted.

All Rights Reserved:CSDN blog self-healing Tips for procrastinators



Feasibility – Feasibility



The model is learned from the training sample to estimate the unknown data (visualization can be thought of as extracting a handful of marbles from the bin and estimating the proportions of different colors in the whole bin by the proportions of different colors in the marbles). This is a typical case of peeping into the big, and it is necessary to explore how machine learning can ensure that this “peeping” is feasible. The feasibility of machine learning is to prove that Ein, the error rate of H in the sample, is close to Eout, the error rate of unknown data. In this way, only by reducing Ein, can we get a small Eout, and then obtain an H that is close to F as G, so as to achieve a good machine learning effect.

Step by step (both take dichotomy problems as examples) :

  • Hoeffding

The Hoeffding inequality is mathematically proven, proving that when the sample size is large enough, the proportion of the sample is very close to the true proportion of the whole.



For a single hypothesis, Hoeffding can be used as the basis to determine whether h is good, that is, if the error rate of H is small in the sample, the maximum probability can ensure that the error rate of H is also small in the whole real data

  • Multi-binhoeffding

For the hypothesis set containing many h’s, Hoeffing conclusion cannot be simply used. A small event that is repeated many times has a high probability of occurring (if there are 150 people each flipping a coin five times, there is a 99.15% chance that at least one person will come up heads five times, compared to a 1/32 chance for one person), which makes it possible to: Based on sample D, H with the lowest error rate is selected from H, but in fact, H has a large error rate on the whole, that is, sample D is a BAD sample for this H

For any given D, the probability that it is a BAD sample of some H is P. The derivation shows that P is proportional to the sample data quantity N and inversely proportional to the number M of H in H (using Hoeffingh and Union bound). That is to say, the smaller M is and the larger N is, the more assured we can choose g with the lowest error rate from H

(A)

So far, it has been proved that for any H in H, Ein ≈ Eout, under the condition that M is finite and N is sufficiently large



Reference: my.oschina.net/findbill/bl…

  • VC bound

How to convert M to a finite, slow-growing polynomial so that the right-hand side of the inequality is as small as possible is the next question to explore. The key is the union bound used in the previous proof. It can be seen from the formula that the upper bound is very loose after deduction, and the top bound will be reached only when there is no intersection between M events, that is, as long as “there is an H, for which D is a BAD sample”, we can judge it to be a BAD sample (similar to the meaning of one-vote veto), that is, the upper bound is high




Dichotomy: Of a given sample D, according to the sample points are bound or x (representing the Ein is 1 or 0) categorise them, if the classification result is linearly separable (with some h can tell the situation), the situation is called a dichotomy (binary/division, is equivalent to all the h in h results are grouped according to the different linear classification, Or a dichotomy contains all the h’s that can divide the situation); All the partitions on D form a dichotomy set


Growthfunction: For a certain N value, you can take many samples of N size, corresponding to many dichotomy sets, the maximum size of these sets is defined as the Growthfunction: m(N).

At this point, the situation of M-> infinity seems to improve by A small step, we want to replace M with M (N), from (A) to get (B)

(B)

VC bound: you can’t actually use m(N) to replace m, (B) can’t be used, with a few minor changes to get (B ‘), VC bound

(B ‘)

Bounding function: upper bound function of growth function, denoted B(N,k)



Calculate the upper limits of bounding functions (NK-1)

At this point, (B ‘) can continue, and VC bound can be rewritten as

(C)

So far, it has been proved that for any H in H, Ein ≈ Eout, regardless of M



Bring about VC dimension

We have Shatter that m(N) is 2N and m(N)=2N, and that H has “shattered” these N points (for a certain D)

Breakpoint: for H, the minimum value of N such that m(N)≠2N, denoted as k (for some D)

To understand the above concepts, use a game to beat monsters: M (N) is the shotgun, and H is the player. In each level (level N), the player has m(N) bullets and is faced with 2N monsters. When N gradually increased from 1, m(N) was less than 2N for the first time in the NTH level, and the player was unable to shatter all the monsters. The game broke, and the NTH level was called a break point

VCdimension: For H, the largest value of N such that m(N)≠2N, denoted as dVC (for all D of size N)

1) VCdimension = k – 1

(2) When N≤dVC, at least one D was shatter; When N≥dVC, any D cannot be shatter

Meaning (not easy to understand)

(1) For D-PERCEPtron, its hypothesis parameter dimension is D +1 (the threshold is taken as 0 dimension, w0=1, weight vector W (W0, W1, W2, W3… Wd)), this is just like the hypothesis set has D +1 knobs, representing its degrees of freedom; But for D-PLA, dVC= D +1 (the proving process can be watched closely), that is, dVC is linked to the dividing iss set Effective ‘binary’ degrees of freedom. It can also be said that dVC reflects Powerfulness. The more dVC, the more powerful ISS set will be, that is, the dividing power of data will be more

More generally, dVC has approximately the same number of free variables as the assumed parameter w…

(2) Replace k-1 generation in (C) with dVC, and the derivation can be obtained



Omega (N, H, δ) is called the Penalty of model complexity, hypothesis set, as shown in the figure below (The bigger the dVC is, the better, the more powerful the dVC is, the higher the error cost is)



(3) The Sample complexity is new



The estimated sample size is usually N≈10dVC