1. The feed

Starting with the fourth article in this series, we began a discussion of the feasibility of machine learning. In the fourth chapter, we went through the test of “No Free Lunch”, understood the premise hypothesis of the “No Free Lunch” theorem, and learned that the learning algorithm matching the problem can achieve good results in specific practical problems.

Now we’re going to move on to the ultimate puzzle that started this theme in the first place

Do you dare to invest in machine learning?

Would the model, learned from the training data, the historical log, really perform as well on future predictions?

If the last article is still trying to clear away the fog of No Free Lunch, this one will really carry out quantitative analysis and discussion on the feasibility of learning.

It belongs to one of the fields of machine Learning research called Computational Learning Theory.

It may be too dry for some people, too theoretical for others. However, I would say that computational learning theory answers the ultimate question of machine learning “why can learn”. It is not only the theoretical basis of machine learning, but also the cornerstone of all buildings. Without understanding the theoretical basis of this part of machine learning, all learning algorithms and models will be reduced to dangling moves. He will always be a “layman” who is blocked at the half-closed door.

But before we get into the subject, we must apologize for skipping tickets again. I totally miscalculated the size and difficulty of the topic. From the first attempt to finish one article, to last week, I changed to the first and the next two, and then I just read the first one, the word count of the fifth article has exceeded 5000+ more than 20 hours of information review and writing time, but I still only finished half of the content of the second part originally planned. I am very sorry that the writing time of this topic will be extended again, divided into three parts: first, middle and next.

Of course, the reverse of skipping also means the persistence of the original intention of this series of articles — the pursuit of making complex things simple, concise and straightforward.

Standing on the shoulders of giants, I strive to make the writing on this topic impact the best explanation of computational learning theory ever. Yes, I took out “one”.

There are only three steps to the whole tutorial, and after we have discussed all three scenarios, we can answer the ultimate question:

In the stock market, we learned from historical data that the accuracy is
This model is able to predict the future
Par? Future accuracy
with
What is the relationship between them?

2. The first case — the case with only one Hypothesis in the Hypothesis Set

In the first case, let’s first look at what happens when there is only one Hypothesis in the Set.

When there is only one Hypothesis in the Hypothesis Set, the relationship between prediction accuracy and training accuracy of the model can be likened to a fairly understandable example — population survey.

Figure 1:

As shown in Figure 1, if you want to investigate the proportion of males in the total population of a province, it is naturally impossible to count the total population and the total number of males. The common method is to carry out scientific sampling, and the proportion of males sampled can be approximated as the proportion of males in the total population. The science of doing so is guaranteed by the famous Hoeffding’s Inequality.

The Hoeffding inequality guarantees that“The proportion of males in the total population (denoted as)”“The proportion of males in the sample (denoted as)”Satisfy the following relationships:




Among themIs the margin of error that you can set,It’s your sample size.

The Hoeffding inequality tells us that whenWhen I get big enough,“Proportion of males in the total population”“Proportion of males in the sample”This value is outside the margin of errorThe probability is very small. Let’s say you want the margin of errorIs 0.01, which can be calculated by substituting the Hoeffding inequality. It can be guaranteed by sampling 50,000 peopleandThe difference betweenIs less than or equal to 0.01%.

We can then say that the proportion of males in the general population can be “predicted” with a high degree of confidence by sampling statistics.

Ok, we have understood the example of population survey. In fact, only one Hypothesis in the Hypothesis Set is completely equivalent:

  • We want to know the predictive accuracy of the model, just as we want to know the percentage of men in the population.
  • Since we can, with the help of scientific sampling and the Hoeffding inequality, “approximate” the male proportion of the population by the male proportion of the sample, of course we can “approximate” the prediction accuracy by the training accuracy of the model.

Specifically, the only Hypothesis in the Hypothesis Set is denoted asThe corresponding Ground Truth is denoted as, and then make two simple substitutions:

  1. Replace “male” in demographics with “correctly judged, i.e.”
  2. Replace “female” in demographics with “misjudged, i.e.”

Figure 2:

As shown in Figure 2, just like demographic statistics, we hope to know the prediction accuracy of the model on unknown data by scientific sampling and statistical accuracy on sample data. This of course also satisfies the Hoeffding inequality,“Predicting correct proportions on unknown data (denoted)”“Train the correct proportion on the sample (denoted as)”Satisfy the following relationships:




Among themIs the margin of error that you can set,It’s your sample size. Then we say: is Probably Approximately Corrent (PAC)

In other words, whenWhen I get big enough,“Training accuracy”“Prediction accuracy”This value is outside the margin of errorThe probability is very small. So when we find the modelWhen it does well on the training sample, it can be fairly safe to assume that it will do just as well at predicting the future!

Yes, there is a God, if you find her shadow.

Of course, we all know that scientific sampling is crucial in demography, and if we don’t sample scientifically, if you sample from a single region or a single age group, then the Hoeffding inequality will no longer protect you.

The same is true for machine learning sampling. The premise of all discussions in the following part of this paper is to ensure that samples meet the requirements of independent and identically distributed (I.I.D.). On independent co-distribution, the watermelon Book explains:

Input spaceIn, all samples obey an implicit unknown distribution, and all samples of training data are independently sampled from this distribution.

Remember the name I.I.D, because it’s going to be mentioned over and over again in machine learning today.

3. The second case — Hypothesis Set contains a finite number of hypotheses

In the first case, we already proved that is Probably Approximately Corrent (PAC)

But because there’s only one Hypothesis in the Hypothesis SetSo there is no choice of learning algorithm, the model you learn must be, right. When the learning algorithm has no choice space, it is almost impossible to get a model with high training accuracy. For the PLA algorithm, if there is only one line to choose from, the performance of this line is probably very poor in the training data. At this point, the predicted performance at the training level is more likely to be “as good” than “as bad”.

Therefore, only when there are many hypotheses in the Hypothesis Set, can the learning algorithm choose a Hypothesis with high accuracy. At this point the predicted performance is close to the training performance.

That being the case, let us now discuss the second case to see what happens when there are a finite number of hypotheses in the Hypothesis Set. For the sake of expression, let’s call this situationFinite hypothesis spaceAnd assume that the Hypothesis Set hasA content.

Let’s first review what is guaranteed by a single hypothesis –“ is PAC“, that is,




As we have said, this formula represents a very small probability that the prediction accuracy will deviate from the training accuracy. But if you look at it the other way around, it’s equivalent to saying, sample as much as you canEvery time, on average, we hit a situation that was beyond the tolerance error. Take the example of demographic statistics, which are counted once a year for 50,000 years in a row, and on average, the accuracy is higher than expected. If we were to call this situation“ContentHit the wall”, will”Probability of hitting a wallAgainst the wallIs, there is

Against the wall

When the content SetWe do not want any hypothesis to hit the wall, because any hypothesis may be selected as the final model by the learning algorithm. So, let’s look at the “Probability of any hypothesis hitting the wall”.

Figure 3:

As shown in FIG. 3, “Probability of any hypothesis hitting the Wall” is denoted as P(Any hypothesis hitting the Wall).

  • Step 1: This step is obvious, but it will“Any content”Carried out.
  • Step 2: According toUnion BoundIt’s intuitiveThe probability of any one hypothesis hitting the wall is less than or equal to the sum of the probabilities of each hypothesis hitting the wall.
  • Step 3: The probability of a single hypothesis hitting the wall is already known.Against the wall.Against the wall. .Against the wall. So step 2.
  • Step 4 in the figure above: add the sum, and finally get any hypothesis that the probability of hitting the wall is less than or equal to. This means that when the Hypothesis Set hasWith one hypothesis, the probability of hitting the wall becomes that with only one hypothesisTimes.

This is still good news for machine learning, because no matterWhich hypothesis is selected as the final model by the learning algorithmAs long as the sample sizeBig enough to still be guaranteedthe“Training accuracy”“Prediction accuracy”This value is outside the margin of errorThe probability is very small. Then we say:


Probably Approximately Corrent (PAC)

So in this case, the future is still in our hands!

4. There are infinite hypotheses in the Hypothesis Set

We have now solved the single hypothesis and the case of a finite number of hypotheses, with only one final point missing — the case of an infinite number of hypotheses in the Set.

First of all, let’s review the conclusion obtained in the case of finite hypothesis space:

This formula tells us that the probability that the difference between training accuracy and prediction accuracy exceeds the error will beThat is, regardless of the size of the Hypothesis SetHow much, as long as it is limited, can always be taken with enough samples will be“Probability of hitting the Wall”Limit.

However, once there are infinite hypotheses in the Hypothesis Set, i.eInfinity, the limit on the right-hand side of the formula breaks down. The line between prediction accuracy and training accuracy is infiniteTearing leaves predictions of the future in a fog once again. What we want to do in the next chapter is to try again to find a ceiling from infinity in the case of “infinite hypothesis space”, and to grasp the hope of predicting the future…… (To be continued)

5. Afterword.

Thank you for reading. This is The Machine Learning Book for everyone. I’m Baxing. If you’d like to receive updates on future articles, consider following me. Or follow this column of the same name and it will be updated in your notification center.

Have fun 🙂