Article/non-success


Machine learning has become so popular in recent years that almost every project team is thinking about whether their projects can be optimized using machine learning methods. For front end students, has a main difficulty lies in the front-end technology stack and machine learning the basic skills needed to have a large gap, the machine learning based on market tutorial, the mathematical basis of requirements on the high side for readers, some of the most basic principle or world view, has been ignored, as a natural thing will lead to the difficulty in understanding. This article tries to fill in the gaps between high school math and general introduction to machine learning.


Recollection of probability theory



Random events: In a randomized trial, events that may or may not occur with some regularity in a large number of repeated trials are called random events (events for short).

Sample space: The sum of all the outcomes of random events

Random variable: The number corresponding to the result of a random event. A random variable is essentially a mapping from a subset of the sample space to the real number space.


So for example, if I flip a coin twice in a row, H is heads, T is tails, then my sample space is going to be zero, we define a random variable X, which is called the number of heads in the result of a coin toss, so we have the following table:


The experimental results

Random variable X

HH 2
HT 1
TH 1
TT 0


Let’s say the probability of each of these outcomes is zero, then you can calculate the probability of a random variable



Random variable X

Corresponding experimental results

The probability P

2

HH

1

HT,TH

0

TT


Real world examples


mnist


In the real world, however, you can’t assume that every experiment is equally likely. Even with this assumption, the set of experimental resultsIt can be so large that it is impossible to enumerate or calculate as in the example above.


Suppose we now have a handwritten number recognition task (MNIST).


Each picture is 28*28= a picture of 784 pixels, and the value range of each pixel is 0-255, so there are altogetherThree different samples, the violent way of looking at it, is to generate all of these samples, make a dictionary, do the numbers for each sample, and the problem will be solved.



However, this solution is only theoretical.The storage, retrieval and labeling of data at this level is an impossible task. Fortunately, it doesn’t have to be. In fact, if every pixel were random, the image we would most likely get would look like this:




In this huge sample space, only a very small percentage are numbers. Let’s define a random variable X, so P(X=0)=? In theory, you have to count the number of pictures that seem to be considered 0, and divide by that. However, this number is not available.So the question is, what’s the probability of randomly generating a sample like this, that it’s a number? What’s the probability of zero? Given a picture like this, how do you tell if it’s a number? Even further, is that 0?


We might want to write some descriptive code, such as a circular white pixel with black in the middle, such that the image is 0. However, programming is very difficult to achieve. How do I define a circle? Missing one pixel is not connected, or is it not zero? Where does the difference between zero and non-zero pixel end? If you go down to the right of 0, you get 9. How about when you add a few pixels, that’s the boundary between 0 and 9?


In fact, this is the probability problem in most real situations, we are faced with a huge sample space, and there is no way to calculate the probability density of each random variable through simple text description or calculation formula. Imagine that you want to teach a blind man who can’t read to read. You want to tell him what a zero is and what a one is. How are you going to teach him? Can only use the descriptive way I said above, how about the learning effect?


We were never taught numbers like this in elementary school, nor were we given an exact mathematical description (in fact, there is no rough language description) of how to write 0 and how to write 1. Instead, the teacher writes down a couple of examples and tells you, this is a 0, this is a 1. Then you practice it at home, write it yourself, and then you memorize it. So in this case, we’re essentially sampling the whole sample of the number zero, and then using the sampled sample to represent the probability distribution of the graph of the number zero.


This idea of viewing a sample as a sample of some unknown probability distribution is of great help in understanding machine learning tasks. For example, face detection, given a random picture, what is the probability that it is a face? How many pictures will be a face? It’s much harder to describe it in words, it’s impossible to express it mathematically, just give you a bunch of samples and tell you, like this, it’s all human faces. This is called the probability density of the sample representation.



And this is, in fact, in high school, statistical sampling, taking a sample of a normal distribution, calculating its mean variance, and then estimating the parameters of that normal distribution. In high school, it’s hard to understand why you’re doing this or what the point is. So, with examples like this, you can see that most probability densities, like the normal distribution and the binomial distribution, can’t be expressed by mathematical formulas with parameters. You can only take samples, and use samples to represent this probability distribution.


Getting back to the subject of number recognition, one possible idea is that instead of identifying numbers, it would be nice to recognize pictures like this:



We can simply write down the statement: the middle half is white, equal to 0, otherwise it is not. You can even figure out the exact probability, which is a little bit outside of high school probability.


The above described, for example, a circle of white pixels with black in the middle, or black above and black above and white in the middle, are characteristics of machine learning. Some features are difficult to describe, or even impossible to describe in programming language; Some features can be described in a single line of code. The next part, which is the basic part of normal machine learning, is to do a nonlinear mapping, mapping an indescribable sample feature to a network that can be easily separated by drawing a line, such as this classic network:



The so-called deep learning, in fact, is to put the part of feature transformation obtained by manual observation into the model. The first part is to describe the characteristics of this sample. The next part is the “technique” part of the basic machine learning course, such as CNN, pooling, activation function and so on, you can learn by yourself.


Statistical language model


Liu Once wrote a science fiction novel called “Poetry Cloud”, in which an alien fell in love with Li Bai’s poems in a super-developed place of science and technology, so he built a huge database storing all possible combinations of Chinese characters, turning the art creation of liberal arts students into big data retrieval for technical men. It’s theoretically possible, and according to this idea, we can use similar techniques to create a painting of Cezanne, Van Gogh, Beethoven and Bazart.


Of course, with the current technology of our earth people, we can not do the above scheme, we can only use the idea similar to the above handwritten numbers, with the probability problem after sampling to solve. In the case of natural language, one of the questions is, what is the probability that a sequence of ten words is a sentence that a human can understand?


If you look at the statistical method, it’s not too difficult, let’s say it’s Chinese, 3000 words, so the total is going to beMaybe, how many of these are sentences? The answer is no. As for describing what a sentence is with rules, humans have spent a lot of time and energy trying to define sentences in a regular way, in other words, trying to deconstruct natural language with subject-verb-object determinate complement and so on.



Linguists and computer scientists eventually produced thousands of rules, and the system was incredibly complex, but it didn’t work well until statistics-based language models emerged.



In the perspective of statistical language model, no longer bother to study the description of the sentence, but get a group of sentence samples (this is better for to find human written articles), and then use this batch of samples to describe the probability distribution of human natural language, rather than tried to use the rules to describe, opened a new chapter for natural language understanding.



Suppose S represents some meaningful sentence, a sequence of words in a particular orderComposition, where n is the length of the sentence. Now, we want to know the probability of S appearing in the text, the probability of S P of S, then.


Using the formula of conditional probability:


   


In this way, the probability is statistical. However, it is still difficult to calculate the actual probability, so an assumption is made that each word is only related to the N words in front of it. This is the N-gram model, assuming N=2.



So, you give a text, you do some simple statistics, and you get these probabilities. So this output, this is the statistical feature of this sample, and then you can do some nonlinear mapping of that feature, and then plug in different natural language tasks, similar to mnIST above, without going into detail.


And, of course, the language of the latest model is much more complicated than this, N – “gramm assumption also has a problem, word may and its correlated with another word far, specific solution up to use a more sophisticated model, but the idea is consistent, is expressed in sample sentences to probability distribution, then the sample characteristics of all kinds of transformation, the last access to different task of natural language understanding.


conclusion


There are many different applications of this perspective, such as Mozart’s, and we can use a similar idea: collect all of Mozart’s works, do feature transformations, and then do something really interesting, such as the classification task of deciding whether a piece is Mozart or not. You can also use a generative model, such as GAN, which essentially does a probability distribution transformation, which is to convert a random evenly distributed or normally distributed sample into the probability space distribution of Mozart’s music, and then input random noise to generate Mozart’s music.


Similar ideas can also be used to write poems, couplets, paintings, and so on. All of this is similar as long as you understand the probability density of the sample space. This can be understood as the “tao” of machine learning. As for how to do it, how to design the model, and which parameters need to be adjusted, these are only technical problems.