preface

Now, as an amateur machine learning enthusiast, neural networks are so extensive and profound that I can only talk about them here. This article is based on many of the gods’ shares, and I have added some of my own (bootload). Please forgive some questions, inaccuracies or errors.

Artificial intelligence (ai)

Artificial intelligence is widely used in modern life, and there are also many application scenarios in Douyin. For example, in the “Tiktok Nature” campaign, using ARTIFICIAL intelligence technology, Tiktok can accurately identify the animals and plants in the video at a high speed. Using artificial intelligence to identify objects is currently a good solution. So what is the relationship between neural networks and artificial intelligence? Neural network is a kind of learning scheme in machine learning, and machine learning is a branch of artificial intelligence, so what is artificial intelligence? Artificial Intelligence (AI) is the foundation of creating and using algorithms to build dynamic computing environments that mimic human intelligence. The definition of artificial intelligence can be divided into two parts, namely “artificial” and “intelligent”. “Artificial” means designed, created and manufactured by people. What is “intelligent” is controversial.

  • Artificial intellectual disabilities
function ArtificialIntelligenceAdd(a, b) {

    return a + b

}
Copy the code
  • Generate code from literal descriptions:
    • Write the front-end GPT – 3
    • Test site

The bosses define AI as “the ability of a system to correctly interpret external data, learn from that data, and use that knowledge to flexibly achieve specific goals and tasks”. Put simply, the goal of artificial intelligence efforts is to make computers think and act like humans. So far, machine learning has done a good job of solving some scenarios that require a lot of repetitive work.

Machine learning

Machine learning is a class that automatically analyzes dataregular, and use the law to predict the unknown data. So for machine learning, the most important thing is two aspects: data and learning methods. As shown in the figure below, the dark part represents more learning methods, while the light part represents more data. The combination of the two is machine learning, and deep learning is a subset of machine learning.There are four types of machine learning methods: supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning.

Supervised learning

A function is learned from a given training data set, from which results can be predicted when new data arrives. The training set of supervised learning is required to include inputs and outputs, also known as features and objectives. In the supervised learning training set the goals are labeled by the human, that is, the human needs to mark each feature before the machine learning. The correct results, the correct criteria need to be defined before the human training.

  • For example: I’m going to make an additive model that gives me two inputs (X1, x2, x3) and the model gives me an output y. So we need to now prepare the data for the model training, which may look like the data in the table below, and the machine uses some method (such as neural network) to find the pattern between (X1, X2, x3) and Y.
x1 x2 X3 (Interference information) y
1 3 11 4
2 6 12 8
7 1 13 8
3 3 14 6

In supervised learning, we need to accurately determine the data before model learning and know the ground truth corresponding to the data. Model learning is about figuring out a method (function) that turns input into output. Supervised learning relies heavily on the reliability of data. Supervised learning is by far the most widely used in the industry because of its clear purpose and large numbers of data cleaners.

Unsupervised learning

Similar to supervised learning, a function is learned from a given training data set, but the training set has no artificially labeled results, i.e., no baseline truth. Find a way for the machine to explore the relationship between the data on its own. In short, it’s looking for similarities.

  • For example, one of the most commonly used scenarios in unsupervised learning is clustering, which is to find the similarities in each piece of data and gather the similar ones together. For example, here are some animal data:
Number of eyes The number of feet The number of hand Ear number The use of tools Brush trill animal
2 2 2 2 1 1 human
2 2 2 2 1 0 The chimpanzee
2 4 0 2 0 0 The cat
2 4 0 2 0 0 The dog
2 0 0 2 0 0 The carp
2 0 0 2 0 0 Crucian carp

If classified according to the above data, it is obvious that the first and second data are similar, the third and fourth data are similar, and the fifth and sixth data are similar. The unsupervised learning of clustering is based on such data, and through some methods, it can achieve the function of aggregation and classification without artificially labeling the benchmark truth. For example, in the example above, suppose the machine concludes that the sum of the values of each feature is the result of classification. It can be found from the observation that the difference is greater than 2 and basically does not belong to a class. For example, if the data of Xiao Ming (human) and his family’s prosperous wealth (dog) are brought in according to this method, it will be deduced that these two sets of data do not belong to the same category. And chimpanzees are somewhere between humans and dogs and cats. Supervised learning is more like a set range of final goals, and the machine gives a range of answers through input data. Unsupervised learning is like using a machine to unearth some information, and then using that information to further process it. More loosely speaking, supervised learning is when a man teaches a machine to do something, and unsupervised learning is when a machine teaches a man to do something.

Semi-supervised learning

Semi-supervised learning is a combination of supervised and unsupervised learning, marking part of the data in training. So it’s marked, but it’s not fully marked. After all, clean and accurate data marking is time-consuming and laborious, so by combining some marked data with a large number of unmarked data to make model learning, in many cases can save manpower and make machine learning understand.

Reinforcement learning

In order to achieve their goals, they gradually adjust their behavior as the environment changes, and assess whether the feedback they receive after each action is positive or negative. In contrast to supervised/unsupervised learning, supervised/unsupervised learning is like a one-time input/output behavior, whereas reinforcement learning is like feedback to a series of changes in the environment. Feedback is usually reward or punishment.

  • For example: there is a joke that tells AI to learn Wolf to catch sheep, and finally the Wolf chooses to kill itself. This is the process of reinforcement learning. If the scene is a 2D map, the Wolf has to choose to move up, down, left, or right every second. If the Wolf catches a sheep, he gets a reward (positive feedback), but if he fails to catch a sheep, he gets a negative feedback (punishment) every second. Finding a pattern that leads to the highest scores for wolves is the process of reinforcement learning.

The neural network

Whether it’s supervised learning, unsupervised learning or reinforcement learning, there has to be a specific way to learn. Artificial Neural Network (ANN) is the most popular and relatively effective learning method at present. It is also the activity and plasticity of neural network that promotes the birth of deep learning.

The development history

Let’s review the history of neural networks.

  • As early as 1943, Warren McCulloch and Walter Pitts created a computational model of neural networks based on mathematics and an algorithm called threshold logic.
  • In 1975, Paul Weber made a major improvement on neural networks and invented the back propagation algorithm.
  • In the 1980s, the concept of CNN/ Convolutional Neural network was introduced. However, the development and research of neural network have been slow for a long time because of the limitation of hardware computing power.
  • In 1993, a company was founded that would change the industry: NVIDIA.
  • In 1999, NVIDIA invented the GPU, the GeForce 256. The ability of gpus to run large amounts of computing in parallel brought a different experience and new ideas.
  • In 2006, the revolutionary frame CUDA was unveiled. CUDA itself is a general purpose parallel computing architecture launched by NVIDIA. Cuda-based programming can use the parallel computing engine of GPUs to solve more complex computing problems more efficiently, so that GPU is no longer limited to the field of picture rendering.
  • Is called a miracle, as long as the machine calculates (The enumeration) quickly enough, no answer is to calculate (back) It won’t. CNN, which is closest to the image, broke out aboutIn 2010,At first, a variety of image processing schemes based on machine learning mushroomed, and the results were very good. Because of the excellent performance of CNN, more and more pioneers in academia and industry combined with GPU, and began large-scale research on neural network.
  • As the computing power of GPU gets stronger and stronger, the size (depth) of the model also gets bigger and bigger. Around 2014, residual neural network appeared, which effectively solved the problem that the machine was difficult to learn due to the deep network, and greatly liberated the depth limit of neural network, thus the concept of deep learning emerged.

The working principle of

Here’s an example of supervised learning using neural networks. Suppose I have calculated my weight and the number of milk tea in 4 weeks, and I want to find a function to represent the relationship between the number of milk tea cups and my weight. It is too difficult to find a function by myself, so I hope to find a formula like F (x)=wx+ bf(x)=wx+b by training the machine to make a simple prediction. X is the number of milk tea per week, and also the input in the training process, which is the characteristic. The baseline truth is the weekly weight. F (x)f(x) is the weight that the machine predicts each time. The values of W and B are random at the beginning, and through several iterations, the machine is allowed to find the most suitable values of W and B.

  • W is usually called weight or parameter, weight.
  • B is commonly referred to as bias or bias.

There are two problems in this process: 1. How to know if the training results are good. 2. How to correct the model according to the results.

Loss Function

Loss function is the method to reflect the training effect of model. The loss function shows the learning results by calculating the difference between the predicted results and the real results in each round. The loss function can be simple or complex, as long as it effectively reflects the difference between the predicted value and the base truth. A relatively simple loss function is as follows:

function simpleLossFun(prediction, groundtruth) {

    return Math.abs(prediction - groundtruth);

}
Copy the code

The only purpose of the machine (model) in the learning process is to minimize loss.

Back Propagation

How to correct the model requires back propagation. Back propagation is used to calculate the gradient, and basically all optimization algorithms are improved after the gradient is calculated by back propagation. In this case, the corrected model is the corrected value of W and B, especially w is most important for f(x)f(x). The effective expression of the relation between w and f(x)f(x) is to take the derivative of the function, \frac{\partial f(x)}{\partial x} =w ‘∂x∂f(x)=w ‘, and to take the derivative of f(x)f(x) by means of the concept of limit, and to take the local linear approximation of f(x)f(x), W prime reflects the rate of change/gradient of f(x) at the point x, to borrow from the big guy:

The derivative of a function with respect to each variable indicates how sensitive the entire expression is to that variable.

Here to view derivative can also introduce the vector Angle, through the derivation of derivative (gradient), w ‘, represents the positive and negative represents the direction of change, the size of the derivative represents the magnitude of the change, can affect by adjusting the w w ‘, will affect the size and direction of the vector of this point, thus affecting the entire line. You can do the same thing in terms of the physical Angle, and the derivative is the instantaneous velocity, and the magnitude and direction of the velocity also change as w changes. Therefore, instead of modifying X, we can obtain the gradient W ‘, and then adjust W according to W ‘to affect the predicted value, so that the value of loss keeps decreasing.

Multicharacteristic function

It is difficult for the model to predict the correct result with a single feature information, so the multi-dimensional feature information is generally needed. Continuing with our previous example of milk tea and weight, I still want to find a function that predicts the pattern of my weight change. I find that sometimes I drink a good cup of milk tea, but the weight gain is not obvious, and sometimes I don’t drink much milk tea, but the weight still goes up. I think it may be related to information such as my exercise level, the size of my cup, the amount of cola I drink, etc., which was not known by the previous model, so the prediction of weight change based on the amount of milk tea alone is incorrect. Before, f(x)= Wx + bf(x)= Wx +b, which is a feature of the machine only considering the number of milk tea cups, is replaced with the diagram of neural network, which is roughly as follows:If we give all this new information to the machine, each of which is a new feature of the model, if we build the function the way we did before, it will be f(x_1, x_2, x_3,…. , x_n) = w_1x_1 + b_1 + w_2x_2 + b_2 + w_3x_3 + b_3 + …. + w_nx_n + b_nf(x1,x2,x3,…. ,xn)=w1x1+b1+w2x2+b2+w3x3+b3+…. The derivative of a function of several variables such as + WNXN +bn can be combined with the Gaussian elimination method. A neural network would look something like thisBut what if it’s more complicated than that (which is likely)? Is it not enough to consider a single combination (number of milk tea, number of cola, amount of exercise) and so on? Maybe the machine needs to arrange and combine various situations, such as milk tea + cola for one group, exercise amount + code writing for another group. How can different combinations be combined together to learn? The plan is simple: nesting dolls. Another layer of parameters is set on the outside, through which the parameters are combined to fuse the various cases. So the function of the model might be: f(x_1, x_2, x_3,…. , x_n) = w_{11}(w_1x_1 + b_1 + w_2x_2 + b_2 + w_3x_3 + b_3 + …. + w_nx_n + b_n) \\ \hspace{3cm}+ w_{12}(w_1x_1 + b_1 + w_2x_2 + b_2 + w_3x_3 + b_3 + …. + w_nx_n + b_n) \\ \hspace{3cm}+ …. \\ \hspace{3cm}+ w_{1n}(w_1x_1 + b_1 + w_2x_2 + b_2 + w_3x_3 + b_3 + …. + w_nx_n + b_n)f(x1,x2,x3,…. ,xn)=w11(w1x1+b1+w2x2+b2+w3x3+b3+…. +wnxn+bn)+w12(w1x1+b1+w2x2+b2+w3x3+b3+…. +wnxn+bn)+…. +w1n(w1x1+b1+w2x2+b2+w3x3+b3+…. The + WNXN +bn) neural network is similar:And when you take the derivative, because you have a layer of dolls, you can take it from hereComposite functionderivativeThe chain rule for PI introduces the intermediate function, and the intermediate function is g_1(x_2, x_3,….) , x_n)g1(x2,x3,…. Xn, let f(x_1,x_2, x_3,…. , x_n)f(x1,x2,x3,…. Xn) becomes f(x_1, g_1)f(x1,g1), and so on, we can trace g_{n-1}(x_n, g_n)gn−1(xn,gn), and work out the derivatives of all the nodes in the chain step by step. From CS231N:

Update parameter

After calculating the gradient of each weight, we need to update the parameters with the gradient. The new question is how many updates each time? New parameters need to be introduced here:Learning Rate. (Learning rate * gradient) is the size of weight updates in each model. Then a new problem comes, how to set the learning rate?On the left hand side, you just update it a little bit at a time, like one millionth of the gradient, which is fine, but it takes a lot of machine time. On the right hand side, each update is very large, which leads to missing the best answer, which is more and more wrong. In the middle, the learning rate decreases with each round of training, so new parameters need to be introduced:Decay parameter. The attenuation parameter can be a fixed decimal, such as 1/2, and the learning rate of each round is multiplied by the attenuation parameter to achieve the effect of large change at the beginning and small change at the end. So the true weight change might be w – (w’ *lr* decay).

The overall process

The training process of the model is divided into four steps

  1. ** Forward propagation: ** The model accepts inputs and predicts results
  2. ** Calculate Loss: ** Calculate the Loss value of predicted and true values
  3. ** Back propagation: ** Calculate the gradient of each weight
  4. ** Correction parameters: ** Update parameters according to gradient, learning rate and attenuation parameters

Generally speaking, the process of model training also revolves around four steps. The process of forward propagation was designed, and the loss function was designed to adjust the learning rate, attenuation parameters, initial weight, etc., according to the accuracy. By repeating the above four steps n times, the model finally fits a function, and this function is the model we want.

All-powerful neural network?

www.zhihu.com/question/26…

reference

CS231n Convolutional Neural Networks for Visual Recognition machine Learning Back propagation Notes How Gradient Descent is applied to neural networks convolutional Neural networks CNN Complete Guide Ultimate edition (1) Front-end Engineers Learn Artificial Intelligence Artificial Neural networks Wikipedia Back Propagation Algorithms #03– A detailed explanation of the principles and codes of least Squares elimination

Resume delivery email: [email protected]