Write up front: I will describe the algorithms commonly used in machine learning from a beginner’s perspective. Their level is really limited, if there is any mistake, I hope you point out, to avoid misleading you. There are many things I want to say at the beginning, so I will take another chapter to introduce my personal understanding of machine learning and some basic knowledge. If you have more time, you can read the text of my preface.

0 foreword

Introduce personal ideas

0.1 About machine learning

There is no doubt that machine learning is getting hotter and hotter, and demand will continue to grow as the ai explosion continues. Whether you’re in the programming field or not, it’s important to understand how machine learning works, at least rationally, rather than mysteriously. But the algorithmic math is negligible, as long as we know that we can achieve something by some mathematical means. For example, the linear regression that I’m going to introduce here is based on the gradient descent algorithm when updating weights, and you don’t have to know what gradient descent is, but you know that there’s a way to do optimization mathematically, and it would be nice to know that. Due to the nature of machine learning, it is almost impossible to achieve rapid progress, and now it is in a rapid development stage, you must keep learning enthusiasm to continue to learn. In addition, this early learning cost is relatively high, before writing their own code, to have a heart, at least to calculus, linear algebra and probability theory of the application to have their own understanding. So that’s also a big advantage, not so easy to replace.

0.2 on linear regression

The quickest way to understand machine learning rationally is to understand the entire flow of linear regression algorithms, and if you really understand linear regression from this article, you should understand this sentence. Of course, for most of you reading this article, you probably already have an understanding of linear regression. But I should mention that if you’re not going to get into machine learning, reading linear regression is also a good way to expand your knowledge base, because that’s where most people start.

0.3 About Fundamentals

First, you need a littleThe foundation. If not, it’s easy to teach yourself. Then it is to learn the use of the following tools:NumPy.Matplotlib.sklearn.TensorFlow. Some of them can be referred to my previous articles.

  • Introduction to Machine learning from TensorFlow
  • Official NumPy Quickstart tutorial
  • Matplotlib Basic operation
  • Matplotlib saves the GIF

And with that you also need some basic math. Mathematics, I think do not want to learn, it is not realistic, you can encounter any algorithm in the process of writing code to seek mathematical proof, the premise is that the amount of knowledge before the best to understand the level of most mathematical knowledge. Here, I think some mathematics popular science on Zhihu is actually quite good. After all, it is not a mathematics department, so I think it is ok to understand the main logic and have an attitude to use it.

  • Machine learning: Reintroducing probability theory

0.4 About Reference

Thanks to the dedication of teachers like Ng and MOE, as well as many bloggers who share their knowledge on the Internet, the programming industry has become a good learning atmosphere. Moreover, there are many excellent learning resources circulating on the Internet, so I put my learning process reference materials into GitHub, I think everyone is dedicated to better learning for others, if there is any infringement, I will delete it immediately. In this series of articles, I will mainly record and practice from the perspective of beginners on the basis of Teacher Enda Ng’s courses.

  • Study reference books

0.5 about me

The knowledge level of an undergraduate student that is about to graduate is limited, so the article certainly has a lot of inadequacy place, hope everybody corrects. And then the code for the experiment is right here.

1 background

I won’t tell you the story, but I can refer to the video. So what is the problem that this is trying to solve? There are many phenomena in life, and we can draw certain conclusions from phenomena, but if we see a phenomenon that we have never seen before, we can’t draw conclusions. But some experienced people can make predictions with a lot of experience. For example, if a child sees a dark cloud and cannot predict rain, an adult will know and remind the child to take an umbrella when going out… Okay, still telling a story, so go ahead 😂😂😂. The difference between adults and children in this example is that they have different ages, different insights, different knowledge levels, and different judgments. So in order to achieve the prediction of the phenomenon, then we must learn, such as children slowly grow up, see ants move will take an umbrella to go out.

Ok, to summarize the above story: phenomena can be compared to data, and the behavior caused by the conclusion is also data. The only difference is that the conclusion is drawn from the phenomenon, so the data of the phenomenon can be regarded as the input, and the data of the conclusion can be regarded as the output. Machine learning, then, is about feeding input data into a model, and then comparing the output of the model with the previous “true” results, slowly changing the model until it has a good prediction of the incoming and outgoing data, and the model is ready to use. The abstraction is to take an inputAnd the outputMapping betweenSo that for the vast majorityIt all goes through the mappingBetter get.

I’m not going to do the housing price example, just look at the graph below, and see if WE can solve it

# coding: utf-8import numpy as np import matplotlib.pyplot as plt def draw(): X = np.linspace(0, 10, 50) Noise = NP.random. Normal (0, 0.5, x.shape) Y = X * 0.5 + 3 + noise plt.scatter(X, Y) plt.savefig("scattergraph.png")
    plt.show()

if __name__ == "__main__":
    draw()
Copy the code

Well, you might find that if you want to giveIt relies on intuition, so everyone’s answer will be different, and in some areas where standards are uniform (such as military projects), this kind of intuition should never be tolerated. Imagine how success can be achieved by intuition on the modern information battlefield.

So we need a model, you can think of it as a black boxFeed it inPull it out, and no matter who, as long as that black box is the same, then everybody’s inputSame words, outputIt’s the same thing. So what is the approximate shape of this model here? At the moment it’s our choice of intervention. Intuitively, you can view this distribution as a very thick line, so let’s choose a linear model to simulate the distribution. So how do we determine this linear model? Here’s to today’s hero:Linear regression

Unary linear regression

The analysis starts from the unary linear function. Since it is assumed to be unary linear, our fitting function can be defined as. And again, as an example of that previous diagram, our goal now is to figure outAnd if you do, give me any one of themShall be able to find outIs here.

2.1 Define the evaluation function – loss function

So here’s how to figure it out. OkayThe whole thought process.

Let’s pick a pointBecause of the error, so. Similarly, if there are many points, then we have the following result, and for convenience, we take the evaluation.


Naturally, our goal is to make the error as small as possible, so we use the form of sum of squares to express the error:


So over here, let’s see what the graph looks like. Okay?

PS: The graph is a little ugly, there is no way, set the noise point, so that the function value span is very large, a little change will call face. Can only set scatterplot to make do, hope to know the partner informed. Well, I’m not going to waste too much time here, but I’m going to look at the code.

# coding: utf-8

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = Axes3D(fig)
ax.set_xlabel(r'$Weight$')
ax.set_ylabel(r'$bias$')
ax.set_zlabel(r'$loss\ value$')
N = 200

X = np.linspace(0.10, N)
noise = np.random.normal(0.0.1, X.shape)
Y = X * 0.5 + 3 + noise

W = np.linspace(- 10.10, N)
b = np.linspace(2 -.8, N)

p = 0.6
plt.xlim(W.min() * p, W.max() * p)
plt.ylim(b.min() * p, b.max() * p)
ax.set_zlim(0.1000)

h = []
for i in W:
    for j in b:
        h.append(np.sum(np.square(Y - (X * i + j))))

print(np.min(h))
h = np.asarray(h).reshape(N, N)
W, b = np.meshgrid(W, b)

ax.scatter(W, b, h, c='b')
plt.savefig("lossgraph.png")
plt.show()
Copy the code

So now that we have a visual idea of where we’re going to take the minimum, we’re still not there. What do we wantThe specific value.

2.2 Solving the minimum value – gradient descent

The problem is further refined, which is to solve the following binary function when to take the extreme value:


That’s where the high numbers come in. I find a graph

And then the problem is very simple, just to get it rightThe partial of PI is equal toSystem of equations. But this is a mathematical method, not a computer way of thinking, just like we solve an equation with one variable, the normal method is to take the values in the domain one by one, until the result meets the expectation. A slightly more advanced approach, of course, is to write an interpreter that programmatically computes human methods. This is obviously not realistic, and the interpreter may not be applicable if the equation type changes. So we still use that kind of machine recognized “stupid” method, remember the introduction to computer teacher evaluation of the computer at the beginning of the term:fast but stupid. It shows that in this case, we can accept the use of fast to compensate for other disadvantages, as long as the calculated value is closer to the final result step by step, it can meet the demand. In fact, these characteristics of the computer faster and now the rise of machine learning, artificial intelligence has a certain relationship.

All right, so let’s start thinking about a way to get this to the extreme.

So the way to think about it is, if you put a random ball on this plane, when the ball stops, it’s going to be where the extremum is. So how do we mimic this process?

Step 1: Release the ball

And this is easy to do, just to randomly pick one anywhere in our domain.

Step 2: Roll

How to simulate scrolling is a difficult point, but it makes sense to analyze it carefully. You can imagine yourself as the ball, and you start to stand in a place that is very unstable, so your natural reaction is to go to a flat place, provided that each position is at the same altitude. So that means that places with higher slopes have a higher chance of hitting bottom faster. There is an inappropriateness in this example, we can think about quadratic functions, and this is mainly to experience this kind of thinking, and we can be flexible. When you have this kind of slope thinking, it’s easy. Here’s the summary: We always want to go to the lowest level given the step size. Knowing which one is the lowest is another difficult problem. Computers have no intuition. If the value is not set properly, the function may converge too slowly or diverge directly, which machine learning should avoid.

In order to analyze the problem, we adopt the control variable method. As you can see, there are two variables in the loss functionwhenAt a certain time,It’s a quadratic function.

# coding: utf-8

import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure()

N = 200

X = np.linspace(0.10, N * 2)
noise = np.random.normal(0.0.1, X.shape)
Y = X * 0.5 + 3 + noise

W = 1
b = np.linspace(- 5.6, N)

h = []
for i in b:
    h.append(np.sum(np.square(Y - (X * W + i))))

plt.plot(b, h)

plt.savefig("quadraticFunction.png")

plt.show()
Copy the code

Let’s look at Mr. Ng’s courseware directly. I don’t want to make pictures.

Now we can introduce the update rule, where we solve for the minimum, which is the derivative for a function of one variable, and the partial derivative for a function of many variables. Obviously, when the slope is positive, we’re going to go in the negative direction, and vice versa. So we can add a function that is inversely proportional to the slope to keep up with the new value, and the update force, that is, how far we go at a time, is the learning rate setting. And the farther away from the extreme point, the greater the absolute value of slope, the larger the step, in line with the update logic.


There might be a little bit of a mental leap here except for oneYou can think about it in terms of the amount of data, but you can also think about it in terms of updating the slope, because in updating the weights, you’re not just calculating the slope of a point, but if you sum it up you’re going to have too much weight and you’re going to end up diverging, so you can change your code and try that out. Well, now that the basic idea is out, let’s write code to implement it.

# coding: utf-8

import matplotlib.pyplot as plt
import numpy as np

N = 200

X = np.linspace(0.10, N * 2)
noise = np.random.normal(0.0.5, X.shape)
Y = X * 0.5 + 3 + noise


def calcLoss(train_X, train_Y, W, b):
    return np.sum(np.square(train_Y - (train_X * W + b)))

def gradientDescent(train_X, train_Y, W, b, learningrate=0.001, trainingtimes=500):
    global loss
    global W_trace
    global b_trace
    size = train_Y.size
    for _ in range(trainingtimes):
        prediction = W * train_X + b
        tempW = W + learningrate * np.sum(train_X * (train_Y - prediction)) / size
        tempb = b + learningrate * np.sum(train_Y - prediction) / size
        W = tempW
        b = tempb
        loss.append(calcLoss(train_X, train_Y, W, b))
        W_trace.append(W)
        b_trace.append(b)


Training_Times = 100
Learning_Rate = 0.002

loss = []
W_trace = [- 1]
b_trace = [1]
gradientDescent(X, Y, W_trace[0], b_trace[0], learningrate=Learning_Rate, trainingtimes=Training_Times)
print(W_trace[- 1], b_trace[- 1])

fig = plt.figure()
plt.title(r'$loss\ function\ change\ tendency$')
plt.xlabel(r'$learning\ times$')
plt.ylabel(r'$loss\ value$')
plt.plot(np.linspace(1, Training_Times, Training_Times), loss)
plt.savefig("gradientDescentLR.png")
plt.show()
Copy the code

So far, we have realized linear regression of one variable by hand. Here, for the convenience of training times, onlyThis time, this is far from enough, I tested, probably achievedThe second training effect is better. Of course, you can also adjust the learning rate, the initial value… Readers can try it for themselves.

Multiple linear regression

Here, we first split and understand that multivariate represents multiple independent variables, linear represents homogeneity and additivity between independent variables, and regression is an expression of a method. Unary is easy to do, we can achieve visualization through various tools, but multivariate expression is a little difficult, especially in high dimensional linear space we can not imagine. So we have to abstract it out and use mathematical notation instead of complicated variables.

You probably already know, yeah, yeah, we’re going to use linear algebra. If you don’t have some basic knowledge before, you can spend a little time on Zhihu to understand what linear algebra is for, here only involves the basic knowledge of linear algebra applications.

Let’s say when we have three independent variablesAnd a dependent variable, and conform to the linear relationship. Then the following equation would be true:


Let’s say we have four sets of data, then there are:


So we can abstract the following and write it as a matrix:


Abstract:


Among them


Our goal is to solveThat is, the equationIn theSolve it. We can solve the equation directly to get the result we want, great, let’s get started:


And if that doesn’t make sense, just to make it very simple, the inverse has to be likeSo the rows and columns have to be equal. Why is that? Is my understanding is the matrix of a linear space to another in a linear space mapping, so if the rank of a matrix is less than the number of lines, namely there is a linear correlation of vectors, when to transform the linear space, may for dimension reduction effect, which is caused by a particular dimension collapse, this influence is irreversible, so there is no recover the inverse matrix of the matrix. In order to map from one vector to another vector, and from another vector back, this matrix has to have an inverse matrix which is full rank. There’s also something called singular matrices, if you’re interested.

So let’s go back to the main line and solve the equation.

# coding: utf-8

import numpy as np

X1 = np.asarray([2104.1416.1534.852]).reshape(4.1)
X2 = np.asarray([5.3.3.2]).reshape(4.1)
X3 = np.asarray([1.2.2.1]).reshape(4.1)

X = np.mat(np.column_stack((X1, X2, X3, np.ones(shape=(4.1)))))
noise = np.random.normal(0.0.1, X1.shape)
Y = np.mat(2.5 * X1 - X2 + 2 * X3 + 4 + noise)
YTwin = np.mat(2.5 * X1 - X2 + 2 * X3 + 4)

W = (X.T * X).I * X.T * Y
WTWin = (X.T * X).I * X.T * YTwin
print(W, "\n", WTWin)

# output:
# [[2.50043958]
# [1.16868808]
# [1.79213736]
# [4.27637958]]
# [[2.5]
# [1]
# [2]
# [4]]
Copy the code

Here, we use the house data of Teacher Enda Ng, but there is too much similarity between the data generated by ourselves, and there is too much error in the calculated results.

We can basically compute multiple linear regression, but we don’t have the word learning in machine learning at all, and we can use the examples we’ve given beforeSo let’s solve it. Here we useHelp us finish it.

# coding: utf-8

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

N = 1000
train_X1 = np.linspace(1.10, N).reshape(N, 1)
train_X2 = np.linspace(1.10, N).reshape(N, 1)
train_X3 = np.linspace(1.10, N).reshape(N, 1)
train_X4 = np.linspace(1.10, N).reshape(N, 1)

# train_X = np.column_stack((train_X1, np.ones(shape=(N, 1))))
train_X = np.column_stack((train_X1, train_X2, train_X3, train_X4, np.ones(shape=(N, 1))))

noise = np.random.normal(0.0.5, train_X1.shape)
# train_Y = 3 * train_X1 + 4
train_Y = train_X1 + train_X2 + train_X3 + train_X4 + 4 + noise

length = len(train_X[0])

X = tf.placeholder(tf.float32, [None, length], name="X")
Y = tf.placeholder(tf.float32, [None.1], name="Y")

W = tf.Variable(np.random.random(size=length).reshape(length, 1), dtype=tf.float32, name="weight")

activation = tf.matmul(X, W)
learning_rate = 0.006

loss = tf.reduce_mean(tf.reduce_sum(tf.pow(activation - Y, 2), reduction_indices=[1]))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)

training_epochs = 2000
display_step = 100

loss_trace = []

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for epoch in range(training_epochs):
        sess.run(optimizer, feed_dict={X: train_X, Y: train_Y})
        temp_loss = sess.run(loss, feed_dict={X: train_X, Y: train_Y})
        loss_trace.append(temp_loss)
        if 1 == epoch % display_step:
            print('epoch: %4s'%epoch, '\tloss: %s'%temp_loss)
    print("\nOptimization Finished!")
    print("\nloss = ", loss_trace[- 1]."\nWeight =\n", sess.run(W, feed_dict={X: train_X, Y: train_Y}))


plt.plot(np.linspace(0.100.100), loss_trace[:100])
plt.savefig("tensorflowLR.png")
plt.show()

# output:
# epoch: 1 Loss: 118.413925
# epoch: 101 Loss: 1.4500043
# epoch: 201 Loss: 1.0270562
# epoch: 301 Loss: 0.75373846
# epoch: 401 Loss: 0.5771168
# epoch: 501 Loss: 0.46298113
# epoch: 601 Loss: 0.38922414
# epoch: 701 Loss: 0.34156123
# epoch: 801 Loss: 0.31076077
# epoch: 901 Loss: 0.29085675
# epoch: 1001 Loss: 0.27799463
# epoch: 1101 Loss: 0.26968285
# epoch: 1201 Loss: 0.2643118
# epoch: 1301 Loss: 0.26084095
# epoch: 1401 Loss: 0.2585978
# epoch: 1501 Loss: 0.25714833
# epoch: 1601 Loss: 0.25621164
# epoch: 1701 Loss: 0.2556064
# epoch: 1801 Loss: 0.2552152
# epoch: 1901 Loss: 0.2549625
# Optimization Finished!
# loss = 0.25480175
# Weight =
# [[1.0982682]
# [0.9760315]
# [1.0619627]
# [0.87049955]
# [3.9700394]]
Copy the code

The fitting effect here is not very good, I do not know whether it is the problem of data, because the similarity of data growth is too high, the feeling may be over-fitting, if you know friends welcome to inform. As you can see belowThey stopped early.

4 summarizes

The solving process of linear regression is very intuitive and conforms to our understanding. One of the preconceived notions about this is that the data must be quasi-synthetic linear, because we’re talking about linear regression. Data distribution is not linear, of course, this method is not applicable, by the same token, if we can’t see the distribution of data, not to estimate the model, it can only step by step to test, so the machine learning is suitable for the current data model are selected, can be understood as to find a reasonable input and output of the mapping relationship, and then import the data, Can construct a correct assessment of the loss function of fitting effect, then there is the loss function optimization, there is a problem didn’t mention before is that if just randomly find, in fact is we can’t guarantee to find the global optimal, of course, under normal circumstances, the local optimal model has good prediction of the output. This is just my personal understanding, welcome everyone to discuss, and then the code of the experiment on my GitHub, you can take what you need. Then this series may continue to be updated, but it may take some time until the next update, because happy New Year ^_^, the reference materials are given before, not a list of thanks.