Regression and gradient descent

Mathematically, regression is given a set of points and can be fitted with a curve. If the curve is a straight line, it is called linear regression; if the curve is a quadratic curve, it is called quadratic regression. There are many varieties of regression, such as locally weighted regression, logistic regression, etc. We’ll talk about that later.

To illustrate regression, a very simple example comes from many places and is found in many open Source applications, such as Weka. The value of a house comes from many places, such as area, number of rooms (several rooms and halls), floor segment, orientation, etc. These variables that affect the value of a house are called features. Feature is a very important concept in machine learning. There are a lot of papers devoted to this stuff. And here, for simplicity, let’s say that our house is a variable that affects, and that’s the area of the house.

Suppose there is a housing sales data as follows:

Area (M ^2) Selling Price (ten thousand yuan)

123, 250,

150, 320,

87, 160,

102, 220,

… …

This table is similar to the housing price around the 5th ring of the Imperial Capital. We can make a graph. The x axis is the area of the house. On the Y-axis is the selling price of a house, as follows:

 

 

 

If a new area comes in, let’s say it’s not in the sales price record, what do we do?

We can use a curve to fit the data as accurately as possible, and then if new input comes in, we can return the value corresponding to this point on the curve. If a straight line was used to fit it, it might look something like this:

 

 

 

The green dots are the ones we want to predict.

First, give some concepts and common symbols, which may be different in different machine learning books.

Housing sales records – Training set or training data, which is the input data in our process, commonly referred to as X

House sale price – output data, usually called Y

The fitting function (or hypothesis or model) is usually written as y = h(x).

A training set consists of a pair of input data and output data

Dimension of input data (number of features, #features), n

The following is a typical machine learning process. Given an input data, our algorithm will go through a series of procedures to get an estimated function that has the ability to give a new estimate of new data that has not been seen before, also known as building a model. Just like the linear regression function above.

 

 

 

We use X1, X2.. Xn describes the components of the feature, such as x1= area of the room, x2= orientation of the room, etc. We can make an estimation function:

 

 

 

θ is called a parameter here, where it means to adjust the influence of each component in the feature, that is, whether the area of the house or the lot of the house is more important. So if we set X0 equal to 1, we can write it as a vector:

 

 

 

Our program also needs a mechanism to evaluate whether we have a good θ, so we need to evaluate the h function we make. Generally, this function is called loss function or error function, which describes the degree of bad h function. In the following, we call this function J function

Here we can make the following error function:

 

 

 

The misestimator is the sum of the squares of the difference between the estimator of x(I) and the true value of y(I), and I multiply this by 1/2 so that when I take the derivative, this coefficient disappears.

There are a number of ways to adjust theta so that J of theta minimizes, and one of them is min square, which is a completely mathematical description, and you’ll see at the end of the Stanford Open Machine Learning course where the formula for least square comes from, and you’ll find it in a lot of machine learning and math books, I won’t mention the least square method here, but the gradient descent method.

Gradient descent is carried out as follows:

1) First assign a value to θ, either randomly, or let θ be an all zero vector.

2) Change the value of θ so that J(θ) decreases in the direction of gradient descent.

For clarity, the following diagram is given:

So this is a graph of the error function J of theta, and the red part is that J of theta has a high value, and what we want is to make it as low as possible. That’s the dark blue part. Theta 0 and theta 1 represent the two dimensions of the theta vector.

 

 

The first step in the gradient descent method mentioned above is to give an initial value of θ, assuming that the initial value given at random is a cross point on the graph.

Then we adjust θ in accordance with the gradient descent direction, which will make J(θ) change to a lower direction, as shown in the figure, the algorithm will end with θ falling until it can no longer fall.

Of course, the final point of gradient descent may not be a global minimum, but a local minimum, which may be the following:

 

 

 

 

 

The figure above is a local minimum point described, which is obtained by re-selecting an initial point. It seems that our algorithm will fall into the local minimum point to a large extent affected by the selection of initial point

Now, I’m going to use an example to describe gradient reduction, taking the partial derivative of our function J(θ) :(if you don’t understand the process, you can review calculus)

 

 

 

Here’s the update, where theta I decreases in the direction of the smallest gradient. θ I represents the value before the update, the following part represents the amount decreased in the gradient direction, and α represents the step size, which is how much each time the gradient decreases.

A very important place in it is worth noting that the direction of the gradient is, for a vector theta, each d component theta. I can work out a gradient direction, we can find the direction of a whole, at the time of change, we change in the direction of dropped most can achieve a minimum point, whether it is local or global.

 

 

In simpler mathematical language, step 2) looks like this:

The inverted triangle represents the gradient, and if you represent it this way, theta I goes away.

 

 

Batch gradient descent method BGD

Batch Gradient Descent (BGD for short) is the most original form of Gradient Descent. Its specific idea is to use all samples to update each parameter, and its mathematical form is as follows:

(1) Take the partial derivative of the above energy function:

(2) Since it is a risk minimization function, each θθ is updated according to the negative gradient of each parameter θθ :

The specific pseudocode form is:

repeat{

      

For every j=0… , n)

}

It can be noticed from the above formula that it is a global optimal solution, but every iteration step, all the data of the training set need to be used, if the sample number mm is large, then you can imagine the iteration speed of this method! So, this introduces another method, stochastic gradient descent.

Advantages: global optimal solution; Easy parallel implementation;

Disadvantages: The training process is slow when the sample size is large.

In terms of the number of iterations, the number of BGD iterations is relatively small. The diagram of iterative convergence curve can be shown as follows:

Stochastic gradient descent SGD

The batch gradient descent method requires all training samples when updating each parameter, so the training process will become unusually slow with the increase of the number of samples. Stochastic Gradient Descent (SGD) is proposed to solve the problem of batch Gradient Descent.

Write the above energy function as follows:

The partial derivative of the loss function of each sample for θθ is used to obtain the corresponding gradient to update θθ :

The specific pseudocode form is:

1.

2. repeat{

for i=1, … , mm{

      

(for j=0, … , nn)

}

}

Stochastic gradient descent is iterative update by every sample time, if the sample size large (e.g., hundreds of thousands of), you may only use one article tens or thousands of samples has been theta iteration to the optimal solution, compared with the above batch gradient descent, iteration time needed tens of training sample, an iteration may not optimal, If you iterate 10 times, you need to traverse the training sample 10 times. However, one of the problems associated with SGD is that there is more noise than BGD, so SGD does not always move in the direction of global optimization.

Advantages: fast training speed;

Disadvantages: Decreased accuracy, not global optimization; Not easy to implement in parallel.

In terms of the number of iterations, SGD has a large number of iterations, and the search process in the solution space looks blind. The diagram of iterative convergence curve can be shown as follows:

Small batch gradient descent method MBGD

As can be seen from the above two gradient descent methods, both of them have advantages and disadvantages. Is it possible to achieve a compromise between the performance of the two methods? That is, the training process of the algorithm is relatively fast, and the accuracy of the final parameter training should also be guaranteed, which is exactly the original intention of mini-batch Gradient Descent (MBGD).

MBGD uses b samples (b is usually 10) for each parameter update, which is in the form of pseudo-code:

Say b=10, m=1000.

Repeat{

for i=1, 11, 21, 31, … , 991 {

    

(for every j=0, … , nn)

}

}

 

Here’s an example:

Let’s say we know that store sales are zero

 

Stores the number X Actual sales Y
1 13
2 14
3 20
4 21
5 25
6 30

How do we predict the relationship between the number of stores X and Y? Suppose we set it to be linear: Y=a0+a1X

 

Now how do we predict a0 and A1 using what we know? Here we use gradient descent:

\

The left is the core of gradient descent. The first formula on the right is the hypothesis function, and the second formula is the loss function.

Among themDenotes the coefficient of the hypothesis function,Is the learning rate.

Applying gradient descent to our previous linear regression problem, the key is to find the derivative of the cost function, i.e. : \

\

The intuitive representation is as follows:

Python code implementation:

Import sys #Training data set each element in x pole (x1) x = [1,2,3,4,5,6 Theta1 * x[1] y = [13,14,20,21,25,30] epsilon = 1 # learning rate alpha = 0.01 diff = [0,0] max_itor = 20 error1 = 0 error0 =0 cnt = 0 m = len(x) #init the parameters to zero theta0 = 0 theta1 = 0 while 1: CNT = CNT +1 diff = [0,0] for range(m): diff[0]+=theta0+ theta1 * x[i]-y[i] diff[1]+=(theta0+theta1*x[i]-y[i])*x[i] theta0=theta0-alpha/m*diff[0] theta1=theta1-alpha/m*diff[1] error1=0 for i in range(m): error1+=(theta0+theta1*x[i]-y[i])**2 if abs(error1-error0)< epsilon: break print('theta0 :%f,theta1 :%f,error:%f'%(theta0,theta1,error1)) if cnt>20: print ('cnt>20') break print('theta0 :%f,theta1 :%f,error:%f'%(theta0,theta1,error1))Copy the code

The results are as follows:

Theta1 theta0:0.205000:0.816667, error: theta0 1948.212261:0.379367, theta1:1.502297, error: 1395.602361 theta0 : 2.077838, 0.527993, theta1: error: theta0 1005.467313:0.654988, theta1:2.560886, error: theta0 730.017909:0.763807, theta1 : 2.966227, error: theta0 535.521394:0.857351, theta1:3.306283, error: theta0 398.166976:0.938058, theta1 : 3.591489, error: theta0 301.147437:1.007975, theta1:3.830615, error: theta0 232.599138:1.068824, theta1 : 4.031026, error: theta0 184.147948:1.122050, theta1:4.198911, error: theta0 149.882851:1.168868, theta1 : 4.339471, error: theta0 125.631467:1.210297, theta1:4.457074, error: theta0 108.448654:1.247197, theta1 : 4.555391, error: theta0 96.255537:1.280286, theta1:4.637505, error: theta0 87.584709:1.310171, theta1 : 4.706007, error: theta0 81.400378:1.337359, theta1:4.763073, error: theta0 76.971413:1.362278, theta1 : 4.810533, error: theta0 73.781731:1.385286, theta1:4.849922, error: theta0 71.467048:1.406686, theta1 : 4.882532, error: theta0 69.770228:1.426731, theta1:4.909448, error: theta0 68.509764:1.445633, theta1 : 4.931579, error: > 20 theta0 CNT 67.557539:1.445633, theta1:4.931579, error: 67.557539 [Finished in 0.2 s]Copy the code

It can be seen that error normally decreases when the learning rate is 0.01. The graph is as follows :(the first graph is when the learning rate is small, and the second graph is when the learning rate is large)

 



\

So let’s adjust the new learning rate again and see if we can see the second graph:

When we adjust the learning rate to 0.3, we get the following results:

Theta1 theta0:6.150000:24.500000, error: theta0:38386.135000-15.270000, theta1:68.932500, error: 552053.226569 theta0 : 285.243875, 67.840125, theta1: error: theta0:7950988.401277-245.867981, theta1:1059.347887, error: 114525223.507401 Theta1 theta0:946.357695:4043.346381, error: theta0:1649619133.261223-3576.913313, theta1 : 15323.055232, error: theta0 23761091159.680252:13591.518674, theta1:58177.105053, error: 342254436006.869995 theta0 Theta1:51565.747234:220775.317546, error: theta0 4929828278909.234375:195724.210360, theta1 : 837920.911885, error: theta0 71009180027939.656250:742803.860227, theta1:3180105.158068, error: 1022815271242165.875000 Theta1 theta0:2819153.863813:12069341.864380, error: theta0:14732617369683060.000000-10699395.102930, theta1 : 45806250.675551, error: theta0 212208421856953728.000000:40606992.787278, theta1 : 173846579.256281, error: theta0 3056647245837464576.000000:154114007.118001, theta1 : 659792674.286440, error: theta0 44027905696333684736.000000:584902509.168162, theta1 : 2504083725.690765, error: theta0 634177359734604038144.000000:2219856149.407590, theta1 : 9503644836.328783, error: theta0 9134682134868024885248.000000:8424927779.709908, theta1 : 36068788150.345154, error: theta0 131575838248146814631936.000000:31974778105.915466, theta1 : 136890372077.920685, error: theta0 1895216599231190653730816.000000:121352546013.825867, theta1 : 519534337912.329712, error: theta0 27298674329760760684609536.000000:460564272592.117981, theta1 : 1971767072878.787598, error: theta0 393209736799816196514906112.000000:1747960435714.394287, theta1 : 7483365594965.919922, error: > 20 theta0 CNT 5663787744653302294061776896.000000:1747960435714.394287, theta1 : 7483365594965.919922, error: 5663787744653302294061776896.000000 [Finished in 0.2 s]Copy the code

You can see that both theta0 and theta1 are jumping, as expected.

The batch gradient descent method is used above, which is very slow in the case of large data sets, because all data sets need to be learned in each iteration. Subsequently, random gradient descent is introduced, which is actually the concept of sampling learning.

reference

  • Summary of Gradient Descent
  • The formula and implementation comparison of Stochastic gradient Descent and Batch gradient Descent