This is the 14th day of my participation in the August More Text Challenge. For details, see:August is more challenging

One, foreword

Now machine learning is a very hot word, in life a lot of things will be related to machine learning. Do you know what machine learning is? Many readers will have already seen some of machine learning, just without using the term.

Today I’m going to take a look at machine learning and use Java to implement a very common machine learning algorithm called linear regression.

Machine learning

Machine learning is a subset of artificial intelligence. The purpose of machine learning is to build a model and continuously learn to optimize parameters through existing experience. Use this optimized model to predict things that haven’t happened yet. There are four things involved here:

  1. model
  2. experience
  3. learning
  4. To predict

A model is what we call a machine learning algorithm, and usually we choose some model that we already have. Such as linear regression, logistic regression, support vector machines and so on. We need to choose the model according to the type of problem, and this article introduces a linear regression model.

Experience is what we often call data, and in the case of weather forecasting, experience is the weather data of the last few hours or days

Learning is a very important step in machine learning. But machines don’t learn on their own. They need to be told how to learn. Therefore, we need to define a special function (loss function) to help machine learn parameters.

Forecasting is the process by which we use the model, and it’s the easiest step.

Now that we’ve covered some of the basics, let’s look at some of the details of linear regression.

3. Linear regression

3.1. Find rules

I’m sure you’ve all done a problem where you’re given the following number and asked to guess the next number:

4
7
10
13
16
Copy the code

It would have been difficult to guess the result if we had guessed blindly. Now let’s assume that the above numbers satisfy the equation:


y = k x + b y = kx + b

I take the number as x and the resulting number as y. Then we can get the following coordinates:

(1.4)
(2.7)
(3.10)
(4.13)
(5.16)
Copy the code

Then substitute two coordinates into the equation, such as (1,4) and (3,10), and we can obtain the following system:


{ k + b = 4 3 k + b = 10 \begin{cases} k + b = 4 \\ 3k + b = 10 \end{cases}

We can solve for


{ k = 3 b = 1 \begin{cases} k = 3 \\ b = 1 \end{cases}

And then I can substitute in the other coordinates, just to make sure that this equation works. And then we can guess what the next number is, and the next number x is 6, so the next number should be 19.

3.2. Linear regression

This is actually linear regression, and our goal is to find the optimal set of k and B. But but we usually write a linear regression equation like this:


y = w x + b y = wx + b

This equation is the same as the previous one, except that it’s written by a different letter. Where W means weight and b means bias.

Now the way we solve for the parameters is different, instead of just plugging in the coordinates and solving the equation, we adjust the parameters using an algorithm called gradient descent. We’ll talk more about that in the next video.

In machine learning, we refer to the data used to adjust the parameters as training data. Where x is called the eigenvalue and y is called the target value.

Loss function

Before learning, we usually give the model an initialization parameter. But the initial parameters are usually not very good, and we get a poor result with this set of parameters. So how bad is that? It is necessary to define a loss function to assess how bad this set of parameters is.

The loss function is a function used to evaluate the quality of parameters, and there are usually many available functions for us to choose from. Here, we choose the mean square error as our loss function:


L = [ y ( w x + b ) ] 2 L = [y – (wx + b)]^2

Where y and x are the data we already know, and w and b are the variables of this function. We use wx plus b to calculate the prediction, and then we take the actual y minus the predicted y and that’s our prediction error. To make sure that this is positive, I squared it.

The thing to notice here is that we can choose the square root, or we can choose not to take the square root. When we take the square root, this is called the root mean square error.

Above, we only calculated the loss value for one of the numbers, but we usually have many training data, so we need to ask for the average of all the errors. So the complete loss value is calculated as follows:


L = 1 n i = 0 n [ y ( w x + b ) ] 2 L = \frac{1}{n}\sum_{i=0}^n[y – (wx + b)]^2

It’s just an average operation.

Gradient descent

4.1 Gradient Descent

We can first observe the function graph between parameters and loss value. In order to facilitate viewing, we only consider the parameter W, and the image is as follows:

The X-axis is the W parameter, and the Y-axis is the loss value. It can be seen from the figure that point B is the point with the minimum loss value (without considering the area outside the figure).

There are two points A and C to the left and right of point B. Let’s analyze these two points respectively.

Point A is to the left of point B. In order to make the loss smaller, we need to update the parameters to the right, that is, to increase the parameters. And we can see from the graph that the slope of the tangent line at A is less than 0, so the derivative at A is less than 0. This is an important point, and we’ll talk more about it after we analyze point C.

Point C is to the right of point B, so we need to update the parameters to the left, that is, reduce the parameters. And you can see that the slope at C has to be greater than 0, so the derivative at C has to be greater than 0.

After analyzing points A and C, we found that. When the derivative is greater than zero, we need to decrease the parameters, and when the derivative is less than zero, we need to increase the parameters. Therefore, we can update the parameters as follows:


w _ = w d L d w w\_ = w – \frac{dL}{dw}

Where the updated parameter is equal to the original parameter minus the derivative of the loss function with respect to the parameter W.

4.2 Learning rate

Updating the parameters in this way is actually problematic, because the derivative we get is often a relatively large number, and if we subtract the derivative directly, we will see the picture on the left:

You can see that the parameters oscillate all the time in the valley, and it takes a long time to update to the optimal parameters, or not to update the optimal parameters at all.

So we can multiply the derivative by a very small number, and that number is the learning rate, and when we use the learning rate we might update our parameters a little bit closer to what we see on the right.

4.3 Updating multiple parameters

Now that we know how to update one parameter, how to update multiple parameters? The linear regression equation has two parameters, w and b, what difference does it make?

It doesn’t really make much difference, but instead of taking the derivative, we just take the partial derivative, and then we update w and B, respectively.

Where, the partial derivative of the loss function with respect to W is as follows:


partial L partial w = 2 x [ y ( w x + b ) ] \frac{\partial{L}}{\mathrm{\partial{w}}} = -2x[y – (wx + b)]

The partial derivative of the loss function with respect to B is as follows:


partial L partial b = 2 [ y ( w x + b ) ] \frac{\partial{L}}{\mathrm{\partial{b}}} = -2[y – (wx + b)]

Our parameter update function is as follows:


w _ = w eta partial L partial w b _ = b eta partial L partial b W \ _ = w – eta \ frac {\ partial {L}} {\ mathrm {\ partial {w}}} \ \ b \ _ = b – eta \ frac {\ partial {L}} {\ mathrm {\ partial {b}}}

If you don’t remember the formula you can just use the formula and update the parameters.

5. Achieve linear regression

Let’s implement a linear regression model in Java.

Class creation and initialization parameters

First we create a class called LinearRegression, which looks like this:

package com.zack.lr;

import java.util.ArrayList;

public class LinearRegression {
    / / weight
    private double weight;
    / / offset value
    private double bias;
    / / eigenvalues
    private ArrayList<Double> features;
    / / the target
    private ArrayList<Double> targets;

    /** * Construct linear regression model *@paramFeatures Features of training data *@paramTargets Indicates the target value of training data */
    public LinearRegression(ArrayList<Double> features, ArrayList<Double> targets){
        this.features = features;
        this.targets = targets;
        initParameter();
    }

    /** * Initialize weights and offset values */
    public void initParameter(a){
        this.weight = Math.random();
        this.bias = Math.random(); }}Copy the code

In the class, we define four member variables, which are the eigenvalues and target values of two model parameters and training data respectively.

We then need to pass in the eigenvalues and target values of the training data when we create the model and call the initParameter function to random the parameters of the initial model.

5.2 gradient Descent

Let’s write the gradient descent algorithm as follows:

package com.zack.lr;

import java.util.ArrayList;

public class LinearRegression {.../** * Gradient descent update parameter *@paramLearning_rate Learning rate *@returnLoss value * /
    public double gradientDecent(double learning_rate){
        double w_ = 0;
        double b_ = 0;
        double totalLoss = 0;
        int n = this.features.size();
        double n = this.features.size();
        for (int i = 0; i < this.features.size(); i++) {
            double yPredict = this.features.get(i) * this.weight + this.bias;
            // The partial derivative of loss to W
            w_ += -2 * learning_rate * this.features.get(i) * (this.targets.get(i) - yPredict) / n;
            // The partial derivative of loss to B
            b_ += -2 * learning_rate * (this.targets.get(i) - (yPredict)) / n;

            // Calculate loss for output
            totalLoss += Math.pow(this.targets.get(i) - yPredict, 2) / n;
        }
        // Update parameters
        this.weight -= w_;
        this.bias -= b_;
        returntotalLoss; }}Copy the code

To this function we pass in a learning rate and then update the parameters.

First, w_ and b_ represent the variation of the two parameters. TotalLoss represents the loss of the parameters. Let’s focus on the following code:

double yPredict = this.features.get(i) * this.weight + this.bias;
// The partial derivative of loss to W
w_ += -2 * learning_rate * this.features.get(i) * (this.targets.get(i) - yPredict) / n;
// The partial derivative of loss to B
b_ += -2 * learning_rate * (this.targets.get(i) - (yPredict)) / n;

// Calculate loss for output
totalLoss += Math.pow(this.targets.get(i) - yPredict, 2) / n;
Copy the code

And you can see that we’re just plugging in the formula for the partial derivative. Since we are iterating n (the number of training data) times, we need to divide by n.

Then we just update the parameters.

5.3 Forecast results

So the purpose of our training model is to predict the outcome, and this is a very simple step, we just plug in the x and find the y. The code is as follows:

package com.zack.lr;

import java.util.ArrayList;

public class LinearRegression {.../** * Forecast results *@paramFeatures Features *@returnForecast result */
    public ArrayList<Double> predict(ArrayList<Double> features){
        // Used to load the prediction results
        ArrayList<Double> yPredict = new ArrayList<>();
        for (Double feature : features) {
            // Make predictions for each x
            yPredict.add(feature * this.weight + this.bias);
        }
        returnyPredict; }}Copy the code

This part of the code is very simple, we just use the trained parameters to calculate the value of y, add it to the prediction set, and return the result.

5.4 Test Procedures

Let’s write a test program to see how well this model works. The code is as follows:

package com.zack.lr;

import java.util.ArrayList;

public class LrDemo {

    public static void main(String[] args) {
        ArrayList<Double> features = new ArrayList<>();
        ArrayList<Double> targets = new ArrayList<>();
        // Prepare the eigenvalues
        for (int i = 0; i < 200; i++) {
            features.add((double)i);
        }
        // Generate the target value with y = 3x + 1
        for (Double feature : features) {
            // Generate the target value and add a random number
            double target = feature * 3 + 1 + Math.random() * 3;
            targets.add(target);
        }

        // Create a linear regression model
        LinearRegression linearRegression = new LinearRegression(features, targets);

        for (long i = 1; i <= 300; i++){
            double loss = linearRegression.gradientDecent(1e-6);
            if(i % 100= =0){
                System.out.println("The first" + i + "After the first update");
                System.out.println("weight = " + linearRegression.getWeight());
                System.out.println("bias = " + linearRegression.getBias());
                System.out.println("loss = "+ loss); }}// Prepare data for testing
        ArrayList<Double> testList = new ArrayList<>();
        testList.add(100.0);
        testList.add(27.0);
        ArrayList<Double> testPredict = linearRegression.predict(testList);
        System.out.println("True results");
        for (Double testX : testList) {
            System.out.println(testX * 3 + 1);
        }
        System.out.println("Forecast results");
        for(Double predict : testPredict) { System.out.println(predict); }}}Copy the code

We generated some data for training. Then set a relatively small learning rate for gradient descent, which is what we usually do.

We updated the parameters 300 times and got a good result. Here is the output:

The first100Weight =2.824277848777937
bias = 0.5474555396904982
loss = 511.60209782851484200Weight =3.0023165757117525
bias = 0.5488865404002248
loss = 3.9653518084868575300Weight =3.0144922328217816
bias = 0.5490704219725514
loss = 1.5908606096719522Real results301.0
82.0Predicted results301.9982937041507
81.94036070816065
Copy the code

You can see how close the predictions are to the real results. So now we have a simple linear regression algorithm.

This case is only considering the unary linear regression, interested readers can consider expanding to multiple linear regression. Thanks for reading.