From I to career planning, I not only want to do some of the applications of high quality (the point of view of software engineering), also want to do some exciting applications, so I hope I can go in the direction of the machine learning, although I studied in university of superficial some fur, but if you want to the machine learning as a professional development, these are far from So I started this series of articles.

I hope to make a summary of knowledge points in the field of Machine Learning through this series, so my goal is to write the part of Machine Learning that can be understood by high school students.

preface

The main character of this article is Linear Regression, also known as LR. But high school kids don’t know what a comeback is, do they? I will now make a brief introduction in the preface.

Regression (Regression)

Regression problems are a large part of machine learning.

In statistics, regression analysis refers to a statistical analysis method that determines the quantitative relationship between two or more variables that depend on each other. According to the number of variables involved, regression analysis can be divided into unary regression and multiple regression analysis. According to the number of dependent variables, it can be divided into simple regression analysis and multiple regression analysis. According to the type of relationship between independent variable and dependent variable, it can be divided into linear regression analysis and nonlinear regression analysis.

Instead of playing with definitions, imagine this scenario: If I now have a data set (height, weight), I can draw height x and weight Y on a piece of paper, as shown below.

In this example, I assume that the relationship between height and weight is linear, so assuming the model is y=kx+b as a function of one order, in order to determine the variables K and B, I need to use the previous data to learn these two parameters.

Although the above example is off by a wide margin. Because the function of weight is definitely not determined by the single feature of height, and even if there is only one feature of height, the optimal function is probably not a one-time function.

Concepts related to regression problems

Typically, such prediction problems can be solved using regression models, which define the relationship between the inputs, which are existing knowledge, and the outputs, which are predictions.

The steps of solving a prediction problem under regression model are as follows:

  1. Accumulate knowledge: We call our knowledge a Training Set. After all, forecasting is based on past data, which makes sense.
  2. Learning: When we have data, we need to learn the data. Why is machine learning intelligent? Because when I tell the computer that my model is linear (a function of one degree) or some other type of function and I pour the data directly into it it can give me back the last function (with all the parameters trained).
  3. Prediction: After learning is complete, when new data (input) is received, we can predict the output from the correspondence obtained during the learning phase.

There’s a competition for cute new exercises on Kaggle about Titanic.

The main idea is to give thousands of people’s personal information (gender, age, cabin registration, boarding gate, etc.) and whether they survived, and then give you some test data, the above personal information, and let you predict whether they will survive.

If you are interested, you can know: Kaggle-Titanic

Coursera – Stanford Machine Learning

The body of the

This blog is mainly about Linear regression. After the introduction, you know that there are many kinds of functions used in regression, which requires developers to choose by themselves. This blog will first introduce the simplest Linear regression.

Linear Regression (LR)

Mathematically, given an example x=(x1; x2; . ; Xd), where Xi is the value of x on the ith attribute. The linear model tries to learn a function that can predict through a linear combination of attributes, i.e

What, you said you were in high school and you didn’t know what the T was? Let’s look at the transpose of a matrix or vector. And of course you have to look at matrix multiplication, because you have a matrix with one row and one column multiplied by a matrix with one row and one column

Ok, so now that we know the basic form of the linear regression model, our task now is to learn the values of the w vector and the B parameters with which we can make predictions.

In general, we give w and B initial values, and then we modify those values to make them conform to our expectations, so how do we modify those values? We need a loss function to indicate the difference between my predicted value and the actual value of the training data.

So what is the loss function and how we use it to correct our parameters W and B, see below.

Gradient descent method

I’m just going to talk about gradient descent, and I’ll come back to it later if I need to.

Do we know Heuristically Search?

Heuristic Search, also known as Informed Search, uses heuristic information provided by a problem to guide the Search to reduce the Search scope and complexity. — Baidu Encyclopedia

For example, WHEN I was a freshman and sophomore in college, I often forgot where my bike was parked in the parking lot after class at night, and it took me a long time to find it. This is called blind search. Breadth-first search (BFS) and breadth-first search (DFS) are both blind searches.

So if I have a hack that tells me how close I am to my car, can I use that as a basis to search in the direction of decreasing distance? This is heuristic Search, and Astar(A*) path-finding algorithms, etc. are heuristic searches.

Heuristic search and machine learning have some concepts in common, so for machine learning, I also need an index like “how far is the distance to my car” to judge the “how far is the distance between my current parameter and my optimal parameter”. We can uniformly call this kind of thing loss function.

Loss Function

We have given a name to the function in the previous example, hypothesis function, which means estimation function. The loss function is used to measure the accuracy of hypothesis function, and there are many specific measurement indexes. Here, we adopt the method of square deviation as in Ng’s tutorial.

J
h

The 2 in the formula is just to cancel out the fractions when you take the partial derivatives, but it doesn’t really matter.

Gradient Descent Algorithm

Now that we know how to evaluate the current parameter, how can I modify the parameter to make it better (the minimum loss function)?

High school students know that in a function of one variable, the geometric meaning of a derivative is the direction in which the function is changing the most rapidly. The gradient is similar, it’s similar, it’s basically a vector in the direction in which it’s going up the fastest.

Gradient, partial derivative part of the supplement we can go to see their own books or some information on the Internet.

Then we get a modified formula, and we iterate over the formula many times to correct the parameters.

Then, where α represents the learning rate. The higher the value is, the more corrections will be made each time. However, the higher the value is, the better it is. Others use a variable learning rate, setting it high at the beginning and decreasing as it nears its lowest point.

Now let’s see what happens when we take the derivative

One thing worth noting here is that we used all m training data for each iteration in this Gradient Descent, which is also called Batch Gradient Descent (BGD)

In this way, m samples will be calculated for each iteration, which requires a large amount of calculation. So there are some optimization schemes, those who are interested can take a look

Now we know how to correct the parameters, but what we actually get is the minimum value of the loss function and not necessarily the minimum value

Due to different starting points (starting parameters), the final solution may not be the global optimal solution (minimum loss function). Let me give you a couple of optimizations from the watermelon book.

  1. Initialize with multiple sets of different parameter values, anyway, to find multiple starting points to choose the best result.
  2. Using a technique called “Simulated Annealing”, Simulated Annealing at each step accepts a result that is worse than the current solution with a probability, thus helping to “jump” out of the local minimum. In the process of each iteration, the probability of accepting the “suboptimal solution” will gradually decrease with the passage of time, thus ensuring the stability of the algorithm.
  3. With stochastic gradient descent, it adds a random factor to its gradient calculation, so even if it gets stuck at a local minimum, its gradient may still not be zero, giving it a chance to jump out of the local minimum and continue searching.

practice

With all that theory, it’s time to write some code. I’m going to use Python to do an exercise on linear regression from Stanford Machine Learning. Both the PDF and the data are available on my GitHub library

The environment

I really recommend installing Anaconda if you don’t want to be bothered by the environment, but other than that I use Python3.x.

background

In this exercise, we will use simple linear regression to predict the profit of the food truck. We now have a lot of pairs of data (number of cities, profit of cities), and what I’m going to do is use a linear regression model and train parameters to predict what the profit of the trucks would be if I gave another city (number of cities).

Code and comments

Reference # http://www.johnwittenauer.net/machine-learning-exercises-in-python-part-1/
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Vector programming instead of a for loop is used to calculate losses
def computeLoss(X, y, theta):  
    inner = np.power(((X * theta.T) - y), 2)
    return np.sum(inner) / (2 * len(X))

# Gradient descent part
def gradientDescent(X, y, theta, alpha, iters):  
    temp = np.matrix(np.zeros(theta.shape))
    parameters = int(theta.ravel().shape[1])
    cost = np.zeros(iters)

    for i in range(iters):
        
        error = (X * theta.T) - y
        for j in range(parameters):
            term = np.multiply(error, X[:,j])
            temp[0,j] = theta[0,j] - ((alpha / len(X)) * np.sum(term))

        theta = temp
        cost[i] = computeLoss(X, y, theta)

    return theta, cost

# Read training data
# Windows user path may need to be modified, and unification may be made sometime later
def loadData(path):
    trainingData = pd.read_csv(path, header=None, names=['Population'.'Profit'])

    trainingData.head()

    trainingData.describe()

    trainingData.plot(kind='scatter', x='Population', y='Profit', figsize=(12.8))
    plt.show()
    return trainingData

trainingData = loadData(os.getcwd() + '/.. /data/ex1data1.txt')

Insert a column of Ones in front of the dataset as constants of the form y=k*x+b*1
trainingData.insert(0.'Ones'.1)

Split input X and output y from the dataset
cols = trainingData.shape[1]  
X = trainingData.iloc[:,0:cols- 1]  
y = trainingData.iloc[:,cols- 1:cols]  

# Convert pandas' DataFrames to a matrix for NUMpy
X = np.matrix(X.values)  
y = np.matrix(y.values)  
Initializer = 0; initializer = 0
theta = np.matrix(np.array([0.0]))  

# Dimensions of each vector
X.shape, theta.shape, y.shape  

# Initial loss function value
computeLoss(X, y, theta)   # 32.07, you can see the value of the loss function after training later

# Set the learning rate and number of iterations
alpha = 0.01  
iters = 2000

# Use gradient descent to get model parameters
theta_fin, loss = gradientDescent(X, y, theta, alpha, iters)  
theta_fin

# Calculate the loss value of the parameters after training
computeLoss(X, y, theta_fin)  # 4.47

# For drawing lines, draw trained lines
x = np.linspace(trainingData.Population.min(), trainingData.Population.max(), 100)  
f = theta_fin[0.0] + (theta_fin[0.1] * x)

fig, ax = plt.subplots(figsize=(12.8))  
ax.plot(x, f, 'r', label='Prediction')  
ax.scatter(trainingData.Population, trainingData.Profit, label='Traning Data')  
ax.legend(loc=2)  
ax.set_xlabel('Population')  
ax.set_ylabel('Profit')  
ax.set_title('Predicted Profit vs. Population Size')  
plt.show()

# Loss changes with the number of iterations
fig, ax = plt.subplots(figsize=(12.8))  
ax.plot(np.arange(iters), loss, 'r')  
ax.set_xlabel('Iterations')  
ax.set_ylabel('Loss')  
ax.set_title('Error vs. Training Epoch')  
plt.show()
Copy the code

Explain in fact notes inside are more clear, do not repeat.

The results of

The data set

The training results

The Error Error decreases with iteration

This article comes from – Liang Wang (LWIO, LWYJ123)

Mainly refer to the ML course johnwittenauer at Stanford

link

  • directory
  • Previous section: None
  • Next chapter: Re: Machine Learning from Scratch – Machine Learning(2) Logistic regression LR