Today we’re going to get started on machine learning and introduce linear regression models
I’ve been getting a lot of questions about data fitting and predictive models, so I thought I’d get started on a column I’ve been working on, machine learning
Linear regression is a linear model, for example, one that assumes a linear relationship between an input variable (x) and a single output variable (y).
More specifically, the output variable (y) can be computed from a linear combination of the input variables (x).
So we want the algorithm to learn the parameters of the hypothesis in order to be able to build equations for prediction that take features and parameters as inputs and predict their values as outputs.
The way to understand linear regression is to start with the mathematics behind it
1. Mathematical principles
We have been exposed to equations since junior high school. At first, they were simple y=kx+b, and later they became more and more complicated. Did I encounter such a kind of problem when I was in junior high school? They give you a whole bunch of stores that sell and make profits on a daily basis and they give you a function to estimate the profit on a future number of sales. Or when drawing large object experiments in college, teachers always say that the requirement of tracing points and drawing lines is to draw a straight line, so that each point is evenly distributed on both sides of the line, and the distance to the line is the shortest. So what is the best line to predict these curves of data characteristics? And that’s where linear models come in.
The first is the most basic form of expressionWhere X is each attribute of the independent variable and θ is the parameter And then the most important thing is how to optimize the parameters, which is to minimize the loss rate. So that’s where we need to talk about the loss rate and the loss function, and the loss function we chose here is the mean square errorSo with the loss function we have to figure out how to minimize the error. How do you minimize a function, you need another very important algorithm, called gradient descent.
Gradient descent: Gradient descent is an iterative optimization algorithm used to find the minimum loss function. To find the local minimum loss function, the gradient descent algorithm is used to approximate the number of steps proportional to the negative gradient of the current point function.
The figure below shows the steps we take to find local minima from the high point of the function.The negative gradient of the function is understood as the negative direction of the loss function derivativeThen the step size of each update is what we call the learning rate. In this way, the minimum value of the loss function can be calculated through iteration, and the optimal parameter solution can be obtained.
The problem here is that gradient descent tends to fall into local minima rather than global minima, which involves optimization algorithms such as stochastic gradient descent and over-fitting, thus losing generalization ability
Code implementation
The first is the most basic linear fitting methodTake, for example, a project I was doing for someone else
import pandas as pd import numpy as np import math from scipy.optimize import leastsq import matplotlib.pyplot as plt Plt.rcparams ['font. Sans-serif '] = ['SimHei'] plt.rcparams ['axes. Unicode_minus '] = False # import data # square feet X = [150200250300350400600] # prices y = [6450745, 0845, 0945, 0114, 50154, 50184] 50 # training set and testing set x_train = np. Array (x [0:5]) y_train=np.array(y[0:5]) x_test=np.array(x[5:7]) y_test=np.array(y[5:7]) def fun(p,x): Return k*x+b def error(p,x,y): P0 =[100,2] para=leastsq(error,p0,args=(x_train,y_train)) k,b=para[0] Print (" fitting equation is: y = x + {} {} ". The format (k, b)) PLT # draw the fitting line. The figure (figsize = (8, 6)) PLT. Scatter (x, y, color = "red", "label =" Point ", our linewidth = 3) x = np. Linspace (100700100) y = k * x + b Plt.plot (X,Y,color="orange",label=" orange", lineWidth =2) plt.xlabel(' SQFT ') plt.ylabel(' SQFT ') plt.legend() plt.show() plt.plot(X,Y,color="orange",label=" orange", lineWidth =2) plt.xlabel(' SQFT ') plt.ylabel(' SQFT ') plt.legend() plt.show()Copy the code
Err1 =[] for range(len(x_train)): err1.append(k*x_train[i]+b-y_train[i]) plt.figure(figsize=(10, 7)) PLT. Title (' error figure) PLT. The plot (x_train err1, 'r', label = 'error', linewidth=3) plt.xlabel('x') plt.ylabel('error') plt.show()Copy the code
Then write a piece of code imitating the linear model of SKLearn and build their own linear model trainerThe data set used here is the 2017 World Happiness Index and GPD, which can be downloaded at Kaggle, linked below www.kaggle.com/unsdsn/worl…Then we used the GDP index to predict how happy people were
Import numpy as NP import pandas as pd import matplotlib.pyplot as PLT # import data = pd.read_csv('2017.csv' data.info() data.head(10)Copy the code
Histohrams = data.hist(grid=False, figsize=(10, 10))Copy the code
# split the training set and the test set Train_test_split train_data = data.sample(FRAc =0.8) test_data = data.drop(train_data.index) train_test_split train_data = data.sample(FRAc =0.8) test_data = data.index x_train = train_data[['Economy..GDP.per.Capita.']].values y_train = train_data[['Happiness.Score']].values x_test = Test_data [['Economy..GDP. Per Capita.']]. Values y_test = test_data[[' happiness.score '] plt.scatter(x_train, y_train, label='Training Dataset') plt.scatter(x_test, y_test, label='Test Dataset') plt.xlabel('Economy.. GDP.per.Capita.') plt.ylabel('Happiness.Score') plt.title('Countries Happines') plt.legend() plt.show()Copy the code
Then there are regularized functions that prevent over fitting and normalized functions, which I won’t post here
Class self_LinearRegression (linearRegression) : def __init__(self, data, labels, polynomial_degree=0, sinusoid_degree=0, normalize_data=True): (data_processed,features_mean,features_deviation)=prepare_for_training(data, polynomial_degree, sinusoid_degree, normalize_data) self.data = data_processed self.labels = labels self.features_mean = features_mean self.features_deviation = features_deviation self.polynomial_degree = polynomial_degree self.sinusoid_degree = sinusoid_degree self.normalize_data = normalize_data num_features = self.data.shape[1] self.theta = Zeros ((num_features, 1)) # Create def train(self, alpha, lambda_param=0, num_iterations=500): cost_history = self.gradient_descent(alpha, lambda_param, num_iterations) return self.theta, Def gradient_descent(self, alpha, lambda_param, num_iterations): cost_history = [] for _ in range(num_iterations): self.gradient_step(alpha, lambda_param) cost_history.append(self.cost_function(self.data, self.labels, lambda_param)) return cost_history def gradient_step(self, alpha, lambda_param): num_examples = self.data.shape[0] predictions = self_LinearRegression.hypothesis(self.data, self.theta) delta = predictions - self.labels reg_param = 1 - alpha * lambda_param / num_examples theta = self.theta theta = theta * reg_param - alpha * (1 / num_examples) * (delta.T @ self.data).T theta[0] = theta[0] - alpha * (1 / Def get_cost(self, data, labels, lambda_param): * (self.data[:, 0].t @ delta).t self. Theta = theta # data_processed = prepare_for_training( data, self.polynomial_degree, self.sinusoid_degree, self.normalize_data, )[0] return self.cost_function(data_processed, labels, lambda_param) def cost_function(self, data, labels, lambda_param): num_examples = data.shape[0] delta = self_LinearRegression.hypothesis(data, self.theta) - labels theta_cut = self.theta[1:, 0] reg_param = lambda_param * (theta_cut.T @ theta_cut) cost = (1 / 2 * num_examples) * (delta.T @ delta + reg_param) Return cost[0][0] # predict(self, data): data_processed = prepare_for_training( data, self.polynomial_degree, self.sinusoid_degree, self.normalize_data, )[0] predictions = self_LinearRegression.hypothesis(data_processed, self.theta) return predictions @staticmethod def hypothesis(data, theta): predictions = data @ theta return predictionsCopy the code
# Now start training yourself to do the linear regression model num_iterations = 500 # Set the number of iterations learning_rate = 0.01 # learning rate regularization_param = 0 # regularization parameters polynomial_degree = 0 sinusoid_degree = 0 linear_regression = self_LinearRegression(x_train, y_train, polynomial_degree, sinusoid_degree) (theta, Cost_history) = linear_regresse. train(learning_rate,regularization_param,num_iterations) # print(' Initial loss rate: {:.2f}'. Format (cost_history[0]) print( {:.2f}'.format(cost_history[-1])) # Output theta_table = pd.dataframe ({' own linear model parameters ': thet.flatten ()}) theta_table.head()Copy the code
The linear regression equation obtained by the self-written model training is y=5.383319*x+0.892002
# Draw the gradient descent plt.plot(range(num_iterations), Cost_history) plt.xLabel (' number of iterations ') plt.ylabel(' loss rate ') plt.title(' loss rate gradient drop graph ') plt.show()Copy the code
# Use your own model to make predictions, Predictions = 100 X_Predictions = Np.linspace (x_train.min(), x_train.max(), predictions_num).reshape(predictions_num, 1) y_predictions = linear_regression.predict(x_predictions) plt.scatter(x_train, y_train, Plot (x_test, y_test, Label =' Test data ') plt.plot(x_Test, y_test, Label =' test data ') plt.plot(x_Predictions, y_predictions, 'r', PLT. Xlabel ('Economy.. GDP.per.Capita.') plt.ylabel('Happiness.Score') plt.title('Countries Happines') plt.legend() plt.show()Copy the code
Then use sklrean’s LinearRegression to train the prediction as well
LinearRegression() # LinearRegression() # LinearRegression() # LinearRegression() # LinearRegression() # LinearRegression() # LinearRegression() # LinearRegression( Liner. fit(x_train,y_train) # predictions = Liner. predict(x_train) plt.scatter(x_train, y_train, Plot (x_test, y_test, Label =' Test data ') plt.plot(x_Test, y_test, Label =' test data ') plt.plot(x_Predictions, y_predictions, 'r', PLT. Xlabel ('Economy.. GDP.per.Capita.') plt.ylabel('Happiness.Score') plt.title('Countries Happines') plt.legend() plt.show() Print (" slope = ",liner.coef_) print(" intercept_ ",liner.intercept_)Copy the code