5.4 Practice of multiple linear regression
5.4.1 Prediction of housing price by multiple linear regression
Before the beginning of this article, the author gives a house price forecast is a linear house price forecast, the next I want to talk about is also the house price forecast, not only related to the area, but also related to the number of rooms, it has become a multivariate problem. This part of the theory has been covered in section 5.2, not here. Let’s get right to the code.
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from mpl_toolkits.mplot3d import axes3d
import math
from matplotlib import cm
import matplotlib.ticker as mtick
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['font.family'] = 'sans-serif'
matplotlib.rcParams['axes.unicode_minus'] = False # is used to display the minus sign normally
Returns: xarr-x dataset yarr-y dataset ""
def loadDataSet(filename) :
Load data
X = []
Y = []
with open(filename, 'rb') as f:
for idx, line in enumerate(f):
line = line.decode('utf-8').strip()
if not line:
continue
eles = line.split()
if idx == 0:
numFea = len(eles)
eles = list(map(float, eles))
X.append(eles[:-1])
Y.append([eles[-1]])
# Convert X,Y lists into matrices
xArr = np.array(X)
yArr = np.array(Y)
return xArr,yArr
Parameters: X - sample set Returns: X, values - standard sample set
def standarize(X) :
m, n = X.shape
values = {} # Save mean and STD for each column to facilitate standardization of predicted data
for j in range(n):
features = X[:,j]
meanVal = features.mean(axis=0)
stdVal = features.std(axis=0)
values[j] = [meanVal, stdVal]
ifstdVal ! =0:
X[:,j] = (features - meanVal) / stdVal
else:
X[:,j] = 0
return X, values
Parameters: theta, X, Y Returns: ""
def J(theta, X, Y) :
m = len(X)
return np.sum(np.dot((predict(theta, X) - Y).T, (predict(theta, X) - Y)) / (2 * m))
Parameters: alpha, X, Y, maxloop, Epsilon Returns: Theta, costs, theTAS
def bgd(alpha, X, Y, maxloop, epsilon) :
m, n = X.shape
theta = np.zeros((n,1)) The initialization parameter is 0
count = 0 # Number of iterations
converged = False # Indicates whether convergence has occurred
cost = np.inf The initializer value is infinite
costs = [J(theta, X, Y),] Record the value of each generation
thetas = {} Record each parameter update
for i in range(n):
thetas[i] = [theta[i,0]]while count <= maxloop:
if converged:
break
count += 1
# n parameters calculated and stored in theTAS (calculated separately)
#for j in range(n):
# deriv = np.sum(np.dot(X[:,j].T, (h(theta, X) - Y))) / m
# thetas[j].append(theta[j,0] - alpha*deriv)
# n parameters are updated in the current Theta
#for j in range(n):
# theta[j,0] = thetas[j][-1]
#n parameters are computed at the same time
theta = theta - alpha * 1.0 / m * np.dot(X.T, (predict(theta, X) - Y))
# Add to theTAS
for j in range(n):
thetas[j].append(theta[j,0])
Record the function cost of the current parameter and store it as costs
cost = J(theta, X, Y)
costs.append(cost)
if abs(costs[-1] - costs[-2]) < epsilon:
converged = True
return theta, thetas, costs
Parameters: theta, X Returns: Returns the result after the prediction.
def predict(theta, X) :
return np.dot(X, theta) At this point, X is the processed X
Parameters X, Y, theta Returns: none ""
def plotRegression(X, Y, theta) :
# Print the fit plane
fittingFig = plt.figure(figsize=(16.12))
##title = 'bgd: rate=%.3f, maxloop=%d, epsilon=%.3f \n'%(alpha,maxloop,epsilon)
ax=fittingFig.gca(projection='3d')
xx = np.linspace(0.200.25)
yy = np.linspace(0.5.25)
zz = np.zeros((25.25))
for i in range(25) :for j in range(25):
normalizedSize = (xx[i]-values[0] [0])/values[0] [1]
normalizedBr = (yy[j]-values[1] [0])/values[1] [1]
x = np.matrix([[1,normalizedSize, normalizedBr]])
zz[i,j] =predict(theta, x)
xx, yy = np.meshgrid(xx,yy)
ax.zaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
ax.plot_surface(xx, yy, zz, rstride=1, cstride=1, cmap=cm.rainbow, alpha=0.1, antialiased=True)
xs = X[:, 0].flatten()
ys = X[:, 1].flatten()
zs = Y[:, 0].flatten()
ax.scatter(xs, ys, zs, c='b', marker='o')
ax.set_xlabel(U 'area')
ax.set_ylabel(U 'Number of bedrooms')
ax.set_zlabel(U 'value')
Parameters costs Returns: none ""
def plotinline(costs) :
errorsFig = plt.figure()
ax = errorsFig.add_subplot(111)
ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
ax.plot(range(len(costs)), costs)
ax.set_xlabel(U 'Number of iterations')
ax.set_ylabel(U prime cost function.)
Parameters X, Y Returns: ""
def computeCorrelation(X, Y) :
xBar = np.mean(X)
yBar = np.mean(Y)
SSR = 0
varX = 0
varY = 0
for i in range(0 , len(X)):
diffXXBar = X[i] - xBar
diffYYBar = Y[i] - yBar
SSR += (diffXXBar * diffYYBar)
varX += diffXXBar**2
varY += diffYYBar**2
SST = math.sqrt(varX * varY)
return SSR / SST
# test
if __name__ == '__main__':
## Step 1: load data...
print("Step 1: load data...")
originX, Y = loadDataSet('houses.txt')
#print(originX.shape, Y.shape)
# add x0 columns to feature X
m,n = originX.shape
X, values = standarize(originX.copy())
X = np.concatenate((np.ones((m,1)), X), axis=1)
#print(X.shape, Y.shape)
#print(X[:3],values)
## Step 2: training...
print("Step 2: training...")
alpha = 1 Vector #
maxloop = 5000 # Maximum number of iterations
epsilon = 0.000001
The optimal argument is stored in theta, costs holds the generation value of each iteration, and TheTAS holds the theta value of each iteration update
theta , thetas, costs= bgd(alpha, X, Y, maxloop, epsilon)
#print(theta , thetas, costs)
print(theta)
## Step 4: testing...
print("Step 4: testing...")
normalizedSize = (70-values[0] [0])/values[0] [1]
normalizedBr = (2-values[1] [0])/values[1] [1]
predicateX = np.matrix([[1, normalizedSize, normalizedBr]])
# At this point, the parameters are learned and the model is set. To predict new instances, the following can be done
price = predict(theta, predicateX)
## Step 5: show the result...
print("Step 5: show the result...")
print('70㎡ two-bedroom valuation: ¥%. 40 thousand Yuan '%price)
## Step 6: show Regression plot...
print("Step 6: show Regression plot...")
plotRegression(X, Y, theta)
## Step 7: show plotinline...
print("Step 7: show plotinline...")
plotinline(costs)
## Step 8: math Pearson...
print("Step 8: math Pearson...")
xMat = np.mat(X) Create xMat matrix
yMat = np.mat(Y) # create yMat matrix
yHat = predict(theta, xMat)
Pearson = computeCorrelation(yHat, yMat)
print(Pearson)
Copy the code
The results are shown below.
[Note] the code and unary regression example slightly different, please compare the reader friend view.
Refer to Attachment 3.MLR\MLR_v1\ mlr_v1.0.py for complete code.
invokes the Sklearn library to implement multiple linear regression directly into the code.
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import math
from matplotlib import cm
import matplotlib.ticker as mtick
from sklearn import linear_model
from mpl_toolkits.mplot3d import axes3d
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['font.family'] = 'sans-serif'
matplotlib.rcParams['axes.unicode_minus'] = False # is used to display the minus sign normally
Returns: xarr-x dataset yarr-y dataset ""
def loadDataSet(filename) :
Load data
X = []
Y = []
with open(filename, 'rb') as f:
for idx, line in enumerate(f):
line = line.decode('utf-8').strip()
if not line:
continue
eles = line.split()
if idx == 0:
numFea = len(eles)
eles = list(map(float, eles))
X.append(eles[:-1])
Y.append([eles[-1]])
# Convert X,Y lists into matrices
xArr = np.array(X)
yArr = np.array(Y)
return xArr,yArr
Parameters: X - sample set Returns: X, values - standard sample set
def standarize(X) :
m, n = X.shape
values = {} # Save mean and STD for each column to facilitate standardization of predicted data
for j in range(n):
features = X[:,j]
meanVal = features.mean(axis=0)
stdVal = features.std(axis=0)
values[j] = [meanVal, stdVal]
ifstdVal ! =0:
X[:,j] = (features - meanVal) / stdVal
else:
X[:,j] = 0
return X, values
Parameters: theta, X, Y Returns: ""
def J(theta, X, Y) :
m = len(X)
return np.sum(np.dot((predict(theta, X) - Y).T, (predict(theta, X) - Y)) / (2 * m))
Parameters: alpha, X, Y, maxloop, Epsilon Returns: Theta, costs, theTAS
def bgd(alpha, X, Y, maxloop, epsilon) :
m, n = X.shape
theta = np.zeros((n,1)) The initialization parameter is 0
count = 0 # Number of iterations
converged = False # Indicates whether convergence has occurred
cost = np.inf The initializer value is infinite
costs = [J(theta, X, Y),] Record the value of each generation
thetas = {} Record each parameter update
for i in range(n):
thetas[i] = [theta[i,0]]while count <= maxloop:
if converged:
break
count += 1
# n parameters calculated and stored in theTAS (calculated separately)
#for j in range(n):
# deriv = np.sum(np.dot(X[:,j].T, (h(theta, X) - Y))) / m
# thetas[j].append(theta[j,0] - alpha*deriv)
# n parameters are updated in the current Theta
#for j in range(n):
# theta[j,0] = thetas[j][-1]
#n parameters are computed at the same time
theta = theta - alpha * 1.0 / m * np.dot(X.T, (predict(theta, X) - Y))
# Add to theTAS
for j in range(n):
thetas[j].append(theta[j,0])
Record the function cost of the current parameter and store it as costs
cost = J(theta, X, Y)
costs.append(cost)
if abs(costs[-1] - costs[-2]) < epsilon:
converged = True
return theta, thetas, costs
Parameters: theta, X Returns: Returns the result after the prediction.
def predict(theta, X) :
return np.dot(X, theta) At this point, X is the processed X
Parameters X, Y, theta Returns: none ""
def plotRegression(X, Y, theta) :
# Print the fit plane
fittingFig = plt.figure(figsize=(16.12))
##title = 'bgd: rate=%.3f, maxloop=%d, epsilon=%.3f \n'%(alpha,maxloop,epsilon)
ax=fittingFig.gca(projection='3d')
xx = np.linspace(0.200.25)
yy = np.linspace(0.5.25)
zz = np.zeros((25.25))
xx, yy = np.meshgrid(xx,yy)
ax.zaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
ax.plot_surface(xx, yy, zz, rstride=1, cstride=1, cmap=cm.rainbow, alpha=0.1, antialiased=True)
xs = X[:, 0].flatten()
ys = X[:, 1].flatten()
zs = Y[:, 0].flatten()
ax.scatter(xs, ys, zs, c='b', marker='o')
ax.set_xlabel(U 'area')
ax.set_ylabel(U 'Number of bedrooms')
ax.set_zlabel(U 'value')
Parameters costs Returns: none ""
def plotinline(costs) :
errorsFig = plt.figure()
ax = errorsFig.add_subplot(111)
ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
ax.plot(range(len(costs)), costs)
ax.set_xlabel(U 'Number of iterations')
ax.set_ylabel(U prime cost function.)
Parameters X, Y Returns: ""
def computeCorrelation(X, Y) :
xBar = np.mean(X)
yBar = np.mean(Y)
SSR = 0
varX = 0
varY = 0
for i in range(0 , len(X)):
diffXXBar = X[i] - xBar
diffYYBar = Y[i] - yBar
SSR += (diffXXBar * diffYYBar)
varX += diffXXBar**2
varY += diffYYBar**2
SST = math.sqrt(varX * varY)
return SSR / SST
# test
if __name__ == '__main__':
## Step 1: load data...
print("Step 1: load data...")
originX, Y = loadDataSet('houses.txt')
#print(originX.shape, Y.shape)
# add x0 columns to feature X
m,n = originX.shape
X = np.concatenate((np.ones((m,1)), originX), axis=1)
#print(X.shape, Y.shape)
#print(X[:3],values)
## Step 2: Create linear regression object...
print("Step 2: Create linear regression object...")
regr = linear_model.LinearRegression()
## Step 3: training...
print("Step 3: training...")
regr.fit(X, Y)
## Step 4: testing...
print("Step 4: testing...")
X_predict = [1.70.2]
# At this point, the parameters are learned and the model is set. To predict new instances, the following can be done
Y_predict = regr.predict([X_predict])
## Step 5: show the result...
print("Step 5: show the result...")
Vector w = (w_1,... , w_p) as coef_, and defines w_0 as intercept_.
print("coef_:"+str(regr.coef_))
print("intercept_:"+str(regr.intercept_))
print("Y_predict:"+str(Y_predict))
## Step 6: show the result...
print("Step 6: show the result...")
print('70㎡ two-bedroom valuation: ¥%. 40 thousand Yuan '%Y_predict)
## Step 6: show Regression plot...
print("Step 6: show Regression plot...")
w0 = np.array(regr.intercept_)
w1 = np.array(regr.coef_.T[1:3])
print(w0)
print(w1)
# merge into a matrix
theta = np.vstack((w0,w1))
print(theta)
plotRegression(X, Y, theta)
## Step 7: math Pearson...
print("Step 7: math Pearson...")
xMat = np.mat(X) Create xMat matrix
yMat = np.mat(Y) # create yMat matrix
yHat = predict(theta, xMat)
Pearson = computeCorrelation(yHat, yMat)
print(Pearson)
Copy the code
MLR\ mlR-sklearn_v2 \ mlR-sklearn_v2.0.py
[Note] the result is no longer posted as the previous code.
5.4.2 Prediction of abalone age by multiple linear regression
Next, we will apply regression to real data. The age of the abalone (an aquatic creature) is recorded in the file abalone. TXT from the UCI dataset. The age of an abalone can be inferred from the number of layers of its shell. Let’s take a look at the data set.
This is a little scary, but this is a good example. Let’s just look at the code.
# -*- coding:utf-8 -*-
import numpy as np
Returns: xarr-x dataset yarr-y dataset ""
def loadDataSet(fileName) :
numFeat = len(open(fileName).readline().split('\t')) - 1
xArr = []; yArr = []
fr = open(fileName)
for line in fr.readlines():
lineArr =[]
curLine = line.strip().split('\t')
for i in range(numFeat):
lineArr.append(float(curLine[i]))
xArr.append(lineArr)
yArr.append(float(curLine[-1]))
return xArr, yArr
Parameters: testPoint - xarr-x dataset yarr-y dataset k-gaussian kernel k, custom Parameters: ws - regression coefficient ""
def lwlr(testPoint, xArr, yArr, k = 1.0) :
xMat = np.mat(xArr); yMat = np.mat(yArr).T
m = np.shape(xMat)[0]
weights = np.mat(np.eye((m))) Create a weighted diagonal matrix
for j in range(m): # Traverse the dataset to calculate the weight of each sample
diffMat = testPoint - xMat[j, :]
weights[j, j] = np.exp(diffMat * diffMat.T/(-2.0 * k**2))
xTx = xMat.T * (weights * xMat)
if np.linalg.det(xTx) == 0.0:
print("The matrix is singular and cannot be inverted.")
return
ws = xTx.I * (xMat.T * (weights * yMat)) # Calculate the regression coefficient
return testPoint * ws
Parameters: testArr - test data set, test set Xarr-x data set, training set Yarr-Y data set, training set K-Gaussian kernel k, custom Parameters: testArr - test data set, test set Xarr-x data set, training set YARr-Y data set, training set K-Gaussian kernel K, custom Parameters: Ws - Regression coefficient ""
def lwlrTest(testArr, xArr, yArr, k=1.0) :
m = np.shape(testArr)[0] Calculate the size of the test dataset
yHat = np.zeros(m)
for i in range(m): # Make predictions for each sample point
yHat[i] = lwlr(testArr[i],xArr,yArr,k)
return yHat
Parameters: xarr-X data set yarr-Y data set Returns: WS-regression coefficient ""
def standRegres(xArr,yArr) :
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T * xMat # Calculate the regression coefficient according to the formula derived in this paper
if np.linalg.det(xTx) == 0.0:
print("The matrix is singular and cannot be inverted.")
return
ws = xTx.I * (xMat.T*yMat)
return ws
Error size evaluation function Parameters: yArr - real data yHatArr - predicted data Returns: error size
def rssError(yArr, yHatArr) :
return ((yArr - yHatArr) **2).sum(a)if __name__ == '__main__':
abX, abY = loadDataSet('abalone.txt')
print('The training set is the same as the test set: locally weighted linear regression, the effect of the size of the kernel K on the prediction :')
yHat01 = lwlrTest(abX[0:99], abX[0:99], abY[0:99].0.1)
yHat1 = lwlrTest(abX[0:99], abX[0:99], abY[0:99].1)
yHat10 = lwlrTest(abX[0:99], abX[0:99], abY[0:99].10)
print('when k=0.1, the error is :',rssError(abY[0:99], yHat01.T))
print('when k=1, the error is :',rssError(abY[0:99], yHat1.T))
print('when k=10, the error is :',rssError(abY[0:99], yHat10.T))
print(' ')
print('Training sets are different from test sets: locally weighted linear regression, is the size of kernel K as small as possible? Replace the dataset and the test results are as follows:)
yHat01 = lwlrTest(abX[100:199], abX[0:99], abY[0:99].0.1)
yHat1 = lwlrTest(abX[100:199], abX[0:99], abY[0:99].1)
yHat10 = lwlrTest(abX[100:199], abX[0:99], abY[0:99].10)
print('when k=0.1, the error is :',rssError(abY[100:199], yHat01.T))
print('when k=1, the error is :',rssError(abY[100:199], yHat1.T))
print('when k=10, the error is :',rssError(abY[100:199], yHat10.T))
print(' ')
print('Training set differs from test set: simple linear regression compared with locally weighted linear regression at k=1 :')
print('when k=1, the error is :', rssError(abY[100:199], yHat1.T))
ws = standRegres(abX[0:99], abY[0:99])
yHat = np.mat(abX[100:199]) * ws
print('Simple linear regression error size :', rssError(abY[100:199], yHat.T.A))
Copy the code
The results are shown below.
It can be seen that when k=0.1, the error of the training set is small, but when applied to the new data set, the error becomes larger. This is often referred to as over-fitting. The model we train, we need to ensure high accuracy of the test set, so that the trained model can be applied to new data, that is, to strengthen the universality of the model. As you can see, when k=1, locally weighted linear regression and simple linear regression get similar results. This also indicates that the best model must be selected by comparing results on unknown data. So is the optimal core size 10? Maybe, but if you want to get better results, you should do 10 tests with 10 different sample sets to compare results.
This example shows how a locally weighted linear regression can be used to build a model with better results than ordinary linear regression. The problem with locally weighted linear regression is that it must be run on the entire data set at a time. This means that in order to make predictions, all training data must be saved.
[1] Wang, J., Wang, J., Et al. [2] Journal of Machine Learning and Machine Learning [3
This chapter refers to the code click enter