5.4 Practice of multiple linear regression

5.4.1 Prediction of housing price by multiple linear regression

Before the beginning of this article, the author gives a house price forecast is a linear house price forecast, the next I want to talk about is also the house price forecast, not only related to the area, but also related to the number of rooms, it has become a multivariate problem. This part of the theory has been covered in section 5.2, not here. Let’s get right to the code.

# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from mpl_toolkits.mplot3d import axes3d
import math
from matplotlib import cm
import matplotlib.ticker as mtick

matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['font.family'] = 'sans-serif'
matplotlib.rcParams['axes.unicode_minus'] = False # is used to display the minus sign normally

Returns: xarr-x dataset yarr-y dataset ""
def loadDataSet(filename) :
    Load data
    X = []
    Y = []
    with open(filename, 'rb') as f:
        for idx, line in enumerate(f):
            line = line.decode('utf-8').strip()
            if not line:
                continue
                
            eles = line.split()
            if idx == 0:
                numFea = len(eles)
            eles = list(map(float, eles))
            
            X.append(eles[:-1])
            Y.append([eles[-1]])
     
    # Convert X,Y lists into matrices
    xArr = np.array(X)
    yArr = np.array(Y)
    
    return xArr,yArr

Parameters: X - sample set Returns: X, values - standard sample set
def standarize(X) :
    m, n = X.shape
    values = {}  # Save mean and STD for each column to facilitate standardization of predicted data
    for j in range(n):
        features = X[:,j]
        meanVal = features.mean(axis=0)
        stdVal = features.std(axis=0)
        values[j] = [meanVal, stdVal]
        ifstdVal ! =0:
            X[:,j] = (features - meanVal) / stdVal
        else:
            X[:,j] = 0
    return X, values

Parameters: theta, X, Y Returns: ""
def J(theta, X, Y) :
    m = len(X)
    return np.sum(np.dot((predict(theta, X) - Y).T, (predict(theta, X) - Y)) / (2 * m))

Parameters: alpha, X, Y, maxloop, Epsilon Returns: Theta, costs, theTAS
def bgd(alpha, X, Y, maxloop, epsilon) :

    m, n = X.shape
    theta = np.zeros((n,1))  The initialization parameter is 0
    
    count = 0 # Number of iterations
    converged = False # Indicates whether convergence has occurred
    cost = np.inf The initializer value is infinite
    costs = [J(theta, X, Y),] Record the value of each generation
    
    thetas = {}  Record each parameter update
    for i in range(n):
        thetas[i] = [theta[i,0]]while count <= maxloop:
        if converged:
            break
        count += 1
        
        # n parameters calculated and stored in theTAS (calculated separately)
        #for j in range(n):
        # deriv = np.sum(np.dot(X[:,j].T, (h(theta, X) - Y))) / m
        # thetas[j].append(theta[j,0] - alpha*deriv)
        # n parameters are updated in the current Theta
        #for j in range(n):
        # theta[j,0] = thetas[j][-1]
        
        #n parameters are computed at the same time
        theta = theta - alpha * 1.0 / m * np.dot(X.T, (predict(theta, X) - Y))
        # Add to theTAS
        for j in range(n):
            thetas[j].append(theta[j,0])
            
        Record the function cost of the current parameter and store it as costs
        cost = J(theta, X, Y)
        costs.append(cost)
            
        if abs(costs[-1] - costs[-2]) < epsilon:
            converged = True
    
    return theta, thetas, costs

Parameters: theta, X Returns: Returns the result after the prediction.
def predict(theta, X) :
    return np.dot(X, theta)  At this point, X is the processed X

Parameters X, Y, theta Returns: none ""
def plotRegression(X, Y, theta) :
    
    # Print the fit plane
    fittingFig = plt.figure(figsize=(16.12))
    ##title = 'bgd: rate=%.3f, maxloop=%d, epsilon=%.3f \n'%(alpha,maxloop,epsilon)
    ax=fittingFig.gca(projection='3d')
    
    xx = np.linspace(0.200.25)
    yy = np.linspace(0.5.25)
    zz = np.zeros((25.25))
    for i in range(25) :for j in range(25):
            normalizedSize = (xx[i]-values[0] [0])/values[0] [1]
            normalizedBr = (yy[j]-values[1] [0])/values[1] [1]
            x = np.matrix([[1,normalizedSize, normalizedBr]])
            zz[i,j] =predict(theta, x)
    xx, yy = np.meshgrid(xx,yy)
    ax.zaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
    ax.plot_surface(xx, yy, zz, rstride=1, cstride=1, cmap=cm.rainbow, alpha=0.1, antialiased=True)
    
    xs = X[:, 0].flatten()
    ys = X[:, 1].flatten()
    zs = Y[:, 0].flatten()
    ax.scatter(xs, ys, zs, c='b', marker='o')
    
    ax.set_xlabel(U 'area')
    ax.set_ylabel(U 'Number of bedrooms')
    ax.set_zlabel(U 'value')

Parameters costs Returns: none ""
def plotinline(costs) :
    
    errorsFig = plt.figure()
    ax = errorsFig.add_subplot(111)
    ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
    
    ax.plot(range(len(costs)), costs)
    ax.set_xlabel(U 'Number of iterations')
    ax.set_ylabel(U prime cost function.)

Parameters X, Y Returns: ""
def computeCorrelation(X, Y) :
    xBar = np.mean(X)
    yBar = np.mean(Y)
    SSR = 0
    varX = 0
    varY = 0
    for i in range(0 , len(X)):
        diffXXBar = X[i] - xBar
        diffYYBar = Y[i] - yBar
        SSR += (diffXXBar * diffYYBar)
        varX +=  diffXXBar**2
        varY += diffYYBar**2
    
    SST = math.sqrt(varX * varY)
    return SSR / SST


# test
if __name__ == '__main__':
    ## Step 1: load data...
    print("Step 1: load data...")
    originX, Y = loadDataSet('houses.txt')
    #print(originX.shape, Y.shape)
  
    # add x0 columns to feature X
    m,n = originX.shape
    X, values = standarize(originX.copy())
    X = np.concatenate((np.ones((m,1)), X), axis=1)
    #print(X.shape, Y.shape)
    #print(X[:3],values)
    
    ## Step 2: training...
    print("Step 2: training...")
    alpha = 1 Vector #
    maxloop = 5000 # Maximum number of iterations
    epsilon = 0.000001

    The optimal argument is stored in theta, costs holds the generation value of each iteration, and TheTAS holds the theta value of each iteration update
    theta , thetas, costs= bgd(alpha, X, Y, maxloop, epsilon)
    #print(theta , thetas, costs)
    
    print(theta)

    ## Step 4: testing...
    print("Step 4: testing...")
    normalizedSize = (70-values[0] [0])/values[0] [1]
    normalizedBr = (2-values[1] [0])/values[1] [1]
    predicateX = np.matrix([[1, normalizedSize, normalizedBr]])
    
    # At this point, the parameters are learned and the model is set. To predict new instances, the following can be done
    price = predict(theta, predicateX)
    
    ## Step 5: show the result...
    print("Step 5: show the result...")
    print('70㎡ two-bedroom valuation: ¥%. 40 thousand Yuan '%price)
    
    ## Step 6: show Regression plot...
    print("Step 6: show Regression plot...")
    plotRegression(X, Y, theta)

    ## Step 7: show plotinline...
    print("Step 7: show plotinline...")
    plotinline(costs)
    
    ## Step 8: math Pearson...
    print("Step 8: math Pearson...")
    xMat = np.mat(X)      Create xMat matrix
    yMat = np.mat(Y)      # create yMat matrix
    yHat = predict(theta, xMat)
    
    Pearson = computeCorrelation(yHat, yMat)
    print(Pearson)
Copy the code

The results are shown below.



[Note] the code and unary regression example slightly different, please compare the reader friend view.

Refer to Attachment 3.MLR\MLR_v1\ mlr_v1.0.py for complete code.

 invokes the Sklearn library to implement multiple linear regression directly into the code.

# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import math
from matplotlib import cm
import matplotlib.ticker as mtick
from sklearn import linear_model
from mpl_toolkits.mplot3d import axes3d

matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['font.family'] = 'sans-serif'
matplotlib.rcParams['axes.unicode_minus'] = False # is used to display the minus sign normally

Returns: xarr-x dataset yarr-y dataset ""
def loadDataSet(filename) :
    Load data
    X = []
    Y = []
    with open(filename, 'rb') as f:
        for idx, line in enumerate(f):
            line = line.decode('utf-8').strip()
            if not line:
                continue
                
            eles = line.split()
            if idx == 0:
                numFea = len(eles)
            eles = list(map(float, eles))
            
            X.append(eles[:-1])
            Y.append([eles[-1]])
     
    # Convert X,Y lists into matrices
    xArr = np.array(X)
    yArr = np.array(Y)
    
    return xArr,yArr

Parameters: X - sample set Returns: X, values - standard sample set
def standarize(X) :
    m, n = X.shape
    values = {}  # Save mean and STD for each column to facilitate standardization of predicted data
    for j in range(n):
        features = X[:,j]
        meanVal = features.mean(axis=0)
        stdVal = features.std(axis=0)
        values[j] = [meanVal, stdVal]
        ifstdVal ! =0:
            X[:,j] = (features - meanVal) / stdVal
        else:
            X[:,j] = 0
    return X, values

Parameters: theta, X, Y Returns: ""
def J(theta, X, Y) :
    m = len(X)
    return np.sum(np.dot((predict(theta, X) - Y).T, (predict(theta, X) - Y)) / (2 * m))

Parameters: alpha, X, Y, maxloop, Epsilon Returns: Theta, costs, theTAS
def bgd(alpha, X, Y, maxloop, epsilon) :

    m, n = X.shape
    theta = np.zeros((n,1))  The initialization parameter is 0
    
    count = 0 # Number of iterations
    converged = False # Indicates whether convergence has occurred
    cost = np.inf The initializer value is infinite
    costs = [J(theta, X, Y),] Record the value of each generation
    
    thetas = {}  Record each parameter update
    for i in range(n):
        thetas[i] = [theta[i,0]]while count <= maxloop:
        if converged:
            break
        count += 1
        
        # n parameters calculated and stored in theTAS (calculated separately)
        #for j in range(n):
        # deriv = np.sum(np.dot(X[:,j].T, (h(theta, X) - Y))) / m
        # thetas[j].append(theta[j,0] - alpha*deriv)
        # n parameters are updated in the current Theta
        #for j in range(n):
        # theta[j,0] = thetas[j][-1]
        
        #n parameters are computed at the same time
        theta = theta - alpha * 1.0 / m * np.dot(X.T, (predict(theta, X) - Y))
        # Add to theTAS
        for j in range(n):
            thetas[j].append(theta[j,0])
            
        Record the function cost of the current parameter and store it as costs
        cost = J(theta, X, Y)
        costs.append(cost)
            
        if abs(costs[-1] - costs[-2]) < epsilon:
            converged = True
    
    return theta, thetas, costs

Parameters: theta, X Returns: Returns the result after the prediction.
def predict(theta, X) :
    return np.dot(X, theta)  At this point, X is the processed X

Parameters X, Y, theta Returns: none ""
def plotRegression(X, Y, theta) :
    
    # Print the fit plane
    fittingFig = plt.figure(figsize=(16.12))
    ##title = 'bgd: rate=%.3f, maxloop=%d, epsilon=%.3f \n'%(alpha,maxloop,epsilon)
    ax=fittingFig.gca(projection='3d')
    
    xx = np.linspace(0.200.25)
    yy = np.linspace(0.5.25)
    zz = np.zeros((25.25))
    
    xx, yy = np.meshgrid(xx,yy)
    ax.zaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
    ax.plot_surface(xx, yy, zz, rstride=1, cstride=1, cmap=cm.rainbow, alpha=0.1, antialiased=True)
    
    xs = X[:, 0].flatten()
    ys = X[:, 1].flatten()
    zs = Y[:, 0].flatten()
    ax.scatter(xs, ys, zs, c='b', marker='o')
    
    ax.set_xlabel(U 'area')
    ax.set_ylabel(U 'Number of bedrooms')
    ax.set_zlabel(U 'value')

Parameters costs Returns: none ""
def plotinline(costs) :
    
    errorsFig = plt.figure()
    ax = errorsFig.add_subplot(111)
    ax.yaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
    
    ax.plot(range(len(costs)), costs)
    ax.set_xlabel(U 'Number of iterations')
    ax.set_ylabel(U prime cost function.)

Parameters X, Y Returns: ""
def computeCorrelation(X, Y) :
    xBar = np.mean(X)
    yBar = np.mean(Y)
    SSR = 0
    varX = 0
    varY = 0
    for i in range(0 , len(X)):
        diffXXBar = X[i] - xBar
        diffYYBar = Y[i] - yBar
        SSR += (diffXXBar * diffYYBar)
        varX +=  diffXXBar**2
        varY += diffYYBar**2
    
    SST = math.sqrt(varX * varY)
    return SSR / SST


# test
if __name__ == '__main__':
    ## Step 1: load data...
    print("Step 1: load data...")
    originX, Y = loadDataSet('houses.txt')
    #print(originX.shape, Y.shape)
  
    # add x0 columns to feature X
    m,n = originX.shape
    X = np.concatenate((np.ones((m,1)), originX), axis=1)
    #print(X.shape, Y.shape)
    #print(X[:3],values)
    
    ## Step 2: Create linear regression object...
    print("Step 2: Create linear regression object...")
    regr = linear_model.LinearRegression()
    
    ## Step 3: training...
    print("Step 3: training...")
    regr.fit(X, Y)
    
    ## Step 4: testing...
    print("Step 4: testing...")
    X_predict = [1.70.2]
    # At this point, the parameters are learned and the model is set. To predict new instances, the following can be done
    Y_predict = regr.predict([X_predict])
    
    ## Step 5: show the result...
    print("Step 5: show the result...")
    Vector w = (w_1,... , w_p) as coef_, and defines w_0 as intercept_.
    print("coef_:"+str(regr.coef_))
    print("intercept_:"+str(regr.intercept_)) 
    print("Y_predict:"+str(Y_predict))
    
    ## Step 6: show the result...
    print("Step 6: show the result...")
    print('70㎡ two-bedroom valuation: ¥%. 40 thousand Yuan '%Y_predict)
    
    
    ## Step 6: show Regression plot...
    print("Step 6: show Regression plot...")
    
    w0 = np.array(regr.intercept_)
    w1 = np.array(regr.coef_.T[1:3])
    print(w0)
    print(w1)
    # merge into a matrix
    theta = np.vstack((w0,w1))
    print(theta)

    plotRegression(X, Y, theta)
    
    ## Step 7: math Pearson...
    print("Step 7: math Pearson...")
    xMat = np.mat(X)      Create xMat matrix
    yMat = np.mat(Y)      # create yMat matrix
    yHat = predict(theta, xMat)
    
    Pearson = computeCorrelation(yHat, yMat)
    print(Pearson)
Copy the code

MLR\ mlR-sklearn_v2 \ mlR-sklearn_v2.0.py

[Note] the result is no longer posted as the previous code.

5.4.2 Prediction of abalone age by multiple linear regression

Next, we will apply regression to real data. The age of the abalone (an aquatic creature) is recorded in the file abalone. TXT from the UCI dataset. The age of an abalone can be inferred from the number of layers of its shell. Let’s take a look at the data set.



This is a little scary, but this is a good example. Let’s just look at the code.

# -*- coding:utf-8 -*-
import numpy as np

Returns: xarr-x dataset yarr-y dataset ""
def loadDataSet(fileName) :

	numFeat = len(open(fileName).readline().split('\t')) - 1
	xArr = []; yArr = []
	fr = open(fileName)
	for line in fr.readlines():
		lineArr =[]
		curLine = line.strip().split('\t')
		for i in range(numFeat):
			lineArr.append(float(curLine[i]))
		xArr.append(lineArr)
		yArr.append(float(curLine[-1]))
	return xArr, yArr


Parameters: testPoint - xarr-x dataset yarr-y dataset k-gaussian kernel k, custom Parameters: ws - regression coefficient ""
def lwlr(testPoint, xArr, yArr, k = 1.0) :

	xMat = np.mat(xArr); yMat = np.mat(yArr).T
	m = np.shape(xMat)[0]
	weights = np.mat(np.eye((m)))										Create a weighted diagonal matrix
	for j in range(m):                      							# Traverse the dataset to calculate the weight of each sample
		diffMat = testPoint - xMat[j, :]     							
		weights[j, j] = np.exp(diffMat * diffMat.T/(-2.0 * k**2))
	xTx = xMat.T * (weights * xMat)										
	if np.linalg.det(xTx) == 0.0:
		print("The matrix is singular and cannot be inverted.")
		return
	ws = xTx.I * (xMat.T * (weights * yMat))							# Calculate the regression coefficient
	return testPoint * ws

Parameters: testArr - test data set, test set Xarr-x data set, training set Yarr-Y data set, training set K-Gaussian kernel k, custom Parameters: testArr - test data set, test set Xarr-x data set, training set YARr-Y data set, training set K-Gaussian kernel K, custom Parameters: Ws - Regression coefficient ""
def lwlrTest(testArr, xArr, yArr, k=1.0) :  

	m = np.shape(testArr)[0]											Calculate the size of the test dataset
	yHat = np.zeros(m)	
	for i in range(m):													# Make predictions for each sample point
		yHat[i] = lwlr(testArr[i],xArr,yArr,k)
	return yHat


Parameters: xarr-X data set yarr-Y data set Returns: WS-regression coefficient ""
def standRegres(xArr,yArr) :

	xMat = np.mat(xArr); yMat = np.mat(yArr).T
	xTx = xMat.T * xMat							# Calculate the regression coefficient according to the formula derived in this paper
	if np.linalg.det(xTx) == 0.0:
		print("The matrix is singular and cannot be inverted.")
		return
	ws = xTx.I * (xMat.T*yMat)
	return ws

Error size evaluation function Parameters: yArr - real data yHatArr - predicted data Returns: error size
def rssError(yArr, yHatArr) :
    return ((yArr - yHatArr) **2).sum(a)if __name__ == '__main__':

    abX, abY = loadDataSet('abalone.txt')
    print('The training set is the same as the test set: locally weighted linear regression, the effect of the size of the kernel K on the prediction :')
    yHat01 = lwlrTest(abX[0:99], abX[0:99], abY[0:99].0.1)
    yHat1 = lwlrTest(abX[0:99], abX[0:99], abY[0:99].1)
    yHat10 = lwlrTest(abX[0:99], abX[0:99], abY[0:99].10)
    
    print('when k=0.1, the error is :',rssError(abY[0:99], yHat01.T))
    print('when k=1, the error is :',rssError(abY[0:99], yHat1.T))
    print('when k=10, the error is :',rssError(abY[0:99], yHat10.T))

    print(' ')
    print('Training sets are different from test sets: locally weighted linear regression, is the size of kernel K as small as possible? Replace the dataset and the test results are as follows:)
    yHat01 = lwlrTest(abX[100:199], abX[0:99], abY[0:99].0.1)
    yHat1 = lwlrTest(abX[100:199], abX[0:99], abY[0:99].1)
    yHat10 = lwlrTest(abX[100:199], abX[0:99], abY[0:99].10)
	
    print('when k=0.1, the error is :',rssError(abY[100:199], yHat01.T))
    print('when k=1, the error is :',rssError(abY[100:199], yHat1.T))
    print('when k=10, the error is :',rssError(abY[100:199], yHat10.T))

	
    print(' ')
    print('Training set differs from test set: simple linear regression compared with locally weighted linear regression at k=1 :')
    print('when k=1, the error is :', rssError(abY[100:199], yHat1.T))
    ws = standRegres(abX[0:99], abY[0:99])
    yHat = np.mat(abX[100:199]) * ws
    print('Simple linear regression error size :', rssError(abY[100:199], yHat.T.A))
Copy the code

The results are shown below.



It can be seen that when k=0.1, the error of the training set is small, but when applied to the new data set, the error becomes larger. This is often referred to as over-fitting. The model we train, we need to ensure high accuracy of the test set, so that the trained model can be applied to new data, that is, to strengthen the universality of the model. As you can see, when k=1, locally weighted linear regression and simple linear regression get similar results. This also indicates that the best model must be selected by comparing results on unknown data. So is the optimal core size 10? Maybe, but if you want to get better results, you should do 10 tests with 10 different sample sets to compare results.

This example shows how a locally weighted linear regression can be used to build a model with better results than ordinary linear regression. The problem with locally weighted linear regression is that it must be run on the entire data set at a time. This means that in order to make predictions, all training data must be saved.

[1] Wang, J., Wang, J., Et al. [2] Journal of Machine Learning and Machine Learning [3

This chapter refers to the code click enter