5.4 Multivariate logistic regression for actual combat

5.4.1 Prediction of mortality of diseased horses by multiple logistic regression

In this exercise, Logistic regression will be used to predict the survival of herniated horses. Raw data set download address: archive.ics.uci.edu/ml/datasets…

The data here consists of 368 samples and 28 features. I am not an expert in horse breeding and I know from some literature that herniosis is a term used to describe gastrointestinal pain in horses. However, the disease does not necessarily originate from gastrointestinal problems in horses. Other problems can also cause hernia. The dataset included indicators that hospitals use to detect equine hernia, some subjective, others difficult to measure, such as equine pain levels.

In addition, it should be noted that in addition to some indicators being subjective and difficult to measure, there is another problem with this data. 30% of the values in the data set are missing. The following will first introduce how to deal with the problem of missing data in the data set, and then use Logistic regression and stochastic gradient rise algorithm to predict the life and death of sick horses.

Missing values in data is a very difficult problem, and there is a lot of literature devoted to solving this problem. So what is the problem with missing data? Let’s say you have 100 samples and 20 features, all of which are collected by the machine. If on the machine; What if a feature is invalid due to damage to one of your sensors? Do you want to throw away the entire data at this point? In that case, what about the other 19 features? Are they still usable? The answer is yes. Because data is sometimes so expensive that it is not desirable to throw it away or retrieve it again, some approach must be adopted to solve this problem.

Some of the alternatives are given: buy a ticket to fill in the missing values using an average of available features;  uses special values to fill in missing values, such as 1; Gigonists ignore samples with missing values; Lent uses the mean values of similar samples to supplement the missing values; Lent uses additional machine learning algorithms to predict missing values.

The original data set is processed and saved in two files: horsecolictest.txt and horsecolicTrain.txt. See examples. Now that we have a “clean” working data set and a good optimization algorithm, we will fuse these parts together to train a classifier that can then be used to predict life and death for sick horses.

With these data sets, all we need is a Logistic classifier that can be used to predict the life and death of sick horses. Look at the code.

# -*- coding:utf-8 -*-
import numpy as np
import time
import matplotlib.pyplot as plt

import matplotlib
matplotlib.rcParams['font.sans-serif'] = ['SimHei']
matplotlib.rcParams['font.family'] = 'sans-serif'
matplotlib.rcParams['axes.unicode_minus'] = False

Parameters: filename - filename Returns: dataSet_Arr - Labels_Arr - label ""
def loadDataSet_old(filename) :
        
    fr = open(filename)		Open the training set

    dataSet = []
    Labels = []
    for line in fr.readlines():
        currLine = line.strip().split('\t')
        lineArr = []
        for i in range(len(currLine)-1):
            lineArr.append(float(currLine[i]))
        
        dataSet.append(lineArr)
        Labels.append(float(currLine[-1]))
    
    # Convert X,Y lists into matrices
    dataSet_Arr = np.array(dataSet)
    Labels_Arr = np.array(Labels)
    
    return dataSet_Arr,Labels_Arr

Returns: xarr-x dataset yarr-y dataset ""
def loadDataSet(filename) :
    X = []
    Y = []
    
    with open(filename, 'rb') as f:
        for idx, line in enumerate(f):
            line = line.decode('utf-8').strip()
            if not line:
                continue
            eles = line.split()
            eles = list(map(float, eles))
            
            if idx == 0:
                numFea = len(eles)
            Remove the last data in each row and place it in X
            X.append(eles[:-1])
            Get the last data for each row
            Y.append([eles[-1]])
     
    # Convert X,Y lists into matrices
    xArr = np.array(X)
    yArr = np.array(Y)
    
    return xArr,yArr

Sigmoid function Parameters: z Returns: ""
def sigmoid(z) :
    return 1.0 / (1.0 + np.exp(-z))

Parameters: theta, X, Y, theLambda Returns: ""
def J(theta, X, Y, theLambda=0) :
    m, n = X.shape
    h = sigmoid(np.dot(X, theta))
    J = (-1.0/m)*(np.dot(np.log(h).T,Y)+np.dot(np.log(1-h).T,1-Y)) + (theLambda/(2.0*m))*np.sum(np.square(theta[1:))if np.isnan(J[0) :return np.inf
    In fact, there is only one value in J, which needs to be retrieved
    return J.flatten()[0]

Parameter: X, Y, options Returns: ""
def gradient(X, Y, options) :
    Options. alpha learning rate options.theLambda regular parameter λ options.maxLoop Maximum number of iterations options.epsilon Threshold for determining convergence options.method - 'SGD' Stochastic gradient ascent - 'BGD' Batch gradient ascent"
    
    m, n = X.shape
    
    Initialize model parameters, n features for n parameters
    theta = np.zeros((n,1))
    
    cost = J(theta, X, Y)  # Current error
    costs = [cost]
    thetas = [theta]
    
    The # Python dictionary dict.get(key, default=None) function returns the value of the specified key, or the default if the value is not in the dictionary
    alpha = options.get('alpha'.0.01)
    epsilon = options.get('epsilon'.0.00001)
    maxloop = options.get('maxloop'.1000)
    theLambda = float(options.get('theLambda'.0)) # theLambda/m = 0 if this does not convert to float
    method = options.get('method'.'bgd')
    
    # Define random gradient rise
    def _sgd(theta) :
        count = 0
        converged = False
        while count < maxloop:
            if converged :
                break
            # Random gradient rise, every sample is updated
            for i in range(m):
                h =sigmoid(np.dot(X[i].reshape((1,n)), theta))
                
                theta = theta + alpha*((1.0/m)*X[i].reshape((n,1))*(Y[i]-h) + (theLambda/m)*np.r_[[[0]], theta[1:]])
                
                thetas.append(theta)
                cost = J(theta, X, Y, theLambda)
                costs.append(cost)
                if abs(costs[-1] - costs[-2]) < epsilon:
                    converged = True
                    break
            count += 1
        return thetas, costs, count
    
    # Define batch gradient rise
    def _bgd(theta) :
        count = 0
        converged = False
        while count < maxloop:
            if converged :
                break
            
            h = sigmoid(np.dot(X, theta))
            theta = theta + alpha*((1.0/m)*np.dot(X.T, (Y-h)) + (theLambda/m)*np.r_[[[0]],theta[1:]])
            
            thetas.append(theta)
            cost = J(theta, X, Y, theLambda)
            costs.append(cost)
            if abs(costs[-1] - costs[-2]) < epsilon:
                converged = True
                break
            count += 1
        return thetas, costs, count
    
    methods = {'sgd': _sgd, 'bgd': _bgd}
    return methods[method](theta)  

Function description: draw the cost function image Parameters: costs Returns:
def plotCost(costs) :
    # Draw error curve
    plt.figure(figsize=(8.4))
    plt.plot(range(len(costs)), costs)
    plt.xlabel(U 'Number of iterations')
    plt.ylabel(U 'price J')    

Parameters: thetass - Returns: Returns the parameter ""
def plotthetas(thetas) :
    Draw the theta variation
    thetasFig, ax = plt.subplots(len(thetas[0]))
    thetas = np.asarray(thetas)
    for idx, sp in enumerate(ax):
        thetaList = thetas[:, idx]
        sp.plot(range(len(thetaList)), thetaList)
        sp.set_xlabel('Number of iteration')
        sp.set_ylabel(r'$\theta_%d$'%idx)

Function Parameters: inX - feature vector weights - regression coefficient Returns: classification results
def classifyVector(inX, weights) :
    
    prob = sigmoid(sum(inX*weights))
    
    if prob > 0.5: 
        return 1.0
    else: 
        return 0.0

Parameters: Weights - regression coefficient dataSet labels-labels Returns: predict - predicted results errorRate-error rate ""
def test(Weights,dataSet,Labels) :
    
    errorCount = 0
    numTestVec = 0.0
    predict = []
    for i in range(len(test_Y)):
        numTestVec += 1.0
        result = classifyVector(dataSet[i], Weights)
        
        predict.append(result)
        
        if int(result)! =int(Labels[i]):
            errorCount += 1
    
    errorRate = (float(errorCount)/numTestVec) * 100 	# Error rate calculation
    
    return predict , errorRate


Parameters: X_test,Y_test,y_predict Returns: none
def plotfit(X_test,Y_test,y_predict) :
    Compare the predicted value with the actual value
    p_x = range(len(X_test))
    plt.figure(figsize=(18.4),facecolor="w")
    plt.ylim(0.6)
    plt.plot(p_x,Y_test,"ro",markersize=8,zorder=3,label=U "True value")
    plt.plot(p_x,y_predict,"go",markersize=14,zorder=2,label=U "Predicted value")
    plt.legend(loc="upper left")
    plt.xlabel(U "Numbers",fontsize=12)
    plt.ylabel("U" type,fontsize=12)
    plt.title(U "Classification of data by Logistic algorithm",fontsize=12)
    #plt.savefig("Logistic algorithm to classify data. PNG ")
    plt.show()

if __name__ == '__main__':
    ## Step 1: load data...
    print("Step 1: load data...")

    origin_train_X, train_Y = loadDataSet('horseColicTraining.txt')
    origin_test_X, test_Y = loadDataSet('horseColicTest.txt')
    
    train_m,train_n = origin_train_X.shape
    test_m,test_n = origin_test_X.shape
    
    train_X = np.concatenate((np.ones((train_m,1)), origin_train_X), axis=1)
    test_X = np.concatenate((np.ones((test_m,1)), origin_test_X), axis=1)
    
    #print(train_X)
    #print(train_Y)
        
    ## Step 2: training sgd...
    print("Step 2: training sgd...")
    
    # SGD random gradient rise
    options = {
        'alpha':0.5.'epsilon':0.00000001.'maxloop':100000.'method':'sgd' 
    }
    
    # Training model
    start = time.time()
    sgd_thetas, sgd_costs, sgd_iterationCount = gradient(train_X, train_Y, options)
    print("sgd_thetas:")
    print(sgd_thetas[-1])
    end = time.time()
    sgd_time = end - start
    print("sgd_time:"+str(sgd_time))
    
    ## Step 3: testing...
    print("Step 3: testing...")
    predict_Y , errorRate = test(sgd_thetas[-1].all(),test_X,test_Y)

    ## Step 4: show the result...
    print("Step 4: show the result...")
    print('Predicted value :',predict_Y)
    print("Test set error rate: %.2f%%" % errorRate)
    
    ## Step 5: show the plotfit...
    print("Step 5: show the plotfit...")
    plotfit(test_X,test_Y,predict_Y)
    
    ## Step 6: show costs sgd...
    print("Step 6: show costs sgd...")
    plotCost(sgd_costs)

    ## Step 7: show thetas sgd...
    print("Step 7: show thetas sgd...")
    plotthetas(sgd_thetas)
Copy the code

The results are shown below.







Regression_classify/Multi_Logistic_Regression_Classify_v1\ Multi_Logistic_Regression_Classify_v1. 0 \ Multi_Logistic_Regression_Classify_v1. 0. Py 】

5.4.2 Predicting the mortality rate of sick horses by multivariate logistic regression

As before, sklearn will be used to implement this, so look directly at the code and note the comparison with the previous example. The Python code is as follows.

# @Language : Python3.6
""" # -*- coding:UTF-8 -*- import numpy as np import random import matplotlib.pyplot as plt import matplotlib import time from sklearn.linear_model import LogisticRegression matplotlib.rcParams['font.sans-serif'] = ['SimHei'] matplotlib.rcParams['font.family'] = 'sans-serif' matplotlib.rcParams['axes.unicode_minus'] = False """Parameters: filename - filename Returns: dataSet_Arr - dataset Labels_Arr - label"" def loadDataSet_old(filename): fr = open(filename) # open trainset dataSet = [] Labels = [] for line in fr.readlines(): currLine = line.strip().split('\t') lineArr = [] for i in range(len(currLine)-1): LineArr. Append (float(currLine[I])) dataSet. Append (lineArr) labels.append (float(currLine[-1]))) # Convert X,Y list to matrix dataSet_Arr = np.array(dataSet) Labels_Arr = np.array(Labels) return dataSet_Arr,Labels_Arr """Returns: xArr -x data set yArr -Y data set""" def loadDataSet(filename): X = [] Y = [] with open(filename, 'rb') as f: for idx, line in enumerate(f): line = line.decode('utf-8').strip() if not line: continue eles = line.split() eles = list(map(float, eles)) if idx == 0: NumFea = len(eles) # remove the last data of each row and put it into x. append(eles[:-1]) # obtain the last data of each row. Append ([eles[-1]]) # transform the X,Y list into a matrix xArr = np.array(X) yArr = np.array(Y) return xArr,yArr """Parameters: X_test,Y_test,y_predict Returns: none""" def plotfit(X_test,Y_test,y_predict): P_x = range(len(X_test)) plt.figure(figsize=(18,4),facecolor="w") plt.ylim(0,6) Plt.plot (p_x,Y_test,"ro",markersize=8,zorder=3,label=u" true ") plt.plot(p_x,Y_test,"ro",markersize=8,zorder=3,label=u" true ") Plt.plot (p_x,y_predict,"go",markersize=14,zorder=2,label=u" predicted value ") plt.plot(loc="upper left") plt.plot(p_x,y_predict,"go",markersize=14,zorder=2,label=u" predicted value ") plt.plot(loc="upper left") Plt.xlabel (u" number ",fontsize=12) plt.ylabel(u" type ",fontsize=12) plt.title(u"Logistic algorithm to classify data ",fontsize=12) plt.xlabel(u" number ",fontsize=12) plt.ylabel(u" type ",fontsize=12) Plt.show () if __name__ == '__main__': ## Step 1: load data... plt.show() if __name__ == '__main__': ## Step 1: Load data... print("Step 1: load data..." ) origin_train_X, train_Y = loadDataSet('horseColicTraining.txt') origin_test_X, test_Y = loadDataSet('horseColicTest.txt') train_m,train_n = origin_train_X.shape test_m,test_n = origin_test_X.shape train_X = np.concatenate((np.ones((train_m,1)), origin_train_X), axis=1) test_X = np.concatenate((np.ones((test_m,1)), origin_test_X), axis=1) #print(train_X) #print(train_Y) ## Step 2: init lr... print("Step 2: init lr..." ) mlr = LogisticRegression(solver = 'sag',max_iter = 5000) ## Step 3: training... print("Step 3: training..." ) start = time.time() # use the improved random ascent gradient training # output y to make the ravel() transformation. Mlr.fit (train_X, train_y.ravel ()) end = time.time() mlr_time = end - start print('mlr_time:',mlr_time) ## Step 4: show weights... print("Step 4: show weights..." ) # model effect for r = MLR. Score (train_X train_Y) print (r value (accuracy) : "", r) # print (" parameters:", MLR. Coef_) # print (" intercept ", MLR. Intercept_) w0 = Np.array (mlr.intercept_) w1 = np.array(mlr.coef_.T[1:22]) # add weights = np.vstack((w0,w1)) #print('weights:') #print(weights) ## Step 5: predict ... print("Step 5: predict..." Predict_Y = MLR. Predict (test_X) print(' predict_Y :') ## print("Step 6: show the plotfit..." ) plotfit(test_X,test_Y,predict_Y)Copy the code

The results are shown below.



Regression_regression_classify/multi_logistic_regression_controller-sklearn_v2 / Multi_Logistic_Regression_Clas Sify – sklearn_v2. 0. Py 】

This chapter refers to the code click enter