Colloquial machine learning – Logical Stey’s Return – Theory + Practice
[toc]
An overview of the
We talked about linear regression, and the model of linear regression is y=wT+by =w ^T +by =wT+b. The predicted value of the model approximates the real marker Y. Then can the predicted value of the model approximate the derivative of the real marker Y? For example, the predicted value of the model approximates the logarithmic function of the real marker. So let’s introduce logistic regression.
The transition function
We need a monotone differentiable function to relate the true marker Y of the classification task to the predicted value of the linear regression model, so we need a transformation function to relate the values of the linear model to the actual predicted value. Consider the dichotomous problem, where the output notation is y belongs to {0, 1}, and the predictive value generated by the linear model is z=wT+bz =w ^T +bz =wT+b is a real value, so we need to convert this real value to a 0/1 value, and the optimal function is the unit step function.
Unit step function
Unit-step function, as shown in the figure below. If the predicted value is greater than zero, it is judged as a positive example. If the predicted value is less than zero, it is considered as a counter example. Zero is arbitrary. As shown in the figure below.
\begin{equation} y = \begin{cases} 0 &mbox {if z < 0}\\ 0.5&mbox {if z = 0}\\ 1&mbox {if z > 0}\ end{cases} \end{equation}
sigmoid function
As can be seen from the figure, the unit step function is discontinuous and therefore not suitable for calculation. Here we introduce the sigmoid function to do the calculation.
Convert the z value to a y value close to 0 or 1, and its output varies steeply near z=0. So now our model is going to be
Probability and logarithmic probability
Probability: If y is the probability of positive cases and 1-y is the probability of negative cases, then the ratio y1−y \dfrac{y}{1-y}1−yy is called probability, reflecting the relative probability of x as positive cases. Can be obtained according to sigmoid function.
Lny1 −y ln\dfrac{y}{1-y} Ln1 −yy is called logarithmic probability;
** It can be seen that y=11+e−(wT+b)y = \dfrac{1}{1 + E ^{-(W ^T +b)}}y=1+ E −(wT+b)1 is actually used to approximate the logarithmic probability of the real marker by the prediction results of the linear model. Therefore, the corresponding model is called “logarithmic probability regression” **
The loss function and calculation method are introduced below.
Loss function
Lny1 −y=wT+b ln\dfrac{y}{1-y} = W ^T + BLN1 − YY =wT+b. so
We use the maximum likelihood estimation method to solve the problem. Since it is a dichotomous problem, it conforms to the 0-1 distribution in probability. So the likelihood function for the p (y = 1 ∣ x) = e (wT + b) 1 + e + b (wT) = f (x) = p (y | x) = 1 \ dfrac {e ^ {(w ^ T + b)}} {1 + e ^ {(w ^ T + b)}} = F (x) p (y = 1 ∣ x) = 1 + e + b (wT) e (wT + b) = f (x), P (y = 0 ∣ x) = 11 + e = 1 – (wT + b) f (x) p (y = 0 | x) = \ dfrac {1} {1 + e ^ {(w ^ T + b)}} 1 – f (x) = p (y = 0 ∣ x) = 1 + 1 = 1 (wT + b) e – f (x)
The logarithmic likelihood function is:
L = lnL (w) (w) = ∑ I = 1 n [yilnf (xi) + 1 – (yi) ln (1 – (xi) f)] l (w) = lnL (w) = \ sum_ {I = 1} ^ n [y_ilnf (x_i) + (1 – y_i) ln (1 – (x_i) f)] L = lnL (w) (w) = ∑ I = 1 n [yilnf (xi) + 1 – (yi) ln (1 – (xi) f)] l (w) = lnL (w) = ∑ I = 1 n [yilnf (xi) 1 – (xi) + f ln (1 – (xi) f)] l (w) = lnL (w) = \ sum_ {I = 1} ^ n [y_iln \ dfrac {f (x_i)} {1 – f (x_i)} + ln (1 – (x_i) f)] l (w) = lnL (w) = ∑ I = 1 n [yiln1 – f (xi) f (xi) + ln (1 – (xi) f)] L = lnL (w) (w) = ∑ I = 1 n [yi (wxi) – ln (1 + ewxi)] l (w) = lnL (w) = \ sum_ {I = 1} ^ n [y_i (wx_i) – ln (1 + e ^ {wx_i})] L = lnL (w) (w) = ∑ I = 1 n [yi (wxi) – ln (1 + ewxi)]
Take the maximum of this function, put a negative sign, take the minimum. Both gradient descent and Newton’s method introduced in previous chapters can be used to solve the problem, which will not be described here.
The code field
Here we talk about how to solve the problem by gradient ascent. First of all, we need to define and analyze the problem. So what kind of information do we need?
- Variables for input information
- Samples: including feature and classification markers, positive and negative cases
- Initialization of regression coefficient;
- Step size calculation;
- The loss function has been determined;
- Calculation of gradient of loss function;
- The gradient through the loss function and the step size are indeed each iteration;
- Iteration stop conditions;
- Input data (identify the above variables)
- Sample information: including features and classification markers (extracted from actual machine learning, which will be posted later;
- The coefficients of regression functions are initialized to 1.0;
- For simplicity, set the step size alpha = 0.001;
- The loss function, as described above:
- Gradient of loss function:
[partial f2 partial xi partial yj] NXN [\ frac {\ partial f ^ 2} {\ partial x_i \ partial y_j}] _ {NXN} [partial xi partial yj partial f2] NXN L = lnL (w) (w) = ∑ I = 1 n [yi (wxi) – ln (1 + ewxi)] l (w) = lnL (w) = \ sum_ {I = 1} ^ n [y_i (wx_i) – ln (1 + e ^ {wx_i})] L = lnL (w) (w) = ∑ I = 1 n [yi (wxi) – ln (1 + ewxi)]
- Iteration stop condition: maxCycles = 500, 500 iterations.
- code
def loadDataSet(): dataMat = []; labelMat = [] fr = open('testSet.txt') for line in fr.readlines(): LineArr = linear.strip ().split() datamat.append ([1.0, float(lineArr[0]), lineArr = linear.strip ().split() Datamat.append ([1.0, float(lineArr[0]), float(lineArr[1])]) labelMat.append(int(lineArr[2])) return dataMat,labelMat def sigmoid(inX): Return 1.0/(1+exp(-inx)) def gradAscent(dataMatIn, classLabels): dataMatrix = mat(dataMatIn) #convert to NumPy matrix labelMat = mat(classLabels).transpose() #convert to NumPy matrix M,n = shape(dataMatrix) alpha = 0.001 maxCycles = 500 weights = ones((n,1)) for k in range(maxCycles): #heavy on matrix operations h = sigmoid(dataMatrix*weights) #matrix mult error = (labelMat - h) #vector subtraction Weights = weights + alpha * datamatrix.transpose ()* error #Copy the code
- Sample data
-0.017612 14.053064 0 -1.395634 4.662541 1 -0.752157 6.538620 0 -1.322371 7.152853 0 0.423363 11.054677 0 0.406704 7.067335 1 0.667394 12.741452 0 -2.460150 6.866805 1 0.569411 9.548755 0 -0.026632 10.427743 0 0.850433 6.920334 1 1.347183 13.175500 0 1.176813 3.167020 1-1.781871 9.097953 0-0.566606 5.749003 1 0.931635 1.589505 1-0.024205 6.151823 1-0.036453 2.690988 1-0.196949 0.444165 1 1.014459 5.754399 1 1.985298 3.230619 1-1.693453-0.557540 1 0.576525 11.778922 0-0.346811-1.678730 1-2.124484 2.672471 1 1.217916 9.597015 0 -0.733928 9.098687 0 -3.642001 -1.618087 1 0.315985 3.523953 1 1.416614 9.619232 0 -0.386323 3.989286 1 0.556921 8.294984 1 1.224863 11.587360 0 -1.347803-2.406051 1 1.196604 4.951851 1 0.275221 9.543647 0 0.470575 9.332488 0-1.889567 9.542662 0-1.527893 12.150579 0-1.185247 11.309318 0-0.445678 3.297303 1 1.042222 6.105155 1-0.618787 10.320986 0 1.152083 0.548467 1 0.828534 2.676045 1-1.237728 10.549033 0-0.683565-2.166125 1 0.229456 5.921938 1-0.959885 11.555336 0 0.492911 10.993324 0 0.184992 8.721488 0 0.355715 10.325976 0 0.397822 8.058397 0 0.824839 13.730343 0 1.507278 5.027866 1 0.099671 6.835839 1-0.344008 10.717485 0 1.785928 7.718645 1-0.918801 11.560217 0-0.364009 4.747300 1-0.841722 4.119083 1 0.490426 1.960539 1-0.007194 9.075792 0 0.356107 12.447863 0 0.342578 12.281162 0 -0.810823-1.466018 1 2.530777 6.476801 1 1.296683 11.607559 0 0.475487 12.040035 0-0.783277 11.009725 0 0.074798 11.023650 0-1.337472 0.468339 1-0.102781 13.763651 0-0.147324 2.874846 1 0.518389 9.887035 0 1.015399 7.571882 0-1.658086-0.027255 1 1.319944 2.171228 1 2.056216 5.019981 1-0.851633 4.375691 1-1.510047 6.061992 0-1.076637-3.181888 1 1.821096 10.283990 0 3.010150 8.401766 1-1.099458 1.688274 1-0.834872-1.733869 1-0.846637 3.849075 1 1.400102 12.628781 0 1.752842 5.468166 1 0.078557 0.059736 1 0.089392-0.715300 1 1.825662 12.693808 0 0.197445 9.744638 0 0.126117 0.922311 1-0.679797 1.220530 1 0.677983 2.556666 1 0.761349 10.693862 0-2.168791 0.143632 1 1.388610 9.341997 0 0.317029 14.739025 0Copy the code
- Get the results
>>> import logRegres >>> param_mat, label_mat = logRegres.loadDataSet() >>> >>> logRegres.gradAscent(param_mat, Label_mat) matrix ([[4.12414349], [0.48007329], [0.6168482]]) > > >Copy the code
Here, all samples are used in each iteration, which requires a large amount of calculation. Therefore, random samples can be selected for each iteration.
reference
- Machine learning