Machine learning algorithm implemented in Python

A,Linear regression

All the code

1. Cost function

Among them:

The next step is to find theta to minimize the cost, which means that the equation we fit is closest to the true value

There are m pieces of data, whereIt’s the square of the distance from the true value of the equation that we’re trying to fit, and the square is because there might be negative values, and the pluses and minuses might cancel out

And the reason why we have the coefficient 2 in front is because we’re taking the gradient with respect to each of these variables, and the 2’s cancel out

Implementation code:

Def computerCost(X,y,theta): m = len(y) J = 0 J = (np. Transpose (X*theta-y))*(X*theta-y)/(2*mCopy the code

Notice that X is the real number with a 1 in front of it, because it has theta of 0.

2. Gradient descent algorithm

The partial derivative of the cost function is obtained:
So updates to Theta can be written as:
Where is the learning rate and the speed of gradient descent control, generally 0.01,0.03,0.1,0.3…..
Why does gradient descent gradually reduce the cost function
Hypothesis functionf(x)
Taylor expansion:F (x + delta x) = f (x) + f (x) '* delta (delta) x x + o
To:Delta x = alpha * 'f (x)The direction of the negative gradient times a very small stepAlpha.
willDelta xSubstitute into Taylor’s expansion:F = f (x) (x + x) - alpha * [' f (x)] squared + o (delta x)
As you can see,Alpha.Is to get a very small positive number,[' f (x)] squaredIs also positive, so it can be concluded that:F (x) + delta x < = f (x)
So as we go down the negative gradient, the function is decreasing, same in multiple dimensions.
The implementation code

Def gradientDescent(X,y,theta,alpha,num_iters): M = len(y) n = len(theta) temp = np.matrix(np.zeros((n,num_iters))) J_history = np.zeros((num_iters,1)) # record the value of each iteration for I in range(num_iters): H = np.dot(X,theta) Matrix can be directly multiplied by temp[:, I] = theta - ((alpha/m)*(Np.dot (NP.transpose (X),h-y))) # gradient calculation of theta = temp[:, I] J_history[I] = ComputerCost (X,y,theta) # print '.', return theta,J_historyCopy the code

3. Mean normalization

The goal is to scale the data to a range that is easy to use the gradient descent algorithm
Where is the average of all this Feture data
It can be the maximum-minimum value, or the standard deviation of the data corresponding to this feature
Implementation code:

Feature def featureNormaliza(X): X_norm = np.array(X) # Zeros ((1, x.shape [1])) sigma = Np.zeros ((1,X.shape[1])) mu = np.mean(X_norm,0) # Sigma = np.std(X_norm,0) # For I in range(x.shape [1]): # traversal column X_norm [, I] = (X_norm [, I] - mu [I])/sigma [I] # normalized return X_norm, mu, the sigmaCopy the code

Note that you also need mean normalization data for forecasting

4. Final running results

The cost changes with the number of iterations

5,Use sciKit-learn library linear model implementation

Import packages

From sklearn import linear_model from sklearn. Preprocessing import StandardScaler #Copy the code

The normalized

Scaler = StandardScaler() scaler.fit(X) x_train = scaler.transform(X) x_test = Scaler. Transform (np) array ([1650]))Copy the code

Linear model fitting

LinearRegression() model.linear regression () model.fit(x_train, y)Copy the code

To predict

Result = model. Predict (x_test)Copy the code

Second,Logistic regression

All the code

1. Cost function

It can be summed up as:

Why not use the cost function of linear regression, because the cost function of linear regression may be non-convex, and for classification problems, it’s hard to get a minimum using gradient descent, and the cost function above is convex

The image of is as follows, i.ey=1When:

You can see that whenTend to be1.y=1, the price paid at this time is consistent with the predicted valuecostTend to be0If,Tend to be0.y=1, the cost at this timecostIt’s a very large value, and our ultimate goal is to minimize the generation value

In the same wayIs shown below (y=0) :

2, gradient

The partial derivative of the cost function is also taken: it can be seen that it is consistent with the partial derivative of linear regression
Pushing the process

3. Regularization

The goal is to prevent overfitting
Add a term to the cost function
Note that j starts with the increment of 1, because theta(0) is a constant term and the first column of X will add 1 column 1, so the product is still Theta (0). Feature does not matter, and regularization is not necessary
Cost after regularization:

Def costFunction(initial_theta,X,y,inital_lambda): M = len(y) J = 0 h = sigmoid(np.dot(X,initial_theta)) # calculate h(z) theta1 = initial_thet.copy () # The pre-theta (0) value is 0 theTA1 [0] = 0 Temp = NP.dot (NP.transpose (theta1),theta1) J = (-np.dot(np.transpose(y),np.log(h))-np.dot(Np.transpose (1-y),np.log(1-h))+temp*inital_lambda/2)/m # regularized cost equation return JCopy the code

The regularized gradient of the cost

Def gradient(initial_theta,X,y,inital_lambda): M = len(y) grad = Np.zeros ((initial_thet.shape [0])) h = sigmoid(np.dot(X, initial_Theta))# calculate h(z) theta1 = Initial_thet.copy () theta1[0] = 0 grad = np.dot(NP.transpose (X), H-Y)/m+inital_lambda/m*theta1 # regularized gradient return gradCopy the code

4. S-type functions (i.e)

Implementation code:

Def sigmoid(z): h = np.zeros((len(z),1)) h = 1.0/(1.0+np.exp(-z)) return hCopy the code

5. The mapping is polynomial

Because the feture of the data can be small, leading to large deviations, some feture combinations are created
Eg: The form is mapped to the power of 2:
Implementation code:

Def mapFeature(X1,X2): degree = 3; Out = np.ones((x1.shape [0],1)) out = np.ones((x1.shape [0],1)) For I in np. Arange (1,degree+1): for j in range(I +1): for j in range(I +1): Temp = X1**(i-j)*(X2**j) # Matrix direct multiplication is equal to the dot product in MATLAB * out = Np.hSTACK ((out, TempCopy the code

6, use,`scipy`Optimization method of

Gradient descent servicescipyIn theoptimizeIn thefmin_bfgsfunction
The optimization algorithm fMIN_BFGS (quasi-Newton method Broyden-Fletcher-Goldfarb-Shanno) in SCIPY was called
CostFunction is a self-implemented costFunction,
Initial_theta stands for the initialized value,
Fprime specifies the gradient of the costFunction
Args is the remaining test parameters, passed in as a tuple, and will eventually return theta that minimizes costFunction

    result = optimize.fmin_bfgs(costFunction, initial_theta, fprime=gradient, args=(X,y,initial_lambda))    
Copy the code

7. Running results

Data1 decision boundaries and accuracy
Data2 decision boundaries and accuracy

Eight,Use the logistic regression model in sciKit-learn library

Import packages

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
import numpy as np
Copy the code

Divide training set and test set

X_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)Copy the code

The normalized

Scaler = StandardScaler() fit(x_train) x_train = scaler. Fit_transform (x_train) x_test = scaler.fit_transform(x_test)Copy the code

Logistic regression

Model = LogisticRegression() model. Fit (x_train,y_train)Copy the code

To predict

# predict = model. Predict (x_test) right = sum(predict == y_test) predict = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 % % % % f '(right * 100.0 / predict shape [0])) # the accuracy of the calculation on the test setCopy the code

Logistic regression _ Handwritten number recognition _OneVsAll

All the code

1. Randomly display 100 numbers

I didn’t use the data set in SciKit-learn. The pixels are 20*20px and the color image is shown in grayscale:
Implementation code:

Def display_data(imgData): Sum = 0 "" display 100 numbers (drawing one by one will be very slow, you can arrange the numbers to be drawn, put them into a matrix, display the matrix) - initialize a two-dimensional array - adjust the data of each row to the matrix of the image, Display_array = -np.ones((pad+10*(20+pad),pad+10*(20+pad))) for I in range(10): for j in range(10): display_array[pad+i*(20+pad):pad+i*(20+pad)+20,pad+j*(20+pad):pad+j*(20+pad)+20] = (imgData[sum,:]. 0 (0,20, ORDER ="F") # order=F 0 Default with line sum += 1 plt.imshow(display_array,cmap='gray') # display grayscale image plt.axis('off') plt.show()Copy the code

2, OneVsAll

How to use logistic regression to solve the problem of multiple classification, OneVsAll is to treat a certain category as a class, all other categories as a class, so there is a dichotomous problem
As shown in the figure below, the data on the way are divided into three categories. First, the red ones are regarded as one category, and the others as another category for logistic regression. Then, the blue ones are regarded as one category, and the others are regarded as one category, and so on…
It can be seen that when the number of classes is greater than 2, logistic regression classification should be carried out as many times as there are classes

3. Handwritten digit recognition

There are 10 numbers from 0 to 9, and it takes 10 sorting
Due to theData set yThe is given0... 9While logistic regression is required0/1So we need to process y
A little bit about the data set, before500One is0.500-1000.is1.., so as shown in the picture below, after processingy.The first 500 rows have 1 in the first column and 0 in the rest. 500-1000 rows have 1 in the second column and 0 in the rest….

And then callGradient descent algorithmTo solve thetheta

Implementation code:

All_theta def oneVsAll(X,y,num_labels,Lambda); X = np.hstack((np.ones((m,1)),X)) # X = 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 Initial_theta = np.zeros((n+1,1)) # Initializes a category of theta # mapping y for I in range(num_labels): Class_y [, I] = np. Int32 (y = = I). Reshape # (1, 1) attention to reshape (1, 1) to assignment # np. Savetxt (" class_y. CSV, "class_y [00 0-6. :], Delimiter =',') "" traverses each category and calculates the corresponding theta value" "for I in range(num_labels): result = optimize.fmin_bfgs(costFunction, initial_theta, fprime=gradient, 0 Args =(X,class_y[:, I],Lambda) # calling gradient descent optimization method ALL_theta [:, I] = Result. 0 (1,-1) # Putting all_theta = np.transpose(all_theta) return all_thetaCopy the code

4, forecasting

As I said, the predicted outcome is oneProbability valueI learned from itthetaPlug in the predictedS functionIn, the maximum value of each row is the maximum probability of a numberColumn numberThat’s the true value of the predicted number, because in categorization, everything is0The will ofyThe mapping is in the first column, the mapping of 1 is in the second column, and so on
Implementation code:

Def predict_oneVsAll(all_theta,X): M = x.shape [0] num_labels = all_thet.shape [0] p = Np.zeros ((m,1)) X = np.hstack((np.ones((m,1)),X)) # Sigmoid (np.dot(X,np.transpose(all_theta))) # prediction "" returns the column number of the maximum value of each row in h -np.max (h, P = np.array(np. Where (h[0,:] == np.max(h, h, h) where(h, h, h) == np.max(h, h) axis=1)[0])) for i in np.arange(1, m): t = np.array(np.where(h[i,:] == np.max(h, axis=1)[i])) p = np.vstack((p,t)) return pCopy the code

5. Running results

10 times classification, accuracy on training set:

6,Use the logistic regression model in sciKit-learn library

1. Import the package

from scipy import io as spio
import numpy as np
from sklearn import svm
from sklearn.linear_model import LogisticRegression
Copy the code

2. Load data

Data = loadmat_data("data_digits. Mat ") X = data['X'] # 20x20px y = data['y'] # shape=(5000, 1) y = np.ravel(y)Copy the code

3. Fitting model

Model = LogisticRegression() model.fit(X, y) #Copy the code

4, forecasting

Predict = model. Predict (X) # predict = model. Predict (X) # predict = model. Predict (X) # predict = model.Copy the code

5. Output results (accuracy on training set)

BP neural network

All the code

1. Neural network Model

First, a three-layer neural network is introduced, as shown in the figure below
The input layer has three units (Bias for padding, usually set to1)

According to the firstjLayer of the firstiAn incentive, also known as a unit unit

For the firstjLayer to the firstj+1The weight matrix of the layer map is the weight of each edge

So we can get:

Hidden layer:

Output layer

Among them,S function, has becomeExcitation function

It can be seen thatIs the 3×4 matrix,Is a 1 by 4 matrix

= =”j+1Number of cells x (jNumber of layers +1)

2. Cost function

Assume that the final output, that is, the output layer has K units
Among them,On behalf of the firstiUnit output

The cost function is similar to logistic regression, which is tired plus each output (K outputs).

3. Regularization

The regularized cost function is

A total ofL-1Layer,

And then we add up the theta matrix for each layer, not including the theta(0) with the offset term.

Regularized cost function implementation code:

Def nnCostFunction(nn_params,input_layer_size,hidden_layer_size,num_labels,X,y,Lambda): Shape [0] # restore theta1 and theta2 theta1 = nn_params[0:hidden_layer_size*(input_layer_size+1)].reshape(hidden_layer_size,input_layer_size+1) Theta2 = nn_params[hidden_layer_size*(input_layer_size+1):length].reshape(num_labels,hidden_layer_size+1) # Np.savetxt (" theta1.csv ",Theta1,delimiter=',') m = x.shape [0] class_y = Np.zeros ((m,num_labels)) # The y of the data corresponds to 0-9, # mapping y for I in range(num_labels): 0 0 class_Y [:, I] = np.int32(y== I).0 Because "" Theta1_colCount = theta1. shape[1] Theta1_x = Theta1[:,1:Theta1_colCount] Theta2_colCount = theta2.shape [1] Theta2_x = Theta2[:,1:Theta2_colCount] # regularize to theta^2 term = Np. Dot (np) transpose (np) vstack ((Theta1_x. Reshape (1, 1), Theta2_x. Reshape (1, 1)))), np. Vstack ((Theta1_x. Reshape (1, 1), Theta2 _x. Reshape (1, 1)))) ' ' 'forward travel, every time you need to fill in a column 1 offset bias,' 'a1 = np. Hstack ((np) ones ((m, 1)), X)) z2 = np. Dot (a1, np. Transpose (Theta1)) Hstack ((np.ones((m,1)),a2) z3 = Np.dot (A2, nP.transpose (Theta2)) h = sigmoid(z3) J = - (np) dot (np. Transpose (class_y. Reshape (1, 1)), np. The log (practice eshape (1, 1))) + np. Dot (np) transpose (1 - class_y. Reshape (1, 1)), np, lo G (1 - practice eshape (1, 1))) - Lambda * term / 2)/m return np ravel (J)Copy the code

4. Back propagation of BP

The forward propagation above can be calculated to obtain J(θ), and its gradient needs to be determined by using the gradient descent method
The purpose of BP back propagation is to find the gradient of the cost function
Suppose a four-layer neural network,Remember – – >lThe first layerjError of one unit

“= = =”(vectorization)

There is noBecause there is no error in the input

Because of the s-type functionThe derivative of is:So the topandIt can be calculated in forward propagation

The process of calculating the gradient by back propagation is as follows:

(Is the capital of)

for i=1-m:

–

– Forward propagation calculation(l = 2 and 4… L)

– Reverse calculation,.;

–

–

The last, namely, the gradient of the cost function is obtained

Implementation code:

Def nnGradient(nn_params,input_layer_size,hidden_layer_size,num_labels,X,y,Lambda): length = nn_params.shape[0] Theta1 = Nn_params [0:hidden_layer_size*(input_layer_size+1)].0 (hidden_layer_size,input_layer_size+1).copy() Otherwise let's change the value of Theta, Nn_params also modifies Theta2 = nn_params[hidden_layer_size*(input_layer_size+1):length].reshape(num_labels,hidden_layer_size+1).copy() m = X.shape[0] Class_y = np.zeros((m,num_labels)) # y for I in range(num_labels): 0 0 class_Y [:, I] = np.int32(y== I).0 Because "" Theta1_colCount = theta1. shape[1] Theta1_x = Theta1[:,1:Theta1_colCount] Theta2_colCount = theta2.shape [1] Theta2_x = Theta2[:,1:Theta2_colCount] Theta1_grad = Np.zeros ((theta1.shape)) # Theta2_x = Theta2[:,1:Theta2_colCount] Theta1_grad = Np.zeros ((theta1.shape)) # Np. zeros((theta2.shape)) # Hstack ((Np.ones ((m,1)),X)) z2 = Np.dot (A1, NP.transpose (Theta1)) A2 = sigmoid(z2) A2 = Hstack ((np.ones((m,1)),a2)) z3 = np.dot(A2, nP.transpose (Theta2)) h = sigmoid(z3) ''' delta3 = np.zeros((m,num_labels)) delta2 = np.zeros((m,hidden_layer_size)) for i in range(m): # delta3 [I:] = (h [I:] - class_y [I:]) * sigmoidGradient (z3 [I:]) # error of mean square error (mse) delta3 [I:] [I:] - class_y = h [I:] # cross entropy error rate Theta2_grad = Theta2_grad+np.dot(np.transpose(delta3[i,:].reshape(1,-1)),a2[i,:].reshape(1,-1)) delta2[i,:] = np.dot(delta3[i,:].reshape(1,-1),Theta2_x)*sigmoidGradient(z2[i,:]) Theta1_grad = Theta1_grad+np.dot(np.transpose(delta2[i,:].reshape(1,-1)),a1[i,:].reshape(1,-1)) Theta1[:,0] = 0 Theta2[:,0] = 0 "Gradient" "grad" = (np) vstack ((Theta1_grad. Reshape (1, 1), Theta2_grad. Reshape (1, 1))) + Lambda * np vstack ((Theta1. Reshape (1, 1), Theta2. Reshape ( - 1, 1))))/m return np ravel (grad)Copy the code

5. The reason why BP can calculate the gradient

I actually used itChain derivativeThe laws of
Because the units at the next level compute using the units at the next level as input
The general derivation is as follows, and ultimately what we want to do is predict the function and we know ityVery close. Finding the gradient of the mean square deviation along this gradient minimizes the cost function. Refer to the process above to find the gradient.

More detailed derivation of error:

6. Gradient check

Check the use ofBPI’m getting the gradient right
Verify using the definition of derivative:
The numerical gradient should be very close to that of BP
After verifying that the BP is correct, there is no need to perform the algorithm to verify the gradient
Implementation code:

Def checkGradient(Lambda = 0): Construct a small neural network verification, because numerical gradients are a waste of time, Input_layer_size = 3 hidden_layer_size = 5 num_labels = 3 m = 5 initial_Theta1 = debugInitializeWeights(input_layer_size,hidden_layer_size); initial_Theta2 = debugInitializeWeights(hidden_layer_size,num_labels) X = debugInitializeWeights(input_layer_size-1,m) y  = 1+np.transpose(np.mod(np.arange(1,m+1), Num_labels))# initialize y y = y.reshape(-1,1) nn_params = Np.vstack ((Initial_theTA1. 0 (-1,1), Initial_theta2. 0 (-1,1)) # 0 nnGradient(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, Lambda) "" use numerical method to compute gradient" "num_grad = np.zeros((nn_params.shape[0])) step = np.zeros((nn_params.shape[0])) e = 1E-4 for I in range(nn_params.shape[0]): 0 0 Step [I] = e Loss1 = nnCostFunction(nN_Params-step.0), Input_layer_size, Hidden_Layer_size, NUM_Labels, X, Y, 0 Lambda) Loss2 = nnCostFunction(NN_params + Step.0), Input_layer_size, hidden_layer_size, NUM_labels, X, Y, 0 Lambda) num_grad[I]= (loss2-loss1)/(2*e) step[I]=0 Np. Hstack ((num_grad. Reshape (1, 1), grad. Reshape (1, 1))) print resCopy the code

7. Random initialization of weights

Neural networks cannot be initialized like logistic regressionthetafor0, because if the weight of each edge is 0 and each neuron has the same output, it will get the same gradient in the back propagation, and only one result will be predicted.
So it should be initialized to something close to zero
The implementation code

Theta def randInitializeWeights(L_in,L_out): W = np. Zeros ((1 + L_in, L_out)) # corresponding to the weight of theta epsilon_init = (6.0 / (L_out + L_in)) * * W = 0.5 Rand (L_out,1+L_in)*2*epsilon_init-epsilon_init # np.random.rand(L_out,1+L_in) generates a random matrix of L_out*(1+L_in) size return WCopy the code

8, forecasting

Forward propagation of predicted results
The implementation code

Def predict(Theta1,Theta2,X): M = x.shape [0] num_labels = theta2. shape[0] #p = Np.zeros ((m,1)) X = Np.hstack ((NP.ones ((m,1)),X)) h1 = sigmoid(NP.dot (X, NP.transpose (Theta1))) h1 = sigmoid(NP.dot (X, NP.transpose (Theta1))) h1 = sigmoid(NP.dot (X, NP.transpose (Theta1) Hstack ((np.ones((m,1)),h1)) h2 = sigmoid(Np.dot (h1,np.transpose(Theta2))) Axis =1) Return the maximum value of each row in h (which is the maximum probability of a number) - finally where the column number where the maximum probability is found (the column number is the corresponding number) "#np.savetxt("h2.csv",h2,delimiter=',') p = np.array(np.where(h2[0,:] == np.max(h2, axis=1)[0])) for i in np.arange(1, m): t = np.array(np.where(h2[i,:] == np.max(h2, axis=1)[i])) p = np.vstack((p,t)) return pCopy the code

9. Output results

Gradient check:
Random display of 100 handwritten digits
Show theta1 weight
Training set prediction accuracy
Prediction accuracy of training set after normalization

4. SVM support vector machine

1. Cost function

In logistic regression, our cost is:, where:,
As shown, ify=1.costThe cost function is shown in the figure

We want to make, i.e.,z>>0In this casecostThe cost function tends to be minimal (that’s what we want), so the utilityredThe function ofInstead of cost in logistic regression
wheny=0As well asInstead of
The resulting cost function is: finally we want to
The cost function in our logistic regression was:

You can think of it hereIt’s just a matter of form hereCThe greater the value of the SVM decision boundarymarginThe larger it is, as will be explained below

2, Large Margin

As shown in the figure below, the maximum SVM classification will be usedmarginSeparate the

So let’s talk about the dot product

saidutheEuclidean norm(European norm),

The vector VinThe vector uThe length of the projection onto phi is called phip, then: Vector inner product:

It can be deduced according to the vector Angle formula,

As I said before, whenCThe greater the,marginThe bigger it is, our goal is to minimize the cost functionJ (theta)whenmarginThe largest,CThe product of the itemIt’s small, so it’s approximately:

.

Our ultimate goal is to minimize the costTheta.

by

We can get:

.pIs thexinTheta.The projection on the

As shown in the figure below, suppose that the decision boundary is shown, find one of the points, and go toTheta.The projection ofp,orIf,pIf it’s small, it needsBig, that’s what we’re asking forTheta.makeMinimum violation,soSo what we ended up with islarge margin

3. SVM Kernel

For linearly separable problems, use linear kernels
For linearly indivisible problems, in logistic regression, we are going tofeatureThe mapping is in the form of a polynomial.SVMThere are alsoPolynomial kernel function“, but more commonlyGaussian kernel function, also known asRBF kernel

The Gaussian kernel function is:

Let’s say I have a couple of points,To:

.

You can see that ifxwithClose, ==”, (that is, the similarity is large)

ifxwithDistance is far, ==”, (that is, the similarity is low)

Gaussian kernelsigmaThe smaller,fThe faster it goes down

How do YOU choose the initial

Training set:

Options:

For the givenxTo calculate thefTo:So:

To minimize theJcalculateTheta..

if, == “predictedy=1

4, the use of`scikit-learn`In the`SVM`The model code

All the code

Linearly separable, specifying the kernel function to belinear:

Loadmat ('data1.mat') X = data1['X'] y = data1['y'] y = np.ravel(y) plot_data(X,y) model = Svm.svc (C=1.0,kernel='linear').fit(X,y) #Copy the code

Nonlinear separable, the default kernel function isrbf

Data2 = IO. Loadmat ('data2.mat') X = data2['X'] y = data2['y'] y = np.ravel(y) PLT = plot_data(X,y) Plt.show () model = svm.svc (gamma=100).fit(X,y) # gammaCopy the code

5. Running results

Linearly separable decision boundary:
Linearly indivisible decision boundary:

5. K-means clustering algorithm

All the code

1. Clustering process

Clustering belongs to unsupervised learning, and it is not known that y markers are divided into K categories
The K-means algorithm is divided into two steps
Step 1: Cluster allocation: randomly select K points as the center, calculate the distance to these K points, and divide them into K clusters
Step 2: Move cluster center: recalculate the center of each cluster, move the center and repeat the above steps.
As shown below:
Randomly assigned cluster centers

Recalculate the cluster center and move it once

The last10Cluster center after step

Calculate each piece of data to which center the most recent implementation code:

Def findClosestCentroids(X,initial_centroids): M = x.shape [0] # number of data items K = initial_centroids.shape[0] # number of classes dis = Np.zeros ((m,K)) # Storage calculation of the distance from each point to K classes IDx = For I in range(m): for j in range(K): for j in range(K): for j in range(K): Dis [I, j] = np. Dot ((X [I:] - initial_centroids [j:]), reshape (1, 1), (X [I:] - initial_centroids [j:]), reshape (1, 1)) Returns the column number corresponding to the minimum value of each row of dis, 0 0 -np.min (dis, Axis =1) returns the minimum value of each line -np.min where(dis == Np.min (dis, Axis =1).0 There may be multiple coordinates corresponding to the minimum value, which can be found by WHERE. Therefore, when returning, the first m coordinates need to be returned (because for multiple minimum values, Dummy,idx = np.min where(dis == np.min(dis, axis=1).0 (-1,1)) return idx[0:dis.shape[0]] #Copy the code

Computing class center implementation code:

Def computerCentroids(X,idx,K): def computerCentroids(X,idx,K): n = x.shape [1] centroids = np.zeros((K,n)) for I in range(K): Centroids [I:] = np. The mean (X [np. Ravel (independence idx = = I), :], axis = 0). Reshape index (1, 1) # if a one-dimensional, axis = 0 for each column, independence idx = = once I find out what kind of, Then calculate the mean return centroidsCopy the code

2. Objective function

Also called distortion cost function

Finally we want to get:

Among them

According to the firstiWhich class center is the closest to which strip of data,

Where is the center of clustering

3. Selection of clustering centers

Random initialization, K are randomly selected from the given data as clustering centers
You can do it multiple times, and then take the one that minimizes the cost function as the center
Implementation code :(here is a random)

Def kMeansInitCentroids(X,K) def kMeansInitCentroids(X,K): M = x.shape [0] m_arr = Np.arange (0,m) # Generate 0-m-1 centroids = Np.zeros ((K,X.shape[1])) nP.random.shuffle (m_arr) # Rand_indices = m_arr[:K] # return centroids = X[rand_indices,:Copy the code

4. Selection of clustering number K

The label of Y is not known for clustering, so the real number of clustering is not known
The Elbow Method
Cost functionJandKIf there is an inflection point, as shown in the figure below,KTake the value at the inflection point, right hereK=3
If very smooth is not clear, artificial choice.
The second is human observational selection

5. Application — image compression

Divide the pixels of the image into several classes, and then replace the original pixel value with this class
Algorithm code to perform clustering:

Def runKMeans(X,initial_centroids,max_iters,plot_process): Shape [0] # Number of classes centroids = initial_centroids # Record current class center previous_centroids = Centroids # record the last class center idx = np.zeros((m,1)) # Which class each piece of data belongs to for I in range(max_iters): %d'%(I +1) idx = findClosestCentroids(X, centroids) if plot_process: # if the image is drawn PLT = plotProcessKMeans(X,centroids,previous_centroids) # Draw the moving process of the cluster center previous_centroids = centroids # reset Centroids = computerCentroids(X, idx, K) Plt.show () return centroids,idx # return cluster centers and which class the data belongs toCopy the code

6,Use linear models in the SciKit-Learn library for clustering

Import packages

    from sklearn.cluster import KMeans
Copy the code

Use models to fit data

Model = KMeans(n_clusters=3).fit(X) # n_clusters specifies 3 classes and fits the dataCopy the code

The clustering center

Centroids = model.cluster_centers_ # Cluster centersCopy the code

7. Running results

Movement of 2d data class center
Image compression

6. PCA Principal Component Analysis (Dimension reduction)

All the code

1, use

Data Compression makes the program run faster
Visualizing data, for example3D-->2DEtc.
.

2, 2D–>1D, nD–>kD

As shown in the figure below, all data points can be projected onto a straight line, minimizing the sum of squares (projection error) of projected distances
Pay attention to data needsThe normalizedTo deal with
Idea is to find1aThe vector u, all data are projected onto it to minimize the projection distance
thennD-->kDIs to look forkA vector, all data are projected onto it to minimize the projection error
Eg :3D–>2D,2 vectors represent a plane, the projection error of all points to this plane is minimum

3. The difference between principal component analysis and linear regression

Linear regression is looking forxwithyAnd then used for predictiony
PCAIs to find a projection surface, minimize the projection error of data to this projection surface

4. PCA dimension reduction process

Data preprocessing (mean normalization)
Formula:
That is, subtract the mean value of the corresponding feature and divide by the standard deviation of the corresponding feature (also can be the maximum – minimum value).
Implementation code:

Def featureNormalize(X, X, X, X, X, X, X, X, X, X, X, X, X, X, X); sigma = np.zeros((1,n)) mu = np.mean(X,axis=0) sigma = np.std(X,axis=0) for i in range(n): X[:,i] = (X[:,i]-mu[i])/sigma[i] return X,mu,sigmaCopy the code

To calculateSigma covariance matrix(Covariance Matrix) :
Notice hereΣIt’s not the summation notation
Covariance matrixSymmetric positive definite(If you don’t understand positive definite, look at line substitution.)
Size of thenxn.nforfeatureThe dimensions of
Implementation code:

Sigma = Np.dot (Np.transpose (X_norm),X_norm)/m # find SigmaCopy the code

To calculateΣEigenvalues and eigenvectors of
Can be usedsvdSingular value decomposition function:U, S, V = SVD (Σ)
And is returnedΣDiagonal matrix of the same sizeS(byΣOf) [Pay attention to:matlabThe function returns a diagonal matrix atpythonIs a vector, saving space.]
There are also two unitary matrices U and V, and
Pay attention to:svdDelta functionSIs in descending order of eigenvalue, if not usedsvd, need to pressThe eigenvalueSize rearrangementU
Dimension reduction
selectUIn the firstKColumn (suppose to drop toKD)
ZThe corresponding data after dimension reduction
Implementation code:

Def projectData(X_norm,U,K): Z = np.zeros((X_norm. Shape [0],K)) U_reduce = U[:,0:K] # take first K Z = np.dot(X_norm,U_reduce) return ZCopy the code

Process summary:
Sigma = X'*X/m
U,S,V = svd(Sigma)
Ureduce = U[:,0:k]
Z = Ureduce'*x

5. Data recovery

Because:
So: (Note that this is an approximation of X)
And becauseUreduceIs the positive definite matrix, and the positive definite matrix satisfies:So:], so here:

Implementation code:

Def recoverData(Z,U,K); X_rec = Np.zeros ((Z.shape[0],U.s shape[0])) U_recude = U[:,0:K] X_rec = Np.dot (Z, NP.transpose (U_recude)) # Restore data (approximate) return  X_recCopy the code

6. Selection of the number of principal components (i.e. the dimension to be reduced)

How to choose
Project error:
Total variation:
ifError rate(Error ratio) :Is said,99%Preserve differences

The error rate is generally takenOne percent, five percent, ten percentEtc.

How to implement

It’s too expensive to try one by one

beforeU,S,V = svd(Sigma)And we got thatS, here error ratio:

You can increase it a little bitKTo try.

7, Use suggestions

Do not use PCA to solve overfitting problemsOverfittingAgain, use regularization (it’s ok if you keep high variability)
Consider using PCA only if you have good results on raw data but are slow to run

8. Running results

The 2-d data has been reduced to 1-d
The direction of the projection
2D reduced to 1D and the corresponding relationship
Face data dimensionality reduction
The original data
Visual sectionUThe matrix information
Restore data

9,PCA in sciKit-learn library is used for dimensionality reduction

Import required packages:

#-*- coding: Utf-8 -*- # Author: Bob # Date:2016.12.22 Import numpy as NP from matplotlib import pyplot as PLT from scipy import IO as utF-8 -*- # Author: Bob # Date:2016.12.22 Import numpy as NP from matplotlib import pyplot as PLT from scipy import IO as  spio from sklearn.decomposition import pca from sklearn.preprocessing import StandardScalerCopy the code

Normalized data

Scaler = StandardScaler() scaler.fit(X) x_train = scaler.transform(X)Copy the code

PCA model was used to fit data and reduce dimension
n_componentsCorresponds to the dimension to be transferred

Model = pca.pca (n_components=K).fit(x_train) # N_components defines the dimension to be reduced Z = model.transform(x_train) # transform will perform the dimension reduction operationCopy the code

Data recovery
model.components_It will be used for dimensionality reductionUmatrix

Data recovery and plotting Ureduce = model.components_ # Get Ureduce x_rec = np.dot(Z,Ureduce) # Data recoveryCopy the code

7. Anomaly Detection

All the code

1. Gaussian distribution (Normal distribution)`Gaussian distribution`

Distribution function:
Among them,uFor the data ofThe mean.sigmaFor the data ofThe standard deviation
sigmaThe moresmall, the corresponding image is moreThe tip
Parameter estimation (parameter estimation)

2. Anomaly detection algorithm

example
Training set:, where
Assume that they are independent from each other and establish the model:
process
Select those with representative exceptionsfeature:xi
Parameter estimation:
To calculatep(x)If,P (x) < epsilonIs considered abnormal, whereEpsilon.Is the critical value of the probability that we wantthreshold
Here are justElement Gaussian distribution, it is assumed that thefeatureThey’re independent of each other, and we’ll talk about thatMultivariate Gaussian distribution, will be automatically capturedfeatureThe relationship between
Parameter estimation implementation code

Def estimateGaussian(X): M,n = x.shape mu = np.zeros((n,1)) sigma2 = np.zeros((n,1)) mu = np.mean(X, axis=0) # axis=0 Var (X,axis=0) # find the variance of each column return mu,sigma2Copy the code

3, evaluation`p(x)`Good or bad, as well`Epsilon.`The selection of

Mismeasurement of skew data
Since the data can be very skewed (i.e. the number of y=1 is very small (y=1 indicates an exception)), F1Score can be calculated using Precision/Recall (on the CV cross validation set).
For example, in predicting cancer, if the model can get 99% correct prediction and 1% error rate, but the actual probability of cancer is very small, only 0.5%, then we always predict that there is no cancer, but y=0 can get a smaller error rate. Using error rate for evaluation is not scientific.
As shown below:
, that is:Predict positive samples correctly/all predict positive samples correctly
, that is:Correctly predicted positive sample/true value is positive sample
Always make y=1(fewer classes) and calculate Precision and Recall
Taking cancer prediction as an example, it is assumed that all predictions are no-cancer, TN=199, FN=1, TP=0, FP=0, so Precision=0/0, Recall=0/1=0, although accuracy=199/200=99.5%, it is not credible.
The selection of epsilon.
Try multiple ε values to make F1Score high
The implementation code

Def selectThreshold(yval,pval): Initialize the required variable bestEpsilon = 0.bestf1 = 0.f1 = 0.step = (np.max(pval)-np.min(pval))/1000 "" calculate" for epsilon in np.arange(np.min(pval),np.max(pval),step): CvPrecision = pval<epsilon tp = np.sum((cvPrecision == 1) & (yval == 1)). Astype (float) # sum is int, Float fp = np.sum((cvPrecision == 1) & (yval == 0)). Astype (float) fn = np.sum((cvPrecision == 1) & (yval == 0)). Astype (float) precision = TP /(TP +fp) # recision = TP /(TP +fn) # recall rate F1 = (2*precision*recision)/(precision+recision) # F1Score if F1 > bestF1: BestF1 = F1 bestEpsilon = epsilon return bestEpsilon,bestF1Copy the code

4. Select what feature (Element Gaussian distribution) to use

If some data does not satisfy the Gaussian distribution, you can change the data, for examplelog(x+C),x^(1/2)Etc.
ifp(x)The value of is large, abnormal or not, and you can try to combine more than onefeature,(because there may be a relationship between features)

5. Multivariate Gaussian distribution

Problems existing in element Gaussian distribution
The red dots are outliers and the rest are normal (such as CPU and memory changes).
The gaussian distribution of X1 is as follows:
The gaussian distribution of X2 is as follows:
It can be seen that the corresponding values of P (x1) and P (x2) do not change much, so they are not considered abnormal
Since we believe that features are independent of each other, the above figure is extended in a circular way
Multivariate Gaussian distribution
It’s not buildingp(x1),p(x2)... p(xn)It’s about unityp(x)
Parameters:.ΣforCovariance matrix
In the same way,| Σ |The smaller,p(x)The more sharp
For example,, represents the positive correlation between x1 and x2, that is, the larger x1 is, the larger X2 will be, as shown in the figure below, the red anomaly points can be detected. If:, represents the negative correlation between x1 and x2
Implementation code:

Def multivariateGaussian(X,mu,Sigma2): k = len(mu) if (sigma2.shape [0]>1): Sigma2 = np. Diag (Sigma2) ' ' ' 'multivariate gaussian distribution function' X = X - mu argu = (np. 2 * PI * * (k / 2) * np linalg. Det (Sigma2) * * p = (0.5) Argu *np.exp(-0.5*np.sum(np.dot(X,np.linalg.inv(Sigma2))*X,axis=1)) # axis indicates that each line returns pCopy the code

6. Characteristics of element and multivariate Gaussian distribution

Element Gaussian distribution
Man can capture itfeatureCan be used when the relationship between
A small amount of calculation
Multivariate Gaussian distribution
The related features are automatically captured
The calculation is large because:
m>norΣIt’s reversible. (If it’s irreversible, it might have redundant x, because it’s linearly dependent, irreversible, or m is less than n.)

7. Program running results

Display the data
contour
Outlier labeling

Machine learning algorithm implemented in Python

directory

A,Linear regression

1. Cost function

2. Gradient descent algorithm

4. Final running results

Second,Logistic regression

1. Cost function

2, gradient

4. S-type functions (i.e)

5. The mapping is polynomial

7. Running results

1. Randomly display 100 numbers

3. Handwritten digit recognition

4, forecasting

6,Use the logistic regression model in sciKit-learn library

BP neural network

1. Neural network Model

2. Cost function

3. Regularization

4. Back propagation of BP

5. The reason why BP can calculate the gradient

6. Gradient check

8, forecasting

4. SVM support vector machine

2, Large Margin

3. SVM Kernel

4, the use ofscikit-learnIn theSVMThe model code

5. Running results

1. Clustering process

2. Objective function

3. Selection of clustering centers

5. Application — image compression

7. Running results

1, use

3. The difference between principal component analysis and linear regression

5. Data recovery

6. Selection of the number of principal components (i.e. the dimension to be reduced)

7, Use suggestions

9,PCA in sciKit-learn library is used for dimensionality reduction

7. Anomaly Detection

1. Gaussian distribution (Normal distribution)Gaussian distribution

2. Anomaly detection algorithm

4. Select what feature (Element Gaussian distribution) to use

6. Characteristics of element and multivariate Gaussian distribution

Related Posts

4, the use of`scikit-learn`In the`SVM`The model code

1. Gaussian distribution (Normal distribution)`Gaussian distribution`