directory
A,Linear regression
- All the code
1. Cost function
-
Among them:
-
The next step is to find theta to minimize the cost, which means that the equation we fit is closest to the true value
-
There are m pieces of data, whereIt’s the square of the distance from the true value of the equation that we’re trying to fit, and the square is because there might be negative values, and the pluses and minuses might cancel out
-
And the reason why we have the coefficient 2 in front is because we’re taking the gradient with respect to each of these variables, and the 2’s cancel out
-
Implementation code:
Def computerCost(X,y,theta): m = len(y) J = 0 J = (np. Transpose (X*theta-y))*(X*theta-y)/(2*m
Copy the code
- Notice that X is the real number with a 1 in front of it, because it has theta of 0.
2. Gradient descent algorithm
- The partial derivative of the cost function is obtained:
- So updates to Theta can be written as:
- Where is the learning rate and the speed of gradient descent control, generally 0.01,0.03,0.1,0.3…..
- Why does gradient descent gradually reduce the cost function
- Hypothesis function
f(x)
- Taylor expansion:
F (x + delta x) = f (x) + f (x) '* delta (delta) x x + o
- To:
Delta x = alpha * 'f (x)
The direction of the negative gradient times a very small stepAlpha.
- will
Delta x
Substitute into Taylor’s expansion:F = f (x) (x + x) - alpha * [' f (x)] squared + o (delta x)
- As you can see,
Alpha.
Is to get a very small positive number,[' f (x)] squared
Is also positive, so it can be concluded that:F (x) + delta x < = f (x)
- So as we go down the negative gradient, the function is decreasing, same in multiple dimensions.
- The implementation code
Def gradientDescent(X,y,theta,alpha,num_iters): M = len(y) n = len(theta) temp = np.matrix(np.zeros((n,num_iters))) J_history = np.zeros((num_iters,1)) # record the value of each iteration for I in range(num_iters): H = np.dot(X,theta) Matrix can be directly multiplied by temp[:, I] = theta - ((alpha/m)*(Np.dot (NP.transpose (X),h-y))) # gradient calculation of theta = temp[:, I] J_history[I] = ComputerCost (X,y,theta) # print '.', return theta,J_history
Copy the code
3. Mean normalization
- The goal is to scale the data to a range that is easy to use the gradient descent algorithm
- Where is the average of all this Feture data
- It can be the maximum-minimum value, or the standard deviation of the data corresponding to this feature
- Implementation code:
Feature def featureNormaliza(X): X_norm = np.array(X) # Zeros ((1, x.shape [1])) sigma = Np.zeros ((1,X.shape[1])) mu = np.mean(X_norm,0) # Sigma = np.std(X_norm,0) # For I in range(x.shape [1]): # traversal column X_norm [, I] = (X_norm [, I] - mu [I])/sigma [I] # normalized return X_norm, mu, the sigma
Copy the code
- Note that you also need mean normalization data for forecasting
4. Final running results
- The cost changes with the number of iterations
5,Use sciKit-learn library linear model implementation
- Import packages
From sklearn import linear_model from sklearn. Preprocessing import StandardScaler #
Copy the code
- The normalized
Scaler = StandardScaler() scaler.fit(X) x_train = scaler.transform(X) x_test = Scaler. Transform (np) array ([1650]))
Copy the code
- Linear model fitting
LinearRegression() model.linear regression () model.fit(x_train, y)
Copy the code
- To predict
Result = model. Predict (x_test)
Copy the code
Second,Logistic regression
- All the code
1. Cost function
- It can be summed up as:
- Why not use the cost function of linear regression, because the cost function of linear regression may be non-convex, and for classification problems, it’s hard to get a minimum using gradient descent, and the cost function above is convex
- The image of is as follows, i.e
y=1
When:
You can see that whenTend to be1
.y=1
, the price paid at this time is consistent with the predicted valuecost
Tend to be0
If,Tend to be0
.y=1
, the cost at this timecost
It’s a very large value, and our ultimate goal is to minimize the generation value
- In the same wayIs shown below (
y=0
) :
2, gradient
- The partial derivative of the cost function is also taken: it can be seen that it is consistent with the partial derivative of linear regression
- Pushing the process
3. Regularization
- The goal is to prevent overfitting
- Add a term to the cost function
- Note that j starts with the increment of 1, because theta(0) is a constant term and the first column of X will add 1 column 1, so the product is still Theta (0). Feature does not matter, and regularization is not necessary
- Cost after regularization:
Def costFunction(initial_theta,X,y,inital_lambda): M = len(y) J = 0 h = sigmoid(np.dot(X,initial_theta)) # calculate h(z) theta1 = initial_thet.copy () # The pre-theta (0) value is 0 theTA1 [0] = 0 Temp = NP.dot (NP.transpose (theta1),theta1) J = (-np.dot(np.transpose(y),np.log(h))-np.dot(Np.transpose (1-y),np.log(1-h))+temp*inital_lambda/2)/m # regularized cost equation return J
Copy the code
- The regularized gradient of the cost
Def gradient(initial_theta,X,y,inital_lambda): M = len(y) grad = Np.zeros ((initial_thet.shape [0])) h = sigmoid(np.dot(X, initial_Theta))# calculate h(z) theta1 = Initial_thet.copy () theta1[0] = 0 grad = np.dot(NP.transpose (X), H-Y)/m+inital_lambda/m*theta1 # regularized gradient return grad
Copy the code
4. S-type functions (i.e)
- Implementation code:
Def sigmoid(z): h = np.zeros((len(z),1)) h = 1.0/(1.0+np.exp(-z)) return h
Copy the code
5. The mapping is polynomial
- Because the feture of the data can be small, leading to large deviations, some feture combinations are created
- Eg: The form is mapped to the power of 2:
- Implementation code:
Def mapFeature(X1,X2): degree = 3; Out = np.ones((x1.shape [0],1)) out = np.ones((x1.shape [0],1)) For I in np. Arange (1,degree+1): for j in range(I +1): for j in range(I +1): Temp = X1**(i-j)*(X2**j) # Matrix direct multiplication is equal to the dot product in MATLAB * out = Np.hSTACK ((out, Temp
Copy the code
6, use,scipy
Optimization method of
- Gradient descent service
scipy
In theoptimize
In thefmin_bfgs
function - The optimization algorithm fMIN_BFGS (quasi-Newton method Broyden-Fletcher-Goldfarb-Shanno) in SCIPY was called
- CostFunction is a self-implemented costFunction,
- Initial_theta stands for the initialized value,
- Fprime specifies the gradient of the costFunction
- Args is the remaining test parameters, passed in as a tuple, and will eventually return theta that minimizes costFunction
result = optimize.fmin_bfgs(costFunction, initial_theta, fprime=gradient, args=(X,y,initial_lambda))
Copy the code
7. Running results
- Data1 decision boundaries and accuracy
- Data2 decision boundaries and accuracy
Eight,Use the logistic regression model in sciKit-learn library
- Import packages
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
import numpy as np
Copy the code
- Divide training set and test set
X_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
Copy the code
- The normalized
Scaler = StandardScaler() fit(x_train) x_train = scaler. Fit_transform (x_train) x_test = scaler.fit_transform(x_test)
Copy the code
- Logistic regression
Model = LogisticRegression() model. Fit (x_train,y_train)
Copy the code
- To predict
# predict = model. Predict (x_test) right = sum(predict == y_test) predict = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 % % % % f '(right * 100.0 / predict shape [0])) # the accuracy of the calculation on the test set
Copy the code
Logistic regression _ Handwritten number recognition _OneVsAll
- All the code
1. Randomly display 100 numbers
- I didn’t use the data set in SciKit-learn. The pixels are 20*20px and the color image is shown in grayscale:
- Implementation code:
Def display_data(imgData): Sum = 0 "" display 100 numbers (drawing one by one will be very slow, you can arrange the numbers to be drawn, put them into a matrix, display the matrix) - initialize a two-dimensional array - adjust the data of each row to the matrix of the image, Display_array = -np.ones((pad+10*(20+pad),pad+10*(20+pad))) for I in range(10): for j in range(10): display_array[pad+i*(20+pad):pad+i*(20+pad)+20,pad+j*(20+pad):pad+j*(20+pad)+20] = (imgData[sum,:]. 0 (0,20, ORDER ="F") # order=F 0 Default with line sum += 1 plt.imshow(display_array,cmap='gray') # display grayscale image plt.axis('off') plt.show()
Copy the code
2, OneVsAll
- How to use logistic regression to solve the problem of multiple classification, OneVsAll is to treat a certain category as a class, all other categories as a class, so there is a dichotomous problem
- As shown in the figure below, the data on the way are divided into three categories. First, the red ones are regarded as one category, and the others as another category for logistic regression. Then, the blue ones are regarded as one category, and the others are regarded as one category, and so on…
- It can be seen that when the number of classes is greater than 2, logistic regression classification should be carried out as many times as there are classes
3. Handwritten digit recognition
- There are 10 numbers from 0 to 9, and it takes 10 sorting
- Due to theData set yThe is given
0... 9
While logistic regression is required0/1
So we need to process y - A little bit about the data set, before
500
One is0
.500-1000.
is1
..
, so as shown in the picture below, after processingy
.The first 500 rows have 1 in the first column and 0 in the rest. 500-1000 rows have 1 in the second column and 0 in the rest…. - And then callGradient descent algorithmTo solve the
theta
- Implementation code:
All_theta def oneVsAll(X,y,num_labels,Lambda); X = np.hstack((np.ones((m,1)),X)) # X = 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 Initial_theta = np.zeros((n+1,1)) # Initializes a category of theta # mapping y for I in range(num_labels): Class_y [, I] = np. Int32 (y = = I). Reshape # (1, 1) attention to reshape (1, 1) to assignment # np. Savetxt (" class_y. CSV, "class_y [00 0-6. :], Delimiter =',') "" traverses each category and calculates the corresponding theta value" "for I in range(num_labels): result = optimize.fmin_bfgs(costFunction, initial_theta, fprime=gradient, 0 Args =(X,class_y[:, I],Lambda) # calling gradient descent optimization method ALL_theta [:, I] = Result. 0 (1,-1) # Putting all_theta = np.transpose(all_theta) return all_theta
Copy the code
4, forecasting
- As I said, the predicted outcome is oneProbability valueI learned from it
theta
Plug in the predictedS functionIn, the maximum value of each row is the maximum probability of a numberColumn numberThat’s the true value of the predicted number, because in categorization, everything is0
The will ofy
The mapping is in the first column, the mapping of 1 is in the second column, and so on - Implementation code:
Def predict_oneVsAll(all_theta,X): M = x.shape [0] num_labels = all_thet.shape [0] p = Np.zeros ((m,1)) X = np.hstack((np.ones((m,1)),X)) # Sigmoid (np.dot(X,np.transpose(all_theta))) # prediction "" returns the column number of the maximum value of each row in h -np.max (h, P = np.array(np. Where (h[0,:] == np.max(h, h, h) where(h, h, h) == np.max(h, h) axis=1)[0])) for i in np.arange(1, m): t = np.array(np.where(h[i,:] == np.max(h, axis=1)[i])) p = np.vstack((p,t)) return p
Copy the code
5. Running results
- 10 times classification, accuracy on training set:
6,Use the logistic regression model in sciKit-learn library
- 1. Import the package
from scipy import io as spio
import numpy as np
from sklearn import svm
from sklearn.linear_model import LogisticRegression
Copy the code
- 2. Load data
Data = loadmat_data("data_digits. Mat ") X = data['X'] # 20x20px y = data['y'] # shape=(5000, 1) y = np.ravel(y)
Copy the code
- 3. Fitting model
Model = LogisticRegression() model.fit(X, y) #
Copy the code
- 4, forecasting
Predict = model. Predict (X) # predict = model. Predict (X) # predict = model. Predict (X) # predict = model.
Copy the code
- 5. Output results (accuracy on training set)
BP neural network
1. Neural network Model
-
First, a three-layer neural network is introduced, as shown in the figure below
-
The input layer has three units (Bias for padding, usually set to
1
) -
According to the first
j
Layer of the firsti
An incentive, also known as a unit unit -
For the first
j
Layer to the firstj+1
The weight matrix of the layer map is the weight of each edge -
So we can get:
-
Hidden layer:
-
Output layer
Among them,S function, has becomeExcitation function -
It can be seen thatIs the 3×4 matrix,Is a 1 by 4 matrix
-
= =”
j+1
Number of cells x (j
Number of layers +1)
2. Cost function
- Assume that the final output, that is, the output layer has K units
- Among them,On behalf of the first
i
Unit output - The cost function is similar to logistic regression, which is tired plus each output (K outputs).
3. Regularization
L
–> Number of all layers- — > the first
l
Number of layer units - The regularized cost function is
- A total of
L-1
Layer, - And then we add up the theta matrix for each layer, not including the theta(0) with the offset term.
- Regularized cost function implementation code:
Def nnCostFunction(nn_params,input_layer_size,hidden_layer_size,num_labels,X,y,Lambda): Shape [0] # restore theta1 and theta2 theta1 = nn_params[0:hidden_layer_size*(input_layer_size+1)].reshape(hidden_layer_size,input_layer_size+1) Theta2 = nn_params[hidden_layer_size*(input_layer_size+1):length].reshape(num_labels,hidden_layer_size+1) # Np.savetxt (" theta1.csv ",Theta1,delimiter=',') m = x.shape [0] class_y = Np.zeros ((m,num_labels)) # The y of the data corresponds to 0-9, # mapping y for I in range(num_labels): 0 0 class_Y [:, I] = np.int32(y== I).0 Because "" Theta1_colCount = theta1. shape[1] Theta1_x = Theta1[:,1:Theta1_colCount] Theta2_colCount = theta2.shape [1] Theta2_x = Theta2[:,1:Theta2_colCount] # regularize to theta^2 term = Np. Dot (np) transpose (np) vstack ((Theta1_x. Reshape (1, 1), Theta2_x. Reshape (1, 1)))), np. Vstack ((Theta1_x. Reshape (1, 1), Theta2 _x. Reshape (1, 1)))) ' ' 'forward travel, every time you need to fill in a column 1 offset bias,' 'a1 = np. Hstack ((np) ones ((m, 1)), X)) z2 = np. Dot (a1, np. Transpose (Theta1)) Hstack ((np.ones((m,1)),a2) z3 = Np.dot (A2, nP.transpose (Theta2)) h = sigmoid(z3) J = - (np) dot (np. Transpose (class_y. Reshape (1, 1)), np. The log (practice eshape (1, 1))) + np. Dot (np) transpose (1 - class_y. Reshape (1, 1)), np, lo G (1 - practice eshape (1, 1))) - Lambda * term / 2)/m return np ravel (J)
Copy the code
4. Back propagation of BP
-
The forward propagation above can be calculated to obtain J(θ), and its gradient needs to be determined by using the gradient descent method
-
The purpose of BP back propagation is to find the gradient of the cost function
-
Suppose a four-layer neural network,Remember – – >
l
The first layerj
Error of one unit -
“= = =”(vectorization)
-
There is noBecause there is no error in the input
-
Because of the s-type functionThe derivative of is:So the topandIt can be calculated in forward propagation
-
The process of calculating the gradient by back propagation is as follows:
-
(Is the capital of)
-
for i=1-m:
–
– Forward propagation calculation(l = 2 and 4… L)
– Reverse calculation,.;
–
–
-
The last, namely, the gradient of the cost function is obtained
-
Implementation code:
Def nnGradient(nn_params,input_layer_size,hidden_layer_size,num_labels,X,y,Lambda): length = nn_params.shape[0] Theta1 = Nn_params [0:hidden_layer_size*(input_layer_size+1)].0 (hidden_layer_size,input_layer_size+1).copy() Otherwise let's change the value of Theta, Nn_params also modifies Theta2 = nn_params[hidden_layer_size*(input_layer_size+1):length].reshape(num_labels,hidden_layer_size+1).copy() m = X.shape[0] Class_y = np.zeros((m,num_labels)) # y for I in range(num_labels): 0 0 class_Y [:, I] = np.int32(y== I).0 Because "" Theta1_colCount = theta1. shape[1] Theta1_x = Theta1[:,1:Theta1_colCount] Theta2_colCount = theta2.shape [1] Theta2_x = Theta2[:,1:Theta2_colCount] Theta1_grad = Np.zeros ((theta1.shape)) # Theta2_x = Theta2[:,1:Theta2_colCount] Theta1_grad = Np.zeros ((theta1.shape)) # Np. zeros((theta2.shape)) # Hstack ((Np.ones ((m,1)),X)) z2 = Np.dot (A1, NP.transpose (Theta1)) A2 = sigmoid(z2) A2 = Hstack ((np.ones((m,1)),a2)) z3 = np.dot(A2, nP.transpose (Theta2)) h = sigmoid(z3) ''' delta3 = np.zeros((m,num_labels)) delta2 = np.zeros((m,hidden_layer_size)) for i in range(m): # delta3 [I:] = (h [I:] - class_y [I:]) * sigmoidGradient (z3 [I:]) # error of mean square error (mse) delta3 [I:] [I:] - class_y = h [I:] # cross entropy error rate Theta2_grad = Theta2_grad+np.dot(np.transpose(delta3[i,:].reshape(1,-1)),a2[i,:].reshape(1,-1)) delta2[i,:] = np.dot(delta3[i,:].reshape(1,-1),Theta2_x)*sigmoidGradient(z2[i,:]) Theta1_grad = Theta1_grad+np.dot(np.transpose(delta2[i,:].reshape(1,-1)),a1[i,:].reshape(1,-1)) Theta1[:,0] = 0 Theta2[:,0] = 0 "Gradient" "grad" = (np) vstack ((Theta1_grad. Reshape (1, 1), Theta2_grad. Reshape (1, 1))) + Lambda * np vstack ((Theta1. Reshape (1, 1), Theta2. Reshape ( - 1, 1))))/m return np ravel (grad)
Copy the code
5. The reason why BP can calculate the gradient
- I actually used it
Chain derivative
The laws of - Because the units at the next level compute using the units at the next level as input
- The general derivation is as follows, and ultimately what we want to do is predict the function and we know it
y
Very close. Finding the gradient of the mean square deviation along this gradient minimizes the cost function. Refer to the process above to find the gradient. - More detailed derivation of error:
6. Gradient check
- Check the use of
BP
I’m getting the gradient right - Verify using the definition of derivative:
- The numerical gradient should be very close to that of BP
- After verifying that the BP is correct, there is no need to perform the algorithm to verify the gradient
- Implementation code:
Def checkGradient(Lambda = 0): Construct a small neural network verification, because numerical gradients are a waste of time, Input_layer_size = 3 hidden_layer_size = 5 num_labels = 3 m = 5 initial_Theta1 = debugInitializeWeights(input_layer_size,hidden_layer_size); initial_Theta2 = debugInitializeWeights(hidden_layer_size,num_labels) X = debugInitializeWeights(input_layer_size-1,m) y = 1+np.transpose(np.mod(np.arange(1,m+1), Num_labels))# initialize y y = y.reshape(-1,1) nn_params = Np.vstack ((Initial_theTA1. 0 (-1,1), Initial_theta2. 0 (-1,1)) # 0 nnGradient(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, Lambda) "" use numerical method to compute gradient" "num_grad = np.zeros((nn_params.shape[0])) step = np.zeros((nn_params.shape[0])) e = 1E-4 for I in range(nn_params.shape[0]): 0 0 Step [I] = e Loss1 = nnCostFunction(nN_Params-step.0), Input_layer_size, Hidden_Layer_size, NUM_Labels, X, Y, 0 Lambda) Loss2 = nnCostFunction(NN_params + Step.0), Input_layer_size, hidden_layer_size, NUM_labels, X, Y, 0 Lambda) num_grad[I]= (loss2-loss1)/(2*e) step[I]=0 Np. Hstack ((num_grad. Reshape (1, 1), grad. Reshape (1, 1))) print res
Copy the code
7. Random initialization of weights
- Neural networks cannot be initialized like logistic regression
theta
for0
, because if the weight of each edge is 0 and each neuron has the same output, it will get the same gradient in the back propagation, and only one result will be predicted. - So it should be initialized to something close to zero
- The implementation code
Theta def randInitializeWeights(L_in,L_out): W = np. Zeros ((1 + L_in, L_out)) # corresponding to the weight of theta epsilon_init = (6.0 / (L_out + L_in)) * * W = 0.5 Rand (L_out,1+L_in)*2*epsilon_init-epsilon_init # np.random.rand(L_out,1+L_in) generates a random matrix of L_out*(1+L_in) size return W
Copy the code
8, forecasting
- Forward propagation of predicted results
- The implementation code
Def predict(Theta1,Theta2,X): M = x.shape [0] num_labels = theta2. shape[0] #p = Np.zeros ((m,1)) X = Np.hstack ((NP.ones ((m,1)),X)) h1 = sigmoid(NP.dot (X, NP.transpose (Theta1))) h1 = sigmoid(NP.dot (X, NP.transpose (Theta1))) h1 = sigmoid(NP.dot (X, NP.transpose (Theta1) Hstack ((np.ones((m,1)),h1)) h2 = sigmoid(Np.dot (h1,np.transpose(Theta2))) Axis =1) Return the maximum value of each row in h (which is the maximum probability of a number) - finally where the column number where the maximum probability is found (the column number is the corresponding number) "#np.savetxt("h2.csv",h2,delimiter=',') p = np.array(np.where(h2[0,:] == np.max(h2, axis=1)[0])) for i in np.arange(1, m): t = np.array(np.where(h2[i,:] == np.max(h2, axis=1)[i])) p = np.vstack((p,t)) return p
Copy the code
9. Output results
- Gradient check:
- Random display of 100 handwritten digits
- Show theta1 weight
- Training set prediction accuracy
- Prediction accuracy of training set after normalization
4. SVM support vector machine
1. Cost function
- In logistic regression, our cost is:, where:,
- As shown, if
y=1
.cost
The cost function is shown in the figure
We want to make, i.e.,z>>0
In this casecost
The cost function tends to be minimal (that’s what we want), so the utilityredThe function ofInstead of cost in logistic regression - when
y=0
As well asInstead of - The resulting cost function is: finally we want to
- The cost function in our logistic regression was:
You can think of it hereIt’s just a matter of form hereC
The greater the value of the SVM decision boundarymargin
The larger it is, as will be explained below
2, Large Margin
-
As shown in the figure below, the maximum SVM classification will be used
margin
Separate the
-
So let’s talk about the dot product
-
.
-
said
u
theEuclidean norm(European norm), -
The vector V
inThe vector u
The length of the projection onto phi is called phip
, then: Vector inner product:
It can be deduced according to the vector Angle formula, -
As I said before, when
C
The greater the,margin
The bigger it is, our goal is to minimize the cost functionJ (theta)
whenmargin
The largest,C
The product of the itemIt’s small, so it’s approximately:
.
Our ultimate goal is to minimize the costTheta.
-
by
We can get:
.p
Is thex
inTheta.
The projection on the -
As shown in the figure below, suppose that the decision boundary is shown, find one of the points, and go to
Theta.
The projection ofp
,orIf,p
If it’s small, it needsBig, that’s what we’re asking forTheta.
makeMinimum violation,soSo what we ended up with islarge margin
3. SVM Kernel
-
For linearly separable problems, use linear kernels
-
For linearly indivisible problems, in logistic regression, we are going to
feature
The mapping is in the form of a polynomial.SVM
There are alsoPolynomial kernel function“, but more commonlyGaussian kernel function, also known asRBF kernel -
The Gaussian kernel function is:
Let’s say I have a couple of points,To:
. -
You can see that if
x
withClose, ==”, (that is, the similarity is large)
ifx
withDistance is far, ==”, (that is, the similarity is low) -
Gaussian kernel
sigma
The smaller,f
The faster it goes down
-
How do YOU choose the initial
-
Training set:
-
Options:
-
For the given
x
To calculate thef
To:So: -
To minimize the
J
calculateTheta.
.
-
if, == “predicted
y=1
4, the use ofscikit-learn
In theSVM
The model code
- All the code
- Linearly separable, specifying the kernel function to be
linear
:
Loadmat ('data1.mat') X = data1['X'] y = data1['y'] y = np.ravel(y) plot_data(X,y) model = Svm.svc (C=1.0,kernel='linear').fit(X,y) #
Copy the code
- Nonlinear separable, the default kernel function is
rbf
Data2 = IO. Loadmat ('data2.mat') X = data2['X'] y = data2['y'] y = np.ravel(y) PLT = plot_data(X,y) Plt.show () model = svm.svc (gamma=100).fit(X,y) # gamma
Copy the code
5. Running results
- Linearly separable decision boundary:
- Linearly indivisible decision boundary:
5. K-means clustering algorithm
- All the code
1. Clustering process
-
Clustering belongs to unsupervised learning, and it is not known that y markers are divided into K categories
-
The K-means algorithm is divided into two steps
-
Step 1: Cluster allocation: randomly select K points as the center, calculate the distance to these K points, and divide them into K clusters
-
Step 2: Move cluster center: recalculate the center of each cluster, move the center and repeat the above steps.
-
As shown below:
-
Recalculate the cluster center and move it once
-
The last
10
Cluster center after step
-
Calculate each piece of data to which center the most recent implementation code:
Def findClosestCentroids(X,initial_centroids): M = x.shape [0] # number of data items K = initial_centroids.shape[0] # number of classes dis = Np.zeros ((m,K)) # Storage calculation of the distance from each point to K classes IDx = For I in range(m): for j in range(K): for j in range(K): for j in range(K): Dis [I, j] = np. Dot ((X [I:] - initial_centroids [j:]), reshape (1, 1), (X [I:] - initial_centroids [j:]), reshape (1, 1)) Returns the column number corresponding to the minimum value of each row of dis, 0 0 -np.min (dis, Axis =1) returns the minimum value of each line -np.min where(dis == Np.min (dis, Axis =1).0 There may be multiple coordinates corresponding to the minimum value, which can be found by WHERE. Therefore, when returning, the first m coordinates need to be returned (because for multiple minimum values, Dummy,idx = np.min where(dis == np.min(dis, axis=1).0 (-1,1)) return idx[0:dis.shape[0]] #
Copy the code
- Computing class center implementation code:
Def computerCentroids(X,idx,K): def computerCentroids(X,idx,K): n = x.shape [1] centroids = np.zeros((K,n)) for I in range(K): Centroids [I:] = np. The mean (X [np. Ravel (independence idx = = I), :], axis = 0). Reshape index (1, 1) # if a one-dimensional, axis = 0 for each column, independence idx = = once I find out what kind of, Then calculate the mean return centroids
Copy the code
2. Objective function
- Also called distortion cost function
- Finally we want to get:
- Among themAccording to the first
i
Which class center is the closest to which strip of data, - Where is the center of clustering
3. Selection of clustering centers
- Random initialization, K are randomly selected from the given data as clustering centers
- You can do it multiple times, and then take the one that minimizes the cost function as the center
- Implementation code :(here is a random)
Def kMeansInitCentroids(X,K) def kMeansInitCentroids(X,K): M = x.shape [0] m_arr = Np.arange (0,m) # Generate 0-m-1 centroids = Np.zeros ((K,X.shape[1])) nP.random.shuffle (m_arr) # Rand_indices = m_arr[:K] # return centroids = X[rand_indices,:
Copy the code
4. Selection of clustering number K
- The label of Y is not known for clustering, so the real number of clustering is not known
- The Elbow Method
- Cost function
J
andK
If there is an inflection point, as shown in the figure below,K
Take the value at the inflection point, right hereK=3
- If very smooth is not clear, artificial choice.
- The second is human observational selection
5. Application — image compression
- Divide the pixels of the image into several classes, and then replace the original pixel value with this class
- Algorithm code to perform clustering:
Def runKMeans(X,initial_centroids,max_iters,plot_process): Shape [0] # Number of classes centroids = initial_centroids # Record current class center previous_centroids = Centroids # record the last class center idx = np.zeros((m,1)) # Which class each piece of data belongs to for I in range(max_iters): %d'%(I +1) idx = findClosestCentroids(X, centroids) if plot_process: # if the image is drawn PLT = plotProcessKMeans(X,centroids,previous_centroids) # Draw the moving process of the cluster center previous_centroids = centroids # reset Centroids = computerCentroids(X, idx, K) Plt.show () return centroids,idx # return cluster centers and which class the data belongs to
Copy the code
6,Use linear models in the SciKit-Learn library for clustering
- Import packages
from sklearn.cluster import KMeans
Copy the code
- Use models to fit data
Model = KMeans(n_clusters=3).fit(X) # n_clusters specifies 3 classes and fits the data
Copy the code
- The clustering center
Centroids = model.cluster_centers_ # Cluster centers
Copy the code
7. Running results
- Movement of 2d data class center
- Image compression
6. PCA Principal Component Analysis (Dimension reduction)
- All the code
1, use
- Data Compression makes the program run faster
- Visualizing data, for example
3D-->2D
Etc. - .
2, 2D–>1D, nD–>kD
- As shown in the figure below, all data points can be projected onto a straight line, minimizing the sum of squares (projection error) of projected distances
- Pay attention to data needs
The normalized
To deal with - Idea is to find
1
aThe vector u
, all data are projected onto it to minimize the projection distance - then
nD-->kD
Is to look fork
A vector, all data are projected onto it to minimize the projection error - Eg :3D–>2D,2 vectors represent a plane, the projection error of all points to this plane is minimum
3. The difference between principal component analysis and linear regression
- Linear regression is looking for
x
withy
And then used for predictiony
PCA
Is to find a projection surface, minimize the projection error of data to this projection surface
4. PCA dimension reduction process
- Data preprocessing (mean normalization)
- Formula:
- That is, subtract the mean value of the corresponding feature and divide by the standard deviation of the corresponding feature (also can be the maximum – minimum value).
- Implementation code:
Def featureNormalize(X, X, X, X, X, X, X, X, X, X, X, X, X, X, X); sigma = np.zeros((1,n)) mu = np.mean(X,axis=0) sigma = np.std(X,axis=0) for i in range(n): X[:,i] = (X[:,i]-mu[i])/sigma[i] return X,mu,sigma
Copy the code
- To calculate
Sigma covariance matrix
(Covariance Matrix) : - Notice here
Σ
It’s not the summation notation - Covariance matrix
Symmetric positive definite
(If you don’t understand positive definite, look at line substitution.) - Size of the
nxn
.n
forfeature
The dimensions of - Implementation code:
Sigma = Np.dot (Np.transpose (X_norm),X_norm)/m # find Sigma
Copy the code
- To calculate
Σ
Eigenvalues and eigenvectors of - Can be used
svd
Singular value decomposition function:U, S, V = SVD (Σ)
- And is returned
Σ
Diagonal matrix of the same sizeS
(byΣ
Of) [Pay attention to:matlab
The function returns a diagonal matrix atpython
Is a vector, saving space.] - There are also two unitary matrices U and V, and
- Pay attention to:
svd
Delta functionS
Is in descending order of eigenvalue, if not usedsvd
, need to pressThe eigenvalueSize rearrangementU
- Dimension reduction
- select
U
In the firstK
Column (suppose to drop toK
D) Z
The corresponding data after dimension reduction- Implementation code:
Def projectData(X_norm,U,K): Z = np.zeros((X_norm. Shape [0],K)) U_reduce = U[:,0:K] # take first K Z = np.dot(X_norm,U_reduce) return Z
Copy the code
- Process summary:
Sigma = X'*X/m
U,S,V = svd(Sigma)
Ureduce = U[:,0:k]
Z = Ureduce'*x
5. Data recovery
- Because:
- So: (Note that this is an approximation of X)
- And because
Ureduce
Is the positive definite matrix, and the positive definite matrix satisfies:So:], so here: - Implementation code:
Def recoverData(Z,U,K); X_rec = Np.zeros ((Z.shape[0],U.s shape[0])) U_recude = U[:,0:K] X_rec = Np.dot (Z, NP.transpose (U_recude)) # Restore data (approximate) return X_rec
Copy the code
6. Selection of the number of principal components (i.e. the dimension to be reduced)
- How to choose
- Project error:
- Total variation:
- ifError rate(Error ratio) :Is said,
99%
Preserve differences - The error rate is generally taken
One percent, five percent, ten percent
Etc. - How to implement
- It’s too expensive to try one by one
- before
U,S,V = svd(Sigma)
And we got thatS
, here error ratio:
- You can increase it a little bit
K
To try.
7, Use suggestions
- Do not use PCA to solve overfitting problems
Overfitting
Again, use regularization (it’s ok if you keep high variability) - Consider using PCA only if you have good results on raw data but are slow to run
8. Running results
- The 2-d data has been reduced to 1-d
- The direction of the projection
- 2D reduced to 1D and the corresponding relationship
- Face data dimensionality reduction
- The original data
- Visual section
U
The matrix information
- Restore data
9,PCA in sciKit-learn library is used for dimensionality reduction
- Import required packages:
#-*- coding: Utf-8 -*- # Author: Bob # Date:2016.12.22 Import numpy as NP from matplotlib import pyplot as PLT from scipy import IO as utF-8 -*- # Author: Bob # Date:2016.12.22 Import numpy as NP from matplotlib import pyplot as PLT from scipy import IO as spio from sklearn.decomposition import pca from sklearn.preprocessing import StandardScaler
Copy the code
- Normalized data
Scaler = StandardScaler() scaler.fit(X) x_train = scaler.transform(X)
Copy the code
- PCA model was used to fit data and reduce dimension
n_components
Corresponds to the dimension to be transferred
Model = pca.pca (n_components=K).fit(x_train) # N_components defines the dimension to be reduced Z = model.transform(x_train) # transform will perform the dimension reduction operation
Copy the code
- Data recovery
model.components_
It will be used for dimensionality reductionU
matrix
Data recovery and plotting Ureduce = model.components_ # Get Ureduce x_rec = np.dot(Z,Ureduce) # Data recovery
Copy the code
7. Anomaly Detection
1. Gaussian distribution (Normal distribution)Gaussian distribution
- Distribution function:
- Among them,
u
For the data ofThe mean.sigma
For the data ofThe standard deviation sigma
The moresmall, the corresponding image is moreThe tip- Parameter estimation (
parameter estimation
)
2. Anomaly detection algorithm
- example
- Training set:, where
- Assume that they are independent from each other and establish the model:
- process
- Select those with representative exceptions
feature
:xi - Parameter estimation:
- To calculate
p(x)
If,P (x) < epsilon
Is considered abnormal, whereEpsilon.
Is the critical value of the probability that we wantthreshold
- Here are justElement Gaussian distribution, it is assumed that the
feature
They’re independent of each other, and we’ll talk about thatMultivariate Gaussian distribution, will be automatically capturedfeature
The relationship between - Parameter estimation implementation code
Def estimateGaussian(X): M,n = x.shape mu = np.zeros((n,1)) sigma2 = np.zeros((n,1)) mu = np.mean(X, axis=0) # axis=0 Var (X,axis=0) # find the variance of each column return mu,sigma2
Copy the code
3, evaluationp(x)
Good or bad, as wellEpsilon.
The selection of
-
Mismeasurement of skew data
-
Since the data can be very skewed (i.e. the number of y=1 is very small (y=1 indicates an exception)), F1Score can be calculated using Precision/Recall (on the CV cross validation set).
-
For example, in predicting cancer, if the model can get 99% correct prediction and 1% error rate, but the actual probability of cancer is very small, only 0.5%, then we always predict that there is no cancer, but y=0 can get a smaller error rate. Using error rate for evaluation is not scientific.
-
As shown below:
-
, that is:Predict positive samples correctly/all predict positive samples correctly
-
, that is:Correctly predicted positive sample/true value is positive sample
-
Always make y=1(fewer classes) and calculate Precision and Recall
-
Taking cancer prediction as an example, it is assumed that all predictions are no-cancer, TN=199, FN=1, TP=0, FP=0, so Precision=0/0, Recall=0/1=0, although accuracy=199/200=99.5%, it is not credible.
-
The selection of epsilon.
-
Try multiple ε values to make F1Score high
-
The implementation code
Def selectThreshold(yval,pval): Initialize the required variable bestEpsilon = 0.bestf1 = 0.f1 = 0.step = (np.max(pval)-np.min(pval))/1000 "" calculate" for epsilon in np.arange(np.min(pval),np.max(pval),step): CvPrecision = pval<epsilon tp = np.sum((cvPrecision == 1) & (yval == 1)). Astype (float) # sum is int, Float fp = np.sum((cvPrecision == 1) & (yval == 0)). Astype (float) fn = np.sum((cvPrecision == 1) & (yval == 0)). Astype (float) precision = TP /(TP +fp) # recision = TP /(TP +fn) # recall rate F1 = (2*precision*recision)/(precision+recision) # F1Score if F1 > bestF1: BestF1 = F1 bestEpsilon = epsilon return bestEpsilon,bestF1
Copy the code
4. Select what feature (Element Gaussian distribution) to use
- If some data does not satisfy the Gaussian distribution, you can change the data, for example
log(x+C),x^(1/2)
Etc. - if
p(x)
The value of is large, abnormal or not, and you can try to combine more than onefeature
,(because there may be a relationship between features)
5. Multivariate Gaussian distribution
- Problems existing in element Gaussian distribution
- The red dots are outliers and the rest are normal (such as CPU and memory changes).
- The gaussian distribution of X1 is as follows:
- The gaussian distribution of X2 is as follows:
- It can be seen that the corresponding values of P (x1) and P (x2) do not change much, so they are not considered abnormal
- Since we believe that features are independent of each other, the above figure is extended in a circular way
- Multivariate Gaussian distribution
- It’s not building
p(x1),p(x2)... p(xn)
It’s about unityp(x)
- Parameters:.
Σ
forCovariance matrix - In the same way,
| Σ |
The smaller,p(x)
The more sharp - For example,, represents the positive correlation between x1 and x2, that is, the larger x1 is, the larger X2 will be, as shown in the figure below, the red anomaly points can be detected. If:, represents the negative correlation between x1 and x2
- Implementation code:
Def multivariateGaussian(X,mu,Sigma2): k = len(mu) if (sigma2.shape [0]>1): Sigma2 = np. Diag (Sigma2) ' ' ' 'multivariate gaussian distribution function' X = X - mu argu = (np. 2 * PI * * (k / 2) * np linalg. Det (Sigma2) * * p = (0.5) Argu *np.exp(-0.5*np.sum(np.dot(X,np.linalg.inv(Sigma2))*X,axis=1)) # axis indicates that each line returns p
Copy the code
6. Characteristics of element and multivariate Gaussian distribution
- Element Gaussian distribution
- Man can capture it
feature
Can be used when the relationship between - A small amount of calculation
- Multivariate Gaussian distribution
- The related features are automatically captured
- The calculation is large because:
m>n
orΣ
It’s reversible. (If it’s irreversible, it might have redundant x, because it’s linearly dependent, irreversible, or m is less than n.)
7. Program running results
- Display the data
- contour
- Outlier labeling