directory

A,Linear regression

  • All the code

1. Cost function

Def computerCost(X,y,theta): m = len(y) J = 0 J = (np. Transpose (X*theta-y))*(X*theta-y)/(2*mCopy the code
  • Notice that X is the real number with a 1 in front of it, because it has theta of 0.

2. Gradient descent algorithm

  • The partial derivative of the cost function is obtained:
  • So updates to Theta can be written as:
  • Where is the learning rate and the speed of gradient descent control, generally 0.01,0.03,0.1,0.3…..
  • Why does gradient descent gradually reduce the cost function
  • Hypothesis functionf(x)
  • Taylor expansion:F (x + delta x) = f (x) + f (x) '* delta (delta) x x + o
  • To:Delta x = alpha * 'f (x)The direction of the negative gradient times a very small stepAlpha.
  • willDelta xSubstitute into Taylor’s expansion:F = f (x) (x + x) - alpha * [' f (x)] squared + o (delta x)
  • As you can see,Alpha.Is to get a very small positive number,[' f (x)] squaredIs also positive, so it can be concluded that:F (x) + delta x < = f (x)
  • So as we go down the negative gradient, the function is decreasing, same in multiple dimensions.
  • The implementation code
Def gradientDescent(X,y,theta,alpha,num_iters): M = len(y) n = len(theta) temp = np.matrix(np.zeros((n,num_iters))) J_history = np.zeros((num_iters,1)) # record the value of each iteration for I in range(num_iters): H = np.dot(X,theta) Matrix can be directly multiplied by temp[:, I] = theta - ((alpha/m)*(Np.dot (NP.transpose (X),h-y))) # gradient calculation of theta = temp[:, I] J_history[I] = ComputerCost (X,y,theta) # print '.', return theta,J_historyCopy the code

3. Mean normalization

  • The goal is to scale the data to a range that is easy to use the gradient descent algorithm
  • Where is the average of all this Feture data
  • It can be the maximum-minimum value, or the standard deviation of the data corresponding to this feature
  • Implementation code:
Feature def featureNormaliza(X): X_norm = np.array(X) # Zeros ((1, x.shape [1])) sigma = Np.zeros ((1,X.shape[1])) mu = np.mean(X_norm,0) # Sigma = np.std(X_norm,0) # For I in range(x.shape [1]): # traversal column X_norm [, I] = (X_norm [, I] - mu [I])/sigma [I] # normalized return X_norm, mu, the sigmaCopy the code
  • Note that you also need mean normalization data for forecasting

4. Final running results

  • The cost changes with the number of iterations

5,Use sciKit-learn library linear model implementation

  • Import packages
From sklearn import linear_model from sklearn. Preprocessing import StandardScaler #Copy the code
  • The normalized
Scaler = StandardScaler() scaler.fit(X) x_train = scaler.transform(X) x_test = Scaler. Transform (np) array ([1650]))Copy the code
  • Linear model fitting
LinearRegression() model.linear regression () model.fit(x_train, y)Copy the code
  • To predict
Result = model. Predict (x_test)Copy the code

Second,Logistic regression

  • All the code

1. Cost function

You can see that whenTend to be1.y=1, the price paid at this time is consistent with the predicted valuecostTend to be0If,Tend to be0.y=1, the cost at this timecostIt’s a very large value, and our ultimate goal is to minimize the generation value

  • In the same wayIs shown below (y=0) :

2, gradient

  • The partial derivative of the cost function is also taken: it can be seen that it is consistent with the partial derivative of linear regression
  • Pushing the process

3. Regularization

  • The goal is to prevent overfitting
  • Add a term to the cost function
  • Note that j starts with the increment of 1, because theta(0) is a constant term and the first column of X will add 1 column 1, so the product is still Theta (0). Feature does not matter, and regularization is not necessary
  • Cost after regularization:
Def costFunction(initial_theta,X,y,inital_lambda): M = len(y) J = 0 h = sigmoid(np.dot(X,initial_theta)) # calculate h(z) theta1 = initial_thet.copy () # The pre-theta (0) value is 0 theTA1 [0] = 0 Temp = NP.dot (NP.transpose (theta1),theta1) J = (-np.dot(np.transpose(y),np.log(h))-np.dot(Np.transpose (1-y),np.log(1-h))+temp*inital_lambda/2)/m # regularized cost equation return JCopy the code
  • The regularized gradient of the cost
Def gradient(initial_theta,X,y,inital_lambda): M = len(y) grad = Np.zeros ((initial_thet.shape [0])) h = sigmoid(np.dot(X, initial_Theta))# calculate h(z) theta1 = Initial_thet.copy () theta1[0] = 0 grad = np.dot(NP.transpose (X), H-Y)/m+inital_lambda/m*theta1 # regularized gradient return gradCopy the code

4. S-type functions (i.e)

  • Implementation code:
Def sigmoid(z): h = np.zeros((len(z),1)) h = 1.0/(1.0+np.exp(-z)) return hCopy the code

5. The mapping is polynomial

  • Because the feture of the data can be small, leading to large deviations, some feture combinations are created
  • Eg: The form is mapped to the power of 2:
  • Implementation code:
Def mapFeature(X1,X2): degree = 3; Out = np.ones((x1.shape [0],1)) out = np.ones((x1.shape [0],1)) For I in np. Arange (1,degree+1): for j in range(I +1): for j in range(I +1): Temp = X1**(i-j)*(X2**j) # Matrix direct multiplication is equal to the dot product in MATLAB * out = Np.hSTACK ((out, TempCopy the code

6, use,scipyOptimization method of

  • Gradient descent servicescipyIn theoptimizeIn thefmin_bfgsfunction
  • The optimization algorithm fMIN_BFGS (quasi-Newton method Broyden-Fletcher-Goldfarb-Shanno) in SCIPY was called
  • CostFunction is a self-implemented costFunction,
  • Initial_theta stands for the initialized value,
  • Fprime specifies the gradient of the costFunction
  • Args is the remaining test parameters, passed in as a tuple, and will eventually return theta that minimizes costFunction
    result = optimize.fmin_bfgs(costFunction, initial_theta, fprime=gradient, args=(X,y,initial_lambda))    
Copy the code

7. Running results

  • Data1 decision boundaries and accuracy
  • Data2 decision boundaries and accuracy

Eight,Use the logistic regression model in sciKit-learn library

  • Import packages
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
import numpy as np
Copy the code
  • Divide training set and test set
X_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)Copy the code
  • The normalized
Scaler = StandardScaler() fit(x_train) x_train = scaler. Fit_transform (x_train) x_test = scaler.fit_transform(x_test)Copy the code
  • Logistic regression
Model = LogisticRegression() model. Fit (x_train,y_train)Copy the code
  • To predict
# predict = model. Predict (x_test) right = sum(predict == y_test) predict = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 % % % % f '(right * 100.0 / predict shape [0])) # the accuracy of the calculation on the test setCopy the code

Logistic regression _ Handwritten number recognition _OneVsAll

  • All the code

1. Randomly display 100 numbers

  • I didn’t use the data set in SciKit-learn. The pixels are 20*20px and the color image is shown in grayscale:
  • Implementation code:
Def display_data(imgData): Sum = 0 "" display 100 numbers (drawing one by one will be very slow, you can arrange the numbers to be drawn, put them into a matrix, display the matrix) - initialize a two-dimensional array - adjust the data of each row to the matrix of the image, Display_array = -np.ones((pad+10*(20+pad),pad+10*(20+pad))) for I in range(10): for j in range(10): display_array[pad+i*(20+pad):pad+i*(20+pad)+20,pad+j*(20+pad):pad+j*(20+pad)+20] = (imgData[sum,:]. 0 (0,20, ORDER ="F") # order=F 0 Default with line sum += 1 plt.imshow(display_array,cmap='gray') # display grayscale image plt.axis('off') plt.show()Copy the code

2, OneVsAll

  • How to use logistic regression to solve the problem of multiple classification, OneVsAll is to treat a certain category as a class, all other categories as a class, so there is a dichotomous problem
  • As shown in the figure below, the data on the way are divided into three categories. First, the red ones are regarded as one category, and the others as another category for logistic regression. Then, the blue ones are regarded as one category, and the others are regarded as one category, and so on…
  • It can be seen that when the number of classes is greater than 2, logistic regression classification should be carried out as many times as there are classes

3. Handwritten digit recognition

  • There are 10 numbers from 0 to 9, and it takes 10 sorting
  • Due to theData set yThe is given0... 9While logistic regression is required0/1So we need to process y
  • A little bit about the data set, before500One is0.500-1000.is1.., so as shown in the picture below, after processingy.The first 500 rows have 1 in the first column and 0 in the rest. 500-1000 rows have 1 in the second column and 0 in the rest….
  • And then callGradient descent algorithmTo solve thetheta
  • Implementation code:
All_theta def oneVsAll(X,y,num_labels,Lambda); X = np.hstack((np.ones((m,1)),X)) # X = 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 Initial_theta = np.zeros((n+1,1)) # Initializes a category of theta # mapping y for I in range(num_labels): Class_y [, I] = np. Int32 (y = = I). Reshape # (1, 1) attention to reshape (1, 1) to assignment # np. Savetxt (" class_y. CSV, "class_y [00 0-6. :], Delimiter =',') "" traverses each category and calculates the corresponding theta value" "for I in range(num_labels): result = optimize.fmin_bfgs(costFunction, initial_theta, fprime=gradient, 0 Args =(X,class_y[:, I],Lambda) # calling gradient descent optimization method ALL_theta [:, I] = Result. 0 (1,-1) # Putting all_theta = np.transpose(all_theta) return all_thetaCopy the code

4, forecasting

  • As I said, the predicted outcome is oneProbability valueI learned from itthetaPlug in the predictedS functionIn, the maximum value of each row is the maximum probability of a numberColumn numberThat’s the true value of the predicted number, because in categorization, everything is0The will ofyThe mapping is in the first column, the mapping of 1 is in the second column, and so on
  • Implementation code:
Def predict_oneVsAll(all_theta,X): M = x.shape [0] num_labels = all_thet.shape [0] p = Np.zeros ((m,1)) X = np.hstack((np.ones((m,1)),X)) # Sigmoid (np.dot(X,np.transpose(all_theta))) # prediction "" returns the column number of the maximum value of each row in h -np.max (h, P = np.array(np. Where (h[0,:] == np.max(h, h, h) where(h, h, h) == np.max(h, h) axis=1)[0])) for i in np.arange(1, m): t = np.array(np.where(h[i,:] == np.max(h, axis=1)[i])) p = np.vstack((p,t)) return pCopy the code

5. Running results

  • 10 times classification, accuracy on training set:

6,Use the logistic regression model in sciKit-learn library

  • 1. Import the package
from scipy import io as spio
import numpy as np
from sklearn import svm
from sklearn.linear_model import LogisticRegression
Copy the code
  • 2. Load data
Data = loadmat_data("data_digits. Mat ") X = data['X'] # 20x20px y = data['y'] # shape=(5000, 1) y = np.ravel(y)Copy the code
  • 3. Fitting model
Model = LogisticRegression() model.fit(X, y) #Copy the code
  • 4, forecasting
Predict = model. Predict (X) # predict = model. Predict (X) # predict = model. Predict (X) # predict = model.Copy the code
  • 5. Output results (accuracy on training set)

BP neural network

1. Neural network Model

2. Cost function

3. Regularization

Def nnCostFunction(nn_params,input_layer_size,hidden_layer_size,num_labels,X,y,Lambda): Shape [0] # restore theta1 and theta2 theta1 = nn_params[0:hidden_layer_size*(input_layer_size+1)].reshape(hidden_layer_size,input_layer_size+1) Theta2 = nn_params[hidden_layer_size*(input_layer_size+1):length].reshape(num_labels,hidden_layer_size+1) # Np.savetxt (" theta1.csv ",Theta1,delimiter=',') m = x.shape [0] class_y = Np.zeros ((m,num_labels)) # The y of the data corresponds to 0-9, # mapping y for I in range(num_labels): 0 0 class_Y [:, I] = np.int32(y== I).0 Because "" Theta1_colCount = theta1. shape[1] Theta1_x = Theta1[:,1:Theta1_colCount] Theta2_colCount = theta2.shape [1] Theta2_x = Theta2[:,1:Theta2_colCount] # regularize to theta^2 term = Np. Dot (np) transpose (np) vstack ((Theta1_x. Reshape (1, 1), Theta2_x. Reshape (1, 1)))), np. Vstack ((Theta1_x. Reshape (1, 1), Theta2 _x. Reshape (1, 1)))) ' ' 'forward travel, every time you need to fill in a column 1 offset bias,' 'a1 = np. Hstack ((np) ones ((m, 1)), X)) z2 = np. Dot (a1, np. Transpose (Theta1)) Hstack ((np.ones((m,1)),a2) z3 = Np.dot (A2, nP.transpose (Theta2)) h = sigmoid(z3) J = - (np) dot (np. Transpose (class_y. Reshape (1, 1)), np. The log (practice eshape (1, 1))) + np. Dot (np) transpose (1 - class_y. Reshape (1, 1)), np, lo G (1 - practice eshape (1, 1))) - Lambda * term / 2)/m return np ravel (J)Copy the code

4. Back propagation of BP

Def nnGradient(nn_params,input_layer_size,hidden_layer_size,num_labels,X,y,Lambda): length = nn_params.shape[0] Theta1 = Nn_params [0:hidden_layer_size*(input_layer_size+1)].0 (hidden_layer_size,input_layer_size+1).copy() Otherwise let's change the value of Theta, Nn_params also modifies Theta2 = nn_params[hidden_layer_size*(input_layer_size+1):length].reshape(num_labels,hidden_layer_size+1).copy() m = X.shape[0] Class_y = np.zeros((m,num_labels)) # y for I in range(num_labels): 0 0 class_Y [:, I] = np.int32(y== I).0 Because "" Theta1_colCount = theta1. shape[1] Theta1_x = Theta1[:,1:Theta1_colCount] Theta2_colCount = theta2.shape [1] Theta2_x = Theta2[:,1:Theta2_colCount] Theta1_grad = Np.zeros ((theta1.shape)) # Theta2_x = Theta2[:,1:Theta2_colCount] Theta1_grad = Np.zeros ((theta1.shape)) # Np. zeros((theta2.shape)) # Hstack ((Np.ones ((m,1)),X)) z2 = Np.dot (A1, NP.transpose (Theta1)) A2 = sigmoid(z2) A2 = Hstack ((np.ones((m,1)),a2)) z3 = np.dot(A2, nP.transpose (Theta2)) h = sigmoid(z3) ''' delta3 = np.zeros((m,num_labels)) delta2 = np.zeros((m,hidden_layer_size)) for i in range(m): # delta3 [I:] = (h [I:] - class_y [I:]) * sigmoidGradient (z3 [I:]) # error of mean square error (mse) delta3 [I:] [I:] - class_y = h [I:] # cross entropy error rate Theta2_grad = Theta2_grad+np.dot(np.transpose(delta3[i,:].reshape(1,-1)),a2[i,:].reshape(1,-1)) delta2[i,:] = np.dot(delta3[i,:].reshape(1,-1),Theta2_x)*sigmoidGradient(z2[i,:]) Theta1_grad = Theta1_grad+np.dot(np.transpose(delta2[i,:].reshape(1,-1)),a1[i,:].reshape(1,-1)) Theta1[:,0] = 0 Theta2[:,0] = 0 "Gradient" "grad" = (np) vstack ((Theta1_grad. Reshape (1, 1), Theta2_grad. Reshape (1, 1))) + Lambda * np vstack ((Theta1. Reshape (1, 1), Theta2. Reshape ( - 1, 1))))/m return np ravel (grad)Copy the code

5. The reason why BP can calculate the gradient

  • I actually used itChain derivativeThe laws of
  • Because the units at the next level compute using the units at the next level as input
  • The general derivation is as follows, and ultimately what we want to do is predict the function and we know ityVery close. Finding the gradient of the mean square deviation along this gradient minimizes the cost function. Refer to the process above to find the gradient.
  • More detailed derivation of error:

6. Gradient check

  • Check the use ofBPI’m getting the gradient right
  • Verify using the definition of derivative:
  • The numerical gradient should be very close to that of BP
  • After verifying that the BP is correct, there is no need to perform the algorithm to verify the gradient
  • Implementation code:
Def checkGradient(Lambda = 0): Construct a small neural network verification, because numerical gradients are a waste of time, Input_layer_size = 3 hidden_layer_size = 5 num_labels = 3 m = 5 initial_Theta1 = debugInitializeWeights(input_layer_size,hidden_layer_size); initial_Theta2 = debugInitializeWeights(hidden_layer_size,num_labels) X = debugInitializeWeights(input_layer_size-1,m) y  = 1+np.transpose(np.mod(np.arange(1,m+1), Num_labels))# initialize y y = y.reshape(-1,1) nn_params = Np.vstack ((Initial_theTA1. 0 (-1,1), Initial_theta2. 0 (-1,1)) # 0 nnGradient(nn_params, input_layer_size, hidden_layer_size, num_labels, X, y, Lambda) "" use numerical method to compute gradient" "num_grad = np.zeros((nn_params.shape[0])) step = np.zeros((nn_params.shape[0])) e = 1E-4 for I in range(nn_params.shape[0]): 0 0 Step [I] = e Loss1 = nnCostFunction(nN_Params-step.0), Input_layer_size, Hidden_Layer_size, NUM_Labels, X, Y, 0 Lambda) Loss2 = nnCostFunction(NN_params + Step.0), Input_layer_size, hidden_layer_size, NUM_labels, X, Y, 0 Lambda) num_grad[I]= (loss2-loss1)/(2*e) step[I]=0 Np. Hstack ((num_grad. Reshape (1, 1), grad. Reshape (1, 1))) print resCopy the code

7. Random initialization of weights

  • Neural networks cannot be initialized like logistic regressionthetafor0, because if the weight of each edge is 0 and each neuron has the same output, it will get the same gradient in the back propagation, and only one result will be predicted.
  • So it should be initialized to something close to zero
  • The implementation code
Theta def randInitializeWeights(L_in,L_out): W = np. Zeros ((1 + L_in, L_out)) # corresponding to the weight of theta epsilon_init = (6.0 / (L_out + L_in)) * * W = 0.5 Rand (L_out,1+L_in)*2*epsilon_init-epsilon_init # np.random.rand(L_out,1+L_in) generates a random matrix of L_out*(1+L_in) size return WCopy the code

8, forecasting

  • Forward propagation of predicted results
  • The implementation code
Def predict(Theta1,Theta2,X): M = x.shape [0] num_labels = theta2. shape[0] #p = Np.zeros ((m,1)) X = Np.hstack ((NP.ones ((m,1)),X)) h1 = sigmoid(NP.dot (X, NP.transpose (Theta1))) h1 = sigmoid(NP.dot (X, NP.transpose (Theta1))) h1 = sigmoid(NP.dot (X, NP.transpose (Theta1) Hstack ((np.ones((m,1)),h1)) h2 = sigmoid(Np.dot (h1,np.transpose(Theta2))) Axis =1) Return the maximum value of each row in h (which is the maximum probability of a number) - finally where the column number where the maximum probability is found (the column number is the corresponding number) "#np.savetxt("h2.csv",h2,delimiter=',') p = np.array(np.where(h2[0,:] == np.max(h2, axis=1)[0])) for i in np.arange(1, m): t = np.array(np.where(h2[i,:] == np.max(h2, axis=1)[i])) p = np.vstack((p,t)) return pCopy the code

9. Output results

  • Gradient check:
  • Random display of 100 handwritten digits
  • Show theta1 weight
  • Training set prediction accuracy
  • Prediction accuracy of training set after normalization

4. SVM support vector machine

1. Cost function

  • In logistic regression, our cost is:, where:,
  • As shown, ify=1.costThe cost function is shown in the figure



    We want to make, i.e.,z>>0In this casecostThe cost function tends to be minimal (that’s what we want), so the utilityredThe function ofInstead of cost in logistic regression
  • wheny=0As well asInstead of
  • The resulting cost function is: finally we want to
  • The cost function in our logistic regression was:



    You can think of it hereIt’s just a matter of form hereCThe greater the value of the SVM decision boundarymarginThe larger it is, as will be explained below

2, Large Margin

3. SVM Kernel

4, the use ofscikit-learnIn theSVMThe model code

Loadmat ('data1.mat') X = data1['X'] y = data1['y'] y = np.ravel(y) plot_data(X,y) model = Svm.svc (C=1.0,kernel='linear').fit(X,y) #Copy the code
  • Nonlinear separable, the default kernel function isrbf
Data2 = IO. Loadmat ('data2.mat') X = data2['X'] y = data2['y'] y = np.ravel(y) PLT = plot_data(X,y) Plt.show () model = svm.svc (gamma=100).fit(X,y) # gammaCopy the code

5. Running results

  • Linearly separable decision boundary:
  • Linearly indivisible decision boundary:

5. K-means clustering algorithm

  • All the code

1. Clustering process

Def findClosestCentroids(X,initial_centroids): M = x.shape [0] # number of data items K = initial_centroids.shape[0] # number of classes dis = Np.zeros ((m,K)) # Storage calculation of the distance from each point to K classes IDx = For I in range(m): for j in range(K): for j in range(K): for j in range(K): Dis [I, j] = np. Dot ((X [I:] - initial_centroids [j:]), reshape (1, 1), (X [I:] - initial_centroids [j:]), reshape (1, 1)) Returns the column number corresponding to the minimum value of each row of dis, 0 0 -np.min (dis, Axis =1) returns the minimum value of each line -np.min where(dis == Np.min (dis, Axis =1).0 There may be multiple coordinates corresponding to the minimum value, which can be found by WHERE. Therefore, when returning, the first m coordinates need to be returned (because for multiple minimum values, Dummy,idx = np.min where(dis == np.min(dis, axis=1).0 (-1,1)) return idx[0:dis.shape[0]] #Copy the code
  • Computing class center implementation code:
Def computerCentroids(X,idx,K): def computerCentroids(X,idx,K): n = x.shape [1] centroids = np.zeros((K,n)) for I in range(K): Centroids [I:] = np. The mean (X [np. Ravel (independence idx = = I), :], axis = 0). Reshape index (1, 1) # if a one-dimensional, axis = 0 for each column, independence idx = = once I find out what kind of, Then calculate the mean return centroidsCopy the code

2. Objective function

3. Selection of clustering centers

  • Random initialization, K are randomly selected from the given data as clustering centers
  • You can do it multiple times, and then take the one that minimizes the cost function as the center
  • Implementation code :(here is a random)
Def kMeansInitCentroids(X,K) def kMeansInitCentroids(X,K): M = x.shape [0] m_arr = Np.arange (0,m) # Generate 0-m-1 centroids = Np.zeros ((K,X.shape[1])) nP.random.shuffle (m_arr) # Rand_indices = m_arr[:K] # return centroids = X[rand_indices,:Copy the code

4. Selection of clustering number K

  • The label of Y is not known for clustering, so the real number of clustering is not known
  • The Elbow Method
  • Cost functionJandKIf there is an inflection point, as shown in the figure below,KTake the value at the inflection point, right hereK=3
  • If very smooth is not clear, artificial choice.
  • The second is human observational selection

5. Application — image compression

  • Divide the pixels of the image into several classes, and then replace the original pixel value with this class
  • Algorithm code to perform clustering:
Def runKMeans(X,initial_centroids,max_iters,plot_process): Shape [0] # Number of classes centroids = initial_centroids # Record current class center previous_centroids = Centroids # record the last class center idx = np.zeros((m,1)) # Which class each piece of data belongs to for I in range(max_iters): %d'%(I +1) idx = findClosestCentroids(X, centroids) if plot_process: # if the image is drawn PLT = plotProcessKMeans(X,centroids,previous_centroids) # Draw the moving process of the cluster center previous_centroids = centroids # reset Centroids = computerCentroids(X, idx, K) Plt.show () return centroids,idx # return cluster centers and which class the data belongs toCopy the code

6,Use linear models in the SciKit-Learn library for clustering

  • Import packages
    from sklearn.cluster import KMeans
Copy the code
  • Use models to fit data
Model = KMeans(n_clusters=3).fit(X) # n_clusters specifies 3 classes and fits the dataCopy the code
  • The clustering center
Centroids = model.cluster_centers_ # Cluster centersCopy the code

7. Running results

  • Movement of 2d data class center
  • Image compression

6. PCA Principal Component Analysis (Dimension reduction)

  • All the code

1, use

  • Data Compression makes the program run faster
  • Visualizing data, for example3D-->2DEtc.
  • .

2, 2D–>1D, nD–>kD

  • As shown in the figure below, all data points can be projected onto a straight line, minimizing the sum of squares (projection error) of projected distances
  • Pay attention to data needsThe normalizedTo deal with
  • Idea is to find1aThe vector u, all data are projected onto it to minimize the projection distance
  • thennD-->kDIs to look forkA vector, all data are projected onto it to minimize the projection error
  • Eg :3D–>2D,2 vectors represent a plane, the projection error of all points to this plane is minimum

3. The difference between principal component analysis and linear regression

  • Linear regression is looking forxwithyAnd then used for predictiony
  • PCAIs to find a projection surface, minimize the projection error of data to this projection surface

4. PCA dimension reduction process

  • Data preprocessing (mean normalization)
  • Formula:
  • That is, subtract the mean value of the corresponding feature and divide by the standard deviation of the corresponding feature (also can be the maximum – minimum value).
  • Implementation code:
Def featureNormalize(X, X, X, X, X, X, X, X, X, X, X, X, X, X, X); sigma = np.zeros((1,n)) mu = np.mean(X,axis=0) sigma = np.std(X,axis=0) for i in range(n): X[:,i] = (X[:,i]-mu[i])/sigma[i] return X,mu,sigmaCopy the code
  • To calculateSigma covariance matrix(Covariance Matrix) :
  • Notice hereΣIt’s not the summation notation
  • Covariance matrixSymmetric positive definite(If you don’t understand positive definite, look at line substitution.)
  • Size of thenxn.nforfeatureThe dimensions of
  • Implementation code:
Sigma = Np.dot (Np.transpose (X_norm),X_norm)/m # find SigmaCopy the code
  • To calculateΣEigenvalues and eigenvectors of
  • Can be usedsvdSingular value decomposition function:U, S, V = SVD (Σ)
  • And is returnedΣDiagonal matrix of the same sizeS(byΣOf) [Pay attention to:matlabThe function returns a diagonal matrix atpythonIs a vector, saving space.]
  • There are also two unitary matrices U and V, and
  • Pay attention to:svdDelta functionSIs in descending order of eigenvalue, if not usedsvd, need to pressThe eigenvalueSize rearrangementU
  • Dimension reduction
  • selectUIn the firstKColumn (suppose to drop toKD)
  • ZThe corresponding data after dimension reduction
  • Implementation code:
Def projectData(X_norm,U,K): Z = np.zeros((X_norm. Shape [0],K)) U_reduce = U[:,0:K] # take first K Z = np.dot(X_norm,U_reduce) return ZCopy the code
  • Process summary:
  • Sigma = X'*X/m
  • U,S,V = svd(Sigma)
  • Ureduce = U[:,0:k]
  • Z = Ureduce'*x

5. Data recovery

  • Because:
  • So: (Note that this is an approximation of X)
  • And becauseUreduceIs the positive definite matrix, and the positive definite matrix satisfies:So:], so here:
  • Implementation code:
Def recoverData(Z,U,K); X_rec = Np.zeros ((Z.shape[0],U.s shape[0])) U_recude = U[:,0:K] X_rec = Np.dot (Z, NP.transpose (U_recude)) # Restore data (approximate) return  X_recCopy the code

6. Selection of the number of principal components (i.e. the dimension to be reduced)

7, Use suggestions

  • Do not use PCA to solve overfitting problemsOverfittingAgain, use regularization (it’s ok if you keep high variability)
  • Consider using PCA only if you have good results on raw data but are slow to run

8. Running results

  • The 2-d data has been reduced to 1-d
  • The direction of the projection
  • 2D reduced to 1D and the corresponding relationship
  • Face data dimensionality reduction
  • The original data
  • Visual sectionUThe matrix information

  • Restore data

9,PCA in sciKit-learn library is used for dimensionality reduction

  • Import required packages:
#-*- coding: Utf-8 -*- # Author: Bob # Date:2016.12.22 Import numpy as NP from matplotlib import pyplot as PLT from scipy import IO as utF-8 -*- # Author: Bob # Date:2016.12.22 Import numpy as NP from matplotlib import pyplot as PLT from scipy import IO as  spio from sklearn.decomposition import pca from sklearn.preprocessing import StandardScalerCopy the code
  • Normalized data
Scaler = StandardScaler() scaler.fit(X) x_train = scaler.transform(X)Copy the code
  • PCA model was used to fit data and reduce dimension
  • n_componentsCorresponds to the dimension to be transferred
Model = pca.pca (n_components=K).fit(x_train) # N_components defines the dimension to be reduced Z = model.transform(x_train) # transform will perform the dimension reduction operationCopy the code
  • Data recovery
  • model.components_It will be used for dimensionality reductionUmatrix
Data recovery and plotting Ureduce = model.components_ # Get Ureduce x_rec = np.dot(Z,Ureduce) # Data recoveryCopy the code

7. Anomaly Detection

1. Gaussian distribution (Normal distribution)Gaussian distribution

  • Distribution function:
  • Among them,uFor the data ofThe mean.sigmaFor the data ofThe standard deviation
  • sigmaThe moresmall, the corresponding image is moreThe tip
  • Parameter estimation (parameter estimation)

2. Anomaly detection algorithm

  • example
  • Training set:, where
  • Assume that they are independent from each other and establish the model:
  • process
  • Select those with representative exceptionsfeature:xi
  • Parameter estimation:
  • To calculatep(x)If,P (x) < epsilonIs considered abnormal, whereEpsilon.Is the critical value of the probability that we wantthreshold
  • Here are justElement Gaussian distribution, it is assumed that thefeatureThey’re independent of each other, and we’ll talk about thatMultivariate Gaussian distribution, will be automatically capturedfeatureThe relationship between
  • Parameter estimation implementation code
Def estimateGaussian(X): M,n = x.shape mu = np.zeros((n,1)) sigma2 = np.zeros((n,1)) mu = np.mean(X, axis=0) # axis=0 Var (X,axis=0) # find the variance of each column return mu,sigma2Copy the code

3, evaluationp(x)Good or bad, as wellEpsilon.The selection of

  • Mismeasurement of skew data

  • Since the data can be very skewed (i.e. the number of y=1 is very small (y=1 indicates an exception)), F1Score can be calculated using Precision/Recall (on the CV cross validation set).

  • For example, in predicting cancer, if the model can get 99% correct prediction and 1% error rate, but the actual probability of cancer is very small, only 0.5%, then we always predict that there is no cancer, but y=0 can get a smaller error rate. Using error rate for evaluation is not scientific.

  • As shown below:

  • , that is:Predict positive samples correctly/all predict positive samples correctly

  • , that is:Correctly predicted positive sample/true value is positive sample

  • Always make y=1(fewer classes) and calculate Precision and Recall

  • Taking cancer prediction as an example, it is assumed that all predictions are no-cancer, TN=199, FN=1, TP=0, FP=0, so Precision=0/0, Recall=0/1=0, although accuracy=199/200=99.5%, it is not credible.

  • The selection of epsilon.

  • Try multiple ε values to make F1Score high

  • The implementation code

Def selectThreshold(yval,pval): Initialize the required variable bestEpsilon = 0.bestf1 = 0.f1 = 0.step = (np.max(pval)-np.min(pval))/1000 "" calculate" for epsilon in np.arange(np.min(pval),np.max(pval),step): CvPrecision = pval<epsilon tp = np.sum((cvPrecision == 1) & (yval == 1)). Astype (float) # sum is int, Float fp = np.sum((cvPrecision == 1) & (yval == 0)). Astype (float) fn = np.sum((cvPrecision == 1) & (yval == 0)). Astype (float) precision = TP /(TP +fp) # recision = TP /(TP +fn) # recall rate F1 = (2*precision*recision)/(precision+recision) # F1Score if F1 > bestF1: BestF1 = F1 bestEpsilon = epsilon return bestEpsilon,bestF1Copy the code

4. Select what feature (Element Gaussian distribution) to use

  • If some data does not satisfy the Gaussian distribution, you can change the data, for examplelog(x+C),x^(1/2)Etc.
  • ifp(x)The value of is large, abnormal or not, and you can try to combine more than onefeature,(because there may be a relationship between features)

5. Multivariate Gaussian distribution

  • Problems existing in element Gaussian distribution
  • The red dots are outliers and the rest are normal (such as CPU and memory changes).
  • The gaussian distribution of X1 is as follows:
  • The gaussian distribution of X2 is as follows:
  • It can be seen that the corresponding values of P (x1) and P (x2) do not change much, so they are not considered abnormal
  • Since we believe that features are independent of each other, the above figure is extended in a circular way
  • Multivariate Gaussian distribution
  • It’s not buildingp(x1),p(x2)... p(xn)It’s about unityp(x)
  • Parameters:.ΣforCovariance matrix
  • In the same way,| Σ |The smaller,p(x)The more sharp
  • For example,, represents the positive correlation between x1 and x2, that is, the larger x1 is, the larger X2 will be, as shown in the figure below, the red anomaly points can be detected. If:, represents the negative correlation between x1 and x2
  • Implementation code:
Def multivariateGaussian(X,mu,Sigma2): k = len(mu) if (sigma2.shape [0]>1): Sigma2 = np. Diag (Sigma2) ' ' ' 'multivariate gaussian distribution function' X = X - mu argu = (np. 2 * PI * * (k / 2) * np linalg. Det (Sigma2) * * p = (0.5) Argu *np.exp(-0.5*np.sum(np.dot(X,np.linalg.inv(Sigma2))*X,axis=1)) # axis indicates that each line returns pCopy the code

6. Characteristics of element and multivariate Gaussian distribution

  • Element Gaussian distribution
  • Man can capture itfeatureCan be used when the relationship between
  • A small amount of calculation
  • Multivariate Gaussian distribution
  • The related features are automatically captured
  • The calculation is large because:
  • m>norΣIt’s reversible. (If it’s irreversible, it might have redundant x, because it’s linearly dependent, irreversible, or m is less than n.)

7. Program running results

  • Display the data
  • contour
  • Outlier labeling