A list,
1 PCA Principal Component Analysis (PCA) is a common data Analysis method. PCA is a data representation method that transforms the original data into a set of linearly independent dimensions through linear transformation. PCA can be used to extract the main feature components of data and is often used for dimensionality reduction of high-dimensional data.
1.1 Dimension reduction
In data mining and machine learning, data is represented as vectors. For example, the annual flow and transaction of a Taobao store in 2012 can be regarded as a set of records, in which the data of each day is a record in the following format:
(Date, number of views, number of visitors, number of transactions, amount of transactions)
Where, “date” is a record mark rather than a metric value, while data mining mostly cares about metric value. Therefore, if we ignore the field of date, we get a group of records, each record can be represented as a five-dimensional vector, and one sample is shown as follows:
It is customary to use column vectors to represent a record, and this guideline will be followed later in this article.
The complexity of many algorithms in machine learning is closely related to, or even exponentially related to, the dimension of the data. Here only 5 dimensions of data may not matter, but in actual machine learning, it is not uncommon to deal with tens of thousands or even tens of thousands of dimensions of data. In this case, the resource consumption of machine learning is unacceptable, so dimensionality reduction will be adopted for data. Dimensionality reduction means the loss of information, but since the actual data itself is often relevant, reduce the loss of information in dimensionality reduction.
For example, in the data of taobao stores above, we know from experience that there is a strong correlation between “page views” and “number of visitors”, and a strong correlation between “number of orders” and “number of transactions”. When the number of page views is high (or low) on a particular day, we should largely assume that the number of visitors is also high (or low) on that day. Therefore, if you delete views or visitors, you don’t end up losing so much information that you reduce the dimension of the data, a so-called dimensional reduction operation. If the data dimension reduction is analyzed and discussed by mathematics, it is expressed by the professional term PCA, which is a dimension reduction method with strict mathematical basis and has been widely adopted.
1.2 Vectors and basis transformations
1.2.1 Inner product and projection
The inner product of two vectors of the same size is defined as follows:
1.2.2 base
In algebra, vectors are often represented by the point coordinates of the end of a line segment. So let’s say that some vector has a coordinate of 3,2, and the 3 here really means that the projection of the vector on the x axis is 3, and the projection of the vector on the y axis is 2. That is, implicitly introducing a definition of x and y vectors of length 1 in the positive direction. So a vector (3,2) is actually projecting 3 onto the X-axis and 2 onto the Y-axis. Notice that the projection is a vector, it can be negative. The vectors (x, y) actually represent linear combinations:
So from the representation above, it turns out that all two-dimensional vectors can be represented as linear combinations like this. Here (1,0) and (0,1) are called bases in two dimensions.
The default bases of (1,0) and (0,1) are chosen for convenience, of course, because they are unit vectors in the positive direction of the x and y axes respectively, thus making point coordinates and vectors on the two-dimensional plane correspond one to one. But in fact any two linearly independent two-dimensional vectors can be a basis, and linearly independent two-dimensional vectors in a two-dimensional plane, intuitively, are two vectors that are not in a straight line.
And if the basis is orthogonal, the only thing that makes it a basis is that it’s linearly independent, and a non-orthogonal basis is fine. But because of the good properties of the orthogonal basis, the basis used in general is orthogonal.
1.2.3 Matrix of basis transformation
The basis transformation in the above example can be represented by matrix multiplication, i.e
If promotion, suppose there are M A N d vector, to transform it as R A N d vectors of new space, so first will be R A base according to the row of matrix A, then the vector according to the column of matrix B, then the product of two matrices AB is transformation as A result, the AB first M after the first M column transformation as A result, through the matrix multiplication is expressed as:
1.3 Covariance matrix and optimization objectives
In data dimension reduction, the key problem is how to determine whether the selected basis is optimal. That is, selecting the optimal basis is the best way to ensure the characteristics of the original data. So let’s say I have five pieces of data
You take the average of each row, and then you subtract the average from each row, and you get
The matrix is expressed in the form of coordinates, and the graph is as follows:
So now the question is: how do you choose to represent these data with one-dimensional vectors and still want to retain the original information as much as possible? In fact, this problem is to select a vector in a direction in the two-dimensional plane, project all data points onto this line, and represent the original record with the value of the projection, that is, the problem of two-dimensional reduction to one-dimensional. So how do you choose this direction (or base) so that you retain as much original information as possible? An intuitive view is that you want the projected values to be as diffuse as possible.
1.3.1 variance
The above problem is to hope that the values of the post-projection can be scattered in one direction as far as possible, and the degree of dispersion can be expressed by mathematical variance, namely:
Thus, the above problem is formalized as: looking for a one-wiki, which maximizes the variance value after all data is transformed into coordinates on this basis.
2.3.2 covariance
Mathematically, the correlation can be expressed by the covariance of two features, namely:
When the covariance is 0, it means that the two features are completely independent. In order for the covariance to be zero, you can only choose the second basis in directions that are orthogonal to the first basis. So the two directions that you end up choosing must be orthogonal.
At this point for dimension reduction problem of the optimization goal: will be reduced to a set of N d vector K d (K < N), the goal is to choose K units (mode 1) orthogonal basis, so that when the raw data transformation to the group based on, in various fields between the two covariance is 0, the field of variance was as large as possible (under the constraint of orthogonal, take the biggest K variance).
2.3.3 Covariance matrix
Assume that there are only two fields x and Y, and form them into a matrix by row, where is the matrix obtained by the centralized matrix, that is, each field minus the average value of each field:
3.4 Diagonalization of covariance matrix
1.4 Algorithm and examples
1.4.1 PCA algorithm
1.4.2 instance
1.5. Discuss
Based on the above explanation of the mathematical principles of PCA, you can learn some of the capabilities and limitations of PCA. In essence, PCA takes the direction with the largest variance as the main feature and “de-correlates” the data in each orthogonal direction, that is, makes them irrelevant in different orthogonal directions.
Therefore, PCA also has some limitations. For example, it can remove linear correlation well, but there is no way for high-order correlation. For data with high-order correlation, Kernel PCA can be considered and non-linear correlation can be converted into linear correlation through Kernel function. In addition, PCA assumes that the main features of the data are distributed in the orthogonal direction. If there are several directions with large variances in the non-orthogonal direction, the effect of PCA will be greatly reduced.
Finally, IT should be noted that PCA is a parameterless technology. In other words, in the face of the same data, if cleaning is not considered, the results will be the same. There is no subjective parameter intervention, so PCA is convenient for general implementation, but it cannot be personalized optimization.
2 Basic concepts of SVM The basic model of Support Vector Machine (SVM) is to find the optimal separation hyperplane in the feature space to maximize the positive and negative sample interval on the training set. SVM is a supervised learning algorithm used to solve binary classification problems. After the kernel method is introduced, SVM can also be used to solve nonlinear problems. Generally, SVM has the following three types:
Hard interval support vector machine (linearly separable support vector machine) : When the training data is linearly separable, a linearly separable support vector machine can be learned by hard interval maximization. Soft interval support vector machine: when the training data is approximately linearly separable, a linear support vector machine can be learned by soft interval maximization. Nonlinear support vector machine: when the training data is linearly untime-sharing, a nonlinear support vector machine can be learned by kernel method and soft interval maximization.
2.2 Hard interval support vector machines
For the hyperplanes A, B and C in Figure 2, hyperplane C should be selected, because the classification using hyperplane C has the best “tolerance” for local disturbance of training samples and the strongest classification robustness. For example, due to the limitation of the training set or noise interference, training sets of sample may than in figure 2 training sample closer to two classes, space industry, at the time of classification decision will appear error, with minimal hyperplane C affected, that is to say the results generated classification hyperplane C is one of the most robust, is the most reliable, the generalization ability of the strongest to did not see sample.
The best hyperplane can be derived from the examples in the following figure.
This is the basic form of SVM.
2.2.1 Lagrange duality problem
KKT condition of 2.2.2 SVM problem
2.3 Soft interval support vector machines
2.4 Nonlinear support vector machines
2.4.1 Support vector regression
2.4.2 Common kernel functions
2.5 Advantages and disadvantages of SVM: WHEN the sample size is medium and small, SVM is easy to obtain the non-linear relationship between data and features, which can avoid the use of neural network structure selection and local minimum problems, and can be interpreted to solve high-dimensional problems. Disadvantages: SVM is sensitive to missing data, there is no general solution for nonlinear problems, it is not easy to choose the correct kernel function, and the computational complexity is high. Mainstream algorithms can reach the complexity of O(n2)O(N2), which is overwhelming for large-scale data.
Ii. Source code
function varargout = test(varargin) gui_Singleton = 1; gui_State = struct('gui_Name', mfilename, ... 'gui_Singleton', gui_Singleton, ... 'gui_OpeningFcn', @test_OpeningFcn, ... 'gui_OutputFcn', @test_OutputFcn, ... 'gui_LayoutFcn', [] , ... 'gui_Callback', []); if nargin && ischar(varargin{1}) gui_State.gui_Callback = str2func(varargin{1}); end if nargout [varargout{1:nargout}] = gui_mainfcn(gui_State, varargin{:}); else gui_mainfcn(gui_State, varargin{:}); end function test_OpeningFcn(hObject, eventdata, handles, varargin) global imgrow imgcol V pcaface accuracy imgrow=112; imgcol=92; Npersons =41; % select 41 person's face disp(' Read training data... '); f_matrix=ReadFace(npersons,0); % Read training data nFACES =size(f_matrix,1); % Number of sample faces % Low dimensional space image is (nPersons *5) * K matrix, each row represents a principal component face, each face 20 dimensional feature disP (' training data PCA feature extraction... '); mA=mean(f_matrix); % Find the mean value of each attribute k=20; % reduced to 20 dimensions [PCAFace,V]=fastPCA(F_matrix, K,mA); %pca feature extraction % PCAFace is 200*20 DISP (' Training feature data normalization.... ') lowvec=min(pcaface); upvec=max(pcaface); scaledface=scaling(pcaface,lowvec,upvec); Disp ('SVM sample training... ') gamma = 0.0078; c=128; multiSVMstruct=multiSVMtrain(scaledface,npersons,gamma,c); save('recognize.mat','multiSVMstruct','npersons','k','mA','V','lowvec','upvec'); Disp (' Read test data... ') [testface,realclass]=ReadFace(npersons,1); Disp (' Test data dimensionality reduction... ') m=size(testface,1); for i=1:m testface(i,:)=testface(i,:)-mA; end pcatestface=testface*V; Disp (' Test characteristic data normalization... ') scaledtestface=scaling(pcatestface,lowvec,upvec); Disp (' Sample classification... ') class=multiSVM(scaledtestface,multiSVMstruct,npersons); Disp (' Test completed! ') accuracy=sum(class==realclass)/length(class); Set (handles. Change_font,'string',' please select photo...... first ') handles.output = hObject; guidata(hObject, handles); Function [f_matrix, realClass]=ReadFace(npersons,flag) % % imgrow- Row pixels of the image are global variables % imgcol- Column pixels of the image are global variables % flag-flag, 0 means to read the training sample, 1 means to read the test sample % output: % known global variable: imgrow=112; imgcol=92; global imgrow; global imgcol; Realclass = zeros (npersons * 5, 1); F_matrix =zeros(npersons*5,imgrow*imgcol); For I =1:npersons facepath='./orl_faces/s'; % facepath=strcat(facepath,num2str(I)); Facepath =strcat(facepath,'/'); cachepath=facepath; % For j=1:5 unction [scaledface] = scaling(faceMat, Lowvec,upvec) % Feature data normalization % Input?? FaceMat needs to normalize the image data, % lowvec original minimum % upvec original maximum upnew=1; lownew=-1; [m,n]=size(faceMat); scaledface=zeros(m,n); for i=1:m scaledface(i,:)=lownew+(faceMat(i,:)-lowvec)./(upvec-lowvec)*(upnew-lownew); end end voting=zeros(m,nclass); for i=1:nclass-1 for j=i+1:nclass class=svmclassify(multiSVMstruct{i}{j},testface); voting(:,i)=voting(:,i)+(class==1); voting(:,j)=voting(:,j)+(class==0); end end [~,class]=max(voting,[],2); endCopy the code
3. Operation results