Basic form of linear model

Given an example of pairs described by D attributes, includingIs the firstThe linear model tries to learn a function that predicts through a linear combination of attributes, i.e

It’s usually written in vector form

Linear regression

Linear regression is trying to learn

The mean square error is used for measurement to minimize the mean square error, i.e

The mean square error corresponds to the commonly used Euclidean distance or “Euclidean distance” for short. The method of model solving based on the minimization of mean square error is called “least square method”. In linear regression, the least square method is an attempt to find a line that minimizes the sum of the Euclidean distances from all samples to the line.

The process of solving w and B to minimize the mean square error is called the least square “parameter estimation” of the linear regression model. We take the derivative of the mean square error with respect to W and b, and get

A closed form solution of the optimal solution of W and B can be obtained by setting the above two equations to zero

 

The more general case is data set D, where the sample is described by D attributes. Now we’re trying to learn

This is called multiple linear regression. Similarly, w and B can be estimated using least squares. For the sake of argument, absorb w and b as vectorsAccordingly, data set D is represented as oneWhere each row corresponds to one example, d elements before the row change correspond to d attribute values of the example, and the last element is constant set to 1, i.e

Let me write this as a vectorThere are

whenIs a full rank matrix or positive definite matrix, the above equation is zero

When it is not a full rank matrix, that is, we often encounter a large number of variables that exceed the number of samples, resulting in more columns than rows of X,Obviously not. Multiple can be removed at this point, they can minimize the mean square error. Which one to choose as the output will be determined by the inductive preference of the learning algorithm, and the common practice is to introduce regularization terms.

Assuming that the output marker corresponding to the example is multi-row variation on an exponential scale, the logarithm of the output marker can be used as the target of the anaphora model approximation, i.e

This is logarithmic linear regression. And actually trying to getClose to y. The essence is to find the nonlinear function mapping from input space to output space.

More generally, consider the monotonically differentiable function g(.). ,

The model thus obtained is called the Generalized Linear Model, in which the function g(.) This is called the “connection function”. Obviously, log-linear regression is a generalized linear model in g(.). =ln(.) Is a special case of.

Logarithmic probability regression

For the classification task, we only need to find a monotone differentiable function to relate the true marker Y of the classification task to the predicted value of the linear regression model. Consider a binary task whose output is marked, and the predicted value generated by the linear regression model is real, so the real value only needs to be converted to the 0/1 value. Ideally, the unit step function.

That is, if the predicted value z is greater than zero, it is regarded as a positive example; if the predicted value z is less than zero, it is regarded as a negative example; if the predicted value z is critical value zero, it can be discriminated arbitrarily.

But the unit step function is not continuous, so we hope to find a “substitute function” that can approximate the unit step function to some extent and is monotonically differentiable. Logistic function is such an alternative function:

The logarithmic probability function is a “Sigmoid function” that converts the z value to a y value close to 0 or 1, and whose output varies sharply near z=0, then

If y is regarded as the possibility that sample X is a positive example, then 1-y is the probability of its negative example, the ratio of the two

It is called “odds”, reflecting the relative probability of x being a positive example. You take the logarithm of the odds and you get the log odds.

As can be seen from the above, in fact, the prediction results of the linear regression model are used to approximate the logarithmic probability of the real marker. Therefore, the corresponding model is called “logarithmic probability regression”.

Estimate w and B of logarithmic probability regression. If y is regarded as a quasi-posterior probability estimatethe

There are clearly

Maximum likelihood method is used to estimate W and B. Maximizing “logarithmic likelihood” for a logarithmic regression model given a data set

Linear discriminant analysis Fisher

Linear Discriminant Analysis (LDA) is a classical Linear learning method, also known as Fisher Discriminant Analysis.

LDA idea: given the training sample set, try to project the sample onto a straight line, so that the projection points of similar samples are as close as possible and those of different samples are as far away as possible; In the classification of the new sample, it is projected onto the same line, and then the category of the new sample is determined according to the position of the projection point.

Given data setmakeRespectively represent theClass example set, mean vector, covariance matrix. If the data is projected onto the line W, the projection distribution of the centers of the two types of samples on the line is 和 ; If all sample points are projected onto the line, the covariance of the two types of samples is 和 .

To make projection points of similar samples as close as possible, covariance of projection points of similar samples can be minimized, namelyAs small as possible; In order to keep the projection points of heterogeneous samples as far away as possible, the distance between class centers can be as large as possible, i.eAs big as possible. Consider both, and the desired maximum is achieved

Multiclassification learning

Multi-classification learning task is often encountered in practical tasks. Some dichotomous learning methods can be directly extended to multi-classification, and the problem of multi-classification can be solved by using dichotomous learning machine.

Consider N categories C1, C2, and….. Cn, the basic idea of multi-classification learning is “disassembly method”, that is, multi-classification tasks are divided into several dichotomous tasks to solve.

There are three classic split strategies: one to one (one vs one, OvO), one to the rest (one vs Rest, OvR), and many-to-many (many vs many, MvM).

OvO pairs N categories to produce N(n-1)/2 dichotomous tasks;

OvR trains N classifiers one class at a time with samples as positive and all other classes as negative.

OvR only needs to train N classifiers, whereas OvO needs to train N(n-1)//2 classifiers, so OvO’s storage overhead and test time overhead are typically higher than OvR’s. However, when training, each OF OvR’s classifiers uses all the samples, whereas EACH of OvO’s classifiers uses only samples of two categories, so OvO’s training time costs are generally lower than OvR’s when there are many categories. As for predictive performance, it depends on the specific data distribution, and in most cases it is about the same.

MvM treats several possible classes as positive classes and several other classes as negative classes at a time. Obviously, OvO and OVR are special cases of MvM.

Class imbalance problem

Category imbalance refers to the situation that the number of training samples of different categories varies greatly in the classification task. Even if the number of training samples of different categories in the original problem is equal, the class imbalance may still occur after using OvR and MvM strategies.

It is assumed that there are fewer positive examples and more negative ones.

Rescaling: under-sampling, over-sampling, threshold movement

Undersampling: directly “undersampling” the anti-class samples in the training set, that is, removing some anti-examples to make positive and negative numbers close, and then learning;

Oversampling: add some positive examples so that the number of positive and negative examples is close, and then learn

Threshold movement: the learning is directly based on the original training set, but when the trained classifier is used for prediction, it willEmbedded in its decisions.

In the above equation, y expresses the probability of positive examples, probabilityM + represents the number of positive examples and m- represents the number of negative examples, so the observation probability is.

The time cost of under-sampling is usually much smaller than that of over-sampling because the former displaces many counterexamples, making the classifier training set much smaller than the initial training set.

Rescaling is also the basis of cost-sensitive learning. In cost sensitive learningwithCan replace, whereIs the cost of misclassifying positive examples into negative examples,Is the cost of misclassifying negative examples into positive ones.