Classification algorithm
In terms of classification algorithms, we can’t help but mention Logistics Regressio.
Personally, I think logistic regression plays a very important role in classification algorithm, so I will spend a lot of time summarizing logistic regression model.
Model construction of logistic regression
1. Generalized linear model
In order to solve the limitations of the linear structure of linear regression, people add a function to the left or right side of the equal sign on the basis of linear regression, so as to better capture the general law of data. At this time, this model is called the generalized linear model, and the function mentioned above is called the relation function.
I have a set of data, and let’s say that the relationship between the data is
If the prediction is made by linear equation, i.e
The fitting of the model itself and the data itself is shown in the figure below
It can be found that there is a big gap between the prediction result of linear model and the real result. But in this case, if we add a base exponential to the right-hand side of the equation, that is, if we take the output of the linear equation and use the base exponential to predict y, we’re going to rewrite the equation as zero
Is equivalent to
It’s the output of a linear equation to predict the outcome of the logarithm base.
Through the above process, it is not difficult to find that by adding some functions to the left and right sides of the model, it is possible to make a linear model capable of capturing nonlinear laws. In the above example, the essence of capturing nonlinear law is that after adding logarithmic function into the equation, the input space (feature space) of the model can be mapped to the output space (label space) by non-linear function. The function that connects the left and right sides of a linear equation and actually extends the performance of the model is called the connection function, and the model that adds the connection function is also called the generalized linear model. The general form of the generalized linear model can be expressed as follows:
Is equivalent to
2. Logarithmic probability model
-
Odd and logarithmic odds
Probability is not probability, but the ratio of the probability that an event will occur to the probability that it will not occur. Assuming that the probability of an event occurring is P, then the probability of the event not occurring is 1-P, and the probability of the event is:
The logarithmic probability of the event is formed by taking the logarithm of the natural base based on the probability.
-
Logarithmic probability model
We treat the logarithmic probability as a function and use it as a relation function, i.e. g(y)=lny1−yg(y)=\ln{\frac{y}{1-y}}g(y)= LN1 −yy
, the generalized linear model is:
The model is then called logistic regression, also known as logistic regression.
Further, if we want to “reverse solve” the logarithmic probability model above, we should change it to the form y=f(x)y =f(x)y =f(x), i.e
The original formula:
One step transformation:
Through a series of transformations:
Finally, the logistic regression model is as follows:
And you can also see that the inverse of the logarithmic probability function is zero
And f(x)f(x)f(x) is also called the sigmoidsigmoid function.
3. Output results of logistic regression model and model interpretability
On the whole, after the logistic regression is processed by sigmoid sigmoID function, the output result of linear equation is compressed between 0 and 1. It is certainly inappropriate to use this result for continuous numerical prediction of regression class. Logistic regression is mainly used to predict dichotomous problems in practical model application.
The business problem
Logistic regression output yyy is probability
The core factor that determines whether y is a probability is not the model itself, but the modeling process.
LogisticsLogisticsLogistics itself also has the corresponding probability distribution, so the input of the independent variable is actually can be regarded as random variables, but the premise is the need to meet certain requirements of distribution.
If the modeling process of logistic regression follows the general modeling process of mathematical statistics, that is, the distribution of independent variables (or the transformed distribution) meets certain requirements (through testing), the final model output is the probability value in the strict sense.
However, if the model is modeled in accordance with the machine learning modeling process and the model is built under the assumption testing of the independent variables, the distribution of the independent variables may not meet the conditions, so the output results may not be defined as the probability in the strict sense.
Or, according to the logistic regression equation
It is further deduced that:
It can be read as: for every 1 increase in ** XXX, the logarithmic probability of the sample being a 1 decreases by 1. **
The explicability based on the coefficient of independent variables can not only be used for the interpretation between independent variables and dependent variables, but also for the identification of the importance of independent variables. For example, suppose that the logistic regression equation is as follows:
I can read that x2x_2x2 is three times as important as x1x_1x1.
4. Multiple classification logistic regression
The previous discussion was based on the dichotomy problem (0-1 classification problem), but using logistic regression to solve multiple classifications requires additional technical skills.
In general, if logistic regression is used to solve multiple classification problems, there are generally two approaches
- One is to change the logistic regression model into multi-classification model
- The other is to use the general multi-classification learning method to transform the modeling process
It is not common to rewrite the logistic regression model into the form of multi-classification model and the solving process is very complicated. Including SciKit-learn, the mainstream methods to achieve multi-classification logistic regression are using multi-classification learning method. The so-called multi-classification learning method refers to the extension of some binary classifier to multi-classification scenarios. This method belongs to the general method that can be used by all binary classifiers including logistic regression.
General solution of multi – classification problem
The basic idea of solving multiple classification problems with binary classification learner is to split first and then integrate
- Start by breaking up the data set
- Multiple data sets can then train multiple models
- Finally, the multiple models are integrated. By integration, I mean the use of these multiple models to predict subsequent incoming data.
Example- four classification problems
Specifically, there are three main strategies
-
One vs. One (OvO)
OvO’s split strategy is relatively simple. The basic process is to separate the data set for each category into a single subset, and then combine them in pairs for model training. For example, for the above four classification data sets, it can be divided into four data sets according to the label category, and then combined in pairs, there are six combinations in total. The splitting process is as follows:
And then on these six newly combined data sets, we can train six classifiers.
After the model training is completed, the final discriminant result can be selected from the discriminant result of the six classifiers by voting method in the face of the prediction of the new data set.
Majority voting means that a new piece of data should end up in category 1.
-
One to many (One vs Rest, OvR)
Unlike OvO’s ptwo combination, the OvR strategy splits the data set with one class of samples as positive and all other data as negative. For the above four classification data sets, OvR strategy will eventually split them into four data sets, and the basic splitting process is as follows
These four data sets will train four classifiers.
For the integration strategy, it is closely related to the partition strategy. For the OvR method, for the prediction of new data, if only one classifier predicts it as a positive example, the new data set belongs to this class. If multiple classifiers predict the classifier as a positive example, the discriminant result of the classifier with higher accuracy is selected as the prediction result of the new data.
OvO vs. OvR:
For both strategies, although OvO needs to train more base classifiers, the base classifier training time will be shorter because each shred data set in OvO is smaller. Overall, OvO is often less expensive than OvR in terms of training time. In terms of performance, most of the time the performance is similar.
-
Many to many (Rest vs Rest, RvR)
MvM is a more complex strategy than OvO and OvR.
MvM requires several classes to be positive and others to be negative at the same time, and requires multiple partitions before integration.
A technique called Error Correcting Output Codes (ECOC) is commonly used to implement the MvM process.
At this point, for the above four classification data sets, the splitting process will become more complicated. We can choose one of these classes as positive and the rest as negative, or two of them as positive and the rest as negative, and so on. There are a lot of seed data sets that will train a lot of base classifiers.
According to the above division, the total will be divided
We can construct 10 classifiers accordingly. However, generally speaking, for ECOC, we will not divide data sets in such a detailed way. Instead, we will select some data sets from the above partition results for modeling. For example, four classifiers can be constructed by selecting the four data sets explicitly expressed above for modeling.
As you can see, OvR is actually a special case of MvM
Next comes model integration. It is worth noting that if we divide the four data sets in the above manner, we can treat the array of labels of positive and negative cases in each partition as the encoding of each piece of data. As follows:
At the same time, using the trained four basic classifiers to predict the new data will also produce four results, and these four results can also constitute the encoding of a four-digit new data.
Next, we can calculate the distance between the encoding of the new data and the encoding of the different categories above to determine which category the newly generated data should fall into.
We can see that, if the predictions are accurate enough, the codes actually correspond to the categories. However, if the underlying classifier does not predict the category accurately, the encoding and the category do not necessarily correspond one to one. There is a ternary encoding method, which will change a specific encoding in this case to 0 (error correction output code), meaning to disable the class.
In fact, there are many ways to calculate distance, common European distance, street distance and Minkowski distance.
-
ECOC method evaluation
For ECOC method, the longer the code is, the more accurate the prediction result will be. However, the longer the code is, the more computing resources will be consumed. In addition, due to the limited category of the model itself, the number of data sets will be limited, so will the code length. But usually
The MvM approach works better than OvR.
5. Loss function of logistic regression
Generally speaking, there are two ways to construct the loss function of logistic regression. The cross entropy loss function is constructed by maximum likelihood estimation and relative entropy respectively.
The basic idea of constructing loss function
Length | Species |
---|---|
1 | 0 |
3 | 1 |
Since there is only one feature Length, the logistic regression model is constructed as follows:
Here, the model output results are regarded as probability, and then the model results can be obtained by substituting the data
The p (y = 1 ∣ x = 1) p (y = 1 | x = 1) p (y = 1 ∣ x = 1) said * XXX values to 1 yyy conditional probability value is 1. And we know that the actual circumstances of the two data for the first data yyy value is 0, the second data yyy * value is 1, so we can calculate the p (y = 0 ∣ x = 1) p (y = 0 | x = 1) p (y = 0 ∣ x = 1) as follows:
Available:
Length | Species | 1-predict | 0-predict |
---|---|---|---|
1 | 1 | ||
3 | 0 |
In general, the construction objectives of the loss function are consistent with the model evaluation metrics (such as SSELoss and SSE). For most classification models, the accuracy of model prediction is the most basic evaluation index. Here if we want to model predicted results as accurate as possible, is equivalent to the hope that the p (y = 0 ∣ x = 1) p (y = 0 | x = 1) p (y = 0 ∣ x = 1) and p (y = 1 ∣ x = 1) p (y = 1 | x = 1) p (y = 1 ∣ x = 1) probability is bigger, the better the results. This objective can be unified in the process of obtaining the maximum value of the following formula:
In addition, considering that the loss function is generally minimized, the maximum value of the above formula can be converted to the corresponding negative result to obtain the minimum value, and the multiplication can also be converted to the logarithmic addition result. Therefore, the maximum value of the above formula can be equivalent to the minimum value of the following formula:
Thus, a logistic regression loss function composed of two data is constructed.
Why not use SSE
The SSE operation is as follows:
Need not is the key to this method, the mathematical level we can prove that the logistic regression, when y belongs to a classification of 0-1 variables, ∣ ∣ y – yhat22 ∣ ∣ 22 | | y – yhat_2 ^ 2 | | _2 ^ 2 ∣ ∣ y – yhat22 ∣ ∣ 22 loss function is not convex function, The non-convex loss function will cause great trouble to solve the optimal solution of parameters. By contrast, the loss function constructed by probability multiplication is convex, which can quickly solve the global minimum.
Why change the above accumulative function from multiplication to logarithmic accumulation
The reason is that in the process of actual modeling operation, especially in the process of loss function construction in the face of a large number of data, as there are many pieces of data, there will be many times of tiring multiplication, and the factor of tiring multiplication is between (0,1), so it is very likely to get a very small number of tiring multiplication. However, the general calculation framework has limited accuracy, that is, it is possible to lose a lot of accuracy in the process of multiplication, which can be well avoided by logarithmic accumulation.
Solving loss function
From a mathematical point of view, it can be proved that the logistic regression loss function constructed according to the above structure is still convex. At this point, we can still solve the parameters by taking the partial derivative of LogitLoss(ω,b)LogitLoss(\omega,b) and setting the partial derivative function equal to 0, and then simultaneous equations.
6. Maximum likelihood estimation is used to solve the loss function
Maximum likelihood estimation knowledge points
Logistic regression model:
Among them:
The solving process is divided into the following four steps:
-
Determine the likelihood term
We know that for logistic regression, when ω\omegaω and XXX obtain a set, there can be a probability prediction output, namely:
And the corresponding probability of 0 is:
You can make
Therefore, the likelihood term corresponding to the third data can be written as:
Yiy_iyi represents the category label corresponding to the data in Article III. It is not difficult to find that when yi=0y_i=0yi=0, which represents the third data label of 0, we need to substitute the likelihood term of the likelihood function p0(x,ω)p_0(x,\omega) P0 (x,ω). On the contrary, when yi=1y_i=1yi=1, it represents the first data label of III with 1. In this case, the likelihood term needed to be substituted into the likelihood function is P1 (x,ω)p_1(x,\omega) P1 (x,ω). The likelihood term above can satisfy both of these different cases.
-
Construction of likelihood functions
The maximum likelihood function is calculated by the multiplicity of likelihood terms:
-
Logarithmic transformation
Subsequently, we will use this formula to solve the loss function.
-
Solution of logarithmic likelihood function
Through a series of mathematical processes, it can be proved that the loss function constructed by maximum likelihood estimation is convex, and then we can solve it by simultaneous equations whose derivative is 0.
However, this method involves a lot of derivative operations and solving equations, and is not suitable for large-scale or even super-large-scale numerical operations. Therefore, in the field of machine learning, some more general optimization methods are usually used to solve the loss function of logistic regression, usually Newton method or gradient descent algorithm.
7. Construct cross entropy loss function through relative entropy
The cross entropy loss function is constructed by relative entropy