This is the sixth day of my participation in the August More text Challenge. For details, see:August is more challenging

Regression (Linear Regression)

Before we get to logistic Regression, let’s briefly introduce you to Linear Regression. Linear regression is the use of continuous variables to estimate the actual value (such as housing prices), by finding out the linear relationship between independent variables and dependent variables, to determine the best line, called the regression line. In addition, we express this regression relationship as


Y = a X + b Y = aX + b

regularization

Occam’s Razor Principle: Of all possible models, we should choose one that can explain the data well and is very simple. From a Bayesian perspective, the regular term corresponds to the prior probability of the model. It can be assumed that the complex model has a small prior probability and the simple model has a large prior probability. Let the L2 norm regularization | | W | | 2 minimum, each element is small, can make W are close to zero. (L1 norm makes W equal to 0), and the smaller parameter indicates that the model is simpler, and the simpler model is less likely to produce the phenomenon of overfitting.

Why does L1 norm make weights sparse?

Any regularization operator can be sparse if it is non-differentiable at Wi=0 and can be decomposed into a summation form.

Logistic Regression

Logistic regression overview

Unlike linear regression, logistic regression is not a regression algorithm, it is a classification algorithm; By fitting a logical function to predict the value of a discrete dependent variable (predict a probability value, based on 0 and 1), to describe the degree of influence of the independent variable on the dependent variable. There can be one or more independent variables. One independent variable is called unary logistic regression, and multiple independent variables are called multiple logistic regression. For example, logistic regression can predict the probability that an email is spam. At the same time, because the result is a probability value, it can also make a ranking model prediction of the click rate and other results.

Logistic regression step

First, a suitable predictive function h(x) is found to predict the judgment results of the input data. Secondly, a loss function cost is constructed to represent the deviation (in the form of difference between the predicted function value H (x) and the training set data category Y). The loss of all training data was averaged or summed as J(θ), indicating the deviation between the predicted value of all training data and the century category. It is easy to find that the smaller J(θ) is, the more accurate the prediction is, so our goal is to find the minimum value of the number J(θ). There are many methods to find the minimum value, which are shown here by Gradient Descent. Next, we interpret the above content in detail: Prediction function:


h ( x ) = 1 1 + e ( Omega. T x + b ) h(x)=\frac{1}{1+e^{-\left(\omega^{T} x+b\right)}}

Loss function


L ( ( Omega. ) ) L((\omega))

L ( Omega. ) = i 1 ( y i Omega. T x log ( 1 + e Omega. T x i ) ) L(\omega)=\sum_{i}^{1}\left(y_{i^{*}} \omega^{T} x-\log \left(1+e^{\omega^{T} x_{i}}\right)\right)

The loss function is derived from the predictive function formula of logistic regression, and it is worth mentioning that omega ^T is not omega ^T, but omega transpose. It should be easy for you to understand what transpose means if you’ve studied linear algebra, but it doesn’t matter if you haven’t, just think of it as omega x. The detailed derivation process is not required and will be included in the appendix. If you are interested, you can check it out later.

What is gradient descent

We can look at the simplest function
y = x 2 y=x^2
1. First take any value of x, such as -0.8, and we can get a value of y:2. Second, find the direction of update. For example, we update in the positive direction and get the following image:We can see that as we update in the positive direction, we are getting closer to the final result (zero). The interval between updates (0.1 here), in machine learning, is called the learning rate. When the learning rate is too high, X may not converge well; When the learning rate is too low, the convergence rate of x may be too slow. 3. Repeat steps 1 and 2 until x converges above the main idea of gradient descent, and we can get the formula of descent:

In actual combat

Practical code: logistic regression

The resources

Blog.csdn.net/weixin_3417…

Blog.csdn.net/fisherming/…

www.cnblogs.com/ModifyRong/…

Blog.csdn.net/donaldsy/ar…

Blog.csdn.net/guolindongg…

Blog.csdn.net/qq_27009517…

zhuanlan.zhihu.com/p/98785902

zhuanlan.zhihu.com/p/61379965

www.cnblogs.com/endlesscodi…