To avoid overfitting, one means is to use regularization to limit the complexity of the model. Regularization is the literal translation of Regularization from English, which means that Regularization is added to the original problem solving conditions to avoid over-fitting of the model due to its complexity.

My website publicity display effect is better, welcome to visit: lulaoshi.info/machine-lea… , my wechat official number is AI-Qingxiu.

Ridge Regression

Many machine learning models solve model parametersTo minimize the loss function. For example, for a givenFor the training set of bar samples, the loss function of linear regression is:


On the basis of the above formula, we add a regular term to obtain a new loss function:


Notice the modelThere areDimension, newly added regular term directly on eachTake a square.

Intuitively, when we minimize the current new loss function, on the one hand we’re going to make the error term of the linear regression itselfMinimize, on the other hand, eachCannot be too large, otherwise the re itemWill is very large. The regular term, also known as the penalty term, is used to punish eachA condition in which the model is too complex due to being too large. In the regular termIs used to balance the coefficient between the loss function and the regularization term, known as the regularization coefficient, the larger the coefficient, the stronger the punishment effect of the regularization term, regularization coefficient will be mentioned later.

For the new loss function just obtained, we can take the derivative of this formula to obtain the gradient, and then we can use the gradient descent method to solve.


Linear regression uses quadratic regular terms to penalize parametersThe whole process is called Ridge Regression, or L2 Regularization is applied. Other machine learning models such as logistic regression and neural networks can also use L2 regularization.

Lasso Regression

If you use a regular term to penalize the parameters of linear regression, known as Lasso(Least Absolute Shrinkage and Selection Operator) Regression, or Regression uses L1 regularization:


As can be seen, Lasso regression mainly uses absolute value as punishment term. The absolute value term has a mutation at zero, which is a little bit more difficult to differentiate. Lasso regression solution requires the Subgradient method or Promximal Gradient Descent (PGD) method, which will not be described here.

General regular term

The regular term derives from the concept of Norm in linear algebra. Norm is a function, for a function, there are, among them,It’s a vector space. That is, the norm transforms a vector into a nonnegative scalar. Common norms are:

Sparse solutions and L1 regularization

If the training data are high-dimensional Sparse features, for example, only 1,000 dimensions of a 100,000,000 dimension feature are non-zero and the rest features are 0 or empty, then the parameters in the trained model will beProbably a lot of them are close to 0. In house price prediction, if we want to use the latitude and longitude characteristics of the house, specifically, suppose that the global latitude can be divided into 10,000 pieces, the longitude can be divided into 10,000 pieces, the geographical location of a house can be represented by the latitude and longitude cross as “longitude 1001_ dimension 999”, the house is in this location and the feature is marked as 1, Otherwise, the feature is marked as 0. The latitude and longitude features of the whole model amount to 100,000,000 dimensions. In fact, we know that most areas other than cities are uninhabited, such as mountains and oceans, and most of the geographical location characteristic parameters in the model should actually be zero. If we take the parametersIt’s all saved. The model is very large and takes up a lot of memory. In fact, there is no need to save all the useless parameters. If we set all the useless parameters to zero and record only the useful parameters, the space taken up by the model will be greatly reduced. The solutions with very many zero components of model parameters are called sparse solutions.

Regularization can solve the above problems. One approach is to use a penalty term to count the number of non-zero parameters in the model, the hope modelAs many zero components as possible and as few non-zero components as possible. This method is more intuitive, it is actually L0 norm, but when solving, this method will turn a convex optimization problem into a non-convex optimization problem, which is not convenient to solve. L2 regularization adds a square penalty, which keeps the argument as small as possible, but does not force the argument to zero. L1 regularization also penalizes non-zero parameters and can make some parameters close to zero eventually become zero to some extent, approximately acting as L0. In terms of gradient descent, L2 is the square termAnd its derivative is, gradient descent according to the direction of the derivative may not fall to absolute value zero; L1 is the absolute value term, the absolute value term can force parameters that are close to zero to become zero eventually.

The figure above is an 8-dimensional parameter model. After training, it can be seen that L1 regularization makes it easier for parameters close to zero to return to zero eventually.

Let’s look at L1 and L2 regularization from a visual perspective. Assuming that our linear regression model has only two dimensions, to be exact,Only two componentsand, we take them as two coordinate axes to draw contour lines of squared error term and regular term, as shown in the figure below.

Where, the upper right corner is the contour line of the squared error term, which is the line of the point where the squared error term has the same value. The center of the coordinate axis is the contour line of the regular termThe line of points in space where the regular term has the same value. If there is no regular term, then the optimal solution should be the center of the contour line of the squared error term, i.e. the point at which the squared error term is smallest. With the regular term, the optimal solution is the compromise between the squared error term and the regular term, which is the intersection in the graph. It can be seen from the figure that the intersection point of L1 regular is on the coordinate axis, i.eorIs 0, and the intersection of L2 regular is not easy on the coordinate axis.

Regularization coefficient

The following formula gives a more general definition of regularization:


Regularization coefficientStrive to balance the fitting degree of training data with the complexity of the model itself:

  • If the regularization coefficient is too large, the model may be relatively simple, but there is a risk of underfitting. The model may not learn some characteristics of the training data and may be inaccurate in its prediction.
  • If the regularization coefficient is too small, the model will be complicated, but there is a risk of over-fitting. Models strive to learn and train various characteristics of data, but their generalization prediction ability may not be high.
  • The ideal regularization coefficient can make the model have good generalization ability. However, the regularization coefficient is generally related to specific problems such as training data and business scenarios, so it is necessary to find a better option through tuning.

Practice and consolidate

As for the different options and parameters of regularization, we can make some attempts in the TensorFlow Playground and observe the differences caused by different options on the results. Address: playground.tensorflow.org/

The resources

  1. Andrew Ng: CS229 Lecture Notes
  2. Ian Goodfellow and Yoshua Bengio and Aaron Courville: Deep Learning
  3. Zhou Zhihua: Machine Learning
  4. Stanford.edu/~shervine/t…
  5. Developers.google.com/machine-lea…
  6. Developers.google.com/machine-lea…
  7. Developers.google.com/machine-lea…