Background for this article: Ridge regression, Lasso regression, a little programming knowledge

One, the introduction

We have learned two regularization methods, ridge regression and Lasso regression. When multiple features are correlated, Lasso regression may select only one of them randomly, while ridge regression will select all features. At this point, it is easy to think that if the two regularization methods are combined, the advantages of the two methods can be combined. This regularization algorithm is called Elastic Net Regression 1.

Ii. Model introduction

The cost function of elastic network regression algorithm combines the regularization methods of Lasso regression and ridge regression, and uses two parameters λ and ρ to control the size of penalty term.


Cost ( w ) = i = 1 N ( y i w T x i ) 2 + Lambda. rho w 1 + Lambda. ( 1 rho ) 2 w 2 2 \operatorname{Cost}(w)=\sum_{i=1}^{N}\left(y_{i}-w^{T} x_{i}\right)^{2}+\lambda \rho\|w\|_{1}+\frac{\lambda(1-\rho)}{2}\|w\|_{2}^{2}

Also, the size of the cost function w is minimized:


w = argmin w ( i = 1 N ( y i w T x i ) 2 + Lambda. rho w 1 + Lambda. ( 1 rho ) 2 w 2 2 ) w=\underset{w}{\operatorname{argmin}}\left(\sum_{i=1}^{N}\left(y_{i}-w^{T} x_{i}\right)^{2}+\lambda \rho\|w\|_{1}+\frac{\lambda(1-\rho)}{2}\|w\|_{2}^{2}\right)

It can be seen that when ρ = 0, its cost function is equivalent to ridge regression, when ρ = 1, its cost function is equivalent to Lasso regression. Just like Lasso regression, there is absolute value in the cost function, which is not differentiable everywhere, so it is impossible to get the analytical solution of W directly by taking the derivative directly, but w can still be solved by using coordinate descent 2.

Three, algorithm steps

Coordinate descent method: The solution method of coordinate descent method is the same as the steps used in Lasso regression, the only difference is that the cost function is different. Specific steps: (1) initialize the weight coefficient W, for example, to zero vector. (2) The ownership weight coefficients are traversed, and one of the weight coefficients is successively taken as a variable, while the other weight coefficients are fixed as the results of the last calculation as constants, so as to find the optimal solution under the current condition with only one weight coefficient variable. In the KTH iteration, the method of updating the weight coefficient is as follows:


w m k According to the first k Next iteration, no m Coefficient of weight w 1 k = argmin w 1 ( Cost ( w 1 . w 2 k 1 . . w m 1 k 1 . w m k 1 ) ) w 2 k = argmin w 2 ( Cost ( w 1 k . w 2 . . w m 1 k 1 . w m k 1 ) ) w m k = argmin w m ( Cost ( w 1 k . w 2 k . . w m 1 k . w m ) ) \begin{matrix} w_m^k \\ w_1^k = \underset{w_1}{\ operatorName {argmin}} \left(\ operatorName {Cost}(w_1, w_2^{k-1}, \dots, \underset{w_1}{\ operatorName {argmin}} \left(\ operatorName {Cost}(w_1, w_2^{k-1}, \dots, w_{m-1}^{k-1}, w_m^{k-1}) \right) \\ w_2^k = \underset{w_2}{\operatorname{argmin}} \left( \operatorname{Cost}(w_1^{k}, w_2, \dots, w_{m-1}^{k-1}, w_m^{k-1}) \right) \\ \vdots \\ w_m^k = \underset{w_m}{\operatorname{argmin}} \left( \operatorname{Cost}(w_1^{k}, w_2^{k}, \dots, w_{m-1}^{k}, w_m) \right) \\ \end{matrix}

(3) Step (2) is a complete iteration. When the ownership weight coefficient changes little or reaches the maximum number of iterations, the iteration is ended.

Four, code implementation

Using Python to implement elastic network regression algorithm (coordinate descent method) :

def elasticNet(X, y, lambdas=0.1, rhos=0.5, max_iter=1000, tol=1e-4) :
    Elastic network regression, using coordinate descent ARGS: X - training dataset y - target label value lambdas - penalty term coefficient rhos - mixed parameter, value range [0,1] max_iter - maximum number of iterations tol - tolerance value of change return: w - weight coefficient """
    Initialize w as the zero vector
    w = np.zeros(X.shape[1])
    for it in range(max_iter):
        done = True
        Loop over all arguments
        for i in range(0.len(w)):
            # Record the coefficient of the previous round
            weight = W[i]
            # Find the best coefficient for the current condition
            w[i] = down(X, y, w, i, lambdas, rhos)
            # When one of the coefficient changes does not reach its tolerable value, continue the loop
            if (np.abs(weight - w[i]) > tol):
                done = False
        # End the loop when all coefficients change little
        if (done):
            break
    return w

def down(X, y, w, index, lambdas=0.1, rhos=0.5) :
    """ cost(w) = (x1 * w1 + x2 * w2 + ... - y)^2 / 2n + ... + lambda * rho * (| w1 | | + w2 | +...). + [λ * (1 - ρ) / 2] * (w1^2 + w2^2 +...) Assuming that w1 is a variable and all other values are constant, the cost function is a unary quadratic function about w1 after plugging into the above equation, which can be written as follows: Cost (w1) = (a * w1 + b)^ 2/2n +... + lambda rho | w1 | + lambda (1 - rho) / 2 * w1 ^ 2 + c (a, b, c, lambda are constant) = > unfolds cost (w1) = [aa / 2 n + lambda (1 - rho) / 2] * w1 ^ 2 + (ab)/n * w1 + Lambda rho | w1 | + c (aa, ab, c, lambda are constant) "" "
    The sum of coefficients of the expanded quadratic terms
    aa = 0
    The sum of the coefficients of the first expansion
    ab = 0
    for i in range(X.shape[0) :# Coefficient of the first term in parentheses
        a = X[i][index]
        # The coefficient of the constant term in parentheses
        b = X[i][:].dot(w) - a * w[index] - y[i]
        It is easy to obtain that the coefficients of the expanded quadratic term are the sum of the squared coefficients of the primary term in parentheses
        aa = aa + a * a
        # It is easy to obtain that the coefficient of the expanded first order term is the sum of the coefficient of the first order term multiplied by the constant term in the parentheses
        ab = ab + a * b
    # Since it is a quadratic function of one variable, when the derivative is zero, it is the minimum value of the function, and only the quadratic coefficient, the primary coefficient and λ need to be concerned
    return det(aa, ab, X.shape[0], lambdas, rhos)

def det(aa, ab, n, lambdas=0.1, rhos=0.5) :
    Find w by the derivative of the cost function. When w = 0, Non-differentiable det (w) = [aa/n + lambda (1 - rho)] * w + ab/n + lambda rho = 0 > 0 (w) = > w = - (ab/n + lambda rho)/aa/n + lambda (1 - rho)] det (w) = [aa / n + lambda (1 - rho)] * w + ab/n - lambda rho = 0 (w = > < 0) w = - (ab/n - lambda rho)/aa/n + lambda (1 - rho)] det (w) = NaN = > w = (w = 0) 0 "" "
    w = - (ab / n + lambdas * rhos) / (aa / n + lambdas * (1 - rhos))
    if w < 0:
        w = - (ab / n - lambdas * rhos) / (aa / n + lambdas * (1 - rhos))
        if w > 0:
            w = 0
    return w
Copy the code

Fifth, third-party library implementation

Scikit – learn3 implementation:

from sklearn.linear_model import ElasticNet

Initialize the elastic network regressor
reg = ElasticNet(alpha=0.1, l1_ratio=0.5, fit_intercept=False)
# Fit the linear model
reg.fit(X, y)
# Weight coefficient
w = reg.coef_
Copy the code

Six, animation demonstration

The following dynamic diagram shows the influence of different ρ on elastic network regression. When ρ increases gradually, L1 regular term dominates, and the cost function is closer to Lasso regression. When ρ decreases gradually, L2 regular term dominates, and the cost function is closer to ridge regression.

The following GIF shows the comparison between Lasso regression and elastic network regression. The dashed line represents ten features of Lasso regression, the solid line represents ten features of elastic network regression, and each color represents the weight coefficient of an independent variable (training data comes from Sklearn Diabetes datasets).

Comparison of Lasso regression and elastic network regression

It can be seen that compared with Lasso regression, elastic network regression retains the feature selection property of Lasso regression, and gives consideration to the stability of ridge regression.

7. Mind mapping

8. References

  1. En.wikipedia.org/wiki/Elasti…
  2. En.wikipedia.org/wiki/Coordi…
  3. Scikit-learn.org/stable/modu…

The full demo can be found here

Note: this article strives to be accurate and easy to understand, but because the author is a beginner, the level is limited, such as the existence of errors or omissions in the article, please readers through the way of comment criticism