Author: He Congqing

In the current field of machine learning, the three most common tasks are regression analysis, classification analysis and cluster analysis. In a previous article, I wrote a 15-minute introduction to SkLearn and Machine Learning: Classification Algorithms. So what is regression? Regression analysis is a predictive modeling technique that studies the relationship between dependent variables (targets) and independent variables (predictors). Regression analysis is widely used in the field of machine learning, for example, predicting the sales volume of goods and traffic flow. So how do you choose the most appropriate machine learning algorithm for these regression problems? This article will be introduced from the following aspects:

1, commonly used regression algorithm

2. Return to competition problems and solutions

3. Ongoing return to competition issues

Commonly used regression algorithms

Here are some commonly used machine learning methods for regression problems. As a powerful algorithm package for machine learning, SKLearn has many classic regression algorithms built in. Each algorithm will be introduced one by one below:

Note: The code of regression algorithm has been uploaded to the network disk. If you are interested, please pay attention to the author’s public account “Heart of AI Algorithm” and reply to “regression algorithm” in the background. \

1, linear regression \

Linear regression fits a linear model with coefficients to minimize the sum of squares of residuals between observed and linear predicted values in the data.

Sklearn also has an interface to the linear regression algorithm library, as shown in the following code example:

# load linear model algorithm library from sklearnimportLinear_model = linear_model.Linearregression () linear_model = linear_model.linearregression () Y_pred = regr. Predict (X_test)Copy the code

2, Ridge return \

The linear regression algorithm using least squares optimization of each coefficient, for ridge regression, ridge regression of punish coefficient (L2) paradigm to solve some problems of ordinary least squares method, for example, when characteristics between completely collinearity (a solution) or characteristics between highly correlated, this time is suitable for using ridge regression.

Linear model from sklearn.linear_modelimportRidge # Create Ridge regression model object reg = Ridge(alpha=. 5Reg. Fit ([[0.0], [0.0], [1.1]], [0.1..1]) # output each coefficient reg.coef_reg.intercept_Copy the code

3. Lasso is back

Lasso is a linear model for estimating sparsity. It is useful in some cases because it tends to choose solutions with fewer parameter values, effectively reducing the number of variables on which a given solution depends. The Lasso model adds L1 normal form as penalty term on the basis of least square method.

Load Lasso model algorithm library from sklearn.linear_modelimportLasso # Create Lasso regression model of the object reg = Lasso(alpha=0.1Training Lasso regression model reg.fit([[0.0], [1.1]], [0.1])
"""Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000, normalize=False, positive=False, precompute=False, Random_state =None, Selection ='cyclic', tol=0.0001, warm_start=False)"""Reg. Predict ([[1.1]])
Copy the code

4. Elastic Net regression

Elastic Net is a linear model that uses both L1 and L2 paradigms as punishment terms. This combination can not only learn the sparse model, but also keep the regularization property of ridge regression.

Load ElasticNet model library from sklearn.linear_modelimportElasticNet # select * from sklearn.datasetsimport make_regression
X, y = make_regression(n_features=2, random_state=0Regr = ElasticNet(random_state=0Regr. Fit (X, y)print(regr.coef_) 
print(regr.intercept_) 
print(regr.predict([[0.0]]))
Copy the code

5. Bayes Ridge is back

Bayesian ridge regression model is similar to ridge regression. Bayes regression estimates parameters by maximizing marginal logarithmic likelihood.

from sklearn.linear_model import BayesianRidge
X = [[0..0.], [1..1.], [2..2.], [3..3.]]
Y = [0..1..2..3.]
reg = BayesianRidge()
reg.fit(X, Y)
Copy the code

SGD is back

The above linear model optimizes the loss function through the least square method, and SGD regression is also a kind of linear regression. The difference is that it minimizes regularized empirical loss through stochastic gradient descent.

import numpy as np
from sklearn import linear_model
n_samples, n_features = 10.5
np.random.seed(0)
y = np.random.randn(n_samples)
X = np.random.randn(n_samples, n_features)
clf = linear_model.SGDRegressor(max_iter=1000, tol=1e-3)
clf.fit(X, y)
"""SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1, eTA0 =0.01, fit_intercept=True, L1_ratio =0.15, learning_rate='invscaling', loss='squared_loss', max_iter=1000, n_iter=None, n_iter_no_change=5, Penalty ='l2', power_t=0.25, random_state=None, shuffle=True, TOL =0.001, validation_fraction=0.1, verbose=0, warm_start=False) """
Copy the code

7, SVR

As we all know, support vector machine is widely used in the field of classification, and the classification method of support vector machine can be extended to solve the regression problem, which is called support vector regression. The model generated by the support vector regression algorithm also relies on only a subset of the training data set (similar to the support vector classification algorithm).

SVR model from sklearn.svmimportSVR # training set X = [[0.0], [2.2]]
y = [0.5.2.5CLF. Fit (X, y)""SVR(C=1.0, cache_size=200, COEF0 =0.0, degree=3, epsilon=0.1, gamma='auto_deprecated', kernel=' RBF ', max_iter=-1, SVR(C=1.0, cache_size=200, COEF0 =0.0, degree=3, epsilon=0.1, gamma='auto_deprecated', kernel=' RBF ', max_iter=-1, Shrinking = True, tol = 0.001, verbose = False)"""
clf.predict([[1.1]])
Copy the code

8. KNN regression

KNN regression can be used in cases where data labels are continuous variables rather than discrete variables. The labels assigned to a query point are calculated based on the average of their nearest neighbor labels.

X = [[0], [1], [2], [3]]
y = [0.0.1.1]
from sklearn.neighbors import KNeighborsRegressor
neigh = KNeighborsRegressor(n_neighbors=2)
neigh.fit(X, y) 
print(neigh.predict([[1.5]]))
Copy the code

9. Decision tree regression

Decision trees can also be applied to regression problems using SkLearn’s DecisionTreeRegressor class.

from sklearn.tree import  DecisionTreeRegressor 
X = [[0.0], [2.2]]
y = [0.5.2.5]
clf = DecisionTreeRegressor()
clf = clf.fit(X, y)
clf.predict([[1.1]])
Copy the code

10, Neural network \

The neural network implements a multi-layer perceptron (MLP) using the MLPRegressor class in SLEARN, which is trained using back propagation with no activation function in the output layer, and can also treat equilibrium functions as activation functions. Therefore, it uses the squared error as the loss function, and the output is a continuous set of values.

from sklearn.neural_network import MLPRegressor
mlp=MLPRegressor()
mlp.fit(X_train,y_train)
""MLPRegressor(ACTIVATION ='relu', alpha=0.0001, BATCH_size ='auto', BETA_1 =0.9, BETA_2 =0.999, EARly_stopping =False, Epsilon = 1E-08, hidden_layer_sizes=(100,), Learning_rate ='constant', Learning_rate_init =0.001, max_iter=200, Momentum =0.9, n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, random_state=None, shuffle=True, Solver =' Adam ', tol=0.0001, validation_fraction=0.1, verbose=False, warm_start=False)"""
y_pred = mlp.predict(X_test)
Copy the code

11. RandomForest regression

RamdomForest regression is also one of the classical integration algorithms.

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features=4, n_informative=2,
                       random_state=0, shuffle=False)
regr = RandomForestRegressor(max_depth=2, random_state=0,
                             n_estimators=100)
regr.fit(X, y)
print(regr.feature_importances_)
print(regr.predict([[0.0.0.0]]))
Copy the code

11. XGBoost is back

XGBoost has made a lot of achievements in the academic world in recent years. Basically all the champion schemes of machine learning competitions use XGBoost algorithm. There are two kinds of algorithm interfaces for XGBoost, and HERE I only introduce the Sklearn interface of XGBoost. For more information:

Xgboost. Readthedocs. IO/en/latest/p…

import xgboost as xgb
xgb_model = xgb.XGBRegressor(max_depth = 3,
                             learning_rate = 0.1,
                             n_estimators = 100,
                             objective = 'reg:linear',
                             n_jobs = - 1)
xgb_model.fit(X_train, y_train,
              eval_set=[(X_train, y_train)], 
              eval_metric='logloss',
              verbose=100)
y_pred = xgb_model.predict(X_test)
print(mean_squared_error(y_test, y_pred))
Copy the code

12, LightGBM return \

LightGBM serves as another gradient enhancement framework that uses tree-based learning algorithms. In the algorithm competition is also used every time, and in order to achieve good results in the competition, LightGBM is an indispensable artifact. Compared with XGBoost, LightGBM has the following advantages: faster training speed, more efficient; Low memory usage. There are two kinds of algorithm interface for LightGBM, here I also introduce The Sklearn interface of LightGBM. Please refer to: more lightgbm. Readthedocs. IO/en/latest /

import lightgbm as lgb
gbm = lgb.LGBMRegressor(num_leaves=31,
                        learning_rate=0.05,
                        n_estimators=20)
gbm.fit(X_train, y_train,
        eval_set=[(X_train, y_train)], 
        eval_metric='logloss',
        verbose=100)
y_pred = gbm.predict(X_test)
print(mean_squared_error(y_test, y_pred))
Copy the code

Return to competition problems and solutions

In order to facilitate the students to practice the relevant projects in machine learning, here are some regression competition questions to help beginners to master the regression problems in machine learning more deeply.

Entry level competition: \

Kaggle — House price forecast \

As one of the most basic regression questions, this contest is suitable for beginners of machine learning.

Website: www.kaggle.com/c/house-pri…

Classic solution:

XGBoost solution: www.kaggle.com/dansbecker/…

Lasso solution: www.kaggle.com/mymkyt/simp…

Advanced competition:

Kaggle — Sales forecast \

The goal of this contest, one of the classic time series questions, is to predict the total sales of each product and store over the next month.

Website: www.kaggle.com/c/competiti…

Classic solution:

LightGBM: www.kaggle.com/sanket30/pr…

XGBoost: www.kaggle.com/fabianabold…

The first solution: www.kaggle.com/c/competiti…

TOP Competition Scheme:

Kaggle — Restaurant visitor forecast \

Website: www.kaggle.com/c/recruit-r…

Solution:

1 st plan: www.kaggle.com/plantsgo/so…

7th solution: www.kaggle.com/c/recruit-r…

8th Scenario: github.com/MaxHalford/…

12th option: www.kaggle.com/c/recruit-r…

Kaggle — Corporate Ion Favoritagrocery Sales Forecast

Website: www.kaggle.com/c/favorita-…

Solution:

1 st plan: www.kaggle.com/c/favorita-…

2 st plan: www.kaggle.com/c/favorita-…

3 st plan: www.kaggle.com/c/favorita-…

4 st plan: www.kaggle.com/c/favorita-…

5 st plan: www.kaggle.com/c/favorita-…

6 st plan: www.kaggle.com/c/favorita-…

The return race is underway

Little friends see the above solution is not eager to try, recently there are various major return competition in China, hurry to strike while the iron is hot, to learn to learn the return competition!

2019 Tencent Advertising Contest — Advertising exposure estimation

Address: algo.qq.com/application…

Data set download:

Follow the author’s official account “AI Algorithm heart “, and reply to “return to the match data set “, you can download directly to the above match data set! \

Please follow and share ↓↓↓\

ID: 92416895\

Currently, it ranks no.1 in the knowledge planet of machine learning

Past wonderful review \

  • Conscience Recommendation: Introduction to machine learning and learning recommendations (2018 edition) \

  • Github Image download by Dr. Hoi Kwong (Machine learning and Deep Learning resources)

  • Printable version of Machine learning and Deep learning course notes \

  • Machine Learning Cheat Sheet – understand Machine Learning like reciting TOEFL Vocabulary

  • Introduction to Deep Learning – Python Deep Learning, annotated version of the original code in Chinese and ebook

  • Zotero paper Management tool

  • The mathematical foundations of machine learning

  • Machine learning essential treasure book – “statistical learning methods” python code implementation, ebook and courseware

  • Blood vomiting recommended collection of dissertation typesetting tutorial (complete version)

  • The encyclopaedia of Machine learning introduction – A collection of articles from the “Beginner machine Learning” public account in 2018

  • Installation of Python (Anaconda+Jupyter Notebook +Pycharm)

  • What if Python code is ugly? Recommend a few artifacts to save you