Kaggle machine learning summary

Designed bar \

❈

Yong Wang is a columnist for the Python Chinese community. He is currently interested in business analysis, Python, machine learning, and Kaggle. 17 years project management, 11 years in communications project manager contract delivery, 6 years in manufacturing project management: PMO, change, production transfer, liquidation and asset disposal. MBA, PMI – the PBA, PMP.

❈

\

2017 is coming to an end, and I spent a lot of my spare time learning Python and machine learning, mainly by playing various competitions on Kaggle. As 2017 comes to an end, let this be my farewell article on machine learning in 2017.

Statistical Test of characteristic Engineering section of Kaggle HousePrice \

Kaggle Building Blocks: Feature engineering section \

Kaggle building block score: LB 0.11666 (top 15%) \

In the opening article, Principles, I talked about:

1, summed up the building block type brush points. (that is, using Pipeline for Pandas’s Pipe and Sklearn)

Pandas’ Pipe is described in the second article. Pipeline and Gridsearch or RandomedSearch combined use, can combine multiple conversion and prediction algorithm, and can tune, select algorithm. This will be explained later. Note: Gridsearch refers to Gridsearch or random search.

2. Self-understanding of practices in feature engineering. (for example: why Log transfer, normalization, etc.)

LogTransfer was also introduced in the second article. Normalization will also be covered in this article. There are many ways to normalize, StandardScaler (mean 0, standard deviation 1), RobustScaler (special treatment for outliers), etc

3. Share the problems you encounter (over-fitting caused by feature engineering) to solve your doubts

The problem is primarily one of data breaches. A new feature (unit price per square meter), local CV, was made in the training set and the RMSE was very good (around 0.05). PB scores were very poor (0.3). The main reason is simply to use the sales price to generate new features, which usually reduces generalization performance and is not desirable.

Introduction to SkLearn PipLine

If Pandas’ Pipe is a steam train, Sklearn’s Pipeline is an electric train with a dispatch center. Trains are algorithms, and the dispatch center makes the algorithms run in sequence, no more, no less, no faster, no slower.

When training after data is ready, the most basic thing is to adjust Hypter parameters, which is time-consuming and labor-consuming, and errors and omissions may occur.

Common algorithm training errors on my own and Stackoverflow are:

1. The predicted results of the algorithm differ greatly. One possibility is that standardized steps in training are missing in prediction. 2. The tuning results of the algorithms differ greatly. (Some are 0.01, some are 10). One possibility is that different training steps use different standardized algorithms (for example, StandardScaler in one instance and RobustScaler in another). 3. In addition, many hyperparameters can be cumbersome to adjust. It’s easy to make mistakes or write wrong.

My solution: Pipeline + Gridsearch + parameter dictionary + container.

Example of using Pipeline

For linear regression problems, Sklearn provides more than 15 regression algorithms. Pipeline method can be used to comprehensively test all algorithms to find the most suitable algorithm. The specific steps are as follows:

1. Initialize all desired linear regression.

2. Create a dictionary container. {” algorithm name “:[initial algorithm object, parameter dictionary, trained Pipeline model object, CV result}

3. In the parameter tuning step, the initial algorithm is packaged with Pipeline, and Gridsearch is used for parameter tuning. The final model object for the corresponding CV can be obtained after the tuning. For example, the steps of the LasSO algorithm are as follows:

Pipe =Pipeline([(“scaler”:None),(“selector”:None),(” CLF “:Lasso())) Using Pipe to process the training set and test set can avoid errors and omissions and improve efficiency. (3) However, the Pipe algorithm is the default parameter, and the directly trained model RMSE is not ideal.
Step 1: Prepare the parameter dictionary: ① Params_lasso ={“Scaler”:[RobustScaler(),StandardScaler()], “Clf__alpha “:np.logspace(-5,-1,10), #10 alpha Rsearch = GridSearchCV(pipe, Param_grid =Params_lasso,scoring =’neg_mean_squared_error’,verbose=verbose, CV =10,refit =True) -gridsearch is a violence parameter. Iterating through all parameter combinations, plus a RandomedSearch to randomly select parameter combinations, reducing callback time and achieving approximate callback performance – Pipe is the just-wrapped algorithm. GridSearch puts optional parameters and algorithms (into, or better combinations). – The training criteria for callbacks is ‘neg_mean_squared_error ‘, the negative value of RMSE. In this way, the maximum is called the minimum MSE. You only need to do Np.sqrt (negative results) on the result once to get the RMSE value. -cv =10. Cross Validate is 9:1. In the case of small data sets, such as House Price. 3 and 10 folds, the results differ even more than the parameters. -refit =True. Do fit again for all datasets after the callback. Generate a complete training model

Comparison of House Price linear regression algorithms

Although I spent a lot of time trying all the Sklearn regression algorithms myself, Lasso,Ridge, Elasticnet, SVM and GradientBoost were the best performing RMSE algorithms. In fact, most of the contestants on Kaggle also use these algorithms, and the Sklearn flow chart gives exactly the same advice. Look at this picture next time and save a lot of time and effort.

As shown in the figure above: House Price data is less than 5000 samples without SGD. If certain features are important, use Lasso, ElasticNet. If in doubt, use Ridge, SVR linear mode, which doesn’t work well, plus SVR Gaussian kernel (RBF) or integrated mode (Random Forest, XGBoost or LightGMB)

For all regression problems, Lasso,Elasticnet, Ridge, SVR(linear kernel) are preferred. Sklearn offers no explanation. Recently, I came across the following ideas in a Python machine learning prediction algorithm core:

Business needs: Quantitative trading, high-speed performance and near-optimal solution performance provided by linear regression algorithms in online advertising business. Linear regression is a must in the second – by-second business.
Test iteration requirements: Usually 100 to 200 models are done for a business problem. When there are hundreds of thousands of data, linear algorithm can get approximate optimal solution only in a few minutes, while integration algorithm often takes several hours or even days. Linear algorithms can be used to speed up most underperforming models.

In addition, SkLearn’s linear algorithm utilizes the BLAS library. The efficiency is more than ten times higher than the integration algorithm. For example, for CrossValidate, Lasso takes about 20 seconds, while GradientBoost and other integration algorithms take about 200 to 300 seconds. As for House Price, I saw two pieces of information on Kaggle that were very enlightening:

1. Lasso Model for Regression problem after feature engineering is done, the method of 0.117 is obtained only with Lasso

2. Multivariate_Data_Analysis_ (7TH_edition) : The winner of HousePrice says this book is a great reference. This is a business analysis book based on statistics. Compared to ordinary machine learning books, this book is more in-depth, but does not know the theoretical derivation. Among them, the data, distribution and outliers not only provide the theoretical derivation, but also provide a variety of engineering methods to assume and verify. And best of all, rule of thumb. Such as:

How to deal with missing date when it is more than 50% and less than 10%? Is the missing data random or regular? How to detect it?
Outlier’s detection, statistical engineering method
Distribution detection, etc.

* * * *

The Python Chinese community will be giving away free books on New Year’s Day,

** Welcome to pay attention! 支那

* * * *

Hold down to scan the qr code below,

Focus on the “Programming Dog”

Programming the dog

Programming bull technology sharing platform ********\

Community in Python

The Spiritual Tribe of Python Chinese Developers \

For cooperation and contribution, please contact wechat:

pythonpost\

— Life is short, I use Python — \

Related Posts

How to choose message middleware ActiveMQ, RabbitMQ, RocketMQ, ZeroMQ and Kafka?

Go develop web will understand the concepts and the underlying principle, by comparing the way to let everyone better understand | Go on topic

Worship! Ali internal are pushing K8S(Kubernetes) study guide, could not be more detailed