This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money. Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
R vs Python, statistics vs machine learning
- A lot of people who have studied R, or worked with Python on R, have asked me all kinds of questions about statistics, because R has all kinds of statistics capabilities, and Python’s statistics capabilities are not that comprehensive. I’ve also been confused by the “R-style Python code” my friends send me. Maybe R code wants to look fancy, but the beauty of Python is simplicity. In jupyter, enter import this to check out the hidden Easter eggs in Python. The poem “Zen of Python”, written by the Python creator, extol the beauty of Simple, bright, and easy to read Python code. If you are interested, you can search the translation.
- Getting back to the point, why do Python and R function so differently in statistics? You may have heard that R was developed by people who studied statistics, so the whole idea is statistical, and Python was developed by people who studied computers, so the whole idea is computer, so it’s no surprise that R is much better at dealing with statistics than Python, The different approaches of these two disciplines are strongly reflected in the various modeling processes of statistics and machine learning.
- The idea of statistics is a kind of “prior” idea. No matter what we do, we must first “test” and “meet the conditions”, and then we need various “tests” to ensure that all kinds of mathematical assumptions are satisfied. Otherwise, we cannot get good results theoretically. Machine learning is a kind of “posterior” idea, no matter what happens, I first let the model run, the effect is not good, I will try to find a way, if the model effect is good, I don’t care about collinearity, residual does not meet the normal distribution, there are no dummy variables and other details, the model effect is great!
- As a non-math, non-statistical person who came from finance to type Python code, I fully appreciate the “posterior” approach to machine learning: we pursue results, not preconditions that need to be met. For me, statistics is the saviour of a problem that machine learning fails to solve. If machine learning fails to solve a problem, I will turn to statistics for help, but I will never set out to meet statistical requirements. And, of course, if you’re a statistician, if you’re an R student, you can think of machine learning as the salvation of statistics. Statistics and machine learning go hand in hand, and you need to understand the difference between the two approaches so that when you hit a dead end, you can find a way out from the other. Want to be able to solve a problem only, it is good train of thought!
Two efficient embedded method
But the more effective method is undoubtedly our embedded method. We have already shown that L1 regularization can be used for feature selection because the parameters corresponding to some features are 0. Combined with the SelectFromModel module of the embedding method, we can easily filter features that make the model very efficient. Note that at this point, our goal is to keep the information on the original data as far as possible, so that the model can keep good fitting effect on the data after dimensionality reduction. Therefore, we do not consider the problem of training set and test set, and put all the data into the model for dimensionality reduction.
from sklearn.linear_model import LogisticRegression as LR from sklearn.datasets import load_breast_cancer import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import cross_val_score from sklearn.feature_selection Import SelectFromModel data = load_breast_cancer() data.data.shape LR_ = LR(solver="liblinear",C=0.9,random_state=420) cross_val_score(LR_,data.data,data.target,cv=10).mean() X_embedded = SelectFromModel(LR_,norm_order=1).fit_transform(data.data,data.target) X_embedded.shape#(569, 9) cross_val_score (LR_ X_embedded, data. The target, CV = 10). The mean () # 0.9368323826808401Copy the code
If you look at the result, the number of features has been reduced to single digits, and the model’s performance has not decreased too much, if we are not too demanding, we can actually stop here. But, can make the model fit better? Here, we have two adjustment methods: 1) adjust the parameter threshold in the SelectFromModel class. This is the threshold of the embedding method, which means to delete all parameters whose absolute value is lower than this simple value. Now the default threshold is None, so the SelectFromModel only selects features based on the result of L1 regularization, i.e., features whose parameters are not O after L1 regularization are selected. At this point, as long as we adjust the value of threshold (draw the learning curve of Threshold), we can observe how the effect of the model changes under different thresholds. Once threshold is adjusted, features are not selected using L1 regularization, but using the coefficients of each feature generated in the model attribute.coef_. Although coEF_ returns the coefficient of a feature, the size of the coefficient is similar to feature_ importances_ in the decision tree and explained_vairance_ in the dimensionality reduction algorithm, both of which measure the importance and contribution of a feature. Thus, the parameter threshold in SelectFromModel can be set to the threshold of COef_, that is, all features with coefficients less than those entered in threshold can be removed.