\
Data preprocessing is one of the most daunting tasks for a data scientist
I read a question on Zhihu before: Why did you leave the data science industry? A bosom friend est replied: data cleaning 10 hours, fitting 2 minutes, so repeat 996.
This answer to the comment below is more wonderful, actually can be really blind…
I pushed an article on data processing before, and sorted out a detailed data preprocessing method
One of the things that hasn’t been clarified, anda lot of the blogs and tweets on the Internet have been a little bit messy, is the Normalization and Standardization thing.
This paper focuses on the following three points
- The relationship between normalization and normalization
- Why normalization and standardization
- Which machine learning models need to be normalized
- How to do normalization and standardization
The relationship between normalization and normalization
This is by far the most confusing. \
There is no Standardization in statistics. There is only Normalization. There is much Normalization, whether it is a normal distribution of data with a mean of 0 and variance of 1, or mapping data to [0,1]
But in the field of machine learning, there are two aspects of Normalization. There is min-max Normalization.There is Mean normalization.Machine learning Standardization refers specifically to the distribution of data into a normal distribution,
From SkLearn preprocessing, it is called Standardization to scale data to a normal distribution with a mean of 0 anda variance of 1 or to scale data to [0,1]. Of course, it is also called Standardization to scale data to [-1,1]. Much of the Normalization in Preprocessing consists only of regularization, which divides X by the L1-norm or L2 norm.
To sum up, normalizing the data into a normal distribution is normalization, and scaling the data to [0,1] is normalization.
Meaning of normalization/standardization
1) The normalization accelerates the speed of obtaining the optimal solution of gradient descent
This diagram is from Ng’s machine learning course, and it’s been cited countless times. The blue circle graph represents contour lines of two features. In the left picture, the interval between X1 and X2 is very different. The interval between X1 and X2 is [0,2000], and the interval between X2 is [1,5], forming very sharp contour lines. When the gradient descent method is used to seek the optimal solution, it is likely to take the “Zigzag” route (vertical contour line), resulting in many iterations before convergence; The figure on the right normalized the two original features, and their corresponding contour lines appear very round, which can converge quickly in the process of solving gradient descent. Therefore, if the machine learning model uses gradient descent method to find the optimal solution, normalization is often very necessary, otherwise it is difficult to convergence or even cannot convergence.
2) Normalization may improve accuracy
Some classifiers need to calculate distances between samples (such as Euclidean distance), such as KNN. If the range of a feature range is very large, then the distance calculation mainly depends on this feature, which is contrary to the actual situation (for example, the actual situation is that the feature with small range is more important).
Which machine learning algorithms need to be normalized
1) The model that needs to use gradient descent and calculate distance needs to be normalized, because without normalization, the convergent path will drop in z-shape, resulting in a slow convergence path and difficulty in finding the optimal solution. After normalization, the speed of obtaining the optimal solution by gradient descent is accelerated and the accuracy may be improved. Such as linear regression, logistic regression, Adaboost, XGBoost, GBDT, SVM, NeuralNetwork, etc. Models that need to calculate distance need to be normalized, such as KNN and KMeans.
2) Probability models and tree structure models do not need normalization, because they do not care about the value of variables, but about the distribution of variables and conditional probability between variables, such as decision tree and random forest.
How is normalization/standardization achieved?
There is no way to explain this part in detail, the best way is to read the documentation and practice!
Sklearn. preprocess module official documentation \
Scikit-learn.org/stable/modu…
Apachecn team has made Sklearn Chinese, please move on
Sklearn.apachecn.org/#/docs/40?i…
Reference \
www.jianshu.com/p/45430e476…
www.zhihu.com/question/20…
zhuanlan.zhihu.com/p/30358160
www.cnblogs.com/LBSer/p/444…
Blog.csdn.net/u014535528/…
www.zhihu.com/question/20…
On this site
The public account “Beginner machine Learning” was founded by Dr. Huang Haiguang. Huang Bo has more than 22,000 followers on Zhihu and ranks among the top 110 in github (32,000). This public number is committed to the direction of artificial intelligence science articles, for beginners to provide learning routes and basic information. Original works include: Personal Notes on Machine learning, notes on deep learning, etc.
Highlights from the past
-
All those years of academic philanthropy. – You’re not alone
-
Suitable for beginners to enter the artificial intelligence route and information download
-
Ng machine learning course notes and resources (Github star 12000+, provide Baidu cloud image) \
-
Ng deep learning notes, videos and other resources (Github standard star 8500+, providing Baidu cloud image)
-
Statistical Learning Methods of Python code implementation (Github 7200+) \
-
Carefully organized and translated mathematical materials related to machine learning
-
Introduction to Deep Learning – Python Deep Learning, annotated version of the original code in Chinese and ebook