Data preprocessing | all about standardization and normalization

Data preprocessing is one of the most daunting tasks for a data scientist

I read a question on Zhihu before: Why did you leave the data science industry? A bosom friend est replied: data cleaning 10 hours, fitting 2 minutes, so repeat 996.

This answer to the comment below is more wonderful, actually can be really blind…

I pushed an article on data processing before, and sorted out a detailed data preprocessing method

One of the things that hasn’t been clarified, anda lot of the blogs and tweets on the Internet have been a little bit messy, is the Normalization and Standardization thing.

This paper focuses on the following three points

The relationship between normalization and normalization
Why normalization and standardization
Which machine learning models need to be normalized
How to do normalization and standardization

The relationship between normalization and normalization

This is by far the most confusing. \

There is no Standardization in statistics. There is only Normalization. There is much Normalization, whether it is a normal distribution of data with a mean of 0 and variance of 1, or mapping data to [0,1]

But in the field of machine learning, there are two aspects of Normalization. There is min-max Normalization.There is Mean normalization.Machine learning Standardization refers specifically to the distribution of data into a normal distribution,

From SkLearn preprocessing, it is called Standardization to scale data to a normal distribution with a mean of 0 anda variance of 1 or to scale data to [0,1]. Of course, it is also called Standardization to scale data to [-1,1]. Much of the Normalization in Preprocessing consists only of regularization, which divides X by the L1-norm or L2 norm.

To sum up, normalizing the data into a normal distribution is normalization, and scaling the data to [0,1] is normalization.

Meaning of normalization/standardization

1) The normalization accelerates the speed of obtaining the optimal solution of gradient descent

This diagram is from Ng’s machine learning course, and it’s been cited countless times. The blue circle graph represents contour lines of two features. In the left picture, the interval between X1 and X2 is very different. The interval between X1 and X2 is [0,2000], and the interval between X2 is [1,5], forming very sharp contour lines. When the gradient descent method is used to seek the optimal solution, it is likely to take the “Zigzag” route (vertical contour line), resulting in many iterations before convergence; The figure on the right normalized the two original features, and their corresponding contour lines appear very round, which can converge quickly in the process of solving gradient descent. Therefore, if the machine learning model uses gradient descent method to find the optimal solution, normalization is often very necessary, otherwise it is difficult to convergence or even cannot convergence.

2) Normalization may improve accuracy

Some classifiers need to calculate distances between samples (such as Euclidean distance), such as KNN. If the range of a feature range is very large, then the distance calculation mainly depends on this feature, which is contrary to the actual situation (for example, the actual situation is that the feature with small range is more important).

Which machine learning algorithms need to be normalized

1) The model that needs to use gradient descent and calculate distance needs to be normalized, because without normalization, the convergent path will drop in z-shape, resulting in a slow convergence path and difficulty in finding the optimal solution. After normalization, the speed of obtaining the optimal solution by gradient descent is accelerated and the accuracy may be improved. Such as linear regression, logistic regression, Adaboost, XGBoost, GBDT, SVM, NeuralNetwork, etc. Models that need to calculate distance need to be normalized, such as KNN and KMeans.

2) Probability models and tree structure models do not need normalization, because they do not care about the value of variables, but about the distribution of variables and conditional probability between variables, such as decision tree and random forest.

How is normalization/standardization achieved?

There is no way to explain this part in detail, the best way is to read the documentation and practice!

Sklearn. preprocess module official documentation \

Scikit-learn.org/stable/modu…

Apachecn team has made Sklearn Chinese, please move on

Sklearn.apachecn.org/#/docs/40?i…

Reference \

www.jianshu.com/p/45430e476…

www.zhihu.com/question/20…

zhuanlan.zhihu.com/p/30358160

www.cnblogs.com/LBSer/p/444…

Blog.csdn.net/u014535528/…

www.zhihu.com/question/20…

On this site

The public account “Beginner machine Learning” was founded by Dr. Huang Haiguang. Huang Bo has more than 22,000 followers on Zhihu and ranks among the top 110 in github (32,000). This public number is committed to the direction of artificial intelligence science articles, for beginners to provide learning routes and basic information. Original works include: Personal Notes on Machine learning, notes on deep learning, etc.

Highlights from the past

All those years of academic philanthropy. – You’re not alone
Suitable for beginners to enter the artificial intelligence route and information download
Ng machine learning course notes and resources (Github star 12000+, provide Baidu cloud image) \
Ng deep learning notes, videos and other resources (Github standard star 8500+, providing Baidu cloud image)
Statistical Learning Methods of Python code implementation (Github 7200+) \
Carefully organized and translated mathematical materials related to machine learning
Introduction to Deep Learning – Python Deep Learning, annotated version of the original code in Chinese and ebook

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Data preprocessing | all about standardization and normalization

Carefully organized and translated mathematical materials related to machine learning

Note: If you join our wechat group or QQ group, please reply”Add group“

To join Knowledge Planet (4300+ user ID: 92416895), please reply”Knowledge of the planet“

Data preprocessing | all about standardization and normalization

Carefully organized and translated mathematical materials related to machine learning

Note: If you join our wechat group or QQ group, please reply”Add group“

To join Knowledge Planet (4300+ user ID: 92416895), please reply”Knowledge of the planet“

Related Posts

Daily PHP Function Sharing (2021-1-8)

[Elegant interview] The Network protocol small interview

How to play golang Channel out of async and await feel