Read feature engineering

 

Author: out

Note: This paper is the lecture notes for the 5th lecture of Feature Engineering in July, the 9th session of online machine Learning. The lecture teacher is Zhang Yushi Johnson, and the notes have been proofread by han Xiaoyang and other relevant teachers.

Date: July 31, 2018.

 

0 foreword

In July, my company offered a variety of machine learning, deep learning and artificial intelligence courses online every month. After three and a half years of polishing, the content quality has been excellent enough. In this process, I was constantly flooded with news that I had successfully changed from traditional IT to AI, and then got an annual salary of 300,000 to 500,000 yuan.

Besides, I am fond of research and good at explaining difficult and obscure things in an easy way, so I am preparing 30 ML course notes in the next month, one per day on average, similar to the previous KMP SVM CNN target detection, and Posting blogs, public accounts, question banks and communities. And the company’s lecturer team to ensure professional, strive for each topic/model has become every ML beginners must see the first article.

In addition, each note will be basically taken while listening to the course using the July online APP with Beats headphones (well, the APP supports 1.5x or 2x playback speed), so as to ensure that I can be popular and the lecturers of the course can be professional. If you have any questions, please comment in the comments.

 

What is feature engineering

It is widely said in the industry that data and features determine the upper limit of machine learning, and models and algorithms only approximate this limit.

But feature engineering is rarely covered in machine learning-related books, including many online courses, and The July online course is the first feature engineering course in machine learning. But until now, many machine learning courses have not covered feature engineering. In my opinion, it is unprofessional to cover machine learning without feature engineering (I believe this article will improve them).

So what is feature engineering, and is it really that important?

As the name implies, feature engineering is an engineering activity in essence, which aims to extract features from original data to the maximum extent for use by algorithms and models.

However, most of the time in machine learning in the company is not studying various algorithms, designing advanced models, researching various applications of deep learning, or designing n-layer neural networks. In fact, 70-80% of the time is spent dealing with data and features.

Because most of the algorithmic refinement of complex models is done by data scientists. And what are most students doing? In running data, or a variety of Map-reduce, Hive SQL, data warehouse bricks, and then do data cleaning, analysis of business and case, and find features. Including many large companies will not be the first choice to use complex models, sometimes is a one-trick LR world.

In a Kaggle data science dichotomization contest, the AUC could be improved by 2% by extracting valid features (AUC is one of the common indicators for evaluating the quality of models), while the AUC could only be improved by about 5‰ by optimizing seemingly superior models. Including the group that won the first prize in an e-commerce commodity recommendation competition, based on feature engineering, the accuracy of the recommendation was 16% higher than that of the engineer.

Through summary and induction, it is generally believed that feature engineering includes the following aspects (the following figure comes from the summary of the first lesson of Feature engineering by ml9 student Hai Broad Sky) :

2. Data and feature processing

2.1 Data Collection

Let’s say I want to predict how users will place an order for a product (buy or not buy). I can collect information such as store reputation, product reviews, and user behavior, or I can cross these information to make some combination features.

2.2 Data Formatting

First determine the storage format. For example, time you year month day or timestamp or day or, or a single action record or a day of action aggregation. Of course, in most cases, you need to associate a very, very large number of Hive tables and cluster files (such as HDFS).

2.3 Data Cleaning

Considering that the collected data may contain some noise, the goal of data cleaning is to remove dirty data, such as garbage in, garbage out.

Of course, an algorithm is mostly a processing machine, and the final product depends largely on the quality of the raw materials. The process of selecting or improving raw materials takes a lot of time, and efficiency depends on how well you understand the business.

2.4 Data Sampling

In many cases, the positive and negative samples are uneven. For example, on the e-commerce platform, there is a big difference between the quantity of goods that users have bought and those that they have not bought before. Because purchasing requires money, most users just browse without buying. Most models are sensitive to positive and negative sample ratios (e.g., LR).

So how do we deal with the imbalance of positive and negative samples? In order to make the samples more balanced, we usually use random sampling and stratified sampling.

For example, if there are more positive samples than negative samples, and the amount is quite large, you can use downsampling. If the positive sample is larger than the negative sample, but the amount is not large, then oversampling (such as mirroring and rotation in image recognition) and modifying the loss function /loss function can be used to deal with the problem of unbalanced positive and negative samples.

2.5 Feature Processing

In the process of feature processing, we will face various types of data, such as numerical type (such as age), category type (such as a certain brand of lipstick may have 18 colors, such as clothing size L XL XLL, such as the day of the week), time type, text type, statistics type, combination characteristics and so on.

2.5.1 Numerical data

There are several methods for feature processing of numerical data: amplitude adjustment, data field variation such as Log, statistical values Max, min, mean, STD, discretization, Hash buckets, statistical values of variables corresponding to each category histogram(distribution), and try numerical => category.

Next, I will focus on some of these methods. For example, adjust the data range to [0,1], as shown in the following code

This operation has a name called normalization.

‘Why normalization? Many students do not understand the explanation given by Wikipedia: 1) Normalization accelerates the speed of finding the optimal solution of gradient descent; 2) Normalization may improve accuracy.

Let’s expand on these two points a little bit. 1 normalized to what can improve the speed of the gradient descent method to solve the optimal solution? As shown in the following two figures (source: Stanford Machine Learning video)

` `

The blue circles represent contour lines of two features. In the figure on the left, the interval between the two features X1 and X2 is very different. The interval between X1 and X2 is [0,2000], and the interval between X2 is [1,5].

‘When the gradient descent method is used to seek the optimal solution, it is likely to take the “Zigzag” route (vertical contour line), resulting in the need for many iterations before convergence; The figure on the right normalized the two original features, and their corresponding contour lines appear very round, which can converge quickly in the process of solving gradient descent.

Therefore, if the machine learning model uses gradient descent method to find the optimal solution, normalization is often very necessary, otherwise it is difficult to convergence or even cannot convergence.

2 Normalization makes it possible to improve accuracy Some classifiers need to calculate distances between samples (e.g., Euclidean distance), e.g., KNN. If the range of a feature range is very large, then the distance calculation mainly depends on this feature, which is contrary to the actual situation (for example, the actual situation is that the feature with small range is more important).

1) Linear normalization

This normalization method is more suitable for the case of more concentrated values. One drawback of this method is that if Max and min are not stable, it is easy to make the normalization result unstable and the effect of subsequent use unstable. In practice, you can use empirical constant values instead of Max and min. 2) Standard deviation standardization The processed data conform to the standard normal distribution, that is, the mean value is 0, the standard deviation is 1, and the transformation function is as follows:

μ is the mean value of all sample data, and σ is the standard deviation of all sample data. 3) Nonlinear normalization is often used in scenes with large data differentiation. Some values are large and some are small. The original values are mapped by some mathematical function. The method includes log, exponent, tangent and so on. Depending on the distribution of the data, you need to determine the curve of the nonlinear function, such as log(V, 2) or log(V, 10). `

In practical application, the models solved by gradient descent method generally need normalization, such as linear regression, Logistic regression, KNN, SVM, neural network and so on.

However, tree models do not need normalization, because they do not care about the value of variables, but about the distribution of variables and conditional probability between variables, such as decision tree and Random Forest.

Next, let’s look at standardizing the data

Another example is doing statistics on data



 

Or, you can discretize the data. For example, a person’s age is a continuous value, but the continuous value is not suitable to be put into some models such as Logistic Regression. In this case, the data needs to be discretized. The popular understanding is that the continuous value is divided into different segments, and each segment is listed as features for segmented processing.



And columnar distribution of data.



2.5.2 Type Data

When we are faced with some categories of data, for example, a brand of lipstick has multiple colors, if it is a human, the different colors can be seen at a glance. But unlike human eyes, computers can only read numbers. Strictly speaking, they can only read 01 binary. So in order for a computer to recognize colors, it needs to encode various colors.

One-hot encoding/dummy variables can be used for characteristics of categorical data such as mouth red.

One-hot encoding generally uses sparse vectors to save space. For example, only One dimension is set to 1 and the values of other positions are set to 0. And it needs to cooperate with feature selection to reduce dimension, after all, high dimension will bring some problems

  1. For example, in k-nearest neighbor algorithm, the distance between two points in high-dimensional space is difficult to measure;
  2. For example, in logistic regression, the number of parameters will increase with the increase of dimension, which is easy to cause overfitting problems.
  3. For example, in some classification scenarios, only some dimensions are useful.



For some text type data, you can also use the Hash technique to do some word frequency statistics



 

Finally, there is a category of data, such as the differences between male and female behaviour in many areas, from which the preferences of male and female students can be measured using the Histogram mapping method.



Man: [1/3, 2/3, 0]; Woman: [0, 1/3, 2, 3]. 21: [0, 1]; 22: [0, 1]…

2.5.3 Time-based data

Temporal data can be regarded as continuous value or discrete value. Such as

  • These time-based data can be viewed as continuous values: duration (how long a user has viewed a single page), interval (how long it has been since the user last purchased/clicked).
  • These timelike data can be viewed as discrete values: time of day (hour_0-23), day of the week (week_Monday… , what week of the year, what quarter of the year, weekdays/weekends, etc.

2.5.4 Text Data

If the data is of text type, such as word bag, stop words can be removed after the preprocessing of text data, and the list composed of the remaining words can be mapped into sparse vectors in the word bank.



The operation shown below is the Python handling of a word bag



At this time, there will be a problem, for example, li Lei likes Han Meimei, and Han Meimei likes Li Lei, the meaning of the two sentences are different, but the above word bag model can not distinguish the order of male chasing female or female chasing male.

To deal with this problem, you can expand the word bag to n-gram



2.5.5 Text Data

For textual data, the TF-IDF feature is often used. It is used to solve the problem of how to determine the importance of a word in a document. Just the frequency of it? It seems not, because the words “think”, “get” and “land” are often used in various documents, but that doesn’t mean they are important.

In order to more accurately evaluate the importance of a word to one of the documents in a document set or a database, we will use the statistical method TF-IDF. The importance of a word increases with the number of times it appears in the document, but decreases inversely with the frequency of its occurrence in the corpus.

TF: Term Frequency

TF(t) = (number of occurrences of the word t in the current text)/(number of occurrences of the word t in all documents)

IDF: IDF(t) = ln(total number of documents/number of documents with T)

In practice, we often use TF-IDF to calculate the weight, namely tF-IDF = TF(t) * IDF(t)

For word bag, Google put forward the Word2Vec model in 2013, which has become one of the most commonly used word embedding models at present.

There are various open source tools for this model, such as Google Word2vec, Gensim, and Facebook FastText



2.5.6 Statistical Data

For statistical features, previous Kaggle/ Tianchi competition, Tmall/JD.com ranking and recommending business line models used features, such as

  • Plus or minus average: how much the commodity price is higher than the average price, how much the user spends in a certain category more than the average user, how many consecutive login days more than the average user…
  • Quartile line: The quartile line at which an item belongs to the price of the item sold
  • Order type: in which position
  • Proportion: proportion of good/medium/bad comments in e-commerce
  • You’re more than one percent… The classmate of

2.5.7 Combination Characteristics

The data of combined characteristic types is illustrated by an example.

There was a mobile recommendation algorithm contest on Aliyuntianchi (now a long-term partner of Aliyuntianchi online in July). The goal of the contest was to accurately recommend appropriate goods/content to mobile users at the appropriate time and place, so as to improve the browsing or purchasing experience of mobile users.

Address: tianchi.aliyun.com/competition…

In real business scenarios, we often need to build personalized recommendation models for a subset of all products. In doing so, we need to leverage not only the behavioral data of users on this subset of goods, but often a much richer set of user behavioral data. The following symbols are defined: U — user set I — subset of goods P — subset of goods, P ⊆ I D — behavioral data set of users for the whole set of goods so our goal was to construct a recommendation model for items in P for users in U with D.

Data Description The contest data consists of two parts. The first part is the user’s mobile behavior data (D) on the full set. The table named tianchi_mobile_shame_train_user contains the following fields:





For this problem, we need to do a lot of data processing, such as:

  1. One day’s cart items are more likely to be purchased the next day => rule
  2. Eliminate people who never buy anything in 30 days => data cleansing
  3. Add N pieces, only buy one piece, the rest will not buy => rule
  4. Shopping cart purchase conversion rate => statistical characteristics of user dimension
  5. Commodity popularity => Statistical characteristics of commodity dimensions
  6. Total clicks/favorites/shopping cart/purchases for different items => statistical characteristics of item dimensions
  7. Brands/products that become popular => Statistical characteristics of product dimensions (difference type)
  8. Count of average users per click/favorites/shopping cart/purchase for different items => statistical characteristics of user dimension
  9. Ratio of the number of behaviors in the last 1/2/3/7 days to the average number of behaviors => Statistical Characteristics of user dimensions (proportional type)
  10. Order of goods in category => Statistical characteristics of goods dimension (Order type)
  11. Total number of people interacting with goods => Statistical characteristics of goods dimensions (summation)
  12. The purchase conversion rate of goods and the ratio of the conversion rate to the average conversion rate of categories => Product dimension calculation characteristics (proportion type)
  13. Mean value of commodity behavior/similar behavior => Statistical characteristics of commodity dimension (proportion type)
  14. If the behavior is counted by categories in the last 1/2/3 days, the value is time type + user dimension
  15. The closest interaction time from now => time type
  16. Total interaction days => Time type
  17. Total number of purchases/number of collections/number of shopping carts of user A to brand B => statistical characteristics of user dimension
  18. The square of user A’s clicks on brand B => statistical characteristics of user dimension
  19. The square of user A’s purchases of brand B => statistical characteristics of user dimension
  20. Click to buy ratio of user A to brand B => Statistical characteristics of user dimension (proportional type)
  21. User Interaction Before or after an item, number of items exchanged => Time type + statistical feature of user dimension
  22. Last interaction time of the user on the last day => Time type
  23. Time of purchase (average, earliest, latest)=> time type

Some features can also be combined, such as the following simple combination of spliced features

  • Simple combination characteristics: splicing type
  • User_id &&category: 10001&& Women’s Skirt 10002&& Men’s Denim
  • User_id &&style: 10001&& Lace 10002&& 100% cotton

Including the actual e-commerce click rate estimation: positive and negative weight, like && don’t like a certain type

And some model feature combinations

  • Generate characteristic combination paths with GBDT
  • Combined features are put into LR training along with the original features
  • The way Facebook was originally used, many Internet companies are using it

Including another combination feature based on tree model: GBDT+LR, each branch can be a feature, so as to learn a series of combination features.



 

3 Feature Selection

With the various feature processing methods in the previous section, many features can be generated, but there are some problems, such as

  • Redundancy: Some features are too highly correlated, which can cost computing performance
  • Noise: Some of the characteristics are negative effects on the predicted results

For these two problems, we have to make feature selection, including dimensionality reduction

  • Feature selection is the removal of features that have little bearing on the outcome prediction
  • SVD or PCA can indeed solve certain high-dimensional problems

Next, let’s look at the ways in which features are selected.

3.1 type filter

Evaluate the degree of correlation between a single feature and the result value, and sort the feature parts with Top correlation. The degree of correlation can be calculated by Pearson correlation coefficient, mutual information and distance correlation.

The disadvantage of this method is that it does not consider the correlation between features and may mistakenly kick out useful correlation features.



Filter features select Python packages



3.2 package type

Wrapping type refers to treating feature selection as a feature subset search problem, screening various feature subsets, and evaluating the effect with the model. The typical enveloping algorithm is recursive feature Elimination algorithm.

For example, with logistic regression, how do you do that?

  1. Run a model with full features
  2. According to the coefficient of the linear model (reflecting correlation), 5-10% weak features were deleted to observe the change in accuracy/AUC
  3. Step by step until the accuracy/AUC drops significantly

The package feature selects the Python package







3.3 the embedded type

Embedded feature selection refers to analyzing the importance of features according to the model (different from the above approach, which is based on model weights from production, etc.). The most common way is to use regularization to make feature selection.

Wait, what is regularization? There are two kinds of regularization: L1 regularization and L2 regularization.

In general regression analysis, regression W represents the coefficient of the feature. It can be seen from the above equation that the regularization term is processed (restricted) for the coefficient. L1 regularization and L2 regularization of the instructions are as follows: L1 regularization is refers to the weight vector w in the sum of the absolute value of each element, usually expressed as | | w | | 1 L2 regularization is refers to the weight vector w in each element is the sum of the squares of the then square root (can see L2 regularization item of Ridge regression with square symbol), usually expressed as | | w | | 2

The regularization term is usually preceded by a coefficient, which is denoted by alpha in Python and λ in some articles. This coefficient needs to be specified by the user.

So what’s the use of L1 and L2 regularization? L1 regularization can generate a sparse weight matrix, that is, generate a sparse model that can be used for feature selection. L2 regularization can prevent overfitting of the model. Of course, L1 can also prevent overfitting to a certain extent

Sparse model and feature selection L1 regularization mentioned above helps to generate a sparse weight matrix, which can then be used for feature selection. Why generate a sparse matrix?

Sparse matrix refers to a matrix where many elements are 0 and only a few elements are non-zero, that is, most coefficients of the linear regression model obtained are 0. Generally, the number of features in machine learning is very large. For example, in text processing, if a phrase (term) is taken as a feature, the number of features will reach tens of thousands (bigRAM).

In predicting or classification, so many features obviously difficult to choose, but if these characteristics into the model is a sparse model, said only a few characteristics have contribution to the model, for the most part is characterized by no contributions, or contribution to the small (because they are in front of the coefficient is zero or very small value, even with is no effect on the model). We can then focus only on the non-zero characteristics of the coefficients. This is the relationship between sparse model and feature selection.

For example, the earliest CTR estimation was made in the electrician LR, and l1-regularized LR model was used in the coefficient features of 300-500 million dimensions. The remaining 20-30 million features mean that other features are not important enough.

The embedded feature selects the Python package



As for the two practical examples in the course, such as the Kaggle Bike rental prediction contest, no more. If you are interested, you can check out the video content of “July Online Machine Learning Session 9, Lecture 5, Feature Engineering”. This article is basically the course notes of this course.

 

4 afterword.

This paper is still being revised and improved. July, August 2, 2018, 6pm, July online office.