Introduction to Machine Learning — How to Build a Complete Machine Learning Project, Part 6!

The first five articles in this series:

  • How to Build a Complete Machine Learning Project
  • Machine learning data set acquisition and test set construction method
  • Data Preprocessing for feature Engineering (PART 1)
  • Data Preprocessing for feature Engineering (Part 2)
  • Feature scaling & feature coding for feature engineering

This is also the last article in feature engineering series, which introduces feature extraction, feature selection and feature construction. Usually feature engineering is considered to be divided into these three aspects, but I will add the previous data & feature preprocessing part to this series.

As a matter of fact, feature engineering requires practice to better master this skill. Only by looking at the theory, the understanding is not deep enough, but when it is applied to the project or competition in practice, the deeper understanding will be gained.


3.4 Feature Selection

Definition: The process of selecting related feature subsets from a given feature set is called feature selection.

1. For a learning task, given a set of attributes, some of them may be critical for learning, but some of them are not significant.

  • Relevant features are relevant features that are useful to the current learning task.
  • An attribute or feature that is irrelevant to the current learning task is an irrelevant feature.

2. Feature selection may reduce the prediction ability of the model, because the excluded features may contain valid information, and discarding this information will reduce the performance of the model to some extent. But it’s also a trade-off between computational complexity and model performance:

  • If you keep as many features as possible, the performance of the model will improve, but at the same time the model will become more complex, and so will the computational complexity.
  • If you eliminate as many features as possible, the performance of the model will decrease, but the model will become simpler and thus less computationally complex.

3. Common feature selection methods are divided into three categories:

  • Filter (filter)
  • Package type (wrapper)
  • Embedded (embedding)
3.4.1 Principle of Feature selection

1. Reasons for adopting feature selection:

  • Dimension disaster problem. Because attributes or characteristics problems caused by too much, if can choose important characteristics, which need only part features can build model, can greatly reduce the dimension disaster problem, in this sense, feature selection and dimension reduction techniques have similar motivation, in fact they are the two mainstream technology of high-dimensional data processing.
  • Removing irrelevant features can reduce the difficulty of learning tasks, simplify the model and reduce computational complexity.

2. The most important thing in feature selection is to ensure that important features are not lost; otherwise, a model with good performance cannot be obtained due to the lack of important information.

  • Given a data set, the relevant features are likely to be different depending on the learning task, so irrelevant features in feature selection refer to features irrelevant to the current learning task.
  • There’s a class of characteristics calledRedundant FeatureThe information they contain can be inferred from other features.
    • Redundant features usually do not work, so removing them can reduce the burden of model training.
    • But if the redundancy feature corresponds exactly to one of the intermediate concepts needed to complete the learning task, it can be beneficial and reduce the difficulty of the learning task.

3. Without any prior knowledge, namely domain knowledge, the only way to select a feature subset containing all important information from the initial feature set is to traverse all possible feature combinations.

But that’s not practical or feasible, because there’s a combinatorial explosion, and you can’t do it with a smaller number of features.

An alternative is:

  • Generate a subset of candidates and evaluate how good or bad it is.
  • The next candidate subset is generated based on the evaluation results, and then its quality is evaluated.
  • This process continues until no better subsequent subset can be found.

There are two questions: How to obtain the next subset of candidate features based on the evaluation results? How to evaluate the quality of the candidate feature subset?

3.4.1.1 Subset Search

1. The steps of subset search method are as follows:

  • Given the feature set A={A1,A2… Ad}, each feature is first regarded as a candidate subset (that is, there is only one element in each subset), and then the d candidate subset is evaluated.

    Assuming A2 is optimal, A2 is chosen as the subset of the first round.

  • Then a feature is added to the selected subset of the previous round to form a candidate subset containing two features.

    Assuming that A2 and A5 are optimal and superior to A2, A2 and A5 are chosen as the subsets of the second round.

  • .

  • Assuming that in the k+1 round, the optimal feature subset of this round is inferior to the optimal feature subset of the previous round, the generation of candidate subsets is stopped, and the feature subset selected in the last round is taken as the result of feature selection.

2. This strategy of gradually adding relevant features is called forward search

Similarly, if you start with a complete feature set and try to remove one irrelevant feature at a time, this strategy of gradually reducing features is called backward searching

3. You can also combine forward and backward searches, gradually adding selected related features in each round (which are sure not to be removed in subsequent iterations) while reducing irrelevant features. This strategy is called a bidirectional search.

4 The strategies are greedy because they only consider making the selection set of the round optimal. But unless you do an exhaustive search, such problems are inevitable.

3.4.1.2 Subset evaluation

1. The methods of subset evaluation are as follows:

Given data set D, all attributes are assumed to be discrete. For attribute subset A, suppose D is divided into V subsets according to its value:

The information gain of attribute subset A can be calculated:


Among them,Represents the set size,According to the entropy.

The larger the information gain is, the more information the feature subset A contains that is helpful for classification. Therefore, for each candidate feature subset, its information gain can be calculated based on training set D as the evaluation criterion.

2. More generally, feature subset A actually determines A partition rule for dataset D.

  • Each partition corresponds to A value on A, and the sample marker information Y corresponds to the true partition of D.
  • A can be evaluated by estimating the difference between the two partitions: the smaller the difference between the partitions corresponding to Y, the better A is.
  • Information entropy is only one method to judge this difference, and other mechanisms that can judge these two differences can be used to evaluate feature subsets.

3. The feature selection method can be obtained by combining the feature subset search mechanism with the subset evaluation mechanism.

  • In fact, decision trees can be used for feature selection, and the set of partition attributes of all tree nodes is the selected feature subset.
  • Other feature selection methods are essentially explicit or implicit combination of some subset search mechanism and subset evaluation mechanism.

4. Common feature selection methods can be divided into the following three types. The main difference lies in whether the feature selection part uses the subsequent learner.

  • Filter: feature selection is carried out on the data set first, and the process is independent of the subsequent learner. In other words, some statistics are designed to filter features without considering the problem of the subsequent learner
  • Wrapper: essentially a classifier that evaluates the performance of subsequent learners as a subset of features.
  • Embedding: Actually the learner selects features autonomously.

5. The simplest feature selection method is to remove the feature with small value change.

If a feature has only two values of 0 and 1, and 95% of all input samples have the feature value of 1, then the feature is considered insignificant.

Of course, one premise of this method is that the eigenvalues are all discrete. If it is continuous, it needs to be used after discretization, and in fact, there will not be more than 95% of the characteristics of a certain value.

Therefore, this method is simple but not very easy to use. It can be used as a pre-processing of feature selection. First remove the features with little change, and then start to select the above three types of feature selection methods.

3.4.2 Filter Selection

In this method, the data set is selected first, and then the learner is trained. The feature selection process is independent of the subsequent learner.

That is, the initial features are filtered by feature selection, and then the filtered features are used to train the model.

  • The advantages are high efficiency in calculation time and high robustness to over-fitting problems.
  • The disadvantage is the tendency to select redundant features, that is, the correlation between features is not taken into account.
3.4.2.1 Relief method

1.Relief:Relevant Features is a well-known filtering feature selection method. The method designs a correlation statistic to measure the importance of features.

  • The statistic is a vector in which each component corresponds to an initial feature. The importance of feature subsets is determined by the sum of the relevant statistical components corresponding to each feature in the subsets.

  • Finally, you only need to specify a threshold value k, and then select the corresponding feature of the correlation statistic component larger than K.

    You can also specify the number of features, m, and select the m features with the largest component of the correlation statistic.

2.Relief is designed for dichotomous problems, and its extended variant, Relie-F, can handle multiple dichotomous problems.

3.4.2.2 Variance selection method

Using variance selection method, first calculate the variance of each feature, and then select the feature whose variance is greater than the threshold value according to the threshold value.

3.4.2.3 Correlation coefficient method

To use the correlation coefficient method, first calculate the correlation coefficient of each feature to the target value and the P value of the correlation coefficient.

3.4.2.4 Chi-square test

The classical Chi-square test is to test the correlation between qualitative independent variables and qualitative dependent variables. Assuming that the independent variable has N values and the dependent variable has M values, considering the difference between the observed value and the expected value of the sample frequency when the independent variable is equal to I and the dependent variable is equal to j, construct the statistic:


It is not hard to see that the meaning of this statistic is simply the correlation of independent variables to dependent variables.

3.4.2.5 Mutual information Method

The classic mutual information is also used to evaluate the correlation between qualitative independent variables and qualitative dependent variables. The calculation formula of mutual information is as follows:


In order to deal with quantitative data, the maximum information coefficient method is proposed.

3.4.3 Package selection

1. Compared with filter feature selection, which does not consider subsequent learners, wrapped feature selection directly takes the performance of learners to be used in the end as the evaluation principle of feature subset. The goal is to select a subset of features tailored to the best performance of a given learner.

  • The advantage is that it is optimized directly for a specific learner. Considering the correlation between features, the wrapped feature selection can usually train a better learner than the filter feature selection.
  • The disadvantage of feature selection is that the computation cost is much higher than that of filter feature selection because the process of feature selection requires multiple training learners.

2.LVW:Las Vegas Wrapper is a typical feature selection method. It uses random strategy to search subset under the framework of Las Vegas Method, and takes the error of final classifier as the evaluation standard of feature subset.

3. Since every feature subset evaluation in LVW algorithm requires training learner and the calculation cost is very high, it will design a stop condition control parameter T.

However, if the initial number of features is large, the T setting is large, and each round of training takes a long time, it is likely that the algorithm will not stop running for a long time. That is, if there is a running time limit, the solution may not be given.

5. Recursive feature elimination method: a base model is used to carry out multiple rounds of training. After each round of training, the features of several weight coefficients are eliminated, and then the next round of training is conducted based on the new feature set.

3.4.4 Embedded selection

1. In filtering and wrapping feature selection methods, feature selection process is obviously different from learner training process.

Embedded feature selection is a combination of feature selection and learner training, both of which are completed in the same optimization process. That is, feature selection is carried out automatically in the training process of the learner.

Common methods include:

  • Using regularization, e.gL_1, L_2Norm is mainly applied to algorithms such as linear regression, logistic regression and support vector machine (SVM).
  • Decision tree idea is used, including decision tree, Random forest, Gradient Boosting, etc.

2. In addition to reducing the risk of overfitting, the introduction of L_1 norm has another advantage: more components of the w obtained by L_1 norm will be zero. Namely, it is easier to obtain sparse solutions.

So the learning method based on L_1 regularization is an embedded feature selection method. The feature selection process is integrated with the training process of the learner.

3. Common embedded selection model:

  • inLassoWhere, the λ parameter controls the sparsity:
    • If λ is smaller, the sparsity is smaller and the selected features are more.
    • On the contrary, the larger λ is, the larger the sparsity is and the fewer the selected features are.
  • inSVMAnd logistic regression, parametersCControlling sparsity:
    • ifCThe smaller the size is, the greater the sparsity is and the fewer features are selected.
    • ifCThe larger the size is, the smaller the sparsity is and the more features are selected.

3.5 Feature Extraction

Feature extraction is generally before feature selection. The object it extracts is the original data, and the purpose is to automatically build new features and convert the original data into a set of features with obvious physical significance (such as Gabor, geometric features, texture features) or statistical significance.

The commonly used methods include dimensionality reduction (PCA, ICA, LDA, etc.), SIFT, Gabor, HOG, etc., word bag model, word embedding model, etc., in terms of text. Some basic concepts of these methods are briefly introduced here.

3.5.1 track of dimension reduction

1. Principal Component Analysis (PCA)

PCA is the most classical method for dimensionality reduction. It aims to find the principal components in the data and use these principal components to represent the original data, so as to achieve the purpose of dimensionality reduction.

The idea of PCA is to find the optimal subspace of data distribution through coordinate axis transformation.

For example, a series of data points in three dimensional space, their distribution in the origin of the plane, if the natural coordinate system x, y, z three axis data, need to three dimensions, but in fact these data points are on the same two-dimensional plane, if we can using coordinate transformation makes the data plane and x, y plane, We can represent the original data through the new x’ and y’ axes without any loss, and that completes the dimensionality reduction, and these two new axes are the principal components that we need to find.

Therefore, PCA solution is generally divided into the following steps:

  1. Centralized processing of sample data;
  2. Find the sample covariance matrix;
  3. The eigenvalue decomposition of covariance matrix is carried out, and the eigenvalues are arranged from large to small.
  4. Take the largest n corresponding eigenvectors before the eigenvaluesW1, W2, ... , Wn, thus reducing the original m – dimensional sample to n – dimensional.

Through PCA, features with small variance can be discarded. Here, feature vector can be understood as the direction of the new coordinate axis in coordinate transformation, and feature value represents the variance of the corresponding feature vector. The larger the feature value is, the larger the variance is, and the more information is. This is why the eigenvectors corresponding to the first n largest eigenvalues are chosen, because these features contain more important information.

PCA is a linear dimension reduction method, which is also one of its limitations. However, there are also many solutions, such as using kernel mapping to expand PCA to obtain kernel principal component analysis (KPCA), or using manifold mapping dimensionality reduction methods, such as isometric mapping, local linear embedding, Laplacian eigenmapping, etc., to perform nonlinear dimensionality reduction for some complex data sets with poor PCA effect.

2. Linear Discriminant Analysis (LDA)

LDA is a supervised learning algorithm. Compared with PCA, IT takes into account the category information of data, while PCA does not, but maps data to the direction with large variance.

Considering the data category information, the purpose of LDA is not only to reduce dimension, but also to find a projection direction so that the projected samples can be separated as far as possible according to the original category, that is, to find a direction that can maximize the distance between classes and minimize the distance within classes.

The advantages of LDA are as follows:

  • Compared with PCA, LDA is better at processing data with categorical information.
  • Linear model is robust to noise, and LDA is an effective method to reduce dimension.

Correspondingly, it also has the following disadvantages:

  • LDA makes strong assumptions about the distribution of data, such as that each category of data is gaussian and the covariances of each class are equal. These assumptions may not be fully satisfied in practice.
  • LDA model is simple and has some limitations in expression ability. However, this can be done by introducing kernel function to extend LDA to deal with the data with complicated distribution.

3.ICA(Independent Component Analysis)

PCA feature transformation and dimension reduction extracted irrelevant parts, and ICA independent component analysis obtained mutually independent attributes. ICA algorithm essentially seeks a linear transformation Z = Wx to maximize the independence of each characteristic component of Z.

PCA is usually used to reduce the dimension of data, and then ICA is used to separate the useful data from multiple dimensions. PCA is the data preprocessing method of ICA.

What is the difference between independent component analysis (ICA) and principal component analysis (PCA)? .

3.5.2 Image feature extraction

Image feature extraction, before deep learning became popular, there were many traditional feature extraction methods, the more common ones include the following.

1. The SIFT features

SIFT is a very widely used feature in image feature extraction. It contains the following advantages:

  • It has the invariance of rotation, scale, translation, Angle of view and brightness, which is conducive to the effective expression of target feature information.
  • SIFT feature has good robustness to parameter adjustment, and the appropriate number of feature points can be adjusted according to the needs of the scene for feature description, so as to facilitate feature analysis.

SIFT extraction of local feature points mainly includes four steps:

  1. Detection of suspected feature points
  2. Remove false feature points
  3. The gradient of feature points matches the direction
  4. Generation of feature description vector

The downside of SIFT is that it’s hard to implement without hardware acceleration or a dedicated image processor.

2. SURF feature

SURF feature is an improvement of SIFT algorithm, which reduces time complexity and improves robustness.

It mainly simplifies some SIFT operations, such as SIFT gaussian second-order differential model is simplified, so that the convolution smoothing operation only needs to be converted into addition and subtraction operations. Finally, the dimension of the generated feature vector is reduced from 128 to 64.

3. The HOG features

Directional gradient histogram (HOG) feature is a histogram feature proposed for pedestrian detection in 2005. It realizes feature description by calculating and counting the gradient direction histogram of local area of image.

HOG feature extraction steps are as follows:

  1. Normalized processing. First, the image is transformed into gray image, and then gamma correction is used to realize it. This step is to improve the robustness of image feature description to illumination and environmental changes, reduce local shadow, local exposure and texture distortion, and resist noise interference as much as possible.
  2. Calculate image gradient;
  3. Statistical gradient direction;
  4. Normalization of eigenvectors; In order to overcome the uneven variation of illumination and the contrast difference between foreground and background, it is necessary to normalize the feature vectors in the block.
  5. Generate eigenvectors.

4. LBP features

Local binary pattern (LBP) is a feature operator to describe local texture of image, which has the advantages of rotation invariance and gray invariance.

LBP feature describes an image processing operation technology within the gray scale, which is aimed at 8-bit or 16-bit gray image input.

By comparing the relationship between window center point and neighborhood point, LBP features are re-coded to form new features to eliminate the influence of external scenes on images, thus solving the problem of feature description in complex scenes (light transformation) to a certain extent.

According to the different window fields, it can be divided into two types, classical LBP and circular LBP. The former window is a 3×3 square window, while the latter expands the window from a square to any circular area.

More detailed can refer to this article – image feature detection description (A):SIFT, SURF, ORB, HOG, LBP feature principle overview and OpenCV code implementation

Of course, the above features are traditional image feature extraction methods, and now the images are basically directly using CNN (convolutional neural network) for feature extraction and classification.

3.5.3 Text feature extraction

1. Word bag model

The most basic text representation model is the word bag model.

To be specific, it is to separate the whole text in terms of words, and then each article can be expressed as a long vector. Each dimension of the vector represents a word, and the weight of the dimension reflects the importance of the word in the original article.

Tf-idf is usually used to calculate the weight, and the formula is tF-IDF (t,d) = TF(t, D) × IDF(t).

Where TF(t, d) represents the frequency of occurrence of word T in document D, and IDF(t) is the frequency of reverse document, which is used to measure the importance of word T in semantic expression, and can be expressed as:


The intuitive explanation is that, if this word has appeared in multiple articles, it is likely to be a relatively common word, and its contribution to distinguishing articles is relatively small, so its weight is naturally relatively small, that is, IDF(t) will be relatively small.

2. N – “gramm model

The word bag model is divided by words, but sometimes it is not a good practice to divide the word level, after all, some words are combined to express the meaning. For example, natural language processing, Computer vision and so on.

Therefore, the phrase (n-gram) composed of n consecutive words (n <= n) can be put into the vector representation as a single feature, forming the N-gram model.

In addition, the same Word may have multiple parts of speech changes but have the same meaning, so Word Stemming is processed in practice, that is, words with different parts of speech are unified into the form of the same stem.

3. Word embedding model

Word embedding is a general term for a class of word vectorization models. The core idea is to map each word into a Dense Vector in a low-dimensional space (K=50~300 dimensions).

A common word embedding model is Word2Vec. It is an underlying neural network model with two network structures, namely CBOW(Continues Bag of Words) and Skip-gram.

CBOW is to predict the generation probability of the current word according to the words appearing in the context. Skip-gram predicts the generation probability of each word in the context according to the current word.

The word embedding model maps each word into a k-dimensional vector. If a document has N words, each document can be represented by an N×K matrix, but such representation is too low-level. In practical application, if the matrix is directly input into the model as the feature representation of the original text, it is usually difficult to get satisfactory results. In general, the matrix needs to be processed to extract and construct higher-level features.

The emergence of deep learning model just provides a method of automatic feature engineering, in which each hidden layer corresponds to the features of different abstraction levels. Convolutional neural network (CNN) and recurrent neural network (RNN) have achieved good results in text representation, because they can model text well and extract some high-level semantic features.

3.5.4 Difference between feature extraction and feature selection

Feature extraction and feature selection are both aimed at finding the most effective feature from the original feature.

The difference between them is that feature extraction emphasizes on obtaining a set of features with obvious physical or statistical significance through feature transformation.

Feature selection is to select a subset of features with obvious physical or statistical significance from the feature set.

Both can help reduce dimension of features and data redundancy, feature extraction can sometimes find more meaningful feature attributes, and the process of feature selection can often indicate the importance of each feature for model construction.

3.6 Feature Construction

Feature construction refers to the artificial construction of new features from raw data. It takes time to look at the raw data, think about the potential form of the problem and data structure, and data sensitivity and machine learning experience can help with feature construction.

Feature construction requires strong insight and analytical ability, which requires us to find some features with physical significance from the original data. Assuming the raw data is tabular data, you can generally create new features by mixing or combining properties, or by splitting or slicing existing features.

Characteristics build very need related domain knowledge and rich practical experience can be very good to build better useful new feature, compared to the feature extraction, feature extraction is through some feature extraction methods to transform original data characteristics, and characteristics of building requires our own manual build human characteristics, such as the combination of two features, Or decompose a feature into multiple new features.


summary

Characteristics of engineering this content and wrote a total of four articles, this article from the data pretreatment, deal with missing value, abnormal values, the class imbalance problem and the expansion of data, the characteristics of the scale, character encoding, as well as feature selection, feature extraction and feature structure of this paper, the basic contains the characteristics of project involves content, of course, there may be a few content not included.

In fact, it happened that the summary of the second chapter of “” involved feature engineering content, and I planned to make a good summary. Unexpectedly, there was a lot of content in this part. However, AS for feature engineering, MY experience is not very rich. Therefore, the content of these articles is mainly to collect the content of online articles, plus a small part of personal experience, it is really difficult to write, a lot of content can only briefly summarize the basic concepts and use steps, but it is difficult to further introduce.

Therefore, it is recommended that you find some practical projects or participate in the competition after watching it, and apply the theory and method of feature engineering in practice, you will have a deeper experience.

The most famous competition is Kaggle. There are tianchi and DataFountain in China.

Next, how to build a complete machine learning project will enter the algorithm model selection and evaluation part. Here, I also intend to briefly summarize the classic algorithms commonly used in machine learning.


Reference:

  • “Hundred side machine learning”. Chapter 1 feature Engineering
  • Blog.csdn.net/dream_angel…
  • www.cnblogs.com/sherial/arc…
  • Gofisher. Making. IO / 2018/06/22 /…
  • Gofisher. Making. IO / 2018/06/20 /…
  • Juejin. Cn/post / 684490…
  • www.zhihu.com/question/47…
  • www.huaxiaozhuan.com/ Statistical Learning/Chapte…
  • Scikit-learn introduces several common feature selection methods](dataunion.org/14072.html)
  • Blogland — Feature Engineering of machine learning
  • Mathematics in Machine Learning (4)- Linear Discriminant Analysis (LDA), Principal Component Analysis (PCA)
  • What is the difference between independent component analysis (ICA) and principal component analysis (PCA)?
  • Image feature detection description (I):SIFT, SURF, ORB, HOG, LBP feature principle overview and OpenCV code implementation

Welcome to follow my wechat official account – Machine Learning and Computer Vision, or scan the qr code below, we can communicate, learn and progress together!

Past wonderful recommendation

Machine learning series
  • Introduction to Machine Learning series 1 – An Overview of Machine learning
  • How to Build a Complete Machine Learning Project
  • Machine learning data set acquisition and test set construction method
  • Data Preprocessing for feature Engineering (PART 1)
  • Data Preprocessing for feature Engineering (Part 2)
  • Feature scaling & feature coding for feature engineering
Math study notes
  • Math Notes for programmers 1- Base conversion
  • Programmer’s Math Note 2– remainder
  • Mathematical Notes for Programmers 3– Iterative methods
Github projects & Resource tutorials recommended
  • [Github Project recommends] a better site for reading and finding papers
  • TensorFlow is now available in Chinese
  • Must-read AI and Deep learning blog
  • An easy-to-understand TensorFlow tutorial
  • Recommend some Python books and tutorials, both beginner and advanced!