How to Build a Complete Machine Learning Project, Part 7!

The first six articles in this series:

  • How to Build a Complete Machine Learning Project
  • Machine learning data set acquisition and test set construction method
  • Data Preprocessing for feature Engineering (PART 1)
  • Data Preprocessing for feature Engineering (Part 2)
  • Feature scaling & feature coding for feature engineering
  • Feature Engineering (finished)

The first six articles from the ultimate goal of a project, looking for and acquiring data, to data preprocessing, do feature engineering, then need to start to choose the appropriate algorithm model, training evaluation and testing.

Therefore, the summary comparison of commonly used machine learning algorithms will be summarized next, including:

  1. Linear regression
  2. Logistic regression
  3. The decision tree
  4. Random forests
  5. Support vector machine
  6. Naive Bayes
  7. KNN algorithm
  8. K-means algorithm
  9. Boosting Method (Boosting)
  10. GBDT
  11. Optimization algorithm
  12. Convolutional neural network

Due to the length problem, the basic principles, advantages and disadvantages of each algorithm are briefly introduced. In order to ensure that each article is not too long, it may be divided into two or three chapters.


1. Linear regression

The paper

Linear Regression is a Regression analysis that models the relationship between one or more independent and dependent variables using a least-square function called a Linear Regression equation.

This function is a linear combination of one or more model parameters called regression coefficients (the variables are all raised to the first power). The case with only one independent variable is called simple regression, and the case with more than one independent variable is called multiple regression.

The model function of linear regression is as follows:


Its loss function is as follows:


Through the training data set to find the optimal solution of parameters, that is, the solution can be obtainedParameter vector of, the parameter vector here can also be divided into parametersRepresents the weight and bias value respectively.

The least square method and gradient descent method are used to solve the optimal solution.

The advantages and disadvantages

Advantages: the results are easy to understand and not complicated to calculate. Disadvantages: Poor fitting of nonlinear data. Applicable data types: numerical and nominal data. Algorithm type: regression algorithm

Code implementation

#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import linear_model
#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays

x_train=input_variables_values_training_datasets
y_train=target_variables_values_training_datasets
x_test=input_variables_values_test_datasets

# Create linear regression object
linear = linear_model.LinearRegression()

# Train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)

#Equation coefficient and Intercept
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

#Predict Output
predicted= linear.predict(x_test)
Copy the code

Logistic regression

The paper

Logistic regression algorithm is based on Sigmoid function, or Sigmoid is Logistic regression function. The Sigmoid function is defined as follows:. The range of the function is 0,1.

So the logistic regression function can be expressed as follows:


The cost function of logistic regression is as follows:


The gradient descent algorithm can be used to solve the parameter that minimizes the cost function. The gradient descent formula is:

The advantages and disadvantages

advantages
  1. Simple implementation, widely used in industrial problems;
  2. When classifying, the computation is very small, the speed is fast, and the storage resources are low.
  3. Easy to observe sample probability score
  4. Multicollinearity is not a problem for logistic regression and can be combined with L2 regularization to solve the problem.
disadvantages
  1. Easy to underfit, general accuracy is not too high
  2. It can only deal with two categories (softmax derived from this can be used for multiple categories) and must be linearly separable;
  3. When the feature space is large, the performance of logistic regression is not very good.
  4. Does not handle large numbers of multi-class features or variables well
  5. For nonlinear features, transformation is required.

Applicable data types: numerical and nominal data. Category: classification algorithm. Trial scenario: Solve the binary classification problem.

Code implementation

#Import Library
from sklearn.linear_model import LogisticRegression
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create logistic regression object

model = LogisticRegression()

# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Equation coefficient and Intercept
print('Coefficient: \n', model.coef_)
print('Intercept: \n', model.intercept_)

#Predict Output
predicted= model.predict(x_test)
Copy the code

3. The decision tree

The paper

Definition: Classification decision tree model is a tree structure describing the classification of instances. A decision tree consists of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes. The inner node represents a feature or attribute, and the leaf node represents a class.

In essence, decision tree learning induces a set of classification rules from the training data set, or it can be said to estimate the conditional probability model from the training data set. The loss function is usually regularized maximum likelihood function, and its strategy is to minimize the loss function as the objective function.

The algorithm of decision tree learning is usually a process of recursively selecting the optimal feature and dividing the training data according to the feature so as to get a best classification of each sub-data set.

The generation of the decision tree corresponds to the local selection of the model and the pruning of the decision tree corresponds to the global selection of the model. The generation of decision tree considers only local optimum, whereas the pruning of decision tree considers global optimum.

Decision tree learning usually includes three steps: feature selection, decision tree generation and decision tree pruning.

Feature selection

The criterion of feature selection is usually information gain or information gain ratio.

The definition of information gain is shown in the figure below:

Where Entropy(S) represents the information Entropy of sample set S, A is the attribute set, and information Gain(S, A) represents the information about the value of the objective function after attribute A is known. The larger the information Gain(S, A) is, the more information is obtained.

The disadvantage of information gain is that it tends to select more features. To solve this problem, information gain ratios can be used.

Therefore, the information gain ratio of feature A to training data set D is defined as follows:


The information gain ratio also has a preference for attributes with fewer values.

Decision tree generation

This paper briefly introduces the decision tree generation algorithm, including ID3, C4.5 algorithm.

ID3

The core of ID3 algorithm is to construct decision tree recursively by applying information gain criterion selection feature on each node of decision tree.

The idea of ID3 algorithm is as follows:

  1. The first is to calculate the information gain of each feature for the current set
  2. Then the feature with the maximum information gain is selected as the decision-making feature of the current node
  3. It can be divided into different sub-nodes according to different categories of features (for example, it can be divided into three sub-trees if age features include youth, middle age and old age).
  4. We then continue recursing over the child nodes until all features have been partitioned

The disadvantages of ID3 are

1) It is easy to cause over fitting; 2) Only nominal data can be processed (discrete); 3) The calculation of information gain depends on features with a large number of features, and attributes with the largest number of attributes are not necessarily optimal. 4) Poor noise resistance and difficult to control the proportion of positive and negative examples in training examples

C4.5

C4.5 algorithm inherits the advantages of ID3 algorithm and improves ID3 algorithm in the following aspects:

  • Using information gain rate to select attributes overcomes the deficiency of favoring the attributes with many values when using information gain to select attributes.
  • Pruning during tree construction;
  • Can complete the discrete processing of continuous attributes;
  • Ability to process incomplete data.

The C4.5 algorithm has the following advantages: The generated classification rules are easy to understand and accurate.

Its disadvantages are:

  1. The algorithm is inefficient. In the process of tree construction, sequential scanning and sorting of data sets are needed for many times, which results in the low efficiency of the algorithm
  2. ** Memory is limited, ** is only suitable for data sets that can reside in memory, the program cannot run when the training set is too large to fit in memory.

In fact, due to the shortcoming of information gain ratio, C4.5 algorithm does not directly select the candidate partitioning attribute with the largest information gain ratio. Instead, it firstly finds the attribute with the information gain above the average level from the candidate partitioning attribute, and then selects the one with the highest information gain ratio.

Whether ID3 or C4.5 is best used on small data sets, decision tree classification is generally only suitable for small data. When the attribute value is very large, it is best to choose THE C4.5 algorithm, and the effect obtained by ID3 will be very poor.

pruning

In spanning trees, if there is no pruning, each leaf will grow into a separate category. This is a complete fit for our training set, but it is very unfriendly to the test set, and the generalization ability is not good. So, we’re going to cut out some of the foliage, make the model more generalizing. According to the time point of pruning, it can be divided into pre-pruning and post-pruning. Prepruning is carried out in the process of decision tree generation. Post pruning is done after the decision tree is generated.

Pruning of decision tree is usually achieved by minimizing the loss function or cost function of the whole decision tree. To put it simply, it is to judge whether pruning is necessary by comparing the loss function or accuracy of the whole tree before and after pruning.

There are many kinds of decision tree pruning algorithm, refer to this article for specific decision tree pruning algorithm.

The advantages and disadvantages

advantages
  1. With simple computation and strong interpretability, it is suitable for processing samples with missing attribute values and can process irrelevant features.

  2. With high efficiency, the decision tree only needs to be constructed once and used repeatedly.

  3. The training time complexity is low and the prediction process is fast, the maximum number of calculations for each prediction does not exceed the depth of the decision tree. For N samples, each sample contains M attributes. Without considering the cost of continuous attribute discretization and subtree growth, the average time complexity of decision tree algorithm is only. Building a decision tree, the worst case complexity is zero, the tree depth generally increases logarithmically.

disadvantages
  1. The classification ability of single decision tree is weak and it is difficult to deal with continuous value variables.
  2. Easy to over-fit (random forest appeared later, which reduced the over-fitting phenomenon);
  3. May or may fall into local minima
  4. No online learning

Overfitting of decision trees is solved

1. The pruning

  • Pre-pruning: when splitting nodes, strict conditions are designed. If not, splitting will be stopped directly (in this way, the dry decision tree cannot be optimized and good results cannot be obtained).
  • Post-pruning: after the establishment of the tree, replace the subtree with a single node, and the classification of nodes adopts the main classification in the subtree (this method wastes the previous establishment process) 2. Cross validation 3. Random forest

Code implementation

#Import Library
#Import other necessary libraries like pandas, numpy...

from sklearn import tree
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create tree object 
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini  

# model = tree.DecisionTreeRegressor() for regression

# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)

#Predict Output
predicted= model.predict(x_test)
Copy the code

Random forest

Introduction to the

Random forest refers to a classifier that uses multiple trees to train and predict samples.

It is composed of multiple CART(Classification And Regression Tree). For each tree, it’sThe training set used is sampled from the total training set, which means that some samples in the total training set may appear in the training set of a tree several times, or may never appear in the training set of a tree. As we train the nodes of each tree,The features used are extracted from all the features in a random proportion without putting backAssume that the total number of features isM, then the ratio can be.

The advantages and disadvantages

advantages
  • It performs well on data sets and has great advantages over other algorithms on many current data sets
  • It can process very high dimensional (feature rich) data without feature selection
  • The importance of features can be assessed
  • When creating a random forest, an unbiased estimation of generlization error is used
  • Fast training speed, easy to make parallelization method
  • During training, the interaction between features can be detected
  • It’s easy to implement
  • For unbalanced data sets, it balances errors
  • It can be applied to data sets with missing features and still have good performance
disadvantages
  1. Random forests have been shown to oversimulate some noisy classification or regression problems
  2. For the data of attributes with different values, more attributes divided by values will have a greater impact on the random forest, so the attribute weights produced by the random forest on such data are not credible.

Code implementation

A simple example of using the random forest algorithm in Sklearn:

#Import Library
from sklearn.ensemble import RandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset

# Create Random Forest object
model= RandomForestClassifier()

# Train the model using the training sets and check score
model.fit(X, y)

#Predict Output
predicted= model.predict(x_test)
Copy the code

summary

Introduced four kinds of algorithms, linear regression, logistic regression, decision tree and random forests, which after three algorithms are commonly used, especially the logistic regression algorithm commonly used in particular, in many algorithms, will consider to implement a logistic regression algorithm to run through the whole process of algorithm, consider replacement according to the model of the complex algorithm. Random forest is an upgraded version of decision tree with better performance, and it can also be used to evaluate the importance of features and make feature selection. It belongs to the last of the three feature selection methods, embedded selection method, and the learner will automatically select features.


Reference:

  • Statistical Learning Methods
  • All regression solutions: traditional regression, logistic regression, weighted regression/kernel regression, ridge regression, generalized linear model/exponential family
  • Decision tree pruning algorithm
  • Decision tree series (5) — CART
  • RandomForest summary of RandomForest
  • Machine Learning Common algorithms personal Summary (for interview)

Welcome to follow my wechat official account – Machine Learning and Computer Vision, or scan the qr code below, we can communicate, learn and progress together!