Recommended reasons
The inventory of machine learning algorithms is common on the Internet. But so far, no one has been able to articulate the problem in conjunction with scenarios, which is the purpose of this article.
In this article, the author will combine his practical experience, analyze the advantages and disadvantages of each algorithm in practice.
The purpose of this article is to take a pragmatic and concise inventory of current machine learning algorithms. Although a lot of stocktaking has been done, the true strengths and weaknesses of each algorithm have never been given. Here, we will discuss it in detail according to our practical experience.
Categorizing machine learning algorithms has always been tricky, and the common classification criteria are: generate/discriminant, parameter/non-parameter, supervised/unsupervised, and so on.
For example, SciKit-Learn classifies algorithms based on their learning mechanism, resulting in the following categories:
Generalized linear model
Support vector machine
Nearest neighbor
The decision tree
The neural network
…
However, in our experience, this is not the most practical way to classify algorithms. That’s because when you’re using machine learning, you don’t think, “I want to train a support vector machine today!”
Instead, you’re usually thinking about the end goal, like predicting an outcome or categorizing your observations.
So, we want to classify machine learning algorithms based on your mission objectives.
There is no free lunch
In machine learning, a fundamental principle is that there is no free lunch. In other words, no algorithm can solve all problems perfectly, especially for supervised learning (e.g., predictive modeling).
For example, you can’t say that neural networks are better than decision trees in any situation, or vice versa. They are influenced by many factors, such as the size or structure of your data set.
As a result, when evaluating performance and picking algorithms with a given set of tests, you should use different algorithms for specific problems.
Of course, the algorithm you choose must be applicable to your own problem, which requires choosing the right machine learning task. By analogy, if you need to clean your house, you might use a vacuum cleaner, broom or mop, but you should never get out your shovel and dig.
Machine learning tasks
Here, we will first discuss the current “three” most common machine learning tasks:
1. Regression (Regression)
2. Classification (Classification)
3. Clustering (Clustering)
And two Dimensionality Reduction problems:
4. Feature selection (Feature Selection)
5. Feature Extraction (Feature Extraction)
In subsequent articles, we will also discuss Density Estimation and anomaly detection tasks.
Note: This article will not discuss specific segmentation areas, such as natural language processing.
Nor will this article cover every specific algorithm. After all, there are countless algorithms out there, and new ones are popping up all the time. Nevertheless, this paper can still present the most representative algorithm for each task.
▌ 1. Regression
Regression is a supervised learning algorithm for the prediction and modeling of continuous numerical variables, using cases such as the prediction of real estate prices, stock price movements, or student grades.
Regression tasks are characterized by annotated data sets with numerical objective variables. In other words, every observation sample used to supervise the algorithm has a numerical truth value.
Linear regression
1.1 (regularization) Linear regression
Linear regression is the most commonly used algorithm for regression tasks. In its simplest form, it fits the data set with a continuous hyperplane (i.e. a straight line when you only have two variables). If there is a linear relationship among the variables in the data set, the degree of fitting is quite high.
In practice, simple linear regression is often replaced by its regularized form (LASSO, Ridge, and elastic networks). Regularization is a punishment technique to avoid overfitting for too many regression coefficients, and the intensity of the punishment needs to be balanced.
-
Advantages: Linear regression is straightforward to understand and interpret, and can be regularized to avoid overfitting. In addition, linear models can easily update the data model by random gradient descent.
-
Disadvantages: Linear regression is terrible at handling nonlinear relationships, not flexible enough to recognize complex patterns, and adding the right interaction terms or polynomials is tricky and time-consuming.
-
Implementation:
Python –
Scikit-learn.org/stable/modu…
R –
Cran.r-project.org/web/package…
1.2 Regression tree (integration method)
Regression trees, also known as decision trees, learn nonlinear relationships naturally by repeatedly splitting data sets into different branches to maximize the information gain of each separation.
Integrated methods, such as random forest (RF) or gradient ascending tree (GBM), combine the predictions of many independent training trees. We won’t go into the mechanics here, but in practice, random forests generally perform well, while gradient ascending trees are more difficult to tune, but tend to have higher performance ceilings.
-
Advantages: Decision tree can learn nonlinear relation and has strong robustness to outliers. Integrated learning performs well in practice, often winning classical (non-deep learning) machine learning competitions.
-
Disadvantages: Single tree is easy to overfit due to unconstraint, this is because a single tree can keep branches until the trained data is remembered. No, an integrated approach can mitigate this weakness.
-
Implementation: Random forest
Python –
Scikit-learn.org/stable/modu…
R –
Cran.r-project.org/web/package…
-
Implementation: gradient ascending tree
Python –
Scikitlearn.org/stable/modu…
R –
Cran.r-project.org/web/package…
1.3 Deep learning
Deep learning refers to multi-layer neural networks capable of learning extremely complex patterns. They use a hidden layer between the input layer and the output layer to model the intermediate representation of the data, which is difficult for other algorithms to do.
There are also several important mechanisms of deep learning, such as convolution and leakage, which enable the algorithm to effectively learn high-dimensional data. However, compared with other algorithms, deep learning requires more data for training, because the model needs to estimate a larger order of magnitude of parameters.
-
Pros: Deep learning is currently the most advanced technology in specific areas, such as computer vision and speech recognition. Deep neural networks perform well on image, audio, and text data and are also easy to update data models with backpropagation algorithms. Their architecture (that is, the number and structure of layers) can be applied to a variety of problems, while hiding layers also reduces the algorithm’s dependence on feature engineering.
-
Cons: Deep learning algorithms are often unsuitable for general purposes because they require large amounts of data. In fact, for classic machine learning problems, deep learning does not perform better than the integrated approach. In addition, due to the intensive computations required for training, they require more expertise to perform tuning (such as setting schema and hyperparameters).
-
Implementation:
Python –
keras.io/
R –
mxnet.io/
1.4 Honorable Mention: Nearest Neighbor algorithm
The nearest neighbor algorithm is “instance based,” which means it needs to preserve every training observation. The nearest neighbor algorithm predicts the observed values of new samples by searching for the most similar training samples.
It is a memory-intensive algorithm with unsatisfactory performance in processing high-dimensional data, and it also needs efficient distance function to calculate the similarity. In practice, a regularized regression or tree integration approach is often a better choice.
▌Classification of 2.
Classification is a supervised learning algorithm for classification variable modeling and prediction, using cases including staff turnover, mail filtering, financial fraud, etc.
As you can see, many regression algorithms have their corresponding classification forms, and classification algorithms tend to apply to predictions of categories (or their likelihood) rather than numerical values.
Logistic regression
2.1 (regularization) Logistic regression
Logistic regression is a classification method corresponding to linear regression, and the basic concepts are derived from linear regression. Logistic regression maps predictions to an interval of 0 to 1 through logical functions, so predicted values can be treated as probabilities of a certain class.
The model is still linear, and the algorithm performs well only if the data is linearly divisible (for example, the data can be completely separated by a decision plane). Logistic regression can also penalize model coefficients for regularization.
-
Advantages: The output results have good probability interpretation, and the algorithm can be regularized to avoid overfitting. Logical models can easily update data models through stochastic gradient descent.
-
Disadvantages: Logistic regression has poor performance in the face of multivariate or nonlinear decision boundaries.
-
Implementation:
Python –
Scikit-learn.org/stable/modu…
R –
Cran.r-project.org/web/package…
2.2 Classification tree (integration method)
The classification algorithm corresponding to the regression tree is the classification tree. Usually, they refer to decision trees or, more rigorously, “categorical regression trees” (CART), also known as the CART algorithm.
-
Advantages: Like regression methods, the classification tree integration approach also performs well in practice. They are robust and scalable when dealing with abnormal data. Because of its hierarchical structure, the classification tree integration method can naturally model the nonlinear decision boundary.
-
Disadvantages: unconstrained, single tree is easy to overfit, integration method can reduce this effect.
-
Implementation: Random forest
Python –
Scikit-learn.org/stable/modu…
R –
Cran.r-project.org/web/package…
-
Implementation: gradient ascending tree
Python –
Scikitlearn.org/stable/modu…
R –
Cran.r-project.org/web/package…
2.3 Deep learning
Deep learning is also easily adapted to classification problems. In fact, deep learning is more applied to classification tasks, such as image classification.
-
Pros: Deep learning is ideal for classifying audio, text, and image data.
-
Disadvantages: Like regression method, deep neural network requires a large amount of data training, so it is not a general purpose algorithm.
-
Implementation:
Python –
keras.io/
R –
mxnet.io/
2.4 Support vector machines
Support vector machines use a trick called kernel function to transform a nonlinear problem into a linear one, essentially computing the distance between two observations. The support vector machine (SVM) algorithm seeks the decision boundary that can maximize the sample interval, so it is also called large spacing classifier.
For example, support vector machines using linear kernels are similar to logistic regression, but more robust. Therefore, in practice, support vector machines are most useful for modeling nonlinear decision boundaries with nonlinear kernel functions.
-
Advantages: Support vector function for nonlinear decision boundary modeling, and many optional kernel functions. In the face of overfitting, support vector machines have strong robustness, especially in high dimensional space.
-
Disadvantages: However, support vector machines are memory intensive algorithms, so choosing the right kernel requires considerable J-skill and is not suitable for large data sets. In current industry applications, the performance of random forest is often better than that of support vector machines.
-
Implementation:
Python –
Scikit-learn.org/stable/modu…
R –
Cran.r-project.org/web/package…
2.5 Naive Bayes
Naive Bayes is a simple algorithm based on conditional probability and counting. Its essence is a probability table, which is updated by training data. It predicts new observations by looking for the most likely category in a probability table based on the eigenvalues of the sample.
The reason, called “naive”, is that the central assumption of characteristic condition independence (for example, each input feature is independent of each other) is almost untenable in reality.
-
Advantages: Naive Bayes algorithms perform well in practice, even though the conditional independence hypothesis is difficult to establish. The algorithm is easy to implement and can be accompanied by data set update.
-
Disadvantages: Naive Bayes’ algorithm is too simple, so it can be easily replaced by the above classification algorithm.
-
Implementation:
Python –
Scikit-learn.org/stable/modu…
R –
Cran.r-project.org/web/package…
▌3. The clustering
Clustering is an unsupervised learning task based on the internal structure of data to find sample natural population (cluster). The use cases include user portrait, clustering of e-commerce items, social network analysis, etc.
Since clustering is unsupervised learning, “correct answers” will not be output, and data visualization is often used to evaluate the results. If you need the “right answer”, i.e. a pre-labeled cluster in the training set, then a classification algorithm is more appropriate.
k-means
3.1 k-means
K-means is a general purpose clustering algorithm based on the geometric distance between sample points. Because the cluster is centered around the cluster center, the results will be nearly spherical and of similar size.
We recommend this algorithm for beginners because it is simple enough and flexible enough to give reasonable results for most problems.
-
Pros: K-means is the most popular clustering algorithm because it’s fast, simple, and surprisingly flexible if you pre-process data and feature engineering effectively.
-
Disadvantages: This algorithm needs to specify the number of clusters, and the choice of K value is not always easy to determine. In addition, if the real clusters in the training data are not spherical, then k-means clustering will yield some poor clusters.
-
Implementation:
Python –
Scikit-learn.org/stable/modu…
R –
Stat. Ethz. Ch – manual/R/R -…
3.2 Affine propagation
Affine propagation is a relatively new clustering algorithm that determines clusters based on the graphical distance between two sample points, with results favoring smaller and variously sized clusters.
-
Advantages: Affine propagation does not need to specify a clear number of clusters, but need to specify hyperparameters such as “Sample preference” and “damping”.
-
Disadvantages: The main disadvantages of affine propagation are slow training speed and large memory requirements, making it difficult to scale to large data sets. In addition, the algorithm also assumes that the potential cluster is close to spherical.
-
Implementation:
Python – http://scikit-learn.org/stable/modules/clustering.html#affinity-propagation
R – https://cran.r-project.org/web/packages/apcluster/index.html
3.3 Layers/layers
Hierarchical clustering, also known as hierarchical clustering, its algorithm is based on the following concepts:
1) Each cluster starts with a data point;
2) Each cluster can be merged based on the same criteria;
3) Repeat this process until you have only one cluster left, which gives you a hierarchy of clusters.
-
Advantages: The main advantage of hierarchical clustering is that the cluster is no longer assumed to be spherical. In addition, it can be easily scaled to large data sets.
-
Disadvantages: Similar to k-means, this algorithm needs to select the number of clusters, i.e. the layers to be preserved after the algorithm is complete.
-
Implementation:
Python – http://scikitlearn.org/stable/modules/clustering.html#hierarchical-clustering
R – https://stat.ethz.ch/R-manual/R-devel/library/stats/html/hclust.html
3.4 DBSCAN
DBSCAN is a density-based clustering algorithm, which clusters the dense regions of sample points. The latest development is HDBSCAN, which allows the density of clusters to be variable.
-
Advantages: DBSCAN does not assume sphere-like clusters and its performance is scalable. In addition, it does not require each point to be assigned to the cluster, which reduces the noise of the cluster.
-
Disadvantages: Users must adjust the hyperparameters “epsilon” and “min_SAMPLE” to define the cluster density. DBSCAN is very sensitive to this.
-
Implementation:
Python – http://scikit-learn.org/stable/modules/clustering.html#dbscan
R – https://cran.r-project.org/web/packages/dbscan/index.html
The dimension disaster
In machine learning, “Dimensionality” usually refers to the number of features in a data set (i.e. the number of input variables).
When the number of features is very large (relative to the number of observed samples in the data set), training an effective model will have particularly high requirements on the algorithm (that is, it is particularly difficult to train an effective model with existing algorithms). This is known as the “Curse of Dimensionality”, especially for clustering algorithms that rely on distance calculations.
One Quora user gave a great analogy to the “dimensional disaster” :
Suppose you have a straight line of 100 yards, and somewhere along that line you drop a coin. It’s not hard to find the coin, you just walk along the line, it takes you up to 2 minutes.
Then, suppose you have a square 100 yards long and 100 yards wide, and you drop your coin somewhere in the square. It’s not easy to find it now. It’s like trying to find a needle in two side by side football fields. It can take days.
Then, assuming a cube 100 yards long, wide and high, that’s like looking for a needle in a 30-story stadium…
As the dimensions increase, the difficulty of searching in space becomes more difficult.
Site link:
www.quora.com/What-is-the…
This requires data dimension reduction methods: feature selection and feature extraction.
▌4. Feature selection
Feature selection is filtering out irrelevant or redundant features from your data set. The key difference between feature selection and feature extraction lies in that feature selection is to select a sub-feature set from the original feature set, while specific extraction is to reconstruct some (one or more) new features based on the original feature set.
It should be noted that some supervised machine learning algorithms already have built-in feature selection mechanisms, such as regular regression and random forest. In general, we recommend that you try these algorithms first, if they match your problem. We’ve already talked about that.
As independent tasks, feature selection can be either unsupervised (e.g., variance threshold) or supervised (e.g., genetic algorithm). If necessary, you can combine multiple approaches in some reasonable way.
4.1 Variance threshold
The variance threshold will discard the characteristics of the observed sample that have small changes in the observed value (i.e., their variance is less than a set threshold). Such features are of minimal value.
For example, if you have public health data where 96 percent of people are 35-year-old men, removing “age” and “sex” doesn’t lose important information.
Since the variance threshold depends on the order of magnitude of the eigenvalue, you should normalize the eigenvalue first.
-
Advantages: Data dimension reduction using the variance threshold approach requires a very solid intuition: features with little change in eigenvalues bring little useful information. This is a relatively safe way to reduce data dimensions early in your modeling.
-
Disadvantages: If the problem you are solving does not require data dimensionality reduction, even using variance thresholds is of little use. In addition, you need to manually set and adjust the variance threshold, which is quite a technical process. We recommend starting with a conservative (that is, low) threshold.
-
Implementation:
Python – http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html
R – https://www.rdocumentation.org/packages/caret/versions/6.0-76/topics/nearZeroVar
4.2 Correlation threshold
The correlation threshold removes features that are highly correlated (that is, features whose eigenvalues change very similarly to other features). They provide redundant information.
For example, if you have a real estate data where the two features are “house size (units: square feet)” and “House size (units: square meters)”, then you can remove either of them (this is very safe and does not have any negative impact on your model).
The question is, which feature do you remove? First, you should calculate the correlation coefficients for all feature pairs. Then, if the correlation coefficient of a feature pair is greater than the set threshold, you can remove the one with the highest average absolute correlation.
-
Advantages: Using correlation thresholds also requires a solid intuition: similar features provide redundant information. Some algorithms are not robust for some data sets with strong correlation features, so removing them can improve the performance of the whole model (computational speed, model accuracy, model robustness, etc.).
-
Cons: Again, you have to manually set and adjust the correlation threshold, which is also a tricky and complex process. In addition, if you set the threshold too low, you will lose some useful information. In any case, we tend to use algorithms that have feature selection built in. Principal component analysis is a good alternative for algorithms without built-in feature extraction.
-
Implementation:
Python – https://gist.github.com/Swarchal/881976176aaeb21e8e8df486903e99d6
R – https://www.rdocumentation.org/packages/caret/versions/6.0-73/topics/findCorrelation
4.3 Genetic Algorithm
Genetic algorithm is a general name of a class of algorithms that can be used for different tasks. Inspired by evolutionary biology and natural selection, they combine mutation and crossover to perform efficient ergodic search in solution space. Here is an excellent introduction: “Introduction of principles behind Genetic Algorithms”.
In machine learning, genetic algorithms have two main uses.
One, for optimization, for example, to find the best weight for a neural network.
Second, it is used for supervised feature extraction. In this use case, “gene” represents a single trait, while “organism” represents a set of candidate traits. Each organism within the species population is scored based on its fitness, just as model performance is tested on test data sets. The organisms most adapted to the environment will survive and reproduce, iterating until they converge on an optimal solution.
-
Advantages: In cases where exhaustive search is not feasible, genetic algorithms can be quite effective for high-dimensional datasets. When your algorithm needs to preprocess data but doesn’t have a built-in feature selection mechanism (such as the nearest neighbor classification algorithm), and you have to keep the original feature (i.e., you can’t use any principal component analysis algorithm), genetic algorithm is your best choice. This often happens in business environments that require transparent, explainable solutions.
-
Disadvantages: Genetic algorithms introduce more complexity into the implementation of your solution, and in most cases they are unnecessary complications. If possible, principal component analysis or other built-in feature selection algorithms will be more efficient and concise.
-
Implementation:
Python – https://pypi.python.org/pypi/deap
R – https://cran.r-project.org/web/packages/GA/vignettes/GA.html
4.4 Honorable Mention: Progressive search
Stepwise search is a supervised feature selection algorithm based on sequential search. It takes two forms: forward search and reverse search.
For progressive forward search, you start with no features. Then, a feature is selected from the candidate feature set to train the model. Then, save the feature corresponding to the best model performance; Further down, you add features to the training model’s feature set, one at a time, until your model’s performance stops improving.
The reverse step-by-step search process is the same, but in reverse order: start by training the model with all features, then remove one feature at a time until the model’s performance plummets.
We mention this algorithm purely for historical reasons. Although many textbooks present the step-by-step search algorithm as an effective method, its performance is often inferior to that of other supervised methods, such as regularization. Stepwise search has many obvious drawbacks, the most fatal of which is that it is a greedy algorithm, unable to face the shock of future change. We do not recommend this algorithm.
▌5. Feature extraction
Feature extraction is used to create a new, smaller feature set, but still retain most of the useful information. It is worth mentioning again that feature selection is used to preserve part of the original feature set, while feature extraction is to create a new feature set.
Like feature selection, some algorithms already have feature extraction mechanisms. The best example of this is deep learning, which can extract increasingly useful features that characterize the raw data at each hidden neural layer. We’ve covered this in the “Deep Learning” section.
As an independent task, feature extraction can be unsupervised (e.g., principal component analysis) or supervised (e.g., linear discriminant analysis).
5.1 Principal component analysis
Principal component analysis is an unsupervised algorithm used to create linear combinations of original features. The newly created features are orthogonal to each other, that is, they have no correlation. Specifically, these new features are arranged in order of their own degree of variation. The first principal component represents the feature that changes the most in your data set, the second principal component represents the feature that changes the most, and so on.
Therefore, you can reduce the dimension of your data by limiting the number of principal components you use. For example, you can use only the number of principal components that make the cumulative explicable variance 90 percent.
You need to normalize the data before using principal component analysis. Otherwise, the feature with the largest order of magnitude of eigenvalue in the original data will dominate your newly created principal component feature.
-
Advantages: Principal component analysis (PCA) is a multi – purpose technique with very good practical effect. It’s fast and simple to deploy, which means you can easily test algorithm performance with or without principal component analysis. In addition, there are several variations and extensions of PRINCIPAL component analysis (kernel PCA, sparse PCA) to solve specific problems.
-
Disadvantages: Newly created principal components are not interpretable, so in some cases it is difficult to make a connection between the new features and the actual application scenario. In addition, you still need to manually set and adjust the cumulative interpretable variance threshold.
-
Implementation:
Python – http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
R – https://stat.ethz.ch/R-manual/R-devel/library/stats/html/prcomp.html
5.2 Linear discriminant analysis
Linear discriminant analysis is not an implied Dirichlet distribution. It is also used to construct linear combinations of original feature sets. However, unlike principal component analysis, linear discriminant analysis does not maximize explicable variance, but maximizes the degree of separation between categories.
Therefore, linear discriminant analysis is a supervised learning method that must use labeled data sets. So, which is better, linear discriminant analysis or principal component analysis? It depends on the circumstances, but the “no free lunch” principle also applies here.
Linear discriminant analysis also depends on the order of magnitude of the eigenvalues, and again you need to normalize the eigenvalues first.
-
Advantages: Linear discriminant analysis is a type of supervised learning, and features acquired in this way can (but not always) improve model performance. In addition, there are some variants of linear discriminant analysis (such as quadratic discriminant analysis) that can be used to solve specific problems.
-
Disadvantages: As with principal component analysis, newly created features are not interpretable. Also, you need to manually set and adjust the number of features you want to keep. Linear discriminant analysis requires already labeled data, so it also makes it more grounded.
-
Implementation:
Python – http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.d iscriminant_analysis.LinearDiscriminantAnalysis
R – https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/lda.html
5.3 Self-coding machine
An autocoder is an artificial neural network used to reconstruct the original input. Image autoencoders, for example, are trained to re-represent raw data, not to distinguish between cats and dogs.
But does it work? The key here is to build fewer neurons at the hidden layer than at the input and output layers. In this way, the hidden layer is constantly learning how to represent the original image with fewer features.
Since the input image is used as the target output, autoencoders are considered unsupervised learning. They can be used directly (for example, image compression) or sequentially stacked (for example, deep learning).
-
Pros: Autoencoders are a type of artificial neural network, which means they perform very well for certain types of data, such as image and voice data.
-
Disadvantages: self – coding machine is an artificial neural network. That said, their optimizations require more data to train. They cannot be used as data dimension reduction algorithms in the general sense.
-
Implementation:
Python – https://keras.io/
R – http://mxnet.io/api/r/index.html
summary
Based on our experience, here are some helpful tips:
-
Practice, practice, practice. To find some data sets, strike while the iron is hot. Knowing these algorithms is just the beginning; mastering them requires practice.
-
Master the basics. Knowing this algorithm will give you a solid foundation to apply machine learning, because all other algorithms are based on some variation of the above algorithm. For example, it is better to learn the differences between principal component analysis and linear discriminant analysis before learning the details of linear discriminant analysis and the differences between quadratic discriminant analysis.
-
Iron law, good data is better than fancy algorithms. In machine learning applications, algorithms are always fungible; Effective exploratory analysis, data cleansing, and feature engineering can always improve your results. We’ll mention this a lot, because it’s the damn truth!
Original link:
Elitedatascience.com/machine-lea…
Elitedatascience.com/dimensional…