As machine learning becomes more popular, there are more algorithms that do tasks well. But you can’t know in advance which algorithm is optimal for your problem. If you have enough time, you can try all the algorithms to find the best one. This article describes how to rely on existing methods (model selection and hyperparametric tuning) to guide you to better choose an algorithm. Michael Beyeler is a postdoctoral fellow in data science at the eScience Institute and Institute for Neuroengineering at the University of Washington.
Step 0: Understand the basics
Before we dive in, let’s go over the basics. Specifically, there are three main categories of machine learning: supervised learning, unsupervised learning and reinforcement learning.
-
In supervised learning, each data point gets a label, such as a category label or a label related to a number. An example of a category tag: classifying an image as “cat” or “dog”; An example of a numerical tag is predicting the sale price of a used car. The purpose of supervised learning is to make predictions about new data by studying many labeled samples. For example, accurately identifying an animal in a new photo (sorting) or predicting the sale price of a used car (regression).
-
In unsupervised learning, data points have no associated labels. Instead, the goal of unsupervised learning algorithms is to organize the data in some way and then find the internal structure that exists in the data. This includes clustering data, or finding simpler ways to work with complex data to make it look simpler.
-
In reinforcement learning, the algorithm will make decisions (what to do next) for each data point. This technique is commonly used in robotics. The sensor reads one data point at a time from the outside world, and the algorithm must decide what the robot should do next. Reinforcement learning is also suitable for Internet of Things applications. Here, the learning algorithm receives a reward signal indicating the good or bad decision it has made and must modify its strategy in order to get the highest reward.
Step 1: Categorize the problem
Next, we will classify the problem, which involves two processes:
-
Categorize by input data: If our data has labels, it’s a supervised learning problem; If the data is unlabeled and we want to find out the internal structure of the data, that’s unsupervised learning; If we want to optimize the objective function by interacting with the environment, this is reinforcement learning.
-
Classification by output: If the output of the model is a numerical value, it is a regression problem; If the output is a category, it is a classification problem; If the output is a set of input data, it is a clustering problem.
It’s that simple!
More generally, we can ask ourselves what our algorithm is trying to accomplish to find the right class of algorithms.
The description above includes a few technical terms that we haven’t mentioned yet:
-
Classification: When using data to predict categories, supervised learning is also called classification. For example, recognizing pictures containing “cat” or “dog” and classifying them as “cat” or “dog” is known as two-class or binomial classification. When there are more categories, such as predicting who will win the next Nobel Prize in physics, this is known as multi-class classification.
-
Regression: Supervised learning is also called regression when predicting numerical values, such as stock prices.
-
Clustering: Clustering, or cluster analysis, is one of the most common methods of unsupervised learning. Clustering is grouping a group of objects in such a way that data in the same group has more similarity than data in different groups.
-
Anomaly Detection: Sometimes we need to find anomalies in data points. In fraud detection, for example, any highly unusual credit card purchase is suspicious; The large number of different forms of fraud, and the very small number of training samples, make it impossible to fully understand what fraudulent activity should look like. Anomaly detection takes the approach of understanding normal behavior (using historical data from non-fraudulent transactions) and identifying significantly different behavior.
Step 2: Find available algorithms
Now that we have classified the problem, we can use the tools at our disposal to identify appropriate and useful algorithms.
Microsoft Azure has created a handy list of algorithms that show which algorithms are available for which class of problem. Although the form is customized for Azure software, it is universally applicable (a PDF version of the form is availablesuo.im/3Ss2zW) :
Some notable algorithms are as follows:
-
Classification:
-
Support vector machines (SVM) can be used to find the widest possible classification boundaries. When the two categories cannot be clearly separated, the algorithm finds the best boundary it can. The real focus is on feature-dense data, such as text or genomes (feature count > 100). In these cases, in addition to requiring just the right amount of memory, support vector machines (SVM) can perform classification faster and with less overfitting than most other algorithms.
-
Artificial neural network (Ann) is a heuristic learning algorithm covering binary classification, multiple classification and regression problems. They come in an infinite variety, including perceptrons and deep learning. They take a long time to train, but are known to achieve current best performance in a variety of applications.
-
Logistic regression: Even with the word “regression” in its name, Logistic regression is actually a powerful tool for binary and multi-classification problems. It’s fast and easy. In fact, it uses s-shaped curves rather than straight lines, so it’s a natural fit for grouping data. Logistic regression can give linear classification boundaries, so if you’re going to use it, you have to make sure that you’re comfortable with linear approximations.
-
Decision trees and random forests: Decision forests (regression, dichotomies and multi-classification), decision jungles (dichotomies and multi-classification) and enhanced decision trees (regression and dichotomies) are all based on decision trees. This is a basic machine learning concept. There are many different variants of decision trees, but they all do the same thing — subdivide the feature space into regions with roughly the same labels. These areas can be consistent categories or constant values, depending on whether you are classifying or regression.
-
Regression:
-
Linear regression is the fitting of a line (or plane, or hyperplane) to a data set. This is a major tool, simple and fast, but it can be too simple for some problems.
-
Bayesian linear regression has a very desirable property: it can avoid overfitting. The Bayesian approach does this by making some assumptions beforehand about the possible distribution of the answers. Another by-product of this approach is that they have very few parameters.
-
Lifting Decision Tree Regression: As described above, lifting decision trees (both regression and dichotomies) are based on the decision tree and work by subdividing the feature space into areas with roughly the same labels. The ascending decision tree avoids overfitting by limiting the number of subdivisions it can have and the minimum number of points allowed in each region. The algorithm constructs a sequence of trees in which each tree learns to compensate for errors left by the previous tree. This results in a very precise learner that uses a lot of memory.
-
Cluster:
-
The goal of Hierarchical Clustering is to construct the Hierarchical structure of Clustering, which has two forms. Agglomerative clustering is a “bottom-up” approach, in which each observation starts in its own cluster and pairs of clusters fuse as it moves up the hierarchy. The divisive clustering, on the other hand, is a “top-down” approach in which all observations start in a cluster and are divided recursively as the observations move down the hierarchy. As a whole, the fusion and fragmentation here are defined in a radical way. The results of hierarchical clustering are usually presented in the form of a dendrogram.
-
The aim of k-means clustering is to divide N sets of observations into K clusters, with each observation belonging to a cluster of the nearest means — which is used as a prototype for these clusters. This splits the data space into Voronoi units.
-
Abnormal detection:
-
K-nearest Neighbors/K-NN is a nonparametric method for classification and regression. In both cases, the input consists of the training sample that is closest to K in the feature space. In the K-NN classification, the output is a class member. Objects are classified by majority voting in their k nearest neighbors, where objects are assigned to the most common class in k nearest neighbors (k is a positive integer and is usually smaller). In k-NN regression, the output is the attribute value of the object. The value is the average value of its k nearest neighbor.
-
Single-class SUPPORT Vector Machines (ONE-class SVM) : Using a clever extension of nonlinear support vector machines, single-class support vector machines can depict a rigorous overview of the boundaries of the entire data set. Any new data points that are out of bounds are abnormal and worth noting.
Step 3: Implement all applicable algorithms
For any given problem, there are usually multiple candidate algorithms that can do the job. So how do we know which one to choose? Often, the answer to this question is not simple, so we have to experiment.
Prototyping is best done in two steps. In the first step, we want to implement some algorithms quickly and crudely with minimal feature engineering. At this stage, our main goal is to get a rough idea of which algorithm performs better. This step is a bit like hiring: we look for reasons to shorten our list of candidate algorithms wherever possible.
Once we’ve narrowed the list down to a few candidate algorithms, real prototyping begins. Ideally, we would set up a machine learning process that uses a carefully selected set of evaluation criteria to compare each algorithm’s performance on a data set. At this stage, we’re only dealing with a small number of algorithms, so we can turn our attention to the real magic: feature engineering.
Step 4: Feature engineering
Perhaps more important than the selection algorithm is the correct selection of features that represent the data. Choosing the right algorithm from the list above is relatively straightforward, whereas feature engineering is more of an art form.
The main problem is that the data we are trying to classify has very little description in the feature space. For example, using the gray value of pixels to predict images is usually a bad choice. Instead, we need to find data transformations that improve the SNR. Without these data transformations, our task would have been unsolvable. For example, before the advent of directional gradient histogram (HOG), complex visual tasks (like pedestrian detection or face detection) were difficult to accomplish.
Although the validity of most features needs to be evaluated experimentally, it is helpful to know the common methods for selecting data features. Here are a few good ones:
-
Principal component analysis (PCA) : A linear dimensionality reduction method that can find characteristic principal components with high information content and explain most variances in data.
-
Scale Invariant Feature Transform (SIFT) : A patented algorithm in computer vision for detecting and describing local features of images. It has an open source alternative, THE ORB (Oriented FAST and Rotated BRIEF).
-
SURF: A more robust version of SIFT, patented.
-
Directional gradient histogram (HOG) : occurrence of a feature description method used in computer vision to count the gradient direction of a local part of an image.
-
For more algorithms, see: Visual Descriptor
Of course, you can also come up with your own characterization methods. If you have several candidate methods, you can use encapsulated methods for intelligent feature selection.
-
Forward search:
-
You don’t pick any features at first.
-
Then select the most relevant feature and add this feature to the existing feature. The cross-validation error of the model was calculated and all other candidate features were selected repeatedly. Finally, select features that allow you to cross-validate with minimal error and place them among the selected features.
-
Repeat until the desired number of features is reached!
-
Reverse search:
-
Start with all the features.
-
First, remove the least relevant features, and then calculate the cross-validation error of the model. Repeat the process for all other candidate characteristics; Finally, the candidate feature that maximizes the cross-validation error is removed.
-
Repeat until the desired number of features is reached!
Use cross-validation criteria to remove and add features!
Step 5: Hyperparameter optimization
Finally, you may want to optimize the hyperparameters of the algorithm. For example, the number of principal components in principal component analysis, the parameter K of the K-nearest neighbor algorithm, or the number of layers and learning rates in neural networks. The best way to do this is to use cross-validation.
Once you apply all of the above, you have a good chance of creating powerful machine learning systems. But, as you might have guessed, the devil is in the details, and you may have to experiment a lot to get there.
From Ask a Swiss