Machine learning and deep learning algorithm flow

  • I don’t know the difference between machine learning and deep learning. I feel that everything is deep learning
  • I heard that my senior has adjusted parameters for 10 months and is ready to issue the T9 thunderbolt model with 200 billion parameters. I want to adjust parameters and send T10 to get the Best Paper

At present, the proportion of research papers related to traditional machine learning is not too high. Some people ridicule deep learning as a system engineering without mathematical value.

However, there is no denying that deep learning is too easy to use, which greatly simplifies the overall algorithm analysis and learning process of traditional machine learning. More importantly, it refreshes the accuracy and accuracy of traditional machine learning algorithms in some general fields.

Deep learning is very hot these years, just like big data five years ago, but deep learning mainly belongs to the field of machine learning, so in this article, we will talk about the difference between machine learning and deep learning algorithm process.

1. Algorithm flow of machine learning

Machine learning is actually the study of data science (sounds a little boring), and here’s how it works: Mainly from 1) data set preparation, 2) exploratory data analysis, 3) data preprocessing, 4) data segmentation, 5) machine learning algorithm modeling, 6) machine learning task selection, of course, the last is to evaluate the application of machine learning algorithm to actual data.

1.1 the data set

Data sets are the starting point for the process of building machine learning models. In simple terms, the dataset is essentially an M by N matrix, where M stands for columns (features) and N stands for rows (samples).

Columns can be decomposed into X and Y, and X can refer to features, independent variables, or input variables. Y can also refer to category labels, dependent variables, and output variables.

1.2 Data Analysis

Exploratory Data Analysis (EDA) is conducted to obtain a preliminary understanding of the data. The main work of EDA is to clean the data, describe the data (describe statistics, charts), view the distribution of data, compare the relationship between data, cultivate intuition of data, summarize the data and so on.

Exploratory data analysis is simply to understand the data, analyze the data, figure out the distribution of the data. It mainly focuses on the real distribution of data and emphasizes the visualization of data, so that analysts can see the hidden rules in data at a glance, so as to get inspiration and help analysts find suitable models for data.

One of the first things I do in a typical machine learning algorithm flow and data science project is to get a better understanding of the data by “staring at the data.” The three main EDA methods commonly used by individuals include:

  • Descriptive statistics: mean, median, pattern, standard deviation.

  • Data visualization: thermal diagram (identifying internal correlation of features), box diagram (visualizing population differences), scatter diagram (visualizing correlation between features), principal component analysis (visualizing cluster distribution presented in data sets), etc.

  • Data shaping: Pivots, groups, filters, etc.

1.3 Data Preprocessing

Data preprocessing, in fact, is to clean up data, data collation or general data processing. Refers to various processes of checking and correcting data to correct missing values, spelling errors, normalizing/standardizing values to make them comparable, converting data (e.g., logarithmic conversion), etc.

For example, resize an image to a uniform size or resolution.

The quality of data will have a great influence on the quality of machine learning algorithm model. Therefore, in order to achieve the best quality of machine learning model, a large part of the traditional machine learning algorithm process is actually data analysis and processing.

Typically, data preprocessing can easily take up 80% of the time of a machine learning project, with the actual modeling phase and subsequent model analysis accounting for perhaps only the remaining 20%.

1.4 Data Segmentation

Training set & Test set

In the machine learning model development process, it is expected that the trained model will perform well on new, unseen data. In order to simulate new and unseen data, the available data is segmented into two parts: training set and test set.

The first is a larger subset of the data used as a training set (e.g., 80% of the original data); The second part is usually a smaller subset and is used as a test set (the remaining 20% of the data).

Next, a prediction model is built from the training set, and this trained model is then applied to the test set (that is, as new, unseen data) for prediction. The optimal model is selected according to the performance of the model on the test set. In order to obtain the optimal model, hyperparameter optimization can also be carried out.

Training set & Verification set & Test set

Another common data segmentation method is to split the data into 3 parts: 1) training set, 2) validation set and 3) test set.

The training set is used to build the prediction model and evaluate the validation set. Based on the prediction, the model can be tuned (such as hyperparameter optimization) and the model with the best performance can be selected according to the results of the validation set.

Validation sets operate in a similar way to training sets. However, it is worth noting that the test set does not participate in the establishment and preparation of the machine learning model. It is the sample set set set aside in the training process of the machine learning model, which is used to adjust the hyperparameters of the model and conduct a preliminary evaluation of the model’s capability. In general, training is accompanied by validation. The validation here is to use the validation set to verify the initial effect of the model.

Cross validation

In fact, data is the most valuable part of the machine learning process, and in order to make more economical use of existing data, n-fold cross-validation is often used to split the data set into N pieces. In such n-fold data sets, one of them is reserved as test data, while the rest are used as training data for model building. The machine learning process is validated by repeated cross-iteration.

This cross-validation method is widely used in machine learning processes, but rarely used in deep learning.

1.5 Machine learning algorithm modeling

Now comes the most interesting part. Data filtering and processing are tedious, and now you can model with carefully prepared data. A classification or regression model can be built based on the data type of the Taget variable (often called the Y variable).

Machine learning algorithm

Machine learning algorithms can be broadly divided into one of three types:

  • Supervised learning: A machine learning task that establishes mathematical (mapping) relationships between input X and output Y variables. Such pairs (X, Y) form the label data used to model how to predict the output from the inputs.
  • Unsupervised learning: A machine learning task that uses only the input X variable. X variable is unlabeled data, and the learning algorithm uses the inherent structure of data when modeling.
  • Reinforcement learning: A machine learning task that determines a course of action. It does this through trial and error learning in an effort to maximize reward.

Parameter tuning

This is the job that the legendary Tiao Shen Xia mainly does. Hyperparameters are essentially the parameters of machine learning algorithms, which directly affect the learning process and prediction performance. Since there is no universal hyperparameter setting that can be universally applied to all data sets, hyperparameter optimization is required.

Take a random forest. When using randomForest, two common hyperparameters are usually optimized, including the mtry and ntree parameters. Mtry (MaxFeatures) represents the number of variables randomly sampled as candidate variables at each split, while Ntree (NESTIMators) represents the number of trees to grow.

Another machine learning algorithm that was still very mainstream 10 years ago was support vector machines (SVM). The hyperparameters that need to be optimized are the C parameter and gamma parameter of the radial basis function (RBF) kernel. The C parameter is a penalty term limiting overfitting, while the gamma parameter controls the width of the RBF kernel.

Tuning is usually to get a better set of values for the hyperparameter. Most of the time, it is not necessary to seek to find an optimal value for the hyperparameter. In fact, it is only a joke.

Feature selection

Feature selection is literally the process of selecting a subset of features from an initial mass of features. In addition to achieving high-precision models, one of the most important aspects of machine learning model building is gaining actionable insights. In order to achieve this goal, it is important to be able to select important subsets of features from a large number of features.

The task of feature selection can itself constitute a whole new field of research, in which a great deal of effort is devoted to designing novel algorithms and methods. Among the many feature selection algorithms available, some classical methods are based on simulated annealing and genetic algorithms. In addition, there are a large number of methods based on evolutionary algorithms (such as particle swarm optimization, ant colony optimization, etc.) and stochastic methods (such as Monte Carlo).

1.6 Machine learning tasks

Two common machine learning tasks in supervised learning include classification and regression.

classification

A trained classification model takes a set of variables as inputs and predicts the class label of the output. Below are three classes represented by different colors and labels. Each small colored sphere represents a sample of data. The visualization of three types of data samples in two dimensions can be created by performing PCA analysis and displaying the first two principal components (PC); Alternatively, you can choose a simple scatter diagram of two variables for visualization.

Performance indicators

How do you know if a trained machine learning model is performing well or badly? Some common metrics used to evaluate classification performance include accuracy (AC), sensitivity (SN), specificity (SP) and Matthew correlation coefficient (MCC).

Return to the

The simplest regression model is best summarized by the simple equation Y = f(X). Where Y corresponds to the quantized output variable, X to the input variable, and F to the mapping function (derived from the machine learning model) that computes the output value as the input feature. The essence of the regression example formula above is that if you know X, you can derive Y. Once Y has been calculated (predicted), a popular visualization is to make a simple scatter plot of the actual and predicted values, as shown below.

The performance of the regression model is evaluated to assess the extent to which the fitting model can accurately predict the input data values. A common indicator used to evaluate the performance of regression models is the determination coefficient (R²). In addition, mean square error (MSE) and root mean square error (RMSE) are also common indicators to measure residuals or prediction errors.

2. Deep learning algorithm flow

Deep learning is actually a paradigm in machine learning, so their main flow is pretty much the same. Deep learning optimizes data analysis, shortens the process of modeling, and unifies the original algorithm of machine learning by neural network.

Before the large-scale use of deep learning, the machine learning algorithm process spent a lot of time collecting data, then screening the data, trying various feature extraction machine learning algorithms, or combining a variety of different features to classify and regression data.

The following is the main flow of machine learning algorithm: 1) data set preparation, 2) data preprocessing, 3) data segmentation, 4) defining neural network model, 5) training network.

Deep learning does not require us to extract features by ourselves, but carries out high-dimensional abstract learning of data automatically through neural network, which reduces the composition of feature engineering and saves a lot of time in this aspect.

At the same time, however, tuning becomes more onerous because of the introduction of deeper and more complex network model structures. For example, defining the structure of the neural network model, confirming the loss function, determining the optimizer, and finally repeatedly adjusting the model parameters.

reference

  • [1] github.com/dataprofess…
  • [2] Chen Zhongming. Deep Learning: Principles and Practices.[M]