preface
Machine learning, a core component of artificial intelligence, is the process by which computer programs learn data experiences to optimize their algorithms and generate “intelligent” recommendations and decisions.
A classic definition of machine learning is:
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
1 Introduction to machine learning
Machine learning is about the computer based on data distribution, learning to build a probability statistical model, and use the model to analyze and predict the data. According to the different distribution modes of learning data, it can be mainly divided into supervised learning and unsupervised learning:
1.1 Supervised Learning
From the labeled data (x is the variable feature space, y is the label), the process of learning the optimal model through the selected model and the determined learning strategy, and then using the appropriate algorithm to calculate, and then using the model to predict.
If the value of Y predicted by the model is finite or infinite, it can be further divided into classification model or regression model.
1.2 Unsupervised Learning:
From the unlabeled data (x is the variable feature space), the optimal model is learned through the selected model and determined learning strategy, and then the appropriate algorithm is used to calculate the optimal model, and then the statistical law or internal structure of the data is discovered with the model.
According to application scenarios, it can be divided into clustering, reduction and correlation analysis models.
2. Machine learning modeling process
2.1 Identifying service Problems
Identifying the business problem is a prerequisite for machine learning, which abstractions the solution to the real business problem: what data to learn as input, and what model to aim for to make decisions as output.
(For example, a simple news classification problem scenario is to learn the existing news and its category tag data, get a text classification model, and make category prediction of new news every day through the model to classify each news channel.)
2.2 Data selection: Collect and input data
The data determines an upper limit to the machine learning results, and the algorithm just tries to get as close to that limit as possible. It means that the quality of the data determines the final effect of the model. In practical industrial applications, the algorithm is usually a small part of the work, most of the engineers are looking for the data, refining the data, and analyzing the data. Data selection needs to focus on:
(1) Representativeness of data: poor representativeness of data will lead to poor model fitting effect;
(2) Data time range: If the characteristic variable X and label Y of supervised learning are related to time sequence, it is necessary to make clear the data time window; otherwise, it may lead to data leakage, that is, the phenomenon of inverted cause-effect characteristic variables exists and is used. (for example, if it will rain tomorrow, but the training data is introduced into the temperature and humidity situation tomorrow);
(3) Data business scope: make clear the scope of data sheets related to the task to avoid missing representative data or introducing a large number of irrelevant data as noise;
2.3 Feature engineering: data preprocessing and feature extraction
Feature engineering is to process the original data into features available in the model, which can be generally divided into:
(1) Data preprocessing: missing value/outlier processing, data discretization, data standardization, etc.
② Feature extraction: feature representation, feature derivation, feature selection, feature dimension reduction, etc.
2.3.1 Data preprocessing
-
Outlier handling
The collected data may introduce outliers (noise) due to human or natural factors, which can interfere with model learning.
Usually, you need to handle artificial outliers, identify outliers using service or technical methods (such as the 3σ criterion), filter outliers using Python and regular expression matching, and delete or replace the outliers based on service conditions.
-
Missing value processing
The missing part of the data can be filled in, not processed or deleted by combining with the business. According to the missing rate and treatment methods, it can be divided into the following cases:
(1) High miss rate, and combined with business can directly delete the characteristic variable. Experience can add a variable feature of type bool to record the missing of the field, missing is denoted as 1, non-missing is denoted as 0;
(2) The missing rate is low, and some missing value filling methods can be used to predict the missing value filling, such as the FillNA method of PANDAS and training random forest model.
(3) No processing: Some models such as Random Forest, XgBoost and LightgBM can handle the case of missing data without any processing of missing data.
-
Data discretization
Data discretization can reduce the time and space overhead of the algorithm (which varies from algorithm to algorithm) and make the features more business explanatory.
Discretization is to segment continuous data into discrete intervals. The segmentation principle includes methods such as equal distance and equal frequency.
-
Data standardization
The dimensionality of each characteristic variable of data differs greatly, so data standardization can be used to eliminate the influence of dimensionality difference of different components and accelerate the efficiency of model convergence. Common methods are:
① Standardization of Min-max:
Scale the value range to (0,1) without changing the data distribution. Max is the maximum value of the sample and min is the minimum value of the sample.
② Z-Score standardization:
When the range is scaled to near 0, the processed data conforms to the standard normal distribution. U is the mean and σ is the standard deviation.
2.3.2 Feature extraction
-
Characteristics of the said
The data needs to be converted into a numerical form that a computer can process. If the data is an image the data needs to be converted into an RGB 3d matrix representation.
The data of character class can be represented by multi-dimensional array, including Onehot single hot coding representation, Word2Vetor distributed representation and Bert dynamic coding, etc.
- Characteristics of the derivative
Basic features have limited representation of sample information, and can be supplemented by features with new meanings derived from features. Feature derivation is a kind of processing (aggregation/transformation, etc.) on the meaning of existing basic features. The common methods are as follows:
(1) Derived from the understanding of business: the method of aggregation refers to the average value, count, and maximum value of fields after aggregation. For example, the average monthly salary, maximum salary and so on can be processed by 12 months ‘salary.
The way to convert is to add, subtract, multiply, and divide fields. For example, through 12 months salary can be processed: the ratio of monthly salary income and expenditure, difference and so on;
(2) Use Featuretools: Featuretools;
- Feature selection
Feature selection screens out salient features and rejects non-salient features. Feature selection methods generally fall into three categories:
(1) Filtering method: Each feature is graded according to its divergence or correlation index, such as variance verification, correlation coefficient, IV value, chi-square test, information gain and other methods.
(2) Packaging method: Select part of the features to train the model iteratively each time, and select the features according to the prediction effect score of the model.
③ Embedding method: use some models to train, get the weight coefficient of each feature, according to the weight coefficient from large to small to select features, such as XGBOOST feature importance selection features.
- Feature dimension reduction
If the number of features is still too large after feature selection, the problem of sparse data samples and difficult distance calculation (called “dimension disaster”) will often occur, which can be solved by feature dimension reduction. Common dimension reduction methods include principal component analysis (PCA), linear discriminant analysis (LDA) and so on.
2.4 Model training
Model training is the process of selecting the distribution of model learning data. In this process, the (super) parameters of the algorithm need to be adjusted according to the training results to make the results better.
-
2.4.1 Data set division
Before training the model, the data set is generally divided into training set and test set, and the training set can be further subdivided into training set and verification set, so as to evaluate the generalization ability of the model.
① Training set: used to run the learning algorithm.
② Development validation set is used to adjust parameters, select features and other optimization of the algorithm. Common validation methods include cross-validation, leave one method, etc.
③ The test set is used to evaluate the performance of the algorithm, but does not change the learning algorithm or parameters accordingly.
-
2.4.2 Model selection
Common machine learning algorithms are as follows:
The choice of model depends on the data situation and the forecast target. Multiple models can be trained, and a better model or model fusion can be selected according to the actual effect.
-
2.4.3 Model training
The training process can be optimized by parameter tuning, which is an empirical process based on data sets, models and details of the training process. Hyperparameter optimization requires an understanding and experience of the principles of the algorithm, as well as automatic parameter tuning techniques such as grid search, random search and Bayesian optimization.
2.5 Model Evaluation
Criteria for model evaluation: the purpose of model learning is to enable the learned model to predict new data well (generalization ability). In reality, the learning degree and generalization ability of training data are usually evaluated by training error and test error.
-
2.5.1 Evaluation Indicators
(1) Evaluation and classification model: The commonly used evaluation criteria include accuracy P, recall R, and their harmonic mean F1-score, etc., and the corresponding values are calculated from the statistics of the confusion matrix:
The accuracy ratio refers to the proportion of the correct positive samples (TP) classified by the classifier to all the positive samples (TP+FP) predicted by the classifier.
Recall rate refers to the proportion of the correct number of positive samples (TP) classified by the classifier to all the positive samples (TP+FN).
F1-score is the harmonic average of the precision ratio P and recall ratio R:
(2) Evaluation regression model: the commonly used evaluation indexes include RMSE root mean square error, etc. The feedback is the fitting of the predicted value and the actual value.
(3) Evaluation of clustering models: There are two types of methods: one is to compare the clustering results with the results of a “reference model”, called “external index”, such as RAND index, FM index, etc. The other is to directly investigate the clustering results without any reference model, which is called “internal index”, such as compactness, degree of separation, etc.
-
2.5.2 Model evaluation and optimization
According to the index performance of the training set and test set, the reasons are analyzed and the model is optimized. The common methods are:
2.6 Model Decision
Decision making is the ultimate goal of machine learning, which analyzes and interprets the model prediction information and applies it to the practical work field.
It should be noted that engineering is result-oriented. The effect of the online operation of the model directly determines the success or failure of the model, including not only its accuracy and error, but also its operation speed (time complexity), resource consumption (space complexity) and stability.
3 the Python of actual combat
# This is a simple demo: Using data from IRIS plants, training iris classification models, Import pandas as pd from sklear.datasets import load_iris Pd.dataframe (data.data, data.feature_names) df['class'] = data.target df.head() # pandas_profiling Import pandas_profiling df.profile_report(title='iris') # Profile_report Y = df['class'] x = df.drop('class', axis=1) From sklear. model_selection import train_test_split train_x, test_x, train_y, test_y = train_test_split(x, Y) # Select xgboost model from xgboost import XGBClassifier XGB = XGBClassifier(base_score=0.5, Booster ='gbtree', base_score=0.5, booster='gbtree', Colsample_bylevel =1, colsample_byNode =1, colsample_byTree =1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=1, min_child_weight=1, missing=None, n_estimators=1, n_jobs=1, nthread=None, objective='multi:softprob', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, Xgb.fit (train_x, train_y) from Sklearn. Metrics import precision_score, recall_score, f1_score, accuracy_score, roc_curve, auc def model_metrics(model, x, y, pos_label=2): Yhat = model.predict(x) result = {'accuracy_score': 'f1_score_macro': 'accuracy_score': 'accuracy_score': 'f1_score_macro': f1_score(y, yhat, average = "macro"), 'precision':precision_score(y, yhat,average="macro"), 'recall':recall_score(y, Print ("TRAIN") print(model_metrics(XGB, train_x), Print ("TEST") print(model_metrics(XGB, test_x, test_y)) #Copy the code