Machine learning process

The wave of Artificial Intelligence is sweeping the world, and many words are always in our ears: Artificial Intelligence, Machine Learning, Deep Learning. Artificial intelligence is the goal we want to achieve, and machine learning is the means to achieve the goal, hope that the machine through learning, he is as smart as people. Deep learning is one of the methods of machine learning. In short: Machine learning is a method to achieve artificial intelligence, and deep learning is a technology to achieve machine learning.

Machine learning can generate models faster and automatically to analyze larger, more complex data, with faster transmission and more accurate results — even at very large scales. High-value predictions can lead to better decisions and wiser behavior when there is no human intervention in the real world. This paper mainly introduces the traditional machine learning, and briefly combs the basic process of machine learning.

Machine learning

At its most basic, machine learning uses algorithms to parse data, learn from it, and then make decisions and predictions about real-world events. Unlike traditional software programs that are hard-coded to solve specific tasks, machine learning takes large amounts of data and “trains” them, using algorithms to learn from the data how to perform a task.

Here are three important messages:

Machine learning is a way to simulate, extend and extend human intelligence, so it is a subset of artificial intelligence;
“Machine learning” is based on a lot of data, that is, its “intelligence” is fed by a lot of data;
Big data technology is especially important because of the huge amount of data to be processed; “Machine learning” is just one application of big data technology. Ten commonly used machine learning algorithms are: decision tree, random forest, logistic regression, SVM, naive Bayes, K-nearest neighbor algorithm, K-means algorithm, Adaboost algorithm, neural network, Markov.

2. Machine learning related technologies

Machine learning can be divided into:

Supervised learning
Semi-supervised learning
Unsupervised learning
Reinforcement learning

2.1 supervised learning

Definition: Knowing the relationship between input and output results based on existing data sets. According to this known relation, an optimal model is obtained by training. In other words, in supervised learning, training data has both features and labels. Through training, the machine can find the connection between features and labels by itself, and can judge labels when faced with data with only features but no labels. More generally, you can think of machine learning as when we teach machines how to do things.

Supervised learning is divided into:

Regression problem
Classification problem

Classical algorithms: support vector machine, linear discrimination, decision tree, naive Bayes

2.1.1 regression problems

The regression problem is for continuous variables. A popular point of regression is to analyze the existing points (training data) and fit an appropriate function model y=f(x)y=f(x)y= F (x), where Y is the label of data, and for a new independent variable X, label Y is obtained through this function model.

2.1.2 classification problems

The difference between Regression problems and Classification problems is the type of things we want the machine to output. In the regression problem, the machine outputs a number, while in the classification problem, the machine outputs a category. The biggest difference from regression is that classification is for discrete types, and the output results are limited.

Classification problems are divided into two types:

Dichotomous, output yes or no. Such as determining whether a tumor is benign or malignant
Multiple categories, select the correct category from multiple options. For example, if you input an image and you determine whether it’s a cat or a dog or a pig, categorization simply means that you analyze the input eigenvector, and then you get the label for a new vector.

2.2. Semi-supervised learning

Traditional machine learning techniques fall into two categories, one is unsupervised learning and the other is supervised learning.

Unsupervised learning uses only the unlabeled sample set, while supervised learning uses only the labeled sample set.

However, in many practical problems, only a small amount of labeled data is available, because the cost of labeling data is sometimes very high. For example, in biology, the structural analysis or functional identification of a protein may take many years of work, while a large amount of unlabeled data is easy to obtain. This has led to the rapid development of semi-supervised learning techniques that can use both labeled samples and unlabeled samples. In short, semi-supervised learning is to reduce the amount of labels.

Semi-supervised learning is inductive and the resulting model can be used as a wider sample

2.3. Unsupervised learning

Definition: We do not know the relationship between the data and features in the data set, but to obtain the relationship between the data based on clustering or certain models. It can be said that unsupervised learning is more like self-learning than supervised learning, and there is no label for the machine to learn to do things by itself.

Unsupervised learning enables us to solve problems with little or no knowledge of what the outcome should look like. We can derive structures from data where we do not need to know the effects of variables. We can obtain this structure by clustering the data according to the relationship between variables in the data. In unsupervised learning, there is no feedback based on the predicted results.

Classical algorithm: clustering K-means algorithm (K-means algorithm), principal component analysis

2.4. Reinforcement learning

Definition: Reinforcement learning is an important branch of machine learning and a product of multi-disciplines and multi-fields. Its essence is to solve decision making problems, that is, to make decisions automatically and continuously.

It mainly contains four elements: Agent, environment state, action and reward

The goal of reinforcement learning is to get the maximum cumulative reward.

The difference between reinforcement learning and supervised learning

Supervised learning is like a teacher instructing the teacher how to be right or wrong during learning. However, in many practical problems, for example, there are tens of millions of game ways of checkers and Go. It is impossible for a teacher to know all possible results. However, reinforcement learning, in the absence of any tag, one result is obtained by first try to make some behavior, through the result is right or wrong feedback, before the adjustment behavior, thus constantly adjust, algorithm can learn in what circumstances to choose what you can get the best results.
Both learning methods learn a mapping of input to output. Supervised learning tells the algorithm what kind of input corresponds to what kind of output. Reinforcement learning gives the machine feedback, which is used to judge whether the behavior is good or bad.
In addition, the feedback of reinforcement learning results is delayed. Sometimes, it may be necessary to take many steps before you know whether a previous step is good or bad, while supervised learning will immediately feedback to the algorithm if it makes a bad choice. And the input of reinforcement learning is always changing. Every time the algorithm makes a behavior, it affects the input of the next decision, while the input of supervised learning is independent and identically distributed.
Through reinforcement learning, an agent can make a trade-off between exploration and exploitation, and choose the best reward. Exploration will try a lot of different things to see if they are better than what has been tried before. Exploitation will try the most effective behavior from the past experience. General supervised learning algorithms do not consider this balance. The difference between reinforcement learning and unsupervised learning:

Unsupervised is not learning input-to-output mapping, but patterns. For example, in the task of recommending a news article to a user, unsupervised will find that the user has read similar articles before and recommend one to them. Reinforcement learning, on the other hand, will first recommend a small amount of news to users, and continuously obtain feedback from users, and finally build a “knowledge map” of articles that users may like.

Main algorithms and classification

From the perspective of several elements of reinforcement learning, methods can be divided into the following categories:

Policy based, the focus is to find the optimal Policy.
Value based, the focus is to find the optimal sum of rewards.
Action based, focus is the best Action for each step.

Basic process of machine learning

3.1 a brief description of the basic process

A basic machine learning process can be briefly divided into five steps: problem transformation, data collection and processing, model training and adjustment, on-line model, and monitoring

Problem transformation: Turn practical problems into machine learning problems
Data collection and processing: collect the data we need, and conduct data cleaning and other processing work
Model training and adjustment: select appropriate models and use processed data for model training
Online model: Deploy the trained model online
Monitor: Monitor the performance of the model, obtain new data, then re-collect and process the data, train and adjust the model to cope with new user behavior changes

3.2. Complex description of the basic process

Data sources:

The first step in machine learning is to collect data, which is important because the quality and quantity of data collected will directly determine whether the prediction model can be built. We can repeat, standardize and correct the collected data and save it into database files or CSV files to prepare for the next data loading.

Analysis:

This step is mainly data discovery, such as finding out the maximum, minimum, average, variance, median, triquartile, quartile, proportion of some specific values (such as zero) or distribution rules of each column. The best way to understand these things is to visualize them, which Can be easily implemented under Google’s Open source program Facets. On the other hand, the independent variable (x1… xn)(x_1… x_n)(x1… Xn) and the dependent variable YYy, find the correlation between the dependent variable and the independent variable, and determine the correlation coefficient.

Feature selection:

The quality of the feature largely determines the effect of the classifier. Filter the independent variables determined in the previous step, which can be manually selected or model selected, select the appropriate features, and then name the variables for better marking. The named file should be saved and used during the prediction phase.

To quantify:

Vectorization is the reprocessing of feature extraction results to enhance the ability of feature representation and prevent the model from being too complex and difficult to learn. For example, continuous feature values are discretized and label values are mapped into enumeration values and identified with numbers. This phase produces an important file: the label and enumeration value correspondence, which is also used in the prediction phase.

Split data set:

You need to split the data into two parts. The first part of the training model will be the bulk of the data set. The second part will be used to evaluate the performance of our trained model. Data is usually divided at 8:2 or 7:3. Training data cannot be used directly for assessment because the model can only remember “questions”.

Training:

Before model training, appropriate algorithms should be determined, such as linear regression, decision tree, random forest, logistic regression, gradient lifting, SVM and so on. The best way to choose an algorithm is to test a variety of different algorithms and then cross-validate the best one. However, if just for looking for a “good enough” algorithm, or a starting point, also there are some good general guidelines, such as if the training set is small, so high/low variance deviation classifier (such as naive bayes classifier) is superior to the low/high variance deviation classifier (e.g., k neighbor classifier), which is easy to fitting. However, as the training set grows, low bias/high variance classifiers will start to win out (they have low asymptotic errors) because high bias classifiers are not enough to provide accurate models.

Evaluation:

After the training is completed, the model is evaluated through the split training data, and the quality of the model is judged by comparing the real data with the predicted data. Evaluation indexes under different task models are as follows:

After the evaluation, if we want to further improve the training, we can adjust the parameters of the model to achieve this, and then repeat the process of training and evaluation.

Filing:

After Model training, four types of files should be sorted out to ensure that the Model can run correctly. The four types of files are Model file, Lable code file, metadata file (algorithm, parameters and results), variable file (independent variable name list, dependent variable name list).

Interface encapsulation:

Encapsulate the service interface to implement calls to the model to return predicted results.

Online:

Deploy the trained model online.

Reference links:

Blog.csdn.net/qq_27567859…
zhuanlan.zhihu.com/p/117238854
www.jianshu.com/p/afa0facbe…