This is the fifth day of my participation in the August Wen Challenge.More challenges in August

What is machine learning

Ai and big data get more press coverage than machine learning. I had never heard of machine learning before I went to graduate school. Only now do I realize that the tasks I thought were done by machine learning were all done by big data. In my humble opinion, big data is concerned with the collection and storage of data. Machine learning is about analyzing data. As for artificial intelligence, it’s just a bluff stunt. That’s my idea, of course. The general view is as shown in the figure below.

Machine learning solves problems

Machine learning can solve problems in various fields, especially NLP and CV

  • Natural language processing for speech recognition applications

  • Image processing and computer vision for face recognition, motion detection and object detection

  • Computational biology, used in tumor detection, drug discovery and DNA sequence analysis

  • Energy production, used to predict price and load

  • Automotive, aerospace, and manufacturing, for predictive maintenance

  • Computational finance, used in credit assessment and algorithmic trading

Machine learning learning methods

Machine learning falls into four broad categories: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. Yes, defeating Lee Sedol’s AlphaGo is reinforcement learning. In my limited knowledge, reinforcement learning is all talk and no action, and is not widely or reasonably used in industry.

Problems facing machine learning

I. Insufficient amount of training data

To teach a toddler what an apple is, all you have to do is point to an apple and say “apple” (this process may need to be repeated several times), and the child will be able to identify apples of all colors and shapes. It’s genius!

Machine learning is not there yet, and most machine learning algorithms need a lot of data to work properly. Thousands of examples are likely to be needed for even the simplest problems, and millions for complex problems such as image or speech recognition (unless you can reuse parts of an existing model).

Second, the training data is not representative

In order to generalize well, it is critical that the training data be very representative of the new example to be generalized. This is true whether you use instance-based learning or model-based learning.

For example, the national data set used to train the linear model is not fully representative, and some countries are missing data. The figure below shows the data after adding the missing country information.

A more representative training sample

If you train linear models with this data set, you’ll get solid lines in the graph, and dashed lines represent old models. As you can see, adding partial missing country information not only dramatically changes the model, but also makes it clearer that this simple linear model may never be so accurate. It seems that some very rich countries are no happier than moderately rich ones (in fact, they seem even less happy), and conversely, some poor countries seem happier than many rich ones.

Models trained using unrepresentative training sets are unlikely to produce accurate forecasts, especially for countries that are particularly poor or rich.

It is important to use a representative training set for the cases you want to generalize. But this is easier said than done: if the sample set is too small, there will be sampling noise (i.e., non-representative data will be selected); However, even very large sample data can also lead to an unrepresentative data set if the sampling method is not correct, which is called sampling bias.

An example of sampling bias

The most famous example of sampling bias occurred during the 1936 U.S. presidential election, when Landon faced Roosevelt. Literary Digest, which conducted an extensive poll, emailed around 10 million people and received 2.4 million responses, came up with a highly confident prediction that Landon would win 57% of the vote. Instead, Roosevelt won 62 percent of the vote. The problem lies in Literary Digest’s sampling method:

  • For one thing, Literary Digest uses a phone book, a magazine subscription list, a club membership list, and similar directories to obtain the address where polls are sent. The people on all these lists tend to have a greater preference for the rich and are more likely to vote Republican (i.e.

  • Second, fewer than 25 percent of those who received the poll responded. Again, this introduces sampling bias, which excludes people who don’t care much about politics, people who don’t like Literary Digest, and other key groups. This is a special type of sampling bias called nonresponse bias.

As another example, suppose you want to create a system to recognize funk music videos. One way to build a training set is to simply search for “funk music” on YouTube and then use the resulting video. However, this is based on the assumption that YouTube’s search engine returns all videos that represent funk music. The actual results are likely to be more biased toward current pop musicians (if you live in Brazil, you’ll get a lot of videos about “Funk Carioca,” which doesn’t sound like James Brown). On the other hand, how else can you get a large training set?

Low quality data

Obviously, if the training set is full of errors, outliers, and noise (for example, data from low-quality measurements), the system will be harder to detect the underlying patterns and less likely to perform well. So the time spent cleaning up training data is well worth the investment. In fact, most data scientists spend a significant portion of their time doing this work. Such as:

  • If some instances are clearly exceptions, it helps to simply throw them away, or to try to fix the error manually.

  • If some examples lack some features (for example, 5% of customers don’t specify age), overall ignore these characteristics, you must decide to ignore an example of this part of the missing, missing value added complete (fill in the value of median age, for example), or a model with the characteristics of the training, retraining a model without this feature.

4. Irrelevant features

As we say: garbage in, garbage out. Only when training data contains enough relevant features and few irrelevant features can the system complete learning. A key part of a successful machine learning project is to extract a good set of features for training. This process is called feature engineering and includes the following:

  • Feature selection (select the most useful feature from existing features for training).

  • Feature extraction (combining existing features to produce more useful features – as mentioned earlier, dimensionality reduction algorithms can help).

  • Create new features by collecting new data.

Now that we’ve looked at some examples of “bad data,” let’s look at some examples of “bad algorithms.”

5. Over-fitting training data

If you were traveling abroad and being extorted by a taxi driver, you would probably say that all the taxi drivers in that country are bandits. Overgeneralizing is something we humans do all the time, and unfortunately, if we’re not careful, machines can easily fall into the same trap. In machine learning, this is called overfitting, which means that the model performs well on the training data, but poorly on the generalization. The figure below shows a high-order polynomial life satisfaction model for training data overfitting. It does much better on training data than a simple linear model, but can you really trust its predictions?

Although complex models such as deep neural networks can detect tiny patterns in the data, if the training set itself is noisy, or the data set is too small (introducing sampling noise), it is likely to cause the model to detect patterns in the noise itself. Obviously, these patterns cannot be generalized to new instances. For example, suppose you provide the life satisfaction model with additional attributes, including some that are not informative (such as the name of the country). In this case, a complex model might detect patterns of fact like this: In the training data, countries with the letter w in their name, such as New Zealand (7.3), Norway (7.4), Sweden (7.2) and Switzerland (7.5), Life satisfaction is greater than 7. How confident are you about the results when you generalize this W satisfaction rule to Rwanda or Zimbabwe? Obviously, the pattern in the training data was only generated by chance, but the model could not determine whether the pattern was real or the result of noise.

Overfitting occurs when the model is too complex relative to the number and noise of the training data. Possible solutions are as follows.

  • Simplified models: Models with fewer parameters can be selected (e.g., linear models rather than higher-order polynomial models) or the number of attributes in training data can be reduced, or constrained models can be selected.
  • Collect more training data.
  • Reduce noise in training data (for example, fix data errors and eliminate outliers).

Constraining the model to make it simpler and reduce the risk of overfitting is a process called regularization. For example, the linear model we defined earlier has two parameters: θ0 and θ1. Therefore, when the algorithm is fitting the training data, the degree of freedom of the adjustment model is equal to 2, and it can adjust the height (θ0) and slope (θ1) of the line. If we force θ1 = 0, then the algorithm’s degree of freedom drops to 1, and fitting the data becomes harder — all it can do is move the line up and down as close to the training instance as possible, and most likely end up near the average. That’s really easy! If we allow the algorithm to modify θ1, but we force it to be very small, then the degree of freedom of the algorithm will be between 1 and 2, and the model will be slightly simpler than the 2 degree of freedom model and slightly more complex than the 1 degree of freedom model. You need to find the right balance between perfectly matching your data and keeping your model simple to make sure it generalizes well.

The following figure shows three models. The dotted line represents the original model trained on the countries represented by circles (countries not represented by squares), the dotted line is the second model we trained on all countries (circles and squares), and the solid line is the model trained with the same data as the first model, but with a regularization constraint. As you can see, regularization forces the model to have a smaller slope: The model doesn’t fit the training data (circles) as well as the first model, but it actually generalizes better the new instances (squares) that it didn’t see during training.

The degree to which regularization is applied in learning can be controlled by a hyperparameter. Hyperparameters are parameters of the learning algorithm (not the model). Therefore, it is not affected by the algorithm itself. Hyperparameters must be set prior to training and remain constant during training. If you set the regularization hyperparameter to a very large value, you get an almost flat model (slope close to zero). Although the learning algorithm certainly cannot overfit the training data, it is also more unlikely to find a good solution. Adjusting hyperparameters is a very important part of building machine learning systems.

Regularization reduces the risk of overfitting

6. Inadequate fitting of training data

As you might have guessed, underfitting is the opposite of overfitting. It usually happens because your model is too simple for the underlying data structure. For example, a linear model describing life satisfaction is under-fitting. Reality is much more complex than the model, so even for the examples used for training, the predictions produced by the model are bound to be inaccurate.

The main ways to solve this problem are:

  • Choose a more powerful model with more parameters.
  • Provide better feature sets for learning algorithms (feature engineering).
  • Reduce constraints in the model (for example, reduce regularized hyperparameters).

Vii. Overall summary

Now we know a little bit about machine learning. Let’s step back for a moment and see the big picture:

  • Machine learning is the theory of how machines can better handle certain tasks by learning from data rather than having to code rules clearly.
  • There are many types of machine learning systems: supervised and unsupervised, batch and online, instance-based and model-based, and so on.
  • In a machine learning project, you take data from a training set and give it to a learning algorithm to calculate. If the algorithm is model-based, it adjusts some parameters to fit the model to the training set (i.e. makes good predictions about the training set itself), and then the algorithm can make reasonable predictions about the new scenarios. If the algorithm is instance-based, it remembers these examples and compares them to learned instances based on similarity measures to generalize these new instances.
  • If the training set has too little data or is not representative enough, contains too much noise or is contaminated by irrelevant features (garbage in, garbage out), then the system will not work well. Finally, your model should neither be too simple (which leads to an underfit) nor too complex (which leads to an overfit).

Machine learning method complete flow

This paper introduces the complete process of machine learning from both theoretical and practical aspects.

The theory of

Abstracted into a mathematical problem

Identifying problems is the first step in machine learning. The training process of machine learning is usually very time-consuming, and the time cost of random attempts is very high. By abstracting as a mathematical problem, we mean we know what kind of data we can get, whether the goal is a classification or regression or clustering problem, and if not, if it falls into one of these categories.

To get the data

The data determines the upper bound of machine learning results, and the algorithm is just as close to that limit as possible. The data must be representative, or it will inevitably overfit. And for classification problems, data skew should not be too serious, the number of different categories of data should not be several orders of magnitude difference. In addition, there should be an evaluation of the magnitude of data, such as how many samples and how many features, to estimate the consumption degree of memory, and to judge whether the memory can be put down in the training process. If you can’t, you have to consider improving the algorithm or using some tricks to reduce dimension. If the amount of data is too large, consider distribution.

Feature preprocessing and feature selection

Good data can be useful only if it can extract good features. Feature preprocessing and data cleaning are very important steps, which can improve the performance of the algorithm significantly. Normalization, discretization, factorization, missing value processing, collinearity removal, etc., a lot of time is spent in data mining process. These tasks are simple and reproducible, with stable and predictable returns, and are essential steps for machine learning. Winnowing out salient features and discarding non-salient features requires machine learning engineers to repeatedly understand the business. This has a decisive effect on many outcomes. With good feature selection, very simple algorithms can produce good, stable results. This requires the use of relevant techniques of feature validity analysis, such as correlation coefficient, Chi-square test, mean mutual information, conditional entropy, posterior probability, logistic regression weight, etc.

Training models and tuning

Until this step, we didn’t train with the algorithm we talked about above. Many algorithms can now be packaged into black boxes for human use. But the real test is to tweak the parameters of these algorithms to make the results better. This requires a deep understanding of how algorithms work. The deeper you understand it, the better you can identify the root of the problem and come up with good tuning solutions.

Model diagnosis

How to determine the direction and thinking of model tuning? This requires techniques for diagnosing the model. Over-fitting and under-fitting judgment is a crucial step in model diagnosis. Common methods such as cross validation, drawing learning curves and so on. The basic tuning idea of overfitting is to increase the amount of data and reduce the complexity of the model. The basic tuning idea of under-fitting is to improve the quantity and quality of features and increase the complexity of the model. Error analysis is also a crucial step in machine learning. By observing the error sample, we can comprehensively analyze the causes of error: the problem of parameter or algorithm selection, the problem of feature or the problem of data itself…… The model after diagnosis needs to be tuned, and the new model after tuning needs to be re-diagnosed. This is a process of repeated iteration and continuous approximation, and continuous attempts are needed to achieve the optimal state.

Model integration

Generally speaking, the effect can be improved after model fusion. And it works well. In engineering, the main method to improve the accuracy of the algorithm is to work on the front end of the model (feature cleaning and pre-processing, different sampling modes) and the back end (model fusion) respectively. Because they are more standard replicable, the effect is more stable. Direct referrals aren’t much work, because training large amounts of data is too slow, and the results are hard to guarantee.

And running

This part of the content is mainly related to the implementation of the project. Engineering is result-oriented, and the effect of the model running online directly determines the success or failure of the model. It includes not only its accuracy and error, but also its running speed (time complexity), resource consumption (space complexity), and whether its stability is acceptable. These work flows are mainly summarized from engineering practice. Not every project contains a complete process. The part here is just a guide. Only by practicing more and accumulating more project experience can we have a deeper understanding of ourselves.

practice

The code is as follows: Summary of UnionPay “Credit User Overdue Prediction” Algorithm Contest *

The resources

Blog.csdn.net/hzbooks/art…

Ww2. Mathworks. Cn/discovery/m…

Blog.csdn.net/tMb8Z9Vdm66…

Datawhalechina. Making. IO/leeml – notes…

“Machine learning field: based on Scikit – Learn, Keras and TensorFlow version 2 (the book) www.huaweicloud.com/articles/c9…