Machine learning: From entry to first model

Welcome to Tencent cloud technology community, get more Tencent mass technology practice dry goods oh ~

Author: Li Chunxiao

Guide language:

“From entry to first model” almost became “from entry to abandonment”. This paper is an attempt of machine learning in operation and maintenance scenarios, using a model to realize the mining of business rules and anomaly detection. This is just a trial run, and it remains to be seen if it will work. I have tried several business data, and it seems to be effective, but I am still worried about where to go wrong or there is a pit not taken into account. I will introduce the model as follows, please give me more advice and help to point out the possible problems, and communicate with you.

Background:

Business operation and maintenance should be responsible for the basic business experience indicators. The previous analysis was based on big data, and the performance of key indicators in each dimension and its combination was counted. For example, we can count the speed (time consuming) of opening an APP in different network systems and obtain the success rate of different command words. For mobile APP business, based on experience, we will consider these factors when analyzing an indicator: APP version, specific dimensions related to indicators (for example, size and image type should be considered for image download; For voD, consider the video type, player type, etc.) and user information (network system, province, carrier, city). These dimensions have a combined effect on key indicators, so which combination of dimensions must be good and which must be bad? The performance of time-consuming indicators tends to show a quasi-normal distribution trend, and the long tail always exists and cannot be eliminated. Should we pay attention to this situation? For the command success rate, it is normal that the success rate of some command words is low. Do you want to generate an alarm? In the past, we avoided alarms by setting exceptions in our monitoring. Is there a way to automatically recognize the normal from the abnormal? With machine learning in full swing, it might be worth giving it a try.

Goal:

Explore the underlying business rules (find out the factors causing the long tail for continuous value indicators such as time consumption)
When monitoring service indicators, locate and ignore the normal conditions, generate alarms only for unexpected exceptions and provide the root cause of the exceptions.

After that, it was a hard battle of failure, from getting started to almost giving up, and finally making the first model. The biggest difficulty is that I have never written any code and cannot learn Python, so I have to learn machine learning theory and code simultaneously. Then, in the case of weak foundation, they were too greedy at the beginning and wanted to find a common model that could be used for different businesses and different indicators, and could solve two goals at the same time. In the absence of a step-by-step entry process, they would inevitably hit a wall everywhere, solve problems and re-learn. The good news is that it worked out, but there is a lesson to be learned: when you have a big goal, set a small goal first, and clear your mind and work your way through it, things will go much smoother.

Then we go straight to the model, ignoring all the detours in the process (because there are too many and too weak, some theories are only understood after the problems are encountered).

Basic idea:

1. Automatic acquisition of business rules by learning and prediction of business performance (ET algorithm). The prediction hits the business rules, and there may be exceptions (please note that it is possible, but not absolute);

2. Input the results of 1 into the decision tree (DT) for visual display; Generate a view of business potential patterns from the predicted hits; Detect exceptions with missed ones and show the root cause.

Step Introduction (Take time as an example) :

1. Prepare two conflicting data sets, one for training and one for prediction

For example: dimensions of video playback services (such as version, model, video source, video encoding type and other existing characteristics) and time consuming data

2. Turn the goal problem into a classification problem, which may be common dichotomies or multi-classifications, as the case may be

Would take the continuity index into discrete values, the goal is to produce three categories: “very good / 0”, “general / 1”, “poor / 2”, will take the 10 digits split, take 1 serving (or two) as “excellent” sample, among a few samples for “general”, finally 1 (or 2) for the “poor” samples. The “extremely bad” here is the long tail of the normal distribution. As shown in the figure below, the first column is the time interval (no manual defined threshold is added, which is automatically obtained), and the second column is the sample size.

3. Feature processing

3.1 Numerical characteristics

There are two types of problems, but they are handled in the same way:

(1) Text to value

(2) Unordered values need to cut off the size relationship of values, such as Appid, which is itself unordered and should not let the algorithm think 65538>65537

Method: One-hot encoding, if sex has three values, boy,girl and unknown, convert to three features sex==boy,sex==girl,sex==unknown, if the conditions meet, set it to 1, otherwise set it to 0.

There are three ways of implementation: self-implementation; Sklearn the switch; Pandas’ Get_dummies method.

After one-hot coding, the number of features will expand dramatically. For mobile phone models, thousands of dimensions will be added after processing. At the same time, it is necessary to consider whether the feature needs to be too detailed processing according to the situation.

3.2 Feature dimension reduction

Whether or not you need to reduce the dimensions depends, but I’ve reduced the dimensions here because there are too many features, and if you don’t reduce the dimensions, you end up with a tree that’s too big to highlight key factors.

So-called dimension reduction, also is the need to extract the features of the characteristics of the key factors affecting the impact on the results, remove not important information and redundant information, theory of details, refer to: http://sklearn.lzjqsdd.com/modules/feature_selection.html

In this paper, ET feature_importance is used to reduce the dimension, reducing the data of 5000+ dimension to about 300

4. Train a classification model with ET algorithm (variant of random forest, ExtraTreesClassifier)

4.1 Index selection of the evaluation model

For the classification algorithm, we first think of the accuracy of the index, but it is invalid in the case of sample imbalance. For example, we have a dichotomous (success and failure) scenario with 98% success. Such samples are directly input into the training model, which is bound to be over-fitting. The model will directly ignore the failed ones and predict all of them as success. The success rate was 98 percent, but the model was ineffective. So what should we use?

Roc_auc_score is available for dichotomies, and confusion_matrix and classification Report is available for multiple classifications

4.2 Treatment of sample imbalance

In this example, obviously the number of zeros and twos is very small, and the number of ones is the big head. In order not to overfit the type 1, the two types 0 and 2 can be sampled.

There are two types of algorithms commonly used:

(1) Directly copy a few samples

(2) SMOTE over-sampling algorithm (details omitted)

I have used both methods here and finally chose SMOTE, but there is no clear difference in the data in this study.

The over-sampling of a few classes solves the over-fitting problem of the big class, but also brings the over-fitting of the small class. However, the model here just needs to make the over-fitting of the small class. We just need to find out the “extremely good” and “extremely bad” parts, and add attention to the abnormal detection of the mediocre ones. Overfitting this problem, do not fear too much, but can be used. For example, the classification of “sick” and “not sick” scenarios would rather have a higher detection rate of “sick”. For the classification report as shown in the figure below, we need to make use of the feature of high recall for small class samples (0 and 2), that is, we just need to find them. For large samples, we need high precision characteristics for anomaly detection.

4.3 Model parameter selection

Sklearn has a ready-made GridSearchCV method that you can use to see what the model looks like with different parameter combinations. For tree algorithm, the commonly used parameters are depth and number of features; Forest algorithm plus a number of trees.

The Max_depth parameter needs special attention. It is easy to overfit when the depth is large. The general experience value is within 15.

**4.4 After model training, use test data to predict, and extract the correct and incorrect predictions of each category. * * :

The part where the prediction is correct: obtain the label of the sample whose prediction is 0,2 and is actually 0,2;

Part of wrong prediction: obtain the sample label with prediction 0 and 1, but actual 2 (adjust according to the situation)

5. Input decision tree for visual display, and conduct business rule mining and anomaly detection respectively

Here, the DT algorithm is only used for presentation, to distinguish different types of data. If necessary, parameters such as MIN_samples_LEAF, min_impurity_DECREASE should still be set to highlight key information.

You can also type the desired path through the DecisionTreeClassifier’s built-in tree_ object

Examples are given below:

5.1 Mining business rules

In voD scenario, 0 and 2 correctly predicted parts are taken and DT is input as shown in the figure below. The potential business rules are automatically found and verified by big data statistics. The conclusions are consistent. The tree’s data is relatively pure because the data input to it can be understood as necessarily conforming to certain laws.

5.2. Anomaly detection

The model in this paper is still in the research stage. Instead of using real abnormal data on the line, we manually create anomalies on a dimension (or combination) of the test data to verify the effect.

The success rate can be divided into two or three categories according to tolerance.

Dichotomies: take a threshold value, such as 99%, below 99% for 2, abnormal, otherwise for 0 normal. The disadvantage is that if the success rate of a dimension is below 99% for a long time, such as 98%, when it suddenly drops, it will be ignored as normal and no alarm will be generated.

Three categories: above 99% is 0, 96~99% is 1, less than 96% is 2, this way will be more flexible. The three categories also correspond to their importance. Focus, focus, ignore.

The following is an example of binary classification (manually setting IPH and client as exceptions) :

Finally: this is only a small attempt, if you want to platform online operation, there are many factors to consider, the first is the model update problem (scheduled update? Avoid selecting the time when the anomaly occurs?) This will be tried in the next stage.

reading

Automatic extension and Combination of discrete Features in machine learning

Has been authorized by the author tencent cloud community released, reproduced please indicate the article source The original link: https://cloud.tencent.com/community/article/477670

Machine learning: From entry to first model

Guide language:

Background:

Goal:

Basic idea:

reading

Related Posts

Vue: Why does V-for set the key to be stable and unique

Challenges for a 100-person R&D team: R&D management, performance appraisal, organizational culture and OKR

How to use proxy technology to mock the interface based on Charles (2)- Combined with JQ to complete batch manual mock