The paper contains 2059 words and is expected to last 4 minutes
Photo credit: pexels.com/pixabay
This paper describes how to improve the performance of machine learning models through active learning iteration. This skill can be applied to any model, but this article explains how to actively learn how to improve binary text classifiers. All of the content below is based on the Microsoft Strata Data Conference 2018 tutorial, Using R and Python for Extensible Data Science, Machine Learning, and artificial Intelligence.
Code portal: https://github.com/hsm207/Strata2018/tree/blog
methods
The data set
The concept of active learning is illustrated by building a binary text classifier on the Wikipedia Detox dataset to detect whether comments cause personal attacks. Here are some examples to illustrate the problem:
Data set portal: https://meta.m.wikimedia.org/wiki/Research:Detox/Data_Release
The training set has 115,374 labeled examples. Now this training set is divided into three sets, namely initial training set, unmarked training set and test set, as follows:
In addition, labels were evenly distributed in the initial training set, but only 13% of labels were 1 in the test set.
In this way, the training set is segmented to simulate the real situation. The corresponding situation is that there are 10285 high-quality labeled examples, and which of the 105089 “unlabeled” examples need to be labeled, so as to obtain more training data to train the classifier. Because annotating data is expensive, it is a challenge to identify examples that are most useful for model performance.
For unlabeled training sets, active learning is a better sampling method than random sampling.
Finally, the Glove word embedding is used to convert the annotation to 50-dimensional embedding.
Sampling method
The sampling method used is a combination of uncertainty sampling and joint sampling. The way it works is:
1. Randomly select 1000 samples from the unlabeled training set.
2. Using Euclidean distance as the distance measure (this is the set part), construct a sub-cluster class on the 1000 samples.
3. Divide the output of sub-cluster into 20 groups.
4. Select the sample with the maximum entropy (http://www.di.fc.ul.pt/~jpn/r/maxent/maxent.html) for each group. That is, the observations with the most uncertainty in the model are selected.
This is to simulate a situation where only 20 high-quality labels are available at a time, for example, a radiologist can process only 20 medical images a day. The whole unlabeled training cluster class is not included because calculating entropy requires model derivation, which may take a long time in large data sets.
The reason for clustering samples is to maximize the diversity of labeled samples. For example, if you simply pick the top 20 with the highest entropy out of a sample of 1,000, it’s possible to pick very similar samples if the samples are tight. In this case, it is best to choose only one example from this group and the rest from the other group, because different examples help you learn the model better.
model
The classifier is built with FastTrees, with annotated vector inserts as input. FastTrees is an implementation of FastRank, a variant of the gradient lifting algorithm.
Portal for more details: https://docs.microsoft.com/en-us/machine-learning-server/r-reference/microsoftml/rxfasttrees
The evaluation index
Due to the imbalanced test set, AUC will be used as the main evaluation indicator.
Implementation details
The following chart illustrates the role of active learning in this experiment:
First, the model will be trained on the initial training set, and then the model and the above sampling method will be used to identify the 20 annotations in the unlabeled training set with the most uncertain classification, i.e., no confidence, and manually annotate them. The initial training set can now be extended to include manual new annotation samples, and retraining models (from scratch). This is the active learning part of the experiment. The steps of extending the initial training set are repeated for 20 iterations, and the performance of the model on the test set is evaluated at the end of each iteration.
The results of
For comparison, the initial training set can be iteratively extended by randomly selecting any 20 examples from the unlabeled training set. The figure below compares the active learning method (ACTIVE) and random sampling (random), using different measures according to the size of the training set (TSS).
It seems that random sampling is superior to active learning initially. However, when the training set size is about 300, the active learning method begins to substantially exceed random sampling in AUC.
In practice, the initial training set may continue to be extended until the ratio of model improvements (such as increases in AUC) to the cost of annotation falls below a pre-determined threshold.
The verification results
To ensure that the results obtained are not coincidental, the random sampling method can be simulated for 100 times and 20 iterations, and the number of times the AUC generated is greater than the active learning method can be calculated. The simulation results produced only one case in which a random sample of AUC was higher than active learning. This suggests that the results of active learning are statistically significant at 5%. Finally, the average AUC difference between random sampling and active learning is -0.03.
conclusion
When there is a large amount of unlabeled data and a limited budget to annotate these data, active learning method is adopted to determine which unlabeled data, and manual annotation can maximize the performance of the model within the constraints of a given budget.
Leave a comment like follow
We share the dry goods of AI learning and development
Welcome to “core reading” of AI vertical we-media of the whole platform
(Add wechat: DXSXBB, join readers’ circle and discuss the freshest artificial intelligence technology.)