- Diverse Mini-Batch Active Learning: A Reproduction Exercise
- Alexandre Abraham
- The Nuggets translation Project
- Permanent link to this article: github.com/xitu/gold-m…
- Translator: z0gSh1u
- Proofread by: Aurevior
B: Diverse mini-batch of active learning
In our previous post on the Active learning Toolkit at a glance, we gave an overview of the most basic active learning methods, the associated common Python libraries, and the more advanced methods they implement.
In this paper, we will introduce some of the latest approaches and take the opportunity to reproduce the results of our 2019 paper, Diverse Mini-Batch Active Learning. The reader can go directly to this article for a complete and detailed understanding of all the methods.
The importance of diversity
In the previous article, I introduced an index-based approach that uses “category prediction probability” to select the most complex and tangled samples. Another density-based approach uses the distribution of data to select the most representative sample.
Uncertainty and representativeness are two of the three evaluation criteria that active learning hopes to optimize:
- Diversity. Diversity is a criterion of pure exploration. A method that simply optimizes diversity is likely to select samples from the entire feature space and is more likely to select outlier samples.
- Representative. A sample is said to be “representative” if many other samples are similar to it by a given measure of similarity. A typical representative sample is the “center” of a large sample cluster.
- Uncertainty. The ones that are hard for the model to classify.
Diversity and representativeness are often optimized at the same time because both make use of the distribution of unlabeled data. The clustering method can optimize both at once. This is why the author chose k-means clustering for his diversity query sampling method. Another strategy is a method based on similarity matrix proposed by example [1], which gathers all evaluation criteria in a loss function.
Diversity is necessary to compensate for the exploration of uncertainty based approaches, which focus on areas close to decision boundaries.
Figure 1. In this small example, the colored sample is labeled, and the current classifier boundary is purple. The method based on uncertainty tends to select the samples in the red region closer to the classifier boundary. Representativeness based methods tend to select samples in the purple region because the sample density is higher. Only diversity-based approaches tend to explore the blue regions.
Figure 1 shows a simple active learning experiment on binary categorization (square, triangle). It shows that methods based on uncertainty or representativeness will miss the whole region of feature space and thus fail to construct a model with good generalization ability on unseen data.
To sum up, the use of clustering methods can be compared to the natural separation of space. Clustering should be composed of similar samples. Clustering method can not only distinguish different features, but also find representative samples in each cluster.
Diversified mini-batch active learning strategies
The diversified Mini-batch active learning method combines uncertainty and diversity by selecting the next K samples to be labeled:
- First, β * K samples were pre-selected using a minimum edge sampler [2], β being the only parameter of the method.
- Then select k from the pre-selected samples, using Submodular function optimizers Submodular(β), K-mean clustering (β), or index-weighted K-mean clustering WClustered(β). The experimental parameters of this paper are β=10 and β=50.
Note: The paper mentions that the best method to solve the diversity problem is submodular function optimization, and k-means clustering is only a reference for this problem. The topic is too complex for a simple blog post, so we present the original paper and wikipedia page for curious readers!
We decided to compare the apricot Submodular function optimizer from the Apricot Python library with the K-means clustering optimizer from Scikit-learn.
Why did you choose this paper?
In the field of active learning, we have a large selection of articles to choose from. We select articles based on the following criteria:
Adequate validation. We noticed that a large number of active learning papers only tested their methods on OpenML’s small data set, without the sufficient samples that would be expected in an active learning scenario — typically, these papers achieved 0.95 accuracy with just 150 samples. Therefore, we wanted to reproduce the paper on a larger dataset, and this paper, using the 20 Newsgroups, MNIST and CIFAR10 dataset, seems complex enough to reproduce without requiring a lot of resources.
Simple and elegant method. Most recent papers have focused on domain-specific depth models of active learning; Some of them are even related to model structure. We wanted a simple enough approach to be understandable and reproducible. The method proposed in this paper is based on K-means clustering, a widely used clustering algorithm with intuitive principles.
Optimize both uncertainty and diversity. Most modern active learning methods optimize models for exploring uncertainty and features — that is, diversity. The method proposed in this paper uses uncertainty as the sample weight of sample preselection and K-means clustering, which naturally combines these indexes.
Note: In this paper, the author compares his approach to a framework called FASS. For the sake of simplicity, we have not reproduced the results of this framework in this blog post. We also decided to use Keras instead of the usual MXNet to see if we could reproduce our results using another framework.
Erratum: Calculation of uncertainty
In this paper, the author defines the amount of information by maximizing the following:
Of these, ŷ₁ is the first tag, and ŷ₂ is the second, and as recommended [2], it should be minimized. We have experimentally confirmed that maximizing it does not give as much performance as random sampling. We assume that the author uses the usual complementary probability treatment:
20 Newsgroups dataset experiment
The task of 20 Newsgroups is to tell which of the 20 discussion groups a series of articles comes from. For this experiment, we used exactly the same configuration as described in the paper:
- Preprocessing is done using CountVectorizer(ngram_range=(1,2), Max_features =250000) and TfidfTransformer(use_idf=True) — equivalent to TfidfVectorizer(ngram_range=(1,2), max_features=250000, Use_idf =True, dtype= Np.int64) — the classifier uses polynomial Logistic regression.
- We started with 100 samples and adjusted to 1,000 samples on the 100th batch.
- One difference: For the high dimensionality of data, MiniBatchKMeans are used as Clustered(50) and WClustered(50) to speed up the experiment.
Conclusion: In this experiment, we failed to reproduce the results of the original text, but also did not produce results that conflict with the original text. In particular, random sampling seemed to achieve an accuracy of 0.68 on 1,000 samples in the original experiment, while the other methods were around 0.70. Finally, we cannot assert that one method is better than another because of the high variability of results.
MNIST dataset experiment
This paper presents two experiments on MNIST datasets, the first of which we reproduce here. This is an active learning experiment for number recognition on MNIST. The active learning loop starts with 100 samples and increments by 100 to 1,000 samples (total 60,000), with the best accuracy of 0.93. This paper presents a Python notebook with a model setting, but we failed to reproduce the same result with the same setting:
- We used Keras instead of MXNet to build the neural network, but the structure was the same — layers of size 128, 64, ReLU activation functions, and finally a layer of 10 neurons using SoftMax.
- We were unable to reproduce an SGD optimizer with an accuracy higher than 0.90 — a learning rate of 0.10 — with the Settings given in the article. We switched to Adam, which had a learning rate of 0.01, and achieved similar performance.
- We used 60,000 samples as a training set and 10,000 as a test set, as described in the paper. But run 20 experiments instead of 16.
The most surprising result of this experiment is that our confidence interval is significantly larger than that of the original paper, although we used the same confidence interval calculation method and repeated the experiment for more than 16 times.
At the same time, we observed that the random method was significantly worse than all active learning methods. The purely uncertainty based approach is worse in some places than the clustering based approach, but not by much. Also, because it’s easier to set up, it might still be a worthwhile option. Finally, the graph of the original text shows only slight differences in clustering methods at around 400 samples. We were not able to replicate this difference, nor did we observe similar behavior in these samples.
Among the three experiments, we pay particular attention to these results because MNIST is the experiment with the most obvious differences between different sampling methods. The biggest difference between our experiment and the original was the implementation of the model, and we decided to replicate the experiment with something else: sciKit-Learn’s multilayer perceptron. We also wanted to make sure that the measure of uncertainty we chose: minimum margin, as the paper says, would perform better than the measure of lowest confidence. The results are as follows:
Figure 4 shows the results much closer to the original text. The performance of minimum edge sampling is significantly better than that of minimum confidence sampling. When the sample size of the learned dataset is between 300 and 800, the weighted clustering method is significantly better than the other methods. Confidence intervals are also much smaller, although not as narrow as in the paper. One exciting finding in the figure is that the weighted clustering method seems to behave consistently independent of the information-based sample preselection and weighting method.
Conclusion: Although there are some minor differences, both sets of implementations confirm the original point. We found that the approach based on diversity was more effective than the approach based on uncertainty. Interestingly, both approaches are more efficient than the information-based approach.
CIFAR 10 dataset experiment
This is an image classification experiment on the CIFAR 10 dataset. Since running this task required a wider training set, active learning started with 1,000 samples and gradually increased by 1,000 samples to 10,000 (a total of 50,000), achieving an accuracy rate of about 0.60. Our configuration for this experiment is:
- We used Keras instead of MXNet and Resnet 50 V2 instead of Resnet 34 V2 because it was not supplied natively by Keras. Since the paper did not specify an optimizer, we ran 3 epochs using RMSprop.
- We used 50,000 training samples and 10,000 test samples, consistent with the paper.
Conclusion: Figure 5 shows that our confidence interval is once again slightly larger than the original. The most striking result was that our accuracy was much higher than in the original text — 0.8 instead of 0.6. This difference should be caused by the different structures of the two experiments. However, again, we observed that active learning was better than random sampling, although we did not observe better performance with diversity-based sampling methods.
Diversified Mini-batch active Learning: Summary
Experiments to reproduce a research article are always complicated, especially when the parameters of the original experiment are not fully shared publicly.
One thing we can’t explain is the higher variability in our results compared to the original text. Even though we used slightly different techniques and Settings, we managed to replicate the main findings of the paper, demonstrating that active learning really does work on complex problems.
reference
Zhdanov, Fedor. “Diverse mini-batch Active Learning.” arXiv preprint arXiv:1901.05954 (2019).
[1] Du, Bo, Et al. “Exploring representativeness and informativeness for active learning.” IEEE Transactions on Cybernetics 47.1 (2015) : 14-26.
[2] Settles, Burr. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, 2009.
If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.
The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.