• Hunting for the Optimal AutoML Library
  • Aimee Coelho
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: zhusimaji
  • Proofreader: Regon-CAO, LSVIh

Search for the optimized AutoML library

Over the past few years, there has been a lot of research on how to automate machine learning models. There are many open source libraries that provide AutoML functionality. How do these libraries relate to each other, and more importantly, how do they compare? To do this, we need an independent team to test all packages under controlled conditions, based on a broad set of data.

Marc-andre Zoller and Marco F. Huber’s paper, Benchmark and Survey of Automated Machine Learning Frameworks, has carried out all Benchmark tests. We found their testing scheme very interesting and that’s what we wanted to share here. Please note:

In general, all algorithms except grid search produce models with similar performance.

AutoML definition

First, determine what exactly AutoML includes that we’re going to discuss. This is a very broad term that can mean everything in a machine learning project, from data cleansing and feature engineering to hyperparametric search to deployment, production, and subsequent incremental updates and deployment of models.

In this article, we’ll discuss AutoML tools that can build workflows for data preprocessing, model selection, and hyperparameter tuning.

We can break this process down into the following steps: structural search, combined algorithm selection, and hyperparametric search (CASH), as shown in Figure 1.

In this article, they split the tool into the following parts: executing CASH and maintaining a linear pipe, as shown in Figure 2; More complex pipes can be searched, including parallel branches and data reorganization, as shown in Figure 3. That’s the structure search in Figure 1.

During the experiment, they found that the performance of CASH algorithm is better. We’ll highlight these tools here.

Description of algorithm

The algorithms available for hyperparameter search are summarized as follows:

Grid search and random search

First, the simplest method: grid search for hyperparameter values is simple and can help you quickly find the best model parameters.

Next comes a random search that generates a random combination of hyperparameters within a specified distribution and range.

As shown in Figure 4, this has the advantage of not losing the optimal solution (if they are between two grid points) and, if run indefinitely, being arbitrarily close to the optimal solution, unlike grid search.

For grid search and random search, the author chooses Scikit-learn implementation.

Optimization search based on sequence model

So we have a series of model optimization methods (SMBO). The algorithm aims to learn from the evaluation when it is complete, similar to how data scientists construct hyperparametric searches. These algorithms learn an alternative model after each evaluation, and then use a query function to query the alternative model to determine the next evaluation point. Querying the proxy model is much faster than training the model to get real performance values.

Model-based optimization techniques aim to learn the optimal solution region in the search space and explore to find the best model.

The experiment assumes a simple two-dimensional quadratic function whose minima are X_0 = 50 and X_1 = -50. When using Bayesian optimization, after first evaluating based on 10 random configurations and learning the alternative model, subsequent evaluations begin to converge toward the minimum. The target function is plotted on the left side of Figure 5, and on the right side — the selection of sequence points selected by The Bayesian method and Gaussian process in SCIKit-Optimize — approximates the minimum as the number of iterations goes from purple to red.

Alternative models have different options, including Gaussian process, random forest and Tree Parzen estimators. The author selects RoBo and Gauss processes, SMAC and random forest, Hyperopt and Tree Parzen predictors to test SMBO.

Other search methods

Multi-arm Bandit learning can also be combined with Bayesian optimization, where the selection of category parameters is handled by Bandit and the space is divided into subspaces called hyperpartitions, each containing only the parameters optimized by Bayesian. Moreover, this baseline has been selected based on Bayes and Field (BTB).

Another approach is to use evolutionary algorithms. There are a variety of different population-based algorithms in use, such as particle swarm optimization. The Optunity algorithm is used in this benchmark.

Performance improvement

In addition to these main algorithms, there are many suggestions on how to enhance these techniques – mainly to reduce the time needed to find the best model.

Low-fidelity degrees

One popular method is to use low-fidelity assessments. This can be training on a data sample or less iterative training on a model – because it is a quick way to understand the performance of each configuration, more configurations can be tested early and only the majority of promising configurations are fully trained.

A good example is HyperBand. BOHB is a tool that combines HyperBand and Bayesian optimization and is included in this benchmark.

Yuan learning

If you already have an idea of the most promising areas of the search space to explore, you can speed up the model optimization process by narrowing your search from scratch rather than starting with a random search. One approach is to look at the hyperparameters that are usually most important, on various data sets, what are generally good parameters for a given algorithm.

Another approach is to use the results of a hyperparametric search that you might have done before on other data sets. If it can be determined that the data sets are similar, the search can be initiated using the best hyperparameter in the task – this is called a hot start search.

Auto-sklearn implements this scheme and can refer to their paper for more information on evaluating the similarity of datasets.

Benchmark test

The Benchmark data is a combination of OpenML100[3], OpenML-CC18[4] and AutoML Benchmark[5], with a total of 137 classification datasets.

All libraries were iterated 325 times, because that was determined to be enough time for all algorithms to converge, and any single model training was limited to 10 minutes.

These libraries feature 13 different algorithms with a total of 58 hyperparameters that can be optimized. The performance of each configuration was evaluated using quad cross validation, and the experiment was repeated 10 times using different random seeds.

conclusion

To compare the accuracy of data sets with different difficulty levels, two baseline models were used. One is a virtual classifier and the other is a simple untuned random forest algorithm. The results of these benchmarks are then used to normalize the results so that the performance of the virtual classifier becomes zero and that of the untuned random forest becomes one. Therefore, any library will perform better than 1, which means it will perform better than an untuned random forest.

Surprisingly, they found that all the tools performed similarly except for grid search!

After ten iterations, all algorithms except grid search beat the random forest baseline, and Figure 6 shows the final performance of all algorithms after 325 iterations.

Curiously, the way the grid is defined for grid search is also automatic, and the default parameters of random forest do not appear in the grid, which is worse than untuned random forest, the full grid is included in the appendix of this article.

Amazing differences in the experiment

The authors also highlight that some data sets show considerable variance when only the random seeds are changed. This is also what we observed in our benchmark test at Dataiku, which demonstrates the importance of repetition.

You can see this clearly in Figure 7, where raw and average accuracy are plotted together. For the data set with the largest variance, diabetes, the results of random searches varied by more than 20%!

Author’s Summary

The authors conclude that

In general, all algorithms except grid search produce models with similar performance.

They also concluded that the absolute difference in accuracy was less than 1% on most data sets, making it unreasonable to rank these algorithms based on performance alone. For future algorithm ranking, other criteria (such as model overhead or scalability) should be considered.

Forget accuracy: speed of convergence as selection criteria

Although the ultimate goal is to get the best model, there are actually other constraints in terms of computing resources and time, especially when we use large amounts of data in an industry and may not be able to train hundreds of models.

In this case, the important consideration is how fast the algorithm can approach the optimal model. Criteria such as model overhead are important when there are enough resources to train hundreds of models, but if you can only afford to train, say, 20 models, then speed of convergence becomes a very important criterion.

To better illustrate this, refer to the curve in Figure 8 below.

In the OpenML Dataset Higgs, four search algorithms are used to optimize the four hyperparameters of only one algorithm (random forest). The optimal loss for each iteration is plotted. The first vertical line represents iteration 20.

At this point, all model-based optimization algorithms no longer perform random searches, but instead choose points to evaluate based on their own models. In the 50 iterations between here and the second vertical line, we see greater differences in the performance of the models found by these algorithms compared to when all the algorithms had converged after 325 iterations.

So if you can only afford to train 20 models, you will have a better model if you choose to use an algorithm based on the Tree Parzen estimator, such as the one implemented in Optuna.

reference

  1. Zöller, M.-A. & Huber, M. F. Benchmark and Survey of Automated Machine Learning Frameworks. (2019).
  2. Bergstra, J. & Bengio, Y. Random Search for Hyper-parameter Optimization. J. Mach. Learn. 13, 281 — 305 (2012).
  3. Casalicchio, G., Hutter, F., Mantovani, R. G. & Vanschoren, J. OpenML Benchmarking Suites and the OpenML100. 1-6 arXiv:1708.03731 V1
  4. Casalicchio, G., Hutter, F. & Mantovani, R. G. OpenML Benchmarking Suites. 1-6 arXiv:1708.03731 V2
  5. Gijsbers, P. et al. An Open Source AutoML Benchmark. 1 — 8 (2019).

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.