Neural architecture search (NAS) replaces human “second-order” paramedic work and enables us to search for optimal neural networks in a two-layer black box. This model is certainly attractive if it can be used cheaply and effectively, considering that “800 Gpus for 28 days of training” is hardly affordable for individuals. In this article, the authors describe the evolution of NAS, and how various improvements have been made to reduce the cost of training to a level that is “within the reach of mere mortals.”

From Medium by Erik Lybecker, Compiled by Heart of the Machine, with Participation: NeuR.



Neural architecture search (NAS) has changed the process of constructing new neural network architectures. This technique can automatically find the optimal neural network architecture for a particular problem. The definition of “optimal” can be seen as modeling the tradeoff process among multiple features, such as the size and accuracy of the network [1]. Even more impressively, NAS can now execute in 4 hours on a single GPU, as opposed to 28 days on 800 Gpus. It took only two years to make that leap, and now we don’t need to be Googlers to use NAS.


But how do researchers make this performance leap? This paper will introduce the development of NAS.


catalyst


The history of NAS can be traced back to the idea of self-organizing networks in 1988 [2], but it wasn’t until 2017 that NAS made its first major breakthrough. At that time, the idea of training recurrent neural networks (RNN) to generate neural network architectures emerged.

Figure 1: Iterative process diagram of training NAS controller: Train controller (RNN), sample architecture A with probability P, train subnetwork of architecture A to get accuracy R, calculate gradient of P and multiply by R to update controller.

In a nutshell, the process is similar to the human search for the best architecture by hand. The controller tests different neural network configurations based on the predefined search space for optimal operation and hyperparameters. In this case, test configuration means assembling, training, and evaluating the neural network to observe its performance.

After many iterations, the controller learns which configurations make up the best neural network in the search space. Unfortunately, the number of iterations required to find the optimal architecture in the search space is very large, so the process is slow.


This is partly because the search space suffers from a combinatorial explosion, where the number of possible networks in the search space increases greatly with the number of components added to the search space. However, this approach does find the current best (SOTA) network, which is now called NASnet [3], but it requires 28 days of training on 800 Gpus. Such high computational costs make search algorithms impractical for most people.


So how can this idea be improved to make it easier to use? During NAS training, most of the time is spent on training and evaluating the network recommended by the controller. Multiple Gpus can be used to train models in parallel, but their individual training process still takes quite a long time. Reducing the computational cost of training and evaluating neural networks will have a significant impact on the total search time of NAS.


This raises the question: how to reduce the computational cost of training and evaluating neural networks without adversely affecting the NAS algorithm?


Reduce fidelity estimates


It is well known that smaller neural networks train faster than larger ones. The reason is simple: smaller networks have lower computing costs. However, smaller neural networks generally perform worse than larger ones in terms of accuracy. The goal of NAS is to find the SOTA network architecture, so is there a way to use smaller models in search algorithms without sacrificing ultimate performance?

Figure 2: Example ResNet architecture, where residuals are represented as “ResNet blocks”.



The answer can be found in the most famous computer vision architecture, ResNet [4]. In the ResNet architecture, we can observe the same set of operations being repeated over and over again. These operations form residual blocks, which are the building blocks of ResNet. This design pattern allows researchers to create deeper or shallower variants of the same model by varying the number of stacked residials.


Implicit in this architectural design is the assumption that high-performance, larger networks can be created by iteratively stacking well-structured building blocks, which is perfectly suited to NAS. In the context of NAS, this means first training and evaluating small models, and then extending the neural network. For example, execute NAS on ResNet18 and then build ResNet50 by repeating the resulting building blocks.


Replacing the entire architecture of searching with search building blocks, as well as training and evaluating smaller models, can greatly improve the speed, and the researcher achieved a search time of only 3-4 days on 450 Gpus [5]. In addition, the technique can find the SOTA architecture even if only the building blocks are searched.
However, while this is a huge improvement, the whole process is still fairly slow, and the number of Gpus required for training has to be reduced in order for it to be practical. Regardless of model size, training neural networks from scratch is always a time-consuming process. Is there a way to reuse weights from previously trained networks?


Weights of inheritance


How do you avoid training a neural network from scratch? The answer is tenure reinheritance, borrowing weights from another network that has already been trained. In NAS, searches are performed on specific target data sets and multiple architectures are trained at the same time. Why not reuse weights and just change the schema? After all, the purpose of the search process is to find architecture, not weights. To achieve reuse weights, we need to restrict the search space with stricter structure definitions.



Figure 3: NAS cells are modeled as Directed Acyclic Graph, where edges represent operations and nodes represent cells that transform and combine previous nodes to create new hidden states.



By defining the number of hidden states allowed in the search building block, the search space becomes very limited. In other words, the number of possible combinations of operations within a building block is large, but not infinite. If hidden states are sorted and their topology is predefined as a directed acyclic graph (DAG), the search space is shown in Figure 3.


Using this search space, we can think of the controller’s proposed architecture as a subnetwork from a larger network, where the larger network and the subnetwork share the same hidden state (nodes).


When a controller suggests using a network architecture, this means selecting a subset of connections (edges) and assigning new operations to hidden states (nodes). This form means that it is easy to encode weights for operations on nodes, enabling weight inheritance. In NAS Settings, this means that the weights of the previous architecture can be used for initialization of the next sampled network [6]. It is well known that initializations can run well independent of tasks or operations [7], and can be trained faster because the model is not trained from scratch.


Now that you no longer have to train every model from scratch, training and evaluation of the network will be much faster. On a single GPU, NAS only needs 0.45 days of training time, which is about 1000 times faster than before [6]. The combination of optimization techniques greatly increases the speed of reinforcement learning-based NAS.


These improvements focus on faster evaluation of individual architectures. However, reinforcement learning is not the fastest way to learn. Is there an alternative search process that traverses the search space more efficiently?


In the NAS process based on reinforcement learning, multiple models need to be trained to find the best one from them. So is there a way to avoid training all models and train just one?


differentiability


In the DAG form of search space, the trained network is a subnetwork of the larger network. So is it possible to train the larger network directly and somehow learn which operations contribute the most? The answer is yes.

Figure 4: a) The operations on the side are initially unknown. B) Continuously free up search space by placing a mix of candidate operations on each edge. C) In bilevel optimization, some weights increase while others decrease. D) The final architecture is constructed by an edge with maximum weight between two nodes [8].



If the controller is removed and the edges are changed to represent all possible operations, the search space is differentiable. In this dense architecture, all possible operations are combined as a weighted sum at each node. Weighted sums are learnable parameters that allow the network to scale different operations. This means you can narrow down the actions that are bad for performance and expand the actions that are “good.” After training the larger network, all that is left to do is look at the weights and select the operations that correspond to the larger weights.


By differentiating the search space and training larger networks (often referred to as “super networks”), we no longer need to train multiple architectures and can use the standard gradient descent optimizer. The differentiability of NAS opens up many possibilities for future development. One example is differentiable sampling in NAS [9], which reduces the search time to just 4 hours because each forward propagation and back propagation require fewer operations in the search.


conclusion


So much for the story of how NAS training was cut from days to hours. In this article, I try to outline the most important ideas driving the development of NAS. Now that NAS technology is efficient enough that anyone with a GPU can use it, what are you waiting for?


References:



[1] https://arxiv.org/pdf/1807.11626.pdf

[2] Self Organizing Neural Networks for the Identification Problem (https://papers.nips.cc/paper/149-self-organizing-neural-networks-for-the-identification-problem.pdf)

[3] https://arxiv.org/pdf/1611.01578.pdf

[4] https://arxiv.org/pdf/1512.03385.pdf

[5] https://arxiv.org/pdf/1707.07012.pdf

[6] https://arxiv.org/pdf/1802.03268.pdf

[7] https://arxiv.org/pdf/1604.02201.pdf

[8] https://arxiv.org/pdf/1806.09055.pdf

[9] https://arxiv.org/pdf/1910.04465.pdf



The original link: https://medium.com/peltarion/how-nas-was-improved-from-days-to-hours-in-search-time-a238c330cd49