Compiled from arXiv by Chenxi Liu et al., Heart of Machine.
In the past, neural network architectures were largely developed manually by human researchers, a time-consuming and error-prone process. Automatic neural architecture search (NAS) technology liberates human work and improves model efficiency. In large-scale image classification, automated models have surpassed those designed by humans.
Auto-deeplab, proposed by Fei-Fei Li and his colleagues at Stanford University, outperforms many of the industry’s best models in image semantic segmentation, and even achieves the performance of pre-trained models without pre-training. Auto-deep AB develops continuous relaxation of discrete architectures that perfectly match the hierarchical architecture search space, significantly improving the efficiency of architecture search and reducing computing power requirements.
Deep neural networks have been successful in many ARTIFICIAL intelligence tasks, including image recognition, speech recognition, machine translation and so on. Although better optimizers [36] and normalization techniques [32, 79] have played an important role, much of the progress is due to the design of neural network architectures. In computer vision, this applies to image classification and intensive image prediction.
More recently, with the democratization of AutoML and AI, there has been a great deal of interest in the architecture of automated design neural networks, which do not rely heavily on expert experience and knowledge. More importantly, last year, the Neural Architecture Search (NAS) succeeded in finding network architectures that surpass human design architectures in large-scale image classification tasks [92, 47, 61].
Image classification is a good starting point for NAS because it is the most basic and well-studied advanced recognition task. In addition, the presence of relatively small benchmark data sets (e.g., CIFAR-10) in the research area reduces computation and speeds up training. However, image classification should not be the end of NAS, and the current success suggests that it can be extended to even more demanding areas. In this paper, we study neural architecture search for semantic image segmentation. This is an important computer vision task that assigns labels, such as “person” or “bicycle,” to each pixel of an input image.
Simply transplanting image classification is not enough for semantic segmentation. In image classification, NAS usually uses transfer learning from low resolution images to high resolution images [92], while the optimal framework for semantic segmentation must operate on high resolution images. This suggests that this study needs :(1) a more relaxed and generic search space to capture architectural variants resulting from higher resolution; (2) More efficient architecture search technology, because higher resolution requires more computation.
The author notes that modern CNN designs generally follow a two-level hierarchical structure, in which the outer network controls changes in spatial resolution and the inner cell level architecture manages specific hierarchical computing. The vast majority of current research on NAS follows this two-level, hierarchical design, but only automates the search for the inner network and manually designs the outer network. This limited search space is a problem for dense image prediction, which is sensitive to changes in spatial resolution. Therefore, in this study, the author proposes a lattice network-level search space, which can enhance the common unit-level search space first proposed in [92] to form a hierarchical architecture search space. The goal of this study is to jointly learn a good combination of repeatable unit structure and network structure for semantic image segmentation.
In terms of architectural search methods, reinforcement learning and evolutionary algorithms tend to be computationally intensive, even on low-resolution data sets ciFAR-10, so they are not well suited for semantic image segmentation tasks. Inspired by the NAS differentiable formula [68, 49], this study developed continuous relaxation of discrete architectures that perfectly matched the search space of layered architectures. Hierarchical architecture search is implemented by stochastic gradient descent. When the search terminates, the best cell architectures are greedily decoded, and the best network architectures are efficiently decoded by the Viterbi algorithm. The authors searched directly for architecture on 321×321 images cropped from the Cityscapes dataset. The search is very efficient, taking just 3 days on a P100 GPU.
Experiments were performed on several semantic segmentation benchmark datasets, including Cityscapes, PASCAL VOC 2012, and ADE20K. Without ImageNet pre-training, the best auto-deeplab model outperformed FRRN-B by 8.6% on Cityscapes and GridNet by 10.9% on GridNet. In experiments using Cityscapes rough labeling data, Auto-Deeplab performed similarly to some current best models pretrained by ImageNet. It is worth noting that the best model in this study (without pre-training) performed similarly to DeepLab V3 + (with pre-training), but the former was 2.23 times faster in MultiAdds. In addition, auto-Deep AB’s lightweight model performance was only 1.2% lower than DeepLab V3 +, with 76.7% fewer parameter requirements, and was 4.65 times faster in MultiAdds than DeepLab V3 +. On PASCAL VOC 2012 and ADE29K, the auto-Deeplab optimal model performed better than many current optimal models with minimal data for pre-training.
The main contributions of this paper are as follows:
-
This is one of the first attempts to extend NAS from an image classification task to an intensive image prediction task.
-
This study proposes a network-level architecture search space that augments and complements the already well-studied unit-level architecture search, with more challenging joint searches of network-level and unit-level architecture.
-
This study proposes a differentiable continuous approach to efficiently run two-level layered architecture searches in only 3 days on a GPU.
-
Without ImageNet pretraining, the performance of auto-Deeplab model on Cityscapes data sets is significantly better than that of FRRN-B and GridNet, and is comparable to the current best model of ImageNet pretraining. On PASCAL VOC 2012 and ADE20K datasets, the best auto-Deeplab model is superior to several current optimal models.
Auto-deep Lab: Hierarchical Neural Architecture Search for Semantic Image Segmentation
Address: arxiv.org/pdf/1901.02…
Abstract: Recently, neural network architectures determined by neural architecture search (NAS) outperform human-designed networks in image classification. This paper will study NAS for semantic image segmentation, an important computer vision task of assigning semantic labels to each pixel in an image. The existing research usually focuses on the search of repeatable cell structure and the artificial design of external network structure that controls spatial resolution variation. This approach simplifies the search space, but presents many problems for intensive image prediction with a large number of network-level architecture variants. Therefore, this study proposes to search the network-level architecture in addition to the search unit structure, thus forming a hierarchical architecture search space. This study proposes a network level search space that incorporates multiple popular network designs and proposes a formula for efficient architecture search based on gradients (using 1 P100 GPU on Cityscapes images in 3 days). This study demonstrates the effectiveness of the method on difficult Cityscapes, PASCAL VOC 2012 and ADE20K datasets. Without any ImageNet pre-training, the proposed framework for semantic image segmentation achieves the current best performance.
Four methods
This section first introduces continuous relaxation of discrete architectures that precisely match the hierarchical architecture search described above, then discusses how to perform architectural searches through optimization, and how to decode discrete architectures after the search terminates.
4.2 optimization
The scalars that serve to control the strength of the connections between different hidden states are now also part of the differentiable computational graph. Therefore, gradient descent can be used to optimize it efficiently. The author uses the first-order approximation in [49] to divide the training data into two separate datasets trainA and trainB. Optimizations alternate between:
1. Renew the network weight W with ∇ W L_trainA(W, α, β);
2. Renew structure α,β with ∇ L_trainB(W, α,β).
The loss function L is the cross entropy calculated on the small batch of semantic segmentation.
4.3 Decoding discrete architecture
Unit architecture
Like [49], this study first retained the two strongest predecessors of each building block and then used the Argmax function to select the most likely operator to decode the discrete unit architecture.
The network architecture
Formula 7 essentially shows that the sum of the “outgoing probability” at each blue node in Figure 1 is 1. In fact, β can be understood as the “transition probability” between different “states” (spatial resolution) in different “time steps” (layers). The goal of this study is to find a path with “maximum probability” from scratch. In the implementation, the author can use the classical Viterbi algorithm to decode the path efficiently.
5 Experimental Results