FBNet series is a lightweight network series completely based on THE NAS method. It analyzes the shortcomings of current search methods and gradually adds innovative improvements. FBNet combines DNAS and resource constraints, FBNetV2 adds channel and input resolution search, FBNetV3 uses accuracy prediction for fast network structure search
Source: Xiaofei algorithm engineering Notes public account
FBNet
Thesis: FBNet: Hardware – Aware Efficient ConvNet Design via Differentiable Neural Architecture Search | CVPR 2019
- Thesis Address:Arxiv.org/abs/1812.03…
- Thesis Code:Github.com/facebookres…
Introduction
The recent design of convolutional networks not only focuses on accuracy, but also needs to take into account the running performance, especially on mobile devices, which makes the design of convolutional neural networks more difficult, mainly due to the following difficulties:
- Intractable design space. As the convolutional network has many parameters, the design space is very complex. At present, many methods are proposed to automate the search, which can simplify the process of manual design, but this method generally requires a large amount of computing power.
- Nontransferable optimality, the performance of convolutional network depends on many factors, such as input resolution and target device. Different resolutions need to adjust different network parameters, and the efficiency of the same block on different devices may be greatly different. Therefore, specific tuning of the network is required under specific conditions.
- Inconsistent efficiency metrics most efficiency metrics are related not only to network architecture, but also to hardware and software Settings on target devices. In order to simplify, many studies use hardware-independent indicators to express the efficiency of convolution, such as FLOPs. However, FLOPs are not always equal to performance, but also related to the implementation of block, which makes the design of network more difficult.
In order to solve the above problems, this paper proposes FBNet, which uses differentiable neural network search (DNAS) to discover hardware-related lightweight convolutional networks, as shown in Figure 1. The DNAS method represents the whole search space as a hypernet, transforms the problem of finding the optimal network structure into finding the optimal candidate block distribution, trains the block distribution through gradient descent, and selects different blocks for each layer of the network. In order to better estimate the network delay, the actual delay of each candidate block is measured and recorded in advance, and the network structure and the corresponding delay can be directly accumulated during the estimation.
Method
DNAS formulates the network structure search problem as:
Given the structural space A\mathcal{A}A, we find the optimal structure A ∈Aa\in \ Mathcal {A}A ∈A. After training the weight waW_AWA, we can minimize the loss L(A,wa)\mathcal{L}(A, w_a)L(A,wa), This paper mainly focuses on three factors: search space A\mathcal{A}A, loss function L(A,wa)\mathcal{L}(A, W_A)L(A, WA) considering actual delay, and efficient search algorithm.
-
The Search Space
Most of the previous methods search the cell structure and stack it into a complete network, but in fact, the same cell structure at different layers has very different effects on the accuracy and delay of the network. Therefore, this paper constructs a macro-architecture-fixed layer-wise search space, and each layer can select blocks with different structures. The overall network structure is shown in Table 1. The structures of the first and last three layers are fixed, while the structures of the remaining layers need to be searched. Due to the large feature resolution of the front layer, a small number of cores is manually set to ensure the lightweight of the network.
The layer-wise search space is shown in Figure 3. Based on the classical structural design of MobileNetV2 and ShuffleNet, different candidate blocks are constructed by setting different convolution kernel size KKK(3 or 5), expansion rate EEE and grouping number. If the input and output resolutions of block are consistent, elder-wise shortcut is added. If grouping convolution is used, channel shuffle is required for the convolution output.
The experiment of this paper contains 9 candidate blocks, and the hyperparameters of each block are shown in Table 2. In addition, there is skip structure, which maps input to output directly, to shorten the depth of the overall network. Overall, the overall network consists of 22 layers to be searched, each layer selected from 9 candidate blocks, with a total of 9229^{22}922 possible structures.
-
Latency-Aware Loss Function
The loss function in Formula 1 should reflect not only the accuracy rate, but also the delay on the target hardware. Therefore, define the following loss function:
CE(A, WA)CE(a, W_A)CE(a, WA) represents the cross entropy loss, LAT(a)LAT(a)LAT(a) represents the delay of the current structure on the target hardware, α\alphaα controls the amplitude of the overall loss function, and β\betaβ adjusts the amplitude of the delay term. The calculation of delay may be time-consuming, so the paper uses the delay lookup table of block to estimate the overall network:
Bl (a)b^{(a)} _LBL (a) is the block of LLL layer in structure AAA. This estimation method assumes that the calculation between blocks is independent of each other and is valid for serial computing devices such as CPUs and DSPs. By this method, It can quickly estimate the actual time delay of 102110^{21}1021 networks.
-
The Search Algorithm
In this paper, the search space is represented as a random hypernet, which is the overall structure of Table 1, and each layer contains 9 parallel blocks of Table 2. In reasoning, the probability of candidate blocks being executed is:
θl\ theTA_L θ L contains parameters that determine the sampling probability of each candidate block in LLL layer, and the output of LLL layer can be expressed as:
Ml,im_{L, I}ml, I is a random variable {0,1}\{0, 1\}{0,1} {0,1} randomly assigned according to the sampling probability, layer output is the sum of the outputs of all blocks. Therefore, the sampling probability of network structure AAA can be expressed as:
θ\thetaθ includes all blocks θ L, I \theta_{l, I}θl, I. Based on the above definition, the discrete optimization problem of formula 1 can be transformed into:
Thus, the weights waw_awa are differentiable, but θ\theta theta is still not differentiable, because ml,im_{L, I}ml, I is defined discretely. For this purpose, the generation method of ML,im_{L, I}ml, I is converted to Gumbel Softmax:
Gl, i-gumbel (0,1)g_{l, I} \sim Gumbel(0,1)gl, i-gumbel (0,1) is the random noise of the Gumbel distribution, τ\tauτ is the temperature parameter. When τ\tauτ approaches 0, ML,im_{L, I}ml, I resembles one-shot, and when τ\tauτ increases, ML,im_{L, I}ml, I resembles continuous random variable. In this way, the cross entropy loss of formula 2 can be derived with respect to waw_AWA and θ\thetaθ, and the delay term LATLATLAT can also be written as:
Because lookup table is used, LAT(bl, I)LAT(b_{l, I})LAT(bl, I) is a constant factor, and the overall delay of network aaa is differentiable for ml,im_{l, I}ml, I, I}θl, I \theta_{l, I}θl, I. At this point, the loss function is differentiable for both the weight waW_AWA and the structure variable θ\thetaθ, and SGD can be used to efficiently optimize the loss function. The search process is equivalent to the training process of random hypernet. During the training, the weight of each block of partial W_A partial L/ partial mathcal{L}/ partial W_A partial L/ partial WA is calculated to update the hypernet. After the block training, each block has different contributions to accuracy and delay. Calculates partial partial\mathcal{L}/ partial\ theta∂L/∂θ to update the sampling probability PθP_{\theta}Pθ for each block. After the training, the optimal network structure is obtained by distributing PθP_{\theta}Pθ by sampling network.
Experiments
ImageNet performance comparison with each lightweight network.
Performance comparisons under specific resource and device conditions.
Conclustion
In this paper, a differentiable neural network search method is proposed, which transforms the discrete unit structure selection into continuous unit structure probability distribution. In addition, the target device delay is added into the optimization process, and the weight sharing of the hypernetwork is combined to quickly generate a high-performance and lightweight network under specific conditions. However, the block framework of this paper is based on the design of current mainstream MobileNetV2 and ShuffleNet, which is more about searching its structural parameters, so it has certain constraints in the network structure.
FBNetV2
Thesis: FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions | CVPR 2020
- Thesis Address:Arxiv.org/abs/2004.05…
- Thesis Code:Github.com/facebookres…
Introduction
DNAS samples the optimal subnet by training the hypernetwork containing all candidate networks. Although the search speed is fast, it requires a large amount of memory, so the search space is generally smaller than other methods, and the memory consumption and computational consumption increase linearly with the search dimension.
In order to solve this problem, DMaskingNAS is proposed in this paper. The channel number and input resolution are added into the hypernetwork in the form of mask and sampling, respectively, and 101410^{14}1014 times of search space is greatly increased with a small amount of memory and computation.
Channel Search
DNAS generally instantiates candidate blocks in the hypernetwork. Candidate blocks are selected in the training process. Adding channel dimension directly to the search space will increase a large amount of memory and computation.
Conventional implementation methods, such as Step A, instantiate convolution of different sizes. In order to make the output of convolution of different dimensions can be fused, zero padding of Step B is performed on features with smaller dimensions. Step B can be converted into features of the same size as Step C, and then mask the outputs with three different masks (blue is 0, white is 1). Since the three convolution sizes of Step C are the same as the input, it can be implemented with a weight shared convolution, namely Step D. The mask of Step D is merged first, and then multiplied by the convolution output, so as to save computation and memory, and finally get Step E, which only needs one convolution and one feature graph.
Input Resolution Search
Similar to channel’s addition of search space, DNAS is implemented by instantiating all layers for each input resolution, which increases computation and memory exponentially. There are also some unavoidable problems:
- The feature output cannot be fused. Blocks with different input resolutions have different output sizes, as shown in Figure A, which cannot be directly fused. Generally, zero padding as shown in Figure B can be used to solve the problem of consistent size, but this will cause pixel misalignment. Therefore, Interspersing zero-padding sampling method (nearest neighbor +zero-padding) as shown in Figure C is adopted to avoid pixel misalignment and pixel pollution.
- The feature map of sampling results in the decrease of the experience domain. As shown in Figure D, assuming that F is 3×33\times 33×3 convolution, the sampled feature graph only covers the effective input of 2×22\times 22×2 in a single convolution. Therefore, the sampled feature graph needs to be compressed before the convolution operation, and then expanded and restored after the convolution operation, as shown in Figure E. In fact, the above figure E can be realized through empty convolution, which avoids the extra memory application and the modification of the convolution kernel.
In the experiment part, the paper did not describe the configuration and search process of input resolution, but only showed the experimental results. The author only opened source the network obtained by search, and did not open source the search code. It is speculated that the same hypernet should be used for feature extraction of input with different resolutions during search, and then the final output should be combined for training. Finally, the resolution with the maximum weight should be selected, as shown in Figure 2. FFF is a shared hypernet.
Experiments
The overall network structure set during the search and each candidate block, a total of 103510^{35}1035 candidate networks.
Multiple FBNetV2 networks are searched, and each network has different resource requirements.
Performance comparison with other networks.
Conclustion
As mentioned before, the block framework of FBNet is based on the design of the current mainstream MobileNetV2 and ShuffleNet. It mainly searches for its structural parameters, so it has certain constraints in the network structure. FBNetV2 immediately came to a 101410^{14}1014 times of improvement, all aspects of the effect is better than most of the current network, but overall, the paper is more like a quantitative method, because the base is still fixed for the existing network structure design.
FBNetV3
FBNetV3: Joint Architecture-Recipe Search using Neural Acquisition Function
- Thesis Address:Arxiv.org/abs/2006.02…
Introduction
FBNetV3 is only available on arXIV at present. This paper considers that most of the current NAS methods only meet the requirements of network structure search, but do not care about whether the training parameters are set properly during network performance verification, which may lead to the degradation of model performance. Therefore, this paper proposes JointNAS to search for the most accurate training parameters and network structure under resource constraints.
JointNAS
JointNAS optimization objectives can be formulated as:
AAA, HHH and Omega Omega represent the network structure embedding, training parameter embedding and search space respectively. Accaccacc calculates the accuracy of the current structure and training parameters. Gig_igi and γ\gamma gamma represent the resource consumption calculation and resource quantity respectively.
JointNAS ‘search process is shown in ALG.1, which divides the search into two stages:
- Coarse-grained phase, in which we iteratively searched for high performance candidate network structure-hyperparameter pairs and trained accuracy predictors.
- Fine-grained predictor is used to quickly search candidate networks using a fine-grained predictor, which incorporates AutoTrain, a hyperparameter optimizer proposed in this paper.
Coarse-grained search: Constrained iterative optimization
Coarse-grained search generates an accuracy predictor and a set of high-performance candidate networks.
-
Neural Acquisition Function
The structure of the predictor is shown in Figure 4, which includes a structure encoder and two heads, namely the auxiliary proxy head and the accuracy head. The proxy head predicts network attributes (FLOPs, number of parameters, etc.), which is mainly used during encoder pre-training. The precision head is fine-tuned during iterative optimization using the encoder pre-trained with the proxy Head based on training parameters and network structure prediction accuracy.
-
Step 1. Pre-train embedding layer
Predictor consists of a training process, first of all, the training model in network structure as input, forecast network properties (FLOPs or reference number, etc.), such as training data is easy to obtain, randomly generated a large number of network and its properties can be calculated, then the encoder Shared with accuracy the head, then on subsequent web search. The pre-training of the encoder can significantly improve the accuracy and stability of the predictor, as shown in Figure 5.
-
Step 2. Constrained iterative optimization
First, quasi-Monte Carlo is used to sample network structure-hyperparameter pairs from the search space, and then the predictors are iteratively trained:
- Select a batch of qualified network structure-hyperparameter pairs based on the predictor results
- To train and test the accuracy of network structure-hyperparameter pair, the early stop strategy is adopted in the training. The final accuracy of the network in the first iteration and the accuracy of each epoch were taken to draw the correlation curve between the network ranking and the final ranking of each epoch, as shown in FIG. 3. The period with a correlation of 0.92 was taken as the training cycle.
- The predictors were updated. The first 50 epochs of the predictors fixed the encoder parameters, followed by learning measurements with progressively declining learning rate. Accuracy prediction Head uses Huber Loss for training, which can withstand the influence of abnormal points on model training.
This iterative process reduces the number of candidates, avoids unnecessary validation, and improves exploration efficiency.
Fine-grained search: Predictor-based evolutionary search
In the second stage, an adaptive genetic algorithm is used to select the optimal network structure-training parameter pair of the first stage as the first generation population. In each iteration, the population was mutated to generate a new subgroup satisfying the constraint, the predictor of coarse-grained stage training was used to predict the individual score quickly, and the optimal KKK network structure-training parameter pairs were selected as the next generation population. Calculate the maximum score increase of the current iteration relative to the last iteration, exit when the increase is not enough, and obtain the final high-accuracy network structure and corresponding training parameters. It should be noted that when resource constraints change, the predictor can still be reused and can quickly search for appropriate network structure and training parameters using fine-grained stages.
Search space
The search space is shown in Table 1, which contains 101710^{17}1017 network structures and 10710^{7}107 training hyperparameters.
Experiments
Fixed network structure to test the effectiveness of training parameter search.
Compare ImageNet performance with other networks.
Conclustion
FBNetV3 completely deviates from the design of FBNetV2 and FBNet. The accuracy predictor and genetic algorithm used in FBNetV3 have been used in many NAS applications. The main highlight is that training parameters are added to the search process, which is very important to improve performance.
Conclustion
FBNet series is a lightweight network series completely based on THE NAS method. It analyzes the shortcomings of current search methods and gradually adds innovative improvements. FBNet combines DNAS and resource constraints, FBNetV2 adds channel and input resolution search, FBNetV3 is to use accuracy prediction for fast network structure search, expect complete code open source.
If this article was helpful to you, please give it a thumbs up or check it out
For more information, please pay attention to wechat official account [Algorithm Engineering Notes of Xiaofei]