Wu Binghong is an engineer of domestic Internet giants and a computer vision enthusiast. His research direction is target detection and medical imaging

Content abstract

EfficientDet is a well-deserved new SOTA algorithm proposed by Google Brain in the field of target detection at the end of 2019 and was included in CVPR2020. In this project, the target detection algorithm EfficientDet is analyzed in detail, and the details of model reproduction based on PaddleDetection, an official target detection development kit are introduced.

EfficientDet derives from a 2020 CVPR article

Arxiv.org/abs/1911.09…

Github.com/google/auto… The main core is to EfficientNet based on the network structure search, further multi-scale feature fusion through the newly designed BiFPN, and finally through the classification/regression branch generated detection frame, so as to realize the expansion from efficient classifier to efficient detector. In terms of overall structure, EfficientDet is not significantly different from a series of Anchor-based one-stage detectors such as RetinaNet, but in each single module, EfficientDet maximally improves model performance with limited computing/storage resources.

EfficientDet Performance comparison with other existing mainstream models:

EfficientDet Network structure:

As shown in the figure from left to right, EfficientDet is divided into three parts: Backbone EfficientNet and Neck BiFPN. And Head Class/Box Prediction NET of detector, whose specific model structure definition code can be seen:

Class EfficientDet (object) : "" "EfficientDet architecture, see https://arxiv.org/abs/1911.09070 Args: backbone (object): backbone instance fpn (object): feature pyramid network instance retina_head (object): `RetinaHead` instance """ __category__ = 'architecture' __inject__ = ['backbone', 'fpn', 'efficient_head', 'anchor_grid'] def __init__(self, backbone, fpn, efficient_head, anchor_grid, box_loss_weight=50.): super(EfficientDet, self).__init__() self.backbone = backbone self.fpn = fpn self.efficient_head = efficient_head self.anchor_grid = anchor_grid self.box_loss_weight = box_loss_weight def build(self, feed_vars, mode='train'): im = feed_vars['image'] if mode == 'train': gt_labels = feed_vars['gt_label'] gt_targets = feed_vars['gt_target'] fg_num = feed_vars['fg_num'] else: im_info = feed_vars['im_info'] mixed_precision_enabled = mixed_precision_global_state() is not None if mixed_precision_enabled: im = fluid.layers.cast(im, 'float16') body_feats = self.backbone(im) if mixed_precision_enabled: body_feats = [fluid.layers.cast(f, 'float32') for f in body_feats] body_feats = self.fpn(body_feats) anchors = self.anchor_grid() if mode == 'train': loss = self.efficient_head.get_loss(body_feats, gt_labels, gt_targets, fg_num) loss_cls = loss['loss_cls'] loss_bbox = loss['loss_bbox'] total_loss = loss_cls + self.box_loss_weight * loss_bbox loss.update({'loss': total_loss}) return loss else: pred = self.efficient_head.get_prediction(body_feats, anchors, im_info) return predCopy the code

In terms of the main structure, EfficientDet uses different configurations to achieve the speed and precision tradeoff, ranging from simple to complex. With a efficientdet-d0 example, the model networking & training configuration parameters can be seen in the YML file format:

Architecture: EfficientDet... pretrain_weights: https://paddle-imagenet-models-name.bj.bcebos.com/EfficientNetB0_pretrained.tar weights: The output/efficientdet_d0 / model_final... EfficientDet: backbone: EfficientNet fpn: BiFPN efficient_head: EfficientHead anchor_grid: AnchorGrid box_loss_weight: 50. EfficientNet: norm_type: sync_bn scale: b0 use_se: true BiFPN: num_chan: 64 repeat: 3 levels: 5 EfficientHead: Repeat: 3 num_chan: 64 prior_prob: 0.01 NUM_anchors: 9 gamma: 1.5 alpha: 0.25 delta: 0.1 output_decoder: score_thresh: 0.05 # originally 0.nMS_THresh: 0.5 pre_nMS_top_N: 1000 # originally 5000 DetectionS_per_IM: 100 nMS_eta: 1.0 AnchorGrid: anchor_base_scale: 4 NUM_scales: 3 aspect_ratios: [[1, 1], [1.4, 0.7], [0.7, 1.4]]...Copy the code

EfficientNet EfficientNet is a classification network published in ICML 2019 by co-author Mingxing Tan, which focuses on how to combine network structures more efficiently with limited computing resources to make the model with higher classification accuracy. In the network design of EfficientNet, three dimensions are considered: network depth, network width and input image resolution. In the context of network structure search, the optimization objectives defined by the author are as follows:

In network structure search, network depth, network width and input image resolution are variables. In order to model the correlation among the three with fixed computing resources, the author also proposes the following modeling methods to represent the constraint relationship among the three:

In network structure search, EfficientNet is different from other network design papers in that it proposes compound scaling method, which can be divided into two steps:

1. In the case of fixed computing resources, the network depth/width/resolution of the benchmark can be obtained through grid search;

2. A series of network structure efficientnet-B0 to B7 was efficientnet-b7 by increasing the depth/width/resolution simultaneously through compound coefficient.

EfficientNet implementation in PaddleDetection is as follows:

from __future__ import absolute_import
from __future__ import division

import collections
import math
import re

from paddle import fluid
from paddle.fluid.regularizer import L2Decay

from ppdet.core.workspace import register

__all__ = ['EfficientNet']

GlobalParams = collections.namedtuple('GlobalParams', [
    'batch_norm_momentum', 'batch_norm_epsilon', 'width_coefficient', 'depth_coefficient', 'depth_divisor'
])

BlockArgs = collections.namedtuple('BlockArgs', [
    'kernel_size', 'num_repeat', 'input_filters', 'output_filters', 'expand_ratio', 'stride', 'se_ratio'
])

GlobalParams.__new__.__defaults__ = (None, ) * len(GlobalParams._fields)
BlockArgs.__new__.__defaults__ = (None, ) * len(BlockArgs._fields)

def _decode_block_string(block_string):
    assert isinstance(block_string, str)
    ops = block_string.split('_')
    options = {}
    for op in ops:
        splits = re.split(r'(\d.*)', op)
        if len(splits) >= 2:
            key, value = splits[:2]
            options[key] = value

    assert (('s' in options and len(options['s']) == 1) or (len(options['s']) == 2 and options['s'][0] == options['s'][1]))

    return BlockArgs(
        kernel_size=int(options['k']),
        num_repeat=int(options['r']),
        input_filters=int(options['i']),
        output_filters=int(options['o']),
        expand_ratio=int(options['e']),
        se_ratio=float(options['se']) if 'se' in options else None,
        stride=int(options['s'][0]))

def get_model_params(scale):
    block_strings = [
        'r1_k3_s11_e1_i32_o16_se0.25',
        'r2_k3_s22_e6_i16_o24_se0.25',
        'r2_k5_s22_e6_i24_o40_se0.25',
        'r3_k3_s22_e6_i40_o80_se0.25',
        'r3_k5_s11_e6_i80_o112_se0.25',
        'r4_k5_s22_e6_i112_o192_se0.25',
        'r1_k3_s11_e6_i192_o320_se0.25',
    ]
    block_args = []
    for block_string in block_strings:
        block_args.append(_decode_block_string(block_string))

    params_dict = {
        # width, depth
        'b0': (1.0, 1.0),
        'b1': (1.0, 1.1),
        'b2': (1.1, 1.2),
        'b3': (1.2, 1.4),
        'b4': (1.4, 1.8),
        'b5': (1.6, 2.2),
        'b6': (1.8, 2.6),
        'b7': (2.0, 3.1),
    }

    w, d = params_dict[scale]

    global_params = GlobalParams(
        batch_norm_momentum=0.99,
        batch_norm_epsilon=1e-3,
        width_coefficient=w,
        depth_coefficient=d,
        depth_divisor=8)

    return block_args, global_params

def round_filters(filters, global_params):
    multiplier = global_params.width_coefficient
    if not multiplier:
        return filters
    divisor = global_params.depth_divisor
    filters *= multiplier
    min_depth = divisor
    new_filters = max(min_depth, int(filters + divisor / 2) // divisor * divisor)
    if new_filters < 0.9 * filters:  # prevent rounding by more than 10%
        new_filters += divisor
    return int(new_filters)

def round_repeats(repeats, global_params):
    multiplier = global_params.depth_coefficient
    if not multiplier:
        return repeats
    return int(math.ceil(multiplier * repeats))

def conv2d(inputs, num_filters, filter_size, stride=1, padding='SAME', groups=1, use_bias=False, name='conv2d'):
    param_attr = fluid.ParamAttr(name=name + '_weights')
    bias_attr = False
    if use_bias:
        bias_attr = fluid.ParamAttr(name=name + '_offset', regularizer=L2Decay(0.))
    feats = fluid.layers.conv2d(inputs, num_filters, filter_size, groups=groups, name=name, stride=stride, padding=padding, param_attr=param_attr, bias_attr=bias_attr)
    return feats

def batch_norm(inputs, momentum, eps, name=None):
    param_attr = fluid.ParamAttr(name=name + '_scale', regularizer=L2Decay(0.))
    bias_attr = fluid.ParamAttr(name=name + '_offset', regularizer=L2Decay(0.))
    return fluid.layers.batch_norm(input=inputs, momentum=momentum, epsilon=eps, name=name, moving_mean_name=name + '_mean', moving_variance_name=name + '_variance', param_attr=param_attr, bias_attr=bias_attr)

def mb_conv_block(inputs, input_filters, output_filters, expand_ratio, kernel_size, stride, momentum, eps, se_ratio=None, name=None):
    feats = inputs
    num_filters = input_filters * expand_ratio

    if expand_ratio != 1:
        feats = conv2d(feats, num_filters, 1, name=name + '_expand_conv')
        feats = batch_norm(feats, momentum, eps, name=name + '_bn0')
        feats = fluid.layers.swish(feats)

    feats = conv2d(feats, num_filters, kernel_size, stride, groups=num_filters, name=name + '_depthwise_conv')
    feats = batch_norm(feats, momentum, eps, name=name + '_bn1')
    feats = fluid.layers.swish(feats)

    if se_ratio is not None:
        filter_squeezed = max(1, int(input_filters * se_ratio))
        squeezed = fluid.layers.pool2d(feats, pool_type='avg', global_pooling=True)
        squeezed = conv2d(squeezed, filter_squeezed, 1, use_bias=True, name=name + '_se_reduce')
        squeezed = fluid.layers.swish(squeezed)
        squeezed = conv2d(squeezed, num_filters, 1, use_bias=True, name=name + '_se_expand')
        feats = feats * fluid.layers.sigmoid(squeezed)

    feats = conv2d(feats, output_filters, 1, name=name + '_project_conv')
    feats = batch_norm(feats, momentum, eps, name=name + '_bn2')

    if stride == 1 and input_filters == output_filters:
        feats = fluid.layers.elementwise_add(feats, inputs)

    return feats

@register
class EfficientNet(object):
    """
    EfficientNet, see https://arxiv.org/abs/1905.11946
    Args:
        scale (str): compounding scale factor, 'b0' - 'b7'.
        use_se (bool): use squeeze and excite module.
        norm_type (str): normalization type, 'bn' and 'sync_bn' are supported
    """
    __shared__ = ['norm_type']

    def __init__(self, scale='b0', use_se=True, norm_type='bn'):
        assert scale in ['b' + str(i) for i in range(8)], "valid scales are b0 - b7"
        assert norm_type in ['bn', 'sync_bn'], "only 'bn' and 'sync_bn' are supported"

        super(EfficientNet, self).__init__()
        self.norm_type = norm_type
        self.scale = scale
        self.use_se = use_se

    def __call__(self, inputs):
        blocks_args, global_params = get_model_params(self.scale)
        momentum = global_params.batch_norm_momentum
        eps = global_params.batch_norm_epsilon

        num_filters = round_filters(32, global_params)
        feats = conv2d(inputs, num_filters=num_filters, filter_size=3, stride=2, name='_conv_stem')
        feats = batch_norm(feats, momentum=momentum, eps=eps, name='_bn0')
        feats = fluid.layers.swish(feats)

        layer_count = 0
        feature_maps = []

        for b, block_arg in enumerate(blocks_args):
            for r in range(block_arg.num_repeat):
                input_filters = round_filters(block_arg.input_filters, global_params)
                output_filters = round_filters(block_arg.output_filters, global_params)
                kernel_size = block_arg.kernel_size
                stride = block_arg.stride
                se_ratio = None
                if self.use_se:
                    se_ratio = block_arg.se_ratio
                if r > 0:
                    input_filters = output_filters
                    stride = 1
                feats = mb_conv_block(feats, input_filters, output_filters, block_arg.expand_ratio, kernel_size, stride, momentum, eps, se_ratio=se_ratio, name='_blocks.{}.'.format(layer_count))
                layer_count += 1
            feature_maps.append(feats)

        return list(feature_maps[i] for i in [2, 4, 6])
Copy the code

The EfficientNet input scale is a compound coefficient corresponding to the original text, with the option of B0-b7. During model training/reasoning, feature maps of the last three different blocks are returned. BiFPN was fed for further multi-scale feature fusion.

Neck: BiFPN

As a key innovation in EfficientDet, BiFPN has a high degree of integration of different scale features by stacking multiple “BiFPN layers”. The figure below shows the comparison between BiFPN Layer and different FPN structures. Compared with the basic FPN structure, BiFPN has a bottom-up feature fusion with a second bottom-up connection in addition to the top-down connection. Compared with PANet with the same bottom up for the second time, BiFPN has cross-layer connections with the same scale characteristics (purple arrow), and each Layer in each BiFPN Layer has independent attention weight. When calculating the feature graph of each node, In both cases, the feature map at the end of the arrow connected to this node is first scaled, then multiplied by the normalized weight for feature fusion, so as to obtain the feature map of this node.

Take P6 node as an example, the feature fusion method is as follows, where is the calculation method of nodes in the middle column, is the last column:

BiFPN Layer specific implementation code as shown below (BiFPNCell class) :

from __future__ import absolute_import from __future__ import division from paddle import fluid from paddle.fluid.param_attr import ParamAttr from paddle.fluid.regularizer import L2Decay from paddle.fluid.initializer import Constant, Xavier from ppdet.core.workspace import register __all__ = ['BiFPN'] class FusionConv(object): def __init__(self, num_chan): super(FusionConv, self).__init__() self.num_chan = num_chan def __call__(self, inputs, name=''): x = fluid.layers.swish(inputs) # depthwise x = fluid.layers.conv2d(x, self.num_chan, filter_size=3, padding='SAME', groups=self.num_chan, param_attr=ParamAttr(initializer=Xavier(), name=name + '_dw_w'), bias_attr=False) # pointwise x = fluid.layers.conv2d(x, self.num_chan, filter_size=1, param_attr=ParamAttr(nitializer=Xavier(), name=name + '_pw_w'), bias_attr=ParamAttr(regularizer=L2Decay(0.), Name =name + '_pw_B ') # bn + act x = fluid.layer. batch_norm(x, momentum=0.997, epsilon=1e-04, Param_attr =ParamAttr(Initializer =Constant(1.0), regularizer=L2Decay(0.), name=name + '_BN_w '), bias_attr=ParamAttr(regularizer=L2Decay(0.), name=name + '_bn_b')) return x class BiFPNCell(object): def __init__(self, num_chan, levels=5): super(BiFPNCell, self).__init__() self.levels = levels self.num_chan = num_chan num_trigates = levels - 2 num_bigates = levels self.trigates = fluid.layers.create_parameter(shape=[num_trigates, 3], dtype='float32', default_initializer=fluid.initializer.Constant(1.)) self.bigates = fluid.layers.create_parameter(shape=[num_bigates, 2], dtype='float32', default_initializer=fluid.initializer.Constant(1.)) self.eps = 1e-4 def __call__(self, inputs, cell_name=''): assert len(inputs) == self.levels def upsample(feat): return fluid.layers.resize_nearest(feat, scale=2.) def downsample(feat): return fluid.layers.pool2d(feat, pool_type='max', pool_size=3, pool_stride=2, pool_padding='SAME') fuse_conv = FusionConv(self.num_chan) # normalize weight trigates = fluid.layers.relu(self.trigates) bigates = fluid.layers.relu(self.bigates) trigates /= fluid.layers.reduce_sum(trigates,  dim=1, keep_dim=True) + self.eps bigates /= fluid.layers.reduce_sum(bigates, dim=1, keep_dim=True) + self.eps feature_maps = list(inputs) # make a copy # top down path for l in range(self.levels - 1): p = self.levels - l - 2 w1 = fluid.layers.slice(bigates, axes=[0, 1], starts=[l, 0], ends=[l + 1, 1]) w2 = fluid.layers.slice(bigates, axes=[0, 1], starts=[l, 1], ends=[l + 1, 2]) above = upsample(feature_maps[p + 1]) feature_maps[p] = fuse_conv(w1 * above + w2 * inputs[p], name='{}_tb_{}'.format(cell_name, l)) # bottom up path for l in range(1, self.levels): p = l name = '{}_bt_{}'.format(cell_name, l) below = downsample(feature_maps[p - 1]) if p == self.levels - 1: # handle P7 w1 = fluid.layers.slice(bigates, axes=[0, 1], starts=[p, 0], ends=[p + 1, 1]) w2 = fluid.layers.slice(bigates, axes=[0, 1], starts=[p, 1], ends=[p + 1, 2]) feature_maps[p] = fuse_conv(w1 * below + w2 * inputs[p], name=name) else: w1 = fluid.layers.slice(trigates, axes=[0, 1], starts=[p - 1, 0], ends=[p, 1]) w2 = fluid.layers.slice(trigates, axes=[0, 1], starts=[p - 1, 1], ends=[p, 2]) w3 = fluid.layers.slice(trigates, axes=[0, 1], starts=[p - 1, 2], ends=[p, 3]) feature_maps[p] = fuse_conv(w1 * feature_maps[p] + w2 * below + w3 * inputs[p], name=name) return feature_maps @register class BiFPN(object): "" "Bidirectional Feature Pyramid Network, see https://arxiv.org/abs/1911.09070 Args: num_chan (int) : number of feature channels repeat (int): number of repeats of the BiFPN module level (int): number of FPN levels, default: 5 """ def __init__(self, num_chan, repeat=3, levels=5): super(BiFPN, self).__init__() self.num_chan = num_chan self.repeat = repeat self.levels = levels def __call__(self, inputs): feats = [] # NOTE add two extra levels for idx in range(self.levels): if idx <= len(inputs): if idx == len(inputs): feat = inputs[-1] else: feat = inputs[idx] if feat.shape[1] ! = self.num_chan: feat = fluid.layers.conv2d(feat, self.num_chan, filter_size=1, padding='SAME', param_attr=ParamAttr(initializer=Xavier()), Bias_attr =ParamAttr(Regularizer =L2Decay(0.)) feat = fluid. Layers. Param_attr = ParamAttr (, initializer = Constant (1.0), regularizer = L2Decay (0.)), bias_attr=ParamAttr(regularizer=L2Decay(0.))) if idx >= len(inputs): feat = fluid.layers.pool2d(feat, pool_type='max', pool_size=3, pool_stride=2, pool_padding='SAME') feats.append(feat) biFPN = BiFPNCell(self.num_chan, self.levels) for r in range(self.repeat): feats = biFPN(feats, 'bifpn_{}'.format(r)) return featsCopy the code

On the basis of this BiFPN layer, the BiFPN layer is formed by stacking different numbers of BiFPN layer, EfficientNet in backbone with different complexity configuration, BiFPN also follows part of the design idea: Similar compound coefficients are used to control the width and depth of BiFPN. The specific formula is as follows:

With a similar compound coefficient control to EfficientNet, the ratio scale of Backbone and BiFPN is shown as follows:

In PaddleDetection, the BiFPN implementation is as shown in the above link, and the parameter modification is efficientdet_d0.yML link in the configuration file BiFPN, where the repeat parameter corresponds to “#layers” in BiFPN, Num_chan corresponds to “#channels” in BiFPN. Overall, the BiFPN EfficientDet network has grown in complexity and the number of channels has also increased.

Head: Class Prediction net

& Box prediction net

As an Anchor Based target detection algorithm, EfficientDet’s Head EfficientDet algorithm is basically consistent with other EXISTING SOTA algorithms, which also uses the classification and regression of detection frames for 5 feature graphs of different scales obtained from BiFPN. In terms of implementation logic, both Class Prediction NET and Box Prediction NET use Depthwise convolution layer. Backbone of different complexity also corresponds to different stack times. Unlike other Anchor based detector, the classification and regression convolutional layer EfficientDet adopts “parameter sharing” to reduce the number of parameters of the model. Specifically, most of the convolutional layers of classification and regression use the same convolution kernel parameters in feature calculation. However, it should be noted that, The BN layers of the two are independent of each other. As can be seen below, the name of the convolution layer is not affected by the level parameter, while the BN layer varies with the level parameter:

def subnet(inputs, prefix, level): feat = inputs for i in range(self.repeat): # NOTE share weight across FPN levels conv_name = '{}_pred_conv_{}'.format(prefix, i) feat = separable_conv(feat, self.num_chan, name=conv_name) # NOTE batch norm params are not shared bn_name = '{}_pred_bn_{}_{}'.format(prefix, Level, I) feat = fluid. Layers. Batch_norm (input=feat, act='swish', Momentum =0.997, epsilon= 1E-4, moving_mean_name=bn_name + '_mean', moving_variance_name=bn_name + '_variance', param_attr=ParamAttr(name=bn_name + '_w', initializer=Constant(value=1.), regularizer=L2Decay(0.)), bias_attr=ParamAttr(name=bn_name + '_b', regularizer=L2Decay(0.))) return featCopy the code

Training methods

Efficientdet d0-D7 efficientdet was trained with 128 batchsize and 300 epochs on 32 TPUv3 (600 epochs for D7/D7x). Training with the efficientdet-d0 configuration we replicated on PaddleDetection and efficientdet-d0 with the 216th epoch, the performance test on coco-Minival was as follows, fluctuating within 0.02map with the original index:

Average Precision (AP) @ [IoU = 0.50:0.95 | area = all | maxDets = 100] = 0.341 Average Precision (AP) @ [IoU = 0.50 | area = All | maxDets = 100] = 0.523 Average Precision (AP) @ [IoU = 0.75 | area = all | maxDets = 100] = 0.360 Average Precision (AP) @ [IoU = 0.50:0.95 | area = small | maxDets = 100] = 0.134 Average Precision (AP) @ [IoU = 0.50:0.95 | area = medium | MaxDets = 100] = 0.401 Average Precision (AP) @ [IoU = 0.50:0.95 | area = large | maxDets = 100] = 0.525 Average Recall (AR) @ [IoU = 0.50:0.95 | area = all | maxDets = 1] = 0.289 Average Recall (AR) @ [IoU = 0.50:0.95 | area = all | maxDets = 10] = 0.445 Average Recall (AR) @ [IoU = 0.50:0.95 | area = all | maxDets = 100] = 0.471 Average Recall (AR) @ [IoU = 0.50:0.95 | Area = small | maxDets = 100] = 0.196 Average Recall (AR) @ [IoU = 0.50:0.95 | area = medium | maxDets = 100] = 0.559 business Recall (AR) @ [IoU = 0.50:0.95 | area = large | maxDets = 100] = 0.690Copy the code

Thus, the repetition of EfficientDet – D0 performance and described effect in line with the original, all the repetition code and the related model will be updated in the near future to PaddleDetection official code base, late will gradually increase the training models of higher configuration of coco, welcome friends opinion, a lot of use:

Github.com/PaddlePaddl…