Abstract: This paper proposes an end-to-end line segment detection model based on Transformer. Using multi-scale Encoder/Decoder algorithm, more accurate coordinates of line endpoints can be obtained. The author directly uses the distance between the predicted end points of the line segment and the end points of the Ground truth as the objective function, which can better carry out regression on the coordinates of the end points of the line segment.
This article was shared by CVER in The Article Series 17: Line Segment Detection Based on Transformer.
1. Article abstract
The traditional morphological line segment detection first requires the edge detection of the image, and then the post processing to get the line segment detection results. In the general deep learning method, the thermal map features of the end points of line segments and lines should be obtained first, and then the detection results of lines should be obtained through fusion processing. The author proposes a new Transformer method, which can directly obtain the detection results of line segments from end to end, that is, the coordinates of line segments’ endpoints, without the need of edge detection and the heat diagram characteristics of the endpoints and lines.
Line segment Detection belongs to the category of target Detection. The line segment Detection model LETR proposed in this paper is an extension of DETR(End-to-end Object Detection with Transformers). The difference is that when Decoder is in the final prediction and regression, One is the center point, width, and height values of the regression box, and the other is the endpoint coordinates of the regression line.
So, let’s first look at how DETR uses Transformer for target detection. After that, I’ll focus on some of the content unique to LETR.
2, How to use Transformer for target Detection (DETR)
Figure 1. DETR model structure
Above is the model structure of DETR. DETR first uses a CNN backbone to extract image features and then inputs them into Transformer model to get N predicted boxes. Then FFN is used for classification and coordinate regression. This part is similar to traditional target detection. Then, the N predicted boxes and M real boxes are binary matched (N>M, the extra ones are empty classes, that is, there is no object, and the coordinate value is directly set to 0). The matching results and matching loss are used to update the weight parameters, and the final detection results and categories of box are obtained. Here are a few key points:
- The first is the serialization and coding of image features.
The feature dimension of cnN-backbone output is C*H*W. Firstly, 1*1 conV is used to reduce the dimension and compress the channel from C to D to get the feature map of D *H*W. Then H and W are merged, and the dimension of the feature map becomes D *HW. The serialized feature map loses the position information of the original image, so the position encoding feature needs to be added to get the final serialized encoding feature.
- Then there’s Transformer’s Decoder
The target-detected Transformer’s Decoder processes all the Decoder inputs, also known as objectQueries, at once, slightly different from the original Transformer, which outputs one at a time from left to right.
Another point is that the input to the Decoder is randomly initialized and can be trained and updated.
- Binary match
The Decoder of Transformer outputs N object proposals, but we do not know the corresponding relationship between them and the real Ground truth, so we need to match through the bipartitic graph and adopt the Hungarian algorithm to get a matching that minimates the matching loss. The matching loss is as follows:
After the final match is obtained, the loss and classified loss are used to update the parameters.
3. LETR model structure
Figure 2. LETR model structure
Transformer structure mainly includes Encoder, Decoder and FFN. Each Encoder contains two sub-layers, self-attention and feed-forward. Decoder includes cross-attention in addition to self-attention and feed-forward. Attention mechanism: The attention mechanism is similar to the original Transformer, the only difference is the cross-attention of Decoder, which has been described above.
- Coarse – to – Fine strategy
As you can see from the figure above, LETR contains two Transformer. The author calls this strategy _a multi-scale Encoder/Decoder strategy, and the two Transformer are called Coarse Encoder/Decoder and FineEncoder/Decoder respectively. In other words, the CNN backbone deep small-scale feature map (CONV5 of ResNet, the size of feature map is 1/32 of the original image size, and the number of channels is 2048) is used to train a Transformer. Namely, Coarse Encoder/Decoder, to obtain coarser-grained line segment features (fixed FineEncoder/Decoder during training, only updating the parameters of Coarse Encoder/Decoder). Then take the output of Coarse Decoder as the input of Fine Decoder and train a Transformer, namely FineEncoder/Decoder. The input of the Fine Encoder is the SHALLOW FEATURE map of CNN backbone (conv4 of ResNet, the size of the feature map is 1/16 of the original image, and the number of channels is 1024). It has a larger dimension than the deep feature map and can make better use of the high-resolution information of the image.
Note: Both the deep and shallow feature map features of CNN backbone need to reduce the channel number to 256 dimensions through 1*1 convolution before being used as input of Transformer
- Binary match
Like DETR, N outputs of fine Decoder are used for classification and regression, and prediction results of N line segments are obtained. But we don’t know the correspondence between N predictions and M real line segments, and N should be greater than M. This is where we do binary matching. The so-called binary matching is to find a correspondence to minimize the matching loss, so we need to give the matching loss, which is the same as the expression of DERT above, except that this term is slightly different, one is GIou and the other is the distance between the end points of the line segment.
4. Model test results
The model reaches state-of-the-arts on Wireframe and YorkUrban data sets.
Figure 3. Comparison of line segment detection methods
Figure 4. Comparison of performance indicators of line segment detection method on two data sets (Table 1); PR curve of line segment detection method (Figure 6)
Click follow to learn about the fresh technologies of Huawei Cloud