Abstract: In this solution, table recognition is divided into four parts: table structure sequence recognition, text detection, text recognition, cell and text box alignment. Among them, the model used for table structure sequence recognition is modified based on Master, the text detection model is used for PSENet, and the text recognition model is used for Master.

This article is shared from huawei cloud community “Paper interpretation twenty-eight: Table recognition model TableMaster”, author: CVER.

1. An overview of the

In table recognition, the model first regresses the coordinate of the cell, and then obtains the row and column information of the table according to the coordinate of the cell. For scenes with tabular lines, the model can accurately obtain cell coordinates, and then the row and column information can be obtained by post-processing cell coordinates. In the case of no table line, it is usually difficult to get cell position or table line information directly, so it is usually necessary to obtain the spatial layout of text blocks by using model training. For example, in the graph model, the common recognition process is to obtain the coordinate and text content of the text box from the OCR model, and then combine the multi-modal information such as vision, location and semantics, and use the graph network to predict the row and column attributes of the text node, and then recover the structure of the table.

In the table recognition model TableMaster released by Ping an Technology, another solution is proposed, that is, the text block position and table structure in the cell are trained at the same time. This involves another form of table representation, which is often used on web pages, and is also used to define tables in HYPERtext Markup Language (HTML) (Figure 1).

Figure 1. Hypertext tags for the table and corresponding tables

According to the syntax rules of HYPERtext Markup Language, tables are made by

There is in the tag

The following figure shows the table structure sequence and cell coordinates recognized by TableMaster:

FIG. 6 TableMaster prediction results. (a) Original drawings; (b) Text boxes for forecasts; (c) The predicted tabular structure sequence

2.2.3 Positioning and recognition of text frame

The character detection model used in the character detection and recognition stage is the classical PSENet[3]. The model used for word recognition is the Master mentioned above. The end-to-end text recognition accuracy can reach 0.9885 by using PSENet+Master model combination.

2.2.4 Restoring the complete HTML

TableMaster The table structure sequence that the network outputs is not the final HTML sequence. In order to obtain the final HTML sequence of the table, the corresponding text content needs to be filled in the table structure marker, and the process is as follows:

Figure 7. From the recognition result to the final HTML sequence. (a) Flow charts; (b) The final HTML sequence; (c) VISUALIZATION of HTML sequences

One important step is cell matching: according to cell coordinates and text box coordinates, text box coordinates and cell coordinates are aligned, and then the corresponding text box identification content can be filled into the corresponding cell marker sequence, so as to get the final HTML text. The alignment of text box is mainly based on three rules: 1. The center point rule. If the center point of text box is inside the cell box, fill the corresponding text content of text box to the corresponding

Table structure restoration through serialization model is an effective table structure recognition method, similar to Baidu RARE. Unlike TableMaster, RARE has replaced Transformer in TableMaster with GRU. In addition, the method only uses the visual information of the image, and can be combined with multi-modal features in the subsequent work to get better results.

Literature references

[1] Jiaquan Ye , Xianbiao Qi , Yelin He , Yihao Chen , DengyiGu , Peng Gao , and Rong Xiao. PingAn-VCGroup’s Solution for ICDAR 2021Competition on Scientific Literature Parsing Task B: Table Recognition to HTML. ArXiv :2105.01848, 2021.

[2] Ning Lu, Wenwen Yu, Xianbiao Qi, Yihao Chen, Ping Gong,Rong Xiao, and Xiang Bai. Master: Multi-aspect non-local network for scene textrecognition. Pattern Recognition, 2021.

[3]Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, GangYu, and Shuai Shao. Shape robust text detection with progressive scaleexpansion network. In Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 9336 — 9345, 2019.

Click to follow, the first time to learn about Huawei cloud fresh technology ~

Tag definition). As you can see from Figure 1, a table is represented as a sequence of text characters so that a serialized model (SEQ2SEQ or Transformer) can be used for table structure prediction.

2.TableMaster

2.1 Table structure identification process

TableMaster adopts multi-task learning mode. It has two branches, one for table structure sequence prediction and the other for cell position regression. After the TableMaster recognition is completed, the recognition results are post-processed by matching algorithm, and the table structural sequence and cell text content are fused to obtain the final HTML of the table (see Figure 2).

Figure 2 TableMaster table recognition process

2.2 TableMaster principle

2.2.1 Network Architecture

TableMaster is modified based on the Master[2] model. Master is a text recognition model developed by Ping an. Its network structure is divided into two parts: encoding and decoding. The structure of the coded network refers to the residual connection structure of ResNet. Unlike ResNet, Master’s encoding network has a multi-aspect GCAttention module after each residual link block:

Where h is the number of multiple attention.

The coding phase is key to the entire Master network, converting an image into a sequence that can be decoded with Transformer. In the coding stage, the input image dimension is 48*160*1, and the output dimension is 6*40*512, wherein 512 is the sequence length of the model. The sequence features output in coding stage are then input to decoding stage through position coding. The decoding part is made up of three conventional Transformer decoding layers (see Figure 3).

Figure 3 Master model structure, image source [2]

TableMaster feature extraction model, that is, the encoding structure is consistent with Master, and the difference between Master structure is in the decoding part. The decoding part of the TableMaster adds a branch relative to the Master: after passing through a Transformer layer, the decoding part of the TableMaster splits into two branches. Each branch is then followed by two Transformer layers for two learning tasks: regression of cell text boxes and prediction of table structure sequences.

FIG. 4 TableMaster and Master model structure comparison, source [1]

2.2.2 Input and output

The input image is 00 480*480*3 and the output is 0 7*7*500 and the output sequence of features is 00 to 49*500 and is 00 to 0 500 is the sequence length of the model, and 49 is the dimension of each position sequence feature. There are 38 category labels in the table (as shown in Figure 5), plus the start and end labels, there are altogether 41 category labels used in the model.

FIG. 5 38 types of labels in the table in Tablemaster model, image source [1]

Two of them