Abstract: LayoutLM model uses large-scale unlabeled document data sets to perform joint pre-training of text and layout, and achieves leading results in multiple downstream document understanding tasks.

This article is shared by Song Xuan from “Paper Interpretation Series 25: LayoutLM: Text and Layout Pre-training for Document Understanding” in Huawei cloud community.

1. The introduction

Document understanding, or document intelligence, has a wide range of uses in today’s society. Business documents such as those shown in Figure 1 contain rich, specific information and complex and varied formatting structures, making it a challenging task to accurately understand these documents. Before this paper, the understanding of model-based documents has the following two shortcomings: (1) For specific scenes, manual annotated data is used for end-to-end supervised training, while large-scale unannotated data is not used, and the model is difficult to be generalized to other formats or scenarios; (2) Feature extraction is carried out using pre-training models in CV or NLP fields, without considering the joint training of text and layout information.

Figure 1. Scanned images of business documents in different layouts and formats

In view of the above deficiencies, researchers in Microsoft Asia Research Institute proposed the LayoutLM model [1] as shown in Figure 2, which used large-scale unlabeled document data sets to conduct joint pre-training of text and layout, and achieved leading results in multiple downstream document understanding tasks. Specifically, LayoutLM model borrowed BERT model to a large extent [2]. At the model input level, LayoutLM added two new features based on text and location features adopted by BERT :(1) 2-d location features, that is, document layout features; (2) Global features and word-level features of document images adopt ROI features of Faster R-CNN [3]. At the learning objective level, Masked Visual-Language Model (MVLM) loss and multi-label Document Classification (MDC) loss are used for multi-task learning. At the training data level, LayoutLM pretrained on approximately 11 million scanned document images from the IIT-CDIP Test Collection 1.0 [4] data set, which contains various document types such as letters, memos, emails, forms, bills, etc. The text content and location information of the document image are obtained through the open source Tesseract [5] engine.

FIG. 2. Schematic diagram of LayoutLM model structure

2. LayoutLM

2.1 Model Structure

Based on BERT model structure, LayoutLM added two input features: 2-D position feature and image feature.

2-D position feature: 2-D position feature aims to encode the relative spatial position relationship in the document. A document can be thought of as a coordinate system with the origin (0,0) at the upper left corner (0,0). For a word, the bounding box to coordinate (x_0 y_0, x_1, y_1) (_x_0, _y_0, _x_1, _y_1), said the (x_0 y_0) (_x_0 _y_0) said the top left corner coordinates, (x_1, y_1) coordinates (x_1, y_1) said the lower right corner. X_0_x_0 shares the embedding layer parameter X_X with X_1_X_1, y_0_Y_0 shares the embedding layer parameter Y_Y with y_1_Y_1. In particular, the bounding box of the entire document image is (0,0,W,H)(0,0,W,H), where W_W_ and H_H_ represent the width and height of the document image respectively.

Image features: According to the bounding box of words, LayoutLM uses ROI operation to generate image region features from the output feature map of Faster-CNN, corresponding to words one by one. For special [CLS] tags (the output of [CLS] tags is connected to the classification layer for document classification task, details can be seen in BERT model), the average feature of the whole graph is used as the image feature of this tag. It should be noted that LayoutLM did not adopt image features in the pre-training stage; Image features can be selectively added only in the downstream task stage, while the weight of the Faster R-CNN model that generates image features comes from the pre-training model without adjustment.

2.2 LayoutLM pre-training

Pre-training Task #1: Mask visual language model MVLM. In the pre-training stage, the text information of some words was masked randomly, but their location information was still retained. Then the training model predicted the masked words according to the context. Through this task, the model can learn to understand the context and use 2-D position information to connect the visual and linguistic modes.

Pre-training Task #2: Multi-label document classification MDC. Many tasks for document understanding require document-level representations. Because each document image in the IIT-CDIP data contains multiple tags, LayoutLM utilizes these tags for supervised document classification tasks to enable [CLS] tags to output a more efficient document-level representation. However, for larger data sets, these labels are not always available, so this task is only optional and is actually discarded in subsequent LayoutLMv2.

2.3 Fine tuning of LayoutLM model

In this paper, the pre-trained LayoutLM model performs fine-tuning on three document understanding tasks, including table understanding, ticket understanding and document classification, using FUNSD, SROIE and RVL-CDIP datasets respectively. For table and ticket understanding tasks, the model performs {B, I, E, S, O} sequence marker prediction for each input position to detect each category of entities. For document classification tasks, the model uses the output features of [CLS] tags to predict categories.

Experiment 3.

LayoutLM model and BERT model have the same Transformer [6] network structure, so the weight of BERT model is used for initialization. Specifically, the BASE model is a 12-layer Transformer, each layer contains 768 hidden units and 12 attention headers, with a total of 113M parameters; The LARGE model is a 24-layer Transformer, each containing 1024 hidden units and 16 attention headers, with a total of 343M parameters. For specific training details and parameter setting, please refer to the paper.

Table understanding. Table 1 and Table 2 show the experimental results of LayoutLM on the table understanding dataset FUNSD, including different models, different amounts of training data, different training duration, different pre-training tasks and other Settings. Firstly, it can be seen that the LayoutLM model with visual information has achieved a significant improvement in accuracy. Secondly, more training data, longer training time and larger model can effectively improve the model accuracy. Finally, the MDC pre-training task has the opposite effect when the data volume is 1M and 11M, and the effect of MVLM is better when the data volume is large.

In addition, the author of the original paper also compared the influences of different initialization modes of LayoutLM model on downstream tasks, as shown in Table 3. It can be seen that initialization using RoBERTa (A Robustly Optimized BERT) model parameters can improve the accuracy of LayoutLM model on downstream tasks to A certain extent compared with initialization using the original BERT model parameters.

Table 1. Accuracy on the FUNSD dataset

Table 2. The accuracy of LayoutLM BASE model (Text + Layout, MVLM) on FUNSD data sets with different training data volumes and training duration

Table 3. The accuracy of LayoutLM model (Text + Layout, MVLM) with different initialization modes on FUNSD dataset

Bill understanding. Table 4 shows the experimental results of LayoutLM on the ticket understanding dataset SROIE. It can be seen that the result of LayoutLM LARGE model is better than the result of the first place in the SROIE competition at that time.

Table 4. Accuracy on SROIE data set

Document image classification. Table 5 shows the experimental results of LayoutLM on document image classification dataset RVL-CDIP. Again, you can see that LayoutLM has the lead result.

Table 5. Classification accuracy on the RVL-CDIP data set

4. Summary

The LayoutLM model introduced in this paper uses large-scale unlabeled document data sets for joint pre-training of text and layout, and achieves leading results in multiple downstream document understanding tasks. The authors point out that larger data sets and models, and the consideration of image features in the pre-training stage are the next research directions.

[1] Xu Y, Li M, Cui L, et al. LayoutLM: Pre-training of text andlayout for document image understanding. Proceedings of the 26th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining. 2020:1192-1200.

[2] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training ofdeep bidirectional transformers for language understanding. Proceedings ofNAACL-HLT. 2019: 4171-4186.

[3] Ren S, He K, Girshick R, et al. Faster R-CNN: Towardsreal-time object detection with region proposal networks. Advances in neuralinformation processing systems, 2015, 28:91-99.

[4] Lewis D, Agam G, Argamon S, et al. Building a testcollection for complex document information processing. Proceedings of the 29thannual international ACM SIGIR conference on Research and development ininformation retrieval. 2006: 665-666.

[5] github.com/tesseract-o…

[6] Vaswani A, Shazeer N, Parmar N, et al.Attention is all you need. Advances in neural information processing systems.2017: 5998-6008.

For more AI technology dry goods, welcome to huawei cloud AI zone, currently there are AI programming Python and other six combat camps for everyone to learn for free.

Click to follow, the first time to learn about Huawei cloud fresh technology ~