STRUCTEXTV2: MASKED VISUAL-TEXTUAL PREDIC-TION FOR DOCUMENT IMAGE PRE-TRAINING

Abstract

In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.

1. INTRODUCTION

In the Document Artificial Intelligence, how to understand visually-rich document images and extract structured information from them has gradually become a popular research topic. Its main associated tasks include document image classification Harley et al. (2015) , layout analysis Zhong et al. (2019) , form understanding Jaume et al. (2019) , document OCR (also called text spotting) Li et al. (2017) ; Liao et al. (2021) , and end-to-end information extraction (usually composed of OCR and entity labelling phrase) Wang et al. (2021) , etc. To solve these tasks well, it is necessary to fully exploit both visual and textual cues. Meanwhile, large-scale self-supervised pre-training Li et al. (2021a) ; Appalaraju et al. (2021) ; Xu et al. (2020; 2021) ; Huang et al. (2022) ; Gu et al. (2021) is a recently rising technology to enhance multi-modal knowledge learning of document images. There are two mainstream self-supervised pre-training frameworks for document image understanding. As illustrated in Fig. 1: (a) The first category is the masked multi-modal modeling such as the proposed pre-training tasks: MLM Devlin et al. (2019) , MVLM Xu et al. (2021) , MM-MLM Appalaraju et al. (2021) and MSM Gu et al. (2021) , whose inputs mainly consists of OCR-extracted texts and image embeddings. The methods collect semantic information from text and image, depending heavily on front-end OCR engines with certain computing costs. Additionally, the two components of OCR engine and document understanding module are separately optimized, which is hard to ensure performance of the whole system. (b) The second category is the masked image modeling (MIM) that inherits the concept of vision-based self-supervised learning such as BEiT Bao et al. (2022) , SimMIM Xie et al. (2022) , MAE He et al. (2022) , CAE Chen et al. (2022) , and DiT Li et al. (2022) , etc. MIM is a powerful image-only pre-training technique to learn the visual contextualized representations of document images. Because of the insufficient consideration of textual Due to the great disparities between vision and language, the existing document understanding methods either consider a single modality or introduce an OCR engine to capture textual content in advance. Researchers used text tokens as the input in language modeling, or selected fixed-size image patches as the granularity of vision pre-training tasks. However, the textual content is visuallysituated in a document and extracted from the image. Thus, we propose the text region-level image masking scheme corresponding to document content to bridge vision modeling to language modeling with the shared representations. This paper proposes StrucTexTv2, a novel multi-modal knowledge learning framework for document image understanding by performing text region-level image masking with dual parallel selfsupervised tasks of image reconstruction and language modeling (as shown in Fig. 2 ). First off, we adopt an off-the-shelf OCR toolkit to perform word-level text detection and text recognition on the pre-training dataset (IIT-CDIP Test Collection Lewis et al. (2006) ). Next, we randomly mask some text word regions given the input images and fed them into the encoder. The major contributions of our work can be summarized as following: • A novel self-supervised pre-training framework by performing text region-level document image masking, named StrucTexTv2, is proposed to learn visual-textual representations in an end-to-end manner. • Superior performance in five downstream tasks demonstrates the effectiveness of the Struc-TexTv2 pre-trained model in document image understanding.

2. RELATED WORK

Self-supervised Learning Thanks to the development of self-supervised tasks and Transformer architectures, in the past few years, the computer vision (CV) and natural language processing (NLP) have achieved breakthroughs in knowledge learning from large-scale unlabeled data. In the domain of NLP, Masked Language Modeling (MLM) task has been widely used in pre-trained models 2022) directly generate textual output from documents and achieve competitive performance on downstream tasks. In this paper, the proposed StrucTexTv2 is a new solution that integrates the advantages of CV and NLP pre-training methods in an end-to-end manner. Benefiting from image-only input of the encoder, our framework can avoid the interference of false OCR results compared with the OCRbased pre-trained models. For our pre-training, although the supervision labels partially come from OCR results, only the high-confidence words from OCR results are randomly selected. The impacts of the OCR quality on our pre-trained model is alleviated to a certain extent.

3.1. MODEL ARCHITECTURE

As illustrated in Fig. 2 , there are two main components of StrucTexTv2: a encoder network using FPN to integrate visual features and semantic features, and the pre-training framework containing two objectives: Mask Language Modeling and Mask Image Modeling. The proposed encoder consists of a visual extractor (CNN) and a semantic module (Transformer). Given an input document image, StrucTexTv2 extracts visual-textual representations through this backbone network. Specifically, the features of the last four down-sampled stages on CNN are extracted from the visual extractor. In the semantic module, following ViT Dosovitskiy et al. (2021) to handle 2D feature maps, the features of the last stage in CNN are flattened in patch-level and are linearly projected to obtain 1D sequence of patch token embeddings, which also serves as the effective input for the Transformer. A relative position embedding representing the token index is added to the token embeddings. Then, the standard Transformer receives the input token embeddings and outputs the semantic enhanced features. We reshape the output features back to context feature maps in the 2D visual space and up-sample the feature maps with the factors of 8. We adopt FPN strategy Lin et al. (2017) to merge visual features of different resolutions from CNN and then concatenate context feature maps with them, deriving a feature map with 1/4 size of the input image. Finally, a fusion network which consists of two successive 1×1 convolutional layers is introduced to take a further multi-modal fusion. 

Document image with masked text regions

P mlm i = MLP(ROI-Align(F f use , b i )), where b i is the bounding box of the ith text region, and the P mlm i is optimized by cross-entropy loss with 30,522 token categories.

3.2.2. TASK #2: MASKED IMAGE MODELING

In MAE He et al. (2022) , BEiT Bao et al. (2022) , and SimMIM Xie et al. (2022) , patch-level Masked Image Modeling has shown strong potential in representation learning. However, in the document understanding domain, patch-level feature learning is too coarse to represent the details of text or word region. Therefore, we introduce a text region-level visual representation learning task called Masked Image Modeling to enhance document representation and understanding. Instead of classifying the classes defined by tokenization like LayoutLMv3 and BEiT, we regress the raw pixel values of the masked text region with mean square error loss following SimMIM and MAE. Specifically, we mask the rectangle text regions and predict the RGB values of missing pixels, leading to significant improvement for performance in representation learning. Decoder for Task #2. We develop a Fully Convolutional Network (FCN) with Transpose Convolution to carry out the document image reconstruction of masked text regions. Specifically, we apply the global average pooling to aggregate each text region's feature and generate the embedding Emb style that mainly represents the visual "style" for each masked text region. To strengthen its text information, we encode the MLM prediction P mlm i to Emb content by using an embedding layer, denoting the "content" knowledge. At last, we concatenate Emb style and Emb content and feed it to a FCN, generating the final restored image prediction P mim i . The procedure of Mask Image Modeling can be formulated as follows, Emb style = GAP(ROI-Align(F f use , b i )), Emb content = EmbeddingLayer(P mlm i ), P mim i = FCN(Concat(Emb style , Emb content )), (4) where GAP is the global average pooling operator. In MIM, we follow MAE and predict the missing pixels of masked text regions. For example, we resize the spatial resolutions of each masked text region to fixed 64×64, and each text-region's regression target is 12,288=64×64×3 pixels of RGB values. The P mim i is optimized by MSE loss in the pre-training phrase.

3.3. DOWNSTREAM TASKS

The StrucTexTv2 pre-training scheme contributes to a visual-textual representation with input of image-only. The multi-modal representation is available to model fine-tuning and profits numerous downstream tasks.

3.3.1. TASK #1: DOCUMENT IMAGE CLASSIFICATION

Document image classification aims to predict the category of each document image, which is one of the fundamental tasks in office automation. For this task, we downsample the output feature maps of backbone network by four 3×3 convolutional layers with stride 2. Next, the image representation is fed into the final linear layer with softmax to predict the class label.

3.3.2. TASK #2: DOCUMENT LAYOUT ANALYSIS

Document layout analysis aims to identify the layout components of document images by object detection. Following DiT, we leverage Cascade R- CNN Cai & Vasconcelos (2018) as the detection framework to perform layout element detection and replace its backbone with StrucTexTv2. Thanks to the multi-scale context design of backbone networks, four resolution-modifying features (P2∼P5) of FPN fusion layers on backbone networks are sent into the iterative structure heads of the detector.

3.3.3. TASK #3 TABLE STRUCTURE RECOGNITION

Table structure recognition aims to recognize the internal structure of a table which is critical for document understanding. Specifically, We employ Cascade R-CNN for cell detection in our table structure recognition framework while replacing the feature encoder with backbone networks. Since some table images are collected by cameras and many cells are deformed, we modify the final output of Cascade R-CNN to the coordinate regression of four vertexes of cells.

3.3.4. TASK #4: DOCUMENT OCR

We tend to read the text in an end-to-end manner based on StrucTexTv2. Our OCR method consists of both the word-level text detection and recognition modules. They share the features of backbone networks and are connected by ROI-Align operations. The text detection module adopts the standard DB Liao et al. (2023) algorithm, which predicts the binarization mask for word-level bounding boxes. Similar to NRTR Sheng et al. (2019) , the text recognition module is composed of multi-layer Transformer decoders to predict character sequences for each word.

3.3.5. TASK #5: END-TO-END INFORMATION EXTRACTION

The aim of the task is to extract entity-level content of key fields from given documents without predefined OCR information. We evaluate the StrucTexTv2 model based on the architecture of document OCR and devise a new branch for semantic entity extraction. Concretely, another DB detection is developed to identify the entity bounding boxes. An additional MLP block is performed with the ROI features to classify entity label. These bounding boxes are utilized for word grouping to merge the text content from Task #4. At length the key information is obtained by grouping words according to the reading order. 2019) is a form understanding dataset that contains 199 forms, which refers to extract four predefined semantic entities (questions, answers, headers, and other) and their linkings presented in the form. We focus on two tasks of document OCR and end-to-end information extraction on FUNSD. For evaluation, we compute the normalized Levenshtein similarity (1-NED) between the predictions and the ground truth. Fine-tuning on RVL-CDIP We evaluate StrucTexTv2 for document image classification. We finetune the model on RVL-CDIP for 20 epochs with cross-entropy loss. The learning rate is set for 3e-4 and the batch size is 28. The input images are resized to 960×960 and maintain its aspect ratio. We use label smoothing=0.1 in the loss function. Besides, the data augmentation methods such as CutMix and MixUp with 0.3 probability are applied on the training phase.

Fine-tuning on PubLayNet

We evaluate on the validation set of PubLayNet for document layout analysis. We fine-tune the Cascade R-CNN and initialize the backbone with our pre-trained model. The detector is trained 8 epochs with Momentum optimizer and a batch size of 8. The learning rate is set to 1e-2, while it decays to 1e-3 on 3 epoch and decays 1e-4 on 6 epoch. We use random resized cropping to augment the training images while the short edges does not exceed 800 pixels. Fine-tuning on WTW We conduct experiments on WTW for table structure recognition. We also employ Cascade R-CNN to detect the cells of table whose backbone is replaced by pre-trained StrucTexTv2. We fine-tune our model end-to-end using ADAM Kingma & Ba (2015) optimizer for 20 epochs with a batch size of 16 and a learning rate of 1e-4. The input images are resized to 640 × 640 after random scaling and the long size being resized to 640 pixels. Fine-tuning on FUNSD On account of the full annotations, both the document OCR and end-to-end information extraction tasks are measured on FUNSD. We set the text recognition network as a 6layers Transformer and fine-tune the whole model for 1200 epochs with a batch size of 32. We follow a cosine learning rate policy and set the initial learning rate to 5e-4. Extra position embeddings are appended to roi-features and we pass it to each layer of decoders. The training losses except DB detector are the cross-entropy function. Additionally, the same loss is estimated for each decoder layer in the text recognition module for better convergence training.

4.3. COMPARISONS WITH THE STATE-OF-THE-ART

To investigate the effect of visual-textual representations, we benchmark StrucTexTv2 with several state-of-the-art techniques on different downstream tasks. Since only small datasets (149 training documents on FUNSD) and to avoid exceeding GPU memory (a tremendous number of table cells on WTW), we only evaluate StrucTexTv2 Small on the FUNSD and WTW datasets. Masking Ratios We investigate the effect of training with different masking ratios. As shown in Tab.6, by replacing the masking ratio with 0.15, 0.30, 0.45 and 0.60, the accuracy of RVL-CDIP is 92.1%, 92.5%, 91.7% and 92.4%, respectively. We also report the results on PubLayNet, the mAP of PubLayNet is 94.7%, 94.9%, 94.8%, and 94.8%, respectively. It suggests that the best masking ratio of our pre-training tasks is 0.30. At the same time, It also suggests that the performance of downstream tasks is less sensitive to the selection of masking ratio.

RVL-CDIP

Consumption Analysis As shown in Tab. 7, StrucTexTv2 Small consumes 56ms and 2,276MB GPU memory to infer one image on RVL-CDIP, while LayoutLMv3 Base spends more GPU memory or time with different OCR engines. It is observed that the OCR process of the two-stage method accounts for the vast majority of computation overhead. Thus, our OCR-free framework can achieve a better trade-off between performance and efficiency.

Masking Strategies

The impact of adjusting text region-level masking to patch-level masking is evaluated in Tab.8. The performance drops a 4.2% accuracy score on RVL-CDIP and a 1.0% mAP on PubLayNet, which demonstrates the effectiveness of the proposed Text-Region masking strategy. 



Figure 1: Comparisons with the main-stream pre-training models of document image understanding. (a) It shows the masked multi-modal modeling methods which input both OCR results and image embeddings. (b) The framework that inputs image-only embeddings is suitable for vision-dominated tasks like document image classification and layout analysis. (c) StrucTexTv2 learns visual-textual representations using only the information from images in the pre-training step and then optimizes various downstream tasks of document image understanding end-to-end.

Figure 2: The overview of StrucTexTv2. Its encoder network consists of a visual extractor (CNN) and a semantic module (Transformer). Given a document image, the encoder extracts the visual feature of the whole image by CNN and obtains the semantic enhanced feature through a Transformer. Subsequently, a lightweight fusion network is utilized to generate the final representation of the image. With the help of ROI Alignment, the multi-modal feature of each masked text region is processed by the MIM branch and the MLM branch to reconstruct the pixels and text, respectively.

4.2 IMPLEMENTATION DETAILSPre-training on IIT-CDIP The proposed encoder network of StrucTexTv2 is composed mainly of the CNN and Transformer. To balance efficiency and effectiveness, StrucTexTv2 Small consists of ResNet-50 and 12-layer Transformers (128 hidden size and 8 attention heads) and introduces only 28M parameters. A larger version of StrucTexTv2 Large is set up as ResNeXt-101Xie et al. (2017) and 24-layer Transformers (768 hidden size and 8 attention heads), which total parameters are 238M. We use the networks trained with ImageNetDeng et al. (2009) as the initialization of CNNs. The Transformers are initialized from the language modelsSun et al. (2020). StrucTexTv2 Small and StrucTexTv2 Large take 17 hours and 52 hours to train one epoch of the IIT-CDIP data, respectively. The whole pre-training phase takes nearly a week with 32 Nvidia Tesla 80G A100 GPUs.

Finally, the pre-training objectives of StrucTexTv2 learn to reconstruct image pixels and text content of the masked words. In support of the proposed pre-training tasks, we introduce a new backbone network for StrucTexTv2. In particular, a CNN-based network with the RoI-Align He et al. (2017) operation produces visual features for the masked regions. Inspired by ViBERTGrid Lin et al. (2021), the backbone usesFPN Lin et al. (2017)  to integrate features of CNN. The following transformer model enables capturing semantical and contextualized representations from the visual features. We evaluate and verify our pre-trained model in five tasks including document image classification, layout analysis, table structure recognition, document OCR, and end-to-end information extraction, all of which have achieved significant gains. The experimental results have also confirmed that the framework of StrucTexTv2 can construct fundamental pre-trained models for document image understanding.

Pre-training Data By following DiTLi et al. (2022), we pretrain StrucTexTv2 on the IIT-CDIP Test Collection 1.0 datasetLewis et al. (2006), whose 11 million multi-page documents are split into single pages, totally 42 million document images.RVL-CDIPHarley et al. (2015) contains 400,000 grayscale document images in 16 classes, with 25,000 images per class. We adopt RVL-CDIP as the benchmark to carry out experiments on document classification task. Average classification accuracy is used evaluate model performance.PubLayNetZhong et al. (2019) consists of more than 360,000 paper images built by automatically parsing PubMed XML files. The five typical document layout elements (text, title, list, figure, and table) are annotated with bounding boxes. Mean average precision (mAP) @ intersection over union (IOU) is used as the evaluation metric of document layout analysis.WTW Long et al. (2021)  covers unconstrained table in natural scene, requiring table structure recognizer to have both discriminative and generative capabilities. It has a total of 14581 images in a wide range of real business scenarios and the corresponding full annotation (including cell coordinates and row/column information) of tables.

Performance comparisons on the RVL-CDIP dataset. We report classification accuracy on the test set. T and I denote the text and image modality of input. The proposed StrucTexTv2 achieves a comparable accuracy to the state-of-the-art models with image-only input.

As the Tab. 1 shows, OCR-based approaches such as DocFormer Appalaraju et al.PubLayNetThe experiment results on PubLayNet are presented in Tab. 2. It is observed that Struc-TexTv2 achieves new state-of-the-art performance of 95.4% and 95.5% mAP scores for both small and large settings. StrucTexTv2 Small beats even LayoutLMv3 BaseHuang et al. (2022) (the result of LayoutLMv3 Large is not released in the paper) which contains multi-modal inputs by 0.3%. We suggest that our dual-modal pre-training tasks can learn rich visual-textual representations of document images and performs excellently in confusing situations. Notably, StrucTexTv2 Large gets 0.1% mAP Performance comparisons on the PubLayNet validation set. The mAP @ IOU [0.50:0.95] is used as the metric.

Performance comparisons on the WTW dataset. The F1-Score is used to measure the accuracy of cell coordinate when IOU=0.9. WTW Tab.3 shows quantitative results of table structure recognition on the WTW dataset. Struc-TexTv2 achieves the physical 78.9% F1-score among all published methods. We reconstruct the table structure based on the detection results of table cells. The superior performance of StrucTexTv2 largely due to the proposed pre-training framework.FUNSD We evaluate StrucTexTv2 Small on both the document OCR and end-to-end information extraction tasks. As shown in Tab. 4, StrucTexTv2 Small achieves outstanding performance, 84.1% 1-NED for document OCR and 55.0% 1-NED for information extraction. Significantly, the whole network is end-to-end trainable. Compared toStrucTexT Li et al. (2021c)  and LayoutLMv3 with need of separately stage-wise training strategies, our model alleviates the error propagation in a documental system with key information parsing.

Performance comparisons on FUNSD. We present the Normalized Edit Distance (1-NED) for the word-level document OCR and the entity-level information extraction. The * denotes a multi-stage process in which the methods are applied using our OCR results and entity boxes for word grouping in information extraction.To further examine the different contributions of StrucTexTv2, we conduct several ablation experiments, such as document layout analysis on PubLayNet, document image classification on RVL-CDIP, and end-to-end information extraction on FUNSD. All models in ablation study are pretrained for 1 epoch with only 1M documents sampled from the IIT-CDIP dataset .

The ablation study on pre-training tasks and different encoding structures. In this study, we evaluate the impact of encoding structures by replacing the backbone of StrucTexTv2 withViT Dosovitskiy et al. (2021)  and SwinTransformerLiu et al. (2021). As shown in Tab. 5, the proposed network of StrucTexTv2 Small even achieves the better results of 92.5% accuracy and 94.9% mAP on RVL-CDIP and PubLayNet, respectively. The performance of two benchmarks dropped by 3.9% accuracy and 1.7% mAP with the ViT Base . Replaced with the SwinTransformer Base , the degradation is more obvious. In addition, StrucTexTv2 Large improves performance by 1.6% on RVL-CDIP and 0.7% on PubLayNet.Pre-training TasksIn this study, we identify the contributions of different pre-training tasks. As shown in the bottom of Tab. 5, the MIM-only pre-trained model achieves an accuracy of 91.8% on RVL-CDIP and an mAP of 94.1% on PubLayNet. The MLM-only pre-trained model achieves 92.0% and 94.5% for the two datasets. The MLM and MIM can jointly exploit the multi-modal feature representations in StrucTexTv2. By combining both the proposed pre-training tasks, the accuracy is improved to 92.5% on RVL-CDIP and the mAP achieves 94.9% on PubLayNet.

The ablation study on the influence of masking ratios (MR.) with StrucTexTv2 Small on RVL-CDIP and PubLayNet.

Consumption analysis on RVL-CDIP. We reimplement LayoutLMv3 *Base with open-source OCR engines to provide text. † denotes the cost of OCR process. All the models are inferred on a NVIDIA Tesla 80G A100.

Comparison between performance of different masking strategies on RVL-CDIP and Pub-LayNet. The model is only pre-trained with the MIM task. work successfully explores a novel pre-training framework named StrucTexTv2 to learn visualtextual representations for document image understanding with image-only input. By performing text region-based image masking, and then predicting both corresponding visual and textual content, the proposed encoder can benefit from large-scale document images efficiently. Extensive experiments on five document understanding tasks demonstrate the superiority of StrucTexTv2 over state-of-the-art methods, especially an improvement in both efficiency and effectiveness.

