STRUCTEXTV2: MASKED VISUAL-TEXTUAL PREDIC-TION FOR DOCUMENT IMAGE PRE-TRAINING

Abstract

In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.

1. INTRODUCTION

In the Document Artificial Intelligence, how to understand visually-rich document images and extract structured information from them has gradually become a popular research topic. Its main associated tasks include document image classification Harley et al. (2015) Due to the great disparities between vision and language, the existing document understanding methods either consider a single modality or introduce an OCR engine to capture textual content in advance. Researchers used text tokens as the input in language modeling, or selected fixed-size image patches as the granularity of vision pre-training tasks. However, the textual content is visuallysituated in a document and extracted from the image. Thus, we propose the text region-level image masking scheme corresponding to document content to bridge vision modeling to language modeling with the shared representations. This paper proposes StrucTexTv2, a novel multi-modal knowledge learning framework for document image understanding by performing text region-level image masking with dual parallel selfsupervised tasks of image reconstruction and language modeling (as shown in Fig. 2 ). First off, we adopt an off-the-shelf OCR toolkit to perform word-level text detection and text recognition The major contributions of our work can be summarized as following: • A novel self-supervised pre-training framework by performing text region-level document image masking, named StrucTexTv2, is proposed to learn visual-textual representations in an end-to-end manner.



† Equal contribution. Correspondence to: Chengquan Zhang<zhangchengquan@baidu.com>.



Figure 1: Comparisons with the main-stream pre-training models of document image understanding. (a) It shows the masked multi-modal modeling methods which input both OCR results and image embeddings. (b) The framework that inputs image-only embeddings is suitable for vision-dominated tasks like document image classification and layout analysis. (c) StrucTexTv2 learns visual-textual representations using only the information from images in the pre-training step and then optimizes various downstream tasks of document image understanding end-to-end.

on the pre-training dataset (IIT-CDIP Test Collection Lewis et al. (2006)). Next, we randomly mask some text word regions given the input images and fed them into the encoder. Finally, the pre-training objectives of StrucTexTv2 learn to reconstruct image pixels and text content of the masked words. In support of the proposed pre-training tasks, we introduce a new backbone network for StrucTexTv2. In particular, a CNN-based network with the RoI-Align He et al. (2017) operation produces visual features for the masked regions. Inspired by ViBERTGrid Lin et al. (2021), the backbone uses FPN Lin et al. (2017) to integrate features of CNN. The following transformer model enables capturing semantical and contextualized representations from the visual features. We evaluate and verify our pre-trained model in five tasks including document image classification, layout analysis, table structure recognition, document OCR, and end-to-end information extraction, all of which have achieved significant gains. The experimental results have also confirmed that the framework of StrucTexTv2 can construct fundamental pre-trained models for document image understanding.

