STRUCTEXTV2: MASKED VISUAL-TEXTUAL PREDIC-TION FOR DOCUMENT IMAGE PRE-TRAINING

Abstract

In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.

1. INTRODUCTION

In the Document Artificial Intelligence, how to understand visually-rich document images and extract structured information from them has gradually become a popular research topic. 



Its main associated tasks include document image classification Harley et al. (2015), layout analysis Zhong et al. (2019), form understanding Jaume et al. (2019), document OCR (also called text spotting) Li et al. (2017); Liao et al. (2021), and end-to-end information extraction (usually composed of OCR and entity labelling phrase) Wang et al. (2021), etc. To solve these tasks well, it is necessary to fully exploit both visual and textual cues. Meanwhile, large-scale self-supervised pre-training Li et al. (2021a); Appalaraju et al. (2021); Xu et al. (2020; 2021); Huang et al. (2022); Gu et al. (2021) is a recently rising technology to enhance multi-modal knowledge learning of document images. There are two mainstream self-supervised pre-training frameworks for document image understanding. As illustrated in Fig. 1: (a) The first category is the masked multi-modal modeling such as the proposed pre-training tasks: MLM Devlin et al. (2019), MVLM Xu et al. (2021), MM-MLM Appalaraju et al. (2021) and MSM Gu et al. (2021), whose inputs mainly consists of OCR-extracted texts and image embeddings. The methods collect semantic information from text and image, depending heavily on front-end OCR engines with certain computing costs. Additionally, the two components of OCR engine and document understanding module are separately optimized, which is hard to ensure performance of the whole system. (b) The second category is the masked image modeling (MIM) that inherits the concept of vision-based self-supervised learning such as BEiT Bao et al. (2022), SimMIM Xie et al. (2022), MAE He et al. (2022), CAE Chen et al. (2022), and DiT Li et al. (2022), etc. MIM is a powerful image-only pre-training technique to learn the visual contextualized representations of document images. Because of the insufficient consideration of textual † Equal contribution. Correspondence to: Chengquan Zhang<zhangchengquan@baidu.com>.

