BROS: A PRE-TRAINED LANGUAGE MODEL FOR UNDERSTANDING TEXTS IN DOCUMENT

Abstract

Understanding document from their visual snapshots is an emerging and challenging problem that requires both advanced computer vision and NLP methods. Although the recent advance in OCR enables the accurate extraction of text segments, it is still challenging to extract key information from documents due to the diversity of layouts. To compensate for the difficulties, this paper introduces a pre-trained language model, BERT Relying On Spatiality (BROS), that represents and understands the semantics of spatially distributed texts. Different from previous pre-training methods on 1D text, BROS is pre-trained on large-scale semistructured documents with a novel area-masking strategy while efficiently including the spatial layout information of input documents. Also, to generate structured outputs in various document understanding tasks, BROS utilizes a powerful graphbased decoder that can capture the relation between text segments. BROS achieves state-of-the-art results on four benchmark tasks: FUNSD, SROIE*, CORD, and SciTSR. Our experimental settings and implementation codes will be publicly available.

1. INTRODUCTION

Document intelligence (DI)foot_0 , which understands industrial documents from their visual appearance, is a critical application of AI in business. One of the important challenges of DI is a key information extraction task (KIE) (Huang et al., 2019; Jaume et al., 2019; Park et al., 2019) that extracts structured information from documents such as financial reports, invoices, business emails, insurance quotes, and many others. KIE requires a multi-disciplinary perspective spanning from computer vision for extracting text from document images to natural language processing for parsing key information from the identified texts. Optical character recognition (OCR) is a key component to extract texts in document images. As OCR provides a set of text blocks consisting of a text and its location, key information in documents can be represented as a single or a sequence of the text blocks (Schuster et al., 2013; Qian et al., 2019; Hwang et al., 2019; 2020) . Although OCR alleviates the burden of processing images, understanding semantic relations between text blocks on diverse layouts remains a challenging problem. To solve this problem, existing works use a pre-trained language model to utilize its effective representation of text. Hwang et al. ( 2019) fine-tunes BERT by regarding KIE tasks as sequence tagging problems. Denk & Reisswig (2019) uses BERT to incorporate textual information into image pixels during their image segmentation tasks. However, since BERT is designed for text sequences, they artificially convert text blocks distributed in two dimensions into a single text sequence losing spatial layout information. Recently, Xu et al. (2020) proposes LayoutLM pre-trained on large-scale documents by utilizing spatial information of text blocks. They show the effectiveness of the pretraining approach by achieving high performance on several downstream tasks. Despite this success, LayoutLM has three limitations. First, LayoutLM embeds x-and y-axis individually using trainable parameters like the position embedding of BERT, ignoring the gap between positions in a sequence and 2D space. Second, its pre-training method is essentially identical to BERT that does not explicitly consider spatial relations between text blocks. Finally, in its downstream tasks, LayoutLM only conducts sequential tagging tasks (e.g. BIO tagging) that require serialization of text blocks.



https://sites.google.com/view/di2019 1

