PIX2STRUCT: SCREENSHOT PARSING AS PRETRAIN-ING FOR VISUAL LANGUAGE UNDERSTANDING Anonymous authors Paper under double-blind review

Abstract

Visually-situated language is ubiquitous-sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domainspecific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.

1. INTRODUCTION

Research on the interaction between language and vision has traditionally focused on tasks where images and text can be separated into distinct channels, e.g. visual question answering or image captioning. However, visually-situated language is a far more pervasive way in which these modalities interact and blend together. For example, documents, tables, infographics, and user interfaces (UIs) are intended to be consumed holistically, without clear boundaries between textual and visual elements (Figure 1 ). Comprehensive understanding of this information requires a deep set of skills, including the ability to recognize text, understand language, and incorporate diverse visual context. Previous work on understanding visually-situated language is scattered. The focus is typically on complex task-specific combinations of available inputs and tools. For example, documentunderstanding models (Huang et al., 2022) rely on external OCR systems, UI-understanding models rely on platform-specific structural metadata (e.g. Android view hierarchy) (Bai et al., 2021) , and diagram-understanding models rely on diagram parses (Kembhavi et al., 2016) . Domain-specific engineering can be effective for high-resource settings such as documents, where there is an abundance of tools and data available. However, these pipelined models lack sharing of the underlying data, model architectures, and objectives across domains, limiting their general applicability. Moreover, relying on external systems like OCR increases engineering complexity, limits adaptability, and can increase overall computational cost. Recent work on OCR-free, end-to-end document understanding from images (Kim et al., 2022; Davis et al., 2022) has attempted to remove such task-specific engineering and reliance on external components during inference by learning to decode OCR outputs during pretraining-a significant step towards more general-purpose models. However, the focus on just text at the surface level limits the depth of knowledge transferred from unsupervised data. Effective use of pixel-only models remains an open challenge. We present Pix2Struct, a pretrained model that combines the simplicity of purely pixel-level inputs with the generality and scalability provided by self-supervised pretraining from diverse and abundant web data. Specifically, we propose a screenshot parsing objective that requires predicting an HTML-based parse from a masked screenshot of a web page. HTML provides clean, vital signals about text, images, and layouts, while the masked inputs encourage joint reasoning about their cooccurrence. With the diversity and complexity of textual and visual elements found on the web, Pix2Struct learns rich representations of the underlying structure of web pages, which we show can effectively transfer to a variety of downstream visual language understanding tasks. A key ingredient which enables this transfer is processing inputs visually and holistically as they are intended for human readers. We introduce variable-resolution inputs for vision transformers that prevent distortion of the original aspect ratio, which can vary greatly across documents, figures, and UIs. During finetuning, we render other inputs (e.g., questions in VQA and bounding boxes in UI tasks) onto the image input for the task. In effect, we consume all our inputs though a single modality, simplifying the modality combination problem in previous work. We train two variants with 282M and 1.3B parameters, which we refer to as Pix2Struct-Base and Pix2Struct-Large respectively, on 80M screenshots of web pages from the C4 corpus (Raffel et al., 2020) . Experiments on four domains and nine tasks show that our finetuned models strongly outperform Donut (ranging from 9 to 53 points), the strongest existing baseline without pipelines. Compared with baselines with domain-specific pipelines, we lag behind the state of the art in highresource domains such as documents and natural images, but we observe significant improvements (ranging from 1 to 44 points) in low-resource domains such as illustrations and UIs. We hope that these results encourage the community to continue developing such general-purpose methods and further enable new applications in this currently fragmented intersection of language and vision. To summarize, our major contributions are as follows: • We introduce the area of general-purpose visually-situated language understanding, which consists of diverse tasks but common challenges. • We propose a screenshot parsing pretraining objective based on the HTML source of web pages. We show that our objective is more effective than previous attempts at enabling the elegant pixel-to-text design for general-purpose visually-situated language understanding. • We introduce variable-resolution input representations to the Vision Transformer and new finetuning strategies that seamlessly integrate language and vision inputs by directly rendering any language prompts on top of the input image. • The pretrained checkpoints and code for reproducing results for all nine tasks are available at https://github.com/anonymized/pix2struct.



Figure1: Examples of visually-situated language understanding tasks, including diagram QA (AI2D), app captioning (Screen2Words), and document QA (DocVQA). We also include an example of our proposed pretraining task (screenshot parsing) on the left. Pix2Struct directly encodes the pixels from the input image (above) and decodes the output text (below).

