TEST-TIME ADAPTATION FOR VISUAL DOCUMENT UNDERSTANDING

Abstract

For visual document understanding (VDU), self-supervised pretraining has been shown to successfully generate transferable representations, yet, effective adaptation of such representations to distribution shifts at test-time remains to be an unexplored area. We propose DocTTA, a novel test-time adaptation method for documents, that does source-free domain adaptation using unlabeled target document data. DocTTA leverages cross-modality self-supervised learning via masked visual language modeling as well as pseudo labeling to adapt models learned on a source domain to an unlabeled target domain at test time. We also introduce new benchmarks using existing public datasets for various VDU tasks including entity recognition, key-value extraction, and document visual question answering tasks, at which DocTTA improves the source model performance up to 1.79% in (F1 score), 3.43% (F1 score), and 17.68% (ANLS score), respectively.

1. INTRODUCTION

Visual document understanding (VDU) is on extracting structured information from document pages represented in various visual formats. It has a wide range of applications: including tax/invoice/mortgage/claims processing, identity/risk/vaccine verification, medical records understanding, compliance management, etc. These applications affect operations of businesses from major industries and lives of the general populace. Overall, it is estimated that there are trillions of documents in the world. Machine learning solutions for VDU should rely on overall comprehension of the document content, extracting the information from text, image, and layout modalities. Most VDU tasks including key-value extraction, form understanding, document visual question answering (VQA) are often tackled by self-supervised pretraining, followed by supervised fine-tuning using human-labeled data (Appalaraju et al., 2021; Gu et al., 2021; Xu et al., 2020b; a; Lee et al., 2022; Huang et al., 2022) . This paradigm uses unlabeled data in a task-agnostic way during the pretraining stage and aims to achieve better generalization at various downstream tasks. However, once the pretrained model is fine-tuned with labeled data on source domain, a significant performance drop might occur if these models are directly applied to a new unseen target domain -a phenomenon known as domain shift (Quiñonero-Candela et al., 2008a; b; Moreno-Torres et al., 2012) . The domain shift problem is commonly encountered in real-world VDU scenarios where the training and test-time distributions are different, a common scenario due to the tremendous diversity observed for document data. Fig. 1 exemplifies this, for key-value extraction task across visually different document templates and for visual question answering task on documents with different contents (figures, tables, letters etc.) for information. The performance difference due to this domain shift might reduce the stability and reliability of VDU models. This is highly undesirable for widespread adoption of VDU, especially given that the common use cases are for high-stakes applications from finance, insurance, healthcare, or legal. Thus, the methods to robustly guarantee high accuracy in the presence of distribution shifts would be of significant impact. Despite being a critical issue, to the best of our knowledge, no prior work has studied post-training domain adaptation for VDU. Unsupervised domain adaptation (UDA) methods attempt to mitigate the adverse effect of data shifts, often by training a joint model on the labeled source and unlabeled target domains that map both domains into a common feature space. However, simultaneous access to data from source and target domains may not be feasible for VDU due to privacy concerns associated with source data access, 1

