TEST-TIME ADAPTATION FOR VISUAL DOCUMENT UNDERSTANDING

Abstract

For visual document understanding (VDU), self-supervised pretraining has been shown to successfully generate transferable representations, yet, effective adaptation of such representations to distribution shifts at test-time remains to be an unexplored area. We propose DocTTA, a novel test-time adaptation method for documents, that does source-free domain adaptation using unlabeled target document data. DocTTA leverages cross-modality self-supervised learning via masked visual language modeling as well as pseudo labeling to adapt models learned on a source domain to an unlabeled target domain at test time. We also introduce new benchmarks using existing public datasets for various VDU tasks including entity recognition, key-value extraction, and document visual question answering tasks, at which DocTTA improves the source model performance up to 1.79% in (F1 score), 3.43% (F1 score), and 17.68% (ANLS score), respectively.

1. INTRODUCTION

Visual document understanding (VDU) is on extracting structured information from document pages represented in various visual formats. It has a wide range of applications: including tax/invoice/mortgage/claims processing, identity/risk/vaccine verification, medical records understanding, compliance management, etc. These applications affect operations of businesses from major industries and lives of the general populace. Overall, it is estimated that there are trillions of documents in the world. Machine learning solutions for VDU should rely on overall comprehension of the document content, extracting the information from text, image, and layout modalities. Most VDU tasks including key-value extraction, form understanding, document visual question answering (VQA) are often tackled by self-supervised pretraining, followed by supervised fine-tuning using human-labeled data (Appalaraju et al., 2021; Gu et al., 2021; Xu et al., 2020b; a; Lee et al., 2022; Huang et al., 2022) . This paradigm uses unlabeled data in a task-agnostic way during the pretraining stage and aims to achieve better generalization at various downstream tasks. However, once the pretrained model is fine-tuned with labeled data on source domain, a significant performance drop might occur if these models are directly applied to a new unseen target domain -a phenomenon known as domain shift (Quiñonero-Candela et al., 2008a; b; Moreno-Torres et al., 2012) . The domain shift problem is commonly encountered in real-world VDU scenarios where the training and test-time distributions are different, a common scenario due to the tremendous diversity observed for document data. Fig. 1 exemplifies this, for key-value extraction task across visually different document templates and for visual question answering task on documents with different contents (figures, tables, letters etc.) for information. The performance difference due to this domain shift might reduce the stability and reliability of VDU models. This is highly undesirable for widespread adoption of VDU, especially given that the common use cases are for high-stakes applications from finance, insurance, healthcare, or legal. Thus, the methods to robustly guarantee high accuracy in the presence of distribution shifts would be of significant impact. Despite being a critical issue, to the best of our knowledge, no prior work has studied post-training domain adaptation for VDU. Unsupervised domain adaptation (UDA) methods attempt to mitigate the adverse effect of data shifts, often by training a joint model on the labeled source and unlabeled target domains that map both domains into a common feature space. However, simultaneous access to data from source and target domains may not be feasible for VDU due to privacy concerns associated with source data access, given legal, technical, and contractual constraints. In addition, the training and serving may be done in different computational environments, and thus, the expensive computational resources used for training may not be available. Test-time adaptation (TTA) (or source-free domain adaptation) has been introduced to adapt a model that is trained on the source to unseen target data, without using any source data (Liang et al., 2020; Wang et al., 2021b; Sun et al., 2020; Wang et al., 2021a; Chen et al., 2022; Huang et al., 2021) . Existing TTA methods have mainly focused on image classification and semantic segmentation tasks, while VDU remains unexplored, despite the clear motivations of the distribution shift besides challenges for the employment of standard UDA. Since VDU significantly differs from other computer vision (CV) tasks, applying existing TTA methods in a straightforward manner is suboptimal. First, in VDU, information is extracted from multiple modalities (including image, text, and layout) unlike other CV tasks. Therefore, a TTA approach proposed for VDU should leverage cross-modal information for better adaptation. Second, multiple outputs (e.g. entities or questions) are obtained from the same document, creating the scenario that their similarity in some aspects (e.g. in format or context) can be used. However, this may not be utilized in a beneficial way with direct application of popular pseudo labeling or self training-based TTA approaches (Lee et al., 2013) , which have gained a lot of attention in CV (Liang et al., 2020; 2021; Chen et al., 2022; Wang et al., 2021a) . Pseudo labeling uses predictions on unlabeled target data for training. However, in VDU, naive pseudo labeling can result in accumulation of errors due to generation of multiple outputs at the same time that are possibly wrong in the beginning, as each sample can contain a long sequence of words. Third, commonly-used selfsupervised contrastive-based TTA methods in CV (He et al., 2020; Chen et al., 2020b; a; Tian et al., 2020) (that are known to increase generalization) employ a rich set of image augmentation techniques, while proposing data augmentation is much more challenging for general VDU. In this paper, we propose DocTTA, a novel TTA method for VDU that utilizes self-supervised learning on text and layout modalities using masked visual language modeling (MVLM) while jointly optimizing with pseudo labeling. We introduce a new uncertainty-aware per-batch pseudo labeling selection mechanism, which makes more accurate predictions compared to the commonly-used pseudo labeling techniques in CV that use no pseudo-labeling selection mechanism (Liang et al., 2020) in TTA or select pseudo labels based on both uncertainty and confidence (Rizve et al., 2021) in semi-supervised learning settings. To the best of our knowledge, this is the first method that employs a self-supervised objective function that combines visual and language representation learning as a key differentiating factor compared to TTA methods proposed for image or text data. While our main focus is the TTA setting, we also showcase a special form of DocTTA where access to source data is



Figure 1: Distribution shift examples for document samples from the proposed benchmark, DocVQA-TTA. Top row: shows documents from four domains: (i) Emails & Letters, (ii) Figures & Diagrams,(iii) Layout, (iv) Tables & Lists, from our VQA benchmark derived from DocVQA dataset (Mathew et al., 2021). Bottom left: documents from source and target domains for key-value information extraction task from SROIE (Huang et al., 2019) receipt dataset. Bottom right: documents from source and target domains for named entity recognition task from FUNSD (Jaume et al., 2019) dataset.

Figure 1: Distribution shift examples for document samples from the proposed benchmark, DocVQA-TTA. Top row: shows documents from four domains: (i) Emails & Letters, (ii) Figures & Diagrams,(iii) Layout, (iv) Tables & Lists, from our VQA benchmark derived from DocVQA dataset (Mathew et al., 2021). Bottom left: documents from source and target domains for key-value information extraction task from SROIE (Huang et al., 2019) receipt dataset. Bottom right: documents from source and target domains for named entity recognition task from FUNSD (Jaume et al., 2019) dataset.

