TEST-TIME ADAPTATION FOR VISUAL DOCUMENT UNDERSTANDING

Abstract

For visual document understanding (VDU), self-supervised pretraining has been shown to successfully generate transferable representations, yet, effective adaptation of such representations to distribution shifts at test-time remains to be an unexplored area. We propose DocTTA, a novel test-time adaptation method for documents, that does source-free domain adaptation using unlabeled target document data. DocTTA leverages cross-modality self-supervised learning via masked visual language modeling as well as pseudo labeling to adapt models learned on a source domain to an unlabeled target domain at test time. We also introduce new benchmarks using existing public datasets for various VDU tasks including entity recognition, key-value extraction, and document visual question answering tasks, at which DocTTA improves the source model performance up to 1.79% in (F1 score), 3.43% (F1 score), and 17.68% (ANLS score), respectively.

1. INTRODUCTION

Visual document understanding (VDU) is on extracting structured information from document pages represented in various visual formats. It has a wide range of applications: including tax/invoice/mortgage/claims processing, identity/risk/vaccine verification, medical records understanding, compliance management, etc. These applications affect operations of businesses from major industries and lives of the general populace. Overall, it is estimated that there are trillions of documents in the world. Machine learning solutions for VDU should rely on overall comprehension of the document content, extracting the information from text, image, and layout modalities. Most VDU tasks including key-value extraction, form understanding, document visual question answering (VQA) are often tackled by self-supervised pretraining, followed by supervised fine-tuning using human-labeled data (Appalaraju et al., 2021; Gu et al., 2021; Xu et al., 2020b; a; Lee et al., 2022; Huang et al., 2022) . This paradigm uses unlabeled data in a task-agnostic way during the pretraining stage and aims to achieve better generalization at various downstream tasks. However, once the pretrained model is fine-tuned with labeled data on source domain, a significant performance drop might occur if these models are directly applied to a new unseen target domain -a phenomenon known as domain shift (Quiñonero-Candela et al., 2008a; b; Moreno-Torres et al., 2012) . The domain shift problem is commonly encountered in real-world VDU scenarios where the training and test-time distributions are different, a common scenario due to the tremendous diversity observed for document data. Fig. 1 exemplifies this, for key-value extraction task across visually different document templates and for visual question answering task on documents with different contents (figures, tables, letters etc.) for information. The performance difference due to this domain shift might reduce the stability and reliability of VDU models. This is highly undesirable for widespread adoption of VDU, especially given that the common use cases are for high-stakes applications from finance, insurance, healthcare, or legal. Thus, the methods to robustly guarantee high accuracy in the presence of distribution shifts would be of significant impact. Despite being a critical issue, to the best of our knowledge, no prior work has studied post-training domain adaptation for VDU. Unsupervised domain adaptation (UDA) methods attempt to mitigate the adverse effect of data shifts, often by training a joint model on the labeled source and unlabeled target domains that map both domains into a common feature space. However, simultaneous access to data from source and target domains may not be feasible for VDU due to privacy concerns associated with source data access, (Mathew et al., 2021) . Bottom left: documents from source and target domains for key-value information extraction task from SROIE (Huang et al., 2019) receipt dataset. Bottom right: documents from source and target domains for named entity recognition task from FUNSD (Jaume et al., 2019) dataset. given legal, technical, and contractual constraints. In addition, the training and serving may be done in different computational environments, and thus, the expensive computational resources used for training may not be available. Test-time adaptation (TTA) (or source-free domain adaptation) has been introduced to adapt a model that is trained on the source to unseen target data, without using any source data (Liang et al., 2020; Wang et al., 2021b; Sun et al., 2020; Wang et al., 2021a; Chen et al., 2022; Huang et al., 2021) . Existing TTA methods have mainly focused on image classification and semantic segmentation tasks, while VDU remains unexplored, despite the clear motivations of the distribution shift besides challenges for the employment of standard UDA. Since VDU significantly differs from other computer vision (CV) tasks, applying existing TTA methods in a straightforward manner is suboptimal. First, in VDU, information is extracted from multiple modalities (including image, text, and layout) unlike other CV tasks. Therefore, a TTA approach proposed for VDU should leverage cross-modal information for better adaptation. Second, multiple outputs (e.g. entities or questions) are obtained from the same document, creating the scenario that their similarity in some aspects (e.g. in format or context) can be used. However, this may not be utilized in a beneficial way with direct application of popular pseudo labeling or self training-based TTA approaches (Lee et al., 2013) , which have gained a lot of attention in CV (Liang et al., 2020; 2021; Chen et al., 2022; Wang et al., 2021a) . Pseudo labeling uses predictions on unlabeled target data for training. However, in VDU, naive pseudo labeling can result in accumulation of errors due to generation of multiple outputs at the same time that are possibly wrong in the beginning, as each sample can contain a long sequence of words. Third, commonly-used selfsupervised contrastive-based TTA methods in CV (He et al., 2020; Chen et al., 2020b; a; Tian et al., 2020) (that are known to increase generalization) employ a rich set of image augmentation techniques, while proposing data augmentation is much more challenging for general VDU. In this paper, we propose DocTTA, a novel TTA method for VDU that utilizes self-supervised learning on text and layout modalities using masked visual language modeling (MVLM) while jointly optimizing with pseudo labeling. We introduce a new uncertainty-aware per-batch pseudo labeling selection mechanism, which makes more accurate predictions compared to the commonly-used pseudo labeling techniques in CV that use no pseudo-labeling selection mechanism (Liang et al., 2020) in TTA or select pseudo labels based on both uncertainty and confidence (Rizve et al., 2021) in semi-supervised learning settings. To the best of our knowledge, this is the first method that employs a self-supervised objective function that combines visual and language representation learning as a key differentiating factor compared to TTA methods proposed for image or text data. While our main focus is the TTA setting, we also showcase a special form of DocTTA where access to source data is granted at test time, extending our approach to be applicable for unsupervised domain adaptation, named DocUDA. Moreover, in order to evaluate DocTTA diligently and facilitate future research in this direction, we introduce new benchmarks for various VDU tasks including key-value extraction, entity recognition, and document visual question answering (DocVQA) using publicly available datasets by modifying them to mimic real-world adaptation scenarios. We show DocTTA significantly improves source model performance at test-time on all VDU tasks without any supervision. To our knowledge, our paper is first to demonstrate TTA and UDA for VDU applications, showing the significant accuracy gain potential via adaptation. We expect our work to open new horizons for future research in VDU and real-world deployment in applications.

2. RELATED WORK

Unsupervised domain adaptation aims to improve the performance on a different target domain, for a model trained on the source domain. UDA approaches for closed-set adaptation (where classes fully overlap between the source and target domains) can be categorized into four categories: (i) distribution alignment-based, (ii) reconstruction-based, and (iii) adversarial based, and (iv) pseudolabeling based. Distribution alignment-based approaches feature aligning mechanisms, such as moment matching (Peng et al., 2019) or maximum mean discrepancy (Long et al., 2015; Tzeng et al., 2014) . Reconstruction-based approaches reconstruct source and target data with a shared encoder while performing supervised classification on labeled data (Ghifary et al., 2016) , or use cycle consistency to further improve domain-specific reconstruction (Murez et al., 2018; Hoffman et al., 2018) . Inspired by GANs, adversarial learning based UDA approaches use two-player games to disentangle domain invariant and domain specific features (Ganin & Lempitsky, 2015; Long et al., 2018; Shu et al., 2018) . Pseudo-labeling (or self-training) approaches jointly optimize a model on the labeled source and pseudo-labeled target domains for adaptation (Kumar et al., 2020; Liu et al., 2021; French et al., 2017) . Overall, all UDA approaches need to access both labeled source data and unlabeled target data during the adaptation which is a special case for the more challenging setting of TTA and we show how our approach can be modified to be used for UDA. Test-time adaptation corresponds to source-free domain adaptation, that focuses on the more challenging setting where only source model and unlabeled target data are available. The methods often employ an unsupervised or self-supervised cost function. TENT (Wang et al., 2021b) utilizes entropy minimization for fully test-time adaptation which encourages the model to become more "certain" on target predictions regardless of their correctness. In the beginning of the training when predictions tend to be inaccurate, entropy minimization can lead to error accumulation since VDU models create a long sequence of outputs per every document resulting in a noisy training. SHOT (Liang et al., 2020) combines mutual information maximization with offline clustering-based pseudo labeling. However, similar to TENT, using simple offline pseudo-labeling can lead to noisy training and poor performance when the distribution shifts are large (Chen et al., 2022; Liu et al., 2021; Rizve et al., 2021; Mukherjee & Awadallah, 2020) . We also use pseudo labeling in DocTTA but we propose online updates per batch for pseudo labels, as the model adapts to test data. Besides, we equip our method with a pseudo label rejection mechanism using uncertainty, to ensure the negative effects of predictions that are likely to be inaccurate. Most recent TTA approaches in image classification use contrastive learning combined with extra supervision (Xia et al., 2021; Huang et al., 2021; Wang et al., 2021a; Chen et al., 2022) . In contrastive learning, the idea is to jointly maximize the similarity between representations of augmented views of the same image, while minimizing the similarity between representations of other samples). All these methods rely on self-supervised learning that utilize data augmentation techniques, popular in CV while not yet being as effective for VDU. While we advocate for using SSL during TTA, we propose to employ multimodal SSL with pseudo labeling for the first time which is imore effective for VDU. Self-supervised pretraining for VDU aims to learn generalizable representations on large scale unlabeled data to improve downstream VDU accuracy (Appalaraju et al., 2021; Gu et al., 2021; Xu et al., 2020b; a; Lee et al., 2022; Huang et al., 2022) . LayoutLM (Xu et al., 2020b) jointly models interactions between text and layout information using a masked visual-language modeling objective and performs supervised multi-label document classification on IIT-CDIP dataset (Lewis et al., 2006) . LayoutLMv2 (Xu et al., 2020a) Transformer layers Box V1 Box V2 Box V3 Box V4 V1 V2 V3 V4 0 1 2 3 B B B B Visual/Text Emb. 2D Pos. Emb. 1D Pos. Emb. Segment Emb. [CLS] T1 [MASK] T3 T4 [MASK] [SEP] Box PAD Box T1 Box T2 Box T3 Box T4 Box T5 Box PAD 0 1 2 3 4 5 6 A A A A A A A V1 V2 V3 V4 [CLS] T1 [MASK] T3 T4 [MASK] [SEP] T2 T5 [MASK] I. Masked Visual-Language Modeling II. Pseudo label generation using model's predictions ỹ c III. Diversify class predictions ỹ 1 , ỹ 2 , ỹ 3 , ỹ 4 , … Figure 2 : Illustration of how our approach, DocTTA, leverages unlabeled target data at test time to i) learn how to predict masked language given visual cues, ii) generate pseudo labels to supervise the learning, and iii) maximize the diversity of predictions to generate enough labels from all classes. 2021) is another multi-modal transformer based architecture that uses text, vision and spatial features and combines them using multi-modal self-attention with a multi-modal masked language modeling (MM-MLM) objective (as a modified version of MLM in BERT (Devlin et al., 2018) ), an image reconstruction loss, and a text describing image loss represented as a binary cross-entropy to predict if the cut-out text and image are paired. FormNet (Lee et al., 2022 ) is a structure-aware sequence model that combines a transformer with graph convolutions and proposed rich attention that uses spatial relationship between tokens. UniDoc (Gu et al., 2021) is another multi-modal transformer based pretraining method that uses masked sentence modeling, visual contrastive learning, and visual language alignment objectives which unlike other methods, does not have a fixed document object detector (Li et al., 2021; Xu et al., 2020a) . In this work, we focus on a novel TTA approach for VDU, that can be integrated with any pre-training method. We demonstrate DocTTA using the publicly available LayoutLMv2 architecture pretrained on IIT-CDIP dataset.

3. DOCTTA: TEST-TIME ADAPTATION FOR DOCUMENTS

In this section, we introduce DocTTA, a test-time adaptation framework for VDU tasks including key-value extraction, entity recognition, and document visual question answering (VQA).

3.1. DOCTTA FRAMEWORK

We define a domain as a pair of distribution D on inputs X and a labeling function l : X ! Y. We consider source and target domains. In the source domain, denoted as hD s , l s i, we assume to have a model denoted as f s and parameterized with ✓ s to be trained on source data {x (i) s , y s } ns i=1 , where x (i) s 2 X s and y (i) s 2 Y s are document inputs and corresponding labels, respectively and n s is the number of documents in the source domain. Given the trained source model f s and leaving X s behind, the goal of TTA is to train f t on the target domain denoted as hD t , l t i where f t is parameterized with ✓ t and is initialized with ✓ s and D t is defined over {x (i) t } nt i=1 2 X t without any ground truth label. Algorithm 1 overviews our proposed DocTTA procedure. Unlike single-modality inputs commonly used in computer vision, documents are images with rich textual information. To extract the text from the image, we consider optical character recognition (OCR) is performed and use its outputs, characters, and their corresponding bounding boxes (details are provided in Appendix). We construct our input X in either of the domains composed of three components: text input sequence X T of length n denoted as n⇥d) , image X I 2 R 3⇥W ⇥H , and layout X B as a 6-dimensional vector in the form of (x min , x max , y min , y max , w, h) representing a bounding box associated with each word in the text input sequence. For the entity recognition task, labels correspond to the set of classes that denote the extracted text; for the key-value extraction task, labels are values for predefined keys; and for the VQA task, labels are the starting and ending positions of the answer presented in the document for the given question. We consider the closed-set assumption: the source and target domains share the same class labels Y s = Y t = Y with |Y| = C being the total number of classes. Perform masked visual-language modeling in Eq. 1 5: (x T 1 , • • • , x T n ) 2 R ( Generate pseudo labels and accept a subset using criteria in Eq. 3 and fine-tune with Eq. 2 6: Maximize diversity in pseudo label predictions Eq. 4 7: ✓ t ✓ t ↵rL DocTTA B Update ✓ t via total loss in Eq. 5 8: end for

3.2. DOCTTA OBJECTIVE FUNCTIONS

In order to adapt f t in DocTTA, we propose three objectives to optimize on the unlabeled target data: Objective I: masked visual language modeling (MVLM). Inspired by masked language modeling in BERT (Devlin et al., 2018) and MVLM used in (Xu et al., 2020a) to perform self-supervised pretraining, we propose to employ MVLM at test time to encourage the model to learn better the text representation of the test data given 2D positions and other text tokens. The intuition behind using this objective for TTA is to enable the target model to learn the language modality of the new data given visual cues and thereby bridging the gap between the different modalities on the target domain. To do so, we randomly mask 15% of input text tokens among which 80% are replaced by a special token [MASK] and the remaining tokens are replaced by a random word from the entire vocabulary. The model is then trained to recover the masked tokens while the layout information remains fixed. To do so, the output representations of masked tokens from the encoder are fed into a classifier which outputs logits over the whole vocabulary, to minimize the negative log-likelihood of correctly recovering masked text tokens x T m given masked image tokens x I and masked layout x B : L MV LM (✓ t ) = E xt2Xt X m log p ✓t (x T tm |x I t , x B t ). Objective II: self training with pseudo labels. While optimizing MVLM loss during the adaptation, we also generate pseudo labels for the unlabeled target data in an online way and treat them as ground truth labels to perform supervised learning on the target domain. Unlike previous pseudo labeling-based approaches for TTA in image classification which update pseudo labels only after each epoch (Liang et al., 2020; 2021; Wang et al., 2021a) , we generate hard pseudo labels per batch aiming to use the latest version of the model for predictions. In addition, unlike prior works, we do not use a clustering mechanism to generate pseudo labels as they will be computationally expensive for documents. Instead, we directly use predictions by the model. However, simply using all the predictions would lead to noisy pseudo labels. Inspired by (Rizve et al., 2021) , in order to prevent noisy pseudo labels, we employ an uncertainty-aware selection mechanism to select the subset of pseudo labels with low uncertainty. Note that in (Rizve et al., 2021) , pseudo labeling is used as a semi-supervised learning approach and the selection criteria is based on both thresholding confidence and using MC-Dropout (Gal & Ghahramani, 2016) as the measure of uncertainty. We empirically observe that raw confidence values (when taken as the posterior probability output from the model) are overconfident despite being right or wrong. Setting a threshold on pseudo labels' confidence only introduces a new hyperparameter without a performance gain (see Sec. 5.1). Instead, to select the predictions we propose to only use uncertainty, in the form of Shannon's entropy (Shannon, 2001) . We also expect this selection mechanism leads to reducing miscalibration due to the direct relationship between the ECEfoot_0 and output prediction uncertainty, i.e. when more certain predictions are selected, ECE is expected to reduce for the selected subset of pseudo labels. Assume p (i) be the output probability vector of the target sample x (i) t such that p (i) c denotes the probability of class c being the correct class. We select a pseudo label ỹ(i) c for x (i) t if the uncertainty of the prediction u(p (i) c ), measured with Shannon's entropy, is below a specific threshold and we update ✓ t weights with a cross-entropy loss: ỹi c = ⇥ u(p (i) c )  ⇤ , L CE (✓ t ) = E xt2Xt X C c=1 ỹc log (f t (x t )), where (•) is the softmax function. Objective III: diversity objective. To prevent the model from indiscriminately being dominated by the most probable class based on pseudo labels, we encourage class diversification in predictions by minimizing the following objective: L DIV = E xt2Xt X C c=1 pc log pc , where p = E xt2Xt (f t (x t )) is the output embedding of the target model averaged over target data. By combining Eqs. 1, 2, and 4, we obtain the full objective function in DocTTA as below: L DocTTA = L MV LM + L CE + L DIV . (5)

3.3. DOCTTA VS. DOCUDA

The proposed DocTTA framework can be extended as a UDA approach, which we refer to as DocUDA (see Appendix for the algorithm and details), by enabling access to source data during adaptation to the target. In principle, the availability of this extra information of source data provides can provide an advantage over TTA, however, as we show in our experiments, their difference is small in most cases, and even TTA can be superior when source domain is significantly smaller than the target domain and the distribution gap is large, highlighting the efficacy of our DocTTA approach in adapting without relying on already-seen source data. The UDA version comes with fundamental drawbacks. From the privacy perspective, there would be concerns associated with accessing or storing source data in deployment environments, especially given that VDU applications are often from privacy-sensitive domains like legal or finance. From the computational perspective, UDA would yield longer convergence time and higher memory requirements due to joint learning from source data. Especially given that the state-of-the-art VDU models are large in size, this may become a major consideration.

4. DOCTTA BENCHMARKS

To better highlight the impact of distribution shifts and to study the methods that are robust against them, we introduce new benchmarks for VDU. Our benchmark datasets are constructed from existing popular and publicly-available VDU data to mimic real-world challenges. We have attached the training and test splits for all our benchmark datasets in the supplementary materials.

4.1. FUNSD-TTA: ENTITY RECOGNITION ADAPTATION BENCHMARK

We consider FUNSD (Jaume et al., 2019) We use SROIE (Huang et al., 2019) dataset with 9 classes in total. Similar to FUNSD, we first combine the original training and test splits. Then, we manually divide them into two groups based on their visual appearance -source domain with 600 documents contains standard-looking receipts with proper angle of view and clear black ink color. We use 37 documents from this split for validation, which we use to tune adaptation hyperparameters. Note that the validation split does not overlap with the target domain, which has 347 receipts with slightly blurry look, rotated view, colored ink, and large empty margins. Fig. 1 (bottom row on the left) exemplifies documents from the source and target domains.

4.3. DOCVQA-TTA: DOCUMENT VQA ADAPTATION BENCHMARK

We use DocVQA (Mathew et al., 2021) , a large-scale VQA dataset with nearly 20 different types of documents including scientific reports, letters, notes, invoices, publications, tables, etc. The original training and validation splits contain questions from all of these document types. However, for the purpose of creating an adaptation benchmark, we select 4 domains of documents: i) Emails & Letters (E), ii) Tables & Lists (T), iii) Figure & Diagrams (F) , and iv) Layout (L). Since DocVQA doesn't have public meta-data to easily sort all documents with their questions, we use a simple keyword search to find our desired categories of questions and their matching documents. We use the same words in domains' names to search among questions (i.e., we search for the words of "email" and "letter" for Emails & Letters domain). However, for Layout domain, our list of keywords is ["top", "bottom", "right", "left", "header", "page number"] which identifies questions that are querying information from a specific location in the document. Among the four domains, L and E have the shortest gap because emails/letters have structured layouts and extracting information from them requires understanding relational positions. For example, the name and signature of the sender are usually at the bottom, while the date usually appears at top left. However, F and T domains seem to have larger gaps with other domains, that we attributed to that learning to answer questions on figures or tables requires understanding local information withing the list or table. Fig. 1 (top row) exemplifies some documents with their questions from each domain. Document counts in each domain are provided in Appendix.

5. EXPERIMENTS

Evaluation metrics: For entity recognition and key-value extraction tasks, we use entity-level F1 score as the evaluation metric, whereas for the document VQA task, we use Average Normalized Levenshtein Similarity (ANLS) introduced by (Biten et al., 2019) (as it is recognized as a better measure compared to accuracy since it doesn't penalize minor text mismatches due to OCR errors).

Model architecture:

In all experiments, we use LayoutLMv2 BASE architecture which has a 12layer 12-head transformer encoder with a hidden size of 768. Its visual backbone is based on ResNeXt101-FPN, similar to that of MaskRCNN (He et al., 2017) . Overall, it has ⇠200M parameters. We note that our approach is architecture independent and hence applicable to any attention-based VDU model. Details on training and hyper parameter tuning are provided in Appendix. Baselines: As our method is the first TTA approach proposed for VDU tasks, there is no baseline to compare directly. Thus, we adopt TTA and UDA approaches from image classification, as they can be applicable for VDU given that they do not depend on augmentation techniques, contrastive learning, or generative modeling. For UDA, we use the baselines DANN (Ganin & Lempitsky, 2015) and CDAN (Long et al., 2018) and for TTA, we use the baselines batch normalization BN (Ioffe & Szegedy, 2015; Nado et al., 2020) , TENT (Wang et al., 2021b) , and SHOT (Liang et al., 2020) . We also provide source-only, where the model trained on source and evaluated on target without any adaptation mechanism, and train-on-target, where the model is trained (and tested) on target domain using the exact same hyperparameters used for TTA (which are found on the validation set and might not be the most optimal values for target data). While these two baselines don't adhere to any domain adaptation setting, they can be regarded as the ultimate lower and upper bound for performance.

5.1. RESULTS AND DISCUSSIONS

FUNSD-TTA: Table 1 shows the comparison between DocTTA and DocUDA with their corresponding TTA and UDA baselines. For UDA, DocUDA outperforms all other UDA baselines by a large margin, and improves 8.96% over the source-only. For the more challenging setting of TTA, DocTTA improves the F1 score of the source-only model by 3.43%, whereas the performance gain by all the TTA baselines is less than 0.5%. We also observe that DocTTA performs slightly better than other UDA baselines DANN and CDAN, which is remarkable given that unlike those, DocTTA does not have access to the source data at test time.

SROIE-TTA:

Table 1 shows the comparison between UDA and TTA baselines vs. DocUDA and DocTTA on SROIE-TTA benchmark. Similar to our findings for FUNSD-TTA, DocUDA and DocTTA outperform their corresponding baselines, where DocTTA can even surpass DANN and CDAN (which use source data at test time). Comparison of DocUDA and DocTTA shows that for small distribution shifts, UDA version of our framework results in better performance. DocVQA-TTA: Table 2 shows the results on our DocVQA-TTA benchmark, where the ANLS scores are obtained by adapting each domain to all the remaining ones. The distribution gap between domains on this benchmark is larger compared to FUNSD-TTA and SROIE-TTA benchmarks. Hence, we also see a greater performance improvement by using TTA/UDA across all domains and methods. For the UDA setting, DocUDA consistently outperforms adversarial-based UDA methods by a large margin, underlining the superiority of self-supervised learning and pseudo labeling in leveraging labeled and unlabeled data at test time. Also in the more challenging TTA setting, DocTTA consistently achieves the highest gain in ANLS score of at least 2.57% increase on E ! F and up to 17.68% on F ! E. Moreover, DocTTA significantly outperforms DANN on all domains and CDAN on 11 out of 12 adaptation scenarios, even though it does not utilize source data at test time. This demonstrates the efficacy of joint online pseudo labeling with diversity maximization and masked visual learning. Between DocUDA and DocTTA, it is expected that DocUDA performs better than DocTTA due to having extra access to source domain data. However, we observe three exceptions where DocTTA surpasses DocUDA by 1.13%, 0.79%, and 2.16% in ANLS score on E ! F and T ! F, and L ! F, respectively. We attribute this to: i) target domain (F) dataset size being relatively small, and ii) large domain gap between source and target domains. The former can create an imbalanced distribution of source and target data, where the larger split (source data) dominates the learned representation. This effect is amplified due to (ii) because the two domains aren't related and the joint representation is biased in favor of the labeled source data. Another finding on this benchmark is that a source model trained on a domain with a small dataset generalizes less compared to the one with sufficiently-large dataset but has a larger domain gap with the target domain. Results for train-on-target on each domain can shed light on this. When we use the domain with the smallest dataset (F) as the source, each domain can only achieve its lowest ANLS score (39.70% on E, 24.77% on T, and 38.59% on L) whereas with T, second smallest domain in dataset size in our benchmark (with 657 training documents), the scores obtained by train-on-target on E and L increases to 84.59% and 83.73%, respectively. Thus, even if we have access to entire target labeled data, the limitation of source domain dataset size is still present. Ablation studies: We compare the impact of different constituents of our methods on DocVQA-TTA benchmark, using a model trained on Emails&Letters domain and adapted to other three domains. Table 3 shows that pseudo labeling selection mechanism plays an important role and using confidence scores to accept pseudo labels results in the poorest performance, much below the source-only ANLS values and even worse than not using pseudo labeling. On the other hand, using uncertainty and raw confidence together to select pseudo labels yields the closest performance to that of the full (best) method (details are provided in Appendix). MVLM loss and diversity maximization criteria have similar impact on DocTTA's performance.

6. CONCLUSIONS

We introduce TTA for VDU for the first time, with our novel approach DocTTA, along with new realistic adaptation benchmarks for common VDU tasks such as entity recognition, key-value extraction, and document VQA. DocTTA starts from a pretrained model on the source domain and uses online pseudo labeling along with masked visual language modeling and diversity maximization on the unlabeled target domain. We propose an uncertainty-based online pseudo labeling mechanism that generates significantly more accurate pseudo labels in a per-batch basis. Overall, novel TTA approaches result in surpassing the state-of-the-art TTA approaches adapted from computer vision. We have included the details of our experimental setup such as the utilized compute resources, pretrained model, optimizer, learning rate, batch size, and number of epochs, etc. in Section A.2 of the Appendix. We have provided the details of our hyper parameter tuning and our search space in Section A.2.2 of the Appendix. For our introduced benchmark datasets, the statistics of each dataset is detailed in Section A.1.2. List of the all the training and validation splits for our proposed benchmarks are also located at Supplemental/TTA_Benchmarks as json files. To ensure full reproducibility, we will also release our code upon acceptance.



Expected calibration error (Naeini et al., 2015) which is a metric to measure calibration of a model



Figure 1: Distribution shift examples for document samples from the proposed benchmark, DocVQA-TTA. Top row: shows documents from four domains: (i) Emails & Letters, (ii) Figures & Diagrams,(iii) Layout, (iv) Tables & Lists, from our VQA benchmark derived from DocVQA dataset(Mathew et al., 2021). Bottom left: documents from source and target domains for key-value information extraction task from SROIE(Huang et al., 2019) receipt dataset. Bottom right: documents from source and target domains for named entity recognition task from FUNSD(Jaume et al., 2019) dataset.

extends it by training on image modality as well, and optimizing text-image alignment and text-image matching objective functions. DocFormer(Appalaraju et al.,

Algorithm 1 DocTTA for closed-set TTA in VDU 1: Input: Source model weights ✓ s , target documents {x i t } nt i=1 , test-time training epochs n e , testtime training learning rate ↵, uncertainty threshold 2: Initialization: Initialize target model f ✓t with ✓ s weights. 3: for epoch = 1 to n e do

We first combine the original training and test splits and then manually divide them into two groups. We set aside 149 forms that are filled with more texts for the source domain and put 50 forms that are sparsely filled for the target domain. We randomly choose 10 out of 149 documents for validation, and the remaining 139 for training. Fig.1(bottom row on the right) shows examples from the source and target domains.

F1 score results for adapting source to target in FUNSD-TTA and SROIE-TTA benchmarks. Availability of the labeled/unlabeled data from source/target domains during adaptation in UDA and TTA settings and training phase in source-only and train-on-target settings are marked. Standard deviations are in parentheses.

ANLS scores for adapting between domains in DocVQA-TTA benchmark. Standard deviations are shown in Appendix.

Ablation analysis on adapting from E to F, T, L in our DocVQA-TTA benchmark with different components including pseudo labeling, L MV LM , L DIV , and pseudo label selection mechanism using confidence only or together with uncertainty. Standard deviations are in parentheses. MV LM 35.66 (0.46) 25.72 (0.55) 45.88 (0.34) DocTTA, no L DIV 34.32 (0.53) 25.17 (0.49) 46.36 (0.21) DocTTA, no pseudo labeling 33.61 (1.65) 23.43 (0.87) 15.89 (1.35)

