A CRITICAL ANALYSIS OF OUT-OF-DISTRIBUTION DETECTION FOR DOCUMENT UNDERSTANDING Anonymous authors Paper under double-blind review

Abstract

Large-scale pretraining is widely used in recent document understanding models. During deployment, one may expect that large-scale pretrained models should trigger a conservative fallback policy when encountering out-of-distribution (OOD) samples, which suggests the importance of OOD detection. However, most existing OOD detection methods focus on single-modal inputs such as images or texts. While documents are multi-modal in nature, it is underexplored if and how multi-modal information in documents can be exploited for OOD detection. In this work, we first provide a systematic and in-depth analysis on OOD detection for document understanding models. We study the effects of model modality, pretraining, and finetuning across various types of OOD inputs. In particular, we find that spatial information is critical for document OOD detection. To better exploit spatial information, we propose a simple yet effective spatial-aware adapter, which serves as an add-on module to adapt transformer-based language models to document domain. Extensive experiments show that our method consistently improves ID accuracy and OOD detection performance compared to baselines. We hope our findings can help inspire future works on understanding OOD robustness for documents.

1. INTRODUCTION

The recent success of large-scale pretrained models has led to the widespread deployment of deep models in various applications. In the document domain, model predictions are increasingly used to help humans make decisions in important applications ranging from tax form processing, machine learning assistant medical reports analysis, deep analyses from financial forms, etc. However, in most cases, models are pretrained on collected data but are then deployed in an environment with a different distribution over the observed data (Cui et al., 2021) . For example, with the outbreak of COVID-19 (Velavan & Meyer, 2020) , machineassisted medical document analysis systems have to face continually changing data distributions. This motivates the need for reliable methods in the document domain to detect out-of-distribution (OOD) inputs. The goal of OOD detection is to categorize in-distribution (ID) test samples into one of the known categories and detect instances that do not belong to any known classes (Huang & Li, 2021; Bendale & Boult, 2016) . Generally, a model is optimized on a particular task (e.g., image classification (Deng et al., 2009) ), and a companion OOD detector is built as a safeguard for the classifier. Recently, large-scale pretrained models have demonstrated promising results in multiple domains (Dosovitskiy et al., 2021; Hendrycks et al., 2020) as pretraining enables models to learn powerful and transferable feature representations (Radford et al., 2021) . In particular, the models obtained by finetuning large-scale pretrained models are significantly better at OOD detection even with a simple distance metric (Lee et al., 2018; Radford et al., 2021) . It is underexplored whether existing OOD detection methods that demonstrate success for images or text can be naturally extended to documents. The main challenges posed in document OOD detection stem from the fact that document understanding is inherently multi-modal, thus, it is suboptimal to rely on a single words and word bounding boxes as inputs during pretraining and finetuning. LayoutLMv2 (Xu et al., 2021b) improves on the prior work by including an image encoder in pretraining and training them jointly. Like Lay-outLMv2, DocFormer (Appalaraju et al., 2021) also adopts a CNN model to extract image grid features. It fuses the spatial information as an inductive bias for the self-attention module. The latest version, Lay-outLMv3 (Huang et al., 2022) , shares similar ideas as LayoutLMv2 and further enhances the visual and spatial characteristics by introducing two other tasks: masked image modeling and word-patch alignment. Another line of works for document pretraining focuses on different granularities of document images and takes region-level text blocks as the basic input elements, such as SelfDoc (Li et al., 2021) and UDoc (Gu et al., 2021) . The pretraining tasks of SelfDoc and UDoc are based on feature space. They adopt a crossmodal encoder to model the relationship between visual and textual features. Instead of using the spatial information at the input layer, SelfDoc and UDoc encode the 2D spatial information with a linear mapping and fuse the position embeddings at the output layer of the image encoder and sentence encoder. Despite the promising performance of those pretrained models on downstream applications, it remains largely underexplored whether recent document pretraining models are robust to various types of OOD data, the role of pretraining and finetuning, and the key factors for document OOD detection.

2.2. OUT-OF-DISTRIBUTION DETECTION

Many OOD detection methods have been proposed for deep models, including generative model-based methods (Ge et al., 2017; Oza & Patel, 2019; Nalisnick et al., 2019; Ren et al., 2019; Xiao et al., 2020; Morteza & Li, 2022) , and discriminative-model based methods. For the latter category, an OOD scoring function can be derived based on the softmax output or logit space (Liu et al., 2020; Hsu et al., 2020; Huang & Li, 2021; Liang et al., 2018; Sun et al., 2021) , gradient information (Huang et al., 2021) , or the feature space (Sastry & Oore, 2020; Sehwag et al., 2021; Winkens et al., 2020; Sun et al., 2022) of a classifier. Despite their impressive performance, most of the scores are developed for natural images and text inputs. A recent work (Larson et al., 2022) studies OOD detection performance for documents, but only explores a limited number of models and OOD detection methods. Furthermore, the relationship between pretraining, finetuning, and spatial information is underexplored. In this work, we provide a finer-grained and comprehensive analysis and hope to shed light on the key factors of OOD robustness for documents. Notations We denote the input and label space X in and Y in = {1, . . . , K}, respectively. Let D in = {(x in i , y in i )} N i=1 denote an ID dataset, where x ∈ X in is the input feature vector, and y in ∈ Y in denotes the semantic label for K-way classification. Let D out = {(x out i , y out i )} M i=1 denote an OOD test set where y out ∈ Y out , and Y out ∩ Y in = ∅. OOD detection can be formulated as a binary classification problem, which aims to distinguish between ID and OOD data. We express the neural network model f := g • h as a composition of a feature extractor h : X in → R d and a classifier g : R d → R K , which maps the feature embedding of an input to K real-valued numbers known as logits. During inference time, OOD detection can be performed by exercising a thresholding mechanism G γ (x) = 1{S(x) ≥ γ} where by convention samples with higher scores S(x) are classified as ID and vice versa. The threshold γ is typically chosen so that a high fraction of ID data (e.g., 95%) is correctly classified. We group OOD detection methods into two major categories: logit-based scores are derived from the logit layer of the model, while distance-based methods are directly based on the feature embedding layer, as shown in Fig. 1 . We describe a few popular OOD detection methods for each category as follows. • Logit-based: Maximum Softmax Probability (MSP) score (Hendrycks & Gimpel, 2017 ) S MSP = max i∈[K] e fi(x) / K j=1 e fj (x) naturally arises as a classic baseline since logits can be converted to a categorical distribution p(y|x); Energy score (Liu et al., 2020) : S Energy = log i∈[K] e fi(x) utilizes the Helmholtz free energy of the data and theoretically aligns with the logarithm of the ID density; MaxLogit score (Hendrycks et al., 2022) : S Maxlogit = max i∈[K] f i (x) removes the softmax function in MSP and demonstrates promising performance on large-scale natural image datasets recently. • Distance-based: Distance-based methods directly leverage feature embeddings h based on the idea that OOD inputs are relatively far away from ID centroids or prototypes. Depending on the distributional assumption of feature embeddings, methods can be characterized as 1) parametric methods such as Mahalanobis score (Lee et al., 2018; Sehwag et al., 2021) which assumes ID embeddings follow classconditional Gaussian distributions and use Mahalanobis distance from the ID centroid as the distance metric; 2) non-parametric methods such as KNN+ (Sun et al., 2022) which uses cosine similarity as the distance metric. Evaluation Metrics To evaluate OOD detection performance, we adopt two commonly used metrics (Hendrycks & Gimpel, 2017) : Area Under the Receiver Operating Characteristic (AUROC) and False Positive Rate at 95% Recall (FPR95). For ID test sets, we report Accuracy (Acc), F1 score, and Mean Average Precision (mAP).

3. ANALYZING OOD ROBUSTNESS FOR DOCUMENT MODELS

In this section, we consider the task of document classification, where models are expected to classify documents into categories such as scientific papers, resumes, etc. However, it is underexplored whether models are robust to OOD samples at test time. Most document classification datasets exist in the form of images (Harley et al., 2015) . Usually, the first step is to pass the image through an OCR system to obtain a set of text blocks along with their coordinates in the image. Given the input image, extracted words, and coordinates, models can utilize single-modal or multi-modal information to classify the document. Models Fig. 2 (a) shows common structures for document image pretraining and classification modelsfoot_0 . According to the input modalities, we categorize them into the following groups: (1) Vision-based: Since current document datasets exist as images, we can treat document classification as the standard image classification problem. In our experiments, we consider ResNet-50 (He et al., 2016) and ViT (Fort et al., 2021) as exemplar document image classification models. As for pretrained weights, we consider two settings: pretrained on ImageNet (Deng et al., 2009) and further pretrained on IIT-CDIP (Lewis et al., 2006) . We adopt masked image modeling (MIM) for image pretraining with a mask ratio of 0.6. Note that the document classification dataset we used in this paper, RVL-CDIP, is a subset of IIT-CDIP. Hence, unless otherwise specified, the IIT-CDIP pretraining data used in this paper excludes RVL-CDIP. L anguage (BERT, etc.) Wor ds Vision (ResNet, etc.) Spatial+Vision+L anguage (L ayoutL M v3, etc.) Wor ds+BBoxes

Textual Encoder

Visual Encoder Spatial+Vision+L anguage (UDoc, etc.) Wor ds+BBoxes

Visual Encoder

Textual Encoder

Ours

Wor ds+BBoxes (2) Text-based: Alternatively, we can define the classification as a text classification problem since documents typically contain words. In our experiments, we consider RoBERTa (Liu et al., 2019) as the backbone and append a classifier for finetuning. Since some documents such as scientific papers consist of sentences with more than 512 tokens, we also consider Longformer (Beltagy et al., 2020) , which can handle a maximum of 4,096 input tokens. Similar to the vision-based models, we further pretrain the language models with masked language modeling (MLM) on IIT-CDIP extracted text corpus. (3) Text+Spatial: Layout information plays a crucial role in the document domain. As shown in Fig. 3 , a document is composed of words or images with some specific layouts. To investigate the effect of layout information, we adopt LayoutLM as a baseline. It is specifically designed for documents and trained on the full IIT-CDIP data. Inspired by the promising OOD detection performance of spatial-aware models (Sec 3.3) and the recent advances in adapter-based transformers (Pfeiffer et al., 2020) , we propose a new spatial-aware adaptor, a small learned module that can be inserted within a pretrained language model. Besides the simplicity, our adapter is competitive for both ID classification and OOD detection (Sec 3.4). (4) Visual+Textual+Spatial: Current state-of-the-art methods tailored to documents consider various input granularity and modality and utilize textual, visual, and spatial information for document tasks. Despite the promising performance, such models are large in size and computationally heavy. We select two representative models to evaluate upon: LayoutLMv3 and UDoc. For a fair comparison, both models are pretrained on full IIT-CDIP. Constructing ID and OOD Datasets We construct ID datasets from RVL-CDIP (Harley et al., 2015) . Specifically, we specify 12 out of 16 classes as ID classes. For OOD datasets, we consider two scenarios: (1) In-domain OOD: To determine the OOD categories, we extensively analyze the performance of recent document classification models. Fig. 2 (b) shows a detailed comparison of per-category accuracy on the RVL-CDIP test set. Naturally, for the classes the model performs poorly on, we may expect models to detect such inputs as OOD instead of assigning a specific ID class with low confidence. We observe that the 4 categories letter, form, scientific report, presentation result in the worst performance across most of models with different modality, which we use as OOD categories and construct the OOD datasets accordingly. The ID dataset is constructed from the remaining 12 categories. We refer to these OOD datasets as in-domain, as they are also constructed from RVL-CDIP. (2) Out-domain OOD: In the open-world setting, test inputs can have significantly different color schemes and layouts compared to ID samples. To mimic such scenarios, we use two public datasets as the out-domain OOD test sets. Specifically, NJU-Fudan Paper-Poster Dataset (Qiang et al., 2019) contains scientific posters in the digital PDF format, and we extract the document contents withfoot_1 . CORD (Park et al., 2019 ) is a receipt understanding dataset that contains significantly different inputs from that in RVL-CDIP. As shown in Fig. 3 , document images in CORD are receipt images without creases or warping, which requires the model to be capable of handling text information but also visual and spatial information. In the following sections, we provide detailed analysis and share insights on various aspects of OOD detection performance for document understanding models under different OOD detection methods. Further details on the setup are provided in Appendix A.

3.1. ARE PRETRAINED MODELS SUFFICIENT FOR OOD DETECTION?

As shown in Sec. 2.1, most domain processing models deployed in the real world are pretrained on a largescale dataset. Naturally, one may expect pre-trained models to be robust to OOD data when equipped with competitive OOD detection methods. To better understand the role of pretraining, we first provide more nuanced discussions on the following questions: 1) Are models equally robust to in-domain and out-domain OOD inputs? 2) How does model modality impact OOD detection performance? We consider a wide range of models pretrained on pure-text/image data (e.g., ImageNet, Wikipdeia, etc). A detailed description of these models can be found in Appendix A.1.2. During finetuning, we combine the pretrained model with a classifier and finetune on RVL-CDIP (ID). For models before and after finetuning, we extract the final feature outputs as the feature embeddings and use the same KNN+ score (Sun et al., 2022) for OOD detection. The results are shown in Figure 4 . We observe the following trends. First, finetuning largely improves OOD detection performance for both in-domain and out-domain OOD data. Pretrained models, despite the fact that they have "seen" a diverse collection of data during pretraining, do not yield sufficient OOD robustness. The same trend holds broadly across models with different modalities. Second, the improvement of finetuning is less significant for out-domain OOD data. For example, the AUROC on Receipt (out-domain OOD) for pretrained ViT model is 97.13, whereas finetuning only improves by 0.79%. This suggests that pretrained models do have the potential to separate data from different domains due to the diversity of data used for pretraining, while it remains hard for pretrained models to perform finer-grained separation for in-domain OOD inputs. Therefore, finetuning is beneficial for improving both types of OOD detection performance as a consequence of improved feature representation. R e s N e t5 0 S w in B a se V iT (c) Pretrain on IIT-CDIP -. Figure 5 : The impact of pretraining data on zero-shot OOD detection performance. IIT-CDIP -denotes the filtered pretraining data after removing the "OOD" categories. To make the analysis more thorough, we have two additional in-domain OOD settings: (1) select the classes the model performs well on, as in-domain OOD categories; (2) randomly select classes as OOD categories. As shown in Appendix (Table 10 and Table 11 ), we can see that finetuning also improves both types of OOD detection, which further reaffirm our conclusion. We also visualize the optimal transport dataset distance (OTDD) (Alvarez-Melis & Fusi, 2020) between in-domain and out-domain OOD datasets in Appendix (Fig. 10 (b) and Fig. 10(c )).Please refer to the Appendix for more details.

3.2. THE IMPACT OF PRETRAINING DATA ON ZERO-SHOT OOD DETECTION

In the previous section, we analyze the impacts of finetuning for OOD detection where the pretraining dataset is fixed and unrelated to documents. Next, we dive deeper and study the impacts of pretraining dataset on zero-shot OOD detection. For each model, we adopt the same pretraining objective while adjusting the amount of pretraining data. Specifically, we increase the data diversity by appending 10, 20, 40, and 100% of randomly sampled data from IIT-CDIP dataset (around 11M) and pretrain each model. After pretraining, we measure OOD detection performance with KNN+ score based on feature embeddings. For out-domain OOD data (Fig. 5 , right), increasing the amount of pretraining data can significantly improve the zero-shot OOD detection performance (w.o. finetuning) for models across different modalities. This further verifies our previous hypothesis that pretraining with diverse data is beneficial for coarse-grained OOD detection, such as inputs from different domains (e.g., color schemes). On the other hand, for in-domain OOD inputs, even increasing the amount of pretraining data by over 40% provides negligible improvements (Fig. 5, left) . This also suggests the necessity of finetuning for improving in-domain OOD detection. We further explore zero-shot OOD detection by removing the potential OOD categories from IIT-CDIP. In practice, we first adopt the LayoutLMv1 finetuned on RVL-CDIP as the classifier for predicting labels for all IIT-CDIP document images. Fig. 5(b) shows the distribution of the predicted classes on IIT-CDIP. Next, we remove the "OOD" categories from the IIT-CDIP data and pretrain two models (RoBERTa and LayoutLMv1) with 10, 20, 40, and 100% of randomly sampled data from the filtered IIT-CDIP. Fig. 5 (c) shows our zeroshot OOD performance. Note that we do not show 0% in Fig. 5 (c) since we pretrain LayoutLMv1 from scratch. For RoBERTa, we start from the public pretrained model and see a similar trend in Fig. 5 (c) -the influence of pretraining for those well-pretrained language models is minor for in-domain OOD detection since there is a considerable gap between OCR words and pure-text data. E.g., words in a document are spatially arranged, while words in text corpus are arranged sequentially. In contrast, pretraining data has a bigger impact on those models trained from scratch -the zero-shot performance of LayoutLMv1 increases when more pretraining data is added. We provide more details in the Appendix (Table 4 and Table 5 ). 

3.3. INVESTIGATING SPATIAL-AWARE MODELS FOR OOD DETECTION

In previous sections, we mainly focus on mainstream text-based and vision-based models to analyze the effects of finetuning and pretraining on in-and out-domain OOD detection. Next, we largely expand the scope of our study by incorporating models tailored to document processing, which we refer to as spatialaware models, such as layoutLM, LayoutLMv3, and UDoc. Moreover, given finetuned models, we are able to compare the performance of logit-based scores and distance-based scores. Some key comparisons are shown in Figure 6 . Please refer to the Appendix for full results. Moreover, spatial-aware models demonstrate both stronger OOD detection performance for in and out-domain OOD. For example, with the best scoring function (KNN+), compared to RoBERTa, LayoutLMv3 improves the average AUROC by 7.09% for out-domain OOD and 7.54%. The significant improvement suggests the value of spatial and visual information in improving OOD robustness for document data. Note that despite this paper mainly comparing the logit-based and distance-based scores, we need to be aware that Gradient-based score has also been proposed for OOD detection. We also report the GradNorm (Huang et al., 2021) OOD detection score in Appendix A.3 and achieves similar performance as logit-based scores.

3.4. TOWARDS SIMPLE AND EFFECTIVE SPATIAL-AWARE ADAPTORS

Spatial-aware models tailored for documents such as LayoutLM rely on spatial information and demonstrates superior OOD detection performance. This brings us a question: given the abundance of well-pretrained large-scale language models on text data such as RoBERTa, is there a simple and effective method that allows us to exploit the pretrained language model to document inputs for effective OOD detection? Next, we show that by enhancing transformer-based pretrained models with a spatial-aware adapter module, we can achieve good performance with minimal code edits. Spatial-aware adapter Given a public pretrained RoBERTa model, depending on the position of the adaptor, we consider two architectures: 1) appending the adapter to the word embedding layer, denoted as Spatial-RoBERTa (pre). It requires pretraining and finetuning, as illustrated in the top row in Fig. 7 ; 2) appending the adapter to the final layer, denoted as Spatial-BoBERTa (post). As the model can utilize pretrained textual encoder, it only requires finetuning (illustrated in the bottom row). In the following, we only discuss Spatial-RoBERTa (pre). Full results for both Spatial-RoBERTa variants are in the Appendix. We freeze the word embedding layer during pretraining for the following considerations: 1) word embeddings learned from large-scale corpus already cover most of those words from documents; 2) pretraining on documents without strong language dependency may not help improve word embeddings. For example, in semi-structured documents (e.g., forms, receipts), language dependencies are not be as strong as rich-text documents (e.g., letters, resumes), which may degenerate the learned word representations. In practice, each word has a normalized bounding box (x0, y0, x1, y1) , where (x0, y0) / (x1, y1) corresponds to the position of the upper left / lower right in the bounding box. Each coordinate is fed into an embedding layer and outputs a position embedding. All position embeddings are added to the initial word embedding to form a new spatial-aware embedding.

Spatial-RoBERTa significantly outperforms RoBERTa

To verify the effectiveness of Spatial-RoBERTa, We compare OOD detection performance for pre-trained and fine-tuned models. The results are shown in Fig. 8 . Spatial-RoBERTa significantly improves the OOD detection performance, especially after finetuning. Specifically, compared to RoBERTa, Spatial-RoBERTa improves AUROC by 9.20% for in-domain OOD and 4.95% for out-domain OOD data. This further verifies the importance of spatial-awareness for OOD detection in the document domain. 

Out-domain OOD (Avg)

Pretrained Finetuned Figure 8 : OOD detection performance between Spatial-RoBERTa and RoBERTa. All models are initialized with public pretrained checkpoints on pure-text data and further pretrained on IIT-CDIP with the same pretraining tasks. The only difference here is that Spatial-RoBERTa has an additional spatial-ware adapter and takes the word bounding boxes as the additional inputs. Spatial-RoBERTa is competitive for both ID classification and OOD detection Beyond OOD detection performance, we also examine ID classification accuracy and plot the two metrics for all the models with different modalities in Fig. 9 . We find that there exists a positive correlation between ID accuracy and OOD detection performance (measured by AUROC) for both in-domain and out-domain OOD. Moreover, spatial-aware models display superior ID accuracy and OOD robustness compared to text-only and vision-only models. Finally, our Spatial-RoBERTa provides a simple and effective solution that greatly improves upon RoBERTa and matches the performance of models with specific architecture such as LayoutLM. Specifically, Spatial-RoBERTa Large achieves 97.37 ID accuracy, which is even higher than LayoutLM (97.28) and UDoc (97.36). Furthermore, since our Spatial-RoBERTa Base and Spatial-RoBERTa Large freeze the word embed-ding during pretraining, it can learn spatial-aware feature target document data while keeping the word embedding fixed, thus reduce the trainable model size and reduce the training cost. 

4. CONCLUSION AND OUTLOOK

This paper presents a large-scale study of various methods for quantifying OOD robustness across different data modalities and models for document domains. Our key novelties include a large-scale investigation of OOD robustness in the document domain and a simple yet powerful spatial-aware adapter for transformer-based language models. We start from document classification and explore the pretrained models for document OOD robustness. A variety of substantial experiments in different settings demonstrates that pretraining datasets and tasks greatly impact OOD detection performance. Notably, OOD samples in the document domain are more accessible to identify in the feature space than in the logit space. Investigations from various perspectives explain certain intriguing phenomena and inspire more research on evaluating OOD robustness towards more reliable document understanding models. Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: largest dataset ever for document layout analysis. In ICDAR, 2019. Wenxuan Zhou, Fangyu Liu, and Muhao Chen. Contrastive out-of-distribution detection for pretrained transformers. EMNLP, 2021.

A APPENDIX

A.1 DOCUMENT CLASSIFICATION A.1.1 DATASETS RVL-CDIP consists of 320K/40K/40K training/validation/testing images under 16 categories. We select 12 categories and treat them as ID (In-Domain) data. We extract the text and layout information with Google OCR enginefoot_2 which provides both tokens and text blocks along with their corresponding bounding boxes. Most recent models take the full IIT-CDIP as pretraining data and finetuning on RVL-CDIP. However, it is not reasonable for the OOD setting since RVL-CDIP is a subset of IIT-CDIP. To make OOD results more reliable, in our experiments, we exclude the RVL-CDIP from the IIT-CDIP during pretraining. We measure the distance between in-domain and out-domain datasets via OTDDfoot_3 . We first visualize the OTDD distance between ID and the OOD data (in-domain and out-domain datasets in our main paper) in Fig. 10(a ). During analysis, we sample the maximum number of 1000 images from each data and calculate the distance between datasets. It can be seen that there is a clear gap between in-domain data and out-domain data. To make the analysis more thorough, we have two additional in-domain OOD settings: (1) select the classes the model performs well as OOD data; (2) randomly select classes as OOD data. Fig. 10 (a) and Fig. 10(c ) show the dataset distance. As for the other two selection strategies, we can see that the domain gap is not as clear as the subset we selected for the main experiments. Interestingly, for those rare-word documents, such as file-folder and advertisements, the dataset distances are larger than documents with rich words. The background colors and layouts may yield a big distinction for those documents. for 30 epochs with a batch size of 50 and a learning rate of 2 × 10 -foot_4 on 8 A100 GPUs. We list the hyperparameters of models used in our paper as follows: inputs derived from finetuned ResNet-50, RoBERTa, and LayoutLMv3. The KNN scores calculated from both vision and language models naturally form smooth distributions. In contrast, MSP and Maha scores for both in-and out-of-distribution data concentrate on high values. Overall our experiments show that using feature space makes the scores more distinguishable between and out-of-distributions and, as a result, enables more effective OOD detection. (a) RoBERTaBase (10%) (b) RoBERTaBase (20%) (c) RoBERTaBase (40%) (d) RoBERTaBase (100%) (e) ViTBase (10%) (f) ViTBase (20%) (g) ViTBase (40%) (h) ViTBase (100%)

Language-only:

(1) BERT and RoBERTa. We adopt the RoBERTa Base (12 layer, 768 hidden size) and BERT Base (12 layer, 768 hidden size) as the backbone and set the maximum sequence length to 512. For RoBERTa, the classifier is composed of two linear layers followed by a tanh activation function. (2) Longformer. We also adopts the Longformer Base (12 layer, 768 hidden size) as the backbone and sets the maximum sequence length to 4,096. A.2.1 DATASETS Document Entity Recognition The original FUNSD (Jaume et al., 2019) , and table ). The original IIIT-AR-13K (Mondal et al., 2020) contains (table, figure, natural image, logo, and signature). In our paper, considering the overlap between IIIT-AR-13K and PubLayNet, we select those images that contain Natural Image as the OOD test set. After filtering, we have 2,880 OOD entities across 1,837 document images. We consider three ID datasets in this experiment. (1) PubLayNet: This is the original PubLayNet dataset. We treat all the entities in training/validation images as ID entities. (2) Considering the domain shift between ID data (PubLayNet) and OOD data (IIIT-AR-13K). We combine the PubLayNet training data with the images from IIIT-AR-13K with overlapping annotations (table and figure) and train the object detection model.

A.2.2 MODELS

Document Entity Recognition Fig. 13 illustrates the entity recognition models used in this paper. We consider the entity on regions instead of tokens since regions contain richer semantic information. As for the pretrained model, we adopt UDoc (pretrained on IIT-CDIP) since it models the inputs at the region level. Based on UDoc framework, we develop the following models. Vision+Language+Layout: (1) ResNet-50+sentence BERT: This model follows the same framework as UDoc, but replaces the sentence encoder in their original design with a more miniature sentence encoder (all-MiniLM-L6-v2). (2) SwinT+Sentence BERT: This model replaces the ResNet-50 visual backbone with a pretrained Swin tiny model (swin-tiny-patch4-window7-224) adopted from the Huggingface. All the models are finetuned with cross-entropy loss for 100 epochs with a learning rate of 10 -5 and batch size of 8 on one A100 GPU. Document Object Detection Two object detection models are considered in this paper: (1) Vanilla Faster-RCNN: This model is the Faster-RCNN with ResNet-50 visual backbone. (2) Faster-RCNN with VOS: This model enhances above model with VOSfoot_5 . Following their paper, we use 1,000 samples for each ID class to estimate the class-conditional Gaussians. We train detection models with the Detectron2 framework (Wu et al., 2019) . Models are trained for 180k iterations with a base learning rate of 0.01 and a batch size of 8. Mean average precision (MAP) @ intersection over union (IOU) [0.50:0.95] of bounding boxes is used to measure the performance. 

A.2.3 EXPERIMENTAL SETUP

For document entity recognition, we construct ID and OOD datasets from FUNSD. Each semantic entity includes a list of words, a label, and a bounding box. The standard label set for this dataset contains 4 categories: question, answer, header, and other. In this paper, we select the entity labeled as other or header category as OOD. And the entities belonging to the other three categories are ID. Instead of treating entity recognition as a named-entity recognition problem, we follow UDoc and solve this problem at the semantic region level. We replace the sentence encoder in UDoc with a smaller sentence encoder (all-MiniLM-L6-v2foot_6 ) from Huggingface (Wolf et al., 2019) . We also have the following model variants to verify the effectiveness of each modality combination: textual-only, visual-only, textual+spatial, visual+spatial, and visual+textual+spatial. For document object detection, we use PubLayNet as the ID dataset. We construct the OOD dataset from IIIT-AR-13K. Unlike PubLayNet, where the documents are scientific articles, IIIT-AR-13K is a dataset for graphical object detection in business documents (e.g., annual reports). Hence, there exists an obvious domain gap between these two datasets. We select natural images as the OOD entity and filter images that contain the OOD entity. We first adopt the vanilla Faster RCNN with ResNet-50 backbone for document object detection as the baseline model. We also enhance Faster RCNN with VOS (Du et al., 2022) , a recent unknown-aware learning framework to improve OOD detection performance for natural images.

A.2.4 OBSERVATIONS

To identify the entity type, models should not only understand the words but also require spatial and visual reasoning ability. We summarize our findings on document entity recognition in Fig. 14 (a) and describe them in more detail in Table 1 . We can see that models can better predict the entity type with the help of the spatial position and also achieves better OOD robustness. Considering the weak language dependency between entities, it is not supervising that vision-based models achieve better performance than text-based models. We can see that UDoc with ResNet-50 achieves the best performance on two OOD test sets, illustrating that visual information plays a major role in increasing the discrimination of entities with similar semantics. We summarize our findings on document object detection in Fig. 14 (b ) and describe them in more detail in Table 2 . We can see that the OOD detection performance is further improved by introducing document images from IIIT-AR-13K with the same ID annotations as training data. In Fig. 15 , we visualize some document entity recognition OOD detection results. In Fig. 16 , we visualize the prediction on sample OOD images, using object detection models trained without VOS (top) and with VOS (bottom), respectively. There is a clear difference between PubLayNet and IIIT-AR-13K -natural image annotations and entities rarely exist in PubLayNet. We can see that vanilla Faster RCNN trained on PubLayNet produces false positives when applied to the OOD document image from IIIT-AR-13K. After introducing the unknown-aware learning method optimized for both ID and OOD, as shown in Table 2 , the FPR95 reduces while preserving the mAP on the ID data. This experiment indicates that bringing uncertainty estimation into the entity detection training procedure can improve the reliability of the document object detection system. 

A.3 ADDITIONAL EXPERIMENTAL RESULTS

• Table 1 corresponds to the results shown in Fig. 15 and Fig. 14(a) . • Table 2 corresponds to the results shown in Fig. 16 and Fig. 14(b ). • Table 3 and Table 7 correspond to the results shown in Fig. 5(a) . • Table 4 and Table 5 correspond to the results shown in Fig. 5(c ). • Table 6 corresponds to the results shown in Fig. 8 and Fig. 9 . • Table 9 and Table 8 correspond to the results shown in Fig. 4 and Fig. 9 . • Table 10 and Table 11 correspond to the analysis in Sec. 3.1. • Table 12 corresponds to the results shown in Fig. 9 . 



See Appendix A.1.2 for further details about the models and hyperparameters. https://github.com/pymupdf/PyMuPDF https://cloud.google.com/vision/docs/ocr https://github.com/microsoft/otdd https://huggingface.co/models https://github.com/deeplearning-wisc/vos https://huggingface.co/sentence-transformers



Figure 1: Schematic description of OOD detection for document classification. The left part shows the pretraining and finetuning pipelines. During inference time, for a given input document image, we calculate the OOD detection score G γ according to different methods (logit-base or distance-base). The OOD detector will identify the input document as OOD if the OOD score is smaller than the threshold value γ.

Figure2: (Left) Illustration of models for document pretraining and classification. Our proposed spatialaware pretraining and finetuning models are the network architectures in green blocks. We also show the modality information on top of each architecture. (Right): Evaluating finetuning performance for document classification of pretrained models. We group models into three groups (from left to right): language-only, vision-only, and multimodal. For each group, we also present the performance of a model in another group (shown in grey) for better reference. The average accuracy for each model is shown in the parenthesis.

Figure 3: (Top) Examples of ID inputs sampled from RVL-CDIP (top). (Bottom) In-domain OOD from RVL-CDIP, and out-domain OOD from Scientific Poster and Receipts.

Figure 6: Comparison between representative feature-based scores and logit-based scores for spatial-aware and non-spatial-aware models. Spatial-aware models are colored in blue.

Figure9: Correlation between ID accuracy and OOD detection performance. For most models, ID accuracy is positively correlated with OOD detection performance. Spatial-aware models display both higher ID accuracy and stronger OOD robustness (in AUROC).

Figure 10: Visualization of optimal transport dataset distance for ID and OOD (in-domain and out-domain) datasets. We highlight the in-domain data in blue and the out-domain in green.

Figure 11: Feature visualization for pretrained (with different numbers of pretraining data) and finetuned models. We show both In-Domain (RVL-CDIP) and Out-Domain (CORD) OOD datasets.

Figure 13: The network architectures in green blocks are our proposed models. We also show the modality information on top of each architecture. Going beyond document classification, we explore OOD detection for two entity-level tasks: document entity recognition and document object detection. Basic units such as text, tables, and figures in the document are the objects that need to be detected and recognized. Document entity recognition aims to predict the label for each semantic entity with given bounding boxes. Document object detection is an object detection task for document images. Specifically, we denote the input as x, the bounding box coordinates associated with object instances in the image as b ∈ R 4 , and use the model with parameters θ to model the bounding box regression p θ (b|x) and the label classification p θ (y|x, b). Given a test input x, the OOD detection scoring function for entity detection and recognition can be unified as S(x, b), where b denotes the object instance predicted by the object detector. In particular, for document entity recognition, since the bounding boxes are provided, the OOD score can be simplified as S(x, b), where b is the given object instance.

Vision/Vision+Layout: (1) ResNet-50: This model is composed of the ResNet-50 from pretrained UDoc. It adopts the RoI pooling followed by a classifier to extract the entity features. (2) ResNet-50+Position: This model also adapts UDoc pretrained ResNet-50. It further improves the RoI features to be spatial-aware by adding position embedding, where the position embeddings are mapped from bounding boxes via a linear mapping layer. Language/Language+Layout: (1) Sentence BERT: This model adopts the language branch of UDoc and appends the classifier to the output of the sentence encoder. (2) Sentence BERT+Position: This model is close to the above model but adds position embedding to the sentence embeddings.

Figure 14: Ablation on document entity recognition and object detection. Numbers are reported in FPR95.

Figure 15: Visualization of detected OOD entities on the form images. The top part shows the entities in blue are entities annotated as other. The bottom part shows the detected OOD entities (green). We also show failure cases on the right part.







dataset contains 149/50 training/testing images. we treat entities with category other/header as the OOD entities. After doing the split, if we treat other as OOD, we have a total number of 8,330/1,019 ID/OOD entities in total. Otherwise, if we treat header as OOD, we have 8,981/368 ID/OOD entities in total.

Comparison with different models on FUNSD OOD setting. All models are initialized with UDoc pretrained on IIT-CDIP and finetuned on FUNSD data with ID entities. All values are percentages. A lower FPR95 or higher AUROC value indicates better performance.

Comparison with different training and detection methods.

OOD detection performance for document classification with different number of pretraining data from IIT-CDIP. ID (Acc) denotes the ID accuracy obtained by testing on ID test data. We report the KNNbased scores for both pretrained and finetuned models. Sci. Poster denotes the document images converted from NJU-Fudan Paper-Poster Dataset. Receipt denotes the receipt images collected from the CORD receipt understanding dataset. For in-domain OOD test data, we also report the averaged scores.

OOD detection performance for document classification with different number of pretraining data from IIT-CDIP -(remove pseudo OOD categories).

OOD detection performance for document classification with different number of pretraining data from IIT-CDIP -(remove pseudo OOD categories).

OOD detection performance for document classification. Spatial-RoBERTa Base (Pre) denotes applying the spatial-aware adapter in the word embedding layer. Spatial-RoBERTa Base (Post) denots applying the spatial-aware adaptor at the output layer.

OOD detection performance for document classification with the different number of pretraining data from IIT-CDIP.

OOD detection performance for document classification. Longformer 4096 denotes the original model adopted from the Huggingface model hub. Longformer 4096 (+) denotes the additional pretraining on IIT-CDIP.

OOD detection performance for document classification. All models are pretrained on ImageNet.

OOD detection performance for document classification (select OOD categories achieve the best performance across most of the models with different modalities).

OOD detection performance for document classification (randomly select four categories as OOD). AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC Pretrain on pure-text data→ Finetune on RVL-CDIP (ID)

OOD detection performance for document classification. All models are pretrained on IIT-CDIP. For LayoutLM models, we adopt the checkpoints from the Huggingface model hub. For UDoc, we pretrain the model on our side. All models are finetuned on RVL-CDIP ID data.

