A CRITICAL ANALYSIS OF OUT-OF-DISTRIBUTION DETECTION FOR DOCUMENT UNDERSTANDING Anonymous authors Paper under double-blind review

Abstract

Large-scale pretraining is widely used in recent document understanding models. During deployment, one may expect that large-scale pretrained models should trigger a conservative fallback policy when encountering out-of-distribution (OOD) samples, which suggests the importance of OOD detection. However, most existing OOD detection methods focus on single-modal inputs such as images or texts. While documents are multi-modal in nature, it is underexplored if and how multi-modal information in documents can be exploited for OOD detection. In this work, we first provide a systematic and in-depth analysis on OOD detection for document understanding models. We study the effects of model modality, pretraining, and finetuning across various types of OOD inputs. In particular, we find that spatial information is critical for document OOD detection. To better exploit spatial information, we propose a simple yet effective spatial-aware adapter, which serves as an add-on module to adapt transformer-based language models to document domain. Extensive experiments show that our method consistently improves ID accuracy and OOD detection performance compared to baselines. We hope our findings can help inspire future works on understanding OOD robustness for documents.

1. INTRODUCTION

The recent success of large-scale pretrained models has led to the widespread deployment of deep models in various applications. In the document domain, model predictions are increasingly used to help humans make decisions in important applications ranging from tax form processing, machine learning assistant medical reports analysis, deep analyses from financial forms, etc. However, in most cases, models are pretrained on collected data but are then deployed in an environment with a different distribution over the observed data (Cui et al., 2021) . For example, with the outbreak of COVID-19 (Velavan & Meyer, 2020) , machineassisted medical document analysis systems have to face continually changing data distributions. This motivates the need for reliable methods in the document domain to detect out-of-distribution (OOD) inputs. The goal of OOD detection is to categorize in-distribution (ID) test samples into one of the known categories and detect instances that do not belong to any known classes (Huang & Li, 2021; Bendale & Boult, 2016) . Generally, a model is optimized on a particular task (e.g., image classification (Deng et al., 2009) ), and a companion OOD detector is built as a safeguard for the classifier. Recently, large-scale pretrained models have demonstrated promising results in multiple domains (Dosovitskiy et al., 2021; Hendrycks et al., 2020) as pretraining enables models to learn powerful and transferable feature representations (Radford et al., 2021) . In particular, the models obtained by finetuning large-scale pretrained models are significantly better at OOD detection even with a simple distance metric (Lee et al., 2018; Radford et al., 2021) . It is underexplored whether existing OOD detection methods that demonstrate success for images or text can be naturally extended to documents. The main challenges posed in document OOD detection stem from the fact that document understanding is inherently multi-modal, thus, it is suboptimal to rely on a single modality. The majority of recent OOD detection approaches focus on single-modal learning (Hsu et al., 2020; Zhou et al., 2021; Xu et al., 2021a; Jin et al., 2022) , and they are not compatible with document understanding tasks which require multi-modal learning. The spatial relationship of text blocks in documents further differentiated the document multimodal learning from the multimodal learning in vision-language domain (Lu et al., 2019; Li et al., 2020) . In addition, recent document pretraining methods have demonstrated remarkable performance on various downstream document understanding tasks (Xu et al., 2020; 2021b; Huang et al., 2022; Li et al., 2021; Cui et al., 2021; Hong et al., 2022; Gu et al., 2022; Wang et al., 2022) . However, existing pretraining datasets for documents are limited and lack diversity, in sharp contrast to common pre-trainining datasets for natural images. Therefore, it is not obvious which OOD detection methods are reliable in the document domain and how pretraining impacts OOD robustness. This paper investigates the OOD robustness in the document domain through the following questions: (1) Are pretrained models robust to OOD examples? Is further pretraining beneficial? How do the pretraining data and tasks affect the performance? (2) How does multimodality (textual, visual, and spatial) affect OOD robustness? (3) Are existing OOD detection methods developed for natural images and texts transferrable to documents? We present a large-scale evaluation of recent approaches. We focus on models pretrained on different data types and evaluate them on a diverse range of document understanding benchmarks across visual, textual, and spatial modalities. Our key contributions are summarized as follows: • We show that pretraining datasets and tasks significantly impact OOD detection performance. Through extensive pretraining and finetuning experiments, we find that higher finetuning performance on ID data does not usually translate to better performance on OOD data. This observation emphasizes the importance of considering metrics beyond ID performance for measuring model reliability. • We propose a spatial-aware adapter, which can serve as an add-on module to transformer-based models and learn the spatial-aware representation. Our method can easily transfer the pretrained language models to the document domain. Extensive experiments show that our method can consistently improve ID accuracy and OOD detection performance across a broad spectrum of datasets. • We show that recent conclusions drawn from OOD detection methods are valid for images and texts but do not always transfer to documents. For a wide range of document models, we observe that OOD samples are easier to identify in the feature space than in the logit space. The rest of the paper is organized as follows. Sec. 2 provides the preliminaries and related works. In Sec. 3, we provide a comprehensive analysis of OOD robustness for document models and conclude in Sec. 4.

2. PRELIMINARIES AND RELATED WORKS

2.1 DOCUMENT MODELS AND PRETRAINING Large-scale pretrained models have attracted a lot of attention in the document domain. In vision or natural language processing (NLP) tasks, pretraining has shown great success in producing generic representations that learn from large-scale unlabeled corpora (Devlin et al., 2018; Lu et al., 2019; Su et al., 2019; He et al., 2020) . Document pretraining also seeks to find universal representations suitable for any downstream task. However, the unique characteristics of document images distinguish document pretraining works from previous ones in vision or language domains. For documents, the contents are spatially distributed, and visual and textual information co-occurs within the semantic regions. In contrast, inputs in the language domain are pure texts, and inputs in the vision-language domain are image-text pairs. Recent document pretraining models differ in architecture and objectives, as depicted in Fig. 2 . Lay-outLM (Xu et al., 2020) extends BERT to learn contextualized word representations for document images through multi-task learning. It takes a sequence of Optical Character Recognition (OCR) (Smith, 2007) 

