A CRITICAL ANALYSIS OF OUT-OF-DISTRIBUTION DETECTION FOR DOCUMENT UNDERSTANDING Anonymous authors Paper under double-blind review

Abstract

Large-scale pretraining is widely used in recent document understanding models. During deployment, one may expect that large-scale pretrained models should trigger a conservative fallback policy when encountering out-of-distribution (OOD) samples, which suggests the importance of OOD detection. However, most existing OOD detection methods focus on single-modal inputs such as images or texts. While documents are multi-modal in nature, it is underexplored if and how multi-modal information in documents can be exploited for OOD detection. In this work, we first provide a systematic and in-depth analysis on OOD detection for document understanding models. We study the effects of model modality, pretraining, and finetuning across various types of OOD inputs. In particular, we find that spatial information is critical for document OOD detection. To better exploit spatial information, we propose a simple yet effective spatial-aware adapter, which serves as an add-on module to adapt transformer-based language models to document domain. Extensive experiments show that our method consistently improves ID accuracy and OOD detection performance compared to baselines. We hope our findings can help inspire future works on understanding OOD robustness for documents.

1. INTRODUCTION

The recent success of large-scale pretrained models has led to the widespread deployment of deep models in various applications. In the document domain, model predictions are increasingly used to help humans make decisions in important applications ranging from tax form processing, machine learning assistant medical reports analysis, deep analyses from financial forms, etc. However, in most cases, models are pretrained on collected data but are then deployed in an environment with a different distribution over the observed data (Cui et al., 2021) . For example, with the outbreak of COVID-19 (Velavan & Meyer, 2020) , machineassisted medical document analysis systems have to face continually changing data distributions. This motivates the need for reliable methods in the document domain to detect out-of-distribution (OOD) inputs. The goal of OOD detection is to categorize in-distribution (ID) test samples into one of the known categories and detect instances that do not belong to any known classes (Huang & Li, 2021; Bendale & Boult, 2016) . Generally, a model is optimized on a particular task (e.g., image classification (Deng et al., 2009) ), and a companion OOD detector is built as a safeguard for the classifier. Recently, large-scale pretrained models have demonstrated promising results in multiple domains (Dosovitskiy et al., 2021; Hendrycks et al., 2020) as pretraining enables models to learn powerful and transferable feature representations (Radford et al., 2021) . In particular, the models obtained by finetuning large-scale pretrained models are significantly better at OOD detection even with a simple distance metric (Lee et al., 2018; Radford et al., 2021) . It is underexplored whether existing OOD detection methods that demonstrate success for images or text can be naturally extended to documents. The main challenges posed in document OOD detection stem from the fact that document understanding is inherently multi-modal, thus, it is suboptimal to rely on a single 1

