DECOMPOSING TEXTURE AND SEMANTIC FOR OUT-OF-DISTRIBUTION DETECTION Anonymous

Abstract

Out-of-distribution (OOD) detection tasks have made significant progress recently since the distribution mismatch between training and testing can severely deteriorate the reliability of AI systems. Nevertheless, the lack of precise interpretation for the in-distribution (ID) limits the application of the OOD detection methods to real-world systems. To tackle this, we decompose the definition of the ID into texture and semantics, motivated by the demands of real-world scenarios. We also design new benchmarks to measure the robustness that OOD detection methods should have. To achieve a good balance between the OOD detection performance and robustness, our method takes a divide-and-conquer approach. Specifically, the proposed model first handles each component of the texture and semantics separately and then fuses these later. This philosophy is empirically proven by a series of benchmarks including both the proposed and the conventional counterpart.

1. INTRODUCTION

Out-of-distribution (OOD) detection is the task that recognizes whether the given data comes from the distribution of training samples, also known as in-distribution (ID), or not. Any machine learning-based system could receive input samples that have a completely disparate distribution from the training environments (e.g., dataset). Since the distribution shift can severely degrade the model performance (Amodei et al., 2016) , it is a potential threat to reliable real-world AI systems. However, the ambiguous definition of the ID limits the feasibility of the OOD detection method in real-world applications, considering the various OOD scenarios. For example, subtle corruption is a clear signal of OOD in the machine vision field while a change in semantic information might not be. On the other hand, an autonomous driving system may assume the ID from the semanticoriented perspective; e.g., an unseen traffic sign is OOD. Unfortunately, most of the conventional OOD detection methods and benchmarks (Zhang et al., 2021; Tack et al., 2020; Ren et al., 2019; Chan et al., 2021) assume the ID as a single-mode thus they cannot handle other aspects of OOD properly (Figure 1a ). To tackle this, we revisit the definition of the ID by decomposing it into two factors: texture and semantics (Figure 1b ). For the texture factor, we define OOD as the textural difference between the ID and OOD datasets. On the contrary, the semantic OOD focuses on the class labels that do not exist in the ID environment. Note that the two aspects have a trade-off relationship, thus detecting both problems with a single model is challenging with the (conventional) entangled OOD perspective. 2018) investigated the texture-shape cue conflict in the deep network and a series of subsequent studies (Hermann et al., 2019; Li et al., 2020; Ahmed & Courville, 2020) explored how to achieve a balance between these perspectives. However, they only analyzed the texture-shape bias inherited in deep networks. Instead, we focus on analyzing the texture and semantic characteristics underlying the ID to build a more practically applicable OOD detection method.

Geirhos et al. (

Unfortunately, to the best of our knowledge, none of the studies on the OOD detection benchmark have thoroughly analyzed the definition of the ID. It can be problematic when the method judges the image corrupted by negligible distortion as OOD, even when the environment can tolerate the small changes in texture. Because of such a complicated scenario, it is crucial to evaluate the OOD detection method in a comprehensive way that goes beyond the simple benchmark. Thus, in this study, we propose a new approach to measuring the performance of the method according to the decomposed definition of the ID. One notable observation in our benchmark is that most of the previous OOD detection methods are highly biased to the texture information and ignore the semantic clues. To mitigate aforementioned issue, our proposed method tackles the texture and semantic information separately and aggregates these at the final module (Figure 2 ). To effectively extract texture information, we use a 2D Fourier transform motivated by the recent frequency domain-driven deep method (Xu et al., 2020) . For the semantic feature, we design an extraction module upon the Deep support vector data description (Deep-SVDD) (Ruff et al., 2018) with a novel angular distance-based initialization strategy. We then combine two features using the normalizing flowbased method (Dinh et al., 2016) , followed by the factor control mechanism. The control system provides the flexibility to handle different OOD scenarios by choosing which decomposed feature is more important in the given surrounding OOD circumstances. The main contributions of this work are as follows: • We decompose the "unclear" definition of the ID into texture and semantics. To the best of our knowledge, this is the first attempt to clarify OOD itself in this field. • Motivated by real-world problems, we create new OOD detection benchmark scenarios. • We propose a novel OOD detection method that is effective on both texture and semantics as well as the conventional benchmark. Furthermore, our method does not require any auxiliary datasets or labels unlike the previous models.

2. RELATED WORK

Class labels of the ID. Early studies on deep OOD methods rely on class supervision. ODIN and Generalized ODIN (Liang et al., 2017; Hsu et al., 2020) et al., 2020; Chen et al., 2020) . Motivated by this, several studies employ data augmentation methods such as image transformation or additional noise on the OOD detection task or model (Hendrycks et al., 2019; Tack et al., 2020; Kirichenko et al., 2020) .



Figure 1: How to define the ID? (a) Traditional OOD detection studies manage the ID in an entangled view. However, this could be naïve considering the complex nature of the real environments. (b) We decompose the definition of the ID into texture and semantic; this provides the flexibility to handle complicated scenarios by determining which view of the ID is suitable for a given scenario.

use the uncertainty measure derived by the Softmax output. It determines a given sample as OOD when the output probability of all classes is less than a predefined threshold. (Sastry & Oore, 2020; Lee et al., 2018) utilize the extracted feature map (e.g., gram matrix) from the pre-trained networks to calculate the OOD score. Also, Zhang et al. (2020) employ a flow-based model that is comparable to ours, but they require class information during the training and solely pay attention to semantics. Auxiliary distribution. Outlier exposure (OE) (Hendrycks et al., 2018) exploits additional datasets that are disjoint from the test dataset to guide the network to better representations for OOD detection. (Papadopoulos et al., 2021) further improves the performance of OE by regularizing the network with the total variation distance of the Softmax output. Data augmentation. Recently, contrastive learning-based methods have shown remarkable success on the tasks related to visual representation (He

