CONTRASTIVE VISION TRANSFORMER FOR SELF-SUPERVISED OUT-OF-DISTRIBUTION DETECTION Anonymous

Abstract

Out-of-distribution (OOD) detection is a type of technique that aims to detect abnormal samples that don't belong to the distribution of training data (or indistribution (ID) data). The technique has been applied to various image classification tasks to identify abnormal image samples for which the abnormality is caused by semantic shift (from different classes) or covariate shift (from different domains). However, disentangling OOD samples caused by different shifts remains a challenge in image OOD detection. This paper proposes Contrastive Vision Transformer (CVT), an attention-based contrastive learning model, for selfsupervised OOD detection in image classification tasks. Specifically, vision transformer architecture is integrated as a feature extracting module under a contrastive learning framework. An empirical ensemble module is developed to extract representative ensemble features, from which a balance can be achieved between semantic and covariate OOD samples. The proposed CVT model is tested in various self-supervised OOD detection tasks, and our approach outperforms state-of-theart methods by 5.12% AUROC on CIFAR-10 (ID) vs. CIFAR-100 (OOD), and by 9.77% AUROC on CIFAR-100 (ID) vs. CIFAR-10 (OOD).

1. INTRODUCTION

As many deep neural networks (DNNs) are deployed in real-world applications, the safety and robustness of the models get more and more attention. Most existing DNNs are trained under the closed-world assumption, i.e., the test data is assumed to be drawn i.i.d. from the same distribution as the training data (Yang et al., 2021) . Although the deployed DNNs can perfectly deal with such ID samples, they would blindly classify the data coming from other classes or domains (i.e., OOD samples) into existing classes in an open-world scenario. Nguyen et al. discovered that neural networks can be easily fooled by unrecognizable images, which means that most DNNs are unreliable when encountering unknown or unseen samples. Such a few mistakes may be tolerable in some scenarios (e.g., chatbot, interactive entertainment), whereas they will bring catastrophic damage when the application area requires great safety benefits, such as automated vehicles, medical imaging and biometric security system. Therefore, it is essential to equip the model with the ability of detecting out-of-distribution data and make it more robust and reliable. Generally, the outlier arises because of the mechanical failure, fraudulent behaviour, human error, instrument error and natural deviations in populations (Hodge & Austin, 2004 ). In the field of machine learning, compared with ID samples, OOD samples are regarded as the outliers due to distributional shifts. The distributional shifts can be caused by semantic shift (i.e., OOD samples from different classes) or covariate shift (i.e., OOD samples from different domains) (Yang et al., 2021) . Meanwhile, the OOD samples that are semantically and stylistically very different from ID samples are referred to as far-OOD samples, and those that are semantically similar to ID samples but different from ID samples in domains are referred to as near-OOD samples (Ren et al., 2021) . The out-of-distribution detection, also known as outlier detection or novelty detection, is developed to identify whether a new input belongs to the same distribution as the training data. A natural idea is to build a classifier to identify the ID and OOD data, using such as Deep Neural Network (DNN) and Support Vector Machine (SVM). However, the sample space of OOD data is almost infinite as OOD dataset is the complementary set of ID dataset, which leads to that creating a representative OOD dataset is impracticable. Moreover, OOD samples are scarce and costly in some industries (e.g., medical imaging, fraud prevention). These are main issues in the research on OOD detection. To address these problems, researchers focus on the latent features of ID data, assuming distinguishable distributional shifts exist between ID and OOD samples in the latent feature space. Some researchers (Nalisnick et al., 2019; Serrà et al., 2019; Xiao et al., 2020) use generative models, like Variational Auto-encoders (VAE), to extract the latent features for both ID and OOD samples, and specific OOD socres are designed and used as the metric. As an alternative, contrastive learning models can be employed to learn the latent features, such as Self-Supervised Outlier Detection (SSD) (Sehwag et al., 2020) and Contrasting Shifted Instances (CSI) (Tack et al., 2020) . However, in contrastive learning, researchers usually adopt standard convolutional neural network (CNN) and its variants like ResNet (He et al., 2016) In this paper, a Contrastive Vision Transformer (CVT) model is proposed for OOD detection under self-supervised regime for image classification tasks. The framework of contrastive learning, including data augmentation and contrastive loss, is adopted to learn the representation for all inputs, which has been shown to be reasonably effective for detecting OOD samples (Tack et al., 2020) . On this basis, four extra modules are introduced into this framework: (i) To improve the distinguishability between ID and OOD samples in the latent space, vision transformer architecture rather than CNN is embedded as a feature extracting module; (ii) Since the collapse of representation is a noteworthy problem in self-supervised and unsupervised scenarios, an additional predictor structure (inspired by BYOL (Grill et al., 2020) ) is employed to avoid collapsed solutions. (iii) Considering that the size of negative samples plays an important role in contrastive learning, a memory queue scheme from MoCo (He et al., 2020) is integrated to maintain the model's performance especially when the batch size is extremely small. (iv) An ensemble module is developed to build representative ensemble features for achieving the balance between semantic and covariate OOD detection, as we observe that in our experiments the latent features from the encoder perform better on semantic OOD samples but on the contrary the latent features from the predictor perform better on covariate OOD samples. To further improve performance, a Mahalanobis distance-based OOD score function is utilised for the OOD detection, the effectiveness of which has been shown in recent papers (Sehwag et al., 2020; Ren et al., 2021) . To conclude, the key contributions of the paper are as follows: • We integrate vision transformer architecture into a contrastive learning framework and develop a new paradigm specifically for self-supervised OOD detection in image classification tasks, results outperform state-of-the-art algorithms • We develop an ensemble module to compute representative features that balance OOD samples from different types of data shifts • We conduct extensive ablation studies to report the influences of various hyper-parameters on OOD detection tasks and benchmark the performance of CVT using different vision transformer modules including ViT, ResNet50, and Swin transformer In the rest of the paper, related work is described in Section 2 and the main CVT model is introduced in Section 3. Followed by numerical results in Section 4 and the paper is concluded in Section 5.

2. RELATED WORK

Contrastive learning is a self-supervised technique that has seen fast development in recent years. Chen et al. ( 2020) proposed a contrastive learning framework consists of four components: data augmentation module, neural network base encoder, MLP (multilayer perceptron) projection head, and contrastive loss. It incorporated a strong inductive bias by gathering samples from the same class and repelling others and achieved promising results in visual representation learning. Under a similar paradigm, many influential variants were develooped in recent years, such as SimCLR (Chen et al., 2020 ), MoCo, SwAV (Caron et al., 2020 ), BYOL and MoCo-v3 (Chen et al., 2021) . MoCo introduced a queue module to store the key representations of negative samples since the number of negative samples can effectively improve performance. To maintain the consistency of keys in the queue, a momentum strategy was developed for MoCo to update the parameters of the key encoder.



as the encoder. By contrast, the transformer-based architectures (such as the earliest Vision Transformer (ViT) (Dosovitskiy et al., 2020), DeiT (Touvron et al., 2021) and Swin Transformer (Liu et al., 2021)) gradually outperform CNNs in terms of extracting robust latent features as they can learn global long-range relationships for visual representation learning, which would facilitate the identification of ID and OOD samples.

