PARETO OPTIMIZATION FOR ACTIVE LEARNING UNDER OUT-OF-DISTRIBUTION DATA SCENARIOS Anonymous

Abstract

Pool-based Active Learning (AL) has achieved great success in minimizing labeling costs by sequentially selecting the most informative unlabeled samples from a large unlabeled data pool and querying their labels from oracle/annotators. However, existing AL sampling schemes might not work well under out-of-distribution (OOD) data scenarios, where the unlabeled data pool contains data samples that do not belong to the pre-defined categories of the target task. Achieving good AL performance under OOD data scenarios is a challenging task due to the natural conflict between AL sampling strategies and OOD sample detection -both more informative in-distribution (ID) data and OOD data in unlabeled data pool may be assigned high informativeness scores (e.g., high entropy) during AL processes. In this paper, we propose a Monte-Carlo Pareto Optimization for Active Learning (POAL) sampling scheme, which selects optimal subsets of unlabeled samples with fixed batch size from the unlabeled data pool. We cast the AL sampling task as a multi-objective optimization problem and utilize Pareto optimization based on two conflicting objectives: (1) the typical AL sampling scheme (e.g., maximum entropy) and (2) the confidence of not being an OOD data sample. Experimental results show the effectiveness of our POAL on classical Machine Learning (ML) and Deep Learning (DL) tasks.

1. INTRODUCTION

In real-life applications, huge amounts of unlabeled data are easily obtained, but labeling them would be expensive (Shen et al., 2004) . AL aims to solve this problem -it achieves greater accuracy with less training data by sequentially selecting the most informative instances and then querying their labels from oracles/annotators (Zhan et al., 2021b) . Current AL methods have been tested on well-studied datasets (Kothawade et al., 2021) like MNIST (Deng, 2012) and CIFAR10 (Krizhevsky et al., 2009) . These datasets are simple and clean. However, in realistic scenarios, when collecting unlabeled data, unrelated data (i.e., out-of-domain data) might be mixed in with the task-related data, e.g., images of letters when the task is to classify images of digits (Du et al., 2021) . Most AL methods are not robust to OOD data scenarios. For instance, Karamcheti et al. (2021) has demonstrated empirically that collective outliers hurt AL performances under Visual Question Answering (VQA) tasks. Meanwhile, selecting and querying OOD samples that are invalid for the target model will waste the labeling cost (Du et al., 2021) and make the AL sampling process less effective. There is a natural conflict between AL and OOD data detection. Most AL methods, especially uncertainty-based measures, prefer selecting data that are hardest to be classified by the current basic classifier (e.g., high entropy of predicted class probabilities). However, in AL, if a basic learner (e.g., Neural Network with softmax output) performs poorly on ID data, it is more likely to provide non-informative predicted probabilities (i.e., close to uniform probabilities) on OOD data (Vaze et al., 2021) . During AL processes, the basic learner is not well-trained due to insufficient labeled data, and insufficient epochs in the case of deep AL. Therefore, the samples selected by AL may contain both high informative ID and OOD samples. For example, consider the Maximum Entropy (ENT) approach for AL, which is a classic uncertainty-based method (Lewis & Catlett, 1994; Shannon, 2001) that selects data whose predicted class probabilities have the largest entropy. Meanwhile, ENT is also a typical OOD detection method -high entropy of the predicted class distribution suggests that the input may be OOD (Ren et al., 2019) . Fig. 1(top) shows an example on the EX8 dataset (Ng, 2008) , which further illustrates this conflict -a large percentage of data with high entropy scores are OOD, and thus ENT-based AL will likely select OOD data for labeling. Thus, additional measures are needed to detect OOD samples so that they are not selected for AL. For example, in Fig. 1 (bottom), the negative Mahalanobis distance (Lee et al., 2018) shows a certain negative correlation with entropy and thus could be used as an ID confidence score. Although the OOD problem has been demonstrated to affect AL in real-life applications (Karamcheti et al., 2021) , there are only a few studies on this topic (Kothawade et al., 2021; Du et al., 2021) . SIMILAR (Submodular Information Measures Based Active Learning) (Kothawade et al., 2021) adopted the submodular conditional mutual information (SCMI) function as the acquisition function. They jointly model the similarity between unlabeled and labeled ID data sets and their dissimilarity with labeled OOD data sets. The estimation might not be accurate initially since both labeled ID and OOD data sets are insufficient. CCAL (Contrastive Coding AL) (Du et al., 2021) needs to pre-train extra self-supervised models like SimCLR (Chen et al., 2020), and also introduces hyper-parameters to trade-off between semantic and distinctive scores, whose values affect the final performance (see Section 4.3 in (Du et al., 2021)). These two factors limit the range of its application. We compared our work with SIMILAR and CCAL in detail and showed the superiority of our method in Appendices C and F.3. In this paper, we advocate simultaneously considering the AL criterion and ID confidence when designing AL sampling strategies to address the above issues. Since the two objectives conflict, we define the AL sampling process under OOD data scenarios as a multi-objective optimization problem (Seferlis & Georgiadis, 2004) . Unlike traditional methods for handling multiple-criteria based AL, such as weighed-sum optimization (Zhan et al., 2022a) or two-stage optimization (Shen et al., 2004; Zhan et al., 2022a) , we propose a novel and flexible batch-mode Pareto Optimization Active Learning (POAL) framework. The contributions and summarization of this paper are as follows: 1. We propose AL under OOD data scenarios within a multi-objective optimization framework. 2. Our framework is flexible and can accommodate different combinations of AL and OOD detection methods according to various target tasks. In our experiments, we use ENT as the AL objective and Mahalanobis distance as ID confidence scores. 3. Naively applying Pareto optimization to AL will result in a Pareto Front with a non-fixed size, which can introduce a high computational cost. To enable efficient Pareto optimization, we propose a Monte-Carlo (MC) Pareto optimization algorithm for fixed-size batch-mode AL. 4. Our framework works well on both classical ML and DL tasks, and we propose pre-selecting and early-stopping techniques to reduce the computational cost on large-scale datasets. 5. Our framework has no trade-off hyper-parameter for balancing AL and OOD objectives. It is important since: i) AL is data-insufficient, there might be no validation set for tuning parameters; ii) hyper-parameter tuning in AL can be label-expensive since every change of hyper-parameter causes AL to label new data, thus provoking substantial labeling inefficiency (Ash et al., 2020) .

2. RELATED WORK

Pool-based Active Learning. Pool-based AL has been well-studied in these years (Settles, 2009; Zhan et al., 2021b; Ren et al., 2021) and widely adopted in various tasks (Duong et al., 2018; Yoo & Kweon, 2019; Dor et al., 2020; Haussmann et al., 2020) . Most AL methods rely on fixed heuristic sampling strategies, which follow two main branches: uncertainty-and representative/diversity-based measures (Ren et al., 2021; Zhan et al., 2022a) . Uncertainty-based approaches select data that maximally reduce the uncertainty of the target basic learner (Ash et al., 2020) . Typical uncertaintybased measures that perform well on classical ML tasks like Query-by-Committee (QBC) (Seung et al., 1992) , Bayesian Active Learning by Disagreement (BALD) (Houlsby et al., 2011) have also been generalized to DL tasks (Wang & Shang, 2014; Gal et al., 2017; Beluch et al., 2018; Zhan et al., 2022a) . Representative/diversity-based methods like k-Means (Zhan et al., 2022a) and Core-Set approach (Sener & Savarese, 2018) select a batch of unlabeled data most representative of the set. Uncertainty-and representative-based measures could be combined with weighted-sum or multistage optimization (Zhan et al., 2022a) . Weighted-sum optimization combines multiple objectives



Figure 1: (top) Distribution of entropy for ID and OOD data during AL processes on EX8. (bottom) Scatter plot of the AL score (entropy) and ID confidence score (negative Mahalanobis distance) of unlabeled data. Larger ID score indicates data is more likely to be ID data.

