SIMILARITY SEARCH FOR EFFICIENT ACTIVE LEARN-ING AND SEARCH OF RARE CONCEPTS Anonymous

Abstract

Many active learning and search approaches are intractable for industrial settings with billions of unlabeled examples. Existing approaches, such as uncertainty sampling or information density, search globally for the optimal examples to label, scaling linearly or even quadratically with the unlabeled data. However, in practice, data is often heavily skewed; only a small fraction of collected data will be relevant for a given learning task. For example, when identifying rare classes, detecting malicious content, or debugging model performance, positive examples can appear in less than 1% of the data. In this work, we exploit this skew in large training datasets to reduce the number of unlabeled examples considered in each selection round by only looking at the nearest neighbors to the labeled examples. Empirically, we observe that learned representations can effectively cluster unseen concepts, making active learning very effective and substantially reducing the number of viable unlabeled examples. We evaluate several selection strategies in this setting on three large-scale computer vision datasets: ImageNet, OpenImages, and a proprietary dataset of 10 billion images from a large internet company. For rare classes, active learning methods need as little as 0.31% of the labeled data to match the average precision of full supervision. By limiting the selection strategies to the immediate neighbors of the labeled data as candidates for labeling, we process as little as 0.1% of the unlabeled data while achieving similar reductions in labeling costs as the traditional global approach. This process of expanding the candidate pool with the nearest neighbors of the labeled set can be done efficiently and reduces the computational complexity of selection by orders of magnitude.

1. INTRODUCTION

Large-scale unlabeled datasets contain millions or billions of examples spread over a wide variety of underlying concepts (Chelba et al., 2013; Zhu et al., 2015; Zhang et al., 2015; Wan et al., 2019; Russakovsky et al., 2015; Kuznetsova et al., 2020; Thomee et al., 2016; Abu-El-Haija et al., 2016; Caesar et al., 2019; Lee et al., 2019) . Often, these massive datasets skew towards a relatively small number of common concepts, such as cats, dogs, and people (Liu et al., 2019; Zhang et al., 2017; Wang et al., 2017; Van Horn & Perona, 2017) . Rare concepts, such as harbor seals, may only appear in a small fraction of the data (less than 1%). However, in many settings, performance on these rare concepts is critical. For example, harmful or malicious content may comprise a small percentage of user-generated content, but it can have an outsize impact on the overall user experience (Wan et al., 2019) . Similarly, when debugging model behavior for safety-critical applications like autonomous vehicles, or when dealing with representational biases in models, obtaining data that captures rare concepts allows machine learning practitioners to combat blind spots in model performance (Karpathy, 2018; Holstein et al., 2019; Ashmawy et al., 2019; Karpathy, 2020) . Even a simple prediction task like stop sign detection can be challenging given the diversity of real-world data. Stop signs may appear in a variety of conditions (e.g., on a wall or held by a person), be heavily occluded, or have modifiers (e.g., "Except Right Turns") (Karpathy, 2020). While large-scale datasets are core to addressing these issues, finding the relevant examples for these long-tail tasks is challenging. Active learning and search have the potential to automate the process of identifying these rare, high value data points significantly, but existing methods become intractable at this scale. Specifically, the goal of active learning is to reduce the cost of labeling (Settles, 2012) . To this end, the learning algorithm is allowed to choose which data to label based on uncertainty (e.g., the entropy of predicted class probabilities) or other heuristics (Settles, 2011; 2012; Lewis & Gale, 1994) . Active search is a sub-area focused on finding positive examples in skewed distributions (Garnett et al., 2012) . Because of a concentrated focus on labeling costs, existing techniques, such as uncertainty sampling (Lewis & Gale, 1994) or information density (Settles & Craven, 2008) , perform multiple selection rounds and iterate over the entire unlabeled data to identify the optimal example or batch of examples to label and scale linearly or even quadratically with the size of the unlabeled data. Computational efficiency is becoming an impediment as the size of datasets and model complexities have increased (Amodei & Hernandez, 2018) . Recent work has tried to address this problem with sophisticated methods to select larger and more diverse batches of examples in each selection round and reduce the total number of rounds needed to reach the target labeling budget (Sener & Savarese, 2018; Kirsch et al., 2019; Coleman et al., 2020; Pinsler et al., 2019; Jiang et al., 2018) . Nevertheless, these approaches still scan over all of the examples to find the optimal examples to label in each round and can be intractable for large-scale unlabeled datasets. For example, running a single inference pass over 10 billion images with ResNet-50 (He et al., 2016) would take 38 exaFLOPs. In this work, we propose Similarity search for Efficient Active Learning and Search (SEALS) to restrict the candidates considered in each selection round and vastly reduce the computational complexity of active learning and search methods. Empirically, we find that learned representations from pre-trained models can effectively cluster many unseen and rare concepts. We exploit this latent structure to improve the computational efficiency of active learning and search methods by only considering the nearest neighbors of the currently labeled examples in each selection round. This can be done transparently for many selection strategies making SEALS widely applicable. Finding the nearest neighbors for each labeled example in unlabeled data can be performed efficiently with sublinear retrieval times (Charikar, 2002) and sub-second latency on billion-scale datasets (Johnson et al., 2017) for approximate approaches. While constructing the index for similarity search requires at least a linear pass over the unlabeled data, this computational cost is effectively amortized over many selection rounds or other applications. As a result, our SEALS approach enables selection to scale with the size of the labeled data rather than the size of the unlabeled data, making active learning and search tractable on datasets with billions of unlabeled examples. We empirically evaluated SEALS for both active learning and search on three large scale computer vision datasets: ImageNet (Russakovsky et al., 2015) , OpenImages (Kuznetsova et al., 2020) , and a proprietary dataset of 10 billion images from a large internet company. We selected 611 concepts spread across these datasets that range in prevalence from 0.203% to 0.002% (1 in 50,000) of the training examples. We evaluated three selection strategies for each concept: max entropy uncertainty sampling (Lewis & Gale, 1994) , information density (Settles & Craven, 2008) , and most-likely positive (Warmuth et al., 2002; 2003; Jiang et al., 2018) . Across datasets, selection strategies, and concepts, SEALS achieved similar model quality and nearly the same recall of the positive examples as the baseline approaches, while improving the computational complexity by orders of magnitude. On ImageNet with a budget of 2,000 binary labels per concept (˜0.31% of the unlabeled data), all baseline and SEALS approaches were within 0.011 mAP of full supervision and recalled over 50% of the positive examples. On OpenImages, SEALS reduced the candidate pool to 1% of the unlabeled data on average while remaining within 0.013 mAP and 0.1% recall of the baseline approaches. On the proprietary dataset with 10 billion images, SEALS needed an even smaller fraction of the data, about 0.1%, to match the baseline, which allowed SEALS to run on a single machine rather than a cluster. To the best of our knowledge, no other works have performed active learning at this scale. We also applied SEALS to the NLP spoiler detection dataset Goodreads (Wan et al., 2019) , where it achieved the same recall as the baseline approaches while only considering less than 1% of the unlabeled data. Together, these results demonstrate that SEALS' improvements to computational efficiency make active learning and search tractable for even billion-scale datasets.

2. RELATED WORK

Active learning's iterative retraining combined with the high computational complexity of deep learning models has led to significant work on computational efficiency (Sener & Savarese, 2018; Kirsch et al., 2019; Pinsler et al., 2019; Coleman et al., 2020; Yoo & Kweon, 2019; Mayer & Timofte, 2020; Zhu & Bento, 2017) . One branch of recent work has focused on selecting large batches of data to minimize the amount of retraining and reduce the number of selection rounds necessary to reach a target budget (Sener & Savarese, 2018; Kirsch et al., 2019; Pinsler et al., 2019) . These

