SIMILARITY SEARCH FOR EFFICIENT ACTIVE LEARN-ING AND SEARCH OF RARE CONCEPTS Anonymous

Abstract

Many active learning and search approaches are intractable for industrial settings with billions of unlabeled examples. Existing approaches, such as uncertainty sampling or information density, search globally for the optimal examples to label, scaling linearly or even quadratically with the unlabeled data. However, in practice, data is often heavily skewed; only a small fraction of collected data will be relevant for a given learning task. For example, when identifying rare classes, detecting malicious content, or debugging model performance, positive examples can appear in less than 1% of the data. In this work, we exploit this skew in large training datasets to reduce the number of unlabeled examples considered in each selection round by only looking at the nearest neighbors to the labeled examples. Empirically, we observe that learned representations can effectively cluster unseen concepts, making active learning very effective and substantially reducing the number of viable unlabeled examples. We evaluate several selection strategies in this setting on three large-scale computer vision datasets: ImageNet, OpenImages, and a proprietary dataset of 10 billion images from a large internet company. For rare classes, active learning methods need as little as 0.31% of the labeled data to match the average precision of full supervision. By limiting the selection strategies to the immediate neighbors of the labeled data as candidates for labeling, we process as little as 0.1% of the unlabeled data while achieving similar reductions in labeling costs as the traditional global approach. This process of expanding the candidate pool with the nearest neighbors of the labeled set can be done efficiently and reduces the computational complexity of selection by orders of magnitude.

1. INTRODUCTION

Large-scale unlabeled datasets contain millions or billions of examples spread over a wide variety of underlying concepts (Chelba et al., 2013; Zhu et al., 2015; Zhang et al., 2015; Wan et al., 2019; Russakovsky et al., 2015; Kuznetsova et al., 2020; Thomee et al., 2016; Abu-El-Haija et al., 2016; Caesar et al., 2019; Lee et al., 2019) . Often, these massive datasets skew towards a relatively small number of common concepts, such as cats, dogs, and people (Liu et al., 2019; Zhang et al., 2017; Wang et al., 2017; Van Horn & Perona, 2017) . Rare concepts, such as harbor seals, may only appear in a small fraction of the data (less than 1%). However, in many settings, performance on these rare concepts is critical. For example, harmful or malicious content may comprise a small percentage of user-generated content, but it can have an outsize impact on the overall user experience (Wan et al., 2019) . Similarly, when debugging model behavior for safety-critical applications like autonomous vehicles, or when dealing with representational biases in models, obtaining data that captures rare concepts allows machine learning practitioners to combat blind spots in model performance (Karpathy, 2018; Holstein et al., 2019; Ashmawy et al., 2019; Karpathy, 2020) . Even a simple prediction task like stop sign detection can be challenging given the diversity of real-world data. Stop signs may appear in a variety of conditions (e.g., on a wall or held by a person), be heavily occluded, or have modifiers (e.g., "Except Right Turns") (Karpathy, 2020) . While large-scale datasets are core to addressing these issues, finding the relevant examples for these long-tail tasks is challenging. Active learning and search have the potential to automate the process of identifying these rare, high value data points significantly, but existing methods become intractable at this scale. Specifically, the goal of active learning is to reduce the cost of labeling (Settles, 2012) . To this end, the learning algorithm is allowed to choose which data to label based on uncertainty (e.g., the entropy of predicted class probabilities) or other heuristics (Settles, 2011; 2012; Lewis & Gale, 1994) . Active search is a sub-area focused on finding positive examples in skewed distributions (Garnett et al., 2012) . Because of a concentrated focus on labeling costs, existing techniques, such as uncertainty sampling (Lewis & Gale, 1994) or information density (Settles & Craven, 2008) , perform multiple selection rounds and iterate over the entire unlabeled data to identify the optimal example or batch of examples to label and scale linearly or even quadratically with the size of the unlabeled data. Computational efficiency is becoming an impediment as the size of datasets and model complexities have increased (Amodei & Hernandez, 2018) . Recent work has tried to address this problem with sophisticated methods to select larger and more diverse batches of examples in each selection round and reduce the total number of rounds needed to reach the target labeling budget (Sener & Savarese, 2018; Kirsch et al., 2019; Coleman et al., 2020; Pinsler et al., 2019; Jiang et al., 2018) . Nevertheless, these approaches still scan over all of the examples to find the optimal examples to label in each round and can be intractable for large-scale unlabeled datasets. For example, running a single inference pass over 10 billion images with ResNet-50 (He et al., 2016) would take 38 exaFLOPs. In this work, we propose Similarity search for Efficient Active Learning and Search (SEALS) to restrict the candidates considered in each selection round and vastly reduce the computational complexity of active learning and search methods. Empirically, we find that learned representations from pre-trained models can effectively cluster many unseen and rare concepts. We exploit this latent structure to improve the computational efficiency of active learning and search methods by only considering the nearest neighbors of the currently labeled examples in each selection round. This can be done transparently for many selection strategies making SEALS widely applicable. Finding the nearest neighbors for each labeled example in unlabeled data can be performed efficiently with sublinear retrieval times (Charikar, 2002) and sub-second latency on billion-scale datasets (Johnson et al., 2017) for approximate approaches. While constructing the index for similarity search requires at least a linear pass over the unlabeled data, this computational cost is effectively amortized over many selection rounds or other applications. As a result, our SEALS approach enables selection to scale with the size of the labeled data rather than the size of the unlabeled data, making active learning and search tractable on datasets with billions of unlabeled examples. We empirically evaluated SEALS for both active learning and search on three large scale computer vision datasets: ImageNet (Russakovsky et al., 2015) , OpenImages (Kuznetsova et al., 2020) , and a proprietary dataset of 10 billion images from a large internet company. We selected 611 concepts spread across these datasets that range in prevalence from 0.203% to 0.002% (1 in 50,000) of the training examples. We evaluated three selection strategies for each concept: max entropy uncertainty sampling (Lewis & Gale, 1994) , information density (Settles & Craven, 2008) , and most-likely positive (Warmuth et al., 2002; 2003; Jiang et al., 2018) . Across datasets, selection strategies, and concepts, SEALS achieved similar model quality and nearly the same recall of the positive examples as the baseline approaches, while improving the computational complexity by orders of magnitude. On ImageNet with a budget of 2,000 binary labels per concept (˜0.31% of the unlabeled data), all baseline and SEALS approaches were within 0.011 mAP of full supervision and recalled over 50% of the positive examples. On OpenImages, SEALS reduced the candidate pool to 1% of the unlabeled data on average while remaining within 0.013 mAP and 0.1% recall of the baseline approaches. On the proprietary dataset with 10 billion images, SEALS needed an even smaller fraction of the data, about 0.1%, to match the baseline, which allowed SEALS to run on a single machine rather than a cluster. To the best of our knowledge, no other works have performed active learning at this scale. We also applied SEALS to the NLP spoiler detection dataset Goodreads (Wan et al., 2019) , where it achieved the same recall as the baseline approaches while only considering less than 1% of the unlabeled data. Together, these results demonstrate that SEALS' improvements to computational efficiency make active learning and search tractable for even billion-scale datasets.

2. RELATED WORK

Active learning's iterative retraining combined with the high computational complexity of deep learning models has led to significant work on computational efficiency (Sener & Savarese, 2018; Kirsch et al., 2019; Pinsler et al., 2019; Coleman et al., 2020; Yoo & Kweon, 2019; Mayer & Timofte, 2020; Zhu & Bento, 2017) . One branch of recent work has focused on selecting large batches of data to minimize the amount of retraining and reduce the number of selection rounds necessary to reach a target budget (Sener & Savarese, 2018; Kirsch et al., 2019; Pinsler et al., 2019) . These approaches introduce novel techniques to avoid selecting highly similar or redundant examples and ensure the batches are both informative and diverse. In comparison, our work aims to reduce the number of examples considered in each selection round and complements existing work on batch active learning. Many of these approaches sacrifice computational complexity to ensure diversity, and their selection methods can scale quadratically with the size of the unlabeled data. Combined with our method, these selection methods scale with the size of the labeled data rather than the unlabeled data. Outside of batch active learning, other work has tried to improve computational efficiency by either using much smaller models as cheap proxies during selection (Yoo & Kweon, 2019; Coleman et al., 2020) or by generating examples (Mayer & Timofte, 2020; Zhu & Bento, 2017) . Using a smaller model reduces the amount of computation per example, but unlike our approach, it still requires making multiple passes over the entire unlabeled pool of examples. The generative approaches (Mayer & Timofte, 2020; Zhu & Bento, 2017) , however, enable sub-linear runtime complexity like our approach. Unfortunately, they struggle to match the label-efficiency of traditional approaches because the quality of the generated examples is highly variable. Active search is a sub-area of active learning that focuses on highly skewed class distributions (Garnett et al., 2012; Jiang et al., 2017; 2018; 2019) . Rather than optimizing for model quality, active search aims to find as many examples from the minority class as possible. Prior work has focused on applications such as drug discovery, where the dataset sizes are limited, and labeling costs are exceptionally high. Our work similarly focuses on skewed distributions. However, we consider novel active search settings in image and text where the available unlabeled datasets are much larger, and computational efficiency is a significant bottleneck. k nearest neighbor (k-NN) classifiers are popular models in active learning and search because they do not require an explicit training phase (Joshi et al., 2012; Wei et al., 2015; Garnett et al., 2012; Jiang et al., 2017; 2018) . The prediction and score for each unlabeled example can be updated immediately after each new batch of labels. In comparison, our SEALS approach uses k-NN algorithms for similarity search to create and expand the candidate pool and not as a classifier. This is an important but subtle difference. While prior work avoids expensive training by using k-NN classifiers, these approaches still require evaluating all of the unlabeled examples, which can still be prohibitively expensive on large-scale datasets like the ones we consider here. SEALS targets the selection phase rather than training, presenting a novel and complementary approach.

3. METHODS

In this section, we outline the problems of active learning (Section 3.1) and search (Section 3.2) formally as well as the selection methods we accelerate using SEALS. For both, we examine the pool-based setting, where all of the unlabeled data is available at once, and examples are selected in batches to improve computational efficiency, as mentioned above. Then in Section 3.3, we describe our SEALS approach and how it further improves computational efficiency in both settings.

3.1. ACTIVE LEARNING

Pool-based active learning is an iterative process that begins with a large pool of unlabeled data U = {x 1 , . . . , x n }. Each example is sampled from the space X with an unknown label from the label space Y = {1, . . . , C} as (x i , y i ). We additionally assume a feature extraction function G z to embed each x i as a latent variable G z (x i ) = z i and that the C concepts are unequally distributed. Specifically, there are one or more valuable rare concepts R ⊂ C that appear in less than 1% of the unlabeled data. For simplicity, we frame this as |R| binary classification problems solved independently rather than 1 multi-class classification problem with |R| concepts. Initially, each rare concept has a small number of positive examples and several negative examples that serve as a labeled seed set L 0 r . The goal of active learning is to take this seed set and select up to a budget of T examples to label that produce a model A T r that achieves low error. For each round t in pool-based active learning, the most informative examples are selected according to the selection strategy φ from a pool of candidate examples P r in batches of size b and labeled, as shown in Algorithm 1. For the baseline approach, P r = {G z (x) | x ∈ U }, meaning that all the unlabeled examples are considered to find the global optimal according to φ. Between each round, the model A t r is trained on all of the labeled data L t r , allowing the selection process to adapt. In this paper, we considered max entropy (MaxEnt) uncertainty sampling (Lewis & Gale, 1994) : φ MaxEnt (z) = - ŷ P (ŷ|z; A r ) log P (ŷ|z; A r ) and information density (ID) (Settles & Craven, 2008) : φ ID (z) = φ MaxEnt (z) ×   1 |P r | zp∈Pr sim(z, z p )   β where sim(z, z p ) is the cosine similarity of the embedded examples and β = 1. Note that for binary classification, max entropy is equivalent to least confidence and margin sampling, which are also popular criteria for uncertainty sampling (Settles, 2009) . While max entropy uncertainty sampling only requires a linear pass over the unlabeled data, ID scales quadratically with |U | because it weights each example's informativeness by its similarity to all other examples. To improve computational performance, the average similarity score for each example can be cached after the first selection round, so subsequent rounds scale linearly. This optimization only works when G z is fixed and would not apply to dynamic similarity calculations like those in Sener & Savarese (2018) . We explored the greedy k-centers approach from Sener & Savarese (2018) but found that it never outperformed random sampling for our experimental setup. Unlike MaxEnt and ID, k-centers does not consider the predicted labels. It tries to achieve high coverage over the entire candidate pool, of which rare concepts make up a small fraction by definition, making it ineffective for our setting. Algorithm 1 BASELINE APPROACH Input: unlabeled data U , labeled seed set L 0 r , feature extractor G z , selection strategy φ(•), batch size b, labeling budget T z * = arg max z∈Pr φ(z) 1: L r = {(G z (x), y) | (x, y) ∈ L 0 r } 2: P r = {G z (x) | x ∈ U and (x, •) ∈ L 0 r } 3: repeat 4: A r = train(L r 7: L r = L r ∪ {(z * , label(x * ))} 8: P r = (P r \ {z * }) ∪ N (z * , k) 9: end for 10: until |L r | = T 3.2 ACTIVE SEARCH Active search is closely related to active learning, so much of the formalism from Section 3.1 carries over. The critical difference is that rather than selecting examples to label that minimize error, the goal of active search is to maximize the number of examples from the target concept r, expressed with the natural utility function u(L r ) = (x,y)∈Lr 1{y = r}). As a result, different selection strategies are favored, but the overall algorithm is the same as Algorithm 1. In this paper, we consider an additional selection strategy to target the active search setting, mostlikely positive (MLP) (Warmuth et al., 2002; 2003; Jiang et al., 2018) : φ MLP (z) = P (r|z; A r ) Because active learning and search are similar, we evaluate all the selection criteria from Sections 3.1 and 3.2 in terms of both the error the model achieves and the number of positive examples.

3.3. SIMILARITY SEARCH FOR EFFICIENT ACTIVE LEARNING AND SEARCH (SEALS)

In this work, we propose SEALS to accelerate the inner loop of active learning and search by restricting the candidate pool of unlabeled examples. To apply SEALS, we use an efficient method for similarity search of the embedded examples (Charikar, 2002; Johnson et al., 2017) and make two modifications to the baseline approach, as shown in Algorithm 2: 1. The candidate pool P r is restricted to the nearest neighbors of the labeled examples. 2. After every example is selected, we find its k nearest neighbors and update P r . Computational savings. Restricting the candidate pool P r to the k nearest neighbors of the labeled examples means we only apply the selection strategy to at most k|L r | examples. This can be done transparently for many selection strategies making it applicable to a wide range of active learning and search methods, even beyond the ones considered here. Finding the k nearest neighbors for each newly labeled example adds overhead, but this can be calculated efficiently with sublinear retrieval times (Charikar, 2002) and sub-second latency on billion-scale datasets (Johnson et al., 2017) for approximate approaches. As a result, the computational complexity of each selection round scales with the size of the labeled dataset rather than the unlabeled dataset. Excluding the retrieval times for the k nearest neighbors, the computational savings from SEALS are directly proportional to the pool size reduction for φ MaxEnt and φ MLP , which is lower bounded by |U |/k|L r |. For φ ID , the average similarity score for each example only needs to be computed once when the example is first selected. This caching means the first round scales quadratically with |U | and subsequent rounds scale linearly for the baseline approach. With SEALS, each selection round scales according to O((1 + bk)|P r |) because the similarity scores are calculated as examples are selected rather than all at once. The resulting computational savings of SEALS varies with the labeling budget T as the upfront cost of the baseline amortizes. Nevertheless, for large-scale datasets with millions or billions of examples, performing that first quadratic round for the baseline is prohibitively expensive. Index construction. Generating the embeddings and indexing the data can be expensive and slow, requiring at least a linear pass over the unlabeled data. However, this cost is effectively amortized over many selection rounds, concepts, or other applications. Similarity search is a critical workload for information retrieval and powers many applications, including recommendation. Increasingly, embeddings from deep learning models are being used (Babenko et al., 2014; Babenko & Lempitsky, 2016; Johnson et al., 2017) . As a result, the embeddings and index can be generated once using a generic model trained in a weak-supervision or self-supervision fashion and reused, making our approach just one of many applications using the index. Alternatively, if the data has already been passed through a predictive system (for example, to tag or classify uploaded images), the embedding could be captured and indexed at inference to avoid additional costs.

4. RESULTS

We applied SEALS to three selection strategies and performed active learning and search on three datasets: ImageNet (Russakovsky et al., 2015) , OpenImages (Kuznetsova et al., 2020) , and a proprietary dataset of 10 billion images. Section 4.1 details the experimental setup for each dataset and the inputs used for both the baseline approach (Algorithm 1) and our proposed method, SEALS (Algorithm 2). Sections 4.2 and 4.3 present the empirical results for active learning and search. Section 4.4 explores the structure of the concepts through the nearest neighbor graphs and embeddings. Across selection strategies, datasets, and concepts, SEALS using ResNet-50 (He et al., 2016) embeddings performed similarly to the baseline while only considering a fraction of the unlabeled data U in the candidate pool for each concept P r . For MLP and MaxEnt, the smaller candidate pool from SEALS sped-up the selection runtime by over 180× on OpenImages. This allowed us to run active learning and search efficiently on an industrial scale dataset with 10 billion images. The improvements were even larger for information density. On ImageNet, SEALS dropped the time for the first selection round from over 75 minutes to 1.5 seconds, over a 3000× improvement. On OpenImages, the baseline for information density ran for over 24 hours without completing a single round, while SEALS took less than 3 minutes to perform 19 rounds. We observed similar results with selfsupervised embeddings using SimCLR (Chen et al., 2020) for ImageNet (Appendix A.3) and using SentenceBERT (Reimers & Gurevych, 2019) for Goodreads spoiler detection (Appendix A.8). Across all datasets and selection strategies, we followed the same general procedure for both active learning and search. Because we are interested in rare concepts, we kept the number of initial positive examples small. We evaluated three settings, with 5, 20, and 50 positives, but only included the results with the smallest size in this section. The others are shown in Appendix A.1. For each setting, negative examples were randomly selected at a ratio of 19 negative examples to every positive example to form the seed set L 0 r . The slightly higher number of negatives in the initial seed set improved average precision on the validation set across all three datasets. The batch size b for each selection round was the same as the size of the initial seed set. For the seed set of 5 positive and 95 negative examples shown below, b was 100, and the labeling budget T was 2,000 examples. As the binary classifier for each concept A r , we used logistic regression trained on the embedded examples. For active learning, we calculated average precision on the test data for each binary concept classifier after each selection round. For active search, we count the number of positive examples labeled so far. We take the mean average precision (mAP) and number of positives across concepts, run each experiment 5 times, and report the mean and standard deviation. For similarity search, we used locality-sensitive hashing (LSH) (Charikar, 2002) implemented in Faiss (Johnson et al., 2017) with Euclidean distance for all datasets aside from the 10 billion images dataset. This simplified our implementation, so the index could be created quickly and independently for each concept and configuration, allowing experiments to run in parallel trivially. However, retrieval times for this approach were not as fast as Johnson et al. (2017) and made up a larger part of the overall active learning loop. In practice, the search index can be heavily optimized and tuned for the specific data distribution, leading to computational savings closer to the improvements described in Section 3.3 and differences in the "Selection" portion of the runtimes in Table 2 . We split the data, selected concepts, and created embeddings as detailed below and summarized in Table 1 . Note that our approach does not constrain the choice of G z , which allows for many network architectures. As representations continue to improve with new self-supervision, generative, or transfer learning techniques, SEALS is still applicable and performance will also likely improve. ImageNet (Russakovsky et al., 2015) has 1.28 million training images spread over 1000 classes. To simulate rare concepts, we split the data in half, using 500 classes to train the feature extractor G z and treating the other 500 classes as unseen concepts. For G z , we used ResNet-50 but added a bottleneck layer before the final output to reduce the dimension of the embeddings to 256. We kept all of the other training hyperparameters the same as in He et al. (2016) . We extracted features from the bottleneck layer and applied l 2 normalization. In total, the 500 unseen concepts had 639,906 training examples that served as the unlabeled pool. We used 50 concepts for validation, leaving the remaining 450 concepts for our final experiments. The number of examples for each concept varied slightly, ranging from 0.114-0.203% of |U |. The 50,000 validation images were used as the test set. OpenImages (Kuznetsova et al., 2020) has 7.34 million images with human-verified labels spread over 19,958 classes, taken as an unbiased sample from Flickr. However, only 6.82 million images were still available in the training set at the time of writing. As a feature extractor, we took ResNet-50 pre-trained on all of ImageNet and used the l 2 normalized output from the bottleneck layer. As rare concepts, we randomly selected 200 classes with between 100 to 6,817 positive training examples. We reviewed the selected classes and removed 47 classes that overlapped with ImageNet. The remaining 153 classes appeared in 0.002-0.088% of the data. We used the same hyperparameters as the ImageNet experiments and the OpenImages predefined test split for evaluation. 10 billion (10B) images from a large internet company were used to test SEALS' scalability. For the feature extractor, we used the same pre-trained ResNet-50 model as the OpenImages experiments. We also selected 8 additional classes from OpenImages as rare concepts: rat, sushi, bowling, beach, hawk, cupcake, and crowd. This allowed us to use the predefined test split from OpenImages for evaluation. Unlike the other datasets, we hired annotators to label images as they were selected and used a proprietary index to achieve low latency retrieval times to capture a real-world setting.

4.2. ACTIVE LEARNING

Across datasets and selection strategies, SEALS performed similarly to the baseline approaches that considered all of the unlabeled data in the candidate pool, as shown in Figures 1 and 2 . ImageNet. With a labeling budget of 2,000 examples per concept (˜0.31% of |U |), all baseline and SEALS approaches (k = 100) were within 0.011 mAP of the 0.699 mAP achieved with full supervision. In contrast, random sampling (Random-All) only achieved 0.436 mAP. MLP-All, MaxEnt-All, and ID-All achieved mAPs of 0.693, 0.695, and 0.688, respectively, while the SEALS equivalents were all within 0.001 mAP at 0.692, 0.695, and 0.688 respectively and considered less than 7% of the unlabeled data. The resulting selection runtime for MLP-SEALS and MaxEnt-SEALS dropped by over 25×, leading to a 3.6× speed-up overall (Table 2 ). The speed-up was even larger for ID-SEALS, ranging from about 45× at 2,00 labels to 3000× at 200 labels. Even at a per-class level, the results were highly correlated with Pearson correlation coefficients of 0.9998 or more (Figure 10a in the Appendix). The reduced skew from the nearest neighbor expansion of the initial seed set only accounted for a small part of the improvement, as Random-SEALS achieved an mAP of 0.498. OpenImages. The gap between the baseline approaches and SEALS widened slightly for OpenImages. At 2,000 labels per concept (˜0.029% of |U |), MaxEnt-All and MLP-All achieved 0.399 and 0.398 mAP, respectively, while MaxEnt-SEALS and MLP-SEALS both achieved 0.386 mAP and considered less than 1% of the data. This sped-up the selection time by over 180× and the total time by over 3×. Increasing k to 1,000 significantly narrowed this gap for MaxEnt-SEALS and MLP-SEALS, improving mAP to 0.395, as shown in the Appendix (Figure 7 ). Moreover, SEALS made ID tractable on OpenImages by reducing the candidate pool to 1% of the unlabeled data, whereas ID-All ran for over 24 hours in wall-clock time without completing a single round (Table 2 ). 

4.3. ACTIVE SEARCH

As shown in Figures 1 and 2 , SEALS recalled nearly the same number of positive examples as the baseline approaches did for all of the considered concepts, datasets, and selection strategies. ImageNet. Unsurprisingly, MLP-All and MLP-SEALS significantly outperformed all of the other selection strategies for active search. At 2,000 labeled examples per concept, both approaches recalled over 74% of the positive examples for each concept at 74.5% and 74.2% recall, respectively. MaxEnt-All and MaxEnt-SEALS had a similar gap of 0.3%, labeling 57.2% and 56.9% of positive examples, while ID-All and ID-SEALS were even closer with a gap of only 0.1% (50.8% vs. 50.9%). Nearly all of the gains in recall are due to the selection strategies rather than the reduced skew in the initial seed, as Random-SEALS increased the recall by less than 1.0% over Random-All. OpenImages. The gap between the baseline approaches and SEALS was even closer on OpenImages despite considering a much smaller fraction of the overall unlabeled pool. MLP-All, MLP-SEALS, MaxEnt-SEALS, and MaxEnt-All were all within 0.1% with ˜35% recall at 2,000 labels per concept. ID-SEALS had a recall of 29.3% but scaled nearly as well as the linear approaches. 10B images. SEALS performed as well as the baseline approach despite considering less than 0.1% of the data and collected 2 orders of magnitude more positive examples than random sampling.

4.4. LATENT STRUCTURE OF UNSEEN CONCEPTS

To better understand why and when SEALS works, we analyzed the nearest neighbor graph across concepts and values of k. Figure 3 shows the cumulative distribution functions (CDF) for the largest connected component within each concept and the average shortest paths between examples in that component. The 10B images dataset was excluded because only a few thousand examples were labeled. The largest connected component gives a sense of how much of the concept SEALS can reach, while the average shortest path serves as a proxy for how long it will take to explore. In general, SEALS performed better for concepts that formed larger connected components and had shorter paths between examples (Figure 11 For OpenImages, rare concepts were more fragmented, but each component was fairly tight, leading to short paths between examples. On a per-class level, concepts like "monster truck" and "blackberry" performed much better than generic concepts like "electric blue" and "meal" that were more scattered (Appendix A.6 and A.7). This fragmentation partly explains the gap between SEALS and the baselines in Section 4.2, and why increasing k closed it. However, even for small values of k, there were significant gains over random sampling, as shown in Figures 6 and 7 in the Appendix.

5. CONCLUSION

In this work, we introduced Similarity search for Efficient Active Learning and Search (SEALS) as a simple approach to accelerate active learning and search that can be applied to a wide range of existing algorithms. SEALS restricted the candidate pool for labeling to the nearest neighbors of the currently labeled set instead of scanning over all of the unlabeled data. Across three large datasets, three selection strategies, and 611 concepts, we found that SEALS achieved similar model quality and recall of positive examples while improving computational efficiency by orders of magnitude.

A APPENDIX

A ,906) . Larger values of k help to close the gap between SEALS and the baseline approach that considers all of the unlabeled data for both active learning (top) and active search (middle). However, increasing k also increases the candidate pool size (bottom), presenting a trade-off between labeling efficiency and computational efficiency. |=6,816,296) . Larger values of k help to close the gap between SEALS and the baseline approach that considers all of the unlabeled data for both active learning (top) and active search (middle). However, increasing k also increases the candidate pool size (bottom), presenting a trade-off between labeling efficiency and computational efficiency.

Number of Labels

A 

Recall (%)

Figure 8 : Active learning and search on ImageNet with self-supervised embeddings from Sim-CLR (Chen et al., 2020) . Because the self-supervised training for the embeddings did not use the labels, results are average across all 1,000 classes and |U |=1,281,167. To compensate for the larger unlabeled pool, we extended the total labeling budget to 4,000 compared to the 2,000 used in Figure 1 . Across strategies, SEALS with k = 100 substantially outperforms random sampling in terms of both the mAP the model achieves for active learning (left) and the recall of positive examples for active search (right), while only considering a fraction of the data U (middle). For active learning, the gap between the baseline and SEALS approaches is slightly larger than in Figure 1 , which is likely due to the larger pool size and increased average shortest paths (see Figure 9 ). Figure 10 : The per-class APs of SEALS were highly correlated to the baseline approaches (*-All) for active learning on ImageNet (right) and OpenImages (left). On OpenImages with k = 100 and a budget of 2,000 labels, the Pearson's correlation (ρ) between the baseline and SEALS for the average precision of individual classes was 0.986 for MaxEnt and 0.987 for MLP. The least-squares fit had a slope of 0.99 and y-intercept of -0.01. On ImageNet, the correlations were even higher. Figure 14 : ID-SEALS (k = 100) versus ID applied to a candidate pool of randomly selected examples (RandPool). Because the concepts we considered were so rare, as is often the case in practice, randomly chosen examples are unlikely to be close to the decision boundary, and a much larger pool is required to match SEALS. On ImageNet (top), ID-SEALS outperformed ID-RandPool in terms of both the error the model achieves for active learning (left) and the recall of positive examples for active search (right) even with a pool containing 10% of the data (middle). On Openimages (bottom), ID-RandPool needed at least 5× as much data to match ID-SEALS for active learning and failed to achieve similar recall even with 10× the data. A.6 ACTIVE LEARNING ON EACH SELECTED CLASS FROM OPENIMAGES We followed the same general procedure described in Section 4.1, aside from the dataset specific details below. Goodreads spoiler detection (Wan et al., 2019) had 17.67 million sentences with binary spoiler annotations. Spoilers made up 3.224% of the data, making them much more common than the rare concepts we evaluated in the other datasets. Following Wan et al. (2019) , we used 3.53 million sentences for testing (20%), 10,000 sentences as the validation set, and the remaining 14.13 million sentences as the unlabeled pool. We also switched to the area under the ROC curve (AUC) as our primary evaluation metric for active learning to be consistent with Wan et al. (2019) . For G z , we used a pre-trained Sentence-BERT model (SBERT-NLI-base) (Reimers & Gurevych, 2019) , applied PCA whitening to reduce the dimension to 256, and performed l 2 normalization. A.8.1 ACTIVE SEARCH SEALS achieved the same recall as the baseline approaches, but only considered less than 1% of the unlabeled data in the candidate pool, as shown in Figure 15 . At a labeling budget of 2,000, MLP-ALL and MLP-SEALS recalled 0.15 ± 0.02% and 0.17 ± 0.05%, respectively, while MaxEnt-All and MaxEnt-SEALS achieved 0.14 ± 0.04% and 0.11 ± 0.06% recall respectively. Increasing the labeling budget to 50,000 examples, increased recall to ˜3.7% for MaxEnt and MLP but maintained a similar relative improvement over random sampling, as shown in Figure 16 . ID-SEALS performed worse than the other strategies. However, all of the active selection strategies outperformed random sampling by up to an order of magnitude. 

A.8.3 LATENT STRUCTURE

The large number of positive examples in the Goodreads dataset limited the analysis we could perform. We could only calculate the size of the largest connected component in the nearest neighbor graph (Figure 17 ). For k = 10, only 28.4% of the positive examples could be reached directly, but increasing k to 100 improved that dramatically to 96.7%. For such a large connected component, one might have expected active learning to perform better in Section A.8.2. By analyzing the embeddings, however, we found that examples are spread almost uniformly across the space with an average cosine similarity of 0.004. For comparison, the average cosine similarity for concepts in ImageNet and OpenImages was 0.453 ± 0.077 and 0.361 ± 0.105 respectively. This uniformity was likely due to the higher fraction of positive examples and spoilers being book specific while Sentence-BERT is trained on generic data. As a result, even if spoilers were tightly clustered within each book, the books were spread across a range of topics and consequently across the embedding space, illustrating a limitation and opportunity for future work. 



)

arg max z∈Pr φ(z) 7: L r = L r ∪ {(z * , label(x * ))} 8: P r = P r \ {z * } 9: end for 10: until |L r | = T Algorithm 2 SEALS APPROACH Input: unlabeled data U , labeled seed set L 0 r , feature extractor G z , selection strategy φ(•), batch size b, labeling budget T , knearest neighbors implementation N (•, •) 1: L r = {(G z (x), y) | (x, y) ∈ L 0 r } 2: P r = ∪ (z,y)∈Lr N (z, k)

Figure 1: Active learning and search on ImageNet (top) and OpenImages (bottom). Across datasets and strategies, SEALS with k = 100 performed similarly to the baseline approach in terms of both the error the model achieved for active learning (left) and the recall of positive examples for active search (right), while only considering a fraction of the data U (middle).

Figure 2: Active learning and search on a proprietary dataset of 10 billion images. Across strategies, SEALS with k = 10, 000 performed similarly to the baseline approach in terms of both the error the model achieved for active learning (left) and the recall of positive examples for active search (right), while only considering a fraction of the data U (middle).

Figure 3: Measurements of the latent structure of unseen concepts in ImageNet (left) and OpenImages (right). Across datasets, the k-nearest neighbor graph of unseen concepts was well connected, forming large connected components (top) for even moderate values of k. The components were tightly packed, leading to short paths between examples (bottom).

Figure 4: Active learning and search with 20 positive seed examples and a labeling budget of 10,000 examples on ImageNet (top) and OpenImages (bottom). Across datasets and strategies, SEALS with k = 100 performs similarly to the baseline approach in terms of both the error the model achieves for active learning (left) and the recall of positive examples for active search (right), while only considering a fraction of the unlabeled data U (middle).

Figure7: Impact of increasing k on OpenImages(|U |=6,816,296). Larger values of k help to close the gap between SEALS and the baseline approach that considers all of the unlabeled data for both active learning (top) and active search (middle). However, increasing k also increases the candidate pool size (bottom), presenting a trade-off between labeling efficiency and computational efficiency.

Figure9: Measurements of the latent structure of unseen concepts in ImageNet with self-supervised embeddings from SimCLR(Chen et al., 2020). In comparison to Figure3a, the k-nearest neighbor graph for unseen concepts was still well connected, forming large connected components (left) for even moderate values of k, but the average shortest path between examples was slightly longer (right). The increased path length is not too surprising considering the fully supervised model still outperformed the linear evaluation of the self-supervised embeddings inChen et al. (2020).

Figure 11: SEALS achieved higher APs for classes that formed larger connected components (left) and had shorter paths between examples (right) in ImageNet (top) and OpenImages (bottom).

Figure13: MLP-SEALS (k = 100) versus MLP applied to a candidate pool of randomly selected examples (RandPool). Because the concepts we considered were so rare, as is often the case in practice, randomly chosen examples are unlikely to be close to the decision boundary, and a much larger pool is required to match SEALS. On ImageNet (top), MLP-SEALS outperformed MLP-RandPool in terms of both the error the model achieves for active learning (left) and the recall of positive examples for active search (right) even with a pool containing 10% of the data (middle). On Openimages (bottom), MLP-RandPool needed at least 5× as much data to match MLP-SEALS for active learning and failed to achieve similar recall even with 10× the data.

Figure 15: Active learning and search on Goodreads with Sentence-BERT embeddings. Across datasets and strategies, SEALS with k = 100 performs similarly to the baseline approach in terms of both the error the model achieves for active learning (left) and the recall of positive examples for active search (right), while only considering a fraction of the data U (middle).

Figure 17: Cumulative distribution function (CDF) for the largest connected component in the Goodreads dataset with varying values of k.

Summary of datasets

Wall clock runtimes for varying selection strategies on ImageNet and OpenImages. The last 3 columns break the total time down into 1) the time to apply the selection strategy to the candidate pool, 2) the time to find the k nearest neighbors (k-NN) for the newly labeled examples, and 3) the time to train logistic regression on the currently labeled examples. Despite using a simple LSH index for similarity search, SEALS substantially improved runtimes across datasets and strategies.

in the Appendix). For most concepts in ImageNet, the largest connected component contained the majority of examples, and the paths between examples were very short. These tight clusters explain why so few examples were needed to learn accurate binary concept classifiers, as shown in Section 4.2, and why SEALS recovered ˜74% of positive examples on average while only labeling ˜0.31% of the data. If we constructed the candidate pool by randomly selecting examples, mAP and recall would drop for all strategies (Appendix A.5). The concepts were so rare that the randomly chosen examples were not close to the decision boundary.

.1 NUMBER OF INITIAL POSITIVES

.3 SELF-SUPERVISED EMBEDDINGS (SIMCLR) ON IMAGENET

Top 1 3 of classes from Openimages for active learning. (1 of 3) Average precision and measurements of the largest component (LC) for each selected class (153 total) from OpenImages with a labeling budget of 2,000 examples. Classes are ordered based on MaxEnt-SEALS.

Middle 1 3 of classes from Openimages for active learning. (2 of 3) Average precision and measurements of the largest component (LC) for each selected class (153 total) from OpenImages with a labeling budget of 2,000 examples. Classes are ordered based on MaxEnt-SEALS.

Bottom 1 3 of classes from Openimages for active learning. (3 of 3) Average precision and measurements of the largest component (LC) for each selected class (153 total) from OpenImages with a labeling budget of 2,000 examples. Classes are ordered based on MaxEnt-SEALS.

Top 1 3 of classes from Openimages for active search. (1 of 3) Recall (%) of positives and measurements of the largest component (LC) for each selected class (153 total) from OpenImages with a labeling budget of 2,000 examples. Classes are ordered based on MLP-SEALS.

Bottom 1 3 of classes from Openimages for active search. (3 of 3) Recall (%) of positives and measurements of the largest component (LC) for each selected class (153 total) from OpenImages with a labeling budget of 2,000 examples. Classes are ordered based on MLP-SEALS. .8 SELF-SUPERVISED EMBEDDING (SENTENCE-BERT) ON GOODREADS

annex

Because the concepts we considered were so rare, as is often the case in practice, randomly chosen examples are unlikely to be close to the decision boundary, and a much larger pool is required to match SEALS. On ImageNet (top), MaxEnt-SEALS outperformed MaxEnt-RandPool in terms of both the error the model achieves for active learning (left) and the recall of positive examples for active search (right) even with a pool containing 10% of the data (middle). On Openimages (bottom), MaxEnt-RandPool needed at least 5× as much data to match MaxEnt-SEALS for active learning and failed to achieve similar recall even with 10× the data.Under review as a conference paper at ICLR 2021 

