REDUCING CLASS COLLAPSE IN METRIC LEARNING WITH EASY POSITIVE SAMPLING Anonymous

Abstract

Metric learning seeks perceptual embeddings where visually similar instances are close and dissimilar instances are apart, but learned representation can be sub-optimal when the distribution of intra-class samples is diverse and distinct sub-clusters are present. We theoretically prove and empirically show that under reasonable noise assumptions, prevalent embedding losses in metric learning, e.g., triplet loss, tend to project all samples of a class with various modes onto a single point in the embedding space, resulting in class collapse that usually renders the space ill-sorted for classification or retrieval. To address this problem, we propose a simple modification to the embedding losses such that each sample selects its nearest same-class counterpart in a batch as the positive element in the tuple. This allows for the presence of multiple sub-clusters within each class. The adaptation can be integrated into a wide range of metric learning losses. Our method demonstrates clear benefits on various fine-grained image retrieval datasets over a variety of existing losses; qualitative retrieval results show that samples with similar visual patterns are indeed closer in the embedding space.

1. INTRODUCTION

Metric learning aims to learn an embedding function to lower dimensional space, in which semantic similarity translates to neighborhood relations in the embedding space (Lowe, 1995) . Deep metric learning approaches achieve promising results in a large variety of tasks such as face identification (Chopra et al., 2005; Taigman et al., 2014; Sun et al., 2014) , zero-shot learning (Frome et al., 2013) , image retrieval (Hoffer & Ailon, 2015; Gordo et al., 2016) and fine-grained recognition (Wang et al., 2014) . In this work we investigate the family of losses which optimize for an embedding representation that enforces that all modes of intra-class appearance variation project to a single point in embedding space. Learning such an embedding is very challenging when classes have a diverse appearance. This happens especially in real-world scenarios where the class consists of multiple modes with diverse visual appearance. Pushing all these modes to a single point in the embedding space requires the network to memorize the relations between the different class modes, which could reduce the generalization capabilities of the network and result in sub-par performance. Recently researchers observed that this phenomena, where all modes of class appearance "collapse" to the same center, occurs in case of the classification SoftMax loss (Qian et al.) . They proposed a multi-center approach, where multiple centers for each class are used with the SoftMax loss to capture the hidden distribution of the data to solve this issue. Instead of using SoftMax, it was shown that triplet loss may offer some relief from class collapsing (Wang et al., 2014) and this is certainly true in noise-free environments. However, in this paper, we show that in real-world conditions with modest noise assumptions, triplet and other metric learning loss yet suffer from class collapse. Rather than refine the loss, we argue the key lies in an improved strategy for sampling and selecting the examples. Early work (Malisiewicz & Efros, 2008) proposed per-exemplar distance representation as a means to overcome class collapsing; inspired by this we introduce a simple sampling method to select positive pairs of training examples. Our method can be combined naturally with other popular sampling methods. In each training iteration, given an anchor and a batch of samples in the same category, our method selects the closest sample to the anchor in the current embedding space as the positive sample. The metric learning loss is then computed based on the anchor and its positive paired sample. We demonstrate the class-collapsing phenomena on a real-world dataset, and show that our method is able to create more diverse embedding which result in a better generalization performance. We evaluate our method on three standard zero-shot benchmarks: CARS196 (Krause et al., 2013), CUB200-2011 (Wah et al., 2011 ) and Omniglot (Lake et al., 2015) . Our method achieves a consistent performance enhancement with respect to various baseline combinations of sampling methods and embedding losses.

2. RELATED WORK

Sampling methods. Designing a good sampling strategy is a key element in deep metric learning. Researchers have been proposed sampling methods when sampling both the negative examples as well as the positive pairs. For negative samples, studies have focused on sampling hard negatives to make training more efficient (Simo-Serra et al., 2015; Schroff et al., 2015; Wang & Gupta, 2015; Oh Song et al., 2016; Parkhi et al., 2015) . Recently, it has been shown that increasing the negative examples in training can significantly help unsupervised representation learning with contrastive losses (He et al., 2020; Wu et al., 2018; Chen et al., 2020) . Besides negative examples, methods for sampling hard positive examples have been developed in classification and detection tasks (Loshchilov & Hutter, 2015; Shrivastava et al., 2016; Arandjelovic et al., 2016; Cubuk et al., 2019; Singh & Lee, 2017; Wang et al., 2017) . The central idea is to perform better augmentation to improve the generalization in testing (Cubuk et al., 2019) . Apart from learning with SoftMax classification, Arandjelovic et al. (2016) propose to perform metric learning by assigning the near instance from the same class as the positive instance. As the positive training set is noisy in their setting, this method leads to features invariant to different perspectives. Different from this approach, we use this method in a clean setting, where the purpose is to get the opposite result of maintaining the inner-class modalities in the embedding space. Xuan et al. (2020) also propose to use this positive sampling method with respect to the N-pair loss (Sohn, 2016) in order to relax the constraints of the loss on the intra-class relations. From a theoretic perspective, we prove that in a clean setting this relaxation is redundant for other popular metric losses like the triplet loss (Chechik et al., 2010) and margin loss (Wu et al., 2017) . We formulate the noisy-environment setting and prove that in this case the triplet and margin losses also suffer from class-collapsing and using our purpose positive sampling method optimizes for solutions without class-collapsing. We also provide an empirical study that supports the theoretic analysis. Noisy label problem. Learning with noisy labels is a practical problem when applied to the real world (Scott et al., 2013; Natarajan et al., 2013; Shen & Sanghavi, 2019; Reed et al., 2014; Jiang et al., 2017; Khetan et al., 2017; Malach & Shalev-Shwartz, 2017) , especially when training with large-scale data (Sun et al., 2017) . One line of work applies a data-driven curriculum learning approach where the data that are most likely labeled correctly are used for learning in the beginning, and then harder data is taken into learning during a later phase (Jiang et al., 2017) . Researchers have also tried on to apply the loss only on easiest top k-elements in the batch, determine by lowest current loss (Shen



Figure 1: Given an anchor (circle with dark ring), our approach samples the closest positive example in the embedding space as the positive element. This results in pushing the anchor only towards the closest element direction (green arrow), which allows the embedding to have multiple clusters for each class.

