REDUCING CLASS COLLAPSE IN METRIC LEARNING WITH EASY POSITIVE SAMPLING Anonymous

Abstract

Metric learning seeks perceptual embeddings where visually similar instances are close and dissimilar instances are apart, but learned representation can be sub-optimal when the distribution of intra-class samples is diverse and distinct sub-clusters are present. We theoretically prove and empirically show that under reasonable noise assumptions, prevalent embedding losses in metric learning, e.g., triplet loss, tend to project all samples of a class with various modes onto a single point in the embedding space, resulting in class collapse that usually renders the space ill-sorted for classification or retrieval. To address this problem, we propose a simple modification to the embedding losses such that each sample selects its nearest same-class counterpart in a batch as the positive element in the tuple. This allows for the presence of multiple sub-clusters within each class. The adaptation can be integrated into a wide range of metric learning losses. Our method demonstrates clear benefits on various fine-grained image retrieval datasets over a variety of existing losses; qualitative retrieval results show that samples with similar visual patterns are indeed closer in the embedding space.

1. INTRODUCTION

Metric learning aims to learn an embedding function to lower dimensional space, in which semantic similarity translates to neighborhood relations in the embedding space (Lowe, 1995) . Deep metric learning approaches achieve promising results in a large variety of tasks such as face identification (Chopra et al., 2005; Taigman et al., 2014; Sun et al., 2014 ), zero-shot learning (Frome et al., 2013) , image retrieval (Hoffer & Ailon, 2015; Gordo et al., 2016) and fine-grained recognition (Wang et al., 2014) . In this work we investigate the family of losses which optimize for an embedding representation that enforces that all modes of intra-class appearance variation project to a single point in embedding space. Learning such an embedding is very challenging when classes have a diverse appearance. This happens especially in real-world scenarios where the class consists of multiple modes with diverse visual appearance. Pushing all these modes to a single point in the embedding space requires the network to memorize the relations between the different class modes, which could reduce the generalization capabilities of the network and result in sub-par performance. Recently researchers observed that this phenomena, where all modes of class appearance "collapse" to the same center, occurs in case of the classification SoftMax loss (Qian et al.). They proposed a multi-center approach, where multiple centers for each class are used with the SoftMax loss to capture the hidden distribution of the data to solve this issue. Instead of using SoftMax, it was shown that triplet loss may offer some relief from class collapsing (Wang et al., 2014) and this is certainly true in noise-free environments. However, in this paper, we show that in real-world conditions with modest noise assumptions, triplet and other metric learning loss yet suffer from class collapse. Rather than refine the loss, we argue the key lies in an improved strategy for sampling and selecting the examples. Early work (Malisiewicz & Efros, 2008) proposed per-exemplar distance representation as a means to overcome class collapsing; inspired by this we introduce a simple sampling method to select positive pairs of training examples. Our method can be combined naturally with other popular sampling methods. In each training iteration, given an anchor and a batch of samples in the same category, our method selects the closest sample to the anchor in the current embedding space as the

