REDUCING CLASS COLLAPSE IN METRIC LEARNING WITH EASY POSITIVE SAMPLING Anonymous

Abstract

Metric learning seeks perceptual embeddings where visually similar instances are close and dissimilar instances are apart, but learned representation can be sub-optimal when the distribution of intra-class samples is diverse and distinct sub-clusters are present. We theoretically prove and empirically show that under reasonable noise assumptions, prevalent embedding losses in metric learning, e.g., triplet loss, tend to project all samples of a class with various modes onto a single point in the embedding space, resulting in class collapse that usually renders the space ill-sorted for classification or retrieval. To address this problem, we propose a simple modification to the embedding losses such that each sample selects its nearest same-class counterpart in a batch as the positive element in the tuple. This allows for the presence of multiple sub-clusters within each class. The adaptation can be integrated into a wide range of metric learning losses. Our method demonstrates clear benefits on various fine-grained image retrieval datasets over a variety of existing losses; qualitative retrieval results show that samples with similar visual patterns are indeed closer in the embedding space.

1. INTRODUCTION

Metric learning aims to learn an embedding function to lower dimensional space, in which semantic similarity translates to neighborhood relations in the embedding space (Lowe, 1995) . Deep metric learning approaches achieve promising results in a large variety of tasks such as face identification (Chopra et al., 2005; Taigman et al., 2014; Sun et al., 2014) , zero-shot learning (Frome et al., 2013) , image retrieval (Hoffer & Ailon, 2015; Gordo et al., 2016) and fine-grained recognition (Wang et al., 2014) . In this work we investigate the family of losses which optimize for an embedding representation that enforces that all modes of intra-class appearance variation project to a single point in embedding space. Learning such an embedding is very challenging when classes have a diverse appearance. This happens especially in real-world scenarios where the class consists of multiple modes with diverse visual appearance. Pushing all these modes to a single point in the embedding space requires the network to memorize the relations between the different class modes, which could reduce the generalization capabilities of the network and result in sub-par performance. Recently researchers observed that this phenomena, where all modes of class appearance "collapse" to the same center, occurs in case of the classification SoftMax loss (Qian et al.) . They proposed a multi-center approach, where multiple centers for each class are used with the SoftMax loss to capture the hidden distribution of the data to solve this issue. Instead of using SoftMax, it was shown that triplet loss may offer some relief from class collapsing (Wang et al., 2014) and this is certainly true in noise-free environments. However, in this paper, we show that in real-world conditions with modest noise assumptions, triplet and other metric learning loss yet suffer from class collapse. Rather than refine the loss, we argue the key lies in an improved strategy for sampling and selecting the examples. Early work (Malisiewicz & Efros, 2008) proposed per-exemplar distance representation as a means to overcome class collapsing; inspired by this we introduce a simple sampling method to select positive pairs of training examples. Our method can be combined naturally with other popular sampling methods. In each training iteration, given an anchor and a batch of samples in the same category, our method selects the closest sample to the anchor in the current embedding space as the Figure 1 : Given an anchor (circle with dark ring), our approach samples the closest positive example in the embedding space as the positive element. This results in pushing the anchor only towards the closest element direction (green arrow), which allows the embedding to have multiple clusters for each class. positive sample. The metric learning loss is then computed based on the anchor and its positive paired sample. We demonstrate the class-collapsing phenomena on a real-world dataset, and show that our method is able to create more diverse embedding which result in a better generalization performance. We evaluate our method on three standard zero-shot benchmarks: CARS196 (Krause et al., 2013) , CUB200-2011 (Wah et al., 2011) and Omniglot (Lake et al., 2015) . Our method achieves a consistent performance enhancement with respect to various baseline combinations of sampling methods and embedding losses.

2. RELATED WORK

Sampling methods. Designing a good sampling strategy is a key element in deep metric learning. Researchers have been proposed sampling methods when sampling both the negative examples as well as the positive pairs. For negative samples, studies have focused on sampling hard negatives to make training more efficient (Simo-Serra et al., 2015; Schroff et al., 2015; Wang & Gupta, 2015; Oh Song et al., 2016; Parkhi et al., 2015) . Recently, it has been shown that increasing the negative examples in training can significantly help unsupervised representation learning with contrastive losses (He et al., 2020; Wu et al., 2018; Chen et al., 2020) . Besides negative examples, methods for sampling hard positive examples have been developed in classification and detection tasks (Loshchilov & Hutter, 2015; Shrivastava et al., 2016; Arandjelovic et al., 2016; Cubuk et al., 2019; Singh & Lee, 2017; Wang et al., 2017) . The central idea is to perform better augmentation to improve the generalization in testing (Cubuk et al., 2019) . Apart from learning with SoftMax classification, Arandjelovic et al. (2016) propose to perform metric learning by assigning the near instance from the same class as the positive instance. As the positive training set is noisy in their setting, this method leads to features invariant to different perspectives. Different from this approach, we use this method in a clean setting, where the purpose is to get the opposite result of maintaining the inner-class modalities in the embedding space. Xuan et al. (2020) also propose to use this positive sampling method with respect to the N-pair loss (Sohn, 2016) in order to relax the constraints of the loss on the intra-class relations. From a theoretic perspective, we prove that in a clean setting this relaxation is redundant for other popular metric losses like the triplet loss (Chechik et al., 2010) and margin loss (Wu et al., 2017) . We formulate the noisy-environment setting and prove that in this case the triplet and margin losses also suffer from class-collapsing and using our purpose positive sampling method optimizes for solutions without class-collapsing. We also provide an empirical study that supports the theoretic analysis. Noisy label problem. Learning with noisy labels is a practical problem when applied to the real world (Scott et al., 2013; Natarajan et al., 2013; Shen & Sanghavi, 2019; Reed et al., 2014; Jiang et al., 2017; Khetan et al., 2017; Malach & Shalev-Shwartz, 2017) , especially when training with large-scale data (Sun et al., 2017) . One line of work applies a data-driven curriculum learning approach where the data that are most likely labeled correctly are used for learning in the beginning, and then harder data is taken into learning during a later phase (Jiang et al., 2017) . Researchers have also tried on to apply the loss only on easiest top k-elements in the batch, determine by lowest current loss (Shen & Sanghavi, 2019) . Inspired by these works, our method focuses on selecting only the top easiest positive relations in the batch. Beyond memorization. Deep networks are shown to be extremely easy to memorize and over-fit to the training data (Zhang et al., 2016; Recht et al., 2018; 2019) . For example, it is shown the network can be trained with randomly assigned labels on the ImageNet data, and obtain 100% training accuracy if augmentations are not adopted. Moreover, even the CIFAR-10 classifier performs well in the validation set, it is shown that it does not really generalize to new collected data which is visually similar to the training and validation set (Recht et al., 2018) . In this paper, we show that when allowing the network the freedom not to have to learn inner-class relation between different class modes, we can achieve much better generalization, and the representation can be applied in a zero-shot setting.

3. PRELIMINARIES

Let X = {x 1 , .., x n } be a set of samples with labels y i ∈ {1, .., m}. The objective of metric learning is to learn an embedding f (•, θ) -→ R k , in which the neighbourhood of each sample in the embedding space contains samples only from the same class. One of the common approaches for metric learning is using embedding losses in which at each iteration, samples from the same class and samples from different classes are chosen according to same sampling heuristic. The objective of the loss is to push away projections of samples from different classes, and pull closer projections of samples from a same class. In this section, we introduce a few popular embedding losses. Notation: Let x i , x j ∈ X, define: D f xi,xj = f (x i )-f (x j ) 2 . In cases where there is no ambiguity we omit f and simply write D xi,xj . We also define the function δ xi,xj = 1 if y i = y j 0 otherwise . Lastly, for every a ∈ R, denote (a) + := max(a, 0). The Contrastive loss (Hadsell et al.) takes sample embeddings and pushes the samples from the different classes apart and pulls samples from the same class together. L f con (x i , x j ) = δ xi,xj • D f xi,xj + (1 -δ xi,xj ) • (α -D f xi,xj ) + Here α is the margin parameter which defines the desired minimal distance between samples from different classes. While the Contrastive loss imposes a constraint on a pair of samples, the Triplet loss (Chechik et al., 2010) functions on a triplet of samples. Given a triplet x a , x p , x n ∈ X, the triplet loss is defined by L f trip (x a , x p , x n ) = δ xa,xp • (1 -δ xa,xn ) • (D f xa,xp -D f xp,xn + α) + The Margin loss (Wu et al., 2017) aims to exploit the flexibility of Triplet loss while maintaining the computational efficiency of the Contrastive loss. This is done by adding a variable which determines the boundary between positive and negative pairs; given an anchor x a ∈ X the loss is defined by L f,β margin (x a , x) = δ xa,x • (D f xa,x -β xa + α) + + (1 -δ xa,x ) • (β xa -D f xa,x + α) +

4. CLASS-COLLAPSING

The contrastive loss objective is to pull all the samples with the same class to a single point in the embedding space. We call this the Class-collapsing property. Formally, an embedding f : X -→ R m has the class-collapsing property, if there exists a label y and a point p ∈ R m such that {f (x i )| y i = y} = {p}.

4.1. EMBEDDING LOSSES OPTIMAL SOLUTION

It is easy to see that an embedding function f that minimizes: O con (f ) = 1 n 2   xi,xj ∈X L f con (x i , x j )   has the class-collapsing property with respect to all classes. However, this is not necessarily true for the Triplet loss and the Margin loss. For simplification for the rest of this subsection we will assume there are only two classes. Let A ⊂ X be a subset of elements such that all the elements in A belongs to one class and all the element in A c belong to the other class. Recall some basic set definitions. Definition 1. For all sets Y, Z ⊂ R m define: 1. The diameter of Y is defined by: diam(Y ) = sup{ y -z |y, z ∈ Y } 2. The distance between Y and Z is: Y -Z = inf{ y -z |y ∈ Y, z ∈ Z} It is easy to see that if f : X - → R m is an embedding, such that diam(f (A)) < 2•α+ f (A)-f (B) , then: O trip (f ) = 1 n 3   xi,xj ,x k ∈X L f trip (x i , x j , x k )   = 0. Moreover, fixing β xi = α for every x i ∈ X, then: O margin (f, β) = 1 n 2   xi,xj ∈X L f,β margin (x i , x j )   = 0. It can be seen that indeed, the family of embedding which induce the global-minimum with respect to the Triplet loss and the Margin loss, is rich and diverse. However, as we will prove in the next subsection, this does not remain true in a noisy environment scenario.

4.2. NOISY ENVIRONMENT ANALYSIS

For simplicity we will also discuss in this section the binary case of two labels, however this could be extended easily to the multi-label case. The noisy environment scenario can be formulated by adding uncertainty to the label class. More formally, let Y = {Y 1 , .., Y n } be a set of independent binary random variables. Let A 1 , .., A t ⊂ X, 0.5 < p < 1 such that: |A j | = n t and P(Y i = k) = p x i ∈ A k q := 1-p t-1 x i / ∈ A k We can also reformulate δ as a binary random variable such that: δ Yi,Yj := 1 Yi=Yj For the Triplet loss define: EL f trip (x i , x j , x k ) = E δYi,Yj • (1 -δYi,Y k ) • D f xi,xj -D f xi,x k + α + . We are searching for an embedding function which minimize EO trip (f ) = 1 n 3 xi,xj ,x k ∈X EL f trip (x i , x j , x k ) Theorem 1. Let f : O - → R m be an embedding, which minimize EO trip (f ), then f has the classcollapsing property with respect to all classes. Similarly, we can define: EL f margin (x i , x j ) = E δYi,Yj • (D f xi,xj -β xi + α) + + E(1 -δYi,Yj ) • (β xi -D f xi,xj + α) + Theorem 2. Let f : O - → R m be an embedding, which minimize EO margin (f, β) = 1 n 2 xi,xj ∈X EL f margin (x i , x j ), then f has the class-collapsing property with respect to all classes. The proof of the last two theorems can be find in Appendix A. In conclusion, although theoretically in clean environments the Triplet loss and Margin loss should allow more flexible embedding solutions, this does not remain true when noise is considered. On a real-world data, where mislabeling and ambiguity can be usually be found, the optimal solution with respect to both these losses becomes degenerate.

4.3. EASY POSITIVE SAMPLING (EPS)

Using standard embedding losses for metric learning can result in an embedding space in which visually diverse samples from the same class are all concentrated in a single location in the embedding space. Since the standard evaluation and prediction method for image retrieval tasks are typically based on properties of the K-nearest neighbours in the embedding space, the class-collapsing property is a side-effect which is not necessarily in order to get optimal results. In the next section, we will show experimental results, which support the assumption that complete class-collapsing can hurt the generalization capability of the network. To address the class-collapsing issue we propose a simple method for sampling, which results in weakening the objective penalty on the inner-class relations, by applying the loss only on the closest positive sample. Formally we define the EPS sampling in the following way; given a mini-batch with N samples, for each sample a, let C a be the set of elements from the same class as a in the mini-batch, we choose the positive sample p a to be arg min t∈Ca ( f (t) -f (a) ) For negative samples n a we can choose according to various options. In this paper we use the following methods: (a) Choosing randomly from all the elements which are not in C a . (b) Using distance sampling (Wu et al., 2017) . (c) semi-hard sampling (Schroff et al., 2015) ,(d) MS hard-mining sampling (Wang et al., 2019) . We then apply the loss on the triplets (a, p a , n a ). Using such sampling changes the loss objective such that instead of pulling all samples in the mini-batch from the same class to be close to the anchor, it only pulls the closest sample to the anchor (with respect to the embedding space) in the mini-batch, see Figure 1 . In Appendix B, we formalize this method in the noisy environment framework. We prove (Claim 1,2) that every embedding which has the class collapsing property is not a minimal solution with respect to both the margin and the triplet loss with the easy positive sampling. Furthermore, in Claim 3,4 we prove that the objective of the losses with EPS on tuples/triplets is to push away every element (including positive elements), that is not in the k-closest elements to the anchor, where k is determined by the noise level p. Therefore, if we apply the EPS method on a mini-batch which has small numbers of positive elements from each modality, in such case adding the EPS to the losses not only relax the constraints on the embedding, allowing the embedding to have multiple inner-clusters. It also optimizes the embedding to have this form.

5. EXPERIMENTS

We test our EPS method on image retrieval and clustering datasets. We evaluate the image retrieval quality based on the recall@k metric (Jégou et al., 2011) , and the clustering quality by using the normalized mutual information score (NMI) (Manning et al., 2008) . The NMI measures the quality of clustering alignments between the clusters induced by the ground-truth labels and clusters induced by applying clustering algorithm on the embedding space. The common practice to choose the NMI clusters is by using K-means algorithm on the embedding space, with K equal to the number of classes. However, this prevents from the measurement capturing more diverse solutions in which homogeneous clusters appear only when using larger amount of clusters. Regular NMI prefers solutions with class-collapsing. Therefore, we increase the number of clusters in the NMI evaluation (denote it by NMI+) we also report the regular NMI score.

5.1. MNIST EVEN/ODD EXAMPLE

To demonstrate the class-collapsing phenomena, we take the MNIST dataset (Lecun et al., 1998) , and split the digits according to odd and even. From a visual perspective this is an arbitrary separation. We took the first 6 digits for training and left the remaining 4 digits for testing. We used a simple shallow architecture which result in an embedding function from the image space to R 2 (For implementation details see Appendix C). We train the network using the triplet loss. We compare our sampling method to random sampling of positive examples (the regular loss). As can be seen in Figure 2 , the regular training without EPS suffers from class-collapsing. Training with EPS creates a richer embedding in which there is a clear separation not only between the two-classes, but also between different digits from the same class. As expected, the class-collapsing embedding preforms worse on the test data with the unseen digits, see Table 1 .

5.2. FINE-GRAINED RECOGNITION EVALUATION

We compare our approach to previous popular sampling methods and losses. The evaluation is conducted on standard benchmarks for zero-shot learning and image retrieval following the common splitting and evaluation practice (Wu et al., 2017; Movshovitz-Attias et al., 2017; Brattoli et al., 2019) . We build our implementation on top of the framework of Roth et al., which allow us to have a fair comparison between all the tested methods with an embedding of fix size (128). For more implementation details and consistency of the results, see Appendix C.

5.2.1. DATASETS

We evaluate our model on the following datasets. • CUB200-2011 (Wah et al., 2011) , which contains 11,788 images of 200 bird species. We also follow Wu et al. (2017) , using 100 classes for training and 100 for testing. -200 R@1 R@2 R@4 NMI NMI+ R@1 R@2 R@ -languages R@1 R@2 R@4 R@8 NMI R@1 R@2 R@4 R@ • Omniglot (Lake et al., 2015) , which contains 1623 handwritten characters from 50 alphabet. In our experiments we only use the alphabets labels during the training process, i.e, all the characters from the same alphabet has the same class. We follow the split in Lake et al. (2015) using 30 alphabets for training and 20 for testing.

5.2.2. RESULTS

We tested our sampling method with 3 different losses: Triplet (Chechik et al., 2010) , Margin (Wu et al., 2017) and Multi-Similarity (MS) (Wang et al., 2019) . For the Margin loss experiment, we combine our sampling method with distance sampling (Wu et al., 2017) ; this could be done because the distance sampling only constrains on the negative samples, where our method only constrains on the positive samples. We set the margin α = 0.2 and initialized β = 1.2 as in (Wu et al., 2017) . For the Triplet we combine our method with semi-hard sampling (Schroff et al., 2015) by fixing the positive according to EPS and then using semi-hard sampling for choosing the negative examples. For the MS loss we replace the positive hard-mining method with EPS and use the same hard-negative method. We use the same hyper-paremeters as in (Wang et al., 2019)  α = 2, λ = 1, β = 50. Results are summarized in Tables 2 and 3 . We can see that our method achieves the best performance on all tested datasets. It is important to note that in the baseline models, when using Semi-hard sampling, the sampling strategy was done also on the positive part as suggest in the original papers. We see that replacing the semi-hard positive sampling with easy-positive sampling, improve results in all the experiments. The improvement gain becomes larger as the dataset classes can be partitioned more naturally to a small number of sub-clusters which are visually homogeneous. In Cars196 dataset it is the car viewpoint, where in Omniglot it is the letters in each language. As can be seen in Table 3 , using EPS on the Omniglot dataset result in creating an embedding in which in most cases the nearest neighbor in the embedding consists of element of the same letter, although the network was trained without these labels. In Figure 4 we can see a qualitatively comparison of CARS16 models results. EPS seems to create more homogeneous neighbourhood relationships with respect to the the viewpoint of the car. More results and comparisons can be find in Appendix C.

5.2.3. POSITIVE BATCH SIZE EFFECT

An important hyperparameter in our sampling method is the number of positive batch samples, from which we select the closest one in the embedding space to the anchor. If the class is visually diverse and the number of positive samples in batch is low, than with high probability the set of all the positive samples will not contain any visually similar image to the anchor. In case of the Omniglot experiment, the effect of this hyperparameter is clear; It determines the probability that the set of positive samples will include a sample from the same letter as the anchor letter. As can be seen in Figure 3 (b), the performance of the model increases as the probability of having another sample with the same letter as the anchor increases.

6. CONCLUSION

In this work we demonstrate the importance of positive sampling strategies when using embedding losses for metric learning. We investigate the class collapsing phenomena with respect to popular embedding losses such as the Triplet loss and the Margin loss. While in clean environments there is a diverse and rich family of optimal solutions, when noise is present, the optimal solution collapses to a degenerate embedding. We propose a simple solution to this issue based on 'easy' positive sampling, and prove that indeed adding this sampling results in non-degenerate embeddings. We also compare and evaluate our method on standard image retrieval datasets, and demonstrate a consistent performance boost on all of them. While our method and results have been limited to metric learning frameworks, we believe that our sampling scheme will also be useful in other related settings, including supervised contrastive learning, which we leave to future work. Proof. Define a new random variables such that for every 1 ≤ r1, r2 ≤ t: hr 1 ,r 2 (Y, Z) = 1 Y = r1 ∧ Z = r2 0 else observe that δY 1 ,Y 2 • (1 -δY 1 ,Y 3 ) = 1≤r 1 ,r 2 ≤t r1 =r 2 1Y 1 =r 1 • hr 1 ,r 2 (Y2, Y3) = 1≤r 1 ,r 2 ≤t r1 =r 2 1Y 1 =r 2 • hr 1 ,r 2 (Y3, Y2) Since the variables are independent E( δY 1 ,Y 2 • (1 -δY 1 ,Y 3 )) = 1 2 • 1≤r 1 ,r 2 ≤t r1 =r 2 E(1Y 1 =r 1 ) • E(hr 1 ,r 2 (Y2, Y3)) + E(1Y 1 =r 2 ) • E(hr 1 ,r 2 (Y3, Y2)). Define: D(x1, x2, x3) := (Dx 1 ,x 2 -Dx 1 ,x 3 + α)+ Rearranging the terms we get n 3 • EOtrip(f ) = x 1 ,x 2 ,x 3 ∈X (E( δY 1 ,Y 2 • (1 -δY 1 ,Y 3 )) • D(x1, x2, x3) = x 1 ,x 2 ,x 3 ∈X 1≤r 1 =r 2 ≤t (E(1Y 1 =r 1 ) • E(hr 1 ,r 2 ((Y2, Y3)) + E(1Y 1 =r 2 ) • E(hr 1 ,r 2 (Y3, Y2))) • D(x1, x2, x3) = x 1 ,x 2 ,x 3 ∈X 1≤r 1 =r 2 ≤t (hr 1, ,r 2 (Y2, Y3)) • E(1Y 1 =r 1 ) • D(x1, x2, x3) + E(1Y 1 =r 2 ) • D(x1, x3, x2) = Therefore, if K(i, j, k, r1, r2) = • E(1Y 1 =r 1 ) • D(xi, xj, x k ) + E(1Y 1 =r 2 ) • D(x1, x k , xj) , then EOtrip(f ) can be written as EOtrip(f ) = 1 n 3 1≤i,j,k≤n 1≤r 1 =r 2 ≤t E(hr 1 ,r 2 (Yj, Y k )) • K(i, j, k, r1, r2) For every xi ∈ X, define: (EOtrip(f ))x i = 1 n 2 • 1≤j,k≤n 1≤r 1 =r 2 ≤t (E(hr 1 ,r 2 (Yj, Y k )) • K(xi, xj, x k , r1, r2) Let f : X -→ R m be an embedding, fix 1 ≤ r ≤ t and xi ∈ Ar, xj, x k ∈ X with f (xi) -f (xj) = w, f (xi) -f (x k ) = h. By definition: K(i, j, k, r1, r2) =          p • (h -w + α)+ + q(w -h + α)+ r1 = r ∧ r2 = r q • (h -w + α)+ + p(w -h + α)+ r2 = r ∧ r1 = r p • (h -w + α)+ + p(w -h + α)+ r1 = r ∧ r2 = r q • (h -w + α)+ + q(w -h + α)+ r1 = r ∧ r2 = r Since 0 < p < 1, in order to get minimal K(i, j, k, r1, r2) value, h and w must satisfy |h -w| ≤ α. In this case we have K(i, j, k, r1, r2) =          (p + q) • α + (h -w)(p -q) r1 = r ∧ r2 = r (p + q) • α + (w -h)(p -q) r2 = r ∧ r1 = r 2 • α r1 = r ∧ r2 = r 2 • α r1 = r ∧ r2 = r Therefore, r 2 ∈{1,.r-1,r+1,.t} (E(hr,r 2 (Yj, Y k )) • K(xi, xj, x k , r1, r2) + (E(hr 2 ,r (Yj, Y k )) • K(xi, xj, x k , r1, r2) = = (p + q) • α( r 2 ∈{1,.r-1,r+1,.t} (E(hr,r 2 (Yj, Y k ) + (E(hr 2 ,r (Yj, Y k )))+ (h -w)(p -q))( r 2 ∈{1,.r-1,r+1,.t} E(hr,r 2 (Yj, Y k )) -E(hr 2 ,r (Yj, Y k )) We split to three cases: 1. If xj, x k ∈ Ar or xj, x k / ∈ Ar then: E(hr,r 2 (Yj, Y k )) = E(hr 2 ,r (Yj, Y k )). Hence, (h -w)(p -q))( r 2 ∈{1,.r-1,r+1,.t} E(hr,r 2 (Yj, Y k )) -E(hr 2 ,r (Yj, Y k )) = 0 2. If xj ∈ Ar and x k / ∈ Ar, then E(hr,r 2 (Yj, Y k )) > E(hr 2 ,r (Yj, Y k )), therefore (h -w)(p -q))( r 2 ∈{1,.r-1,r+1,.t} E(hr,r 2 (Yj, Y k )) -E(hr 2 ,r (Yj, Y k )) Since p > 0.5 and |h -w| ≤ α,the minimal value is achieved whenever h = 0 and w = α. 3. In the same way if x k ∈ Ar and xj / ∈ Ar, then E(h hr 2 ,r (Yj, Y k )) = E(hr,r 2 (Yj, Y k )) and the minimal value is achieved whenever h = α and w = 0. In conclusion, if xi ∈ Ar, an embedding f * satisfies We will now prove the same theorem with respect to the margin loss. Theorem 2. Let f : O -→ R m be an embedding, which minimize (EOtrip(f * ))x i = min{(EOtrip(f ))x i |f : X -→ R m } iff f * (xj) = f * (xi) EOmargin(f, β) = 1 n 2 x i ,x j ∈X EL f margin (xi, xj), then f has the class-collapsing property with respect to all classes. Proof. Observe that if xi, xj ∈ Ar, then EL f margin (xi, xj) = p • (Dx i ,x j -βx i + α)+ + (1 -p) • (βx i -Dx i ,x j + α)+ Since 0 < p < 1, then the maximal value is achieved whenever |Dx i ,x j -βx i | ≤ α, in this case: EL f margin (xi, xj) = (2p -1) • (Dx i ,x j -βx i ). In the same way in case xi ∈ Ar and xj / ∈ Ar then: EL f margin (xi, xj) = (2p -1) • (βx i -Dx i ,x j ). Combining both directions we get: x j ∈X EL f margin (xi, xj) = (2p -1) •   Y j ∈A Dx i ,x j - Y j / ∈A Dx i ,x j   Since: p > 0.5 and |Dx i ,x j -βx i | ≤ α, the minimal value is achieved whenever Dx i ,x j = 0, Dx i ,x k = 2α and βx i = α, for every xi, xj ∈ Ar, x k / ∈ Ar.

B: EASY POSITIVE SAMPLING IN NOISY ENVIRONMENT

In this subsection we analyse the EPS method from the theoretical prospective, using the framework defined in Section 4. We use the same notions as in sections 3 and 4. Define: Φ(yi, yj) = 1 yi = yj ∧ Dx i ,x j = min{Dx i ,x k | y k = yi} 0 else . Then, the easy positive sampling loss can be defined by: 1 n 1≤i,j,k≤n Φ(yi, yj) • L f trip (xi, xj, x k ) for the triplet loss and 1 n 1≤i,j≤n (Φ(yi, yj) • L f,β margin (xi, xj)) + 1 y i =y j L f,β margin (xi, xj) for the margin loss. In the noisy environment stochastic case, using section 4 notions, Φ become a random variable: Φ(Yi, Yj) = 1 Yi = Yj ∧ ∀t ((Dx i, ,x t < Dx i ,x j ) → Yt = Yi) 0 else Therefore, the triplet loss with EPS in the noisy environment case, become: EL f EP Strip (xi, xj, x k ) = E Φ(Yi, Yj) • δY i ,Y j • (1 -δY i ,Y ) • D f x i ,x j -D f x i ,x k + α + and for the margin loss with EPS we have: EL f EP Smargin (xi, xj) = E( Φ(Yi, Yj) • δY i ,Y j ) • (D f x i ,x j -βx i + α)+ + E(1 -δY i ,Y j ) • (βx i -D f x i ,x j + α)+ As in section 4.2 we assume that Y = {Y1, .., Yn} is a set of independent binary random variables. Let A1, .., At ⊂ X, 0.5 < p < 1 such that: |Aj| = n t and P(Yi = k) = p xi ∈ A k q := 1-p t-1 xi / ∈ A k For simplicity we assume that every 1 ≤ i ≤ n t satisfies x n•i t +1 , .., x n•i t +t ∈ Ai We prove first that the minimal embedding with respect to both losses does not satisfy the class collapsing property. Let f1 be an embedding function such that: D f 1 x i ,x j = 0 (∃r)(xi, xj ∈ Ar) α else and f2 an embedding such that: D f 1 x 1 ,x 2 = 0 (∃r)(xi, xj ∈ Ar)∧ ∼ ((i < t 2n ∧ j > t 2n ) ∨ (i > t 2n ∧ j < t 2n ) α else f1 represent the case of class collapsing, where f2 represent the case where there are two modalities for the first class. In order to show that the minimal embedding does not satisfy the class collapsing property it suffice to prove that 1 n 1≤i,j,k≤n L f 2 EP Strip (xi, xj, x k ) < 1 n 1≤i,j,k≤n L f 1 EP Strip (xi, xj, x k ) and 1 n 1≤i,j≤n L f 2 EP Smargin (xi, xj) < 1 n 1≤i,j≤n L f 1 EP Smargin (xi, xj). Remark: For both losses the definition requires a strict order between the elements, therefore by distance zero, we meant infinitesimal close, the order between the elements inside the sub-clusters is random, and element between set A1 are closer then set A c 1 in both embeddings. For simplification we neglect this infinitesimal constants in the proofs. Claim 1. There exists M such that if n ≥ M , then: 1 n 1≤i,j,k≤n L f 2 EP Strip (xi, xj, x k ) < 1 n 1≤i,j,k≤n L f 1 EP Strip (xi, xj, x k ) Proof. Fix x1, WOLOG we may assume in both embeddings that D f j x 1 ,x i < D f j x 1 ,x k for every j ∈ {1, 2} and 1 ≤ i < k ≤ n. It suffice to prove that 1 n 1≤j,k≤n (L f 1 EP Strip (x1, xj, x k ) -L f 2 EP Strip (x1, xj, x k )) > 0 Let q = (1 -p), observe that P( 1≤t<j Yi = Yt) = p m+1 • q j-2-m + p j-2-m q j+1 ≤ 2p j-1 where m = |{t | t ≤ j, Yt ∈ A1}| . Thus if j ≥ n 2t , we have L f 2 EP Strip (x1, xj, x k ) ≤ P( 1≤t<j Yi = Yt) • 2 • α ≤ 4 • αp j-1 . Therefore, 1 n j> n 2t ,1≤k≤n L f 2 EP Strip (x1, xj, x k ) ≤ j> n 2t 4•αp j = 4•n•αp n 2t • n(2t-1) 2t j=0 p j = 4•αp n 2t • 1 -q n(2t-1)/2t 1 -q n→∞ → 0 For j ≤ n 2t and k ≤ n 2t of k > n t , we have L f 1 EP Strip (x1, xj, x k ) = L f 2 EP Strip (x1, xj, x k ). Hence, the only case left is j ≤ n 2t and n 2t < k ≤ n t . In this case: L f 2 EP Strip (x1, xj, x k ) = 0, where L f 1 EP Strip (x1, xj, x k ) = (p 2 • q j-1 + q 2 • p j-1 ) • α ≥ q j+1 α and we get: 1 n • j≤ n 2t , n 2t ≤k≤ n t L f 1 EP Strip (x1, xj, x k ) -L f 2 EP Strip (x1, xj, x k ) ≥ α • q 2 • n 2t j=0 q i = α • q 2 • 1 -q n/2t 1 -q n→∞ → αq 2 • 1 1 -q Choosing M such that α • q 2 • 1 -q M/2t 1 -q > 4 • αp M 4 • 1 -q M (2t-1)/2t 1 -q will satisfy that for every n > M : 1 n 1≤i,j,k≤n L f 2 EP Strip (xi, xj, x k ) < 1 n 1≤i,j,k≤n L f 1 EP Strip (xi, xj, x k ) Claim 2. There exists M such that if n ≥ M then: 1 n 1≤i,j≤n L f 2 EP Smargin (xi, xj) < 1 n 1≤i,j≤n L f 1 EP Smargin (xi, xj) Proof. For every 1 ≤ j ≤ n 2t or n t < j ≤ n we have: L f 1 EP Smargin (xi, xj) = L f 1 EP Smargin (xi, xj) For n 2t < j ≤ n 2 : L f 2 EP Smargin (xi, xj) = 2 • p • q • βx i + (p 2 q j-2 + q 2 p j-2 ) • (2 • α -βx i ) while: L f 2 EP Smargin (xi, xj) = 2 • p • q • (βx i + α) + (p 2 q j-2 + q 2 p j-2 ) • (α -βx i ) Since j > n 2t the second therm tend to zero. Therefore, taking M such that 2qp > (p 2 q M 2t -2 + q 2 p M 2t -2 ) will satisfy that for each n ≥ M 1 n 1≤i,j≤n L f 2 EP Smargin (xi, xj) < 1 n 1≤i,j≤n L f 1 EP Smargin (xi, xj) In the previous two claims we prove that the class collapsing solution is not minimal with respect to both the EP Smargin and the EP Striplet. In the following claims we prove that not only it is not the minimal solution, looking locally on the direct effect of the EPS losses on a sample which is not one of the closest elements to to the anchor. We prove that the optimal solution in this case is an embedding in which the distance between the sample to the anchor is equal to the margin hyperparameter. Claim 3. Let f be an embedding. For every i, let i1, .., in be such that D f x i ,x i 1 < D f x i ,x i 2 < ... < D f x i ,xn , Then there exists M such that for every j > M the minimal embedding for L f EP Smargin (xi, xj) is achived whenever D f x i ,x j = βx i + α. Proof. Fix x1, as in the previous claims we will assume: D f x 1 ,x 1 < D f x 1 ,x 2 < ... < D f x 1 ,xn As was prove in in Claim 1 P( 1≤t<j Yi = Yt) ≤ 4p j , thus E( Φ(Yi, Yj) • δY i ,Y j ) ≤ P( 1≤t<j Yi = Yt) ≤ 4p j-1 j→∞ → 0 Since the minimal solution for E( Φ(Yi, Yj) • δY i ,Y j ) • (D f x i ,x j -βx i + α)+ + E(1 -δY i ,Y j ) • (βx i -D f x i ,x j + α)+ satisfies |βx i -D f x i ,x j | ≤ α, we have: L f 1 EP Smargin (x1, xj) = α • (E( Φ(Y1, Yj) • δY 1 ,Y j ) + E(1 -δY 1 ,Y j ))+ (D f x 1 ,x j -βx 1 ) • (E( Φ(Y1, Yj) • δY 1 ,Y j ) -E(1 -δY i ,Y j )) Since E(1 -δY 1 ,Y j ) ≥ 2pq, there exists M such every j > M satisfies (E( Φ(Y1, Yj) • δY 1 ,Y j ) -E(1 -δY i ,Y j )) < 0 Therefore the minimal value is achieved whenever D f x 1 ,x j = α + βx 1 . The proof in the EPStriplet loss case is similar. Claim 4. Let f be an embedding. For every i, let i1, , .., in be such that D f x i ,x i 2 < ... < D f x i ,xn . Then there exists M such that for every j > M the minimal embedding for: L f EP Strip (xi, xt, xt+j) + L f EP Strip (xi, xt+j, xt) is achieved whenever Dx i ,x t+j = Dx i ,x t + α. Proof. Define K(Yi, Yi, Y k ) := E Φ(Yi, Yj) • δY i ,Y j • (1 -δY i ,Y k ) . Fixing x1, assuming D f x 1 ,x 1 < D f x 1 ,x 2 < ... < D f x 1 ,xn , We have: L f EP Strip (x1, xt, xt+j) + L f EP Strip (x1, xt+j, xt) = K(Y1, Yt, Yt+j) • D f x 1 ,x t -D f x 1 ,x t+j + α + + K(Y1, Yt+j, Yt) • D f x 1 ,x t+j -D f x 1 ,t + α + As in the previous claim, the minimal value is achieved whenever On the one hand: K(Y1, Yt, Yt+j) = ( i∈{1,2,..,t,t+j} p t i q 1-t i )+( i∈{1,2,..,t,t+j} p 1-t i q t i ) ≥ q t+1 where |D f x 1 ,x t+j -D f x 1 ,x t | ≤ α in this case: L f EP Strip (x1, xt, xt+j) + L f EP Strip (x1, xt+j, xt) = α • (K(Y1, Yt, Yt+j) + K(Y1, Yt+j, Yt))Yt))+ (D f x 1 ,x t -D f x 1 ,x t+j ) • (K(Y1, Yt, Yt+j) -K(Y1, Yt+j, Yt)) t k = 1 Y k / ∈ A 0 else for k ∈ {2, .., t -1, t + j} and t k = 1 Y k ∈ A 0 else for k ∈ {1, t}. On the other hand K(Y1Yt+j, Yt) ≥ P rob( 1≤k<t+j Y1 = Y k ) ≤ 4p t+j-1 . Taking j large enough such that q t+1 ≤ 4p t+j-1 , we have: (E Φ(Y1, Yt) • δY 1 ,Y t • (1 -δY 1 ,Y t+j ) -E Φ(Y1, Yt+j) • δY 1 ,Y t+j • (1 -δY 1 ,Y t ) ) > 0 therefore in such case the minimum is archived whenever D f x 1 ,x t+j = D f x 1 ,x t + α.

C: MORE EXPERIMENTS AND IMPLEMENTATION DETAILS EMBEDDING BEHAVIOR ON TRAINING SETS

The class-collapsing phenomena also occur in the training process of the image retrieval datasets. This can also be measured qualitatively on the Omniglod detest; although training without the EPS results in batter overfitting to training samples, the results on the letters fine-grained task are significantly inferior comparing to training with the EPS (Table 4 ). It is also important to note the low NMI score when using EPS with the number of clusters equal to the number of languages, and the increment of this score when increasing the 

MNIST ARCHITECTURE DETAILS

For the MNIST even/odd experiment we use a model consisting of two consecutive convolutions layer with (3,3) kernels and 32,64 (respectively) filter sizes. The two layers are followed by Relu activation and batch normalization layer, then there is a (2,2) max-pooling follows by 2 dense layers with 128 and 2 neurons respectively.

RECOGNITION DATASETS ARCHITECTURE DETAILS

We use an embedding of size 128, and an input size of 224X224 for the first two datasets, and 80X80 for the Omniglot dataset. For all the experiments we used the original bounding boxes without cropping around the object box. As a backbone for the embedding, we use ResNet50 (He et al., 2016) with pretrained weights on imagenet. The backbone is followed by a global average pooling and a linear layer which reduces the dimension to the embedding size. Optimization is performed using Adam with a learning rate of 10 -5 , and the other parameters set to default values from Kingma & Ba (2014).

STABILITY ANALYSIS

Following Musgrave et al. (2020); Fehervari et al. (2019) , it was important to us to have a fair comparison between all tested models. Therefore, for all the experiments we use the same framework (Roth et al.) , with the same architecture and embedding size (128). We also did not change the default hyper-parameters in all tested methods. We run each experiment 8 times with different random seeds, the reported results are the mean of all the experiments. The std of the Recall@1 results of all experiments can be seen in Table 5 . In all cases the differences between the results with and without the EPS are significance.

MULTI-SIMILARITY COMPARISON

From our experiments, the Multi-similarity loss is highly affected by the batch size. Using Resnet50 backbone, we restrict the number of batch size to 160 for all tested model, which cause to the inferior results of the multi-similarity loss comparing to other methods. For the sake of completeness we provide the results also on inception backbone with embedding size of 512 as in Wang et al. (2019) , and batch size of 260. As can be seen in Table 6 , also in these cases the results improve when using EPS instead of semi-hard sampling on the positive samples. 

TRIMMED LOSS COMPARISON

The situation where a class consists of multiple modes can also be seen as a noisy data scenario with respect to the embedding loss, where positive tuples consisting of examples from different modes are considered as 'bad' labelling. One approach to address noisy labels is by back-propagating the loss only on the k-elements in the batch with the lowest current loss (Shen & Sanghavi, 2019) . Although this approach resembles Malisiewicz & Efros (2008) , the difference is that in Malisiewicz & Efros (2008) they apply the trimming only on the positive tuples. We test the effect of using Trimmed Loss on random sampled triplets with different level of trimming percentage. As can be seen in Figure 6 , there is only a minor improvement when applying the loss on top of the distance-margin loss on the Omniglot-letters dataset. This emphasizes the importance of constraining the trimming to the positive sampling only.



Figure 2: Embedding examples from the MNIST validation set, after training using only even/odd labels.Different colors indicate different digits. Left: Using Triplet-loss, class collapsing pushes all intra-class digits to overlapping clusters. Right: With EPS, different digits form separate clusters. Retrieval or classification using the odd-vs-even task/metric is more effectively implemented using the embedding on the right, even though the embedding on the left is learned with a loss that more strictly optimizes for the task.

Figure 3: Results on Omniglot-letters. (a) Recall@1 performance of each model per epoch. (b) performance of EPS + distance-margin model on the Omniglot dataset, as a function of the number of positive samples in batch (where zero is equivalent to only using only distance sampling). Increasing the number the number of positive samples enhances the model performance.

Figure 4: Retrieval results for randomly chosen query images in Cars196 dataset. Using EPS creates more homogeneous neighbourhood relationships with respect to the car viewpoint.

PROOFS FOR THE THEOREMS IN SUBSECION 4.2 Theorem 1. Let f : O -→ R m be an embedding, which minimize EOtrip(f ), then f has the class-collapsing property with respect to all classes.

for every xj ∈ Ar, and f * (xj) -f * (xi) = α for every xj / ∈ Ar.

Figure 5: t-SNE visualization of Cars196 training classes (each class has a different color). Training with EPS results in more diverse classes appearance.

Figure 5 visualise the t-SNE embedding (van der Maaten & Hinton, 2008) of Cars196 training classes. As can be seen, when training without EPS each class fits well to a bivariate normal-distribution with small variance and different means. Training with EPS result in more diverse distributions and in some of the classes fits batter to a mixture of multiple different distributions.

Figure6: Recall@1 performance with Trimmed loss across varying trimming percentage. Except for small improvement in the Distance-margin case on the Omniglot dataset, in all other cases there is no improvement when applying the Trimmed loss.



Recall@k and NMI performance on Cars196 and CUB200-2011. NMI+ indicate the NMI measurement when using 10 (number of classes) clusters. Our EPS method improves in all cases. † : Our re-implemented version with the same embedding dimension.

Recall@k and NMI performance on Omniglot dataset. In both cases the training was done with only language labels. Right: evaluation on language labels. Left: evaluation on letter labels. NMI+ indicate the NMI measurement when using 30*(number of classes) clusters. Our EPS method improves in both cases.

Results of semi-hard with/without EPS on the Omniglot training dataset. Without EPS the network feet almost perfectly to the training set. However, using EPS results in batter performances on the letters fine-grained task.

Std of Recall@1 results. Each model was trained 8 times with different random seeds.

Results of Multi-similarity loss with Embedding size 512 (as inWang et al. (2019)). Using EPS improve results in both cases.

