TRIPLET SIMILARITY LEARNING ON CONCORDANCE CONSTRAINT

Abstract

Triplet-based loss functions have been the paradigm of choice for robust deep metric learning (DML). However, conventional triplet-based losses require carefully tuning a decision boundary, i.e., violation margin. When performing online triplet mining on each mini-batch, choosing a good global and constant prior value for violation margin is challenging and irrational. To circumvent this issue, we propose a novel yet efficient concordance-induced triplet (CIT) loss as an objective function to train DML models. We formulate the similarity of triplet samples as a concordance constraint problem, then directly optimize concordance during DML model learning. Triplet concordance refers to the predicted ordering of intra-class and inter-class similarities being correct, which is invariant to any monotone transformation of the decision boundary of triplet samples. Hence, our CIT loss is free from the plague of adopting the violation margin as a prior constraint. In addition, due to the high training complexity of triplet-based losses, we introduce a partial likelihood term for CIT loss to impose additional penalties on hard triplet samples, thus enforcing fast convergence. We extensively experiment on a variety of DML tasks to demonstrate the elegance and simplicity of our CIT loss against its counterparts. In particular, on face recognition, person re-identification, as well as image retrieval datasets, our method can achieve comparable performances with state-of-the-arts without tuning any hyper-parameters laboriously.

1. INTRODUCTION

Deep metric learning (DML) for visual understanding tasks, e.g., face recognition Schroff et al. (2015) ; Taigman et al. (2014) , person re-identification (ReID) Shi et al. (2016) ; Ustinova & Lempitsky (2016) , image retrieval Fang et al. (2021) ; Revaud et al. (2019) , aims at learning embedding representations of images with class-level labels by a ranking loss function Kaya & Bilge (2019) ; Sohn (2016) ; Wang et al. (2017) . There are two representative ranking loss functions developed for DML to minimize between-class similarity and maximize within-class similarity, i.e., pair-based loss Sun et al. (2014) and triplet-based loss Zhao et al. (2019) . Compared to pairwise constraints, the optimization pattern of triplet-based losses additionally captures the relative similarity information, thus yielding impressive performances Liang et al. (2021) ; Zhuang et al. (2016) . With triplet constraints, images from the same class are projected into neighboring embedding spaces, and images with different semantic contexts are mapped apart. However, under such an optimization objective, triplet-based losses suffer from following two problems when training DML models with the stochastic gradient descent (SGD) algorithm and sampling triplets within a mini-batch. •Irrational to set an absolute margin. Triplet constraint relies on a decision boundary to partition the embedding space of intra-class and inter-class, i.e., violation margin for reinforcing optimization Wang et al. (2018a; b) . However, the violation margin is sensitive to scale change, and choosing an identical absolute value for clusters in different scales of intra-class variation is inappropriate Wang et al. (2017) . Hence, triplet-based losses need to regulate this hyper-parameter attentively to impose appropriate penalty strength Qian et al. (2019) ; Sun et al. (2020) . The performance of Circle loss Sun et al. (2020) on the varying circular decision boundary can prove such a claim. The performance of the same task exhibits a significant difference by setting different violation margins. And Circle loss with the same violation margin varies from superior to inferior on various tasks. For circumventing this issue, Angular loss is proposed to push the negative point away from the center of the positive cluster and drag the positive points closer to each other by constraining the upper bound of the angle at the negative point Wang et al. (2017) . In hierarchical triplet loss (HTL) Ge (2018) , the violation margin is automatically updated over the constructed hierarchical tree to identify a margin that generates gradients for violated triplets. However, existing methods of mitigating the issue still depend on the setting of the decision boundary, only substituting a hyperparameter. Angular loss needs to specify the angle degree and HTL needs to design a hierarchical class tree. Since choosing a global and constant prior value for the decision boundary is irrational, we innovatively formulate triplet similarity learning as a concordance constraint problem without an assumed decision boundary. •Suffering from slow convergence. Triplet-based losses can provide a strong supervisory signal for training DML models by mining rich and fine-grained inter-sample relations. However, since the number of tuples (each tuple contains an anchor sample and its positive and negative samples) increases polynomially with the number of training samples, they suffer from prohibitively high training complexity, thus causing significantly slow convergence Ebrahimpour et al. (2022) ; Kim et al. (2020) . Another potential issue for triplet-based losses is that a large amount of tuples make a limited contribution to the learning algorithm and sometimes even diminishes the quality of the learned embedding space Wu et al. (2017) . Many works have been devoted to studying the effective triplet sampling strategy within a mini-batch to utilize hard triplet samples that improve convergence speed or the final discriminative performance Hermans et al. (2017) ; Oh Song et al. (2016) ; Sohn (2016); Wu et al. (2017) . For example, HTL Ge ( 2018) is proposed to automatically collect informative training triplets via an adaptively-learned hierarchical class structure. However, these hard triplet sample mining techniques involve tuning hyper-parameters and may occur the risk of overfitting when performing online triplet mining within a mini-batch Ebrahimpour et al. ( 2022); Kim et al. (2020) . Given three tuple types (hard, semi-hard, and easy triplet samples), we need to consider how to achieve the trade-off between them during DML optimization. Leveraging hard triplet samples alone may occur bad local minima Do et al. (2019) . Overwhelming easy triplet samples affect the training efficiency Schroff et al. (2015) . Inspired by SoftTriplet loss Qian et al. (2019) introduced to learn embeddings without triplet sampling, we explore laying more emphasis on hard triplet samples by relaxing concordance constraints, thus accelerating convergence speed. In each triplet sample, the intra-class similarity is naturally higher than the one of inter-class. The predicted ordering of similarities of intra-class and inter-class needs to be on par with the observed ordering. Such an ordering concordance not only takes effect on a mini-batch but also on the whole sample. Such intrinsic concordance constraint is invariant to any monotone transformation of the decision boundary of triplet samples. Hence, we develop a novel concordance-induced triplet (CIT) loss function to optimize triplet similarity. Existing triplet-based losses explicitly give a global and constant violation margin as a decision boundary based on apriori knowledge. Unlike them, our CIT loss exploits the concordance constraint of triplet similarity to avoid falling into the plague of tuning the violation margin. It is an elegant, simple, and efficient way to learn the intrinsic similarity between all samples and is insensitive to the triplet sampling within a mini-batch. We further introduce a partial likelihood term to enforce different penalty strengths on different tuple types, primarily laying more penalties on hard triplet samples. This term mainly helps improve convergence speed and exhibits a slight impact on performance, thus avoiding the plague of elaborative tuning. Based on thoroughly and randomly mini-batches and triplet sampling, this term can regulate the penalty strength keeping consistency with the degree of the discordance or concordance of triplet similarity. The higher discordance of hard triplet samples brings more penalty strengths, thus arising more contributions to gradients. The main contributions of this work are summarized as follows: • We propose a novel, simple, elegant concordance-induced triplet (CIT) loss function for deep metric learning (DML). Our CIT loss frees DML training from tuning the decision boundary by directly maximizing concordance of triplet similarity. • In addition, we introduce a partial likelihood term to impose loose concordance constraints to focus on the informativity of hard triplet samples, thus helping speed up convergence. • Using two popular backbones, we conduct extensive experiments on various DML tasks, including face recognition, person re-identification (Reid), and image retrieval. On all tasks, we demonstrate the effectiveness and elegance of our CIT loss and gained performance on par with state-of-the-art.

2. METHODOLOGY

2.1 WARM-UP Given a training set D = {(x i , y i )} N i=1 with K classes, x i with label y i ∈ {0, 1, • • • , K -1} denotes the feature embedding projected from the i-th image with DML models. Let the feature embedding of x i ∈ R D be represented as F θ (I i ), where I i is the i-th input image, θ indicates the learnable parameters of a differentiable DML model F, and D is the dimension of feature embedding. With DML model F, we can map the image space I i into low-dimension embedding space x i used for similarity measurement. x i is usually normalized into unit length for the training stability and comparison simplicity. DML models primarily utilize various ranking loss functions to learn the embedding space from the image space in a supervised way. Apart from a suitable neural network, it is essential to design a ranking loss function to optimize the embedding space. Our work aims to improve triplet-based loss to help DML models train elegantly and efficiently by avoiding tuning the violation margin and improving convergence speed, and can gain impressive performance. DML models are commonly trained using online SGD algorithms, where the gradients for optimizing network parameters are computed locally with mini-batches. Hence, triplet samples are selected and formed in a mini-batch during each training iteration. Each triplet sample T = (x a , x p , x n ) consists of an anchor sample x a , a positive sample x p and a negative sample x n , whose labels satisfy y a = y p ̸ = y n . The goal of triplet-based losses is to push away the negative sample x n from the anchor sample x a by a violation margin m > 0 compared to the positive sample x p : S an + m ≤ S ap , (1) We define S ap ∈ [0, 1] as the intra-class similarity of x a and x p , and S an ∈ [0, 1] as the inter-class similarity of x a and x n . We seek to minimize S an and maximize S ap . To enforce this constraint in the embedding space, we define the optimization target of the standard triplet loss as: L st = 1 N T T S an -S ap + m + , where the operator [•] + = max(0, •) represents the hinge function and the symbol N T denotes the number of all triplet samples in a mini-batch. According to Equation 2, given a globally constant value for the violation margin m, we can group all triplet samples into three categories: • Hard triplet samples: if S an > S ap , • Semi-hard triplet samples: if S ap -m < S an < S ap , • Easy triplet samples: if S an + m < S ap . Among these three tuple types, easy triplet samples generate zero loss, while hard triplet samples contribute the most losses. An effective sampling strategy combining an appropriate violation margin can help mine hard triplet samples. In fact, the violation margin in the triplet-based losses plays a key role to sample selection during model training Ge (2018) . However, the violation margin needs to be carefully tuned. On the other hand, the absolute violation margin is an irrational decision boundary for multi-classes with different cluster centroids.

2.2. CIT LOSS

Being simple, we intuitively explore no need to consider the violation margin for learning hard triplet samples effectively. Since the concordance is invariant to any monotone transformation of the decision boundary of triplet samples, it is natural to formulate the metric learning problem to maximize the concordance. We turn the predicted similarity correlation of each triplet sample T into a comparable pairs, i.e., intra-class similarity S ap and inter-class similarity S an . The set of comparable pairs mined from the whole training set D is E T := {(S ap , S an )}. The number in this set is N T . A comparable pair (S ap , S an ) ∈ E T is concordant if S ap > S an . Otherwise, the pair is discordant with the ground truth. We calculate the ratio of undergoing any pairs by taking the exponential form of intra-class and inter-class similarities: R = 1 N T (Sap,San)∈E T e San e Sap . (3) Obviously, the average ratio R lies in the range [e -1 , e]. The lower and upper ranges correspond to complete concordance and discordance of all comparable pairs, respectively. With such bounds, we define our CIT loss as an exponential lower-bound form: L e = 1 N T (Sap,San)∈E T 1 -e -(San-Sap) + . With minimizing CIT loss, we address the metric learning problem as concordance optimization between the predicted and observed similarity of comparable pairs. Similarity concordance building in whole triplet samples is invariant to any monotone transformation of the decision boundary of triplet samples. In other words, concordance optimization pays emphasis on the ordering correlation of distance without considering the correlation degree, thus avoiding imposing margin constraints. The empirical error induced by pairwise discordance with respect to DML model F θ is denoted by G(F θ ) and defined by: G(F θ ) = 1 N T (Sap,San)∈E T I Sap<San ≥ L e , where the indicator function I = 1 if S ap < S an (discordance), and 0 otherwise (concordance). According to Equation 5, we can estimate the learning parameters θ of the DML model F by minimizing CIT loss L e . We leverage concordance-induced penalty in our CIT loss to optimize the target pairwise similarity. We can assume that the violation margin m is 0 in our CIT loss, then there are two tuple types: hard triplet samples if S ap < S an and easy triplet samples if S ap > S an . For easy triplet samples, the DML model F faultlessly predicts the target pairwise ranking, thus leaving penalty-free. And hard triplet samples contribute to the informativity for gradient-based optimization. To speed up convergence, we further introduce a partial likelihood term for our CIT loss to focus on the discordant penalty of hard triplet samples. From the standpoint of concordance, the pairwise similarity in each triplet sample T is not always transitive. The transitivity of concordance-induced order by the DML model F θ : S ap > S an but may S ap < S pn for each triplet sample T . S pn indicates inter-class similarity between x p and x n . Considering transitivity of triangle edge, we define the partial likelihood form as: L p = (Sap,San)∈E T e Sap e San + e Spn . (6) Theoretically, the higher proportion of the edge S ap in the three sides can better account for the concordance and transitivity. Hence, the product of the ratio in L p can suggest the degree of concordance or discordance. We quantize such a degree to construct a loose concordance constraint term by placing the negative log partial likelihood of Equation 6: L p = - 1 N T (Sap,San)∈E T S ap -log(e San + e Spn ) . With Equation 7, the penalty on predictive error for hard triplet samples can be boosted, thus helping gradient optimization. And the best predictive outcome for easy triplet samples is S ap ≥ log(e San + e Spn ), there is no penalty. If not, the slight penalty for easy triplet samples can help enhance the clustering effect within the class. It can be seen from this that this term is subject to the principles of maximizing inter-class similarity and minimizing intra-class similarity. Combining Equations 4 and 7, we present our CIT loss as: L cit = γL e + (1 -γ)L p , where the hyper-parameter γ ∈ [0, 1] regulates the loss value between them. Specifically, γ controls the magnitudes of the two losses at the same level to stabilize the model training. By comparing the two terms in our CIT loss, we can find that the former term requires a more rigorous concordance. However, it is impossible to ensure the complete transitivity and concordance. Hence, we can not fully replace the former term with the latter term. Generally, γ = 0.5 can trade off the discordant penalty and fast convergence. Schematically, (a) and (b) in Figure 1 show that our CIT loss can bring fast convergence by utilizing the hyper-parameter γ to boost the contribution of hard triplet samples to gradient update of network parameters. And it helps reduce the training consumption of easy triplet samples, thus reaching up to the last performance in advance. 2019). Both ST and our CIT are exempt from the violation margin setting. Our CIT leverages the intrinsic concordance between the predictive and observed similarities of triplet samples, while ST extends softmax loss with multiple centers for each class. Implement details. We implement all loss functions on the pytorch-metric-learning Musgrave et al. (2020) platform and experiment with them on two different network structures. Two networks are convolutional neural network (CNN) of ResNet50 He et al. (2016) and vision transformer (ViT) Dosovitskiy et al. (2020) . We set the hyper-parameters for comparable methods according to the default reported in their works, and the default of the hyper-parameter γ in our CIT loss is 1.0, if not specified. We extract the 512-D feature embeddings for computing distances and use Euclidean distance as the metric during inferences. We adopt FastReID He et al. (2020) platform to train the two networks with different loss functions for three DML tasks. The CNN optimizer is Adam Kingma & Ba (2014), with a learning rate of 3.5e-4, while ViT is SGD, with a learning rate of 8e-3. 

3.2. PERSON RE-IDENTIFICATION

We evaluate comparable loss functions on the ReID task in Table 1 . We can make three observations from the reported performances. First, we can find that our CIT can achieve competitive performances against the state-of-the-art. CIT obtains the best (95.81) on CNN and the second-highest (93.88) on ViT in terms of R@1. And CIT achieves the best (86.23) on ViT and the second-highest (89.41) on CNN in terms of mAP . Moreover, mIN P of our CIT are on par with the second-highest methods (underline), showing the competence of staying in the first tier among all methods. Second, we report the performances on varying violation margins for those loss functions relying on the decision boundary (See column Margin in Table 1 ). We can discern that different violation margins can bring significant differences in performance, and the most prominent are Angular and Circle. Angular achieves the best mAP and mIN P on CNN when the angle specified in degrees (α) is 20 (its default is 40). But when α = 60, the performance of Angular is pretty bad. In particular, we conduct repeated experiments to verify the reliability of these results. Such results entail that it is challenging to choose an appropriate violation margin. Third, both ST and our CIT are free from the plague of setting the violation margin. In terms of three key metrics, R@1, mAP , and mIN P on CNN and ViT, our CIT exhibits significant advantages against ST. In fact, there is a crucial hyper- 

Method Network Cars196

In-shop Clothes R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP 

3.3. IMAGE RETRIEVAL

We evaluate all comparable loss functions adopting default hyper-parameter settings on two image retrieval datasets, i.e., Cars196 and In-shop Clothes. We compare our CIT against those state-ofthe-art methods in Table 2 . Given metrics R@1 and mAP of two datasets on two networks, CIT gets four number one (bold) and four number two (underline) among all eight. Such results can again demonstrate the superiority of our CIT to other methods conditioned on no careful decision boundaries and other hyper-parameters tuning. There are other two interesting points implied in Table 2. First, for those comparable methods with the violation margin, Table 1 illustrates that various violation margins bring significant performance differences on the same dataset. Whereas Table 2 indicates that different datasets with the same violation margin exhibit significant gain differences. It can be proved by the performance of Circle which gains the best R@1 of 92.73 and mAP of 76.97 on ViT for the In-shop Clothes dataset but achieves the worst on ViT for the Cars196 dataset. Second, for the Cars196 dataset, our CIT lay a lot behind in mAP on CNN compared to Triplet. The metric R@1 only measures how many items are hit, while the metric mAP also considers the rank of the hit items to reflect the ranking quality. From the standpoint of ranking quality, our CIT fails to quantize the concordance between the predicted and observed similarities. In other words, CIT implicitly models the relations of S ap and S an as inequality, while Triplet explicitly formulates their relations as an identical equation with the help of the violation margin. If the chosen violation margin for Triplet fits related tasks, such a prior constraint can help boost performance, and the obtained result is superior to our CIT. However, choosing an appropriate task-specific violation margin is challenging and laborious. Our CIT marginally outperforms its counterparts and exhibits an elegant training manner that does not need to adjust the violation margin elaboratively.

3.4. FACE RECOGNITION

For the face recognition task, according to the ROC curve on TPR (True Positive Rate) at FPR (False Positive Rate) from 1e-6 to 1, we report AUC performances of four verification datasets on CNN and ViT in Figure 2 . Our CIT gains the four highest AUC among the eight sub-tasks, including AgeDB on CNN (98.33), CFP-FP on CNN (97.39), IJB-C on ViT (99.32), and CFP-FP on ViT (96.85). The overall performance of this task can again affirm the availability of our CIT which directly optimizes the similarity concordance of triplet samples and is free from the annoyance of introducing the prior of the decision boundary. While regarding those loss functions In-shop Clothes R@1 R@5 R@10 mAP R@1 R@5 R@10 mAP 0. with the violation margin, we must diligently seek good violation margins for specific datasets and backbones to obtain comparable performance. And for ST without the violation margin, we still need to find an appropriate number of centers for each class, and the over-large number of centers raises an efficiency problem. Hyper-parameters are task-specific, so regulating hyper-parameters is indispensable and de-facto laborious and time-consuming. Based on the concordance relations of similarity, our CIT exhibits more flexibility in modeling triplet constraints than existing triplet-based losses. The violation margin enforces constant restrictions for triplet samples, while our CIT only abides unequal relationship. Without any bells and whistles on tuning hyper-parameters, CIT can yield advanced performances. And the only hyper-parameter γ in CIT is task-agnostic.

3.5. ABLATION STUDY

Here we want to analyze the impact of the hyper-parameter γ in Equation 8. Table 3 shows that the hyper-parameter γ in our CIT loss has little effect on the performance. Both accuracy and mAP on different sizes of γ exhibit consistency and coherence. For two datasets on CNN and ViT, the best of R@1 and mAP (bold) scatter to varying sizes of γ. Since Equations 4 and 7 can be interchangeable, and the difference is that Equation 7 lays more penalty on hard triplet samples. The reported stable performance can demonstrate that γ has no impact on learning the concordance of triplet similarity in the mini-batch sampling. We can claim that this hyper-parameter is task-agnostic. From the performance perspective, our CIT can pay no attention to this only hyper-parameter. On the other hand, from the convergence viewpoint, introducing γ aims to help DML models speed up the convergence by laying emphasis on hard triplet samples. As Figure 3 shows, CIT with lesser γ can be fast convergence with fewer epochs achieving the last performance. The lesser γ entails that Equation 7with more loose concordance contributes more to the loss value. Our CIT pays more penalty on hard triplet samples and little penalty on easy triplet samples by setting lesser γ, thus advancing the convergence speed. This ablation study can demonstrate that our CIT with γ can alleviate the training complexity of triplet-based loss functions.

4. CONCLUSIONS AND DISCUSSIONS

Building on the concordance constraint of triplet similarity, we propose a novel and elegant concordance-induced triplet (CIT) loss function to simplify the optimization process for deep metric learning (DML). Our CIT loss can free DML training from the laborious tuning of the violation margin in conventional triplet-based loss functions and encourage model training fast convergence. The violation margin is task-specific, while the concordance constraint is task-agnostic and monotonous. Hence, the concordance between the predicted and observed similarities can help our CIT loss push far away from the plague of giving prior constraints for decision boundaries. We further utilize the degree of concordance of triplet samples to pay more penalties on hard triplet samples to speed up gradient optimization. The extensive experiments on three popular DML tasks with two networks can demonstrate the elegance and availability of our proposed CIT, yielding performances on par with other triplet-based loss functions. It is worthy to emphasize that our CIT loss intends to achieve comparable performance with its counterparts but not pursue superior performance. CIT can favor DML training simply and elegantly by modeling the concordance of triplet similarity. When a DML task is challenging to offer the violation margin and needs to alleviate the training complexity, our CIT loss is a reliable alternative to conventional triplet-based loss functions. It is interesting to explore the degree of concordance of triplet samples to bring the best performance in the future. If we can design a method to measure the degree of concordance, we can avoid the excessive constraints of the partial likelihood term in our CIT loss. A GENERALIZATION BOUND ANALYSIS Theorem 1. Let L represents a family of loss function associated to L cit given DML model F θ . Then for ∀δ > 0 over E T , each of the following holds for any L cit ∈ L with probability at least 1 -δ. E[L cit ] ≤ ẼD [L cit ] + R D (L) + ln(1/δ) 2N T , where Proof Sketch. The generalization error bound based on McDiarmid inequality and Rademacher complexity for concordance learning can be proved by Theorem 3.5 described in Mohri et al. (2018) . Remark 1. As shown in Equation 9, the supremum of the generalization bound consists of three terms. The first term is the empirical error relating to training, the lower empirical error brings a smaller supremum of the generalization bound. And in the third term, the more considerable amount of triplet samples T can reduce the upper bound of generalization error. The generalization bound of our loss function L cit given the DML model F θ is largely determined by the second term, Rademacher complexity of L. Theorem 2. For the hypothesis space L = {L : T → {0, 1}}, we state L cit (T 1 , T 2 , •, T N ) = L cit (T 1 ), L cit (T 2 ), • • • , L cit (T N ) ∈ {0, 1} N is a dichotomy. We further define the growth func- tion of L is M L (N T ) = max N T L(T 1 , T 2 , • • • , T N |. Then the following holds: R D (L) ≤ 2 ln M L (N T ) N T . ( ) Proof Sketch. Relating the Rademacher complexity to the growth function, Equation 10 can be derived by Theorem 3.7 (Massart's lemma) and Corollary 3.8 in Mohri et al. (2018) . Remark 2. The Rademacher complexity can be bounded in terms of the growth function, which is distribution independent and purely combinatorial. The growth function M L (N T ) is the maximum number of distinct ways in which triplet samples of N T can be predicted as concordance (0) or discordance (1) by using hypotheses in L. It suggests the number of dichotomies realized by the hypothesis and its upper bound is 2 N T . The growth function M L (N T ) suggests the representation power of the hypothesis space L, thus reflecting the complexity of the hypothesis space. Combining Equations 9 and 10, we rewrite the supremum of the generalization error as: E[L cit ] = sup D ẼD [L cit ] + M L (N T ) + ln(1/δ) 2N T . Based on the above theorems, we would like to compare the supremum of the generalization bound of two items in the loss function L cit . The relation of triplets is modeled as S ap -S an in Equation 4, and we reorganize the modeling relation of Equation 7with a violation margin, as follows: S ap -S an + S an -log(e San + e Spn ) margin . (12) Similar to the standard triplet loss in Equation 2, the loss function L p utilizes a margin to constrain triplet similarities, as Equation 12shows. Obviously, from the standpoint of modeling complexity, the representation power of L p is stronger than L e . Corollary 1. Given the same DML model F θ , we let L e represents a family of loss function associated to L e and L p represents a family of loss function associated to L p . Then the following holds: M Le (N T ) < M Lp (N T ) ≤ 2 N T . ( ) For any hypothesis set L, the trivial bound is 2 N T . The growth function measures the richness or complexity of the hypothesis set L. Hence, the growth function M Le (N T ) for the hypothesis set L e is smaller than the growth function M Lp (N T ) for the hypothesis set L p . Corollary 2. By virtue of Theorem 1 and Theorem 2, we further bound the difference between the empirical error and generalization error for loss functions L e and L p associated to the DML model F θ . E[L e ] -ẼD [L e ] ≤ M Le (N T ) + ln(1/δ) 2N T . ( ) E[L p ] -ẼD [L p ] ≤ M Lp (N T ) + ln(1/δ) 2N T . ( ) With Equation 13, we can obtain: E[L e ] -ẼD [L e ] ≤ E[L p ] -ẼD [L p ]. Supposing the amount of triplet samples is enough to make both expectation errors theoretically infinite close. Then, we can further derive However, we need to make a trade-off between the training error and generalization error on the complexity of the hypothesis space. The larger complexity of the hypothesis space easily leads to higher generalization errors. Such as, one of the popular strategies for controlling the complexity to avoid over-fitting is introducing regularization terms. Therefore, it is not that the higher the complexity and the smaller the training error can surely bring the better generalization. ẼD [L p ] ≤ ẼD [L e ]. Table 3 can account for the necessity of holding the trade-off between the training error and generalization error. Increasing complexity by setting γ < 1.0, i.e., introducing L p in L cit favors the In-shop Clothes dataset with 11,735 classes. On the contrary, the best performances of the Mar-ket1501 dataset containing 1,501 identities are achieved by setting γ = 1.0, i.e., without L p . Through generalization bound and experiment result analysis, we can conclude: • To free DML training from tuning the decision boundary, we present a tight constraint on triplet similarities and achieve comparable performances with its counterparts. • We further introduce a loose strategy to increase the complexity of the hypothesis space to speed convergence. • Increasing complexity is favorable to a large amount of training set. When the number of triplet samples is enough, we can regulate a relatively small γ to enjoy two benefits simultaneously: speeding convergence and better generalization.

B GRADIENTS ON SIMILARITIES OF COMPARABLE PAIRS

The partial derivatives of our CIT loss in Equation 8with respect to S ap and S an can be written as Equations 17 and 18, and the schematics of both corresponding gradients are shown in Figure 4 .  ∂L cit ∂S an = γe -(San-Sap) -(1 -γ) e San (e San + e Spn ) ln . ( ) ∂L cit ∂S ap = -γe -(San-Sap) + (1 -γ),

C PERFORMANCE OF FACE RECOGNITION

In Figure 2 , we show the full ROC curves for comparable loss functions. Here, we additionally report their true accepted rate (TAR) at 1e-5 and 1e-3 false accepted rate (FAR) in Tables 4 and 5 . Besides, we also compare their face verification accuracy in Table 6 .

D EMBEDDING SPACE METRIC

An image can be encoded into an embedding feature by a differentiable DML model trained by a triplet loss function. In Tables 7 and 8 , we report performances for Triplet and CT on Market1501 and In-shop Clothes datasets by tuning more margins. Besides, we also provide two metrics to measure embedding space, i.e., normalized mutual information (NMI), spectral variance (SV) Roth et al. (2020) . Lower values of SV indicate more directions of significant variance and suggest more discriminative embedding features. 

F QUALITATIVE RESULTS

In Figure 7 , we exhibit some retrieval cases for our CIT loss on different datasets.

G GRADIENTS ON FEATURES OF TRIPLET SAMPLES

In Figure 8 , we visualize the gradients of three features x a , x p , x n . The three features may belong to different classes under randomly mini-batch sampling, thus incomparable. Hence, from this visualization, we can not discern the contributions of features to DML optimization. 



Figure 1: (a) For hard triplet samples, as the epoch increases, their number significantly decreases, and the average loss on γ = 0.5 (heavier penalties) gradually approaches γ = 1.0. (b) The convergence speed on γ = 0.5 is faster than γ = 1.0 due to imposing more penalties on hard triple samples.

Figure 2: Comparison of different loss functions for LFW Huang et al. (2008), AgeDB Moschoglou et al. (2017), IJB-C Maze et al. (2018), and CFP-FP Sengupta et al. (2016) datasets on CNN (upper row) and ViT (down row) networks in terms of AUC (in %).

Figure 3: Rank-1 accuracy (R@1) (Market1501 dataset) and mAP (In-shop Clothes dataset) versus epochs of our CIT with different sizes of the hyper-parameter γ on CNN and ViT.

ẼD [L cit ] is the empirical error G(F θ ) in Equation 5 over the training set D (in-of-samples), and E[L cit ] denotes the generalization or expectation error on out-of-samples. E T with length N T is sampled from the training set D. R D (L) signifies the Rademacher complexity of L with respect to the training set D.

Due to L p introducing an extra decision boundary, the complexity of modeling triplet similarities can better fit the training set, thus helping speed training convergence. Compared to the loose constraint of L p , L e enforces a tight constraint on triplet similarities. As shown in in (b) of Figure 1, our loss function L cit fulfills lower loss by setting γ = 0.5.

Figure 4: Gradients of CIT Loss Function with respect to S ap and S an . CIT loss lays emphasize on hard samples by setting γ = 0.5 compared to γ = 1.0.

Figure 5 and 6 help analyze the convergence of loss functions.

Figure 5: Convergence curves of CNN for different loss functions on the Market1501 dataset. Left shows that our CIT can achieve the same convergence effect as Triplet and CT, smoother than Circle, ST, and Angular, which need tuning more hyper-parameters. And right plot again proves that the hyper-parameter can help speed the convergence of our CIT. k signifies the angle factor of a straight line and is used to measure the of loss curves roughly.

Figure 6: Convergence curves of CNN for our CIT and Triplet loss functions on CNN for the Mar-ket1501 and In-shop Clothes dataset. Triplet of m = 0.0 falls into the local optimum prematurely by observing occurrence epochs of the zero loss. Triplet of m=0.0 has the same mining strategybut its 1 form is weaker than CIT in the exponential form in driving DML model optimization due to insufficient exploitation of hard triplet samples. By setting γ < 1.0, our CIT building on similarity concordance can approach the convergence of Triplet with m=0.05.

Figure 7: Qualitative results of our CIT loss for (a) Market1501 dataset on CNN, (b) Market1501 dataset on ViT, (c) In-shop Clothes dataset on CNN, and (d) In-shop Clothes dataset on ViT. For each query image (leftmost), top 4 retrievals are exhibited. The results with red boundaries are false cases but they are substantially similar to the query images in terms of appearance.

Comparison of different loss functions for person ReID task on CNN and ViT networks in terms of rank-k (k=1,5,10, in %) accuracy, mean average precision (mAP, in %), and mean inverse negative penalty (mINP, in %).

Comparison of different loss functions for image retrieval tasks on CNN and ViT networks in terms of rank-k (k=1,5,10, in %) accuracy and mAP (in %).

.32 98.00 43.57 92.73 98.03 98.67 76.62    parameter needed to be tuned carefully for ST, i.e., the number of weight vectors per class. By contrast, our CIT avoids attentively regulating any hyper-parameters. The above three observations can demonstrate the elegance and effectiveness of our CIT, achieving comparable performance with the state-of-the-art and no need to tune hyper-parameters carefully.

Mikołaj Wieczorek, Barbara Rychalska, and Jacek D ąbrowski. On the unreasonable effectiveness of centroids in image retrieval. In International Conference on Neural Information Processing, pp. 212-223. Springer, 2021. Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840-2848, 2017. Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014. Xiaonan Zhao, Huan Qi, Rui Luo, and Larry Davis. A weakly supervised adaptive triplet loss for deep metric learning. In Proceedings of the IEEE/CVF International Conference Computer Vision Workshops, pp. 0-0, 2019. Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision, pp. 1116-1124, 2015. Bohan Zhuang, Guosheng Lin, Chunhua Shen, and Ian Reid. Fast training of triplet-based deep binary embedding networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5955-5964, 2016.

Comparison of different loss functions for LFW Huang et al. (2008), AgeDB Moschoglou et al. (2017), IJB-C Maze et al. (2018), and CFP-FP Sengupta et al. (2016) datasets on CNN in terms of TAR@FAR (in %).

Comparison of different loss functions for LFW Huang et al. (2008), AgeDB Moschoglou et al. (2017), IJB-C Maze et al. (2018), and CFP-FP Sengupta et al. (2016) datasets on ViT in terms of TAR@FAR (in %). Ours) 83.77 98.53 51.93 58.70 72.45 90.43 60.89 65.40

Comparison of different loss functions for LFW Huang et al. (2008), AgeDB Moschoglou et al. (2017), IJB-C Maze et al. (2018), and CFP-FP Sengupta et al. (2016) datasets on CNN and ViT networks in terms of face verification accuracy (in %).

Comparison of Triplet and CT with different margins for the Market1501 dataset on CNN and ViT networks in terms of rank-1 (in %) accuracy, normalized mutual information (NMI, in %), spectral variance (SV, in %), mean average precision (mAP, in %), and mean inverse negative penalty (mINP, in %).

Comparison of Triplet and CT with different margins for the In-shop Clothes dataset on CNN and ViT networks in terms of rank-1 (in %) accuracy, normalized mutual information (NMI, in %), spectral variance (SV, in %), mean average precision (mAP, in %), and mean inverse negative penalty (mINP, in %).

