DETERMINANT REGULARIZATION FOR DEEP METRIC LEARNING

Abstract

Distance Metric Learning (DML) aims to learn the distance metric that better reflects the semantically similarities in the data. Current pair-based and proxy-based methods on DML focus on reducing the distance between similar samples, while expanding the distance of dissimilar ones. However, we reveal that shrinking the distance between similar samples may distort the feature space, increasing the distance between points of the same class region, and therefore, harming the generalization of the model. The regularization terms (such as L 2 -norm on weights) cannot be adopted to solve this issue as they are based on linear projection. To alleviate this issue, we adopt the structure of normalizing flow as the deep metric layer and calculate the determinant of the Jacobian matrix as a regularization term that helps in reducing the Lipschitz constant. At last, we conduct experiments on several pair-based and proxy-based algorithms that demonstrate the benefits of our method.

1. INTRODUCTION

Deep metric learning (DML) is a branch of learning algorithms that parameterizes a deep neural network to capture highly non-linear similarities between images according to a given semantical relationship. Because the learned similarity functions can measure the similarity between samples that do not appear in the training data set, the learning paradigm of DML is widely used in many applications such as image classification & clustering, face re-identification, or general supervised and unsupervised contrastive representation learning Chuang et al. (2020) . Commonly, DML aims to optimize a deep neural networks to span the projection space on a surface of hyper-sphere in which the semantic similar samples have small distances and the semantic dissimilar samples have large distance. This goal can be formulated as the discriminant criterion (and its many variants that appear in the literature) we summarize as follows. max{d θ (x i , x j )|j ∈ S i } < δ 1 < δ 2 < min{d θ (x i , x l )|l ∈ D i } (1) where θ are the parameters of the deep metric model, δ 1 and δ 2 are two tunable hyperparameters, and S i and D i are the sets of similar and dissimilar samples of the query x i , respectively. 2016) is used to define the objective function in DML. Besides the definition of the objective function, many works point out that the performance of DML crucially depends on the informative sample mining (HSM) procedure and therefore focus their research direction on improving the HSM. Unfortunately, the explicit definitions of informative samples is still unclear, and the problem seems to be unsolved. This leads us to the following question: what is the real reason that makes DML model so crucially depend on hard sample mining? In this paper, we try to answer this question studying the local Lipschitz constant of the learned projection f θ (x). (2015) . Current methods present good generalization i.e., the learned metric can work well on the unseen classes. We attribute the good generalization performance of DML models to the fact that the learned function f θ (x) extends in a continuous manner on the projecting region presenting a small Lipschitz constant w.r.t. the origin sample space. As we know, DML reduces the distances between similar samples which also reduces the local Lipschitz constant of f θ (x) of the region surrounded by training samples. The continuity of f θ (x) prevents the Lipschitz constant from changing fast, so f θ (x) presents a small Lipschitz constant at points close the region where training samples are located. The learned projection having small Lipschitz constant induces a smaller upper bound on the empirical loss, which results on a better generalization performance. This interpretation explains why the projection f θ (x) learned by distance metric learning has a good generalization even on unseen classes.

Commonly the log-exp function q

λ (θ) = log( n j=1 e λai(θ) ) Oh Song et al. ( However, DML presents some drawbacks. Shrinking the cluster of each class increases the local Lipschitz constant of f θ (x). This phenomenon can be divided into two effects. The first one corresponds to the increase of the Lipschitz constant caused by enlarging the distances between dissimilar samples, which was found by Song et al. (2021) . This effect increases the local Lipschitz constant of f θ (x) at the region between classes. The author of Song et al. ( 2021) commented that the failure of training triplet loss without semi-hard sample mining can be attributed to it. In the second case, the regions with large Lipschitz constant are a priori unknown. This effect may occur in the region that belongs to unknown regions or regions that belong to a cluster for a class. In the first case, the negative effect can be alleviated reducing the distance between dissimilar samples. Based on this strategy, Song et al. ( 2021) designs an KKN decision boundary to formulate a general framework of distance metric learning and proved that current state-of-the-art algorithms such as lifted structure loss, multi-similarity loss, circle loss or N-pair loss are special cases of it. Those implicitly mean that the sample mining strategies are designed to reduce the Lipschitz constant of the learned projection. Regarding the second case, there are a few works that address this issue. A common strategy is to assign the position of centers of each class by minimizing designed energy functions Duan et al. (2019); Liu et al. (2018) . The main assumption of this method is that if distances between centers are large, there is no need to shrink each class too much while still preserving the classes gap. However, this method has only positive results if it increases the Lipschitz constant at the region of unknown cases. If the increasing regions occur within each class, this routine will fail to work. In summary, we claim that current methods in distance metric learning can improve its discriminant ability and reduce the Lipschitz constant of the learned problem at the same time by applying the right regularization factor. In this paper, we design a framework to demonstrate this. The contents include the following aspects: 1 We give an unified framework of proxy-based distance metric learning. In our framework, we summarize the traditional distance metric learning algorithms and classification-based distance metric learning algorithms together. Therefore, we present the mathematical framework that proves the connections between these methods and it gives us a theoretical base to support our hypothesis on the effects of the Lipschitz constant on distance metric learning.



We reveal that potential energy-based methods are not very effective in pushing centers away from each other since it only consider local information of the data. To alleviate this problem, we adopt the log-exp mean functions and power mean function to design a loss term that pulls away centers of each classes. Because we prove that potential energy methods are a special realization of our algorithm, ours also has the power to separate centers of different classes. To further solve the Lipschitz constant problem on distance metric learning we design a deep neural network structure that allow us to minimize the Lipschitz constant of the deep neural network directly. This structure contains two parts: the first part extracts features using traditional backbone networks, such as Resnet, VGG, and Inceptions; the second



Figure 1: The illustration represents the feature space spanned by f θ (x) learned by deep metric learning. ri is the radius i-th class of samples in training dataset and r * the radius of an unknown class. δ2 -δ1 reflects the distance between the two closest samples in class 1 and class 2.

Many recent advances have been presented on DML in many new excellent works Wang et al. (2019); Kim et al. (2020); Roth et al. (2022); Wang & Liu (2021); Schroff et al. (2015); Sun et al. (2020); Deng et al. (2019); Wang et al. (2018) since the first DML model was proposed in 2015 Schroff et al.

