DETERMINANT REGULARIZATION FOR DEEP METRIC LEARNING

Abstract

Distance Metric Learning (DML) aims to learn the distance metric that better reflects the semantically similarities in the data. Current pair-based and proxy-based methods on DML focus on reducing the distance between similar samples, while expanding the distance of dissimilar ones. However, we reveal that shrinking the distance between similar samples may distort the feature space, increasing the distance between points of the same class region, and therefore, harming the generalization of the model. The regularization terms (such as L 2 -norm on weights) cannot be adopted to solve this issue as they are based on linear projection. To alleviate this issue, we adopt the structure of normalizing flow as the deep metric layer and calculate the determinant of the Jacobian matrix as a regularization term that helps in reducing the Lipschitz constant. At last, we conduct experiments on several pair-based and proxy-based algorithms that demonstrate the benefits of our method.

1. INTRODUCTION

Deep metric learning (DML) is a branch of learning algorithms that parameterizes a deep neural network to capture highly non-linear similarities between images according to a given semantical relationship. Because the learned similarity functions can measure the similarity between samples that do not appear in the training data set, the learning paradigm of DML is widely used in many applications such as image classification & clustering, face re-identification, or general supervised and unsupervised contrastive representation learning Chuang et al. (2020) . Commonly, DML aims to optimize a deep neural networks to span the projection space on a surface of hyper-sphere in which the semantic similar samples have small distances and the semantic dissimilar samples have large distance. This goal can be formulated as the discriminant criterion (and its many variants that appear in the literature) we summarize as follows. max{d θ (x i , x j )|j ∈ S i } < δ 1 < δ 2 < min{d θ (x i , x l )|l ∈ D i } (1) where θ are the parameters of the deep metric model, δ 1 and δ 2 are two tunable hyperparameters, and S i and D i are the sets of similar and dissimilar samples of the query x i , respectively. Commonly the log-exp function q λ (θ) = log( n j=1 e λai(θ) ) Oh Song et al. (2016) is used to define the objective function in DML. Besides the definition of the objective function, many works point out that the performance of DML crucially depends on the informative sample mining (HSM) procedure and therefore focus their research direction on improving the HSM. Unfortunately, the explicit definitions of informative samples is still unclear, and the problem seems to be unsolved. This leads us to the following question: what is the real reason that makes DML model so crucially depend on hard sample mining? In this paper, we try to answer this question studying the local Lipschitz constant of the learned projection f θ (x). Many recent advances have been presented on DML in many new excellent works Wang et al. (2019) ; Kim et al. (2020) ; Roth et al. (2022) ; Wang & Liu (2021) ; Schroff et al. (2015) ; Sun et al. (2020) ; Deng et al. (2019) ; Wang et al. (2018) since the first DML model was proposed in 2015 Schroff et al. (2015) . Current methods present good generalization i.e., the learned metric can work well on the unseen classes. We attribute the good generalization performance of DML models to the fact that the learned function f θ (x) extends in a continuous manner on the projecting region presenting a small Lipschitz constant w.r.t. the origin sample space. As we know, DML reduces the distances between similar samples which also reduces the local Lipschitz constant of f θ (x) of the region surrounded by training samples. The continuity of f θ (x) prevents the Lipschitz constant from changing fast, so f θ (x) presents a small Lipschitz constant at points close the region where training samples are located. The learned projection having small Lipschitz constant induces a smaller upper bound on the empirical loss, which results on a better generalization performance. This interpretation explains why the projection f θ (x) learned by distance metric learning has a good generalization even on unseen classes. However, DML presents some drawbacks. Shrinking the cluster of each class increases the local Lipschitz constant of f θ (x). This phenomenon can be divided into two effects. The first one corresponds to the increase of the Lipschitz constant caused by enlarging the distances between dissimilar samples, which was found by Song et al. (2021) . This effect increases the local Lipschitz constant of f θ (x) at the region between classes. The author of Song et al. (2021) commented that the failure of training triplet loss without semi-hard sample mining can be attributed to it. In the second case, the regions with large Lipschitz constant are a priori unknown. This effect may occur in the region that belongs to unknown regions or regions that belong to a cluster for a class. In the first case, the negative effect can be alleviated reducing the distance between dissimilar samples. Based on this strategy, Song et al. (2021) designs an KKN decision boundary to formulate a general framework of distance metric learning and proved that current state-of-the-art algorithms such as lifted structure loss, multi-similarity loss, circle loss or N-pair loss are special cases of it. Those implicitly mean that the sample mining strategies are designed to reduce the Lipschitz constant of the learned projection. Regarding the second case, there are a few works that address this issue. A common strategy is to assign the position of centers of each class by minimizing designed energy functions Duan et al. (2019) ; Liu et al. (2018) . The main assumption of this method is that if distances between centers are large, there is no need to shrink each class too much while still preserving the classes gap. However, this method has only positive results if it increases the Lipschitz constant at the region of unknown cases. If the increasing regions occur within each class, this routine will fail to work. In summary, we claim that current methods in distance metric learning can improve its discriminant ability and reduce the Lipschitz constant of the learned problem at the same time by applying the right regularization factor. In this paper, we design a framework to demonstrate this. The contents include the following aspects: 1 We give an unified framework of proxy-based distance metric learning. In our framework, we summarize the traditional distance metric learning algorithms and classification-based distance metric learning algorithms together. Therefore, we present the mathematical framework that proves the connections between these methods and it gives us a theoretical base to support our hypothesis on the effects of the Lipschitz constant on distance metric learning. part learns the non-linear metric by an invertible deep neural layer used in the Normalizing Flows, whose gradients with respect to the input are easy to compute. 4 We conduct extensive experiments on challenging data sets such as CUB-200-2011, Cars196, Standord Online Products, and In-Shop Clothes Retrieval to illustrate the effectiveness of our algorithm. Notation. We denote as X o = {(x i , y i )} N1 i=1 the C-class dataset where x i ∈ R d1 is the i-th sample and y i ∈ {1, • • • , C} is the label of x i . z i = f θ (x i ) : R d1 → R d2 is a deep neural networks parameterized by θ. The similarity between x i and x j is denoted as A ij = cos(f θ (x i ), f θ (x j )). The set of proxies is denoted by X p = {(w k , y k )} N2 k=1 where w k ∈ R d2 and y k ∈ {1, • • • , C} is the corresponding label of w k . The similarity between x i and w j is denoted by B ij = cos(f θ (x i ), w j ). Because proxy-based DML does not calculate the similarity between samples within X or X p , the similar relationship between samples X o + X p can be depicted by a bipartite graph. For x i ∈ X o , its similar samples are only in X p and denoted as S 1 i . For w i ∈ X p , its similar samples are only in X o and denoted by S 2 i . Likewise, dissimilar sample sets of x i ∈ X o and w i ∈ X p are denoted by D 1 i and D 2 i , respectively.

2. MOTIVATION

Let us consider a deep neural network z = f θ (x) : R d1 :→ R d where θ is the learnable parameter, x is the image vector, and z is the feature vector of the image vector x. Normally, in DML the norm of z is equal to 1 because the L 2 -normalization is often used on the last layer of f θ (x). Therefore, for a training dataset X tr = {(x i , y i )} N i=1 and a testing data X te = {x i } T i=1 , the samples of them are all projected to the surface of a d-dimensional sphere centered at the origin of the feature space. Let us denote the surface of the considered d-dimensional sphere by S. Because in classificationbased tasks, the features vectors of different classes are desired to be separated from each other, the features from different classes are located in different clusters on S. Without loss of generality, we suppose that each class of samples belong only to one cluster, and the cluster region located by the k-th class is denoted by S k . Therefore, the set S is divided into C + 1 parts. Besides the C regions {S k } C k=1 , there is one region without any samples located in, which is called as blank region. In the open set problem, the blank region is specified to the unknown classes. In distance metric learning, the blank region of the training features may be located within the region of the testing features. Because there is no overlapping between classes of the training and the testing sets, therefore, S = B + ∪ n k=1 S k . Then, we would like to demonstrate why shrinking the samples will increase the Lipschitz constant of the learned projection. Before doing this, we introduce the definition of the Lipschitz constant. Definition. Let (X , d X ) and (Y, d Y ) be two metric spaces, the Lipschitz constant of a function f is defined as: Lip(f ) = max x1,x2∈X :i̸ =j d Y (f (x 1 ), f (x 2 )) d X (x 1 , x 2 ) (2) Let us consider two projection f 1 (x) and f 2 (x), their projecting regions on S are {S 1 k } C k=1 + B 1 and {S 2 k } C k=1 + B 2 , respectively. Thus, if the area of {S 1 k } C k=1 is larger than the area of {S 2 k } C k=1 , the area of B 1 is smaller than that of B 2 . Therefore, we can find two samples x a x b whose projections are in B 1 and B 2 . Thus, the following constraint holds d Y (f 1 (x a ), f 1 (x b )) d X (x a , x b ) < d Y (f 2 (x a ), f 2 (x b )) d X (x a , x b ) The Eq. ( 3) indicates that the Lipschitz constant of the f 1 is smaller than that of f 2 . However, for distance metric learning tasks, the dimension of S is very small when compared to the number of training samples. For example, the CUB200-2021 dataset has 11,788 samples with 200 classes. Each class has less than 60 images on average. Thus, S k , the region of the k-th class may be not a connected to the neighborhood of another class, otherwise we consider that both belong to the same one. As seen from Figure 1 , between two similar samples there could be a space belonging to the blank region. Thus, when we shrink the distance between samples of each class, the regions involving the increasing Lipschitz constant may probably be within the blank areas between two similar samples. When this happens, the generalization ability of f 1 would be harmed. For a good training on the deep neural network, we want to learn the distance metric learning without increasing the Lipschitz constant within the cluster of each class.

3. ANALYSIS ON THE LIPSCHITZ CONSTANT OF NEURAL NETWORK

In this section, we introduce how to control the Lipschitz constant of our deep network. Before doing this, we introduce an useful lemma about the Lipschitz constant of a deep layer-based projection. Lemma 1. [Weaver (2018) ] Given a T -layer deep projection f θ (x 0 ) = f T (f T -1 . . . f 1 (x 0 )) param- eterized with θ, the i-th layer is x i+1 = f i (x i ). Let L fi and L f θ denote the Lipschitz constant of f i (x i ) and f θ (x 0 ), there is an equation of L f θ constrained by L f θ = T i=1 L fi (4) According to Eq.( 4), the Lipschitz constant of f θ (x) can be constrained calculating the individual contribution of each layer {L fi |i = 1, • • • , T }. Let us introduce another lemma which gives a tighter upper bound to the Lipschitz constant of the projection L fi . Lemma 2. Given a deep neural network, the Lipschitz constant of its i-th layer presents the following relationship. L fi = max x1,x2∈X ∥f i (x 1 ) -f i (x 2 )∥ 2 ∥x 1 -x 2 ∥ 2 = max x ′ ∈X ∥( ∂f i ∂x )| x=x ′ ∥ 2 F = max x ′ ∈X d i=1 (λ x ′ i ) 2 where λ x ′ i is the i-th singular value of the matrix ∂fi ∂x | x=x ′ . Proof. Suppose f i is a continuous function, according to the Taylor equation there is a x ′ such that f i (x 2 ) = f i (x 1 ) + A x ′ (x 2 -x 1 ), where A x ′ = ∂fi(x) ∂x| x=x ′ . Thus, max x1,x2 ∥fi(x1)-fi(x2)∥ 2 ∥x1-x2∥ 2 = max x ′ ,x1,x2 (x2-x1) T A T x Ax(x2-x1) ∥x1-x2∥ 2 = max x ′ ∥A x ∥ 2 F . Denoting the singular values of A x ′ by {λ x ′ i } d i=1 , there is ∥A x ∥ 2 F = T r(A T x A x ) = d i=1 (λ x ′ i ) 2 . Thus, L fi = max x ′ ∈X d i=1 (λ x ′ i ) 2 . □ Remark 1. The above Lemma estimates the Lipschitz constant of a projection by calculating partial gradient of f i (x) with respect to x. Thus, if we reduce the F-norm of the partial gradient matrix at all the samples in the training dataset, we can let the learned projection have a smaller Lipschitz constant. If f i represents a linear projection: y = L T x. Because ∂fi ∂x = L T , the Lipschitz constant corresponds to L fi = ∥L∥ 2 F which is a widely-used regularization term to improve the generalization ability on deep learning models. Distance metric learning learns a representation where samples in the same class present a small distance, and samples from different classes a large distance. Thus, for the deep neural networks f θ trained on this metric, the lipschitz constant of f θ will naturally increase in the black region and decrease in clusters. However, if we use the term in Eq.( 5) to minimize the Lipschitz constant of f θ , then we would reduce the Lipschitz constant of the overall problem. Such a result contradicts the goal of distance metric learning. Because it harms the discriminant ability of the model, but in return it improves generalization. Thus, instead of using d i=1 λ x i (dx i ), we introduce the following term to regularize the Lipschitz constant. R x ′ = log( d i=1 (λ x ′ i ) 2 ) = 2 d i=1 log(λ x ′ i ) (6) Geometric meaning . Suppose A x is the Jacobian matrix of z = f i (x) at x ′ . O(x, r) = {x||x ′ -x| < r} is a neighborhood of x ′ , so the volume of O(x, r) = d i=1 (dx i ) if r → 0. Suppose the singular values of A x are λ x 1 > λ x 2 > • • • > λ x d . The volume of f i (O(x, r)) is d i=1 λ x i (dx i ), thus, d i=1 λ x i is the volume changed after projection. Therefore, if R x ′ is reduced, the value ∥A x ′ ∥ 2 F = d i=1 (λ x ′ i ) 2 is reduced. But the difference from ∥A x ′ ∥ 2 F is that minimizing R x ′ will let small (λ x ′ i ) 2 become smaller at the procedure of minimizing ∥A x ′ ∥ 2 F . Therefore, we can propose a regularization term to minimize the Lipschitz constant according to the metric learning requirement.

R(x

i ) = 1 d d i=1 log λ x ′ i + 1 (7)

4. DEEP METRIC LAYER

From the implementation point of view, the formulation involved in the calculus of det( ∂f θ (x) ∂x ) is intractable for traditional neural networks. To alleviate this problem, we design an non-linear projection layer for whom the determinant of its Jacobian matrix is easy to solve. Here, we adopt the deep neural network used in Normalizing Flows Rezende & Mohamed (2015) where it is efficient to solve det( ∂f θ (x) ∂x ). Let h(•; θ) : R → R be a bijection parameterized by θ. Then, the desired projection is g : R D → R D , which projects each sample x ∈ R D as y = g(x). Let the t-th entries of x and y be x t and y t , the projection is defined as y t = h(x t ; Θ t (x 1:t-1 )) (8) where x 1:t = (x 1 , • • • , x t ). For t = 2, • • • , D we choose arbitrary functions Θ t (•) mapping R t-1 to the set of all parameters, and Θ 1 is a constant. The jacobian matrix of the Eq.( 9) is triangular. Each output y t only depends on x 1:t and so the determinant is just the product of its diagonal entries, det(Dg) = Π D t=1 det ∂y t ∂x t (9) The structure of Normalizing Flows is an invertible projection which can not be used to reduce the dimension of the input samples. So the input of the Normalizing Flows can not be images, thus we need employ a backbone based on convolutional neural networks to extract the features of images and at last those features are fed into the Normalizing Flow layers. Therefore, the regularization presented in the previous section can not be used to constrain the Lipschitz constant of the convolutional neural networks. Let us consider the Eq.( 5), we find that if we only constrain the Lipschitz constant of the structure of Normalizing Flows, two results may occur. (1) the Lipshcitz constant of whole structure of our deep neural network will be reduced. (2) the Lipschitz constant of the convolutional neural works will be larger. To avoid the second situation, we use a dueling Normalizing Flows architecture connecting to the backbone. For the first one, the regularization is minimized, while for the second one, its regularization term is maximized. During the inference, only the minimizing deep metric layer are used. By doing this, we can let the Lipschitz constant of the backbone be stable while reducing Lipschitz constant of the combined network. The Structure of the proposed method is depicted in the Figure 2 . According to the property of the invertible structure of normalizing flow, the Jacobi matrix ∂yt ∂xt is an upper triangle matrix. Thus, there is an equation that R(x i ) = 1 d d i=1 log λ x ′ i + 1 == 1 d log ∂y t ∂x t + I ( ) where I is an identity matrix. Because the R(x i ) is a function of x i . If we have the sample set Z = {z 1 , • • • , z n }, then we can use R(Z) = 1 n n i=1 R(z i ). The Eq.( 11) lets the regularization of the Lipscthiz constant be small for each training sample. However it does not reduce the Lipscthiz constant between dissimilar samples. To solve this problem, we use sample augmentation to reduce Lipschitz constant in those regions. ---------------------------- Sample augmentation. Let z i = f θ (x i ) be the output feature embedding of the backbone. Then, for a point {z i }, we can be compute the direction e ij = z i -z j . Then, a new sample can be generated sampling from z ij = z i + εe ij . By selecting an appropriate value ε, we can find that the new sample z ij belongs to the blank region between z i and z j . In this way, if we minimize the Lipschitz constant on the generated samples, we can reduce the Lipscthiz constant of points within the cluster of each class. In order to efficiently select the number of samples, we only select the samples x j in the neighborhood of x i . In this way, we produce extra training samples for the batch. Because those augmented samples are located between two similar samples, if the distance between the samples is reduced, the distance between the augmented sample and each of the two similar ones also does. Therefore, there is no need to perform the distance metric learning on the augmented samples. Thus, the objective function of the proposed distance metric learning is presented as follows. min (θ1,θ2,θ3) N i=1 loss((x i , S i , D i ), θ) + λ 1 R θ2 (Z 1 ) -λ 2 R(Z 2 ) (11) where θ 1 is the learnable parameter of the backbone, θ 2 is the learnable parameter of the parameter of the first deep metric layer, θ 3 is the second deep metric layer. λ 1 > 0 and λ 2 > 0 are the coefficient factors of the regularization terms. Z 1 is the training samples and their augmented samples for the first invertible neural network, and Z 2 is the training samples and the augmented samples for the second invertible neural network. After the training of the model, the outputs of the first invertible networks are considered as the features of images.

5. EXPERIMENTS

We evaluate the effectiveness of the proposed method on four datasets for fine-grained image retrieval. In the conducted experiments we compare the performance of the proposed regularization factor when applied on current state-of-the-art models in DML.

5.1. SETTINGS

Fine-grained image retrieval. We benchmark our model in four datasets on fine-grained image retrieval: Cars196 We use Inception as the backbone for our model. Similar to previous works in the literature, we use a pre-trained model on ImageNet for classification and we select a final 512-D embedding layer that would correspond to the dimension of the hidden layers on the deep metric model. For the backbone we freeze the batch normalization layers during the training and we add an activation layer for the connection to the deep metric layers. The deep metric layer we use are invertible normalizing flows layers, as the determinat of the Jacobian matrix required in the regularization factor is efficient to compute. Particularly, we rely on Real-NVP Dinh et al. (2016) , a model that implements normalizing flows using affine coupling layers that combine a scaling term with a shit term in the transformation. Despite its simplicity Real-NVP has shown to be effective estimating complex density distributions without requiring a high number of layers. In our experiment we rely on 12 layers for each of the dueling deep metric modules to capture the radial distribution proceed by the cosine similarity in DML. Finally, we use L 2 -normalization on the final output of the normalizing flows. Regarding the loss function, we rely on the state-of-the-art Proxy Anchor loss Kim et al. (2020) for our experimentation. Proxy Anchor proposes a proxy-based anchor method that associates the entire data in the batch with proxies for each class. This method has shown advantages w.r.t. previous proxy-based method that do not exploit data-to-data relations and it has also shown to be more efficient than pair-based methods. In particular, we define the same number of proxies as classes in the dataset. The loss function is completed with the determinant-based regularization factor presented Eq. ( 11). Training: To train our model we rely on the weight decay AdamW optimizer Loshchilov & Hutter (2017) . The initial learning rate is 10 -4 for the backbone network and 10 -3 for the metric learning layers and the proxies. We use a linear decay in both cases and we train the model for 30 epochs. Similarly to Kim et al. (2020) we initialize the proxies using a normal distribution and a bigger learning rate is used on them for a faster convergence. We maintain the hyperparameters δ for the margin and α scaling factor of Proxy Anchor to 32 and 0.1, respectively, in all experiments. We set the coefficients λ 1 and λ 2 to 0.05 for the dueling deep metric layers. The results for the CUB-200-2011 and the Cars-196 datasets are summarize in Table 1 . Our method achieves competitive results on both dataset, improving the recall@k metric of the baseline implementation of Proxy Anchor. Significant results are obtained on both cases, where we beat by a margin of 0.7% the recall@1 in the CUB-200-2011 dataset and by a margin of 1.1 in the Cars-196. We recreate the Table 1 for Kim et al. (2020) to put in perspective the obtained results. In the SOP dataset, the model shows slightly worse performance that the baseline, the results are summarized in Table 2 . We attribute these results to the fact that the SOP dataset is the largest of the four. Despite this fact we expect that with further tunning of the parameters this result can be also improved. At last, the results on the In-shop dataset are also modest in comparison to the baseline scenario (see Table 3 ), and we cannot attribute a significant benefit on applying the regularization factor in that case.

5.4. ABLATION STUDIES

Deep metric layers. In the ablation studies we test two types of Normalizing Flows for the distance metric layers: Real-NVP Dinh et al. (2016) and NICE (Non-linear Independent Component Estimation) Dinh et al. (2014) considered the predecessor of Real-NVP. We observe a slight benefit in using Real-NVP and therefore this is the metric that we report in the experimentation. Coefficients of regularization. Regarding the regularization coefficient, we perform a test comprising the following values of λ ∈ {0.01, 0.05, 0.1, 0.2, 0.5}. Particularly, we observe better results when the coefficients λ 1 and λ 2 of the dueling deep metric layers take the same value. The reported results correspond to a value of λ 1 = λ 2 = 0.05 that achieved overall best performance. Data augmentation. Finally we also experiment with different values for the ϵ-coefficient in the generation of new samples for data augmentation. The ϵ-coefficient controls how far from the current sample is the augmented one be created. Because the last layer applies an L 2 -normalization, the output samples are restricted to be in the unitary hyper-sphere and the data augmentation occurs in the tangent hyperplane to it. We test several coefficients ϵ ∈ {0.01, 0.05, 0.1, 0, 2, 0.4, 0.6, 0.8} ranging from the close proximity of the original sample to the neighbour sample. In order to further increase the generalization we should carefully select the ϵ, in our experimentation ϵ = 0.4 corresponds to the value that achieved best performance.



We reveal that potential energy-based methods are not very effective in pushing centers away from each other since it only consider local information of the data. To alleviate this problem, we adopt the log-exp mean functions and power mean function to design a loss term that pulls away centers of each classes. Because we prove that potential energy methods are a special realization of our algorithm, ours also has the power to separate centers of different classes. To further solve the Lipschitz constant problem on distance metric learning we design a deep neural network structure that allow us to minimize the Lipschitz constant of the deep neural network directly. This structure contains two parts: the first part extracts features using traditional backbone networks, such as Resnet, VGG, and Inceptions; the second



Figure 1: The illustration represents the feature space spanned by f θ (x) learned by deep metric learning. ri is the radius i-th class of samples in training dataset and r * the radius of an unknown class. δ2 -δ1 reflects the distance between the two closest samples in class 1 and class 2.

Figure 2: Structure of the dueling deep metric neural network used to reduce the overall Lipschitz constant of the network.

, CUB-200-2011, Standord Online Products (SOP) and In-Shop Clothes Retrieval. CARS-196 contains 16,183 images from 196 classes of cars. The first 98 classes are used for training and the last 98 classes are used for testing. CUB-200-2011 contains 200 different class of birds. We use the first 100 class with 5,864 images for training and the last 100 class with 5,924 images for testing. SOP is the largest dataset and consists of 120,053 images belonging to 22,634 classes of online products. The training set contains 11,318 classes and includes 59,551 images, the rest 11,316 classes with 60,499 images are used for testing. Lastly, the In-shop dataset consists of a total of 54,642 images which are divided in 25,882 images from 3997 classes for training and 28,760 images for testing. Comparison of the Recall@k (in percentile) on the CUB-200-2011 and the Cars-196 finegrained image datasets. The backbone network is denoted by: G for GoogleNet, R50 for ResNet50 and BN for Inception with Batch Normalization. The superscript indicates the size of the final embedding layers used in the backbone network. Source: Kim et al. (2020).

Comparison of the Recall@k (in percentile) on the Standord Online Products (SOP).

Comparison of the Recall@k (in percentile) on the In-Shop Clothes Retrieval dataset.

acknowledgement

We keep the ablation studies concise because the aim in the experimentation is to determine whether the regularization benefits the proxy-anchor algorithm used as a baseline. Therefore, in the shake of presenting a fair comparison, we maintain the same 512-D embedding dimension as the original work does, and also we report the same values for the margin δ = 0.1 and scaling factor α = 32 of Proxy Anchor.

6. CONCLUSIONS

This paper presents a novel learning paradigm for Distance Metric Learning (DML). Differently from the other DML methods that only focus on designing different loss functions, our work focuses on regularizing the Lipschitz constant as a way to improve the generalization capabilities of DML models. We adopt invertible layers from Normalizing Flows to construct a deep metric model in which the computation the Jacobian matrix is efficient. At last, we minimize the determinant of Jacobian matrix to reduce the Lipschitz constant of the deep neural network. Conducted experiments on fine-grained Cars196, CUB-200-2011, Standord Online Products (SOP) and In-Shop Clothes Retrieval datasets show that the proposed architecture benefits the baselines proxy-based architecture on achieving better generalization.

