COMPACT BILINEAR POOLING VIA GENERAL BILIN-EAR PROJECTION

Abstract

Deep metric learning aims at learning a deep neural network by letting similar samples have small distances while dissimilar samples have large distances. To achieve this goal, the current DML algorithms mainly focus on pulling similar samples in each class as closely as possible. However, the action of pulling similar samples only considers the local distribution of the data samples. It ignores the global distribution of the data set, i.e., the center positions of different classes. The global distribution helps the distance metric learning. For example, expanding the distance between centers can increase the discriminant ability of the extracted features. However, how to increase the distance between centers is a challenging task. In this paper, we design a genius function named the skewed mean function, which only considers the most considerable distances of a set of samples. So maximizing the value of the skewed mean function can make the most significant distance larger. We also prove that current energy functions used for uniformity regularization on centers are special cases of our skewed mean function. At last, we conduct extensive experiments to illustrate the superiority of our methods.

1. INTRODUCTION

Deep metric learning (DML) is a branch of supervised feature extraction algorithms that constrain the learned features, such that similar samples have a small distance and dissimilar samples have a large distance. Because having the ability to learn a deep neural network for unseen classes, distance metric learning, i.e., the classes of testing classes do not appear in the training data set, DML are widely used in the applications of image classification & clustering, face re-identification, or general supervised and unsupervised contrastive representation learning Chuang et al. (2020) . The goal of DML is to optimize deep neural networks to span its projection space on a surface of a hyper-sphere, in which the semantically similar samples have small distances, and the semantically dissimilar samples have large distances. This purpose can be formulated as a set of triplets . However, because of the exponential amount of those triplets, distance metric learning needs an additional procedure called as information sample selection, such as hard sample mining and semi-hard sample mining. With the advance of the mining techniques, state-of-the-art metric learning algorithms Wang et al. (2019b) ; Kim et al. (2020) ; Roth et al. (2022) ; Wang & Liu (2021) ; Schroff et al. (2015) ; Sun et al. (2020) ; Deng et al. (2019) ; Wang et al. (2018) use the log-exp function q λ (θ) = log( n j=1 e λai(θ) ) Oh Song et al. (2016) to combine the distance metric learning and the information sample selection together. an Where θ is the parameter of the deep metric, δ 1 and δ 2 are two tuned parameters, and S i and D i are the sets of similar and dissimilar samples of the query x i , respectively. Although achieving excellent performance, those log -exp function-related algorithms fail to assign different classes' centers. Assigning the locations of class centers facilitates distinguishing the features from different. For example, if we can let the centers have a significant distance in the features samples, thus the distance of two samples from those two classes will also be increased, making them more easily distinguished. There are several methods designed to enlarge the distance between centers. The most representative ones are potential energy functions. However, those functions only consider the nearest centers of the query samples. Because the nearest centers are distributed around the query, the pushing actions shown from those nearest centers will let the query sample be stuck, i.e, the position of the query is hard to move in the training stage. As seen from the figure 1, we consider the c 1 as the query centers, and the six centers in the circle are the nearest centers of the center c 1 . Because the current energy function only lets the nearest samples push the query samples away, the six nearest samples will let the query sample stuck because the directions of those pushing actions contradict each other. This makes the potential energy function fails to assign the locations of centers. In this paper, we propose a set of functions named as skewed mean function, which can consider the largest values of a set of samples. In this way, we can let the sample pairs with largest distance be away from each other. Because those samples are on the boundary of the cluster of each sample. Those samples are less to be stuck. Using this finding, we design a regularization term to assign the centers of different classes. The contents include the following aspects: 1 We give a unified framework of proxies-based distance metric learning. From our framework, we summarize the traditional distance metric learning algorithms and the classification-based distance metric learning algorithm together. Therefore, the mathematical proof of their connections gives us a theoretical base for performing ablation experiments to support our comment that all problems of distance metric learning are about the Lipschitz constant. 2 We reveal that the potential energy-based methods have less power to push centers away from each other since they only consider local data information. To alleviate this problem, we adopt the log-exp mean functions and power mean function to design the term to pull the centers of each class. Because we prove the potential energy methods are a special cause of ours, our algorithms have the power to push centers of different classes. 3 We conduct extensive experiments on challenging data sets such as CUB-200-2011, Cars196, Aircraft, and Inshop to illustrate the effectiveness of our algorithms. Notation. X o = {(x i , y i )} N1 i=1 is C-class dataset where x i ∈ R d1 is the i-th sample and y i ∈ {1, • • • , C} is the label of x i . z i = f θ (x i ) : R d1 → R d2 is a deep neural networks parameterized by θ. The similarity between x i and x j is denoted as A ij = cos(f θ (x i ), f θ (x j )). The set of proxies is denoted by X p = {(w k , y k )} N2 k=1 where w k ∈ R d2 and y k ∈ {1, • • • , C} is the corresponding label of w k . The similarity between x i and w j is denoted by B ij = cos(f θ (x i ), w j ). Because proxy-based DML does not calculate the similarity between samples within X or X p , the similar relationship between samples X o + X p can be depicted by a bipartite graph. For x i ∈ X o , its similar samples are only in X p and denoted as S 1 i . For w i ∈ X p , its similar samples are only in X o and denoted by S 2 i . Likewise, dissimilar sample sets of x i ∈ X o and w i ∈ X p are denoted by D 1 i and D 2 i , respectively.

2. DISTANCE METRIC LEARNING REVISITED 3 SHORTCOMINGS OF DISTANCE METRIC LEARNING

In this section, we adopt the proxy anchor loss as the baseline to analyze current distance metric learning model. The objective function of proxy anchor loss is presented as follows. J = 1 |P + | p∈P + log   1 + x∈X + p e -α(s(x,p))-δ   + 1 |P | p∈P log   1 + x∈X + p e αs(x,p)+δ   (1) where δ > 0 is a margin, α > 0 is a scaling factor, P indicates the set of all proxies, and P + denotes the set of positive proxies of data in the batch. Also, for each proxy p, a batch of embedding vectors X is divided into two sets: X + , the set of positive embedding vectors of p, and X - p = X p -X + p . The gradient of the loss function with respect to s(x, p) is given by ∂ℓ(X) ∂s(s, p) =      1 P + -αe -α(s(x,p)-δ) 1+ x ′ ∈X + p e -α(s(x,p)-δ) , ∀x ∈ X + p 1 P + -αe -α(s(x,p)-δ) 1+ x ′ ∈X + p e -α(s(x,p)-δ) , ∀x ∈ X + p (2) In practice, the best performance of distance metric learning algorithms set α a large value. Normally, α > 32. According to Eq.( 2), we know large α only focuses the farthest similar sample and nearest dissimilar samples in the optimization procedure. Only consider the nearest dissimilar samples means the distance metric leaning only consider the local distribution of the trainging data, and does consider the global information of the training set. As a consequence, the moving of the closest dissimilar sample will be stuck by other ignored dissimilar samples. Because those nearest samples will give each element a pushing force from its anchor or on is its.Therefore, we can claim that the goal of distance metric learning mainly depends on the shrinking of similar samples in each class. In this way, the distance metric learning does not have the power to assign the centers of each classes. Geometrically, pushing centers away from each other will benefit the distinguish samples between different classes. For example, suppose the radius of the cluster region of each class be fixed as r, and the gap between two classes be δ. If we let the centers of each class be pushing away from each other, the gap between two classes will also be enlarged, i.e., δ + ϵ where ϵ is the amount increased by the pushing action for centers. In this way, the features extracted by neural networks will be easy to distinguish. Besides, if the centers are not assigned by the algorithm. When we want the gap between two classes still to be δ + +ϵ, we should shrink samples in the cluster of each class significantly. However, the training samples of distance metric learning are not very much. For example, the widely used dataset in metric learning is CUB-200-2011 has 200 classes with each class 69 samples on average. Compared with dimension of features extracted by neural network, normally being 512 or 1024, the number 69 is very smaller. In this way, it is hard to shrink so less samples in the high dimensional feature space without the overfitting. When the overfitting happens, the performance of distance metric learning will be hurt. Therefore, how to assign the centers of each class is an very important issue.

3.1. SHORTCOMINGS OF ENERGY FUNCTION

Several works are proposed to assign the centers of different classes for classification problem. The well-known ones are the energy function based ones whose formulations are presented as follows. E s,d (w i | C i=1 ) = C i=1 C j=1,i̸ =i f s (S(w i , w j )) = i̸ =j S(w i , w j ) -s , s > 0 -i̸ =j log(S(w i , w j )) , s = 0 (3) where S(w i , w j ) is a similarity function between w i and w j . Commonly, there S(w i , w j ) = |w iw j | 2 2 . Let us calculate the gradient descent of energy function with respective to s(w i , w j ), there is ∂E s,d (w i | C i=1 ) ∂s(w i , w j ) = i̸ =j (-s)S(w i , w j ) -s-1 , s > 0 -i̸ =j S(w i , w j ) -1 , s = 0 (4) As seen from the above Eq.( 4), we know that if S(w i , w j ) is small, the value of gradient descent is large. It means the algorithm would give a large weight to the sample pairs with smaller distance. Thus, the energy function only consider the closest samples of each query, and ignore the farther samples. This have two shortcomings: If only closest samples of each query are considered, so the pushing action on this query sample is easy to be eliminated by the samples around it. Considering there are hundreds of classes in each distance metric learning task, thus, such phenomenons easy encounter. And the centres can not be assigned to the whole surface of the hyper-sphere in the feature space. In the following content, we design a new mechanism to solve this problem. That is we let the farthest samples of each query to push the query. Because the farthest samples are always located in the boundary of the region of the features, so when we let the distance between them and query samples, it is hard to be stuck.

3.2. DISTANCE METRIC LEARNING SURVEY BY SKEWED MEAN FUNCTIONS

Definition 1. Given a set of numbers S = {s 1 , s 2 , • • • , s N }, without loss generality, by setting 0 < s 1 < s 2 < • • • < s N , we can define a K skewed mean of the numbers S as follows. M [K] (S) = 1 |K| |K| i=1 s i , K < 0 1 K K i=1 s N -i+1 , K > 0 (5) where K ∈ {±1, ±2, • • • , ±N }. Obviously, M [-1] = s 1 , M [1] = s N , and M [N ] (S) = M [-N ] (S) = 1 N N i=1 s i . As seen from the definition of the skewed mean functions, it is easy to find the largest value from a set of numbers. If those numbers are the distances between a pair of centers, we can enlarge the skewed mean functions to assign the position of centers. However, those skewed mean function involves the operation of ranking the numbers, which make the skewed mean function is not a continuous function with respective to the distance S(w i , w j ). To solve this problem, we design a series of continuous skewed mean function by introducing the following Theorem. Theorem 1. Given a monotonously continuous increasing function y = f λ (x) : R 1 :→ R 1 where λ ∈ R 1 , and its inverse function x = f -1 λ (y) : R 1 :→ R 1 , we define a function presented as follows. b S (λ) = f -1 λ ( 1 N N i=1 f λ (s i )) We can calculate the K skewed mean of the numbers S = {s 1 , s 2 , • • • , s N } defined in Eq.( 5) by using Eq.( 6) with an appropriate selected λ, if b S (λ) satisfies the following rules: (  S (λ, a) = 1 λ log a 1 N N i=1 a λsi b S (λ) = ( 1 N N i=1 (s i ) λ ) 1/λ (7) Property 1. The functions b(λ) and g(λ) has the following properties: (1) Both b(λ) and g(λ) are two monotonically increasing functions with respective to λ; (2) lim λ→+∞ b(λ) = a n and lim λ→+∞ g(λ) = a n ; (3) lim λ→-∞ b(λ) = a 1 and lim λ→-∞ g(λ) = a 1 , thus, there is an appropriate number λ * to let b(λ * ) = a k or g(λ * ) = a k where a k is the k-th largest number in {a i } T i=1 . (4) Let a i = (x i -x) T M(x i -x) where M ⪰ 0 ∈ R d×d is a distance metric, b(λ) and -b(-λ) are convex with respective to the matrix M when λ > 0. Remark 1. By using the K skewed mean function, we can automatically select the k largest values of a set of numbers or the smallest values. If we want to assign the centers of different classes in the features space, we should select the center pairs whose distance are large, and let those sample pairs with large distance be pushed away from each other. Because all of those samples are on the surface of a sphere, the distances between those selected sample pairs have a maximal values. In this way, the algorithms will convergence.

3.3. REGULARIZATION PUNISHING LARGE SIMILARITY BETWEEN CLASS CENTERS

In this section, we design a term to push centers of classes away from each other. Suppose {p i } C i=1 are centers of classes and the similarity between p i and p j is denoted as s(p i , p j ). Then, we collect all similarities related to p c as a set denoted by M c = {s(p c , p i )|j ̸ = c}. In M c , the r-th largest element of is denoted by v (r) Mc . In this way, there is a constraint v (1) Mc < δ 3 to fulfill the above goal, whose continuous version is presented as follows. R 1 = 1 γ 3 log   C j=1,j̸ =i e γ3(s(p i ,p j )-δ3)   R 2 = ( 1 C(C -1) C i=1 i̸ =j s(p i , p j ) λ ) 1 λ (8) Thus, if we add the above regularization for the metric learning algorithm, we can achieve a new optimization problem which have the ability to constrain the location of centers of different classes.

3.4. THE RELATIONSHIP BETWEEN EXISTING METHODS.

Let us introduce a relaxation of the constraint v (1) Mc < δ 3 . Let us combine all {M c } C c=1 to one large set M = C c=1 M c , the set {v (1) Mc |c = 1, • • • , C} is a subset of M. Therefore, the constraint v (1) Mc < δ 3 can be relaxed as 1 C C c=1 v (1) Mc < δ 3 . In this way, by constructing a continuous version of it, we can have an new regularization. Excepting the log-exp mean function, there is another skewed mean function can be used, i.e., g(λ) = ( 1 n n i=1 a λ i ) 1 λ . Thus, the new continuous constraint is ( 1 C(C -1) C i=1 i̸ =j s(p i , p j ) λ ) 1 λ < δ 3 (9) If we set the similarity function as the negative distance, the constraint in Eq.( 10) is presented as ( 1 C(C -1) C i=1 i̸ =j d(p i , p j ) -λ ) 1 -λ > δ 3 (10) If we set λ = -1, the left term in Eq.( 10) is the energy based regularization proposed by in Uniformface Duan et al. (2019) . If we perform the operation (x) -λ on left term of Eq.( 10), the minimum hyperspherical energy Liu et al. (2018) . Because (x) -λ is a monotonous decrease function with respect to λ, the minimum hyperspherical energy has the same goal of Eq.( 10). 1 C(C-1) C i=1 i̸ =j d(p i , p j ) -λ is obtained For the regularization term used in Uniformface, to let λ = -1 will reduce the flexibility of the algorithm to suit different types of data, because we know the λ is a parameter related to the class number C. Different from Uniformface, the minimum hyperspherical energy term has a parameter λ on s(p i , p j ). However, s -λ (p i , p j ) will be very large with a relative small λ if s(p i , p j ) is small. Such a large value will make its coefficient in the objective function hard to tune. Thus, in practice, λ could not be selected too large. Actually, λ is set to 0, 1, 2 in Liu et al. (2018) . This means minimum hyperspherical energy term also lacks enough flexibility to deal with different types of samples. Besides the flexibility, the above mentioned two methods should calculate C(C -1)/2 times similarity, which is extremely large when the class number of the task is large. So many calculation will make the gradient update very slow. For example, in the face recognition, the class number can be more than 690K, so such the terms used in Uniformface and minimum hyperspherical energy term will cost plentiful computational resources. However, in our algorithm, we consider the {p i } C i=1 as nodes in the bipartite graph. Similar to samples in X , we can also only select small part of {p i } C i=1 to construct the objective function. In this way, our algorithm can save a lot of computational resource.

4. EXPERIMENTAL RESULTS

In this section, our method is evaluated and compared to current state-of-the-art methods on the four benchmark datasets for deep metric learning. We also investigate the effect of hyperparameters and embedding dimensionality of our loss to demonstrate its robustness.

4.1. DATASETS

We employ CUB-200-2011Wah et al. (2011 ), Cars-196Krause et al. (2013) 

4.2. IMPLEMENTATION DETAILS

Embedding network: For a fair comparison to previous work, the inception network Ioffe & Szegedy (2015) with batch normalization pre-trained for ImageNet classification is adopted as our embedding network. We change the size of its last fully connected layer according to the dimensionality of embedding vectors, and L 2 -normalize the final output. Training: In every experiment, we employ AdamW optimizer Loshchilov & Hutter (2017) , which has the same update step of Adam Kingma & Ba (2014) yet decays the weight separately. Our model is trained for 40 epochs with initial learning rate 10 -4 on the CUB-200-2011 and Cars-196 , and for 60 epochs with initial learning rate 6 • 10 -4 on the SOP and In-shop. The learning rate for proxies is scaled up 100 times for faster convergence. Input batches are randomly sampled during training. Image setting: Input images are augmented by random cropping and horizontal flipping during training while they are center-cropped in testing. The default size of cropped images is 224 × 224 as in most of previous work, but for comparison to HORDE Jacob et al. (2019) , we also implement models trained and tested with 256 × 256 cropped images.

4.3. ABLATION EXPERIMENT ON DIFFERENT METHODS

To demonstrate the importance of the neighborhood parameter learning, we conduct an ablation study on CUB-200-2011 . Since the outer objective is a standard metric learning based on the log-exp function, therefore, we could instead it with the objective function of other type of metric learning algorithm, such as multi-similar loss Wang et al. (2019b) , N-pair lossSohn (2016), lifted-structure lossOh Song et al. (2016) , Proxy-nca loss Movshovitz-Attias et al. (2017) and the adaptive neighborhood metric learning Song et al. (2021) . The reason why we select those three methods to conduct the ablation experiment, is they are the special cases of the adaptive neighborhood metric learning Song et al. (2021) . We adopt the reformulated metric learning methods as the outer objective function of our methods, our bi-level learning framework could solve the neighborhood parameters and the metric parameter according to the algorithm ??. We utilize the dataset CUB-200-2011 to training those methods, and adopt the Recall@1 to evaluate the performance of them. The results are shown in the Table 1 . As seen from the Table 1 , with the help of the neighborhood parameter learning, those well-known metric learning algorithms could be improved further in terms of performance. With further comparison the performance of our method with state-of-the-art techniques on image retrieval task, we conduct the proposed methods on the CUB-200-2011, Car-196, Stanford Online Product (SOP) and In-shop Clothes Retrieval (In-Shop) datasets. We adopt the recall@k as the metric to evaluate the performance of the related metric learning methods. The result are shown in Table 2 -4. As shown in Table 2 , our DANML improves Recall@1 by 1.9% on the CUB-200-2011, and 1.5% on the Cars-196 over the recent state-of-the-art multi-similarity loss. This may be because the logistic loss function is more powerful than the linear function for generalization. Meanwhile, for recently proposed method Circle Loss, our DANML outperforms it about 0.9% on the CUB-200-2011 and 2.2% on the Cars-196 dataset. Compared with ABE which is an ensemble method with a much heavier model, our method achieves a higher Recall@1 by 7.0% improvement on the CUB-200-2011 and 0.4% on the Cars-196 dataset. For the Stanford Online Products (SOP) and the In-Shop Clothes Retrieval (In-Shop), as seen from Tables 4 and 3 , our method outperforms multi-similarity loss by 1.7% on the In-Shop dataset and by 0.4% on the SOP dataset, respectively. Furthermore, when compared with ABE, our method increases Recall@1 by 3.6% and 2.8% on the In-Shop and SOP dataset, respectively. For the Circle Loss which is a recent state-of-the-art method on SOP dataset, our DANML achieves a better performance about 1.6% on it. Figure 3 : The Recall@1 corresponds to different r. Embedding dimension: The dimension of embedding vectors is a crucial factor that controls the trade-off between speed and accuracy in image retrieval systems. We thus investigate the effect of embedding dimensions on the retrieval accuracy in our Bi-level metric learning framework. We test our loss with embedding dimensions varying from 64 to 1, 024 following the the experiment in Wang et al. (2019b) , and further examine that with 32 embedding dimension. The result of analysis is quantified in Figure 2 , in which the retrieval performance of our loss is compared with that of MS loss Wang et al. (2019b) . The performance of our loss is fairly stable when the dimension is equal to or larger than 128. Moreover, our loss outperforms MS loss in all embedding dimensions, and more importantly, its accuracy does not degrade even with the very high dimensional embedding unlike MS loss. Parameter r in our method: We also investigate the effect of the hyperparameter r of our method on the . The results of our analysis are summarized in Figure 3 , in which we examine Recall@1 of the proposed bi-level metric learning loss by varying the values of the parameter 0.6, 0.66, 0.7}. For CUB-200-211 and Cars-196, the results suggest that when r near 0.55 and 0.45, the proposed bi-level metric learning achieve the best performances, respectively. The results indicate the performance of the proposed bi-level methods is sensitive to the parameter λ, so we should carefully chose r in the proposed bi-level distance metric learning algorithm. The r is the gap between two negative classes which determines the lower-bound of the Lipschitz constant of the learned deep neural network network. That is why the performance of the proposed methods is sensitive to the value of r.



CONCLUSIONIn this paper, we reveal that learning the position of centers for each class is very important to metric learning. However, current potential energy-based regularization has less ability to constrain the position of centers because it considers the nearest centers of each query center. The pushing actions given by the nearest centers on the query center contradict each other. To overcome this shortcoming, we design a function named skewed mean function, which can be used to calculate the most considerable distances of a set of numbers. Using the skewed mean function, we give new center regularization, which considers center pairs with farthest centers. The conducted experiments illustrate the effectiveness of our proposed method.



Figure1: The illustration of assigning the location of centers. c1 is only pushed away by the six nearest centers. Because the pushing directions are contrary, the position of c1 is easy to stick. Therefore, the location assignment fails.

1) b S (λ) is a monotonously increasing function with respective to λ (2) lim λ→+∞ b S (λ) = max{s i } N i=1 and lim λ→-∞ b S (λ) = min{s i } N i=1 For the Theorem 1, we can give two examples of f λ (x), i.e., f λ (x) = e λx and f λ (x) = x λ , which respectively corresponds to b

, Stanford Online Product (SOP)Oh Song et al. (2016) and In-shop Clothes Retrieval (In-Shop) datasets Liu et al. (2016) for evaluation. For CUB-200-2011, we use 5864 images of its first 100 classes for training and 5,924 image of the other classes for testing. For Cars-196, 8054 images of its first 98 classes are used for training and 8131 images of the other classes are kept for testing. For SOP, we follow the standard dataset split in Oh Song et al. (2016) using 59551 images of 11,318 classes for training and 60,502 images of the rest classes for testing. Also for In-Shop, we follow the setting in Oh Song et al. (2016) using 25882 images of the first 3,997 classes for training and 28,760 images of the other classes for testing; the test set is further partitioned into a query set with 14,218 images of 3,985 classes and a gallery set with 12,612 images of 3,985 classes.

Figure 2: The Recall@1 corresponds to different dimensions.Figure3: The Recall@1 corresponds to different r.

Performance on the CUB-200-2011 of the three state-of-the-art methods and their improved versions with 512 dimension.

annex

Table 2 : Recall@K(%) performance on CUB-200-2011 dataset and Cars-196 dataset. Superscript denotes embedding size. CUB-200-2011 Cars-196 Recall@K(%) 

