COMPACT BILINEAR POOLING VIA GENERAL BILIN-EAR PROJECTION

Abstract

Deep metric learning aims at learning a deep neural network by letting similar samples have small distances while dissimilar samples have large distances. To achieve this goal, the current DML algorithms mainly focus on pulling similar samples in each class as closely as possible. However, the action of pulling similar samples only considers the local distribution of the data samples. It ignores the global distribution of the data set, i.e., the center positions of different classes. The global distribution helps the distance metric learning. For example, expanding the distance between centers can increase the discriminant ability of the extracted features. However, how to increase the distance between centers is a challenging task. In this paper, we design a genius function named the skewed mean function, which only considers the most considerable distances of a set of samples. So maximizing the value of the skewed mean function can make the most significant distance larger. We also prove that current energy functions used for uniformity regularization on centers are special cases of our skewed mean function. At last, we conduct extensive experiments to illustrate the superiority of our methods.

1. INTRODUCTION

Deep metric learning (DML) is a branch of supervised feature extraction algorithms that constrain the learned features, such that similar samples have a small distance and dissimilar samples have a large distance. Because having the ability to learn a deep neural network for unseen classes, distance metric learning, i.e., the classes of testing classes do not appear in the training data set, DML are widely used in the applications of image classification & clustering, face re-identification, or general supervised and unsupervised contrastive representation learning Chuang et al. (2020) . The goal of DML is to optimize deep neural networks to span its projection space on a surface of a hyper-sphere, in which the semantically similar samples have small distances, and the semantically dissimilar samples have large distances. Where θ is the parameter of the deep metric, δ 1 and δ 2 are two tuned parameters, and S i and D i are the sets of similar and dissimilar samples of the query x i , respectively. Although achieving excellent performance, those log -exp function-related algorithms fail to assign different classes' centers. Assigning the locations of class centers facilitates distinguishing the features from different. For example, if we can let the centers have a significant distance in the features samples, thus the distance of two samples from those two classes will also be increased, making them more easily distinguished. There are several methods designed to enlarge the distance between centers. The most representative ones are potential energy functions. However, those functions only consider the nearest centers of the query samples. Because the nearest centers are distributed around the query, the pushing actions shown from those nearest centers will let the query sample be stuck, i.e, the position of the query is hard to move in the training stage. As seen from the figure 1 , we consider the c 1 as the query centers, and the six centers in the circle are the nearest centers of the center c 1 . Because the current energy function only lets the nearest samples push the query samples away, the six nearest samples will let the query sample stuck because the directions of those pushing actions contradict each other. This makes the potential energy function fails to assign the locations of centers. In this paper, we propose a set of functions named as skewed mean function, which can consider the largest values of a set of samples. In this way, we can let the pairs with largest distance be away from each other. Because those samples are on the boundary of the cluster of each sample. Those samples are less to be stuck. Using this finding, we design a regularization term to assign the centers of different classes. The contents include the following aspects: 1 We give a unified framework of proxies-based distance metric learning. From our framework, we summarize the traditional distance metric learning algorithms and the classification-based distance metric learning algorithm together. Therefore, the mathematical proof of their connections gives us a theoretical base for performing ablation experiments to support our comment that all problems of distance metric learning are about the Lipschitz constant. 2 We reveal that the potential energy-based methods have less power to push centers away from each other since they only consider local data information. To alleviate this problem, we adopt the log-exp mean functions and power mean function to design the term to pull the centers of each class. Because we prove the potential energy methods are a special cause of ours, our algorithms have the power to push centers of different classes. 3 We conduct extensive experiments on challenging data sets such as CUB-200-2011, Cars196, Aircraft, and Inshop to illustrate the effectiveness of our algorithms. Notation. X o = {(x i , y i )} N1 i=1 is C-class dataset where x i ∈ R d1 is the i-th sample and y i ∈ {1, • • • , C} is the label of x i . z i = f θ (x i ) : R d1 → R d2 is a deep neural networks parameterized by θ. The similarity between x i and x j is denoted as A ij = cos(f θ (x i ), f θ (x j )). The set of proxies is denoted by X p = {(w k , y k )} N2 k=1 where w k ∈ R d2 and y k ∈ {1, • • • , C} is the corresponding label of w k . The similarity between x i and w j is denoted by B ij = cos(f θ (x i ), w j ). Because proxy-based DML does not calculate the similarity between samples within X or X p , the similar relationship between samples X o + X p can be depicted by a bipartite graph. For x i ∈ X o , its similar samples are only in X p and denoted as S 1 i . For w i ∈ X p , its similar samples are only in X o and denoted by S 2 i . Likewise, dissimilar sample sets of x i ∈ X o and w i ∈ X p are denoted by D 1 i and D 2 i , respectively. 



Figure1: The illustration of assigning the location of centers. c1 is only pushed away by the six nearest centers. Because the pushing directions are contrary, the position of c1 is easy to stick. Therefore, the location assignment fails.

DISTANCE METRIC LEARNING REVISITED3 SHORTCOMINGS OF DISTANCE METRIC LEARNINGIn this section, we adopt the proxy anchor loss as the baseline to analyze current distance metric learning model. The objective function of proxy anchor loss is presented as follows.

