GENERALIZED SUM POOLING FOR METRIC LEARNING

Abstract

A common architectural choice for deep metric learning is a convolutional neural network followed by global average pooling (GAP). Albeit simple, GAP is a highly effective way to aggregate information. One possible explanation for the effectiveness of GAP is considering each feature vector as representing a different semantic entity and GAP as a convex combination of them. Following this perspective, we generalize GAP and propose a learnable generalized sum pooling method (GSP). GSP improves GAP with two distinct abilities: i) the ability to choose a subset of semantic entities, effectively learning to ignore nuisance information, and ii) learning the weights corresponding to the importance of each entity. Formally, we propose an entropy-smoothed optimal transport problem and show that it is a strict generalization of GAP, i.e., a specific realization of the problem gives back GAP. We show that this optimization problem enjoys analytical gradients enabling us to use it as a direct learnable replacement for GAP. We further propose a zero-shot loss to ease the learning of GSP. We show the effectiveness of our method with extensive evaluations on 4 popular metric learning benchmarks. Code is available at: GSP-DML Framework

1. INTRODUCTION

Distance metric learning (DML) addresses the problem of finding an embedding function such that the semantically similar samples are embedded close to each other while the dissimilar ones are placed relatively apart in the Euclidean sense. Although the prolific and diverse literature of DML includes various architectural designs (Kim et al., 2018; Lin et al., 2018; Ermolov et al., 2022) , loss functions (Musgrave et al., 2020) , and data-augmentation techniques (Roth et al., 2020; Venkataramanan et al., 2022) , many of these methods have a shared component: a convolutional neural network (CNN) followed by a global pooling layer, mostly global average pooling (GAP) (Musgrave et al., 2020) . Common folklore to explain the effectiveness of GAP is considering each pixel of the CNN feature map as corresponding to a separate semantic entity. For example, spatial extent of one pixel can correspond to a "tire" object making the resulting feature a representation for "tireness" of the image. If this explanation is correct, the representation space defined via output of GAP is a convex combination of semantically independent representations defined by each pixel in the feature map. Although this folklore is later empirically studied in (Zeiler & Fergus, 2014; Zhou et al., 2016; 2018, and references therein) and further verified for classification in (Xu et al., 2020) , its algorithmic implications are not clear. If each feature is truly representing a different semantic entity, should we really average over all of them? Surely, some classes belong to the background and should be discarded as nuisance variables. Moreover, is uniform average of them the best choice? Aren't some classes more important than others? In this paper, we try to answer these questions within the context of metric learning. We propose a learnable and generalized version of GAP which learns to choose the subset of the semantic entities to utilize as well as weights to assign them while averaging. In order to generalize the GAP operator to be learnable, we re-define it as a solution of an optimization problem. We let the solution space to include 0-weight effectively enabling us to choose subset of the features as well as carefully regularize it to discourage degenerate solution of using all the features. Crucially, we rigorously show that the original GAP is a specific case of our proposed optimization problem for a certain realization. Our proposed optimization problem closely follows optimal transport based top-k operators (Cuturi et al., 2019) and we utilize its literature to solve it. Moreover, we present an algorithm for an efficient computation of the gradients over this optimization problem enabling learning. A critical desiderata of such an operator is choosing subset of features which are discrimantive and ignoring the background classes corresponding to nuisance variables. Although supervised metric learning losses provide guidance for seen classes, they carry no such information to generalize the behavior to unseen classes. To enable such a behavior, we adopt a zeroshot prediction loss as a regularization term which is built on expressing the class label embeddings as a convex combination of attribute embeddings (Demirel et al., 2017; Xu et al., 2020) . In order to validate the theoretical claims, we design a synthetic empirical study. The results confirm that our pooling method chooses better subsets and improve generalization ability. Moreover, our method can be applied with any DML loss as GAP is a shared component of them. We applied our method on 6 DML losses and test on 4 datasets. Results show consistent improvements with respect to direct application of GAP as well as other pooling alternatives.

2. RELATED WORK

We discuss the works which are most related to ours. Briefly, our contributions include that i) we introduce a general formulation for weighted sum pooling, ii) we formulate local feature selection as an optimization problem which admits closed form gradient expression without matrix inversion, and iii) we propose a meta-learning based zero-shot regularization term to explicitly impose unseen class generalization to the DML problem. DML. Primary thrusts in DML include i) tailoring pairwise loss terms (Musgrave et al., 2020) that penalize the violations of the desired intra-and inter-class proximity constraints, ii) pair mining (Roth et al., 2020) , iii) generating informative samples (Ko & Gu, 2020; Liu et al., 2021; Gu et al., 2021; Venkataramanan et al., 2022) , and iv) augmenting the mini-batches with virtual embeddings called proxies (Wang et al., 2020; Teh et al., 2020) . To improve generalization; learning theoretic ideas (Dong et al., 2020; Lei et al., 2021; Gurbuz et al., 2022) , disentangling class-discriminative and classshared features (Lin et al., 2018; Roth et al., 2019) , intra-batch feature aggregation (Seidenschwarz et al., 2021) , and further regularization terms (Jacob et al., 2019; Zhang et al., 2020; Kim & Park, 2021; Roth et al., 2022) are utilized. To go beyond of a single model, ensemble (Xuan et al., 2018; Kim et al., 2018; Sanakoyeu et al., 2019; Zheng et al., 2021a; b) and multi-task based approaches (Milbich et al., 2020; Roth et al., 2021) are also used. Different to them, we propose a learnable pooling method for the global feature extraction generalizing GAP, a shared component of all of the mentioned works. Hence, our work is orthogonal to all of these and can be used jointly with any of them. Prototype-based pooling. Most related to ours are trainable VLAD (Arandjelovic et al., 2016) and optimal transport based aggregation (Mialon et al., 2021) . Such methods employ similarities to the prototypes to form a vector of aggregated local features for each prototype and build ensemble of representations. Similar to us, Mialon et al. (2021) uses optimal transport formulation to select local features to be pooled for each prototype. That said, such methods map a set of features to another set of features without discarding any and do not provide a natural way to aggregate the class-discriminative subset of the features. On the contrary, our pooling machine effectively enables learning to select discriminative features and maps a set of features to a single feature that is distilled from nuisance information. Attention-based pooling. Among the methods that reweights the CNN features before pooling, CroW (Kalantidis et al., 2016 ), Trainable-SMK (Tolias et al., 2020 ), and CBAM (Woo et al., 2018) build on feature magnitude based saliency, assuming that the backbone functions must be able to zero-out nuisance information. Yet, such a requirement is restrictive for the parameter space and annihilation of the non-discriminative information might not be feasible in some problems. Similarly, attention-based weighting methods DeLF (Noh et al., 2017 ), GSoP (Gao et al., 2019) do not have explicit control on feature selection behavior and might result in poor models when jointly trained with the feature extractor (Noh et al., 2017) . Differently, our method unifies attention-based feature masking practices (e.g. convolution, correlation) with an efficient-to-solve optimization framework and lets us do away with engineered heuristics in obtaining the masking weights (e.g. normalization, sigmoid, soft-plus) without restricting the solution space unlike magnitude based methods. Optimal transport based operators. Optimal transport (OT) distance (Cuturi, 2013) to match local features is used as the DML distance metric instead of ℓ2 in (Zhao et al., 2021) . Despite effective, replacing ℓ2 with OT increases memory cost for image representation as well as computation cost for

