NEURAL BREGMAN DIVERGENCES FOR DISTANCE LEARNING

Abstract

Many metric learning tasks, such as triplet learning, nearest neighbor retrieval, and visualization, are treated primarily as embedding tasks where the ultimate metric is some variant of the Euclidean distance (e.g., cosine or Mahalanobis), and the algorithm must learn to embed points into the pre-chosen space. The study of non-Euclidean geometries is often not explored, which we believe is due to a lack of tools for learning non-Euclidean measures of distance. Recent work has shown that Bregman divergences can be learned from data, opening a promising approach to learning asymmetric distances. We propose a new approach to learning arbitrary Bergman divergences in a differentiable manner via input convex neural networks and show that it overcomes significant limitations of previous works. We also demonstrate that our method more faithfully learns divergences over a set of both new and previously studied tasks, including asymmetric regression, ranking, and clustering. Our tests further extend to known asymmetric, but non-Bregman tasks, where our method still performs competitively despite misspecification, showing the general utility of our approach for asymmetric learning. Focusing on the hypothesis space of Bregman divergences is valuable due to the fact that many core machine learning measures, including squared Euclidean, Kullback-Leibler, and Ikura-Saito divergences, are special cases of Bregman divergences. While special cases of the Bregman divergence are used today, and many general results have been proven over the space of Bregman measures, less progress has been made in learning Bregman divergences. PBDL. Recent works have proposed ways to empirically learn the Bregman divergence that best represents a dataset by focusing on a max-affine representation of φ (Siahkamari et al., 2020; 

1. INTRODUCTION

Learning a task-relevant metric among samples is a common application of machine learning, with use in retrieval, clustering, and ranking. A classic example of retrieval is in visual recognition where, given an object image, the system tries to identify the class based on an existing labeled dataset. To do this, the model can learn a measure of similarity between pairs of images, assigning small distances between images of the same object type. Given the broad successes of deep learning, there has been a recent surge of interest in deep metric learning-using neural networks to automatically learn these similarities (Hoffer & Ailon, 2015; Huang et al., 2016; Zhang et al., 2020) . The traditional approach to deep metric learning learns an embedding function over the input space so that a simple distance measure between pairs of embeddings corresponds to task-relevant spatial relations between the inputs. The embedding function f is computed by a neural network, which is learned to encode those spatial relations. For example, we can use the basic Euclidean distance metric to measure the distance between two samples x and y as f (x)f (y) 2 . This distance is critical in two ways. First, it is used to define the loss functions, such as triplet or contrastive loss, to dictate how this distance should be used to capture task-relevant properties of the input space. Second, since f is trained to optimize the loss function, the distance influences the learned embedding f . This approach has limitations. When the underlying reference distance is asymmetric or does not follow the triangle inequality, a standard metric cannot accurately capture the data. An important example is clustering over probability distributions, where the standard k-means approach with Euclidean distance is sub-optimal, leading to alternatives being used like the KL-divergence (Banerjee et al., 2005) . Other cases include textual entailment and learning graph distances which disobey the triangle inequality. Recent work has shown interest in learning an appropriate distance from the data instead of predetermining the final metric between embeddings (Cilingir et al., 2020; Pitis et al., 2020) . A natural class of distances that include common measures such as the squared Euclidean distance are the Bregman divergences (Bregman, 1967) . They are parametrized by a strictly convex function φ and measure the distance between two points x and y as the first-order Taylor approximation error of the Method φ representation Learning Approach Complexity Joint Learning NBD φ(x) = ICNN(x) Gradient Descent O(|θ|) Yes PBDL φ(x) = max i (b i x + z i ) Linear Programming O(n 3 ) No Deep-div φ(x) = max i (b i x + z i ) Gradient Descent O(|θ| + K) Yes Table 1 : Comparison of our Bregman learning approach NBD with prior methods. NBD simultaneously has better representational power and computational efficiency. function originating from y at x. The current best approach in Bregman learning approximates φ using the maximum of affine hyperplanes (Siahkamari et al., 2020; Cilingir et al., 2020) . In this work we address significant limitations of previous approaches and present our solution, Neural Bregman Divergences (NBD). NBD is the first non max-affine approach to learn a deep Bregman divergence, avoiding key limitations of prior works. We instead directly model the generating function φ(x), and then use φ(x) to implement the full divergence D φ . To demonstrate efficacy, we leverage prior and propose several new benchmarks of asymmetry organized into three types of information they provide: 1) quality of learning a Bregman divergence directly, 2) ability to learn a Bregman divergence and a feature extractor jointly, and 3) effectiveness in asymmetric tasks where the ground truth is known to be non-Bregman. This set of tests shows how NBD is far more efficacious in representing actual Bregman divergences than prior works, while simultaneously performing better in non-Bregman learning tasks. The rest of our paper is organized as follows. In §2 we show how to implement NBD using an Input Convex Neural Network and compare to related work, with further related work in §3. In §4 we demonstrate experiments where the underlying goal is to learn a known Bregman measure, and then to jointly learn a Bregman measure with an embedding of the data from which the measure is computed. Then §5 studies the performance of our method on asymmetric tasks where the underlying metric is not Bregman, to show more general utility where prior Bregman methods fail. Finally we conclude in §6.

2. NEURAL BREGMAN DIVERGENCE LEARNING

A Bregman divergence computes the divergence between two points x and y from a space X by taking the first-order Taylor approximation of a generating function φ. This generating function is defined over X and can be thought of as (re-)encoding points from X . A proper and informative φ is incredibly important: different φ can capture different properties of the spaces over which they are defined. Our aim in this paper is to learn Bregman divergences by providing a neural method for learning informative functions φ. Definition 2.1. Let x, y ∈ X , where X ⊆ R d . Given a continuously differentiable, strictly convex φ : X → R, the Bregman divergence parametrized by φ is D φ (x, y) = φ(x) -φ(y) -∇φ(y), x -y , where •, • represents the dot product and ∇φ(y) is the gradient of φ evaluated at y. A properly defined φ can capture critical, inherent properties of the underlying space. By learning φ via Eq. ( 1), we aim to automatically learn these properties. For example, Bregman divergences can capture asymmetrical relations: if X is the D-dimensional simplex representing D-dimensional discrete probability distributions then φ(x) = x, log x yields the KL divergence, D φ (x, y) = 2022). Given a set of K affine hyperplanes of the form b i x + z i , the convex function φ can be given a lower-bound approximation as φ(x) = max i (b i x + z i ). Given a dataset of m distance constraints between pairs of samples D(x a , x p ) ≤ D(x a , x n ), the parameters {b i , z i } K 1 can be learned using convex optimization to minimize an objective function, e.g. a variant of the triplet loss L tr (x a , x p , x n ) = m j=1 max{0, 1 + D(x aj , x pj ) -D(x aj , x nj )}. With some rewriting of the objective, the problem can be solved with techniques such as ADMM. We have modified some of the presentation for simplicity; refer to the original works for detail. While effective on smaller datasets at learning a suitable divergence for ranking and clustering, this approach has limitations when scaling up to the dataset sizes used in deep learning. It requires a fixed set of triplets for training which limits the data size (whereas on-demand mining can generate O(n 3 ) triplets). Even though ADMM can be distributed (and the 2-block ADMM from Siahkamari et al. (2022) is an order of magnitude faster than the original), it does not scale as well as batch gradient descent methods to large data. Moreover, it cannot be directly used for deep learning tasks, as we cannot simultaneously learn an embedding with the divergence. Deep-div. Following the max-affine approach, Cilingir et al. (2020) proposed the first deep learning approach to learn φ, where the base neural network is followed by a layer of K max-affine components. Each max-affine component is a linear layer (or a shallow network), trained via backpropagation. Recall that φ(x) := max i (b i x+z i ) and let φ i (x) = b i x+z i for each i. To simplify the computation of d φ (x, y), they make use of the following property. Fact 2.1.1. Let i and j be the corresponding max-affine components for φ(x) and φ(y). Then the Bregman divergence between two points x and y is D φ (x, y) = φ i (x)φ j (x). Thus for a given x, y, the components i, j are updated by backpropagation. Observe that no matter how large K is, φ is neither continuously differentiable nor strictly convex. As a result, this structure is not reliably learned using gradient descent. We provide an example to illustrate this, which is confirmed in our experiments (e.g. Table 4 ).  ∇ b L(x, y, d; φ) = 2(D φ (x, y) -d) • ∇ b D φ (x, y) For any max-affine slopes b i , b j , ∇ b D φ (x, y) are x and -x respectively, while the others are 0. If we are also learning an embedding f (x), then replace x with f (x) above, and furthermore for any embedding parameter θ, we have ∇ θ D φ (x, y) = (b i -b j ) • ∇ θ f (x). From this we infer that when i = j no learning occurs. Furthermore, if any max-affine hyperplane l is (nearly) dominated by the others over the domain of the inputs (that is, φ l (x) < φ(x) for (almost) all x), that component never (rarely) gets updated, resulting in essentially 'dead' components. Another implication is that d φ (x, y) = 0 for any x, y located on the same maximal component. By linearity there are at most K such regions, thus in classification or clustering with K classes the runtime of Deep-div is forced to increase with K. Even so, this resolution is low for more fine-grained tasks such as regression unless K is very large. In practice, training this method on deep metric learning or regression shows issues which corroborate our analysis. 2.2 REPRESENTING D φ VIA φ DIRECTLY Next we describe our new method, where we directly learn φ with a continuous convex neural network, which has a number of advantages over the piecewise approach. Whereas each pass of Deep-div requires an O(K) loop over the max-affine components, slowing training and inference, our approach does not incur additional complexity beyond an additional backpropagation step. Our approach is suited for both classification and regression tasks because it gives much finer resolution to φ, while max-affine approaches incur irreducible error when the target is smooth. This is further improved by selecting an activation function to make the learned φ strictly convex, continuously differentiable, unlike the max-affine approximation. Empirically, we demonstrate that our method converges consistently and more efficiently to lower error solutions. To represent φ, we adopt the Input Convex Neural Network (ICNN) (Amos et al., 2017) . The ICNN composes linear layers with non-negative weights W + and affine functions with unconstrained weights U with convex non-decreasing activation functions g(•). The composition of these three components for the ith layer of an ICNN is given by Eq. ( 2), with z i the i'th layer's input and z i+1 the output, z i+1 = g W + i z i + U i z 0 + b i . (2) By construction, the resulting neural network satisfies convexity. Chen et al. (2019) and Pitis et al. (2020) have shown under specific conditions that ICNNs universally approximate convex functions. Furthermore, with hidden layers, the representational power of an ICNN is exponentially greater than that of a max-affine function ( Chen et al. (2019) Thm. 2), giving much finer resolution over φ. Prior works on the ICNN have only tried piecewise linear activation functions such as the ReLU variants for g(•) = max(x, 0); we instead use the Softplus activation g(x) = log(1 + exp(x)) which lends the network smoothness and strict convexity. This is an important design choice as learning ∇φ(y) involves the second derivatives, which for any piecewise activation is zero. This causes vanishing gradients in the ∇φ(y), xy term, restricting the capacity to learn. We further discuss its representational capacity in F.1. Efficient Computation. In order to backpropagate a loss through the ∇φ(y) term in 1, we use double backpropagation as in Drucker & Le Cun (1991) . Normally computing the gradient of ∇φ(y) would involve constructing the Hessian, with a resulting quadratic increase in computation and memory use. Double backpropagation uses automatic differentiation to efficiently compute gradients with respect to the inputs with the "Jacobian vector product" (Frostig et al., 2021) , so that ∇φ(y), xy can be computed in the cost of evaluating φ(y) one additional time. Since there are already three calls to φ, this is only a 25% increase in computational overhead to backpropagate through Eq. ( 1), provided we have a learnable representation of φ. This functionality has been implemented in the PyTorch API (Paszke et al., 2017) . Furthermore, we can vectorize the Bregman divergence over a pairwise matrix, making our method significantly more computationally efficient than prior approaches. In particular, avoiding the O(K) max-affine computation of Deep-div cuts our training time in half (Table 8 ). We show empirical timings, along with discussion, in Appendix A.

2.3. JOINT TRAINING

The original feature space is rarely ideal for computing the distance measures between samples. Classical metric learning generally attempts to apply a linear transformation to the feature space in order to apply a fixed distance function D(•, •) such as Euclidean distance (Jain et al., 2012; Kulis et al., 2012) . In deep metric learning, a neural network f θ is used to embed the samples into a latent space where the distance function is more useful (Musgrave et al., 2020) . In our approach, instead of fixing the distance function, we also learn a Bregman divergence as the measure: D φ (f θ (x), f θ (y)) = φ(f θ (x)) -φ(f θ (y)) -∇φ(ỹ), f θ (x) -f θ (y) (3) with ∇φ evaluated at ỹ = f θ (y). Algorithm 1 Neural Bregman Divergence (NBD). Given data pairs (a i , b i ), our approach learns (1) f θ to featurize a i and b i ; (2) φ to compute a Bregman divergence value ŷ between the featurized data points. The computed Bregman divergence is trained via loss function to be close to a target divergence value y i . If a target divergence value isn't available, an implicit loss function can be used. Require: Dataset of pairs and target distance, Loss function (•, •) : R → R 1: f θ ← arbitrary neural network as a feature extractor 2: φ ← a ICNN network parameterized as by Eq. ( 2) 3: for each data tuple (a i , b i ) with label y i in dataset do 4: x ← f θ (a i ) Perform feature extraction 5: y ← f θ (b i ) 6: rhs ← ∇φ(y), x -y Use double backprop 7: ŷ ← φ(x) -φ(y) -rhs Empirical Bregman 8: (ŷ, y i ) . backward() Compute gradients 9: update parameters of φ and θ 10: return Jointly trained feature extractor f θ and learned Bregman Divergence φ Note we now have two sets of parameters to learn: those associated with φ and the encoder (θ). During training, they are simultaneously learned through gradient descent, which involves double-backpropagation as described earlier. We summarize this process in Alg. 1. The metric model accepts two samples as input and estimates the divergence between them. When the target divergence value is available, the metric can be trained using a regression loss function such as mean square error. Otherwise, an implicit comparison such as triplet or contrastive loss can be used.

3. OTHER RELATED WORK

In classic metric learning methods, a linear or kernel transform on the ambient feature space is used, combined with a standard distance Table 2 : We cluster data generated from a mixture of exponential, Gaussian, and multinomial distributions. Learning the metric from data is superior to using a standard metric such as Euclidean. Our approach NBD furthermore outperforms all other divergence learning methods. Means and standard deviations are reported over 10 runs. function such as Euclidean or cosine distance. The linear case is equivalent to Mahalanobis distance learning. Information on such approaches are in (Xing et al., 2002; Kulis et al., 2012; Jain et al., 2012; Kulis et al., 2009) . Bregman divergences generalize many standard distance measures and can further introduce useful properties such as asymmetry. They have classically been used in machine learning for clustering, by modifying the distance metric used in common algorithms such as kmeans (Banerjee et al., 2005; Wu et al., 2009) . One of the first methods to learn a Bregman divergence fits a non-parametric kernel to give a local Mahalanobis metric. The coefficients for the data points are fitted using subgradient descent (Wu et al., 2009) . We described the more recent approaches earlier. Recently Pitis et al. (2020) approached asymmetric distance learning by fitting a norm N with modified neural networks which satisfy norm properties and using the induced distance metric N (xy). They introduce two versions that we include as baselines: one (Deepnorm) parametrizes N with a modified ICNN that satisfies properties such as non-negativity and subadditivity. The second (Widenorm) computes a nonlinear transformation of a set of Mahalanobis norms. By construction, these metrics allow for asymmetry but still satisfy the triangle inequality. On the other hand, the Bregman divergence does not necessarily obey the triangle inequality. This is appealing for situations where the triangle inequality may be too restrictive. (2020) . The datasets consist of mixtures of N = 1000 points in R 10 from five clusters, where the multivariate distribution given cluster identity is non-isotropic Gaussian, exponential, or multinomial. Given a distance metric, we apply a generalized k-means algorithm to cluster the data points. While standard metrics, such as L2-distance and KL-divergence, may be ideal for specific forms of data (e.g. isotropic Gaussian and simplex data, respectively), our goal is to learn an appropriate metric directly from a separate labeled training set. In particular, because Bregman divergences are uniquely associated with each member of the exponential family (Banerjee et al., 2005) , our method is especially suited for clustering data from a wide range of distributions which may not be known ahead of time. To learn the metric from data, we apply triplet mining, including all triplets with non-zero loss (Hoffer & Ailon, 2015) . We use the same method to train all models except for the Euclidean baseline, which requires no training, and PBDL where we directly use the authors' code.

4. BREGMAN EXPERIMENTS

As shown in Table 2 , our method NBD gives improved clustering over all distributions compared to all baselines. In particular, standard k-means with Euclidean distance is clearly inadequate. While the Mahalanobis baseline shows significant improvement, it is only comparable to NBD in the Gaussian case, where a matrix can be learned to scale the clusters to be isotropic. This task indicates the importance of learning flexible divergences from data. After demonstrating success in distributional clustering, we now apply our method to ranking and clustering real data (Table 3 ), as first shown in Siahkamari et al. (2020) . For the ranking tasks, the test set is treated as queries for which the learned model retrieves items from the training set in order of increasing divergence. The ranking is scored using mean average precision (MAP) and area under ROC curve (AUC). Our method again outperforms the other Bregman learning methods in the large majority of datasets and metrics. We emphasize that these are standard experiments from recent work, on which our method proves superior. As a confirmation that our method can faithfully represent Bregman divergences, we use simulated data to demonstrate that our method efficiently learns divergences between pairs of inputs. We generate pairs of 20-dim. vectors from a Normal distribution, with 10 informative features used to compute the target divergence and 10 distractors. To be more challenging and realistic, we add various levels of correlations among all features to make the informative features harder to separate.

4.2. DIVERGENCE REGRESSION

The following target divergences are used: (1) squared Euclidean distance (symmetric); (2) squared Mahalanobis distance (symmetric); (3) φ(x) = x log x (asymmetric); (4) KL-divergence (asymmetric). In this task we compare our NBD with Deep-div and Mahalanobis, but we did not find a regression option for PBDL in the authors' code. Instead we add Deepnorm and Widenorm metrics from Pitis et al. (2020) as alternative baselines which do not learn Bregman divergences. The results of these experiments are in Table 4 , with loss curves shown in Appendix Fig. 4 . In the symmetric cases of the Euclidean and Mahalanobis ground-truth, our NBD method performs nearly as well as using a Mahalanobis distance itself. This shows that our method is not losing any representational capacity in being able to represent these standard measures. This is notably not true for the prior approaches for asymmetric learning: Deepnorm, Widenorm, and Deep-div. Notably, unlike in the clustering tasks, the piecewise representation of φ in Deep-div is unable to accurately represent Bregman regression targets, as discussed earlier in §2.1. In Fig. 4c and Fig. 4d two asymmetric divergences are used, and our NBD approach performs better than all existing options. Because these experiments isolate purely the issue of learning the divergence itself, we have strong evidence that our approach is the most faithful to learning a known divergence from a supervisory signal. Note that the Mahalanobis distance performed second best under all noise levels, meaning the prior asymmetric methods were in fact less accurate at learning asymmetric measures than a purely symmetric model.

4.3. CO-LEARNING AN EMBEDDING WITH A DIVERGENCE

Having shown that our method outperforms the prior Bregman learning approaches on shallow clustering, classification, and regression, we introduce a more challenging task, BregMNIST, where a neural embedding must be learned along with the divergence metric. The dataset consists of paired MNIST images, with the target distance being a Bregman divergence between the digits shown in the images. Example pairs are displayed in Fig. 1 for the asymmetrical Bregman divergence parametrized by φ(x) = (x + 1) log(x + 1). We also make a harder version by substituting MNIST with CIFAR10 with the same divergence labels. In both cases the relation of features to class label is arbitrary (that is, we impose an ordinal relation among labels that does not exist in the data), meaning that the embedding function must learn to effectively map image classes to the correct number used to compute the divergence, while the metric head must also learn to compute the target Bregman divergence. The results of the experiments (Fig. 2 ) mirror our results in §4.2. For both BregMNIST and BregCIFAR NBD performs best, while prior methods of learning asymmetric measures perform worse than the Mahalanobis distance.

5. NON-BREGMAN LEARNING

We have shown that our NBD method is the most effective among all available options when the underlying ground-truth is from the class of Bregman divergences. In this section we will now explore the effectiveness of our approach on tasks that are known to be non-Euclidean, but not necessarily representable by a Bregman divergence. The purpose of these experiments is to show that NBD does not depend on the underlying representation being a proper divergence in order to still be reasonably effective, and that it is still more effective then the prior Deep-div approach to Bregman learning. This is also of practical relevance to applications: just as the Euclidean metric was used for convenient properties and simplicity, without belief that the underlying system was truly Euclidean, our NBD Published as a conference paper at ICLR 2023 may be valuable for developing more flexible methods that inherit the mathematical convenience of Bregman divergences. These tasks probe the efficacy of the closest Bregman approximation of the underlying divergence, so we expect that our method will not surpass SOTA when the task is sufficiently non-Bregman. 5 We extend the ranking experiments from §4.1 and the architecture of Fig. 1 to the deep metric learning setting where an embedding is learned alongside the divergence. We use a ResNet-18 as the base feature extractor and apply batch triplet mining to learn Eq. 3 by minimizing the triplet loss. Most metrics perform comparably in this experiment, although Deep-div is consistently outperformed by the others. We observed that Deep-div has higher variance, and depending on the initialization could learn well or not at all. This may be due to 'dead' affine components discussed earlier. Fixing the Euclidean distance still appears as effective as learning the final metric here. We hypothesize this is because the embedding space of image datasets is well-behaved enough for a standard distance to accurately cluster images. In the following experiments we will investigate tasks where, even after applying an arbitrary feature extractor, a standard distance measure is no longer sufficient.

5.2. APPROXIMATE SEMANTIC DISTANCE

The next task involves learning symmetric distances that do not follow the triangle inequality. We group the CIFAR10 classes into two categories: man-made and natural. Within each category we select an arbitrary exemplar class (car and deer in our experiment). We then assign proxy distances between classes to reflect semantic similarity: 0.5 within the same class, 2 between any non-exemplar class and its category exemplar, and 8 between non-exemplar classes within a category. Like BregCIFAR, we present pairs of images to the model, which simultaneously adjusts a neural embedding and learn a divergence function such that inter-class distances in the embedding space match the target values. This task is harder than the previous ones because it is not sufficient to learn a separable embedding for each class; the embeddings must additionally be arranged appropriately in a non-Euclidean space. The results in Table 6 indicate our method effectively learns distances that do not follow the triangle inequality. The Deep-div approach does second-best here due to the small space of valid outputs. The other approaches by limitation adhere to the triangle inequality and do not perform as well. The overlap distance task presents pairs of the same image or different images, but with different crops taken out. A horizontal and vertical cut are chosen uniformly at random from each image. When the crops are based on the same image, the asymmetrical divergence measure between images X and Y is the percent intersection area: D(X, Y ) = 1 -|X∩Y | |X| . Otherwise the divergence is 1. We use the INRIA Holidays dataset (see Appendix G). The results can be found in Fig. 3 , where we see NBD performs the second best of all options. Table 7 : Results of estimating distances on the shortest-path task. Triangle-inequality preserving deep and wide-norm are expected to perform best. Our NBD performs significantly better than previous Bregman learning approach Deep-div, and can be competitive with the triangle-inequality preserving methods. The gap between train and test loss shows the impact of triangle inequality helping to avoid over-fitting the observed sub-graph used for training. Dataset and experiment details in Appendix. We observe that Widenorm performs better on this task, especially during the initial learning process, due to the fact that it is permitted to violate the positive definiteness property of norms: D(x, x) > 0. Thus the method learns to map a difference of zero between embeddings to some intermediate distance with lower MSE. This can be problematic in use cases where the definiteness is important.

5.4. SHORTEST PATH LENGTH

Our final task involves estimating the shortest path on a graph from one embedded node to another based on their distances to and from a set of landmark nodes. This task inherently favors the Widenorm and Deepnorm methods because they maintain the triangle inequality (i.e., no shortcuts allowed in shortest path), and so are expected to perform better than NBD. We reproduce the experimental setup of Pitis et al. (2020) closely with details in Appendix E. The results for each method are shown in §5.4, which largely match our expectations. The triangleinequality preserving measures usually perform best, given the nature of the problem: any violation of the triangle inequality means the distance measure is "taking a shortcut" through the graph search space, and thus must be under-estimating the true distance to the target node. NBD and Deep-div, by being restricted to the space of Bregman divergences, have no constraint that prevents violating the triangle-inequality, and thus often under-estimate the true distances. Comparing the train and test losses help to further assess this behavior, as the training pairs can be overfit to an extent. We see that NBD effectively learns the asymmetric relationships between seen points despite underestimating the distance to new points. To further explore the degree to which the properties underlying Bregman divergences affect shortest path length learning, we introduce two extensions to NBD. The first is a soft modification encouraging the triangle inequality to be obeyed (Bregman-sqrt) . The second has a hard constraint guaranteeing the triangle inequality (Bregman-GS) and is defined as (Acharyya et al., 2013) . Results are included in §5.4, with detail on the extensions in the Appendix. There is an inherent tradeoff between the two extensions as Bregmansqrt can be asymmetric but still does not require satisfying the triangle inequality, while Bregman-GS is symmetric but always satisfies the triangle inequality. We see that these two modifications to NBD are highly competitive with Deepnorm and Widenorm. Furthermore, the relative performance of each provides an indication of whether asymmetry or triangle inequality is more crucial to modeling a given dataset. These methods highlight that even when a given task is highly non-Bregman, NBD can be readily extended to relax or strengthen various assumptions to better model the data. D gsb φ (x, y) = D φ (x, y) + D φ (y, x) + 1 2 x - y 2 2 + 1 2 ∇φ(x) -∇φ(y) 2 2

A COMPUTATIONAL DISCUSSION

We measure timing information of NBD as well as the benchmarks used in our experiments for a divergence learning task as in §4.2. The training time (forward and backward passes), inference time (forward pass), and pairwise distance matrix computation are collected. While the pairwise distance is not directly used in our experiments, it is commonly computed in metric learning applications such as clustering and classification. The data dimension size N × D used here is characteristic of the size of many metric learning experiments, where the N × N pairwise distance matrix can be stored on GPU memory, but the N × N × D tensor with embedding dimension D does not fit: the fast but naive approach of flattening the matrix and passing as a N × N -length batch does not work. The Mahalanobis method is naturally the fastest method and serves as a reasonable runtime lower bound. This is because the distance can be expressed a simple composition of a norm with a linear layer. The squared Euclidean norm can be simplified as D(x, y) = x 2 2 + y 2 2 -2 x, y , which can be efficiently computed. We can similarly compute the Bregman divergences. For example, the squared Euclidean distance is equivalently written as D(x, y) = x 2 2y 2 2 -2y, xy , which is its Bregman divergence formulation. Though not necessary for our current experiments, the pairwise distances for larger batch sizes for the Mahalanobis and NBD can be readily implemented in PyKeOps Charlier et al. (2021) . Thus the longer computational time for NBD can likely be attributed to the increased cost of double backpropagation and the convex metric architecture. On the other hand, the Deepnorm cannot be vectorized in such manner, so pairwise distances need to be computed on the order of O(N 2 ) runtime, for example by further splitting the tensor into smaller mini-batches. While the Widenorm is composed of Mahalanobis metrics (set to 32 in our experiments) that can be vectorized, in our experiments the memory requirement was still too high, also requiring looping over sub-batches. We use sub-batch size 200 in this analysis. We note that the loop can be alternatively performed over the Mahalanobis components in Widenorm, but this would still be slower than the standard Mahalanobis and NBD methods. Finally, the Deep-div method efficiently computes pairwise distances, but its forward pass requires the input to be passed through a set of K affine sub-networks via looping, increasing the computational time. Table 9 : Details of the shortest-path datasets, the number of dimensions in the graph, and how the edge weights in the graph are computed. These tasks were originally proposed by Pitis et al. (2020) and favor asymmetric methods that maintain the triangle inequality.

B FIGURES FOR BREGMAN REGRESSION TASK

in the figures show mean and 95% confidence interval for the loss over each epoch. We used Quadro RTX 6000 GPUs to train our models. Deep metric learning. Finally, this refers to the triplet loss experiment with co-learning of embedding and metric. For this we use the following settings: learning rate 1e-4, Adam optimizer, batch size 64, embedding dimension 32, 30 epochs. For all datasets that are 32x32, we resized to 224x224 and used a pretrained ResNet18 as the base embedding model, which is then finetuned during training. We used smaller learning rate and embedding/batch dimensions in keeping with the standard metric learning protocol in Musgrave et al. (2020) which we found gave more stable results. We note that in Cilingir et al. ( 2020) they ran their Deep-div on a similar task and reported some results better and some worse than our results with Deep-div. We note that there are differences in protocol, where they extensively tuned the training procedure with hyperparmater optimization, whereas we selected default values to ensure robustness. However, we used a larger base extractor with pretrained weights whereas they used a custom CNN. As a result differences in performance can be expected. While our reported metric is MAP@10 and theirs was nearest neighbor accuracy, we found both measures to be very close across our experiments. E SHORTEST PATH DETAILS Pitis et al. (2020) introduced learning shortest path length in graphs as a task. The problem consists of learning d(x, y) where x, y ∈ V are a pair of nodes in a large weighted graph G = (V, E). For each node, the predictive features consist of shortest distances from the node to a set of 32 landmark nodes (and vice versa for asymmetric graphs). As this task requires predicting the distance from one node to another, maintaining the triangle inequality has inherent advantages and is the correct inductive bias for the task. We still find the task useful to elucidate the difference between NBD and the prior Deep-div. We reproduce the experimental setup of Pitis et al. (2020) closely, collecting a 150K random subset of pairs from the graph as the dataset, with true distances computed using A * search. While the original work normalized distances to mean 50, we found that such large regression outputs were difficult to learn. Instead we normalize to mean 1, which results in faster convergence of all methods. A 50K/10K train-test split was used. The features are standardized with added noise sampled from N (0, 0.2), and 96 normal-distributed distractor features were included. For additional details refer to Appendix E of Pitis et al. (2020) . However, their experimental detail and code was sufficient to only reproduce three of the graph datasets (3d, taxi, 3dd). Therefore, we develop two additional asymmetric graphs (traffic and octagon). The details of all the shortest-path graphs we use are provided in Table 9 . Models were trained for 50 epochs at learning rate 5e-5 and 50 epochs at 5e-6.

E.1 EXTENSIONS OF BREGMAN DIVERGENCE FOR THE TRIANGLE INEQUALITY

For the first we draw from mathematical literature demonstrating that metrics can be induced from the square root of certain symmetrized Bregman divergences, depending on constraints on φ Chen et al. (2008) . We learn the square root of the Bregman divergence to provide a soft inductive bias (as an illustrative example, Euclidean distance is a metric but squared Euclidean distance is not). For the second we introduce a modification of the Bregman divergence known as the Generalized Symmetrized Bregman divergence. As shown by Acharyya et al. (2013) , the square root of such a divergence is guaranteed to satisfy the triangle inequality. This divergence is defined as D gsb φ (x, y) = D φ (x, y) + D φ (y, x) + 1 2 x -y 2 2 + 1 2 ∇φ(x) -∇φ(y) 2 2 . There is an inherent tradeoff between the two extensions as Bregman-sqrt can be asymmetric but still does not require satisfying the triangle inequality, while Bregman-GS is symmetric but always satisfies the triangle inequality.

F ADDITIONAL HYPERPARAMETER DETAILS

We followed the hyperparameter specifications for the Deepnorm and Widenorm results as stated in Pitis et al. (2020) . The Widenorm used 32 components with size 32, concave activation size 5, and max-average reduction. For the Deepnorm we used the neural metric version which gave the strongest performance in their paper: 3 layers with dimension 128, MaxReLU pairwise activations, concave activation size 5, and max-average reduction. We adapt their PyTorch code from https: //github.com/spitis/deepnorms. In the graph distance task, results show the same learning pattern (and relative performances between models) but the overall error magnitudes that we obtain are rather different than their reported results. We re-implemented the Deep-div method following the description in Cilingir et al. (2020) . The number of affine sub-networks stated in their paper varied but was generally set to low values such as 10, for the purpose of matching the number of classes in their classification tasks. In their appendix they experiment with increasing numbers of sub-networks and find best results at 50. For this reason we set 50 for our experiments. Following their paper, we use small FNNs for the max-affine components. Our NBD uses a 2 hidden layer FICNN with width 128 for φ. We found that our results are robust to the depth and width of the FICNN. F.1 ACTIVATION FOR FICNN While developing theory using the softplus directly is harder because of the non-linearity, we believe that the approximation error using softplus activations is upper bounded by the ReLU approximation error. This is because the softplus can be made arbitrarily close to ReLU by making α large in the operation g α (x) = log(1 + exp(αx))/α. As the α value could be learned in the affine components of the neural network, a ICNN with ReLU can be closely expressed as an ICNN with softplus. So we can simply substitute g α (x) for ReLU(x) in the original proofs, and therefore our optimal error is less than or equal to the ReLU error. (In particular when the true function has curvature then it will be strictly better than ReLU).

G OVERLAP DETAILS

We compute overlap as the percent of non-intersecting area from the crops. We show this in , where NBD performs best in the symmetric (left) and asymmetric (center, right) Bregman learning tasks. We see Mahalanobis performs well in the symmetric task (left) since it is correctly specified for the ground truth, but performs relatively poorly in the asymmetric case (right). Deep-div is unable to learn effectively in either (not shown on left because the error was too high).



CONCLUSIONTo enable future asymmetric modeling research, we have developed the Neural Bregman Divergence (NBD). NBD jointly learns a Bregman measure and a feature extracting neural network. We show that NBD learns divergences directly or indirectly when trained jointly with a network, and that NBD still learns effectively when the underlying metric is not a divergence, allowing effective use of our tool across a wide spectrum but retaining the nice properties of Bregman divergences.



Consider a Bregman regression problem learning D(x, y) from known targets d and loss function L(x, y, d; φ) = (D φ (x, y)d) 2 . The gradient with respect to each parameter b is

Figure 2: MSE (y-axis) after epochs of training (x-axis) on asymmetric BregMNIST (left) and BregCIFAR (right) with true φ(x) = x log x. NBD performs best in both tasks.

Figure 1: Demonstration of the BregMNIST task. Nodes with the same color indicate weight sharing. Each image is embedded by a CNN, and the ground-truth divergence is computed from the digit values of the input images. The embeddings of each image are given to NBD, and the loss is computed from NBD's output and the true divergence. The CNN and NBD are learned jointly.

Figure 3: MSE (y-axis) for predicting the overlap D(X, Y ) between two image embeddings learned jointly with the underlying CNN.

Figure 4: Results when vectors with 10 real features and 10 distractor features are used to compute different specific Bregman divergences. Mean absolute error is on the y-axis and number of training epochs on the x-axis. The shaded region around each plot shows the standard deviation of the results. Note in all cases our NBD has very low variance while effectively learning the target divergence.



Figure6: MSE (y-axis) after epochs of training (x-axis), where NBD performs best in the symmetric (left) and asymmetric (center, right) Bregman learning tasks. We see Mahalanobis performs well in the symmetric task (left) since it is correctly specified for the ground truth, but performs relatively poorly in the asymmetric case (right). Deep-div is unable to learn effectively in either (not shown on left because the error was too high).

0.08 0.830 0.03 0.913 0.05 0.938 0.03 0.921 0.02 0.939 0.01

On real datasets, learned Bregman di-BREGMAN RANKING AND CLUSTERING Our first task expands the distributional clustering experiments in Banerjee et al. (2005); Cilingir et al.

Regression test MAE when unused distractor features are correlated (None/Med/High) with the true/used features. Best results in bold, second best in italics. NBD performs best on asymmetric regression, and second-best to Mahalanobis on symmetric regression, where a Mahalanobis distance is expected to fit perfectly.

Timing information for a divergence learning task as in §4.2, with embedding dimension 20 and a batch size of 1000, comparing the methods used in our experiments. We compute the per-epoch training time (forward and backward passes), inference time (forward pass), and pairwise distance matrix computation. Results are averaged over 30 runs, with standard deviation in parentheses.

ACKNOWLEDGMENTS

We would like to thank the anonymous reviewers for their comments, questions, and suggestions. This material is based in part upon work supported by the National Science Foundation under Grant No. IIS-2024878, with some computation provided by the UMBC HPCF, supported by the National Science Foundation under Grant No. CNS-1920079. This material is also based on research that is in part supported by the Army Research Laboratory, Grant No. W911NF2120076, and by the Air Force Research Laboratory (AFRL), DARPA, for the KAIROS program under agreement number FA8750-19-2-1003. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either express or implied, of the Air Force Research Laboratory (AFRL), DARPA, or the U.S. Government.

C DATA GENERATION DETAILS

Distributional clustering. We sample 1000 points uniformly into 5 clusters, each with 10 feature dimensions. To generate non-isotropic Gaussians we sample means uniformly in the hyper-box within [-4, 4] for each coordinate. The variances are a random PSD matrix (scikit-learn make_spd_matrix) added with σ 2 I where σ 2 = 5. The reason for these values are because we aimed to have each mixture task be similarly difficult (clusters not perfectly separable but also not too challenging). For the multinomial task, we sampled 100 counts into the 10 feature dimensions. Each cluster's underlying probability distribution was sampled from Dirichlet ([10, 10, . . . , 10] ). Finally, the exponential case are iid samples for each feature. The underlying cluster rates are sampled uniformly between [0.1, 10].Regression noise features. To add correlation among features (both informative and distractors), we generate covariance matrices with controlled condition number κ while keeping the marginal distributions of each feature as x i ∼ N (0, 1). For the medium correlation task κ < 100, while for high correlation κ is between 250 and 500. 50000 pairs were generated with 20 features.

D EXPERIMENTAL PROCEDURE

Overview. Our experiments fall into two categories: regression (e.g. sections 4.2, 4.3, 5.2, 5.3, 5.4 ) and clustering/ranking (e.g. sections 4.1, 5.1). For the first category a scalar divergence target is assumed. Square error loss is used, and an input pair of samples is drawn from an underlying dataset or randomly generated. From the input pair, the true divergence target is computed. For example if the dataset has N samples, there are O(N 2 ) possible pairs. Pairs are drawn simultaneously as batches with specified batch size, and an epoch (unless otherwise stated) is defined as the number of pairs equal to the size of the original dataset (so an epoch is the usual length, but not all data has necessarily been seen).For the second category, a divergence target does not exist, only the relative ranking of anchor-positive being less distance than anchor-negative. The triplet loss is used here. In all such metric learning tasks, we fit models using triplet mining, with margin 0.2, and Adam optimizer. All triplets with non-zero loss are mined from a batch. For more detail on triplet loss and mining see Musgrave et al. (2020) . Thus for a batch size of B there are up to O(B 3 ) triplets. We iterate over the dataset in batches. The exception is PBDL which uses the authors' original code with pre-generated tuples.Distributional clustering. We used batch size 128, 200 epochs, 1e-3 learning rate for all models. Here, and in all subsequent experiments, to train PBDL we used the authors' provided Python code, which uses the alternating direction method of multipliers (ADMM) technique. (They also provide Matlab code using Gurobi.)Bregman ranking. Since Deep-div and NBD are deep learning approaches, we use Adam to optimize this problem instead of convex optimization solvers. To ensure convergence, we tune the learning rate and number of epochs using gridsearch over a validation set separated from the training data. We do the same for the Mahalanobis approach. A typical example of the parameters is batch size 256, 250 epochs, learning rate 1e-3.Regression. We used 100 epochs of training with learning rate 1e-3, batch size 1000.Deep regression experiments. For the remaining experiments which involve co-learning an embedding, we use default hyperparameter settings to keep methods comparable, such as Adam optimizer, learning rate 1e-3, batch size 128, embedding dimension 128, and 200 epochs. By deep regression, we refer to tasks that have a continuous target, such as BregMNIST, overlap distance, and shortest path.For the MNIST/CIFAR tasks the embedding network consists of two/four convolutional layers respectively followed by two fully-connected layers (more specific details follow).For the semantic distance CIFAR task, we used a pretrained ResNet20 as the embedding without freezing any layers for faster learning He et al. (2016) . Results were robust to the embedding model chosen.We replicated each training and reported means and standard deviations. For the Bregman benchmark tasks we trained 20x, while for the deep learning/graph learning tasks we trained 5x. Learning curves 

