NEURAL BREGMAN DIVERGENCES FOR DISTANCE LEARNING

Abstract

Many metric learning tasks, such as triplet learning, nearest neighbor retrieval, and visualization, are treated primarily as embedding tasks where the ultimate metric is some variant of the Euclidean distance (e.g., cosine or Mahalanobis), and the algorithm must learn to embed points into the pre-chosen space. The study of non-Euclidean geometries is often not explored, which we believe is due to a lack of tools for learning non-Euclidean measures of distance. Recent work has shown that Bregman divergences can be learned from data, opening a promising approach to learning asymmetric distances. We propose a new approach to learning arbitrary Bergman divergences in a differentiable manner via input convex neural networks and show that it overcomes significant limitations of previous works. We also demonstrate that our method more faithfully learns divergences over a set of both new and previously studied tasks, including asymmetric regression, ranking, and clustering. Our tests further extend to known asymmetric, but non-Bregman tasks, where our method still performs competitively despite misspecification, showing the general utility of our approach for asymmetric learning.

1. INTRODUCTION

Learning a task-relevant metric among samples is a common application of machine learning, with use in retrieval, clustering, and ranking. A classic example of retrieval is in visual recognition where, given an object image, the system tries to identify the class based on an existing labeled dataset. To do this, the model can learn a measure of similarity between pairs of images, assigning small distances between images of the same object type. Given the broad successes of deep learning, there has been a recent surge of interest in deep metric learning-using neural networks to automatically learn these similarities (Hoffer & Ailon, 2015; Huang et al., 2016; Zhang et al., 2020) . The traditional approach to deep metric learning learns an embedding function over the input space so that a simple distance measure between pairs of embeddings corresponds to task-relevant spatial relations between the inputs. The embedding function f is computed by a neural network, which is learned to encode those spatial relations. For example, we can use the basic Euclidean distance metric to measure the distance between two samples x and y as f (x)f (y) 2 . This distance is critical in two ways. First, it is used to define the loss functions, such as triplet or contrastive loss, to dictate how this distance should be used to capture task-relevant properties of the input space. Second, since f is trained to optimize the loss function, the distance influences the learned embedding f . This approach has limitations. When the underlying reference distance is asymmetric or does not follow the triangle inequality, a standard metric cannot accurately capture the data. An important example is clustering over probability distributions, where the standard k-means approach with Euclidean distance is sub-optimal, leading to alternatives being used like the KL-divergence (Banerjee et al., 2005) . Other cases include textual entailment and learning graph distances which disobey the triangle inequality. Recent work has shown interest in learning an appropriate distance from the data instead of predetermining the final metric between embeddings (Cilingir et al., 2020; Pitis et al., 2020) . A natural class of distances that include common measures such as the squared Euclidean distance are the Bregman divergences (Bregman, 1967) . They are parametrized by a strictly convex function φ and measure the distance between two points x and y as the first-order Taylor approximation error of the Method φ representation Learning Approach Complexity Joint Learning NBD φ(x) = ICNN(x) Gradient Descent O(|θ|) Yes PBDL φ(x) = max i (b i x + z i ) Linear Programming O(n 3 ) No Deep-div φ(x) = max i (b i x + z i ) Gradient Descent O(|θ| + K) Yes Table 1 : Comparison of our Bregman learning approach NBD with prior methods. NBD simultaneously has better representational power and computational efficiency. function originating from y at x. The current best approach in Bregman learning approximates φ using the maximum of affine hyperplanes (Siahkamari et al., 2020; Cilingir et al., 2020) . In this work we address significant limitations of previous approaches and present our solution, Neural Bregman Divergences (NBD). NBD is the first non max-affine approach to learn a deep Bregman divergence, avoiding key limitations of prior works. We instead directly model the generating function φ(x), and then use φ(x) to implement the full divergence D φ . To demonstrate efficacy, we leverage prior and propose several new benchmarks of asymmetry organized into three types of information they provide: 1) quality of learning a Bregman divergence directly, 2) ability to learn a Bregman divergence and a feature extractor jointly, and 3) effectiveness in asymmetric tasks where the ground truth is known to be non-Bregman. This set of tests shows how NBD is far more efficacious in representing actual Bregman divergences than prior works, while simultaneously performing better in non-Bregman learning tasks. The rest of our paper is organized as follows. In §2 we show how to implement NBD using an Input Convex Neural Network and compare to related work, with further related work in §3. In §4 we demonstrate experiments where the underlying goal is to learn a known Bregman measure, and then to jointly learn a Bregman measure with an embedding of the data from which the measure is computed. Then §5 studies the performance of our method on asymmetric tasks where the underlying metric is not Bregman, to show more general utility where prior Bregman methods fail. Finally we conclude in §6.

2. NEURAL BREGMAN DIVERGENCE LEARNING

A Bregman divergence computes the divergence between two points x and y from a space X by taking the first-order Taylor approximation of a generating function φ. This generating function is defined over X and can be thought of as (re-)encoding points from X . A proper and informative φ is incredibly important: different φ can capture different properties of the spaces over which they are defined. Our aim in this paper is to learn Bregman divergences by providing a neural method for learning informative functions φ. Definition 2.1. Let x, y ∈ X , where X ⊆ R d . Given a continuously differentiable, strictly convex φ : X → R, the Bregman divergence parametrized by φ is D φ (x, y) = φ(x) -φ(y) -∇φ(y), x -y , where •, • represents the dot product and ∇φ(y) is the gradient of φ evaluated at y. A properly defined φ can capture critical, inherent properties of the underlying space. By learning φ via Eq. ( 1), we aim to automatically learn these properties. For example, Bregman divergences can capture asymmetrical relations: if X is the D-dimensional simplex representing D-dimensional discrete probability distributions then φ(x) = x, log x yields the KL divergence, D φ (x, y) = d x d log x d y d . On the other hand, if X = R d and φ is the squared L 2 norm (φ(y) = y 2 2 ), then D φ (x, y) = x -y 2 2 . Focusing on the hypothesis space of Bregman divergences is valuable due to the fact that many core machine learning measures, including squared Euclidean, Kullback-Leibler, and Ikura-Saito divergences, are special cases of Bregman divergences. While special cases of the Bregman divergence are used today, and many general results have been proven over the space of Bregman measures, less progress has been made in learning Bregman divergences.

2.1. EXISTING BREGMAN LEARNING APPROACHES

PBDL. Recent works have proposed ways to empirically learn the Bregman divergence that best represents a dataset by focusing on a max-affine representation of φ (Siahkamari et al., 2020;  

