NEURAL BREGMAN DIVERGENCES FOR DISTANCE LEARNING

Abstract

Many metric learning tasks, such as triplet learning, nearest neighbor retrieval, and visualization, are treated primarily as embedding tasks where the ultimate metric is some variant of the Euclidean distance (e.g., cosine or Mahalanobis), and the algorithm must learn to embed points into the pre-chosen space. The study of non-Euclidean geometries is often not explored, which we believe is due to a lack of tools for learning non-Euclidean measures of distance. Recent work has shown that Bregman divergences can be learned from data, opening a promising approach to learning asymmetric distances. We propose a new approach to learning arbitrary Bergman divergences in a differentiable manner via input convex neural networks and show that it overcomes significant limitations of previous works. We also demonstrate that our method more faithfully learns divergences over a set of both new and previously studied tasks, including asymmetric regression, ranking, and clustering. Our tests further extend to known asymmetric, but non-Bregman tasks, where our method still performs competitively despite misspecification, showing the general utility of our approach for asymmetric learning.

1. INTRODUCTION

Learning a task-relevant metric among samples is a common application of machine learning, with use in retrieval, clustering, and ranking. A classic example of retrieval is in visual recognition where, given an object image, the system tries to identify the class based on an existing labeled dataset. To do this, the model can learn a measure of similarity between pairs of images, assigning small distances between images of the same object type. Given the broad successes of deep learning, there has been a recent surge of interest in deep metric learning-using neural networks to automatically learn these similarities (Hoffer & Ailon, 2015; Huang et al., 2016; Zhang et al., 2020) . The traditional approach to deep metric learning learns an embedding function over the input space so that a simple distance measure between pairs of embeddings corresponds to task-relevant spatial relations between the inputs. The embedding function f is computed by a neural network, which is learned to encode those spatial relations. For example, we can use the basic Euclidean distance metric to measure the distance between two samples x and y as f (x)f (y) 2 . This distance is critical in two ways. First, it is used to define the loss functions, such as triplet or contrastive loss, to dictate how this distance should be used to capture task-relevant properties of the input space. Second, since f is trained to optimize the loss function, the distance influences the learned embedding f . This approach has limitations. When the underlying reference distance is asymmetric or does not follow the triangle inequality, a standard metric cannot accurately capture the data. An important example is clustering over probability distributions, where the standard k-means approach with Euclidean distance is sub-optimal, leading to alternatives being used like the KL-divergence (Banerjee et al., 2005) . Other cases include textual entailment and learning graph distances which disobey the triangle inequality. Recent work has shown interest in learning an appropriate distance from the data instead of predetermining the final metric between embeddings (Cilingir et al., 2020; Pitis et al., 2020) . A natural class of distances that include common measures such as the squared Euclidean distance are the Bregman divergences (Bregman, 1967) . They are parametrized by a strictly convex function φ and measure the distance between two points x and y as the first-order Taylor approximation error of the 1

