Approximate Birkhoff-von-Neumann decomposition: a differentiable approach

Abstract

The Birkhoff-von-Neumann (BvN) decomposition is a standard tool used to draw permutation matrices from a doubly stochastic (DS) matrix. The BvN decomposition represents such a DS matrix as a convex combination of several permutation matrices. Currently, most algorithms to compute the BvN decomposition employ either greedy strategies or custom-made heuristics. In this paper, we present a novel differentiable cost function to approximate the BvN decomposition. Our algorithm builds upon recent advances in Riemannian optimization on Birkhoff polytopes. We offer an empirical evaluation of this approach in the fairness of exposure in rankings, where we show that the outcome of our method behaves similarly to greedy algorithms. Our approach is an excellent addition to existing methods for sampling from DS matrices, such as sampling from a Gumbel-Sinkhorn distribution. However, our approach is better suited for applications where the latency in prediction time is a constraint. Indeed, we can generally precompute an approximated BvN decomposition offline. Then, we select a permutation matrix at random with probability proportional to its coefficient. Finally, we provide an implementation of our method.

1. Introduction & Related work

Sampling from a doubly stochastic (DS) matrix is a significant problem that recently caught the attention of the machine learning community, with applications such as exposure fairness in ranking algorithms (Kahng et al., 2018; Singh & Joachims, 2018) , strategies to reduce bribery (Keller et al., 2018; 2019) , and learning latent representations (Mena et al., 2018; Grover et al., 2018; Linderman et al., 2018) . We consider the Birkhoff-von-Neumann decomposition (BvND) (Birkhoff, 1946) , which is deterministic and represents a DS matrix as the convex combination of permutations matrices (or permutation sub-matrices). In general, the BvND of a particular DS matrix is not unique. Sampling from a BvND boils down to selecting a sub-permutation matrix with a probability proportional to its coefficient. Current BvND algorithms rely on greedy heuristics (Dufossé & Uçar, 2016), mixed-integer linear programming (Dufossé et al., 2018) , or quantization (Liu et al., 2018) . Hence, these methods are not differentiable. We rely on reparametrization techniques to use gradient-based algorithms (Grover et al., 2018; Linderman et al., 2018 ). Recently, Mena et al. (2018) introduced a reparametrization trick to draw samples from a Gumbel-Sinkhorn distribution. However, these methods can underperform in applications where there is a constraint in the prediction, as reparametrization methods require to solve a perturbed Sinkhorn matrix scaling problem. In this work, we propose an alternative to Gumbel-matching-related approaches, which is well-suited for applications where we do not need to sample permutations online. Thus, our method is fast during prediction time by saving all components in memory. We call our algorithm: differentiable approximate Birkhoff-von-Neumann decomposition, and it is a continuous relaxation of the BvND. We rely on the recently proposed Riemannian gradient descent on Birkhoff polytopes. The main parameter is the number of components of the decomposition. We enforce an approximate orthogonality constraint on each component of the BvND. To our knowledge, this is the first gradient-based approximation of the BvND.

2. Preliminaries

We first present some background on the BvND and comment on recent advances on Riemannian optimization in Birkhoff polytopes. Notations. We write column vectors using bold lower-case, e.g., x. x i denotes the i-th component of x. We write matrices using bold capital letters, e.g., X. X ij denotes the element in the i-row and j-column of X. [n] = {1, . . . , n}. Letters in calligraphic, e.g. P denotes sets. • F denotes the Frobenius norm of a matrix. I p is the p × p identity matrix, and 1 n is a 1 vector of size n. We use the superscript of a matrix X l to indicate an element of a set. Thus, {X i } k i=1 be short for the set {X 1 , . . . , X k }. However, we denote X (t) a matrix X at iteration t. ∆ n denotes the n -1 probability simplex, and is the Hadamard product.

2.1. Technical background.

Definition 1 (Doubly stochastic matrix (DS)) A DS matrix is a non-negative, square matrix whose rows and columns sums to 1. The set of DS matrices is defined as: DP n := X ∈ R n×n + : X1 n = 1 n , 1 T n X = 1 T n . ( ) Definition 2 (Birkhoff polytope) The multinomial manifold of DS matrices is equivalent to the convex object called the Birkhoff Polytope (Birkhoff, 1946) , an (n -1) 2 dimensional convex submanifold of the ambient R n×n with n! vertices.We use DP n to refer to the Birkhoff Polytope. Theorem 1 (Birkhoff-von Neumann Theorem) The convex hull of the set of all permutation matrices is the set of doubly-stochastic matrices and there exists a potentially non-unique θ such that any DS matrix can be expressed as a linear combination of k permutation matrices (Birkhoff, 1946; Hurlbert, 2008 ) X = θ 1 P 1 + . . . + θ k P k , θ i > 0, θ T 1 k = 1. ( ) While finding the minimum k is shown to be NP-hard (Dufossé et al., 2018) , by Marcus-Ree theorem, we know that there exists one constructible decomposition where k < (n -1) 2 + 1. Definition 3 (Permutation Matrix) A permutation matrix is defined as a sparse, square binary matrix, where each column and each row contains only a single true (1) value: P n := P ∈ {0, 1} n×n : P1 n = 1 n , 1 T n P = 1 T n . (3) In particular, we have that the set of permutation matrices P n is the intersection of the set of DS matrices and the orthogonal group O n := X T X = I : X ∈ R n×n , P n = DP n ∩ O n (Goemans, 2015) . Riemannian gradient descent on DP n . The base Riemannian gradient methods use the following update rule X (t+1) = R X (t) -γ H X (t) (Absil et al., 2009) , where H X (t) is the Riemannian gradient of the loss L : DP n → R at X (t) , γ is the learning rate,



Figure 1: Illustration of the Birkhoff-von-Neumann decomposition (BvND): One can represent a doubly stochastic matrix as a convex combination of permutation matrices.

