Approximate Birkhoff-von-Neumann decomposition: a differentiable approach

Abstract

The Birkhoff-von-Neumann (BvN) decomposition is a standard tool used to draw permutation matrices from a doubly stochastic (DS) matrix. The BvN decomposition represents such a DS matrix as a convex combination of several permutation matrices. Currently, most algorithms to compute the BvN decomposition employ either greedy strategies or custom-made heuristics. In this paper, we present a novel differentiable cost function to approximate the BvN decomposition. Our algorithm builds upon recent advances in Riemannian optimization on Birkhoff polytopes. We offer an empirical evaluation of this approach in the fairness of exposure in rankings, where we show that the outcome of our method behaves similarly to greedy algorithms. Our approach is an excellent addition to existing methods for sampling from DS matrices, such as sampling from a Gumbel-Sinkhorn distribution. However, our approach is better suited for applications where the latency in prediction time is a constraint. Indeed, we can generally precompute an approximated BvN decomposition offline. Then, we select a permutation matrix at random with probability proportional to its coefficient. Finally, we provide an implementation of our method.

1. Introduction & Related work

Sampling from a doubly stochastic (DS) matrix is a significant problem that recently caught the attention of the machine learning community, with applications such as exposure fairness in ranking algorithms (Kahng et al., 2018; Singh & Joachims, 2018) , strategies to reduce bribery (Keller et al., 2018; 2019) , and learning latent representations (Mena et al., 2018; Grover et al., 2018; Linderman et al., 2018) . We consider the Birkhoff-von-Neumann decomposition (BvND) (Birkhoff, 1946) , which is deterministic and represents a DS matrix as the convex combination of permutations matrices (or permutation sub-matrices). In general, the BvND of a particular DS matrix is not unique. Sampling from a BvND boils down to selecting a sub-permutation matrix with a probability proportional to its coefficient. Current BvND algorithms rely on greedy heuristics (Dufossé & Uçar, 2016) , mixed-integer linear programming (Dufossé et al., 2018) , or quantization (Liu et al., 2018) . Hence, these methods are not differentiable. We rely on reparametrization techniques to use gradient-based algorithms (Grover et al., 2018; Linderman et al., 2018) . Recently, Mena et al. (2018) introduced a reparametrization trick to draw samples from a Gumbel-Sinkhorn distribution. However, these methods can underperform in applications where there is a constraint in the prediction, as reparametrization methods require to solve a perturbed Sinkhorn matrix scaling problem. In this work, we propose an alternative to Gumbel-matching-related approaches, which is well-suited for applications where we do not need to sample permutations online. Thus, our method is fast during prediction time by saving all components in memory. We call our algorithm: differentiable approximate Birkhoff-von-Neumann decomposition, and it is a continuous relaxation of the BvND. We rely on the recently proposed Riemannian gradient descent on Birkhoff polytopes. The main parameter is the number of components of the decomposition. We enforce an approximate orthogonality constraint on each component of the BvND. To our knowledge, this is the first gradient-based approximation of the BvND.

2. Preliminaries

We first present some background on the BvND and comment on recent advances on Riemannian optimization in Birkhoff polytopes. Notations. We write column vectors using bold lower-case, e.g., x. x i denotes the i-th component of x. We write matrices using bold capital letters, e.g., X. X ij denotes the element in the i-row and j-column of X. [n] = {1, . . . , n}. Letters in calligraphic, e.g. P denotes sets. • F denotes the Frobenius norm of a matrix. I p is the p × p identity matrix, and 1 n is a 1 vector of size n. We use the superscript of a matrix X l to indicate an element of a set. Thus, {X i } k i=1 be short for the set {X 1 , . . . , X k }. However, we denote X (t) a matrix X at iteration t. ∆ n denotes the n -1 probability simplex, and is the Hadamard product.

2.1. Technical background.

Definition 1 (Doubly stochastic matrix (DS)) A DS matrix is a non-negative, square matrix whose rows and columns sums to 1. The set of DS matrices is defined as: DP n := X ∈ R n×n + : X1 n = 1 n , 1 T n X = 1 T n . ( ) Definition 2 (Birkhoff polytope) The multinomial manifold of DS matrices is equivalent to the convex object called the Birkhoff Polytope (Birkhoff, 1946) , an (n -1) 2 dimensional convex submanifold of the ambient R n×n with n! vertices.We use DP n to refer to the Birkhoff Polytope. Theorem 1 (Birkhoff-von Neumann Theorem) The convex hull of the set of all permutation matrices is the set of doubly-stochastic matrices and there exists a potentially non-unique θ such that any DS matrix can be expressed as a linear combination of k permutation matrices (Birkhoff, 1946; Hurlbert, 2008 ) X = θ 1 P 1 + . . . + θ k P k , θ i > 0, θ T 1 k = 1. ( ) While finding the minimum k is shown to be NP-hard (Dufossé et al., 2018) , by Marcus-Ree theorem, we know that there exists one constructible decomposition where k < (n -1) 2 + 1. 

Definition 3 (Permutation Matrix)

A permutation matrix is defined as a sparse, square binary matrix, where each column and each row contains only a single true (1) value: P n := P ∈ {0, 1} n×n : P1 n = 1 n , 1 T n P = 1 T n . (3) In particular, we have that the set of permutation matrices P n is the intersection of the set of DS matrices and the orthogonal group Goemans, 2015) . O n := X T X = I : X ∈ R n×n , P n = DP n ∩ O n ( Riemannian gradient descent on DP n . The base Riemannian gradient methods use the following update rule (Absil et al., 2009) , where H X (t) is the Riemannian gradient of the loss L : DP n → R at X (t) , γ is the learning rate, and R X (t) : T X (t) DP n → DP n is a retraction that maps the tangent space at X (t) to the manifold. Douik & Hassibi (2019) t) , where Π X (t) is the projection onto the tangent space of DS matrices at X (t) . The total complexity of an iteration of the Riemannian gradient descent method on the DS manifold is (16/3)n 3 + 7n 2 + log(n) X (t+1) = R X (t) -γ H X (t) defined H X (t) = Π X (t) ∇ X (t) L X (t) X ( √ n (Douik & Hassibi, 2019) . See Appendix B for the closed-form computation of Π X (t) and R X (t) .

3. Approximate Birkhoff-von-Neumann decomposition

We propose a differentiable loss function to approximate the BvND of a DS matrix. This approach is valuable when one can save the resulting decomposition in memory. et al. (2011) showed that on DP n , all permutations are located on an hypersphere

Plis

S (n-1) 2 -1 of radius √ n -1, S (n-1) 2 -1 := x ∈ R (n-1) 2 : x = √ n -1 , centered at the center of mass C n = 1 n 1 n 1 T n . Therefore, we can rely on the hypersphere-based relaxations (Plis et al., 2011; Zanfir & Sminchisescu, 2018) to learn the sub-permutation matrices. However, we have two main reasons to avoid the hypersphere relaxations in our setting: i) Birdal & Simsekli (2019) showed that the gap as a ratio between DP n and both S (n-1) 2 -1 and O n grows to infinity as n grows. ii) While there exist polynomial-time projections of the n!-element permutation space onto the continuous hypersphere representation and back, these algorithms are prohibitive in iterative algorithms, as each operation displays a time complexity of O(n 4 ). Another possible direction is to build a convex combination of sub-permutation matrices that is an Barman (2015) showed that it is possible to build an -close representation using at most O(log(n)/ 2 ) matrices. Kulkarni et al. (2017) further improve this result to using at most 1/ matrices. Therefore, we can build a matrix M that satisfies M -M 2 F ≤ , where M = k l=1 η l X l and X l ∈ P n for l ∈ [k]. However, it is not currently practical to operate directly on P n as we have to solve a combinatorial optimization problem. Additionally, P n lacks a manifold structure. Thus, we relax the domain of the absolute permutations by assuming that each X l ∈ DP n for l ∈ [k] (Linderman et al., 2018; Birdal & Simsekli, 2019; Lyzinski et al., 2015; Yan et al., 2016) . We add an orthogonal regularization to ensure that each matrix X l is approximately orthogonal. We can use Riemannian gradient descent-based methods on the manifold of DS matrices to minimize the total loss, which has a time complexity of O(n 3 ) per iteration. -close matrix of M ∈ DP n . A n × n matrix M is an -close matrix of M ∈ DP n if M ij -M ij ≤ , ∀(i, j) ∈ [n] 2 .

3.2. Optimization problem

We want to solve the following optimization problem: min η∈∆ k 1 2 M - k l=1 η l X l 2 F , s.t. X l ∈ DP n ∩ O n , for l ∈ [k], ( ) where k is O(1/ ) (by Theorem 9 in Kulkarni et al. (2017) ) for an -close matrix approximation. Now, we recast Eq. 4 in its Lagrangian form and include additional penalties to leverage the Riemannian gradient descent on DP n . Reconstruction loss. This loss measures the reconstruction error for a given set of candidate DS matrices, X l k l=1 and weight vector η ∈ ∆ k , as follows: L recons η, X l k l=1 = 1 2 M - k l=1 η l X l 2 F . ( ) Orthogonal regularization. This loss encourages a DS matrix X to be approximately orthogonal by pushing them towards the nearest orthogonal manifold (Brock et al., 2017) . We compute it as follows: L ortho (X) = (i,j)∈[n] 2 X T X -I ij . ( ) Eq. 6 corresponds to an entrywise matrix norm that promotes sparsity. Nevertheless, we can also use other orthogonality promoting losses like in Zavlanos & Pappas (2008) or Bansal et al. (2018) , e.g., soft orthogonality regularization, L ortho (X) = X T X -I 2 F . We note that Ł ortho : DP n → R + and it is zero iff X ∈ O n . However, Eq. 6 is not convex in X and we are guaranteed only to converge to saddle points. Optimization objective. We compute the total loss by adding the reconstruction loss and an orthogonal regularization for each X l for l ∈ [k], as follows: min {X l ∈DPn} k l=1 , η∈∆ k L η, X l k l=1 ; µ, ω = L recons η, X i k l=1 + k l=1 µ l L ortho X l + ω Ω(η), where Ω(•) is a regularization function, ω > 0 and µ l > 0 for l ∈ [k] are regularization hyper-parameters that control the trade-off between reconstruction and orthogonality. For simplicity, we assume the same value of µ l for all l ∈ [k]. Algorithm 1 displays a summary of our proposed solution. This algorithm has a complexity per iteration of O(k n 3 ). Thus, we are interested in the k n regime. Diversity. The regularization function Ω(•) is necessary to avoid trivial solutions and force using various reconstruction components. For some l ∈ [k], the model sets X l to M if we do not regularize η. However, biasing the contribution of each component X l k l=1 is problem-dependent. We set Ω(•) = • 2 2 . We remark that this regularization does not impose fining different sub-permutation matrices. However, repeated sub-permutation matrices are not an issue for the BvND (see Fig. 3 ).

Orthogonalization cost annealing (optional).

We can use a variable regularization hyper-parameter µ (t) at training time to improve fining suitable solutions (Bowman et al., 2015) . At the start of training, we set µ (0) = 0, so that the model learns to represent with various components. Then, as training progresses, we gradually increase this parameter, forcing the model to impose the orthogonality constraint. This process boils down to computing µ (t) = min 1, max 0, t-ti to-ti µ, were t i and t o denote the initial and final iteration of the annealing, respectively. However, we also need to tune t o and t i . Refinement. The primary use of the BvND is to sample (binary) permutations matrices from a DS matrix. However, the solution of Eq. 7 yields approximately orthogonal matrices. Thus, we still need to round/refine the solution to return a permutation matrix. We can use two different approaches: deterministic rounding or rounding by stochastic sampling. The deterministic rounding finds a feasible permutation of a given DS matrix via the Hungarian algorithm (optimal transport problem) (Peyré et al., 2019; Birdal & Simsekli, 2019) , where we set the (transport) cost for each l ∈ [k] to K l = 1 -X l . Thus, we solve Xl ← min Xl ∈Pn 1 n n i=1 K l i, Xl i , where Xl i denotes the nonzero column of the ith row of the matrix Xl . However, the matching is not differentiable. In that case, we can also use the Sinkhorn algorithm (Cuturi, 2013) , which solves an entropic-regularized optimal transport problem with cost K l . We set a small entropic regularization (temperature) parameter. Our experiment showed similar performance for the matching and Sinkhorn, which suggests that our approximation can result in practical end-to-end applications.

Algorithm 1 Approximate BvND with a differentiable cost function

Require: DS matrix M, number k of sub-permutation matrices, learning rate γ, regularization parameters µ and ω Ensure: η * , X i * k i=1 that minimizes Eq. 7 1: while not converge do 2: for i ∈ [k] do 3: Update using Riemannian gradient descent: X i (t+1) ← R X i(t) -γ Π X i(t) ∇ X i(t) L softmax(η (t) ), X i (t) k i=1 ; µ (t) , ω X i (t) 4: end for 5: Update using gradient descent: η (t+1) ← η (t) -γ∇ η (t) L softmax(η (t) ), X i (t) k i=1 ; µ (t) , ω 6: end while

4. APPLICATION: Fairness of Exposure in Rankings

One recent application of the BvND in machine learning is the reduction of presentation bias in ranking systems Singh & Joachims (2018) . The aim is to find a DS matrix representing a probabilistic ranking system that satisfies a fairness constraint in expectation. Then, we sample a ranking from this fair DS matrix for each user. Thus, sampling for each user with Gumbel-Shinkhorn becomes prohibitive as it solves a Sinkhorn matrix scaling problem for each sample. Therefore, we precompute the BvND and propose rankings at random for each user. Here, | • | denotes the cardinality of a set. Learning to rank algorithms tends to display top-ranked results more often given the user feedback (usually clicks), which leads to ignoring other potentially relevant results. In brief, the method finds a probabilistic re-ranking system which satisfies specific presentation bias constraints We encode this probabilistic re-ranking in a DS matrix. We decompose this DS matrix using the BvN algorithm. Then, one can select a sub-permutation matrix at random using a probability proportional to the decomposition coefficient. Thus, this re-ranking approach will satisfy group fairness constraints in expectation. For simplicity, we assume a single query q and consider that we want to present a ranking of a set of n documents/items D = {d i } n i=1 . We denote by U the set of all users u that lead to identical q. We represent rel(d, q) the measure of relevances for a given queryfoot_0 . We assume a full information setting. Thus relevances are known. We use the relevances to compute the utility, e.g., discounted cumulative gain (DCG), or the Normalized DCG (NDCG, the DCG normalized by the DCG of the optimal ranking) (Järvelin & Kekäläinen, 2002) .

4.1. Static case

We can write a utility function U in terms of a probabilistic ranking M for a query q as U (M| q) = di∈D n j=1 M ij u(d i | q) v(j), ( ) where M ∈ DP n is the probabilistic ranking. Thus M ij represents the probability of replacing document d i (indexed at position i) at rank j. u(d| q) := u∈U Pr[u| q] f (rel(d, q)), is the expected utility of document d for query q, where f (•) maps the relevance of the document for a user to its utility. v(j) is the examination propensity; it models how much attention an item d at position j. We use a logarithmic discount in the position v j = v(j) = 1 log 2 (1+j) and f (rel(d, q)) = 2 rel(d,q) -1.Thus, the utility function Eq. 8 corresponds to the expected DCG. We include fairness constraint by solving: M * = arg max M∈DPn U (M| q) subject to M is fair. Singh & Joachims (2018) define several fairness constraints that are linear in M, and this problem boils down to solving a linear program. However, in prediction time, one needs to sample from M. Thus, we use the BvN algorithm to decompose M into the convex combination of k permutation matrices P k . One chooses a permutation at random with a probability proportional to the coefficient. The resulting model satisfies the fairness constraint in expectation. To set our fairness constraints, we need first to define the merit, impact, and exposure of an item d. The merit of item is its expected average relevance. The exposure of item d is the probability P (d) that the user will see d and thus can read that article, buy that product, etc. We note that estimating the position bias is not part of our study. We assume full knowledge of these position-based probabilities. We use the feedback C, e.g., clicks, as a measure of impact. We extend these definitions to group G ⊆ D by aggregating over the group. Then, we have for a protected group G Merit(G) = 1 |G| d∈G u(d| q), Imp(G) = 1 |G| d∈G C(d), and Expo(G) = 1 |G| d∈G P (d). ( ) Setting the constraints as functions of the probabilistic ranking M. Êxpo(d i | M) = n j=1 M ij v j . Assuming the Position-Based Model (PBM) click model, the estimated probability of a click is the exposure × conditional relevancy. Thus, the estimated probability of click on a document d i is C(d) = Êxpo(d| M) u(d| q). Therefore, we can estimate the average impact and exposure on the items in group G for the rankings defined by M as Împ(G| M) = 1 |G| di∈G u(d| q)   n j=1 M ij v j   and Êxpo(G| M) = 1 |G| di∈G n j=1 M ij v j . ( ) Here, we only use disparate exposure and impact constraints as fairness constraints. The disparity constraints D(G i , G j ) is the difference of the fairness metric between protected groups G i and G j , divided by their respective merit. We denote D E (G i , G j ) and D I (G i , G j ) to be the Disparity Exposure Constraint and Disparity Impact Constraint, respectively. Let D(G i , G j ) be the difference in estimations.

4.2. Fairness in Dynamic Learning-to-Rank

Morik et al. ( 2020) present an extension of the fairness of exposure (see Section 4) to a dynamic setting. They propose a fairness controller algorithm that ensures notions group fairness amortized through time. This algorithm dynamically adapts both utility function and fairness as more data becomes available. Here, we assume that both the exposure and the impact vary through time, Imp t (G), and Expo t (G). Then, we use the cumulative fairness constraint over τ time steps, e.g., 1 τ τ t=1 Imp t (G). We extend it to their estimations too. The optimization problem is (11) where λ ≥ 0 controls the trade-off between ranking score and fairness. M * = arg max M∈DPn,ζij ≥0 U (M| q) -λ ij ζ ij s.t. ∀G i , G j : Dτ (G i , G j ) + D τ -1 (G i , G j ) ≤ ζ ij , We explore our differentiable approximate BvND's behavior on synthetic data and validate its usefulness in the fairness of exposure in ranking problems. We aim to check its performance compared to the greedy combinatorial construction. Optimization. We build DS Random matrices naively to initialize each component X l . For l ∈ [k], we sample each element of matrix M l ij , ∀(i, j) ∈ [n] 2 from a half-normal distribution, i.e., the absolute value of an i.i.d. sample drawn from a Gaussian. Then, we project onto DS using the Sinkhorn algorithm. We set the maximum number of iterations to 10 000, the learning rate γ = 10 -3 , and relative tolerance of 10 4 . After a greedy parameter tunning, we use ω = 1 and µ = 10 -2 . Technical aspects. We give a Pytorch (Paszke et al., 2019) implementation in Appendix A. We use RiemannianADAM implemented in the geoopt library (Kochurov et al., 2020) . We run all the experiments on a single desktop machine with a 4-core Intel Core i7 2.4 GHz. [4, 6, 10] . To add sparsity, we mask each matrix by thresholding at [0.1, 0.5, 0.9] divided n, respectively. Then, we project the masking result using the Sinkhorn algorithm. We explore the performance of the differentiable BvND as a function of the number of k components. We measure the computation time, the reconstruction error, and the orthogonalization error. Results. We observe in Fig. 2 the reconstruction error is monotonically decreasing. However, the approximate orthogonalization constraint becomes more challenging to satisfy when increasing the DS input matrix's dimensionality. Table 1 shows the the computation time for each parameter.

5.2. Static Fairness of Exposure

Setup. We use the toy example presented in Singh & Joachims (2018) . These data represent a web-service that connects employers (users) to potential employees (items). The set contains three males and three females. The male applicants have relevance for the employers of 0.80, 0.79, 0.78, respectively, while the female applicants have the relevance of 0.77, 0.76, 0.75, respectively. Here we follow the standard probabilistic definition of relevance, where 0.77 means that 77% of all employers issuing the query find that applicant relevant. The Probability Ranking Principle suggests ranking these applicants in the decreasing order of relevance, i.e., the three males at the top positions, followed by the females. The task is to re-rank them so that the system satisfies equal opportunity of exposure across groups. Thus, we solve a linear program to maximize Eq 8 such that DE (G Male , G Female ) ≤ 10 -6 . Results. Fig. 3 shows a toy example of the fairness of exposure. Fig. 3a presents the original biased ranking, whereas Fig. 3b shows a fair probabilistic ranking, which has a negligible loss in performance. Note that solving this problem does not imply that each We decompose a fair probabilistic ranking using the approximate and differentiable BvND using three components. sub-permutation matrix is unique. In practice, we often find a decomposition with repeated sub-permutations.

5.3. Dynamic Fairness of Exposure

Setup. We rely on Morik et al. (2020) to simulate an environment based on articles in the Ad Fontes Media Bias dataset, which generates a dynamic ranking on a set of news articles belonging to two groups left-leaning and right-leaning news articles, G left and G right , respectively. See Appendix C for a full description of this simulation. We use dynamic learning to rank settings to minimize amortized fairness disparities. We solve Eq. 11 using linear programming. We evaluate the effectiveness of our differentiable approximate BvND (Diff-BvN) algorithm compared to the standard implementation of the BvND (BvN). We explore the difference between both models over various trade-offs between ranking score and fairness λ. Then, we measure their performance for a fixed λ. Baseline. We use as a baseline the greedy heuristic of BvN (Dufossé & Uçar, 2016; Dufossé et al., 2018) . For the Diff-BvN, we set the number of components to k = 10. We refine the matrix using the Hungarian algorithm (Kuhn, 1955) to ensure returning a permutation matrix. However, we also tried the stabilized Sinkhorn used in Cuturi et al. (2019) with the same performance. Results. Fig. 4 presents the performance of BvN and Diff-BvN over various values of the fairness regularization parameter λ. Fig. 4a shows the NDCG of both methods, whereas Fig. 4c shows their unfairness of impact. Regarding the NDCG, BvN and Diff-BvN display the same performance across different values of λ. We observe the same pattern in the unfairness of exposure and impact, Fig. 4b and Fig. 4c , respectively. We set the fairness regularization parameter to λ = 10 -2 . We see in Fig. 5 that the performance of the re-ranked system is the same for BvN and Diff-BvN. However, the similar performance between these methods implies that only components 10 represent most of the information, which might not hold in other scenarios. Nevertheless, fewer components improve the performance in some applications (Porter et al., 2013; Liu et al., 2015) .

6. Discussion and Conclusion

In this paper, we proposed a differentiable cost function to approximate the Birkhoff-von-Neumann decomposition (Diff-BvN). Our algorithm approximates a DS matrix by a convex combination of matrices on the Birkhoff polytope, where each matrix in the decomposition is approximately orthogonal. We can minimize the final loss function using Riemannian gradient descent. Our algorithm is easy to implement on standard auto-diff-based libraries. Experiments on the fairness of exposure in ranking problems show that our algorithm yields similar results to the Birkhoff-von-Neumann decomposition algorithm with a greedy heuristic. Our algorithm provides an alternative to existing approaches to sample permutation matrices from a DS matrix. In particular, it offers the option to balance the trade-off between memory and time in prediction settings. Fewer assignments lead to improved performance in some applications (Porter et al., 2013; Liu et al., 2015) . Thus, our algorithm can display better performance that greedy approaches as one sets the number of assignments a priory.

Potential improvements.

In practice, our algorithm is sensitive to the orthogonal constraints. Therefore, we need to explore setting the parameters of the cost scheduler/annealing for the orthogonal regularization parameter, e.g., how fast do we have to increase the orthogonal regularization parameter?. Additionally, we note that the current implementation of our algorithm is still limited to small matrices. Thus, we need to explore different possible directions to make it scalable, e.g., randomization or quadrature methods (Altschuler et al., 2019) . 

B Riemannian Optimization

A Riemannian manifold is a smooth manifold M of dimension d that can be locally approximated by an Euclidean space R d . At each point x ∈ M one can define a d-dimensional vector space, the tangent space T x M. We characterize the structure of this manifold by a Riemannian metric, which is a collection of scalar products ρ = {ρ(•, •) x } x∈M , where ρ(•, •) x : T x M × T x M → R on the tangent space T x M varying smoothly with x. The Riemannian manifold is a pair (M, ρ) (Sommer et al., 2020) . The tangent space linearizes the manifold at a point x ∈ M, making it suitable for practical applications as it leverages the implementation of algorithms in the Euclidean space. We use the Riemannian exponential and logarithmic maps to project samples onto the manifold and back to tangent space, respectively. The Riemannian exponential map, when well-defined, Exp x : T x M → M realizes a local diffeomorphism from a sufficiently small neighborhood 0 in T x M into a neighborhood of the point x ∈ M. Riemannian gradient descent. The base Riemannian gradient methods (Bonnabel, 2013; Smith, 1994) use the following update rule w t+1 = Exp wt (-γ t H(w t )), where H(w t ) is the Riemannian gradient of the loss L : M → R at w t , and γ t is the learning rate. However, the exponential map is not easy to compute in many cases, as one needs to solve a calculus of variations problem or know the Christoffel symbols (Bonnabel, 2013) . Thus, it is much easier and faster to use a first-order approximation of the exponential map, called a retraction. A retraction R w (v) : T w M → M maps the tangent space at w to the manifold such that d(R w (tv), Exp w (tv)) = O(t 2 ), this imposes a local rigidity condition that preserves gradients. Therefore, one can rely on the retraction to compute the alternative update w t+1 = R wt (-γ t H(w t )) (Absil et al., 2009) . Douik & Hassibi (2019) introduced the following computation of the retraction mapping and the projection onto the tangent space of DS matrices. The proofs can be found in (Douik & Hassibi, 2019) . Theorem 2 The projection operator Π X (Y) maps Y ∈ DP n onto the tangent space at X ∈ DP n , T X DP n is written as Π X (Y) = Y -α1 T + 1β T X, where is the Hadamard product, α = I -XX T + Y -XY T 1, β = Y T 1 -X T α, and (•) + denotes the pseudo-inverse Theorem 3 (Retraction) For a vector ζ X ∈ T X DP n lying on the tangent space at X ∈ DP n , the first order retraction map R X is given by R X (ζ X ) = Π (X exp (ζ X X)) , ( ) where is the Hadamard division, the operator Π denotes the projection onto DP n , efficiently computed using the Sinkhorn algorithm (Sinkhorn & Knopp, 1967) .



We can extend this definition to include user preferences, rel(d, q, u).



Figure 1: Illustration of the Birkhoff-von-Neumann decomposition (BvND): One can represent a doubly stochastic matrix as a convex combination of permutation matrices.

Figure 2: Performance of the approximate BvND on synthetic data: for different matrix sizes. (top) Reconstruction error as a function of the number of components. (bottom) Histogram of the orthogonalization error of each component.

Figure 3: Static fairness exposure on toy data: a) Unfair ranking; b) Ranking satisfies the disparate exposure constraint.We decompose a fair probabilistic ranking using the approximate and differentiable BvND using three components.

Figure 4: Performance of the fairness controller as a function of the parameter λ: Both methods, BvN and Diff-BvN display almost the same behavior across values of the regularization parameter. These values correspond to ten trials of the simulated news data of 3000 users.

t t r i b u t e s ----------m a n i f o l d : B i r k h o f f P o l y t o p e M a t r i c e s : ManifoldParameter , Matrix on t h e B i r k h o f f P o l y t o p e . Used t o compute compute RiemannianADAM . w e i g h t : nn . Parameter , Tensor o f c o e f f i c i e n t s . " " " d e f __init__ ( s e l f , n , n_components=2) : s u p e r ( BirkhoffVonNeumannDecomposition , s e l f ) . __init__ ( ) s e l f . n = n s e l f . n_components = n_components s e l f . manifold_ = B i r k h o f f P o l y t o p e ( ) s e l f . M a t r i c e s = nn . ParameterDict ( { s t r ( i n d ) : ManifoldParameter ( data= s e l f . manifold_ . random ( ( n , n ) ) , m a n i f o l d= s e l f . manifold_ ) f o r i n d i n r a n g e ( n_components ) } ) s e l f . w e i g h t = nn . Parameter ( data=t o r c h . rand ( 1 , 1 , n_components ) ) d e f f o r w a r d ( s e l f , i n p u t=None ) : w = t o r c h . softmax ( s e l f . weight , dim=-1) Ms = t o r c h . c a t ( [ s e l f . M a t r i c e s [ s t r ( c h o i c e ) ] . u n s q u e e z e ( 2 ) f o r c h o i c e i n r a n g e ( s e l f . n_components ) ] , dim=2) r e t u r n (Ms * w) . sum( -1) Listing 1: Approximate Birkhoff-von-Neumann decomposition (ApproxBvND) import t o r c h import numpy a s np d e f o r t h o _ e r r o r (X) : n_samples = l e n (X) r e t u r n (X. T @ Xt o r c h . eye ( n_samples ) ) . abs ( ) . sum ( ) d e f r e c o n s t r u c t i o n _ e r r o r (A, A_approx ) : r e t u r n t o r c h . norm (A -A_approx , p= ' f r o '

i m i z e r = RiemannianAdam ( l i s t ( model . p a r a m e t e r s ( ) ) , l r=l r ) v l o s s = [ stop_thr ] l o o p = 1 i f max_iter > 0 e l s e 0 i t = 0 w h i l e l o o p : i t += 1 o p t i m i z e r . zero_grad ( ) with t o r c h . enable_grad ( ) : o r t h _ l o s s = t o r c h . z e r o s ( 1 ) f o r param i n model . p a r a m e t e r s ( ) : i f i s i n s t a n c e ( param , g e o o p t . t e n s o r . ManifoldParameter ) : o r t h _ l o s s += o r t h o _ e r r o r ( param ) e l s e : weight_reg = model . w e i g h t . pow ( 2 ) . sum ( ) A_recons = model ( ) r e c o n s t r u c t i o n _ l o s s = r e c o n s t r u c t i o n _ e r r o r ( A, A_recons ) l o s s = ( r e c o n s t r u c t i o n _ l o s s + re g _o rt ho * o r t h _ l o s s + r e g _ w e i g h t s * weight_reg ) v l o s s . append ( l o s s . item ( ) ) r e l a t i v e _ e r r o r = ( abs ( v l o s s [ -1] -v l o s s [ -2 ] ) / abs ( v l o s s [ -2 ] ) i f v l o s s [ -2] != 0 e l s e 0 . ) i f ( ( i t >= max_iter ) o r ( np . i s n a n ( v l o s s [ -1 ] ) ) o r ( r e l a t i v e _ e r r o r < stop_thr )

Computation time

Junchi Yan, Xu-Cheng Yin, Weiyao Lin, Cheng Deng, Hongyuan Zha, and Xiaokang Yang.A short survey of recent advances in graph matching. In ACM ICMR, pp. 167-174, 2016.Andrei Zanfir and Cristian Sminchisescu. Deep learning of graph matching. In CVPR, pp.2684-2693, 2018.

C Simulations

News data. Morik et al. (2020) contain the description of this dataset. Nevertheless, we added it for completeness. In each trial, we sample a set of 30 news articles D. For each article d, the dataset contains a polarity value ρ d that we rescale to the interval between -1 and 1. We simulate the user polarities. We draw the polarity for each user from a mixture of two nor-where p neg is the probability of the user to be left-learning (mean -0.5). We use p neg = 0.5. Besides, each user has an openness parameter o ut ∼ U(.05, .55), indicating the breadth of interest outside their polarity.We draw the true relevance from the Bernoulli distribution r t (d). We use the Position-based click model (PBM (Chuklin et al., 2015) ) to model user behavior, where the marginal probability that a user u t examines an article d depends only on its position. The remainder of the simulation follows the dynamic ranking setup. At each time step t a user u t arrives at the system, the algorithm presents an unpersonalized ranking and the user provides feedback c t according to p t and r t . The algorithm only observes c t and not r t .

