EFFICIENT, STABLE, AND ANALYTIC DIFFERENTIA-TION OF THE SINKHORN LOSS

Abstract

Optimal transport and the Wasserstein distance have become indispensable building blocks of modern deep generative models, but their computational costs greatly prohibit their applications in statistical machine learning models. Recently, the Sinkhorn loss, as an approximation to the Wasserstein distance, has gained massive popularity, and much work has been done for its theoretical properties. To embed the Sinkhorn loss into gradient-based learning frameworks, efficient algorithms for both the forward and backward passes of the Sinkhorn loss are required. In this article, we first demonstrate issues of the widely-used Sinkhorn's algorithm, and show that the L-BFGS algorithm is a potentially better candidate for the forward pass. Then we derive an analytic form of the derivative of the Sinkhorn loss with respect to the input cost matrix, which results in an efficient backward algorithm. We rigorously analyze the convergence and stability properties of the advocated algorithms, and use various numerical experiments to validate the performance of the proposed methods.

1. INTRODUCTION

Optimal transport (OT, Villani, 2009 ) is a powerful tool to characterize the transformation of probability distributions, and has become an indispensable building block of generative modeling. At the core of OT is the Wasserstein distance, which measures the difference between two distributions. For example, the Wasserstein generative adversarial network (WGAN, Arjovsky et al., 2017) uses the 1-Wasserstein distance as the loss function to minimize the difference between the data distribution and the model distribution, and a huge number of related works emerge afterwards. Despite the various appealing theoretical properties, one major barrier for the wide applications of OT is the difficulty in computing the Wasserstein distance. For two discrete distributions, OT solves a linear programming problem of nm variables, where n and m are the number of Diracs that define the two distributions. Assuming n = m, standard linear programming solvers for OT have a complexity of O(n 3 log n) (Pele & Werman, 2009) , which quickly becomes formidable as n gets large, except for some special cases (Peyré et al., 2019) . To resolve this issue, many approximate solutions to OT have been proposed, among which the Sinkhorn loss has gained massive popularity (Cuturi, 2013) . The Sinkhorn loss can be viewed as an entropic-regularized Wasserstein distance, which adds a smooth penalty term to the original objective function of OT. The Sinkhorn loss is attractive as its optimization problem can be efficiently solved, at least in exact arithmetics, via Sinkhorn's algorithm (Sinkhorn, 1964; Sinkhorn & Knopp, 1967) , which merely involves matrix-vector multiplications and some minor operations. Therefore, it is especially suited to modern computing hardware such as the graphics processing units (GPUs). Recent theoretical results show that Sinkhorn's algorithm has a computational complexity of O(n 2 ε -2 ) to output an ε-approximation to the unregularized OT (Dvurechensky et al., 2018) . Many existing works on the Sinkhorn loss focus on its theoretical properties, for example Mena & Niles-Weed (2019) and Genevay et al. (2019) . In this article, we are mostly concerned with the computational aspect. Since modern deep generative models mostly rely on the gradient-based learning framework, it is crucial to use the Sinkhorn loss with differentiation support. One simple and natural method to enable Sinkhorn loss in back-propagation is to unroll Sinkhorn's algorithm, adding every iteration to the auto-differentiation computing graph (Genevay et al., 2018; Cuturi et al., 2019) . However, this approach is typically costly when the number of iterations are large. Instead, in this article we have derived an analytic expression for the derivative of Sinkhorn loss based on quantities computed from the forward pass, which greatly simplifies the back-propagation of the Sinkhorn loss. More importantly, one critical pain point of the Sinkhorn loss, though typically ignored in theoretical studies, is that Sinkhorn's algorithm is numerically unstable (Peyré et al., 2019) . We show in numerical experiments that even for very simple settings, Sinkhorn's algorithm can quickly lose precision. Various stabilized versions of Sinkhorn's algorithm, though showing better stability, still suffer from slow convergence in these cases. In this article, we have rigorously analyzed the solution to the Sinkhorn optimization problem, and have designed both forward and backward algorithms that are provably efficient and stable. The main contribution of this article is as follows: • We have derived an analytic expression for the derivative of the Sinkhorn loss, which can be efficiently computed in back-propagation. • We have rigorously analyzed the advocated forward and backward algorithms for the Sinkhorn loss, and show that they have desirable efficiency and stability properties. • We have implemented the Sinkhorn loss as an auto-differentiable function in the PyTorch and JAX frameworks, using the analytic derivative obtained in this article. The code to reproduce the results in this article is available at https://1drv.ms/u/s! ArsORq8a24WmoFjNQtZYE_BERzDQ.

2. THE (SHARP) SINKHORN LOSS AS APPROXIMATE OT

Throughout this article we focus on discrete OT problems. Denote by ∆ n = {w ∈ R n + : w T 1 n = 1} the n-dimensional probability simplex, and let µ = : T 1 m = a, T T 1 n = b}, and let M ∈ R n×m be a cost matrix with entries M ij , i = 1, . . . , n, j = 1, . . . , m. Without loss of generality we assume that n ≥ m, as their roles can be exchanged. Then OT can be characterized by the following optimization problem, W (M, a, b) = min P ∈Π(a,b) ⟨P, M ⟩, where ⟨A, B⟩ = tr(A T B). An optimal solution to (1), denoted as P * , is typically called an optimal transport plan, and can be viewed as a joint distribution whose marginals coincide with µ and ν. The optimal value W (M, a, b) = ⟨P * , M ⟩ is then called the Wasserstein distance between µ and ν if the cost matrix M satisfies some suitable conditions (Proposition 2.2 of Peyré et al., 2019) . As is introduced in Section 1, solving the optimization problem (1) can be difficult even for moderate n and m. One approach to regularizing the optimization problem is to add an entropic penalty term to the objective function, leading to the entropic-regularized OT problem (Cuturi, 2013) : Sλ (M, a, b) = min T ∈Π(a,b) S λ (T ) := min T ∈Π(a,b) ⟨T, M ⟩ -λ -1 h(T ), where h(T ) = n i=1 m j=1 T ij (1 -log T ij ) is the entropy term. The new objective function S λ (T ) is λ -1 -strongly convex on Π(a, b), so (2) has a unique global solution, denoted as T * λ . In this article, T * λ is referred to as the Sinkhorn transport plan. The entropic-regularized Wasserstein distance, also known as the Sinkhorn distance or Sinkhorn loss in the literature (Cuturi, 2013) , is then defined as S λ (M, a, b) = ⟨T * λ , M ⟩. To simplify the notation, we omit the subscript λ in T * λ hereafter when no confusion is caused. It is worth noting that in the literature, S λ and Sλ are sometimes referred to as the sharp and regularized Sinkhorn loss, respectively. The following proposition from Luise et al. (2018) suggests that S λ achieves a faster rate at approximating the Wasserstein distance than Sλ . Due to this reason, in this article we focus on the sharp version, and simply call S λ the Sinkhorn loss for brevity. Proposition 1 (Luise et al., 2018) . There exist constants C 1 , C 2 > 0 such that for any λ > 0, |S λ (M, a, b) -W (M, a, b)| ≤ C 1 e -λ and | Sλ (M, a, b) -W (M, a, b)| ≤ C 2 /λ. The constants C 1 and C 2 are independent of λ, and depend on µ and ν.

3. DIFFERENTIATION OF THE SINKHORN LOSS

To use the Sinkhorn loss in deep neural networks or other machine learning tasks, it is also crucial to obtain the derivative of S λ (M, a, b) with respect to its input parameters. Differentiating the Sinkhorn loss typically involves two stages, the forward and backward passes. In the forward pass, the Sinkhorn loss or the transport plan is computed using some optimization algorithm, and in the backward pass the derivative is computed, using either an analytic expression or the automatic differentiation technique. In this section we analyze both passes in details. Throughout this article we use the following notations. For x, y ∈ R, x ∧ y means min{x, y}. For a vector v = (v 1 , . . . , v k ) T , let v -1 = (v -1 1 , . . . , v -1 k ) T , ṽ = (v 1 , . . . , v k-1 ) T , and denote by diag(v) the diagonal matrix formed by v. Let u = (u 1 , . . . , u l ) T be another vector, and denote by u ⊕ v the l × k matrix with entries (u i + v j ). For a matrix A = (a ij ) = (A 1 , . . . , A k ) with column vectors A 1 , . . . , A k , let Ã = (A 1 , . . . , A k-1 ) , and e λ [A] be the matrix with entries e λaij . The symbol ⊙ denotes the elementwise multiplication operator between matrices or vectors. ∥ • ∥ and ∥ • ∥ F stand for the Euclidean norm for vectors and Frobenius norm for matrices, respectively. Finally, we globally define η ≡ λ -1 for simplicity.

3.1. ISSUES OF SINKHORN'S ALGORITHM

In the existing literature, one commonly-used method for the forward pass of the Sinkhorn loss is Sinkhorn's algorithm (Sinkhorn, 1964; Sinkhorn & Knopp, 1967) . Unlike the original linear programming problem (1), the solution to the Sinkhorn problem has a special structure. Cuturi (2013) shows that the optimal solution T * can be expressed as T * = diag(u * )M e diag(v * ) for some vectors u * and v * , where M e = e -λMij . Sinkhorn's algorithm starts from an initial vector v (0) ∈ R m + , and generates iterates u (k) ∈ R n + and v (k) ∈ R m + as follows: u (k+1) ← a ⊙ [M e v (k) ] -1 , v (k+1) ← b ⊙ [M T e u (k+1) ] -1 . ( ) It can be proved that u (k) → u * and v (k) → v * , and then the Sinkhorn transport plan T * can be recovered by (3). Sinkhorn's algorithm is very efficient, as it only involves matrix-vector multiplication and other inexpensive operations. However, one major issue of Sinkhorn's algorithm is that the entries of M e = e -λMij may easily underflow when λ is large, making some elements of the vectors M e v (k) and M T e u (k+1) close to zero. As a result, some components of u (k+1) and v (k+1) would overflow. Therefore, Sinkhorn's algorithm in its original form is unstable, and in practice the iterations (4) are typically carried out in the logarithmic scale, which we call the Sinkhorn-log algorithm for simplicity. Besides, there are some other works also attempting to improve the numerical stability of Sinkhorn's algorithm (Schmitzer, 2019; Cuturi et al., 2022) . Despite the advancements of Sinkhorn's algorithm, one critical issue observed in practice is that Sinkhorn-type algorithms may be slow to converge, especially for small regularization parameters. This would severely slow down the computation, and may even give misleading results when the user sets a moderate limit on the total number of iterations. Below we show a motivating example to highlight this issue. Consider a triplet (M, a, b) for the Sinkhorn problem, and fix the regularization parameter η to be 0.001, where the detailed setting is provided in Appendix B.1. The true T * matrix is visualized in Figure 1 , along with the solutions given by various widely-used algorithms, including Sinkhorn's algorithm, Sinkhorn-log, the stabilized scaling algorithm (Stabilized, Algorithm 2 of Schmitzer, 2019) , and the Greenkhorn algorithm (Altschuler et al., 2017; Lin et al., 2022) . The maximum number of iterations is 10000 for Greenkhorn and 1000 for other algorithms. In Figure 1 , it is clear that the plans given by Sinkhorn's algorithm and Greenkhorn are farthest to the true value, and Greenkhorn generates NaN values reflected by the white stripes in the plot. In contrast, the stable algorithms Sinkhorn-log and Stabilized greatly improve them. Sinkhorn's algorithm and Sinkhorn-log are equivalent in exact arithmetics, so their numerical differences highlight the need for numerically stable algorithms. However, Sinkhorn-log and Stabilized still have visible inconsistencies with the truth even after 1000 iterations. 

3.2. THE ADVOCATED ALTERNATIVE FOR FORWARD PASS

To this end, we advocate an alternative scheme to solve the optimal plan T * , and we show both theoretically and empirically that this method enjoys great efficiency and stability. Consider the dual problem of ( 2), which has the following form (Proposition 4.4 of Peyré et al., 2019) : max α,β L(α, β) := max α,β α T a + β T b -η n i=1 m j=1 e -λ(Mij -αi-βj ) , α ∈ R n , β ∈ R m . ( ) Let α * = (α * 1 , . . . , α * n ) T and β * = (β * 1 , . . . , β * m ) T be one optimal solution to (5), and then the Sinkhorn transport plan T * can be recovered as 5) is equivalent to an unconstrained convex optimization problem, so a simple gradient ascent method suffices to find its optimal solution. But in practice, quasi-Newton methods such as the limited memory BFGS method (L-BFGS, Liu & Nocedal, 1989) can significantly accelerate the convergence. Using L-BFGS to solve ( 5) is a known practice (Cuturi & Peyré, 2018; Flamary et al., 2021) , but little is known about its stability in solving the regularized OT problem. We first briefly describe the algorithm below, and in Section 4 we rigorously prove that L-BFGS converges fast and generates stable iterates. T * = e λ [α * ⊕β * -M ]. Remarkably, ( It is worth noting we can reduce the number of variables to be optimized in (5) based on the following two findings. First, as pointed out by Cuturi et al. (2019) , the variables (α, β) have one redundant degree of freedom: if (α * , β * ) is one solution to (5), then so is (α * +c1 n , β * -c1 m ) for any c. Therefore, we globally set β m = 0 without loss of generality. Second, let α * (β) = arg max α L(α, β) for a given β = ( βT , β m ) T = ( βT , 0) T , and define f (β) = -L(α * (β), β). Then we only need to minimize f (β) with (m -1) free variables to get β * , and α * can be recovered as α * = α * (β * ). In the appendix we show that α * (β), f (β), and ∇ β f have simple closed-form expressions: f (β) = -α * (β) T a -β T b + η, α * (β) i = η log a i -η log   m j=1 e λ(βj -Mij )   , ∇ β f = T (β) T 1 n -b, T (β) = e λ [α * (β) ⊕ β -M ]. With f (β) and ∇ β f , the L-BFGS algorithm can be readily used to minimize f (β) and obtain β * . Each gradient evaluation requires O(mn) exponentiation operations, which is comparable to Sinkhorn-log. Although exponentiation is more expensive than matrix-vector multiplication as in Sinkhorn's algorithm, this extra cost can be greatly remedied by modern hardware such as GPUs. On the other hand, we would show in Section 4 that L-BFGS has a strong guarantee on numerical stability, which is critical for many scientific computing problems. For the motivating example in Section 3.1, we demonstrate the advantage of L-BFGS by showing its transport plan in the rightmost plot of Figure 1 . We limit its maximum number of gradient evaluations to 1000, and hence comparable to other methods. Clearly, the L-BFGS solution is visually identical to the ground truth. To study the difference between L-BFGS and Sinkhorn's algorithm in more depth, we compute the objective function value of the dual problem (6) at each iteration for both Sinkhorn-log and L-BFGS. The results are visualized in Figure 2 , with three different η values, η = 0.1, 0.01, 0.001. Since each L-BFGS iteration may involve more than one gradient evaluation, for L-BFGS we plot the values against both the outer iteration and gradient evaluation counts. Figure 2 gives a clear clue to the issue of Sinkhorn-log: it has a surprisingly slow convergence speed compared to L-BFGS when η is small. Theoretically, Sinkhorn algorithms will eventually converge with sufficient iterations, but in practice, a moderate limit on computational budget may prevent them from generating accurate results. To this end, L-BFGS appears to be a better candidate when one needs a small η for better approximation to the OT problem. In Appendix B.2 we have designed more experiments to study the forward pass stability and accuracy of different algorithms.

3.3. THE ANALYTIC BACKWARD PASS

For the backward pass, one commonly-used method is unrolled Sinkhorn's algorithm, which is based on the fact that Sinkhorn's forward pass algorithm is differentiable with respect to a, b, and M . Therefore, one can use automatic differentiation software to compute the corresponding derivatives in the backward pass. This method is used in Genevay et al. (2018) for learning generative models with the Sinkhorn loss, but in practice it may be extremely slow if the forward pass takes a large number of iterations. To avoid the excessive cost of unrolled algorithms, various implicit differentiation methods have been developed for the Sinkhorn loss (Feydy et al., 2019; Campbell et al., 2020; Xie et al., 2020; Eisenberger et al., 2022) , but they still do not provide the most straightforward way to compute the gradient. To this end, we advocate the use of analytic derivatives of the Sinkhorn loss, which solves the optimization problem (2) in the forward pass, and use the optimal dual variables (α * , β * ) for the backward pass. The analytic form for ∇ a,b S λ (M, a, b) has been studied in Luise et al. (2018) , and to our best knowledge, few result has been presented for ∇ M S λ (M, a, b). Our first main theorem, given in Theorem 1, fills this gap and derives the analytic form for ∇ M S λ (M, a, b). Theorem 1. For a fixed λ > 0, ∇ M S λ (M, a, b) = T * + λ(s u ⊕ s v -M ) ⊙ T * , where T * = e λ [α * ⊕ β * -M ], s u = a -1 ⊙ (µ r -T * sv ), s v = (s T v , 0) T , µ r = (M ⊙ T * )1 m , μc = ( M ⊙ T * ) T 1 n , sv = D -1 μc -T * T (a -1 ⊙ µ r ) , and D = diag( b) -T * T diag(a -1 ) T * . In addition, D is positive definite, and hence invertible, for any λ > 0, a ∈ ∆ n , b ∈ ∆ m , and M . Assuming n ≥ m, the main computational cost is in forming D, which requires O(m 2 n) operations for matrix-matrix multiplication, and in computing sv , which costs O(m 3 ) operations for solving a positive definite linear system.

4. CONVERGENCE AND STABILITY ANALYSIS

In Sections 3.2 and 3.3, we have introduced the advocated forward and backward algorithms for the Sinkhorn loss, respectively. In this section, we focus on the theoretical properties of these algorithms, and show that they enjoy provable efficiency and stability. As a first step, we consider the target of the optimization problem (6), and show that f (β) has a well-behaved minimizer, which does not underflow or overflow. Theorem 2. Denote by f * the minimum value of f (β) and β * an optimal solution, and let α * = α * (β * ). Then f * > -∞, β * is unique, ∥α * ∥ < ∞, and ∥β * ∥ < ∞. In particular, let I = arg max i T * im , a max = max i a i , and c = log(n/b m ). Then L αi ≤ L αi ≤ α * i ≤ U αi and L βj ≤ β * j ≤ U βj ≤ U βj for all i = 1, . . . , n and j = 1, . . . , m, where U αi = M im + η • log(a i ∧ b m ), U βj = M Ij -M Im +η [log(a I ∧ b j ) + c] , L αi = η • log(a i /m) -max j (U βj -M ij ), L βj = η • log(b j /n) -max i (U αi -M ij ), U βj = max i {M ij -M im } + η [log(a max ∧ b j ) + c] , L αi = η • log(a i /m) -max j (U βj -M ij ). Theorem 2 shows that the optimal dual variables γ * are bounded, and more importantly, the bounds are roughly at the scale of the cost matrix entries M ij and the log-weights log(a i ) and log(b j ). Therefore, at least the target of the optimization problem is well-behaved. Moreover, one useful application of Theorem 2 is to impose a box constraint on the variables, adding further stability to the optimization algorithm. After verifying the properties of the optimal solution, a more interesting and critical problem is to seek a stable algorithm to approach the optimal solution. Indeed, in Theorem 3 we prove that the L-BFGS algorithm for minimizing ( 6) is one such method. Theorem 3. Let { β(k) } be a sequence of iterates generated by the L-BFGS algorithm starting from a fixed initial value β(0) , and define β (k) = ( β(k)T , 0) T , T (k) = e λ [α * (β (k) ) ⊕ β (k) -M ], f (k) = f (β (k) ). Then there exist constants 0 ≤ r < 1 and C 1 , C 2 > 0 such that for each k > 0: (a) f (k) -f * ≤ r k (f (0) -f * ) := ε (k) . (Linear convergence for the objective value) k) does not underflow) The explicit expressions for the constants C 1 , C 2 , r are given in Appendix A. (b) ∥β (k) -β * ∥ 2 ≤ C 1 ε (k) . (Linear convergence for the iterates) (c) T (k) 1 m =a, ∥∇ β f (β (k) )∥ 2 =∥ T (k)T 1 n -b∥ 2 ≤ C 2 ε (k) . (Exponential decay of the gradient) (d) T (k) ij < min{a i , b j + C 2 ε (k) }, 1 ≤ j ≤ m -1. (T (k) does not overflow) (e) max j T (k) ij >a i /m, max i T (k) ij >(b j -C 2 ε (k) )/n, 1 ≤ j ≤ m -1. (T ( Theorem 3 reveals some important information. First, both the objective function value and the iterates have linear convergence speed, so the forward pass using L-BFGS takes O(log(1/ε)) iterations to obtain an ε-optimal solution. Second, the marginal error for µ, measured by ∥T (k) 1 m -a∥, is exactly zero due to the partial optimization on α. The other marginal error ∥ T (k)T 1 n -b∥, which is equal to the gradient norm, is also bounded at any iteration, and decays exponentially fast to zero. This validates the numerical stability of the L-BFGS algorithm. Third, the estimated transport plan at any iteration k, T (k) , is also bounded and stable. This result can be compared with the formulation in (3): it is not hard to find that u * , v * , and M e , when computed individually, can all be unstable due to the exponentiation operations, especially when η is small. In contrast, T * and T (k) , thanks to the results in Theorem 3, do not suffer from this issue. We emphasize that Theorem 3 provides novel results that are not direct consequences of the L-BFGS convergence theorem given in Liu & Nocedal (1989) . First, classical theorems only guarantee the convergence of objective function values and iterates as in (a) and (b), whereas we provide richer information such as the marginal errors and transport plans specific to OT problems. More importantly, our results are all nonasymptotic with computable constants. To achieve this, we carefully analyze the eigenvalue structure of the dual Hessian matrix, which is of interest by itself. Likewise, we show that the derivative of ∇ M S λ (M, a, b) as in Theorem 1 can also be computed in a numerically stable way. Let ∇ M S be the k-step approximation to ∇ M S := ∇ M S λ (M, a, b) using the L-BFGS algorithm, i.e., replacing every T * in ∇ M S by T (k) . Then we show that the error on gradient also decays exponentially fast. Theorem 4. Using the symbols defined in Theorems 1 and 3, let σ = 1/σ min (D), where σ min (D) is the smallest eigenvalue of D. Assume that for some k 0 , ε (k0) < C -1 1 min{1, (6σ∥D∥ F ) -1 } 4λ 2 , and then for every k ≥ k 0 , ∥ ∇ M S -∇ M S∥ F ≤ C S √ ε (k) = C S f (0) -f * • r k/2 , where the explicit expression for C S is given in Appendix A. k 0 always exists as ε (k) decays to zero exponentially fast as ensured by Theorem 3(a).

5. APPLICATION: SINKHORN GENERATIVE MODELING

The Sinkhorn loss is useful in unsupervised learning tasks that attempt to match two distributions p θ and p * , where p * stands for the data distribution, and p θ is the model distribution. If p θ and p * can be represented or approximated by two discrete measures µ θ and ν, respectively, then one would wish to minimize the Sinkhorn loss S λ (M θ , a θ , b θ ) between µ θ and ν, where the cost matrix M and weights a, b may depend on learnable parameters θ. In gradient-based learning frameworks, the key step of seeking the optimal parameter θ that minimizes S λ (M θ , a θ , b θ ) is to compute the gradient ∇ θ S λ (M θ , a θ , b θ ), which further reduces to evaluating ∇ a,b S λ (M, a, b) and ∇ M S λ (M, a, b). Luise et al. ( 2018) assumes that M is fixed, and studies the gradients ∇ a,b S λ (M, a, b). However, in many generative models it is more important to derive ∇ M S λ (M, a, b), as the weights a and b are typically fixed, whereas the cost matrix M is computed from the output of some parameterized layers. Consider a data set X 1 , . . . , X n that follows the distribution p * , X i ∈ R p , and our target is to fit a deep neural network g θ : R r → R p such that g θ (Z) approximately follows p * , where Z ∼ N (0, I r ). To this end, we first generate random variates Z 1 , . . . , Z m ∼ N (0, I r ), and let Y j = g θ (Z j ). The two weight vectors are simply taken to be a = n -1 1 n , b = m -1 1 m , and the cost matrix between X i and Y j is given by M θ ∈ R n×m , where (M θ ) ij = ∥X i -g θ (Z j )∥ 2 . Then we learn the network parameter θ by minimizing the Sinkhorn loss ℓ (θ) = S λ (M θ , n -1 1 n , m -1 1 m ). We show such applications in Section 7.

6. RELATED WORK

In Table 1 we list some related work on the differentiation of the Sinkhorn loss, and provide a brief summary of the contribution and limitation of each work. The target of our work is closest to that of Genevay et al. (2018) , i.e., to differentiate the Sinkhorn loss with respect to the input cost matrix. However, they use unrolled Sinkhorn's algorithm, so there is no analytic form for ∇ M S λ . Our work is mostly motivated by Luise et al. (2018) , but they consider the derivative with respect to the weights instead of the cost matrix. Similarly, Cuturi et al. (2019) and Cuturi et al. (2020) consider gradients on weights and data points, but not the general cost matrix. Campbell et al. (2020) ; Xie et al. (2020) ; Eisenberger et al. (2022) all consider the derivative of the transport plan T * using implicit differentiation. Although this is a more general problem than computing ∇ M S λ , it loses the special structure of the Sinkhorn loss. As a result, the compact matrix form of the derivative presented in Theorem 1 is unique. Furthermore, the storage requirement for our result is O(nm), whereas some existing works need to store much larger matrices. Finally, very few work has rigorously analyzed the stability of the algorithm, in both the forward and backward passes.

7.1. RUNNING TIME OF FORWARD AND BACKWARD PASSES

In our subsequent experiments, we compare three algorithms for differentiating the Sinkhorn loss: the proposed method ("Analytic") using L-BFGS for the forward pass and the analytic derivative for backward pass, the implicit differentiation method ("Implicit"), and the unrolled Sinkhorn algorithm ("Unroll"). Both Implicit and Unroll use Sinkhorn-log for the forward pass, and are implemented in the OTT-JAX library (Cuturi et al., 2022) . We simulate cost matrices of different dimensions, and compare the forward time and total time of different algorithms with both η = 0.1 and η = 0.01. The detailed experiment setting and results are given in Appendix B.3, Table 2 , and Table 3 . The main conclusion is that under the same accuracy level, the forward pass of Analytic is typically faster than those of the other two algorithms, and the backward time is significantly faster. Overall, the proposed analytic differentiation method demonstrates visible advantages on computational efficiency.  ∇ M S λ ✗ ✓ ✗ ✗ ✗ Luise et al. (2018) ∇ a,b S λ ✓ ✗ - ✓ ✗ Cuturi et al. (2019) ∇ a,x T * ✗ ✗ - ✗ ✓ Cuturi et al. (2020) ∇ b,x T * ✓ ✗ - ✓ ✗ Campbell et al. (2020) ∇ M T * ✓ ✓ ✗ ✓ ✗ Xie et al. (2020) ∇ M T * ✓ ✓ ✗ ✓ ✗ Eisenberger et al. (2022) ∇ M T * ✓ ✓ ✗ ✗ ✗ This article ∇ M S λ ✓ ✓ ✓ ✓ ✓ 7.2 GENERATIVE MODELS ON TOY DATA SETS In this section we apply the Sinkhorn loss to generative modeling, and test the accuracy and efficiency of the proposed algorithm. Following the methods in Section 5, we consider a toy data set with n = 1000 and p = 2 (shown in Figure 3 ), and we attempt to learn a neural network g θ such that g θ approximately pushes forward Z 1 , . . . , Z m ∼ N (0, I r ) to the data distribution. In our experiments, g θ is a fully-connected ReLU neural network with input dimension r = 5 and hidden dimensions 64-128-64. The number of latent data points is m = 1000. In the first setting, we intentionally keep both the observed data and the latent points Z 1 , . . . , Z m fixed, so that the optimization of g θ is a deterministic process without randomness, and the optimization variable obtained from each forward pass is used as the warm start for the next training iteration. At every ten iterations, we compute the 2-Wasserstein distance between the observed data and the pushforward points {g θ (Z i )}, and we train g θ for 2000 iterations using the Adam optimizer with a learning rate of 0.001. This setting, though not common in generative modeling, helps us to monitor the computation of gradients without the impact of random sampling. Moreover, the Wasserstein distance has an achievable lower bound of zero if g θ is sufficiently expressive, so we can study the accuracy and efficiency of gradient computation by plotting the metric against running time. The comparison results for the three algorithms are shown in the second and third plots of Figure 3 , from which we have the following findings. First, when η = 0.5, the 2-Wasserstein distance will increase after certain number of iterations, indicating that the Sinkhorn loss is not an accurate approximation to the Wasserstein distance when η is large. Second, it is clear that the Unroll method has the slowest computation, as it needs to unroll a potentially large computational graph in automatic differentiation. Third, the proposed analytic differentiation shows visible advantages on computational efficiency, thanks to the closed-form expression for the derivative. Finally, to examine the performance of differentiation methods in a genuine generative modeling setting, we use a regular training scheme that randomly samples the observed data and latent points at each iteration. Due to the first finding above, we choose the smaller η value for the Sinkhorn loss, and use a mini-batch of size n = m = 256 for training. We run 5000 iterations in total, and the metric curves are shown in the last plot of Figure 3 . It can be found that the performance of the three algorithms are similar to the fixed-Z case: all three methods properly reduce the Wasserstein distance over time, but the proposed algorithm uses less time to accomplish the computation. Additional experiments on simulated data sets are given in Appendix B.4.

7.3. DEEP GENERATIVE MODELS

Finally, we use the Sinkhorn loss to train larger and deeper generative models on the MNIST (Le-Cun et al., 1998) and Fashion-MNIST (Xiao et al., 2017) data sets. In this experiment, we do not pursue training the best possible generative model; instead, we primarily validate our claims on the 

8. CONCLUSION

In this article we study the differentiation of the Sinkhorn loss with respect to its cost matrix, and have derived an analytic form of the derivative, which makes the backward pass of the differentiation easy to implement. Moreover, we study the numerical stability of the forward pass, and rigorously prove that L-BFGS is a stable and efficient algorithm that complements the widely-used Sinkhorn's algorithm and its stabilized versions. In particular, L-BFGS typically converges faster for Sinkhorn problems with a weak regularization. It is worth noting that the proposed analytic differentiation method can be combined with different forward algorithms, and a reasonable scheme is to use L-BFGS for weakly regularized OT problems, and to choose Sinkhorn's algorithm otherwise. The differentiable Sinkhorn loss has many potential applications in generative modeling and permutation learning (Mena et al., 2018) , and we anticipate that the technique developed in this article would boost future research on those directions.

A EXPLICIT EXPRESSIONS FOR CONSTANTS

We first define a few user constants for the L-BFGS algorithm. Let m 0 be the maximum number of correction vectors used to construct the BFGS matrix B k , and c 1 ∈ (0, 1/2), c 2 ∈ (c 1 , 1) are two constants related to the Wolfe condition: we assume the L-BFGS algorithm uses some line search algorithm to select the step sizes α k that satisfy: f (x k + α k d k ) ≤ f (x k ) + c 1 α k g T k d k , g(x k + α k d k ) T d k ≥ c 2 g T k d k , where f (•) and g(•) stand for the objective function and gradient function, respectively, x k is the k-th iterate, g k = g(x k ), d k = -B -1 k g k is the search direction, and B k is the BFGS matrix that approximates the Hessian matrix. m 0 , c 1 , and c 2 are selected by the user. For Theorem 3, let β(0) be the initial value, and let µ = M T a and u i = max j M ij , i = 1, . . . , n. Then we define the following constants: M3+M4) . U c = b -1 m max 1≤j≤m-1 µ j + η n i=1 a i log a i -η + f (β (0) ) + A i = η log a i -U c -η log   e -λ(Mim+Uc) + m-1 j=1 e -λM ij   , i = 1, . . . , n M 1 = λ • n -m + 2 2(n -m + 1) • min 1≤i≤n e λ(Ai-Mim) , M 2 = λ 1 - n i=1 e λ(Ai-Mim) M 3 = M 2 -log M 1 -1, M 4 = m -1 + m 0 M 2 -m 0 [log M 1 -log(1 + m 0 M 2 )] C 1 = 2/M 1 , C 2 = 2M -1 1 M 2 2 r = 1 -c 1 (1 -c 2 )M 1 /M 2 e -( For Theorem 4, C S = 4λ C 1 [∥∇ M S∥ F + 2λ∥T * ∥ F (C v + C u )] C v = 2σ(∥µ c ∥ + 3∥T * T (a -1 ⊙ µ r )∥ + 3∥D∥ F ∥s v ∥) C u = ∥µ r ∥ + 2C v ∥diag(a -1 )T * ∥ F + ∥a -1 ⊙ (T * s v )∥.

B ADDITIONAL EXPERIMENT DETAILS B.1 SETTINGS OF THE MOTIVATING EXAMPLE

Consider a small problem of n = 90 and m = 60. Let x i = 5(i -1)/(n -1), i = 1, . . . , n be equally-spaced points on [0, 5], and similarly define y j = 5(j -1)/(m -1), j = 1, . . . , m. The cost matrix is set to M ij = (x i -y j ) 2 , and the weights a and b are specified as follows. Let f 1 be the density function of an exponential distribution with mean 1, and f 2 be the density function of a mixture of two normal distributions, 0.2•N (1, 0.04)+0.8•N (3, 0.25). And then we set ãi = f 1 (x i ), bj = f 2 (y j ), a i = ãi / n k=1 ãk , and b j = bj / m k=1 bk . We fix the regularization parameter η to be 0.001. This value is selected such that the resulting Sinkhorn plan T * λ is visually close to the OT plan P * . In Figure 5 , we show the Wasserstein and Sinkhorn transport plans under different values of η. It can be seen that when η ≤ 0.001, T * λ is visually indistinguishable from P * . We compute the true T * λ using the ε-scaling algorithm (Algorithm 3 of Schmitzer, 2019) . This algorithm is typically accurate, but it requires solving a sequence of Sinkhorn problems with increasing λ's, where the solution corresponding to the previous λ is used as a warm start for the next one. Therefore, its computational cost is typically large, and it does not compare fairly with other methods. Due to this reason, in this article we mainly use the ε-scaling algorithm to compute high-precision reference values, and do not include it for method comparison.  Err plan (T ) = i,j (T ij -T * ij ) 2 , and the error on the Sinkhorn loss value, Err loss (T ) = |⟨T, M ⟩ -⟨T * , M ⟩| = |⟨T -T * , M ⟩| . For each configuration of the experiment, we simulate the data 100 times, and visualize the distribution of the errors using boxplots. In our experiment, we fix n = 150, m = 200, and consider varying dimensions p = 1, 10, 50. The Sinkhorn regularization parameters compared are η = 0.01, 0.001, and for each η we set a specific maximum number of iterations for all algorithms. Two data generation models are considered: (a) Both f 1 (x) and f 2 (y) are multivariate normal distributions N (0, I p ); (b) Both f 1 (x) and f 2 (y) have independent components. Each marginal distribution of f 1 is an exponential distribution with mean 1, and each marginal distribution of f 2 is a mixture of two normal distributions, 0.2 • N (1, 0.04) + 0.8 • N (3, 0.25). The final results are demonstrated in Figure 6 , where all the errors are shown in the log-scale. In Figure 6 , many boxplots for the Stabilized and Greenkhorn algorithms are missing, since they produce all NaN values in the 100 simulations due to numerical overflows. For Sinkhorn, even if it generates no NaNs values explicitly for η = 0.001, it does not give any meaningful results either. This implies that numerical stability is a critical issue in computing the Sinkhorn loss. For Sinkhorn-log, it gives reasonably small errors when the regularization is large (η = 0.01) with sufficient number of iterations, but its accuracy quickly deteriorates when η decreases to 0.001. Moreover, we can find that Sinkhorn-log is sensitive to the limits on number of iterations. For example, when p = 1 and η = 0.01, the loss value error can be as small as 10 -6 given 1000 maximum number of iterations, but if we restrict the limit to 200, the error can be as large as 10 -2 or even 10 0 , depending on the data distribution. In contrast, the difference on maximum number of iterations has a minor effect on the L-BFGS algorithm, indicating that it converges fast, and additional iterations are not needed. These findings demonstrate the advantage of the advocated L-BFGS method in both numerical stability and accuracy. 

B.3 RUNNING TIME OF FORWARD AND BACKWARD PASSES

In this section we compare the running time of three algorithms on differentiating the Sinkhorn loss. (a) The proposed algorithm "Analytic": L-BFGS in the forward pass, and analytic differentiation in the backward pass. (b) "Implicit" implemented in the OTT-JAX library (Cuturi et al., 2022) : Sinkhorn-log in the forward pass, and implicit differentiation in the backward pass. (c) "Unroll" implemented in the OTT-JAX library: Sinkhorn-log in the forward pass, and unrolled automatic differentiation in the backward pass. We use the second data generation model in Section B.2 to simulate data points, and use the three algorithms above to compute the Sinkhorn loss and its derivative with respect to the cost matrix. For each configuration, we randomly generate the data 100 times, and compute their mean forward time and mean total time. The stopping rule implemented in the OTT-JAX library is ∥T 1 m -a∥ + ∥T T 1 n -b∥ < ε ott , and one of the terms would be exactly zero in the last iteration. The stopping rule for L-BFGS is ∥T T 1 n -b∥ ∞ < ε lbf gs . To account for such a difference, we set ε lbf gs = 10 -6 , and let ε ott = max{n, m}•ε lbf gs . In fact, this setting favors the competing method, as its stopping criterion is strictly weaker than the proposed one. To test whether the algorithms actually converge under the given criteria, we also report the number of converging cases within the 100 repetitions. Results for different data dimensions and regularization parameters are given in Table 2 and Table  3 , where the former uses 1000 maximum number of iterations, and the latter uses 10000. We experiment on more simulated data to compare the computational efficiency of Analytic and Implicit. We do not include Unroll since it is too time-consuming. The results are given in Figure 7 , as analogues of Figure 3 . Three more simulated data sets are studied, and we also include the Sinkhorn loss as a metric to evaluate the model performance over time. For the deep generative models on MNIST and Fashion-MNIST data in Section 7.4, the architectures of the generators are given in Table 4 , where FC d stands for a fully-connected layer with d output dimensions, and Conv c,k,s,pin,pout is the transposed convolutional layer with c output channels, kernel size k, stride s, input padding pin, and output padding pout. Both models are trained using the Adam optimizer with learning rate 0.0001, on mini-batches of size 600. The Sinkhorn regularization parameter is set to η = 0.1, and the training process consists of two stages. In the first stage, we use the squared L 2 distance to construct the cost matrix, and in the second stage we switch to the L 1 distance. The intuition is that the squared L 2 distance has smoother derivatives, thereby making the training more stable in early steps; on the other hand, the L 1 distance makes the generated images sharper. The two stages are run for 20 and 30 epochs, respectively. The Wasserstein distances values in Figure 4 are computed in the first stage. Conv 32, 4, 2, 1, 0 + ReLU 32 × 14 × 14 Conv 64, 4, 2, 1, 0 + ReLU 64 × 14 × 14  Conv 32, 4, 2, 1, 1 + ReLU 32 × 29 × 29 Conv 64, 4, 2, 1, 1  + ReLU 64 × 29 × 29 Conv 1,4,1,2,0 1 × 28 × 28 Conv 1,4,1,2,0 1 × 28 × 28 problem: min Q,G E p X ∥X -G(Q(X))∥ 2 + ξ • D(p Z , p Q(X) ), where p X , p Q(X) , and p Z are the distribution of data points X, the distribution of Q(X), and a pre-specified latent distribution, respectively, and ξ is a regularization parameter to balance the two terms. The first term in ( 7) is the reconstruction error, and the second term quantifies the divergence of the distribution of Q(X) to the latent distribution p Z . We simply take p Z to be a multivariate standard normal distribution, and use the Sinkhorn divergence (Genevay et al., 2018) to define D(•): D(µ, ν) = S λ (µ, ν) - 1 2 S λ (µ, µ) - 1 2 S λ (ν, ν), where µ and ν are two discrete distributions, and with slight abuse of notation, S λ (µ, ν) is the Sinkhorn loss studied in this article. In actual implementation, µ and ν are Diracs of data points from p Z and p Q(X) , respectively. To generate new images, we sample latent data points Z 1 , . . . , Z n from the latent distribution p Z , and then pass them to the generator to obtain images Y i = G(Z i ), i = 1, . . . , n. Since the focus of this article is on the computation and differentiation of the Sinkhorn loss, we do not attempt to build and train the model with full complexity. Instead, to compare computing time, we only run 10 epochs for illustration purpose. The architectures of the encoder and the decoder are given in Table 5 , and we set the latent dimension to r = 64, the WAE regularization parameter to ξ = 1, and the Sinkhorn loss parameter to η = 0.1. We use the Adam optimizer with a learning rate of 0.001 and a mini-batch size of 500. We have found that the Implicit method generates NaNs after 72 mini-batch iterations, so we only show the results for the proposed Analytic algorithm and the existing Unroll method. Figure 8 For the motivating example in Section 3.1 and the experiments in Section B.2, results for Sinkhorn's algorithm, Sinkhorn-log, Stabilized, and Greenkhorn are computed using the Python Optimal Transport (POT) library (Flamary et al., 2021) . For the experiments in Sections 7.2, 7.3, B.3, B.4, and B.6, the Implicit and Unroll algorithms are implemented in the OTT-JAX library (Cuturi et al., 2022) .

C PROOFS OF THEOREMS C.1 TECHNICAL LEMMAS

In this section we state a few technical lemmas that are used to prove our main theorems. Lemma 1 to Lemma 3 below are standard conclusions in vector calculus, Lemma 4 and Lemma (5) are derived from the eigenvalue theory, and Lemma (7) and Lemma (8) are related to the Sinkhorn problem. We first introduce the following notations. For any x ∈ R, let [x] + = max{x, 0}. For a matrix A = (a ij ) ∈ R n×m , the vectorization operator, vec(A), creates a vector by stacking the column vectors of A together, i.e., vec(A) = (a 11 , . . . , a n1 , a 12 , . . . , a n2 , . . . , a 1m , . . . , a nm ) T . For two matrices A = (a ij ) ∈ R n×m and B ∈ R p×q , the Kronecker product of A and B is defined as A ⊗ B =    a 11 B • • • a 1m B . . . . . . . . . a n1 B • • • a nm B    . For a differentiable vector-valued function f : R n → R m , the partial derivative of f with respect to x is defined as ∂f (x) ∂x T =     ∂f1(x) ∂x1 • • • ∂f1(x) ∂xn . . . . . . . . . ∂fm(x) ∂x1 • • • ∂fm(x) ∂xn     . We use I n to denote the n × n identity matrix, and σ max (•) and σ min (•) stand for the largest and smallest eigenvalues of some symmetric matrix, respectively. Lemma 1. Given two matrices A ∈ R m×n and B ∈ R n×r , vec(AB) = (I r ⊗ A)vec(B) = (B T ⊗ I m )vec(A). Lemma 2. Let f : R m → R n and g : R m → R n be two vector-valued differentiable functions of x ∈ R m . Then ∂ ∂x T f (x) T g(x) = g(x) T ∂f (x) ∂x T + f (x) T ∂g(x) ∂x T . Lemma 3. Let f : R m → R l and g : R m → R r be two vector-valued differentiable functions of x ∈ R m . Then ∂ ∂x T vec(f (x)g(x) T ) = (g(x) ⊗ I l ) ∂f (x) ∂x T + (I r ⊗ f (x)) ∂g(x) ∂x T . Lemma 4. Let A and B be two n × n positive definite matrices, and let α 1 ≥ • • • ≥ α n > 0 and β 1 ≥ • • • ≥ β n > 0 be the ordered eigenvalues of A and B, respectively. Then for any x ∈ R n , x T A 1/2 BA 1/2 x ≤ α 1 β 1 ∥x∥ 2 . Proof. Let U 1×n = x T A 1/2 , and then u := U U T = x T Ax ≤ α 1 ∥x∥ 2 , and U BU T = x T A 1/2 BA 1/2 x. By Theorem A.4 in page 788 of Marshall et al. (2011) , we immediately get U BU T = tr(U BU T ) ≤ β 1 u ≤ α 1 β 1 ∥x∥ 2 . Lemma 5. Let A and B be two symmetric matrices of the same size. Then σ min (A) + σ min (B) ≤ σ min (A + B) ≤ σ max (A + B) ≤ σ max (A) + σ max (B). Proof. Using the well-known identity σ max (A) = max ∥x∥=1 x T Ax, we have σ max (A + B) = max ∥x∥=1 x T (A + B)x ≤ max ∥x∥=1 x T Ax + max ∥x∥=1 x T Bx = σ max (A) + σ max (B). Applying the inequality above to -(A + B), we get the result on the opposite direction. Lemma 6. Let x = (x 1 , . . . , x n ) T and y = (y 1 , . . . , y n ) T be two vectors. Define LSE(x) = log n i=1 e xi , and then for any x, y ∈ R n , -∥x -y∥ ∞ ≤ min i (x i -y i ) ≤ LSE(x) -LSE(y) ≤ max i (x i -y i ) ≤ ∥x -y∥ ∞ . Proof. It is easy to find that ∇ x LSE(x) = softmax(x) = (s 1 , . . . , s n ) T is the softmax function, where s i = e xi / n k=1 e x k ∈ (0, 1). By the mean value theorem, we have LSE(x) -LSE(y) = softmax(wx + (1 -w)y) T (x -y) for some 0 < w < 1. Let z = softmax(wx + (1 -w)y), and then LSE(x) -LSE(y) = n i=1 z i (x i -y i ) ≤ max i (x i -y i ) • n i=1 z i = max i (x i -y i ) ≤ ∥x -y∥ ∞ , and similarly, LSE(x) -LSE(y) ≥ min i (x i -y i ) ≥ -∥x -y∥ ∞ . Lemma 7. Let f (β) be defined as in (6), and let µ = M T a and u i = max j M ij , i = 1, . . . , n. If f (β) ≤ c, then we have max j β j ≤ U c and min j β j ≥ L c , where U c = b -1 m max 1≤j≤m-1 µ j + η n i=1 a i log a i -η + c + (8) L c = - min 1≤j≤m-1 b j -1 n i=1 a i u i + η n i=1 a i log a i -η + c + . Proof. By definition, f (β) = η n i=1 a i log   m j=1 e λ(βj -Mij )   -η n i=1 a i log a i -β T b + η. If f (β) ≤ c, then c 0 := c + η n i=1 a i log a i -η ≥ η n i=1 a i log   m j=1 e λ(βj -Mij )   -β T b. By definition we have β m = 0, and let J = arg max 1≤j≤m-1 β j . Then c 0 ≥ η n i=1 a i log e λ(β J -M iJ ) -β T b = n i=1 a i (β J -M iJ ) - m-1 j=1 β j b j ≥ β J - n i=1 a i M iJ -β J m-1 j=1 b j = b m β J -µ J ≥ b m β J -max 1≤j≤m-1 µ j , which verifies (8) by noting that max j β j = [β J ] + . Next, let K = arg min 1≤j≤m-1 β j . We can assume that β K < 0, since otherwise the trivial bound min j β j = β m = 0 is already met. Consider the sets S + = {j : β j > 0} and S -= {j : β j < 0}. Then clearly, β T b = β K b K + j̸ =K j∈S+ β j b j + j̸ =K j∈S- β j b j ≤ β K b K + [β J ] + • j̸ =K j∈S+ b j + 0 ≤ β K b K + [β J ] + (1 -b m ). Also note that log( m j=1 e xj ) ≥ max j x j for any x 1 , . . . , x n ∈ R, so  a i u i ≥ η • max j (λβ j ) - n i=1 a i u i = max j β j - n i=1 a i u i = [β J ] + - n i=1 a i u i . As a result, c 0 ≥ η n i=1 a i log   m j=1 e λ(βj -Mij )   -β T b ≥ [β J ] + - n i=1 a i u i -β K b K -[β J ] + (1 -b m ) = b m [β J ] + - n i=1 a i u i -β K b K ≥ - n i=1 a i u i -β K b K , and then β K ≥ -b -1 K n i=1 a i u i + c 0 + ≥ - min 1≤j≤m-1 b j -1 n i=1 a i u i + c 0 + , which verifies (9). Lemma 8. Let T be an n × m matrix with strictly positive entries, and suppose that n ≥ m. Define µ = T 1 m , ν = T T 1 n , and H = diag(µ) T T T diag(ν) , D = diag(ν) -T T diag(µ) -1 T . Then σ max (D) ≤ max 1≤j≤m-1 ν j , σ min (D) ≥ σ min (H) ≥ n -m + 2 2(n -m + 1) • min 1≤i≤n T im , σ min (D) ≥ min 1≤i≤m-1 m-1 j=1 D ij = min 1≤j≤m-1 n i=1 µ -1 i T ij T im . Proof. Consider the matrix S = H -sJ, where J is an (n + m -1) × (n + m -1) matrix filled of ones, and s is a positive scalar. Let R k = j̸ =k |S kj |, k = 1, . . . , n + m -1. Suppose s ≤ min 1≤i≤n,1≤j≤m-1 T ij ,and then for k = 1, . . . , n, it is easy to find that R k = (n -1)s + m-1 j=1 (T kj -s) = (n -1)s + µ k -T km -(m -1)s = (n -m)s + µ k -T km , and for k = n + 1, . . . , n + m -1, R k = (m -2)s + n i=1 (T i,k-n -s) = (m -2)s + ν k-n -ns = (m -n -2)s + v k-n . Then it is easy to see that S kk -R k = µ k -R k = T km -(n -m)s, k = 1, . . . , n ν k-n -R k = (n + 2 -m)s, k = n + 1, . . . , n + m -1 . Let min 1≤i≤n T im -(n -m)s = (n + 2 -m)s, and then s = min 1≤i≤n T im /(2n -2m + 2), and S kk -R k ≥ L := n -m + 2 2(n -m + 1) • min 1≤i≤n T im > 0 for all k. By the Gershgorin circle theorem, every eigenvalue of S must be greater than L. Since H = S + sJ and J is nonnegative definite, we also have σ min (H) ≥ L > 0, implying that H is positive definite. For the second formula, it is easy to find that the D matrix is the Schur complement of the block diag(µ) of the H matrix. So by Theorem 3.1 of Fan (2002) , we have σ min (D) ≥ σ min (H) and σ max (D) ≤ σ max (diag(ν)) = max 1≤j≤m-1 ν j . Finally, let c = max 1≤j≤m-1 ν j , and then D can be expressed as D = cI m-1 -B, where B = T T diag(µ) -1 T + diag(c1 m-1 -ν) is a matrix that have nonnegative entries. In addition, we have proved that D is positive definite, so D is a nonsingular M -matrix by the definition in Tian & Huang (2010) . Then Theorem 3.2 of Tian & Huang (2010) shows that σ min (D) ≥ min 1≤i≤m-1 m-1 j=1 D ij . Let δ = D1 m-1 , and then clearly min 1≤i≤m-1 m-1 j=1 D ij = min i δ i . Note that δ = D1 m-1 = ν -T T diag(µ) -1 T 1 m-1 = ν -T T diag(µ) -1 (T 1 m -T m ) = ν -T T diag(µ) -1 (µ -T m ) = ν -T T 1 n + T T diag(µ) -1 T m = T T diag(µ) -1 T m , where T m stands for the m-th column of T . Therefore, min 1≤i≤m-1 δ i = min 1≤j≤m-1 n i=1 µ -1 i T ij T im . C.2 PROOF OF (6) Let T = e λ [α ⊕ β -M ], and then it is easy to find that ∇ α L(α, β) = a -T 1 m and ∇ β L(α, β) = b -T T 1 n . Since α * (β) = arg max α L(α, β), we find that α i ≡ α * (β) i is the solution to the equation a -T 1 m = 0. By definition, we have a i = m j=1 e λ(αi+βj -Mij ) = e λαi m j=1 e λ(βj -Mij ) , i = 1, . . . , n, so the solution is α i = η log a i -η log m j=1 e λ(βj -Mij ) . Since T 1 m = a, we immediately get 1 T n T 1 m = 1, so L(α * (β), β) = α * (β) T a + β T b -η1 T n T 1 m = α * (β) T a + β T b -η, and we get the expression for f (β) = -L(α * (β), β). Finally, ∇ β L(α * (β), β) = ∂α * (β) ∂ βT T ∇ α L(α, β)| α=α * (β) + ∇ β L(α, β) α=α * (β) . Since α * (β) = arg max α L(α, β) implies that ∇ α L(α, β)| α=α * (β) = 0, we have ∇ β L(α * (β), β) = ∇ β L(α, β) α=α * (β) , and hence  ∇ β f (β) = -∇ β L(α * (β), β) = T (β) T 1 n -b. It is easy to find that ∂vec(T * )/∂vec(R) T is an (nm) × (nm) diagonal matrix with diagonal elements vec(λT * ), so vec(M ) T ∂vec(T * ) ∂vec(R) T = λvec(M ⊙ T * ) T . (12) Furthermore, ∂vec(R) ∂vec(M ) T = ∂vec(α * 1 T m + 1 n β * T -M ) ∂vec(M ) T = (1 m ⊗ I n ) ∂α * ∂vec(M ) T + (I m ⊗ 1 n ) ∂β * ∂vec(M ) T -I (nm) , where the second identity is an application of Lemma 3. Combine (11), ( 12) and ( 13), and then we get vec(M ) T ∂vec(T * ) ∂vec(M ) T = λvec(M ⊙ T * ) T ∂vec(R) ∂vec(M ) T = λvec(M ⊙ T * ) T (1 m ⊗ I n ) ∂α * ∂vec(M ) T + λvec(M ⊙ T * ) T (I m ⊗ 1 n ) ∂β * ∂vec(M ) T -λvec(M ⊙ T * ) T , which is the first term of (10). Using the identities in Lemma 1, we have vec(M ⊙ T * ) T (1 m ⊗ I n ) = (1 m ⊗ I n ) T vec(M ⊙ T * ) T = (1 T m ⊗ I n )vec(M ⊙ T * ) T = [vec((M ⊙ T * )1 m )] T := µ T r , vec(M ⊙ T * ) T (I m ⊗ 1 n ) = (I m ⊗ 1 n ) T vec(M ⊙ T * ) T = (I m ⊗ 1 T n )vec(M ⊙ T * ) T = vec(1 T n (M ⊙ T * )) T := µ T c . Since we have set β * m = 0, (10) simplifies to ∂S λ (M, a, b) ∂vec(M ) T = λ µ T r ∂α * ∂vec(M ) T + μT c ∂ β * ∂vec(M ) T -vec(M ⊙ T * ) T + vec(T * ) T . (14) Let w * = (α * T , β * T ) T , and then the main challenge is to calculate ∂w * /∂vec(M ) T . First, note that the optimality condition for (α * , β * ) = arg max α,β L(α, β) is ∇ α L(α, β)| (α,β)=(α * ,β * ) = 0, ∇ β L(α, β)| (α,β)=(α * ,β * ) = 0. ( ) Section C.2 has shown that ∇ α L(α, β) = a -T 1 m and ∇ β L(α, β) = b -T T 1 n . Moreover, ∇ 2 α L(α, β) = -λdiag(T 1 m ) ∇ 2 β L(α, β) = -λdiag( T T 1 n ) ∇ β (∇ α L(α, β)) = -λ T . Define the function F (w, M ) = ∇ α L(α, β) ∇ β L(α, β) = a -T 1 m b -T T 1 n , where w = (α T , βT ) T , and then w * satisfies the equation F (w * , M ) = 0, indicating that w * is implicitly a function of M , written as w * = w(M ). By the implicit function theorem, ∂w(M ) ∂vec(M ) T = - ∂F (w, M ) ∂w T w=w * -1 ∂F (w, M ) ∂vec(M ) T w=w * := -F -1 w F M . Note that F w = F T w = ∇ 2 α L(α, β) ∇ β (∇ α L(α, β)) ∇ β (∇ α L(α, β)) T ∇ 2 β L(α, β) := -λ A B BT D . Then by the inversion formula for block matrices, we have F -1 w = -λ -1 A B BT D -1 = -λ -1 A -1 + A -1 B ∆-1 BT A -1 -A -1 B ∆-1 -∆-1 BT A -1 ∆-1 , where ∆ = D -BT A -1 B. For g = (µ T r , μT c ) T , the vector s = (s T u , sT v ) T = -λF -1 w g has the following expression: sv = -∆-1 BT A -1 µ r + ∆-1 μc s u = A -1 µ r + A -1 B ∆-1 BT A -1 µ r -A -1 B ∆-1 μc , = A -1 µ r -A -1 Bs v . After some simplification, we obtain nm) and HM ∈ R (m-1)× (nm) . By definition, ∆ = diag( T T 1 n ) -T T diag((T 1 m ) -) T sv = ∆-1 μc -∆-1 T T ((T 1 m ) -⊙ µ r ) s u = (T 1 m ) -⊙ µ r -(T 1 m ) -⊙ ( T sv ). Next, partition F M as F M = G M HM , where G M ∈ R n×( G M =     ∂G1 ∂vec(M ) T . . . ∂Gn ∂vec(M ) T     , G i = - m j=1 T ij = - m j=1 e λ(αi+βj -Mij ) , so ∂G i ∂M kl = 0, i ̸ = k λT kl , i = k . This indicates that G M = λ (diag(T 1 ), . . . , diag(T m )), where T 1 , . . . , T m are the column vectors of T . Similarly, for H j = - n i=1 T ij , HM =     ∂H1 ∂vec(M ) T . . . ∂Hm-1 ∂vec(M ) T     , ∂H j ∂M kl = 0, j ̸ = l λT kl , j = l , implying that HM = λ    T T 1 . . . T T m-1 0 T n    . As a result, µ T r ∂α * ∂vec(M ) T + μT c ∂ β * ∂vec(M ) T = (µ T r , μT c ) ∂w * ∂vec(M ) T = -(µ T r , μT c )F -1 w F M = -λF -1 w µ r μc T λ -1 G M λ -1 HM = (s T u , sT v ) λ -1 G M λ -1 HM = (s u ⊙ T 1 ) T , . . . , (s u ⊙ T m ) T + sv,1 T T 1 , . . . , sv,m-1 T T m-1 , 0 T n = [vec (diag(s u )T + T diag(s v ))] T . Finally, substitute (17) into ( 14), and we have ∂S λ (M, a, b) ∂vec(M ) T = λ [vec (diag(s u )T * + T * diag(s v )) -vec(M ⊙ T )] T + vec(T ) T Transforming back to the matrix form, and we obtain ∂S λ (M, a, b) ∂M = λ(s u ⊕ s v -M ) ⊙ T + T. Replacing T with T * and noting that a = T * 1 m , b = T * T 1 n , we get the stated result. The positive definiteness of the ∆ matrix is a direct consequence of Lemma 8.

C.4 PROOF OF THEOREM 2

In the proof of Theorem 1 we have already shown that ∇ 2 α, β L(α, β) = -λH := -λ diag(T 1 m ) T T T diag( T T 1 n ) , where T = e λ [α ⊕ β -M ]. Plugging α * (β) into L(α, β), and then ∇ 2 β L(α * (β), β) is the Schur complement of the top left block of ∇ 2 α, β L(α, β), given by ∇ 2 β L(α * (β), β) = -λ diag( T T 1 n ) -T T diag(T 1 m ) -1 T . Since f (β) = -L(α * (β), β) , by Lemma 8 we find that ∇ 2 β f (β) is positive definite, so f (β) is strictly convex on β, and hence β * is unique. The optimality conditions for (α * , β * ) are T * 1 m = a and T * T 1 n = b, where T * = e λ [α * ⊕ β * - M ]. Since T * ij = exp{λ(α * i + β * j -M ij )} ≥ 0 and a i = m j=1 T * ij , b j = n i=1 T * ij , we have T * ij ≤ min{a i , b j } for all i and j, implying that α * i + β * j ≤ U ij := M ij + λ -1 min{log(a i ), log(b j )}. Since β * m = 0 by design, we have α * i ≤ U αi < +∞, i = 1, . . . , n, where U αi = U im = M im + λ -1 min{log(a i ), log(b m )}. This indicates that α * i is upper bounded. Next, let I = arg max i T * im . Since T * T 1 n = b implies that b m = n i=1 T * im ≤ nT * Im , we have T * Im = exp{λ(α * I -M Im )} ≥ b m /n, and hence α * I ≥ M Im + λ -1 log(b m /n). Again, since α * i + β * j ≤ U ij for all i and j, it holds that β * j ≤ U Ij -α * I ≤ U Ij -M Im -λ -1 log(b m /n) = M Ij + λ -1 min{log(a I ), log(b j )} -M Im -λ -1 log(b m /n) = M Ij -M Im + λ -1 min{log(na I /b m ), log(nb j /b m )} := U βj < +∞, j = 1, . . . , m. On the other hand, T * T 1 n = b implies that b j = n i=1 T * ij = e λβ * j • n i=1 e λ(α * i -Mij ) for any j, so log b j = λβ * j + log n i=1 e λ(α * i -Mij ) ≤ λβ * j + log n i=1 e λ(Uα i -Mij ) . It is well-known that max{x 1 , . . . , x n } ≤ log n i=1 e xi ≤ max{x 1 , . . . , x n } + log n for any x 1 , . . . , x n ∈ R, so β * j ≥ λ -1 log b j -λ -1 log n i=1 e λ(Uα i -Mij ) ≥ λ -1 log b j -max i (U αi -M ij ) -λ -1 log n ≥ λ -1 log(b j /n) -max i (U αi -M ij ) := L βj > -∞, j = 1, . . . , m. Then ( 18) and ( 19) together show that |β * j | < ∞. Similarly, T * 1 m = a, so log a i ≤ λα * i + log   m j=1 e λ(U β j -Mij )   , α * i ≥ λ -1 log(a i /m) -max j (U βj -M ij ) := L αi > -∞, i = 1, . . . , n. The trivial bounds L αi and U βj are obtained by removing the unknown index I. The results above verify that |α * i | < ∞ and |β * j | < ∞, and hence ∥α * ∥ < ∞ and ∥β * ∥ < ∞. Finally, plugging in β * to the objective function, and we immediately get f * > -∞.

C.5 PROOF OF THEOREM 3

Claims (a) and (b) are direct consequences of the convergence property of the L-BFGS algorithm (Theorem 7.1, Liu & Nocedal, 1989) , and we only need to verify its three assumptions. The new results here are explicit expressions for the constants C 1 , C 2 , and r. First, f is twice continuously differentiable, so Assumption 7.1(1) of Liu & Nocedal (1989) is verified. Second, f is a closed convex function, and we define the level set of f as L c = { β ∈ R m-1 : f (β) ≤ c}. Theorem 2 has shown that f * > -∞, and when c = f * , obviously L c = { β * } is non-empty and bounded. Then Corollary 8.7.1 of Rockafellar (1970) shows that L c is bounded for every c. In particular, for a fixed initial value β(0) , define L = { β : f (β) ≤ f (β (0) )}, and then L is a bounded, closed, and convex set, which verifies Assumption 7.1(2) of Liu & Nocedal (1989) . -Mim) . On the other hand, T ij must satisfy T ij > 0 and m j=1 T ij = a i for any i and j, so for j = 1, . . . , m -1, we have T ij ≤ a i -T im . Therefore, v j = n i=1 T ij ≤ n i=1 (a i -T im ) = 1 - n i=1 T im ≤ 1 - n i=1 e λ(Ai-Mim) , j = 1, . . . , m -1. This implies that there exist constants M 1 , M 2 > 0 such that M 1 ∥x∥ 2 ≤ x T H(β)x ≤ M 2 ∥x∥ 2 for all x ∈ R m-1 and β ∈ L, with M 1 = λ • n -m + 2 2(n -m + 1) • min 1≤i≤n e λ(Ai-Mim) , -Mim) . M 2 = λ 1 - n i=1 e λ(Ai This verifies Assumption 7.1(3) of Liu & Nocedal (1989) . Next, we derive the explicit constants in the theorem. Following the notations in equation ( 7.3) of Liu & Nocedal (1989) , the BFGS matrix B k for the L-BFGS algorithm has the following expression m) , B (l+1) = B (l) -B (l) s l s T l B (l) s T l B (l) s l + y l y T l y T l s l B k = B ( , where m = min{k + 1, m 0 }, m 0 is a user-defined constant explained in Section A, and {y l } and {s l } are two sequences of vectors. We also choose B (0) = I to be the identity matrix. Fix l = m, and let cos θ k = s T l B (l) s l /(∥s l ∥ • ∥B (l) s l ∥), ρ k = y T l s l /∥s l ∥ 2 , τ k = ∥y l ∥ 2 /y T l s l , and q k = s T l B (l) s l /∥s l ∥ 2 . Then it can be verified that tr(B (l+1) ) = tr(B (l) ) -∥B (l) s l ∥ 2 s T l B (l) s l + ∥y l ∥ 2 y T l s l = tr(B (l) ) -  q k cos 2 θ k + τ k , T ij ≤ n • max i T ij . Replacing ∥∇ β f (β)∥ by its upper bound, and claim (e) is verified.

C.6 PROOF OF THEOREM 4

For matrix A n×m = (a ij ), define ∥A∥ ∞ = max i,j |a ij |, and the notation A ≥ 0 means a ij ≥ 0 for all i and j. First, it is easy to show that ∥A ⊙ B∥ F ≤ ∥A∥ ∞ ∥B∥ F , since ∥A ⊙ B∥ F = i,j (a ij b ij ) 2 ≤ i,j ∥A∥ 2 ∞ b 2 ij = ∥A∥ ∞ ∥B∥ F . Next, we show that if B n×m ≥ 0, C p×n ≥ 0, and v ≥ 0, where v is a vector, then ∥C(A ⊙ B)v∥ ≤ ∥A∥ ∞ ∥CBv∥. Proof. Let u i be the i-th element of (A ⊙ B)v, and then Moreover, for matrices B n×m ≥ 0, and C m×p ≥ 0, let C j be the j-th column of C, and then u i = m j=1 a ij b ij v j , |u i | ≤ ∥A∥ ∞ m j=1 b ij v j . ∥(A ⊙ B)C∥ F = p j=1 ∥(A ⊙ B)C i ∥ 2 ≤ ∥A∥ ∞ p j=1 ∥BC i ∥ 2 = ∥A∥ ∞ ∥BC∥ F . On the other hand, let tij be the (i, j) element of the matrix T T diag(a -) T , and t ij be the (i, j) element of T * T diag(a -) T * . Then This implies that ∥ D -D∥ F = ∥ T * T diag(a -) T * -T T diag(a -) T ∥ F = i,j | tij -t ij | 2 ≤ 3ε∥D∥ F . Then by Theorem 7.2 of Higham ( 2002), if 3εσ∥D∥ F < 1, we have ∥δ v ∥ ∥s v ∥ ≤ εσ 1 -3εσ∥D∥ F ∥µ c ∥ + 3∥T * T (a -⊙ µ r )∥ ∥s v ∥ + 3∥D∥ F . where σ = ∥D -1 ∥ op = 1/σ min (D). Assume that ε < min{1, 1/(6σ∥D∥ F )}, then with slight simplification, we have ∥δ v ∥ ∞ ≤ ∥δ v ∥ ≤ C v ε, where C v = 2σ(∥µ c ∥ + 3∥T * T (a -⊙ µ r )∥ + 3∥D∥ F ∥s v ∥). On the other hand, Combining the results together, we get ∥ ∇ M S -∇ M S∥ F ≤ ε [∥∇ M S∥ F + 2λ∥T * ∥ F (C v + C u )] , where C u = ∥µ r ∥ + 2C v ∥diag(a -)T * ∥ F + ∥a -⊙ (T * s v )∥.  (k) j -β * j )∥ ∞ = ∥g∥ ∞ , implying that ∥f ∥ ∞ = ∥α * (β (k) ) -α * ∥ ∞ ≤ ∥g∥ ∞ ≤ C 1 ε (k) . As a result, ε = 2λ(∥f ∥ ∞ + ∥g∥ ∞ ) ≤ 4λ C 1 ε (k) .



i=1 a i δ xi and ν = m j=1 b j δ yj be two discrete probability measures supported on data points {x i } n i=1 and {y j } m j=1 , respectively, where a = (a 1 , . . . , a n ) T ∈ ∆ n , b = (b 1 , . . . , b m ) T ∈ ∆ m , and δ x is the Dirac at position x. Define Π(a, b) = {T ∈ R n×m +

Figure 1: Visualization of Sinkhorn plans computed by different algorithms.

Figure 2: Comparing the convergence speed of Sinkhorn-log and L-BFGS.

Figure 3: From left to right: the true observed data; the measured Wasserstein distance over time in the fixed Z setting with η = 0.5; the similar plot for η = 0.1; the random Z setting with η = 0.1.

Figure 5: Visualization of Sinkhorn plans with different η values.

Figure 6: Comparing different algorithms on the errors of Sinkhorn transport plan and loss value. The missing boxplots indicate that the corresponding results are NaNs.

Figure 7: More experiments on simulated data. The plots are analogues of Figure 3, with three additional data sets and the extra Sinkhorn divergence metric to evaluate model performance.

(a)  shows the training process of the WAE model based on the Sinkhorn divergence for 10 epochs, and Figure8(b) shows the randomly generated images using the Analytic method after 50 epochs. It is clear that the proposed Analytic method is more efficient than the Unroll method in training.

Figure 8: (a) Training process of WAE for the CelebA data for 10 epochs, based on the Sinkhorn divergence. (b) Randomly generated images using the Analytic method after 50 epochs of training.

PROOF OF THEOREM 1By definition we haveS λ (M, a, b) = ⟨T * , M ⟩ = vec(T * ) T vec(M ), so Lemma 2 gives ∂S λ (M, a, b) ∂vec(M ) T = vec(M ) T ∂vec(T * ) ∂vec(M ) T + vec(T * ) T ∂vec(M ) ∂vec(M ) T . (10)Obviously, ∂vec(M )/∂vec(M ) T is the (nm) × (nm) identity matrix I (nm) , so the second term of (10) is essentially vec(T * ) T , and the remaining task is to derive ∂vec(T * )/∂vec(M ) T .LetR = α * ⊕ β * -M = α * 1 T m + 1 n β * T -M ,and then T * = e λ [R]. Using the chain rule of derivatives, we have ∂vec(T * ) ∂vec(M ) T = ∂vec(T * ) ∂vec(R) T • ∂vec(R) ∂vec(M ) T .

Third, let H(β) := ∇ 2 β f (β), and then in the proof of Theorem 2 we have already shown thatH(β) = λ diag( T T 1 n ) -T T diag(T 1 m ) -1 T , where T = e λ [α * (β) ⊕ β -M ]. Lemma 8 verifies that σ min (H(β)) ≥ λ • n -m + 2 2(n -m + 1) • min 1≤i≤n T im , σ max (H(β)) ≤ λ • max 1≤j≤m-1 ν j ,where ν = T T 1 n . On the L set, Lemma 7 shows that max j β j ≤ U c and min j β j ≥ L c , with c = f (β (0) ). Therefore,α i := η log a i -η log   m j=1 e λ(βj -Mij )   ≥ η log a i -η log   e -λMim + a i -U c -η log   e -λ(Mim+Uc) + A i ,T im = e λ(αi-Mim) ≥ e λ(Ai

)det(B (l+1) ) = det(B (l) )ρ k /q k . (21)and claim (c) is proved.Claims (d) and (e) can be verified as follows. For any β ∈ L, recall thatT = e λ [α * (β) ⊕ β -M ],and then T 1 m -a = 0 and∥ T T 1 n -b∥ ∞ ≤ ∥ T T 1 n -b∥ = ∥∇ β f (β)∥, ij -b j ≤ ∥∇ β f (β)∥, j = 1, . . . , m -1. (26) Then 0 < T ij ≤ m j=1 T ij = a i and 0 < T ij ≤ n i=1 T ij ≤ b j + ∥∇ β f (β)∥.The gradient ∥∇ β f (β)∥ can be bounded using claim (c), so (d) is also proved. On the other hand, (26) shows thata i = m j=1 T ij ≤ m • max j T ij ,and similarly we haveb j -∥∇ β f (β)∥ ≤ n i=1

(E T ) ki ][1 + (E T ) kj ]T * ki T * kj /a k , | tij -t ij | = n k=1 [(E T ) ki + (E T ) kj + (E T ) ki (E T ) kj ]T * ki T * kj /a k ≤ (2ε + ε 2 ) n k=1 T * ki T * kj /a k < 3εt ij .

∥δ u ∥ ∞ ≤ ∥δ u ∥ ≤ ∥a -⊙ δ r ∥ + ∥a -⊙ (T ŝv -T * s v )∥ ≤ ε∥µ r ∥ + ∥a -⊙ (T * ŝv + (E T ⊙ T * )ŝ v -T * s v )∥ ≤ ε∥µ r ∥ + ∥a -⊙ (T * δ v )∥ + ∥a -⊙ ((E T ⊙ T * )ŝ v )∥ ≤ ε∥µ r ∥ + ∥a -⊙ (T * δ v )∥ + ∥diag(a -)(E T ⊙ T * )ŝ v ∥ ≤ ε∥µ r ∥ + ∥a -⊙ (T * δ v )∥ + ε∥diag(a -)T * ŝv ∥ = ε∥µ r ∥ + ∥a -⊙ (T * δ v )∥ + ε∥diag(a -)T * (s v + δ v )∥ ≤ ε∥µ r ∥ + (1 + ε)∥a -⊙ (T * δ v )∥ + ε∥diag(a -)T * s v ∥ ≤ ε∥µ r ∥ + (1 + ε)∥diag(a -)T * ∥ F ∥δ v ∥ + ε∥a -⊙ (T * s v )∥ ≤ ε(∥µ r ∥ + 2C v ∥diag(a -)T * ∥ F + ∥a -⊙ (T * s v )∥).

Theorem 3(b) shows that ∥g∥ ∞ ≤ ∥g∥ = ∥β (k) -β * ∥ ≤ C 1 ε (k) . In addition, by (6) we haveα (k) i := α * (β (k) ) i = η log a i -η log

A brief summary of existing literature on differentiation of Sinkhorn loss.

Running time of three algorithms for differentiating the Sinkhorn loss, with maximum 1000 iterations.

Running time of three algorithms for differentiating the Sinkhorn loss, with maximum 10000 iterations.

Architectures of the neural networks for MNIST and Fashion-MNIST data.

Architectures of the neural networks for CelebA data. + ReLU 32 × 32 × 32 Conv 256,4,1,0,0 + ReLU 256 × 4 × 4 Conv 64,4,2,1,0 + ReLU 64 × 16 × 16 Conv 128,4,2,1,0 + ReLU 128 × 8 × 8 Conv 128,4,2,1,0 + ReLU 128 × 8 × 8 Conv 64,4,2,1,0 + ReLU 64 × 16 × 16 Conv 256,4,2,1,0 + ReLU 256 × 4 × 4 Conv 32,4,2,1,0 + ReLU 32 × 32 × 32

annex

Define ψ(B) = tr(B) -log det(B), and it is known that ψ(B) > 0 for any positive definite matrix B. Equation (6.50) of Nocedal & Wright (2006) shows that 0 < ψ(B (l+1) ) = tr(B (l) ) -q k cos 2 θ k + τ k -log det(B (l) ) -log ρ k + log q k = ψ(B (l) ) + (τ k -log ρ k -1) + 1 -q k cos 2 θ k + log q k cos 2 θ k + log cos 2 θ k .Under the assumptions verified above, equations (7.8) and (7.9) of Liu & Nocedal (1989) show thatNow we show that ψ(B k ) can be upper bounded. First, Lemma 5 implies that for l = 0, . . . , m -1, and (21) givesCombining ( 22) and ( 23), we have cos 2 θ k > e -(M3+M4) .Finally, using the argument in Byrd et al. (1987) , we have M3+M4) , and c 1 , c 2 are two constants for the Wolfe condition as explained in Section A. The constant C 1 is simply 2/M 1 .For (c), we follow the analysis in Nocedal et al. (2002) . Let g(β) = ∇ β f (β), and then g(β * ) = 0. By Taylor's theorem we havewhere H 1 = H(ξ) for some ξ in the line segment connecting β and β * . Also,Combining ( 24) and ( 25), we getIt is easy to find thatBy Lemma 4, we have 0) due to claim (a), we find that β(k) ∈ L for all k > 0. Therefore,Let α = α * +f and β = β * +g for some perturbation vectors f and g, and define T = e λ [α⊕β-M ]. Then it is easy to find thatas long as λ|f i + g j | < 1. This can be achieved by assumingConsider ŝu = s u + δ u and ŝv = s v + δ v for some perturbation vectors δ u and δ v , and letThen we haveTherefore, we just need to show proper bounds for ∥δ u ∥ ∞ and ∥δ v ∥ ∞ .Consider μr = (M ⊙ T )1 m and μc = (M ⊙ T ) T 1 n , and thenIt can be easily verified thatAs a result, ∥ bv -b v ∥ = ∥δ c -T * T (a -⊙ δ r ) -(E T ⊙ T * ) T (a -⊙ (µ r + δ r ))∥ ≤ ∥δ c ∥ + ∥T * T (a -⊙ δ r )∥ + ∥E T ∥ ∞ ∥T * T (a -⊙ (µ r + δ r ))∥ < ε∥µ c ∥ + ε∥T * T (a -⊙ µ r )∥ + ε∥T * T (a -⊙ µ r )∥ + ε∥T * T (a -⊙ δ r )∥ < ε∥µ c ∥ + ε∥T * T (a -⊙ µ r )∥ + ε∥T * T (a -⊙ µ r )∥ + ε 2 ∥T * T (a -⊙ µ r )∥ < ε∥µ c ∥ + 3ε∥T * T (a -⊙ µ r )∥.

