LEARNING WITH STOCHASTIC ORDERS

Abstract

Learning high-dimensional distributions is often done with explicit likelihood modeling or implicit modeling via minimizing integral probability metrics (IPMs). In this paper, we expand this learning paradigm to stochastic orders, namely, the convex or Choquet order between probability measures. Towards this end, exploiting the relation between convex orders and optimal transport, we introduce the Choquet-Toland distance between probability measures, that can be used as a drop-in replacement for IPMs. We also introduce the Variational Dominance Criterion (VDC) to learn probability measures with dominance constraints, that encode the desired stochastic order between the learned measure and a known baseline. We analyze both quantities and show that they suffer from the curse of dimensionality and propose surrogates via input convex maxout networks (ICMNs), that enjoy parametric rates. We provide a min-max framework for learning with stochastic orders and validate it experimentally on synthetic and high-dimensional image generation, with promising results. Finally, our ICMNs class of convex functions and its derived Rademacher Complexity are of independent interest beyond their application in convex orders. Code to reproduce experimental results is available here.

1. INTRODUCTION

Learning complex high-dimensional distributions with implicit generative models (Goodfellow et al., 2014; Mohamed & Lakshminarayanan, 2017; Arjovsky et al., 2017) via minimizing integral probability metrics (IPMs) (Müller, 1997a) has led to the state of the art generation across many data modalities (Karras et al., 2019; De Cao & Kipf, 2018; Padhi et al., 2020) . An IPM compares probability distributions with a witness function belonging to a function class F, e.g., the class of Lipchitz functions, which makes the IPM correspond to the Wasserstein distance 1. While estimating the witness function in such large function classes suffers from the curse of dimensionality, restricting it to a class of neural networks leads to the so called neural net distance (Arora et al., 2017) that enjoys parametric statistical rates. In probability theory, the question of comparing distributions is not limited to assessing only equality between two distributions. Stochastic orders were introduced to capture the notion of dominance between measures. Similar to IPMs, stochastic orders can be defined by looking at the integrals of measures over function classes F (Müller, 1997b) . Namely, for µ + , µ -∈ P 1 (R d ), µ + dominates µ -, or µ -⪯ µ + , if for any function f ∈ F, we have 1a for an example). In the present work, we focus on the Choquet or convex order (Ekeland & Schachermayer, 2014) generated by the space of convex functions (see Sec. 2 for more details). R d f (x) dµ -(x) ≤ R d f (x) dµ + (x) (See Figure Previous work has focused on learning with stochastic orders in the one dimensional setting, as it has prominent applications in mathematical finance and distributional reinforcement learning (RL). The survival function gives a characterization of the convex order in one dimension (See Figure 1b and Sec. 2 for more details). For instance, in portfolio optimization (Xue et al., 2020; Post et al., 2018; Dentcheva & Ruszczynski, 2003) the goal is to find the portfolio that maximizes the expected return under dominance constraints between the return distribution and a benchmark distribution. Figure 1 : VDC example in 1D. Figure 1a :µ + is mixture of 3 Gaussians , µ -corresponds to a single mode of the mixture. µ + dominates µ -in the convex order. Figure 1b : uni-variate characterization of the convex order with survival functions (See Sec. 2 for details). Figure 1c : Surrogate VDC computation with Input Convex Maxout Network and gradient descent. The surrogate VDC tends to zero at the end of the training and hence characterizes the convex dominance of µ + on µ -. A similar concept was introduced in distributional RL (Martin et al., 2020) for learning policies with dominance constraints on the distribution of the reward. While these works are limited to the univariate setting, our work is the first, to the best of our knowledge, that provides a computationally tractable characterization of stochastic orders that is sample efficient and scalable to high dimensions. The paper is organized as follows: in Sec. 3 we introduce the Variational Dominance Criterion (VDC); the VDC between measures µ + and µ -takes value 0 if and only if µ + dominates µ -in the convex order, but it suffers from the curse of dimension and cannot be estimated efficiently from samples. To remediate this, in Sec. 4 we introduce a VDC surrogate via Input Convex Maxout Networks (ICMNs). ICMNs are new variants of Input Convex Neural Nets (Amos et al., 2017) that we propose as proxy for convex functions and study their complexity. We show in Sec. 4 that the surrogate VDC has parametric rates and can be efficiently estimated from samples. The surrogate VDC can be computed using (stochastic) gradient descent on the parameters of the ICMN and can characterize convex dominance (See Figure 1c ). We then show in Sec. 5 how to use the VDC and its surrogate to define a pseudo-distance on the probability space. Finally, in Sec. 6 we propose penalizing generative models training losses with the surrogate VDC to learn implicit generative models that have better coverage and spread than known baselines. This leads to a min-max game similar to GANs. We validate our framework in Sec. 7 with experiments on portfolio optimization and image generation.

2. THE CHOQUET OR CONVEX ORDER

Denote by P(R d ) the set of Borel probability measures on R d and by P 1 (R d ) ⊂ P(R d ) the subset of those which have finite first moment: µ ∈ P 1 (R) if and only if R d ∥x∥ dµ(x) < +∞. Comparing probability distributions Integral probability metrics (IPMs) are pseudo-distances between probability measures µ, ν defined as d F (µ, ν) = sup f ∈F E µ f -E ν f , for a given function class F which is symmetric with respect to sign flips. They are ubiquitous in optimal transport and generative modeling to compare distributions: if F is the set of functions with Lipschitz constant 1, then the resulting IPM is the 1-Wasserstein distance; if F is the unit ball of an RKHS, the IPM is its maximum mean discrepancy. Clearly, d F (µ, ν) = 0 if and only E µ f = E ν f for all f ∈ F, and when F is large enough, this is equivalent to µ = ν. The Choquet or convex order When the class F is not symmetric with respect to sign flips, comparing the expectations E µ f and E ν f for f ∈ F does not yield a pseudo-distance. In the case where F is the set of convex functions, the convex order naturally arises instead: Definition 1 (Choquet order, Ekeland & Schachermayer (2014) , Def. 4). For µ -, µ + ∈ P 1 (R d ), we say that µ -⪯ µ + if for any convex function f : R d → R, we have µ -⪯ µ + is classically denoted as "µ -is a balayée of µ + ", or "µ + dominates µ -". It turns out that ⪯ is a partial order on P 1 (R d ), meaning that reflexivity (µ ⪯ µ), antisymmetry (if µ ⪯ ν and ν ⪯ µ, then µ = ν), and transitivity (if µ 1 ⪯ µ 2 and µ 2 ⪯ µ 3 , then µ 1 ⪯ µ 3 ) hold. As an example, if µ -, µ + are Gaussians µ -= N (0, Σ -), µ + = N (0, Σ + ), then µ -⪯ µ + if and only if Σ -⪯ Σ + in the positive-semidefinite order (Müller, 2001) . Also, since linear functions are convex, µ -⪯ µ + implies that both measures have the same expectation: E x∼µ-x = E x∼µ+ x. In the univariate case, µ -⪯ µ + implies that supp(µ -) ⊆ supp(µ + ) and that Var(µ -) ≤ Var(µ + ), and we have that µ -⪯ µ + holds if and only if for all x ∈ R, +∞ x Fµ-(t) dt ≤ +∞ x Fµ+ (t) dt, where F• is the survival function (one minus the cumulative distribution function). Note that this characterization can be checked efficiently if one has access to samples of µ -and µ + . In the high-dimensional case, there exists an alternative characterization of the convex order: Proposition 1 (Ekeland & Schachermayer (2014) , Thm. 10). If µ -, µ + ∈ P 1 (R d ), we have µ -⪯ µ + if and only if there exists a Markov kernel R (i.e. ∀x ∈ R d , R d y dR x (y) = x) such that µ + = R d R x dµ -. Equivalently, there exists a coupling (X, Y ) such that Law(X) = µ -, Law(Y ) = µ + and X = E(Y |X) almost surely. Intuitively, this means that µ + is more spread out than µ -. Remark that this characterization is difficult to check, especially in high dimensions.

3. THE VARIATIONAL DOMINANCE CRITERION

In this section, we present a quantitative way to deal with convex orders. Given a bounded open convex subset Ω ⊆ R d and a compact set K ⊆ R d , let A = {u : Ω → R, u convex and ∇u ∈ K almost everywhere}. We define the Variational Dominance Criterion (VDC) between probability measures µ + and µ -supported on Ω analogously to IPMs, replacing F by A: VDC A (µ + ||µ -) := sup u∈A Ω u d(µ --µ + ). Remark that when 0 ∈ K, VDC A (µ + ||µ -) ≥ 0 because the zero function belongs to the set A. We reemphasize that since A is not symmetric with respect to sign flips as f ∈ A does not imply -f ∈ A, the properties of the VDC are very different from those of IPMs. Most importantly, the following proposition, shown in App. A, links the VDC to the Choquet order. Proposition 2. Let K compact such that the origin belongs to the interior of K. If µ + , µ -∈ P(Ω), VDC A (µ + ||µ -) := sup u∈A Ω u d(µ --µ + ) = 0 if and only if Ω u d(µ --µ + ) ≤ 0 for any convex function on Ω (i.e. µ -⪯ µ + according to the Choquet order). That is, Proposition 2 states that the VDC between µ + and µ -takes value 0 if and only if µ + dominates µ -. Combining this with the interpretation of Proposition 1, we see that intuitively, the quantity VDC A (µ + ||µ -) is small when µ + is more spread out than µ -, and large otherwise. Hence, if we want to enforce or induce a Choquet ordering between two measures in an optimization problem, we can include the VDC (or rather, its surrogate introduced in Sec. 4) as a penalization term in the objective. Before this, we explore the connections between VDC and optimal transport, and study some statistical properties of the VDC.

3.1. THE VDC AND OPTIMAL TRANSPORT

Toland duality provides a way to interpret the VDC through the lens of optimal transport. In the following, W 2 (µ, ν) denotes the 2-Wasserstein distance between µ and ν. Theorem 1 (Toland duality, adapted from Thm. 1 of Carlier (2008) ). For any µ + , µ -∈ P(Ω), the VDC satisfies: VDC A (µ + ||µ -) = sup ν∈P(K) 1 2 W 2 2 (µ + , ν) - 1 2 W 2 2 (µ -, ν) - 1 2 Ω ∥x∥ 2 d(µ + -µ -)(x) (2) The optimal convex function u of VDC in (1) and the optimal ν in the right-hand side of (2) satisfy (∇u) # µ + = (∇u) # µ -= ν, where (∇u) # µ + denotes the pushforward of µ + by ∇u. Note that under the assumption 0 ∈ K, Theorem 1 implies that VDC A (µ + ||µ -) = 0 if and only if W 2 2 (µ + , ν) -1 2 Ω ∥x∥ 2 dµ + ≤ W 2 2 (µ -, ν) -1 2 Ω ∥x∥ 2 dµ -for any ν ∈ P(K). Under the equivalence VDC A (µ + ||µ -) = 0 ⇐⇒ µ + ⪰ µ -shown by Proposition 2, this provides yet another characterization of the convex order for arbitrary dimension.

3.2. STATISTICAL RATES FOR VDC ESTIMATION

In this subsection, we present an upper bound on the statistical rate of estimation of VDC A (µ||ν) using the estimator VDC A (µ n ||ν n ) based on the empirical distributions µ n = 1 n n i=1 δ xi , ν n = 1 n n i=1 δ yi built from i.i.d. samples (x i ) n i=1 , (y i ) n i=1 from µ and ν, respectively. Theorem 2. Let Ω = [-1, 1] d and K = {x ∈ R d | ∥x∥ 2 ≤ C} for an arbitrary C > 0. With probability at least 1 -δ, |VDC A (µ||ν) -VDC A (µ n ||ν n )| ≤ 18C 2 d log( δ 4 )( 2 √ n ) + 8Kn -2 d , where K depends on C and d. The proof of this result is in App. A. The dependency on n -2 d is indicative of the curse of dimension: we need a number of samples n exponential in d to control the estimation error. While Theorem 2 only shows an upper bound on the difference between the VDC and its estimator, in Subsec. 5.1 we study a related setting where a Ω(n -2 d ) lower bound is available. Hence, we hypothesize that VDC estimation is in fact cursed by dimension in general.

4. A VDC SURROGATE VIA INPUT CONVEX MAXOUT NETWORKS

Given the link between the VDC and the convex order, one is inclined to use the VDC as a quantitative proxy to induce convex order domination in optimization problems. Estimating the VDC implies solving an optimization problem over convex functions. In practice, we only have access to the empirical versions µ n , ν n of the probability measures µ, ν; we could compute the VDC between the empirical measures by solving a linear program similar to the ones used in non-parametric convex regression (Hildreth, 1954) . However, the statistical rates for the VDC estimation from samples are cursed by dimension (Subsec. 3.2), which means that we would need a number of samples exponential in the dimension to get a good estimate. Our approach is to focus on a surrogate problem instead: sup u∈ Â Ω u d(µ --µ + ), where Â is a class of neural network functions included in A over which we can optimize efficiently. In constructing Â, we want to hardcode the constraints u convex and ∇u ∈ K almost everywhere into the neural network architectures. A possible approach would be to use the input convex neural networks (ICNNs) introduced by Amos et al. (2017) , which have been used as a surrogate of convex functions for generative modeling with normalizing flows (Huang et al., 2021) in optimal transport (Korotin et al., 2021a; b; Huang et al., 2021; Makkuva et al., 2020) and large-scale Wasserstein flows (Alvarez-Melis et al., 2021; Bunne et al., 2021; Mokrov et al., 2021) . However, we found in early experimentation that a superior alternative is to use input convex maxout networks (ICMNs), which are maxout networks (Goodfellow et al., 2013) that are convex with respect to inputs. Maxout networks and ICMNs are defined as follows: Definition 2 (Maxout networks). For a depth L ≥ 2, let M = (m 1 , . . . , m L ) be a vector of positive integers such that m 1 = d. Let F L,M,k be the space of k-maxout networks of depth L and widths M, which contains functions of the form f (x) = 1 √ m L m L i=1 a i max j∈[k] ⟨w (L-1) i,j , (x (L-1) , 1)⟩, a i ∈ R, w (L-1) i,j ∈ R m L-1 +1 (3) where for any 2 ≤ ℓ ≤ L -1, and any 1 ≤ i ≤ m ℓ , the i-th component of x (ℓ) = (x (ℓ) 1 , . . . , x m ℓ ) is computed recursively as: x (ℓ) i = 1 √ m ℓ max j∈[k] ⟨w (ℓ-1) i,j , (x (ℓ-1) , 1)⟩, w (ℓ) i,j ∈ R m ℓ +1 , with x (1) = x. Definition 3 (Input convex maxout networks or ICMNs). A maxout network f of the form (3)-( 4) is an input convex maxout network if (i) for any 1 ≤ i ≤ M L , a i ≥ 0, and (ii) for any 2 < ℓ ≤ L -1, 1 ≤ i ≤ m ℓ+1 , 1 ≤ j ≤ k, the first m ℓ components of w (ℓ) i,j are non-negative. We denote the space of ICMNs as F L,M,k,+ . In other words, a maxout network is an ICMN if all the non-bias weights beyond the first layer are constrained to be non-negative. This definition is analogous to the one of ICNNs in Amos et al. (2017) , which are also defined as neural networks with positivity constraints on non-bias weights beyond the first layer. Proposition 5 in App. B shows that ICMNs are convex w.r.t to their inputs. It remains to impose the condition ∇u ∈ K almost everywhere, which in practice is enforced by adding the norms of the weights as a regularization term to the loss function. For theoretical purposes, we define F L,M,k (1) (resp. F L,M,k,+ (1)) as the subset of F L,M,k (resp. F L,M,k,+ ) such that for all 1 ≤ ℓ ≤ L -1, 1 ≤ i ≤ m ℓ , 1 ≤ j ≤ k, ∥w (ℓ) i,j ∥ 2 ≤ 1, and ∥a∥ 2 = m L i=1 a 2 i ≤ 1. The following proposition, proven in App. B, shows simple bounds on the values of the functions in F L,M,k (1) and their derivatives. Proposition 3. Let f be an ICMN that belongs to F L,M,k (1). For x almost everywhere in B 1 (R d ), ∥∇f (x)∥ ≤ 1. Moreover, for all x ∈ B 1 (R d ), |f (x)| ≤ L, and for 1 ≤ ℓ ≤ L, ∥x (ℓ) ∥ ≤ ℓ. When K = B 1 (R d ), we have that the space of ICMNs F L,M,k,+ (1) is included in A. Hence, we define the surrogate VDC associated to F L,M,k,+ (1) as: VDC F L,M,k,+ (1) (µ + ||µ -) = sup u∈F L,M,k,+ (1) Ω u d(µ --µ + ). (5) Theorem 3. Suppose that for all 1 ≤ ℓ ≤ L, the widths m ℓ satisfy m ℓ ≤ m, and assume that µ, ν are supported on the ball of R d of radius r. With probability at least 1 -δ, |VDC F L,M,k,+ (1) (µ||ν) -VDC F L,M,k,+ (1) (µ n ||ν n )| ≤ 18r 2 log( δ 4 ) 2 √ n + 512 (L-1)km(m+1) n (L + 1) log(2) + log(L 2 +1) 2 + √ π 2 , We see from Theorem 3 that VDC F L,M,k,+ (1) in contrast to VDC A has parametric rates and hence favorable properties to be estimated from samples. In the following section, wes see that VDC can be used to defined a pseudo-distance on the probability space.

5. FROM THE CONVEX ORDER BACK TO A PSEUDO-DISTANCE

We define the Choquet-Toland distance (CT distance) as the map d CT,A : P(Ω) × P(Ω) → R given by d CT,A (µ + , µ -) := VDC A (µ + ||µ -) + VDC A (µ -||µ + ). That is, the CT distance between µ + and µ -is simply the sum of Variational Dominance Criteria. Applying Theorem 1, we obtain that d CT,A (µ + , µ -) = 1 2 (sup ν∈P(K) W 2 2 (µ + , ν) -W 2 2 (µ -, ν) + sup ν∈P(K) W 2 2 (µ + , ν) -W 2 2 (µ -, ν) ). The following result, shown in App. C, states that d CT,K is indeed a distance. Theorem 4. Suppose that the origin belongs to the interior of K. d CT,A is a distance, i.e. it fulfills (i) d CT,A (µ + , µ -) ≥ 0 for any µ + , µ -∈ P(Ω) (non-negativity). (ii) d CT,A (µ + , µ -) = 0 if and only if µ + = µ -(identity of indiscernibles). (iii) If µ 1 , µ 2 , µ 3 ∈ P(Ω), we have that d CT,A (µ 1 , µ 2 ) ≤ d CT,A (µ 1 , µ 3 ) + d CT,A (µ 3 , µ 2 ). As in (5), we define the surrogate CT distance as: d CT,F L,M,k,+ (1) (µ + , µ -) = VDC F L,M,k,+ (1) (µ + ||µ -) + VDC F L,M,k,+ (1) (µ -||µ + ). (6)

5.1. STATISTICAL RATES FOR CT DISTANCE ESTIMATION

We show almost-tight upper and lower bounds on the expectation of the Choquet-Toland distance between a probability measure and its empirical version. Namely, E[d CT,A (µ, µ n )] = O(n -2 d ), Ω(n -2 d / log(n)). Theorem 5. Let C 0 , C 1 be universal constants independent of the dimension d. Let Ω = [-1, 1] d , and K = {x ∈ R d | ∥x∥ 2 ≤ C}. Let µ n be the n-sample empirical measure corresponding to a probability measure µ over [-1, 1] d . When µ is the uniform probability measure and n ≥ C 1 log(d), we have that E[d CT,A (µ, µ n )] ≥ C 0 C d(1+C) log(n) n -2 d . ( ) For any probability measure µ over [-1, 1] d , E[d CT,A (µ, µ n )] ≤ Kn -2 d , ( ) where K is a constant depending on C and d (but not the measure µ). Overlooking logarithmic factors, we can summarize Theorem 5 as E[d CT,A (µ, µ n )] ≍ n -2 d . The estimation of the CT distance is cursed by dimension: one needs a sample size exponential with the dimension d for µ n to be at a desired distance from µ. It is also interesting to contrast the rate n -2 d with the rates for similar distances. For example, for the r-Wasserstein distance we have Singh & Póczos (2018) , Section 4). Given the link of the CT distance and VDC with the squared 2-Wasserstein distance (see Subsec. 3.1), the n -2 d rate is natural. E[W r r (µ, µ n )] ≍ n -r d ( The proof of Theorem 5, which can be found in App. D, is based on upper-bounding and lowerbounding the metric entropy of the class of bounded Lipschitz convex functions with respect to an appropriate pseudo-norm. Then we use Dudley's integral bound and Sudakov minoration to upper and lower-bound the Rademacher complexity of this class, and we finally show upper and lower bounds of the CT distance by the Rademacher complexity. Bounds on the metric entropy of bounded Lipschitz convex functions have been computed and used before (Balazs et al., 2015; Guntuboyina & Sen, 2013) , but in L p and supremum norms, not in our pseudo-norm. Next, we see that the surrogate CT distance defined in (6) does enjoy parametric estimation rates. Theorem 6. Suppose that for all 1 ≤ ℓ ≤ L, the widths m ℓ satisfy m ℓ ≤ m. We have that E[d CT,F L,M,k,+ (1) (µ, µ n )] ≤ 256 (L-1)km(m+1) n (L + 1) log(2) + log(L 2 +1) 2 + √ π 2 , In short, we have that E[d CT,F L,M,k,+ (1) (µ, µ n )] = O(Lm k n ). Hence, if we take the number of samples n larger than k times the squared product of width and depth, we can make the surrogate CT distance small. Theorem 6, proven in App. D, is based on a Rademacher complexity bound for the space F L,M,k (1) of maxout networks which may be of independent interest; to our knowledge, existing Rademacher complexity bounds for maxout networks are restricted to depth-two networks (Balazs et al., 2015; Kontorovich, 2018) . Theorems 5 and 6 show the advantages of the surrogate CT distance over the CT distance are not only computational but also statistical; the CT distance is such a strong metric that moderate-size empirical versions of a distribution are always very far from it. Hence, it is not a good criterion to compare how close an empirical distribution is to a population distribution. In contrast, the surrogate CT distance between a distribution and its empirical version is small for samples of moderate size. An analogous observation for the Wasserstein distance versus the neural net distance was made by Arora et al. (2017) . If µ n , ν n are empirical versions of µ, ν, it is also interesting to bound |d CT,F L,M,k,+ (1) (µ n , ν n ) - d CT,A (µ, ν)| ≤ |d CT,F L,M,k,+ (1) (µ n , ν n ) -d CT,F L,M,k,+ (1) (µ, ν)| + |d CT,F L,M,k,+ (1) (µ, ν) - d CT,A (µ, ν)|. The first term has a O( k/n) bound following from Theorem 6, while the second term is upper-bounded by (2015) shows that sup 2 sup f ∈A inf f ∈F L,M,k (1) ∥f -f ∥ ∞ , f ∈A inf f ∈F 2,(d,1),k (1) ∥f -f ∥ ∞ = O(dk -2/d ). Hence, we need k exponential in d to make the second term small, and thus n ≫ k exponential in d to make the first term small.

6. LEARNING DISTRIBUTIONS WITH THE SURROGATE VDC AND CT DISTANCE

We provide a min-max framework to learn distributions with stochastic orders. As in the generative adversarial network (GAN, Goodfellow et al. (2014) ; Arjovsky et al. (2017) ) framework, we parametrize probability measures implicitly as the pushforward µ = g # µ 0 of a base measure µ 0 by a generator function g in a parametric class G and we optimize over g. The loss functions involve a maximization over ICMNs corresponding to the computation of a surrogate VDC or CT distance (and possibly additional maximization problems), yielding a min-max problem analogous to GANs. Enforcing dominance constraints with the surrogate VDC. In some applications, we want to optimize a loss L : P(Ω) → R under the constraint that µ = g # µ 0 dominates a baseline measure ν. We can enforce, or at least, bias µ towards the dominance constraint by adding a penalization term proportional to the surrogate VDC between µ and ν, in application of Proposition 2. A first instance of this approach appears in portfolio optimization (Xue et al., 2020; Post et al., 2018) . Let ξ = (ξ 1 , . . . , ξ p ) be a random vector of return rates of p assets and let Y 1 := G 1 (ξ), Y 2 := G 2 (ξ) be real-valued functions of ξ which represent the return rates of two different asset allocations or portfolios, e.g. G i (ξ) = ⟨ω i , ξ⟩ with ω i ∈ R p . The goal is to find a portfolio G 2 that enhances a benchmark portfolio G 1 in a certain way. For a portfolio G with return rate Y := G(ξ), we let F (1) Y (x) = P ξ (Y ≤ x) be the CDF of its return rate, and F (2) Y (x) = x -∞ F (1) Y (x ′ ) dx ′ . If Y 1 , Y 2 are the return rates of G 1 , G 2 , we say that Y 2 dominates Y 1 in second order, or Y 2 ⪰ 2 Y 1 if for all x ∈ R, F (2) Y2 (x) ≤ F (2) Y1 (x) , which intuitively means that the return rates Y 2 are less spread out than those Y 1 , i.e. the risk is smaller. Formally, the portfolio optimization problem can be written as: max G2 E[Y 2 := G 2 (ξ)], s.t. Y 2 ⪰ 2 Y 1 := G 1 (ξ). (9) It turns out that Y 2 ⪰ 2 Y 1 if and only if E[(η -Y 2 ) + ] ≤ E[(η -Y 1 ) + ] for any η ∈ R, or yet equivalently, if E[u(Y 2 )] ≥ E[u(Y 1 ) ] for all concave non-decreasing u : R → R (Dentcheva & Ruszczyński, 2004) . Although different, note that the second order is intimately connected to the Choquet order for 1-dimensional distributions, and it can be handled with similar tools. Define F L,M,k,-+ (1) as the subset of F L,M,k,+ (1) such that the first m 1 components of the weights w (1) i,j are non-positive for all 1 ≤ i ≤ m 2 , 1 ≤ j ≤ k. If we set the input width m 1 = 1, we can encode the condition Y 2 ⪰ 2 Y 1 as VDC F L,M,k,-+ (1) (ν||µ), where ν = L(Y 1 ) and µ = L(Y 2 ) are the distributions of Y 1 , Y 2 , resp. Hence, with the appropriate Lagrange multiplier we convert problem (9) into a min-max problem between µ and the potential u of VDC min µ:µ=L(⟨ξ,ω2⟩) - R x dµ(x) + λVDC F L,M,k,-+ (1) (ν||µ). A second instance of this approach is in GAN training. Assuming that we have a baseline generator g 0 that can be obtained via regular training, we consider the problem: min g∈G max f ∈F {E X∼νn [f (X)] -E Y ∼µ0 [f (g(Y ))]} + λVDC F L,M,k,+ (1) (g # µ 0 ||(g 0 ) # µ 0 ) . (11) The first term in the objective function is the usual WGAN loss (Arjovsky et al., 2017) , although it can be replaced by any other standard GAN loss. The second term, which is proportional to VDC F L,M,k,+ (1) (g # µ 0 ||(g 0 ) # µ 0 ) = max u∈F L,M,k,+ (1) {E Y ∼µ0 [u(g 0 (Y ))] -E Y ∼µ0 [u(g(Y ))]}, enforces that g # µ 0 ⪰ (g 0 ) # µ 0 in the Choquet order, and thus u acts as a second 'Choquet' critic. Tuning λ appropriately, the rationale is that we want a generator that optimizes the standard GAN loss, with the condition that the generated distribution dominates the baseline distribution. As stated by Proposition 1, dominance in the Choquet order translates to g # µ 0 being more spread out than (g 0 ) # µ 0 , which should help avoid mode collapse and improve the diversity of generated samples. In practice, this min-max game is solved via Algorithm 1 given in App. F. For the Choquet critic, this amounts to an SGD step followed by a projection step to impose non-negativity of hidden to hidden weights. Generative modeling with the surrogate CT distance. The surrogate Choquet-Toland distance is well-suited for generative modeling, as it can used in GANs in place of the usual discriminator. Namely, if ν is a target distribution, ν n is its empirical distribution, and D = {µ = g # µ 0 |g ∈ G} is a class of distributions that can be realized as the push-forward of a base measure µ 0 ∈ P(R d0 ) by a function g : R d0 → R d , the problem to solve is g * = arg min g∈G d CT,F L,M,k,+ (1) (g # µ 0 , ν n ) = arg min g∈G {max u∈F L,M,k,+ (1) {E X∼νn [u(X)] - E Y ∼µ0 [u(g(Y ))]} + max u∈F L,M,k,+ (1) {E Y ∼µ0 [u(g(Y ))] -E X∼νn [u(X)]}}. Algorithm 2 given in App. F summarizes learning with the surrogate CT distance.

7. EXPERIMENTS

Portfolio optimization under dominance constraints In this experiment, we use the VDC to optimize an illustrative example from Xue et al. ( 2020) (Example 1) that follows the paradigm laid out in Sec. 6. In this example, ξ ∈ R ∼ P is drawn uniformly from [0, 1], we define the benchmark portfolio as: G 1 (ξ) = i 20 ξ ∈ [0.05 × i, 0.05 × (i + 1)) i = 0, . . . , 19 1 ξ = 1 and the optimization is over the parameterized portfolio G 2 (ξ) := G(ξ; z) = zξ. The constrained optimization problem is thus specified as: gradient regularization is computed for the discriminator with respect to interpolates between real and generated data. When training g * and g 0 , we used the same WGAN-GP hyperparameter configuration. We set λ in Equation ( 11 1 , where we see improved image quality from g * (as measured by lower FID) relative to the pre-trained baseline g 0 . We therefore find that the VDC surrogate improves upon g 0 by providing g * with larger support, preventing mode collapse. Samples generated from g * are displayed in Figure 2a . min z -E P [G(ξ; z)] s.t. G(ξ; z) ⪰ 2 G 1 (ξ), 1 ≤ z ≤ 2 In order to highlight the representation power of ICMNs, we replace them in the VDC estimation by the ICNN implementation of Huang et al. (2021) . Instead of maxout activation, Huang et al. ( 2021) uses a Softplus activation and instead of a projection step it also uses a Softplus operation to impose the non-negativity of hidden to hidden weights to enforce convexity. We see in 3 given in App. G we quantify mode collapse by looking at two scores: 1) the entropy of the discrete assignment of generated points to the means of the mixture 2) the negative log likelihood (NLL) of the Gaussian mixture. When training with the VDC regularizer to improve upon the collapsed generator g 0 (which is taken from step 55k from the unregularized GAN training), we see more stable training and better mode coverage as quantified by our scores. 2D point cloud generation with d CT We apply learning with the CT distance in a 2D generative modeling setting. Both the generator and CT critic architectures are comprised of fully-connected neural networks with maxout non-linearities of kernel size 2. Progression of the generated samples can be found in the right-hand panel of Figure 2 , where we see the trained generator accurately learn the ground truth distribution. All experiments were performed in a single-CPU compute environment.

8. CONCLUSION

In this paper, we introduced learning with stochastic order in high dimensions via surrogate Variational Dominance Criterion and Choquet-Toland distance. These surrogates leverage input convex maxout networks, a new variant of input convex neural networks. Our surrogates have parametric statistical rates and lead to new learning paradigms by incorporating dominance constraints that improve upon a baseline. Experiments on synthetic and real image generation yield promising results. Finally, our work, although theoretical in nature, can be subject to misuse, similar to any generative method. A PROOFS OF SEC. 3 Proof of Proposition 2. Since A is included in the set of convex functions, the implication from right to left is straightforward. For the other implication, we prove the contrapositive. Suppose that u is a convex function on Ω such that Ω u d(µ --µ + ) > 0. Then we show that we can construct ũ ∈ A such that Ω ũ d(µ --µ + ) > 0, which shows that the supremum over A is strictly positive. Denote by C the set of convex functions f : R d → R which are the point-wise supremum of finitely many affine functions, i.e. f (x) = max i∈I {⟨y i , x⟩ -a i } for some finite family (y i , a i ) ∈ R d × R. For any convex function g, there is an increasing sequence (g n ) n ⊆ C such that g = sup n g n pointwise (Ekeland & Schachermayer (2014), p. 3) . Applying this to u, we know there exists an increasing sequence (u n ) n ⊆ C such that u = lim n→∞ u n . By the dominated convergence theorem, lim n→∞ Ω u n d(µ --µ + ) = Ω u d(µ --µ + ) > 0, which means that for some N large enough, Ω u N d(µ --µ + ) > 0 as well. Since u N admits a representation f (x) = max i∈I {⟨y i , x⟩ -a i } for some finite family I, it may be trivially extended to a convex function on R d . Let (η ϵ ) ϵ be a family of non-negative radially symmetric functions in C 2 (R d ) supported on the ball B ϵ (0) of radius ϵ centered at zero, and such that R d η ϵ (x) dx = 1. Let Ω ϵ = {x ∈ Ω | dist(x, ∂Ω) ≥ ϵ}. For any x ∈ R d , we have that (u N * η ϵ )(x) = R d η ϵ (x -y)u(y) dy = R d η ϵ (y ′ )u N (x -y ′ ) dy ′ . ( ) By the convexity of u N , we have that u N (λx + (1 -λ)x ′ -y ′ ) ≤ λu N (x -y ′ ) + (1 -λ)u N (x ′ -y ′ ) for any x, x ′ , y ′ ∈ R d . Thus, (12) implies that (u N * η ϵ )(λx + (1 -λ)x ′ ) ≤ λ(u N * η ϵ )(x) + (1 - λ)(u N * η ϵ )(x ′ ), which means that u N * η ϵ is convex. Also, by the dominated convergence theorem, lim ϵ→0 Ω (u N * η ϵ ) d(µ --µ + ) = Ω u N d(µ --µ + ) > 0 Hence, there exists ϵ 0 > 0 such that Ω (u N * η ϵ ) d(µ --µ + ) > 0. Since u N * η ϵ0 is in C 2 (R d ) by the properties of convolutions, its gradient is continuous. Since the closure Ω is compact because Ω is bounded, by the Weierstrass theorem we have that sup x∈Ω ∥∇(u N * η ϵ0 )(x)∥ 2 < +∞ Let r > 0 be such that the ball B r (0) is included in K. Rescaling u N * η ϵ0 by an appropriate constant, we have that sup x∈Ω ∥∇(u N * η ϵ0 )(x)∥ 2 < r, and that means that ∇(u N * η ϵ0 )(x) ∈ K for any x ∈ Ω. Thus, u N * η ϵ0 ∈ A, and that means that sup u∈A { Ω u d(µ --µ + )} > 0, concluding the proof.  N (δ; F d (M, C), ∥ • ∥ ∞ ) ≤ Kδ -d 2 , for some constant K that depend on C, M and d. Lemma 3 (Dudley's entropy integral bound, Wainwright (2019 ), Thm. 5.22, Dudley (1967) ). Let {X θ | θ ∈ T} be a zero-mean sub-Gaussian process with respect the metric ρ X on T. Let D = sup θ,θ ′ ∈T ρ X (θ, θ ′ ). Then for any δ ∈ [0, D] such that N (δ; T, ρ X ) ≥ 10, we have  E[sup θ,θ ′ ∈T (X θ -X θ ′ )] ≤ E sup γ,γ ′ ∈T ρ X (γ,γ ′ )≤δ (X γ -X γ ′ ) + 32 E ϵ [∥S n ∥ F d (M,C) ] ≤ Kn -2 d , where K is a constant depending on M , C and d. Proof. We choose T n = {(f (X i )) n i=1 ∈ R n | f ∈ F d (M, C)}, we define the Rademacher process X f = n i=1 ϵ i f (X i ), which is sub-Gaussian with respect to the metric ρ n (f, f ′ ) = n i=1 (f (X i ) -f ′ (X i )) 2 . Remark that D ≤ 2M √ n. For any δ ∈ [0, D], we apply Lemma 3 setting f ′ ≡ 0 and we get E ϵ [∥S n ∥ F d (M,C) ] = 1 n E sup f ∈Tn X f ′ ≤ 1 n E sup f,f ′ ∈Tn ρn(f,f ′ )≤δ (X f -X f ′ ) + 32 D δ log N (t; F d (M, C), ρ n ) dt . Note that for any f, f ′ ∈ F d (M, C), ρ n (f, f ′ ) ≤ √ n∥f -f ′ ∥ ∞ , which means that log N (δ; F d (M, C), ρ n ) ≤ log N (δ/ √ n; F d (M, C), ∥ • ∥ ∞ ). Thus, Lemma 2 implies that log N (δ; F d (M, C), ρ n ) ≤ K δ √ n -d 2 = Kδ -d 2 n d 4 , Hence, D δ log N (t; F d (M, C), ρ n ) dt ≤ D δ √ Kn d 8 t -d 4 dt = √ Kn d 8 -d 4 +1 t -d 4 +1 D δ (14) ≤ √ Kn d 8 d 4 +1 (δ -d 4 +1 -(2M √ n) -d 4 +1 ) We set δ = n 1 2 -2 d , and we get that δ -d 4 +1 = n ( 1 2 -2 d )(-d 4 +1) = n -d 8 +1-2 d . Hence, the right-hand side of ( 14) is upper-bounded by √ K d 4 +1 n 1-2 d . And since n i=1 ϵ i (f (X i ) -f ′ (X i )) ≤ n i=1 |f (X i ) - f ′ (X i )| ≤ √ nρ n (f, f ′ ), we have E sup γ,γ ′ ∈T ρ X (γ,γ ′ )≤δ (X γ -X γ ′ ) ≤ √ nδ = √ nn 1 2 -2 d = n 1-2 d . Plugging these bounds back into (13), we obtain E ϵ [∥S n ∥ F d (M,C) ] ≤ √ K d 4 +1 + 1 n -2 d . Since K already depends on d, we rename it as K ← Sriperumbudur et al. (2009) shows that for any function class F on Ω such that r := sup f ∈F ,x∈Ω |f (x)| < +∞, with probability at least 1 -δ we have √ K d 4 +1 + 1, concluding the proof. Proof of Theorem 2. Let F d (C) be the space of convex functions on [-1, 1] d such that |f (x)| ≤ C √ d and |f (x) -f (y)| ≤ C∥x -y∥ for any x, y ∈ [-1, 1] d . Lemma 1 shows that when Ω = [-1, 1] d and K = {x ∈ R d | ∥x∥ 2 ≤ C}, functions in A belong to F d (C) up to a constant term, which means that VDC A (µ||µ) = sup u∈F d (C) Ω u d(µ --µ + ). Theorem 11 of | sup f ∈F {E µ f (x) -E µ f (x)} -sup f ∈F {E µn f (x) -E µn f (x)}| ≤ 18r 2 log( δ 4 )( 1 √ m + 1 √ n ) + 2 Rn (F, (x i ) n i=1 ) + 2 Rn (F, (y i ) n i=1 ), where Rn denotes the empirical Rademacher complexity. Proposition 4 shows that Rn (F d (C), (x i ) n i=1 ) ≤ Kn -2 d for any (x i ) n i=1 ∈ Ω ⊆ [-1, 1] d , where K depends on C and d. This concludes the proof.

B PROOFS OF SEC. 4

Proposition 5. Input convex maxout networks (Definition 3) are convex with respect to their input. Proof. The proof is by finite induction. We show that for any 2 ≤ ℓ ≤ L-1 and 1 ≤ i ≤ m ℓ , the func- tion x → x (ℓ) i is convex. The base case ℓ = 2 holds because x → x (2) i = 1 √ m2 max j∈[k] ⟨w i,j , (x, 1)⟩ is a pointwise supremum of convex (affine) functions, which is convex. For the induction case, we have that x → x (ℓ-1) i is convex for any 1 ≤ i ≤ m ℓ-1 by the induction hypothesis. Since a linear combination of convex functions with non-negative coefficients is convex, we have that for any 1 ≤ i ≤ m ℓ , 1 ≤ j ≤ k, x → ⟨w (ℓ-1) i,j , (x (ℓ-1) , 1)⟩ is convex. Finally, x → x (ℓ) i = 1 √ m ℓ max j∈[k] ⟨w (ℓ-1) i,j , (x (ℓ-foot_0) , 1)⟩ is convex because it is the pointwise supremum of convex functions. Proof of Proposition 3. We can reexpress f (x) as: f (x) = 1 √ m L m L i=1 a i ⟨w (L-1) i,j * L-1,i , (x (L-1) , 1)⟩, j * L-1,i = arg max j∈[k] ⟨w (L-1) i,j , (x (L-1) , 1)⟩ (15) x (ℓ) i = 1 √ m ℓ ⟨w (ℓ-1) i,j * ℓ-1,i , (x (ℓ-1) , 1)⟩, j * ℓ-1,i = arg max j∈[k] ⟨w (ℓ-1) i,j * ℓ-1,i , (x (ℓ-1) , 1)⟩. For 1 ≤ ℓ ≤ L -1, we define the matrices W * ℓ ∈ R m ℓ+1 ,m ℓ such that their i-th row is the vector [ 1 √ m ℓ+1 w (ℓ) i,j * ℓ,i ] 1:m ℓ , i.e. the vector containing the first m ℓ components of 1 √ m ℓ+1 w (ℓ) i,j * ℓ,i . Iterating the chain rule, one can see that for almost every x ∈ R d , 1 ∇f (x) = (W * 1 ) ⊤ (W * 2 ) ⊤ . . . (W * L-1 ) ⊤ a Since the spectral norm ∥ • ∥ 2 is sub-multiplicative and ∥A∥ 2 = ∥A ⊤ ∥ 2 , we have that ∥∇f (x)∥ = ∥(W * 1 ) ⊤ (W * 2 ) ⊤ . . . (W * L-1 ) ⊤ ∥ 2 ∥a∥ = ∥W * 1 ∥ 2 ∥W * 2 ∥ 2 . . . ∥W * L-1 ∥ 2 ∥a∥. We compute the Frobenius norm of W * ℓ : ∥W * ℓ ∥ 2 F = 1 m ℓ m ℓ i=1 [w (ℓ) i,j * ℓ,i ] 1:m ℓ 2 ≤ 1 m ℓ m ℓ i=1 w (ℓ) i,j * ℓ,i 2 ≤ 1. Since for any matrix A, ∥A∥ 2 ≤ ∥A∥ F , and the vector a satisfies ∥a∥ ≤ 1, we obtain that ∥∇f (x)∥ ≤ 1. To obtain the bound on |f (x)|, we use again the expression (15). For 1 ≤ ℓ ≤ L -1, we let b ℓ ∈ R m ℓ+1 be the vector such that the i-th component is [ 1 √ m ℓ+1 w (ℓ) i,j * ℓ,i ] m ℓ +1 . Since |[ 1 √ m ℓ+1 w (ℓ) i,j * ℓ,i ] m ℓ +1 | ≤ 1 √ m ℓ+1 , ∥b ℓ ∥ ≤ 1. It is easy to see that f (x) = a ⊤ W * L-1 • • • W * 1 x + L-1 ℓ=1 a ⊤ W * L-1 • • • W * ℓ+1 b ℓ . Thus, |f (x)| ≤ ∥a∥ ∥W * L-1 • • • W * 1 x∥ + L-1 ℓ=1 ∥W * L-1 • • • W * ℓ+1 b ℓ ∥ ≤ L. The bound on ∥x (ℓ) ∥ follows similarly, as x ℓ = W * ℓ-1 • • • W * 1 x + ℓ-1 ℓ ′ =1 a ⊤ W * ℓ-1 • • • W * ℓ+1 b ℓ . Proposition 6. Let F L,M,k (1) be the subset of F L,M,k such that for all 1 ≤ ℓ ≤ L -1, 1 ≤ i ≤ m ℓ , 1 ≤ j ≤ k, ∥w (ℓ) i,j ∥ 2 ≤ 1, and ∥a∥ 2 = m L i=1 a 2 i ≤ 1. For any f ∈ F L,M,k (1) and x ∈ B 1 (R d ), we have that |f (x)| ≤ 1. The metric entropy of F L,M,k (1) with respect to ρn admits the upper bound: log N δ; F L,M,k (1), ρn ≤ L ℓ=2 km ℓ (m ℓ-1 + 1) log 1 + 2 3+L-ℓ δ + m L log 1 + 4 δ Proof. We define the function class G L,M,k that contains the functions from R d to R m L of the form g(x) = 1 √ m L max j∈[k] ⟨w (L-1) i,j , (x (L-1) , 1)⟩ m L i=1 , ∀2 ≤ ℓ ≤ L -1, x (ℓ) i = 1 √ m ℓ max j∈[k] ⟨w (ℓ-1) i,j , (x (ℓ-1) , 1)⟩, x (1) = x, ∀1 ≤ ℓ ≤ L -1, 1 ≤ i ≤ m ℓ , 1 ≤ j ≤ k, ∥w i,j ∥ 2 ≤ 1. Given {X i } n i=1 ⊆ B 1 (R d ) we define the pseudo-metric ρn between functions from R d to R m L as ρn (f, f ′ ) = 1 n m L i=1 n j=1 (f i (X j ) -f ′ i (X j )) 2 . We prove by induction that log N (δ; G L,M,k , ρn ) ≤ L ℓ=2 km ℓ (m ℓ-1 + 1) log 1 + 2 2+L-ℓ √ (ℓ-1) 2 +1 δ To show the induction case, note that for L ≥ 3, any g ∈ G L,M,k can be written as g(x) = 1 √ m L max j∈[k] ⟨w (L-1) i,j , (h(x), 1)⟩ m L i=1 , where h ∈ G L-1,M,k . Remark that given a δ 2 √ (L-1) 2 +1 -cover C ′ of B 1 (R m L-1 +1 ), there exist w(L-1) i,j ∈ C ′ such that ∥ w(L-1) i,j -w (L-1) i,j ∥ ≤ δ 2 √ (L-1) 2 +1 . Hence, if h is such that ρn ( h, h) ≤ δ 2 , and we define g(x) = ( 1 √ 2m L max j∈[k] ⟨ w(L-1) i,j , ( h(x), 1)⟩) m L i=1 , we obtain ρn (g, g) 2 = 1 nm L m L i=1 n k=1 max j∈[k] ⟨ w(L-1) i,j , ( h(X k ), 1)⟩ -max j∈[k] ⟨w (L-1) i,j , (h(X k ), 1)⟩ 2 ≤ 1 nm L m L i=1 n k=1 max j∈[k] ⟨ w(L-1) i,j -w (L-1) i,j , ( h(X k ), 1)⟩ -⟨w (L-1) i,j , (h(X k ) -h(X k ), 0)⟩ 2 ≤ 2 nm L m L i=1 n k=1 max j∈[k] ⟨ w(L-1) i,j -w (L-1) i,j , ( h(X k ), 1)⟩ 2 + ⟨w (L-1) i,j , (h(X k ) -h(X k ), 0)⟩ 2 ≤ 2 nm L m L i=1 n k=1 max j∈[k] ∥ w(L-1) i,j -w (L-1) i,j ∥ 2 ∥( h(X k ), 1)∥ 2 + ∥w (L-1) i,j ∥ 2 ∥h(X k ) -h(X k )∥ 2 ≤ 2 nm L m L i=1 n k=1 max j∈[k] ∥ w(L-1) i,j -w (L-1) i,j ∥ 2 ∥( h(X k ), 1)∥ 2 + ∥w (L-1) i,j ∥ 2 ∥h(X k ) -h(X k )∥ 2 ≤ 1 nm L m L i=1 n k=1 δ 2 4(L 2 +1) (L 2 + 1) + ∥h(X k ) -h(X k )∥ 2 ≤ 1 nm L m L i=1 n k=1 δ 2 2 + 1 n n k=1 ∥h(X k ) -h(X k )∥ 2 ≤ δ 2 . In the second-to-last inequality we used that if h ∈ G ℓ,M,k , ∥h(x)∥ ≤ ℓ. This is equivalent to the bound ∥x (ℓ) ∥ ≤ ℓ shown in Proposition 3. Hence, we can build a δ-cover of G L,M,k in the pseudo-metric ρn from the Cartesian product of a δ 2 -cover of G L-1,M,k in ρn and km L copies of a δ 2 √ (L-1) 2 +1 -cover of B 1 (R m L-1 +1 ) in the ∥ • ∥ 2 norm. Thus, N (δ; G L,M,k , ρn ) ≤ N δ 2 ; G L-1,M,k , ρn • N δ 2 √ (L-1) 2 +1 ; B 1 (R m L-1 +1 ), ∥ • ∥ 2 km L . The metric entropy of the unit ball admits the upper bound log N (δ; B 1 (R d ), ∥ • ∥ 2 ) ≤ d log(1 + 2 δ ) (Wainwright (2019), Example 5.8). Consequently, log N (δ; G L,M,k , ρn ) ≤ log N δ 2 ; G L-1,M,k , ρn + km L log N δ 2 √ (L-1) 2 +1 ; B 1 (R m L-1 +1 ), ∥ • ∥ 2 ≤ L-1 ℓ=2 km ℓ (m ℓ-1 + 1) log 1 + 2 2+L-ℓ √ (ℓ-1) 2 +1 δ + km L (m L-1 + 1) log 1 + 4 √ L 2 +1 δ = L ℓ=2 km ℓ (m ℓ-1 + 1) log 1 + 2 2+L-ℓ √ (ℓ-1) 2 +1 δ In the second inequality we used the induction hypothesis. To conclude the proof, note that an arbitrary function f ∈ F L,M,k (1) can be written as f (x) = ⟨a, g(x)⟩, where a ∈ B 1 (R m L ) and g ∈ G L,M,k . Applying an analogous argument, we see that a δ 2 √ L 2 +1 -cover of B 1 (R m L ) and a δ 2 -cover of G L,M,k give rise to a δ-cover of F L,M,k (1). Hence, log N δ; F L,M,k (1), ρn ≤ log N δ 2 ; G L,M,k , ρn + log N δ 2 ; B 1 (R m L ), ∥ • ∥ 2 ≤ L ℓ=2 km ℓ (m ℓ-1 + 1) log 1 + 2 3+L-ℓ √ (ℓ-1) 2 +1 δ + m L log 1 + 4 √ L 2 +1 δ . Finally, to show the bound |f (x)| ≤ 1 for all f ∈ F L,M,k (1) and x ∈ B 1 (R d ), we use that |f (x)| ≤ |⟨a, g(x)⟩| ≤ ∥a∥∥g(x)∥ ≤ 1 and that if g ∈ G ℓ,M,k and x ∈ B 1 (R d ), then ∥g(x)∥ ≤ 1, as shown before. Proposition 7. Suppose that for all 1 ≤ ℓ ≤ L, the widths m ℓ satisfy m ℓ ≤ m. Then, the Rademacher complexity of the class F L,M,k (1) satisfies: E ϵ [∥S n ∥ F L,M,k (1) ] ≤ 64 (L-1)km(m+1) n (L + 1) log(2) + 1 2 log(L 2 + 1) + √ π 2 . (16) Proof. We apply Dudley's entropy integral bound (Lemma 3). We choose T n = {(f (X i )) n i=1 ∈ R n | f ∈ F L,M,k (1)}, we define the Rademacher process X f = 1 √ n n i=1 ϵ i f (X i ), which is sub- Gaussian with respect to the metric ρn (f, f ′ ) = 1 n n i=1 (f (X i ) -f ′ (X i )) 2 . Remark that D ≤ 2. Setting f ′ ≡ 0 and δ = 0 in Lemma 3, we obtain that E ϵ [∥S n ∥ F L,M,k (1) ] = 1 √ n E sup f ∈Tn X f ′ ≤ 32 √ n 2 0 log N (t; F L,M,k (1), ρn ) dt. Applying the metric entropy bound from Proposition 6, we get log N (δ; F L,M,k (1), ρn ) ≤ L ℓ=2 km ℓ (m ℓ-1 + 1) log 1 + 2 3+L-ℓ √ (ℓ-1) 2 +1 δ + m L log 1 + 4 √ L 2 +1 δ ≤ (L -1)km(m + 1) log 1 + 2 L+1 √ L 2 +1 δ ≤ (L -1)km(m + 1) log 2 L+2 √ L 2 +1 δ In the last equality we used that for δ ∈ [0, 2], 1 + 2 L+1 √ L 2 +1 δ ≤ 2 L+2 √ L 2 +1 δ . We compute the integral 2 0 log 2 L+2 √ L 2 +1 t dt = 2 L+2 √ L 2 + 1 2 -(L+1) 1 √ L 2 +1 0 -log(t ′ ) dt ′ . Applying Lemma 4 with z = 2 -(L+1) 1 √ L 2 +1 1 √ L 2 +1 , we obtain that 2 -(L+1) 1 √ L 2 +1 0 -log(t ′ ) dt ′ ≤ 2 -(L+1) (L + 1) log(2) + 1 2 log(L 2 + 1) + √ π 2 . Putting everything together yields equation ( 16). Lemma 4. For any z ∈ (0, 1], we have that z 0 -log(x) dx = z -log(z) + √ π 2 erfc( -log(z)) ≤ z -log(z) + √ π 2 . Here, erfc denotes the complementary error function, defined as erfc (x) = 2 √ π +∞ x e -t 2 dt. Proof. We rewrite the integral as: z 0 -log(x) 0 1 2 √ y dy dx = -log(z) 0 1 2 √ y z 0 dx dy + +∞ -log(z) 1 2 √ y e -y 0 dx dy = z -log(z) 0 1 2 √ y dy + +∞ -log(z) e -y 2 √ y dy = z -log(z) + +∞ √ -log(z) e -t 2 dt = z -log(z) + √ π 2 erfc( -log(z)) The complementary error function satisfies the bound erfc(x) ≤ e -x 2 for any x > 0, which implies the final inequality. Proof of Theorem 3. Proposition 7 proves that under the condition m ℓ ≤ m, the empirical Rademacher complexity of the class F L,M,k (1) satisfies Rn (F L,M,k (1)) ≤ 64 (L-1)km(m+1) n ( (L + 1) log(2) + 1 2 log(L 2 + 1) + √ π 2 ). Applying Theorem 11 of Sriperumbudur et al. (2009) as in the proof of Theorem 2, and that for any f ∈ F L,M,k (1) and x ∈ B r (R d ), we have that |f (x)| ≤ r (see Proposition 6), we obtain the result. C PROOFS OF SEC. 5 Proof of Theorem 4. To show (i), we have that for any ν ′ ∈ P(K), 0 = 1 2 W 2 2 (µ + , ν ′ ) - 1 2 W 2 2 (µ -, ν ′ ) + 1 2 W 2 2 (µ -, ν ′ ) - 1 2 W 2 2 (µ + , ν ′ ) ≤ sup ν∈P(K) 1 2 W 2 2 (µ + , ν) - 1 2 W 2 2 (µ -, ν) + sup ν∈P(K) 1 2 W 2 2 (µ -, ν) - 1 2 W 2 2 (µ + , ν) = d CT,A (µ + , µ -). The right-to-left implication of (ii) is straight-forward. To show the left-to-right one, we use the definition for the CT distance, rewriting VDC A (µ + ||µ -) and VDC A (µ -||µ + ) in terms of their definitions: d CT,A (µ + , µ -) = sup u∈A Ω u d(µ + -µ -) + sup u∈A Ω u d(µ + -µ -) . Since the two terms in the right-hand side are non-negative, d CT,A (µ + , µ -) = 0 implies that they are both zero. Then, applying Proposition 2, we obtain that µ -⪯ µ + and µ + ⪯ µ -according to the Choquet order. The antisymmetry property of partial orders then implies that µ + = µ -. To show (iii), we use equation ( 17) again. The result follows from sup u∈A Ω u d(µ 1 -µ 2 ) ≤ sup u∈A Ω u d(µ 1 -µ 3 ) + sup u∈A Ω u d(µ 3 -µ 2 ) , sup u∈A Ω u d(µ 2 -µ 1 ) ≤ sup u∈A Ω u d(µ 2 -µ 3 ) + sup u∈A Ω u d(µ 3 -µ 1 ) . D PROOFS OF SUBSEC. 5.1 Lemma 5. For a function class F that contains the zero function, define ∥S n ∥ F = sup f ∈F | 1 n n i=1 ϵ i f (X i )| , where ϵ i are Rademacher variables. E X,ϵ [∥S n ∥ F ] is known as the Rademacher complexity. Suppose that 0 belongs to the compact set K. We have that 1 2 E X,ϵ [∥S n ∥ Ā] ≤ E[d CT,A (µ, µ n )] ≤ 4E X,ϵ [∥S n ∥ A ], where Ā = {f -E µ [f ] | f ∈ A} is the centered version of the class A. Proof. We will use an argument similar to the proof of Prop. 4.11 of Wainwright (2019) (with the appropriate modifications) to obtain the Rademacher complexity upper and lower bounds. We start with the lower bound: E X,ϵ [∥S n ∥ Ā] = E X,ϵ   sup f ∈A 1 n n i=1 ϵ i (f (X i ) -E Yi [f (Y i )])   ≤ E X,Y,ϵ   sup f ∈A 1 n n i=1 ϵ i (f (X i ) -f (Y i ))   = E X,Y   sup f ∈A 1 n n i=1 (f (X i ) -f (Y i ))   ≤ E X   sup f ∈A 1 n n i=1 (f (X i ) -E[f ])   + E Y   sup f ∈A 1 n n i=1 (f (Y i ) -E[f ])   ≤ 2E X   sup f ∈A 1 n n i=1 (f (X i ) -E[f ])   + 2E X   sup f ∈A 1 n n i=1 (E[f ] -f (X i ))   = 2E[VDC A (µ n ||µ) + VDC A (µ||µ n )]. The last inequality follows from the fact that sup f ∈A | 1 n n i=1 (f (X i ) -E[f ])| ≤ sup f ∈A 1 n n i=1 (f (X i ) -E[f ]) + sup f ∈A 1 n n i=1 (E[f ] -f (X i )) , which holds as long as the two terms in the right-hand side are non-negative. This happens when f ≡ 0 belongs to A as a consequence of 0 ∈ K. The upper bound follows essentially from the classical symmetrization argument: E[d CT,A (µ, µ n )] ≤ 2E X sup f ∈A 1 n n i=1 (f (X i ) -E[f ]) ≤ 4E X,ϵ sup f ∈A 1 n n i=1 ϵ i f (X i ) . Lemma 6 (Relation between Rademacher and Gaussian complexities, Exercise 5.5, Wainwright (2019) ). Let ∥G n ∥ F = sup f ∈F | 1 n n i=1 z i f (X i )| , where z i are standard Gaussian variables. E X,z [∥G n ∥ F ] is known as the Gaussian complexity. We have E X,z [∥Gn∥ F ] 2 √ log n ≤ E X,ϵ [∥S n ∥ F ] ≤ π 2 E X,z [∥G n ∥ F ]. Given a set T ⊆ R n , the family of random variables {G θ , θ ∈ T}, where G θ := ⟨ω, θ⟩ = n i=1 ω i θ i and ω i ∼ N (0, 1) i.i.d., defines a stochastic process is known as the canonical Gaussian process associated with T.

D.1 RESULTS USED IN THE LOWER BOUND OF THEOREM 5

Lemma 7 (Sudakov minoration, Wainwright (2019), Thm. 5.30; Sudakov (1973) ). Let {G θ , θ ∈ T} be a zero-mean Gaussian process defined on the non-empty set T. Then, E sup θ∈T G θ ≥ sup δ>0 δ 2 log M G (δ; T). where M G (δ; T) is the δ-packing number of T in the metric ρ G (θ, θ ′ ) = E[(X θ -X θ ′ ) 2 ]. Proposition 8. Let C 0 , C 1 be universal constants independent of the dimension d, and suppose that n ≥ C 1 log(d). Recall that F d (M, C) is the set of convex functions on [-1, 1] d such that |f (x)| ≤ M and |f (x) -f (y)| ≤ C∥x -y∥ for any x, y ∈ [-1, 1] d , and that Fd (M, C) = {f -E µ [f ] | f ∈ F d (M, C)}. The Gaussian complexity of the set F d (M, C) satisfies E X,z [∥G n ∥ F d (M,C) ] ≥ C 0 C d(1+C) n -2 d . Proof. Given (X i ) n i=1 ⊆ R d , let T n = {(f (X i )) n i=1 ∈ R n | f ∈ F d (M, C)}. Let X i be sampled i.i.d. from the uniform measure on [-1, 1] d . We have that with probability at least 1/2 on the instantiation of ( X i ) n i=1 , E z [∥G n ∥ F d (M,C) ] = 1 n E sup θ∈Tn X θ ≥ 1 2n Cn 128d(1+C) n -2 d log M Cn 128d(1+C) n -2 d ; F d (M, C), ρ n 1/2 ≳ 1 2n Cn 128d(1+C) n -2 d log 2 n 16 32n 2 d √ 2d(1+C)Cn+1 1/2 ≈ 1 2n Cn 128d(1+C) n -2 d n 16 log(2) -log(32 2d(1 + C)C) -1 2 + 2 d log(n) 1/2 ≥ C 0 C d(1+C) n -2 d . The first inequality follows from Sudakov minoration (Lemma 7) by setting δ = Cn 128d(1+C) n -2 d . The second inequality follows from the lower bound on the packing number of Ā given by Corollary 1. In the following approximation we neglected the term 1 in the numerator inside of the logarithm. In the last inequality, C 0 is a universal constant independent of the dimension d. The inequality holds as long as log(32 2d(1 + C)C) + ( 1 2 + 2 d ) log(n) ≤ n 32 log(2), which is true when n ≥ C 1 log(d) for some universal constant C 1 . Proposition 9. With probability at least (around) 1/2 on the instantiation of (X i ) n i=1 as i.i.d. samples from the uniform distribution on S d-1 , the packing number of the set F d (M, C) with respect to the metric ρ n (f, f ′ ) = n i=1 (f (X i ) -f ′ (X i )) 2 satisfies M Cn 32d(1+C) n -2 d ; F d (M, C), ρ n ≳ 2 n/16 . Proof. This proof uses the same construction of Thm. 6 of Bronshtein (1976) , which shows a lower bound on the metric entropy of F d (M, C) in the L ∞ norm. His result follows from associating a subset of convex functions in F d (M, C) to a subset of convex polyhedrons, and then lower-bounding the metric entropy of this second set. Note that the pseudo-metric ρ n is weaker than the L ∞ norm, which means that our result does not follow from his. Instead, we need to use a more intricate construction for the set of convex polyhedrons, and rely on the Varshamov-Gilbert lemma (Lemma 8). First, we show that if we sample n points (X i ) n i=1 i.i.d. from the uniform distribution over the unit ball of R d , with constant probability (around 1 2 ) there exists a δ-packing of the unit ball of R d with k ≈ δ -d/2 2 points. For any n ∈ Z + , we define the set-valued random variable S n = {i ∈ {1, . . . , n} | ∀1 ≤ j < i, ∥X i -X j ∥ 2 ≥ δ}, and the random variable A n = |S n |. That is, A n is the number of points X i that are at least epsilon away of any point with a lower index; clearly the set of such points constitutes a δ-packing of the unit ball of R d . We have that E[A n |(X i ) n-1 i=1 ] = A n-1 + Pr(∀1 ≤ i ≤ n -1, ∥X n -X i ∥ 2 ≥ δ). Since for a fixed X i and uniformly distributed X n , Pr(∥X n -X i ∥ 2 ≤ δ) ≤ δ d , a union bound shows that Pr(∀1 ≤ i ≤ n -1, ∥X n -X i ∥ 2 ≥ δ) ≥ 1 -(k -1)δ d . Thus, by the tower property of conditional expectations: E[A n ] ≥ E[A n-1 ] + 1 -(n -1)δ d . A simple induction shows that E[A n ] ≥ n - n-1 i=0 iδ d = n -n(n-1)δ d 2 . This is a quadratic function of n. For a given δ > 0, we choose n that maximizes this expression, which results in 1 -(2n-1)δ d 2 ∼ 0 =⇒ n ∼ δ -d + 1 2 (18) =⇒ E[A n ] ≳ δ -d + 1 2 + (δ -d + 1 2 )(δ -d -1 2 )δ d 2 = δ -d + 1 2 + δ -d -1 4 δ d 2 = δ -d 2 + 1 2 -δ d 8 . where we used ∼ instead of = to remark that n must be an integer. Since A n is lower-bounded by 1 and upper-bounded by n ∼ δ -d + 1 2 , Markov's inequality shows that P (A n ≥ E[A n ]) ≳ δ -d 2 + 1 2 -δ d 8 -1 δ -d -1 2 ≈ 1 2 , where the approximation works for δ ≪ 1. Taking k = A n , this shows the existence of a δ-packing of the unit ball of size ≈ δ -d/2 2 with probability at least (around) 1 2 . Next, we use a construction similar to the proof of the lower bound in Thm. 6 of Bronshtein (1976) . That is, we consider the set Sn = {(x, g(x)) | x ∈ S n } ⊆ R d+1 , where g : [-1, 1] d → R is the map that parameterizes part of the surface of the (d -1)-dimensional hypersphere S d-1 (R, t) centered at (0, . . . , 0, t) of radius R for well chosen t, R. As in the proof of Thm. 6 of Bronshtein (1976) , if x ∈ Sn ⊆ S d-1 (R, t) and we let S d-1 (R, t) x denote the tangent space at x, we construct a hyperplane P that is parallel to S d-1 (R, t) x at a distance ϵ < R from the latter and that intersects the sphere. A simple trigonometry exercise involving the lengths of the chord and the sagitta of a two-dimensional circular segment shows that any point in S d-1 (R, t) that is at least √ 2Rϵ away from x will be separated from x by the hyperplane P . Thus, the convex hull of Sn \ {x} is at least ϵ away from x. For any subset S ′ ⊆ Sn , we define the function g S ′ as the function whose epigraph (the set of points on or above the graph of a convex function) is equal to the convex hull of S ′ . Note that g S ′ is convex and piecewise affine. Let us set δ = √ 2Rϵ. If x is a point in Sn \ S ′ , by the argument in the previous paragraph the convex hull of S ′ is at least ϵ away from x. Thus, |g S ′ (x) -g S ′ ∪{x} | ≥ ϵ. The functions g S ′ and g S ′′ will differ by at least ϵ at each point in the symmetric difference S ′ ∆S ′′ . By the Varshamov-Gilbert lemma (Lemma 8), there is a set of 2 k/8 different subsets S ′ such that g S ′ differ pairwise at at least k/4 points in Sn by at least ϵ. Thus, for any S ′ ̸ = S ′′ in this set, we have that ρ n (g S ′ , g S ′′ ) = n i=1 (g S ′ (X i ) -g S ′ (X i )) 2 ≥ k 4 ϵ. We have to make sure that all the functions g S ′ belong to the set F d (M, C). Since | d dx √ R 2 -x 2 | = |x| √ R 2 -x 2 , the function g will be C-Lipschitz on [-1, 1] d if we take R 2 ≥ d(1+C) C , and in that case g S ′ will be Lipschitz as well for any S ′ ⊆ Sn . To make sure that the uniform bound ∥g S ′ ∥ ∞ ≤ M holds, we adjust the parameter t. To obtain the statement of the proposition we start from a certain n and set δ such that (18) holds: δ = n -1 d , and we set k ≈ n 2 . Since δ = √ 2Rϵ, we have that ϵ = 1 2R δ 2 ≈ 1 2 C d(1+C) n -2 d , which means that 2 k/8 ≈ 2 n/16 , k 4 ϵ ≈ Cn 32d(1+C) n -2 d . Lemma 8 (Varshamov-Gilbert). Let N ≥ 8. There exists a subset Ω ⊆ {0, 1} N such that |Ω| ≥ 2 N/8 and for any x, x ′ ∈ Ω such that x ̸ = x ′ , at least N/4 of the components differ. Lemma 9. For any ϵ > 0, the packing number of the centered function class F d (M, C) = {f - E µ [f ]| f ∈ F d (M, C)} with respect to the metric ρ n fulfills M (ϵ/2; F d (M, C), ρ n ) ≥ M (ϵ; F d (M, C), ρ n )/(4nC/ϵ + 1), Proof. Let S be an ϵ-packing of F d (M, C) with respect to ρ n (i.e. |S| = M (ϵ; F d (M, C), ρ n )). Let F d (M, C, t, δ) = {f ∈ F d (M, C) | |E µ [f ] -t| ≤ δ}. If we let (t i ) m i=1 be a δ packing of [-C, C], we can write F d (M, C) = ∪ m i=1 F d (M, C, t i , δ). By the pigeonhole principle, there exists an index i ∈ {1, . . . , m} such that |S ∩ F d (M, C, t i , δ)| ≥ |S|/m. Let F d (M, C, t i ) = {f ∈ F d (M, C) | E µ [f ] = t i }, and let P : F d (M, C) → F d (M, C, t i ) be the projection operator defined as f → f -E µ [f ] + t i . If we take δ = ϵ/(4n), we have that ρ n (P f, P f ′ ) ≥ ϵ/2 for any f ̸ = f ′ ∈ S ∩ F d (M, C, t i , δ), as ρ n (P f, P f ′ ) ≤ ρ n (P f, f ) + ρ n (f, f ′ ) + ρ n (f ′ , P f ′ ), and ρ n (f, P f ) ≤ n|E µ [f ] -t i | ≤ nδ = ϵ/4, while ρ n (f, f ′ ) ≥ ϵ. Thus, the ϵ/2-packing number of F d (M, C, t i ) is lower-bounded by |S ∩ F d (M, C, t i , δ)| ≥ |S|/m. Since m = ⌊(2C + 2δ)/(2δ)⌋ ≤ 4nC/ϵ + 1, this is lower-bounded by |S|/(4nC/ϵ + 1). The proof concludes using the observation that the map F d (M, C, t i ) → Fd (M, C) defined as f → f -t i is an isometric bijection with respect to ρ n . Corollary 1. With probability at least 1/2, the packing number of the set F d (M, C) with respect to the metric ρ n satisfies M Cn 128d(1+C) n -2 d ; F d (M, C), ρ n ≥ 2 n 16 1 32n 2 d √ 2d(1+C)Cn+1 Proof of Theorem 5. Lemma 5 provides upper and lower bounds to E[d CT,A (µ, µ n )] in terms of the Rademacher complexities of A and its centered version, Ā. Lemma 1 in App. A states that A is equal to the space F d (M, C) of convex functions on [-1, 1] d such that |f (x)| ≤ M and |f (x) -f (y)| ≤ C∥x -y∥ for any x, y ∈ [-1, 1] d . Let E X,ϵ [∥S n ∥ F ], E X,ϵ [∥G n ∥ F ] be the Rademacher and Gaussian complexities of a function class F (see definitions in Lemma 5 and Lemma 6). We can write E X,ϵ [∥S n ∥ Ā] ≥ E X,z [∥Gn∥ Ā ] 2 √ log n ≥ C 0 C 2d(1+C) log(n) n -2 d , where the first inequality holds by Lemma 6, and the second inequality follows from Proposition 8. This gives rise to (7) upon redefining C 0 ← C 0 / √ 8. Equation (8) follows from the Rademacher complexity upper bound in Lemma 5, and the upper bound on the empirical Rademacher complexity of F d (M, C) given by Proposition 4. Proof of Theorem 6. We upper-bound E[d CT,F L,M,k,+ (1) (µ, µ n )] as in the upper bound of E[d CT,A (µ, µ n )] in Lemma 5: E[d CT,F L,M,k,+ (1) (µ, µ n )] ≤ 2E X sup f ∈F L,M,k,+ (1) 1 n n i=1 (f (X i ) -E[f ]) ≤ 4E X,ϵ sup f ∈F L,M,k,+ (1) 1 n n i=1 ϵ i f (X i ) = 4E ϵ [∥S n ∥ F L,M,k,+ (1) ]. Likewise, we get that E[VDC F L,M,k,+ (1) (µ, µ n )] ≤ 2E ϵ [∥S n ∥ F L,M,k,+ (1) ]. Since F L,M,k,+ (1) ⊂ F L,M,k (1), we have that E ϵ [∥S n ∥ F L,M,k,+ (1) ] ≤ E ϵ [∥S n ∥ F L,M,k (1) ]. Then, the result follows from the Rademacher complexity upper bound from Proposition 7.

E SIMPLE EXAMPLES IN DIMENSION 1 FOR VDC AND d CT

For simple distributions over compact sets of R, we can compute the CT discrepancy exactly. Let η : R → R be a non-negative bump-like function, supported on [-1, 1], symmetric w.r.t 0, increasing on (-1, 0], and such that R η = 1. Clearly η is the density of some probability measure with respect to the Lebesgue measure. Same variance, different mean. Let µ + ∈ P(R) with density η(x -a) for some a > 0, and let µ -∈ P(R) with density η(x + a). We let K = [-C, C] for any C > 0. Proposition 10. Let F (x) = x -∞ (η(y + a) -η(y -a)) dy, and G(x) = x 0 F (y) dy. We obtain that VDC K (µ + ||µ -) = 2CG(+∞). The optimum of (P 2 ) is equal to the Dirac delta at -C: ν = δ -C , while an optimum of (P 1 ) is u = -Cx. We have that VDC K (µ + ||µ -) = 2CG(+∞) as well, and thus, d CT,A (µ -, µ + ) = 4CG(+∞). Proof. Suppose that first that a > 0. We have that F (-∞) = F (-a -1) = 0, F (+∞) = F (a + 1) = 0. F is even, and sup x |F (x)| = 1. We have that G is odd (in particular G(-a -1) = -G(a + 1)) and non-decreasing on R. Also, 0 < G(+∞) = G(a + 1) ≤ a + 1. Let K = [-C, C] where C > 0, and let Ω = [-a -1, a + 1]. For any twice-differentiablefoot_1 convex function u on Ω such that u ′ ∈ K, we have Ω u(x) d(µ --µ + )(y) = Ω u(x)(η(y + a) -η(y -a)) dy = [u(x)F (x)] a+1 -a-1 -Ω u ′ (x)F (x) dy = -Ω u ′ (x)F (x) dy = [-u ′ (x)G(x)] a+1 -a-1 + Ω u ′′ (x)G(x) dy = -G(a + 1)(u ′ (a + 1) + u ′ (-a -1)) + Ω u ′′ (x)G(x) dy If we reexpress u(a + 1) = u ′ (-a -1) + a+1 -a-1 u ′′ (x) dx, the right-hand side becomes -2G(a + 1)u ′ (-a -1) + Ω u ′′ (x)(G(x) -G(a + 1)) dy Since G is increasing, for any x ∈ [-a -1, a + 1), G(x) < G(a + 1). The convexity assumption on u implies that u ′′ ≥ 0 on Ω, and the condition u ′ ∈ K means that u ′ ∈ [-C, C]. Thus, the function u that maximizes (19) fulfills u ′ ≡ -C, u ′′ ≡ 0. Thus, we can take u = -Cx. The measure ν = (∇u) # µ -= (∇u) # µ + is equal to the Dirac delta at -C: ν = δ -C . Hence, sup u∈A Ω u d(µ --µ + ) = 2CG(a + 1). Thus, VDC K (µ + ||µ -) = 2CG(a + 1). Reproducing the same argument yields VDC K (µ -||µ + ) = 2CG(a + 1), and hence the CT distance is equal to d CT,A (µ -, µ + ) = 4CG(a + 1). Same mean, different variance. Let µ + ∈ P(R) with density 1 a η( x a ) for some a > 0, and let µ -∈ P(R) with density η(x). Note that µ + has support [-a, a], and standard deviation equal to a times the standard deviation of µ -. We let K = [-C, C] for any C > 0 as before. Proposition 11. Let F (x) = x -∞ (η(y) -1 a η( y a )) dy, and G(x) = x -∞ F (y) dy. When a < 1, we have that VDC K (µ + ||µ -) = 2CG(0). The optimum of (P 2 ) is equal to ν = 1 2 δ -C + 1 2 δ C , while an optimum of (P 1 ) is u = C|x|. When a > 1, VDC K (µ + ||µ -) = 0. For any a > 0, d CT,A (µ -, µ + ) = 2CG(0). Proof. We define Ω = [-max{a, 1}, max{a, 1}] and K = [-C, C]. We have that F (-∞) = F (-max{a, 1}) = 0, F (+∞) = F (max{a, 1}) = 0, and F (0) = 0. F is odd. When a < 1 it is non-positive on [0, +∞) and non-negative on (-∞, 0]. When a > 1, it is non-negative on [0, +∞) and non-positive on (-∞, 0]. We have that G(-∞) = G(-max{a, 1}) = 0 and G(+∞) = G(max{a, 1}) = 0. G is even. When a < 1, G is non-negative, non-decreasing on (-∞, 0 ] and non-increasing on [0, +∞): it has a global maximum at 0. When a > 1, G is nonpositive, non-increasing on (-∞, 0] and non-decreasing on [0, +∞): it has a global minimum at 0. We have that Ω u(x) d(µ --µ + )y = Ω u(x) η(y) -1 a η y a dy = [u(x)F (x)] a+1 -a-1 -Ω u ′ (x)F (x) dy = -Ω u ′ (x)F (x) dy = [-u ′ (x)G(x)] a+1 -a-1 + Ω u ′′ (x)G(x) dy = Ω u ′′ (x)G(x) dy. Thus, when a < 1 this expression is maximized when u ′′ ∝ δ 0 . Taking into account the constraints u ′ ∈ [-C, C], the optimal u ′ is u ′ (x) = Csign(x), which means that an optimal u is u(x) = C|x|. Thus, the measure ν = (∇u) # µ -= (∇u) # µ + is equal to the average of Dirac deltas at -C and C: ν = 1 2 δ -C + 1 2 δ C . We obtain that sup u∈A Ω u d(µ --µ + ) = 2CG(0). When a > 1, the expression (20) is maximized when u ′′ = 0, which means that any u ′ constant and any u affine work. Any measure ν concentrated at a point in [-C, C] is optimal. Thus, sup u∈A Ω u d(µ --µ + ) = 0. We conclude that d CT,A (µ + , µ -) = 2CG(0).

F ALGORITHMS FOR LEARNING DISTRIBUTIONS WITH SURROGATE VDC AND CT DISTANCE

In Algorithms 1 and 2, we present the steps for learning with the VDC and the CT distance, respectively. In order to enforce convexity of the ICMNs, we leverage the projected gradient descent algorithm after updating hidden neural network parameters. Additionally, to regularize the u networks in Algorithm 1, we include a term that penalizes the square of the outputs of the Choquet critic on the baseline and generated samples (Mroueh & Sercu, 2017) . Algorithm 1 Enforcing dominance constraints with the surrogate VDC Input: Target distribution ν, baseline g 0 , latent distribution µ 0 , integer maxEpochs, integer discriminatorEpochs, Choquet weight λ, GAN learning rate η, Choquet critic learning rate η VDC , Choquet critic regularization weight λ u,reg , WGAN gradient penalty regularization weight λ GP Initialize: untrained discriminator f φ , untrained generator g θ , untrained Choquet ICMN critic u ψ ψ ← ProjectHiddenWeightsToNonNegative()  for i = 1 to maxEpochs do L WGAN (φ, θ) = E Y ∼µ0 [f φ (g θ (Y ))] -E X∼ν [f φ (X)] L GP = E t∼Unif[0,1] [E X∼ν,Y ∼µ0 (||∇ x f φ (tg θ (Y ) + (1 -t)X)|| -1) 2 ] L VDC (ψ, θ) = E Y

G PROBING MODE COLLAPSE

As described in Sec. 7, to investigate how training with the surrogate VDC regularizer helps alleviate mode collapse in GAN training, we implemented GANs trained with the IPM objective alone and compared this to training with the surrogate VDC regularizer for a mixture of 8 Gaussians target distribution. To track mode collapse, we report two metrics: 1) The mode collapse score that we define as follows: for each generated point, we assign it to the nearest neighbor cluster in the target mixture and obtain a histogram over the modes computed on all generated points. The closer this histogram is to the uniform distribution, the less mode collapsed is the generator. We quantify this with the KL distance of this histogram to the uniform distribution on 8 modes. 2) The negative log likelihood (NLL) of the Gaussian mixture. A converged generator needs to have a low negative likelihood and low mode collapse score. In Figure 3 , we see that in the baseline training (unregularized GAN training), we observe mode collapse and cycling between modes, evidenced by the fluctuating mode collapse score and NLL. In contrast, when training with the VDC regularizer to improve upon the collapse generator g 0 (which is taken from step 55k from the unregularized GAN training), we see more stable training and better mode coverage. As the regularization weight λ VDC for VDC increases, the dominance constraint is more strongly enforced resulting in a better NLL and smaller mode collapse score. 

Image generation

In training WGAN-GP WGAN-GP + VDC, we use residual convolutional neural network (CNN) for the generator and CNN for the discriminator (both with ReLU nonlinearities). See Table 3 , Table 4 , and Table 5 for the full architectural details. Note that in Table 3 , PixelShuffle refers to a dimension rearrangement where an input of dimension Cr 2 × H × W is rearranged to C × Hr × W rfoot_2 . Our latent dimension for µ 0 is 128 and λ GP = 10. We use ADAM optimizers (Kingma & Ba, 2015) for both networks, learning rates of 1e -4 , and a batch size of 64. We use the CIFAR-10 training data and split it as 95% training and 5% validation. FID is calculated using the validation set. The generator was trained every 6th epoch, and training was executed for about 400 epochs in total. When training g * we use learning rate of 1e -5 for the Choquet critic, λ = 10, and λ u,reg = 10. Training the baseline g 0 and g * with the surrogate VDC was done on a compute environment with 1 CPU and 1 A100 GPU.

2D point cloud generation

When training with the d CT surrogate for point cloud generation, the generator and Choquet critics are parameterized by residual maxout networks with maxout kernel size of 2. The critics are ICMNs. Our latent dimension for µ 0 is 32. Both the generator and critics have hidden dimension of 32. The generator consists of 10 fully-connected layers and the critics consists of 5. For all networks, we add residual connections from input-to-hidden layers (as opposed to hidden-to-hidden). The last layer for all networks is fully-connected linear. See Table 6 for full architectural details. We use ADAM optimizers for all networks, learning rate of 5e -4 for the generator, learning rates of 1e -4 for the Choquet critics, and a batch size of 512. Training was done on a single-CPU environment. 



The gradient of f is well defined when there exists a neighborhood of x for which f is an affine function. Using a mollifier sequence, any convex function can be approximated arbitrarily well by a twicedifferentiable convex function. See pytorch documentation for more details.



which is (twice) the approximation error of the class A by the class F L,M,k (1). Such bounds have only been derived in L = 2: Balazs et al.

As stated inXue et al. (2020), this example has a known solution at z = 2 where E P [G 2 (ξ; 2)] = 1 outperforms the benchmark E P [G 1 (ξ)] = 0.5. We relax the constrained optimization by including it in the objective function, thus creating min-max game (10) introduced in Sec. 6. We parameterize F L,M,k,-+ (1) with a 3-layer, fully-connected, decreasing ICMN with hidden dimension 32 and maxout kernel size of 4. After 5000 steps of stochastic gradient descent on z (learning rate 1e -3 ) and the parameters of the ICMN (learning rate 1e -3 ), using a batch size of 512 and λ = 1, we are able to attain accurate approximate values of the known solution: z = 2, 1 512 512 j=1 [G z (ξ j )] = 1.042, and 1 512 512 j=1 [G 1 (ξ j )] = 0.496.

Figure 2: Training generative models by enforcing dominance with surrogate VDC on pre-trained CIFAR-10 WGAN-GP (Left) and with surrogate CT distance on 2D point clouds (Right). Ground truth point cloud distributions (blue) consist of swiss roll (Top), circle of eight Gaussians (Middle), and Github icon converted to a point cloud (Bottom). Image generation with baseline model dominance Another application of learning with the VDC is in the high-dimensional setting of CIFAR-10 (Krizhevsky & Hinton, 2009) image generation. As detailed in Sec. 6, we start by training a baseline generator g 0 using the regularized Wasserstein-GAN paradigm (WGAN-GP) introduced in Arjovsky et al. (2017); Gulrajani et al. (2017), where

Let Ω = [-1, 1] d and K = {x ∈ R d | ∥x∥ 2 ≤ C}. The function class A = {u : Ω → R, u convex and ∇u a.e. ∈ K} is equal to the space F d (M, C) of convex functions on [-1, 1] d such that |f (x)| ≤ M and |f (x) -f (y)| ≤ C∥x -y∥ for any x, y ∈ [-1, 1] d , up to a constant term. Here, M = C √ d. Proof. Looking at problem (1), note that adding a constant to a function u ∈ A does not change the value of the objective function. Thus, we can add the restriction that u(0) = 0, and since Ω is compact and has Lipschitz constant upper-bounded by sup x∈K ∥x∥, any such function u must fulfill M := sup u∈A sup x∈Ω |u(x)| < +∞. Thus, we have that functions in A belong to {u : Ω → R, u convex and ∇u a.e. ∈ K, sup x∈Ω |u(x)| ≤ M } up to a constant term, for a well chosen M . Now we will use the particular form of Ω and K. First, note that we can take M = C √ d without loss of generality. Given u ∈ A, we have that ∥∇u(x)∥ 2 ≤ C for a.e. x ∈ [-1, 1] d . By the mean value theorem, we have that |u(x) -u(y)| = | 1 0 ⟨∇f (tx + (1 -t)y), x -y⟩ dt| ≤ C∥x -y∥, implying that u is C-Lipschitz. This shows that A ⊆ F d (M, C). Rademacher's theorem states that C-Lipschitz functions are a.e. differentiable, and gradient norms must be upper-bounded by C wherever gradients exist as otherwise one reaches a contradiction. Hence, F d (M, C) ⊆ A, concluding the proof. Lemma 2 (Metric entropy of convex functions, Bronshtein (1976), Thm. 6). Let F d (M, C) be the compact space of convex functions on [-1, 1] d such that |f (x)| ≤ M and |f (x) -f (y)| ≤ C∥x -y∥ for any x, y ∈ [-1, 1] d . The metric entropy of this space with respect to the uniform norm topology satisfies log

t; T, ρ X ) dt. Proposition 4. For any family of n points (X i ) n i=1 ⊆ [-1, 1] d , the empirical Rademacher complexity of the function class F d (M, C) satisfies

∼µ0 [u ψ (g θ (Y )) -u ψ (g 0 (Y ))] for j = 1 to discriminatorEpochs do φ ← φ -η∇ φ (L WGAN + λ GP L GP ) {ADAM optimizer} L VDCreg = E Y ∼µ0 [u ψ (g θ (Y )) 2 + u ψ (g 0 (Y )) 2 ] ψ ← ψ -η VDC ∇ ψ (L VDC + λ u,reg L VDCreg ) {ADAM optimizer} ψ ← ProjectHiddenWeightsToNonNegative() end for θ ← θ + η∇ θ (L WGAN + λL VDC ) {ADAM optimizer} end for Return g θ

Figure 4: Trajectory of z parameter from the portfolio optimization example in Sec. 7.

) to 10 (see App. F and App. H for more details). Training runs for g * and g 0 FID scores for WGAN-GP and WGAN-GP with VDC surrogate for convex functions approximated by either ICNNs with softplus activations or ICMNs. ICMNs improve upon the baseline g 0 and outperform ICNNs with softplus. FID score for WGAN-GP + VDC includes mean values ± one standard deviation for 5 repeated runs with different random initialization seeds.

Architectural details for convex decreasing u network used in portfolio optimization.

Architectural details for generator g θ and Choquet critics u ψ1 , u ψ2 in the 2D point cloud domain.Generator g θ and Choquet critics u ψ1 , u ψ2Libraries Our experiments rely on various open-source libraries, including pytorch(Paszke  et al., 2019)  (license: BSD) and pytorch-lightning(Falcon et al., 2019) (Apache 2.0).

availability

Code re-use For several of our generator, discriminator, and Choquet critics, we draw inspiration and leverage code from the following public Github repositories: (1) https://github.com/caogang/wgan-gp, (2) https://github.com/ozanciga/ gans-with-pytorch, and (3) https://github.com/CW-Huang/CP-Flow.

annex

Algorithm 2 Generative modeling with the surrogate CT distance Input: Target distribution ν, latent distribution µ 0 , integer maxEpochs, integer criticEpochs, learning rate η Initialize: untrained generator g θ , untrained Choquet ICMN critics u ψ1 , u ψ2 ψ 1 ← ProjectHiddenWeightsToNonNegative() 

H ADDITIONAL EXPERIMENTAL DETAILS

Portfolio optimization In Figure 4 we plot the trajectory of the z parameter from the portfolio optimization example solved in Sec. 7.The convex decreasing u network was parameterized with a 3-layer, fully-connected, decreasing ICMN with hidden dimension 32 and maxout kernel size of 4. See Table 2 for full architectural details. Optimization was performed on CPU. 

