LEARNING WITH STOCHASTIC ORDERS

Abstract

Learning high-dimensional distributions is often done with explicit likelihood modeling or implicit modeling via minimizing integral probability metrics (IPMs). In this paper, we expand this learning paradigm to stochastic orders, namely, the convex or Choquet order between probability measures. Towards this end, exploiting the relation between convex orders and optimal transport, we introduce the Choquet-Toland distance between probability measures, that can be used as a drop-in replacement for IPMs. We also introduce the Variational Dominance Criterion (VDC) to learn probability measures with dominance constraints, that encode the desired stochastic order between the learned measure and a known baseline. We analyze both quantities and show that they suffer from the curse of dimensionality and propose surrogates via input convex maxout networks (ICMNs), that enjoy parametric rates. We provide a min-max framework for learning with stochastic orders and validate it experimentally on synthetic and high-dimensional image generation, with promising results. Finally, our ICMNs class of convex functions and its derived Rademacher Complexity are of independent interest beyond their application in convex orders. Code to reproduce experimental results is available here.

1. INTRODUCTION

Learning complex high-dimensional distributions with implicit generative models (Goodfellow et al., 2014; Mohamed & Lakshminarayanan, 2017; Arjovsky et al., 2017) via minimizing integral probability metrics (IPMs) (Müller, 1997a) has led to the state of the art generation across many data modalities (Karras et al., 2019; De Cao & Kipf, 2018; Padhi et al., 2020) . An IPM compares probability distributions with a witness function belonging to a function class F, e.g., the class of Lipchitz functions, which makes the IPM correspond to the Wasserstein distance 1. While estimating the witness function in such large function classes suffers from the curse of dimensionality, restricting it to a class of neural networks leads to the so called neural net distance (Arora et al., 2017) that enjoys parametric statistical rates. In probability theory, the question of comparing distributions is not limited to assessing only equality between two distributions. Stochastic orders were introduced to capture the notion of dominance between measures. Similar to IPMs, stochastic orders can be defined by looking at the integrals of measures over function classes F (Müller, 1997b) . Namely, for µ + , µ 1a for an example). In the present work, we focus on the Choquet or convex order (Ekeland & Schachermayer, 2014) generated by the space of convex functions (see Sec. 2 for more details). -∈ P 1 (R d ), µ + dominates µ -, or µ -⪯ µ + , if for any function f ∈ F, we have R d f (x) dµ -(x) ≤ R d f (x) dµ + (x) (See Figure Previous work has focused on learning with stochastic orders in the one dimensional setting, as it has prominent applications in mathematical finance and distributional reinforcement learning (RL). The survival function gives a characterization of the convex order in one dimension (See Figure 1b and Sec. 2 for more details). For instance, in portfolio optimization (Xue et al., 2020; Post et al., 2018; Dentcheva & Ruszczynski, 2003) the goal is to find the portfolio that maximizes the expected return under dominance constraints between the return distribution and a benchmark distribution. A similar concept was introduced in distributional RL (Martin et al., 2020) for learning policies with dominance constraints on the distribution of the reward. While these works are limited to the univariate setting, our work is the first, to the best of our knowledge, that provides a computationally tractable characterization of stochastic orders that is sample efficient and scalable to high dimensions. The paper is organized as follows: in Sec. 3 we introduce the Variational Dominance Criterion (VDC); the VDC between measures µ + and µ -takes value 0 if and only if µ + dominates µ -in the convex order, but it suffers from the curse of dimension and cannot be estimated efficiently from samples. To remediate this, in Sec. 4 we introduce a VDC surrogate via Input Convex Maxout Networks (ICMNs). ICMNs are new variants of Input Convex Neural Nets (Amos et al., 2017) that we propose as proxy for convex functions and study their complexity. We show in Sec. 4 that the surrogate VDC has parametric rates and can be efficiently estimated from samples. The surrogate VDC can be computed using (stochastic) gradient descent on the parameters of the ICMN and can characterize convex dominance (See Figure 1c ). We then show in Sec. 5 how to use the VDC and its surrogate to define a pseudo-distance on the probability space. Finally, in Sec. 6 we propose penalizing generative models training losses with the surrogate VDC to learn implicit generative models that have better coverage and spread than known baselines. This leads to a min-max game similar to GANs. We validate our framework in Sec. 7 with experiments on portfolio optimization and image generation.

2. THE CHOQUET OR CONVEX ORDER

Denote by P(R d ) the set of Borel probability measures on R d and by P 1 (R d ) ⊂ P(R d ) the subset of those which have finite first moment: µ ∈ P 1 (R) if and only if R d ∥x∥ dµ(x) < +∞. Comparing probability distributions Integral probability metrics (IPMs) are pseudo-distances between probability measures µ, ν defined as d F (µ, ν) = sup f ∈F E µ f -E ν f , for a given function class F which is symmetric with respect to sign flips. They are ubiquitous in optimal transport and generative modeling to compare distributions: if F is the set of functions with Lipschitz constant 1, then the resulting IPM is the 1-Wasserstein distance; if F is the unit ball of an RKHS, the IPM is its maximum mean discrepancy. Clearly, d F (µ, ν) = 0 if and only E µ f = E ν f for all f ∈ F, and when F is large enough, this is equivalent to µ = ν. The Choquet or convex order When the class F is not symmetric with respect to sign flips, comparing the expectations E µ f and E ν f for f ∈ F does not yield a pseudo-distance. In the case where F is the set of convex functions, the convex order naturally arises instead: Definition 1 (Choquet order, Ekeland & Schachermayer (2014), Def. 4). For µ -, µ + ∈ P 1 (R d ), we say that µ -⪯ µ + if for any convex function f : R d → R, we have



Figure 1: VDC example in 1D. Figure 1a :µ + is mixture of 3 Gaussians , µ -corresponds to a single mode of the mixture. µ + dominates µ -in the convex order.Figure 1b : uni-variate characterization of the convex order with survival functions (See Sec. 2 for details).Figure 1c: Surrogate VDC computation with Input Convex Maxout Network and gradient descent. The surrogate VDC tends to zero at the end of the training and hence characterizes the convex dominance of µ + on µ -.

Figure 1: VDC example in 1D. Figure 1a :µ + is mixture of 3 Gaussians , µ -corresponds to a single mode of the mixture. µ + dominates µ -in the convex order.Figure 1b : uni-variate characterization of the convex order with survival functions (See Sec. 2 for details).Figure 1c: Surrogate VDC computation with Input Convex Maxout Network and gradient descent. The surrogate VDC tends to zero at the end of the training and hence characterizes the convex dominance of µ + on µ -.

