FAST NONLINEAR VECTOR QUANTILE REGRESSION

Abstract

Quantile regression (QR) is a powerful tool for estimating one or more conditional quantiles of a target variable Y given explanatory features X. A limitation of QR is that it is only defined for scalar target variables, due to the formulation of its objective function, and since the notion of quantiles has no standard definition for multivariate distributions. Recently, vector quantile regression (VQR) was proposed as an extension of QR for vector-valued target variables, thanks to a meaningful generalization of the notion of quantiles to multivariate distributions via optimal transport. Despite its elegance, VQR is arguably not applicable in practice due to several limitations: (i) it assumes a linear model for the quantiles of the target Y given the features X; (ii) its exact formulation is intractable even for modestly-sized problems in terms of target dimensions, number of regressed quantile levels, or number of features, and its relaxed dual formulation may violate the monotonicity of the estimated quantiles; (iii) no fast or scalable solvers for VQR currently exist. In this work we fully address these limitations, namely: (i) We extend VQR to the non-linear case, showing substantial improvement over linear VQR; (ii) We propose vector monotone rearrangement, a method which ensures the quantile functions estimated by VQR are monotone functions; (iii) We provide fast, GPU-accelerated solvers for linear and nonlinear VQR which maintain a fixed memory footprint, and demonstrate that they scale to millions of samples and thousands of quantile levels; (iv) We release an optimized python package of our solvers as to widespread the use of VQR in real-world applications. Nonlinear VQR. To address the limitation of linear specification, in Section 4 we propose nonlinear vector quantile regression (NL-VQR). The key idea is fit a nonlinear embedding function of the input features jointly with the regression coefficients. This is made possible by leveraging the relaxed dual formulation and solver introduced in Section 3. We demonstrate, through synthetic and real-data experiments, that nonlinear VQR can model complex conditional quantile functions substantially better than linear VQR and separable QR approaches. Vector monotone rearrangement (VMR). In Section 5 we propose VMR, which resolves the co-monotonicity violations in estimated CVQFs. We solve an optimal transport problem to rearrange

1. INTRODUCTION

Quantile regression (QR) (Koenker & Bassett, 1978 ) is a well-known method which estimates a conditional quantile of a target variable Y, given covariates X. A major limitation of QR is that it deals with a scalar-valued target variable, while many important applications require estimation of vector-valued responses. A trivial approach is to estimate conditional quantiles separately for each component of the vector-valued target. However this assumes statistical independence between targets, a very strong assumption rarely held in practice. Extending QR to high dimensional responses is not straightforward because (i) the notion of quantiles is not trivial to define for high dimensional variables, and in fact multiple definitions of multivariate quantiles exist (Carlier et al., 2016) ; (ii) quantile regression is performed by minimizing the pinball loss, which is not defined for high dimensional responses. Seminal works of Carlier et al. (2016) and Chernozhukov et al. (2017) introduced a notion of quantiles for vector-valued random variables, termed vector quantiles. Key to their approach is extending the notions of monotonicity and strong representation of scalar quantile functions to high dimensions, i.e.

Co-monotonicity:

(Q Y (u) -Q Y (u )) (u -u ) ≥ 0, ∀ u, u ∈ [0, 1] d (1) Strong representation: Y = Q Y (U), U ∼ U[0, 1] d (2) where Y is a d-dimensional variable, and Q Y : [0, 1] d → R d is its vector quantile function (VQF). Moreover, Carlier et al. (2016) extended QR to vector-valued targets, which leads to vector quantile regression (VQR). VQR estimates the conditional vector quantile function (CVQF) Q Y|X from  (u) = [Q 1 (u), Q 2 (u) ] is co-monotonic with u = (u 1 , u 2 ); Q 1 , Q 2 are depicted as surfaces (left, right) with the corresponding vector quantiles overlaid. On Q 1 , increasing u 1 for a fixed u 2 produces a monotonically increasing curve, and vice versa for Q 2 . (b) Visualization of conditional vector quantile functions (CVQFs) via α-contours. Data was drawn from a joint distribution of (X, Y) where Y|X = x has a star-shaped distribution rotated by x degrees. The true CVQF Q Y|X changes non-linearly with the covariates X, while E [Y|X] remains the same. This demonstrates the challenge of estimating CVQFs from samples of the joint distribution. Appendix C provides further intuitions regarding VQFs and CVQFs, and details how the α-contours are constructed from them. samples drawn from P (X,Y) , where Y is a d-dimensional target variable and X are k-dimensional covariates. They show that a function Q Y|X which obeys co-monotonicity (1) and strong representation (2) exists and is unique, as a consequence of Brenier's polar factorization theorem (Brenier, 1991) . Figure 1 provides a visualization of these notions for a two-dimensional target variable. Assuming a linear specification Q Y|X (u; x) = B(u) x + a(u), VQR can be formulated as an optimal transport problem between the measures of Y|X and U, with the additional meanindependence constraint E [U|X] = E [X]. The primal formulation of this problem is a large-scale linear program and is thus intractable for modestly-sized problems. A relaxed dual formulation which is amenable to gradient-based solvers exists but leads to co-monotonicity violations. The first goal of our work is to address the following limitations of Carlier et al. (2016; 2020) : (i) the linear specification assumption on the CVQF, and (ii) the violation of co-monotonicity when solving the inexact formulation of the VQR problem. The second goal of this work is to make VQR an accessible tool for off-the-shelf usage on large-scale high-dimensional datasets. Currently there are no available software packages to estimate VQFs and CVQFs that can scale beyond toy problems. We aim to provide accurate, fast and distribution-free estimation of these fundamental statistical quantities. This is relevant for innumerable applications requiring statistical inference, such as distribution-free uncertainty estimation for vector-valued variables (Feldman et al., 2021) , hypothesis testing with a vector-valued test statistic (Shi et al., 2022) , causal inference with multiple interventions (Williams & Crespi, 2020) , outlier detection (Zhao et al., 2019) and others. Below we list our contributions. Scalable VQR. We introduce a highly-scalable solver for VQR in Section 3. Our approach, inspired by Genevay et al. (2016) and Carlier et al. (2020) , relies on solving a new relaxed dual formulation of the VQR problem. We propose custom stochastic-gradient-based solvers which maintain a constant memory footprint regardless of problem size. We demonstrate that our approach scales to millions of samples and thousands of quantile levels and allows for GPU-acceleration. the vector quantiles such that they satisfy co-monotonicity. We show that VMR strictly improves the estimation quality of CVQFs while ensuring zero co-monotonicity violations. Open-source software package. We release a feature-rich, well-tested python package vqrfoot_0 , implementing estimation of vector quantiles, vector ranks, vector quantile contours, linear and nonlinear VQR, and VMR. To the best of our knowledge, this would be the first publicly available tool for estimating conditional vector quantile functions at scale.

2. BACKGROUND

Below we introduce quantile regression, its optimal-transport-based formulation and the extension to vector-valued targets. We aim to provide a brief and self-contained introduction to the key building blocks of VQR. For simplicity, we present the discretized versions of the problems. Notation. Throughout, Y, X denote random variables and vectors, respectively; deterministic scalars, vectors and matrices are denoted as y, x, and X. P (X,Y) denotes the joint distribution of the X and Y. 1 N denotes an N -dimensional vector of ones, denotes the elementwise product, and I {•} is an indicator. We denote by N the number of samples, d the dimension of the target variable, k the dimension of the covariates, and T the number of vector quantile levels per target dimension. Q Y|X (u; x) is the CVQF of the variable Y|X evaluated at the vector quantile level u for X = x. Quantile Regression. The goal of QR is to estimate the quantile of a variable Y|X. Assuming a linear model of the latter, and given u ∈ (0, 1), QR amounts to solving (b u , a u ) = arg min b,a E (X,Y) ρ u (Y -b X -a) , where b u x + a u is the u-th quantile of Y|X = x, and ρ u (z), known as the pinball loss, is given by ρ u (z) = max{0, z} + (u -1)z. Solving this problem produces an estimate of Q Y|X for a single quantile level u. In order to estimate the full conditional quantile function (CQF) Q Y|X (u), the problem must be solved at all levels of u with additional monotonicity constraints, since the quantile function is non-decreasing in u. The CQF discretized at T quantile levels can be estimated from N samples {x i , y i } N i=1 ∼ P (X,Y) by solving min B,a u N i=1 ρ u (y i -b u x i -a u ) s.t. ∀i, u ≥ u =⇒ b u x i + a u ≥ b u x i + a u , where B and a aggregate all the b u and a u , respectively. We refer to eq. ( 3) as simultaneous linear quantile regression (SLQR). This problem is undefined for a vector-valued Y, due to the inherently 1d formulation of the monotonicity constraints and the pinball loss ρ u (z). Optimal Transport Formulation. Carlier et al. (2016) showed that SLQR (3) can be equivalently written as an optimal transport (OT) problem between the target variable and the quantile levels, with an additional constraint of mean independence. Given N data samples arranged as y ∈ R N , X ∈ R N ×k , and T quantile levels denoted by u = 1 T , 2 T , ..., 1 we can write, max Π≥0 u Πy s.t. Π 1 T = ν Π1 N = µ [ϕ] ΠX = X [β] ( ) where Π is the transport plan between quantile levels u and samples (x, y), with marginal constraints ν = 1 N 1 N , µ = 1 T 1 T and mean-independence constraint X = 1 T 1 T 1 N 1 N X. The dual variables are ϕ = D -a and β = D -B, where D is a first-order finite differences matrix, and a ∈ R T , B ∈ R T ×k contain the regression coefficients for all quantile levels. Refer to appendices A.1 and A.2 for a full derivation of the connection between SLQR (3) and OT (4). Vector Quantile Regression. Although the OT formulation for SLQR (4) is specified between 1d measures, this formulation is immediately extensible to higher dimensions. Given vector-valued targets y i ∈ R d arranged in Y ∈ R N ×d , their vector quantiles are also in R d . The vector quantile levels are sampled on a uniform grid on [0, 1] d with T evenly spaced points in each dimension, resulting in T d d-dimensional vector quantile levels, arranged as U ∈ R T d ×d . The OT objective can be written as T i=1 N j=1 Π i,j u i y j = Π S, where S ∈ R T ×N , and thus can be naturally extended to d > 1 by defining the pairwise inner-product matrix S = U Y ∈ R T d ×N . The result is a d-dimensional discrete linear estimate of the CVQF, Q Y|X (u; x) which is co-monotonic (1) with u for each x. Appendix A.6 details how the estimated CVQF is obtained from the dual variables in the high-dimensional case.

3. SCALABLE VQR

To motivate our proposed scalable approach to VQR, we first present the exact formulations of VQR and discuss their limitations. The OT-based primal and dual problems are presented below. max Π≥0 T d i=1 N j=1 u i y j Π i,j s.t. Π 1 T d = ν [ψ] Π1 N = µ [ϕ] ΠX = X [β] (5) min ψ,ϕ,β ψ ν + ϕ µ + tr β X s.t. ∀i, j : ϕ i + β i x j + ψ j ≥ u i y j [Π] Here ψ ∈ R N , ϕ ∈ R T d , and β ∈ R T d ×k are the dual variables, µ = 1 T d 1 T d , ν = 1 N 1 N are the marginals and X = 1 T d N 1 T d 1 N X is the mean-independence constraint. The solution of these linear programs results in the convex potentials ψ(x, y) and β(u) x + ϕ(u), where the former is discretized over data samples and the latter over quantile levels. The estimated CVQF is then the transport map between the measures of the quantile levels and the data samples, and is given by Appendix A.6 ). Notably, co-monotonicity of the CVQF arises due to it being the gradient of a convex function. Q Y|X (u; x) = ∇ u β(u) x + ϕ(u) (see also

Limitations of existing methods.

The primal VQR problem (5) has T d • N variables and T d • (k + 1) + N constraints. Solving the above linear programs with general-purpose solvers has cubic complexity in the number of variables and constraints and is thus intractable even for modestly-sized VQR problems (see fig. A9 ). Numerous works proposed fast solvers for variants of OT problems. A common approach for scaling OT to large samples is to introduce an entropic regularization term -ε i,j Π i,j log Π i,j to the primal objective. As shown by Cuturi (2013) , this regularized formulation allows using the Sinkhorn algorithm which provides significant speedup. Other solvers have been proposed for OT variants with application-specific regularization terms on Π (Ferradans et al., 2013; Solomon et al., 2015; Benamou et al., 2015) . However, the addition of the mean-independence constraint on Π renders these solvers inapplicable for VQR. For example, the Sinkhorn algorithm would require three projections per iteration: one for each marginal, and one for the mean-independence constraint. Crucially, for the latter constraint the projection onto the feasible set has no closed-form solution and thus solving VQR via Sinkhorn is very slow in practice. Relaxed dual formulation. Another approach for scaling OT is by solving its dual formulation. Genevay et al. (2016) showed that the relaxed dual version of the standard OT problem is amenable to stochastic optimization and can thus scale to large samples. The relaxed dual formulation of VQR is obtained from eq. ( 6) (see appendix A.4), and can we written as min ψ,β ψ ν + tr β X + ε T d i=1 µ i log   N j=1 exp 1 ε u i y j -β i x j -ψ j   , where ε controls the exactness of the the objective; as ε decreases, the relaxed dual more closely approximates the exact dual. Of note are some important properties of this formulation: (i) it encodes the mean-independence constraint in the objective; (ii) it is equivalent to an entropic-regularized version of the VQR OT primal (5) (see appendix A.5); (iii) it is an unconstrained convex objective amenable to stochastic-gradient based optimization. The key drawbacks of this approach are the linear specification of the resulting CVQF in x, and the potential violation of co-monotonicity due to relaxation of the constraints. We address the former limitation in Section 4 and the latter in Section 5. SGD-based solver. The relaxed dual formulation of VQR (7) involves only T d • k + N optimization variables and the objective is amenable to GPU-based acceleration, as it involves only dense-matrix multiplications and pointwise operations. However, calculating this objective requires materializing two T d × N inner-product matrices, causing the memory footprint to grow exponentially with d. This problem is exacerbated when X is high-dimensional, as it must also be kept in memory. These memory requirements severely limit the problem size which can be feasibly solved on generalpurpose GPUs. Thus, naive stochastic gradient descent (SGD) with mini-batches of samples to (7) is insufficient for obtaining a scalable solver. To attain a constant memory footprint w.r.t. T and N , we sample data points from {(x j , y j )} N j=1 together with vector quantile levels from {u i } T d i=1 and evaluate eq. ( 7) on these samples. Contrary to standard SGD, when sampling quantile levels, we select the corresponding entries from β and µ; likewise, when sampling data points, we select the corresponding entries from ψ. Thus, by setting a fixed batch size for both data points and levels, we solve VQR with a constant memory footprint irrespective of problem size (fig. 2 ). Moreover, we observe that in practice, using SGD adds only negligible optimization error to the solution, with smaller batches producing more error as expected (Appendix fig. A10a ). Refer to Appendix B for a convergence analysis of this approach and to Appendix G for implementation details.

4. NONLINEAR VQR

In this section we propose an extension to VQR which allows the estimated model to be non-linear in the covariates. Carlier et al. (2016) proved the existence and uniqueness of a CVQF which satisfies strong representation, i.e., Y|(X = x) = Q Y|X (U; x), and is co-monotonic w.r.t. U. In order to estimate this function from finite samples, they further assumed a linear specification in both u and x, given by the model Q L Y|X (u; x) = B(u) x + a(u). This results in the OT problem with the meanindependence constraint (5). However, in practice and with real-world datasets, there is no reason to assume that such a specification is valid. In cases where the true CVQF is a non-linear function of x this model is mis-specified and the estimated CVQF does not satisfy strong representation. Extension to nonlinear specification. We address the aforementioned limitation by modelling the CVQF as a non-linear function of the covariates parametrized by θ, i.e., Q N L Y|X (u; x) = B(u) g θ (x) + a(u). We fit θ jointly with the regression coefficients B, a. The rationale is to parametrize an embedding g θ (x) such that the model is Q N L Y|X is better specified than Q L Y|X in the sense of strong representation. Encoding this nonlinear CVQF model into the exact formulations of VQR (eqs. ( 5) and ( 6)) will no longer result in a linear program. However, the proposed relaxed dual formulation of VQR (7) can naturally incorporate the aforementioned non-linear transformation of the covariates, as follows min ψ,β,θ ψ ν + tr β Ḡθ (X) + ε T d i=1 µ i log   N j=1 exp 1 ε u i y j -β i g θ (x j ) -ψ j   , where Ḡθ (X) = 1 T d 1 N g θ (X)/(T d N ) is the empirical mean after applying g θ to each sample. We optimize the above objective with the modified SGD approach described in section 3. The estimated non-linear CVQF can then be recovered by Q N L Y|X (u; x) = ∇ u β(u) g θ (x) + ϕ(u) . A crucial advantage of our nonlinear CVQF model is that g θ may embed x ∈ R k into a different dimension, k = k. This means that VQR can be performed, e.g., on a lower-dimensional projection of x or on a "lifted" higher-dimensional representation. For a low-dimensional X which has a highlynonlinear relationship with the target Y, lifting it into a higher dimension could result in the linear regression model (with coefficients B and a) to be well-specified. Conversely, for high-dimensional X which has intrinsic low-dimensionality (e.g., an image), g θ may encode a projection operator which has inductive bias, and thus exploits the intrinsic structure of the feature space. This allows one to choose a g θ suitable for the problem at hand, for example, a convolutional neural network in case of an image or a graph neural network if x lives on a graph. In all our experiments, we indeed find that a learned non-linear feature transformation substantially improves the representation power of the estimated CVQF (see fig. Simultaneous Quantile Regression (SQR). SQR is the task of simultaneously resolving multiple conditional quantiles of a scalar variable, i.e. a possibly non-linear version of eq. ( 3). Nonlinear SQR has been well-studied, and multiple approaches have been proposed (Brando et al., 2022; Tagasovska & Lopez-Paz, 2019) . We highlight that SQR can be performed as a special case of NL-VQR (8) when d = 1. See Appendix fig. A12 for an example. Other SQR approaches enforce monotonicity of the CQF via different ad-hoc techniques, while with NL-VQR the monotonicity naturally emerges from the OT formulation as explained in section 3; thus it is arguably a preferable approach. To our knowledge, OT-based SQR has not been demonstrated before. Prior attempts at nonlinear high-dimensional quantiles. Feldman et al. (2021) proposed an approach for constructing high-dimensional confidence regions, based on a conditional variational autoencoder (CVAE). A crucial limitation of this work is that it does not infer the CVQF; their resulting confidence regions are therefore not guaranteed to be quantile contours (see fig. 1a ), because they do not satisfy the defining properties of vector quantiles (eqs. ( 1) and ( 2)).

5. VECTOR MONOTONE REARRANGEMENT

Solving the relaxed dual(7) may lead to violation of co-monotonicity in Q Y|X , since the exact constraints in eq. ( 6) are not enforced. This is analogous to the quantile crossing problem in the scalar case, which can also manifest when QR is performed separately for each quantile level (Chernozhukov et al., 2010 ) (Appendix fig. A12d ). In what follows, we propose a way to overcome this limitation. Consider the case of scalar quantiles, i.e., d = 1. Denote Q Y|X (u; x) as an estimated CQF, which may be non-monotonic in u due to estimation error and thus may not be a valid CQF. One may convert Q Y|X (u; x) into a monotonic Q Y|X (u; x) through rearrangement as follows. Consider a random variable defined as Y|X := Q Y|X (U; x) where U ∼ U[0, 1]. Its CDF and inverse CDF are given by F Y|X (y; x) = 1 0 I Q Y|X (u; x) ≤ y du, and Q Y|X (u; x) = inf y : F Y|X (y; x) ≥ u . Q Y|X (u; x) is the true CQF of Y|X and is thus necessarily monotonic. It can be shown that Q Y|X (u; x) is no worse an estimator for the true CQF Q Y|X (u; x) than Q Y|X (u; x) in the L p -norm sense (Chernozhukov et al., 2010) . In practice, this rearrangement is performed in the scalar case by sorting the discrete estimated CQF. Rearrangement has no effect if Q Y|X (u; x) is already monotonic. Here, we extend the notion of rearrangement to the case of d > 1. As before, define as defined by eq. ( 1), we can compute a co-monotonic Q Y|X (u; x) by calculating the vector quantiles of Y|X = x, separately for each x. We emphasize that this amounts to solving the simpler vector quantile estimation problem for a specific x, as opposed to the vector quantile regression problem. Y|X := Q Y|X (U; x) where U ∼ U[0, 1] d , and Q Y|X (u; x) is the estimated CVQF. If it is not co-monotonic, Let Q = [ q j ] j ∈ R T d ×d be the estimated CVQF Q Y|X (u; x) sampled at T d levels U = [ u i ] i ∈ R T d ×d by solving VQR. To obtain Q Y|X (u; x) sampled at U we solve max πi,j ≥0 T d i,j=1 π i,j u i q j s.t. Π 1 = Π1 = 1 T d 1, and then compute Q = T d • Π Q. We define this procedure as vector monotone rearrangement (VMR). The existence and uniqueness of a co-monotonic function, mapping between the measures of U and Q Y|X (U; x) is due to the Brenier's theorem as explained in section 1 (Brenier, 1991; McCann, 1995; Villani, 2021) . VMR can be interpreted as "vector sorting" of the discrete CVQF estimated with VQR, since it effectively permutes the entries of Q such that the resulting Q is co-monotonic with U . Note that eq. ( 9) is an OT problem with the inner-product matrix as the ground cost, and simple constraints on the marginals of the transport plan Π. Thus, VMR ( 9) is significantly simpler to solve exactly than VQR (5). We leverage fast, off-the-shelf OT solvers from the POT library (Flamary et al., 2021; Bonneel et al., 2011) and apply VMR as a post-processing step, performed after the estimated CVQF Q Y|X (u; x) is evaluated for a specific x. In the 1d case, quantile crossings before and after VMR correction can be readily visualized (Appendix figs. A12d and A12e). For d > 1, monotonicity violations manifest as 4 demonstrates that co-monotonicity violations are completely eliminated and that strong representation strictly improves by applying VMR. (u i -u j ) (Q(u i )-Q(u j )) < 0 (Appendix fig. A11). Figure

6. EXPERIMENTS

We use four synthetic and four real datasets which are detailed in Appendices D.1 and D.2. Except for the MVN dataset, which was used for the scale and optimization experiments, the remaining three synthetic datasets were carefully selected to be challenging since they exhibit complex nonlinear relationships between X and Y (see e.g. fig. 1b ). We evaluate using the following metrics (detailed in Appendix E): (i) KDE-L1, an estimate of distance between distributions; (ii) QFD, a distance measured between an estimated CVQF and its ground truth; (iii) Inverse CVQF entropy; (iv) Monotonicity violations; (v) Marginal coverage; (vi) Size of α-confidence set. The first three metrics serve as a measure of strong representation (2), and can only be used for the synthetic experiments since they require access to the data generating process. The last two metrics are used for real data experiments, as they only require samples from the joint distribution. Implementation details for all experiments can be found in Appendix G.

6.1. SCALE AND OPTIMIZATION

To demonstrate the scalability of our VQR solver and its applicability to large problems, we solved VQR under multiple increasing values of N and T , while keeping all other data and training settings fixed, and measured both wall time and KDE-L1. We used up to N = 2 • 10 6 data points, d = 2 dimensions, and T = 100 levels per dimension (T d = 100 2 levels in total), while sampling both data points and quantile levels stochastically, thus keeping the memory requirement fixed throughout. This Published as a conference paper at ICLR 2023 enabled us to run the experiment on a commodity 2080Ti GPU with 11GB of memory. Optimization experiments, showing the effects of ε and batch sizes on convergence, are presented in Appendix F. Figure 2 presents these results. Runtime increases linearly with N and quadratically with T , as can be expected for d = 2. KDE-L1 consistently improves when increasing N and T , showcasing improved accuracy in distribution estimation, especially as more quantile levels are estimated. To the best of our knowledge, this is the first time that large-scale VQR has been demonstrated.

6.2. SYNTHETIC DATA EXPERIMENTS

Here our goal is to evaluate the estimation error (w.r.t the ground-truth CVQF) and sampling quality (when sampling from the CVQF) of nonlinear VQR. We use the conditional-banana, rotating stars and synthetic glasses datasets, where the assumption of a linear CVQF is violated. Baselines. We use both linear VQR and conditional variational autoencoders (CVAE) (Feldman et al., 2021) 

6.3. REAL DATA EXPERIMENTS

VQR has numerous potential applications, as detailed in section 1. Here we showcase one immediate and highly useful application, namely distribution-free uncertainty estimation for vector-valued targets. Given P (X,Y) and a confidence level α ∈ (0, 1), the goal is to construct a conditional αconfidence set C α (x) such that it has marginal coverage of 1 -α, defined as P [Y ∈ C α (X)] = 1 -α. A key requirement from an uncertainty estimation method is to produce a small C α (x) which satisfies marginal coverage, without any distributional assumptions (Romano et al., 2019) . Baselines. We compare nonlinear VQR against (i) Separable linear QR (Sep-QR); (ii) Separable nonlinear QR (Sep-NLQR); (iii) linear VQR. For Sep-QR and Sep-NLQR the estimated CVQF is Q Sep Y|X (u; x) = Q QR Y1|X (u 1 ; x), . . . Q QR Y d |X (u d ; x) , where Q QR Yi|X (u i ; x) is obtained via 1d linear or nonlinear quantile regression of the variable Y i . These separable baselines represent the basic approaches for distribution-free uncertainty estimation with vector-valued variables. They work well in practice and the size of the α-confidence sets they produce serves an upper bound, due to their inherent independence assumption. Evaluation procedure. The key idea for comparing distribution-free uncertainty estimation approaches is as follows. First, a nominal coverage rate 1 -α * is chosen. An α-confidence set is then constructed for each estimation method, such that it obtains the nominal coverage (in expectation). This is similar to calibration in conformal prediction (Sesia & Romano, 2021) since α is calibrated to control the size of the α-confidence set. Finally, the size of the α-confidence set is measured as the evaluation criterion. Appendix G contains experimental details and calibrated α values. Results. Across the four datasets, NL-VQR acheives 34-60% smaller α-confidence set size compared to the second-best performing method (section 6.3 and fig. A15 ; table A4 ). The reason for the superior performance of NL-VQR is its ability to accurately capture the shape of the conditional distribution, leading to small confidence sets with the same coverage (section 6.3 and fig. A16 ).

7. SOFTWARE

We provide our VQR solvers as part of a robust, well-tested python package, vqr (available in the supplementary materials). Our package implements fast solvers with support for GPU-acceleration and has a familiar sklearn-style API. It supports any number of dimensions for both input features (k) and target variables (d), and allows for arbitrary neural networks to be used as the learned nonlinear feature transformations, g θ . The package provides tools for estimating vector quantiles, vector ranks, vector quantile contours, and performing VMR as a refinement step after fitting VQR. To the best of our knowledge, this would be the first publicly available tool for estimating conditional vector quantile functions at scale. See Appendix H for further details.

8. CONCLUSION

In this work, we proposed NL-VQR, and a scalable approach for performing VQR in general. NL-VQR overcomes a key limitation of VQR, namely the assumption that the CVQF is linear in X. Our approach allows modelling conditional quantiles by embedding X into a space where VQR is better specified and further to exploit the structure of the domain of X (via a domain-specific models like CNNs). We proposed a new relaxed dual formulation and custom solver for the VQR problem and demonstrated that we can perform VQR with millions of samples. As far as we know, large scale estimation of CVQFs or even VQFs has not been previously shown. Moreover, we resolved the issue of high-dimensional quantile crossings by proposing VMR, a refinement step for estimated CVQFs. We demonstrated, through exhaustive synthetic and real data experiments, that NL-VQR with VMR is by far the most accurate way to model CVQFs. Finally, based on the real data results, we argue that NL-VQR should be the primary approach for distribution-free uncertainty estimation, instead of separable approaches which assume independence. Limitations. As with any multivariate density estimation task which does not make distributional assumptions, our approach suffers from the curse of dimensionality, especially in the target dimension. Overcoming this limitation requires future work, and might entail, e.g., exploiting the structure of the domain of Y. This could be achieved by leveraging recent advances in high-dimensional neural OT (Korotin et al., 2021) . Another potential limitation is that the nonlinear transformation g θ (x) is shared across quantile levels (it is not a function of u), though evaluating whether this is truly a limiting assumption in practice requires further investigation. In conclusion, although quantile regression is a very popular tool, vector quantile regression is arguably far less known, accessible, and usable in practice due to lack of adequate tools. This is despite that fact that it is a natural extension of QR, which can be used for general statistical inference tasks. We present the community with an off-the-shelf method for performing VQR in the real world. We believe this will contribute to many existing applications, and inspire a wide-range of new applications, for which it is currently prohibitive or impossible to estimate CVQFs.

A DERIVATIONS

In this section, we present the following derivations: • Formulating one-dimensional QR as a correlation-maximization problem (Appendix A.1). • Rephrasing correlation maximization as one-dimensional optimal transport (Appendix A.2). • Extension of the OT-based formulation for QR to the multi-dimensional targets (Appendix A.3). • Relaxing the dual formulation of the OT-based VQR problem (Appendix A.4). • The equivalence between the entropic-regularized version of the OT-based primal and the relaxed dual formulation (Appendix A.5). • Calculating the conditional (vector) quantile functions from the dual variables (Appendix A.6). The derivations in this section are not rigorous mathematical proofs; instead they are meant to show an easy-to-follow way to obtain the high-dimensional VQR relaxed dual objective (which we eventually solve), by starting from the well-known one-dimensional QR based on the pinball loss. A.1 QUANTILE REGRESSION AS CORRELATION MAXIMIZATION The pinball loss defined above can be written as, ρ u (z) = uz, z > 0 (u -1)z, z ≤ 0 = uz -z + z, z > 0 (u -1)z, z ≤ 0 = z + + (u -1)z where z + max{0, z}. Note that we also define z -max{0, -z} which we use later. Given the continuous joint distribution P X,Y , and assuming a linear model relates the u-th quantile of Y to X, then performing quantile regression involves minimizing min au,bu E X,Y Y -b u X -a u + -(1 -u) Y -b u X -a u . Define Z(a u , b u ) Y -b u X -a u , the above problem can be written as min au,bu E X,Y Z(a u , b u ) + -(1 -u)Z(a u , b u ) . ( ) Define P u Z(a u , b u ) + and N u Z(a u , b u ) -as the positive and negative deviations from the true u-th quantile, then notice that (i) P u ≥ 0 and N u ≥ 0; (ii) P u -N u = Z(a u , b u ). Introducing P u and N u as slack variables, we can rewrite the above optimization problem as min au,bu,Pu,Nu E X,Y [P u -(1 -u)Z(a u , b u )] s.t., P u ≥ 0, N u ≥ 0 P u -N u = Z(a u , b u ), with multiplier [V u ]. Solving the above problem is equivalent to solving the Lagrangian formulation min Pu,Nu,au,bu max Vu E X,Y [P u -(1 -u)Z(a u , b u ) -V u (P u -N u -Z(a u , b u ))] s.t., P u ≥ 0, N u ≥ 0. Published as a conference paper at ICLR 2023 Substituting Z u (a u , b u ) = Y -b u X -a, we get min Pu,Nu,au,bu max Vu E X,Y P u -(1 -u) Y -b u X -a u -V u P u -N u -Y + b u X + a u s.t., P u ≥ 0, N u ≥ 0. In the first term, Y can be omitted because it is independent of the optimization variables, and in the second term, Y can be separated, this yields max Vu E X,Y [V u Y] + min Pu,Nu,au,bu E X,Y P u -(u -1) b u X + a u -V u P u -N u + b u X + a u s.t., P u ≥ 0, N u ≥ 0. Rewriting the terms by separating a u , b u , P u , N u yields max Vu E X,Y [V u Y] + min Pu,Nu,au,bu E X,Y P u (1 -V u ) + N u V u -a u (V u -(1 -u)) -b u (V u X -(1 -u)X) s.t., P u ≥ 0, N u ≥ 0. By treating P u , N u , a u , b u as Lagrange multipliers we can obtain the dual formulation. Since P u , N u ≥ 0, they must participate in inequality constraints, and since a u , b u are unconstrained, they participate in equality constraints. Thus, max Vu E X,Y [V u Y u ] s.t. V u ≥ 0 [N u ] V u ≤ 1 [P u ] E X,Y [V u ] = (1 -u) [a u ] E X,Y [V u X] = (1 -u)E [X] [b u ]. Note that (i) complementary slackness dictates that we always have P u , N u ≥ 0, and that (ii) the last two constraints can be seen together as implying the mean independence condition of E [X|V ] = E [X]. Solving for multiple quantiles simultaneously can be achieved by minimizing the sum of above optimization problem ∀u ∈ [0, 1]. In addition we demand monotonicity of the estimated quantiles, such that u ≥ u =⇒ b u X + a u ≥ b u X + a u . According to the KKT conditions, the solution to the above problem must satisfy complementary slackness, which in this case means (1 -V u )P u = 0 V u N u = 0. Following the definitions of P u , N u , we have the following relation between the estimate quantile b u X + a u and V u :    Y > b u X + a u =⇒ P u > 0 =⇒ V u = 1 Y < b u X + a u =⇒ N u > 0 =⇒ V u = 0 Y = b u X + a u =⇒ P u = N u = 0 =⇒ 0 ≤ V u ≤ 1 Since P X,Y is a continuous distribution, then for any a u , b u , we have P Y = b u X + a u = 0. Thus in practice we can ignore this case and write V u = I Y ≥ b u X + a u . Therefore the monotonicity constraint translates to V u as u ≥ u =⇒ b u X + a u ≥ b u X + a u ⇐⇒ V u ≤ V u . Adding the monotonicity constraint on V u to the gives rise to a new dual problem: max Vu 1 0 E X,Y [V u Y u ] du s.t. V u ≥ 0 [N u ] V u ≤ 1 [P u ] E X,Y [V u ] = (1 -u) [a u ] E X,Y [V u X] = (1 -u)E [X] [b u ] u ≥ u =⇒ V u ≤ V u . Let us now consider a dataset {(x i , y i )} N i=1 of i.i.d. samples from P X,Y where x i ∈ R k and y i ∈ R. We are interested in calculating T quantiles for levels u 1 = 0 < u 2 < • • • < u T ≤ 1. Furthermore, denote the sample mean x ∈ R k where xk = N i=1 x i,k /N . By discretizing the above problem, we arrive at max Vτ,i T τ =1 N i=1 V τ,i y i s.t. V τ,i ≥ 0 [N τ,i ] V τ,i ≤ 1 [P τ,i ] 1 N N i=1 V τ,i = (1 -u τ ) [a τ ] 1 N N i=1 V τ,i x i,k = (1 -u τ )x k [b τ,k ] V T,i ≤ V T -1,i , ≤ . . . V 1,i . Now we can vectorize the above formulation by defining the matrix V ∈ R T ×N containing elements V τ,i , the first-order finite differences matrix D =          1 0 • • • 0 -1 1 0 0 -1 . . . 0 0 . . . . . . 1 0 0 • • • -1 1          ∈ R T ×T , and the vector of desired quantile levels, u = [u 1 , . . . , u T ] . The monotonicity constraint therefore becomes v i D ≥ 0 ∀i = 1, . . . , N where v i is the i-th column of V . We also denote the covariates matrix X ∈ R N ×k and response vector y ∈ R N . Thus, the problem vectorizes as, max V 1 T V y s.t. V τ,i ≥ 0 [N τ,i ] V τ,i ≤ 1 [P τ,i ] 1 N V 1 N = (1 T -u) [a] 1 N V X = (1 T -u) x [B] V D ≥ 0. where a ∈ R T and B ∈ R T ×k are the dual variables which contain the regression coefficients per quantile level. We can observe the following for the above problem: 1. For any random variable, its zeroth quantile is smaller than or equal to any values the variable takes. In particular, for τ = 0 we have V 0,i = 1 ∀i because from eq. ( 13) we know that V 0,i is the indicator that y i is greater that the zeroth quantile of Y . 2. The last constraint V D ≥ 0, which enforces non-increasing monotonicity along the quantile level, i.e., V τ,i ≤ V τ ,i ∀τ ≥ τ , also enforces that V T,i ≥ 0. Therefore, we can observe that the first (V τ,i ≥ 0), second (V τ,i ≤ 1) and last (V D ≥ 0) constraints are partially-redundant and can be condensed into only two vectorized constraints: (1) V D ≥ 0 which ensures monotonicity and non-negativity of all elements in V ; (2) V D1 T ≤ 1 N which enforces that V 0,i ≤ 1 ∀i. Notice that condensing the constraints in this manner comes with the inability to interpret the meaning of the Lagrange multipliers P τ,i , N τ,i . However, the advantage is that we have less constraints in total and the interpretability of the multipliers a, B is maintained. Thus, we arrive at the following vectorized problem, max V 1 T V y s.t. 1 N V 1 N = (1 T -u) [a] 1 N V X = (1 T -u) x [B] V D ≥ 0 V D1 T ≤ 1 N .

A.2 CORRELATION MAXIMIZATION AS OPTIMAL TRANSPORT

Following Carlier et al. (2016) , the above problem can be re-formulated as an Optimal Transport problem, i.e. the problem of finding a mapping between two probability distributions. Assume we are now interested in estimating the quantiles of Y at T uniformly-sampled levels, [u 1 , . . . , u T ] = 1 T , 2 T , . . . , T T . In the above problem, one can decompose the objective as 1 T V y = 1 T D - D V y = D -1 1 T D V y. If we then denote Π = 1 N D V ∈ R T ×N u = 1 T D -1 1 T = 1 T , 2 T , . . . , T T ∈ R T , then we can write the objective as N T • u Πy. In addition, denote µ = D (1 T -u) = 1 T 1 T ∈ R T ν = 1 N 1 N ∈ R N , which represent (respectively) the empirical probability measure of the quantile levels u and the data points (X, y) (we choose both measures to be uniform). Now, by using the decomposed objective, and by multiplying the first two constraints by D on either side, we obtain the following equivalent problem: max Π≥0 u Πy s.t. Π1 N = µ = 1 T 1 T [D -a] ΠX = µν X = 1 T 1 T x [D -B] 1 T Π ≤ ν = 1 N 1 N . ( ) Note that (i) we ignore the normalization constants where they do not affect the solution; (ii) the Lagrange multipliers are scaled by D -since the constraints are scaled by D . The interpretation of this formulation is that we seek a transport plan Π between the measures of the quantile levels U and the target Y|X, subject to constraints on the plan which ensure that its marginals are the empirical measures µ and ν, and that mean independence E [X|U] = E [X] holds. Each individual entry Π i,j in this discrete plan is the probability mass attached to (u i , x j , y j ) in this optimal joint distribution.

A.3 EXTENDING THE OPTIMAL TRANSPORT FORMULATION TO VECTOR QUANTILES

We now wish to deal with the case where the target variable can be a vector. Observe that for the scalar case, we can write the OT objective as u Πy = T i=1 N j=1 Π i,j u i y j = Π S, where S ∈ R T ×N is a matrix of pairwise products, i.e. S i,j = u i y j , and denotes the Hadamard (elementwise) product. Now let us assume that y j ∈ R d for any d ≥ 1, and thus our target data will now be arranged as Y ∈ R N ×d . The quantile levels must now also d-dimensional, since we have a quantile level dimension for each data dimension. We will choose a uniform grid on [0, 1] d on which to compute our vector quantiles. Along each dimension of this grid we sample T equally-spaced points, giving us in total T d points in d dimensions. To keep the formulation of the optimization problem two dimensional, we arrange the coordinates of these points as the matrix U ∈ R T d ×d . Thus, we can naturally extend the pairwise product matrix S to the multi-dimensional case using a d-dimensional inner-product between each point on the quantile level grid U and each target point in Y . This yields the simple form S = U Y , where S ∈ R T d ×N , which can be plugged in to the above formulation ( 14) and solved directly. Thus we obtain eq. ( 5).  ϕ i + β i x j + ψ j ≥ u i y j The above problem has a unique solution (Carlier et al., 2016) . Thus, first-order optimality conditions for each ϕ i yield ϕ i = max j u i y j -β i x j -ψ j . Substituting the optimal ϕ i into the dual formulation results in an unconstrained but exact min-max problem: min ψ,β ψ ν + tr β X + T d i=1 µ i max j u i y j -β i x j -ψ j . ( ) We can relax this problem by using a smooth approximation for the max operator, given by max j (x) ≈ ε log   j exp x j ε   . Plugging the smooth approximation into eq. ( 15) yields the relaxed dual in eq. ( 7).

A.5 EQUIVALENCE BETWEEN REGULARIZED PRIMAL AND RELAXED DUAL

Adding an entropic regularization term to the OT-based primal formulation of the VQR problem (5), and converting into a minimization problem yields, min Π Π, -S + ε Π, log Π s.t. Π 1 T d = ν [-ψ] Π1 N = µ [-ϕ] ΠX = X [-β] where ν ∈ R N , µ ∈ R T d , X ∈ R T d ×k are defined as in section 3. Note that the entropic regularization term can be interpreted as minimization of the KL-divergence between the transport plan Π and the product of marginals, i.e., D KL Π µν . In order to show the equivalence between this regularized problem and the relaxed dual ( 7), the key idea is to use the Fenchel-Rockafeller duality theorem (Rockafellar, 1974) . This allows us to write the dual of an optimization problem in terms of the convex conjugate functions of its objective. To apply this approach, we reformulate eq. ( 16) into an unconstrained problem of the following form: min W ∈W f * (A * W ) + g * (W ) = max V ∈V -f (-V ) -g(AV ), where we define a pair of operators A : V → W and A * : W * → V * adjoint to each other; f : V → R, f * : V * → R and g : W → R, g * : W * → R are pairs of convex conjugate functions. In our problem W = W * and V = V * are the vector spaces W = W * = R T d ×N and V = V * = R T d × R N × R T d ×k . We define the operator A * Π = Π 1 T d , Π1 N , ΠX and an indicator function, i a (z) = 0, z = a, ∞, z = a. We can now represent eq. ( 16) as an unconstrained problem, min Π Π, -S + ε Π, log Π + i (ν,µ, X) (A * Π) To derive A, the adjoint operator of A * , we can write (ψ, ϕ, β), A * Π V = ψ, Π 1 T d + ϕ, Π1 N + β, ΠX = tr ψ Π 1 T d + tr ϕ Π1 N + tr β ΠX = 1 T d ψ , Π + ϕ1 N , Π + βX , Π = A(ψ, ϕ, β), Π W . Thus, A(ψ, ϕ, β) = 1 T d ψ + ϕ1 N + βX . We then define the functions f * (A * Π) = i (ν,µ, X) (A * Π) g * (Π) = Π, -S + ε Π, log Π , and their corresponding convex conjugates are therefore given by, f (ψ, ϕ, β) = (ψ, ϕ, β), (ν, µ, X) V = ψ, ν + ϕ, µ + β, X g(W ) = ε ij exp W ij + S ij -1 ε . Using f * and g * , we can write (17) as min W {f * (A * W ) + g * (W )} , then by the Fenchel-Rockafeller duality theorem, we get the equivalent dual form max V {-f (-V ) -g(AV )} . Substituting V = (-ψ, -ϕ, -β) (the dual variables of eq. ( 16)), converting to a minimization problem, and omitting constant factors in the objective, we get min ψ,ϕ,β ψ ν + ϕ µ + tr β X + ε T d i=1 N j=1 exp 1 ε S ij -ψ j -ϕ i -β i x j . Now we write ϕ in terms of ψ, β by using a first-order optimality condition of the above problem. Taking the derivative w.r.t. ϕ i and setting to zero, we have: 0 = µ i -exp - ϕ i ε N j=1 exp 1 ε S ij -ψ j -β i x j exp - ϕ i ε = µ i j exp 1 ε (S ij -ψ j -β i x j ) ϕ i = ε log   1 µ i j exp 1 ε S ij -ψ j -β i x j   Finally, substituting eqs. ( 19) and ( 20) into eq. ( 18) and omitting constant terms yields, min ψ,β ψ ν + tr β X + ε T d i=1 µ i log   N j=1 exp 1 ε S ij -β i x j -ψ j   , where S ij = u i y j . Thus, we obtain that eq. ( 21) is equal to eq. ( 7). In summary, we have shown that the relaxed dual formulation of the VQR problem that we solve (7), is equivalent to an entropic-regularized version of the OT-based primal formulation eq. ( 5).

A.6 EXTRACTING THE VECTOR QUANTILE REGRESSION COEFFICIENTS

The dual variables obtained from the OT formulation (eq. ( 4)) are ϕ ∈ R T d and β ∈ R T d ×k . In the case of scalar quantiles (d = 1) we can obtain the conditional quantile function from the dual variables by applying the T × T finite differences matrix D defined in appendix A.1: Q Y|X (u; x) = D (βx + ϕ) u . This is effectively taking the u-th component the first-order discrete derivative where u is one of the T discrete quantile levels. In the vector case, we have T d discrete d-dimensional quantile levels. Equivalently, the relation between the dual variables and the quantile function is then Q Y|X (u ; x) = [∇ u {βx + ϕ}] u . Here βx + ϕ is in R T d and its gradient with respect to u, ∇ u {βx + ϕ}, is in R T d ×d . We then evaluate its gradient at o ne of the discrete levels u to obtain the d-dimensional quantile. As explained in section 3, the expression βx + ϕ is convex in u, and thus the co-monotonicity of the estimated CVQF is obtained by virtue of it being the gradient of a convex function.

B CONVERGENCE OF VQR

Below we mention a few comments regarding the optimization and approximation error of VQR. Linear VQR. The relaxed dual formulation of VQR, presented in eq. ( 6), is an unconstrained smooth convex minimization problem. Thus, in this case, gradient descent is guaranteed to converge to the optimal solution up to any desired precision. Moreover, given fixed quantile levels, our objective can be written as the minimization of an expectation under the data distribution P (X,Y) . Under these criteria, performing SGD by sampling i.i.d. from the data distribution, is known to converge to the global minimum (Hazan (2019) , Chapter 3). We refer to Section 5.4 in (Peyré & Cuturi, 2019) (and references therein) for further analysis and details regarding the convergence rates of SGD and other stochastic optimization methods applied to this problem. Specifically, stochastic averaged gradient (SAG) was shown to have improved convergence rates compared to SGD Genevay et al. (2016) . However, in practice we find that SGD converges well for this problem (see Figure 2 , Appendix fig. A10a ), and we use it here for simplicity. Nonlinear VQR. In the case of non-linear VQR, as formulated in eq. ( 8), the optimization is over a non-convex objective, and thus no convergence to a global minimum is guaranteed. Convergence analyses for non convex objectives only provide guarantees for weak forms of convergence. For e.g., the analysis methodology introduced by Ghadimi & Lan (2013) can be used to show that under the assumption of uniformly bounded gradient estimates, the norm of the gradient of the loss function decreases on average as O(t -1/2 ), when t → ∞, where t is the iteration. This can be viewed as a weak form of convergence that does not guarantee convergence to a fixed point, even to a local minimum. Approximation error as ε → 0. The nonlinear VQR formulation (8) produces an estimate Q ε Y|X . By decreasing ε, one can make the problem more exact, as we have shown in practice (fig. A10b ). Following the approach of Proposition A.1 in Genevay et al. (2016) , it can be shown that this estimate approaches the true CVQF as ε → 0.

C CONDITIONAL VECTOR QUANTILE FUNCTIONS C.1 INTUITIONS ABOUT VECTOR QUANTILES

To interpret the meaning of CVQFs, let us consider the 2-dimensional case, where we assume Y = (Y 1 , Y 2 ) . Given a specific covariates vector x = (x 1 , . . . , x k ) , and level u = (u 1 , u 2 ) we may write the components of the conditional vector quantile function as, Q Y|X (u; x) = Q 1 (u; x) Q 2 (u; x) = Q Y1|Y2,X u 1 ; Q Y2|X (u 2 ; x) , x Q Y2|Y1,X u 2 ; Q Y1|X (u 1 ; x) , x , where Q Yi|Yj ,X (u; y, x) denotes the scalar quantile function of the random variable Y i at level u, given Y j = y, X = x. Thus, for example, the first component Q 1 (u; x) is a 2D surface where moving along u 1 for a fixed u 2 yields a scalar, non-decreasing function representing the quantiles of Y 1 when Y 2 is at a value corresponding to level u 2 (see Figure A6b ). In addition, the vector quantile function is co-monotonic with u in the sense defined by eq. ( 1). For higher dimensions, it becomes more involved as the conditioning in each component is on the vector quantile of all the remaining components of the target Y (Figure A6c ).

C.2 CONDITIONAL QUANTILE CONTOURS

A useful property of CVQFs is that they allow a natural extension of α-confidence intervals to high dimensional distributions, which we denote as α-contours. Vector quantile contours. Formally, we define the conditional contour of Y|X = x at confidence level α as Q α Y|X (x), given by Q α Y|X (x) = Q Y|X (u; x) | u ∈ U α , U α = d i=1 U i α , U i α = (u 1 , . . . , u d ) | u i ∈ [α, 1 -α], u -i ∈ {α, 1 -α} where u -i denotes any component of u = (u 1 , . . . , u d ) except for u i . Note that the contour can be calculated not only using the true CVQF, Q Y|X (u; x), as above, but also using an estimated CVQF, Q Y|X (u; x). For simplicity consider the 2d case. By fixing e.g. u 1 to be one of {α, 1 -α} and then sweeping over u 2 (and vice versa), we obtain a set of vector quantile levels U α corresponding to the confidence level α (see fig. Separable quantile contours. In contrast to vector quantile contours obtained by VQR, which can accurately model distributions with arbitrary shapes, using separable quantiles produces trivial box-shaped contours. Consider a CVQF estimated with separable quantiles as presented in eq. ( 10). Due to the dependence of each component of Q Sep Y|X (u; x) only on the corresponding component of u, the shape of the resulting quantile contour will always be box-shaped (see fig. A7, middle ). Such trivial contours result in inferior performance for applications of uncertainty estimation, where a confidence region with a the smallest possible area for a given confidence level is desired. MVN. Data was generated from a linear model y = Ax + η, where x ∼ U[0, 1] k , A ∈ R d×k is a random projection matrix and η is multivariate Gaussian noise with a random covariance matrix. Conditional-banana. This dataset was introduced by Feldman et al. (2021) , and inspired by a similar dataset in Carlier et al. (2017) . The target variable Y ∈ R 2 has a banana-shaped conditional distribution, and its shape changes non-trivially when conditioned on a continuous-valued covariate X ∈ R (see the left-most column of the panel presented in Figure A13 ). The data generating process is defined as follows X ∼ U[0.8, 3.2], z ∼ U[-π, π], φ ∼ U[0, 2π], r ∼ U[-0.1, 0.1] β ∼ U[0, 1] k , β = β || β|| 1 Y 0 = 1 2 (-cos (Z) + 1) + r sin (φ) + sin (X), Y 1 = Z βX + r cos (φ), and then Y = [Y 0 , Y 1 ] . Synthetic glasses. This dataset was introduced by Brando et al. (2022) . The target variable Y ∈ R has a bimodal conditional distribution. The mode locations shift periodically when conditioned on a continuous-valued covariate X ∼ U[0, 1] (fig. A12a ). The data generating process is defined as z 1 = 3πX z 2 = π(1 + 3X) ∼ Beta (α = 0.5, β = 1) γ ∼ Categorical(0, 1) Y 1 = 5 sin(z 1 ) + 2.5 + Y 2 = 5 sin(z 2 ) + 2.5 - Y = γY 1 + (1 -γ)Y 2 . Rotating star. In this dataset, the target variable of Y ∈ R 2 has a star-shaped conditional distribution, and its shape is rotated by X ∈ R degrees when conditioned on a discrete-valued X, taking values in [0, 10, 20, . . . , 60] . Data is generated based on a 600 × 600 binary image of a star. See the first column in Figure A14 to visualize conditional distributions as a function of X. Since conditional distributions differ only by a rotation, E [Y|X] remains the same for all X. However, the shape of the distribution changes substantially with X, especially in tails. Thus, this dataset is a challenging candidate for estimating vector quantiles which must also represent these tails in order to properly recover the shape of the conditional distribution.

D.2 REAL DATASETS

We perform real data experiments on the blog_data, bio, house, meps_20 datasets obtained from (Feldman et al., 2021) . The original real datasets contained one-dimensional targets. Feldman et al. (2021) constructed an additional target variable by selecting a feature that has high correlation with first target variable and small correlation to the other input features, so that it is hard to predict. Summary of these datasets is presented in table A1 . 

Dataset

L L l=1 f * Y|X=x l -f Y|X=x l 1 . The KDEs are calculated with 100 bins per dimension. An isotropic Gauassian kernel was used, with σ = 0.1 for the conditional-banana dataset and σ = 0.035 for the star dataset. We used pykeops (Charlier et al., 2021) for a fast implementation of high-dimensional KDEs. We used L = 20 and M = 4000. Quantile function distance (QFD). This metric measures the distance between a true CVQF, Q * Y|X , and an estimate for it obtained by VQR, Q Y|X . 1. Sample L evaluation covariate vectors, {x l } L l=1 values at random from the ground truth distribution P X . 2. For each x l , (a) Sample M points {y m,l } M m=1 from the ground truth conditional distribution P Y|X=x l . (b) Estimate an unconditional vector quantile function on {y m,l } M m=1 , i.e. perform vector quantile estimation, not regression. Denote the estimated unconditional vector quantile function as Q * Y|X . This serves as a proxy for the ground-truth conditional quantile function. (c) Denote the estimated conditional quantile function evaluated at x l : Q Y|X=x l . (d) Compute the normalized difference between them elementwise over each of the T d discrete quantile levels, i.e., d l = Q * Y|X=x l -Q Y|X=x l 2 / Q * Y|X=x l 2 . 3. Calculate the QFD metric as 1 L L l=1 d l . We used L = 20 and M = 4000. Percentage of co-monotonicity violations (MV). This value can be measured directly. Given an estimated vector quantile function Q(u), with T levels per dimension, there are in total T 2d quantile level pairs. Thus, we measure 1 T 2d T d i,j I (u i -u j ) ( Q(u i ) -Q(u j )) < 0 .

E.2 METRICS FOR REAL DATA

With real data, only finite samples from the ground truth distribution are available. In particular, since our real datasets have a continuous X, this means that we never have more than one sample from each conditional distribution Y|X = x. Strong representation can therefore not be quantified directly as with the above metrics. Instead, we opt for metrics which can be evaluated on finite samples and are specifically suitable for our chosen application of distribution-free uncertainty estimation (section 6.3). Size of a conditional α-confidence set (AC). This metric approximates |C α (x)| i.e. the size of an α-confidence set constructed for a specific covariate x. Lower values of this metric are better, as they indicate that the confidence set is a better fit for the shape of the data distribution because, intuitively, the implication is that the same proportion of the data distribution can be represented by a smaller region. The metric is computed as follows. 1. Estimate a CVQF Q Y|X (u; x), e.g. via VQR (7), NL-VQR (8), or separable quantiles (10). 2. Choose a test covariate, denoted by x T . 3. Construct the corresponding conditional α-contour, Q α Y|X (x T ), as defined by eq. ( 22), using the estimate Q Y|X (u; x T ). 4. Construct the corresponding conditional α-confidence set C α (x T ) from the contour Q α Y|X (x T ) by calculating a convex hull of the points within the contour. 5. Calculate the value of the metric as the volume of the resulting convex hull (area for d = 2). We note that using a convex hull is only a linear approximation of the true C α (x) defined by the points in the contour. For reasonable values of T , we find it to be a good approximation, and it is an upper-bound on the area/volume of the true C α (x). In practice we use scipy with qhull (Virtanen et al., 2020; Barber et al., 1996) to construct d-dimensional convex hulls and measure their volume. Marginal Coverage (MC). This metric measures the proportion of unseen data points that are contained within the the conditional α-confidence sets, C α (x), that was obtained from a given estimated CVQF. It is computed as follows. 1. Estimate a CVQF Q Y|X (u; x) using one of the aforementioned methods on a training set sampled from P Y|X . 2. Denote a disjoint held-out test set x T j , y T j N T j=1 ∼ P Y|X . 3. Measure the marginal coverage as 1 N T N T j I y T j ∈ C α (x T j ) , where C α (x T j ) is constructed as a convex hull of the conditional α-contour Q α Y|X (x T j ) constructed using the estimated Q Y|X (u; x) (eq. ( 22)). Comparison to general-purpose solver. To showcase the scalability of our solver, we compare its runtime against an off-the-shelf linear program solver, ECOS, available within the CVXPY package (Diamond & Boyd, 2016) . These results are shown in Figure A9 . As expected, our custom VQR solver approaches the performance of the general purpose solver which achieves slightly more accurate solutions due to solving an exact problem. However, the results demonstrate that with a linear program solver, the runtime quickly becomes prohibitive even for small problems. Crucially, it is important to note that linear program solvers only allow solving the linear VQR problem, while our solver supports nonlinear VQR. 

F.2 OPTIMIZATION

To further evaluate our proposed VQR solver, we experimented with various values of ε and batch sizes of samples and quantile levels, denoted as B N and B T respectively. Figure A10a shows the effect of the mini-batch size in both N and T on the accuracy of the solution. We measured the QFD metric, averaged over 100 evaluation x values. These results are presented in Figure A10b . As expected we observe improved accuracy when increasing batch sizes (both B N and B T ) and when decreasing ε. 



can be installed with pip install vqr; source available at https://github.com/ vistalab-technion/vqr



Figure 1: (a) Visualization of the vector quantile function (VQF) and its α-contours, a highdimensional generalization of α-confidence intervals. Data was drawn from a 2d star-shaped distribution. Vector quantiles (colored dots) are overlaid on the data (middle). Different colors correspond to α-contours, each containing 100 • (1 -2α) 2 percent of the data. The VQF Q Y(u) = [Q 1 (u), Q 2 (u)] is co-monotonic with u = (u 1 , u 2 ); Q 1 , Q 2 aredepicted as surfaces (left, right) with the corresponding vector quantiles overlaid. On Q 1 , increasing u 1 for a fixed u 2 produces a monotonically increasing curve, and vice versa for Q 2 . (b) Visualization of conditional vector quantile functions (CVQFs) via α-contours. Data was drawn from a joint distribution of (X, Y) where Y|X = x has a star-shaped distribution rotated by x degrees. The true CVQF Q Y|X changes non-linearly with the covariates X, while E [Y|X] remains the same. This demonstrates the challenge of estimating CVQFs from samples of the joint distribution. Appendix C provides further intuitions regarding VQFs and CVQFs, and details how the α-contours are constructed from them.

Figure 2: Proposed VQR solver scales to large N and T with improving accuracy. Computed on MVN dataset with d = 2 and k = 10. Blue curve shows KDE-L1 distance, defined in section 6. Left: Sweeping N ; T = 50 and B N = 50k. Right: Sweeping T ; N = 100k, and B T = 2500.

Figure 3: NL-VQR quantitatively and qualitatively outperforms other methods in conditional distribution estimation. Comparison of kerneldensity estimates of samples drawn from VQR, CVAE, and NL-VQR, on the conditional-banana dataset. Models were trained with N = 20k samples; for VQR T = 50 was used. Numbers depict the KDE-L1 metric. Lower is better.

3; Appendix figs. A13 and A14).

Figure 4: VMR improves strong representation (QFD; left), and completely eliminates monotonicity violations (MV; right). VMR allows for smaller ε values, thus improving accuracy (lower QFD) without compromising monotonicity (zero MV). Without VMR, MV increases (right, red) as the problem becomes more exact (smaller ε). MVN dataset; N = 20k, T = 50, d = 2, k = 1.

Figure 5: NL-VQR produces substantially smaller confidence sets than other methods at a given confidence level. Top: Mean C α size; Bottom: Example C α (x) for a specific x. Calculated on house prices test-set. Full details in figs. A15 and A16.

RELAXING THE EXACT OPTIMAL TRANSPORT DUAL FORMULATION Recall the exact dual formulation of the OT-based primal VQR problem (5): min ψ,ϕ,β ψ ν + ϕ µ + tr β X s.t. ∀i, j :

1a, left and right). Mapping the set U α back to the domain of Y by using the values of the CVQF Q Y|X (u; x) along it, we obtain a contour of arbitrary shape, which contains 100 • (1 -2α) d percent of the d-dimensional distribution (see fig. 1a, middle and fig. A7, right).

Figure A6: Visualization (a) 1d, (b) 2d, and (c) 3d vector quantile functions estimated on the MVN dataset. In each plot, Q i (u) is the ith component of the vector quantile Q Y (u), plotted over all quantile levels u. The number of quantile levels calculated was 50, 25 and 10 for 1d, 2d and 3d quantiles respectively. For 2d quantiles (b), the top plot shows multiple monotonic quantile curves of one variable while keeping the other at a fixed level.

FigureA7: Visualization of α-quantile contours constructed using separable quantiles and vector quantiles. All the contours presented contain ∼ 90% of the points (i.e., α = 0.025). Left (top to bottom): samples drawn from bivariate normal, heart-shaped, and star-shaped densities. Middle & Right: black triangles overlaid on the density constitute the α-quantile contour estimated using separable quantiles and vector quantiles, respectively. The bivariate normal density has zero mean, unit variance and correlation of ρ = -0.7. Vector quantile contours accurately capture the shape of distribution whereas the separable quantile contours are always box-shaped due to the assumption of independence.

Scale with respect to d and k. FigureA8demonstrates the runtime of our solver with respect to the number of target dimensions d and covariate dimensions k.

Figure A8: Effect of d and k on runtime. VQR solver runtime is shown as a function of number of target dimensions d (left) and covariate dimensions k (right). Calculated on the MVN dataset with N = 10k; for d experiment T = 10; for k experiment T = 50 and d = 2. Runtime scales exponentially with d as expected, due to having T d quantile levels. For k, we can see that runtime remains relatively constant even for hundreds of dimensions, then increses due to memory constraints.

Figure A9: Our proposed solver is orders of magnitude faster than general-purpose linear program solver, and approaches the exact solution as the problem size increases. VQR and CVX (using ECOS LP solver) solver runtimes (solid lines) are shown as a function of number of samples N (left) and quantile levels per dimension T (right). Dashed lines indicate KDE-L1 as measure of solution quality. CVX obtains the ideal solution due to solving an exact problem. Calculated on the MVN dataset d = 2, k = 10. Left: T = 30; Right: N = 2000.

Effect of batch sizes BN and BT on optimization. Computed on MVN dataset with d = 2 and k = 1. The vertical axis is the QFD metric defined in section 6. Left: Sweeping over BN ; T = 25. Right: Sweep over BT ; T = 50 produces 50 2 total levels. (b) Decreasing ε in the relaxed dual improves strong representation. Y-axis: QFD metric. Computed on MVN dataset with N = 20k, k = 1. Right: T = 50, d = 2; Left: T = 100, d = 1.

Figure A10: Optimization experiments showing the effect of ε and both batch sizes on convergence.

Figure A11: Visualization of monotonicity violations in 2d. Left: co-monotonicity matrix, showing the value of(u i -u j ) (Q(u i ) -Q(u j)) for all i, j, after fitting VQR without VMR with d = 2, T = 10 on the MVN dataset. Right: quantile-level pairs i, j where co-monotonicity is violated. VMR resolves all these violations.

FigureA12: VQR and NL-VQR can be used for Simultaneous Quantile Regression (SQR), and VMR eliminates quantile crossings. SQR on the synthetic glasses dataset (a; gray points show ground-truth distribution). Linear VQR fails to correctly model the conditional distribution (b, c; gray points sampled from linear VQR), while nonlinear VQR reconstructs the shape exactly (d, e; gray points sampled from nonlinear VQR). In both cases case, VMR successfully eliminates any quantile crossings (c, e).

Figure A13: Qualitative evaluation of VQR, CVAE and NL-VQR on the conditional-banana dataset. Depicted are the kernel-density estimates of samples drawn from each of these methods. Models were trained with N = 20k with T = 50 quantile levels per dimension. Numbers depict the KDE-L1 metric. Better viewed as GIF provided in the supplementary material.TableA3: Quantitative evaluation of linear VQR, CVAE, and nonlinear VQR on the rotating stars dataset. Evaluated on N = 4000 samples with T = 50 levels per dimension. The KDE-L1, QFD and Entropy metrics are defined in section 6. Arrows ↑/↓ indicate when higher/lower is better; Numbers are mean ± std calculated over the 20 values of x. Reference entropy is obtained by uniformly sampling the 2d quantile grid.

Figure A15: Nonlinear VQR produces substantially smaller α-confidence sets in real data experiments. Comparison of nonlinear VQR (NL-VQR), separable nonlinear QR (Sep-NLQR), linear VQR (VQR), and linear separable QR approaches (Sep-QR) on the four real datasets presented in tableA1. The confidence sets are constructed for each method in such a way to produce same desired level of marginal coverage (blue dashed line). Error bars generated from 10 trials, each with a different randomly-sampled train/test split.

Figure A16: Nonlinear VQR produces α-quantile contours which model the shape of the conditional distribution and have significantly smaller area compared to other methods. Each column depicts α-quantile contours for two test covariates (top, bottom) sampled from bio (left), house prices (middle), and blog_data (right) datasets. Separable QR approaches produce box-shaped confidence regions due to the assumption of statistical independence, whereas VQR and nonlinear VQR produce contours with non-trivial shapes that adapt to the test covariate at hand (each column, top vs bottom). The areas of the α-confidence sets are reported relative to NL-VQR in the legend.

as strong baselines for estimating the conditional distribution of Y|X. We emphasize that CVAE only allows sampling from the estimated conditional distribution; it does not estimate quantiles, while VQR allows both. Thus, we could compare VQR with CVAE only on the KDE-L1 metric, which is computed on samples. To the best of our knowledge, besides VQR there is no other generative model capable of estimating CVQFs.

Information about the real data sets. (e) Calculate the KDE f * Y|X=x l from the ground truth samples y * m,l . 3. Calculate the KDE-L1 metric as 1

Quantitative evaluation of linear VQR, CVAE and nonlinear VQR on the conditionalbanana dataset. Evaluated on N = 4000 samples with T = 50 levels per dimension. The KDE-L1, QFD and Entropy metrics are defined in section 6. Arrows ↑/↓ indicate when higher/lower is better; Numbers are mean ± std calculated over the 20 values of x. Reference entropy is obtained by uniformly sampling the 2d quantile grid.

acknowledgement

ACKNOWLEDGEMENTS Y.R. was supported by the Israel Science Foundation (grant No. 729/21). Y.R. thanks Shai Feldman and Stephen Bates for discussions about vector quantile regression, and the Career Advancement Fellowship, Technion, for providing research support. A.A.R., S.V., and A.M.B. were partially supported by the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant agreement No. 863839), by the Council For Higher Education -Planning & Budgeting Committee, and by the Israeli Smart Transportation Research Center (ISTRC).

E EVALUATION METRICS E.1 METRICS FOR SYNTHETIC DATA

The quality of an estimated CVQF can be measured by how well it upholds strong representation and co-monotonicity (eqs. ( 1) and ( 2)). However, measuring strong representation requires knowledge of the ground-truth joint distribution P (X,Y) or its conditional quantile function. With synthetic data, adherence to strong representation can be measured via several proxy metrics. The second key property, violations of co-monotonicity, can be measured directly. Below we describe the metrics we use to evaluate these properties on estimated conditional quantile functions.Entropy of inverse CVQF. If strong representation holds, the inverted CVQF, Q -1 Y|X (Y|X), must result in a uniform distribution, when evaluated on samples drawn from the true conditional distribution. As a measure of uniformity, we calculate a normalized entropy of the inverted CVQF. The inversion procedure is done as follows. (c) For i ∈ 1, . . . , T d denote c i as the number of times that the quantile level i is found in {u m,l } M m=1 , and calculate p i = c i /M . (d) Calculate the entropy h l =i p i log p i and the normalized entropy h l = (exp (h l ) -1) /(T d -1).

3.. Report the CVQF inverse entropy as 1

L l h l .Note that, 1. The entropy metric is normalized such that its values are in [0, 1] where 1 corresponds to a uniform distribution, and 0 corresponds to a delta distribution. 2. Some non-uniformity arises due to the quantization of the quantile level grid into T discrete levels per dimension. We report a reference entropy, calculated on a uniform sample of M quantile-level grid points. 3. We used L = 20 and M = 4000.

Distribution distance (KDE-L1

). Since the CVQF fully represents the conditional distribution P (Y|X) it can serve as a generative model for it. We used inverse-transform sampling to generate data from the estimated conditional distribution though the fitted VQR model, Q Y|X (u; x), as follows. 

G IMPLEMENTATION DETAILS

Solver implementation details. The log-sum-exp term in eq. ( 7) becomes numerically unstable as ε decreases, which can be mitigated by implementing it asScale and Optimization experiment. For both scale and optimization experiments, we run VQR for 10k iterations and use a learning rate scheduler that decays the learning rate by a factor of 0.9 every 500 iterations if the error does not drop by 0.5%. Synthetic glasses experiment. We set N = 10k, T = 100, and ε = 0.001. We optimized both VQR and NL-VQR for 40k iterations and use a learning rate scheduler that decays the learning rate by a factor of 0.9 every 500 iterations if the error does not drop by 0.5%. In NL-VQR, as g θ , we used a 3-layer fully-connected network with each hidden layer of size 1000. We used skip-connections, batch-norm, and ReLU nonlinearities between the hidden layers. For NL-VQR and VQR, we set the initial learning rate to be 0.4 and 1, respectively.Conditional Banana and Rotating Star experiments. We draw N = 20k samples from P (X,Y) and fit T = 50 quantile levels per dimension (d = 2) for linear and nonlinear VQR (NL-VQR). Evaluation is performed by measuring all aforementioned metrics on 4000 evaluation samples from 20 true conditional distributions, conditioned on x = [1.1, 1.2, . . . , 3.0]. For NL-VQR, as g θ we use a small MLP with three hidden layers of size (2, 10, 20) and a ReLU non-linearity. Thus g θ lifts X into 20 dimensions on which VQR is performed. We set ε = 0.005 and optimized both VQR and NL-VQR for 20k iterations. We used the same learning rate and schedulers as in the synthetic glasses experiment.Real data experiments. In all the real data experiments, we randomly split the data into 80% training set and 20% hold-out test set. Fitting is performed on the train split, and evaluation metrics, marginal coverage and quantile contour area (calculated as reported in appendix E.2), are measured on the test split. We repeat this procedure 10 times with different random splits. The baselines we evaluate are NL-VQR, VQR, separable nonlinear QR (Sep-NLQR), and separable linear QR (Sep-QR). In the separable baselines, two QR models are fit separately, one for each target variable. For both nonlinear separable QR and NL-VQR baselines, we choose g θ to be an MLP with three hidden layers. In the case of NL-VQR, hidden layers sizes are set to (100, 60, 20) and for nonlinear separable QR, they are set to (50, 30, 10) . All methods were run for 40k iterations, with learning rate set to 0.3 and ε = 0.01. We set T = 50 for NL-VQR and VQR baselines, and T = 100 for separable linear and nonlinear QR baselines. The α parameter that determines the α-quantile contour construction is calibrated separately for each method and dataset. The goal of the calibration procedure is to achieve consistent coverage across baselines in order to allow for the comparison of the α-quantile contour areas. Listing 1: Minimal example of using the vqr library. Demonstrates instantiation of solver and regressor, fitting VQR, sampling from the fitted CVQF and coverage calculation.We note that although a reference Matlab implementation of linear VQR by the original authors exists, this implementation relies on solving the exact formulation (5) using a general-purpose linear program solver. We found it to be prohibitively slow to use general-purpose linear program solvers for VQR (fig. A9 ). For example, we noted a runtime of 4.5 hours for N = 7k, d = 2, k = 10, T = 30, and for N = 10k the solver did not converge even after 8 hours. Thus, this approach is unsuitable even for modestly-sized problems. Moreover, when using such solvers, only linear VQR can be performed. For comparison, our proposed solver converges to a solution of the same quality on problems of these sizes in less than one minute and supports nonlinear VQR.

