TANGENTIAL WASSERSTEIN PROJECTIONS

Abstract

We develop a notion of projections between sets of probability measures using the geometric properties of the 2-Wasserstein space. In contrast to existing methods, it is designed for multivariate probability measures that need not be regular, is computationally efficient to implement via a linear regression, and provides a unique solution in general. The idea is to work on tangent cones of the Wasserstein space using generalized geodesics. Its structure and computational properties make the method applicable in a variety of settings where probability measures need not be regular, from causal inference to the analysis of object data. An application to estimating causal effects yields a generalization of the synthetic controls method for systems with general heterogeneity described via multivariate probability measures, something that has been out of reach of existing approaches.

1. INTRODUCTION

The concept of projections, that is, approximating a target quantity of interest by an optimally weighted combination of other quantities, is of fundamental relevance in learning theory and statistics. Projections are generally defined between random variables in appropriately defined linear spaces (e.g. van der Vaart, 2000, chapter 11) . In modern statistics and machine learning applications, the objects of interest are often probability measures themselves. Examples range from object-and functional data (e.g. Marron & Alonso, 2014) to causal inference with individual heterogeneity (e.g. Athey & Imbens, 2015) . A notion of projection between sets of probability measures should be applicable between any set of general probability measures, replicate geometric properties of the target measure, and possess good computational and statistical properties. We introduce such a notion of projection between sets of general probability measures supported on Euclidean spaces. It provides a unique solution to the projection problem under mild conditions. To achieve this, we work in the 2-Wasserstein space, that is, the set of all probability measures with finite second moments equipped with the 2-Wasserstein distance. Importantly, we focus on the multivariate setting, i.e. we consider the Wasserstein space over some Euclidean space R d , denoted by W 2 , where the dimension d can be high. The multivariate setting poses challenges from a mathematical, computational, and statistical perspective. In particular, W 2 is a positively curved metric space for d > 1 (e.g. Ambrosio et al., 2008 , Kloeckner, 2010) . Moreover, the 2-Wasserstein distance between two probability measures is defined as the value function of the Monge-Kantorovich optimal transportation problem (Villani, 2003, chapter 2), which does not have a closed-form solution in multivariate settings. This is coupled with a well-known statistical curse of dimensionality for general measures (Ajtai et al., 1984 , Dudley, 1969 , Fournier & Guillin, 2015 , Talagrand, 1992; 1994 , Weed & Bach, 2019) .

1.1. EXISTING APPROACHES

These challenges have impeded the development of a method of projections between general and potentially high-dimensional probability measures. A focus so far has been on the univariate and low-dimensional setting. In particular, Chen et al. (2021) , Ghodrati & Panaretos (2022) , and Pegoraro & Beraha (2021) introduced frameworks for distribution-on-distribution regressions in the univariate setting for object data. Bigot et al. (2014) , Cazelles et al. (2017) developed principal component analyses on the space of univariate probability measures using geodesics on the Wasserstein space. The most closely related works to ours are Bonneel et al. (2016) , Mérigot et al. (2020), and Werenski et al. (2022) . The first develops a regression approach in barycentric coordinates with applications in computer graphics as well as color and shape transport problems. Their method is defined directly on W 2 and requires solving a computationally costly bilevel optimization problem, which does not necessarily yield global solutions. The second introduces a linearization of the 2-Wasserstein space by lifting it to a L 2 -space anchored at measure that is absolutely continuous with respect to Lebesgue measure. This approach relies on the existence of optimal transport maps between this absolutely continuous "anchor" distribution and other distributions and hence only defines tangent spaces at absolutely continuous measures. The third works on a tangential structure based on "Karcher means" (Karcher, 2014 , Zemel & Panaretos, 2019) , which is more restrictive still. This implies that their method requires all involved measures to be absolutely continuous measures with densities that are bounded away from zero, with the target measure lying in the convex hull of the control measures.

1.2. OUR CONTRIBUTION

In contrast to the existing approaches, our method is applicable for general probability measures, allows for the target measure to be outside the generalized geodesic convex hull of the control measures, can be implemented by a standard constrained linear regression, and provides a global-and in many cases unique-solution. The proposed method transforms the projection problem on the positively curved Wasserstein space into a linear optimization problem in the geometric tangent cone, which can be implemented via a linear regression. This problem takes the form of a deformable template (Boissard et al., 2015 , Yuille, 1991) , which connects our approach to this literature. Our method can be implemented in three steps: (i) obtain the general tangent cone structure at the target measure, (ii) construct a tangent space from the tangent cone via barycentric projections if it does not exist, and (iii) perform a linear regression to carry out the projection in the tangent space. This implementation of the projection approach via linear regression is computationally efficient, in particular compared to the existing methods in Bonneel et al. (2016) and Werenski et al. (2022) . The challenging part of the implementation is lifting the problem to the tangential structure: this requires computing the corresponding optimal transport plans between the target and each measure used in the projection. Many methods have been developed for this, see for instance Benamou & Brenier (2000) , Jacobs & Léger (2020) , Makkuva et al. (2020) , Peyré & Cuturi (2019) , Ruthotto et al. (2020) and references therein. Other alternatives compute approximations of the optimal transport plans via regularized optimal transport problems (Peyré & Cuturi, 2019) , such as entropy regularized optimal transport (Galichon & Salanié, 2010 , Cuturi, 2013) . The proposed projection approach is compatible with any such method, therefore its complexity scales with that of estimating optimal transport plans. We provide results for the statistical consistency when estimating the measures via their empirical counterparts in practice. To demonstrate the efficiency and utility of the proposed method, we apply our method in different settings and compare it to existing benchmarks such as Werenski et al. (2022) . Furthermore, we extend the classical synthetic control estimator (Abadie & Gardeazabal, 2003 , Abadie et al., 2010) to settings with observed individual heterogeneity in multivariate outcomes. The synthetic controls estimator is a projection approach, where one tries to predict an aggregate outcome of a treated unit by an optimal convex combination of control units and to use the weights of this optimal combination to construct the counterfactual state of the treated unit had it not received treatment. The novelty of our application is that it lets us perform the synthetic control method on the joint distribution of several outcomes, which complements the recently introduced method in Gunsilius (2022) designed for univariate outcomes. The possibility to project entire probability measures allows us to disentangle treatment heterogeneity at the treatment unit level. The possibility of working with general probability measures is key in this setting, as many outcomes of interest are not regular. We illustrate this by applying our method to estimate the effects of a Medicaid expansion policy in Montana, where we consider-as outcome-non-regular probability measure in d = 28 dimensions.

2. METHODOLOGY

2.1 THE 2-WASSERSTEIN SPACE W 2 (R d ) The 2-Wasserstein Distance For probability measures P X , P Y ∈ P(R d ) with supports X , Y ⊆ R d , respectively, the 2-Wasserstein distance W 2 (P X , P Y ) is defined as W 2 (P X , P Y ) min γ∈Γ(P X ,P Y ) X × Y |x -y| 2 dγ(x, y) 1 2 . (2.1) Here, | • | denotes the Euclidean norm on R d and Γ(P X , P Y ) γ ∈ P(R d × R d ) : (π 1 ) # γ = P X , (π 2 ) # γ = P Y is the set of all couplings of P X and P Y . The maps π 1 and π 2 are the projections onto the first and second coordinate, respectively, and T # P denotes the pushforward measure of P via T , i.e. for any measurable A ⊆ Y, T # P (A) ≡ P (T -1 (A)). An optimal coupling γ ∈ Γ(P X , P Y ) solving the optimal transport problem equation 2.1 is an optimal transport plan. By Prokhorov's theorem, a solution always exists in our setting. When P X is regular, i.e. when it does not give mass to sets of lower Hausdorff dimension in its support, then the optimal transport plan γ solving equation 2.1 is unique and takes the form γ = (Id ×∇ϕ) # P X , where Id is the identity map on R d and ∇ϕ(x) is the gradient of some convex function. This result is known as Brenier's theorem (Brenier, 1991 , McCann, 1997 , Villani, 2003, Theorem 2.12) . By definition, all measures that possess a density with respect to Lebesgue measure are regular. Our main contribution is to allow for general probability measures, where only optimal transport plans but no maps exist. The 2-Wasserstein Space The 2-Wasserstein space W 2 (P 2 (R d ), W 2 ) is the metric space defined on the set P 2 (R d ) of all probability measures with finite second moments supported on R d , with the 2-Wasserstein distance as the metric. It is a geodesically complete space in the sense that between any two measures P, P ∈ W 2 , one can define a geodesic P t : [0, 1] → W 2 via the interpolation P t (π t ) # γ, where γ is an optimal transport plan and π t : R d × R d → R d is defined through π t (x, y) (1-t)x+ty (Ambrosio et al., 2008 , McCann, 1997) . Using this, it can be shown that W 2 is a positively curved metric space d > 1 (Ambrosio et al., 2008, Theorem 7.3.2) and flat for d = 1 (Kloeckner, 2010) , where curvature is defined in the sense of Aleksandrov (1951) . This difference in the curvature properties is the main reason for why the multivariate setting requires different approaches compared to the established results for measures on the real line.

2.2. TANGENT CONE STRUCTURE ON W 2

We exploit a tangential structure that can be defined for general measures on W 2 (Ambrosio et al., 2008 , Otto, 2001 , Takatsu & Yokota, 2012) . In particular, it allows us to circumvent solving a bilevel optimization problem as the one considered in Bonneel et al. (2016) , whose statement we have included in the appendix. The tangential structure relies on the fact that geodesics P t in W 2 are linear in the transport plans (π t ) # γ. This implies a geometric tangent cone structure at each measure P ∈ W that can be defined as the closure in P 2 (R d ) of the set G(P ) γ ∈ P 2 (R d × R d ) : (π 1 ) # γ = P, (π 1 , π 1 + επ 2 ) # γ is optimal for some ε > 0 with respect to the local distance W 2 P (γ 12 , γ 13 ) min (R d ) 3 |x 2 -x 3 | 2 dγ 123 : γ 123 ∈ Γ 1 (γ 12 , γ 13 ) , (2.2) where γ 12 and γ 13 are couplings between P and some other measures P 2 and P 3 , respectively, and Γ 1 (γ 12 , γ 13 ) is the set of all 3-couplings γ 123 such that the projection of γ 123 onto the first two elements is γ 12 and the projection onto the first and third element is γ 13 (Ambrosio et al., 2008, Appendix 12) . We can then define the exponential map at P with respect to some tangent element γ ∈ G(P ) by exp P (γ) = (π 1 + π 2 ) # γ . This tangent cone can be constructed at every P ∈ W, irrespective of its support; in particular, we do not assume that the corresponding measures are regular, i.e., give mass to subsets of R d of lower Hausdorff dimension. In the case where P is regular the tangent cone structure reduces to a tangent space (Ambrosio et al., 2008, Theorem 8.5.1) . This tangent space structure has been exploited in Mérigot et al. (2020) and Werenski et al. (2022) , and we include the results for our projection approach in this special case in Appendix A.

2.3. TANGENTIAL WASSERSTEIN PROJECTIONS

Our main contribution is to define a projection approach between general probability measures, where the target need not be regular. To define this notion of projection, we need to first define an appropriate notion of a geodesic convex hull. The novelty here is that we define this notion via generalized geodesics (Ambrosio et al., 2008, section 9. 2) centered at the target measure P 0 . For this, we extend the definition of W P to the multimarginal setting, by defining, for given couplings γ 0j ∈ Γ(P 0 , P j ), j ∈ J W 2 P0;λ (γ 01 , γ 02 , . . . , γ 0J ) min    (R d ) J+1 J j=1 λ j x j -x 0 2 dγ : γ ∈ Γ 1 (γ 01 , . . . , γ 0J )    , ) where Γ 1 (γ 01 , . . . , γ 0J ) ⊆ Γ(P 0 , P 1 , . . . , P J ) is the set of all (J + 1)-couplings γ such that the projection of γ onto the first-and j-th element is γ 0j . Note that this definition is similar to the multimarginal definition of the 2-Wasserstein barycenter (Agueh & Carlier, 2011 , Gangbo & Święch, 1998) , but "centered" at P 0 . Based on this, we define the generalized geodesic convex hull of measures {P j } j∈ J with respect to the measure P 0 as Co P0 P j J j=1      P (λ) ∈ P 2 (R d ) : P (λ) =   J j=1 λ j π j+1   # γ, γ solves W 2 P0;λ (γ 01 , . . . , γ 0J ), γ 0j is optimal in Γ(P 0 , P j ) ∀j ∈ J , λ ∈ ∆ J      . (2.4) A direct application of our tangential projection idea would lead us to solving λ * arg min λ∈∆ J W 2 P0;λ (γ 01 , . . . , γ 0J ) , which would be a computationally prohibitive bilevel optimization problem similar to the one in Bonneel et al. (2016) . We therefore rely on barycentric projections to reduce the general cone structure to a regular tangent space which we denote by T P0 W 2 (Ambrosio et al., 2008) . In this structure the projection problem equation 2.5 is replaced by λ * arg min λ∈∆ J J j=1 λ j b γ0j -Id 2 L 2 (P0) , with b γ0j (x 1 ) R d x 2 dγ 0j,x1 (x 2 ) (2.6) denoting the barycentric projections of optimal transport plans γ 0j between P 0 and P j . Here, γ x1 denotes the disintegration of the optimal transport plan γ with respect to P 0 . This approach is a natural extension of the regular setting to general probability measures for two reasons. First, if the optimal transport plans γ 0j are actually induced by some optimal transport map, then b γ0j reduces to this optimal transport map; in this case the general tangent cone G(P 0 ) reduces to the regular tangent cone T P0 W 2 (Ambrosio et al., 2008, Theorem 12.4.4) . Second, by the T P0 W 2 W 2 P 0 P 1 P 2 P 3 P π b 1 -Id π Co P0 Figure 1 : Tangential Wasserstein projection for a general target P 0 . T P 0 W2 is the regular tangent space constructed by applying barycentric projection to G(P0), the general tangent cone anchored at P0. Thick dashed lines are tangent vectors (bj -Id) generated by the respective barycentric projections. The gray shaded region is their convex hull in this constructed tangent space and π is the projection of Id onto this convex hull. Pπ exp P 0 (π) is the projection of P0 onto the generalized geodesic convex hull CoP 0 {P1, P2, P3} ⊆ W2 (blue). definition of b γ and disintegrations in conjunction with Jensen's inequality it holds for all λ ∈ ∆ J that J j=1 λ j b γ0j -Id 2 L 2 (P0) W 2 P0;λ (γ 01 , . . . , γ 0J ) . (2.7) This implies that for general P 0 we can also define a convex hull based on barycentric projections, which is of the form Co P0 P j J j=1      P (λ) ∈ P 2 (R d ) : P (λ) =   J j=1 λ j b γ0j   # P 0 , λ ∈ ∆ J      . (2.8) Furthermore, the contraction property equation 2.7 implies that Co P0 ⊆ Co P0 , with equality when all transport plans are achieved via maps ∇ϕ j . Using these definitions, the following defines our notion of projection for general P 0 and shows that it projects onto Co P0 . Proposition 2.1. Consider a general target measure P 0 and a set {P j } j∈ J of general control measures. Construct the measure P π as P π exp P0   J j=1 λ * j b γ0j -Id   , where the optimal weights λ * ∈ ∆ J are obtained by solving equation 2.6 and γ 0j are optimal plans transporting P 0 to P j , respectively. Then for given optimal plans γ 0j , P π is the unique metric projection of P 0 onto Co P0 P j J j=1 . The optimal plans γ 0j transporting P 0 to P j need not be unique if P j lies outside the cut locus of P 0 , i.e., when there is more than one optimal way to transport P 0 onto P j . However, the projection for fixed γ 0j is always unique by virtue of the linear regression.

3. STATISTICAL PROPERTIES OF THE WEIGHTS AND PROJECTION

We now provide statistical consistency results for our method when the corresponding measures {P j } j∈ J are estimated from data. We consider the case where the measures P j are replaced by their empirical counterparts P Nj (A) N -1 j Nj n=1 δ Xn (A) for every measurable set A in the Borel σ-algebra on R d , where δ x (A) is the Dirac measure and X 1j , . . . , X Nj ,j is an independent and identically distributed set of random variables whose distribution is P j . We explicitly allow for different sample sizes J j=0 N j = N for the different measures. To save on notation we write ϕ Nj ≡ ϕ j , b 0j ≡ b γ0j ,Nj and γ 0j ≡ γ Nj ,N0 in the following. Proposition 3.1 (Consistency of the optimal weights). Let P Nj J j=0 be the empirical measures corresponding to the data X 1j , . . . , X Nj j J j=0 which are independent and identical draws from P j , respectively, and are supported on some common latent probability space (Ω, A , P ). Assume all P j have finite second moments. As N j → ∞ for all j ∈ J , the corresponding optimal weights λ * N = λ * N1 , . . . , λ * N J ∈ ∆ J obtained via λ * N arg min λ∈∆ J J j=1 λ j b 0j -Id 2 L 2 (P N 0 ) , (3.1) satisfy P λ * N -λ * > ε → 0 for all ε > 0 , where λ * solve equation 2.6. This consistency result directly implies consistency of the optimal weights in case the optimal transport problems between P N0 and each P Nj are achieved by optimal transport maps ∇ ϕ Nj . We also have a consistency result for the empirical counterparts P π,N of the optimal projection P π . Corollary 3.1 (Consistency of the optimal projections). In the setting of Proposition 3.1, the estimated projections P π,N converge weakly in probability to the projection P π as N j → ∞ for all j ∈ J . Proposition 3.1 and Corollary 3.1 hold in all generality and without any assumptions on the corresponding measures P j , except that they possess finite second moments. To get stronger results, for instance parametric rates of convergences, one needs to make strong regularity assumptions on the measures P j . Without these, the rate of convergence of optimal transport maps in terms of expected square loss is as slow as n -2/d (Hütter & Rigollet, 2021) . Under such additional regularity conditions, the results for the asymptotic properties are standard, because the proposed method reduces to a classical semiparametric estimation problem, as the weights λ j are finite-dimensional.

4.1. MIXTURES OF GAUSSIANS

We consider mixtures of Gaussian in dimension d = 10. We draw from the following Gaussians: X j ∼ N µ j , Σ , j = 0, 1, 2, 3 , where µ 0 = [10, 10, . . . , 10], µ 1 = [50, 50, . . . , 50], µ 2 = [200, 200, . . . , 200] , µ 3 = [-50, -50, . . . , -50] and Σ = Id 10 +0.8 Id - 10 , with Id - 10 the 10 × 10 matrix with zeros on the main diagonal and ones on all off-diagonal terms. We then define the following mixtures: Y 0 as target, and Y 1 , Y 2 , and Y 3 as controls, where Y 0 = 0.7X 0 + 0.15X 1 + 0.15X 2 , Y 1 = 0.6X 0 + 0.3X 1 + 0.1X 2 , Y 2 = 0.7X 1 + 0.2X 2 + 0.1X 3 , Y 3 = 0.3X 0 + 0.1X 2 + 0.6X 3 . Each sample is 10000 points. The estimated optimal weights are λ * = [0.4329, 0.4002, 0.1669] with corresponding projection Y 0 . With only 3 control units, it is not possible to perfectly replicate the entire target distribution. Still, in Figure 2 , the optimal projection approximates Y 0 reasonably well. Moreover, the weights are non-sparse in this case, indicating that the target P Y0 lies inside the geodesic convex hull of the control measures. In many real-world applications we observe sparse optimal weights; see, for instance, Section 4.3, and our application to synthetic controls in Section 5. This implies that in these settings the target lies outside the geodesic convex hull of the controls and is projected onto one of the faces. Figure 2 : Kernel density estimates of the average of all dimensions comparing target P Y0 (blue) and its projection P π (orange) onto the generalized geodesic convex hull of {P Y1 , P Y2 , P Y3 }.

4.2. IMAGE EXPERIMENT: MNIST

We compare our results to those from the experiment in Section 4.3 of Werenski et al. ( 2022). We follow the experimental procedure described therein, taking as experimental data the MNIST dataset of 28 × 28 pixel images of hand-written digits (LeCun, 1998) . We show comparison to the test case with image occlusion. We treat the normalized matrix as probability measures supported on a 28 × 28 grid. Figure 3 shows our results. We are able to more clearly replicate the edges and contours of the target image, compared to both the Euclidean projection and the method described in Werenski et al. ( 2022). Moreover, our method manages to replicate the overall shape of the specific handwritten number closer than the other methods; in particular, it is the only method that correctly replicates the horizontal bar at the bottom of this particular handwritten "4". 2022), using optimal weights obtained from their method; result from our approach, using optimal weights obtained from equation 2.6; target image.

4.3. IMAGE EXPERIMENT: LEGO BRICKS

To examine the general properties of how our method obtains the optimal weights, we provide an application on replicating a target image of an object using images of the same object taken from different angles. We use the Lego Bricks dataset available from Kaggle, which contains approximately 12,700 images of 16 different Lego bricks in RGBA format. Our method manages to replicate the target block rather well, while only using the information of control units that look sufficiently like the target. In particular, in replication, our method does not use information from any image of the underside of the Lego brick. In contrast, the Euclidean projection does not provide the correct rotation in the replication, and suffers from the standard blur induced by using a mixture of images. Left entry in parentheses is optimal weights from our method, right entry are optimal weights from Euclidean projection. Weights are denoted as zero if they are less than 1e-6.

5. APPLICATION TO CAUSAL INFERENCE VIA SYNTHETIC CONTROLS

When analyzing the causal effect of treatment on a unit, such as that of public policies or medical interventions, there is often no comparable control unit that can capture the treated unit's underlying characteristics. The classical synthetic controls method (Abadie & Gardeazabal, 2003 , Abadie et al., 2010) aims to create a suitable control unit by replicating the pre-treatment outcome trends of the treated unit, using some optimally chosen set of control units. This is achieved by projecting the observed characteristics of the target unit onto the convex hull defined by the characteristics of control units in the pre-treatment periods. The optimal weights obtained by this projection, therefore, describe how much each control unit contributes to the target unit's counterfactual outcome in the post-treatment period (Abadie, 2021) . We apply our notion of projections to extend the classical synthetic control method to work on joint measures of several outcomes, which allows to disentangle heterogeneous treatment effects and complements the univariate method introduced in Gunsilius (2022). As demonstration, we study the effect of health insurance coverage following state-level Medicaid expansion in Montana in 2016. The variables of interest are Medicaid coverage, employment status, log wages, and log hours worked. For control units, we use the twelve states for which such expansion has never occurred; these are: Alabama, Florida, Georgia, Kansas, Mississippi, North Carolina, South Carolina, South Dakota, Tennessee, Texas, Wisconsin, Wyoming. Additional information can be found in Appendix C. We estimate "synthetic Montana", i.e. Montana had it not adopted Medicaid expansion, by estimating the optimal weights λ * using data from 2010 to 2016, and solving equation 2.6 over the joint distribution of the four outcomes over the time period from 2010 to 2016, which generates measures in d = 28 dimensions. We note that we estimate one set of optimal weights-specifically, one for each control state-over the entire time period. We then estimate the counterfactual joint distribution using data from 2017 to 2020, by using the optimal weights λ * and computing the weighted barycenter (Agueh & Carlier, 2011) of the control states using these weights. Details of sample selection and estimating "synthetic Montana" are described in Appendix C. The results of the general causal effect of the Medicaid expansion policy in Montana averaged over the years 2017 -2020 are illustrated in Figure 5 . 

6. CONCLUSION

We have developed a projection method between sets of probability measures supported on R d based on the tangent cone structure of the 2-Wasserstein space. Our method seeks to best approximate some general target measure using some chosen set of control measures. In particular, it provides a global (and in most cases unique) optimal solution. Our application to evaluating the first-and second-order effects of Medicaid expansion in Montana via an extension of the synthetic controls estimator (Abadie & Gardeazabal, 2003 , Abadie et al., 2010) demonstrates the method's utility in allowing for a method that is applicable for general probability measures. The method still works without restricting optimal weights to be in the unit simplex, which would allow for extrapolation beyond the convex hull of the control units, providing a notion of tangential regression. It can also be extended to a continuum of measures, using established consistency results of barycenters (e.g. Le Gouic & Loubes, 2017).

APPENDIX A WASSERSTEIN BARYCENTERS AND THE SPECIAL CASE OF A REGULAR TARGET MEASURE

The natural approach to defining projections on W 2 is to work on the manifold directly. As mentioned in the main text, this leads to a bilevel optimization problem, based on the notion of barycenters in Wasserstein space (Agueh & Carlier, 2011 , Carlier & Ekeland, 2010) : P (λ) = arg min P ∈P2(R d ) J j=1 λ j 2 W 2 2 (P, P j ). With this definition, and assuming that the barycenter P (λ) is unique for given λ, the bilevel projection problem reads: λ * ∈ arg min λ∈∆ J W 2 (P 0 , P (λ)), where P (λ) = arg min P ∈P2(R d ) J j=1 λ j 2 W 2 2 (P, P j ). (A.1) A version of this approach is used in Bonneel et al. (2016) to define a notion of regression between probability measures in low dimensions. The challenges here are mathematical and computational. Importantly, the optimal weights λ * need not be unique. This is not an issue for the applications considered in Bonneel et al. (2016) , like color transport; however, it is important in statistical settings when the weights convey information used in further procedures, like causal inference via synthetic controls, where the optimal weights are used to introduced a counterfactual outcome of a treated unit had it not been treated (Abadie & Gardeazabal, 2003 , Abadie et al., 2010 , Abadie, 2021) . Moreover, the bi-level optimization structure makes solving the problem prohibitively costly in higher dimensions. Bonneel et al. (2016) introduce a gradient descent approach based on an entropy-regularized analogue of W 2 (Cuturi, 2013 , Peyré & Cuturi, 2019) that can be implemented in low-dimensional settings. Other approaches like Werenski et al. ( 2022) introduce a tangential approach, but under strong assumptions on the involved measures: they need to be absolutely continuous with densities bounded away from zero on their support, and in particular the target measure must be known to lie inside the convex hull of the other measures. A starting point for this is to consider a characterization of the barycenter P (λ) for fixed weights of a set {P j } j∈ J in regular tangent spaces. Agueh & Carlier (2011, Equation (3.10)) show that if at least one of the measures is absolutely continuous with respect to Lebesgue measure, then P (λ) can be characterized via J j=1 λ j ∇ φj -Id = 0, (A.2) where { φj } j∈ J are the optimal transport maps from the barycenter to the respective measure P j , i.e. ( φj ) # P (λ) = P j . Each term of the summand in equation A.2 is an element in T P (λ) W 2 (R d ) by construction. More generally, the condition equation A.2 is a sufficient condition for P (λ) to be a "Karcher mean" (Karcher, 2014) in W 2 (Zemel & Panaretos, 2019) . In fact, a "Karcher mean" of a set of measures {P j } j∈ J is defined as the gradient of the Fréchet functional in W 2 and is characterized through equation A.2 holding P (λ)-almost everywhere. equation A.2 is a stronger condition because it is assumed to hold at every point in the support of P (λ), not just almost every point. Álvarez-Esteban et al. (2016) use this characterization to introduce a fixed-point approach to compute Wasserstein barycenters, and Werenski et al. (2022) use this structure to introduce a replication approach for absolutely continuous measures whose densities are bounded away from zero and whose target measure lies inside the convex hull of the control measures. Related is the recent definition of weak barycenters in Cazelles et al. (2021) , where the authors replace the optimal transport maps from the classical optimal transport problem by the weak optimal transport problem introduced in Gozlan et al. ( 2017). Heuristically, this characterization is that of a deformable template. A measure P is a deformable template if there exists a set of deformations ψ j j=1,...,J such that ψ j # P = P j , in a way that their weighted average is "as close to the identity" as possible. In our setting ψ j ≡ ∇ϕ j -Id (Anderes et al., 2015 , Boissard et al., 2015 , Yuille, 1991) . In our setting of interest, our tangential projection reduces to λ * arg min λ∈∆ J J j=1 λ j ∇ ϕ j -Id 2 L 2 (P0) , (A.3) where ∇ϕ j are the optimal transport maps between the target P 0 and the control measures P j , j ∈ J . In contrast to Werenski et al. ( 2022) the target measure does not need to lie inside the convex hull of the other measures. Based on these definitions we can show that our approach is a projection of the target P 0 onto Co P0 {P j } J j=1 in the case where P 0 is regular. Proposition A.1. Consider a regular target measure P 0 and a set {P j } j∈ J of general control measures. Construct the measure P π as P π exp P0   J j=1 λ * j (∇ϕ j -Id)   , where the optimal weights λ * ∈ ∆ J are obtained by solving equation A.3 and ∇ϕ j are the optimal maps transporting P 0 to P j , respectively. Then P π is the unique metric projection of P 0 onto Co P0 P j J j=1 .

B PROOFS

Proof of Proposition A.1. Define the following closed and convex subset C ⊆ L 2 (P 0 ) for fixed optimal transportation maps between P 0 and P j , denoted ∇ϕ j : C    f ∈ L 2 (P 0 ) : f = J j=1 λ j ∇ ϕ j for some λ ∈ ∆ J    . Recall that the transport maps ∇ϕ j exist since P 0 is regular. Using C, we can rewrite equation A.3 as arg min λ∈∆ J J j=1 λ j ∇ ϕ j -Id 2 L 2 (P0) = arg min f ∈C f -Id 2 L 2 (P0) , which by definition is the metric projection of Id onto C. Since C is a non-empty closed and convex subset of the Hilbert space L 2 (P 0 ), this metric projection exists and is unique (Aliprantis & Border, 1999, Theorem 6.53) . Moreover, if Id ∈ C, then π C = Id; otherwise, π C ∈ ∂C, where ∂C is the boundary of C (Aliprantis & Border, 1999, Lemma 6.54) . Since P 0 is regular, the exponential map is continuous. In fact, for every j = k, W 2 2 (P j , P k ) = W 2 2 ((∇ ϕ j ) # P 0 , (∇ ϕ k ) # P 0 ) R d ∇ ϕ j -∇ ϕ k 2 dP 0 (x). In other words, the distance between P j and P k in W 2 (R d ) is smaller than that between corresponding elements ∇ϕ j , ∇ϕ k in the tangent space. This implies continuity of the exponential map. Furthermore, in this regular setting, the exponential map sends convex sets in T P0 W 2 to generalized geodesically convex sets in W 2 . Mechanically, for any two (scaled) elements t(∇ϕ j -Id) and s(∇ϕ k -Id) in T P0 W 2 , and any ρ ∈ [0, 1], exp P0 (ρt(∇ ϕ j -Id) + (1 -ρ)s(∇ ϕ k -Id)) = exp P0 ((ρt ∇ ϕ j + (1 -ρ)s ∇ ϕ k ) -(ρt + (1 -ρ)s) Id) = exp P0   ρ ρt ρ ∇ ϕ j + (1 -ρ)s ρ ∇ ϕ k -Id   = ρt ∇ ϕ j + (1 -ρ)s ∇ ϕ k + (1 -ρ) Id # P 0 = ρt(∇ ϕ j -Id) + (1 -ρ)s(∇ ϕ k -Id) + Id # P 0 where ρ ρt + (1 -ρ)s. This is a generalized geodesic connecting P j and P k , via the optimal transport map between them and P 0 (Ambrosio et al., 2008, section 9.2) . The same argument holds when extending generalized geodesics to generalized barycenters by taking convex combination of more measures than a binary interpolation with respect to ρ. Mechanically, for any λ ∈ ∆ J and t j > 0 for all j ∈ J , exp P0   J j=1 λ j t j (∇ ϕ j -Id)   = exp P0   J j=1 λ j t j ∇ ϕ j - J j=1 λ j t j Id   = exp P0    ρ J   J j=1 ρ J ∇ϕ j -Id      =      J j=1 λ j t j ∇ϕ j   + (1 -ρ J ) Id    # P 0 =      J j=1 λ j t j (∇ϕ j -Id)   + Id    # P 0 where ρ J J j=1 λ j t j . This proves the exponential map is generalized geodesically convex. From above it follows that P π exp P0 (π C ) is either in the interior of C, which is the case if Id ∈ C, or on its boundary: since π C ∈ ∂C, exp P0 (π C ) ∈ exp P0 (∂C). By continuity of the exponential map it follows that exp P0 (∂C) = ∂ exp P0 (C). Combining all steps above show that P π is a geodesic metric projection of P 0 onto the geodesic convex hull of P j J j=1 . Proof of Proposition 2.1. The result follows from the same argument as the proof of Proposition A.1. Theorem 12.4.4 in Ambrosio et al. (2008) shows that T P0 W 2 is the image of the barycentric projection of measures in the general tangent cone: b γ (x) is an optimal transport map if γ is an optimal transport plan. But the exponential map satisfies exp P0 (v) = (v + Id) # P 0 for all v ∈ T P0 W 2 . This implies that P π exp P0   J j=1 λ * j b γ0j -Id   =   J j=1 λ * j b γ0j   # P 0 ∈ Co P0 P j J j=1 , since the convex combination of elements in the subgradients of convex functions lie in the subgradient of a convex function (provided the subgradient of each convex function is nonempty, which is the case here). Then the continuity and generalized convexity of the exponential map for elements in the regular tangent space T P0 W 2 implies the result. Proof of Proposition 3.1. We split the proof into two parts. In the first part we prove the convergence in probability of the family of objective functions equation 3.1 to their population counterparts equation 2.6 if the empirical measures P Nj converge weakly in probability to the population measures P j . In the second step we use the fact that λ * is a classical semiparametric estimator (Andrews, 1994 , Newey & McFadden, 1994) to derive the convergence of the weights. Step 1: Convergence of the objective functions To show the convergence of the of the objective functions for obtaining the weights λ * , we write J j=1 λ j b 0j -Id 2 L 2 (P0) - J j=1 λ j b 0j -Id 2 L 2 (P N 0 ) = J j=1 λ j b 0j (x) -x 2 dP 0 - J j=1 λ j b 0j (x) -x 2 dP N0 . We hence want to show that lim j Nj →∞ J j=1 λ j b 0j (x) -x 2 dP 0 (x) - J j=1 λ j b 0j (x) -x 2 dP N0 (x) = 0 , where j N j ≡ min {N 0 , . . . , N J }. We split the result into two parts. The first part shows that lim inf j Nj →∞ R d J j=1 λ j b 0j (x 0 ) -x 0 2 dP N0 (x 0 ) R d J j=1 λ j b 0j (x 0 ) -x 0 2 dP 0 (x 0 ). In the second part we use the L 2 (P 0 ) convergence of the barycentric projections to prove that the limit exists and coincides with the limit inferior. For the first part, we have lim inf j Nj →∞ R d J j=1 λ j b 0j (x 0 ) -x 0 2 dP N0 (x 0 ) = lim inf j Nj →∞ (R d ) J+1 J j=1 λ j x j -x 0 2 d γ N (x 0 , x 1 , . . . , x J ), where γ N (x 0 , x 1 , . . . , x J ) is a measure that solves min    (R d ) J+1 J j=1 λ j x j -x 0 2 dγ : γ ∈ Γ 1 ( γ 01 , . . . , γ 0J )    , γ are the optimal couplings between P N0 and P Nj b 0j # P N0 . Since all measures are defined on the complete and separable space R d , and by assumption of finite second moments, i.e. max j∈ J sup Nj x j -x 0 2 d γ 0j < +∞ , it holds that each sequence γ 0j is tight by Ulam's theorem (Dudley, 2018, Theorem 7.1.4) . Using the fact that λ ∈ ∆ J and γ N ∈ Γ 1 ( γ 01 , . . . , γ 0J ), applying Jensen's inequality gives us max j∈ J sup Nj (R d ) J+1 J j=1 λ j x j -x 0 2 d γ N max j∈ J sup Nj J j=1 λ j R d x j -x 0 2 d γ 0j < +∞ , which implies that γ N is tight. By Prokhorov's theorem, there exists a subsequence γ N k that weakly converges to a limit measure γ. Therefore, by the continuity of the map (x 0 , x j ) → j λ j x j -x 0 , it follows from classical convergence results (Ambrosio et al., 2008, Lemma 5.1.12(d) ) that lim inf j Nj →∞ (R d ) J+1 J j=1 λ j x j -x 0 2 d γ N (x 0 , x 1 , . . . , x J ) = (R d ) J+1 J j=1 λ j x j -x 0 2 dγ(x 0 , . . . , x J ). Furthermore, by the same argument via Jensen's inequality, i.e., (R d ) J+1 J j=1 λ j x j -x 0 2 dγ(x 0 , . . . , x J ) J j=1 (R d ) 2 λ j x j -x 0 2 dγ 0j (x 0 , x j ) < +∞ , it follows that the limit γ ∈ Γ 1 (γ 01 , . . . , γ 0J ). Now note that by the definition of disintegration it follows that (Ambrosio et al., 2008, Lemma 5.3 .2) γ ∈ Γ 1 (γ 01 , . . . , γ 0J ) ⇐⇒ γ x0 ∈ Γ γ 1|x0 , . . . , γ J|x0 , where γ = γ x0 dP 0 (x 0 ) and γ 0j = γ j|x0 dP 0 (x 0 ) are the disintegrations of γ and γ 0j with respect to P 0 , respectively. Therefore, we have (R d ) J+1 J j=1 λ j x j -x 0 2 dγ(x 0 , . . . , x J ) = R d (R d ) J J j=1 λ j x j -x 0 2 dγ x0 (x 1 , . . . , x J ) dP 0 (x 0 ) R d (R d ) J   J j=1 λ j x j -x 0   dγ x0 (x 1 , . . . , x J ) 2 dP 0 (x 0 ) = R d J j=1 λ j (R d ) J x j dγ x0 (x 1 , . . . , x J ) -x 0 2 dP 0 (x 0 ) = R d J j=1 λ j R d x j dγ j|x0 (x j ) -x 0 2 dP 0 (x 0 ) = R d J j=1 λ j b 0j (x 0 ) -x 0 2 dP 0 (x 0 ), where the third lines follows from Jensen's inequality and the fifth line from γ x0 ∈ Γ γ 1|x0 , . . . , γ J|x0 . This shows the first part. For the second part we use the fact that each barycentric projection b 0j (x 1 ) is an optimal transport map between P N0 and P Nj if γ 0j is an optimal transport plan between P N0 and P Nj , which follows from Theorem 12.4.4 in Ambrosio et al. (2008) . As before, we know that b 0j # P N0 is a tight sequence that converges to some P j . By definition and the fact that b0j is the gradient of a convex function between P N0 and P Nj , b 0j is the unique optimal transport map between P N0 and P Nj for all N j and all j. Since the measures P j have finite second moments by assumption, we have lim sup N0∧Nj →∞ R d |x j | 2 d P Nj = lim sup N0∧Nj →∞ R d b 0j (x 0 ) 2 dP N0 = lim sup N0∧Nj →∞ R d R d x j d γ j|x0 (x j ) 2 dP N0 lim sup N0∧Nj →∞ (R d ) 2 x j 2 d γ 0j (x 0 , x j ) = (R d ) 2 x j 2 dγ 0j (x 0 , x j ) < +∞, where the last equality follows from the tightness of γ 0j , as shown earlier. Therefore, by standard stability results for optimal transport maps (Segers, 2022 , Panaretos & Zemel, 2020) , it holds that b 0j converges uniformly on every compact subset K ⊆ R d in the support of the limit measure P j , that is lim N0∧Nj →∞ sup x0∈K b 0j (x 0 ) -v j (x 0 ) = 0 , where v j is the optimal transport map between P 0 and P j . We now show that v j = b 0j P 0 -almost everywhere. From the local uniform convergence, we can then derive "strong L 2 -convergence" (Ambrosio et al., 2008, Definition 5.4 .3) of the potentials: lim sup N0∧Nj →∞ b 0j L 2 (P N 0 ) -v j L 2 (P0) lim sup N0∧Nj →∞ b 0j L 2 (P N 0 ) -v j L 2 (P N 0 ) + lim sup N0→∞ v j L 2 (P N 0 ) -v j L 2 (P0) lim sup N0∧Nj →∞ b 0j -v j L 2 (P N 0 ) + lim sup N0→∞ v j L 2 (P N 0 ) -v j L 2 (P0) Now the first term converges to zero by Hölder's inequality and the local uniform convergence of the optimal transport maps from above. The second term satisfies lim sup N0→∞ v j L 2 (P N 0 ) -v j L 2 (P0) = lim sup N0→∞ R d v j (x 0 ) 2 dP N0 1/2 - R d v j (x 0 ) 2 dP 0 1/2 lim sup N0→∞ R d v j (x 0 ) 2 dP N0 - R d v j (x 0 ) 2 dP 0 1/2 . But since P 0 has finite second moments, it holds that this term also converges to zero. Based on this we can show that γ 0j ≡ Id, b 0j converge weakly to γ 0j ≡ Id, v j . Indeed, if γ 0j is a limit point of the sequence γ 0j , it holds that (R d ) 2 x j 2 dγ 0j (x 0 , x j ) lim inf N0∧Nj →∞ (R d ) 2 x j 2 d γ 0j (x 0 , x j ) lim sup N0∧Nj →∞ (R d ) 2 x j 2 d γ 0j (x 0 , x j ) = lim sup N0∧Nj →∞ R d b 0j (x 0 ) 2 dP N0 (x 0 ) = R d v j (x 0 ) 2 dP 0 (x 0 ). Disintegrating the left-hand side with respect to P 0 , and applying Jensen's inequality, gives (R d ) 2 x j 2 dγ 0j (x 0 , x j ) = R d R d x j 2 dγ j|x0 (x j ) dP 0 (x 0 ) R d R d x j dγ j|x0 (x j ) 2 dP 0 (x 0 ) = R d b 0j (x 0 ) 2 dP 0 (x 0 ), that is, R d b 0j (x 0 ) 2 dP 0 (x 0 ) R d v j (x 0 ) 2 dP 0 (x 0 ). But since v j is an optimal transport map between P 0 and P j by definition, it holds that R d b 0j (x 0 ) 2 dP 0 (x 0 ) R d v j (x 0 ) 2 dP 0 (x 0 ) , which implies that equality holds and we have that R d b 0j (x 0 ) 2 -v j (x 0 ) 2 dP 0 (x 0 ) = 0 , which implies that b 0j = v j P 0 -almost everywhere. We have hence shown that Id, b 0j # P N0 converges weakly to Id, b 0j # P 0 for all j, where the barycentric projection b 0j is the optimal transport map between P 0 and P j (e.g. Villani, 2003, Theorem 2.12.(iii)). Moreover, we have shown "strong L 2 -convergence" of the barycentric projections in terms of Definition 5.4.3 in Ambrosio et al. (2008) . Since this holds for all j, it also holds for their convex combination for fixed weights λ ∈ ∆ J . Putting everything together, we then have that lim j Nj →∞ J j=1 λ j b 0j -Id 2 L 2 (P N 0 ) = J j=1 λ j b 0j -Id 2 L 2 (P0) . Since all observable measures P j are empirical measures, they converge weakly in probability (Varadarajan, 1958) , which implies that lim j Nj →∞ P      J j=1 λ j b 0j -Id 2 L 2 (P N 0 ) - J j=1 λ j b 0j -Id 2 L 2 (P0) > ε      = 0 for all ε > 0. This shows convergence in probability of the objective function for fixed λ. Step 2: Convergence of the optimal weights λ *

N

The convergence of the optimal weights now follows from standard consistency results in semiparametric estimation. In particular, the objective functions are all convex for any λ ∈ R J , which implies that they converge uniformly on any compact set (Rockafellar, 1970, Theorem 10.8) , so the objective function converges uniformly on ∆ J . Now a standard consistency result like Theorem 2.1 in Newey & McFadden (1994) then implies that lim j Nj →∞ P λ * N -λ * > ε = 0 for all ε > 0 , which is what we wanted to show. Note that the result can also be shown if we allow the weights λ to be negative, i.e., if we only require that J j=1 λ j = 1. In this case, the fact that the objective functions are convex and coercive implies that an optimal λ * will be achieved at the interior of the extended Euclidean space, from which consistency follows by Theorem 2.7 in Newey & McFadden (1994) . Proof of Corollary 3.1. We want to show that  lim ∧j Nj →∞ R d f   x 0 , J j=1 λ * Nj b0j (x 0 ) -x 0   dP N0 (x 0 ) = R d f   x 0 , J j=1 λ * j b 0j (x 0 ) -x 0   dP 0 (x 0 ) for any continuous function such that |f (x 0 )| C 1 + C 2 |x 0 -x 0 | 2 for all x 0 in the support of P 0 , where C 1 , C 2 < +∞ are some constants and x 0 in some element in the support of P 0 (Ambrosio et al., 2008, equation (5.1.21) ). In particular, this holds for any bounded and continuous function f , which implies that lim ∧j Nj →∞ R d f   J j=1 λ * Nj b0j (x 0 )   dP N0 (x 0 ) = R d f   J j=1 λ * j b 0j (x 0 )   dP 0 (x 0 ) for any bounded and continuous function, which implies that and analogously for their empirical counterparts g N . Note that g and g N are non-random functions if the measures P j and P Nj are non-random themselves for all j ∈ J . Moreover, by definition, g and g N are continuous maps because J j=1 λ * j b 0j are gradients of convex functions, which are continuous P 0 -almost everywhere; the same thing holds for their empirical counterparts. Now from what we have shown above and in Proposition 3.1, it holds that g N P N0 , . . . , P N J → g (P 0 , . . . , P J ) as P Nj converge weakly to P j . Since {P Nj } J j=1 here instead are the only random elements in × J j=0 P 2 (R d ), d j , the extended continuous mapping theorem implies that lim j Nj →∞ P d g N P N0 , . . . , P N J , g (P 0 , . . . , P J ) > ε = 0 for all ε > 0 , which is what we wanted to show.

C DETAILS OF MEDICAID EXPANSION APPLICATION

We use the ACS data with harmonized variables made available by IPUMS, a unified source of Census and survey data collected around the world. The data is at the household-person-year level. For our application, we select the household head and the spouse as our unit of analysis. The continuous outcomes are adjusted using the person-level sample weights available in the data. We adopt the following sample restriction criteria: we included individuals We randomly select N = 1500 observations from each unit for estimating λ * . In the Python implementation, we face a challenge where if the entries of the target and control data are large enough, equation A.3 becomes too large for CVXPY to compute an optimal solution. Therefore, we introduce a stabilizing constant to prevent this. This stabilizing constant is determined by the mean value and dimensions of the target distribution, and the number of controls. The weights we obtained are sparse and are displayed in Table 2 . We check whether the obtained weights are fit for creating synthetic Montana by examining if they well-approximate actual Montana in the pre-treatment period. As seen in Figures 6 and 7 , our projection is similar to the actual data. Once we obtain the optimal weights λ * , we estimate the counterfactual outcomes of interest for the four years after Medicaid expansion in Montana (namely, between 2017 and 2020). This involves solving equation A.1 with λ * obtained from the pre-intervention period. Implementation-wise, we computed the free-support barycenter, using the POT package; this does not fix the support of the barycenter a priori, and allows it to be different from those of the control distributions. We plot the densities and distributions of the counterfactual outcomes in Figure 5 of the main text. To perform inference on the estimated causal effect, we use a placebo permutation test in analogy to Abadie et al. (2010) , Gunsilius (2022) . The idea is to repeatedly apply the procedure described above to each control unit, pretending in turn each control unit is the treated unit. Post-intervention, if an actual treatment effect only appears in the treatment unit (Montana, in this application), then the estimated effect for the actual treatment unit should be among the largest. We plot the 2-Wasserstein distance between the treated, joint distribution of all outcomes and the pre-/post-intervention optimal projection (i.e. equation A.1 with λ * ). We present two sets of results in Figure 8 : in panel (A), the optimal projection is computed using λ * estimated using all years in the pre-intervention period; in panel (B), the λ * used is constructed from taking a simple average of weights estimated in each year of the pre-intervention period. Our results suggest that the estimated causal effect is valid in the post-intervention period, as we consistently observe the largest difference coming from Montana, especially from 2017-2019. The effect is less pronounced in 2020, however.



Figure 3: Left to right: occluded image; Euclidean projection; result from Werenski et al. (2022), using optimal weights obtained from their method; result from our approach, using optimal weights obtained from equation 2.6; target image.

Figure4: Top row: target block, Euclidean projection, and projection from our method. Middle and bottom rows: Control units used in simulation. Left entry in parentheses is optimal weights from our method, right entry are optimal weights from Euclidean projection. Weights are denoted as zero if they are less than 1e-6.

Consistent with findings inCourtemanche et al. (2017),Mazurenko et al. (2018), we find significant first-and second order effects of Medicaid expansion, which are summarized in the top row and the bottom row of Figure5, respectively. "Synthetic Montana" has much lower proportion of individuals insured under Medicaid, suggesting that expanding Medicaid eligibility directly affects the extensive margin of Medicaid enrollment. The disemployment effect is less pronounced in comparison to the enrollment effect we estimated, but nonetheless positive and nontrivial, consistent with the findings in, e.g.,Peng et al. (2020), but inconsistent with those in, e.g.,Gooptu et al. (2016). We also find positive second-order effects, summarized in the bottom row of Figure5. Additional details are in Appendix C.

Figure 5: Counterfactual (blue) vs actual (orange) Montana from 2017 to 2020. In the bottom row, histograms of data distributions are shown on the left, and cumulative distribution functions are shown on the right.

N0 converges weakly in probability to J j=1 λ * j b 0j # P 0 , where λ * N λ * N1 , . . . , λ * N J are the optimal weights obtained in equation 3.1 and equation 2.6, respectively. The result follows by applying the extended continuous mapping theorem (van der Vaart & Wellner, 2013, Theorem 1.11.1) as follows. As shown in the proof of Proposition 3.1 we have "strong L 2 -convergence" of the maps J j=1 λ * Nj b0j -Id to J j=1 λ * j b 0j -Id. Therefore, by Theorem 5.4.4 (iii) in Ambrosio et al. (2008), it holds that

P Nj converge weakly to P j , j ∈ J . Now we apply the extended continuous mapping theorem (van der Vaart & Wellner, 2013, Theorem 1.11.1). Equip P 2 (R d ) with any metric d(•, •) that metrizes weak convergence. We define the maps g : × J j=0 P 2 (R d ), d j → P 2 (R d ), d by g (P 0 , . . . , P J ) =

Estimated Weights for Control States.

(a) Covered by Medicaid (b) Employment Status

Figure 6: Replicated (blue) vs actual (orange) Montana from 2010 to 2016.

Figure 7: Replicated (blue) vs actual (orange) Montana from 2010 to 2016. In each panel, histograms of data distributions are shown on the left, and cumulative distribution functions are shown on the right.

Figure 8: In orange: Montana. In blue: pretending each control state listed inTable 1 is a treated state.

• of working age, i.e. between ages 18 and 65• who has no missing outcomes (for those listed in the main text)• who has no top-coded responses• who are either household heads or their spousesThe sample size breakdown by states are follows: Summary of the full data sample used to obtain λ * .

Table 1 is a treated state.

annex

Under review as a conference paper at ICLR 2023 To accompany Figure 8 , we also compute p-values, which we denote by and define as p t r(d1t) J+1 , where d 1t is the 2-Wasserstein distance from the optimal projection to actual distribution when the target unit is Montana, r(d 1t ) is the rank of d 1t amongst d jt s at given time t, and J is the number of control units. Results are described in Table 3 . A smaller p t value indicates larger treatment effect. We observed that r(d 1t ) = 1 for 2018 and 2019, implying a nontrivial effect of the Medicaid expansion in Montana during these years. The values are p t are comparably higher in 2017 and 2020, which we attribute to the fact that it was the first year of the policy implementation, and the COVID-19 pandemic, respectively. 

