MAX-SLICED BURES DISTANCE FOR INTERPRETING DISCREPANCIES

Abstract

We propose the max-sliced Bures distance, a lower bound on the max-sliced Wasserstein-2 distance, to identify the instances associated with the maximum discrepancy between two samples. The max-slicing can be decomposed into two asymmetric divergences each expressed in terms of an optimal slice or equivalently a 'witness' function that has large magnitude evaluations on a localized subset of instances in one distribution versus the other. We show how witness functions can be used to detect and correct for covariate shift through reweighting and to evaluate generative adversarial networks. Unlike heuristic algorithms for the max-sliced Wasserstein-2 distance that may fail to find the optimal slice, we detail a tractable algorithm that finds the global optimal slice and scales to large sample sizes. As the Bures distance quantifies differences in covariance, we generalize the max-sliced Bures distance by using non-linear mappings, enabling it to capture changes in higher-order statistics. We explore two types of non-linear mappings: positive semidefinite kernels where the witness functions belong to a reproducing kernel Hilbert space, and task-relevant mappings corresponding to a neural network. In the context of samples of natural images, our approach provides an interpretation of the Fréchet Inception distance by identifying the synthetic and natural instances that are either over-represented or under-represented with respect to the other sample. We apply the proposed measure to detect imbalances in class distributions in various data sets and to critique generative models. π sorts {x i } m i=1 to find examples from μ with large ω 2 µ<ν ω 2 µ<ν (y σ(1) ) ≥ • • • ≥ ω 2 µ<ν (y σ(K) ) σ sorts {y i } n i=1 to find the examples from ν with large ω 2 µ<ν , where π, π, σ, σ denote permutations and and denote expected inequalities with a large difference.

1. INTRODUCTION

Divergence measures quantify the dissimilarity between probability distributions. They are fundamental to hypothesis testing and the estimation and criticism of statistical models, and serve as cost functions for optimizing generative adversarial neural networks (GANs). Although a multitude of divergences exists, not all of them are interpretable. A divergence is interpretable if can be expressed in terms of a real-valued witness function ω(•) whose level-sets identify the specific subsets that are not well matched between the distributions, specifically, subsets which have much higher or much lower probability under one distribution versus the other. Localizing these discrepancies is useful for understanding and compensating for differences between two samples or distributions, to detect covariate shift (Shimodaira, 2000; Quionero-Candela et al., 2009; Lipton et al., 2018) or to evaluate generative models (Heusel et al., 2017) . While many divergences can be posed in terms of witness functions, not all witness functions are readily obtained or interpreted. From an information-theoretic perspective, the most natural witness function is the logarithm of the ratio of the densities (Kullback & Leibler, 1951) as in the Kullback-Leibler divergence. Applying other convex functions to the density ratio constitutes the family of f -divergences (Ali & Silvey, 1966; Rényi, 1961) , which include the Hellinger, Jensen-Shannon, and others. However, without a parametric model estimating the densities from samples is challenging (Vapnik, 2013) . Following Vapnik's advice to "try to avoid solving a more general problem as an intermediate step," previous work has sought to directly model the density ratio via kernel learning (Nguyen et al., 2008; Kanamori et al., 2009; Yamada et al., 2011; 2013; Saito et al., 2018; Lee et al., 2019) or to estimate an f -divergence by optimizing a function from a suitable family (Nguyen et al., 2010) such as a neural network Nowozin et al. (2016) . Witness functions need not rely on the density ratio. A wide class of divergences called integral probability metrics (IPMs) (Müller, 1997) , which include total variation, the Wasserstein-1 distance, maximum mean discrepancy (MMD) (Gretton et al., 2007) , and others (Mroueh et al., 2017) , seek a witness function that maximizes the distance between the first moments of the witness function evaluations. In these cases the optimal witness function ω (•) has a greater expectation in one distribution compared to the other distribution. An IPM between two measures µ and ν is expressed as sup ω∈F |E X∼µ [ω(X)] -E Y ∼ν [ω(Y )]| for a given family of functions F. A class of related divergences are the max-sliced Wasserstein-p distances, which seek a linear (Deshpande et al., 2019) or non-linear slicing function (Kolouri et al., 2019) that maximizes the Wasserstein-p distance between the witness function evaluations for the two distributions. However, there are two difficulties with computing the max-sliced Wasserstein distance for two samples. The first is that it is a saddlepoint optimization problem, whose objective evaluation requires sorting the samples. Previous work has sought to approximate it using a first moment approximation (Deshpande et al., 2019) or to use a finite number of steps of a local optimizer (Kolouri et al., 2019) , without any guarantee of obtaining an optimal witness function. Another difficulty is in the interpretation of the obtained witness function. Unlike the density ratio, there is no notion of whether the witness function will take higher values for points associated to one distribution versus the other. To address both of these issues we propose a max-sliced distance that replaces the Wasserstein-2 distance with a second-moment approximation based on the Bures distance (Dowson & Landau, 1982; Gelbrich, 1990) . The Bures distance (Bures, 1969; Uhlmann, 1976) is a distance metric between positive semidefinite operators. It is well-known in quantum information theory (Nielsen & Chuang, 2000; Koltchinskii & Xia, 2015) and machine learning (Brockmeier et al., 2017; Muzellec & Cuturi, 2018; Zhang et al., 2020; Oh et al., 2020; De Meulemeester et al., 2020) .

1.1. CONTRIBUTION

We propose a novel IPM-like divergence measure, the "max-sliced Bures distance", to identify localized regions and instances associated with the maximum discrepancy between two samples. The distance is expressed as the maximal difference between the root mean square (RMS) of the witness function evaluations sup ω∈S E X∼µ [ω 2 (X)] -E Y ∼ν [ω 2 (Y )] , where S is an appropriate family of functions. As |∆| = max{∆, -∆}, the max-sliced Bures can be expressed as the maximum of one-sided max-sliced divergences with optimal witness functions, ω µ>ν = arg max ω∈S E µ [ω 2 (X)]-E ν [ω 2 (Y )], and ω µ<ν = arg max ω∈S E ν [ω 2 (Y )]-E µ [ω 2 (X)]. If the distributions are not well-matched, then ω µ>ν has large magnitude function evaluations under a 'localized' subset of µ and smaller magnitude values for ν, and the opposite for ω µ<ν . The two samples {x i } m i=1 , {y i } n i=1 can be sorted by the magnitude of the witness function evaluations. 1 Crucially, we detail a tractable optimization procedure that is guaranteed to yield a global optimum witness function for the one-sided max-sliced Bures divergence. When X = R d and the first or second moments distinguish the distributions, linear witness functions can be used S = {ω(•) = •, w : w ∈ S d-1 }, where S d-1 denotes the unit sphere in R d . The optimal witness function for the one-sided max-sliced Bures divergence ω µ>ν (•) = •, w µ>ν coincides with the subspace with the greatest difference in RMS, w µ>ν = arg max w∈S d-1 w E[XX ]w -w E[Y Y ]w. This optimization problem depends on the dimension d; after computation of the covariance matrices, it is independent of the sample sizes m ≥ n. In comparison, the optimal slice for the max-sliced Wasserstein may not be obtained, and even gradient ascent to a local optimum requires O(m log m) at each function/gradient evaluation. Furthermore, the slice that maximizes the max-sliced Wasser-1 Four groups of 'witness points' (top-K instances) can be inspected to identify any discrepancies: ω 2 µ>ν (x π(1) ) ≥ • • • ≥ ω 2 µ>ν (x π(K) ) π sorts {x i } m i=1 to reveal examples from μ with large ω 2 µ>ν ω 2 µ>ν (y σ(1) ) ≥ • • • ≥ ω 2 µ>ν (y σ(K) ) σ sorts {y i } n i=1 to find the examples from ν with large ω 2 µ>ν , ω 2 µ<ν (x π(1) ) ≥ • • • ≥ ω 2 µ<ν (x π(K) ) Real>Fake Real>Fake Real<Fake Real<Fake stein lacks an intrinsic ordering, and it is left to the user to determine whether instances from µ or ν have high or low values or magnitudes. As second-order moments may be insufficient for distinguishing the distributions, we explore non-linear mappings of the random variables. Firstly, we consider a reproducing kernel Hilbert space (RKHS) H with the family of witness functions S = {ω(•) = φ(•), ω H : ω, φ(•) ∈ H, ω, ω H = 1}. An example with Gaussian kernels is shown in Figure 1 . Secondly, we use a pre-trained neural network to create a task-relevant mapping, computing the second-order statistics of the hidden-layer activations, and apply this in the context of samples of natural images. This enables interpretation of the Fréchet Inception distance (FID) (Heusel et al., 2017) by identifying the subspace and images associated with discrepancies between synthetic and natural images. We prove that the max-sliced Bures distance provides a lower bound on the max-sliced Fréchet distance. Because of their similarity, we develop the max-sliced Bures distance in the context of max-sliced versions of the total variation and Wasserstein-2 distances. The kernel-based versions of these are novel contributions themselves. The max-sliced total variation distance is a special case of the covariance feature matching proposed by Mroueh et al. (2017) . In experimental results, we show applications of the linear and kernel-based versions to detect imbalances in class distributions of natural images and to critique GANs. We compare to other divergences expressed in terms of witness functions including MMD. Finally, we propose algorithms to reweight an empirical distribution in order to minimize max-sliced divergences (with applications to generating conditional distributions and covariate shift correction).

2. METHODOLOGY

Consider a topological space X , a Borel σ-algebra B X , and the set Pr(X ) of Borel probability measures on X . Let µ, ν ∈ Pr(X ) denote two probability measures, and X ∼ µ and Y ∼ ν be two random variables X, Y ∈ X . Let κ denote a positive-definite, bounded kernel function κ : X × X → B ⊂ R. For any κ, there is an implicit mapping φ : X → H that maps any element x ∈ X to an element in the reproducing kernel Hilbert space (RKHS) φ(x) ∈ H such that κ(x, y) = φ(x), φ(y) H = κ(•, x), κ(•, y) H for x, y ∈ X , and ∀ω ∈ H, ω(x) = ω, κ(•, x) H (Aronszajn, 1950) . ∀ω ∈ H, ω 2 = ω, ω H . When clear, we drop the H subscript on the inner product. A rank-1 RKHS operator is denoted as ) ] ∈ H the first moments of the random variables in the RKHS. The uncentered second moments are ρ ω ⊗ ψ ∈ H × H with (ω ⊗ ψ)φ(x), φ(y) = ψ, φ(x) ω, φ(y) = ψ(x)ω(y) for x, y ∈ X . Denote by m X = E X∼µ [φ(X)] ∈ H and m Y = E Y ∼ν [φ(Y X = E [φ(X) ⊗ φ(X)] ∈ H × H and ρ Y = E [φ(Y ) ⊗ φ(Y )] ∈ H × H. The covariance operators are Σ X = ρ X -m X ⊗ m X and Σ Y = ρ Y -m Y ⊗ m Y . 2.1 DIVERGENCES AS DISTANCE METRICS Let D(µ, ν) denote a divergence D : Pr(X ) × Pr(X ) → [0, ∞). It is a distance metric between measures (a probability metric) if all of the following statements hold: (i) µ = ν =⇒ D(µ, ν) = 0, (ii) D(µ, ν) = 0 =⇒ µ = ν, (iii) D(µ, ν) = D(ν, µ), (iv) D(µ, ν) ≤ D(µ, ξ) + D(ν, ξ). It is a semi-metric if all properties aside from (ii) hold. Müller (1997) defines the class of integral probability metrics as the supremum of the absolute difference between expectations D F (µ, ν) = sup ω∈F X ω(x)dµ(x) - X ω(x)dν(x) = sup ω∈F E[ω(X)] -E[ω(Y )] . With appropriate choice of the family of functions F, this form yields well-known divergences (Sriperumbudur et al., 2010) , e.g., when F is the set of functions with Lipschitz constant less than 1, the resulting divergence is the Wasserstein-1 distance metric. Another example of an IPM is when F = {ω ∈ H : ω 2 ≤ 1}, which yields MMD (Gretton et al., 2007) , defined as D H MMD (µ, ν) = sup ω∈H: ω 2≤1 {E[ω(X)] -E[ω(Y )] = E[ φ(X) -φ(Y ), ω ]} = m X -m Y 2 = E X∼µ,X ∼µ [κ(X, X )] + E Y ∼ν,Y ∼ν [κ(Y, Y )] -2E X∼µ,Y ∼ν [κ(X, Y )]. For characteristic kernels such as the Laplacian and Gaussian kernels, the mean embedding E X∼µ [φ(X)] : Pr(X ) → H is an injective function (Sriperumbudur et al., 2008; Fukumizu et al., 2009; Sriperumbudur et al., 2010) , capturing the full statistics of µ. In these cases, MMD is a distance metric on Pr(X ); likewise, distance metrics between the operators ρ X = E [φ(X) ⊗ φ(X)] and ρ Y = E [φ(Y )⊗φ(Y ) ] induce probability metrics for characteristic kernels (Zhang et al., 2020) .

2.2. OPERATOR DISTANCES FOR DEFINING DIVERGENCES

Total variation (TV) is a well-known probability metric and an integral probability metric (Müller, 1997) , taking the form (1/2) i |p i -q i | for discrete measures, for which p i = µ(x i ) and q i = ν(x i ) where {x i } i = X . The TV distance between operators in the RKHS is a divergence D H T V (µ, ν) d T V (ρ X , ρ Y ) 1 2 ρ X -ρ Y 1 , where • 1 denotes the trace norm (Schatten 1-norm), which is the sum of the singular values. The Bures distance generalizes the Hellinger distance (1/2) i ( √ p i -√ q i )foot_0 to positive semidefinite operators (Fuchs & Van De Graaf, 1999; Bromley et al., 2014; Bhatia et al., 2019) . The kernel Bures divergence D H B (µ, ν) and the Bures distance d B (ρ X , ρ Y ) are defined as D H B (µ, ν) = d B (ρ X , ρ Y ) ρ X 1 + ρ Y 1 -2 √ ρ X √ ρ Y 1 . The Bures distance is used to define the Wasserstein-2 (W2) distance between Gaussian measures, i.e., the Fréchet distance (Fréchet, 1957; Dowson & Landau, 1982) . The multivariate Fréchet distance provides a lower bound for the W2 distance (Gelbrich, 1990 ). 2 The kernel Gauss-Wasserstein distance (Zhang et al., 2020; Oh et al., 2020) is defined as D H GW (µ, ν) m X -m Y 2 2 + d 2 B (Σ X , Σ Y ) = [D H MMD (µ, ν)] 2 + d 2 B (Σ X , Σ Y ). (6) Zhang et al. (2020) also proposed the kernel Wasserstein-p distance between µ and ν, W H p (µ, ν) inf γ∈Γ(µ,ν) E (X,Y )∼γ [d p κ (X, Y )] 1 p , p ≥ 1, where Γ(µ, ν) defines the set of all joint distributions coupling µ and ν, and d p κ (X, Y ) = φ(X) - φ(Y ) p 2 . For p = 2, d 2 κ (X, Y ) = κ(X, X)+κ(Y, Y )-2κ(X, Y ). When p = 2 and φ(x) → x ∈ R d such that H = R d , the standard W2 distance W R d 2 (µ, ν) is obtained.

2.3. DIVERGENCES BASED ON SLICING HILBERT SPACES

The sliced Wasserstein distance (Wu et al., 2019; Deshpande et al., 2018; Kolouri et al., 2018) , and max-sliced Wasserstein distance (Deshpande et al., 2019; Kolouri et al., 2019) evaluate discrepancies in linear or non-linear one-dimensional subspaces. A motivation for this is the analytic solution of the Wasserstein-p distance in one dimension. The max-sliced Wasserstein-p distance takes the form max- W R d p (µ, ν) ∝ sup w∈S d-1 inf γ∈Γ(µ,ν) E (X,Y )∼γ [ X -Y, w p ], p ≥ 1. Similarly, we propose the max-sliced Bures, the kernel TV, and the kernel Wasserstein-p distances using the rank-1 operator Ω = ω ⊗ ω ∈ H × H, which projects (slices) the RKHS along a one-dimensional subspace defined by the ray ω ∈ H, with ω, φ(X) = ω(X), due to the reproducing property. In this formulation, ω : X → R is the witness function from the set S = {ω ∈ H : ω 2 = 1}. Notably, a linear slice in the RKHS is a possibly non-linear function in the input space. For conciseness, we denote the mean square witness function evaluations E [ω 2 (X)] = ω, ρ X ω = √ ρ X ω 2 2 as ω 2 µ , and E [ω 2 (Y )] = ω, ρ Y ω = √ ρ Y ω 2 2 as ω 2 ν . The RMS ω µ is an L 2 semi-norm on ω induced by the positive semidefinite operator √ ρ X . The max-sliced kernel TV, Bures, and W2 distances, derived in appendix A.1, are expressed as max-D H T V (µ, ν) 1 2 max sup ω∈S ω 2 µ -ω 2 ν , sup ω∈S ω 2 ν -ω 2 µ , max-D H B (µ, ν) max sup ω∈S ω µ -ω ν , sup ω∈S ω ν -ω µ , and max-W H 2 (µ, ν) sup ω∈S ω 2 µ + ω 2 ν -sup γ∈Γ(µ,ν) E (X,Y )∼γ [2ω(X)ω(Y )], respectively. The inner supremums in equation 8 and equation 9 are the one-sided divergences. The max-sliced TV distance is an IPM with F = {ω 2 (x) = φ(x), ω 2 : ∀ω ∈ S} and is a special case of the IPM Σ divergence proposed by Mroueh et al. (2017) . While not an IPM, the max-sliced kernel Bures distance can be directly related to max-sliced versions of the TV, Fréchet, and W2 distance, as detailed by the following results. Theorem 1. The square of the max-sliced Bures distance in the RKHS H is less than or equal to twice the max-sliced TV distance, max-D H B (µ, ν) ≤ 2 max-D H T V (µ, ν) . Theorem 2. The max-sliced Bures distance in the RKHS H is a lower bound on the kernel maxsliced Gauss-Wasserstein distance, max-D H B (µ, ν) ≤ max L -D H GW (µ, ν) ≤ max U -D H GW (µ, ν). Theorem 3. The max-sliced Bures distance in the RKHS is a lower bound on the kernel max-sliced W2 distance, max-D H B (µ, ν) ≤ max-W H 2 (µ, ν). These results trivially translate to the linear kernel case H = R d , S = S d-1 , ω(x) = w, x , for w, x ∈ R d and X ⊆ R d . The latter two show that the max-sliced Bures distance is a lower-bound on the max-sliced Fréchet distance, which is a lower-bound on the max-sliced W2 distance. The proofs and other relationships among the divergences are in Appendix A.2.

2.4. COMPUTING THE MAX-SLICED DIVERGENCES

The max-sliced kernel Bures, TV, and W2 distances require solving optimization problems to find the optimal witness function. As noted by others investigating IPMs (Mroueh et al., 2017; Li et al., 2017; Kolouri et al., 2019) , the witness function can be defined using a family of functions implemented as a neural network. In this context, Goodfellow et al. (2014) use the divergence D A (µ, ν) = max ω E [log ω(X)] + E [log (1 -ω(Y ))], where ω : X → (0, 1). In comparison, the Wasserstein- ) ] requires a Lipschitz con-straint (Arjovsky et al., 2017; Gulrajani et al., 2017) . 3 Table 1 compares the form and constraints. 1 distance D W 1 (µ, ν) = sup ω∈Lip 1 E[ω(X)] -E[ω(Y Table 1 : Divergences written in terms of witness functions. Closed-form solutions denoted ω * . D A (µ, ν) = max ω:ω(•)∈(0,1) E[log ω(X)] + E[log (1 -ω(Y ))], ω * (•) = dµ(•) dµ(•)+dν(•) . D W 1 (µ, ν) = sup ω∈Lip 1 E[ω(X)] -E[ω(Y )]. D H MMD (µ, ν) = sup ω∈H: ω 2≤1 E[ω(X)] -E[ω(Y )], ω * = m X -m Y m X -m Y 2 . max-D H T V (µ, ν) = sup ω∈H: ω 2≤1 1 2 E[ω 2 (X)] -E[ω 2 (Y )] . max-D H B (µ, ν) = sup ω∈H: ω 2≤1 E[ω 2 (X)] -E[ω 2 (Y )] .

2.5. SAMPLE-BASED ESTIMATORS FOR MAX-SLICED DIVERGENCES

We consider the case of finite samples expressed as empirical measures μ = m i=1 µ i δ xi and ν = n i=1 ν i δ yi for the samples {x i } m i=1 and {y i } n i=1 with discrete probability masses denoted as column vectors [µ 1 , . . . , µ m ] = µ ∈ [0, 1] m , µ, 1 = 1, and [ν 1 , . . . , ν n ] = ν ∈ [0, 1] n , ν, 1 = 1. The kernel-based max-sliced divergences optimize the witness function ω(•) = l i=1 α i κ(•, z i ) in terms of the dual variables α ∈ R l corresponding to a subset of the pooled sample {x i } m i=1 ∪{y i } n i=1 . The optimization problems and algorithms are detailed in appendix A.4. For clarity, we proceed to the linear kernel case for a finite-dimensional embedding φ(x) = x ∈ R d . After embedding, kernel evaluations correspond to vector inner-products κ (x, y) = φ(x), φ(y) = x y. Let X = [x 1 , . . . , x m ] ∈ R d×m and Y = [y 1 , . . . , y n ] ∈ R d×n denote the sample points with corresponding masses µ and ν, respectively. The witness function is the inner product ω(x) = w x, where the variable w defines the slice with ω 2 μ = w ρ X w with ρ X = XD µ X and ω 2 ν = w ρ Y w with ρ Y = YD ν Y , where D v is diagonal with entries v. For i.i.d. samples, D µ = 1 m I and D ν = 1 n I. The max-sliced TV, Bures, and W2 divergences are max-D R d T V (μ, ν) = max w: w 2 ≤1 w (ρ X -ρ Y )w = λ 1 (ρ X -ρ Y ), max-D R d B (μ, ν) = max w: w 2 ≤1 w ρ X ww ρ Y w , and ( 12) max-W R d 2 (μ, ν) = max w: w 2 ≤1 w (ρ X + ρ Y )w -2 max P∈P μ,ν w XP Y w, where λ 1 (•) denotes the largest magnitude eigenvalue of the argument and P μ,ν = {P ∈ [0, 1] m×n |P1 n = µ, P 1 m = ν} is a transportation polytope. These three optimizations differ in difficulty. The first two require only the sample means m X , m Y and covariance matrices Σ X , Σ Y , since ρ X = m X m X + m-1 m Σ X and ρ Y = m Y m Y + n-1 n Σ Y (assuming unbiased covariance estimates). The one-sided max-sliced TV divergences can be solved by finding the eigenvectors associated to the largest eigenvalues of ρ X -ρ Y and ρ Y -ρ X . Likewise the optimal slice for each one-sided max-sliced Bures divergence requires solving a series of eigenvector problems. Specifically, if ρ X and ρ Y are strictly positive definite, then the optimal witness function is ω µ>ν (•) = w γ , • , where γ ∈ (0, 1] solves the optimization problem γ = arg max 0<γ≤1 w γ ρ X w γ -w γ ρ Y w γ , w γ = arg max w: w 2 ≤1 w (γρ X -ρ Y )w. ( ) The general case involves checking the nullspace of ρ Y and is given in the Appendix A.5. In comparison, the max-sliced W2 distance is a saddlepoint optimization problem (Deshpande et al., 2019) . Following Kolouri et al. (2019) , gradient ascent on w can be performed with first-order solves. Each gradient evaluation requires solving the transport map by sorting X w and Y w. For this, we use ADAM (Kingma & Ba, 2015) and quasi-Newton approaches, such as MINFUNC (Schmidt, 2012) . The same approaches can be used to approximate max-D B after smoothing √ • as √ • + 0.01.

3. EXPERIMENTS

We present various examples of using the proposed max-sliced divergences to identify the discrepancies between two samples. We apply the proposed approach to detect mismatched distributions of natural and fake images using the internal representation of the Inception Network (Szegedy et al., 2016) as in the Fréchet Inception distance (Heusel et al., 2017) and the Inception score (Salimans et al., 2016) . We investigate whether the witness functions detect covariate shift caused by class imbalances. Then, we propose optimizing the weights ν to compensate for covariate shift. Finally, we use the one-sided max-sliced Bures divergence to monitor mode dropping during GAN training.

3.1. INTERPRETING THE FR ÉCHET INCEPTION DISTANCE

We use a linear witness function to identify instances that are not well matched between two samples of real or fake images represented by internal activations of the Inception object classifying network (Szegedy et al., 2016) . Specifically, we search for witness functions with the form ω (x) = w, φ(x) , where the vector φ(x) ∈ R 2048 is an Inception code-the internal activations of penultimate layer of the network after pooling (Heusel et al., 2017) . Figure 2 shows the performance of the proposed measure to identify instances associated with imbalanced representation of particular classes. In particular, μ is a uniform sample from the training set and ν is a sample from the test set with less instances from one class. Using the one-sided max-sliced Bures divergence we obtain the optimal slice ω µ>ν and apply it to the imbalanced sample ν, identifying the top-10 witness points with the largest magnitude witness function evaluations ω 2 µ>ν (y σ(1) ) ≥ . . . ≥ ω 2 µ>ν (y σ(10) ), where σ is a permutation corresponding to sorting {y i } n i=1 by descending magnitude. (While this may seem counterintuitive as it is expected that max 1≤i≤m ω 2 µ>ν (x i ) max 1≤i≤n ω 2 µ>ν (y i ), since ω µ>ν corresponds to a one-dimensional subspace, {y σ(i) } K i=1 are the K instances from ν with the largest norm after projection to this subspace.) The performance is quantified by the precision of the labels of these instances (ideally, these witness points should be from the underrepresented class). Notably, the mean precision of the top-10 instances is 0.79 or better across the classes with a mean average precision (MAP) of 0.94 when the class probabilities differ by 2% (10.2% for majority and 8.2% for minority). This is compared to a MAP of 0.82 for the first-moment based surrogate of the max-sliced W2 distance (Deshpande et al., 2019) . Computing the max-sliced W2 distance takes much longer to run on this sample size. Next we generate a set of 50,000 synthetic images for an AutoGAN instance pre-trained on CI-FAR10 (Gong et al., 2019) , which has an Inception score of 8.525 and a FID score of 12.41. We applied both one-sided max-sliced Bures divergences to identify the two subspaces that maximize the difference in RMS between fake and real images. Figure 3 details the top-10 images in each subspace and their realism scores R (Kynkäänniemi et al., 2019) . 4 Applying the max-sliced Wasserstein-2 distance yielded almost the same solution as w Real<Fake (a linear correlation of 0.992). We compare the proposed max-sliced Bures distance and the resulting max-sliced Fréchet distance to the max-sliced W2 distance for linear witness functions. Figure 4 compares the divergence estimates instances associated with a simple case of covariate shift with the MNIST data set. Notably, the precision of detecting class imbalances is higher for smaller samples using a kernel (Appendix A.8). (Center) Computation time. (Right) Each curve is the average precision@10 (averaged across the 10 classes). The one-sided max-sliced Bures yields the witness function ω µ>ν (•) = w, • , which is applied to reliably identify the instances from μ that are from the minority class for m ≥ 1000.

3.3. COVARIATE SHIFT CORRECTION BY REWEIGHTING

We consider the task of reweighting the instances in one sample to minimize the max-sliced Bures distance. This optimization problem can be expressed as min ν∈R n ≥0 : i νi=1 J(ν), where J(ν) ∝ max-D R d B (μ, ν). As shown in appendix A.7, this is a convex minimization problem with a simplex constraint on ν. We apply the Frank-Wolfe algorithm (Jaggi, 2013) to iteratively adjust the weight of one instance at each iteration. The performance is quantified in terms of the Fréchet Inception distance between the real-test images of the class and the reweighted sample of fake images. For comparison, we also optimize reweightings that minimize the W2 distance and the max-sliced W2 distance (using 10 mini-batches of size n = m =100 at each iteration). The average FID distance across the classes is 49.15 for the max-sliced Bures reweighting compared to 68.1 and 72.5 for the mini-batch W2 and max-sliced W2. (See Table 4 in the Appendix for full results). Figure 5 shows results of a reweighting uniform distributions to match target distributions using either linear slices or random Fourier bases (Rahimi & Recht, 2008) for approximating a Gaussian kernel. 5 For the latter, using the max-sliced Bures as a loss achieves the lowest reweighted W2 distance, which is computed by solving a discrete transportation problem (Flamary & Courty, 2017) . 

4. CONCLUSION

We propose the max-sliced Bures distance, a lower-bound on the max-sliced W2 distance, which can be computed optimally with a tractable algorithm. We show increased performance with kernelbased witness functions for covariate shift detection and correction, and also highlight its utility in the linear case when the feature space is the internal representation of a pre-trained network. Importantly, the one-sided max-sliced Bures divergences enable direct interpretation of under-and over-representation between two samples, which can be used to identify systematic discrepancies. 

A APPENDIX

The appendix details the derivation of the proposed divergences, formal results relating them to existing divergences, algorithms, and additional experimental results. ), defined in equation 4 and equation 5, respectively. In these cases, the expression of the max-sliced distance can be simplified as max-d(ρ X , ρ Y ) = sup Ω∈U1 d(Ωρ X Ω, Ωρ Y Ω) = sup Ω∈U1 d( Ω, ρ X HS Ω, Ω, ρ Y HS Ω) = sup Ω∈U1 δ( Ω, ρ X HS , Ω, ρ Y HS ) = sup ω∈H: ω 2 ≤1 δ( ω, ρ X ω H , ω, ρ X ω H ), where •, • HS : (H × H) × (H × H) → R denotes the inner-product defining the Hilbert-Schmidt (Schatten-2 norm), Ω, ρ HS = ω ⊗ ω, ρ HS = ω, ρω H , and δ(p, q) = d(pΩ, qΩ) denotes the distance between scaled versions of Ω, for p ∈ R ≥0 and q ∈ R ≥0 . The equalities follow from the fact that ΩρΩ = (ω ⊗ ω) ρ (ω ⊗ ω) = ω, ρω Ω. The max-sliced TV distance and squared max-sliced Bures distance yield expressions for δ(p, q) that match the form of the underlying TV and Hellinger divergences: d T V (pΩ, qΩ) yields δ(p, q) = 1 2 |p -q|, and d 2 B (pΩ, qΩ) yields δ(p, q) = ( √ p - √ q) 2 : max-d T V (ρ X , ρ Y ) sup ω∈S 1 2 | ω, ρ X ω -ω, ρ Y ω |, and max-d B (ρ X , ρ Y ) sup ω∈S | ω, ρ X ω - ω, ρ Y ω |, where S = {ω ∈ H : ω 2 = 1}. The kernel-based divergences can be expressed in terms of the witness functions since ω, ρ X ω = E [ φ(X), ω φ(X), ω ] = E [ω 2 (X) ] and likewise for ρ Y . The underlying divergences and distances are listed in Table 2 . Table 2 : Relationship between scalar discrepancy δ, divergence D between continuous µ, ν and discrete measures µ, ν, operator dissimilarity d, max-sliced dissimilarity, and kernel max-sliced divergence for the TV and squared Bures divergences (squared Bures is twice the squared Hellinger). TV (Bures) 2 δ(p, q) 1 2 |p -q| | √ p - √ q| 2 D(µ, ν) 1 2 i |µ i -ν i | i √ µ i - √ ν i 2 D(µ, ν) 1 2 X |dµ(x) -dν(x)| X dµ(x) -dν(x) 2 d(ρ X , ρ Y ) 1 2 ρ X -ρ Y 1 ρ X 1 + ρ Y 1 -2 ρ 1 2 X ρ 1 2 Y 1 max-d(ρ X , ρ Y ) 1 2 sup ω∈S | ω, ρ X ω -ω, ρ Y ω | sup ω∈S ω, ρ X ω - ω, ρ Y ω 2 max-D(µ, ν) 1 2 sup ω∈S |E [ω 2 (X)] -E [ω 2 (Y )]| sup ω∈S E [ω 2 (X)] -E [ω 2 (Y )] 2 Slicing is natural for the operator-based distances, as it is inherent in their definition such that they coincide with the corresponding divergences between discrete probability laws (Fuchs & Van De Graaf, 1999) . The scalar discrepancy measure δ(•, •) can be accumulated across a complete set of slices to obtain the original distances. Consider a set (or countably infinite sequence) of orthogonal trace-norm operators O = {Ω 1 = ω 1 ⊗ ω 1 , Ω 2 = ω 2 ⊗ ω 2 , . . . , Ω k = ω k ⊗ ω k }. Since these operators are orthogonal and have unit trace-norm k i=1 Ω i ∞ = 1, d(ρ X , ρ Y ) ≥ k i=1 d( Ω i , ρ X HS Ω i , Ω i , ρ Y HS Ω i ) = k i=1 δ( Ω i , ρ X HS , Ω i , ρ Y HS ) = k i=1 δ(p i (O), q i (O)), where p i (O) = Ω i , ρ X HS and q i (O) = Ω i , ρ Y HS . To obtain the equality, one must optimize O over all possible sets of orthogonal trace-norm operators.

A.1.1 MAX-SLICED TOTAL VARIATION DISTANCE

The sliced kernel total variation distance is d T V (Ωρ X Ω, Ωρ Y Ω) = 1 2 Ω(ρ X -ρ Y )Ω 1 = 1 2 Ω, ρ X -ρ Y Ω 1 = 1 2 | Ω, ρ X -ρ Y | Ω 1 . Maximizing over slices yields 1 2 sup Ω∈U1 | Ω, ρ X -ρ Y | Ω 1 = 1 2 sup Ω∈U1 | Ω, ρ X -ρ Y | = 1 2 sup ω: ω 2≤1 | ω 2 µ -ω 2 ν |, where the first equality follows from the fact that distance is maximized when Ω 1 = 1. This yields the expression max-D H T V (µ, ν) 1 2 sup ω: ω 2=1 | ω 2 µ -ω 2 ν |. Notably, the penultimate expression in equation 18 can be related to the operator norm, sup Ω∈U1 | Ω, ρ X -ρ Y | ≤ sup Ω∈{O∈H×H: O 1≤1} Ω, ρ X -ρ Y = ρ X -ρ Y ∞ , due to the dual norm definition. Since ρ X -ρ Y is symmetric, the equality is achieved. Thus, max-D H T V (µ, ν) = 1 2 ρ X -ρ Y ∞ . For a linear kernel, this can be computed by finding the largest magnitude eigenvalue of ρ X -ρ Y = E X∼µ [XX ] -E Y ∼ν [Y Y ] ∈ R d×d . A.1.2 MAX-SLICED BURES DISTANCE The sliced version of the Bures distance is d B (Ωρ X Ω, Ωρ Y Ω) = Ωρ X Ω 1 + Ωρ Y Ω 1 -2 (Ωρ X Ω) 1 2 (Ωρ Y Ω) 1 2 1 , which can be simplified since Ωρ X Ω 1 = tr((ω ⊗ ω)ρ X (ω ⊗ ω)) = ω, ρ X ω ω 2 2 = ω 2 µ ω 2 2 , and (Ωρ X Ω) 1 2 (Ωρ Y Ω) 1 2 1 = ω, ρ X ω ω, ρ Y ω Ω 1 = ω 2 µ ω 2 ν Ω 1 = ω µ ω ν ω 2 2 . Using these expressions, the max-sliced Bures distance is sup Ω∈U1 d B (Ωρ X Ω, Ωρ Y Ω) = sup ω∈H: ω 2≤1 ω 2 ω 2 µ + ω 2 ν -2 ω µ ω ν . The expression is monotonic with the norm of ω yielding max-D H B (µ, ν) sup ω∈H: ω 2≤1 ( ω µ -ω ν ) 2 = sup ω∈H: ω 2≤1 ω µ -ω ν (21) = sup ω∈H: ω 2=1 E X∼µ [ω 2 (X)] -E Y ∼ν [ω 2 (Y )] .

A.1.3 MAX-SLICED GAUSS-WASSERSTEIN DISTANCES

Two max-sliced versions of the Gauss-Wasserstein or Fréchet distance in the RKHS are max L -D H GW (µ, ν) = sup Ω∈U1 Ω(m X -m Y ) 2 2 + d 2 B (ΩΣ X Ω, ΩΣ Y Ω) = sup ω∈H: ω 2≤1 ω 2 m X -m Y , ω 2 + ω, Σ X ω - ω, Σ Y ω 2 = sup ω∈H: ω 2≤1 m X -m Y , ω 2 + ω, Σ X ω - ω, Σ Y ω 2 = sup ω∈H: ω 2≤1 (E X∼µ [ω(X)] -E Y ∼ν [ω(Y )]) 2 + σ ω(X) -σ ω(Y ) 2 , (22) where σ ω(X) = E[(ω(X) -m X , ω ) 2 ] is the standard deviation of ω(X), and σ ω(Y ) = E[(ω(Y ) -m Y , ω ) 2 ], and max U -D H GW (µ, ν) = sup Ω∈U1 Ω(m X -m Y ) 2 2 + sup Ω∈U1 d 2 B (ΩΣ X Ω, ΩΣ Y Ω) 1 2 = MMD 2 (µ, ν) + sup ω∈H: ω 2≤1 σ ω(X) -σ ω(Y ) 2 = MMD 2 (µ, ν) + max-d 2 B (Σ X , Σ Y ). Notably, max L -D H GW (µ, ν) is the supremum of the Fréchet distance over all witness functions.

A.1.4 MAX-SLICED KERNEL WASSERSTEIN

A sliced version of the kernel Wasserstein distance between µ and ν relies on the the sliced distance d κ,ω (X, Y ) = (ω ⊗ ω) (φ(X) -φ(Y )) 2 = ω, φ(X) -φ(Y ) ω 2 = | ω, φ(X) -φ(Y ) | ω 2 = | ω, φ(X) -ω, φ(Y ) | ω 2 = |ω(X) -ω(Y )| ω 2 . This distance is monotonic with the norm of ω and convex with respect to ω. The max-sliced kernel Wasserstein-p distance, p ≥ 1, is max-W H p (µ, ν) = sup ω∈H: ω 2≤1 inf γ∈Γ(µ,ν) E (X,Y )∼γ (d κ,ω (X, Y )) p 1 p = sup ω∈H: ω 2≤1 inf γ∈Γ(µ,ν) E (X,Y )∼γ |ω(X) -ω(Y )| p 1 p , which is a one-dimensional optimal transport problem. Let ω # µ and ω # ν denote the pushforward measures, then the divergence can be written as max-W H p (µ, ν) = sup ω∈H: ω 2≤1 inf π∈Π(ω # µ,ω # ν) E (S,T )∼π |S -T | p 1 p , where Π(ω # µ, ω # ν) is the set of all joint distributions coupling the pushforward measures. Assuming the measures are absolutely continuous and adopting the notation from Santambrogio ( 2015), let F ω,µ (w) = w -∞ dω # µ = ω # µ((-∞, w]) and F ω,ν (w) = w -∞ dω # ν = ω # ν((-∞, w]) denote the cumulative distribution functions of the pushforward measures with pseudo-inverses F -1 ω,µ (q) = inf{w ∈ R : F ω,µ (w) ≥ q} and F -1 ω,ν (q) = inf{w ∈ R : F ω,ν (w) ≥ q}. As shown in Lemma 2.8 (Santambrogio, 2015) , then the optimal transport plan π * has cumulative distribution G ω (w X , w Y ) = min{F ω,µ (w X ), F ω,ν (w Y )} and the divergence is max-W H p (µ, ν) = sup ω∈H: ω 2≤1 1 0 F -1 ω,µ (q) -F -1 ω,ν (q) p dq 1 p , and for the case p = 1 (Santambrogio, 2015, Proposition 2.17), max-W H 1 (µ, ν) sup ω∈H: ω 2≤1 R |F ω,µ (w) -F ω,ν (w)| dw. ( ) The objective in the last quantity is an L 1 -norm version of the Cramér-von Mises criterion (Schmid & Trede, 1995; Anderson, 1962) . That is, the max-sliced kernel Wasserstein-1 distance is equivalent to a max-sliced L 1 -norm version of the Cramér-von Mises criterion, where the slicing corresponds to a function in the RKHS that witnesses the largest discrepancies between the measures. The choice of p = 2 simplifies, yielding the following optimization problem max-W H 2 (µ, ν) = sup ω∈H: ω 2≤1 inf γ∈Γ(µ,ν) E (X,Y )∼γ ω 2 (X) + ω 2 (Y ) -2ω(X)ω(Y ) 1 2 = sup ω∈H: ω 2≤1 ω 2 µ + ω 2 ν -sup γ∈Γ(µ,ν) E (X,Y )∼γ [2ω(X)ω(Y )] 1 2 (28) = sup ω∈H: ω 2≤1 ω 2 µ + ω 2 ν -2 1 0 F -1 ω,µ (q)F -1 ω,ν (q)dq 1 2 . ( )

A.2 RELATIONSHIP BETWEEN THE KERNEL-BASED MAX-SLICED DIVERGENCES

The Bures distance can be used to lower bound the TV distance, (Fuchs & Van De Graaf, 1999) . This inequality stems from the inequality between the squared Hellinger and total variation distances for discrete probability laws. d 2 B (ρ X , ρ Y ) ≤ ρ X -ρ Y 1 = 2d T V (ρ X , ρ Y ) i √ p i - √ q i 2 2 ≤ i |p i -q i |, where the inequality holds for each summand, | √ p i - √ q i | 2 ≤ | √ p i - √ q i |( √ p i + √ q i ) = |p i -q i | . This inequality holds for the max-sliced divergences as stated in Theorem 1. Proof of Theorem 1. ( ω µ -ω ν ) 2 ≤ ω 2 µ -ω 2 ν , where the inequality is a consequence of the inequality between the arithmetic and geometric means, let a = ω 2 µ and b = ω 2 ν , then ( √ a - √ b) 2 ≤ | √ a - √ b|( √ a + √ b) = |a -b|. Taking the supremum over ω ∈ H with the constraint ω 2 ≤ 1 yields the desired the result max-D H B (µ, ν) 2 ≤ 2 max-D H T V (µ, ν) . We now consider the relationship between the Gauss-Wasserstein or Fréchet distance-which combines the distances between the first and second-order moments-with the max-sliced Bures when it is applied directly to the uncentered covariance matrices. For this we need the following lemma. Lemma 4 (Reverse triangle inequality). For two vectors in R 2 , the difference between their Euclidean norms is less than or equal to the Euclidean norm of their differences. For a, b, c , d ∈ R, √ a 2 + b 2 - √ c 2 + d 2 ≤ (a -c) 2 + (b -d) 2 . Proof. Let e = √ a 2 + b 2 - √ c 2 + d 2 2 and f = (a -c) 2 + (b -d) 2 . e = a 2 + b 2 + c 2 + d 2 -2 (a 2 + b 2 )(c 2 + d 2 ) = a 2 + b 2 + c 2 + d 2 -2 a 2 c 2 + b 2 d 2 + b 2 c 2 + a 2 d 2 ≤ a 2 + b 2 + c 2 + d 2 -2 a 2 c 2 + b 2 d 2 + 2 √ a 2 b 2 c 2 d 2 , by the arithmetic and geometric mean inequality = a 2 + b 2 + c 2 + d 2 -2 ( √ a 2 c 2 + √ b 2 d 2 ) 2 = a 2 + b 2 + c 2 + d 2 -2( √ a 2 c 2 + √ b 2 d 2 ) ≤ a 2 + b 2 + c 2 + d 2 -2(ac + bd) = (a -c) 2 + (b -d) 2 = f. Taking the square root of each side yields the inequality √ e ≤ √ f . Proof of Theorem 2. To relate the sliced Bures and Gauss-Wasserstein distances, we note that ω 2 µ = ρ X , ω ⊗ ω HS = Σ X + m X ⊗ m X , ω ⊗ ω HS = σ 2 ω(X) + m X , ω 2 and likewise for ω 2 ν . Then, ω 2 µ - ω 2 ν = m X , ω 2 + σ 2 ω(X) - m Y , ω 2 + σ 2 ω(Y ) ≤ m X -m Y , ω 2 + (σ ω(X) -σ ω(Y ) ) 2 , ( ) where the inequality relies on Lemma 4, with a = m X , ω , b = σ ω(X) , c = m Y , ω , and d = σ ω(Y ) . Taking supremum over slices yields the desired inequality. Theorem 2 shows that the max-sliced kernel Bures distance is a lower-bound on the max-sliced kernel Wasserstein-2 distance, since the latter is lower bounded by the max-sliced kernel Gauss-Wasserstein distance. Proof of Theorem 3. Let ω(X) = m X , ω + ω(X) and ω(Y ) = m Y , ω + ω(Y ), where ω(X) = φ(X) -m X , ω and ω(Y ) = φ(Y ) -m Y , ω are zero mean, and E [ω 2 (X)] = σ 2 ω(X) and E [ω 2 (Y )] = σ 2 ω(Y ) . The squared Fréchet distance between random variables ω(X) and ω(Y ) is (E [ω(X)] -E [ω(Y )]) 2 + (σ ω(X) -σ ω(Y ) ) 2 = m X -m Y , ω 2 + (σ ω(X) -σ ω(Y ) ) 2 = m X , ω 2 + σ 2 ω(X) + m Y , ω 2 + σ 2 ω(Y ) -2 m X , ω m Y , ω + σ ω(X) σ ω(Y ) = E [ω 2 (X)] + E [ω 2 (Y )] -2 E [ω(X)] E [ω(Y )] + E [ω 2 (X)] E [ω 2 (Y )] . By Hölder's inequality, E [ω(X)ω(Y )] ≤ σ ω(X) σ ω(Y ) = E [ω 2 (X)] E [ω 2 (Y )]. Consequently, E [ω 2 (X)] + E [ω 2 (Y )] -2 E [ω(X)] E [ω(Y )] + E [ω 2 (X)] E [ω 2 (Y )] (31) ≤ E [ω 2 (X)] + E [ω 2 (Y )] -2 (E [ω(X)] E [ω(Y )] + E [ω(X)ω(Y )]) = E [ω 2 (X)] + E [ω 2 (Y )] -2 E [ω(X)ω(Y )] = E [(ω(X) -ω(Y )) 2 ]. ( ) Taking the infimum over all possible joint distributions (X, Y ) ∼ γ that are within the coupling distribution γ ∈ Γ, yields the sliced Wasserstein-2 distance on the right hand side. Maximizing over slices, yields max U -D H GW (µ, ν) ≤ max-W H 2 (µ, ν) and combining with Theorem 2 yields the desired result. A.3 RELATIONSHIP TO OTHER DIVERGENCES Finally, we note that the terms in the max-sliced Gauss-Wasserstein divergences are related to kernel Fischer discriminant analysis (KFDA) (Mika et al., 1999) . KFDA objective is based on the ratio of the difference in means to the pooled variances: D H FDA (µ, ν) = sup ω ω, m X -m Y 2 σ 2 ω(X) + σ 2 ω(Y ) = sup ω ω, m X -m Y 2 ω, (Σ X + Σ Y )ω . ( ) KFDA seeks a witness function which has widely separated means for the two measures, and minimal variance.

A.4 COMPUTING THE MAX-SLICED KERNEL DIVERGENCES

We assume the witness functionfoot_4 ω ∈ H is of the form ω = n+m i=1 α i φ(z i ) with ω(•) = n+m i=1 α i κ(•, z i ) where z i = x i , 1 ≤ i ≤ m y i-m , m < i ≤ n + m and α ∈ R m+n . In this case, ω 2 μ = ω ⊗ ω, (φ ⊗ φ) # μ) = m j=1 µ j ω ⊗ ω, φ(x j ) ⊗ φ(x j ) = m j=1 µ j ω, φ(x j ) 2 = m j=1 µ j n+m i=1 α i φ(z i ), φ(x j ) 2 = m j=1 µ j n+m i=1 α i κ(z i , x j ) 2 = µ, (K XZ α) •2 = α K XZ diag(µ)K XZ α = D 1 2 µ K XZ α 2 2 , where κ(z i , z j ) = K ij , K = K XX , K XY K Y X , K Y Y = K XZ K Y Z , (•) •2 denotes the elementwise squaring of a matrix/vector, and D v = diag(v) denotes a diagonal matrix whose diagonal entries are the vector v. Similarly, ω 2 ν = ν, (K Y Z α) •2 = D 1 2 ν K Y Z α 2 2 . In order for the constraint ω 2 2 ≤ 1 =⇒ α Kα ≤ 1 to ensure a bounded solution, we assume K is strictly positive definite. For this purpose, we add a small value to its diagonal K + 10 -9 I when necessary in the optimization procedures. For computational purposes when m or n are large, a subset (possibly random) of landmark points of size l < n + m can be used to form the witness function ω = l i=1 α i φ(z τi ), where {τ i } l i=1 ⊂ {1, . . . , m + n}. In this case, K XZ ∈ R m×l and K Y Z ∈ R n×l with [K XZ ] i,j = κ(x i , z τj ) and [K Y Z ] i,j = κ(y i , z τj ). In this case, the constraint also needs to be adjusted.

A.4.1 MAX-SLICED KERNEL TV DISTANCE

Using these expressions, the max-sliced kernel TV distance is max-D H T V (μ, ν) = max α:α Kα≤1 1 2 D 1 2 µ K XZ α 2 2 -D 1 2 ν K Y Z α 2 2 (34) = max α:α Kα≤1 1 2 |α (K XZ D µ K XZ -K Y Z D ν K Y Z )α|. ( ) The solution is the generalized eigenvector corresponding to the largest magnitude eigenvalue of the generalized eigenvalue problem Av = λKv, where A = K XZ D µ K XZ -K Y Z D ν K Y Z . α = arg max α α Aα α Kα . The witness function is ω (•) = m+n i=1 α i κ(•, z i ).

A.4.2 MAX-SLICED KERNEL BURES DISTANCE

The sample-based max-sliced kernel Bures distance is max-D H B (μ, ν) = max α:α Kα≤1 D 1 2 µ K XZ α 2 -D 1 2 ν K Y Z α 2 (36) = max s∈{-1,+1} max α:α Kα≤1 s D 1 2 µ K XZ α 2 -s D 1 2 ν K Y Z α 2 (37) The last expression shows that the max-sliced Bures distance can be expressed as a bilevel optimization problem, where the inner optimization problem-which we refer to as one-sided max-sliced Bures divergence-is a difference of convex functions: max-D H B (μ, ν) = max min α:α Kα≤1 g(α) -h(α) , min α:α Kα≤1 h(α) -g(α) , g(α) = D 1 2 ν K Y Z α 2 , ( ) h(α) = D 1 2 µ K XZ α 2 . ( ) Without loss of generality, we will consider the first case (s = 1), min α:α Kα≤1 g(α) -h(α). (P) Inspired by the approach Landsman ( 2008), we relate this problem to a quadratic program, specifically, the quadratically constrained quadratic program min α:α Kα≤1 c 1 g 2 (α) -c 2 h 2 (α) = min α:α Kα≤1 α (c 1 K Y Z D ν K Y Z -c 2 K XZ D µ K XZ )α, (Q), for c 1 , c 2 ∈ R ≥0 . The solution of which can be obtained as in the max-sliced kernel TV distance by solving a generalized eigenvalue problem. For (P), the Lagrangian function is L(α, λ) = g(α) -h(α) -λ(α Kα -1). ( ) Let G = {α : g(α) > 0, h(α) > 0} denote the set of points where g and h are differentiable. Then for α ∈ G and K positive definite, L(α, λ) is differentiable, and ∇ α L(α, λ) = ∇ α g(α) -∇ α h(α) -2λKα = 1 2g(α) ∇ α g 2 (α) - 1 2h(α) ∇ α h 2 (α) -2λKα, where the equality follows from ∇ α g 2 (α) = 2g(α)∇ α g(α) and likewise ∇ α h 2 (α) = 2h(α)∇ α h(α). For (Q), the Lagrangian function and its gradient are L(α, λ) = c 1 g 2 (α) -c 2 h 2 (α) -λ(α Kα -1), ∇ α L(α, λ) = c 1 ∇ α g 2 (α) -c 2 ∇ α h 2 (α) -2λKα. If c 1 = 1 2g(α) and c 2 = 1 2h(α) , then ∇ α L(α, λ) = ∇ α L(α, λ). Let α * denote a global optimum of (Q). If c 1 = 1 2g(α * ) and c 2 = 1 2h(α * ) , then ∇ α L(α, λ)| α=α * = ∇ α L(α, λ)| α=α * . Consequently, α * is a local optimum of (P). By the Karush-Kuhn-Tucker conditions, it is a necessary condition for all optima of (P) in G to have this form. Thus, any global optimum of (P) that lies in G corresponds to a global optimum of (Q) for particular values of c 1 , c 2 . The family of solutions to (Q) that includes all local optima of (P) in G, is α * γ = arg max α:α Kα≤1 γh 2 (α) -g 2 (α), γ = c 2 c 1 ∈ (0, 1], where the bounds are due to the non-negative functions and h 2 (α * γ ) ≥ g 2 (α * γ ) =⇒ 1 2c2 ≥ 1 2c1 =⇒ c 1 ≥ c 2 =⇒ c2 c1 ≤ 1. Notably, γ = 1 =⇒ c 1 = c 2 corresponds to the one-sided max-sliced TV. The global optimum of (P) within G is necessarily within {α * γ } γ∈(0,1] and can be found as the solution to the bound scalar optimization problem min γ∈(0,1] g(α * γ ) -h(α * γ ). A remaining case for a global optimum is a non-differentiable point α / ∈ G, specifically, g(α ) = 0 (the case of h(α ) = 0 is trivial), which corresponds to α ∈ In the case of equal samples sizes of uniform measure, m = n and µ = 1 m 1 m and ν = 1 n 1 n , elements in P μ,ν are scaled elements in the Birkhoff polytope (the set of doubly stochastic matrices), and the solution to the linear program is a permutation and max-W H p (μ, ν) = sup ω∈H: ω 2≤1 m i=1 ω(x (i) ) -ω(y (i) ) p 1 p = ώX -ώY p , where • p denotes the p norm in m dimensions. As in the continuous case, the discrete transport plan between the sorted measures has the cumulative distribution Ǵ ∈ [0, 1] m×n with Ǵi,j = min{ i k=1 μk , j k=1 νk }. The optimal transport plan between the sorted measures is given by taking the first difference over both rows and columns of Ǵ, Ṕ1,1 = Ǵ1,1 , Ṕ1,j = Ǵ1,j -Ǵ1,j-1 , Ṕi,1 = Ǵi,1 -Ǵi-1,1 , and Ṕi,j = Ṕi,j -Ṕi-1,j -Ṕi,j-1 . Using the sorted witness function evaluations the distance can be written as max-W H p (μ, ν) = max ω∈H: ω 2≤1   m,n i=1,j=1 Ṕi,j ω(x (i) ) -ω(y (j) ) p   1 p . ( ) We now turn our attention to the optimization of the function ω parametrized in terms of α, ω(•) = m+n i=1 α i κ(•, z i ). For arbitrary sample sizes with p = 2, the max-sliced kernel W2 distance is max-W H 2 (μ, ν) = max ω∈H: ω 2 ≤1 ω 2 μ + ω 2 ν -2 max P∈P μ,ν P, ω X ω Y 1 2 , with unsorted values ω X = [ω(x 1 ), . . . , ω(x m )] = K XZ α and ω Y = [ω(y 1 ), . . . , ω(y n )] = K Y Z α. The max-sliced kernel W2 distance can be expressed in terms of equation 45 or equation 48: max-W H 2 (μ, ν) 2 = max α:α Kα≤1 min P∈P μ,ν m,n i=1,j=1 P ij m+n k=1 (κ(x i , z k ) -κ(y j , z k ))α k 2 = max α:α Kα≤1 α K XZ D µ K XZ α + α K Y Z D ν K Y Z α -2 max P∈P μ,ν α K XZ PK Y Z α = max α:α Kα≤1 min P∈P μ,ν α Q P α = max α min P∈P μ,ν α Q P α α Kα , where Q P = K XZ D µ K XZ +K Y Z D ν K Y Z -K XZ PK Y Z -K Y Z P K XZ is a symmetric matrix positive semidefinite matrix. (This can be seen since the objective is greater than or equal to zero for all choices of α.) For fixed P, equation 49 is a convex maximization, that can be solved as a generalized eigenvalue problem. For fixed α, the optimization in terms of P is a linear program with the solution detailed above. However, when maximizing with respect to α, P is a matrix-valued function of α. To find a local maximum for α we use the unconstrained optimization equation 49 and perform gradient ascent with respect to α, wherein each iteration we compute the optimal transport plan. This approach is also used in computing the generalized max-sliced Wasserstein distance Kolouri et al. (2019) .

A.5 COMPUTING THE MAX-SLICED DIVERGENCES FOR THE LINEAR CASE

The linear case follows from the kernel case with some further simplification. The objectives of the one-sided max-sliced Bures divergences are each a difference of convex functions, whose stationary points are maximum eigenvalue problems, and correspond to reweighted versions of the one-sided max-sliced TV divergence. Assuming the covariance matrices are strictly positive definite max-D R d B (μ, ν) = max 0<γ<1 w γ ρ X w γ -w γ ρ Y w γ , where w γ = arg max w: w 2≤1 w (γρ X -ρ Y )w for the one-sided case. If the matrices are singular, then cases where w is in the nullspace must be checked. Without loss of generality, the algorithm to obtain the optimal slice for the one-sided max-sliced Bures divergence is described in Algorithm 2. , where C = ZZ , Z ∈ R 2×2 with entries that are originally standard normals and then row normalized such that C is a correlation matrix. In the population case and for zero-mean Gaussians, the Bures distance is equivalent to the W2 distance (Gelbrich, 1990) . In the sample case, it is a lower bound. At both m = 100 and m = 10 4 the gradient optimization of the max-sliced W2 distance fails to obtain the global optimal slice (instead obtaining a local optimum). Figure 8 : Success rate of finding optimal slices for the max-sliced Bures and W2 distances across samples of varying sizes (10 random runs per size). The distributions are zero-mean Gaussians, with µ = N (0, C) and ν = N (0, I), where C = ZZ , and Z ∈ R d×d with entries that are originally standard normals and then row normalized such that C is a correlation matrix. (Left) In the case of d = 2, success is obtained for a distance within 1% of the value obtained by fine-grid search of angles. In this case, the gradient approach for the max-sliced Bures fails more often than the max-sliced W2. (Right) For d = 1000 the eigenvalue-based approach (Algorithm A.5) defines the global optimum. In the larger dimension, the gradient approach for the max-sliced Bures {ADAM} succeeds in almost all of the cases, within 1% of the optimal value obtained by Algorithm A.5 (MSB), whereas the max-sliced W2 distance fails to upper bound the max-sliced Bures on roughly half the trials. with entries that are originally standard normals and then row normalized such that C is a correlation matrix. (Top) For the max-sliced W2 a successful run is obtained when it is greater than or equal to the optimal solution to the max-sliced Bures (blue solid line) or when it is greater than 95% of max-sliced Bures (red dotted with circles). For the gradient approach to max-sliced Bures, success is when the difference to the optimal is 5% of the optimal (yellow solid with diamonds). (Bottom) Divergence values obtained across the 10 trials with increasing sample size. in the pooled sample. To ease computation for large-sample sizes, we let τ be a random subset of the pooled samples {x i } m i=1 ∪{y i } n i=1 of size l = min{500, m+n}. In this case, the max-sliced Fréchet refers to max L -D H GW , which is equal to the square root of the sum of square of MMD and the square of the max-sliced Bures using centered kernels, as in equation 23. The kernel-based max-sliced W2 distance should be an upper bound of the max-sliced Fréchet. However, in practice the optimal slice (witness function) may not be obtained. We now compare the proposed kernel-based max-sliced divergences to existing baselines. A primary baseline for this task is to train a logistic regression model with kernel basis functions to distinguish the two samples, and then use the probability estimates of the instances as the witness function evaluations ω(x) = Pr(H 0 |X = x) = 1 -Pr(H 1 |X = x), where H 0 : X ∼ ν and H 1 : X ∼ μ. -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 -1 0 1 -2 -1 0 1 2 -2 -1 0 1 -2 -1 0 1 2 -2 -1 0 1 -2 -1 0 1 2 -2 -1 0 1 -2 -1 0 1 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 0 2 -2 As additional baselines we also tested three methods for importance reweighting and density ratio estimation: kernel mean matching (KMM) (Huang et al., 2007) , least-squares importance estimation (uLSIF) (Kanamori et al., 2009) , and relative density-ratio estimation (RuLSIF) (Yamada et al., 2011) , but all methods were outperformed by logistic regression with kernel bases. We also compare with kernel Fischer discriminant analysis (KFDA) (Mika et al., 1999) , and the linear cases of max-sliced Wasserstein-2 distance, its first moment approximation, max-sliced Bures, and logistic regression. For all kernel methods, a Gaussian kernel κ σ is used with the parameter σ set as the median Euclidean distance in the pooled sample. Using the MNIST data set again, we test three scenarios of covariate shift. For each, one sample has a mismatched probability for one class l ∈ {0, . . . , 9} and the other sample has a balanced sample: (Scenario 1) μ is balanced and ν is missing l; (Scenario 2) μ is imbalanced with l only appearing in 2% of the cases, compared to 10. 8% for the other classes and ν is balanced; (Scenario 3) μ is balanced and ν consists of only images from l. In each case, μ is a sample of 500 images from the training set and ν is a sample of 500 images from the test set. A threshold-free way to assess covariate shift detection is to use the area-under-the-curve (AUC) of the receiver operator curve (ROC), where positive instances correspond to class l. For some methods, the witness function (or its magnitude) may be ambiguous in sign, i.e., the values may be high (or large) for either the underor over-sampled instances (namely, max-sliced W2). To be generous, on each run we choose the ordering with the highest AUC. The results are reported in Table 3 . The other baselines KMM, uLSIF, and RuLSIF are not shown (their AUC scores across the scenarios are worse than the logistic regression with kernel baseline). In a separate set of runs we also compute the realism scores (Kynkäänniemi et al., 2019) with k = 3 where ν is considered the real set, and μ are synthetic, to prioritize instances; results for the three scenarios are 0.75±0.11, 0.67±0.12, and 0.94±0.04, which is better than linear logistic regression and KFDA, but far worse than the kernel logistic regression baseline. Figure 12 shows 20 synthetic images-generated by AutoGAN trained on CIFAR10 (n=50,000)with the largest weights after reweighting in order to minimize the max-sliced Bures distance to the subset of training images for each class separately (m=5,000). Computing the max-sliced Bures distance with the entire training set of 50,000 points is tractable since it does not depend on the sample size. The realism scores of the selected images have a median and range of 0.96 (0.63-1.34). Figure 13 shows the same but based on the weights optimized by using the W2 distance with the mini-batch optimization as the cost function. The realism scores of the selected images have a median and range of 1.1 (0.93-1.29). Figure 14 shows the same but based on the weights optimized by using the max-sliced W2 distance with the mini-batch optimization. The realism scores of the selected images have a median and range of 0.97 (0.65-1.25). Finally, Figure 15 shows the synthetic images selected for having the highest realism scores; notably this set lacks class correspondence. The optimizations in the first three cases use the Frank-Wolfe algorithm (Jaggi, 2013) with simplex constraints. The default step-size schedule γ = 2 k+2 and the same stopping criterion is used max 1≤i≤n |ν (k) i -ν (k-1) i | < 10 -3 , where k is the iteration index. This yields roughly the same number of iterations for each method. The optimization starts from a uniform weighting, which means the weights for only ∼2000 instances are actually individually adjusted (the rest are adjusted by common scaling). The Fréchet Inception distances after reweighting are detailed in Table 4 . Based on the quantitative and qualitative results it appears that the W2 distance with mini-batch approximation assigns high weight to high-quality synthetic images, but the diversity of the highly weighted instances may not capture the full distribution for a class. In this regard, the max-sliced Bures better captures the diversity of the class, albeit choosing less realistic images. Table 4 : Fréchet Inception distances (FID) between CIFAR10 test set images in each class and reweighted sample of synthetic images from AutoGAN. The second column shows the FID to the corresponding training set. The third column is a uniform weighting over all 50,000 synthetic images. The reweighting that minimizes the max-sliced Bures distance (MSB) to the subset of training images performs the best on average. Using the W2 distance-estimated through 10 mini-batches of 100 images on each iteration-performs best only on one-class. The max-sliced W2 (MSW2) distance also uses mini-batches. The realism scores R of the 20 images with the highest weight for each class (200 images for each method) are summarized by the median and range. Rows (top to bottom) correspond to airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck classes. Realism score correctly identifies "realistic" imagery, but is unable to find samples that cover the real distribution. For example, the top row is missing airplanes, the third row is missing birds, the fourth row is missing cats, etc.



In the finite-dimensional case, the multivariate Fréchet distance (squared) is often expressed as mX -mY 2 2 + tr(ΣX + ΣY -2 √ ΣX ΣY ); the trace term is the squared Bures distance d 2 B (ΣX , ΣY ) = tr(ΣX + ΣY ) -2 √ ΣX √ ΣY 1 , where √ ΣX √ ΣY 1 = tr( √ ΣX ΣY ) (Dowson & Landau, 1982). In the context of GANs, the sum of the one-sided max-sliced Bures divergences may prove more appropriate for training, since it allows for separate witness functions for over-and under-representation. However, as its witness functions tend to localize discrepancies, even two witness functions may not make efficient use of the generator's samples. Instead, a distributional-version of the sliced Bures distance akin to the recent distributional-sliced Wasserstein proposal(Anonymous, 2021a) or the Bures distance itself(Anonymous, 2021b) could be used. Nonetheless, this localization property is what makes the max-sliced Bures interpretable. We also compute the realism scores of the entire set of fake images using the set of 10,000 test images and 3-nearest neighbor distances. The Spearman rank correlation between the realism scores for the full set of fake images and the witness function evaluations is -0.70 for ω 2 Real<Fake and it is 0.17 for ω 2 Real>Fake . This means the realism score has a strong inverse correlation with ω 2 Real<Fake ∝ -R, and a weak correlation for R ∝ ω 2 Real>Fake . https://github.com/anon-author-dev/gsw Similar to kernel PCA, the constraint ω 2 ≤ 1 allows the use of the representer theorem for the RKHS.



Figure 1: The magnitude of the witness functions obtained from max-sliced Bures indicate discrepancies. (Left) A Gaussian kernel is used to construct non-linear witness functions and identify witness points. (Right) Witness points for real and fake images from stacked MNIST and CIFAR10. In each of the 6 frames, instances corresponding to the left hand sides of equation 1 and equation 2 are on the top; the bottom instances correspond to the right hand sides of the expected inequalities.

Figure 2: Max-sliced divergences using the Inception Network representation are applied to samples with mismatched class distributions in CIFAR10. The first sample consists of the training set (balanced classes with m=50,000), and the second sample is an imbalanced subset of the test set with n=10,000. Each curve is the mean precision@10 (averaged across 10 random draws) for test sets where the given class is subsampled at different levels of imbalance and other classes are balanced.

Figure 3: Max-slicing Inception codes to illustrate the AutoGAN discrepancies. One-sided maxsliced Bures is used to identify two witness function (as linear subspaces of the Inception codes) that differentiate the real from fake samples. (Left: A,C) Images in subspace under-represented by fake w Real>Fake . (Right: B,D) Images in subspace over-represented by fake w Real<Fake . (Top: A,B) Real CIFAR10 test images. (Bottom: C,D) Fake images. (C) Realism scores (median and range): 0.92 (0.84-1.03). (D) Realism scores (median and range): 0.68 (0.62-0.73)

| w = max sliced Wasserstein-2 (w'X) 2 | w = max sliced Wasserstein-2 w'X | w = m X -m Y (w'X) 2 | w= one-sided max sliced Bures

Figure 4: Max-sliced distances applied to two samples from MNIST. The first sample μ is m images drawn uniformly from the training set, and the second sample ν is n images from the test set where one digit is a minority class l ∈ {0, . . . , 9} with prevalence of 5%. (Left) Divergence estimates across sample size with l = 7. For m < 2000, gradient-based approaches for the max-sliced W2 distance fail to obtain the optimal slice as it should upper bound the max-sliced Fréchet distance.(Center) Computation time. (Right) Each curve is the average precision@10 (averaged across the 10 classes). The one-sided max-sliced Bures yields the witness function ω µ>ν (•) = w, • , which is applied to reliably identify the instances from μ that are from the minority class for m ≥ 1000.

Figure 5: Reweighting a uniform distribution to match various target distributions by minimizing the max-sliced W2 distance or the max-sliced Bures distance with either a linear kernel or random Fourier bases (d=2,000, σ ∈ {0.1, 0.2}). Examples follow Kolouri et al. (2019) and uses ADAM (Kingma & Ba, 2015) defaults and a learning rate of 10 -2 . A point's size is proportional to their weights after 100 iterations. Learning curves are the weighted W2 distance (log-scale).

DERIVATION OF THE MAX-SLICED KERNEL DIVERGENCES Let U 1 = {ω ⊗ ω : |ω| 2 ≤ 1, ω ∈ H} denote the set of rank-1 symmetric operators with bounded trace norm. Let d denote either the TV distance (d = d T V ) or the squared Bures distance (d = d 2 B

Figure7: Sliced and max-sliced Bures and Wasserstein-2 distances are compared on population statistics and samples of varying sizes. µ = N (0, C) and ν = N (0, I), where C = ZZ , Z ∈ R 2×2 with entries that are originally standard normals and then row normalized such that C is a correlation matrix. In the population case and for zero-mean Gaussians, the Bures distance is equivalent to the W2 distance(Gelbrich, 1990). In the sample case, it is a lower bound. At both m = 100 and m = 10 4 the gradient optimization of the max-sliced W2 distance fails to obtain the global optimal slice (instead obtaining a local optimum).

Figure9: Performance of gradient algorithms for max-sliced Bures and max-sliced W2 distances across d = 1000 dimensional samples of varying sizes (10 random runs per size) and number of iterations in ADAM(Left to right: 50, 100, 200, 500, 1000). The samples are from zero-mean Gaussians distributions, with µ = N (0, C) and ν = N (0, I), where C = ZZ , and Z ∈ R d×d with entries that are originally standard normals and then row normalized such that C is a correlation matrix. (Top) For the max-sliced W2 a successful run is obtained when it is greater than or equal to the optimal solution to the max-sliced Bures (blue solid line) or when it is greater than 95% of max-sliced Bures (red dotted with circles). For the gradient approach to max-sliced Bures, success is when the difference to the optimal is 5% of the optimal (yellow solid with diamonds). (Bottom) Divergence values obtained across the 10 trials with increasing sample size.

Figure10: Maximum mean discrepancy (MMD) and max-sliced Bures distance (MSB) applied to two-dimensional samples using a Gaussian kernel. For each data set (shown as a two-by-two subplot), the contour plots indicate the squared magnitude of the witness function evaluations. For MMD, positive witness function values are plotted in the top row and negative evaluations are in the second row. For MSB, the rows correspond to the two one-sided divergences. The witness functions for the one-sided MSB divergences correspond to localized regions.

Figure 11: Kernel-based max-sliced distances are applied to balanced and imbalanced samples from MNIST. The first sample μ consists of the training set (balanced classes with size m), and the second sample ν is a n-sized sample from the test set with a minority class l ∈ {0, . . . , 7} with prevalance of 5%. (Left) Divergence estimates for increasing sample size for l = 7. Notably, for m < 2000 the max-sliced Wasserstein-2 distance fails to obtain the optimal slice as it should upper bound the max-sliced kernel Gauss-Wasserstein (Fréchet) distance. (Center) Corresponding computation time. (Right) Each curve is the average precision@10 (averaged across the 10 classes). The witness function for the one-sided max-sliced Bures ω μ>ν can be used to reliably identify instances from μ associated to the missing class.

Figure 12: Distribution matching based on minimizing max-sliced Bures distance max-D R d B (μ, ν). Synthetic images shown are those with the highest values of ν, where ν is the ν-weighted distribution over 50,000 synthetic images from AutoGAN and μ consists of CIFAR10 training images for a single class in each row. Rows (top to bottom) correspond to airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck classes.

Figure 13: Distribution matching based on minimizing the Wasserstein-2 distance through minibatch. Synthetic images shown are those with the higest values of ν, where ν is the ν-weighted distribution over 50,000 synthetic images from AutoGAN and μ consists of CIFAR10 training images for a single class in each row. Rows (top to bottom) correspond to airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck classes.

Figure 14: Distribution matching based on minimizing the max-sliced Wasserstein-2 distance max-W R d 2 (μ, ν) through mini-batch approximation. Synthetic images shown are those with the highest values of ν, where ν is the ν-weighted distribution over 50,000 synthetic images from Au-toGAN and μ consists of CIFAR10 training images for a single class in each row. Rows (top to bottom) correspond to airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck classes.

Figure15: Selecting images directly with the highest realism scores. Synthetic images shown are those with the highest realism values over 50,000 synthetic images generated by AutoGAN when the realism scores used in each row are computed using the CIFAR10 training images for a single class. Rows (top to bottom) correspond to airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck classes. Realism score correctly identifies "realistic" imagery, but is unable to find samples that cover the real distribution. For example, the top row is missing airplanes, the third row is missing birds, the fourth row is missing cats, etc.

Unsupervised covariate shift outlier detection on MNIST. The goal is to identify instances associated with an over-or underrepresented class l ∈ {0, . . . , 9}. Values are AUC where positives are instances from class l. We report the mean and standard deviation and the number of times each method has the highest AUC (including ties) across 100 trials (10 for each case l ∈ {0, . . . 9}).

annex

Null(D ν K Y Z ). In this case, the generalized eigenvalue problem must be restricted to the nullspace α * ∅ = arg max α∈Null(Dν K Y Z ):α Kα≤1 h 2 (α). Let V ∈ R (m+n)×p denote a matrix of p orthonormal columns that spans the nullspace, then α * ∅ = Vβ * , where β * = arg max β∈R p :β V KVβ≤1 D 1 2 µ K XZ Vβ 2 2 . Overall, the one-sided max-sliced kernel Bures distance can be computed using a combination of a line search for γ ∈ (0, 1] and checking the solution in the nullspace as described in Algorithm 1.Algorithm 1: One-sided max-sliced kernel Bures divergenceThe empirical version of max-sliced kernel Wasserstein divergence can be expressed in terms of either equation 24 or equation 26,whereDenote the sorted values as vectors ώX and ώY , with [ ώX ] i = ω(x (i) ) and [ ώY ] i = ω(y (i) ), and denote by μ and ν the corresponding permuted versions of µ and ν.Algorithm 2: One-sided max-sliced Bures divergence As an alternative to Algorithm 2, first-order algorithms can be applied, nevertheless, obtaining a global optimal cannot be guaranteed easily in this case. To make the objective differentiable, the square root, which is non-differentiable at 0, should be smoothedg., 2 = 0.01).

A.6 SAMPLE-BASED MAX-SLICED BURES DISTANCE IS A RELAXATION OF THE MAX-SLICED WASSERSTEIN-2 DISTANCE

We show that the max-sliced Bures distance is a relaxation of the max-sliced W2 distance. Let ρ = ZZ with Z ∈ R d×p denote a strictly positive definite matrix, such that w ρw > 0, w ρw = √ w ZZ w = max θ∈S p-1 θ, Z w , where θ * = arg max θ∈S p-1 θ, Z w = 1 √ w ZZ w Z w. Using this form, when ρ X and ρ Y are non-singular, the one-sided sliced Bures can be expressed asν ) . Squaring this quantity and taking the square root yields the objective of the max-sliced Bures distance in a similar form to the max-sliced W2 distance,where

A.7 COVARIATE SHIFT CORRECTION ALGORITHMS

For covariate shift correction the goal is to minimize the divergence by adjusting the weights ν of the instances in one sample ν. This results in the following convex optimization problem:whereFor fixed α the function J(•, α) is a sum of a linear function and a convex function) is convex since maximizing over α preserves convexity (Nesterov, 2018, Theorem 3.1.8 ).In the linear kernel case, the cost isAgain, J(ν) is convex since J(•, w) is convex for fixed w. In this case, the gradient has the intuitive form ofTo solve this convex minimization over a probability simplex we apply the Frank-Wolfe (conditional gradient) algorithm Jaggi (2013) to iteratively adjust the weight of one instance ν ← (1-γ)ν +γe i , where e i is an indicator vector, i = arg min 1≤j≤n [∇ ν J(ν)] j , and γ ∈ [0, 1] is the stepsize. A benefit of the Frank-Wolfe scheme is that it requires only rank-1 updates of YD ν Y , which are needed for updating w.

A.8 ADDITIONAL EXPERIMENTAL RESULTS

We start by comparing the proposed max-sliced Bures distance to the max-sliced W2 distance for two-dimensional data. We compare the fixed-point algorithm for solving each one-sided max-sliced Bures divergence with gradient-based approaches for the max-sliced W2 distance using ADAM with parameters lr = 1e-3, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-08, capping the number of iterations at 1000 or until the change in the slice is minimal ww old ∞ < 10 -6 .In two-dimensions, a near optimal slice can be obtained by a fine grid search of the sliced Bures and sliced W2 distance as shown in Figure 7 for two zero-mean Gaussian distributions. Figure 8 shows the cases for success rate of the gradient-based optimizations across 10 trials at varying sample sizes for 2-and 1000-dimensional zero-mean Gaussians. The effect of the number of gradient iterations is reported in Figure 9 .For kernel-based divergences, maximum mean discrepancy (MMD) detects differences in the first moments of the distributions in the RKHS. Using uncentered second-moments, the kernel-based max-sliced Bures distance (MSB) may detect some of the same differences. Figure 10 details witness function evaluations of each for six data sets generated from two-dimensional distributions, where a Gaussian kernel function is used. Notably, the one-sided MSB divergences correspond to localized regions, which are not distributed outliers. This is beneficial for the 'precision' of the witness function, but the 'recall' of MMD is better. This benefit of this localization depends on the task.A.9 COVARIATE SHIFT DETECTION WITH MAX-SLICED KERNEL DIVERGENCESWe proceed to generalize the comparisons in Figure 4 on imbalanced samples on MNIST to the kernel case. We use a Gaussian kernel κ σ with the parameter σ set as the median Euclidean distance

