DIMENSION REDUCTION AS AN OPTIMIZATION PROBLEM OVER A SET OF GENERALIZED FUNCTIONS

Abstract

We reformulate unsupervised dimension reduction problem (UDR) in the language of tempered distributions, i.e. as a problem of approximating an empirical probability density function p emp (x) by another tempered distribution q(x) whose support is in a k-dimensional subspace. Thus, our problem is reduced to the minimization of the distance between q and p emp , D(q, p emp ), over a pertinent set of generalized functions. This infinite-dimensional formulation allows to establish a connection with another classical problem of data science -the sufficient dimension reduction problem (SDR). Thus, an algorithm for the first problem induces an algorithm for the second and vice versa. In order to reduce an optimization problem over distributions to an optimization problem over ordinary functions we introduce a nonnegative penalty function R(f ) that "forces" the support of f to be k-dimensional. Then we present an algorithm for minimization of I(f ) + λR(f ), based on the idea of two-step iterative computation, briefly described as a) an adaptation to real data and to fake data sampled around a k-dimensional subspace found at a previous iteration, b) calculation of a new k-dimensional subspace. We demonstrate the method on 4 examples (3 UDR and 1 SDR) using synthetic data and standard datasets.

1. INTRODUCTION

Linear dimension reduction (LDR) is a family of problems in data science that includes principal component analysis, factor analysis, linear multidimensional scaling, Fisher's linear discriminant analysis, canonical correlations analysis, sufficient dimensionality reduction (SDR), maximum autocorrelation factors, slow feature analysis and more. In unsupervised dimension reduction (UDR) we are given a finite number of points in R n (sampled according to some unknown distribution) and the goal is to find a "low-dimensional" affine (or linear) subspace that approximates "the support" of the distribution. The study field currently achieved a saturation level at which unifying frameworks for the problem become of special interest Cunningham & Ghahramani (2015) . An approach that we present in that paper is based on the theory of generalized functions, or tempered distributions Soboleff (1936) ; Schwartz (1949) . An important generalized function that cannot be represented as an ordinary function is the Dirac delta function, denoted δ, and δ n denotes its ndimensional version. Any dataset {x i } N i=1 ⊆ R n naturally corresponds to the distribution p emp (x) = 1 N N i=1 δ n (x-x i ) which, with some abuse of terminology, can be called the empirical probability density function. Based on that, UDR can be understood as a task whose goal is to approximate p emp (x) by q(x), where q(x) is a distribution whose density is supported in a k-dimensional affine subspace A ⊆ R n . Note that a function whose density is supported in some low-dimensional subset of R n is not an ordinary function. Exact definitions of such distributions can be found in Section 3. To formulate an optimization task we additionally need a loss D(p emp , q) that measures the distance between the ground truth p emp and a distribution q, that we search for. Thus, in our approach, the UDR problem is defined as: I (q) = D (p emp , q) → min q (1) under the condition that q(x) has a k-dimensional support. The SDR problem is tightly connected with the UDR problem. In SDR, given supervised data, the goal is to find the so called effective subspace, defined by its basis vectors {w 1 , • • • , w k } ⊆ R n , such that the regression function can be searched in the form g(w T 1 x, • • • , w T k x). In Wang et al. (2010) it was shown how a method originally developed for SDR can be turned into an UDR method, i.e. applied to unsupervised data, by simply setting an output to be equal to an input. The key observation of our analysis, stated in Theorem 2, is that a class of functions of the form g(w T 1 x, • • • , w T k x) can be characterized as functions whose Fourier transform is supported in the corresponding effective subspace. In Section 4 we give 3 examples of UDR problems that we cast as 1 and in the fourth example we formulate SDR as an optimization task with the search space dual to that of UDR. Thus, all 4 examples can be studied within the same optimization framework. The structure of the paper is as follows: in Section 3 we formally define the search space in Problem 1, denoted G k , and an image of G k under the Fourier transform, denoted F k . Instead of searching directly in a set of generalized functions, G k , in Section 5 we describe how we substitute an ordinary function for a distribution in the optimization task at the expence of adding a new penalty term to its objective, λR(f ). Using a gaussian kernel M (x, y), Theorem 4 characterizes generalized g ∈ G k as such g for which the matrix of properly defined integrals M g = Re R n ×R n x i y j g(x) * M (x, y)g(y)dxdy i,j=1,n is of rank k. We define R(f ) as a squared Frobenius distance from M f to the closest matrix of rank k. In Section 6 we suggest a method for solving min φ I(φ) + λR(φ) which we call the alternating scheme. Section 7 is dedicated to experiments with the alternating scheme on synthetic data and standard datasets.

2. PRELIMINARIES AND NOTATIONS

Throughout this paper we use standard terminology and notation from functional analysis. For exact definitions one can address the textbook on the theory of distributions Friedlander & Joshi (1998) . The Schwartz space of functions and its dual space are denoted by S(R n ) and S (R n ) correspondingly. For a tempered distribution T ∈ S (R n ) and φ ∈ S(R n ), T, φ denotes T (φ). The Fourier and inverse Fourier transforms are denoted by F, F -1 : S (R n ) → S (R n ). For brevity, we denote F[f ] by f . If all required conditions are satisfied, an integrable f : R n → C (or, a Borel measure µ on R n ) is used as the tempered distribution T f (or, T µ ) where T f , φ = R n f (x)φ(x)dx (or, T µ , φ = R n φ(x)dµ). For Ω ⊆ S (R n ), Ω * denotes the sequential closure of Ω with respect to weak topology of S (R n ). By L 2 (R n ) we denote the L 2 -space with the inner product: u, v L2 = u(x) * v(x)dx. For φ ∈ S(R n ), ψ ∈ S (R n ), their convolution and multiplication are denoted by φ * ψ and φψ correspondingly. For g 1 ∈ S (R k ) and g 2 ∈ S (R n-k ), g 1 ⊗ g 2 ∈ S (R n ) denotes their tensor product. For a square matrix A, Tr(A) denotes its trace and for arbitrary matrix, ||A|| F def = Tr(A T A). Identity matrix of size n is denoted by I n .

3. BASIC FUNCTION CLASSES

An example of a generalized function, whose density is concentrated in a k-dimensional subspace, is any distribution that can be represented as g ⊗ δ n-k def = g ⊗ δ ⊗ • • • ⊗ δ n -k times where g ∈ S (R k ). If g = T f , where f : R k → R is an ordinary function, then g ⊗ δ n-k can be understood as a generalized function whose density is concentrated in a subspace {x ∈ R n |x i = 0, i > k} and equals f (x 1:k ). It can be shown that the distribution acts on φ ∈ S(R n ) in the following way: T f ⊗ δ n-k , φ = R k f (x 1:k )φ(x 1:k , 0 n-k )dx 1:k Now to generalize the latter definition to any k-dimensional subspace we have to introduce a change of variables in tempered distributions. Let g ∈ S (R n ) and U ∈ R n×n be an orthogonal matrix, i.e. U T U = I n . Then, g U ∈ S (R n ) is defined by the rule: g U , φ = g, ψ where ψ(x) = φ(U T x). If g = T f , the latter definition gives g U = T f where f (x) = f (U x). Now, we define classes of tempered distributions: G k = {(f ⊗ δ n-k ) U |f ∈ S (R k ), U ∈ R n×n , U T U = I n } (2) G k = (T f ⊗ δ n-k ) U |f ∈ S(R k ), U ∈ R n×n , U T U = I n (3) F k = {T r |r(x) = f (U x), f ∈ S(R k ), U ∈ R k×n , rank(U ) = k} (4) The first two classes are related as: Theorem 1. G k = G k * . The last two classes are isomorphic under the Fourier transform. Theorem 2. F[G k ] = F k and F -1 [F k ] = G k . For any collection f 1 , • • • , f l ∈ S (R n ), span R {f i } l 1 denotes { l i=1 λ i f i |λ i ∈ R} ⊆ S (R n ) , which is a linear space over R. The set G k has the following simple characterization: Theorem 3. For any T ∈ S (R n ), T ∈ G k if and only if dim span R {x 1 T, x 2 T, • • • , x n T } ≤ k. Informally, the theorem holds because any linear dependency α 1 x 1 T + • • • + α n x n T = 0 over R implies that if α 1 x 1 + • • • + α n x n = 0, then T = 0. This is equivalent to a statement that the support of T is concentrated on a subspace α 1 x 1 +• • •+α n x n = 0. If dim span R {x 1 T, x 2 T, • • • , x n T } ≤ k, then one can find n -k such dependencies, which means that the support of T is k-dimensional. Let B(R n ) denote the Borel sigma-algebra on R n and P denote a set of all Borel probability measures on R n . Let us now define P k = {µ ∈ P|∃v 1 , • • • , v k ∈ R n , ∀A ∈ B(R n ) : µ(A) = µ(A ∩ span(v 1 , • • • , v k ))} (5) i.e. P k is a set of probability measures with all probability concentrated in some subspace span(v 1 , • • • , v k ) whose dimension is not greater than k. It is easy to see that T µ ∈ G k for any µ ∈ P k . 4 EXAMPLES OF LDR FORMULATIONS UDR: Maximum mean discrepancy (MMD) Let k(x) = 1 √ (2πh 2 ) n e -|x| 2 2h 2 be the radial gaussian kernel on R n . The kernel k(x) defines the so-called kernel embedding of probability measures φ Muandet et al. (2017) : µ ∈ P → φ(µ) = k * µ = E y∼µ k(x -y) = k(x -y)dµ(y) The Maximum Mean Discrepancy (MMD) distance Gretton et al. (2012) is defined as the distance induced by metrics on L 2 (R n ), i.e. for two probability measures µ, ν ∈ P: d MMD (µ, ν) = ||φ(µ) -φ(ν)|| L2(R n ) Let x 1 , • • • , x N ∈ R n be the dataset of points. This dataset defines the empirical probabilistic measure µ data that corresponds to the tempered distribution T µ data = 1 N N i=1 δ n (x -x i ). We shall study a method concurrent to PCA that is based on solving the following problem: I(ν) = d MMD (µ data , ν) = ||φ(µ data ) -φ(ν)|| L2(R n ) → min ν∈P k (6) i.e. we shall attempt to approximate the empirical probabilistic measure µ data with another probabilistic measure ν which is supported in some k-dimensional subspace of R n . UDR: Distance based on higher moments (HM) It is well-known that maximum mean discrepancy measures the similarity between characteristic functions of two probability distributions in the O 1 h -neighbourhood of the origin. Another approach to measure the similarity of two distributions is based on the difference between moments: d HM (µ, ν) 2 = 4 s=1 λ s n s 1≤i1,••• ,is≤n (m i1•••is -n i1•••is ) 2 where m i1•••is = E X∼µ [X[i 1 ] • • • X[i s ]] and n i1•••is = E X∼ν [X[i 1 ] • • • X[i s ] ] are corresponding moments. The positive parameters λ 1 , λ 2 , λ 3 , λ 4 are chosen to fix the relative importance of the mean, the co-variance, the co-skewness and the co-kurtosis. Thus, we will be interested in the following optimization task (analogous to 6): d HM (µ data , ν) → min ν∈P k (7) UDR: Wasserstein distance (WD) Another important distance between probability measures that has the origins in the transport theory is the Wasserstein distance Villani (2008) . Let (R n , || • ||) be a Banach space. Between any two Borel probability measures µ, ν on R n with ||x||dµ < ∞ and ||x||dν < ∞ the Wasserstein distance is: W (µ, ν) = inf π∈Π(µ,ν) ||x -y||dπ where Π(µ, ν) is a set of all couplings of µ and ν. The Wasserstein distance defines another version of LDR problem: W (µ data , ν) → min ν∈P k In the appendix B one can find proofs that in the case of L 1 norm ||x|| = i |x i |, the task 8 corresponds to the well-studied robust PCA problem Candès et al. (2011) . If, instead of the L 1norm, we use the L 2 -norm, this leads to another well-studied task, which is known as the outlier pursuit problem Xu et al. (2010) . Sufficient dimension reduction (SDR) Given a labeled dataset {(x i , y i )} N i=1 where x i ∈ R n , y i ∈ C (C is a finite set of classes for a classification, or R for a regression problem), the sufficient dimension reduction problem can be informally described as a problem of finding vectors w 1 , • • • , w k ∈ R n such that p(y|w T 1 x, • • • , w T k x) ≈ p(y|x) (possibly, under some additional assumptions on the form of p(y|x)). We formulate the SDR problem as an optimization task: inf f ∈F k J(f ) The object f : R n → R is a smooth real-valued function. We assume that f is a candidate for the regression function and J(f ) is a cost function that values how strongly f fits in this role. In practice for the regression case and for the binary classification case with 0-1 outputs we use the following cost functions correspondingly: xi+ ) where H(y, p) = -y log p -(1 -y) log(1 -p) and υ > 0 is a parameter. By requiring f ∈ F k , we assume that the regression function f satisfies (for k fixed in advance): J(f ) = 1 N N i=1 E ∼N (0,υ 2 In) |y i -f (x i + )| 2 J(f ) = 1 N N i=1 E ∼N (0,υ 2 In) H y i , e f (xi+ ) 1 + e f ( f (x) = g(w T 1 x, • • • , w T k x) where w 1 , • • • , w k ∈ R n . Thus, given an input x, an output of f depends on the projection of x onto span(w 1 , • • • , w k ). The set span(w 1 , • • • , w k ) is called the effective subspace. Note that the way we defined the SDR's objective J(f ) for the regression and the classification cases is not unique. There are definitions that has the same form 9, but deal with the conditional distribution p(y|x) as an argument, instead of the regression function.

5. REDUCTION OF THE OPTIMIZATION PROBLEM TO ORDINARY FUNCTIONS

The central problem that our paper addresses is how to minimize an objective function over G k (or P k )? In this section we describe an approach based on penalty functions and kernels. Let us assume for simplicity that M is the gaussian kernel, i.e. M (x, y) = G n σ (x -y) where G n σ (x) = including cases of the Abel Kernel 1 σ n e -|x-y| σ and the Fourier tranform of the Abel Kernel, the Poisson kernel: cnσ (σ 2 +|x-y| 2 ) n+1 2 . For f, g ∈ S(R n ) let us denote: f |M |g def = R n ×R n f (x) * M (x, y)g(y)dxdy ≤ max x,y M (x, y)||f || L1 ||g|| L1 < ∞ For general f, g ∈ S (R n ) the expression f |M |g is defined if ∃f , g ∈ S(R n ) such that T f = f * G n , T g = g * G n and f |M |g →0 → A. Then, f |M |g def = A. For example, δ n |M |δ n = M (0, 0). Theorem 3 concludes, from f ∈ G k , that dim span R {x 1 f, x 2 f, • • • , x n f } ≤ k. Using the kernel M , one can build the Gram matrix from the collection of distributions, [ x i f |M |x j f ] 1≤i,j≤n . For any f ∈ S (R n ) let us denote a real part of the Gram matrix [Re x i f |M |x j f ] 1≤i,j≤n by M f (if it is defined). Theorem 4. If f ∈ G k , then x i f |M |x j f is defined and rank M f ≤ k. Definition 1. Let A ∈ R n×n be a positive semidefinite matrix with eigenvalues λ 1 ≥ λ 2 ≥ • • • ≥ λ n (with counting multiplicities). Then, the Ky Fan k-anti-norm of A is ||A|| k = k i=1 λ n+1-k . Let R(f ) = ||M f || n-k . Theorem 4 tells us that that for f ∈ G k , R(f ) = 0. For ordinary f , the Eckart-Young-Mirsky theorem gives us R(f ) = min A∈R n×n ,rank A≤k || M f -A|| 2 F . Thus, by penalizing the value of R(f ), we enforce M f to be close to some matrix of rank k. For I : G k ∪ S(R n ) → R + , it is natural to reduce the optimization task over tempered distributions I(f ) → min f ∈G k (10) to an optimization task over ordinary functions with a penalty term R: I(f ) + λ||M f || n-k = I(f ) + λR(f ) → inf f ∈S(R n ) (11) Details on the conditions, under which this reduction holds, can be found in the appendix D. Let us now concentrate on the task 11 and describe the alternating scheme for its solution.

6. THE ALTERNATING SCHEME

We will concentrate on problem 11. It is known Hiai (2013) that the Ky Fan anti-norm is a concave function, i.e. R(φ) = ||M φ || n-k depends on M φ in a concave way. It can be shown that the dependence of R(φ) on φ is both non-convex and non-concave, i.e. we deal with a non-convex optimization task. The kernel M (x, y) : R n × R n → C induces a linear operator from L 2 (R n ) to L 2 (R n ): O(M )[f ] = R n M (x, y)f (y)dy. For any operator O between spaces H 1 and H 2 , we denote its range as R[O] = {O(x)|x ∈ H 1 }. Let B(H 1 , H 2 ) denote a set of bounded linear oper- ators between Hilbert spaces H 1 and H 2 . For O ∈ B(H 1 , H 2 ) the rank of O is defined as dim R(O). Let L r 2 (R n ) be the Hilbert space (over R) of real-valued functions from L 2 (R n ) and L * 2 (R n ) = L r 2 (R n ) × L r 2 (R n ). The space L * 2 (R n ) is equivalent to L 2 (R n ) treated as a linear space over R. Below we do not distinguish [φ 1 , φ 2 ] ∈ L * 2 (R n ) and φ 1 + iφ 2 ∈ L 2 (R n ). It is easy to see that any O ∈ B(L * 2 (R n ), R n ) can be given by formula: O[φ] i = Re O i , φ L2(R n ) , O i ∈ L 2 (R n ), i = 1, n i.e. O ∈ B(L * 2 (R n ), R n ) can be identified with a vector of functions O = [O i ] i=1,n , O i ∈ L 2 (R n ) and the Hilbert-Schmidt norm on B(L * 2 (R n ), R n ) (i.e. √ Tr O † O) is: ||O|| * = n i=1 ||O i || 2 L2(R n ) (12) Recall that for the kernel M , O(M ) is positive and self-adjoint. Since O(M ) is also bounded, then the square root O(M ) can be correctly defined Rudin (1991) . For any complex-valued function f let us introduce a linear operator S f : L * 2 (R n ) → R n by the following rule: S f [φ] i = Re x i f (x), O(M )[φ] L2(R n ) i.e. (S f ) i = O(M )[x i f (x)], i = 1, n Theorem 5. If Tr M f < ∞, then S f ∈ B(L * 2 (R n ), R n ) and S f S † f = M f . Moreover, R(f ) = min S∈B(L * 2 (R n ),R n ),rank S≤k ||S f -S|| 2 * and the minimum is attained at S = P f S f where P f = k i=1 u i u † i and {u i } k 1 are unit eigenvectors of M f corresponding to the k largest eigenvalues (counting multiplicities). Given the new representation R(f ) = min S∈B(L * 2 (R n ),R n ),rank S≤k ||S f -S|| 2 * it is natural to view the Task 11 as a minimization of I(φ) + λ||S φ -S|| 2 * over two objects: φ and S ∈ B(L * 2 (R n ), R n ) : rank S ≤ k. The simplest approach to minimize a function over two arguments is to optimize alternatingly, i.e. first over φ, and then over S : rank S ≤ k, and so on. Theorem 5 gives that the minimization over S is equivalent to the truncation of SVD(S φ ) at the k-th term. This idea, that we dub the alternating scheme, is described in Algorithm 1. Algorithm 1 The alternating scheme for 11 P 0 ←-0, S φ0 ←-0 for t = 1, • • • , T do φ t ←-arg min φ I(φ) + λ||S φ -P t-1 S φt-1 || 2 * (minimizing over φ) Calculate M φt and find {v i } n 1 s.t. M φt v i = λ i v i , λ 1 ≥ • • • ≥ λ n P t ←- k i=1 v i v T i (Truncated SVD(S φt ) is P t S φt ) Output: v 1 , • • • , v k The alternating scheme 1 allows for a reformulation in the dual space. By this we mean that in Scheme 1 we substitute φ t for the original φ t . If the primal Scheme 1 deals with operators S φ , S φt-1 , the dual version deals with vectors of functions G σ ∂ φ ∂x , G σ ∂ φt-1 ∂x . Details of the dual algorithm can be found in the appendix F.

7. EXPERIMENTS

The alternating scheme 1 is a general optimization method which needs to be specified for every optimization task. We designed numerical specifications of the alternating scheme 1 for all 4 optimization tasks: 6, 7, 8 and 9 and made experiments with all of them. Details of the algorithms, i.e. numerical methods to minimize over φ and calculate M φt , can be found in the appendix (G, H, I and J). Note that for Wasserstein distance minimization 8 we exploit the alternating scheme in the initial form (i.e. 1), and for MMD 6, HM 7 and SDR 9 we use the dual version of the scheme. Behaviour of MMD for small h. We studied the difference in the behaviour of PCA and a solution of 6 obtained by the alternating scheme 1 (MMD), for the case when h is small compared to the standard deviation of features. Experiments show that they are sharply different when data points are sampled along a low-dimensional manifold M, which is bent globally, goes through the origin O and has a large curvature at O. Because PCA is a global method and points do not lie on an affine subspace, interpreting principal directions is not straightforward. We select a smooth function f : R n-1 → R, such that f (0) = 0 and generate points in the following way: points x 1 , x 2 , • • • , x N ∼ [-10, 10] n-1 are sampled uniformly, after calculation of y i = f (x i ) we add some noise: z i = (x i , y i ) + i , i ∼ N (0, 0.01I n ). Both PCA and MMD are applied to the dataset (first 3 pictures on Figure 1a ). As we see, MMD, unlike PCA, tries to catch ideal alignments of points rather that searching for a global alignment of points (which can be nonexistent). This property of MMD makes it a promising tool for the calculation of the tangent space to a data manifold at a given point. Fourth picture shows that when we have 2 equally important directions in data such that the first principal direction of PCA is between them (red line), and we set k = 1, then MMD (green line) always chooses one of those directions. Experiments with outlier detection (MMD, HM, Wasserstein distance). Following the experiment setup of Xu et al. (2010) , we choose parameters N = n = 400, δ = 0.05(0.1), k = 10 and generate random matrices A ∈ R N (1-δ)×k , B ∈ R n×k whose entries are iid as N (0, 1). Then, to the columns of the matrix BA T ∈ R n×N (1-δ) (whose rank is ≤ k) are concatenated with the columns of the matrix C ∈ R n×N δ : X = concat(BA T , C) ∈ R n×N . The entries in C are either iid as N (0, 1) (case I) or N δ copies of the same vector whose entries are iid as N (0, 1) (case II). Let X = [x 1 , • • • , x N ], i.e. columns of X are the data points. Thus, N (1 -δ) columns of BA T lie in a k-dimensional subspace of R n and N δ columns of C are outliers, and solutions of tasks 6, 7 or 8 for this dataset are expected to be supported in a column space of BA T . After every iteration (step t of the alternating scheme 1) we calculate the Frobenius distance between the projection operator P t of 1 and the projection operator P to the column space of BA T , i.e. ||P t -P || F . For the task 8, the dependence of ||P t -P || F on t for different values of parameters δ and λ is shown in Figure 1b . For tasks 6, 7 the behaviour of the alternating scheme is similar, 7 iterations are enough to approach the optimal subspace. Besides the speed of convergence we were also interested in how ||P * -P || F , where P * = lim t→∞ P t is the final projection operator (e.g. P 20 in practice), depends on the parameter σ of the kernel M = G σ . It is natural to expect the quality of the solution P * to degrade as σ → +∞ (this corresponds to M (x, y) → 0), and, less trivially, as σ → 0 (this corresponds to M (x, y) → δ n (x -y)). Experiments with the sufficient dimension reduction. We made experiments on the standard datasets, Heart, Breast Cancer, Ionosphere, Diabetes, Boston house prices and Wine quality. First we applied Sliced Inverse Regression algorithm (SIR) Li (1991) to the training set and calculated the effective subspace for k = 2, 3. All points were projected onto that space and we obtained two-or three-dimensional representations of input points. In the last step we applied 10 nearest neighbours algorithm (KNN) to predict outputs (based on reduced inputs) on the test set (for the regression case, the 10-KNN regression was used). The same scheme was repeated with PCA, Kernel Dimensionality Reduction (KDR) algorithm Fukumizu et al. (2004) and the alternating scheme 1 adapted for SDR. We experimented with the dual version of algorithm 1, setting (after the data was standardized) the kernel's parameter σ = 0.8foot_2 and λ = 10.0. Details of its numerical implementation can be found in the appendix J. In the The code is available on github to facilitate the reproducibility of our results.

8. RELATED WORK

We present an optimization framework in which the search space is G k , or P k . Another unifying framework for LDR tasks is suggested by Cunningham & Ghahramani (2015) in which the basic search space is the Stiefel manifold S(n, k). The main disadvantage of using G k , instead of the Stiefel manifold, is that its infinite number of dimensions requires a special procedure to turn an optimization into a finite-dimensional task. Using Ky-Fan k-antinorm as a regularizer for the matrix completion problem has been suggested by Hu et al. (2013) and further developed in Oh et al. (2016) ; Liu et al. (2016) ; Hong et al. (2016) . Unlike this chain of works, we formulate an infinite-dimensional task. Also, our regularizer R(f ) = ||M f || n-k is a sum of smallest n-k squared singular values of the operator S f where S f depends on f linearly. The idea of alternating two basic stages, the convex optimization and SVD, is ubiquitous in low-rank optimization, see e.g. Mazumder et al. (2010) ; Hastie et al. (2015) . Zhu & Zeng (2006) applied the Fourier transform for estimating the effective subspace in SDR, implicitely using an analog of Theorem 2.

9. CONCLUSIONS

We develope a new optimization framework for LDR problems. The alternating scheme for the optimization task demonstrates both the computational efficiency and the applicability to real-world data. The algorithm performs quite stably when we vary most of the hyperparameters, though it crucially depends on two parameters, the bandwidth of the "smoothing" kernel M , σ, and the penalty parameter λ. We believe that the MMD/HM/WD methods for UDR could be used as an alternative to PCA in study fields in which data demonstrate "heavy-tailed" and "non-gaussian" behaviour, such as financial applications. Also, our formulation of SDR is free from any assumptions on the distribution of input-output pairs, which makes it an alternative to other methods of the efficient subspace estimation. More detailed report on these topics is a subject of future research. A PROOFS FOR SECTION 3 A.1 PROOF OF THEOREM 1: GIVEN FOR COMPLETENESS Proof. The inclusion G k ⊆ G k * follows from a well-known fact that S(R k ) is dense in S (R k ). I.e. for any f ∈ S (R k ) one can always find a sequence {f i } ⊆ S(R k ) such that T fi → * f . Therefore, for any (f ⊗ δ n-k ) U ∈ G k there is a sequence {(T fi ⊗ δ n-k ) U } ⊆ G k such that (T fi ⊗ δ n-k ) U → * (f ⊗ δ n-k ) U . Thus, G k ⊆ G k * . Since G k ⊆ G k , to prove G k = G k * it is enough to show that G k is sequentially closed. We need a simple fact from a theory of distributions. Lemma 1. If T i → * T and φ i → φ, then T i , φ i → T, φ . Proof of Lemma. Schwartz space S(R n ) is a Fréchet space, therefore the Banach-Steinhaus theo- rem applies to S (R n ). Since T i → * T , we have sup i | T i , φ | < ∞ for any φ ∈ S(R n ). From the Banach-Steinhaus theorem, applied to a set {T i } ∞ 1 , we obtain for any > 0, there is a neighbourhood U of 0 ∈ S(R n ) such that | T i , φ | < whenever φ ∈ U . Thus, | T i , φ i -φ | < for a large enough i. From that we conclude that T i , φ i → T, φ . For any T ∈ S (R n ) and ψ ∈ S(R n-k ), let us define T ψ ∈ S (R k ) as T ψ , φ = T, φ ⊗ ψ . Suppose that {f i } ∞ 1 ⊆ S (R k ), {U i } ∞ 1 are such that (f i ⊗ δ n-k ) Ui → * f . We need to prove that f ∈ G k . Since a set of orthogonal matrices is compact, then one can always find a subsequence {U ni } such that U ni → U . Since (f ni ⊗ δ n-k ) Un i → * f and φ(U ni x) → φ(U x) (for any fixed φ ∈ S(R n )) , using lemma 1 we obtain: f ni ⊗ δ n-k , φ = (f ni ⊗ δ n-k ) Un i , φ(U ni x) → f, φ(U x) = f U T , φ(x) Thus, we have: f ni ⊗ δ n-k → * f U T From the last we see that f ni → * f ψ U T where ψ is such that ψ(0) = 1. Therefore, f U T = f ψ U T ⊗δ n-k and f = (f ψ U T ⊗ δ n-k ) U ∈ G k . A.2 PROOF OF THEOREM 2 Proof. Let us prove first that if g = T f ⊗ δ n-k , then F[g] = T r where r(x) = f (x 1:k ), x ∈ R n . For that we have to prove that F[g], φ = T r , φ for any φ ∈ S(R n ). Indeed, F[g], φ = g, F[φ] = T f ⊗ δ n-k , R n φ(y)e -ix T y dy = T f , R n φ(y)e -ix T 1:k y 1:k dy = R n+k f (x 1:k )φ(y)e -ix T 1:k y 1:k dydx 1:k = R n f (y 1:k )φ(y)dy = T r , φ Let us calculate the image of G k under the Fourier transform. It is easy to see that for any g ∈ S (R n ), φ ∈ S(R n ) and orthogonal U ∈ R n×n we have: F[g U ], φ(x) = g U , F[φ](x) = g, F[φ](U T x) = = g, F[φ(U T x)] = F[g], φ(U T x) = (F[g]) U , φ(x) Therefore, F[g U ] = (F[g]) U . Thus, if g = T f ⊗ δ n-k , then (F[g U ]) = (T r ) U = T r where r (x) = r(U x) = f (U k x) and U k ∈ R k×n is a matrix consisting of first k rows of U . Thus, T r ∈ F k . Let us show that by varying f ∈ S(R k ) and U in the expression f (U k x) we can obtain any function from F k . For this it is enough to show that F k is equivalent to the following set of functions: Q = {g(U k x)|g ∈ S(R k ), U k ∈ R k×n , U k U T k = I k } The fact Q ⊆ F k is obvious. Let us now prove that Q ⊇ {g(P x)|g ∈ S(R k ), P ∈ R k×n , rank P = k} = F k . Indeed, if f (x) = g(P x), then f (x) = g (U k x) where U k = (P P T ) -1/2 P and g (y) = g((P P T ) 1/2 y). By construction, U k U T k = I k and g ∈ S(R k ). Thus, Q = F k . Therefore, F[G k ] = F k , and from the bijectivity of the Fourier transform we obtain F -1 [F k ] = G k . A.3 PROOF OF THEOREM 3 Proof of Theorem 3 (⇒). Let us prove that from T = (f ⊗ δ n-k ) U , f ∈ S (R k ), U T U = I n it follows that dim span R {x 1 T, x 2 T, • • • , x n T } ≤ k. It is easy to see that x i [f ⊗ δ n-k ] = 0 if i > k. If U = [u 1 , • • • , u n ] T , then for i > k we have 0 = (x i [f ⊗ δ n-k ]) U = u T i x(f ⊗ δ n-k ) U = u T i xT . Thus, we have n -k orthogonal vectors, u k+1 , • • • , u n , such that [x 1 T, • • • , x n T ]u i = 0. Using standard linear algebra we obtain there are at most k distributions x i1 T, • • • , x i k T, k ≤ k that form a basis of span R {x i T } n 1 . To prove the second part of theorem we need the following lemma. Lemma 2. If T ∈ S (R n ) is such that y i T = 0 for any i > k, then T ∈ G k . Proof of lemma. Recall from functional analysis, for f ∈ S (R n ), the tempered distribution ∂f ∂xi is defined by the condition ∂f ∂xi , φ = -f, ∂φ ∂xi . Once the Fourier transform is applied, our lemma's dual version is equivalent to the following formulation: if ∂f ∂xi = 0, i > k, then f ∈ F k * . Let us prove it in this formulation. A set of infinitely differentiable functions with a compact support is denoted by C ∞ c (R). Suppose φ ∈ S(R n ) and p ∈ C ∞ c (R) are chosen in such a way that ∞ -∞ p(y i )dy i = 1, supp p ⊆ [A, B]. Let us define: r(x) = xi -∞ φ(x -i , y i )dy i - xi -∞ p(y i )dy i ∞ -∞ φ(x -i , y i )dy i It is easy to see that for any α ∈ N n-1 , α ∈ N, β ∈ N n-1 , β ∈ N we have (at least one derivative over x i is present): x α -i x α i ∂ β,1+β r ∂x β -i ∂x 1+β i = x α -i x α i ∂ β,β [φ(x) -p(x i ) ∞ -∞ φ(x -i , y i )dy i ] ∂x β -i ∂x β i = x α -i x α i ∂ β,β φ(x) ∂x β -i ∂x β i -x α i ∂ β p(x i ) ∂x β i ∞ -∞ x α -i ∂ β φ(x -i , y i ) ∂x β -i dy i The terms x α -i x α i ∂ β,β φ(x) ∂x β -i ∂x β i and x α i ∂ β p(xi) ∂x β i are bounded by the definition of S(R n ), C ∞ c (R). The boundedness of ∞ -∞ x α -i ∂ β φ(x-i,yi) ∂x β -i dy i is a consequence of the inequality (which holds because φ ∈ S(R n )): |x α -i ∂ β φ(x-i,yi) ∂x β -i | ≤ C 1+y 2 i . Analogously (not a single derivative over x i is present): x α -i x α i ∂ β r ∂x β -i = x α i xi -∞ x α -i ∂ β φ(x -i , y i ) ∂x β -i dy i -x α i xi -∞ p(y i )dy i ∞ -∞ x α -i ∂ β φ(x -i , y i ) ∂x β -i dy i = = x α i (1- xi -∞ p(y i )dy i ) xi -∞ x α -i ∂ β φ(x -i , y i ) ∂x β -i dy i -x α i xi -∞ p(y i )dy i ∞ xi x α -i ∂ β φ(x -i , y i ) ∂x β -i dy i The second term is 0 when x i ≤ A. It is also bounded when x i > A because |x α -i ∂ β φ(x-i,yi) ∂x β -i | ≤ C (1+y 2 i ) α +1 and: x α i ∞ xi x α -i ∂ β φ(x -i , y i ) ∂x β -i dy i ≤ |x i | α ∞ xi C (1 + y 2 i ) α +1 dy i The latter is bounded because lim xi→+∞ |x i | α ∞ xi C (1+y 2 i ) α +1 dy i = 0. The first term is 0 when x i ≥ B and it is bounded for x i < B: x α i xi -∞ x α -i ∂ β φ(x -i , y i ) ∂x β -i dy i ≤ |x i | α xi -∞ C (1 + y 2 i ) α +1 dy i The latter is also bounded, since lim xi→-∞ |x i | α xi -∞ C (1+y 2 i ) α +1 dy i = 0. Thus, x α ∂ β r(x) ∂x β is bounded and r ∈ S(R n ). Therefore ∂f ∂xi = 0 implies: f, ∂r ∂x i = 0 ⇒ f [φ] = f [p(x i ) ∞ -∞ φ(x -i , y i )dy i ] Since this sequence of arguments works for any i > k, we can apply them sequentially to initial φ ∈ S(R n ) w.r.t. x k+1 , ..., x n and obtain for any p k+1 , ..., p n ∈ C c (R) such that ∞ -∞ p i (y i )dy i = 1: f [φ] = f [p k+1 (x k+1 ) • • • p n (x n ) R n-k φ(x 1:k , x k+1:n )dx k+1:n ] Moreover, since C ∞ c (R) is dense in S(R) , we can assume that p k+1 , ..., p n ∈ S(R). For the inverse Fourier transform T = F -1 [f ] the latter condition becomes equivalent to: T, φ = T, p k+1 (x k+1 ) • • • p n (x n )φ(x 1:k , 0 k+1:n ) for any p k+1 , ..., p n ∈ S(R) such that p i (0) = 1. Let us define p i (x i ) = e -x 2 i . It is easy to check that T = g ⊗ δ n-k where g ∈ S (R k ), g, ψ = T, e -|x k+1:n | 2 ψ(x 1:k ) for ψ ∈ S(R k ). I.e. T ∈ G k and lemma is proved. Proof of Theorem 3 (⇐). If dim span R {x 1 T, x 2 T, • • • , x n T } ≤ k, then dim{v ∈ R n |[x 1 T, • • • , x n T ]v = 0} ≥ n -k Thus, there exist at least n-k orthonormal vectors v k+1 , • • • , v n , such that [x 1 T, • • • , x n T ]v i = 0. Therefore, [x 1 T, • • • , x n T ]v i = (v T i x)T = 0. Let us complete v k+1 , • • • , v n to form an orthonormal basis of R n : v 1 , • • • , v n . Let us define a matrix V = [v 1 , • • • , v n ]. It is easy to see that: (v T i x)T V = (v T i V x)T V = x i T V Since for i > k we have (v T i x)T = 0, then x i T V = 0. Using lemma 2 we obtain T V ∈ G k . Therefore, (T V ) V T = T ∈ G k . Theorem proved.

B STRUCTURE OF WD

Recall that (R n , || • ||) is a Banach space. Now, let us consider an optimization problem: for a given X ∈ R n×N solve ||X -L|| → min rank(L)≤k where || • || is extended to R n×N by ||[s 1 , • • • , s N ]|| def = i ||s i ||. The following simple theorem shows that the two tasks are connected, so that the solution of one directly leads to the solution of another. Theorem 6. Given data points {x 1 , • • • , x N }, let X = [x 1 , • • • , x N ] ∈ R n×N . Then, min ν∈P k W (µ data , ν) = 1 N min Y ∈R n×N ,rank(Y )≤k ||X -Y || Moreover, min ν∈P k W (µ data , ν) is attained on ν * , where ν * is a uniform distribution over {y i } N i=1 and [y 1 , • • • , y N ] ∈ arg min Y ∈R n×N ,rank(Y )≤k ||X -Y ||. Proof. Let us prove first that inf µ∈P k W (µ data , µ) ≤ 1 N ||X -Y * || where Y * = [y 1 , • • • , y N ] ∈ arg min Y ∈R n×N ,rank(Y )≤k ||X -Y || Let π be a uniform distribution over {(x i , y i )} N i=1 and µ * be a uniform distribution over {y i } N i=1 . Since π ∈ Π(µ data , µ * ), we obtain W (µ data , µ * ) ≤ 1 N N i=1 ||x i -y i || = 1 N ||X -Y * ||. The support of µ * is k-dimensional, because rank(Y * ) ≤ k. Thus, we have µ * ∈ P k and inf µ∈P k W (µ data , µ) ≤ W (µ data , µ * ) ≤ 1 N ||X -Y * ||. Now, if we prove the inverse inequal- ity, i.e. inf µ∈P k W (µ data , µ) ≥ 1 N ||X -Y * ||, this will imply that inf µ∈P k W (µ data , µ) = 1 N ||X -Y * || and therefore, inf µ∈P k W (µ data , µ) = W (µ data , µ * ). This will in the end give us µ * ∈ arg inf µ∈P k W (µ data , µ). Let {µ t } ∞ 1 be such that µ t ∈ P k and W (µ data , µ t ) -inf µ∈P k W (µ data , µ) → 0. Let L t denote a k-dimensional support of µ t and P t is a projection operator onto L t . Let µ * t be a uniform distribution over {P t x 1 , • • • , P t x N }, i.e. µ * t (A) = 1 N N i=1 [P t x i ∈ A]. It is easy to see that W (µ * t , µ data ) ≤ W (µ t , µ data ) , because µ * t and µ t share the same k-dimensional support L t , but the "transportation of a mass" concentrated in point x i of the empirical distribution µ emp can be most optimally done by just moving it to P t x i (i.e. to the closest point on L t ). Thus, we have inf µ∈P k W (µ data , µ) ≤ W (µ data , µ * t ) ≤ W (µ data , µ t ), and therefore, W (µ data , µ * t ) - inf µ∈P k W (µ data , µ) → 0. Since a set of projection operators is compact, one can always extract a subsequence {P ts } ∞ s=1 , such that P ts → P . It is easy to see that µ In the main part of the paper we assume M to be a gaussian kernel, though the theory can be applied to a more general case of the so called proper kernels. * ts → µ * * (i.e. W (µ * ts , µ * * ) → 0) where µ * * is a uniform distribution over {P x 1 , • • • , P x N }. For that distribution we have W (µ data , µ * * ) = lim s→∞ W (µ data , µ * ts ) = inf µ∈P k W (µ data , µ). Thus, the infinum is attained on µ * * . It is easy to see that W (µ data , µ * * ) = W (µ * * , µ data ) = 1 N ||X -P X||. Since rank(P X) ≤ k we obtain W (µ data , µ * * ) ≥ 1 N min Y ∈R n×N ,rank Recall that for any operator O between spaces H 1 and H 2 we denote its range as R[O] = {O(x)|x ∈ H 1 }. For Ω ⊆ S(R n ), Ω denotes the sequential closure of Ω with respect to natural topology of S(R n ). A set of continuous functions in R n is denoted by C(R n ). A set of infinitely differentiable functions with compact support in R n is denoted as C ∞ c (R n ) Definition 2. The function M (x, y) : R n × R n → C is called the proper kernel if and only if • O(M )[f ] = R n M (x, y)f (y)dy is a linear operator from L 2 (R n ) to L 2 (R n ), • M (y, x) = M (x, y) * , • f, O(M )[f ] L2(R n ) > 0, ∀f ∈ L 2 (R n ), f = 0. • max x,y |M (x, y)| < ∞, • R[O(M )] ∩ S(R n ) = S(R n ). The gaussian kernel M (x, y) = G n σ (x-y), which is of special interest from an application-oriented perspective, is captured by the following lemma: Lemma 3. If ζ, ζ ∈ C(R n ) are bounded, ∀x ζ(x) > 0, then M (x, y) = ζ(x -y) is a proper kernel. Proof. Verification of the first four conditions is easy, so we only check the fifth condition. Let us denote linear operators C ζ [f ] = ζ * f and O g [f ](x) = g(x)f (x). Then we have F[C ζ [L 2 (R n )]] = O ζ [L 2 (R n )] ⊇ C ∞ c (R n ). Therefore, R[O(M )] = C ζ [L 2 (R n )] ⊇ F -1 [C ∞ c (R n )]. Since C ∞ c (R n ) is dense in S(R n ), then F -1 [C ∞ c (R n )] also has this property. I.e. R[O(M )] ∩ S(R n ) = S(R n ). Besides the gaussian kernel the lemma also captures a case of the Abel Kernel ζ(x) = e -|x| . It is well-known that the Fourier tranform of the Abel Kernel is the Poisson kernel: ζ(x) = cn (1+|x| 2 ) n+1 2 (which is also proper).

C.2 PROOF OF THEOREM 4

We will prove a more general statement: Theorem 7. Let M (x, y) be a proper kernel and, additionally, a Lipschitz function. If f ∈ G k , then x i f |M |x j f is defined and rank M f ≤ k. Proof. Let us first show that f |M |g is defined for all f, g ∈ G k . Note that for any f = (T a ⊗ δ n-k ) U ∈ G k we have T f = (T a ⊗ δ n-k ) U * G n = ((T a * G k ) ⊗ T G n-k ) U Let us denote a = a * G k and b = b * G k . It is easy to see that f = (a (x 1:k )G n-k (x k+1:n )) U ∈ S(R n ) From a well-known property of the Weierstrass transform we have: ||f || L1 = ||a || L1 • ||G n-k || L1 ≤ ||a|| L1 From this we obtain for any f = (T a ⊗ δ n-k ) U , g = (T b ⊗ δ n-k ) V ∈ G k : | f |M |g | ≤ max x,y |M (x, y)| ||f || L1 ||g || L1 ≤ max x,y |M (x, y)| ||a|| L1 ||b|| L1 < ∞ Thus, f |M |g is defined and: For the function M (x 1:k , y 1:k ) we have: f |M |g = R n ×R n a * (x 1:k )G n-k (x k+1:n )M (U T x, V T y)b (y 1:k )G n-k (y k+1:n )dxdy = = R k ×R k a * (x 1:k )M (x 1:k , y 1:k )b (y 1:k )dx 1:k dy 1:k where M (x 1:k , y 1:k ) = R n-k ×R n-k G n-k (x k+1:n )M (U T x, V T y)G n-k (y k+1:n )dx k+1:n dy k+1:n Let U k , V k ∈ R |M (x 1:k , y 1:k ) -M (U T k x, V T k y)| = | R 2n-2k G n-k (x k+1:n ) M (U T x, V T y) -M (U T k x, V T k y) G n-k (y k+1:n )dx k+1:n dy k+1:n | ≤ L| R 2n-2k G n-k (x k+1:n ) |(U -U k ) T x| + |(V -V k ) T y| G n-k (y k+1:n )dx k+1:n dy k+1:n | = L| R 2n-2k G n-k (x k+1:n ) (|x k+1:n | + |y k+1:n |) G n-k (y k+1:n )dx k+1:n dy k+1:n | = = 2L R n-k |x k+1:n |G n-k (x k+1:n )dx k+1:n = 2L R n-k |x k+1:n |G n-k 1 (x k+1:n )dx k+1:n Thus, there exists bounded M (x 1:k , y 1:k ) = M (U T k x, V T k y) such that: M (x 1:k , y 1:k ) →0 → M (x 1:k , y 1:k ) in L ∞ (R 2k ) Further we assume that > 0 is small enough, so that M (x Let us now prove that rank 1:k , y 1:k ) ≤ C = 2 max |M |. Now we have: | f |M |g - R k ×R k a * (x 1:k ) M (x 1:k , y 1:k )b(y 1:k )dx 1:k dy 1:k | = | R k ×R k (a * (x 1:k )M (x 1:k , y 1:k )b (y 1:k ) -a * (x 1:k ) M (x 1:k , y 1:k )b(y 1:k ))dx 1:k dy 1:k | = | R k ×R k M (x 1:k , y 1:k )a * (x 1:k )(b (y 1:k ) -b(y 1:k ))dx 1:k dy 1:k + R k ×R k M (x 1:k , y 1:k )b(y 1:k )(a * (x 1:k ) -a * (x 1:k ))dx 1:k dy 1:k + R k ×R k a * (x 1:k )b(y 1:k )(M (x 1:k , y 1:k ) -M (x 1:k , y 1:k ))dx 1:k dy 1:k | ≤ C||a || L1 ||b -b|| L1 + C||b|| L1 ||a -a|| L1 + ||a * (x 1:k )b(y 1:k )|| L1 ||M -M || L∞ M f ≤ k. Since f ∈ G k , then f = (T g ⊗ δ n-k ) U where U is an orthogonal matrix and U = [w 1 , • • • , w n ]. It is easy to see that: x i f |M |x j f = (x i f ) U T |M (U T x, U T y)|(x j f ) U T = w T i x T g ⊗ δ n-k |M (U T x, U T y)|w T j x T g ⊗ δ n-k Let us now denote V = [u 1 , • • • , u n ] ∈ R k×n a submatrix of U in which only first k rows of U are present. Then, the latter integral is equal to: R k ×R k u T i x 1:k y T 1:k u j g(x 1:k ) * M (V T x 1:k , V T y 1:k )g(y 1:k )dx 1:k dy 1:k = u T i Bu j where B = [ x i g|M |x j g ] 1≤i,j≤k , M (x 1:k , y 1:k ) = M (V T x 1:k , V T y 1:k ) is the Gram matrix of the collection {x i g(x 1:k )} k i=1 ⊆ S(R k ). Obviously, rank M f = rank Re u T i Bu j 1≤i,j≤n = rank V T (Re B)V ≤ rank V = k.

D GENERAL THEORY OF THE REDUCTION FOR SECTION 5

For a sequence {f s } ∞ s=1 ⊆ S (R n ), Lim s→∞ f s denotes a set of points f ∈ S (R n ), such that there exists a growing sequence {s i } ⊆ N and lim i→∞ f si = f .

D.1 REGULAR SOLUTIONS AND REDUCTION THEOREMS

For I : G k ∪ S(R n ) → R + , it is natural to reduce the optimization task 10 to an optimization task over ordinary functions with a penalty term 11. To have an equivalence between 10 and 11 we need to assume that I's behaviour when approaching f ∈ G k from a set S(R n ) is continuous, i.e. for any sequence {f i } ⊆ S(R n ) such that T fi → * f ∈ G k , we have lim i→∞ I(T fi ) = I(f ). Let us introduce the notion of a regular solution both for 10 and 11. Let  B k = C>0 {f ∈ G k | Tr(M f ) ≤ C} * Definition 3. Any f ∈ Arg min f ∈G k I(f ) B k is I(f i ) + λ i R(f i ) ≤ inf f ∈S(R n ) I(f ) + λ i R(f ) + i ( ) where i → +0 and λ i → +∞, i → +∞. If, additionally, Tr(M fi ) is bounded, then {f i } ∞ 1 is said to solve 11 regularly. Let us define rsol (I(f ), R(f )) = {fi} ∞ 1 r. solves (11) Lim i→∞ T fi Theorem 8. If M is a proper kernel, then rsol (I(f ), R(f )) ⊆ Arg min f ∈G k I(f ). Theorem 9. If M is a proper kernel and rsol (I(f ), R(f )) = ∅, then Arg min f ∈G k I(f ) B k ⊆ rsol (I(f ), R(f )). Theorem 10 (Reduction theorem). If M is a proper kernel, Arg min f ∈G k I(f ) ⊆ B k and rsol (I(f ), R(f )) = ∅, then rsol (I(f ), R(f )) = Arg min f ∈G k I(f ). Suppose that we now solve a sequence of problems 11 and find {f s } ∞ 1 . According to Theorems 8 and 9, the following are potential scenarios: (1) Tr(M fs ) blows up and the convergence is not guaranteed. This situation can be avoided by controlling Tr(M f ) in an optimization process. In practice, when f has a parameterized form, this can be done by bounding parameters. If Tr(M fs ) does not blow up, we still have two subcases: (2.1) Lim s→∞ T fs = ∅. This implies a positive outcome to approach 11 to the optimization problem, Problem 10. (2.2) Lim s→∞ T fs = ∅. This exotic situation can happen only if a sequence T fs leaves any sequentially compact subset of S (R n ). Bounding parameters also tackles this case.

D.2 PROOFS OF THEOREM 8 AND 9

For any f = (T l ⊗ δ n-k ) U ∈ G k and σ > 0, let us define f σ as: T fσ = (T l ⊗ δ n-k ) U * G n σ = (T lσ ⊗ T G n-k σ ) U f σ = (l σ (x 1:k )G n-k σ (x k+1:n )) U l σ = l * G k σ We have T fσ → * (T l ⊗ δ n-k ) U as σ → +0. Lemma 4. For any f ∈ G k , lim σ→+0 x i f σ |M |x j f σ = 0, for any (i, j) / ∈ {1, ..., k} 2 , and sup σ∈[0,1] x i f σ |M |x j f σ < ∞, for any (i, j) ∈ {1, ..., k} 2 . Proof. W.l.o.g. we can assume that f = T l ⊗ δ n-k , l ∈ S(R k ). If i > k, j ≤ k we have: x i f σ |M |x j f σ = 1 (2πσ 2 ) n-k R n ×R n x i y j e -|x k+1:n | 2 2σ 2 l σ (x 1:k )M (x, y)e -|y k+1:n | 2 2σ 2 l σ (y 1:k )dxdy = R n 1 √ 2πσ 2 n-k x i e -|x k+1:n | 2 2σ 2 l σ (x 1:k )P (x)dx where P (x) = R n 1 √ 2πσ 2 n-k y j M (x, y)e -|y k+1:n | 2 2σ 2 l σ (y 1:k )dy. Using the Hólder inequality we obtain: | x i f σ |M |x j f σ | ≤ || 1 √ 2πσ 2 n-k x i e -|x k+1:n | 2 2σ 2 l σ (x 1:k )|| L1(R n ) ||P || L∞(R n ) = || 1 √ 2πσ 2 n-k x i e -|x k+1:n | 2 2σ 2 || L1(R n-k ) ||l σ || L1(R k ) ||P || L∞(R n ) Since |M (x, y)| ≤ γ for some γ, we have: |P (x)| ≤ γ|| 1 √ 2πσ 2 n-k y j e -|y k+1:n | 2 2σ 2 l σ (y 1:k )|| L1(R n ) = γ|| 1 √ 2πσ 2 n-k e -|y k+1:n | 2 2σ 2 || L1(R n-k ) ||y j l σ (y 1:k )|| L1(R k ) = γ||y j l σ (y 1:k )|| L1(R k ) Thus, | x i f σ |M |x j f σ | ≤ || 1 √ 2πσ 2 n-k x i e -|x k+1:n | 2 2σ 2 || L1(R n-k ) ||l σ || L1(R k ) γ||y j l σ || L1(R k ) Using ||l σ || L1(R k ) -||l|| L1(R k ) σ→+0 → 0, ||y j l σ || L1(R k ) -||y j l|| L1(R k ) σ→+0 → 0, we see the boundedness of ||l σ || L1(R k ) γ||y j l σ || L1(R k ) and proceed: ≤ C|| 1 √ 2πσ 2 n-k x i e -|x k+1:n | 2 2σ 2 || L1(R n-k ) It is easy to see that || 1 √ 2πσ 2 n-k x i e -|x k+1:n | 2 2σ 2 || L1(R n-k ) → 0 as σ → 0, therefore x i f σ |M |x j f σ → 0. Similarly, we can prove that x i f σ |M |x j f σ → 0 if i, j > k. The entries of the main k × k minor [ x i f σ |M |x j f σ ] 1≤i,j≤k are bounded, because: Tr M fσ = 1 (2πσ 2 ) n-k R n ×R n x • ye -|x k+1:n | 2 2σ 2 l σ (x 1:k )M (x, y)e -|y k+1:n | 2 2σ 2 l σ (y 1:k )dxdy ≤ γ (2πσ 2 ) n-k R n ×R n (|x 1:k •y 1:k |+|x k+1:n •y k+1:n |)e -|x k+1:n | 2 +|y k+1:n | 2 2σ 2 l σ (x 1:k )l σ (y 1:k )dxdy ≤ γ R n ×R n |x 1:k • y 1:k |l σ (x 1:k )l σ (y 1:k )dx 1:k dy 1:k + γ||l σ || 2 L1 (n -k)σ 2 ≤ γ k j=1 ||y j l σ || 2 L1(R k ) + γ||l σ || 2 L1 (n -k)σ 2 Again, using ||l σ || L1(R k ) -||l|| L1(R k ) σ→+0 → 0, ||y j l σ || L1(R k ) -||y j l|| L1(R k ) σ→+0 → 0, we obtain the boundedness of RHS. Corollary 1. For any f ∈ G k , lim σ→0 R(f σ ) = 0. Proof. W.l.o.g. we can assume that f = T l ⊗δ n-k , l ∈ S(R k ). By lemma, all entries of M fσ except those of the main k × k minor approach 0 as σ → 0. This means that lim σ→+0 Q(f σ ) = 0 where Q(f σ ) = n i=k+1 x i f σ |M |x i f σ . Let v 1 , • • • , v n be unit eigenvectors of M fσ correspond- ing to the eigenvalues λ 1 ≥ • • • ≥ λ n , P = n i=k+1 e i e T i , then R(f σ ) = n i=k+1 λ i = min pi∈[0,1], n 1 pi=n-k n i=1 λ i p i ≤ n i=1 λ i Tr (P v i v T i P ) = Tr (P M fσ P ) = Q(f σ ) Since R(f σ ) ≤ Q(f σ ), we obtain lim σ→0 R(f σ ) = 0.

D.2.1 PROOF OF THEOREM 8

Proof. Suppose that a sequence {f i } ∞ s=1 ⊆ S(R n ) regularly solves ( 7) and T ∈ Lim i→∞ f i . W.l.o.g. we can assume that T fi → * T and Tr(M fi ) is bounded and I(f i ) + λ i R(f i ) ≤ inf f ∈S(R n ) I(f ) + λ i R(f ) + i , i → 0. Below we use continuity of I and corollary 1: inf f ∈S(R n ) I(f ) + λ i R(f ) ≤ inf f ∈G k inf σ>0 I(f σ ) + λ i R(f σ ) ≤ inf f ∈G k lim σ→+0 I(f σ ) + λ i R(f σ ) ≤ inf f ∈G k I(f ) from which we conclude that λ i R(f i ) ≤ inf f ∈G k I(f ) + i and, therefore, R(f i ) i→∞ → 0. For each l, let us define P l as the projection operator to a subspace spanned by first principal components of the matrix M f l , i.e. P l = k i=1 v l i v l i T where v l 1 , ..., v l k are orthonormal eigenvectors that correspond to k largest eigenvalues of M f l . From the Eckart-Young-Mirsky theorem we see that R(f l ) = || M f l -P l M f l || 2 F . Since a set of all projection operators {P ∈ R n×n |P 2 = P, P T = P } is a compact subset of R n 2 , one can always find a projection operator P = k i=1 v i v T i and a growing subsequence {l s } such that ||P ls -P || F → 0 as s → ∞. Thus, for the subsequence {f ls } we have: || M f ls -P M f ls || F = || M f ls -P ls M f ls + P ls M f ls -P M f ls || F ≤ || M f ls -P ls M f ls || F + ||P ls -P || F || M f ls || F = R(f ls ) + ||P ls -P || F Tr(M fs ) and using the boundedness of Tr(M fs ) we obtain || M f ls -P M f ls || F → 0. Since || M f ls -P M f ls || F → 0, let us complete v 1 , ..., v k to an orthonormal basis v 1 , ..., v n and make the change of variables y i = v T i x. Let us denote V = [v 1 , ..., v n ] and let V T = [w 1 , ..., w n ]. Then, after that change of variables any function f (x) corresponds to f (y) = f (V y) and the kernel M corresponds to M (y, y ) = M (V y, V y ). If we apply that change of variables in the integral expression of x i f |M |x j f , we will obtain: x i f |M |x j f = w T i yf |M |w T j yf = w T i [ y i f |M |y j f ] n×n w j ⇒ Re x i f |M |x j f = w T i [Re y i f |M |y j f ] n×n w j I.e. M f = V M f V T , or M f = V T M f V . Note that P = V I k n V T where I k n is a diagonal matrix whose main k × k minor is the identity matrix, and all other entries are zeros. Using the fact that the Frobenius norm of orthogonally similar matrices are equal and the identity V T M f ls V = V T M f ls V , we obtain: || M f ls -P M f ls || F = ||V T M f ls V -V T P M f ls V || F = || V T M f ls V -V T V I k n V T M f ls V || F = || M f ls -I k n M f ls || F Thus, the property || M f ls -P M f ls || F → 0 implies that: Re y i f ls |M |y j f ls → 0, IF i > k Moreover, for i = j we have Re y i f ls |M |y j f ls = y i f ls |M |y j f ls . It is easy to see that after the change of variables we still have f ls → * T V . Since f ls ∈ S(R n ), we have y i f ls ∈ S(R n ) and, therefore, y i f ls ∈ L 2 (R n ). Let us treat now M as an operator O(M ) : L 2 (R n ) → L 2 (R n ), O(M )[f ](x) = R n M (x, y)f (y)dy. Let us take any function φ ∈ L 2 (R n ) such that ψ = O(M )[φ] ∈ S(R n ). Since O(M ) is a strictly positive self-adjoint operator, by the Cauchy-Schwarz inequality, we obtain: | y i f ls , O(M )[φ] | ≤ y i f ls |M |y i f ls φ, O(M )[φ] Therefore, for any ψ ∈ R[O(M )] ∩ S(R n ) and i > k we have lim s→∞ y i f ls , ψ = lim s→∞ f ls , y i ψ = 0. Since f ls → * T V we obtain T V , y i ψ = y i T V , ψ = 0 for any ψ ∈ R[O(M )]∩S(R n ). But the denseness of R[O(M )]∩S(R n ) in S(R n ) implies that y i T V = 0. Using lemma 2 and (T V ) V T = T we obtain T ∈ G k . Thus, we proved that T fi → T ∈ G k . Since I(f i ) ≤ I(f i ) + λ i R(f i ) ≤ inf f ∈G k I(f ) + i and I is continuous, we finally get that I(T ) ≤ inf f ∈G k I(f ), i.e. T ∈ Arg min f ∈G k I(f ).

D.2.2 PROOF OF THEOREM 9

Proof. Suppose f * ∈ Arg min f ∈G k I(f ) B k , i.e. f * ∈ B k and I(f * ) = min f ∈G k I(f ). Since f * ∈ B k , then there exists a sequence {s i } ⊆ G k such that T s i → * f * and Tr M s i < ∞. Let us define s i σ ∈ S(R n ) as T s i σ = T s i * G n σ Since lim σ→0 R(s i σ ) = 0 (lemma 4), there exists σ i > 0, such that R(s i σ ) < 1 i whenever 0 < σ ≤ σ i . Also, by definition Tr M s i = lim σ→0 Tr M s i σ . Therefore, there exists σ i > 0, such that Tr M s i σ < Tr M s i + 1 whenever 0 < σ ≤ σ i . If we set σ * i = min{σ i , σ i , 1 i }, then a sequence {s i σ * i } ⊆ S(R n ) satisfies: lim i→∞ R(s i σ * i ) = 0 Tr M s i σ * i < ∞ and (using lemma 1) T s i σ * i → * f * Due to the continuity of I we have lim i→∞ I(s i σ * i ) = I(f * ) Now we set f i = s i σ * i , λ i = 1 √ R(fi) and we obtain the needed sequence: lim i→∞ I(f i ) = lim i→∞ I(f i ) + λ i R(f i ) = I(f * ), lim i→∞ λ i = +∞ where Tr M fi is bounded. It remains to check that our sequence regularly solves (7), i.e. lim i→∞ inf f ∈S(R n ) I(f )+λ i R(f ) = I(f * ) (this will imply lim i→∞ I(f i )+λ i R(f i )-inf f ∈S(R n ) I(f )+ λ i R(f ) = 0). The inequality in one direction is obvious, inf f ∈S(R n ) I(f ) + λ i R(f ) ≤ inf f ∈G k inf σ>0 I(f σ ) + λ i R(f σ ) ≤ inf f ∈G k lim σ→+0 I(f σ ) + λ i R(f σ ) = inf f ∈G k I(f ) = I(f * ) Let us prove the inverse inequality. Since rsol (I(f ), R(f )) = ∅, there exists { fi } ⊆ S(R n ) such that: I( fi ) + λi R( fi ) ≤ inf f ∈S(R n ) I(f ) + λi R(f ) + i , lim s→+∞ λi = +∞, lim i→+∞ i = 0, Tr M fi < ∞ and a = lim i→+∞ T fi . From theorem 5 we obtain a ∈ Arg min f ∈G k I(f ). One can always find a subset { λdi } ⊆ { λi } such that λdi < λ i , λdi → ∞ and obtain: inf f ∈S(R n ) I(f ) + λ i R(f ) ≥ inf f ∈S(R n ) I(f ) + λdi R(f ) ≥ I( fdi ) + λdi R( fdi ) -di ≥ I( fdi ) -di Therefore, lim i→∞ inf f ∈S(R n ) I(f ) + λ i R(f ) ≥ lim i→∞ I( fdi ) -di = I(a) = inf f ∈G k I(f ) = I(f * ) This proves that {f i } regularly solves (7) and lim i→∞ f i = f * i.e. f * ∈ rsol (I(f ), R(f )).

E PROOF OF THEOREM 5

Again we will prove a more general statement. Theorem 11. If M is a proper and a real-valued kernel, O(M ) is bounded and Tr M f < ∞, then S f ∈ B(L * 2 (R n ), R n ) and S f S † f = M f . Moreover, R(f ) = min S∈B(L * 2 (R n ),R n ),rank S≤k ||S f -S|| 2 * and the minimum is attained at S = P f S f where P f = k i=1 u i u † i and {u i } k 1 are unit eigenvectors of M f corresponding to the k largest eigenvalues (counting multiplicities). Proof. The boundedness of S f follows from the Cauchy-Schwarz inequality: |S f [φ] i | 2 = |Re O(M )[x i f ], φ | 2 ≤ O(M )[x i f ], O(M )[x i f ] φ, φ = x i f, O(M )[x i f ] φ, φ and therefore: ||S f [φ]|| 2 = n i=1 |S f [φ] i | 2 ≤ Tr M f ||φ|| 2 L2(R n ) I.e. we have checked that S f is bounded. By definition, S † f : R n → L r 2 (R n ) × L r 2 (R n ) and u, S f [φ 1 , φ 2 ] = S † f [u], [φ 1 , φ 2 ] , u ∈ R n , [φ 1 , φ 2 ] ∈ L r 2 (R n ) × L r 2 (R n ). Let us denote f 1 = Re f, f 2 = Im f . It is easy to see that the following operator satisfies the latter identity: O[u] = O(M )[f 1 (x)x T u], O(M )[f 2 (x)x T u] Since the adjoint is unique, then S † f = O. Let us calculate S f S † f : u S † f --→ O(M )[f 1 (x)x T u], O(M )[f 2 (x)x T u] S f --→   x 1 f 1 (x), O(M )[ O(M )[f 1 (x)x T u]] + x 1 f 2 (x), O(M )[ O(M )[f 2 (x)x T u]] • • • x n f 1 (x), O(M )[ O(M )[f 1 (x)x T u]] + x n f 2 (x), O(M )[ O(M )[f 2 (x)x T u]]   =   2 j=1 x 1 f j (x), O(M )[f j (x)x T u] • • • 2 j=1 x n f j (x), O(M )[f j (x)x T u]   = [Re x i f, M [x j f ] ] 1≤i,j≤n u = M f u Thus, S f S † f = M f . Since Tr S f S † f < ∞ and ||S † f [u]|| 2 ≤ u, M f u , we obtain S † f is a bounded operator. Let u 1 , • • • u n be orthonormal eigenvectors of M f = S f S † f and λ 1 ≥ • • • ≥ λ n > 0 be corre- sponding nonzero eigenvalues. For σ i = √ λ i let us define v i = S † f [ui] σi . Vector v i corresponds to a pair of functions: v i = 1 σ i O(M )[f 1 (x)x T u i ], O(M )[f 2 (x)x T u i ] ∈ L r 2 (R n ) × L r 2 (R n ) It is easy to see that v 1 , • • • v n is an orthonormal basis in Im S † f , and S † f can be expanded in the following way: S † f = n i=1 σ i v i u † i and therefore, SVD for S f is: S f = n i=1 σ i u i v † i By the Eckart-Young-Mirsky theorem (see Theorem 4.4.7 from Hsing & Eubank (2015) ), an optimal S in min S∈B(L * 2 (R n ),R n ),rank S≤k ||S f -S|| 2 * is defined by a truncation of SVD for S f at kth term, i.e.: S = k i=1 σ i u i v † i = P f S f where P f = k i=1 u i u † i is a projection operator to first k principal components of M f . Moreover, ||S f -P f S f || 2 = n i=k+1 σ 2 i = ||M f || n-k = R(f ). F THE ALTERNATING SCHEME IN THE DUAL SPACE FOR M (x, y) = ζ(x -y) When M (x, y) = ζ(x -y), the alternating scheme 1 allows for a reformulation in the dual space. By this we mean that in Scheme 1 we substitute φt for the original φ t . If the primal Scheme 1 deals with operators S φ , S φt-1 , the dual version deals with vectors of functions ζ ∂ φ ∂x , ζ ∂ φt-1 ∂x . The substitution is based on the following simple fact: Theorem 12. If M (x, y) = ζ(x -y), ζ, ζ ∈ C(R n ) and ∀x ζ(x) > 0, then there exist constants c 1 and c 2 such that ||S φ -P t-1 S φt-1 || 2 * = c 1 || || ∂ φ ∂x -P t-1 ∂ φt-1 ∂x || 2 || 2 L 2, ζ (R n ) and x i f |M |x j f = c 2 ∂ f ∂xi , ∂ f ∂xj L 2, ζ (R n ) Proof. Let f : R n → C be such that ||x i f || L2(R n ) < ∞. O(M )[ψ] = ζ * ψ ⇒ F {O(M )[ψ]} ∼ ζ ψ ⇒ F O(M )[ψ] ∼ ζ ψ ⇒ S f [ψ] i = Re x i f, O(M )[ψ] ∼ Re F {x i f } , F O(M )[ψ] ∼ Re i ∂ f ∂x i , ζ ψ = Re i ζ ∂ f ∂x i , ψ Since S f [ψ] i = Re (S f ) i , ψ ∼ Re (S f ) i , ψ , we obtain (S f ) i = κ ζ ∂ f ∂x i (17) where κ is a constant. Let us now introduce a vector of functions V f = [(S f ) 1 , • • • , (S f ) n ] T ∈ L n 2 (R n ). Using 17 we obtain (S f ) i = κ ζ ∂ f ∂xi , and therefore: V f = κ ζ ∂ f ∂x Thus, the expression ||S φ -P t-1 S φt-1 || 2 * in the alternating scheme can be rewritten as: ||V φ -P t-1 V φt-1 || 2 L n 2 (R n ) ∼ ||κ ζ ∂ φ ∂x -P t-1 κ ζ ∂ φt-1 ∂x || 2 L n 2 (R n ) ∼ || || ∂ φ ∂x -P t-1 ∂ φt-1 ∂x || 2 || 2 L 2, ζ (R n ) The matrix M f can also be calculated from f using the following identity: x i f, M [x j f ] = x i f, ζ * (x j f ) ∼ ∂ f ∂x i , ζ ∂ f ∂x j = ∂ f ∂x i , ∂ f ∂x j L 2, ζ (R ) Let us introduce a function Ĩ such that Ĩ(f ) = I( f ). Then, we see that all steps in Scheme 1 can be performed with φt rather than with φ t , using the algorithm 2. Informally, the dual algorithm works as follows: at each iteration t we compute a function φt adapting it to data (the term Ĩ( φ)) and adapting its gradient field to the rank reduced gradient field of the previous φt-1 . For a sufficiently large T , it will converge and φT ≈ φT -1 . Then, the second term in the last step will be approximately equal to λ|| || ∂ φT ∂x -P T -1 ∂ φT ∂x || 2 || 2 L 2, ζ (R n ) , enforcing ∂ φT ∂x ≈ P T -1 ∂ φT ∂x for random x ∼ ζ || ζ|| L 1 . Thus, gradients ∂ φT ∂x lie in a k-dimensional subspace col P T -1 . This last property is a characteristic property of functions from F k . Algorithm 2 The alternating scheme in the dual space. P 0 ←-0, φ0 ←-0 for t = 1, • • • , T do φt ← arg min φ Ĩ( φ) + λ|| || ∂ φ ∂x -P t-1 ∂ φt-1 ∂x || 2 || 2 L 2, ζ (R n ) Calculate M t = Re ∂ φt ∂xi , ∂ φt ∂xj L 2, ζ (R n ) Find {v i } n 1 s.t. M t v i = λ i v i , λ 1 ≥ • • • ≥ λ n P t ←- k i=1 v i v T i Output: v 1 , • • • , v k G A NUMERICAL ALTERNATING SCHEME FOR MMD G.1 STRUCTURE OF F[P k ] From theorems 1 and 2, F[P k ] ⊆ F k * . In fact, a famous theorem of Bochner (1932) gives us that the Fourier transform of any positive finite Borel measure is a continuous positive definite function. That is, if f ∈ F[P], then for any distinct y 1 , • • • , y s ∈ R n the matrix [f (y i -y j )] i,j=1,n is positive semidefinite. Since µ(R n ) = 1, we additionally have f (0) = 1. Let PDF denote the set of all continuous positive definite functions on R n and Using the isometry property of the Fourier transform for L 2 (R n ) and the convolution theorem, we see that: M k = {f ∈ PDF|∃v 1 , ..., v k ∈ R n , g : R k → C s.t. f (x) = g(v T 1 x, ..., v T k x), f (0) = 1} d MMD (µ, ν) = ||k * µ -k * ν|| L2(R n ) ∝ ||γ(x)(F[µ](x) -F[ν](x))|| L2(R n ) Thus, from Theorem 13 we obtain that the task 6 is equivalent to: ||p data -q|| L 2,γ 2 (R n ) → min q∈M k (19) G.3 ALGORITHMS FOR MMD Let Π k : G k → {1, +∞} and M k : F k → {1 , +∞} be simple penalty functions: Π k (φ) = 1, if φ ∈ P k and Π k (φ) = ∞, otherwise M k (φ) = 1, if φ ∈ M k and M k (φ) = ∞, otherwise Then, the task 6 is equivalent to: I(φ) = d 2 MMD (µ data , φ)Π k (φ) → inf φ∈G k From the result of the previous section we see that if I(φ) = Ĩ( φ), then: Ĩ( φ) = ||p data -φ|| 2 L 2,γ 2 (R n ) M k ( φ) Thus, the Algorithm 3 is an adaptation of Algorithm 2 to MMD. Algorithm 3 The alternating scheme in the dual space for MMD P 0 ←-0, q 0 ←-0 for t = 1, • • • , T do 1 q t ←-arg min q∈M k R n γ(x) 2 |p data (x) -q(x)| 2 dx + λ R n ζ(x)|| ∂q ∂x -P t-1 ∂qt-1 ∂x || 2 2 dx 2 Calculate M t = ∂qt ∂xi , ∂qt ∂xj L 2, ζ (R n ) 3 Find {v i } n 1 s.t. M t v i = λ i v i , λ 1 ≥ • • • ≥ λ n 4 P t ←- k i=1 v i v T i Output: L = span(v 1 , • • • , v k ) If the function p data is real-valued, then only real-valued functions can appear in the Algorithm 3. This assumption can be satisfied by adding reflections of initial points to the dataset (after it was centered). At step 1, we search over q given in the following parameterized form: q θ (x) = nn i=1 α i cos(ω T i x) where α i > 0 and nn i=1 α i = 1. In our implementation, we set [α i ] i=1,nn = softmax([u i ] i=1,nn ) and u i 's are unconstrained. The number of neurons in a single layer neural network with a cosine activation function, nn, is a hyperparameter. Let us denote parameters {ω i , u i } nn i=1 by θ. It is easy to see the function q θ is positive definite. Moreover, using Theorem 2 from Barron (1993) , it can be shown that a set of all such functions, i.e. the convex hull of {cos(ω T x)|ω ∈ R n }, is dense in a set of real-valued functions from M k . Though this parameterization is quite natural, finding architectures with more expressive power in a space of real-valued positive definite functions is an open problem. Now, to minimize Ψ(θ) = R n γ(x) 2 |p data (x) -q θ (x)| 2 dx + λ R n ζ(x)|| ∂q θ ∂x -P t-1 ∂q θt-1 ∂x || 2 2 dx with stochastic gradient descent methods (in our case, the Adam optimizer) we need to have an unbiased estimator of ∇ θ Ψ(θ) ∝ E z∼γ 2 ∇ θ |p data (z) -q θ (z)| 2 + λE z ∼ ζ ∇ θ || ∂q θ ∂x (z ) -P t-1 ∂q θt-1 ∂x (z )|| 2 2 where z ∼ f denotes that the random vector z is sampled according to the probability density function f (x) R n f (x)dx . Thus, a natural estimator of the gradient is: 1 m m i=1 ∇ θ |p data (z i ) -q θ (z i )| 2 + λ m m i=1 ∇ θ || ∂q θ (ξ i )) ∂x -P t-1 ∂q θt-1 (ξ i )) ∂x || 2 2 where {z i } m i=1 ∼ iid γ 2 and {ξ i } m i=1 ∼ iid ζ. The last important issue with the practical numerical algorithm is the calculation of M t at step 2. It is easy to see that: M t = E χ∼ ζ ∂q t ∂x (χ) ∂q t ∂x (χ) T In practice we sample χ 1 , • • • , χ l ∼ ζ and estimate M t as follows: M t ≈ 1 l l i=1 ∂q t ∂x (χ i ) ∂q t ∂x (χ i ) T The details of the numerical algorithm are given below 4. In all our experiments with MMD we set ζ = γ 2 . Algorithm 4 The numerical algorithm for MMD. Hyperparameters: λ, h, σ, m, l, α, β 1 , β 2 , nn. P 0 ←-0, θ 0 ←-0 for t = 1, • • • , T do while θ has not converged do Sample {z i } m i=1 ∼ iid γ 2 Sample {ξ i } m i=1 ∼ iid ζ L ←-1 m m i=1 |p data (z i ) -q θ (z i )| 2 + λ m m i=1 || ∂q θ (ξ i )) ∂x -P t-1 ∂q θ t-1 (ξ i )) ∂x || 2 2 θ ←-Adam(∇ θ L, θ, α, β 1 , β 2 ) θ t ←-θ Sample {χ i } l i=1 ∼ iid ζ Calculate M t = 1 l l i=1 ∂q θ t (χ i )) ∂x ∂q θ t (χ i )) ∂x T Find {v i } n 1 s.t. M t v i = λ i v i , λ 1 ≥ • • • ≥ λ n P t ←- k i=1 v i v T i Output: v 1 , • • • , v k H A NUMERICAL ALTERNATING SCHEME FOR HM H.1 THE DUAL FORM OF HM Due to a well-known relationship between moments of the probability measure µ and its characteristic function p, i.e. i s m i1•••is = ∂ s p(0) ∂xi 1 •••∂xi s , the task 7 is equivalent to: 4 s=1 λ s n s 1≤i1,••• ,is≤n | ∂ s p data (0) ∂x i1 • • • ∂x is - ∂ s q(0) ∂x i1 • • • ∂x is | 2 → min q∈M k Note that the maximum mean discrepancy distance and the distance based on higher moments are substantially different. Indeed, even if we set h as a large value (which makes 1 h ≈ 0), the MMD distance, unlike the HM distance, neglects higher order derivatives of the characteristic functions in the neigbourhood of the origin. Moreover, from the dual form 21 it is clear that d HM (µ data , ν) is a degenerate case of a weighted Sobolev norm between characteristic functions of µ data and ν.

H.2 ALGORITHMS FOR HM

Analogously to the case of MMD we see that the task 7 is equivalent to: I(φ) = d HM (µ data , φ) 2 Π k (φ) → inf φ∈G k and Ĩ( φ) = 4 s=1 λ s n s 1≤i1,••• ,is≤n | ∂ s p data (0) ∂x i1 • • • ∂x is - ∂ s φ(0) ∂x i1 • • • ∂x is | 2 M k ( φ) Thus, the Algorithm 5 is an adaptation of Algorithm 2 to HM. Algorithm 5 The alternating scheme in the dual space for HM P 0 ←-0, q 0 ←-0 for t = 1, • • • , T do 1 q t ←-arg min q∈M k 4 s=1 λs n s 1≤i1,••• ,is≤n | ∂ s p data (0) ∂xi 1 •••∂xi s -∂ s q(0) ∂xi 1 •••∂xi s | 2 + λ R n ζ(x)|| ∂q ∂x - P t-1 ∂qt-1 ∂x || 2 2 dx 2 Calculate M t = ∂qt ∂xi , ∂qt ∂xj L 2, ζ (R n ) 3 Find {v i } n 1 s.t. M t v i = λ i v i , λ 1 ≥ • • • ≥ λ n 4 P t ←- k i=1 v i v T i Output: L = span(v 1 , • • • , v k ) Again, as in a numerical algorithm for MMD, at step 1, we search over q given in the form 20. The objective of step 1 can be represented as: Φ(θ) = 4 s=1 λ s E i1,••• ,is∼ iid U (1,n) | ∂ s (p data -q θ )(0) ∂x i1 • • • ∂x is | 2 + λE z ∼ ζ || ∂q θ ∂x (z ) -P t-1 ∂q θt-1 ∂x (z )|| 2 2 where U(1, n) is the discrete uniform distribution over {1, • • • , n}. To apply the stochastic gradient descent methods we need to have an unbiased estimator of ∇ θ Φ(θ) which is equal to: 4 s=1 λ s E i1,••• ,is∼ iid U (1,n) ∇ θ | ∂ s (p data -q θ )(0) ∂x i1 • • • ∂x is | 2 + λE z ∼ ζ ∇ θ || ∂q θ ∂x (z ) -P t-1 ∂q θt-1 ∂x (z )|| 2 2 Thus, a natural estimator of the gradient is: 4 s=1 λ s m 1 m1 i=1 ∇ θ | ∂ s (p data -q θ )(0) ∂x a[s,i,1] ∂x a[s,i,2] • • • ∂x a[s,i,s] | 2 + λ m 2 m2 i=1 ∇ θ || ∂q θ (ξ i )) ∂x -P t-1 ∂q θt-1 (ξ i )) ∂x || 2 Algorithm 6 The numerical algorithm for HM. Hyperparameters: λ, {λ s } s=1,4 , m 1 , m 2 , l, α, β 1 , β 2 , nn. P 0 ←-0, θ 0 ←-0 for t = 1, • • • , T do while θ has not converged do Sample {a[s, i, j]} s=1,4,i=1,m1,j=1,s ∼ iid U(1, n) Sample {ξ i } m2 i=1 ∼ iid ζ L ←- 4 s=1 λs m1 m1 i=1 ∇ θ | ∂ s (p data -q θ )(0) ∂x a[s,i,1] ∂x a[s,i,2] •••∂x a[s,i,s] | 2 + λ m2 m2 i=1 ∇ θ || ∂q θ (ξ i )) ∂x - P t-1 ∂q θ t-1 (ξ i )) ∂x || 2 2 θ ←-Adam(∇ θ L, θ, α, β 1 , β 2 ) θ t ←-θ Sample {χ i } l i=1 ∼ iid ζ Calculate M t = 1 l l i=1 ∂q θ t (χ i )) ∂x ∂q θ t (χ i )) ∂x T Find {v i } n 1 s.t. M t v i = λ i v i , λ 1 ≥ • • • ≥ λ n P t ←- k i=1 v i v T i Output: v 1 , • • • , v k I A NUMERICAL ALTERNATING SCHEME FOR WD By Theorem 6, the task 13 is equivalent to min µ∈P k W (µ, µ data ), or to the following task: I(φ) → inf φ∈G k where I(T µ ) = W (µ, µ data ) if µ ∈ P k and I(φ) = ∞, if otherwise. The alternating scheme 1 is designed to solve the penalty form of the problem, i.e. I(φ) + λR(φ) → min φ∈S(R n ) which is equivalent to W (φ, µ data ) + λR(φ) → min φ∈Sp(R n ) where S p (R n ) ⊆ S(R n ) is a set of Schwartz functions that can serve as pdf: φ(x) ≥ 0, R n φ(x)dx = 1. A numerical version of the alternating scheme requires additional specifications on: a) how to minimize over φ at step 1, and b) how to estimate M φt .

I.1 HOW TO MINIMIZE OVER φ?

In the case of WD, the minimization step of the alternating scheme makes the following: φ t ←-arg min φ∈Sp(R n ) W (φ, µ data ) + λ||S φ -P t-1 S φt-1 || 2 where S f = O(M )[xf (x)]. For a numerical implementation of that step we need to choose some family of functions that is dense in S p (R n ) (or, rich enough to approach the solution µ * ). Following the tradition of GAN research let us assume that the family is given in the following formfoot_4 : H = {φ θ |φ θ (x) is pdf of random vector g θ (z), z ∼ p(z), θ ∈ Θ} where {g θ |θ ∈ Θ} is a parameterized family of smooth functions (usually, a neural network) and p(z) is some fixed distribution (usually, the gaussian distribution). Following Arjovsky et al. (2017) , we make the assumption 1. In a numerical algorithm we need an access to a procedure that samples according to φ θ (x), not the function itself. Assumption 1. ||g θ (z ) -g θ (z)|| ≤ L(θ, z)(||θ -θ|| + ||z -z||) where E z∼p(z) L(θ, z) < +∞ Thus, instead of solving 22 we solve: φ t ←-arg min φ∈H W (φ, µ data ) + λ||S φ -P t-1 S φt-1 || 2 taking into account that φ t-1 ∈ H. The Kantorovich-Rubinstein duality theorem gives us that: W (φ θ , µ data ) = max f :||fx||≤1 E x∼µ data [f (x)] -E z∼p(z) [f (g θ (z))] which turns 22 into the following minimax task: φ t ←-arg min φ∈H max f :||fx||≤1 E x∼µ data [f (x)] -E z∼p(z) [f (g θ (z))] + λ||S φ -P t-1 S φt-1 || 2 In practice, we choose a family of functions L = {f w |w ∈ W} and internal maximization is made over w ∈ W with an additional penalty term that penalizes a violation of the Lipschitz condition: ∀x : ||f x || ≤ 1. A family of minimax algorithms for the minimization of W (φ θ , µ emp ) was developed in a series of papers Arjovsky et al. (2017) ; Gulrajani et al. (2017) ; Wei et al. (2018) . The standard minimax scheme that gained popularity in GAN literature iterates two steps: a) n iter times make a gradient ascent over w ∈ W, b) make a gradient descent over θ. The task 24 can be viewed as a Wasserstein GAN with an additional regularization term λT (θ) where T (θ) = ||S φ θ -P t-1 S φ θ t-1 || 2 . To adapt these algorithms to the minimization of our function, we only need to have an unbiased estimator of the gradient ∂T ∂θ . This estimator is needed for the generator to make its gradient descent step. The discriminator's part of the algorithm (in which we maximize over Lipschitz functions f w ) can be set in a standard fashion -we choose Petzka et al. (Ξ is defined in equation 25) θ ← Adam(∇ θ L, θ, α, β 1 , β 2 ) θ t ←-θ Realization of step (**): Sample {z i } l i=1 , {z i } l i=1 ∼ p(z) M t ←ij g θt (z i )g θt (z j ) T M (g θt (z i ), g θt (z j )) Find {v i } n 1 s.t. M t v i = λ i v i , λ 1 ≥ • • • ≥ λ n P t ←- k i=1 v i v T i Output: v 1 , • • • , v k

I.2 HOW TO ESTIMATE ∂T

∂θ AND M φ θ t ? Another important aspect of the numerical algorithm is the complexity of estimating the matrix M φ θ t at step (**). The following theorem shows that we only need to sample z ∼ p a sufficient number of times to estimate ∂T ∂θ and M φ θ t . Theorem 14. If φ θ is pdf of the random vector g θ (z), z ∼ p(z), then ∂T ∂θ = E z,z ∼p ∂Ξ(θ, z, z ) ∂θ M φ θ = E z,z ∼p g θ (z)g θ (z ) T M (g θ (z), g θ (z )) where Ξ(θ, z, z ) = (g θ (z) • g θ (z ))M (g θ (z), g θ (z ))-2(g θ (z) • P t-1 g θt-1 (z ))M (g θ (z), g θt-1 (z )) (25 ) and RHS is well-defined.

I.2.1 DEFINITION OF H

Specifically, for robust PCA/outlier pursuit applications, we define φ θ (x) as a probability density function of the random vector a + b, where a, b are independent and a is the i-th column of matrix θ 1 ∈ R n×N (where i ∼ U(1, N ) is sampled uniformly from {1, • • • , N }), b = g θ2 (c), c ∼ N (0, I n ) and g θ2 : R n → R n is a neural network with weights θ 2 . Thus, θ = (θ 1 , θ 2 ). It can be checked that H, defined in this way, satisfies the Assumption 1. We specifically introduce the random vector a here because, according to Theorem 6, the ultimate solution of the problem corresponds to θ 1 = Y and b = 0. This guarantees that the solution is approachable from set H.

I.3 PROOF OF THEOREM 14

We need to following lemma. Proof of theorem 14. Using lemma 5 we have: T (θ) = E x,y∼φ θ (x • y)M (x, y) + E x,y∼φ θ t-1 (x • P t-1 y)M (x, y)--2E x∼φ θ ,y∼φ θ t-1 (x • P t-1 y)M (x, y) = E z,z ∼p (g θ (z) • g θ (z ))M (g θ (z), g θ (z ))+ E z,z ∼p (g θt-1 (z) • P t-1 g θt-1 (z ))M (g θt-1 (z), g θt-1 (z ))-2E z,z ∼p (g θ (z) • P t-1 g θt-1 (z ))M (g θ (z), g θt-1 (z )) The second term does not depend on θ. Therefore, ∂T ∂θ = ∂ ∂θ E z,z ∼p Ξ(θ, z, z ) where Ξ(θ, z, z ) = (g θ (z) • g θ (z ))M (g θ (z), g θ (z )) -2(g θ (z) • P t-1 g θt-1 (z ))M (g θ (z), g θt-1 (z )) If E z,z ∼p ∂Ξ(θ,z,z ) ∂θ is well-defined (the proof of sufficiency of that condition is similar to the proof of Theorem 3 from Arjovsky et al. (2017) ), then, using Leibniz integral rule, we obtain: ∂ ∂θ E z,z ∼p Ξ(θ, z, z ) = E z,z ∼p ∂Ξ(θ, z, z ) ∂θ The fact that M φ θ = E z,z ∼p g θ (z)g θ (z ) T M (g θ (z), g θ (z )) is obvious from the definition M φ θ = E x,y∼φ θ xy T M (x, y).

J A NUMERICAL ALTERNATING SCHEME FOR SDR

For a binary classification case, given a labeled dataset {(x i , y i )} N i=1 , x i ∈ R n , y i ∈ C, C = {0, 1} we formulate the sufficient dimension reduction problem as the minimization task: We apply the alternating scheme in the dual space (Algorithm 2) to this task. We set M (x, y) = ζ(x -y), where ζ is a strictly positive probability density function. A numerical version of the scheme is given below (Algorithm 8). J(f ) = E (z, At every iteration t = 1, • • • , T of the Algorithm 2 we solve the task (in our case Ĩ = J): φt ← arg min φ Ĩ( φ) + λ|| | ∂ φ ∂x -P t-1 ∂ φt-1 ∂x | || 2 L 2, ζ (R n ) In a numerical version of the algorithm we assume that φ is given as a neural network f θ , i.e. our task becomes: That is why ∇ θ L (given to Adam optimizer in the gradient descent loop) in the Algorithm 8 is an unbiased estimator of ∂Φ(θ) ∂θ . Thus, in the "while loop" we find optimal φt = f θt . θ t ← According to Algorithm 2, the next goal is to estimate M t = Re ∂ φt ∂xi , ∂ φt ∂xj L 2, ζ (R n ) . It is easy to see that M t = E χ∼ ζ ∂ φt ∂x (χ) ∂ φt ∂x (χ) T = E χ∼ ζ ∂f θt ∂x (χ) ∂f θt ∂x (χ) T From the last we see that the matrix M t can be estimated by sampling χ ∼ ζ a sufficient number of times (the parameter l in our algorithm). All the rest is identical to Algorithm 2. The regression version of the algorithm can be obtained by setting L(c, c ) = (c -c ) 2 . Implementations for different databases can be found at github.



√ 2πσ 2 n e -|x| 2 2σ 2 . Besides the gaussian kernel our theory also captures many other kernels, Since the role of the parameter σ is similar to that of the bandwidth in the kernel density estimation, we use Silverman's rule of thumb to set σ = N -1/(n+4) . where {a[s, i, j]} s=1,4,i=1,m1,j=1,s ∼ iid U(1, n) and {ξ i } m2 i=1 ∼ iid ζ. Overall, we obtain the following Algorithm 6. If H ⊆ S(R n ) is not satisfied, then we can choose H = {φ θ * G n |θ ∈ Θ} for a very small .



(a) Visualization of outputs of the PCA and MMD methods. MMD (green line) tends to select a subcollection of points that sharply aligns along the main direction, whereas the first principal component (red line) could be a result of averaging over different directions in the data. Left plot: ||Pt -P ||F : I δ = 0.05, λ = 20.0, case I, I δ = 0.05, λ = 20.0, case II, I δ = 0.05, λ = 100.0, case I, I δ = 0.05, λ = 100.0, case II, I δ = 0.1, λ = 100.0, case I, I δ = 0.1, λ = 100.0, case II. Right plot: ||P * -P ||F as a function of ln σ:I MMD, I HM, I WD.

Both an optimization over G k and over S(n, k) is typically hard: for a final point, at best one can guarantee that it is a local extremum. Promising aspects of G k are: a) G k allows to formulate naturally a new class of objectives on it, b) local extrema on G k substantially differ from local extrema on S(n, k), because a local search over G k uses more degrees of freedom.

Y )≤k ||X -Y ||. This completes the proof. Note that the case of L 1 norm ||x|| = i |x i | in the task 13 corresponds to the well-studied robust PCA problem Candès et al. (2011). If, instead of the L 1 -norm, we use the L 2 -norm, this leads to another task: ||X -L|| 1,2 → min rank(L)≤k (14)where ||S|| 1,2 = j i s 2 ij , which known as the outlier pursuit problemXu et al. (2010).C PROPER KERNELS AND PROOF OF THEOREM 4C.1 PROPER KERNELS

n×n be matrices that comprise the first k rows of U, V correspondingly and n -k zero rows below. Also, let L denote Lipschitz constant for M such that |M (x, y) -M (x , y )| ≤ L(|x -x | + |y -y |).

18) Thus, the following characterization of F[P k ] becomes evident. Theorem 13. F[P k ] = M k . G.2 THE DUAL FORM OF MMD Let us define another gaussian kernel γ(x) = e -h 2 |x| 2 2 = F[k]. Let p data (x) denote the characteristic function of the random vector X data ∼ µ data . By definition, p data (x) = E[e iX T data x ] = 1 N N i=1 e ix T i x . Thus, p data ∝ F -1 [µ data ] and µ data ∝ F[p data ].

(2018)'s version, in which the term max{0, || ∂fw ∂x (ξx + (1 -ξ)g θ (z))|| -1} 2 enforces Lipschitz condition (see step (*) of the Algorithm 7). Algorithm 7 Numerical algorithm for WD. We use M (x, y) = e -||x-y|| 2 n and default values of λ = 10, Λ = 100, n critic = 5, m = 40, l = 10000n, α = 0.00001, β 1 = 0.5, β 2 = 0.9 P 0 ←-0, θ 0 ←-0 for t = 1, • • • , T do Minimax realization of min θ W (φ θ , µ emp ) + λT (θ) (*): while θ has not converged do for s = 1, ..., n critic do Discriminator updates w Sample{z i } m i=1 , {z i } m i=1 ∼ p(z) L ←--1 m m i=1 f w (g θ (z i )) + λ i,j Ξ(θ,zi,z j ) m 2

Lemma 5. ||S φ -P S ψ || 2 = E x,y∼φ (x • y)M (x, y) + E x,y∼ψ (x • P y)M (x, y) -2E x∼φ,y∼ψ (x • P y)M (x, y) Proof of lemma. ||S φ -P S ψ || 2 = || O(M )[xφ(x)] -P O(M )[xψ(x)]|| 2 = = || O(M )[xφ(x) -P xψ(x)]|| 2 = n i=1 || O(M )[x i φ(x) -(P x) i ψ(x)]|| 2 = n i=1 x i φ(x)|O(M )[x i φ(x)] + (P x) i ψ(x)|O(M )[(P x) i ψ(x)] -2 (P x) i ψ(x)|O(M )[x i φ(x)] = E x,y∼φ (x • y)M (x, y) + E x,y∼ψ (x • P y)M (x, y) -2E x∼φ,y∼ψ (x • P y)M (x, y)

c)∼µ data , ∼N (0,υ 2 In) L(c, f (z + )) → min f ∈F k where L(c, y) = -c log(y) -(1 -c) log(1 -y).

table 1 one can see the obtained test set accuracy on the classification tasks and R 2 on the regression tasks. As we see from the table 1, after reducing the dimension of an input to k = 2, 3, we are still able to obtain good accuracy of prediction on a test set. The cross-validated accuracies/R 2 of KNN on 2 or 3-dimensional input representations.

called a regular solution of 10.In other words, B k formalizes a set of distributions from G k , that can be approached through sequences {f i } ⊆ G k , for which Tr(M fi ) does not blow up. Obviously, G k ⊆ B k ⊆ G k . In applications, regular solutions include all Arg min

The gradient of the function Φ(θ) = J(f θ ) + λE ξ∼ ζ || ∂f θ ∂x (ξ) -P t-1

annex

Algorithm 8 The numerical alternating scheme for SDR. We use υ = 1.0, ζ(x) = G n 0.8 (x) and default values of λ = 10, m ≈ 50, m = 100, l = 30000, α = 0.0001, β 1 = 0.5, β 2 = 0.9

