DIMENSION REDUCTION AS AN OPTIMIZATION PROBLEM OVER A SET OF GENERALIZED FUNCTIONS

Abstract

We reformulate unsupervised dimension reduction problem (UDR) in the language of tempered distributions, i.e. as a problem of approximating an empirical probability density function p emp (x) by another tempered distribution q(x) whose support is in a k-dimensional subspace. Thus, our problem is reduced to the minimization of the distance between q and p emp , D(q, p emp ), over a pertinent set of generalized functions. This infinite-dimensional formulation allows to establish a connection with another classical problem of data science -the sufficient dimension reduction problem (SDR). Thus, an algorithm for the first problem induces an algorithm for the second and vice versa. In order to reduce an optimization problem over distributions to an optimization problem over ordinary functions we introduce a nonnegative penalty function R(f ) that "forces" the support of f to be k-dimensional. Then we present an algorithm for minimization of I(f ) + λR(f ), based on the idea of two-step iterative computation, briefly described as a) an adaptation to real data and to fake data sampled around a k-dimensional subspace found at a previous iteration, b) calculation of a new k-dimensional subspace. We demonstrate the method on 4 examples (3 UDR and 1 SDR) using synthetic data and standard datasets.

1. INTRODUCTION

Linear dimension reduction (LDR) is a family of problems in data science that includes principal component analysis, factor analysis, linear multidimensional scaling, Fisher's linear discriminant analysis, canonical correlations analysis, sufficient dimensionality reduction (SDR), maximum autocorrelation factors, slow feature analysis and more. In unsupervised dimension reduction (UDR) we are given a finite number of points in R n (sampled according to some unknown distribution) and the goal is to find a "low-dimensional" affine (or linear) subspace that approximates "the support" of the distribution. The study field currently achieved a saturation level at which unifying frameworks for the problem become of special interest Cunningham & Ghahramani (2015) . An approach that we present in that paper is based on the theory of generalized functions, or tempered distributions Soboleff (1936); Schwartz (1949) . An important generalized function that cannot be represented as an ordinary function is the Dirac delta function, denoted δ, and δ n denotes its ndimensional version.

Any dataset {x

i } N i=1 ⊆ R n naturally corresponds to the distribution p emp (x) = 1 N N i=1 δ n (x-x i ) which, with some abuse of terminology, can be called the empirical probability density function. Based on that, UDR can be understood as a task whose goal is to approximate p emp (x) by q(x), where q(x) is a distribution whose density is supported in a k-dimensional affine subspace A ⊆ R n . Note that a function whose density is supported in some low-dimensional subset of R n is not an ordinary function. Exact definitions of such distributions can be found in Section 3. To formulate an optimization task we additionally need a loss D(p emp , q) that measures the distance between the ground truth p emp and a distribution q, that we search for. Thus, in our approach, the UDR problem is defined as: I (q) = D (p emp , q) → min q (1) under the condition that q(x) has a k-dimensional support. The SDR problem is tightly connected with the UDR problem. In SDR, given supervised data, the goal is to find the so called effective subspace, defined by its basis vectors {w 1 , • • • , w k } ⊆ R n , such that the regression function can be searched in the form g(w T 1 x, • • • , w T k x). In Wang et al. ( 2010) it was shown how a method originally developed for SDR can be turned into an UDR method, i.e. applied to unsupervised data, by simply setting an output to be equal to an input. The key observation of our analysis, stated in Theorem 2, is that a class of functions of the form g(w T 1 x, • • • , w T k x) can be characterized as functions whose Fourier transform is supported in the corresponding effective subspace. In Section 4 we give 3 examples of UDR problems that we cast as 1 and in the fourth example we formulate SDR as an optimization task with the search space dual to that of UDR. Thus, all 4 examples can be studied within the same optimization framework. The structure of the paper is as follows: in Section 3 we formally define the search space in Problem 1, denoted G k , and an image of G k under the Fourier transform, denoted F k . Instead of searching directly in a set of generalized functions, G k , in Section 5 we describe how we substitute an ordinary function for a distribution in the optimization task at the expence of adding a new penalty term to its objective, λR(f ). Using a gaussian kernel M (x, y), Theorem 4 characterizes generalized g ∈ G k as such g for which the matrix of properly defined integrals M g = Re R n ×R n x i y j g(x) * M (x, y)g(y)dxdy i,j=1,n is of rank k. We define R(f ) as a squared Frobenius distance from M f to the closest matrix of rank k. In Section 6 we suggest a method for solving min φ I(φ) + λR(φ) which we call the alternating scheme. Section 7 is dedicated to experiments with the alternating scheme on synthetic data and standard datasets.

2. PRELIMINARIES AND NOTATIONS

Throughout this paper we use standard terminology and notation from functional analysis. For exact definitions one can address the textbook on the theory of distributions Friedlander & Joshi (1998). The Schwartz space of functions and its dual space are denoted by S(R n ) and S (R n ) correspondingly. For a tempered distribution T ∈ S (R n ) and φ ∈ S(R n ), T, φ denotes T (φ). The Fourier and inverse Fourier transforms are denoted by F, F -1 : S (R n ) → S (R n ). For brevity, we denote F[f ] by f . If all required conditions are satisfied, an integrable f : R n → C (or, a Borel measure µ on R n ) is used as the tempered distribution T f (or, T µ ) where T f , φ = R n f (x)φ(x)dx (or, T µ , φ = R n φ(x)dµ). For Ω ⊆ S (R n ), Ω * denotes the sequential closure of Ω with respect to weak topology of S (R n ). By L 2 (R n ) we denote the L 2 -space with the inner product: u, v L2 = u(x) * v(x)dx. For φ ∈ S(R n ), ψ ∈ S (R n ), their convolution and multiplication are denoted by φ * ψ and φψ correspondingly. For g 1 ∈ S (R k ) and g 2 ∈ S (R n-k ), g 1 ⊗ g 2 ∈ S (R n ) denotes their tensor product. For a square matrix A, Tr(A) denotes its trace and for arbitrary matrix, ||A|| F def = Tr(A T A). Identity matrix of size n is denoted by I n .

3. BASIC FUNCTION CLASSES

An example of a generalized function, whose density is concentrated in a k-dimensional subspace, is any distribution that can be represented as g ⊗ δ n-k def = g ⊗ δ ⊗ • • • ⊗ δ n -k times where g ∈ S (R k ). If g = T f , where f : R k → R is an ordinary function, then g ⊗ δ n-k can be understood as a generalized function whose density is concentrated in a subspace {x ∈ R n |x i = 0, i > k} and equals f (x 1:k ). It can be shown that the distribution acts on φ ∈ S(R n ) in the following way: T f ⊗ δ n-k , φ = R k f (x 1:k )φ(x 1:k , 0 n-k )dx 1:k Now to generalize the latter definition to any k-dimensional subspace we have to introduce a change of variables in tempered distributions. Let g ∈ S (R n ) and U ∈ R n×n be an orthogonal matrix, i.e. U T U = I n . Then, g U ∈ S (R n ) is defined by the rule: g U , φ = g, ψ where ψ(x) = φ(U T x). If g = T f , the latter definition gives g U = T f where f (x) = f (U x). Now, we define classes of tempered distributions:

