ISOMETRIC AUTOENCODERS

Abstract

High dimensional data is often assumed to be concentrated on or near a lowdimensional manifold. Autoencoders (AE) is a popular technique to learn representations of such data by pushing it through a neural network with a low dimension bottleneck while minimizing a reconstruction error. Using high capacity AE often leads to a large collection of minimizers, many of which represent a low dimensional manifold that fits the data well but generalizes poorly. Two sources of bad generalization are: extrinsic, where the learned manifold possesses extraneous parts that are far from the data; and intrinsic, where the encoder and decoder introduce arbitrary distortion in the low dimensional parameterization. An approach taken to alleviate these issues is to add a regularizer that favors a particular solution; common regularizers promote sparsity, small derivatives, or robustness to noise. In this paper, we advocate an isometry (i.e., local distance preserving) regularizer. Specifically, our regularizer encourages: (i) the decoder to be an isometry; and (ii) the encoder to be the decoder's pseudo-inverse, that is, the encoder extends the inverse of the decoder to the ambient space by orthogonal projection. In a nutshell, (i) and (ii) fix both intrinsic and extrinsic degrees of freedom and provide a non-linear generalization to principal component analysis (PCA). Experimenting with the isometry regularizer on dimensionality reduction tasks produces useful low-dimensional data representations.

1. INTRODUCTION

A common assumption is that high dimensional data X ⊂ R D is sampled from some distribution p concentrated on, or near, some lower d-dimensional submanifold M ⊂ R D , where d < D. The task of estimating p can therefore be decomposed into: (i) approximate the manifold M; and (ii) approximate p restricted to, or concentrated near M. In this paper we focus on task (i), mostly known as manifold learning. A common approach to approximate the d-dimensional manifold M, e.g., in (Tenenbaum et al., 2000; Roweis & Saul, 2000; Belkin & Niyogi, 2002; Maaten & Hinton, 2008; McQueen et al., 2016; McInnes et al., 2018) , is to embed X in R d . This is often done by first constructing a graph G where nearby samples in X are conngected by edges, and second, optimizing for the locations of the samples in R d striving to minimize edge length distortions in G. Autoencoders (AE) can also be seen as a method to learn low dimensional manifold representation of high dimensional data X . AE are designed to reconstruct X as the image of its low dimensional embedding. When restricting AE to linear encoders and decoders it learns linear subspaces; with mean squared reconstruction loss they reproduce principle component analysis (PCA). Using higher capacity neural networks as the encoder and decoder, allows complex manifolds to be approximated. To avoid overfitting, different regularizers are added to the AE loss. Popular regularizers include sparsity promoting (Ranzato et al., 2007; 2008; Glorot et al., 2011) , contractive or penalizing large derivatives (Rifai et al., 2011a; b), and denoising (Vincent et al., 2010; Poole et al., 2014) . Recent AE regularizers directly promote distance preservation of the encoder (Pai et al., 2019; Peterfreund et al., 2020) . In this paper we advocate a novel AE regularization promoting isometry (i.e., local distance preservation), called Isometric-AE (I-AE). Our key idea is to promote the decoder to be isometric, and the encoder to be its pseudo-inverse. Given an isometric decoder R d → R D , there is no well-defined inverse R D → R d ; we define the pseudo-inverse to be a projection on the image of the decoder composed with the inverse of the decoder restricted to its image. Locally, the I-AE regularization therefore encourages: (i) the differential of the decoder A ∈ R D×d to be an isometry, i.e., A T A = I d , where I d is the d × d identity matrix; and (ii) the differential of the encoder, B ∈ R d×D to be the pseudo-inverse (now in the standard linear algebra sense) of the differential of the decoder A ∈ R D×d , namely, B = A + . In view of (i) this implies B = A T . This means that locally our decoder and encoder behave like PCA, where the encoder and decoder are linear transformations satisfying (i) and (ii); That is, the PCA encoder can be seen as a composition of an orthogonal projection on the linear subspace spanned by the decoder, followed by an orthogonal transformation (isometry) to the low dimensional space. In a sense, our method can be seen as a version of denoising/contractive AEs (DAE/CAE, respectively). DAE and CAE promote a projection from the ambient space onto the data manifold, but can distort distances and be non-injective. Locally, using differentials again, projection on the learned manifold means (AB) 2 = AB. Indeed, as can be readily checked conditions (i) and (ii) above imply A(BA)B = AB. This means that I-AE also belongs to the same class of DAE/CAE, capturing the variations in tangent directions of the data, M, while ignoring orthogonal variations which often represent noise (Vincent et al., 2010; Alain & Bengio, 2014) . The benefit in I-AE is that its projection on the data manifold is locally an isometry, preserving distances and sampling the learned manifold evenly. That is, I-AE does not shrink or expand the space; locally, it can be imagined as an orthogonal linear transformation. The inset shows results of a simple experiment comparing contractive AE (CAE-bottom) and isometric AE (I-AE-top). Both AEs are trained on the green data points; the red arrows depict projection of points (in blue) in vicinity of the data onto the learned manifold (in black) as calculated by applying the encoder followed by the decoder. Note that CAE indeed projects on the learned manifold but not evenly, tending to shrink space around data points; in contrast I-AE provides a more even sampling of the learned manifold. Experiments confirm that optimizing the I-AE loss results in a close-to-isometric encoder/decoder explaining the data. We further demonstrate the efficacy of I-AE for dimensionality reduction of different standard datatsets, showing its benefits over manifold learning and other AE baselines.

2. RELATED WORKS

Manifold learning. Manifold learning generalizes classic dimensionality reduction methods such as PCA (F.R.S., 1901) and MDS (Kruskal, 1964; Sammon, 1969) , by aiming to preserve the local geometry of the data. Tenenbaum et al. (2000) use the nn-graph to approximate the geodesic distances over the manifold, followed by MDS to preserve it in the lower dimension. Roweis & Saul ( 2000 2019) suggest to embed high dimensional points into a low dimension with a neural network by constructing a metric between pairs of data points and minimizing the metric distortion energy. Kato et al. (2019) suggest to learn an isometric decoder by using noisy latent variables. They prove under certain conditions that it encourages isometric decoder. Peterfreund et al. (2020) suggest autoencoders that promote the isometry of the encoder over the data by approximating its differential gram matrix using sample covariance matrix. Zhan et al. (2018) encourage distance preserving autoencoders by minimizing metric distortion energy in common feature space.



Figure 1: Top: I-AE; bottom: CAE.

); Belkin & Niyogi (2002); Donoho & Grimes (2003) use spectral methods to minimize different distortion energy functions over the graph matrix. Coifman et al. (2005); Coifman & Lafon (2006) approximate the heat diffusion over the manifold by a random walk over the nn-graph, to gain a robust distance measure on the manifold. Stochastic neighboring embedding algorithms (Hinton & Roweis, 2003; Maaten & Hinton, 2008) captures the local geometry of the data as a mixture of Gaussians around each data points, and try to find a low dimension mixture model by minimizing the KL-divergence. In a relatively recent work, McInnes et al. (2018) use iterative spectral and embedding optimization using fuzzy sets. Several works tried to adapt classic manifold learning ideas to neural networks and autoencoders. Pai et al. (

