LOCAL DISTANCE PRESERVING AUTO-ENCODERS US-ING CONTINUOUS K-NEAREST NEIGHBOURS GRAPHS

Abstract

Auto-encoder models that preserve similarities in the data are a popular tool in representation learning. In this paper we introduce several auto-encoder models that preserve local distances when mapping from the data space to the latent space. We use a local distance-preserving loss that is based on the continuous k-nearest neighbours graph which is known to capture topological features at all scales simultaneously. To improve training performance, we formulate learning as a constraint optimisation problem with local distance preservation as the main objective and reconstruction accuracy as a constraint. We generalise this approach to hierarchical variational auto-encoders thus learning generative models with geometrically consistent latent and data spaces. Our method provides state-ofthe-art or comparable performance across several standard datasets and evaluation metrics.

1. INTRODUCTION

Auto-encoders and variational auto-encoders (Kingma & Welling, 2014; Rezende et al., 2014) are often used in machine learning to find meaningful latent representations of the data. What constitutes meaningful usually depends on the application and on the downstream tasks, for example, finding representations that have important factors of variations in the data (disentanglement) (Higgins et al., 2017; Chen et al., 2018) , have high mutual information with the data (Chen et al., 2016) , or show clustering behaviour w.r.t. some criteria (van der Maaten & Hinton, 2008) . These representations are usually incentivised by regularisers or architectural/structural choices. One criterion for finding a meaningful latent representation is geometric faithfulness to the data. This is important for data visualisation or further downstream tasks that involve geometric algorithms such as clustering or kNN classification. The data often lies in a small, sparse, low-dimensional manifold in the space it inhabits and finding a lower dimensional projection that is geometrically faithful to it can help not only in visualisation and interpretability but also in predictive performance and robustness (e.g. Karl et al., 2017; Klushyn et al., 2021) . There are several approaches that implement such projections, ISOMAP (Tenenbaum et al., 2000) , LLE (Roweis & Saul, 2000) , SNE/t-SNE (Hinton & Roweis, 2002; van der Maaten & Hinton, 2008; Graving & Couzin, 2020) and UMAP (McInnes et al., 2018; Sainburg et al., 2021) aim to preserve the local neighbourhood structure while topological auto-encoders (Moor et al., 2020) , witness auto-encoders (Schönenberger et al., 2020), and (Li et al., 2021) use regularisers in auto-encoder models to learn projections that preserve topological features or local distances. The approach presented in (Moor et al., 2020) , uses persistent homology computation to define local connectivity graphs over which to preserve local distances. One can choose the dimensionality of the preserved topological features, however, preserving higher-dimensional topological features comes at additional computational cost. In this paper we propose to use the continuous k-nearest neighbours method (Berry & Sauer, 2019) which is based on consistent homology and results in a significantly simpler graph construction method; it is also known to capture topological features at all scales simultaneously. Since AE and VAE methods are usually hard to train and regularise (Alemi et al., 2018; Higgins et al., 2017; Zhao et al., 2018; Rezende & Viola, 2018) , to improve learning we formulate learning as a constraint optimisation with the topological loss as the objective the reconstruction loss as constraint. In addition, we adapt the proposed methods to VAEs with learned priors. This enables us to learn models that generate data with topologically/geometrically consistent latent and data spaces.

2. METHODS

In this paper we address (i) projecting i.i.d. data X = {x i } N i=1 with x ∈ R n into a lower-dimensional representation z ∈ R m (m < n) using auto-encoders and (ii) learning an unsupervised (hierarchical) probabilistic model that can be used not only to encode but also generate data similar to X. Autoencoder models are typically learned by minimising the average reconstruction loss L rec (θ, φ; X) = E p(x) [l(x, g θ (f φ (x))] w.r.t. (θ, φ), where l(•, •) is a positive, symmetric, non-decreasing function and the mappings f φ and g θ are called the encoder and the generator, respectively. Due to consistency with distance preserving losses, we only use as reconstruction loss the Euclidean distance l(x, x ) = ||xx || 2 . The expectation is taken w.r.t. the empirical distribution p(x) = (1/N ) i δ(xx i ) and training is performed via stochastic batch gradient methods. Unsupervised probabilistic models are typically learned by maximum likelihood method w.r.t. θ on p θ (X) = i i p θ (x i |z i ) p θ (z i ) dz i , where p θ (x|z) is the likelihood term corresponding to the generator g θ (x) and p θ (z) is the prior distribution/density of the latent variables z. The distribution p θ (z) is either chosen as a product of some standard univariate distributions or learned via empirical Bayes. In practice, learning the prior is often included in the maximum likelihood optimisation. Since the integrals i p θ (x i |z)p θ (z)dz are usually intractable, log p θ (x) is often approximated using amortised variational Bayes (Kingma & Welling, 2014; Rezende et al., 2014) resulting in the evidence lower-bound (ELBO) approximation log p θ (x) ≥ max φ E q φ (z;x) [log p θ (x|z)] -KL[q φ (z; x)||p θ (z)] . The resulting q φ (z; x) is an approximation of the posterior distribution p θ (z|x) = p θ (x|z)p θ (z)/p θ (x) and can be viewed as corresponding to the encoder f θ (x). For notation simplicity, we use θ for all model parameters, and φ for all encoder parameters. In this paper we will deviate slightly from the ELBO approach to fit the parameters θ and φ because of practical considerations but the general modelling ideas will be similar nonetheless.

2.1. LOCAL DISTANCE PRESERVATION

Auto-encoders are popular models for dimensionality reduction and thus they are often extended with regularisers or constraints that impose various types of inductive biases required by the task at hand. One such inductive bias is local distance preservation, that is, two data points x i and x j close in the data-space at distance d X (x i , x j ) should be mapped into points z i = f φ (x i ) and z j = f φ (x j ) at distance γd Z (z i , z j ) d X (x i , x j ). This distance preservation can help to retain the topology of the data X in the encoded data Z = {z i = f θ (x i )} N i=1 . Since the the data X is often hypothetised to lie on a sub-manifold of R n , give or take some observation noise (Rifai et al., 2011) , one expects that the encoded data Z will be a lower-dimensional, topologically faithful representation of X. In this paper we mainly consider local distance preservation where locality or closeness in the data manifold is formulated via (neighbourhood) graph structures constructed based on topological/geometrical considerations. We present the graph construction methods we use in Section 2.3. Let us assume that we have constructed two graphs with the same method, a graph G X based on data/batch and another graph G Z based on the encoding of the data/batch. Given these graphs and the distance measures in both spaces, we define the local distance-preserving loss defined similarly as in (Sammon, 1969; Lawrence & Quinonero-Candela, 2006; Moor et al., 2020) , L topo (φ; X, Z) = (i,j)∈G X ∪G Z |d X (x i , x j ) -γd Z (z i , z j )| 2 . (1) Here, in case of auto-encoder models we have Z = {z i = f φ (x i )} N i=1 , while in case of generative models we have Z = {z i ∼ q(z; x i )} N i=1 . The scaling factor γ is a learned variable and is introduced to help with the scaling issues one might encounter in VAE models. In case of generative models one can also consider the generative counterpart for Z ∼ p θ (z), X ∼ p θ (•|Z ). For models and training schedules we considering in this paper this did not bring any additional benefit because a good auto-encoding and a well fitted prior already ensures a small value for this additional term. There are several other options for loss functions that are designed to incentivise auto-encoders to preserve locality structures. SNE/tSNE construct a probability distribution of connectedness for each

