ISOMETRIC AUTOENCODERS

Abstract

High dimensional data is often assumed to be concentrated on or near a lowdimensional manifold. Autoencoders (AE) is a popular technique to learn representations of such data by pushing it through a neural network with a low dimension bottleneck while minimizing a reconstruction error. Using high capacity AE often leads to a large collection of minimizers, many of which represent a low dimensional manifold that fits the data well but generalizes poorly. Two sources of bad generalization are: extrinsic, where the learned manifold possesses extraneous parts that are far from the data; and intrinsic, where the encoder and decoder introduce arbitrary distortion in the low dimensional parameterization. An approach taken to alleviate these issues is to add a regularizer that favors a particular solution; common regularizers promote sparsity, small derivatives, or robustness to noise. In this paper, we advocate an isometry (i.e., local distance preserving) regularizer. Specifically, our regularizer encourages: (i) the decoder to be an isometry; and (ii) the encoder to be the decoder's pseudo-inverse, that is, the encoder extends the inverse of the decoder to the ambient space by orthogonal projection. In a nutshell, (i) and (ii) fix both intrinsic and extrinsic degrees of freedom and provide a non-linear generalization to principal component analysis (PCA). Experimenting with the isometry regularizer on dimensionality reduction tasks produces useful low-dimensional data representations.

1. INTRODUCTION

A common assumption is that high dimensional data X ⊂ R D is sampled from some distribution p concentrated on, or near, some lower d-dimensional submanifold M ⊂ R D , where d < D. The task of estimating p can therefore be decomposed into: (i) approximate the manifold M; and (ii) approximate p restricted to, or concentrated near M. In this paper we focus on task (i), mostly known as manifold learning. A common approach to approximate the d-dimensional manifold M, e.g., in (Tenenbaum et al., 2000; Roweis & Saul, 2000; Belkin & Niyogi, 2002; Maaten & Hinton, 2008; McQueen et al., 2016; McInnes et al., 2018) , is to embed X in R d . This is often done by first constructing a graph G where nearby samples in X are conngected by edges, and second, optimizing for the locations of the samples in R d striving to minimize edge length distortions in G. Autoencoders (AE) can also be seen as a method to learn low dimensional manifold representation of high dimensional data X . AE are designed to reconstruct X as the image of its low dimensional embedding. When restricting AE to linear encoders and decoders it learns linear subspaces; with mean squared reconstruction loss they reproduce principle component analysis (PCA). Using higher capacity neural networks as the encoder and decoder, allows complex manifolds to be approximated. To avoid overfitting, different regularizers are added to the AE loss. Popular regularizers include sparsity promoting (Ranzato et al., 2007; 2008; Glorot et al., 2011) , contractive or penalizing large derivatives (Rifai et al., 2011a; b) , and denoising (Vincent et al., 2010; Poole et al., 2014) . Recent AE regularizers directly promote distance preservation of the encoder (Pai et al., 2019; Peterfreund et al., 2020) . In this paper we advocate a novel AE regularization promoting isometry (i.e., local distance preservation), called Isometric-AE (I-AE). Our key idea is to promote the decoder to be isometric, and the encoder to be its pseudo-inverse. Given an isometric decoder R d → R D , there is no well-defined inverse R D → R d ; we define the pseudo-inverse to be a projection on the image of the decoder composed with the inverse of the decoder restricted to its image. Locally, the I-AE regularization therefore encourages: (i) the differential of the decoder A ∈ R D×d to be an isometry, i.e., A T A = I d , where I d is the d × d identity matrix; and (ii) the differential of the encoder, B ∈ R d×D to be the pseudo-inverse (now in the standard linear algebra sense) of the differential of the decoder A ∈ R D×d , namely, B = A + . In view of (i) this implies B = A T . This means that locally our decoder and encoder behave like PCA, where the encoder and decoder are linear transformations satisfying (i) and (ii); That is, the PCA encoder can be seen as a composition of an orthogonal projection on the linear subspace spanned by the decoder, followed by an orthogonal transformation (isometry) to the low dimensional space. In a sense, our method can be seen as a version of denoising/contractive AEs (DAE/CAE, respectively). DAE and CAE promote a projection from the ambient space onto the data manifold, but can distort distances and be non-injective. Locally, using differentials again, projection on the learned manifold means (AB) 2 = AB. Indeed, as can be readily checked conditions (i) and (ii) above imply A(BA)B = AB. This means that I-AE also belongs to the same class of DAE/CAE, capturing the variations in tangent directions of the data, M, while ignoring orthogonal variations which often represent noise (Vincent et al., 2010; Alain & Bengio, 2014) . The benefit in I-AE is that its projection on the data manifold is locally an isometry, preserving distances and sampling the learned manifold evenly. That is, I-AE does not shrink or expand the space; locally, it can be imagined as an orthogonal linear transformation. The inset shows results of a simple experiment comparing contractive AE (CAE-bottom) and isometric AE (I-AE-top). Both AEs are trained on the green data points; the red arrows depict projection of points (in blue) in vicinity of the data onto the learned manifold (in black) as calculated by applying the encoder followed by the decoder. Note that CAE indeed projects on the learned manifold but not evenly, tending to shrink space around data points; in contrast I-AE provides a more even sampling of the learned manifold. Experiments confirm that optimizing the I-AE loss results in a close-to-isometric encoder/decoder explaining the data. We further demonstrate the efficacy of I-AE for dimensionality reduction of different standard datatsets, showing its benefits over manifold learning and other AE baselines.

2. RELATED WORKS

Manifold learning. Manifold learning generalizes classic dimensionality reduction methods such as PCA (F.R.S., 1901) and MDS (Kruskal, 1964; Sammon, 1969) , by aiming to preserve the local geometry of the data. Tenenbaum et al. (2000) use the nn-graph to approximate the geodesic distances over the manifold, followed by MDS to preserve it in the lower dimension. Roweis & Saul (2000) ; Belkin & Niyogi (2002) ; Donoho & Grimes (2003) use spectral methods to minimize different distortion energy functions over the graph matrix. Coifman et al. (2005) ; Coifman & Lafon (2006) approximate the heat diffusion over the manifold by a random walk over the nn-graph, to gain a robust distance measure on the manifold. Stochastic neighboring embedding algorithms (Hinton & Roweis, 2003; Maaten & Hinton, 2008) captures the local geometry of the data as a mixture of Gaussians around each data points, and try to find a low dimension mixture model by minimizing the KL-divergence. In a relatively recent work, McInnes et al. (2018) use iterative spectral and embedding optimization using fuzzy sets. Several works tried to adapt classic manifold learning ideas to neural networks and autoencoders. Pai et al. (2019) suggest to embed high dimensional points into a low dimension with a neural network by constructing a metric between pairs of data points and minimizing the metric distortion energy. Kato et al. (2019) suggest to learn an isometric decoder by using noisy latent variables. They prove under certain conditions that it encourages isometric decoder. Peterfreund et al. (2020) suggest autoencoders that promote the isometry of the encoder over the data by approximating its differential gram matrix using sample covariance matrix. Zhan et al. (2018) encourage distance preserving autoencoders by minimizing metric distortion energy in common feature space.

Modern autoencoders.

There is an extensive literature on extending autoencoders to a generative model (task (ii) in section 1). That is, learning a probability distribution in addition to approximating the data manifold M. Variational autoencoder (VAE) Kingma & Welling (2014) and its variants Makhzani et al. (2015) ; Burda et al. (2016); Sønderby et al. (2016) ; Higgins et al. (2017); Tolstikhin et al. (2018) ; Park et al. (2019); Zhao et al. (2019) are examples to such methods. In essence, these methods augment the AE structure with a learned probabilistic model in the low dimensional (latent) space R d that is used to approximate the probability P that generated the observed data X . More relevant to our work, are recent works suggesting regularizers for deterministic autoencoders that together with ex-post density estimation in latent space forms a generative model. Ghosh et al. (2020) suggested to reduce the decoder degrees of freedom, either by regularizing the norm of the decoder weights or the norm of the decoder differential. Other regularizers of the differential of the decoder, aiming towards a deterministic variant of VAE, were recently suggested in Kumar & Poole (2020) ; Kumar et al. (2020) . In contrast to our method, these methods do not regularize the encoder explicitly. We consider high dimensional data points

3. ISOMETRIC AUTOENCODERS

X = {x i } n i=1 ⊂ R D sampled from some probability distribution P (x) in R D concentrated on or near some d dimensional submanifold M ⊂ R D , where d < D. Our goal is to compute isometric autoencoder (I-AE) defined as follows. Let g : R D → R d denote the encoder, and f : R d → R D the decoder; N is the learned manifold, i.e., the image of the decoder, N = f (R d ). I-AE is defined by the following requirements: (i) The data X is close to N . (ii) f is an isometry. (iii) g is the pseudo-inverse of f . Figure 2 is an illustration of I-AE. Let θ denote the parameters of f , and φ the parameters of g. We enforce the requirements (i)-(iii) by prescribing a loss function L(θ, φ) and optimize it using standard stochastic gradient descent (SGD). We next break down the loss L to its different components. Condition (i) is promoted with the standard reconstruction loss in AE: L rec (θ, φ) = 1 n n i=1 f (g(x i )) -x i 2 , ( ) where • is the 2-norm. Before handling conditions (ii),(iii) let us first define the notions of isometry and pseudo-inverse. A differentiable mapping f between the euclidean spaces R d and R D is a local isometry if it has an orthogonal differential matrix df (z) ∈ R D×d , df (z) T df (z) = I d , where I d ∈ R d×d is the identity matrix, and df (z) ij = ∂f i ∂zj (z). A local isometry which is also a diffeomorphism is a global isometry. Restricting the decoder to isometry is beneficial for several reasons. First, Nash-Kuiper Embedding Theorem Nash (1956) asserts that non-expansive maps can be approximated arbitrary well with isometries if D ≥ d + 1 and hence promoting an isometry does not limit the expressive power of the decoder. Second, the low dimensional representation of the data computed with an isometric encoder preserves the geometric structure of the data. In particular volume, length, angles and probability densities are preserved between the low dimensional representation R d , and the learned manifold N . Lastly, for a fixed manifold N there is a huge space of possible decoders such that N = f (R d ). For isometric f , this space is reduced considerably: Indeed, consider two isometries parameterizing N , i.e., f 1 , f 2 : R d → N . Then, since composition of isometries is an isometry we have that f -1 2 • f 1 : R d → R d is a dimension-preserving isometry and hence a rigid motion. That is, all decoders of the same manifold are the same up to a rigid motion. For the encoder the situation is different. Since D > d the encoder g cannot be an isometry in the standard sense. Therefore we ask g to be the pseudo-inverse of f . For that end we define the projection operator p on a submanifold N ⊂ R D as p(x) = arg min x ∈N xx . Note that the closest point is not generally unique, however the Tubular Neighborhood Theorem (see e.g., Theorem 6.24 in Lee ( 2013)) implies uniqueness for points x sufficiently close to the manifold N . Definition 1. We say the g is the pseudo-inverse of f if g can be written as g = f -1 • p, where p is the projection on N = f (R d ). Consequently, if g is the pseudo-inverse of an isometry f then it extends the standard notion of isometry by projecting every point on a submanifold N and then applying an isometry between the d-dimensional manifolds N and R d . See Figure 2 for an illustration. First-order characterization. To encourage f, g to satisfy the (local) isometry and the pseudoinverse properties (resp.) we will first provide a first-order (necessary) characterization using their differentials: Theorem 1. Let f be a decoder and g an encoder satisfying conditions (ii),(iii). Then their differentials A = df (z) ∈ R D×d , B = dg(f (z)) ∈ R d×D satisfy A T A = I d (3) BB T = I d (4) B = A T The theorem asserts that the differentials of the encoder and decoder are orthogonal (rectangular) matrices, and that the encoder is the pseudo-inverse of the differential of the decoder. Before proving this theorem, let us first use it to construct the relevant losses for promoting the isometry of f and pseudo-inverse g. We need to promote conditions (3), ( 4), ( 5). Since we want to avoid computing the full differentials A = df (z), B = dg(f (z)), we will replace (3) and ( 4) with stochastic estimations based on the following lemma: denote the unit d -1-sphere by S d-1 = z ∈ R d | z = 1 . Lemma 1. Let A ∈ R D×d , where d ≤ D. If Au = 1 for all u ∈ S d-1 , then A is column- orthogonal, that is A T A = I d . Therefore, the isometry promoting loss, encouraging (3), is defined by L iso (θ) = E z,u df (z)u -1 2 , where z ∼ P iso (R d ), and P iso (R d ) is a probability measure on R d ; u ∼ P (S d-1 ), and P (S d-1 ) is the standard rotation invariant probability measure on the d -1-sphere S d-1 . The pseudo-inverse promoting loss, encouraging (4) would be L piso (φ) = E x,u u T dg(x) -1 2 , where x ∼ P (M) and u ∼ P (S d-1 ). As usual, the expectation with respect to P (M) is computed empirically using the data samples X . Lastly, (5) might seem challenging to enforce with neural networks, however the orthogonality of A, B can be leveraged to replace this loss with a more tractable loss asking the encoder is merely the inverse of the decoder over its image: Lemma 2. Let A ∈ R D×d , and B ∈ R d×D . If A T A = I d = BB T and BA = I d then B = A + = A T . Fortunately, this is already taken care of by the reconstruction loss: since low reconstruction loss in equation 1 forces the encoder and the decoder to be the inverse of one another over the data manifold, i.e., g(f (z)) = z, it encourages BA = I d and therefore, by Lemma 2, automatically encourages equation 5. Note that invertability also implies bijectivity of the encoder/decoder restricted to the data manifold, pushing for global isometries (rather than local). Summing all up, we define our loss for I-AE by L(θ, φ) = L rec (θ, φ) + λ iso (L iso (θ) + L piso (φ)) , (8) where λ iso is a parameter controlling the isometry-reconstruction trade-off.

3.1. DETAILS AND PROOFS.

Let us prove Theorem 1 characterizing the relation of the differentials of isometries and pseudoisometries, A = df (z) ∈ R D×d , B = dg(f (z)) ∈ R d×D . First, by definition of isometry (equation 2), A T A = I d . We denote by T x N the d-dimensional tangent space to N at x ∈ N ; accordingly, T x N ⊥ denotes the normal tangent space. Lemma 3. The differential dp(x) ∈ R D×D at x ∈ N of the projection operator p : R D → N is dp(x)u = u u ∈ T x N 0 u ∈ T x N ⊥ (9) That is, dp(x) is the orthogonal projection on the tangent space of N at x. Proof. First, consider the squared distance function to N defined by η(x) = 1 2 min x ∈N xx 2 . The envelope theorem implies that ∇η(x) = xp(x). Differentiating both sides and rearranging we get dp(x) = I D -∇ 2 η(x). As proved in Ambrosio & Soner (1994) (Theorem 3.1), ∇ 2 η(x) is the orthogonal projection on T x N ⊥ . Let x = f (z) ∈ N . Since x ∈ N we have p(x) = x. Condition (iii) asserts that g(y) = f -1 (p(y)); taking the derivative at y = x we get dg(x) = df -1 (x)dp(x). Lemma 3 implies that dp(x) = AA T , since AA T is the orthogonal projection on T x N . Furthermore, df -1 (x) restricted to Im(A) is A T . Putting this together we get B = dg(x) = A T AA T = A T . This implies that BB T = I d , and that B = A + = A T . This concludes the proof of Theorem 1. Proof of Lemma 1. Writing the SVD of A = U ΣV T , where Σ = diag(σ 1 , . . . , σ d ) are the singular values of A, we get that -d) , be a completion of A to an orthogonal matrix in R D×D . Now, I d = BU U T B T = I d + BV V T B T , and since BV V T B T 0 this means that BV = 0, that is B takes to null the orthogonal space to the column space of A. A direct computation shows that BU = A T U which in turn implies B = A T = A + . d i=1 σ 2 i v 2 i = 1 for all v ∈ S d-1 . Plugging v = e j , j ∈ [d] (the standard basis) we get that all σ i = 1 for i ∈ [d] and A = U V T is orthogonal as claimed. Proof of Lemma 2. Let U = [A, V ], V ∈ R D×(D Implementation. Implementing the losses in equation 6 and equation 7 requires making a choice for the probability densities and approximating the expectations. We take P iso (R d ) to be either uniform or gaussian fit to the latent codes g(X ); and P (M) is approximated as the uniform distribution on X , as mentioned above. The expectations are estimated using Monte-Carlo sampling. That is, at each iteration we draw samples x ∈ X , ẑ ∼ P iso (R d ), û ∼ P (S d-1 ) and use the approximations L iso (θ) ≈ df ( ẑ) û -1 2 L piso (φ) ≈ ûT dg( x) -1 2 The right differential multiplication df ( ẑ) û and left differential multiplication ûT dg( x) are computed using forward and backward mode automatic differentiation (resp.). Their derivatives with respect to the networks' parameters θ, φ are computed by another backward mode automatic differentiation.

4.1. EVALUATION

We start by evaluating the effectiveness of our suggested I-AE regularizer, addressing the following questions: (i) does our suggested loss L (θ, φ) in equation 8 drive I-AE training to converge to an isometry? (ii) What is the effect of the L piso term? In particular, does it encourage better manifold approximations as conjectured? To that end, we examined the I-AE training on data points X sampled uniformly from 3D surfaces with known global parameterizations. Figure 3 weights tied to the encoder weights (TCAE) (Rifai et al., 2011a) ; Gradient penalty on the decoder (RAE-GP) (Ghosh et al., 2020) ; and Denoising autoencoder with gaussian noise (DAE) (Vincent et al., 2010) . The results demonstrate that I-AE is able to learn an isometric embedding, showing some of the advantages in our method: sampling density and distances between input points is preserved in the learned low dimensional space. In addition, for the AE methods, we quantitatively evaluate how close is the learnt decoder to an isometry. For this purpose, we triangulate a grid of planar points {z i } ⊂ R 2 . We denote by {e ij } the triangles edges incident to grid points z i and z j . Then, we measured the edge lengths ratio, l ij = f (z i ) -f (z j ) / e ij expected to be ≈ 1 for all edges e ij in an isometry. In Table 1 we log the standard deviation (Std) of {l ij } for I-AE compared to other regularized AEs. For a fair comparison, we scaled z i so the mean of l ij is 1 in all experiments. As can be seen in the table, the distribution of {l ij } for I-AE is significantly more concentrated than the different AE baselines. Finally, although L iso is already responsible for learning an isometric decoder, the pseudo-inverse encoder (enforced by the loss L piso ) helps it converge to simpler solutions. We ran AE training with and without the L piso term. Figure 4 shows in gray the learnt decoder surface, N , without L piso (left), containing extra (unnatural) surface parts compared to the learnt surface with L piso (right). In both cases we expect (and achieve) a decoder approximating an isometry that passes through the input data points. Nevertheless, the pseudo-inverse loss restricts some of the degrees of freedom of the encoder which in turn leads to a simpler solution.

4.2. DATA VISUALIZATION

In this experiment we evaluate our method in the task of high dimension data visualization, i.e., reducing high dimensional data into two dimensional space. Usually the data is not assumed to lie on a manifold with such a low dimension, and it is therefore impossible to preserve all of its geometric properties. A common artifact when squeezing higher dimensional data into the plane is crowding (Maaten & Hinton, 2008) , that is planar embedded points are crowded around the origin. We use the same architecture for all auto-encoder methods on each dataset. MNIST and FMNIST we evaluated in two scenarios: (i) Both encoder and decoder are fully-connected (MLP) networks; and (ii) Both encoder and decoder are Convolutional Neural Network (CNN). For COIL20 dataset both encoder and decoder are Convolutional Neural Network. Full implementation details and hyper-parameters values can be found in the Appendix. I-AE CAE RAE-GP DAE AE U-MAP t-SNE The results are presented in figure 5 ; where each embedded point z is colored by its ground-truth class/label. We make several observation. First, in all the datasets our method is more resilient to crowding compared to the baseline AEs, and provide a more even spread. U-MAP and t-SNE produce better separated clusters. However, this separation can come at a cost: See the COIL20 result (third row) and blow-ups of three of the classes (bottom row). In this dataset we expect evenly spaced points that correspond to the even rotations of the objects in the images. Note (in the blow-ups) that U-MAP maps the three classes on top of each other (non-injectivity of the "encoder"), t-SNE is somewhat better but does not preserve well the distance between pairs of data points (we expect them to be more or less equidistant in this dataset). In I-AE the rings are better separated and points are more equidistant; the baseline AEs tend to densify the points near the origin. Lastly, considering the inter and intra-class variations for the MNIST and FMNIST datasets, we are not sure that isometric embeddings are expected to produce strongly separated clusters as in U-MAP and t-SNE (e.g., think about similar digits of different classes and dissimilar digits of the same class with distances measured in euclidean norm).

4.3. DOWNSTREAM CLASSIFICATION

To quantitatively evaluate the unsupervised low-dimensional embedding computed with the I-AE we performed the following experiment: We trained simple classifiers on the embedded vectors computed by I-AE and baseline AEs and compared their performance (i.e., accuracy). Note that the process of learning the embedding is unsupervised and completely oblivious to the labels, which are used solely for training and testing the classifiers. We evaluate on the same datasets as in Section 4.2: In MNIST and FMNIST we use the standard train-test split, and on COIL20 we split 75%-25% randomly. As AE baselines we take vanilla AE, CAE, DAE and RAE-GP, as described above. We repeat each experiment with 3 different latent dimensions, {16, 64, 256}, and use two different simple classification algorithms: linear Support vector machines (SVM) (Cortes & Vapnik, 1995) and K-nearest neighbors (K-NN), with K = 5. 

4.4. HYPER-PARAMETERS SENSITIVITY

To evaluate the affect of λ iso on the output we compared the visualizations and optimized loss values of MNIST and FMNIST, trained with same CNN architecture as in Section 4.2 with λ iso ∈ {0, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 0.1}. Figure 6 shows the different visualization results as well as L rec , L iso , L piso as a function of λ iso . As can be seen in both datasets the visualizations and losses are stable for λ iso values between 0.01 and 0.5, where a significant change to the embedding is noticeable at 0.75. The trends in the loss values are also rather stable; L iso and L piso start very high in the regular AE, i.e., λ iso = 0, and quickly stabilize. As for L rec on FMNIST we see a stable increase while in MNIST it also starts with a steady increase until it reaches 0.75 and then it starts to be rockier, which is also noticeable in the visualizations. 

5. CONCLUSIONS

We have introduced I-AE, a regularizer for autoencoders that promotes isometry of the decoder and pseudo-inverse of the encoder. Our goal was two-fold: (i) producing a favorable low dimensional manifold approximation to high dimensional data, isometrically parameterized for preserving, as much as possible, its geometric properties; and (ii) avoiding complex isometric solutions based on the notion of psuedo-inverse. Our regularizers are simple to implement and can be easily incorporated into existing autoencoders architectures. We have tested I-AE on common manifold learning tasks, demonstrating the usefulness of isometric autoencoders. An interesting future work venue is to consider task (ii) from section 1, namely incorporating I-AE losses in a probabilistic model and examine the potential benefits of the isometry prior for generative models. One motivation is the fact that isometries push probability distributions by a simple change of coordinates, P (z) = P (f (z)). 



Figure 1: Top: I-AE; bottom: CAE.

Figure 2: I-AE.

Figure 3: Evaluation of 3D → 2D embeddings.

For fairness in evaluation, all methods were trained using the same training hyper-parameters. See Appendix for the complete experiment details including mathematical formulation of the different AE regularizers. In addition, we compared against popular classic manifold learning techniques: U-MAP (McInnes et al., 2018), t-SNE (Maaten & Hinton, 2008) and LLE. (Roweis & Saul, 2000).

Figure 4: Decoder surfaces without L piso (left) and with (right).

Figure 5: Results of data visualization experiment. Different colors indicate different ground turth labels/classes. Top shows MNIST: FC architecture of the encoder/decoder (top row), and CNN (bottom row); Middle shows FMNIST: FC (top row), and CNN (bottom row); Bottom shows COIL20 with CNN architecture, where zoom-ins of 3 classes are shown in the bottom row.

Figure6: Sensitivity to hyper-parameters. Top: visualizations of MNIST (1st row) and FMNIST (2nd row) datasets trained with different λ iso values. Bottom: plots of the final train losses as a function of λ iso ; left to right: L rec (linear scale), L iso (log scale), and L piso (log scale).

Figure 7: CelebA reconstructions.

Std of {l ij }.

Table2logs the results, where for both types of classifiers I-AE outperforms the baseline AEs in almost all combinations, where the SVM experiments demonstrate larger margins in favor of I-AE. The results of the K-NN indicate that euclidean metric captures similarity in our embedding, and the results of the SVM, especially on the MNIST and COIL20 datasets, indicate that I-AE is able to embed the data in an arguably simpler, linearly separable manner. The very high classification rates in COIL20 are probably due to the size and structure of this dataset. Nevertheless with SVM, already in 16 dimensions I-AE provides an accuracy of 95%, with 5% margin from 2nd place. Downstream classification experiment. Both tables indicate accuracy in [0, 1]. Left: results with a linear SVM classifier; and right: results of a K-NN classifier with K=5. The top performance scores are highlighted with colors: first, second and third.

High dimensional generalization experiment architectures.

Manifold approximation quality on test images. We log the L 2 and FID distances (lower is better) from reconstructed images to the input images. The L 2 numbers are reported * 10 3 . The top performance scores are highlighted as: First, Second.

annex

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.Yibing Zhan, Jun Yu, Zhou Yu, Rong Zhang, Dacheng Tao, and Qi Tian. Comprehensive distancepreserving autoencoders for cross-modal retrieval. In Proceedings of the 26th ACM International Conference on Multimedia, MM '18, pp. 1137 -1145, New Next, we evaluate how well our suggested isometric prior induces manifolds that generalizes well to unseen data. We experimented with three different images datasets: MNIST (LeCun, 1998); CIFAR10 (Krizhevsky et al., 2009) ; and CelebA (Liu et al., 2015) . We quantitatively estimate methods performance by measuring the L 2 distance and the Fréchet Inception Distance (FID) Heusel et al. (2017) on a held out test set. For each dataset, we used the official train-test splits.For comparison versus baselines we have selected among relevant existing AE based methods the following: Vanilla AE (AE); autoencoder trained with weight decay (AEW); Contractive autoencoder (CAE); autoencoder with spectral weights normalization (RAE-SN); and autoencoder with L 2 regularization on decoder weights (RAE-SN). RAE-L 2 and RAE-SN were recently successfully applied to this data in (Ghosh et al., 2020) , demonstrating state-of-the-art performance on this task. In addition, we compare versus the Wasserstein Auto-Encoder (WAE) Tolstikhin et al. (2018) , chosen as state-of-the-art among generative autoencoders.For evaluation fairness, all methods were trained using the same training hyper-parameters: network architecture, optimizer settings, batch size, number of epochs for training and learning rate scheduling. See the appendix for specific hyper-parameters values. In addition, we generated a validation set out of the training set using 10k samples for the MNIST and CIFAR-10 experiment, whereas for the CelebA experiment we used the official validation set. For each training epoch, we evaluated the reconstruction L 2 loss on the validation set and chose the final network weights to be the one that achieves the minimum reconstruction. We experimented with two variants of I-AE regularizers: L piso and L piso + L iso . Table 7 logs the results. Note that I-AE produced competitive results with the current SOTA on this task.Architecture. For all methods, we used an autoencoder with Convolutional and Convolutional transpose layers. Kingma & Ba (2014) , setting a learning rate of 0.0005 and batch size 100. I-AE parameter was set to λ iso = 0.1.

