AUTOENCODER IMAGE INTERPOLATION BY SHAPING THE LATENT SPACE Anonymous

Abstract

Autoencoders represent an effective approach for computing the underlying factors characterizing datasets of different types. The latent representation of autoencoders have been studied in the context of enabling interpolation between data points by decoding convex combinations of latent vectors. This interpolation, however, often leads to artifacts or produces unrealistic results during reconstruction. We argue that these incongruities are due to the structure of the latent space and because such naively interpolated latent vectors deviate from the data manifold. In this paper, we propose a regularization technique that shapes the latent representation to follow a manifold that is consistent with the training images and that drives the manifold to be smooth and locally convex. This regularization not only enables faithful interpolation between data points, as we show herein, but can also be used as a general regularization technique to avoid overfitting or to produce new samples for data augmentation.

1. INTRODUCTION

Given a set of data points, data interpolation or extrapolation aims at predicting novel data points between given samples (interpolation) or predicting novel data outside the sample range (extrapolation) . Faithful data interpolation between sampled data can be seen as a measure of the generalization capacity of a learning system (Berthelot et al., 2018) . In the context of computer vision and computer graphics, data interpolation may refer to generating novel views of an object between two given views or predicting in-between animated frames from key frames. Interpolation that produces novel views of a scene requires input such as the geometric and photometric parameters of existing objects, camera parameters and additional scene components, such as lighting and the reflective characteristics of nearby objects. Unfortunately, these characteristics are not always available or are difficult to extract in real-world scenarios. Thus, in such cases, we can apply data-driven interpolation that is deduced based on a sampled dataset drawn from the scene taken under various acquisition parameters. The task of data interpolation is to extract new samples (possibly continuous) between known data samples. Clearly, linear interpolation between two images in the input (image) domain does not work as it produces a cross-dissolve effect between the intensities of the two images. Adopting the manifold view of data (Goodfellow et al., 2016; Verma et al., 2018; Bengio et al., 2013) , this task can be seen as sampling new data points along the geodesic path between the given points. The problem is that this manifold is unknown in advance and one has to approximate it from the given data. Alternatively, adopting the probabilistic perspective, interpolation can be viewed as drawing samples from highly probable areas in the data space. One fascinating property of unsupervised learning is the network's ability to reveal the underlying factors controlling a given dataset. Autoencoders (Doersch, 2016; Kingma & Welling, 2013; Goodfellow et al., 2016; Kramer, 1991; Vincent et al., 2010) represent an effective approach for exposing these factors. Researchers have demonstrated the ability to interpolate between data points by decoding a convex sum of latent vectors (Shu et al., 2018; Mathieu et al., 2016) ; however, this interpolation often incorporates visible artifacts during reconstruction. To illustrate the problem, consider the following example: A scene is composed of a vertical pole at the center of a flat plane (Figure 1 -left). A single light source illuminates the scene and accordingly, the pole projects a shadow onto the plane. The position of the light source can vary along the upper hemisphere. Hence, the underlying parameters controlling the generated scene are (θ, φ), the elevation and azimuth, respectively. The interaction between the light and the pole produces a cast shadow whose direction and length are determined by the light direction. A set of images of this scene is acquired from a fixed viewing position (from above) with various lighting directions. Our goal in this example is to train a model that is capable of interpolating between two given images. Figure 1 , top row, depicts a set of interpolated images, between the source image (left image) and the target image (right image), where the interpolation is performed in the input domain. As illustrated, the interpolation is not natural as it produces cross-dissolve effects in image intensities. Training a standard autoencoder and applying linear interpolation in its latent space generates images that are much more realistic (Figure 1 , bottom row). Nevertheless, this interpolation is not perfect as visible artifacts occur in the interpolated images. The source of these artifacts can be investigated by closely inspecting the 2D manifold embedded in the latent space. Figure 2 shows two manifolds embedded in latent spaces, one with data embedded in 2D latent space (left plot) and one with data embedded in 3D latent space (2nd plot from the left). In both cases, the manifolds are 2D and are generated using vanilla autoencoders. The grid lines represent the (θ, φ) parameterization. It can be seen that the encoders produce non-smooth and non-convex surfaces in 2D as well as in 3D. Thus, linear interpolation between two data points inevitably produces in-between points outside of the manifold. In practice, the decoded images of such points are unpredictable and may produce non-realistic artifacts. This issue is demonstrated in the two right images in Figure 2 . When the interpolated point is on the manifold (an empty circle denoted 'A'), a faithful image is generated by the decoder (2nd image from the right). When the interpolated point departs from the manifold (the circle denoted 'B'), the resulting image is unpredictable (right image). In this paper, we argue that the common statistical view of autoencoders is not appropriate when dealing with data that have been generated from continuous factors. Alternatively, the manifold structure of continuous data must be considered, taking into account the geometry and shape of the manifold. Accordingly, we propose a new interpolation regularization mechanism consisting



Figure 1: Left: A vertical pole casting a shadow. Yellow blocks-top row: Cross-dissolve phenomena as a result of linear interpolation in the input space. Yellow blocks-bottom row: Image reconstruction obtained by a linear latent space interpolation of an autoencoder. Unrealistic artifacts are introduced.

Figure 2: The latent manifold of the data embedded in 2D latent space (leftmost plot) and 3D latent space (second plot from the left) learned by vanilla autoencoders. Gridlines represent the (θ, φ) parameterization. The second image from the right was generated from the latent point denoted 'A'. The rightmost image was generated from the latent point denoted 'B'.

