GUIDING REPRESENTATION LEARNING IN DEEP GEN-ERATIVE MODELS WITH POLICY GRADIENTS Anonymous

Abstract

Variational Auto Encoder (VAE) provide an efficient latent space representation of complex data distributions which is learned in an unsupervised fashion. Using such a representation as input to Reinforcement Learning (RL) approaches may reduce learning time, enable domain transfer or improve interpretability of the model. However, current state-of-the-art approaches that combine VAE with RL fail at learning good performing policies on certain RL domains. Typically, the VAE is pre-trained in isolation and may omit the embedding of task-relevant features due to insufficiencies of its loss. As a result, the RL approach can not successfully maximize the reward on these domains. Therefore, this paper investigates the issues of joint training approaches and explores incorporation of policy gradients from RL into the VAE's latent space to find a task-specific latent space representation. We show that using pre-trained representations can lead to policies being unable to learn any rewarding behaviour in these environments. Subsequently, we introduce two types of models which overcome this deficiency by using policy gradients to learn the representation. Thereby the models are able to embed features into its representation that are crucial for performance on the RL task but would not have been learned with previous methods.



The goal of representation learning is to learn a suitable representation for a given application domain. Such a representation should contain useful information for a particular downstream task and capture the distribution of explanatory factors (Bengio et al. (2013) ). Typically, the choice of a downstream task influences the choice of method for representation learning. While Generative Adversarial Network (GAN)s are frequently used for tasks that require high-fidelity reconstructions or generation of realistic new data, auto-encoder based methods have been more common in RL. Recently, many such approaches employed the Variational Auto Encoder (VAE) (Kingma & Welling (2013) ) framework which aims to learn a smooth representation of its domain. Most of these approaches follow the same pattern: First, they build a dataset of states from the RL environment. Second, they train the VAE on this static dataset and lastly train the RL mode using the VAE's representation. While this procedure generates sufficiently good results for certain scenarios, there are some fundamental issues with this method. Such an approach assumes that it is possible to collect enough data and observe all task-relevant states in the environment without knowing how to act in it. As a consequence, when learning to act the agent will only have access to a representation that is optimized for the known and visited states. As soon as the agent becomes more competent, it might experience novel states that have not been visited before and for which there is no good representation (in the sense that the experienced states are out of the original learned distribution and the mapping is not appropriate). For example, small objects in pixel-space are ignored as they affect a reconstruction based loss only marginally. Thus, any downstream task using such a representation will have no access to information about such objects. A good example for such a task is Atari Breakout, a common RL benchmark. Figures 1a and 1b show an original Breakout frame and its reconstruction. While the original frame contains the ball in the lower right hand corner, this crucial feature is missing completely in the reconstruction. We approach this issue through simultaneously learning representation and RL task, that is by combining the training of both models. As an advantage, this abolishes the need of collecting data before knowing the environment as it combines VAE and RL objectives. In consequence the VAE has an incentive to represent features that are relevant to the RL model. The main contributions of this paper are as follows: First we show that combined learning is possible and that it yields good performing policies. Second, we show that jointly trained representations incorporate additional, task-specific information which allows a RL agent to achieve higher rewards then if it was trained on a static representation. This will be shown indirectly by comparing achieved rewards as well as directly through an analysis of the trained model and its representation.

2. RELATED WORK

Lange & Riedmiller (2010) explored Auto Encoder (AE) (Lecun (1987); Bourlard & Kamp (1988) ; Hinton & Zemel (1994) ) as a possible pre-processor for RL algorithms. The main focus in their work was finding good representations for high dimensional state spaces that enables policy learning. As input, rendered images from the commonly used grid world environment were used. The agent had to manoeuvre through a discretized map using one of four discrete movement actions per timestep. It received a positive reward once reaching the goal tile and negative rewards elsewhere. The AE bottleneck consisted only of two neurons, which corresponds to the dimensionality of the environemnt's state. Fitted Q-Iteration (FQI) (Ernst et al. (2005) ) was used to estimate the Q-function, which the agent then acted -greedy upon. Besides RL, they also used the learned representation to classify the agents position given an encoding using a Multi-Layer Perceptron (MLP) (Rumelhart et al. (1985) ). For these experiments, they found that adapting the encoder using MLP



RL) gained much popularity in recent years by outperforming humans in games such as Atari (Mnih et al. (2015)), Go (Silver et al. (2016)) and Starcraft 2 (Vinyals et al. (2017)). These results were facilitated by combining novel machine learning techniques such as deep neural networks (LeCun et al. (2015)) with classical RL methods. The RL framework has shown to be quite flexible and has been applied successfully in many further domains, for example, robotics (Andrychowicz et al. (2020)), resource management (Mao et al. (2016)) or physiologically accurate locomotion (Kidziński et al. (2018)).

Figure 1: A frame from Atari Breakout. The original image 1a was passed through a pre-trained VAE yielding the reconstruction 1b. Note the missing ball in the lower right hand corner.

