GUIDING REPRESENTATION LEARNING IN DEEP GEN-ERATIVE MODELS WITH POLICY GRADIENTS Anonymous

Abstract

Variational Auto Encoder (VAE) provide an efficient latent space representation of complex data distributions which is learned in an unsupervised fashion. Using such a representation as input to Reinforcement Learning (RL) approaches may reduce learning time, enable domain transfer or improve interpretability of the model. However, current state-of-the-art approaches that combine VAE with RL fail at learning good performing policies on certain RL domains. Typically, the VAE is pre-trained in isolation and may omit the embedding of task-relevant features due to insufficiencies of its loss. As a result, the RL approach can not successfully maximize the reward on these domains. Therefore, this paper investigates the issues of joint training approaches and explores incorporation of policy gradients from RL into the VAE's latent space to find a task-specific latent space representation. We show that using pre-trained representations can lead to policies being unable to learn any rewarding behaviour in these environments. Subsequently, we introduce two types of models which overcome this deficiency by using policy gradients to learn the representation. Thereby the models are able to embed features into its representation that are crucial for performance on the RL task but would not have been learned with previous methods.



The goal of representation learning is to learn a suitable representation for a given application domain. Such a representation should contain useful information for a particular downstream task and capture the distribution of explanatory factors (Bengio et al. (2013) ). Typically, the choice of a downstream task influences the choice of method for representation learning. While Generative Adversarial Network (GAN)s are frequently used for tasks that require high-fidelity reconstructions or generation of realistic new data, auto-encoder based methods have been more common in RL. Recently, many such approaches employed the Variational Auto Encoder (VAE) (Kingma & Welling (2013)) framework which aims to learn a smooth representation of its domain. Most of these approaches follow the same pattern: First, they build a dataset of states from the RL environment. Second, they train the VAE on this static dataset and lastly train the RL mode using the VAE's representation. While this procedure generates sufficiently good results for certain scenarios, there are some fundamental issues with this method. Such an approach assumes that it is possible to collect enough data and observe all task-relevant states in the environment without knowing how to act in it. As a consequence, when learning to act the agent will only have access to a representation that is optimized for the known and visited states. As soon as the agent becomes more competent, it might experience novel states that have not been visited before and for which there is no good representation (in the sense that the experienced states are out of the original learned distribution and the mapping is not appropriate).



RL) gained much popularity in recent years by outperforming humans in games such as Atari (Mnih et al. (2015)), Go (Silver et al. (2016)) and Starcraft 2 (Vinyals et al. (2017)). These results were facilitated by combining novel machine learning techniques such as deep neural networks (LeCun et al. (2015)) with classical RL methods. The RL framework has shown to be quite flexible and has been applied successfully in many further domains, for example, robotics (Andrychowicz et al. (2020)), resource management (Mao et al. (2016)) or physiologically accurate locomotion (Kidziński et al. (2018)).

