GUIDING REPRESENTATION LEARNING IN DEEP GEN-ERATIVE MODELS WITH POLICY GRADIENTS Anonymous

Abstract

Variational Auto Encoder (VAE) provide an efficient latent space representation of complex data distributions which is learned in an unsupervised fashion. Using such a representation as input to Reinforcement Learning (RL) approaches may reduce learning time, enable domain transfer or improve interpretability of the model. However, current state-of-the-art approaches that combine VAE with RL fail at learning good performing policies on certain RL domains. Typically, the VAE is pre-trained in isolation and may omit the embedding of task-relevant features due to insufficiencies of its loss. As a result, the RL approach can not successfully maximize the reward on these domains. Therefore, this paper investigates the issues of joint training approaches and explores incorporation of policy gradients from RL into the VAE's latent space to find a task-specific latent space representation. We show that using pre-trained representations can lead to policies being unable to learn any rewarding behaviour in these environments. Subsequently, we introduce two types of models which overcome this deficiency by using policy gradients to learn the representation. Thereby the models are able to embed features into its representation that are crucial for performance on the RL task but would not have been learned with previous methods.

1. INTRODCTION

Reinforcement Learning (RL) gained much popularity in recent years by outperforming humans in games such as Atari (Mnih et al. (2015) ), Go (Silver et al. (2016) ) and Starcraft 2 (Vinyals et al. (2017) ). These results were facilitated by combining novel machine learning techniques such as deep neural networks (LeCun et al. (2015) ) with classical RL methods. The RL framework has shown to be quite flexible and has been applied successfully in many further domains, for example, robotics (Andrychowicz et al. (2020) ), resource management (Mao et al. (2016) ) or physiologically accurate locomotion (Kidziński et al. (2018) ). The goal of representation learning is to learn a suitable representation for a given application domain. Such a representation should contain useful information for a particular downstream task and capture the distribution of explanatory factors (Bengio et al. (2013) ). Typically, the choice of a downstream task influences the choice of method for representation learning. While Generative Adversarial Network (GAN)s are frequently used for tasks that require high-fidelity reconstructions or generation of realistic new data, auto-encoder based methods have been more common in RL. Recently, many such approaches employed the Variational Auto Encoder (VAE) (Kingma & Welling (2013) ) framework which aims to learn a smooth representation of its domain. Most of these approaches follow the same pattern: First, they build a dataset of states from the RL environment. Second, they train the VAE on this static dataset and lastly train the RL mode using the VAE's representation. While this procedure generates sufficiently good results for certain scenarios, there are some fundamental issues with this method. Such an approach assumes that it is possible to collect enough data and observe all task-relevant states in the environment without knowing how to act in it. As a consequence, when learning to act the agent will only have access to a representation that is optimized for the known and visited states. As soon as the agent becomes more competent, it might experience novel states that have not been visited before and for which there is no good representation (in the sense that the experienced states are out of the original learned distribution and the mapping is not appropriate). Another issue arises from the manner the representation is learned. Usually, the VAE is trained in isolation, so it decides what features are learned based on its own objective function and not on what is helpful for the downstream task. Mostly, such a model is tuned for good reconstruction. Without the information from the RL model, such a representation does not reflect what is important for the downstream task. As a consequence, the VAE might omit learning features that are crucial for good performance on the task because they appear negligible with respect to reconstruction (Goodfellow et al. (2016) , Chapter 15, Figure 15 .5). For example, small objects in pixel-space are ignored as they affect a reconstruction based loss only marginally. Thus, any downstream task using such a representation will have no access to information about such objects. A good example for such a task is Atari Breakout, a common RL benchmark. Figures 1a and 1b show an original Breakout frame and its reconstruction. While the original frame contains the ball in the lower right hand corner, this crucial feature is missing completely in the reconstruction. We approach this issue through simultaneously learning representation and RL task, that is by combining the training of both models. As an advantage, this abolishes the need of collecting data before knowing the environment as it combines VAE and RL objectives. In consequence the VAE has an incentive to represent features that are relevant to the RL model. The main contributions of this paper are as follows: First we show that combined learning is possible and that it yields good performing policies. Second, we show that jointly trained representations incorporate additional, task-specific information which allows a RL agent to achieve higher rewards then if it was trained on a static representation. This will be shown indirectly by comparing achieved rewards as well as directly through an analysis of the trained model and its representation.

2. RELATED WORK

Lange & Riedmiller (2010) explored Auto Encoder (AE) (Lecun (1987) ; Bourlard & Kamp (1988) ; Hinton & Zemel (1994) ) as a possible pre-processor for RL algorithms. The main focus in their work was finding good representations for high dimensional state spaces that enables policy learning. As input, rendered images from the commonly used grid world environment were used. The agent had to manoeuvre through a discretized map using one of four discrete movement actions per timestep. It received a positive reward once reaching the goal tile and negative rewards elsewhere. The AE bottleneck consisted only of two neurons, which corresponds to the dimensionality of the environemnt's state. Fitted Q-Iteration (FQI) (Ernst et al. (2005) ) was used to estimate the Q-function, which the agent then acted -greedy upon. Besides RL, they also used the learned representation to classify the agents position given an encoding using a Multi-Layer Perceptron (MLP) (Rumelhart et al. (1985) ). For these experiments, they found that adapting the encoder using MLP gradients lead to an accuracy of 99.46 %. However, they did not apply this approach to their RL task. A compelling example for separate training of meaningful representation is provided by Higgins et al. (2017b) who proposed a framework called DARLA. They trained RL agents on the encoding of a β-VAE (Higgins et al. (2016) ; Higgins et al. (2017a) ) with the goal of zero-shot domain transfer. In their approach, β-VAE and agent were trained separately on a source domain and then evaluated in a target domain. Importantly, source and target domain are similar to a certain extent and only differ in some features, e.g. a blue object in the source domain might be red in the target domain. During training of the β-VAE, the pixel-based reconstruction loss was replaced with a loss calculated in the latent space of a Denoising Auto Encoder (DAE) (Vincent et al. (2008) ). Thereby their approach avoids missing task relevant feature encodings at the cost of training another model. For one of their evaluation models, they allowed the RL gradients to adapt the encoder. Their results show that subsequent encoder learning improves performance of Deep Q-Learning (DQN) but decreases performance of Asynchronous Advantage Actor-Critic (A3C) (Mnih et al. (2016) ). Ha & Schmidhuber (2018) proposed a combination of VAE, Recurrent Neural Networks (RNN) (Hochreiter & Schmidhuber (1997) ) and a simple policy as a controller. They hypothesized that by learning a good representation of the environment and having the ability to predict future states, learning the policy itself becomes a trivial task. Like in most other models, the VAE was pre-trained on data collected by a random policy. Only the RNN and the controller were trained online. The compressed representation from the VAE was passed into a RNN in order to estimate a probability density for the subsequent state. The controller was deliberately chosen as a single linear layer and could thus be optimized with Covariance Matrix Adaptation -Evolution Strategy (CMA-ES) (Hansen (2006) ). This work demonstrated how a VAE can provide a versatile representation that can be utilized in reinforcement learning. In addition, such an approach allows to predict the subsequent encoded state. While these findings encourage the usage of VAE in conjunction with RL, this is only possible in environments where the state space can be explored sufficiently by a random policy. However, if the policy can only discover important features after acquiring a minimal level of skill, sampling the state space using a random policy will not yield high-performing agents. Learning such features would only be possible if the VAE is continuously improved during policy training. Another interesting combination of VAEs and RL was recently proposed by Yang et al. (2019) , with their so called Action-Conditional Variational Auto-Encoder (AC-VAE). Their motivation for creating this model was to train a transparent, interpretable policy network. Usually, the β-VAEs decoder is trained to reconstruct the input based on the representation the encoder produced. In this work though, the decoders objective was to predict the subsequent state s t+1 . As input it got the latent space vector z combined with an action-mapping-vector, which is the action vector a t with a zero-padding to match the latent spaces dimensionality. Inspecting the decoder estimates for s t+1 when varying one dimension of the latent space showed, that each dimension encoded a possible subsequent state that is likely to be encountered if the corresponding action from this dimension was taken. Unfortunately, the authors did not report any rewards they achieved on Breakout, hence it was not possible for us to compare model performances.

3. COMBINATION OF REINFORCEMENT AND REPRESENTATION LEARNING OBJECTIVES

In this section, we will first revisit the fundamentals of RL and VAEs and discuss their different objective functions. Then, we propose a joint objective function that allows for joint training of both models using gradient descent based learning methods.

3.1. REINFORCEMENT LEARNING WITH POLICY OPTIMIZATION

RL tries to optimize a Markov Decision Process (MDP) (Bellman (1957) ) that is given by the tuple S, A, r, p, γ . S denotes the state space, A the action space and p : S × R × S × A → [0, 1] the environment's dynamics function that, provided a state-action pair, gives the state distribution for the successor state. r : S × A → R is the reward and γ ∈ [0, 1) the scalar discount factor. The policy π θ (a|s) is a stochastic function that gives a probability distribution over actions for state s. θ denotes the policy's parameter vector which is typically subject to optimization. A trajectory τ = (s 0 , a 0 , ..., s T , a T ) consisting of an alternating sequence of states and actions can be sampled in the environment, where T stands for the final timestep of the trajectory and a i ∼ π θ (a i |s i ). The overarching goal of RL is to find a policy that maximizes the average collected reward over all trajectories. This can be expressed as the optimization problem max Eτ∼p(τ) t r(s, a) , which can also be written in terms of an optimal policy parameter vector θ * = arg max θ Eτ∼p(τ) t r(s, a) . When trying to optimize the policy directly be searching for θ * , policy optimization algorithms like A3C, Actor-Critic with Experience Replay (ACER) (Wang et al. (2016a) ), Trust Region Policy Optimization (TRPO) (Schulman et al. (2015a) ) or Proximal Policy Optimization (PPO) (Schulman et al. (2017) ) are commonly used. The fundamental idea behind policy optimization techniques is to calculate gradients of the RL objective with respect to the policy parameters: ∇ θ J(θ) = E τ ∼p(τ ) ∇ θ log π θ (τ ) r(τ ) where we defined T t=0 r(s, a) = r(τ ) for brevity. However, most policy optimization methods introduce heavy modifications to this vanilla gradient in order to achieve more stable policy updates. Throughout our work, we have used PPO as RL algorithm because it is quite sample efficient and usually produces stable policy updates. For an in-depth description of PPO, we refer to our A.1 or the original work Schulman et al. (2017) .

3.2. LEARNING REPRESENTATIONS USING VARIATIONAL AUTO-ENCODERS

Kingma & Welling (2013) introduced the VAE as a method to perform Variational Inference (VI) (Jordan et al. ( 1999)) using function approximators, e.g. deep neural networks. VI tries to approximate a distribution over the generative factors of a dataset which would otherwise involve calculating an intractable integral. The authors present an algorithm that utilizes the auto encoder framework, an unsupervised learning method which learns data encodings by reconstructing its input. Therefore, the input is first compressed until it reaches a given size and is afterwards decompressed to its original size. When using deep neural networks, these transformations can be achieved by using for example fully connected or convolutional layers. In order for the VAE to approximate a distribution over generative factors, the authors used the so called "reparametrization trick". It allows for gradient based optimization methods to be used in searching for the distribution parameters. For training the VAE, a gradient based optimizer tries to minimize the following loss: L V AE (x, φ, ψ) = -D KL (q φ (z|x) || p(z)) + E q φ (z|x) log p ψ (x|z) with z = l(µ, σ, ) and ∼ p( ) (2) where D KL denotes the Kullback-Leibler Divergence (KL) (Kullback & Leibler (1951) ) of the approximated distribution over generative factors produced by the encoder q φ (z|x) and some prior distribution p(z). The expectation is often referred to as reconstruction loss that is typically calculated on a per-pixel basis. Lastly, l(µ, σ, ) is a sampling function that is differentiable w.r.t. the distribution parameters, for example z = u + σ .

3.3. JOINT OBJECTIVE FUNCTION

Combining both loss functions such that both models can be trained at the same time is rather straight-forward. Adding both individual losses and using an optimizer such as ADAM (Kingma & Ba (2014) ) to minimize them is sufficient to achieve joint training. During backpropagation, gradients from the policy and the VAE are combined in the latent space. Due to different topologies of the networks, gradient magnitudes differ significantly. Therefore, we introduced the hyperparameter κ which can be used to either amplify or dampen the gradients and we arrive at the following loss: L joint = κL PG (θ k , θ k-1 , φ k ) + L V AE (x, φ, ψ, β) where L PG is some policy gradient algorithm's objective function. As mentioned before, we used PPO's loss L PPO (equation 4 in the appendix).

4. EXPERIMENTS

In order to test our model with the combined objective function given by equation 3, we have used the well-known benchmark of Atari Breakout. This environment has several properties that make it appealing to use: it is easily understandable by humans, used often as a RL task and the conventional pre-trained methods fail at mastering it. The ball is the most important feature that is required to be encoded in order to perform well, is heavily underrepresented (approximately 0.1% of the observation space). Therefore, the VAE's incentive to encode it is very low whereas our model succeeds in encoding it. In the following, we compare the pre-trained approach to two different continuously trained models that use the loss from equation 3.

4.1. DATA COLLECTION AND PRE-PROCESSING

The raw RGB image data produced by the environment has a dimensionality of 210 × 160 × 3 pixels. We employ a similar pre-precessing as Mnih et al. (2015) , but instead of cropping the greyscaled frames, we simply resize them to 84 × 84 pixels. As we will first train models similar to those introduced in previous works with a pre-trained VAE, we needed to construct a dataset containing Breakout states. We used an already trained policy to collect a total of 25, 000 frames, the approximate equivalent of 50 episodes.

4.2. PRE-TRAINING THE VARIATIONAL AUTO-ENCODER

Our first model is based on those of the previously introduced works which involve isolated pretraining the VAE on a static dataset.  v O H x 1 F 3 G n k h / R 1 3 V w l j t L m h w Q = " > A A A C K n i c b V D L S g N B E J z 1 b X z r 0 c t g E D y F X V H 0 K O j B Y w Q T g 0 m Q 3 k m v G Z y d W W Z 6 h b j k L 7 z q F / g 1 3 o J X P 8 T Z J A d f D Q N F d X d 1 T c W Z k o 7 C c B T M z M 7 N L y w u L V d W V t f W N z a 3 t p v O 5 F Z g Q x h l b C s G h 0 p q b J A k h a 3 M I q S x w p v 4 4 b z s 3 z y i d d L o a x p k 2 E 3 h X s t E C i B P 3 X Z S o H 6 c F K 3 h 3 W Y 1 r I X j 4 n 9 B N A V V N q 3 6 3 V Z Q 6 f S M y F P U J B Q 4 1 4 7 C j L o F W J J C O c U I v J o S R X n A w v / 8 9 7 0 q I g N f A A h J X e K x d 9 s C D I p / T j S q m d w Z P x G r z h k F M f e R 2 U D 0 6 b i e v Y q 2 A C u S K f X v Q 7 q 7 + g e V i L j m r H V 0 f V s 4 t p j k t s l + 2 x A x a x E 3 b G L l m d N Z h g m j 2 z F / Y a v A X v w S j 4 m I z O B N O d H f a j g s 8 v j v y m U A = = < / l a t e x i t > X < l a t e x i t s h a 1 _ b a s e 6 4 = " k S U A H n t j U e a f G F X s A 4 m 7 5 9 A 4 p 6 c = " > A A A C M n i c b V D L S g N B E J z 1 b X x F P X o Z D I K n s C u K H g U 9 e I x g N J C E 0 D v p N Y O z M 8 t M r x C X / I l X / Q J / R m / i 1 Y 9 w N s n B V 0 N D U d 1 d X V S c K e k o D F + D m d m 5 + Y X F p e X K y u r a + k Z 1 c + v a m d w K b A q j j G 3 F 4 F B J j U 2 S p L C V W Y Q 0 V n g T 3 5 2 V 8 5 t 7 t E 4 a f U X D D L s p 3 G q Z S A H k q V 6 1 2 h k A F Z 0 U a B A n R W s 0 6 l V r Y T 0 c F / 8 L o i m o s W k 1 e p t B p d M 3 I k 9 R k 1 D g X D s K M + o W Y E k K h a N K J 3 e Y g b i D W 2 x 7 q C F F 1 y 3 G 1 k d 8 z z N 9 n h j r W x M f s 9 8 v C k i d G 6 a x 3 y w 9 u t + z k v x v 1 s 4 p O e k W U m c 5 o R a T R 0 m u O B l e 5 s D 7 0 q I g N f Q A h J X e K x c D s C D I p / X j S 6 m d w Y P x G r z p k N M A e Q O U D 1 C b i e v Y q 2 A C u S K f X v Q 7 q 7 / g + q A e H d a P L g 9 r p + f T H J f Y D t t l + y x i x + y U X b A G a z L B 7 t k j e 2 L P w U v w F r w H H 5 P V m W B 6 s 8 1 + V P D 5 B Y I u q U 4 = < / l a t e x i t >

PPO VAE

Figure 2 : Model combining PPO and a VAE. Depending on the model configuration, the colored parts are trained differently. X is the VAE's input and X the reconstructions. PPO receives the mean vectors U as input and calculates a distribution over actions π. Note that we use capital letters in the VAE to emphasize that we pass n frames at the same time when a policy is trained. policy training, we use n instances of the same encoder with shared weights that receive a sequence of the last n frames as input. Stacking allows us to incorporate temporal information and for the policy to predict the ball's trajectory. By sharing the weights, we ensure that the resulting encodings originate from the same function. U then represents the concatenated encodings of the sequence. This weight sharing method has proven to be more time efficient than to query the encoder n times and concatenate the results afterwards. Prior to policy training, we trained the VAE on the dataset we have collected before, with hyperparameters from table 1. Once pre-training was finished, we discarded the decoder weights and used the stacked encoder as input for the policy MLP. The MLP was then trained 10M steps with hyperparameters from table 2. During this training, the encoder weights were not changed by gradient updates anymore but remained fixed. The second model we introduce is called PPO adapt , which has the same structure and hyperparameters as the first model. For this model, we also train the VAE in isolation first, however the encoder weights are not fixed anymore during policy training. Gradients from the RL objective are back propagated through the encoder, allowing it to learn throughout policy training. We hypothesize that features that are important for policy performance can be incorporated in an already learned representation. Figure 3 compares the median rewards of three rollouts with different random seeds for all models. PPO fixed was not once able to achieve a reward of 10 or higher, while PPO adapt steadily improved its performance with final rewards well over 50. The learning curve of PPO adapt shows that the model is able to learn how to act in the environment, whereas PPO fixed does not. The non-zero rewards from PPO fixed are similar to those of random agents in Breakout. From these results, we can assume that training the VAE in isolation on a static dataset for Breakout results in a deficient representation for RL. Therefore, using policy gradients to adapt an already learned representation can be beneficial in environments where the VAE fails to encode task-relevant features.

4.3. JOINTLY LEARNING REPRESENTATION AND POLICY

The last model we introduce, PPO VAE , combines a complete VAE with a policy MLP that receives U, the concatenated state encodings, as input. As opposed to the first two models, all weights are initialized randomly before policy training and the VAE is not pre-trained. For this procedure an Figure 3: Reward of the three proposed models across three random seeds each. PPO fixed is not able to achieve high rewards while the other two models consistently improve their performance. already trained agent that gathers a dataset for the VAE beforehand is not necessary. The decoder is trained exactly as in the isolated setting, meaning its gradients are also only computed using the VAE's loss function. During backpropagation, the gradients coming from Z and h 1 are added together and passed through the encoder. This model has the same network configuration and hyperparameters as the first two, with the only difference that we also evaluated different values for κ from the joint loss equation 3 (see A.3). For the results reported here, we chose κ = 20. All hyperparameters can be found in table 3 . By simultaneously training representation and policy, we expect the VAE to learn task-relevant features from the beginning of training. This assumption is supported by the learning curve shown in figure 3 , which compares PPO VAE to the previous two models. The curve shows a steady increase in reward over the course of training with PPO VAE achieving slightly higher rewards than PPO adapt in the beginning. This characteristic changes after less than 1M steps and from that point on PPO adapt consistently outperforms PPO VAE . This difference in performance is likely attributed to the fact, that in PPO VAE the decoder is trained throughout the complete training. While for PPO adapt , the latent space can be changed without restrictions, the decoder of PPO VAE constantly produces gradients do not contain information about the ball. Therefore, PPO VAE has less incentive to achieve higher rewards, which is reflected by its performance. In figure 4 we illustrate a pre-processed frame and added the values of the Jacobian to the blue channel if the were greater than the mean value of the Jacobian. Only visualizing above-mean Jacobian values removes some noise in the blue channel makes the images much easier to interpret and only highlights regions of high relevance. We can clearly see, that the Jacobian has high values at missing blocks as well as around the ball, meaning that these regions are considered to have high impact on future rewards. By visualizing the Jacobian we have confirmed that the policy gradients encourage the VAE to embed task-relevant features.

5. CONCLUSION

This paper focused on the issue of pre-training VAEs with the purpose of learning a policy for a downstream task based on the VAE's representation. In many environments, the VAE has little to no incentive to learn task-relevant features if they are small in observation space. Another issue arises if the observation of these features depends on policy performance and as a result, they are underrepresented in a dataset sampled by a random agent. In both cases, fixing encoder weights during policy training prevents the VAE to learn these important features and policy performance will be underwhelming. We carried out experiments on the popular RL benchmark Atari Breakout. The goal was to analyze whether policy gradients guide representation learning towards incorporating performancecritic features that a VAE would not learn on a pre-recorded dataset. First experiments confirmed, that the common pre-trained approach did not yield well-performing policies in this environment. Allowing the policy gradients to adapt encoder weights in two different models showed significant improvements in terms of rewards. With policy gradients guiding the learned representation, agents consistently outperformed those that were trained on a fixed representation. Our work verifies that the fundamental issue with pre-trained representations still exists and shows possible solutions in RL scenarios. Nonetheless, future work can still explore a variety of improvements to our models. For once, training not only the encoder but also the decoder with RL gradients can improve interpretability of the VAE and enable it to be used as a generator again that also generates task-relevant features. Another direction is to impose further restrictions on the latent space during joint training of VAE and policy. The goal there would be to maintain the desired latent space characteristics of VAEs while still encoding task-relevant features.

A APPENDIX

A.1 STABLE POLICY LEARNING WITH PROXIMAL POLICY OPTIMIZATION Most actor-critic algorithms successfully reduce the variance of the policy gradient, however they show high variance in policy performance during learning and are at the same time very sample inefficient. Natural gradient (Amari (1998) ) methods such as TRPO from Schulman et al. (2015a) greatly increase sample efficiency and learning robustness. Unfortunately, they are relatively complicated to implement and are computationally expensive as the require some second order approximations. PPO (Schulman et al. (2017) ) is a family of policy gradient methods that form pessimistic estimates of the policy performance. By clipping and therefore restricting the policy updates, PPO prohibits too large of a policy change as they have been found to be harmful to policy performance in practice. PPO is often combined with another type of advantage estimation (Schulman et al. (2015b) ) that produces high accuracy advantage function estimates. We define the PPO-Clip objective is defined as J PPO (θ k , θ k-1 ) = E min o(θ)A π θ k (s, a), clip o(θ), 1 -, 1 + A π θ k (s, a) s.t. δ MB < δ target (4) where o(θ) = π θ k (a|s) a|s) denotes the probability ratio of two policies. π θ k-1 This objective is motivated by the hard KL constraint that TRPO enforces on policy updates. Should a policy update result in a policy that deviates too much from its predecessor, TRPO performs a line search along the policy gradient direction that decreases the gradient magnitude. If the constraint is satisfied during the line search, the policy is updated using that smaller gradient step. Otherwise the update is rejected after a certain number of steps. This method requires to calculate the second order derivative of the KL divergence, which is computationally costly. PPO uses its clipping objective to implicitly constrain the deviation of consecutive policies. In some settings, PPO still suffers from diverging policy updates (OpenAI Spinning Up -PPO ( 2018)), so we included a hard KL constrained on policy updates. The constraint can be checked after each mini-batch update analytically and is therefore not very computationally demanding. PPO extends the policy gradient objective function from Kakade (2002) . With the probability ratio o(θ), we utilize importance sampling in order to use samples collected with any policy to update our current one. Thereby we can use samples more often than in other algorithms, making PPO more sample efficient. Using importance sampling, we still have a correct gradient estimate. Combining the new objective with actor-critic methods yields algorithm 1. K denotes the number of optimization epoch per set of trajectories and B denotes the mini-batch size. In the original paper, a combined objective function is also given with: L PPO (θ k , θ k-1 , φ k ) = E c 1 J PPO (θ k ,θ k-1 ) -c 2 L V π θ (φ k ) + H(π θ k ) s.t. δ MB < δ target where H(π θ k ) denotes the policy entropy. Encouraging the policy entropy not to decrease too much prohibits the policy from specializing on one action. As discussed in OpenAI Spinning Up -PPO (2018), there are two cases for J PPO (θ k , θ): either the advantage function was positive or negative. In case the advantage is positive, it can be written as: J PPO (θ k , θ) = E min (o(θ), (1 + )) A π θ k (s, a) A π θ k (s, a) > 0 indicates that the action yields higher reward than other actions in this state, hence we want its probability π θ k (a|s) to increase. This increase is clipped to (1 + ) once π θ k (a|s) > π θ k-1 (a|s)(1 + ). Note however, that updates that would worsen policy performance are neither clipped nor bound. If the the advantage is negative, it can be expressed as: J PPO (θ k , θ) = E max (o(θ), (1 -)) A π θ k (s, a) This equation behaves conversely to 6: A π θ k (s, a) < 0 indicates that we chose a suboptimal action, thus we want to decrease its probability. Once π θ k (a|s) < π θ k-1 (a|s)(1 -), the max bounds the magnitude by which the action's probability can be decreased. for each mini-batch of size B in {τ i } do 7: Update the policy by maximizing the PPO-Clip objective 4 8: Minimize L V π θ on the mini-batch In equation 3, we introduced the hyperparameter κ to balance VAE and PPO gradients. We found empirically, that tuning κ is straight forward and requires only few trials. In order to simplify the search for κ, one can evaluate gradient magnitudes of the different losses at the point where they are merged at U. Our experiments showed PPO's gradients to be significantly smaller, thus scaling up the loss function was appropriate. This will likely differ if the networks are configured differently. Increasing κ from 1 to 10 led to considerably higher rewards, however the difference in performance was small when increasing κ further to 20. Therefore, we chose κ = 20 in our reported model performances. 



Figure 1: A frame from Atari Breakout. The original image 1a was passed through a pre-trained VAE yielding the reconstruction 1b. Note the missing ball in the lower right hand corner.

Figure 2 shows the individual parts of the complete training process. For the first model, PPO fixed , the encoder and decoder (shown in orange and red) are pretrained before policy training. During this phase, there is no influence from the RL loss. Once the VAE training is finished, the decoder shown in red in Figure 2 is discarded completely. Later during X < l a t e x i t s h a 1 _ b a s e 6 4 = " T o

4 b D S y R 1 m I B 7 g H t s e a k j R d Y u x 5 S H f 9 0 y P J 8 b 6 p 4 m P 2 e 8 b B a T O D d L Y T 5 Y W 3 e 9 e S f 7 X a + e U n H Y L q b

Figure 4: The Jacobian of PPO's value function. Highlighted areas mean high importance in terms of future rewards. Note the high Jacobian values around the ball and the blocks.

Proximal Policy Optimisation with KL constraint 1: Initialize policy parameters θ 0 and value function parameters φ 0 2: for k = 0, 1, 2, ... do 3: Collect set of trajectories D k = {τ i } with π θ k and compute Rt

Figure 5: Performance comparison of PPO VAE with different values for κ

Hyperparameter table for VAE training on Breakout

Policy hyperparameters of PPO fixed and PPO adapt

Policy hyperparameter table of PPO VAE A.3 CHOOSING APPROPRIATE VALUES FOR κ

