TEMPORAL DISENTANGLEMENT OF REPRESENTATIONS FOR IMPROVED GENERALISATION IN REINFORCEMENT LEARNING

Abstract

Reinforcement Learning (RL) agents are often unable to generalise well to environment variations in the state space that were not observed during training. This issue is especially problematic for image-based RL, where a change in just one variable, such as the background colour, can change many pixels in the image. The changed pixels can lead to drastic changes in the agent's latent representation of the image, causing the learned policy to fail. To learn more robust representations, we introduce TEmporal Disentanglement (TED), a self-supervised auxiliary task that leads to disentangled image representations exploiting the sequential nature of RL observations. We find empirically that RL algorithms utilising TED as an auxiliary task adapt more quickly to changes in environment variables with continued training compared to state-of-the-art representation learning methods. Since TED enforces a disentangled structure of the representation, our experiments also show that policies trained with TED generalise better to unseen values of variables irrelevant to the task (e.g. background colour) as well as unseen values of variables that affect the optimal policy (e.g. goal positions).

1. INTRODUCTION

Real-world environments are often not static and deterministic, but can be subject to changes, both incremental or with sudden effects (Luck et al., 2017) . Reinforcement Learning (RL) algorithms need to be robust to these changes and adapt quickly. Moreover, since many real-world robotics applications rely on images as inputs (Vecerik et al., 2021; Hämäläinen et al., 2019; Chebotar et al., 2019) , RL agents need to learn robust representations of images that remain useful after a change in the environment. For example, a simple change in lighting conditions can change the perceived colour of an object, but this should not affect the agent's ability to perform a task. One of the reasons RL agents fail to generalise to unseen values of environment variables, such as colours and object positions, is that they overfit to variations seen in training (Zhang et al., 2018) . The issue is especially problematic for image-based RL, where a change in one environment variable can mean the agent is presented with a very different set of pixels for which trained RL policies are often no longer optimal. This failure to generalise occurs for both variables that are irrelevant to the optimal policy, such as background colours, and relevant variables, such as goal positions (Kirk et al., 2022) . In practice, this often results in agents needing to adapt their policy after a change to only one variable. We show experimentally that agents often cannot recover the optimal policy after the environment changes because it is too difficult to 'undo' the overfitting. One approach to tackle the generalisation issue is to use domain randomisation during training to maximise the environment variations observed (Cobbe et al., 2019; Chebotar et al., 2019) . However, in practice, we may be unaware of what variations an agent might see in the future. Even if all possible future variations are known, training on this full set is often sample inefficient and may result in a sub-optimal policy as the agent learns to compensate for many different values. It can also be impractical to generate all variations during training due to limitations on laboratory setups or in simulation suites. An alternative to domain randomisation is to learn robust representations that generalise to unseen variations (Kirk et al., 2022) , but learning such representations remains an open problem (Lan et al., 2022) . Some approaches aim to learn a representation that is invariant to image distractors (Zhang et al., 2020; 2021) to improve generalisation to unseen values of only variables irrelevant to the optimal policy. These approaches do not enforce a structure on the remaining task-relevant variables in the representation for generalisation to unseen values of relevant variables. A promising direction towards robust RL is to learn a disentangled representation of observations. Disentanglement techniques aim to learn robust representations for both task relevant and irrelevant variables by separating distinct, informative factors of variation in an image into the unknown ground truth factors that generated the image (Bengio et al., 2013) . When one factor of variation changes to a previously unseen value, such as a new colour, changing many pixels in the image, only a subset of features in a disentangled representation will change. The RL agent will still be able to rely on the remaining unchanged features in the representation to adapt quickly, allowing generalisation and performance recovery similar to state-based RL. Higgins et al. (2017b) show that a disentangled representation improves generalisation in RL. However, this approach requires a trained or hard-coded policy to collect independent and identically distributed (i.i.d.) data to pre-train a β-VAE (Higgins et al., 2017a) offline, and it has since been proven that it is theoretically impossible to learn a disentangled representation from i.i.d. data alone (Locatello et al., 2019) . We introduce a self-supervised auxiliary task for learning disentangled representations for the robust encoding of images, which we call TEmporal Disentanglement (TED). Unlike previous work, TED can be implemented with only minimal changes to existing RL algorithms and allows for lifelong learning of disentangled representations. In contrast to Higgins et al. (2017b) , our approach uses the non-i.i.d. temporal data from consecutive timesteps in RL to learn the disentangled representation online. Note that TED does not require a decoder which lessens computational costs. We provide experimental results across a variety of tasks from the DeepMind Control Suite (Tunyasuvunakool et al., 2020 ), Panda Gym (Gallouédec et al., 2021) and Procgen (Cobbe et al., 2020) environments. For our experiments, we train on a subset of some of the environment variables (such as colours), then evaluate generalisation on a test environment with unseen values of the variables and continue training to demonstrate adaptation to the new environment. Our results demonstrate that TED improves generalisation of a variety of base RL algorithms on unseen environment variables that are relevant or irrelevant to the task, while state-of-the-art baselines that achieve equally good training performance still fail to adapt and, in some cases, are unable to recover after overfitting to the training environment. We also evaluate a disentanglement metric (Higgins et al., 2017a) to demonstrate that our approach increases the extent to which the learned representation has disentangled the factors of variation in the image observations compared to baselines. 2021) use domain adversarial optimisation to learn a representation invariant to distractors. These approaches all aim to generalise to unseen values of irrelevant variables, e.g. background colours. In contrast, TED uses disentanglement to enforce a structured representation that applies to both relevant and irrelevant variables.



GENERALISATION IN IMAGE-BASED REINFORCEMENT LEARNING Image augmentation. Image augmentation artificially increases the size of the dataset by adding image perturbations to improve robustness of representations. Laskin et al. (2020a) apply a variety of image augmentation techniques such as translation, cutouts and cropping; Yarats et al. (2021) average over multiple augmentations; Hansen & Wang (2021) maximise mutual information between the representations of augmented and non-augmented images; and Hansen et al. (2021) use both augmented and non-augmented images to stabilise Q-value estimation. However, image augmentation approaches to generalisation can still fail when the agent experiences stronger types of variation after training (Kirk et al., 2022). In our experiments, we show that TED can be used alongside image augmentation techniques to further improve generalisation while benefiting from the augmentation. Learning invariant representations. Invariance techniques aim to learn a representation that ignores distractors in the image. Zhang et al. (2020) use causal inference techniques assuming a block MDP structure; Zhang et al. (2021) use a bisimulation metric; and Li et al. (

