TEMPORAL DISENTANGLEMENT OF REPRESENTATIONS FOR IMPROVED GENERALISATION IN REINFORCEMENT LEARNING

Abstract

Reinforcement Learning (RL) agents are often unable to generalise well to environment variations in the state space that were not observed during training. This issue is especially problematic for image-based RL, where a change in just one variable, such as the background colour, can change many pixels in the image. The changed pixels can lead to drastic changes in the agent's latent representation of the image, causing the learned policy to fail. To learn more robust representations, we introduce TEmporal Disentanglement (TED), a self-supervised auxiliary task that leads to disentangled image representations exploiting the sequential nature of RL observations. We find empirically that RL algorithms utilising TED as an auxiliary task adapt more quickly to changes in environment variables with continued training compared to state-of-the-art representation learning methods. Since TED enforces a disentangled structure of the representation, our experiments also show that policies trained with TED generalise better to unseen values of variables irrelevant to the task (e.g. background colour) as well as unseen values of variables that affect the optimal policy (e.g. goal positions).

1. INTRODUCTION

Real-world environments are often not static and deterministic, but can be subject to changes, both incremental or with sudden effects (Luck et al., 2017) . Reinforcement Learning (RL) algorithms need to be robust to these changes and adapt quickly. Moreover, since many real-world robotics applications rely on images as inputs (Vecerik et al., 2021; Hämäläinen et al., 2019; Chebotar et al., 2019) , RL agents need to learn robust representations of images that remain useful after a change in the environment. For example, a simple change in lighting conditions can change the perceived colour of an object, but this should not affect the agent's ability to perform a task. One of the reasons RL agents fail to generalise to unseen values of environment variables, such as colours and object positions, is that they overfit to variations seen in training (Zhang et al., 2018) . The issue is especially problematic for image-based RL, where a change in one environment variable can mean the agent is presented with a very different set of pixels for which trained RL policies are often no longer optimal. This failure to generalise occurs for both variables that are irrelevant to the optimal policy, such as background colours, and relevant variables, such as goal positions (Kirk et al., 2022) . In practice, this often results in agents needing to adapt their policy after a change to only one variable. We show experimentally that agents often cannot recover the optimal policy after the environment changes because it is too difficult to 'undo' the overfitting. One approach to tackle the generalisation issue is to use domain randomisation during training to maximise the environment variations observed (Cobbe et al., 2019; Chebotar et al., 2019) . However, in practice, we may be unaware of what variations an agent might see in the future. Even if all possible future variations are known, training on this full set is often sample inefficient and may 1

