ON THE IMPORTANCE OF DISTRACTION-ROBUST REPRESENTATIONS FOR ROBOT LEARNING

Abstract

Representation Learning methods can allow the application of Reinforcement Learning algorithms when a high dimensionality in a robot's perceptions would otherwise prove prohibitive. Consequently, unsupervised Representation Learning components often feature in robot control algorithms that assume highdimensional camera images as the principal source of information. In their design and performance, these algorithms often benefit from the controlled nature of the simulation or the laboratory conditions they are evaluated in. However, these settings fail to acknowledge the stochasticity of most real-world environments. In this work, we introduce the concept of Distraction-Robust Representation Learning. We argue that environment noise and other distractions require learned representations to encode the robot's expected perceptions rather than the observed ones. Our experimental evaluations demonstrate that representations learned with a traditional dimensionality reduction algorithm are strongly susceptible to distractions in a robot's environment. We propose an Encoder-Decoder architecture that produces representations that allow the learning outcomes of robot control tasks to remain unaffected by these distractions.

1. INTRODUCTION

Representation Learning techniques form an integral part in many Reinforcement Learning (RL) robot control applications (Lesort et al., 2018) . Utilising low-dimensional representations can allow for a faster and more efficient learning of tasks than when using high-dimensional sensor information (Munk et al., 2016) . This is particularly useful in vision-based learning when high-dimensional images of the robot's environment are the principal source of information available to the learning algorithm (Zhu et al., 2020) . Most commonly, representations are learned by applying dimensionality reduction techniques such as Autoencoders (AEs) (Hinton & Salakhutdinov, 2006) or Variational Autoencoders (VAEs) (Kingma & Welling, 2014) to the robot's sensory data (Lange et al., 2012; Zhu et al., 2020) . Generally, an AE consists of two Neural Networks, an Encoder E and a Decoder D. The Encoder attempts to condense all available information in the input data x into a latent representation z from which a reconstruction of the inputs D(E(x)) is generated by the Decoder. When the dimensionality of the representation is smaller than that of the input data, some information is lost when creating the representations. An AE is typically trained to shrink the magnitude of this information loss by minimising a reconstruction error. This error is commonly given by the squared norm of differences, L AE = ||x -D(E(x))|| 2 2 . (1) However, the optimisation of the reconstruction error in Eq. 1 does not necessarily result in the generation of representations that are optimal for the use in robot learning algorithms. For example, an accurate reconstruction of the decorative patterns on a dinner plate is less important than the plate's dimensions to a robot learning to place it into a cabinet. It can therefore be desirable to control which aspects of the information contained in the inputs are most critical to be preserved in the representations. For instance, Pathak et al. ( 2017) design a Neural Network that learns representations from visual inputs by using them to predict the action taken by the RL agent in its state transition. By asking the network to predict the action, the authors eliminate the requirement for the representations to retain any state information that is unrelated to the agent's behaviour.

