ON THE IMPORTANCE OF DISTRACTION-ROBUST REPRESENTATIONS FOR ROBOT LEARNING

Abstract

Representation Learning methods can allow the application of Reinforcement Learning algorithms when a high dimensionality in a robot's perceptions would otherwise prove prohibitive. Consequently, unsupervised Representation Learning components often feature in robot control algorithms that assume highdimensional camera images as the principal source of information. In their design and performance, these algorithms often benefit from the controlled nature of the simulation or the laboratory conditions they are evaluated in. However, these settings fail to acknowledge the stochasticity of most real-world environments. In this work, we introduce the concept of Distraction-Robust Representation Learning. We argue that environment noise and other distractions require learned representations to encode the robot's expected perceptions rather than the observed ones. Our experimental evaluations demonstrate that representations learned with a traditional dimensionality reduction algorithm are strongly susceptible to distractions in a robot's environment. We propose an Encoder-Decoder architecture that produces representations that allow the learning outcomes of robot control tasks to remain unaffected by these distractions.

1. INTRODUCTION

Representation Learning techniques form an integral part in many Reinforcement Learning (RL) robot control applications (Lesort et al., 2018) . Utilising low-dimensional representations can allow for a faster and more efficient learning of tasks than when using high-dimensional sensor information (Munk et al., 2016) . This is particularly useful in vision-based learning when high-dimensional images of the robot's environment are the principal source of information available to the learning algorithm (Zhu et al., 2020) . Most commonly, representations are learned by applying dimensionality reduction techniques such as Autoencoders (AEs) (Hinton & Salakhutdinov, 2006) or Variational Autoencoders (VAEs) (Kingma & Welling, 2014) to the robot's sensory data (Lange et al., 2012; Zhu et al., 2020) . Generally, an AE consists of two Neural Networks, an Encoder E and a Decoder D. The Encoder attempts to condense all available information in the input data x into a latent representation z from which a reconstruction of the inputs D(E(x)) is generated by the Decoder. When the dimensionality of the representation is smaller than that of the input data, some information is lost when creating the representations. An AE is typically trained to shrink the magnitude of this information loss by minimising a reconstruction error. This error is commonly given by the squared norm of differences, L AE = ||x -D(E(x))|| 2 2 . (1) However, the optimisation of the reconstruction error in Eq. 1 does not necessarily result in the generation of representations that are optimal for the use in robot learning algorithms. For example, an accurate reconstruction of the decorative patterns on a dinner plate is less important than the plate's dimensions to a robot learning to place it into a cabinet. It can therefore be desirable to control which aspects of the information contained in the inputs are most critical to be preserved in the representations. For instance, Pathak et al. ( 2017) design a Neural Network that learns representations from visual inputs by using them to predict the action taken by the RL agent in its state transition. By asking the network to predict the action, the authors eliminate the requirement for the representations to retain any state information that is unrelated to the agent's behaviour. A focus on the learned representations' preservation of task-relevant information becomes even more crucial in the presence of distracting influences (DIs) in the environment. These DIs can materialise in the presence of additional environment objects which exhibit dynamics that are not only uncorrelated with the robot's behaviour but additionally misleading. For instance, a robot that is learning to move objects to different positions in a room can find the observation of a moving autonomous vacuum cleaner misleading. Alternatively, DIs can impact the dynamics of existing objects in the room. For instance, after the robot has moved a box to a certain position in the room, further movements of the box due to external forces can be distracting to the robot's learning process. In this paper, we introduce the concept of Distraction-Robust Representation Learning. We investigate the learning outcomes of robot control tasks when DIs are present in the environment. We show that in the presence of DIs, representations learned exclusively from environment observations can mislead the robot's perceptions of its control over the environment. This finding demonstrates that Distraction-Robust Representation Learning needs to be afforded increased attention. In particular, works in the strand of research that aim to make RL algorithms more applicable to real-world scenarios largely concentrate on improving algorithm attributes such as the data efficiency (Zhu et al., 2020) . However, few works acknowledge the challenges posed by the inherently stochastic nature of real-world environments and the presence of DIs (Forestier et al., 2017) . Furthermore, we introduce a Robot Action Encoder-Decoder architecture (RAED) which successfully produces representations that are robust to DIs in the environment. RAED follows the simple but effective approach of using only the values that parameterise the robot's actions as the input to the Encoder. Such a set of parameters defines a robot controller for instance. The representations produced by the Encoder are used by RAED's Decoder to generate predictions of the environment observations. RAED's design allows for static environment elements to be learned by the Decoder while concentrating the information in the representations on the observable consequences of the robot's behaviour. Moreover, when environment observations are distorted by the presence of DIs, RAED produces representations that capture the expected consequences of the robot's environment interactions. This is not the case when training representation learning methods such as AEs to reconstruct the full content of the robot's visual perceptions. We can therefore draw some parallels between RAED's design and the concept of a forward model (Jordan & Rumelhart, 1992) . Given the simplicity of the approach, we expect the applicability of RAED to generalise to various different learning algorithms.

2. RELATED WORK

Several works have investigated mechanisms to preserve only task-relevant information in learned representations. Pathak et al. (2017) propose a Neural Network architecture that learns representations of visual inputs by predicting the action taken by the RL agent in its state transition. This design allows the representations to dedicate their information capacity to the observable consequences of the robot's actions. Finn et al. (2016) propose a spatial AE to learn representations that aim to preserve only the configuration of objects in the environment rather than all aspects of the information contained in the camera images. However, in both approaches, the representations are learned from visual inputs which will be distorted if DIs are present in the environment. Without an explicit correction mechanism, these representation learning techniques therefore remain susceptible to distractions. The concept of affordance learning formulates a similar goal in discovering the consequences of the robot's actions on its environment (Cakmak et al., 2007; S ¸ahin et al., 2007) . However, the works in this strand of research rarely consider the problem of DIs in the environment. Instead, they mainly concentrate on the robot's ability to infer how an object in the environment would behave in response to its actions when no prior interaction experience with that particular object is available (Dehban et al., 2016; Mar et al., 2015) . A work that uses learned representations and investigates robot interactions in the presence of DIs is presented in the Intrinsically Motivated Goal Exploration Processes (IMGEP, Laversanne-Finot et al. ( 2018)). IMGEP aims to enable robots to explore the possible interactions with various tools in an environment that also features distractor objects. These objects either cannot be interacted with or move independently of the robot. The authors show that a variant of their proposed algorithm remains unaffected by the presence of these distractors. This robustness is demonstrated by the robot's lack of interaction with the distractor objects. However, the DIs we consider in this paper pose an arguably larger challenge for two main reasons. First, we evaluate distractor objects which exhibit dynamics that are not only independent of the robot's behaviour but also misleading to the robot's

