R-LATTE: VISUAL CONTROL VIA DEEP REINFORCE-MENT LEARNING WITH ATTENTION NETWORK

Abstract

Attention mechanisms are generic inductive biases that have played a critical role in improving the state-of-the-art in supervised learning, unsupervised pre-training and generative modeling for multiple domains including vision, language and speech. However, they remain relatively under-explored for neural network architectures typically used in reinforcement learning (RL) from high dimensional inputs such as pixels. In this paper, we propose and study the effectiveness of augmenting a simple attention module in the convolutional encoder of an RL agent. Through experiments on the widely benchmarked DeepMind Control Suite environments, we demonstrate that our proposed module can (i) extract interpretable task-relevant information such as agent locations and movements without the need for data augmentations or contrastive losses; (ii) significantly improve the sampleefficiency and final performance of the agents. We hope our simple and effective approach will serve as a strong baseline for future research incorporating attention mechanisms in reinforcement learning and control.

1. INTRODUCTION

Attention plays a crucial rule in human cognitive processing: a commonly-accepted hypothesis is that humans rely on a combination of top-down selection (e.g., zooming into certain parts in our perception field) and bottom-up disruption (e.g., getting distracted by a novel stimuli) to prioritize information that is most likely to be important for survival or task completion (Corbetta & Shulman L., 2002; Kastner & Ungerleider G., 2000) . In machine learning, attention mechanisms for neural networks have been studied for various purposes in multiple domains, such as computer vision (Xu et al., 2015; Zhang et al., 2019a) , natural language processing (Vaswani et al., 2017; Brown et al., 2020), and speech recognition (Chorowski et al., 2015; Bahdanau et al., 2016) . For example, Zagoruyko & Komodakis (2017) proposed transferring knowledge between two neural networks by aligning the activation (attention) maps of one network with the other. Springenberg et al. (2014) and Selvaraju et al. (2017) proposed a gradient-based method to extract attention maps from neural networks for interpretability. Attention also stands behind the success of Transformers (Vaswani et al., 2017) , which uses a self-attention mechanism to model dependencies in long-sequence language data. However, attention mechanisms have received relatively little attention in deep reinforcement learning (RL), even though this generic inductive bias shows the potential to improve the sampleefficiency of agents, especially from pixel-based environments. More frequently explored directions in RL from pixels are unsupervised/self-supervised learning (Oord et al., 2018; Srinivas et al., 2020; Lee et al., 2020; Stooke et al., 2020; Kipf et al., 2020) 2020) further improved the gains using temporal contrast. Another promising direction has focused on latent variable modeling (Watter et al., 2015; Zhang et al., 2019b; Hafner et al., 2020; Sekar et al., 2020; Watters et al., 2019 ): Hafner et al. (2019) proposed to leverage world-modeling in a latent-space for planning, and Hafner et al. ( 2020) utilized the latent dynamics model to generate synthetic roll-outs. However, using such representation learning methods may incurs expensive back-and-forth costs (e.g., hyperparameter tuning). This motivates our search in a more effective neural network architecture in the convolutional encoder of an RL agent. In this paper, we propose R-LAtte: Reinforcement Learning with Attention module, a simple yet effective architecture for encoding image pixels in vision-based RL. In particular, the major components of R-LAtte are • Two-stream encoding: We use two streams of encoders to extract non-attentional and attentional features from the images. We define our attention masks by applying spatial Softmax to unnormalized saliencies computed by element-wise product between non-attentional and attentional features. We show that the attentional features contain more task-relevant information (e.g., agent movements) while the non-attentional features contain more task-invariant information (e.g., agent shape). • Adaptive scaling: Once the attention masks are obtained, they are combined with original non-attentional features, and then provided to RL agents. Here, to balance the trade-off between original and attentional features, we introduce a learn-able scaling parameter that is being optimized jointly with other parameters in the encoder. We test our architecture on the widely used benchmarks, DeepMind Control Suite environments (Tassa et al., 2018) , to demonstrate that our proposed module can (i) extract interpretable task-relevant information such as agent locations and movements; (ii) significantly improve the sample-efficiency and final performance of the agents without the need for data augmentations or contrastive losses. We also provide results from a detailed ablation study, which shows the contribution of each component to overall performance. To summarize, the main contributions of this paper are as follows: • We present R-LAtte, a simple yet effective architecture that can be used in conjunction for encoding image pixels in vision-based RL. • We show that significantly improve the sample-efficiency and final performance of the agents on continuous control tasks from DeepMind Control Suite (Tassa et al., 2018) .

2. RELATED WORK

Reinforcement learning from pixels RL on image-inputs has shown to be benefited from representation learning methods using contrastive losses (Oord et al., 2018; Srinivas et al., 2020; Lee et al., 2020; Stooke et al., 2020; Kipf et al., 2020) , self-supervised auxiliary task learning (Jaderberg et al., 2017; Goel et al., 2018; Sekar et al., 2020) , and latent variable modeling (Watter et al., 2015; Zhang et al., 2019b; Hafner et al., 2020; Sekar et al., 2020; Watters et al., 2019) . Along with those successes, Laskin et al. ( 2020) and Kostrikov et al. ( 2020) recently showed that proper data augmentations alone could achieve competitive performance against previous representation learning methods. Different from existing representation learning and data augmentation methods, we focus on architecture improvements specific to RL from pixels, which has been largely uncharted by previous works. Attention in machine learning Human understand scenes by attending on a local region of the view and aggregating information over time to form an internal scene representation (Ungerleider & G, 2000; Rensink, 2000) . Inspired by this mechanism, researchers developed attention modules in neural networks, which directly contributed to lots of recent advances in deep learning especially on natural language processing and computer vision applications (Vaswani et al., 2017; Bahdanau et al., 2015; Mnih et al., 2014; Xu et al., 2015) . Using attention mechanism in RL from pixels has also been explored by previous works for various purposes. Zambaldi et al. (2018) proposed selfattention module for tasks that require strong relational reasoning. Different from their work, we study a more computationally-efficient attention module on robot control tasks that may or may not require strong relational reasoning. Choi et al. ( 2019) is the closest study in terms of the architectural choice. One main difference is in the objective of the study. Choi et al. (2019) investigate the effect of attention on Atari games and focus on the exploration aspect, while we focus on improving the learning of robot controls using the DeepMind Control Suite environments (Tassa et al., 2018) . Levine et al. (2016) also use an attention-like module for robot control tasks. There are several architectural design differences (e.g., the addition of a residual connection and a Hadamard product), which we will show are crucial for achieving good performance on control tasks (See Section 5.4).



: Jaderberg et al. (2017) introduced unsupervised auxiliary tasks, such as the Pixel Control task. Srinivas et al. (2020) applied contrastive learning for data-efficient RL and Stooke et al. (

