D2RL: DEEP DENSE ARCHITECTURES IN REINFORCE-MENT LEARNING

Abstract

While improvements in deep learning architectures have played a crucial role in improving the state of supervised and unsupervised learning in computer vision and natural language processing, neural network architecture choices for reinforcement learning remain relatively under-explored. We take inspiration from successful architectural choices in computer vision and generative modeling, and investigate the use of deeper networks and dense connections for reinforcement learning on a variety of simulated robotic learning benchmark environments. Our findings reveal that current methods benefit significantly from dense connections and deeper networks, across a suite of manipulation and locomotion tasks, for both proprioceptive and image-based observations. We hope that our results can serve as a strong baseline and further motivate future research into neural network architectures for reinforcement learning. The project website is at this link https://sites.google.

1. INTRODUCTION

Deep Reinforcement Learning (DRL) is a general purpose framework for training goal-directed agents in high dimensional state and action spaces. There have been plenty of successes from DRL for robotic control tasks, spanning across locomotion and navigation tasks, both in simulation and in the real world (Schulman et al., 2015; Akkaya et al., 2019; Kalashnikov et al., 2018) . While the generality of the DRL framework lends itself to be applicable to a wide variety of tasks, one has to address issues such as the sample-efficiency and generalization of the agents trained with this framework. Sample-efficiency is fundamentally critical to agents trained in the real world, particularly for robotic control tasks. Baking in minimal inductive biases into the framework is one effective mechanism to address the issue of sample-efficiency of DRL agents and make them more efficient. The generality of the framework makes it difficult to control particular behaviours and inductive biases for DRL algorithms. Inductive biases are important for learning algorithms, as they are able to induce desirable behaviour in the learned agents. Recent work has sought to improve the sample efficiency of DRL by adding an inductive bias of invariance, when learning from images, through techniques such as data augmentations (Laskin et al., 2020; Kostrikov et al., 2020) and contrastive losses (Srinivas et al., 2020) . Similarly, another important inductive bias in DRL is the choice of the architectures for function approximators, for example how to parameterize the neural network for the policy and value functions. However, the problem of choosing architecture designs in DRL and robotics, for planning and control, has been largely ignored. Modern computer vision and language processing research have shown the disproportionate advantage of the size and depth of the neural networks used (He et al., 2016b; Radford et al.) wherein very deep neural networks can be trained such that they learn better and more generalizable representations. Furthermore, recent evidence suggests that deeper neural networks can not only learn more complex functions but also have a smoother loss landscape (Rolnick & Tegmark, 2017) . Learning function approximators which enable better optimization and expressivity is an important inductive bias, which is greatly exploited in vision and language processing by using clever neural network architecture choices such as residual connections (He et al., 2016a) , normalization layers (Santurkar et al., 2018) , and gating mechanisms (Hochreiter & Schmidhuber, 1997) , to name a few. It would be ideal to incorporate similar inductive biases in modern DRL algorithms in robotics in order to allow for better sample efficiency as that would significantly aid the deployment of real world robot learning agents. In this paper, we first highlight the problems that occur when learning policies and value functions using vanilla deep neural networks. Then we propose D2RL; an architecture that addresses these problems while benefiting from the utility of inductive biases added by more expressive function approximators. We show how our proposed architecture scales effectively to a wide variety of off-policy RL algorithms, for proprioceptive-feature and image based inputs across a diverse set of challenging robotic control and manipulation environments. Our approach is motivated by utilizing a form of dense-connections similar to the ones found in modern deep learning, such as DenseNet (Huang et al., 2017 ), Skip-VAE (Dieng et al., 2019) and U-Nets (Ronneberger et al., 2015) . We demonstrate that the proposed parameterization significantly improves sample efficiency of RL agents (as measured by the number of environment interactions required to obtain a level of performance) in continuous control tasks. Our contributions can be summarized as: 1. We investigate the problem with increasing the number of layers used to parameterize policies and value functions. 2. We propose a general solution based on dense-connections to overcome the problem. 3. We extensively evaluate our proposed architecture on a diverse set of robotics tasks from proprioceptive features and images across multiple standard algorithms.

2. RELATED WORK

Learning efficient representations for sample-efficient RL Several recent papers have sought to improve representation learning of observations for control. CURL (Srinivas et al., 2020) augments the usual RL loss with a contrastive loss that seeks to learn a latent representation which enforces the inductive bias of encodings of augmentations of the same image being closer in latent space than embeddings of augmentations of different images. RAD (Laskin et al., 2020) and DrQ (Kostrikov et al., 2020) showed that simple data augmentations like random crop, color jitter, patch cutout, and random convolutions can alone help improve sample-efficiency and generalization of RL from pixel inputs. Some other algorithms learn latent representations decoupled from policy learning, through a variational autoencoder based loss (Higgins et al., 2017; Hafner et al., 2019; Nair et al., 2018) . OFENet (Ota et al., 2020) shows that learning a higher dimensional feature space helps learn a more informative representation when learning from states. Inductive biases in deep learning Inductive biases in deep learning have been long explored in different contexts such as temporal relations (Hochreiter & Schmidhuber, 1997) , spatial relations (LeCun et al., 1998; Krizhevsky et al., 2012) , translation invariance (Berthelot et al., 2019; He et al., 2020; Chen et al., 2020; Srinivas et al., 2020; Laskin et al., 2020) and learning contextual representations (Vaswani et al., 2017) . These inductive biases are injected either directly through the network parameterization (LeCun et al., 1998; Hochreiter & Schmidhuber, 1997; Vaswani et al., 2017) or by implicitly changing the objective function (Berthelot et al., 2019; Srinivas et al., 2020) .



Figure 1: Visual illustrations of the proposed dense-connections based D2RL modification to the policy π φ (•) and Q-value Q θ (•) neural network architectures. The inputs are passed to each layer of the neural network through identity mappings. Forward pass corresponds to moving from left to right in the figure. For state-based envs, st is the observed simulator state and there is no convolutional encoder.

