EFFICIENT TRANSFORMERS IN REINFORCEMENT LEARNING USING ACTOR-LEARNER DISTILLATION

Abstract

Many real-world applications such as robotics provide hard constraints on power and compute that limit the viable model complexity of Reinforcement Learning (RL) agents. Similarly, in many distributed RL settings, acting is done on unaccelerated hardware such as CPUs, which likewise restricts model size to prevent intractable experiment run times. These "actor-latency" constrained settings present a major obstruction to the scaling up of model complexity that has recently been extremely successful in supervised learning. To be able to utilize large model capacity while still operating within the limits imposed by the system during acting, we develop an "Actor-Learner Distillation" (ALD) procedure that leverages a continual form of distillation that transfers learning progress from a large capacity learner model to a small capacity actor model. As a case study, we develop this procedure in the context of partially-observable environments, where transformer models have had large improvements over LSTMs recently, at the cost of significantly higher computational complexity. With transformer models as the learner and LSTMs as the actor, we demonstrate in several challenging memory environments that using Actor-Learner Distillation recovers the clear sample-efficiency gains of the transformer learner model while maintaining the fast inference and reduced total training time of the LSTM actor model.

1. INTRODUCTION

Compared to standard supervised learning domains, reinforcement learning presents unique challenges in that the agent must act while it is learning. In certain application areas, which we term actor-latency-constrained settings, there exists maximum latency constraints on the acting policy which limit its model size. These constraints on latency preclude typical solutions to reducing the computational cost of high capacity models, such as model compression or off-policy reinforcement learning, as it strictly requires a low computational-complexity model to be acting during learning. Here, the major constraint is that the acting policy model must execute a single inference step within a fixed budget of time, which we denote by T actor -the amount of compute or resources used during learning is in contrast not highly constrained. This setting is ubiquitous within real-world application areas: for example, in the context of learning policies for robotic platforms, because of inherent limitations in compute ability due to power and weight considerations it is unlikely a large model could run fast enough to provide actions at the control frequency of the robot's motors. However, for many of these strictly actor-latency constrained settings there are orthogonal challenges involved which prevent ease of experimentation, such as the requirement to own and maintenance real robot hardware. In order to develop a solution to actor-latency constrained settings without needing to deal with substantial externalities, we focus on the related area of distributed onpolicy reinforcement learning (Mnih et al., 2016; Schulman et al., 2017; Espeholt et al., 2018) . Here a central learner process receives data from a series of parallel actor processes interacting with the environment. The actor processes run step-wise policy inference to collect trajectories of interaction to provide to the learner, and they can be situated adjacent to the accelerator or distributed on different machines. With a large model capacity, the bottleneck in experiment run-times for distributed learning quickly becomes actor inference, as actors are commonly run on CPUs or devices without significant hardware acceleration available. The simplest solution to this constraint is to increase the number of parallel actors, often resulting in excessive CPU resource usage and limiting the total number of experiments that can be run on a compute cluster. Therefore, while not a hard constraint, experiment run-time in this setting is largely dominated by actor inference speed. The distributed RL setting therefore presents itself as a accessible test-bed for solutions to actor-latency constrained reinforcement learning. Within the domain of distributed RL, an area where reduced actor-latency during learning could make a significant impact is in the use of large Transformers (Vaswani et al., 2017) to solve partiallyobservable environments (Parisotto et al., 2019 ). Transformers (Vaswani et al., 2017) have rapidly emerged as the state-of-the-art architecture across a wide variety of sequence modeling tasks (Brown et al., 2020; Radford et al., 2019; Devlin et al., 2019) owing to their ability to arbitrarily and instantly access information across time as well as their superior scaling properties compared to recurrent architectures. Recently, their application to reinforcement learning domains has shown results surpassing previous state-of-the-art architectures while matching standard LSTMs in robustness to hyperparameter settings (Parisotto et al., 2019) . However, a downside to the Transformer compared to LSTM models is its significant computational cost. In this paper, we present a solution to actor-latency constrained settings, "Actor-Learner Distillation" (ALD), which leverages a continual form of Policy Distillation (Rusu et al., 2015; Parisotto et al., 2015) to compress, online, a larger "learner model" towards a tractable "actor model". In particular, we focus on the distributed RL setting applied to partially-observable environments, where we aim to be able to exploit the transformer model's superior sample-efficiency while still having parity with the LSTM model's computational-efficiency during acting. On challenging memory environments where the transformer has a clear advantage over the LSTM, we demonstrate our Actor-Learner Distillation procedure provides substantially improved sample efficiency while still having experiment run-time comparable to the smaller LSTM.

2. BACKGROUND

A Markov Decision Process (MDP) (Sutton & Barto, 1998 ) is a tuple of (S, A, T , γ, R) where S is a finite set of states, A is a finite action space, T (s |s, a) is the transition model, γ ∈ [0, 1] is a discount factor and R is the reward function. A stochastic policy π ∈ Π is a mapping from states to a probability distribution over actions. The value function V π (s) of a policy π for a particular state s is defined as the expected future discounted return of starting at state s and executing π: V π (s) = ∞ t=0 γ t r t . The optimal policy π * is defined as the policy with the maximum value at every state, i.e. ∀s ∈ S, π ∈ Π we have V π * (s) > V π (s). The optimal policy is guaranteed to exist (Sutton & Barto, 1998) . In this work, we focus on Partially-Observable MDPs (POMDPs) which define environments which cannot observe the state directly and instead must reason over observations. POMDPs require the agent to reason over histories of observations in order to make an informed decision over which action to choose. This motivates the use of memory models such as LSTMs (Hochreiter & Schmidhuber, 1997) or transformers (Vaswani et al., 2017) . In this work, we use V- MPO (Song et al., 2020) as the main reinforcement learning algorithm. V-MPO, an on-policy value-based extension of the Maximum a Posteriori Policy Optimisation algorithm (Abdolmaleki et al., 2018) , used an EM-style policy optimization update along with regularizing KL constraints to obtain state-of-the-art results across a wide variety of environments. Similar to Song et al. ( 2020), we use the IMPALA (Espeholt et al., 2018) distributed RL framework to parallelize acting and learning. Actor processes step through the environment and collect trajectories of data to send to the learner process, which batches actor trajectories and optimizes reinforcement learning objectives to update the policy parameters. The updated parameters are then communicated back to actor processes in order to maintain on-policyness. In the following, we first refer to a single observation and associated statistics (reward, etc.) as a step. We refer to an environment step as the acquisition of a step from the environment after an action is taken. The total number of environment steps is a measure of the sample complexity of an RL algorithm. We refer to a step processed by an RL algorithm as an agent step. The total number of agent steps is a measure of the computational complexity of the RL algorithm. We use steps per second (SPS) to measure the speed of an algorithm. We refer to Learner SPS as the total number of agent steps processed by the learning algorithm per second. We refer to Actor SPS as the total number of environment steps acquired by a single actor per second.

