LINEAR REPRESENTATION META-REINFORCEMENT LEARNING FOR INSTANT ADAPTATION Anonymous

Abstract

This paper introduces Fast Linearized Adaptive Policy (FLAP), a new metareinforcement learning (meta-RL) method that is able to extrapolate well to outof-distribution tasks without the need to reuse data from training, and adapt almost instantaneously with the need of only a few samples during testing. FLAP builds upon the idea of learning a shared linear representation of the policy so that when adapting to a new task, it suffices to predict a set of linear weights. A separate adapter network is trained simultaneously with the policy such that during adaptation, we can directly use the adapter network to predict these linear weights instead of updating a meta-policy via gradient descent, such as in prior meta-RL methods like MAML, to obtain the new policy. The application of the separate feed-forward network not only speeds up the adaptation run-time significantly, but also generalizes extremely well to very different tasks that prior Meta-RL methods fail to generalize to. Experiments on standard continuous-control meta-RL benchmarks show FLAP presenting significantly stronger performance on out-of-distribution tasks with up to double the average return and up to 8X faster adaptation run-time speeds when compared to prior methods.

1. INTRODUCTION

Deep Reinforcement Learning (DRL) has led to recent advancements that allow autonomous agents to solve complex tasks in a wide range of fields (Schulman et al. (2015) , Lillicrap et al. (2015a) , Levine et al. (2015) ). However, traditional approaches in DRL learn a separate policy for each unique task, requiring large amounts of samples. Meta-Reinforcement learning (meta-RL) algorithms provide a solution by teaching agents to implicitly learn a shared structure among a batch of training tasks so that the policy for unseen similar tasks can quickly be acquired (Finn et al. (2017) ). Recent progress in meta-RL has shown efforts being made in improving the sample complexity of meta-RL algorithms (Rakelly et al. ( 2019 2020)). However, most of the existing meta-RL algorithms prioritize sample efficiency at the sacrifice of computational complexity in adaptation, making them infeasible to adapt to fastchanging environments in real-world applications such as robotics. In this paper, we present Fast Linearized Adaptive Policy (FLAP), an off-policy meta-RL method with great generalization ability and fast adaptation speeds. FLAP is built on the assumption that similar tasks share a common linear (or low-dimensional) structure in the representation of the agent's policy, which is usually parameterized by a neural network. During training, we learn the shared linear structure among different tasks using an actor-critic algorithm. A separate adapter net is also trained as a supervised learning problem to learn the weights of the output layer for each unique train task given by the environment interactions from the agent. Then when adapting to a new task, we fix the learned linear representation (shared model layers) and predict the weights for the new task using the trained adapter network. An illustration of our approach is highlighted in Figure 2 . We highlight our main contributions below: • State of the Art Performance. We propose an algorithm based on learning and predicting the shared linear structures within policies, which gives the strongest results among the meta-RL algorithms and the fastest adaptation speeds. FLAP is the state of the art in all these areas including performance, run-time, and memory usage. As is shown in Fig-



), Rothfuss et al. (2018)), along with the out-ofdistribution performance of meta-RL algorithms during adaptation (Fakoor et al. (2019); Mendonca et al. (

