LINEAR REPRESENTATION META-REINFORCEMENT LEARNING FOR INSTANT ADAPTATION Anonymous

Abstract

This paper introduces Fast Linearized Adaptive Policy (FLAP), a new metareinforcement learning (meta-RL) method that is able to extrapolate well to outof-distribution tasks without the need to reuse data from training, and adapt almost instantaneously with the need of only a few samples during testing. FLAP builds upon the idea of learning a shared linear representation of the policy so that when adapting to a new task, it suffices to predict a set of linear weights. A separate adapter network is trained simultaneously with the policy such that during adaptation, we can directly use the adapter network to predict these linear weights instead of updating a meta-policy via gradient descent, such as in prior meta-RL methods like MAML, to obtain the new policy. The application of the separate feed-forward network not only speeds up the adaptation run-time significantly, but also generalizes extremely well to very different tasks that prior Meta-RL methods fail to generalize to. Experiments on standard continuous-control meta-RL benchmarks show FLAP presenting significantly stronger performance on out-of-distribution tasks with up to double the average return and up to 8X faster adaptation run-time speeds when compared to prior methods.

1. INTRODUCTION

Deep Reinforcement Learning (DRL) has led to recent advancements that allow autonomous agents to solve complex tasks in a wide range of fields (Schulman et al. (2015) , Lillicrap et al. (2015a) , Levine et al. (2015) ). However, traditional approaches in DRL learn a separate policy for each unique task, requiring large amounts of samples. Meta-Reinforcement learning (meta-RL) algorithms provide a solution by teaching agents to implicitly learn a shared structure among a batch of training tasks so that the policy for unseen similar tasks can quickly be acquired (Finn et al. (2017) ). Recent progress in meta-RL has shown efforts being made in improving the sample complexity of meta-RL algorithms (Rakelly et al. (2019 ), Rothfuss et al. (2018) ), along with the out-ofdistribution performance of meta-RL algorithms during adaptation (Fakoor et al. (2019); Mendonca et al. (2020) ). However, most of the existing meta-RL algorithms prioritize sample efficiency at the sacrifice of computational complexity in adaptation, making them infeasible to adapt to fastchanging environments in real-world applications such as robotics. In this paper, we present Fast Linearized Adaptive Policy (FLAP), an off-policy meta-RL method with great generalization ability and fast adaptation speeds. FLAP is built on the assumption that similar tasks share a common linear (or low-dimensional) structure in the representation of the agent's policy, which is usually parameterized by a neural network. During training, we learn the shared linear structure among different tasks using an actor-critic algorithm. A separate adapter net is also trained as a supervised learning problem to learn the weights of the output layer for each unique train task given by the environment interactions from the agent. Then when adapting to a new task, we fix the learned linear representation (shared model layers) and predict the weights for the new task using the trained adapter network. An illustration of our approach is highlighted in Figure 2 . We highlight our main contributions below: • State of the Art Performance. We propose an algorithm based on learning and predicting the shared linear structures within policies, which gives the strongest results among the meta-RL algorithms and the fastest adaptation speeds. FLAP is the state of the art in all these areas including performance, run-time, and memory usage. As is shown in 2020)) in terms of both adaptation speed and average return. Further results from our experiments show that FLAP acquires adapted policies that perform much better on out-ofdistribution tasks at a rapid run-time adaptation rate up to 8X faster than prior methods. • Prediction rather than optimization. We showcase a successful use of prediction via adapter network rather than optimization with gradient steps (Finn et al. ( 2017)) or the use of context encoders (Rakelly et al. ( 2019)) during adaptation. This ensures that different tasks would have policies that are different from each other, which boosts the out-ofdistribution performance, while gradient-based and context-based methods tend to produce similar policies for all new tasks. Furthermore, the adapter network learns an efficient way of exploration such that during adaptation, only a few samples are needed to acquire the new policy. To our knowledge, this is the first meta-RL method that directly learns and predicts a (linearly) shared structure successfully in adapting to new tasks. We also analyze the adaptation run-time speed of these methods on tasks that are similar (indistribution) and tasks that are not very similar (out-of-distribution) to further evaluate these models. Flap presents significantly stronger results compared to prior meta-RL methods. The idea of learning shared information (embeddings) across different tasks has been investigated deeply in transfer learning, including for example, universal value function approximators (Schaul



Fig-ure 1, the FLAP algorithm outperforms most of the existing meta-RL algorithms including MAML (Finn et al. (2017)), PEARL (Rakelly et al. (2019)) and MIER (Mendonca et al. (

Figure 1: Strong Experimental Results: We showcase the performance of meta-RL methods on tasks that are very different from the training tasks to assess the generalization ability of methods.We also analyze the adaptation run-time speed of these methods on tasks that are similar (indistribution) and tasks that are not very similar (out-of-distribution) to further evaluate these models. Flap presents significantly stronger results compared to prior meta-RL methods.

Figure 2: Overview of our approach: In training, for different tasks {T i }, we parametrize their policy as π i = φ • w i , where φ ∈ R d is the shared linear representation we hope to acquire. In testing (adaptation), we fix the acquired linear representation φ and directly alter the weights w test by using the output of the feed-forward adapter network.

