INTERPRETABLE META-REINFORCEMENT LEARNING WITH ACTOR-CRITIC METHOD

Abstract

Meta-reinforcement learning (meta-RL) algorithms have successfully trained agent systems to perform well on different tasks within only few updates. However, in gradient-based meta-RL algorithms, the Q-function at adaptation step is mainly estimated by the return of few trajectories, which can lead to high variance in Q-value and biased meta-gradient estimation, and the adaptation uses a large number of batched trajectories. To address these challenges, we propose a new meta-RL algorithm that can reduce the variance and bias of the meta-gradient estimation and perform few-shot task data sampling, which makes the meta-policy more interpretable. We reformulate the meta-RL objective, and introduce contextual Q-function as a meta-policy critic during task adaptation step and learn the Q-function under a soft actor-critic (SAC) framework. The experimental results on 2D navigation task and meta-RL benchmarks show that our approach can learn an more interpretable meta-policy to explore unknown environment and the performance are comparable to previous gradient-based algorithms.

1. INTRODUCTION

Reinforcement learning problems have been studied for a long time and there are many impressive works that achieved human-level control in real world tasks (Mnih et al., 2013; Silver et al., 2017; Vinyals et al., 2019; Schrittwieser et al., 2019) . These agents are trained separately on each task and may require huge sampled data and millions of trails. However, in a many real world tasks, the cost of sampling data is not negligible, thus we cannot give agent a large number of trails in environment. In contrast, human can laverage past experiences and learn new tasks quickly in few trails, which is very efficient. Many tasks in fact share similar structures that can be extracted as prior knowledge, e.g., shooting games aims to eliminate enemies with weapons in different environments, which can help agent generalize quickly through different tasks. Meta-learn (Thrun & Pratt, 2012) reinforcement learning tasks can be a suitable chioce. Meta-reinforcement learning (meta-RL) aims to learn a policy that can adapt to the unknown environment within few interactions with environment. Meta-policy can be seen as a policy that can derive new a policy maximizes the performance in the new environment. Gradient-based algorithms in meta-RL (Finn et al., 2017; Stadie et al., 2018; Rothfuss et al., 2018; Liu et al., 2019) showed that a meta-policy can be obtained by reinforcement learning a policy adapted by few reinforcement learning steps. The experiment results suggests that gradient-based methods can learn to sample and utilize sampled data in some extent. Nevertheless, the learning style and learned meta-policy are still far from human. Human learns a new task by interacting with the task sequentially and efficiently. With the obtaining of environment data, human gradually understanding where to sampling data and how to utilize the sampled data to adjust the policy, while gradient-based algorithms use parallel sampling neglecting the relations between data. Sampling independently is not data-efficient, usually needs a number of stochastic trajectories to do plicy adaptation. This causes the agent relying on the stochasticity to sample and only learns how to utilize data. Inspired by the human behavior, we propose a K-shot meta-RL problem that constrains on the data amount accessed by agent, e.g., adapting policy within only two trails. Low resource environment simulates the real world tasks that have high costs on data obtaining, therefore, requires agent to learn a stable strategy to explore environment. To address the K-shot problem, we also propose a contextual gradient-based algorithm using actor-critic method. The adptation step uses a trail buffer

