INTERPRETABLE META-REINFORCEMENT LEARNING WITH ACTOR-CRITIC METHOD

Abstract

Meta-reinforcement learning (meta-RL) algorithms have successfully trained agent systems to perform well on different tasks within only few updates. However, in gradient-based meta-RL algorithms, the Q-function at adaptation step is mainly estimated by the return of few trajectories, which can lead to high variance in Q-value and biased meta-gradient estimation, and the adaptation uses a large number of batched trajectories. To address these challenges, we propose a new meta-RL algorithm that can reduce the variance and bias of the meta-gradient estimation and perform few-shot task data sampling, which makes the meta-policy more interpretable. We reformulate the meta-RL objective, and introduce contextual Q-function as a meta-policy critic during task adaptation step and learn the Q-function under a soft actor-critic (SAC) framework. The experimental results on 2D navigation task and meta-RL benchmarks show that our approach can learn an more interpretable meta-policy to explore unknown environment and the performance are comparable to previous gradient-based algorithms.

1. INTRODUCTION

Reinforcement learning problems have been studied for a long time and there are many impressive works that achieved human-level control in real world tasks (Mnih et al., 2013; Silver et al., 2017; Vinyals et al., 2019; Schrittwieser et al., 2019) . These agents are trained separately on each task and may require huge sampled data and millions of trails. However, in a many real world tasks, the cost of sampling data is not negligible, thus we cannot give agent a large number of trails in environment. In contrast, human can laverage past experiences and learn new tasks quickly in few trails, which is very efficient. Many tasks in fact share similar structures that can be extracted as prior knowledge, e.g., shooting games aims to eliminate enemies with weapons in different environments, which can help agent generalize quickly through different tasks. Meta-learn (Thrun & Pratt, 2012) reinforcement learning tasks can be a suitable chioce. Meta-reinforcement learning (meta-RL) aims to learn a policy that can adapt to the unknown environment within few interactions with environment. Meta-policy can be seen as a policy that can derive new a policy maximizes the performance in the new environment. Gradient-based algorithms in meta-RL (Finn et al., 2017; Stadie et al., 2018; Rothfuss et al., 2018; Liu et al., 2019) showed that a meta-policy can be obtained by reinforcement learning a policy adapted by few reinforcement learning steps. The experiment results suggests that gradient-based methods can learn to sample and utilize sampled data in some extent. Nevertheless, the learning style and learned meta-policy are still far from human. Human learns a new task by interacting with the task sequentially and efficiently. With the obtaining of environment data, human gradually understanding where to sampling data and how to utilize the sampled data to adjust the policy, while gradient-based algorithms use parallel sampling neglecting the relations between data. Sampling independently is not data-efficient, usually needs a number of stochastic trajectories to do plicy adaptation. This causes the agent relying on the stochasticity to sample and only learns how to utilize data. Inspired by the human behavior, we propose a K-shot meta-RL problem that constrains on the data amount accessed by agent, e.g., adapting policy within only two trails. Low resource environment simulates the real world tasks that have high costs on data obtaining, therefore, requires agent to learn a stable strategy to explore environment. To address the K-shot problem, we also propose a contextual gradient-based algorithm using actor-critic method. The adptation step uses a trail buffer D to store all the transitions in K-shot sampling and optimizes expected value for the states in D. The meta-learning step optimizes the expected return performed by adapted policy while learning the value functions and context encoder using soft actor-critic (Haarnoja et al., 2018) objectives. We learn the policy with reparameterized objective that derives an unbiased meta-gradient estimation and reduces the estimation variance for Q-value. Our contribution can be summarized as follows: • We reformulate and propose the K-shot meta-RL problem to simulate the real world environment. • We propose a new gradient-based objective to address the K-shot problem. • We introduce context based policy and value functions to perform efficient data sampling. • We use actor-critic method to reduce the variance and bias of estimation in Q-value and meta-gradien.

2. RELATED WORK

Meta-reinforce learning algorithms mainly have three different categories: gradient-based motheds (Finn et al., 2017; Stadie et al., 2018; Rothfuss et al., 2018; Liu et al., 2019; Nichol et al., 2018) , recurrent meta-learners (Wang et al., 2016; Duan et al., 2016 ), multi-task learners (Fakoor et al., 2019; Rakelly et al., 2019) . Gradient-based algorithms like MAML (Finn et al., 2017) optimizing the policy updated by one step reinforcement learning, aiming at learning a good initialization of the policy weights. E-MAML (Stadie et al., 2018) considered the impact that the data obtained by meta-policy can influence the adapted policy's performance and assigned credit for meta-policy. While ProMP (Rothfuss et al., 2018) modified the adaptation gradient estimator to be low variance on second-order gradient. Recurrent meta-learners (Wang et al., 2016; Duan et al., 2016) use RNN as a meta-learner that can learn new task from environment data while exploring. The RNN learners are optimized with sequentially performed episodes end-to-end, which is more similar to the learning process of human and more interpretable in meta-policy. Multi-task learners (Fakoor et al., 2019; Rakelly et al., 2019) aim at learning multi-task objective to solve meta-learning problems. They argue that meta-learning can be done by explicitly resuing the learned features through context variable. MQL (Fakoor et al., 2019) can even perform well without adaptation. PEARL (Rakelly et al., 2019) constructs context encoder to infer the latent task variable and also learns a multi-task objective. The trained policy can perform structured exploration by inferring the task while interacting with environment.Our approach is related closely to the gradient-based researches which also tries to reduce the estimation variance and bias of the second-order gradient, however, we estimate the second-order gardient with value functions, and we still want perform structured exploration in data expensive environments.

3. BACKGROUND

This section focuses on the problem definition and notation of reinforcement learning and metareinforcement learning problems.

3.1. REINFORCEMENT LEARNING

Reinforcement learning (RL) problems aim to maximize the expectation of episode returns E τ ∼P (τ |θ) [R(τ )] = E τ ∼P (τ |θ) [ t γ t r(s t , a t )] with single task and agent, where τ = {s 0 , a 0 , r 0 , . . . } is the trajectory performed by the agent, s 0 ∼ ρ 0 is the initial state, a t ∼ π θ (a t |s t ) is the action sampled from the policy π that parameterized by θ, s t+1 ∼ P (s t+1 |a t , s t ) is the state at timestep t, and P (s t+1 |a t , s t ) is the transition probability. The problem can be represented by a Markov Desicion Process (MDP) with tuple M = (S, A, P, R, ρ 0 , γ, H), where S ⊆ R n is the set of states, A ⊆ R m is the set of actions, P(s |s, a) ∈ R + is the system transition probability, R(s, a) ∈ R is the reward function of the task, and H is the horizon.

