CONTEXTUAL SYMBOLIC POLICY FOR META-REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

Context-based Meta-Reinforcement Learning (Meta-RL), which conditions the RL agent on the context variables, is a powerful method for learning a generalizable agent. Current context-based Meta-RL methods often construct their contextual policy with a neural network (NN) and directly take the context variables as a part of the input. However, the NN-based policy contains tremendous parameters which possibly result in overfitting, the difficulty of deployment and poor interpretability. To improve the generation ability, efficiency and interpretability, we propose a novel Contextual Symbolic Policy (CSP) framework, which generates contextual policy with a symbolic form based on the context variables for unseen tasks in meta-RL. Our key insight is that the symbolic expression is capable of capturing complex relationships by composing various operators and has a compact form that helps strip out irrelevant information. Thus, the CSP learns to produce symbolic policy for meta-RL tasks and extract the essential common knowledge to achieve higher generalization ability. Besides, the symbolic policies with a compact form are efficient to be deployed and easier to understand. In the implementation, we construct CSP as a gradient-based framework to learn the symbolic policy from scratch in an end-to-end and differentiable way. The symbolic policy is represented by a symbolic network composed of various symbolic operators. We also employ a path selector to decide the proper symbolic form of the policy and a parameter generator to produce the coefficients of the symbolic policy. Empirically, we evaluate the proposed CSP method on several Meta-RL tasks and demonstrate that the contextual symbolic policy achieves higher performance and efficiency and shows the potential to be interpretable.

1. INTRODUCTION

Meta-Reinforcement Learning (Meta-RL) is a promising strategy to improve the generalization ability on unseen tasks of reinforcement learning. Meta-RL methods learn the shared internal structure of tasks from the experiences collected across a distribution of training tasks and then quickly adapt to a new task with a small amount of experiences. On this basis, context-based Meta-RL methods (Duan et al., 2016; Rakelly et al., 2019; Fakoor et al., 2019; Huang et al., 2021) are proposed with the motivation that only a part of the model parameters need to be updated in a new environment. They force their model to be conditional on a set of task-specific parameters named context variables which are formed by aggregating experiences. Context-based Meta-RL methods are attractive because of their empirically higher performance and higher efficiency compared with previous methods which update the whole model. However, how to incorporate the context variables into the policy is still an open problem. Most of the current methods construct their contextual policy with a neural network (NN) and directly take the context variables as a part of the input. This kind of NN-based policy usually involves thousands of parameters, which may bring training difficulties, possibly result in overfitting and hurt the generalization performance. In addition, deploying the complex NN-based policy is inefficient and even impossible on limited computational resources. What is worse, we have to treat the NNbased policy as a black box that is hard to comprehend and interpret, e.g., we cannot understand what the difference between the policies of different tasks is.

