CONTEXTUAL SYMBOLIC POLICY FOR META-REINFORCEMENT LEARNING Anonymous authors Paper under double-blind review

Abstract

Context-based Meta-Reinforcement Learning (Meta-RL), which conditions the RL agent on the context variables, is a powerful method for learning a generalizable agent. Current context-based Meta-RL methods often construct their contextual policy with a neural network (NN) and directly take the context variables as a part of the input. However, the NN-based policy contains tremendous parameters which possibly result in overfitting, the difficulty of deployment and poor interpretability. To improve the generation ability, efficiency and interpretability, we propose a novel Contextual Symbolic Policy (CSP) framework, which generates contextual policy with a symbolic form based on the context variables for unseen tasks in meta-RL. Our key insight is that the symbolic expression is capable of capturing complex relationships by composing various operators and has a compact form that helps strip out irrelevant information. Thus, the CSP learns to produce symbolic policy for meta-RL tasks and extract the essential common knowledge to achieve higher generalization ability. Besides, the symbolic policies with a compact form are efficient to be deployed and easier to understand. In the implementation, we construct CSP as a gradient-based framework to learn the symbolic policy from scratch in an end-to-end and differentiable way. The symbolic policy is represented by a symbolic network composed of various symbolic operators. We also employ a path selector to decide the proper symbolic form of the policy and a parameter generator to produce the coefficients of the symbolic policy. Empirically, we evaluate the proposed CSP method on several Meta-RL tasks and demonstrate that the contextual symbolic policy achieves higher performance and efficiency and shows the potential to be interpretable.

1. INTRODUCTION

Meta-Reinforcement Learning (Meta-RL) is a promising strategy to improve the generalization ability on unseen tasks of reinforcement learning. Meta-RL methods learn the shared internal structure of tasks from the experiences collected across a distribution of training tasks and then quickly adapt to a new task with a small amount of experiences. On this basis, context-based Meta-RL methods (Duan et al., 2016; Rakelly et al., 2019; Fakoor et al., 2019; Huang et al., 2021) are proposed with the motivation that only a part of the model parameters need to be updated in a new environment. They force their model to be conditional on a set of task-specific parameters named context variables which are formed by aggregating experiences. Context-based Meta-RL methods are attractive because of their empirically higher performance and higher efficiency compared with previous methods which update the whole model. However, how to incorporate the context variables into the policy is still an open problem. Most of the current methods construct their contextual policy with a neural network (NN) and directly take the context variables as a part of the input. This kind of NN-based policy usually involves thousands of parameters, which may bring training difficulties, possibly result in overfitting and hurt the generalization performance. In addition, deploying the complex NN-based policy is inefficient and even impossible on limited computational resources. What is worse, we have to treat the NNbased policy as a black box that is hard to comprehend and interpret, e.g., we cannot understand what the difference between the policies of different tasks is. To address the above issues, in this work, we propose a novel Contextual Symbolic Policy (CSP) framework to learn a contextual policy with a compact symbolic form for unseen tasks in meta-RL. We are inspired by the symbolic expression, which has a compact form but is capable of capturing complex relationships by composing variables, constants and various mathematical operators. In general, compact and effective representations can strip out irrelevant information and find the essential relationship of variables, which can benefit the generalization. Therefore, for meta-RL tasks with similar internal structures, CSP produces symbolic policies to model the relationship of the proper action and state and extract essential common knowledge across the tasks. With the common knowledge of a series of tasks, CSP is able to achieve higher generalization ability and quickly adapt to unseen tasks. Moreover, the compact symbolic policies learned by CSP are efficient to be deployed and easier to understand. In conclusion, contextual policies produced by CSP achieve higher generalization performance, efficiency, and show the potential to be interpretable when constrained in a compact symbolic form. However, finding the proper forms and constant values of the symbolic policies for a distribution of tasks is challenging. In this paper, we propose an efficient gradient-based learning method for the CSP framework to learn the contextual symbolic policy from scratch in an end-to-end differentiable way. To express the policy in a symbolic form, the proposed CSP consists of a symbolic network, a path selector and a parameter generator. The symbolic network can be considered as a full set of the candidate symbolic policies. In the symbolic network, the activation functions are composed of various symbolic operators and the parameters can be regarded as the coefficients in the symbolic expression. For a new task, the path selector chooses the proper compact symbolic form from the symbolic network by adaptively masking out most irrelevant connections. Meanwhile, the parameters of the chosen symbolic form are generated by the parameter generator. We design all these modules to be differentiable. Thus, we can update the whole framework with gradient efficiently. We evaluate the proposed CSP on several Meta-RL tasks. The results show that our CSP achieves higher generalization performance than previous methods while reducing the floating point operations (FLOPs) by 2-45000×. Besides, the produced symbolic policies show the potential to be interpretable.

2. RELATED WORKS

2.1 META-REINFORCEMENT LEARNING Meta-RL extends the notion of meta-learning (Schmidhuber, 1987; Bengio et al., 1991; Thrun & Pratt, 1998) to the context of reinforcement learning. Some works (Li et al., 2017; Young et al., 2018; Kirsch et al., 2019; Zheng et al., 2018; Sung et al., 2017; Houthooft et al., 2018) aim to metalearn the update rule for reinforcement learning. We here consider another research line of works that meta-train a policy that can be adapted efficiently to a new task. Several works (Finn et al., 2017; Rothfuss et al., 2018; Stadie et al., 2018; Gupta et al., 2018; Liu et al., 2019) learn an initialization and adapt the parameters with policy gradient methods. However, these methods are inefficient because of the on-policy learning process and the gradient-based updating during adaptation. Recently, context-based Meta-RL achieve higher efficiency and performance. PEARL (Rakelly et al., 2019) proposes an off-policy Meta-RL method that infers probabilistic context variables with experiences from new environments. ADARL (Huang et al., 2021) characterizes a compact representation about changes of environments with a structural environment model, which enables efficient adaptation. Hyper (Sarafian et al., 2021) proposes a hypernetwork where the primary network determines the weights of a conditional network and achieves higher performance. Fu et al. (2020) introduce a contrastive learning method to train a compact context encoder. They also train an exploration policy to maximize the information gain. Most of the existing context-based Meta-RL methods (Fu et al., 2020; Zhang et al., 2021; Zintgraf et al., 2020) attempt to achieve higher performance by improving the context encoder or the exploration strategy. However, in this paper, we aim to improve the efficiency, interpretability and performance by replacing the pure neural network policy with a contextual symbolic policy.

