KNOWLEDGE-GROUNDED REINFORCEMENT LEARN-ING

Abstract

Receiving knowledge, abiding by laws, and being aware of regulations are common behaviors in human society. Bearing in mind that reinforcement learning (RL) algorithms benefit from mimicking humanity, in this work, we propose that an RL agent can act on external guidance in both its learning process and model deployment, making the agent more socially acceptable. We introduce the concept, Knowledge-Grounded RL (KGRL), with a formal definition that an agent learns to follow external guidelines and develop its own policy. Moving towards the goal of KGRL, we propose a novel actor model with an embedding-based attention mechanism that can attend to either a learnable internal policy or external knowledge. The proposed method is orthogonal to training algorithms, and the external knowledge can be flexibly recomposed, rearranged, and reused in both training and inference stages. Through experiments on tasks with discrete and continuous action space, our KGRL agent is shown to be more sample efficient and generalizable, and it has flexibly rearrangeable knowledge embeddings and interpretable behaviors.

1. INTRODUCTION

Incorporating external guidance into learning is a commonly seen behavior among humans. We can speed up our learning process by referring to useful suggestions. At the same time, we beware of external regulations for safe and ethical reasons. On top of following external guidance, we humans can learn our own strategies to complete a task. We can also arbitrarily recompose, rearrange, and reuse those strategies and external guidelines to solve a new task and adapt to environmental changes. Imitating human behaviors has been shown to benefit reinforcement learning (RL) (Billard et al., 2016; Sutton and Barto, 2018; Zhang et al., 2019) . However, how an RL agent can achieve the above capabilities remains challenging. Different approaches have been proposed to develop some of these abilities. One branch of previous work in RL has studied how an agent can learn from demonstrations provided externally as examples of completing a task (Ross et al., 2011; Rajeswaran et al., 2017; Duan et al., 2017; Nair et al., 2018; Goecks et al., 2019; Ding et al., 2019) . Another branch of previous research has investigated how an agent can learn reusable policies. This branch of research includes (1) transferring knowledge among tasks with similar difficulty (Parisotto et al., 2015; Yin and Pan, 2017; Gupta et al., 2018a; Liu et al., 2019; Tao et al., 2021) and (2) learning multiple reusable skills to solve a complex task in a divideand-conquer manner (Bacon et al., 2017; Frans et al., 2017; Nachum et al., 2018a; Eysenbach et al., 2018; Kim et al., 2021; Tseng et al., 2021) . These methods allow an agent to learn a new task with fewer training samples. However, demonstrations are too task-specific to be reused in a new task, and incorporating external guidelines from different sources into existing knowledge-reuse frameworks is not straightforward. Moreover, current learning-from-demonstration and knowledge-reuse approaches lack the flexibility to rearrange and recompose different demonstrations or knowledge, so they cannot dynamically adapt to environmental changes. In this work, we introduce Knowledge Grounded Reinforcement Learning (KGRL), a novel problem with the following goal: An RL agent can learn its own policy (knowledge) while referring to external knowledge. Meanwhile, all knowledge can be arbitrarily recomposed, rearranged, and reused anytime in the learning and inference stages. A KGRL problem simultaneously considers the following three questions: (1) How can an agent follow a set of external knowledge from different sources? (2) How can an agent efficiently learn new knowledge by referring to external ones? (3) What is a proper representation of knowledge such that an agent can dynamically recompose, rearrange, and reuse knowledge in both the learning process and model deployment? To address the challenges in KGRL, we propose a simple yet effective actor model with an embeddingbased attention mechanism. This model is orthogonal to training algorithms and easy to implement. In this model, each external-knowledge policy is paired with one learnable embedding. An internalknowledge policy, which the agent learns by itself, is also paired with one learnable embedding. Then another learnable query embedding performs an attention mechanism by attending to each knowledge embedding and then decides which knowledge to follow. With this attention mechanism, the agent can learn new skills by following a set of external guidance. In addition, the knowledge and query embeddings design disentangles each knowledge from the selection mechanism, so a set of knowledge can be dynamically recomposed or rearranged. Finally, all knowledge policies are encoded into a joint embedding space, so they can be reused once their embeddings are appropriately learned. We evaluate our proposed KGRL method for grid navigation and robotic manipulation tasks. The results demonstrate that our method answers all the questions considered in KGRL. In analyses, we show that the proposed approach achieves sample efficiency, generalizability, compositionality, and incrementality, which are the four essential components of efficient learning (Kaelbling, 2020) . At the same time, our method also enables behavior interpretability because of the embedding-based attention mechanism. Our contributions are: • We introduce KGRL, a novel RL problem addressing how an agent can repeatedly refer to an arbitrary set of external knowledge for learning new skills. • We propose a novel actor model that achieves the goals of KGRL and is orthogonal to training algorithms. • We demonstrate in the experiments that our KGRL method satisfies the four properties of efficient learning (Kaelbling, 2020) and learns interpretable strategies.

2. RELATED WORK

Several lines of research in RL have studied how an agent can incorporate external demonstrations/regulations into learning or learn reusable policies. Here we summarize the differences between their research goals and the ones in the KGRL formulation. Learning from demonstrations/external strategies. Providing expert demonstrations or external strategies to an RL agent is a popular way to incorporate external guidance into learning. Demonstrations are (sub-)optimal examples of completing a task and represented as state-action pairs; external strategies are (sub-)optimal policies that are represented in a specific form, e.g., fuzzy logic. Previous work has leveraged demonstrations or external strategies by introducing them into the policy-update steps of RL (Hester et al., 2017; Rajeswaran et al., 2017; Vecerik et al., 2017; Nair et al., 2018; Pfeiffer et al., 2018; Goecks et al., 2019; Kartal et al., 2019; Zhang et al., 2020; 2021) or performing imitation learning (IL) (Abbeel and Ng, 2004; Ross et al., 2011; Ho and Ermon, 2016; Duan et al., 2017; Peng et al., 2018a; b; Ding et al., 2019) . Learning from demonstrations (LfD) and IL require sufficient number of optimal demonstrations, which are difficult to collect, to achieve sample-efficient learning. Moreover, those demonstrations are task-dependent and can hardly be reused in other tasks. Compared to LfD and IL, learning from external strategies, including KGRL, can achieve efficient learning by referring to highly imperfect knowledge. Also, KGRL can incorporate general knowledge that is not necessary task-related. Finally, our KGRL approach learns a uniform representation of knowledge in different forms, so it supports flexible recomposition/rearrangement/reuse of an arbitrary set of knowledge. These abilities are not all achievable by previous work in this direction. Safe RL. Other than demonstrations, regulations are also essential guidance that can help an agent achieve effective learning. Safe RL, which incorporates safety regulations (constraints) into learning and inference stages, has gained more attention over the past decade. Current research in safe RL mainly includes safety in learning through (1) jointly optimizing the expected return and safe-related cost (Ammar et al., 2015; Achiam et al., 2017; Chow et al., 2018; Stooke et al., 2020; Ding et al., 

