GUARDED POLICY OPTIMIZATION WITH IMPERFECT ONLINE DEMONSTRATIONS

Abstract

The Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent guards the training of a student agent by intervening and providing online demonstrations. Assuming optimal, the teacher policy has the perfect timing and capability to intervene in the learning process of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work, we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and substantial safety guarantee without being affected by the teacher's own performance. Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost. Moreover, the student policy surpasses the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments.

1. INTRODUCTION

In Reinforcement Learning (RL), the Teacher-Student Framework (TSF) (Zimmer et al., 2014; Kelly et al., 2019) incorporates well-performing neural controllers or human experts as teacher policies in the learning process of autonomous agents. At each step, the teacher guards the free exploration of the student by intervening when a specific intervention criterion holds. Online data collected from both the teacher policy and the student policy will be saved into the replay buffer and exploited with Imitation Learning or Off-Policy RL algorithms. Such a guarded policy optimization pipeline can either provide safety guarantee (Peng et al., 2021) or facilitate efficient exploration (Torrey & Taylor, 2013) . The majority of RL methods in TSF assume the availability of a well-performing teacher policy (Spencer et al., 2020; Torrey & Taylor, 2013 ) so that the student can properly learn from the teacher's demonstration about how to act in the environment. The teacher intervention is triggered when the student acts differently from the teacher (Peng et al., 2021) or when the teacher finds the current state worth exploring (Chisari et al., 2021) . This is similar to imitation learning where the training outcome is significantly affected by the quality of demonstrations (Kumar et al., 2020; Fujimoto et al., 2019) . Thus with current TSF methods if the teacher is incapable of providing high-quality demonstrations, the student will be misguided and its final performance will be upperbounded by the performance of the teacher. However, it is time-consuming or even impossible to obtain a well-performing teacher in many real-world applications such as object manipulation with robot arms (Yu et al., 2020a) and autonomous driving (Li et al., 2022a) . As a result, current TSF methods will behave poorly with a less capable teacher. In the real world, the coach of Usain Bolt does not necessarily need to run faster than Usain Bolt. Is it possible to develop a new interactive learning scheme where a student can outperform the teacher while retaining safety guarantee from it? In this work we develop a new guarded policy optimization method called Teacher-Student Shared Control (TS2C). It follows the setting of a teacher policy and a learning student policy, but relaxes the requirement of high-quality demonstrations from the teacher. A new intervention mechanism is designed: Rather than triggering intervention based on the similarity between the actions of teacher and student, the intervention is now determined by a trajectory-based value estimator. The student is allowed to conduct an action that deviates from the teacher's, as long as its expected return is promising. By relaxing the intervention criterion from step-wise action similarity to trajectory-based value estimation, the student has the freedom to act differently when the teacher fails to provide correct demonstration and thus has the potential to outperform the imperfect teacher. We conduct theoretical analysis and show that in previous TSF methods the quality of the online data-collecting policy is upper-bounded by the performance of the teacher policy. In contrast, TS2C is not limited by the imperfect teacher in upper-bound performance, while still retaining a lower-bound performance and safety guarantee. Experiments on various continuous control environments show that under the newly proposed method, the learning student policy can be optimized efficiently and safely under different levels of teachers while other TSF algorithms are largely bounded by the teacher's performance. Furthermore, the student policies trained under the proposed TS2C substantially outperform all baseline methods in terms of higher efficiency and lower test-time cost, supporting our theoretical analysis.

2.1. RELATED WORK

The Teacher-Student Framework The idea of transferring knowledge from a teacher policy to a student policy has been explored in reinforcement learning (Zimmer et al., 2014) . It improves the learning efficiency of the student policy by leveraging a pretrained teacher policy, usually by adding auxiliary loss to encourage the student policy to be close to the teacher policy (Schmitt et al., 2018; Traoré et al., 2019) . Though our method follows teacher-student transfer framework, an optimal teacher is not a necessity. During training, agents are fully controlled by either the student (Traoré et al., 2019; Schmitt et al., 2018) or the teacher policy (Rusu et al., 2016) , while our method follows intervention-based RL where a mixed policy controls the agent. Other attempts to relax the need of well-performing teacher models include student-student transfer (Lin et al., 2017; Lai et al., 2020) , in which heterogeneous agents exchange knowledge through mutual regularisation (Zhao & Hospedales, 2021; Peng et al., 2020) . Learning from Demonstrations Another way to exploit the teacher policy is to collect static demonstration data from it. The learning agent will regard the demonstration as optimal transitions to imitate from. If the data is provided without reward signals, agent can learn by imitating the teacher's policy distribution (Ly & Akhloufi, 2020) , matching the trajectory distribution (Ho & Ermon, 2016; Xu et al., 2019) or learning a parameterized reward function with inverse reinforcement learning (Abbeel & Ng, 2004; Fu et al., 2017) . With additional reward signals, agents can perform Bellman updates pessimistically, as most offline reinforcement learning algorithms do (Levine et al., 2020) . The conservative Bellman update can be performed either by restricting the overestimation of Q-function learning (Fujimoto et al., 2019; Kumar et al., 2020) or by involving model-based uncertainty estimation (Yu et al., 2020b; Chen et al., 2021b) . In contrast to the offline learning from demonstration, in this work we focus on the online deployment of teacher policies with teacherstudent shared control and show its superiority in reducing the state distributional shift, improving efficiency and ensuring training-time safety. Intervention-based Reinforcement Learning Intervention-based RL enables both the expert and the learning agent to generate online samples in the environment. The switch between policies can be random (Ross et al., 2011 ), rule-based (Parnichkun et al., 2022) or determined by the expert, either through the manual intervention of human participators (Abel et al., 2017; Chisari et al., 2021; Li et al., 2022b) or by referring to the policy distribution of a parameterized expert (Peng et al., 2021) . More delicate switching algorithms include RCMP (da Silva et al., 2020) which asks for expert advice when the learner's action has high estimated uncertainty. RCMP only works for agents with discrete action spaces, while we investigate continuous action space in this paper. Also, Ross & Bagnell (2014) and Sun et al. (2017) query the expert to obtain the optimal value function, which is used to guide the expert intervention. These switching mechanisms assume the expert policy to be optimal, while our proposed algorithm can make use of a suboptimal expert policy. To exploit

