GUARDED POLICY OPTIMIZATION WITH IMPERFECT ONLINE DEMONSTRATIONS

ABSTRACT

The Teacher-Student Framework (TSF) is a reinforcement learning setting where a teacher agent guards the training of a student agent by intervening and providing online demonstrations. Assuming optimal, the teacher policy has the perfect timing and capability to intervene in the learning process of the student agent, providing safety guarantee and exploration guidance. Nevertheless, in many real-world settings it is expensive or even impossible to obtain a well-performing teacher policy. In this work, we relax the assumption of a well-performing teacher and develop a new method that can incorporate arbitrary teacher policies with modest or inferior performance. We instantiate an Off-Policy Reinforcement Learning algorithm, termed Teacher-Student Shared Control (TS2C), which incorporates teacher intervention based on trajectory-based value estimation. Theoretical analysis validates that the proposed TS2C algorithm attains efficient exploration and substantial safety guarantee without being affected by the teacher's own performance. Experiments on various continuous control tasks show that our method can exploit teacher policies at different performance levels while maintaining a low training cost. Moreover, the student policy surpasses the imperfect teacher policy in terms of higher accumulated reward in held-out testing environments. Code is available at https://metadriverse.github.io/TS2C.

1. INTRODUCTION

In Reinforcement Learning (RL), the Teacher-Student Framework (TSF) (Zimmer et al., 2014; Kelly et al., 2019) incorporates well-performing neural controllers or human experts as teacher policies in the learning process of autonomous agents. At each step, the teacher guards the free exploration of the student by intervening when a specific intervention criterion holds. Online data collected from both the teacher policy and the student policy will be saved into the replay buffer and exploited with Imitation Learning or Off-Policy RL algorithms. Such a guarded policy optimization pipeline can either provide safety guarantee (Peng et al., 2021) or facilitate efficient exploration (Torrey & Taylor, 2013) . The majority of RL methods in TSF assume the availability of a well-performing teacher policy (Spencer et al., 2020; Torrey & Taylor, 2013 ) so that the student can properly learn from the teacher's demonstration about how to act in the environment. The teacher intervention is triggered when the student acts differently from the teacher (Peng et al., 2021) or when the teacher finds the current state worth exploring (Chisari et al., 2021) . This is similar to imitation learning where the training outcome is significantly affected by the quality of demonstrations (Kumar et al., 2020; Fujimoto et al., 2019) . Thus with current TSF methods if the teacher is incapable of providing high-quality demonstrations, the student will be misguided and its final performance will be upperbounded by the performance of the teacher. However, it is time-consuming or even impossible to obtain a well-performing teacher in many real-world applications such as object manipulation with robot arms (Yu et al., 2020a) and autonomous driving (Li et al., 2022a) . As a result, current TSF methods will behave poorly with a less capable teacher. In the real world, the coach of Usain Bolt does not necessarily need to run faster than Usain Bolt. Is it possible to develop a new interactive learning scheme where a student can outperform the teacher while retaining safety guarantee from it? In this work we develop a new guarded policy optimization

