POLICY LEARNING USING WEAK SUPERVISION

Abstract

Most existing policy learning solutions require the learning agents to receive high-quality supervision signals, e.g., rewards in reinforcement learning (RL) or high-quality expert's demonstrations in behavioral cloning (BC). These quality supervisions are either infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the weak supervisions to perform policy learning efficiently. To handle this problem, we treat the "weak supervisions" as imperfect information coming from a peer agent, and evaluate the learning agent's policy based on a "correlated agreement" with the peer agent's policy (instead of simple agreements). Our way of leveraging peer agent's information offers us a family of solutions that learn effectively from weak supervisions with theoretical guarantees. Extensive evaluations on tasks including RL with noisy reward, BC with weak demonstrations and standard policy co-training (RL + BC) show that the proposed approach leads to substantial improvements, especially when the complexity or the noise of the learning environments grows.

1. INTRODUCTION

Recent breakthrough in policy learning (PL) opens up the possibility to apply these techniques in realworld applications such as robotics (Mnih et al., 2015; Akkaya et al., 2019) and self-driving (Bojarski et al., 2016a; Codevilla et al., 2018) . Nonetheless, most existing works require agents to receive high-quality supervision signals, e.g., reward or expert's demonstrations, which are either infeasible or prohibitively expensive to obtain in practice. For instance, (1) the reward may be collected through sensors thus not credible (Everitt et al., 2017; Romoff et al., 2018; Wang et al., 2020) ; (2) the demonstrations by an expert in behavioral cloning (BC) are often imperfect due to limited resources and environment noise (Laskey et al., 2017; Wu et al., 2019; Reddy et al., 2020) . Learning from weak supervision signals such as noisy rewards r (noisy versions of r) (Wang et al., 2020) or low-quality demonstrations e D E (noisy versions of D E ) produced by problematic expert ⇡E (Wu et al., 2019) is one of the outstanding challenges that prevents a wider application of PL. Although some recent works have explored these topics separately in their specific domains (Guo et al., 2019; Wang et al., 2020; Lee et al., 2020) , there is no unified solution towards performing robust policy learning pip install arxiv-latex-cleaner under this imperfect supervision. In this work, we first formulate a meta-framework to study RL/BC with weak supervision signals and call it weakly supervised policy learning. Then as a response, we propose a theoretically principled solution concept, PeerPL, to perform efficient policy learning using the available weak supervisions. Our solution concept is inspired by the literature of peer prediction (Miller et al., 2005; Dasgupta & Ghosh, 2013; Shnayder et al., 2016) , where the question concerns verifying information without ground truth verification. Instead, a group of agents' reports (none of which is assumed to be highquality nor clean) are used to validate each other's information. We adopt a similar idea and treat the "weak supervisions" as information coming from a peer agent, and evaluate the learning agent's policy based on a "correlated agreement" (CA) with the peer agent's. Compared to standard reward/loss functions that impose simple agreements with the weak supervisions, our approach punishes an over-agreement to avoid overfitting to the weak supervisions. Our way of leveraging peer agent's information offers us a family of solutions that 1) does not require prior knowledge of the weakness of the supervisions, and 2) learns effectively with strong theoretical guarantees. We demonstrate how the proposed PeerPL framework adapts in challenging tasks including RL with noisy rewards and behavioral cloning (BC) from weak demonstrations. Furthermore, we provide

