POLICY LEARNING USING WEAK SUPERVISION

Abstract

Most existing policy learning solutions require the learning agents to receive high-quality supervision signals, e.g., rewards in reinforcement learning (RL) or high-quality expert's demonstrations in behavioral cloning (BC). These quality supervisions are either infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the weak supervisions to perform policy learning efficiently. To handle this problem, we treat the "weak supervisions" as imperfect information coming from a peer agent, and evaluate the learning agent's policy based on a "correlated agreement" with the peer agent's policy (instead of simple agreements). Our way of leveraging peer agent's information offers us a family of solutions that learn effectively from weak supervisions with theoretical guarantees. Extensive evaluations on tasks including RL with noisy reward, BC with weak demonstrations and standard policy co-training (RL + BC) show that the proposed approach leads to substantial improvements, especially when the complexity or the noise of the learning environments grows.

1. INTRODUCTION

Recent breakthrough in policy learning (PL) opens up the possibility to apply these techniques in realworld applications such as robotics (Mnih et al., 2015; Akkaya et al., 2019) and self-driving (Bojarski et al., 2016a; Codevilla et al., 2018) . Nonetheless, most existing works require agents to receive high-quality supervision signals, e.g., reward or expert's demonstrations, which are either infeasible or prohibitively expensive to obtain in practice. For instance, (1) the reward may be collected through sensors thus not credible (Everitt et al., 2017; Romoff et al., 2018; Wang et al., 2020) ; (2) the demonstrations by an expert in behavioral cloning (BC) are often imperfect due to limited resources and environment noise (Laskey et al., 2017; Wu et al., 2019; Reddy et al., 2020) . Learning from weak supervision signals such as noisy rewards r (noisy versions of r) (Wang et al., 2020) or low-quality demonstrations e D E (noisy versions of D E ) produced by problematic expert ⇡E (Wu et al., 2019) is one of the outstanding challenges that prevents a wider application of PL. Although some recent works have explored these topics separately in their specific domains (Guo et al., 2019; Wang et al., 2020; Lee et al., 2020) , there is no unified solution towards performing robust policy learning pip install arxiv-latex-cleaner under this imperfect supervision. In this work, we first formulate a meta-framework to study RL/BC with weak supervision signals and call it weakly supervised policy learning. Then as a response, we propose a theoretically principled solution concept, PeerPL, to perform efficient policy learning using the available weak supervisions. Our solution concept is inspired by the literature of peer prediction (Miller et al., 2005; Dasgupta & Ghosh, 2013; Shnayder et al., 2016) , where the question concerns verifying information without ground truth verification. Instead, a group of agents' reports (none of which is assumed to be highquality nor clean) are used to validate each other's information. We adopt a similar idea and treat the "weak supervisions" as information coming from a peer agent, and evaluate the learning agent's policy based on a "correlated agreement" (CA) with the peer agent's. Compared to standard reward/loss functions that impose simple agreements with the weak supervisions, our approach punishes an over-agreement to avoid overfitting to the weak supervisions. Our way of leveraging peer agent's information offers us a family of solutions that 1) does not require prior knowledge of the weakness of the supervisions, and 2) learns effectively with strong theoretical guarantees. We demonstrate how the proposed PeerPL framework adapts in challenging tasks including RL with noisy rewards and behavioral cloning (BC) from weak demonstrations. Furthermore, we provide intensive analysis of the convergence behavior and the sample complexity for our solutions. These results jointly demonstrate that our approach enables agents to learn the optimal policy efficiently under weak supervisions. Evaluations on these tasks show strong evidence that PeerPL brings significant improvements over state-of-the-art solutions, especially when the complexity or the noise of the learning environments grows. To summarize, the contributions in the paper are mainly three-folds: (1) We provide a unified formulation of weakly supervised policy learning to model the weak supervision in RL/BC problems; (2) We propose a novel PeerPL solution framework based on calculating a correlated agreement with weak supervisions, a novel way for policy evaluation introduced to RL/BC tasks; (3) PeerPL is theoretically guaranteed to recover the optimal policy (as if the supervisions are of high-quality and clean) and competitive empirical performances are observed in several policy learning tasks.

2. RELATED WORK

Learning with Noisy Supervision Learning from noisy labels is widely explored within the supervised learning domain. Beginning from the seminal work (Natarajan et al., 2013) that proposed an unbiased surrogate loss function to recover the true loss given the knowledge of noise rates, follow-up works focus on how to estimate the noise rates based on noisy observations (Scott et al., 2013; Scott, 2015; Sukhbaatar & Fergus, 2014; van Rooyen & Williamson, 2015; Menon et al., 2015) . Recent work (Wang et al., 2020) adapts this idea within RL and proposes a statistics-based estimation algorithm. However, the estimation is not efficient especially when the state-action space is huge. Moreover, as a sequential process, the error in estimating the noise rate can accumulate and amplify when deploying an RL algorithm. In contrast, our solution in this paper does not require a priori specification of the noise rates thus offloading the burden of estimation. Behavioral Cloning (BC) Standard BC (Pomerleau, 1991; Ross & Bagnell, 2010) tackles the sequential decision-making problem by imitating the expert's actions using supervised learning. Specifically, it aims to minimize the one-step deviation error over the expert's trajectory without reasoning the sequential consequences of actions. Therefore, the agent suffers from compounding errors when there is a mismatch between demonstrations and real states encountered (Ross & Bagnell, 2010; Ross et al., 2011) . Recent works introduce data augmentations (Bojarski et al., 2016b) and valuebased regularization (Reddy et al., 2019) or inverse dynamics models (Torabi et al., 2018; Monteiro et al., 2020) to encourage learning long-horizon behaviors. While simple and straightforward, BC has been widely investigated in a wide range of domains (Giusti et al., 2016; Justesen & Risi, 2017) and often yields competitive performance (Farag & Saleh, 2018; Reddy et al., 2019) . Our framework is complementary to current BC literature by introducing a learning strategy from weak demonstrations (e.g., noisy or from a poorly-trained agent) and provides theoretical guarantees on how to retrieve clean policy under mild assumptions (Song et al., 2019) . Correlated Agreement Peer prediction aims to elicit information from self-interested agents without ground-truth verification (Miller et al., 2005; Dasgupta & Ghosh, 2013; Shnayder et al., 2016) . The only source of information to serve as verification is from the agents' reports. Particularly, in Dasgupta & Ghosh (2013); Shnayder et al. ( 2016), a correlated agreement (CA) type of mechanism is proposed, which evaluates the correlations between agents' reports. In addition to encouraging some agreement between agents, CA mechanism also punishes over-agreement when two agents always report identically. This property helps reduce the effect of noisy reports by punishing overfitting to them. Recently, Liu & Guo (2020) adapts a similar idea in learning from noisy labels for supervised learning. We consider a more challenging weakly supervised policy learning setting and study the convergence rates in sequential decision-making problems.

3. POLICY LEARNING FROM WEAK SUPERVISION

We begin by introducing a general framework to unify PL with low-quality supervision signals. Then we provide instantiations of the proposed weakly supervised formulation with two different applications: (1) RL with noisy reward and (2) behavioral cloning (BC) using weak expert demonstrations.

