POLICY LEARNING USING WEAK SUPERVISION

Abstract

Most existing policy learning solutions require the learning agents to receive high-quality supervision signals, e.g., rewards in reinforcement learning (RL) or high-quality expert's demonstrations in behavioral cloning (BC). These quality supervisions are either infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the weak supervisions to perform policy learning efficiently. To handle this problem, we treat the "weak supervisions" as imperfect information coming from a peer agent, and evaluate the learning agent's policy based on a "correlated agreement" with the peer agent's policy (instead of simple agreements). Our way of leveraging peer agent's information offers us a family of solutions that learn effectively from weak supervisions with theoretical guarantees. Extensive evaluations on tasks including RL with noisy reward, BC with weak demonstrations and standard policy co-training (RL + BC) show that the proposed approach leads to substantial improvements, especially when the complexity or the noise of the learning environments grows.

1. INTRODUCTION

Recent breakthrough in policy learning (PL) opens up the possibility to apply these techniques in realworld applications such as robotics (Mnih et al., 2015; Akkaya et al., 2019) and self-driving (Bojarski et al., 2016a; Codevilla et al., 2018) . Nonetheless, most existing works require agents to receive high-quality supervision signals, e.g., reward or expert's demonstrations, which are either infeasible or prohibitively expensive to obtain in practice. For instance, (1) the reward may be collected through sensors thus not credible (Everitt et al., 2017; Romoff et al., 2018; Wang et al., 2020) ; (2) the demonstrations by an expert in behavioral cloning (BC) are often imperfect due to limited resources and environment noise (Laskey et al., 2017; Wu et al., 2019; Reddy et al., 2020) . Learning from weak supervision signals such as noisy rewards r (noisy versions of r) (Wang et al., 2020) or low-quality demonstrations e D E (noisy versions of D E ) produced by problematic expert ⇡E (Wu et al., 2019) is one of the outstanding challenges that prevents a wider application of PL. Although some recent works have explored these topics separately in their specific domains (Guo et al., 2019; Wang et al., 2020; Lee et al., 2020) , there is no unified solution towards performing robust policy learning pip install arxiv-latex-cleaner under this imperfect supervision. In this work, we first formulate a meta-framework to study RL/BC with weak supervision signals and call it weakly supervised policy learning. Then as a response, we propose a theoretically principled solution concept, PeerPL, to perform efficient policy learning using the available weak supervisions. Our solution concept is inspired by the literature of peer prediction (Miller et al., 2005; Dasgupta & Ghosh, 2013; Shnayder et al., 2016) , where the question concerns verifying information without ground truth verification. Instead, a group of agents' reports (none of which is assumed to be highquality nor clean) are used to validate each other's information. We adopt a similar idea and treat the "weak supervisions" as information coming from a peer agent, and evaluate the learning agent's policy based on a "correlated agreement" (CA) with the peer agent's. Compared to standard reward/loss functions that impose simple agreements with the weak supervisions, our approach punishes an over-agreement to avoid overfitting to the weak supervisions. Our way of leveraging peer agent's information offers us a family of solutions that 1) does not require prior knowledge of the weakness of the supervisions, and 2) learns effectively with strong theoretical guarantees. We demonstrate how the proposed PeerPL framework adapts in challenging tasks including RL with noisy rewards and behavioral cloning (BC) from weak demonstrations. Furthermore, we provide intensive analysis of the convergence behavior and the sample complexity for our solutions. These results jointly demonstrate that our approach enables agents to learn the optimal policy efficiently under weak supervisions. Evaluations on these tasks show strong evidence that PeerPL brings significant improvements over state-of-the-art solutions, especially when the complexity or the noise of the learning environments grows. To summarize, the contributions in the paper are mainly three-folds: (1) We provide a unified formulation of weakly supervised policy learning to model the weak supervision in RL/BC problems; (2) We propose a novel PeerPL solution framework based on calculating a correlated agreement with weak supervisions, a novel way for policy evaluation introduced to RL/BC tasks; (3) PeerPL is theoretically guaranteed to recover the optimal policy (as if the supervisions are of high-quality and clean) and competitive empirical performances are observed in several policy learning tasks.

2. RELATED WORK

Learning with Noisy Supervision Learning from noisy labels is widely explored within the supervised learning domain. Beginning from the seminal work (Natarajan et al., 2013) that first proposed an unbiased surrogate loss function to recover the true loss given the knowledge of noise rates, follow-up works focus on how to estimate the noise rates based on noisy observations (Scott et al., 2013; Scott, 2015; Sukhbaatar & Fergus, 2014; van Rooyen & Williamson, 2015; Menon et al., 2015) . Recent work (Wang et al., 2020) adapts this idea within RL and proposes a statistics-based estimation algorithm. However, the estimation is not efficient especially when the state-action space is huge. Moreover, as a sequential process, the error in estimating the noise rate can accumulate and amplify when deploying an RL algorithm. In contrast, our solution in this paper does not require a priori specification of the noise rates thus offloading the burden of estimation. Behavioral Cloning (BC) Standard BC (Pomerleau, 1991; Ross & Bagnell, 2010) tackles the sequential decision-making problem by imitating the expert's actions using supervised learning. Specifically, it aims to minimize the one-step deviation error over the expert's trajectory without reasoning the sequential consequences of actions. Therefore, the agent suffers from compounding errors when there is a mismatch between demonstrations and real states encountered (Ross & Bagnell, 2010; Ross et al., 2011) . Recent works introduce data augmentations (Bojarski et al., 2016b) and valuebased regularization (Reddy et al., 2019) or inverse dynamics models (Torabi et al., 2018; Monteiro et al., 2020) to encourage learning long-horizon behaviors. While simple and straightforward, BC has been widely investigated in a wide range of domains (Giusti et al., 2016; Justesen & Risi, 2017) and often yields competitive performance (Farag & Saleh, 2018; Reddy et al., 2019) . Our framework is complementary to current BC literature by introducing a learning strategy from weak demonstrations (e.g., noisy or from a poorly-trained agent) and provides theoretical guarantees on how to retrieve clean policy under mild assumptions (Song et al., 2019) .

Correlated Agreement

Peer prediction aims to elicit information from self-interested agents without ground-truth verification (Miller et al., 2005; Dasgupta & Ghosh, 2013; Shnayder et al., 2016) . The only source of information to serve as verification is from the agents' reports. Particularly, in Dasgupta & Ghosh (2013) ; Shnayder et al. (2016) , a correlated agreement (CA) type of mechanism is proposed, which evaluates the correlations between agents' reports. In addition to encouraging some agreement between agents, CA mechanism also punishes over-agreement when two agents always report identically. This property helps reduce the effect of noisy reports by punishing overfitting to them. Recently, Liu & Guo (2020) adapts a similar idea in learning from noisy labels for supervised learning. We consider a more challenging weakly supervised policy learning setting and study the convergence rates in sequential decision-making problems.

3. POLICY LEARNING FROM WEAK SUPERVISION

We begin by introducing a general framework to unify PL with low-quality supervision signals. Then we provide instantiations of the proposed weakly supervised formulation with two different applications: (1) RL with noisy reward and (2) behavioral cloning (BC) using weak expert demonstrations.

3.1. PRELIMINARY OF POLICY LEARNING

The goal of policy learning (PL) is to learn a policy ⇡ that the agent could follow to perform a series of actions in a stateful environment. For RL, the interactive environment is characterized as an MDP M = hS, A, R, P, i. At each time t, the agent in state s t 2 S takes an action a t 2 A by following the policy ⇡ : S ⇥A ! R, and potentially receives a reward r(s t , a t ) 2 R. Then the agent transfers to the next state s t+1 according to a transition probability function P. We denote the generated trajectory ⌧ = {(s t , a t , r t )} T t=0 , where T is a finite or infinite horizon. RL algorithms aim to maximize the expected reward over the trajectory ⌧ induced by the policy: J(⇡) = E (st,at,rt)⇠⌧ [ P T t=0 t r t ] , where 2 (0, 1] is the discount factor. Another popular policy learning method is through behavioral cloning (BC). The goal of BC is to mimic the expert policy ⇡ E through demonstrations D E = {(s i , a i )} N i=1 drawn from distribution D E (generated according to ⇡ E ), where (s i , a i ) is the sampled state-action pair from the expert's trajectory. Typically, training a policy with standard BC corresponds to maximizing the following log-likelihood: J(⇡) = E (s,a)⇠D E [log ⇡(a|s)]. In both RL and BC, the agent receives "supervisions" through either the reward r by interacting with environments or the expert policy ⇡ E as observable demonstrations. Consider a particular policy class ⇧, the optimal policy is then defined as ⇡ ⇤ = arg max ⇡2⇧ J(⇡): ⇡ ⇤ obtains maximum expected reward over the horizon T in RL and ⇡ ⇤ corresponds to the clean expert policy ⇡ E in BC. In practice, one can also leverage both RL and BC approaches to take advantage of both worlds (Brys et al., 2015; Hester et al., 2018; Guo et al., 2019; Song et al., 2019) . Specifically, a recent hybrid framework called policy co-training (Song et al., 2019) is considered in this paper.

3.2. META FRAMEWORK FOR POLICY LEARNING WITH WEAK SUPERVISION

To unify, we denote these weak supervision signals using e Y that is either the reward r for RL or the action ã performed by an expert policy ⇡E for BC. e Y denotes a weak version of a high-quality supervision signal Y . As a consequence, in an abstract manner, a weakly supervised PL problem can be formulated as learning the optimal policy ⇡ ⇤ with only accessing a weak supervision sequence denoting as {(s i , a i ), e Y i } N i=1 . To unify the discussion, suppose that we have an evaluation function Eva ⇡ ((s i , a i ), e Y i ) that evaluates a taken policy at state (s i , a i ) with a weak supervision e Y i . In the RL setting, this Eva ⇡ is the loss for different RL algorithms, which is a function of noisy reward r received at (s i , a i ). While for the BC setting, this Eva ⇡ is the loss used to evaluate the action taken by the agent given the action taken by the expert. Furthermore, we let J (⇡) denote the function that evaluates policy ⇡ under a set of state action pairs with weak supervision signals {(s i , a i ), e Y i } N i=1 , i.e., J(⇡) = E (s,a)⇠⌧ [Eva ⇡ ((s, a), Ỹ )]. Note that above unified notations are only for better delivery of our framework and we still treat PL as a sequential decision problem. We focus on the following two instantiations for our weakly supervised settings: RL with Noisy Reward Consider a finite MDP f M = hS, A, R, F, P, i with noisy reward channels (Wang et al., 2020) , where R : S ⇥ A ! R, and the noisy reward r is generated following a certain function F : R ! e R. Denote the trajectory a policy ⇡ ✓ generates via interacting with f M as ⌧✓ . Assume the reward is discrete and has |R| levels, the noise rate can be characterized via a matrix C RL |R|⇥|R| , where each entry c j,k indicates the flipping probability for generating a perturbed outcome: c RL j,k = P (r t = R k |r t = R j ) . We call r and r the true reward and noisy reward respectively. BC with Weak Demonstration Instead of observing the expert demonstration generated according to ⇡ E , denote the available weak demonstrations by {(s i , ãi )} N i=1 , where ãi ⇠ ⇡E (•|s i ) is the noisy action and each state-action pair (s i , ãi ) is drawn from distribution e D E . In particular, we assume the noisy action ãi is independent of the state s given the deterministic expert action ⇡ E (s), i.e., P(ã i |⇡ E (s i )) = P(ã i |s i , ⇡ E (s i )). Similar to RL, we assume the noise regime can be characterized by a confusion matrix C BC |A|⇥|A| , where each entry c j,k indicates the flipping probability for the expert to take an suboptimal action c BC j,k = P(⇡ E (s k ) = A k |⇡ E (s j ) = A j ). In this setting, we'd like to recover ⇡ ⇤ as if we were able to access the quality expert demonstration ⇡ E instead of ⇡E . As we show later, our approach successfully avoided the need of this knowledge. This is also a main challenge for designing our solution.

4. PEERPL: WEAKLY SUPERVISED PL VIA CORRELATED AGREEMENT

To deal with weak supervisions in PL, we propose a unified and theoretically principled framework PeerPL. We treat the weak supervisions as information coming from a "peer agent", and then evaluates the policy using a certain type of "correlated agreement" function between the learning policy and the peer agent.

4.1. OVERVIEW OF THE IDEA: CORRELATED AGREEMENT WITH WEAK SUPERVISIONS

We first present the general idea of our PeerPL framework that uses a concept named "correlated agreement" (CA). For each weakly supervised state-action pair ((s i , a i ), e Y i ), we randomly sample a state-action pair (s j , a j ), j 6 = i, as well as another supervision signal e Y k , k 6 = i, j from a different state-action pair. Then we evaluate ((s i , a i ), e Y i ) according to the following: CA with Weak Supervision: Eva ⇡ (s i , a i ), e Y i Eva ⇡ (s j , a j ), e Y k (1) Intuitively, the first term above encourages an "agreement" with the weak supervision (that a policy agrees with the corresponding supervision), while the second term punishes a "blind" and "over" agreement that happens when the agent's policy always matches with the weak supervision even on randomly paired traces (noise). The randomly paired samples j, k helps us achieve this check. Note the implementation of our mechanism does not require the knowledge of C RL |R|⇥|R| nor C BC |A|⇥|A| , and offers a prior-knowledge free way to learn effectively with weak supervsions. This above process is illustrated in Figure 1 . Illustrative toy example Consider a toy BC setting where we learned a policy that outputs a sequence of actions a 1 = a 2 = a 3 = 1, a 4 = 0, and they have perfectly matched the weak supervision a 0 1 = a 0 2 = a 0 3 = 1, a 0 4 = 0 (at the same sequence of states). Supposedly Eva ⇡ ((s i , a i ), a 0 i ) evaluates how well a policy matches the expert demonstration (Eva ⇡ = 1 for agreeing, 0 for not). Using only Eva ⇡ ((s i , a i ), a 0 i ) will return the highest score 1 for agreeing with this noisy/imperfect/lowquality supervision. However the correlated agreement evaluation returns (for example for i = 1) E[Eva ⇡ ((s i , a i ), a 0 i ) Eva ⇡ ((s j , a j ), a 0 k , k 6 = j)] = 1 (0.75 2 + 0.25 2 ) = 0.375, where 0.75 2 + 0.25 2 is the probability of randomly paired a and a 0 match each other. The above example shows that a full agreement with the weak supervision will instead be punished! In what follows, we solidate our implementations within each of the settings considered and provide theoretical guarantees under weak supervision.

4.2. PEERRL: PEER REINFORCEMENT LEARNING

Since the reward signals are no longer credible in the weakly supervised RL setting, we propose the following objective function that punishes the over-agreement based on CA mechanism J RL (⇡ ✓ ) = E h Eva RL ⇡ (s i , a i ), ri i ⇠ • E h Eva RL ⇡ (s j , a j ), rk i , where Eva RL ⇡ (s, a), r = ` ⇡ ✓ , (s, a, r) . (3) In (2), the first expectation is taken over (s i , a i , ri ) ⇠ ⌧✓ and second one is taken over (s j , a j , rj ) ⇠ ⌧✓ , (s k , a k , rk ) ⇠ ⌧✓ , where ⌧✓ is the trajectory generated by ⇡ ✓ with the noisy reward function r. The choice of the loss function `depends on the RL algorithms used (e.g., temporal difference error (Mnih et al., 2013; Wang et al., 2016) or the policy gradient loss (Sutton et al., 1999) ). Also, the learning sequence is encoded in ⇡, therefore, maximizing the objective J RL (⇡) is equivalent to maximizing the accumulated reward. ⇠ 0 is a hyperparameter to balance the penalty for blind agreements induced by CA. In what follows, we consider the Q-Learning (Watkins & Dayan, 1992) as the underlying learning algorithm where `(⇡ ✓ , (s, a, r)) = r and demonstrate our CA mechanism provides strong guarantees for Q-Learning with only observing the noisy reward. For clarity, we define peer reward as Peer RL Reward: rpeer (s, a) = r(s, a) ⇠ • r0 , where r 0 is a randomly sampled reward over all state-action pair and ⇠ 0 is a parameter to balance the noisy reward and the punishment for blind agreement (with r0 ). We set ⇠ = 1 (for binary case) in the following analysis and treat each (s, a) equally when sampling the r 0 . In our experiment, rpeer is not sensitive to the choice of ⇠, and we have kept ⇠ a constant for each run of the RL experiments. We show peer reward rpeer offers us an affine transformation of the true reward in expectation, which is the key to guaranteeing the convergence of our Peer RL algorithm to converge to ⇡ ⇤ . For clarity, consider the binary reward setting (r + and r ) and denote the error in r as e + = P(r = r |r = r + ), e = P(r = r + |r = r ) (a simplification of C RL |R|⇥|R| in the binary setting). Lemma 1. Let r 2 [0, R max ] be a bounded reward. Assume 1 e e + > 0, then we have: E[r peer ] = (1 e e + ) • E[r peer ] = (1 e e + ) • E[r] + const, where r peer = r peer (s, a) = r(s, a) ⇠ • r 0 is the peer RL reward when observing the true reward r. Lemma 1 shows that by subtracting the peer penalty term r0 from noisy reward r, rpeer recovers the clean and true reward r in expectation. Remark It's notable that the expectation of the noisy reward E[r] can also be written as: E[r] = (1 e e + )E[r] + e r + + e + r | {z } const . However, we claim that the constant in peer reward has less effect on the true reward r. That's because we have that E[r] = (1 e e + ) • ✓ E[r] + e + 1 e e + r + e 1 e e + r + ◆ , E[r peer ] = (1 e e + ) • (E[r] (1 p peer )r p peer r + ), where p peer 2 [0, 1] denotes the probability that a sample policy gets a reward r + . Since the magnitude of noise terms Based on Lemma 1, we further offer the following convergence guarantee: Theorem 1. (Convergence) Given a finite MDP with noisy reward, denoting as f M = hS, A, R, F, P, i, the Q-learning algorithm with peer rewards, given by the update rule, Qt+1(st, at) = (1 ↵t)Qt(st, at) + ↵t ⇥ rpeer(st, at) + max a 0 2A Qt(st+1, a 0 ) ⇤ , ⇡t(s) = arg max a2A Qt(s, a) converges w.p.1 to the optimal policy ⇡ ⇤ (s) as long as P t ↵ t = 1 and P t ↵ 2 t < 1. Theorem 1 states that the agent will converge to the optimal policy w.p.1 with peer rewards without requiring any knowledge of the corruption in rewards (C RL |R|⇥|R| , as opposed to previous work (Wang et al., 2020 ) that requires such knowledge). Moreover, we found that, to guarantee the convergence to ⇡ ⇤ , the number of samples needed for our approach is no more than O(1/(1 e e + ) 2 ) times of the one needed when the RL agent observes true rewards perfectly (see Appendix A). Remark Even though we only presented analysis for the binary case for Q-Learning, our approach is rather generic and is ready to be plugged into modern DRL algorithms. Specifically, we provide the multi-reward extension and implementations with DQN (Mnih et al., 2013) and policy gradient (Sutton et al., 1999) using peer reward in Appendix A.

4.3. PEERBC: PEER BEHAVIORAL CLONING

Similarly, we present our CA solution in the setting of behavioral cloning (PeerBC). In BC, the supervision is given by the weak expert' noisy trajectory. The Eva BC ⇡ function in BC evaluates the agent policy ⇡ ✓ and the weak expert' trajectory {(s i , ãi )} N i=1 using `(⇡ ✓ , (s i , ãi )) where `is an arbitrary classification loss. Taking for instance the cross-entropy, the objective of PeerBC is: J BC (⇡ ✓ ) = E h Eva BC ⇡ (si, ai), ãi i ⇠ • E h Eva BC ⇡ (sj, aj), ãk i , where Eva BC ⇡ (s, a), ã = ` ⇡ ✓ , (s, ã) = log ⇡ ✓ (ã|s). In ( 4), the first expectation is taken over (s i , ãi ) ⇠ e D E , a i ⇠ ⇡(•|s i ) and the second is taken over (s j , ãj ) ⇠ e D E , a j ⇠ ⇡(•|s j ), (s k , ãk ) ⇠ e D E , a k ⇠ ⇡(•|s k ). Again, the second Eva BC ⇡ term in J BC serves the purpose of punishing over-agreement with the weak demonstration. Similarly, ⇠ 0 is a parameter to balance the penalty for blind agreements. At each iteration, the agent learns under weak supervision ã, and the training samples are generated from the distribution e D E determined by the weak expert. We prove that the policy learned by PeerBC converges to the expert policy when observing a sufficient amount of weak demonstrations. We focus on the binary action setting for the purpose of theoretical analysis, where the action space is given by A = {A + , A } and the weakness or noise in the weak expert ⇡E is quantified by e + = P(⇡ E (s) = A |⇡ E (s) = A + ) and e = P(⇡ E (s) = A + |⇡ E (s) = A ) (a simplification of C BC |A|⇥|A| in the binary setting). Let the ⇡ e D E be the optimal policy by maximizing the objective in ( 4 . Theorem 2 states that as long as weak demonstrations are observed sufficiently, i.e., N is sufficiently large, the policy learned by PeerBC is able to converge to the clean expert policy ⇡ E (s) with a convergence rate of O 1/ p N .

Algorithm 1 Peer policy co-training (PeerCT)

Require: Views A, B, MDPs M A , M B , policies ⇡A, ⇡B, mapping functions fA!B, fB!A that maps states from one view to the other view, CA coefficient ⇠, step size for policy update. 1: repeat 2: Run ⇡ A to generate trajectories ⌧ A = {(s A i , a A i , r A i )} N i=1 . 3: Run ⇡ B to generate trajectories ⌧ B = {(s B j , a B j , r B j )} M j=1 . 4: Agents label the trajectories for each other ⌧ 0A n (s A i , ⇡ B fB A(s A i ) o N i=1 , ⌧ 0B n (s B j , ⇡ A fA B (s B j ) o M j=1 .

5:

Update policies: ⇡ {A,B} ⇡ {A,B} + • rJ CT (⇡ {A,B} ) 6: until convergence Peer Policy Co-Training Our discussion of BC allows us to study a more challenging co-training task (Song et al., 2019) . Given a finite MDP M, there are two agents that receive partial observations and we let ⇡ A and ⇡ B denote the policies for agent A and B. Moreover, two agents are trained jointly to learn with rewards and noisy demonstrations from each other (e.g., at preliminary training phase). Symmetrically, we consider on the case where agent A learns with the demonstrations from B on sampled trajectories, and ⇡ B effectively serves as a noisy version of expert policy ⇡ E = arg max ⇡2⇧ E (s,a)⇠D E [log ⇡(a|s)]. For simplicity of demonstration, we focus on recovering the clean expert policy by only adapting the BC evaluation term (ignoring the effect of RL rewards, see Eqn. ( 6)). Denote by M A . Then ⇡ B substitutes each action a A i with its selection ⌧ A ✓ = {(s A i , a A i , r A i )} N i=1 the a 0B i ⇠ ⇡ B (•|f A!B (s A i )) as the weak supervisions. Similar to PeerRL/PeerBC, the objective function of peer co-learning (PeerCT) becomes J CT (⇡ ✓ ) = E h Eva RL ⇡ (s A i , a A i ), r A i + Eva BC ⇡ (s A i , a A i ), a 0B i i ⇠ • E h Eva BC ⇡ (s A j , a A j ), a 0B k i , where the first expectation is taken over (s A i , a A i , r A i ) ⇠ ⌧ A ✓ , a 0B i ⇠ ⇡ B (•|f A!B (s A i )) , and the second is taken over (s A j , a A j , r A j ) ⇠ ⌧ A ✓ , (s A k , a A k , r A k ) ⇠ ⌧ A ✓ , a 0B k ⇠ ⇡ B (•|f A!B (s A k )) , `is the loss function defined in Eqn. ( 5) to measure the policy difference, and Eva RL ⇡ , Eva BC ⇡ are defined in Eqn. ( 3) and ( 5) respectively. The full algorithm PeerCT is provided in Algorithm 1. We omit detailed discussions on the convergence of PeerCT due to it can be viewed as a straight-forward extension of Theorem 2 in the context of co-training.

5. EXPERIMENTS

We evaluate our solution in three challenging weakly supervised PL problems. Experiments on control games and Atari show that, without any prior knowledge of the noise in supervisions, our approach is able to leverage weak supervisions more effectively.

5.1. PEERRL WITH NOISY REWARD

We first evaluate our method in RL with noisy reward setting. Following Wang et al. (2020) , we consider the binary reward { 1, 1} for Cartpole where the symmetric noise is synthesized with different error rates e = e = e + . We choose DQN (Mnih et al., 2013) and Dueling DQN (DDQN) (Wang et al., 2016) algorithms and train the models for 10,000 steps. We repeat each experiment 10 times with different random seeds and leave the results for DQN in Appendix D. Figure 2 shows the learning curves for DDQN with different approaches in noisy environments (⇠ = 0.2)foot_0 . Since the number of training steps is fixed, the faster the algorithm converges, the fewer total episodes the agent will involve thus the learning curve is on the left side. As a consequence, the proposed peer reward outperforms other baselines significantly even in a high-noise regime (e.g., e = 0.4). Moreover, we highlight that peer reward does not require any knowledge of noise rates or complicated estimation algorithms compared to Wang et al. (2020) . Table 1 provides quantitative results on the average reward R avg and total episodes N epi . We find the agents with peer reward lead to a larger R avg (less generalization error) and a smaller N epi (faster convergence) consistently, which again verifies the effectiveness of our approach. Analysis of the benefits in PeerRL When the noise rate e is small, we observed that the agents with peer reward even lead to faster convergence than ones observing true reward perfectly. This indicates there might be other factors apart from noise reduction that promote the RL learning with peer reward. We hypothesize this is because (1) peer reward scales the reward signals appropriately, which potentially reduces the variance and makes it easier to learn from; (2) the peer penalty term encourages explorations in RL implicitly; (3) the human-specific "true reward" is also imperfect which leads to a weak supervision scenario. More discussions and analysis are deferred to Appendix C.6. There might be multiple explanations for the improvement in performance provided by PeerRL depending on the specific RL tasks. However, we emphasize that the property of recovering from noise is non-negligible especially in a high-noise regime (e.g., e = 0.4). Table 1 : Numerical performance of DDQN on CartPole with true reward (r), noisy reward (r), surrogate reward r (Wang et al., 2020) , and peer reward rpeer (⇠ = 0.2). R avg denotes average reward per episode after convergence, the higher (") the better; N epi denotes total episodes involved in 10,000 steps, the lower (#) the better. Fully converged PPO 20.9 ± 0.3 89.3 ± 5.4 389.6 ± 216.9 33.3 ± 0.8 +70.6% R avg " N epi # R avg " N epi # R avg " N epi # R avg " N epi # DDQN In BC setting, we evaluate our approach on four vision-based Atari games. For each environment, we train an imperfect RL model with PPO (Schulman et al., 2017) algorithm. Here, "imperfect" means the training is terminated before convergence when the performance is about 70% ⇠ 90% as good as the fully converged model. We then collect the imperfect demonstrations using the expert model and generate 100 trajectories for each environment. The results are reported under three random seeds. Figure 3 presents the comparisons for standard BC and PeerBC, from which we observe that our approach outperforms standard BC and even the expert it learns from. Note that during the whole training process, the agent never learns by interacting directly with the environment but only have access to the expert's trajectories. Therefore, we owe this performance gain to PeerBC's strong ability for learning from weak supervisions. The peer term we add not only provably eliminates the effects of noise but also extracts useful strategy from the demonstrations. In Table 2 , we show the quantitative results. Our approach consistently outperforms the expert and standard BC. As a reference, we also compare two other baselines GAIL (Ho & Ermon, 2016) and SQIL (Reddy et al., 2019) and provide the sensitivity analysis of ⇠ in Appendix D. Analysis of benefits in PeerBC Similarly, the performance improvement of PeerBC might be also coupled with multiple possible factors. (1) The imperfect expert model might be a noisy version of the fully-converged agent since there are less visited states on which the selected actions of the model contains noise. (2) The improvements might be brought up by biasing against high-entropy policies thus PeerBC is useful when the true policy itself is deterministic. We provide more discussions about the second factor in Appendix C.5.

5.3. PEERCT FOR STANDARD POLICY CO-TRAINING

Finally, we verify the effectiveness of PeerCT algorithm in policy co-training setting (Song et al., 2019) . This setting is more challenging since the states are partially observable and each agent needs to imitate another agent's behavior that is highly biased and imperfect. Note that we adopt the exact same setting as Song et al. (2019) without any synthetic noise included. This implies the potential of our approach to deal with natural noise in real-world applications. Following Song et al. (2019) , we mask the first two dimensions respectively in the state vector to create two views for co-training in classic control games (Acrobot and CartPole). Similarly, the agent either removes all even index coordinates (view-A) in the state vector or removing all odd index ones (view-B) on Atari games. As shown in Table 3 and Figure 4 , PeerCT algorithm outperforms training from single view, and CoPiEr algorithm consistently on both control games (⇠ = 0.5 in Figure 4a , 4b) and Atari games (⇠ = 0.2 in Figure 4c, 4d ). In most cases, our approach leads to a faster convergence and lower generalization error compared to CoPiEr, which again verify that our ways of leveraging information from peer agent enables recovery of useful knowledge from highly imperfect supervisions. principled framework PeerPL that builds on evaluating a learning policy's correlated agreements with the weak supervisions. We demonstrate how our method adapts in RL/BC and the combined co-training tasks and provide intensive analyses of their convergence behaviors and sample complexity. Experiments on these tasks show our approach leads to substantial improvements over baseline methods.

6. CONCLUSION

A ANALYSIS OF PEERRL We start this section by providing the proof of the convergence of Q-Learning under peer reward rpeer (Theorem 1). Moreover, we give the sample complexity of phased value iteration (Theorem A1). In the rest of this section, we show how to extend the proposed method to multi-outcome setting (Section A.3) and modern deep reinforcement learning (DRL) algorithms such as policy gradient (Sutton et al., 1999) and DQN (Mnih et al., 2013; van Hasselt et al., 2016 ) (Section A.4).

A.1 CONVERGENCE

Recall that we consider the binary reward case {r + , r }, where r + and r are two reward levels. The flipping errors of the reward are defined as e + = P(r t = r |r t = r + ) and e = P(r t = r + |r t = r ). The peer reward is defined as r peer (s, a) = r(s, a) r 0 , where r 0 is randomly sampled reward over all state-action pair (s, a). Note that we treat each (s, a) equally when sampling the r 0 due to lack of the knowledge of true transition probability P. In practice, the agent could only noisy observation of peer reward rpeer (s, a) = r(s, a) r0 . We provide the Q-learning with peer reward in Algorithm A1.

Algorithm A1 Q-Learning with Peer Reward

Require: f M = (S, A, e R, P, ), learning rate ↵ 2 (0, 1), initial state distribution 0.

1: Initialize

Q: S ⇥ A ! R arbitrarily 2: while Q is not converged do 3: Start in state s ⇠ 0 4: while s is not terminal do 5: Calculate ⇡ according to Q and exploration strategy 6: a ⇡(s); s 0 ⇠ P(•|s, a) 7: Observe noisy reward r(s, a) and randomly sample another r0 from all state-action pairs 8: Calculate peer reward rpeer(s, a) = r(s, a) r0 9: Q(s, a) (1 ↵) • Q(s, a) + ↵ • (rpeer(s, a) + • max a 0 Q(s 0 , a 0 )) 10: s s 0 11: end while 12: end while 13: return Q(s, a) and ⇡(s) We then show the proposed peer reward rpeer offers us an affine transformation of true reward in expectation, which is the key to guaranteeing the convergence for RL algorithms. Lemma 1. Let r 2 [0, R max ] be bounded reward and assume 1 e e + > 0. Then, if we define the peer reward rpeer (s, a) = r(s, a) r0 , in which the penalty term r0 is randomly sampled noisy reward over all state-action pair (s, a), we have E[r peer ] = (1 e e + )E[r peer ] = (1 e e + )E[r] + const, where r peer is the clean version of peer reward when observing the true reward. Proof. With slight notation abuse, we let rpeer , r, r, r 0 also represent the random variables. Let ⇡ denotes the RL agent's policy. Consider the two terms on the RHS of noisy peer reward separately, Lemma 1 shows the proposed peer reward rpeer offers us a "noise-free" positive (1 e e + > 0) linear transformation of true reward r in expectation, which is shown the key to govern the convergence. It is widely known in utility theory and reward shaping literature (Ng et al., 1999; Asmuth et al., 2008; Von Neumann & Morgenstern, 2007) that any positive linear transformations leave the optimal policy unchanged. As a consequence, we consider a "transformed MDP" M with reward r = (1 e e + )r + const, where the const is the same as the constant in Eqn. ( 19). In what follows, we provide the formulation of the concept of "transformed MDP" with the policy invariance guarantee. Lemma A1. Given a finite MDP M = hS, A, R, P, i, a transformed MDP M = hS, A, R, P, i with positive linear transformation in reward r := a • r + b, where a, b are constants and a > 0, is guaranteed consistency in optimal policy. Proof. The Q function for transformed MDP M (denoting as Q) is given as follows: E[r] = P(r = r + |⇡) • E r=r+ [P(r = r |r = r + ) • r + P(r = r + |r = r + ) • r + ] + P(r = r |⇡) • E r=r [P(r = r |r = r ) • r + P(r = r + |r = r ) • r + ] (8) = P(r = r + |⇡) • E r=r+ [e + r + (1 e + )r + ] + P(r = r |⇡) • E r=r [(1 e )r + e r + ] Q(s, a) = 1 X t=0 t rt = 1 X t=0 t (a • r t + b) = a 1 X t=0 t r t + 1 X t=0 t b = a • Q(s, a) + B, where B = P 1 t=0 t b is a constant. Therefore, there is only a postive linear shift (a > 0) in Q(s, a) thus resulting in invariance in optimal policy for transformed MDP: ⇡⇤ (s) = arg max a2A Q⇤ (s, a) = arg max a2A [a • Q(s, a) + B] = arg max a2A Q(s, a) = ⇡ ⇤ (s). Lemma A1 states that we only need to analysis the convergence of learned policy ⇡(s) to the optimal policy ⇡⇤ (s) for transformed MDP M, which is equivalent to the optimal policy ⇡(s) ⇤ for original MDP. This result is relevant to potential-based reward shaping (Ng et al., 1999; Asmuth et al., 2008) where a specific class of state-dependent transformation is adopted to speed up the convergence speed of Q-Learning meanwhile maintaining the optimal policy invariance. Moreover, a degenerate case for single-step decisions is studied in utility theory (Von Neumann & Morgenstern, 2007) which also implies our result. Finally, we need an auxiliary result (Lemma A2) from stochastic process approximation to analyse the convergence for Q-Learning. Lemma A2. The random process { t } taking values in R n and defined as • 0  ↵ t  1, P t ↵ t (x) = 1 and P t ↵ t (x) 2 < 1; • ||E [F t (x)|F t ] || W  || t ||, with < 1; • Var [F t (x)|F t ]  C(1 + || t || 2 W ), for C > 0. Here F t = { t , t 1 , • • • , F t 1 • • • , ↵ t , • • • } stands for the past at step t, ↵ t (x) is allowed to depend on the past insofar as the above conditions remain valid. The notation || • || W refers to some weighted maximum norm. Proof of Lemma A2. See previous literature (Jaakkola et al., 1993; Tsitsiklis, 1994) . Theorem 1. (Convergence) Given a finite MDP with noisy reward, denoting as f M = hS, A, e R, F, P, i, the Q-learning algorithm with peer rewards, given by the update rule, Qt+1(st, at) = (1 ↵t)Qt(st, at) + ↵t  rpeer(st, at) + max b2A Qt(st+1, b) , ( ) ⇡t(s) = arg max a2A Qt(s, a) converges w.p.1 to the optimal policy ⇡ ⇤ (s) as long as P t ↵ t = 1 and P t ↵ 2 t < 1. Proof. Firstly, we construct a surrogate MDP M with the positive-linearly transformed reward r = (1 e e + ) • r + const, where const = (1 e e + )((1 p) • r + p • r + ) is a constant. From Lemma A1, we know the optimal policy for M is precisely the optimal policy for M: ⇡⇤ (s) = ⇡ ⇤ (s). Let Q⇤ denotes the optimal state-action function for this transformed MDP M. For notation brevity, we abbreviate s t , s t+1 , rpeer (s t , s t+1 ), Q t , Q t+1 , and ↵ t as s, s 0 , Q, Q 0 , rpeer and ↵, respectively. Subtracting from both sides the quantity Q⇤ (s, a) in Eqn. ( 21): Q 0 (s, a) Q⇤ (s, a) =(1 ↵) ⇣ Q(s, a) Q⇤ (s, a) ⌘ + ↵  rpeer + max b2A Q(s 0 , b) Q⇤ (s, a) . Let t (s, a) = Q(s, a) Q⇤ (s, a) and F t (s, a) = rpeer + max b2A Q(s 0 , b) Q⇤ (s, a). t+1 (s 0 , a) = (1 ↵) t (s, a) + ↵F t (s, a). In consequence, E [F t (s, a)|F t ] = E  rpeer + max b2A Q(s 0 , b) Q⇤ (s, a) = E  rpeer + max b2A Q(s 0 , b) r max b2A Q⇤ (s 0 , b) = E [r peer ] E [r] + E  max b2A Q(s 0 , b) max b2A Q⇤ (s 0 , b) = E  max b2A Q(s 0 , b) max b2A Q⇤ (s 0 , b)  E  max b2A,s 0 2S Q(s 0 , b) Q⇤ (s 0 , b) = E h kQ Q⇤ k 1 i = ||Q Q⇤ || 1 = || t || 1 . In above derivations, we utilize the unbiasedness property for peer reward (Lemma 1) and the inequality max b2A Q(s 0 , b) max b2A Q⇤ (s 0 , b)  max b2A,s 0 2S Q(s 0 , b) Q⇤ (s 0 , b) . Var [F t (s, a)|F t ] = E " ✓ rpeer + max b2A Q(s 0 , b) Q⇤ (s, a) E  rpeer + max b2A Q(s 0 , b) Q⇤ (s, a) ◆ 2 # = E " ✓ rpeer + max b2A Q(s 0 , b) E  rpeer + max b2A Q(s 0 , b) ◆ 2 # = Var  rpeer + max b2A Q(s 0 , b) . Since rpeer is bounded, it can be clearly verified that Var [F t (s, a)|F t ]  C 00 (1 + || t (s, a)|| 2 1 ) for some constant C 00 > 0. Then, t converges to zero w.p.1 from Lemma A2, i.e., Q(s, a) converges to Q⇤ (s, a). As a consequence, we know the policy ⇡ t (s) converges to the optimal policy ⇡⇤ (s) = ⇡ ⇤ (s).

A.2 SAMPLE COMPLEXITY

In this section, we establish the sample complexity for Q-Learning with peer reward as discussed in Sec 4.2. Since the transition probability P in MDP remains unknown in practice, we firstly introduce a practical sampling model G(M) following previous literature (Kearns & Singh, 1998; 2000; Kearns et al., 1999) . in which the transition can be observed by calling the generative model. Then the sample complexity is analogous to the number of calls for G(M) to obtain a near optimal policy. Definition A1. A generative model G(M) for an MDP M is a sampling model which takes a state-action pair (s t , a t ) as input, and outputs the corresponding reward r(s t , a t ) and the next state s t+1 randomly with the probability of P a (s t , s t+1 ), i.e., s t+1 ⇠ P(•|s, a). It is known that exact value iteration is not feasible when the agent interacts with generative model G(M) (Wang et al., 2020; Kakade, 2003) . For the convenience of analysing sample complexity, we introduce a phased value iteration following Wang et al. (2020) ; Kearns & Singh (1998); Kakade (2003) .

Algorithm A2 Phased Value Iteration

Require: G(M): generative model of M = (S, A, R, P, ), T : number of iterations. 1: Set V T (s) = 0 2: for t = T 1, • • • , 0 do 3: Calling G(M) m times for each state-action pair. Pa (s t , s t+1 ) = #[(s t , a t ) ! s t+1 ] m 4: Set V (s t ) = max a2A X st+12S Pa (s t , s t+1 ) [r t + V (s t+1 )] ⇡(s) = arg max a2A V (s t ) 5: end for 6: return V (s) and ⇡(s) Note that Pa (s t , s t+1 ) is the estimation of transition probability P a (s t , s t+1 ) by calling G(M) m times. For the simplicity of notations, the iteration index t decreases from T 1 to 0. We could also adopt peer reward in phased value iteration by replacing Line 4 in Algorithm A2 by V (s t ) = max a2A X st+12S Pa (s t , s t+1 ) [r peer (s t , a) + V (s t+1 )] . Then the sample complexity of one variant (phased value iteration) of Q-Learning is given as follows: Theorem A1. (Sample Complexity) Let r 2 [0, R max ] be bounded reward, for an appropriate choice of m, the phased value iteration algorithm with peer reward rpeer calls the generative model G( f M) O ⇣ |S||A|T ✏ 2 (1 e e+) 2 log |S||A|T ⌘ times in T epochs, and returns a policy such that for all state s 2 S, 1 ⌘ V ⇡ (s) V ⇤ (s)  ✏, w.p. 1 , 0 < < 1, where ⌘ = 1 e e + > 0 is a constant. Proof. Similar to Theorem 1, we firstly construct a transformed MDP M and the optimal policies for these two MDP are equivalent (Lemma A1). As a result, we could analyse the sample complexity of phased value iteration under M. It is easy to obtain that rpeer 2 [0, R max ] and V ⇡ (s) 2 h 0, Rmax 1 i are also bounded. Using Hoeffding's inequality, we have Pr 0 @ E h V ⇤ t+1 (s t+1 ) i X st+12S Pa (s t , s t+1 ) V ⇤ t+1 (s t+1 ) ✏ 1 A  2 exp ✓ 2m✏ 2 (1 ) 2 R 2 max ◆ , Pr 0 @ E [r peer (s t , a)] X st+12S Pa (s t , s t+1 )r peer (s t , a) ✏ 1 A  2 exp ✓ 2m✏ 2 R 2 max ◆ . Then the difference between learned value function V ⇡ (s) t and optimal value function V ⇤ (s) t under transformed MDP at iteration t is given: V ⇤ t (s) V t (s) = max a2A E ⇥ r t + V ⇤ t+1 (s t+1 ) ⇤ max a2A X st+12S Pa (s t , s t+1 ) [r peer (s t , a) + V t+1 (s t+1 )]  max a2A E [r t ] X st+12S Pa (s t , s t+1 )r peer (s t , a) + max a2A E h V ⇤ t+1 (s t+1 ) i X st+12S Pa (s t , s t+1 )V t+1 (s t+1 )  ✏ 1 + max a2A |E [r t ] E [r peer ]| + ✏ 2 + E h V ⇤ t+1 (s t+1 ) i E [V t+1 (s t+1 )]  max s2S V ⇤ t+1 (s) V t+1 (s) + ✏ 1 + ✏ 2 Recursing above equation, we get max s2S V ⇤ (s) V (s)  (✏ 1 + ✏ 2 ) + (✏ 1 + ✏ 2 ) + • • • + T 1 (✏ 1 + ✏ 2 ) = (✏ 1 + ✏ 2 )(1 T ) 1 Let ✏ 1 = ✏ 2 = (1 )✏ (1+ ) , then max s2S V ⇤ (s) V (s)  ✏. In other words, for arbitrarily small ✏, by choosing m appropriately, there always exists ✏ 1 and ✏ 2 such that the value function error is bounded within ✏. As a consequence the phased value iteration algorithm can converge to the near optimal policy within finite steps using peer reward. Note that there are in total |S||A|T transitions under which these conditions must hold, where | • | represent the number of elements in a specific set. Using a union bound, the probability of failure in any condition is smaller than 2|S||A|T • exp ✓ m ✏ 2 (1 ) 2 (1 + ) 2 • (1 ) 2 R 2 max ◆ . We set above failure probability less than , and m should satisfy that m = O ✓ 1 ✏ 2 log |S||A|T ◆ . In consequence, after m|S||A|T calls, which is, O ⇣ |S||A|T ✏ 2 log |S||A|T ⌘ , the value function converges to the optimal value function V ⇤ (s) for every s in transformed MDP f M , with probability greater than 1 . From Lemma A1, we know V ⇤ (s) = (1 e e + ) • V ⇤ (s) + C, where C is a constant. Let ✏ = (1 e e + ) • ✏ 0 and V (s) = (1 e e + ) • V 0 (s) + C, we have |V ⇤ (s) V 0 (s)| = V ⇤ (s) C (1 e e + ) V (s) C (1 e e + ) = 1 (1 e e + ) V ⇤ (s) V (s)  ✏ 0 This indicates that when the algorithm converges to the optimal value function for transformed MDP M, it also finds a underlying value function V 0 (s) = 1 ⌘ V (s) that converges the optimal value function V ⇤ (s) for original MDP M. ⌘ times of the one needed when the RL agent observes true rewards perfectly. When the noise is in high-regime, the algorithm suffers from a large 1 (1 e e+) 2 thus less efficient. Moreover, the sample complexity of phased value iteration with peer reward is equivalent to the one with surrogate reward in Wang et al. (2020) though sampling peer reward is less expensive and does not rely on any knowledge of noise rates.

A.3 MULTI-OUTCOME EXTENSION

In this section, we show our peer reward is generalizable to multi-class setting. Recall that in Section 3.2 we suppose the reward is discrete and has |R| levels, and the noise rates are characterized as C RL |R|⇥|R| . Here we make further assumptions on the confusion matrix: the reward is misreported to each level with specific probability, e.g., C RL |R|⇥|R| = 2 6 6 6 4 1 P i6 =1 e i , e 2 , • • • e |R| e 1 , 1 P i6 =2 e i , • • • e |R| . . . • • • . . . . . . e 1 , e 2 , • • • , 1 P i6 =|R| e i 3 7 7 7 5 Following the notations in A.1, we define the peer reward in multi-outcome settings as r(s, a) = r(s, a) r 0 , where r 0 is randomly sampled following a specific sample policy ⇡ sample over all state-action pairs. Let e R peer , R, e R, and R 0 denote the random variables corresponding to rpeer , r, r, r 0 , c ij represents the entry of C RL |R|⇥|R| . Then we have E ⇡ h e R i = |R| X i=1 P (R = R i |⇡) |R| X j=1 c ij R j = |R| X i=1 P (R = R i |⇡) 2 4 0 @ 1 X j6 =i e i 1 A R i + X j6 =i e j R j 3 5 = |R| X i=1 P (R = R i |⇡) 2 4 0 @ 1 |R| X j=1 e i 1 A R i + |R| X j=1 e j R j 3 5 = 0 @ 1 |R| X j=1 e j 1 A E ⇡ [R] + |R| X j=1 e j R j , E ⇡ sample h e R 0 i = |R| X i=1 R i • P ⇣ e R = R i |⇡ sample ⌘ = |R| X j=1 R j |R| X i=1 P (R = R i |⇡ sample ) c ij = |R| X j=1 R j 2 4 X i6 =j P (R = R i |⇡ sample ) e j + P (R = R j |⇡ sample ) 0 @ 1 X i6 =j e i 1 A 3 5 = |R| X j=1 R j 2 4 |R| X i=1 P (R = R i |⇡ sample ) e j + P (R = R j |⇡ sample ) 0 @ 1 |R| X i=1 e i 1 A 3 5 = 0 @ 1 |R| X i=1 e i 1 A E ⇡ sample [R] + |R| X j=1 e j R j . Then, the peer reward is formulated as E h e R peer i = E ⇡ h e R i E h e R 0 i = 0 @ 1 |R| X j=1 e j 1 A E ⇡ [R] 0 @ 1 |R| X i=1 e i 1 A E ⇡ sample [R] = 0 @ 1 |R| X j=1 e j 1 A E ⇡ [R] + const.

A.4 EXTENSION IN MODERN DRL ALGORITHMS

In this section, we give the following deep reinforcement learning algorithms combined with our peer reward in Algorithm A3 and A4. In Algorithm A3, we give the peer reward aided robust policy gradient algorithm, where the gradient in Equation 25 corresponds to the loss function `((s, a), q) = q log ⇡ ✓ (a|s), which is classification calibrated (Liu & Guo, 2020) . So the expectation of the gradient in 25 is an unbiased esitmation of the policy gradient in corresponding clean MDP. In (A4), we present a robust DQN algorithm with peer sampling, in which the origin loss is `((s, a), ỹ), also classification calibrated. Thus the robustness can be proved via Liu & Guo (2020) . Algorithm A3 Policy Gradient (Sutton et al., 1999) with Peer Reward Require: f M = (S, A, e R, P, ), learning rate ↵ 2 (0, 1), initial state distribution 0, weight parameter ⇠. 1: Initialize ⇡ ✓ : S ⇥ A ! R arbitrarily 2: for episode = 1 to M do 3: Collect trajectory ⌧ ✓ = {(si, ai, ri)} T i=0 , where s0 ⇠ 0, at ⇠ ⇡ ✓ (•|st), st+1 ⇠ P(•|st, at).

4:

Compute qt = P T i=t t i ri for all t 2 {0, 1, . . . , T } 5: For each index i 2 {0, 1, . . . , T }, we independently sample another two different indices j, k, 6: and update policy parameter ✓ following ✓ ✓ + ↵ [qir ✓ log ⇡ ✓ (ai|si) ⇠ • q k r ✓ log ⇡ ✓ (aj|sj)] 7: end for 8: return ⇡ ✓ Algorithm A4 Deep Q-Network (Mnih et al., 2013) with Peer Reward Require: f M = (S, A, e R, P, ), learning rate ↵ 2 (0, 1), initial state distribution 0, weight parameter ⇠. 1: Initialize replay memory D to capacity N 2: Initialize action-value function Q with random weights 3: for episode = 1 to M do 4: for t = 1 to T do 5: With probability ✏ select a random action at, otherwise select at = maxa Q ⇤ (s, a) 6: Execute action at and observe reward rt and observation st+1 7: Store transition (st, at, rt, st+1) in D 8: Sample three random minibatches of transitions (si, ai, ri, si+1), (sj, aj, rj, sj+1), (s k , a k , rk , s k+1 ) from D.

9:

Set ỹi = ( ri for terminalsi ri + max a 0 Q(si+1, a 0 ) for non-terminal si+1 10: Set ỹpeer = ( rk for terminalsi rk + max a 0 Q(sj+1, a 0 ) for non-terminal sj+1

11:

Perform a gradient descent step on (ỹi Q(si, ai)) 2 ⇠ • (ỹpeer Q(sj, aj)) 2 12: end for 13: end for 14: return Q

B ANALYSIS OF PEERBC

We prove that the policy learned by PeerBC converges to the expert policy when observing a sufficient amount of weak demonstrations in Theorem A2. Theorem A2. With probability at least 1 , the error rate is upper-bounded by R ⇤ D E  1 + ⇠ 1 e e + r 2 log 2/ N , ( ) where N is the number of state-action pairs demonstrated by the expert.  RD E (⇡) := 1 N X i2[N ] (⇡(s i ), a i ), R e D E (⇡) := 1 N X i2[N ] (⇡(s i ), ãi ). Note we focus on the analyses of loss in this proof. The negative of loss can be seen as a reward. Denote by ⇡ e D E and ⇡ e D E be the optimal policy obtained with minimizing the indicator loss with dataset e D E and distribution e D E . We shorten ⇡ e D E as ⇡⇤ , which is the best policy we can learn from imperfect demonstration with our algorithm. Let ⇡ ⇤ be the policy for the perfect expert. We would like to see the performance gap of policy learning between imperfect demonstrations and perfect demonstrations, i.e. R D E (⇡ ⇤ ) R D E (⇡ ⇤ ). Using Hoeffding's inequality with probability at least 1 , we have | R e D E (⇡) R e D E (⇡)|  (1 + ⇠) r log 2/ 2N . Note we also have R e D E (⇡ ⇤ ) R e D E (⇡ e D E )  R e D E (⇡ ⇤ ) R e D E (⇡ e D E )+ ⇣ R e D E (⇡ ⇤ ) R e D E (⇡ ⇤ ) ⌘ + ⇣ R e D E (⇡ e D E ) R e D E (⇡ e D E ) ⌘ 0 + 2 max ⇡ R e D E (⇡) R e D E (⇡) (1 + ⇠) r 2 log 2/ N . Before proceeding, we need to define a constant to show the affect of label noise. When the dimension of action space is 2, the problem is essentially a binary classification with noisy labels (Liu & Guo, 2020) , where the noise rate (a.k.a confusion matrix) is defined as e + = P(⇡ E (s) = A |⇡ ⇤ (s) = A + ) and e = P(⇡ E (s) = A + |⇡ ⇤ (s) = A ). Recall the action space is defined as A = {A + , A }. The noise constant is denoted by e = e 1 + e +1 . Accordingly, when the dimension of action space is |R| > 2, we can also get similar results under uniform noise where e u := P(⇡ E (s) = u|⇡ ⇤ (s) = u 0 ), u 0 6 = u. ( ) The noise constant e is denoted by e = P |R| u=1 e u . The feature-independent assumption holds thus the properties of peer loss functions (Liu & Guo, 2020) can be used, i.e. R D E (⇡ ⇤ ) R D E (⇡ ⇤ ) = 1 1 e ⇣ R e D E (⇡ ⇤ ) R e D E (⇡ e D E ) ⌘  1 + ⇠ 1 e r 2 log 2/ N From definition and deterministic assumption for ⇡ ⇤ , we have R D E (⇡ ⇤ ) = 0. Thus the error rate in the k-th iteration is R D E (⇡ ⇤ )  R D E (⇡ ⇤ ) + 1 + ⇠ 1 e r 2 log 2/ N = 1 + ⇠ 1 e r 2 log 2/ N . ( ) Note R D E (⇡ ⇤ ) = R e D E by definition. C SUPPLEMENTARY EXPERIMENTS C.1 EXPERIMENTAL SETUP We set up our experiments within the popular OpenAI stable-baselinesfoot_3 and keras-rlfoot_4 framework. Specifically, three popular RL algorithms including Deep-Q-Network (DQN) (Mnih et al., 2013; van Hasselt et al., 2016) , Dueling-DQN (DDQN) (Wang et al., 2016) Table A1 : Numerical performance of DDQN on CartPole with true reward (r), noisy reward (r), surrogate reward r (Wang et al., 2020) , and peer reward rpeer (⇠ = 0.2). R avg denotes average reward per episode after convergence, (last five episodes) the higher (") the better; N epi denotes total episodes involved in 10,000 steps, the lower (#) the better. , with its value approximated using its maximum point. We experiment the DDPG (Lillicrap et al., 2015) and uniform noise in this environment. In Figure A3 , the RL agents with proposed CA objective successfully converge to the optimal policy under different amounts of noise. On the contrary, the agents with noisy rewards suffer from the biased noise especially in a high-noise regime. e = 0.1 e = 0.2 e = 0.3 e = 0.4 Ravg " Nepi # R avg " Nepi # R avg " Nepi # R avg " Nepi # DQN r

C.5 SENSITIVITY ANALYSIS OF PEER PENALTY ⇠

In this section, we analyze the sensitivity of ⇠ in RL and BC tasks. Note that we did not tune this hyperparameter extensively in all the experiments presented above since we found our method works robustly in a wide range of ⇠.

RL with noisy reward

We repeat the experiment in Figure A1 for DQN but with a varying ⇠ from 0.1 to 0.4. As shown in Figure A2 , our method works reasonably and leads to faster convergence compared to baselines. However, we found that the late stage of training, a small ⇠ is necessary since the agent already gains useful knowledge and make reasonable actions, therefore, an over-large 

BC from weak demonstrations

We conduct experiments on Pong with 12 different ⇠ values, varying from 0.1 to 1.2. As a reference, two other BC baselines GAIL (Ho & Ermon, 2016) and SQIL (Reddy et al., 2019) are considered. We do not include GAIL in the figure, since GAIL fails to produce meaningful results on vision-based Atari games as also observed (Reddy et al., 2019; Brantley et al., 2020) . From Figure A5 , we can see PeerBC outperforms pure behavior cloning and SQIL (Reddy et al., 2019) when ⇠ is within [0.1, 0.7], revealing our proposed PeerBC is a superior behavior cloning approach able to better elicit information from imperfect demonstrations.

C.6 VARIANCE REDUCTION IN NOISY REWARD SETTING

In practice, there is a trade-off question between bias and variance that influence the RL training. Even though variance reduction approaches theoretically do not resolve the challenge when bias presents in the observed rewards, it might be beneficial to the training procedure in practice. To investigate whether the variance reduction techniques are helpful in the setting with biased noise, we repeated the experiments in Table 1 . We adopt the variance reduction technique (VRT) proposed in Romoff et al. (2018) and found it brings little benefits to noisy reward and surrogate reward as shown in Figure A6 . However, peer + VRT performs worse compared to peer reward only. This demonstrates the benefits of scenarios does not mainly come from the variance reduction in our setting. The full quantitative results are presented in Table A2 .

C.7 STOCHASTIC POLICY FOR BEHAVIORAL CLONING

In this section, we analyze the stochasticity of the imperfect expert model and fully-converged PPO agent (assumed to be the clean expert), and show that our PeerBC can handle both cases when the clean expert is stochastic and when it's rather deterministic. 190.2 ± 14.6 111.8 ± 4.5 183.6 ± 15.7 118.0 ± 9.5 179.6 ± 19.9 129.6 ± 13.9 182.6 ± 13.4 We plot the entropy of the PPO agent during training on four environments from the BC task in Figure A7 , and we give the entropy value of the imperfect expert model and the optimal policy in Table A3 . We observe that except for Freeway, the entropy of expert policies is always larger than 1. We calculate the mean value of the highest action probability over 1000 steps for the full-converged PPO agents in Table A4 , which again verifies that the true expert policy we aim to recover might not



We analysed the sensitivity of ⇠ and found the algorithm performs reasonable when ⇠ (0.1, 0.4). More insights and experiments with varied ⇠ is deferred to Appendix D. t+1 (x) = (1 ↵ t (x)) t (x) + ↵ t (x)F t (x)converges to zero w.p.1 under the following assumptions: https://github.com/hill-a/stable-baselines https://github.com/keras-rl/keras-rl https://github.com/araffin/rl-baselines-zoo/blob/master/hyperparams/ ppo2.yml#L1



can potentially become much larger than 1 p peer and p peer in a high-noise regime, r will dilute the informativeness of E[r]. On the contrary, E[r peer ] contains a moderate constant noise thus maintaining more useful training signals of the true reward in practice.

) with imperfect demonstrations e DE (a particular set of with N i.i.d. imperfect demonstrations). Note `(•) should be specified as indicator loss: (⇡(s), a) = 1 when ⇡(s) 6 = a, otherwise (⇡(s), a) = 0. We have the following upper bound on the error rate. Theorem 2. Denote by R e D E := P (s,a)⇠D E (⇡ e D E (s) 6 = a) the error rate for PeerBC. With probability at least 1 , it is upper-bounded as: R e D

Figure 2: Learning curves of DDQN on CartPole with true reward (r) , noisy reward (r) , surrogate reward (Wang et al., 2020) (r) , and peer reward (r peer , ⇠ = 0.2) .

Figure 3: Learning curves of BC on Atari. Standard BC , PeerBC (ours) , expert .5.2 PEERBC FROM WEAK DEMONSTRATIONS

Figure 4: Policy co-training on control/Atari. Single view , Song et al. (2019) , PeerCT (ours) .

) = P(r = r + |⇡) • E r=r+ [(1 e + e ) • r + + e + r + e r + ] (11) + P(r = r |⇡) • E r=r [(1 e e + ) • r + e r + + e + r )] (12) = (1 e + e )E[r] + e r + + e + r . (13) Since we are treating the visitation probability of all state-action pair (s, a) equally while sampling the peer penalty r 0 , then the probability of true reward r under this sampling policy ⇡ sample is a constant, denoting as p peer , i.e., p peer = P(r = r |⇡ sample ) is a constant. Then we have, E[r 0 ] = P(r = r |⇡ sample ) • r + P(r = r + |⇡ sample ) • r + (14) = (e + p peer + (1 e ) (1 p peer )) • r + ((1 e + )p peer + e (1 p peer )) • r + (15) = (1 e e + )[(1 p peer ) • r + p peer • r + ] + e + r + e r + . (16) As a consequence, we obtain the expectation of peer reward satisfies E[r peer ] = E[r] E[r 0 ] (17) = (1 e + e )E[r] (1 e e + )[(1 p peer ) • r + p peer • r + ] is easy to obtain that E[r peer ] = E[r] [(1 p peer ) • r + p peer • r + ]. Therefore, we have E[r peer ] = (1 e e + )E[r peer ] = (1 e e + )E[r] + const.

As a consequence, we know it needs to call O⇣ |S||A|T ✏ 02 (1 e e+) 2 log |S||A|T ⌘to achieve an ✏ 0 error in value function for original MDP M, which is no more than O

Recall e D E denotes the joint distribution of imperfect expert' state-action pair (s, ã). Assume there is a perfect expert and the corresponding state-action pairs (s, a) ⇠ D E . The indicator classification loss (⇡(s), a) is specified here for a clean presentation, where (⇡(s), a) = 1 when ⇡(s) 6 = a, otherwise (⇡(s), a) = 0. Let e D E := {(s i , ãi )} N i=1 be the set of imperfect demonstrations, and D E := {(s i , ãi )} N i=1 be the set of weak demonstrations. Define: R D E (⇡) := E (s,a)⇠D E [ (⇡(s), a)] , R e D E (⇡) := E (s,ã)⇠D E [ (⇡(s), ã)]

Figure A1: Learning curves on CartPole game with true reward (r) , noisy reward (r) , surrogate reward (Wang et al., 2020) (r) , and peer reward (r peer , ⇠ = 0.2) . Each experiment is repeated 10 times with different random seeds.

Figure A2: Learning curves of DQN on CartPole game with peer reward (r peer ) under different choices of ⇠ (from 0.1 to 0.4).

Figure A5: Sensitivity analysis of ⇠ for PeerBC on Pong with behavior cloning , PeerBC (⇠ varies from 0.2 to 0.5 and 1.0), expert , and SQIL reported by SQIL(Reddy et al., 2019). Each experiment is repeated under 3 different random seeds.

BC from weak demonstrations. Our approach successfully recovers better policies than expert.

Comparison with single view training and CoPiEr (Song et al., 2019) on standard policy co-training.

RL with noisy reward FollowingWang et al. (2020), we consider the binary reward { 1, 1} for Cartpole where the symmetric noise is synthesized with different error rates e = e = e + . We adopted a five-layer fully connected network and the Adam optimizer. The model is trained for 10,000 steps with the learning rate of 1e 3 and the Boltzmann exploration strategy. The update rate of target model and the memory size are 1e 2 and 50,000. The performance is reported under 10 independent trials with different random seeds.BC with weak expertWe train the imperfect expert on the framework stable-baselines with default network architecture for Atari and hyper-parameters from rl-baselines-zoo 4 . The expert model is trained for 1, 400, 000 steps for Pong and 2, 000, 000 steps for Boxing, Enduro and Freeway. For each of those environment, We use the trained model to generate 100 trajectories, and behavior cloning is performed on these trajectories. We adopt cross entropy loss for behavior cloning and add a small constant (1 ⇥ 10 8 ) for each logit after the softmax operation for peer term to avoid this term become too large. In BC experiments, the batchsize is 128, learning rate is 1 ⇥ 10 4 and the ✏ value for Adam optimizer is 1 ⇥ 10 8 .Policy co-trainingFor the experiments on Gym (CartPole and Acrobot), we mask the first coordinate in the state vector for one view and the second for the other, same asSong et al. (2019). Both policies are trained with PPO(Schulman et al., 2017) + PeerBC. In each iteration, we sample 128 steps from each of the 8 parallel environments. These samples are fed to PPO training with a batchsize of 256, a learning rate of 2.5 ⇥ 10 4 and a clip range of 0.1. Both learning rate and clip range decay to 0 throughout time. We represent the policy by a fully connected network with 2 hidden layers, each has 128 units.For the experiments on Atari (Pong and Breakout), the input is raw game images. We adopt the preprocess introduced inMnih et al. (2013) and mask the pixels in odd columns for one view and even columns for the other. The policy we use adopts a default CNN as in stable-baselines. Batchsize, learning rate, clip range and other hyper-parameters are the same as Gym experiments. Note that we only add PeerBC after 1000 episodes.C.3 SUPPLEMENTARY RESULTS FOR FIGURE 2 AND



+ VRT 179.0 ± 13.5 112.1 ± 5.4 163.4 ± 17.0 112.8 ± 9.1 190.1 ± 10.3 112.6 ± 3.9 179.8 ± 15.1 119.7 ± 9.1

annex

be fully deterministic. These results demonstrate the flexibility of our proposed approach in dealing with both stochastic and deterministic clean expert policies in practice, although a deterministic clean expert policy is assumed in our theoretical analysis.Also, from Figure A7 and Table A3 , we notice that the entropy of imperfect expert models are higher than the fully converged PPO agents, implying that the expert models might contain an amount of noise. That's because there might be states on which the expert has not seen enough and the selected actions contain much noise. This is consistent with our claim, that the benefits of PeerBC might come from two aspects, both noise reduction of the imperfect expert and inducing a more deterministic policy.Timesteps (⇥10 

