BEHAVIORAL CLONING FROM NOISY DEMONSTRA-TIONS

Abstract

We consider the problem of learning an optimal expert behavior policy given noisy demonstrations that contain observations from both optimal and non-optimal expert behaviors. Popular imitation learning algorithms, such as generative adversarial imitation learning, assume that (clean) demonstrations are given from optimal expert policies but not the non-optimal ones, and thus often fail to imitate the optimal expert behaviors given the noisy demonstrations. Prior works that address the problem require (1) learning policies through environment interactions in the same fashion as reinforcement learning, and (2) annotating each demonstration with confidence scores or rankings. However, such environment interactions and annotations in real-world settings take impractically long training time and a significant human effort. In this paper, we propose an imitation learning algorithm to address the problem without any environment interactions and annotations associated with the non-optimal demonstrations. The proposed algorithm learns ensemble policies with a generalized behavioral cloning (BC) objective function where we exploit another policy already learned by BC. Experimental results show that the proposed algorithm can learn behavior policies that are much closer to the optimal policies than ones learned by BC.

1. INTRODUCTION

Imitation learning (IL) has become a widely used approach to obtain autonomous robotics control systems. IL is often more applicable in real-world problems than reinforcement learning (RL) since expert demonstrations are often easier than designing appropriate rewards that RL requires. There have been several IL methods that involve RL (Ziebart et al., 2008; Ng et al., 2000; Abbeel & Ng, 2004; Ho & Ermon, 2016) . Those IL methods inherit sample complexity from RL in terms of environment interactions during training. The complexity restricts applicabilities in real-world problems since a number of environment interactions in real-world settings often take a long time and cause damage to the robot or the environment. Therefore, we are interested in IL methods that do not require the environment interactions, such as behavioral cloning (BC) (Pomerleau, 1991) which learns an expert policy in a supervised fashion. BC as well as popular IL methods, such as generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016) , assume the expert demonstration is optimal. Unfortunately, it is often difficult to obtain optimal demonstrations for many tasks in real-world problems because the expert who tries to operate the robot so that it can achieve tasks often makes mistakes due to various reasons, such as the difficulty of the task, difficulty in handling the controller, limited observability of the environment, or the presence of distraction. The mistakes include unnecessary and/or incorrect operations to achieve the tasks. Given such noisy expert demonstrations, which contain records of both optimal and non-optimal behavior, BC as well as the popular IL methods fails to imitate the optimal policy due to the optimal assumption on the demonstrations as shown in (Wu et al., 2019) . A naive solution to cope with the noisy demonstrations is discarding the non-optimal demonstrations among the ones that were already collected. This screening process is often impractical because it involves a significant human effort. Most of recent IL works suppose settings where a very limited number of clean expert demonstrations, which are composed of only the optimal behavior records, are available. Those methods are also vulnerable to the noisy demonstrations due to the optimal assumption on the demonstrations. Thus they implicitly suppose such impractical screening process if they were applied in real-world problems, where a number of the noisy demonstrations other than the clean ones can be easily obtained. There have been IL methods addressing the noisy demonstrations. Instead of the screening process, they require to annotate each demonstration with confidence scores (Wu et al., 2019) or rankings (Brown et al., 2019) . Even though they cope well with the noisy demonstrations to obtain the optimal behavior policies, such annotation costs a significant human effort as it is for the screening. Hence, we desire IL methods that can cope well with the noisy demonstrations, which can be easily obtained in real-world settings, without any screening and annotation processes associated with the non-optimal behaviors. In this paper, we propose a novel imitation learning algorithm to address the noisy demonstrations. The proposed algorithm does not require (1) any environment interactions during training, and (2) any screening and annotation processes associated with the non-optimality of the expert behaviors. Our algorithm learns ensemble policies with a generalized BC objective function where we exploit another policy already learned by BC. Experimental results show that the proposed algorithm can learn policies that are much closer to the optimal than ones learned by BC.

2. RELATED WORKS

A wide variety of IL methods have been proposed in these last few decades. BC (Pomerleau, 1991) is the simplest IL method among those and thus BC could be the first IL option when enough clean demonstrations are available. Ross & Bagnell (2010) have theoretically pointed out a downside of the BC which is referred to as compounding error -the small errors of the learners trained by BC could compound over time and bring about the deterioration of their performance. On the other hand, experimental results in (Sasaki et al., 2018) show that BC given the clean demonstrations of sufficient amounts can easily obtain the optimal behavior even for complex continuous control tasks. Hence, the effect of the compounding error is negligible in practice if the amount of clean demonstrations is sufficient. However, even if the amount of the demonstrations is large, BC cannot obtain the optimal policy given the noisy demonstrations due to the optimal assumption on the demonstrations. Another widely used IL approaches are inverse reinforcement learning (IRL) (Ziebart et al., 2008; Ng et al., 2000; Abbeel & Ng, 2004) and adversarial imitation learning (AIL) (Ho & Ermon, 2016) . Since those approaches also assume the optimality of the demonstrations, they are also not able to obtain the optimal policy given the noisy demonstrations, as shown in (Wu et al., 2019) . As we will show in Section 6, our algorithm successfully can learn near-optimal policies if noisy demonstrations of sufficient amounts are given. There have been several works that address the noisy demonstrations (Wu et al., 2019; Brown et al., 2019; Tangkaratt et al., 2019; Kaiser et al., 1995; Grollman & Billard, 2012; Kim et al., 2013) . Those works address the noisy demonstrations by either screening the non-optimal demonstrations with heuristic non-optimal assessments (Kaiser et al., 1995) , annotations associated with the nonoptimality (Wu et al., 2019; Brown et al., 2019; Grollman & Billard, 2012) , or training through the environment interactions (Kim et al., 2013; Wu et al., 2019; Brown et al., 2019; Tangkaratt et al., 2019) . Our algorithm does not require any screening processes, annotations associated with the non-optimality, and the environment interactions during training. Offline RL methods (Lange et al., 2012; Fujimoto et al., 2019; Kumar et al., 2020) train the learner agents without any environment interactions, and allow the training dataset to have non-optimal trajectories as in our problem setting. A drawback of offline RL methods for the real-world applications is the requirement to design reward functions, which often involves a significant human efforts for its success, since those methods assume that the reward for each state-action pair is known. Our algorithm does not require to design reward functions as in standard IL methods. Disagreement regularized imitation learning (DRIL) (Brantley et al., 2019) is a state-of-the-art IL algorithm which employs an ensemble of policies as our algorithm does. The aims of employing the ensemble is different between DRIL and our algorithm. DRIL uses the disagreement in predictions made by policies in the ensemble to evaluate whether the states observed during training the learner are ones observed in the expert demonstrations. On the other hand, our algorithm uses the ensemble to encourage the learner to take optimal actions on each state as described in 5.3. In addition, DRIL fundamentally requires the environment interactions during training whereas our algorithm does not.

3. PRELIMINARIES AND PROBLEM SETUP

In this work, we consider an episodic fixed-horizon Markov decision process (MDP) which is formalized as a tuple {S, A, P, R, d 0 , T }, where S is a set of states, A is a set of possible actions agents can take, P : S×A×S → [0, 1] is a transition probability, R : S×A → [0, 1] is a reward function, d 0 : S → [0, 1] is a distribution over initial states, and T is an episode horizon. The agent's behavior is defined by a stochastic policy π : S×A → [0, 1] and Π denotes a set of the stochastic policies. The expected one-step immediate reward for a policy π given a state s is defined as R π (s) = E a∼π(•|s) R(s, a) . Let d π t and d π = 1 T T t=1 d π t denote the distribution over states at time step t and the average distribution over T time steps induced by π, respectively. The distributions d π 1 at the first step correspond to d 0 for any π. When following a policy π throughout an episode, the expected one-step immediate reward at time step t and the expected T -step reward are defined as R π t = E s∼d π t ,a∼π(•|s) R(s, a) = E s∼d π t R π (s) and J (π, R) = T t=1 R π t = T E s∼d π R π (s) , respectively. We refer to J (π, R) as on-policy expected T -step reward. We also consider another T -step reward defined by J β (π, R) = T E s∼d β R π (s) , which we call off-policy expected T -step reward, where β ∈ Π is a policy that can differ from π. In our problem setting, the functions R is not given. Instead, we observe noisy demonstrations. We refer to the agent that generates the noisy demonstrations as the noisy expert. The decision process turns to be MDP\{R} as in the common imitation learning settings, and our problem can be formalized as to find an optimal policy in MDP\{R}. Here we refer to the true expert policy π * e as ones being able to take the optimal (thus not noisy) behavior in episodic tasks. We make the following four assumptions to further formalize our problem setting: Assumption 1. The T -step expected reward of π * e satisfies J (π, R) ≤ J (π * e , R); J β (π, R) ≤ J β (π * e , R); and J β (π * e , R) ≤ J (π * e , R) for any non-optimal policies π, β ∈ Π \ {π * e }. Assumption 2. With small probability , which we call non-optimal probability, the policies π e the noisy experts follow during demonstrations are sampled at each time step as π e = π ∼ p Π if ≥ z ∼ U(0, 1), otherwise π e = π * e , where p Π is an unknown distribution over the set of policies, z is a random variable, and U(0, 1) is a uniform distribution with range [0, 1]. Assumption 3. The reward R πe t is at least zero if the noisy expert has followed a policy π ∈ Π\{π * e } once or more so far, otherwise R πe t = E s∼d πe t E π∼pΠ [R π (s)] + (1 -)R π * e (s) . Assumption 4. The sequence {R πe 1 , ..., R πe T } has monotonically decreasing property R πe t ≥ R πe t+1 . Assumption 1 indicates that both on-policy and off-policy expected T -step reward following π * e are always greater than or equal to ones following any other policies. In other words, we assume the true expert policy is an optimal one in the MDP, and the agent following the policy is able to behave so that the expected immediate rewards at any states are maximized. Under Assumption 1, the problem that we would like to solve in this work can be said to learn a parameterized policy π θ to maximize its on-policy expected T -step reward J (π θ , R) to J (π * e , R). Assumption 2 indicates that the noisy expert occasionally adopts non-optimal policies, which results in the noisy demonstrations, due to random events, such as the presence of distractions, associated with the random variable z. The noisy expert is going to visit states that would be never visited by the true expert if the noisy expert followed non-optimal policies even once. Assumption 3 indicates that those states are less rewarded and their rewards are at least zero. Assumption 3 also indicates that the noisy demonstrations have a number of episodes where the noisy expert has reached the same state s where the noisy expert has adopted both π * e and π ∈ Π \ {π * e } with the probability . Assumption 4 indicates that, since the probability the noisy expert consecutively follows π * e decreases as time step increases according to Assumption 2, the divergence between d πe t and d π * e t becomes greater as the number of time step t increases, and thus the one-step expected immediate reward R πe t = E s∼d πe t ,a∼πe(•|s) R(s, a) decreases as t increases. In this section, we firstly describe BC objective in 4.1. Then, we analyze why the learner trained by BC deteriorates its performance when using the noisy demonstrations from the expected T-step reward maximization and KL-divergence minimization perspectives in 4.2 and 4.3, respectively.

4.1. BEHAVIORAL CLONING OBJECTIVE

Let π θ ∈ Π is a learner policy parameterized by θ to be optimized by IL algorithms. The objective of BC in common is as follows: arg max θ E s∼d πe ,a∼πe(•|s) [log π θ (a|s)]. (1) The objective (1) aims to mimic the expert behavior which follows π e . It can be interpreted that ( 1) is to maximize the expected one-step immediate reward R π θ (s) to R πe (s) at each state s ∼ d πe . Since the state distribution d πe is not induced by π θ , it can also be said that ( 1) is to maximize the off-policy expected T -step rewards J πe (π θ , R) to J (π e , R).

4.2. THE EXPECTED T-STEP REWARD MAXIMIZATION

We obtain the lower bound of the expected on-policy T -step reward for the noisy expert policy in almost the same way to derive Theorem 2.1 in (Ross & Bagnell, 2010) where they showed the lower bound for the learner policies given the "clean" expert demonstrations. Theorem 1. If the Assumptions 1 -4 hold, J (π e , R) has the following lower bound: J (π e , R) ≥ 1 T T -1 t=0 (1 -) t • E π∼pΠ [J πe (π, R)]. The detailed derivation can be found in Appendix A.1. Assume that the learner policy π θ has a probability of non-optimal behavior ˆ = + ζ at most as the result of BC, where ζ ∈ [0, 1 -] is an additional probability of non-optimal behavior due to the remained loss in (1). Note that ζ may become greater than zero due to the difficulty in the optimization of (1) even if = 0. The learner following π θ with ˆ can be deemed as another noisy expert who samples a policy at each time step π θ = π ∼ p π θ if ˆ ≥ z ∼ U(0, 1), otherwise π θ = π * e , where p π θ is a (special) distribution from which the same policy is always sampled. By replacing ˆ and p π θ from and p Π in Theorem 1 respectively, we obtain the following corollary. Corollary 1. If the Assumptions 1 -4 hold and the policy π θ has a probability of non-optimal behavior ˆ = + ζ, J (π θ , R) has the following lower bound: J (π θ , R) ≥ 1 T T -1 t=0 (1 -ˆ ) t • J πe (π θ , R). Recall that the BC objective (1) is to maximize J πe (π θ , R). If ˆ = 0, Corollary 1 indicates that the on-policy expected T -step reward J (π θ , R), which corresponds to the actual learner performance, is boosted by maximizing J πe (π θ , R) through the optimization of the BC objective (1). On the other hand, if > 0 and thus ˆ > 0, the first factor on the RHS in (3) becomes much smaller as becomes larger. Corollary 1 thus shows that the probability of non-optimal behavior of the noisy expert significantly negates the improvement of learner performance J (π θ , R) by BC even if ζ can be sufficiently minimized through the optimization. Hence, the learner trained by BC is not able to boost the learner performance enough if the noisy demonstrations were given.

4.3. KL DIVERGENCE MINIMIZATION

Let S πe be a set of states that are observed in the noisy demonstration. S πe can be thought of as the domain of (empirical) state distributions d πe . S πe can be defined with two state sets of states as S πe = S πe e ∪ S πe e+ * , where S πe e contains states that are observed if the noisy expert has followed a policy π ∈ Π \ {π * e } once or more so far in the episode, and S πe e+ * contains states at which the noisy expert has followed a policy π ∈ Π \ {π * e } at the first time in the episode. Under Assumption 3, the rewards R πe t for the states s ∈ S πe e are at least zero whereas  R πe t = E s∼d πe t E π∼pΠ [R π (s)] + (1 -)R π * e (s) where α and β are ratios the noisy expert entered states that belong to S πe e and S πe e+ * during demonstrations, respectively. In addition, α + β = 1 is satisfied. Using Equation (4), the upper bound of the objective function in Equation ( 1) is derived as follows: E s∼d πe ,a∼πe(•|s) [log π θ (a|s)] ≤ -αΩ e (θ) -βΩ e+ * (θ), ) Ω e (θ) = E s∼d πe e [D KL [π e (•|s)||π θ (•|s)]], Ω e+ * (θ) = E s∼d πe e+ * ,π∼pΠ [D KL [π(•|s)||π θ (•|s)]] + (1 -)E s∼d πe e+ * [D KL [π * e (•|s)||π θ (•|s)]], where D KL is forward Kullback-Leibler (KL) divergence. The full derivation can be found in Appendix A.2. The inequality (5) shows that the BC objective (1) with the noisy demonstrations is to minimizes the sum of KL divergences. The first term on the RHS in (7) leads the learner to imitate some non-optimal behaviors whereas the second term is to learn π * e on the same states. The optimization to maximize the RHS in ( 7) is difficult because minimizing KL divergences with different target distributions at the same time is difficult in general. The first term on the RHS in (7) thus works as a "noisy" regularizer with a coefficient that makes the learner confused to learn π * e . The difficulty in the optimization due to the noisy regularizer increases ζ as increases. As mentioned in 4.1 and 4.2, BC is to maximize J πe (π θ , R) to J (π e , R). Hence, minimizing Ω e (θ) in ( 6) corresponds to maximize E s∼d πe e [R π θ (s)] to E s∼d πe e [R πe (s)]. Since the rewards R πe (s) are at least zero for the states s ∼ d πe e according to Assumption 3 and the definition of S πe e , E s∼d πe e [R π θ (s)] becomes at least zero by minimizing Ω e (θ). Hence J πe (π θ , R) becomes at least zero as the rate α increases, while the rate α increases as the probabilities of non-optimal behavior increases. Thus, the larger the probability is, the more difficult it is to boost the learner performance by BC. If the influence of the noisy regularizer can be reduced, probabilities the learner follows π * e at the state s ∈ S πe e+ * will increase. In addition, as probabilities the learner follows π * e at the states s ∈ S πe e+ * increase, the rate (corresponding to α) for the states s ∈ S πe e will decrease. Thus, it can be said that, the more often learner follows π * e at the states s ∈ S πe e+ * , the more rewards R π * e (s) the learner obtains according to Assumption 3. To summarize the above analysis, reducing the influence of the noisy regularizer for states s ∈ S πe e+ * , which leads the learner to imitate some non-optimal behaviors, might boost the learner performance.

5. ALGORITHM

The analyses in Section 4 describe that the learner trained by standard BC deteriorates its performance when the noisy demonstrations are given. Based on both analyses in 4.2 and 4.3, the learner performance will be boosted if the learner imitates the optimal policy π * e but not the non-optimal ones π ∈ Π \ {π * e } for the states s ∈ S πe e+ * . In other words, the learner performance will be boosted if ˆ of the learner can be reduced. In this section, we first propose our algorithm that avoids learning π ∈ Π \ {π * e } while learning π * e in 5.1. Then we describe how our algorithm works to avoid learning π ∈ Π \ {π * e } from mode seeking and reward maximization perspectives in 5.2 and 5.3, respectively. We lastly provide limitations of our algorithm in 5.4.

5.1. PROPOSED ALGORITHM

We consider a generalization of the BC objective as follows: arg max θ E s∼d πe ,a∼πe(•|s) [log π θ (a|s) • R(s, a)], Algorithm 1 Behavioral Cloning from Noisy Demonstrations 1: Given the expert demonstrations D. for k = 1, K do 6: Initialize parameters θ k . 7: for l = 1, L do 8: Sample a random minibatch of N state-action pairs (s n , a n ) from D k . 9: Calculate a sampled gradientfoot_0  N N n=1 ∇ θ k log π θ k (s n , a n ) • R(s n , a n ). 10: Update θ k by gradient ascent using the sampled gradient. 11: end for 12: end for 13: Copy π θ old ← π θ . where R : S×A → [0, 1] denotes an arbitrary function which can differ from R. If R(s, a) = 1 for ∀(s, a) ∈ S × A, the objective (8) corresponds to the BC objective (1). If A R(s, a)da = 1 for ∀s ∈ S is satisfied, R(s, a) can be interpreted as weights for action samples obtained by the demonstrations so that the actions are sampled according to their relative weights. The objective (8) can also be deemed as that of the off-policy actor-critic (Off-PAC) algorithm 1 (Degris et al., 2012) with reward functions R(s, a) and zero discount factors. Let π θ 1 , π θ 2 , ..., π θ K be K parameterized policies with different initial parameters θ 1 , θ 2 , ..., θ K , and π θ (a|s) = K k=1 π θ k (a|s)/K denotes an ensemble of the parameterized policies with parameters θ = {θ 1 , θ 2 , ..., θ K }. Let π θ old be a parameterized policy with θ old which was already optimized with the noisy demonstrations. The main idea of our algorithm is to reuse the old policy π θ old as R(s, a) in the generalized BC objective (8). arg max θ E s∼d πe ,a∼πe(•|s) [log π θ (a|s) • π θ old (a|s)]. The overview of our algorithm is described in Algorithm 1.

5.2. WEIGHTED ACTION SAMPLING FOR π * e MODE SEEKING

Since π θ old satisfies A π θ old (a|s)da = 1 for ∀s ∈ S, π θ old can be interpreted as the weights for the weighted action sampling. We below explain the weighted action sampling procedure in our algorithm on S πe e+ * . Figure 1 depicts a toy example of the sampling procedure. The distribution of the noisy expert actions on S πe e+ * is a mixture of two distributions as shown in Equation ( 7). If is sufficiently small, π θ is optimized so that its mode is closer to that of π * e than π ∈ Π \ {π * e } according to mode seeking properties of the forward KL divergence (Ghasemipour et al., 2020) . Given the sampling weights π θ old (a|s) = π θ (a|s) for the empirical action samples, the weighted action distribution distorts so that its mode also gets closer to the mode of π * e . By iteratively distorting the weighted action distribution with the same procedure, its mode fits to near the mode of π e * . The weights for actions sampled from π ∈ Π \ {π * e } eventually become much smaller, and thus the learner will not learn π ∈ Π \ {π * e }. The mode seeking procedure of our algorithm is analogous to the mean shift algorithm (Fukunaga & Hostetler, 1975 ) so that the mode of π θ shifts towards that of π * e by minimizing the KL divergence between π θ and the weighted action distribution. 8) at each iteration. The solid lines on the bottom row describe distributions which draw actions, that were already drawn by π e (a|s) in the noisy demonstrations, according to the current importance weight π θ (a|s) at each iteration. π θ (a|s) are optimized at each iteration so that the weighted distribution at the previous iteration is the target distribution.

5.3. REWARD MAXIMIZATION

As the Off-PAC objective, the objective (9) maximizes the expected (one-step) reward R(s, a) = π θ old (a|s). Recall that the learner policy π θ (a|s) = K k=1 π θ k (a|s)/K is an ensemble of the parameterized policies in our algorithm. Following the work in (Perrone, 1993), we obtain 1 K K k=1 E s∼d πe ,a∼πe(•|s) [log π θ k (a|s) • R(s, a)] ≤ E s∼d πe ,a∼πe(•|s) log π θ (a|s) • R(s, a) , where we use Jensen's inequality with the concave property of logarithm : 1 K K k=1 log π θ k (a|s) ≤ log π θ (a|s). The inequality (10) indicates that the ensemble of policies π θ 1 , π θ 2 , ..., π θ K , each of which was learned with (8), has greater or equal values of the objective function in (8) than the averaged values over the policies in the ensemble. As mentioned in 5.2, R(s, a) = π θ old (a|s) becomes higher near the mode of π * e . Thus, making π θ as the ensemble further encourages to shift its mode to that of π * e and avoid learning π ∈ Π \ {π * e }.

5.4. LIMITATIONS

Our algorithm has three limitations. First, K × M times computational cost is required in comparison with BC, where M is the number of iterations in Algorithm 1. Second, the compounding error due to the probability of non-optimal behavior ζ still remains unless sufficient amounts of the demonstrations are given. Lastly, π θ is fitting to π ∈ Π \ {π * e } rather than π * e if the major mode of π(a|s) + (1 -)π * e (a|s) is nearer to the mode of π(a|s) than that of π * e . It may be caused due to the higher kurtosis of π(a|s) or of large values.

6. EXPERIMENTS

In our experiments, we aim to answer the following three questions: Q1. Does our algorithm improve the learner performance more than BC given the noisy demonstrations? Q2. Can the compounding error due to ζ be reduced as the number of noisy demonstrations increase? Q3. Is our algorithm competitive to the existing IL methods if both annotations associated with the non-optimality and environment interactions are allowed?

6.1. SETUP

To answer Q1 and Q2, we evaluated our algorithm against BC on four continuous control tasks that are simulated with MuJoCo physics simulator (Todorov et al., 2012) . We train an agent on each task by proximal policy optimization (PPO) algorithm (Schulman et al., 2017) using the rewards defined in the OpenAI Gym (Brockman et al., 2016) . We use the resulting stochastic policy as the true expert policy π * e . We generate the noisy expert demonstrations using π * e while randomly adopting non-optimal policies π with probabilities of the non-optimal behavior . The non-optimal policies π are selected from uniform distributions a ∼ U(-u, u), Gaussian distributions a ∼ N (a * , I) with a ∼ π * e (•|s), or a deterministic policy a = 0, where u ∈ R |A| denotes all-ones vectors and I ∈ R |A|×|A| denotes identity matrices. are selected from {0.0, 0.1, 0.2, 0.3, 0.4, 0.5}. The noisy expert takes actions following π * e if z ≥ otherwise π which is fixed to a selected one through an episode, where z ∼ U(0, 1). Each noisy demonstration with the selected consists of N state-action pairs, where N is selected from {5000, 10000, 50000, 100000}. Then we perform our algorithm as well as BC to train the learners using each noisy demonstration. We also conducted the same experiments on four low-dimensional discrete control tasks (see Appendix A.4 ). To answer Q3, we evaluated our algorithm against IC-GAIL (Wu et al., 2019) , 2IWIL (Wu et al., 2019) , T-REX (Brown et al., 2019) , GAIL and DRIL on three continuous control tasks. IC-GAIL, 2IWIL and T-REX require both annotations associated with the non-optimality and environment interactions. GAIL and DRIL require the environment interactions for the training, but they do not address the noisy demonstration problem. The true expert policy π * e are obtained in the same way as mentioned above. The non-optimal policies π are fixed to a ∼ U(-u, u). We generate the noisy expert demonstrations which consists of 10000 state-action pairs for each ∈ {0.05, 0.1, 0.15, ...., 1.0}. Then we perform our algorithm and the baselines using all noisy demonstrations. The detailed description of this experimental setup can be found in Appendix A.3. In both experiments, the performance of the learners is measured by cumulative rewards they earned in an episode. The cumulative reward is normalized with ones earned by π * e and a random policy a ∼ U(-u, u) so that 1.0 and 0.0 indicate the performance of π * e and the random policy, respectively. We run five experiments on each task and setup, and measure the mean and standard deviation of the normalized cumulative rewards for each learner over the five experiments. In all experiments, we set the number of policies K = 5 in the ensemble learner policy π θ and the number of iterations M = 5. The implementation details of our algorithm can be found in Appendix A.5.

6.2. RESULTS

Figure 2 depicts the experimental results against BC. Over all tasks, our algorithm obtains much better learner performance than BC-Single, which is a single (thus not an ensemble) policy learned by BC. It suggests that the policies learned by our algorithm are closer to π * e than ones learned by BC. The compounding error due to ζ is expected to be reduced as the number of demonstrations increase. Whereas BC-Ensemble which denotes the ensemble of policies learned by BC yields significant performance gains against BC-Single, increasing the number of noisy demonstrations has a little effect to boost the learner performance trained by BC-Ensemble as shown in Figure 2-(D) . It indicates that BC-Ensemble can not reduce the compounding error due to . On the other hand, our algorithm can boost the learner performance up to that of π * e as increasing the number of demonstrations. It suggests that our algorithm can reduce the compounding error due to both and ζ if sufficient amounts of the noisy expert demonstrations are given, as is the case for BC with the clean expert demonstrations. The results with the deterministic non-optimal policy π ∈ Π \ {π * e } which always takes an action a = 0 are worse than those with other non-optimal policies. It corresponds to the limitation of our algorithm as mentioned in 5.4, since the major mode of π(a|s)+(1-)π * e (a|s) might be around a = 0. We also conducted ablation experiments where the number of policies K are selected from {1, 5} in our algorithm. See Appendix A.6 for details. The ablation experimental results show that the learner obtains better performance if K increases. In addition, the performance of the learner trained by our algorithm is significantly better than that of BC-Single even though K = 1. It suggests that our algorithm improves the learner performance by not only the ensemble approach but also using the old policies π θ old . Table 1 shows the experimental results against IC-GAIL, 2IWIL, T-REX, GAIL and DRIL. Over all tasks, 2IWIL and our algorithm can successfully obtain the true expert performance while others can not. It suggests that our algorithm can obtain competitive results with that of existing IL methods even though the annotation and the environment interactions are not used. 

7. CONCLUSION

In this paper, we proposed an imitation learning algorithm to cope with the noisy expert demonstrations. Experimental results showed that our algorithm can learn behavior policies that are much closer to the true expert policies than ones learned by BC. Since our algorithm cope well with the noisy expert demonstrations while not requiring any environment interactions and annotations associated with the non-optimal demonstrations, our algorithm is more applicable to real-world problems than the prior works. Although our algorithm has a few limitations as mentioned in 5.4, we believe that the analysis of performance deterioration detailed in Section 4 contributes to step forward for solving the noisy demonstration problems. In future work, we will consider the setting where the probability of non-optimal behavior is state-dependent, which often occurs in the real world more than the state-independent case that we have considered in this paper.

A APPENDIX

A.1 DETAILED DEVIATION OF THEOREM 1 Proof. Let q t = (1 -) t denotes the probability the noisy expert consecutively follows π * e in the first t step, and χ = T t=1 q t-1 denotes sum of q t-1 over time steps. Then we obtain: J (π e , R) ≥ T t=1 q t-1 R πe t + (1 -q t-1 ) • 0 (11) ≥ T 1 T T t=1 q t-1 1 T T t=1 R πe t (12) = χ T T t=1 E s∼d πe t E π∼pΠ [R π (s)] + (1 -)R π * e (s) = χ T E π∼pΠ [J πe (π, R)] + (1 -)J πe (π * e , R) ≥ χ T E π∼pΠ [J πe (π, R)] + (1 -)E π∼pΠ [J πe (π, R)] (13) = 1 T T -1 t=0 (1 -) t • E π∼pΠ [J πe (π, R)] The first inequality ( 11) is from Assumption 2 and 3. The second inequality ( 12) is from Chebyshev's sum inequality with the monotonically decreasing properties according to Assumption 4. The third inequality ( 13) is from Assumption 1 : J β (π, R) ≤ J β (π * e , R) for any π, β ∈ Π \ {π * e }.

A.2 DETAILED DERIVATION OF THE KL DIVERGENCES

From the definition of (4), we obtain: We annotate confidence scores for the noisy demonstrations so that the confidence is one if the demonstrations are obtained with = 0 otherwise zero. The confidence scores are used IC-GAIL as well as 2IWIL. We use publicly available codefoot_2 for the implementation of both IC-GAIL and E



Although Off-PAC multiplies log π θ (a|s) by a density ratio πe(s|a)/π θ (s|a), π θ (s|a) is empirically approximated to be one in popular off-policy RL algorithms such as DDPG(Lillicrap et al., 2015). https://github.com/kristery/Imitation-Learning-from-Imperfect-Demonstration https://github.com/hiwonjoon/ICML2019-TREX https://github.com/xkianteb/dril



2: Set R(s, a) = 1 for ∀(s, a) ∈ D. 3: Split D into K disjoint sets {D 1 , D 2 , ..., D K }. 4: for iteration = 1, M do 5:

s, a) = π θ old (a|s) for ∀(s, a) ∈ D. 15: end for 16: return π θ .

Figure 1: A toy example of the weighted action sampling procedure at each iteration in our algorithm when given a state s ∈ S πe e+ * . On both rows, the horizontal lines are the action domains. The left and right dotted lines on the top row describe π ∈ Π \ {π * e } and π * e (a|s), respectively. The dotted lines on the bottom row describe the mixture distribution π e (a|s) = π(a|s) + (1 -)π * e (a|s) with = 0.4. The solid lines on the top row describe π θ (a|s) that are optimized with objective (8) at each iteration. The solid lines on the bottom row describe distributions which draw actions, that were already drawn by π e (a|s) in the noisy demonstrations, according to the current importance weight π θ (a|s) at each iteration. π θ (a|s) are optimized at each iteration so that the weighted distribution at the previous iteration is the target distribution.

Figure 2: (A)-(C) The performance of policies vs. given 50000 state-action pairs of the noisy expert demonstrations where the non-optimal policies π ∈ Π\{π * e } are (A) U(-u, u), (B) N (a * , I) with a ∼ π * e (•|s), and (C) the deterministic one a = 0, respectively. (D) The performance of policies vs. the number of state-action pairs N of the noisy demonstrations with = 0.3 where π(a|s) = U(-u, u). BC-Single is a policy learned by BC. BC-Ensemble is an ensemble of policies, each of which was learned by BC. Shaded regions indicate the standard deviation over five experiments.

s∼d πe ,a∼πe(•|s) [log π θ (a|s)] = αE s∼d πe e ,a∼πe(•|s) [log π θ (a|s)] (14) + βE s∼d πe e+ * ,a∼πe(•|s) [log π θ (a|s)](15)The forward Kullback-Leibler (KL) divergence D KL between π e and π θ over a state distribution d πe is defined asE s∼d πe [D KL (π e (•|s)||π θ (•|s))] = -E s∼d πe [E a∼πe(•|s) [log π θ (a|s)] + H[π e (•|s)]], where H denotes the entropy. Since H[π e (•|s)] always takes positive value and is not associated with θ, we obtain an inequality : E s∼d πe ,a∼πe(•|s) [log π θ (a|s)] ≤ -E s∼d πe [D KL (π e (•|s)||π θ (•|s))]. The same goes with (14) as αE s∼d πe e ,a∼πe(•|s) [log π θ (a|s)] ≤ -αE s∼d πe [D KL (π e (•|s)||π θ (•|s))]. (16) Since π e adopts both π * e and π ∈ Π \ {π * e } following the probability , the third term (15) can be expanded as: βE s∼d πe e+ * ,a∼πe(•|s) [log π θ (a|s)] = βE s∼d πe e+ * E π∼pΠ,a∼π(•|s) [log π θ (a|s)] +(1 -)E a∼π * e (•|s) [log π θ (a|s)] ≤ -β E s∼d πe e+ * ,π∼pΠ [D KL (π(•|s)||π θ (•|s))] +(1 -)E s∼d πe e+ * [D KL (π * e (•|s)||π θ (•|s))] (17) A.3 DETAILED DESCRIPTION OF THE EXPERIMENTAL SETUP

for the states s ∈ S πe e+ * . Note that the noisy expert adopts π ∈ Π \ {π * e } with a probability at the states s ∈ S πe e+ * . Let d πe e and d πe e+ * be the state distributions the noisy expert policy induces in S πe e and S πe e+ * , respectively. Then we can define d πe as a mixture of those distributions as d πe (s) = αd πe e (s) + βd πe e+ * (s),

The experimental results against IL methods that require the environment interactions.

CartPole-v1

Acrobot-v1 MountainCar-v0 LunarLander-v2 2IWIL. We follow the training procedure of both methods as described in Section 5 in (Wu et al., 2019) .We annotate rankings for the noisy demonstrations so that the smaller correspond to higher rankings. Then, we train the learner by T-REX given the ranked demonstration data. We use publicly available code 3 for the implementation of T-REX.For training the learner with GAIL and DRIL, we use all noisy demonstrations without any screening process. We use publicly available code 4 for the implementation of GAIL and DRIL.

A.4 EXPERIMENTAL RESULTS ON DISCRETE CONTROL TASKS

Figure 3 shows the experimental results on four discrete control tasks. Over all tasks, our algorithm obtain much better results than BC.

A.5 IMPLEMENTATION DETAILS OF OUR ALGORITHM

We implement our algorithm using K neural networks with two hidden layers to represent policies π θ 1 , π θ 2 , ..., π θ K in the ensemble. The input of the networks is vector representations of the state. Each neural network has 100 hidden units in each hidden layer followed by hyperbolic tangent nonlinearity, and the dimensionality of its final output corresponds to that of action space. The final output is followed by softmax function in the discrete control tasks. As for the continuous control tasks, the final output represents the mean of a Gaussian policy asθ k is implemented as a trainable independent vector from the networks. The neural network architecture for the policy trained by BC is the same as the ones for a single policy in our algorithm. We employ Adam (Kingma & Ba, 2014) for learning parameters with a learning rate of η * 10 -4 where η) is a scaling parameter. The parameter η plays a role in scaling R = π θ old (a|s) to avoid the training being slow due to π θ old (a|s) of small values.The parameters in all layers are initialized by Xavier initialization (Glorot & Bengio, 2010) . The mini-batch size and the number of training epochs are 128 and 500, respectively.

A.6 ABLATION EXPERIMENTS

We conducted ablation experiments where we evaluate how the number of policies K in the ensemble policy π θ as well as the number of the policies K old used in the old ensemble policies π θ old affect the performance. Table 2 summarizes the ablation experimental results. Even if our algorithm uses K = 1 as BC-Single does, the results of our algorithm are better than BC. It indicates that the weighted action sampling described in 5.2 works to avoid learning the non-optimal policies without relying on the ensemble approach. The same goes with K = 5. Our algorithm with K = 5 and K old = 1 obtain much better performance than BC-Ensemble with K = 5. This result also supports the weighted action sampling works. The learner performance with fixed K increases as K old increases. Similarly, the learner performance with fixed K old increases as K increases. It suggests that both K and K old affect the performance in our algorithm.Table 2 : The performance of policies on the ablation experiment. The number of state-action pairs of the noisy expert demonstrations is N = 50000. The non-optimal policies π ∈ Π\{π * e } is U(0, I). BC-Single is a policy learned by BC. BC-Ensemble is an ensemble of five policies, each of which was learned by BC. K denotes the number of policies in the ensemble policy π θ . K old denotes the number of policies used in the old ensemble policy π θ old . The mean and standard deviation of the normalized cumulative rewards over three experiments are described.

Ant-v2

HalfCheetah 

