BEHAVIORAL CLONING FROM NOISY DEMONSTRA-TIONS

Abstract

We consider the problem of learning an optimal expert behavior policy given noisy demonstrations that contain observations from both optimal and non-optimal expert behaviors. Popular imitation learning algorithms, such as generative adversarial imitation learning, assume that (clean) demonstrations are given from optimal expert policies but not the non-optimal ones, and thus often fail to imitate the optimal expert behaviors given the noisy demonstrations. Prior works that address the problem require (1) learning policies through environment interactions in the same fashion as reinforcement learning, and (2) annotating each demonstration with confidence scores or rankings. However, such environment interactions and annotations in real-world settings take impractically long training time and a significant human effort. In this paper, we propose an imitation learning algorithm to address the problem without any environment interactions and annotations associated with the non-optimal demonstrations. The proposed algorithm learns ensemble policies with a generalized behavioral cloning (BC) objective function where we exploit another policy already learned by BC. Experimental results show that the proposed algorithm can learn behavior policies that are much closer to the optimal policies than ones learned by BC.

1. INTRODUCTION

Imitation learning (IL) has become a widely used approach to obtain autonomous robotics control systems. IL is often more applicable in real-world problems than reinforcement learning (RL) since expert demonstrations are often easier than designing appropriate rewards that RL requires. There have been several IL methods that involve RL (Ziebart et al., 2008; Ng et al., 2000; Abbeel & Ng, 2004; Ho & Ermon, 2016) . Those IL methods inherit sample complexity from RL in terms of environment interactions during training. The complexity restricts applicabilities in real-world problems since a number of environment interactions in real-world settings often take a long time and cause damage to the robot or the environment. Therefore, we are interested in IL methods that do not require the environment interactions, such as behavioral cloning (BC) (Pomerleau, 1991) which learns an expert policy in a supervised fashion. BC as well as popular IL methods, such as generative adversarial imitation learning (GAIL) (Ho & Ermon, 2016) , assume the expert demonstration is optimal. Unfortunately, it is often difficult to obtain optimal demonstrations for many tasks in real-world problems because the expert who tries to operate the robot so that it can achieve tasks often makes mistakes due to various reasons, such as the difficulty of the task, difficulty in handling the controller, limited observability of the environment, or the presence of distraction. The mistakes include unnecessary and/or incorrect operations to achieve the tasks. Given such noisy expert demonstrations, which contain records of both optimal and non-optimal behavior, BC as well as the popular IL methods fails to imitate the optimal policy due to the optimal assumption on the demonstrations as shown in (Wu et al., 2019) . A naive solution to cope with the noisy demonstrations is discarding the non-optimal demonstrations among the ones that were already collected. This screening process is often impractical because it involves a significant human effort. Most of recent IL works suppose settings where a very limited number of clean expert demonstrations, which are composed of only the optimal behavior records, are available. Those methods are also vulnerable to the noisy demonstrations due to the optimal assumption on the demonstrations. Thus they implicitly suppose such impractical screening process if they were applied in real-world problems, where a number of the noisy demonstrations other than the clean ones can be easily obtained. There have been IL methods addressing the noisy demonstrations. Instead of the screening process, they require to annotate each demonstration with confidence scores (Wu et al., 2019) or rankings (Brown et al., 2019) . Even though they cope well with the noisy demonstrations to obtain the optimal behavior policies, such annotation costs a significant human effort as it is for the screening. Hence, we desire IL methods that can cope well with the noisy demonstrations, which can be easily obtained in real-world settings, without any screening and annotation processes associated with the non-optimal behaviors. In this paper, we propose a novel imitation learning algorithm to address the noisy demonstrations. The proposed algorithm does not require (1) any environment interactions during training, and (2) any screening and annotation processes associated with the non-optimality of the expert behaviors. Our algorithm learns ensemble policies with a generalized BC objective function where we exploit another policy already learned by BC. Experimental results show that the proposed algorithm can learn policies that are much closer to the optimal than ones learned by BC.

2. RELATED WORKS

A wide variety of IL methods have been proposed in these last few decades. BC (Pomerleau, 1991) is the simplest IL method among those and thus BC could be the first IL option when enough clean demonstrations are available. Ross & Bagnell (2010) have theoretically pointed out a downside of the BC which is referred to as compounding error -the small errors of the learners trained by BC could compound over time and bring about the deterioration of their performance. On the other hand, experimental results in (Sasaki et al., 2018) show that BC given the clean demonstrations of sufficient amounts can easily obtain the optimal behavior even for complex continuous control tasks. Hence, the effect of the compounding error is negligible in practice if the amount of clean demonstrations is sufficient. However, even if the amount of the demonstrations is large, BC cannot obtain the optimal policy given the noisy demonstrations due to the optimal assumption on the demonstrations. Another widely used IL approaches are inverse reinforcement learning (IRL) (Ziebart et al., 2008; Ng et al., 2000; Abbeel & Ng, 2004 ) and adversarial imitation learning (AIL) (Ho & Ermon, 2016) . Since those approaches also assume the optimality of the demonstrations, they are also not able to obtain the optimal policy given the noisy demonstrations, as shown in (Wu et al., 2019) . As we will show in Section 6, our algorithm successfully can learn near-optimal policies if noisy demonstrations of sufficient amounts are given. There have been several works that address the noisy demonstrations (Wu et al., 2019; Brown et al., 2019; Tangkaratt et al., 2019; Kaiser et al., 1995; Grollman & Billard, 2012; Kim et al., 2013) . Those works address the noisy demonstrations by either screening the non-optimal demonstrations with heuristic non-optimal assessments (Kaiser et al., 1995) , annotations associated with the nonoptimality (Wu et al., 2019; Brown et al., 2019; Grollman & Billard, 2012) , or training through the environment interactions (Kim et al., 2013; Wu et al., 2019; Brown et al., 2019; Tangkaratt et al., 2019) . Our algorithm does not require any screening processes, annotations associated with the non-optimality, and the environment interactions during training. Offline RL methods (Lange et al., 2012; Fujimoto et al., 2019; Kumar et al., 2020) train the learner agents without any environment interactions, and allow the training dataset to have non-optimal trajectories as in our problem setting. A drawback of offline RL methods for the real-world applications is the requirement to design reward functions, which often involves a significant human efforts for its success, since those methods assume that the reward for each state-action pair is known. Our algorithm does not require to design reward functions as in standard IL methods. Disagreement regularized imitation learning (DRIL) (Brantley et al., 2019) is a state-of-the-art IL algorithm which employs an ensemble of policies as our algorithm does. The aims of employing the ensemble is different between DRIL and our algorithm. DRIL uses the disagreement in predictions made by policies in the ensemble to evaluate whether the states observed during training the learner are ones observed in the expert demonstrations. On the other hand, our algorithm uses the ensemble to encourage the learner to take optimal actions on each state as described in 5.3. In addition, DRIL fundamentally requires the environment interactions during training whereas our algorithm does not.

