DEEP Q LEARNING FROM DYNAMIC DEMONSTRATION WITH BEHAVIORAL CLONING

Abstract

Although Deep Reinforcement Learning (DRL) has proven its capability to learn optimal policies by directly interacting with simulation environments, how to combine DRL with supervised learning and leverage additional knowledge to assist the DRL agent effectively still remains difficult. This study proposes a novel approach integrating deep Q learning from dynamic demonstrations with a behavioral cloning model (DQfDD-BC), which includes a supervised learning technique of instructing a DRL model to enhance its performance. Specifically, the DQfDD-BC model leverages historical demonstrations to pre-train a supervised BC model and consistently update it by learning the dynamically updated demonstrations. Then the DQfDD-BC model manages the sample complexity by exploiting both the historical and generated demonstrations. An expert loss function is designed to compare actions generated by the DRL model with those obtained from the BC model to provide advantageous guidance for policy improvements. Experimental results in several OpenAI Gym environments show that the proposed approach adapts to different performance levels of demonstrations, and meanwhile, accelerates the learning processes. As illustrated in an ablation study, the dynamic demonstration and expert loss mechanisms with the utilization of a BC model contribute to improving the learning convergence performance compared with the origin DQfD model.

1. INTRODUCTION

Deep reinforcement learning (DRL) methods have made great progress (Mnih et al., 2013; 2015; Silver et al., 2017) when applied in several rule-based applications such as the Go game (Silver et al., 2016) . However, due to the diversity and uncertainty of complex systems, the establishment of a simulation environment is difficult to be consistent with the real-world system. Therefore, DRL algorithms usually fail in a direct application to many real-world scenarios. Meanwhile, a DRL model may produce an action that is sampled from a random policy when exploring the state-action space. However, random actions are not allowed in many real-world circumstances. For example, in autonomous driving experiments (Kiran et al., 2020) , a random policy may contribute to traffic congestion, even road accidents. Therefore, fitting to complex situations becomes one of the most urgent tasks for applying a DRL model for complicated decision-making tasks. It is noted that human experts have great advantages in learning efficiency and decision-making performance (Tsividis et al., 2017) . Incorporating expert knowledge is a potential solution to enhance the adaptability of DRL models for complex tasks (Hester et al., 2018; Matas et al., 2018) . Nevertheless, the knowledge and experience of an expert are difficult to be modeled and described directly. One solution, attracting more and more attention, is to indirectly learn expert strategies by learning their decision trajectories, also known as demonstrations (Schaal, 1997; Behbahani et al., 2019; Ravichandar et al., 2020) . Particularly, deep Q learning from demonstrations (DQfD) is a typical algorithm that succeeded in combining DRL with demonstrations (Hester et al., 2018) , which combines the temporal difference (TD) error in the traditional DDQN algorithm with supervised expert loss by constructing a hybrid loss function. Through a specially designed large margin supervised loss function (Piot et al., 2014a; b) , the DQfD method can guide and assist an agent to learn the expert's knowledge by constantly steering the agent learning strategies closer to those represented by the demonstration. However, the DQfD model suffers from three major issues: (1) In the DQfD learning process, the trajectory data in the historical demonstration dataset is the only source for contributing expert loss values, which does not include the self-generated transitions of the trained agents. As a result, the DQfD algorithm merely relies on TD errors to improve the policy but the demonstration is idle when the self-generated transitions are sampled from the experience replay buffer, which reduces the efficiency of utilizing demonstrations. (2) According to the learning mechanism, static demonstrations are too limited to cover sufficient state-action space during the agent training process, especially when it is difficult or expensive to collect demonstrations in real-world applications. Also, with more newly-generated transitions added into the experience replay buffer, historical demonstrations would make smaller contributions to the policy improvement as their sampling probability becomes lower and lower. (3) The DQfD algorithm requires the learned policy to approximate the demonstration but ignores the imperfection of historical demonstrations and it is an imperfect demonstration that commonly exists in real-world applications. Perfect demonstration can always provide an appropriate guide but the imperfect demonstration is detrimental to the improvement of the policy when the learned policy is superior to the policy represented by the imperfect demonstration. To address the above issues, we propose a novel deep Q learning from dynamic demonstration with the behavioral cloning (DQfDD-BC) method. Fig. 1 illustrates the structure of the proposed approach. DQfDD-BC shares the same basic components of the DDQN algorithm (Hasselt et al., 2016) , including two Q networks and an experience replay buffer. There are an additional imitation learning (IL) (Hussein et al., 2017) module and an alterable demonstration in the replay buffer. Besides, the loss value is calculated considering both the output of the trained Q network and the BC model. Two main contributions are summarized as follows: (1) An IL model, named behavioral cloning (BC) (Torabi et al., 2018; Bühler et al., 2020) , is proposed to generate expert loss to utilize all the transitions in the experience replay buffer. It first attempts to extract the experts' policy from the initial demonstration and allows the agent to provide reasonable actions when facing newly generated states. During the self-learning process, the agent's actions are compared with those generated by the BC model through a self-designed expert loss function. The inclusion of the BC model allows the knowledge in the demonstrations to be sufficiently utilized in the training process and enables the model to cope with the states which experts have not ever encountered. The supervised learning process and self-learning process can promote each other. The supervised model provides a basic reference to guide the model adjustment. Meanwhile, the self-learning process keeps improving and overcomes the limitation of the BC model and suboptimal samples during the learning process. (2) An automatic update mechanism is proposed to adaptively enhance the BC model. In particular, new transitions are generated by the trained agents to fine-tune the BC model if the model achieves a relatively high-performance score. Such a mechanism tries to include more high-quality transition samples to improve the demonstration and avoids potential adverse impacts caused by imperfect demonstrations. In this study, we evaluate the proposed DQfDD-BC method on several gym environments (Brockman et al., 2016) . For comparison purposes, the DDQN and DQfD methods are used as the baselines (Hasselt et al., 2016; Hester et al., 2018) . The experiments clearly demonstrate that DQfDD-BC surpasses all the baselines in terms of convergence speed as well as decision-making performance. The ablation experiments also show that both the proposed expert loss function together with the BC model and dynamic demonstration contribute significantly to the performance superior to the DQfDD-BC algorithm.



Figure 1: The technique framework of the proposed DQfDD-BC method where IL model represents an imitation learning model.

