DEEP Q LEARNING FROM DYNAMIC DEMONSTRATION WITH BEHAVIORAL CLONING

Abstract

Although Deep Reinforcement Learning (DRL) has proven its capability to learn optimal policies by directly interacting with simulation environments, how to combine DRL with supervised learning and leverage additional knowledge to assist the DRL agent effectively still remains difficult. This study proposes a novel approach integrating deep Q learning from dynamic demonstrations with a behavioral cloning model (DQfDD-BC), which includes a supervised learning technique of instructing a DRL model to enhance its performance. Specifically, the DQfDD-BC model leverages historical demonstrations to pre-train a supervised BC model and consistently update it by learning the dynamically updated demonstrations. Then the DQfDD-BC model manages the sample complexity by exploiting both the historical and generated demonstrations. An expert loss function is designed to compare actions generated by the DRL model with those obtained from the BC model to provide advantageous guidance for policy improvements. Experimental results in several OpenAI Gym environments show that the proposed approach adapts to different performance levels of demonstrations, and meanwhile, accelerates the learning processes. As illustrated in an ablation study, the dynamic demonstration and expert loss mechanisms with the utilization of a BC model contribute to improving the learning convergence performance compared with the origin DQfD model.

1. INTRODUCTION

Deep reinforcement learning (DRL) methods have made great progress (Mnih et al., 2013; 2015; Silver et al., 2017) when applied in several rule-based applications such as the Go game (Silver et al., 2016) . However, due to the diversity and uncertainty of complex systems, the establishment of a simulation environment is difficult to be consistent with the real-world system. Therefore, DRL algorithms usually fail in a direct application to many real-world scenarios. Meanwhile, a DRL model may produce an action that is sampled from a random policy when exploring the state-action space. However, random actions are not allowed in many real-world circumstances. For example, in autonomous driving experiments (Kiran et al., 2020) , a random policy may contribute to traffic congestion, even road accidents. Therefore, fitting to complex situations becomes one of the most urgent tasks for applying a DRL model for complicated decision-making tasks. It is noted that human experts have great advantages in learning efficiency and decision-making performance (Tsividis et al., 2017) . Incorporating expert knowledge is a potential solution to enhance the adaptability of DRL models for complex tasks (Hester et al., 2018; Matas et al., 2018) . Nevertheless, the knowledge and experience of an expert are difficult to be modeled and described directly. One solution, attracting more and more attention, is to indirectly learn expert strategies by learning their decision trajectories, also known as demonstrations (Schaal, 1997; Behbahani et al., 2019; Ravichandar et al., 2020) . Particularly, deep Q learning from demonstrations (DQfD) is a typical algorithm that succeeded in combining DRL with demonstrations (Hester et al., 2018) , which combines the temporal difference (TD) error in the traditional DDQN algorithm with supervised expert loss by constructing a hybrid loss function. Through a specially designed large margin supervised loss function (Piot et al., 2014a; b) , the DQfD method can guide and assist an agent to learn the expert's knowledge by constantly steering the agent learning strategies closer to those represented by the demonstration.

