REPAINT: KNOWLEDGE TRANSFER IN DEEP ACTOR-CRITIC REINFORCEMENT LEARNING

Abstract

Accelerating the learning processes for complex tasks by leveraging previously learned tasks has been one of the most challenging problems in reinforcement learning, especially when the similarity between source and target tasks is low or unknown. In this work, we propose a REPresentation-And-INstance Transfer algorithm (REPAINT) for deep actor-critic reinforcement learning paradigm. In representation transfer, we adopt a kickstarted training method using a pre-trained teacher policy by introducing an auxiliary cross-entropy loss. In instance transfer, we develop a sampling approach, i.e., advantage-based experience replay, on transitions collected following the teacher policy, where only the samples with high advantage estimates are retained for policy update. We consider both learning an unseen target task by transferring from previously learned teacher tasks and learning a partially unseen task composed of multiple sub-tasks by transferring from a pre-learned teacher sub-task. In several benchmark experiments, REPAINT significantly reduces the total training time and improves the asymptotic performance compared to training with no prior knowledge and other baselines.

1. INTRODUCTION

Most reinforcement learning methods train an agent from scratch, typically requiring a huge amount of time and computing resources. Accelerating the learning processes for complex tasks has been one of the most challenging problems in reinforcement learning (Kaelbling et al., 1996; Sutton & Barto, 2018) . In the past few years, deep reinforcement learning has become more ubiquitous to solve sequential decision-making problems in many real-world applications, such as game playing (OpenAI et al., 2019; Silver et al., 2016) , robotics (Kober et al., 2013; OpenAI et al., 2018) , and autonomous driving (Sallab et al., 2017) . The computational cost of learning grows as the task complexity increases in the real-world applications. Therefore, it is desirable for a learning algorithm to leverage knowledge acquired in one task to improve performance on other tasks. Transfer learning has achieved significant success in computer vision, natural language processing, and other knowledge engineering areas (Pan & Yang, 2009) . In transfer learning, the teacher (source) and student (target) tasks are not necessarily drawn from the same distribution (Taylor et al., 2008a) . The unseen student task may be a simple task which is similar to the previously trained tasks, or a complex task with traits borrowed from significantly different teacher tasks. Despite the prevalence of direct weight transfer, knowledge transfer from previously trained agents for reinforcement learning tasks has not been gaining much attention until recently (Barreto et al., 2019; Ma et al., 2018; Schmitt et al., 2018; Lazaric, 2012; Taylor & Stone, 2009) . In this work, we propose a knowledge transfer algorithm for deep actor-critic reinforcement learning, i.e., REPresentation And INstance Transfer (REPAINT). The algorithm can be categorized as a representation-instance transfer approach. Specifically, in representation transfer, we adopt a kickstarted training method (Schmitt et al., 2018) using a previously trained teacher policy, where the teacher policy is used for computing the auxiliary loss during training. In instance transfer, we develop a new sampling algorithm for the replay buffer collected from the teacher policy, where we only keep the transitions that have advantage estimates greater than a threshold. The experimental results across several transfer learning tasks show that, regardless of the similarity between source and target tasks, by introducing knowledge transfer with REPAINT, the number of training iterations needed by the agent to achieve some reward target can be significantly reduced when compared to training from scratch and training with only representation transfer or instance transfer. Additionally, the agent's asymptotic performance is also improved by REPAINT in comparison with the baselines.

2. RELATED WORK: TRANSFER REINFORCEMENT LEARNING

Transfer learning algorithms in reinforcement learning can be characterized by the definition of transferred knowledge, which contains the parameters of the reinforcement learning algorithm, the representation of the trained policy, and the instances collected from the environment (Lazaric, 2012) . When the teacher and student tasks share the same state-action space and they are similar enough (Ferns et al., 2004; Phillips, 2006) , parameter transfer is the most straightforward approach, namely, one can initialize the policy or value network in the student tasks by that from teacher tasks (Mehta et al., 2008; Rajendran et al., 2015) . Parameter transfer with different state-action variables is more complex, where the crucial aspect is to find a suitable mapping from the teacher state-action space to the student state-action space (Gupta et al., 2017; Talvitie & Singh, 2007; Taylor et al., 2008b) . Most of the transfer learning algorithms fall into the category of representation transfer, where the reinforcement learning algorithm learns a specific representation of the task or the solution, and the transfer algorithm performs an abstraction process to fit it into the student task. Konidaris et al. ( 2012) uses the reward shaping approach to learn a portable shaping function for knowledge transfer, while some other works use neural networks for feature abstraction (Duan et al., 2016; Parisotto et al., 2015; Zhang et al., 2018) . Policy distillation (Rusu et al., 2015) or its variants is another popular choice for learning the teacher task representation, where the student policy aims to mimic the behavior of pre-trained teacher policies during its own learning process (Schmitt et al., 2018; Yin & Pan, 2017) . Recently, successor representation has been widely used in transfer reinforcement learning, in which the rewards are assumed to share some common features, so that the value function can be simply written as a linear combination of the successor features (SF) (Barreto et al., 2017; Madarasz & Behrens, 2019) . Barreto et al. (2019) extends the method of using SF and generalised policy improvement in Q-learning (Sutton & Barto, 2018) However, most of the aforementioned algorithms either assume specific forms of reward functions or perform well only when the teacher and student tasks are similar. Additionally, very few algorithms are designated to actor-critic reinforcement learning. In this work, we propose a representationinstance transfer algorithm to handle the generic cases of task similarity, which is also naturally fitted for actor-critic algorithms and can be easily extended to other policy gradient based algorithms.

3. BACKGROUND: ACTOR-CRITIC REINFORCEMENT LEARNING

A general reinforcement learning (RL) agent interacting with environment can be modeled in a Markov decision process (MDP), which is defined by a tuple M = (S, A, p, r, γ), where S and A are sets of states and actions, respectively. The state transfer function p(•|s, a) maps a state and action pair to a probability distribution over states. r : S × A × S → R denotes the reward function that determines a reward received by the agent for a transition from (s, a) to s . The discount factor, γ ∈ [0, 1], provides means to obtain a long-term objective. Specifically, the goal of an RL agent is to learn a policy π that maps a state to a probability distribution over actions at each time step t, so that a t ∼ π(•|s t ) maximizes the accumulated discounted return t≥0 γ t r(s t , a t , s t+1 ). To address this problem, a popular choice to adopt is the model-free actor-critic architecture, e.g., Konda & Tsitsiklis (2000) ; Degris et al. (2012); Mnih et al. (2016); Schulman et al. (2015a; 2017) , where the critic estimates the value function and the actor updates the policy distribution in the direction suggested by the critic. The state value function at time t is defined as V π (s) = E ai∼π(•|si)   i≥t γ i-t r(s i , a i , s i+1 ) s t = s   . (3.1)



to more general environments. Borsa et al. (2018), Ma et al. (2018), and Schaul et al. (2015a) learn a universal SF approximator for transfer. The basic idea of instance transfer algorithms is that the transfer of teacher samples may improve the learning on student tasks. Lazaric et al. (2008) and Tirinzoni et al. (2018) selectively transfer samples on the basis of the compliance between tasks in a model-free algorithm, while Taylor et al. (2008a) studies how a model-based algorithm can benefit from samples coming from the teacher task.

