TRANSFER AMONG AGENTS: AN EFFICIENT MULTIA-GENT TRANSFER LEARNING FRAMEWORK

Abstract

Transfer Learning has shown great potential to enhance the single-agent Reinforcement Learning (RL) efficiency, by sharing learned policies of previous tasks. Similarly, in multiagent settings, the learning performance can also be promoted if agents can share knowledge with each other. However, it remains an open question of how an agent should learn from other agents' knowledge. In this paper, we propose a novel multiagent option-based policy transfer (MAOPT) framework to improve multiagent learning efficiency. Our framework learns what advice to give to each agent and when to terminate it by modeling multiagent policy transfer as the option learning problem. MAOPT provides different kinds of variants which can be classified into two types in terms of the experience used during training. One type is the MAOPT with the Global Option Advisor which has the access to the global information of the environment. However, in many realistic scenarios, we can only obtain each agent's local information due to partial observation. The other type contains MAOPT with the Local Option Advisor and MAOPT with the Successor Representation Option (SRO) which are suitable for this setting and collect each agent's local experience for the update. In many cases, each agent's experience is inconsistent with each other which causes the option-value estimation to oscillate and to become inaccurate. SRO is used to handle the experience inconsistency by decoupling the dynamics of the environment from the rewards to learn the option-value function under each agent's preference. MAOPT can be easily combined with existing deep RL approaches. Experimental results show it significantly boosts the performance of existing deep RL methods in both discrete and continuous state spaces. * We provide the theoretical analysis to show this objective ensures to converge to an improved policy and will not affect the convergence of the original RL algorithm.

1. INTRODUCTION

Transfer Learning has shown great potential to accelerate single-agent RL via leveraging prior knowledge from past learned policies of relevant tasks (Yin & Pan, 2017; Yang et al., 2020) . Inspired by this, transfer learning in multiagent reinforcement learning (MARL) (Claus & Boutilier, 1998; Hu & Wellman, 1998; Bu et al., 2008; Hernandez-Leal et al., 2019; da Silva & Costa, 2019) is also studied with two major directions: 1) transferring knowledge across different but similar MARL tasks and 2) transferring knowledge among multiple agents in the same MARL task. For the former, several works explicitly compute similarities between states or temporal abstractions (Hu et al., 2015; Boutsioukis et al., 2011; Didi & Nitschke, 2016) to transfer across similar tasks with the same number of agents, or design new network structures to transfer across tasks with different numbers of agents (Agarwal et al., 2019; Wang et al., 2020) . In this paper, we focus on the latter direction due to the following intuition: in a multiagent system (MAS), each agent's experience is different, so the states each agent encounters (the degree of familiarity to the different regions of the whole environment) are also different; if we figure out some principled ways to transfer knowledge across different agents, all agents could form a big picture about the MAS even without exploring the whole space of the environment, and this will definitely facilitate more efficient MARL (da Silva et al., 2020) . Transferring knowledge among multiple agents is still investigated at an initial stage, and the assumptions and designs of some recent methods are usually simple. For example, LeCTR (Omidshafiei et al., 2019) and HMAT (Kim et al., 2020) adopted the teacher-student framework to learn to teach by assigning each agent two roles (i.e., the teacher and the student), so the agent could learn when and what to advise other agents or receive advice from other agents. However, both LeCTR and HMAT only consider two-agent scenarios. Liang & Li (2020) proposed a method under the teacher-student framework where each agent asks for advice from other agents through learning an attentional teacher selector. However, they simply used the difference of two unbounded value functions as the reward signal which may cause instability. DVM (Wadhwania et al., 2019) and LTCR Xue et al. (2020) are two proposed multiagent policy distillation methods to transfer knowledge among more than two agents. However, both methods decompose the solution into several stages in a coarse-grained manner. Moreover, they consider the distillation equally throughout the whole training process, which is counter-intuitive. A good transfer should be adaptive rather than being equally treated, e.g., the transfer should be more frequent at the beginning of the training since agents are less knowledgeable about the environment, while decay as the training process continues because agents are familiar with the environment gradually and should focus more on their own knowledge. In this paper, we propose a novel MultiAgent Option-based Policy Transfer (MAOPT) framework which models the policy transfer among multiple agents as an option learning problem. In contrast to the previous teacher-student framework and policy distillation framework, MAOPT is adaptive and applicable to scenarios consisting of more than two agents. Specifically, MAOPT adaptively selects a suitable policy for each agent as the advised policy, which is used as a complementary optimization objective of each agent. MAOPT also uses the termination probability as a performance indicator to determine whether the advice should be terminated to avoid negative transfer. Furthermore, to facilitate the scalability and robustness, MAOPT contains two types: one type is MAOPT with the global option advisor (MAOPT-GOA), the other type consists of MAOPT with the local option advisor (MAOPT-LOA) and MAOPT with the successor representation option advisor (MAOPT-SRO). Ideally, we can obtain the global information to estimate the option-value function, where MAOPT-GOA is used to select a joint policy set, in which each policy is advised to each agent. However, in many realistic scenarios, we can only obtain each agents' local experience, where we adopt MAOPT-LOA and MAOPT-SRO. Each agent's experience may be inconsistent due to partial observations, which may cause the inaccuracy in option-value's estimation. MAOPT-SRO is used to overcome the inconsistency in multiple agents' experience by decoupling the dynamics of the environment from the rewards to learn the option-value function under each agent's preference. MAOPT can be easily incorporated into existing DRL approaches and experimental results show that it significantly boosts the performance of existing DRL approaches both in discrete and continuous state spaces.

2. PRELIMINARIES

Stochastic Games (Littman, 1994) are a natural multiagent extension of Markov Decision Processes (MDPs), which model the dynamic interactions among multiple agents. Considering the fact agents may not have access to the complete environmental information, we follow previous work's settings and model the multiagent learning problems as partially observable stochastic games (Hansen et al., 2004) . A Partially Observable Stochastic Game (POSG) is defined as a tuple N , S, A 1 , • • • , A n , T , R 1 , • • • ,R n , O 1 , • • • , O n , where N is the set of agents; S is the set of states; A i is the set of actions available to agent i (the joint action space A = A 1 ×A 2 ×• • •×A n ); T is the transition function that defines transition probabilities between global states: S ×A×S → [0, 1]; R i is the reward function for agent i: S × A → R and O i is the set of observations for agent i. A policy π i : O i × A i → [0, 1] specifies the probability distribution over the action space of agent i. The goal of agent i is to learn a policy π i that maximizes the expected return with a discount factor γ: J = E π i ∞ t=0 γ t r i t . The Options Framework. Sutton et al. (1999) firstly formalized the idea of temporally extended action as an option. An option ω ∈ Ω is defined as a triple {I ω , π ω , β ω } in which I ω ⊂ S is an initiation state set, π ω is an intra-option policy and β ω : I ω → [0, 1] is a termination function that specifies the probability an option ω terminates at state s ∈ I ω . An MDP endowed with a set of options becomes a Semi-Markov Decision Process (Semi-MDP), which has a corresponding optimal option-value function over options learned using intra-option learning. The options framework considers the call-and-return option execution model, in which an agent picks an option o according to its option-value function Q ω (s, ω), and follows the intra-option policy π ω until termination, then selects a next option and repeats the procedure. Deep Successor Representation (DSR). The successor representation (SR) (Dayan, 1993) is a basic scheme that describes the state value function by a prediction about the future occurrence of all states under a fixed policy. SR decouples the dynamics of the environment from the rewards. Given a transition (s, a, s , r), SR is defined as the expected discounted future state occupancy: M (s, s , a) = E ∞ t=0 γ t 1[s t = s ]|s 0 = s, a 0 = a , where 1 [.] is an indicator function with value of one when the argument is true and zero otherwise. Given the SR, the Q-value for selecting action a at state s can be formulated as the inner product of the SR and the immediate reward: Q π (s, a) = s ∈S M (s, s , a)R(s ). DSR (Kulkarni et al., 2016) extends SR by approximating it using neural networks. Specifically, each state s is represented by a D-dimensional feature vector φ s , which is the output of the network parameterized by θ. Given φ s , SR is represented as m sr (φ s , a|τ ) parameterized by τ , a decoder gθ(φ s ) parameterized by θ outputs the input reconstruction ŝ, and the immediate reward at state s is approximated as a linear function of φ s : R(s) ≈ φ s • w, where w ∈ R D is the weight vector. In this way, the Q-value function can be approximated by putting these two parts together as: Q π (s, a) ≈ m sr (φ s , a|τ ) • w. The stochastic gradient descent is used to update parameters (θ, τ, w, θ). Specifically, the loss function of τ is: L(τ, θ) = E (φ s + γm sr (φ s , a |τ ) -m sr (φ s , a|τ )) 2 , where a = arg max a m sr (φ s , a) • w, and m sr is the target SR network parameterized by τ which follows DQN (Mnih et al., 2015) for stable training. The reward weight w is updated by minimizing the loss function: L(w, θ) = (R(s) -φ s • w) 2 . The parameter θ is updated using an L2 loss: L( θ, θ) = (ŝ -s) 2 . Thus, the loss function of DSR is the composition of the three loss functions: L(θ, τ, w, θ) = L(τ, θ) + L(w, θ) + L( θ, θ).

3. MULTIAGENT OPTION-BASED POLICY TRANSFER (MAOPT)

3.1 FRAMEWORK OVERVIEW In this section, we describe our MAOPT framework in detail. Figure 1 illustrates the MAOPT framework which contains n agents interacting with the environment and corresponding option advisors. At each step, each agent i obtains its own observation o i , selects an action a i following its policy π i , and receives its reward r i . Each option advisor initializes the option set, and selects an option for each agent. During the training phase, the option advisor uses samples from all agents to update the option-value function and corresponding termination probabilities. Each agent is advised by an option advisor, and the advice is to exploit this advised policy through imitation, which serves as a complementary optimization objective (each agent does not know which policy it imitates and how the extra loss function is calculated) * . The exploitation of this advised policy is terminated as the selected option terminates and then another option is selected. In this way, each agent efficiently exploits useful information from other agents and as a result, the learning process of the whole system is accelerated and improved. Note that in the following section we assume the agents using the option advisor are homogeneous, i.e., agents share the same option set. While our MAOPT can also support the situation where each agent is initialized with different numbers of options, e.g., each agent only needs to imitate its neighbours. To achieve this, instead of input states into the option-value network, we just input the pair of states and options to the network and output a single option-value. Our proposed MAOPT can be classified into two types in terms of the experience used during training. One type is MAOPT with the global option advisor (MAOPT-GOA) which has the access to the global information (i.e., (s, a, r, s ), where r = n i=1 r i ) of the environment. Thus, MAOPT-GOA selects a joint option as the advice set given the global observation of the environment and then evaluates the performance of the selected joint option. Selecting a joint option means that each advice given to each agent begins and ends simultaneously. However, in many realistic scenarios, we can only obtain each agent's local information due to the partial observation. Moreover, the degree of familiarity to the environment of each agent is different, then some agents may need to imitate their teachers for a longer time. Therefore, a more flexible way to control each advice when to terminate individually is necessary. The other type contains MAOPT with the local option advisor (MAOPT-LOA), and MAOPT with the successor representation option advisor (MAOPT-SRO) which collects each agent's local experience for the update. In many cases, each agent's experience is inconsistent with each other, e.g., each agent has an individual goal to achieve or has different roles, and the rewards assigned to each agent are different. If we simply use all experiences for the update, the option-value estimation would oscillate and become inaccurate. MAOPT-SRO is used to handle the experience inconsistency by decoupling the dynamics of the environment from the rewards to learn the option-value function under each agent's preference.

3.2. MAOPT-GOA

In cases where we have access to the global information of the environment, the global option advisor is used to advise each agent. The procedure of MAOPT-GOA is described as follows (pseudo-code is included in the appendix). First, MAOPT-GOA initializes the joint option set Ω 1 × Ω 2 × • • • × Ω n (where Ω i = {ω 1 , • • • , ω n }). Each option ω i corresponds to agent i's policy π i . The joint option-value function is defined as Q ω (s, ω|ψ) parameterized by ψ which evaluates the performance of each joint option ω. The corresponding target network is parameterized by ψ which copies from ψ every k steps. The termination network parameterized by outputs the termination probability β(s , ω| ) of the joint option ω. The update of the joint option-value network update follows previous work (Sutton et al., 1999; Bacon et al., 2017) . We first samples B transitions uniformly from the global replay buffer, for each sample (s, a, r, s ), we calculate the joint U function, the joint option-value function upon arrival: U (s , ω|ψ ) = (1 -β(s , ω| )) Q ω (s , ω|ψ ) + β(s , ω| ) max ω ∈ Ω Q ω (s , ω |ψ ). (3) Then, the option-value network minimizes the following loss: L ω = 1 B b (r b + γU (s b+1 , ω|ψ ) -Q ω (s b , ω|ψ)) 2 . ( ) where r b = n r i b . According to the call-and-return option execution model, the termination probability β ω controls when to terminate the selected joint option and then to select another joint option accordingly, which is updated w.r.t as follows (Bacon et al., 2017) : = -α ∂β(s , ω| ) ∂ A(s , ω|ψ ) + ξ, where, A(s , ω|ψ ) is the advantage function which can be approximated as Q ω (s , ω|ψ ) - max ω ∈ Ω Q ω (s , ω |ψ ), and ξ is a regularization term to ensure explorations (Bacon et al., 2017) . Then, given each state s, MAOPT-GOA selects a joint option ω following the -greedy strategy over the joint option-value function. Then MAOPT-GOA calculates the cross-entropy H(π ω |π i ) between each intra-option policy π ω and each agent's policy π i , and gives it to each agent respectively, serving as a complementary optimization objective of each agent, which means that apart from maximizing the cumulate reward, the agent also imitates the intra-option policy π ω by minimizing the loss function L i tr . The imitation for the intra-option policy is terminated as the option terminated, and then another option is selected to provide advice for the agent. The formula of the loss function L i tr is given as follows: L i tr = f (t)H(π ω |π i ), where, f (t) = 0.5 + tanh(3 -µt)/2 is the weighting factor of H(π ω |π i ). µ is a hyper-parameter that controls the decrease degree of the weight. This means that at the beginning of learning, each agent exploits knowledge from other agents mostly. As learning continues, knowledge from other agents becomes less useful and each agent focuses more on the current self-learned policy. Note that we cannot calculate the cross-entropy between the discrete action policies directly. To remedy this, we apply the softmax function with some temperature to the discrete action vectors to transform the actions into discrete categorical distributions. In MAOPT-GOA, each advice given to each agent begins and ends simultaneously. While for each agent, when to terminate the reusing of other agents' knowledge should be decided asynchronously and individually since the degree of familiarity to the environment of each agent is probably not identical. Moreover, in many realistic scenarios, we can only obtain each agent's local information due to partial observation. Therefore, a more flexible way to advise each agent is necessary. In the following section, we describe the second type of MAOPT in detail.

3.3. MAOPT-LOA

MAOPT-LOA equips each agent an option advisor, and each advisor uses local information from all agents for estimation. How MAOPT-LOA is applied in actor-critic methods is illustrated in Figure 2 (pseudo-code is included in the appendix). Firstly, MAOPT-LOA initializes n options Ω = {ω 1 , ω 2 , • • • , ω n }. Each option ω i corresponds to agent i's policy π i . The input of option network parameterized by ψ and termination network parameterized by is the local observation o i of each agent i. The option-value function Q ω (o i , ω|ψ) and termination probability β(o i , ω| ) are used to evaluate the performance of each option ω i ∈ Ω. The update of the option-value function and the termination probability is similar to that in MAOPA-GOA. For the update of each agent i, MAOPT-LOA first selects an option ω from {ω 1 , ω 2 , • • • , ω n } following -greedy strategy over the option-value function. Then MAOPT-LOA calculates the crossentropy H(π ω |π i ) between each intra-option policy π ω and each agent's policy π i , and gives it to each agent respectively, contributing to a complementary loss function L i tr for each agent. Note that the option-value network and termination network collect experience from all agents for the update. What if the experience from one agent is inconsistent with others? In a POSG, each agent can only obtain the local observation and individual reward signal, which may be different for different agents even at the same state. If we use inconsistent experiences to update one shared option-value network and termination network, the estimation of the option-value function would oscillate and become inaccurate. We propose MAOPT-SRO to address this problem. MAOPT-SRO decouples the dynamics of the environment from the rewards to learn the option-value function under each agent's preference. In this way, MAOPT-SRO can address such sample inconsistency and learn the option-value and the corresponding termination probabilities under each agent's preference which is described in the next section.

3.4. MAOPT-SRO

MAOPT-SRO applies a novel option learning algorithm, Successor Representation Option (SRO) learning to learn the option-value function under each agent's preference. The SRO network architecture is shown in Figure 3 , with each observation o i from each agent i as input. o i is input through two fully-connected layers to generate the state embedding φ o i , which is transmitted to three network sub-modules. The first sub-module contains the state reconstruction model which ensures φ o i well representing o i , and the weights for the immediate reward approximation at local observation o i . The immediate reward is approximated as a linear function of φ o i : R i (φ o i ) ≈ φ o i • w, where w ∈ R D is the weight vector. The second sub-module is used to approximate SR for options m sr (φ o i , ω|τ ) which describes the expected discounted future state occupancy of executing the option ω. The corresponding target network is parameterized by τ which copies from τ every k steps. The last sub-module is used to update the termination probability β(φ o i , ω| ), which is similar to that in MAOPT-LOA described in Section 3.3. for each agent i do for each agent i do 16: Algorithm 1 MAOPT-SRO. 1: Initialize: option set Ω = {ω 1 , ω 2 , • • • , ω n }, Set π i old = π i 17: Calculate the advantage A i = t >t γ t -t r i t -V υ i (o i t ) 18: Optimize the critic loss L i c = - T t=1 ( t >t γ t -t r i t -V υ i (o i t )) 2 19: The option advisor calculates the transfer loss L Optimize L( θ, θ) = gθ(φ o i ) -o i 2 w.r.t θ, θ 25: Optimize L(w, θ) = r i -φ o i • w 2 w.r.t w, θ 26: for each ω do 27: if π ω selects action a i at observation o i then 28: Calculate Ũ (φ o i , ω|τ ) 29: Set y ← φ o i + γ Ũ (φ o i , ω|τ ) 30: Optimize the following loss w.r. t τ : L(τ, θ) = 1 B b (y b -m sr (φ o i , ω|τ )) 2 31: Optimize the termination network w.r.t : = -α ∂β(φ o i ,ω| ) ∂ A(φ o i , ω|τ ) + ξ 32: end if 33: end for 34: Copy τ to τ every k steps 35: end for Given m sr (φ o i , ω|τ ), the SRO-value function can be approximated as: Q ω (φ o i , ω) ≈ m sr (φ o i , ω|τ ) • w. Since options are temporal abstractions (Sutton et al., 1999) , SRO also needs to calculate the Ũ function which is served as SR upon arrival, indicating the expected discounted future state occupancy of executing an option ω upon entering a state. Given the transition (o i , a i , r i , o i ), we firstly introduce the SR upon arrival Ũ as follows: Ũ (φ o i , ω|τ ) = (1 -β(φ o i , ω| ))m sr (φ o i , ω|τ ) + β(φ o i , ω| )m sr (φ o i , ω |τ ), (7) where ω = arg max ω∈Ω m sr (φ o i , ω|τ ) • w. We consider MAOPT-SRO combing with PPO (Schulman et al., 2017) , a popular single-agent policy-based RL. The way MAOPT-SRO combines with other policy-based RL algorithms is similar. Algorithm 1 illustrates the whole procedure of MAOPT-SRO. First, we initialize the network parameters for the state embedding network, reward prediction network, state reconstruction network, termination network, SR network, SR target network, and the actor and critic networks of each agent (Line 1). For each episode, each agent i first obtains its local observation o i which corresponds to the current state s (Line 3). Then, MAOPT-SRO selects an option ω for each agent i (Line 5), and each agent selects an action a i following its policy π i (Line 6). The joint action a is performed, then the reward r and new state s is returned from the environment (Lines 8,9). The transition is stored in the replay buffer D i (Line 11). If ω terminates, MAOPT-SRO selects another option for agent i (Line 12). For each update step, each agent updates its critic network by minimizing the loss L i c (Line 18), where T is the length of the trajectory segment. Then each agent updates its actor network by minimizing the summation of the original loss and the transfer loss L i tr (Line 20). For the update of SRO, it first samples a batch of B/N transitions from each agent's buffer D i , which means there are B transitions in total for the update (Line 23). SRO loss is composed of three components: the state reconstruction loss L( θ, θ), the loss for reward weights L(w, θ) and SR loss L(τ, θ). The state reconstruction network is updated by minimizing two losses L( θ, θ) (Line 24) and L(w, θ) (Line 25). The second sub-module, SR network approximates SR for options and is updated by minimizing the standard L2 loss L(τ, θ) (Lines 26-30). At last, the termination probability of the selection option is updated (Line 31), where A(φ o i , ω|τ ) is the advantage function and approximated as m sr (φ o i , ω|τ ) • w -max ω∈Ω m sr (φ o i , ω|τ ) • w, and ξ is a regularization term to ensure explorations. In this section, we evaluate the performance of MAOPT combined with the common single-agent RL algorithm (PPO (Schulman et al., 2017) ) and MARL algorithm (MADDPG (Lowe et al., 2017) ). The test domains include two representative multiagent games, Pac-Man and multiagent particle environment (MPE) (illustrated in the appendix). Specifically, we first combine MAOPT-LOA and MAOPT-SRO with PPO on Pac-Man to validate whether MAOPT-SRO successfully solves the sample inconsistency due to the partial observation. Then, we combine MAOPT-GOA, MAOPT-LOA, and MAOPT-SRO with two baselines (MADDPG and PPO) on MPE to further validate whether MAOPT-SRO is a more flexible way for knowledge transfer among agents and enhances the advantage of our framework. We also compare with DVM (Wadhwania et al., 2019) , which is the most recent multiagent transfer method † .

4.1. PAC-MAN

Pac-Man (van der Ouderaa, 2016) is a mixed cooperativecompetitive maze game with one pac-man player and two ghost players. The goal of the pac-man player is to eat as many pills as possible and avoid the pursuit of ghost players. For ghost players, they aim to capture the pac-man player as soon as possible. In our settings, we aim to control the two ghost players and the pac-man player as the opponent is controlled by well pre-trained PPO policy. The game ends when one ghost catches the pac-man player or the episode exceeds 100 steps. Each player receives -0.01 penalty each step and +5 reward for catching the pac-man player. We consider two Pac-Man scenarios (OpenClassic and MediumClassic) with the game difficulties increasing. Figure 4 (a) presents the average rewards on the OpenClassic scenario. We can see that both MAOPT-LOA and MAOPT-SRO perform better than other methods and achieve the average discount rewards of +3 approximately with smaller variance. In contrast, PPO and DVM only achieve the average discount rewards of +2.5 approximately with larger variance. This phenomenon indicates that both MAOPT-LOA and MAOPT-SRO enable efficient knowledge transfer between two ghosts, thus facilitating better performance. Next, we consider a more complex Pac-Man game scenario, where the layout size is larger than the former and it contains obstacles (walls). From Figure 4 (b) we can observe that the advantage of MAOPT-LOA and MAOPT-SRO is much more obvious compared with PPO and DVM. Furthermore, MAOPT-SRO performs best among all methods, which means that MAOPT-SRO effectively selects more suitable advice for each agent. The reason that MAOPT performs better than DVM is that MAOPT enables each agent to effectively exploit useful information from other agents through the option-based call-and-return mechanism, which successfully avoids negative transfer when other agents' policies are only partially useful. However, DVM just transfers all information from other agents through policy distillation. By comparing the results of the two scenarios, we see that the superior advantage of MAOPT-SRO increases when faced with more challenging scenarios. Intuitively, as the environmental difficulties increase, agents are harder to explore the environment and to learn the optimal policy. In such a case, agents need to exploit the knowledge of other agents more efficiently, which would greatly accelerate the learning process as demonstrated by MAOPT-LOA and MAOPT-SRO. MPE (Lowe et al., 2017 ) is a simple multiagent particle world with continuous observation and discrete action space. We evaluate the performance of MAOPT on two scenarios: predator and prey, and cooperative navigation. The predator and prey contains three agents which are slower and want to catch one adversary (rewarded +10 by each hit). The adversary is faster and wants to avoid being hit by the other three agents. Obstacles block the way. The cooperative navigation contains four agents, and four corresponding landmarks. Agents are penalized with a reward of -1 if they collide with other agents. Thus, agents have to learn to cover all the landmarks while avoiding collisions. Both games end when exceeding 100 steps.

4.2. MPE

Both domains contain the sample inconsistency problem since each agent's local observation contains the relative distance between other agents, obstacles, and landmarks. Moreover, in cooperative navigation, each agent is assigned a different task, i.e., approaching a different landmark from others, which means each agent may receive different rewards under the same observation. Therefore, we cannot directly use all experience to update one shared optionvalue network. In such a case, we design an individual option learning module for each agent in MAOPT-LOA, which only collects one agent's experience to update the option-value function. Figure 5 (a) shows the average rewards on predator and prey. We can see that all our proposed MAOPT-GOA, MAOPT-LOA, and MAOPT-SRO (combined with PPO) achieve higher average rewards than PPO and DVM. Figure 5 (b) demonstrates a similar phenomenon that both MAOPT-GOA and MAOPT-SRO (combined with MADDPG) perform better than vanilla MADDPG, and MAOPT-SRO performs best among all methods. This is because MAOPT-SRO uses all agents' experience for the update and efficiently distinguishes which part of the information is useful and provides positive advice for each agent. Furthermore, it uses the individual termination probability to determine when to terminate each agent's advice, which is a more flexible manner, thus facilitating more efficient and effective knowledge transfer among agents. Table 1 shows the average distance between each agent and its nearest landmark (line 1), and the average collision frequencies of agents (line 2) in cooperative navigation. In this game, agents are required to cover all landmarks while avoiding collisions. Therefore, a better result means to get a closer average distance between agents and landmarks, and less collision frequencies among agents. We can see that MAOPT-GOA, MAOPT-LOA, and MAOPT-SRO achieve the less collisions and the shorter average distance from landmarks than other methods. Furthermore, MAOPT-SRO performs best among all methods. The superior advantage of MAOPT is due to the effectiveness in identifying the useful information from other agents' policies. Therefore, each agent exploits useful knowledge of other agents and as a result, thus leading to the least collisions and the minimum distance from landmarks. Finally, we provide an ablation study to investigate whether MAOPT-SRO selects a suitable policy for each agent, thus efficiently enabling agents to exploit useful information from others. Figure 6 presents the action movement in the environment, each arrow is the direction of movement caused by the specific action at each location. Four figures show the direction of movement caused by the action selected from the policy of an agent at t 1 = 6 × 10 5 steps (Figure 6 (a), top left), and at t 2 = 2 × 10 6 (Figure 6 (c), bottom left); the direction of movement caused by the action selected from the intraoption policies of MAOPT-SRO at t 1 = 6 × 10 5 steps (Figure 6 (b), top right), and at t 2 = 2 × 10 6 steps (Figure 6(d) , bottom right) respectively. The preferred direction of movement should be towards the blue circle. We can see that actions selected by the intra-option policies of MAOPT-SRO are more accurate than those selected from the agent's own policy, namely, more prone to pursue the adversary (blue). This indicates that the advised policy selected by MAOPT-SRO performs better than the agent itself, which means MAOPT-SRO successfully distinguishes useful knowledge from other agents. Therefore, the agent can learn faster and better after exploiting knowledge from this advised policy than learning from scratch.

A PSEUDO-CODE FOR MAOPT-GOA AND MAOPT-IOA

Algorithm 2 MAOPT-GOA. 1: Initialize: the joint option set Ω 1 × • • • , ×Ω n , each Ω i = {ω 1 , ω 2 , • • • , ω n }, joint option-value network parameters ψ, joint option-value target network parameters ψ , termination network parameters , replay buffer D, actor network parameters ρ i for each agent i, critic network parameters υ i for each agent i, batch size T for PPO 2: for each episode do for every T steps do 13: for each agent i do 14: Set π i old = π i 15: Calculate the advantage A i = t >t γ t -t r i t -V υ i (o i t ) 16: Optimize the critic loss L i c = - T t=1 ( t >t γ t -t r i t -V υ i (o i t )) 2 17: The option advisor calculates the transfer loss L i tr = f (t)H(π ωi |π i ) 18: Optimize the actor loss Li a = T t=1 π i (a i t |o i t ) π i old (a i t |o i t ) A i t -λKL[π i old |π i ] + L i tr w.r.t ρ i 19: end for for each sample (s, a, r, s ) do

23:

for each ω do 24: if π ωi selects action a i for all ω i ∈ ω then Copy ψ to ψ every k steps 33: end for Algorithm 2 illustrates the whole procedure of MAOPT-GOA. First, we initialize the network parameters for the joint option-value network, termination network, joint option target network, and the actor and critic networks of each agent i. For each episode, each agent i first obtains its local observation o i which corresponds to the current state s (Line 3). Then, MAOPT-GOA selects a joint option ω for all agents (Line 4), and each agent selects an action a i following its policy π i (Lines 5-7). The joint action a is performed, then the reward r and new state s is returned from the environment (Lines 8, 9). The transition is stored in the replay buffer D (Line 10). If ω terminates, then MAOPT-IOA selects another joint option ω (Line 11). For each update step, each agent updates its critic network by minimizing the loss L i c (Line 16), and updates its actor network by minimizing the summation of the original loss L i a and the transfer loss L i tr (Line 18). For the update of GOA, it first samples a batch of B transitions from the replay buffer D (Line 21). Then GOA updates the joint option-value network by minimizing the standard L2 loss L( ω) (Lines 22-27). At last, the termination probability of the selection joint option is updated (Line 28). Algorithm 3 MAOPT-IOA. 1: Initialize: the option set {ω 1 , • • • , ω n }, option-value network parameters ψ, option-value target network parameters ψ , termination network parameters , replay buffer D i for each agent i, actor network parameters ρ i for each agent i, critic network parameters υ i for each agent i, batch size T for PPO 2: for each episode do Calculate the advantage A i = t >t γ t -t r i t -V υ i (o i t ) f 18: Optimize the critic loss L i c = - T t=1 ( t >t γ t -t r i t -V υ i (o i t )) 2 19: The option advisor calculates the transfer loss L i tr = f (t)H(π ω |π i ) 20: Optimize the actor loss Li Optimize the option-value network by minimizing the following loss w.r.t τ : Copy ψ to ψ every k steps 35: end for Algorithm 3 illustrates the whole procedure of MAOPT-IOA. First, we initialize the network parameters for the option-value network, termination network, option target network, and the actor and critic networks of each agent i. For each episode, each agent i first obtains its local observation o i which corresponds to the current state s (Line 3). Then, MAOPT-IOA selects an option ω for each agent i (Line 5), and each agent selects an action a i following its policy π i (Line 6). The joint action a is performed, then the reward r and new state s is returned from the environment (Lines 8, 9). The transition is stored to each agent's replay buffer D i (Line 11). If ω terminates, then MAOPT-IOA selects another option for agent i (Line 12). For each update step, each agent updates its critic network by minimizing the loss L i c (Line 18), and updates its actor network by minimizing the summation of the original loss L i a and the transfer loss L i tr (Line 20). a = T t=1 π i (a i t |o i t ) π i old (a i t |o i t ) A i t -λKL[π i old |π i ] + L i tr w. L ω = 1 B b y b -Q ω (o i , For the update of IOA, it first samples a batch of B/N transitions from each agent's buffer D i (Line 23). Then MAOPT-IOA updates the option-value network by minimizing the standard L2 loss L(ω) (Lines 24-29). At last, the termination probability of the selection option is updated (Line 30).

B ENVIRONMENT ILLUSTRATIONS AND DESCRIPTIONS

Pac-Man (van der Ouderaa, 2016) is a mixed cooperative-competitive maze game with one pac-man player and two ghost players (Figure 7 ).The most complex scenario is shown in Figure 10 with four ghost players and one pac-man player. The goal of the pac-man player is to eat as many pills (denoted as white circles in the grids) as possible and avoid the pursuit of ghost players. For ghost players, they aim to capture the pac-man player as soon as possible. In our settings, we aim to control the two ghost players and the pac-man player as the opponent is controlled by well pre-trained PPO policy. The game ends when one ghost catches the pac-man player or the episode exceeds 100 steps. Each player receives -0.01 penalty each step and +5 reward for catching the pac-man player. MPE (Lowe et al., 2017 ) is a simple multiagent particle world with continuous observation and discrete action space. We evaluate the performance of MAOPT on two scenarios: predator and prey (Figure 8 with four agents and Figure 11 with twelve agents), and cooperative navigation (Figure 9 with four agents and Figure 12 with ten agents). The predator and prey contains three (nine in Figure 11 ) agents (green) which are slower and want to catch one adversary (blue)(rewarded +10 by each hit). The adversary is faster and wants to avoid being hit by the other three agents. Obstacles (grey) block the way. The cooperative navigation contains four (ten in Figure 11 ) agents (green), and four corresponding landmarks (cross). Agents are penalized with a reward of -1 if they collide with other agents. Thus, agents have to learn to cover all the landmarks while avoiding collisions. At each step, each agent receives a reward of the negative value of the distance between the nearest landmark and itself. Both games end when exceeding 100 steps. 

State Description

Pac-Man The layout size of two scenarios are 25 × 9 (OpenClassic), 20 × 11 (MediumClassic) and 28 × 27 (OriginalClassic) respectively. The observation of each ghost player contains its position, the position of its teammate, walls, pills, and the pac-man, which is encoded as a one-hot vector. The input of the network is a 68-dimension in OpenClassic, 62-dimension in MediumClassic and111dimension in OriginalClassic. MPE The observation of each agent contains its velocity, position, and the relative distance between landmarks, blocks, and other agents, which is composed of 18-dimension in predator and prey with four agents (36-dimension with twelve agents), 24-dimension in cooperative navigation with four agents (60-dimension with ten agents) as the network input.

C ADDITIONAL RESULTS AND ANALYSIS

We here summarize the components and properties of our framework, and list the suitable scenarios of each option advisor. MAOPT-GOA contains n agent models and 1 model for learning the jointoption value function and termination probabilities. MAOPT-GOA is used when we can obtain the global information. 

D THEORETICAL ANALYSIS

We first explain that the advised policy is better than the agent's own policy. If none of other agents' policies is better than the agent's own policy, then the advised policy is the agent's own policy, which means there is no need to imitate. The intuitive explanation of such transfer among agents is based on mutual imitation among agents. If an agent imitates a policy which is better than its own policy, it can achieve higher performance. Which policy should be imitated by which agent is decided by our option advisor. The option-value function estimates the performance of each option, as well as the intra-option policy, therefore we select the option with the maximum option-value for each agent to imitate the intra-option policy of this option. The convergence of option-value learning has been proved and verified (Sutton et al., 1999; Bacon et al., 2017) . Therefore, the advised policy is the best among all policies at current timestep. Then we provide the theoretical analysis to show the agent's policy will finally converge to the imitated policy through imitation. Give n agents, n options, for each agent i, the option advisors selects an option ω = arg max Ω Q ω (s, ω) and ω contains the policy π * i,t (ρ * t ) with the maximum expected return η * i,t (s). Each agent imitates the advised policy π * i,t (ρ * t ) to minimizes the difference between two policies π * i,t (ρ * t ) and π i,t (ρ i t ): ρ i = α i (η * i,t (s|ρ * t ) -η i,t (s|ρ i t ))(ρ * t -ρ i t ). If we set x i = ρ * t -ρ i t then we calculate the difference of x i as follows: x i = -α i (η * i,t (s) -η i,t (s))x i , Then we have x =    -α 1 (η * 1,t (s|ρ * t ) -η 1,t (s|ρ 1 t )) 0 • • • 0 0 -α 2 (η * 2,t (s|ρ * t ) -η 2,t (s|ρ 2 t )) • • • 0 0 0 • • • 0 0 0 • • • -α n (η * n,t (s|ρ * t ) -η n,t (s|ρ n t ))    x =A x Note that η * i,t (s)-η i,t (s) ≥ 0, then -α i (η * i,t (s)-η i,t (s)) ≤ 0, the main diagonal of the diagonal matrix A only contains non-positive values. Therefore, the real part of all eigenvalues is non-positive. By means of Lyapunov's stability theorem (Shil'nikov, 2001) , it is proved that A is globally and asymptotically stable. The extreme of each x i approaches 0: lim t→∞ x i = 0 for i ∈ {1, 2, • • • , n}. Therefore, each policy would converge to the advised policy through imitation. To conclude, we show that for each agent, the advised policy is better than the policy of the agent itself, and each policy would converge to the advised policy through imitation. Thus, each agent's policy will converge to an improved policy through imitation, and this will not affect the convergence of the vanilla RL algorithm.

E NETWORK STRUCTURE AND PARAMETER SETTINGS

Network Structure Here we provide the network structure for PPO and MAOPT-SRO shown in Figure 16 (a) and (b) respectively.

Parameter Settings

Here we provide the hyperparameters for MAOPT, DVM as well as two baselines, PPO and MAD-DPG shown in Table 5 and 6 respectively. 



† The details of neural network structures and parameter settings are in the appendix, and we share network parameters among all homogeneous agents(Gupta et al., 2017;Rashid et al., 2018).



Figure 1: Framework overview.

Figure 3: The SRO architecture.

a i ∼ π i (o i ) the joint action a = {a 1 , • • • , a n } 9: Observe reward r = {r 1 , • • • , r n }and new state s 10: for each agent i do 11: Store transition (o i , a i , r i , o i , ω, i) to replay buffer D i 12: Select another option ω if ω terminates 13:

Figure 4: The performance on Pac-Man.

Figure 5: The performance on predator and prey.

Figure 6: Analysis of agent 1's policy and MAOPT-SRO's policy.

a i ∼ π i (o i ) action a = {a 1 , • • • , a n } 9: Observe reward r = {r 1 , • • • , r n }and new state s 10: Store transition (s, a, r, ω, s ) replay buffer D 11: Select another joint option ω if ω terminates 12:

y ← r + γU (s , ω|ψ ) 27: Optimize the following loss w.r.t ψ: L ω = 1 B b (y b -Q ω (s b , ω|ψ)) 2 28:Optimize the termination loss w.r.t = -α ∂β(s , ω| )

a i ∼ π i (o i ) action a = {a 1 , • • • , a n } 9: Observe reward r = {r 1 , • • • , r n }and new state s 10: for each agent i do 11:Store transition (o i , a i , r i , o i , ω, i) replay buffer D i

Figure 7: Pac-Man.Figure8: Predator and prey (N = 4).

Figure 7: Pac-Man.Figure8: Predator and prey (N = 4).

Figure 9: Cooperative navigation (N = 4).

Figure 10: OriginalClassic.Figure 11: Predator and prey (N = 12).

Figure 12: Cooperative navigation (N = 10).

Figure13: The performance on predator and prey (N = 4).



state feature parameters θ, reward weights w, state reconstruction network parameters θ, termination network parameters , SR network parameters τ , SR target network parameters τ , batch size T for PPO, replay buffer D i , actor network parameters ρ i , and critic network parameters υ i for each agent i 2: for each episode do

Average of collisions and average distance from a landmark in cooperative navigation.

r.t ρ i

ω|ψ)

While in practice, only partial observations are available in some environments. Therefore, we also provide MAOPT-LOA to enable knowledge transfer among agents. MAOPT-GOA contains n agent models and 1 model for learning the individual option value function and termination probabilities. The option model adopts the parameter sharing similar to common MARL training. However, each agent only obtains the local observation and individual reward signals, which may be different for different agents even at the same state. If we use inconsistent experiences to update the option-value network and termination network, the estimation of the option-value function would oscillate and become inaccurate. Due to partial observability and reward conflict, we design a novel option learning based on successor features. MAOPT-SRO contains n agent models and 1 model for learning the individual SRO value function and termination probabilities. The SRO model adopts the parameter sharing similar to common MARL training. Aspects of MAOPT with three kinds of option adivsors.

Average return with standard deviation in predator and prey (N = 4).

Average return with standard deviation in predator and prey (N = 4).

5. CONCLUSION AND FUTURE WORK

In this paper, we propose a novel MultiAgent Option-based Policy Transfer (MAOPT) framework for efficient multiagent learning by taking advantage of option-based policy transfer. Our framework learns what advice to give to each agent and when to terminate it by modeling multiagent transfer as the option learning problem. Furthermore, to facilitate the robustness of our framework, we provide two types: one type is MAOPT-GOA, which is adopted in fully cooperative settings (with access to global state and reward). The other type contains MAOPT-LOA and MAOPT-SRO, which are proposed for mixed settings (only access to local state with inconsistency and also individual rewards). MAOPT-SRO is proposed to solve the sample inconsistency due to the partial observation, by decoupling the dynamics of the environment from the rewards to learn the option-value function under each agent's preference. MAOPT can be easily combined with existing DRL approaches. Experimental results show it significantly boosts the performance of existing DRL methods. As for future work, it is worth investigating how to achieve coordination among agents by designing MAOPT-GOA in a centralized training, decentralized execution manner. For example, it is worth investigating how to decompose the joint option-value function into individual option-value functions and update each termination probability separately.

