LEARNING TO SHARE IN MULTI-AGENT REINFORCE-MENT LEARNING

Abstract

In this paper, we study the problem of networked multi-agent reinforcement learning (MARL), where a number of agents are deployed as a partially connected network. Networked MARL requires all agents make decision in a decentralized manner to optimize a global objective with restricted communication between neighbors over the network. We propose a hierarchically decentralized MARL method, LToS, which enables agents to learn to dynamically share reward with neighbors so as to encourage agents to cooperate on the global objective. For each agent, the high-level policy learns how to share reward with neighbors to decompose the global objective, while the low-level policy learns to optimize local objective induced by the high-level policies in the neighborhood. The two policies form a bi-level optimization and learn alternately. We empirically demonstrate that LToS outperforms existing methods in both social dilemma and two networked MARL scenarios.

1. INTRODUCTION

In multi-agent reinforcement learning (MARL), there are multiple agents interacting with the environment via their joint action to cooperatively optimize an objective. Many methods of centralized training and decentralized execution (CTDE) have been proposed for cooperative MARL, such as VDN (Sunehag et al., 2018) , QMIX (Rashid et al., 2018) , and QTRAN (Son et al., 2019) . However, these methods suffer from the overgeneralization issue (Palmer et al., 2018; Castellini et al., 2019) . Moreover, they may not easily scale up with the number of agents due to centralized learning (Qu et al., 2020a) . In many MARL applications, there are a large number of agents that are deployed as a partially connected network and collaboratively make decisions to optimize the globally averaged return, such as smart grids (Dall'Anese et al., 2013) , network routing (Jiang et al., 2020) , traffic signal control (Chu et al., 2020) , and IoT (Xu et al., 2019) . To deal with such scenarios, networked MARL is formulated to decompose the dependency among all agents into dependencies between only neighbors in such scenarios. To avoid decision-making with insufficient information, agents are permitted to exchange messages with neighbors over the network. In such settings, it is feasible for agents to learn to make decisions in a decentralized way (Zhang et al., 2018; Qu et al., 2020b) . However, there are still difficulties of dependency if anyone attempts to make decision independently, e.g., prisoner's dilemma and tragedy of the commons (Pérolat et al., 2017) . Existing methods tackle these problems by consensus update of value function (Zhang et al., 2018) , credit assignment (Wang et al., 2020) , or reward shaping (Chu et al., 2020) . However, these methods rely on either access to global state and joint action (Zhang et al., 2018) or handcrafted reward functions (Wang et al., 2020; Chu et al., 2020) . Inspired by the fact that sharing plays a key role in human's learning of cooperation, in this paper, we propose Learning To Share (LToS), a hierarchically decentralized learning method for networked MARL. LToS enables agents to learn to dynamically share reward with neighbors so as to collaboratively optimize the global objective. The high-level policies decompose the global objective into local ones by determining how to share their rewards, while the low-level policies optimize local objectives induced by the high-level policies. LToS learns in a decentralized manner, and we prove that the high-level policies are a mean-field approximation of the joint high-level policy. Moreover, the high-level and low-level policies form a bi-level optimization and alternately learn to optimize the global objective. LToS is easy to implement and currently realized by DDPG (Lillicrap et al., 2016) as the high-level policy and DGN (Jiang et al., 2020) as the low-level policy. We empirically demonstrate that LToS outperforms existing methods for networked MARL in both social dilemma and two real-world scenarios. To the best of our knowledge, LToS is the first to learn to share reward for global optimization in networked MARL.

2. RELATED WORK

There are many recent studies for collaborative MARL. Most of them adopt centralized training and decentralized execution, such as COMA (Foerster et al., 2018) , VDN (Sunehag et al., 2018) , QMIX (Rashid et al., 2018) , and QTRAN (Son et al., 2019) . Many are constructed on the basis of factorizing the joint Q-function by assuming additivity (Sunehag et al., 2018) , monotonicity (Rashid et al., 2018) , or factorizable tasks (Son et al., 2019) . However, they are learned in a centralized way and hence may not easily scale up with the number of agents in networked MARL (Qu et al., 2020a) . Moreover, these factorized methods suffer from the overgeneralization issue (Palmer et al., 2018; Castellini et al., 2019) . Other studies focus on decentralized training specifically in networked MARL, to which our work is more closely related. Zhang et al. (2018) proposed consensus update of value function, but it requires global state at each agent, which is usually unavailable in decentralized training. Chu et al. (2020) introduced a spatial discount factor to capture the influence between agents, but the spatial discount factor remains hand-tuned. Sodomka et al. (2013) and Peysakhovich & Lerer (2018b) involved the concept of transferable utility to encourage cooperation, and Peysakhovich & Lerer (2018a) resorted to game theory and gave more complex reward designs. However, these methods cannot be extended beyond two-player games. Hughes et al. (2018) proposed the inequity aversion model to balance agents' selfish desire and social fairness. Wang et al. (2020) considered to learn the Shapley value as the credit assignment. However, these methods still rely on hand-crafted reward designs. Mguni et al. (2019) added an extra part to the original reward as non-potential based reward shaping and used Bayesian optimization to induce the convergence to a desirable equilibrium between agents. However, the extra part remains fixed during an episode, which makes it less capable of dealing with dynamic environments. Moreover, the reward shaping alters the original optimization problem.

3.1. NETWORKED MULTI-AGENT REINFORCEMENT LEARNING

Assume N agents interact with an environment. Let V = {1, 2, • • • , N } be the set of agents. The multi-agent system is modeled as an undirected graph G(V, E), where each agent i serves as vertex i and E ⊆ V × V is the set of all edges. Two agents i, j ∈ V can communicate with each other if and only if e ij = (i, j) ∈ E. We denote agent i and its all neighbors in the graph together as a set N i . The state of the environment s ∈ S transitions upon joint action a ∈ A according to transition probability P a : S × A × S → [0, 1], where joint action set A = × i∈V A i . Each agent i has a policy π i ∈ Π i : S × A i → [0, 1], and we denote the joint policy of all agents as π ∈ Π = × i∈V Π i . For networked MARL, a common and realistic assumption is that the reward of each agent i just depends on its action and the actions of its neighbors (Qu et al., 2020a) , i.e., r i (s, a) = r i (s, a Ni ). Moreover, each agent i may only obtain partial observation o i ∈ O i , but can approximate the state by the observations of N i (Jiang et al., 2020) or the observation history (Chu et al., 2020) , which are all denoted by o i for simplicity. The global objective is to maximize the sum of cumulative rewards of all agents , i.e., ∞ t=0 N i=1 γ t r t i .

3.2. MARKOV GAME

In such a setting, each agent can individually maximizes its own expected return, which is known as Markov game. This may lead to stable outcome or Nash equilibrium, which however is usually sub-optimal. Given π, the value function of agent i is given by v π i (s) = a π(a|s) s p a (s |s, a)[r i + γv π i (s )], where p a ∈ P a describes the state transitions. A Nash equilibrium is defined as (Mguni et al., 2019) v (πi,π-i) i (s) ≥ v (π i ,π-i) i (s), ∀π i ∈ Π i , ∀s ∈ S, ∀i ∈ V, where π -i = × j∈V\{i} π j .

4. METHOD

The basic idea of LToS is to enable agents to learn how to share reward with neighbors such that agents are encouraged to collaboratively optimize the global objective in networked MARL. LToS is a decentralized hierarchy. At each agent, the high-level policy determines the weights of reward sharing based on low-level policies while the low-level policy directly interacts with the environment to optimize local objective induced by high-level policies. Therefore, they form a bi-level optimization and alternately learn towards the global objective.

4.1. REWARD SHARING

The intuition of reward sharing is that if agents share their rewards with others, each agent has to consider the consequence of its actions on others, and thus it promotes cooperation. In networked MARL, as the reward of an agent is assumed to depend on the actions of neighbors, we allow reward sharing between neighboring agents. For the graph of V, we additionally define a set of directed edges, D, constructed from E. Specifically, we add a loop d ii ∈ D for each agent i and split each undirected edge e ij ∈ E into two directed edges: d ij = (i, j) and d ji = (j, i) ∈ D. Each agent i determines a weight w ij ∈ [0, 1] for each directed edge d ij , ∀j ∈ N i , subject to the constraint j∈Ni w ij = 1, so that w ij proportion of agent i's environment reward r i will be shared to agent j. Let w ∈ W = × dij ∈D w ij be the weights of the graph. Therefore, the shaped reward after sharing for each agent i is defined as r w i = j∈Ni w ji r j . (3)

4.2. HIERARCHY

Assume there is a joint high-level policy φ ∈ Φ : S × W → [0, 1] to determine w. Given φ and w, we can define the value function of π at each agent i based on (1) as v π i (s; φ) = w φ(w|s) a π(a|s, w) s p a (s |s, a)[r w i + γv π i (s ; φ)], v π i (s; w, φ) = a π(a|s, w) s p a (s |s, a)[r w i + γv π i (s ; φ)]. We express w as a discrete action for simplicity. It also holds for continuous action as long as we change all the summations to integrals. = a π(a|s, w)p a (s |s, a). As commonly assumed the reward is deterministic given s and a, from (4), we have, Let V φ V (s; π) . = i∈V v π i (s; φ) and Q φ V (s, w; π) . = i∈V v π i (s; w, φ). Proposition 4.1. Given π, V φ V (s; π) and Q φ V ( v π i (s; φ) = w φ(w|s) a π(a|s, w)[r w i + s p a (s |s, a)γv π i (s ; φ)] (6) = w φ(w|s) s p w (s |s, w)[r φ i + γv π i (s ; φ)], where p w ∈ P w : S × W × S → [0, 1] describes the state transitions given π. Let r φ V . = i∈V r φ i , and from (7) we have V φ V (s; π) = i∈V w φ(w|s) s p w (s |s, w)[r φ i + γv π i (s ; φ)] (8) = w φ(w|s) s p w (s |s, w)[ i∈V r φ i + γ i∈V v π i (s ; φ)] (9) = w φ(w|s) s p w (s |s, w)[r φ V + γV φ V (s ; π)], and similarly, Q φ V (s, w; π) = i∈V s p w (s |s, w)[r φ i + γ w φ(w |s )v π i (s ; w , φ)] (11) = s p w (s |s, w)[ i∈V r φ i + γ w φ(w |s ) i∈V v π i (s ; w , φ)] (12) = s p w (s |s, w)[r φ V + γ w φ(w |s )Q φ V (s , w ; π)]. Moreover, from the definitions of r w i and r φ i we have r φ V = a π(a|s, w) i∈V r w i = a π(a|s, w) i∈V j∈Ni w ji r j (14) = a π(a|s, w) (i,j)∈D w ij r i = a π(a|s, w) i∈V r i , Thus, given π, V φ V (s) and Q φ V (s, w) are respectively the value function and action-value function of φ in terms of the sum of expected cumulative rewards of all agents, i.e., the global objective. Proposition 4.1 implies that φ directly optimizes the global objective by generating w given π. Unlike existing hierarchical RL methods, we can directly construct the value function and actionvalue function of φ based on the value function of π at each agent. As φ optimizes the global objective given π while π i optimizes the shaped reward individually at each agent given φ (assuming π convergent to Nash equilibrium or stable outcome, denoted as lim), they form a bi-level optimization. Let J φ (π) and J π (φ) denote the objectives of φ and π respectively. The bi-level optimization can be formulated as follows, max φ J φ (π * (φ)) s.t. π * (φ) = arg lim π J π (φ). (16)

4.3. DECENTRALIZED LEARNING

Proposition 4.2. The joint high-level policy φ can be learned in a decentralized manner, and the decentralized high-level policies of all agents form a mean-field approximation of φ. Proof. Let d ij ∈ D serve as a vertex with action w ij and reward w ij r i in a new graph G . Each vertex has its own local policy φ ij (w ij |s), and we can verify their independence by means of Markov Random Field. For ∀i ∈ V, {d ij |j ∈ N i } should form a fully connected subgraph in G , because their actions are subject to the constraint j∈Ni w ij = 1. As d ij ∈ G only connects to {d ik |k ∈ N i \{j}}, the fully connected subgraph is also a maximal clique. According to Hammersley-Clifford theorem (Hammersley & Clifford, 1971) , we have φ(w|s) ≈ i∈V φ i (w out i |s), where w out i = {w ij |j ∈ N i }. Proposition 4.1 and 4.2 indicate that for each agent i, the low-level policy simply learns a local π i (a i |s, w in i ), where w in i = {w ji |j ∈ N i }, to optimize the cumulative reward of r w i , since r w i is fully determined by w in i according to (3) and denoted as r w i from now on. And the high-level policy φ i just needs to locally determine w out i to maximize the cumulative reward of r φ i simplified as r φ i . Therefore, for decentralized learning, ( 16) can be decomposed locally for each agent i as max φi J φi (φ -i , π * 1 (φ), • • • , π * N (φ)) s.t. π * i (φ) = arg max πi J πi (π -i , φ 1 (π), • • • , φ N (π)), We abuse the notation and let φ and π also denote their parameterizations respectively. To solve the optimization, we have ∇ φi J φi (φ -i , π * 1 (φ), • • • , π * N (φ)) ≈ ∇ φi J φi (φ -i , π 1 + α∇ π1 J π1 (φ), • • • , π N + α∇ π N J π N (φ)), ( ) where α is the learning rate for the low-level policy. Let π i denote π i + α∇ πi J πi (φ), we have ∇ φi J φi (φ -i , π * 1 (φ), • • • , π * N (φ)) ≈ ∇ φi J φi (φ -i , π 1 , • • • , π N ) + α N j=1 ∇ 2 φi,πj J πj (φ)∇ π j J φi (φ -i , π 1 , • • • , π N ). The second-order derivative is neglected due to high computational complexity, without incurring significant performance drop such as in meta-learning (Finn et al., 2017) and neural architecture search (Liu et al., 2019) . Similarly, we have ∇ πi J πi (π -i , φ * 1 (π), • • • , φ * N (π)) ≈ ∇ πi J πi (π -i , φ 1 + β∇ φ1 J φ1 (π), • • • , φ N + β∇ φ N J φ N (π)), ( ) where β is the learning rate of the high-level policy. Therefore, we can solve the bi-level optimization ( 16) by the first-order approximations in a decentralized way. For each agent i, φ i and π i are alternately updated. In distributed learning, as each agent i usually does not have access to state, we further approximate φ i (w out i |s) and π i (a i |s, w in i ) by φ i (w out i |o i ) and π i (a i |o i , w in i ), respectively. Moreover, in network MARL as each agent i is closely related to neighboring agents, (17) can be further seen as π i maximizes the cumulative discounted reward of r w i given φ Ni , where φ Ni = × j∈Ni φ j , and φ i optimizes the cumulative discounted reward of r w i given π Ni (i.e., r φ i ), where π Ni = × j∈Ni π j . During training, π Ni and φ Ni are implicitly considered by interactions of w out i and w in i respectively. The architecture of LToS is illustrated in Figure 1 . At each timestep, the high-level policy of each agent i makes a decision of action w out i as the weights of reward sharing based on the observation. Then, the low-level policy takes the observation and w in i as an input and outputs the action. Agent i obtains the shaped reward according to w in i for both the high-level and low-level policies. The gradients are backpropagated along purple dotted lines. Further, from Proposition 4.1, we have q φi i (s, w out i ; π Ni ) = v πi i (s; w in i , φ Ni ) , where q φi i is the actionvalue function of φ i given π Ni , v πi i is the value function of π i given φ Ni and conditioned on w in i . As aforementioned, we approximately have q φi i (o i , w out i ) = v πi i (o i ; w in i ). We can see that the actionvalue function of φ i is equivalent to the value function of π i . That said, we can use a single network to approximate these two functions simultaneously. For a deterministic low-level policy, the highlevel and low-level policies can share a same action-value function. In the current instantiation of LToS, we use DDPG (Lillicrap et al., 2016) for the high-level policy and DGN (Jiang et al., 2020) (Q-learning) for the low-level policy. Thus, the Q-network of DGN also serves the critic of DDPG, and the gradient of w in i is calculated based on the maximum Q-value of a i . More discussions about training LToS and the detailed training algorithm are available in Appendix A.1. 

5. EXPERIMENTS

For the experiments, we adopt three scenarios depicted in Figure 2 . Prisoner is a grid game about social dilemma that easily measures agents' cooperation, while traffic and routing are real-world scenarios of networked MARL. We obey the principle of networked MARL that only allows communication in neighborhood as Jiang et al. (2020) ; Chu et al. (2020) . To illustrate the reward sharing scheme each agent learned, we use a simple indicator: selfishness, the reward proportion that an agent chooses to keep for itself. For ablation, we keep the sharing weights fixed for each agent, named fixed LToS. Throughout the experiments, we additionally compare with the baselines including DQN, DGN, QMIX and two methods for networked MARL, i.e., ConseNet (Zhang et al., 2018) and NeurComm (Chu et al., 2020) , both of which take advantage of recurrent neural network (RNN) for the partially observable environment (Hausknecht & Stone, 2015) . To maximize the average global reward directly, we specially tune the reward shaping factor of other baselines in prisoner and introduce QMIX as a centralized baseline in traffic and routing. Moreover, as DGN is the low-level policy of LToS, DGN also serves the ablation of LToS without reward sharing.

5.1. PRISONER

We use prisoner, a grid game version of the well-known matrix game prisoner's dilemma from Sodomka et al. (2013) to demonstrate that LToS is able to learn cooperative policies to achieve the global optimum (i.e., maximize globally averaged return). As illustrated in Figure 2a , there are two agents A and B that respectively start on two sides of the middle of a grid corridor with full observation. At each timestep, each agent chooses an action left or right and moves to the corresponding adjacent grid, and each timestep every action incurs a cost -0.01. There are three goals, two goals at both ends and one in the middle. The agent gets a reward +1 for reaching the goal. The game ends once some agent reaches a goal or two agents reach different goals simultaneously. This game resembles prisoner's dilemma: going for the middle goal ("defect") will bring more rewards than the farther one on its side ("cooperate"), but if two agents both adopt that, a collision occurs and only one of the agents wins the goal with equal probability. On the contrary, both agents obtain a higher return if they both "cooperate", though it takes more steps. Figure 3 illustrates the learning curves of all the models in terms of average return. Note that for all three scenarios, we represent the average of three training runs with different random seeds by solid lines and the min/max value by shadowed areas. As a result of self-interest optimization, DQN converges to the "defect/defect" Nash equilibrium where each agent receives an expected reward about 0.5. So does DGN since it only aims to take advantage of its neighbors' observations while prisoner is a fully observable environment already. ConseNet agents sometimes choose to cooperate by building a consensus on average return at the beginning, but it is unstable and abandoned subsequently. Given a hand-tuned reward shaping factor to direct agents to maximize average return, NeurComm and fixed LToS agents are able to cooperate eventually. However, they converge much slower. Coco-Q (Sodomka et al., 2013) and LToS outperform all other methods. As a modified tabular Q-learning method, Coco-Q introduces the coco value (Kalai & Kalai, 2010 ) as a substitute for the expected return in the Bellman equation and regards the difference as transferred reward. However, it is specifically designed for some games, and it is hard to be extended beyond two-player games. LToS can learn the reward sharing scheme where one agent at first gives all the reward to the other so that both of them are prevented from "defect", and thus achieve the best average return quickly. By prisoner, we verify LToS can escape from local optimum by learning to share reward. In traffic, as illustrated in Figure 2b , we aim to investigate the capability of LToS in dealing with highly dynamic environment through reward sharing. We adopt the same problem setting as in (Wei et al., 2019) . In a road network, each agent serves as traffic signal control at an intersection. The observation of an agent consists of a one-hot representation of its current phase (directions for red/green lights) and the number of vehicles on each incoming lane of the intersection. At each timestep, an agent chooses a phase from the pre-defined phase set for the next time interval, i.e., 10 seconds. The reward is set to be the negative of the sum of the queue lengths of all approaching lanes at current timestep. The global objective is to minimize average travel time of all vehicles in the road network, which is equivalent to minimizing the sum of queue lengths of all intersections over an episode (Zheng et al., 2019) . The experiment was conducted on a traffic simulator, CityFlow (Zhang et al., 2019) . We use a 6 × 6 grid network with 36 intersections. The traffic flows were generated to simulate dynamic traffic flows including both peak and off-peak period, and the statistics is summarized in Table 1 . Figure 4 shows the learning curves of all the models in terms of average travel tiem of all vehicles in logarithmic form. The performance after convergence is summarized in Table 2 , where LToS outperforms all other methods. LToS outperforms DGN, which demonstrates the reward sharing scheme learned by the high-level policy indeed helps to improve the cooperation of agents. Without the high-level policy, i.e., given fixed sharing weights, fixed LToS does not perform well in dynamic environment. This indicates the necessity of the high-level policy. Although NeurComm and ConseNet both take advantage of RNN for partially observable environments, LToS still outperforms these methods, which verifies the great improvement of LToS in networked MARL. QMIX shows apparent instability and is confined to suboptimality (Mahajan et al., 2019) . Specifically, in the best episode, QMIX tries to release traffic flows from one direction while stopping flows from the other all the time.

5.2. TRAFFIC

We visualize the variation of selfishness of all agents during an episodes in Figure 5 and 6 . Figure 5 depicts the temporal variance of selfishness for each agent. For most agents, there are two valleys occurred exactly during two peak periods (i.e., 0 -600s and 1, 800 -2, 400s). This is because for heavy traffic agents need to cooperate more closely, which can be induced by being less selfish. We can see this from the fact that selfishness is even lower in the second valley where the traffic is even heavier (i.e., 2 vs. 1 vehicles/s). Therefore, this demonstrates that the agents learn to adjust their extent of cooperation to deal with dynamic environment by controlling the sharing weights. Figure 6 shows the spatial pattern of selfishness at different timesteps, where the distribution of agents is the same as the road network in Figure 2b . The edge and inner agents tend to have very different selfishness. In addition, inner agents keep their selfishness more uniform during off-peak periods, while they diverge and present cross-like patterns during peak periods. This shows that handling heavier traffic requires more diverse reward sharing schemes among agents to promote more sophisticated cooperation. Packet routing is regarded as a complex problem in distributed computer networks. Here is a simplified version of the problem. A network consists of multiple routers with a stationary network topology. Data packets come into the network (started at a router) following the Poisson distribution, and the arrival rate varies during an episode as summarized in Table 3 . Each router has a FIFO queue as the packet buffer. For simplicity, we assume that each queue has unlimited volume allowance, and each packet has a size same as each link's bandwidth. At every timestep, each router observes the data packets in the queue and incoming links as well as indices of neighboring routers, forwards the first packet in the FIFO to the selected next hop, and obtains a reward which is the negative of the queue length. The transmission time of a packet over a link is proportional to the geographic distance, and the packet will be stored after arriving at the next hop unless it reaches the destination. The delay of a packet is the sum of timesteps spent at the routers and on the links. The goal of packet routing is to send the packets to their destinations through hop-by-hop transmissions with minimum average delay. Compared with traffic, routing is a more fine-grained task, because it requires specific control for each data packet. In the experiment, we choose a real network topology: IBM backbone network of 18 vertices that each works for a city in North America (Knight et al., 2011) and the topology is depicted in Figure 2c , where each edge consists of two unidirectional links and varies considerably in distance. We assume that each router helps with loopback detection while forwarding. Figure 7 illustrates the learning curves of all the models in terms of average delay, and their performance after convergence in terms of throughout and delay is also summarized in Table 4 . NeurComm, ConseNet and QMIX are not up to this task and may need much more episodes to converge. By learning proper reward sharing, LToS outperforms all other baselines in terms of both metrics. Compared to traffic, routing additionally considers the heterogeneous network topology. Therefore, the experimental results also verify the capability of LToS of handling both temporal and spatial heterogeneity in networked MARL. 

6. CONCLUSION

In this paper, we proposed LToS, a hierarchically decentralized method for networked MARL. LToS enables agents to share reward with neighbors so as to encourage agents to cooperate on the global objective. For each agent, the high-level policy learns how to share reward with neighbors to decompose the global objective, while the low-level policy learns to optimize local objective induced by the high-level policies in the neighborhood. Experimentally, we demonstrate that LToS outperforms existing methods in both social dilemma and two networked MARL scenarios. Algorithm 1 LToS 1: Initialize φ i parameterized by θ i and π i parameterized by µ i for each agent i (φ i is learned using DDPG and π i is learned using DGN, where they share the Q-network) 2: for episode = 1 to max-training-round do 3: Initialize a random process X w for w-action exploration 4: Initialize a random process X a for a-action exploration 5: for max-episode-length do 6: for each agent i do 7: w out i ← φ i (o i ) + X w 8: a i ← π i (o i ; w in i ) + X a 9: Execute action a i , obtain original reward r i , and transition to o i 10: Set y i ← r w i + γq π i i (o i , a i ; w in i )| a i =π i (o i ;w in i ) 18: Update π i by ∇ µi 1 |D| (oi,w in i ,ai,r w i ,o i )∈D (y i -q π i (o i , a i ; w in i )) 2 19: Exchange w out i ← φ i (o i ) and get w in i 20: Calculate the gradient g in i = ∇ w in i q πi i (o i , arg max ai q πi i ; w in i ) 21: Exchange g in i and get gradient g out i for w out i 22: Update θ i by 1 |D| oi∈D (∇ θi φ i (o i )) T g out i 23: Softly update θ i and µ i : θ i ← τ θ i + (1 -τ )θ i and µ i ← τ µ i + (1 -τ )µ i 24: end for 25: end if

26:

end for 27: end for choose to predetermine the initial selfishness to learn the high-level policy effectively. However, with normal initializers, the output of the high-level policy network will be evenly distributed initially. Therefore, we use a special selfishness initializer for each high-level policy network instead. As we use the softmax to produce the weights, which guarantees the constraint: j∈Ni w ij = 1, ∀i ∈ V, we specially set the bias of the last fully-connected layer so that each decentralized high-level policy network tends to keep for itself the same reward proportion as the given selfishness initially. The rest of reward is still evenly distributed among neighbors. LToS learns started from such initial weights, while fixed LToS uses such weights throughout each experiment. Moreover, we use grid search to find the best selfishness for fixed LToS in traffic and routing. For prisoner we deliberately set the selfishness to 0.5 so that fixed LToS directly optimizes the average return. Unified Pseudo Random Number Generator. LToS is learned in a decentralized manner. This incurs some difficulty for experience replay. As each agent i needs w in i to update network weights for both high-level and low-level policies, it should sample from its buffer a batch of experiences where each sampled experience should be synchronized across the batches of all agents (i.e., the experiences should be collected at a same timestep). To handle this, all agents can simply use a unified pseudo random number generator and the same random seed. Different Time Scales. As many hierarchical RL methods do, we set the high-level policy to running at a slower time scale than the low-level one. Proposition 4.1 still holds if we expand v π i for more than one step forward. Assuming the high-level policy runs every M timesteps, we can fix w out,t i = w out,t+1 = • • • = w out,t+M -1 . M is referred to as action interval in Table 6 . Infrequent Parameter Update with Small Learning Rate. Based on the continuity of w, a small modification of φ means a slight modification of local reward functions, and will intuitively result in an equally slight modification of the low-level value functions. This guarantees the low-level policies are highly reusable. A.2 HYPERPARAMATERS Table 5 summarizes the hyperparameters of DQN and DGN that also serves as the low-level network of LToS. We follow many of the original DGN in prisoner and routing, but choose the setting of Wei et al. (2019) in traffic for consistency. Table 6 summarizes the hyperparameters of the highlevel network of LToS, which are different from the low-level network. Table 7 summarizes the



s, w; π) are respectively the value function and action-value function of φ. Proof. Let r φ i . = a π(a|s, w)r w i and p w (s |s, w).

Figure 1: LToS

Figure 2: Three experimental scenarios: (a) prisoner, (b) traffic, and (c) routing.

Figure 3: Learning curves in prisoner.

Figure 4: Learning curves in traffic.

Figure 5: Temporal pattern of selfishness Figure 6: Spatial pattern of selfishness

Figure 7: Learning curves in routing.

Statistics of traffic flows

Average travel time of all the models in traffic

Statistics of packet flow

Performance of all models in routing: throughput (packets) and average delay (timesteps)

Hyperparameters for DQN and DGN (also serves as the low-level policy network of LToS)

Hyperparameters for the high-level policy network of LToS

Hyperparameters for NeurComm, ConseNet and QMIX

annex

hyperparameters of NeurComm and ConseNet, which adhere to the implementation (Chu et al., 2020) . In addition, for tabular Coco-Q, the step-size parameter is 0.5. We adopt soft update for both high-level and low-level networks and use an Ornstein-Uhlenbeck Process (abbreviated as OU) for high-level exploration.Both fixed LToS and NeurComm exploit static reward shaping, but they adopt different reward shaping schemes which are hard to compare directly. We consider a simple indicator: Self Neighbor Ratio (SNR), the ratio of reward proportion that an agent chooses to keep for itself to that it obtains from a single neighbor. As the rest reward is evenly shared with neighbors in LToS, for each agent i, we have SNR = selfishness /1-selfishness × (|N i | -1) for LToS, and SNR = 1 /α for NeurComm where α is the spatial discount factor. We adjust the initial selfishness and α to set the SNR of both methods at the same level for fair comparison.

