CORRECTING EXPERIENCE REPLAY FOR MULTI-AGENT COMMUNICATION

Abstract

We consider the problem of learning to communicate using multi-agent reinforcement learning (MARL). A common approach is to learn off-policy, using data sampled from a replay buffer. However, messages received in the past may not accurately reflect the current communication policy of each agent, and this complicates learning. We therefore introduce a 'communication correction' which accounts for the non-stationarity of observed communication induced by multi-agent learning. It works by relabelling the received message to make it likely under the communicator's current policy, and thus be a better reflection of the receiver's current environment. To account for cases in which agents are both senders and receivers, we introduce an ordered relabelling scheme. Our correction is computationally efficient and can be integrated with a range of off-policy algorithms. We find in our experiments that it substantially improves the ability of communicating MARL systems to learn across a variety of cooperative and competitive tasks.

1. INTRODUCTION

Since the introduction of deep Q-learning (Mnih et al., 2013) , it has become very common to use previous online experience, for instance stored in a replay buffer, to train agents in an offline manner. An obvious difficulty with doing this is that the information concerned may be out of date, leading the agent woefully astray in cases where the environment of an agent changes over time. One obvious strategy is to discard old experiences. However, this is wasteful -it requires many more samples from the environment before adequate policies can be learned, and may prevent agents from leveraging past experience sufficiently to act in complex environments. Here, we consider an alternative, Orwellian possibility, of using present information to correct the past, showing that it can greatly improve an agent's ability to learn. We explore a paradigm case involving multiple agents that must learn to communicate to optimise their own or task-related objectives. As with deep Q-learning, modern model-free approaches often seek to learn this communication off-policy, using experience stored in a replay buffer (Foerster et al., 2016; 2017; Lowe et al., 2017; Peng et al., 2017) . However, multi-agent reinforcement learning (MARL) can be particularly challenging as the underlying game-theoretic structure is well known to lead to non-stationarity, with past experience becoming obsolete as agents come progressively to use different communication codes. It is this that our correction addresses. Altering previously communicated messages is particularly convenient for our purposes as it has no direct effect on the actual state of the environment (Lowe et al., 2019) , but a quantifiable effect on the observed message, which constitutes the receiver's 'social environment'. We can therefore determine what the received message would be under the communicator's current policy, rather than what it was when the experience was first generated. Once this is determined, we can simply relabel the past experience to better reflect the agent's current social environment, a form of off-environment correction (Ciosek & Whiteson, 2017) . We apply our 'communication correction' using the framework of centralised training with decentralised control (Lowe et al., 2017; Foerster et al., 2018) , in which extra information -in this case the policies and observations of other agents -is used during training to learn decentralised multi-Markov Games A partially observable Markov game (POMG) (Littman, 1994; Hu et al., 1998) for N agents is defined by a set of states S, sets of actions A 1 , ..., A N and observations O 1 , ..., O N for each agent. In general, the stochastic policy of agent i may depend on the set of action-observation histories H i ≡ (O i × A i ) * such that π i : H i × A i → [0, 1]. In this work we restrict ourselves to history-independent stochastic policies π i : O i × A i → [0, 1]. The next state is generated according to the state transition function P : S ×A 1 ×. . .×A n ×S → [0, 1]. Each agent i obtains deterministic rewards defined as r i : S × A 1 × ... × A n → R and receives a deterministic private observation o i : S → O i . There is an initial state distribution ρ 0 : S → [0, 1] and each agent i aims to maximise its own discounted sum of future rewards E s∼ρπ,a∼π [ ∞ t=0 γ t r i (s, a)] where π = {π 1 , . . . , π n } is the set of policies for all agents, a = (a 1 , . . . , a N ) is the joint action and ρ π is the discounted state distribution induced by these policies starting from ρ 0 . Experience Replay As an agent continually interacts with its environment it receives experiences (o t , a t , r t+1 , o t+1 ) at each time step. However, rather than using those experiences immediately for learning, it is possible to store such experience in a replay buffer, D, and sample them at a later point in time for learning (Mnih et al., 2013) . This breaks the correlation between samples, reducing the variance of updates and the potential to overfit to recent experience. In the single-agent case, prioritising samples from the replay buffer according to the temporal-difference error has been shown to be effective (Schaul et al., 2015) . In the multi-agent case, Foerster et al. (2017) showed that issues of non-stationarity could be partially alleviated for independent Q-learners by importance sampling and use of a low-dimensional 'fingerprint' such as the training iteration number. MADDPG Our method can be combined with a variety of algorithms, but we commonly employ it with multi-agent deep deterministic policy gradients (MADDPG) (Lowe et al., 2017) , which we describe here. MADDPG is an algorithm for centralised training and decentralised control of multi-agent systems (Lowe et al., 2017; Foerster et al., 2018) , in which extra information is used to train each agent's critic in simulation, whilst keeping policies decentralised such that they can be deployed outside of simulation. It uses deterministic policies, as in DDPG (Lillicrap et al., 2015) , which condition only on each agent's local observations and actions. MADDPG handles the nonstationarity associated with the simultaneous adaptation of all the agents by introducing a separate centralised critic Q µ i (o, a) for each agent where µ corresponds to the set of deterministic policies µ i : O → A of all agents. Here we have denoted the vector of joint observations for all agents as o. The multi-agent policy gradient for policy parameters θ of agent i is: ∇ θi J(θ i ) = E o,a∼D [∇ θi µ i (o i )∇ ai Q µ i (o, a)| ai=µi(oi) ]. ( ) where D is the experience replay buffer which contains the tuples (o, a, r, o ). Like DDPG, each Q µ i is approximated by a critic Q w i which is updated to minimise the error with the target.

L(w

i ) = E o,a,r,o ∼D [(Q w i (o, a) -y) 2 ] where y = r i + γQ w i (o , a ) is evaluated for the next state and action, as stored in the replay buffer. We use this algorithm with some additional changes (see Appendix A.3 for details). Communication One way to classify communication is whether it is explicit or implicit. Implicit communication involves transmitting information by changing the shared environment (e.g. scattering breadcrumbs). By contrast, explicit communication can be modelled as being separate from the environment, only affecting the observations of other agents. In this work, we focus on explicit communication with the expectation that dedicated communication channels will be frequently integrated into artificial multi-agent systems such as driverless cars.

