MULTI-AGENT IMITATION LEARNING WITH COPULAS

Abstract

Multi-agent imitation learning aims to train multiple agents to perform tasks from demonstrations by learning a mapping between observations and actions, which is essential for understanding physical, social, and team-play systems. However, most existing works on modeling multi-agent interactions typically assume that agents make independent decisions based on their observations, ignoring the complex dependence among agents. In this paper, we propose to use copula, a powerful statistical tool for capturing dependence among random variables, to explicitly model the correlation and coordination in multi-agent systems. Our proposed model is able to separately learn marginals that capture the local behavioral patterns of each individual agent, as well as a copula function that solely and fully captures the dependence structure among agents. Extensive experiments on synthetic and real-world datasets show that our model outperforms state-of-the-art baselines across various scenarios in the action prediction task, and is able to generate new trajectories close to expert demonstrations.

1. INTRODUCTION

Recent years have witnessed great success of reinforcement learning (RL) for single-agent sequential decision making tasks. As many real-world applications (e.g., multi-player games (Silver et al., 2017; Brown & Sandholm, 2019) and traffic light control (Chu et al., 2019) ) involve the participation of multiple agents, multi-agent reinforcement learning (MARL) has gained more and more attention. However, a key limitation of RL and MARL is the difficulty of designing suitable reward functions for complex tasks with implicit goals (e.g., dialogue systems) (Russell, 1998; Ng et al., 2000; Fu et al., 2017; Song et al., 2018) . Indeed, hand-tuning reward functions to induce desired behaviors becomes especially challenging in multi-agent systems, since different agents may have completely different goals and state-action representations (Yu et al., 2019) . Imitation learning provides an alternative approach to directly programming agents by taking advantage of expert demonstrations on how a task should be solved. Although appealing, most prior works on multi-agent imitation learning typically assume agents make independent decisions after observing a state (i.e., mean-field factorization of the joint policy) (Zhan et al., 2018; Le et al., 2017; Song et al., 2018; Yu et al., 2019) , ignoring the potentially complex dependencies that exist among agents. Recently, Tian et al. (2019) and Liu et al. (2020) proposed to implement correlated policies with opponent modeling, which incurs unnecessary modeling cost and redundancy, while still lacking coordination during execution. Compared to the single-agent setting, one major and fundamental challenge in multi-agent learning is how to model the dependence among multiple agents in an effective and scalable way. Inspired by probability theory and statistical dependence modeling, in this work, we propose to use copulas (Sklar, 1959b; Nelsen, 2007; Joe, 2014) to model multi-agent behavioral patterns. Copulas are powerful statistical tools to describe the dependence among random variables, which have been widely used in quantitative finance for risk measurement and portfolio optimization (Bouyé et al., 2000) . Using a copulas-based multi-agent policy enables us to separately learn marginals that capture the local behavioral patterns of each individual agent and a copula function that only and fully captures the dependence structure among the agents. Such a factorization is capable of modeling arbitrarily complex joint policy and leads to interpretable, efficient and scalable multi-agent imitation learning. As a motivating example (see Figure 1 ), suppose there are two agents, each with one-dimensional action space. In Figure 1a , although two joint policies are quite different, they actually share the same copula (dependence structure) and one marginal. Our proposed copula-based policy is capable = = Gaussian(0, 1) Gaussian(0, 1) Gaussian(0, 1) Gumbel(-1.5, 0.7) (a) Same copula but different marginals a2 π(a 1 , a 2 |s)da 2 ) as well as the copula c(F 1 (a 1 |s), F 2 (a 2 |s)) on the unit cube. Here F i is the cumulative distribution function of the marginal π i (a i |s) and u i := F i (a i |s) is the uniformly distributed random variable obtained by probability integral transform with F i . More details and definitions can be found in Section 3.2. of capturing such information and more importantly, we may leverage such information to develop efficient algorithms for such transfer learning scenarios. For example, when we want to model teamplay in a soccer game and one player is replaced by his/her substitute while the dependence among different roles are basically the same regardless of players, we can immediately obtain a new joint policy by switching in the new player's marginal while keeping the copula and other marginals unchanged. On the other hand, as shown in Figure 1b , two different joint policies may share the same marginals while having different copulas, which implies that the mean-field policy in previous works (only modeling marginal policies and making independent decisions) cannot differentiate these two scenarios to achieve coordination correctly. + + + + = = Uniform over [0,1] ! Uniform over [0, " ! ] ! ∪ [ " ! , 1] ! (b) Same marginals but different copulas Towards this end, in this paper, we propose a copula-based multi-agent imitation learning algorithm, which is interpretable, efficient and scalable for modeling complex multi-agent interactions. Extensive experimental results on synthetic and real-world datasets show that our proposed method outperforms state-of-the-art multi-agent imitation learning methods in various scenarios and generates multi-agent trajectories close to expert demonstrations.

2. PRELIMINARIES

In this work, we consider the problem of multi-agent imitation learning under the framework of Markov games (Littman, 1994) , which generalize Markov Decision Processes to multi-agent settings, where N agents are interacting with each other. Specifically, in a Markov game, S is the common state space, A i is the action space for agent i ∈ {1, . . . , N }, η ∈ P(S) is the initial state distribution and P : S × A 1 × . . . × A N → P(S) is the state transition distribution of the environment that the agents are interacting with. Here P(S) denotes the set of probability distributions over state space S. Suppose at time t, agents observe s[t] ∈ S and take actions a[t] := (a 1 [t], . . . , a N [t]) ∈ A 1 × . . . × A N , the agents will observe state s[t + 1] ∈ S at time t + 1 with probability P (s[t + 1]|s[t], a 1 [t], . . . , a N [t]). In this process, the agents select the joint action a[t] by sampling from a stochastic joint policy π : S → P(A 1 × . . . × A N ). In the following, we will use subscript -i to denote all agents except for agent i. For example, (a i , a -i ) represents the actions of all agents; π i (a i |s) and π i (a i |s, a -i ) represent the marginal and conditional policy of agent i induced by the joint policy π(a|s) (through marginalization and Bayes's rule respectively). We consider the following off-line imitation learning problem. Suppose we have access to a set of demonstrations D = {τ j } M j=1 provided by some expert policy π E (a|s), where each expert trajectory τ j = {(s j [t], a j [t])} T t=1 is collected by the following sampling process:  s 1 ∼ η(



Figure 1: In each subfigure, the left part visualizes the joint policy π(a 1 , a 2 |s) on the joint action space [-3, 3] 2 and the right part shows the corresponding marginal policies (e.g., π 1 (a 1 |s) =

