MULTI-AGENT IMITATION LEARNING WITH COPULAS

Abstract

Multi-agent imitation learning aims to train multiple agents to perform tasks from demonstrations by learning a mapping between observations and actions, which is essential for understanding physical, social, and team-play systems. However, most existing works on modeling multi-agent interactions typically assume that agents make independent decisions based on their observations, ignoring the complex dependence among agents. In this paper, we propose to use copula, a powerful statistical tool for capturing dependence among random variables, to explicitly model the correlation and coordination in multi-agent systems. Our proposed model is able to separately learn marginals that capture the local behavioral patterns of each individual agent, as well as a copula function that solely and fully captures the dependence structure among agents. Extensive experiments on synthetic and real-world datasets show that our model outperforms state-of-the-art baselines across various scenarios in the action prediction task, and is able to generate new trajectories close to expert demonstrations.

1. INTRODUCTION

Recent years have witnessed great success of reinforcement learning (RL) for single-agent sequential decision making tasks. As many real-world applications (e.g., multi-player games (Silver et al., 2017; Brown & Sandholm, 2019) and traffic light control (Chu et al., 2019) ) involve the participation of multiple agents, multi-agent reinforcement learning (MARL) has gained more and more attention. However, a key limitation of RL and MARL is the difficulty of designing suitable reward functions for complex tasks with implicit goals (e.g., dialogue systems) (Russell, 1998; Ng et al., 2000; Fu et al., 2017; Song et al., 2018) . Indeed, hand-tuning reward functions to induce desired behaviors becomes especially challenging in multi-agent systems, since different agents may have completely different goals and state-action representations (Yu et al., 2019) . Imitation learning provides an alternative approach to directly programming agents by taking advantage of expert demonstrations on how a task should be solved. Although appealing, most prior works on multi-agent imitation learning typically assume agents make independent decisions after observing a state (i.e., mean-field factorization of the joint policy) (Zhan et al., 2018; Le et al., 2017; Song et al., 2018; Yu et al., 2019) , ignoring the potentially complex dependencies that exist among agents. Recently, Tian et al. (2019) and Liu et al. (2020) proposed to implement correlated policies with opponent modeling, which incurs unnecessary modeling cost and redundancy, while still lacking coordination during execution. Compared to the single-agent setting, one major and fundamental challenge in multi-agent learning is how to model the dependence among multiple agents in an effective and scalable way. Inspired by probability theory and statistical dependence modeling, in this work, we propose to use copulas (Sklar, 1959b; Nelsen, 2007; Joe, 2014) to model multi-agent behavioral patterns. Copulas are powerful statistical tools to describe the dependence among random variables, which have been widely used in quantitative finance for risk measurement and portfolio optimization (Bouyé et al., 2000) . Using a copulas-based multi-agent policy enables us to separately learn marginals that capture the local behavioral patterns of each individual agent and a copula function that only and fully captures the dependence structure among the agents. Such a factorization is capable of modeling arbitrarily complex joint policy and leads to interpretable, efficient and scalable multi-agent imitation learning. As a motivating example (see Figure 1 ), suppose there are two agents, each with one-dimensional action space. In Figure 1a , although two joint policies are quite different, they actually share the same copula (dependence structure) and one marginal. Our proposed copula-based policy is capable a2 π(a 1 , a 2 |s)da 2 ) as well as the copula c(F 1 (a 1 |s), F 2 (a 2 |s)) on the unit cube. Here F i is the cumulative distribution function of the marginal π i (a i |s) and u i := F i (a i |s) is the uniformly distributed random variable obtained by probability integral transform with F i . More details and definitions can be found in Section 3.2. of capturing such information and more importantly, we may leverage such information to develop efficient algorithms for such transfer learning scenarios. For example, when we want to model teamplay in a soccer game and one player is replaced by his/her substitute while the dependence among different roles are basically the same regardless of players, we can immediately obtain a new joint policy by switching in the new player's marginal while keeping the copula and other marginals unchanged. On the other hand, as shown in Figure 1b , two different joint policies may share the same marginals while having different copulas, which implies that the mean-field policy in previous works (only modeling marginal policies and making independent decisions) cannot differentiate these two scenarios to achieve coordination correctly. Towards this end, in this paper, we propose a copula-based multi-agent imitation learning algorithm, which is interpretable, efficient and scalable for modeling complex multi-agent interactions. Extensive experimental results on synthetic and real-world datasets show that our proposed method outperforms state-of-the-art multi-agent imitation learning methods in various scenarios and generates multi-agent trajectories close to expert demonstrations.

2. PRELIMINARIES

In this work, we consider the problem of multi-agent imitation learning under the framework of Markov games (Littman, 1994) , which generalize Markov Decision Processes to multi-agent settings, where N agents are interacting with each other. Specifically, in a Markov game, S is the common state space, A i is the action space for agent i ∈ {1, . . . , N }, η ∈ P(S) is the initial state distribution and P : S × A 1 × . . . × A N → P(S) is the state transition distribution of the environment that the agents are interacting with. Here P(S) denotes the set of probability distributions over state space S. Suppose at time t, agents observe s[t] ∈ S and take actions a [t] := (a 1 [t], . . . , a N [t]) ∈ A 1 × . . . × A N , the agents will observe state s[t + 1] ∈ S at time t + 1 with probability P (s[t + 1]|s[t], a 1 [t], . . . , a N [t] ). In this process, the agents select the joint action a[t] by sampling from a stochastic joint policy π : S → P(A 1 × . . . × A N ). In the following, we will use subscript -i to denote all agents except for agent i. For example, (a i , a -i ) represents the actions of all agents; π i (a i |s) and π i (a i |s, a -i ) represent the marginal and conditional policy of agent i induced by the joint policy π(a|s) (through marginalization and Bayes's rule respectively). We consider the following off-line imitation learning problem. Suppose we have access to a set of demonstrations D = {τ j } M j=1 provided by some expert policy π E (a|s), where each expert trajectory τ j = {(s j [t], a j [t])} T t=1 is collected by the following sampling process: s 1 ∼ η(s), a[t] ∼ π E (a|s[t]), s[t + 1] ∼ P (s|s[t], a[t]) for t ∈ {1, . . . , T }. The goal is to learn a parametric joint policy π θ to approximate the expert policy π E such that we can do downstream inferences (e.g., action prediction and trajectory generation). The learning problem is off-line as we cannot ask for additional interactions with the expert policy or the environment during training, and the reward is also unknown.

3. MODELING MULTI-AGENT INTERACTIONS WITH COPULAS

3.1 MOTIVATION Many modeling methods for multi-agent learning tasks employ a simplifying mean-field assumption that the agents make independent action choices after observing a state (Albrecht & Stone, 2018; Song et al., 2018; Yu et al., 2019) , which means the joint policy can be factorized as follows: π(a 1 , . . . , a N |s) = N i=1 π i (a i |s) Such a mean-field assumption essentially allows for independent construction of each agent's policy. For example, multi-agent behavior cloning by maximum likelihood estimation is now equivalent to performing N single-agent behavior cloning tasks: max π E (s,a)∼ρπ E [log π(a|s)] = N i=1 max πi E (s,ai)∼ρπ E ,i [log π i (a i |s)] where the occupancy measure ρ π : S × A 1 × . . . × A N → R denotes the state action distribution encountered when navigating the environment using the joint policy π (Syed et al., 2008; Puterman, 2014) and ρ π,i is the corresponding marginal occupancy measure. However, when the expert agents are making correlated action choices (e.g., due to joint plan and communication in a soccer game), such a simplifying modeling choice is not able to capture the rich dependency structure and coordination among agent actions. To address this issue, recent works (Tian et al., 2019; Liu et al., 2020) propose to use a different factorization of the joint policy such that the dependency among N agents can be preserved: π(a i , a -i |s) = π i (a i |s, a -i )π -i (a -i |s) for i ∈ {1, . . . , N }. (3) Although such a factorization is general and captures the dependency among multi-agent interactions, several issues still remain. First, the modeling cost is increased significantly, because now we need to learn N different and complicated opponent policies π -i (a -i |s) as well as N different marginal conditional policies π i (a i |s, a -i ), each with a deep neural network. It should be noted that there are many redundancies in such a modeling choice. Specifically, suppose there are N agents and N > 3, for agent 1 and N , we need to learn opponent policies π -1 (a 2 , . . . , a N |s) and π -N (a 1 , . . . , a N -1 |s) respectively. These are potentially high dimensional and might require flexible function approximations. However, the dependency structure among agent 2 to agent N -1 are modeled in both π -1 and π -N , which incurs unnecessary modeling cost. Second, when executing the policy, each agent i makes decisions through its marginal policy π i (a i |s) = E π-i(a-i|s) (a i |s, a -i ) by first sampling a -i from its opponent policy π -i then sampling its action a i from π i (•|s, a -i ). Since each agent is performing such decision process independently, coordination among agents are still impossible due to sampling randomness. Moreover, a set of independently learned conditional distributions are not necessarily consistent with each other (i.e., induced by the same joint policy) (Yu et al., 2019) . In this work, to address above challenges, we draw inspiration from probability theory and propose to use copulas, a statistical tool for describing the dependency structure between random variables, to model the complicated multi-agent interactions in a scalable and efficient way.

3.2. COPULAS

When the components of a multivariate random variable x = (x 1 , . . . , x N ) are jointly independent, the density of x can be written as: p(x) = N i=1 p(x i ) When the components are not independent, this equality does not hold any more as the dependencies among x 1 , . . . , x N can not be captured by the marginals p(x i ). However, the differences can be corrected by multiplying the right hand side of Equation ( 4) with a function that only and fully describes the dependency. Such a function is called a copula (Nelsen, 2007) , a multivariate distribution function on the unit hyper-cube with uniform marginals. Intuitively, let us consider a random variable x i with continuous cumulative distribution function F i . Applying probability integral transform gives us a random variable u i = F i (x i ), which has standard uniform distribution. Thus one can use this property to separate the information in marginals from the dependency structures among x 1 , . . . , x N by first projecting each marginal onto one axis of the hyper-cube and then capture the pure dependency with a distribution on the unit hyper-cube. Formally, a copula is the joint distribution of random variables u 1 , . . . , u N , each of which is marginally uniformly distributed on the interval [0, 1]. Furthermore, we introduce the following theorem that provides the theoretical foundations of copulas: Theorem 1 ( (Sklar, 1959a) ). Suppose the multivariate random variable (x 1 , . . . , x N ) has marginal cumulative distribution functions F 1 , . . . , F N and joint cumulative distribution function F , then there exists a unique copula C : [0, 1] N → [0, 1] such that: F (x 1 , . . . , x N ) = C F 1 (x 1 ), . . . , F N (x N ) When the multivariate distribution has a joint density f and marginal densities f 1 , . . . , f N , we have: f (x 1 , . . . , x N ) = N i=1 f i (x i ) • c F 1 (x 1 ), . . . , F N (x N ) (6) where c is the probability density function of the copula. The converse is also true. Given a copula C and marginals F i (x i ), then C F 1 (x 1 ), . . . , F N (x N ) = F (x 1 , . . . , x N ) is a N -dimensional cumulative distribution function with marginal distributions F i (x i ). Theorem 1 states that every multivariate cumulative distribution function F (x 1 , . . . , x N ) can be expressed in terms of its marginals F i (x i ) and a copula C F 1 (x 1 ), . . . , F N (x N ) . Comparing Eq. (4) with Eq. ( 6), we can see that a copula function encoding correlations between random variables can be used to correct the mean-field approximation for arbitrarily complex distribution.

3.3. MULTI-AGENT IMITATION LEARNING WITH COPULA-BASED POLICIES

A central question in multi-agent imitation learning is how to model the dependency structure among agent decisions properly. As discussed above, the framework of copulas provides a mechanism to decouple the marginal policies (individual behavioral patterns) from the dependency left in the joint policy after removing the information in marginals. In this work, we advocate copula-based policy for multi-agent learning because copulas offer some unique and desirable properties in multi-agent scenarios. For example, suppose we want to model the interactions among players in a sports game. Using copula-based policy, we will obtain marginal policies for each individual player as well as dependencies among different roles (e.g., forwards and midfielders in soccer). Such a multi-agent learning framework has the following advantages: • Interpretable. The learned copula density can be easily visualized to intuitively analyze the correlation among agent actions. • Scalable. When the marginal policy of agents changes but the dependency among different agents remain the same (e.g., in a soccer game, one player is replaced by his/her substitute, but the dependence among different roles are basically the same regardless of players), we can obtain a new joint policy efficiently by switching in the new agent's marginal while keeping the copula and other marginals unchanged. • Succinct. The copula-based factorization of the joint policy avoids the redundancy in previous opponent modeling approaches (Tian et al., 2019; Liu et al., 2020) (as discussed in Section 3.1) by separately learning marginals and a copula. Learning. In this section, we discuss how to learn a copula-based policy from a set of expert demonstrations. Under the framework of Markov games and copulas, we factorize the parametric joint policy as: π(a 1 , . . . , a N |s; θ) = N i=1 π i (a i |s; θ i ) • c F 1 (a 1 |s; θ 1 ), . . . , F N (a N |s; θ N )|s; θ c where π i (a i |s; θ i ) is the marginal policy of agent i with parameters θ i and F i is the corresponding cumulative distribution function; the function c (parameterized by θ c ) is the density of the copula on the transformed actions u i = F i (a i |s; θ i ) obtained by processing original actions with probability integral transform. The training algorithm of our proposed method is presented as Algorithm 1. Given a set of expert demonstrations D, our goal is to learn marginal actions of agents and their copula function. Our approach consists of two steps. 1 We first learn marginal action distributions of each agent given their current state (lines 1-6). This is achieved by training M LP marginal that takes as input a state s and output the parameters of marginal action distributions of N agents given the input state (line 3).foot_1  In our implementation, we use mixture of Gaussians to realize each marginal policy π i (a i |s; θ i ) such that we can model complex multi-modal marginals while having a tractable form of the marginal cumulative distribution functions. Therefore, the output of M LP marginal consists of the means, covariance, and weights of components for the N agents' Gaussian mixtures. We then calculate the likelihood of each observed action a j based on agent j's marginal action distribution (line 5), and maximize the likelihood by optimizing the parameters of M LP marginal (line 6). After learning marginals, we fix the parameters of marginal MLPs and start learning the copula (lines 7-20). We first process the original demonstrations using probability integral transform and obtain a set of new demonstrations with uniform marginals (lines 8-13). Then we learn the density of copula (lines 14-20). Notice that the copula can be implemented as either state-dependent (lines 14-17) or state-independent (lines 18-20): For state-dependent copula, we use M LP copula to take as input the current state s and outputs the parameters of copula density c(•|s) (line 15). Then we calculate the likelihood of copula value u (line 16) and maximize the likelihood by updating M LP copula (line 17). For state-independent copula, we directly calculate the likelihood of copula value u under c(•) (line 19) and learn parameters of c(•) by maximizing the likelihood (line 20). The copula density (c(•) or c(•|s)) can be implemented using parametric methods such as Gaussian or mixture of Gaussians. It is worth noticing that if copula is state-independent, it can also be implemented using non-parametric methods such as kernel density estimation (Parzen, 1962; Davis et al., 2011) . In this way, we no longer learn parameters of copula by maximizing likelihood as in lines 19-20, but simply store all copula values u for density estimation and sampling in inference stage. We will visualize the learned copula in experiments. Inference and Generation. In inference stage, the goal is to predict the joint actions of all agents given their current state s. The inference algorithm is presented as Algorithm 2 in Appendix A, where we first sample a copula value u = (u 1 , • • • , u N ) from the learned copula, either statedependent or state-independent (lines 1-5), then apply inverse probability transform to transform them to the original action space: âj = F -1 j (u j |s) (lines 7-10). Note that an analytical form of the inverse cumulative distribution function may not always be available. In our implementation, we use binary search to approximately solve this problem since F j is a strictly increasing function, which is shown to be highly efficient in practice. In addition, we can also sample multiple i.i.d. copula values from c(•|s) or c(•) (line 3 or 5), transform them into the original action space, and take their average as the predicted action. This strategy is shown to be able to improve the accuracy of action prediction (in terms of MSE loss), but requires more running time as a trade-off. The generation algorithm is presented as Algorithm 3 in Appendix A. To generate new trajectories, we repeatedly predict agent actions given the current state (line 2), then execute the generated action and obtain an updated state from the environment (line 3). The computational complexity of the training and the inference algorithms is analyzed as follows. The complexity of each round in Algorithm 1 is O(M T N ), where M is the number of trajectories in the training set, T is the length of each trajectory, and N is the number of agents. The complexity of Algorithm 2 is O(N ). The training and the inference algorithms scales linearly with the size of input dataset.

4. RELATED WORK

The key problem in multi-agent imitation learning is how to model the dependence structure among multiple interactive agents. Le et al. ( 2017) learn a latent coordination model for players in a cooperative game, where different players occupy different roles. However, there are many other multiagent scenarios where agents do not cooperate for a same goal or they do not have specific roles (e.g., self-driving). Bhattacharyya et al. ( 2018) adopt parameter sharing trick to extend generative adversarial imitation learning to handle multi-agent problems, but it does not model the interaction of agents. Interaction Network (Battaglia et al., 2016) learns a physical simulation of objects with binary relations, and CommNet (Sukhbaatar et al., 2016) learns dynamic communication among agents. But they fail to characterize the dependence among agent actions explicitly. Researchers also propose to infer multi-agent relationship using graph techniques or attention mechanism. For example, Kipf et al. (2018) propose to use graph neural networks (GNN) to infer the type of relationship among agents. Hoshen (2017) introduces attention mechanism into multi-agent predictive modeling. Li et al. (2020) combine generative models and attention mechanism to capture behavior generating process of multi-agent systems. These works address the problem of reasoning relationship among agents rather than capturing their dependence when agents are making decisions. Another line of related work is deep generative models in multi-agent systems. For example, Zhan et al. (2018) propose a hierarchical framework with programmatically produced weak labels to gen- 

5.1. EXPERIMENTAL SETUP

Datasets. We evaluate our method in three settings. PhySim is a synthetic physical environment where 5 particles are connected by springs. Driving is a synthetic driving environment where one vehicle follows another along a single lane. RoboCup is collected from an international scientific robot competition where two robot teams (including 22 robots) compete against each other. The detailed dataset description is provided in Appendix B. Experimental environments are shown in Figure 2 . Baselines. We compare our method with the following baselines: LR is a logistic regression model that predicts actions of agents using all of their states. SocialLSTM (Alahi et al., 2016) predicts agent trajectory using RNNs with a social pooling layer in the hidden state of nearby agents. IN (Battaglia et al., 2016) predicts agent states and their interactions using deep neural networks. CommNet (Sukhbaatar et al., 2016) simulates the inter-agent communication by broadcasting the hidden states of all agents and then predicts their actions. VAIN (Hoshen, 2017) uses neural networks with attention mechanism for multi-agents modeling. NRI (Kipf et al., 2018) designs a graph neural network based model to learn the interaction type among multiple agents. Since most of the baselines are used for predicting future states given historical state, we change the implementation of their objective functions and use them to predict the current action of agents given historical states. Each experiment is repeated 3 times, and we report the mean and standard deviation. Hoshen (2017) . Our method is shown to outperform all baselines significantly on all three datasets, which demonstrates that explicitly characterizing dependence of agent actions could greatly improve the performance of multi-agent behavior modeling.

LR

Uniform copula KDE copula Gaussian mix. copula PhySim 8.994 ± 0.001 1.256 ± 0.006 2.893 ± 0.012 Driving -0.571 ± 0.024 -0.916 ± 0.017 -0.621 ± 0.028 RoboCup 3.243 ± 0.049 0.068 ± 0.052 3.124 ± 0.061 Table 2 : Negative log-likelihood (NLL) of test trajectories evaluated by different copula. Uniform copula assumes no dependence among agent actions. KDE copula uses kernel density estimation to model the copula, which is state-independent. Gaussian mixtures copula uses Gaussian mixture model to characterize the copula, which is state-dependent. To investigate the efficacy of copula, we implement three types of copula function: Uniform copula means we do not model dependence among agent actions. KDE copula uses kernel density estimation to model the copula function, which is stateindependent. Gaussian mixtures copula uses Gaussian mixture model to characterize the copula function, of which the parameters are output by an MLP taking as input the current state. We train the three models on training trajectories, then calculate negative log-likelihood (NLL) of test trajectories using the three trained models. A lower NLL score means that the model assigns high likelihood to given trajectories, showing that it is better at characterizing the dataset. The NLL scores of the three models on the three datasets are reported in Table 2 . The performance of KDE copula and Gaussian copula both surpasses uniform copula, which demonstrates that modeling dependence among agent actions is essential for improving model expressiveness. However, Gaussian copula performs worse than KDE copula, because Gaussian copula is state-dependent thus increases the risk of overfitting. Notice that the performance gap between KDE and Gaussian copula is less on PhySim, since PhySim dataset is much larger so the Gaussian copula can be trained more effectively. Table 3 : Negative log-likelihood (NLL) of new test trajectories in which the action distribution of one agent is changed. We evaluate the new test trajectories based on whether to use the original marginal action distributions or copula, which results in four combinations.

5.3. GENERALIZATION CAPABILITY OF COPULA

One benefit of copulas is that copula captures the pure dependence among agents, regardless of their own marginal action distributions. To demonstrate the generalization capabilities of copulas, we design the following experiment. We first train our model on the original dataset, and learn marginal action distributions and copula function (which is called original marginals and original copula). Then we substitute one of the agents with a new agent and use the simulator to generate a new set of trajectories. Specifically, this is achieved by doubling the action value of one agent (for example, this can be seen as substituting an existing particle with a lighter one in PhySim). We retrain our model on new trajectories and learn new marginals and new copula. We evaluate the likelihood of new trajectories based on whether to use the original marginals or original copula, which, accordingly, results in four combinations. The NLL scores of four combinations are presented in Table 3 . It is clear, by comparing the first and the last column, that "new marginals + new copula" significantly outperform "original marginals + original copula", since new marginals and new copula are trained on new trajectories and therefore characterize the new joint distribution exactly. To see the influence of marginals and copula more clearly, we further compare the results in column 2 and 3, where we use new copula or new marginals separately. It is clear that the model performance does not drop significantly if we use the original copula and new marginals (by comparing column 3 and 4), which demonstrates that the copula function basically stays the same even if marginals are changed. The result supports our claim that the learned copula is generalizable in the case where marginal action distributions of agents change but the internal inter-agent relationship stays the same.

5.4. COPULA VISUALIZATION

Another benefit of copulas is that it is able to intuitively demonstrate the correlation among agent actions. We choose the RoboCup dataset to visualize the learned copula. As shown in Figure 4a in Appendix D, we first randomly select a game (the 6th game) between cyrus2017 and helios2017 and draw trajectories of 10 players in the left team (L2 ∼ L11, except the goalkeeper). It is clear that the 10 players fulfill specific roles: L2 ∼ L4 are defenders, L5 ∼ L8 are midfielders, and L9 ∼ L11 are forwards. Then we plot the copula density between the x-axis (the horizontal direction) of L2 and the x-axis of L3 ∼ L11, respectively, as shown in Figure 4b in Appendix D. These figures illustrate linear correlation between their moving direction along x-axis, that is, when L2 moves forward other players are also likely to move forward. However, the correlation strength differs with respect to different players according to the visualized result: L2 exhibits high correlation with L3 and L4, but low correlation with L9 ∼ L11. This is because L2 ∼ L4 are all defenders so they collaborate more closely with each other, but L9 ∼ L11 are forwards thus far from L2 in the field. The learned copula can also be used to generate new trajectories. We visualize the result of trajectory generation on RobuCup dataset. As shown in Figure 3 , the dotted lines denote the ground-truth trajectories of the 10 player in an attack from midfield to the penalty area.

5.5. TRAJECTORY GENERATION

The trajectories generated by our copula model (Figure 3b ) are quite similar to the demonstration as they exhibit high consistency. It is clear that midfielders and forwards (No. 5 ∼ No. 11) are basically moving to the same direction, and they all make a left turn on their way to penalty area. However, the generated trajectories by independent modeling show little correlation since the players are all making independent decisions. We also present the result of trajectory generation on Driving dataset in Appendix E.

6. CONCLUSION AND FUTURE WORK

In this paper, we propose a copula-based multi-agent imitation learning algorithm that is interpretable, efficient and scalable to model complex multi-agent interactions. Sklar's theorem allows us to separately learn marginal policies that capture the local behavioral patterns of each individual agent and a copula function that only and fully captures the dependence structure among the agents. Compared to previous multi-agent imitation learning methods based on independent policies (meanfield factorization of the joint policy) or opponent modeling, our method is capable of modeling complex dependence among agents and achieving coordination without any modeling redundancy. Experimental results on physical simulation, driving and robot soccer datasets demonstrate the effectiveness of our method compared with state-of-the-art baselines. We point out two directions of future work. First, the copula function is generalizable only if the dependence structure of agents (i.e., their role assignment) is unchanged. Therefore, it is interesting to study how to efficiently apply the learned copula to the scenario with evolving dependence structure. Another practical question is that whether our proposed method can be extended to the setting of decentralized execution, since the step of copula sampling (line 3 or 5 in Algorithm 2) is shared by all agents. A straightforward way to solve this problem is to set a fixed sequence of random seeds for all agents in advance, so that the copula sample obtained by all agents is the same at each timestamp. Designing a more robust and elegant mechanism for decentralized execution is also a promising direction.

A PSEUDO CODE FOR TRAINING, INFERENCE, AND GENERATION PROCEDURE

The pseudo code for inference and generation procedure are presented in Algorithm 2 and 3, respectively. Algorithm 2: Inference procedure  ← M LP copula (s) Sample a copula value u = (u1, • • • , uN ) from c(•|s) else Sample a copula value u = (u1, • • • , uN ) from c(•) // Transform from copula space to action space Calculate (parameters of) the conditional marginal action distributions for all agents: {fj(•|s)} N j=1 ← M LP marginal (s) for agent j = 1, • • • , N do Fj(•|s) ← CDF of fj(•|s) âj ← F -1 j (uj|s) â ← (â1, • • • , âj) return â Algorithm 3: Generation procedure 

B DATASET DETAILS

PhySim is collected from a physical simulation environment where 5 particles move in a unit 2D box. The state is locations of all particles and the action is their acceleration (there is no need to include their velocities in state because accelerations are completely determined by particle locations). We add Gaussian noise to the observed values of actions. Particles may be pairwise connected by springs, which can be represented as a binary adjacency matrix A ∈ {0, 1} N ×N . The elasticity between two particles scales linearly with their distance. At each timestamp, we randomly sample an adjacency matrix from {A 1 , A 2 } to connect all particles, where A 1 and A 2 are set as complimentary (i.e. A 1 + A 2 + I = 1) to ensure that they are as different as possible. Therefore, the marginal action distribution of each particle given a system state is Gaussian mixtures with two components. Here the coordination signal for particles can be seen as the hidden variable determining which set of springs (A 1 or A 2 ) is used at current time. We generate 10, 000 training trajectories, 2, 000 validation trajectories, and 2, 000 test trajectories for experiments, where the length of each trajectory is 500. Driving is generated by CARLAfoot_2 (Dosovitskiy et al., 2017) , an open-source simulator for autonomous driving research that provides realistic urban environments for training and validation of autonomous driving systems. To generate the driving data, we design a car following scenario, where a leader car and a follower car drive in the same lane. We make the leader car alternatively accelerate to a speed upper bound and slow down to stopping. The leader car does not care about the follower and drives following its own policy. The follower car tries to follow closely the leader car while keeping a safe distance. Here the state is the locations and velocities of the two cars, and the action is their accelerations. We generate 1, 009 trajectories in total, and split the whole data into training, validation, and test set with ratio of 6 : 2 : 2. The average length of trajectories is 85.5 in Driving dataset. RoboCup (Michael et al., 2017) is collected from an international scientific robot football competition in which teams of multiple robots compete against each other. The original dataset contains all pairings of 10 teams with 25 repetitions of each game (1, 125 games in total). The state of a game (locations and velocities of 22 robots) is recorded every 100 ms, resulting in a trajectory of length 6, 000 for each game (10 min). We select the 25 games between two teams, cyrus2017 and helios2017, as the data used in this paper. The state is locations of 10 robots (except the goalkeeper) in the left team, and the action is their velocities. The dataset is split into training, validation, and test set with ratio of 6 : 2 : 2. For the three datasets, to learn the marginal action distribution of each agent (i.e. Gaussian mixtures), we use an MLP with one hidden layer to take as input a state and output the centers of their Gaussian mixtures. To prevent overfitting, the variance of these Gaussian mixtures is parameterized by a free variable for each particle, and the weights of mixtures are set as uniform. Each dimension of states and actions in the original datasets are normalized to range [-1, 1]. For PhySim, the number of particles are set to 5. Learning rate is set to 0.01, and the weight of L2 regularizer is set to 10 -5 . For Driving, learning rate is 0.005 and L2 regularizer weight is 10 -5 . For RoboCup, learning rate is 0.001 and L2 regularizer weight is 10 -6 .

C BASELINE IMPLEMENTATION DETAILS

For LR, we use the default implementation in Python sklearn package. For SocialLSTM (Alahi et al., 2016) , the dimension of input is set as the dimension of states in each dataset. The spatial pooling size is 32, and we use an 8 × 8 sum pooling window size without overlaps. The hidden state dimension in LSTM is 128. The learning rate is 0.001. For IN (Battaglia et al., 2016) , all MLPs are with one hidden layer of 32 units. The learning rate is 0.005. For CommNet (Sukhbaatar et al., 2016) , all MLPs are with one hidden layer of 32 units. The dimension of hidden states is set to 64, and the number of communication round is set to 2. The learning rate is 0.001. For VAIN (Hoshen, 2017) , the encoder and decoder functions are implemented as a fully connected neural network with one hidden layer of 32 units. The dimension of hidden states is 64, and the dimension of attention vectors is 10. The learning rate is 0.0005. For NRI (Kipf et al., 2018) , we use an MLP encoder and an MLP decoder, with one hidden layer of 32 units. The learning rate is 0.001.

D VISUALIZED COPULA ON ROBOCUP DATASET

The trajectories of players in one game as well as the visualized pairwise copula are presented in Figure 4 .

E GENERATED TRAJECTORIES ON DRIVING DATASET

For the Driving dataset, we randomly select 10 original trajectories and 10 trajectories generated by our method, and show the visualization results in Figure 5 . The x-axis is timestamp and y-axis is the location (coordinate) of two cars. Our learned policy is shown to be able to maintain the distance between two cars. 



An alternate approach is to combine the two steps together and use end-to-end training, but this does not perform well in practice because the copula term is unlikely to converge before marginals are well-trained. Here we assume that each agent is aware of the whole system state. But our model can be easily generalized to the case where agents are only aware of partial system state by feeding the corresponding state to their MLPs. https://carla.org/



Figure 1: In each subfigure, the left part visualizes the joint policy π(a 1 , a 2 |s) on the joint action space [-3, 3] 2 and the right part shows the corresponding marginal policies (e.g., π 1 (a 1 |s) =

Training procedureInput: The number of trajectories M , the length of trajectory T , the number of agents N , demonstrations D = {τ i } M i=1 , where each trajectoryτ i = {(s i [t], a i [t])} T t=1Output: Marginal action distribution MLP M LP marginal , state-dependent copula MLP M LP copula or state-independent copula density c(•) // Learning marginals 1 while M LP marginal not converge do 2 for each state-action pair (s, (a1, • • • , aN )) do 3 Calculate the conditional marginal action distributions for all agent: {fj(•|s)} N j=1 ← M LP marginal (s) 4 for agent j = 1, • • • , N do 5 Calculate the likelihood of the observed action aj: fj(aj|s) 6 Maximize fj(aj|s) by optimizing M LP marginal using gradient descent // Learning copula 7 while M LP copula or c(•) not converge do 8 for each state-action pair (s, (a1, • • • , aN )) do 9 {fj(•|s)} N j=1 ← M LP marginal (s) for agent j = 1, • • • , N do Fj(•|s) ← the CDF of fj(•|s) Transform the agent action aj to uniformly distributed value uj ← Fj(aj|s) Obtain u = (u1, • • • , uN ) in the unit hyper-cube [0, 1] N if copula is set as state-dependent then Calculate the state-dependent copula density c(•|s) ← M LP copula (s) Calculate the likelihood of u: c(u|s) Optimize M LP copula by maximizing log c(u|s) using gradient descent else Calculate the likelihood of u: c(u) Optimize parameters of c(•) using maximum likelihood or non-parametric methods return M LP marginal , M LP copula or c(•)

Figure 2: Experimental environments (left to right): PhySim, Driving, and RoboCup.

Figure 3: Generated trajectories on RoboCup dataset using independent modeling or copula. Dotted lines are ground-truth trajectories and solid lines are generated trajectories.

Inference module (Algorithm 2), initial state s[0], required length L, environment E Output: Generated trajectory τ for l = 0, • • • , L do Feed state s[l] to the inference module and get the predicted action â[l] Execute â[l] in environment E and get a new state s[l + 1] τ = {(s[l], â[l])} L l=0 return τ

Figure 4: (a) Trajectories of 10 players (except the goalkeeper) of the left team in one RoboCup game; (b) Copula density between x-axis of the L2 player and x-axis of another player (L3 ∼ L11).

Figure 5: Original and generated trajectories on Driving dataset. The x-axis is timestamp and y-axis is the location (1D coordinate) of two cars.

Marginal action distribution MLP M LP marginal , state-dependent copula MLP M LP copula or state-independent copula density c(•), current state s Output: Predicted action â // Sample from copula if copula is set as state-dependent then Calculate (parameters of) state-dependent copula density c(•|s)

