ADVERSARIAL DRIVING POLICY LEARNING BY MISUNDERSTANDING THE TRAFFIC FLOW Anonymous authors Paper under double-blind review

Abstract

Acquiring driving policies that can transfer to unseen environments is essential for driving in dense traffic flows. Adversarial training is a promising path to improve robustness under disturbances. Most prior works leverage few agents to induce driving policy's failures. However, we argue that directly implementing this training framework into dense traffic flow degrades transferability in unseen environments. In this paper, we propose a novel robust policy training framework that is capable of applying adversarial training based on a coordinated traffic flow. We start by building up a coordinated traffic flow where agents are allowed to communicate Social Value Orientations (SVOs). Adversary emerges when the traffic flow misunderstands the SVO of driving agent. We utilize this property to formulate a minimax optimization problem where the driving policy maximizes its own reward and a spurious adversarial policy minimizes it. Experiments demonstrate that our adversarial training framework significantly improves zero-shot transfer performance of the driving policy in dense traffic flows compared to existing algorithms.

1. INTRODUCTION

Policy learning in dense traffic flows is a progressively active area for both academia and industry community in autonomous driving (Dosovitskiy et al., 2017; Suo et al., 2021) . Since training driving policy in real world is costly, researchers aim to build dense traffic flows in simulation as an alternative (Cai et al., 2020; Pal et al., 2020; Wu et al., 2021) . Peng et al. (2021) develops a traffic flow that exhibits altruistic behaviors and training driving policy in such coordinated flow also performs well. However, the internal dynamics of different traffic flows are varied, making it difficult to train driving policy in one flow and transfer it into unseen traffic patterns. Hence, it is indispensable to develop robust driving policies that can transfer among different traffic flows. An appealing technical route to improve the robustness of driving policy is adversarial attack (Pinto et al., 2017) , which models differences between training and evaluating environments as extra disturbances towards driving policy (Wachi, 2019; Chen et al., 2021; Liu et al., 2021; Huang et al., 2022) . To exert disturbances on driving policy, these works leverage few agents to deliberately induce driving policy's failures. Although working well in sparse traffic situations, this pipeline cannot extend to dense traffic flows. On the one hand, increasing the number of attacking agents makes adversarial attacks easier, yet it is harder for the driving policy to resist such strong disturbances, which severely harms policy learning. On the other hand, attacking agents mainly concentrate on producing adversarial behaviors towards driving policy, while overlooking the modeling of altruistic behaviors among them. Therefore, the key is to construct a coordinated traffic flow which still generates adversarial behaviors. We develop a coordinated traffic flow with communication and propose a misunderstanding-based adversarial training pipeline based on this flow. Specifically, for building a coordinated traffic flow, we introduce the concept of Social Value Orientation (SVO) (Liebrand, 1984) in social psychology which balances egoistic and altruistic behaviors for each agent. SVO can be regarded as the hidden information of one agent, which typically cannot be accessed by other agents. However, in this paper, we allow agents in our traffic flow to communicate genuine SVOs with each other. Since the traffic flow is served as a testbed for training and evaluating driving policies, the coordination mechanism within the traffic flow is invisible to driving policies. In other words, when placing a driving policy to interact with the traffic flow, the traffic flow requires receiving driving policy's SVO while the driving policy is unaware of traffic flows' SVOs. This property offers a neat approach to induce misunderstandings between driving policy and our traffic flow, making it adversarial towards driving policy. We reserve a spurious adversarial agent to disturb the SVO delivery from the driving agent to other agents and formulate a minimax optimization problem where the driving policy maximizes its own reward while the spurious adversarial policy minimizes it, as shown in Figure 1 . Contributions. We propose a novel adversarial training framework based on a coordinated traffic flow to obtain driving policies that can transfer across various traffic flows. We develop a coordinated traffic flow where agents exhibit egoistic, prosocial, and altruistic behaviors based on communicating SVOs with each other. Based on this traffic flow, we apply adversarial driving policy training by adversarially misunderstanding the traffic flow, which is disturbed to produce improper coordinated behaviors towards driving policy. We investigate characteristics of several traffic flows in four challenging scenarios and carry out comprehensive comparative studies to evaluate the robustness of driving policy. Results show that our traffic flow achieves the highest success rate and the proposed adversarial training pipeline significantly improves the transferability of driving policy compared to existing algorithms.

2. RELATED WORK

Dense traffic flows. Prior works explore different methodologies to simulate dense traffic flows including rule design (Behrisch et al., 2011; Dosovitskiy et al., 2017; Cai et al., 2020; Zhou et al., 2021) , Imitation Learning (IL) (Zhao et al., 2021; Gu et al., 2021; Wang et al., 2022) , and Multi-Agent Reinforcement Learning (MARL) (Pal et al., 2020; Palanisamy, 2020; Wu et al., 2021) . IL naturally leverages numerous human expert data but suffers from severe distribution shift and poor closed-loop performance even in simple scenarios. Most rule-and MARL-based algorithms aim to simulate individual behaviors of distinct agents, which overlooks complex interactions among agents. Similar to our work, Peng et al. (2021) also builds a coordinated traffic flow based on SVO. However, agents in their traffic flow have no access to other agents' SVOs, leading to conservative behaviors.

Adversarial attack.

A common way to acquire robust policy is applying Robust Adversarial Reinforcement Learning (RARL) (Pinto et al., 2017; Pan et al., 2019; Vinitsky et al., 2020; Oikarinen et al., 2021) . Researchers in autonomous driving also follow this pipeline (Wachi, 2019; Ding et al., 2020; Chen et al., 2021; Sharif & Marijan, 2021; Huang et al., 2022)  R ′ i = cos(c i )R i + sin(c i )R Si where R Si = j∈I S i R j /|I Si |, I Si is the set of surrounding agents w.r.t. agent i. c i ∈ [0, π 2 ] is the SVO of agent i and kept fixed during each episode. Given equation 1, we formulate a SVOembedded POSG G ′ = ⟨I, S, A, P, R ′ , C, ρ 0 , O, n, γ, T ⟩. R ′ = {R ′ 0 , R ′ 1 , . . . , R ′ n-1 } denotes the set of SVO-embedded reward functions. C = {c 0 , c 1 , . . . , c n-1 } is set of all SVOs. Problem formulation. As one can see, SVO determines the trade-off between egoistic and altruistic behaviors. For each agent, it is necessary to recognize SVOs of itself and other agents, which provides the ability to infer other agents' reward structures. Therefore, we design policy as β : O i ×C i ×(× j∈I S i C j ) → ∆(A i ). And we use a single policy β to optimize the sum of n optimization objectives in SVO-embedded POSG: max β E st∼P β ,at∼β [ i∈I T t=0 γ t R ′ i (s t , a t , c i )], c i ∈ C i (2)

3.2. KEY COMPONENTS

State space. Agents driving in dense traffic flow need to continually interact with surrounding agents. Besides, road structures also influence agents' decisions. Therefore, state space S needs to cover a collection of static and dynamic elements. The set of static elements E s include lane centerlines, sidelines, agents' global paths, i.e., E s = {centerline, sideline, path}. The set of dynamic elements E d include current and historical poses and velocities (trajectories) of all agents, i.e., E d = {trajectory 0 , trajectory 1 , . . . , trajectory n-1 }. We utilize vectorized representation based on Gao et al. (2020) , which is computation-and memory-efficient. In our work, elements in E s and E d are sets of points containing corresponding features. Specifically, static element e s i = {v 0 , v 1 , . . . , v j , . . . } , i ∈ E s . v j = [p j , l i , i, j] where p j = (x, y, θ) is the pose of point j in element i and l i is the lane width of element i. For points in dynamic elements, v j = [p j , c i , i, j] where p j = (x, y, θ, v) and c i denotes the SVO of agent i. Observation space. In POSG, each agent could only receive perceptual information locally, we use L 2 norm to define locality, i.e., agent i could only receive points (x e , y e ) that ∥(x i , y i )-(x e , y e )∥ 2 ≤ d, in which (x i , y i ) is the current location of agent i. Design of R. The goal of each agent in dense traffic flow is homogeneous, for instance, all agents want to successfully finish the task as fast as possible. Besides, since each agent receives O i , designing R i upon O i rather than S benefits policy training. In our work, we use self-motivated reward R i : O i × A i → R and R i = R for all i ∈ I. However, designing self-motivated reward function still remains an open problem. Designing fine-grained dense reward accelerates training procedure but relies too heavily on human knowledge, while training with coarse sparse reward requires much more data. To combine both benefits, we design a near-sparse reward function containing a dense reward for incentive driving fast and a sparse reward for penalizing catastrophic failures. Catastrophic failures include collision with other agents, deviation from drivable area, driving too far from global path, and crashing into wrong lane. Coordinated behaviors could be produced by incorporating SVO. Policy training. We apply Independent Policy Learning (IPL) (Tan, 1993) to solve Equation 2. Although IPL is prone to generate egoistic suboptimal behaviors (de Witt et al., 2020; Yang et al., 2020) , we could alleviate this problem by incorporating SVO, which forces the algorithm to consider other agents' goals.

3.3. POLICY ARCHITECTURE

To better extract static and dynamic features and capture relations among them, we utilize a hierarchical feature extraction framework. We use DeepSet (Zaheer et al., 2017) to aggregate homogeneous information within dynamic and static elements, followed by Multi-Head Attention (MHA) (Vaswani et al., 2017) to further extract heterogeneous information among different elements. Homogeneous feature aggregation. Consider the elements set e ⊂ E, E = E s , E d , and the function processing on the set needs to retain the adjacency between elements and permutationinvariant to the order of objects in the element. Based on theorem 2 in (Zaheer et al., 2017) , the propagation function f is defined as: f (e) = ρ v∈e ϕ(v) And we obtain the element level features l e = f (e), where e is the input elements set, the nodes v ∈ e transformed into a representation ϕ(v). The sum of representations is processed using the ρ network defined by Multi-Layer Perception (MLP) network. In our implementation, DeepSet can extract polyline-level features while not introducing too many parameters. Heterogeneous feature aggregation. The static element level features l s e = [l s e0 , l s e1 , . . . , l s ej , . . .] and the dynamic element features l d e = [l d e0 , l d e1 , . . . , l d ej , . . .] go through a MHA layer which takes into account their inter-relations to output the final action for the agents. Given arbitrary feature matrices w, z and their linear projections w Q , w K , w V and z Q , z k , z v , the Self Attn(w) and CrossAttn(w, z) are defined as: Self Attn(w) = Sof tmax w Q w T K √ d k w V CrossAttn(w, z) = Sof tmax w Q z T K √ d k z V (4) where √ d k is the dimension of the key vectors. We leverage the one-layer cross-attention network to model the interaction between dynamic and static segments. The dynamic elements features l d e and static elements features l s e are fused by the Self Attn and CrossAttn operation:  l d o = Self Attn(l d e ) + π : O → ∆(A) to solve M . c β ∈ C = [0, π 2 ] is the SVO of driving policy which is taken by β. The genuine SVO of driving agent c π is always 0 since existing single-agent algorithms are fully self-interested. Adversary emerges when c β and c π differ: max π min c β E st∼P β,c β ,at∼π [ T t=0 γ t R(s t , a t )] In section 3.1 and equation 7, the SVOs are invariant during one episode for the reason of stabilizing training. However, in adversarial training, we aim to destabilize policy training. Therefore, we introduce a spurious policy π c : O → ∆(C) to produce c β which is allowed to change across time steps, changing equation 7 into: max π min πc E c β,t ∼πc,st∼P β,c β,t ,at∼π [ T t=0 γ t R(s t , a t )] Note that equation 8 relates to three policies including driving policy π, background policy β, and spurious adversarial policy π c . Since driving policy maximizes its own reward, it is egoistic from the perspective of social psychology. Background policy controls the whole traffic flow to exhibit egoistic and altruistic behaviors. The spurious policy is the only one that aims to generate adversarial behaviors by minimizing driving policy's reward. We highlight that agents in our traffic flow try to coordinate with each other, which is a fundamental difference compared to previous attacking agents. Instead of deliberately inducing failures of π, we keep β non-adversarial and leverage an extra π c to disturb the SVO of π taken by β.

4.2. ADVERSARIAL POLICY TRAINING

Algorithm 1 outlines our training framework. Given background policy β, we alternatively optimize both driving policy π and adversarial policy π c . The parameters θ π of π and θ πc of π c are randomly initialized before training. In each of N iterations, we first optimize θ π and keep θ πc fixed, followed by optimizing θ πc and keeping θ π fixed. In performance of driving policy. Note that the action of adversarial policy is c β . a is the action of driving policy which is used to compute r. This alternating procedure is repeated for N iterations. The main difference between standard and misunderstanding-based adversarial learning is the objective of β. In standard adversarial learning, β controls background agents to attack the driving agent. Background agents know exactly which one is the driving agent. While in misunderstanding-based adversarial learning, background agents aim to coordinate with each other, including the driving agent. A background agent cannot distinguish which surrounding agent is the driving agent. Therefore, the spurious agent which produces c β applies adversary from the perspective of driving agent. In Algorithm 1 the spurious agent and driving policy take the same observation.

5. RESULTS

In this section, we pursue to answer three seminal questions. (1) Can our proposed traffic simulation produce more coordinated behaviors? (2) Does the spurious adversarial policy degrade driving policy's performance? (3) Does our adversarial training framework improve driving policy's zeroshot transfer ability? Before discussing these questions, we first explain some preliminary details. Settings. We evaluate our proposed method using our internal driving simulator which supports various maps and scenarios. Similar to Training pipelines. We use vanilla RL (VRL), existing robust adversarial RL (RARL), and our proposed misunderstanding-based adversarial learning (M-RARL) to train driving policies. VRL can be applied in IDM, FLOW, CoPO, and SocialComm. RARL and M-RARL can only be applied in FailMaker and SocialComm (with the spurious adversarial agent) respectively.

5.1. PERFORMANCE OF TRAFFIC FLOWS

We demonstrate the performance of different traffic flows. Figure 2 and Figure 3 show quantitative results of different traffic flows. As one can see, our proposed SocialComm achieves the highest success rates and average speeds across all scenarios. Compared with CoPO, agents in SocialComm can recognize other agents' SVOs and produce coordinated behaviors, therefore achieving collaboration and high efficiency of the whole system. FailMaker achieves the lowest success rate and highest collision rate (lowest safety) due to its adversarial nature. Note that in merge, our traffic flow outperforms other methods by a large margin. The reason is that the initial poses of all agents in merge are much closer than these in other scenarios, which makes it harder for the agents to coordinate. Qualitative results are shown in Figure 4 . See Appendix A.2 for more results.

5.2. MISUNDERSTANDING-BASED ADVERSARY WITH SOCIALCOMM

Our robust policy learning framework applies adversarial training on a coordinated traffic flow. In this part, we demonstrate that misunderstanding-based adversary successfully degrades driving policy's performance and has the ability to impede the driving agent. Performance. We first train driving policy using vanilla RL in four non-adversarial traffic flows (including IDM, FLOW, CoPO, and SocialComm) and deploy these well-trained driving policies into SocialComm with the spurious adversarial agent. Results are shown in Table 1 . Data in parentheses is the change of performances under the spurious adversarial policy. Success rates and efficiencies of all driving policies decrease. This reveals that our spurious adversarial policy could impede the Driving policy controls the red vehicle. Vehicles with blue and purple boxes are two background agents that exhibit adversarial behaviors towards driving policy while maintaining coordination. driving policy. Note that catastrophic failures of driving policy still increase under adversary in our traffic flow. The reason is that although highly coordinated, our traffic flow cannot eliminate catastrophic failures of the whole traffic system. And it is not unallowable for the traffic flow to incur driving policy's catastrophic failures due to the optimization objective in Equation 8. Adversarial behaviors on coordinated traffic flow. Figure 5 demonstrates some qualitative results on how our traffic flow impedes driving policy to finish its own task. Driving policy controls the red car. At t = 0s, the driving policy aims to pass through the bottleneck efficiently and keeps high speed. When driving policy approaches the bottleneck, where interaction frequently happens, the agent with blue box slows down (t = 2.5s) to regulate the speed of driving policy and pass through the bottleneck. After that, the agent with purple box cut in the agent ahead of driving agent (t = 5.0s). These agents impede the driving policy and degrade its efficiency and explicit coordinated behaviors among each other.

5.3. ZERO-SHOT TRANSFER PERFORMANCE OF DRIVING POLICIES

To evaluate the robustness in unseen environments, we deploy all driving policies in all traffic flows. In Figure 6 , elements in primary diagonals are obtained by evaluating in their training environments (the traffic flow used in training and evaluating is the same) and typically achieve highest success rates. Therefore, off-diagonal elements reveal zero-shot transfer performances and colors in the † Comparing VRL/SocialComm and M-RARL/SocialComm, one can see that injecting adversaries properly in dense traffic flow significantly improves robustness in unseen non-adversarial environments. Note that all methods except RARL/FailMaker act poorly in FailMaker since it is extremely easy for background agents in FailMaker to attack driving policies, no matter how driving agents act shrewdly. See Appendix A.3 for more results.

6. CONCLUSION

In this paper, we propose a novel adversarial training framework based on a coordinated traffic flow with communication. Driving policies trained with this framework exhibit robust behaviors across various traffic flows. We report characteristics of several traffic flows in scenarios including intersection, bottleneck, merge, and roundabout. We carry out numerous comparative studies to evaluate the transferability of driving policy. Results show that our traffic flow achieves the highest success rate and adversarial learning on our traffic flow significantly improves driving policy's zero-shot transfer performance compared to existing algorithms.

A APPENDIX

A.1 IMPLEMENTATION DETAILS Intelligent Driver Model (IDM). IDM is given by Equation 9and Equation 10. The model describes the acceleration vback of the back agent, as a function of the agent's velocity v back , the reference velocity v 0 , the difference between the agent velocity and the velocity of the agent in front ∆v = v backv f ront , and the following distance φ = s f ront + L length,f ronts back . Here, s f ront is the position of the front agent, s back denotes the position of the back agent, and L length,f ront denotes the length of the front agent. The physical interpretation of the parameters are the minimum following time, T , the minimum following gap, s 0 , the maximum acceleration, a, the minimum following gap, s 0 , the maximum acceleration, a, and the comfortable braking deceleration, b. vback = a 1 - v back v 0 δ - ϕ (v back , ∆v) φ 2 (9) ϕ (v back , ∆v) = s 0 + v back T + v back ∆v 2 √ ab Policy learning parameters. For Independent Policy Learning (IPL) and single-agent reinforcement learning algorithms, we utilize Soft-Actor-Critic (SAC) (Haarnoja et al., 2018) and Adam optimizer (Kingma & Ba, 2015) . Detailed parameters are shown in Table 3 . A.2 RESULTS ON OUR COORDINATED TRAFFIC FLOW More results are shown in Table 4 , Figure 7 , Figure 8 , Figure 9 , Figure 10 , Figure 11 , Figure 12 , Figure 13 , and Figure 14 .

A.3 RESULTS ON ZERO-SHOT TRANSFER

More results are shown in Table 5 , Table 6 Table 7 , Table 8 .

A.4 DETAILS OF OUR SIMULATOR

Our internal driving simulator is 2D and aims to investigate single-and multi-agent driving behaviors, especially in dense traffic flows. Inspired by the trajectory prediction community (Liang et al., 2020; Zhao et al., 2021; Gu et al., 2021) , our simulator utilizes sparse (vectorized) representation to capture the structural information of high-definition maps and agents. Compared to rasterized encoding which rasterizes the HD map elements together with agents into an image, vectorized representation is computation-and memory-efficient (Gao et al., 2020) . The critical components of our simulator are built on top of this vectorized representation. For designing an RL-oriented simulator, there are three critical components including scenario initialization, step forward, and done condition. scenario initialization. We choose one scenario from intersection, bottleneck, merge, and roundabout and load the pre-built vectorized map for this scenario. After that, we assign global path, initial pose, and SVO for each agent in the scenario. A vectorized map contains two parts including the centerline and sideline. Each part is a 3D tensor which contains different elements. Each element contains a sequence of points. Each point v is a vector [p, l] where p = (x, y, θ) is the pose and l represents the lane width (for sideline, l is always 0). The average distance of adjacent points is 2.0m. Given this map, we build a graph G on top of centerline where each element is a node of G and an edge exists only when two elements are connected end to end. For each scenario, we manually pick up two bunch of points as initial and terminational poses respectively. For each agent, we randomly select an initial and terminal pose (p initial and p terminal ) and use A* algorithm to search a list of points from p initial to p terminal on G. This list of points is the global path of the agent, in which the first point is the initial pose. Finally, we assign a SVO c ∈ [0, π 2 ] to this agent. For all agents in the scenario, the above procedure is repeated. We further clarify what the term "agent" means in our paper. Agents can be divided into two categories: foreground agents and background agents. Background agents are NPCs that have unchanged policies like IDM and learned NNs. Background agents are part of the environment. For a single-agent environment, there is only one single foreground agent (driving agent). For a multi-agent environment, there exist multiple foreground agents. "Foreground agent" is exactly the meaning of "agent" in RL community. Currently, when there are n vehicles in our simulator, the number of foreground and background agents is (1, n -1) for single-agent settings and (n, 0) for multi-agent settings. step forward. We use bicycle model as the vehicle dynamic model, where the inputs of the model are acceleration a and steer δ and the state is (x, y, θ, v). To guarantee that the agent will not exceed its maximum speed v max = 6m/s substantially, we introduce a PID controller (K p = 1.0, K i = 0.01, K d = 0.05) to regulate a given the reference speed v r and current speed v. Therefore, the action is (v r , δ) where v r ∈ [0, v max ] and δ ∈ [-45 • , 45 • ]. As explained in Section 3.2, the state and observation space contains a collection of static and dynamic elements and is vectorized, the dimension of state and observation space is inherently not fixed. The length of static elements is not fixed and has no upper bound. The upper bound of dynamic elements' length is 10. done condition. In our simulator, one agent is done if it reaches its destination, encounters catastrophic failures, or survives until timeout. Once an agent is done, it will be removed from the scenario. When all foreground agents are done, this episode ends. Catastrophic failures include collision with other agents, deviation from drivable area, driving too far from global path, and crashing into wrong lane. The maximum steps for one episode are t max and when an agent survives t max steps in the environment, we call it "timeout". In this paper, t max = 100. An agent is marked as success only when it passes the interaction zone (as shown in Figure 15 ). Note that we name each scenario with its interaction zone. For instance, the interaction zone in bottleneck is the bottleneck. 

A.6 ADDITIONAL RESULTS

The comparison of training efficiency between CoPO and SocialComm (Ours) is shown in Figure 16 . The impact of varying the number of agents in each scenario is shown in Figure 17 . 



∆(X ) denotes the set of probability distribution over set X .



Figure 1: Overview of our training framework. Left: We build up a coordinated traffic flow in which agents communicate SVOs to coordinate with each other. Right: By disturbing the SVO of driving agent, our traffic flow exhibits adversarial behaviors towards the driving policy.

Figure 2: Performance of different traffic flows. The radar graphs demonstrate three essential features of different traffic flows. Safety is calculated by taking the complement of catastrophic failures.Intersection Bottleneck

Figure 5: Adversarial behaviors towards driving agent generated by coordinated traffic flow.Driving policy controls the red vehicle. Vehicles with blue and purple boxes are two background agents that exhibit adversarial behaviors towards driving policy while maintaining coordination.

Figure 6: Zero-shot transfer performance in intersection and bottleneck. The heatmap reports the percentage of success rate for different methods in different traffic flows. Deeper color represents higher success rate. Primary diagonals indicates that training and evaluating environments are the same. A " † " indicates our proposed misunderstanding-based adversarial training.

METRICS Note that most metrics we use are widely used in prior works. Success is used in Dosovitskiy et al. (2017); Chen et al. (2020); Wu et al. (2021); Peng et al. (2021); Rhinehart et al. (2019); Chen et al. (2019b); Cai et al. (2019). Collision is used in Chen et al. (2020); Suo et al. (2021); Chen et al. (2019b); Cai et al. (2019). Off Road is used in Rhinehart et al. (2019); Chen et al. (2019b). Off Route is our design, but Chen et al. (2019a); Toromanoff et al. (2020) uses it as a reward term. Wrong Lane is used in Rhinehart et al. (2019). Efficiency is used in Wu et al. (2021); Chen et al. (2019b); Moghadam et al. (2020); Cai et al. (2019).

Figure 10: The performance of CoPO and SocialComm (Ours) in roundabout.

In this work, we apply adversarial training framework on a coordinated traffic flow with communication to solve this problem. P, R, ρ 0 , O, n, γ, T ⟩. n is the number of agents. I denotes the set {0, 1, . . . , n -1}. S is the state space. A is the joint action space of n agents and A = × i∈I A i . P : S × A → ∆(S) 1 is the state transition probability. R = {R 0 , R 1 , . . . , R n-1 } denotes the set of agent-specific reward functions and R i : S × A → R is bounded for all i ∈ I. Note that each agent i receives distinct reward from its own reward function r i = R i (s, a). ρ 0 ∈ ∆(S) is the initial state distribution. O is the joint observation space and O = × i∈I O i . γ ∈ (0, 1] is the discount factor, and T is the time horizon. In POSG, each agent i maximizes its own expected cumulative reward via policyβ i : O i → ∆(A i ).When n becomes large, it is time-and-space consuming to optimize a set of policies B = {β 0 , β 1 , . . . , β n-1 }. To solve this problem, we simply adopt parameter sharing strategy(Terry et al., 2020), i.e., β i = β, with the help of neural network which has powerful representation ability.

For simplicity, we use an MLP as the decoder function. SVO-embedded POMDP M containing traffic flow policy β Output: Driving policy π, adversarial policy πc Initialize: Learnable parameters θπ, θπ c for n = 1, 2, . . . , N do / * Stage 1: Given πc optimize π * / for n1 = 1, 2, . . . , N1 do Collect a set of transition tuples {(o, a, o ′ , r)} trajectories by rolling out π and πc on M ; Optimize parameters θπ of π using any RL algorithms; Collect a set of transition tuples {(o, a, c β , o ′ , -r)} trajectories by rolling out π and πc on M ; Optimize parameters θπ c of πc using any RL algorithms;

Peng et al. (2021), we select several highly interactive scenarios including intersection, bottleneck, merge, and roundabout. During training, we randomly place 8 to 20 vehicles in each scenario at the beginning of each episode. After training, we randomly place 20 vehicles and evaluate all relevant methods. See more details of our simulator in Appendix A.4. Coordinated behaviors in bottleneck. The figure highlights that our traffic flow with communication produces diverse coordinated behaviors such as queueing at the narrow crossing, rushing at open areas and yielding to avoid crashes. Lane). Third, driving efficiency is represented by average speed of the whole traffic simulation (Efficiency). As for single-agent training, these metrics are calculated from the perspective of ego agent. More disccusions can be found in Appendix A.5.

Effect of misunderstanding-based adversary with our coordinated traffic flow So-cialComm. The table reports the percentage of different metrics in intersection and bottleneck. Results in parentheses indicate the performance change under adversary. Results marked in red indicate the performance degradation under adversary while results in blue indicate the performance increase. A " † " indicates our proposed traffic flow.

Parameters of IDM. The performance of CoPO and SocialComm (Ours) in intersection. The performance of CoPO and SocialComm (Ours) in bottleneck. The performance of CoPO and SocialComm (Ours) in merge.

Zero-shot transfer performance in intersection. Each subtable stores results of different driving policies in the same traffic flow. A " † " indicates our proposed method.

annex

Under review as a conference paper at 2023 

