ADVERSARIAL DRIVING POLICY LEARNING BY MISUNDERSTANDING THE TRAFFIC FLOW Anonymous authors Paper under double-blind review

Abstract

Acquiring driving policies that can transfer to unseen environments is essential for driving in dense traffic flows. Adversarial training is a promising path to improve robustness under disturbances. Most prior works leverage few agents to induce driving policy's failures. However, we argue that directly implementing this training framework into dense traffic flow degrades transferability in unseen environments. In this paper, we propose a novel robust policy training framework that is capable of applying adversarial training based on a coordinated traffic flow. We start by building up a coordinated traffic flow where agents are allowed to communicate Social Value Orientations (SVOs). Adversary emerges when the traffic flow misunderstands the SVO of driving agent. We utilize this property to formulate a minimax optimization problem where the driving policy maximizes its own reward and a spurious adversarial policy minimizes it. Experiments demonstrate that our adversarial training framework significantly improves zero-shot transfer performance of the driving policy in dense traffic flows compared to existing algorithms.

1. INTRODUCTION

Policy learning in dense traffic flows is a progressively active area for both academia and industry community in autonomous driving (Dosovitskiy et al., 2017; Suo et al., 2021) . Since training driving policy in real world is costly, researchers aim to build dense traffic flows in simulation as an alternative (Cai et al., 2020; Pal et al., 2020; Wu et al., 2021) . Peng et al. (2021) develops a traffic flow that exhibits altruistic behaviors and training driving policy in such coordinated flow also performs well. However, the internal dynamics of different traffic flows are varied, making it difficult to train driving policy in one flow and transfer it into unseen traffic patterns. Hence, it is indispensable to develop robust driving policies that can transfer among different traffic flows. An appealing technical route to improve the robustness of driving policy is adversarial attack (Pinto et al., 2017) , which models differences between training and evaluating environments as extra disturbances towards driving policy (Wachi, 2019; Chen et al., 2021; Liu et al., 2021; Huang et al., 2022) . To exert disturbances on driving policy, these works leverage few agents to deliberately induce driving policy's failures. Although working well in sparse traffic situations, this pipeline cannot extend to dense traffic flows. On the one hand, increasing the number of attacking agents makes adversarial attacks easier, yet it is harder for the driving policy to resist such strong disturbances, which severely harms policy learning. On the other hand, attacking agents mainly concentrate on producing adversarial behaviors towards driving policy, while overlooking the modeling of altruistic behaviors among them. Therefore, the key is to construct a coordinated traffic flow which still generates adversarial behaviors. We develop a coordinated traffic flow with communication and propose a misunderstanding-based adversarial training pipeline based on this flow. Specifically, for building a coordinated traffic flow, we introduce the concept of Social Value Orientation (SVO) (Liebrand, 1984) in social psychology which balances egoistic and altruistic behaviors for each agent. SVO can be regarded as the hidden information of one agent, which typically cannot be accessed by other agents. However, in this paper, we allow agents in our traffic flow to communicate genuine SVOs with each other. Since the traffic flow is served as a testbed for training and evaluating driving policies, the coordination mechanism within the traffic flow is invisible to driving policies. In other words, when placing a driving policy to interact with the traffic flow, the traffic flow requires receiving driving policy's SVO while the driving policy is unaware of traffic flows' SVOs. This property offers a neat approach to induce misunderstandings between driving policy and our traffic flow, making it adversarial towards driving policy. We reserve a spurious adversarial agent to disturb the SVO delivery from the driving agent to other agents and formulate a minimax optimization problem where the driving policy maximizes its own reward while the spurious adversarial policy minimizes it, as shown in Figure 1 . Contributions. We propose a novel adversarial training framework based on a coordinated traffic flow to obtain driving policies that can transfer across various traffic flows. We develop a coordinated traffic flow where agents exhibit egoistic, prosocial, and altruistic behaviors based on communicating SVOs with each other. Based on this traffic flow, we apply adversarial driving policy training by adversarially misunderstanding the traffic flow, which is disturbed to produce improper coordinated behaviors towards driving policy. We investigate characteristics of several traffic flows in four challenging scenarios and carry out comprehensive comparative studies to evaluate the robustness of driving policy. Results show that our traffic flow achieves the highest success rate and the proposed adversarial training pipeline significantly improves the transferability of driving policy compared to existing algorithms.

2. RELATED WORK

Dense traffic flows. Prior works explore different methodologies to simulate dense traffic flows including rule design (Behrisch et al., 2011; Dosovitskiy et al., 2017; Cai et al., 2020; Zhou et al., 2021) , Imitation Learning (IL) (Zhao et al., 2021; Gu et al., 2021; Wang et al., 2022) , and Multi-Agent Reinforcement Learning (MARL) (Pal et al., 2020; Palanisamy, 2020; Wu et al., 2021) . IL naturally leverages numerous human expert data but suffers from severe distribution shift and poor closed-loop performance even in simple scenarios. Most rule-and MARL-based algorithms aim to simulate individual behaviors of distinct agents, which overlooks complex interactions among agents. Similar to our work, Peng et al. ( 2021) also builds a coordinated traffic flow based on SVO. However, agents in their traffic flow have no access to other agents' SVOs, leading to conservative behaviors. Adversarial attack. A common way to acquire robust policy is applying Robust Adversarial Reinforcement Learning (RARL) (Pinto et al., 2017; Pan et al., 2019; Vinitsky et al., 2020; Oikarinen et al., 2021) . Researchers in autonomous driving also follow this pipeline (Wachi, 2019; Ding et al., 2020; Chen et al., 2021; Sharif & Marijan, 2021; Huang et al., 2022) 2022) attempt to expel ego agent from drivable areas. Using attacking agents to interfere with driving policy deliberately, such pipeline provides large adversarial disturbance for driving policy. However, excessively concerning rarely happened scenarios harms the robustness of driving policy in unseen environments since it fails



Figure 1: Overview of our training framework. Left: We build up a coordinated traffic flow in which agents communicate SVOs to coordinate with each other. Right: By disturbing the SVO of driving agent, our traffic flow exhibits adversarial behaviors towards the driving policy.

. Adversarial policies in Ding et al. (2020); Chen et al. (2021); Sharif & Marijan (2021); Huang et al. (2022) are optimized to collide with driving agent, while Wachi (2019); Huang et al. (

