ADVERSARIAL DRIVING POLICY LEARNING BY MISUNDERSTANDING THE TRAFFIC FLOW Anonymous authors Paper under double-blind review

Abstract

Acquiring driving policies that can transfer to unseen environments is essential for driving in dense traffic flows. Adversarial training is a promising path to improve robustness under disturbances. Most prior works leverage few agents to induce driving policy's failures. However, we argue that directly implementing this training framework into dense traffic flow degrades transferability in unseen environments. In this paper, we propose a novel robust policy training framework that is capable of applying adversarial training based on a coordinated traffic flow. We start by building up a coordinated traffic flow where agents are allowed to communicate Social Value Orientations (SVOs). Adversary emerges when the traffic flow misunderstands the SVO of driving agent. We utilize this property to formulate a minimax optimization problem where the driving policy maximizes its own reward and a spurious adversarial policy minimizes it. Experiments demonstrate that our adversarial training framework significantly improves zero-shot transfer performance of the driving policy in dense traffic flows compared to existing algorithms.

1. INTRODUCTION

Policy learning in dense traffic flows is a progressively active area for both academia and industry community in autonomous driving (Dosovitskiy et al., 2017; Suo et al., 2021) . Since training driving policy in real world is costly, researchers aim to build dense traffic flows in simulation as an alternative (Cai et al., 2020; Pal et al., 2020; Wu et al., 2021) . Peng et al. (2021) develops a traffic flow that exhibits altruistic behaviors and training driving policy in such coordinated flow also performs well. However, the internal dynamics of different traffic flows are varied, making it difficult to train driving policy in one flow and transfer it into unseen traffic patterns. Hence, it is indispensable to develop robust driving policies that can transfer among different traffic flows. An appealing technical route to improve the robustness of driving policy is adversarial attack (Pinto et al., 2017) , which models differences between training and evaluating environments as extra disturbances towards driving policy (Wachi, 2019; Chen et al., 2021; Liu et al., 2021; Huang et al., 2022) . To exert disturbances on driving policy, these works leverage few agents to deliberately induce driving policy's failures. Although working well in sparse traffic situations, this pipeline cannot extend to dense traffic flows. On the one hand, increasing the number of attacking agents makes adversarial attacks easier, yet it is harder for the driving policy to resist such strong disturbances, which severely harms policy learning. On the other hand, attacking agents mainly concentrate on producing adversarial behaviors towards driving policy, while overlooking the modeling of altruistic behaviors among them. Therefore, the key is to construct a coordinated traffic flow which still generates adversarial behaviors. We develop a coordinated traffic flow with communication and propose a misunderstanding-based adversarial training pipeline based on this flow. Specifically, for building a coordinated traffic flow, we introduce the concept of Social Value Orientation (SVO) (Liebrand, 1984) in social psychology which balances egoistic and altruistic behaviors for each agent. SVO can be regarded as the hidden information of one agent, which typically cannot be accessed by other agents. However, in this paper, we allow agents in our traffic flow to communicate genuine SVOs with each other. Since the traffic flow is served as a testbed for training and evaluating driving policies, the coordination mechanism within the traffic flow is invisible to driving policies.

