EMERGENT ROAD RULES IN MULTI-AGENT DRIVING ENVIRONMENTS

Abstract

For autonomous vehicles to safely share the road with human drivers, autonomous vehicles must abide by specific "road rules" that human drivers have agreed to follow. "Road rules" include rules that drivers are required to follow by law -such as the requirement that vehicles stop at red lights -as well as more subtle social rules -such as the implicit designation of fast lanes on the highway. In this paper, we provide empirical evidence that suggests that -instead of hard-coding road rules into self-driving algorithms -a scalable alternative may be to design multi-agent environments in which road rules emerge as optimal solutions to the problem of maximizing traffic flow. We analyze what ingredients in driving environments cause the emergence of these road rules and find that two crucial factors are noisy perception and agents' spatial density. We provide qualitative and quantitative evidence of the emergence of seven social driving behaviors, ranging from obeying traffic signals to following lanes, all of which emerge from training agents to drive quickly to destinations without colliding. Our results add empirical support for the social road rules that countries worldwide have agreed on for safe, efficient driving.

1. INTRODUCTION

Public roads are significantly more safe and efficient when equipped with conventions restricting how one may use the roads. These conventions are, to some extent, arbitrary. For instance, a "drive on the left side of the road" convention is, practically speaking, no better or worse than a "drive on the right side of the road" convention. However, the decision to reserve some orientation as the canonical orientation for driving is far from arbitrary in that establishing such a convention improves both safety (doing so decreases the probability of head-on collisions) and efficiency (cars can drive faster without worrying about dodging oncoming traffic). In this paper, we investigate the extent to which these road rules -like the choice of a canonical heading orientation -can be learned in multi-agent driving environments in which agents are trained to drive to different destinations as quickly as possible without colliding with other agents. As visualized in Figure 1 , our agents are initialized in random positions in different maps (either synthetically generated or scraped from real-world intersections from the nuScenes dataset (Caesar et al., 2019) ) and tasked with reaching a randomly sampled feasible target destination. Intuitively, when agents have full access to the map and exact states of other agents, optimizing for traffic flow leads the agents to drive in qualitatively aggressive and un-humanlike ways. However, when perception is imperfect and noisy, we show in Section 5 that the agents begin to rely on constructs such as lanes, traffic lights, and safety distance to drive safely at high speeds. Notably, while prior work has primarily focused on building driving simulators with realistic sensors that mimic LiDARs and cameras (Dosovitskiy et al., 2017; Manivasagam et al., 2020; Yang et al., 2020; Bewley et al., 2018) , we focus on the high-level design choices for the simulator -such as the definition of reward and perception noise -that determine if agents trained in the simulator exhibit realistic behaviors. We hope that the lessons in state space, action space, and reward design gleaned from this paper will transfer to simulators in which the prototypes for perception and interaction used in this paper are replaced with more sophisticated sensor simulation. Code and Documentation for all experiments presented in this paper can be found in our Project Pagefoot_0 . Our main contributions are as follows: • We define a multi-agent driving environment in which agents equipped with noisy LiDAR sensors are rewarded for reaching a given destination as quickly as possible without colliding with other agents and show that agents trained in this environment learn road rules that mimic road rules common in human driving systems. • We analyze what choices in the definition of the MDP lead to the emergence of these road rules and find that the most important factors are perception noise and the spatial density of agents in the driving environment. • We release a suite of 2D driving environmentsfoot_1 with the intention of stimulating interest within the MARL community to solve fundamental self-driving problems. we use in this work -is an on-policy policy gradient algorithm that alternately samples from the environment and optimizes the policy using stochastic gradient descent. PPO stabilizes the Actor's training by limiting the step size of the policy update using a clipped surrogate objective function.

Multi-Agent Reinforcement Learning

The central difficulties of Multi-Agent RL (MARL) include environment non-stationarity (Hernandez-Leal et al., 2019; 2017; Busoniu et al., 2008; Shoham et al., 2007) , credit assignment (Agogino and Tumer, 2004; Wolpert and Tumer, 2002) , and the curse of dimensionality (Busoniu et al., 2008; Shoham et al., 2007) . Recent works (Son et al., 2019; Rashid et al., 2018) Emergent Behavior Emergence of behavior that appears human-like in MARL (Leibo et al., 2019) has been studied extensively for problems like effective tool usage (Baker et al., 2019) , ball passing and interception in 3D soccer environments (Liu et al., 2019) , capture the flag (Jaderberg et al., 2019) , hide and seek (Chen et al., 2020; Baker et al., 2019 ), communication (Foerster et al., 2016; Sukhbaatar et al., 2016) , and role assignment (Wang et al., 2020) . For autonomous driving and traffic control (Sykora et al., 2020) , emergent behavior has primarily been studied in the context of imitation learning (Bojarski et al., 2016; Zeng et al., 2019; Bansal et al., 2018; Philion and Fidler, 2020; Bhattacharyya et al., 2019) . Zhou et al. ( 2020) solve a similar problem as ours from the perspective of environment design but fail to account for real-world aspects like noisy perception, which are inherent for emergent rules. In contrast to works that study emergent behavior in mixed-traffic autonomy (Wu et al., 2017b) , embedded rules in reward functions (Medvet et al., 2017; Talamini et al., 2020) and traffic signal control (Stevens and Yeh, 2016), we consider a fully autonomous driving problem in a decentralized execution framework and show the emergence of standard traffic rules that are present in transportation infrastructure.

3. PROBLEM SETTING

We frame the task of driving as a discrete time Multi-Agent Dec-POMDP (Oliehoek et al., 2016) . Formally, a Dec-POMDP is a tuple G = S, A, P, r, ρ 0 , O, n, γ, T . S denotes the state space of the environment, A denotes the joint action space of the n agents s.t. n i=1 a i ∈ A, P is the state transition probability P : S × A × S → R + , r is a bounded reward function r : S × a → R, ρ 0 is the initial state distribution, O is the joint observation space of the n agents s.t. n i=1 o i ∈ O, γ ∈ (0, 1] is the discount factor, and T is the time horizon.



http://fidler-lab.github.io/social-driving https://github.com/fidler-lab/social-driving



Figure 1. Multi-agent Driving Environment We train agents to travel from a→b as quickly as possible with limited perception while avoiding collisions and find that "road rules" such as lane following and traffic light usage emerge.

have attempted to solve these issues in a centralized training decentralized execution framework by factorizing the joint action-value Q function into individual Q functions for each agent. Alternatively, MADDPG (Lowe et al., 2017) and PPO with Centralized Critic (Baker et al., 2019) have also shown promising results in dealing with MARL Problems using policy gradients.

