UNCERTAINTY AND TRAFFIC LIGHT AWARE PEDES-TRIAN CROSSING INTENTION PREDICTION

Abstract

Predicting Vulnerable Road User (VRU) crossing intention is one of the major challenges in automated driving. Crossing intention prediction systems trained only on pedestrian features underperform in situations that are most obvious to humans, as the latter take additional context features into consideration. Moreover, such systems tend to be over-confident for out-of-distribution samples, therefore making them less reliable to be used by downstream tasks like sensor fusion and trajectory planning for automated vehicles. In this work, we demonstrate that the results of crossing intention prediction systems can be improved by incorporating traffic light status as an additional input. Further, we make the model robust and interpretable by estimating uncertainty. Experiments on the PIE dataset show that the F1-score improved from 0.77 to 0.82 and above for three different baseline systems when considering traffic-light context. By adding uncertainty, we show increased uncertainty values for out-of-distribution samples, therefore leading to interpretable and reliable predictions of crossing intention.

1. INTRODUCTION

VRUs are complex participants for an Automated Vehicle (AV) to perceive. The AV should not only be able to detect VRUs, but also understand their underlying intentions and predict their future actions. In addition, several surrounding factors, including incoming traffic and traffic light status, influence VRU behavior. A pedestrian may decide to stop or go at a particular moment based on these conditions. Consider a situation where a pedestrian is standing near the boundary of the curb or walking towards the curb on a traffic light junction looking forward to cross the driving lane of the ego-vehicle. The information about the green traffic light status for the vehicle might help to predict that the pedestrian will keep standing or stop at the curb, i.e., the intention of the pedestrian is not to cross. For this reason, it is necessary to consider surrounding factors like traffic light status in addition to behavioral cues to make an accurate prediction. Object-based context cues such as pedestrian location over a period of time can provide rich information about VRU motion. But, it is challenging to perceive features like human interactions with the ego-vehicle that can determine its maneuvering. Humans exhibit highly variable motion patterns, and even the same gesture or activity may differ subtly among individuals based on geographic locations. In such a case, it is helpful to divide the task into a smaller sequence of tasks that can be solved independently. In the VRU case, we can learn a model to infer an appearance-invariant representation. The articulated pose of VRUs is one such representation, commonly used in literature for action recognition (Duan et al., 2022) , gesture recognition (Mitra & Acharya, 2007) , emotion recognition (Shi et al., 2020) and intention prediction (Kotseruba et al., 2021) . These object-based features along with surrounding information about the pedestrian can be combined over a temporal domain to generate a reliable predictor for VRU actions in the future. In this paper, we explore this approach and attempt to predict pedestrian crossing intention for a future time horizon of 1-2 s by observing them for a time horizon of 0.5 s. The handling of VRUs is safety-critical, so it is important to be aware of the uncertainty of the model that predict their behavior as well. Despite much emphasis on safety, deep learning models are often deployed as black box that do not offer reliability and interpretability. As a result, they do not indicate how a system will behave under unknown circumstances. To interpret how the

