UNCERTAINTY AND TRAFFIC LIGHT AWARE PEDES-TRIAN CROSSING INTENTION PREDICTION

Abstract

Predicting Vulnerable Road User (VRU) crossing intention is one of the major challenges in automated driving. Crossing intention prediction systems trained only on pedestrian features underperform in situations that are most obvious to humans, as the latter take additional context features into consideration. Moreover, such systems tend to be over-confident for out-of-distribution samples, therefore making them less reliable to be used by downstream tasks like sensor fusion and trajectory planning for automated vehicles. In this work, we demonstrate that the results of crossing intention prediction systems can be improved by incorporating traffic light status as an additional input. Further, we make the model robust and interpretable by estimating uncertainty. Experiments on the PIE dataset show that the F1-score improved from 0.77 to 0.82 and above for three different baseline systems when considering traffic-light context. By adding uncertainty, we show increased uncertainty values for out-of-distribution samples, therefore leading to interpretable and reliable predictions of crossing intention.

1. INTRODUCTION

VRUs are complex participants for an Automated Vehicle (AV) to perceive. The AV should not only be able to detect VRUs, but also understand their underlying intentions and predict their future actions. In addition, several surrounding factors, including incoming traffic and traffic light status, influence VRU behavior. A pedestrian may decide to stop or go at a particular moment based on these conditions. Consider a situation where a pedestrian is standing near the boundary of the curb or walking towards the curb on a traffic light junction looking forward to cross the driving lane of the ego-vehicle. The information about the green traffic light status for the vehicle might help to predict that the pedestrian will keep standing or stop at the curb, i.e., the intention of the pedestrian is not to cross. For this reason, it is necessary to consider surrounding factors like traffic light status in addition to behavioral cues to make an accurate prediction. Object-based context cues such as pedestrian location over a period of time can provide rich information about VRU motion. But, it is challenging to perceive features like human interactions with the ego-vehicle that can determine its maneuvering. Humans exhibit highly variable motion patterns, and even the same gesture or activity may differ subtly among individuals based on geographic locations. In such a case, it is helpful to divide the task into a smaller sequence of tasks that can be solved independently. In the VRU case, we can learn a model to infer an appearance-invariant representation. The articulated pose of VRUs is one such representation, commonly used in literature for action recognition (Duan et al., 2022) , gesture recognition (Mitra & Acharya, 2007) , emotion recognition (Shi et al., 2020) and intention prediction (Kotseruba et al., 2021) . These object-based features along with surrounding information about the pedestrian can be combined over a temporal domain to generate a reliable predictor for VRU actions in the future. In this paper, we explore this approach and attempt to predict pedestrian crossing intention for a future time horizon of 1-2 s by observing them for a time horizon of 0.5 s. The handling of VRUs is safety-critical, so it is important to be aware of the uncertainty of the model that predict their behavior as well. Despite much emphasis on safety, deep learning models are often deployed as black box that do not offer reliability and interpretability. As a result, they do not indicate how a system will behave under unknown circumstances. To interpret how the model behaves in such situations, we intend to predict the uncertainty of each prediction to know the confidence of our model.

2.1. ARCHITECTURES FOR PEDESTRIAN CROSSING INTENTION PREDICTION

To build a safer AV system for urban roads, it is important to estimate the crossing intention of a pedestrian, i.e. whether a pedestrian intends to cross/not-cross the road in front of ego-vehicle for a predefined time horizon. Pedestrian crossing intention prediction is mostly treated as a binary task where the goal is to classify between two classes for pedestrian intention, i.e. Crossing (C) or Not Crossing (NC). One of the early works in this direction, as proposed by Rasouli et al. ( 2017) was to predict crossing action at a given frame using a static representation of the traffic scene and encoding pedestrian looking and walking actions using CNNs. 2021) present a pedestrian action prediction model along with a common evaluation criterion. In this paper, the authors evaluate different architectures for pedestrian action recognition, namely static (crossing prediction is made using only last frame in the observation sequence), Recurrent Neural Network (RNN) models, 3D convolution and optical flow based models. They propose a novel architecture based on 3D convolution and multiple RNNs and experiment with different input features like bounding box, local-context, human-pose keypoints and ego-vehicle speed. We base our experiments on this architecture and perform ablation studies to get insights on drouputs and uncertainties. Yang et al. (2022) fuse different phenomena such as sequences of RGB imagery, semantic segmentation masks, and ego-vehicle speed in an optimum way using attention mechanisms and a stack of recurrent neural networks. Achaji et al. (2022) present a framework based on multiple variations of Transformer models to predict the pedestrian street-crossing decision, based on the dynamics of its initiated trajectory, using only bounding boxes as input.

2.2. FEATURES USED FOR PREDICTING PEDESTRIAN CROSSING INTENTION

Body language is generally modeled as head orientation, body orientation, posture and gesture which is often used to estimate the pedestrian intention in future. Yang & Ni (2019) use two vision cues to estimate a pedestrian's crossing intention. They propose a looking/not-looking classifier using a 2D convolutional CNN to capture the eye contact between a pedestrian and the ego-vehicle. They also come up with a C/NC classifier based on 3D CNN to model the pedestrian's early crossing action. Roth et al. (2021) propose a method to estimate vehicle-pedestrian path prediction that takes into account the awareness of the driver and the pedestrian towards each other. They extend Dynamic Bayesian Network (DBN) method by Kooij et al. (2014) where they perform path prediction for an individual pedestrian, to the mutual vehicle-pedestrian case. Their results indicate that driver-attention-aware models improve collision risk estimation compared to driver-agnostic models. Human-pose is an intermediate representation which is very useful to determine various human behaviors. Quintero et al. (2017) propose a method to recognize pedestrian intentions such as standing, walking, stopping and starting based on a Hidden Markov Model (HMM). The authors use 3D positions and displacements of 11 skeleton points. They also propose a single frame skeleton estimation algorithm based on point clouds extracted from a stereo pair. Fang & López (2018) use CNN-based pedestrian detection, tracking and pose estimation to predict C/NC action for pedestrians. They use a classifier to predict C/NC using human-pose features. The authors extend their work, to recognize the intention of the cyclists along with the pedestrians (Fang & López, 2019) . Mínguez et al. (2018) present a method to predict the future path, pose and intentions of the pedestrians up to a time horizon of 1 s. The authors use balanced Gaussian process dynamic models (BGPDM) to learn 3D time-related information extracted from the skeleton points.



Razali et al. (2021)  propose a multitask architecture to estimate the intention and pose of a pedestrian simultaneously using RGB images as input. A multi-task architecture with an encoder-decoder based intention and action prediction to predict pedestrian crossing intent and forecast future behavior of the pedestrians is presented byYao  et al. (2021). The authors also propose an attentive relation network to extract important features from traffic objects and scenes to improve the performance of the intention and action detection framework.Lorenzo et al. (2021b) and Lorenzo et al. (2021a)  use vision transformers to encode the non-visual features. They experiment with different types of video encoders and finally fuse the features from the two branches to predict pedestrian crossing intent.Kotseruba et al. (

