TRANSCENDENTAL IDEALISM OF PLANNER: EVALUATING PERCEPTION FROM THE PLANNING PERSPECTIVE FOR AUTONOMOUS DRIVING Anonymous authors Paper under double-blind review

Abstract

Evaluating the performance of perception module in autonomous driving is one of the most critical tasks in developing these complex intelligent systems. While module-level unit test methodologies adopted from traditional computer vision tasks are viable to a certain extent, it still remains far less explored to evaluate how changes in a perception module can impact the planning of an autonomous vehicle in a consistent and holistic manner. In this work, we propose a principled framework that provides a coherent and systematic understanding of how perception modules affect the planning of an autonomous vehicle that actually controls the vehicle. Specifically, planning of an autonomous vehicle is formulated as an expected utility maximisation problem, where all input signals from upstream modules jointly provide a world state description, and the planner aims to find the optimal action to execute by finding the solution to maximise the expected utility determined by both the world state and the action. We show that, under some mild conditions, the objective function can be represented as an inner product between the world state description and the utility function in a Hilbert space. This geometric interpretation enables a novel way to analyse the impact of noise in world state estimation on the solution to the problem, and leads to a universal quantitative metric for such purpose. The whole framework resembles the idea of transcendental idealism in the classical philosophy literature, which gives the name to our approach.

1. INTRODUCTION

Autonomous driving has recently risen as a fast-advancing realm in both industry and academia, and receives a surge of interest from engineering and scientific communities (Yurtsever et al., 2020; Sun et al., 2020) . As an intricate system, an autonomous driving vehicle consists of numerous hardware components and interactive onboard modules. As one such core component, the onboard perception module serves as the major source of real-time characterisation of the dynamic environment an autonomous vehicle (AV) navigates through. To evaluate and improve the perception module, conventional perception tasks (such as detection, segmentation, tracking) have been well defined and corresponding performance measurements are established in computer vision to benchmark performance of perception algorithms (Lin et al., 2014) . Despite their great success in driving the development of advanced perceptual information processing modules, almost all such metrics exclusively focus on the perception-level performance in a deployment-agnostic fashion, for instance, how close a detected object is to the ground truth, while ignoring the actual impact of the result to the entire AV system. Indeed, not all perception errors translate the same to the planning of an AV. Obviously, miss detecting a vehicle in front of an AV is far more serious than one behind far away. This problem is further compounded by the heterogeneity of perception errors that share little semantics in common ("How dose an error of 5m/s in velocity compare to that of a size 25% larger?"), where intuitive manual engineering is widely used (Caesar et al., 2020) . Although these issues are typically addressed by integrating road test in the real world, the process is extremely costly and time-consuming (Wachenfeld and Winner, 2016; Åsljung et al., 2017) . In result, tools are in great demand to effectively and efficiently measure the impact of perception to the whole autonomous driving system before deployment on road. Unfortunately, these solutions still remain far less explored in the research literature. The change in AV behaviour due to perception error is not always correlated to the cost of consequence. In (a) the AV has to circumvent the erroneously perceived cone by making a large detour. While for (b) the AV only needs to make a slight detour to the right, yet it inevitably hits the cone. In this case, although the behaviour change is far less than that of (a), the consequence is significantly worse ("hitting an object" v.s. "making a large detour"). In (c) the consequence of either way is indifferent to the AV in moving forward, yet the change in behaviour is considerable in terms of spatiotemporal motion. As for (d), if there are two falsely detected cones on both sides, which are close enough to the AV when passing by despite no collision, the AV still decides to maintain the same motion as in the ground truth case. Therefore, the final behaviour of the AV does not change given the perception error, but the cost of passing by two close objects already changes the planning process, which will be missed by the metrics that only look at the AV behaviour or planning result. Most recently, the community starts to approach this problem with some initial efforts (Sun et al., 2020; Philion et al., 2020; Ivanovic and Pavone, 2021; Deng et al., 2021) . Despite some success, these preliminary solutions only exploit certain aspects of the problem, either implicitly relying on weak correlation between behaviour change and driving cost (Philion et al., 2020) , inferring the holistic cost via local properties (Ivanovic and Pavone, 2021) , or coarse levels (Sun et al., 2020) . In this work, we propose a principled and universal framework to quantify how noise in perception input affects the AV planning. This is achieved by explicitly analysing the process of AV planning, in the context of expected utility maximisation (Osborne and Rubinstein, 1994) , and evaluating the change of utility values critical to the AV reasoning subject to input perception errors. Under some mild conditions (Section 3.3), we show that this planning process can be formulated as an optimisation problem with linear objective function in a Hilbert space, where utility to optimise is the inner product of an action-wise utility function and the world state distribution represented by perception. This geometric interpretation reveals many natural and insightful properties of the problem, for example, any input error can be decomposed into two parts: one that does not affect the utility comparison (planning-invariant error) and the other one that directly changes the planning problem (planning-critical error). Based on this novel insight, we derive a metric that quantify how a perception error changes the planning process. We want to emphasise the necessity of understanding impacts of perception errors on an autonomous driving system via the process of planning, rather than purely from the final result (i.e., the AV behaviour, or the trajectory output from the planning module), as proposed by previous works (Philion et al., 2020) . This results from the fact that, the final planning result does not necessarily reflect how AVs evaluate the situation, reason with the environment, and assess the costs of actions. In fact, the correlation between behaviour change and the actual consequence is weak, or even negative in many common cases, as illustrated in Figure 1 . Actually, most works implicitly or explicitly integrate some priori knowledge of consequences of perception errors into metric design. The complexity of such impact on autonomous driving, however, is far beyond hand-crafted rules, defeating their purposes despite tremendous amounts of manual efforts, e.g., Deng et al. (2021) assumes that severity of an error should be weighted proportional to the reciprocal of its cubed Manhattan distance to the AV, regardless of its position relative to the AV (in front or behind the AV). In contrast, we make little such presumption and fully rely on the planning process to infer the error consequence in a fully transparent way, which enables our solution to capture many critical cases. In this regard, the core principle of our design resembles the idea in the philosophical system of transcendental idealism, proposed by Immanuel Kant in his classical work Critique of Pure Reason (Kant, 1998) , which argues that, due to the limitation of the observer's sensibility, cognition of external objects is processed never as they are in themselves, but via the cognitive faculties and subject to the interpretation of the observer's experience. For the same reason, the consequence of environment misrepresentation for an AV due to perception errors is naturally reflected via the change in its planning (the core component of an AV that interprets its environment) and measured by the extra loss incurred, which gives the name to our framework: transcendental idealism of planner (TIP).

2. LITERATURE REVIEW

Planning for Autonomous Vehicles. We consider the behavioural decision and motion planning as the planning process, which generates the vehicle behaviour to execute by the controller given the observation up to the planning time. There is a rich literature to address these fundamental problems in autonomous driving, which can be roughly categorised into canonical module-based and datadriven methods. The former relies on explicit modelling of target accomplishment in optimisation frameworks and seeks the optimal solution as the result (Schwarting et al., 2018) . The latter, on the other hand, aims to directly map raw sensor data into the AV behaviour or final vehicle control signals by leveraging the approximation power of deep learning and massive data (Bojarski et al., 2016; Grigorescu et al., 2020) , which has attracted increasing attention recently. In this work, we aim to explore the internal mechanism of a planner to gain some insights of the impacts on planning from perception noise, and specifically focus on the module-based planning with explicit target achieving process. Extension of results in this work to data-driven planning is left for future work. Behaviour based Metrics for Upstream Modules. Recent works aimed to assess the performance of perception from the autonomous driving system viewpoint mostly approach the problem in heuristic ways. Considering black-box planning models, Philion et al. (2020) implicitly hypothesise that driving consequences of perception errors are directly correlated to the change in an AV spatiotemporal trajectories planned, and propose the planning KL-divergence (PKL) to measure the impact. While intuitive, it fails to incorporate the context of environment and does not precisely reflect the real cost of input noise in many common traffic scenarios. To deal with the specific problem of object representation, Deng et al. (2021) study how object shapes can affect autonomous driving and devise the support distance error (SDE) to quantify such effect. In a very recent work, Ivanovic and Pavone (2021) start looking into the planning process and employs sensitivity as a probe of input signal's contribution to AV behaviour. This, however, implicitly leverages local-only properties of differentiable cost functions to infer global results. In comparison, our proposed approach captures the big picture of planning process and applies to far more general cases.

3. AV PLANNING AS EXPECTED UTILITY MAXIMISATION

To evaluate the performance of perception module from the AV planning perspective in a principled manner, we start by introducing the preliminary basics, and then review the expected utility maximisation (EUM) as the optimal AV action framework. After that, the interpretation of EUM in a Hilbert space is presented, based on which our metric for perception is derived.

3.1. PRELIMINARY

We first present the mathematical basics to facilitate the following theoretical analysis. Unless otherwise specified explicitly, all notations follow the standardised one in Goodfellow et al. (2016) . A probability space {Φ, F, P} is defined by a sample space Φ, an event space F (a σ-algebra on Φ), and a Borel probability measure P on F. A random variable X : Φ → R d (d ∈ N) is induced from {Φ, F, P} with distribution function F X (x). When absolutely continuous, F X (x) = x -∞ f X (t) dt, where f X (x) is the probability density function (PDF). L 2 (X , ρ) denotes the space of squareintegrable functions, and ρ is a Lebesgue measure accordingly. A Hilbert space H = (T , ⟨•, •⟩) is defined on a complete space T with inner-product ⟨•, •⟩ H and induced norm∥•∥ H . Let S ⊂ H be a subspace of a Hilbert space H, S ⊥ = {x ∈ H| ⟨x, y⟩ , ∀y ∈ S} is the orthogonal complement of S (i.e., the set of all vectors orthogonal to S). The linear span of a set S is span(S). n v := v/∥v∥ is an element of unit length in a normed vector space by normalising element v.

3.2. AUTONOMOUS VEHICLES AS RATIONAL AGENTS

An AV is an intelligent agent that aims to accomplish some predefined goals in an interactive and uncertain environment. For this, an AV is constantly faced with the problem of planning in the dynamic environment, and the quality of planning determines how well the goals can be achieved. By the classical EUM theory (Osborne and Rubinstein, 1994) , at any given time t, an agent aims to achieve the maximum expected reward, defined by the utility function U , via execution of the optimal action a * t such that a * t = arg max a∈Da,t E U (S t , a) , (1) where D a,t is the set of all feasible AV actions at time t; s ∈ S, a random variable with distribution function F St (s), is the world state at time t in the world state space S; and EU (F St , a) := E U (S t , a) = s∈S U (s, a) dF St (s). (2) Intuitively, the utility function encodes the goal or reward the AV is supposed to achieve, for example, to reach a destination in time, to minimise likelihood of collision with other objects, and to avoid sharp change in motion, and F St (s) captures uncertainty about the stochastic environment given all prior world knowledge and historical observations up to t, which are estimated by modules like localisation and perception. Architectures of many modern AV planners still follow this classical framework (Paden et al., 2016; Buehler et al., 2009; Fan et al., 2018) .

3.3. EXPECTED UTILITY MAXIMISATION IN THE HILBERT SPACE

To gain some insights into the expected utility of (1) and how input noise is consumed by the planning process, we introduce an interpretation in the Hilbert space to leverage geometric tools available from linear algebra. We first establish the conditions under which a probability measure can be embedded into a Hilbert space, followed by the interpretation of EUM from a geometric perspective in Section 4. For brevity, all proofs are left in Appendix G. Theorem 1 (Probability Measure Embeddings in Hilbert Space). Let {X , d} be a compact metric space with d as the metric function, p be a Borel probability measure on X , and X be a random variable on X with distribution function F X (x). If F X (x) is absolutely continuous and the density function f X is square-integrable, i.e., f X ∈ L 2 , then there exists a unique elementfoot_0 µ p ∈ H such that E X g(x) = µ p , g H , ∀g ∈ H, where element µ p denotes the embedding of probability measure p in the Hilbert space H = (L 2 , ⟨•, •⟩), with the inner product given by ⟨g, h⟩ H := x g(x)h(x)ρ(dx). (4) The critical condition of F X (x) being absolutely continuous with a square-integrable density function f X in Theorem 1 is actually general and includes many popular distributions as special cases (see the discussion in Appendix F). Theorem 1 establishes a mapping from probability measures of continuous random variables to H. Additionally, the mapping is also injective by the following result. Theorem 2 (Injection of Probability Measure Embeddings). Let p and q be two Borel probability measures defined on a compact metric space {X , d} with absolutely continuous distribution functions, then p = q almost everywhere if and only if µ p = µ q , where µ p and µ q are the embeddings of p and q in H, respectively. A similar result for mixed distributions is also available in the appendix (Theorem 4). Under the mild conditions in the aforementioned results, the expected utility maximisation of (1) can be rewritten as a * = arg max a∈Da E s∼p(s) U (s, a) = arg max a∈Da µ p , U a H , ∀U (s, a) ∈ H. Given the injective correspondence between p and µ p established above, we can leverage many tools in algebra (such as inner product, orthogonality, projection, and subspace) to analyse the impact of perception result on AV planning via the EUM in H, denoted "planning utility Hilbert space", where the topological structure is exclusively determined by its inner product.

4. PERCEPTION EVALUATION VIA AV PLANNING

In this section, we derive the extra cost of planning incurred by perception errors through the theoretical foundation established in Section 3. While the actual world state characterisation p(s) consists of signals from modules other than perception (e.g., localisation), for brevity we assume that the perception module is the only source for world state estimation in the following discussion. The AV is moving forward on a road of width 6m, where there is a cone in front, the belief of its position is a distribution on a line across the road (the x axis). The ground truth distribution p is U [-3,-2] , a uniform distribution with support [-3, -2], while the perception believes its location (distributed as q) is U [-1,0] . The AV has two action options: 1) to move forward (a * , the red arrow), and the utility function is U 1 (x) = -10 • 1 x∈[-1,1] with x being the position of the cone (only large loss for collision with the cone); 2) to make a hard brake (a, the grey arrow), and the utility function is a constant U 2 (x) = -5 (loss of hard braking is identical regardless of the cone position). In this case, the ∆U and ∆µ are illustrated in the top right, while the decomposition of PIE ∆µ ⊥ and PCE ∆µ ∥ are on the bottom right. Note that, ∆µ ∥ is of the same shape as ∆U (up to a negative constant), and ⟨∆U, ∆µ ⊥ ⟩ = 0.

4.1. BREAKDOWN OF PERCEPTION ERRORS

Consider a general case where the candidate action set is D a = {a i }, and each action is associated with a distinct utility function U (s, a i ) ∈ H such that U (s, a i ) -U (s, a j ) H > 0 ⇔ a i ̸ = a j , ∀a i , a j ∈ D a . Let a * be the optimal action per EUM of (5) given the ground truth distribution of world state p(s). For a specific a ̸ = a * , ∆U (a * , a) = U a * -U a , and the planning half-space in H is H a := {f | f, ∆U (a * , a) H > 0, f ∈ H}. Given the actual perception result q(s), a * is preferred over a by EUM if and only if µ q ∈ H a , i.e., ξ(q; a * , a) := µ q , ∆U (a * , a) H = EU (q, a * ) -EU (q, a) > 0, with ξ(q; α, β) denoting the α-β preference score given q, which exclusively decides the result of EUM. As illustrated in Figure 2 (a), the final planning is made correctly if and only if µ q ∈ a∈Da/{a * } H a . ( ) When there is an error in perception q(s) (i.e., µ qµ p H > 0), the preference of (7) may be affected, i.e., ξ(q; a * , a) ̸ = ξ(p; a * , a), so is the preference between a * and a by EUM. To understand how the difference ∆µ = µ qµ p affects the result of EUM, we further decompose ∆µ into two orthogonal components: ∆µ = µ q -µ p = ∆µ ∥ + ∆µ ⊥ , (9) where ∆µ ∥ = ⟨∆µ, n ∆U ⟩ n ∆U = ⟨∆µ,∆U ⟩ ∥∆U ∥ 2 H ∆U (10) is the projection of ∆µ onto unit vector n ∆U (denoted behaviour direction) parallel to U a * -U a , and ∆µ ⊥ ∈ span({∆U }) ⊥ (11) Algorithm 1: TIP Score Evaluation Input :An observation sequence from perception q({st} 0 t=-τ ) and its ground truth p({st} 0 t=-τ ) Output :TIP score of q({st} 0 t=-τ ) Get the candidate trajectory set Da,p, and the optimal action a * ∈ Da,p from the planner with the ground truth perception input p({st} 0 t=-τ ) Get the candidate trajectory set Da,q from the planner with the noisy perception input q({st} 0 t=-τ ) Define the reference action set Da := Da,p ∪ Da,q foreach a ∈ Da do Compute four estimates via finite-size samples: ÊU 1 = 1 n n i=1 U (s (i) q , a * ), ÊU 2 = 1 n n i=1 U (s (i) q , a), ÊU 3 = 1 n n i=1 U (s (i) p , a * ), ÊU 4 = 1 n n i=1 U (s (i) p , a) where {s (i) q } n i=1 are n i.i.d. observations from q({st} 0 t=-τ ), and similar for {s (i) p } n i=1 Compute ∆ξ(a * , a; q, p) = ÊU 1 -ÊU 2 -ÊU 3 + ÊU 4 end Compute and output the result I (q, p; U, Da) = mina∈D a ∆ξ(a * , a; q, p) is the projection of ∆µ onto the orthogonal complement of the subspace spanned by the behaviour direction. In the presence of perception error ∆µ, as illustrated in Figure 2 (a), the change in preference score of ( 7) is only determined by ∆µ ∥ : ∆ξ(a * , a; q, p) = ξ(q; a * , a)-ξ(p; a * , a) = ⟨∆µ, ∆U ⟩ = ⟨∆µ ∥ + ∆µ ⊥ , ∆U ⟩ = ⟨∆µ ∥ , ∆U ⟩. (12) For this reason, we denote ∆µ ∥ as the planning-critical error (PCE), and ∆µ ⊥ as the planninginvariant error (PIE). This observation reveals a pivotal fact: not all errors in world state estimation or perception are of equivalent impact on the AV planning, and the subspace span({∆U }) ⊥ contains all errors that would not affect EUM at all. A toy example of PCE and PIE is shown in Figure 2(b) . Given this interpretation, we define the maximum reduction in the preference score of a * over any candidate actions as the measure of impact from the perception error on AV planning: I (q, p; U, D a ) = min a∈Da ∆ξ(a * , a; q, p). (13)

4.2. EVALUATION OF PERCEPTION ERROR IMPACT BY TIP

In practice, combining ( 7) and ( 12), the evaluation of perception error impact can be reduced to the calculation of four expected utilities: ∆ξ(a * , a; q, p) = E q(s) U (s, a * ) -E p(s) U (s, a * ) -E q(s) U (s, a) + E p(s) U (s, a) . ( ) Computing these expectations in analytical forms typically requires strong assumptions on the forms of both utility and distribution functions, which substantially limits the flexibility and representation capacity. To allow for maximum representation flexibility, we resort to numerical methods and estimate the expected utilities from finite-size samples of world states, and show that the solution is both statistically consistent and uniformly efficient under the mild conditions in Theorem 3. Specifically, for a fixed action a, given an i.i.d. sample of the utilities {U (S i , a)} n i=1 with S i drawn from the state distribution p S (s), an unbiased estimator of the expected utility based on U-statistics is EU a = 1 n n i=1 U (S i , a). ( ) Under many common practical conditions, we show that fast convergence rate via uniform bound of the estimator of ( 15) can be achieved by the following observation. Theorem 3 (Exponential Convergence Rate). If there exists an M ∈ R such that U (S, a) < M almost surely, then Pr EUa -E U (S, a) > ε < 2e -nε 2 2L , ∀ε > 0, L = min M 2 , Var(U (S, a)) + M ε/3 . ( ) The exponential convergence rate at O(e -n ) provided by Theorem 3 is significant in the sense that it depends on (i) neither the dimensionality of the original state space S (i.e., the curse of dimensionality is not invoked), nor (ii) the distribution p S (s) and utility functions (i.e., U (S, a) and p S (s) can take any arbitrary forms). To facilitate the understanding of our approach, the pseudo code for evaluating TIP is provided in Algorithm 1, which sketches the basic routine to compute TIP score of a perception input sequence q({s t } 0 t=-τ ) from t = -τ to t = 0 for planning at t = 0. Note that, NDS saturates in several cases once the noise reaches some levels, and sensitivity of TIP to various types and levels of noise is generally more consistent than the other three (SDE-APD is computed by SDE-APD@t=1s for velocity noise). A miss detected stationary obstacle is placed in front of an AV at x=0 moving at 14m/s along the +x direction. AV-1 ("jerk-averse") is optimised for driving comfort with braking capped at -4m/s 2 , while AV-2 ("collisionaverse") is for safety and can brake as much as -6m/s 2 . Both SDE-APD and PKL consider closer miss detection is worse than further ones. TIP, however, reveals that AV-1 regards the one at 30m the worst, since the collision is inevitable even if the obstacles at 20m and 25m are detected; yet miss detection at 30m leads to a collision that could have been avoided otherwise. In contrast, no other metrics provide this fine resolution at this level. When the miss detection happens behind the AV, both TIP and PKL ignore its impact unlike the other two. Note the symmetry of NDS and SED-APD in both directions of the x-axis.

5. EMPIRICAL STUDY

In this section, we evaluate how the proposed TIP works in practice via extensive experiments on both synthetic and real data. Due to the space constraint, more details are left in the appendix.

5.1. BASIC SETTINGS

All AVs used in the experiment are based on the same type of regular passenger vehicles. The planner deployed in the autonomous driving system is derived from the popular open-sourced project of Apollo (Fan et al., 2018) . It consists of various sub-modules of routing, object motion prediction, cost generation, path finder, and trajectory optimisation. At each planning time instant, these submodules (except the trajectory optimisation) analyse the environment, input history, and establish the target utility function for final trajectory optimisation, i.e., the objective utility function U (•, s) is first created with the perception input as (part of) hyper-parameters. The path finder then provides multiple initial paths as candidates for path-wise trajectory optimisation, and the final choice is determined by a utility decider. The goals the planner strives to achieve include motion smoothness, traffic rule compliance, safety, progress to the destination, etc. The planner has been extensively verified via rigorous road tests in major cities with millions of population (See Appendix B for more details). All experiments are implemented in scenarios as the standard protocol in autonomous driving (Riedmaier et al., 2020) . Scenarios used are collected from real world road tests (see more details in Appendix D). We consider the planning problem at a particular frame in a scenario at a time, and evaluate the utility of an action (a spatiotemporal trajectory the AV executes) for the next three seconds, following the basic setup of (Philion et al., 2020) . For comparison, three baselines are adopted from the spectrum of perception metrics: 1) at the conventional end, nuScenes dataset score (NDS) (Caesar et al., 2020) combines several traditional scoring results for 3D object detection into a single performance measure, 2) SDE average precision distance-weighted (SDE-APD) (Deng et al., 2021) focuses more on support distance errors near the AV in an ego-centric fashion, and 3) PKL (Philion et al., 2020) serves as the representative for AV behaviour-driven metrics.

5.2. RESULTS ON SYNTHETIC DATA

In the first set of experiments, we aim to gain some understanding of various metrics in reaction to common types of perception noises. A dataset is synthesised from our curated road test scenarios by adding controlled noise to the 3D object ground truth of vehicles, to enable clear observation on the sensitivity of metrics to specific types of perception errors. For this, 1000 5s-long scenarios are assembled, with the number of objects per scene between 30 and 500, where AVs are moving on an average speed at least 5m/s. The ground truth is annotated by professionally trained human operators. All objects in the scenario are labeled with location, heading, category, and bounding box from 3D point clouds recorded from LiDARs on AVs during road tests. In total, six types of errors are considered. The false positives are tested by adding "ghost" vehicles scattered within a 70m-by-30m box centred at the AV, with motion properties randomly perturbed from the AV. The miss detection is created by removing objects from the ground truth randomly with a certain probability (i.e., miss detection rate). Other noises involving location, yaw, velocity and size are sampled from zero-mean Gaussian distributions with different variances and added to corresponding properties of objects in ground truth. The comparisons are shown in Figure 3 . While all metrics are negatively correlated with all six types of perception noises, NDS saturates in some types (e.g., velocity) due to its design. SDE-APD, also involving manual engineering, exhibits varying sensitivity at different noise levels (especially for the velocity, as the default matching threshold 0.2m is easily overwhelmed by speed noise larger than 1m/s. Selectivity of TIP tends to be more consistent than PKL, in the sense that, while both may predict results at similar dynamic ranges when the noise is mild, the former indicates larger loss when input perception errors intensify in most cases. We further investigate behaviour of TIP in some individual cases. In a typical miss detection scenario, we remove a stationary vehicle in front of or behind moving AVs. As shown in Figure 4 , outcomes rendered by TIP change with different planner settings, and it predicts the miss detection of the worst loss at the border of collision events when the accdident could have been avoided by a small margin if the obstacle is successfully detected. This reveals the superior resolution of TIP in identifying critical events from the planning perspective that would have been missed by other baselines. The case also shows that NDS and SDE-APD fail to distinguish errors at both sides of the AV, due to their spatial or directional homogeneity by design.

5.3. RESULTS ON REAL DATA

In the second set of experiments, we study the results from the real perception module deployed on our AVs, which is exemplified by a 3D object detection model that predicts class, location, heading, velocity and size of objects from LiDAR point clouds. TIP is independent of the specific detector and can be applied to various methods (Lang et al., 2019; Yin et al., 2021; Shi et al., 2020) . We adopt an end-to-end multi-view fusion (MVF) based model to synergise the birds-eye and perspective views of point clouds (Zhou et al., 2019) . The model is trained on 780K LiDAR sweeps using annotations of vehicle, pedestrian and cyclist with detection range [-67.2m, 124.8m ]× [-51.2m, 51.2m] . A typical challenge in developing the perception model is to determine how much training is needed to reach a satisfactory level of performance. Conventional solutions require a variety of heterogeneous metrics to measure different aspects of the algorithm, including mean average precision (mAP) for detection and mean squared errors in predicted motion properties. Recently, some unified metrics like NDS (Caesar et al., 2020) are also proposed by manual engineering. These metrics hardly predict outcome of the driving quality improvement of a perception model change, and in most cases the conclusion can only be made from large-scale real road tests, which is almost infeasible for such purpose (Wachenfeld and Winner, 2016; Åsljung et al., 2017) . We evaluate the performance of our 3D object detection model on the same benchmark as in Section 5.2 and compare the model output against the ground truth. The model is trained for 15 epochs, and results are reported in the left of Figure 5 . Not surprisingly, NDS tends to increase as the model training progresses and the final checkpoint models usually achieve the best performance since NDS combines the errors that are aligned with the loss functions optimised during training. When evaluated with the AV involved, however, the observation is not quite similar. SDE-APD implies that the training, without the AV context, seems to struggle with improving results on close-by objects as the losses are dominated by large number of far-away yet more challenging objects. From either behaviour or planning perspectives, both TIP and PKL indicate that the last checkpoint model is not among the best possible models during training. Instead, models somewhere in the middle of the training can provide better autonomous driving performance. Actually, neither TIP nor PKL is improved significantly beyond the 7th epoch, suggesting that early termination of training may be even more beneficial to driving quality. More importantly, we notice that TIP disagree with PKL on scenarios across models of top performance, where there are quite some critical cases identified by TIP yet missed by PKL. The difference is illustrated in the middle of Figure 5 by the scatter plot of randomly sampled scenarios, where the PKL values are almost zero while the TIP scores are non-trivial for quite a number of cases, suggesting the drastic impact of the perception errors on the AV planning process despite similar AV behaviour outputs with or without these errors (related individual examples are discussed in Appendix C.1). To compare other 3D detectors for offline applications (e.g., auto labelling (Qi et al., 2021 )), we implement two offline models with PillarNet (Shi et al., 2022) enhanced with transformer modules as the basic detector. The first one, denoted MVF (PN-T), uses the point cloud only from the current frame for prediction. The second one, denoted 5F-MVF (PN-T), leverages 5 consecutive frames around the current one to predict. Results are reported in Table 1 . Both offline models, with far less restriction on resources, have better performance by SDE-APD. MVF (PN-T), however, cannot produce precise velocity out of observation from only one frame, which leads to inferior performance by other three metrics (MVF + tracking is the onboard model discussed above). 5F-MVF (PN-T) delivers overall best results across all metrics, despite the marginal gap computed by PKL. To further justify the soundness of the proposed approach on the scenario level, we also implement a set of subjective evaluation similar to that in (Philion et al., 2020) . We collect 258 pairs of scenarios with actual perception noises and check weather TIP, PKL or NDS disagree on the relative severity (i.e., one believes the perception error in scenario A is worse than that in scenario B while the other one thinks alternatively). These scenario pairs are compared and rated by 10 randomly selected human drivers to decide on which is worse from the human perspective. The result reported in Table 2 suggests that human drivers side with TIP over other three baselines.

6. CONCLUSION

In this work, we propose a principled framework to evaluate perception from the perspective of planning for autonomous driving. Our approach explicitly exploits the properties of module-based planners and effectively identifies perception noises that may cause large planning change in the context of expected utility maximisation. Extensive experiments on both synthetic and real data confirm that our approach is capable of distinguishing perception errors that would not be separated by conventional metrics or those only exclusively focusing on AV behaviours.

7. ETHICS STATEMENT

Autonomous driving directly involves interaction with human beings, animals and other assets in the real world. Any performance measure tools for autonomous driving may not identify or cover all possible failure cases of AVs that may lead to negative or even catastrophic consequences. TIP is not an exception, despite the principle to precisely reflect the potential loss in planning by design. It is critical that any researchers and practitioners of TIP should still implement standard and comprehensive safety protocols to ensure legitimate compliance to appropriate laws and rules, to minimise the likelihood of any potential negative impacts.

8. REPRODUCIBILITY STATEMENT

The submission includes three major technical parts: theory, implementation, and empirical study. The main results of theoretical analysis and treatments are presented in Section 3.3 and Section 4.2. The detailed explanation and complete proofs are presented in Appendix F and Appendix G. The computation of TIP is provided as pseudo-code as in Algorithm 1. For the empirical study, the planner is derived from the open-sourced Apollo with configuration fine-tuned on real road test data as depicted in Appendix B. NDS is computed via the open-sourced implementation. SDE-APD is implemented in house according to the original paper (Deng et al., 2021) . PKL is implemented following the open-sourced project.

A ADDITIONAL LITERATURE REVIEW

Expected Utility Maximisation. Almost all solutions proposed so far in the artificial intelligence realm to develop intelligent agents have followed the rational-action strategy as opposed to the human-like behaviour strategy, i.e., the optimal decision made by a rational agent should optimise some achievement measurement on consequences of its decision, which is, ideally, well aligned with the predefined high-level goals (Russell and Norvig, 2009) . Among many possible formulations of the outcome measurement, the expected utility (EU) hypothesis is one of the most popular frameworks in the decision theory (von Neumann and Morgenstern, 1944) . While first introduced to characterise human behaviours in microeconomics, the idea also finds great success in many other domains, including the optimal decision for artificial intelligence systems (Osborne and Rubinstein, 1994) . Embedding Probability Measures in Hilbert Space. Embedding probability measures into a Hilbert space has been explored in the literature for kernel methods (Berlinet and Thomas-Agnan, 2011; Smola et al., 2007) . These methodologies exclusively focus on the reproducing kernel Hilbert space, which is spanned by Mercer kernel functions. However, the implicit requirement on function continuity may restrict its applicability to our problem, where either utility and probability density functions may be discontinuous. In this work, we consider a less restrictive Hilbert space where only square-integrability is needed and more flexible functions are possible.

B AUTONOMOUS VEHICLE PLANNER

We start by introducing the autonomous vehicle planner, which provides the fundamental toolkit for the proposed evaluation framework. Our planner is designed to control an Level 4 AV running in urban areas of major modern cities. Its modulized architecture is similar to many popular utility-based planners (e.g., (Fan et al., 2018) ), which consists of four major components as illustrated in Figure 6 : • The predictor infers the motion information s m in the future (i.e., t > 0) for all dynamic road objects from perception input history (i.e., t ⩽ 0) up to the planning time (i.e., t = 0). • The action proposer analyses the current environment at the planning time from (1) the perception input, (2) future object motion input, and (3) other inputs (e.g., localisation, traffic lights, semantic maps, routing path, etc.), and proposes various sets of behaviours (e.g., "go straight" and "lane change") for the AV with an initial feasible spatiotemporal trajectory for each set. • The trajectory optimiser takes results of above components as input, and finds the optimal spatiotemporal trajectory for each behaviour set by numerically solving an optimisation problem with the initial feasible trajectory from the proposer as the starting point. • The optimal trajectories from all behaviour sets are then submitted to the action decider, which assemblies all information to evaluate the utilities of different candidate actions (with corresponding optimal spatiotemporal trajectories), and makes the final decision on the a * . Similar to many popular architectures (e.g., (Fan et al., 2018) ), our planner utility function U (a, s) is of the general form U (s, a) = i λ i U i (s, a) + U s (s) + U a (a), where λ i are the (static) coefficients, the atomic element function U i depending on both a and s characterises the "compatibility" of action a and scenario s, U s (s) depicts the current environment, and U a (a) evaluates the quality of the action. These terms can be categorized into the following groups. • The smooth motion group encourages motion without abrupt change in acceleration, and penalises large jerks (i.e., the derivative of acceleration). • The safety distance to road obstacles group is designed to keep the AV away from other road objects to minimise the collision likelihoods. This distance is defined as ℓ 2 distance between the AV spatiotemporal sweeping contour and a foreign object on the road. • The legal motion satisfaction group is designed to enforce the AV to strictly follow all applicable traffic rules when in motion. For instance, the cost for crossing solid yellow lines is so significant as that such behaviour is prohibited unless a collision cannot be avoided otherwise. Some other legal options also come at certain prices too to discourage high-risk behaviour (e.g., lane changes in crowded scenes). • The progress to the destination group aims to guide the AV to achieve the goal in the big picture and reach the final destination. The aforementioned planner deployed onboard our AVs have gone through rigorous road test in urban areas of major cities with millions of population. Results of more than 10,000 miles on average tested every week indicate that the planner achieves 111.3 miles per intervention (MPI), which suggests that the planner used in this work is a reasonable and validated one.

C MORE COMPARISONS TO RELATED METRICS

In comparison to the other advanced metrics (PKL (Philion et al., 2020) , IPA (Ivanovic and Pavone, 2021) , SDE (Deng et al., 2021) ) that are recently proposed for more effective perception evaluation, our approach provides a universal and principled solution to evaluate the impact of perception noise from the perspective of the planning process of an AV.

C.1 COMPARISON WITH PKL

Here we provide more empirical results to better understand the difference between the proposed TIP and PKL (planning KL-divergence) (Philion et al., 2020) .

Results on Synthetic Data

Figure 7 demonstrates a scatter plot for scene-wise TIP and PKL results on the synthetic data generated as described in Section 5.2 with 6 false positives per scene. It is observed that some results are very close to either xor y-axis (the top right corner), suggesting that TIP and PKL deviate in determining if a perception error (i.e., false positive) is crucial to planning on these cases. A typical scenario of such disagreement is shown in Figure 8 , where the behaviour of the AV does not change significantly with ground-truth or noisy perception inputs (PKL = -0.248), yet the planning process has changed quite a lot (TIP = -61.654) due to the affinity of false positive objects that has drastically change the planning cost to close objects. In this case, TIP is capable of detecting serious perception errors that PKL fails to identify.

Results on Real Data

On the real data, we also have similar observations, and demonstrate the actual scene for one such scenario in Figure 9 . As shown in this case, a false detection of a vehicle in front of the AV does not change the behaviour considerably (PKL = -0.802), while the significant planning cost change is reflected by TIP with value -115.42. More individual examples are shown in Figure 10 Overall, on both synthetic and real data, we show that the proposed TIP can efficiently and effectively capture perception errors critical to the planning that may be missed by PKL. This confirms our motivation to exploit the actual AV planning process, as opposed to the AV behaviour (i.e., the result of planning), to gain insights into the impact of input perception error on the whole AV system.  Y W N B R V 3 X R 3 B b H g 2 r j u t 5 N b W V 1 b 3 8 h v F r a 2 d 3 b 3 i v s H D R 0 l i m G d R S J S r Y B q F F x i 3 X A j s B U r p G E g s B m M b q Z + 8 w m V 5 p F 8 M O M Y / Z A O J O 9 z R o 2 V 7 u n j W b d Y c s v u D G S Z e B k p Q Y Z a t / j V 6 U U s C V E a J q j W b c + N j Z 9 S Z T g T O C l 0 E o 0 x Z S M 6 w L a l k o a o / X R 2 6 o S c W K V H + p G y J Q 2 Z q b 8 n U h p q P Q 4 D 2 x l S M 9 S L 3 l T 8 z 2 s n p n / l p 1 z G i U H J 5 o v 6 i S A m I t O / S Y 8 r Z E a M L a F M c X s r Y U O q K D M 2 n Y I N w V t 8 e Z k 0 z s v e R b l y V y l V r 7 M 4 8 n A E x 3 A K H l x C F W 6 h B n V g M I B n e I U 3 R z g v z r v z M W / N O d n M I f y B 8 / k D 3 2 S N i g = = < / l a t e x i t > a ⇤ < l a t e x i t s h a 1 _ b a s e 6 4 = " w r a l / e x 3 p F C.2 COMPARISON WITH IPA IPA (injecting planning-awareness) is recently developed in (Ivanovic and Pavone, 2021) to encode the planning error based on the hypothesis that the impact of object location error is proportional to the gradient magnitude of the planning cost functions involving the AV-object distance. This solution however requires differentiability of the planning cost functions, while our approach does not and is thus more applicable to the Level 4 AVs that are typically structured as a modularized pipeline of individual components including perception, prediction, planning, etc. Even more serious is that it fails to account for all cases since the local properties (gradients) do not always reflect the global ones (overall losses). To illustrate this, consider a scenario, where the cost of AV being close to an object is 1/d. Now assume that there are the following two cases of object location errors. V Y 4 h R a z m H s D 2 Y 1 g I Y = " > A A A B 9 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e x K U I 9 B P X i M Y B 6 Q r K F 3 M p s M m X 0 w M 6 u E J f / h x Y M i X v 0 X b / 6 N s 8 k e N L F g o K j q p m v K i w V X 2 r a / r c L K 6 t r 6 R n G z t L W 9 s 7 t X 3 j 9 o q S i R l D V p J C L Z 8 V A x w U P W 1 F w L 1 o k l w 8 A T r O 2 N r z O / / c i k 4 l F 4 r y c x c w M c h t z n F L W R H t J e g H p E U Z C b a R / 7 5 Y p d t W c g y 8 T J S Q V y N P r l r 9 4 g o k n A Q k 0 F K t V 1 7 F i 7 K U r N q W D T U i 9 R L E Y 6 x i H r G h p i w J S b z l J P y Y l R B s S P p H m h J j P 1 9 0 a K g V K T w D O T W U i 1 6 G X i f 1 4 3 0 f 6 l m / I w T j Q L 6 f y Q n w i i I 5 J V Q A Z c M q r F x B C k k p u s h I 5 Q I t W m q J I p w V n 8 8 j J p n V W d 8 2 r t r l a p X + V 1 F O E I j u E U H L i A O t x C A 5 p A Q c I z v M K b 9 W S 9 W O / W x 3 y 0 Y O U 7 h / A H 1 u c P S G 2 S Y Q = = < / l a t e x i t > D a • Case one: The ground truth distance of an object to the AV is 1m, and the noisy distance estimated by perception is 0.9m. Per the metric IPA defined in (Ivanovic and Pavone, 2021) , the result is d dd (1/d) d=1 |∆d| = 1 ×|1.0 -0.9| = 0.1, while the actual cost difference is 1 0.9 -1 1 = 1/0.9 -1 = 0.111. • Case two: The ground truth distance of an object to the AV is 2m, and the noisy distance estimated by perception is 2.5m. Per the metric IPA defined in (Ivanovic and Pavone, 2021) , the result is d dd (1/d) d=2 |∆d| = 0.25 ×|2.5 -2.0| = 0.125, while the actual cost difference is 1 2 -1 2.5 = 0.5 -1/2.5 = 0.1. Obviously, the metric IPA of case two is larger than case one, while the actual error in planning cost is the other way, as Taylor series up to first order terms adopted by IPA cannot precisely delineate the cost function value change over large range input variation. 

D SCENARIO COLLECTION

The scenarios used in this work are curated from AV road test in real world from public roads in urban areas of megacities, e.g., central business districts, populated residential communities, major commercial areas, etc. Each scenario is a 10s-long excerpt extracted from a continuous interval of road test, which consists of 1) all raw data recordings (LiDAR point clouds, camera images, positioning signals, etc.) from the road test during the interval, and 2) the portion of offline generated high-definition (HD) and birds-eye view (BEV) raster maps that cover the field of perception during the interval. The duration of road test ranges from tens of minutes to several hours, and covers various times of both weekdays and weekends from early morning till late night during a period of more than one year, providing a rich blending in weather condition (e.g., sunny, cloudy, rainy, and snowy), traffic intensity (e.g., rush hours on highways and crowded streets during holidays), road participant diversity (e.g., private cars, cyclists, pedestrians, and emergency vehicles), and so forth. The scenarios are selected from non-trivial situations (i.e., those with few traffic participants are filtered out) with balance in AV motion speed, diversity of traffic participants, weather, geographical locations, etc. Figure 9 : Illustration of AV behaviours in reaction to ground truth and actual noisy perception inputs. Under the ground truth perception input (the left column), the AV is clear to move forward with a soft braking to keep distance to another vehicle ("82") in front. Given the noisy perception input (the right column), however, the AV has to hard brake to avoid potential collision with the false positive vehicle close to it in front (marked by the red arrow). In either cases, since the AV speed is slow and is braking (either soft or hard), the difference in behaviour is insignificant (PKL = -0.802), yet the consequence of the false positive is by no means trivial: the false positive causes a hard brake and virtual collision (between the behaviours under ground truth perception input and false positive), which is precisely captured by the proposed TIP (TIP = -115.42). The kinematic motion for the ground truth scenario (bottom left) is a = -0.36m/s 2 , j = -0.72m/s 3 , and for the noisy scenario (bottom right) is a = -0.36m/s 2 , j = -76.4m/s 3 . Note how sharp the braking changes in presence of the noisy perception (jerk: -0.72m/s 3 v.s. -76.4m/s 3 ). Clearly, this is a critical error from the system perspective.

E BREAKDOWN OF TIP

In order to facilitate understanding of the TIP evaluation process, we have illustrated the evaluation steps in Figure 11 , where a typical false negative case is used for the analysis.

F EXAMPLES AND NON-EXAMPLES OF SQUARE-INTEGRABLE DENSITY FUNCTIONS

Theorem 1 in the main text requires square-integrability of a density function,which includes many popular cases that may be used for constructing the utility function for planning. Example 1 (Bounded PDFs). If both the support and range of the PDF f (x) of an random variable is bounded, then f (x) is square-integrable, e.g., uniform distributions. improperly perceived such that it is superimposed with the AV. The ground truth is shown on the left and the actual noisy perception is on the right. For the top case: TIP = -132.4, PKL = 0.0, acceleration a = 0.03m/s 2 , jerk j = 0.09m/s 3 (for both ground truth and actual noisy perception scenarios). For the bottom case: TIP = -75.0, PKL = 0.04, a = -0.35m/s 2 , j = -2.43m/s 3 (for the ground truth scenario), and a = -0.35m/s 2 , j = -2.61m/s 3 (for the actual noisy perception scenario). Example 2 (Parametric PDFs). PDFs of many popular parametric statistical models are squareintegrable, e.g., (sub-)Gaussian, (sub-)Laplace, Gamma (including exponential, Erlang, and χ 2 distributions), etc. Example 3 (Mixture Models of Countable Components with Square-Integrable PDFs). The PDF of a mixture model is of the form: f (x) = i α i f i (x), α i > 0, i α i = 1, where f i (x) is the PDF of the i-th component out of the countable set {f i (x)}. f (x) of ( 17) is square-integrable if ∀i, f i ∈ L 2 and M = sup i ∥f i ∥ H < +∞ as f (x) 2 dx = i,j α i α j f i (x)f j (x) dx = i,j α i α j f i , f j H (18) ⩽ i,j α i α j ∥f i ∥ H f j H ⩽ M 2 < +∞. A variety of mixture models are included such as Gaussian mixture models and mixtures of Gamma distributions. with EU -58.3 (move forward). Planned (for AV) or predicted object motion in a scene is rendered as a coloured spatiotemporal trajectory (3D tube with z-axis as time), e.g., each tube consists of locations of the corresponding object at time t (t is the coordinate of the z-or vertical axis). Note that in the lower left scene the candidate AV behaviour a (move forward at a constant speed) is evaluated against the ground truth environment (p), which collides with the object at the end of 3s. Corresponding utilities are shown under all cases, and TIP in this case is -199.7 according to the calculation of expected utilities in (15). On the other hand, since ℓ 1 and ℓ 2 norms are not necessarily equivalent in infinite-dimensional spaces, there are indeed some density functions f (x) ∈ L 1 with infinite ℓ 2 norm. Non-Example 1 (Square-Unintegrable PDFs). Let the distribution F X of a random variable X be F X (x) =      0, x ∈ (-∞, 0) 1 √ a x 1 2 , x ∈ [0, a] 1, x ∈ (a, +∞) where a > 0 is the parameter; and the density function is then f (x) = 1 2 √ a x -1 2 , x ∈ (0, a) 0, otherwise where f (x) is not square-integrable since x -1 increases too fast as x → 0.

G PROOFS OF THEOREMS IN THE MAIN TEXT

G.1 NOTATIONS Besides the notations in Section 3.1, a few more are introduced as follows. A unit step function is W (xc) = 1 x∈[c,+∞) , c ∈ R. L 1 (X , ρ) denotes the space of absolutely integrable functions.

G.2 EMBEDDING PROBABILITY MEASURES IN H

PROOF (Theorem 1). Since F X (x) is absolutely continuous, there exists a density function f X (x) ∈ L 1 such that d dx F X (x) = f X (x) almost everywhere. Since f X (x) ∈ L 2 , let M = ∥f X ∥ < +∞, ∀g ∈ H, we have E X g(x) = x g(x) dF X (x) (20) = x g(x)f (x)ρ(dx) (21) ⩽ x g(x) f (x) ρ(dx) ⩽ M ||g|| H , (23) where ( 23) follows from the Cauchy-Schwarz inequality (Rudin, 1976, Theorem 11.35) . Thus, the linear functional E X [•] is bounded on H and E X g(x) = x g(x) dF X (x) = x f X (x)g(x)ρ(dx) = ⟨f X , g⟩ H , ∀g ∈ H, where µ p := f X ∈ H is the embedding of the probability measure in H. Now assume that there exists another element µ ′ ∈ H such that E X g(x) = µ ′ , g H , ∀g ∈ H. Since µ pµ ′ ∈ H, we have µ p -µ ′ 2 H = µ p -µ ′ , µ p -µ ′ H = µ p , µ p -µ ′ H -µ ′ , µ p -µ ′ H = E X µ p -µ ′ -E X µ p -µ ′ = 0. Therefore, the embedding µ p for probability measure p in H is a unique equivalence class of the functions that are equal almost everywhere.

G.3 INJECTION OF PROBABILITY MEASURE EMBEDDINGS IN H

To prove the injection of probability measure embedding in Theorem 2, a preliminary result of (Dudley, 2002, Lemma 9.3.2 ) is first introduced. Lemma 1. If (X , d) is a metric space, p and q are two probability measures on X , then E x∼p(x) [g] = E x∼q(x) [g] , ∀g ∈ C b (X ) if and only if p = q, where C b (X ) is the space of all bounded continuous functions on X . PROOF (Theorem 2). Now we prove this theorem in the following two directions. Necessity. Since the embedding of a probability measure is unique in H, it is easy to see that µ p = µ q if p = q. Sufficiency. Note that, by Weierstrass extreme value theorem (Rudin, 1976, Theorem 4.16) , any real continuous function g ∈ C(X ) on the compact space X is bounded, i.e., ∀g ∈ C(X ), ∃M ∈ R such that g(x) < M, ∀x ∈ X . It follows that C(X ) ⊂ L 2 (X ) since X g(x) 2 ρ(dx) ⩽ M 2 |X | < +∞.



Referred to as a unique class of functions that are equal almost everywhere. {h 1 n } itself, however, is not a Cauchy sequence, thus it has no limit.



Figure 1: Illustration of behaviour change v.s. driving cost (best viewed colour).The change in AV behaviour due to perception error is not always correlated to the cost of consequence. In (a) the AV has to circumvent the erroneously perceived cone by making a large detour. While for (b) the AV only needs to make a slight detour to the right, yet it inevitably hits the cone. In this case, although the behaviour change is far less than that of (a), the consequence is significantly worse ("hitting an object" v.s. "making a large detour"). In (c) the consequence of either way is indifferent to the AV in moving forward, yet the change in behaviour is considerable in terms of spatiotemporal motion. As for (d), if there are two falsely detected cones on both sides, which are close enough to the AV when passing by despite no collision, the AV still decides to maintain the same motion as in the ground truth case. Therefore, the final behaviour of the AV does not change given the perception error, but the cost of passing by two close objects already changes the planning process, which will be missed by the metrics that only look at the AV behaviour or planning result.

Figure2: (a) Illustration of EUM in H. ∆U = U a * -U a defines the behaviour direction; ξ represents the preference score; µ p and µ q are the embeddings of the ground truth and noisy perception result, respectively; ∆µ is the perception error, which is decomposed into the planning-critical error (PCE) ∆µ ∥ , and the planning-invariant error (PIE) ∆µ ⊥ ; and the shaded area corresponds to H a . (b) A toy example of PCE and PIE (best viewed colour). The AV is moving forward on a road of width 6m, where there is a cone in front, the belief of its position is a distribution on a line across the road (the x axis). The ground truth distribution p is U [-3,-2] , a uniform distribution with support [-3, -2], while the perception believes its location (distributed as q) is U [-1,0] . The AV has two action options: 1) to move forward (a * , the red arrow), and the utility function is U 1 (x) = -10 • 1 x∈[-1,1] with x being the position of the cone (only large loss for collision with the cone); 2) to make a hard brake (a, the grey arrow), and the utility function is a constant U 2 (x) = -5 (loss of hard braking is identical regardless of the cone position). In this case, the ∆U and ∆µ are illustrated in the top right, while the decomposition of PIE ∆µ ⊥ and PCE ∆µ ∥ are on the bottom right. Note that, ∆µ ∥ is of the same shape as ∆U (up to a negative constant), and ⟨∆U, ∆µ ⊥ ⟩ = 0.

Figure 3: Comparison of metrics on different cases of synthetic noise. The left vertical axes are for NDS and SDE-APD, and the right are for PKL and TIP.Note that, NDS saturates in several cases once the noise reaches some levels, and sensitivity of TIP to various types and levels of noise is generally more consistent than the other three (SDE-APD is computed by SDE-APD@t=1s for velocity noise).

Figure 4: Metrics for AVs of different driving styles.

Figure 5: Comparison of metrics on real data (best viewed colour). Left: Metrics on different checkpoints during training. Middle: Scatter plot of impacts of perception noise measured by TIP and PKL. Note the number of data points close to the x-axis (PKL = 0), which correspond to critical errors in planning due to the perception noise captured by TIP yet missed by PKL since the AV behaviours are similar with and without the noi se. Right: First one is the ground truth with corresponding AV behaviour (the spatio-temporal trajectory represented by the green tube with the z-axis as the temporal dimension); second one shows that a false positive (pointed to by the red arrow) causes an outrageous planning error with a jerk of -76.4m/s 3 while the typical limit is around -1.0m/s 3 (Wang et al., 2018), despite a mild change in behaviour by PKL. PKL and TIP of this case are highlighted by the red circle in the scatter plot (the middle figure). See details in Figure 9 and Figure 10 of Appendix C.1.

t e x i t s h a 1 _ b a s e 6 4 = " O H V U f A 9 M P V Z E E Z H 3 x z L 0 a t e S 4 k g = " > A A A B 6 n i c b V D L S g N B E O y N r x h f U Y 9 e B o M g H s K u B P U Y 9 O I x o n l A s o b Z S S c Z M j u 7 z M w K Y ck n e P G g i F e / y J t / 4 y T Z g y

Figure 6: Diagram of the major components in our planner which is used for computing the proposed perception evaluation metric TIP.

Figure 8: Illustration of AV behaviours in ground truth and synthetic scenes with false positives. The green tube represents the spatiotemporal trajectory of the AV with the z-axis as the temporal dimension (same for the rest). Bold solid lines are boundary of driving areas (e.g., curbs, vegetarian zoom dividers), while light solid lines are centre lines of vehicle lanes with dashed lines as the lane boundaries. Road objects are marked with 3D bounding boxes in green. Sub-figures in the first (second) row are birds-eye view (side view) of the scene, and sub-figures in the left (right) column correspond to ground truth (noisy) perception input (same for the rest). In this case, the AV intends to move forward under the ground truth perception input (the left column); in the presence of perception input noise (the right column), the AV behaviour remains almost unchanged (PKL = -0.248), since two false positive vehicles (pointed by red arrows) on its both sides force the AV to keep moving straight, yet the close-to-object cost (safety distance to road obstacles) has changed considerably during planning, and is reflected by the score of TIP -61.654.

Figure10: Illustration of more AV behaviours in reaction to ground truth and actual noisy perception inputs. Two more outrageous perception errors are shown where an object location is improperly perceived such that it is superimposed with the AV. The ground truth is shown on the left and the actual noisy perception is on the right. For the top case: TIP = -132.4, PKL = 0.0, acceleration a = 0.03m/s 2 , jerk j = 0.09m/s 3 (for both ground truth and actual noisy perception scenarios). For the bottom case: TIP = -75.0, PKL = 0.04, a = -0.35m/s 2 , j = -2.43m/s 3 (for the ground truth scenario), and a = -0.35m/s 2 , j = -2.61m/s 3 (for the actual noisy perception scenario).

Figure11: Breakdown of TIP for a typical miss detection scenario. A stationary vehicle 30m in front of the AV is missed by the onboard object detector. Four scenes are: 1) (top left) ground truth behaviour (a * ) in ground truth environment (p) with EU -5.83 (hard brake to avoid a collision); 2) (top right) ground truth behaviour (a * ) in noisy perceived environment (q) with EU -80.8 (hard brake without any obstacles in front); 3) (bottom left) candidate behaviour (a) in ground truth environment (p) with EU -183.0 (move forward and collide with the object in front); 4) (bottom right) candidate behaviour (a) in noisy perceived environment (q) with EU -58.3 (move forward). Planned (for AV) or predicted object motion in a scene is rendered as a coloured spatiotemporal trajectory (3D tube with z-axis as time), e.g., each tube consists of locations of the corresponding object at time t (t is the coordinate of the z-or vertical axis). Note that in the lower left scene the candidate AV behaviour a (move forward at a constant speed) is evaluated against the ground truth environment (p), which collides with the object at the end of 3s. Corresponding utilities are shown under all cases, and TIP in this case is -199.7 according to the calculation of expected utilities in (15).

Comparison of different detectors across metrics (↑).

Appendix

Now if µ p = µ q almost everywhere, we have⩽ µ pµ q H ∥g∥ H = 0, ∀g ∈ C(X ).Thus p = q by Lemma 1.

G.4 APPROXIMATION OF EXPECTATION FOR DISCRETE/MIXED DISTRIBUTIONS IN H

While Theorem 1 in the main text only addresses the continuous distributions, a similar result can be found given point-wise continuity conditions for general distributions, which can be decomposed into absolutely continuous and discrete parts (Chung, 2000) .Theorem 4 (Approximation of Mixed Distribution). Let F ac (x) be an absolutely continuous distribution function with density function f X (x);a mixed distribution function with λ ∈ (0, 1) as the convex combination coefficient. If f X (x) is square-integrable, and g(x) ∈ L 2 is uniformly continuous at {a i }, then there exists a sequence ofWe start by considering a simple discrete case by the following lemma. Lemma 2. Let F X (x) = W (xa) be a discrete distribution function with point mass at a ∈ X . If g(x) ∈ L 2 is continuous at a, then there exists a sequence of {µ p,n } ⊂ H such thatPROOF (Lemma 2). ∀ε > 0, since g(x) is continuous at a, there exists a a radius r > 0 such thatwith a positive measure V = ρ(B(a, r)) > 0, where B(a, r) ⊂ X is a neighbourhood of r around a.We have g(a)ε < ⟨h ε , g⟩ H < g(a) + ε.

Thus, lim

Lemma 2 implies that the expected value of a function continuous at the point mass of a delta distribution can be approximated by an inner product in H with any arbitrary precision.PROOF (Theorem 3). Note thatSince F ac (x) is absolutely continuous, by Theorem 1, there exists a µ ∈ H such thatOn the other hand, ∀ε > 0, since g(x) is uniformly continuous at {a i }, there exists a radius r > 0 such that ∀i,We haveCombining ( 28) and ( 29) leads toG.5 UNIFORM CONVERGENCE RATE OF EXPECTED UTILITY ESTIMATORS PROOF (Theorem 3). Assume that {X i } n i=1 and independent and X i ∈ [a i , b i ] almost surely. Let X = 1 n i X i . Per Hoeffding's inequality (Hoeffding, 1963, Theorem 2) , for any ε > 0,By symmetry, it also holds true that, for any ε > 0,(31)Combining one-side inequalities of ( 30) and (31) leads towhereOn the other hand, Bernstein inequality (Bernstein, 1946 ) also provides an improved revision of Chebyshev's inequality by incorporating both almost-sure bound and variance bound:The proof is completed by setting X i = U (S i , a) and taking the lowest bound of ( 32) and (33) for the tail probability of X -E X .

