CRITIC SEQUENTIAL MONTE CARLO

Abstract

We introduce CriticSMC, a new algorithm for planning as inference built from a composition of sequential Monte Carlo with learned Soft-Q function heuristic factors. These heuristic factors, obtained from parametric approximations of the marginal likelihood ahead, more effectively guide SMC towards the desired target distribution, which is particularly helpful for planning in environments with hard constraints placed sparsely in time. Compared with previous work, we modify the placement of such heuristic factors, which allows us to cheaply propose and evaluate large numbers of putative action particles, greatly increasing inference and planning efficiency. CriticSMC is compatible with informative priors, whose density function need not be known, and can be used as a model-free control algorithm. Our experiments on collision avoidance in a high-dimensional simulated driving task show that CriticSMC significantly reduces collision rates at a low computational cost while maintaining realism and diversity of driving behaviors across vehicles and environment scenarios.

1. INTRODUCTION

Sequential Monte Carlo (SMC) (Gordon et al., 1993 ) is a popular, highly customizable inference algorithm that is well suited to posterior inference in state-space models (Arulampalam et al., 2002; Andrieu et al., 2004; Cappe et al., 2007) . SMC is a form of importance sampling, that breaks down a high-dimensional sampling problem into a sequence of low-dimensional ones, making them tractable through repeated application of resampling. SMC in practice requires informative observations at each time step to be efficient when a finite number of particles is used. When observations are sparse, SMC loses its typical advantages and needs to be augmented with particle smoothing and backward messages to retain good performance (Kitagawa, 1994; Moral et al., 2009; Douc et al., 2011) . SMC can be applied to planning problems using the planning-as-inference framework (Ziebart et al., 2010; Neumann, 2011; Rawlik et al., 2012; Kappen et al., 2012; Levine, 2018; Abdolmaleki et al., 2018; Lavington et al., 2021) . In this paper we are interested in solving planning problems with sparse, hard constraints, such as avoiding collisions while driving. In this setting, such a constraint is not violated until the collision occurs, but braking needs to occur well in advance to avoid it. Figure 1 demonstrates on a toy example how SMC requires an excessive number of particles to solve such problems. In the language of optimal control (OC) and reinforcement learning (RL), collision avoidance is a sparse reward problem. In this setting, parametric estimators of future rewards (Nair et al., 2018; Riedmiller et al., 2018) are learned in order to alleviate the credit assignment problem (Sutton & Barto, 2018; Dulac-Arnold et al., 2021) and facilitate efficient learning. In this paper we propose a novel formulation of SMC, called CriticSMC, where a learned critic, inspired by Q-functions in RL (Sutton & Barto, 2018) , is used as a heuristic factor (Stuhlmüller et al., 2015) in SMC to ameliorate the problem of sparse observations. We borrow from the recent advances in deep-RL (Haarnoja et al., 2018a; Hessel et al., 2018) to learn a critic which approximates future likelihoods in a parametric form. While similar ideas have been proposed in the past (Rawlik et al., 2012; Piché et al., 2019) , in this paper we instead suggest (1) using soft Q-functions (Rawlik et al., 2012; Chan et al., 2021; Lavington et al., 2021) as heuristic factors, and (2) choosing the placement of such factors to allow for efficient exploration of action-space through the use of putative particles (Fearnhead, 2004) . Additionally, we design CriticSMC to be compatible with informative prior distributions, which may not include an associated (known) log-density function. In planning contexts, such priors can specify additional requirements that may be difficult to define via rewards, such as maintaining human-like driving behavior. Figure 1 : Illustration of the difference between CriticSMC and SMC in a toy environment in which a green ego agent is trying to reach the red goal without being hit by any of the three chasing adversaries. All plots show overlaid samples of environment trajectories conditioned on the ego agent achieving its goal. While SMC will asymptotically explore the whole space of environment trajectories, CriticSMC's method of using the critic as a heuristic within SMC encourages computationally efficient discovery of diverse high reward trajectories. SMC with a small number of particles fails here because the reward is sparse and the ego agent's prior behavioral policy assigns low probability to trajectories that avoid the barrier and the other agents. We show experimentally that CriticSMC is able to refine the policy of a foundation (Bommasani et al., 2021) autonomous-driving behavior model to take actions that produce significantly fewer collisions while retaining key behavioral distribution characteristics of the foundation model. This is important not only for the eventual goal of learning complete autonomous driving policies (Jain et al., 2021; Hawke et al., 2021) , but also immediately for constructing realistic infraction-free simulations to be employed by autonomous vehicle controllers (Suo et al., 2021; Bergamini et al., 2021; Ścibior et al., 2021; Lioutas et al., 2022) for training and testing. Planning, either in simulation or real world, requires a model of the world (Ha & Schmidhuber, 2018) . While CriticSMC can act as a planner in this context, we show that it can just as easily be used for model-free online control without a world model. This is done by densely sampling putative action particles and using the critic to select amongst these sampled actions. We also provide ablation studies which demonstrate that the two key components of CriticSMC, namely the use of the soft Q-functions and putative action particles, significantly improve performance over relevant baselines with similar computational resources.

2. PRELIMINARIES

Since we are primarily concerned with planning problems, we work within the framework of Markov decision processes (MDPs). An MDP M = {S, A, f, P 0 , R, Π} is defined by a set of states s ∈ S, actions a ∈ A, reward function r(s, a, s ′ ) ∈ R, deterministic transition dynamics function f (s, a), initial state distribution p 0 (s) ∈ P 0 , and policy distribution π(a|s) ∈ Π. Trajectories are generated by first sampling from the initial state distribution s 1 ∼ p 0 , then sequentially sampling from the policy a t ∼ π(a t |s t ) and then the transition dynamics s t+1 ← f (s t , a t ) for T -1 time steps. Execution of this stochastic process produces a trajectory τ = {(s 1 , a 1 ), . . . , (s T , a T )} ∼ p π , which is then scored using the reward function r. The goal in RL and OC is to produce a policy π * = arg max π E p π [ T t=1 r(s t , a t , s t+1 )] . We now relate this stochastic process to inference.

2.1. REINFORCEMENT LEARNING AS INFERENCE

RL-as-inference (RLAI) considers the relationship between RL and approximate posterior inference to produce a class of divergence minimization algorithms able to estimate the optimal RL policy. The posterior we target is defined by a set of observed random variables O 1:T and latent random variables τ 1:T . Here, O defines "optimality" random variables which are Bernoulli distributed with probability proportional to exponentiated reward values (Ziebart et al., 2010; Neumann, 2011; Levine, 2018) . They determine whether an individual tuple τ t = {s t , a t , s t+1 } is optimal (O t = 1) or sub-optimal and with the placement we use in CriticSMC (right). We use ŵ for pre-resampling weights and w for post-resampling weights and we elide the normalizing factor W t = N i=1 ŵi t for clarity. The placement of h t in CriticSMC crucially enables using putative action particles in Section 3.3. 1: algorithm STANDARD 2: a i t ∼ π(a t |s i t ) 3: ŝi t+1 ← f (s i t , a i t ) f (s i t , a i t ) f (s i t , a i t ) 4: ŵi t ← wi t-1 e r(s i t ,a i t ,ŝ i t+1 ) r(s i t ,a i t ,ŝ i t+1 ) r(s i t ,a i t ,ŝ i t+1 ) 5: α i t ∼ RESAMPLE( ŵ1:N t ) 6: s i t+1 ← ŝα i t t+1 7: wi t ← 1 N 8: end algorithm (a) No heuristic factors 1: algorithm H-FACTORS 2: a i t ∼ π(a t |s i t ) 3: ŝi t+1 ← f (s i t , a i t ) f (s i t , a i t ) f (s i t , a i t ) 4: ŵi t ← wi t-1 e r(s i t ,a i t ,ŝ i t+1 ) r(s i t ,a i t ,ŝ i t+1 ) r(s i t ,a i t ,ŝ i t+1 )+h i t h i t h i t 5: α i t ∼ RESAMPLE( ŵ1:N t ) 6: s i t+1 ← ŝα i t t+1 7: wi t ← 1 N e -h α i t t h α i t t h α i t t 8: end algorithm (b) With heuristic factors 1: algorithm CRITICSMC 2: a i t ∼ π(a t |s i t ) 3: 4: ŵi t ← wi t-1 e h i t h i t h i t 5: α i t ∼ RESAMPLE( ŵ1:N t ) 6: s i t+1 ← f (s α i t t , a α i t t ) f (s α i t t , a α i t t ) f (s α i t t , a α i t t ) 7: wi t ← 1 N e r( (O t = 0). We replace O t = 1 with O t in the remainder of the paper for conciseness. While we can rarely compute the posterior p(s 1:T , a 1:T |O 1:T ) in closed form, we assume the joint distribution p(s 1:T , a 1:T , O 1:T ) = p 0 (s 1 ) T t=1 p(O t |s t , a t , s t+1 )δ f (st,at) (s t+1 )π(a t |s t ), where δ f (st,at) is a Dirac measure centered on f (s t , a t ). This joint distribution can be used following standard procedures from variational inference to learn or estimate the posterior distribution of interest (Kingma & Welling, 2014) . How close the estimated policy is to the optimal policy often depends upon the chosen reward surface, the prior distribution over actions, and chosen policy distribution class. Generally, the prior is chosen to contain minimal information in order to maximize the entropy of the resulting approximate posterior distribution (Ziebart et al., 2010; Haarnoja et al., 2018a) . Contrary to classical RL, we are interested in using informative priors whose attributes we want to preserve while maximizing the expected reward ahead. In order to manage this trade-off, we now consider more general inference algorithms for state-space models. 2.2 SEQUENTIAL MONTE-CARLO SMC (Gordon et al., 1993) is a popular algorithm that can be used to sample from the posterior distribution in non-linear state-space models and HMMs. In RLAI, SMC sequentially approximates the filtering distributions p(s t , a t |O 1:t ) for t ∈ 1 . . . T using a collection of weighted samples called particles. The crucial resampling step adaptively focuses computation on the most promising particles while still producing an unbiased estimation of the marginal likelihood (Moral, 2004; Chopin et al., 2012; Pitt et al., 2012; Naesseth et al., 2014; Le, 2017) . The primary sampling loop for SMC in a Markov decision process is provided in Figure 2a , and proceeds by sampling an action a t given a state s t , generating the next state s t+1 s t+1 s t+1 using the environment or a model of the world, computing a weight ŵt using the reward function r r r, and resampling from this particle population. The post-resampling weights wt are assumed to be uniform for simplicity but non-uniform resampling schemes exist (Fearnhead & Clifford, 2003) . Here, each timestep only performs simple importance sampling linking the posterior p(s t , a t |O 1:t ) to p(s t+1 , a t+1 |O 1:t+1 ). When the observed likelihood information is insufficient, the particles may fail to cover the portion of the space required to approximate the next posterior timestep. For example, if all current particles have the vehicle moving at high speed towards the obstacle, it may be too late to brake and causing SMC to erroneously conclude that a collision was inevitable, while in fact it just did not explore braking actions earlier on in time. As shown by Stuhlmüller et al. (2015) , we can introduce arbitrary heuristic factors h t h t h t into SMC before resampling, as shown in Figure 2b , mitigating the insufficient observed likelihood information. h t can be a function of anything sampled up to the point where it is introduced, does not alter the asymptotic behavior of SMC, and can dramatically improve finite sample efficiency if chosen carefully. In this setting, the optimal choice for h t is the marginal log-likelihood ahead T t log p(O t:T |s t , a t ), which is typically intractable to compute but can be approximated. In the context of avoiding collisions, this term estimates the likelihood of future collisions from a given state. A typical application of such heuristic factors in RLAI, as given by Piché et al. (2019) , is shown in Figure 2b .

3. CRITICSMC

Historically, heuristic factors in SMC are placed alongside the reward, which is computed by taking a single step in the environment (Figure 2b ). The crucial issue with this methodology is that updating weights requires computing the next state (Line 3 in Figure 2b ), which can both be expensive in complex simulators, and would prevent the use in online control without a world model. In order to avoid this issue while maintaining the advantages of SMC with heuristic factors, we propose to score particles using only the heuristic factor, resample, then compute the next state and the reward, as shown in Figure 2c . We choose h t which only depends on the previous state and actions observed and not the future state, so that we can sample and score a significantly larger number of so-called putative action particles, thereby increasing the likelihood of sampling particles with a large h t . In this section we first show how to construct such h t , then how to learn an approximation to it, and finally how to take full advantage of this sampling procedure using putative action particles.

3.1. FUTURE LIKELIHOODS AS HEURISTIC FACTORS

We consider environments where planning is needed to satisfy certain hard constraints C(s t ) and define the violations of such constraints as infractions. This makes the reward function (and thus the log-likelihood) defined in Section 2 sparse, log p(O t |s t , a t , s t+1 ) = r(s t , a t , s t+1 ) = 0, if C(s t+1 ) is satisfied -β pen , otherwise where β pen > 0 is a penalty coefficient. At every time-step, the agent receives a reward signal indicating if an infraction occurred (e.g. there was a collision). To guide SMC particles towards states that are more likely to avoid infractions in the future, we use h t which approximate future likelihoods (Kim et al., 2020) defined as h t ≈ log p(O t:T |s t , a t ). Such heuristic factors up-weight particles proportionally to how likely they are to avoid infractions in the future but can be difficult to accurately estimate in practice. As has been shown in previous work (Rawlik et al., 2012; Levine, 2018; Piché et al., 2019; Lavington et al., 2021) , log p(O t:T |s t , a t ) corresponds to the "soft" version of the state-action value function Q(s t , a t ) used in RL, often called the critic. Following Levine (2018) , we use the same symbol Q for the soft-critic. Under deterministic state transitions s t+1 ← f (s t , a t ), the soft Q function satisfies the following equation, which follows from the exponential definition of the reward given in Equation 2(a proof is provided in Section A.2 of the Appendix), ,at+1) . (3) Q(s t , a t ) := log p(O t:T |s t , a t ) = r(s t , a t , s t+1 ) + log E at+1∼π(at+1|st+1) e Q(st+1 CriticSMC sets the heuristic factor h t = Q(s t , a t ), as shown in Figure 2c . We note that alternatively one could use the state value function for the next state V (s t+1 ) = log E at+1 [exp(Q(a t+1 , s t+1 ))], as shown in Figure 2b . This would be equivalent to the SMC algorithm of Piché et al. (2019) (see Section A.2 of the Appendix), which was originally derived using the two-filter formula (Bresler, 1986; Kitagawa, 1994) instead of heuristic factors. The primary advantage of the CriticSMC formulation is that the heuristic factor can be computed before the next state, thus allowing the application of putative action particles.

3.2. LEARNING CRITIC MODELS WITH SOFT Q-LEARNING

Because we do not have direct access to Q, we estimate it parametrically with Q ϕ . Equation 3 suggests the following training objective for learning the state-action critic (Lavington et al., 2021) L TD (ϕ) = E st,at,st+1∼dSAO Q ϕ (s t , a t ) -r(s t , a t , s t+1 ) - log E at+1∼π(at+1|st+1) e Q ⊥(ϕ) (st+1,at+1) 2 ≈ E st,at,st+1∼dSAO E a 1:K t+1 ∼π(at+1|st+1) Q ϕ (s t , a t ) -QTA (s t , a t , s t+1 , â1:K t+1 ) 2 , where d SAO is the state-action occupancy (SAO) induced by CriticSMC, ⊥ is the stop-gradient operator (Foerster et al., 2018) indicating that the gradient of the enclosed term is discarded, and the approximate target value QTA is defined as QTA (s t , a t , s t+1 , â1:K t+1 ) = r(s t , a t , s t+1 ) + γ log 1 K K j=1 e Q ⊥(ϕ) (st+1,â j t+1 ) . The discount factor γ is introduced to reduce variance and improve the convergence of Soft-Q iteration (Bertsekas, 2019; Chan et al., 2021) . For stability, we replace the bootstrap term Q ⊥(ϕ) with a ϕ-averaging target network Q ψ (Lillicrap et al., 2016) , and use prioritized experience replay (Schaul et al., 2016) , a non-uniform sampling procedure. These modifications are standard in deep RL, and help improve stability and convergence of the trained critic (Hessel et al., 2018) . We note that unlike Piché et al. (2019) , we learn the soft-Q function for the (static) prior policy, dramatically simplifying the training process, and allowing faster sampling at inference time.

3.3. PUTATIVE ACTION PARTICLES

Algorithm 1 Critic Sequential Monte Carlo procedure CRITICSMC(p0, f , π, r, Q N , K, T ) Sample s 1:N 1 ∼ p0(s1) Set w1:N 0 ← 1 N for t ∈ 1 . . . T do for n ∈ 1 . . . N do for k ∈ 1 . . . K do Sample ân,k t ∼ π(at|s n t ) Set ŵn•N+k t ← 1 K wn t-1 e Q(s n t ,â n,k t ) end for end for Set Wt ← N •K i=1 ŵi t Sample α 1:N t ∼ RESAMPLE ŵ1:N•K t W t for n ∈ 1 . . . N do Set i ← ⌊α n t /K⌋ + 1 Set j ← (α n t mod K) + 1 Set a n t ← âi,j t Set s n t+1 ← f (s i t , âi,j t ) Set wn t ← 1 N Wte r(s i t ,â i,j t ,s n t+1 )-Q(s i t ,â i,j t ) end for end for return s 1:N 1:T , a 1:N 1:T , w1:N 1:T end procedure Sampling actions given states is often computationally cheap when compared to generating states following transition dynamics. Even when a large model is used to define the prior policy, it is typically structured such that the bulk of the computation is spent processing the state information and then a relatively small probabilistic head can be used to sample many actions. To take advantage of this, we temporarily increase the particle population size K-fold when sampling actions and then reduce it by resampling before the new state is computed. This is enabled by the placement of heuristic factors between sampling the action and computing the next state, as highlighted in Figure 2c . Specifically, at each time step t for each particle i we sample K actions âi,j t , resulting in N •K putative action particles (Fearnhead, 2004 ). The critic is then applied as a heuristic factor to each putative particle, and a population of size N re-sampled following the next time-step using these weighted examples. The full algorithm is given in Algorithm 1. For low dimensional action spaces, it is possible to sample actions densely under the prior, eliminating the need for a separate proposal distribution. This is particularly beneficial in settings where the prior policy is only defined implicitly by a sampler and its log-density cannot be quickly evaluated everywhere. In the autonomous driving context, the decision leading to certain actions can be complex, but the action space is only two-or threedimensional. Using CriticSMC, a prior generating human-like actions can be provided as a sampler without the need for a density function. Lastly, CriticSMC can be used for model-free online control through sampling putative actions from the current state, applying the critic, and then selecting a single action through resampling. This can be regarded as a prior-aware approach to selecting actions similar to algorithms proposed by Abdolmaleki et al. (2018) ; Song et al. (2019) .

4. EXPERIMENTS

We demonstrate the effectiveness of CriticSMC for probabilistic planning where multiple future possible rollouts are simulated from a given initial state using CriticSMC using two environments: a multi-agent point-mass toy environment and a high-dimensional driving simulator. In both environments infractions are defined as collisions with either other agents or the walls. Since the environment dynamics are known and deterministic, we do not learn a state transition model of the world and there is no need to re-plan actions in subsequent time steps. We also show that CriticSMC successfully avoids collisions in the driving environment when deployed in a model-free fashion in which the proposed optimal actions are executed directly in the environment at every timestep during the CriticSMC process. Finally, we show that both the use of putative particles and the Soft-Q function instead of the standard Hard-Q result in significant improvement in terms of reducing infractions and maintaining behavior close to the prior.

4.1. TOY ENVIRONMENT

In the toy environment, depicted in Figure 1 , the prior policy is a Gaussian random walk towards the goal position without any information about the position of the other agents and the barrier. All external agents are randomly placed and move adversarially and deterministically toward the ego agent. The ego agent commits an infraction if any of the following is true: 1) colliding with any of the other agents, 2) hitting a wall, 3) moving outside the perimeter of the environment. Details of this environment can be found in the Appendix. We compare CriticSMC using 50 particles and 1024 putative action particles on the planning task against several baselines, namely the prior policy, rejection sampling with 1000 maximum trials, and the SMC method of Piché et al. (2019) with 50 particles. We randomly select 500 episodes with different initial conditions and perform 6 independent rollouts for each episode. The prior policy has an infraction rate of 0.84, rejection sampling achieves 0.78 and SMC of Piché et al. (2019) yields an infraction rate of 0.14. CriticSMC reduces infraction rate to 0.02.

4.2. HUMAN-LIKE DRIVING BEHAVIOR MODELING

Human-like driving behavior models are increasingly used to build realistic simulation for training self-driving vehicles (Suo et al., 2021; Bergamini et al., 2021; Ścibior et al., 2021) , but they tend to suffer from excessive numbers of infractions, in particular collisions. In this experiment we take an existing model of human driving behavior, ITRA ( Ścibior et al., 2021) , as the prior policy and attempt to avoid collisions, while maintaining the human-likeness of predictions as much as possible. The environment features non-ego agents, for which we replay actions as recorded in the INTERACTION dataset (Zhan et al., 2019) . The critic receives a stack of the last two ego-centric ego-rotated birdview images (Figure 3 ) of size 256×256×3 as partial observations of the full state. This constitutes a challenging, image-based, high-dimensional continuous control environment, in contrast to Piché et al. (2019) , who apply their algorithm to low dimensional vector-state spaces in the Mujoco simulator Todorov et al. (2012) ; Brockman et al. (2016) . The key performance metric in this experiment is the average collision rate, but we also report the average displacement error (ADE 6 ) using the minimum error across six samples for each prediction, which serves as a measure of humanlikeness. Finally, the maximum final distance (MFD) metric is reported to measure the diversity of the predictions. The evaluation is performed using the validation split of the INTERACTION dataset, which neither ITRA nor the critic saw during training. We evaluate CriticSMC on a model-based planning task against the following baselines: the prior policy (ITRA), rejection sampling with 5 maximum trials and the SMC incremental weight update rule proposed by Piché et al. (2019) using 5 particles. CriticSMC uses 5 particles and 128 putative particles, noting the computational cost of using the putative particles is negligible. We perform separate evaluation in each of four locations from the INTERACTION dataset, and for each example in the validation set we execute each method six times independently to compute the performance metrics. Table 1 shows that CriticSMC reduces the collision rate substantially more than any of the baselines and that it suffers a smaller decrease in predictive error than the SMC of Piché et al. (2019) . All methods are able to maintain diversity of sampled trajectories on par with the prior policy. Next, we test CriticSMC as a model-free control method, not allowing it to interact with the environment until an action for a given time step is selected, which is equivalent to using a single particle in CriticSMC. Specifically, at each step we sample 128 putative action particles and resample one of them based on critic evaluation as a heuristic factor. We use Soft Actor-Critic (SAC) (Haarnoja et al., 2018a) as a model-free baseline, noting that other SMC variants are not applicable in this context, since they require inspecting the next state to perform resampling. We show in Table 2 that CriticSMC is able to reduce collisions without sacrificing realism and diversity in the predictions. Here SAC does notably worse in terms of both collision rate as well as ADE. This is unsurprising as Figure 3 : Collision avoidance arising from using CriticSMC for control of the red ego agent in a scenario from the INTERACTION dataset. There are three rows: the top shows the sequence of states leading to a collision arising from choosing actions from the prior policy, the middle row shows that control by CriticSMC's implicit policy avoids the collision, and the third row is a contour plot illustrating the relative values of the critic (brighter corresponds to higher expected reward) evaluated at the current state over the entire action space of acceleration (vertical axis) and steering (horizontal axis). The black dots are 128 actions sampled from the prior policy. The white dot indicates the selected action. Best viewed zoomed onscreen. For more examples see Figure 6 in the Appendix. CriticSMC takes advantage of samples from the prior, which is already performant in both metrics, while SAC must be trained from scratch. This example highlights how CriticSMC utilizes prior information more efficiently than black-box RL algorithms like SAC.

Effect of Using Putative Action Particles

We evaluate the importance of putative action particles, via an ablation study varying the number of particles and putative particles in CriticSMC and standard SMC. Table 3 contains results that show both increasing the number of particles and putative articles have a significant impact on performance. Putative particles are particularly important since a large number of them can typically be generated with a small computational overhead.

Comparison of Training the Critic

With the Soft-Q and Hard-Q Objective We compare the fitted Q iteration (Watkins & Dayan, 1992), which uses the maximum over Q at the next stage to update the critic (i.e., max at+1 Q(s t+1 , a t+1 )), with the fitted soft-Q iteration used by CriticSMC (Eq. 4). The results, displayed in Table 4 , show that the Hard-Q heuristic factor leads to a significant reduction in collision rate over the prior, but produces a significantly higher ADE 6 score. We attribute this to the risk-avoiding behavior induced by hard-Q. Execution Time Comparison Figure 4 shows the average execution time it takes to predict 3 seconds into the future given 1 second of historical observations for the driving behavior modeling experiment. This shows that the run-time of all algorithms is of the same order, while the collision rate of CriticSMC is significantly lower, demonstrating the low overhead of using putative action particles.

5. RELATED WORK

SMC methods (Gordon et al., 1993; Kitagawa, 1996; Liu & Chen, 1998) , also known as particle filters, are a well-established family of inference methods for generating samples from posterior distributions. Their basic formulations perform well on the filtering task, but poorly on smoothing (Godsill et al., 2004) due to particle degeneracy. These issues are usually addressed using backward simulation (Lindsten & Schön, 2013) or rejuvenation moves (Gilks & Berzuini, 2001; Andrieu et al., 2010) . These solutions improve sample diversity but are not sufficient in our context, where normal SMC often fails to find even a single infraction-free sample. Lazaric et al. (2007) used SMC for learning actor-critic agents with continuous action environments. Similarly to CriticSMC, Piché et al. (2019) propose using the value function as a backward message in SMC for planning. Their method is equivalent to what is obtained using the equations from Figure 2b with h t = V (s t+1 ) = log E at+1 [exp(Q(a t+1 , s t+1 ))] (see proof in Section A.2 of the Appendix). This formulation cannot accommodate putative action particles and learns a parametric policy alongside V (s t+1 ), instead of applying the soft Bellman update (Asadi & Littman, 2017; Chan et al., 2021) to a fixed prior. In our experiments we used the bootstrap proposal (Gordon et al., 1993) , which samples from the prior model, but in cases where the prior density can be efficiently computed, using a better proposal distribution can bring significant improvements. Such proposals can be obtained in a variety of et al., 2000) or neural networks minimizing the forward Kullback-Leibler divergence (Gu et al., 2015) . CriticSMC can accommodate proposal distributions, but even when the exact smoothing distribution is used as a proposal, backward messages are still needed to avoid the particle populations that focus on the filtering distribution. As we show in this work, CriticSMC can be used for planning as well as model-free online control. The policy it defines in the latter case is not parameterized explicitly, but rather obtained by combining the prior and the critic. This is similar to classical Q-learning (Watkins & Dayan, 1992), which obtains the implicit policy by taking the maximum over all actions of the Q function in discrete action spaces. This approach has been extended to continuous domains using ensembles (Deisenroth & Rasmussen, 2011; Ryu et al., 2020; Lee et al., 2021) and quantile networks (Bellemare et al., 2017) . The model-free version of CriticSMC is also very similar to soft Q-learning described as described by Haarnoja et al. (2017) ; Abdolmaleki et al. (2018) , and analyzed by Chan et al. (2021) . Imitating human driving behavior has been successful in learning control policies for autonomous vehicles (Bojarski et al., 2016; Hawke et al., 2019) and to generate realistic simulations (Bergamini et al., 2021; Ścibior et al., 2021) . In both cases, minimizing collisions, continues to present one of the most important issues in autonomous vehicle research. Following a data-driven approach, Suo et al. (2021) proposed auxiliary losses for collision avoidance, while Igl et al. (2022) used adversarially trained discriminators to prune predictions that are likely to result in infractions. To the best of our knowledge, ours is the first work to apply a critic targeting the backward message in this context.

6. DISCUSSION

CriticSMC increases the efficiency of SMC for planning in scenarios with hard constraints, when the actions sampled must be adjusted long before the infraction takes place. It achieves this efficiency through the use of a learned critic which approximates the future likelihood using putative particles that densely sample the action space. The performance of CriticSMC relies heavily on the quality of the critic and in this work we display how to take advantage of recent advances in deep RL to obtain one. One avenue for future work is devising more efficient algorithms for learning the soft Q function such as proximal updates (Schulman et al., 2017) or the inclusion of regularization which guards against deterioration of performance late in training (Kumar et al., 2020) . The design of CriticSMC is motivated by the desire to accommodate implicit priors defined as samplers, such as the ITRA model ( Ścibior et al., 2021) we used in our self-driving experiments. For this reason, we avoided learning explicit policies to use as proposal distributions since maintaining similarity with the prior can be extremely complicated. Where the prior density can be computed, learned policies could be successfully accommodated. This is particularly important when the action space is high-dimensional and it is difficult to sample it densely using putative particles. In this work, we focused on environments with deterministic transition dynamics but CriticSMC could also be applied when dynamics are stochastic (i.e. s t+1 ∼ p(s t+1 |s t , a t )). In these settings, the planning as inference framework suffers from optimism bias (Levine, 2018; Chan et al., 2021) , even when exact posterior can be computed, which is usually mitigated by carefully constructing the variational family. For applications in real-world planning, CriticSMC relies on having a model of transition dynamics and the fidelity of that model is crucial for achieving good performance. Learning such models from observations is an active area of research (Ha & Schmidhuber, 2018; Chua et al., 2018; Nagabandi et al., 2020) . Finally, we focused on avoiding infractions, but CriticSMC is applicable to planning with any reward surfaces and to sequential inference problems more generally. e Q(st+1,at+1) . (8) If we assume the dynamics p(s t+1 |s t , a t ) of the environment to be deterministic s t+1 ← f (s t , a t ), we can simplify Equation 6to ,at+1) . (9) Q(s t , a t ) = r(s t , a t , s t+1 ) + log E at+1∼π(at+1|st+1) e Q(st+1 SMC using value function as heuristic factors Piché et al. (2019) proposed using state values V (s t ) as backward messages in SMC for planning. Based on the two-filter formula (Bresler, 1986; Kitagawa, 1994) , they derive the following weight update rule w t = w t-1 E st+1∼p(st+1|st,at) exp r(s t , a t , s t+1 ) + V (s t+1 ) -log E st∼p(st|st-1,at-1) [exp (V (s t ))] . ( ) We omit the term -log π θ (a t |s t ) since we assume to use bootstrap proposals instead of learning them. Thus, the current action is sampled as a t ∼ π(a t |s t ). In our framework, for simplicity, we assume the use of deterministic state transition dynamics p(s t+1 |s t , a t ) which simplifies the update rule to w t = w t-1 exp r(s t , a t , s t+1 ) + V (s t+1 ) -V (s t ) . Piché et al. ( 2019) in practice trained an SAC-based policy and used the learned state-action value functions Q(s t , a t ) to approximate state values V (s t ). Following a similar experimentation setting, we use a soft approximation of the value function terms using state-action value functions Q(s t , a t ) as described by Levine (2018) using V (s t ) = log E at∼π(at|st) [exp(Q(s t , a t ))] . This results in the following particle weight update rule w t = w t-1 exp r(s t , a t , s t+1 ) + log E ât+1∼π(at+1|st+1) [exp(Q(s t+1 , ât+1 ))] -log E ât∼π(at|st) [exp(Q(s t , ât ))] . We can then define the heuristic factor used in Figure 2b of the main paper as h t = log E ât+1∼π(at+1|st+1) [exp(Q(s t+1 , ât+1 ))] -log E ât∼π(at|st) [exp(Q(s t , ât ))] which utilizes a soft approximation of the value function. It is worth emphasising that a next state sample s t+1 from the environment model is required which makes the use of putative action particles (see Section 3.3 of the main paper) inefficient and expensive contrary to the proposed CriticSMC method.

A.3 TOY ENVIRONMENT EXPERIMENT DETAILS

In this environment, the ego agent is described by e t = (x e t , y e t , r e ) where x, y is the position in the square coordinate system [0, 1] 2 and r e is the radius. We randomly position other agents o i t = (x o i t , y o i t , r o i ) where i ∈ [0, 5]. In addition, there is a partial barrier in the middle with gates g k = (x g k , y g k , w g k ) where x g k , y g k are the coordinates of the center of the gate k, w g k is the width of the opening and k ∈ [1, 3]. Finally, a goal position G = (x G , y G , r G ) is positioned on the other side of the barrier. The ego and the other agents are moving by displacement actions a t = (∆x t , ∆y t ). The state representation consists of relative distances between the ego agent and the other agents, the center of the gates and the goal position. A two-layer fully connected neural network with a ReLU activation function and size of 64 takes as input this representation and produces a state encoding. A similar network takes as input the two-dimensional displacement actions and produces the action encoding. Finally, another two-layer network takes as input the concatenation of the state and actions encodings and produce the Q values. We train the model using a single Nvidia RTX 2080Ti GPU. The prioritized experience replay buffer has a size of 1 million stored experiences. The discount factor is set to 0.99, the batch size to 256 and the learning rate to 0.001. Finally, we sample 1024 actions during running CriticSMC while training the critic model.

A.4 DRIVING BEHAVIOR MODEL EXPERIMENT DETAILS

The prior model we picked for this experiment is ITRA ( Ścibior et al., 2021) but any other probabilistic behavior model can be used. We follow the same architecture and training procedure as described in Ścibior et al. (2021) . The prior model is trained on the INTERACTION (Zhan et al., 2019) dataset and the task is that given 10 timesteps of observed behavior, predict the next 30 timesteps of future trajectories. For the critic, we used the same convolution neural network architecture as the prior model. The critic takes as input the last two observed birdviews images and encodes them separately. The concatenation of the two representations along with the action encoding is processed by a final layer that produces the Q value. The architecture for these layers is the same as in Section A.3. We train the critic model using a single Nvidia RTX 2080Ti GPU. The prioritized experience replay buffer has a size of 1.5 million stored experiences. The discount factor is set to 0.99, the batch size to 256 and the learning rate to 0.001. Finally, we sample 128 actions during running CriticSMC while training the critic model.

A.4.1 REINFORCEMENT LEARNING ENVIRONMENT

The environment used to train the RL agents takes as input a location from the INTERACTION dataset and trains a single-agent policy where all non-ego actors rollout according to ground truth. Because the CriticSMC algorithm rolls out every agent according to ground truth for the first ten frames of each trajectory before prediction, we simply remove these frames and begin executing the policy on frame eleven. At time step t, the policy takes the previous and current birdview images (b t-1 , b t ) where each image has a size 256×256×1. The stacked birdview images make the total input for the policy and value function to be 2×256×256×1. The policy produces an action a t ∈ [-1, 1] 2 which corresponds to the bicycle kinematic model's relative action space (see Ścibior et al. (2021) for more details). The differentiable simulator ( Ścibior et al., 2021) then uses a t to update its state and returns the next birdview image b t+1 . In this setting the policy distribution that is learned follows a squashed normal distribution (Haarnoja et al., 2018b) , as is standard for the SAC implementations (Haarnoja et al., 2018b) . The stochastic policy learned by SAC is tailored towards exploration and thus behaves poorly. For this reason we only report its deterministic behavior (e.g. the mode of the policy) in Table 2 of the main paper. For each of the four locations that were evaluated, the RL agents were run over three different learning rate schedules and three different reward structures for a minimum of 150k time steps. The policy uses the same convolutional neural network architecture as in CriticSMC and is updated according to the soft actor-critic algorithm in stable-baselines3 (Brockman et al., 2016) . Table 5 shows the hyper-parameter settings used for training. Reward Surfaces In the three rewards settings which we tested, there were a number of different feedback mechanisms which were used to produce the desired behavior (i.e. low-collision probability and low ADE). The first, was a score based reward upon an estimate of the log-probability under ITRA. To compute this "score reward", the environment passes the pair of birdview images (b t , b t+1 ) to ITRA, which generates the hypothetical action a ITRA t that ITRA would have taken to make the state transition from b t to b t+1 . Then, the environment sets the reward to be a monotonic function of the likelihood of a t under a normal distribution centred around a ITRA t : r t+1 ≡ tanh log p(a t ; a ITRA t , Σ) where p(•; a ITRA t , Σ) = N (a ITRA t , Σ) for some covariance Σ. Next, we include five simpler reward surfaces which have been shown to improve performance in the literature (Reda et al., 2020) . First, the "action reward" is a linear function of the absolute difference between the action output by the policy and the action which ITRA would have taken at time step t: (Haarnoja et al., 2018b) . Latent-Features 256 Number of neurons used in the output of the feature encoder, and which is fed to the standard two layer multi-layer perceptron defined by standard SAC algorithms (Haarnoja et al., 2018b) . Table 5 : Hyper-parameters for the reinforcement learning baseline used in Section 4.2. All hyperparameters which were not listed above, use the default values provided by the SAC implementation of stable-baselines3 (Brockman et al., 2016) . 2 -||a t -a ITRA t || 1 . Second, the action difference reward is the scaled absolute difference between the current and previous actions: ||a t -a t-1 || 1 . Third, the environment computes the "ground-truth reward" r t+1 by evaluating s t against the ground truth data from the INTERACTION dataset. In particular, the environment sets the reward to be a linear function of the negative Euclidean distance at time t + 1 between the xy-coordinate of the ego-vehicle according to the simulator, s t+1 , and that according to ground truth, s GT t+1 : 100 -||s t+1 -s GT t+1 || 2 . Fourth, we include a "survival reward" of 1 if the agent does not commit an infraction at step t. Lastly, the infraction reward is -5 if the agent commits any type of infraction at step t and 5 otherwise. We pick the number of particles as N ∈ {1, 5, 10} and the number of putative action particles as K ∈ {1, 100, 1000}. CriticSMC is able to better estimate the marginal likelihood significantly faster than SMC by taking advantage of a large population of putative action particles and a computationally efficient critic function used as a heuristic factor to guide inference.



Figure 4: Execution time comparison between the baseline methods and CriticSMC for both model-predictive planning and model-free online control. The collision infraction rate is averaged across the 4 INTERAC-TION locations.

Figure5: Given a well-defined linear Gaussian state-space model, we evaluate the performance of SMC and CriticSMC estimating the (negative) log-marginal likelihood (bar plots on the left column) relative to the speed of inference measured in wall-clock time (bar plots on the right column). We pick the number of particles as N ∈ {1, 5, 10} and the number of putative action particles as K ∈ {1, 100, 1000}. CriticSMC is able to better estimate the marginal likelihood significantly faster than SMC by taking advantage of a large population of putative action particles and a computationally efficient critic function used as a heuristic factor to guide inference.

Main loop of SMC without heuristic factors (left), with naive heuristic factors h t (middle)

Infraction rates for different inference methods performing model-predictive planning tested on four locations from the INTERACTION dataset(Zhan et al., 2019).

Infraction rates for performing model-free online control against the prior and SAC policies tested on four locations from the INTERACTION dataset(Zhan et al., 2019).

Infraction rates for SMC and CriticSMC with a varying number of particles and putative particles, tested on 500 random initial states using the proposed toy environment.



Infraction rates for the Hard-Q and the Soft-Q objectives tested on the location DR DEU Merging MT from the INTERACTION dataset(Zhan et al., 2019).

and log p(O t+1:T |s t , a t ) = log st+1 at+1 p(s t+1 |s t , a t )π(a t+1 |s t+1 )p(O t+1:T |s t+1 , a t+1 )da t+1 ds t+1

ACKNOWLEDGMENTS

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada CIFAR AI Chairs Program, and the Intel Parallel Computing Centers program. Additional support was provided by UBC's Composites Research Network (CRN), and Data Science Institute (DSI). This research was enabled in part by technical support and computational resources provided by WestGrid (www.westgrid.ca), Compute Canada (www.computecanada.ca), and Advanced Research Computing at the University of British Columbia (arc.ubc.ca).

A APPENDIX

A.1 SEQUENTIAL MONTE CARLO Here we briefly give an overview of the sequential Monte Carlo (SMC) algorithm adapted for Markov decision processes (MDPs). We borrow the notation from Section 2 in the main paper. Obtaining state-action pairs (s 1:T , a 1:t ) that maximize the expected sum of rewards corresponds to sampling state-action pairs from the posterior p(s 1:T , a 1:T |O 1:T ). The SMC inference algorithm (Gordon et al., 1993) approximate the filtering distributions p(s t , a t |O 1:t ). In general, SMC is assuming the existence of a proposal distribution q(s t+1 , a t+1 |s t , a t ) but for simplicity we instead use bootstrap proposals that use the prior policy. The algorithm samples N independent particles from the initial distribution s n 1 ∼ p 0 (s 1 ) where each particle has uniform weights ) in the corresponding particle weight, saving the sum of weights and normalizing them before proceeding to the next time step. In our setting, we assume the state dynamics p(s t+1 |s t , a t ) of the environment to be deterministic s t+1 ← f (s t , a t ). SMC suffers from weight disparity which can lead to a reduced effective sample size of particles. This is mitigated by introducing a resampling step RESAMPLE( w1:N t ) at every iteration to help SMC select promising particles with high weights that have higher chance of surviving whereas particles with low weights most likely will get discarded. See Douc & Cappe (2005) for an extensive overview of different resampling schemes. Algorithm 2 summarizes the SMC process for MPDs using bootstrap proposals. whereUsing these five feedback mechanisms, we consider three different reward surfaces. Each of which are defined following reward calculation:In the first reward setting which was considered, we set all coefficients α i to zero except the SURVIVE reward, and thus refer to this reward type as the survival reward setting. Next we consider a setting where we set all α i to zero except the GROUND TRUTH reward, and refer to this setting as the ground-truth setting. Lastly, we considered a setting where: α 1 = 0.15, α 2 = 2.0, α 3 = 0.05, and the remaining α i are all set to zero. We refer to this setting as the ITRA setting, as it includes the most information about the ITRA model. To arrive at the final result, models where trained under all three of these settings, evaluated, and then chosen based upon the lowest collision infraction rate.

A.5 CRITICSMC AS AN EFFICIENT SMC INFERENCE ALGORITHM

We include in the supplementary material a demo code implementation of CriticSMC applied to the following linear Gaussian state-space model (LGSSM) with well-defined critic functionwhere the state transition function f (s t , a t ) is assumed to be computationally expensive. The conditional posterior samples from p(s 1:T , a 1:T |O 1:T ) are defined as states that are within the range defined in Equation 20. We use T = 10 in our experiments.Figure 5 demonstrates the performance of CriticSMC compared to SMC for estimating the (negative) log-marginal likelihood p(O 1:T ) relative to the computational time needed to execute the inference algorithm.A 

