CRITIC SEQUENTIAL MONTE CARLO

Abstract

We introduce CriticSMC, a new algorithm for planning as inference built from a composition of sequential Monte Carlo with learned Soft-Q function heuristic factors. These heuristic factors, obtained from parametric approximations of the marginal likelihood ahead, more effectively guide SMC towards the desired target distribution, which is particularly helpful for planning in environments with hard constraints placed sparsely in time. Compared with previous work, we modify the placement of such heuristic factors, which allows us to cheaply propose and evaluate large numbers of putative action particles, greatly increasing inference and planning efficiency. CriticSMC is compatible with informative priors, whose density function need not be known, and can be used as a model-free control algorithm. Our experiments on collision avoidance in a high-dimensional simulated driving task show that CriticSMC significantly reduces collision rates at a low computational cost while maintaining realism and diversity of driving behaviors across vehicles and environment scenarios.

1. INTRODUCTION

Sequential Monte Carlo (SMC) (Gordon et al., 1993 ) is a popular, highly customizable inference algorithm that is well suited to posterior inference in state-space models (Arulampalam et al., 2002; Andrieu et al., 2004; Cappe et al., 2007) . SMC is a form of importance sampling, that breaks down a high-dimensional sampling problem into a sequence of low-dimensional ones, making them tractable through repeated application of resampling. SMC in practice requires informative observations at each time step to be efficient when a finite number of particles is used. When observations are sparse, SMC loses its typical advantages and needs to be augmented with particle smoothing and backward messages to retain good performance (Kitagawa, 1994; Moral et al., 2009; Douc et al., 2011) . SMC can be applied to planning problems using the planning-as-inference framework (Ziebart et al., 2010; Neumann, 2011; Rawlik et al., 2012; Kappen et al., 2012; Levine, 2018; Abdolmaleki et al., 2018; Lavington et al., 2021) . In this paper we are interested in solving planning problems with sparse, hard constraints, such as avoiding collisions while driving. In this setting, such a constraint is not violated until the collision occurs, but braking needs to occur well in advance to avoid it. Figure 1 demonstrates on a toy example how SMC requires an excessive number of particles to solve such problems. In the language of optimal control (OC) and reinforcement learning (RL), collision avoidance is a sparse reward problem. In this setting, parametric estimators of future rewards (Nair et al., 2018; Riedmiller et al., 2018) are learned in order to alleviate the credit assignment problem (Sutton & Barto, 2018; Dulac-Arnold et al., 2021) and facilitate efficient learning. In this paper we propose a novel formulation of SMC, called CriticSMC, where a learned critic, inspired by Q-functions in RL (Sutton & Barto, 2018) , is used as a heuristic factor (Stuhlmüller et al., 2015) in SMC to ameliorate the problem of sparse observations. We borrow from the recent advances in deep-RL (Haarnoja et al., 2018a; Hessel et al., 2018) to learn a critic which approximates future likelihoods in a parametric form. While similar ideas have been proposed in the past (Rawlik et al., 2012; Piché et al., 2019) , in this paper we instead suggest (1) using soft Q-functions (Rawlik et al., 2012; Chan et al., 2021; Lavington et al., 2021) as heuristic factors, and (2) choosing the placement of such factors to allow for efficient exploration of action-space through the use of putative particles (Fearnhead, 2004) . Additionally, we design CriticSMC to be compatible with informative prior distributions, which may not include an associated (known) log-density function. In planning contexts, such priors can specify additional requirements that may be difficult to define via rewards, such as maintaining human-like driving behavior. All plots show overlaid samples of environment trajectories conditioned on the ego agent achieving its goal. While SMC will asymptotically explore the whole space of environment trajectories, CriticSMC's method of using the critic as a heuristic within SMC encourages computationally efficient discovery of diverse high reward trajectories. SMC with a small number of particles fails here because the reward is sparse and the ego agent's prior behavioral policy assigns low probability to trajectories that avoid the barrier and the other agents. We show experimentally that CriticSMC is able to refine the policy of a foundation (Bommasani et al., 2021) autonomous-driving behavior model to take actions that produce significantly fewer collisions while retaining key behavioral distribution characteristics of the foundation model. This is important not only for the eventual goal of learning complete autonomous driving policies (Jain et al., 2021; Hawke et al., 2021) , but also immediately for constructing realistic infraction-free simulations to be employed by autonomous vehicle controllers (Suo et al., 2021; Bergamini et al., 2021; Ścibior et al., 2021; Lioutas et al., 2022) for training and testing. Planning, either in simulation or real world, requires a model of the world (Ha & Schmidhuber, 2018) . While CriticSMC can act as a planner in this context, we show that it can just as easily be used for model-free online control without a world model. This is done by densely sampling putative action particles and using the critic to select amongst these sampled actions. We also provide ablation studies which demonstrate that the two key components of CriticSMC, namely the use of the soft Q-functions and putative action particles, significantly improve performance over relevant baselines with similar computational resources.

2. PRELIMINARIES

Since we are primarily concerned with planning problems, we work within the framework of Markov decision processes (MDPs). An MDP M = {S, A, f, P 0 , R, Π} is defined by a set of states s ∈ S, actions a ∈ A, reward function r(s, a, s ′ ) ∈ R, deterministic transition dynamics function f (s, a), initial state distribution p 0 (s) ∈ P 0 , and policy distribution π(a|s) ∈ Π. Trajectories are generated by first sampling from the initial state distribution s 1 ∼ p 0 , then sequentially sampling from the policy a t ∼ π(a t |s t ) and then the transition dynamics s t+1 ← f (s t , a t ) for T -1 time steps. Execution of this stochastic process produces a trajectory τ = {(s 1 , a 1 ), . . . , (s T , a T )} ∼ p π , which is then scored using the reward function r. The goal in RL and OC is to produce a policy π * = arg max π E p π [ T t=1 r(s t , a t , s t+1 )]. We now relate this stochastic process to inference.

2.1. REINFORCEMENT LEARNING AS INFERENCE

RL-as-inference (RLAI) considers the relationship between RL and approximate posterior inference to produce a class of divergence minimization algorithms able to estimate the optimal RL policy. The posterior we target is defined by a set of observed random variables O 1:T and latent random variables τ 1:T . Here, O defines "optimality" random variables which are Bernoulli distributed with probability proportional to exponentiated reward values (Ziebart et al., 2010; Neumann, 2011; Levine, 2018) . They determine whether an individual tuple τ t = {s t , a t , s t+1 } is optimal (O t = 1) or sub-optimal



Figure1: Illustration of the difference between CriticSMC and SMC in a toy environment in which a green ego agent is trying to reach the red goal without being hit by any of the three chasing adversaries. All plots show overlaid samples of environment trajectories conditioned on the ego agent achieving its goal. While SMC will asymptotically explore the whole space of environment trajectories, CriticSMC's method of using the critic as a heuristic within SMC encourages computationally efficient discovery of diverse high reward trajectories. SMC with a small number of particles fails here because the reward is sparse and the ego agent's prior behavioral policy assigns low probability to trajectories that avoid the barrier and the other agents.

