IMITATING HUMAN BEHAVIOUR WITH DIFFUSION MODELS

Abstract

Diffusion models have emerged as powerful generative models in the text-toimage domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.

1. INTRODUCTION

To enable Human-AI collaboration, agents must learn to best respond to all plausible human behaviors (Dafoe et al., 2020; Mirsky et al., 2022) . In simple environments, it suffices to generate all possible human behaviours (Strouse et al., 2021) but as the complexity of the environment grows this approach will struggle to scale. If we instead assume access to human behavioural data, collaborative agents can be improved by training with models of human behaviour (Carroll et al., 2019) . In principle, human behavior can be modelled via imitation learning approaches in which an agent is trained to mimic the actions of a demonstrator from an offline dataset of observation and action tuples. More specifically, Behaviour Cloning (BC), despite being theoretically limited (Ross et al., 2011) , has been empirically effective in domains such as autonomous driving (Pomerleau, 1991 ), robotics (Florence et al., 2022) and game playing (Ye et al., 2020; Pearce and Zhu, 2022) . Popular approaches to BC restrict the types of distributions that can be modelled to make learning simpler. A common approach for continuous actions is to learn a point estimate, optimised via Mean Squared Error (MSE), which can be interpereted as an isotropic Gaussian of negligible variance. Another popular approach is to discretise the action space into a finite number of bins and frame as a classification problem. These both suffer due to the approximations they make (illustrated in Figure 1 ), either encouraging the agent to learn an 'average' policy or predicting action dimensions independently resulting in 'uncoordinated' behaviour (Ke et al., 2020) . Such simplistic modelling choices can be successful when the demonstrating policy is itself of restricted expressiveness (e.g. when using trajectories from a single pre-trained policy represented by a simple model.) However, for applications requiring cloning of human behaviour, which contains diverse trajectories and multimodality at decision points, simple models may not be expressive enough to capture the full range and fidelity of behaviours (Orsini et al., 2021) . For these reasons, we seek to model the full distribution of actions observed. In particular, this paper focuses on diffusion models, which currently lead in image, video and audio generation (Saharia et al., 2022; Harvey et al., 2022; Kong et al., 2020) , and avoid issues of training instability in generative adversarial networks (Srivastava et al., 2017) , or sampling issues with energy based models (Florence et al., 2022) . By using diffusion models for BC we are able to: 1) more accurately model complex action distributions (as illustrated in Figure 1 ); 2) significantly outperform state-of-the-art methods (Shafiullah et al., 2022) on a simulated robotic benchmark; and 3) scale to modelling human gameplay in Counter-Strike: Global Offensive -a modern, 3D gaming environment recently proposed as a platform for imitation learning research (Pearce and Zhu, 2022). To achieve this performance, we contribute several innovations to adapt diffusion models to sequential environments. Section 3.2 shows that good architecture design can significantly improve performance. Section 3.3 then shows that Classifier-Free Guidance (CFG), which is a core part of text-to-image models, surprisingly harms performance in observation-to-action models. Finally, Section 3.4 introduces novel, reliable sampling schemes for diffusion models. The appendices include related work, experimental details, as well as further results and explanations.

2. MODELLING CHOICES FOR BEHAVIOURAL CLONING

In this section we examine common modelling choices for BC. For illustration purposes, we created a simple environment to highlight their limitations. We simulated an arcade toy claw machine, as shown in Figures 1, 3 &  4. An agent observes a top-down image of toys (o) and chooses a point in the image, in a 2D continuous action space, a ∈ R 2 . If the chosen point is inside the boundaries of a valid toy, the agent successfully obtains the toy. To build a dataset of demonstrations, we synthetically generate images containing one or more toys, and uniformly at random pick a single demonstration action a that successfully grabs a toy (note this toy environment uses synthetic rather than human data). The resulting dataset is used to learn p(a|o). To make training quicker, we restrict the number of unique observations o to seven, though this could be generalised.

MSE.

A popular choice for BC in continuous action spaces approximates p(a|o) by a point-estimate that is optimised via MSE. This makes a surprisingly strong baseline in the literature despite its simplicity. However, MSE suffers from two limitations that harm its applicability to our goal of modelling the full, complex distributions of human behaviour. 1) MSE outputs a point-estimate. This precludes it from capturing any variance or multimodality present in p(a|o). 2) Due to its optimisation objective, MSE learns the 'average' of the distribution. This can bias the estimate towards more frequently occurring actions, or can even lead to out-of-distribution actions (e.g. picking the action between two modes). The first can be partially mitigated by instead assuming a Gaussian distribution, predicting a variance for each action dimension and sampling from the resulting Gaussian. However, due to the MSE objective, the learnt mean is still the average of the observed action distribution. These limitations are visualised in Figure 1 .

Discretised.

A second popular choice is to discretise each continuous action dimension into B bins, and frame it as a classification task. This has two major limitations. 1) Quantisation errors arise since the model outputs a single value for each bin. 2) Since each action dimension is treated independently, the marginal rather than the joint distribution is learnt. This can lead to issues during sampling whereby dependencies between dimensions are ignored, leading to 'uncoordinated' behaviour. This can be observed in Figure 1 where points outside of the true distribution have been sampled in the bottom-right corner. This can be remedied by modelling action dimensions autoregressively, but these models bring their own challenges and drawbacks (Lin et al., 2021) . K-Means. Another method that accounts for dependencies between action dimensions, first clusters the actions across the dataset into K bins (rather than B |a| ) using K-Means. This discretises the joint-action distribution, rather than the marginal as in 'Discretised'. Each action is then associated with its nearest cluster, and learning can again be framed as a classification task. This approach



Figure 1: Expressiveness of a variety of models for behaviour cloning in a single-step, arcade claw game with two simultaneous, continuous actions. Existing methods fail to model the full action distribution, p(a|o), whilst diffusion models excel at covering multimodal & complex distributions.

