NEURAL SDES MADE EASY: SDES ARE INFINITE-DIMENSIONAL GANS

Abstract

Several authors have introduced Neural Stochastic Differential Equations (Neural SDEs), often involving complex theory with various limitations. Here, we aim to introduce a generic, user friendly approach to neural SDEs. Our central contribution is the observation that an SDE is a map from Wiener measure (Brownian motion) to a solution distribution, which may be sampled from, but which does not admit a straightforward notion of probability density -and that this is just the familiar formulation of a GAN. This produces a continuous-time generative model, arbitrary drift and diffusions are admissible, and in the infinite data limit any SDE may be learnt. After that, we construct a new scheme for sampling and reconstructing Brownian motion, with constant average-case time and memory costs, adapted to the access patterns of an SDE solver. Finally, we demonstrate that the adjoint SDE (used for backpropagation) may be constructed via rough path theory, without the previous theoretical complexity of two-sided filtrations.

1. INTRODUCTION

Neural differential equations are an elegant concept, bringing together the two dominant modelling paradigms of neural networks and differential equations. Indeed, since their introduction, Neural Ordinary Differential Equations (Chen et al., 2018) have prompted the creation of a wide variety of similarly-inspired models, for example based around controlled differential equations (Kidger et al., 2020b; Morrill et al., 2020) , Lagrangians (Cranmer et al., 2020 ), higher-order ODEs (Massaroli et al., 2020; Norcliffe et al., 2020) , and equilibrium points (Bai et al., 2019) . Tzen & Raginsky (2019a;b) obtain Neural SDEs as a continuous limit of deep latent Gaussian models. They train by optimising a variational bound, using forward-mode autodifferentiation. They consider only theoretical applications, for modelling distributions as the terminal value of an SDE. Li et al. (2020) give arguably the closest analogue to the neural ODEs of Chen et al. (2018) . They introduce neural SDEs via a subtle argument involving two-sided filtrations and backward Stratonovich integrals, but in doing so are able to introduce a backward-in-time adjoint equation, using only efficient-to-compute vector-Jacobian products. In applications, they use neural SDEs in a latent variable modelling framework, using the stochasticity to model Bayesian uncertainty. Hodgkinson et al. (2020) introduce Neural SDEs via an elegant theoretical argument, as a limit of random ODEs. The limit is made meaningful via rough path theory. In applications, they use the limiting random ODEs, and treat stochasticity as a regulariser within a normalising flow. However, they remark that in this setting the optimal diffusion is zero. This is a recurring problem: Innes et al. ( 2019) also train neural SDEs for which the optimal diffusion is zero. 2020), in that we use stochasticity to model distributions on path space. The resulting neural SDE is not a improvement to a similar neural ODE, but a standalone concept in its own right.

1.2. CONTRIBUTIONS

Our central contribution is the observation that the mathematical formulation of SDEs is directly comparable to the machine learning formulation of GANs. Using this connection, we show how it becomes straightforward to train neural SDEs as generative time series models. Arbitrary drift and diffusions are admissible, and in the infinite data limit any SDE may be learnt. Next, we introduce a new way of sampling Brownian motion, adapted to the query patterns typical to SDE solvers. The scheme produces exact samples using O(1) memory and average-case O(1) time. In particular, it can reconstruct its past trajectory, which is necessary for the use of adjoint equations. The scheme operates by combining splittable Pseudo-Random Number Generators (PRNGs), a binary tree of dependent intervals, and a Least Recently Used (LRU) cache of recent queries. Finally, we demonstrate that the theoretical construction of adjoint SDEs (which may be used to backpropagate through an SDE) may be simplified by using the pathwise formulation of rough path theory. In particular this avoids the previous theoretical complexity of two-sided filtrations. To facilitate the use of these techniques, we have implemented them as part of an open-source PyTorch-compatible general-purpose SDE library, [redacted] . This may be found at https:// github.com/ [redacted] . H 0 = ξ φ (Y 0 ) X 0 = ζ θ (V ) V ∼ N (0, I v ) W t = Brownian motion dX t = µ θ (t, X t ) dt + σ θ (t, X t ) • dW t dH t = f φ (t, H t ) dt + g φ (t, H t ) • dY t D = m φ (H T ) Y t = θ (X t ) Noise Generator Discriminator Initial Hidden state Output 0. Figure 1 : Summary of equations.

2. METHOD

2.1 SDES AS GANS Consider some "noise" distribution µ on a space X , and a target probability distribution ν on a space Y. A generative model for ν is a learnt function G θ : X → Y trained so that the (pushforward) distribution G θ (µ) approximates ν. For our purposes, a Generative Adversarial Network (GAN) may then be characterised as a choice of G θ which may be sampled from (by sampling ω ∼ µ and then evaluating G θ (ω)), but for which the probability density of G θ (µ) is not computable (due to the complicated structure of G θ ).



Neural Stochastic Differential Equations (neural SDEs), such as Tzen & Raginsky (2019a); Li et al. (2020); Hodgkinson et al. (2020) among others.1.1 RELATED WORKWe begin by discussing previous formulations, and applications, of Neural SDEs.

Rackauckas et al. (2020) treat neural SDEs in classical Feynman-Kac fashion, and like Hodgkinson et al. (2020); Tzen & Raginsky (2019a;b), optimise a loss on just the terminal value of the SDE. Briol et al. (2020); Gierjatowicz et al. (2020) instead consider the more general case of using a neural SDE to model a time-varying quantity, for which the stochasticity in the system models the variability (specifically certain statistics) of time-varying data. Letting µ, ν denote the learnt and true distributions, both train by minimising |µ(f ) -ν(f )| for functions of interest f (such as derivative payoffs). This corresponds to training with a non-characteristic MMD (Gretton et al., 2013). Several authors, such as Oganesyan et al. (2020); Hodgkinson et al. (2020); Liu et al. (2019), seek to use stochasticity as a way to enhance or regularise a neural ODE model. Our approach is most similar to Li et al. (2020), in that we treat neural SDEs as learnt continuoustime model components of a differentiable computation graph, and Briol et al. (2020); Gierjatowicz et al. (

