ACTION MATCHING: A VARIATIONAL METHOD FOR LEARNING STOCHASTIC DYNAMICS FROM SAMPLES

Abstract

Stochastic dynamics are ubiquitous in many fields of science, from the evolution of quantum systems in physics to diffusion-based models in machine learning. Existing methods such as score matching can be used to simulate these physical processes by assuming that the dynamics is a diffusion, which is not always the case. In this work, we propose a method called "Action Matching" that enables us to learn a much broader family of stochastic dynamics. Our method requires access only to samples from different time-steps, makes no explicit assumptions about the underlying dynamics, and can be applied even when samples are uncorrelated (i.e., are not part of a trajectory). Action Matching directly learns an underlying mechanism to move samples in time without modeling the distributions at each time-step. In this work, we showcase how Action Matching can be used for several computer vision tasks such as generative modeling, super-resolution, colorization, and inpainting; and further discuss potential applications in other areas of science.

1. INTRODUCTION

The problem of learning stochastic dynamics is one of the most fundamental problems in many different fields of science. In physics, porous medium equations (Vázquez, 2007) describe many natural phenomena from this perspective, such as Fokker Planck equation in statistical mechanics, Vlasov equation for plasma, and Nonlinear heat equation. Another prominent example is from Quantum Mechanics where the state of physical systems is a distribution whose evolution is described by the Schrödinger equation. Recently, stochastic dynamics have achieved very promising results in machine learning applications. The most promising examples of this approach are the diffusion-based generative models (Song et al., 2020b; Ho et al., 2020) . Informal Problem Setup In this paper we approach the problem of Learning Stochastic Dynamics from their samples. Suppose we observe the time evolution of some random variable X t with the density q t , from t 0 to t 1 . Having access to samples from the density q t at different points in time t ∈ [t 0 , t 1 ], we want to build a model of the dynamics by learning how to move samples in time such that they respect the marginals q t . In this work, we propose a method called "Action Matching" as a solution to this problem.

Learning Stochastic Dynamics vs. Time-Series

There is an important distinction between the problem of learning stochastic dynamics and time-series modeling (e.g., language, speech or video modeling). In time-series, the samples come in trajectories, where the samples in each trajectory are usually highly correlated. However, in learning stochastic dynamics, we only have access to independent samples at any given time-step (i.e., uncorrelated samples through time). This degree of freedom allows us to solve different types of problems that can not be approached by time-series modeling. We provide several examples in our experiment section, but also point out that sometimes it is even physically impossible to obtain samples along trajectories. For example, in Quantum Mechanics, the act of measurement at a given point collapses the wave function which prevents us from obtaining further samples along that trajectory. Generative Modeling with Action Matching From the Machine Learning perspective, the problem of learning stochastic dynamics is a generalization of generative modeling. One way so solve generative modeling is to first construct a distributional path (stochastic dynamics) from the data Figure 1 : Score Matching learns a model for every distribution, while Action Matching learns the transition rule between distributions according to the continuity equation. Here, we illustrate that learning the dynamics might be a much simpler task than learning all the distributions individually. distribution to a tractable prior distribution (e.g., Gaussian or uniform), and then learn to move along this path to generate samples. The most prominent example of this approach is the recent developments in diffusion generative models (Song et al., 2020b; Ho et al., 2020) , where a stochastic differential equation (SDE) is constructed to move the samples from the data distribution to the prior, and the reverse SDE is constructed by learning the score function of the intermediate distributions via Score Matching (Hyvärinen & Dayan, 2005) . Action Matching can be used for generative modeling in a similar way, where we also construct a stochastic dynamics between the data distribution and the prior. However, the important distinction is that this dynamics is constructed solely from samples of the intermediate distributions, rather than analytical SDEs used in diffusion. This heavily relaxes the constraint on the dynamics required in SDEs, and enables Action Matching to learn a much richer family of dynamics between the two distributions. For example, in both widely used VP-SDEs and VE-SDEs (Song et al., 2020b) , the conditionals q t (x t |x 0 ) are tractable Gaussian distributions, while in Action Matching, the dynamics can have any arbitrary conditional q t (x t |x 0 ), as long as it can be sampled from. We can also use Action Matching to learn the dynamics constructed by SDEs as SDEs can be sampled from. In Section 5.1, we provide a rich family of dynamics, that can be learned with Action Matching, without the knowledge of the underlying process. Another important distinction between SDEs and Action Matching is that the Action Matching modeling capacity is spent only on learning how to move the samples (in a consistent way with the marginals), and does not make any attempt to learn the marginals themselves. However, in diffusion models such as VP-SDEs or VE-SDEs, all the capacity of the model is spent on learning the score function of the individual densities ∇ log q t (x) for the backward diffusion. This is wasteful if the evolution of the density is simple, but the densities themselves are complicated. An illustrative toy example of this is provided in Fig. 1 , where a complicated density is evolving with a constant velocity through time. In this case, Action Matching only needs to learn a constant velocity vector field, without learning anything about the individual marginals. As a practical example of this, we will consider the colorization task in the experiment section, and argue that moving directly from a grayscale image to the colored image with action matching is much easier than moving from Gaussian noise to a colored image with a conditional diffusion that conditions on the grayscale image. In short, compared to diffusion generative models, Action Matching has the following advantages: 1. Action Matching relies only on samples and does not require any knowledge of the underlying stochastic dynamics, which is essential when we only have access to samples. 2. Action Matching is designed to learn only the dynamics, rather than the individual distributions q t , which is useful when a complicated distribution has a simple dynamics. 3. Action Matching's applicability extends beyond that of diffusion models, as it can learn a much richer class of stochastic dynamics (see Theorem 1). Our contribution is two-fold: 1) In Section 2, we discuss a mathematically rigorous problem formulation for learning stochastic dynamics, why this problem is well-defined, and what types of dynamics we aim to learn. 2) In Section 3, we discuss Action Matching as a variational framework for learning these dynamics. Finally, as some of the possible applications of Action Matching, we discuss several computer vision tasks, such as generative modeling, super-resolution, colorization, and inpainting; and provide experiments.

2. PROBLEM FORMULATION OF LEARNING CONTINUOUS DYNAMICS

Continuity Equation Suppose we have a set of particles in space X ⊂ R d , initially distributed as q t0 . Let each particle follow a time-dependent ODE (continuous flow) with the velocity field v : [t 0 , t 1 ] × X → R d as follows ∂ ∂t x(t) = v t (x(t)) , x(t 0 ) = x . From fluid mechanics, we know that the density of the particles at time t, denoted by q t , evolves according to the continuity equation ∂ ∂t q t = -∇ • (q t v t ) , which holds in the distributional sense, where ∇• denotes the divergence operator. Note that even though we arrived at the continuity equation using ODEs, the continuity equation can describe a rich family of density evolutions in a wide range of stochastic processes, including those of SDEs (see Equation 37of Song et al. (2020b) ), or even those of the porous medium equation (Otto, 2001) that are more general than SDEs. This is intuitively because in modeling the density evolution, we only care about respecting the marginals q t , and not the underlying stochastic process that resulted in those marginals. This motivates us to restrict ourselves to ODEs of the form Eq. ( 1), and the continuity equation, without losing any modelling capacity. In fact, as the following theorem shows, under mild conditions, any continuous dynamics can be modeled by the continuity equation, and moreover any continuity equation results in a continuous dynamics. Theorem 1 (Adapted from Theorem 8.3.1 of Ambrosio et al. (2008) ). Consider a continuous dynamic with the density evolution of q t , which satisfies mild conditions (absolute continuity in the 2-Wasserstein space of distributions P 2 (X )). Then, there exists a unique (up to a constant) function s * t (x), called "action", such that vector field v * t (x) = ∇s * t (x) and q t satisfies the continuity equation ∂ ∂t q t = -∇ • (q t ∇s * t (x)) . (3) In other words, the ODE ∂ ∂t x(t) = ∇s * t (x) can be used to move samples in time such that the marginals are q t . Furthermore, we know that ∇s * t (x), defined in Eq. (3), minimizes the kinetic energy functional K(v t ), defined below, along q t (Ambrosio et al., 2008 ) K(v t ) := 1 2 t1 t0 E qt(x) ∥v t (x)∥ 2 dt , ∇s * t = arg min vt K(v t ) ∂ ∂t q t = -∇ • (q t v t ) , where the optimization is over all v t satisfying the continuity equation with q t . We can use the optimal value of the optimization in Eq. ( 4) to attribute a unique kinetic energy value K to any stochastic dynamics q t as follows: K := 1 2 t1 t0 E qt(x) ∥∇s * t (x)∥ 2 dt , Using Theorem 1, the problem of learning the dynamics can be boiled down to learning the unique vector field ∇s * t , only using samples from q t . Motivated by this, we restrict our search space of velocity vectors to the family of curl-free vector fields S t = {∇s t | s t : X → R} . We use a neural network to parameterize the set of functions s t (x), and propose Action Matching for learning the neural network such that s t (x) approximates s * t (x). Once we have learned the vector field ∇s * t , we can move samples forward or backward in time by simulating the ODE in Eq. ( 1) with the velocity ∇s * t . The continuity equation ensures that samples at any given time t ∈ [t 0 , t 1 ] are distributed according to q t .

3. ACTION MATCHING

The main development of this paper is the Action Matching method, which allows us to recover the true action s * t of a continuous dynamic and thereby simulate it, while having access only to samples from q t . In order to do so, we define the variational action s t (x), parameterized by a neural network, that approximates s * t (x), by minimizing the "ACTION-GAP" objective ACTION-GAP(s, s * ) := 1 2 E qt(x) ∥∇s t (x) -∇s * t (x)∥ 2 dt . Note that this objective is intractable, as we do not have access to ∇s * . We now propose action matching as a variational framework for optimizing this objective. We first show that the problem of minimizing the intractable Eq. ( 7), is tightly related to estimating another intractable quantity: the kinetic energy of a continuous dynamics. As discussed in Theorem 1, we can attribute a kinetic energy K quantity to any absolutely continuous dynamics q t , using the true action s * (see Eq. ( 4)). In order to estimate the intractable K, we define a tractable variational kinetic energy lower bound (KILBO) as a functional of an arbitrary variational action s as follows: KILBO(s) = E qt 1 (x) [s t1 (x)] -E qt 0 (x) [s t0 (x)] action-increment -E qt(x) 1 2 ∥∇s t (x)∥ 2 + ∂s t ∂t (x) dt smoothness (regularization) . The following proposition establishes that Eq. ( 7) is the gap between KILBO and the true kinetic energy. Proposition 1. For an arbitrary variational action s, KILBO(s) is a lower bound on the true kinetic K, and the gap can be characterized with K = KILBO(s) + ACTION-GAP(s, s * ) . Thus, since K is not a function of s, the following optimization problems are equivalent arg max s {KILBO(s)} = arg min s {ACTION-GAP(s, s * )} , ( ) where the equality is up to an additive constant. The KILBO gap is tight iff ∇s t (x) = ∇s * t (x). See Appendix A for the proof. Proposition 1 indicates that maximizing the KILBO results in estimating the true kinetic energy, as well as matching the variational action to the true action. Note that unlike the intractable K, maximizing KILBO is tractable, as we can use the samples of q t to obtain an unbiased low variance estimate of KILBO. KILBO can be decomposed into an action-increment and a smoothness term. If we only optimize the action-increment term, we learn large values for s t (x) at t 1 and small values for s t (x) at t 0 . In this case, s t (x) tends to learn a degenerate function with sharp transitions in both x and t directions. The smoothness term acts as a regularization term by penalizing large gradients with respect to both x direction 1 2 ∥∇s t (x)∥ 2 , and t direction ∂st(x) ∂t .

4. GENERATIVE MODELING USING ACTION MATCHING

While Action Matching has a wide range of applications in learning the continuous dynamics, in this work, we focus on the applications of Action Matching in generative modeling task. In Action Matching generative models, we first have to define a dynamics (i.e., noising process) that transforms samples from the data distribution q 0 = π to samples of a prior distribution q 1 (e.g., standard Gaussian). Action Matching is then used to learn the vector field ∇s ⋆ of the chosen dynamics. Once ∇s ⋆ is learned, we can sample from the target distribution by first sampling from the prior, and then moving the samples using a reverse ODE with the velocity ∇s ⋆ . Finally, Action Matching enables use to compute the exact log-likelihood of the data. 

4.1. NOISING PROCESSES IN ACTION MATCHING GENERATIVE MODELS

To learn the vector field ∇s ⋆ , Action Matching only requires samples from the intermediate distributions q t , t ∈ [0, 1], that define the noising process. We now provide a broad family of noising processes that can be used for generative modeling tasks. Consider the process x t = f t (x 0 ) + σ t ε, x 0 ∼ π(x), ε ∼ p(ε) , where f t (x 0 ) is some transformation of the data, which could be nonlinear. At t = 0, f 0 is the identity function, and σ 0 = 0. Thus, x 0 is distributed according to the data distribution, i.e., q 0 (x 0 ) = π(x 0 ). The noising process then gradually eliminates information from the samples using f t , and increases the variance of noise σ t . At t = 1, f t would become the zero function and we have σ 1 = 1. Thus, x 1 would be distributed as q 1 (x 1 ) = p(x 1 ). We now demonstrate that the same general idea can be used to construct different noising processes for solving different vision tasks, such as diffusion image generation, super-resolution, colorization, inpainting, and torus image generation. See Fig. 2 for the examples of these sampling processes. We will demonstrate Action Matching learning these dynamics in the experiment section. Action Matching for Learning the Diffusion Dynamics Diffusion processes can be viewed as a special case of the process in Eq. ( 11), when f t (x 0 ) is a linear transformation f t (x 0 ) = α t x 0 x t = α t x 0 + σ t ε, ε ∼ N (0, 1) , where α t and σ t we can be chosen such that the marginals of Eq. ( 12) corresponds to the marginals of VP-SDE and VE-SDE (Equation 29 from Song et al. (2020b) ). This sampling process corresponds to the unconditional image generation task since this dynamics transforms all the information in the image into the Gaussian noise. We can use Action Matching to learn the dynamics of Eq. ( 12), solely from samples of Eq. ( 12), without any knowledge of the underlying diffusion process. The Equations ( 11) and ( 12) showcase that Action Matching generalizes diffusion, by allowing f t to be any non-linear function, ε ∼ p be any noise model. In contrast, Denoising Score Matching uses linear function f t for VP-SDE and VE-SDE, with the tractable Gaussian conditional.

Super-Resolution and Inpainting

Action Matching provides a lot of freedom in the choice of the sampling process, which we demonstrate in the conditional generation tasks. Consider the following sampling processes x t = mask • x 0 + (1 -mask) • (α t x 0 + σ t ε) , where the mask variable has the same dimensions as x 0 and every coordinate of the mask vector is in {0, 1}. Thus, the noising process is only applied to the subset of pixels, which can be used to learn the inpainting and super-resolution tasks. In the inpainting task, the mask itself is a Bernoulli random variable that decides either the top-half or the bottom-half of the image is destroyed. In the super-resolution case, the mask is fixed, and keeps one pixel in each 2 × 2 block, while the remaining pixels are transformed to noise.

Algorithm 1 Generative Modeling using Action Matching

Require: dataset {x i } N i=1 , x i ∼ π(x) = q 0 (x) Require: parametric model s t (x, θ) for learning iterations do sample a batch of data {x i 0 } n i ∼ π(x) = q 0 (x 0 ) sample a batch of noise {ε i } n i ∼ q 1 (x 1 ) = p(ε) sample times {t i } n i ∼ Uniform[0, 1] sample two batches {x i 1 } n i , {x i t i } n i using x i t i = f t i (x i 0 ) + σ t i ε i L = 1 n n i s 0 (x i 0 ) -s 1 (x i 1 ) + 1 2 ∇s t (x i t i ) 2 + ∂st(x i t i ) ∂t update the model θ ← Optimizer(θ, ∇ θ L θ ) end for return trained model s t (x, θ * ) Colorization Another option for the conditional generation is the interpolation between the original datapoint and its nonlinearly transformed version with some noise added. For instance, we do image colorization using this approach x t = α t x 0 + σ t (10 -1 ε + gray(x 0 )) , (14) where the function gray(x 0 ) returns the grayscale version of image x 0 . Note that function gray(x 0 ) is not injective, since it maps several different colorizations to the same grayscale image. As we show further, Action Matching sampling process is a bijection; hence, the addition of noise is crucial for sampling from the data distribution images given the same conditioning. At the same time, the added noise partially destroys the information. To avoid this corruption of the conditional image, we concatenate the original grayscale image with the input. Generative Modeling on a Torus Finally, we consider the problem of learning a stochastic dynamics on a manifold. Here we consider a distribution on torus, where every coordinate of the data vector to be in [0, 1] with periodic boundary conditions. Then, the sampling process interpolating between the data distribution on the noise distribution is x t = (x 0 + σ t ε) mod 1, ε ∼ N (0, 1). (15) Note that q 1 converges to the uniform distribution on the torus when σ t → ∞.

4.2. LEARNING, SAMPLING, AND LIKELIHOOD EVALUATION OF ACTION MATCHING GENERATIVE MODELS

Learning Once we define the noising process for q t , ∀t ∈ [0, 1], we apply Action Matching as described in Algorithm 1. It samples points with different time-steps and then minimizes the objective (7) w.r.t. the parameters θ of s t (x, θ). In practice, we found that the performance of Algorithm 1 might be hindered by high variance of the objective estimate. To reduce the variance of the objective (7), we propose to weight it over time and also adaptively select the distribution of sampled time-steps. We derive the weighted KILBO objective in Appendix A, and further discuss the details of training in Appendix B. Sampling We sample from the target distribution via the trained function s t (x(t), θ * ) by solving the following ODE backward in time: ∂ ∂t x = ∇ x s t (x(t), θ * ), x(t = 1) = ε , ε ∼ p(ε). ( ) Recall that this sampling process is justified by Eq. (3), where s t (x(t), θ * ) approximates s * t . Evaluating the Log-likelihood for the generation tasks can be done by integrating the same ODE forward, i.e., log q 0 (x(0)) = log q 1 (x(1)) + 1 0 dt ∇ 2 s * t (x(t)), ∂ ∂t x = ∇ x s * t (x(t)), x(t = 0) = x, where we approximate s * t by s t (x(t), θ * ) and assume the density q 1 (x) to be a known analytic distribution. Figure 3 : Illustration of the difference between Score Matching and Action Matching noising processes on the colorization task. We argue that Action Matching provides a more efficient way to learn the colorization model since the process requires less changes between the input and the resulting images. The additional channels are used to condition all the inputs on the grayscale image.

4.3. ACTION MATCHING VS. SCORE MATCHING GENERATIVE MODELS

In this section, we give more insights on Action Matching by drawing connections and highlighting differences to Score Matching (Hyvärinen & Dayan, 2005) and recently introduced generative models relying on Score Matching (Song et al., 2020b; Ho et al., 2020) . The most fundamental difference between Action Matching and Score Matching is that they approach completely different estimation problems. Indeed, Score Matching estimates the gradient of the log-density from samples of the distribution, while Action Matching cannot be applied for a single distribution. Instead, Action Matching learns the underlying mechanism of a stochastic dynamics, i.e., learns how distributions change in time. We schematically depict this difference in Fig. 1 . Despite the fundamental difference in the problem setup both methods can be applied for generative modeling. For generative modeling, we have the freedom to choose the stochastic dynamics between the target distribution and the prior. Hence, by choosing the dynamics to be a diffusion (Fokker-Planck equation) with known drift and diffusion coefficients, one can learn the score of every marginal and then sample using the corresponding ODE or SDE (Song & Ermon, 2019) . Furthermore, if the drift term is affine then Denoising Score Matching (Vincent, 2011) can be applied to learn the score model. More precisely, the model from (Song et al., 2020b) requires samples from the noising process and the analytic formulas for the drift and diffusion coefficient, where the drift term has to be affine. Action Matching requires only samples from the process to learn the dynamics. Hence, it includes the case of diffusion even without the knowledge of the drift and diffusion coefficient. Moreover, it can learn a much broader family of generative processes, which can have better properties for different applications. In Fig. 3 , we give an example of such a process for the colorization task. Since Score Matching can be defined only for the diffusion, its forward process removes all the information about the image resulting in pure noise. In contrast, for Action Matching, we can remove just the information about the color of image while adding some low-variance noise along the way. For both models, we concatenate the grayscale image with the input. However, we argue that, for Action Matching, less computational efforts are needed since we have to apply less modifications to the original image to color it. We discuss and provide evidence for this in Section 5.1.

5.1. GENERATIVE MODELING

Action Matching has a wide range of applications in modeling density evolutions. In this section, we showcase the applications of Action Matching in generative modeling tasks, since they are among the most challenging high-dimensional stochastic dynamics. Action Matching generative models should not be directly compared with diffusion models, as they make different assumptions, and have access to different information. A score matching diffusion model such as VE-SDE and VP-SDE explicitly relies on the analytic forms of drift and the diffusion coefficients of the SDE. In contrast, Action Matching infers the underlying vector field of any arbitrary continuous stochastic dynamics, solely from the samples. For this reason, we expect Action Matching generative models to under-perform in this setting. Diffusion and Torus map images to known distributions; hence, for them, we report negative log-likelihood in bits per dimension (BPD). For all tasks, we report FID evaluated between generated images and the test data. For CelebA, we use 20k images. For CIFAR-10, we use 10k images. We apply Action Matching to MNIST, CelebA (Liu et al., 2015) and CIFAR-10 datasets for a variety of computer vision tasks. Namely, we perform unconditional image generation via diffusion as well as conditional generation for super-resolution, in-painting, and colorization tasks. In addition to these settings, we also learn unconditional image generation on a torus, where Denoising Score Matching can not be applied in the original formulation. For the baseline in unconditional image generation tasks, we use the model from (Song et al., 2020b) , which is the diffusion-based generative process trained with Denoising Score Matching. For the baseline in conditional image generation tasks, we follow (Saharia et al., 2022) , and condition the model by concatenating the conditioning image as an additional channel with the main input. We refer to all baselines as Score Matching (SM in Table 1 and Fig. 4 ). We discuss further implementation details in Appendix D.1. We train all models for 300k iterations and report the negative log-likelihood in bits per dimension (BPD) and FID scores (Heusel et al., 2017) in Table 1 . We demonstrate generated images by Action Matching in Appendix E and provide animations of the generation in github.com/action-matching. We observe that Denoising Score Matching performs better than Action Matching on all tasks, which was expected due to the additional information that the Denoising Score Matching objective uses about the underlying process. However, as we discussed in Section 4.3, we expect Action Matching to converge faster on the conditional image generation tasks, as it only needs to learn a cross-domain transformation, rather than learning the conditional generation from the Gaussian noise. We experimentally verified this hypothesis by evaluating the FID throughout the training process, on the colorization task, shown in Fig. 4 .

5.2. SCHRÖDINGER EQUATION SIMULATION

In this section, we demonstrate that Action Matching can learn a wide range of stochastic dynamics by applying it to the dynamics of a quantum system evolving according to the Schrödinger equation. The Schrödinger equation describes the evolution of many quantum systems, and in particular, it describes the physics of molecular systems. Here, for the ground truth dynamics, we take the dynamics of an excited state of the hydrogen atom, which is described by the following equation Table 2: Performance of Action Matching and the Annealed Langevin Dynamics (ALD) for the Schrödinger equation simulation. For ALD, we estimate the scores in two ways: Score Matching (SM) and Sliced Score Matching (SSM). We also demonstrate that even using true scores does not allow for the precise simulation. i ∂ ∂t ψ(x, t) = - 1 ∥x∥ ψ(x, t) - 1 2 ∇ 2 ψ(x, t). ( ) The function ψ(x, t) : R 3 × R → C is called a wavefunction and it completely describes the state of the quantum system. In particular, it defines the distribution of the coordinates x by defining its density as q t (x) := |ψ(x, t)| 2 , which dynamics is defined by the dynamics of ψ(x, t) in Eq. ( 18). For the baseline, we take Annealed Langevin Dynamics as considered in (Song & Ermon, 2019) . It approximates the ground truth dynamics using only scores of the distributions by running the approximate MCMC method (which does not have access to the densities) targeting the intermediate distributions of the dynamics (see Algorithm 3). For the estimation of scores, we consider Score Matching (SM) (Hyvärinen & Dayan, 2005) , Sliced Score Matching (SSM) (Song et al., 2020a) , and additionally evaluate the baseline using the ground truth scores. For further details, we refer the reader to Appendix D.2 and the code github.com/action-matching. Action Matching outperforms both Score Matching and Sliced Score Matching, precisely simulating the true dynamics (see Fig. 5 and Table 2 ). Despite that both SM and SSM accurately recover the ground truth scores for the marginal distributions (see the right plot in Fig. 5 ), one cannot efficiently use them for the sampling from ground truth dynamics. Note, that even using the ground truth scores in Annealed Langevin Dynamics does not match the performance of Action Matching (see Table 2 ) since it is itself an approximation to the Metropolis-Adjusted Langevin Algorithm. Finally, we provide animations of the learned dynamics for different methods (see github.com/action-matching) to illustrate the performance difference.

6. CONCLUSION

In this work, we discussed how any continuous dynamics (under mild conditions) can be represented by a unique continuous vector field minimizing the kinetic energy. This representation provides a rigorous mathematical formulation for the problem of learning stochastic dynamics. We then presented Action Matching, as a variational framework for learning this unique vector field solely from samples of the dynamics. We further demonstrated that Action Matching can learn a wide range of continuous dynamics, including those of diffusion. We believe the flexibility that Action Matching introduces will be useful in applications in natural sciences, where stochastic dynamics appear, but the underlying mechanisms are not controlled, and thus we can only make observations.

A ACTION MATCHING

Proposition. For an arbitrary variational action s, KILBO(s) is a lower bound on the true kinetic K(s * ), and the gap can be characterize with KILBO(s) = ωt 1 E q t 1 (x) [st 1 (x)] -ωt 0 E q t 0 (x) [st 0 (x)] - t 1 t 0 ωtE q t (x) 1 2 ∥∇st(x)∥ 2 + ∂st(x) ∂t + st(x) d log ωt dt dt = K -ACTION-GAP(s, s * ) , Thus, since K is not a function of s, the following optimization problems are equivalent arg max s {KILBO(s)} = arg min s {ACTION-GAP(s, s * )} , where the equality is up to an additive constant. The KILBO gap is tight iff ∇s t (x) = ∇s * t (x). Proof. ACTION-GAP(s, s * ) = 1 2 t 1 t 0 ωtE q t (x) ∥∇s -∇s * ∥ 2 dt = 1 2 t 1 t 0 x ωtqt(x)∥∇s -∇s * ∥ 2 dxdt = 1 2 t 1 t 0 x ωtqt(x)∥∇s∥ 2 dxdt - t 1 t 0 ωt x qt(x)⟨∇st(x), ∇s * t (x)⟩dxdt + K 1 2 E q t (x) ∥∇s * ∥ 2 dt = 1 2 t 1 t 0 x ωtqt(x)∥∇s∥ 2 dxdt - t 1 t 0 ωt x ⟨∇st(x), qt(x)∇s * t (x)⟩dxdt + K (1) = 1 2 t 1 t 0 x ωtqt(x)∥∇s∥ 2 dxdt + t 1 t 0 ωt x st(x)[∇ • (qt(x)∇s * t (x))]dxdt + K = 1 2 t 1 t 0 x ωtqt(x)∥∇s∥ 2 dxdt - t 1 t 0 x ωtst(x) ∂qt(x) ∂t dx dt + K (2) = t 1 t 0 ωtE q t (x) 1 2 ∥∇st(x)∥ 2 dt -ωtE q t (x) [st(x)] t 1 t 0 - x E q t (x) st(x) dωt dt + ωt ∂st(x) ∂t dt + K = t 1 t 0 ωtE q t (x) 1 2 ∥∇st(x)∥ 2 + ∂st(x) ∂t + st(x) d log ωt dt dt -ωt 1 E q t 1 (x) [st 1 (x)] + ωt 0 E q t 0 (x) [st 0 (x)] + K = -KILBO(s) + K , where in (1), we have used V ⟨∇g, f ⟩dx = ∂V ⟨f g, ds⟩ -V g(∇ • f )dx, and in (2) we have used integration by parts.

B GENERATIVE MODELING IN PRACTICE

In practice, we found that the naive application of Action Matching (Algorithm 1) for complicated dynamics such as image generation might exhibit poor convergence due to the large variance of objective estimate. Moreover, the optimization problem min st 1 2 q * t (x)∥∇s t (x) -∇s * t (x)∥ 2 dxdt (20) might be ill posed due to the singularity of the ground truth vector field ∇s * t . Indeed, consider the sampling process x t = f t (x 0 ) + σ t ε, x 0 ∼ π(x), ε ∼ N (x | 0, 1) , where the target distribution is a mixture of delta-functions π(x) = 1 N N i δ(x -x i ). Algorithm 2 Generative Modeling using Action Matching (In Practice) Require: dataset {x i } N i=1 , x i ∼ π(x) = q 0 (x) Require: parameteric model s t (x, θ), weight schedule ω(t) for learning iterations do sample a batch of data {x i 0 } n i ∼ π(x) = q 0 (x) sample a batch of noise {ε i } n i ∼ q 1 (x 1 ) sample times {t i } n i ∼ p(t) sample two batches {x i 1 } n i , {x i t i } n i using x i t i = f t i (x i 0 ) + σ t i ε i L = n i 1 p(t i ) s 0 (x i 0 )ω(0)-s 1 (x i 1 )ω(1)+ 1 2 ∇s t (x i t i ) 2 ω(t i )+ ∂st(x i t i ) ∂t ω(t i )+s t (x i t ) ∂ω(t i ) ∂t i update the model θ ← Optimizer(θ, ∇ θ L θ ) end for return trained model s t (x, θ * ) Denoting the distribution of x t as q t (x), we can solve the continuity equation ∂q t ∂t = -∇ • (q t ∇s * t ) analytically (see Appendix C). The ground truth vector field is ∇s * t = 1 i q i t i q i t (x -f t (x i )) ∂ ∂t log σ t + ∂f t (x i ) ∂t , q i t (x) = N (x | f t (x i ), σ 2 t ). For generative modeling, it's essential that q 0 = π(x); hence, lim t→0 σ t = 0 and lim t→0 f t (x) = x. Assuming that σ 2 t is continuous and differentiable at 0, in the limit, we have lim t→0 ∥∇s * t (x)∥ 2 ∝ lim t→0 1 σ 2 t , lim t→0 1 2 q * t (x)∥∇s * t (x)∥ 2 dxdt ∝ lim t→0 1 σ 2 t . Thus, the loss can be properly defined only on the interval t ∈ (δ, 1], where δ > 0. In practice, we want to set δ as small as possible; hence, ideally, we want to learn s t on the whole interval t ∈ [0, 1]. We can get rid of the singularity just by reweighting the objective in time, i.e., 1 2 q * t (x)∥∇s t (x) -∇s * t (x)∥ 2 dxdt =⇒ 1 2 ω(t)q * t (x)∥∇s t (x) -∇s * t (x)∥ 2 dxdt. To give an example, we can take σ t = √ t and f t (x) = x √ 1 -t, then ω(t) = (1 -t)t 3/2 cancels out the singularities at t = 0 and t = 1. The second modification of the original Algorithm 1 is the sampling of time-steps for the estimation of the time integral. Namely, the optimization of ( 26) is equivalent to the minimization of the following objective  L(s) = ω(t 1 )s t1 (x)q * t1 (x)dx -ω(t 0 )s t0 (x)q * t0 We implement sampling from this distribution by aggregating the estimated variances throughout the training with exponential moving average and then follow by linear interpolation between the estimates. C "SPARSE DATA" REGIME We start with the case where the dataset consists only of a single point x 0 ∈ R d q 0 (x) = δ(x -x 0 ), k t (x t | x) = N (x t | f t (x), σ 2 t ). ( ) Then the distribution at time t is q t (x) = dx ′ q 0 (x ′ )k t (x | x ′ ) = N (x | f t (x 0 ), σ 2 t ). The ground truth vector field v comes from the continuity equation ∂q t ∂t = -⟨∇, q t v⟩ =⇒ ∂ ∂t log q t = -⟨∇ log q t , v⟩ -⟨∇, v⟩. For our dynamics, we have ∂ ∂t log q t = ∂ ∂t - d 2 log(2πσ 2 t ) - 1 2σ 2 t ∥x -f t (x 0 )∥ 2 (35) = -d ∂ ∂t log σ t + 1 σ 2 t ∥x -f t (x 0 )∥ 2 ∂ ∂t log σ t + 1 σ 2 t x -f t (x 0 ), ∂f t (x 0 ) ∂t (36) = -d ∂ ∂t log σ t + 1 σ 2 t x -f t (x 0 ), (x -f t (x 0 )) ∂ ∂t log σ t + ∂f t (x 0 ) ∂t ; ∇ log q t = - 1 σ 2 t (x -f t (x 0 )); (38) ∂ ∂t log q t = -d ∂ ∂t log σ t -∇ log q t , (x -f t (x 0 )) ∂ ∂t log σ t + ∂f t (x 0 ) ∂t . Matching the corresponding terms in the continuity equation, we get v = (x -f t (x 0 )) ∂ ∂t log σ t + ∂f t (x 0 ) ∂t . For the set of delta-functions, we denote q 0 (x) = i δ(x -x i ), q t (x) = i q i t (x), q i t (x) = N (x | f t (x i ), σ 2 t ). Due to the linearity of the continuity equation w.r.t. q, we have i ∂q i t ∂t = i ⟨∇, q i t v⟩ =⇒ i q i t ∂ ∂t log q i t + ⟨∇ log q i t , v⟩ + ⟨∇, v⟩ = 0. We first solve the equation for ∂ft ∂t = 0, then for ∂ ∂t log σ t = 0 and join the solutions. For ∂ft ∂t = 0, we look for the solution in the following form v σ = A i q i t i ∇q i t , q i t (x) = N (x | f i t (x i ), σ 2 t ). Then we have ⟨∇, v σ ⟩ = ∇ A i q i t , i ∇q i t + A i q i t i ∇ 2 q i t (44) = - A ( i q i t ) 2 i ∇q i t 2 + A i q i t i q i t ∇ log q i t 2 - d σ 2 , ( ) i q i t ⟨∇, v σ ⟩ = - A i q i t i ∇q i t 2 + A i q i t ∇ log q i t 2 - d σ 2 , and from (42) we have i q i t -d ∂ ∂t log σ t + ∇ log q i t , v σ + σ 2 t ∂ ∂t log σ t ∇ log q i t + ⟨∇, v σ ⟩ = 0. From these two equations we have i q i t ⟨∇, v σ ⟩ = - A i q i t i ∇q i t 2 + A i q i t ∇ log q i t 2 - d σ 2 = (48) = i q i t d ∂ ∂t log σ t - A i q i t i ∇q i t 2 -σ 2 t ∂ ∂t log σ t i q i t ∇ log q i t 2 . Thus, we have A = -σ 2 t ∂ ∂t log σ t . For ∂ ∂t log σ t = 0, we simply check that the solution is v f = 1 i q i t i q i t ∂f t (x i ) ∂t . Indeed, the continuity equation turns into i q i t ∇ log q i t , v f - ∂f t (x i ) ∂t + ⟨∇, v f ⟩ = 0. From the solution and the continuity equation we write i q i ⟨∇, v f ⟩ in two different ways. i q i t ⟨∇, v f ⟩ = - 1 i q i t i ∇q i t , i q i t ∂f t (x i ) ∂t + i ∇q i t , ∂f t (x i ) ∂t (53) = - i ∇q i t , v f + i ∇q i t , ∂f t (x i ) ∂t Thus, we see that ( 51) is indeed a solution. Finally, unifying v σ and v f , we have the full solution For the architecture of the neural network parameterizing s t , we follow (Salimans & Ho, 2021) . In more details, we parameterize s t (x) as ∥unet(t, x) -x∥ 2 , where unet(t, x) is the U-net architecture (Ronneberger et al., 2015) . For the U-net architecture, we follow (Song et al., 2020b) with the only difference is that we set the channel multiplier parameter to 64 instead of 128, thus, narrowing down the architecture. We have to narrow down the architecture since Action Matching requires taking the derivative w.r.t. the inputs at each iteration, which is a downside compared to Denoising Score Matching. Otherwise the training of one model takes a week on 4 gpus. We consider the same U-net architecture for the baseline to parameterize ∇ log q t . For diffusion, we take VP-SDE from (Song et al., 2020b) , which corresponds to α t = exp(-1 2 β(s)ds) and σ t = 1 -exp(-β(s)ds), where β(s) = 0.1 + 19.9t. For other tasks we take σ t = t and α t = 1 -t. All images are normalized to the interval [-1, 1]. For image generation on the torus, we first normalize the data such that every pixel is in [0.25, 0.75]. Thus we make sure that the shortest distance between the lowest and the largest pixel values is maximal on the circle [0, 1]. v = - ∂ ∂t log σ t σ 2 t i q i t i ∇q i t + 1 i q i t i q i t ∂f t (x i ) ∂t , q i t (x) = N (x | f t (x i ), σ 2 t ), (55) v = 1 i q i t i q i t (x -f t (x i )) ∂ ∂t log σ t + ∂f t (x i ) ∂t Although Action Matching learns deterministic mappings, it is possible to learn one-to-many mappings by adding small amount of noise to the data. For example, each row of Fig. 6 shows that Action Matching has learned to generate different colorizations from a single grayscale CIFAR-10 image, using different noise samples added to the grayscale image in Eq. ( 14).

D.2 DETAILS ON THE SCHRÖDINGER EQUATION SIMULATION

For the initial state of the dynamics i ∂ ∂t ψ(x, t) = - 1 ∥x∥ ψ(x, t) - 1 2 ∇ 2 ψ(x, t) , we take the following wavefunction ψ(x, t = 0) ∝ ψ 32-1 (x) + ψ 210 (x), and q * t=0 (x) = |ψ(x, t = 0)| 2 , (58) where n, l, m are quantum numbers and ψ nlm is the eigenstate of the corresponding Hamiltonian (see Griffiths & Schroeter (2018) ). For all the details on sampling and the exact formulas for the initial state, we refer the reader to the code github.com/action-matching. We evolve the initial state for T = 14 • 10 3 time units in the system ℏ = 1, m e = 1, e = 1, ε 0 = 1 collecting the dataset of samples from q * t . For the time discretization, we take 10 3 steps; hence, we sample every 14 time units. To evaluate each method, we collect all the generated samples from the distributions q t , t ∈ [0, T ] comparing them with the samples from the training data. For the distance metric, we measure the MMD distance (Gretton et al., 2012) between the generated samples and the training data at 10 different timesteps t = k 10 T, k = 1, . . . , 10 and average the distance over the timesteps. For the Annealed Langevin Dynamics, we set the number of intermediate steps for M = 5, and select the step size dt by minimizing MMD using the exact scores ∇ log q t (x). For all methods, we use the same architecture, which is a multilayer perceptron with 5 layers 256 hidden units each. The architecture h(t, x) takes x ∈ R 3 and t ∈ R and outputs 3-d vector, i.e. h(t, x) : R × R 3 → R 3 . For the score-based models it already defines the score, while for action matching we use s t (x) = ∥h(t, x) -x∥ 2 as the model and the vector field is defined as ∇s t (x).

Algorithm 3 Annealed Langevin Dynamics

Require: score model s t (x), step size dt, number of intermediate steps M Require: initial samples x i 0 ∈ R d for time steps t ∈ (0, T ] do set the target distribution q t , such that s t (x) ≈ ∇ log q t (x) for intermediate steps j ∈ 1, . . . , M do ε i ∼ N (0, 1)  x i t = x i t +



Figure 2: Examples of different noising processes used for different vision tasks. At t = 0, we start from the data distribution. Depending on the task, the noising process gradually destroys all or partial information of data, and replace it with prior noise.

Figure 4: Faster convergence of Action Matching (AM) compared to Score Matching (SM) in FID values and generated samples quality for the colorization task on CIFAR-10.

Figure 5: On the left, we demonstrate performance of compared algorithms in terms of average MMD over the time of dynamics. The MMD is measured between generated samples and the training data. On the right, we report squared error of the score estimation for the score-based methods.

Figure 6: Illustration that Action Matching can learn one to many relations using low variance noise added to the image. Here, we sample different colorizations starting from the same grayscale input adding different samples of noise.

Figure 13: Action Matching on CIFAR-10 for super-resolution.

Figure 14: Action Matching on CIFAR-10 for colorization.

Figure 15: Action Matching on CIFAR-10 for torus.

Experimental results for Action Matching (AM) and Score Matching (SM) on computer vision tasks.

x)dxNote that for every choice of p(t) we get an unbiased estimate of the original objective function. Thus, we can design p(t) to reduce the variance of the middle part of the objective. In our experiments, we observed that simply taking p(t) proportionally to the standard deviation of the corresponding integrand significantly reduces the variance, i.e.,p(t) ∝ E x∼qt (ζ t -E x∼qt ζ t ) 2 , ζ

