LEARNING DIFFUSION BRIDGES ON CONSTRAINED DOMAINS

Abstract

Diffusion models have achieved promising results on generative learning recently. However, because diffusion processes are most naturally applied on the unconstrained Euclidean space R d , key challenges arise for developing diffusion based models for learning data on constrained and structured domains. We present a simple and unified framework to achieve this that can be easily adopted to various types of domains, including product spaces of any type (be it bounded/unbounded, continuous/discrete, categorical/ordinal, or their mix). In our model, the diffusion process is driven by a drift force that is a sum of two terms: one singular force designed by Doob's h-transform that ensures all outcomes of the process to belong to the desirable domain, and one non-singular neural force field that is trained to make sure the outcome follows the data distribution statistically. Experiments show that our methods perform superbly on generating tabular data, images, semantic segments and 3D point clouds. Code is available at https: //github.com/gnobitab/ConstrainedDiffusionBridge.



However, due to the continuous nature of diffusion processes, the standard approaches are restricted to generating unconstrained continuous data in R d . For generating data constrained on special structured domains, such as discrete, bound data or mixes of them, special techniques , e.g., dequantization (Uria et al., 2013; Ho et al., 2019) and multinomial diffusion (Hoogeboom et al., 2021; Austin et al., 2021) , need to be developed case by case and the results still tend to be unsatisfying despite promising recent advances (Hoogeboom et al., 2021; Austin et al., 2021) .

⇧ ⇤

< l a t e x i t s h a 1 _ b a s e 6 4 = " b a X Q B V Q z d l Q 1 v B l k b V v d q N y T 5 g 8 = " > A A A B 7 H i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S K I i y E z H V u 7 K 7 p x W c F p C + 1 Y M m m m D c 1 k h i Q j l N J v c O N C E b d + k D v / x v Q h q O i B C 4 d z 7 u X e e 8 K U M 6 U R + r B y K 6 t r 6 x v 5 z c L W 9 s 7 u X n H / o K m S T B L q k 4 Q n s h 1 i R T k T 1 N d M c 9 p O J c V x y G k r H F 3 N / N Y 9 l Y o l 4 l a P U x r E e C B Y x A j W R v K 7 D X Z 3 1 i u W k O 2 6 F V S u Q G R 7 V Q + 5 n i G 1 8 5 r j O t C x 0 R w l s E S j V 3 z v 9 h O S x V R o w r F S H Q e l O p h g q R n h d F r o Z o q m m I z w g H Y M F T i m K p j M j 5 3 C E 6 P 0 Y Z R I U 0 L D u f p 9 Y o J j p c Z x a D p j r I f q t z c T / / I 6 m Y 4 u g g k T a a a p I I t F U c a h T u D s c 9 h n k h L N x 4 Z g I p m 5 F Z I h l p h o k 0 / B h P D 1 K f y f N F 3 b 8 e z y j V e q X y 7 j y I M j c A x O g Q O q o A 6 u Q Q P 4 g A A G H s A T e L a E 9 W i 9 W K + L 1 p y 1 n D k E P 2 C 9 f Q K c Z 4 6 S < / l a t e x i t > Time 𝑡 ℙ 𝜽 This work proposes a simple and unified framework for learning diffusion models on general constrained domains Ω embedded in the Euclidean space R d . The idea is to learn a continuous R d -valued diffusion process Z t on time interval t ∈ [0, T ], with a carefully designed force field, such that the final state Z T guarantees to 1) fall into the desirable domain Ω, and 2) follows the data distribution asymptotically. We achieve both steps by leveraging a key tool in stochastic calculus called Doob's h-transform (Doob, 1984) , which provides formula for deriving the diffusion processes whose final states are guaranteed to fall into a specific set or equal a specific value.

Algorithm 1 Learning Diffusion Models on Constrained Domains (a Simple Example)

Input: A dataset D := {x (i) } drawn from distribution Π * on a domain Ω = {e 1 , e 2 , . . . , e K }. Goal: Learn a diffusion model that terminates at time T to generate samples from Π * . Learning: Solve the optimization below with stochastic gradient descent (or other optimizers) θ * = arg min θ T 0 E x∼D f θ (Z t , t) -∇ Zt log ω Ω (x | Z t , t) 2 dt, where ω Ω (x | z, t) = exp - ∥z -x∥ 2 2(T -t) e∈Ω exp - ∥z -e∥ 2 2(T -t) , Z t = t T x + (1 - t T )x 0 + t(T -t) T ξ, with x drawn from the dataset D, ξ ∼ N (0, I), and x 0 any initial point. Sampling: Generate sample Z T from dZ t = f θ * (Z t , t) + ∇ Zt log e∈Ω exp - ∥Z t -e∥ 2 2(T -t) dt + dW t , Z 0 = x 0 . Remark When the domain Ω is a manifold (e.g., line or surface) in R d , simply replace the sum e∈Ω with the corresponding line or surface (in general Hausdorff) integration Ω on Ω. Our simple procedure can be applied to any domain Ω once a properly defined summation (for discrete sets) or integration (for continuous domains) can be evaluated. To give a quick overview on the practical intuition without invoking the mathematical theory, we show in Algorithm 1 a simple instance of the framework when the domain is a discrete set Ω = {e 1 , . . . , e K }. The idea is to set up the diffusion model to have a form of dZ t = f θ (Z t , t) + ∇ Zt ψ Ω (Z t , t) dt + dW t , ψ Ω (z, t) := log e∈Ω exp - ∥z -e∥ 2 2(T -t) , where the drift is a sum of a non-singular (e.g., bounded) term f θ (z, t) which is a trainable neural force field with parameter θ, and a singular term ∇ z ψ Ω (z, t), which drives Z t towards set Ω as a gradient ascent on ψ Ω (z, t). The ψ Ω (z, t) measures the closeness of z to set Ω, as the log-likelihood of a Gaussian mixture model (GMM) centered on the elements in Ω with variance Tt. When t approaches to the terminal time T , the variance Tt of the GMM goes to zero, and the magnitude of ∇ z ψ Ω (z, t) grows to infinity, hence ensuring that Z T must belong to Ω. In particular, note that ∇ z ψ Ω (z, t) = e∈Ω ω Ω (e | z, t) e -z T -t , ω Ω (e | z, t) = exp -∥z-e∥ 2 2(T -t) exp(ψ Ω (z, t)) , which increases with an O(1/(Tt)) rate as t → T ; here ω Ω (e | z, t) is the softmax probability measuring the relative closeness of z to the elements e in Ω (see also Eq (2) ). As we show in Section 2.3, once f θ is non-singular in the sense of the mild condition of T 0 E[ f θ (Z t , t) 2 ]dt < +∞, the diffusion model in (3) guarantees to yield a final state Z T that Outline The rest of the paper is organized as follows. Section 2 introduces h-transform, which allows us to derive bridge processes that are guaranteed to enter specific sets at the terminal time, and Section 2.3 specifies the parametric diffusion models for Ω-bridges. Then, with the learnable diffusion models, Section 3 introduces the general learning framework along with the loss function.

2. BACKGROUND: DIFFUSION PROCESSES AND h-TRANSFORM

A diffusion process Z = {Z t : t ∈ [0, T ]} on R d follows a stochastic differential equation of form Q : dZ t = b(Z t , t)dt + σ(Z t , t)dW t , where W t is a Wiener process, and σ : [0, T ] × R d → R is a positive diffusion coefficient, and b ∈ [0, T ] × R d → R d is a drift function. We use Q (or P) to denote the path measure of stochastic processes Z, which are probability measures on the space of continuous paths. Let Q t be the marginal distribution of Z t at time t under Q. Our framework heavily relies on the bridge processes, special stochastic processes that guarantee to achieve a deterministic value or fall into a given set at the final state T . ☞ For a set Ω ⊆ R d , a process Z in R d with law Q is called an Ω-bridge if Q(Z T ∈ Ω) = 1. One natural approach to constructing bridge processes is to derive the conditioned process of a general unconstrained process given that the desirable bridge constraint happens. Specifically, assume that Q is the law of a general unconstrained diffusion process of form (4), and denote by Q Ω (•) = Q(• | Z T ∈ Ω) the conditioned distribution given that the event of Z T ∈ Ω happens. Then Q Ω is guaranteed to be an Ω-bridge by definition. Importantly, a remarkable result from Doob (Doob, 1984) , now known as h-transform, shows that Q Ω is the law of a diffusion process with a properly modified drift term. Below, we introduce this results, first for the case x-bridge when Ω = {x} includes a single point, and then for more general sets Ω. For simplicity, we only state the formula from h-transform that are useful for us without proofs. See e.g., Oksendal (2013); Rogers & Williams (2000) for more background on h-transform.

2.1. x-BRIDGES

Let us first consider the x-bridge Q x (•) := Q(• | Z T = x) , the process Q pinned at a deterministic terminal point Z T = x. By the result from h-transform (see e.g., Oksendal (2013) ), the conditioned process Q x (•) := Q(• | Z T = x) , if it exists, can be shown to be the law of dZ t = b(Z t , t) + σ 2 (Z t , t)∇ z log q T |t (x | Z t ) dt + σ(Z t , t)dW t , where q T |t (x|z) is the density function of the transition probability Q T |t (dx | z) := Q(Z T ∈ dx | Z t = z), where dx denotes an infinitesimal volume centering around x. Compared with the diffusion process (4) of Q, the main difference is that the conditioned process has an additional drift force σ 2 (z, t)∇ z log q T |t (x|z) which plays the role of steering Z t towards the target Z T = x; this is a singular force whose magnitude increases to infinity as t → T , because q T |t (• | z) is a delta measure centered at z when t = T . In addition, by Bayes rule, the distribution of the initial state Z 0 should be given by Z 0 ∼ Q 0|T (• | x), Q 0|T (dz|x) ∝ Q 0 (dz)q T |0 (x|z). (6) Example 2.1. If Q is the law of dZ t = σ t dW t , we have Q T |t (•|z) = N (z, β T -β t ) , where β t = t 0 σ 2 s ds. Hence, following the formula in (5), Q x := Q(•|Z T = x) is the law of dZ t = σ 2 t x -Z t β T -β t dt + σ t dW t , and Z 0 ∼ Q 0|T (dz) ∝ Q 0 (dz)ϕ(x | z, β T -β t ), and ϕ(•|µ, σ 2 ) is the density function of N (µ, σ 2 ). The process in (7) is known as a (time-scaled) Brownian bridge. Note that the drift in (7) grows to infinity in magnitude with a rate of O(1/(β Tβ t )) as t → T , which ensures that Z t = x with probability one. Arbitrary initialization To make (5) the conditioned process of (4), the initial distribution must follow the Bayes rule in (6). However, thanks to the singular force ∇ z log q T |t (x|z), the process (5) can guarantee Z x t = x from an arbitrary initialization once the process is well defined. When the initialization is different from (6), the process in ( 5) is no longer the conditioned process of ( 4), but it remains to be an x-bridge in that Z T = x is still guaranteed. To see why this is the case, assume that Q is initialized from a deterministic point Z 0 = x 0 . Then we would still have Z 0 = x 0 when conditioned on Z T ∈ Ω by Bayes rule. This suggests that (5) starting from any deterministic initialization is the condition process of Q with the same deterministic initialization, and is hence an x-bridge. As a result, (5) from any stochastic initialization is also an x-bridge because it can be viewed as the mixture of the processes with different deterministic initialization, all of which are x-bridges. See Appendix A.4 for a detailed analysis, in which it is shown that ( 5) with an arbitrary initialization can be viewed as the conditioned process of a special class of non-Markov processes called reciprocal process.

2.2. Ω-BRIDGES

More generally, for the law Q of (4) and a set Ω ∈ R d , the Ω-bridge Q Ω := Q(• | Z T ∈ Ω) follows Q Ω : dZ t = η Ω (Z t , t)dt + σ(Z t , t)dW t , with η Ω (z, t) = b(z, t) + σ 2 (z, t)E x∼Q T |t,z,Ω [∇ z log q T |t (x | z)], Z 0 ∼ Q 0|T (• | Z T ∈ Ω), where drift force η Ω is similar to that of the x-bridge in (5), except that the final state x is now randomly drawn from an Ω-truncated (or Ω-conditioned) transition probability: Q T |t,z,Ω (dx | z) := Q(Z T ∈ dx | Z t = z, Z T ∈ Ω), which is the transition probability from Z t to Z T , conditioned on that Z T ∈ Ω. In practice, its form can be derived using Bayes rule. Example 2.2. Assume Q follows dZ t = σ t dW t . Then Q Ω yields the following Ω-bridge: dZ t = η Ω (Z t , t)dt + σ t dW t , η Ω (z, t) = σ 2 t E x∼NΩ(z,β T -βt) x -z β T -β t , where N Ω (z, σ 2 ) = Law(Z | Z ∈ Ω) with Z ∼ N (µ, σ 2 ), which is an Ω-truncated Gaussian distribution N (µ, σ 2 ), whose density function is ϕ Ω (x) ∝ I(x ∈ Ω)ϕ(x|µ, σ) with ϕ(x|µ, σ) the density function of N (µ, σ 2 ). Note that it is tractable to calculate η Ω once we can evaluate the expectation of As in the x-bridge, we can set the initialization to be any distribution supported on the set of points that can reach Ω following Q (precisely, points z 0 that satisfy Ω ∩ supp(Q T (•|Z 0 = z 0 )) ̸ = ∅) using the mixture of initialization argument. N Ω (z, β T -β t ). A general case is when Ω = I 1 × • • • I d ,

2.3. A PARAMETRIC FAMILY OF Ω-BRIDGES

The formula in (8) only provides a fixed process for a given Q. For the purpose of learning generative models, however, we need a rich family of Ω-bridges within which we can search for a best one to fit with the data distribution. It turns out we can achieve this by simply adding an extra non-singular drift force, which can be a trainable neural network, on top of the Ω-bridge in (8). Specifically, we construct the following parametric diffusion model P θ : P θ : dZ t = (σ(Z t , t)f θ (Z t , t) + η Ω (Z t , t))dt + σ(Z t , t)dW t , Z 0 ∼ P θ 0 , where f θ (z, t) is a neural network with input (z, t) and parameter θ, which will be trained based on the empirical observations. Adding the neural drift σ(Z t , t)f θ (Z t , t) term does not break the Ω-bridge condition, once it satisfies a very mild regularization condition: Proposition 2.3. For any Q Ω following dZ t = η Ω (Z t , t)dt + σ(Z t , t)dW t that is an Ω-bridge, the P θ in (10) is also an Ω-bridge if E Z∼Q Ω [ T 0 f θ (Z t , t) 2 2 dt] < +∞ and KL(Q Ω 0 || P θ 0 ) < +∞. The condition on f θ is very mild, and it is satisfied if f θ is bounded. Moreover, it can easily hold even when f θ is not bounded. For example, assuming that f θ (x) ≤ a ∥x∥ β + b, which holds for ReLU network with β = 1, we just need to require that the underlying process has a bounded moment E Z∼Q Ω [ T 0 ∥Z t ∥ 2β dt] < +∞, which is a typical regularity condition to expect.

3. LEARNING Ω-BRIDGE MODELS

Let {x (i) } n i=1 be an i.i.d. sample from an unknown distribution Π * on a domain Ω ⊆ R d . Our goal is to learn the parameter θ for the Ω-bridge model P θ in (10) such that the terminal distribution Z T ∼ P θ T matches the data X ∼ Π * . We should distinguish P θ , which is the trainable generative model, and Q, which is a fixed "baseline process" that helps us to derive methods for constructing and learning the model. Q can be the simple Brownian motion in Example 2.1 and 2.2. P ✓ T < l a t e x i t s h a 1 _ b a s e 6 4 = " U q C Q h s P G q / g A N G j B 2 h p 1 4 D 3 c F f I = " > A A A B / H i c d V D L S g M x F M 3 U V 6 2 v a p d u g k V w N W S c U t t d 0 Y 3 L C n 1 B W 0 s m T d v Q z I P k j l C G + i t u X C j i 1 g 9 x 5 9 + Y a S u o 6 I H A 4 Z x 7 u S f H i 6 T Q Q M i H l V l b 3 9 j c y m 7 n d n b 3 9 g / y h 0 c t H c a K 8 S Y L Z a g 6 H t V c i o A 3 Q Y D k n U h x 6 n u S t 7 3 p V e q 3 7 7 j S I g w a M I t 4 3 6 f j Q I w E o 2 C k Q b 7 Q 8 y l M P C + p z 2 9 7 M O F A B 4 1 B v k j s K n H L p I q J X a q 6 r p s S p + w Q U s G O T R Y o o h X q g / x 7 b x i y 2 O c B M E m 1 7 j o k g n 5 C F Q g m + T z X i z W P K J v S M e 8 a G l C f 6 3 6 y C D / H p 0 Y Z 4 l G o z A s A L 9 T v G w n 1 t Z 7 5 n p l M o + r f X i r + 5 X V j G F X 6 i Q i i G H j A l o d G s c Q Q 4 r Q J P B S K M 5 A z Q y h T w m T F b E I V Z W D 6 y p k S v n 6 K / y e t c 9 s p 2 e 5 N q V i 7 X N W R R c f o B J 0 h B 1 2 g G r p G d d R E D M 3 Q A 3 p C z 9 a 9 9 W i 9 W K / L 0 Y y 1 2 i m g H 7 D e P g F y w p V O < / l a t e x i t > ⇧ ⇤ < l a t e x i t s h a 1 _ b a s e 6 4 = " b a X Q B V Q z d l Q 1 v B l k b V v d q N y T 5 g 8 = " > A A A B 7 H i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S K I i y E z H V u 7 K 7 p x W c F p C + 1 Y M m m m D c 1 k h i Q j l N J v c O N C E b d + k D v / x v Q h q O i B C 4 d z 7 u X e e 8 K U M 6 U R + r B y K 6 t r 6 x v 5 z c L W 9 s 7 u X n H / o K m S T B L q k 4 Q n s h 1 i R T k T 1 N d M c 9 p O J c V x y G k r H F 3 N / N Y 9 l Y o l 4 l a P U x r E e C B Y x A j W R v K 7 D X Z 3 1 i u W k O 2 6 F V S u Q G R 7 V Q + 5 n i G 1 8 5 r j O t C x 0 R w l s E S j V 3 z v 9 h O S x V R o w r F S H Q e l O p h g q R n h d F r o Z o q m m I z w g H Y M F T i m K p j M j 5 3 C E 6 P 0 Y Z R I U 0 L D u f p 9 Y o J j p c Z x a D p j r I f q t z c T / / I 6 m Y 4 u g g k T a a a p I I t F U c a h T u D s c 9 h n k h L N x 4 Z g I p m 5 F Z I h l p h o k 0 / B h P D 1 K f y f N F 3 b 8 e z y j V e q X y 7 j y I M j c A x O g Q O q o A 6 u Q Q P 4 g A A G H s A T e L a E 9 W i 9 W K + L 1 p y 1 n D k E P 2 C 9 f Q K c Z 4 6 S < / l a t e x i t > x ⇠ ⇧ ⇤ < l a t e x i t s h a 1 _ b a s e 6 4 = " Y 6 N i f 4 3 N 7 B A O k P L P l K y 5 q 0 D R r E g = " > A A A B 8 3 i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S K I i 2 F m O r Z 2 V 3 T j s o J t h c 5 Y M m m m D U 0 y Q 5 I R S + l v u H G h i F t / x p 1 / Y / o Q V P T A h c M 5 9 3 L v P V H K q N K O 8 2 H l l p Z X V t f y 6 4 W N z a 3 t n e L u X k s l m c S k i R O W y J s I K c K o I E 1 N N S M 3 q S S I R 4 y 0 o + H F 1 G / f E a l o I q 7 1 K C U h R 3 1 B Y 4 q R N l J w D w N F O Q w a 9 P a k W y w 5 t u d V n H I F O r Z f 9 R 3 P N 6 R 2 W n M 9 F 7 q 2 M 0 M J L N D o F t + D X o I z T o T G D C n V c Z 1 U h 2 M k N c W M T A p B p k i K 8 B D 1 S c d Q g T h R 4 X h 2 8 w Q e G a U H 4 0 S a E h r O 1 O 8 T Y 8 S V G v H I d H K k B + q 3 N x X / 8 j q Z j s / C M R V p p o n A 8 0 V x x q B O 4 D Q A 2 K O S Y M 1 G h i A s q b k V 4 g G S C G s T U 8 G E 8 P U p / J + 0 P N v 1 7 f K V X 6 q f L + L I g w N w C I 6 B C 6 q g D i 5 B A z Q B B i l 4 A E / g 2 c q s R + v F e p 2 3 5 q z F z D 7 4 A e v t E 0 e N k T U = < / l a t e x i t > Z x ⇠ Q x < l a t e x i t s h a 1 _ b a s e 6 4 = " g 4 b V 3 K T i t Z O b J g 3 c O D 1 4 L Z R 2 z A k = " > A A A B / n i c d V D L S g M x F M 3 U V 6 2 v U X H l J l g E V 8 P M d G z t r u j G Z Q v 2 g Z 2 2 Z N K 0 D c 0 8 S D L S M h T 8 F T c u F H H r d 7 j z b 8 y 0 F V T 0 Q O B w z r 3 c k + N F j A p p m h 9 a Z m V 1 b X 0 j u 5 n b 2 t 7 Z 3 d P 3 D x o i j D k m d R y y k L c 8 J A i j A a l L K h l p R Z w g 3 2 O k 6 Y 2 v U r 9 5 R 7 i g Y X A j p x H p + G g Y 0 A H F S C q p p x / d d i f Q F d S H r o / k y P O S 2 q w 7 6 e l 5 0 7 D t o l k o Q t N w S o 5 p O 4 q U z 8 u W b U H L M O f I g y W q P f 3 d 7 Y c 4 9 k k g M U N C t C 0 z k p 0 E c U k x I 7 O c G w s S I T x G Q 9 J W N E A + E Z 1 k H n 8 G T 5 X S h 4 O Q q x d I O F e / b y T I F 2 L q e 2 o y z S h + e 6 n 4 l 9 e O 5 e C i k 9 A g i i U J 8 O L Q I G Z Q h j D t A v Y p J 1 i y q S I I c 6 q y Q j x C H G G p G s u p E r 5 + C v 8 n D d u w H K N Q c / K V y 2 U d W X A M T s A Z s E A J V M A 1 q I I 6 w C A B D + A J P G v 3 2 q P 2 o r 0 u R j P a c u c Q / I D 2 9 g l Z F 5 X E < / l a t e x i t > P ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " a m E Q 7 Q f 7 k 6 O D Q + 2 n V l c n O Y C P O L 4 = " > A A A B + n i c d V D L T g I x F O 3 g C / E F u n T T S E x c T T o O Q d g R 3 b j E R B 4 J I O m U A g 2 d R 9 o 7 G j L y K W 5 c a I x b v 8 S d f 2 M H M F G j J 2 l y c s 6 9 u a f H i 6 T Q Q M i H l V l Z X V v f y G 7 m t r Z 3 d v f y h f 2 m D m P F e I O F M l R t j 2 o u R c A b I E D y d q Q 4 9 T 3 J W 9 7 k I v V b t 1 x p E Q b X M I 1 4 z 6 e j Q A w F o 2 C k f r 7 Q 9 S m M P S + p z 2 6 6 M O Z A + / k i s a v E L Z M q J n a p 6 r p u S p y y Q 0 g F O z a Z o 4 i W q P f z 7 9 1 B y G K f B 8 A k 1 b r j k A h 6 C V U g m O S z X D f W P K J s Q k e 8 Y 2 h A f a 5 7 y T z 6 D B 8 b Z Y C H o T I v A D x X v 2 8 k 1 N d 6 6 n t m M g 2 q f 3 u p + J f X i W F Y 6 S U i i G L g A V s c G s Y S Q 4 j T H v B A K M 5 A T g 2 h T A m T F b M x V Z S B a S t n S v j 6 K f 6 f N E 9 t p 2 S 7 V 6 V i 7 X x Z R x Y d o i N 0 g h x 0 h m r o E t V R A z F 0 h x 7 Q E 3 q 2 7 q 1 H 6 8 V 6 X Y x m r O X O A f o B 6 + 0 T C v y U h w = = < / l a t e x i t > Q ⇧ ⇤ < l a t e x i t s h a 1 _ b a s e 6 4 = " Y Y 0 K i f X r M 1 / d A n s s a 7 N i E h L g F H E = " > A A A B + 3 i c d V D L S s N A F J 3 U V 6 2 v W J d u B o s g L k K S x t b u i m 5 c t m A f 0 L R l M p 2 2 Q y c P Z i Z i C f k V N y 4 U c e u P u P N v n L Q V V P T A w O G c e 7 l n j h c x K q R p f m i 5 t f W N z a 3 8 d m F n d 2 / / Q D 8 s t k U Y c 0 x a O G Q h 7 3 p I E E Y D 0 p J U M t K N O E G + x 0 j H m 1 1 n f u e O c E H D 4 F b O I 9 L 3 0 S S g Y 4 q R V N J Q L 7 o + k l P P S 5 r p I H E b d H C e D v W S a d h 2 x S x X o G k 4 V c e 0 H U V q F z X L t q B l m A u U w A q N o f 7 u j k I c + y S Q m C E h e p Y Z y X 6 C u K S Y k b T g x o J E C M / Q h P Q U D Z B P R D 9 Z Z E / h q V J G c B x y 9 Q I J F + r 3 j Q T 5 Q s x 9 T 0 1 m S c V v L x P / 8 n q x H F / 2 E x p E s S Q B X h 4 a x w z K E G Z F w B H l B E s 2 V w R h T l V W i K e I I y x V X Q V V w t d P 4 f + k b R u W Y 5 S b T q l + t a o j D 4 7 B C T g D F q i C O r g B D d A C G N y D B / A E n r V U e 9 R e t N f l a E 5 b 7 R y B H 9 D e P g F i g 5 S u < / l a t e x i t > Time 𝑡 training As the case of other diffusion models, P θ can be viewed as a model with an infinite dimensional latent variable of the intermediate trajectories of Z. Hence, a canonical learning approach is expectation maximization (EM), which alternates between 1) E-step: estimating the posterior P θ,x := P θ (Z | Z T = x) of the latent trajectories Z given the observation Z T = x; 2) M-step: estimating the parameter θ with Z imputed from P θ,x . A key challenge, however, is that the posterior distribution P θ,x is difficult to calculate due to the presence of neural force field in P θ (as the h-transform formula would have no closed form), and it need to be iteratively updated as θ changes. Following DDPM (Ho et al., 2020) , we consider a simpler approach that replaces the posterior P θ,x with an arbitrary x-bridge, denoted by Q x . This yields a simplified EM algorithm without the expensive posterior inference in the E-step. A natural choice is the conditioned process Q x := Q(Z | Z T = x ), but the method works for a general x-bridge. Specifically, let Q Π * (•) = Q x (•)Π * (dx) be the mixture of the x-bridges whose end point x is randomly drawn from the data distribution x ∼ Π * . The trajectories from Q Π * can be generated in the following "backward" way: first drawing a data point x ∼ Π * , and then Z ∼ Q x conditioned on the end point x. Obviously, by construction, the terminal distribution of Q Π * equals Π * , that is, Q Π * T = Π * . Then, the model P θ can be estimated by fitting data drawn from Q Π * using maximum likelihood estimation: min θ L(θ) := KL(Q Π * || P θ ) . The classical (variational) EM would alternatively update θ (M-step) and Q x (E-step) to make Q x ≈ P θ,x . Why is it OK to simply drop the E-step? At the high level, it is the benefit from using universal approximators like deep neural networks: if the model space of P θ is sufficiently rich, by minimizing the KL divergence in (11), P θ can approximate the given Q Π * well enough (in a way that is made precise in the Appendix A) such that their terminal distributions are close: P θ T ≈ Q Π * T = Π * . ☞ Learning latent variable models require no E-step if the model space is sufficiently rich. We should see that in this case the latent variables Z in the learned model P θ is dictated by the choice of the imputation distribution Q since we have P θ,x = Q x when the KL divergence in ( 11) is fully minimized to zero; EM also achieves P θ,x = Q x but has the imputation distribution Q x determined by the model P θ , not the other way. Loss Function In its general form, the x-bridge Q x that we use can be a non-Markov diffusion process Q x : dZ t = η x (Z, t)dt + σ(Z t , t)dW t , Z 0 ∼ µ x , which has the same diffusion coefficient σ(Z t , t) as P θ in (10), and any x-dependent η x and initialization µ x once the x-bridge condition is ensured. As the general framework, we assume that η x can depend on the whole trajectory Z.

Algorithm 2 Learning Ω-Bridge Diffusion Models

Input: A dataset D := {x (i) } drawn from distribution Π * on a domain Ω. Setup: Specify an x-bridge Q x and an Ω-bridge Q Ω Q x : dZ t = η x (Z, t)dt + σ(Z t , t)dW t , Q Ω : dZ t = η Ω (Z t , t)dt + σ(Z t , t)dW t , Specify the generative model P θ based on Q Ω and a neural network f θ : P θ : dZ t = (σ(Z t , t)f θ (Z t , t) + η Ω (Z t , t))dt + σ(Z t , t)dW t , Z 0 ∼ P θ 0 . Default: let Q be the law of dZ t = σ t dW t and derive the bridges by h-transform as Q x = Q(•|Z T = x) in Eq (7) and Q Ω = Q(•|Z T ∈ Ω) in Eq (9). Training: Estimating θ by minimizing the loss function ( 13) using any off-the-shelf optimizer. Sampling: Generate sample Z T from P θ with the trained parameter θ. Using Girsanov theorem (e.g., Oksendal, 2013) , with P θ in (10) and Q x in (12), the KL divergence in (11) can be shown to equal to L(θ) = E x∼Π * Z∼Q x   -log p θ 0 (Z 0 ) MLE of initial dist. + 1 2 T 0 σ -1 (Z t , t)(s θ (Z t , t) -η x (Z, t)) 2 score matching dt    + const, ( ) where we write s θ as the overall drift force of P θ in (10), that is, s θ (z, t) = σ(z, t)f θ (z, t) + η Ω (z, t), and p θ 0 is the probability density function (PDF) of the initial distribution P θ 0 . Therefore, L(θ) is a sum of the negative log-likelihood of the initial distribution that encourages P θ 0 ≈ Q Π * 0 , and a least squares loss between s θ and η x . In practice, we simply fix the initial distribution P θ 0 to be a delta measure on a fixed point (say x 0 = 0), so we only need to train the drift function f θ . Algorithm 1 shows a simple instance of the framework when the baseline process Q is the standard Brownian motion dZ t = dW t , Q x = Q(• | Z T = x) and σ(z, t) = 1. Note that the least squares term in (13) can be viewed as enforcing f θ ≈ σ -1 (η xη Ω ), which reduces to f θ ≈ ∇ log ω Ω in the case of Algorithm 1. Related Works Bridge processes provide a simple and flexible approach to learning diffusion generative models, which was explored in Peluchetti (2021); Ye et al. (2022) ; Wu et al. (2022); De Bortoli et al. (2021) . Heng et al. (2021) investigates the orthogonal problem of simulating from the bridge Q x for a given Q. In comparison, our method learns diffusion models on general domains Ω on which an Ω-bridge can be derived (using h-transform or any other method), and hence provides a highly flexible framework for learning with structured data (including discrete, continuous, and their mixes). This distinguishes it with existing approaches that are designed for special types of data (e.g., Ho et 2022), which proposes to learn first hitting diffusion models for generating data on both discrete sets and spheres. The advantage of our approach is that it is simpler and easier to derive for more complex types of domains.

4. EXPERIMENTS

We evaluate our algorithms for generating mixed-typed tabular data, grid-valued point clouds, categorical semantic segmentation maps, discrete CIFAR10 images. We observe that Ω-bridge provides a particularly attractive and superb approach to generating data from various constrained domains. Algorithm Overview For all experiments, we use Algorithm 2 with the default choice of Q x in (7) and Q Ω in (9). The specific form of η Ω is derived based on the specific choice of the domain Ω. By default, we set the initialization Z 0 = 0 and the optimizer Adam.

Logistic (↑) AdaBoost (↑)

MLP (↑) Real Training Data 0.877±0.021 0.912±0.013 0.897±0.012 TVAE (Xu et al., 2019) 0.825±0.012 0.876±0.005 0.845±0.008 CTGAN (Xu et al., 2019) 0.649±0.014 0.841±0.021 0.843±0.016 CopulaGAN (Patki et al., 2016) 

4.1. GENERATING MIXED-TYPE TABULAR DATA

Learning to generate tabular data is challenging, because tabular data usually contains a mixture of discrete and continuous attributes (Xu et al., 2019; Park et al., 2018) . Unlike carefully designing special GANs as in previous works (Xu et al., 2019; Srivastava et al., 2017) , Ω-bridge can be seamlessly applied to mixed-typed tabular data generation without any further modification. In contrast, diffusion processes that solely work on discrete domain (Austin et al., 2021; Hoogeboom et al., 2021) cannot be applied to this task. In this experiment, we use the Adult Income dataset (Kohavi, 1996) , which contains 30,162 training samples. The data points are described by a series of attributes, including continuous (age, capital-gain, etc.) and discrete (sex, race, etc.). We compare with conditional tabular GAN (Xu et al., 2019) (CTGAN), CopulaGAN (Patki et al., 2016) , and Table VAE (Xu et al., 2019) (TVAE), which are state-of-the-art GAN-based and VAE-based generative models for mixed-typed tabular data. Following previous works (Xu et al., 2019; Patki et al., 2016) , we measure the classification accuracy on the real data of logistic regression, AdaBoost classifier and MLP classifier when trained on the generated data. In the Ω-bridge model, we set σ t = 3 exp(-3t) and f θ a 3-layer MLP. In this case, 1 . Ω = I 1 × • • • × I 15 , Result All the three different classifiers yield the highest accuracy when trained on the data generated by our method, referred to as Mixed-Bridge in this case. The result reflects that the data generated by Mixed-Bridge is closer to the real distribution than the baseline methods. 

4.2. GENERATING INTEGER-VALUED POINT CLOUDS

A feature of point clouds in 3D objects in graphics is that they tend to distribute evenly, especially if they are discretized from a mesh. This aspect is omitted in most existing works on point cloud generation. As a result they tend to generate non-uniform points that are unsuitable for real applications, which often involve converting back to meshes with procedures like Ball-Pivoting (Bernardini 

Methods

ELBO (↓) IWBO (↓) Uniform Dequantization (Uria et al., 2013) 1.010 0.930 Variational Dequantization (Ho et al., 2019) 0.334 0.315 Argmax Flow (Softplus thres.) (Hoogeboom et al., 2021) 0.303 0.290 Argmax Flow (Gumbel distr.) (Hoogeboom et al., 2021) 0.365 0.341 Argmax Flow (Gumbel thres.) (Hoogeboom et al., 2021) 0.307 0.287 Multinomial Diffusion (Hoogeboom et al., 2021) 0 et al., 1999) . We apply our method to generate point clouds that constrained on a integer grid which we show yields much more uniformly distributed points. To the best of our knowledge, we are the first work on integer-valued 3D point cloud generation. A point cloud is a set of points {x i } m i=1 , x i ∈ R 3 in the 3D space, where m refers to the number of points. We apply two variants of our method: R d -Bridge and Grid-Bridge. R d -Bridge generates points in the continuous 3D space, i.e., Ω R = R 3m . Grid-Bridge generate points that on integer grids, Ω Grid = {1, . . . , 128} 3m . We fix the diffusion coefficient σ t = 1. The number of diffusion steps K is set to 1000. We test our method on ShapeNet (Chang et al., 2015) chair models, and compare it with Point Cloud Diffusion (PCD) (Luo & Hu, 2021a) , a state-of-the-art continuous diffusion-based generative model for point clouds. The neural network f θ in our methods are the same as that of PCD for fair comparison. Qualitative results and quantitative results are shown in Figure 2 and Table 2 . As common practice (Luo & Hu, 2021a,b) , we measure minimum matching distance (MMD), coverage score (COV) and 1-NN accuracy (1-NNA) using Chamfer Distance (CD) with the test dataset. Result Both R d -Bridge and Grid-Bridge get better MMD, COV, and 1-NNA than PCD. Moreover, by constraining the domain of interest to the integer grids, Grid-Bridge yields even better performance than R d -Bridge. In Figure 2 , since the point clouds generated by Grid-Bridge are limited to integer grids, the reconstructed meshes from Ball-Pivoting clearly have higher quality than R d -Bridge and PCD.

4.3. GENERATING SEMANTIC SEGMENTATION MAPS ON CITYSCAPES

We consider unconditionally generating categorical semantic segmentation maps. We represent each pixels as a one-hot categorical vector. Hence the data domain is Ω = {e 1 , . . . , e c } h×w , where c is the number of classes and e i is the i-th c-dimensional one-hot vector, and h, w represent the height and width of the image. In CityScapes (Cordts et al., 2016) , h = 32, w = 64, c = 8. In this experiment, we test different schedule of the diffusion coefficient σ t , including (Constant Noise): σ t = 1; (Noise Decay A): σ t = a exp(-bt); (Noise Decay B): σ t = a(1 -t); (Noise Decay C) σ t = a(1 -exp(-b(1 -t))). Here a and b are hyper-parameters. The number of diffusion steps K is set to 500. We measure the negative log-likelihood (NLL) of the test set using the learned models. The NLL (bits-per-dimension) is estimated with evidence lower bound (ELBO) and importance weighted bound (IWBO) (Burda et al., 2016) , respectively, as in (Hoogeboom et al., 2021) . We compare Ω-Bridge with a state-of-the-art categorical diffusion algorithm, Argmax Flow (and Multinomial Diffusion) (Hoogeboom et al., 2021) , and the traditional methods, uniform dequantization (Uria Methods IS (↑) FID (↓) NLL (↓) Discrete D3PM uniform L vb (Austin et al., 2021) 5.99 51.27 5.08 D3PM absorbing L vb (Austin et al., 2021) 6.26 41.28 4.83 D3PM Gauss L vb (Austin et al., 2021) 7.75 15.30 3.966 D3PM Gauss L λ=0.001 (Austin et al., 2021) 8 (Ho et al., 2019) . The numerical results of the baselines are directly adopted from (Hoogeboom et al., 2021) , and experiment configuration is kept the same for fair comparison. The neural network f θ is the same as (Hoogeboom et al., 2021) . The results are shown in Figure 3 and Table 3 . Our Ω-bridge is named Categorical-Bridge (Cat.-Bridge) in this experiment. Result We observe that all the four kinds of Cat.-Bridge can successfully generate categorical semantic segments, and different noise schedules result in different empirical performance. Among the four variants of Cat.-Bridge, Cat.-Bridge with Noise Decay A yields the best ELBO and IWBO, surpassing all the other algorithms in comparison. 

5. CONCLUSION AND LIMITATIONS

We present a framework for learning diffusion generative models on constrained data domains. It leaves a number of directions for further explorations and improvement. For example, the practical impact of the choices of the bridges Q, in terms of initialization, dynamics, and noise schedule, are still not well understood and need more systematical studies. Besides, our current method is limited to Ω that are factorizable and integrable. Moreover, application of Ω-bridge to many other practical fields also needs investigation in the future. Roadmap The appendix is structured as follows: • Appendix A provides the theoretical analysis and derivation of diffusion bridges. In particular, Appendix A.1 shows the derivation of the main training loss; Appendix A.2 derives the drift term η Π * of Q Π * ; Appendix A.3 proves that we can actually use a Markov model P θ to match all time-marginals with Q Π * ; Appendix A.4 explains why we can use arbitrary initialization when constructing bridge processes and discusses the reciprocal structure of Q Π * ; Appendix A.5 provides analysis on the time-discretization error and statistical error of the practical discretized algorithm. Appendix A.6 and A.7 presents details on the condition and examples of Ω-bridge construction. • Appendix B shows additional experiment details and results.

A THEORETICAL ANALYSIS ON BRIDGES

A.1 DERIVATION OF THE MAIN LOSS IN EQUATION (13) [Proof of Equation ( 13)] Denote by Q x = Q(•|Z T = x). Note that KL(Q Π * || P θ ) = E x∼Π * ,Z∼Q x log dQ Π * dP θ (Z) = E x∼Π * ,Z∼Q x log dQ x dP θ (Z) + log dQ Π * dQ x (Z) = E x∼Π * KL(Q x || P θ ) + const, where const denotes a constant that is independent of θ. Recall that Q x follows dZ t = η x (Z [0,t] , t)dt + σ(Z t , t)dW t , and P θ follows dZ t = s θ (Z t , t)dt + σ(Z t , t)dW t . By Girsanov theorem (e.g., Lejay, 2018) , KL(Q x || P θ ) = KL(Q x 0 || P θ 0 ) + 1 2 E Z∼Q x T 0 s θ (Z t , t) -η x (Z [0,t] , t) 2 2 dt = E Z∼Q x -log p θ 0 (Z 0 ) + 1 2 T 0 s θ (Z t , t) -η x (Z [0,t] , t) 2 2 dt + const. Hence L(θ) = E x∼Π * ,Z∼Q x -log p θ 0 (Z 0 ) + 1 2 T 0 s θ (Z t , t) -η x (Z [0,t] , t) 2 2 dt + const = E Z∼Q Π * -log p θ 0 (Z 0 ) + 1 2 T 0 s θ (Z t , t) -η Z T (Z [0,t] , t) 2 2 dt + const. A.2 DERIVATION OF THE DRIFT η Π * OF Q Π * Lemma A.1. Let Q x is the law of dZ x t = η x (Z x [0,t] , t)dt + σ(Z x t , t)dW t , Z 0 ∼ Q x 0 , and Q Π * := Q x (Z)Π * (dx) for a distribution Π * on R d . Then Q Π * is the law of dZ t = η Π * (Z [0,t] , t)dt + σ(Z t , t)dW t , Z 0 ∼ Q Π * 0 , where η Π * (z [0,t] , t) = E x∼Π * ,Z∼Q x [η x (Z [0,t] , t) | Z [0,t] = z [0,t] ], Q Π * 0 (dz 0 ) = E x∼Π * [Q x 0 (dz 0 )]. [Proof] Q Π * is the solution of the following optimization problem: Q Π * = arg min P KL(Q Π * || P) = E x∼Π * [KL(Q x || P)] + const . By Girsanov's Theorem (e.g., Lejay, 2018) , any stochastic process P that has KL(Q x || P) < +∞ (and hence is equivalent to Q x ) has a form of dZ t = η Π * (Z [0,t] , t)dt + σ(Z t , t)dW t for some measurable function η Π * , and E x∼Π * [KL(Q x || P)] = E x∼Π * [KL(Q x 0 || P 0 )] + E x∼Π * ,Z∼Q x 1 2 T 0 σ(Z t , t) -1 (η Π * (Z [0,t] , t) -η x (Z [0,t] , 0)) 2 2 . It is clear that to achieve the minimum, we need to take P 0 (•) = E x∼Π * [Q x 0 (•)] and η Π * (z [0,t] , t) = E x∼Π * ,Z∼Q x [η x (Z [0,t] , t) | Z [0,t] = z [0,t] ], which yields the desirable form of Q Π * . A.3 P θ * YIELDS A MARKOVIZATION OF Q Π * As P θ is Markov by the model assumption, it can not perfectly fit Q Π * which is non-Markov in general. This is a substantial problem because Q Π * can be non-Markov even if Q x is Markov for all x ∈ Ω (see Section A.4). In fact, using Doob's h-transform method (Doob, 1984) , Q Π * can be shown to be the law of a diffusion process dZ t = η Π * (Z [0,t] , t)dt + σ(Z t , t)dW t , η Π * (z [0,t] , t) = E Z∼Q Π * η Z T (z [0,t] , t) | Z [0,t] = z [0,t] , where η Π * is the expectation of η x when x = Z T is drawn from Q conditioned on Z [0,t] . We resolve this by observing that it is not necessary to match the whole path measure ( P θ ≈ Q Π * ) to match the terminal (P θ T ≈ Q Π * T = Π * ). It is enough for P θ to be the best Markov approximation (a.k.a. Markovization) of Q Π * , which matches all (hence terminal) fixed-time marginals with Q Π * : Proj(Q Π * , M) := arg min P∈M KL(Q Π * || P), M = the set of all Markov processes on [0, T ]. Proposition A.2. The global optimum of L(θ) in (11) and ( 13) is achieved by θ * if s θ * (z, t) = E Z∼Q Π * η Z T (Z [0,t] , t) | Z t = z , µ θ * (dz 0 ) = Q Π * 0 = E x∼Π * [Q x 0 (dz 0 )] . ( ) In this case, P θ * = Proj(Q Π * , M) is the Markovization of Q Π * , with which it matches all timemarginals: Peluchetti (2021) gives a related result that the marginals of mixtures of Markov diffusion processes can be matched by another Markov diffusion process, but does not discuss the issue of Markovization nor connect to KL divergence. Theorem 1 of Song et al. (2021) is the special case of (15) when Q Π * is Markov. P θ * t = Q Π * t for all time t ∈ [0, T ]. In addition, KL(Π * || P θ T ) ≤ KL(P θ * || P θ ) = KL(Q Π * || P θ ) -KL(Q Π * || P θ * ) = L(θ) -L(θ * ). (15) Note that s θ * is a conditional expectation of η Π * : s θ * (z, t) = E Z∼Q Π * [η Π * (Z [0,t] , t) | Z t = z]. Theorem 1 of [Proof of Proposition A.2] It is the combined result of Lemma A.3 and Lemma A.4 below. Lemma A.3. Let Q be a non-Markov diffusion process on [0, T ] of form Q : dZ t = η(Z [0,t] , t)dt + σ(Z t , t)dW t , Z 0 ∼ Q 0 , and M = arg min P∈M KL(Q || P) be the Markovization of Q, where M is the set of all Markov processes on [0, T ]. Then Q is the law of M : dZ t = m(Z t , t)dt + σ(Z t , t)dW t , Z 0 ∼ Q 0 , where m(z, t) = E Z∼Q [η(Z [0,t] , t) | Z t = z]. In addition, we have Q t = M t for all time t ∈ [0, T ]. [Proof] By Girsanov's Theorem (e.g., Lejay, 2018) , any process that has KL(Q || M) < +∞ (and hence is equivalent to Q) has a form of dZ t = m(Z [0,t] , t)dt + σ(Z t , t)dW t , where m is a measurable function. Since M is Markov, we have m(Z [0,t] , t) = m(Z t , t). Then KL(Q || P) = KL(Q 0 || P 0 ) + E Z∼Q 1 2 T 0 σ(Z t , t) -1 (η(Z [0,t] , t) -m(Z t , 0)) 2 2 . It is clear that to achieve the minimum, we need to take M 0 = Q 0 and m(z, t) = E Z∼Q [η(Z [0,t] , t) | Z t = z]. To prove Q t = M t , note that by the chain rule of KL divergence: KL(Q || P) = KL(Q t || P t ) + E Zt∼Qt [KL(Q(•|Z t ) || P(•|Z t ))], ∀t ∈ [0, T ]. As the second term P(•|Z t ) is independent of the choice of the marginal P t at time t ∈ [0, T ], the optimum should be achieved by M only if M t = Q t . Lemma A.4. Let Q : dZ t = η(Z [0,t] , t)dt + σ(Z t , t)dW t , Z 0 ∼ Q 0 M : dZ t = m(Z t , t)dt + σ(Z t , t)dW t , Z 0 ∼ Q 0 , P θ : dZ t = s θ (Z t , t)dt + σ(Z t , t)dW t , Z 0 ∼ P θ 0 , where M is the Markovization of Q (see Lemma A.3). Then KL(Q || P θ ) = KL(Q || M) + KL(M || P θ ). Hence, assume there exists θ * such that P θ * = M and write L(θ) := KL(Q || P θ ). We have KL(Q T || P θ T ) = KL(M T || P θ T ) ≤ KL(M || P θ ) = L(θ) -L(θ * ). [Proof] Note that KL(M || P θ ) = KL(M 0 || P θ 0 ) + 1 2 E Zt∼Mt T 0 σ(Z t , t) -1 (s θ (Z t , t) -m(Z t , t)) 2 2 dt = KL(M 0 || P θ 0 ) + 1 2 T 0 E Zt∼Mt σ(Z t , t) -1 (s θ (Z t , t) -m(Z t , t)) 2 2 dt = KL(Q 0 || P θ 0 ) + 1 2 T 0 E Zt∼Qt σ(Z t , t) -1 (s θ (Z t , t) -m(Z t , t)) 2 2 dt //Q t = M t ∀t = KL(Q 0 || P θ 0 ) + E Z∼Q 1 2 T 0 σ(Z t , t) -1 (s θ (Z t , t) -m(Z t , t)) 2 2 dt = KL(Q 0 || P θ 0 ) + 1 2 s θ -m 2 Q,σ , where we define ∥f ∥ 2 Q,σ = E Z∼Q 1 2 T 0 σ(Z t , t) -1 f (Z t , t) 2 2 dt . On the other hand, KL(Q || P θ ) = KL(Q 0 || P θ 0 ) + E Z∼Q 1 2 T 0 σ(Z t , t) -1 (s θ (Z t , 0)) -η(Z [0,t] , t) 2 2 dt = KL(Q 0 || P θ 0 ) + 1 2 s θ -η 2 Q,σ KL(Q || M) = E Z∼Q 1 2 T 0 σ(Z t , t) -1 (η(Z [0,t] , t) -m(Z t , 0)) 2 2 dt = 1 2 ∥η -m∥ 2 Q,σ . Using Lemma A.5 with a(z) = σ(z, t) -1 s θ (z, t), and b(z [0,t] ) = σ(z, t) -1 η(z [0,t] , t), we have the following bias-variance decomposition: η -s θ 2 Q,σ = s θ -m 2 Q,σ + ∥η -m∥ 2 Q,σ . Hence, KL(Q || P θ ) = KL(M || P θ ) + KL(Q || M). Finally, KL(M T || P θ T ) ≤ KL(M || P θ ) is the direct result of the following factorization of KL divergence: KL(M || P θ ) = KL(M T || P θ T ) + E x∼M T KL(M T (•|Z T = x) || P θ T (•|Z T = x)) . Lemma A.5. Let (X, Y ) be a random variable and a(x), b(x, y) are square integral functions. Let m(x) = E[b(X, Y ) | X = x]. We have E[∥a(X) -b(X, Y )∥ 2 2 ] = E[∥a(X) -m(X)∥ 2 2 ] + E[∥b(X, Y ) -m(X)∥ 2 2 ]. [Proof] E[∥a(X) -b(X, Y )∥ 2 2 ] = E[∥a(X) -m(X) + m(X) -b(X, Y )∥ 2 2 ] = E[∥a(X) -m(X)∥ 2 2 ] + E[∥m(X) -b(X, Y )∥ 2 2 ] + 2∆, where ∆ = E[(a(X) -m(X)) ⊤ (m(X) -b(X, Y ))]] = E[(a(X) -m(X)) ⊤ E[(m(X) -b(X, Y ))|X]] = E[(a(X) -m(X)) ⊤ (m(X) -m(X))] = 0. A.4 MARKOV AND RECIPROCAL PROPERTIES OF Q Π * Mixture of Bridges and Initialization It is an immediate observation that the mixtures of a set of bridges are also bridges: let Q z,A be a set of A-bridges indexed by a variable z, then Q A := Q z,A µ(dz) is an x-bridge for any distribution µ on z. A special case is to take the mixture of the conditional bridges in (5) starting from different deterministic initialization, which shows that we can obtain a valid x-bridge by equipping the same drift in (5) with essentially any initialization. Hence, the choices of the drift force and initialization in Q x can be completely decouple. Proposition A.6. Let Q is a path measure and Ω x is the set of z for which Qz0,x ( •) := Q(•|Z T = x, Z 0 = z 0 ) exists. Then Q x := Qz0,x µ(dz 0 | x) is an x-bridge, for any distribution µ on Ω × Ω. [Proof of Proposition A.6] This is an obvious result. We have Q z0,x (Z T = x) = 1 by the definition of conditioned processes. Hence Q x (Z T = x) = Q z0,x (Z T = x)µ(dz 0 | x) = µ(dz 0 | x) = 1.

Markov and Reciprocal Properties of

Q Π * If Q x is constructed as Q x = Q(•|Z T = x), it is easy to see that Q Π * := Q x (•)Π * (dx) is Markov iff Q is Markov. Proposition A.7. Assume Q x = Q(• | Z T = x) and π * (z) := dΠ * dQ T (z) exists and is positive everywhere. Then Q Π * is Markov, iff Q is Markov. [Proof of Proposition A.7] If Q x = Q(• | Z T = x), we have from the definition of Q Π * : Q Π * (Z) = Q(Z|Z T )Π * (Z T ) = Q(Z)π * (Z T ), variance of ∆(θ; Z, t) grows to infinite as t ↑ T . Hence, we should not include ∆(θ; Z, T ) at the end point τ K+1 = T into the sum in the loss ℓ(θ, Z, τ ) to avoid variance exploding. In the sampling phase, the continuous-time model P θ should be approximated numerically. A standard approach is the Euler-Maruyama method, which simulates the trajectory on a time grid τ by Ẑτ k+1 = Ẑτ k + ϵ k s θ ( Ẑτ k , τ k ) + √ ϵ k σ( Ẑτ k , τ k )ξ k , ϵ k = τ k+1 -τ k , ξ k ∼ N (0, I d ), The final output is ẐT . The following result shows the KL divergence between Π * and the distribution of ẐT can be bounded by the sum of the step size and the expected optimality gap E[ L(θ) -L(θ * )] of the time-discretized loss in (16).

A.5.1 TIME-DISCRETIZATION ERROR ANALYSIS (PROPOSITION A.11)

Proposition A.11. Assume Ω = R d and σ(z, t) = σ(t) is state-independent. Take the uniform time grid τ unif := {iϵ} K i=0 with step size ϵ = T /K in the sampling step (17). Assume σ(t) > c > 0, ∀t and σ(t) is piecewise constant w.r.t. time grid τ unif . Let L ϵ (θ) = E Z∼Q Π * [ℓ(θ; Z, τ unif )]. Let θ,ϵ T be the distribution of the resulting sample ẐT . Let θ * be an optimal parameter satisfying (14 ). Assume C 0 := sup z,t s θ * (z, t) 2 /(1 + ∥z∥ 2 ), tr(σ 2 (z, t)), E P θ * [∥Z 0 ∥ 2 ] < +∞, s θ * (z, t) -s θ * (z ′ , t ′ ) 2 2 ≤ L ∥z -z ′ ∥ 2 + |t -t ′ | for ∀z, z ′ ∈ R d and t, t ′ ∈ [0, T ]. Then KL(Π * || P θ,ϵ T ) ≤ L ϵ (θ) -L ϵ (θ * ) + O √ ϵ . We provide the analysis and proof for Proposition A.11 in the following text. Proposition A.12. Assume Ω = R d and σ(z, t) = σ(t) is state-independent and σ(t) > c > 0, ∀t ∈ [0, T ]. Take the uniform time grid τ unif := {iϵ} K i=0 with step size ϵ = T /K in the sampling step (17). Let L ϵ (θ) = E Z∼Q Π * [ℓ ϵ (θ; Z)] with ℓ ϵ (θ, Z t ) = -log p θ 0 (Z 0 ) + 1 2K K k=1 σ -1 k (s θ (Z t k , t k ) -η Z T (Z [0,t k ] , t k )) 2 2 , where ϵ > 0 is a step size with T = Kϵ and t k = (k -1)ϵ, and σ 2 k := (t k+1 -t k ) -1 t k+1 t k σ(t) 2 dt. Let P θ,ϵ T be the distribution of the sample ẐT resulting from the following Euler method: Ẑt k+1 = Ẑt k + ϵs θ (Z t k , t k ) + √ ϵσ k ξ k , where ξ k ∼ N (0, I d ) is the standard Gaussian noise in R d . Let θ * be an optimal parameter satisfy- ing (14). Assume C 0 := sup z,t s θ * (z, t) 2 /(1 + ∥z∥ 2 ), tr(σ 2 (z, t)), E P θ * [∥Z 0 ∥ 2 ] < +∞, and s θ * satisfies s θ * (z, t) -s θ * (z ′ , t ′ ) 2 2 ≤ L ∥z -z ′ ∥ 2 + |t -t ′ | for ∀z, z ′ ∈ R d and t, t ′ ∈ [0, T ]. Then we have KL(Π * || P θ,ϵ T ) ≤ L ϵ (θ) -L ϵ (θ * ) + O √ ϵ . [Proof of Proposition A.11 ] This is the result of Lemma A.13 below by noting that the Pθ there is equivalent to the Euler method above, and L ϵ (θ) -L ϵ (θ * ) ≤ Lϵ (θ) -Lϵ (θ * ) (because σ -2 k = ((t k+1 -t k ) -1 t k+1 t k σ(t) 2 ) -1 ≤ (t k+1 -t k ) -1 t k+1 t k σ(t) -2 ). Lemma A.13. Let h be a step size and ϵ = T /K for a positive integer K. For each t ∈ [0, ∞), denote by ⌊t⌋ ϵ = max({kϵ : k ∈ N} ∩ [0, t]). Assume Q Π * : dZ t = η Π * (Z [0,t] , t)dt + σ(Z t , t)dW t , Z 0 ∼ Q 0 P θ * : dZ t = s θ * (Z t , t)dt + σ(Z t , t)dW t , Z 0 ∼ Q 0 , P θ : dZ t = s θ (Z t , t)dt + σ(Z t , t)dW t , Z 0 ∼ P θ 0 Pθ : dZ t = s θ (Z ⌊t⌋ϵ , t)dt + σ(Z t , t)dW t , Z 0 ∼ P θ 0 , ≤ L 2 E P θ * T 0 σ -2 (Z t , t) Z t -Z ⌊t⌋ϵ 2 + (t -⌊t⌋ ϵ ) dt ≤ L 2c 2 E P θ * T 0 ((Z t -Z ⌊t⌋ϵ ) 2 + (t -⌊t⌋ ϵ ))dt ≤ L 2c 2 (C P θ * + 1) T 0 (t -⌊t⌋ ϵ )dt //Lemma A.17 = L 2c 2 (C P θ * + 1) T ϵ 2 . //Lemma A.15, where C P θ * is a constant depending on P θ * that comes from Lemma A.17. Hence KL(P θ * || Pθ ) ≤ inf ω≥0 (1 + ω)I 1 + (1 + 1/ω)I 2 = ( I 1 + I 2 ) 2 ≤ Lϵ (θ) -Lϵ (θ * ) + 1 2 L c 2 (C P θ * + 1) T ϵ . This completes the proof. Lemma A.14. For any a, b ∈ R d , and ω ≥ 0, ∥a + b∥ 2 2 ≤ (1 + ω) ∥a∥ 2 2 + (1 + 1/ω) ∥b∥ 2 2 . [Proof] (1 + ω) ∥a∥ 2 2 + (1 + 1/ω) ∥b∥ 2 2 ≥ ∥a∥ 2 2 + ∥b∥ 2 2 + 2a ⊤ b = ∥a + b∥ 2 2 Lemma A.15. Assume T ≥ 0, ϵ ≥ 0 and T /ϵ ∈ N. We have Lemma A.17. Consider T 0 (t -⌊t⌋ ϵ )dt = T ϵ 2 . [Proof] T 0 (t -⌊t⌋ ϵ )dt = K-1 k=0 ϵ 0 (hk + x -hk)dx = K-1 k=0 ϵ 0 xdx = Kϵ 2 /2 = T ϵ/2. dX t = b(Z t , t)dt + σ(Z t , t)dW t , X 0 = 0, t ∈ [0, T ]. Assume there exists a finite constant C 0 , such that ∥b(x, t)∥ 2 2 ≤ C 0 (1 + ∥x∥ 2 2 ), ∀x ∈ R d , t ∈ [0, T ], tr(σσ ⊤ (x, t)) ≤ C 0 , ∀x ∈ R d , t ∈ [0, T ]. and E[∥Z 0 ∥ 2 2 ] ≤ C 0 . Then for any 0 ≤ s ≤ t ≤ T , we have E[∥Z t -Z s ∥ [Proof] Let η = sup x,t tr(σσ ⊤ (x, t)). We have by Ito Lemma, d dt E ∥Z t -Z s ∥ 2 2 = E 2(Z t -Z s ) ⊤ (b(Z t , t) + dW t ) + η = E 2(Z t -Z s ) ⊤ b(Z t , t) + η = E ∥Z t -Z s ∥ 2 2 + ∥b(Z t , t)∥ 2 2 + d ≤ E ∥Z t -Z s ∥ 2 2 + C 0 (1 + ∥Z t ∥ 2 2 ) + η ≤ (1 + 2C 0 )E ∥Z t -Z s ∥ 2 2 + η + C 0 (1 + 2E ∥Z s ∥ 2 2 ). Using Gronwall's inequality, E ∥Z t -Z s ∥ 2 2 ≤ (t -s)(η + C 0 (1 + 2E[∥Z s ∥ 2 2 ])) exp ((t -s)(1 + 2C 0 )) . Taking s = 0 yields that E ∥Z t -Z 0 ∥ 2 2 ≤ t(η + C 0 (1 + 2E[∥Z 0 ∥ 2 2 ])) exp (t(1 + 2C 0 )) . Hence E[∥Z t ∥ 2 2 ] ≤ 2E ∥Z t -Z 0 ∥ 2 2 + 2E ∥X 0 ∥ 2 2 ≤ 2t(η + C 0 (1 + 2E[∥Z 0 ∥ 2 2 ])) exp (t(1 + 2C 0 )) + 2E ∥Z 0 ∥ 2 2 ≤ 4T (C 0 + C 2 0 ) exp(T (1 + 2C 0 )) + 2C 0 . Therefore, E ∥Z t -Z s ∥ 2 2 ≤ (t -s)(η + C 0 (1 + 2E[∥Z s ∥ 2 2 ])) exp ((t -s)(1 + 2C 0 )) ≤ C(t -s), where C = (2C 0 + 4C 2 0 + 8C 0 T (C 0 + C 2 0 ) exp(T (1 + 2C 0 ))) exp (T (1 + 2C 0 )) .

A.5.2 STATISTICAL ERROR ANALYSIS (PROPOSITION A.18)

To provide a simple analysis of the statistical error, we assume that θn = arg min θ L(θ) is an asymptotically normal M-estimator of θ * following classical asymptotic statistics ( Van der Vaart, 2000) , with which we can estimate the rate of the excess risk L ϵ ( θn ) -L ϵ (θ * ) and hence the KL divergence. Proposition A.18. Assume the conditions in Proposition (A.11). Assume θn = arg min θ Lϵ (θ) with Lϵ (θ) = n i=1 ℓ(θ; Z (i) , τ unif )/n, Z (i) ∼ Q Π * . Take Q x to be the standard Brownian bridge dZ x t = x-Z x t T -t dt + dW t with Z 0 ∼ N (0, v 0 ) and v 0 > 0. Assume √ n( θn - θ * ) d -→ N (0, Σ * ) as n → +∞ , where Σ * is the asymptotic covariance matrix of the M estimator θn . Assume L ϵ (θ) is second order continuously differentiable and strongly convex at θ * . Assume Π * has a finite covariance and admits a density function π that satisfies sup t∈[0,T ] E Q Π * ∇ θ s θ * (Z t , t) 2 (1 + ∥∇ log π(Z T )∥ 2 + tr(∇ 2 log π(Z T ))) < +∞. We have E KL(Π * || P θn,ϵ T ) = O log(1/ϵ) + 1 n + √ ϵ . The expectation in Eq. ( 18 where {Z (i) } n i=1 is drawn i.i.d. from Q Π * . We assume that θn is an asymptotically normal Mestimator, in which case we have √ n(θ n -θ * ) d -→ N (0, Σ * ), where Σ * = H -1 * V * H -1 * , H * = E Z∼Q Π * ∇ 2 θθ * ; Z) , V * = E[∇ θ ℓ(θ * ; Z)∇ θ ℓ(θ * ; Z) ⊤ ], and nE[(L( θn ) -L(θ * ))] ≍ 1 2 √ n(θ * -θn ) ⊤ H * √ n(θ * -θn ) ≍ 1 2 tr(H -1 * V * ), where f ≍ g denotes that fg = o(1). We now need to bound tr(H -1 * V * ). Combining the results in Lemma A.19 and Lemma A.23, we have when t k = (k -1)ϵ and T = Kϵ, tr(H -1 * V * ) = O 1 + 1 K K k=1 1 T -t k = O(1 + log(1/ϵ)). Hence, E[ KL(Π * || P θn,ϵ T )] = O E[ L( θn ) -L(θ * )] + √ ϵ = O E[L( θn ) -L(θ * )] + √ ϵ = O log(1/ϵ + 1) n + √ ϵ . Lemma A.19. Assume the conditions in Proposition A.18. Define I 0 = E Z∼Q Π * ∇ log p θ * 0 (Z 0 ) 2 , I k = E Z∼Q Π * ∇ θ s θ * (Z t k , t k ) 2 tr(cov(η Z T (Z [0,t k ] , t k ) | Z t k )) , for ∀k = 1, . . . K. Then tr(H -1 * V * ) 1/2 ≤ 1 λ min (H * ) 1/2   I 1/2 0 + 1 K K k=1 I k 1/2   . [Proof] From Lemma A.21, tr(H -1 * V * ) ≤ (λ min (H * )) -1 tr(V * ). Hence we just need to bound tr(V * ). tr(V * ) 1/2 = E Z∼Q Π * ∥∇ θ ℓ(θ * , Z)∥ 2 2 1/2 ≤ E Z∼Q Π * ∥∇ θ ℓ(θ * , Z)∥ 2 2 1/2 + 1 K K k=1 E Z∼Q Π * ∇ θ s θ * (Z t k , t k )(s θ * (Z t k , t k ) -η Z T (Z [0,t k ] , t k )) 2 2 1/2 ≤ E Z∼Q Π * ∥∇ θ ℓ(θ * , Z)∥ 2 2 1/2 + 1 K K k=1 E Z∼Q Π * ∇ θ s θ * (Z t k , t k ) 2 2 (s θ * (Z t k , t k ) -η Z (Z [0,t k ] , t k )) 2 2 1/2 = E Z∼Q Π * ∥∇ θ ℓ(θ * , Z)∥ 2 2 1/2 + 1 K K k=1 E Z∼Q Π * ∇ θ s θ * (Z t k , t k ) 2 2 tr cov η Z T (Z [0,t k ] , t k ) | Z t k 1/2 = I 1/2 0 + 1 K K k=1 I 1/2 k ≤ I 1/2 0 + 1 K K k=1 I k . Lemma A.20. Assume the results in Lemma A.19 and Lemma A.23 hold. Assume max k∈1,...,K E Z∼Π * ∇ θ sθ * (Z t k , t k ) 2 2 1 + ∥∇ log π * (Z T )∥ 2 2 + tr(∇ 2 log π * (Z T )) < +∞, Then for k = 1, . . . , K, we have I k = O 1 T -t k + 1 . [Proof] It is a direction application of (20). Lemma A.21. Let A and B be two d × d positive semi-definite matrices. Then tr(AB) ≤ λ max (A)tr(B). [Proof] Write A into A = d i=1 λ i u i u ⊤ i where λ i and u i is the i-th eigenvalue and eigenvectors of A, respectively. Then tr(AB) = tr( d i=1 λ i u ⊤ i Bu i ) ≤ λ max (A)tr( d i=1 u ⊤ i Bu i ) = λ max (A)tr(B). Controlling the Conditional Variance of the Regression Problem Assume Q x is the standard Brownian bridge: Q x : dZ x t = x -Z x t T -t dt + dW t , Z 0 ∼ N (0, v 0 ). In this case, the (ideal) loss function is L(θ) = -E X∼Π * ,Z∼Q X log p θ 0 (Z 0 ) + 1 2 T 0 s θ (Z t , t) -Y t 2 dt , where Y t = X -Z t 1 -t . The second part of the loss is a least square regression for predicting Y t = η X (Z t , t) with s θ (Z t , t).  Z x t ∼ N t T x, t(T -t) T + (T -t) 2 T 2 v 0 . [Proof] Let Z z0,x t be the same process that is initialized from Z z0,x 0 = z 0 . We have from the textbook result regarding Brownian bridge that we can write Z z0,x t = tx+(T -t)z0 T + t(T -t) T ξ t where ξ t is some standard Gaussian random The result follows directly as Z x t = Z Z0,x t with Z 0 ∼ N (0, v 0 ). Lemma A.23. Let π * be the density function Π * on R d whose covariance matrix exists. When X ∼ Π * and Z ∼ Q X from (19) with v 0 > 0. Then the density function ρ t (x|z) of X|Z t = z t satisfies ρ t (x|z) ∝ π * (x) exp - T t z -x 2 2 2( T (T -t) t + v 0 (T -t) 2 t 2 ) . In addition, there exists positive constants c < +∞ and τ ∈ (0, T ), such that tr(cov ρt (x|z)) ≤    w t d + w 2 t E ρt ∥∇ x log π * (x)∥ 2 2 + tr ∇ 2 log π * (x) z , when τ ≤ t ≤ T c, when 0 ≤ t ≥ τ , where w t = T (T -t) t + v 0 (T -t) 2 t 2 . So tr(cov ρt (x|z)) is bounded and decay to zero with rate O(T -t) as t ↑ T . [Proof] We know that X ∼ Π * and Z X t |X ∼ N (t/T X, w t ). Hence, ( 21) is a direct result of Bayes rule. Then Lemma A.25 gives tr(cov ρt (x|z)) = w t d + w 2 t E ρt ∥∇ x log π * (x)∥ 2 2 + tr ∇ 2 log π * (x) z . On the other hand, ρ t (x|z) ∝ π * (x) exp(- 1 2w t ∥x∥ 2 + T tw t z ⊤ x), When t → 0, we have 1/w t → 0 and T /(tw t ) → 0. Hence, ρ t (x|z) converges to π * (x) as t → 0, as a result, tr(cov ρt (x|z)) → tr(cov π * (x)) < +∞. Therefore, for any c > 0, there exists t 0 > 0, such that tr(cov ρt (x|z)) ≤ tr(cov π * (x)) + c when 0 ≤ t ≤ t 0 . Remark A.24. We need to have v 0 > 0 to ensure that T /(tw t ) → 0 in the proof of Lemma A.23. This is purely a technical reason, for yielding a finite bound of the conditioned variance when t is close to 0. We can establish the same result when v 0 = 0 by adding the assumption that max k∈1,.. .,K E Z∼Q Π * ∇ θ s θ * (Z t k , t k ) 2 2 tr(cov Π * Z t k (Z T )) < +∞, where Π * z is the distribution with density π * z (x) ∝ π * (x) exp(z ⊤ x/T ). Lemma A.25. Let p(x) ∝ π(x) exp -α ∥x-b∥ 2 2 2 be a positive probability density function on R d , where α > 0, b ∈ R and log π is continuously second order differentiable. Then tr(cov p (x)) ≤ E p [∥x∥ 2 2 ] = α -1 d + α -2 E p [∥∇ x log π(x)∥ 2 2 + tr(∇ 2 log π(x))] . [Proof] Let us focus on the case when b = 0 first. Stein's identity says that E p (∇ x log π(x) -αx) ⊤ ϕ(x) + ∇ ⊤ x ϕ(x) = 0, for a general continuously differentiable function ϕ when the integrals above are finite. Taking ϕ = x yields that E p (∇ x log π(x) -αx) ⊤ x + d = 0, which gives E p [∥x∥ 2 2 ] = α -1 (E p [∇ x log π(x) ⊤ x] + d). On the other hand, taking ϕ(x) = ∇ x log π(x) yields E p (∇ x log π(x) -αx) ⊤ ∇ x log π(x) + tr(∇ 2 log π(x)) = 0, which gives E p [∇ x log π(x) ⊤ x] = α -1 E [∥∇ x log π(x)∥ 2 2 + tr(∇ 2 log π(x))] . This gives E p [∥x∥ 2 2 ] = dα -1 + α -2 E p [∥∇ x log π(x)∥ 2 2 + tr(∇ 2 log π(x))] . For b ̸ = 0, define p(x) ∝ π (x + b) exp -α 2 ∥x∥ 2 , which is the distribution of x = x -b when x ∼ p. Then applying the result above to p yields tr(cov p (x)) = tr(cov p(x)) ≤ α -1 d + α -2 E x∼ p ∥∇ x log π (x + b)∥ 2 2 + tr ∇ 2 log π (x + b) = α -1 d + α -2 E ∼p ∥∇ x log π (x)∥ 2 2 + tr ∇ 2 log π (x) .

A.6 CONDITION FOR Ω-BRIDGES

We provide the proof for Proposition 2.3. [Proof of Proposition 2.3] By the formula of KL divergence between two diffusion processes, we have KL(Q Ω ||P θ ) = KL(Q Ω 0 || P θ 0 ) + 1 2 E Z∼Q Ω T 0 f θ (Z t , t) 2 2 dt < +∞. This means that Q Ω and P θ are absolutely continuous to each other, and hence have the same support. Therefore, Q Ω (Z T ∈ Ω) = 1 implies that P θ (Z T ∈ Ω) = 1.

A.7 EXAMPLES OF Ω-BRIDGES

If Ω is a product space, the integration can be factorized into one-dimensional integrals. Specifically, assume Ω = I 1 × • • • I d , then η Ω (z, t) = η Ii (z i , t) d i=1 , where η Ii is the drift fore of the I i -bridge, and z i is the i-th element of z = [z i ]. Therefore, it is sufficient to focus on 1D case below. Consider the bridge process constructed from the Brownian motion in (9). If Ω is a discrete set, say Ω = {e 1 . . . , e K }, we have ω(e k , z, t), ω(e, z, t)de = σ 2 t ∇ z log F ( z - √ β T -β t ) -F ( z -b √ β T -β t ) , where F is the standard Gaussian CDF.

B ADDITIONAL MATERIALS OF THE EXPERIMENTS

In our experiments, T = 1 and ϵ = T /K = 1/K. Moreover, we take the time grid by randomly sampling from {i/K} K-foot_2 i=0 for the training objective Eq. ( 13). For evaluation, we calculate the standard evidence lower bound (ELBO) by viewing the resulting time-discretized model as a latent variable model: E X∼Π * [-log pθ T (X)] ≤ E Z∼Q Π * -log pθ 0 (Z 0 ) q 0 (Z 0 ) - K k=1 log pθ t k+1 |t k (Z t k+1 |Z t k ) q t k+1 |t k (Z t k+1 |Z t k ) , where t k = (k -1)ϵ, and pθ is the density function of the time-discretized version of P θ , and q is the density function of Q. We adopt Monte-Carlo sampling to estimate the log-likelihood. As in (Song et al., 2020b) , we repeat 5 times in the test set for the estimation. For categorical/integer/grid generation, the likelihood of the last step should take the rounding into account: in practice, we have ẐT = rounding( Ẑt K + ϵs θ ( Ẑt K , t K ) + √ ϵσ(Z t k , t K )ξ K , Ω), where rounding(x, Ω) denotes finding the nearest element of x on Ω, and hence the likelihood pθ T |t K of the last step should incorporate the rounding operator as a part of the model.

B.1 GENERATING MIXED-TYPE TABULAR DATA

In this experiment, the metrics are measured by the implementation from Synthetic Data Vault (SDV) (Patki et al., 2016) . For baseline methods, we adopt their open-sourced official implementation 1 . For the machine learning models adopted for evaluation, logistic regerssion, AdaBoost and MLP, we directly use their default configuration in SDV. For the results in Table 1 , we repeat the experiments with 5 different random seeds and report their standard deviation. We provide additional generated samples from Mixed-Bridge in Figure 5 . 

B.2 GENERATING INTEGER-VALUED POINT CLOUDS

In this experiment, we need to process point cloud data on integer grid. To prepare the data, we firstly sample 2048 points from the ground truth mesh. Then, we normalize all the point clouds to a unit bounding box. After this, we simply project the points onto grid point by rounding the coordinate to integer. The metrics in the main text, MMD, COV and 1-NNA are computed with respect to the post-processed integer-valued training point clouds. For the results in Table 2 , we repeat the experiments for 3 times and report the mean of the experiments.

B.3 GENERATING SEMANTIC SEGMENTATION MAPS ON CITYSCAPES

In this experiment, we set (Noise Decay A): σ 2 t = 3 exp(-3t); (Noise Decay B): σ 2 t = 3(1t); (Noise Decay C) σ 2 t = 3 -3 exp(-3(1t)). We visualize the noise schedule in Figure 6 . Note that, except for Constant Noise, all the other three processes gradually decrease the magnitude of the noise as t → 1. For fair comparison, we use the same neural network as in Hoogeboom et al. (2021) . The network is optimized with Adam optimizer with a learning rate of 0.0002. The model is trained for 500 epochs. The CityScapes dataset (Cordts et al., 2016) contains photos captured by the cameras on the driving cars. A pixel-wise semantic segmentation map is labeled for each photo. As in (Hoogeboom et al., 2021) , we rescale the segmentation maps from cityscapes to 32 × 64 images using nearest neighbour interpolation. Our training set and test set is exactly the same as that of (Hoogeboom et al., 2021) for fair comparison. For the results in Table 3 ,we repeat the experiments for 3 times and report the mean of the experiments. We provide more samples in Figure 9 .

B.4 DISCRETE CIFAR10 GENERATION

The model is trained using the same training strategy as DDPM (Ho et al., 2020) with the code base provided in Song et al. (2020b) . Specifically, the neural network is the same U-Net structure as the implementation in Song et al. (2020b) . The optimizer is Adam with a learning rate of 0.0002. According to common practice (Song & Ermon, 2020; Song et al., 2020b) , the training is smoothed by exponential moving average (EMA) with a factor of 0.999. We use K = 1000 and dt = 0.001 for discretizing the SDE. To account for the discretization error, after the final step, we apply rounding to the generated images to get real integer-valued images. We compare the value distribution of the generated images in Figure 8 . 



] ≤ K C0,T (ts), where K C0,T is a finite constant that depends on C 0 and T . + tr ∇ 2 log π * (X) Z t .(20) https://github.com/sdv-dev/CTGAN



Figure 1: An Ω-Bridge on discrete domain Ω = {1, 2, 3, 4}.

for which the expectation reduces to one dimensional Gaussian integrals. See Appendix A.6 and A.7 for details and examples of η Ω .

al. (2020); Hoogeboom et al. (2021); Austin et al. (2021); Li et al. (2022); Dieleman et al. (2022) for discrete data). De Bortoli et al. (2022) discusses how to learn score-based generative models on general Riemannian manifolds. Another highly related work is Ye et al. (

where I 1 to I 9 are discrete domains and I 10 to I 15 are non-negative continuous domains. For discrete domains, I = {e 1 , . . . , e d }, we have, η I (z, t) = σ 2 t ∇ z log e∈I exp -∥z-e∥ 2 2(β T -βt) ; for nonnegative continuous domains, I = [0, +∞), derivation shows η I (z, t) = σ 2 t ∇ z log F ( z √ β T -βt ) , where F is the standard Gaussian CDF. Finally, we have η Ω (z, t) = 15 i=1 η Ii (z i , t) for the whole domain Ω. We set the number of diffusion steps to K = 2000. Results are shown in Table

Figure 2 & Table2: The point clouds (upper row) generated by different methods and meshes reconstructed from them (lower row). Grid-Bridge obtains more uniform points and hence better mesh thanks to the integer constraints. Numbers in the table are multiplied by 10 3 .

Figure 3: Results on generating categorical segmentation maps. Each pixel here an one-hot vector. Each dimension of the Ω-bridge starts from a deterministic and evolve through a stochastic trajectory to converge to either 0 or 1. The generated samples have similar visual quality to the training data.

Figure 4: Integer-bridges can generate high-quality discrete samples with different initial distribution.In this experiment, we apply three types of bridges. All of these bridges use the same output domain Ω = {0, . . . , 255} h×w×c , where h, w, c are the height, width and number of channels of the images, respectively. We set σ t = 3 exp(-3t). We consider different initial distributions: (Init. A) Z 0 = 128; (Init. B) Z 0 = μ0 , (Init. C) Z 0 ∼ N (μ 0 , σ0 ), where μ0 and σ0 are the empirical mean and variance of pixels in the CIFAR10 training set. The number of diffusion steps K is set to 1000. We compare with the variants of a state-of-the-art discrete diffusion model, D3PM(Austin et al., 2021). For fair comparison, we use the DDPM backbone(Ho et al., 2020) as the neural drift f θ in our method, similar to D3PM. We report the Inception Score (IS)Salimans et al. (2016), Fréchet Inception Distance (FID)Heusel et al. (2017) and negative log-likelihood (NLL) of the test dataset. We call our method Integer-Bridge in this case. The results are shown in Table4 and Figure4. Result In Table 4, Integer-Bridge with Initialization A,B,C can all get lower FIDs (≤ 7) than the variants of D3PM. Among the three kinds of Integer-bridges, Integer-Bridge (Init. B) obtains the lowest NLL (3.35). It also beats D3PM Gauss + logistic (3.435) on NLL, which has the best NLL in the variants of D3PM.

Lemma A.16 (Grönwall's inequality). Let I denote an interval of the real line of the form [a, ∞) or [a, b] or [a, b) with a < b. Let α, β and u be real-valued functions defined on I. Assume that β and u are continuous and that the negative part of α is integrable on every closed and bounded subinterval of I. (a) If

) is w.r.t. the randomness of θn . The log(1/ϵ) factor shows up as the sum of a harmonic series as the variance of ∆(θ; Z, t) grows with O(1/(Tt)) when t ↑ T . Taking ϵ = 1/n yields KL(Π * || P θn,ϵ T ) = O(log n/n). If we want to achieve KL(Π * || P θn,ϵ T ) = O(η), it is sufficient to take K = T /ϵ = O(1/η) steps and n = Ø(log(1/η)/η) data points. [Proof of Proposition A.18] Let θ * = arg min θ L ϵ (θ) := E Z∼Q Π * [ℓ(θ; Z)],

η Ω (z, t) = σ 2 t 1 K k=1 ω(e k , z, t) K k=1 ω(e k , z, t) e kz β Tβ t = σ 2 t ∇ z log K k=1

Figure 5: Generated tabular data from Mixed-Bridge.

Figure 6: Visualization of the noise schedule of Noise decay A, Noise decay B and Noise decay C.

Figure7: Diffusion process of one pixel (a 8-dimensional vector) in CityScapes. As t → 1, 7 of the dimensions reaches 0, while 1 of the dimensions reaches 1, turning the vector into a one-hot vector.

Figure 8: Final value distribution of the generated images with Bridge-Continuous and Bridge-Integer (before rounding) on CIFAR10. We only show the values in [125.5, 130.5] for visual clarity. Integer-Bridge generates discrete values.

Classification accuracy on the Adult Income dataset with different classifiers when trained with data synthesized by generative models. Real Training Data shows the upper bound of the metrics.

Results on the CityScapes dataset. Cat. refers to 'Categorical'.

Discrete CIFAR10 Image Generation et al., 2013) and variational dequantization

β is non-negative and if u satisfies the integral inequality

The conditioned variance cov(Y t | Z t ) is an important factor that influences the error of the regression problem. We now show that tr(cov(Y t | Z t )) = O(1/Tt) which means that it explodes to infinity when t ↑ T .Lemma A.22. For the standard Brownian bridge in (19), we have

ACKNOWLEDGEMENTS

This research is supported by NSF CAREER1846421, SenSE2037267, EAGER-2041327, Office of Navy Research, and NSF AI Institute for Foundations of Machine Learning (IFML).

annex

where π * (Z T ) = dΠ * dQ T (Z T ). Therefore, Q Π * is obtained by multiplying a positive factor π * (Z T ) on the terminal state Z T of Q. Hence Q Π * has the same Markov structure as that of Q.If Q x is constructed from mixtures of bridges as above, the resulting Q Π * is more complex. In fact, simply varying the initialization µ in Proposition (A.6) can change the Markov structure of Q Π * . Proposition A.8. Take Q x to be the dynamics in (7[Proof of Proposition A.8] When taking Q x to be the dynamics (7) initialized from Z 0 ∼ µ 0 = N (0, v 0 ), we have) for some f and g, which is not the case except the degenerated case (v 0 = 0 and v 0 = +∞) because Q0,1 is not factorized.On the other hand, when v 0 = 0, we have thatis Markov following Proposition A.7. When v 0 = +∞, as the case of SMLD, Q Π * is the law of Z t = ZT -t with d Zt = dW t and Z0 ∼ Π * , which is also Markov.The right characterization of Q Π * from Proposition (A.6) involves reciprocal processes (Léonard et al., 2014) . Definition A.9. A process Z with law Q on [0, T ] is said to be reciporcal if it can be written intoand µ is a probability measure on Ω × Ω.[Proof of Proposition A.10] Note thatis reciprocal by Definition A.9.On the other hand, iffor some Markov process M and probability measure µ on Ω × Ω. In this case, we haveIntuitively, a reciprocal process can be viewed as connecting the head and tail of a Markov chain, yielding a single loop structure., where A is any event that occur between time s and t. Solutions of the Schrodinger bridge problems are reciprocal processes (Léonard et al., 2014) .

A.5 PRACTICAL ALGORITHM AND ERROR ANALYSIS

In practice, we need to introduce empirical and numerical approximations in both training and inference phases. Denote by τ = {τ i } K+1 i=1 a grid of time points with 0 = τ 1 < τ 2 . . . < τ K+1 = T . During training, we minimize an empirical and time-discretized surrogate of L(θ) as followswhere ∆(θ; Z, t) := σ -1 (Z t , t)(s θ (Z t , t)η x (Z [0,t] , t)) 2 , and {Z (i) } is drawn from Q Π * , and τ (i) can be either a deterministic uniform grid of [0, T ], i.e., τ (i) = {i/K} K i=0 , or drawn i.i.d. uniformly on [0, T ] (see e.g., Song et al. (2020b) ; Ho et al. (2020) ). A subtle problem here is that the

