FROM POINTS TO FUNCTIONS: INFINITE-DIMENSIONAL REPRESENTATIONS IN DIFFUSION MODELS

Abstract

Diffusion-based generative models learn to iteratively transfer unstructured noise to a complex target distribution as opposed to Generative Adversarial Networks (GANs) or the decoder of Variational Autoencoders (VAEs) which produce samples from the target distribution in a single step. Thus, in diffusion models every sample is naturally connected to a random trajectory which is a solution to a learned stochastic differential equation (SDE). Generative models are only concerned with the final state of this trajectory that delivers samples from the desired distribution. Abstreiter et al. (2021) showed that these stochastic trajectories can be seen as continuous filters that wash out information along the way. Consequently, it is reasonable to ask if there is an intermediate time step at which the preserved information is optimal for a given downstream task. In this work, we show that a combination of information content from different time steps gives a strictly better representation for the downstream task. We introduce an attention and recurrence based modules that "learn to mix" information content of various time-steps such that the resultant representation leads to superior performance in downstream tasks.

1. INTRODUCTION

A lot of the progress in Machine Learning hinges on learning good representations of the data, whether in supervised or unsupervised fashion. Typically in the absence of label information, learning a good representation is often guided by reconstruction of the input, as is the case with autoencoders and generative models like variational autoencoders (Vincent et al., 2010; Kingma & Welling, 2013; Rezende et al., 2014) ; or by some notion of invariance to certain transformations like in Contrastive Learning and similar approaches (Chen et al., 2020b; d; Grill et al., 2020) . In this work, we analyze a novel way of representation learning which was introduced in Abstreiter et al. ( 2021) with a denoising objective using diffusion based models to obtain unbounded representations. Diffusion-based models (Sohl-Dickstein et al., 2015; Song et al., 2020; 2021; Sajjadi et al., 2018; Niu et al., 2020; Cai et al., 2020; Chen et al., 2020a; Saremi et al., 2018; Dhariwal & Nichol, 2021; Luhman & Luhman, 2021; Ho et al., 2021; Mehrjou et al., 2017; Nichol & Dhariwal, 2021) are generative models that leverage step-wise perturbations to the samples of the data distribution (eg. CIFAR10), modeled via a Stochastic Differential Equation (SDE), until convergence to an unstructured distribution (eg. N (0, I)) called, in this context, the prior distribution. In contrast to this diffusion process, a "score model" is learned to approximate the reverse process that iteratively converges to the data distribution starting from the prior distribution. Beyond the generative modelling capacity of score-based models, we instead use the additionally encoded representations to perform inference tasks, such as classification. In this work, we revisit the formulation provided by Abstreiter et al. ( 2021 mechanisms over diffusion trajectories, we ask about similarity and differences of representations across diffusion processes. Do they encode certain interpretable features at different points, or is it redundant to look at the whole trajectory? It is worth noting that while the representation itself is infinite-dimensional, we do discretize it for performing our analysis. Our findings can be summarized as follows: • We propose using the trajectory-based representation combined with sequential architectures like Recurrent Neural Networks (RNNs) and Transformers to perform downstream predictions using multiple points as it leads to better performance than just finding onebest point on the trajectory for downstream predictions (Abstreiter et al., 2021) . • We analyze the representations obtained at different parts of the trajectory through Mutual Information and Attention-based relevance to downstream tasks to showcase the differences in information contained along the trajectory. • We also provide insights into the benefits of using more points on the trajectory, with saturating benefits as our discretization becomes finer. We further show that finer discretizations lead to even more performance benefits when the latent space is severely restricted, eg. just a 2-dimensional output from the encoder.

2. BEYOND FIXED REPRESENTATIONS

We first outline how diffusion-based representation learning systems are trained. Given some example x 0 ∈ R d which is sampled from the target distribution p 0 , the diffusion process constructs the trajectory (x t ) t∈[0,1] through the application of an SDE. In this work, we consider the Variance Exploding SDE (Song et al., 2021) for this diffusion process, defined as dx = f (x, t) + g(t)dw := d[σ 2 (t)] dt dw ( ) where w is the standard Wiener process and σ 2 (•) the noise variance of the diffusion process. This leads to a closed form distribution of x t conditional on x 0 as p 0t (x t |x 0 ) = N (x t ; x 0 , [σ 2 (t) -



Figure 1: Downstream performance of single point based representations (MLP) and full trajectory based representations (RNN and Tsf) on different datasets for both types of learned encoders: probabilistic (VDRL) and deterministic (DRL) using a 64-dimensional latent space (Top) and a 128dimensional latent space (Bottom).

