LEARNING MULTI-SCALE LOCAL CONDITIONAL PROBABILITY MODELS OF IMAGES

Abstract

Deep neural networks can learn powerful prior probability models for images, as evidenced by the high-quality generations obtained with recent score-based diffusion methods. But the means by which these networks capture complex global statistical structure, apparently without suffering from the curse of dimensionality, remain a mystery. To study this, we incorporate diffusion methods into a multi-scale decomposition, reducing dimensionality by assuming a stationary local Markov model for wavelet coefficients conditioned on coarser-scale coefficients. We instantiate this model using convolutional neural networks (CNNs) with local receptive fields, which enforce both the stationarity and Markov properties. Global structures are captured using a CNN with receptive fields covering the entire (but small) low-pass image. We test this model on a dataset of face images, which are highly non-stationary and contain large-scale geometric structures. Remarkably, denoising, super-resolution, and image synthesis results all demonstrate that these structures can be captured with significantly smaller conditioning neighborhoods than required by a Markov model implemented in the pixel domain. Our results show that score estimation for large complex images can be reduced to low-dimensional Markov conditional models across scales, alleviating the curse of dimensionality.



Deep neural networks (DNNs) have produced dramatic advances in synthesizing complex images and solving inverse problems, all of which rely (at least implicitly) on prior probability models. Of particular note is the recent development of "diffusion methods" (Sohl-Dickstein et al., 2015) , in which a network trained for image denoising is incorporated into an iterative algorithm to draw samples from the prior (Song & Ermon, 2019; Ho et al., 2020; Song et al., 2021) , or to solve inverse problems by sampling from the posterior (Kadkhodaie & Simoncelli, 2020; Cohen et al., 2021; Kawar et al., 2021; Daras et al., 2022) . The prior in these procedures is implicitly defined by the learned denoising function, which depends on the prior through the score (the gradient of the log density). But density or score estimation is notoriously difficult for high-dimensional signals because of the curse of dimensionality: worst-case data requirements grow exponentially with the data dimension. How do neural network models manage to avoid this curse? Traditionally, density estimation is made tractable by assuming simple low-dimensional models, or structural properties that allow factorization into products of such models. For example, the classical Gaussian spectral model for images or sounds rests on an assumption of translation-invariance (stationarity), which guarantees factorization in the Fourier domain. Markov random fields (Geman & Geman, 1984) assume localized conditional dependencies, which guarantees that the density can be factorized into terms acting on local, typically overlapping neighborhoods (Clifford & Hammersley, 1971 ). In the context of images, these models have been effective in capturing local properties, but are not sufficiently powerful to capture long-range dependencies. Multi-scale image decompositions offered a mathematical and algorithmic framework better suited for the structural properties of images (Burt & Adelson, 1983; Mallat, 2008) . The multi-scale representation facilitates handling of larger

