LEARNING MULTI-SCALE LOCAL CONDITIONAL PROBABILITY MODELS OF IMAGES

Abstract

Deep neural networks can learn powerful prior probability models for images, as evidenced by the high-quality generations obtained with recent score-based diffusion methods. But the means by which these networks capture complex global statistical structure, apparently without suffering from the curse of dimensionality, remain a mystery. To study this, we incorporate diffusion methods into a multi-scale decomposition, reducing dimensionality by assuming a stationary local Markov model for wavelet coefficients conditioned on coarser-scale coefficients. We instantiate this model using convolutional neural networks (CNNs) with local receptive fields, which enforce both the stationarity and Markov properties. Global structures are captured using a CNN with receptive fields covering the entire (but small) low-pass image. We test this model on a dataset of face images, which are highly non-stationary and contain large-scale geometric structures. Remarkably, denoising, super-resolution, and image synthesis results all demonstrate that these structures can be captured with significantly smaller conditioning neighborhoods than required by a Markov model implemented in the pixel domain. Our results show that score estimation for large complex images can be reduced to low-dimensional Markov conditional models across scales, alleviating the curse of dimensionality.



Deep neural networks (DNNs) have produced dramatic advances in synthesizing complex images and solving inverse problems, all of which rely (at least implicitly) on prior probability models. Of particular note is the recent development of "diffusion methods" (Sohl-Dickstein et al., 2015) , in which a network trained for image denoising is incorporated into an iterative algorithm to draw samples from the prior (Song & Ermon, 2019; Ho et al., 2020; Song et al., 2021) , or to solve inverse problems by sampling from the posterior (Kadkhodaie & Simoncelli, 2020; Cohen et al., 2021; Kawar et al., 2021; Daras et al., 2022) . The prior in these procedures is implicitly defined by the learned denoising function, which depends on the prior through the score (the gradient of the log density). But density or score estimation is notoriously difficult for high-dimensional signals because of the curse of dimensionality: worst-case data requirements grow exponentially with the data dimension. How do neural network models manage to avoid this curse? Traditionally, density estimation is made tractable by assuming simple low-dimensional models, or structural properties that allow factorization into products of such models. For example, the classical Gaussian spectral model for images or sounds rests on an assumption of translation-invariance (stationarity), which guarantees factorization in the Fourier domain. Markov random fields (Geman & Geman, 1984 ) assume localized conditional dependencies, which guarantees that the density can be factorized into terms acting on local, typically overlapping neighborhoods (Clifford & Hammersley, 1971 ). In the context of images, these models have been effective in capturing local properties, but are not sufficiently powerful to capture long-range dependencies. Multi-scale image decompositions offered a mathematical and algorithmic framework better suited for the structural properties of images (Burt & Adelson, 1983; Mallat, 2008) . The multi-scale representation facilitates handling of larger 2009)). Recent work, inspired by renormalization group theory in physics, has shown that probability distributions with long-range dependencies can be factorized as a product of Markov conditional probabilities over wavelet coefficients (Marchand et al., 2022) . Although the performance of these models is eclipsed by recent DNN models, the concepts on which they rest-stationarity, locality and multi-scale conditioningare still of fundamental importance. Here, we use these tools to constrain and study a score-based diffusion model. A number of recent DNN image synthesis methods-including variational auto-encoders (Chen et al., 2018) , generative adversarial networks (Gal et al., 2021) normalizing flow models (Yu et al., 2020; Li, 2021) ), and diffusion models (Ho et al., 2022; Guth et al., 2022) -use coarse-to-fine strategies, generating a sequence of images of increasing resolution, each seeded by its predecessor. With the exception of the last, these do not make explicit the underlying conditional densities, and none impose locality restrictions on their computation. On the contrary, the stage-wise conditional sampling is typically accomplished with huge DNNs (up to billions of parameters), with global receptive fields. Here, we develop a low-dimensional probability model for images decomposed into multi-scale wavelet sub-bands. Following the renormalization group approach, the image probability distribution is factorized as a product of conditional probabilities of its wavelet coefficients conditioned by coarser scale coefficients. We assume that these conditional probabilities are local and stationary, and hence can be captured with low-dimensional Markov models. Each conditional score can thus be estimated with a conditional CNN (cCNN) with a small receptive field (RF). The score of the coarse-scale lowpass band (a low-resolution version of the image) is modeled using a CNN with a global RF, enabling representation of large-scale image structures and organization. We test this model on a dataset of face images, which present a challenging example because of their global geometric structure. Using a coarse-to-fine anti-diffusion strategy for drawing samples from the posterior (Kadkhodaie & Simoncelli, 2021), we evaluate the model on denoising, super-resolution, and synthesis, and show that locality and stationarity assumptions hold for conditional RF sizes as small as 9 × 9 without harming performance. In comparison, the performance of CNNs restricted to a fixed RF size in the pixel domain dramatically degrades when the RF is reduced to such sizes. Thus, high-dimensional score estimation for images can be reduced to low-dimensional Markov conditional models, alleviating the curse of dimensionality. A software implementation is available at https://github.com/ LabForComputationalVision/local-probability-models-of-images 

1. MARKOV WAVELET CONDITIONAL MODELS

Images are high-dimensional vectors. Estimating an image probability distribution or its score therefore suffer from the curse of dimensionality, unless one limits the estimation to a relatively low-dimensional model class. This section introduces such a model class as a product of Markov conditional probabilities over multi-scale wavelet coefficients. Markov random fields (Dobrushin, 1968; Sherrington & Kirkpatrick, 1975) define low-dimensional models by assuming that the probability distribution has local conditional dependencies over a graph, which is known a priori. One can then factorize the probability density into a product of conditional probabilities, each defined over a small number of variables (Clifford & Hammersley, 1971) . Markov random fields have been used to model stationary texture images, with conditional dependencies within small spatial regions of the pixel lattice. At a location u, such a Markov model assumes that the pixel value x(u), conditioned on pixel values x(v) for v in a neighborhood of u, is independent from all pixels outside this neighborhood. Beyond stationary textures, however, the chaining of shortrange dependencies in pixel domain has proven insufficient to capture the complexity of long-range geometrical structures. Many variants of Markov models have been proposed (e.g., Geman & Geman (1984) ; Malfait & Roose (1997); Cui & Wang (2005) ), but none have demonstrated performance comparable to recent deep networks while retaining a local dependency structure. Based on the renormalization group approach in statistical physics (Wilson, 1971) , new probability models are introduced in Marchand et al. ( 2022), structured as a product of probabilities of wavelet coefficients conditioned on coarser-scale values, with spatially local dependencies. These Markov



structures, and local (Markov) models have captured these probabilistically (e.g., Chambolle et al. (1998); Malfait & Roose (1997); Crouse et al. (1998); Buccigrossi & Simoncelli (1999); Paget & Longstaff (1998); Mihçak et al. (1999); Wainwright et al. (2001); Şendur & Selesnick (2002); Portilla et al. (2003); Cui & Wang (2005); Lyu & Simoncelli (

