QUANTIZED COMPRESSED SENSING WITH SCORE-BASED GENERATIVE MODELS

Abstract

We consider the general problem of recovering a high-dimensional signal from noisy quantized measurements. Quantization, especially coarse quantization such as 1-bit sign measurements, leads to severe information loss and thus a good prior knowledge of the unknown signal is helpful for accurate recovery. Motivated by the power of score-based generative models (SGM, also known as diffusion models) in capturing the rich structure of natural signals beyond simple sparsity, we propose an unsupervised data-driven approach called quantized compressed sensing with SGM (QCS-SGM), where the prior distribution is modeled by a pre-trained SGM. To perform posterior sampling, an annealed pseudo-likelihood score called noise perturbed pseudo-likelihood score is introduced and combined with the prior score of SGM. The proposed QCS-SGM applies to an arbitrary number of quantization bits. Experiments on a variety of baseline datasets demonstrate that the proposed QCS-SGM significantly outperforms existing state-of-the-art algorithms by a large margin for both in-distribution and out-of-distribution samples. Moreover, as a posterior sampling method, QCS-SGM can be easily used to obtain confidence intervals or uncertainty estimates of the reconstructed results.

1. INTRODUCTION

Many problems in science and engineering such as signal processing, computer vision, machine learning, and statistics can be cast as linear inverse problems: y = Ax + n, where A ∈ R M ×N is a known linear mixing matrix, n ∼ N (n; 0, σ 2 I) is an i.i.d. additive Gaussian noise, and the goal is to recover the unknown signal x ∈ R N ×1 from the noisy linear measurements y ∈ R M ×1 . Among various applications, compressed sensing (CS) provides a highly efficient paradigm that makes it possible to recover a high-dimensional signal from a far smaller number of M ≪ N measurements (Candès & Wakin, 2008) . The underlying wisdom of CS is to leverage the intrinsic structure of the unknown signal x to aid the recovery. One of the most widely used structures is sparsity, i.e., most elements of x are zero under certain transform domains, e.g., wavelet and Fourier transformation (Candès & Wakin, 2008) . In other words, the standard CS exploits the fact that many natural signals in real-world applications are (approximately) sparse. This direction has spurred a hugely active field of research during the last two decades, including efficient algorithm design (Tibshirani, 1996; Beck & Teboulle, 2009; Tropp & Wright, 2010; Kabashima, 2003; Donoho et al., 2009) , theoretical analysis (Candès et al., 2006; Donoho, 2006; Kabashima et al., 2009; Bach et al., 2012) , as well as all kinds of applications (Lustig et al., 2007; 2008) , just to name a few. Despite their remarkable success, traditional CS is still limited by the achievable rates since the sparsity assumptions, whether being naive sparsity or block sparsity (Duarte & Eldar, 2011) , are still too simple to capture the complex and rich structure of natural signals. For example, most natural signals are not strictly sparse even in the specified transform domains, and thus simply relying on sparsity alone for reconstruction could lead to inaccurate results. Indeed, researchers have proposed to combine sparsity with additional structure assumptions, such as low-rank assumption (Fazel et al., 2008; Foygel & Mackey, 2014 ), total variation (Candès et al., 2006; Tang et al., 2009) , to further improve the reconstruction performances. Nevertheless, these hand-crafted priors can only apply, if not exactly, to a very limited range of signals and are difficult to generalize to other cases. To address this problem, driven by the success of generative models (Goodfellow et al., 2014; Kingma & Welling, 2013; Rezende & Mohamed, 2015) , there has been a surge of interest in developing CS methods with data-driven priors (Bora et al., 2017; Hand & Joshi, 2019; Asim et al., 2020; Pan et al., 2021) . The basic idea is that, instead of relying on hand-crafted sparsity, the prior structure of the unknown signal is learned through a generative model, such as VAE (Kingma & Welling, 2013) or GAN (Goodfellow et al., 2014) . Notably, the past few years have witnessed a remarkable success of one new family of probabilistic generative models called diffusion models, in particular, score matching with Langevin dynamics (SMLD) (Song & Ermon, 2019; 2020) and denoising diffusion probabilistic modeling (DDPM) (Ho et al., 2020; Nichol & Dhariwal, 2021) , which have proven extremely effective and even outperform the state-of-the-art (SOTA) GAN (Goodfellow et al., 2014) and VAE (Kingma & Welling, 2013) in density estimation and generation of various natural sources. As both DDPM and SMLD estimate, implicitly or explicitly, the score (i.e., the gradient of the log probability density w.r.t. to data), they are also referred to together as score-based generative models (SGM) (Song et al., 2020) . Several CS methods with SGM have been proposed very recently (Jalal et al., 2021a; b; Kawar et al., 2021; 2022; Chung et al., 2022) and perform quite well in recovering x with only a few linear measurements in (1). Nevertheless, the linear model (1) ideally assumes that the measurements have infinite precision, which is not the case in realistic acquisition scenarios. In practice, the obtained measurements have to be quantized to a finite number of Q bits before transmission and/or storage (Zymnis et al., 2009; Dai & Milenkovic, 2011) . Quantization leads to information loss which makes the recovery particularly challenging. For moderate and high quantization resolutions, i.e., Q is large, the quantization impact is usually modeled as a mere additive Gaussian noise whose variance is determined by quantization distortion (Dai & Milenkovic, 2011; Jacques et al., 2010) . Subsequently, most CS algorithms originally designed for the linear model (1) can then be applied with some modifications. However, such an approach is apparently suboptimal since the information about the quantizer is not utilized to a full extent (Dai & Milenkovic, 2011) . This is true especially in the case of coarse quantization, i.e., Q is small. An extreme and important case of coarse quantization is 1-bit quantization, where Q = 1 and only the sign of the measurements are observed (Boufounos & Baraniuk, 2008) . Apart from extremely low cost in storage, 1-bit quantization is particularly appealing in hardware implementations and has also proven robust to both nonlinear distortions and dynamic range issues (Boufounos & Baraniuk, 2008) . Consequently, there have been extensive studies on quantized CS, particularly 1-bit CS, in the past decades and a variety of algorithms have been proposed, e.g., (Zymnis et al., 2009; Dai & Milenkovic, 2011; Plan & Vershynin, 2012; 2013; Jacques et al., 2013; Xu & Kabashima, 2013; Xu et al., 2014; Awasthi et al., 2016; Meng et al., 2018; Jung et al., 2021; Liu et al., 2020; Liu & Liu, 2022) . However, most existing methods are based on standard CS methods, which, therefore, inevitably inherit their inability to capture rich structures of natural signals beyond sparsity. While several recent works (Liu et al., 2020; Liu & Liu, 2022 ) studied 1-bit CS using generative priors, their main focuses are VAE and/or GAN, rather than SGM.

1.1. CONTRIBUTIONS

• We propose a novel framework, termed as Quantized Compressed Sensing with Score-based Generative Models (QCS-SGM in short), to recover unknown signals from noisy quantized measurements. QCS-SGM applies to an arbitrary number of quantization bits, including the extremely coarse 1-bit sign measurements. To the best of our knowledge, this is the first time that SGM has been utilized for quantized CS (QCS). • We consider one popular SGM model called NCSNv2 (Song & Ermon, 2020) and verify the effectiveness of the proposed QCS-SGM in QCS on various real-world datasets including MNIST, Cifar-10, CelebA 64 × 64, and the high-resolution FFHQ 256 × 256. Using the pretrained SGM as a generative prior, for both in-distribution and out-of-distribution samples, the proposed QCS-SGM significantly outperforms existing SOTA algorithms by a large margin in QCS. Perhaps surprisingly, even in the extreme case of 1-bit sign measurements, QCS-SGM can still faithfully recover the original images with much fewer measurements than the original signal dimension, as shown in Figure 1 . Moreover, compared to existing methods, QCS-SGM tends to recover more natural images with good perceptual quality. • We propose a noise-perturbed pseudo-likelihood score as a principled way to incorporate information from the measurements, which is one key to the success of QCS-SGM. Interestingly, when degenerated to the linear case (1), if A is row-orthogonal, then QCS-SGM reduces to a form similar to Jalal et al. (2021a) . While the annealing term (denoted as γ 2 t in (4) of Jalal et al. (2021a) is added heuristically as one additional hyper-parameter, it appears naturally within our framework and admits an analytical solution. More importantly, for general matrices A, our method generalizes and significantly outperforms Jalal et al. (2021a) . It is expected that the idea of noise-perturbed pseudo-likelihood score can be generalized to other conditional generative models as a general rule. • As one sampling method, QCS-SGM can easily yield multiple samples with different random initializations, whereby one can obtain confidence intervals or uncertainty estimates.

1.2. RELATED WORKS

The proposed QCS-SGM is one kind of data-driven method for CS which generally include two lines of research: unsupervised and supervised. The supervised approach trains neural networks from end to end using pairs of original signals and observations (Jin et al., 2017; Zhang & Ghanem, 2018; Aggarwal et al., 2018; Wu et al., 2019; Yao et al., 2019; Gilton et al., 2019; Yang et al., 2020; Antun et al., 2020) . Despite good performance on specific problems, the supervised approach is severely limited and requires re-training even with a slight change in the measurement model. The unsupervised approach, including the proposed QCS-SGM, only learns a prior and thus avoids such problem-specific training since the measurement model is only known and used for inference (Bora et al., 2017; Hand & Joshi, 2019; Asim et al., 2020; Pan et al., 2021) . Among others, the CSGM framework (Jin et al., 2017) is the most popular one which learns the prior with generative models. Recent works Liu et al. (2020) ; Liu & Liu (2022) extended CSGM to non-linear observations including 1-bit CS, which are the two most related studies to current work. However, the main focuses of Liu et al. (2020) ; Liu & Liu (2022) are limited to VAE and GAN (in particular DCGAN (Radford et al., 2015) ). As is widely known, GAN suffers from unstable training and less diversity due to the adversarial training nature while VAE uses a surrogate loss. Another significant drawback of priors with VAE/GAN is that they can have large representation errors or biases due to architecture and training, which can be easily caused by inappropriate latent dimensionality and/or mode collapse (Asim et al., 2020) . With the advent of SGM such as SMLD (Song & Ermon, 2019; 2020) and DDPM (Ho et al., 2020; Nichol & Dhariwal, 2021) , there have been several recent studies (Jalal et al., 2021a; b; Kawar et al., 2021; 2022; Chung et al., 2022; Daras et al., 2022) in CS using SGM as a prior and outperform conventional methods. Interestingly, Daras et al. (2022) performs posterior sampling in the "latent space" of a pre-trained generative model. Nevertheless, all the existing works on SGM for CS focus on linear measurements (1) without quantization, which therefore lead to performance degradation when directly applied to the quantized case (2). Moreover, even in the linear case (1), as illustrated in Corollary 1.2 and Remark 3, the likelihood score computed in Jalal et al. (2021a; b) is not accurate for general matrices A. By contrast, we provide a simple yet effective approximation of the likelihood score which significantly outperforms Jalal et al. (2021a) for general matrices even in the linear case. Another related work (Qiu et al., 2020) studied 1-bit CS where the sparsity is implicitly enforced via mapping a low dimensional representation through a known ReLU generative network. Wei et al. (2019) also considers a non-linear recovery using a generative model assuming that the original signal is generated from a low-dimensional latent vector.

2.1. INVERSE PROBLEMS FROM QUANTIZED MEASUREMENTS

The inverse problem from noisy quantized measurements is posed as follows (Zymnis et al., 2009 ) y = Q(Ax + n), where the goal is to recover the unknown signal x ∈ R N ×1 from quantized measurements y ∈ R M ×1 , where A ∈ R M ×N is a known linear mixing matrix, n ∼ N (n; 0, σ 2 I) is an i.i.d. additive Gaussian noise with known variance, and Q(•) : R M ×1 → Q M ×1 is an element-wise 1 quantizer function which maps each element into a finite (or countable) set of codewords Q, i.e., y m = Q(z m + n m ) ∈ Q, or equivalently (z m + n m ) ∈ Q -1 (y m ), m = 1, 2, ..., M , where z m is the m-th element of z = Ax. Usually, the quantization codewords Q correspond to intervals (Zymnis et al., 2009) , i.e., Q -foot_0 (y m ) = [l ym , u ym ) where l ym , u ym denote prescribed lower and upper thresholds associated with the quantized measurement y m , i.e., l ym ≤ (z m + n m ) < u ym . For example, for a uniform quantizer with Q quantization bits (resolution), the quantization codewords Q = {q r } 2 Q r=1 consist of 2 Q elements, i.e., q r = 2r -2 Q -1 ∆/2, r = 1, ..., 2 Q , ( ) where ∆ > 0 is the quantization interval, and the lower and upper thresholds associated q r are l qr = -∞, r = 1; r -2 Q-1 -1 ∆, r = 2, ..., 2 Q . u qr = r -2 Q-1 ∆, r = 1, ..., 2 Q ; +∞, , r = 2 Q . In the extreme 1-bit case, i.e., Q = 1, only the sign values are observed, i.e., y = sign(Ax + n), where the quantization codewords Q = {-1, +1} and the associated thresholds are l -1 = -∞, u -1 = 0 and l +1 = 0, u +1 = +∞, respectively.

2.2. SCORE-BASED GENERATIVE MODELS

For any continuously differentiable probability density function p(x), if we have access to its score function, i.e., ∇ x log p(x), we can iteratively sample from it using Langevin dynamics (Turq et al., 1977; Bussi & Parrinello, 2007; Welling & Teh, 2011 ) x t = x t-1 + α t ∇ xt-1 log p(x t-1 ) + √ 2α t z t , 1 ≤ t ≤ T, where z t ∼ N (z t ; 0, I), α t > 0 is the step size, and T is the total number of iterations. It has been demonstrated that when α t is sufficiently small and T is sufficiently large, the distribution of x T will converge to p(x) (Roberts & Tweedie, 1996; Welling & Teh, 2011) . In practice, the score function ∇ x log p(x) is unknown and can be estimated using a score network s θ (x) via score matching (Hyvärinen, 2006; Vincent, 2011) . However, the vanilla Langevin dynamics faces a variety of challenges such as slow convergence. To address this challenge, inspired by simulated annealing (Kirkpatrick et al., 1983; Neal, 2001) , Song & Ermon (2019) proposed an annealed version of Langevin dynamics, which perturbs the data with Gaussian noise of different scales and jointly estimates the score functions of noise-perturbed data distributions. Accordingly, during the inference, an annealed Langevin dynamics (ALD) is performed to leverage the information from all noise scales. Specifically, assume that p β (x | x) = N (x; x, β 2 I), and so we have p β (x) = p data (x)p β (x | x)dx, where p data (x) is the data distribution. Consider we have a sequence of noise scales {β t } T t=1 satisfying β max = β 1 > β 2 > • • • > β T = β min > 0. The β min → 0 is small enough so that p βmin (x) ≈ p data (x), and β max is large enough so that p βmax (x) ≈ N (x; 0, β 2 max I). The noise conditional score network (NCSN) s θ (x, β) proposed in Song & Ermon (2019) aims to estimate the score function of each p βt (x) by optimizing the following weighted sum of score matching objective θ * = arg min θ T t=1 E p data (x) E p β t (x|x) ∥s θ (x, β t ) -∇ x log p βt (x | x)∥ 2 2 . (7) After training the NCSN, for each noise scale, we can run K steps of Langevin MCMC to obtain a sample for each p βt (x) as x k t = x k-1 t + α t s θ (x k-1 t , β t ) + √ 2α t z k t , k = 1, 2, ..., K. The sampling process is repeated for t = 1, 2, ..., T sequentially with x 0 1 ∼ N (x; 0, β 2 max I) and x 0 t+1 = x K t when t < T . As shown in Song & Ermon (2019) , when K → ∞ and α t → 0 for all t, the final sample x K T will become an exact sample from p βmin (x) ≈ p data (x) under some regularity conditions. Later, by a theoretical analysis of the learning and sampling process of NCSN, an improved version of NCSN, termed NCSNv2, was proposed in Song & Ermon (2020) which is more stable and can scale to various datasets with high resolutions.

3. QUANTIZED CS WITH SGM

In the case of quantized measurements (2) with access to y, the goal is to sample from the posterior distribution p(x | y) rather than p(x). Subsequently, the Langevin dynamics in (6) becomes x t = x t-1 + α t ∇ xt-1 log p(x t-1 | y) + √ 2α t z t , 1 ≤ t ≤ T, where the conditional (posterior) score ∇ x log p(x | y) is required. One direct solution is to specially train a score network work that matches the score ∇ x log p(x | y). However, this method is rather inflexible and needs to re-train the network when confronted with different measurements. Instead, inspired by conditional diffusion models (Jalal et al., 2021a) , we resort to computing the posterior score using the Bayesian rule from which ∇ x log p(x | y) is decomposed into two terms ∇ x log p(x | y) = ∇ x log p(x) + ∇ x log p(y | x), which include the unconditional score ∇ x log p(x) (we call it prior score), and the conditional score ∇ x log p(y | x) (we call it likelihood score), respectively. The prior score ∇ x log p(x) can be easily obtained using a trained score network such as NCSN or NCSNv2, so the remaining goal is to compute the likelihood score ∇ x log p(y | x) from the quantized measurements.

3.1. NOISE-PERTURBED PSEUDO-LIKELIHOOD SCORE

In the case without noise perturbing in x, one can calculate the likelihood score ∇ x log p(y | x) directly from the measurement process ( 2) and then substitute it into (10) to obtain the posterior score ∇ x log p(x | y). However, this does not work well and might even diverge (Jalal et al., 2021a) . The reason is that in NCSN/NCSNv2, as shown in ( 7), the estimated score is not the original prior score ∇ x log p data (x) but rather a noise-perturbed prior score ∇ x log p βt (x) for each noise scale β t . Therefore, the likelihood score directly computed from (2) does not match the noise-perturbed prior score. (Vehkaperä et al., 2016) . Therefore, this assumption does not impose a serious limitation in the case of CS, and we leave the study of general A as a future work. Subsequently, we obtain our main result as shown in Theorem 1. Theorem 1. (noise-perturbed pseudo-likelihood score) For each noise scale β t > 0, if p βt (x | x) = N (x; x, β 2 t I), under Assumption 1 and Assumption 2, the noise-perturbed pseudo-likelihood score ∇ x log p βt (y | x) for the quantized measurements y in (2) can be computed as ∇ x log p βt (y | x) = A T G(β t , y, A, x), where G(β t , y, A, x) = [g 1 , g 2 , ..., g M ] T ∈ R M ×1 with each element being g m = exp - ũ2 ym 2 -exp - l2 ym 2 σ 2 + β 2 t ∥a T m ∥ 2 2 ũym lym exp -t 2 2 dt , m = 1, 2, ..., M, where a T m ∈ R 1×N denotes the m-th row vector of A and ũym = a T m x -u ym σ 2 + β 2 t ∥a T m ∥ 2 2 , lym = a T m x -l ym σ 2 + β 2 t ∥a T m ∥ 2 2 , ( ) Proof. The proof is shown in Appendix C. In the extreme 1-bit case, a more concise result can be obtained, as shown in Corollary 1.1: Corollary 1.1. (noise-perturbed pseudo-likelihood score in the 1-bit case) For each noise scale β t > 0, if p βt (x | x) = N (x; x, β 2 t I), under Assumption 1 and Assumption 2, the noise-perturbed pseudo-likelihood score ∇ x log p βt (y | x) for the sign measurements y = sign(Ax + n) in ( 5) can be computed as ∇ x log p βt (y | x) = A T G(β t , y, A, x), ( ) where G(β t , y, A, x) = [g 1 , g 2 , ..., g M ] T ∈ R M ×1 with each element being g m = 1 + y m 2Φ(z m ) - 1 -y m 2 (1 -Φ(z m )) exp - z2 m 2 2π(σ 2 + β 2 t ∥a T m ∥ 2 2 ) , m = 1, 2, ..., M, where zm = a T m x √ σ 2 +β 2 t ∥a T m ∥ 2 2 and Φ(z) = 1 √ 2π z -∞ e -t 2 2 dt is the cumulative distribution function (CDF) of the standard normal distribution. Proof. It can be easily proved following the proof in Theorem 1. Interestingly, if no quantization is used, or equivalently Q → ∞, the model in (2) reduces to the standard linear model (1). Accordingly, results in Theorem 1 can be modified as in Corollary 1.2: Corollary 1.2. (noise-perturbed pseudo-likelihood score without quantization) For each noise scale β t > 0, if p βt (x | x) = N (x; x, β 2 t I), under Assumption 1, the noise-perturbed pseudo-likelihood score ∇ x log p βt (y | x) for linear measurements y = Ax + n in (1) can be computed as ∇ x log p βt (y | x) = A T σ 2 I + β 2 t AA T -1 (y -Ax) . ( ) Moreover, under Assumption 2, the ∇ x log p βt (y | x) can be further simplified as ∇ x log p βt (y | x) = A T G linear (β t , y, A, x), where G linear (β t , y, A, x) = [g 1 , g 2 , ..., g M ] T ∈ R M ×1 with each element being g m = y m -a T m x σ 2 + β 2 t ∥a T m ∥ 2 2 , m = 1, 2, ..., M. Proof. The proof is shown in Appendix D. Remark 3. In contrast to the quantized case, here we obtain a closed-form result ( 16) for any kind of matrices A. Interestingly, if Assumption 2 is further imposed, it reduces to (18), which is similar to Jalal et al. (2021a) where the annealing term β 2 t a T m 2 2 in (18) corresponds to γ 2 t in Jalal et al. (2021a). However, γ 2 t in Jalal et al. (2021a) is added heuristically as an additional hyper-parameter while we derive it in closed-form under the principled perturbed pseudo-likelihood score. More importantly, for general matrices A, our closed-form result (16) significantly outperforms Jalal et al. (2021a) , as demonstrated in Appendix E. Therefore, we not only explain the necessity of an annealing term in Jalal et al. (2021a) but also extend and improve Jalal et al. (2021a) in the general case.

3.2. POSTERIOR SAMPLING VIA ANNEALED LANGEVIN DYNAMICS

By combining the prior score from SGM and noise-perturbed pseudo-likelihood score in Theorem 1, we obtain the resultant Algorithm 1, namely, Quantized CS with SGM (QCS-SGM in short). Algorithm 1: Quantized Compressed Sensing with SGM (QCS-SGM) Input: {β t } T t=1 , ϵ, K, y, A, σ 2 , quantization codewords Q and thresholds {[l q , u q )|q ∈ Q} Initialization: x 0 1 ∼ U (0, 1) for t = 1 to T do α t ← ϵβ 2 t /β 2 T for k = 1 to K do Draw z k t ∼ N (0, I) Compute G(β t , y, A, x k-1 t ) as ( 12) (or (15) for 1-bit) x k t = x k-1 t + α t s θ (x k-1 t , β t )+A T G(β t , y, A, x k-1 t ) + √ 2α t z k t x 0 t+1 ← x K t Output: x = x K T

4. EXPERIMENTS

We empirically demonstrate the efficacy of the proposed QCS-SGM in various scenarios. The widely-used i.i.d. Gaussian matrix A, i.e., A ij ∼ N (0, 1/M ) are considered. The code is available at https://github.com/mengxiangming/QCS-SGM. Datasets: Three popular datasets are considered: MNIST (LeCun & Cortes, 2010) , Cifar-10 ( Krizhevsky & Hinton, 2009) , and CelebA (Liu et al., 2015) , and the high-resolution Flickr Faces High Quality (FFHQ) (Karras et al., 2018) 2020), while we directly use the for Cifar-10, and CelebA, and FFHQ, which are available in this Link. Therefore, the prior score can be estimated using these pre-trained NCSNv2 models. After observing the quantized measurements as (2), we can infer the original x by posterior sampling via QCS-SGM in Algorithm 1. It is important to note that, we select images x that are unseen by the pre-traineirec SGM models. For details of the experimental setting, please refer to Appendix F. We have submitted the code in the supplementary material and will release it upon acceptance.

4.1. 1-BIT QUANTIZATION

First, we perform experiments on the extreme case, i.e., 1-bit measurements. Specifically, we consider images of the MNIST (LeCun & Cortes, 2010) and CelebA datasets (Liu et al., 2015) in the same setting as Liu & Liu (2022) . Apart from QCS-SGM, we also show results of LASSO (Tibshirani, 1996) (for CelebA, Lasso with a DCT basis (Ahmed et al., 1974) is used), CSGM (Bora et al., 2017) , BIPG (Liu et al., 2020) , and OneShot (Liu & Liu, 2022) . For Lasso (Lasso-DCT), CSGM, BIPG, and OneShot, we follow the default setting as the open-sourced code of Liu & Liu (2022) . The typical reconstructed images from 1-bit measurements with fixed M ≪ N are shown in Figure 2 for both MNIST and CelebA. Perhaps surprisingly, QCS-SGM can faithfully recover the images from 1-bit measurements with M ≪ N measurements while other methods fail or only recover quite vague images. To quantitatively evaluate the effect of M , we compare different algorithms using three popular metrics, i.e., peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) for different M , and the results are shown in Figure 3 . It can be seen that, under both metrics, the proposed QCS-SGM significantly outperforms all the other algorithms by a large margin. An investigation of the effect of (pre-quantization) Gaussian noise is shown in Figure 23 in Appendix J. Please refer to Appendix H for more results. Multiple samples & Uncertainty Estimates: It is worth pointing out that, as one kind of sampling method, QCS-SGM can yield multiple samples with different random initialization so that we can easily obtain confidence intervals or uncertainty estimates of the reconstructed results. Please refer to Appendix G for more typical samples of QCS-SGM. In contrast, OneShot (Liu & Liu, 2022) can easily get stuck in a local minimum with different random initializations (Liu & Liu, 2022) .

4.2. MULTI-BIT QUANTIZATION

Then, we evaluate the efficacy of QCS-SGM in the case of multi-bit quantization, e.g., 2-bit, 3-bit. The results on Cifar-10 and CelebA are shown in Figure 4 . The results of the linear case without quantization are also shown for comparison using (18) in Corollary 1.2. As expected, with the increase of quantization resolution, the reconstruction performances get better. One typical result for high-resolution FFHQ 256 × 256 images is shown in Figure 1 . Moreover, to evaluate the quantization effect under "fixed budget" setting, i.e., Q × M remain the same for different Q and M , we conduct experiments for CelebA in the fixed case of Q × M = 12288. Interestingly, as shown in Table 1 , under 'fixed budget", while there is an apparent increase in PSNR for the multi-bit case, the perception metric SSIM remains about the same as 1-bit case. For more results, please refer to Appendix H.

4.3. OUT-OF-DISTRIBUTION PERFORMANCE

In practice, the signals of interest may be out-of-distribution (OOD) because the available training dataset has bias and is unrepresentative of the true underlying distribution. To assess the performance of QCS-SGM for OOD datasets, we choose the CelebA as the training set while evaluating results on OOD images randomly sampled from the FFHQ dataset, which have features of variation that are rare among CelebA images (e.g., skin tone, age, beards, and glasses) (Jalal et al., 2021b) . As shown in Figure 5 , even on OOD images, the proposed QCS-SGM can still obtain significantly competitive results compared with other existing algorithms. CelebA (64 × 64) 1-bit, M = 12288 2-bit, M = 6144 3-bit, M = 4096 Method PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ QCS-

5. CONCLUSION

In this paper, a novel framework called Quantized Compressed Sensing with Score-based Generative Models (QCS-SGM) is proposed to reconstruct a high-dimensional signal from noisy quantized measurements. Thanks to the power of SGM in capturing the rich structure of natural signals beyond simple sparsity, the proposed QCS-SGM significantly outperforms existing SOTA algorithms by a large margin under coarsely quantized measurements on a variety of experiments, for both indistribution and out-of-distribution datasets. There are still some limitations in QCS-SGM. For example, it is limited to the case where measurement matrix A is (approximately) row-orthogonal. The generalization QCS-SGM to general matrices is an important future work. In addition, at present the training of SGM models is still very expensive, which hinders the applicability of QCS-SGM in the resource-limited scenario when no pre-trained models are available. Compared to traditional CS algorithms without data-driven prior, the possible lack of enough training data also imposes a great challenge for QCS-SGM. It is therefore important to address these limitations in the future. Nevertheless, as it is a research topic with rapid progress, we believe that SGM could open up new opportunities for quantized CS, and sincerely hope that our work as a first step could inspire more studies on this fascinating topic.

A VERIFICATION OF ASSUMPTION 1

Mathematically, we can write the exact p(y | x) as p(y | x) = p(y | x)p(x | x)dx, where from the Bayes' rule, p(x | x) = p(x | x)p(x) p(x | x)p(x)dx . ( ) For NCSN/NCSNv2, recall that the likelihood p(x | x) follows a Gaussian distribution N (x; x, β 2 I), where β is the variance of the perturbed Gaussian noise and, by definition, β → 0 in later steps of the reverse diffusion process in NCSN/NCSNv2. As a result, from (20), the likelihood p(x | x) will dominate the posterior p(x | x) as β → 0 and therefore p(x | x) ∝ p(x | x), indicating that the prior p(x) becomes uninformative. Note that this is particularly the case when the entropy of p(x) is high, which is usually the case for generative models capable of generating diverse images. While this assumption is crude at the beginning of the reverse diffusion process when the noise variance β is large, it is asymptotically accurate as the process goes forward. The effectiveness of this assumption is also empirically supported by various experiments in Section 4. Figure 6 : Comparison of the exact mean and variance of p(x | x) with the pseudo mean and variance under the uninformative assumption, i.e., p(x | x) ∝ p(x | x) in the toy scalar Gaussian example. In this plot, we set β max = 90, β min = 0.01, T = 500, which are the same as the setting of NCSNv2 for CelebA. The x and the prior standard deviation σ 0 is set to be σ 0 = 15. It can be seen that the approximated values approach the exact values very quickly, verifying the effectiveness of the Assumption 1 for this toy example. A toy example: In the following, we illustrate the assumption in a toy example where x reduces to a scalar random variable x and the associated prior p(x) follows a Gaussian distribution, i.e., p(x) = N (x; 0, σ 2 0 ), where σ 2 is the prior variance. The likelihood p(x | x) in this case is simply p(x | x) = N (x; x, β 2 ). Then, from (20), after some algebra, it can be computed that the posterior distribution p (x | x) is p(x | x) = N (x; m exact , v exact ) where m exact = σ 2 0 σ 2 0 + β 2 x, v exact = σ 2 0 β 2 σ 2 0 + β 2 . ( ) Under the Assumption 1, i.e., p(x | x) ∝ p(x | x), we obtain an approximation of p(x | x) as follows p(x | x) ≃ p(x | x) = N (x; m pseudo , v pseudo ), where m pseudo = x, v pseudo = β 2 . ( ) By comparing the exact result ( 22) and approximation result (24), it can be easily seen that for a fixed σ 2 0 > 0, as β → 0, we have m pseudo → m post and v pseudo → v post . In Figure 6 , similar to NCSN/NCSNv2, we anneal β as β t = β max ( βmin βmax ) t-1 T -1 geometrically and compare m pseudo , v pseudo with m exact , v exact as t increase from 1 to T . It can be seen in Figure 6 that the approximated values m pseudo , v pseudo , especially the variance v pseudo , approach to the exact values m exact , v exact very quickly, verifying the effectiveness of the Assumption 1 under this toy example.

B VERIFICATION OF ASSUMPTION 2

Before verifying it in the i.i.d. Gaussian case, we first briefly add a comment on the general case when AA T is not an exact diagonal matrix. As demonstrated later in Appendix C, the Assumption 2 is introduced to ensure that the covariance matrix σ 2 I + β 2 t AA T of Gaussian noise is diagonal so that we can obtain a closed-form solution for the likelihood score in the quantized case, which is otherwise intractable. If Suppose that the elements of A follow i.i.d. Gaussian, i.e., A ij ∼ N (0, σ 2 ). Next, we investigate the elements of the matrix C = AA T = {C ij }, i, j = 1...M . Regarding the diagonal elements C ii , by definition, it reads as C ii = n=1...N A 2 in , i = 1...M. ( ) As C ii is the sum of square of N i.i.d. Gaussian random variables A ij , j = 1...N , C ii follows a Gamma distribution, i.e., Γ N 2 , 2σ 2 . The mean and variance of C ii can be computed as N σ 2 and 2N σ 4 . Regarding the off-diagonal elements C ij , i ̸ = j, by definition, it reads as C ij = n=1...N A in A jn , i, j = 1...M, i ̸ = j. ( ) As A in and A jn are independent Gaussian for i ̸ = j, it can be computed that the mean and variance of C ij are 0 and N σ 4 , respectively. When σ 2 = 1/M and M = αN , where α > 0 is the constant measurement ratio, the variance of C ij is 1 α 2 N → 0 as N → ∞. As a result, all the off-diagonal elements of AA T will tend to zero as N → ∞. From another perspective, in Figure 7 , we compute the ratio between the average magnitude of off-diagonal elements C ij and the diagonal elements C ii when M = αN with α = 0.5. It can be seen that as N increases, the magnitude of the off-diagonal elements becomes negligible. As a result, when A is i.i.d. Gaussian, the matrix AA T can be well approximated as a diagonal matrix in the high-dimensional case.

C PROOF OF THEOREM 1

Proof. Let us denote z = Ax. For each noise scale β t > 0, under the Assumption 1, we obtain p βt (x | x) ∝ p βt (x | x) ∼ N (x; x, β 2 t I) then we can equivalently write x = x + β t w, where w ∼ N (0, I). As a result, z = Ax = A(x + β t w) = Ax + β t Aw. Then, from (2), we obtain y = Q (Ax + ñ) , where ñ = n + β t Aw. Since n ∼ N (0, σ 2 I) and w ∼ N (0, I) and are independent to each other, it can be concluded that ñ is also Gaussian with mean zero and covariance σ 2 I + β 2 t AA T , i.e., ñ ∼ N (ñ; 0, σ 2 I + β 2 t AA T ). Subsequently, under the Assumption 2, i.e., AA T is a diagonal matrix, each element ñm ∼ N (ñ m ; 0, σ 2 + β 2 t a T m 2 ) of ñ will be independent to each other and thus from ( 38) we can obtain a closed-form solution for the likelihood distribution (we will refer it as a pseudo-likelihood due to the assumptions used) p(y|ẑ = Ax) as follows p(y|ẑ = Ax) = M m=1 p y m | ẑm = a T m x where, from the definition of quantizer Q, p y m | ẑm = a T m x = p (l ym ≤ ẑm + ñm < u ym ) (31) = Φ   -ẑ m + u ym σ 2 + β 2 t ∥a T m ∥ 2 2   -Φ   -ẑ m + l ym σ 2 + β 2 t ∥a T m ∥ 2 2   (32) = Φ (-ũ ym ) -Φ -lym where Φ(z) = 1 √ 2π z -∞ e -t 2 2 dt is the cumulative distribution function of the standard normal distribution and ũym = a T m x -u ym σ 2 + β 2 t ∥a T m ∥ 2 2 , lym = a T m x -l ym σ 2 + β 2 t ∥a T m ∥ 2 2 . ( ) As a result, it can be calculated that the noise-perturbed pseudo-likelihood score ∇ x log p βt (y | x) for the quantized measurements y in (2) can be computed as ∇ x log p βt (y | x) = A T G(β t , y, A, x) where G(β t , y, A, x) = [g 1 , g 2 , ..., g M ] T ∈ R M ×1 with each element being g m = exp - ũ2 ym 2 -exp - l2 ym 2 σ 2 + β 2 t ∥a T m ∥ 2 2 ũym lym exp -t 2 2 dt , m = 1, 2, ..., M, which completes the proof. Proof. Similarly in the proof of Theorem 1, let us denote z = Ax. For each noise scale β t > 0, under the Assumption 1, we can equivalently write x = x + β t w, where w ∼ N (0, I). As a result, z = Ax = A(x + β t w) = Ax + β t Aw. Then, in the case of the linear model, from (1), we obtain y = Ax + ñ, where ñ = n + β t Aw. Since n ∼ N (0, σ 2 I) and w ∼ N (0, I) and are independent to each other, it can be concluded that ñ is also Gaussian with mean zero and covariance σ 2 I + β 2 t AA T , i.e., ñ ∼ N (ñ; 0, σ 2 I + β 2 t AA T ). Therefore, a closed-form solution for the likelihood distribution p(y|ẑ = Ax) can be obtained as follows p(y|ẑ = Ax) = exp -1 2 (y -Ax) T σ 2 I + β 2 t AA T -1 (y -Ax) (2π) M det (σ 2 I + β 2 t AA T ) . ( ) As a result, we can readily obtain a closed-form solution for the noise-perturbed pseudo-likelihood score ∇ x log p βt (y | x) as follows ∇ x log p βt (y | x) = A T σ 2 I + β 2 t AA T -1 (y -Ax) . Furthermore, if AA T is a diagonal matrix, the inverse of matrix σ 2 I + β 2 t AA T become trivial since it is a diagonal matrix with the m-th diagonal element being σ 2 + β 2 t a T m 2 2 . After some simple algebra, we can obtain the equivalent representation in (18), which completes the proof.

E COMPARISON WITH ALD IN JALAL ET AL. (2021A) IN THE LINEAR CASE

As shown in Corollary 1.2, in the special case without quantization, our results in Theorem 1 can be reduced to a form similar to the ALD in Jalal et al. (2021a) . However, there are several different important differences. First, our results are derived in a principled way as noise-perturbed pseudolikelihood score and admit closed-form solutions, while the results in Jalal et al. (2021a) are obtained heuristically by adding an additional hyper-parameter (and thus needs fine-tuning). Second, the results in Jalal et al. (2021a) are similar to an approximate version (18) of ours, which holds only when AA T is a diagonal matrix. For general matrices A, we have a closed-form approximation (16). We compare our results with ALD (Jalal et al., 2021a) and the results are shown in Figure 8 and Figure 9 . It can be seen that when at low condition number of A when AA T is approximately a diagonal matrix, our with diagonal approximation (18) performs similarly as ( 16), both of which are similar to ALD (Jalal et al., 2021a) , as expected. However, when the condition number of A is large so that AA T is far from a diagonal matrix, our method with the (16) significantly outperforms ALD in Jalal et al. (2021a) .

F DETAILED EXPERIMENTAL SETTINGS

In training NCSNv2 for MNIST, we used a similar training setup as Song & Ermon (2020) for Cifar10 as follows. Training: batch-size: 128 n-epochs: 500000 n-iters: 300001 snapshot-freq: 50000 snapshot-sampling: true anneal-power: 2 log-all-sigmas: false. Please refer to Song & Ermon (2020) and associated open-sourced code for details of training. For Cifar-10, CelebA, and FFHQ, we directly use the pre-trained models available in this Link. When performing posterior sampling using the QCS-SGM in 1, for simplicity, we set a constant value ϵ = 0.0002 for all quantized measurements (e.g., 1-bit, 2-bit, 3-bit) for MNIST, Cifar10 and CelebA. For the high-resolution FFHQ 256 × 256, we set ϵ = 0.00005 for 1-bit and ϵ = 0.00002 for 2-bit and 3-bit case, respectively. For all linear measurements for MNIST, Cifar10, and CelebA, we set ϵ = 0.00002. It is believed that some improvement can be achieved with further fine-tuning of ϵ for different scenarios. For MNIST and Cifar-10, we set β 1 = 50, β T = 0.01, T = 232; for CelebA, we set β 1 = 90, β T = 0.01, T = 500; for FFHQ, we set β 1 = 348, β T = 0.01, T = 2311 which are the same as Song & Ermon (2020) . The number of steps K in QCS-SGM for each noise scale is set to be K = 5 in all experiments. For more details, please refer to the submitted code.

G MULTIPLE SAMPLES AND UNCERTAINTY ESTIMATES

As one kind of posterior sampling method, QCS-SGM can yield multiple samples with different random initialization so that we can easily obtain confidence intervals or uncertainty estimates of the reconstructed results. For example, typical samples as well as mean and std are shown in Figure 10 for MNIST and CelebA in the 1-bit case. Figure 13 shows the quantitative results of QCS-SGM for different quantization bits. in the same setting as Table 1 . 



While more sophisticated quantization schemes exist(Dirksen, 2019), here we focus on memoryless scalar quantization due to its popularity and simplicity where the quantizer acts element-wise.



Figure 1: Reconstructed images of our QCS-SGM for one FFHQ 256 × 256 high-resolution RGB test image (N = 256 × 256 × 3 = 196608 pixels) from noisy heavily quantized (1bit, 2-bit and 3-bit) CS 8× measurements y = Q(Ax + n), i.e., M = 24576 ≪ N . The measurement matrix A ∈ R M ×N is i.i.d. Gaussian, i.e., Aij ∼ N (0, 1 M ), and a Gaussian noise n is added with standard deviation σ = 10 -3 .

Figure 2: Typical reconstructed images from 1-bit measurements on MNIST and CelebA. QCS-SGM faithfully recovers original images from 1-bit measurements even when M ≪ N . Compared to other methods, QCS-SGM recovers more natural images with good perceptual quality.

Figure 3: Quantitative comparisons based on different metrics for 1-bit MNIST and CelebA. QCS-SGM remarkably outperforms all other methods under all metrics.

Figure 4: Results of QCS-SGM for Cifar-10 and CelebA images under different quantization bits.

Figure 5: Results on out-of-distribution (OOD) FFHQ dataset from 1-bit (2-bit and 3-bit are also shown for QCS-SGM) measurements. M = 10000, σ = 0.001.

Figure 7: The ratio between the average magnitude of the diagonal elements C ii and off-diagonal elements C ij when M = αN with α = 0.5.

Figure 8: Averaged Cosine Similarity and PSNR of reconstructed MNIST images of ALD in Jalal et al. (2021a) and ours (with formula (18) and (16), respectively) when M = 200, σ = 0.1 for the different condition number of the matrix A is cond(A) = 1000. It can be seen that our method with (16) significantly outperforms ALD inJalal et al. (2021a)  at high condition number while performing similarly at low condition number. Ours with diagonal approximation (18) is about the same as ALD as expected.

Figure 9: Tyical recovered MNIST images of ALD in Jalal et al. (2021a) and ours (with formula (18) and (16), respectively) when M = 200, σ = 0.05 and the condition number of matrix A is cond(A) = 1000. It can be seen that our method with (16) significantly outperforms ALD inJalal  et al. (2021a), which performs about the same as ours with diagonal approximation (18).

Figure 10: Multiple samples of QCS-SGM on MNIST (M = 200, 400, σ = 0.05) and CelebA datasets (M = 4000, 1000, σ = 0.001) from 1-bit measurements. The mean and std of the samples are also shown. 20

Figure 11 and Figure 12 show results with relatively large value of M in the same setting as Figure 2 and Figure 4, respectively.

Figure 11: Typical reconstructed images from 1-bit measurements on MNIST (M = 400) and CelebA (M = 10000).

Figure 12: Results of QCS-SGM for Cifar-10 (M = 4000) and CelebA (M = 16000) images under different quantization bits.

Figure 13: Quantitative results of QCS-SGM for different quantization bits, σ = 0.001.

Figure 14 and Figure 15 show the reconstructed images of QCS-SGM for Cifar10 and CelebA in the fixed budget case of Q × M = 3072 and Q × M = 12288, respectively.

Figure 14: Reconstructed images of QCS-SGM for Cifar10 in the fixed budget case (Q × M = 3072).

Figure 15: Reconstructed images of QCS-SGM for CelebA in the fixed budget case (Q×M = 12288) in the same setting as Table1.

Figure 17: Results results of QCS-SGM on Cifar10 for CS8, i.e., M = 1536, N = 3072. Noiseless case.

To address this problem, we propose to calculate the noise-perturbed pseudo-likelihood score ∇ x log p βt (y | x) by explicitly accounting for such the noise perturbation in NCSN/NCSNv2. Unfortunately, for NCSN/NCSNv2, the ∇ x log p βt (y | x) is intractable in the general case. To this end, we propose a simple yet effective approximation by introducing two assumptions.

Quantitative

ACKNOWLEDGEMENTS

X. Meng would like to thank Jiulong Liu for his help in understanding their code of OneShot. This work was supported by JSPS KAKENHI Nos. 17H00764, and 19H01812, 22H05117, and JST CREST Grant Number JPMJCR1912, Japan.

availability

The code is available at https://github.com/mengxiangming/QCS

annex

 (Gilton et al., 2019 ) in all cases. As in Gilton et al. (2019) . the values reported are the median across a test set of size 256. Reconstructed images of QCS-SGM are shown in Figure 16 -19. In the case of no quantization, we compare our method with one popular method called Neumann Networks (Gilton et al., 2019) . The results for Cifar10 are shown in Table 2 . It can be seen that our method outperforms Neumann Networks by a large margin. We do not show quantization comparisons for quantized CS since Neumann Networks does not support it. 

J COMPARISON IN THE EXACTLY SAME SETTING AS LIU & LIU (2022)

Note that there is a slight difference in the modeling of measurement matrix A and noise n between ours and that in Liu & Liu (2022) . In Liu & Liu (2022) , it is assumed that A is an i.i.d. Gaussian matrix where each element A ij follows A ij ∼ N (0, 1) and that n is an i.i.d. Gaussian with variance σ 2 , i.e., n i ∼ N (0, σ 2 ). However, in practice, the measurement matrix A is usually normalized so that the norm of each column equals 1. As a result, in our setting, we assume that each element of i.i.d. Gaussian matrix follows A ij ∼ N (0, 1/M ). Mathematically, there is a one-to-one correspondence between the two settings, but the simulation setting is different due to the different measurement size M . As a result, for an exact comparison with results in Liu & Liu (2022) , we also conducted experiments assuming exactly the same setting as Liu & Liu (2022) , i.e., A ij ∼ N (0, 1), and n i ∼ N (0, σ 2 ). The results are shown in Figures 20, 21, 22, 23. An investigation of the effect of (pre-quantization) Gaussian noise is shown in Figure 23 . Interestingly, the results in Figure 23 empirically demonstrate that QCS-SGM is robust to pre-quantization noise, i.e., similar performances have been achieved for a large range of noise, even when the noise is very large or approaching zero. Note that different from the comparing methods, the results of QCS-SGM in Figure 23 are not normalized to range [0, 1], which suggests that the "dithering" effect seems not apparent for QCS-SGM. A rigorous analysis of this is left as future work. 

