DIFFERENTIABLE GAUSSIANIZATION LAYERS FOR INVERSE PROBLEMS REGULARIZED BY DEEP GENERA-TIVE MODELS

Abstract

Deep generative models such as GANs, normalizing flows, and diffusion models are powerful regularizers for inverse problems. They exhibit great potential for helping reduce ill-posedness and attain high-quality results. However, the latent tensors of such deep generative models can fall out of the desired high-dimensional standard Gaussian distribution during inversion, particularly in the presence of data noise and inaccurate forward models, leading to low-fidelity solutions. To address this issue, we propose to reparameterize and Gaussianize the latent tensors using novel differentiable data-dependent layers wherein custom operators are defined by solving optimization problems. These proposed layers constrain inverse problems to obtain high-fidelity in-distribution solutions. We validate our technique on three inversion tasks: compressive-sensing MRI, image deblurring, and eikonal tomography (a nonlinear PDE-constrained inverse problem) using two representative deep generative models: StyleGAN2 and Glow. Our approach achieves state-of-the-art performance in terms of accuracy and consistency. However, this formulation of DGM-regularized inversion still leads to unsatisfactory results if the data are noisy or the forward model is inaccurate, as shown in Fig. 20 , even if we fine-tune the weighting parameter β. To analyze this problem, first recall that a well-trained DGM has a latent space (usually) defined on a standard Gaussian distribution. In other words, a DGM either only sees standard Gaussian latent

1. INTRODUCTION

Inverse problems play a crucial role in many scientific fields and everyday applications. For example, astrophysicists use radio electromagnetic data to image galaxies and black holes (Högbom, 1974; Akiyama et al., 2019) . Geoscientists rely on seismic recordings to reveal the internal structures of Earth (Tarantola, 1984; Tromp et al., 2005; Virieux & Operto, 2009) . Biomedical engineers and doctors use X-ray projections, ultrasound measurements, and magnetic resonance data to reconstruct images of human tissues and organs (Lauterbur, 1973; Gemmeke & Ruiter, 2007; Lustig et al., 2007) . Therefore, developing effective solutions for inverse problems is of great importance in advancing scientific endeavors and improving our daily lives. Solving an inverse problem starts with the definition of a forward mapping from parameters m to data d, which we formally write as d = f (m) + , (1) where f stands for a forward model that usually describes some physical process, denotes noise, d the observed data, and m the parameters to be estimated. The forward model can be either linear or nonlinear and either explicit or implicitly defined by solving partial differential equations (PDEs). This study considers three representative inverse problems: Compressive Sensing MRI, Deblurring, and Eikonal (traveltime) Tomography, which have important applications in medical science, geoscience, and astronomy. The details of each problem and its forward model are in App. A. The forward problem maps m to d, while the inverse problem estimates m given d. Unfortunately, inverse problems are generally under-determined with infinitely many compatible solutions and intrinsically ill-posed because of the nature of the physical system. Worse still, the observed data are usually noisy, and the assumed forward model might be inaccurate, exacerbating the ill-posedness. These challenges require using regularization to inject a priori knowledge into inversion processes to obtain plausible and high-fidelity results. Therefore, an inverse problem is usually posed as an optimization problem: where R(m) is the regularization term. Beyond traditional regularization methods such as the Tikhonov regularization and Total Variation (TV) regularization, deep generative models (DGM), such as VAEs (Kingma & Welling, 2013) , GANs (Goodfellow et al., 2014) , and normalizing flows (Dinh et al., 2014; 2016; Kingma et al., 2016; Papamakarios et al., 2017; Marinescu et al., 2020) , have shown great potential for regularizing inverse problems (Bora et al., 2017; Van Veen et al., 2018; Hand et al., 2018; Ongie et al., 2020; Asim et al., 2020; Mosser et al., 2020; Li et al., 2021; Siahkoohi et al., 2021; Whang et al., 2021; Cheng et al., 2022; Daras et al., 2021; 2022) . Such deep generative models directly learn from training data distributions and are a powerful and versatile prior. They map latent vectors z to outputs m distributed according to an a priori distribution: m = g(z) ∼ p target , z ∼ N (0, I), for example. The framework of DGM-regularized inversion (Bora et al., 2017) is arg min m (1/2) d -f (m) (c) arg min z (1/2) d -f • g (z) 2 2 + R (z), where the deep generative model g, whose layers are frozen, reparameterizes the original variable m, acting as a hard constraint. Instead of optimizing for m, we now estimate the latent variable z and retrieve the inverted m by forward mappings. Since the latent distribution is usually a standard Gaussian, the new (optional) regularization term R (z) can be chosen as β z 2 2 for GANs and VAEs, where β is a weighting factor. See App. J.1 for more details on a similar formulation for normalizing flows. Since the optimal β depends on the problem and data, tuning β is highly subjective and costly. tensors during training (e.g., GANs) or learns to establish 1-1 mappings between training examples and typical samples from the standard Gaussian distribution (e.g., normalizing flows, App. J.2). As a result, the generator may map out-of-distribution latent vectors to unrealistic results. We show in Fig. 1 the visual effects of several types of deviations of latent vectors from a spherical Gaussian (with a temperature of 0.7). It can be seen that ( 1) the latent tensor should have i.i.d. entries, and (2) these entries should be distributed as a 1D standard Gaussian in order to generate plausible images. Since the traditional DGM-regularized inversion lacks such Gaussianity constraint, we conjecturefoot_1 that the latent tensor deviates from the desired high-dimensional standard Gaussian distribution during inversion, leading to poor results. Our observations and reasoning motivate us to propose a set of differentiable Gaussianization layers to reparameterize and Gaussianize the latent vectors of deep generative models (e.g., StyleGAN2 (Karras et al., 2020) and Glow (Kingma & Dhariwal, 2018) ) for inverse problems. The implementation is available here.

2.1. OVERVIEW -REPARAMETERIZATION USING GAUSSIANIZATION LAYERS

Our solution is based on a necessary condition of the standard Gaussian prior on latent tensors. Let us define a partition of a latent tensor z ∈ R n as the collection of non-overlapping patches P (z) = {z i } i=1,••• ,N , where the patches z i ∈ R D are of the same dimension and can be assembled as z, i.e., n = N × D. If the latent tensor z ∼ N (0, I), then for all z i from any partition P (z), we have z i ∼ N (0, I). Note that z is a symbolic representation of the latent tensor. In a specific DGM, z can be either a 2D/3D tensor or a list of such tensors corresponding to a multi-scale architecture (App. D). Even though there are numerous partition schemes, such as random grouping of components, we choose the simplest: partitioning z based on neighboring components, which works well in practice. To constrain z i ∼ N (0, I) during inversion, we reparameterize z i by constructing a mapping h : v i → z i , such that z i ∼ N (0, I). The new variables v i are of the same dimension as z i . Suppose that we have constructed the patch-level mapping h, we can obtain a mapping h † at the tensor level, such that z = h † (v), where v is assembled from v i in the same way as z from z i . For example, h † = diag(h, • • • , h) if patches are extracted from neighboring components and are concatenated into a vectorized v. Hence, the original DGM-regularized inversion 3 becomes arg min v (1/2) d -f • g • h † (v) 2 2 . (4) Since we are imposing a constraint through reparameterization, there is no need to include the regularization term R (z). The new formulation is still an unconstrained optimization problem, enabling us to use highly efficient unconstrained optimizers, such as L-BFGS (Nocedal & Wright, 2006) and ADAM (Kingma & Ba, 2015) . The remaining critical piece is to construct h, and it leads to our Gaussianization layers. First, we translate the constraint of z i = h(v i ) ∼ N (0, I) into the following optimization problem: arg min h D KL (p Z (h (v i )) N (0, I)) , where p Z is the probability density function (PDF) of z i . Second, we adopt the framework proposed by precursor works of normalizing flows on Gaussianization (Chen & Gopinath, 2000; Laparra et al., 2011) to solve this optimization problem. The KL-divergence can be decomposed as the sum of the multi-information I(z i ) and the marginal negentropy J m (z i ) (Chen & Gopinath, 2000; Meng et al., 2020) : D KL (p Z (z i ) N (0, I)) = I(z i ) + J m (z i ), where I(z i ) = D KL   p Z (z i ) D j p j (z (j) i )   , and J m (z i ) = D j=1 D KL p j (z (j) i ) N (0, 1) . ( ) Here z (j) i denotes the j-th component of patch vector z i , and p j stands for the marginal PDF for that component. The multi-information I(z i ) quantifies the independence of the components of z i , while the marginal negentropy J m (z i ) describes how close each component is to a 1D standard Gaussian. With this decomposition, the optimization procedure depends on the facts: (1) the KL divergence and a standard Gaussian in Eq. 6 are invariant to an orthogonal transformation, and (2) the multi-information term is invariant to a component-wise invertible differentiable transformation (App. J.3). As a result, we perform Gaussianization in two steps: 1. Minimize the multi-information I(z i ). This is done by an orthogonal transformation that keeps the overall KL divergence the same but increases the negentropy J m (z i ). We achieve this by using our independent component analysis (ICA) layer. ICA is the optimal choice since it maximizes non-Gaussianity so that the subsequent marginal Gaussianization step removes it and results in a large decrease in D KL (p Z (h (v i )) N (0, I)). 2. Minimize the marginal negentropy J m (z i ) by component-wise operations that perform 1D Gaussianization of marginal distributions p j , j=1,••• ,D . The multi-information does not change under component-wise invertible operations. Therefore, the overall KL divergence between z i and the Gaussian distribution decreases. The Gaussianization steps are well-aligned with the motivating example (Fig. 1 ). To constrain DGM outputs to be plausible, one should make components within latent tensor patches independent (or destroy the patterns) (Fig. 1(c )) and shape the 1D distribution as close as possible to Gaussian (Fig. 1 (a)(b)). We will see that h is parameterized by an orthogonal matrix and two scalar parameters in 1D Gaussianization layers. Unlike conventional neural network layers, the input-data-dependent Gaussianization layers are not defined by learning from a dataset ahead of time but by solving certain optimization problems on the job (Fig. 18 ). Special care should be taken to implement the gradient computation correctly and ensure that they pass the finite-difference convergence test (App. G). As an overview, the composition of our proposed layers is: v → Whitening → ICA → Yeo-Johnson → Lambert W × F X → Standardization → z, where whitening and ICA belong to the first step and the rest belong to the second step. We will discuss in App. K some possible simplifications of the layers in practice after our ablation studies. The overall proposed inversion process (one iteration) is illustrated in Fig. 2 . G layers h † Defined by Eq. ( 5)

Compute misfit

Observed data Calculated data Gradient-based update d Figure 2 : Illustration of the proposed inversion process. Gradient computation in the Gaussianization layers is enabled by the implicit function theorem and automatic differentiation (App. G). We use the L-BFGS optimizer (Nocedal & Wright, 2006) to update v.

2.2. REDUCING MULTI-INFORMATION -ICA LAYER

The orthogonal matrix W is constructed by the independent component analysis (ICA). The input is the patch vectors {v i | i=1,••• ,N }. The ICA algorithm first computes the input-dependent orthogonal matrix W and then computes p i = W v i , i=1,••• ,N as the output. The orthogonal matrix W makes the entries of each p i independent random variables. We use the FastICA algorithm (Hyvarinen, 1999; Hyvärinen & Oja, 2000) , which employs a fixedpoint algorithm to maximize a contrast function Φ (e.g., the logcosh function), for our ICA layer to reduce multi-information. The FastICA algorithm typically requires that the data are pre-whitened. We adopt the ZCA whitening method or the iterative whitening method introduced in Hyvarinen (1999) (App. E). With the whitened data, we compute W using a damped fixed-point iteration scheme: W = 1 N αV φ W V -W diag φ W V 1 , ( ) where 1 is an all-one vector, the column vectors of V are {v i | i=1,••• ,N }, φ(•) = Φ (•), α ∈ (0, 1), and we use α = 0.8 throughout our experiments. To save computation time, we only perform a maximum of 10 iterations. The details of the whole algorithm can be found in App. E. We set the initial W as an identity matrix. If the input vectors are already standard Gaussian (indistribution), the computed W will still be an identity matrix, which maps the input to the same output. In practice, the empirical distribution from finite samples is not a standard Gaussian, so W is not an identity matrix but another orthogonal matrix, which still maps standard Gaussian input vectors to standard Gaussian vectors as output. For this reason, we sample starting {v i | i=1,••• ,N } from the standard Gaussian distribution to start inversions with plausible outputs.

2.3. REDUCING MARGINAL NEGENTROPY

For 1D Gaussianization, we choose a combination of the Yeo-Johnson transformation that reduces skewness and the Lambert W × F X transformation that reduces heavy-tailedness. Both are layers based on optimization problems with only one parameter, which is cheap to compute and is easy to back-propagate the gradient. Eq. 7 requires us to perform such 1D transformations for each component of the random vectors. In other words, we need to solve the same optimization problem for D times, which imposes a substantial computational burden. Instead, we empirically find it acceptable to share the same optimization-generated parameter across all components. In other words, we perform only a single 1D Gaussianization, treating all entry values in the latent vector as the data simultaneously. Power transformation layer We propose to use the power transformation or Yeo-Johnson transformation (Yeo & Johnson, 2000) to reduce the skewness of distributions. As shown in Fig. 3 Lambert W × F X layer Due to noise and inaccurate forward models, we observe that the distribution of latent vector values tends to be shaped as a heavy-tailed distribution during the inversion process. To reduce the heavy-tailedness, we adopt the Lambert W × F X method detailed in Goerg (2015) . We use the parameterized Lambert W × F X distribution family to approximate a heavy-tailed input and solve an optimization to estimate an optimal parameter δ (App. E), with which the inverse transformation maps the heavy-tailed distribution towards a Gaussian. Fig. 3 (b) shows that the Lambert W × F X layer acts as a nonlinear squashing function. As δ increases, it compresses more the large values and reduces the heavy-tailedness. Intuitively, the Lambert W × F X layer can also be interpreted as an intelligent way of imposing constraints on the range of values instead of a simple box constraint. We refer the readers to App. E for more details about the optimization problem and implementation. Standardization with temperature Since the Lambert W × F X layer output may not necessarily have a zero mean and a unit (or a prescribed) variance, we standardize the output using z = (x -E[x])/ Var(x) * γ, ( ) where γ is the temperature parameter suggested in Kingma & Dhariwal (2018) .

3. RELATED WORK

End-to-end NNs for inverse problems There are numerous end-to-end neural networks designed for inverse problems, using CNNs (Chen et al., 2017; Jin et al., 2017; Sriram et al., 2020) , GANs (Mardani et al., 2018; Lugmayr et al., 2020; Wang et al., 2018) , invertible networks (Ardizzone et al., 2018) , and diffusion models (Kawar et al., 2022) . The general idea is simple: train a neural network that directly maps observed data to estimated parameters. Even though such methods seem effective in a few applications, DGM-regularized inversion with our Gaussianization layers is preferable for the following reasons. First, the forward modeling can be so expensive computationally that it is infeasible to collect a decent training datasets for some applications. For example, one large-scale fluid mechanics or wave propagation simulation can take hours, if not days. Second, the relationship between parameters and data can be highly nonlinear, and multiple solutions may exist. An end-to-end network may map data to interpolated solutions that are not realistic. In comparison, our method can start from different initializations or even employ sampling techniques to address this issue. Third, the configuration of data collection can change from experiment to experiment. It is impractical, for example, to re-train the network each time we change the number and locations of sensors. In contrast, not only can our method deal with this situation, but it can even use the same DGM for different forward models, as we can see in the compressive sensing MRI and eikonal tomography examples. While almost all end-to-end methods are only applied to linear inverse problems, our Gaussianization layers are also effective in nonlinear problems. Other techniques to improve DGM-regularized inversion In high-dimensional space, the probability mass of a standard Gaussian distribution concentrates within the so-called Gaussian typical set (App. I). To be in the Gaussian typical set, one necessary but not sufficient condition is to be within an annulus area around a high-dimensional sphere (App. I). Utilizing this geometric property, DGMregularized inversion methods like Bojanowski et al. (2017) and Liang et al. (2021) force updated latent vectors to stay on the sphere. This strategy is closely related to spherical interpolation (White, 2016) . We call this strategy the spherical constraint for inversion. In the original StyleGAN2 paper, the authors also noticed that in image projection tasks, the noise maps tend to have leakage from signals (Karras et al., 2020) -the same phenomenon we discussed. They proposed a multi-scale noise regularization term (NoiseRlg) to penalize spatial correlation in noise maps. We extend the same technique to our inverse problems for comparison. Note that we use a whitening layer before the ICA layer. The whitening layer can be used alone, similar to Huang et al. (2018) and Siarohin et al. (2018) , whose performance will be reported in ablation studies. Also, Wulff & Torralba (2020) observed that for StyleGAN2, the Leaky ReLU function can "Gaussianize" latent vectors in the W space. CSGM-w (Kelkar & Anastasio, 2021) utilizes this a priori knowledge to improve DGMregularized compressive sensing problems. To further compare with the Gaussianizaion layers, we in addition propose an alternative idea that reparameterizes latent vectors using learnable orthogonal matrices (Cayley parameterization) and fixed latent vectors, which is closely related to the work of orthogonal over-parameterized training (Liu et al., 2021) (App. H ). Recently, score-based generative models have started to show promise for solving inverse problems (Jalal et al., 2021; Song et al., 2021) . However, they have been mainly applied to linear inverse problems and challenged by noisy data (Kawar et al., 2021) . Besides, scored-based methods might not work for certain physics-based inverse problems, since Gaussian noise parameters may break the physics simulation enginefoot_3 .

4. EXPERIMENTS

We consider three representative inversion problems for testing: compressive sensing MRI, image deblurring, and eikonal traveltime tomography. For MRI and eikonal tomography, we used synthetic brain images as inversion targets and used the pre-trained StyleGAN2 weights from Kelkar & Anastasio (2021) (trained on data from the databases of fastMRI (Zbontar et al., 2018; Knoll et al., 2020) , TCIA-GBM (Scarpace et al., 2016), and OASIS-3 (LaMontagne et al., 2019) ) for regularization. We used the test split of the CelebA-HQ dataset (Karras et al., 2018) for deblurring, and the DGM is a Glow network trained on the training split. We refer readers to App. F for details on datasets and training. We tested each parameter configuration in each inversion on 100 images (25 in the eikonal tomography due to its expensive forward modeling). Since the deep generative models are highly nonlinear, the results may get stuck in local minima. Thus, we started inversion using three different randomly initialized latent tensors for each of the 100 or 25 images, picked the best value among the three for each metric, and reported the mean and standard deviation of those metrics, except for CSGM-w, TV, and NoiseRlg, where the initialization is fixed. The metrics we used were PSNR, SSIM (Wang et al., 2004) , and an additional LPIPS (Zhang et al., 2018) for the CelebA-HQ data. We used the LBFGS (Nocedal & Wright, 2006) optimizer in all experiments except TV, noise regularization, and CSGM-w, which use FISTA (Beck & Teboulle, 2009) or ADAM (Kingma & Ba, 2015) . The temperature was set to 1.0 for StyleGAN2 and 0.7 for Glow.

4.1. COMPRESSIVE SENSING MRI USING STYLEGAN2

The mathematical model of compressive sensing MRI is d = Am + , where A ∈ C M ×N is the sensing matrix, which consists of FFT and subsampling in the k-space (frequency domain). Eq. 10 is an under-determined system, and we use Accl = N/M to denote the acceleration ratio. We also added i.i.d. complex Gaussian noise with a signal-to-noise ratio (SNR) of 20 dB or 10 dB to the measured data. See App. A.1 for more background information. Table 1 compares the results from total variation regularization (TV), noise regularization (NoiseRlg) (Karras et al., 2020) , spherical constraint/reparameterization: z = v/ v 2 * dim(v), CSGM-w (Kelkar & Anastasio, 2021 ), our proposed orthogonal reparameterization (Orthogonal), and our proposed Gaussianization layers (G layers). Fig. 4 shows examples of inversion results. In the base case where Accl=8x and SNR=20 dB, the Gaussianization layers gives the best scores, and this advantage gets more significant when data SNR decreases to 10 dB. Interestingly, the scores from all methods improve significantly if we make the system better determined (i.e., Accl=2x), and the performance of TV, spherical constraint, and Gaussianization layers become more similar in this scenario. We conclude that our proposed Gaussianization layers are effective and more robust than other methods, particularly in low-SNR scenarios. Ablation study Table 2 summarizes the ablation study on the components of the Gaussianization layers. We kept the standardization layer on for all cases. The conclusions are as follows: 1. The ICA layer plays the most significant role in improving result scores; 2. The whitening/ZCA layer alone is not effective; 3. The Yeo-Johnson (YJ) and the Lambert W × F X (Lambt) layers are more effective when the noise level is higher (e.g., SNR=10 dB vs. 20 dB). Their performance seems data-dependent: when SNR=20 dB, YJ seems more effective, while Lambt gives the best scores when SNR=10 dB. But the overall improvement from these two layers is marginal. In practice, one may only use ICA and one of the 1D Gaussianization layers. Additionally, we tested the effect of patch size on the z (style) vectors (App. C). We find that the largest possible patch size gives the best results.  d = H * m + , ( ) where H is a smoothing filter and * denotes convolution. We used a Gaussian smoothing filter with a standard deviation of 3, and added noise ∼ N (0, 50 2 I) to the observed data. Although the system may not be under-determined, the high-frequency information is lost due to low-pass filtering; hence this is also an ill-posed problem. Though we only tested on facial images, deblurring has wide applications in astronomy and geophysics. See App. A.2 for more background information. Table 3 and Fig. 5 show that the Gaussianization layers are also effective in Glow, better than using the spherical constraint or the orthogonal reparameterization. We also demonstrate the efficacy of Gaussianization layers when the forward model is inaccurate, in which case the induced error in data is not Gaussian (App. C). Ablation study Table 3 also shows that the two parameterization schemes (App. D) for Glow using the Gaussianization layers have similar performance. Besides, we report the ablation study on the components of the Gaussianizaiton layers for Glow (App. C).

4.3. EIKONAL TOMOGRAPHY USING STYLEGAN2

In acoustic wave imaging (e.g., ultrasound tomography), we excite waves using sparsely distributed sources one at a time at the boundary of the object. Then we reconstruct its internal structures (the spatial distribution of wave speed) given the first-arrival travel time recorded on the boundary. The following eikonal equation approximately describes the shortest travel time T (x; x s ) that the acoustic wave emerging from the source location x s takes to reach location x inside the target object (Yilmaz, 2001) : |∇T (x; x s ) | = 1/c (x) , T (x s ; x s ) = 0, ) where c(x) is the wave propagation speed at each location. Both this eikonal PDE and the implicitly defined forward mapping c(x) → T (x) are nonlinear, and there has been little research on DGMregularized inverse problems with such nonlinear characteristics. The inverse problem is severely illposed, which is equivalent to a curved-ray tomography problem. See App. A.3 for more background information. We added noise to the recorded travel time using the following formula: T noisy (x r ; x s ) = T (x r ; x s )(1 + ), where ∼ N (0, 0.001 2 ) and x r denotes any receiver location. In other words, longer traveltime corresponds to larger uncertainties. We show that the Gaussianization layers outperformed other methods in this tomography task in Table 4 and Fig. 6 . Note that this is a statistical conclusion. We also report an example where the spherical constraint works better than the Gaussianization layers (bottom right). 

5. DISCUSSION AND CONCLUSIONS

We provide insights on the experiment results and further discuss this work's limitations, computational cost, and broader impact in App. K. In summary, we have identified a critical problem in DGM-regularized inversion: the latent tensor can deviate from a typical example from the desired high-dimensional standard Gaussian distribution, leading to unsatisfactory inversion results. To address the problem, we have introduced the differentiable Gaussianization layers that reparameterize and Gaussianize latent tensors so that the inversion results remain plausible. In general, our method has achieved the best scores in terms of mean values and standard deviations compared with other methods, demonstrating our method's advantages and high performance in terms of accuracy and consistency. (Regarding the standard deviation of scores, we only compare with other methods with competitive mean scores, such as orthogonal reparameterization and spherical constraint.) Our proposed layers are plug-and-play, require minimal parameter tuning, and can be applied to various deep generative models and inverse problems.

A BACKGROUND OF FORWARD MODELS

A.1 COMPRESSIVE SENSING MRI The MRI process essentially samples the spatial frequency components of a target, following some trajectories in the spatial frequency space (k-space) according to the design of the physical system. For example, a system may sample the k-space line-by-line horizontally/vertically or in radial directions. If the k-space has been fully sampled on a Cartesian grid, one can directly use inverse FFT to reconstruct the image. For various practical reasons, however, it is necessary to speed up the data collection process, usually by skipping data points in the k-space, which can be mathematically represented by a masking operation. In addition, there can be multiple coils with different sensitivity maps collecting data simultaneously. The mathematical formulation reads d i = P F S i m + , i = 1, • • • , N coils ( ) where d i ∈ C N is the k-space data corresponding to the i-th coil, m ∈ R N is the target object, P ∈ R N ×N is the mask, F ∈ C N ×N is the Fourier transform operator, S i ∈ C N ×N is the point-wise sensitivity map (a diagonal matrix) corresponding to the i-th coil, and denotes noise. To ensure a fair comparison with prior work and reproducibility, we used the same masks from the repository of Kelkar & Anastasio (2021) (Fig. 7 ). In addition, we also used the same single-coil setup as in Kelkar & Anastasio (2021) , where the sensitivity matrix is an identity matrix. We condense the effects of all operators into a linear operator A ∈ C M ×N and arrive at the underdetermined system d = Am + (10) from the main text, where we use Accl = N/M to denote the acceleration ratio.

A.2 DEBLURRING

The mathematical model behind deblurring is d = H * m + , ( ) where H is a smoothing filter, * denotes convolution, and is noise. The purpose of deburring is to recover the original sharp image m given a noisy blurred observation d. In this study, we showed deblurring examples for natural images. In scientific applications, deblurring or deconvolution is also a powerful tool. For example, in astronomy, d is a blurred image from a telescope, H is a point-spread function (PSF) constructed from the physics model of the telescope, and we want to obtain a sharper image from the observation (Starck et al., 2002) . In geophysics, d can be the seismic data, H is a calibrated wavelet, and we want to obtain sharp images of reflectivities defining the boundaries of subsurface strata (Lines & Treitel, 1984; Zhang & Castagna, 2011) . In general, H is a low-pass or band-pass filter, so certain frequency contents are lost in the forward process. The deblurring or deconvolution process needs to recover such missing information. In addition, the noise makes the inversion process unstable. The deblurring or deconvolution problem is thus ill-posed. 

A.3 EIKONAL TOMOGRAPHY

In acoustic wave imaging (e.g., ultrasound tomography), we excite waves using sparsely distributed sources one at a time at the boundary of the object. Then we reconstruct its internal structures (the spatial distribution of wave speed) given the first-arrival travel time recorded on the boundary. The following eikonal equation describes the shortest travel time T (x; x s ) that the acoustic wave emerging from the source location x s takes to reach location x inside the target object in the high-frequency limit (Yilmaz, 2001) : |∇T (x; x s ) | = 1/c (x) , T (x s ; x s ) = 0, where c(x) is the wave propagation speed at each location. Both this eikonal PDE and the implicitly defined forward mapping c(x) → T (x) are nonlinear. To be more specific, we show the setup of our experiment in Fig. 8(a) . For the convenience of numerical testing, we put sources and receivers on the boundaries of a square box that contains the object, suggesting that the object is immersed in a square box filled with water. In reality, one can put sources and receivers directly on the target. The dimension of the square area is 25.6 cm × 25.6 cm with a grid interval of 0.001 m. There are eight sources located on each side, and receivers are located at each grid point on the boundary. We can interpret the nonlinearity and ill-posedness of eikonal tomography from another perspective. As shown in Fig. 8 (b), we plot contours T (x) = const representing the wavefronts. Under the high-frequency approximation, the propagation of waves can be viewed as curved rays traveling in directions perpendicular to such wavefronts. Since the wavefronts depend on velocity field c(x), the rays are also functionals of the parameter c(x) to be estimated, contrary to straight-ray tomography such as CT. The eikonal inversion problem is thus nonlinear. Besides, the rays carry information about the medium averaged along their paths. One property of curved-ray tomography is that the ray coverage is uneven inside the object. In fact, curved rays tend to avoid low-velocity areas, giving us little information about such regions, making the inverse problem intrinsically ill-posed. We solved the eikonal equation using the fast sweeping method (Zhao, 2005) and computed the gradient using the discrete adjoint-state method, using the code from the repository of Li et al. (2020) . Our eikonoal tomography uses the same StyleGAN2 network as the compressive sensing MRI experiments. The value range of StyleGAN2 is [-1, 1]. In the forward model, we map the StyleGAN2 output to c(x) in two steps. First, we convert its values to the range of [0, 1] by using m ← (m+1)/2. Second, we convert the values to the range of acoustic wave velocity using m ← 100 × m + 1500. This relationship is purely manufactured for our synthetic tests. One should use a more realistic relationship in practice.

B ADDITIONAL MOTIVATING EXAMPLES

Glow We varied the 2 -norm of the latent tensors for Glow and reported the outputs in Fig. 9 . The images are getting increasingly unrealistic as we increase the norm from 1.0 √ d, where d is the latent tensor dimension. This phenomenon demonstrates why we need to use a temperature < 1 for Glow. StyleGAN2 StyleGAN2 is a well-designed generator with a built-in spherical transformation for the style vector. It seems very robust to random vectors drawn from the distributions similar to those in Fig. 1 . However, we were able to find challenging examples for it from inversion tasks. In Fig. 10 , The first example is the direct output only using the pathologic style vector. The second example is the generator output using the style vector after whitening only. As we can see, whitening alone is ineffective in improving the image quality, which is confirmed by our ablation studies. On the contrary, if we use the ICA layer, we see a huge improvement in visual quality, and our interpretation is that getting rid of higher-order dependencies is very important. Note that there are eyeglasses in the third figure, meaning that if we relax the 1D Gaussian requirement, we can sample some rare examples. Finally, we use the full G layers to get another plausible image. (Rombach et al., 2022) is a state-of-the-art deep generative model with text-to-image synthesis capability, which maps a Gaussian latent tensor to a high-resolution image.

No processing ICA only Whitening only

Full G layers We scaled all latent tensors to be on the sphere with a radius of √ tensor dimension. Stable Diffusion picks up and amplifies the sparse large-amplitude points in the heavy-tailed case and the 2D sinusoidal pattern in the non-i.i.d. case. We repeated the tests for Glow and StyleGAN2 on Stable Diffusion and reported the results in Fig. 11 . The distribution for skewed case was the exponential-gamma distribution log Γ(1, 1) (or the "log-gamma distribution" in scipy), the distribution for the heavy-tailed case was a Lambert W × F X distribution with parameter δ = 0.5 based on a standard Gaussian (Goerg, 2015) . To generate the latent tensor with non-i.i.d. entries, we first sampled z 0 ∈ R 4×64×64 ∼ N (0, I). Then we added 0.5 * sin(2π/64 x) ⊗ cos(2π/64 y) to each channel of z 0 for the latent tensor z, where ⊗ stands for outer product. The number of denoising steps, classifier-free guidance, and the output dimensions for Stable Diffusion are 50, 7.5, and 512 × 512, respectively. 0.7 √ d √ d 1.3 √ d 0.7 √ d √ d 1.3 √ d The outputs of Stable Diffusion are completely out-of-range if the latent tensors deviate from standard Gaussian, with much poorer quality than those from Glow and StyleGAN2. For example, Stable Diffusion picks up and amplifies the sparse large-amplitude points in the heavy-tailed case and the 2D sinusoidal pattern in the non-i.i.d. case. However, with our Gaussianization layers, we are still able to constrain the outputs within the plausible range. Interestingly, we show in Fig. 12 that Stable Diffusion is much more sensitive to the norm of latent tensors than the other DGMs. Therefore, we anticipate that our Gaussianization layers are critical for potential inversion applications using Stable Diffusion such as text-guided inversion.

C ADDITIONAL EXPERIMENTS

First, we report in Table 5 an ablation study on the effects of various combinations of regularization techniques on style vectors and noise maps in StyleGAN2. The observations are (1) If we turn off the update of noise maps, the Gaussianization layers lead to the best result; (2) If we turn on the update of noise maps, applying the Gaussianization layers to both the style vectors and noise maps gives the best results. Table 5 : Ablation study on different combinations of regularization on the style vectors and noise maps in StyleGAN2 for compressive sensing MRI. In the parentheses, "u" stands for unconstrained, "s" stands for spherical constraint, "g" stands for reparameterization with the Gaussianization layers, "o" stands for orthogonal reparameterization, and "" means that the update is turned off. Note that StyleGAN2 has a built-in spherical transformation layer for the style vectors. Second, we studied the effect of patch size on the performance of Gaussianization layers. To be more specific, we tested the effects of various patch sizes on 1D style vectors. The largest patch size is 64 since the number of extracted patches should not be smaller than the dimension of the patches. We only turned on the ICA layer and the standardization layer in this experiment. One can observe that both the PSNR and the SSIM increase as the patch size increases. We would advise using the largest possible patch size in Gaussianization layers, although we only used a patch size of 32 for style vectors in all other experiments. In all experiments, we fixed the patch size for Glow as 3 × 8 × 8 and the patch size for the noise maps in StyleGAN2 as 1 × 8 × 8. In the experiments with Glow, the image dimension was 3 × 128 × 128. If we chose a larger patch size, we could not obtain enough patch vectors. As for StyleGAN2, if we look at the parameterization illustrated in Fig. 14 , there is already a 4 × 4 noise map unconstrained if we choose the patch size as 8 × 8. If we increase the patch size to 1 × 16 × 16, there will be two additional 8 × 8 noise maps unconstrained. To minimize the influence of unconstrained parameters, we chose the patch size for noise maps as 1 × 8 × 8. Third, we did a parameter sweep on the weighting parameter β in the Glow-regularized deblurring problem (formulation 3 or 44). Consistent with Fig. 20 , the conventional Glow-based regularization is not as effective as our Gaussianization layers. If β is too large (e.g., 10 or 100), the inversion results become unrealistic with huge errors. Fourth, we investigated the performance of Gaussianization layers when the forward model is inaccurate (Table 8 ). Our Gaussianization layers still outperform the conventional Glow-regularized inversion. Table 8 : Glow-regularized deblurring with an inaccurate filter. The ground-truth standard deviation of the Gaussian filter is 3, which is used to generate the observed data, but we used 5 for inversion.

Method

LPIPS↓ PSNR↑ SSIM↑ Conventional (β = 1.0) 0.21±0.06 17.72±1.41 0.49±0.059 G layers 0.17±0.06 18.70±1.40 0.50±0.060 In addition, we did an ablation study on the components of Gaussianization layers in Glow-regularized deblurring (Table 9 ). We adopted the second parameterization scheme (App. D). The observations are similar to those from Table 2 : 1. The ICA layer is the most significant part in improving result scores, especially LPIPS -the score that matches human perception the best; 2. The whitening/ZCA layer is not effective when used alone; 3. There is no clear winner among combinations of 1D Gaussianization layers, and their difference in performance is marginal. So we picked the Lambert W × F X layer, one of the best-performing options, in all other deblurring experiments as the 1D Gaussianization layer. Inversion using out-of-distribution images Finally, we tested the limit of DGM-regularized inversion by using real-world out-of-distribution target images. We randomly sampled 25 MRI images from the fastMRI DICOM dataset (Zbontar et al., 2018; Knoll et al., 2020) . The RSNA clinical trial processor was used to anonymize the whole dataset, and each image has been manually inspected to make sure that there is no protected information. According to the fastMRI paper (Zbontar et al., 2018) , "This dataset represents a larger variety of machines and settings than are present in the raw data." Also, the image values are represented in uint16 rather than float numbers, and their value range can be very large (0 to several thousand). So we normalized the image values to be between range [-1, 1] using img = (img -img.min())/(img.max()-img.min()) * 2 -1. However, the trained StyleGAN2 outputs are always slightly beyond [-1, 1], and we clipped the values in both test image generation and inversion. Therefore, the new examples are out-ofdistribution. We kept the same experimental setup of 8× acceleration and 20 dB SNR and tested inversion using the spherical constraint and Gaussianization layers on both only using the style vectors and using the additional noise maps. We report the inversion results in Table 10 and Fig First, none of the inversion results shown here are visually comparable to the ground truth, which shows the necessity of having a large training dataset that covers the target distribution. Also, it is vital to ensure that the pre-processing and value ranges of training and target images are the same. Second, in terms of metrics, the results are comparable to but slightly worse than those on synthetic test images reported previously. This is expected because these images are out of distribution. Besides, in cases where we applied Gaussianization layers, using both noise maps and style vectors gives better results than using style vectors only. In addition, Gaussianization layers improved results when we also optimized the noise maps in addition to the style vectors. However, we found that only using the style vector and the spherical constraint gives the best result on this test set, which contradicts our results reported in Table 5 . Our interpretation is that the dimension of style vectors is still relatively lower than noise maps or the latent tensor used in Glow, so our Gaussianization layers may be more effective in dealing with the latter two types of parameters.

D REPARAMETERIZATION SCHEMES FOR STYLEGAN2 AND GLOW

Fig. 14 illustrates the reparameterization scheme for StylgeGAN2 using the Gaussianization layers. The patch size for the style vectors and noise maps are 32 and 1 × 8 × 8, respectively. Note that we transpose the latent tensors in Fig. 14 compared to the notations in corresponding equations and algorithms. Also, as pointed out in Gu et al. (2020) , we also find that a single latent code is insufficient for image reconstruction. Thus, we use multiple style vectors that were fed into different intermediate layers.  {z i | i=1,••• ,N } mentioned in the main text. Unfolding v 1 v 3 v 4 v 2

Vectors Tensor

Tensor (1 channel) Tensor (4 channels) For Glow, we came up with two reparameterization schemes (Fig. 15 ). The first one (P1) reparameterizes patches from a latent tensor with the same dimension as the output image. The dimensions of the patches and the image are 3 × 8 × 8 and 3 × 128 × 128, respectively. Then a multi-scale squeezing operation maps the latent tensor into a list of tensors corresponding to different scales of Glow. Glow uses the list of tensors as the input. The second scheme (P2) reparameterizes patches extracted directly from tensors in the above list. We also partition each tensor into 3 × 8 × 8 patches. The unfolding and squeezing operations are illustrated in Fig. 16 . In the multi-scale architecture of Glow, the squeezing operation is applied recursively on half of the output tensors cut in the channel direction (Dinh et al., 2016; Kingma & Dhariwal, 2018 ). v 1 v 2 v 3 v 4

E MORE DETAILS OF THE GAUSSIANIZATION LAYERS E.1 ICA LAYER

The overall ICA layer is summarized in algorithm 1. We set a maximum number of the fixed-point iterations to reduce computational cost and ensure accurate gradient computation that can pass the finite-difference convergence test. The gradients are backpropagated through the loops without using the trick introduced in App. G. Whitening The FastICA algorithm typically requires that the data are pre-whitened. Let V be the data matrix whose columns are data vectors. We first subtract the mean from the data vector using V ← Vmean(V , dim = 1). Then we compute the data covariance matrix using C = (1 -η) 1 N -1 V V + ηI, where we use a small constant (e.g., η = 0.001) to blend the empirical covariance matrix and an identity matrix to avoid ill-conditioning. After these data preparation steps, the ZCA whitening used throughout the study first computes the eigenvalues Λ and eigenvectors D, and then output the whitened data using V ← DΛ -1/2 D V . Alternatively, we can use the following steps to whiten the data, which are also used later in ICA iterations to decorrelate column vectors in the orthogonal matrix (Hyvarinen, 1999) :  1. Initialize W ← I; 2. Compute W ← W / W CW 2 (16) 3. Repeat until convergence W ← 3 2 W - 1 2 W W CW , V = V -mean(V , dim = 1), C = (1 -η) 1 N -1 V V + ηI; 2 V ← ZCA-whitening(V ) ; // ICA stage 3 W = W * ← I, j ← 1 ; 4 while j ≤ J do 5 W ← 1 N αV φ W V -W diag φ W V 1 Eq. 8 ; 6 W = W 0 ← W / W W 2 , k ← 1 ; 7 while k < K do 8 W ← 3 2 W -1 2 W W W ; 9 if W -W 0 < then 10 break; 11 W 0 ← W , k ← k + 1 ; 12 if W -W * < then 13 break; 14 W * ← W , j ← j + 1 ; 15 return P ← W V . The modified FastICA iterations. As stated in Hyvärinen (1999) , the objective function for one neural unit of the weight vector w i and input v is arg max wi E Φ w i v , s.t., E w i v 2 = 1, ( ) where Φ is the contrast function (e.g., logcosh). The original derivations convert this constrained optimization to an unconstrained one using Lagrange multipliers. However, this procedure is unnecessary since the matrix W is orthogonalized after each iteration, and the input vectors have been pre-whitened. Therefore, we only need to solve the following equation E vφ w i v = 0, ( ) whose Jacobian is J = E vv φ (w i v) ≈ E vv E φ (w i v) = E φ (w i v) , ( ) where φ is the derivative of Φ. The Newton iteration scheme is thus w i = w i -E vφ w i v /E φ (w i v) . To improve the convergence, we damp the iterations by a parameter α ∈ (0, 1). Also, using the same technique to convert the Newton iterations to fixed-point iterations in Hyvärinen (1999) ; Hyvarinen (1999) , we arrive at the modified fixed-point iteration scheme: w i = αE vφ w i v -E φ w i v w i , followed by the aforementioned decorrelation procedure after each step. The convergence of the modified FastICA iterations can be proved similarly as in Oja & Yuan (2006) .

E.2 POWER TRANSFORMATION LAYER

Algorithm 2: Power Transformation Layer Input :Data vector p Output :Vector s whose values are 1D Gaussianized. 1 Estimate λ from p using Eq. 24 ; 2 Compute s with the estimated λ and data p using Eq. 23 ; 3 return s. We propose to use the power transformation or Yeo-Johnson transformation (Yeo & Johnson, 2000) to reduce the skewness of distributions: s(λ, p) =          (p + 1) λ -1 /λ, p ≥ 0, λ = 0, log (p + 1) , p ≥ 0, λ = 0, -([-p + 1] 2-λ -1)/ [(2 -λ)] , p < 0, λ = 2, -log(-p + 1), p < 0, λ = 2, ( ) where p is an input value, s is an output value, and λ is the parameter to be estimated. As shown in Fig. 3  where p is the input data vector with entries p i, i=1,••• ,n . We use a custom operator based on SciPy's implementation using Brent's algorithm (Brent, 2013) to find an approximate minimum of Problem 24. Continuing from the approximate minimum, we use Brent's root finding algorithm (Brent, 2013) to find the minimum where the gradient vanishes. Since the parameter λ depends on input data, we need to back-propagate the gradient through the optimization process. The power transformation layer is summarized in algorithm 2. E.3 LAMBERT W × F X LAYER Due to noise and inaccurate forward models, we observe that the distribution of latent vector values tends to be shaped as a heavy-tailed distribution during the inversion process. To reduce the heavytailedness, we adopt the Lambert W × F X method detailed in Goerg (2015) . Let X be a random variable whose CDF is F X , with mean µ X and standard deviation σ X . The following transformation with a heavy-tail parameter δ ≥ 0: S = U exp δ 2 U 2 σ X + µ X , databases of fastMRI (Zbontar et al., 2018; Knoll et al., 2020) , TCIA-GBM (Scarpace et al., 2016), and OASIS-3 (LaMontagne et al., 2019) . As a result, there is no sensitive personal information in our target images. In the data generation process, we only used one style vector of a dimension of 512. We picked 100 images that are visually plausible, whose random seeds can be found in our code. These random seeds were never used to initialize inversion. In inversion experiments, we used 14 such style parameters: one for the lowest resolution of 4 × 4, one for the tRGB layer, and two for each resolution from 8 × 8 to 256 × 256, to avoid inversion crime. This choice of expanding latent space dimension is also justified by our observation that only one style parameter vector is generally insufficient for inversion tasks with data from datasets such as CelebA-HQ (Karras et al., 2018) . We used the CelebA-HQ dataset (Karras et al., 2018) (under the Creative Commons CC BY-NC 4.0 license) for the deblurring experiments. All images were downsampled to the resolution of 128×128. We split the 30000 images from CelebA-HQ into the subsets of training (24183 images), validation (2993 images), and testing (2824 images) following the original splits from CelebA (Liu et al., 2015) . For the inversion tests, we randomly selected 100 images from the test set. All training was conducted using 8 × 32 GB Nvidia V100 GPUs with a batch size of 64. We used the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 10 -4 , as well as β 1 = 0.9 and β 2 = 0.99. Lambert W × F X layers. We denote the input and output by x and y, respectively. The layer is represented by h θ , and the layer is defined by solving an optimization problem whose objective function is l. The back-propagation of gradients in components defined by Eq. 24 and Eq. 28 is enabled by the implicit function theorem and automatic differentiation detailed in App. G.

G GRADIENT COMPUTATION OF THE OPTIMIZATION

-BASED DIFFERENTIABLE LAYERS arg min θ l (h θ (x)) -→ θ * Input: x h θ * (x) -→ Output: y As shown in Fig. 18 , in the power transformation and Lambert W × F X layers, there are operators whose outputs are obtained by solving optimization problems formally described as θ * = arg min θ l(x, θ), Figure 19 : Gradient accuracy test for custom operators. An accurate gradient should make the error converge at a rate of the second order following the red dashed line. where l denotes the objective function that defines the operator (combining the objective function and layer h θ in Fig. 18 ), symbol θ stands for the optimal output, a scalar in our cases but can also be a vector in general situations. The optimal condition is l θ (x, θ) := L(x, θ) = 0, (30) where the subscript denotes partial derivatives. The optimal condition implicitly defines a forward operator of the following form: θ * = op forward (x) . (31) The backward operator is ∂χ ∂x = op backward ∂χ ∂θ , θ * , x , where χ is the objective function of an inverse problem. Differentiating Eq. 30 with respect to x, we have L x + L θ θ x = 0 =⇒ θ x = -L -1 θ L x , using the implicit function theorem. Then, to back-propagate the gradient from ∂χ ∂θ to ∂χ ∂x , we use ∂χ ∂x = ∂χ ∂θ θ x = - ∂χ ∂θ L -1 θ L x = - ∂χ ∂θ H -1 θ L x , where H θ is the Hessian matrix of χ with respect to θ. In our problems, the output θ is a scalar, so it is easy to use automatic differentiation to compute L θ directly and hence L -1 θ . Otherwise, if θ has many parameters, we can first solve the following linear system with an auxiliary vector λ: λH θ = - ∂χ ∂θ , and then compute the gradient using ∂χ ∂x = λL x , a technique also known as the adjoint-state method. Note that there is no need to compute the Hessian explicitly, but one can use automatic differentiation to compute the vector-Hessian product λH θ and utilize iterative linear solvers like GMRES (Saad & Schultz, 1986) to solve the linear system. As a final note, we check the accuracy of our gradient computation using the finite-difference convergence test based on Taylor expansion: f (x + δx) = f (x) + ∇f (x) δx + O( 2 ), where δx is a random vector with a unit 2 norm, and ∇f (x) is computed using our custom gradient. We here define f as a composite function that maps the (vector) output of an forward operator to a scalar, e.g., f (x) = op forward (x) 2 2 . Once we decrease , we should see that the error term f (x + δx) -f (x) -∇f (x) δx decreases at a rate of the second order. All our layers passed this test, as the example shown in Fig. 19 . This test should be conducted in double precision. This theorem is a direct application of the asymptotic equipartition property (AEP), which is based on the i.i.d. assumption of sequence entries and the weak law of large numbers. Similar to the Gaussian annulus theorem, Theorem 2 depicts a geometric property of the Gaussian typical set: it is concentrated in an annulus near a shell of radius σ √ n, which can be directly verified by definition. Of course, equivalently, if a vector whose 2 norm significantly deviates from √ n cannot be a typical sample from a standard Gaussian. However, one cannot assert that a vector is sampled from the Gaussian typical set by only checking its norm. 

J MISCELLANEOUS TOPICS

where M denotes the parameter space. The probability density p M introduces our a priori knowledge and is represented by a normalizing flow (Glow) g θ , which is a differentiable invertible mapping between two distributions, parameterized by neural network parameters θ: m = g θ (z), where z is the latent vector. After training, the log probability density of a given model m is log p M (m; θ) = log p Z g -1 θ (m) + log det J g -1 θ (m) = log p Z (z) -log det J g θ (z) , (43) where Z stands for the latent space. When we use the trained network in inversion, we freeze the network weights; hence, we drop θ in g θ hereafter in our notation. Therefore, Eq. 3 for Glowregularized inversion is arg min z (1/2) d -f • g (z) 2 2 -β log p Z (z) -log det J g (z) . The new regularization term in Eq. 3 is R (z) = -β log p Z (z) -log det J g (z) , and we have shown the results with different βs in Fig. 20 and Table 7 .  where p * M is the target distribution in the physical parameter space, and p * Z is the corresponding latent-space distribution under the normalizing flow. This means that minimizing the forward KL divergence in the M domain or physical parameter space is equivalent to minimizing the reverse KL-divergence in the Z domain or the latent space. This fact and Theorem 2 imply that a well-trained normalizing flow maps samples from the target distribution into the Gaussian typical set with very high probability and vice versa. it is not a significant feature within our training data. One needs to pay particular attention to training dataset construction to ensure that it represents the typical features to be investigated. One should also interpret their results with this caveat in mind. On the other hand, one may combine Gaussianization layers and latent space manipulation techniques (Shen et al., 2020) for a guided inversion, e.g., requiring that the inverted images should have eyeglasses. Second, we only use one set of Gaussianization layers. One can further improve the performance by using additional sets of layers, with possible different partition schemes, such as using the rolling operation (App. J.4) to shift the patch extraction locations. Third, since the deep generative models are highly the results may get stuck in local minima. We mitigate this problem by starting inversion from multiple randomly initialized latent vectors. However, this technique is computationally expensive. It is an important research direction to improve the initialization process. The Gaussianization layers are based on optimization and fixed-point iterations, which can be computationally expensive. To study possible simplifications, we estimated the forward and backward time for various combinations of sub-layers using 100 repeated experiments (Table 11 ). The computation was performed using one Nvidia V100 GPU and an Intel Xeon Platinum 8276 CPU. We always turn on the ICA layer (together with whitening) since the ablation studies show that it contributes the most to performance increase (Table 2 and Table 9 ). Combining results from those ablation studies, we can see that the option ICA+YJ strikes a good balance between performance and computation time. It has a shorter runtime than ICA+Lambert and ICA+YJ+Lambert. At the same time, it gives comparable and sometimes the best results. On the other hand, the additional computation time from Gaussianization layers is well justified by the performance increase, especially if the physics simulation is much more time-consuming. For example, the eikonal solver for the experimental setup in our study has a forward runtime of 0.293±0.013 seconds and a backward runtime of 5.933±0.151 seconds, respectively. In addition, our implementation can also be greatly improved by more efficient utilization of GPUs and rewriting bottleneck components using high-performance codes. Table 11: Forward and backward computation time of various combinations of Gaussianization sublayers. The times were computed using tensors of the same shape as latent tensors for Glow and StyleGAN2, respectively. Each computation time was estimated by 100 repeated experiments. We always turn on the whitening layer and the standardization layer. We also provide insights into why Gaussianization layers perform better than other methods. First, the spherical constraint only relies on the geometric property of the Gaussian typical set (App. I). However, as shown in Fig. 1 , it is insufficient to only constrain the norm of latent vectors in inversion. Gaussianization is necessary to ensure inverted results are in the range of DGMs. Second, the orthogonal reparameterization generally underperforms the Gaussianization layers. Our interpretation is as follows: rather than actively destroying latent tensor patterns like the Gaussianization layers do, orthogonal reparameterization cannot prevent permuting the fixed typical Gaussian latent tensors into ones with spatial patterns. In addition, we have shown in Fig. 10 that the whitening transformation alone is ineffective in keeping the DGM outputs within the range. We thus interpret that reducing higher-order dependencies is essential. Therefore, the noise regularization and the whitening layers yield less satisfying results than the Gaussianization layers. We believe this study can benefit both the broader scientific community and the general public. However, we point out that the MRI and eikonal tomography experiments are purely synthetic and numerical. They do not fully reflect realistic medical imaging configurations. One should be cautious about applying our techniques to real data and interpreting the results.



+ R(m),(2) Out-of-distribution/typicality tests are very challenging for high-dimensional data(Rabanser et al., 2019;Nalisnick et al., 2019). Also, since latent tensors are Gaussian/Gaussian-like, it is hard to conduct dimension reduction for such tests. For example, in elastic waveform inversion, initializing material properties using Gaussian noise may create unrealistic scenarios where the P-wave speed is lower than the S-wave speed at some spatial locations.



Figure 1: Comparison of images generated by a deep generative model (DGM), Glow, using latent tensors that deviate from a spherical Gaussian distribution (left) and those after corresponding corrections (right). The visual effects highlight the necessity of keeping the latent tensor within such a distribution during inversion. The second column shows the characteristics of deviated latent tensors: (a) histogram: i.i.d. components but the distribution is skewed; (b) histogram: i.i.d. components but the distribution is heavy-tailed; (c) latent tensor image: non-i.i.d. entries. The first column shows the corresponding outputs of a Glow network. The third column shows latent tensors corrected by (a) the Yeo-Johnson layer (YJ), (b) the Lambert W × F X layer (LB), and (c) the full set of our Gaussianization layers (G layers). Those corrected latent tensors map to realistic images shown in the fourth column. All latent tensors have a norm of 0.7 √ vec dim because of the Gaussian Annulus Theorem (App. I) and the fact that Glow works best with a temperature smaller than one (see Fig. 9). Additional examples for StyleGAN2 and Stable Diffusion (Rombach et al., 2022) can be found in App. B.

Figure 3: The nonlinear activation functions from (a) the power transformation (Yeo-Johnson) layer and (b) the Lambert W × F X layer.

Figure 4: Comparison of compressive sensing MRI inversion results (Accl=8x, SNR=20 dB).

Figure 5: Comparison of deblurring results from different methods.

Figure 6: Comparison of eikonal tomography results from different methods.

Figure 7: Masks for compressive sensing MRI (Kelkar & Anastasio, 2021). White: 1, black: 0. (a) 8× acceleration; (b) 2× acceleration.

Figure 8: Experimental setup for eikonal tomography. (a) The target image and source locations; (b) Travel time field and the receiver array corresponding to the source. The contours are wavefronts. The propagation of waves can be viewed as curved rays traveling in directions perpendicular to such wavefronts; (c) The profile of the shortest wave travel time recorded at the receiver array. The receiver index starts from 0 in the top-left corner in subfigure (b) and increases in the clockwise direction.

Fig. 8(b)  shows the shortest travel time field of the generated wave from the indicated source location. For each source, we only use receivers on the three other sides, indicated by the red lines, excluding the one on which the source is located. Fig.8(c)shows the profile of the shortest wave travel time recorded at the receiver array. The receiver index starts from 0 in the top-left corner and increases in the clockwise direction.

Figure 9: The visual effects of the 2 norm of latent tensors on Glow outputs. The top and bottom panels correspond to latent tensors drawn from a spherical Gaussian and a spherical skewed distribution, respectively. The two tensors are scaled to have various norms: 0.1, 0.7, 1.0, and 2.0 of √ d (the square root of the latent tensor dimension). The images are getting increasingly unrealistic as we increase the norm from 1.0 √ d. This phenomenon demonstrates why we need to use a temperature < 1 for Glow.

Figure 10: The visual effects of various components of Gaussianization layers on a pathologic style vector for StyleGAN2.

Figure 12: Comparison of Stable Diffusion outputs generated using spherical Gaussian latent tensors of different norms, where d = √ tensor dimension.

. 13. To save experiment time, we only use one 1D Gaussianization layer (either Yeo-Johnson or Lambert W × F X ). Table 10: MRI inversion using real-world out-of-distribution images. Style only, G layers, YJ for 1D 25.23±3.30 0.765±0.050 (d) Style + noise, G layers, YJ for 1D 25.39±3.28 0.769±0.045 (e) Style only, G layers, Lambert for 1D 25.39±3.38 0.762±0.064 (f) Style + noise, G layers, Lambert for 1D 25.45±3.39 0.765±0.064

Figure 13: Examples of MRI inversion using out-of-distribution images. (Accl=8x, SNR=20 dB). (a) Style vector only + spherical constraint; (b) style vector + noise maps + spherical constraint; (c) style vector only + Gaussianization (only the Yeo-Johnson layer for 1D Gaussianization); (d) style vector + noise maps + Gaussianization (only the Yeo-Johnson layer for 1D); (e) style vector only + Gaussianization (only the Lambert layer for 1D); (f) style vector + noise maps + Gaussianization (only the Lambert layer for 1D).

Figure 15: Two reparameterization schemes for Glow. (a) P1; (b) P2. The dimensions of the latent parameters are [vector dimension × the number of vectors]. The parameters before the Gaussianization layers are {v i | i=1,••• ,N }, and the patch vectors after the Gaussianization layers are {z i | i=1,••• ,N } mentioned in the main text.

Figure 16: Illustration of the squeezeing and unfolding operations.

matrix V ∈ R D×N ; error tolerance ; damping parameters: η = 10 -4 and α = 0.8; maximum iteration numbers J = 10 and K = 100. Output :Matrix P ∈ R D×N with i.i.d. entries // Whitening stage 1

(a), the form of the Yeo-Johnson activation function depends on parameter λ. If λ = 1, the mapping is an identity mapping. If λ > 1, the activation function is convex, compressing the left tail and extending the right tail, reducing the left-skewness. If λ < 1, the activation function is concave, which oppositely reduces the right-skewness. The only parameter λ is determined by solving an optimization problem that minimizes the negentropy: s (λ, p i )) + (λ -1) n i=1 sign(p i ) log(|p i | + 1),

Figure 17: The negative log-likelihood or NLL (reported in bits per dimension) on the training and validation splits of the CelebA-HQ dataset with different numbers of multi-scale levels.For the hyper-parameters of the Glow networks, we used 4 multi-scale levels and 32 flow-steps, and we only used additive coupling layers. Fig.17reports the training process. For each epoch, we computed the training negative log-likelihood (NLL) averaged throughout the epoch, and the validation NLL at the end of the epoch. The validation curves suggest that it is better to use 4 multi-scale levels. We chose the network weights from the epoch before the validation NLL stopped to decrease: 850 for the CelebA-HQ dataset. All training was conducted using 8 × 32 GB Nvidia V100 GPUs with a batch size of 64. We used the Adam optimizer(Kingma & Ba, 2015) with a learning rate of 10 -4 , as well as β 1 = 0.9 and β 2 = 0.99.

Figure18: Illustration of the forward computation of optimization-based ICA, Yeo-Johnson, and Lambert W × F X layers. We denote the input and output by x and y, respectively. The layer is represented by h θ , and the layer is defined by solving an optimization problem whose objective function is l. The back-propagation of gradients in components defined by Eq. 24 and Eq. 28 is enabled by the implicit function theorem and automatic differentiation detailed in App. G.

Figure 20: Comparison of conventional DGM (Glow)-regularized deblurring results with different weighting factor βs and the result using our Gaussianization layers.

DUALITY OF KL DIVERGENCEAs also shown inPapamakarios et al. (2017), the KL-divergence between two distributions does not change under a differentiable invertible transformation, soD KL [p * M (m) p M (m; θ)] = D KL [p * Z (z; θ) p Z (z)] ,

Comparison of compressive sensing MRI results.

Ablation study of the Gaussianization layers.

Deblurring results using Glow.

Eikonal tomography using StyleGAN2

Ablation study on patch size of the style vectors in StyleGAN2.

Glow-regularized deblurring results using different βs.

Ablation study of the components of Gaussianization layers applied to Glow-regularized deblurring.

ACKNOWLEDGEMENTS

The author would like to thank Huseyin Denli, Ashutosh Tewari, Myun-Seok Cheon, Di Du, Stuart Harwood, Yu Fan, and Qiuzi Li for helpful discussions and the anonymous reviewers for their constructive feedback, which greatly improved the paper.

annex

where U = (X -µ X ) /σ X , is a bijection and maps X to another random variable S with heavier tails.The transformation Eq. 25 is bijective if δ ≥ 0, and we can use the Lambert W function to find its inverse. The Lambert W function W is defined as the inverse of q = W -1 (t) = t exp (t), where t and q are scalars. Given q, Halley's method can be used to find t = W (q) (Corless et al., 1996) . Hence, the inverse of Eq. 25 iswhereWe use the parameterized Lambert W × F X distribution family to approximate a heavy-tailed input distribution and use Eq. 26 to recover a distribution with lighter tails. In order to make the recovered distribution close to a Gaussian distribution, we compute the optimal heavy-tail parameter δ by minimizing the difference between the kurtosis of the output distribution and 3 (Kurtosis is a common surrogate measure of negentropy (Hyvärinen & Oja, 2000) ):where s is the data vector, and Kurt is the kurtosis. We constrain δ > 0, and solve Eq. 28 using the L-BFGS-B optimizer (Zhu et al., 1997) .In addition, we estimate the mean µ X and standard deviation σ X along with δ using the Iterative Generalized Method of Moments (IGMM) (Goerg, 2015) , which embeds an optimization problem for δ in an outer loop of iterations to estimate σ X and µ X (see algorithm 3). If the kurtosis of input data vector is not greater than 3, we skip the whole Lambert W × F X layer by directly outputting the data vector.Algorithm 3: Lambert W × F X Layer with the Iterative Generalized Method of Moments (IGMM) (Goerg, 2015) Input :Data vector s, error tolerance = 10 -5 , maximum iteration number K = 100.Output :vector x whose empirical distribution is less heavy-tailed than s, and its kurtosis ≈ 3) k+1) , and σX , δ(k+1) ;

F DETAILS OF DATASETS AND TRAINING

For MRI and eikonal tomography, we generate synthetic brain images as inversion targets using the pre-trained StyleGAN2 weights from Kelkar & Anastasio (2021), which are trained on data from the

H ORTHOGONAL REPARAMETERIZATION

We also propose to reparameterize the latent vector z using an orthogonal matrix R:where v is fixed during an inversion, and we treat R as the parameter instead. There are various ways to parameterized an orthogonal matrix. We choose the Cayley parameterization:one of the best reported in Liu et al. (2021) . Therefore, the DGM-regularized inversion using orthogonal reparameterization isThe specific reparameterization schemes for StyleGAN2 and Glow are the same as in App. D. For style we the full dimension of 512 because of the observation described in Table 6 . For Glow and noise maps in StyleGAN2, we use patch sizes 3 × 8 × 8 and 1 × 8 × 8, respectively, to save computation time and memory.

I GAUSSIAN TYPICAL SET

The Gaussian annulus theorem states that a high-dimensional standard Gaussian distribution has most of its probability mass concentrated within an annulus area around a high-dimensional sphere:Theorem 1 (Gaussian Annulus Theorem (Blum et al., 2020) ) For an n-dimensional standard Gaussian, for any β ≤ √ n, all but at most 3e -cβ 2 of the probability mass lies within the annulus √ n -β ≤ |x| ≤ √ n + β, where c is a fixed positive constant.This theorem gives a necessary geometric condition of typical samples from a high-dimensional standard Gaussian. However, the converse is not true: not all vectors whose 2 norm is √ n -β ≤ |x| ≤ √ n + β are typical examples from the standard Gaussian distribution. As a result, a DGM does not necessarily map a latent vector staying within the Gaussian annulus geometrically to a plausible image, which is demonstrated in Fig. 1 .The formal definition of a typical set is as follows.Definition 1 (Cover & Thomas (2012)) Let p X (x) be a distribution whose support is X . The typical set A (n) is defined as the set of sequenceswhere H [X] is the entropy of random variable X. Now a random vector x ∈ R n ∼ N (0, σ 2 I) can be factorized as i.i.d. random variables that are distributed as N (0, σ 2 ). Therefore, we can regard x as an i.i.d. sequence and give the following definition:Definition 2 (Gaussian Typical Set) A Gaussian typical set is the typical set AThe following theorem guarantees that a typical sample from x ∈ R n ∼ N (0, σ 2 I) resides in the Gaussian typical set with very high probability.Theorem 2 (Cover & Thomas (2012)) For every > 0, the typical set has probability P A(n) > 1with a sufficiently large dimension n.

J.3 INVARIANCE OF KL-DIVERGENCE AND MULTI-INFORMATION

For the completeness of this paper, we briefly prove the key properties of KL-divergence and multi-information used in the Gaussianization framework.First, we prove that the KL-divergence is invariant under differentiable bijections. We write the definition of the KL-divergence between distribution p x and q x of random vector x asSuppose there is a differentiable and invertible transformation T that maps x to u: u = T (x). Then, we know the PDF under change of variable iswhere we define the Jacobian matrix J T (x) = ∂u ∂x (x). Therefore, we haveThen, we show that the multi-information does not change under component-wise invertible differentiable transformations. The multi-information is defined by the KL-divergence between the joint distribution p(x) and the marginals:We define the bijection u = T (x) specifically with invertible differentiable transformations for each component T i (x i ) = u i . Then, using the KL-divergence invariance we just proved and that We point out the limitations and potential improvements of this work. First, requiring latent vectors to traverse within the Gaussian distribution/typical set means that the statistics of the training dataset will dominate the results. Fig. 21 shows that our method cannot restore the eyeglasses because

