ON THE INVERSION OF DEEP GENERATIVE MODELS

Abstract

Deep generative models (e.g. GANs and VAEs) have been developed quite extensively in recent years. Lately, there has been an increased interest in the inversion of such a model, i.e. given a (possibly corrupted) signal, we wish to recover the latent vector that generated it. Building upon sparse representation theory, we define conditions that rely only on the cardinalities of the hidden layer and are applicable to any inversion algorithm (gradient descent, deep encoder, etc.), under which such generative models are invertible with a unique solution. Importantly, the proposed analysis is applicable to any trained model, and does not depend on Gaussian i.i.d. weights. Furthermore, we introduce two layer-wise inversion pursuit algorithms for trained generative networks of arbitrary depth, where one of them is accompanied by recovery guarantees. Finally, we validate our theoretical results numerically and show that our method outperforms gradient descent when inverting such generators, both for clean and corrupted signals.

1. INTRODUCTION

In the past several years, deep generative models, e.g. Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and Variational Auto-Encoders (VAEs) (Kingma & Welling, 2013) , have been greatly developed, leading to networks that can generate images, videos, and speech voices among others, that look and sound authentic to humans. Loosely speaking, these models learn a mapping from a random low-dimensional latent space to the training data distribution, obtained in an unsupervised manner. Interestingly, deep generative models are not used only to generate arbitrary signals. Recent work rely on the inversion of these models to perform visual manipulations, compressed sensing, image interpolation, and more (Zhu et al., 2016; Bora et al., 2017; Simon & Aberdam, 2020) . In this work, we study this inversion task. Formally, denoting the signal to invert by y ∈ R n , the generative model as G : R n0 → R n , and the latent vector as z ∈ R n0 , we study the following problem: z * = arg min z 1 2 G(z) -y 2 2 , where G is assumed to be a feed-forward neural network. The first question that comes to mind is whether this model is invertible, or equivalently, does Equation 1 have a unique solution? In this work, we establish theoretical conditions that guarantee the invertibility of the model G. Notably, the provided theorems are applicable to general non-random generative models, and do not depend on the chosen inversion algorithm. Once the existence of a unique solution is recognized, the next challenge is to provide a recovery algorithm that is guaranteed to obtain the sought solution. A common and simple approach is to draw a random vector z and iteratively update it using gradient descent, opting to minimize Equation 1 (Zhu et al., 2016; Bora et al., 2017) . Unfortunately, this approach has theoretical guarantees only in limited scenarios (Hand et al., 2018; Hand & Voroninski, 2019) , since the inversion problem is generally non-convex. An alternative approach is to train an encoding neural network that maps images to their latent vectors (Zhu et al., 2016; Donahue et al., 2016; Bau et al., 2019; Simon & Aberdam, 2020) ; however, this method is not accompanied by any theoretical justification. We adopt a third approach in which the generative model is inverted in an analytical fashion. Specifically, we perform the inversion layer-by-layer, similar to Lei et al. (2019) . Our approach is based on the observation that every hidden layer is an outcome of a weight matrix multiplying a sparse vector, followed by a ReLU activation. By utilizing sparse representation theory, the proposed algorithm ensures perfect recovery in the noiseless case and bounded estimation error in the noisy one. Moreover, we show numerically that our algorithm outperforms gradient descent in several tasks, including reconstruction of noiseless and corrupted images. Main contributions: The contributions of this work are both theoretical and practical. We derive theoretical conditions for the invertiblity of deep generative models by ensuring a unique solution for the inversion problem defined in Equation 1. In short, these conditions rely on the growth of the non-zero elements of consecutive hidden layers by a factor of 2 for trained networks and by any constant greater than 1 for random models. Then, by leveraging the inherent sparsity of the hidden layers, we introduce a layerwise inversion algorithm with provable guarantees in the noiseless and noisy settings for fully-connected generators. To the best of our knowledge, this is the first work that provides such guarantees for general (non-random) models, addressing both the conceptual inversion and provable algorithms for solving Equation 1. Finally, we provide numerical experiments, demonstrating the superiority of our approach over gradient descent in various scenarios.

1.1. RELATED WORK

Inverting deep generative models: A tempting approach for solving Equation 1 is to use first order methods such as gradient descent. Even though this inversion is generally non-convex, the works in Hand & Voroninski (2019) ; Hand et al. (2018) show that if the weights are random then, under additional assumptions, no spurious stationary points exist, and thus gradient descent converges to the optimum. A different analysis, given in Latorre et al. (2019) , studies the case of strongly smooth generative models that are near isometry. In this work, we study the inversion of general (non-random and non-smooth) ReLU activated generative networks, and provide a provable algorithm that empirically outperforms gradient descent. A close but different line of theoretical work analyzes the compressive sensing abilities of trained deep generative networks (Shah & Hegde, 2018; Bora et al., 2017) ; however, these works assume that an ideal inversion algorithm, solving Equation 1, exists. Different works Bojanowski et al. (2017) ; Wu et al. (2019) suggest training procedures that result with generative models that can be easily inverted. Nevertheless, in this work we do not assume anything on the training procedure itself, and only rely on the weights of the trained model.

Layered-wise inversion:

The closest work to ours, and indeed its source of inspiration, is Lei et al. (2019) , which proposes a novel scheme for inverting generative models. By assuming that the input signal was corrupted by bounded noise in terms of 1 or ∞ , they suggest inverting the model using linear programs layer-by-layer. That said, to assure a stable inversion, their analysis is restricted to cases where: (i) the network weights are Gaussian i.i.d. variables; (ii) the layers expand such that the number of non-zero elements in each layer is larger than the size of the entire layer preceding it; and (iii) that the last activation function is either ReLU or leaky-ReLU. Unfortunately, as mentioned in their work, these three assumptions often do not hold in practice. In this work, we do not rely on the distribution of the weights nor on the chosen activation function of the last layer. Furthermore, we relax the expansion assumption as to rely only on the expansion of the number of non-zero elements. This relaxation is especially needed in the last hidden layer, which is typically larger than the image size. Neural networks and sparse representation: In the search for a profound theoretical understanding for deep learning, a series of papers suggested a connection between neural networks and sparse coding, by demonstrating that the forward pass of a neural network is in fact a pursuit for a multilayer sparse representation (Papyan et al., 2017; Sulam et al., 2018; Chun & Fessler, 2019; Sulam et al., 2019; Romano et al., 2019; Xin et al., 2016) . In this work, we expand this proposition by showing that the inversion of a generative model is based on sequential sparse coding steps.

2. THE GENERATIVE MODEL

Notations: We use bold uppercase letters to represent matrices, and bold lowercase letters to represent vectors. The vector w j represents the jth column in the matrix W. Similarly, the vector w i,j represents the jth column in the matrix W i . The activation function ReLU is the entry-wise operator ReLU(u) = max{u, 0}. We denote by spark(W) the smallest number of columns in W that are linearly-dependent, and by x 0 the number of non-zero elements in x. The mutual coherence of a matrix W is defined as: µ(W) = max i =j |w T i wj | wi 2 wj 2 . Finally, we define x S and W S i as the supported vector and the row-supported matrix according to the set S, and denote by S c the complementary set of S. Problem Statement: We consider a typical generative scheme G : R n0 → R n of the form: x 0 = z, x i+1 = ReLU(W i x i ), for all i ∈ {0, . . . , L -1}, G(z) = φ(W L x L ), where x i ∈ R ni , {x i } L-1 i=1 are the hidden layers, W i ∈ R ni+1×ni are the weight matrices (n L+1 = n), x 0 = z ∈ R n0 is the latent vector that is usually randomly selected from a normal distribution, z ∼ N (0, σ 2 I n0 ), and φ is an invertible activation function, e.g. tanh, sigmoid, or piece-wise linear. Given a sample x = G(z), that was created by the generative model above, we aim to recover its latent vector z. Note that each hidden vector in the model is produced by a ReLU activation, leading to hidden layers that are inherently sparse. This observation supports our approach to study this model utilizing sparse representation theory. In what follows, we use this observation to derive theoretical statements on the invertibility and the stability of this problem, and to develop pursuit algorithms that are guaranteed to restore the original latent vector.

3. INVERTIBILITY AND UNIQUENESS

We start by addressing this question: "Is this generative process invertible?". In other words, when given a signal that was generated by the model, x = G(z * ), we know that a solution z * to the inverse problem exists; however, can we ensure that this is the only one? Theorem 1 below (its proof is given in Appendix A) provides such guarantees, which are based on the sparsity level of the hidden layers and the spark of the weight matrices (see Section 2). Importantly, this theorem is not restricted to a specific pursuit algorithm; it can rather be used for any restoration method (gradient descent, deep encoder, etc.) to determine whether the recovered latent vector is the unique solution. Definition 1 (sub-spark). Define the s-sub-spark of a matrix W as the minimal spark of any subset S of rows of cardinality |S| = s, sub-spark(W, s) = min |S|=s spark(W S ). Definition 2 (sub-rank). Define the s-sub-rank of a matrix W as the minimal rank over any subset S of rows of cardinality |S| = s, sub-rank(W, s) = min |S|=s rank(W S ). Theorem 1 (Uniqueness). Consider the generative scheme described in Equation 2 and a signal x = G(z * ) with a corresponding set of representations {x * i } L i=1 that satisfy: (i) s L = x * L 0 < spark(W L ) 2 . (ii) s i = x * i 0 < sub-spark(Wi,si+1) 2 , for all i ∈ {1, . . . , L -1}. (iii) n 0 = sub-rank(W 0 , s 1 ) ≤ s 1 . Then, z * is the unique solution to the inverse problem that meets these sparsity conditions. Theorem 1 is the first of its kind to provide uniqueness guarantees for general non-statistical weight matrices. Moreover, it only requires an expansion of the layer cardinalities as opposed to Huang et al. (2018) ; Hand & Voroninski (2019) and Lei et al. (2019) that require dimensionality expansion that often does not hold for the last layer (typically n < n L ). A direct corollary of Theorem 1 is in the case of random matrices. In such case, the probability of heaving n linearly dependent columns is essentially zero (Elad, 2010, Chapter 2) . Hence, the conditions of Theorem 1 become: (i) s L < n + 1 2 . (ii) s i < s i+1 + 1 2 . (iii) s 1 ≥ n 0 . In fact, since singular square matrices have Lebesgue measure zero, this corollary holds for almost all set of matrices. In practice, to allow for a sufficient increase in the cardinalities of the hidden layers, their dimensions should expand as well, excluding the last layer. For example, if the dimensions of the hidden layers increase by a factor of 2, as long as the hidden layers preserve a constant percentage of non-zero elements, Theorem 1 holds almost surely. Notably, this is the common practice in various generative architectures, such as DC- GAN Radford et al. (2015) and PGAN Karras et al. (2017) . Nevertheless, in the random setting, we can further relax the above conditions by utilizing a theorem by Foucart & Rauhut (2013) . This theorem considers a typical sparse representation model with a random dictionary and states that a sparse representation is unique as long as its cardinality is smaller than the signal dimension. Therefore, as presented in Theorem 2, in the random setting the cardinality across the layers need to grow only by a constant, i.e. s i < s i+1 and s L < n. Theorem 2 (Uniqueness for Random Weight Matrices). Assume that the weight matrices comprise of random independent and identically distributed entries (say Gaussian). If the representations of a signal x = G(z * ) satisfy: (i) s L = x L 0 < n. (ii) s i = x i 0 < s i+1 , for all i ∈ {1, . . . , L -1}. (iii) s 1 = x 1 0 ≥ n 0 , then, with probability 1, the inverse problem has a unique solution that meets these conditions. The above theorem states that to ensure a unique global minimum in the stochastic case, the number of nonzero elements should expand by only a single parameter. The proof of this theorem follows the same protocol as Theorem 1's proof, while replacing the spark-based uniqueness (Donoho & Elad, 2003) with Foucart & Rauhut (2013) . As presented in Section 6.1, these conditions are very effective in predicting whether the generative process is invertible or not, regardless of the recovery algorithm used.

4. PURSUIT GUARANTEES

In this section we provide an inversion algorithm supported by reconstruction guarantees for the noiseless and noisy settings. To reveal the potential of our approach, we first discuss the performance of an Oracle, in which the true supports of all the hidden layers are known, and only their values are missing. This estimation can be performed by a sequence of simple linear projections on the known supports. Note that already in the first step of estimating x L , we can realize the advantage of utilizing the inherent sparsity of the hidden layers. Here, the reconstruction error of the Oracle is proportional to s L = x L 0 , whereas solving a least square problem, as suggested in Lei et al. (2019) , results with an error that is proportional to n L . For more details see Appendix B. Algorithm 1 Layered Basis-Pursuit Input: y = G(z) + e ∈ R n , where e 2 ≤ , and sparsity levels {s i } L i=1 . First step: xL = arg min x 1 2 φ -1 (y) -W L x 2 2 + λ L x 1 , with λ L = 2 . Set ŜL = Support(x L ) and L = (3+ √ 1.5) √ s L minj w L,j 2 . General step: For any layer i = L -1, . . . , 1 execute: 1. xi = arg min x 1 2 x Ŝi+1 i+1 -W Ŝi+1 i x 2 2 + λ i x 1 , with λ i = 2 i+1 . 2. Set Ŝi = Support(x i ) and i = (3+ √ 1.5) √ si minj w Ŝi+1 i,j 2 i+1 . Final step: Set ẑ = arg min z 1 2 x Ŝ1 1 -W Ŝ1 0 z 2 2 . In what follows, we propose to invert the model by solving sparse coding problems layer-by-layer, while leveraging the sparsity of all the intermediate feature vectors. Specifically, Algorithm 1 describes a layered Basis-Pursuit approach, and Theorem 3 provides reconstruction guarantees for this algorithm. The proof of this theorem is given in Appendix C. In Corollary 1 we provide guarantees for this algorithm when inverting non-random generative models in the noiseless case. 

First Layer Inversion

Linearized ADMM Figure 1 : The Latent-Pursuit inverts the generative model layer-by-layer as described in Algorithm 3, and is composed of three steps: (1) last layer inversion by Algorithm 5; (2) midlayer inversions using Algorithm 2; and (3) first layer inversion via Algorithm 2 with the x-step replaced by Equation 10. Theorem 3 (Layered Basis-Pursuit Stability). Suppose that y = x + e, where x = G(z) is an unknown signal with known sparsity levels {s i } L i=1 , and e 2 ≤ . Let be the Lipschitz constant of φ -1 and define L+1 = . If in each midlayer i ∈ {1, . . . , L}, s i < 1 3µs i+1 (Wi) , then, • The support of xi is a subset of the true support, Ŝi ⊆ S i ; • The vector xi is the unique solution for the basis-pursuit; • The midlayer's error satisfies xi -x i 2 < i , where i = (3+ √ 1.5) √ si minj w Ŝi+1 i,j 2 i+1 . • the recovery error on the latent space is upper bounded by ẑ -z 2 < √ ϕ L i=1 (3 + √ 1.5) √ s j min j w Ŝi+1 i,j , where ϕ = λ min (W Ŝ1 0 ) T W Ŝ1 0 > 0. (4) Corollary 1 (Layered Basis-Pursuit -Noiseless Case). Let x = G(z) with sparsity levels {s i } L i=1 , and assume that s i < 1/3µ si+1 (W i ) for all i ∈ {1, . . . , L}, and that ϕ = λ min ((W Ŝ1 0 ) T W Ŝ1 0 ) > 0. Then Algorithm 1 recovers the latent vector ẑ = z perfectly.

5. THE LATENT-PURSUIT ALGORITHM

While Algorithm 1 provably inverts the generative model, it only uses the non-zero elements x Ŝi+1 i+1 to estimate the previous layer x i . Here we present the Latent-Pursuit algorithm, which expands the Layered Basis-Pursuit algorithm by imposing two additional constraints. First, the Latent-Pursuit sets inequality constraints, W S c i+1 i xi ≤ 0, that emerge from the ReLU activation. Second, recall that the ReLU activation constrains the midlayers to have nonnegative values, x i ≥ 0. Furthermore, we refrain from inverting the activation function directly φ -1 since practically, this inversion might be unstable, e.g. when using tanh. In what follows, we describe each of the three parts of the proposed algorithm: (i) the image layer; (ii) the middle layers; and (iii) the first layer. Starting with the inversion of the last layer, i.e. the image layer, we need to solve x L = arg min x 1 2 y -φ(W L x) 2 2 + λ L 1 T x, s. t. x ≥ 0, where 1 T x represents an 1 regularization term under the nonnegative constraint. Assuming that φ is smooth and strictly monotonic increasing, this problem is a smooth convex function with separable constraints, and therefore, it can be solved using a projected gradient descent algorithm. In particular, we employ FISTA (Beck & Teboulle, 2009) , as described in Algorithm 5 in Appendix D. We move on to the middle layers, i.e. estimating x i for i ∈ {1, . . . , L -1}. Here, both the approximated vector and the given signal are assumed to result from a ReLU activation function. This leads us to the following problem: x i = arg min x 1 2 x Ŝ i+1 -W Ŝ i x 2 2 + λ i 1 T x, s. t. x ≥ 0, W Ŝc i x ≤ 0 (6) where Ŝ = Ŝi+1 is the support of the output of the layer to be inverted, and Ŝc = Ŝc i+1 is its complementary. To solve this problem we introduce an auxiliary variable a = W S c i x, leading to the following augmented Lagrangian form: min x,a,u 1 2 x S i+1 -W S i x 2 2 + λ i 1 T x + ρ i 2 a -W S c i x + u 2 2 , s. t. x ≥ 0, a ≤ 0. This optimization problem could be solved using ADMM (Boyd et al., 2011) , however, it would require inverting a matrix of size n i × n i , which might be costly. Alternatively, we employ a more general method, called alternating direction proximal method of multipliers (Beck, 2017, Chapter 15) , in which a quadratic proximity term, 1 2 xx (k) Q , is added to Equation 7. By setting Q = αI-W S i T W S i +βI-ρ i W S c i T W S c i , with α+β ≥ λ max (W S i T W S i +ρ i W S c i T W S c i ), ) we get that Q is positive semidefinite. This leads to the Linearized-ADMM algorithm in Algorithm 2 which is guaranteed to converge to the optimal solution of Equation 7(see details in Appendix D). We now recovered all the hidden layers, and only the latent vector z is left to be estimated. For this inversion step we adopt a MAP estimator utilizing the fact that z is drawn from a normal distribution: z = arg min z 1 2 x S 1 -W S 0 z 2 2 + γ 2 z 2 2 , s. t. W S c 0 z ≤ 0, with γ > 0. This problem can be solved by the Linearized-ADMM algorithm described above (see details in Appendix D), except for the update of x in Algorithm 2, which becomes: z (k+1) ← 1 α + β + γ (α + β)z (k) -W S 0 T (W S 0 z (k) -x S 1 ) -ρ 1 W S c 0 T (W S c 0 z (k) -a (k) -u (k) ) . ( ) Once the latent vector z and all the hidden layers {x i } L i=1 are recovered, we propose an optional step to improve the final estimation. In this step, which we refer to as debiasing, we freeze the recovered supports and only optimize over the non-zero values in an end-to-end fashion. This is equivalent to computing the Oracle, only here the supports are not known, but rather estimated using the proposed pursuit. Algorithm 3 provides a short description of the entire proposed inversion method. Algorithm 2 Latent Pursuit: Midlayer Inversion Initialization: x (0) ∈ R ni , u (0) , a (0) ∈ R si+1 , ρ i > 0, and α, β satisfying Equation 8. Until converged: for k = 0, 1, . . . execute: 1. x (k+1) ← ReLU x (k) -1 α+β W S i T (W S i x (k) - x S i+1 )-ρ i α+β W S c i T (W S c i x (k) -a (k) -u (k) )-λ i α+β . 2. a (k+1) ← -ReLU u (k) -W S c i x (k+1) . 3. u (k+1) ← u (k) + a (k+1) -W S c i x (k+1) . Algorithm 3 The Latent-Pursuit Algorithm Initialization: Set λ i > 0 and ρ i > 0. First step: Estimate x L , i.e. solve Equation 5 using Algorithm 5. Middle step: For layers i = L -1, . . . , 1, estimate x i using Algorithm 2. Final step: Estimate z using Algorithm 2 but with the x-step of Equation 10. Debiasing (optional): Set z ← arg min z 1 2 y -φ 0 i=L W Ŝi+1 i z 2 2 .

6. NUMERICAL EXPERIMENTS

We demonstrate the effectiveness of our approach through numerical experiments, where our goal is twofold. First, we study random generative models and show the ability of the uniqueness claim above (Theorem 2) to predict when both gradient descent and our approach fail to invert G as the inversion is not unique. In addition, we show that in these random networks and under the conditions of Corollary 1, the latent vector is perfectly recovered by both the Layered Basis-Pursuit and the Latent-Pursuit algorithm. Our second goal is to demonstrate the advantage of Latent-Pursuit over gradient descent for trained generative models, in two settings: noiseless and image inpainting. . These results support Theorem 2 stating that to guarantee a unique solution, the hidden layer cardinality s 1 ≈ n1 2 should be larger than the latent vector space and smaller than the image space. Moreover, it supports Corollary 1 by showing that under the non-zero expansion condition, both Layered Basis-Pursuit and Latent-Pursuit (Algorithms 1 and 3) recover the original latent vector perfectly.

6.1. RANDOM WEIGHTS

First, we validate the above theorems on random generative models, by considering a framework similar to Huang et al. (2018) and Lei et al. (2019) . Here, the generator is composed of two layers: x = G(z) = tanh(W 2 ReLU(W 1 z)), where the dimensions of the network are n = 625, n 1 varies between 50 to 1000 and n 0 ∈ {100, 200}. The weight matrices W 1 and W 2 are drawn from an iid Gaussian distribution. We generate 512 signals by feeding the generator with latent vectors from a Gaussian distribution, and then test the performance of the inversion of these signals in terms of SNR for all the layers, using gradient descent, Layered Basis-Pursuit (Algorithm 1), and Latent-Pursuit (Algorithm 3). For gradient descent, we use the smallest step-size from {1e -1, 1e0, 1e1, 1e2, 1e3, 1e4} for 10, 000 steps that resulted with a gradient norm smaller than 1e -9. For Layered Basis-Pursuit we use the best λ 1 from {1e -5, 7e -6, 3e -6, 1e -6, 0}, and for Latent-Pursuit, we use λ 1 = 0, ρ = 1e -2 and γ = 0. In Layered Basis-Pursuit and Latent-Pursuit we preform a debiasing step in a similar manner to gradient descent. Figure 2 marks median results in the central line, while the ribbons show 90%, 75%, 25%, and 10% quantiles. In these experiments the sparsity level of the hidden layer is approximately 50%, s 1 = x 1 0 ≈ n1 2 , due to the weights being random. In what follows, we split the analysis of Figure 2 to three segments. Roughly, these segments are s 1 < n 0 , n 0 < s 1 < n, and n < s 1 as suggested by the theoretical results given in Theorem 2 and Corollary 1. In the first segment, Figure 2 shows that all three methods fail. Indeed, as suggested by the uniqueness conditions introduced in Theorem 2, when s 1 < n 0 , the inversion problem of the first layer does not have a unique global minimizer. The dashed vertical line in Figure 2 marks the spot where n1 2 = n 0 . Interestingly, we note that the conclusions in (Huang et al., 2018; Lei et al., 2019) , suggesting that large latent spaces cause gradient descent to fail, are imprecise and valid only for fixed hidden layer size. This can be seen by comparing n 0 = 100 to n 0 = 200. As a direct outcome of our uniqueness study and as demonstrated in Figure 2 , gradient descent (and any other algorithm) fails when the ratio between the cardinalities of the layers is smaller than 2. Nevertheless, Figure 2 exposes an advantage for using our approach over gradient descent. Note that our methods successfully invert the model for all the layers that follow the layer for which the sparsity assumptions do not hold, and fail only past that layer, since only then uniqueness is no longer guaranteed. However, since gradient descent starts at a random location, all the layers are poorly reconstructed. For the second segment, we recall Theorem 3 and in particular Corollary 1. There we have shown that Layered Basis-Pursuit and Latent-Pursuit are guaranteed to perfectly recover the latent vector as long as the cardinality of the midlayer s 1 = x 1 0 satisfies n 0 ≤ s 1 ≤ 1/3µ(W 1 ). Indeed, Figure 2 demonstrates the success of these two methods even when s 1 ≈ n1 2 is greater than the worst-case bound 1/3µ(W 1 ). Moreover, this figure validates that Latent-Pursuit, which leverages additional properties of the signal, outperforms Layered Basis-Pursuit, especially when s 1 is large. Importantly, while the analysis in Lei et al. (2019) suggests that n has to be larger than n 1 , in practice, all three methods succeed to invert the signal even when n 1 > n. This result highlights the strength of the proposed analysis that leans on the cardinality of the layers rather than their size. We move on to the third and final segment, where the size of hidden layer is significantly larger than the dimension of the image. Unfortunately, in this scenario the layer-wise methods fail, while gradient descent succeeds. Note that, in this setting, inverting the last layer solely is an ambitious (actually, impossible) task; however, since gradient descent solves an optimization problem of a much lower dimension, it succeeds in this case as well. This experiment and the accompanied analysis suggest that a hybrid approach, utilizing both gradient descent and the layered approach, might be of interest. We defer a study of such an approach for future work.

6.2. TRAINED NETWORK

To demonstrate the practical contribution of our work, we experiment with a generative network trained on the MNIST dataset. Our architecture is composed of fully connected layers of sizes 20, 128, 392, and finally an image of size 28 × 28 = 784. The first two layers include batchnormalizationfoot_0 and a ReLU activation function, whereas the last one includes a piecewise linear unit (Nicolae, 2018) . We train this network in an adversarial fashion using a fully connected discriminator and spectral normalization (Miyato et al., 2018) . We should note that images produced by fully connected models are typically not as visually appealing as ones generated by convolutional architectures. However, since the theory provided here focuses on fully connected models, this setting was chosen for the experimental section, similar to other previous work (Huang et al., 2018; Lei et al., 2019) that study the inversion process.

Network inversion:

We start with the noiseless setting and compare the Latent-Pursuit algorithm to the Oracle (which knows the exact support of each layer) and to gradient descent. To invert a signal and compute its reconstruction quality, we first invert the entire model and estimate the latent vector. Then, we feed this vector back to the model to estimate the hidden representations and the reconstructed image. For our algorithm we use ρ = 1e -2 for all layers and 10, 000 iterations of debiasing. For gradient-descent run, we use 10, 000 iterations, momentum of 0.9 and a step size of 1e -1 that gradually decays to assure convergence. Overall, we repeat this experiment 512 times. Figure 3a demonstrates the reconstruction error of the latent vector. First, we observe that the performance of our inversion algorithm is on par with those of the Oracle. Moreover, not only does our approach performs much better than gradient descent, but in many experiments the latter fails utterly. In Appendix E.1 we provide reconstruction error for all the layers followed by image samples. A remark regarding the run-time of these algorithms is in place. Using an Nvidia 1080Ti GPU, the proposed Latent-Pursuit algorithm took approximately 15 seconds per layer to converge for a total of approximately 75 seconds to complete, including the debiasing step, for all 512 experiments. On the other hand, gradient-descent took approximately 30 seconds to conclude.

Image inpainting:

We continue our experiments with image inpainting, i.e. inverting the network and reconstructing a clean signal when only some of its pixels are known. First, we apply a random mask in which 45% of the pixels are randomly concealed. Since the number of known pixels is still larger than the number of non-zero elements in the layer preceding it, our inversion algorithm usually reconstructs the image successfully as suggested by Figure 3b . In this experiment, we perform slightly worse than the Oracle, which is not surprising considering the information disparity between the two. As for gradient descent, we see similar results to the ones received in the non-corrupted setting. Appendix E.2 provides image samples and reconstruction comparison across all layers. Finally, we repeat the above experiment using a deterministic mask that conceals the upper ∼ 45% of each image (13 out of 28 rows). The results of this experiment, which are provided in Figure 3c and Appendix E.3, lead to similar conclusions as in the previous experiment. Indeed, since the model contains fully connected layers, we expect the last two experiments to show comparable results.

7. CONCLUSIONS

In this paper we have introduced a novel perspective regarding the inversion of deep generative networks and its connection to sparse representation theory. Building on this, we have proposed novel invertibility guarantees for such a model for both random and trained networks. We have accompanied our analysis by novel pursuit algorithms for this inversion and presented numerical experiments that validate our theoretical claims and the superiority of our approach compared to the more classic gradient descent. We believe that the insights underlining this work could lead to a broader activity which further improves the inversion of these models in a variety of tasks. , where Wi is the row and column supported matrix W i [S i+1 , S i ]. Final step: Set ẑ = arg min z 1 2 xS1 1 -W S1 0 z 2 2 . dimension. That said, this result reveals another advantage of employing the sparse coding approach over solving least squares problems, as the error can be proportional to s i rather than to n i . Theorem 4 (The Oracle). Given a noisy signal y = G(z) + e, where e ∼ N (0, σfoot_1 I), and assuming known supports {S i } L i=1 , the recovery errors satisfy 2 : σ 2 L j=i λ max ( WT j Wj ) s i ≤ E xi -x i 2 2 ≤ σ 2 L j=i λ min ( WT j Wj ) s i , for i ∈ {1, . . . , L}, where Wi is the row and column supported matrix, W i [S i+1 , S i ]. The recovery error bounds for the latent vector are similarly given by: σ 2 L j=0 λ max ( WT j Wj ) n 0 ≤ E ẑ -z 2 2 ≤ σ 2 L j=0 λ min ( WT j Wj ) n 0 . Proof. Assume y = x + e with x = G(z), then the Oracle for the Lth layer is xS L = W † L y. Since y = WL x S L + e, we get that xS L = x S L + ẽL , where ẽL = W † L e, and ẽL ∼ N (0, σ 2 ( WT L WL ) -1 ). Therefore, using the same proof technique as in Aberdam et al. (2019) , the upper bound on the recovery error in the Lth layer is: E xL -x L 2 2 = σ 2 trace(( WT L WL ) -1 ) ≤ σ 2 s L λ min ( WT L WL ) . ( ) Using the same approach we can derive the lower bound by using the largest eigenvalue of WT L WL . In a similar fashion, we can write xS i = x S i + ẽi for all i ∈ {0, . . . , L -1}, where ẽi = A [i,L] e and A [i,L] W † i W † i+1 • • • W † L . Therefore, the upper bound for the recovery error in the ith layer becomes: E xi -x i 2 2 = E A [i,L] e 2 2 = σ 2 trace A [i,L] A T [i,L] = σ 2 trace A [i,L-1] W † L ( W † L ) T A T [i,L-1] = σ 2 trace A [i,L-1] ( WT L WL ) -1 A T [i,L-1] ≤ σ 2 λ min ( WT L WL ) trace A [i,L-1] A T [i,L-1] ≤ • • • ≤ σ 2 L j=i+1 λ min ( WT j Wj ) trace A [i,i] A T [i,i] = σ 2 L j=i+1 λ min ( WT j Wj ) trace ( WT i Wi ) -1 ≤ σ 2 L j=i λ min ( WT j Wj ) s i , and this concludes the proof.

C THEOREM 3: PROOF

Proof. We first recall the stability guarantee from Tropp (2006) for the basis-pursuit. Lemma 2 (Basis Pursuit Stability Tropp (2006) ). Let x * be an unknown sparse representation with known cardinality of x * 0 = s, and let y = Wx * +e, where W is a matrix with unit-norm columns and e 2 ≤ . Assume the mutual coherence of the dictionary W satisfies s < 1/(3µ(W)). Let x = arg min x 1 2 y -Wx 2 2 + λ x 1 , with λ = 2 . Then, x is unique, the support of x is a subset of the support of x * , and x * -x ∞ < (3 + √ 1.5) . In order to use the above lemma in our analysis we need to modify it such that W does not need to be column normalized and that the error is 2 -and not ∞ -bounded. For the first modification we decompose a general unnormalized matrix W as WD, where W is the normalized matrix, wi = w i / w i 2 , and D is a diagonal matrix with d i = w i 2 . Using the above lemma we get that D(x * -x) ∞ < (3 + √ 1.5) . Thus, the error in x is bounded by x -x ∞ < (3 + √ 1.5) min i w i 2 . ( ) Since Lemma 2 guarantees that the support of x is a subset of the support of x * , we can conclude that x -x 2 < (3 + √ 1.5) min i w i 2 √ s. Under the conditions of Theorem 3, we can use the above conclusion to guarantee that estimating x L from the noisy input y using Basis-Pursuit must lead to a unique xL such that its support is a subset of that of x L . Also, x L -xL 2 < L = (3 + √ 1.5) min j w L,j 2 L+1 √ s L , Algorithm 5 Latent Pursuit: Last Layer Inversion Input: y ∈ R n , K ∈ N, λ L ≥ 0, µ ∈ (0, 2 ), φ(•) is -smooth and strictly monotonic increasing. Initialization: u (0) ← 0, x L ← 0, t (0) ← 1. General step: for any k = 0, 1, . . . , K execute the following: 1. g ← W T L φ W L x (k) L φ W L x (k) L -y 2. u (k+1) ← ReLU x (k) L -µ • (g + λ L 1) 3. t (k+1) ← 1+ √ 1+4t (k) 2 2 4. x (k+1) L ← u (k+1) + t (k) -1 t (k+1) (u (k+1) -u (k) ) Return: x (K) L where w L,j is the jth column in W L , and L+1 = as φ -1 (y) can increase the noise by a factor of . Moving on to the estimation of the previous layer, we have that x ŜL L = W ŜL L-1 x L-1 + e L , where e L 2 ≤ L . According to Theorem 3 assumptions, the mutual coherence condition holds, and therefore, we get that the support of xL-1 is a subset of the support of x L-1 , xL-1 is unique, and that x L-1 -xL-1 2 < L-1 = (3 + √ 1.5) min j w ŜL L-1,j 2 L √ s L-1 . Using the same technique proof for all the hidden layers results in x i -xi 2 < i = (3 + √ 1.5) min j w Ŝi+1 i,j 2 i+1 √ s i , for all i ∈ {1, . . . , L -1}, where w Ŝi+1 i,j is the jth column in W Ŝi+1 i . Finally, we have that x Ŝ1 1 = W Ŝ1 0 z + e 1 , where e 1 2 ≤ 1 . Therefore, if ϕ = λ min ((W Ŝ1 0 ) T W Ŝ1 0 ) > 0, and ẑ = arg min z 1 2 x Ŝ1 1 -W Ŝ1 0 z 2 2 . ( ) Then, ẑ -z 2 2 = e T 1 (W Ŝ1 0 ) T W Ŝ1 0 -1 e 1 ≤ 1 ϕ 2 1 , which concludes Theorem 3 guarantees.

D DETAILS ON THE LATENT-PURSUIT ALGORITHM

Here we provide additional details on the Latent-Pursuit algorithm described in Section 5. In order to estimate the last layer we aim to solve x L = arg min x 1 2 y -φ(W L x) 2 2 + λ L 1 T x, s. t. x ≥ 0. ( ) For this goal we make use of FISTA (Beck & Teboulle, 2009) algorithm as described in Algorithm 5. As describe in Section 5, for estimating the middle layers we aim to solve: x i = arg min x 1 2 x Ŝ i+1 -W Ŝ i x 2 2 + λ i 1 T x, s. t. x ≥ 0, W Ŝc i x ≤ 0. ( ) Using the auxiliary variable a = W S c i x and the positive semidefinite matrix Q = αI-W S i T W S i + βI -ρ i W S c i T W S c i , we get that the Linearized-ADMM aims to solve: min x,a,u 1 2 x S i+1 -W S i x 2 2 +λ i 1 T x+ ρ i 2 a -W S c i x + u 2 2 + 1 2 x -x (k) Q , s. t. x ≥ 0, a ≤ 0. (28) This leads to an algorithm that alternates through the following steps: (30) x (k+1) ← arg min x α 2 x -x (k) - 1 α W S i T W S i x (k) -x S i+1 2 2 + λ i x + (29) β 2 x -x (k) - ρ i β W S c i T W S c i x (k) -a (k) -u (k) u (k+1) ← u (k) + a (k+1) -W S c i x (k+1) . Thus, the Linearized-ADMM algorithm, described in Algorithm 2 is guaranteed to converge to the optimal solution of Equation 7. After recovering all the hidden layers, we aim to estimate the latent vector z. For this inversion step we adopt a MAP estimator as described in Section 5: z = arg min z 1 2 x S 1 -W S 0 z 2 2 + γ 2 z 2 2 , s. t. W S c 0 z ≤ 0, with γ > 0. In fact, this problem can be solved by a similar Linearized-ADMM algorithm described above, expect for the update of x (Equation 29), which becomes: z (k+1) ← arg min z α 2 z -z (k) - 1 α W S 0 T W S 0 z (k) -x S 1 2 2 + β 2 z -z (k) - ρ i β W S c 0 T W S c 0 z (k) -a (k) -u (k) 2 2 + γ 2 z 2 2 . (33) Equivalently, for the latent vector z, the first step of Algorithm 2 is changed to to: k) ) . z (k+1) ← 1 α + β + γ (α + β)z (k) -W S 0 T (W S 0 z (k) -x S 1 ) -ρ 1 W S c 0 T (W S c 0 z (k) -a (k) -u (

E INVERSION RESULTS FOR TRAINED NETWORKS

Here we provide detailed results for the various inversion experiments described in Section 6.2.

E.1 CLEAN IMAGES

Figure 4 demonstrates the reconstruction error for all the layers when inverting clean images. In Figures 5 and 6 we demonstrate successful and failure cases of the gradient-descent algorithm and compare them to our approach.

E.2 RANDOM MASK INPAINTING

Figures 7-9 demonstrate the performance of our approach compared to gradient descent in terms of SNR and image quality respectively for the randomly-generated mask experiment.

E.3 NON-RANDOM MASK INPAINTING

Figures 10-12 demonstrate the performance of our approach compared to gradient descent in terms of SNR and image quality respectively for the non-random mask experiment.



Note that after training, batch-normalization is a simple linear operation. For simplicity we assume here that φ is the identity function.



Figure 2: Gaussian iid weights: Recovery errors as a function of the hidden-layer size (n 1 ), where the image space is 625. Subfigures (a)-(c) correspond to z ∈ R 100 and (d)-(f) to z ∈ R 200 . These results support Theorem 2 stating that to guarantee a unique solution, the hidden layer cardinality s 1 ≈ n12 should be larger than the latent vector space and smaller than the image space. Moreover, it supports Corollary 1 by showing that under the non-zero expansion condition, both Layered Basis-Pursuit and Latent-Pursuit (Algorithms 1 and 3) recover the original latent vector perfectly.

Figure 3: Trained 3 layers model: Reconstruction error of the latent vector z in 512 experiments. Latent-Pursuit clearly outperforms gradient descent.

The Layered-Wise Oracle Input: y = G(z) + e ∈ R n , and supports of each layer {S i } L i=1 . First step: xL = arg min x 1 2 φ -1 (y) -WL x 2 2 , where WL is the column supported matrix W L [:, S L ]. Intermediate steps: For any layer i = L -1, . . . , 1, set xi = arg min x

annex

A THEOREM 1: PROOF Proof. The main idea of the proof is to show that under the conditions of Theorem 1 the inversion task at every layer i ∈ {1, . . . , L + 1} has a unique global minimum. For this goal we utilize the well-known uniqueness guarantee from sparse representation theory.Lemma 1 (Sparse Representation -Uniqueness Guarantee Donoho & Elad (2003) ; Elad (2010) ). If a system of linear equations y = Wx has a solution x satisfying x 0 < spark(W)/2, then this solution is necessarily the sparset possible.Using the above Lemma, we can conclude that if x L obeys x L 0 = s L < spark(W L )/2, then x L is the unique vector that has at most s L nonzeros, while satisfying the equation φMoving on to the previous layer, we can employ again the above Lemma for the supported vector x S L L . This way, we can ensure that x L-1 is the unique s L-1 -sparse solution of x S L L = W S L L-1 x L-1 as long asHowever, the condition s L-1 = x L-1 0 < sub-spark(W L-1 ,s L ) 2 implies that the above necessarily holds. This way we can ensure that each layer i, i ∈ {1, . . . , L -1} is the unique sparse solution.Finally, in order to invert the first layer we need to solve x S1 1 = W S1 0 z. If W S1 0 has full columnrank, this system either has no solution or a unique one. In our case, we do know that a solution exists, and thus, necessarily, it is unique. A necessary but insufficient condition for this to be true is s 1 ≥ n 0 . The additional requirement sub-rank(W 0 , s 1 ) = n 0 ≤ s 1 is sufficient for z to be the unique solution, and this concludes the proof.

B THE ORACLE ESTIMATOR

The motivation for studying the recovery ability of the Oracle is that it can reveal the power of utilizing the inherent sparsity of the feature maps. Therefore, we analyze the layer-wise Oracle estimator described in Algorithm 4, which is similar to the layer-by-layer fashion we adopt in both the Layered Basis-Pursuit (Algorithm 1) and in the Latent-Pursuit (Algorithm 3). In this analysis we assume that the contaminating noise is white additive Gaussian.The noisy signal y carries an additive noise with energy proportional to its dimension, σ 2 n. Theorem 4 below suggests that the Oracle can attenuate this noise by a factor of n0 n , which is typically much smaller than 1. Moreover, the error in each layer is proportional to its cardinality σ 2 s i . These results are expected, as the Oracle simply projects the noisy signal on low-dimensional subspaces of known 

