IMPLICIT NORMALIZING FLOWS

Abstract

Normalizing flows define a probability distribution by an explicit invertible transformation z = f (x). In this work, we present implicit normalizing flows (ImpFlows), which generalize normalizing flows by allowing the mapping to be implicitly defined by the roots of an equation F (z, x) = 0. ImpFlows build on residual flows (ResFlows) with a proper balance between expressiveness and tractability. Through theoretical analysis, we show that the function space of ImpFlow is strictly richer than that of ResFlows. Furthermore, for any ResFlow with a fixed number of blocks, there exists some function that ResFlow has a nonnegligible approximation error. However, the function is exactly representable by a single-block ImpFlow. We propose a scalable algorithm to train and draw samples from ImpFlows. Empirically, we evaluate ImpFlow on several classification and density modeling tasks, and ImpFlow outperforms ResFlow with a comparable amount of parameters on all the benchmarks.

1. INTRODUCTION

Normalizing flows (NFs) (Rezende & Mohamed, 2015; Dinh et al., 2014) are promising methods for density modeling. NFs define a model distribution p x (x) by specifying an invertible transformation f (x) from x to another random variable z. By change-of-variable formula, the model density is ln p x (x) = ln p z (f (x)) + ln |det(J f (x))| , where p z (z) follows a simple distribution, such as Gaussian. NFs are particularly attractive due to their tractability, i.e., the model density p x (x) can be directly evaluated as Eqn. (1). To achieve such tractability, NF models should satisfy two requirements: (i) the mapping between x and z is invertible; (ii) the log-determinant of the Jacobian J f (x) is tractable. Searching for rich model families that satisfy these tractability constraints is crucial for the advance of normalizing flow research. For the second requirement, earlier works such as inverse autoregressive flow (Kingma et al., 2016) and RealNVP (Dinh et al., 2017) restrict the model family to those with triangular Jacobian matrices. More recently, there emerge some free-form Jacobian approaches, such as Residual Flows (Res-Flows) (Behrmann et al., 2019; Chen et al., 2019) . They relax the triangular Jacobian constraint by utilizing a stochastic estimator of the log-determinant, enriching the model family. However, the Lipschitz constant of each transformation block is constrained for invertibility. In general, this is not preferable because mapping a simple prior distribution to a potentially complex data distribution may require a transformation with a very large Lipschitz constant (See Fig. 3 for a 2D example). Moreover, all the aforementioned methods assume that there exists an explicit forward mapping z = f (x). Bijections with explicit forward mapping only covers a fraction of the broad class of invertible functions suggested by the first requirement, which may limit the model capacity. In this paper, we propose implicit flows (ImpFlows) to generalize NFs, allowing the transformation to be implicitly defined by an equation F (z, x) = 0. Given x (or z), the other variable can be computed by an implicit root-finding procedure z = RootFind(F (•, x)). An explicit mapping z = f (x) used in prior NFs can viewed as a special case of ImpFlows in the form of F (z, x) = f (x) -z = 0. To balance between expressiveness and tractability, we present a specific from of ImpFlows, where each block is the composition of a ResFlow block and the inverse of another ResFlow block. We theoretically study the model capacity of ResFlows and ImpFlows in the function space. We show that the function family of single-block ImpFlows is strictly richer than that of two-block ResFlows by relaxing the Lipschitz constraints. Furthermore, for any ResFlow with a fixed number of blocks, there exists some invertible function that ResFlow has non-negligible approximation error, but ImpFlow can exactly model. On the practical side, we develop a scalable algorithm to estimate the probability density and its gradients, and draw samples from ImpFlows. The algorithm leverages the implicit differentiation formula. Despite being more powerful, the gradient computation of ImpFlow is mostly similar with that of ResFlows, except some additional overhead on root finding. We test the effectiveness of ImpFlow on several classification and generative modeling tasks. ImpFlow outperforms ResFlow on all the benchmarks, with comparable model sizes and computational cost.

2. RELATED WORK

Expressive Normalizing Flows There are many works focusing on improving the capacity of NFs. For example, Dinh et al. (2014; 2017) Chen et al. (2019) propose NFs with free-form Jacobian, which approximate the determinant with stochastic estimators. In parallel with architecture design, Chen et al. (2020) ; Huang et al. (2020) ; Cornish et al. (2020) ; Nielsen et al. (2020) improve the capacity of NFs by operating in a higher-dimensional space. As mentioned in the introduction, all these existing works adopt explicit forward mappings, which is only a subset of the broad class of invertible functions. In contrast, the implicit function family we consider is richer. While we primarily discuss the implicit generalization of ResFlows (Chen et al., 2019) in this paper, the general idea of utilizing implicit invertible functions could be potentially applied to other models as well. Finally, Zhang et al. (2020) formally prove that the model capacity of ResFlows is restricted by the dimension of the residual blocks. In comparison, we study another limitation of ResFlows in terms of the bounded Lipschitz constant, and compare the function family of ResFlows and ImpFlows with a comparable depth. Continuous Time Flows (CTFs) (Chen et al., 2018b; Grathwohl et al., 2019; Chen et al., 2018a ) are flexible alternative to discrete time flows for generative modeling. They typically treat the invertible transformation as a dynamical system, which is approximately simulated by ordinary differential equation (ODE) solvers. In contrast, the implicit function family considered in this paper does not contain differential equations, and only requires fixed point solvers. Moreover, the theoretical guarantee is different. While CTFs typically study the universal approximation capacity under the continuous time case (i.e., "infinite depth" limit), we consider the model capacity of ImpFlows and ResFlows under a finite number of transformation steps. Finally, while CTFs are flexible, their learning is challenging due to instability (Liu et al., 2020; Massaroli et al., 2020) and exceedingly many ODE solver steps (Finlay et al., 2020) et al. (2020) incorporate periodic functions for representation learning. Different from these works, which consider implicit functions as a replacement to feed-forward networks, we develop invertible implicit functions for normalizing flows, discuss the conditions of the existence of such functions, and theoretically study the model capacity of our proposed ImpFlow in the function space.

3. IMPLICIT NORMALIZING FLOWS

We now present implicit normalizing flows, by starting with a brief overview of existing work.

3.1. NORMALIZING FLOWS

As shown in Eqn. (1), a normalizing flow f : x → z is an invertible function that defines a probability distribution with the change-of-variable formula. The modeling capacity of normalizing flows depends on the expressiveness of the invertible function f . Residual flows (ResFlows) (Chen et al., 2019; Behrmann et al., 2019) are a particular powerful class of NFs due to their free-form Jacobian. ResFlows use f = f L • • • • • f 1 to construct the invertible mapping, where each layer f l is an invertible residual network with Lipschitz constraints bounded by a fixed constant κ: f l (x) = x + g l (x), Lip(g l ) ≤ κ < 1, where Lip(g) is the Lipschitz constant of a function g (see Sec. 4.1 for details). Despite their free-form Jacobian, the model capacity of ResFlows is still limited by the Lipschitz constant of the invertible function. The Lipschitz constant of each ResFlow block f l cannot exceed 2 (Behrmann et al., 2019) , so the Lipschitz constant of an L-block ResFlow cannot exceed 2 L . However, to transfer a simple prior distribution to a potentially complex data distribution, the Lipschitz constant of the transformation can be required to be sufficiently large in general. Therefore, ResFlows can be undesirably deep simply to meet the Lipschitz constraints (see Fig. 3 for a 2D example). Below, we present implicit flows (ImpFlows) to relax the Lipschitz constraints.

3.2. MODEL SPECIFICATION

In general, an implicit flow (ImpFlow) is defined as an invertible mapping between random variables x and z of dimension d by finding the roots of F (z, x) = 0, where F is a function from R 2d to R d . In particular, the explicit mappings z = f (x) used in prior flow instances (Chen et al., 2019; Kingma & Dhariwal, 2018) can be expressed as an implicit function in the form F (z, x) = f (x) -z = 0. While ImpFlows are a powerful family to explore, generally they are not guaranteed to satisfy the invertibility and the tractability of the log-determinant as required by NFs. In this paper, we focus on the following specific form, which achieves a good balance between expressiveness and tractability, and leave other possibilities for future studies. Definition 1. Let g z : R d → R d and g x : R d → R d be two functions such that Lip(g x ) < 1 and Lip(g z ) < 1, where Lip(g) is the Lipschitz constant of a function g. A specific form of ImpFlows is defined by F (z, x) = 0, where F (z, x) = g x (x) -g z (z) + xz. (3) The root pairs of Eqn. (3) form a subset in R d × R d , which actually defines the assignment rule of a unique invertible function f . To see this, for any x 0 , according to Definition 1, we can construct a contraction h x0 (z) = F (z, x 0 ) + z with a unique fixed point, which corresponds to a unique root (w.r.t. z) of F (z, x 0 ) = 0, denoted by f (x 0 ). Similarly, in the reverse process, given a z 0 , the root (w.r.t. x) of F (z 0 , x) = 0 also exists and is unique, denoted by f -1 (z 0 ). These two properties are sufficient to ensure the existence and the invertibility of f , as summarized in the following theorem. Theorem 1. Eqn.(3) defines a unique mapping f : R d → R d , z = f (x), and f is invertible. See proof in Appendix A.1. Theorem 1 characterizes the validness of the ImpFlows introduced in Definition 1. In fact, a single ImpFlow is a stack of a single ResFlow and the inverse of another single ResFlow, which will be formally stated in Sec 4. We will investigate the expressiveness of the function family of the ImpFlows in Sec 4, and present a scalable algorithm to learn a deep generative model built upon ImpFlows in Sec. 5.

4. EXPRESSIVENESS POWER

We first present some preliminaries on Lipschitz continuous functions in Sec. 4.1 and then formally study the expressiveness power of ImpFlows, especially in comparison to ResFlows. In particular, we prove that the function space of ImpFlows is strictly richer than that of ResFlows in Sec. 4.2 (see an illustration in Fig. 1 (a)). Furthermore, for any ResFlow with a fixed number of blocks, there exists some function that ResFlow has a non-negligible approximation error. However, the function is exactly representable by a single-block ImpFlow. The results are illustrated in Fig. 1 

4.1. LIPSCHITZ CONTINUOUS FUNCTIONS

For any differentiable function f : R d → R d and any x ∈ R d , we denote the Jacobian matrix of f at x as J f (x) ∈ R d×d . Definition 2. A function R d → R d is called Lipschitz continuous if there exists a constant L, s.t. f (x 1 ) -f (x 2 ) ≤ L x 1 -x 2 , ∀x 1 , x 2 ∈ R d . The smallest L that satisfies the inequality is called the Lipschitz constant of f , denoted as Lip(f ). Generally, the definition of Lip(f ) depends on the choice of the norm || • ||, while we use L 2 -norm by default in this paper for simplicity. Definition 3. A function R d → R d is called bi-Lipschitz continuous if it is Lipschitz continuous and has an inverse mapping f -1 which is also Lipschitz continuous. It is useful to consider an equivalent definition of the Lipschitz constant in our following analysis. Proposition 1. (Rademacher (Federer (1969) , Theorem 3.1.6)) If f : R d → R d is Lipschitz continuous, then f is differentiable almost everywhere, and Lip(f ) = sup x∈R d J f (x) 2 , where M 2 = sup {v: v 2=1} M v 2 is the operator norm of the matrix M ∈ R d×d .

4.2. COMPARISON TO TWO-BLOCK RESFLOWS

We formally compare the expressive power of a single-block ImpFlow and a two-block ResFlow. We highlight the structure of the theoretical results in this subsection in Fig. 1 (a) and present a 1D motivating example in Fig. 2 . All the proofs can be found in Appendix. A. On the one hand, according to the definition of ResFlow, the function family of the single-block ResFlow is R := {f : f = g + Id, g ∈ C 1 (R d , R d ), Lip(g) < 1}, where C 1 (R d , R d ) consists of all functions from R d to R d with continuous derivatives and Id denotes the identity map. Besides, the function family of -block ResFlows is defined by composition: R := {f : f = f • • • • • f 1 for some f 1 , • • • , f ∈ R}. By definition of Eqn. (4) and Eqn. ( 5), R 1 = R. On the other hand, according to the definition of the ImpFlow in Eqn. (3), we can obtain (g x + Id)(x) = g x (x) + x = g z (z) + z = (g z + Id)(z) , where • denotes the composition of functions. Equivalently, we have z = (g z + Id) -1 • (g x + Id) (x), which implies the function family of the single-block ImpFlow is I = {f : f = f -1 2 • f 1 for some f 1 , f 2 ∈ R}. Intuitively, a single-block ImpFlow can be interpreted as the composition of a ResFlow block and the inverse function of another ResFlow block, which may not have an explicit form (see Fig. 2 (c ) and (d) for a 1D example). Therefore, it is natural to investigate the relationship between I and R 2 . Before that, we first introduce a family of "monotonically increasing functions" that does not have an explicit Lipschitz constraint, and show that it is strictly larger than R. Lemma 1. R F := {f ∈ D : inf x∈R d ,v∈R d , v 2=1 v T J f (x)v > 0}, ( ) where D is the set of all bi-Lipschitz C 1 -diffeomorphisms from R d to R d , and A B means A is a proper subset of B. Note that it follows from Behrmann et al. (2019, Lemma 2 ) that all functions in R are bi-Lipschitz, so R D. In the 1D input case, we can get R = {f ∈ C 1 (R) : inf x∈R f (x) > 0, sup x∈R f (x) < 2}, F = {f ∈ C 1 (R) : inf x∈R f (x) > 0}. In the high dimensional cases, R and F are hard to illustrate. Nevertheless, the Lipschitz constants of the functions in R is less than 2 (Behrmann et al., 2019) , but those of the functions in F can be arbitrarily large. Based on Lemma 1, we prove that the function family of ImpFlows I consists of the compositions of two functions in F, and therefore is a strictly larger than R 2 , as summarized in the following theorem. Theorem 2. (Equivalent form of the function family of a single-block ImpFlow). I = F 2 := {f : f = f 2 • f 1 for some f 1 , f 2 ∈ F}. Note that the identity mapping Id ∈ F, and it is easy to get F ⊂ I. Thus, the Lipschitz constant of a single ImpFlow (and its reverse) can be arbitrarily large. Because R F and there exists some functions in I \ R 2 (see a constructed example in Sec. 4.3), we can get the following corollary. Corollary 1. R R 2 F 2 = I. The results on the 1D example in Fig. 2 (b) and (c) accord with Corollary 1. Besides, Corollary 1 can be generalized to the cases with 2 -block ResFlows and -block ImpFlows, which strongly motivates the usage of implicit layers in normalizing flows.

4.3. COMPARISON WITH MULTI-BLOCK RESFLOWS

We further investigate the relationship between R for > 2 and I, as illustrated in Fig. 1 (b) . For a fixed , the Lipschitz constant of functions in R is still bounded, and there exist infinite functions that are not in R but in I. We construct one such function family: for any L, r ∈ R + , define P(L, r) = {f : f ∈ F, ∃ B r ⊂ R d , ∀x, y ∈ B r , f (x) -f (y) 2 ≥ L x -y 2 }, where B r is an d-dimensional ball with radius of r. Obviously, P(L, r) is an infinite set. Below, we will show that ∀ 0 < < log 2 (L), R has a non-negligible approximation error for functions in P(L, r). However, they are exactly representable by functions in I. Theorem 3. Given L > 0 and r > 0, we have • P(L, r) ⊂ I. • ∀ 0 < < log 2 (L), P(L, r) ∩ R = ∅. Moreover, for any f ∈ P(L, r) with d-dimensional ball B r , the minimal error for fitting f in B r by functions in R satisfies inf g∈R sup x∈Br f (x) -g(x) 2 ≥ r 2 (L -2 ) (10) It follows Theorem 3 that to model f ∈ P(L, r), we need only a single-block ImpFlow but at least a log 2 (L)-block ResFlow. In Fig. 2 (b), we show a 1D case where a 3-block ResFlow cannot fit a function that is exactly representable by a single-block ImpFlow. In addition, we also prove some other properties of ImpFlows. In particular, R 3 ⊂ I. We formally present the results in Appendix B.

5. GENERATIVE MODELING WITH IMPFLOWS

ImpFlows can be parameterized by neural networks and stacked to form a deep generative model to model high-dimensional data distributions. We develop a scalable algorithm to perform inference, sampling and learning in such models. For simplicity, we focus on a single-block during derivation. Formally, a parametric ImpFlow block z = f (x; θ) is defined by F (z, x; θ) = 0, where F (z, x; θ) = g x (x; θ) -g z (z; θ) + x -z, and Lip(g x ) < 1, Lip(g z ) < 1. Let θ denote all the parameters in g x and g z (which does NOT mean g x and g z share parameters). Note that x refers to the input of the layer, not the input data. The inference process to compute z given x in a single ImpFlow block is solved by finding the root of F (z, x; θ) = 0 w.r.t. z, which cannot be explicitly computed because of the implicit formulation. Instead, we adopt a quasi-Newton method (i.e. Broyden's method (Broyden, 1965) ) to solve this problem iteratively, as follows: z [i+1] = z [i] -αBF (z [i] , x; θ), for i = 0, 1, • • • , ( ) where B is a low-rank approximation of the Jacobian inversefoot_0 and α is the step size which we use line search method to dynamically compute. The stop criterion is F (z [i] , x; θ) 2 < f , where f is a hyperparameter that balances the computation time and precision. As Theorem 1 guarantees the existence and uniqueness of the root, the convergence of the Broyden's method is also guaranteed, which is typically faster than a linear rate. Another inference problem is to estimate the log-likelihood. Assume that z ∼ p(z) where p(z) is a simple prior distribution (e.g. standard Gaussian). The log-likelihood of x can be written by ln p(x) = ln p(z) + ln det(I + J gx (x)) -ln det(I + J gz (z)), where J f (x) denotes the Jacobian matrix of a function f at x. See Appendix. A.4 for the detailed derivation. Exact calculation of the log-determinant term requires O(d 3 ) time cost and is hard to scale up to high-dimensional data. Instead, we propose the following unbiased estimator of ln p(x) using the same technique in Chen et al. ( 2019) with Skilling-Hutchinson trace estimator (Skilling, 1989; Hutchinson, 1989) : ln p(x) = ln p(z) + E n∼p(N ),v∼N (0,I) n k=1 (-1) k+1 k v T [J gx (x) k ]v -v T [J gz (z) k ]v P(N ≥ k) , where p(N ) is a distribution supported over the positive integers. The sampling process to compute x given z can also be solved by the Broyden's method, and the hyperparameters are shared with the inference process. In the learning process, we perform stochastic gradient descent to minimize the negative loglikelihood of the data, denoted as L. For efficiency, we estimate the gradient w.r.t. the model parameters in the backpropagation manner. According to the chain rule and the additivity of the log-determinant, in each layer we need to estimate the gradients w.r.t. x and θ of Eqn. (13). In particular, the gradients computation involves two terms: one is ∂ ∂(•) ln det(I + J g (x; θ)) and the , where g is a function satisfying Lip(g) < 1 and (•) denotes x or θ. On the one hand, for the log-determinant term, we can use the same technique as Chen et al. (2019) , and obtain an unbiased gradient estimator as follows. ∂ ln det(I + J g (x; θ)) ∂(•) = E n∼p(N ),v∼N (0,I) n k=0 (-1) k P(N ≥ k) v T J g (x; θ) k ∂J g (x; θ) ∂(•) v , where p(N ) is a distribution supported over the positive integers. On the other hand, ∂L ∂z ∂z ∂(•) can be computed according to the implicit function theorem as follows (See details in Appendix A.5): ∂L ∂z ∂z ∂(•) = ∂L ∂z J -1 G (z) ∂F (z, x; θ) ∂(•) , where G(z; θ) = g z (z; θ) + z. In comparision to directly calculate the gradient through the quasi-Newton iterations of the forward pass, the implicit gradient above is simple and memory-efficient, treating the root solvers as a blackbox. Following Bai et al. ( 2019), we compute ∂L ∂z J -1 G (z) by solving a linear system iteratively, as detailed in Appendix C.1. The training algorithm is formally presented in Appendix C.4.

6. EXPERIMENTS

We demonstrate the model capacity of ImpFlows on the classification and density modeling tasksfoot_1 . In all experiments, we use spectral normalization (Miyato et al., 2018) to enforce the Lipschitz constrants, where the Lipschitz constant upper bound of each layer (called Lipschitz coefficient) is denoted as c. For the Broyden's method, we use f = 10 -6 and b = 10 -10 for training and testing to numerically ensure the invertibility and the stability during training. Please see other detailed settings including the method of estimating the log-determinant, the network architecture, learning rate, batch size, and so on in Appendix D.

6.1. VERIFYING CAPACITY ON CLASSIFICATION

We first empirically compare ResFlows and ImpFlows on classification tasks. Compared with generative modeling, classification is a more direct measure of the richness of the functional family, because it isolates the function fitting from generative modeling subtleties, such as log-determinant estimation. We train both models in the same settings on CIFAR10 and CIFAR100 (Krizhevsky & Hinton, 2009) . Specifically, we use an architecture similar to ResNet-18 (He et al., 2016) . Overall, the amount of parameters of ResNet-18 with vanilla ResBlocks, ResFlows and ImpFlows are the same of 6.5M. The detailed network structure can be found in Appendix D. The classification results are shown in Table 1. To see the impact of the Lipschitz constraints, we vary the Lipschitz coefficient c to show the difference between ResFlows and ImpFlows under the condition of a fixed Lipschitz upper bound. Given different values of c, the classification results of ImpFlows are consistently better than those of ResFlows. These results empirically validate Corollary 1, which claims that the Published as a conference paper at ICLR 2021 For the density modeling tasks, we first evaluate ImpFlows on the Checkerboard data whose density is multi-modal, as shown in Fig. 3 (a) . For fairness, we follow the same experiment settings as Chen et al. ( 2019) (which are specified in Appendix D), except that we adopt a Sine (Sitzmann et al., 2020) activation function for all models. We note that the data distribution has a bounded support while we want to fit a transformation f mapping it to the standard Gaussian distribution, whose support is unbounded. A perfect f requires a sufficiently large J f (x) 2 for some x mapped far from the mean of the Gaussian. Therefore, the Lipschtiz constant of such f is too large to be fitted by a ResFlow with 8 blocks (See Fig. 3 (b) ). A 4-block ImpFlow can achieve a result of 5.05 bits, which outperforms the 5.08 bits of a 8-block ResFlow with the same number of parameters. Such results accord with our theoretical results in Theorem 2 and strongly motivate ImpFlows.

6.3. DENSITY MODELING ON REAL DATA

We also train ImpFlows on some real density modeling datasets, including the tabular datasets (used by Papamakarios et al. (2017) ), CIFAR10 and 5-bit 64 × 64 CelebA (Kingma & Dhariwal, 2018) . For all the real datasets, we use the scalable algorithm proposed in Sec. 5. We test our performance on five tabular datasets: POWER (d = 6), GAS (d = 8), HEPMASS (d = 21), MINIBOONE (d = 43) and BSDS300 (d = 63) from the UCI repository (Dua & Graff, 2017) , where d is the data dimension. For a fair comparison, on each dataset we use a 10-block ResFlow and a 5-block ImpFlow with the same amount of parameters, and a 20-block ImpFlow for a better result. The detailed network architecture and hyperparameters can be found in Appendix D. Table 2 shows the average test log-likelihood for ResFlows and ImpFlows. ImpFlows achieves better density estimation performance than ResFlow consistently on all datasets. Again, the results demonstrate the effectiveness of ImpFlows. Then we test ImpFlows on the CIFAR10 dataset. We train a multi-scale convolutional version for both ImpFlows and ResFlows, following the same settings as Chen et al. ( 2019) except that we use a smaller network of 5.5M parameters for both ImpFlows and ResFlows (see details in Appendix D). As shown in Table 3 , Impflow achieves better results than ResFlow consistently given different values of the Lipschitz coefficient c. Moreover, the computation time of ImpFlow is comparable to that of ResFlow. See Appendix C.2 for detailed results. Besides, there is a trade-off between the expressiveness and the numerical optimization of ImpFlows in larger models. Based on the above experiments, we believe that advances including an lower-variance estimate of the log-determinant can benefit ImpFlows in larger models, which is left for future work. We also train ImpFlows on the 5-bit 64×64 CelebA. For a fair comparison, we use the same settings as Chen et al. (2019) . The samples from our model are shown in Appendix E.

7. CONCLUSIONS

We propose implicit normalizing flows (ImpFlows), which generalize normalizing flows via utilizing an implicit invertible mapping defined by the roots of the equation F (z, x) = 0. ImpFlows build on Residual Flows (ResFlows) with a good balance between tractability and expressiveness. We show that the functional family of ImpFlows is richer than that of ResFlows, particularly for modeling functions with large Lipschitz constants. Based on the implicit differentiation formula, we present a scalable algorithm to train and evaluate ImpFlows. Empirically, ImpFlows outperform ResFlows on several classification and density modeling benchmarks. Finally, while this paper mostly focuses on the implicit generalization of ResFlows, the general idea of utilizing implicit functions for NFs could be extended to a wider scope. We leave it as a future work.

A ADDITIONAL LEMMAS AND PROOFS

A.1 PROOF FOR THEOREM 1

Proof. (Theorem 1)

Firstly, ∀x 0 ∈ R d , the mapping h x0 (z) = F (z, x 0 ) + z is a contrative mapping, which can be shown by Lipschitz condition of g z : (F (z 1 , x 0 ) + z 1 ) -(F (z 2 , x 0 ) + z 2 ) = g z (z 1 ) -g z (z 2 ) < z 1 -z 2 . Therefore, h x0 (z) has an unique fixed point, denoted by f (x 0 ) : h x0 (f (x 0 )) = f (x 0 ) ⇔ F (f (x 0 ), x 0 ) = 0 Similarly, we also have: ∀z 0 ∈ R d , there exists an unique g(z 0 ) satisfying F (z 0 , g(z 0 )) = 0. Moreover, Let z 0 = f (x 0 ), we have F (f (x 0 ), g(f (x 0 ))) = 0. By the uniqueness, we have g(f (x 0 )) = x 0 , ∀x 0 ∈ R d . Similarly, f (g(x 0 )) = x 0 , ∀x 0 ∈ R d . Therefore, f is unique and invertible.

A.2 PROOF FOR THEOREM 2

We denote D as the set of all bi-Lipschitz C 1 -diffeomorphisms from R d to R d . Firstly, we prove Lemma 1 in the main text. Proof. (Lemma 1). ∀f ∈ R, we have sup x∈R d J f (x) -I 2 2 < 1, which is equivalent to sup x∈R d ,v∈R d , v 2=1 (J f (x) -I)v 2 2 < 1 (Definition of operator norm.) sup x∈R d ,v∈R d , v 2=1 v T (J T f (x) -I)(J f (x) -I)v < 1 sup x∈R d ,v∈R d , v 2=1 v T J T f (x)J f (x)v -2v T J f (x)v < 0 Note that J f (x) is nonsingular, so ∀x, v ∈ R d , v 2 = 1, we have v T J T f (x)J f (x)v > 0. Thus, 0 > sup x∈R d ,v∈R d , v 2=1 v T J T f (x)J f (x)v -2v T J f (x)v ≥ sup x∈R d ,v∈R d , v 2 =1 -2v T J f (x)v So we have inf x∈R d ,v∈R d , v 2=1 v T J f (x)v > 0. Note that the converse is not true, because v T J f (x)v > 0 does not restrict the upper bound of Lipschitz constant of f . For example, when f (x) = mx where m is a positive real number, we have inf x∈R d ,v∈R d , v 2=1 v T J f (x)v = inf x∈R d ,v∈R d , v 2=1 v T (mI)v = m > 0 However, m can be any large positive number. So we have R F. Lemma 2. ∀f ∈ D, if inf x∈R d ,v∈R d , v 2=1 v T J f (x)v > 0, then inf x∈R d ,v∈R d , v 2=1 v T J f -1 (x)v > 0, Proof. (Proof of Lemma 2). By Inverse Function Theorem, J f -1 (x) = J -1 f (f -1 (x)). Because f is from R d to R d , we have inf x∈R d ,v∈R d , v 2 =1 v T J f -1 (x)v = inf x∈R d ,v∈R d , v 2=1 v T J -1 f (f -1 (x))v = inf x∈R d ,v∈R d , v 2=1 v T J -1 f (x)v Let u = J -1 f (x)v and v 0 = u u 2 , we have v 0 2 = 1, and v T J -1 f (x)v = u T J T f (x)u = u T J f (x)u = u 2 2 v T 0 J f (x)v 0 . The above equation uses this fact: for a real d×d matrix A, ∀x ∈ R d , x T Ax = (x T Ax) T = x T A T x because x T Ax ∈ R. Note that f is Lipschitz continuous, J f (x) 2 ≤ Lip(f ). So 1 = v 2 ≤ J f (x) 2 u 2 ≤ Lip(f ) u 2 , which means u 2 ≥ 1 Lip(f ) . Thus, inf x∈R d ,v∈R d , v 2=1 v T J -1 f (x)v = inf x∈R d ,v∈R d , v 2 =1 u 2 2 v T 0 J f (x)v 0 ≥ inf x∈R d ,u∈R d , J f (x)u 2=1 u 2 2 inf x∈R d ,v0∈R d , v0 2 =1 v T 0 J f (x)v 0 ≥ 1 Lip(f ) 2 inf x∈R d ,v∈R d , v 2=1 v T J f (x)v > 0 Lemma 3. ∀f ∈ D, if f -1 ∈ R, we have inf x∈R d ,v∈R d , v 2=1 v T J f (x)v > 0. (19) Proof. (Proof of Lemma 3). ∀f ∈ D, if f -1 ∈ R, then from Lemma 1, we have inf x∈R d ,v∈R d , v 2 =1 v T J f -1 (x)v > 0. Note that f -1 ∈ D, from Lemma 2 we have inf x∈R d ,v∈R d , v 2=1 v T J f (x)v > 0. Lemma 4. ∀f ∈ D, if inf x∈R d ,v∈R d , v 2=1 v T J f (x)v > 0, then ∃ α 0 > 0, s.t. ∀ 0 < α < α 0 , sup x∈R d αJ f (x) -I 2 < 1. Proof. (Proof of Lemma 4). Note that f is Lipschitz continuous, so Lip(f ) = sup x∈R d J f (x) 2 . Denote β = inf x∈R d ,v∈R d , v 2=1 v T J f (x)v. And let α 0 = β Lip(f ) 2 > 0. ∀ 0 < α < α 0 , we have sup x∈R d αJ f (x) -I 2 2 = sup x∈R d ,v∈R d , v 2=1 v T (αJ T f (x) -I)(αJ f (x) -I)v = 1 + sup x∈R d ,v∈R d , v 2=1 α 2 v T J T f (x)J f (x)v -2αv T J f (x)v ≤ 1 + α 2 sup x∈R d ,v∈R d , v 2=1 v T J T f (x)J f (x)v + 2α sup x∈R d ,v∈R d , v 2=1 -v T J f (x)v = 1 + α 2 sup x∈R d ,v∈R d , v 2=1 v T J T f (x)J f (x)v -2α inf x∈R d ,v∈R d , v 2=1 v T J f (x)v = 1 + α 2 sup x∈R d J f (x) 2 2 -2αβ = 1 + α(αLip(f ) 2 -2β) < 1 + α(α 0 Lip(f ) 2 -2β) = 1 -αβ < 1. The above equation uses this fact: for a real d×d matrix A, ∀x ∈ R d , x T Ax = (x T Ax) T = x T A T x because x T Ax ∈ R. Proof. (Theorem 2) Denote P = {f ∈ D | ∃f 1 , f 2 ∈ D, f = f 2 • f 1 , where inf x∈R d ,v∈R d , v 2=1 v T J f1 (x)v > 0, inf x∈R d ,v∈R d , v 2=1 v T J f2 (x)v > 0}. Firstly, we show that I ⊂ P. ∀f ∈ I, assume f = f 2 • f 1 , where f 1 ∈ R and f -1 2 ∈ R. By Lemma 1 and Lemma 3, we have inf x∈R d ,v∈R d , v 2=1 v T J f1 (x)v > 0, inf x∈R d ,v∈R d , v 2=1 v T J f2 (x)v > 0. Thus, f ∈ P. So I ⊂ P. Next, we show that P ⊂ I. ∀f ∈ P, assume f = f 2 • f 1 , where inf x∈R d ,v∈R d , v 2=1 v T J f1 (x)v > 0, inf x∈R d ,v∈R d , v 2=1 v T J f2 (x)v > 0. From Lemma 2, we have inf x∈R d ,v∈R d , v 2=1 v T J f -1 2 (x)v > 0. From Lemma 4, ∃ α 1 > 0, α 2 > 0, s.t. ∀ 0 < α < min{α 1 , α 2 }, sup x∈R d αJ f1 (x) -I 2 < 1, sup x∈R d αJ f -1 2 (x) -I 2 < 1. Let α = 1 2 min{α 1 , α 2 }. Let g = g 2 • g 1 , where g 1 (x) = αf 1 (x), g 2 (x) = f 2 ( x α ). We have g(x) = f 2 ( αf1(x) α ) = f (x), and J g1 (x) = αJ f1 (x), g -1 2 (x) = αf -1 2 (x), J g -1 2 (x) = αJ f -1 2 (x).

So we have

sup x∈R d J g1 (x) -I 2 = sup x∈R d αJ f1 (x) -I 2 < 1, sup x∈R d J g -1 2 (x) -I 2 = sup x∈R d αJ f -1 2 (x) -I 2 < 1. Thus, g 1 ∈ R and g -1 2 ∈ R and f = g 2 • g 1 . So f ∈ I. Therefore, P ⊂ I. In conclusion, I = P.

A.3 PROOF FOR THEOREM 3

Firstly, we prove a lemma of bi-Lipschitz continuous functions. Lemma 5. If f : (R d , • ) → (R d , • ) is bi-Lipschitz continuous, then 1 Lip(f -1 ) ≤ f (x 1 ) -f (x 2 ) x 1 -x 2 ≤ Lip(f ), ∀x 1 , x 2 ∈ R d , x 1 = x 2 . Proof. (Proof of Lemma 5). ∀x 1 , x 2 ∈ R d , x 1 = x 2 , we have f (x 1 ) -f (x 2 ) ≤ Lip(f ) x 1 -x 2 x 1 -x 2 = f -1 (f (x 1 )) -f -1 (f (x 2 )) ≤ Lip(f -1 ) f (x 1 ) -f (x 2 ) Thus, we get the results.

Assume a residual flow

f = f L • • • • • f 1 where each layer f l is an invertible residual network: f l (x) = x + g l (x), Lip(g l ) ≤ κ < 1. Thus, each layer f l is bi-Lipschitz and it follows by Behrmann et al. (2019) and Lemma 5 that 1 -κ ≤ f l (x 1 ) -f l (x 2 ) x 1 -x 2 ≤ 1 + κ < 2 L , ∀x 1 , x 2 ∈ R d , x 1 = x 2 . ( ) By multiplying all the inequalities, we can get a bound of the bi-Lipschitz property for ResFlows, as shown in Lemma 6. Lemma 6. For ResFlows built by f = f L • • • • • f 1 , where f l (x) = x + g l (x), Lip(g l ) ≤ κ < 1, then (1 -κ) L ≤ f (x 1 ) -f (x 2 ) x 1 -x 2 ≤ (1 + κ) L , ∀x 1 , x 2 ∈ R d , x 1 = x 2 . Next, we prove Theorem 3. Proof. (Theorem 3) According to the definition of P(L, r), we have P(L, r) ⊂ F ⊂ I. ∀ 0 < < log 2 (L), we have L -2 > 0. ∀ g ∈ R , by Lemma 6, we have g(x) -g(y) 2 ≤ 2 x -y 2 , ∀x, y ∈ B r . Thus, ∀ x 0 ∈ B r , we have f (x) -g(x) 2 = f (x) -f (x 0 ) + g(x 0 ) -g(x) + f (x 0 ) -g(x 0 ) 2 ≥ f (x) -f (x 0 ) 2 -g(x 0 ) -g(x) + f (x 0 ) -g(x 0 ) 2 ≥ f (x) -f (x 0 ) 2 -g(x 0 ) -g(x) 2 -f (x 0 ) -g(x 0 ) 2 ≥ (L -2 ) x -x 0 2 -f (x 0 ) -g(x 0 ) 2 So sup x∈Br f (x) -g(x) 2 ≥ sup x∈Br (L -2 ) x -x 0 2 -f (x 0 ) -g(x 0 ) 2 ≥ (L -2 )r -f (x 0 ) -g(x 0 ) 2 Notice that the inequality above is true for any x 0 ∈ B r , so we have sup x∈Br f (x) -g(x) 2 ≥ sup x0∈Br (L -2 )r -f (x 0 ) -g(x 0 ) 2 = (L -2 )r -inf x0∈Br f (x 0 ) -g(x 0 ) 2 ≥ (L -2 )r -sup x0∈Br f (x 0 ) -g(x 0 ) 2 Therefore, sup x∈Br f (x) -g(x) 2 ≥ r 2 (L -2 ), ∀g ∈ R So we get inf g∈R sup x∈Br f (x) -g(x) 2 ≥ r 2 (L -2 ) Because ∀f ∈ P(L, r), inf g∈R sup x∈Br f (x) -g(x) 2 > 0, we have R ∩ P(L, r) = ∅.

A.4 PROOF FOR EQUATION 13

Proof. (Equation 13) By Change of Variable formula: log p(x) = log p(z) + log |∂z/∂x|, Since z = f (x) is defined by the equation F (z, x) = g x (x) -g z (z) + x -z = 0, by Implicit function theorem, we have ∂z/∂x = J f (x) = -[J F,z (z)] -1 [J F,x (x)] = (I + J gz (z)) -1 (I + J gx (x)). Thus, log |∂z/∂x| = ln | det(I + J gx (x))| -ln | det(I + J gz (z))| Note that any eigenvalue λ of J gx (x) satisfies |λ| < σ(J gx (x)) = J gx (x) 2 < 1, so λ ∈ (-1, 1). Thus, det(I + J gx (x)) > 0. Similarly, det(I + J gz (z)) > 0. Therefore, log |∂z/∂x| = ln det(I + J gx (x)) -ln det(I + J gz (z)) A.5 PROOF FOR EQUATION 16 Proof. (Equation 16) By implicitly differentiating two sides of F (z, x; θ) = 0 by x, we have ∂g x (x; θ) ∂x - ∂g z (z; θ) ∂z ∂z ∂x + I - ∂z ∂x = 0, So we have ∂z ∂x = I + ∂g z (z; θ) ∂z -1 I + ∂g x (x; θ) ∂x = J -1 G (z) ∂F (z, x; θ) ∂x By implicitly differentiating two sides of F (z, x; θ) = 0 by θ, we have ∂g x (x; θ) ∂θ - ∂g z (z; θ) ∂θ - ∂g z (z; θ) ∂z ∂z ∂θ - ∂z ∂θ = 0, So we have ∂z ∂θ = I + ∂g z (z; θ) ∂z -1 ∂g x (x; θ) ∂θ - ∂g z (z; θ) ∂θ = J -1 G (z) ∂F (z, x; θ) ∂θ Therefore, the gradient from z to (•) is ∂L ∂z ∂z ∂(•) = ∂L ∂z J -1 G (z) ∂F (z, x; θ) ∂(•) .

B OTHER PROPERTIES OF IMPLICIT FLOWS

In this section, we propose some other properties of ImpFlows. Lemma 7. For a single implicit flow f ∈ I, assume that f = f -1 2 • f 1 , where f 1 (x) = x + g 1 (x), Lip(g 1 ) ≤ κ < 1, (21) f 2 (x) = x + g 2 (x), Lip(g 2 ) ≤ κ < 1, then 1 -κ 1 + κ ≤ f (x 1 ) -f (x 2 ) x 1 -x 2 ≤ 1 + κ 1 -κ , ∀x 1 , x 2 ∈ R d , x 1 = x 2 . Proof. (Proof of Lemma 7) According to Eqn. (20), we have 1 -κ ≤ f 1 (x 1 ) -f 1 (x 2 ) x 1 -x 2 ≤ 1 + κ, ∀x 1 , x 2 ∈ R d , x 1 = x 2 , 1 1 + κ ≤ f -1 2 (x 1 ) -f -1 2 (x 2 ) x 1 -x 2 ≤ 1 1 -κ , ∀x 1 , x 2 ∈ R d , x 1 = x 2 . By multiplying these two inequalities, we can get the results. Theorem 4. (Limitation of the single ImpFlow). I ⊂ {f : f ∈ D, ∀x ∈ R d , λ(J f (x)) ∩ R -= ∅}, where λ(A) denotes the set of all eigenvalues of matrix A.

Proof. (Proof of Theorem 4)

Proof by contradiction. Assume ∃f ∈ I and x ∈ R d , s.t. ∃λ ∈ λ(J f (x)), λ < 0. There exists a vector u = 0, J f (x)u = λu. By Theorem 2, ∃f 1 , f 2 ∈ F, f = f 2 • f 1 , hence J f (x) = J f2 (f 1 (x))J f1 (x). We denote A := J f2 (f 1 (x)), B := J f1 (x). Since f 1 , f 2 ∈ F, we have v T Av > 0, w T Bw > 0, ∀v, w = 0, v, w ∈ R d . Note that B is the Jacobian of a bi-Lipschitz function at a single point, so B is non-singular. As u = 0, we have Bu = 0. Thus, (Bu) T A(Bu) = (Bu) T ((AB)u) = λu T B T u = λu T Bu The last equation uses this fact: for a real d × d matrix A, ∀x ∈ R d , x T Ax = (x T Ax) T = x T A T x because x T Ax ∈ R. Note that the left side is positive, and the right side is negative. It's a contradiction. Therefore, I cannot include all the bi-Lipschitz C 1 -diffeomorphisms. As a corollary, we have R 3 ⊂ I. Corollary 2. R 3 ⊂ I. Proof. (Proof for Corollary 2) Consider three linear functions in R: f 1 (x) = x + -0.46 -0.20 0.85 0.00 x f 2 (x) = x + -0.20 -0.70 0.30 -0.60 x f 3 (x) = x + -0.50 -0.60 -0.20 -0.55 x We can get that f = f 1 • f 2 • f 3 is in R 3 , and f is also a linear function with Jacobian 0.2776 -0.4293 0.5290 -0.6757 However, this is a matrix with two negative eigenvalues: -0.1881, -0.2100. Hence f is not in I. Therefore, R 3 ⊂ I.

C COMPUTATION C.1 APPROXIMATE INVERSE JACOBIAN

The exact computation for the Jacobian inverse term costs much for high dimension tasks. We use the similar technique in Bai et al. ( 2019) to compute ∂L ∂z J -1 G (z): solving the linear system of variable y: J T G (z)y T = ( ∂L ∂z ) T , where the left hand side is a vector-Jacobian product and it can be efficiently computed by autograd packages foy any y without computing the Jacobian matrix. In this work, we also use Broyden's method to solve the root, the same as methods in the forward pass, where the tolerance bound for the stop criterion is b . Remark. Although the forward, inverse and backward pass of ImpFlows all need to solve the root of some equation, we can choose small enough f and b to ensure the approximation error is small enough. Thus, there is a trade-off between computation costs and approximation error. In practice, we use f = 10 -6 and b = 10 -10 and empirically does not observe any error accumulation. Note that such approximation is rather different from the variational inference technique in Chen et al. (2020) ; Nielsen et al. (2020) , because we only focus on the exact log density itself.

C.2 COMPUTATION TIME

We evaluate the average computation time for the model trained on CIFAR10 in Table 3 on a single Tesla P100 (SXM2-16GB). See Table 4 for the details. For a fair comparision, the forward (inference) time in the training phase of ImpFlow is comparable to that of ResFlow because the log-determinant term is the main cost. The backward time of ImpFlow costs more than that of Res-Flow because it requires to rewrite the backward method in PyTorch to solve the linear equation. The training time includes forward, backward and other operations (such as the Lipschitz iterations for spectral normalization). We use the same method as the release code of ResFlows (fixed-point iterations with tolerance 10 -5 ) for the sample phase. The sample time of ImpFlow is less than that of ResFlow because the inverse of L-block ImpFlow needs to solve L fixed points while the inverse of 2L-block ResFlow needs to solve 2L fixed points. Fast sampling is particularly desirable since it is the main advantage of flow-based models over autoregressive models. Also, we evaluate the average Broyden's method iterations and the average function evaluation times during the Broyden's method. See Table 5 for the details. We specify the function (data) to be fitted is f (x) = 0.1x, x < 0 10x, x ≥ 0 For I, we can construct a fully-connected neural network with ReLU activation and 3 parameters as following: g x (x) = ReLU(-0.9x) g z (z) = - √ 0.9ReLU( √ 0.9z) The two networks can be implemented by spectral normalization. Assume the implicit function defined by Eqn. (3) using the above g x (x) and g z (z) is f I . Next we show that f = f I . Let f 1 (x) = x + ReLU(-0.9x) and f 2 (x) = x - √ 0.9ReLU( √ 0.9x), we have f -1 2 (x) = x + ReLU(9x). Therefore, f I = f -1 2 • f 1 = f . For every residual block of R,R 2 and R 3 , we train a 4-layer MLP with ReLU activation and 128 hidden units, and the Lipschitz coefficient for the spectral normalization is 0.99, and the iteration number for the spectral computation is 200. The objective function is min θ E x∼Unif(-1,1) (f θ (x) -f (x)) 2 , where f θ is the function of 1 or 2 or 3 residual blocks. We use a batch size of 5000. We use the Adam optimizer, with learning rate 10 -3 and weight decay 10 -5 . We train the model until convergence, on a single NVIDIA GeForce GTX 1080Ti. The losses for R,R 2 and R 3 are 5.25, 2.47, 0.32, respectively.

D.2 CLASSIFICATION

For the classification tasks, we remove all the BatchNorm layers which are inside of a certain Res-Block, and only maintain the BatchNorm layer in the downsampling layer. Moreover, as a single ImpFlow consists of two residual blocks with the same dimension of input and output, we replace the downsampling shortcut by a identity shortcut in each scale of ResNet-18, and add a downsampling layer (a convolutional layer) with BatchNorm after the two residual blocks of each scale. Thus, each scale consists of two ResBlocks with the same dimension of input and output, which (6.5M parameters) is different from the vanilla ResNet-18 architecture (11.2M parameters). Note that the "vanilla ResNet-18" in our main text is refered to the 6.5M-parameter architecture, which is the same as the versions for ResFlow and ImpFlow. We use the comman settings: batch size of 128, Adam optimizer with learning rate 10 -3 and no weight decay, and total epoch of 150. For the spectral normalization iterations, we use a error bound of 10 -3 , the same as Chen et al. (2019) . We train every experiment on a single NVIDIA GeForce GTX 2080Ti.

D.3 DENSITY MODELING ON TOY 2D DATA

Following the same settings as Chen et al. (2019) , we use 4-layer multilayer perceptrons (MLP) with fully-connected layers of 128 hidden units. We use the Adam optimizer with learning rate of 10 -3 and weight decay of 10 -5 . Moreover, we find that 1 2π sin(2πx) is a better activation for this task while maintain the property of 1-Lipschitz constant, so we use this activation function for all experiments, which can lead to faster convergence and better log-likelihood for both ResFlows and ImpFlows, as shown in Fig. 4 . We do not use any ActNorm or BatchNorm layers. For the log-determinant term, we use brute-force computation as in Chen et al. (2019) . For the forward and backward, we use the Broyden's method to compute the roots, with f = 10 -6 . The Lipschitz coefficient for spectral normalization is 0.999, and the iteration number for spectral normalization is 20. The batch size is 5000, and we train 50000 iterations. The test batch size is 10000. Also, we vary the network depth to see the difference between ImpFlow and ResFlow. 

D.4 DENSITY MODELING ON TABULAR DATASETS

We use the same data preprocessing as Papamakarios et al. (2017) , including the train/valid/test datasets splits. For all models, we use a batch size of 1000 (both training and testing) and learning rate of 10 -3 for the Adam optimizer. The main settings are the same as Chen et al. (2019) on the toy2D dataset. The residual blocks are 4-layer MLPs with 128 hidden units. The ResFlows use 10 blocks and ImpFlows use 5 blocks to ensure the same amount of parameters. And we use a 20-block ImpFlow for a better result. Also, we use the Sine activation as 1 2π sin(2πx). We do not use any ActNorm or BatchNorm layers. For the Lipschitz coefficient, we use c = 0.9 and the iteration error bound for spectral normalization is 10 -3 . For the settings of our scalable algorithms, we use brute-force computation of the log-determinant term for POWER and GAS datasets and use the same estimation settings as Chen et al. (2019) for HEPMASS, MINIBOONE and BSDS300 datasets. In particular, for the estimation settings, we always exactly compute 2 terms in training process and 20 terms in testing process for the logdeterminant series. We use a geometric distribution of p = 0.5 for the distribution p(N for the log-determinant term. We use a single sample of (n, v) for the log-determinant estimators for both training and testing. We train each expeirment on a single NVIDIA GeForce GTX 2080Ti for about 4 days for ResFlows and 6 days for ImpFlows. For 20-block ImpFlow, we train our model for about 2 weeks. However, we find that the 20-block ImpFlow will overfit the training dastaset for MINIBOONE because this dataset is quite small, so we use the early-stopping technique.

D.5 DENSITY MODELING ON IMAGE DATASETS

For the CIFAR10 dataset, we follow the same settings and architectures as Chen et al. (2019) . In particular, every convolutional residual block is LipSwish → 3 × 3 Conv → LipSwish → 1 × 1 Conv → LipSwish → 3 × 3 Conv. The total architecture is Image → LogitTransform(α) → k × ConvBlock → [Squeeze → k × ConvBlock] × 2, where ConvBlock is i-ResBlock for ResFlows and ImpBlock for ImpBlock, and k = 4 for ResFlows and k = 2 for ImpFlows. And the first ConvBlock does not have LipSwish as pre-activation, followed as Chen et al. (2019) . We use ActNorm2d after every ConvBlock. We do not use the FC layers (Chen et al., 2019) . We use hidden channels as 512. We use batch size of 64 and the Adam optimizer of learning rate 10 -3 . The iteration error bound for spectral normalization is 10 -3 . We use α = 0.05 for CIFAR10. For the settings of our scalable algorithms, we use the same as Chen et al. (2019) for the logdeterminant terms. In particular, we always exactly compute 10 terms in training process and 20 terms in testing process for the log-determinant series. We use a possion distribution for the distribution p(N ) for the log-determinant term. We use a single sample of (n, v) for the log-determinant estimators for both training and testing. We train each ResFlow on a single NVIDIA GeForce GTX 2080Ti and each ImpFlow on two cards of NVIDIA GeForce GTX 2080Ti for about 6 days for ResFlows and 8 days for ImpFlows. Although the amount of parameters are the same, ImpFlows need more GPU memory due to the implementation of PyTorch for the backward pass of implicit function. For the CelebA dataset, we use exactly the same settings as the final version of ResFlows in Chen et al. (2019) , except that we use the Sine activation of the form as 1 2π sin(2πx).



We refer readers toBroyden (1965) for the calculation details for B. See https://github.com/thu-ml/implicit-normalizing-flows for details.



(b) and formally presented in Sec. 4.3. Relationship between R 2 and I. ... (b) Relationship between R and I.

Figure 1: An illustration of our main theoretical results on the expressiveness power of ImpFlows and ResFlows. Panel (a) and Panel (b) correspond to results in Sec. 4.2 and Sec. 4.3 respectively.

Figure 3: Checkerboard data density and the results of a 8-block ResFlow and a 4-block ImpFlow.

z; θ) = g z (z; θ) + z and h(y) = yJ G (z) -∂L ∂z y ← 0 while h(y) 2 ≥ b do B ←The estimated inverse Jacobian of h(y) (e.g. by Broyden's method) α ← LineSearch(y, h, B) y ← y -αBh(y) Average test log-likelihood (in nats) for different f of ImpFlow on POWER dataset.f 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1

Figure 4: 8-block ResFlow with different activation function trained on Checkerboard dataset.

Figure 5: Test NLL (in bits) by varying the network depth. Lower is better.

, making their large-scale application still an open problem.

Classification error rate (%) on test set of vanilla ResNet, ResFlow and ImpFlow of ResNet-18 architecture, with varying Lipschitz coefficients c.

Average test log-likelihood (in nats) of tabular datasets. Higher is better.

Average bits per dimension of ResFlow and ImpFlow on CIFAR10, with varying Lipschitz coefficients c. Lower is better.

Single-batch computation time (seconds) for ResFlow and ImpFlow in Table3on a single Tesla P100 (SXM2-16GB).

Single-batch iterations of Broyden's method during forward and backward pass for a single block of ImpFlow in Table3. ImpFlow on POWER dataset with f = 10 -6 (see Appendix. D for detailed settings), and then test this model with different f to see the numerical sensitivity of the fixed-point iterations. Table6shows that our model is not sensitive to f in a fair range.Algorithm 2: Backward Algorithm For a Single-Block ImpFlow Require: g x;θ , g z;θ in Eqn. (3), stop criterion b . Input: x, z, ∂L ∂z . Output: The gradient for x and θ from z, i.e. ∂L

ACKNOWLEDGEMENT

We thank Yuhao Zhou, Shuyu Cheng, Jiaming Li, Kun Xu, Fan Bao, Shihong Song and Qi'An Fu for proofreading. This work was supported by the National Key Research and Development Program of China (Nos. 2020AAA0104304), NSFC Projects (Nos. 61620106010, 62061136001, U19B2034, U181146, 62076145), Beijing NSF Project (No. JQ19016), Beijing Academy of Artificial Intelligence (BAAI), Tsinghua-Huawei Joint Research Program, Huawei Hisilicon Kirin Intelligence Engineering Development, the MindSpore team, a grant from Tsinghua Institute for Guo Qiang, Tiangong Institute for Intelligent Computing, and the NVIDIA NVAIL Program with GPU/DGX Acceleration. C. Li was supported by the fellowship of China postdoctoral Science Foundation (2020M680572), and the fellowship of China national postdoctoral program for innovative talents (BX20190172) and Shuimu Tsinghua Scholar. J. Chen was supported by Shuimu Tsinghua Scholar.

annex

Published as a conference paper at ICLR 2021 & Dhariwal, 2018) 

