RG-FLOW: A HIERARCHICAL AND EXPLAINABLE FLOW MODEL BASED ON RENORMALIZATION GROUP AND SPARSE PRIOR

Abstract

Flow-based generative models have become an important class of unsupervised learning approaches. In this work, we incorporate the key idea of renormalization group (RG) and sparse prior distribution to design a hierarchical flow-based generative model, called RG-Flow, which can separate information at different scales of images with disentangled representations at each scale. We demonstrate our method mainly on the CelebA dataset and show that the disentangled representations at different scales enable semantic manipulation and style mixing of the images. To visualize the latent representations, we introduce receptive fields for flow-based models and find that the receptive fields learned by RG-Flow are similar to those in convolutional neural networks. In addition, we replace the widely adopted Gaussian prior distribution by a sparse prior distribution to further enhance the disentanglement of representations. From a theoretical perspective, the proposed method has O(log L) complexity for image inpainting compared to previous generative models with O(L 2 ) complexity.

1. INTRODUCTION

One of the most important unsupervised learning tasks is to learn the data distribution and build generative models. Over the past few years, various types of generative models have been proposed. Flow-based generative models are a particular family of generative models with tractable distributions (Dinh et al., 2017; Kingma & Dhariwal, 2018; Chen et al., 2018b; 2019; Behrmann et al., 2019; Hoogeboom et al., 2019; Brehmer & Cranmer, 2020; Rezende et al., 2020; Karami et al., 2019 ). Yet the latent variables are on equal footing and mixed globally. Here, we propose a new flow-based model, RG-Flow, which is inspired by the idea of renormalization group in statistical physics. RG-Flow imposes locality and hierarchical structure in bijective transformations. It allows us to access information at different scales in original images by latent variables at different locations, which offers better explainability. Combined with sparse priors (Olshausen & Field, 1996; 1997; Hyvärinen & Oja, 2000) , we show that RG-Flow achieves hierarchical disentangled representations. Renormalization group (RG) is a powerful tool to analyze statistical mechanics models and quantum field theories in physics (Kadanoff, 1966; Wilson, 1971) . It progressively extracts more coarse-scale statistical features of the physical system and decimates irrelevant fine-grained statistics at each scale. Typically, the local transformations used in RG are designed by human physicists and they are not bijective. On the other hand, the flow-based models use cascaded invertible global transformations to progressively turn a complicated data distribution into Gaussian distribution. Here, we would like to combine the key ideas from RG and flow-based models. The proposed RG-flow enables the machine to learn the optimal RG transformation from data, by constructing local invertible transformations and build a hierarchical generative model for the data distribution. Latent representations are introduced at different scales, which capture the statistical features at the corresponding scales. Together, the latent representations of all scales can be jointly inverted to generate the data. This method was recently proposed in the physics community as NeuralRG (Li & Wang, 2018; Hu et al., 2020) . Our main contributions are two-fold: First, RG-Flow can separate the signal statistics of different scales in the input distribution naturally, and represent information at each scale in its latent vari-ables z. Those hierarchical latent variables live on a hyperbolic tree. Taking CelebA dataset (Liu et al., 2015) as an example, the network will not only find high-level representations, such as the gender factor and the emotion factor for human faces, but also mid-level and low-level representations. To visualize representations of different scales, we adopt the concept of receptive field from convolutional neural networks (CNN) (LeCun, 1988; LeCun et al., 1989) and visualize the hidden structures in RG-flow. In addition, since the statistics are separated into a hierarchical fashion, we show that the representations can be mixed at different scales. This achieves an effect similar to style mixing. Second, we introduce the sparse prior distribution for latent variables. We find the sparse prior distribution is helpful to further disentangle representations and make them more explainable. The widely adopted Gaussian prior is rotationally symmetric. As a result, each of the latent variables in a flow model usually does not have a clear semantic meaning. By using a sparse prior, we demonstrate the clear semantic meaning in the latent space.

2. RELATED WORK

Some flow-based generative models also possess multi-scale latent space (Dinh et al., 2017; Kingma & Dhariwal, 2018) , and recently hierarchies of features have been utilized in Schirrmeister et al. (2020) , where the top-level feature is shown to perform strongly in out-of-distribution (OOD) detection task. Yet, previous models do not impose hard locality constraint in the multi-scale structure. In Appendix C, the differences between globally connected multi-scale flows and RG-Flow are discussed, and we see that semantic, meaningful receptive fields do not show up in the globally connected cases. Recently, other more expressive bijective maps have been developed (Hoogeboom et al., 2019; Karami et al., 2019; Durkan et al., 2019) , and those methods can be incorporated into the proposed structure to further improve the expressiveness of RG-Flow. Some other classes of generative models rely on a separate inference model to obtain the latent representation. Examples include variational autoencoders (Kingma & Welling, 2014) , adversarial autoencoders (Makhzani et al., 2015) , InfoGAN (Chen et al., 2016) , and BiGAN (Donahue et al., 2017; Dumoulin et al., 2017) . Those techniques typically do not use hierarchical latent variables, and the inference of latent variables is approximate. Notably, recent advances suggest that having hierarchical latent variables may be beneficial (Vahdat & Kautz, 2020) . In addition, the coarseto-fine fashion of the generation process has also been discussed in other generative models, such as Laplacian pyramid of adversarial networks (Denton et al., 2015) , and multi-scale autoregressive models (Reed et al., 2017) . Disentangled representations (Tenenbaum & Freeman, 2000; DiCarlo & Cox, 2007; Bengio et al., 2013) is another important aspect in understanding how a model generates images (Higgins et al., 2018) . Especially, disentangled high-level representations have been discussed and improved from information theoretical principles (Cheung et al., 2015; Chen et al., 2016; 2018a; Higgins et al., 2017; Kipf et al., 2020; Kim & Mnih, 2018; Locatello et al., 2019; Ramesh et al., 2018) . Apart from the high-level representations, the multi-scale structure also lies in the distribution of natural images. If a model can separate information of different scales, then its multi-scale representations can be used to perform other tasks, such as style transfer (Gatys et al., 2016; Zhu et al., 2017) , face mixing (Karras et al., 2019; Gambardella et al., 2019; Karras et al., 2020) , and texture synthesis (Bergmann et al., 2017; Jetchev et al., 2016; Gatys et al., 2015; Johnson et al., 2016; Ulyanov et al., 2016) . Typically, in flow-based generative models, Gaussian distribution is used as the prior for the latent space. Due to the rotational symmetry of Gaussian prior, an arbitrary rotation of the latent space would lead to the same likelihood. Sparse priors (Olshausen & Field, 1996; 1997; Hyvärinen & Oja, 2000) was proposed as an important tool for unsupervised learning and it leads to better explainability in various domains (Ainsworth et al., 2018; Arora et al., 2018; Zhang et al., 2019) . To break the symmetry of Gaussian prior and further improve the explainability, we introduce a sparse prior to flow-based models. Please refer to Figure 12 for a quick illustration on the difference between Gaussian prior and the sparse prior, where the sparse prior leads to better disentanglement. Renormalization group (RG) has a broad impact ranging from particle physics to statistical physics. Apart from the analytical studies in field theories (Wilson, 1971; Fisher, 1998; Stanley, 1999) , RG has also been useful in numerically simulating quantum states. The multi-scale entanglement renormalization ansatz (MERA) (Vidal, 2008; Evenbly & Vidal, 2014) implements the hierarchical structure of RG in tensor networks to represent quantum states. The exact holographic mapping (EHM) (Qi, 2013; Lee & Qi, 2016; You et al., 2016) further extends MERA to a bijective (unitary) flow between latent product states and visible entangled states. Recently, Li & Wang (2018) ; Hu et al. (2020) incorporates the MERA structure and deep neural networks to design a flow-base generative model that allows machine to learn the EHM from statistical physics and quantum field theory actions. In quantum machine learning, recent development of quantum convolutional neural networks also (Cong et al., 2019) utilize the MERA structure. The similarity between RG and deep learning has been discussed in several works (Bény, 2013; Mehta & Schwab, 2014; Bény & Osborne, 2015; Oprisa & Toth, 2017; Lin et al., 2017; Gan & Shu, 2017) . The information theoretic objective that guides machine-learning RG transforms are proposed in recent works (Koch-Janusz & Ringel, 2018; Hu et al., 2020; Lenggenhager et al., 2020) . The meaning of the emergent latent space has been related to quantum gravity (Swingle, 2012; Pastawski et al., 2015) , which leads to the exciting development of machine learning holography (You et al., 2018; Hashimoto et al., 2018; Hashimoto, 2019; Akutagawa et al., 2020; Hashimoto et al., 2020) .

3. METHODS

Flow-based generative models. Flow-based generative models are a family of generative models with tractable distributions, which allows efficient sampling and exact evaluation of the probability density (Dinh et al., 2015; 2017; Kingma & Dhariwal, 2018; Chen et al., 2019) . The key idea is to build a bijective map G(z) = x between visible variables x and latent variables z. Visible variables x are the data that we want to generate, which may follow a complicated probability distribution. And latent variables z usually have simple distribution that can be easily sampled, for example the i.i.d. Gaussian distribution. In this way, the data can be efficiently generated by first sampling z and mapping them to x through x = G(z). In addition, we can get the probability associated with each data sample x, log p X (x) = log p Z (z) log @G(z) @z . The bijective map G(z) = x is usually composed as a series of bijectors, G(z) = G 1 G 2 • • • G n (z) , such that each bijector layer G i has a tractable Jacobian determinant and can be inverted efficiently. The two key ingredients in flow-based models are the design of the bijective map G and the choice of the prior distribution p Z (z). Structure of RG-Flow networks. Much of the prior research has focused on designing more powerful bijective blocks for the generator G to improve its expressive power and to achieve better approximations of complicated probability distributions. Here, we focus on designing the architecture that arranges the bijective blocks in a hierarchical structure to separate features of different scales in the data and to disentangle latent representations. (a) Renormalization

RG scale Fine-grained

Coarse-grained x (3) z (3) x (2) z (2) x (1) z (1) x (0) z (0) decimated features (b) Generation inverse RG scale Fine-grained Coarse-grained z (2) z (1) z (0) z (3) x visible variables latent variables z Our design is motivated by the idea of RG in physics, which progressively separates the coarsegrained data statistics from fine-grained statistics by local transformations at different scales. Let x be the visible variables, or the input image (level-0), denoted as x (0) ⌘ x. A step of the RG transformation extracts the coarse-grained information x (1) to send to the next layer (level-1), and splits out the rest of fine-grained information as auxiliary variables z (0) . The procedure can be described by the following recursive equation (at level-h for example), x (h+1) , z (h) = R h (x (h) ), which is illustrated in Fig. 1 (a), where dim(x (h+1) ) + dim(z (h) ) = dim(x (h) ), and the RG transformation R h can be made invertible. At each level, the transformation R h is a local bijective map, which is constructed by stacking trainable bijective blocks. We will specify its details later. The split-out information z (h) can be viewed as latent variables arranged at different scales. Then the inverse RG transformation G h ⌘ R 1 h simply generates the fine-grained image, x (h) = R 1 h (x (h+1) , z (h) ) = G h (x (h+1) , z (h) ). The highest-level image x (h L ) = G h L (z (h L ) ) can be considered as generated directly from latent variables z (h L ) without referring to any higher-level coarse-grained image, where h L = log 2 L log 2 m, for the original image of size L ⇥ L with local transformations acting on kernel size m ⇥ m. Therefore, given the latent variables z = {z (h) } at all levels h, the original image can be restored by the following nested maps, as illustrated in Fig. 1 (b), x ⌘ x (0) = G 0 (G 1 (G 2 (• • • , z (2) ), z (1) ), z (0) ) ⌘ G(z), where z = {z 0 , • • • , z h L }. RG-Flow is a flow-based generative model that uses the above composite bijective map G as the generator. To model the RG transformation, we arrange the bijective blocks in a hierarchical network architecture. Fig. 2(a) shows the side view of the network, where each green or yellow block is a local bijective map. Following the notation of MERA networks, the green blocks are the disentanglers, which reparametrize local variables to reduce their correlations, and the yellow blocks are the decimators, which separate the decimated features out as latent variables. The blue dots on the bottom are the visible variables x from the data, and the red crosses are the latent variables z. We omit color channels of the image in the illustration, since we keep the number of color channels unchanged through the transformation. As a mathematical description, for the single-step RG transformation R h , in each block (p, q) labeled by p, q = 0, 1, . . . , L 2 h m 1, the mapping from x (h) to (x (h+1) , z (h) ) is given by n y (h) 2 h (mp+ m 2 +a,mq+ m 2 +b) o (a,b)2⇤ 1 m = R dis h ✓ n x (h) 2 h (mp+ m 2 +a,mq+ m 2 +b) o (a,b)2⇤ 1 m ◆ n x (h+1) 2 h (mp+a,mq+b) o (a,b)2⇤ 2 m , n z (h) 2 h (mp+a,mq+b) o (a,b)2⇤ 1 m /⇤ 2 m =R dec h ✓ n y (h) 2 h (mp+a,mq+b) o (a,b)2⇤ 1 m ◆ , where ⇤ k m = {(ka, kb) | a, b = 0, 1, . . . , m k 1} denotes the set of pixels in a m ⇥ m square with stride k, and y is the intermediate result after the disentangler but not the decimator. The notation x (h) (i,j) stands for the variable (a vector of all channels) at the pixel (i, j) and at the RG level h (similarly for y and z). The disentanglers R dis h and decimators R dec h can be any bijective neural network. Practically, We use the coupling layer proposed in the Real NVP networks (Dinh et al., 2017) to build them, with a detailed description in Appendix A. By specifying the RG transformation R h = R dec h R dis h above, the generator G h ⌘ R 1 h is automatically specified as the inverse transformation. Training objective. After decomposing the statistics into multiple scales, we need to make the latent features decoupled. So we assume that the latent variables z are independent random variables, described by a factorized prior distribution Inference Causal Cone p Z (z) = Y l p(z l ), z (h) < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 Y j L M f P x V W K J F q K f h l o G + g c m R i I = " > A A A B 8 H i c d V D L S g M x F M 3 4 r P V V d e k m 2 A p 1 U y Y t O O 2 u 4 M Z l B f u Q d i y Z N N O G J p k h y Q h 1 6 F e 4 c a G I W z / H n X 9 j + h B U 9 M C F w z n 3 c u 8 9 Q c y Z N q 7 7 4 a y s r q 1 v b G a 2 s t s 7 u 3 v 7 u Y P D l o 4 S R W i T R D x S n Q B r y p m k T c M M p 5 1 Y U S w C T t v B + G L m t + + o 0 i y S 1 2 Y S U 1 / g o W Q h I 9 h Y 6 a Z w f 5 s W R 2 f T Q j + X d 0 t e p Y q Q B 9 0 S Q j X P Q 5 a 4 q H J e 8 y A q u X P k w R K N f u 6 9 N 4 h I I q g 0 h G O t u 8 i N j Z 9 i Z R j h d J r t J Z r G m I z x k H Y t l V h Q 7 a f z g 6 f w 1 C o D G E b K l j R w r n 6 f S L H Q e i I C 2 y m w G e n f 3 k z 8 y + s m J q z 6 K Z N x Y q g k i 0 V h w q G J 4 O x 7 O G C K E s M n l m C i m L 0 V k h F W m B i b U d a G 8 P U p / J + 0 y i V k I 7 o q 5 + u 1 Z R w Z c A x O Q B E g 4 I E 6 u A Q N 0 A Q E C P A A n s C z o 5 x H 5 8 V 5 X b S u O M u Z I / A D z t s n F 3 C P 6 Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 Y j L M f P x V W K J F q K f h l o G + g c m R i I = " > A A A B 8 H i c d V D L S g M x F M 3 4 r P V V d e k m 2 A p 1 U y Y t O O 2 u 4 M Z l B f u Q d i y Z N N O G J p k h y Q h 1 6 F e 4 c a G I W z / H n X 9 j + h B U 9 M C F w z n 3 c u 8 9 Q c y Z N q 7 7 4 a y s r q 1 v b G a 2 s t s 7 u 3 v 7 u Y P D l o 4 S R W i T R D x S n Q B r y p m k T c M M p 5 1 Y U S w C T t v B + G L m t + + o 0 i y S 1 2 Y S U 1 / g o W Q h I 9 h Y 6 a Z w f 5 s W R 2 f T Q j + X d 0 t e p Y q Q B 9 0 S Q j X P Q 5 a 4 q H J e 8 y A q u X P k w R K N f u 6 9 N 4 h I I q g 0 h G O t u 8 i N j Z 9 i Z R j h d J r t J Z r G m I z x k H Y t l V h Q 7 a f z g 6 f w 1 C o D G E b K l j R w r n 6 f S L H Q e i I C 2 y m w G e n f 3 k z 8 y + s m J q z 6 K Z N x Y q g k i 0 V h w q G J 4 O x 7 O G C K E s M n l m C i m L 0 V k h F W m B i b U d a G 8 P U p / J + 0 y i V k I 7 o q 5 + u 1 Z R w Z c A x O Q B E g 4 I E 6 u A Q N 0 A Q E C P A A n s C z o 5 x H 5 8 V 5 X b S u O M u Z I / A D z t s n F 3 C P 6 Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 Y j L M f P x V W K J F q K f h l o G + g c m R i I = " > A A A B 8 H i c d V D L S g M x F M 3 4 r P V V d e k m 2 A p 1 U y Y t O O 2 u 4 M Z l B f u Q d i y Z N N O G J p k h y Q h 1 6 F e 4 c a G I W z / H n X 9 j + h B U 9 M C F w z n 3 c u 8 9 Q c y Z N q 7 7 4 a y s r q 1 v b G a 2 s t s 7 u 3 v 7 u Y P D l o 4 S R W i T R D x S n Q B r y p m k T c M M p 5 1 Y U S w C T t v B + G L m t + + o 0 i y S 1 2 Y S U 1 / g o W Q h I 9 h Y 6 a Z w f 5 s W R 2 f T Q j + X d 0 t e p Y q Q B 9 0 S Q j X P Q 5 a 4 q H J e 8 y A q u X P k w R K N f u 6 9 N 4 h I I q g 0 h G O t u 8 i N j Z 9 i Z R j h d J r t J Z r G m I z x k H Y t l V h Q 7 a f z g 6 f w 1 C o D G E b K l j R w r n 6 f S L H Q e i I C 2 y m w G e n f 3 k z 8 y + s m J q z 6 K Z N x Y q g k i 0 V h w q G J 4 O x 7 O G C K E s M n l m C i m L 0 V k h F W m B i b U d a G 8 P U p / J + 0 y i V k I 7 o q 5 + u 1 Z R w Z c A x O Q B E g 4 I E 6 u A Q N 0 A Q E C P A A n s C z o 5 x H 5 8 V 5 X b S u O M u Z I / A D z t s n F 3 C P 6 Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 Y j L M f P x V W K J F q K f h l o G + g c m R i I = " > A A A B 8 H i c d V D L S g M x F M 3 4 r P V V d e k m 2 A p 1 U y Y t O O 2 u 4 M Z l B f u Q d i y Z N N O G J p k h y Q h 1 6 F e 4 c a G I W z / H n X 9 j + h B U 9 M C F w z n 3 c u 8 9 Q c y Z N q 7 7 4 a y s r q 1 v b G a 2 s t s 7 u 3 v 7 u Y P D l o 4 S R W i T R D x S n Q B r y p m k T c M M p 5 1 Y U S w C T t v B + G L m t + + o 0 i y S 1 2 Y S U 1 / g o W Q h I 9 h Y 6 a Z w f 5 s W R 2 f T Q j + X d 0 t e p Y q Q B 9 0 S Q j X P Q 5 a 4 q H J e 8 y A q u X P k w R K N f u 6 9 N 4 h I I q g 0 h G O t u 8 i N j Z 9 i Z R j h d J r t J Z r G m I z x k H Y t l V h Q 7 a f z g 6 f w 1 C o D G E b K l j R w r n 6 f S L H Q e i I C 2 y m w G e n f 3 k z 8 y + s m J q z 6 K Z N x Y q g k i 0 V h w q G J 4 O x 7 O G C K E s M n l m C i m L 0 V k h F W m B i b U d a G 8 P U p / J + 0 y i V k I 7 o q 5 + u 1 Z R w Z c A x O Q B E g 4 I E 6 u A Q N 0 A Q E C P A A n s C z o 5 x H 5 8 V 5 X b S u O M u Z I / A D z t s n F 3 C P 6 Q = = < / l a t e x i t > z (h+1) < l a t e x i t s h a 1 _ b a s e 6 4 = " F 2 2 Z 0 where l labels every element in z, including the RG level, the pixel position and the channel. This prior gives the network the incentive to minimize the mutual information between latent variables. This minimal bulk mutual information (minBMI) principle was previously proposed to be the information theoretic principle that defines the RG transformation (Li & Wang (2018) ; Hu et al. (2020) ). x K B p c n Z U d D 7 6 8 w u W Z + b w P c = " > A A A B 8 n i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w V a o C M O k B a f d F d y 4 r G B b Y T q W T J p p Q z O T I c k I d e h n u H G h i F u / x p 1 / Y / o Q V P T A h c M 5 9 3 L v P U H C m d K O 8 2 H l V l b X 1 j f y m 4 W t 7 Z 3 d v e L + Q U e J V B L a J o I L e R N g R T m L a V s z z e l N I i m O A k 6 7 w f h i 5 n f v q F R M x N d 6 k l A / w s O Y h Y x g b S S v f H + b V U Z n 6 H R a 7 h d L j u 3 W 6 g i 5 0 L E R a r g u M s R B t f O G C 5 H t z F E C S 7 T 6 x f f e Q J A 0 o r E m H C v l I S f R f o a l Z o T T a a G X K p p g M s Z D 6 h k a 4 4 g q P 5 u f P I U n R h n A U E h T s Y Z z 9 f t E h i O l J l F g O i O s R + q 3 N x P / 8 r x U h 3 U / Y 3 G S a h q T x a I w 5 V A L O P s f D p i k R P O J I Z h I Z m 6 F Z I Q l J t q k V D A h f H 0 K / y e d q o 1 M R F f V U r O x j C M P j s A x q A A E X N A E l 6 A F 2 o A A A R 7 A E 3 i 2 t P V o v V i v i 9 a c t Z w 5 B D 9 g v X 0 C 8 h e Q W Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F 2 2 Z 0 x K B p c n Z U d D 7 6 8 w u W Z + b w P c = " > A A A B 8 n i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w V a o C M O k B a f d F d y 4 r G B b Y T q W T J p p Q z O T I c k I d e h n u H G h i F u / x p 1 / Y / o Q V P T A h c M 5 9 3 L v P U H C m d K O 8 2 H l V l b X 1 j f y m 4 W t 7 Z 3 d v e L + Q U e J V B L a J o I L e R N g R T m L a V s z z e l N I i m O A k 6 7 w f h i 5 n f v q F R M x N d 6 k l A / w s O Y h Y x g b S S v f H + b V U Z n 6 H R a 7 h d L j u 3 W 6 g i 5 0 L E R a r g u M s R B t f O G C 5 H t z F E C S 7 T 6 x f f e Q J A 0 o r E m H C v l I S f R f o a l Z o T T a a G X K p p g M s Z D 6 h k a 4 4 g q P 5 u f P I U n R h n A U E h T s Y Z z 9 f t E h i O l J l F g O i O s R + q 3 N x P / 8 r x U h 3 U / Y 3 G S a h q T x a I w 5 V A L O P s f D p i k R P O J I Z h I Z m 6 F Z I Q l J t q k V D A h f H 0 K / y e d q o 1 M R F f V U r O x j C M P j s A x q A A E X N A E l 6 A F 2 o A A A R 7 A E 3 i 2 t P V o v V i v i 9 a c t Z w 5 B D 9 g v X 0 C 8 h e Q W Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F 2 2 Z 0 x K B p c n Z U d D 7 6 8 w u W Z + b w P c = " > A A A B 8 n i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w V a o C M O k B a f d F d y 4 r G B b Y T q W T J p p Q z O T I c k I d e h n u H G h i F u / x p 1 / Y / o Q V P T A h c M 5 9 3 L v P U H C m d K O 8 2 H l V l b X 1 j f y m 4 W t 7 Z 3 d v e L + Q U e J V B L a J o I L e R N g R T m L a V s z z e l N I i m O A k 6 7 w f h i 5 n f v q F R M x N d 6 k l A / w s O Y h Y x g b S S v f H + b V U Z n 6 H R a 7 h d L j u 3 W 6 g i 5 0 L E R a r g u M s R B t f O G C 5 H t z F E C S 7 T 6 x f f e Q J A 0 o r E m H C v l I S f R f o a l Z o T T a a G X K p p g M s Z D 6 h k a 4 4 g q P 5 u f P I U n R h n A U E h T s Y Z z 9 f t E h i O l J l F g O i O s R + q 3 N x P / 8 r x U h 3 U / Y 3 G S a h q T x a I w 5 V A L O P s f D p i k R P O J I Z h I Z m 6 F Z I Q l J t q k V D A h f H 0 K / y e d q o 1 M R F f V U r O x j C M P j s A x q A A E X N A E l 6 A F 2 o A A A R 7 A E 3 i 2 t P V o v V i v i 9 a c t Z w 5 B D 9 g v X 0 C 8 h e Q W Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F 2 2 Z 0 x K B p c n Z U d D 7 6 8 w u W Z + b w P c = " > A A A B 8 n i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w V a o C M O k B a f d F d y 4 r G B b Y T q W T J p p Q z O T I c k I d e h n u H G h i F u / x p 1 / Y / o Q V P T A h c M 5 9 3 L v P U H C m d K O 8 2 H l V l b X 1 j f y m 4 W t 7 Z 3 d v e L + Q U e J V B L a J o I L e R N g R T m L a V s z z e l N I i m O A k 6 7 w f h i 5 n f v q F R M x N d 6 k l A / w s O Y h Y x g b S S v f H + b V U Z n 6 H R a 7 h d L j u 3 W 6 g i 5 0 L E R a r g u M s R B t f O G C 5 H t z F E C S 7 T 6 x f f e Q J A 0 o r E m H C v l I S f R f o a l Z o T T a a G X K p p g M s Z D 6 h k a 4 4 g q P 5 u f P I U n R h n A U E h T s Y Z z 9 f t E h i O l J l F g O i O s R + q 3 N x P / 8 r x U h 3 U / Y 3 G S a h q T x a I w 5 V A L O P s f D p i k R P O J I Z h I Z m 6 F Z I Q l J t q k V D A h f H 0 K / y e d q o 1 M R F f V U r O x j C M P j s A x q A A E X N A E l 6 A F 2 o A A A R 7 A E 3 i 2 t P V o v V i v i 9 a c t Z w 5 B D 9 g v X 0 C 8 h e Q W Q = = < / l a t e x i t > Starting from a set of independent latent variables z, the generator G should build up correlations locally at different scales, such that the multi-scale correlation structure can emerge in the resulting image x to model the correlated probability distribution of the data. To achieve this goal, we should maximize the log likelihood for x drawn from the data set. The loss function to minimize reads L = E x⇠pdata(x) log p X (x) = E x⇠pdata(x) ✓ log p Z (R(x)) + log @R(x) @x ◆ , where R(x) ⌘ G 1 (x) = z denotes the RG transformation, which contains trainable parameters. By optimizing the parameters, the network learns the optimal RG transformation from the data. Receptive fields of latent variables. Due to the nature of local transformations in our hierarchical network, we can define the generation causal cone for a latent variable to be the affected area when that latent variable is changed. This is illustrated as the red cone in Fig. 2(c ). To visualize the latent space representation, we define the receptive field for a latent variable z l as RF l = E z⇠p Z (z) @G(z) @z l c , where | • | c denotes the 1-norm on the color channel. The receptive field reflects the response of the generated image to an infinitesimal change of the latent variable z l , averaged over p Z (z). Therefore, the receptive field of a latent variable is always contained in its generation causal cone. Higher-level latent variables have larger receptive fields than those of the lower-level ones. Especially, if the receptive fields of two latent variables do not overlap, which is often the case for lower-level latent variables, they automatically become disentangled in the representation. Image inpainting and error correction. Another advantage of the network locality can be demonstrated in the inpainting task. Similar to the generation causal cone, we can define the inference causal cone shown as the blue cone in Fig. 2(d ). If we perturb a pixel at the bottom of the blue cone, all the latent variables within the blue cone will be affected, whereas the latent variables outside the cone cannot be affected. An important property of the hyperbolic tree-like network is that the higher level contains exponentially fewer latent variables. Even though the inference causal cone is expanding as we go into higher levels, the number of latent variables dilutes exponentially as well, resulting in a constant number of latent variables covered by the inference causal cone on each level. Therefore, if a small local region on an image is corrupted, only O(log L) latent variables need to be modified, where L is the edge length of the entire image. While for globally connected networks, all O(L 2 ) latent variables have to be varied.

Sparse prior distribution.

We have chosen to hard-code the RG information principle by using a factorized prior distribution, i.e. p Z (z) = Q l p(z l ). The common practice is to choose p(z l ) to be the standard Gaussian distribution, which is spherical symmetric. If we apply any rotation to z, the distribution will remain the same. Therefore, we cannot avoid different features from being mixed under the arbitrary rotation. To overcome this issue, we use an anisotropic sparse prior distribution for p Z (z). In our implementation, we choose the Laplacian distribution p(z l ) = 1 2b exp( |z l |/b), which is sparser compared to Gaussian distribution and breaks the spherical symmetry of the latent space. In Appendix E, we show a two-dimensional pinwheel example to illustrate this intuition. This heuristic method will encourage the model to find more semantically meaningful representations by breaking the spherical symmetry.

4. EXPERIMENTS

Synthetic multi-scale datasets. To illustrate RG-Flow's ability to disentangle representations at different scales and spatially separated representations, we propose two synthetic datasets with multi-scale features, named MSDS1 and MSDS2. Their samples are shown in Appendix B. In each image, there are 16 ovals with different colors and orientations. In MSDS1, all ovals in an image have almost the same color, while their orientations are randomly distributed. So the color is a global feature in MSDS1, and the orientation is a local feature. In MSDS2, on the contrary, the orientation is a global feature, and the color is a local one. We implement RG-Flow as shown in Fig. 2 . After training, we find that RG-Flow can easily capture the characteristics of those datasets. Namely, the ovals in each image from MSDS1 have almost the same color; and from MSDS2, the same orientation. Especially, in Fig. 3 , we plot the effect of varying latent variables at different levels, together with their receptive fields. For MSDS1, if we vary a high-level latent variable, the color of the whole image will change, which shows that the network has captured the global feature of the dataset. And if we vary a low-level latent variable, the orientation of only the corresponding one oval will change. As the ovals are spatially separated, the low-level representation of different ovals is disentangled. Similarly, for MSDS2, if we vary a high-level latent variable, the orientations of all ovals will change. And if we vary a low-level latent variable, the color of only the corresponding one oval will change. For comparison, we also trained Real NVP on our synthetic datasets. We find that Real NVP fails to learn the global and local characteristics of those datasets. Details can be found in Appendix B.

MSDS1 High level MSDS2

Low level After training, the network learns to progressively generate finer-grained images, as shown in Fig. 4 (a). The colors in the coarse-grained images are not necessarily the same as those at the same positions in the fine-grained images, because there is no constraint to prevent the RG transformation from mixing color channels. Receptive fields. To visualize the latent space representation, we calculate the receptive field for each latent variable, and list some of them in Fig. 4 (b). We can see the receptive size is small for low-level variables and large for high-level ones, as indicated from the generation causal cone. In the lowest level (h = 0), the receptive fields are merely small dots. In the second lowest level (h = 1), small structures emerge, such as an eyebrow, an eye, a part of hair, etc. In the middle level (h = 2), we can see eyebrows, eyes, forehead bang structure emerge. In the highest level (h = 3), each receptive field grows to the whole image. We will investigate those explainable latent representations in the next section. For comparison, we show receptive fields of Real NVP in Appendix C. Even though Real NVP has multi-scale structure, since it is not locally constrained, semantic representations at different scales do not emerge. Learned features on different scales. In this section, we show that some of these emergent structures correspond to explainable latent features. Flow-based generative model is the maximal encoding procedure, because the core of flow-based generative models is the bijective maps, and they preserves the dimensionality before and after the encoding. Usually, the images in the dataset live on a low dimensional manifold, and we do not need to use all the dimensions to encode such data. In Fig. 4 (c) we show the statistics of the strength of receptive fields. We can see most of the latent variables have receptive fields with relatively small strength, meaning that if we change the value of those latent variables, the generated images will not be affected much. We focus on those latent variables with receptive field strength greater than one, which have visible effects on the generated images. We use h to label the RG level of latent variables, for example, the lowest-level latent variables have h = 0, whereas the highest-level latent variables have h = 4. In addition, we will focus on h = 1 (low level), h = 2 (mid level), h = 3 (high level) latent variables. There are a few latent variables with h = 0 that have visible effects, but their receptive fields are only small dots with no emergent structures. For high-level latent representations, we found in total 30 latent variables that have visible effects, and six of them are identified with disentangled and explainable meanings. Those factors are gender, emotion, light angle, azimuth, hair color, and skin color. In Fig. 5 (a), we plot the effect of varying those six high-level variables, together with their receptive fields. For the mid-level latent representations, we plot the four leading variables together with their receptive fields in Fig. 5 (b), and they control eye, eyebrow, upper right bang, and collar respectively. For the low-level representations, some leading variables control an eyebrow and an eye as shown in Fig. 5(c ). We see them achieve better disentangled representations when their receptive fields do not overlap. Image mixing in scaling direction. Given two images x A and x B , the conventional image mixing takes a linear combination between z A = G 1 (x A ) and z B = G 1 (x B ) by z = z A + (1 )z B with 2 [0, 1] and generates the mixed image from x = G(z). In our model, latent variables z is coordinated by the pixel position (i, j) and the RG level h. The direct access of the latent variable (i,j) at each point enables us to mix the latent variables in a different manner, which may be dubbed as a "hyperbolic mixing". We consider mixing the large-scale (high-level) features of x A and the small-scale (low-level) features of x B by combining their corresponding latent variables via z (h) = ( z (h) A , for h ⇥, z (h) B , for h < ⇥, where ⇥ serves as a dividing line of the scales. As shown in Fig. 6 (a), as we change ⇥ from 0 to 3, more low-level information in the blonde-hair image is mixed with the high-level information of the black-hair image. Especially when h = 3, we see the mixed face have similar eyes, nose, eyebrows, and mouth as the blonde-hair image, while the high-level information, such as face orientation and hair color, is taken from the black-hair image. In addition, this mixing is not symmetric under the interchange of z A and z B , see Fig. 6(b) for comparison. This hyperbolic mixing achieves the similar effect of StyleGAN (Karras et al., 2019; 2020) that we can take mid-level information from an image and mix it with the high-level information of another image. In Fig. 6 (c), we show more examples of mixing faces. Image inpainting and error correction. The existence of the inference causal cone ensures that at most O(log L) latent variables will be affected, if we have a small local corrupted region to be inpainted. In Fig. 7 , we show that RG-Flow can faithfully recover the corrupted region (marked as red) only using latent variables locating inside the inference causal cone, which are around one third of all latent variables. For comparison, if we randomly pick the same number of latent variables to modify in Real NVP, it fails to inpaint as shown in Fig. 7 (Constrained Real NVP). To achieve the recovery of similar quality in Real NVP, as shown in Fig. 7 (Real NVP), all latent variables need to be modified, which are of O(L 2 ) order. See Appendix F for more details about the inpainting task and its quantitative evaluations.



Figure 1: (a) The forward RG transformation splits out decimated features at different scales. (b) The inverse RG transformation generates the fine-grained image from latent variables.

Fig. 2(b)  shows the top-down view of a step of the RG transformation. The green/yellow blocks (disentanglers/decimators) are interwoven on top of each other. The covering area of a disentangler or decimator is defined as the kernel size m ⇥ m of the bijector. For example, in Fig.2(b), the kernel size is 4 ⇥ 4. After the decimator, three fourth of the degrees of freedom are decimated into latent variables (red crosses in Fig.2(a)), so the edge length of the image is halved.

Figure 2: Subplot (a) shows the side view of the network. Green/Yellow blocks denote the disentanglers/decimators, which are bijective maps. Subplot (b) shows the top-down view of the network. The red area in the subplot (c) illustrates the generation causal cone for a latent variable. The blue area in the subplot (d) illustrates the inference causal cone for a visible variable.

Figure 3: Multi-scale latent representations for MSDS1 and MSDS2.

Figure 4: Subplot (a) shows the progressive generation of images during the inverse RG. Subplot (b) shows some receptive fields of latent variables from low level to high level. The strength of each receptive field is rescaled to one for better visualization. Subplot (c) shows the statistics of the receptive fields' strength.

Figure 5: Semantic factors found on different levels.

Figure 6: Image mixing in the hyperbolic tree-like latent space.

Constrained

Real NVP

Real NVP

Figure 7 : Inpainting locally corrupted images.

5. DISCUSSION AND CONCLUSION

In this paper, we combined the ideas of renormalization group and sparse prior distribution to design RG-Flow, a probabilistic flow-based generative model. This versatile architecture can be incorporated with any bijective map to achieve an expressive flow-based generative model. We have shown that RG-Flow can separate information at different scales and encode them in latent variables living on a hyperbolic tree. To visualize the latent representations in RG-Flow, we defined the receptive fields for flow-based models in analogy to that in CNN. Taking CelebA dataset as our main example, we have shown that RG-Flow will not only find high-level representations, but also mid-level and low-level ones. The receptive fields serve as a visual guidance for us to find explainable representations. In contrast, the semantic representations of mid-level and low-level structures do not emerge in globally connected multi-scale flow models, such as Real NVP. We have also shown that the latent representations can be mixed at different scales, which achieves an effect similar to style mixing.In our model, if the receptive fields of two latent representations do not overlap, they are naturally disentangled. For high-level representations, we propose to utilize a sparse prior to encourage disentanglement. We find that if the dataset only contains a few high-level factors, such as the 3D Chair dataset (Aubry et al., 2014) shown in Appendix G, it is hard to find explainable high-level disentangled representations, because of the redundant nature of the encoding in flow-based models.Incorporating information theoretic criteria to disentangle high-level representations in the redundant encoding procedure will be an interesting future direction.

