EFFICIENT SAMPLING FOR GENERATIVE ADVERSAR-IAL NETWORKS WITH REPARAMETERIZED MARKOV CHAINS

Abstract

Recently, sampling methods have been successfully applied to enhance the sample quality of Generative Adversarial Networks (GANs). However, in practice, they typically have poor sample efficiency because of the independent proposal sampling from the generator. In this work, we propose REP-GAN, a novel sampling method that allows general dependent proposals by REParameterizing the Markov chains into the latent space of the generator. Theoretically, we show that our reparameterized proposal admits a closed-form Metropolis-Hastings acceptance ratio. Empirically, extensive experiments on synthetic and real datasets demonstrate that our REP-GAN largely improves the sample efficiency and obtains better sample quality simultaneously.

1. INTRODUCTION

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have achieved a great success on generating realistic images in recent years (Karras et al., 2019; Brock et al., 2019) . Unlike previous models that explicitly parameterize the data distribution, GANs rely on an alternative optimization between a generator and a discriminator to learn the data distribution implicitly. However, in practice, samples generated by GANs still suffer from problems such as mode collapse and bad artifacts. Recently, sampling methods have shown promising results on enhancing the sample quality of GANs by making use of the information in the discriminator. In the alternative training scheme of GANs, the generator only performs a few updates for the inner loop and has not fully utilized the density ratio information estimated by the discriminator. Thus, after GAN training, the sampling methods propose to further utilize this information to bridge the gap between the generative distribution and the data distribution in a fine-grained manner. For example, DRS (Azadi et al., 2019) applies rejection sampling, and MH-GAN (Turner et al., 2019) adopts Markov chain Monte Carlo (MCMC) sampling for the improved sample quality of GANs. Nevertheless, these methods still suffer a lot from the sample efficiency problem. For example, as will be shown in Section 5, MH-GAN's average acceptance ratio on CIFAR10 can be lower than 5%, which makes the Markov chains slow to mix. As MH-GAN adopts an independent proposal q, i.e., q(x |x) = q(x ), the difference between samples can be so large that the proposal gets rejected easily. To address this limitation, we propose to generalize the independent proposal to a general dependent proposal q(x |x). To the end, the proposed sample can be a refinement of the previous one, which leads to a higher acceptance ratio and better sample quality. We can also balance between the exploration and exploitation of the Markov chains by tuning the step size. However, it is hard to design a proper dependent proposal in the high dimensional sample space X because the energy landscape could be very complex (Neal et al., 2010) . Nevertheless, we notice that the generative distribution p g (x) of GANs is implicitly defined as the push-forward of the latent prior distribution p 0 (z), and designing proposals in the low dimensional latent space is generally much easier. Hence, GAN's latent variable structure motivates us to design a structured dependent proposal with two pairing Markov chains, one in the sample space X and the other in the latent space Z. As shown in Figure 1 , given the current pairing samples (z k , x k ), we draw the next proposal x in a bottom-to-up way: 1) drawing a latent proposal z following q(z |z k ); 2) pushing it forward through the generator and getting the sample proposal x = G(z ); 3)  M Q q f R I l 2 p Z C M l N / T + Q 0 N m Y c h 7 Y z p j g 0 i 9 5 U / M / r Z B h d B b l Q a Y Z c s f m i K J M E E z L 9 n P S F 5 g z l 2 B L K t L C 3 E j a k m j K 0 + V R s C N 7 i y 8 u k d V b 3 z u v e 3 U W t c V 3 E U Y Y j O I Z T 8 O A S G n A L T f C B g Y B n e I U 3 R z k v z r v z M W 8 t O c X M I f y B 8 / k D L I O O 6 w = = < / l a t e x i t > z k+1 < l a t e x i t s h a 1 _ b a s e 6 4 = " i 2 R T z / 5 N l O q V 6 a w n t 5 e M 9 v i e B T o = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B Z B E E q i g h 6 L X j x W s B / Q h r L Z b t q l m 0 3 Y n Q g 1 9 E d 4 8 a C I V 3 + P N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K F Q d f 9 d g o r q 2 v r G 8 X N 0 t b 2 z u 5 e e f + g a e J U M 9 5 g s Y x 1 O 6 C G S 6 F 4 A w V K 3 k 4 0 p 1 E g e S s Y 3 U 7 9 1 i P X R s T q A c c J 9 y M 6 U C I U j K K V W k + 9 b H T m T X r l i l t 1 Z y D L x M t J B X L U e + W v b j 9 m a c Q V M k m N 6 X h u g n 5 G N Q o m + a T U T Q 1 P K B v R A e 9 Y q m j E j Z / N z p 2 Q E 6 v 0 S R h r W w r J T P 0 9 k d H I m H E U 2 M 6 I 4 t A s e l P x P 6 + T Y n j t Z 0 I l K X L F 5 o v C V B K M y f R 3 0 h e a M 5 R j S y j T w t 5 K 2 J B q y t A m V L I h e I s v L 5 P m e d W 7 q H r 3 l 5 X a T R 5 H E Y 7 g G E 7 B g y u o w R 3 U o Q E M R v A M r / D m J M 6 L 8 + 5 8 z F s L T j 5 z C H / g f P 4 A B e G P W w = = < / l a t e x i t > z k+2 < l a t e x i t s h a 1 _ b a s e 6 4 = " i y 4 6 t H M N n O Y 7 d M 6 j d q 0 N c G V c a x Y = " > A A A B 7 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S I I Q k m q o M e i F 4 8 V 7 A e 0 o W y 2 m 3 b p Z h N 2 J 0 I N / R F e P C j i 1 d / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n Z X V t f W N z c J W c X t n d 2 + / d H D Y N H G q G W + w W M a 6 H V D D p V C 8 g Q I l b y e a 0 y i Q v B W M b q d + 6 5 F r I 2 L 1 g O O E + x E d K B E K R t F K r a d e N j q v T n q l s l t x Z y D L x M t J G X L U e 6 W v b j 9 m a c Q V M k m N 6 X h u g n 5 G N Q o m + a T Y T Q 1 P K B v R A e 9 Y q m j E j Z / N z p 2 Q U 6 v 0 S R h r W w r J T P 0 9 k d H I m H E U 2 M 6 I 4 t A s e l P x P 6 + T Y n j t Z 0 I l K X L F 5 o v C V B K M y f R 3 0 h e a M 5 R j S y j T w t 5 K 2 J B q y t A m V L Q h e I s v L 5 N m t e J d V L z 7 y 3 L t J o + j A M d w A m f g w R X U 4 A 7 q 0 A A G I 3 i G V 3 h z E u f F e X c + 5 q 0 r T j 5 z B H / g f P 4 A B 2 a P X A = = < / l a t e x i t > z k+3 < l a t e x i t s h a 1 _ b a s e 6 4 = " i u I T d S w D d w u s V t 5 v u m N n k s 4 B n u 4 = " > A A A B 7 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S I I Q k m s o M e i F 4 8 V 7 A e 0 o W y 2 m 3 b p Z h N 2 J 0 I N / R F e P C j i 1 d / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e k E h h 0 H W / n Z X V t f W N z c J W c X t n d 2 + / d H D Y N H G q G W + w W M a 6 H V D D p V C 8 g Q I l b y e a 0 y i Q v B W M b q d + 6 5 F r I 2 L 1 g O O E + x E d K B E K R t F K r a d e N j q v T n q l s l t x Z y D L x M t J G X L U e 6 W v b j 9 m a c Q V M k m N 6 X h u g n 5 G N Q o m + a T Y T Q 1 P K B v R A e 9 Y q m j E j Z / N z p 2 Q U 6 v 0 S R h r W w r J T P 0 9 k d H I m H E U 2 M 6 I 4 t A s e l P x P 6 + T Y n j t Z 0 I l K X L F 5 o v C V B K M y f R 3 0 h e a M 5 R j S y j T w t 5 K 2 J B q y t A m V L Q h e I s v L 5 P m R c W r V r z 7 y 3 L t J o + j A M d w A m f g w R X U 4 A 7 q 0 A A G I 3 i G V 3 h z E u f F e X c + 5 q 0 r T j 5 z B H / g f P 4 A C O u P X Q = = < / l a t e x i t > ... x k < l a t e x i t s h a 1 _ b a s e 6 4 = " w x p t u V L k d X B z 2 r j S t U L X 0 B 2 3 9 B M = " > A A A B 7 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U 0 G P R i 8 c K p h b a U D b b T b t 0 s w m 7 E 7 G E / g Y v H h T x 6 g / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 L r f T m l l d W 1 9 o 7 x Z 2 d r e 2 d 2 r 7 h + 0 T J J p x n 2 W y E S 3 Q 2 q 4 F I r 7 K F D y d q o 5 j U P J H 8 L R z d R / e O T a i E T d 4 z j l Q U w H S k S C U b S S / 9 T L R 5 N e t e b W 3 R n I M v E K U o M C z V 7 1 q 9 t P W B Z z h U x S Y z q e m 2 K Q U 4 2 C S T 6 p d D P D U 8 p G d M A 7 l i o a c x P k s 2 M n 5 M Q q f R I l 2 p Z C M l N / T + Q 0 N m Y c h 7 Y z p j g 0 i 9 5 U / M / r Z B h d B b l Q a Y Z c s f m i K J M E E z L 9 n P S F 5 g z l 2 B L K t L C 3 E j a k m j K 0 + V R s C N 7 i y 8 u k d V b 3 z u v e 3 U W t c V 3 E U Y Y j O I Z T 8 O A S G n A L T f C B g Y B n e I U 3 R z k v z r v z M W 8 t O c X M I f y B 8 / k D K X O O 6 Q = = < / l a t e x i t > x k+1 < l a t e x i t s h a 1 _ b a s e 6 4 = " b j l U S 2 K Q t o X 2 U S + / V s 4 L q I 4 9 i k c = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B Z B E E q i g h 6 L X j x W s B / Q h r L Z T t q l m 0 3 Y 3 Y g l 9 E d 4 8 a C I V 3 + P N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K 4 N q 7 7 7 R R W V t f W N 4 q b p a 3 t n d 2 9 8 v 5 B U 8 e p Y t h g s Y h x k+2 < l a t e x i t s h a 1 _ b a s e 6 4 = " X w t K k E a F W J j 3 f + s K 9 U T s Q v R C g J I = " > A A A B 7 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S I I Q k m q o M e i F 4 8 V 7 A e 0 o W y 2 m 3 b p Z h N 2 J 2 I J / R F e P C j i 1 d / j z X / j t s 1 B W x 8 V O 6 A a B Z f Y M N w I b C c K a R Q I b A W j 2 6 n f e k S l M P N 6 b Y W Z e k E h h 0 H W / n Z X V t f W N z c J W c X t n d 2 + / d H D Y N H G q G W + w W M a 6 H V D D p V C 8 g Q I l b y e a 0 y i Q v B W M b q d + 6 5 F r I 2 L 1 g O O E + x E d K B E K R t F K r a d e N j q v T n q l s l t x Z y D L x M t J G X L U e 6 W v b j 9 m a c Q V M k m N 6 X h u g n 5 G N Q o m + a T Y T Q 1 P K B v R A e 9 Y q m j E j Z / N z p 2 Q U 6 v 0 S R h r W w r J T P 0 9 k d H I m H E U 2 M 6 I 4 t A s e l P x P 6 + T Y n j t Z 0 I l K X L F 5 o v C V B K M y f R 3 0 h e a M 5 R j S y j T w t 5 K 2 J B q y t A m V L Q h e I s v L 5 N m t e J d V L z 7 y 3 L t J o + j A M d w A m f g w R X U 4 A 7 q 0 A A G I 3 i G V 3 h z E u f F e X c + 5 q 0 r T j 5 z B H / g f P 4 A B F K P W g = = < / l a t e x i t > assigning x k+1 = x if the proposal x is accepted, otherwise rejected x k+1 = x k . By utilizing the underlying structure of GANs, the proposed reparameterized sampler becomes more efficient in the low-dimensional latent space. We summarize our main contributions as follows: N U M W y w W M S q H V C N g k t s G G 4 E t h O F N A o E t o L R 7 d R v P a L S P J Y P Z p y g H 9 G B 5 C F n 1 F i p 9 d T L R u f V S a 9 U d i v u D G S Z e D k p Q 4 5 6 r / T V 7 c c s j V A a J q j W H c 9 N j J 9 R Z T g T O C l 2 U 4 0 J Z S M 6 w I 6 l k k a o / W x 2 7 o S c W q V P w l j Z k o b M 1 N 8 T G Y 2 0 H k e B 7 Y y o G e p F b y r + 5 3 V S E 1 7 7 G Z d J a l C y + a I w F c T E Z P o 7 6 X O F z I i x J Z Q p b m 8 l b E g V Z c Y m V L Q h e I s v L 5 P m R c W r V r z 7 y 3 L t J o + j A M d w A m f g w R X U 4 A 7 q 0 A A G I 3 i G V 3 h z E u f F e X c • We propose a structured dependent proposal of GANs, which reparameterizes the samplelevel transition x → x into the latent-level z → z with two pairing Markov chains. We prove that our reparameterized proposal admits a tractable acceptance criterion. • Our proposed method, called REP-GAN, serves as a unified framework for the existing sampling methods of GANs. It provides a better balance between exploration and exploitation by the structured dependent proposal, and also corrects the bias of Markov chains by the acceptance-rejection step. • Empirical results demonstrate that REP-GAN achieves better image quality and much higher sample efficiency than the state-of-the-art methods on both synthetic and real datasets.

2. RELATED WORK

Although GANs are able to synthesize high-quality images, the minimax nature of GANs makes it quite unstable, which usually results in degraded sample quality. A vast literature has been developed to fix the problems of GANs ever since, including novel network modules (Miyato et al., 2018) , training mechanism (Metz et al., 2017) , and alternative objectives (Arjovsky et al., 2017) . Moreover, there is another line of work using sampling methods to improve the sample quality of GANs. DRS (Azadi et al., 2019) firstly proposes to use rejection sampling. MH-GAN (Turner et al., 2019) instead uses the Metropolis-Hasting (MH) algorithm with an independent proposal. DDLS (Che et al., 2020) and DCD (Song et al., 2020) apply gradient-based proposals by viewing GAN as an energy-based model. Tanaka (2019) proposes a similar gradient-based method named DOT from the perspective of optimal transport. Different from them, our REP-GAN introduces a structured dependent proposal through latent reparameterization, and includes all three effective sampling mechanisms, the Markov Chain Monte Carlo method, the acceptance-rejection step, and the latent gradient-based proposal, to further improve the sample efficiency. As shown in Table 1 , many existing works are special cases of our REP-GAN. Our method also belongs to the thread of works that combine MCMC and neural networks for better sample quality. Previously, some works combine variational autoencoders (Kingma & Welling, 2014) and MCMC to bridge the amorization gap (Salimans et al., 2015; Hoffman, 2017; Li et al., 2017) , while others directly learn a neural proposal function for MCMC (Song et al., 2017; Levy et al., 2018; Wang et al., 2018) . Our work instead reparameterizes the high-dimensional sample-level transition into a simpler low-dimensional latent space via the learned generator network.

3. BACKGROUND

3.1 GAN GAN models the data distribution p d (x) implicitly with a generator G : Z → X mapping from a low-dimensional latent space Z to a high-dimensional sample space X , x = G(z), z ∼ p 0 (z), where the sample x follows the generative distribution p g (x) and the latent variable z follows the prior distribution p 0 (z), e.g., a standard normal distribution N (0, I). In GAN, a discriminator D : X → [0, 1] is learned to distinguish samples from p d (x) and p g (x) in an adversarial way min G max D E x∼p d (x) log(D(x)) + E z∼p0(z) log(1 -D(G(z))). Goodfellow et al. ( 2014) point out that an optimal discriminator D implies the density ratio between the data and generative distributions D(x) = p d (x) p d (x) + p g (x) ⇒ p d (x) p g (x) = 1 D(x) -1 -1 . (3)

3.2. MCMC

Markov Chain Monte Carlo (MCMC) refers to a kind of sampling methods that draw a chain of samples x 1:K ∈ X K from a target distribution p t (x). We denote the initial distribution as p 0 (x) and the proposal distribution as q(x |x k ). With the Metropolis-Hastings (MH) algorithm, we accept the proposal x ∼ q(x |x k ) with probability α (x , x k ) = min 1, p t (x ) q (x k |x ) p t (x k ) q (x |x k ) ∈ [0, 1]. (4) If x is accepted, x k+1 = x , otherwise x k+1 = x k . Under mild assumptions, the Markov chain is guaranteed to converge to p t (x) as K → ∞. In practice, the sample efficiency of MCMC crucially depends on the proposal distribution to trade off between exploration and exploitation.

4. THE PROPOSED REP-GAN

In this section, we first review MH-GAN and point out the limitations. We then propose our structured dependent proposal to overcome these obstacles, and finally discuss its theoretical properties as well as practical implementations.

4.1. FROM INDEPENDENT PROPOSAL TO DEPENDENT PROPOSAL

MH-GAN (Turner et al., 2019) first proposes to improve GAN sampling with MCMC. Specifically, given a perfect discriminator D and a descent (but imperfect) generator G after training, they take the data distribution p d (x) as the target distribution and use the generator distribution p g (x) as an independent proposal x ∼ q (x |x k ) = q (x ) = p g (x ). (5) With the MH criterion (Eqn. ( 4)) and the density ratio (Eqn. ( 3)), we should accept x with probability α MH (x , x k ) = min 1, p d (x ) q (x k ) p d (x k ) q (x ) = min 1, D (x k ) -1 -1 D (x ) -1 -1 . (6) However, to achieve tractability, MH-GAN adopts an independent proposal q(x ) with poor sample efficiency. As the proposed sample x is independent of the current sample x k , the difference between the two samples can be so large that it results in a very low acceptance probability. Consequently, samples can be trapped in the same place for a long time, leading to a very slow mixing of the chain. A natural solution is to take a dependent proposal q(x |x k ) that will propose a sample x close to the current one x k , which is more likely to be accepted. Nevertheless, the problem of such a dependent proposal is that its MH acceptance criterion α DEP (x , x k ) = min 1, p d (x ) q (x k |x ) p d (x k ) q (x |x k ) , is generally intractable because the data density p d (x) is unknown. Besides, it is hard to design a proper dependent proposal q(x |x k ) in the high dimensional sample space X with complex landscape. These obstacles prevent us from adopting a dependent proposal that is more suitable for MCMC.

4.2. A TRACTABLE STRUCTURED DEPENDENT PROPOSAL WITH REPARAMETERIZED MARKOV CHAINS

As discussed above, the major difficulty of a general dependent proposal q(x |x k ) is to compute the MH criterion. We show that it can be made tractable by considering an additional pairing Markov chain in the latent space. As we know, samples of GANs lie in a low-dimensional manifold induced by the push-forward of the latent variable. Suppose that at the k-th step of the Markov chain, we have a GAN sample x k with latent z k . Instead of drawing a sample x directly from a sample-level proposal distribution q(x |x k ), we first draw a latent proposal z from a dependent latent proposal distribution q(z |z k ). Afterward, we push the latent z forward through the generator and get the output x as our sample proposal. As illustrated in Figure 1 , our bottom-to-up proposal relies on the transition reparameterization with two pairing Markov chains in the sample space X and the latent space Z. Hence we call it a REP (reparameterized) proposal. Through a learned generator, we transport the transition x k → x in the high dimensional space X into the low dimensional space Z, z k → z , which enjoys a much better landscape and makes it easier to design proposals in MCMC algorithms. For example, the latent target distribution is nearly standard normal when the generator is nearly perfect. In fact, under mild conditions, the REP proposal distribution q REP (x |x k ) and the latent proposal distribution q(z |z k ) are tied with the following change of variables (Gemici et al., 2016; Ben-Israel, 1999 ) log q REP (x |x k ) = log q(x |z k ) = log q(z |z k ) - 1 2 log det J z J z , where J z denotes the Jacobian matrix of the push-forward G at z, i.e., [J z ] ij = ∂ x i /∂ z j , x = G(z). Nevertheless, it remains unclear whether we can perform the MH test to decide the acceptance of the proposal x . Note that a general dependent proposal distribution does not meet a tractable MH acceptance criterion (Eqn. ( 7)). Perhaps surprisingly, it can be shown that with our structured REP proposal, the MH acceptance criterion is tractable for general latent proposals q(z |z k ). Theorem 1. Consider a Markov chain of GAN samples x 1:K with initial distribution p g (x). For step k + 1, we accept our REP proposal x ∼ q REP (x |x k ) with probability α REP (x , x k ) = min 1, p 0 (z )q(z k |z ) p 0 (z k )q(z |z k ) • D(x k ) -1 -1 D(x ) -1 -1 , i.e. let x k+1 = x if x is accepted and x k+1 = x k otherwise. Further assume the chain is irreducible, aperiodic and not transient. Then, according to the Metropolis-Hastings algorithm, the stationary distribution of this Markov chain is the data distribution p d (x) (Gelman et al., 2013) . Proof. Note that similar to Eqn (8), we also have the change of variables between p g (x) and p 0 (z), log p g (x)| x=G(z) = log p 0 (z) - 1 2 log det J z J z . According to Gelman et al. (2013) , the assumptions that the chain is irreducible, aperiodic, and not transient make sure that the chain has a unique stationary distribution, and the MH algorithm ensures that this stationary distribution equals to the target distribution p d (x). Thus we only need to show that the MH criterion in Eqn. ( 9) holds. Together with Eqn. ( 3), ( 7) and ( 8), we have α REP (x , x k ) = p d (x ) q (x k |x ) p d (x k ) q (x |x k ) = p d (x )q(z k |z ) det J z k J z k -1 2 p g (x k )p g (x ) p d (x k )q(z |z k ) det J z J z -1 2 p g (x )p g (x k ) = q(z k |z ) det J z k J z k -1 2 p 0 (z ) det J z J z -1 2 (D(x k ) -1 -1) q(z |z k ) det J z J z -1 2 p 0 (z k ) det J z k J z k -1 2 (D(x ) -1 -1) = p 0 (z )q(z k |z )(D(x k ) -1 -1) p 0 (z k )q(z |z k )(D(x ) -1 -1) . ( ) Hence the proof is completed. The theorem above demonstrates the following favorable properties of our method: • The discriminator score ratio is the same as α MH (x , x k ), but MH-GAN is restricted to a specific independent proposal. Our method instead works for any latent proposal q(z |z k ). When we take q(z |z k ) = p 0 (z ), our method reduces to MH-GAN. • Compared to α DEP (x , x k ) of a general dependent proposal (Eqn. ( 7)), the unknown data distributions terms are successfully cancelled in the reparameterized acceptance criterion. • The reparameterized MH acceptance criterion becomes tractable as it only involves the latent priors, the latent proposal distributions, and the discriminator scores. Combining the REP proposal q REP (x |x k ) and its tractable MH criterion α REP (x , x k ), we have developed a novel sampling method for GANs, coined as REP-GAN. See Appendix 1 for a detailed description. Moreover, our method can serve as a general approximate inference technique for Bayesian models by bridging MCMC and GANs. Previous works (Marzouk et al., 2016; Titsias, 2017; Hoffman et al., 2019 ) also propose to avoid the bad geometry of a complex probability measure by reparameterizing the Markov transitions into a simpler measure. However, these methods are limited to explicit invertible mappings without dimensionality reduction. In our work, we first show that it is also tractable to conduct such model-based reparameterization with implicit models like GANs.

4.3. A PRACTICAL IMPLEMENTATION

REP-GAN enables us to utilize the vast literature of existing MCMC algorithms (Neal et al., 2010) to design dependent proposals for GANs. We take Langevin Monte Carlo (LMC) as an example. As an Euler-Maruyama discretization of the Langevin dynamics, LMC updates the Markov chain with x k+1 = x k + τ 2 ∇ x log p t (x k ) + √ τ • ε, ε ∼ N (0, I), for a target distribution p t (x). Compared to MH-GAN, LMC utilizes the gradient information to explore the energy landscape more efficiently. However, if we directly take the (unknown) data distribution p d (x) as the target distribution p t (x), LMC does not meet a tractable update rule. As discussed above, the reparameterization of REP-GAN makes it easier to design transitions in the low-dimensional latent space. Hence, we instead propose to use LMC for the latent Markov chain. We assume that the data distribution also lies in the low-dimensional manifold induced by the generator, i.e., Supp (p d ) ⊂ Im(G). This implies that the data distribution p d (x) also has a pairing distribution in the latent space, denoted as p t (z). They are tied with the change of variables log p d (x)| x=G(z) = log p t (z) - 1 2 log det J z J z , Taking p t (z) as the (unknown) target distribution of the latent Markov chain, we have the following Latent LMC (L2MC) proposal z = z k + τ 2 ∇ z log p t (z k ) + √ τ • ε = z k + τ 2 ∇ z log p t (z k ) det J z k J z k -1 2 p 0 (z k ) det J z k J z k -1 2 + τ 2 ∇ z log p 0 (z k ) + √ τ • ε = z + τ 2 ∇ z log p d (x k ) p g (x k ) + τ 2 ∇ z log p 0 (z k ) + √ τ • ε = z - τ 2 ∇ z log(D -1 (x k ) -1) + τ 2 ∇ z log p 0 (z k ) + √ τ • ε, ε ∼ N (0, I), where x k = G(z k ). As we can see, L2MC is made tractable by our structured dependent proposal with pairing Markov chains. DDLS (Che et al., 2020) proposes a similar Langevin proposal by formalizing GANs as an implicit energy-based model, while here we provide a straightforward derivation through reparameterization. Our major difference to DDLS is that REP-GAN also includes a tractable MH correction step (Eqn. ( 9)), which accounts for the numerical errors introduced by the discretization and ensures that detailed balance holds.

4.4. EXTENSION TO WGAN

Our method can also be extended to other kinds of GAN, like Wasserstein GAN (WGAN) (Arjovsky et al., 2017) . The WGAN objective is min G max D E x∼p d (x) [D(x)] -E x∼pg(x) [D(x)], where D : X → R is restricted to be a Lipschitz function. Under certain conditions, WGAN also implies an approximate estimation of the density ratio (Che et al., 2020) , D(x) ≈ log p d (x) p g (x) + const ⇒ p d (x) p g (x) ≈ exp(D(x)) • const. ( ) Following the same derivations as in Eqn. ( 11) and ( 14), we will have the WGAN version of REP-GAN. Specifically, with x k = G(z k ), the L2MC proposal follows z = z k + τ 2 ∇ z D(x k ) + τ 2 ∇ z log p 0 (z k ) + √ τ • ε, ε ∼ N (0, I), and the MH acceptance criterion is α REP -W (x , x k ) = min 1, q(z k |z )p 0 (z ) q(z |z k )p 0 (z k ) • exp (D(x )) exp (D(x k )) . ( )

5. EXPERIMENTS

We show our empirical results both on synthetic and real image datasets.

5.1. SYNTHETIC DATA

Following DOT (Tanaka, 2019) and DDLS (Che et al., 2020) , we apply REP-GAN to the synthetic Swiss Roll dataset, where data samples lie on a Swiss roll manifold in the two-dimensional space. We construct the dataset by scikit-learn with 100,000 samples, and train a WGAN as in Tanaka (2019) , where both the generator and discriminator are fully connected neural networks with leaky ReLU nonlinearities. We optimize the model using the Adam optimizer, with α = 0.0001, β 1 = 0.5, β 2 = 0.9. After training, we draw 1,000 samples with different sampling methods. As shown in Figure 2 , with appropriate step size (τ = 0.01), the gradient-based methods (DDLS and REP-GAN) outperform independent proposals (DRS and MH-GAN) by a large margin, while DDLS is more discontinuous on shape compared to REP-GAN. In DDLS, when the step size becomes too large (τ = 0.1, 1), the numerical error of the Langevin dynamics becomes so large that the chain either collapses or diverges. In contrast, those bad proposals are rejected by the MH correction steps of REP-GAN, which prevents the misbehavior of the Markov chain. 

5.2. REAL IMAGE DATA

Following MH-GAN (Turner et al., 2019) , we conduct experiments on two real-world image datasets, CIFAR-10 and CelebA, for DCGAN (Radford et al., 2015) and WGAN (Arjovsky et al., 2017) . Following the conventional evaluation protocol, we initialize each Markov chain with a GAN sample, run it for 640 steps, and take the last sample for evaluation. We collect 50,000 samples to evaluate the Inception Scorefoot_0 (Salimans et al., 2016) . The step size τ of our L2MC proposal is 0.01 on CIFAR-10 and 0.1 on CelebA. We calibrate the discriminator with Logistic Regression as in Turner et al. (2019) . From Table 2 , we can see our method outperforms state-of-the-art methods on both datasets. We also plot the Inception Score and acceptance ratio per epoch in Figure 3 based on our re-implementation. Although the training process of GANs is known to be very unstable, our REP-GAN can still outperform previous sampling methods both consistently (superior in most epochs) and significantly (the improvement is larger than the error bar) as shown in the left panel of Figure 3 . In the right panel, we find that the average acceptance ratio of MH-GAN is lower than 0.2 in most cases, while REP-GAN has an acceptance ratio of 0.4-0.8, which is known to be a good tradeoff for MCMC algorithms. We also notice that the acceptance ratio goes down as the training continues. We suspect this is because the distribution landscape becomes complex and a constant sampling step size will produce more distinct samples that are more likely to get rejected. Ablation study. From Table 3 , we can see that without the MH correction step, the Langevin steps often result in worse sample quality. Meanwhile, the acceptance is very small on CIFAR-10 without the dependent REP proposal. As a result, our REP-GAN (REP+MH) is the only setup that consistently improves over the baseline and obtains the best Inception Score on each dataset. The only exception is DCGAN on CelebA, where the independent proposal outperforms our REP proposal with a higher acceptance ratio. We believe that this is because the human face samples of CelebA are very alike among each other, such that independent samples from the generator can also be easily accepted. Nevertheless, the acceptance ratio of the independent proposal can be much lower on datasets with diverse sources of images, like CIFAR-10. Markov chain visualization. In Figure 4 , we demonstrate two Markov chains sampled with different methods. We can see that MH-GAN is often trapped in the same place because of the independent proposals. DDLS and REP-GAN instead gradually refine the samples with gradient steps. In addition, compared the gradient-based methods, we can see that the MH rejection steps of REP-GAN help avoid some bad artifacts in the images. For example, in the camel-like images marked in red, the body of the camel is separated in the sample of DDLS (middle) while it is not in the sample of REP-GAN (bottom). Note that, the evaluation protocol only needs the last step of the chain, thus we prefer a small step size that finetunes the initial samples for better sample quality. As shown in Figure 5 , our REP proposal can also produce very diverse images with a large step size.

A APPENDIX

A.1 ASSUMPTIONS AND IMPLICATIONS Note that our method needs a few assumptions on the models for our analysis to hold. Here we state them explicitly and discuss their applicability and potential impacts. Assumption 1. The generator mapping G : R n → R m (n < m) is injective, and its Jacobian matrix ∂ G(z) ∂ z of size m × n, has full column rank for all z ∈ R n . For the change of variables in Eqn. ( 8) and ( 10) to hold, according to Ben-Israel (1999) , we need the mapping to be injective and its Jaobian should have full column rank. A mild sufficient condition for injectivity is that the generator only contains (non-degenerate) affine layers and injective nonlinearities, like LeakyReLU. It is not hard to show that such a condition also implies the full rankness of the Jacobian. In fact, this architecture has already been found to benefit GANs and achieved state-of-the-art results (Tang, 2020) . The affine layers here are also likely to be non-degenerate because their weights are randomly initialized and typically will not degenerate in practice during the training of GANs. Assumption 2. The discriminator D offers a perfect estimate the density ratio between the generative distribution p g (x) and the data distribution p d (x) as in Eqn. (3). This is a common, critical, but less practical assumption among the existing sampling methods of GANs. It is unlikely to hold exactly in practice, because during the alternative training of GANs, the generator is also changing all the time, and the a few updates of the discriminator cannot fully learn the corresponding density ratio. Nevertheless, we think it can capture a certain extent information of density ratio which explains why the sampling methods can consistently improve over the baseline at each epoch. From our understanding, the estimated density ratio is enough to push the generator better but not able to bring it up to the data distribution. This could be the reason why the Inception scores obtained by the sampling methods, can improve over the baselines but cannot reach up to that of real data and fully close the gap, even with very long run of the Markov chains. Hence, there is still much room for improvement. To list a few, one can develop mechanisms that bring more accurate density ratio estimate, or relax the assumptions for the method to hold, or establishing estimation error bounds. Overall, we believe GANs offer an interesting alternative scenario for the development of sampling methods.

A.2 ALGORITHM PROCEDURE

We give a detailed description of the algorithm procedure of our REP-GAN in Algorithm 1. Algorithm 1 GAN sampling with Reparameterized Markov chains (REP-GAN) Input: trained GAN with (calibrated) discriminator D and generator G, Markov chain length K, latent prior distribution p 0 (z), latent proposal distribution q(z |z k ); Output: an improved GAN sample x K ; Draw an initial sample x 1 : 1) draw initial latent z 1 ∼ p 0 (z) and 2) push forward Here we list some additional empirical results of our methods. x 1 = G(z 1 ); for each step k ∈ [1, K -1] do Draw a REP proposal x ∼ q REP (x |x k ): 1) draw a latent proposal z ∼ q(z |z k ), (x k , x ); if x is accepted then Let x k+1 = x , z k+1 = z else Let x k+1 = x k , z k+1 = z k end if end for Fréchet Inception Distance (FID). We additionally report the comparison of Fréchet Inception Distance (FID) in Table 4 . Because previous works do not report FID on these benchmarks, we report our re-implementation results instead. We can see the ranks are consistent with the Inception scores in Table 2 and our method is superior in most cases. Computation overhead. In Table 5 , we compare different gradient-based sampling methods of GANs. Comparing DDLS and our REP-GAN, they take 88.94s and 88.85s, respectively, hence the difference is negligible. Without the MH-step, our method takes 87.62s, meaning the additional MH-step only costs 1.4% computation overhead, which is also negligible, but it brings a significant improvement of sample quality as shown in Table 3 . Markov chain visualization on CelebA. We demonstrate two Markov chains on CelebA with different MCMC sampling methods of WGAN in Figure 6 . We can see that on CelebA, the acceptance ratio of MH-GAN becomes much higher than that on CIFAR-10. Nevertheless, the sample quality is still relatively low. In comparison, the gradient-based method can gradually refine the samples with Langevin steps, and our REP-GAN can alleviate image artifacts with MH correction steps.

A.4 MULTI-MODAL EXPERIMENTS

Aside from the manifold learning example shown in Figure 2 , we additionally conduct experiments to illustrate the performance of our sampling methods for multi-modal distributions. 25-Gaussians. To begin with, we consider the 25-Gaussians dataset widely discussed in previous work (Azadi et al., 2019; Turner et al., 2019; Che et al., 2020) . The 25-Gaussians dataset is generated by a mixture of twenty-five two-dimensional isotropic Gaussian distributions with variance 0.01, and means separated by 1, arranged in a grid. We train a small Wasserstein GAN model with the standard WGAN-GP objective following the setup in Tanaka (2019) . After training, we draw 1,000 samples with different sampling methods. Similarly, we starts a Markov chain with a GAN sample, run it for 100 steps, and collect the last example for evaluation. As shown in Figure 7 , compared to MH-GAN, the gradient-based methods (DDLS and ours) produce much better samples close to the data distribution with proper step size. Comparing DDLS and ours, DDLS tends to concentrate so much on the mode centers that its standard deviation can be even smaller than that of data distribution. Instead, our method preserves more sample diversity while concentrating on the mode centers. When the step size becomes larger, the difference becomes more obvious. When τ = 0.1, as marked with blue circles, the samples of DDLS become so concentrated that some modes are missed. When τ = 1, samples of DDLS diverge far beyond the 5x5 grid. In comparison, our REP-GAN does not suffer from these issues with the MH correction steps accounting for the bias introduced by numerical errors. Scale to more modes. In the above, we have experimented w.r.t. a relatively easy scenario where the multi-modal distribution only has 5x5 modes (n = 5 modes along each axis). In fact, the distinctions between the sampling methods become even more obvious when we scale to more modes. Specifically, as shown in Figure 8 , we also compare them w.r.t. mixture of Gaussians with 9x9 and 13x13 modes, respectively. The rest of the setup is similar to 25-Gaussians. Note that throughout the experiments in this part, we adopt proper step size, τ = 0.01, for the gradient-based methods (DDLS and REP-GAN) by default. Under the more challenging scenarios, we can see that the gradient-based methods still consistently outperforms MH-GAN. Moreover, our REP-GAN has a more clear advantage over DDLS. Specifically, for 9x9 modes, our REP-GAN produces samples that are less noisy (i.e., less examples distinct from the modes), while preserving all the modes. For 9x9 modes, DDLS makes a critical mistake that it drops one of the modes (left down corner, marked with red circle) during the Markov chain update. As discussed above, we believe this is because DDLS has a bias towards regions with high



For fair comparison, our training and evaluation follows the the official code of MH-GAN(Turner et al., 2019): https://github.com/uber-research/metropolis-hastings-gans CONCLUSIONIn this paper, we have proposed a novel method, REP-GAN, to improve the sampling of GAN. We devise a structured dependent proposal that reparameterizes the sample-level transition of GAN into the latent-level transition. More importantly, we first prove that this general proposal admits a tractable MH criterion. Experiments show our method does not only improve sample efficiency but also demonstrate state-of-the-art sample quality on benchmark datasets over existing sampling methods.



t e x i t s h a 1 _ b a s e 6 4 = " c d l W M J S M M 5 d ka r U j M X h 0 R H + 2 Y V A = " > A A A B 7 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l U 0 G P R i 8 c K p h b a U D b b T b t 0 s w m 7 E 6 G G / g Y v H h T x 6 g / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 L r f T m l l d W 1 9 o 7 x Z 2 dr e 2 d 2 r 7 h + 0 T J J p x n 2 W y E S 3 Q 2 q 4 F I r 7 K F D y d q o 5 j U P J H 8 L R z d R / e O T a i E T d 4 z j l Q U w H S k S C U b S S / 9 T L R 5 N e t e b W 3 R n I M v E K U o M C z V 7 1 q 9 t P W B Z z h U x S Y z q e m 2 K Q U 4 2 C S T 6 p d D P D U 8 p G d M A 7 l i o a c x P k s 2 M n 5

e S w f z D h B P 6 I D y U P O q L F S 6 6 m X j c 6 8 S a 9 c c a v u D G S Z e D m p Q I 5 6 r / z V 7 c c s j V A a J q j W H c 9 N j J 9 R Z T g T O C l 1 U 4 0 J Z S M 6 w I 6 l k k a o / W x 2 7 o S c W K V P w l j Z k o b M 1 N 8 T G Y 2 0 H k e B 7 Y y o G e p F b y r + 5 3 V S E 1 7 7 G Z d J a l C y + a I w F c T E Z P o 7 6 X O F z I i x J Z Q p b m 8 l b E g V Z c Y m V L I h e I s v L 5 P m e d W 7 q H r 3 l 5 X a T R 5 H E Y 7 g G E 7 B g y u o w R 3 U o Q E M R v A M r / D m J M 6 L 8 + 5 8 z F s L T j 5 z C H / g f P 4 A A s 2 P W Q = = < / l a t e x i t >

t e x i t s h a 1 _ b a s e 6 4 = " g T 4 v v v Y G R U 8 i r h J I C F i g T j T L O j g = " > A A A B 7 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S I I Q k m s o M e i F 4 8 V 7 A e 0 o W y 2 k 3 b p Z h N 2 N 2 I J / R F e P C j i 1 d / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O y u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w q e

Figure 1: Illustration of REP-GAN's reparameterized dependent proposal with two pairing Markov chains, one in the latent space Z, and the other in the sample space X .

Figure 2: Visualization of samples with different sampling methods on the Swiss Roll dataset. Here tau denotes the Langevin step size in Eqn. (17).

Figure 3: Average Inception Score (left) and acceptance ratio (right) vs. training epoch on CIFAR-10 based on our re-implementation. The standard deviation is shown with shaded error bar (left).

Figure 4: The first 20 steps of two Markov chains with the same initial samples. The chains are generated by MH-GAN (top), DDLS (middle), and REP-GAN (bottom).

Figure 5: Visualization of 5 Markov chains of our REP proposals (i.e., REP-GAN without the MH rejection steps) with a large step size (τ = 1).

Figure 6: Visualization of the Markov chains of MH-GAN (top), DDLS (middle), and REP-GAN (bottom) on CelebA with WGAN backbone.

Comparison of sampling methods for GANs in terms of three effective sampling mechanisms.

Inception Scores (IS) of different sampling methods on CIFAR-10 and CelebA.

Ablation study of our REP-GAN with Inception Scores (IS) and acceptance ratios (Accept) averaged over five adjacent checkpoints. IND refers to the independent proposal of MH-GAN. REP refers to our REP proposal. MH denotes the MH rejection step of the corresponding sampler.

Calculate the MH acceptance criterion α REP (x k , x ) following Eqn. (9); Decide the acceptance of x with probability α REP

Fréchet Inception Distance (FID) of different MCMC sampling methods on CIFAR-10 and CelebA based on our re-implementation.

Comparison of computation cost (measured in seconds) of gradient-based MCMC sampling methods of GANs. We report the total time to sample a batch of 500 samples with DCGAN on a NVIDIA 1080 Ti GPU. We initialize the chain with GAN samples and run each chain for 640 steps.

