BRGANS: STABILIZING GANS' TRAINING PROCESS WITH BROWNIAN MOTION CONTROL

Abstract

The training process of generative adversarial networks (GANs) is unstable and does not converge globally. In this paper, we propose a universal higher-order noise-based controller called Brownian Motion Controller (BMC) that is invariant to GANs' frameworks so that the training process of GANs is stabilized. Specifically, starting with the prototypical case of Dirac-GANs, we design a BMC and propose Dirac-BrGANs, which retrieve exactly the same but reachable optimal equilibrium regardless of GANs' framework. The optimal equilibrium of our Dirac-BrGANs' training system is globally unique and always exists. Furthermore, we give theoretical proof that the training process of Dirac-BrGANs achieves exponential stability almost surely for any arbitrary initial value and derive bounds for the rate of convergence. Then we extend our BMC to normal GANs and propose BrGANs. We provide numerical experiments showing that our BrGANs effectively stabilize GANs' training process and obtain state-of-theart performance in terms of FID and inception score compared to other stabilizing methods.

1. INTRODUCTION

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) are popular deep learning based generative architecture. Given a multi-dimensional input dataset with unknown P real , GANs can obtain an estimated P model and produce new entries that are as close to indistinguishable as possible from the input entries. For example, GANs can be used to generate images that look real to the human eyes (Wang et al., 2017) . A GAN's architecture consists of two neural networks: a generator and a discriminator. The generator creates new elements that resemble the entries from the input dataset as closely as possible. The discriminator, on the other hand, aims to distinguish the (counterfeit) entries produced by the generator from the original members of the input dataset. The GAN's two networks can be modeled as a minimax problem; they compete against one another while striving to reach a Nash-equilibrium, an optimal solution where the generator can produce fake entries that are, from the point of view of the discriminator, in all respects indistinguishable from real ones. Unfortunately, training GANs often suffers from instabilities. Previously, theoretical analysis has been conducted on GAN's training process. Fedus et al. (2018) argue that the traditional view of considering training GANs as minimizing the divergence of real distribution and model distribution is too restrictive and thus leads to instability. Arora et al. (2018) show that GANs training process does not lead generator to the desired distribution. Farnia & Ozdaglar (2020) suggest that current training methods of GANs do not always have Nash equilibrium, and Heusel et al. (2017a) is able to push GANs to converge to local Nash equilibrium using a two-time scale update rule (TTUR). Many previous methods (Mescheder et al., 2018; Arjovsky & Bottou, 2017; Nagarajan & Kolter, 2017; Kodali et al., 2017) have investigated the causes of such instabilities and attempted to reduce them by introducing various changes to the GAN's architecture. However, as Mescheder et al. (2018) show in their study, where they analyze the convergence behavior of several GANs models, despite bringing significant improvements, GANs and its variations are still far from achieving stability in the general case. To accomplish this goal, in our work, we design a Brownian Motion Control (BMC) using control theory and propose a universal model, BrGANs, to stabilize GANs' training process. We start with the prototypical Dirac-GAN (Mescheder et al., 2018 ) and analyze its system of training dynamics. We then design a Brownian motion controller (BMC) on the training dynamic of Dirac-GAN in order to stabilize this system over time domain t. We generalize our BMC to normal GANs' setting and propose BrGANs. • We design Brownian Motion Controller (BMC), a universal higher order noise-based controller for GANs' training dynamics, which is compatible with all GANs' frameworks, and we give both theoretical and empirical analysis showing BMC effectively stabilizes GANs' training process. • Under Dirac-GANs' setting, we propose Dirac-BrGANs and conduct theoretical stability analysis to derive bounds on the rate of convergence. Our proposed Dirac-BrGANs are able to converge globally with exponential stability. • We extend BMC to normal GANs' settings and propose BrGANs. In experiments, our BrGANs converge faster and perform better in terms of inception scores and FID scores on CIFAR-10 and CelebA datasets than previous baselines in various GANs models.

1.2. RELATED WORK

To stabilize GANs training process, a lot of work has been done on modifying its training architecture. Karras et al. (2018) train the generator and the discriminator progressively to stabilize the training process. Wang et al. (2021) observe that during training, the discriminator converges faster and dominates the dynamics. They produce an attention map from the discriminator and use it to improve the spatial awareness of the generator. In this way, they push GANs' solution closer to the equilibrium. On the other hand, many work stabilizes GANs' training process with modified objective functions. Kodali et al. (2017) add gradient penalty to their objective function to avoid local equilibrium with their model called DRAGAN. This method has fewer mode collapses and can be applied to a lot of GANs' frameworks. Other work, such as Generative Multi-Adversarial Network (GMAN) (Durugkar et al., 2017) , packing GANs (PacGAN) (Lin et al., 2017) and energy-based GANs (Zhao et al., 2016) , modifies the discriminator to achieve better stability. Xu et al. (2019) formulate GANs as a system of differential equations and add closed-loop control (CLC) on a few variations of the GANs to enforce stability. However, the design of their controller depends on the objective function of the GANs models and does not work for all variations of the GANs models. Motivated by them, we analyze GANs' training process from control theory's perspective and design an invariant Brownian Motion Controller (BMC) to stabilize GANs training process. Compared with Xu et al. (2019) , our proposed BrGANs converge faster, perform better, and do not rely on any specific GANs' architecture.

2. CONTROLLING DIRAC-GAN WITH BROWNIAN MOTION

In this section, we come up with the BMC, a higher order noise-based controller, as a universal control function that is invariant to objective functions of various GANs models. In addition, we prove that Dirac-GAN with BMC is exponentially stable and we derive bounds on its converge rate.

2.1. DYNAMIC SYSTEM OF DIRAC-GANS

In Dirac-GANs' settings, the distribution of generator G follows p G (x) = δ(x -θ) and the discriminator is linear D φ (x) = φx. The true data distribution is given by p D (x) = δ(x -c) with a constant c. (Notice that for Dirac distribution, δ can be considered as an impulse on the origin such that δ(x) = 1 at the origin and 0 otherwise.) The objective functions of Dirac-GANs can be written as: max φ L D φ (φ; θ) = h 1 (D φ (c)) + h 2 (D φ (θ)) max θ L G θ (θ; φ) = h 3 (D φ (θ)), where h 1 (•) and h 3 (•) are increasing functions and h 2 (•) is a decreasing function around zero (Xu et al., 2019) . The training process of Dirac-GAN can be modelled as a system of differential equations. Following Mescheder et al. (2018) and Xu et al. (2019) , the training dynamical system of Dirac-GAN is formulated as:      dφ(t) dt = h 1 (φ(t)c)c + h 2 (φ(t)θ(t))θ(t), dθ(t) dt = h 3 (φ(t)θ(t))φ(t), This system has a constant nontrivial solution (θ, φ) = (c, 0). Let θ(t) = θ(t) -c, and convert the original system (2) to:      dφ(t) dt = h 1 (φ(t)c)c + h 2 (φ(t)( θ(t) + c))( θ(t) + c), d θ(t) dt = h 3 (φ(t)( θ(t) + c))φ(t). At this time, the equilibrium of system ( 3) is (0, 0). Define X(t) = (φ(t), θ(t)) , f (X(t)) = (h 1 (φ(t)c)c + h 2 (φ(t)( θ(t) + c)) θ(t) + h 2 (φ(t)( θ(t) + c))c, h 3 (φ(t)( θ(t) + c))φ(t)) . Then system (3) can be rewritten as: dX(t) = f (X(t))dt.

2.2. DESIGNING BROWNIAN MOTION CONTROLLER FOR DIRAC-GAN

Brownian motion is a natural phenomenon that captures the random displacements of particles in d-dimensional space. At each time step, the displacement B t is an independent, identical random variable ranging in R d . The distribution of B t is normally characterized by a multivariate Gaussian distribution. Denote the position of a particle at initial time 0 as X(0). Then at time T , this particle's position is given as X(T ) = X(0) + T 0 B t dt. In control theory, noise-based controllers like our Brownian motion controller (BMC) are a useful tool to stabilize dynamical systems and push the solution towards the optimal value over time domain t (Mao et al., 2002) . In this section, we design a BMC on Dirac-GAN's training dynamic to improve stability. To stabilize system (4), we propose the following higher order noise-based controller: u(t) = 1 X(t) Ḃ1 (t) + 2 |X(t)| β X(t) Ḃ2 (t), where B 1 (t) and B 2 (t) are independent one-dimensional Brownian motions, β > 1, 1 and 2 are non-negative constants. Incorporating BMC (6), the controlled system is given as dX(t) = f (X(t))dt + u(t).

2.3. DIRAC-BRGAN WITH EXPONENTIAL STABILITY

In this section, we derive the existence of unique global solution and stability of system (7). The equilibrium point of system ( 3) is (φ(t e ), θ(t e )) = (0, 0) Without the BMC, the training of a regular GAN or a WGAN (Arjovsky et al., 2017 ) is unstable and it oscillates around the equilibrium point (0, 0) . Figure 1 illustrates the gradient maps of θ against φ and convergence behavior over time domain t. We can see that the gradients of both the generator and the discriminator are oscillating around the equilibrium point, but they never converge to it. For the stability analysis, we impose the following assumption on the smoothness of functions h 1 , h 2 , h 3 in system (2). Assumption 1. There exist positive constants α 1 , α 2 , α 3 such that for any x, y ∈ R n , |h 1 (x) -h 1 (y)| ≤ α 1 x -y , |h 2 (x) -h 2 (y)| ≤ α 2 x -y , |h 3 (x) -h 3 (y)| ≤ α 3 x -y . In what follows, we first prove that the BMC from equation ( 6) yields a unique global solution in Theorem 1. That is, no matter which initial point X(0) we are starting from, system (7) a.s. will have a unique solution X(t) as t goes to infinity. Then, in Theorem 2, we show that this unique global solution exponentially converges to the equilibrium point a.s. with bounds on the hyperparameters 1 , 2 and β which in turn affect the rate of convergence. Combining Theorem 1 and Theorem 2, we claim that system (7) is stable and thus the Dirac-BrGAN is stable and converges to the optimal solution as required. Theorem 1. (Proof in Appendix A) Under Assumption 1, for any initial value X(0) = ξ ∈ R 2 , if 2 = 0 and β > 1, then there a.s. exists a unique global solution X(t) to system (7 ) on t ∈ [0, ∞). Theorem 2. (Proof in Appendix B) Let Assumption 1 hold. Assume that 2 = 0 and β > 1. If 2 1 2 -ϕ > 0, where ϕ takes the value of max x≥0 - 2 2 2 x 2β + (α 2 2 + 1 2 α 2 3 )x 2 + [(1 + 1 2 α 2 1 )c 2 + 2c + 1 2 ] , then for any X(0) = ξ with sufficiently small constant ∈ (0, 2 1 /2 -ϕ), the global solution X(t) of system (7) has the property that lim sup t→∞ log |X(t)| t ≤ - 2 1 2 -ϕ + , a.s. that is, the solution of system ( 7) is a.s. exponentially stable. Here since 2 1 2 -ϕ > 0 and is a sufficiently small constant, then when Eq. ( 9) is satisfied, we have lim sup t→∞ log |X(t)| t ≤ -λ, a.s. ( ) for some positive constant λ. Rearranging we get lim sup t→∞ |X(t)| ≤ e -λt , which implies lim sup t→∞ X(t) = (0, 0) as required. Notice that the rate of convergence depends only on constant λ, which in turn depends on 1 and ϕ. It means that the convergence rate is decided by the choice of hyper-parameters 1 , 2 , and β. In practice, we can tune these three variables as desired, as long as they satisfy the constraint from equation 9. Notice that our Dirac-BrGANs works for any h 1 , h 2 and h 3 as long as they satisfy the smoothness condition under assumption 1. In other words, we have proven that the Dirac-BrGANs are stable regardless of the GANs' architecture, and we have given theoretical bounds on the rate of convergence. In Figure 1 , we present visual proof that the Dirac-BrGANs are stable and converge to the optimal equilibrum as required.

3. GENERALIZATION OF BMC TO GANS

In section 2 we designed a universal BMC for Dirac-GANs and proved that the Dirac-BrGANs are globally exponentially stable. In this section, we are going to generalize the BMC for normal GANs (i.e., GANs other than the Dirac-GAN). We consider any GANs where the Generator (G) and the Discriminator (D) are neural networks in their respective function spaces.

3.1. MODELLING DYNAMICS OF GANS

Analogously to the Dirac-GAN, the training dynamics of normal GANs can be formulated as a system of differential equations. Instead of θ and φ, we directly start with G(z, t) and D(x, t) to represent, respectively, the generator and the discriminator. The objective functions of GANs can be written as: max D L D (D; G) = E p(x) [h 1 (D(x))] + E p(g) [h 2 (D(x))] max G L G (G; D) = E pz(z) [h 3 (D(G(z)))]. Following Xu et al. ( 2019)'s notation, the training dynamic for the generator and discriminator over the time domain t can be transformed to:        dD(x, t) dt = p(x) dh 1 (D(x)) dD(x, t) + p G (x) dh 2 (D(x)) dD(x) , ∀x dG(z, t) dt = p z (z) dh 3 (D(G(z))) dD(G(z)) dD(G(z)) dG(z) , ∀z Now we define X(t) = (D(x, t), G(z, t)) and f (X(t)) = p(x) dh1(D(x)) dD(x,t) + p G (x) dh2(D(x)) dD(x) p z (z) dh3(D(G(z))) dD(G(z)) dD(G(z)) dG(z) . (15) Now we convert system (14) to dX(t) = f (X(t))dt. (16)

3.2. BRGAN: STABILIZED GANS WITH BMC

The optimal solution of a normal GAN is achieved when D(x) = 0 for the discriminator and p G (x) = p(x) for the generator. In control theory, when we design a controller for a dynamical system, we need to know the optimal solution so that we can use the controller to push this dynamical system to its equilibrium point without changing it. With normal GANs, we only have information on the optimal solution for the discriminator, so we are going to impose the BMC only on the discriminator this time. Notice that in Dirac-GAN, we impose the BMC u(t) = 1 X(t) Ḃ1 (t) + 2 |X(t)| β X(t) Ḃ2 (t) on X(t), which is for both the generator and discriminator. Since now we are going to impose BMC only on the discriminator, without losing information from the generator, here we slightly modify Eq. ( 6) so that u D (t) = 1 D(x) Ḃ1 (t) + 2 (D 2 (x) + D 2 (G(z)))D(x) Ḃ2 (t), where B 1 (t) and B 2 (t) are independent one-dimensional Brownian motions, 1 and 2 are nonnegative constants. Since we are going to implement BrGANs through gradient descent, our BMC can be reflected on the discriminator's objective function with the derivative being Eq. ( 17). We thus take integration of Eq. ( 17) and modify the objective function of the discriminator in (13) to: max D L D (D; G) = L D (D; G) + 1 2 1 D 2 (x) Ḃ1 (t) + [ 1 4 2 D 4 (x) + 1 2 2 D 2 (G(z))D 2 (x)] Ḃ2 (t). We implement our designed objective functions in section 4. Our numerical experiments show that BrGANs successfully stabilize GANs models and are able to generate images with promising quality.

4. EVALUATION

In this section, we show the effectiveness of BMC by providing both quantitive and qualititive results.

4.1. EXEPERIMENTAL SETTING

Dataset: We evaluate our proposed BrGANs on well-established CIFAR10 (Krizhevsky et al., 2009) and CelebA datasets (Liu et al., 2015) . The CIFAR-10 dataset consists of 60000 32×32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. This dataset can be used for both conditional image generation and unconditional image generation. In order to compare our training method fairly with the solutions from the related works, we use a batch size of 64, the same generator, and discriminator architecture under the same codebase. CelebA contains 202,599 face images of size 64 × 64, which has diverse facial features. Implementation details: Both our generator and discriminator are composed of convolutional layers, batch normalization, and activation layers. The generator uses four layers of transposed convolutional layers to convert 1 × 100 latent vector to a 3 × 32 × 32 image. Batch normalization and ReLU activation are followed by each layer. For the discriminator, first, we use three layers of convolutional layer to obtain 1024 × 4 × 4 image and then feed this image to an MLP to get a single value. Each of our models is trained on an Nvidia 2080TI GPU. The batch size is 64, while the generator is trained for 50000 iterations, and the discriminator is trained for 250000 iterations. We use Adam with 1e-4 learning rate as an optimizer to train our model. Evaluation Metric: For the Dirac-GAN problem, the optimal solution is known to us so we can measure the convergence speed and draw the gradient map for different training algorithms. For CIFAR 10, we use the FID score (Heusel et al., 2017b) and the inception score (Barratt & Sharma, 2018) to measure the quality of the generated images. We also compare the FID and inception scores across different timestamps to show the convergence speed.

4.2. CONVERGENCE OF DIRAC-BRGAN

The gradient map and the convergence curve are presented in Figure 1 . These results show that our Dirac-BrGANs have better convergence patterns and speed than Dirac-GANs. Without adding BMC to the training objective, Dirac-GANs cannot reach equilibrium. The parameters of the generator and the discriminator are oscillating in a circle as shown in figure 1 . However, the parameters of the Dirac-BrGANs only oscillate in the first 500 iterations and soon converge in 800 iterations. We also study different combinations of 1 and 2 under β = 1 and β = 2. As shown in table 1 and table 2, Dirac-BrGANs converge better when we set 1 = 0.1 and 2 = 0.01. Generally, a larger will lead to a faster convergence rate but when is large enough, the effect of increasing will become saturated. On the other hand, when is too small, Dirac-BrGANs will take more than 100000 iterations to converge. We demonstrate that BrGANs converge fast and generate higher fidelity image than Wasserstein GANs (WGAN) (Arjovsky et al., 2017) , WGAN with weight clipping (WGAN-CP), WGAN with gredient penlty (WGAN-GP) (Gulrajani et al., 2017) , GAN with closed loop control (WGAN-CLC) (Xu et al., 2019) and their combinations. From the results in Table 3 , we can observe that our proposed WGAN-BR-GP achieves 22.10 FID and 5.42 inception score on CIFAR10 dataset, which is the best result among all other GANs. Specifically, WGAN-BR out performs WGAN, WGAN-BR-CP outperforms WGAN-CP and WGAN-CLC-CP, and WGAN-BR-GP outperforms WGAN-GP and WGAN-CLC-GP. From Table 4 , we can observe similar trends Figure 2 : FID score on CIFAR10. Figure 3 : Inception score on CIFAR10.

4.4. QUALITATIVE RESULTS

We provide qualitative results on CIFAR and CelebA datasets. In Fig. 5 and Fig. 4 , these images are generated by WGAN, WGAN-GP, WGAN-GP-BR, and WGAN-GP-CLC, from top left to bottom right respectively. It can be observed that our WGAN-GP-BR can generate images with higher fidelity and more reasonable in visual perception. In this paper, we revisit GANs' instability problem from the perspective of control theory. Our work novelty incorporates a higher order non-linear controller and modify the objective function of the discriminator to stabilize GANs models. We innovatively design a universal noise-based control method called Brownian Motion Control (BMC) and propose BrGANs to achieve exponential stability. Notably, our BMC is compatible with all GANs variations. Experimental results demonstrate that our BrGANs converge faster and and perform better in terms of FID and inception scores on CIFAR-10 and CelebA. In our paper, theoretical analysis has been done under Dirac-GANs' setting and we are able to stabilize both generator and discriminator simultaneously. However, under normal GANs' settings, we only design BMC for the discriminator and stabilize the discriminator, then force the stability of the generator. Additionally, our BMC is derived under continuous setting, but GANs' training process is considered as discrete time steps. To resolve these two problems, further work can be done on estimating the generator's equilibrium at each time step and imposing a controller on both generators and discriminators simultaneously. By Assumption 1, we therefore have Noting that p ∈ (0, 1) and β > 1 and 2 = 0, by the boundedness of polynomial functions, there exists a positive constant H such that LV (x) ≤ H. We therefore have  EV (X(t ∧ τ k )) ≤ E|ξ| p + E 2 x 2β + (α 2 2 + 1 2 α 2 3 )x 2 + [(1 + 1 2 α 2 1 )c 2 + 2c + 1 2 ] , then for any X(0) = ξ with sufficiently small constant ∈ (0, 2 1 /2 -ϕ), the global solution X(t) of system (7) has the property that lim sup t→∞ log |X(t)| t ≤ - 2 1 2 -ϕ + , a.s. that is, the solution of system (7) is a.s. exponentially stable. 



Figure 1: The gradient map and convergence behavior of Dirac-WGANs (first row) and Dirac-BrWGANs (second row), where the Nash equilibrium of both model should be at (0, 0) T .

Figure 4: CIFAR Figure 5: CelebA

Ht is independent of k. By the definition of τ k , |X(τ k )| = k, so P(τ k ≤ t)k p ≤ P(τ k ≤ t)V (X(τ k ))] ≤ E[l {τ k ≤t} V (X(t ∧ τ k ))] ≤ EV (X(t ∧ τ k ))

Applying Itô formula to log |X(t)| yields log |X(t)| = log |X(0)| β dB 2 (s).

Convergence iters for β = 2 under Dirac-BrGANs

Results on CIFAR10.Convergence iterations of GANs is presented in Fig.2and Fig.3, measured by FID score and inception score, respectively. It is readily observed that our proposed BrGANs achieves better FID and inception scores given the same training iteration.Xu et al. (2019) add a L2 regularization (CLC) on the objective function on discriminator. In our BrGANs, we incorporate both information from the generator and discriminator to our controller, so that the discriminator and generator to make sure the discriminator does not dominate the training process. Our BrGANs also compute faster thanXu et al. (2019) since we do not need to keep a buffer and update accordingly during training process.

APPENDIX A: PROOF OF THEOREM 1

Under Assumption 1, for any initial value X(0) = ξ ∈ R 2 , if 2 = 0 and β > 1, then there a.s. exists a unique global solution X(t) to system (7) on t ∈ [0, ∞).Proof. Under Assumption 1, then, we can calculate thatFor any bounded initial value X(0) ∈ R n , there exists a unique maximal local strong solution X(t) of system (7) on t ∈ [0, τ e ), where τ e is the explosion time. To show that the solution is actually global, we only need to prove that τ e = ∞ a.s. Let k 0 be a sufficiently large positive number such that |X(0)| < k 0 . For each integer k ≥ k 0 , define the stopping timewith the traditional setting inf ∅ = ∞, where ∅ denotes the empty set. Clearly, τ k is increasing as k → ∞ and τ k → τ ∞ ≤ τ e a.s. If we can show that τ ∞ = ∞, then τ e = ∞ a.s., which implies the desired result. This is also equivalent to prove that, for any t > 0, P(τ k ≤ t) → 0 as k → ∞. To prove this statement, for any p ∈ (0, 1), define a C 2 -functionOne can obtain that X(t) = 0 for all 0 ≤ t ≤ τ e a.s. Thus, one can apply the Itô formula to show that for any t ∈ [0, τ e ), dV (X(t)) =LV (X(t))dt + p 1 |X(t)| p dB 1 (t)where LV is defined asFor any ε ∈ (0, 1), choose θ > 0 such that θε > 1. Then for each integer m > 0, the exponential martingale inequality givesSince ∞ m=1 m -θε < ∞, by the well-known Borel-Cantelli lemma, there exists an Ω0 ⊆ Ω with P( Ω0 ) = 1 such that for any ω ∈ Ω0 , there exists an integer m(ω), when m > m(ω), and m -This, together with Assumption 1, yields1 dB 1 (t) + θε log(t + 1).Letting be sufficiently small, by the definition of ϕ in (9), for sufficiently small ∈ (0, 

