TWO STEPS AT A TIME -TAKING GAN TRAINING IN STRIDE WITH TSENG'S METHOD

Abstract

Motivated by the training of Generative Adversarial Networks (GANs), we study methods for solving minimax problems with additional nonsmooth regularizers. We do so by employing monotone operator theory, in particular the Forward-Backward-Forward (FBF) method, which avoids the known issue of limit cycling by correcting each update by a second gradient evaluation. Furthermore, we propose a seemingly new scheme which recycles old gradients to mitigate the additional computational cost. In doing so we rediscover a known method, related to Optimistic Gradient Descent Ascent (OGDA). For both schemes we prove novel convergence rates for convex-concave minimax problems via a unifying approach. The derived error bounds are in terms of the gap function for the ergodic iterates. For the deterministic and the stochastic problem we show a convergence rate of O( 1 /k) and O( 1 / √ k), respectively. We complement our theoretical results with empirical improvements in the training of Wasserstein GANs on the CIFAR10 dataset.

1. INTRODUCTION

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have proven to be a powerful class of generative models, producing for example unseen realistic images. Two neural networks, called generator and discriminator, compete against each other in a game. In the special case of a zero sum game this task can be formulated as a minimax (aka saddle point) problem. Conventionally, GANs are trained using variants of (stochastic) Gradient Descent Ascent (GDA) which are known to exhibit oscillatory behavior and thus fail to converge even for simple bilinear saddle point problems, see Goodfellow (2016) . We therefore propose the use of methods with provable convergence guarantees for (stochastic) convex-concave minimax problems, even though GANs are well known to not warrant these properties. Along similar considerations an adaptation of the Extragradient method (EG) (Korpelevich, 1976) for the training of GANs was suggested in Gidel et al. (2019) , whereas Daskalakis et al. (2018) ; Daskalakis & Panageas (2018) ; Liang & Stokes (2019) studied Optimistic Gradient Descent Ascent (OGDA) based on optimistic mirror descent (Rakhlin & Sridharan, 2013a; b) . We however investigate the Forward-Backward-Forward (FBF) method (Tseng, 1991) from monotone operator theory, which uses two gradient evaluations per update, similar to EG, in order to circumvent the aforementioned issues. Instead of trying to improve GAN performance via new architectures, loss functions, etc., we contribute to the theoretical foundation of their training from the point of view of optimization. Contribution. Establishing the connection between GAN training and monotone inclusions motivates to use the FBF method, originally designed to solve this type of problems. This approach allows to naturally extend the constrained setting to a regularized one making use of the proximal operator. We also propose a variant of FBF reusing previous gradients to reduce the computational cost per iteration, which turns out to be a known method, related to OGDA. By developing a unifying scheme that captures FBF and a generalization of OGDA, we reveal a hitherto unknown connection. Using this approach we prove novel non asymptotic convergence statements in terms of the minimax gap for both methods in the context of saddle point problems. In the deterministic and stochastic setting we obtain rates of O( 1 /k) and O( 1 / √ k), respectively. Concluding, we highlight the relevance of our proposed method as well as the role of regularizers by showing empirical improvements in the training of Wasserstein GANs on the CIFAR10 dataset. Organization. This paper is structured as follows. In Section 2 we highlight the connection of GAN training and monotone inclusions and give an extensive review of methods with convergence guarantees for the latter. The main results as well as a precise definition of the measure of optimality are discussed in Section 3. Concluding, Section 4 illustrates the empirical performance in the training of GANs as well as solving bilinear problems.

2. GAN TRAINING AS MONOTONE INCLUSION

The GAN objective was originally cast as a two-player zero-sum game between the discriminator D y and the generator G x (Goodfellow et al., 2014) given by min x max y E ρ∼q [log(D y (ρ))] + E ζ∼p [log(1 -D y (G x (ζ)))], exhibiting the aforementioned minimax structure. Due to problems with vanishing gradients in the training of such models, a successful alternative formulation called Wasserstein GAN (WGAN) (Arjovsky et al., 2017) has been proposed. In this case the minimization tries to reduce the Wasserstein distance between the true distribution q and the one learned by the generator. Reformulating this distance via the Kantorovich Rubinstein duality leads to an inner maximization over 1-Lipschitz functions which are approximated via neural networks, yielding the saddle point problem min x max y: Dy Lip≤1 E ρ∼q [D y (ρ)] -E ζ∼p [D y (G x (ζ))].

2.1. CONVEX-CONCAVE MINIMAX PROBLEMS

Due to the observations made in the previous paragraph we study the following abstract minimax problem min x∈R d max y∈R n Ψ(x, y) := f (x) + E ξ∼Q [Φ(x, y; ξ)] -h(y), where the convex-concave coupling function Φ(x, y) := E ξ∼Q [Φ(x, y; ξ)], which hides the stochasticity for ease of notation, is differentiable with L-Lipschitz continuous gradient. The proper, convex and lower semicontinuous functions f : R d → R ∪ {+∞} and h : R n → R ∪ {+∞} act as regularizers. A solution of (1) is given by a so-called saddle point (x * , y * ) fulfilling for all x and y Ψ(x * , y) ≤ Ψ(x * , y * ) ≤ Ψ(x, y * ). In the context of two-player games this corresponds to a pair of strategies, where no player can be better off by changing just their own strategy. For the purpose of this motivating section, we will restrict ourselves for now to the special case of the deterministic constrained version of (1), given by min x∈X max y∈Y Φ(x, y), where f and h are given by indicator functions of closed convex sets X and Y , respectively. The indicator function δ C of a set C is defined as δ C (z) = 0 for z ∈ C and δ C (z) = +∞ otherwise.

2.2. MINIMAX PROBLEMS AS MONOTONE INCLUSIONS

If the coupling function Φ is convex-concave and differentiable then solving (1) is equivalent to solving the first order optimality conditions which can be written as a so-called monotone inclusion with w = (x, y) ∈ R m and m = d + n, given by 0 ∈ F (w) + N Ω (w). (2) The entities involved are F (x, y) := (∇ x Φ(x, y), -∇ y Φ(x, y)), and the normal cone N Ω of the convex set Ω := X × Y . The normal cone mapping is given by N Ω (w) = {v ∈ R m : v, w -w ≤ 0 ∀w ∈ Ω}, for w ∈ Ω and N Ω (w) = ∅ for w / ∈ Ω. Here, the operators F and N Ω satisfy well known properties from convex analysis (Bauschke & Combettes, 2011) , in particular the first one is monotone (and Lipschitz if ∇Φ is so) whereas the latter one is maximal monotone. We call a, possibly set-valued, operator A from R m to itself monotone if u -u , z -z ≥ 0 ∀u ∈ A(z), u ∈ A(z ). We say A is maximal monotone, if there exists no monotone operator A such that the graph of A is properly contained in the graph of A . Problems of type (2) have been studied thoroughly in convex optimization, with the most established solution methods being Extragradient (Korpelevich, 1976) and Forward-Backward-Forward (Tseng, 1991) . Both methods are known to generate sequences of iterates converging to a solution of (2). Note that in the unconstrained setting (i.e. if Ω is the entire space) both of these algorithms even produce the same iterates.

2.3. SOLVING MONOTONE INCLUSIONS

The connection between monotone inclusions and saddle point problems is of course not new. The application of Extragradient (EG) to minimax problems has been studied in the seminal paper Nemirovski (2004) under the name of Mirror Prox and a convergence rate of O( 1 /k) in terms of the function values has been proven. Even a stochastic version of the Mirror Prox algorithm has been studied in Juditsky et al. (2011) with a convergence rate of O( 1 / √ k). Applied to problem (2), with P Ω being the projection onto Ω, it iterates EG: w k = P Ω [z k -α k F (z k )] z k+1 = P Ω [z k -α k F (w k )]. The Forward-Backward-Forward (FBF) method, introduced in Tseng (1991), has not been studied rigorously for minimax problems in terms of function values yet, despite promising applications in Boţ et al. (2020) and its advantage of it only requiring one projection, whereas EG needs two. It is given by FBF: w k = P Ω [z k -α k F (z k )] z k+1 = w k + α k (F (z k ) -F (w k )). Both, EG and FBF, have the "disadvantage" of needing two gradient evaluations per iteration. A possible remedy -suggested in Gidel et al. (2019) for EG under the name of extrapolation from the past -is to recycle previous gradients. In a similar fashion we consider FBFp: w k = P Ω [z k -α k F (w k-1 )] z k+1 = w k + α k (F (w k-1 ) -F (w k )), where we replaced F (z k ) by F (w k-1 ) twice in (4). As a matter of fact, the above method can be written exclusively in terms of the first variable w k by incrementing the index k in the first update and then substituting in the second line. This results in w k+1 = P Ω w k -α k+1 F (w k ) + α k (F (w k-1 ) -F (w k )) . This way we rediscover a known method which was studied in Malitsky & Tam (2020) for general monotone inclusions under the name of forward-reflected-backward. It reduces to optimistic mirror descent (Rakhlin & Sridharan, 2013a; b) in the unconstrained case with constant step size α k = α, giving w k+1 = w k -α(2F (w k ) -F (w k-1 )) (7) which has been proposed for the training of GANs under the name of Optimistic Gradient Descent Ascent (OGDA), see Daskalakis et al. (2018) ; Daskalakis & Panageas (2018) ; Liang & Stokes (2019) . All of the above methods and extensions rely solely on the monotone operator formulation of the saddle point problem where the two components x and y play a symmetric role. Taking the special minimax structure into consideration, Hamedani & Aybat (2018) showed convergence of a method that uses an optimistic step (7) in one component and a regular gradient step in the other, thus requiring less storing of past gradients in comparison to (6). On the downside, however, by reducing the number of required gradient evaluations per iteration, the largest possible step size is reduced from 1 /L (see Korpelevich (1976) or Section 3) to 1 /2L (see Gidel et al. (2019) ; Malitsky & Tam (2020) ; Malitsky (2015) or Section 3). To summarize, the number of required gradient evaluations is halved, but so is the step size, resulting in no clear net gain.

2.4. REGULARIZERS

The role of regularizers is well studied in many fields such as statistics (Tibshirani, 1996) , signal processing (Palomar & Eldar, 2010) or inverse problems (Rudin et al., 1992) . They serve different purposes such as inducing sparsity in the solution or conditioning of the problem. In the context of deep learning this has been explored from different perspectives, e.g. in incremental convex neural networks where neurons with zero weights are removed from the network and new ones are inserted according to different policies, see Bach (2017) ; Bengio et al. (2006); Rosset et al. (2007) ; Pieper & Petrosyan (2020) . Other examples include the box-constraints for WGANs with weight clipping (see Arjovsky et al. (2017) ) or spectral normalization (see Miyato et al. (2018) ) which has so far rather been considered as part of the architecture, but can at the same time seen as a regularization term of the function values. In the framework of monotone operator theory the optimality condition of the regularized minimax problem (1) can be written as 0 ∈ F (w) + ∂r(w), where r is given by (x, y) → f (x) + h(y). The possibly set-valued operator ∂r denotes the subdifferential of r and is given by ∂r(w) := {v ∈ R m : v, w -w + r(w) ≤ r(w ) ∀w ∈ R m }. The monotone inclusion (8) generalizes (2) in a natural way, since N Ω = ∂δ Ω . Similarly, the projection constitutes a special case of the so-called proximal mapping which for the function r and λ > 0 is given by prox λr (w) := arg min w ∈R m r(w ) + 1 2λ w -w 2 . In particular, the proximal mapping of the indicator δ Ω yields the projection onto the set Ω, i.e. prox λδΩ = P Ω .

3. MAIN RESULTS

Motivated by the considerations above we study the inclusion problem 0 ∈ F (w) + ∂r(w), where F : R m → R m is a monotone and Lipschitz operator and r : R m → R ∪ {+∞} is a proper convex lower semicontinuous function.

3.1. MEASURE OF OPTIMALITY

There are two common quantities measuring the quality of a point with respect to the monotone inclusion (8). The most natural one is the distance to the solution set for which typically only asymptotic convergence can be proved. If F arises from a saddle point problem (1) meaning that F has the form (3), we want to use a more problem specific measure, the minimax gap, which for a point w = (u, v) ∈ R d × R n is given by sup y∈R n Ψ(u, y) -inf x∈R d Ψ(x, v) = sup x∈R d ,y∈R n Ψ(u, y) -Ψ(x, v) . ( ) This minimax gap can be interpreted from a game theoretic standpoint as the sum of the maximal payoffs achievable by the two players by playing their respective best responses, given the current strategy of the opponent. In the more general monotone inclusion setting where no function values are available, an appropriate generalization of ( 10) is given for any w ∈ R m by sup z∈R m F (z), w -z + r(w) -r(z). If r is the indicator δ Ω of the compact and convex set Ω it is clear that the supremum is only taken over z ∈ Ω and will thus be finite. The restricted gap. Since the problem ( 9) is in general unconstrained and the supremum can be infinite we consider instead, as done for example in Nesterov (2007) , the restricted gap where the above supremum is taken over an auxiliary compact set B ⊂ R m instead of the entire space. Note that the restricted gap is in general only a reasonable measure of optimality for elements of B. It is nonnegative on B and zero for points of B which solve ( 9). Additionally we want to be able to conclude that if a point w * has zero gap it solves (9). This is for example the case if w * is in the interior of B, which can always be ensured if B is chosen large enough. In order to capture both at the same time we define the following unifying gap G B (w) := sup (x,y)∈B Ψ(u, y) -Ψ(x, v) if F and r come from (1) sup z∈B F (z), w -z + r(w) -r(z) otherwise. (11)

3.2. METHODS

We now present a novel unifying scheme for solving problem ( 9), which generalizes FBF (4) and in addition recovers the method motivated in (5) as FBFp. Let us point out again that the latter algorithm was already introduced in Malitsky & Tam (2020) and corresponds to OGDA (Rakhlin & Sridharan, 2013a; Daskalakis et al., 2018; Daskalakis & Panageas, 2018) if F stems from the minimax setting (3). Algorithm 3.1 (generalized FBF). For a starting point z 0 ∈ R m and step sizes α k > 0 we consider for all k ≥ 0 w k = prox α k r (z k -α k F (♦ k )) z k+1 = w k + α k (F (♦ k ) -F (w k )). For ♦ k = z k this reduces to the well known FBF method, whereas ♦ k = w k-1 , with the additional initial condition w -1 = z 0 , recycles previous gradients (FBFp). Consider the scenario where F is given as an expectation E ξ [F (• ; ξ)], e.g. coming from (1), and only a stochastic estimator F (• ; ξ) is accessible instead of F itself. In this case we adapt Algorithm 3.1 in the following way. Algorithm 3.2 (generalized stochastic FBF). For a starting point z 0 ∈ R m and step sizes α k > 0 we consider for all k ≥ 0 ξ k ∼ Q (optionally η k ∼ Q) w k = prox α k r (z k -α k F (♦ k ; k )) z k+1 = w k + α k (F (♦ k ; k ) -F (w k ; ξ k )). For ♦ k = z k and k = η k this results in a stochastic version of FBF, whereas ♦ k = w k-1 and k = ξ k-1 recycles previous gradients (stochastic FBFp) with the additional initial condition w -1 = z 0 and ξ -1 = η 0 . Even though both methods encompassed by the unifying scheme Algorithm 3.1 have been studied in the deterministic setting before, the stated convergence results are new. Note that while the rate for FBF is completely new our result for FBFp provides only a generalization of the known rate for OGDA, see Mokhtari et al. (2019) . Similarly, the stochastic version of FBF has been considered before in Bot et al. (2019) and rates have been obtained, but only in terms of the fixed point residual and not the function values. However, we want to point out that the stochastic version of FBFp has not been considered prior to this work.

3.3. CONVERGENCE

Let in the following B ⊂ R m be the compact set of the restricted (unifying) gap function (11) with D := sup w,z∈B z -w denoting its diameter. For convenience in the estimation we assume that the starting point z 0 of the discussed methods is in B. Theorem 3.1 (deterministic). Let (w k ) k≥0 be the sequence generated by Algorithm 3.1. If (i) FBF, i.e. ♦ k = z k , with step size α k = α ≤ 1 /L, or (ii) FBFp, i.e. ♦ k = w k-1 , with step size α k = α ≤ 1 /2L is chosen, then for all K ≥ 1 the averaged iterates wK := 1 K K-1 k=0 w k fulfill G B ( wK ) ≤ D 2 2αK , where G B is the restricted gap defined in (11). In order to derive similar convergence statements for the stochastic algorithm we need to assume (standard) properties of the gradient estimator F (• ; ξ). Assumption 1. Unbiasedness: E ξ [F (w; ξ)] = F (w) ∀w ∈ R m . Assumption 2. Bounded variance: E ξ [ F (w; ξ) -F (w) 2 ] ≤ σ 2 ∀w ∈ R m . In particular we actually only need the above assumption to hold for all iterates w k . Such an hypothesis is in practice difficult to check, but could be exploited in special cases where additional properties of the variance and boundedness of the iterates are known a priori. Assumption 3. The samples ξ k are independent of the iterates w k , for all k ≥ 0. Equipped with these assumptions we are now able to prove the statement. Theorem 3.2 (stochastic). Let Assumption 1, 2 and 3 hold and let (w k ) k≥0 be the sequence generated by Algorithm 3.2. If (i) stochastic FBF, i.e. ♦ k = z k and k = η k , with step size α k ≤ α ≤ 1 / √ 2L, or (ii) stochastic FBFp, i.e. ♦ k = w k-1 and k = ξ k-1 , with step size α k ≤ α ≤ 1 /3L is chosen, then for all K ≥ 1 the averaged iterates wK := K-1 k=0 α k w k K-1 k=0 α k fulfill E[G B ( wK )] ≤ D 2 + 24σ 2 K-1 k=0 α 2 k K-1 k=0 α k , where G B is the restricted gap defined in (11). The above theorem exhibits a classical step size dependence (Robbins & Monro, 1951) , yielding convergence for sequences (α k ) k≥0 that are square summable ∞ k=0 α 2 k < +∞ but not summable ∞ k=0 α k = +∞. Additionally, if in the setting of Theorem 3.2 the step size is chosen to be α k = α/ √ k + 1, a convergence rate can be obtained and is given by E[G B ( wK )] = O 1 √ K . ( ) If the step size does not go to zero, the gap can usually not be expected to vanish either. However, we can still show decrease in the gap up to a residual stemming from the variance. In particular, for a constant step size α k = α we have E[G B ( wK )] ≤ D 2 αK + 24σ 2 α. Additionally, if the number of iterations K is fixed beforehand, a conclusion similar to (12) can be obtained by choosing α = 1 / √ K in (13).

4. EXPERIMENTS

The aim of this section is to show how the use of methods with convergence guarantees, albeit only in the monotone setting, can yield better training performance for different architectures and objectives. In particular, we demonstrate that FBF can perform at least as good as EG although requiring less evaluations of the regularizers.

4.1. 2D TOY EXAMPLE

Following Goodfellow (2016) ; Mescheder et al. (2018) and others we consider the canonical example min x max y xy, illustrating the cycling behavior of (even bilinear) minimax problems. We augment this approach by adding a nonsmooth L1-regularizer for one player, with κ > 0, resulting in min x∈R max y∈[-1,1] κ|x| + xy. ( ) The aforementioned issue of GDA (and its proximal extension PGDA) cycling around the solution is highlighted in Figure 1 . The other methods, for which we display the averaged iterates, however do converge to a solution and show a decrease in the restricted gap according to theory. Even though the proximal steps provide improvement towards the solution (0, 0) and FBF only uses half the amount of evaluations compared to EG, it outperforms the competing algorithms. 

4.2. WGAN TRAINED ON CIFAR10

In this section we apply the above proposed techniques from monotone inclusions to the training of Wasserstein GANs employing DCGAN (Radford et al., 2015) and ResNet (He et al., 2016) architectures. All models are trained on the CIFAR10 dataset (Krizhevsky et al., 2009) which consists of 60,000 images in 10 different classes (with 50,000 training images and 10,000 test images) using an NVIDIA RTX 2080Ti GPU. For the DCGAN experiments we work with the original WGAN formulation including weight clipping, since it includes regularizers innately (the indicator of a box for the weights of the discriminator). In addition we propose a modification of the WGAN formulation which replaces the box constraint on the discriminator's weights with an L1-regularization, under the name of WGAN-L1. This results in a soft-thresholding operation instead of the "harsh" clipping. For the experiments on ResNet we use the WGAN-GP formulation (Gulrajani et al., 2017) which penalizes the norm of the gradient of the discriminator to enforce the Lipschitz constraint, together with spectral normalization of the weight matrices (Miyato et al., 2018) which can be seen as a projection. The two evaluation metrics used are the Inception Score (IS, higher is better) (Salimans et al., 2016) and the Fréchet inception distance (FID, lower is better) (Heusel et al., 2017) , both computed on 50,000 samples. In the case of the IS we use the updated and corrected implementation from Barratt & Sharma (2018) . All results are averaged over 5 runs for each method. In Table 1 the best IS and FID for each method are reported. FBF Adam performs at least as good as all considered competitors with respect to both evaluation metrics. One can also see that WGAN-L1 using the proximal operator improves the performance of all considered methods. Figure 2 shows the training progress regarding IS for each method and both problem formulations. The graphs suggest that making use of WGAN-L1 objective has a stabilizing effect during training, leading to a smoother and more consistent learning curve -a property that only FBF Adam seems to exhibit for weight clipping. Figure 3 as well as Table 1 show that for the WGAN-GP formulation FBF Adam maintains the improved performance of EG compared to GDA, while only requiring half the amount of spectral normalizations, resulting in time savings of up to 10% as reported in Miyato et al. (2018) .

5. CONCLUSION

By highlighting the connection between GAN objectives and monotone inclusions, we are able to tackle their training via the Forward-Backward-Forward method which is known to converge to a solution for convex-concave minimax problems. We deepened this theoretical understanding by proving novel convergence rates in terms of the function values. We complement these rigorous considerations by promising practical results, indicating that application of FBF can lead to improved performance and saved computation time (compared to EG). 

A DEFINITIONS

In Section 2.4 we require the regularizers to be proper, convex and lower semicontinuous which are common properties in convex analysis. We call a function r : R m → R ∪ {+∞} proper if it is not constant +∞, which means that it takes a finite value for at least a single point. In addition, we say that r is lower semicontinuous if for all z 0 ∈ R m lim inf z→z0 r(z) ≥ r(z 0 ). It is easy to see that if C ⊂ R m is nonempty, closed and convex, then the indicator δ C of this set, given by δ C (z) = 0 if z ∈ C +∞ otherwise fulfills the assumptions of being proper, convex and lower semicontinuous.

B ABOUT THE GAP FUNCTION

Typically in monotone inclusions, the distance to the set of solutions is used as a measure of quality of a given point due to the lack of more specific structure in general. Asymptotic convergence of the iterates has been established for FBF and FBFp in Bauschke & Combettes (2011, Proposition 27.13 ) and Malitsky & Tam (2020) , respectively. Furthermore, no convergence rates can be expected without stronger monotonicity assumptions. We want to take into account the special structure of the monotone inclusion coming from the minimax problem (1). For this reason we use the following (restricted) minimax gap, common for saddle point problems, which for a point (u, v) is given by G B (u, v) = sup (x,y)∈B Ψ(u, y) -Ψ(x, v). ( ) For the general case, i.e. F being an arbitrary monotone and Lipschitz operator this is connected to the other measure of optimality we use in ( 11), for w ∈ R m given by G B (w) = sup z∈B F (z), w -z + r(w) -r(z), ( ) where we interpret the possible occurrence of ∞ -∞ as +∞. It stems from the field of Variational Inequalities where such a function is also known as merit function (Nesterov, 2007) . The relevance of the above two quantities will be made clear by the following statements. Proof. A saddle point (x * , y * ) clearly fulfills that sup (x,y)∈R d ×R n Ψ(x * , y) -Ψ(x, y * ) = 0. On the other hand let G B (x * , y * ) = 0. For an arbitrary point (x, y) we can choose α ∈ (0, 1) large enough such that (u, v) := α(x * , y * ) + (1 -α)(x, y) is in the interior of B. Therefore, Ψ(x * , v) -Ψ(u, y * ) = Ψ(x * , αy * + (1 -α)y) -Ψ(αx * + (1 -α)x, y * ) ≤ 0. Using the convex-concave structure of Ψ we deduce that αΨ(x * , y * ) + (1 -α)Ψ(x * , y) -αΨ(x * , y * ) -(1 -α)Ψ(x, y * ) ≤ 0, which implies that Ψ(x * , y) ≤ Ψ(x, y * ). Since (x, y) was chosen arbitrary (x * , y * ) is a saddle point. Similarly, an analogous statement can be shown for ( 16). The proof, however is split up into multiple lemmas to highlight the connection to Variational Inequalities. Theorem B.2. Let F : R m → R m be monotone and continuous, r : R m → R ∪ {+∞} proper, convex and lower semicontinuous and B ⊂ R m . A point w * in the interior of B solves the monotone inclusion 0 ∈ F (w) + ∂r(w) (17) if and only if its restricted gap ( 16) is zero, G B (w * ) = 0. For all other elements of B the gap is nonnegative. Let the assumptions of Theorem B.2 hold true for the following lemmas as we break up the proof into separate statements. We do so by making use of the associated Variational inequality (VI) find w such that F (w), z -w + r(z) -r(w) ≥ 0 ∀z ∈ R m . Lemma B.3. The monotone inclusion (17) is equivalent to the VI (18). Proof. The equivalence of ( 17) and ( 18) follows immediately from the definition of the subdifferential of r. The formulation ( 18) is typically referred to as the strong form of the VI, whereas find w such that F (z), z -w + r(z) -r(w) ≥ 0 ∀z ∈ R m , is known as the weak formulation. Lemma B.4. Under the given assumptions the notion of weak and strong VI are equivalent. Proof. For the monotone operator F it is clear that if w * is a solution to the strong formulation (18), it is also a solution to the weak formulation (19). In fact, if F is continuous the reverse implication also holds true. To see this, let w * be a solution to the weak VI ( 19) and z = αw * + (1 -α)u for an arbitrary u ∈ R m and α ∈ (0, 1), then F (αw * + (1 -α)u), (1 -α)(u -w * ) + r(αw * + (1 -α)u) -r(w * ) ≥ 0. This implies by the convexity of r that (1 -α) F (αw * + (1 -α)u), (u -w * ) + (1 -α)(r(u) -r(w * )) ≥ 0. By dividing by (1 -α) and then taking the limit α → 1 we obtain that w * is a solution of the strong form (18). With the notion of VIs in mind, the above defined gap ( 16) becomes natural as it measures how much the statement of ( 19) is violated. Lemma B.5. G B is nonnegative on B and zero for solutions of the weak VI. Proof. It is clear that G B (w) ≥ 0 for w ∈ B as z = w can be chosen in the supremum. On the other hand if w * ∈ B is a solution to the weak VI (19) then G B (w * ) = 0. This follows from the fact that for a solution of ( 19) for all z ∈ B F (z), w * -z + r(w * ) -r(z) ≤ 0. Therefore the supremum over the above expression in z is also less than zero, but clearly zero is obtained for z = w * . For the reverse implication to hold true, we may not use points on the boundary of B. Lemma B.6. If a point w * in the interior of B exhibits zero gap G B (w * ) = 0, then it is a solution to the weak VI (19). Proof. Since w * is in the interior of B we can, for an arbitrary w ∈ R m , choose α ∈ (0, 1) large enough such that z := αw * + (1 -α)w ∈ B. Using this z in the supremum of the gap we deduce that F (αw * + (1 -α)w), w * -αw * -(1 -α)w + r(w * ) -r(αw * + (1 -α)w) ≤ 0. This implies that (1 -α) F (αw * + (1 -α)w), w -w * + (1 -α)(r(w) -r(w * )) ≥ 0. By dividing by (1 -α) and then taking the limit α → 1 we deduce that w * solves the strong form of the VI ( 18). 

C REFINED THEOREMS

Recall that restricted (unifying) gap function G B defined in ( 11) is computed with respect to a set B ⊂ R m where D := sup w,z∈B z -w denotes its diameter and it is assumed that z 0 ∈ B. Furthermore, the averaged iterates wK for K ≥ 1 are given by wK := K-1 k=0 α k w k K-1 k=0 α k . C.1 DETERMINISTIC STATEMENTS The convergence statement of Theorem 3.1 actually holds true not just for a constant step size as presented in Section 3, but for variable step sizes as well. Theorem C.1. Let (w k ) k≥0 be the sequence generated by Algorithm 3.1. If (i) FBF, i.e. ♦ k = z k , with step size 0 < α k ≤ α ≤ 1 /L, or (ii) FBFp, i.e. ♦ k = w k-1 , with step size 0 < α k ≤ α ≤ 1 /2L is chosen, then for all K ≥ 1 G B ( wK ) ≤ D 2 2 K-1 k=0 α k . C.2 STOCHASTIC STATEMENTS We actually prove a slightly more general version of Theorem 3.2. In particular the step size can be chosen larger than initially claimed, however, at the cost of a worse constant. Theorem C.2. Let Assumption 1, 2 and 3 hold and let (w k ) k≥0 be the sequence generated by FBF, i.e. Algorithm 3.2 with ♦ k = z k and k = η k . Let the step size α k ≤ α < 1 L , then E[G B ( wK )] ≤ D 2 + 4(1 -α 2 L 2 ) -1 σ 2 K-1 k=0 α 2 k 2 K-1 k=0 α k , for all K ≥ 1. Theorem 3.2 (i) can be deduced from the above statement by using α = 1 / √ 2L which yields that (1 -α 2 L 2 ) -1 = 2.

D.3 FORWARD-BACKWARD-FORWARD

Proof for deterministic FBF, Theorem C.1 (i). We start off by plugging ♦ k = z k into (21). Since W k = Z k = 0 we can use γ → 0 to deduce that for all k ≥ 0 α k g(w k , z) + 1 2 z k+1 -z 2 ≤ 1 2 z k -z 2 - 1 2 (1 -α 2 k L 2 ) z k -w k 2 . From this it is clear that the step size is constrained by α ≤ 1 /L as stated in the theorem. By summing up from k = 0 to K -1 and dividing by K-1 k=0 α k we obtain 1 K-1 k=0 α k K-1 k=0 α k g(w k , z) ≤ z 0 -z 2 2 K-1 k=0 α k . The claimed statement is then derived by taking the supremum in z over B and applying Lemma D.1. Proof for stochastic FBF, Theorem C.2. Plugging ♦ k = z k and k = η k into ( 21) gives for all k ≥ 0 α k g(w k , z) + 1 2 z k+1 -z 2 ≤ 1 2 z k -z 2 - 1 2 (1 -(1 + γ)α 2 k L 2 ) z k -w k 2 + α k W k , z -v k + α k W k , v k -w k + (1 + γ -1 )α 2 k ( W k 2 + Z k 2 ). By summing this inequality up and applying Lemma D.2 with v 0 = z 0 , p k = -α k W k and v k+1 := v k -p k we deduce that K-1 k=0 -α k W k , v k -z ≤ 1 2 z 0 -z 2 + 1 2 K-1 k=0 α 2 k W k 2 , and therefore K-1 k=0 α k g(w k , z) ≤ z 0 -z 2 + K k=0 α k W k , v k -w k + 2(1 + γ -1 )α 2 k ( W k 2 + Z k 2 ). By choosing γ such that α = ( √ 1 + γL) -1 we deduce that 1 + γ -1 = 1/(1 -α 2 L 2 ). Next, we take the supremum over z ∈ B and the expectation to obtain E sup z∈B K-1 k=0 α k g(w k , z) ≤ D 2 + 4(1 -α 2 L 2 ) -1 σ 2 K-1 k=0 α 2 k , where we used that  E[ W k , v k -w k ] = E E W k , v k -w k w [k] , ξ [k-1] = E E W k w [k] , ξ [k-1] , v k -w k = 0, with ξ [k-1] = (ξ 0 , . . . , ξ k-1 ) α k g(w k , z) + 1 2 z k+1 -z 2 ≤ 1 2 z k -z 2 - 1 2 z k -w k 2 + 1 2 α 2 k L 2 w k-1 -w k 2 . ( ) Now we need to bound the term w k-1 -w k 2 by z k -w k 2 . Since 2 z k -w k 2 + 2 z k -w k-1 2 ≥ w k -w k-1 2 (29) we have for all k ≥ 1 z k -w k 2 ≥ -z k -w k-1 2 + 1 2 w k-1 -w k 2 ≥ -α 2 k-1 L 2 w k-1 -w k-2 2 + 1 2 w k-1 -w k 2 (30) whereas for k = 0, since w -1 = z 0 , we have that z 0 -w 0 2 = w -1 -w 0 2 . Plugging ( 31) into (28) for k = 0 we get that α 0 g(w 0 , z) + 1 2 z 1 -z 2 + 1 2 (1 -α 2 0 L 2 ) w 0 -w -1 2 ≤ 1 2 z 0 -z 2 . ( ) Plugging ( 30) into (28) we get that for all k ≥ 1 α k g(w k , z) + 1 2 z k+1 -z 2 + 1 2 1 2 -α 2 k L 2 w k -w k-1 2 ≤ 1 2 z k -z 2 + 1 2 α 2 k-1 L 2 w k-1 -w k-2 2 . ( ) In order to be able to telescope we need to ensure that for all k ≥ 0 1 2 -α 2 k L 2 ≥ α 2 k L 2 . This is equivalent to the condition α k ≤ 1 /2L which was required in the statement of the theorem. Now we sum up (33) from k = 1 to K -1 which yields K-1 k=1 α k g(w k , z) + 1 2 z K -z 2 + 1 2 1 2 -α 2 K-1 L 2 w K-1 -w K-2 2 ≤ 1 2 z 1 -z 2 + 1 2 α 2 0 L 2 w 0 -w -1 2 . Adding ( 34) and (32) and dividing by K-1 k=0 α k to deduce 1 K-1 k=0 α k K-1 k=0 α k g(w k , z) ≤ z 0 -z 2 2 K-1 k=0 α k , where we used that 1 -α 2 0 L 2 ≥ α 2 0 L 2 to get rid of w 0 -w -1 2 . The final statement follows by taking the supremum in z over B and applying Lemma D.1. Proof for stochastic FBFp, Theorem C.3. By using ♦ k = w k-1 we deduce from (21) for all k ≥ 0 that α k g(w k , z) + 1 2 z k+1 -z 2 ≤ 1 2 z k -z 2 - 1 2 z k -w k 2 + 1 2 (1 + γ)α 2 k L 2 w k-1 -w k 2 + α k W k , z -w k + 2(1 + γ -1 )α 2 k ( W k 2 + Z k 2 ). As in ( 27) we can split α k W k , z -w k into α k W k , z -v k + α k W k , v k -w k and use Lemma D.2 to deduce K-1 k=0 α k g(w k , z) ≤ z 0 -z 2 - K-1 k=0 1 2 z k -w k 2 + 1 2 (1 + γ)α 2 k L 2 w k-1 -w k 2 + α k W k , v k -w k + 3(1 + γ -1 )α 2 k ( W k 2 + Z k 2 ) . Taking now the supremum over z ∈ B and then the expectation we conclude that the inequality E sup z∈B K-1 k=0 α k g(w k , z) ≤ D 2 - 1 2 K-1 k=0 z k -w k 2 -(1 + γ)α 2 k L 2 w k-1 -w k 2 + 3(1 + γ -1 )σ 2 K-1 k=0 α 2 k (35) holds. Let from now on k ≥ 1 as we will treat the case k = 0 separately. Using (29) we deduce that z k -w k 2 ≥ -z k -w k-1 2 + 1 2 w k-1 -w k 2 ≥ -α 2 k-1 F (w k-1 ; ξ k-1 ) -F (w k-2 ; ξ k-2 ) 2 + 1 2 w k-1 -w k 2 . (36) Now we bound the difference of the two estimators by inserting ±F (w k-1 ), ±F (w k-2 ) and applying the inequality a + b + c 2 ≤ 3( a 2 + b 2 + c 2 ) which yields F (w k-1 ; ξ k-1 ) -F (w k-2 ; ξ k-2 ) 2 ≤ 3 W k-1 2 + 3 W k-2 2 + 3 F (w k-2 ) -F (w k-1 ) 2 . We conclude that E F (w k-1 ; ξ k-1 ) -F (w k-2 ; ξ k-2 ) 2 ≤ 6σ 2 + 3L 2 E w k-1 -w k-2 2 . ( ) Using ( 37) in ( 36) we deduce that E z k -w k 2 ≥ -α 2 k-1 (6σ 2 + 3L 2 E w k-1 -w k-2 2 ) + 1 2 E w k-1 -w k 2 , ( ) whereas for k = 0 we have (31). Now we plug (38) into (35) to conclude that E sup z∈B K-1 k=0 α k g(w k , z) ≤ D 2 - 1 2 K-1 k=1 -3α 2 k-1 L 2 E w k-1 -w k-2 2 + 1 2 -(1 + γ)α 2 k L 2 w k-1 -w k 2 + 1 2 ((1 + γ)α 2 0 L 2 -1) w -1 -w 0 2 + 6(1 + γ -1 )σ 2 K-1 k=0 α 2 k (39) From this we conclude that in order to be able to telescope we need to enforce 1 2 -(1 + γ)α 2 k L 2 ≥ 3α 2 k L 2 which is equivalent to 1 2(4 + γ) ≥ α 2 k L 2 . Since α k ≤ α, we can ensure this by choosing γ such that 1 2(4 + γ) = α 2 L 2 . ( ) With (40) in place conclude from (39) that the inequality E sup z∈B K-1 k=0 α k g(w k , z) ≤ D 2 + 1 2 ((4 + γ)α 2 0 L 2 -1) w -1 -w 0 2 + 6(1 + γ -1 )σ 2 K-1 k=0 α 2 k Using the fact that 3α 2 0 L 2 ≤ 1 -(1 + γ)α 2 0 L 2 from (40) to discard the w 0 -w -1 2 term, yields E sup z∈B K-1 k=0 α k g(w k , z) ≤ D 2 + 6(1 + γ -1 )σ 2 K-1 k=0 α 2 k (41) Through (40), we can estimate 1 γ = 2α 2 L 2 1 -8α 2 L 2 . ( ) Plugging ( 42) into (41), dividing by K-1 k=0 α k and applying Lemma D.1, deduces the final statement.

F HYPERPARAMETERS

For the WGAN formulation with weight clipping, see Table 4 , we used the extensively tuned hyperparameters from Gidel et al. (2019) for ExtraAdam, Adam1 and OptimisticAdam. Note that our values of the Inception Score (IS) differ from the ones reported in Gidel et al. (2019) as we use the newer implementation of the IS proposed in Barratt & Sharma (2018) . For FBF-Adam we tuned the step size and kept all other hyperparameters equal. Table 4 : Hyperparameters used for the WGAN formulation (with weight clipping).

(DCGAN) WGAN Hyperparameters

Batch size = 64 Number of generator updates = 500, 000 Adam β 1 = 0.5 Adam β 2 = 0.9 Weight clipping for the discriminator = 0.01 Learning rate for discriminator = 5 × 10 -4 (Extra Adam) = 2 × 10 -4 (AltAdam1, FBF Adam, Optim. Adam) Learning rate for generator = 5 × 10 -5 (Extra Adam) = 2 × 10 -5 (AltAdam1, FBF Adam, Optim. Adam) For our newly proposed WGAN-L1 formulation using 1-Norm regularization, see Table 5 , we limited the hyperparameter search to the step sizes, with the values in Table 4 as initial guesses. We chose the value performing the best in terms of IS and FID for a sample seed. All other parameters were kept the same as in Gidel et al. (2019); Boţ et al. (2020) . Table 5 : Hyperparameters used for the WGAN-L1 formulation (with soft thresholding).

(DCGAN) WGAN-L1 Hyperparameters

Batch size = 64 Number of generator updates = 500, 000 Adam β 1 = 0.5 Adam β 2 = 0.9 L1 regularization for the discriminator = 1 × 10 -4 Learning rate for discriminator = 1 × 10 -3 (FBF Adam, Extra Adam) = 5 × 10 -4 (Optim. Adam) = 2 × 10 -4 (AltAdam1) Learning rate for generator = 1 × 10 -4 (FBF Adam, Extra Adam) = 5 × 10 -5 (Optim. Adam) = 2 × 10 -5 (AltAdam1) For the experiments based on the WGAN-GP formulation including spectral normalization we limited the hyperparameter search to the step sizes, with the values recommended in Gidel et al. ( 2019) as initial guesses. We used a single power iteration for the spectral normalization as suggested in Miyato et al. (2018) and reduced the number of generator updates by a factor of two to ease the computational burden.



Figure 1: A comparison of the methods presented in Section 2.3 applied to problem (14) with κ = 0.01. PGDA denotes (alternating) gradient descent ascent with proximal steps. As mentioned in the introduction it fails to converge. EGp denotes the method presented in Gidel et al. (2019) as extrapolation from the past. For the restricted gap we use B 1 = B 2 = [-1, 1].

Our hyperparameter search was limited to the step sizes when using the WGAN-L1 and WGAN-GP formulation, while all other parameters were kept the same as in Gidel et al. (2019); Boţ et al. (2020). It seems noteworthy that in the case of soft-thresholding bigger step sizes performed better with the only exception of AltAdam1.

Figure 2: Left: Average and best/worst IS on the WGAN objective with weight clipping. Right: Average and best/worst IS on the WGAN-L1 objective using the proximal operator; The WGAN-L1 objective improves the IS in comparison to weight clipping and stabilizes the behavior of all considered methods during the training procedure.

Figure 3: Average and best/worst results regarding IS (left) and FID (right) using ResNet architecture on the WGAN-GP objective including spectral normalization. Middle: Samples from the generator trained with FBF Adam.

Theorem B.1. Let Φ : R d × R n → R be continuously differentiable and f : R d → R ∪ {+∞}, h : R n → R ∪ {+∞} be proper, convex and lower semicontinuous and B ⊂ R d × R n . A point (x * , y * ) in the interior of B solves the saddle point problem (1) if and only if its minimax gap (15) is zero, G B (x * , y * ) = 0. For all other elements of B the gap is nonnegative.

Now, we can turn to proving the theorem.Proof of TheoremB.2. Combine Lemma B.3, B.4, B.5 and B.6.

and w [k] = (w 0 , . . . , w k ). The final statement follows by dividing by K-1 k=0 α k and applying Lemma D.1. D.4 FORWARD-BACKWARD-FORWARD-PAST Proof for deterministic FBFp, Theorem C.1 (ii). We start off by plugging ♦ k = z k into (21). Since W k = Z k = 0 we can use γ → 0 to conclude that for all k ≥ 0

The best Inception Score (IS) and Fréchet Inception Distance (FID). The column denoted by WGAN, WGAN-L1 and WGAN-GP refers to the standard formulation with weight clipping, our regularized implementation using the 1-norm and the formulation with gradient penalty and spectral normalization, respectively.Given the ubiquity and dominance of Adam (Kingma & Ba, 2014) as an optimizer for many deep learning related training tasks, instead of using vanilla SGD we opt for Adam updates. This results in a method we call FBF Adam. Analogous approaches have been applied inGidel et al. (2019)

Monro. A stochastic approximation method. The annals of mathematical statistics, pp.400-407, 1951.    Saharon Rosset, Grzegorz Swirszcz, Nathan Srebro, and Ji Zhu.1 regularization in infinite dimensional feature spaces. In International Conference on Computational Learning Theory, pp.544-558. Springer, 2007.

annex

Theorem C.3. Let Assumption 1, 2 and 3 hold and let (w k ) k≥0 be the sequence generated by FBFp, i.e. Algorithm 3.2 with ♦ k = w k-1 and k = ξ k-1 . Let the step sizeTheorem 3.2 (ii) is obtained from the above theorem by using the particular step size bound of α = 1 /3L, which yields that 4α 2 L 2 1 -8α 2 L 2 = 4.Although, the step size in the refined statements Theorem C.2 and C.3 can be chosen arbitrarily close to 1 /L and 1 /(2 √ 2L) for stochastic FBF and stochastic FBFp, respectively. This does not mean it should be -since the constant in the convergence rate deteriorates when the step size is close to its allowed upper bound.

D PROOFS D.1 PREPARATIONS

We introduce the notation connected to the strong formulation of the VI (18) associated to the monotone inclusion (9), given by g(w, z) := F (w), w -z + r(w) -r(z), for g : R m × R m → R ∪ {+∞}. Next we will establish the fact that this function can be used to bound the (restricted) unifying gap function, which we remind, is defined aswhere in the first case (u, v) ∈ R d × R n is identified with w ∈ R m . In particular the dimensions fulfill d + n = m, and r(w) is given by f (u) + h(v). Lemma D.1. It holds that for all K ≥ 1 supProof. First we will prove the case if F is derived from a saddle point problem. Note that from the convex-concave structure of Φ we get that v) . By summing the two up we obtainWe can reformulate the above inequality in terms of g to see that for zThe statement of the first case is obtained by adding r(w) -r(z) on both sides and using the fact that Ψ is convex-concave.If F is a general monotone operator, then we use its monotonicity to deduce thatThe desired result follows from using the linearity of the inner product.Notation. We denote the error of the stochastic estimator via(20) Furthermore, we will denote via E[ • | U ], the conditional expectation with respect to the random variable U .We will also need the following lemma. Lemma D.2. Let (p k ) k≥0 ∈ R d be a given sequence and (v k ) k≥0 recursively defined for all k ≥ 0 byProof. From the three point identity it follows immediately thatfrom which the statement of the lemma follows.

D.2 A UNIFIED DECREASE RESULT

We will start with a unifying proposition which covers the common parts of all convergence proofs. Proposition D.3. For a γ > 0 we have that for all k ≥ 0 andProof. Let k ≥ 0 and z ∈ R m be arbitrary. Using the decomposition (20) it follows thatSince prox α k r = (Id + α k ∂r)-1 we deduce that(23) Adding ( 22) and ( 23) gives that(24) We estimate the inner product on the left side of the inequality by inserting and subtracting z k and using the three point identity twice to deduceThe first two summands are fine as they will telescope, so we are left with estimating z k+1 -w k 2 . By the definition of z k+1 we have thatwhere we inserted and subtracted F (♦ k ) and F (w k ) and applied Young's inequality to deduce the result. Adding ( 26), ( 25) and ( 24) we conclude that 

