ADVERSARIAL PROBLEMS FOR GENERATIVE NET-WORKS

Abstract

We are interested in the design of generative networks. The training of these mathematical structures is mostly performed with the help of adversarial (min-max) optimization problems. We propose a simple methodology for constructing such problems assuring, at the same time, consistency of the corresponding solution. We give characteristic examples developed by our method, some of which can be recognized from other applications and some are introduced here for the first time. We compare various possibilities by applying them to well known datasets using neural networks of different configurations and sizes.

1. INTRODUCTION

The problem we are interested in, can be summarized as follows: We are given two collections of training data {z j } and {x i }. In the first set the samples follow the origin probability density h(z) and in the second the target density f (x). The target density f (x) is considered unknown while h(z) can either be known with the possibility to produce samples z j every time it is necessary or unknown in which case we have a second fixed training set {z j }. Our goal is to design a deterministic transformation G(z) so that the data {y j } produced by applying the transformation y = G(Z) onto {z j } follow the target density f (y). Of course one may wonder whether the proposed problem enjoys any solution, namely, whether there indeed exists a transformation G(z) capable of transforming z into y with the former following the origin density h(z) and the latter the target density f (y). The problem of transforming random vectors has been analyzed in Box & Cox (1964) where existence is shown under general conditions. Computing, however, the actual transformation is a completely different challenge with one of the possible solutions relying on adversarial approaches applied to neural networks. The most well known usage of this result is the possibility to generate synthetic data that follow the unknown target density f (x). In this case h(z) is selected to be simple (e.g. i.i.d. standard Gaussian or i.i.d. uniform) so that generating realizations from h(z) is straightforward. As mentioned, the adversarial approach can be applied even if the origin density h(z) is unknown provided that we have a dataset {z j } with data following the origin density. It was Goodfellow et al. (2014) that first introduced the idea of adversarial (min-max) optimization and demonstrated that it results in the determination of the desired transformation G(z) (consistency). Alternative adversarial approaches were subsequently suggested by Martin Arjovsky & Bottou (2017); Bińkowski et al. (2018) and shown to also deliver the correct transformation G(z). We must mention the work of Nowozin et al. (2016) in which a class of min-max optimizations, f-GANs, was defined to design generator/discriminator pairs. Then, Liu et al. (2017) defined the adversarial divergences class of objective function which further combined f-GANs, MMD-GAN (Li et al., 2017) , WGAN, WGAN-GP (Gulrajani et al., 2017) , and entropic regularized optimal transport problems. Also, they investigated under what conditions the discriminator's class has the effect of matching generalized moments. Next, the work of Song & Ermon (2019) connected f-GANs and Wasserstein GANs (WGANs) (Martin Arjovsky & Bottou, 2017) , and later Birrell et al. (2020) generalized the results by introducing the (f, Γ)-divergencies which allowed to bridge fdivergencies and integral probability metrics. Our class of generative adversarial problems establishes a one-to-one correspondence with f-gans under the ideal (non data-driven) setup. However, we believe that our approach enjoys certain signif-icant advantages: First, the definition of the two functions φ(z), ψ(z) in Equ. equation 8 is straightforward while Nowozin et al. (2016) requires to solve an additional optimization problem for the derivation of each GAN loss. An additional benefit of our approach is the complete control over the result of the maximization problem that defines the discriminator. In other words, we can decide what function, the discriminator must estimate. In Nowozin et al. (2016) such flexibility does not exist. This is important because we can select the approximation function properly so that we avoid the need to impose difficult constraints on the discriminator output (e.g. positivity) since such constraints tend to seriously affect the approximation quality of the corresponding neural network. Further, there is no need for the discriminator to be a Lipschitz function, as WGAN or WGAN-GP something that needs extra operations to ensure that the discriminator is Lipschitz. Furthermore, we will show that the function the discriminator tries to approximate is a transformation of the likelihood ratio r(x) = g(x)/f (x) and there are important applications in Statistics where one is interested in estimating only the transformation of the likelihood ratio with the most common cases being the likelihood ratio itself, its logarithm (log-likelihood ratio), or the ratio r(x) 1+r(x) which plays the role of the posterior probability between two densities. In other words, there are applications where one is interested only in the "max" part of the min-max problem. Finally, because we know what transformation of the likelihood function the discriminator tries to approximate, it is possible to compare the different GANs on how closely they reach the optimal value of the likelihood ratio r(x) = 1 meaning f (x) = g(x). As in Nowozin et al. (2016) , we will show that our methods provides an abundance of adversarial problems that are capable of identifying the appropriate transformation G(z). Furthermore, we will also provide a simple recipe as to how we can successfully construct such problems. Arguing along the same lines of the existing min-max formulations: We would like to optimally specify a vector transformation G(z), the generator, and a scalar function D(x), the discriminator. To achieve this, for each combination {G(z), D(x)} we define the cost function J(G, D) = E x∼f φ D(x) + E z∼h ψ D(G(z)) where φ(z), ψ(z) are two scalar functions of the scalar z and E x∼f [•], E z∼h [•] denote expectation with respect to the density f (x), h(z) respectively. The optimum combination generator/discriminator is then identified by solving the following min-max problem min G(z) max D(x) J(G, D) = min G(z) max D(x) E x∼f φ D(x) + E z∼h ψ D(G(z)) . We must point out that our goal is not to solve equation 2, but rather find a class of functions φ(z), ψ(z) so that the transformation G(z) that will come out of the solution of equation 2 is such that y = G(z) follows the target density f (y) when z follows the origin density h(z). If z is random following h(z) then y = G(z) is also random and we denote with g(y) its corresponding probability density. Clearly, there exists a correspondence between transformations G(z) and densities g(y) when the density h(z) of z is fixed. Since we can write E z∼h ψ D(G(z)) = E y∼g ψ D(y) , this allows us to argue that the min-max problem in equation 2 is equivalent to min g(y) max D(x) E x∼f φ D(x) + E y∼g ψ D(y) It is now possible to combine the two expectations by applying a change of measure and a change of variables and equivalently write equation 3 as follows: min g(y) max D(x) E x∼f φ D(x) + ψ D(x) g(x) f (x) f (x)dx = min g(x) max D(x) E x∼f φ D(x) + E x∼f r(x)ψ D(x) } = min g(x) max D(x) E x∼f φ D(x) + r(x)ψ D(x) } where r(x) = g(x)/f (x) denotes the corresponding likelihood ratio. Since f (x) is also fixed, there is again a correspondence between r(x) and g(x), hence the previous min-max problem becomes equivalent to min r(x)∈L f max D(x) E x∼f φ D(x) + r(x)ψ D(x) . Here L f denotes the class of all likelihood ratios r(x) with respect to the density f (x), namely, all the functions r(x) that satisfy L f = r(x) : r(x) ≥ 0, r(x)f (x) dx = 1 . Using these definitions, let us define the cost J(r, D) = E x∼f φ D(x) + r(x)ψ D(x) and, according to equation 4, we are interested in the following min-max problem min r(x)∈L f max D(x) J(r, D). As mentioned, our actual goal is not to solve the adversarial problem. Instead, we would like to properly identify pairs of functions {φ(z), ψ(z)} so that equation 7 accepts as solution the function r(X) = 1. Indeed, if r(X) = 1 is the solution to equation 7, this means that g(x) = f (x) is the solution to equation 3 and, finally, that the optimum G(x) obtained from equation 1 is such that y = G(x) follows g(y) = f (y) which, of course, is our original objective. Even though the minmax problem in equation 1 is what we attempt to solve, it is through equation 7 that we understand what its solution entails. In the next section we focus on equation 6, equation 7 and propose a simple design method (recipe) for the two functions φ(z), ψ(z) that assures that the solution of equation 7 is indeed r(x) = 1. Before we discuss the details of our work, we would like to summarize this paper's contribution. • We design a family of GANs problems using a likelihood ratio approach. In this class, all optimization problems have the desired property that the generator output follows the target distribution of the random vector of interest, x, in other words, that the likelihood ratio of the two distributions is equal to one. • We propose a straightforward recipe to explore the GANs family. With this methodology, we were able to identify subgroups characterized by specific transformations of the likelihood ratio. In these subgroups, we discovered novel objective functions and classified to them previously introduced GANs (such as Wasserstein, Cross-Entropy GANs). • We propose a new online metric for evaluating the performance of our generative model during training. • Our experiments provide insights for the behavior of the different GANs objective functions, with some of the novel objective functions performing better than the already known GANs. 2 A CLASS OF FUNCTIONS φ(z), ψ(z) Suppose that ω(r) is a strictly increasing and (left and right) differentiable scalar function of the nonnegative scalar r, i.e. r ∈ [0, ∞). Denote with J ω = ω [0, ∞) the range of values of ω(r) and let ω -1 (z) be the inverse function of ω(r) which exists and is defined for z ∈ J ω . Let ρ(z) > 0 be a positive scalar function also defined for z ∈ J ω then, using ω(r) and ρ(z), we propose the following pair φ(z), ψ(z) φ (z) = -ω -1 (z)ρ(z), ψ (z) = ρ(z), (8) where " " denotes derivative. Since ω(r) and ρ(z) are arbitrary (provided they satisfy the strict increase and positivity constraint respectively), the class of pairs defined by equation 8 is very rich allowing for a multitude of choices. We show next that any such pair {φ(z), ψ(z)} gives rise to a min-max problem, as in equation 7, that accepts r(x) = 1 as its unique solution. We prove this claim in two steps. The first, involves a theorem where we consider a simplified version of the min-max problem. Theorem 1 Let ω(r), φ(z), ψ(z) and J ω be defined as above with the additional constraint ψ ω(1) = 0. Fix r ≥ 0 and consider φ(D) + rψ(D) as a function of the scalar D. Then, for any D ∈ J ω , we have that φ(D) + rψ(D) ≤ φ ω(r) + rψ ω(r) , with equality if and only if D = ω(r). Consider next the minimization with respect to r of the maximal value in equation 9. It is then true that min r≥0 φ ω(r) + rψ ω(r) = φ ω(1) , with equality if and only if r = 1. A consequence of Theorem 1 is the next corollary, which constitutes the second and final step in proving that the adversarial problem defined in equation 7 has as unique solution the function r(X) = 1. Corollary 1 If the functions φ(z), ψ(z) satisfy equation 8 and ω(r) is strictly increasing and left and right differentiable, then in the adversarial problem defined in equation 7 the maximizer is D(X) = ω r(X) and the minimizer is r(X) = 1, while the resulting min-max value is equal to min r(X)∈L f max D(x) E x∼f φ D(x) + r(X)ψ D(x) = φ ω(1) + ψ ω(1) .

2.1. SUBCLASSSES OF THE GANS FAMILY

Let us now present some of the subclasses of the GANs family, where each subclass is characterized by the type of transformation of the likelihood ratio, ω(r). Once we fix ω(r) and give pairs {φ(z), ψ(z)} that satisfy equation 8, we are able to "pick" objective functions laying in the ω(r) subregion. Subclass A: ω(r) = r α The first examined subclass is the simplest one, consisting of just powers of the likelihood ratio. To the best of our knowledge, this is the first work proposing objective functions from this class. To find the pairs {φ(z), ψ(z)} we proceed as follows. We have that ω -1 (z) = z 1 α and J ω = [0, ∞). According to equation 8, for z ∈ [0, ∞) we must define φ (z) = -z 1 α ρ(z), ψ (z) = ρ(z). Also r = D -a . Some examples of this subclass are presented in Table 1 . For the particular selection ω(r) = r (corresponding to α = 1) we can show that the resulting cost is equivalent to the Bregman cost (Bregman, 1967) . In fact there is a one-to-one correspondence between our ρ(z) function and the function that defines the Bregman cost. This correspondence however is lost once we switch to a different α or a different ω(r) function, suggesting that the proposed class of pairs {φ(z), ψ(z)}, is far richer than the class induced by the Bregman cost.  φ(z) ψ(z) J ω A1a -z log(z) [0, ∞) A1b -log(z) -z -1 [0, ∞) A2 -(1 + z) -(1 + z -1 ) [0, ∞) A3 -log(1 + z) -log(1 + z -1 ) [0, ∞) MSE -0.5z 2 z [0, ∞) Subclass B: ω(r) = α -1 log r This subclass considers one of the most popular transformations of the likelihood ratio, the log-likelihood ratio. As with subclass A for the first time, the next examples are presented. They can be used either under a min-max setting, for the determination of the generator/discriminator pair, or under a pure maximization setting for the direct estimation of the log-likelihood ratio function log r(x). We have ω -1 (z) = e αz and J ω = R. As before ρ(z) must be strictly positive and, according to equation 8, for all real z we must define φ (z) = -e αz ρ(z), ψ (z) = ρ(z). And the likelihood ratio is r = e aD . In Table 2 we can see some examples of this subclass. Subclass C: ω(r) = r r+1 As we already mentioned, this is another important transform of the likelihood ratio. Interestingly, in this subclass belongs the first introduced GAN (Goodfellow et al., 2014) the Cross Entropy GAN.  φ(z) ψ(z) J ω B1a -e z e z R B1b -z -e -z R Exponential -e 0.5z -e -0.5z R B2 -log(1 + e z ) -log(1 + e -z ) R When ω(r) = r r+1 we have ω -1 (z) = z 1-z and J ω = [0, 1]. For ρ(z) > 0, z ∈ [0, 1] we must define the functions φ(z), ψ(z) according to equation 8 φ (z) = -z 1-z ρ(z), ψ (z) = ρ(z). In this case the likelihood ratio is r = D 1-D . In Table 3 we see the cross entropy GAN and C2 which is presented here for the first time. Table 3 : Subclass C optimization problems for GANs GAN φ(z) ψ(z) J ω Cross Entropy log(1 -z) log(z) [0, 1] C2 z + log(1 -z) z [0, 1] Subclass D: ω(r) = sign(log r) This is a special case of ω(r) with the corresponding function not being strictly increasing. It turns out that we can still come up with optimization problems, two of which are known and used in practice, by considering ω(r) as a limit of a sequence of strictly increasing functions. Monotone Loss: As a first approximation we propose sign(z) ≈ tanh( c 2 z) where c > 0 a parameter. We note that lim c→∞ tanh( c 2 z) = sign(z). Using this approximation we can write sign(log r) ≈ tanh c 2 log r = r c -1 r c + 1 = ω(r). As we mentioned, we have exact equality for c → ∞. Let us perform our analysis by assuming that c is finite. We note that ω -1 (z) = ( 1+z 1-z ) 1 c and J ω = [-1, 1]. Consequently, if ρ(z) > 0 for z ∈ [-1, 1], we must define φ (z) = -1+z 1-z 1 c ρ(z), ψ (z) = ρ(z). D1) If we let c → ∞ in order to converge to the desired sign function, this yields φ (z) = -ρ(z) and ψ (z) = ρ(z). This suggests that φ(z) = -z ρ(x)dx is decreasing and ψ(z) = z ρ(x)dx = -φ(z) is increasing. In fact any strictly increasing function ψ(z) can be adopted provided we select φ(z) = -ψ(z). There is a popular combination that falls under Case D1). In particular, the selection ψ(z) = z = -φ(z) reminds us of Wasserstein GAN Martin Arjovsky & Bottou (2017), with two differences, in our case z should lie in [-1, 1] and the discriminator is not constrained to be a Lipschitz function. Hinge Loss: As a second approximation we use the expression sign(z) ≈ sign(z)|z| 1 c , c > 0, which is strictly increasing, continuous and converges to sgn(z) as c → ∞. This suggests that sign(log r) ≈ sign(log r)| log r| 1 c = ω(r), and ω -1 (z) = e z c . Since ω(r) can assume any real value we conclude that J ω = R which, clearly, differs from the previous approximation where we had J ω = [-1, 1]. If ρ(z) > 0, z ∈ R then, according to equation 8 we must define φ (z) = -e z c ρ(z), ψ (z) = ρ(z). We present the following case that leads to a very well known pair from a completely different application. D2) If we select ψ (z) = ρ(z) = {e -|z| 1 c +1 z<-1 } > 0 then φ (z) = -e z 1 c {e -|z| 1 c +1 z<-1 }. If we now let c → ∞, we obtain the limiting form for the derivatives which become ψ (z) = -1 z<1 and φ (z) = 1 z>-1 . By integrating we arrive at φ(z) = -max{1 + z, 0} and ψ(z) = -max{1 -z, 0}. we notice that since c → ∞ we cannot find the likelihood ratio in terms of the discriminator. The cost based on this particular pair is called the hinge loss Tang (2013) and it is very popular in binary classification where one is interested only in the maximization problem. The corresponding method is known to exhibit an overall performance which in practice is considered among the best Rosasco et al. (2004) ; Janocha & Czarnecki (2017) . Here, as in Zhao et al. (2016) , we propose the hinge loss as a means to perform adversarial optimization for the design of the generator G(x).  φ(z) ψ(z) J ω Hinge -(1 + z) + -(1 -z) + R Wasserstein z -z R In Table 4 we can see Hinge and Wasserstein GANs optimization problems. The detailed derivation of the above optimization problems and some more examples can be found in Appendix A.2. This completes our presentation of examples. However, we must emphasize, that these are only a few illustrations of possible pairs {φ(z), ψ(z)} one can construct. Indeed combining, as dictated by equation 8, any strictly increasing function ω(r) with any positive function ρ(z) generates a legitimate pair {φ(z), ψ(z)} and a corresponding min-max problem equation 7 that enjoys the desired solution r(x) = 1. Remark 1 Given the numerous choices we have in defining adversarial optimizations for solving the same problem, one may wonder whether there exists a means to rank these methods and identify the most efficient, at least for classes of data. This is an extremely challenging question which, unfortunately, finds no answer even in simpler problems. For example, in binary classification there are also classes of optimization problems that are used to train neural networks Bartlett et al. (2006) ; Masnadi-Shirazi & Vasconcelos (2009) . However, until now, no theoretical analysis exists that can order them and designate the most efficient one. This is possible only through experience with countless simulations.

3. DATA-DRIVEN SETUP AND NEURAL NETWORKS

Let us now consider the data-driven version of the problem. As mentioned, the target density f (x) is unknown. Instead we are given a collection of realizations {x i } that follow f (x) and a second collection {z j } that follows the origin density h(z). These data constitute our training set. Regarding the second set {z j } it can either become available "on the fly" when h(z) is known by generating realizations every time they are needed, or it can be considered fixed from the start exactly as {x i }, if h(z) is also unknown. As we pointed out in Section 1, we are interested in designing a generator G(z) so that when we apply it onto the data z j , that is, y j = G(z j ) the resulting y j will follow a density that matches the target density f (x). Since we are now considering the data-driven version of the problem, we are going to limit G(z), D(x) to be the outputs of corresponding neural networks. Therefore the generator is replaced by G(z, θ) while the discriminator by D(x, ϑ) where θ, ϑ summarize the parameters of the two neural networks. Of course instead of neural networks one could use any other parametric family, as SVMs, capable of efficiently approximating any nonlinear function. Once we have selected our favorite ω(r) and ρ(z) functions we can compute from equation 8 the functions φ(z), ψ(z) that enter into the min-max problem defined in equation 2. This problem, after limiting the generator and discriminator to neural networks, can be rewritten as follows min θ max ϑ J (θ, ϑ) = min θ max ϑ E x∼f φ D(x, ϑ) + E z∼h ψ D(G(z, θ), ϑ) . ( ) If θ o , ϑ o are the corresponding optimum parameter values, and the structure of the two networks is sufficiently rich, we expect that G(z, θ o ), D(x, ϑ o ) will approximate the optimum functions D(x), G(z) of the ideal problem in equation 2 respectively. In particular for θ o , the generator G(z, θ o ), whenever applied onto any z j that follows h(z), it will result in a y j = G(z j , θ o ) that follows a density which is expected to be close to the target density f (y). Remark 2 When replacing D(x), G(z) with neural networks we must take special care of the corresponding outputs. Basically, we must guarantee that they are of the correct form. This is particularly important in the case of the scalar output D(x, ϑ) of the discriminator. We recall that the optimum discriminator is D(x) = ω r(x) . This implies that we need to assure that D(x, ϑ) takes values in J ω (the range of ω(r)). Consequently, we must apply the proper nonlinearity in the output of the discriminator that will guarantee this fact.

4. EXPERIMENTS

In this section, we want to examine the performance of the GANs objectives presented in Table 4 for different datasets. For that reason we tested their performance on four different datasets, namely MNIST (LeCun et al., 1998) , CelebA (Liu et al., 2015) , CIFAR-10 datasets (Krizhevsky et al., 2009) , and Stanford Cars (Krause et al., 2013) . We recall that GANs are notorious for their nonrobust behavior Bengio (2012) ; Creswell et al. (2018) ; Mescheder et al. (2017) . For the stabilization of the training process, we used the maximum gradient-penalty methodology which was generalized to a class of Lipschitz GANs in Zhou et al. (2019) (implementation details in Appendix A.3). In this section we present our results for the regularization parameter λ = 10, in the Appendix we included results for different values of λ. In Tables 5, 6 we present the final attained Frechet Inception Distances (FID) Heusel et al. (2017) and Kernel Inception Distances (KID) Bińkowski et al. (2018) scores after training. In Table 7 we computed the absolute difference of the discriminator estimated likelihood ratio from the optimal likelihood ratio, which is equal to one and the variance around the optimal value. For Hinge and Wasserstein GANs we cannot compute the likelihood ratio related metrics, as we mentioned in section 2.1. Also, in Figure 4 we show the evolution of FID (second row) and KID (third row) during training. In the first row we see the value of the distance d(x, y) = |E x∼f [D(x)]-E y∼g [D(y)]|. For the expectations estimation, we used 64 examples for MNIST and 128 for Stanford Cars, CelebA, and CIFAR-10. We believe that this is an insightful quantity since its value is an indicator of how accurate the discriminator can distinguish samples from the target distribution and the generator output. Finally, in Appendix A.3, some generated samples of our trained generative models are included. Table 7 : Mean absolute difference of the discriminator estimated likelihood ratio from the optimal value and variance around the optimal likelihood ratio. For the MNIST dataset, the different GANs have very similar performance, something that is reasonable since this is a simple dataset. The d(x, y) curves are quickly converging indicating that the discriminator has stabilized. Interestingly, the FID, KID curves for GANs with similar d(x, y) have also close FID, KID scores. In particular, Subclass C GANs (Cross-Entropy with C2); Subclass D (Hinge) with Wasserstein; B2, B1a and A3; B1b with Exponential; A1b with A2; and A1a with MSE. For the CIFAR-10 dataset, the FID and KID scores for different objective functions tend to have similar behavior, but some differences exist. In particular, the objectives A1a, A1b, A2, MSE, B1a, B1b in the initial iterations create a steep valley for the metric d(x, y); and at the same time in the FID score, these objectives attain smaller FID scores faster than the other GANs. Also, the further improvement of the scores during training seems to be related to the behavior of d(x, y). For instance, the d(x, y) of A1b, A2 is still increasing during the last iterations, and the generated image quality (FID score) is improving. Lastly, similar to MNIST, GANs that have very close d(x, y) tend to have close FID, KID curves. GAN CARS CELEBA CIFAR10 MNIST ×1, ×10 -2 ×1, ×10 -3 ×10 -2 , ×10 -4 ×10 -1 , ×10 For Stanford Cars and CelebA, we observe some performance gaps between different subclasses. Specifically for CelebA, the Subclass D and Subclass B objective functions start to diverge after some training steps, something that is evident from the FID and KID curves where the scores increase dramatically after approximately the first half of iterations. Also, this is noticeable for the d(x, y), where a sudden drop is evident around the same period. Interestingly the variance around the optimal likelihood ratio value is increased for the GANS who started to diverge. For Stanford cars, the Subclass C has the worst performance, with the recorded MMD score being an order larger than the other objectives (expect Subclass B -B2, Exponential objectives-). Furthermore, for this dataset, the distances d(x, y) are "noisy" (strong fluctuations) and with approximately ten times larger magnitude value when compared to the corresponding curves in the other datasets. It is reasonable since this dataset is a complex one, where we have natural scenes with cars, and, importantly, significantly smaller (∼ 8000 training samples, when MNIST, CIFAR10 has 60000 and CelebA 200000). In summary, our simulations indicate that the Subclass A objectives A1a, A1b, A2, MSE have the best performance both in terms of the computed metrics (hence image generation quality) and stability during training. They might not give the exact best score in all datasets, but are very close to the best recorded, and, most importantly, they have an overall more stable behavior when compared to objectives from the other subclasses. Furthermore, the discriminator convergence to the optimal likelihood ratio between the dataset and the generator output can be an ad-hoc measure of the behavior/stability of our generative models than can be used during training to provide online, useful insights of the current training condition.

5. CONCLUSION

In this paper, we provided and demonstrated a straightforward methodology to determine loss functions that solve the generative adversarial problem. Our results suggest that there is not a single loss function that achieves the best performance in terms of the examined metrics for all different datasets. This performance variation among loss functions becomes evident as the increasing complexity of the datasets that convolutes the generation task is better addressed by some loss functions that clearly outperform others. Specifically, in simpler datasets, such as MNIST, the evaluated loss functions yield very similar performance, whereas, in more intricate datasets like CelebA, CIFAR-10, and Stanford Cars, performance "gaps" between the different loss functions, and different subclasses, emerges. Our findings also propose that in every generation task unexplored loss functions outperformed the previously proposed ones. Consequently, this function class is worth-exploring to identify new loss functions that can be used and evaluated in different applications. Our method provides a versatile tool that can be exploited in that direction.

A APPENDIX

A.1 PROOFS Theorem 1 Proof. We note that the constraint ψ ω(1) = 0 does not affect the generality of our class of functions since from equation 8 we have that ψ(z), after integration, is defined up to an arbitrary additive constant. We can always select this constant so that the constraint is satisfied. We would also like to emphasize that this constraint is needed only for the proof of this theorem and it is not necessary for the corresponding min-max problem defined in equation 7. For fixed r, to find the maximum of φ(D) + rψ(D) we consider the derivative with respect to D which, using equation 8, takes the form φ (D) + rψ (D) = r -ω -1 (D) ρ(D). The strict increase of ω(r) is inherited by its inverse function ω -1 (z) which, combined with the positivity of ρ(z), implies that the previous expression has the same sign as r-ω -1 (D) or ω(r)-D. Consequently D = ω(r) is the only critical point of φ(D) + rψ(D) which is a global maximum. Of course there are possibilities for extrema at the two end points of J ω but they can only be (local) minima. Let us now focus on the resulting function φ ω(r) + rψ ω(r) . Taking its derivative with respect to r yields φ ω(r) + rψ ω(r) = φ ω(r) + rψ ω(r) ω (r) + ψ ω(r) = ψ ω(r) , where the last equality is due to the specific definition of the two functions φ(z), ψ(z) in equation 8. Since ψ (z) = ρ(z) > 0, this implies that ψ(z) is strictly increasing, being also the integral of ρ(z) it is continuous in z. If we combine this property with the strict increase and continuity (as a result of left and right differentiability) of ω(r) we conclude that ψ ω(r) is also strictly increasing and continuous in r. We recall that ψ(z) is selected to satisfy ψ ω(1) = 0, consequently for r = 1 the function φ ω(r) + rψ ω(r) has a unique minimum which is global and no other critical points. Of course it can still exhibit extrema at r = 0 and/or r → ∞ but they can only be (local) maxima. Corollary 1 Proof. First, we observe that E x∼f φ D(x) + r(x)ψ D(x) = E x∼f φ D(x) + r(x) ψ D(x) + ψ ω(1) with the last equality being true since E x∼f [r(x)] = 1 and where ψ(z) = ψ(z) -ψ ω(1) . We start with the maximization problem. Since D(x) is a function of x we have max D(x) E x∼f φ D(x) + r(x) ψ D(x) = E x∼f max D(x) φ D(x) + r(x) ψ D(x) . The maximization under the expectation can be performed for each fixed x. However, when we fix x then r(x) becomes a constant and the result of the maximization depends only on the actual value of r(x). This suggests that we can limit ourselves to functions of the form D(x) = D r(x) . After this observation we can drop the dependence on x and perform, equivalently, the maximization max D φ D(r) + r ψ D(r) for each fixed r. The pair {φ(z), ψ(z)} satisfies the assumptions of Theorem 1, therefore maximization is achieved for D(r) = ω(r). This implies that max D(x) E x∼f φ D(x) + r(x)ψ D(x) = E x∼f φ ω r(x) + r(x) ψ ω r(x) + ψ ω(1) . We can now continue in a similar way for the minimization problem. Specifically min r(x)∈L f max D(x) E x∼f φ D(x) + r(x) ψ D(x) = min r(x)∈L f E x∼f φ ω r(x) + r(x) ψ ω r(x) ≥ E x∼f min r(x)∈L f φ ω r(x) + r(x) ψ ω r(x) ≥ E x∼f min r φ ω(r) + r ψ ω(r) = φ ω(1) with the last inequality being true since the minimization that follows is unconstrained and the last equality being a consequence of Theorem 1. The final lower bound is clearly attained by r(X) = 1, which is also a legitimate solution of the constrained minimization, since r(X) = 1 belongs to the class L f of likelihood ratios. Consequently r(X) = 1 is the solution to the min-max problem. Returning to the original min-max setup with ψ(z) replacing ψ(z), we can clearly see that it satisfies equation 11. This completes the proof.

A.2 THE SUBCLASSES OF THE GANS FAMILY

Subclass A: ω(r) = r α The first examined subclass is the simplest one, consisting of just powers of the likelihood ratio. To the best of our knowledge, this is the first work proposing objective functions from this class. To find the pairs {φ(z), ψ(z)} we proceed as follows. We have that ω -1 (z) = z 1 α and J ω = [0, ∞). According to equation 8, for z ∈ [0, ∞) we must define φ (z) = -z 1 α ρ(z), ψ (z) = ρ(z). The following examples can be shown to satisfy these equations. A1) If we select ρ(z) = z β , with β = -1, -1 -1 α , this yields φ(z) = -z 1+ 1 α +β 1+ 1 α +β and ψ(z) = z 1+β 1+β . For β = -1, ρ(z) = z -1 , φ(z) = -αz 1 α , ψ(z) = log z. For β = -1 -1 α , ρ(z) = z -1-1 α , φ(z) = -log z, ψ(z) = -αz -1 α . A2) If we select α = 1, ρ(z) = 1 (1+z) then, φ(z) = -(1 + z) and ψ(z) = -(1 + z -1 ). A3) If we select α = 1, ρ(z) = 1 (1+z)z then, φ(z) = -log(1 + z) and ψ(z) = -log(1 + z -1 ). For the particular selection ω(r) = r (corresponding to α = 1) we can show that the resulting cost is equivalent to the Bregman cost (Bregman, 1967) . In fact there is a one-to-one correspondence between our ρ(z) function and the function that defines the Bregman cost. This correspondence however is lost once we switch to a different α or a different ω(r) function, suggesting that the proposed class of pairs {φ(z), ψ(z)}, is far richer than the class induced by the Bregman cost. We should mention that in A1) the selection α = 1, β = 0 is known as the mean square error criterion and if we apply only the maximization problem then this corresponds to a likelihood ratio estimation technique proposed in the literature by Sugiyama et al. (2010; 2012) . We will refer to this case as the MSE GAN. Subclass B: ω(r) = α -1 log r This subclass considers one of the most popular transformations of the likelihood ratio, the log-likelihood ratio. As in the first subclass, for the first time, the next examples are presented. They can be used either under a min-max setting, for the determination of the generator/discriminator pair, or under a pure maximization setting for the direct estimation of the log-likelihood ratio function log r(x). We have ω -1 (z) = e αz and J ω = R. As before ρ(z) must be strictly positive and, according to equation 8, for all real z we must define φ (z) = -e αz ρ(z), ψ (z) = ρ(z). The following examples satisfy these equations. B1) If ρ(z) = e -βz with β = 0, α, this produces φ(z) = -e (α-β)z α-β , ψ(z) = -e -βz β . If β = 0 then ρ(z) = 1, φ(z) = -e αz α , ψ(z) = z. If β = α then ρ(z) = e -αz , φ(z) = -z and ψ(z) = -e -αz α . We call the α = 1, β = 0.5 case the Exponential GAN. B2) If α = 1, ρ(z) = 1 1+e z then, φ(z) = -log(1 + e z ) and ψ(z) = -log(1 + e -z ). Subclass C: ω(r) = r r+1 As we already mentioned, this is another important transform of the likelihood ratio. Interestingly, in this subclass belongs the first introduced GAN (Goodfellow et al., 2014) the Cross Entropy GAN. When ω(r) = r r+1 we have ω -1 (z) = z 1-z and J ω = [0, 1]. For ρ(z) > 0, z ∈ [0, 1] we must define the functions φ(z), ψ(z) according to equation 8 φ (z) = -z 1-z ρ(z), ψ (z) = The next set of examples can be seen to satisfy these equations. C1) If we select ρ(z) = 1 z , this yields φ(z) = log(1 -z) and ψ(z) = log z. C2) Selecting ρ(z) = (1 -z) α , with α = 0, -1, yields φ(z) = -1 1+α (1 -z) α+1 + 1 α (1 -z) α and ψ(z) = -1 1+α (1 -z) 1+α . For α = 0, we have ρ(z) = 1 and φ(z) = z + log(1 -z), ψ(z) = z, while for α = -1 we have ρ(z) = 1 1-z and φ(z) = -log(1 -z) -1 1-z , ψ(z) = -log(1 -z). In C1) we recognize the functions used in the original article by Goodfellow et al. (2014) . C2) appears for the first time. Subclass D: ω(r) = sign(log r) This is a special case of ω(r) with the corresponding function not being strictly increasing. It turns out that we can still come up with optimization problems, two of which are known and used in practice, by considering ω(r) as a limit of a sequence of strictly increasing functions. Monotone Loss: As a first approximation we propose sign(z) ≈ tanh( c 2 z) where c > 0 a parameter. We note that lim c→∞ tanh( c 2 z) = sign(z). Using this approximation we can write sign(log r) ≈ tanh c 2 log r = r c -1 r c + 1 = ω(r). As we mentioned, we have exact equality for c → ∞. Let us perform our analysis by assuming that c is finite. We note that ω -1 (z) = ( 1+z 1-z ) 1 c and J ω = [-1, 1]. Consequently, if ρ(z) > 0 for z ∈ [-1, 1], we must define φ (z) = -1+z 1-z 1 c ρ(z), ψ (z) = ρ(z). D1) If we let c → ∞ in order to converge to the desired sign function, this yields φ (z) = -ρ(z) and ψ (z) = ρ(z). This suggests that φ(z) = -z ρ(x)dx is decreasing and ψ(z) = z ρ(x)dx = -φ(z) is increasing. In fact any strictly increasing function ψ(z) can be adopted provided we select φ(z) = -ψ(z). There is a popular combination that falls under Case D1). In particular, the selection ψ(z) = z = -φ(z) reminds us of Wasserstein GAN Martin Arjovsky & Bottou (2017), with two differences, in our case z should lie in [-1, 1] and the discriminator is not constrained to be a Lipschitz function. Hinge Loss: As a second approximation we use the expression sign(z) ≈ sign(z)|z| 1 c , c > 0, which is strictly increasing, continuous and converges to sgn(z) as c → ∞. This suggests that sign(log r) ≈ sign(log r)| log r| 1 c = ω(r), and ω -1 (z) = e z c . Since ω(r) can assume any real value we conclude that J ω = R which, clearly, differs from the previous approximation where we had J ω = [-1, 1]. If ρ(z) > 0, z ∈ R then, according to equation 8 we must define φ (z) = -e z c ρ(z), ψ (z) = ρ(z). We present the following case that leads to a very well known pair from a completely different application. D2) If we select ψ (z) = ρ(z) = {e -|z| 1 c +1 z<-1 } > 0 then φ (z) = -e z 1 c {e -|z| 1 c +1 z<-1 }. If we now let c → ∞, we obtain the limiting form for the derivatives which become ψ (z) = -1 z<1 and φ (z) = 1 z>-1 . By integrating we arrive at φ(z) = -max{1 + z, 0} and ψ(z) = -max{1 -z, 0}. The cost based on this particular pair is called the hinge loss Tang (2013) and it is very popular in binary classification where one is interested only in the maximization problem. The corresponding method is known to exhibit an overall performance which in practice is considered among the best Rosasco et al. (2004) ; Janocha & Czarnecki (2017) . Here, as in Zhao et al. (2016) , we propose the hinge loss as a means to perform adversarial optimization for the design of the generator G(x). This completes our presentation of examples. However, we must emphasize, that these are only a few illustrations of possible pairs {φ(z), ψ(z)} one can construct. Indeed combining, as dictated by equation 8, any strictly increasing function ω(r) with any positive function ρ(z) generates a legitimate pair {φ(z), ψ(z)} and a corresponding min-max problem equation 7 that enjoys the desired solution r(x) = 1.

A.3 EXPERIMENTS

In our experiments we employed the two neural networks the generator and the discriminator. For the generator, we used a four-layer neural network where the first layer is linear and the remaining deconvolutional; with ReLU activation functions between the layers except the final layer where we used a sigmoid function since the output is an image with pixel values in the range [0, 1]. The generator input is a standard i.i.d. normal vector with dimension 64 for MNIST and 128 for Stanford Cars, CelebA and CIFAR-10. The output of the generator is a 784 × 1 vector for the MNIST dataset, whereas for the other datasets it is a 3072 × 1 vector. For the discriminator, we used a four-layer neural network with three convolutional layers followed by a linear layer. We applied Leaky ReLUs between the layers except for the final layer where we adopted proper functions based on the range J ω . For the training of the two neural networks we applied the Adam algorithm Kingma & Ba (2014) with β 1 = 0.5, β 2 = 0.9, learning rate 10 -4 and batch size 50 for MNIST and 128 for Stanford Cars, CelebA and CIFAR-10. For all datasets, the training lasted 180000 iterations. A.4 CELEBA In Figures 2, 3 , 4, 5 we can see the estimation of likelihood ratio from the discriminator neural network and the FID, KID scores of the generated images during training. For every optimization problem, we tested its performance for different values of the regularization parameter λ of the maximum gradient penalty. The different values of λ where {0.01, 0.1, 1, 10} as in Zhou et al. (2019) . We must mention that in the cases where some λ value is missing means that the algorithm diverged during training. From our simulations, we notice that, especially in the case of the KID score, the likelihood ratio approximation accuracy of the discriminator coincides with the quality of the synthetic images produced by the generator. For instance, in Figure 2 for A3, λ = 10 at ∼ 150000 iterations the likelihood ratio (which should converge to 1) starts to be more variant around the optimal value, then we notice the KID, FID scores increase indicating that the generated images quality drops. It is apparent in Figure 11 where we see some blurry, almost faceless, images. Also, for A1b, B1a, B1b, Exponential, B2 GANs, we notice the same behavior. Moreover, in all cases, we notice a faster convergence for larger values of λ. Interestingly, there is a similar behavior, in terms of convergence, between the different λ's. For example, in B2, for λ = 10, we notice that the variance around one is increasing. Similarly for λ = 1, λ = 0.1 but more slowly in terms of iterations. Therefore, it seems that the regularization parameter cannot fix the divergence problems of a GAN, but they can influence the period they will become evident. As we already discussed, in the case of Hinge and Wasserstein GANs, the likelihood ratio value is not computable. For this reason in Figure 5 we present the discriminator output for dataset (D(X)) and for synthetic (D(G(Z))) samples. In the Hinge GAN, the discriminator output gradually increases its range of values for λ = 10 around 130000 iteration and for λ = 1 around 200000 iteration. Accordingly, the synthetic images produced by the generator start to look less similar to the ones from the dataset. Similarly, Wasserstein GANs have the same patterns. But in this case, the discriminator output tends to have large values. Our simulations argue, that a way to reduce those large values is to increase the regularization parameter λ, in other words to "force" the discriminator being Lipschitz. (2019) . Cases where some λ value is missing means that the algorithm diverged during training. 



Figure 1: Evolution of the corresponding FID and KID scores during training for the datasets Stanford Cars, CelebA, CIFAR10, and MNIST.

Figure 2: Likelihood ratio (r), KID, FID scores for CelebA dataset. Simulation results correspond to Subclass A GANs for different values of the maximum gradient penalty hyperparameter λ.

Figure 3: Likelihood ratio (r), KID, FID scores for CelebA dataset. Simulation results correspond to Subclass B GANs for different values of the maximum gradient penalty hyperparameter λ.

Figure 6: Likelihood ratio (r), KID, FID scores for Stanford cars dataset. Simulation results correspond to Subclass A GANs for different values of the maximum gradient penalty hyperparameter λ.

Figure 7: Likelihood ratio (r), KID, FID scores for Stanford cars dataset. Simulation results correspond to Subclass B GANs for different values of the maximum gradient penalty hyperparameter λ.



Subclass B optimization problems for GANsGAN

Optimization problems for GANsGAN

±×10 -12 ×10 -6 ±×10 -12 ×10 -6 ±×10 -12 ×10 -4 ±×10 -8

A1a

A1b A2

A3 MSE B1a

B1b Exponential B2Cross Entropy C2 Hinge Wasserstein 

