FUNCTION-SPACE REGULARIZED RÉNYI DIVERGENCES

Abstract

We propose a new family of regularized Rényi divergences parametrized not only by the order α but also by a variational function space. These new objects are defined by taking the infimal convolution of the standard Rényi divergence with the integral probability metric (IPM) associated with the chosen function space. We derive a novel dual variational representation that can be used to construct numerically tractable divergence estimators. This representation avoids risk-sensitive terms and therefore exhibits lower variance, making it well-behaved when α > 1; this addresses a notable weakness of prior approaches. We prove several properties of these new divergences, showing that they interpolate between the classical Rényi divergences and IPMs. We also study the α → ∞ limit, which leads to a regularized worst-case-regret and a new variational representation in the classical case. Moreover, we show that the proposed regularized Rényi divergences inherit features from IPMs such as the ability to compare distributions that are not absolutely continuous, e.g., empirical measures and distributions with lowdimensional support. We present numerical results on both synthetic and real datasets, showing the utility of these new divergences in both estimation and GAN training applications; in particular, we demonstrate significantly reduced variance and improved training performance.

1. INTRODUCTION

Rényi divergence, Rényi (1961) , is a significant extension of Kullback-Leibler (KL) divergence for numerous applications; see, e.g., Van Erven & Harremos (2014) . The recent neural-based estimators for divergences Belghazi et al. (2018) along with generative adversarial networks (GANs) Goodfellow et al. (2014) accelerated the use of divergences in the field of deep learning. The neural-based divergence estimators are feasible through the utilization of variational representation formulas. These formulas are essentially lower bounds (and, occasionally, upper bounds) which are approximated by tractable statistical averages. The estimation of a divergence based on variational formulas is a notoriously difficult problem. Challenges include potentially high bias that may require an exponential number of samples McAllester & Stratos (2020) or the exponential statistical variance for certain variational estimators Song & Ermon (2019) , rendering divergence estimation both data inefficient and computationally expensive. This is especially prominent for Rényi divergences with order larger than 1. Indeed, numerical simulations have shown that, unless the distributions P and Q are very close to one another, the Rényi divergence R α (P ∥Q) is almost intractable to estimate when α > 1 due to the high variance of the statistically-approximated risk-sensitive observables Birrell et al. (2021) , see also the recent analysis in Lee & Shin (2022) . A similar issue has also been observed for the KL divergence, Song & Ermon (2019) . Overall, the lack of estimators with low variance for Rényi divergences has prevented wide-spread and accessible experimentation with this class of information-theoretic tools, except in very special cases. We hope our results here will provide a suitable set of tools to address this gap in the methodology. One approach to variance reduction is the development of new variational formulas. This direction is especially fruitful for the estimation of mutual information van den Oord et al. (2018) ; Cheng et al. (2020) . Another approach is to regularize the divergence by restricting the function space of the variational formula. Indeed, instead of directly attacking the variance issue, the function space of the variational formula can be restricted, for instance, by bounding the test functions or more appropriately by bounding the derivative of the test functions. The latter regularization leads to Lipschitz continuous function spaces which are also foundational to integral probability metrics (IPMs) and more specifically to the duality property of the Wasserstein metric. In this paper we combine the above two approaches, first deriving a new variational representation of the classical Rényi divergences and then regularizing via an infimal-convolution as follows R Γ,IC α (P ∥Q) := inf η {R α (P ∥η) + W Γ (Q, η)} , where P and Q are the probability distributions being compared, the infimum is over the space of probability measures, R α is the classical Rényi divergence, and W Γ is the IPM corresponding to the chosen regularizing function space, Γ. The new family of regularized Rényi divergences that are developed here address the risk-sensitivity issue inherent in prior approaches. More specifically, our contributions are as follows. • We define a new family of function-space regularized Rényi divergences via the infimal convolution operator between the classical Rényi divergence and an arbitrary IPM (1). The new regularized Rényi divergences inherit their function space from the IPM. For instance, they inherit mass transport properties when one regularizes using the 1-Wasserstein metric. • We derive a dual variational representation (11) of the regularized Rényi divergences which avoids risk-sensitive terms and can therefore be used to construct lower-variance statistical estimators. • We prove a series of properties for the new object: (a) the divergence property, (b) being bounded by the minimum of the Rényi divergence and IPM, thus allowing for the comparison of non-absolutely continuous distributions, (c) limits as α → 1 from both left and right, (d) regimes in which the limiting cases R α (P ∥Q) and W Γ (Q, P ) are recovered. • We propose a rescaled version of the regularized Rényi divergences ( 16) which lead to a new variational formula for the worst-case regret (i.e., α → ∞). This new variational formula does not involve the essential supremum of the density ratio as in the classical definition of worst-case regret, thereby avoiding risk-sensitive terms. • We present a series of illustrative examples and counterexamples that further motivate the proposed definition for the function-space regularized Rényi divergences. • We present numerical experiments that show (a) that we can estimate the new divergence for large values of the order α without variance issues and (b) train GANs using regularized function spaces. Related work. The order of Rényi divergence controls the weight put on the tails, with the limiting cases being mode-covering and mode-selection Minka (2005) . Rényi divergence estimation is used in a number of applications, including Sajid et al. (2022) 

2. NEW VARIATIONAL REPRESENTATIONS OF CLASSICAL RÉNYI DIVERGENCES

The Rényi divergence of order α ∈ (0, 1) ∪ (1, ∞) between P and Q, denoted R α (P ∥Q), can be defined as follows: Let ν be a sigma-finite positive measure with dP = pdν and dQ = qdν. Then R α (P ∥Q) :=    1 α(α-1) log q>0 p α q 1-α dν if 0 < α < 1 or α > 1 and P ≪ Q , +∞ if α > 1 and P ̸ ≪ Q , where P ≪ Q denotes absolute continuity of P with respect to Q. There always exists such a ν (e.g., ν = P + Q) and one can show that the definition (2) does not depend on the choice of ν. The R α provide a notion of 'distance' between P and Q in that they satisfy the divergence property, i.e., they are non-negative and equal zero iff Q = P . The limit of R α as α approaches 1 or 0 equals the KL or reverse KL divergence respectively Van Erven & Harremos (2014) . An alternative representation of R α , the so-called Rényi-Donsker-Varadhan variational formula, was derived from (2) in Birrell et al. (2021) , R α (P ∥Q) = sup ϕ∈M b (Ω) 1 α -1 log e (α-1)ϕ dP - 1 α log e αϕ dQ , P, Q ∈ P(Ω) . Here (Ω, M) denotes a measurable space, M b (Ω) the space of bounded measurable real-valued functions on Ω, and P(Ω) is the space of probability measures on Ω. By a change of variables argument this can be transformed into the following new variational representation; see Theorem A.2 in Appendix A for a proof. We call it the convex-conjugate Rényi variational formula (CC-Rényi). Theorem 2.1 (Convex-Conjugate Rényi Variational Formula). Let P, Q ∈ P(Ω) and α ∈ (0, 1) ∪ (1, ∞). Then R α (P ∥Q) = sup g∈M b (Ω):g<0 gdQ + 1 α -1 log |g| (α-1)/α dP + α -1 (log α + 1) . If (Ω, M) is a metric space with the Borel σ-algebra then (4) holds with M b (Ω) replaced by C b (Ω), the space of bounded continuous real-valued functions on Ω. The representation (4) is of convex-conjugate type, which will be key in our development of functionspace regularized Rényi divergences. It is also of independent interest as it avoids risk-sensitive terms, unlike (3) which contains cumulant-generating-functions. This makes (4) better behaved in estimation problems, especially when α > 1; see the example in Section 6.1 below. We also obtain a new variational formula for worst-case regret, as defined by Van Erven & Harremos (2014) D ∞ (P ∥Q) := lim α→∞ αR α (P ∥Q) = log ess sup P dP dQ , P ≪ Q ∞, P ̸ ≪ Q . (5) In contrast to (5), which requires estimation of the likelihood ratio, the new variational formula (6) below avoids risk-sensitive terms. Theorem 2.2 (Worst-case Regret Variational Formula). Let P, Q ∈ P(Ω). Then D ∞ (P ∥Q) = sup g∈M b (Ω):g<0 gdQ + log |g|dP + 1 . ( ) If Ω is a metric space with the Borel σ-algebra then (6) holds with M b (Ω) replaced by C b (Ω). See Theorem A.5 in Appendix A for a proof. Equation ( 6) is a new result of independent interest and will also be useful in our study of the α → ∞ limit of the function-space regularized Rényi divergences that we define in the next section. We are now ready to define the function-space regularized Rényi divergences and derive their key properties. In this section, X will denote a compact metric space, P(X) will denote the set of Borel probability measures on X, and C(X) will denote the space of continuous real-valued functions on X. We equip C(X) with the supremum norm and recall that the dual space of C(X) is C(X) * = M (X), the space of finite signed Borel measures on X (see the Riesz representation theorem, e.g., Theorem 7.17 in Folland ( 2013)). Definition 3.1. Given a test-function space Γ ⊂ C(X), we define the infimal-convolution Γ-Rényi divergence (i.e., IC-Γ-Rényi divergence) between P, Q ∈ P(X) by R Γ,IC α (P ∥Q) := inf η∈P(X) {R α (P ∥η) + W Γ (Q, η)} , α ∈ (0, 1) ∪ (1, ∞) , where W Γ denotes the Γ-IPM W Γ (µ, ν) := sup g∈Γ { gdµ -gdν} , µ, ν ∈ M (X) . ( ) Remark 3.2. The classical Rényi divergence is convex in its second argument but not in its first when α > 1 Van Erven & Harremos (2014) . This is the motivation for defining the IC-Γ-Rényi divergences via an infimal convolution in the second argument of R α ; convex analysis tools will be critical in deriving properties of R Γ,IC α below. For α ∈ (0, 1) one can use the identity R α (P ∥Q) = R 1-α (Q∥P ) to rewrite (7) as an infimal convolution in the first argument. The definition (7) can be thought of as a regularization of the classical Rényi divergence using the Γ-IPM. For computational purposes it is significantly more efficient to have a dual formulation, i.e., a representation of R Γ,IC α in terms of a supremum over a function space. To derive such a representation we begin with the variational formula for R α from Theorem 2.1. If we define the convex mapping Λ P α : C(X) → (-∞, ∞], Λ P α [g] := ∞1 g̸ <0 - 1 α -1 log |g| (α-1)/α dP + α -1 (log α + 1) 1 g<0 , then (4) from Theorem 2.1 can be written as a convex conjugate R α (P ∥Q) = (Λ P α ) * [Q] := sup g∈C(X) { gdQ -Λ P α [g]} . One can then use Fenchel-Rockafellar duality to derive a dual formulation of the IC-Γ-Rényi divergences. To apply this theory we will need to work with spaces of test functions that satisfy the following admissibility properties. These properties are similar to those used in the construction of regularized KL and f -divergences in Dupuis, Paul & Mao, Yixiang (2022) and Birrell et al. (2022a) . Definition 3.3. We will call Γ ⊂ C(X) admissible if it is convex and contains the constant functions. We will call an admissible Γ strictly admissible if there exists a P(X)-determining set Ψ ⊂ C(X) such that for all ψ ∈ Ψ there exists c ∈ R, ϵ > 0 such that c ± ϵψ ∈ Γ. Recall that Ψ being P(X)-determining means that for all Q, P ∈ P(X), if ψdQ = ψdP for all ψ ∈ Ψ then Q = P . Putting the above pieces together one obtains the following variational representation. Theorem 3.4. Let Γ ⊂ C(X) be admissible, P, Q ∈ P(X), and α ∈ (0, 1) ∪ (1, ∞). Then: 1. R Γ,IC α (P ∥Q) = sup g∈Γ:g<0 gdQ + 1 α -1 log |g| (α-1)/α dP + α -1 (log α + 1) . 2. If (11) is finite then there exists η * ∈ P(X) such that R Γ,IC α (P ∥Q) = inf η∈P(X) {R α (P ∥η) + W Γ (Q, η)} = R α (P ∥η * ) + W Γ (Q, η * ) . (12) 3. R Γ,IC α (P ∥Q) ≤ min{R α (P ∥Q), W Γ (Q, P )}. 4. If Γ is strictly admissible then R Γ,IC α has the divergence property. See Theorem B.3 in Appendix B for detailed proofs of these results as well as several additional properties. We note that there are alternative strategies for proving the variational formula (11) which make different assumptions; further comments on this can be found in Remark B.4. Important examples of strictly admissible Γ include the following: 1. Γ = C(X), which leads to the classical Rényi-divergences. 2. Γ = Lip 1 (X), i.e. all 1-Lipschitz functions. This regularizes the Rényi divergences via the Wasserstein metric. 3. Γ = {c + g : c ∈ R, g ∈ C(X), |g| ≤ 1}. This regularizes the Rényi divergences via the total-variation metric. 4. Γ = {c + g : c ∈ R, g ∈ Lip 1 (X), |g| ≤ 1}. This regularizes the Rényi divergences via the Dudley metric. 5. Γ = {c + g : c ∈ R, g ∈ Y : ∥g∥ V ≤ 1}, the unit ball in a RKHS V ⊂ C(X). This regularizes the Rényi divergences via MMD. In practice, uniform bounds can be implemented using an appropriately chosen final NN layer. Lipschitz bounds can be implemented using spectral normalization of neural networks Miyato et al. (2018) , or using a soft gradient penalty Gulrajani et al. (2017) . The function space Γ for structurepreserving GANs discussed in the Appendix is implemented using equivariant neural networks, Birrell et al. (2022b) . If Γ is a ball in an RKHS space the implementation is carried out using the same tools used in, e.g., MMD distances and divergences, Gretton et al. (2012) ; Glaser et al. (2021) . The IC-Γ-Rényi divergences also satisfy a data processing inequality. See Theorem B.8 in Appendix B for a proof as well as details regarding the notation. Theorem 3.5 (Data Processing Inequality). Let α ∈ (0, 1) ∪ (1, ∞), Q, P ∈ P(X), and K be a probability kernel from X to Y such that K [g] ∈ C(X) for all g ∈ C(X, Y ). If Γ ⊂ C(Y ) is admissible then R Γ,IC α (K[P ]∥K[Q]) ≤ R K[Γ],IC α (P ∥Q). If Γ ⊂ C(X × Y ) is admissible then R Γ,IC α (P ⊗ K∥Q ⊗ K) ≤ R K[Γ],IC α (P ∥Q). If K[Γ] is strictly contained in Γ then the bounds in Theorem 3.5 can be strictly tighter than the classical data processing inequality Van Erven & Harremos (2014) . Data-processing inequalities are important for constructing symmetry-preserving GANs; see Birrell et al. (2022b) and Section D.1.

4. LIMITS, INTERPOLATIONS, AND REGULARIZED WORST-CASE REGRET

Next we use Theorem 3.4 to compute various limits of the IC-Γ-Rényi divergences. First we show that they interpolate between R α and W Γ in the following sense (see Theorem B.5 for a proof). Theorem 4.1. Let Γ ⊂ C(X) be admissible, P, Q ∈ P(X), and α ∈ (0, 1) ∪ (1, ∞). 1. lim δ→0 + 1 δ R δΓ,IC α (P ∥Q) = W Γ (Q, P ), 2. If Γ is strictly admissible then lim L→∞ R LΓ,IC α (P ∥Q) = R α (P ∥Q). Now we discuss the limiting behavior in α. These results generalize several properties of the classical Rényi divergences Van Erven & Harremos (2014) . First we consider the α → 1 limit; see Theorem B.6 for a proof. Theorem 4.2. Let Γ ⊂ C(X) be admissible and P, Q ∈ P(X). Then lim α→1 + R Γ,IC α (P ∥Q) = inf η∈P(X): ∃β>1,R β (P ∥η)<∞ {R(P ∥η) + W Γ (Q, η)} , ( ) lim α→1 -R Γ,IC α (P ∥Q) = inf η∈P(X) {R(P ∥η) + W Γ (Q, η)} (14) = sup g∈Γ:g<0 { gdQ + log |g|dP } + 1 . ( ) Remark 4.3. When Γ = C(X), changing variables to g = -exp(ϕ -1) transforms ( 15) into the Legendre-transform variational formula for R(P ∥Q); see equation ( 1) in Birrell et al. (2022c) with f (x) = x log(x). Eq. ( 14) is an infimal convolution of the reverse KL-divergence, as opposed to the results in Dupuis, Paul & Mao, Yixiang (2022) which apply to the (forward) KL-divergence. Function-space regularized worst-case regret. Next we investigate the α → ∞ limit of the IC-Γ-Rényi divergences, which will lead to the function-space regularized worst-case regret. First recall that some authors use an alternative definition of the classical Rényi divergences, related to the one used in this paper by D α (•∥•) := αR α (•∥•). This alternative definition has the useful property of being non-decreasing in α; see Van Erven & Harremos (2014) . Appropriately rescaled, the IC-Γ-Rényi divergence also satisfies this property, leading to the following definition. Definition 4.4. For Γ ⊂ C(X), α ∈ (0, 1) ∪ (1, ∞) and P, Q ∈ P(X) we define D Γ,IC α (P ∥Q) := αR Γ/α,IC α (P ∥Q) . ( ) Note that αR Γ/α,IC α (P ∥Q) is non-decreasing in α; see Lemma B. 1 for a proof. We now show that the divergences D Γ,IC α are well behaved in the α → ∞ limit, generalizing (5). Taking this limit provides a definition of function-space regularized worst-case regret, along with the following dual variational representation. Theorem 4.5. Let Γ ⊂ C(X) be admissible and P, Q ∈ P(X). Then D Γ,IC ∞ (P ∥Q) := lim α→∞ D Γ,IC α (P ∥Q) = inf η∈P (X) {D ∞ (P ∥η) + W Γ (Q, η)} (17) = sup g∈Γ:g<0 gdQ + log |g|dP + 1 . We call D Γ,IC ∞ the infimal-convolution Γ-worst-case regret (i.e., IC-Γ-WCR). The method of proof of Theorem 4.5 is similar to that of part (1) of Theorem 3.4; see Theorem B.7 in Appendix B for details. Theorem 4.5 suggests that D Γ,IC α is the appropriate α-scaling to use when α is large and we find this to be the case in practice; see the example in Section 6.3.1.

5. ANALYTICAL EXAMPLES AND COUNTEREXAMPLES

In this section we present several analytical examples and counterexamples that illustrate important properties of the IC-Γ-Rényi divergences and demonstrate weaknesses of other attempts to define regularized Rényi divergences. In particular, we show that other attempts at regularizing Rényi divergences fail to inherit important properties from the Γ-IPM. More details on the computations can be found in Appendix C Infimal convolution and scaling limits: First we present a simple example that illustrates the infimal convolution formula and limiting properties from Sections 3 and 4. Let P = δ 0 , Q x,c = cδ 0 +(1-c)δ x for c ∈ (0, 1), x > 0, and let Γ = Lip 1 . Then for L > 0 one can compute R LΓ,IC α (P ∥Q x,c ) =    (1 -c)Lx , 0 < αLx < 1 α -1 -cLx + α -1 log(αLx) , 1 ≤ αLx ≤ 1/c α -1 log(1/c) , αLx > 1/c . ( ) In particular, it is straightforward to show that R LΓ,IC α (P ∥Q x,c ) ≤ (1 -c)Lx = W LΓ (Q x,c , P ), lim x→0 + R LΓ,IC α (P ∥Q x,c ) = lim x→0 + (1 -c)Lx = 0, and lim L→∞ R LΓ,IC α (P ∥Q x,c ) = α log(1/c) = R α (P ∥Q x,c ). We can also rewrite this in terms of the solution to the infimal convolution problem and take the worst-case-regret scaling limit as follows R LΓ,IC α (P ∥Q x,c ) =    W LΓ (Q x,c , P ) , 0 < αLx < 1 R α (P ∥Q x,1/(αLx) ) + W LΓ (Q x,c , Q x,1/(αLx) ) , 1 ≤ αLx ≤ 1/c R α (P ∥Q x,c ) , αLx > 1/c , lim α→∞ αR Γ/α,IC α (P ∥Q x,c ) =    W Γ (Q x,c , P ) , 0 < x < 1 D ∞ (P ∥Q x,1/x ) + W Γ (Q x,c , Q x,1/x ) , 1 ≤ x ≤ 1/c D ∞ (P ∥Q x,c ) , x > 1/c . (20) Γ-Rényi-Donsker-Varadhan counterexample: As an alternative to Definition 3.1, one can attempt to regularize the Rényi divergences by restricting the test-function space in the variational representation (3), leading to the Γ-Rényi-Donsker-Varadhan (Γ-Rényi-DV) divergences R Γ,DV α (P ∥Q) := sup ϕ∈Γ 1 α -1 log e (α-1)ϕ dP - 1 α log e αϕ dQ . The bound log e cϕ dP ≥ c ϕdP for all ϕ ∈ Γ, c ∈ R implies that R Γ,DV α ≤ W Γ for α ∈ (0, 1), making (21) a useful regularization of the Rényi divergences in this case; this utility was demonstrated in Pantazis et al. (2022) , where it was used to construct GANs. However, estimators built from the representation (21) (i.e., replacing P and Q by empirical measures) are known to be numerically unstable when α > 1. Below we provide a counterexample showing that, unlike for the IC-Γ-Rényi divergences, R Γ,DV α ̸ ≤ W Γ in general when α > 1. We conjecture that this is a key reason for the instability of Γ-Rényi-Donsker-Varadhan estimators when α > 1. Let P x,c = cδ 0 + (1 -c)δ x , Q = δ 0 for x > 0, c ∈ (0, 1) and Γ L = Lip L . Then for α > 1 we have R Γ L ,DV α (P x,c ∥Q) = 1 α-1 log (c + (1 -c) exp((α -1)Lx)) and W Γ L (P x,c , Q) = (1 -c)Lx. Using strict concavity of the logarithm one can then obtain the bound R Γ L ,DV α (P x,c ∥Q) > W Γ L (P x,c , Q) . (22) This shows that, when α > 1, Γ-Rényi-DV violates the key property that allows the IC-Γ-Rényi divergences to inherit properties from the corresponding Γ-IPM. Another alternative is to begin with (3) and then reduce the test-function space to 1 α log(Γ), where the logarithm is introduced to eliminate the exponential functions in ( 21). However, this definition also fails to provide an appropriate regularized Rényi divergence; in particular, it is incapable of meaningfully comparing Dirac distributions. See Appendix C.3 for details. These counterexamples lend further credence to our infimal-convolution based regularization approach (7).

6. NUMERICAL EXPERIMENTS

In this section we present numerical examples that demonstrate the use of the IC-Γ-Rényi divergences for both estimation and training of GANs (additional examples can be found in Appendix D). All of the divergences considered in this paper have a variational representation of the form D(P ∥Q) = sup g∈Γ H[g; P, Q] for some objective functional H; we use the corresponding estimator D n (P ∥Q) := sup θ∈Θ H[g θ ; P n , Q n ] ( ) where P n , Q n are n-sample empirical measures and g θ is a family of neural networks (NN) with parameters θ ∈ Θ. For Lipschitz function spaces we weaken the Lipschitz constraint to a soft 1-sided gradient penalty (see Section 4.1 of Birrell et al. (2022a) ). Optimization is performed using the Adam optimizer Kingma & Ba (2014) . For the infimal convolution divergences we enforce negativity of the test function (i.e., discriminators) using a final layer having one of the following forms: 1)-abs(x) or 2) -(1/(1 -x)1 x<0 + (1 + x)1 x≥0 ). The latter, which we term poly-softplus, is C 1 and decays like O(x -1 ) as x → -∞.

6.1. VARIANCE OF RÉNYI ESTIMATORS

As a first example, we compare estimators of the classical Rényi divergences (i.e., without regularization) constructed from DV-Rényi (3) and CC-Rényi (4) in a simple case where the exact Rényi divergence is known. We let Q and P be 1000-dimensional Gaussians with equal variance and study R α (P ∥Q) as a function of the separation between their means. The results are shown in Figure 1 . We see that the estimator based on the convex-conjugate Rényi variational formula 4 has smaller variance and mean-squared error (MSE) that the Rényi-Donsker-Varadhan variational formula 3, with the difference becoming very large when α ≫ 1 or when P and Q are far apart (i.e., when µ q is large). The Rényi-Donsker-Varadhan estimator only works well when µ q and α are both not too large, but even in such cases the convex-conjugate Rényi estimator generally performs better. We conjecture that this difference is due to the presence of risk-sensitive terms in (3) which were eliminated in the new representation (4). We note that the NN for the convex-conjugate Rényi estimator used the poly-softplus final layer, as we found the -abs final layer to result in a significant percentage of failed runs (i.e., NaN outputs) but this issue did not arise when using poly-softplus. We do not show results for either DV-WCR or CC-WCR here as the exact divergence is infinite in this example. DV-R α refers to Rényi divergence estimators built using (3) while CC-R α refers to estimators built using our new variational representation (4). We used a NN with one fully connected layer of 64 nodes, ReLU activations, and a poly-softplus final layer (for CC-Rényi). We trained for 10000 epochs with a minibatch size of 500. The variance and MSE were computing using data from 50 independent runs. Note that the CC-Rényi estimator has significantly reduced variance and MSE compared to the DV-Rényi estimator, even when α is large. Strikingly, the 1-D case exhibits the same behavior (see Figure 3 in Appendix D.2), demonstrating that the DV-Rényi estimator is unsuitable even in low dimensions.

6.2. DETECTION OF RARE SUB-POPULATIONS IN SINGLE-CELL BIOLOGICAL DATASETS

A critical task in cancer assessment is the detection of rare sub-populations subsumed in the overall population of cells. The advent of affordable flow and mass cytometry technologies that perform single cell measurements opens a new direction for the analysis and comparison of high-dimensional cell distributions Shahi et al. (2017) via divergence estimation. We consider single cell mass cytometry measurements on 16 bone marrow protein markers (d = 16) coming from healthy and disease individuals with acute myeloid leukemia Levine et al. (2015) . Following Weber et al. ( 2019), we create two datasets: one with only healthy samples and another one with decreasing percentage of sick cells and compute several divergences. Considering the estimated divergence value as the score of a binary classifier, we compute the ROC curve and the respective area under the ROC curve (AUC) for any pair of sample distributions. More specifically, true negatives correspond to the divergence values between two healthy datasets while true positives correspond to the divergence between a healthy and a diseased dataset. Thus, the AUC is 1.0 when the divergence estimates are completely separable while AUC is 0.5 when they completely overlap. Table 1 reports the AUC values for the scaled IC-Γ-Rényi divergences ( 16), various levels of rarity and two sample sizes for the datasets. The best performance in the Rényi family is obtained for α = ∞ using the IC-Γ-WCR variational formula (18). IC-Γ-WCR also outperforms the Wasserstein distance of first order in both sample size regimes.

6.3. IC-Γ-RÉNYI GANS

Finally, we study a pair of GAN examples (the second example is presented in Appendix D). Here the goal is to learn a distribution P using a family of generator distribution Q ψ ∼ h ψ (X) where X is a noise source and h ψ is a family of neural networks parametrized by ψ ∈ Ψ, i.e., the goal is to solve inf ψ∈Ψ D n (P ∥Q ψ ) , where D n is a divergence estimator of the form (23). In particular, we will study the GANs constructed from the newly introduced IC-Γ-Rényi and IC-Γ-WCR GANs and compare them with Wasserstein GAN Gulrajani et al. (2017) ; Arjovsky et al. (2017) . In Figure 2 we demonstrate improved performance of the IC-Γ-Rényi and IC-Γ-WCR GANs, as compared to Wasserstein GAN with gradient penalty (WGAN-GP), on the CIFAR-10 dataset Krizhevsky et al. (2009) . The IC GANs also outperform Rényi-DV GAN (21), as the latter is highly unstable when α > 1 and so the training generally encounters NaN after a small number of training epochs (hence we omit those results from the figure). We use the same ResNet neural network architecture as in (Gulrajani et al., 2017, Appendix F) and focus on evaluating the effect of different divergences. Here we let Γ be the set of 1-Lipschitz functions, implement via a gradient penalty. Note that D Γ,IC ∞ performs significantly better than R Γ,IC α with large α, and the rescaled D Γ,IC α -GAN performs better that R Γ,IC α -GAN when α is large. We also show the averaged final FID score in the legend, computed using 50000 samples from both P and Q. For the IC GANs we enforce negativity of the discriminator by using a final layer equal to -abs. The GANs were trained using the Adam optimizer with an initial learning rate of 0.0002. The left pane shows that the IC-Γ-Rényi GANs outperform WGAN while the right pane shows that GANs based on the rescaled D Γ,IC α divergences (16) perform better when α is large, including in the α → ∞ limit, i.e., IC-Γ-WCR (17). In both cases the IC GANs outperform the Γ-Rényi-DV GANs with α > 1 (21); the latter fail to converge due to the presence of risk-sensitive terms. Theorem A.1 (Rényi-Donsker-Varadhan Variational Formula) . Let P and Q be probability measures on (Ω, M) and α ∈ R, α ̸ = 0, 1. Then for any set of functions, Φ, with M b (Ω) ⊂ Φ ⊂ M(Ω) we have R α (P ∥Q) = sup ϕ∈Φ 1 α -1 log e (α-1)ϕ dP - 1 α log e αϕ dQ , where we interpret ∞ -∞ ≡ -∞ and -∞ + ∞ ≡ -∞. If in addition (Ω, M) is a metric space with the Borel σ-algebra then (25) holds for all Φ that satisfy Lip b (Ω) ⊂ Φ ⊂ M(Ω), where Lip b (Ω) is the space of bounded Lipschitz functions on Ω (i.e., Lipschitz for any Lipschitz constant L ∈ (0, ∞)). Using Theorem A.1 we can derive a new variational representation that takes the form of a convex conjugate. Theorem A.2 (Convex-Conjugate Rényi Variational Formula). Let P, Q ∈ P(Ω) and α ∈ (0, 1) ∪ (1, ∞). Then R α (P ∥Q) = sup g∈M b (Ω):g<0 gdQ + 1 α -1 log |g| (α-1)/α dP + α -1 (log α + 1) . ( ) If (Ω, M) is a metric space with the Borel σ-algebra then ( 26) holds with M b (Ω) replaced by C b (Ω), the space of bounded continuous real-valued functions on Ω. Proof. Let Φ = {α -1 log(-h) : h ∈ M b (Ω), h < 0}. We have M b (Ω) ⊂ Φ ⊂ M(Ω), hence Theorem A.1 implies R α (P ∥Q) = sup ϕ∈Φ 1 α -1 log e (α-1)ϕ dP - 1 α log e αϕ dQ = sup h∈M b (Ω):h<0 1 α -1 log e (α-1)(α -1 log(-h)) dP - 1 α log e α(α -1 log(-h)) dQ = sup h∈M b (Ω):h<0 1 α -1 log |h| (α-1)/α dP - 1 α log (-h)dQ . Note that the second term is finite but the first term is possibly infinite when α ∈ (0, 1). Next use the identity log(c) = inf z∈R {z -1 + ce -z } , c ∈ (0, ∞) in the second term to write R α (P ∥Q) = sup h∈M b (Ω):h<0 1 α -1 log |h| (α-1)/α dP - 1 α inf z∈R {z -1 + e -z (-h)dQ} (29) = sup z∈R sup h∈M b (Ω):h<0 1 α -1 log |h| (α-1)/α dP + z -1 α + α -1 e -z hdQ . For each z ∈ R make the change variables h = αe z g, g ∈ M b (Ω), g < 0 in the inner supremum to derive R α (P ∥Q) = sup z∈R sup g∈M b (Ω):g<0 1 α -1 log |αe z g| (α-1)/α dP - z -1 α + α -1 e -z αe z gdQ (30) = sup z∈R sup g∈M b (Ω):g<0 1 α -1 log |g| (α-1)/α dP + (α -1 (log α + 1) + gdQ) = sup g∈M b (Ω):g<0 gdQ + 1 α -1 log |g| (α-1)/α dP + α -1 (log α + 1) . This completes the proof of ( 26). The proof of the metric-space version in nearly identical. Remark A.3. To reverse the above derivation and obtain (25) (with Φ = {ϕ ∈ M(Ω) : ϕ is bounded above}) from ( 26), change variables g → -c exp(αϕ), ϕ ∈ Φ, c > 0 in (26) and then maximize over c. Corollary A.4. If X is a compact metric space with the Borel σ-algebra, P, Q ∈ P(X), and α ∈ (0, 1) ∪ (1, ∞) then C b (X) = C(X) and so R α (P ∥Q) = sup g∈C(X):g<0 gdQ + 1 α -1 log |g| (α-1)/α dP + α -1 (log α + 1) . Next we derive a variational formula for the worst case regret, defined by D ∞ (P ∥Q) := lim α→∞ αR α (P ∥Q) . ( ) Theorem A.5. Let P, Q ∈ P(Ω). Then D ∞ (P ∥Q) = sup g∈M b (Ω):g<0 gdQ + log |g|dP + 1 . ( ) If Ω is a metric space with the Borel σ-algebra then (33) holds with M b (Ω) replaced by C b (Ω). Remark A.6. Note that on a compact metric space, the space of bounded continuous functions is the same as the space of all continuous functions. Proof. Recall Van Erven & Harremos (2014) D ∞ (P ∥Q) = log ess sup P dP dQ , P ≪ Q ∞, P ̸ ≪ Q . First suppose P ̸ ≪ Q. Then there exists a measurable set A with Q(A) = 0 and P (A) > 0. Let αgdQ + α α -1 log |g| (α-1)/α dP + (log α + 1) g n = -n1 A -1 A c . Then ≥ lim α→∞ gdQ + α α -1 log |g/α| (α-1)/α dP + (log α + 1) = lim α→∞ gdQ + α α -1 log |g| (α-1)/α dP + 1 = gdQ + log |g|dP + 1 for all g ∈ M b (Ω), g < 0. Here we used the dominated convergence theorem to evaluate the limit. Hence, by maximizing over g we obtain D ∞ (P ∥Q) ≥ sup g∈M b (Ω):g<0 gdQ + log |g|dP + 1 . ( ) To prove the reverse inequality, take any r ∈ (0, ess sup P dP/dQ). By definition of the essential supremum we have P (dP/dQ > r) > 0. We also have the bound P (dP/dQ > r) = 1 dP/dQ>r dP dQ dQ ≥ 1 dP/dQ>r rdQ = rQ(dP/dQ > r) .  This holds for all r < ess sup P dP/dQ, therefore we can take r ↗ ess sup P dP/dQ and use (34) to conclude sup g∈M b (Ω):g<0 gdQ + log |g|dP + 1 ≥ log(ess sup P dP/dQ) = D ∞ (P ∥Q) . Combining this with (39) completes the proof of (33). To prove the reverse inequality, take any g ∈ M b (Ω) with g < 0. By Lusin's theorem, for all ϵ > 0 there exists a closed set E ϵ and h ϵ ∈ C b (Ω) such that P (E c ϵ ) ≤ ϵ, Q(E c ϵ ) ≤ ϵ, h ϵ | Eϵ = g, and inf g ≤ h ϵ ≤ 0. Define g ϵ = h ϵ -ϵ. Then g ϵ < 0, g ϵ ∈ C b (Ω) and we have sup g∈C b (Ω):g<0 gdQ + log |g|dP ≥ g ϵ dQ + log |g ϵ |dP (46) = gdQ + (h ϵ -g)1 E c ϵ dQ -ϵ + log( |g|dP + (|h ϵ | -|g|)1 E c ϵ dP + ϵ) ≥ gdQ -(sup g -inf g)Q(E c ϵ ) -ϵ + log( |g|dP + inf gP (E c ϵ ) + ϵ) ≥ gdQ -(sup g -inf g)ϵ -ϵ + log( |g|dP + inf gϵ + ϵ) . Taking the limit ϵ → 0 + we therefore obtain sup g∈C b (Ω):g<0 gdQ + log |g|dP ≥ gdQ + log |g|dP . ( ) This holds for all g ∈ M b (Ω) with g < 0, hence by taking the supremum over g we obtain the reverse inequality to (45). This completes the proof.

B PROOFS

In this appendix we provide a number of proofs that were omitted from the main text. Recall that X denotes a compact metric space. Lemma B.1. Let Γ ⊂ C(X) and P, Q ∈ P(X). Then αR Γ/α,IC α (P ∥Q) is non-decreasing in α ∈ (0, 1) ∪ (1, ∞). If 0 ∈ Γ then αR Γ,IC α (P ∥Q) is also non-decreasing. Proof. If 0 ∈ Γ then W Γ ≥ 0, hence αR Γ,IC α (P ∥Q) = inf η∈P(X) {αR α (P ∥η) + αW Γ (Q, η)} (48) where both α → αR α (P ∥η) and α → αW Γ (Q, η) are non-decreasing. Therefore the infimum is as well. The proof for α → αR Γ/α,IC α (P ∥Q) is similar, though it doesn't require the assumption 0 ∈ Γ due to the identity αW Γ/α = W Γ . Next we prove a key lemma that is used in our main result. First recall the definition Λ P α [g] := ∞1 g̸ <0 - 1 α -1 log |g| (α-1)/α dP + α -1 (log α + 1) 1 g<0 , g ∈ C(X). ( ) Lemma B.2. Λ P α is convex and is continuous on {g ∈ C(X) : g < 0}, an open subset of C(X). Proof. First we prove convexity. Let g 0 , g 1 ∈ {C(X) : g < 0} and λ ∈ (0, 1). For α ∈ (0, 1) we can use the inequality λa + (1 -λ)b ≥ a λ b 1-λ for all a, b > 0 to compute - 1 α -1 log |λg 1 + (1 -λ)g 0 | (α-1)/α dP ≤ - 1 α -1 log (|g 1 | λ |g 0 | 1-λ ) (α-1)/α dP . Using Hölder's inequality with exponents p = 1/λ, q = 1/(1 -λ) we then obtain - 1 α -1 log (|g 1 | λ |g 0 | 1-λ ) (α-1)/α dP (51) ≤ - 1 α -1 log |g 1 | (α-1)/α dP λ |g 0 | (α-1)/α dP 1-λ =λ - 1 α -1 log |g 1 | (α-1)/α dP + (1 -λ) - 1 α -1 log |g 0 | (α-1)/α dP . Therefore g → -1 α-1 log |g| (α-1)/α dP is convex on {g < 0}. This proves Λ P α is convex when α ∈ (0, 1). Now suppose α > 1. The map t > 0, t → t (α-1)/α is concave and -log is decreasing and convex, hence - 1 α -1 log |λg 1 + (1 -λ)g 0 | (α-1)/α dP (52) ≤ - 1 α -1 log λ |g 1 | (α-1)/α dP + (1 -λ) |g 0 | (α-1)/α dP ≤λ - 1 α -1 log |g 1 | (α-1)/α dP + (1 -λ) - 1 α -1 log |g 0 | (α-1)/α dP . This proves that Λ P α is also convex when α > 1. Openness of {g < 0} follows from the assumption that X is compact and so any strictly negative continuous function is strictly bounded away from zero. Continuity on {g < 0} then follows from the dominated convergence theorem. Now we prove our main theorem, deriving the dual variational formula and other important properties of the IC-Γ-Rényi divergences. Theorem B.3. Let Γ ⊂ C(X) be admissible, P, Q ∈ P(X), and α ∈ (0, 1) ∪ (1, ∞). Then: 1. R Γ,IC α (P ∥Q) = sup g∈Γ:g<0 gdQ + 1 α -1 log |g| (α-1)/α dP + α -1 (log α + 1) . ( ) 2. If ( 53) is finite then there exists η * ∈ P(X) such that R Γ,IC α (P ∥Q) = inf η∈P(X) {R α (P ∥η) + W Γ (Q, η)} = R α (P ∥η * ) + W Γ (Q, η * ) . (54) 3. R Γ,IC α (P ∥Q) is convex in Q. If α ∈ (0, 1) then R Γ,IC α (P ∥Q) is jointly convex in (P, Q). 4. (P, Q) → R Γ,IC α (P ∥Q) is lower semicontinuous. 5. R Γ,IC α (P ∥Q) ≥ 0 with equality if P = Q. 6. R Γ,IC α (P ∥Q) ≤ min{R α (P ∥Q), W Γ (Q, P )}. 7. If Γ is strictly admissible then R Γ,IC α has the divergence property. Proof. 1. Define F, G : C(X) → (-∞, ∞] by F = Λ P α and G[g] = ∞1 g̸ ∈Γ -E Q [g]. Using the assumptions on Γ along with Lemma B.2 we see that F and G are convex, F [-1] < ∞, G[-1] < ∞, and F is continuous at -1. Therefore Fenchel-Rockafellar duality (see, e.g., Theorem 4.4.3 in Borwein & Zhu (2006) ) along with the identity C(X) * = M (X) gives sup g∈C(X) {-F [g] -G[g]} = inf η∈M (X) {F * [η] + G * [-η]} , and if either side is finite then the infimum on the right hand side is achieved at some η * ∈ M (X). Using the definitions, we can rewrite the left hand side as follow sup g∈C(X) {-F [g] -G[g]} = sup g∈Γ:g<0 gdQ + 1 α -1 log |g| (α-1)/α dP + α -1 (log α + 1) . We can also compute G * [-η] = sup g∈C(X) {-gdη -(∞1 g̸ ∈Γ -E Q [g])} = W Γ (Q, η) . (57) Therefore inf η∈M (X) {(Λ P α ) * [η] + W Γ (Q, η)} (58) = sup g∈Γ:g<0 gdQ + 1 α -1 log |g| (α-1)/α dP + α -1 (log α + 1) . Next we show that the infimum over M (X) can be restricted to P(X). First suppose η ∈ M (X) with η(X) ̸ = 1. Then, using the assumption that Γ contains the constant functions, we have W Γ (Q, η) ≥ E Q [±n] -±ndη = ±n(1 -η(X)) → ∞ (59) as n → ∞ (for appropriate choice of sign). Therefore W Γ (Q, η) = ∞ if η(X) ̸ = 1. This implies that the infimum can be restricted to {η ∈ M (X) : η(X) = 1}. Now suppose η ∈ M (X) is not positive. Take a measurable set A with η(A) < 0. By Lusin's theorem, for all ϵ > 0 there exists a closed set E ϵ ⊂ X and a continuous function g ϵ ∈ C(X) such that |η|(E c ϵ ) < ϵ, 0 ≤ g ϵ ≤ 1, and g ϵ | Eϵ = 1 A . Define g n,ϵ = -ng ϵ -1, n ∈ Z + . Then g n,ϵ ∈ {g ∈ C(X) : g < 0}, hence (Λ P α ) * [η] ≥ g n,ϵ dη + 1 α -1 log |g n,ϵ | (α-1)/α dP + α -1 (log α + 1) (60) = -nη(A) + nη(A ∩ E c ϵ ) -n g ϵ 1 E c ϵ dη -η(X) + 1 α -1 log |ng ϵ + 1| (α-1)/α dP + α -1 (log α + 1) ≥n(|η(A)| -2ϵ) -η(X) + 1 α -1 log |ng ϵ + 1| (α-1)/α dP + α -1 (log α + 1) . If α > 1 then log |ng ϵ + 1| (α-1)/α dP ≥ 0 and if α ∈ (0, 1) then log |ng ϵ + 1| (α-1)/α dP ≤ 0. In either case we have 1 α-1 log |ng ϵ + 1| (α-1)/α dP ≥ 0 and so (Λ P α ) * [η] ≥n(|η(A)| -2ϵ) -η(X) + α -1 (log α + 1) . ( ) By choosing ϵ < |η(A)|/2 and taking n → ∞ we see that (Λ P α ) * [η] = ∞ whenever η ∈ M (X) is not positive. Therefore the infimum can further be restricted to positive measures. Combining these results we find sup g∈Γ:g<0 gdQ + 1 α -1 log |g| (α-1)/α dP + α -1 (log α + 1) (62) = inf η∈P(X) {(Λ P α ) * [η] + W Γ (Q, η)} . For η ∈ P(X), equation ( 10) implies (Λ P α ) * [η] = R α (P ∥η). This completes the proof. 2. The existence of a minimizer follows from Fenchel-Rockafellar duality; again, see Theorem 4.4.3 in Borwein & Zhu (2006) . 3. This follows from (53) together with the fact that the supremum of convex functions is convex and y → 1 α-1 log(y) is convex when α ∈ (0, 1). 4. Compactness of X implies that g and |g| (α-1)/α are bounded and continuous whenever g ∈ Γ satisfies g < 0. Therefore Q → gdQ and P → |g| (α-1)/α dP are continuous in the weak topology on P(X). Therefore the objective functional in ( 53) is continuous in (P, Q). The supremum is therefore lower semicontinuous. 5. This easily follows from the definition (7). 6. R α is a divergence, hence is non-negative. Γ contains the constant functions, hence W Γ ≥ 0. Therefore R Γ,IC α ≥ 0. If Q = P then 0 ≤ R Γ,IC α (P ∥Q) ≤ R α (P ∥P ) + W Γ (P, P ) = 0, hence R Γ,IC α (P ∥Q) = 0. 7. Suppose Γ is strictly admissible. Due to part 5 of this theorem, we only need to show that if R Γ,IC α (P ∥Q) = 0 then P = Q. If R Γ,IC α (P ∥Q) = 0 then part 2 implies there exists η * ∈ P(X) such that 0 = R α (P ∥η * ) + W Γ (Q, η * ) . ( ) Both terms are non-negative, hence R α (P ∥η * ) = 0 = W Γ (Q, η * ). R α has the divergence property, hence η * = P . So W Γ (Q, P ) = 0. Therefore 0 ≥ gdQ -gdP for all g ∈ Γ. Let Ψ be as in the definition of strict admissibility and let ψ ∈ Ψ. There exists c ∈ R, ϵ > 0 such that c ± ϵψ ∈ Γ and so 0 ≥ ±ϵ( ψdQ -ψdP ). Therefore ψdQ = ψdP for all ψ ∈ Ψ. Ψ is P(X)-determining, hence Q = P . Remark B.4. The Fenchel-Rockafellar Theorem applies under two different sets of assumptions: the first assumes both mappings are lower semicontinuous (LSC) while the second applies when one mapping is continuous at a point where both are finite. The mapping Λ P α , as defined by ( 49) and appearing in (10), is not LSC but it is continuous on its domain, hence we used the second version of Fenchel-Rockafellar in our proof of Theorem B.3. For α > 1 one could alternatively redefine Λ P α along the boundary of {g < 0} to make it LSC while still maintaining the relation ( 10) and thereby utilize the first version of Fenchel-Rockafellar. This alternative approach is also amenable to extending the theorem to non-compact spaces, using the methods from Dupuis, Paul & Mao, Yixiang (2022); Birrell et al. (2022a) . However, these methods do not apply to α ∈ (0, 1). With this in mind, in order to provide a simple unified treatment of all α ∈ (0, 1) ∪ (1, ∞) we structured our proof around the second version of Fenchel-Rockafellar. Despite the fact that Λ P α is not LSC, the Fenchel-Rockafellar Theorem does imply that convex duality holds at all points of continuity in the domain, i.e., one has Λ P α [g] = sup η∈M (X) { gdη -R α (P ∥η)} for all g < 0 , but this duality formula doesn't necessarily hold if g ̸ < 0. Here, R α (P ∥η) for general η ∈ M (X) is defined via the variational formula R α (P ∥η) := (Λ P α ) * [η] = sup g∈C(X) { gdη -Λ P α [g]} and one can rewrite this in terms of the classical Rényi divergence as follows R α (P ∥η) = ∞ if η ̸ ≥ 0 or η = 0 , R α (P ∥ η ∥η∥ ) -1 α log ∥η∥ if η is a nonzero positive measure. Next we prove the limiting results from Theorem 4.1. Theorem B.5. Let Γ ⊂ C(X) be admissible, P, Q ∈ P(X), and α ∈ (0, 1) ∪ (1, ∞). Then lim δ→0 + 1 δ R δΓ,IC α (P ∥Q) = W Γ (Q, P ) and if Γ is strictly admissible we have lim L→∞ R LΓ,IC α (P ∥Q) = R α (P ∥Q) . Proof. It is straightforward to show that the scaled function spaces are admissible and W cΓ = cW Γ for all c > 0. First we prove (67). From the definition 7 we have δ -1 R δΓ,IC α (P ∥Q) = inf η∈P(X) {δ -1 R α (P ∥η) + W Γ (Q, η)} ≤ W Γ (Q, P ) and so δ -1 R δΓ,IC α (P ∥Q) is non-increasing in δ. Therefore lim δ→0 + δ -1 R δΓ,IC α (P ∥Q) = sup δ>0 δ -1 R δΓ,IC α (P ∥Q) and lim δ→0 + δ -1 R δΓ,IC α (P ∥Q) ≤ W Γ (Q, P ) . We will assume this inequality is strict and derive a contradiction. This assumption, together with (70), implies R δΓ,IC α (P ∥Q) < ∞ for all δ > 0. Part (2) of Theorem 3.4 then implies the existence of η * ,δ ∈ P(X) such that δ -1 R δΓ,IC α (P ∥Q) = δ -1 R α (P ∥η * ,δ ) + W Γ (Q, η * ,δ ) ≥ W Γ (Q, η * ,δ ) . ( ) Take a sequence δ n → 0 + . We have assumed X is compact, hence P(X) is also compact and so there exists a weakly convergent subsequence η * ,δn j → η * . From the variational formulas ( 25) and (8) we see that R α (P ∥•) and W Γ (Q, •) are lower semicontinuous, hence lim inf j W Γ (Q, η * ,δn j ) ≥ W Γ (Q, η * ) and R α (P ∥η * ) ≤ lim inf j R α (P ∥η * ,δn j ) ≤ lim inf j δ nj (δ -1 nj R α (P ∥η * ,δn j ) + W Γ (Q, η * ,δn j )) (73) = lim inf j δ nj (δ -1 nj R δn j Γ,IC α (P ∥Q)) = 0 , where the last equality follows from the assumed strictness of the inequality (70). Therefore the divergence property for the classical Rényi divergences implies R α (P ∥η * ) = 0 and P = η * . Combining the above results we obtain lim δ→0 + δ -1 R δΓ,IC α (P ∥Q) = lim j→∞ δ -1 nj R δn j Γ,IC α (P ∥Q) ≥ lim inf j W Γ (Q, η * ,δn j ) (75) ≥W Γ (Q, η * ) = W Γ (Q, P ) . This contradicts (71) and therefore we have proven the equality (67). Now we assume Γ is strictly admissible and will prove (68) via similar reasoning. From the definition 7 we see that R LΓ,IC α (P ∥Q) = inf η∈P(X) {R α (P ∥η) + LW Γ (Q, η)} ≤ R α (P ∥Q) and R LΓ,IC α is non-decreasing in L. Hence lim L→∞ R LΓ,IC α (P ∥Q) = sup L>0 R LΓ,IC α (P ∥Q) and lim L→∞ R LΓ,IC α (P ∥Q) ≤ R α (P ∥Q) . Suppose this inequality is strict. Then R LΓ,IC α (P ∥Q) < ∞ for all L and we can use part (2) of Theorem 3.4 to conclude there exists η * ,L ∈ P(X) such that R LΓ,IC α (P ∥Q) = R α (P ∥η * ,L ) + LW Γ (Q, η * ,L ) . (78) Take L n → ∞. Compactness of P(X) implies the existence of a weakly convergent subsequence η * ,j := η * ,Ln j → η * ∈ P(X). Lower semicontinuity of R α (P ∥•) and W Γ (Q, •) imply lim inf j R α (P ∥η * ,j ) ≥ R α (P ∥η * ) and W Γ (Q, η * ) ≤ lim inf j W Γ (Q, η * ,j ) = lim inf j L -1 nj W Ln j Γ (Q, η * ,j ) ≤ lim inf j L -1 nj R Ln j Γ,IC α (P ∥Q) = 0 , where the last equality follows from the assumed strictness of the inequality (77). Therefore W Γ (Q, η * ) = 0. Γ is strictly admissible, hence Q = η * (see the proof of part (7) of Theorem 3.4). Combining these results we see that lim L→∞ R LΓ,IC α (P ∥Q) = lim j R Ln j Γ,IC α (P ∥Q) = lim j (R α (P ∥η * ,Ln j ) + L nj W Γ (Q, η * ,Ln j )) (80) ≥ lim inf j R α (P ∥η * ,j ) ≥ R α (P ∥η * ) = R α (P ∥Q) . This contradicts the assumed strictness of the inequality (77) and hence ( 77) is an equality. This completes the proof. Next we prove Theorem 4.2, regarding the α → 1 limit of the IC-Γ-Rényi divergences. Theorem B.6. Let Γ ⊂ C(X) be admissible and P, Q ∈ P(X). Then lim α→1 + R Γ,IC α (P ∥Q) = inf η∈P(X): ∃β>1,R β (P ∥η)<∞ {R(P ∥η) + W Γ (Q, η)} , lim α→1 -R Γ,IC α (P ∥Q) = inf η∈P(X) {R(P ∥η) + W Γ (Q, η)} (82) = sup g∈Γ:g<0 { gdQ + log |g|dP } + 1 . (83) Proof. Lemma B.1 implies α → αR Γ,IC α (P ∥Q) is non-decreasing on (1, ∞), therefore lim α→1 + αR Γ,IC α (P ∥Q) = inf α>1 αR Γ,IC α (P ∥Q) (84) = inf α>1 inf η∈P(X) {αR α (P ∥η) + αW Γ (Q, η)} = inf η∈P(X) inf α>1 {αR α (P ∥η) + αW Γ (Q, η)} = inf η∈P(X) { lim α→1 + αR α (P ∥η) + W Γ (Q, η)} . From Van Erven & Harremos (2014) we have lim α→1 + R α (P ∥η) = R(P ∥η) if ∃β > 1, R β (P ∥η) < ∞ ∞ otherwise (85) and so we can conclude lim α→1 + R Γ,IC α (P ∥Q) = lim α→1 + αR Γ,IC α (P ∥Q) = inf η∈P (X): ∃β>1,R β (P ∥η)<∞ {R(P ∥η) + W Γ (Q, η)} . (86) This proves (81). Now we compute the limit as α → 1 -. Note that the limit exists due to the fact that α → αR Γ,IC α (P ∥Q) is non-decreasing. From the definition ( 7), for all η ∈ P(X) we have lim α→1 -R Γ,IC α (P ∥Q) ≤ lim α→1 -R α (P ∥η) + W Γ (Q, η) = R(P ∥η) + W Γ (Q, η) . (87) Here we used the fact that lim α→1 -R α (P ∥η) = R(P ∥η) (see Van Erven & Harremos (2014) ). Maximizing over η then gives lim α→1 -R Γ,IC α (P ∥Q) ≤ inf η∈P(X) {R(P ∥η) + W Γ (Q, η)} . ( ) To prove the reverse inequality, use part 1 of Theorem 3.4 to compute lim α→1 -R Γ,IC α (P ∥Q) = lim α→1 -αR Γ,IC α (P ∥Q) = lim α→1 -sup g∈Γ:g<0  α gdQ + α α -1 log |g| (α-1)/α dP + log α + 1 ≥ gdQ + lim α→1 - α α -1 log |g| (α- We now use Fenchel-Rockafellar duality (Theorem 4.4.3 in Borwein & Zhu (2006) ) to compute the dual variational representation of the right hand side of (90). Define F, G : C(X) → (-∞, ∞] by F [g] = ∞1 g̸ <0 -log |g|dP and G[g] = ∞1 g̸ ∈Γ -E Q [g]. It is stratightforward to show that F and G are convex, F [-1] < ∞, G[-1] < ∞, and F is continuous at -1. Therefore inf g∈C(X) {F [g] + G[g]} = sup η∈E * {-F * (-η) -G * (η)} , i.e. sup g∈Γ:g<0 {E Q [g] + log |g|dP } = inf η∈M (X) {F * (η) + W Γ (Q, η)} , where F * (η) = sup g∈C(X):g<0 { gdη + log |g|dP }. Now we show the infimum can be restricted to η ∈ P(X): If η(X) ̸ = 1 then by taking g = ±n we find W Γ (Q, η) ≥ n|Q(X) -η(X)| → ∞ (93) as n → ∞. Therefore W Γ (Q, η) = ∞ if η(X) ̸ = 1. Now suppose η ∈ M (X) is not positive. Take a measurable set A with η(A) < 0. By Lusin's theorem, for all ϵ > 0 there exists a closed set E ϵ ⊂ X and a continuous function g ϵ ∈ C(X) such that |η|(E c ϵ ) < ϵ, 0 ≤ g ϵ ≤ 1, and g ϵ | Eϵ = 1 A . Define g n,ϵ = -ng ϵ -1, n ∈ Z + . Then g n,ϵ ∈ {g ∈ C(X) : g < 0}, hence F * (η) ≥ g n,ϵ dη + log |g n,ϵ |dP (94) = -ng ϵ -1dη + log(ng ϵ + 1)dP ≥n(|η(A)| -(g ϵ -1 A )1 E c ϵ dη) -η(X) ≥n(|η(A)| -ϵ) -η(X) . Letting ϵ = |η(A)|/2 and taking n → ∞ gives F * (η) = ∞. Therefore we conclude inf η∈M (X) {F * (η) + W Γ (Q, η)} = inf η∈P(X) {F * (η) + W Γ (Q, η)} . ( ) To evaluate F * (η) for η ∈ P(X) we make a change of variables g = -exp(h -1), h ∈ C(X) to obtain F * (η) = sup h∈C(X) { hdP -e h-1 dη} -1 = R(P ∥η) -1 . ( ) Here we used the Legendre-transform variational representation of the KL divergence; see equation ( 1) in Birrell et al. (2022c) with f (x) = x log(x). Combining these results we obtain inf η∈P(X) {R(P ∥η) + W Γ (Q, η)} ≥ lim α→1 - R Γ,IC α (P ∥Q) ≥ sup g∈Γ:g<0 gdQ + log |g|dP + 1 = inf η∈M (X) {F * (η) + W Γ (Q, η)} + 1 = inf η∈P(X) {R(P ∥η) + W Γ (Q, η)} . This completes the proof. Now we prove Theorem 4.5, regarding the α → ∞ limit of the IC-Γ-Rényi divergences. Theorem B.7. Let Γ ⊂ C(X) be admissible and P, Q ∈ P(X). Then lim α→∞ αR Γ/α,IC α (P ∥Q) = inf η∈P (X) {D ∞ (P ∥η) + W Γ (Q, η)} (98) = sup g∈Γ:g<0 gdQ + log |g|dP + 1 . ( ) Proof. First note that αR Γ/α,IC α (P ∥Q) = inf η∈P(X) {αR α (P ∥η) + W Γ (Q, η)} (100) is nondecreasing in α, therefore for η ∈ P(X) we have  Next we use the Fenchel-Rockafellar duality to derive a dual formulation of the right hand side of (105). Define G, F : C(X) → (-∞, ∞], G[g] = ∞1 g̸ ∈Γ -E Q [g], F [g] = ∞1 g̸ <0 -log |g|dP . It is straightforward to prove that G, F are convex and G[-1] < ∞, F [-1] < ∞ and F is continuous at -1. Therefore Fenchel-Rockafellar duality implies inf g∈C(X) {F [g] + G[g]} = sup η∈C(X) * {-F * [-η] -G * [η]} , i.e. sup g∈Γ:g<0 E Q [g] + log |g|dP = inf η∈M (X) {F * [η] + W Γ (Q, η)} , where F * [η] = sup g∈C(X):g<0 { gdη + log |g|dP }. We now prove that the infimum over M (X) can be restricted to P(X). First suppose η(X) ̸ = 1. Then, because Γ contains the constant functions, we have W Γ (Q, η) ≥ ±n(1 -η(X)) → ∞ as n → ∞ for appropriate choice of sign. Therefore W Γ (Q, η) = ∞ when η(X) ̸ = 1. Now suppose η ∈ M (X) is not positive. Take a measurable set A with η(A) < 0. By Lusin's theorem, for all ϵ > 0 there exists a closed set E ϵ ⊂ X and a continuous function g ϵ ∈ C(X) such that |η|(E c ϵ ) < ϵ, 0 ≤ g ϵ ≤ 1, and g ϵ | Eϵ = 1 A . Define g n,ϵ = -ng ϵ -1, n ∈ Z + . Then g n,ϵ ∈ {g ∈ C(X) : g < 0}, hence F * [η] ≥ g n,ϵ dη + log |g n,ϵ |dP (109) =n(|η(A)| -(g ϵ -1 A )1 E c ϵ dη) -η(X) + log(n g ϵ dP + 1) ≥n(|η(A)| -ϵ) -η(X) . Letting ϵ = |η(A)|/2 and then taking n → ∞ we see that F * [η] = ∞ when η is not positive. Together these results imply inf η∈M (X) {F * [η] + W Γ (Q, η)} = inf η∈P(X) {F * [η] + W Γ (Q, η)} . Finally, using Theorem A.5 we see that F * [η] + 1 = sup g∈C(X):g<0 { gdη + log |g|dP } + 1 = D ∞ (P ∥η) for all η ∈ P(X). Combining these results gives lim α→∞ αR Γ/α,IC α (P ∥Q) ≥ sup g∈Γ:g<0 gdQ + log |g|dP + 1 (112) = inf η∈M (X) {F * [η] + W Γ (Q, η)} + 1 = inf η∈P(X) {D ∞ (P ∥η) + W Γ (Q, η)} ≥ lim α→∞ αR Γ/α,IC α (P ∥Q) . This completes the proof. Finally, we prove Theorem 3.5, the data-processing inequality for the IC-Γ-Rényi divergences. First we introduce the following notation: Let Y be another compact metric space and K be a probability kernel from X to Y . Given P ∈ P(X) we denote the composition of P with K by P ⊗ K (a probability measure on X × Y ) and we denote the marginal distribution on Y by K [P ]. Given g ∈ C(X × Y ) we let K[g] denote the function on X given by x → g(x, y)K x (dy). Theorem B.8 (Data Processing Inequality). Let α ∈ (0, 1) ∪ (1, ∞), Q, P ∈ P(X), and K be a probability kernel from X to Y such that K[g] ∈ C(X) for all g ∈ C(X, Y ). 1. If Γ ⊂ C(Y ) is admissible then R Γ,IC α (K[P ]∥K[Q]) ≤ R K[Γ],IC α (P ∥Q) . ( ) 2. If Γ ⊂ C(X × Y ) is admissible then R Γ,IC α (P ⊗ K∥Q ⊗ K) ≤ R K[Γ],IC α (P ∥Q) . Proof. It is straightforward to show that admissiblility of Γ implies admissibility of K[Γ]. Hence we can write R K[Γ],IC α (P ∥Q) = sup g∈K[Γ]:g<0 gdQ + 1 α -1 log |g| (α-1)/α dP + α -1 (log α + 1) (115) ≥ sup g∈Γ:g<0 K[g]dQ + 1 α -1 log |K[g]| (α-1)/α dP + α -1 (log α + 1) . Using Jensen's inequality we can derive | g(y)K x (dy)| (α-1)/α ≤ |g(y)| (α-1)/α K x (dy) , α ∈ (0, 1) , | g(y)K x (dy)| (α-1)/α ≥ |g(y)| (α-1)/α K x (dy) , α > 1 . Combining ( 115) -( 117) with the monotonicity of y → 1 α-1 log(y) we arrive at (113). The proof of (114) is similar.

C DETAILS ON ANALYTICAL EXAMPLES AND COUNTEREXAMPLES

In this appendix we present several details regarding the analytical examples found in Section 5.

C.1 INFIMAL CONVOLUTION AND SCALING LIMITS

First we present a simple example that illustrates the infimal convolution formula and limiting properties from Sections 3 and 4. Let P = δ 0 , Q x,c = cδ 0 + (1 -c)δ x for c ∈ (0, 1), x > 0, and let Γ = Lip 1 . Then for L > 0 one can compute R α (P ∥Q x,c ) =α -1 log(1/c) , W LΓ (Q x,c , P ) =(1 -c)Lx , and R LΓ,IC α (P ∥Q x,c ) = sup a,b<0:|a-b|≤x {Lca + L(1 -c)b + α -1 log(L|a|)} + α -1 (log α + 1) (120) = sup a<0 {Lca + L(1 -c) min{x + a, 0} + α -1 log(L|a|)} + α -1 (log α + 1) =α -1 + α -1 sup y>0 -cy + log y , y ≤ αLx (1 -c)αLx -y + log y , y > αLx =    (1 -c)Lx , 0 < αLx < 1 α -1 -cLx + α -1 log(αLx) , 1 ≤ αLx ≤ 1/c α -1 log(1/c) , αLx > 1/c . In particular, it is straightforward to show that R LΓ,IC α (P ∥Q x,c ) ≤ W LΓ (Q x,c , P ) , lim x→0 + R LΓ,IC α (P ∥Q x,c ) = lim x→0 + (1 -c)Lx = 0 , ( ) lim L→∞ R LΓ,IC α (P ∥Q x,c ) = α log(1/c) = R (P ∥Q x,c ) . We can also rewrite this in terms of the solution to the infimal convolution problem as follows R LΓ,IC α (P ∥Q x,c ) =    W LΓ (Q x,c , P ) , 0 < αLx < 1 R α (P ∥Q x,1/(αLx) ) + W LΓ (Q x,c , Q x,1/(αLx) ) , 1 ≤ αLx ≤ 1/c R α (P ∥Q x,c ) , αLx > 1/c . Taking the worst-case-regret scaling limit we find lim α→∞ αR Γ/α,IC α (P ∥Q x,c ) =    (1 -c)x , 0 < x < 1 1 -cx + log(x) , 1 ≤ x ≤ 1/c log(1/c) , x > 1/c (125) =    W Γ (Q x,c , P ) , 0 < x < 1 D ∞ (P ∥Q x,1/x ) + W Γ (Q x,c , Q x,1/x ) , 1 ≤ x ≤ 1/c D ∞ (P ∥Q x,c ) , x > 1/c , where D ∞ (P ∥Q x,c ) = log(1/c).

C.2 Γ-RÉNYI-DONSKER-VARADHAN COUNTEREXAMPLE

As an alternative to Definition 3.1, one can attempt to regularize the Rényi divergences by restricting the test-function space in the variational representation (3), leading to the Γ-Rényi-Donsker- Varadhan divergences R Γ,DV α (P ∥Q) := sup ϕ∈Γ 1 α -1 log e (α-1)ϕ dP - 1 α log e αϕ dQ . The bound log e cϕ dP ≥ c ϕdP , ϕ ∈ Γ, c ∈ R implies that R Γ,DV α ≤ W Γ for α ∈ (0, 1), making (126) a useful regularization of the Rényi divergences in this case; this utility was demonstrated in Pantazis et al. (2022) , where it was used to construct GANs. However, the representation ( 126) is known to be poorly behaved when α > 1. Here we provide a counterexample showing that, unlike for the IC-Γ-Rényi divergences, R Γ,DV α ̸ ≤ W Γ in general when α > 1. We conjecture that this fact is the reason for the poor behavior of R Γ,DV α when α > 1. In particular, Let P x,c = cδ 0 + (1 -c)δ x , Q = δ 0 for x > 0, c ∈ (0, R Γ L ,IC α (P ∥Q x ) ≤ Lx = W Γ L (P, Q x ) , lim x→0 + R Γ L ,IC α (P ∥Q x ) = 0 , showing that R Γ L ,IC α is able to capture the convergence of Q x to P as x → 0 + , while R Γ,log-DV α fails to do so.

D ADDITIONAL EXAMPLES D.1 TRAINING SYMMETRY-PRESERVING GANS ON ROTMNIST

When learning a distribution P that is invariant under a symmetry group (e.g., rotation invariance for images without preferred orientation) one can greatly increase performance by using a GAN that incorporates the symmetry information into the generator and the discriminator space Γ Dey et al. (2021) . A theory of such symmetry-preserving GANs was developed in Birrell et al. (2022b) and the new divergences introduced in this paper satisfy the assumptions required to apply that theory. In Table 2 we demonstrate this effectiveness on the RotMNIST dataset, obtained from randomly rotating the original MNIST digit dataset LeCun et al. (1998) , resulting in an rotation-invariant distribution. Note that incorporating more symmetry information into the GAN (i.e., progressing down the rows of the table) results in greatly improved performance, especially in the low data regime. Table 2 : The median of the FIDs (lower is better), calculated every 1,000 generator update for 20,000 iterations, over three independent trials. The number of the training samples used for experiments varies from 1% (600) to 10% (6,000) of the RotMNIST training set. The NN structure and hyperparameters are the same as those used in Section 5.4 of Birrell et al. (2022b) . Eqv G (resp. Inv D) denotes that the symmetry information was incorporated into the generator (resp. discriminator) while CNN implies that a convolutional NN was used (without rotational symmetry). Σ denotes the rotation group used, where C n denotes rotations by being integer multiples of 2π/n. Here we compare the DV-Rényi and CC-Rényi estimators on the Gaussian test problem from Section 6.1, except in lower dimensions (1-D and 100-D). Qualitatively, the behavior is similar. In particular, it is striking that the DV-Rényi estimator performs extremely poorly even in the 1-D case (see Figure 3 ) while the CC-Rényi estimator has much lower variance and MSE when the separation between the distributions becomes larger (i.e., as µ q increases). Gaussians. DV-R α refers to Rényi divergence estimators built using (3) while CC-R α refers to estimators built using our new variational representation (4). We used a NN with one fully connected layer of 64 nodes, ReLU activations, and a poly-softplus final layer (for CC-Rényi). We trained for 10000 epochs with a minibatch size of 500. The variance and MSE were computing using data from 50 independent runs. Note that the CC-Rényi estimator has significantly reduced variance and MSE compared to the DV-Rényi estimator, even when α is large. DV-R α refers to Rényi divergence estimators built using (3) while CC-R α refers to estimators built using our new variational representation (4). We used a NN with one fully connected layer of 64 nodes, ReLU activations, and a poly-softplus final layer (for CC-Rényi). We trained for 10000 epochs with a minibatch size of 500. The variance and MSE were computing using data from 50 independent runs. Again, the CC-Rényi estimator has significantly reduced variance and MSE compared to the DV-Rényi estimator, even when α is large.



(behavioural sciences),Mironov (2017) (differential privacy), andLi & Turner (2016) (variational inference); in the latter the variational formula is an adaptation of the evidence lower bound. Rényi divergences have been also applied in the training of GANsBhatia et al. (2021) (loss function for binary classification -discrete case) and inPantazis et al. (2022) (continuous case, based on the Rényi-Donsker-Varahdan variational formula inBirrell et al. (2021)). Rényi divergences with α > 1 are also used in contrastive representation learning,Lee & Shin (2022), as well as in PAC-Bayesian Bounds,Bégin et al. (2016). In the context of uncertainty quantification and sensitivity analysis, Rényi divergences provide confidence bounds for rare events,Atar et al. (2015);Dupuis et al. (2020), with higher rarity corresponding to larger α.Reducing the variance of divergence estimators through control of the function space have been recently proposed. InSong & Ermon (2019) an explicit bound to the output restricts the divergence values. A systematic theoretical framework on how to regularize through the function space has been developed inDupuis, Paul & Mao, Yixiang (2022);Birrell et al. (2022a)  for the KL and f -divergences. Despite not covering the Rényi divergence, the theory inDupuis, Paul & Mao, Yixiang (2022);Birrell  et al. (2022a)  and particularly the infimal-convolution formulation clearly inspired the current work. However, adapting the infimal-convolution method to the Rényi divergence setting requires two new technical innovations: (a) We develop a new low-variance convex-conjugate variational formula for the classical Rényi divergence in Theorem 2.1 (see also Fig.1), allowing us to apply infimalconvolution tools to develop the new Γ-Rényi divergences in Theorem 3.4. (b) We study the α → ∞ limit of (a) to obtain a new low-variance variational representation of worst-case regret in Theorem 2.2 and study its Γ-regularization in Theorem 4.5.

Figure1: Variance and MSE of estimators of the classical Rényi divergence between 1000dimensional Gaussians. DV-R α refers to Rényi divergence estimators built using (3) while CC-R α refers to estimators built using our new variational representation (4). We used a NN with one fully connected layer of 64 nodes, ReLU activations, and a poly-softplus final layer (for CC-Rényi). We trained for 10000 epochs with a minibatch size of 500. The variance and MSE were computing using data from 50 independent runs. Note that the CC-Rényi estimator has significantly reduced variance and MSE compared to the DV-Rényi estimator, even when α is large. Strikingly, the 1-D case exhibits the same behavior (see Figure3in Appendix D.2), demonstrating that the DV-Rényi estimator is unsuitable even in low dimensions.

Figure2: Comparison between IC-Γ-Rényi GAN, IC-Γ-WCR GAN, and WGAN-GP (both 1 and 2-sided) on the CIFAR-10 dataset. Here we plot the inception score as a function of the number of training epochs (moving average over the last 5 data points, with results averaged over 5 runs). We also show the averaged final FID score in the legend, computed using 50000 samples from both P and Q. For the IC GANs we enforce negativity of the discriminator by using a final layer equal to -abs. The GANs were trained using the Adam optimizer with an initial learning rate of 0.0002. The left pane shows that the IC-Γ-Rényi GANs outperform WGAN while the right pane shows that GANs based on the rescaled D Γ,IC

sup g∈M b (Ω):g<0 gdQ + log |g|dP + 1 ≥ g n dQ + log |g n |dP + 1 (35) = -nQ(A) -Q(A c ) + log(nP (A) + P (A c )) + 1 = log(nP (A) + P (A c )) → ∞ (36) as n → ∞. Therefore sup g∈M b (Ω):g<0 gdQ + log |g|dP + 1 = ∞ = D ∞ (P ∥Q) . (37) Now suppose P ≪ Q. Using the definition (32) along with Theorem A.2 and changing variables g = g/α we have D ∞ (P ∥Q)

ϵ > 0 define g c,ϵ = -c1 dP/dQ>r -ϵ. These satisfy g c,ϵ ∈ M b (Ω), g c,ϵ < 0 and sosup g∈M b (Ω):g<0 gdQ + log |g|dP + 1 ≥ g c,ϵ dQ + log |g c,ϵ |dP + 1 (41) = -cQ(dP/dQ > r) -ϵ + log(cP (dP/dQ > r) + ϵ) + 1 ≥ -cP (dP/dQ > r)/r -ϵ + log(cP (dP/dQ > r) + ϵ) + 1 .Letting ϵ → 0 + we find sup g∈M b (Ω):g<0 gdQ + log |g|dP + 1 ≥ -cP (dP/dQ > r)/r + log(cP (dP/dQ > r)) + 1 (42) for all c > 0. We have P (dP/dQ > r) > 0, hence by maximizing over c > 0 and changing variables to z = cP (dP/dQ > r) we obtain sup g∈M b (Ω):g<0 gdQ + log |g|dP + 1 ≥ sup z>0 {-z/r + log(z) + 1} = log(r) .

Now suppose Ω is a metric space. We clearly have D ∞ (P ∥Q) = sup g∈M b (Ω):g<0 gdQ + log |g|dP + 1 (45) ≥ sup g∈C b (Ω):g<0 gdQ + log |g|dP + 1 .

1)/α dP + 1 = gdQ + d dy | y=0 log e y log |g| dP + 1 = gdQ + log |g|dP + 1 for all g ∈ Γ, g < 0. Therefore, maximizing over g gives lim α→1 -R Γ,IC α (P ∥Q) ≥ sup g∈Γ:g<0 gdQ + log |g|dP + 1 .

∥η) + W Γ (Q, η)} =D ∞ (P ∥η) + W Γ (Q, η) .Maximizing over η gives the upper boundlim α→∞ αR Γ/α,IC α (P ∥Q) ≤ inf η∈P(X) {D ∞ (P ∥η) + W Γ (Q, η)} . (102)To prove the reverse inequality, use the variational formula (53) to write αR Γ|g/α| (α-1)/α dP + log α + 1 (103) |g| (α-1)/α dP + 1 .Therefore, for all g ∈ Γ, g < 0 we can use the dominated convergence theorem to compute gdQ + log |g|dP + 1 .Maximizing over g then gives lim α→∞ αR Γ/α,IC α (P ∥Q) ≥ sup g∈Γ:g<0 gdQ + log |g|dP + 1 .

1) and Γ L = Lip L . Then for α > 1 we haveR Γ L ,DV α (P x,c ∥Q) = sup a,b∈R:|a-b|≤Lx 1 α -1 log(c exp((α -1)a) + (1 -c) exp((α -1)b)) exp((α -1)a) + (1 -c) exp((α -1)(Lx + a))) -a = 1 α -1 log (c + (1 -c) exp((α -1)Lx)) , W Γ L (P x,c , Q) = sup |a-b|≤Lx {ca + (1 -c)b -a} = (1 -c)Lx . (129)Note that the condition α > 1 was crucial in computing the supreumum over b in (128). Using strict concavity of the logarithm one can then obtain the boundR Γ L ,DV α (P x,c ∥Q) > W Γ L (P x,c , Q) .(130)C.3 log-Γ-RÉNYI-DONSKER-VARADHAN COUNTEREXAMPLEA second alternative to Definition 3.1 is to again start with (3) and then reduce the test-function space to 1 α log(Γ) we show below, this definition fails to provide a regularized divergence; in particular, it is incapable of meaningfully comparing Dirac distributions.Let P = δ 0 , Q x = δ x , x > 0, Γ L = Lip L . Then straightforward computations using the variational definition givesR Γ L ,DV -log α (P ∥Q x ) =α -1 sup {Lb + α -1 log L + α -1 log(|a|)} + α -1 (log(α) -x|)} + α -1 log L + α -1 (log α + 1) =α -1 log(αLx) + α -1 , x ≥ 1/(αL) Lx , x < 1/(αL) .

CNN D, Σ = C 4 CNN G + Inv D, Σ = C 4 Eqv G + Inv D, Σ = C 4 Eqv G + Inv D, Σ = C 8

Figure3: Variance and MSE of estimators of the classical Rényi divergence between 1-dimensional Gaussians. DV-R α refers to Rényi divergence estimators built using (3) while CC-R α refers to estimators built using our new variational representation (4). We used a NN with one fully connected layer of 64 nodes, ReLU activations, and a poly-softplus final layer (for CC-Rényi). We trained for 10000 epochs with a minibatch size of 500. The variance and MSE were computing using data from 50 independent runs. Note that the CC-Rényi estimator has significantly reduced variance and MSE compared to the DV-Rényi estimator, even when α is large.

Figure4: Variance and MSE of estimators of the classical Rényi divergence between 100-dimensional Gaussians. DV-R α refers to Rényi divergence estimators built using (3) while CC-R α refers to estimators built using our new variational representation (4). We used a NN with one fully connected layer of 64 nodes, ReLU activations, and a poly-softplus final layer (for CC-Rényi). We trained for 10000 epochs with a minibatch size of 500. The variance and MSE were computing using data from 50 independent runs. Again, the CC-Rényi estimator has significantly reduced variance and MSE compared to the DV-Rényi estimator, even when α is large.

AUC values (higher is better) for several divergences and various levels of rarity. The AUC values have been averaged from 50 independent runs. The neural discriminator has 2 hidden layers with 32 units each and ReLU activation. The D Γ,IC α divergences used the poly-softplus final layer.

ACKNOWLEDGMENTS

The research of J.B., M.K. and L.R.-B. was partially supported by the Air Force Office of Scientific Research (AFOSR) under the grant FA9550-21-1-0354. The research of M. K. and L.R.-B. was partially supported by the National Science Foundation (NSF) under the grants DMS-2008970 and TRIPODS CISE-1934846. The research of P.D. was partially supported by the NSF under the grant DMS-1904992 and by the AFOSR under the grant FA-9550-21-1-0354. The work of Y.P. was partially supported by the Hellenic Foundation for Research and Innovation (HFRI) through the "Second Call for HFRI Research Projects to support Faculty Members and Researchers" under Project 4753. This work was performed in part using high performance computing equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative.

DIVERGENCES

In this appendix we provide several variational formulas for the classical Rényi divergences, some of which are new. In the following we let (Ω, M) denote a measurable space, M(Ω) be the space of measurable real-valued functions on Ω, M b (Ω) the subspace of bounded functions, and P(Ω) will be the space of probability measures on Ω.First we recall the Rényi-Donsker-Varadhan variational formula derived in Birrell et al. (2021) . This is a generalization of the Donsker-Varadhan variational representation of the KL divergence.

