FUNCTION-SPACE REGULARIZED RÉNYI DIVERGENCES

Abstract

We propose a new family of regularized Rényi divergences parametrized not only by the order α but also by a variational function space. These new objects are defined by taking the infimal convolution of the standard Rényi divergence with the integral probability metric (IPM) associated with the chosen function space. We derive a novel dual variational representation that can be used to construct numerically tractable divergence estimators. This representation avoids risk-sensitive terms and therefore exhibits lower variance, making it well-behaved when α > 1; this addresses a notable weakness of prior approaches. We prove several properties of these new divergences, showing that they interpolate between the classical Rényi divergences and IPMs. We also study the α → ∞ limit, which leads to a regularized worst-case-regret and a new variational representation in the classical case. Moreover, we show that the proposed regularized Rényi divergences inherit features from IPMs such as the ability to compare distributions that are not absolutely continuous, e.g., empirical measures and distributions with lowdimensional support. We present numerical results on both synthetic and real datasets, showing the utility of these new divergences in both estimation and GAN training applications; in particular, we demonstrate significantly reduced variance and improved training performance.

1. INTRODUCTION

Rényi divergence, Rényi (1961) , is a significant extension of Kullback-Leibler (KL) divergence for numerous applications; see, e.g., Van Erven & Harremos (2014) . The recent neural-based estimators for divergences Belghazi et al. (2018) along with generative adversarial networks (GANs) Goodfellow et al. (2014) accelerated the use of divergences in the field of deep learning. The neural-based divergence estimators are feasible through the utilization of variational representation formulas. These formulas are essentially lower bounds (and, occasionally, upper bounds) which are approximated by tractable statistical averages. The estimation of a divergence based on variational formulas is a notoriously difficult problem. Challenges include potentially high bias that may require an exponential number of samples McAllester & Stratos (2020) or the exponential statistical variance for certain variational estimators Song & Ermon (2019), rendering divergence estimation both data inefficient and computationally expensive. This is especially prominent for Rényi divergences with order larger than 1. Indeed, numerical simulations have shown that, unless the distributions P and Q are very close to one another, the Rényi divergence R α (P ∥Q) is almost intractable to estimate when α > 1 due to the high variance of the statistically-approximated risk-sensitive observables Birrell et al. (2021) , see also the recent analysis in Lee & Shin (2022) . A similar issue has also been observed for the KL divergence, Song & Ermon (2019) . Overall, the lack of estimators with low variance for Rényi divergences has prevented wide-spread and accessible experimentation with this class of information-theoretic tools, except in very special cases. We hope our results here will provide a suitable set of tools to address this gap in the methodology. One approach to variance reduction is the development of new variational formulas. This direction is especially fruitful for the estimation of mutual information van den Oord et al. (2018); Cheng et al. (2020) . Another approach is to regularize the divergence by restricting the function space of the variational formula. Indeed, instead of directly attacking the variance issue, the function space of the variational formula can be restricted, for instance, by bounding the test functions or more appropriately by bounding the derivative of the test functions. The latter regularization leads to Lipschitz continuous function spaces which are also foundational to integral probability metrics (IPMs) and more specifically to the duality property of the Wasserstein metric. In this paper we combine the above two approaches, first deriving a new variational representation of the classical Rényi divergences and then regularizing via an infimal-convolution as follows R Γ,IC α (P ∥Q) := inf η {R α (P ∥η) + W Γ (Q, η)} , where P and Q are the probability distributions being compared, the infimum is over the space of probability measures, R α is the classical Rényi divergence, and W Γ is the IPM corresponding to the chosen regularizing function space, Γ. The new family of regularized Rényi divergences that are developed here address the risk-sensitivity issue inherent in prior approaches. More specifically, our contributions are as follows. • We define a new family of function-space regularized Rényi divergences via the infimal convolution operator between the classical Rényi divergence and an arbitrary IPM (1). The new regularized Rényi divergences inherit their function space from the IPM. For instance, they inherit mass transport properties when one regularizes using the 1-Wasserstein metric. • We derive a dual variational representation (11) of the regularized Rényi divergences which avoids risk-sensitive terms and can therefore be used to construct lower-variance statistical estimators. • We prove a series of properties for the new object: 



(a) the divergence property, (b) being bounded by the minimum of the Rényi divergence and IPM, thus allowing for the comparison of non-absolutely continuous distributions, (c) limits as α → 1 from both left and right, (d) regimes in which the limiting cases R α (P ∥Q) and W Γ (Q, P ) are recovered. • We propose a rescaled version of the regularized Rényi divergences (16) which lead to a new variational formula for the worst-case regret (i.e., α → ∞). This new variational formula does not involve the essential supremum of the density ratio as in the classical definition of worst-case regret, thereby avoiding risk-sensitive terms. • We present a series of illustrative examples and counterexamples that further motivate the proposed definition for the function-space regularized Rényi divergences. • We present numerical experiments that show (a) that we can estimate the new divergence for large values of the order α without variance issues and (b) train GANs using regularized function spaces. Related work. The order of Rényi divergence controls the weight put on the tails, with the limiting cases being mode-covering and mode-selection Minka (2005). Rényi divergence estimation is used in a number of applications, including Sajid et al. (2022) (behavioural sciences), Mironov (2017) (differential privacy), and Li & Turner (2016) (variational inference); in the latter the variational formula is an adaptation of the evidence lower bound. Rényi divergences have been also applied in the training of GANs Bhatia et al. (2021) (loss function for binary classification -discrete case) and in Pantazis et al. (2022) (continuous case, based on the Rényi-Donsker-Varahdan variational formula in Birrell et al. (2021)). Rényi divergences with α > 1 are also used in contrastive representation learning, Lee & Shin (2022), as well as in PAC-Bayesian Bounds, Bégin et al. (2016). In the context of uncertainty quantification and sensitivity analysis, Rényi divergences provide confidence bounds for rare events, Atar et al. (2015); Dupuis et al. (2020), with higher rarity corresponding to larger α. Reducing the variance of divergence estimators through control of the function space have been recently proposed. In Song & Ermon (2019) an explicit bound to the output restricts the divergence values. A systematic theoretical framework on how to regularize through the function space has been developed in Dupuis, Paul & Mao, Yixiang (2022); Birrell et al. (2022a) for the KL and f -divergences. Despite not covering the Rényi divergence, the theory in Dupuis, Paul & Mao, Yixiang (2022); Birrell et al. (2022a) and particularly the infimal-convolution formulation clearly inspired the current work. However, adapting the infimal-convolution method to the Rényi divergence setting requires two new technical innovations: (a) We develop a new low-variance convex-conjugate variational formula for the classical Rényi divergence in Theorem 2.1 (see also Fig. 1), allowing us to apply infimalconvolution tools to develop the new Γ-Rényi divergences in Theorem 3.4. (b) We study the α → ∞ limit of (a) to obtain a new low-variance variational representation of worst-case regret in Theorem 2.2 and study its Γ-regularization in Theorem 4.5.

