SEMI-AUTOREGRESSIVE ENERGY FLOWS: TOWARDS DETERMINANT-FREE NORMALIZING FLOWS

Abstract

Normalizing flows are a popular approach for constructing probabilistic and generative models. However, maximum likelihood training of flows is challenging due to the need to calculate computationally expensive determinants of Jacobians. This paper takes steps towards addressing this challenge by introducing objectives and model architectures for determinant-free training of flows. Central to our framework is the energy objective, a multidimensional extension of proper scoring rules that admits efficient estimators based on random projections. The energy objective does not require calculating determinants and therefore supports general flow architectures that are not well-suited to maximum likelihood training. In particular, we introduce semi-autoregressive flows, an architecture that can be trained with the energy loss, and that interpolates between fully autoregressive and non-autoregressive models, capturing the benefits of both. We empirically demonstrate that energy flows achieve competitive generative modeling performance while maintaining fast generation and posterior inference.

1. INTRODUCTION

Normalizing flows are one of the major families of probabilistic and generative models (Rezende and Mohamed, 2015; Kingma et al., 2016; Papamakarios et al., 2019) . They feature tractable inference and maximum likelihood learning, and have applications in areas such as image generation (Kingma and Dhariwal, 2018; Dinh et al., 2014) , anomaly detection (Nalisnick et al., 2019) , and density estimation (Papamakarios et al., 2017; Dinh et al., 2017) . However, flows require calculating computationally expensive determinants of Jacobians in order to evaluate their densities; this either limits the range of architectures compatible with flows, or makes flow models with highly expressive neural architectures slow to train. This paper seeks to question the use of maximum likelihood for training flows and instead explores an approach for determinant-free training inspired by two-sample testing and the theory of proper scoring rules. See Appendix G for more detailed motivation. Si et al. (2022) recently showed that normalizing flows can be trained using objectives derived from proper scoring rules (Gneiting and Raftery, 2007a) that involve only samples from the model and the data distribution (hence do not require computing densities and departs from log-likelihood based training of autoregressive models as in (Papamakarios et al., 2017) ). Although quantile flows (Si et al., 2022) are determinant-free, they are also necessarily autoregressive due to the CDF only existing in one dimension, and therefore inherit various limitations, such as slow sampling speed. Here, we extend the sample-based proper scoring rule framework of Si et al. (2022) to models that are not fully autoregressive. Central to our approach is the energy objective, a multidimensional extension of proper scoring rules that only requires model samples, and not densities. We complement this objective with efficient estimators based on random projections and compare against alternative sample-based objectives that serve as strong baselines. We examine the theoretical properties of our approach, draw connections to divergence minimization, and highlight benefits over maximum likelihood training. Our framework enables training model architectures that are more general than the ones compatible with maximum likelihood learning (e.g., densely connected networks). In particular, we propose semi-autoregressive flows, an architecture trained with the energy loss that integrates the speed of feed-forward architectures with the sample quality of autoregressive models. Across a number Contributions. In summary, this work (1) questions the use of maximum likelihood for training flows and proposes an alternative approach based on proper scoring rules and two-sample tests that extends quantile flows (Si et al., 2022) to multiple dimensions. We (2) introduce specific two-sample objectives, such as the energy loss, and derive efficient slice-based estimators. We also (3) provide a theoretical analysis for the proposed objectives as they are consistent estimators and feature unbiased gradients. Finally, we (4) introduce a semi-autoregressive architecture that features high sample quality and speed on generation and posterior inference tasks.

2. BACKGROUND

Normalizing Flow Models Generative modeling involves specifying a probabilistic model p(y) ∈ ∆(R d ) over a high-dimensional y ∈ R d (Kingma and Welling, 2014; Goodfellow et al., 2014) . A normalizing flow is a generative model p(y) defined via an invertible mapping f : R d → R d between a noise variable z ∈ R d sampled from a prior z ∼ p(z) and the target variable y (Rezende and Mohamed, 2015; Papamakarios et al., 2019) . We may obtain an analytical expression for the likelihood p(y) via the change of variables formula p(y) = ∂f (z) -1 ∂z p(z), where ∂f (z) -1 ∂z denotes the determinant of the inverse Jacobian of f . Computing this quantity is often expensive, hence we typically choose f to be in a class of models for which the determinant is tractable (Rezende and Mohamed, 2015) , such as in autoregressive models (Papamakarios et al., 2017) . Proper Scoring Rules Consider a score or a loss ℓ : ∆ and Raftery, 2007a) . A popular proper loss is the continuous ranked probability score (CRPS), defined for two cumulative distribution functions (CDFs) F and G as CRPS(F, G) = (F (y) -G(y)) (R d ) × R d → R + over a probabilistic forecast F ∈ ∆(R d ) and a sample y ∈ R d . The loss ℓ is proper if the true distribution G ∈ arg min F E y∼G ℓ(F, y) (Gneiting 2 dy. When we only have samples from G, we can generalize this score to obtain the following loss for a single sample y ′ : CRPS s (F, y ′ ) = y (F (y) -I(y -y ′ )) 2 dy. where I denotes the Heaviside step function. The above CRPS can also be written as an expectation relative to the distribution F : CRPS(F, y ′ ) = - 1 2 E F |Y -Y ′ | + E F |Y -y ′ |, where Y, Y ′ are independent copies of a random variable distributed according to F . Recently, Si et al. ( 2022) proposed autoregressive quantile flows, which are trained using the CRPS and are determinant-free. We seek to extend the approach of Si et al. ( 2022) beyond autoregressive flows. Two-Sample Tests and Integral Probability Metrics Two-sample tests compare distributions F, G based on their respective sets of samples D F = {y (i) } m i=1 and D G = {x (i) } n i=1 . Specifically,



Energy Flows and Semi-Autoregressive Energy Flows (SAEFs) are invertible generative models that feature expressive architectures, exact likelihood and posterior evaluation, and their training does not require computing log-determinants, in contrast to VAEs(Kingma and Welling,  2014),MAFs (Papamakarios et al., 2017), NAFs (Huang et al., 2018), AQFs (Si et al., 2022),  GMMNets (Li et al., 2015), and CramerGANs (Bellemare et al., 2017).

