TRAINING NORMALIZING FLOWS FROM DEPENDENT DATA

Abstract

Normalizing flows are powerful non-parametric statistical models that function as a hybrid between density estimators and generative models. Current learning algorithms for normalizing flows assume that data points are sampled independently, an assumption that is frequently violated in practice, which may lead to erroneous density estimation and data generation. We propose a likelihood objective of normalizing flows incorporating dependencies between the data points, for which we derive a flexible and efficient learning algorithm suitable for different dependency structures. We show that respecting dependencies between observations can improve empirical results on both synthetic and real-world data.

1. INTRODUCTION

Density estimation and generative modeling of complex distributions are fundamental problems in statistics and machine learning and significant in various application domains. Remarkably, normalizing flows (Rezende & Mohamed, 2015; Papamakarios et al., 2021) can solve both of these tasks at the same time. Furthermore, their neural architecture allows them to capture even very highdimensional and complex structured data (such as images and time series). In contrast to other deep generative models such as variational autoencoders (VAEs), which only optimize a lower bound on the likelihood objective, normalizing flows optimize the likelihood directly. Previous work on both generative models and density estimation with deep learning assumes that data points are sampled independently from the underlying distribution. However, this modelling assumption is oftentimes heavily violated in practice. Figure 1 illustrates why this can be problematic. A standard normalizing flow trained on dependent data will misinterpret the sampling distortions in the training data as true signal (Figure 1c . Our proposed method, on the other hand, can correct for the data dependencies and reconstruct the original density more faithfully (Figure 1d ). The problem of correlated data is very common and occurs in many applications. Consider the ubiquitous task of image modeling. The Labeled Faces in the Wild (LFW, (Huang et al., 2008) ) data set consists of facial images of celebrities, but some individuals in the data set are grossly overrepresented. For example, George W. Bush is depicted on 530 images, while around 70% of the individuals in the data set only appear once. A generative model trained naively on these data will put considerably more probability mass on images similar to George W. Bush, compared to the less represented individuals. Arguably, most downstream tasks, such as image generation and outlier detection, would benefit from a model that is less biased towards these overrepresented individuals. In the biomedical domain, large cohort studies involve participants that oftentimes are directly related (such as parents and children) or indirectly related (by sharing genetic material due to a shared ancestry)-a phenomenon called population stratification (Cardon & Palmer, 2003) . These dependencies between individuals play a major role in the traditional analyses of these data and require sophisticated statistical treatment (Lippert et al., 2011) , but current deep-learning based nonparametric models lack the required methodology to do so. This can have considerable negative impact on downstream tasks, as we will show in our experiments. In finance, accurate density estimation and modeling of assets (e.g., stock market data) is essential for risk management and modern trading strategies. Data points are often heavily correlated with one another, due to time, sector, or other relations. Traditionally, financial analysts often use copulas for the modeling of non-parametric data, which themselves can be interpreted as a simplified version In certain settings from classical parametric statistics, direct modeling of the dependencies in maximum likelihood models is analytically feasible. For linear and generalized linear models, dependencies are usually addressed either with random effects in linear mixed models (Jiang & Nguyen, 2007) or sometimes only by the inclusion of fixed-effects covariates (Price et al., 2006) . Recent work in deep learning introduced concepts from random effects linear models into deep learning for prediction tasks such as regression and classification (Simchoni & Rosset, 2021; Xiong et al., 2019; Tran et al., 2020) . In federated learning of generative models, researchers usually deal with the break of the non-i.i.d. assumptions with ad hoc methods and without consideration of the statistical implications (Augenstein et al., 2020; Rasouli et al., 2020) . These methods also only consider block-type, repeat-measurement dependencies for multi-source integration. To the best of our knowledge, both deep generative models and deep density estimation so far lack the tools to address violations against the independence assumption in the general setting and in a well-founded statistical framework. In this work we show that the likelihood objective of normalizing flows naturally allows for the explicit incorporation of data dependencies. We investigate several modes of modeling the dependency between data points, appropriate in different settings. We also propose efficient optimization procedures for this objective. We then apply our proposed method to three high-impact real-world settings. First, we model a set of complex biomedical phenotypes and show that adjusting for the genetic relatedness of individuals leads to a considerable increase in statistical testing power in a downstream genome-wide association analysis. Next, we consider two image data sets, one with facial images, the other from the biomedical domain, leading to less biased generative image models. In the last application, we use normalizing flows to better model the return correlations between financial assets. In all experiments, we find that adjustment for dependencies can significantly improve the model fit of normalizing flows.

2. METHODS

In this section we describe our methodology for training normalizing flows from dependent data. First, we will derive a general formulation of the likelihood under weak assumptions on the dependencies among observations. Afterwards, we will investigate two common settings in more detail.

2.1. BACKGROUND: LIKELIHOOD WITH INDEPENDENT OBSERVATIONS

A normalizing flow is an invertible function t : R p → R p that maps a p-dimensional noise variable u to a p-dimensional synthetic data variable x. The noise variable u is usually distributed following a simple distribution (such as a N p (0, I p ) or U [0,1] p ), for which the density is explicitly known and efficiently computable. By using the change of variable formula, the log-density can be explicitly computed as log(p x (x)) = log(p u (u)) -log(| det J t (u)|),



Figure 1: Example setting with synthetic data sampled with inter-instance dependencies. Training a standard normalizing flow on this data biases the model. Adjusting for the dependencies during training recovers the true underlying distribution.

