TRAINING NORMALIZING FLOWS FROM DEPENDENT DATA

Abstract

Normalizing flows are powerful non-parametric statistical models that function as a hybrid between density estimators and generative models. Current learning algorithms for normalizing flows assume that data points are sampled independently, an assumption that is frequently violated in practice, which may lead to erroneous density estimation and data generation. We propose a likelihood objective of normalizing flows incorporating dependencies between the data points, for which we derive a flexible and efficient learning algorithm suitable for different dependency structures. We show that respecting dependencies between observations can improve empirical results on both synthetic and real-world data.

1. INTRODUCTION

Density estimation and generative modeling of complex distributions are fundamental problems in statistics and machine learning and significant in various application domains. Remarkably, normalizing flows (Rezende & Mohamed, 2015; Papamakarios et al., 2021) can solve both of these tasks at the same time. Furthermore, their neural architecture allows them to capture even very highdimensional and complex structured data (such as images and time series). In contrast to other deep generative models such as variational autoencoders (VAEs), which only optimize a lower bound on the likelihood objective, normalizing flows optimize the likelihood directly. Previous work on both generative models and density estimation with deep learning assumes that data points are sampled independently from the underlying distribution. However, this modelling assumption is oftentimes heavily violated in practice. Figure 1 illustrates why this can be problematic. A standard normalizing flow trained on dependent data will misinterpret the sampling distortions in the training data as true signal (Figure 1c . Our proposed method, on the other hand, can correct for the data dependencies and reconstruct the original density more faithfully (Figure 1d ). The problem of correlated data is very common and occurs in many applications. Consider the ubiquitous task of image modeling. The Labeled Faces in the Wild (LFW, (Huang et al., 2008) ) data set consists of facial images of celebrities, but some individuals in the data set are grossly overrepresented. For example, George W. Bush is depicted on 530 images, while around 70% of the individuals in the data set only appear once. A generative model trained naively on these data will put considerably more probability mass on images similar to George W. Bush, compared to the less represented individuals. Arguably, most downstream tasks, such as image generation and outlier detection, would benefit from a model that is less biased towards these overrepresented individuals. In the biomedical domain, large cohort studies involve participants that oftentimes are directly related (such as parents and children) or indirectly related (by sharing genetic material due to a shared ancestry)-a phenomenon called population stratification (Cardon & Palmer, 2003) . These dependencies between individuals play a major role in the traditional analyses of these data and require sophisticated statistical treatment (Lippert et al., 2011) , but current deep-learning based nonparametric models lack the required methodology to do so. This can have considerable negative impact on downstream tasks, as we will show in our experiments. In finance, accurate density estimation and modeling of assets (e.g., stock market data) is essential for risk management and modern trading strategies. Data points are often heavily correlated with one another, due to time, sector, or other relations. Traditionally, financial analysts often use copulas for the modeling of non-parametric data, which themselves can be interpreted as a simplified version

