TRADE: A SIMPLE SELF-ATTENTION-BASED DENSITY ESTIMATOR

Abstract

We present TraDE, a self-attention-based architecture for auto-regressive density estimation with continuous and discrete valued data. Our model is trained using a penalized maximum likelihood objective, which ensures that samples from the density estimate resemble the training data distribution. The use of self-attention means that the model need not retain conditional sufficient statistics during the autoregressive process beyond what is needed for each covariate. On standard tabular and image data benchmarks, TraDE produces significantly better density estimates than existing approaches such as normalizing flow estimators and recurrent autoregressive models. However log-likelihood on held-out data only partially reflects how useful these estimates are in real-world applications. In order to systematically evaluate density estimators, we present a suite of tasks such as regression using generated samples, out-of-distribution detection, and robustness to noise in the training data and demonstrate that TraDE works well in these scenarios.

1. INTRODUCTION

Density estimation involves estimating a probability density p(x), given independent, identically distributed (iid) samples from it. This is a versatile and important problem as it allows one to generate synthetic data or perform novelty and outlier detection. It is also an important subroutine in applications of graphical models. Deep neural networks are a powerful function class and learning complex distributions with them is promising. This has resulted in a resurgence of interest in the classical problem of density estimation. One of the more popular techniques for density estimation is to sample data from a simple reference distribution and then to learn a (sequence of) invertible transformations that allow us to adapt it to a target distribution. Flow-based methods (Durkan et al., 2019b) employ this with great success. A more classical approach is to decompose p(x) in an iterative manner via conditional probabilities p(x i+1 |x 1...i ) and fit this distribution using the data (Murphy, 2013). One may even employ implicit generative models to sample from p(x) directly, perhaps without the ability to compute density estimates. This is the case with Generative Adversarial Networks (GANs) that reign supreme for image synthesis via sampling (Goodfellow et al., 2014; Karras et al., 2017) . Implementing these above methods however requires special care, e.g., the normalizing transform requires the network to be invertible with an efficiently computable Jacobian. Auto-regressive models using recurrent networks are difficult to scale to high-dimensional data due to the need to store a potentially high-dimensional conditional sufficient statistic (and also due to vanishing gradients). Generative models can be difficult to train and GANs lack a closed density model. Much of the current work is devoted to mitigating these issues. The main contributions of this paper include: 1. We introduce TraDE, a simple but novel auto-regressive density estimator that uses self-attention along with a recurrent neural network this problem, yet it remains capable of approximating any density function. To our knowledge, this is the first adaptation of Transformer-like architectures for continuous-valued density estimation. 2. Log-likelihood on held-out data is the prevalent metric to evaluate density estimators. However, this only provides a partial view of their performance in real-world applications. We propose a suite of experiments to systematically evaluate the performance of density estimators in downstream tasks such as classification and regression using generated samples, detection of out-of-distribution samples, and robustness to noise in the training data. 3. We provide extensive empirical evidence that TraDE substantially outperforms other density estimators on standard and additional benchmarks, along with thorough ablation experiments to dissect the empirical gains. The main feature of this work is the simplicity of our proposed method along with its strong (systematically evaluated) empirical performance.

2. BACKGROUND AND RELATED WORK

Given a dataset x 1 , . . . , x n where each sample x l ∈ R d is drawn iid from a distribution p(x), the maximum-likelihood formulation of density estimation finds a θ-parameterized distribution q with θ = argmax θ 1 n n l=1 log q(x l ; θ). The candidate distribution q can be parameterized in a variety of ways as we discuss next. Normalizing flows write x ∼ q as a transformation of samples z from some base distribution p z from which one can draw samples easily (Papamakarios et al., 2019) . If this mapping is f θ : z → x, two distributions can be related using the determinant of the Jacobian as q(x; θ) := p z (z) df θ dz -1 . A practical limitation of flow-based models is that f θ must be a diffeomorphism, i.e., it is invertible and both f θ and f -1 θ are differentiable. Good performance using normalizing flows imposes nontrivial restrictions on how one can parametrize f θ : it must be flexible yet invertible with a Jacobian that can be computed efficiently. There are a number of techniques to achieve this, e.g., linear mappings, planar/radial flows (Rezende & Mohamed, 2015; Tabak & Turner, 2013) , Sylvester flows (Berg et al., 2018 ), coupling (Dinh et al., 2014) and auto-regressive models (Larochelle & Murray, 2011) . One may also compose the transformations, e.g., using monotonic mappings f θ in each layer (Huang et al., 2018; De Cao et al., 2019) . Auto-regressive models factorize the joint distribution as a product of univariate conditional distributions q(x; θ) := i q i (x i |x 1 , . . . x i-1 ; θ). The auto-regressive approach to density estimation is straightforward and flexible as there is no restriction on how each conditional distribution is modeled. Often, a single recurrent neural network (RNN) is used to sequentially estimate all conditionals with a shared set of parameters (Oliva et al., 2018; Kingma et al., 2016) . For high-dimensional data, the challenge lies in handling the increasingly large state space x 1 , . . . , x i-1 required to properly infer x i . In recurrent auto-regressive models, these conditioned-upon variables' values are stored in some representation h i which is updated via a function h i+1 = g(h i , x i ). This overcomes the problem of high-dimensional estimation, albeit at the expense of loss in fidelity. Techniques like masking the computational paths in a feed-forward network are popular to alleviate these problems further (Uria et al., 2016; Germain et al., 2015; Papamakarios et al., 2017) . Many existing auto-regressive algorithms are highly sensitive to the variable ordering chosen for factorizing q, and some methods must train complex ensembles over multiple orderings to achieve good performance (Germain et al., 2015; Uria et al., 2014) . While autoregressive models are commonly applied to natural language and time series data, this setting only involves variables that are already naturally ordered (Chelba et al., 2013) . In contrast, we consider continuous (and discrete) density estimation of vector valued data, e.g. tabular data, where the underlying ordering and dependencies between variables is often unknown. Generative models focus on drawing samples from the estimated distribution that look resemble the true distribution of data. There is a rich history of learning explicit models from variational inference (Jordan et al., 1999) that allow both drawing samples and estimating the log-likelihood or implicit models such as Generative Adversarial Networks (GANs, see (Goodfellow et al., 2014)) 



Figure 1: TraDE is well suited to density estimation of Transformers. Left: Bumblebee (true density), Right: density estimated from data.

