CATEGORICAL NORMALIZING FLOWS VIA CONTINU-OUS TRANSFORMATIONS

Abstract

Despite their popularity, to date, the application of normalizing flows on categorical data stays limited. The current practice of using dequantization to map discrete data to a continuous space is inapplicable as categorical data has no intrinsic order. Instead, categorical data have complex and latent relations that must be inferred, like the synonymy between words. In this paper, we investigate Categorical Normalizing Flows, that is normalizing flows for categorical data. By casting the encoding of categorical data in continuous space as a variational inference problem, we jointly optimize the continuous representation and the model likelihood. Using a factorized decoder, we introduce an inductive bias to model any interactions in the normalizing flow. As a consequence, we do not only simplify the optimization compared to having a joint decoder, but also make it possible to scale up to a large number of categories that is currently impossible with discrete normalizing flows. Based on Categorical Normalizing Flows, we propose GraphCNF a permutationinvariant generative model on graphs. GraphCNF implements a three-step approach modeling the nodes, edges, and adjacency matrix stepwise to increase efficiency. On molecule generation, GraphCNF outperforms both one-shot and autoregressive flow-based state-of-the-art.

1. INTRODUCTION

Normalizing Flows have been popular for tasks with continuous data like image modeling (Dinh et al., 2017; Kingma and Dhariwal, 2018; Ho et al., 2019) and speech generation (Kim et al., 2019; Prenger et al., 2019) by providing efficient parallel sampling and exact density evaluation. The concept that normalizing flows rely on is the rule of change of variables, a continuous transformation designed for continuous data. However, there exist many data types typically encoded as discrete, categorical variables, like language and graphs, where normalizing flows are not straightforward to apply. To address this, it has recently been proposed to discretize the transformations inside normalizing flows to act directly on discrete data. Unfortunately, these discrete transformations have shown to be limited in terms of the vocabulary size and layer depth due to gradient approximations (Hoogeboom et al., 2019; Tran et al., 2019) . For the specific case of discrete but ordinal data, like images where integers represent quantized values, a popular strategy is to add a small amount of noise to each value (Dinh et al., 2017; Ho et al., 2019) . It is unnatural, however, to apply such dequantization techniques for the general case of categorical data, where values represent categories with no intrinsic order. Treating these categories as integers for dequantization biases the data to a non-existing order, and makes the modeling task significantly harder. Besides, relations between categories are often multi-dimensional, for example, word meanings, which cannot be represented with dequantization. In this paper, we investigate normalizing flows for the general case of categorical data. To account for discontinuity, we propose continuous encodings in which different categories correspond to unique, non-overlapping and thus close-to-deterministic volumes in a continuous latent space. Instead of pre-specifying the non-overlapping volumes per category, we resort to variational inference to jointly learn those and model the likelihood by a normalizing flow at the same time. This work is not the first to propose variational inference with normalizing flows, mostly considered for improving the flexibility of the approximate posterior (Kingma et al., 2016; Rezende and Mohamed, 2015; Van Den Berg et al., 2018) . Different from previous works, we use variational inference to learn a continuous representation z of the discrete categorical data x to a normalizing flow. A similar idea has been investigated in (Ziegler and Rush, 2019) , who use a variational autoencoder structure with the normalizing flow being the prior. As both their decoder and normalizing flow model (complex) dependencies between categorical variables, (Ziegler and Rush, 2019) rely on intricate yet sensitive learning schedules for balancing the likelihood terms. Instead, we propose to separate the representation and relation modeling by factorizing the decoder both over the categorical variable x and the conditioning latent z. This forces the encoder and decoder to focus only on the mapping from categorical data to continuous encodings, and not model any interactions. By inserting this inductive bias, we move all complexity into the flow. We call this approach Categorical Normalizing Flows (CNF). Categorical Normalizing Flows can be applied to any task involving categorical variables, but we primarily focus on modeling graphs. Current state-of-the-art approaches often rely on autoregressive models (Li et al., 2018; Shi et al., 2020; You et al., 2018) that view graphs as sequences, although there exists no intrinsic order of the node. In contrast, normalizing flows can perform generation in parallel making a definition of order unnecessary. By treating both nodes and edges as categorical variables, we employ our variational inference encoding and propose GraphCNF. GraphCNF is a novel permutation-invariant normalizing flow on graph generation which assigns equal likelihood to any ordering of nodes. Meanwhile, GraphCNF efficiently encodes the node attributes, edge attributes, and graph structure in three consecutive steps. As shown in the experiments, the improved encoding and flow architecture allows GraphCNF to outperform significantly both the autoregressive and parallel flow-based state-of-the-art. Further, we show that Categorical Normalizing Flows can be used in problems with regular categorical variables like modeling natural language or sets. Our contributions are summarized as follows. Firstly, we propose Categorical Normalizing Flows using variational inference with a factorized decoder to move all complexity into the prior and scale up to large number of categories. Secondly, starting from the Categorical Normalizing Flows, we propose GraphCNF, a permutation-invariant normalizing flow on graph generation. On molecule generation, GraphCNF sets a new state-of-the-art for flow-based methods outperforming one-shot and autoregressive baselines. Finally, we show that simple mixture models for encoding distributions are accurate, efficient, and generalize across a multitude of setups, including sets language and graphs.

2.1. NORMALIZING FLOWS ON CONTINUOUS DATA

A normalizing flow (Rezende and Mohamed, 2015; Tabak and Vanden Eijnden, 2010 ) is a generative model that models a probability distribution p(z (0) ) by applying a sequence of invertible, smooth mappings f 1 , ..., f K : R d → R d . Using the rule of change of variables, the likelihood of the input z (0) is determined as follows: p(z (0) ) = p(z (K) ) • K k=1 det ∂f k (z (k-1) ) ∂z (k-1) where z (k) = f k (z (k-1) ), and p(z (K) ) represents a prior distribution. This calculation requires to compute the Jacobian for the mappings f 1 , ..., f K , which is expensive for arbitrary functions. Thus, the mappings are often designed to allow efficient computation of its determinant. One of such is the coupling layer proposed by Dinh et al. (2017) which showed to work well with neural networks. For a detailed introduction to normalizing flows, we refer the reader to Kobyzev et al. (2019) .

2.2. NORMALIZING FLOWS ON CATEGORICAL DATA

We define x = {x 1 , ..., x S } to be a multivariate, categorical random variable, where each element x i is itself a categorical variable of K categories with no intrinsic order. For instance, x could be a sentence with x i being the words. Our goal is to learn the joint probability mass function, P model (x), via a normalizing flow. Specifically, as normalizing flows constitute a class of continuous transformations, we aim to learn a continuous latent space in which each categorical choice of a variable x i maps to a stochastic continuous variable z i ∈ R d whose distribution we learn.

