REDUCING THE COMPUTATIONAL COST OF DEEP GENERATIVE MODELS WITH BINARY NEURAL NETWORKS

Abstract

Deep generative models provide a powerful set of tools to understand real-world data. But as these models improve, they increase in size and complexity, so their computational cost in memory and execution time grows. Using binary weights in neural networks is one method which has shown promise in reducing this cost. However, whether binary neural networks can be used in generative models is an open problem. In this work we show, for the first time, that we can successfully train generative models which utilize binary neural networks. This reduces the computational cost of the models massively. We develop a new class of binary weight normalization, and provide insights for architecture designs of these binarized generative models. We demonstrate that two state-of-the-art deep generative models, the ResNet VAE and Flow++ models, can be binarized effectively using these techniques. We train binary models that achieve loss values close to those of the regular models but are 90%-94% smaller in size, and also allow significant speed-ups in execution time.

1. INTRODUCTION

As machine learning models continue to grow in number of parameters, there is a corresponding effort to try and reduce the ever-increasing memory and computational requirements that these models incur. One method to make models more efficient is to use neural networks with weights and possibly activations restricted to be binary-valued (Courbariaux et al., 2015; Courbariaux et al., 2016; Rastegari et al., 2016; McDonnell, 2018; Gu et al., 2018) . Binary weights and activations require significantly less memory, and also admit faster low-level implementations of key operations such as linear transformations than when using the usual floating-point precision. Although the application of binary neural networks for classification is relatively well-studied, there has been no research that we are aware of that has examined whether binary neural networks can be used effectively in unsupervised learning problems. Indeed, many of the deep generative models that are popular for unsupervised learning do have high parameter counts and are computationally expensive (Vaswani et al., 2017; Maaløe et al., 2019; Ho et al., 2019a) . These models would stand to benefit significantly from converting the weights and activations to binary values, which we call binarization for brevity. In this work we focus on non-autoregressive models with explicit densities. One such class of density model is the variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) , a latent variable model which has been used to model many high-dimensional data domains accurately. The state-of-the-art VAE models tend to have deep hierarchies of latent layers, and have demonstrated good performance relative to comparable modelling approaches (Ranganath et al., 2016; Kingma et al., 2016; Maaløe et al., 2019) . Whilst this deep hierarchy makes the model powerful, the model size and compute requirements increases with the number of latent layers, making very deep models resource intensive. of invertible transformations to a simple density, with the transformed density approximating the data-generating distribution. Flow models which achieve state-of-the-art performance compose many transformations to give flexibility to the learned density (Kingma & Dhariwal, 2018; Ho et al., 2019a) . Again the model computational cost increases as the number of transformations increases. To examine how to binarize hierarchical VAEs and flow models successfully, we take two models which have demonstrated excellent modelling performance -the ResNet VAE (Kingma et al., 2016) and the Flow++ model (Ho et al., 2019a) -and implement the majority of each model with binary neural networks. Using binary weights and activations reduces the computational cost, but also decreases the representational capability of the model. Therefore our aim is to strike a balance between reducing the computational cost and maintaining good modelling performance. We show that it is possible to decrease the model size drastically, and allow for significant speed ups in run time, with only a minor impact on the achieved loss value. We make the following key contributions: • We propose an efficient binary adaptation of weight normalization, a reparameterization technique often used in deep generative models to accelerate convergence. Binary weight normalization is the generative-modelling alternative to the usual batch normalization used in binary neural networks. • We show that we can binarize the majority of weights and activations in deep hierarchical VAE and flow models, without significantly hurting performance. We demonstrate the corresponding binary architecture designs for both the ResNet VAE and the Flow++ model. • We perform experiments on different levels of binarization, clearly demonstrating the trade-off between binarization and performance.

2. BACKGROUND

In this section we give background on the implementation and training of binary neural networks. We also describe the generative models that we implement with binary neural networks in detail.

2.1. BINARY NEURAL NETWORKS

In order to reduce the memory and computational requirements of neural networks, there has been recent research into how to effectively utilise networks which use binary-valued weights w B and possibly also activations α B rather than the usual real-valuedfoot_0 weights and activations (Courbariaux et al., 2015; Courbariaux et al., 2016; Rastegari et al., 2016; McDonnell, 2018; Gu et al., 2018) . In this work, we use the convention of binary values being in B := {-1, 1}. Motivation. The primary motivation for using binary neural networks is to decrease the memory and computational requirements of the model. Clearly binary weights require less memory to be stored: 32× less than the usual 32-bit floating-point weights. Binary neural networks also admit significant speed-ups. A reported 2× speed-up can be achieved by a layer with binary weights and real-valued inputs (Rastegari et al., 2016) . This can be made an additional 29× faster if the inputs to the layer are also constrained to be binary (Rastegari et al., 2016) . With both binary weights and inputs, linear operators such as convolutions can be implemented using the inexpensive XNOR and bit-count binary operations. A simple way to ensure binary inputs to a layer is to have a binary activation function before the layer (Courbariaux et al., 2016; Rastegari et al., 2016) . Optimization. Taking a trained model with real-valued weights and binarizing the weights has been shown to be lead to significant worsening of performance (Alizadeh et al., 2019) . So instead the binary weights are optimized. It is common to not optimize the binary weights directly, but instead optimize a set of underlying real-valued weights w R which can then be binarized in some fashion for inference. In this paper we will adopt the convention of binarizing the underlying weights using the sign function (see Equation 2). We also use the sign function as the activation function when we use binary activations (see Equation 5, where α R are the real-valued pre-activations). We define the sign



We use real-valued throughout the paper to be synonymous with "implemented with floating-point precision".

