BITAT: NEURAL NETWORK BINARIZATION WITH TASK-DEPENDENT AGGREGATED TRANSFORMATION

Abstract

Neural network quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation while preserving the performance of the original model. However, 1-bit weight/1-bit activations of compactly-designed backbone architectures often used for edge-device deployments result in severe performance degeneration. This paper proposes a novel Quantization-Aware Training method that can effectively alleviate performance degeneration even with extreme quantization by focusing on the inter-weight dependencies, between the weights within each layer and across consecutive layers. To minimize the quantization impact of each weight on others, we perform an orthonormal transformation of the weights at each layer by training an input-dependent correlation matrix and importance vector, such that each weight is disentangled from the others. Then, we quantize the weights based on their importance to minimize the loss of the information from the original weights/activations. We further perform progressive layer-wise quantization from the bottom layer to the top, so that quantization at each layer reflects the quantized distributions of weights and activations at previous layers. We validate the effectiveness of our method on various benchmark datasets against strong neural quantization baselines, demonstrating that it alleviates the performance degeneration on ImageNet and successfully preserves the full-precision model performance on CIFAR-100 with compact backbone networks.

1. INTRODUCTION

Over the past decade, deep Neural Networks (NN) have achieved tremendous success in solving various real-world problems (Creswell et al., 2018; Gidaris et al., 2018; Chen et al., 2020; Karras et al., 2021; Radford et al., 2021) . Recently, network architectures are becoming increasingly larger based on the empirical observations of their improved performance. However, it is increasingly difficult to deploy them on edge devices with limited memory and computational power. Therefore, many recent works focus on building resource-efficient networks to bridge the gap between their scale and actual permissible computational complexity/memory bounds for on-device model deployments. Several works consider designing computation-and memory-efficient architecture modules, while others focus on compressing a given neural network by either pruning its weights (Yoon & Hwang, 2017; He et al., 2020b; Lin et al., 2020a) or reducing the bits used to represent the weights and activations (Bulat et al., 2021; Dbouk et al., 2020; Li et al., 2021) . The latter approach, neural network quantization, is beneficial for building on-device AI systems since the edge devices oftentimes only support low bitwidth-precision parameters and/or operations. However, it inevitably suffers from the non-negligible forgetting of the encoded information from the full-precision models. Such loss of information becomes worse with extreme quantization into binary neural networks with 1-bit weights and 1-bit activations (Bulat et al., 2021; Zhuang et al., 2019; Qin et al., 2020b) . How can we then effectively preserve the original model performance even with extremely lowprecision networks? To address this question, we focus on the somewhat overlooked properties of NN for quantization: the weights in a layer are highly correlated with each other and weights in consecutive layers. Quantizing the weights will inevitably affect the weights within the same layer since they together comprise a transformation represented by the layer. Thus, quantizing the weights and activations at a specific layer will adjust the correlation and relative importance between them. Moreover, it will also largely impact the next layer that directly uses the output of the layer, which together comprise a function represented by the neural network.

