BITAT: NEURAL NETWORK BINARIZATION WITH TASK-DEPENDENT AGGREGATED TRANSFORMATION

Abstract

Neural network quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation while preserving the performance of the original model. However, 1-bit weight/1-bit activations of compactly-designed backbone architectures often used for edge-device deployments result in severe performance degeneration. This paper proposes a novel Quantization-Aware Training method that can effectively alleviate performance degeneration even with extreme quantization by focusing on the inter-weight dependencies, between the weights within each layer and across consecutive layers. To minimize the quantization impact of each weight on others, we perform an orthonormal transformation of the weights at each layer by training an input-dependent correlation matrix and importance vector, such that each weight is disentangled from the others. Then, we quantize the weights based on their importance to minimize the loss of the information from the original weights/activations. We further perform progressive layer-wise quantization from the bottom layer to the top, so that quantization at each layer reflects the quantized distributions of weights and activations at previous layers. We validate the effectiveness of our method on various benchmark datasets against strong neural quantization baselines, demonstrating that it alleviates the performance degeneration on ImageNet and successfully preserves the full-precision model performance on CIFAR-100 with compact backbone networks.

1. INTRODUCTION

Over the past decade, deep Neural Networks (NN) have achieved tremendous success in solving various real-world problems (Creswell et al., 2018; Gidaris et al., 2018; Chen et al., 2020; Karras et al., 2021; Radford et al., 2021) . Recently, network architectures are becoming increasingly larger based on the empirical observations of their improved performance. However, it is increasingly difficult to deploy them on edge devices with limited memory and computational power. Therefore, many recent works focus on building resource-efficient networks to bridge the gap between their scale and actual permissible computational complexity/memory bounds for on-device model deployments. Several works consider designing computation-and memory-efficient architecture modules, while others focus on compressing a given neural network by either pruning its weights (Yoon & Hwang, 2017; He et al., 2020b; Lin et al., 2020a) or reducing the bits used to represent the weights and activations (Bulat et al., 2021; Dbouk et al., 2020; Li et al., 2021) . The latter approach, neural network quantization, is beneficial for building on-device AI systems since the edge devices oftentimes only support low bitwidth-precision parameters and/or operations. However, it inevitably suffers from the non-negligible forgetting of the encoded information from the full-precision models. Such loss of information becomes worse with extreme quantization into binary neural networks with 1-bit weights and 1-bit activations (Bulat et al., 2021; Zhuang et al., 2019; Qin et al., 2020b) . How can we then effectively preserve the original model performance even with extremely lowprecision networks? To address this question, we focus on the somewhat overlooked properties of NN for quantization: the weights in a layer are highly correlated with each other and weights in consecutive layers. Quantizing the weights will inevitably affect the weights within the same layer since they together comprise a transformation represented by the layer. Thus, quantizing the weights and activations at a specific layer will adjust the correlation and relative importance between them. Moreover, it will also largely impact the next layer that directly uses the output of the layer, which together comprise a function represented by the neural network. 2020) To tackle this challenging problem, we propose a new QAT method, referred to as Neural Network Binarization with Task-dependent Aggregated Transformation (BiTAT), as illustrated in Figure 1 Left. Our method sequentially quantizes the weights at each layer of a pre-trained model based on chunkwise input-dependent weight importance by training orthonormal dependency matrices and scaling vectors. After quantizing each layer, we fine-tune the subsequent full-precision layers, which utilize the quantized layer as an input for a few epochs while keeping the quantized weights frozen. we aggregate redundant input dimensions for transformation matrices and scaling vectors, significantly reducing the computational cost of the quantization process. Such consideration of inter-weight dependencies allows our BiTAT algorithm to better preserve the information from a given highprecision network, allowing it to achieve comparable performance to the original full-precision network even with extreme quantization, such as binarization of both weights and activations. The main contributions of the paper can be summarized as follows: Ours BITw / BITa 2/4 4/8 1/1 1/1 CORRELATION block N/A N/A block TASK-BASED Q ✓ × × ✓ STRUCTURED node × × dynamic APPROACH PTQ • We demonstrate that weight dependencies within each layer and across layers play an essential role in preserving the model performance during quantized training. • We propose an input-dependent quantization-aware training method that binarizes neural networks. We disentangle the correlation in the weights from across multiple layers by training rotation matrices and importance vectors, which guides the quantization process to consider the disentangled weights' importance. • We empirically validate our method on several benchmark datasets against state-of-the-art NN quantization methods, showing that it significantly outperforms baselines with the compact neural network architecture.

2. RELATED WORK

Minimizing the quantization error. Quantization methods for deep neural networks can be broadly categorized into several strategies (Qin et al., 2020a) . We first introduce the methods that aim to minimize the weight/activation discrepancy between quantized models and their high-precision counterparts. XNOR-Net (Rastegari et al., 2016) aims to minimize the least-squares error between quantized and full-precision weights for each output channel in layers. DBQ (Dbouk et al., 2020) and QIL (Jung et al., 2019) perform layerwise quantization with parametric scale or transformation functions optimized to the task. Yet, they quantize full-precision weight elements regardless of the correlation between other weights. While TSQ (Wang et al., 2018) and Real-to-Bin (Martinez et al., 2020) propose to minimize the ℓ 2 distance between the quantized activations and the realvalued network's activations by leveraging intra-layer weight dependency, they do not consider cross-layer dependencies. ProxyBNN (He et al., 2020a) adopts the orthogonal matrix to preserve the



Figure1: Left: An Illustration of our proposed method. Weight elements in a layer is highly correlated to each other along with the weights in other layers. Our BiTAT sequentially obtains quantized weights of each layer based on the importance of disentangled weights to others using a trainable orthonormal rotation matrix and importance vector. Right: Categorization of relevant and strong quantization methods to ours.

