BITAT: NEURAL NETWORK BINARIZATION WITH TASK-DEPENDENT AGGREGATED TRANSFORMATION

Abstract

Neural network quantization aims to transform high-precision weights and activations of a given neural network into low-precision weights/activations for reduced memory usage and computation while preserving the performance of the original model. However, 1-bit weight/1-bit activations of compactly-designed backbone architectures often used for edge-device deployments result in severe performance degeneration. This paper proposes a novel Quantization-Aware Training method that can effectively alleviate performance degeneration even with extreme quantization by focusing on the inter-weight dependencies, between the weights within each layer and across consecutive layers. To minimize the quantization impact of each weight on others, we perform an orthonormal transformation of the weights at each layer by training an input-dependent correlation matrix and importance vector, such that each weight is disentangled from the others. Then, we quantize the weights based on their importance to minimize the loss of the information from the original weights/activations. We further perform progressive layer-wise quantization from the bottom layer to the top, so that quantization at each layer reflects the quantized distributions of weights and activations at previous layers. We validate the effectiveness of our method on various benchmark datasets against strong neural quantization baselines, demonstrating that it alleviates the performance degeneration on ImageNet and successfully preserves the full-precision model performance on CIFAR-100 with compact backbone networks. INTRODUCTION



Weight elements in a layer is highly correlated to each other along with the weights in other layers. Our BiTAT sequentially obtains quantized weights of each layer based on the importance of disentangled weights to others using a trainable orthonormal rotation matrix and importance vector. Right: Categorization of relevant and strong quantization methods to ours. Despite their impact on NN quantization, such inter-weight dependencies have been relatively overlooked. As shown in Figure 1 Right, although BRECQ (Li et al., 2021) addresses the problem by considering the dependency between filters in each block, it is limited to the Post-Training Quantization (PTQ) problem, which suffers from inevitable information loss, resulting in inferior performance. Most recent Quantization-Aware Training (QAT) methods (Dbouk et al., 2020; Liu et al., 2020) are concerned with obtaining quantized weights by minimizing quantization losses with parameterized activation functions, disregarding cross-layer weight dependencies. To the best of our knowledge, no prior work explicitly considers dependencies among the weights for QAT. To tackle this challenging problem, we propose a new QAT method, referred to as Neural Network Binarization with Task-dependent Aggregated Transformation (BiTAT), as illustrated in Figure 1 Left. Our method sequentially quantizes the weights at each layer of a pre-trained model based on chunkwise input-dependent weight importance by training orthonormal dependency matrices and scaling vectors. After quantizing each layer, we fine-tune the subsequent full-precision layers, which utilize the quantized layer as an input for a few epochs while keeping the quantized weights frozen. we aggregate redundant input dimensions for transformation matrices and scaling vectors, significantly reducing the computational cost of the quantization process. Such consideration of inter-weight dependencies allows our BiTAT algorithm to better preserve the information from a given highprecision network, allowing it to achieve comparable performance to the original full-precision network even with extreme quantization, such as binarization of both weights and activations. The main contributions of the paper can be summarized as follows: • We demonstrate that weight dependencies within each layer and across layers play an essential role in preserving the model performance during quantized training. • We propose an input-dependent quantization-aware training method that binarizes neural networks. We disentangle the correlation in the weights from across multiple layers by training rotation matrices and importance vectors, which guides the quantization process to consider the disentangled weights' importance. • We empirically validate our method on several benchmark datasets against state-of-the-art NN quantization methods, showing that it significantly outperforms baselines with the compact neural network architecture.

2. RELATED WORK

Minimizing the quantization error. Quantization methods for deep neural networks can be broadly categorized into several strategies (Qin et al., 2020a) . We first introduce the methods that aim to minimize the weight/activation discrepancy between quantized models and their high-precision counterparts. XNOR-Net (Rastegari et al., 2016) aims to minimize the least-squares error between quantized and full-precision weights for each output channel in layers. DBQ (Dbouk et al., 2020) and QIL (Jung et al., 2019) perform layerwise quantization with parametric scale or transformation functions optimized to the task. Yet, they quantize full-precision weight elements regardless of the correlation between other weights. While TSQ (Wang et al., 2018) and Real-to-Bin (Martinez et al., 2020) propose to minimize the ℓ 2 distance between the quantized activations and the realvalued network's activations by leveraging intra-layer weight dependency, they do not consider cross-layer dependencies. ProxyBNN (He et al., 2020a) adopts the orthogonal matrix to preserve the correlation between coordinates while minimizing the quantization error. Recently, BRECQ (Li et al., 2021) and the work in a similar vein on post-training quantization (Nagel et al., 2020) consider the interdependencies between the weights and the activations by using a Taylor series-based approach. However, calculating the Hessian matrix for a large neural network is often intractable, and thus they resort to strong assumptions such as small block-diagonality of the Hessian matrix to make them feasible. BiTAT solves this problem by training the dependency matrices alongside the quantized weights while grouping similar weights together to reduce the computational cost. Modifying the task loss function. BNN-DL (Ding et al., 2019) adds a distributional loss enforcing the weight distributions to be quantization-friendly. Apprentice (Mishra & Marr, 2018) uses knowledge distillation to preserve the knowledge of the full-precision teacher network in the quantized network. However, such methods only put a constraint on the distributional properties of the weights, not the dependencies and the values of the weight elements. CI-BCNN (Wang et al., 2019) parameterizes bitcount operations by exploring the interaction between output channels using reinforcement learning and quantizes the floating-point accumulation in convolution operations based on them. However, reinforcement learning is expensive, and it still does not consider cross-layer dependencies. RBNN (Lin et al., 2020b) achieves a significantly higher cosine similarity between the full-precision weight and its binarization by constraining the model to preserve fewer angular biases. Reducing the gradient error. Liu et al. (2018) devises a better gradient estimator for the sign function used to binarize the activations and a magnitude-aware gradient correction method. PCNN (Gu et al., 2019) proposes a new discrete backpropagation method via projection, where the layerwise trainable function effectively projects the weights at each layer to multiple quantized weights. Re-ActNet (Liu et al., 2020) achieves state-of-the-art performance for binary neural networks by training a generalized activation function for compact network architecture used in Liu et al. (2018) . However, their quantizer functions conduct element-wise unstructured compression without considering the change in other correlated weights throughout quantization training. This makes the search process converge to suboptimal solutions since task loss is the only guide for finding the optimal quantized weights, which is often insufficient for high-dimensional and complex architectures. On the other hand, we can obtain a better-informed guide that compels the training procedure to spend more time searching in areas that are more likely to contain high-performing quantized weights.

3. WEIGHT IMPORTANCE FOR QUANTIZATION-AWARE TRAINING

We aim to quantize a full-precision neural network into a binary neural network (BNN), where the obtained quantized network is composed of binarized 1-bit weights and activations, which preserves the performance of the original full-precision model. Let f (•; W) be a L-layered neural network parameterized by a set of pre-trained weights W = {w (1) , . . . , w (L) }, where w (l) ∈ R d l-1 ×d l is the weight at layer l and d 0 is the dimensionality of the input. 2020) search for optimal quantized weights by solving for the optimization problem that can be generally described as follows: minimize W,ϕ L task (f (X ; Q (W; ϕ)) , Y) , where L task is a standard task loss function, such as cross-entropy loss, and Q(•; ϕ) is the weight quantization function parameterized by ϕ which transforms a real-valued vector to a discrete, binary vector. Existing works quantize typically by rounding each element of the weights or activation to the nearest quantization value. This is equivalent to minimizing loss terms based on the Mean Squared Error (MSE) between the full-precision weights and the quantized weights at each layer: Q(w) := α * b * , where α * , b * = arg min α∈R,b∈{-1,1} m ∥w -αb∥ 2 2 , ( ) where m is the dimensionality of the target weight. For inference, w q = Q(w) is used. Using this quantizer, QAT methods iteratively search for the quantized weights based on the task loss using stochastic gradient descent-based methods, and the model parameters converge into the ball-like region around the full-precision weights w. However, the region around the optimal full-precision weights may contain suboptimal solutions with high errors. We demonstrate such inefficiency of the existing quantizer formulation through a simple experiment in Figure 2 . Suppose we have three input points, x 1 , x 2 , and x 3 , and full-precision weights w. Quantized training of the weight using Equation 2successfully reduces MSE between the quantized weight and the full-precision, but the task prediction loss using w q is nonetheless very high. The main source of error comes from the independent application of the quantization process to each weight element: Neural network weights are not independent, but highly correlated, so holding the loss value constant, quantizing (perturbing) one weight will affect the others. Moreover, after quantization, the weight importances can also change significantly. Both factors lead to high errors in the pre-activations. On the other hand, our proposed QAT method BiTAT, described in Section 4, achieves a quantized model with smaller error. This results from the consideration of the inter-weight dependencies, which we describe in the next subsection. How can we, then, find the low-precision subspace, which contains the best-performing quantized weights on the task, by exploiting the inter-weight dependencies? The properties of the input distribution give us some insights into this question. Let us consider a task composed of N centered training samples {x 1 , . . . , x N } = X ∈ R N ×d0 . We can obtain principal components of the training samples v 1 , . . . , v d0 ∈ R d0 and the corresponding coefficients λ 1 , . . . , λ d0 ≥ 0, in descending order. When we optimize a singlelayered neural network parameterized by w (1) , neurons corresponding to the columns of w (1) are oriented in a similar direction to the principal components with higher variances (i.e., v i than v j , where i < j) that is much more likely to get activated than the others. We apply a change of basis to the column space of the weight matrix w (1) with the bases (v 1 , . . . , v d0 ): V (0) w (1) = w (1) (3) w (1) = V (0) ⊤ w (1) , where V (0) = [v 1 | • • • | v d0 ] ∈ R d0×d0 is an orthonormal matrix. The top rows of the transformed weight matrix w (1) will contain more important weights, whereas the bottom rows will contain less important ones. Therefore, the accuracy of the model will be more affected by the perturbations of the weights at top rows than ones at the bottom rows. Note that this transformation can also be applied to the convolutional layer by "unfolding" the input image or feature map into a set of patches, which enables us to convert the convolutional weights into a matrix (The detailed descriptions of the orthonormal transformations for convolutional layers is provided in the supplementary file). We can also easily generalize the method to multi-layer neural networks, by taking the inputs for the l-th layer as the "training set", assuming that all of the previous layer's weights are fixed, as follows: x (l) i = δ w (l) ⊤ x (l-1) i N i=1 , where δ(•) is the nonlinear transformation defined by both the non-linear activations and any layers other than linear transformation with the weights, such as pooling or Batch Normalization. Then, we obtain the change-of-basis matrix V (l) for layer l by using PCA on x (l-1) i . The impact of transforming the weights is shown in Figure 3 . We compute the principal components of each layer in the initial pre-trained model and measure the test accuracy after adding the noise to the top-5 highest-variance (dashed red) or lowest-variance components (dashed blue) per layer. While a model with perturbed high-variance components degrades the performance as the noise scale increases, a model with perturbed low-variance components consistently obtains high performance even with large perturbations. This shows that preserving the important weight components that respond to high-variance input components is critically important for effective neural network quantization.

3.2. CROSS-LAYER WEIGHT CORRELATION IMPACTS MODEL PERFORMANCE

So far, we have only described dependencies among weights within a single layer. However, dependencies between the weights across different layers also significantly impact the performance as well. To validate that, we perform layerwise sequential training from the bottom layer to the top. At the training of each layer, the model computes the principal components of the target layer and adds noise to its top-5 high/low components. As shown in Figure 3 , progressive training with the low-variance components (solid blue) achieves significantly improved accuracy over the end-to-end training counterpart (dashed blue) with a high noise scale, which demonstrates the beneficial effect of modeling weight dependencies in earlier layers. We describe further details in the supplementary file.

4. TASK-DEPENDENT WEIGHT TRANSFORMATION FOR NN BINARIZATION

Our objective is to obtain binarized weights w q given pre-trained full-precision weights. We effectively mitigate performance degeneration from the binarization process by focusing on the inter-weight dependencies within each layer and across consecutive layers. Given a single-layered neural network parameterized by w (1) , We first reformulate the quantization function Q in Equation 2 with the weight correlation matrix V (0) and the importance vector s (0) so that each weight is disentangled from the others while allowing larger quantization errors on the unimportant disentangled weights (Unless otherwise stated, we omit the superscript denoting layer index): Q(w; s, V ) = arg min wq∈Q diag(s) V ⊤ w -V ⊤ w q 2 F + γ ∥w q ∥ 1 , where V ∈ R d0×d0 , and s ∈ R d0 is a scaling term that assigns different importance scores to each row of V ⊤ w. We denote Q = {α ⊙ b : α ∈ R d1 , b ∈ {-1, 1} d0×d1 } as the set of possible binarized values for w ∈ R d0×d1 with a scalar scaling factor for each output channel, where ⊙ is an element-wise multiplication operator, with dimensions broadcasted appropriately. We additionally include ℓ 1 norm adjusted by a hyperparameter γ. At the same time, we want our quantized model to minimize the empirical task loss (e.g., cross-entropy loss) for a given dataset. Thus we formulate the full objective in the form of a bilevel optimization problem to find the best quantized weights which minimize the task loss by considering the cross-layer weight dependencies and the relative importance among weights: w * , s * , V * = arg min w,s,V L task (f (X ; w q ) , Y) , where w q = Q(w; s, V ). After the quantized training, the quantized weights w * q at layer l are determined by w * q = Q(w * ; s * , V * ). In practice, directly solving the above bilevel optimization problem is impractical due to its excessive computational cost. We therefore consider the following relaxed problem: α * , w * , s * , V * = arg min α,w,s,V L task (f (X ; α•sgn(w)), Y) +λ diag(s)V ⊤ (w -α•sgn(w)) 2 F +γ ∥α•sgn(w)∥ 1 , ( ) where λ is a hyperparameter to balance between the quantization objective and task loss. Since it is impossible to compute the gradients for discrete values in quantized weights, we adopt the straight-through estimator Bengio et al. (2013) that is broadly used across QAT methods: sgn(w) indicates the sign function applied elementwise to w. We follow Liu et al. (2020) for the derivative of sgn(•). Finally, we obtain the desired quantized weights by w * q = α•sgn(w * ). In order to obtain the off-diagonal parts of the cross-layer dependency matrix V , we minimize Equation 8 with respect to s and V to dynamically determine the values (we omit X and Y from this argument for readability): L train (α,w, s, V ) = L task (f (X ; α•sgn(w)) , Y) +λ diag(s)V ⊤ (w -α•sgn(w)) 2 F +γ ∥α•sgn(w)∥ 1 + Reg(s, V ), where Reg(s, V ) := ∥V V ⊤ -I∥ 2 + |σi log(s i )| 2 is a regulariztion term which enforces V to be orthogonal and keeps the scale of s constant. Here, σ is the constant initial value of i log(s i ), which is a non-negative importance score. 𝒘 (1) 𝒘 ( 2  𝑽 (0) 𝑺 (0) 𝒘 (2) 𝒘 (3) 𝒙 0 block … Quantize weights 𝒘 (𝟐) 𝒘 (3) 𝒙 0 … 𝒘 𝑞 1 𝑽 (0) 𝑽 ( 1) 𝑺 (0) 𝑺 ( 1)

Block Correlation Matrix Importance Vector

Quantize weights 𝒘 (𝟑) 𝑽 ( 0) 𝑽 ( 1) 𝑽 ( 2) 𝑺 (0) 𝑺 ( 1) 𝑺 (2) 𝒘 𝑞 2 𝒘 𝑞 1 𝑽 (0,1) 𝑺 (:1) 𝑽 (0,1,2) 𝑺 (:2) 9, it is inefficient to perform quantization-aware training while considering the complete correlations of all weights in the given neural network. Therefore, we only consider cross-layer dependencies between only few consecutive layers (we denote it as a block), and initialize s and V using Principal Component Analysis (PCA) on the inputs to those layers within each block. Formally, we define a weight correlation matrix in a neural network block V (block) ∈ R ( k i=1 di)×( k i=1 di) , where k is the number of layers in a block, similarly to the block-diagonal formulation in Li et al. (2021) to express the dependencies between weights across layers in the offdiagonal parts. We initialize s (l) and in-diagonal parts V (l) by applying PCA on the input covariance matrix: s (l) ← (λ (l) ) 1 2 , V (l) ← U (l) , where U (l) λ (l) (U (l) ) ⊤ := 1 N N i=1 o (l-1) i o (l-1) i ⊤ , where o (l) is a column vector and the output of l-th layer and o (0) = x. This allows the weights at l-th layer to consider the dependencies on the weights from the earlier layers within the same neural block, and we refer to this method as sequential quantization , so that the model alleviates the quantization errors accumulated through propagating from the lower to higher consecutive layers while preserving the performance of the quantized model. Then, instead of having one set of s and V for each layer, we can keep the previous layer's s and V and expand them. Specifically, when quantizing layer l which is a part of the block that starts with the layer m, we first apply PCA on the input covariance matrix to obtain λ (l) and U (l) . We then expand the existing s (m:l-1) and V (m:l-1) to obtain s ∈ R D+d l-1 and s ∈ R D+d l-1 as follows * : [s (m:l) ]i := [s (m:l-1) ]i, i ≤ D, [(λ (l) ) 1 2 ]i-D, D < i, [V (m:l) ]i,j :=      [V (m:l-1) ]i,j, i, j ≤ D, [U (l) ]i-D,j-D, D < i, j, 0, otherwise, where D = l-2 i=m d i , as illustrated in Figure 4 . The weight dependencies between different layers (i.e., off-diagonal areas) are trainable and zero-initialized. That is, at each layerwise quantization in the target block, we train the importance vector and orthonormal correlation matrix, where expanded areas are initialized by PCA components of the current layer inputs area. To enable the matrix multiplication of the weights with the expanded s and V , we define the expanded block weights † : w (m:l) = PadCol(w (m:l-1) , d l ); w (l) , where PadCol(•, c) zero-pads the input matrix to the right by c columns. Then, our final objective from Equation 9with cross-layer dependencies is given as follows: Algorithm 1 Neural Network Binarization with Task-dependent Aggregated Transformation 1: Input: Pre-trained weights w (1) , . . . , w (L) for L layers, task loss function L, Maximum size of inputdimension group k, quantization epochs per layer Nep. 2: Output: Quantized weights w * (1) , . . . , w * (L) . 3: B1, . . . , Bn ←Divide the neural network into n blocks 4: for each block B do 5: s = [], V = [] 6: for each layer l in B do 7: o (l-1) ← inputs for layer l 8: P ← if d l-1 > k then K-MEANS(X (l) , k) else I d l-1 ▷ Grouping permutation matrix 9: U diag(λ)U ⊤ = PCA(P o (l-1) ) ▷ Initialization values for the expanded part 10: s ← [s; λ 1 2 ], V ← V 0 0 U ▷ expand s and V 11: sα (l:L) , w (l:L) , s, V ← arg min sα,w,s,V Ltrain(sα, w, s, V ) ▷ Iterate for Nep epochs 12: w (l) q ← sα (l) •sgn(w (l) ) L train (w (l:L) , s (m:l) , V (m:l) ) = L task f X ; {α•sgn(w (l) ), w (l+1:L) } , Y + λ diag(s (m:l) )V (m:l) ⊤ w (m:l) -α•sgn(w (m:l) ) 2 F + γ α•sgn(w (m:l) ) 1 + Reg(s (m:l) , V (m:l) ). Given the backbone architecture with L layers, we minimize L train (w (l) , s (l) , V (l) ) with respect to w (l) , s (l) , and V (l) to find the desired binarized weights w * (l) q for layer l while keeping the other layers frozen. Next, we finetune the following layers using the task loss function a few epochs before performing QAT on following layers, as illustrated in Figure 4 . This sequential quantization proceeds from the bottom layer to the top and the obtained binarized weights are frozen during the training.

4.2. COST-EFFICIENT BITAT AGGREGATED WEIGHT CORRELATION USING REDUCTION MATRIX

We derived a QAT formulation which focues on the cross-layer weight dependency by learning block-wise weight correlation matrices. Yet, as the number of inputs to higher layers is often large, the model constructs higher-dimensional V (l) on upper blocks, which is costly. In order to reduce the training memory footprint as well as the computational complexity, we aggregate the input dimensions into several small groups based on functional similarity using k-means clustering. First, we take feature vectors, the outputs of the l-th layer o (l) 1 , . . . , o N ∈ R d l for each output dimension, to obtain d l points p 1 , p 2 , . . . , p d l ∈ R N , then aim to cluster the points to k groups using k-means clustering, each containing N/k points. Let g i ∈ {1, 2, . . . , k} indicate the group index of p i , for i = 1, . . . , d l . We construct the reduction matrix P ∈ R k×d l , where P ij = 1 N/k if g j = i, and 0 otherwise. Each group corresponds to a single row of the reduced V (l+1) ∈ R k×k instead of the original dimension d l × d l . In practice, this significantly reduces the memory consumption of the V (down to 0.07%). Now, we replace s and V ⊤ in Equation 13 to s and V ⊤ P , respectively, initializing s and V with the grouped input covariance 1 

5. EXPERIMENTS

We validate a new quantization-aware training method, BiTAT, over multiple benchmark datasets; CIFAR-10, CIFAR-100 Krizhevsky et al. (2009 ), and ILSVRC2012 ImageNet Deng et al. (2009) datasets. We use MobileNet V1 (Howard et al., 2017) backbone network, which is a compact neural architecture designed for mobile devices. We follow overall experimental setups from prior works Yamamoto (2021); Liu et al. (2020) . BRECQ introduces an adaptive PTQ method by focusing on the weight dependency via hessian matrix computations, resulting in significant performance deterioration and excessive training time. DBQ and LCQ suggest QAT methods, but the degree of bitwidth compression for the weights and activations is limited to 2-to 8-bits, which is insufficient to meet our interest in achieving neural network binarization with 1-bits weights and activations. MeliusNet only suffers a small accuracy drop, but it has a high OP count. DBQ and LCQ restrict the bit-width compression to be higher at 4 bits so that they cannot enjoy the XNOR-Bitcount optimization for speedup. Although Bi-Real Net, Real-to-Bin, and EBConv successfully achieve neural network binarization, over-parameterized ResNet is adopted as backbone networks, resulting in higher OP count. Moreover, except EBConv, these works still suffer from a significant accuracy drop. ReActNet binarizes all of the weights and activations (except the first and last layer) in compact network architectures while preventing model convergence failure. Nevertheless, the method still suffers from considerable performance degeneration of the binarized model. On the other hand, our BiTAT prevents information loss during quantized training up to 1-bits, showing a superior performance than ReActNet, 0.37 % ↑ for ImageNet, 0.53% ↑ for CIFAR-10, and 2.31% ↑ for CIFAR-100. Note that BiTAT further achieves on par performance of the MobileNet backbone for CIFAR-100. The results support our claim on layer-wise quantization from the bottom layer to the top, reflecting the disentangled weight importance and correlation with the quantized weights at earlier layers.

Ablation study

We conduct ablation studies to analyze the effect of salient components in our proposed method in Figure 7 Left. BiTAT based on layer-wise sequential quantization without weight transformation already surpasses the performance of ReActNet, demonstrating that layer-wise progressive QAT through an implicit reflection of adjusted importance plays a critical role in preserving the pre-trained models during quantization. We adopt intra-layer weight transformation using the inputdependent orthonormal matrix, but no significant benefits are observed. Thus, we expect that only disentangling intra-layer weight dependency is insufficient to fully reflect the adjusted importance of each weight due to a binarization of earlier weights/activations. This is evident that BiTAT considering both intra-layer and cross-layer weight dependencies achieves improved performance than the case with only intra-layer dependency. Yet, this requires considerable additional training time to compute



* [•]i indicates the i-th element of the object inside the brackets. † [A; B] indicates vertical concatenation of the matrices A and B.



Figure1: Left: An Illustration of our proposed method. Weight elements in a layer is highly correlated to each other along with the weights in other layers. Our BiTAT sequentially obtains quantized weights of each layer based on the importance of disentangled weights to others using a trainable orthonormal rotation matrix and importance vector. Right: Categorization of relevant and strong quantization methods to ours.

Given a training dataset X and corresponding labels Y, existing QAT methods Rastegari et al. (2016); Dbouk et al. (2020); Jung et al. (2019); Bethge et al. (2020); Yamamoto (2021); Park & Yoo (

Figure2: A simple experiment that crosslayer weight correlation is critical to find wellperforming quantized weights during QAT.

Figure 3: Solid lines: Test accuracy of a MobileNetV2 model on CIFAR-100 dataset, after adding Gaussian noise to the top 5 rows and the bottom 5 rows of w (l) for all layers, considering the dependency on the lower layers. Dashed lines: Not considering the dependency on the lower layers. The x axis is in log scale.

Figure 4: Quantization-aware Training with BiTAT: We perform a sequential training process: quantization training of a layer -rapid finetuning for upper layers. At each layerwise quantization, we also train the importance vector and orthonormal correlation matrix, which are initialized by PCA components of the current and lower layer inputs in the target block, and guide the quantization to consider the importance of disentangled weights. 4.1 LAYER-PROGRESSIVE QUANTIZATION WITH BLOCK-WISE WEIGHT DEPENDENCY Now we extend our formulation for multi-layered neural networks considering cross-layer weight dependency. While we obtain the objective function in Equation9, it is inefficient to perform quantization-aware training while considering the complete correlations of all weights in the given neural network. Therefore, we only consider cross-layer dependencies between only few consecutive layers (we denote it as a block), and initialize s and V using Principal Component Analysis (PCA) on the inputs to those layers within each block.

Figure 5: Initialization of the block correlation matrix.

i ) ⊤ . We describe the full training process of our proposed method in Algorithm 1. The total number of training epochs taken in training is O(LN ep ), where L is the number of layers, and N ep is the number of epochs for the quantizing step for each layer.

Performance comparison of BiTAT with baselines. We report the averaged test accuracy across three independent runs. The best results are highlighted in bold, and results of cost-expensive models (10 8 ↑ ImgNet FLOPs) are de-emphasized in gray. We refer to several results reported from their own papers, denoted as † . While our method aims to solve the QAT problem, we extensively compare our BiTAT against various methods; Post-training Quantization (PTQ) method: BRECQ Li et al. (2021), and Quantization-aware Training (QAT) methods: DBQ Dbouk et al. (2020), EBConv Bulat et al. (2021), Bi-Real Net Liu et al. (2018), Real-to-Bin Martinez et al. (2020), LCQ Yamamoto (2021), MeliusNet Bethge et al. (2020), ReActNetLiu et al. (2020). Note that DBQ, LCQ, and MeliusNet, keep some crucial layers, such as 1×1 downsampling layers, in full-precision, leading to inefficiency at evaluation time. Due to the page limit, we provide the details on baselines and the training and inference phase during QAT including hyperparameter setups in the Supplementary file.

annex

with a chunk-wise transformation matrix. In the end, BiTAT with aggregated transformations, which is our full method, outperforms our defective variants in both terms of model performance and training time by drastically removing redundant correlation through reduction matrices. We note that using kmeans clustering for aggregated correlation is also essential, as another variant, BiTAT with filter-wise transformations, which filter-wisely aggregates the weights instead, results in deteriorated performance.

5.2. QUALITATIVE ANALYSIS

Visualization of Reduction Matrix We visualize the weight grouping for BiTAT in Figure 7 Right to analyze the effect of the reduction matrix, which groups the weight dependencies in each layer based on the similarity between the input dimensions. Each 3×3 square represents a convolutional filter, and each unique color in weight elements represents which group each weight is assigned to, determined by the k-means algorithm, as described in Section 4.2. We observe that weight elements in the same filter do not share their dependencies; rather, on average, they often belong to four-five different weight groups. Opposite to these observations, BRECQ regards the weights in each filter as the same group for computing the dependencies in different layers, which is problematic since weight elements in the same filter can behave differently from each other.Visualization of Cross-layer Weight Dependency In Figure 6 , we visualize learned transformation matrices V (top row), which shows that many weight elements at each layer are also dependent on other layer weights as highlighted in darker colors, verifying our initial claim. Further, we provide visualizations for their multiplications with corresponding importance vectors diag(s)V ⊤ (bottom row). Here, the row of V ⊤ is sorted by the relative importance in increasing order at each layer. We observe that important weights in a layer affect other layers, demonstrating that cross-layer weight dependency impacts the model performance during quantized training.

6. CONCLUSION

In this work, we explored long-overlooked factors that are crucial in preventing the performance degeneration with extreme neural network quantization: the inter-weight dependencies. That is, quantization of a set of weights affect the weights for other neurons within each layer, as well as weights in consecutive layers. Grounded by the empirical analyses of the node interdependency, we propose a Quantization-Aware Training (QAT) method for binarizing the weights and activations of a given neural network with minimal loss of performance. Specifically, we proposed orthonormal transformation of the weights at each layer to disentangle the correlation among the weights to minimize the negative impact of quantization on other weights. Further, we learned scaling term to allow varying degree of quantization error for each weight based on their measured importance, for layer-wise quantization. Then we proposed an iterative algorithm to perform the layerwise quantization in a progressive manner. We demonstrate the effectiveness of our method in neural network binarization on multiple benchmark datasets with compact backbone networks, largely outperforming state-of-the-art baselines.

