BRECQ: PUSHING THE LIMIT OF POST-TRAINING QUANTIZATION BY BLOCK RECONSTRUCTION

Abstract

We study the challenging task of neural network quantization without end-toend retraining, called Post-training Quantization (PTQ). PTQ usually requires a small subset of training data but produces less powerful quantized models than Quantization-Aware Training (QAT). In this work, we propose a novel PTQ framework, dubbed BRECQ, which pushes the limits of bitwidth in PTQ down to INT2 for the first time. BRECQ leverages the basic building blocks in neural networks and reconstructs them one-by-one. In a comprehensive theoretical study of the second-order error, we show that BRECQ achieves a good balance between crosslayer dependency and generalization error. To further employ the power of quantization, the mixed precision technique is incorporated in our framework by approximating the inter-layer and intra-layer sensitivity. Extensive experiments on various handcrafted and searched neural architectures are conducted for both image classification and object detection tasks. And for the first time we prove that, without bells and whistles, PTQ can attain 4-bit ResNet and MobileNetV2 comparable with QAT and enjoy 240× faster production of quantized models.

1. INTRODUCTION

The past decade has witnessed the rapid development of deep learning in many tasks, such as computer vision, autonomous driving, etc. However, the issue of huge computation cost and memory footprint requirements in deep learning has received considerable attention. Some works such as neural architecture search (Zoph & Le, 2016) try to design and search a tiny network, while others, like quantization (Hubara et al., 2017) , and network pruning (Han et al., 2015) are designed to compress and accelerate off-the-shelf well-trained redundant networks. Many popular quantization and network pruning methods follow a simple pipeline: training the original model and then finetune the quantized/pruned model. However, this pipeline requires a full training dataset and many computation resources to perform end-to-end backpropagation, which will greatly delay the production cycle of compressed models. Besides, not all training data are always ready-to-use considering the privacy problem. Therefore, there is more demand in industry for quantizing the neural networks without retraining, which is called Post-training Quantization. Although PTQ is fast and light, it suffers from severe accuracy degeneration when the quantization precision is low. For example, DFQ (Nagel et al., 2019) can quantize ResNet-18 to 8-bit without accuracy loss (69.7% top-1 accuracy) but in 4-bit quantization, it can only achieve 39% top-1 accuracy. The primary reason is the approximation in the parameter space is not equivalent to the approximation in model space thus we cannot assure the optimal minimization on the final task loss. Recent works like (Nagel et al., 2020) recognized the problem and analyzed the loss degradation by Taylor series expansion. Analysis of the second-order error term indicates we can reconstruct each layer output to approximate the task loss degeneration. However, their work cannot further quantize the weights into INT2 because the cross-layer dependency in the Hessian matrix cannot be ignored when the perturbation on weight is not small enough. In this work, we analyze the second-order error based on the Gauss-Newton matrix. We show that the second-order error can be transformed into network final outputs but suffer from bad generalization. To achieve the best tradeoff, we adopt an intermediate choice, block reconstruction. In addition, our contributions are threefold: 1. Based on the second-order analysis, we define a set of reconstruction units and show that block reconstruction is the best choice with the support from theoretical and empirical evidence. We also use Fisher Information Matrix to assign each pre-activation with an importance measure during reconstruction. 2. We incorporate genetic algorithm and the well-defined intra-block sensitivity measure to generate latency and size guaranteed mixed precision quantized neural networks, which fulfills a general improvement on both specialized hardware (FPGA) and general hardware (ARM CPU). 3. We conduct extensive experiments to verify our proposed methods. We find that our method is applicable to a large variety of tasks and models. Moreover, we show that post-training quantization can quantize weights to INT2 without significant accuracy loss for the first time.

2. PRELIMINARIES

Notations Vectors are denoted by small bold letters and matrices (or tensors) are denoted by capital bold letters. For instance, W and w represent the weight tensor and its flattened version. Bar accent denotes the expectation over data points, e.g. ā. Bracketed superscript w ( ) indicates the layer index. For a convolutional or a fully-connected layer, we mark its input and output vectors by x and z. Thus given a feedforward neural network with n layers, we can denote the forward process by x ( +1) = h(z ( ) ) = h(W ( ) x ( ) + b ( ) ), 1 ≤ ≤ n, where h(•) indicates the activation function (ReLU in this paper). For simplicity, we omit the analysis of bias b ( ) as it can be merged into activation. || • || F denotes the Frobenius norm. Quantization Background Uniform symmetric quantization maps the floating-point numbers to several fixed-points. These points (or grids) have the same interval and are symmetrically distributed. We denote the set that contains these grids as Q u,sym b = s × {-2 b-1 , . . . , 0, . . . , 2 b-1 -1}. Here, s is the step size between two grids and b is the bit-width. Quantization function, denoted by q(•) : R → Q u,sym b , is generally designed to minimize the quantization error: min || ŵ -w|| 2 F . s.t. ŵ ∈ Q u,sym b (2) Solving this minimization problem, one can easily get the q(•) by leveraging the rounding-to-nearest operation • . Rounding-to-nearest is a prevalent method to perform quantization, e.g. PACT (Choi et al., 2018) . However, recently some empirical or theoretical evidence supports that simply minimizing the quantization error in parameter space does not bring optimal task performances. Specifically, Esser et al. (2020) propose to learn the step size s by gradient descent in quantization-aware training (QAT). LAPQ (Nahshan et al., 2019) finds the optimal step size when the loss function is minimized without re-training the weights. Their motivations are all towards minimizing a final objective, which is the task loss, i.e., min E[L( ŵ)], s.t. ŵ ∈ Q u,sym b . While this optimization objective is simple and can be well-optimized in QAT scenarios, it is not easy to learn the quantized weight without end-to-end finetuning as well as sufficient training data and computing resources. In post-training quantization settings, we only have full precision weights that w = arg min w E[L(w)] where w ∈ R and a small subset of training data to do calibration. Taylor Expansion It turns out that the quantization imposed on weights could be viewed as a special case of weight perturbation. To quantitatively analyze the loss degradation caused by quantization, Nagel et al. (2020) use Taylor series expansions and approximates the loss degradation by E[L(w + ∆w)] -E[L(w)] ≈ ∆w T ḡ(w) + 1 2 ∆w T H(w) ∆w, where ḡ(w) = E[∇ w L] and H(w) = E[∇ 2 w L] are the gradients and the Hessian matrix and ∆w is the weight perturbation. Given the pre-trained model is converged to a minimum, the gradients can be safely thought to be close to 0. However, optimizing with the large-scale full Hessian is memoryinfeasible on many devices as the full Hessian requires terabytes of memory space. To tackle this problem, they make two assumptions: 1. Layers are mutual-independent, thus the Hessian is in the form of layer-diagonalfoot_0 and Kroneckerfactored, i.e., H(w ( ) ) = E[x ( ) x ( ),T ⊗ H (z ( ) ) ], where ⊗ is the Kronecker product. 2. The second-order derivatives of pre-activations are constant diagonal matrix (H (z ( ) ) = c × I) which is independent of input data points. At last, the objective is transformed into a practical proxy signal, the change in feature-maps (z = Wx), and the quantized model can be obtained by a layer-by-layer feature map reconstruction algorithm (with few calibration images). Recent works, like Bit-Split (Wang et al., 2020) and AdaQuant (Hubara et al., 2020) , also take this layer-wise objective to improve the post-training quantization. However, they failed to quantize weights to INT2. We think the inherent reason is that when ∆w grows higher, the former assumptions do not hold and an accurate signal is required.

3. PROPOSED METHOD

3.1 CROSS-LAYER DEPENDENCY Denote the neural network output z (n) = f (θ), the loss function can be represented by L(f (θ)) where θ = vec[w (1),T , . . . , w (n),T ] T is the stacked vector of weights in all n layers. The Hessian matrix can be computed by ∂ 2 L ∂θ i ∂θ j = ∂ ∂θ j m k=1 ∂L ∂z (n) k ∂z (n) k ∂θ i = m k=1 ∂L ∂z (n) k ∂ 2 z (n) k ∂θ i ∂θ j + m k,l=1 ∂z (n) k ∂θ i ∂ 2 L ∂z (n) k ∂z (n) l ∂z (n) l ∂θ j , where z (n) ∈ R m . Since the pretrained full precision model is converged to a local minimum, we can assume the Hessian is positive-semidefinite (PSD). Specifically, the converged model has ∇ z (n) L close to 0 so the first term in Eq. ( 5) is neglected and Hessian becomes the Gauss-Newton (GN) matrix G (θ) . GN matrix can be written in matrix form (Botev et al., 2017) as H (θ) ≈ G (θ) = J z (n) (θ) T H (z (n) ) J z (n) (θ), where J z (n) (θ) is the Jacobian matrix of the network output with respect to the network parameters. However, in practice, we cannot explicitly compute and store the Jacobian for each input data point in such a raw form. To reduce the computation and memory budget, we will transform the secondorder error into the network output, as shown in the following theorem. Theorem 3.1. Consider an n-layer feedforward neural network with ReLU activation function. Assuming all weights are quantized, the second-order error optimization can be transformed by: arg min θ ∆θ T H(θ) ∆θ ≈ arg min θ E ∆z (n),T H (z (n) ) ∆z (n) . Remark 3.1. The same transformation is also applicable for activation quantization. The quadratic loss is defined as E[∆γ T H (γ) ∆γ] where ∆γ = vec[∆x (1),T , . . . , ∆x (n),T ] T . We prove the theorem using the quadratic form, details can be found in Appendix A.1. Here we provide a sketch of the proof by matrix form. The product of perturbation and Jacobian can be thought as the first-order Taylor approximation of the change in network output ∆z (n) : ∆z (n) = ẑ(n) -z (n) ≈ J z (n) (θ)∆θ. Therefore, combining Eq. (8) and Eq. ( 6) we can transform the large-scale second-order error into the change in network outputs characterized by the output Hessian H (z (n) ) . The theorem indicates a simple observation, suppose a well-trained teacher model and an initialized student model, we can minimize their discrepancy by reconstructing the network's final output z (n) , which coincides with and generalizes the distillation (Hinton et al., 2015; Polino et al., 2018) . LAPQ (Nahshan et al., 2019 ) also considers the dependency but their optimization does not rely on second-order information. However, we should emphasize that distillation requires the same computation and data resources as in normal training procedure, which is impractical for PTQ with limited data. 

3.2. BLOCK RECONSTRUCTION

Although the network output reconstruction has an accurate estimation of the second-order error, we find in practice it is worse than the layer-by-layer reconstruction in PTQ. The primary reason for this is optimizing the whole networks over 1024 calibration data samples leads to over-fitting easily. As Jakubovitz et al. ( 2019) explained, the networks can have perfect expressivity when the number of parameters exceeds the number of data samples during training, but lower training error does not ensure lower test error. We find layer-wise reconstruction acts like a regularizer which reduces the generalization error by matching each layer's output distribution. In other words, both layer-wise and network-wise output reconstruction has their own drawbacks. And there should be a better bias-variance trade-off choice to conduct reconstruction at an intermediate granularity. The layer-wise optimization corresponds to layer-diagonal Hessian (Fig. 1b blue parts) and the network-wise optimization corresponds to full Hessian (Fig. 1b green parts). Similarly, we can define an intermediate block-diagonal Hessian. Formally, if layer k to layer (where 1 ≤ k < ≤ n) form a block, the weight vector is defined as θ = vec[w (k),T , . . . , w ( ),T ] T and the Hessian can be also transformed by ∆ θT H( θ) ∆ θ = E[∆z ( ),T H (z ( ) ) ∆z ( ) ]. Such block-diagonal Hessian ignores the inter-block dependency and considers the intra-block dependency but it produces less generalization error. Then we can block-by-block reconstruct the intermediate output. To this end, we define 2 extra kinds of intermediate reconstruction granularity: Stage-wise reconstruction and Block-wise reconstruction. These 4 reconstruction granularities are described below: 1. Layer-wise Reconstruction: Assume the Hessian matrix is layer-diagonal and optimize the layer output one-by-one. It does not consider cross-layer dependency and resemble existing methods (Nagel et al., 2020; Hubara et al., 2020; Wang et al., 2020) . 2. Block-wise Reconstruction: A block is the core component in modern CNN, such as the Residual Bottleneck Block as shown in Fig. 1a . This method assumes the Hessian matrix is blockdiagonal and block-by-block perform reconstruction, which ignores inter-block dependencies.

3.. Stage-wise Reconstruction:

A stage is where the featuremaps will be downsampled and generate more channels, which is believed to produce higher-level features. Typical CNN in ImageNet dataset contains 4 or 5 different stages. This method simultaneously optimizes all layers within a stage and thus considers more dependencies than the block-wise method. 4. Network-wise Reconstruction: Optimize the whole quantized network by reconstructing the output of the final layers. This method resembles distillation but does not result in good performances with few images because of high generalization error. The relationship between network, stage, block, and layer is illustrated in Fig. 1a . We test these 4 kinds of reconstruction granularity and find that block-wise optimization outperforms others. We think this is because the main off-diagonal loss in the Hessian is concentrated in each block, as Fig. 1b orange part illustrated, while the inter-block loss is small and can be ignored in the opti-Algorithm 1: BRECQ optimization Input: Pretrained FP model; Calibration dataset, iteration T for all i = 1, 2, . . . , N -th block in the FP model do Collect input data to the block x (i) , the FP output z (i) and its gradient g (z (i) ) ; for all j = 1, 2, . . . , T -iteration do Get quantized output ẑ(i) and compute ∆z (i) = z (i) -ẑ(i) ; Descend Eq. ( 10) and update the rounding of all the weights in this block (Eq. ( 16)); if Activation Quantization is triggered then Update the activation quantization step size (Eq. ( 18)). After optimization, compute the sensitivity for each layer and between layers (2-bit only); return Quantized model, Sensitivities for mixed precision; mization. The shortcut connections, which is proposed in (He et al., 2016) , may also increase the dependencies within a block. Also, the stage-wise or net-wise optimization suffer from the bad generalization on the validation set and degenerate the final performances. We report the quantitative comparison in Sec. 4.1. We name our algorithm BRECQ, because we choose block as our base reconstruction unit. It is necessary to point out that our analysis does not give the optimal configuration of the reconstruction granularity. The choice of block-wise optimization comes from our experiments and we find this choice has two merits. (1) No hyper-parameters included and ( 2) applicable for all models and all tasks we tested.

3.3. APPROXIMATING PRE-ACTIVATION HESSIAN

With block-diagonal approximated Hessian matrix, we can measure the cross-layer dependency inside each block and transform any block's second-order error to the output of this block E[∆z ( ),T H (z ( ) ) ∆z ( ) ]. This objective requires the further computation of the knowledge in the rest of the network, i.e., pre-activation Hessian H (z ( ) ) . One way is to follow Nagel et al. (2020) and assume H (z ( ) ) = c × I. Therefore the quadratic loss becomes ||∆z ( ) || 2 . This method might be easy to implement but lose too much information. We use the diagonal Fisher Information Matrix (FIM) to replace the pre-activation Hessian. Formally, given a probabilistic model p(x|θ), the FIM is defined as: F(θ) = E ∇ θ log p θ (y|x)∇ θ log p θ (y|x) T = -E ∇ 2 θ log p θ (y|x) = - H(θ) log p(x|θ) . ( ) The FIM is equal to the negative expected Hessian of the log-likelihood function, therefore, a simple corollary is that the Hessian of task loss will become FIM if the model distribution matches the true data distribution (LeCun et al., 2012) . Although matching true data distribution seems impossible, this is the best we can do since the pretrained model is converged. The diagonal of the pre-activation FIM is equal to the squared gradients of each elements, which is successfully used in Adam (Kingma & Ba, 2014) for the second momentum. The optimization objective becomes min ŵ E ∆z ( ),T H (z ( ) ) ∆z ( ) = min ŵ E ∆z ( ),T diag ( ∂L ∂z ( ) 1 ) 2 , . . . , ( ∂L ∂z ( ) a ) 2 ∆z ( ) . (10) Compared with the MSE minimization, the above minimization incorporates the squared gradient information. If the output has higher absolute gradients, it will receive more attention when being reconstructed. A similar method for pruning the pre-activation has been proposed in Theis et al. (2018) . Note that BRECQ is compatible with any optimization method, like STE (Hubara et al., 2017) . Here we adopt adaptive rounding (Nagel et al., 2020) for weights and learned step size (Esser et al., 2020) for activation step size because we observe they generally perform better in PTQ, see details in Appendix B.4.1. We formulate the overall calibration algorithm for a unified precision model in algorithm 1. We should emphasize that we only need a small subset (1024 in our experiments) of the whole training dataset to calibrate the quantized model. And we can obtain a quantized ResNet-18 within 20 minutes on a single GTX 1080TI GPU.

3.4. MIXED PRECISION

To further push the limit of post-training quantization, we employ mixed precision techniques, which can be formulated by min c L( ŵ, c), s.t. H(c) ≤ δ, c ∈ {2, 4, 8} n . ( ) Here, c is the bit-width vector with the shape of number of layers. H(•) is a hardware performance measurement function, which is used to ensure the mixed precision model has the same or lower hardware performance (e.g., memory and speed) than a predefined threshold δ. We choose 2, 4, 8-bit for mixed precision because they are most common in practical deployment. Regarding the training loss L, we find that nearly all existing literature (Cai et al., 2020; Hubara et al., 2020; Dong et al., 2019) uses layer-wise measurement. They all assume the sensitivity within a layer is independent and can be summed together. Therefore, the mixed precision problem becomes an integer programming problem. However, we argue that the loss measurement should contain two parts: diagonal loss and off-diagonal loss, the first is the same with previous works and measure the sensitivity of each layer independently, while the off-diagonal loss is used to measure the cross-layer sensitivity. Theoretically, we should examine all permutations, which results in 3 n possibilities and prohibits the search algorithm. Our first attempt is to reduce the off-diagonal loss into the blocklevel as we mentioned that the Hessian can be approximated to a block-diagonal matrix. Granted, we still find the search space is large, for example, if a block has four layers, then we have to consider the 3 4 = 81 permutations for a single block. Based on our preliminary experiments, we find that 4-bit and 8-bit quantization nearly do not drop the final accuracy. Hence we only take 2-bit permutations into consideration and drastically reduce the search space. We use genetic algorithm (Guo et al., 2020) to search the optimal bitwidth configuration with hardware performance threshold, the algorithm is located in algorithm 2. Due to space limits, we put related works in Appendix 5. Readers can refer to related works for a brief discussion on quantization and secondorder analysis.

4. EXPERIMENTS

In this section, we report experimental results for the ImageNet classification task and MS COCO object detection task. The detailed implementation of the experiments can be found in the Appendix B.4.4. The rest of this section will contain ablation study on reconstruction granularity, classification and detection results, mixed precision results and comparison with quantization-aware training. In Appendix B, we conduct more experiments, including the impact of the first and the last layer, the impact of calibration dataset size and data source.

4.1. ABLATION STUDY

We test four kinds of reconstruction granularity: Net-wise, Stage-wise, Block-wise, and Layer-wise reconstruction. We conduct ImageNet experiments using MobileNetV2 and ResNet-18 with 2bit weight quantization for all layers except for the first and the last layer. It can be seen from Table 1 that Block-wise optimization outperforms other methods. This result implies that the generalization error in net-wise and stage-wise optimization outweighs their off-diagonal loss. In ResNet-18, we find the difference is not significant, this can be potentially attributed to that ResNet-18 only has 19 layers in the body and the block size, as well as the stage size, is small, therefore leading to indistinct results.

4.2. IMAGENET

We conduct experiments on a variety of modern deep learning architectures, including ResNet (He et al., 2016) with normal convolution, MobileNetV2 (Sandler et al., 2018) with depthwise separa- ble convolution and RegNet (Radosavovic et al., 2020) with group convolution. Last but not least important, we also investigate the neural architecture searched (NAS) models, MNasNet (Tan et al., 2019) . In Table 2 , we only quantize weights into low-bit integers and keep activations full precision. We compare with strong baselines including Bias Correction, optimal MSE, AdaRound, AdaQuant, and Bit-split. Note that the first and the last layer are kept with 8-bit. While most of the existing methods have good performances in 4-bit quantization, they cannot successfully quantize the model into 2-bit. Our method consistently achieves the lowest accuracy degradation for ResNets (within 5%) and other compact models. We further quantize activations into 4-bit to make the quantized model run on integer-arithmetic hardware platforms. We find that 4-bit activation quantization can have a huge impact on RegNet and MobileNet. Nonetheless, our methods produce higher performance than other state-of-the-arts. To be noted, BRECQ is the first to promote the 2W4A accuracy of PTQ to a usable level while all other existing methods crashed. 

4.5. MIXED PRECISION

In this section, we test (1) model-size guaranteed mixed precision and (2) FPGA latency guaranteed mixed precisionfoot_1 to unleash the potential of mixed precision and further push the limit of PTQ. We choose ResNet-18, MobileNetV2, and RegNetX-600MF to validate the efficacy of our algorithm. Note that in this section, we keep activation in 8-bit because we only compare the discrepancy between the unified and mixed precision in weights. We omit 3-bit weight quantization in unified precision because it is usually unfriendly to the hardware. Latency settings can be found in Appendix B.4.3. From Fig. 2 we find that (1) mixed precision consistently outperforms unified precision, especially when using extremely low-bit, e.g., up to 10% accuracy increase with the same latency as the 2-bit model. ( 2) mixed precision can produce many bit configurations that can adapt to plenty of hardware requirements while unified precision can only have 2 fixed models.

5. RELATED WORKS

Quantization Model quantization can be classified into two categories: Quantization-aware Training (QAT) and Post-training Quantization (PTQ). Rounding floating-point numbers to fixed-points numbers will produce 0 gradients almost everywhere. Therefore, most QAT methods employ the Straight-Through Estimator (STE) for gradients approximation. Gong et al. (2019) uses a differentiable tanh function to gradually approach the step function. Choi et al. (2018) ; Esser et al. ( 2020) introduces parameterized clipping thresholds to learn it by STE. Apart from uniform quantization, some works like Li et al. (2019) argue that non-uniform quantization has better performance than uniform quantization while keeping its efficiency. Despite the promising results given by QAT methods, they usually need more than 100 GPU hours to get it. In that case, PTQ plays an important role which is what we focus on in this paper. Generally, most deep learning models can be safely quantized to 8-bit without re-training. Data-Free Quantization Nagel et al. ( 2019) even do layer-wise 8-bit PTQ without any data. However, in 4-bit quantization, most parameter space-based methods cannot obtain good performances. Recently, Nagel et al. (2020) propose to do layer-wise calibration and made huge progress in 4-bit quantization. Our work continues its analysis on Taylor expansion and considers the off-diagonal loss. Another perspective of quantification is the precision allocation scheme. Hardware-aware Quantization (HAQ Wang et al. (2019) ) leverages reinforcement learning to search the optimal bitwidth configuration. Hessian-aware Weight Quantization (HAWQ) (Dong et al., 2019) utilizes the second-order information to decide the bitwidth. Mixed precision also appears in PTQ, such as the Pareto frontier method in ZeroQ (Cai et al., 2020) and the Integer Programming method in AdaQuant (Hubara et al., 2020) .

Second-order Analysis and Optimization

The history of second-order information in perturbation analysis can be traced to the 1990s like Optimal Brain Surgeon (Hassibi & Stork, 1993; Dong et al., 2017) . The Hessian matrix is essential for pruning and quantization. As aforementioned, HAWQ uses the largest eigenvalue of Hessian to determine the sensitivity. Hessian matrix is also important for second-order optimization like Newton's method as it consists of the curvature information. However, calculating the real full Hessian is prohibitive on today's deep learning architectures. Therefore, approximations are made to simplify the calculation and make the storage more flexible, e.g., Gauss-Newton optimization with Kronecker-factored recursive approximation Botev et al. (2017) . Hessian-Free optimization (Martens, 2010) avoids the explicit computation of the Hessian matrix by solving the linear system g = Hv. Second-order optimization with FIM is called Natural Gradient Descent (Amari, 1998) . K-FAC (Martens & Grosse, 2015) utilizes the layer-diagonal FIM and the approximation of the expected Kronecker product to compute the curvature information.

6. CONCLUSION

In this paper, we propose BRECQ, a post-training quantization framework by analyzing the secondorder error. We show that the reconstruction of quantization at the block granularity arrives at a good balance of cross-layer dependency and first order approximation, especially in 2-bit weight quantization where no prior works succeed to quantize. BRECQ is compatible with mixed precision and can reduce the search cost. To our best knowledge, BRECQ reaches the highest performance in post-training quantization and is the first to be on a par with quantization-aware training using 4-bit.

A MAIN PROOFS

A.1 PROOF OF THEOREM 3.1 Proof. We will prove the theorem using quadratic form. Denote the weight vector shape as θ ∈ R d , and the network output vector shape as z (n) ∈ R m . The quadratic form of the ∆θ T H (θ) ∆θ can be represented by: ∆θ T H (θ) ∆θ = d i=1 ∆θ 2 i ∂ 2 L ∂θ 2 i + 2 d i<j ∆θ i ∆θ j ∂L ∂θ i θ j = d i=1 d j=1 ∆θ i ∆θ j ∂L ∂θ i θ j , ( ) where L is the cross-entropy loss. Based on Eq. ( 5), we have ∂ 2 L ∂θ i θ j = m k,l ∂z (n) k ∂θ l ∂ 2 L ∂z (n) k z (n) l ∂z (n) l ∂θ j Substituting above equation in Eq. ( 12), we have ∆θ T H (θ) ∆θ = d i=1 d j=1 ∆θ i ∆θ j m k=1 m l=1 ∂z (n) k ∂θ i ∂ 2 L ∂z (n) k z (n) l ∂z (n) l ∂θ j (14a) = d i=1 d j=1 m k=1 m l=1 ∆θ i ∆θ j ∂z (n) k ∂θ i ∂ 2 L ∂z (n) k z (n) l ∂z (n) l ∂θ j (14b) = m k=1 m l=1 ∂ 2 L ∂z (n) k z (n) l d i=1 ∆θ i ∂z (n) k ∂θ i   d j=1 ∆θ j ∂z (n) k ∂θ j   (14c) = (∆θJ z (n) θ ) T H (z (n) ) (∆θJ z (n) θ ), where we define the J[ x y ] is the Jacobian matrix of x w.r.t. y. To this end, we use the first-order Taylor expansion as we did in Eq. ( 8) to approximate the change in network output, i.e., ∆z (n) ≈ ∆θJ z (n) θ (15) Therefore, the final objective is transformed to ∆z (n),T H (z (n) ) ∆z (n) .

B EXPERIMENTS B.1 EFFECT OF THE FIRST AND THE LAST LAYER

Many papers claim that the first and the last layer can have a huge impact on the final accuracy. In this section, we investigate this phenomenon as well as the impact of the first and the last layer on hardware performances. We test ResNet-18, MobileNetV2 as well as RegNet-600MF. Our observations include: 1. In terms of accuracy, the 4-bit quantization is essentially good, both of these two layers won't drop too much accuracy (with 0.2%). But in 2-bit quantization, the last fully connected layer is far more important than the first layer. We also observe that the first layer in MobileNetV2 and RegNet (3×3 kernels, 32 channels) is slightly more sensitive than that in ResNet-18 (7×7 kernel, 64 channels). 2. In terms of model size, the first layer merely has a minor impact because the input images only have 3 channels, while the last layer contains many weight parameters and greatly affects the memory footprint. We should point out that just the model size in the first layer is low doesn't mean the memory burden is low, because the input image will cost huge memory space. 3. In terms of latency, the situation depends on the architecture. For example, in ResNet-18 the first layer has a huge impact on the latency, while in MobileNetV2 and RegNet-600MF the last layer is more important than the first layer. This is because the latency is affected by multiple factors, such as the input size of the featuremap, the FLOPs, and the weight memory size. The arithmetic intensity (OPs/byte) greatly affects the latency. We find that the operations with high arithmetic intensity, i.e., shallow layers in the network, generate a less latency gap between different bitwidths. In conclusion, we find that keeping the first and the last layer 8-bit is unnecessary. Especially in ResNet-18, we find that setting all layers to 4-bit results in 53.3 ms latency and is faster than the 59.8 ms in 2-bit quantization (with first and last layer 8-bit), but the accuracy is even 4% higher. Such phenomenon indicates the potential power of the mixed precision.

B.2 EFFECT OF DATA

We evaluated the influence of the size of calibration dataset and the source of the data on ResNet-18. We test different numbers of input data point and find that the improvement in 4-bit quantization is trivial. Yet in 2-bit quantization we can see that the accuracy increases 5% when #data points increase. We also test the distilled data introduced in ZeroQ (Cai et al., 2020) . Distilled data is learned from pretrained models' BN statistics, i.e., x 1 distilled = arg min x∈R n i=1 ((µ i -μi ) 2 +(ς iςi )) where µ i and ς i is the original mean and variance in BN statistics of the (i)-th layer. We find that distilled data performs good in 4-bit quantization but still has a large margin with the original ImageNet dataset in 2-bit quantization. We also find the final accuracy does not benefit much from the increase of number of distilled data, this might because the distilled data are minimized by a same objective and has low diversity. not get a higher computation efficiency for extremely low-bit such as 2-bit and 4-bit. But we can acquire a better memory access efficiency. The primary speedup comes from the reduction of data movement. Specifically, we can conduct more times of addition in the same 8-bit register before we have to move it to a 16-bit register to avoid overflow. The lower bit-width is used, the less movement is needed. Together with the optimized data packing and data padding, we can run mixed precision quantization on Raspberry Pi 3B, which has a 1.2 GHz 64-bit quad-core ARM Cortex-A53. Note that this implementation is not optimized for depthwise separable or group convolution, therefore we only verify the latency on ResNets.

B.4.4 IMPLEMENTATION DETAILS

The ImageNet dataset consists of 1.2M training images and 50K test images. We follows standard pre-process (He et al., 2016) to get 1024 224×224 input images as the calibration dataset. We fold the batch normalization layer into convolution and freeze the BN statistics before post-training quantization. We use Adam optimizer (Kingma & Ba, 2014) to learn the weight rounding and activation range to reconstruct the block output. Note that some layers are not a component of any block, such as the first convolutional layer and the last fully connected layer and the last convolutional layer in the MobileNetV2. These layers use naive layer reconstruction. The batch size of learning is set to 32 and each block will be optimized for 2 × 10 4 iterations. The learning rate is set to 10 -3 during the whole learning process. Other hyper-parameters such as the temperature β are kept the same with Nagel et al. (2020) . For activation step size, we also use Adam optimizer and set the learning rate to 4e-5. Note that we do not implement the gradient scale as introduced in the original paper (Esser et al., 2020) . After reconstruction, we will store the sensitivity measured on the calibration dataset. Note that we will store intra-block sensitivity in 2-bit quantization. The sensitivity, as well as hardware performances for each layer, will be stored in a lookup table. When calculating the fitness value and determining the hardware performances in a genetic algorithm, we will check the lookup table . For genetic algorithm, we set the population size to 50 and evolve 100 iterations to obtain the best individual. The first population is initialized by Gaussian distribution and we round the samples to integers in [0, 1, 2], corresponding to bit-width [2, 4, 8] . The mutation probability is set to 0.1. The genetic algorithm usually completes the evolution in only about 3 seconds. For object detection tasks, we use 256 training images taken from the MS COCO dataset for calibration. The image resolution is set to 800 (max size 1333) for ResNet-18 and ResNet-50, while the image resolution for MobileNetV2 is set to 600 (max size 1000). Note that we only apply block reconstruction in the backbone because other parts of the architecture, such as Feature Pyramid Net, do not have the block structure. Therefore a naive layer reconstruction is applied to the rest of the network. Learning hyper-parameters are kept the same with ImageNet experiments.



To prevent ambiguity, we hereby use layer-diagonal Hessian to replace the common name "block-diagonal Hessian" because the block in this paper means a building block in the CNNs. We also test mobile CPU latency guaranteed mixed precision, located in Appendix B.3.



Low-bit GEMM implementation on ARM CPU.

Figure 5: FPGA-based and mobile CPU-based latency acquisition.

Network is composed of a stem layer (first convolution on input images), a body and a head layer (average pooling with a fully connected layer). A body contains several stages, and a stage contains several blocks. A representative block is the bottleneck block with residual path. We define 4 kinds of reconstruction granularity, namely net-wise, stage-wise, block-wise and layerwise optimization, each of them corresponds an essential component of CNN.

Ablation study.

Accuracy comparison on weight-only quantized post-training models. Activations here are unquantized and kept full precision. We also conduct variance study for our experiments. Bold values indicates best results. * indicates our implementation based on open-source codes.

Accuracy comparison on fully quantized post-training models. Activations here are quantized to 4-bit. Notations follows the upper table.

Performance as well as training cost comparison with quantization-aware training (QAT).

Objection detection task (MS COCO) comparison on fully quantized post-training models. Activations here are quantized to 8-bit. We report the bounding box mean Average Precision (mAP) metric.

Impact of the first and the last layer

ACKNOWLEDGMENT

We thank Markus Nagel and anonymous reviewers for their kind help of this work. This project is primarily supported by NSFC 61876032.

availability

Codes are available at https://github.com/yhhhli/BRECQ.

annex

In this section, we compare our algorithm (post-training quantization) with some quantization-aware training methods, including PACT (Choi et al., 2018) , DSQ (Gong et al., 2019) , LSQ (Esser et al., 2020) , and a mixed precision technique HAQ (Wang et al., 2019) . Table 4 shows that although BRECQ is a PTQ method with limited available data, it can achieve comparable accuracy results with existing quantization-aware training models. In addition, our method can surpass them in 4bit MobileNetV2 while using less than one training GPU hours. Our method also has comparable accuracy with HAQ, which is a training-based mixed precision method. Note that our GPU hours include 3 unified precision training (2-, 4-, 8-bit respectively) and further mixed-precision training only needs to check the lookup table. Instead, HAQ would end-to-end search for each hardware performance threshold from scratch.

4.4. MS COCO

To validate the effectiveness of BRECQ on other tasks, we conduct object detection on the twostage Faster-RCNN (Ren et al., 2015) and the one-stage RetinaNet (Lin et al., 2017) . ResNet-18, 50 as well as MobileNetV2 are adopted as backbones for the detection model. Results in Table 5 demonstrate our method nearly does not drop the performance in 4-bit weight quantization and 8bit activation. In particular, BRECQ only decreases 0.21% mAP performance on 4-bit ResNet-18 backboned Faster RCNN. On 4-bit ResNet-50 backboned RetinaNet, our method is outperforms the mixed precision based ZeroQ model by 3% mAP. Even when the weight bit decreases to 2, the model still achieves near-to-original mAP. In this section, we test the mobile CPU latency guaranteed mixed precision. The latency lookup table is tested using the technique in Gong et al. (2019) . We only validate it on ResNet-18 and ResNet-50 because the current low-bit General Matrix Multiply (GEMM) implementation only supports normal convolution. The results concur with Fig. 2 . Below 4-bit, the mixed precision can achieve better task performance than the unified precision models. For ResNet-50, the improvement is lower than that for ResNet-18 and any other mixed precision models. We think this is because the sensitivity in ResNet-50 is not distinct and therefore the improvement brought by mixed precision is trivial.

B.4 IMPLEMENTATION B.4.1 LEARNING STRATEGIES

In this work, we mainly focus on developing optimization objective rather than optimization strategies. We observe adaptive rounding performs well in post-training quantization. A brief introduction on AdaRound is given below, the detailed algorithm can be found in Nagel et al. (2020) .Traditional quantization function is performed by rounding-to-nearest operation: ŵ = s × clip( w/s , n, p). AdaRound optimizes the rounding policy in post-training quantization. Specifically, all weights are initially rounded by floor operation, and a learnable variable v determines the final rounding result to be flooring or ceiling. A sigmoid-like function σ(•) keeps the learnable variable v moving between 0 and 1 and a regularization term assures the σ(v) can converged to either 0 or 1. The formulation is given byThe minimization problem together with the regularization is given by) 2 , . . . , ( ∂L ∂zwhere progressively decreasing β in the calibration ensures the σ(v) converged to binary values. The activations cannot be quantized using adaptive rounding because they vary with different input data points. Thus, we can only adjust its quantization step size Esser et al. (2020) . Denoting the quadratic loss in above equation as L q , the gradients of step size is given bywhere all step size in the block will be optimized. Get the best fitted entry and then do the overall block reconstruction (cf. algorithm 1); return mixed precision model

B.4.3 LATENCY ACQUISITION

We test the latency of quantized neural networks on a self-developed simulator of a precisionvariable accelerator for NN. The basic architecture of this accelerator is inspired by typical systolicmatrix multiplication. The accelerator supports the per-channel quantization parameter. The precision of each layer of a NN is highly configurable in this accelerator, supporting 9 types of precision combination: activation: 2-, 4-, 8-bit × weight: 2-, 4-, 8-bit, see Fig. 5a . With the support of scalable function-unit (Sharma et al., 2018) , the peak performance of the accelerator is able to achieve corresponding linear improvement as the precision decreases. For example, the peak performance of this accelerator is 256 GMAC/s in 8-bit × 8-bit precision, and it scales to 512 GMAC/s in 8bit × 4-bit precision and 4 TMAC/s in 2-bit × 2-bit precision. Although this accelerator provides considerable computation resources especially in low precision, the parallelism of the specific layer (like depthwise convolution) and the bandwidth of on-chip buffer is limited. Consequently, actual performance may not scale accurately along with the peak performance, and the final performance differs according to the size and type of layers. The simulator performs cycle-accurate simulation and evaluation for a given NN executed on the accelerator, so we can get an equivalent evaluation by using this simulator. The simulator is available in the provided source codes.For the acquisition of mobile ARM CPU latency, we adopt the redesigned low-bit GEMM implementation in Han et al. (2020) . Fig. 5b shows a brief overview of the low-bit GEMM implementation. Since there is no instruction supporting the bit-width below 8 on ARM architecture, we can

