POWERQUANT: AUTOMORPHISM SEARCH FOR NON-UNIFORM QUANTIZATION

Abstract

Deep neural networks (DNNs) are nowadays ubiquitous in many domains such as computer vision. However, due to their high latency, the deployment of DNNs hinges on the development of compression techniques such as quantization which consists in lowering the number of bits used to encode the weights and activations. Growing concerns for privacy and security have motivated the development of data-free techniques, at the expanse of accuracy. In this paper, we identity the uniformity of the quantization operator as a limitation of existing approaches, and propose a data-free non-uniform method. More specifically, we argue that to be readily usable without dedicated hardware and implementation, non-uniform quantization shall not change the nature of the mathematical operations performed by the DNN. This leads to search among the continuous automorphisms of (R * + , ×), which boils down to the power functions defined by their exponent. To find this parameter, we propose to optimize the reconstruction error of each layer: in particular, we show that this procedure is locally convex and admits a unique solution. At inference time, we show that our approach, dubbed PowerQuant, only require simple modifications in the quantized DNN activation functions. As such, with only negligible overhead, it significantly outperforms existing methods in a variety of configurations.

1. INTRODUCTION

Deep neural networks (DNNs) tremendously improved algorithmic solutions for a wide range of tasks. In particular, in computer vision, these achievements come at a consequent price, as DNNs deployment bares a great energetic price. Consequently, the generalization of their usage hinges on the development of compression strategies. Quantization is one of the most promising such technique, that consists in reducing the number of bits needed to encode the DNN weights and/or activations, thus limiting the cost of data processing on a computing device. Existing DNN quantization techniques, for computer vision tasks, are numerous and can be distinguished by their constraints. One such constraint is data usage, as introduced in Nagel et al. (2019) , and is based upon the importance of data privacy and security concerns. Data-free approaches such as Banner et al. (2019) ; Cai et al. (2020) ; Choukroun et al. (2019) ; Fang et al. (2020) ; Garg et al. (2021) ; Zhao et al. (2019) ; Nagel et al. (2019) , exploit heuristics and weight properties in order to perform the most efficient weight quantization without having access to the training data. As compared to data-driven methods, the aforementioned techniques are more convenient to use but usually come with higher accuracy loss at equivalent compression rates. Data-driven methods performance offer an upper bound on what can be expected from data-free approaches and in this work, we aim at further narrowing the gap between these methods. To achieve this goal, we propose to leverage a second aspect of quantization: uniformity. For simplicity reasons, most quantization techniques such as Nagel et al. (2019) ; Zhao et al. (2019) ; Cong et al. (2022) perform uniform quantization, i.e. they consist in mapping floating point values to an evenly spread, discrete space. However, non-uniform quantization can theoretically provide a closer fit to the network weight distributions, thus better preserving the network accuracy. Previous work on non-uniform quantization either focused on the search of binary codes (Banner et al., 2019 ; The proposed method (right bin in blue) drastically improves upon the existing data-free methods especially in the challenging W4/A4 quantization. Hubara et al., 2016; Jeon et al., 2020; Wu et al., 2016; Zhang et al., 2018) or leverage logarithmic distribution (Miyashita et al., 2016; Zhou et al., 2017) . However, these approaches map floating point multiplications operations to other operations that are hard to leverage on current hardware (e.g. bit-shift) as opposed to uniform quantization which maps floating point multiplications to integer multiplications (Gholami et al., 2021; Zhou et al., 2016) . To circumvent this limitation and reach a tighter fit between the quantized and original weight distributions, in this work, we propose to search for the best possible quantization operator that preserves the nature of the mathematical operations. We show that this search boils down to the space defined by the continuous automorphisms of (R * + , ×), which is limited to power functions defined by their exponent. We optimize the value of this parameter by minimizing the error introduced by quantization. This allows us to reach superior accuracy, as illustrated in Fig 1 . To sum it up, our contributions are: • We search for the best quantization operator that do not change the nature of the mathematical operations performed by the DNN, i.e. the automorphisms of (R * + , ×). We show that this search can be narrowed down to finding the best exponent for power functions. • We find the optimal exponent parameter to more closely fit the original weight distribution compared with existing (e.g. uniform and logarithmic) baselines. To do so, we propose to optimize the quantization reconstruction error. We show that this problem is locally convex and admits a unique solution. • In practice, we show that the proposed approach, dubbed PowerQuant, only requires simple modifications in the quantized DNN activation functions and accumulation. Furthermore, we demonstrate through extensive experimentation that our method achieves outstanding results on various and challenging benchmarks with negligible computational overhead.

2. RELATED WORK

2.1 QUANTIZATION In this section, we provide a background on the current state of DNNs quantization. Notice that while certain approaches are geared towards memory footprint reduction (e.g. without quantizing inputs and activations) (Chen et al., 2015; Gong et al., 2014; Han et al., 2016; Zhou et al., 2017) , in what follows, we essentially focus on methods that aim at reducing the inference time. In particular, motivated by the growing concerns for privacy and security, data-free quantization methods (Banner et al., 2019; Cai et al., 2020; Choukroun et al., 2019; Fang et al., 2020; Garg et al., 2021; Zhao et al., 2019; Nagel et al., 2019; Cong et al., 2022) are emerging and have significantly improved over the recent years. The first breakthrough in data-free quantization (Nagel et al., 2019) was based on two mathematical ingenuities. First, they exploited the mathematical properties of piece-wise affine activation functions (such as e.g. ReLU based DNNs) in order to balance the per-channel weight distributions by iteratively applying scaling factors to consecutive layers. Second, they proposed a bias correction scheme that consists in updating the bias terms of the layers with the difference between the expected quantized prediction and the original predictions. They achieved near full-precision accuracy in int8 quantization. Since this seminal work, two main categories of data-free quantization methods have emerged. First, data-generation based methods, such as Cai et al. (2020) ; Garg et al. (2021) , that used samples generated by Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) as samples to fine-tune the quantized model through knowledge distillation (Hinton et al., 2014) . Nevertheless, these methods are time-consuming and require significantly more computational resources. Other methods, such as Banner et al. (2019) ; Choukroun et al. (2019) ; Fang et al. (2020) ; Zhao et al. (2019) ; Nagel et al. (2019) ; Cong et al. (2022) , focus on improving the quantization operator but usually achieve lower accuracy. One limitation of these approaches is that they are essentially restricted to uniform quantization, while considering nonuniform mappings between the floating point and low-bit representation might be key to superior performance.

2.2. NON-UNIFORM QUANTIZATION

Indeed, in uniform settings, continuous variables are mapped to an equally-spaced grid in the original, floating point space. Such mapping introduces an error: however, applying such uniform mapping to an a priori non-uniform weight distribution is likely to be suboptimal in the general case. To circumvent this limitation, non-uniform quantization has been introduced (Banner et al., 2019; Hubara et al., 2016; Jeon et al., 2020; Wu et al., 2016; Zhang et al., 2018; Miyashita et al., 2016; Zhou et al., 2017) and (Zhang et al., 2021a; Li et al., 2019) . We distinguish two categories of non-uniform quantization approaches. First, methods that introduce a code-base and require very sophisticated implementations for actual inference benefits such as Banner et al. (2019) ; Hubara et al. (2016) ; Jeon et al. (2020) ; Wu et al. (2016) ; Zhang et al. (2018) . Second, methods that simply modify the quantization operator such as Miyashita et al. (2016) ; Zhou et al. (2017) . In particular, (Zhang et al., 2021a) propose a log-quantization technique. Similarly, Li et al. (Li et al., 2019) use log quantization with basis 2. In both cases, in practice, such logarithmic quantization scheme changes the nature of the mathematical operations involved, with multiplications being replaced by bit shifts. Nevertheless, one limitation of this approach is that because the very nature of the mathematical operations is intrinsically altered, in practice, it is hard to leverage without dedicated hardware and implementation. Instead of transforming floating point multiplications in integer multiplications, they change floating point multiplications into bit-shifts or even look up tables (LUTs). Some of these operations are very specific to some hardware (e.g. LUTs are thought for FPGAs) and may not be well supported on most hardware. Conversely, in this work, we propose a non-uniform quantization scheme that preserves the nature of the mathematical operations by mapping floating point multiplications to standard integer multiplications. As a result, our approach boils down to simple modifications of the computations in the quantized DNN, hence allowing higher accuracies than uniform quantization methods while leading to straightforward, ready-to-use inference speed gains. Below we describe the methodology behind the proposed approach.

3. METHODOLOGY

Let F be a trained feed forward neural network with L layers, each comprising a weight tensor W l . Let Q be a quantization operator such that the quantized weights Q(W l ) are represented on b bits. The most popular such operator is the uniform one. We argue that, despite its simplicity, the choice of such a uniform operator is responsible for a significant part of the quantization error. In fact, the weights W l most often follow a bell-shaped distribution for which uniform quantization is ill-suited: intuitively, in such a case, we would want to quantize more precisely the small weights on the peak of the distribution. For this reason, the most popular non-uniform quantization scheme is logarithmic quantization, outputting superior performance. Practically speaking, however, it consists in replacing the quantized multiplications by bit-shift operations. As a result, these methods have limited adaptability as the increment speed is hardware dependent. To adress this problem, we look for the non-uniform quantization operators that preserve the nature of matrix multiplications. Formally, taking aside the rounding operation in quantization, we want to define the space Q of functions Q such that where * is the intern composition law of the quantized space and × is the standard multiplication, and Q, Q -1 are the quantization and de-quantization operators, respectively. In the case of uniform quantization and our work * will be the multiplication while in other non-uniform works it often corresponds to other operations that are harder to leverage, e.g. bit-shift. Recall that, for now, we omit the rounding operator. The proposed PowerQuant method consists in the search for the best suited operator Q for a given trained neural network and input statistics. ∀Q ∈ Q, ∃Q -1 ∈ Q s.t. ∀x, y Q -1 (Q(x) * Q(y)) = x × y (1) 3.1 AUTOMORPHISMS OF (R * + , ×) Let Q be a transformation from R + to R + . In this case, * , the intern composition law in quantized space in (1), simply denote the scalar multiplication operator × and (1) becomes Q(x) × Q(y) = Q(x × y) ∀x, y ∈ R 2 + . In order to define a de-quantization operation, we need Q -1 to be defined i.e. Q is bijective. Thus, by definition, Q is a group automorphism of (R * + , ×). Thus, quantization operators that preserve the nature of multiplications are restricted to automorphisms of (R * + , ×). The following lemma further restricts the search to power functions. Lemma 1. The set of continuous automorphisms of (R * + , ×) is defined by the set of power functions Q = {Q : x → x a |a ∈ R}. A proof of this result can be found in Appendix A. For the sake of clarity, we will now include the rounding operation in the quantization operators. Q = Q a : W → (2 b-1 -1) sign(W ) × |W | a max |W | a a ∈ R (2) where W is a tensor and all the operations are performed element-wise. As functions of W , the quantization operators defined in equation 2 are (signed) power functions. Fig 2 illustrates the effect of the power parameter a on quantization (vertical bars). Uniform quantization and a = 1 are equivalent and correspond to a quantization invariant to the weight distribution. For a < 1, the quantization is more fine-grained on weight values with low absolute value and coarser on high absolute values. Conversely, for a > 1, the quantization becomes more fine-grained on high absolute values. We now define the search protocol in the proposed search space Q.

3.2. AUTOMORPHISM SEARCH AS A MINIMIZATION PROBLEM

We propose to use the error introduced by quantization on the weights as a proxy on the distance between the quantized and the original model.

Reconstruction Error Minimization:

The operator Q a is not a bijection. Thus, quantization introduces a reconstruction error summed over all the layers of the network, and defined as follows: ϵ(F, a) = L l=1 W l -Q -1 a (Q a (W l )) p where ∥ • ∥ p denotes the L p vector norm (in practice p = 2, see appendix B) and the de-quantization operator Q -1 a is defined as: Q -1 a (W ) = sign(W ) × W × max |W | 2 b-1 -1 1 a (4) In practice, the problem of finding the best exponent a * = argmin a ϵ(F, a) in ( 3) is a locally convex optimization problem (Appendix C.1) which has a unique minimum (see Appendix C.2). We find the optimal value for a using the Nelder-Mead method (Nelder & Mead, 1965) which solves problems for which derivatives may not be known or, in our case, are almost-surely zero (due to the rounding operation). In practice, more recent solvers are not required in order to reach the optimal solution (see Appendix D). Lastly, we discuss the limitations of the proposed metric in Appendix H.

3.3. FUSED DE-QUANTIZATION AND ACTIVATION FUNCTION

Based on equation 2, the quantization process of the weights necessitates the storage and multiplication of W along with a signs tensor, which is memory and computationally intensive. For the weights, however, this can be computed once during the quantization process, inducing no overhead during inference. As for activations, we do not have to store the sign of ReLU activations as they are always positive. In this case, the power function has to be computed at inference time (see algorithm 2). However, it can be efficiently computed Kim et al. (2021) , using Newton's method to approximate continuous functions in integer-only arithmetic. This method is very efficient in practice as it converges in 2 steps for low bit representations (four steps for int32). Thus, PowerQuant leads to significant accuracy gains with limited computational overhead. Conversely, for non-ReLU feed forward networks such as EfficientNets (SiLU) or Image Transformers (GeLU), activations are signed. This can be tackled using asymmetric quantization which consists in the use of a zero-point. In general, asymmetric quantization allows one to have a better coverage of the quantized values support. In our case, we use asymmetric quantization to work with positive values only. Formally, for both SiLU and GeLU, the activations are analytically bounded below by C SiLU = 0.27846 and C GeLU = 0.169971 respectively. Consequently, assuming a layer with SiLU activation with input x and weights W , we have: Q -1 a (Q a (x + C SiLU )Q a (W )) ≈ ((x + C SiLU ) a W a ) 1 a = xW + C SiLU W (5) The bias term C SiLU W induces a very slight computation overhead which is standard in asymmetric quantization. We provide a detailed empirical evaluation of this cost in Appendix G. Using the adequate value for the bias corrector, we can generalize equation 5 to any activation function σ. The quantization process and inference with the quantized DNN are summarized in Algorithm 1 and 2. The proposed representation is fully compatible with integer multiplication as defined in Jacob et al. (2018) , thus it is fully compatible with integer only inference (see appendix F for more details).

Algorithm 1 Weight Quantization Algorithm

Require: trained neural network F with L layers to quantize, number of bits b a ← solver(min{error(F, a)}) ▷ in practice we use the Nelder-Mead method for l ∈ {1, . . . , L} do W sign ← sign(W l ) ▷ save the sign of the scalar values in W W l ← W sign × |W l | a ▷ power transformation s ← max |W l | 2 b-1 -1 ▷ get quantization scale Q : W l → W l s and Q -1 : W → W sign × |W × s| 1 a ▷ qdefine Q and Q -1 end for Algorithm 2 Simulated Inference Algorithm Require: trained neural network F quantized with L layers, input X and exponent a * for l ∈ {1, . . . , L} do X ← X a * ▷ X is assumed positive (see equation (5)) X Q ← ⌊Xs X ⌉ ▷ where s X is a scale in the input range O ← F l (X Q ) ▷ where we applied • 1 a * at the accumulation X ← σ(O) s X s W ▷ where σ is the activation function and s W the weight scale end for 

4.1. DATASETS AND IMPLEMENTATION DETAILS

We validate the proposed PowerQuant method on ImageNet classification (Deng et al., 2009) (≈ 1.2M images train/50k test). In our experiments we used pre-trained MobileNets (Sandler et al., 2018) , ResNets (He et al., 2016) , EfficientNets (Tan & Le, 2019) and DenseNets (Huang et al., 2017) . We used Tensorflow implementations of the baseline models from official repositories, achieving standard baseline accuracies. The quantization process was done using Numpy library. Activations are quantized as unsigned integers and weights are quantized using a symmetric representation. We fold batch-normalization layers as in Yvinec et al. (2022a) . We performed ablation study using the uniform quantization operator over weight values from Krishnamoorthi (2018) and logarithmic quantization from Miyashita et al. (2016) . For our comparison with state-of-the-art approaches in data-free quantization, we implemented the more complex quantization operator from SQuant (Cong et al., 2022) . To compare with strong baselines, we also implement bias correction (Nagel et al., 2019) (which measures the expected difference between the outputs of the original and quantized models and updates the biases terms to compensate for this difference) as well as input weight quantization (Nagel et al., 2019) .

4.2. EXPONENT PARAMETER FITTING

Fig 3 illustrates the evolution of both the accuracy of the whole DNN and the reconstruction error summed over all the layers of the network, as functions of the exponent parameter a. Our target is the highest accuracy with respect to the value of a: however, in a data-free context, we only have access to the reconstruction error. Nevertheless, as shown on Fig 3, these metrics are strongly anticorrelated. Furthermore, while the reconstruction curve is not convex it behaves well for simplex based optimization method such as the Nelder-Mead method. This is due to two properties: locally convex (Appendix C.1) and has a unique minimum (Appendix C.2). Empirically, optimal values a * for the exponent parameter are centered on 0.55, which approximately corresponds to the first distribution in Fig 2 . Still, as shown on Table 1 we observe some variations on the best value for a which motivates the optimization of a for each network and bitwidth. Furthermore, our results provide a novel insight on the difference between pruning and quantization. In the pruning literature (Han et al., 2015; Frankle & Carbin, 2018; Molchanov et al., 2019) , the baseline method consists in setting the smallest scalar weight values to zero and keeping unchanged the highest non-zero values, assuming that small weights contribute less to the network prediction. In a similar vein, logarithmic or power quantization with a > 1 roughly quantizes (almost zeroing it out) small scalar values to better preserve the precision on larger values. In practice, in our case, lower reconstruction errors, and better accuracies, are achieved by setting a < 1: this suggests that the assumption behind pruning can't be straightforwardly applied to quantization, where in fact we argue that finely quantizing smaller weights is paramount to preserve the patterns learned at each layer, and the representation power of the whole network. Another approach that puts more emphasis on the nuances between low valued weights is logarithmic based non-uniform quantization. In Table 1 and Appendix E, we compare the proposed power method to both uniform and logarithmic approaches. By definition, the proposed power method necessarily outperforms the uniform method in every scenario as uniform quantization is included in the search space. For instance, in int4, the proposed method improves the accuracy by 13.22 points on ResNet 50. This improvement can also be attributed to a better input quantization of each layer, especially on ResNet 50 where the gap in the reconstruction error (over the weights) is smaller.

4.3. COMPARISON WITH DATA-FREE QUANTIZATION METHODS

In table 2, we report the performance of several data-free quantization approaches on ResNet 50. Although no real training data is involved in these methods, some approaches such as ZeroQ (Cai et al., 2020) , DSG (Zhang et al., 2021b) or GDFQ (Xu et al., 2020) rely on data generation (DG) in order to calibrate parameters of the method or to apply fine-tuning to preserve the accuracy through quantization. As shown in table 2, in the W8/A8 setup, the proposed PowerQuant method outperforms other data-free solutions, fully preserving the accuracy of the floating point model. The gap is even wider on the more challenging low bit quantization W4/A4 setup, where the PowerQuant improves the accuracy by 1.93 points over SQuant (Cong et al., 2022) and by 14.88 points over GDFQ. This shows the effectiveness of the method on ResNet 50. We provide more results on DenseNet (Huang et al., 2017) , MobileNet (Sandler et al., 2018) , Efficient Net (Tan & Le, 2019) in Appendix 2021) with baseline accuracies 72.2, 79.9 and 81.8 and ≈ 5M, ≈ 22M, ≈ 87M parameters respectively. Similarly to ConvNets, the image transformer is better quantized using PowerQuant rather than standard uniform quantization schemes such as DFQ. Furthermore, more complex and recent data-free quantization schemes such as SQuant, tend to under-perform on the novel Transformer architectures as compared to ConvNets. This is not the case for PowerQuant which maintains its very high performance even in low bit representations. This is best illustrated on ViT where PowerQuant W4/A8 out performs both DFQ and SQuant even when they are allowed 8 bits for the weights (W8/A8) by a whopping 4.91 points. The proposed PowerQuant even outperforms methods dedicated to transformer quantization such as PSAQ Li et al. (2022) on every image transformer tested. We further compare the proposed power quantization, in W4/A8, on natural language processing (NLP) tasks and report results in Table 4 . We evaluate a BERT model (Devlin et al., 2018) on GLUE (Wang et al., 2018) and report both the original (reference) and our reproduced (baseline) results. We compare the three quantization processes: uniform, logarithmic and PowerQuant. Similarly to computer vision tasks, the power quantization outperforms the other methods in every instances which further confirms its ability to generalize well to transformers and NLP tasks. In what follows, we show experimentally that our approach induces very negligible overhead at inference time, making this accuracy enhancement virtually free from a computational standpoint. 2022) to provide a hardware-agnostic measurement of the overhead computation cost in quantized neural networks. In Table 5 , we evaluate the cost in the inference graph due to the change in the activation function. We observe very similar results to Table 17 . The proposed changes are negligible in terms of computational cost on all tested networks. Furthermore, DenseNet has the highest cost due to its very dense connectivity. On the other hand, using this metric it seems that the overhead cost due to the zero-point technique from section 3.3 for EfficientNet has no significant impact as compared to MobileNet and ResNet. In addition, we discuss the inference and processing cost of PowerQuant on specific hardware using dedicated tools in Appendix K.

5. CONCLUSION

In this paper, we pinpointed the uniformity of the quantization as a limitation of existing datafree methods. To address this limitation, we proposed a novel data-free method for non-uniform quantization of trained neural networks for computer vision tasks, with an emphasis on not changing the nature of the mathematical operations involved (e.g. matrix multiplication). This led us to search among the continuous automorphisms of (R * + , ×), which are restricted to the power functions x → x a . We proposed an optimization of this exponent parameter based upon the reconstruction error between the original floating point weights and the quantized ones. We show that this procedure is locally convex and admits a unique solution. At inference time, the proposed approach, dubbed PowerQuant, involves only very simple modifications in the quantized DNN activation functions. We empirically demonstrate that PowerQuant allows a closer fit to the original weight distributions compared with uniform or logarithmic baselines, and significantly outperforms existing methods in a variety of benchmarks with only negligible computational overhead at inference time. In addition, we also discussed and addressed some of the limitations in terms of optimization (per-layer or global) and generalization (non-ReLU networks). Future work involves the search of a better proxy error as compared with the proposed weight reconstruction error as well as the extension of the search space to other composition laws of R + that are suited for efficient calculus and inference.

A PROOF OF LEMMA 1

In this section, we provide a simple proof for lemma 1 as well as a discussion on the continuity hypothesis. Proof. We have that ∀x ∈ R + , Q(x) × Q(0) = Q(0) and ∀x ∈ R + , Q(x) × Q(1) = Q(x) which induces that Q is either the constant 1 or Q(0) = 0 and Q(1) = 1. Because Q is an automorphism we can eliminate the first option. Now, we will demonstrate that Q is necessarily a power function. Let n be an integer, then Q(x n ) = Q(x) × Q(x n-1 ) = Q(x) 2 × Q(x n-2 ) = • • • = Q(x) n . Similarly, for fractions, we get Q(x 1 n ) × • • • × Q(x 1 n ) = Q(x) ⇔ Q(x 1 n ) = Q(x) 1 n . Assuming Q is continuous, we deduce that for any rational a ∈ R, we have Q(x a ) = Q(x) a In order to verify that the solution is limited to power functions, we use a reductio ad absurdum. Assume Q is not a power function. Therefore, there exists (x, y) ∈ R 2 + and a ∈ R such that Q(x) ̸ = x a and Q(y) = y a . By definition of the logarithm, there exists b such that x b = y. We get the following contradiction, from (7), Q(x b a ) = Q(y a ) = y a Q(x b a ) = Q(x ab ) = Q(x a ) b ̸ = x ab = y a Consequently, the suited functions Q are limited to power functions i.e. Q = {Q : x → x a |a ∈ R}. We would also like to put the emphasis on the fact that there are other Automorphisms of (R, ×). However, the construction of such automorphisms require the axiom of choice Herrlich (2006) . Such automorphisms are not applicable in our case which is why the key constraint is being an automorphism rather than the continuous property.

B NORM SELECTION

In the minimization objective, we need to select a norm to apply. In this section, we provide theoretical arguments in favor of the l 2 vector norm. Let F be a feed forward neural network with L layers to quantize, each defined by a set of weights W l = (w l ) i,j ∈ R n l ×m l and bias b l ∈ R n l . We note (λ (i) l ) i the eigenvalues associated with W l . We want to study the distance d(F, F a ) between the predictive function F and its quantized version F a defined as d(F, F a ) = max x∈D ∥F (x) -F a (x)∥ p ( ) where D is the domain of F . We prove that minimizing the reconstruction error with respect to a is equivalent to minimizing d(F, F a ) with respect to a. Assume L = 1 for the sake of simplicity and we drop the notation l. With the proposed PowerQuant method, we minimize the vector norm ∥W -Q -1 a (Q a (W ))∥ p p = i<=n max j<=m |w i,j -Q -1 a (Q a (w i,j ))| p (10) For p = 2, the euclidean norm is equal to the spectral norm, thus minimizing ∥W -Q -1 a (Q a (W ))∥ 2 is equivalent to minimizing d(F, F a ) for L = 1. However, we know that minimizing for another value of p may result in a different optimal solution and therefore not necessarily minimize d(F, F a ). In the context of data-free quantization, we want to avoid uncontrollable changes on F , which is why we recommend the use of p = 2.

C MATHEMATICAL PROPERTIES C.1 LOCAL CONVEXITY

We prove that the minimization problem defined in equation 3 is locally convex around the solution a * . Formally we prove that x → x -Q -1 a (Q a (x)) p (11) is locally convex around a * defined as arg min a x -Q -1 a (Q a (x)) p . Lemma 2. The minimization problem defined as arg min a x -Q -1 a (Q a (x)) p ( ) is locally convex around any solution a * . Proof. We recall that ∂x a ∂a = x a log(x). The function x -Q -1 a (Q a (x) ) is differentiable. We assume x ∈ R, then we can simplify the sign functions (assume x positive without loss of generality) and note y = max |x|, then ∂Q -1 a (Q a (x)) ∂a = ∂ (2 b-1 -1) x a y a y a 2 b-1 -1 1 a ∂a . ( ) This simplifies to ∂Q -1 a (Q a (x)) ∂a = y ∂ ⌊B( x y ) a ⌋ B 1 a ∂a , with B = 2 b-1 -1. By using the standard differentiation rules, we know that the rounding operator has a zero derivative a.e.. Consequently we get, ∂Q -1 a (Q a (x)) ∂a = -a 2 y   B x y a B   1 a log   B x y a B   . ( ) Now we can compute the second derivative of Q -1 a (Q a (x)), ∂ 2 Q -1 a (Q a (x)) ∂a 2 = a 4 y   B x y a B   1 a log 2   B x y a B   . ( ) From this expression, we derive the second derivative, using the property (f • g) ′′ = f ′′ • g × g ′2 + f ′ • g × g ′′ and the derivatives | • | 1 p ′ = x|x| 1 p -2 p and | • | 1 p ′′ = 1-p p 2 |x| 1 p x 2 , then for any x i ∈ x ∂ 2 x i -Q -1 a (Q a (x i )) ∂a 2 = 1 -p p 2 |x i -Q -1 a (Q a (x i )| 1 p (x i -Q -1 a (Q a (x i )) 2 ∂Q -1 a (Q a (x)) ∂a 2 + (x i -Q -1 a (Q a (x i ))|x i -Q -1 a (Q a (x i )| 1 p -2 p ∂ 2 Q -1 a (Q a (x)) ∂a 2 (17) We now note the first term in the previous addition T 1 = 1-p p 2 |xi-Q -1 a (Qa(xi)| 1 p (xi-Q -1 a (Qa(xi)) 2 ∂Q -1 a (Qa(x)) ∂a 2 and the second term as a product of T 2 = (xi-Q -1 a (Qa(xi))|xi-Q -1 a (Qa(xi)| 1 p -2 p times T 3 = ∂ 2 Q -1 a (Qa(x)) ∂a 2 . We know that T 1 > 0 and T 3 > 0, consequently, and T 2 is continuous in a. At a * the terms with |x i -Q -1 a (Q a (x i )) | are negligible in comparison with ∂ 2 Q -1 a (Qa(x)) ∂a 2 and ∂Q -1 a (Qa(x)) ∂a 2 . Consequently, there exists an open set around a * where T 1 > |T 2 |T 3 , and ∂ 2 |xi-Q -1 a (Qa(xi))| ∂a 2 > 0. This concludes the proof. In this section we provide the elements of proof on the uniqueness of the solution of the minimization of the quantization reconstruction error. Lemma 3. The minimization problem over x ∈ R N defined as arg min a x -Q -1 a (Q a (x)) p has almost surely a unique global minimum a * . Proof. We assume that x can not be exactly quantized, i.e. min a x -Q -1 a (Q a (x)) p > 0 which is true almost everywhere. We use a reductio ad absurdum and assume that there exist two optimal solutions a 1 and a 2 to the optimization problem. We expand the expression x -Q -1 a (Q a (x)) p and get x -Q -1 a (Q a (x)) p = x -(2 b-1 -1) sign(x) × |x| a max |x| a max |x| a 2 b-1 -1 1 a sign(x) . p We note the rounding term R a and get x -Q -1 a (Q a (x)) p = x -R a max |x| a 2 b-1 -1 1 a sign(x) p . ( ) Assume R a1 = R a2 = R, the minimization problem arg min a x -R max |x| a 2 b-1 -1 1 a sign(x) p is convex and has a unique solution, thus a 1 = a 2 . Now assume R a1 ̸ = R a2 . Let's denote D(R) the domain of power values a over which we have (2 b-1 -1) sign(x)×|x| a max |x| a = R. If there is a value a outside of D(R a1 ) ∪ D(R a2 ) such that R ′ has each of its coordinate strictly between the coordinates of R a1 and R a2 , then, without loss of generality, assume that at least half of the coordinates of R a1 are further away from the corresponding coordinates of x than one quantization step. This implies that there exists a value a ′ in D(R ′ ) such that x -Q -1 a ′ (Q a ′ (x)) p < x -Q -1 a1 (Q a1 (x)) p . which goes against our hypothesis. Thus, there are up to N possible values for R that minimize the problem which happens iff x satisfies at least one coordinate can be either ceiled or floored by the rounding. The set defined by this condition has a zero measure.

D SOLVER FOR MINIMIZATION

In the main article we state that we can use Nelder-Mead (Nelder & Mead, 1965) solver to find the optimal a * . We tested several other solvers and report the results in Table 6 . The empirical results show that basically any popular solver can be used, and that the Nelder-Mead solver is sufficient for the minimization problem.

E COMPARISON BETWEEN LOG, NAIVE AND POWER QUANTIZATION COMPLEMENTARY RESULTS

To complement the results provided in the main paper on ResNet 50, we list in Table 7 more quantization setups on ResNet 50 as well as DenseNet 121. To put it in a nutshell, The proposed power quantization systematically achieves significantly higher accuracy and lower reconstruction error than the logarithmic and uniform quantization schemes. On a side note, the poor performance of the logarithmic approach on DenseNet 121 can be attributed to the skewness of the weight distributions. Formally, ResNet 50 and DenseNet 121 weight values show similar average standard deviations across layers (0.0246 and 0.0264 respectively) as well as similar kurtosis (6.905 and 6.870 respectively). However their skewness are significantly different: 0.238 for ResNet 50 and more than twice as much for DenseNet 121, with 0.489. The logarithmic quantization, that focuses on very small value is very sensible to asymmetry which explains the poor performance on DenseNet 121. In contrast, the proposed method offers a robust performance in all situations.

F HOW TO PERFORM MATRIX MULTIPLICATION WITH POWERQUANT

The proposed PowerQuant method preserves the multiplication operations, i.e. a multiplication in the floating point space remains a multiplication in the quantized space (integers). This allows one to leverage current implementations of uniform quantization available on most hardware Gholami et al. (2021) ; Zhou et al. (2016) . However, while PowerQuant preserves multiplications it doesn't preserve additions which are significantly less costly than multiplications. Consequently, in order to infer under the PowerQuant transformation, instead of accumulating the quantized products, as done in standard quantization Jacob et al. (2018) , one need to accumulate the powers of said products. Formally, let's consider two quantized weights w 1 , w 2 and their respective quantized inputs x 1 , x 2 . The standard accumulation would be performed as follows w 1 x 1 +w 2 x 2 . In the case of PowerQuant, this would be done as (w 1 x 1 ) 1 a +(w 2 x 2 ) 1 a . Previous studies on quantization have demonstrated that such power functions can be computed with very high fidelity at almost no latency cost Kim et al. (2021) .

G OVERHEAD COST OF ZERO-POINTS IN ACTIVATION QUANTIZATION

The overhead cost introduced in equation 5 is well known in general in quantization as it arises from asymmetric quantization. Nonetheless, we share here (as well as in the article) some empirical values. These are empirical results from our own implementation. We include ResNet50 as it can also be quantized using asymmetric quantization although in our research, we only applied asymmetric quantization to SilU and GeLU based architectures. We included these results in the appendix of the revised article. It is worth noting that according to LSQ+ Bhalgat et al. (2020) , asymmetric quantization can be achieved at virtually not run-time cost.

H LIMITATIONS OF THE RECONSTRUCTION ERROR METRIC

In the proposed PowerQuant method, we fit the parameter a based on the reconstruction error over all the weights, i.e. over all layers in the whole network. Then, we perform per-channel quantization layer by layer independently. However, if the final objective is to minimize the reconstruction error from equation (3), a more efficient approach would consist in fitting the parameter a separately for each layer. We note a * l such that for every layer l we have a * l = arg min a W l -Q -1 a (Q a (W l )) p Then the network (F, (a l ) * ) quantized with a per-layer fit of the power parameter will satisfy L l=1 W l -Q -1 a l (Q a l (W l )) p < L l=1 W l -Q -1 a (Q a (W l )) p if and only if their exists at least one l such that a l ̸ = a. Consequently, if the reconstruction error was a perfect estimate of the resulting accuracy, the per-layer strategy would offer an even higher accuracy than the proposed PowerQuant method. Unfortunately, the empirical evidence, in table 9 , shows that the proposed PowerQuant method achieves better results in every benchmark. This observation demonstrates the limits of the measure of the reconstruction error. We explain this phenomenon by the importance of inputs and activations quantization. This can be seen as some form of overfitting the parameters a l on the weights which leads to poor performance on the activation quantization and prediction. In the general sens, this highlights the limitations of the reconstruction error as a proxy for maximizing the accuracy. Previous results can be interpreted in a similar way. For instance, in SQuant Cong et al. (2022) the author claim that it is better to minimize the absolute sum of errors rather than the sum of absolute errors and achieve good performance in data-free quantization.

I IMPROVEMENT WITH RESPECT TO QAT

In the introduction, we argued that data-driven quantization schemes performance define an upperbound on data-free performance. Our goal was to narrow the resulting gap between these methods. In Table 10 , we report the evolution in the gap between data-free and data-driven quantization techniques. These empirical results validate the significant improvement of the proposed method at narrowing the gap between data-free and data-driven quantization methods by 26.66% to 29.74%. In order to complete our comparison to QAT methods, we considered the short-re-training (30 epochs) regime from OCTAV in Table 11 . We can draw two observations from this comparison. First, on ResNet 50, OCTAV achieves remarkable results by reach near full-precision accuracy. Still the proposed method does not fall too far back with only 5.31 points lower accuracy while being data-free. Second, on very small models such as MobileNet V2, using a strong quantization operator rather than a short re-training leads to a huge accuracy improvement as PowerQuant achieves 45.18 points higher accuracy. This is also the finding of the author in OCTAV, as they conclude that models such as MobileNet tend to be very challenging to quantize using static quantization and short re-training. In Table 12 , we draw a comparison between the proposed PowerQuant and the QAT method OCTAV Sakr et al. (2022) , both using dynamic quantization (i.e. estimating the ranges of the activations onthe-fly depending on the input). As expected, the use of dynamic ranges has a considerable influence on the performance of both quantization methods. As can be observed the QAT method OCTAV achieved very impressive results and even outperforming the full-precision model on ResNet 50. Nevertheless, it is on MobileNet that the influence of dynamic ranges is the most impressive. For OCTAV, we observe a boost of almost 71 points going from almost random predictions to near exact full-precision accuracy. It is to be noted that PowerQuant does not fall shy in front of these performances, as using static quantization we still manage to preserve some of the predictive capability of the model. Furthermore, using dynamic quantization, Powerquant achieves similar accuracies than OCTAV while not involving any fine-tuning, contrary to OCTAV. All in all, we can conclude that the proposed data-free method manages to hold close results to a state-of-the-art QAT method in some context. An interesting future work could be the extension of PowerQuant as a QAT method and possibly learning the power parameter a that we use in our quantization operator. J COMPARISON TO STATE-OF-THE-ART DATA-FREE QUANTIZATION ON OTHER CONVNETS In addition to our evaluation on ResNet, we propose some complementary results on DenseNet in Table 13 as well as the challenging and compact architectures MobileNet and EfficientNet in Table We observe a significant boost in performance from PowerQuant as compared to the other very competitive data-free solutions.

K OVERHEAD COST DISCUSSION

In this section, we provide more empirical results on the inference cost of the proposed method. Table 17 shows the inference time of DNNs quantized with our approach (which only implies modifications of the activation function and a bias correction-see Section 3.3). For DenseNet, ResNet and MobileNet V2, the baseline activation function is the ReLU, which is particularly fast to compute. the simulation tool nntool from GreenWaves, a risc-v chips manufacturer that enables to simulate inference cost of quantized neural networks on their gap unit. We tested a single convolutional layer with bias and relu activation plus our power quantization operation and reported the number of cycles and operations. These results demonstrate that even without any optimization the proposed method has a marginal computational cost on MCU inference which corroborates our previous empirical results. We would like to put the emphasis on the fact that this cost could be further reduced via optimizing the computation of the power function using existing methods such as Kim et al. (2021) . Similarly, we measure the empirical time required to perform the proposed quantization method on several neural networks and report the results in table 19 . These results show that the proposed PowerQuant method offers outstanding trade-offs in terms of compression and accuracy at virtually no cost over the processing and inference time as compared to other data-free quantization methods. For instance, SQuant is a sophisticated method that requires heavy lifting in order to efficiently process a neural network. On a CPU, it requires at least 100 times more time to reach a lower accuracy than the proposed method as we will showcase in our comparison to state-of-the-art quantization schemes.



EXPERIMENTSIn this section, we empirically validate our method. First, we discuss the optimization of the exponent parameter a of PowerQuant using the reconstruction error, showing its interest as a proxy for the quantized model accuracy from an experimental standpoint. We show that the proposed approach preserves this reconstruction error significantly better, allowing a closer fit to the original weight distribution through non-uniform quantization. Second, we show through a variety of benchmarks that the proposed approach significantly outperforms state-of-the-art data-free methods, thanks to more efficient power function quantization with optimized exponent. Third, we show that the proposed approach comes at a negligible cost in term of inference speed.



Figure 1: Comparison of the proposed method to other data-free quantization schemes on DenseNet 121 pretrained on ImageNet. The proposed method (right bin in blue) drastically improves upon the existing data-free methods especially in the challenging W4/A4 quantization.

Figure2: Influence of the power parameter a on the quantized distribution for weights distributed following a Gaussian prior. In such a case, the reconstruction error is typically minimized for a < 1.

Figure 3: Accuracy/reconstruction error relationship for ResNet and DenseNet quantized in W4/A4.

Comparison between logarithmic, uniform and the proposed quantization scheme on ResNet 50 trained for ImageNet classification task. We report for different quantization configuration (weights noted W and activations noted A) both the top1 accuracy and the reconstruction error (equation 3).

Comparison between state-of-the-art post training quantization techniques on ResNet 50 on ImageNet.We distinguish methods relying on data (synthetic or real) or not. In addition to being fully data-free, our approach significantly outperforms existing methods.

Comparison of data-free quantization methods on ViT and DeiT trained on ImageNet.

Complementary Benchmarks on the GLUE task with the BERT transformer architecture quantized in W4/A8. We provide the original performance (from the article) as well as our reproduced results (baseline).

ACE cost of the overhead computations introduced by PowerQuant.

Minimization of the reconstruction error on a MobileNet V2 for W6/A6 quantization with different solvers.

Comparison between logarithmic, uniform and the proposed quantization scheme on ResNet 50 and DenseNet 121 trained for ImageNet classification task. We report for different quantization configuration (weights noted W and activations noted A) both the top1 accuracy and the reconstruction error (equation 3).

Overhead induced by asymmetric quantization

Comparison between the per-layer and global method of power parameter a fitting on a ResNet 5à trained for ImageNet classification task.

Performance Gap as compared to Data-driven techniques on ResNet 50 quantization in W4/A4. The relative gap improvement to the state-of-the-art SQuant [6], is measured asgs-gp

Performance gap between data-free PowerQuant and short-retraining OCTAVSakr et al. (2022).

Performance gap between PowerQuant and OCTAVSakr et al. (2022) (using an additional short retraining), both using dynamic range estimation.

Comparison between state-of-the-art post-training quantization techniques on DenseNet 121 on Im-ageNet. We distinguish methods relying on data (synthetic or real) or not. In addition to being fully data-free. our approach significantly outperforms existing methods.

Complementary Benchmarks on ImageNet as weights only for Bert in Table16. In table13, we report the performance of other data-free quantization processes on DenseNet 121. The OMSE method(Choukroun et al., 2019) is a post-training quantization method that leverages validation examples during quantization, thus cannot be labelled as data-free. Yet, we include this work in our comparison as they show strong performance in terms of accuracy at a very low usage of real data. As showcased in table 13, the proposed PowerQuant method almost preserves the floating point accuracy in W8/A8 quantization. Additionally, on the challenging W4/A4 setup, our approach improves the accuracy by a remarkable 12.30 points over OMSE and 17.54 points over SQuant. This is due to the overall better efficiency of non-uniform quantization, that allows a theoretically closer fit to the weight distributions of each DNN layer. The results on MobileNet and EfficientNet from Table 14 confirm our previous findings.

Inference time, in seconds, over ImageNet using batches of size 16 of several networks on a 2070 RTX GPU. We also report the accuracy for W6/A6 quantization setup.

Inference cost each component of a convolutional layer and percentage of total in terms of number of cycles on a wide range of simulated hardware using nntool from GreenWaves.

We report the processing time in seconds (on an Intel(R) Core(TM) i9-9900K CPU) required to quantize a trained neural network such as ResNet, MobileNet, DenseNet or EfficientNet.

ACKNOWLEDGMENTS

This work has been supported by the french National Association for Research and Technology (ANRT), the company Datakalab (CIFRE convention C20/1396) and by the French National Agency (ANR) (FacIL,. This work was granted access to the HPC resources of IDRIS under the allocation 2022-AD011013384 made by GENCI.

annex

Nevertheless, our results show that our approach leads to only increasing by 1% the whole inference time on most networks. More precisely, in the case of ResNet 50, the change in activation function induces a slowdown of 0.15%. The largest runtime increase is obtained on DenseNet with a 3.4% increase. Lastly, note that our approach is also particularly fast and efficient on EfficientNet B0, which uses SiLU activation, thanks to the bias correction technique introduced in Section 3.3. Overall, the proposed approach can be easily implemented and induces negligible overhead in inference on GPU.To furthermore justify the practicality of the proposed quantization process, we recall that the only practicality concern that may arise is on the activation function as the other operations are strictly identical to standard uniform quantization. According to Kim et al. (2021) efficient power functions can be implemented for generic hardware as long as they support standard integer arithmetic, i.e. as long as they support uniform quantization. When it comes to Field-Programmable Gate Array (FPGA), activation functions are implemented using look-up tables (LUT) as detailed in Hajduk (2017) . More precisely, they are pre-computed using Padé approximation which are quotients of polynomial functions. Consequently the proposed approach would simply change the polynomial values but not the inference time as it would still rely on the same number of LUTs.In general, activation functions that are non-linear can be very effectively implemented in quantization runtime Lam et al. (2022) . However these considerations are hardware agnostic. In order to circumvent this limitation and address any concerns to our best, we conducted a small study using

