BIBENCH: BENCHMARKING AND ANALYZING NETWORK BINARIZATION

Abstract

Neural network binarization is one of the most promising compression approaches with extraordinary computation and memory savings by minimizing the bit-width of weight and activation. However, despite being a general technique, recent works reveal that applying binarization in various practical scenarios, including multiple tasks, architectures, and hardware, is not trivial. Moreover, common challenges, such as severe degradation in accuracy and limited efficiency gains, suggest that specific attributes of binarization are not thoroughly studied and adequately understood. To comprehensively understand binarization methods, we present BiBench, a carefully engineered benchmark with in-depth analysis for network binarization. We first inspect the requirements of binarization in the actual production setting. Then for the sake of fairness and systematic, we define the evaluation tracks and metrics. We also perform a comprehensive evaluation with a rich collection of milestone binarization algorithms. Our benchmark results show that binarization still faces severe accuracy challenges, and newer state-ofthe-art binarization algorithms bring diminishing improvements, even at the expense of efficiency. Moreover, the actual deployment of certain binarization operations reveals a surprisingly large deviation from their theoretical consumption. Finally, based on our benchmark results and analysis, we suggest establishing a paradigm for accurate and efficient binarization among existing techniques. We hope BiBench paves the way toward more extensive adoption of network binarization and serves as a fundamental work for future research.

1. INTRODUCTION

Since the rising of modern deep learning, the contradiction between ever-increasing model size and limited deployment resources has persisted. For this reason, compression technologies are crucial for practical deep learning and have been widely studied, including model quantization (Gong et al., 2014; Wu et al., 2016; Vanhoucke et al., 2011; Gupta et al., 2015) , network pruning (Han et al., 2015; 2016; He et al., 2017) , knowledge distillation (Hinton et al., 2015; Xu et al., 2018; Chen et al., 2018; Yim et al., 2017; Zagoruyko & Komodakis, 2017) , lightweight architecture design (Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2018b; Ma et al., 2018) , and low-rank decomposition (Denton et al., 2014; Lebedev et al., 2015; Jaderberg et al., 2014; Lebedev & Lempitsky, 2016) . As a compression approach that extremely reduces the bit-width to 1-bit, network binarization is regarded as the most aggressive quantization technology (Rusci et al., 2020; Choukroun et al., 2019; Qin et al., 2022; Shang et al., 2022b; Zhang et al., 2022b; Bethge et al., 2020; 2019; Martinez et al., 2019; Helwegen et al., 2019) . The binarized models leverage the most compact 1-bit parameters, which take little storage and memory and accelerate the inference by efficient bitwise operations. Compared to other compression technologies like network pruning and architecture design, network binarization enjoys stronger topological generics since it only applies to parameters. Therefore, in academic research, network binarization is widely studied as an independent compression technique instead of the 1-bit specialization of quantization (Gong et al., 2019; Gholami et al., 2021) . It is impressive that State-of-The-Art (SoTA) binarization algorithms push binarized models to fullprecision performance on large-scale tasks (Deng et al., 2009; Liu et al., 2020) . However, existing network binarization is still far from practical. We point out that two worrisome trends are emerging from accuracy and efficiency perspectives in current binarization research: Figure 1 : Evaluation tracks of BiBench. Our benchmark evaluates binarization algorithms on the most comprehensive evaluation tracks, including "Learning Task", "Neural Architecture", "Corruption Robustness", "Training Consumption", "Theoretical Complexity", and "Hardware Inference". Trend-1. Accuracy comparison converging to limited scope. In recent binarization research, several image classification tasks, e.g., CIFAR-10 and ImageNet, are becoming standard options for comparing accuracy. The typical selection of evaluation tasks helps the clear and fair comparison of accuracy performance among different binarization algorithms. However, since most binarization algorithm studies are engineered for learning tasks with image modality inputs, the presented insights and conclusions are rarely verified in a broader range of other modalities and tasks. The monotonic tasks also hinder the comprehensive evaluation from an architectural perspective. Besides, data noise like corruption is a common problem on low-cost edge devices and is widely studied in compression (Lin et al., 2018; Rakin et al., 2021) , whereas few advanced binarization algorithms consider the robustness of binarized models. Trend-2. Efficiency analysis remaining at the theoretical level. Network binarization is widely recognized for its significant storage and computation savings. For example, theoretical savings are up to 32× and 64× for convolutions, respectively (Rastegari et al., 2016; Bai et al., 2021) . However, since lacking support from hardware libraries, the models compressed by binarization algorithms can hardly be evaluated on real-world edge hardware, leaving their efficiency claims lacking experimental evidence. In addition, the training efficiency of the binarization algorithm is usually neglected in current research, which causes several negative phenomena in training a binary network, such as the increasing demand for computation resources and time consumption, being sensitive to hyperparameters, and requiring detailed tuning in optimization, etc. In this paper, we present BiBench, a network Binarization Benchmark to evaluate binarization algorithms comprehensively from accuracy and efficiency perspectives (Table 1 ). Based on BiBench, we benchmark 8 representative binarization algorithms on 9 deep learning datasets, 13 different neural architectures, 2 deployment libraries, 14 hardware chips, and various hyperparameter settings. It costs us about 4 GPU years of computation time to build our BiBench, devoted to promoting comprehensive evaluation for network binarization from the perspectives of accuracy and efficiency. Furthermore, we analyze the benchmark results in depth and reveal insights along evaluation tracks, and give suggestions for designing practical binarization algorithms.

2. BACKGROUND

2.1 NETWORK BINARIZATION Binarization compresses weights w ∈ R cin×cout×k×k and activations a ∈ R cin×w×h to 1-bit in computationally dense convolution, where c in , k, c out , w, and h denote the input channel, kernel size, output channel, input width, and input height. The computation can be expressed as o = α popcount (xnor (sign(a), sign(w))) , (1) where o denotes the outputs and α ∈ R cout denotes the optional scaling factor calculated as α = ∥w∥ n (Courbariaux et al., 2016b; Rastegari et al., 2016) , xnor and popcount are bitwise instructions defined as (Arm, 2020; AMD, 2022) . Though enjoying extreme compression and acceleration, severely limited representation causes the degradation of binarized networks. Therefore, various algorithms emerge constantly to improve accuracy (Yuan & Agaian, 2021) . The vast majority of existing binarization algorithms focus on improving the binarized operators. As shown in Eq. ( 1), the fundamental difference between binarized and full-precision networks is (Liu et al., 2018b) × × √ 1 2 × × √ × XNOR++ (Bulat et al., 2019) √ × × 1 2 × × × × ReActNet (Liu et al., 2020 ) (Xu et al., 2021b) × √ √ 2 4 × × × × FDA (Xu et al., 2021a ) × √ × 1 2 × × √ × ReCU × × √ 1 6 × × × × Our Benchmark (BiBench) √ √ √ 9 13 √ √ √ √ 1 " √ " and "×" indicates the track is considered in the original paper of the binarization algorithm, while "*" indicates only being studied in other related studies. "s", "τ ", or "g" indicates "scaling factor", "parameter redistribution", or "gradient approximation" techniques proposed in this work, respectively. the former applies binarized operators, which also affect binarized models' optimization properties directly and determine their hardware efficiency (Alizadeh et al., 2019; Geiger & Team, 2020) . In addition, the improvements of binarized operators are widely flexible across neural architectures and learning tasks (Wang et al., 2020b; Qin et al., 2022; Zhao et al., 2022) , which fully exerts the generalizability of the bit-width compression. We thereby consider 8 generic binarization algorithms in our BiBench (Courbariaux et al., 2016b; Rastegari et al., 2016; Zhou et al., 2016b; Liu et al., 2018b; Bulat et al., 2019; Liu et al., 2020; Xu et al., 2021b; a) . And techniques of these binarization algorithms focus on binarized operator improvement and can be categorized into several types broadly, i.e., scaling factors, parameter redistribution, and gradient approximation. Note that the techniques requiring specified local structures or training pipelines are excluded for a fair comparison, such as the bi-real shortcut (Liu et al., 2018a) and duplicate activation (Liu et al., 2020) . For more details of binarization algorithms for our BiBench, please see Appendix. A.

2.2. CHALLENGES FOR BINARIZATION

Since about 2015, network binarization has attracted great interest in different research fields, including but not limited to vision, language understanding, etc. However, there are still various challenges in the production and deployment process of network binarization in actual practice. The goal in production is to train accurate binarized networks with controllable resources. Recent works have revealed the capability of binarization algorithms evaluated on image classification is not always appliable to new learning tasks and neural architectures (Qin et al., 2020a; Wang et al., 2020b; Qin et al., 2021; Liu et al., 2022) . And to achieve higher accuracy, some binarization algorithms require several times of training resources compared with the training of full-precision networks. Ideal binarized networks should be hardware-friendly and robust when deployed to edge devices. Unlike well-supported multi-bit (2-8 bit) quantization, most mainstream inference libraries have not yet supported the deployment of binarization on hardware (NVIDIA, 2022; HUAWEI, 2022; Qualcomm, 2022) , which also causes the theoretical efficiency of existing binarization algorithms cannot align their performance on actual hardware. Moreover, the data collected by low-cost devices in natural edge scenarios is not always clean and high-quality, which affects the robustness of binarized models severely (Lin et al., 2018; Ye et al., 2019; Cygert & Czyżewski, 2021) . However, recent binarization algorithms rarely consider corruption robustness when designed.

3. BIBENCH: TRACKS AND METRICS FOR BINARIZATION

In this section, we present BiBench towards accurate and efficient network binarization. As Figure 1 shows, we build 6 evaluation tracks and corresponding metrics upon the practical challenges in production and deployment of binarization. For all tracks, higher metrics mean better performance.

3.1. TOWARDS ACCURATE BINARIZATION

The evaluation tracks for accurate network binarization in our BiBench are "Learning Task", "Neural Architecture" (for production), and "Corruption Robustness" (for deployment). ① Learning Task. We selected 9 learning tasks for 4 different data modalities to comprehensively evaluate network binarization algorithms. For the widely-evaluated 2D visual modality, in addition to image classification tasks on CIFAR-10 ( Krizhevsky et al., 2014) and ImageNet (Krizhevsky et al., 2012) , we also include object detection tasks on PASCAL VOC (Hoiem et al., 2009) and COCO (Lin et al., 2014 ) across all algorithms. For 3D visual modality, we evaluate the algorithm on ModelNet40 classification (Wu et al., 2015) and ShapeNet segmentation (Chang et al., 2015) tasks of 3D point clouds. For textual modality, the natural language understanding tasks in GLUE benchmark (Wang et al., 2018) are applied for evaluation. For speech modality, we evaluate algorithms on Speech Commands KWS task (Warden, 2018) . The details of tasks and datasets are in Appendix. B. Then we build the evaluation metric for this track. For a particular binarization algorithm, we take the accuracy of full-precision models as baselines and calculate the mean relative accuracy for all architectures on each task. Then we calculate the Overall Metric (OM) of the task track as the quadratic mean of all tasks (Curtis & Marshall, 2000) . The equation of evaluation metric is OM task = 1 N N i=1 E 2 A bi taski A taski , where A bi taski and A taski denote the accuracy results of the binarized and full-precision models on i-th task, N is the number of tasks, and E(•) denote taking the mean value. The quadratic mean form is uniformly applied in BiBench to unify all metrics, which prevents metrics from being unduly influenced by certain bad items and thus can measure the overall performance on each track. ② Neural Architecture. We choose diverse neural architectures with mainstream CNN-based, transformer-based, and MLP-based architectures to evaluate the generalizability of binarization algorithms from the neural architecture perspective. We adopt standard ResNet-18/20/34 (He et al., 2016) and VGG (Simonyan & Zisserman, 2015) to evaluate CNN architectures, and the Faster-RCNN (Ren et al., 2015) and SSD300 (Liu et al., 2016) frameworks are applied as detectors. We binarize BERT-Tiny4/Tiny6/Base (Kenton & Toutanova, 2019) with the bi-attention mechanism for convergence (Qin et al., 2021) to evaluate transformer-based architectures. And we evaluate PointNet vanilla and PointNet (Qi et al., 2017) with EMA aggregator (Qin et al., 2020a) , FSMN (Zhang et al., 2015) , and Deep-FSMN (Zhang et al., 2018a) as typical MLP-based architectures for their linear unit composition. The details of these architectures are presented in Appendix. C. Similar to the overall metric for learning task track, we build the metric for neural architecture: OM arch = 1 3 E 2 A bi CNN A CNN + E 2 A bi Transformer A Transformer + E 2 A bi MLP A MLP . ③ Corruption Robustness. The corruption robustness of binarization on deployment is critical to deal with bad cases like perceptual device damage, a common problem with low-cost equipment in real-world implementations. We consider the robustness of binarized models to corruption of 2D visual data and evaluate algorithms on the CIFAR10-C (Hendrycks & Dietterich, 2018) benchmark. Therefore, we evaluate binarization algorithms' performance on the corrupted data compared with the normal data using the corruption generalization gap (Zhang et al., 2022a) : G taski = A norm taski -A corr taski , where A corr taski and A norm taski denote the accuracy results under all architectures on i-th corruption task and corresponding normal task, respectively. And the overall metric on this track is calculated by OM robust = 1 C C i=1 E 2 G taski G bi taski . (5)

3.2. TOWARDS EFFICIENT BINARIZATION

As for the efficiency of network binarization, we evaluate "Training Consumption" for production, "Theoretical Complexity" and "Hardware Inference" for deployment. ④ Training Consumption. We consider the occupied training resource and the hyperparameter sensitivity of binarization algorithms, which affect the consumption of one training and the whole tuning process, respectively. For each algorithm, we train its binarized networks with various hyperparameter settings, including different learning rates, learning rate schedulers, optimizers, and even random seeds, to evaluate whether the binarization algorithm is easy to tune to an optimal state. We align the epochs for binarized and full-precision networks and compare their consumption and time. The evaluation metric for the training consumption track is related to the training time and hyperparameter sensitivity. For a specific binarization algorithm, we have OM train = 1 2 E 2 T train T bi train + E 2 std(A hyper ) std(A bi hyper ) , where T train denotes the set of time in once training, A hyper is the set of results with different hyperparameter settings, and std(•) is taking standard deviation values. ⑤ Theoretical Complexity. When evaluating theoretical complexity, we calculate the compression and speedup ratio before and after binarization on architectures such as ResNet18. The evaluation metric relates to model size saving (MB) and computational floating-point operations (FLOPs) at inference. For binarized parameters, the storage occupation of binarized parameters is 1/32 as their 32-bit floating-point counterparts (Rastegari et al., 2016) . For binarized operations, the multiplication between a 1-bit number (weight) and a 1-bit number (activation) approximately takes 1*1/64 FLOPs for a CPU with the instruction size of 64 bits (Zhou et al., 2016b; Liu et al., 2018b; Li et al., 2019) . The compression ratio r c and speedup ratio r s are r c = |M | ℓ0 1 32 |M | ℓ0 -| M | ℓ0 + | M | ℓ0 , r s = FLOPs M 1 64 FLOPs M -FLOPs M + FLOPs M , where M and M are the amount of full-precision parameters in the original and binarized models, respectively, and FLOPs M and FLOPs M denote the computation related to these parameters, respectively. And the overall metric for theoretical complexity is OM comp = 1 2 (E 2 (r c ) + E 2 (r s )). ⑥ Hardware Inference. Since the limited support of binarization in hardware deployment, just two inference libraries, Larq's Compute Engine (Geiger & Team, 2020) and JD's daBNN (Zhang et al., 2019) can deploy and evaluate the binarized models on ARM hardware in practice. Regarding target hardware devices, we mainly focus on ARM CPU inference as this is the mainstream hardware type for edge scenarios, including HUAWEI Kirin, Qualcomm Snapdragon, Apple M1, MediaTek Dimensity, and Raspberry Pi. We put the hardware details in Appendix. D. And for a certain binarization algorithm, we take the saving of storage and inference time under different inference libraries and hardware as evaluation metrics: OM infer = 1 2 E 2 T infer T bi infer + E 2 S infer S bi infer , where T infer and S infer denote the inference time and storage on different devices, respectively.

4. BIBENCH IMPLEMENTATION

This section shows the implementation details and training and inference pipelines of our BiBench. Implementation details. We implement BiBench with PyTorch (Paszke et al., 2019) packages. The definitions of the binarized operators are entirely independent of corresponding single files. And the corresponding operator in the original model can be flexibly replaced by the binarized one while evaluating different tasks and architectures. When deployed, we export well-trained binarized models of a binarization algorithm to the Open Neural Network Exchange (ONNX) format (developers, 2021) and then feed them to the inference libraries in BiBench (if it applies to this algorithm). Training and inference pipelines. Hype-parameters: We train the binarized networks with the same number of epochs as their full-precision counterparts. Inspired by results in Section 5.2.1, we use the Adam optimizer for all binarized models for better convergence, the initial learning rates are 1e -3 as default (or 0.1× of the default learning rate), and the learning rate scheduler is CosineAnnealingLR (Loshchilov & Hutter, 2017) . Architecture: In BiBench, we thoroughly follow the original architectures of full-precision models and binarize their convolution, linear, and multiplication units with the binarization algorithms. Hardtanh is uniformly used as the activation function to avoid the all-one feature. Pretrains: We adopt finetuning for all binarization algorithms, and for each of them, we initialize all binarized models by the same pre-trained model for specific neural architectures and learning tasks to eliminate the inconsistency at initialization.

5. BIBENCH EVALUATION

This section shows our evaluation results and analysis in BiBench. The main accuracy results are in Table 2 and the efficiency results are in Table 3 . More detailed results are in Appendix. E.

5.1. ACCURACY ANALYSIS FOR BINARIZATION

For the accuracy results of network binarization, we present the evaluation results in Table 2 for each accuracy-related track using the metrics defined in Section 3.1.

5.1.1. LEARNING TASK TRACK

We present the evaluation results of binarization on various tasks. Besides the overall metrics OM task , we also present the relative accuracy of binarized networks compared to full-precision ones. Accuracy retention is still the most rigorous challenge for network binarization. With fully unified training pipelines and neural architectures, there is a large gap between the performance of binarized and full-precision networks on most learning tasks. For example, the results on large-scale ImageNet and COCO tasks are usually less than 80% of their full-precision counterparts. Moreover, the marginal effect of advanced binarization algorithms is starting to appear, e.g., on the ImageNet, the SoTA algorithm ReCU is less 3% higher than vanilla BNN. The binarization algorithms' performances under different data modes are significantly different. When comparing various tasks, an interesting phenomenon is that the binarized networks suffer a huge accuracy drop in language understanding GLUE benchmark, but it can almost approach full-precision performance on the ModelNet40 point cloud classification task. Similar phenomena suggest that the direct transfer of binarization studies' insights across different tasks is non-trivial. As for the overall performance, both ReCU and ReActeNet show high accuracy across different learning tasks. Surprisingly, although ReCU wins the championship on most four individual tasks, ReActNet stands out in the overall metric comparison finally. They both apply reparameterization in the forward propagation and gradient approximation in the backward propagation. 

5.1.2. NEURAL ARCHITECTURE TRACK

Binarization exhibits a clear advantage on CNN-based and MLP-based architectures over transformer-based ones. Since being widely studied, the advanced binarization algorithms can achieve about 78%-86% of the full-precision accuracy in CNNs, and the binarized networks with MLP architectures even approach the full-precision performance (e.g., Bi-Real Net 87.83%). In contrast, the transformer-based one suffers from extremely significant performance degradation when binarized, and none of the algorithms achieves an overall accuracy metric higher than 70%. Compared to CNNs and MLPs, the results show that transformer-based architectures constructed by unique attention mechanisms require specific binarization designs instead of direct binarizing. The overall winner on the architecture track is the FDA algorithm. In this track, FDA has achieved the best results in both CNN and Transformer. The evaluation of these two tracks proves that these binarization algorithms, which apply statistical channel-wise scaling factor and custom gradient approximation like FDA and ReActNet, have the advantage of stability to a certain degree.

5.1.3. CORRUPTION ROBUSTNESS TRACK

The binarized network can approach full-precision level robustness for corruption. Surprisingly, binarized networks show robustness close to full-precision counterparts in corruption evaluation. Evaluation results on the CIFAR10-C dataset show that the binarized network performs close to the full-precision network in the typical 2D image corruption. ReCU and XNOR-Net even outperform their full-precision counterparts. If corruption robustness requirements are the same, the binarized version network requires little additional designs or supervision for robustness. Thus, binarized networks usually show comparable robustness for corruption against full-precision counterparts, which can be seen as a general property of binarized networks rather than specific algorithms.

5.2. EFFICIENCY ANALYSIS FOR BINARIZATION

As for efficiency, we discuss and analysis the metrics of training consumption, theoretical complexity, and hardware inference tracks below.

5.2.1. TRAINING CONSUMPTION TRACK

We comprehensively investigate the training cost of binarization algorithms on ResNet18 of CI-FAR10 and present the sensitivity and training time results for different binarization algorithms in Table 3 and Figure 3 , respectively. "Binarization̸ =Sensitivity": existing techniques can stabilize binarization-aware training. An existing common intuition is that the training of binarized networks is usually more sensitive to the training settings than full-precision networks, caused by the representation limits and gradient approximation errors brought by the high degree of discretization. However, we find that the hyperparameter sensitivities of existing binarization algorithms are polarized. Some of them are even more hyperparameter-stable than the training of full-precision networks, while others fluctuate hugely. The reason for this problem is the difference in the techniques applied by the binarized operators of these algorithms. The training-stable binarization algorithms often have the following commonalities: (1) Channel-wise floating-point scaling factors based on learning or statistics; (2) Soft gradient approximation to reduce gradient error. These hyperparameter-stable binarization algorithms may not certainly outperform other algorithms but can simplify the tuning process in production and obtain reliable accuracy in one training. The preference for hyperparameter settings is evident for the training of binarization. From the statistical results in Figure 2 , training with Adam optimizer, 1× (identical to full-precision network) learning rate and CosineAnnealingLR scheduler is more stable than that with other settings. Inspired by this, we adopt this setting in evaluating binarization as part of the standard training pipelines. The soft gradient approximation in binarization brings a significant training time increase. Comparing the time consumed by each binarization algorithm, we found that the training time of algorithms using the custom gradient approximation techniques such as Bi-Real and ReActNet increased significantly, and the metric of FDA is even as worse as 20.62%, which means that the training time he spent is close to 5× the full-precision network training.

5.2.2. THEORETICAL COMPLEXITY TRACK

There is no obvious difference in the theoretical complexity among binarization algorithms. The leading cause of the difference in compression rate is the definition of the static scaling factor of each model, e.g., BNN does not apply any factors and enjoys the most compression. For theoretical acceleration, the main difference comes from two aspects. First, the static scaling factor reduction also improves the theoretical speedup. Second, real-time re-scaling and mean-shifting for activation bring additional computation, such as ReActNet, which harms 0.11× speedup. In general, the theoretical complexity of each method is similar, and the overall metrics are in the range of [12.71, 12.94 ]. These results show binarization algorithms should have similar inference efficiency. Compared to other tracks, the hardware inference track brings us some insights for binarization in a real-world deployment.

5.2.3. HARDWARE INFERENCE TRACK

Limited inference libraries lead to an almost fixed paradigm of binarization deployment. After investigating existing open-source inference libraries, we find that few inference libraries support the deployment of binarization algorithms on hardware. And there are only Larq (Geiger & Team, 2020) and daBNN (Zhang et al., 2019) with complete deployment pipelines and mainly support deployment on ARM devices. We first evaluate the deployment capability of the two inference libraries in Table 4 . Both inference libraries support a channel-wise scaling factor in floating-point form and force it to fuse into the BN layer (BN must follow every convolution of the binarized model). Neither supports dynamic activation of activations statistics nor re-scaling in inference. The only difference is that Larq further supports mean-shifting activation with a fixed bias. Constrained by the inference libraries, the practical deployment of binarization algorithms is limited. The scale factor shape of XNOR++ led to its failed deployment, and XNOR also failed because of the activation re-scaling technique. The vast majority of binarization methods have almost identical inference performance, and the mean-shifting operation of ReAct-  √ N/A N/A N/A × × XNOR × Channel-wise FP32 √ √ × DoReFa √ Channel-wise FP32 √ × × Bi-Real √ Channel-wise FP32 √ × × XNOR++ × Spatial-wise FP32 × × × ReActNet √ Channel-wise FP32 √ × √ ReCU √ Channel-wise FP32 √ × × FDA √ Channel-wise FP32 √ × × Net on activation slightly affects the efficiency, i.e., binarized models must satisfy fixed deployment paradigms and have almost identical efficiency performance. Born for the edge: the lower the chip's computing power, the higher the binarization speedup. After deploying and evaluating binarized models across dozen chips, we compare the average speedup of the binarization algorithm on each chip. A counterintuitive finding is that the higher the chip capability, the worse the speedup of binarization on the chip (Figure 3 ). Further observation showed that the contradiction is mainly because higher-performance chips have more acceleration brought by multi-threading when running floating-point models. Thus the speedup of binarized models is relatively slow in these chips. The scenarios where network binarization technology comes into play better are edge chips with low performance and cost, and the extreme compression and acceleration of binarization are making deployment of advanced neural networks on edge possible.

5.3. SUGGESTIONS FOR BINARIZATION ALGORITHM

Based on the above evaluation and analysis, we attempt to summarize a paradigm for accurate and efficient network binarization among existing techniques: (1) Soft quantization approximation is an undisputed must-have technique. This binarization technique does not affect hardware inference efficiency and is adopted by all winning binarization algorithms on accuracy tracks, including Bi-Real, ReActNet, and ReCU. (2) Channel-wise scaling factors are the only option available for practical binarization. The results of the learning task and neural architecture tracks demonstrate the advantage of floating-point scaling factors, and analysis of efficiency tracks further limits it to the channel-wise form. (3) Mean-shifting the input with a fixed bias is an optional helpful operation. Our results show that this technique in ReActNet effectively improves accuracy and consumes almost no extra computation, but not all inference libraries support it. We have to stress that although the benchmarking on evaluation tracks leads us to several ground rules towards accurate and efficient binarization, none of the binarization techniques or algorithms work well across all scenarios so far. In the future, binarization research should focus on breaking the mutual restrictions between production and deployment, and the binarization algorithms should consider deployability and efficiency. The inference libraries are also expected to support more advanced binarized operators.

6. DISCUSSION

In this paper, we present BiBench, a versatile and comprehensive benchmark that delves into the most fundamental questions of model binarization. BiBench covers 8 network binarization algorithms, 9 deep learning datasets (including one corruption benchmark), 13 different neural architectures, 2 deployment libraries, 14 real-world hardware, and various hyperparameter settings. Moreover, BiBench proposes evaluation tracks specifically designed to measure critical aspects such as accuracy under multiple conditions and efficiency when deployed on actual hardware. More importantly, by collating experiment results and analysis, BiBench hopes to establish an empirically optimized paradigm with several critical considerations for designing accurate and efficient binarization methods. We hope BiBench can facilitate a fair comparison of algorithms through a systematic investigation with metrics that reflect the fundamental requirements and serve as a foundation for applying model binarization in broader and more practical scenarios.

APPENDIX FOR BIBENCH

A DETAILS OF BINARIZATION ALGORITHM General Binarization: In previous studies, quantization schemes with lower bit-widths were regarded as more aggressive schemes (Rusci et al., 2020; Choukroun et al., 2019; Qin et al., 2022) , because lower bit-widths usually lead to higher compression and speed-up but result in more Severe accuracy loss. With the lowest bit-width among all quantization approaches, 1-bit quantization (binarization) is regarded as the most aggressive quantization technique (Qin et al., 2022) , facing severe challenges in terms of accuracy but enjoying the highest compression and speedup ratios. Training. During the training of a general binarized model, the sign function is usually applied in forward propagation, and STE or other gradient approximations is applied in backward propagation to make the binarized model trainable. Since the parameters are quantized to binary, network binarization approaches usually use a simple sign function as the quantizer instead of directly sharing the quantizer with multi-bit (2-8 bit) quantization (Gong et al., 2019; Gholami et al., 2021) . Specifically, as Gong et al. (2019) describes, for multi-bit uniform quantization, given the bit width b and the floating-point activation/weight x following in the range (l, u), the complete quantizationdequantization process of uniform quantization can be defined as Q U (x) = round x ∆ ∆, where the original range (l, u) is divided into 2 b -1 intervals Pi, i ∈ (0, 1, • • • , 2 b -1), and ∆ = u-l 2 b -1 is the interval length. When b = 1, the Q U (x) equals the sign function, and the binary function is expressed as Q B (x) = sign(x). Therefore, binarization can be regarded as the 1-bit specialization of quantization. Deployment. For real-world hardware deployment, every 32 binarized parameters are packed together using 32-bit instructions and computed simultaneously, which is the main principle for acceleration. For compression of binary algorithms. Instructions including XNOR (or combine EOR and NOT) and popcount enable binarized networks to deploy on real-world hardware. XNOR (exclusive-XOR) gate, a combination of an XOR gate followed by an inverter. XOR (also known as EOR) is a pervasive instruction that has long existed in assembly instructions for all target platforms. The popcount instruction means Population Count per byte. This instruction counts the number of bits with one value in each vector element in the source register, places the result into a vector, and writes the vector to the destination register (Arm, 2020) . This instruction is applied to accelerate the inference of binarized networks (Hubara et al., 2016; Rastegari et al., 2016) and is widely supported by various hardware, e.g., the definitions of popcount in ARM and x86 are in (Arm, 2020) and (AMD, 2022), respectively. Comparison with other compression techniques. Most existing network compression technologies aim to reduce the size and computation of full-precision models. Specifically, knowledge distillation supervises compact small (student) models by intermediate features and/or soft outputs of the large (teacher) model (Hinton et al., 2015; Xu et al., 2018; Chen et al., 2018; Yim et al., 2017; Zagoruyko & Komodakis, 2017) ; model pruning (Han et al., 2015; 2016; He et al., 2017) and lowrank decomposition (Denton et al., 2014; Lebedev et al., 2015; Jaderberg et al., 2014; Lebedev & Lempitsky, 2016) reduce network parameters and computation by pruning and low-rank approximation; compact model design directly designs a compact model (Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2018b; Ma et al., 2018) . Although these compression technologies can effectively reduce the number of parameters, the compressed model still uses 32-bit floating-point numbers, which leaves room for further compression using model quantization/binarization technologies. Compared with multi-bit (2-8 bit) model quantization compressing parameters to integers (Gong et al., 2014; Wu et al., 2016; Vanhoucke et al., 2011; Gupta et al., 2015) , binarization usually directly applies the sign function to compress the model to a more compact 1-bit (Rusci et al., 2020; Choukroun et al., 2019; Qin et al., 2022; Shang et al., 2022b; Qin et al., 2020b) . Moreover, due to the application of binary parameters, bitwise operations (XNOR and popcount) can be 

Selection Rules:

First of all, we state that we obey some general rules for selecting binarization algorithms, i.e., the selected binarization algorithms should improve the binarized operator since it is the fundamental difference between binarized and full-precision models (as discussed in Section 2.1). And we thus also exclude the algorithms and techniques requiring specified local structures or training pipelines for a fair comparison. Then, we explain in detail the choice of binarization algorithms and why they are representative. When we built the BiBench, we considered various binarization algorithms with improved operator techniques in binarization research, and now we list them in detail in Table 5 . We consider from the following perspectives, the purposes are making the selected binarization algorithms representative and should complete all evaluations in BiBench fairly: Operator Techniques (Yes/No), Year, Conference, Citation (to 2022/11/08), Open source (Yes/No), and Specified Structure / Training-pipeline (Yes/No/Optional). (1) We analyze the techniques proposed in these works. Following the general rules we mentioned, all considered binarization algorithms should have significant contributions to the improvement of the binarization operator (Operator Techniques: Yes) and should not include techniques that are bound to specific architectures and training pipelines to complete well all the evaluations of the learning task, neural architecture, and training consumption tracks in BiBench (Specified Structure / Training-pipeline: No/Optional, Optional means the techniques are included but can be decoupled with binarized operator totally). (2) We also consider the impact and reproducibility of these works. We prioritized the selection of works with more than 100 citations, which means they are more discussed and compared in binarization research and thus have higher impacts. Works in 2021 and later are regarded as the SoTA binarization algorithms and also prioritized. Furthermore, we hope the selected works have official open-source implementations for reproducibility. Based on the above selections, eight binarization algorithms, i.e., BNN, XNOR-Net, DoReFa-Net, Bi-Real Net, XNOR-Net++, ReActNet, FDA, and ReCU, stand out and are fully evaluated by our BiBench. BNN (Courbariaux et al., 2016b) : During the training process, BNN uses the straight-through estimator (STE) to calculate gradient g x which takes into account the saturation effect: sign(x) = +1, if x ≥ 0 -1, otherwise g x = g b , if x ∈ (-1, 1) 0, otherwise. And during inference, the computation process is expressed as o = sign(a) ⊛ sign(w), where ⊛ indicates a convolutional operation using XNOR and bitcount operations. XNOR-Net (Rastegari et al., 2016) : XNOR-Net obtains the channel-wise scaling factors α = ∥w∥ |w| for the weight and K contains scaling factors β for all sub-tensors in activation a. We can approximate the convolution between activation a and weight w mainly using binary operations: o = (sign(a) ⊛ sign(w)) ⊙ Kα, where w ∈ R c×w×h and a ∈ R c×win ×hin denote the weight and input tensor, respectively. And the STE is also applied in the backward propagation of the training process. (15) And the STE is also applied in the backward propagation with the full-precision gradient. Bi-Real Net (Liu et al., 2018b) : Bi-Real Net proposes a piece-wise polynomial function as the gradient approximation function: bireal (a) =      -1 if a < -1 2a + a 2 if -1 ⩽ a < 0 2a -a 2 if 0 ⩽ a < 1 1 otherwise , ∂ bireal (a) ∂a =    2 + 2a if -1 ⩽ a < 0 2 -2a if 0 ⩽ a < 1 0 otherwise . ( ) And the forward propagation of Bi-Real Net is the same as Eq. ( 15). (Bulat et al., 2019) : XNOR-Net++ proposes to re-formulate Eq. ( 14) as:

XNOR-Net++

o = (sign(a) ⊛ sign(w)) ⊙ Γ, and we adopt the Γ as the following form in experiments (achieve the best performance in the original paper): Γ = α ⊗ β ⊗ γ, α ∈ R o , β ∈ R hout , γ ∈ R wout , where α, β, and γ are learnable during training. ReActNet (Liu et al., 2020) : ReActNet defines an RSign as a binarization function with channelwise learnable thresholds: x = rsign (x) = +1, if x > α -1, if x ≤ α . ( ) where α is a learnable coefficient controlling the threshold. And the forward propagation is o = (rsign(a) ⊛ sign(w)) ⊙ α. ( ) ReCU (Xu et al., 2021b) : As described in their paper, ReCU is formulated as recu(w) = max min w, Q (τ ) , Q (1-τ ) , where Q (τ ) and Q (1-τ ) denote the τ quantile and 1 -τ quantile of w, respectively. And other implementations also strictly follow the original paper and official code. FDA (Xu et al., 2021a) : FDA computes the gradients of o in the backward propagation as: ∂ℓ ∂t = ∂ℓ ∂o w ⊤ 2 ⊙ ((tw 1 ) ≥ 0) w ⊤ 1 + ∂ℓ ∂o η ′ (t) + ∂ℓ ∂o ⊙ 4ω π n i=0 cos((2i + 1)ωt), where ∂ℓ ∂o is the gradient from the upper layers, ⊙ represents element-wise multiplication, and ∂ℓ ∂t is the partial gradient on t that backward propagates to the former layer. And w 1 and w 2 are weights in the original models and the noise adaptation modules respectively. FDA updates them as ∂ℓ ∂w 1 = t ⊤ ∂ℓ ∂o w ⊤ 2 ⊙ ((tw 1 ) ≥ 0) , ∂ℓ ∂w 2 = σ (tw 1 ) ⊤ ∂ℓ ∂o .

B DETAILS OF LEARNING TASKS

Selection Rules: To comprehensively evaluate the performance of the binarization algorithm in various learning tasks, we should select various representative tasks. First, representative perception modalities are selected in our deep learning, including (2D/3D) vision, text, and speech. Research on these modalities has progressed rapidly and has a broad impact, so we choose specific tasks and datasets in these modalities. Specifically, (1) on the 2D vision modality, we choose the basic image classification task and object detection task (one of the most popular downstream tasks), the former including CIFAR10 and ImageNet datasets, the latter including Pascal VOC and COCO datasets. These datasets Ima-geNet and COCO are both more challenging large datasets, while CIFAR10 and Pascal VOC are more basic. For other modalities, binarization is still challenging even with the underlying tasks and datasets in the field, since there are few related binarization studies: (2) In the 3D vision modality, the basic point cloud classification ModelNet40 dataset is selected to evaluate the binarization performance, which is regarded as one of the most fundamental tasks in 3D point cloud research and is widely studied. (3) In the text modality, the General Language Understanding Evaluation (GLUE) benchmark is usually recognized as the most popular dataset, including nine sentence-or sentence-pair language understanding tasks. (4) In the speech modality, the keyword spotting task was chosen as the base task, specifically the Google Speech Commands classification dataset. Based on the above reasons and rules, we have selected a series of challenging and representative tasks for BiBench to evaluate binarization comprehensively and have obtained a series of reliable and informative conclusions and experiences. CIFAR10 (Krizhevsky et al., 2014) : The CIFAR-10 dataset (Canadian Institute For Advanced Research) is a collection of images commonly used to train machine learning and computer vision algorithms. This dataset is widely used for image classification tasks. There are 60,000 color images, each of which measures 32x32 pixels. All images are categorized into 10 different classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. Each class has 6000 images, where 5000 are for training and 1000 are for testing. The evaluation metric of the CIFAR-10 dataset is accuracy, defined as: Accuracy = T P + T N T P + T N + F P + F N , ( ) where TP (True Positive) means cases correctly identified as positive, TN (True Negative) means cases correctly identified as negative, FP (False Positive) means cases incorrectly identified as positive and FN (False Negative) means cases incorrectly identified as negative. To estimate the accuracy, we should calculate the proportion of TP and TN in all evaluated cases. ImageNet (Krizhevsky et al., 2012) : ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images are collected from the web and labeled by human labelers using a crowd-sourced image labeling service called Amazon Mechanical Turk. As part of the Pascal Visual Object Challenge, ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) was established in 2010. There are approximately 1.2 million training images, 50,000 validation images, and 150,000 testing images in total in ILSVRC. ILSVRC uses a subset of Ima-geNet, with about 1000 images in each of the 1000 categories. ImageNet also uses accuracy to evaluate the predicted results, which is defined above. Pascal VOC07 (Hoiem et al., 2009)  mAP = 1 n k=n k=1 AP k ( ) where AP k denotes the average precision of the k-th category, which calculates the area under the precision-recall curve: AP k = 1 0 p k (r)dr. ( ) Especially for VOC07, we apply 11-point interpolated AP , which divides the recall value to {0.0, 0.1, . . . , 1.0} and then computes the average of maximum precision value for these 11 recall values as: AP = 1 11 r∈{0.0,...,1.0} AP r (27) = 1 11 r∈{0.0,...,1.0} p interp r. ( ) The maximum precision value equals to the right of its recall level: p interp r = max r≥r p(r). ( ) COCO17 (Lin et al., 2014) : The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images. According to community feedback, in the 2017 release, the training/validation split was changed from 83K/41K to 118K/5K. And the images and annotations are the same. The 2017 test set is a subset of 41K images from the 2015 test set. Additionally, 123K images are included in the unannotated dataset. The COCO17 dataset also uses mean average precision (mAP ) as defined above PASCAL VOC07 uses, which is defined as above. ModelNet40 (Wu et al., 2015) : The ModelNet40 dataset contains point clouds of synthetic objects. As the most widely used benchmark for point cloud analysis, ModelNet40 is popular due to the diversity of categories, clean shapes, and well-constructed dataset. In the original ModelNet40, 12,311 CAD-generated meshes are divided into 40 categories, where 9,843 are for training, and 2,468 are for testing. The point cloud data points are sampled by a uniform sampling method from mesh surfaces and then scaled into a unit sphere by moving to the origin. The ModelNet40 dataset also uses accuracy as the metric, which has been defined above in CIFAR10. ShapeNet (Chang et al., 2015) : ShapeNet is a large-scale repository for 3D CAD models developed by researchers from Stanford University, Princeton University, and the Toyota Technological Institute in Chicago, USA. Using WordNet hypernym-hyponym relationships, the repository contains over 300M models, with 220,000 classified into 3,135 classes. There are 31,693 meshes in the ShapeNet Parts subset, divided into 16 categories of objects (i.e., tables, chairs, planes, etc.). Each shape contains 2-5 parts (with 50 part classes in total). GLUE (Wang et al., 2018) : General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding tasks, including single-sentence tasks CoLA and SST-2, similarity and paraphrasing tasks MRPC, STS-B and QQP, and natural language inference tasks MNLI, QNLI, RTE, and WNLI. Among them, SST-2, MRPC, QQP, MNLI, QNLI, RTE, and WNLI use accuracy as the metric, which is defined in CIFAR10. CoLA is measured by Matthews Correlation Coefficient (MCC), which is better in binary classification since the number of positive and negative samples are extremely unbalanced: M CC = T P × T N -F P × F N (T P + F P )(T P + F N )(T N + F P )(T N + F N ) . And STS-B is measured by Pearson/Spearman Correlation Coefficient: r Pearson = 1 n -1 n i=1 X i - X s X Y i - Ȳ s Y , r Spearman = 1 - 6 d 2 i n(n 2 -1) , ( ) where n is the number of observations, s X and s Y indicate the sum of squares of X and Y respectively, and d i is the difference between the ranks of corresponding variables. SpeechCom. (Warden, 2018) : As part of its training and evaluation process, SpeechCom provides a collection of audio recordings containing spoken words. Its primary goal is to provide a way to build and test small models that detect a single word that belongs to a set of ten target words. Models should detect as few false positives as possible from background noise or unrelated speech while providing as few false positives as possible. The accuracy metric for SpeechCom is also the same as CIFAR10.  CE Network Corruption = 5 s=1 E Network s,Corruption / 5 s=1 E AlexNet s,Corruption . To make Corruption Errors comparable across corruption types, the difficulty is usually adjusted by dividing by AlexNet's errors.

C DETAILS OF NEURAL ARCHITECTURES

ResNet (He et al., 2016) : Residual Networks, or ResNets, learn residual functions concerning the layer inputs instead of learning unreferenced functions. Instead of making stacked layers directly fit a desired underlying mapping, residual nets let these layers fit a residual mapping. There is empirical evidence that these networks are easier to optimize and can achieve higher accuracy with considerably increased depth. VGG (Simonyan & Zisserman, 2015) : VGG is a classical convolutional neural network architecture. It is proposed by an analysis of how to increase the depth of such networks. It is characterized by its simplicity: the network utilizes small 3×3 filters, and the only other components are pooling layers and a fully connected layer. MobileNetV2 (Sandler et al., 2018) : MobileNetV2 is a convolutional neural network architecture that performs well on mobile devices. This model has an inverted residual structure with residual connections between the bottleneck layers. The intermediate expansion layer employs lightweight depthwise convolutions to filter features as a source of nonlinearity. In MobileNetV2, the architecture begins with an initial layer of 32 convolution filters, followed by 19 residual bottleneck layers. Faster- RCNN (Ren et al., 2015) : Faster R-CNN is an object detection model that improves Fast R-CNN by utilizing a region proposal network (RPN) with the CNN model. The RPN shares fullimage convolutional features with the detection network, enabling nearly cost-free region proposals. A fully convolutional network is used to predict the bounds and objectness scores of objects at each position simultaneously. RPNs use end-to-end training to produce region proposals of high quality and instruct the unified network where to search. Sharing their convolutional features allows RPN and Fast R-CNN to be combined into a single network. Faster R-CNN consists of two modules. The first module is a deep, fully convolutional network that proposes regions, and the second is the detector that uses the proposals for giving the final prediction boxes. SSD (Liu et al., 2016) : SSD is a single-stage object detection method that discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. During prediction, each default box is adjusted to match better the shape of the object based on its scores for each object category. In addition, the network automatically handles objects of different sizes by combining predictions from multiple feature maps with different resolutions. BERT (Kenton & Toutanova, 2019): BERT, or Bidirectional Encoder Representations from Transformers, improves upon standard Transformers by removing the unidirectionality constraint using a masked language model (MLM) pre-training objective. By masking some tokens from the input, the masked language model attempts to estimate the original vocabulary id of the masked word based solely on its context. An MLM objective differs from a left-to-right language model in that it enables the representation to integrate the left and right contexts, which facilitates pre-training a deep bidirectional Transformer. Additionally, BERT uses a next-sentence prediction task that pre-trains text-pair representations along with the masked language model. Note that we replace the direct binarized attention with a bi-attention mechanism to prevent the model from completely crashing (Qin et al., 2021) . PointNet (Qi et al., 2017) : PointNet is a unified architecture for applications ranging from object classification and part segmentation to scene semantic parsing. The architecture directly receives point clouds as input and outputs either class labels for the entire input or point segment/part labels. PointNet-Vanilla is a variant of PointNet, which drops off the T-Net module. And for all PointNet models, we apply the EMA-Max (Qin et al., 2020a) as the aggregator, because directly following the max pooling aggregator will cause the binarized PointNets to fail to converge. FSMN (Zhang et al., 2015) : Feedforward sequential memory networks or FSMN is a novel neural network structure to model long-term dependency in time series without using recurrent feedback. It is a standard fully connected feedforward neural network containing some learnable memory blocks. As a short-term memory mechanism, the memory blocks encode long context information using a tapped-delay line structure. Deep-FSMN (Zhang et al., 2018a) : The Deep-FSMN architecture is an improved feedforward sequential memory network (FSMN) with skip connections between memory blocks in adjacent layers. By utilizing skip connections, information can be transferred across layers, and thus the gradient vanishing problem can be avoided when building very deep structures. We also evaluate binarization algorithms on language and speech tasks, for which we test TinyBERT (4 layers and 6 layers) on GLUE Benchmark and FSMN and D-FSMN on the SpeechCommand dataset. Results are listed in Table 8 .

E FULL RESULTS

To demonstrate the robustness corruption of binarized algorithms, we show the results on the CIFAR10-C benchmark, which is used to benchmark the robustness to common perturbations in Table 9 and Table 10 . It includes 15 kinds of noise, blur, weather, and digital corruption, each with five levels of severity. The sensitivity of hyperparameters while training is shown in 



Figure 2: Comparisons of accuracy under different training settings.

Figure3: The lower the chip's computing power, the higher the inference speedup of deployed binarized models.

DoReFa-Net(Zhou et al., 2016b): DoReFa-Net applies the following function for 1-bit weight and activation: o = (sign(a) ⊛ sign(w)) ⊙ α.

CIFAR10-C(Hendrycks & Dietterich, 2018): CIFAR10-C is a dataset generated by adding 15 common corruptions and 4 extra corruptions to the test images in the Cifar10 dataset. It benchmarks the frailty of classifiers under corruption, including noise, blur, weather, and digital influence. And each type of corruption has five levels of severity, resulting in 75 distinct corruptions. We report the accuracy of the classifiers under each level of severity and each corruption. Meanwhile, we use the mean and relative corruption error as metrics. Denote the error rate of Network under Settings as E NetworkSettings . The classifier's aggregate performance across the five severities of the corruption types. The Corruption Errors of a certain type of Corruption is computed with the formula:

Comparison between BiBench and existing binarization works along evaluation tracks.

Accuracy benchmark for network binarization. Blue: best in a row. Red: worst in a row.

Efficiency benchmark for network binarization. Blue: best in a row. Red: worst in a row.

Deployment capability of different inference libraries on real hardware.

The considered binarization algorithms and our final selections in BiBench. Bold means that the algorithm has an advantage in that column.

HisiliconKirin (Hisilicon, 2022): Kirin is a series of ARM-based systems-on-a-chip (SoCs) produced by HiSilicon. Their products include Kirin 970, Kirin 980, Kirin 985, etc. SoC) processor architecture provided by Qualcomm. The original Snapdragon chip, the Scorpio, was similar to the ARM Cortex-A8 core based upon the ARMv7 instruction set, but it was enhanced by the use of SIMD operations, which provided higher performance. Qualcomm Snapdragon processors are based on the Krait architecture. They are equipped with an integrated LTE modem, providing seamless connectivity across 2G and 3G LTE networks.Raspberrypi (Wikipedia, 2022b): Raspberry Pi is a series of small single-board computers (SBCs) developed in the United Kingdom by the Raspberry Pi Foundation in association with Broadcom. Raspberry Pi was originally designed to promote the teaching of basic computer science in schools and in developing countries. As a result of its low cost, modularity, and open design, it is used in many applications, including weather monitoring, and is sold outside the intended market. It is typically used by computer hobbyists and electronic enthusiasts due to the adoption of HDMI and USB standards.

7 shows the accuracy of different binarization algorithms on 2D and 3D vision tasks, including CIFAR10, ImageNet, PASCAL VOC07, COCO14 for 2D vision tasks and ModelNet40 for

Accuracy on 2D and 3D Vision Tasks.

Table 11-12. For each binarization algorithm, we use SGD or Adam optimizer, 1× or 0.1× of the original learning rate, cosine or step learning scheduler, and 200 training epochs. Each case is tested five times to show the training stability. We also calculate the mean and standard deviation (std) of accuracy. The best accuracy and the lowest std for each binarization algorithm are bolded. Accuracy on ShapeNet dataset.

Accuracy on Language and Speech Tasks.

Results for Robustness Corruption on CIFAR10-C Dataset with Different BinarizationAlgorithms (1/2).

Results for Robustness Corruption on CIFAR10-C Dataset with Different Binarization Algorithms (2/2).

Sensitivity to Hyper Parameters in Training (1/2). Algorithm Epoch Optimizer Learning Rate Scheduler Acc. Acc. 1 Acc. 2 Acc. 3 Acc. 4 mean std

Sensitivity to Hyper Parameters in Training (2/2). Algorithm Epoch Optimizer Learning Rate Scheduler Acc. Acc. 1 Acc. 2 Acc. 3 Acc. 4 mean std

acknowledgement

We conduct comprehensive deployment and inference on various kinds of hardware, including the Kirin series (970, 980, 985, 990, and 9000E), Dimensity series (820 and 9000), Snapdragon series (855+, 870 and 888), Raspberrypi (3B+ and 4B) and Apple M1 series (M1 and M1 Max). Limited to the support of frameworks, we can only test BNN and ReAct with Larq compute engine and only BNN with daBNN. We convert models to enable the actual inference on real hardware, including ResNet18/34 and VGG-Small on Larq, and only ResNet18/34 on daBNN. And we test 1, 2, 4, and 8 threads for each hardware and additionally test 16 threads for Apple Silicons on Larq. And daBNN only supports single-thread inference. Results are showcased in Table 13-16. 

