BIBENCH: BENCHMARKING AND ANALYZING NETWORK BINARIZATION

Abstract

Neural network binarization is one of the most promising compression approaches with extraordinary computation and memory savings by minimizing the bit-width of weight and activation. However, despite being a general technique, recent works reveal that applying binarization in various practical scenarios, including multiple tasks, architectures, and hardware, is not trivial. Moreover, common challenges, such as severe degradation in accuracy and limited efficiency gains, suggest that specific attributes of binarization are not thoroughly studied and adequately understood. To comprehensively understand binarization methods, we present BiBench, a carefully engineered benchmark with in-depth analysis for network binarization. We first inspect the requirements of binarization in the actual production setting. Then for the sake of fairness and systematic, we define the evaluation tracks and metrics. We also perform a comprehensive evaluation with a rich collection of milestone binarization algorithms. Our benchmark results show that binarization still faces severe accuracy challenges, and newer state-ofthe-art binarization algorithms bring diminishing improvements, even at the expense of efficiency. Moreover, the actual deployment of certain binarization operations reveals a surprisingly large deviation from their theoretical consumption. Finally, based on our benchmark results and analysis, we suggest establishing a paradigm for accurate and efficient binarization among existing techniques. We hope BiBench paves the way toward more extensive adoption of network binarization and serves as a fundamental work for future research.

1. INTRODUCTION

Since the rising of modern deep learning, the contradiction between ever-increasing model size and limited deployment resources has persisted. For this reason, compression technologies are crucial for practical deep learning and have been widely studied, including model quantization (Gong et al., 2014; Wu et al., 2016; Vanhoucke et al., 2011; Gupta et al., 2015) , network pruning (Han et al., 2015; 2016; He et al., 2017) , knowledge distillation (Hinton et al., 2015; Xu et al., 2018; Chen et al., 2018; Yim et al., 2017; Zagoruyko & Komodakis, 2017) , lightweight architecture design (Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2018b; Ma et al., 2018) , and low-rank decomposition (Denton et al., 2014; Lebedev et al., 2015; Jaderberg et al., 2014; Lebedev & Lempitsky, 2016) . As a compression approach that extremely reduces the bit-width to 1-bit, network binarization is regarded as the most aggressive quantization technology (Rusci et al., 2020; Choukroun et al., 2019; Qin et al., 2022; Shang et al., 2022b; Zhang et al., 2022b; Bethge et al., 2020; 2019; Martinez et al., 2019; Helwegen et al., 2019) . The binarized models leverage the most compact 1-bit parameters, which take little storage and memory and accelerate the inference by efficient bitwise operations. Compared to other compression technologies like network pruning and architecture design, network binarization enjoys stronger topological generics since it only applies to parameters. Therefore, in academic research, network binarization is widely studied as an independent compression technique instead of the 1-bit specialization of quantization (Gong et al., 2019; Gholami et al., 2021) . It is impressive that State-of-The-Art (SoTA) binarization algorithms push binarized models to fullprecision performance on large-scale tasks (Deng et al., 2009; Liu et al., 2020) . However, existing network binarization is still far from practical. We point out that two worrisome trends are emerging from accuracy and efficiency perspectives in current binarization research: 

Challenges in Model Binarization

Figure 1 : Evaluation tracks of BiBench. Our benchmark evaluates binarization algorithms on the most comprehensive evaluation tracks, including "Learning Task", "Neural Architecture", "Corruption Robustness", "Training Consumption", "Theoretical Complexity", and "Hardware Inference". Trend-1. Accuracy comparison converging to limited scope. In recent binarization research, several image classification tasks, e.g., CIFAR-10 and ImageNet, are becoming standard options for comparing accuracy. The typical selection of evaluation tasks helps the clear and fair comparison of accuracy performance among different binarization algorithms. However, since most binarization algorithm studies are engineered for learning tasks with image modality inputs, the presented insights and conclusions are rarely verified in a broader range of other modalities and tasks. The monotonic tasks also hinder the comprehensive evaluation from an architectural perspective. Besides, data noise like corruption is a common problem on low-cost edge devices and is widely studied in compression (Lin et al., 2018; Rakin et al., 2021) , whereas few advanced binarization algorithms consider the robustness of binarized models. Trend-2. Efficiency analysis remaining at the theoretical level. Network binarization is widely recognized for its significant storage and computation savings. For example, theoretical savings are up to 32× and 64× for convolutions, respectively (Rastegari et al., 2016; Bai et al., 2021) . However, since lacking support from hardware libraries, the models compressed by binarization algorithms can hardly be evaluated on real-world edge hardware, leaving their efficiency claims lacking experimental evidence. In addition, the training efficiency of the binarization algorithm is usually neglected in current research, which causes several negative phenomena in training a binary network, such as the increasing demand for computation resources and time consumption, being sensitive to hyperparameters, and requiring detailed tuning in optimization, etc. In this paper, we present BiBench, a network Binarization Benchmark to evaluate binarization algorithms comprehensively from accuracy and efficiency perspectives (Table 1 ). Based on BiBench, we benchmark 8 representative binarization algorithms on 9 deep learning datasets, 13 different neural architectures, 2 deployment libraries, 14 hardware chips, and various hyperparameter settings. It costs us about 4 GPU years of computation time to build our BiBench, devoted to promoting comprehensive evaluation for network binarization from the perspectives of accuracy and efficiency. Furthermore, we analyze the benchmark results in depth and reveal insights along evaluation tracks, and give suggestions for designing practical binarization algorithms.

2. BACKGROUND

2.1 NETWORK BINARIZATION Binarization compresses weights w ∈ R cin×cout×k×k and activations a ∈ R cin×w×h to 1-bit in computationally dense convolution, where c in , k, c out , w, and h denote the input channel, kernel size, output channel, input width, and input height. The computation can be expressed as o = α popcount (xnor (sign(a), sign(w))) , (1) where o denotes the outputs and α ∈ R cout denotes the optional scaling factor calculated as α = ∥w∥ n (Courbariaux et al., 2016b; Rastegari et al., 2016) , xnor and popcount are bitwise instructions defined as (Arm, 2020; AMD, 2022) . Though enjoying extreme compression and acceleration, severely limited representation causes the degradation of binarized networks. Therefore, various algorithms emerge constantly to improve accuracy (Yuan & Agaian, 2021) . The vast majority of existing binarization algorithms focus on improving the binarized operators. As shown in Eq. ( 1), the fundamental difference between binarized and full-precision networks is

