BIBENCH: BENCHMARKING AND ANALYZING NETWORK BINARIZATION

Abstract

Neural network binarization is one of the most promising compression approaches with extraordinary computation and memory savings by minimizing the bit-width of weight and activation. However, despite being a general technique, recent works reveal that applying binarization in various practical scenarios, including multiple tasks, architectures, and hardware, is not trivial. Moreover, common challenges, such as severe degradation in accuracy and limited efficiency gains, suggest that specific attributes of binarization are not thoroughly studied and adequately understood. To comprehensively understand binarization methods, we present BiBench, a carefully engineered benchmark with in-depth analysis for network binarization. We first inspect the requirements of binarization in the actual production setting. Then for the sake of fairness and systematic, we define the evaluation tracks and metrics. We also perform a comprehensive evaluation with a rich collection of milestone binarization algorithms. Our benchmark results show that binarization still faces severe accuracy challenges, and newer state-ofthe-art binarization algorithms bring diminishing improvements, even at the expense of efficiency. Moreover, the actual deployment of certain binarization operations reveals a surprisingly large deviation from their theoretical consumption. Finally, based on our benchmark results and analysis, we suggest establishing a paradigm for accurate and efficient binarization among existing techniques. We hope BiBench paves the way toward more extensive adoption of network binarization and serves as a fundamental work for future research.

1. INTRODUCTION

Since the rising of modern deep learning, the contradiction between ever-increasing model size and limited deployment resources has persisted. For this reason, compression technologies are crucial for practical deep learning and have been widely studied, including model quantization (Gong et al., 2014; Wu et al., 2016; Vanhoucke et al., 2011; Gupta et al., 2015) , network pruning (Han et al., 2015; 2016; He et al., 2017) , knowledge distillation (Hinton et al., 2015; Xu et al., 2018; Chen et al., 2018; Yim et al., 2017; Zagoruyko & Komodakis, 2017) , lightweight architecture design (Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2018b; Ma et al., 2018) , and low-rank decomposition (Denton et al., 2014; Lebedev et al., 2015; Jaderberg et al., 2014; Lebedev & Lempitsky, 2016) . As a compression approach that extremely reduces the bit-width to 1-bit, network binarization is regarded as the most aggressive quantization technology (Rusci et al., 2020; Choukroun et al., 2019; Qin et al., 2022; Shang et al., 2022b; Zhang et al., 2022b; Bethge et al., 2020; 2019; Martinez et al., 2019; Helwegen et al., 2019) . The binarized models leverage the most compact 1-bit parameters, which take little storage and memory and accelerate the inference by efficient bitwise operations. Compared to other compression technologies like network pruning and architecture design, network binarization enjoys stronger topological generics since it only applies to parameters. Therefore, in academic research, network binarization is widely studied as an independent compression technique instead of the 1-bit specialization of quantization (Gong et al., 2019; Gholami et al., 2021) . It is impressive that State-of-The-Art (SoTA) binarization algorithms push binarized models to fullprecision performance on large-scale tasks (Deng et al., 2009; Liu et al., 2020) . However, existing network binarization is still far from practical. We point out that two worrisome trends are emerging from accuracy and efficiency perspectives in current binarization research:

