HIGH-CAPACITY EXPERT BINARY NETWORKS

Abstract

Network binarization is a promising hardware-aware direction for creating efficient deep models. Despite its memory and computational advantages, reducing the accuracy gap between binary models and their real-valued counterparts remains an unsolved challenging research problem. To this end, we make the following 3 contributions: (a) To increase model capacity, we propose Expert Binary Convolution, which, for the first time, tailors conditional computing to binary networks by learning to select one data-specific expert binary filter at a time conditioned on input features. (b) To increase representation capacity, we propose to address the inherent information bottleneck in binary networks by introducing an efficient width expansion mechanism which keeps the binary operations within the same budget. (c) To improve network design, we propose a principled binary network search mechanism that unveils a set of network topologies of favorable properties. Overall, our method improves upon prior work, with no increase in computational cost, by ∼ 6%, reaching a groundbreaking ∼ 71% on ImageNet classification. Code will be made available here.

1. INTRODUCTION

A promising, hardware-aware, direction for designing efficient deep learning models case is that of network binarization, in which filter and activation values are restricted to two states only: ±1 (Rastegari et al., 2016; Courbariaux et al., 2016) . This comes with two important advantages: (a) it compresses the weights by a factor of 32× via bit-packing, and (b) it replaces the computationally expensive multiply-add with bit-wise xnor and popcount operations, offering, in practice, a speed-up of ∼ 58× on a CPU (Rastegari et al., 2016) . Despite this, how to reduce the accuracy gap between a binary model and its real-valued counterpart remains an open problem and it is currently the major impediment for their wide-scale adoption. In this work, we propose to approach this challenging problem from 3 key perspectives: 1. Model capacity: To increase model capacity, we firstly introduce the first application of Conditional Computing (Bengio et al., 2013; 2015; Yang et al., 2019) to the case of a binary networks, which we call Expert Binary Convolution. For each convolutional layer, rather than learning a weight tensor that is expected to generalize well across the entire input space, we learn a set of N experts each of which is tuned to specialize to portions of it. During inference, a very light-weight gating function dynamically selects a single expert for each input sample and uses it to process the input features. Learning to select a single, tuned to the input data, expert is a key property of our method which renders it suitable for the case of binary networks, and contrasts our approach to previous works in conditional computing (Yang et al., 2019) .

2.. Representation capacity:

There is an inherent information bottleneck in binary networks as only 2 states are used to characterize each feature, which hinders the learning of highly accurate models. To this end, for the first time, we highlight the question of depth vs width in binary networks and propose a surprisingly unexplored efficient mechanism for increasing the effective width of the network by preserving the original computational budget. We show that our approach leads to noticeable gains in accuracy without increasing computation. 3. Network design: Finally, and inspired by similar work in real-valued networks (Tan & Le, 2019), we propose a principled approach to search for optimal directions for scaling-up binary networks. Main results: Without increasing the computational budget of previous works, our method improves upon the state-of-the-art (Martinez et al., 2020) 2016) which reported the very first binary model of high accuracy, there has been a great research effort to develop binary models that are competitive in terms of accuracy when compared to their real-valued counterparts, see for example (Lin et al., 2017; Liu et al., 2018; Alizadeh et al., 2018; Bulat et al., 2019; Bulat & Tzimiropoulos, 2019; Ding et al., 2019; Wang et al., 2019; Zhuang et al., 2019; Zhu et al., 2019; Kim et al., 2020; Bulat et al., 2020; Martinez et al., 2020) . Notably, many of these improvements including real-valued down-sampling layers (Liu et al., 2018) 2020) which, further boosted by a sophisticated distillation and data-driven channel rescaling mechanism, yielded an accuracy of ∼ 65% on ImageNet. This method, along with the recent binary NAS of Bulat et al. ( 2020) reporting accuracy of ∼ 66%, are to our knowledge, the state-of-the-art in binary networks. Our method further improves upon these works achieving an accuracy of ∼ 71% on ImageNet, crucially without increasing the computational complexity. To achieve this, to our knowledge, we propose for the first time to explore ideas from Conditional Computing (Bengio et al., 2013; 2015) and learn data-specific binary expert weights which are dynamically selected during inference conditioned on the input data. Secondly, we are the first to identify width as an important factor for increasing the representation capacity of binary networks, and introduce a surprisingly simple yet effective mechanism to enhance it without increasing complexity. Finally, although binary architecture design via NAS (Liu et al., 2018; Real et al., 2019) has been recently explored in (Kim et al., 2020; Bulat et al., 2020) , we propose to approach it from a different perspective that is more related to Tan & Le (2019), which was developed for real-valued networks.

2.2. CONDITIONAL COMPUTATION

Conditional computation is a very general data processing framework which refers to using different models or different parts of a model conditioned on the input data. Wang et al. (2018) and Wu et al. (2018) propose to completely bypass certain parts of the network during inference using skip connections by training a policy network via reinforcement learning. Gross et al. (2017) proposes to train large models by using a mixture of experts trained independently on different partitions of the data. While speeding-up training, this approach is not end-to-end trainable nor tuned towards improving the model accuracy. propose to learn a Mixture of Experts, i.e. a set of filters that are linearly combined using a routing function. In contrast, our approach learns to select a single expert at a time. This is critical for binary networks for two reasons: (1) The linear combination of a binary set of weights is nonbinary and, hence, a second binarization is required giving rise to training instability and increased memory consumption. In Section 5, we compare with such a model and show that our approach works significantly better. (2) The additional computation to multiply and sum the weights, while negligible for real-valued networks, can lead to a noticeable computational increase for binary ones.



Shazeer et al. (2017) trains thousands of experts that are combined using a noisy top-k expert selection while Teja Mullapudi et al. (2018) introduces the HydraNet in which a routing function selects and combines a subset of different operations. The later is more closely related to online network search. Chen et al. (2019) uses a separate network to dynamically select a variable set of filters while Dai et al. (2017) learns a dynamically computed offset. More closely related to the proposed EBConv is Conditional Convolution, where Yang et al. (2019)

by ∼ 6%, reaching a groundbreaking ∼ 71% on ImageNet classification. Since the seminal works of Courbariaux et al. (2015; 2016) which showed that training fully binary models (both weights and activations) is possible, and Rastegari et al. (

, double skip connections (Liu et al., 2018), learning the scale factors (Bulat & Tzimiropoulos, 2019), PReLUs (Bulat et al., 2019) and two-stage optimization (Bulat et al., 2019) have been put together to build a strong baseline in Martinez et al. (

