HIGH-CAPACITY EXPERT BINARY NETWORKS

Abstract

Network binarization is a promising hardware-aware direction for creating efficient deep models. Despite its memory and computational advantages, reducing the accuracy gap between binary models and their real-valued counterparts remains an unsolved challenging research problem. To this end, we make the following 3 contributions: (a) To increase model capacity, we propose Expert Binary Convolution, which, for the first time, tailors conditional computing to binary networks by learning to select one data-specific expert binary filter at a time conditioned on input features. (b) To increase representation capacity, we propose to address the inherent information bottleneck in binary networks by introducing an efficient width expansion mechanism which keeps the binary operations within the same budget. (c) To improve network design, we propose a principled binary network search mechanism that unveils a set of network topologies of favorable properties. Overall, our method improves upon prior work, with no increase in computational cost, by ∼ 6%, reaching a groundbreaking ∼ 71% on ImageNet classification. Code will be made available here.

1. INTRODUCTION

A promising, hardware-aware, direction for designing efficient deep learning models case is that of network binarization, in which filter and activation values are restricted to two states only: ±1 (Rastegari et al., 2016; Courbariaux et al., 2016) . This comes with two important advantages: (a) it compresses the weights by a factor of 32× via bit-packing, and (b) it replaces the computationally expensive multiply-add with bit-wise xnor and popcount operations, offering, in practice, a speed-up of ∼ 58× on a CPU (Rastegari et al., 2016) . Despite this, how to reduce the accuracy gap between a binary model and its real-valued counterpart remains an open problem and it is currently the major impediment for their wide-scale adoption. In this work, we propose to approach this challenging problem from 3 key perspectives: 1. Model capacity: To increase model capacity, we firstly introduce the first application of Conditional Computing (Bengio et al., 2013; 2015; Yang et al., 2019) to the case of a binary networks, which we call Expert Binary Convolution. For each convolutional layer, rather than learning a weight tensor that is expected to generalize well across the entire input space, we learn a set of N experts each of which is tuned to specialize to portions of it. During inference, a very light-weight gating function dynamically selects a single expert for each input sample and uses it to process the input features. Learning to select a single, tuned to the input data, expert is a key property of our method which renders it suitable for the case of binary networks, and contrasts our approach to previous works in conditional computing (Yang et al., 2019) .

2.. Representation capacity:

There is an inherent information bottleneck in binary networks as only 2 states are used to characterize each feature, which hinders the learning of highly accurate models. To this end, for the first time, we highlight the question of depth vs width in binary networks and propose a surprisingly unexplored efficient mechanism for increasing the effective width of the network by preserving the original computational budget. We show that our approach leads 1

