MEMORIZATION CAPACITY OF NEURAL NETWORKS WITH CONDITIONAL COMPUTATION

Abstract

Many empirical studies have demonstrated the performance benefits of conditional computation in neural networks, including reduced inference time and power consumption. We study the fundamental limits of neural conditional computation from the perspective of memorization capacity. For Rectified Linear Unit (ReLU) networks without conditional computation, it is known that memorizing a collection of n input-output relationships can be accomplished via a neural network with O( √ n) neurons. Calculating the output of this neural network can be accomplished using O( √ n) elementary arithmetic operations of additions, multiplications and comparisons for each input. Using a conditional ReLU network, we show that the same task can be accomplished using only O(log n) operations per input. This represents an almost exponential improvement as compared to networks without conditional computation. We also show that the Θ(log n) rate is the best possible. Our achievability result utilizes a general methodology to synthesize a conditional network out of an unconditional network in a computationally-efficient manner, bridging the gap between unconditional and conditional architectures.

1. INTRODUCTION

1.1 CONDITIONAL COMPUTATION Conditional computation refers to utilizing only certain parts of a neural network, in an input-adaptive fashion (Davis & Arel, 2013; Bengio et al., 2013; Eigen et al., 2013) . This can be done through gating mechanisms combined with a tree-structured network, as in the case of "conditional networks" (Ioannou et al., 2016) or neural trees and forests (Tanno et al., 2019; Yang et al., 2018; Kontschieder et al., 2015) . Specifically, depending on the inputs or some features extracted from the inputs, a gate can choose its output sub-neural networks that will further process the gate's input features. Another family of conditional computation methods are the so-called early-exit architectures (Teerapittayanon et al., 2016; Kaya et al., 2019; Gormez et al., 2022) . In this case, one typically places classifiers at intermediate layers of a large network. This makes it possible to exit at a certain layer to reach a final verdict on classification, if the corresponding classifier is confident enough of its decision. Several other sub-techniques of conditional computation exist and have been well-studied, including layer skipping (Graves, 2016), channel skipping in convolutional neural networks (Gao et al., 2019) , or reinforcement learning methods for input-dependent dropout policies (Bengio et al., 2015) . Although there are many diverse methods (Han et al., 2021) , the general intuitions as to why conditional computation can improve the performance of neural networks remain the same: First, the computation units are chosen in an adaptive manner to process the features that are particular to the given input pattern. For example, a cat image is ideally processed by only "neurons that are specialized to cats." Second, one allocates just enough computation units to a given input, avoiding a waste of resources. The end result is various benefits relative to a network without conditional computation, including reduced computation time, and power/energy consumption (Kim & Seo, 2020) . Achieving these benefits are especially critical in edge networks with resource-limited devices (Li et al., 2021a; b) . Moreover, conditioning incurs minimal loss, or in some cases, no loss in learning performance.

