MEMORIZATION CAPACITY OF NEURAL NETWORKS WITH CONDITIONAL COMPUTATION

Abstract

Many empirical studies have demonstrated the performance benefits of conditional computation in neural networks, including reduced inference time and power consumption. We study the fundamental limits of neural conditional computation from the perspective of memorization capacity. For Rectified Linear Unit (ReLU) networks without conditional computation, it is known that memorizing a collection of n input-output relationships can be accomplished via a neural network with O( √ n) neurons. Calculating the output of this neural network can be accomplished using O( √ n) elementary arithmetic operations of additions, multiplications and comparisons for each input. Using a conditional ReLU network, we show that the same task can be accomplished using only O(log n) operations per input. This represents an almost exponential improvement as compared to networks without conditional computation. We also show that the Θ(log n) rate is the best possible. Our achievability result utilizes a general methodology to synthesize a conditional network out of an unconditional network in a computationally-efficient manner, bridging the gap between unconditional and conditional architectures.

1. INTRODUCTION

1.1 CONDITIONAL COMPUTATION Conditional computation refers to utilizing only certain parts of a neural network, in an input-adaptive fashion (Davis & Arel, 2013; Bengio et al., 2013; Eigen et al., 2013) . This can be done through gating mechanisms combined with a tree-structured network, as in the case of "conditional networks" (Ioannou et al., 2016) or neural trees and forests (Tanno et al., 2019; Yang et al., 2018; Kontschieder et al., 2015) . Specifically, depending on the inputs or some features extracted from the inputs, a gate can choose its output sub-neural networks that will further process the gate's input features. Another family of conditional computation methods are the so-called early-exit architectures (Teerapittayanon et al., 2016; Kaya et al., 2019; Gormez et al., 2022) . In this case, one typically places classifiers at intermediate layers of a large network. This makes it possible to exit at a certain layer to reach a final verdict on classification, if the corresponding classifier is confident enough of its decision. Several other sub-techniques of conditional computation exist and have been well-studied, including layer skipping (Graves, 2016) , channel skipping in convolutional neural networks (Gao et al., 2019) , or reinforcement learning methods for input-dependent dropout policies (Bengio et al., 2015) . Although there are many diverse methods (Han et al., 2021) , the general intuitions as to why conditional computation can improve the performance of neural networks remain the same: First, the computation units are chosen in an adaptive manner to process the features that are particular to the given input pattern. For example, a cat image is ideally processed by only "neurons that are specialized to cats." Second, one allocates just enough computation units to a given input, avoiding a waste of resources. The end result is various benefits relative to a network without conditional computation, including reduced computation time, and power/energy consumption (Kim & Seo, 2020) . Achieving these benefits are especially critical in edge networks with resource-limited devices (Li et al., 2021a; b) . Moreover, conditioning incurs minimal loss, or in some cases, no loss in learning performance. Numerous empirical studies have demonstrated the benefits of conditional computation in many different settings. Understanding the fundamental limits of conditional computation in neural networks is thus crucial, but has not been well-investigated in the literature. There is a wide body of work on a theoretical analysis of decision tree learning (Maimon & Rokach, 2014) , which can be considered as an instance of conditional computation. These results are, however, not directly applicable to neural networks. In Cho & Bengio (2014), a feature vector is multiplied by different weight matrices, depending on the significant bits of the feature vector, resulting in an exponential increase in the number of free parameters of the network (referred to as the capacity of the network in Cho & Bengio (2014)). On the other hand, the potential benefits of this scheme have not been formally analyzed.

1.2. MEMORIZATION CAPACITY

In this work, we consider the problem of neural conditional computation from the perspective of memorization capacity. Here, the capacity refers to the maximum number of input-output pairs of reasonably-general position that a neural network of a given size can learn. It is typically expressed as the minimum number of neurons or weights required for a given dataset of size, say n. Early work on memorization capacity of neural networks include Baum (1988); Mitchison & Durbin (1989) ; Sontag (1990) . In particular, Baum (1988) shows that, for thresholds networks, O(n) neurons and weights are sufficient for memorization. This sufficiency result is later improved to O( √ n) neurons and O(n) weights by Vershynin ( 2020 The aforementioned achievability results have also been proven to be tight in certain cases. A very useful tool in this context is the Vapnik-Chervonenkis (VC) dimension (Vapnik & Chervonenkis, 2015) . In fact, applying the VC dimension theory to neural networks (Anthony et al., 1999) , it can be shown that the number of neurons and weights should be both polynomial in the size of the dataset for successful memorization. Specifically, Ω( √ n) weights and Ω(n 1/4 ) neurons are optimal for ReLU networks, up to logarithmic factors. We will justify this statement later on for completeness.

1.3. SCOPE, MAIN RESULTS, AND ORGANIZATION

We analyze the memorization capacity of neural networks with conditional computation. We describe our neural network and the associated computational complexity models in Section 2. We describe a general method to synthesize conditional networks from unconditional networks in Section 3. We provide our main achievability and converse results for memorization capacity in Sections 4 and 5, respectively. We draw our main conclusions in Section 6. Some of the technical proofs are provided in the supplemental material. We note that this paper is specifically on analyzing the theoretical limits on neural conditional computation. In particular, we show that n input-output relationships can be memorized using a conditional network that needs only O(log n) operations per input or inference step. The best unconditional architecture requires O( √ n) operations for the same task. This suggests that conditional models can offer significant time/energy savings as compared to unconditional architectures. In general, understanding the memorization capacity of neural networks is a well-studied problem of fundamental importance and is related to the expressive power of neural networks. A related but separate problem is generalization, i.e. how to design conditional networks that can not only recall the memory patterns with reasonable accuracy but also generalize to unseen examples. The "double-descent" phenomenon (Belkin et al., 2019; Nakkiran et al., 2021) suggests that the goals of memorization and generalization are not contradictory and that a memorizing network can potentially also generalize well. A further



); Rajput et al. (2021). There are also several studies on other activation functions, especially the Rectified Linear Unit (ReLU), given its practicality and wide utilization in deep learning applications. Initial works Zhang et al. (2017); Hardt & Ma (2016) show that O(n) neurons and weights are sufficient for memorization in the case of ReLU networks. This is improved to O( √ n) weights and O(n) neurons in Yun et al. (2019). In addition, Park et al. (2021) proves the sufficiency of O(n 2/3 ) weights and neurons, and finally, Vardi et al. (2022) shows that memorization can be achieved with only O( √ n) weights and neurons, up to logarithmic factors. For the sigmoid activation function, it is known that O( √ n) neurons and O(n) weights (Huang, 2003), or O(n 2/3 ) weights and neurons (Park et al., 2021) are sufficient. Memorization and expressivity have also been studied in the context of specific network architectures such as convolutional neural networks (Cohen & Shashua, 2016; Nguyen & Hein, 2018).

