ONE NETWORK FITS ALL? MODULAR VERSUS MONOLITHIC TASK FORMULATIONS IN NEURAL NETWORKS

Abstract

Can deep learning solve multiple tasks simultaneously, even when they are unrelated and very different? We investigate how the representations of the underlying tasks affect the ability of a single neural network to learn them jointly. We present theoretical and empirical findings that a single neural network is capable of simultaneously learning multiple tasks from a combined data set, for a variety of methods for representing tasks-for example, when the distinct tasks are encoded by well-separated clusters or decision trees over certain task-code attributes. More concretely, we present a novel analysis that shows that families of simple programming-like constructs for the codes encoding the tasks are learnable by two-layer neural networks with standard training. We study more generally how the complexity of learning such combined tasks grows with the complexity of the task codes; we find that combining many tasks may incur a sample complexity penalty, even though the individual tasks are easy to learn. We provide empirical support for the usefulness of the learning bounds by training networks on clusters, decision trees, and SQL-style aggregation. * Work performed in part while visiting Google.

1. INTRODUCTION

Standard practice in machine learning has long been to only address carefully circumscribed, often very related tasks. For example, we might train a single classifier to label an image as containing objects from a certain predefined set, or to label the words of a sentence with their semantic roles. Indeed, when working with relatively simple classes of functions like linear classifiers, it would be unreasonable to expect to train a classifier that handles more than such a carefully scoped task (or related tasks in standard multitask learning). As techniques for learning with relatively rich classes such as neural networks have been developed, it is natural to ask whether or not such scoping of tasks is inherently necessary. Indeed, many recent works (see Section 1.2) have proposed eschewing this careful scoping of tasks, and instead training a single, "monolithic" function spanning many tasks. Large, deep neural networks can, in principle, represent multiple classifiers in such a monolithic learned function (Hornik, 1991) , giving rise to the field of multitask learning. This combined function might be learned by combining all of the training data for all of the tasks into one large batch-see Section 1.2 for some examples. Taken to an extreme, we could consider seeking to learn a universal circuit-that is, a circuit that interprets arbitrary programs in a programming language which can encode various tasks. But, the ability to represent such a monolithic combined function does not necessarily entail that such a function can be efficiently learned by existing methods. Cryptographic hardness theorems (Kearns & Valiant, 1994 ) establish that this is not possible in general by any method, let alone the specific training methods used in practice. Nevertheless, we still can ask how Figure 1 : Our framework shows that it is possible to learn analytic functions such as the gravitational force law, decision trees with different functions at the leaf nodes, and programming constructs such as those on the right, all using a non-modular monolithic architecture. rich a family of tasks can be learned by these standard methods. In this work, we study the extent to which backpropagation with stochastic gradient descent (SGD) can learn such monolithic functions on diverse, unrelated tasks. There might still be some inherent benefit to an architecture in which tasks are partitioned into sub-tasks of such small scope, and the training data is correspondingly partitioned prior to learning. For example, in the early work on multitask learning, Caruana (1997) observed that training a network to solve unrelated tasks simultaneously seemed to harm the overall performance. Similarly, the seminal work of Jacobs et al. (1991) begins by stating that "If backpropagation is used to train a single, multilayer network to perform different subtasks on different occasions, there will generally be strong interference effects that lead to slow learning and poor generalization". We therefore ask if, for an unfortunate choice of tasks in our model, learning by standard methods might be fundamentally impaired. As a point of reference from neuroscience, the classical view is that distinct tasks are handled in the brain by distinct patches of the cortex. While it is a subject of debate whether modularity exists for higher level tasks (Samuels, 2006) , it is accepted that there are dedicated modules for low-level tasks such as vision and audio processing. Thus, it seems that the brain produces a modular architecture, in which different tasks are handled by different regions of the cortex. Conceivably, this division into task-specific regions might be driven by fundamental considerations of learnability: A single, monolithic neural circuit might simply be too difficult to learn because the different tasks might interfere with one another. Others have taken neural networks trained by backpropagation as a model of learning in the cortex (Musslick et al., 2017) ; to the extent that this is reasonable, our work has some bearing on these questions as well.

1.1. OUR RESULTS

We find, perhaps surprisingly, that combining multiple tasks into one cannot fundamentally impair learning with standard training methods. We demonstrate this for a broad family of methods for combining individual tasks into a single monolithic task. For example, inputs for each individual tasks may come from a disjoint region (for example, a disjoint ball) in a common input space, and each individual task could then involve applying some arbitrary simple function (e.g., a separate linear classifier for each region). Alternately there may be an explicit "task code" attribute (e.g., a one-hot code), together with the usual input attributes and output label (s) , where examples with the same task code are examples for the same learning task. Complementing our results that combining multiple tasks does not impair learning, we also find that some task coding schemes do incur a sample complexity penalty. A vast variety of task coding schemes may be used. As a concrete example, when the data points for each task are well-separated into distinct clusters, and the tasks are linear classification tasks, we show that a two-layer architecture trained with SGD successfully learns the combined, monolithic function; the required amount of data simply scales as the sum of the amount required to learn each task individually (Theorem 2). Meanwhile, if the tasks are determined by a balanced decision tree of height h on d code attributes (as in Fig. 1 , left), we find that the training time and amount of data needed scales as ∼ d h -quasipolynomial in the 2 h leaves (distinct tasks) when d is of similar size to h, and thus when the coding is efficient (Theorem 3). We also prove a corresponding lower bound, which shows that this bound is in fact asymptotically tight (Theorem 3). More generally, for task codings based on decision trees using linear splits with a margin of at least γ (when the data has unit 2 norm), the training time and required data are asymptotically bounded by ∼ e O(h/γ 2 ) , which for constant γ is polynomial in the 2 h functions (Theorem 4). We generalize from these cluster-based and decision-tree based task codings to more complex codes that are actually simple programs. For instance, we show that SQL-style aggregation queries over a fixed database, written as a functions of the parameters of the query, can also be learned this way. More generally, simple programming constructs (such as in Fig. 1 , right), built by operations such as compositions, aggregation, concatenation, and branching on a small number of such learnable functions, are also learnable (Theorem 5). In general, we can learn a low-depth formula (circuit with fan-out 1) in which each gate is not merely a switch (as in a decision tree), but can be any analytic function on the inputs, including arithmetic operations. Again, our key technical contribution is that we show that all of these functions are efficiently learned by SGD. This is non-trival since, although universal approximation theorems show that such functions can be expressed by (sufficiently wide) two-layer neural networks, under standard assumptions some expressible functions are not learnable Klivans & Sherstov (2009) . We supplement the theoretical bounds with experiments on clusters, decision trees, and SQL-style aggregation showing that such functions are indeed learned in practice. We note that the learning of such combined functions could have been engineered by hand: for example, there exist efficient algorithms for learning clusterings or such decision trees, and it is easy to learn the linear classifiers given the partitioned data. Likewise, these classes of functions are all known to be learnable by other methods, given an appropriate transformation of the input features. The key point is that the two-layer neural network can jointly learn the task coding scheme and the task-specific functions without special engineering of the architecture. That is, it is unnecessary to engineer a way of partitioning of the data into separate tasks prior to learning. Relatedly, the time and sample requirements of learning multiple tasks on a single network in general is insufficient to explain the modularity observed in biological neural networks if their learning dynamics are similar to SGD -i.e., we cannot explain the presence of modularity from such general considerations. All our theoretical results are based upon a fundamental theorem that shows that analytic functions can be efficiently learnt by wide (but finite-width) two-layer neural networks with standard activation functions (such as ReLU), using SGD from a random initialization. Specifically, we derive novel generalization bounds for multivariate analytic functions (Theorems 1 and 8) by relating wide networks to kernel learning with a specific network-induced kernel (Jacot et al., 2018; Du et al., 2019; Allen-Zhu et al., 2019; Arora et al., 2019a; Lee et al., 2019) , known as the neural tangent kernel (NTK) (Jacot et al., 2018) . We further develop a calculus of bounds showing that the sum, product, ratio, and composition of analytic functions is also learnable, with bounds constructed using the familiar product and chain rules of univariate calculus (Corollaries 1, 2). These above learnability results may be of independent interest; for example, they can be used to show that natural physical laws like the gravitational force equations (shown in Fig. 1 ) can be efficiently learnt by neural networks (Section B.1). Furthermore, our bounds imply that the NTK kernel for ReLU activation has theoretical learning guarantees that are superior to the Gaussian kernel (Section A.2), which we also demonstrate empirically with experiments on learning the gravitational force law (Section B.2).

1.2. RELATED WORK

Most related to our work are a number of works in application areas that have sought to learn a single network that can perform many different tasks. In natural language processing, Tsai et al. (2019) show that a single model can solve machine translation across more than 50 languages. Many other works in NLP similarly seek to use one model for multiple languages, or even multiple tasks (Johnson et al., 2017; Aharoni et al., 2019; Bapna et al., 2019; Devlin et al., 2018) . Monolithic models have also been successfully trained for tasks in very different domains, such as speech and language (Kaiser et al., 2017) . Finally, there is also work on training extremely large neural networks which have the capacity to learn multiple tasks (Shazeer et al., 2017; Raffel et al., 2019) . These works provide empirical clues that suggest that a single network can successfully be trained to perform a wide variety of tasks. But, they do not provide a systematic theoretical investigation of the extent of this ability as we do here. Caruana (1997) proposed multitask learning in which a single network is trained to solve multiple tasks on the same input simultaneously, as a vector of outputs. He observed that average generalization error for the multiple tasks may be much better than when the tasks are trained separately, and this observation initiated an active area of machine learning research (Zhang & Yang, 2017) . Multitask learning is obviously related to our monolithic architectures. The difference is that whereas in multitask learning all of the tasks are computed simultaneously and output on separate gates, here all of the tasks share a common set of outputs, and the task code inputs switch between the various tasks. Furthermore, contrary to the main focus of multitask learning, we are primarily interested in the extent to which different tasks may interfere, rather than how much similar ones may benefit. Our work is also related to studies of neural models of multitasking in cognitive science. In particular, Musslick et al. (2017) consider a similar two-layer architecture in which there is a set of task code attributes. But, as in multitask learning, they are interested in how many of these tasks can be performed simultaneously, on distinct outputs. They analyze the tradeoff between improved sample complexity and interference of the tasks with a handcrafted "gating" scheme, in which the parts of activity are zeroed out depending on the input (as opposed to the usual nonlinearities); in this model, they find out that the speedup from multitask learning comes at the penalty of limiting the number of tasks that can be correctly computed as the similarity of inputs varies. Thus, in contrast to our model where the single model is computing distinct tasks sequentially, they do find that the distinct tasks can interfere with each other when we seek to solve them simultaneously.

2. TECHNICAL OVERVIEW

We now give a more detailed overview of our theoretical techniques and results, with informal statements of our main theorems. For full formal statements and proofs, please see the Appendix.

2.1. LEARNING ANALYTIC FUNCTIONS

Our technical starting point is to generalize the analysis of Arora et al. (2019b) in order to show that two-layer neural networks with standard activation, trained by SGD from random initialization, can learn analytic functions on the unit sphere. We then obtain our results by demonstrating how our representations of interest can be captured by analytic functions with power series representations of appropriately bounded norms. Formal statements and proofs for this section appear in Appendix A.2. Let S d denote the unit sphere in d dimensions. Theorem 1. (Informal) Given an analytic function g(y), the function g(β • x), for fixed β ∈ R d (with β def = β 2 ) and inputs x ∈ S d is learnable to error with n = O((βg (β) + g(0)) 2 / 2 ) examples using a single-hidden-layer, finite width neural network of width poly(n) trained with SGD, with g(y) = ∞ k=0 |a k |y k (1) where the a k are the power series coefficients of g(y). We will refer to g (1) as the norm of the function g-this captures the Rademacher complexity of learning g, and hence the required sample complexity. We also show that the g function in fact tightly captures the Rademacher complexity of learning g, i.e. there is a lower bound on the Rademacher complexity based on the coefficients of g for certain input distributions (see Corollary 5 in Section C in the appendix). We also note that we can prove a much more general version for multivariate analytic functions g(x), with a modified norm function g(y) constructed from the multivariate power series representation of g(x) (Theorem 8 in Appendix A.2). The theorems can also be extended to develop a "calculus of bounds" which lets us compute new bounds for functions created via combinations of learnable functions. In particular, we have a product rule and a chain rule: Figure 2 : Some of the task codings which fit in our framework. On the left, we show a task coding via clusters. Here, c (i) is the code for the ith cluster. On the right, we show a task coding based on low-depth decision trees. Here, c i is the ith coordinate of the code c of the input datapoint. Corollary 1 (Product rule). Let g(x) and h(x) meet the conditions of Theorem 1. Then the product g(x)h(x) is efficiently learnable as well, with O(M g•h / 2 ) samples where M g•h = g (1) h(1) + g(1) h (1) + g(0) h(0). Corollary 2 (Chain rule). Let g(y) be an analytic function and h(x) be efficiently learnable, with auxiliary functions g(y) and h(y) respectively. Then the composition g(h(x)) is efficiently learnable as well with O(M g•h / 2 ) samples where M g•h = g ( h(1)) h (1) + g( h(0)), provided that h(0) and h(1) are in the radius of convergence of g. The calculus of bounds enables us to prove learning bounds on increasingly expressive functions, and we can prove results that may be of independent interest. As an example, we show in Appendix B.1 that forces on k bodies interacting via Newtonian gravitation, as shown in Figure 1 , can be learned to error using only k O(ln(k/ )) examples (even though the function 1/x has a singularity at 0).

2.2. TASK CODING VIA CLUSTERS

Our analysis of learning analytic functions allows us to prove that a single network with standard training can learn multiple tasks. We formalize the problem of learning multiple tasks as follows. In general, these networks take pairs of inputs (c, x) where c is a task code and x is the input (vector) for the chosen task represented by c. We assume both c and x have fixed dimensionality. These pairs are then encoded by the concatenation of the two vectors, which we denote by c; x. Given k tasks, corresponding to evaluation of functions f 1 , . . . , f k respectively on the input x, the ith task has a corresponding code c (i) . Now, we wish to learn a function g such that g(c (i) ; x) = f i (x) for examples of the form (c (i) ; x, f i (x) ). This g is a "monolithic" function combining the k tasks. More generally, there may be some noise (bounded within a small ball around c (i) ) in the task codes which would require learning the monolithic function g(c, x) = f j (x) where j = argmin i cc (i) 2 . Alternately the task-codes are not given explicitly but are inferred by checking which ball-center c (i) (unique per task) is closest to the input x (see Fig. 2 (left) for an example). Note that these are all generalizations of a simple one-hot coding. We assume throughout that the f i are analytic, with bounded-norm multinomial Taylor series representations. Our technical tool is the following Lemma (proved in Appendix A.2) which shows that the univariate step function 1(x ≥ 0) can be approximated with error and margin γ using a low-degree polynomial which can be learnt using SGD. Lemma 1. Given a scalar x, let Φ(x, γ, ) = (1/2) 1 + erf Cx log(1/ )/γ where erf is the Gauss error function and C is a constant. Let Φ (x, γ, ) be the function Φ(x, γ, ) with its Taylor series truncated at degree O(log(1/ )/γ). Then, Φ (x, γ, ) = O( ) x ≤ -γ/2, 1 -O( ) x ≥ γ/2. Also, Φ (x, γ, ) can be learnt using SGD with at most e O((log(1/ )/γ 2 )) examples. Using this lemma, we show that indicator functions for detecting membership in a ball near a prototype c (i) can also be sufficiently well approximated by functions with such a Taylor series representation. Specifically, we use the truncated representation of the erf function to indicate that cc (i) is small. As long as the centers are sufficiently well-separated, we can find a low-degree, low-norm function this way using Lemma 1. For example, to check if c is within distance r of center c (i) we can use 1( cc (i) 2 ≤ r 2 ), which can be approximated using the φ function in Lemma 1. Then given such approximate representations for the task indicators I 1 (c), . . . , I k (c), the function g(c; x) = I 1 (c)f 1 (x) + • • • + I k (c)f k (x) has norm linear in the complexities of the task functions, so that they are learnable by Theorem 1 (we scale to inputs to lie within the unit ball as required by Theorem 1). We state the result below, for the formal statement and proof see Appendix A.3. Theorem 2. (Informal) Given k analytic functions having Taylor series representations with norm at most poly(k/ ) and degree at most O(log(k/ )), a two-layer neural network trained with SGD can learn the following functions g(c; x) on the unit sphere to accuracy with sample complexity poly(k/ ) times the sum of the sample complexities for learning each of the individual functions: • for Ω(1)-separated codes c (1) , . . . , c (k) , if c -c (i) 2 ≤ O(1), then g(c; x) = f i (x).

2.3. TASK CODING VIA LOW-DEPTH DECISION TREES

Theorem 2 can be viewed as performing a single k-way branching choice of which task function to evaluate. Alternatively, we can consider a sequence of such choices, and obtain a decision tree in which the leaves indicate which task function is to be applied to the input. We first consider the simple case of a decision tree when c is a {±1}-valued vector. We can check that the values c 1 , . . . , c h match the fixed assignment c (i) 1 , . . . , c h that reaches a given leaf of the tree using the function I c (i) (c) = h j=1 cj +c (i) j 2 (or similarly for any subset of up to h of the indices). Then g(c; x) = I c (1) (c)f 1 (x) + • • • + I c (k) (c)f k (x) represents our decision tree coding of the tasks (see Fig. 2 (right) for an example). For the theorem, we again scale the inputs to lie within the unit ball: Theorem 3. (Informal) Two-layer neural networks trained with SGD can learn such a decision tree with depth h within error with sample complexity O(d h / 2 ) times the sum of the sample complexity for learning each of the individual functions at the leaves. Furthermore, conditioned on the hardness of learning parity with noise, d Ω(h) examples are in fact necessary to learn a decision tree of depth h. We can generalize the previous decision tree to allow a threshold based decision at every internal node, instead of just looking at a coordinate. Assume that the input data lies in the unit ball and that each decision is based on a margin of at least γ. We can then use a product of our truncated erf polynomials to represent branches of the tree. We thus show: Theorem 4. (Informal) If we have a decision tree of depth h where each decision is based on a margin of at least γ, then we can learn such a such a function within error with sample complexity e O(h log(1/ )/γ 2 ) times the sample complexity of learning each of the leaf functions. For the formal statements and proofs, see Appendix A.4. Note that by Theorem 3, the exponential dependence on the depth in these theorems is necessary.

2.4. SIMPLE PROGRAMMING CONSTRUCTS

So far, we have discussed jointly learning k functions with task codings represented by clusters and decision trees. We now move to a more general setup, where we allow simple programming constructs such as compositions, aggregation, concatenation, and branching on different functions. At this stage, the distinction between "task codes" and "inputs" becomes somewhat arbitrary. Therefore, we will generally drop the task codes c from the inputs. The class of programming constructs we can learn is a generalization of the decision tree and we refer to it as a generalized decision program. Definition 1. We define a generalized decision program to be a circuit with fan-out 1 (i.e., a tree topology). Each gate in the circuit computes a function of the outputs of its children, and the root (top) node computes the final output. All gates, including the leaf gates, have access to the input x. We can learn generalized decision programs where each node evaluates one among a large family of operations, first described informally below, and then followed by a formal definition. Arithmetic/analytic formulas As discussed in Section 2.1, learnability of analytic functions not only allows us to learn functions with bounded Taylor series, but also sums, products, and ratios of such functions. Thus, we can learn constant-depth arithmetic formulas with bounded outputs and analytic functions (with appropriately bounded Taylor series) applied to such learnable functions. Aggregation We observe that the sum of k functions with bounded Taylor representations yields a function of the same degree and norm that is at most k times greater; the average of these k functions, meanwhile does not increase the magnitude of the norm. Thus, these standard aggregation operations are represented very efficiently. These enable us to learn functions that answer a family of SQL-style queries against a fixed database as follows: suppose I(x, r) is an indicator function for whether or not the record r satisfies the predicate with parameters x. Then a sum of the m entries of a database that satisfy the predicate given by x is represented by I(x, r (1) )r (1) + • • • + I(x, r (m) )r (m) . Thus, as long as the predicate function I and records r (i) have bounded norms, the function mapping the parameters x to the result of the query is learnable. We remark that max aggregation can also be represented as a sum of appropriately scaled threshold indicators, provided that there is a sufficient gap between the maximum value and other values.

Structured data

We note that our networks already receive vectors of inputs and may produce vectors of outputs. Thus, one may trivially structured inputs and outputs such as those in Fig. 1 (right) using these vectors. We now formalize this by defining the class of functions we allow. Definition 2. We support the following operations at any gate in the generalized decision program. Let every gate have at most k children. Let g be the output of some gate and {f 1 , . . . , f k } be the outputs of the children of that gate. 1. Any analytic function of the child gates which can be approximated by a polynomial of degree at most p, including sum g = k i=1 f i and product of p terms g = Π p i=1 f i . 2. Margin-based switch (decision) gate with children {f 1 , f 2 } and some constant margin γ, i.e., g = f 1 if β, x -α ≤ -γ/2, and g = f 2 if β, x -α ≥ γ/2, for a vector β and constant α. 3. Cluster-based switch gate with k centers {c (1) , . . . , c (k) }, with separation r (for some constant r), i.e. the output is i) , and 0 if x does not match any of the centers. 4. Composition of two functions, g(x) = f 1 (f 2 (x)). f i if x -c (i) ≤ r/3. A special case of this is a look-up table which returns value v i if x = c (

5.. Create a tuple out of separate fields by concatenation: given inputs

{f 1 , . . . , f k } g outputs a tuple [f 1 , . . . , f k ], which creates a single data structure out of the children. Or, extract a field out of a tuple: for a fixed field i, given the tuple [f 1 , . . . , f k ], g returns f i . 6. For a fixed table T with k entries {r 1 , . . . , r k }, a Boolean-valued function b, and an analytic function f , SQL queries of the form SELECT SUM f(r_i), WHERE b(r_i, x) for the input x, i.e., g computes i:b(ri,x)=1 f (r i ). (We assume that f takes bounded values and b can be approximated by an analytic function of degree at most p.) For an example, see the function avg_income_zip_code() in Fig. 1 (right). As an example of a simple program we can support, refer to Fig. 1 (right) which involves table lookups, decision nodes, analytic functions such as Euclidean distance, and SQL queries. Theorem 5 is our learning guarantee for generalized decision programs. See Section A.5 in the Appendix for proofs, formal statements, and a detailed description of the program in Fig. 1 (right). Theorem 5. (Informal) Any generalized decision program of constant depth h using the above operations with p ≤ O(log(k/ )) can be learnt within error with sample complexity k poly(log(k/ )) . For the specific case of the program in Fig. 1 (right), it can be learnt using (k/ ) O(log(1/ )) examples, where k is the number of individuals in the database. 

3. EXPERIMENTS

We next empirically explore the learnability of multiple functions by a two layer neural network when the tasks are coded by well-separated clusters or decision trees, and more generally the learnability of SQL-style aggregation for a fixed database. We find good agreement between the empirical performance and the bounds of Section 2. See Appendix D for more details of the experimental setup. Learning binary classification for well-separated clusters data We demonstrate through experiments on synthetic data that a single neural network can learn multiple tasks if the tasks are well-separated into clusters, as we discussed in Section 2.2. Here the data is drawn from a mixture of k well-separated Gaussians in d = 50 dimensions. Within each Gaussian, the data points are marked with either of two labels. For the label generation, we consider two cases, first when the labels within each cluster are determined by a simple linear classifier, and second when the labels are given by a random teacher neural network with one hidden layer of 10 hidden units. Fig. 3 shows the performance of a single two-layer neural network with 50k hidden units on this task. The performance of the neural network changes only slightly on increasing the number of clusters (k), suggesting that a single neural network can learn across all clusters. Learning polynomial functions on leaves of a decision tree We consider the problem of learning polynomial functions selected by a decision tree. The data generation process is as follows. We first fix parameters: tree depth h, decision variable threshold margin γ, number of variables k, and degree p for leaf functions. Then we specify a full binary decision tree of depth h with a random polynomial function on each leaf. To do this, we first generate thresholds t 1 , t 2 , ..., t h from the uniform distribution on [0, 1] and 2 h leaf functions which are homogeneous polynomials of k variables and degree p, with uniformly distributed random coefficients in [0, 1]. A train/test example (x, y) where x = (x 1 , ..., x h , x h+1 , ..., x h+p ) is generated by first randomly sampling the x i 's from the uniform distribution on [0, 1], selecting the corresponding leaf based on x 1 , ..., x h (that is, go left at the first branch if x 1 ≤ t 1 , otherwise go right, etc), and computing y by evaluating the leaf function at (x h+1 , ..., x h+p ). The data is generated with the guarantee that each leaf has the same number of data points. Fig. 4 shows the performance of a two-layer neural network with 32 × 2 h hidden units, measured in the R-squared metric. Here the R-squared metric is defined as 1 -i (ŷ i -y i ) 2 / i (y i -y) 2 , and is the fraction of the underlying variance explained by the model. Note that for a model that outputs the mean y for any input, the R-squared metric would be zero. We observed for a fixed number of training samples, accuracy increases as threshold margin increases, and the dependence of sample complexity on test error agrees with the bound in Theorem 4. 

4. CONCLUSION AND FUTURE WORK

Our results indicate that even using a single neural network, we can still learn tasks across multiple, diverse domains. However, modular architectures may still have benefits over monolithic ones: they might use less energy and computation, as only a portion of the total network needs to evaluate any data point. They may also be more interpretable, as it is clearer what role each part of the network is performing. It is an open question if any of these benefits of modularity can be extended to monolothic networks. For instance, is it necessary for a monolithic network to have modular parts which perform identifiable simple computations? And if so, can we efficiently identify these from the larger network? This could help in interpreting and understanding large neural networks. Our work also begins to establish how neural networks can learn functions which are represented as simple programs. This perspective raises the question, how rich can these programs be? Can we learn programs from a full-featured language? In particular, supposing that they combine simpler programs using other basic operations such as composition, can such libraries of tasks be learned as well, i.e., can these learned programs be reused? We view this as a compelling direction for future work. A THEORETICAL RESULTS

A.1 KERNEL LEARNING BOUNDS

In this section, we develop the theory of learning analytic functions. For a given function g, we define a parameter M g related to the sample complexity of learning g with small error with respect to a given loss function: Definition 3. Fix a learning algorithm, and a 1-Lipschitz loss function L. For a function g over a distribution of inputs D, a given error scale , and a confidence parameter δ, let the sample complexity n g,D ( , δ) be the smallest integer such that when the algorithm is given n g,D ( , δ) i.i.d. examples of g on D, with probability greater than 1 -δ, it produces a trained model ĝ with generalization error E x∼D [L(g(x), ĝ(x))] less than . Fix a constant C > 0. We say g is efficiently learned by the algorithm (w.r.t. C) if there exists a constant M g (depending on g) such that for all , δ, and distributions D on the inputs of g, n g,D ( , δ) ≤ C([M g + log(δ -1 )]/ 2 ). For example, it is known (Talagrand (1994) ) that there exists a suitable choice of C such that empirical risk minimization for a class of functions efficiently learns those functions with M g at most the VC-dimension of that class. Previous work focused on computing M g , for functions defined on the unit sphere, for wide neural networks trained with SGD. We extend the bounds derived in Arora et al. (2019a) to analytic functions, and show that they apply to kernel learning methods as well as neural networks. The analysis in Arora et al. (2019a) focused on the case of training the hidden layers of wide networks with SGD. We first show that these bounds are more general and in particular apply to the case where only the final layer weights are trained (corresponding to the NNGP kernel in Lee et al. ( 2019)), and therefore our results will apply to general kernel learning as well. The proof strategy consists of showing that finite-width networks have a sensible infinite-width limit, and showing that training causes only a small change in parameters of the network. Let m be the number of hidden units, and n be the number of data points. Let y be the n × 1 dimensional vector of training outputs. Let h be a n × m random matrix denoting the activations of the hidden layer (as a function of the weights of the lower layer) for all n data points. We will first show the following: Theorem 6. For sufficiently large m, a function g can be learned efficiently in the sense of Definition 3 by training the final layer weights only with SGD, where the constant M g given by M g ≤ y T (H ∞ ) -1 y (4) where we define H ∞ as H ∞ = E[hh T ] which is the NNGP kernel from Lee et al. (2019) . We require some technical lemmas in order to prove the theorem. We first need to show that H ∞ is, with high probability, invertible. If K(x, x ), the kernel function which generates H ∞ is given by a infinite Taylor series in x • x it can be argued that H ∞ has full rank for most real world distributions. For example, the ReLU activation this holds as long as no two data points are co-linear (see Definition 5.1 in Arora et al. (2019a) ). We can prove this more explicitly in the following lemma: Lemma 2. If all the n data points x are distinct and the Taylor series of K(x, x ) in x • x has positive coefficients everywhere then H ∞ is not singular. Proof. First consider the case where the input x is a scalar. Since the Taylor series corresponding to K(x, x ) consists of monomials of all degrees of xx , we can view it as some inner product in a kernel space induced by the function φ(x) = (1, x, x 2 , . . .), where the inner product is diagonal (but with potentially different weights) in this basis. For any distinct set of inputs {x 1 , .., x n } the set of vectors φ(x i ) are linearly independent. The first n columns produce the Vandermonde matrix obtained by stacking rows 1, x, x, ..., x n-1 for n different values of x, which is well known to be non-singular (since a zero eigenvector would correspond to a degree n -1 polynomial with n distinct roots {x 1 , .., x n }). This extends to the case of multidimensional x if the values, projected along some dimension, are distinct. In this case, the kernel space corresponds to the direct sum of copies of φ applied elementwise to each coordinate x i . If all the points are distinct and and far apart from each other, the probability that a given pair coincides under random projection is negligible. From a union bound, the probability that a given pair coincide is also bounded -so there must be directions such that projections along that direction are distinct. Therefore, H ∞ can be considered to be invertible in general. As m → ∞, hh T concentrates to its expected value. More precisely, (hh T ) -1 approaches (H ∞ ) -1 for large m if we assume that the smallest eigenvalue λ min (H ∞ ) ≥ λ 0 , which from the above lemma we know to be true for fixed n. (For the ReLU NTK the difference becomes negligible with high probability for m = poly(n/λ 0 ) Arora et al. (2019a) .) This allows us to replace hh T with H ∞ in any bounds involving the former. We can get learning bounds in terms of hh T by studying the upper layer weights w of the network after training. After training, we have y = w • h. If hh T is invertible (which the above arguments show is true with high probability for large m), the following lemma holds: Lemma 3. If we initialize a random lower layer and train the weights of the upper layer, then there exists a solution w with norm y T (hh T ) -1 y. Proof. The minimum norm solution to y = w T h is w * = (h T h) -1 h T y. The norm squared (w * ) T w * of this solution is given by y T h(h T h) -2 h T y. We claim that h(h T h) -2 h T = (hh T ) -1 . To show this, consider the SVD decomposition h = USV T . Expanding we have h(h T h) -2 h T = USV T (VS 2 V T ) -2 VSU T . Evaluating the right hand side gets us US -2 U T = (hh T ) -1 . Therefore, the norm of the minimum norm solution is y T (hh T ) -1 y. We can now complete the proof of Theorem 6. Proof of Theorem 6. For large m, the squared norm of the weights approaches y T (H ∞ ) -1 y. Since the lower layer is fixed, the optimization problem is linear and therefore convex in the trained weights w. Therefore SGD with small learning rate will reach this optimal solution. The Rademacher complexity of this function class is at most y T (H ∞ ) -1 y which we at most by M g where M g is an upper bound on y T (H ∞ ) -1 y. The optimal solution has 0 train error based on the assumption that H ∞ is full rank and the generalization error will be no more than O( y T (H ∞ ) -1 y 2n ) which is at most if we use at least n = Ω(M g / 2 ) training samples -note that this is identical to the previous results for training the hidden layer only Arora et al. (2019a) ; Du et al. (2019) .

A.2 LEARNING ANALYTIC FUNCTIONS

Now, we derive our generalization bounds for single variate functions. We use Theorem 6 to prove the following corollary, a more general version of Corollary 6.2 proven in Arora et al. (2019a) for wide ReLU networks with trainable hidden layer only: Corollary 3. Consider the function g : R d → R given by: g(x) = k a k (β T k x) k (8) Then, if g is restricted to ||x|| = 1, and the NTK or NNGP kernel can be written as H(x, x ) = k b k (x • x ) k , the function can be learned efficiently with a wide one-hidden-layer network in the sense of Definition 3 with M g = k b -1/2 k |a k |||β k || k 2 (9) up to g-independent constants of O(1), where β k ≡ ||β k || 2 . In the particular case of a ReLU network, the bound is M g = k k|a k |||β k || k 2 (10) The original corollary applied only to networks with trained hidden layer, and the bound on the ReLu network excluded odd monomials of power greater than 1. Proof. The extension to NNGP follows from Theorem 6, which allows for the application of the arguments used to prove Corollary 6.2 from Arora et al. (2019a) (particularly those found in Appendix E). The extension of the ReLu bound to odd powers can be acheived with the following modification. consider appending a constant component to the input x so that the new input to the network is (x/ √ 2, 1/ √ 2). The kernel then becomes: K(x, x ) = x • x + 1 4π π -arccos x • x + 1 2 . ( ) Re-writing the power series as an expansion around x • x = 0, we have terms of all powers. An asymptotic analysis of the coefficients using known results shows that coefficients b k are asymptotically O(k -3/2 ) -meaning in Equation 10 applies to these kernels, without restriction to even k. Equation 9 suggests that kernels with slowly decaying (but still convergent) b k will give the best bounds for learning polynomials. Many popular kernels do not meet this criteria. For example, for inputs on the sphere of radius r, the Gaussian kernel K(x, x ) = e -||x-x || 2 /2 can be written as K(x, x ) = e -r 2 e x•x . This has b -1/2 k = e r 2 /2 √ k!, which increases rapidly with k. This provides theoretical justification for the empirically inferior performance of the Gaussian kernel which we will present in Section B.2. Guided by this theory, we focus on kernels where b -1/2 k ≤ O(k), for all k (or, b k ≥ O(k -2 )). The modified ReLu meets this criterion, as well as hand-crafted kernels of the form K(x, x ) = k k -s (x • x ) k (12) with s ∈ (1, 2] is a valid slowly decaying kernel on the sphere. We call these slowly decaying kernels. We note that by Lemma 3, the results of Corollary 3 apply to networks with output layer training only, as well as kernel learning (which can be implemented by training wide networks). Using the extension of Corollary 3 to odd powers, we first show that analytic functions with appropriately bounded norms can be learnt. Theorem 7. Let g(y) be a function analytic around 0, with radius of convergence R g . Define the auxiliary function g(y) by the power series g(y) = ∞ k=0 |a k |y k (13) where the a k are the power series coefficients of g(y). Then the function g(β • x), for some fixed vector β ∈ R d with ||x|| = 1 is efficiently learnable in the sense of Definition 3 using a model with a slowly decaying kernel K with M g = βg (β) + g(0) (14) if the norm β ≡ ||β|| 2 is less than R g . Proof. We first note that the radius of convergence of the power series of g(y) is also R g since g(y) is analytic. Applying Equation 10, pulling out the 0th order term, and factoring out β, we get M g = |a 0 | + β ∞ k=1 k|a k |β k = βg (β) + g(0) (15) since β < R g . The tilde function is the notion of complexity which measures how many samples we need to learn a given function. Informally, the tilde function makes all coefficients in the Taylor series positive. The sample complexity is given by the value of the function at 1 (in other words, the L1 norm of the coefficients in the Taylor series). For a multivariate function g(x), we define its tilde function g(y) by substituting any inner product term α, x by a univariate y. The above theorem can then also be generalized to multivariate analytic functions: Lemma 4. Given a collection of p vectors β i in R d , the function f (x) = p i=1 β i • x is efficiently learnable with M f = p i β i ( ) where β i ≡ ||β i || 2 . Proof. The proof of Corollary 6.2 in Arora et al. (2019a) relied on the following statement: given positive semi-definite matrices A and B, with A B, we have: P B A -1 P B B + ( ) where + is the Moore-Penrose pseudoinverse, and P is the projection operator. We can use this result, along with the Taylor expansion of the kernel and a particular decomposition of a multivariate monomial in the following way. Let the matrix X to be the training data, such that the αth column x i is a unit vector in R d . Given K ≡ X T X, the matrix of inner products, the Gram matrix H ∞ of the kernel can be written as H ∞ = ∞ k=0 b k K •k (18) where • is the Hadamard (elementwise) product. Consider the problem of learning the function f (x) = p i=1 β i • x. Note that we can write: f (X) = (X k ) T ⊗ k i=1 β i . ( ) Here ⊗ is the tensor product, which for vectors takes an n 1 -dimensional vector and an n 2 dimensional vector as inputs vectors and returns a n 1 n 2 dimensional vector: w ⊗ v =         w 1 v 1 w 1 v 2 • • • w 1 v n2 w 2 v 1 • • • w n1 v n2         . ( ) The operator is the Khatri-Rao product, which takes an n 1 × n 3 matrix A = (a 1 , • • • , a n3 ) and a n 2 ⊗ n 3 matrix B = (b 1 , • • • , b n3 ) and returns the n 1 n 2 × n 3 dimensional matrix A B = (a 1 ⊗ b 1 , • • • , a n3 ⊗ b n3 ). ( ) For p = 2, this form of f (X) can be proved explicitly: (X 2 ) T β 1 ⊗ β 2 = (x 1 ⊗ x 1 , • • • , x P ⊗ x P ) T β 1 ⊗ β 2 . ( ) The αth element of the matrix product is (x α ⊗ x α ) • (β 1 ⊗ β 2 ) = (β 1 • x α )(β 2 • x α ) (23) which is exactly f (x α ). The formula can be proved for p > 2 by finite induction. With this form of f (X), we can follow the steps of the proof in Appendix E of Arora et al. (2019a) , which was written for the case where the β i were identical: y T (H ∞ ) -1 y = (⊗ p i=1 β i ) T X p (H ∞ ) -1 (X p ) T ⊗ p i=1 β i . ( ) Using Equation 17, applied to K •p , we have: y T (H ∞ ) -1 y ≤ b -1 p (⊗ p i=1 β i ) T X p P K •p (K •p ) + P K •p (X p ) T ⊗ p i=1 β i . ( ) Since the X p are eigenvectors of P K •p with eigenvalue 1, and X p (K •p ) + (X p ) T = P X p , we have: y T (H ∞ ) -1 y ≤ b -1 p (⊗ p i=1 β i ) T P X p ⊗ p i=1 β i (26) y T (H ∞ ) -1 y ≤ b -1 p p i=1 β i • β i . ( ) For the slowly decaying kernels, b p ≥ p -2 . Therefore, we have y T (H ∞ ) -1 y ≤ M f for M f = p i β i ( ) where β i ≡ ||β i || 2 , as desired. This leads to the following generalization of Theorem 7: Theorem 8. Let g(x) be a function with multivariate power series representation: g(x) = k v∈V k a v k i=1 (β v,i • x) (29) where the elements of V k index the kth order terms of the power series. We define g(y) = k ãk y k with coefficients ãk = v∈V k |a v | k i=1 β v,i . ( ) If the power series of g(y) converges at y = 1 then with high probability g(x) can be learned efficiently in the sense of Definition 3 with M g = g (1) + g(0). Proof. Follow the construction in Theorem 7, using Lemma 4 to get bounds on the individual terms. Then sum and evaluate the power series of g (1) to arrive at the bound. Remark 1. Note that the g function defined above for multivariate functions depends on the representation, i.e. choice of the vectors β. Therefore to be fully formal g(y) should instead be gβ (y). For clarity, we drop β from the expression gβ (y) and it is implicit in the g notation. Remark 2. If g(x) can be approximated by some function g app such that |g(x) -g app | ≤ for all x in the unit ball, then Theorem 8 can be used to learn g(x) within error + with sample complexity O(M gapp / 2 ). To verify Remark 2, note that we are doing regression on the upper layer of the neural network, where the lower layer is random. So based on g app there exists a low-norm solution for the regression coefficients for the upper layer weights which gets error at most . If we solve the regression under the appropriate norm ball, then we get training error at most , and the generalization error will be at most with O(M gapp / 2 ) samples. We can also derive the equivalent of the product and chain rule for function composition. Proof of Corollary 1. Consider the power series of g(x)h(x), which exists and is convergent since each individual series exists and is convergent. Let the elements of V j,g and V k,h index the jth order terms of g and the kth order terms of h respectively. The individual terms in the series look like: a v b w j j =1 (β v,j • x) k k =1 (β w,k • x) for v ∈ V j,g , w ∈ V k,h with bound (j + k)|a v ||b w | j j =1 β v,j k k =1 β w,k for v ∈ V j,g , w ∈ V k,h for all terms with j + k > 0 and g(0) h(0) for the term with j = k = 0. Distribute the j + k product, and first focus on the j term only. Summing over all the V k,h for all k, we get k w∈V k,h j|a v ||b w | j j =1 β v,j k k =1 β w,k = |a v | j j =1 β v,j h(1). ( ) Now summing over the j and V j,g we get g (1) h( 1). If we do the same for the k term, after summing we get g(1) h ( 1). These bounds add and we get the desired formula for M gh , which, up to the additional g(0) h(0) term looks is the product rule applied to g and h. One immediate application for this corollary is the product of many univariate analytic functions. If we define G(x) = i g i (β i • x) where each of the corresponding gi (y) have the appropriate convergence properties, then G is efficiently learnable with bound M G given by M G = d dy i gi (β i y) y=1 + i gi (0). ( ) Proof of Corollary 2. Writing out g(h(x)) as a power series in h(x), we have: g(h(x)) = ∞ k=0 a k (h(x)) k . ( ) We can bound each term individually, and use the k-wise product rule to bound each term of (h(x)) k . Doing this, we have: M g•h = ∞ k=1 k|a k | h (1) h(1) k-1 + ∞ k=0 |a k | h(0) k . ( ) Factoring out h (1) from the first term and then evaluating each of the series gets us the desired result. The following corollary considers the case where the function g(x) is low-degree and directly follows from Theorem 8. Fact 1. The following facts about the tilde function will be useful in our analysis-1. Given a multivariate analytic function g(x) of degree p for x in the d-dimensional unit ball, there is a function g(y) as defined in Theorem 8 such that g(x) is learnable to error with O(pg(1)/ 2 ) samples. 2. The tilde of a sum of two functions is at most the sum of the tilde of each of the functions, i.e. if f = g + h then f (y) ≤ g(y) + h(y) for y ≥ 0. 3. The tilde of a product of two functions is at most the product of the tilde of each of the functions, i.e. if f = g • h then f (y) ≤ g(y) h(y) for y ≥ 0. 4. If g(x) = f (αx), then g(y) ≤ f (αy) for y ≥ 0.

5.. If g(x)

= f (x + c) for some c ≤ 1, then g(y) ≤ f (y + 1) for y ≥ 0. By combining this with the previous fact, if g(x) = f (α(x -c)) for some c ≤ 1, then g(1) ≤ f (2α). To verify the last part, note that in the definition of g we replace β, x with y. Therefore, we will have an additional β, c term when we compute the tilde function for g(x) = f (x + c). As c ≤ 1, the additional term is at most 1. The following lemma shows how we can approximate the indicator 1(x > α) with a low-degree polynomial if x is at least γ/2 far away from α. We will use this primitive several times to construct low-degree analytic approximations of indicator functions. The result is based on the following simple fact. Fact 2. If the Taylor series of g(x) is exponentially decreasing, then we can truncate it at degree O(log(1/ )) to get error. We will use this fact to construct low-degree approximations of functions. Lemma 5. Given a scalar x, let the function Φ(x, γ, , α) = (1/2) 1 + erf (x -α)c log(1/ )/γ for some constant c. Let Φ (x, γ, , α) be the function Φ(x, γ, , α) with its Taylor series truncated at degree O(log(1/ )/γ). Then for |α| < 1, Φ (x, γ, , α) = x ≤ α -γ/2, 1 - x ≥ α + γ/2. Also, M Φ is at most e O((log(1/ )/γ 2 )) . Proof. Note that Φ(x, γ, , α) is the cumulative distribution function (cdf) of a normal distribution with mean α and standard deviation O(γ/ log(1/ )). Note that at most /100 of the probability mass of a Gaussian distribution lies more than O( log(1/ )) standard deviations away from the mean. Therefore, Φ(x, γ, , α) = /100 x ≤ α -γ/2, 1 -/100 x ≥ α + γ/2. Note that erf(x) = 2 √ π x 0 e -t 2 dt = 2 √ π ∞ i=0 (-1) i x 2i+1 i!(2i + 1) . Therefore, the coefficients in the Taylor series expansion of erf((x -α)c log(1/ )/γ)) in terms of (x -α) are smaller than for i > O(log(1/ )/γ 2 ) and are geometrically decreasing henceforth. Therefore, we can truncate the Taylor series at degree O(log(1/ )/γ 2 ) and still have an O( ) approximation. Note that for f (x) = erf(x), f (y) ≤ 2 √ π y 0 e t 2 dt ≤ 2 √ π ye y 2 ≤ e O(y 2 ) . After shifting by α and scaling by O( log(1/ )/γ), we get Φ (y) = e O((y+α) 2 log(1/ )/γ 2 ) . For x = 1, this is at most e O(log(1/ )/γ 2 ) . Hence the result now follows by Fact 1.

A.3 LEARNABILITY OF CLUSTER BASED DECISION NODE

In the informal version of the result for learning cluster based decisions we assumed that the task-codes c are prefixed to the input datapoints, which we refer to as x inp . For the formal version of the theorem, we use a small variation. The task code and the input c, x inp gets mapped to x = c + x inp • (r/3) for some constant r < 1/6. Since x inp resides on the unit sphere, x will be distance at most (r/3) from the center it gets mapped to. Note that the overall function f can be written as follows, f (x) = k j=1 1 x -c j 2 ≤ (r/2) 2 f j ((x -c j )/(r/3)) where f j is the function corresponding to the center c j . The main idea will be to show that the indicator function can be expressed as an analytic function. Theorem 9. (formal version of Theorem 2) Assume that d ≥ 10 log k (otherwise we can pad by extra coordinates to increase the dimensionality). Then we can find k centers in the unit ball which are at least r apart, for some constant r. Let f (x) = k j=1 1 x -c j 2 ≤ (r/2) 2 f j ((x -c j )/(r/3)) where f j is the function corresponding to the center c j . Then if each f j is a degree p polynomial, M f of the function f is p • poly(k/ ) fj (6/r) ≤ p • poly(k/ )(6/r) p fj (1). Proof. Let f app (x) = k j=1 Φ x -c j 2 , (r/2) 2 , /k, (r/4) 2 f j ((x -c j )/(r/3)) where Φ is defined in Lemma 5. Let I j (x) = Φ ( x -c j 2 , (r/2) 2 , /k, (r/4) 2 ). The indicator I j (x) checks if xc j is a constant fraction less than r/2, or a constant fraction more than r/2. Note that if x is from a different cluster, then xc j is at least some constant, and hence I j (x) is at most /k. The contribution from k such clusters would be at most . If xc j < /k, then the indicator is at least 1 -O( /k). Hence as f app is an O( )-approximation to f , by Remark 2 it suffices to show learnability of f app . If y = x, c j and assuming x and the centers c j are all on unit sphere, Ĩj (y) = Φ (2 + 2y, r/3, /k, r/3) ≤ e O(log(k/ ) = poly(k/ ). By Fact 1, f (y) ≤ poly(k/ ) j fj (6/r). As f j are at most degree p, f (y) ≤ poly(k/ ) j fj (6/r) ≤ p • poly(k/ )(6/r) p fj (1). Corollary 4. The previous theorem implies that we can also learn f where f is a lookup table with M f = poly(k/ ), as long as the keys c i are well separated. Note that as long as the keys c i are distinct (for example, names) we can hash them to random vectors on a sphere so that they are all well-separated. Note that the indicator function for the informal version of Theorem 9 stated in the main body is the same as that for the lookup table in Corollary 4. Therefore, the informal version of Theorem 9 follows as a Corollary of Theorem 9.

A.4 LEARNABILITY OF FUNCTIONS DEFINED ON LEAVES OF A DECISION TREE

We consider decision trees on inputs drawn from {-1, 1} d . We show that such a decision tree g can be learnt with M g ≤ O(d h ). From this section onwards, we view the combined input c, x as x. The decision tree g can be written as follows, g(x) = j I j (x)v j , where the summation runs over all the leaves, I j (x) is the indicator function for leaf j, and v j ∈ [-1, 1] is the constant value on the leaf j. We scale the inputs by √ d to make them lie on the unit sphere, and hence each coordinate of x is either ±1/ √ d. Let the total number of leaves in the decision tree be B. The decision tree indicator function of the j-th leaf can be written as the product over the path of all internal decision nodes. Let j l be variable at the l-th decision node on the path used by the j-th leaf. We can write, I j (x) = l (a j l x j l + b j l ) , where each x j l ∈ {-1/ √ d, 1/ √ d} and a j l ∈ {- √ d/2, √ d/2} and b j l ∈ {-1/2, 1/2}. Note that the values of a j l and b j,l are chosen depending on whether the path for the j-th leaf choses the left child or the right child at the l-th decision variable. For ease of exposition, the following theorem is stated for the case where the leaf functions are constant functions, and the case where there are some analytic functions at the leaves also follows in the same way. Theorem 10. If a function is given by g(x) = B j=1 I j (x)v j , where I j (x) is a leaf indicator function in the above form, with tree depth h, then M g is at most O(d h ). Proof. Note that g(y) ≤ Ĩj (y)|v j | ≤ l √ dy/2 + 1/2 =⇒ g(1) ≤ 2 h ( √ d/2 + 1/2) h ≤ d h . As the degree of g is at most h, therefore M g ≤ hg(1) ≤ hd h . Remark 3. Note that by Theorem 10 we need O (log k) log k -2 samples to learn a lookup table based on a decision tree. On the other hand, by Corollary 4 we need poly(k/ ) samples to learn a lookup table using cluster based decision nodes. This shows that using a hash function to obtain a random O(log k) bit encoding of the indexes for the k lookups is more efficient than using a fixed log k length encoding for the k lookups. We also prove a corresponding lower bound in Theorem 14 which shows that d Ω(h) samples are necessary to learn decision trees of depth h. We will now consider decision trees where the branching is based on the inner product of x with some direction β j,l . Assume that there is a constant gap for each decision split, then the decision tree indicator function can be written as, I j (x) = l 1( x, β j,l > α j,l ). Theorem 11. (formal version of Theorem 4) A decision tree of depth h where every node partitions in a certain direction with margin γ can be written as g(x) = B j=1 I j (x)f j (x), then the final M g = e O(h log(1/ )/γ 2 ) (p + h log 1/ ) fj (1), where p is the maximum degree of f j . Proof. Define g app , g app (x) = B j=1 Π l Φ ( x, β j,l , γ, /h, α j,l )f j (x) where Φ is as defined in Lemma 5. Note that for all y = 1, Φ (1, γ, /h, α j,l ) ≤ e O(log(1/ )/γ 2 ) . Therefore, gapp (1) ≤ B j=1 Π l Φ (1, γ, /h, α j,l ) fj (1), ≤ e O(log(1/ )/γ 2 ) fj (1). Note that the degree of g app is at most O(p + h log(1/ )/γ 2 ). Therefore, M gapp ≤ e O(h log(1/ )/γ 2 ) (p + h log(1/ )/γ 2 ) fj (1). By Remark 2, learnability of g follows from the learnability of its analytic approximation g app .

A.5 GENERALIZED DECISION PROGRAM

In this section, instead a decision tree, we will consider a circuit with fan-out 1, where each gate (node) evaluates some function of the values returned by its children and the input x. A decision tree is a special case of such circuits in which the gates are all switches. So far, the function outputs were univariate but we will now generalize and allow multivariate (vector) outputs as well. Hence the functions can now evaluate and return data structures, represented by vectors. We assume that each output is at most d dimensional and lies in the unit ball. Definition 4. For a multivariate output function f , we define f (y) as the sum of fi (y) for each of the output coordinates f i . Remark 4. Theorem 9 , 10 and 11 extend to the multivariate output case. Note that if each of the individual functions has degree at most p, then the sample complexity for learning the multivariate output f is at most O(p f (1)/ 2 )) (where the multivariate tilde function is defined in Definition 4). We now define a generalized decision program and the class of functions that we support. Definition 5. We define a generalized decision program to be a circuit with fan-out 1 (i.e., a tree topology) where each gate evaluates a function of the values returned by its children and the input x, and the root node evaluates the final output. All gates, including those at the leaves, have access to the input x. We support the following gate operations. Let h be the output of a gate, let each gate have at most k children, and let {f 1 , . . . , f k } be the outputs of its children. 1. Any analytic function of the child gates of degree at most p, including sum h = k i=1 f i and product of p terms h = Π p i=1 f i . 2. Margin based switch (decision) gate with children {f 1 , f 2 }, some constant margin γ, vector β and constant α, h = f 1 if β, x -α ≤ -γ/2, f 2 if β, x -α ≥ γ/2. 3. Cluster based switch gate with k centers {c (1) , . . . , c (k) }, with separation r for some constant r, and the output is Here, we assume that f has bounded value and p can be approximated by an analytic function of degree at most p. f i if x -c (i) ≤ r/3. A 6. Compositions of functions, h(x) = f (g(x)). First, we note that all of the above operators can be approximated by low-degree polynomials. Claim 1. If p ≤ O(log(k/ )), each of the above operators in the generalized decision program can be expressed as a polynomial of degree at most O(log(k/ )), where k is maximum out-degree of any of the nodes. Remark 5. Note that for the SQL query, we can also approximate other aggregation operators apart from SUM, such as MAX or MIN. For example, to approximate MAX of x 1 , . . . , x k up to where the input lies between [0, 1] we can first write it as MAX(x 1 , . . . , x k ) = j 1 i (1(x i > j) > 1/2) , and then approximate the indicators by analytic functions. Lemma 6 shows how we can compute the tilde function of the generalized decision program. Lemma 6. The tilde function for a generalized decision program can be computed recursively with the following steps: 1. For a sum gate h = f + g, h(y) = f (y) + g(y). 2. For a product gate, h = f.g, h(y) = f (y) • g(y). 3. For a margin based decision gate (switch) with children f and g, h = I lef t f + (1 -I lef t )g and h(y) = Ĩleft ( f (y) + g(y)) + g(y). Here I lef t is the indicator for the case where the left child is chosen. 4. For cluster based decision gate (switch) with children {f 1 , ..., f k }, h(y) ≤ i Ĩi fi (6y/r). Here I i is the indicator for the cluster corresponding to the i-th child. 5. For a look-up table with k key-values, h(y) ≤ k Ĩ(y) as long as the 1 norm of each key-value is at most 1. 6. Creating a data structure out of separate fields can be done by concatenation, and h for the result is at most sum of the original tilde functions. Extracting a field out of a data structure can also be done in the same way.  = i f (r i )p(r i , x), h(y) ≤ i Ĩp,ri y), where I p,ri is the indicator for p(r i , x). For example, x here can denote some threshold value to be applied to a column of the table, or selecting some subset of entries (in Fig. 1 , x is the zip-code). 8. For h(x) = f (g(x)), h(y) ≤ f (g(y)). All except for the last part of the above Lemma directly follow from the results in the previous sub-section. Below, we prove the result for the last part regarding function compositions. Lemma 7. Assume that all functions have input and output dimension at most d. If f and g are two functions with degree at most p 1 and p 2 , then h(x) = f (g(x)) has degree at most p 1 p 2 and h(y) ≤ f (g(y)). Proof. Note that this follows if f and g are both scalar outputs and inputs. Let g(x) = (g 1 (x), ..., g d (x)). Let us begin with the case where f = β, x , where β = 1. Then h(y) = i |β i |g i (y) ≤ i gi (y) ≤ g(y). When f = Π p1 i=1 β i , x , h(y) ≤ g(y) p1 ≤ f (g(y) ). The same argument works when we take a linear combination, and also for a multivariate function f (as f for a multivariate f is the summation of individual fi , by definition). We now present our result for learning generalized decision programs. Theorem 12. Let the in-degree of any gate be at most k. The sample complexity for learning the following classes of generalized decision programs is as follows: 1. If every gate is either a decision node with margin γ, a sum gate, or a lookup of size at most k, then M g ≤ e O(h log(1/ )/γ 2 ) k O(h) . 2. For some constant C, if there are at most C product gates with degree at most C, and every other gate is a decision gate with margin γ or a sum gate with constant functions at the leaves, then M g ≤ e O(h log(1/ )/γ 2 ) . 3. Given a function f and a Boolean function p which can be approximated by a polynomial of degree at most O(log(k/ )), for a SQL operator g over a table T with k entries {r 1 , . . . , r k } representing SELECT SUM f(r_i), WHERE p(r_i, x), M g ≤ i Ĩp,ri (1).

4.

Let the function at every gate be an analytic function f of degree at most p and the sum of the coefficients of f is upper bounded by c p for some constant c. Then note that f (y) ≤ (cy) p for y ≥ 1. Therefore, the final function g(y) ≤ (cky) p h and hence M g ≤ (ck) p h . Proof. The first three claims can be obtained using Lemma 6. For the final claim, consider the final polynomial obtained by expanding the function at each gate in a bottom-up way. We will upper bound g(y) for the overall function g corresponding to the generalized decision program. g(y) can be upper bounded by starting with f (y) for the leaf nodes f . For any internal gate i, let g i (x) = f i (f j1 (x), . . . , f jp (x)) where f jt are the outputs of the children of the gate i. We recursively compute gi (y) = fi ( l fj l (y)). Therefore, for a gate with k children gi (y) ≤ (c l gj l (y)) p . Therefore, for the root gate g 0 , g0 (y) ≤ (cky) p h . Remark 6. Note that the dependence on h is doubly exponential. We show a corresponding lower bound in Theorem 15 that this is necessary. Theorem 12 implies that we can learn programs such as the following formal version of Fig. We can use the product and chain rules to show that many functions important in scientific applications can be efficiently learnable. This is true even when the function has a singularity. As an example demonstrating both, we prove the following bound on learning Newton's law of gravitation: Theorem 13. Consider a system of k bodies with positions x i ∈ R 3 and masses m i , interacting via the force: F i = j =i m i m j r 3 ij (x j -x i ) where r ij ≡ ||x i -x j ||. We assume that R = r max /r min , the ratio between the largest and smallest pairwise distance between any two bodies, is constant. Suppose the m i have been rescaled to be between 0 and 1. Then the force law is efficiently learnable in the sense of Definition 3 using the modified ReLU kernel to generalization error less than using k O(ln(k/ )) samples. Proof. We will prove learning bounds for each component of F separately, showing efficient learning with probability greater than 1-δ/3k. Then, using the union bound, the probability of simultaneously learning all the components efficiently will be 1 -δ. There are two levels of approximation: first, we will construct a function which is within /2 of the original force law, but more learnable. Secondly, we will prove bounds on learning that function to within error /2. We first rescale the vector of collective {x i } so that their collective length is at most 1. In these new units, this gives us r 2 max ≤ 2 k . The first component of the force on x 1 can be written as: (F 1 ) 1 = k j=2 m 1 m j r 2 1j ((x j ) 1 -(x 1 ) 1 ) r 1j . If we find a bound M f for an individual contribution f to the force, we can get a bound on the total √ M F = (k -1) M f . Consider an individual force term in the sum. The force has a singularity at r 1j = 0. In addition, the function r 1j itself is non-analytic due to the branch cut at 0. We instead will approximate the force law with a finite power series in r 2 1j , and get bounds on learning said power series. The power series representation of (1 - x) -3/2 is ∞ n=0 (2n+1)!! (2n)!! x n . If we approximate the function with d terms, the error can be bounded using Taylor's theorem. The Lagrange form of the error gives us the bound 1 (1 -x) 3/2 - d n=0 (2n + 1)!! (2n)!! x n ≤ √ πd|x| d+1 (1 -|x|) 5/2+d (40) where we use (2n+1)!! (2n)!! ≈ √ πn for large n. We can use the above expansion by rewriting r -3 1j = a -3 (1 -(1 -r 2 1j /a 2 )) -3/2 (41) for some shift a. Approximation with f d (r 2 1j ), the first d terms of the power series in (1 -r 2 1j /a 2 ) gives us the error: |f d (r 2 1j ) -r -3 1j | ≤ √ πd|1 -r 2 1j /a 2 | d+1 a 3 (1 -|1 -r 2 1j /a 2 |) 5/2+d (42) which we want to be small over the range r min ≤ r 1j ≤ r max . The bound is optimized when it takes the same value at r min and r max , so we set a 2 = (r 2 min + r 2 max )/2. In the limit that r max r min , where learning is most difficult, the bound becomes |f d (r 2 1j ) -r -3 1j | ≤ √ 8πd r 3 max R 2 /2 5/2+d e -2(d+1)/R 2 (43) none of the models learn well if R is not fixed. We randomly drew the masses corresponding to the k + 1 bodies from [0, 10]. We generated 5 million such examples -each example with 4(k + 1) features corresponding to the location and mass of each of the bodies, and a single label corresponding to the gravitational force F on the target body along the x-axis. We held out 10% of the dataset as test data to compute the root mean square error (RMSE) in prediction. We trained three different neural networks on this data, corresponding to various kernels we analyzed in the previous section: 1. A wide one hidden-layer ReLU network (corresponding to the ReLU NTK kernel). 2. A wide one hidden-layer ReLU network with a constant bias feature added to the input (corresponding to the NTK kernel). 3. A wide one hidden-layer network with exponential activation function, where only the top layer of the network is trained (corresponding to the Gaussian kernel). We used a hidden layer of width 1000 for all the networks, as we observed that increasing the network width further did not improve results significantly. All the hidden layer weights were initialized randomly. In Figure 5 Normalized by the range F max -F min of the forces. Gaussian kernels learn worse than ReLU at large k. All three networks are able to learn the gravitational force equation with small normalized RMSE for hundreds of bodies. Both the ReLU network and ReLU with bias outperform the network corresponding to the Gaussian kernel (in terms of RMSE) as k increases. In particular, the Gaussian kernel learning seems to quickly degrade at around 400 bodies, with a normalized RMSE exceeding 50%. This is consistent with the learning bounds for these kernels in Section A.2, and suggests that those bounds may in fact be useful to compare the performances of different networks in practice. We did not, however, observe much difference in the performance of the ReLU network when adding a bias to the input, which suggests that the inability to get an analytical bound due to only even powers in the ReLU NTK kernel might be a shortcoming of the proof technique, rather than a property which fundamentally limits the model.

C LOWER BOUNDS

First, we show an exponential dependence on the depth h is necessary for learning decision trees. The result depends on the hardness of solving parity with noise. Conjecture 1. (hardness of parity with noise) Let a, x ∈ {0, 1} d be d-dimensional Boolean vectors. In the parity with noise problem, we are given noisy inner products modulo 2 of the unknown vector x with the examples a i , i.e. b i = a i , x + η i mod 2 where η i is a Binomial random variable which is 1 with probability 0.1. Then any algorithm for finding x needs at least 2 Ω(d) time or examples We provide a more detailed setup of the experiment reported in Fig. 3a where the task codes are given by clusters, and there is a separate linear function for every cluster. In this experiment, the data is drawn from k clusters, and from a mixture of two well-separated Gaussians in each cluster. Data points from the two Gaussians within each cluster are assigned two different labels, for 2k labels in total. Fig. 6a below shows an instance of this task in two dimensions, the red circles represent the clusters, and there are two classes drawn from well-separated Gaussians from each cluster. In high dimensions, the clusters are very well-separated, and doing a k-means clustering to identify the k cluster centers and then learning a simple linear classifier within each cluster gets near perfect classification accuracy. Fig. 6b shows the performance of a single neural network trained on this task (same as Fig. 3a in the main body). We can see that a single neural network still gets good performance with a modest increase in the required number of samples.



Random linear classifier for each cluster.

Random teacher network for each cluster.

Figure 3: Binary classification on multiple clusters, results are an average over 3 trials. A single neural network does well even when there are multiple clusters. The error does not increase substantially on increasing the number of clusters k

Learning SQL-style aggregation queries We demonstrate the learnability of SQL-style aggregation queries, which are functions of the form SELECT SUM/MIN/MAX f(x) WHERE p(x) from DATABASE. The train and test datasets are generated from the Penn World 0.001 0.002 0.003 0.004 0.005 0Fixed tree depth h = 10.

Figure 4: Learning random homogeneous polynomials of 4 variables and degree 4 on the leaves of a decision tree, the results are averaged over 7 trials. (a) Sample complexity scales as e O(h log(1/ )/γ 2 ) with error , where error is measured by (1-Test R-squared). (b) For fixed tree depth, accuracy increases with increasing margin.

Figure 5: RMSE vs number of bodies k for learning gravitational force law for different kernels. Normalized by the range F max -F min of the forces. Gaussian kernels learn worse than ReLU at large k.

(a) An instance of the problem with multiple clusters, each cluster is indicated by a red circle.

Figure 6: Experiment where data is clustered into tasks with a separate linear function for each task. A single neural network does well even when there are multiple clusters.

Feenstra et al., 2015), which contains 11830 rows of economic data. The WHERE clause takes the form of (x i1 ≥ t i1 ) AND . . . AND (x i k ≥ t i

R-Squared for SQL-style aggregation. A single network with one hidden layer gets high R-Squared values, and the error does not increase substantially if the complexity of the aggregation is increased by increasing the number of columns in the WHERE clause.

special case of this is a look-up table which returns value v i if x = c (i) , and 0 if x does not match any of the centers.4. Create a data structure out of separate fields by concatenation such as constructing a tuple[f 1 , . . . , f k ] which creates a single data structure out of its children, or extract a field out of a data structure.5. Given a table T with k entries {r 1 , . . . , r k }, a Boolean-valued function p and an analytic function f , SQL queries of the form SELECT SUM f(r_i), WHERE p(r_i, x).

7. Given an analytic function f and a Boolean function p, for a SQL operator h over a table T with k entries {r 1 , . . . , r k } representing SELECT SUM f(r_i), WHERE p(r_i, x), or in other words h

Number of epochs and average runtime

ACKNOWLEDGEMENTS

Brendan Juba was partially supported by NSF Awards CCF-1718380, IIS-1908287, and IIS-1939677, and was visiting Google during a portion of this work. Vatsal Sharan was supported in part by NSF award 1704417.

annex

The following claim follows from Theorem 12.Claim 2. The above classes and functions can be implemented and learnt using (k/ ) O(log(1/ )) samples, where the tables are of size at most k.Proof. We begin with the in_same_zip_code() function. Note that this is a special case of the cluster based functions. As in Corollary 4 all attributes such as zip-code are appropriately hashed such that they are well-separated. We can now test equality by doing an indicator function for a ball around the zip-code of Person A. The indicator function for a ball can be approximated by a low-degree polynomial as in the cluster-based branching results in Theorem 9. As the total number of individuals is at most k, therefore by Theorem 9 the sample complexity is at most poly(k/ ).For the avg_income_zip_code() function, we use the SQL query result in Theorem 12. Note that the indicators are testing equality in the case of our program, and hence as in the previous case we can use the cluster-based branching result in Theorem 9 to approximate these indicators by polynomial functions, to obtain a sample complexity of poly(k/ ).Finally, we argue that we can learn the get_straight_line_distance() function. Here, we are composing two functions f and (g 1 , g 2 ) where f is the distance function and (g 1 , g 2 ) are the lookups for the latitude and longitude for Person A and B. By Corollary 4, the lookups have gi (1) ≤ poly(k/ ). By part 6 of Lemma 6, the tilde for the concatenation is the sum of the tilde for the individual functions. For computing the Euclidean distance (x i -y i ) 2 , note that the square root function does not have a Taylor series defined at 0. However, we can use the same analysis as in the proof for learning the 1/x function in the gravitational law (see Appendix B.1) to get a polynomial of degree at most O(log(1/ )), and hence f (y) ≤ (O(y)) log(1/ ) . Thus using the composition rule in Lemma 6, the sample complexity is (k/ ) O(log(1/ )) .where R = r max /r min , which is constant by assumption.In order to estimate an individual contribution to the force force to error /2k (so the total error is /2), we must have:This allows us to choose the smallest d which gives us this error. Taking the logarithm of both sides, we have:where we use that r 2 max ≤ 2/k after rescaling. The choice d ≥ R 2 ln(k 2 / ) ensures error less than /2k per term.Using this approximation, we can use the product and chain rules to get learning bounds on the force law. We can write the approximationwhereThe number of samples needed for efficient learning is bounded by(48) Evaluating, we havewhich, after using r 2 max ≤ 2/k and d = R 2 ln(k 2 / ) gives us the boundThe asymptotic behavior isWe can therefore learn an /2-approximation of one component of F 1 , with probability at least 1 -δ/3k and error /2 with O(4(M F + log(3k/δ))/ 2 ) samples. Therefore, we can learn F 1 to error with the same number of samples. Using a union bound, with probability at least 1 -δ we can simultaneously learn all components of all {F i } with that number of samples.We note that since the cutoff of the power series at d( ) = O(R 2 ln(k 2 / )) dominates the bound, we can easily compute learning bounds for other power-series kernels as well. If the dth power series coefficient of the kernel is b d , then the bound on M F is increased bywhich increases the exponent of k by a factor of ln(R 2 ln(k 2 / )).

B.2 EMPIRICAL CONFIRMATION OF LEARNING BOUNDS

We empirically validated our analytical learning bounds by training models to learn the gravitational force function for k bodies (with k ranging from 5 to 400) in a 3-dimensional space. We created synthetic datasets by randomly drawing k points from [0, 1] 3 corresponding to the location of k bodies, and compute the gravitational force (according to Figure 1 ) on a target body also drawn randomly from [0, 1] 3 . To avoid singularities, we ensured a minimum distance of 0.1 between the target body and the other bodies (corresponding to the choice R = 10). As predicted by the theory, (where Ω hides poly-logarithmic factors in d). Similarly, if x is given to be s-sparse for s d, then any algorithm for finding x needs at least d Ω(s) time or examples.Note that the hardness of learning parity with noise is a standard assumption in computational learning theory and forms the basis of many cryptographic protocols (Regev, 2009) . The best known algorithm for solving parity needs 2 O(d/ log d) time and examples (Blum et al., 2003) . Learning parities is also known to provably require 2 Ω(d) samples for the class of algorithm known as statistical query algorithms-these are algorithms are only allowed to obtain estimates of statistical properties of the examples but cannot see the examples themselves (Kearns, 1998) . Note that the usual stochastic algorithms for training neural networks such as SGD can be implemented in the statistical query model (Song et al., 2017) . Similar hardness result are conjectured for the problem of learning sparse parity with noise, and the best known algorithm runs in time d Ω(s) (Valiant, 2015) .Based on the hardness of parity with noise, we show that exponential dependence on the depth for learning decision trees is necessary.Theorem 14. Conditioned on the hardness of the sparse parity with noise problem, any algorithm for learning decision trees of depth h needs at least d Ω(h) time or examples.Proof. Note that we can represent a parity with noise problem where the answer is h-sparse by a decision tree of depth h where the leaves represent the solutions to the parity problem. The result then follows by the hardness of the sparse parity with noise problem.We also show that the doubly exponential dependence on the depth for learning generalized decision programs is necessary.Theorem 15. Learning a generalized decision program which is a binary tree of depth h using stochastic gradient descent requires at least 2 2 Ω(h) examples. Conditioned on the hardness of learning noisy parities, any algorithm for learning a generalized program of depth h needs at least 2 2 Ω(h) time or examples (where Ω hides poly-logarithmic factors in h).Proof. Note that a generalized decision program of depth h can encode a parity function over D = 2 h bits. Any statistical query algorithm to learn a parity over D bits needs at least 2 Ω(D) samples. As stochastic gradient descent can be implemented in the statistical query model, hence the bound for stochastic gradient descent follows.To prove the general lower bound, note that a generalized decision program of depth h can also encode a noisy parity function over D = 2 h bits. Conditioned on the hardness of parity with noise, any algorithm for learning noisy parities needs at least 2 Ω(D) samples. Hence the bound for general algorithms also follows.In our framework, we assume that all the underlying functions that we learn are analytic, or have an analytic approximation. It is natural to ask if such an assumption is necessary. Next, we show that learning even simple compositions of functions such as their sum is not possible without some assumptions on the individual functions.Lemma 8. There exists function classes F 1 and F 2 which can be learnt efficiently but for every f 1 ∈ F 1 there exists f 2 ∈ F 2 such that f 1 + f 2 is hard to learn (conditioned on the hardness of learning parity with noise) Proof. Both f 1 and f 2 are modifications of the parity with noise problem. The input in both cases is x ∈ {0, 1} d . Let β be the solution to the noisy parity problem. The output for the function class F 1 is [β, y], where y is the value of the noisy parity for the input. The output for the function class F 2 is [-β, y], where y is again the value of the noisy parity for the input. Note that F 1 and F 2 are trivial to learn, as the solution β to the noisy parity problem is already a part of the output. For any f 1 ∈ F 1 , choose f 2 ∈ F 2 to be the function with the same vector β. Note that conditioned on the hardness of learning parity with noise, f 1 + f 2 is hard to learn.

C.1 LOWER BOUNDS FOR LEARNING ANY ANALYTIC FUNCTION

In this section, we show that there is a lower bound on the Rademacher complexity ȳT H-1 y based on the coefficients in the polynomial expansion of the g function. Hence the g function characterizes the complexity of learning g.For any J = (J 1 , . . . , J n ) ∈ N n , write a monomial X J = x J1 1 . . . x Jn n . Define |J| = k J k . For a polynomial p(x) = J a J x J , where a J ∈ C, its degree deg(p) = max a J =0 |J|. The following fact shows that monomials form an orthogonal basis over the unit circle in the complex plane. Fact 3. X J , X J C n = 1 if J = J and 0 otherwise (here, •, • C n denotes the inner product over the unit circle in the complex plane).Note that according to Theorem 7 the sample complexity for learning g(x) depends on g (1) = j j|a j |, and hence is the 1 norm of the derivative. The following Lemma shows that this is tight in the sense that Ω( j ja 2 j ) samples or the 2 norm of the derivative are necessary for learning g(x). For any variable x let x denote the complex conjugate of x. Let x 1 , x 2 , . . . , x n denote the training examples. Let Q denote the kernel polynomial so that K(x i , x j ) = Q( xi T x j ). Let Q(t) = i q i t i . For simplicity, let us look at the case where the power series and the kernel polynomial are univariate polynomials of a bounded degree deg(q). We will assume that we have enough samples that Fact 3 hold when averaging over all samples. Let q J be the coefficient of T J in the polynomial expansion ofLemma 9. For a univariate polynomial y = p(x) , ȳT H -1 y = j a 2 j /q j asymptotically in the sample size, where a j are the coefficients of the polynomial p. For a multivariate polynomial, ȳT H -1 y = j a 2 J /q J asymptotically in the sample size. Here, H -1 denotes the pseudoinverse of H.Proof. We will begin with the univariate case. Let {(x 1 , y 1 ), (x 2 , y 2 , . . . , (x n , y n )} denote the training examples and their labels. Let y be the vector of all the labels {y i }. Let d = max{deg(p), deg(q)} (where we assume that deg(q) is bounded for simplicity). Now consider the matrix G with n rows and d columns where the (i, j)-th entry is x j i . Note that ḠT transforms y from the standard basis to the monomial basis, i.e. the expected value of (1/n) ḠT y is (a 1 , . . . , a d ) (by Fact 3). Therefore, (1/n) ḠT y = (a 1 , . . . , a d ) asymptotically in the sample size n. We claim that H = GD ḠT where D is the diagonal matrix where D k,k = q k . To verify this, let G (i) denote that i-th row of G and observe that the (i, j)-th entry G. Now given the orthonormality of the monomial basis, (1/n) ḠT G = I. Therefore since H = GD ḠT is the SVD of H, H -1 = (1/n 2 )GD -1 ḠT . Hence ȳT H -1 y = ((1/n)G T ȳ)T D -1 ((1/n) ḠT y) = j (1/q j )a 2 j . For the multivariate case, instead of having d columns for G, we will have one column for every possible value of J of degree at most d. In the diagonal entry D J,J we put q J , where q J is the coefficient of T J in the polynomial expansion ofCorollary 5. For the ReLU activation q j = Ω(1/j), and hence ȳT H-1 y ≥ Ω( j ja 2 j ) asymptotically in the sample size.Note that in Theorem 7, the upper bound for the sample complexity was O( j j|a j |), hence Theorem 7 is tight up to the distinction between the 

