SPOTTING EXPRESSIVITY BOTTLENECKS AND FIXING THEM OPTIMALLY

Abstract

Machine learning tasks are generally formulated as optimization problems, where one searches for an optimal function within a certain functional space. In practice, parameterized functional spaces are considered, in order to be able to perform gradient descent. Typically, a neural network architecture is chosen and fixed, and its parameters (connection weights) are optimized, yielding an architecturedependent result. This way of proceeding however forces the evolution of the function during training to lie within the realm of what is expressible with the chosen architecture, and prevents any optimization across possible architectures. Costly architectural hyper-parameter optimization is often performed to compensate for this. Instead, we propose to adapt the architecture on the fly during training. We show that the information about desirable architectural changes, due to expressivity bottlenecks when attempting to follow the functional gradient, can be extracted from the backpropagation. To do this, we propose a new mathematically wellgrounded method to detect expressivity bottlenecks on the fly and solve them by adding suitable neurons when and where needed. Thus, while the standard approach requires large networks, in terms of number of neurons per layer, for expressivity and optimization reasons, we are able to start with very small neural networks and let them grow appropriately. As a proof of concept, we show results on the MNIST dataset, matching large neural network accuracy, with competitive training time, while removing the need for standard architectural hyper-parameter search.

1. INTRODUCTION

Issues with the fixed-architecture paradigm. Universal approximation theorems such as Hornik et al. (1989) are historically among the first theoretical results obtained on neural networks, stating the family of neural networks with arbitrary width as a good candidate for a parameterized space of functions to be used in machine learning. However the current common practice in neural network training consists in choosing a fixed architecture, and training it, without any possible architecture modification meanwhile. This inconveniently prevents the direct application of these universal approximation theorems, as expressivity bottlenecks that might arise in a given layer during training will not be able to be fixed. There are two approaches to circumvent this in daily practice. Either one chooses a (very) large width, to be sure to avoid expressivity issues (Hanin & Rolnick, 2019b; Raghu et al., 2017) , but then consumes extra computational power to train such big models, and often needs to reduce the model afterwards, possibly using probabilistic edges (Liu et al., 2019) . Or one tries different architectures and keeps the most suitable one (in terms of performance-size compromise for instance), which multiplies the computational power by the number of trials. This latter approach relates to the Auto-DeepLearning field, where different exploration strategies over the space of architecture hyper-parameters (among other ones) have been tested, including reinforcement learning (Baker et al., 2017; Zoph & Le, 2016) , Bayesian optimization techniques (Mendoza et al., 2016) , and evolutionary approaches (Miller et al., 1989) (Miikkulainen et al., 2017) , that rely on random tries and consequently take time for exploration. Within that line, Net2Net (Chen et al., 2015) , AdaptNet (Yang et al., 2018) and MorphNet (Gordon et al., 2018) propose different strategies to explore possible variations of a given architecture, possibly guided by model size constraints. Instead, we aim at providing a way to locate precisely expressivity bottlenecks in a trained network, which might speed up neural architecture search significantly. Moreover, based on such observations, we aim at modifying the architecture on the fly during training, in a single run (no re-training), using first-order derivatives only, while avoiding neuron redundancy. Neural architecture growth. A related line of work consists in growing networks neuron by neuron, by iteratively estimating the best possible neurons to add, according to a certain criterion. For instance, Wu et al. (2019) and Firefly (Wu et al., 2020) aim at escaping local minima by adding neurons that minimize the loss under neighborhood constraints. These neurons are found by gradient descent or by solving quadratic problems involving second-order derivatives. Another example is GradMax (Evci et al., 2022) , which seeks to minimize the loss as fast as possible and involves another quadratic problem. However the neurons added by these approaches are possibly redundant with existing neurons, in particular if one does not wait for training convergence to a local minimum (which is time consuming) before adding neurons, therefore producing larger-than-needed architectures. On the opposite we will explicitly take redundancy into account in our growing criterion. Optimization properties. An important reason for common practice to choose wide architectures is the associated optimization properties: sufficiently larger networks are proved theoretically and shown empirically to be better optimizers than small ones Jacot et al. (2018) . Typical, small networks exhibit issues with spurious local minima, while wide ones usually find good nearly-global minima. One of our goals is to train small networks without suffering from such optimization difficulties.

Notions of expressivity. Several concepts of expressivity or complexity exist in the Machine

Learning literature, ranging from Vapnik-Chervonenkis dimension and Rademacher complexity to the number of pieces in a piecewise affine function (as networks with ReLU activations are) Serra et al. (2018) ; Hanin & Rolnick (2019a) . Bottlenecks have been also studied from the point of view of Information Theory, through mutual information between the activities of different layers (Tishby & Zaslavsky, 2015) ; this quantity is difficult to estimate though. Also relevant and from Information Theory, the Minimum Description Length paradigm and Kolmogorov complexity enable to search for a compromise between performance and model complexity. In this article, we aim at measuring lacks of expressivity as the difference between what the backpropagation asks for and what can be done by a small parameter update (such as a gradient step), that is, between the desired variation for each activation in each layer (for each sample) and the best one that can be realized by a parameter update. Intuitively, differences arise when a layer does not have sufficient expressive power to realize the desired variation. Our main contributions are that we: • take a functional analysis viewpoint over gradient descent on neural networks, suggesting to attempt to follow the functional gradient. We optimize not only the weights of the current architecture, but also the architecture itself on the fly, in order to progressively move towards more suitable parameterized functional spaces. • properly define and quantify the notion of expressivity bottlenecks, globally at the neural network output as well as at each layer, and this in an easily computionable way. This allows to localize the expressivity bottelenecks, by spotting layers with great lacks of expressivity; • mathematically define the best possible neurons to add to a given layer to decrease lacks of expressivity as a quadratic problem; compute them and their associated expressivity gain; • check that adding these best neurons is better indeed than adding random ones; • are able to train a neural network without gradient descent (yet still relying on backpropagation) by just adding such best neurons, without parameter update; • naturally obtain a series of compromises between performance and number of neurons, in a single run, thus removing the need for layer width hyper-optimization, and this in competitive computational complexity with respect to classically training a large model just once. One could define a target accuracy and stop adding neurons when it is reached.

2.1. NOTATIONS

We consider a feedforward neural network with L hidden layers, f θ : R p → R d , where the parameters θ := (W 1 , ..., W L ) are organized into affine layers followed by activation functions σ l .We denote the dataset by D := {(x 1 , y 1 ), ..., (x N , y N )}, with x i ∈ R p and y i ∈ R d , and the loss function by L. We will assume that σ l is differentiable at 0 and that L is differentiable on R d and that σ l (0) = 0. Except in Part 3.4, the dataset D is fixed. The network iteratively computes: Figure 1: Notations b 0 (x) = x 1 a l (x) = W l b l-1 (x) b l (x) = σ l (a l (x)) 1 with f θ (x) = σ L (a L (x)). To any vector-valued function noted t(x) and any batch of inputs X := [x 1 , ..., x n ], we associate the concatenated matrix T (X) := (t(x 1 ) ... t(x n )). The matrices of pre-activation and post-activation activities at layer l over a minibatch X are thus respectively A l (X) = (a l (x 1 ) ... a l (x n )) and B l (X) = (b l (x 1 ) ... b l (x n )).

2.2. APPROACH

We take a functional perspective on the use of neural networks. Ideally in a machine learning task, one would search for a function f : R p → R d that minimizes the loss L by gradient descent: ∂f ∂t = -∇ f L(f ) for some metric on the functional space F (typically, L 2 ), where ∇ f denotes the functional gradient. For a chosen η > 0, the descent direction v goal := -η∇ f L(f ) is a function of the same type as f , indicating the best infinitesimal variation to add to f to decrease the loss. In practice, to compute the gradient, a finite-dimensioned parametric space of functions is considered by choosing a particular neural network architecture A with weights θ ∈ Θ A . The associated parametric search space F A then consists of all possible functions f θ that can represented with such a network for any parameter value θ. Gradient descent reminder. For the sake of simplicity, let us consider a loss of the form L(f ) = E (x,y)∼D L(f (x), y) . Under standard weak assumptions (A.1), and up to a multiplicative learning rate, the gradient descent is then of the form: ∂θ ∂t = -∇ θ L(f θ ) = -E (x,y)∼D ∇ θ L(f θ (x), y) Using the chain rule, this yields a function change : v GD := η ∂f θ ∂t = η ∂f θ ∂θ ∂θ ∂t = ∂f θ ∂θ E (x,y)∼D ∂f θ ∂θ T (x) v goal (x) Optimal move. We name T f θ A , or just T A , the tangent space of F A at f θ , that is, the set of all possible infinitesimal variations around f θ under small parameter variations: T f θ A := { ∂f θ ∂θ δθ | s.t. δθ ∈ Θ}. This linear space is a first order approximation of the neighborhood of f θ within F A . The direction v GD obtained above by gradient descent is actually not the best one to consider within T A . Indeed, the best move v * would be the orthogonal projection of the desired direction v goal := -η∇ f θ L(f θ ) onto T A . This projection depends on the chosen metric and is what a (generalization of the notion of) natural gradient computes (Ollivier, 2017) . The expressivity bottleneck is measured as the difference between the optimal functional move v * given the architecture A and the functional gradient v goal . The former is the projection of the latter onto the tangent space T A . Lack of expressivity. When -η∇ f θ L(f θ ) does not belong to the reachable subspace T A , there is a lack of expressivity, that is, the parametric space A is not rich enough to follow the ideal functional gradient descent. This happens frequently with small neural networks. Example. Suppose one tries to estimate the function y = f true (x) = 2 sin(x) + x with a linear model f predict (x) = ax + b. Consider (a, b) = (1, 0) and the square loss L . For the dataset of inputs (x 0 , x 1 , x 2 , x 3 ) = (0, 1.5, π, 4.5), there exists no parameter update (δa, δb) that would improve prediction at x 0 , x 1 , x 2 and x 3 simultaneously, as the space of linear functions {f : x → ax + b | a, b ∈ R} is not expressive enough. To improve the prediction at x 0 , x 1 , x 2 and x 3 , one should look for another, more expressive functional space such that for i = 0, 1, 2, 3 the functional update ∆f predict (x i ) := f t+1 predict (x i ) -f t predict (x i ) goes into the same direction as minus the functional gradient Ideal updates. The same reasoning can be applied to the pre-activations a l , seen as functions defined over the input space of the neural network. The optimal update of the weights of the different layers is then the projection of the desired update direction of pre-activation functions, i.e. η∇ a l L(f θ ), obtained by back-propagation, onto the linear subspace T a l A of possible variations within the architecture, as we will detail now just like in the intuition part. Training a neural network is usually done by gradient descent, which consists in updating the weight matrices W l recursively using back-propagation. More precisely, the gradient of the loss w.r.t. W l i,j is the product of the backpropagation from the loss L till layer preactivations a l (x) and of the layer-specific derivative from these preactivations till the weight in question: ∂a l (x) v goal (x i ) := -η∇ fpredict(xi) L(f predict (x i ), y i ) = -η2(f predict (x i ) -y i ) where y i = f true (x i ). ∂W l i,j ∇ u L (σ L (W L (...σ l (u)))) |u = a l (x) . The opposite of the second term times η the positive real number, v goal l (x) := -η ∇ u L(σ L (W L (...σ l (u)))) |u = a l (x) , indicates the desired update direction for u = a l (x). Mathematically speaking, if at time t any activity update were possible at each layer l, we would choose at t + 1 the pre-activation function updates such that for all samples i = 1, ..., n: ∆a l (x i ) := a t+1 l (x i ) -a t l (x i ) = -η∇ a t l (x) L(σ L (W L (...(a t l (x))), y) |(x, y) = (x i , y i ) (1) Unfortunately, most of the time no parameter move δθ is able to induce this progression for each x i at the same time, because the θ-parameterized family of functions a l is not expressive enough. Intuitively the ideal update above does not result from a parameter update but from an update of the pre-activation functions in their functional spaces. Following this ideal update rather than the classical gradient descent would optimally decrease the loss, with an order-1 effect in η provided the gradient is non-0. Unlike this ideal update, the classical gradient descent decreases the global loss at first order in η but does not necessarily improve the prediction for all x i . For the rest of the paper we will note v l goal (x i ) := -η∇ a l (x) L(f θ (x), y) |(x,y)=(xi,yi) . Activity update resulting from a parameter change. Given a subset of parameters θ, an incremental direction δ θ to update θ, and an amplitude η > 0, the impact of the parameter update δ θ on the pre-activity a l at layer l at order 1 in δ θ is v l (x i , δ θ) := ∂a l (x) ∂ θ δ θ. Remark 1. We could have choose to study the desired update for b l , our choice for a l is explained in section A.2.

3. EXPRESSIVITY BOTTLENECKS

We now define expressivity bottlenecks based on the activity updates, both actual v l (.) and desired v l goal (.) ones (cf Figure 2 ): Definition 3.1 (Lack of expressivity). For a neural network f θ and a minibatch of points {(x i , y i )} n i=1 , we define the lack of expressivity at layer l as how far the desired activity update V l goal is from the closest possible activity update realizable by a parameter change δθ: min v l ∈T a l A n i=1 v l (x i ) -v l goal (x i ) 2 = min δθ V l (X, δθ) -V l goal (X) 2 Tr (2) where ||.|| stands for the L 2 norm and ||.|| Tr for the Frobenius norm. In the two following parts we fix a minibatch {(x i , y i )} n i=1 , i.e. a subset of the full dataset D. As X := [x 1 , ..., x 1 ] is then fixed, we simplify the notation: A l := A l (X), B l := B l (X), V l := V l (X) . . .

3.1. BEST MOVE WITHOUT MODIFYING THE ARCHITECTURE OF THE NETWORK

Let δW * l be the solution of 2 when the parameter variation δθ is restricted to involve only layer l parameters, i.e. W l . This move is sub-optimal in that it does not result from an update of all architecture parameters but only of the current layer ones: δW * l = arg min δW l ∈M(W l .shape) V l (δW l ) -V l goal 2 Tr (3) where M(k, l) is the set of matrices of size k × l, i.e. here of the size of W l . We denote by V l * 0 the associated activity variation: V l * 0 = δW * l B l-1 v l * 0 (x i ) = δW * l b l-1 (x i ) Proposition 3.1. The solution of 3 and its associated activity variation are: δW * l = 1 n V l goal B T l-1 ( 1 n B l-1 B T l-1 ) + where P + denotes the generalized inverse of matrix P . This update δW * l is not equivalent to the usual gradient descent update, whose form is δW GD l ∝ V l goal B T l-1 . In fact V l * 0 is the projection of V l goal on the post-activation matrix of layer l -1, that is to say onto the span of all possible directions from post-activation, through the projec- tor 1 n B T l-1 ( 1 n B l-1 B T l-1 ) + B l-1 . To increase expressivity if needed, we will aim at increasing this span with the most useful directions to close the gap between this best update and the desired one. Note that the update δW * l consists of a standard gradient (V l goal B T l-1 ) and of a (kind of) natural gradient only for the last part (projector).

3.2. REDUCING EXPRESSIVITY BOTTLENECK BY MODIFYING THE ARCHITECTURE

To get as close as possible to V l+1 goal and to increase the expressive power of the current neural network, we modify each layer of its structure. At layer l, we add K neurons n 1 , ..., n K with input weights α 1 , ..., α k and output weights ω 1 , ..., ω K (cf Figure 4 ). We have the following change : W T l ← W T l α 1 ... α K and W l+1 ← (W l+1 ω 1 ... ω K ). We note this modification of architecture θ ← θ ⊕ θ K l↔l+1 where ⊕ is the concatenation sign and θ K l↔l+1 := (α k , ω k ) K k=1 are the added neurons. The added neurons could be chosen randomly, as in usual neural network initialisation, but this would not yield any guarantee regarding the impact on the system loss. Another possibility would be to set either input weights (α k ) K k=1 or output weights (ω k ) K k=1 to 0, so that the function f θ (.) would not be modified, while its gradient w.r.t. θ would be enriched from the new parameters. Another option is to solve a optimization problem as in the previous section with the modified structure θ ← θ ⊕ θ K l↔l+1 and jointly search for both the optimal new parameters θ K l↔l+1 and the optimal variation δW l of the old ones. arg min θ K l↔l+1 ,δW l V goal l+1 -V l+1 ((δW l , θ K l↔l+1 )) 2 Tr Figure 4: Adding two neurons at layer l in purple (K = 2), with connections in purple. We have α i ∈ R 3 et ω i ∈ R 3 for i = 1, 2. As the displacement V l+1 at layer l + 1 is actually a sum of the moves induced by the neurons already present (δW l ) and by the added neurons (θ K l↔l+1 ), our problem rewrites as : arg min θ K l↔l+1 ,δW l ||V goal l+1 -V l+1 (δW l ) -V l+1 (θ K l↔l+1 )|| 2 Tr (4) With v l+1 (x, θ K l↔l+1 ) := K j=1 w k (b l-1 (x) T α k ). Refering to the definition of v l+1 (x) this choice have to be explained because the partial derivative with respect to (α k , ω k ) is actually 0 (See A.2). We solve this problem in two steps. Let us fix for the moment δW l ∈ M(|v goal l+1 (x)|, |b l (x)|), standing for an update of the matrix W l+1 , and search for the best new parameters θ K l↔l+1 . We note V goal l+1 proj = V goal l+1 proj (δW l ) := V goal l+1 -V l+1 (δW l ). We are looking for the following quantity : θK, * l↔l+1 ( α * k , ω * k ) K * k=1 , K * := arg min (α k ,ω k ) K k=1 ,K ||V goal l+1 proj -V l+1 (θ K l↔l+1 )|| 2 Tr (5) We define the matrices N := 1 n B l-1 V goal l+1 proj T and S := 1 n B l-1 B T l-1 . Note that N depends on δW l . Using the low-rank matrix approximation theorem (Eckart & Young, 1936) , we can solve this quadratic optimization problem as follows. As S is semi-positive definite, let us denote its Cholesky decomposition by S = S 1/2 S 1/2 T , and consider the SVD of the matrix S 1/2 -1 N = R k=1 λ k u k v T k with λ 1 ≥ ... ≥ λ R ≥ 0, where R is the rank of the matrix N . Then: Proposition 3.2. The solution of (5) can be written as: • optimal number of neurons: K * = R • their optimal weights: ( α * k , ω * k ) = (S 1/2 T -1 u k , v k ) for k = 1, ..., R. Moreover for any number of neurons K ⩽ R, and associated scaled weights θK, * l↔l+1 , the expressivity gain and the first order in η of the loss improvement due to the addition of these K neurons are equal and can be quantified very simply as a function of the eigenvalues λ k : 1 n ||V goal l+1 proj -V l+1 ( θK, * l↔l+1 )|| 2 Tr = 1 n ||V goal l+1 proj || 2 Tr - K k=1 λ 2 k 1 n n i=1 L(f θ⊕ θK, * l↔l+1 (x i ), y i ) = 1 n n i=1 L(f θ (x i ), y i ) + σ ′ l (0) η K k=1 λ 2 k + o(|| θK, * l↔l+1 || 2 ) Proposition 3.3. If S is positive definite, then solving ( 5) is equivalent to taking ω k = N α k and finding the K first eigenvectors α k associated to the K largest eigenvalues λ of the generalized eigenvalue problem : N N T α k = λSα k This formulation is useful when dimensions of N and S are large. Considering LOBPCG method (Peter Benner, 29 Apr 2020) allows not to invert the matrix and to compute the Cholesky factorization in proposition 3.2. In practice the matrix S is positive definite except for l -1 = 0, and even in this last case it is possible to define the Cholesky decomposition of S (cf Appendix). Corollary. The matrix δW l which minimizes (6) (through its impact on V goal l+1 proj ) or equivalently minimizes the sum of orders zero and one in η of ( 7) is given by δW * l+1 in Proposition 3.1. Corollary. For all integers m, m ′ such that m + m ′ ⩽ R, at order one in η it is equivalent to add m + m ′ neurons simultaneously according to the previous method or to add m neurons then m ′ neurons by applying successively the previous method twice. Minimizing the distance (6), ie the distance between V l+1 goal (δW l ) and V l+1 (θ l↔l+1 ), is equivalent to minimize the loss L at order one in γ, and it is directly due to the following development: L(f θ⊕θ K l↔l+1 ) = L(f θ ) -σ ′ l (0) 1 η 1 n σ ′ l (0) V goal l+1 (δW l ), V l+1 (θ l↔l+1 ) Tr + o(||V l+1 (θ l↔l+1 )||) When solving (6), we notice that the family {V l+1 ((α k , ω k ))} K k=1 of pre-activity variations induced by adding the neurons θK, * l↔l+1 is orthogonal for the trace scalar product. We could say that the added neurons are orthogonal to each other in that sense. The addition of each neuron k has an impact on the order of λ k , which can be used to define a criterion to decide whether the neuron k should be added to the layer or not, i.e. only if λ 2 k L(f θ ) > τ . We name the operation θ ← θ ⊕ θK, * l↔l+1 the K-update of the network at layer l.

3.3. CHOOSING THE AMPLITUDE FACTOR γ WHEN ADDING NEURONS

Let us consider the l+1-th layer of the network. Having the best update for linear layer l+1, δW * l+1 , and the best neurons to add accordingly to layer l, ( α * k , ω * k ) K k=1 , we estimate the best factors γ 0 and γ to multiply the new neurons with, in order to speed up the learning. Defining the updates θ δ(l+1) (γ 0 ) = (W 1 , ..., W l+1 + γ 0 δW * l+1 , ..., W L ) and θK, * l↔l+1 (γ) = (γ α * k , ω * k ) K k=1 , we apply a line search algorithm to find a local minimum of the loss function : γ * 0 := arg min γ0∈V(0) 1 n n i=1 L(f θ δ(l+1) (γ0) (X i ), Y i ) γ * := arg min γ∈V(0) 1 n n i=1 L(f θ δ(l+1) (γ * 0 )⊕ θK, * l↔l+1 (γ) (X i ), Y i ) where V(0) is a positive neighbourhood of zero. Remark : the amplitude factor can also be defined differently, for example by choosing a different amplitude factor for the input and output parameters, ie ( √ γ 1 α k , √ γ 2 ω k ) K k=1 .

3.4. VARIANCE OF THE ESTIMATOR

Noting n l the number of neurons already in layer l before the K-update, we discuss the variance of our estimate δ θK, * l↔l+1 for l = 1, ..., L -1, as a function of the minibatch. At layer l + 1, the solution of ( 5) is an estimator of θ K, * l↔l+1 := (α * k , ω * k ) K k=1 , which is the minimizer of : θ K, * l↔l+1 := arg min α,ω E   v goal l+1 proj (X i ) - K k=1 v l+1 k (X i ) 2   (8) where the expectation is taken over all possible random samples X i ∼ D. The variance of the directions ( α * k , ω * k ) K k=1 grows in √ p where p := max(n l-1 , n l+1 ) and decreases in √ N where N is the minibatch size used for the estimation of (8). In practice we will start with N = 100 and increase N with the architecture growth. In Figures 5 and 6 , we run an experiment on the MNIST dataset ((LeCun et al., 1998) ), with 7 CPUs, repeated 20 times. All models are trained for 50 seconds with Adam(lr = 0.0001, µ = 0) with constant mini-batch of size 100. The activation functions are σ l = selu if l = 1, 2 and σ 3 = Sof tmax. We plot accuracy mean and standard deviation on test set for our approach with architecture growth and for standard training with a fixed architecture. For our approach, we start with a feedforward model with 2 hidden layers of widths [1, 1], initialized with Kaiming normal. Every 0.05 seconds (with Adam(lr = 0.0001)), we extend the two hidden layers according to our method. We compare the performance of our method with classical training of large models (Fig. 5 ), or of models with the same final architecture as ours (Fig. 6 ), to check performance when one already knows the correct architecture. Classic models are initialized with Kaiming normal. More graphics can be found in Appendix C. We note that large models take more computational time to train, and that architecture growth yields better or similar performance while avoiding layer width tuning. 

5. ABOUT GREEDY GROWTH SUFFICIENCY

One might wonder whether a greedy approach on layer growth might get stuck in a non-optimal state. We derive the following series of propositions in this regard. Since in this work we add neurons layer per layer independently, we study here the case of a single hidden layer network, to spot potential layer growth issues. For the sake of simplicity, we consider the task of least square regression towards an explicit continuous target f * , defined on a compact set. That is, we aim at minimizing the loss: inf f x∈D ∥f (x) -f * (x)∥ 2 where f (x) is the output of the neural network and D is the training set. Proposition 5.1 (Greedy completion of an existing network). If f is not f * yet, then there exists a set of neurons to add to the hidden layer such that the new function f ′ will have a lower loss than f . One can even choose the added neurons such that the loss is arbitrarily well minimized. Furthermore: Proposition 5.2 (Greedy completion by one single neuron). If f is not f * yet, there exists a neuron to add to the hidden layer such that the new function f ′ will have a lower loss than f . As a consequence, there exists no situation where one would need to add many neurons simultaneously to decrease the loss: it is always feasible with a single neuron. One can express a lower bound on how much the loss has improved (for the best such neuron), but it is not a very good bound without further assumptions on f . Proposition 5.3 (Greedy completion by one infinitesimal neuron). The neuron in the previous proposition can be chosen to have arbitrarily small input weights. This detail is important in that our approach is based on the tangent space of the function f and consequently manipulates infinitesimal quantities. Though we perform line search in a second step and consequently add non-infinitesimal neurons, our first optimization problem relies on the linearization of the activation function by requiring the added neuron to have infinitely small input weights, to make the problem easier to solve. This proposition confirms that such neuron exists indeed. Correlations and higher orders. Note that, as a matter of fact, our approach exploits linear correlations between inputs of a layer and desired output variations. It might happen that the loss is not minimized yet but there is no such correlation to exploit anymore. In that case the optimization problem (5) will not find neurons to add. Yet following Prop. 5.3 there does exist a neuron with arbitrarily small input weights that can reduce the loss. This paradox can be explained by pushing further the Taylor expansion of that neuron output in terms of weight amplitude (single factor ε on all of its input weights), for instance σ(εα • x) ≃ σ(0) + σ ′ (0)εα • x + 1 2 σ ′′ (0)ε 2 (α • x) 2 + O(ε 3 ). Though the linear term α • x might be uncorrelated over the dataset with desired output variation, i.e. E x∼D [α • x] = 0, the quadratic term (α • x) 2 , or higher-order ones otherwise, might be correlated. Finding neurons with such higher-order correlations can be done by increasing accordingly the power of (α • x) in the optimization problem (4). Note that one could consider other function bases that the polynomials from Taylor expansion. In all cases, one does not need to solve such problems exactly but just to find an approximate solution, i.e. a neuron improving the loss. Adding random neurons. Another possibility to suggest additional neurons, when expressivity bottlenecks are detected but no correlation (up to order p) can be exploited anymore, is to add random neurons. The first p order Taylor expansions will show 0 correlation with desired output variation, hence no loss improvement nor worsening, but the correlation of the p + 1-th order will be non-0, with probability 1, in the spirit of random projections. The loss can then be improved, all the more with a line search to optimize the neuron amplitude. However, such random neurons also contribute to other directions in the functional space than the desired one. This hinders the loss improvement expectable from them, as the line search will need to find a compromise with the loss changes brought by these other directions. This is confirmed experimentally in Appendix C.4. To alleviate this, in the spirit of common neural network training practice, one could consider brute force combinatorics by adding many random neurons and hoping that one will be close enough to the desired direction. The difference with standard training is that we would perform such computationally-costly searches only when and where relevant, exploiting all simple information (linear correlations in each layer) first.

6. CONCLUSION

We have properly defined lacks of expressivity, and their minimization has allowed us to optimize the architecture on the fly, to better follow the functional gradient, enabling architecture growth. Apart from straightforward extension to other types of layers (such as convolutions) and to the addition of new layers, future work will pay attention to overfit possibilities (which we have not we have not observed so far, thanks to the optimally small number of parameters) and to neuron addition strategies.

A ASSUMPTIONS

A.1 MEASURE THEORY STATEMENT Let X be an open subset of R, and Ω be a measure space. Suppose f : X × Ω -→ R satisfies the following conditions: • f (x, ω) is a Lebesgue-integrable function of ω for each x ∈ X. • For almost all ω ∈ Ω , the partial derivative f x or f accordinf to x exists for all x ∈ X. • There is an integrable function θ : Ω -→ R such that |f x (x, ω)| ≤ θ(ω) for all x ∈ X and almost every ω ∈ Ω. Then, for all x ∈ X, d dx Ω f (x, ω) dω = Ω f x (x, ω) dω See proof and details :mea.

A.2 REMARKS

When increasing the size of layer l with θ K l↔l+1 := (α k , ω k ) K k=1 starting with (α k , ω k ) K k=1 = 0, the outcome for v l+1 (x, θ K l↔l+1 ) is 0 because the gradient with respect to (α k , ω k ) K k=1 is 0 in (α k , ω k ) K k=1 is 0. In mathematical terms : v l+1 (x, θ K l↔l+1 ) := ∂a l+1 (x) ∂θ K l↔l+1 |θ K l↔l+1 =0 θ K l↔l+1 = 0 (9) The impact of this modification of structure has to be seen differently. The first point of view is to say that we choose (ω k ) k k=1 then compute v l+1 (x, (α k ) K k=1 ) as a function of the family (ω k ) K k=1 . We have then : v l+1 (x, θ K l↔l+1 ) := v l+1 (x, (α k ) K k=1 ) = ∂a l+1 (x) ∂( (α k ) K k=1 ) |(α k ) K k=1 =0 (α k ) K k=1 = K k=1 ω k b l-1 (x) T α k This is equivalent to say that for each family (ω k ) K k=1 , the tangent space in a l+1 restricted to move in the family (α k ) K k=1 , ie T a l+1 A := { ∂a l+1 ∂(α k ) K k=1 |(α k ) K k=1 =0 (α k ) K k=1 |(α k ) K k=1 ∈ R |b l-1 (x)| K } varies with the family (ω k ) K k=1 , ie T a l+1 A := T a l+1 A ((ω k ) K k=1 ). Optimizing over the ω k is equivalent to search for the better tangent space while optimizing on the α k is equivalent to find the better projection on the tangent space defined by the ω k . Note that making the derivative according to the α k ease the problem by removing the nonlinearity in σ l . When reversing the roles of the α k and of the ω k , ie fixing the α k and compute v l+1 (x, (ω k ) K k=1 ), it makes the problem harder to solve because the non linearity in σ l remains in the optimisation problem. Taking an other point of view, you can consider the second order of a l+1 (x) in (α k , ω k ) K k=1 in 0 to recover the same expression. Indeed taking the Taylor expansion in (α k , ω k ) K k=1 a l+1 (x) = a l+1 (x) |(α k ,ω k ) K k=1 =0 + K k=1 ω k b l-1 (x) T α k + o (||(α k ) K k=1 || + ||(ω k ) K k=1 ||) 2 [TODO] A faire Define δW l + the generalized inverse of δW l then : δW * l = 1 n V goal l B T l-1 1 n B l-1 B T l-1 + and V l 0 = 1 n V goal l B T l-1 1 n B l-1 B T l-1 + B l-1 Proof Consider the function g(δW l ) = ||V goal l -δW l B l-1 || 2 Tr , then g(δW l + H) = ||V goal l -δW l B l-1 -HB l-1 || 2 Tr = g(δW l ) -2⟨V goal l -δW l B l-1 , HB l-1 ⟩ Tr + o(||H||) = g(δW l ) -2 Tr V goal l -δW l B l-1 T HB l-1 + o(||H||) = g(δW l ) -2 Tr B l-1 V goal l -δW l B l-1 T H + o(||H||) = g(δW l ) -2⟨ V goal l -δW l B l-1 B T l-1 , H⟩ Tr + o(||H||) By identification ∇ δW l g(δW l ) = 2 V goal l -δW l B l-1 B T l-1 ∇ δW l g(δW l ) = 0 =⇒ V goal l B T l-1 = δW l B l-1 B T l-1 Using the definition of the generalized inverse of M + : δW * l = 1 n V goal l B T l-1 1 n B l-1 B T l-1 + B.2 PROPOSITION 3.2 If S is positive definite, consider the Cholesky decomposition S = S 1/2 S 1/2 T , note R the rank of the matrix N and the SVD of the matrix S 1/2 -1 N = R k=1 λ k u k v T k with 0 ≤ λ 1 ≤ ... ≤ λ R then the solution of (4) is written as K * = R and ( α * k , ω * k ) = (S 1/2 T -1 u k , v k ) for k = 1, ..., R l+1 . Moreover for all K ≤ R and θK, * l↔l+1 := ( α * k , ω * k ) K k=1 and for γ positive, θK, * l↔l+1 (γ) := ( γα * k , ω * k ) K k=1 , we have that: 1 n ||V goal l+1 proj -V l+1 ( θK l↔l+1 )|| 2 Tr = 1 n ||V goal l+1 proj || 2 Tr - K k=1 λ 2 k (10) 1 n n i=1 L(f θ⊕ θK, * l↔l+1 (γ) (x i ), y i ) = 1 n n i=1 L(f θ (x i ), y i ) -σ ′ l (0) γ η K k=1 λ 2 k + o(γ) Proof arg min θK l↔l+1 1 n ||V goal l+1 proj -V l+1 ( θK l↔l+1 )|| 2 Tr = arg min θK l↔l+1 - 2 n V goal l+1 proj , V l+1 ( θK l↔l+1 ) Tr + 1 n ||V l+1 ( θK l↔l+1 )|| 2 Tr = arg min θK l↔l+1 1 n g( θK l↔l+1 ) We note 1 n g( θK l↔l+1 ) = - 2 n n i k v goal l+1 proj (x i ) T α T k b l-1 (x i ) ω k + 1 n K k,j n i α T k b l-1 (x i ) ω T k ω j α T j b l-1 (x i ) = - 2 n K k α T k 1 n n i b l-1 (x i )v goal l+1 proj (x i ) T ω k + 1 n K k,j ω T k ω j α T k 1 n n i b l-1 (x i )b l-1 (x i ) T α j = -2 K k α T k N ω k + K k,j ω T k ω j α T k Sα j with N = 1 n n i b l-1 (x i )v goal l+1 proj (x i ) T = 1 n B l-1 V goal l+1 proj T , S = 1 n n i b l-1 (x i )b l-1 (x i ) T = 1 n B l-1 B T l-1 Under review as a conference paper at ICLR 2023 Suppose S is semi definite positive and note S = S 1/2 S 1/2T , γ k = S 1/2T α k and S 1/2 -1 N = R r=1 λ r v r e T r the SVD of matrix S 1/2 -1 N . - K k=1 α T k N ω k = - k γ T k S 1/2 -1 N ω k = - k R r=1 γ T k λ r v r e T r ω k = -Tr k r λ r (γ T k v r )e T r ω k = -Tr k r λ r ω k γ T k v r e T r = -Tr k ω k γ T k r λ r v r e T r = - k γ k ω T k , r λ r v r e T r Tr with ⟨A, B⟩ Tr = T race(A T B) K k,j ω T k ω j α T k Sα j = k,j ω T k ω j γ T j γ k = Tr k,j (ω T k ω j )γ T j γ k = Tr( k,j γ k ω T k ω j γ T j = || k ω k γ T k || 2 Tr avec ||A|| Tr = T race(A T A) = || k γ k ω T k || Tr Then we have : arg min K, θl↔l+1 1 n g(α, ω) = arg min K,α=S -1/2 T γ,ω ||S -1/2 N - K k=1 γ k ω T k || 2 Tr Then the solution is giving by the paper Eckart & Young (1936) chosing K = rank(S -1/2 N ) and K k=1 γ k ω T k = K r=1 λ r v r e T r . Choosing K = R is the best option. We now consider the matrix M .The minimization gives also the following properties at the optimum : for k ̸ = j ⟨γ k ω T k , γ j ω T j ⟩ T r = 0 ||S -1/2 N - K k=1 γ k ω T k || 2 Tr = R r=K+1 λ 2 r = ||S -1/2 N || 2 T r -|| K k=1 γ k ω T k || 2

Tr

We also have the following property : arg min θK l↔l+1 1 n ||V goal l+1 proj -V l+1 ( θK l↔l+1 )|| 2 Tr = arg min H≥0 arg min θK l↔l+1 ,||V l+1 ( θK l↔l+1 )||Tr=H 1 n ||V goal l+1 proj -V l+1 ( θK l↔l+1 , δW * l+1 )|| 2 Tr = arg min H≥0 arg min θK l↔l+1 ,||V l+1 ( θK l↔l+1 )||Tr=H - 2 n V goal l+1 proj (, V l+1 ( θK l↔l+1 ) Tr + 1 n ||V K ( θK l↔l+1 )|| 2 Tr = arg min H≥0 arg min θK l↔l+1 ,||V l+1 ( θK l↔l+1 )||Tr=H - 2 n V goal l+1 proj , V l+1 ( θK l↔l+1 ) Tr + 1 n H 2 = arg min H≥0 arg min θK l↔l+1 ,||V l+1 ( θK l↔l+1 ) * ||Tr=1 -H V goal l+1 proj , V l+1 ( θK l↔l+1 ) Tr + 1 2 H 2 With V l+1 ( θK l↔l+1 ) * the solution of the second arg min (ie for H = 1). Then the norm minimizing the first argmin is given by : H * = V goal l+1 proj , V l+1 ( θK l↔l+1 ) * Tr = V goal l+1 , V l+1 ( θK l↔l+1 ) * Tr -0 Furthermore min θK l↔l+1 1 n ||V goal l+1 proj -V l+1 ( θK l↔l+1 )|| 2 Tr = - K r=1 λ 2 r + 1 n ||V goal l+1 proj || 2 Tr min θK l↔l+1 1 n ||V goal l+1 proj -V l+1 ( θK l↔l+1 )|| 2 Tr = - 1 n H * 2 + 1 n ||V goal l+1 proj || Tr =⇒ H * = V goal l+1 proj , V l+1 ( θK l↔l+1 ) * Tr = K r=1 λ 2 r × √ n V l+1 ( θK, * l↔l+1 ) = H * V l+1 ( θK l↔l+1 ) * V l+1 (θ K, * l↔l+1 ), V goal l+1 proj Tr = H * × V goal l+1 proj , V l+1 ( θK l↔l+1 ) * Tr = H * 2 where the last equality is given by the optimisation of ||S -1/2 N - K k=1 u k ω T k || 2 T r . So minimizing the scalar product -V goal l+1 proj (δW * l+1 ), V l+1 ( θK l↔l+1 ) * Tr setting the norm of V l+1 ( θK l↔l+1 ) is equivalent to minimizing the norm ||V goal l+1 proj (δW * l+1 ) -V l+1 ( θK l↔l+1 )|| 2 Tr . 1 n n i=1 L(f θ(γ * 0 W * l+1 )⊕( αl+1, * k , ω * k ) R k=1 (γ) (x i ), y i ) = 1 n n i=1 L(f θ (x i ), y i ) - γ η σ ′ l (0) K r=1 λ 2 k - γ 0 η 1 n σ ′ l (0)⟨V l+1 (W * l+1 ), V goal l+1 ⟩ Tr + o(max(γ, γ 0 ))

B.3 CHOLESKY DECOMPOSITION FOR S POSITIVE SEMI-DEFINITE

When matrix S is not positive definite, the following trick can be apply. Consider S = U ΣV T the svd of S. As S is symmetric U = V T . Perform the QR-decomposition of matrix √ ΣU T = QR, Q an orthogonal matrix and R an upper triangular matrix. Defining P + the pseudo inverse of P , one can remark that R T R = U √ Σ(Q -1 ) T Q -1 U √ ΣU T . As Q is orthogonal R T R = U ΣU T = U ΣV T = S.

B.4 PROPOSITION 3.3

Suppose S is semi definite, we note S = S 1/2 S 1/2T . Solving ( 7) is equivalent to taking ω k = N T α k and find the K first eigenvectors ω k associated to the K largest eigenvalues λ of the generalized eigenvalue problem : N N T α k = λSα k Proof The LOBPCG problem is equivalent to maximise the generalized Rayleigh quotient which is : α * = max x α T N N T α α T Sα p * = max p=S 1/2T α p T S 1/2 -1 N N T S 1/2 -1T p p T p p * = max ||p||=1 ||N T S 1/2 -1T p|| α * = S 1/2 -1T p * And considering the SVD of S 1/2 -1 N = R r=1 λ r u r v T r , then S 1/2 -1 N N T S 1/2 -1T = R r=1 λ 2 r u r u T r because j ̸ = i =⇒ u T i u j = 0 and v T i v j = 0. So maximise the first formula is equivalent to p * k = u k , then α k = S 1/2 -1T u k . And N T α k = N T S 1/2 -1T u k = λ k v k We prove second corollary 3.2 by induction. For m = m ′ = 1 : a l+1 (x) t+1 = a l+1 (x) t + V ( θ1, * l↔l+1 , x)γ + o(γ) v goal l+1,t+1 (x) = v goal l+1,t (x) + ∇ a l+1 (x) L(f θ (x), y) T v( θ1, * l↔l+1 , x)γ + o(γ) Adding the second neuron we obtain the minimization problem: arg min α2,ω2 ||V goal l+1,t -V l+1 (α 2 , ω 2 )|| Tr + o(1) B.5 SECTION Theory behind Greedy Growth WITH PROOFS One might wonder whether a greedy approach on layer growth might get stuck in a non-optimal state. We derive the following series of propositions in this regard. Since in this work we add neurons layer per layer independently, we study here the case of a single hidden layer network, to spot potential layer growth issues. For the sake of simplicity, we consider the task of least square regression towards an explicit continuous target f * , defined on a compact set. That is, we aim at minimizing the loss: inf f x∈D ∥f (x) -f * (x)∥ 2 where f (x) is the output of the neural network and D is the training set. Proposition B.1 (Greedy completion of an existing network). If f is not f * yet, there exists a set of neurons to add to the hidden layer such that the new function f ′ will have a lower loss than f . One can even choose the added neurons such that the loss is arbitrarily well minimized. Proof. The classic universal approximation theorem about neural networks with one hidden layer Pinkus (1999) states that for any continuous function g defined on a compact set ω, for any desired precision γ, and for any activation function σ provided it is not a polynomial, then there exists a neural network g with one hidden layer (possibly quite large when γ is small) and with this activation function σ, such that ∀x, ∥g(x) -g * (x)∥ ⩽ γ We apply this theorem to the case where g * = f * -f , which is continuous as f * is continuous, and f is a shallow neural network and as such is a composition of linear functions and of the function σ, that we will suppose to be continuous for the sake of simplicity. We will suppose that f is real-valued for the sake of simplicity as well, but the result is trivially extendable to vector-valued functions (just concatenate the networks obtained for each output independently). We choose γ = 1 10 ∥f * -f ∥ L2 , where ⟨a| b⟩ L2 = 1 |ω| x∈ω a(x) b(x) dx. This way we obtain a one-hidden-layer neural network g with activation function σ such that: ∀x ∈ ω, -γ ⩽ g(x) -g * (x) ⩽ γ ∀x ∈ ω, g(x) = f * (x) -f (x) + a(x) with ∀x ∈ ω, |a(x)| ⩽ γ. Then: ∀x ∈ ω, f * (x) -(f (x) + g(x)) = -a(x) ∀x ∈ ω, (f * (x) -h(x)) 2 = a 2 (x) with h being the function corresponding to a neural network consisting in concatenating the hidden layer neurons of f and g, and consequently summing their outputs. ∥f * -h∥ 2 L2 = ∥a∥ 2 L2 ∥f * -h∥ 2 L2 ⩽ γ 2 = 1 100 ∥f * -f ∥ 2 L2 and consequently the loss is reduced indeed (by a factor of 100 in this construction). The same holds in expectation or sum over a training set, by choosing γ = 1 10 1 |D| x∈D ∥f (x) -f * (x)∥ 2 , as Equation ( 12) then yields: x∈D (f * (x) -h(x)) 2 = x∈D a 2 (x) ⩽ 1 100 x∈D (f * (x) -f (x)) 2 which proves the proposition as stated. For more general losses, one can consider order-1 (linear) developpment of the loss and ask for a network g that is close to (the opposite of) the gradient of the loss. Proof of the additional remark. The proof in Pinkus (1999) relies on the existence of real values c n such that the n-th order derivatives σ (n) (c n ) are not 0. Then, by considering appropriate values arbitrarily close to c n , one can approximate the n-th derivative of σ at c n and consequently the polynomial c n of order n. This standard proof then concludes by density of polynomials in continuous functions. Provided the activation function σ is not a polynomial, these values c n can actually be chosen arbitrarily, in particular arbitrarily close to 0. This corresponds to choosing neuron input weights arbitrarily close to 0. Proposition B.2 (Greedy completion by one single neuron). If f is not f * yet, there exists a neuron to add to the hidden layer such that the new function f ′ will have a lower loss than f . Proof. From the previous proposition, there exists a finite set of neurons to add such that the loss will be decreased. In this particular setting of L2 regression, or for more general losses if considering small function moves, this means that the function represented by this set of neurons has a strictly negative component over the gradient g of the loss (g = 2(f * -f ) in the case of the L2 regression). That is, denoting by a i σ(W i • x) these N neurons: N i=1 a i σ(w i • x) g L2 = K < 0 i.e. N i=1 ⟨a i σ(w i • x)| g⟩ L2 = K < 0 Now, by contradiction, if there existed no neuron i among these ones such that ⟨a i σ(w i • x) | g⟩ L2 ⩽ 1 N K then we would have: ∀i ∈ [1, N ], ⟨a i σ(w i • x)| g⟩ L2 > 1 N K N i=1 ⟨a i σ(w i • x)| g⟩ L2 > K hence a contradiction. Then necessarily at least one of the N neurons satisfies ⟨a i σ(w i • x)| g⟩ L2 ⩽ 1 N K < 0 and thus decreases the loss when added to the hidden layer of the neural network representing f . Moreover this decrease is at least 1 N of the loss decrease resulting from the addition of all neurons. As a consequence, our greedy approach will not get stuck in a situation where one would need to add many neurons simultaneously to decrease the loss: it is always feasible by a single neuron. On can express a lower bound on how much the loss has improved (for the best such neuron), but it not a very good one without further assumptions on f . Proposition B.3 (Greedy completion by one infinitesimal neuron). The neuron in the previous proposition can be chosen to have arbitrarily small input weights. Proof. This is straightforward, as, following a previous remark, the neurons found to collectively decrease the loss can be supposed to all have arbitrarily small input weights. This detail is important in that our approach is based on the tangent space of the function f and consequently manipulates infinitesimal quantities. Though we perform line search in a second step and consequently add non-infinitesimal neurons, our first optimization problem relies on the linearization of the activation function by requiring the added neuron to have infinitely small input weights, without which it would be much harder to solve. This proposition confirms that such neuron does exist indeed.

C ADDITIONAL EXPERIMENTAL RESULTS

All the experiements are performed 20 times on MNIST Dataset and the models are trained with Adam(lr = 0.0001, µ = 0, batchsize = 100) with 7 CPU. For the classic model, neurons are initialized with Kaiming Normal. For our approach, we always start with a model of size [1, 1] initialized with Kaiming Normal and we expand its architecture using 3.2 every 0.05 seconds. The batch-size, n l mb , for estimating θ(γ) l↔l+1 (γ) and δW * l+1 is n l mb = 100 at first step and for every layer. applying our method on every layers, n mb is updated as n l mb ← -n l mb × max(neurons l-1 , neurons l+1 ), where neurons l indicates the amount of neurons at layer l . In the following section, we modify the value for the training time between architecture growth 8 and the architecture growth 10. The y -axis is accuracy on test set.

C.1 CHANGING TRAINING TIME BETWEEN EACH ARCHITECTURE EXPANSION STEP

In this part we modify the training time between each architecture growth. We apply 8 times our method on each hidden layer, the final architecture is [222, 71] 

C.3 EIGENVALUES

In this section we apply our method 15 times on each layer. After applying 8 times our method, the accuracy of the system does not increase significantly. We have that E x,y∼D [z T v goal l (x)] for every vector z, as explained in part 5. Looking at the eigenvalues for the first and second hidden layers and When performing the quadratic optimization (5), we obtain the optimal direction for ( α * k , ω * k ) R k=1 . It is also possible to generate randomly the new neurons and compute the amplitude factors. This second option have the benefit of being less time consuming, but it would project the desired direction on those random vectors and would affect the accuracy score compared to optimal solution defined in 3.1. 



Figure2: The expressivity bottleneck is measured as the difference between the optimal functional move v * given the architecture A and the functional gradient v goal . The former is the projection of the latter onto the tangent space T A .

Figure 3: Linear interpolation

Figure5: All graphics represent the same experiment but from a different perspective. Left : after partitioning computational time on intervals of size 0.1 seconds, we compute a linear interpolation for the accuracy value. Middle : the accuracy value against number of epochs, where the time needed to compute the optimal neurons is not noticeable. Right : accuracy against computational time, where durations due to Cholesky decompositions and their happening instants are averaged over experiments, for better visualisation purposes.

Figure6: Left : we plot the interpolation of accuracy on intervals of size 0.1 second. Middle : zoom of the top left plot. Right : the accuracy value against number of epochs, unlike middle plot in graphic 5, our method is trained on fewer epochs compared to classic model, indeed with equal architecture our method spends time computing the best neurons while the classic method continues its training.

Figure 7: changing the tangent space with different values for the family (ω k ) K k=1 .

Figure 8: The different shades of red correspond to different training time between each architectures growth in seconds.

Figure 10: Comparison between the classic training of architecture [1000, 1000] and our approach of architecture [929, 141]. All models are trained for 20 seconds we plot here mean and standard deviation for accuracy on test set. For our method we trained the model for 0.05 seconds between each architecture growth. Right interpolation on interval of size 0.01. Middle : accuracy as a function of epochs. Right : accuracy plotted as a function of the mean of time in each category : classic training and our approach.estimate decrease for the loss, ie k λ 2 k :

Figure 11: mean and standard deviation for 1 k k λ 2 k for first and second hidden layers as a function of the number of architecture growths

Figure 12: mean and standard deviation for max(λ k ) for first and second hidden layers as a function of the number of architecture growths

Figure13: Experiment performed 20 times on the MNIST dataset: a starting model in black [1, 1] is initialized according to normal Kaiming, then is duplicated to give the red and the green model. The structure of the red model is modified by our method to reach the structure[110, 51]  while the green model is extended with random neurons. Then all models are trained for 5 seconds. The white space for the red model corresponds to the quadratic optimisation and the computation of the amplitude factor while for the green model it corresponds only to the computation of the amplitude factor.

annex

Furthermore :Tr Before looking at the impact on the Loss let's prove the corollary 3.2. We are looking for the matrix δW l minimizing the last equality. Consider the span of matrix S = {δW l B l |δW l matrix} and P the matrix projection on span S :Then δW * l+1 B l at proposition 3.1 minimize the first norm and is the orthogonal projection of V goal l+1 on span S.We also have that :And by definition of the orthogonal projection on linear span:The impact on the global loss is : 

