ON THE COMPLEXITY OF NONSMOOTH AUTOMATIC DIF-FERENTIATION

Abstract

Using the notion of conservative gradient, we provide a simple model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs. The overhead complexity of the backward mode turns out to be independent of the dimension when using programs with locally Lipschitz semi-algebraic or definable elementary functions. This considerably extends Baur-Strassen's smooth cheap gradient principle. We illustrate our results by establishing fast backpropagation results of conservative gradients through feedforward neural networks with standard activation and loss functions. Nonsmooth backpropagation's cheapness contrasts with concurrent forward approaches, which have, to this day, dimensional-dependent worst-case overhead estimates. We provide further results suggesting the superiority of backward propagation of conservative gradients. Indeed, we relate the complexity of computing a large number of directional derivatives to that of matrix multiplication, and we show that finding two subgradients in the Clarke subdifferential of a function is an NP-hard problem.

1. INTRODUCTION

Automatic evaluation of derivatives: Algorithmic differentiation (AD) appeared around 60 years ago (Beda et al. (1959) ; Wengert (1964) ), and has been since then constantly developed and used in many contexts, see Griewank et al. (1989) ; Griewank and Walther (2008) for a thorough discussion. Today, it is at the core of modern learning architectures (Rumelhart et al., 1986; LeCun et al., 2015; Baydin et al., 2018) , to the point that training a neural network (NN) is ultimately a way to combine the outputs of AD. There are many practical and theoretical developments available nowadays: flexible and efficient numerical libraries (Abadi et al., 2016; Paszke et al., 2019; Bradbury et al., 2018) , an implicit differentiation theory (Griewank and Faure, 2003; Griewank and Walther, 2008) and its extensions (Agrawal et al., 2019; Bai et al., 2019; Bolte et al., 2021; Blondel et al., 2021) , the adjoint method (Farrell et al., 2013; Pearlmutter, 1995; Plessix, 2006) with application to neural ODEs (Chen et al., 2018) , "piggyback" style differentiation of optimization algorithms (Griewank and Faure, 2003; Mehmood and Ochs, 2020; Bertrand et al., 2020; Lorraine et al., 2020) , or differentiation of conjugate gradient algorithms (Gratton et al., 2014) . Backward algorithmic differentiation, or backpropagation, plays a particular role when smooth optimization tasks are at stake, as it evaluates the gradient of a function with a cost proportional to that of function evaluations, independently of dimension. This property, called the cheap gradient principle (Wolfe, 1982; Griewank and Walther, 2008) , is at the root of the machine learning libraries revolution. According to the key complexity theory version of this result due to Baur and Strassen (1983) , arithmetic complexity of the evaluation of the derivative of a rational function is at most 5 times the complexity of function evaluation. Extensions exist for smooth differentiable functions Baur and Strassen (1983) ; Griewank and Walther (2008) but standard computational practice of AD consists of little known about the nonsmooth case. The objective of this paper is precisely to present a simple, general, nonsmooth cheap conservative principle and to explore other complexity results for evaluating nonsmooth derivatives. This extends the cheap gradient principle of smooth AD to the path differentiable world Bolte and Pauwels (2020b) which includes semi-algebraic and more generally definable functions Coste (2000a; b) , a class that contains the vast majority of machine learning programs used in practice, see for example Bolte and Pauwels (2020b) . Nonsmooth AD & computational complexity: Sorting values, pooling data, thresholding functions, or determining closest points are some of the most essential numerical decision operations. They are ubiquitous in machine learning and modern optimization. All of them are nonsmooth, and most of them have a very desirable feature: they are cheap to compute, much cheaper than smoothed surrogates. For instance, the famous ReLU activation in deep learning, whose role is to threshold to zero negative values to allow for the inactivity of neurons, requires only one bit of encoding in theory. On the other hand, other nonlinear activations potentially require auxiliary algorithms for their evaluation, incurring a higher computational cost. This simplicity of use also comes with the issue of finding an adequate way of training models and, thus differentiating objects. The standard computational practice of AD consists in applying differential calculus rules directly to nonsmooth objects, replacing gradients by surrogates, typically Clarke subgradients. This is how AD is performed within TensorFlow, PyTorch or Jax. This approach has shown tremendous success (LeCun et al., 2015) and has been massively used for the last 10 years. Yet, despite this empirical success, Barton et al. claimed in Barton et al. (2018) that "there does not seem to exist [at this day] a true analogous reverse AD mode to compute generalized derivatives for nonsmooth functions", illustrating the difficulty of nonsmooth AD. Conservative gradients were introduced as a faithful mathematical model capturing the formal application of calculus rules to subdifferentials by Bolte and Pauwels (2020a; b) ; Bolte et al. (2021) . The author unfamiliar with this notion may reduce, in a ML context, conservative gradients to outputs of calculus rules formally applied to Clarke subgradients and Jacobians. Our goal is to provide an adequate computational complexity theory for conservative calculus, a theory that will therefore match standard practical approaches. Among other possible first-order options offered by nonsmooth calculus, we also investigate the properties of directional derivatives and those of the Clarke subdifferential. For directional derivatives, our motivation comes from the fact that this nonsmooth operation has general calculus rules, while the Clarke subdifferential is central in terms of variational interpretation.

Contributions:

The main thesis of this work is that conservative gradients have computational properties similar to smooth derivatives, which are much more favorable than those of alternative nonsmooth oracles such as subgradients or directional derivatives. • We provide a simple computational model for addressing the question of complexity theory of nonsmooth numerical programs. • For the backward mode, we prove a cheap conservative gradient principle à la Baur-Strassen, generalizing state of the art to nonsmooth programs modeling most NNs. We establish that, regardless of dimension, the computational cost of a conservative gradient is of the order of that of function evaluation. Our results provide a theoretical validation of the fact that the cost of backpropagation does not depend on the programs' smoothness. • For the forward mode, we relate the computational cost of p directional derivatives to that of p ˆp matrix multiplication. We provide lower complexity bounds that illustrate the limits to which this deficiency may be improved. This applies to existing nonsmooth AD frameworks (Khan and Barton, 2012; 2013) . • We establish that computing two distinct elements in the Clarke subdifferential of a given point is NP-hard for simple ReLU programs. This result also applies to the lexicographic subdifferential. In contrast, we show that the problem can be solved in polynomial time for conservative gradients. This reflects the computational difficulty of dealing with the Clarke subdifferential. • A result of independent interest: deciding differentiability of a ReLU program at a point is NP-hard. Relation with existing work: Conservative gradients were introduced in Bolte and Pauwels (2020a;b) to model "formal subdifferentiation" used by practitioners and nonsmooth "backpropagation". They were further studied in Lewis and Tian (2021) ; Davis and Drusvyatskiy (2021) ; Bolte et al. (2021) and empirically investigated in Bertoin et al. (2021) . Computational complexity was only qualitatively considered. We provide a rigorous description of this aspect based an arithmetic computational cost framework capturing programming with nondifferentiable components. The quest for a computationally cheap nonsmooth derivative has a long history in AD literature. Existing works of Griewank (Griewank and Walther, 2008; Griewank, 2013; Griewank and Rojas, 2019; Griewank and Walther, 2020) are essentially based on piecewise smoothness structures (Scholtes, 2012) . A cheap subgradient principle was also given in Kakade and Lee (2018) , but it requires a very strong qualification condition. As illustrated in Griewank and Rojas (2019) , such qualification conditions can be computationally hard to check in practice. In another research line, based on chain rules for directional derivatives, Khan-Barton (Khan and Barton, 2012; 2013; 2015; Barton et al., 2018) studied the vector forward mode AD. In particular, they investigated the forward AD framework to evaluate elements of the lexicographic subdifferential (see Nesterov (2005) ), which is contained in the Clarke subdifferential. In the worst case, the computational overhead ratio they obtain is proportional to the ambient dimension. This contrasts with our cheap gradient principle, whose constant is dimension-less. While these contributions are most relevant to nonsmooth AD, their applicability to large-scale learning models is limited, due to the central role of forward AD. Organization of the paper: We introduce elements of nonsmooth analysis and, in particular, the notion of conservative gradient used throughout this work in Section 2. Section 3 describes a general model of computation that allows one to express the computational cost and complexity of programs, functions and their conservative gradients. This section also presents an abstract program algorithmic differentiation framework. These elements are gathered in Section 4 which presents our extension of the Baur-Strassen result with the cheap conservative gradient principle and its illustrations. To conclude, in Section 5, we describe computational lower bounds for evaluating directional derivatives and distinct subgradients for simple programs.

2. NONSMOOTH GENERALIZED GRADIENTS

They are fundamental to expressing variations of nonsmooth losses in Machine Learning. Given a locally Lipschitz continuous function F : R p Ñ R, the Clarke subdifferential of F is B c F pxq " conv " lim kÑ`8 ∇F px k q : x k P diff F , x k Ý Ñ kÑ`8 x * (1) where diff F is the full measure set where F is differentiable and ∇F is the standard gradient (Clarke, 1983) . The subdifferential is set-valued, which we write B c F : R p Ñ R p . For each x P R p , elements of B c F pxq are called Clarke subgradients of F . A selection d in B c F , is a function d : R p Ñ R p such that for all x P R p , dpxq P B c F pxq. If F is C 1 then B c F " t∇F u everywhere so the only possible selection is d " ∇F . We will manipulate derived dictionaries, which typically provide a selection in either the Clarke subdifferential, or more general set-valued maps. Example 1 For ReLU : t Þ Ñ maxp0, tq, we have B c ReLUptq is t0u if t ă 0, t1u if t ą 0 and r0, 1s if t " 0. We may define the function ReLU 1 as a selection in B c ReLU : ReLU 1 ptq " 1, if t ą 0, ReLU 1 ptq " 0, otherwise. The chain-rule, essential to AD, generally fails for Clarke subgradients. This is why we now consider the more flexible notion of conservative gradients.  In this case, F is called path differentiable. Conservative Jacobians are defined similarly. As in Section 2, d : R p Ñ R p is a selection of D F if dpxq P D F pxq for all x P R p . A rich class of path differentiable functions is given by locally Lipschitz continuous semi-algebraic functions with the Clarke subdifferential as a conservative gradient. Actually, virtually all functions used in machine learning are path differentiable (Bolte and Pauwels, 2020a; b) . The most salient facts about path differentiable functions and their conservative gradients are: • (Clarke subgradient), for all x P R p , B c F pxq Ă convpD F pxqq. • (Gradient almost everywhere) Conservative gradients are gradients a.e (Bolte and Pauwels, 2020a) . • (First-order oracle) Selection in conservative gradients can be used as surrogate gradients, while preserving convergence guaranties (Bolte and Pauwels, 2020a; b; Bolte et al., 2021) . Conservative Jacobians can be composed while preserving conservativity (Bolte and Pauwels, 2020a) , a feature which do not enjoy Clarke Jacobians: let F : R p Ñ R m , G : R m Ñ R l be locally Lipschitz continuous mappings, d F : R p Ñ R mˆp and d G : R m Ñ R lˆm be selections in conservative Jacobians for F and G respectively. Then the product mapping x Þ Ñ d G pF pxqq ˆdF pxq is a selection in a conservative Jacobian for G ˝F . The use of conservative Jacobians provides a very convenient framework to model AD in the nonsmooth case, see Bolte and Pauwels (2020a; b) . A fundamental theorem is the following: Theorem 1 (Path differentiable functions are ubiquitous) (Bolte and Pauwels (2020a)) Locally Lipchitz semialgebraic (or definable) functions are path differentiable.

3.1. CALCULUS MODEL, PROGRAMS, COMPUTATIONAL COST AND COMPLEXITY

A dictionary D is a finite set of real functions (e.g. t`, ´, ˆ, {u), it is paired with P 0 pDq, a set of elementary programs implementing them in real arithmetic. Starting from P 0 pDq, we aim at capturing the notion of "program of programs" at any depth. As this is an inductive process, we call k P N a program "level", which is simply an induction counter needed for consistency. Recursively, programs of level k `1, in P k`foot_0 pDq, consist of combinations of outputs of programs of level k, in P k pDq. For example if P 1 and P 2 are elementary programs in P 0 pDq, then the program which sums the outputs of P 1 and P 0 is of level 1. More precisely: Let p, q be input and output sizes respectively and m ě p `q a memory size. A predecessor relation is a set valued map pr : t1, . . . , mu Ñ t1, . . . , mu such that for i " 1, . . . , m • for j P prpiq, j ă i. • prpiq is empty if i ď p and nonempty otherwise. An adapted program sequence pg i q m i"p`1 in P k pDq, is a set of programs such that g i has |prpiq| input arguments and a single output, for all i " p `1, . . . , m. Given `p, q, m, pr, pg i q m i"p`1 ˘, the program given in Algorithm 1 is a level k `1 program on D .

Algorithm 1: Program data:

`p, q, m, pr, pg i q m i"p`1 ˘. Input: x " px 1 , . . . x p q 1: for i " p `1, p `2, . . . m do 2: x i " g i px prpiq q where 3: x prpiq " px j q jPprpiq . 4: end for Return: y :" px j q m j"m´q`1 . The set of programs with dictionary D is PpDq " Ť kě0 P k pDq. We shall see however that P k pDq " P 1 pDq for all k, using modification of the computational graph. A cost on a dictionary D is a nonnegative function on D, it extends additively by induction on programs on D through the rule costpP q " ř m i"1 costpg i q where P is a program on D as described in Algorithm 1. A direct example is the dictionary of arithmetic functions t`, ´, ˆ, {u, together with addition or multiplication by fixed constants, denoted by `c and ˆc respectively 1 , see also Section A.1. Throughout the paper, we assume that dictionaries contain at least operations `and ˆ. Each program on D may be represented by a program in P 1 pDq with the same cost, by expanding all subprograms until they reduce to an elementary program. Cost evaluation is thus well defined on such programs. As detailed in Appendix A.1, this model of computation is equivalently expressed using directed acyclic graphs. To sum up, we have defined the set of programs PpDq on D, which includes programs of programs. The programs g i in Algorithm 1 may be taken in PpDq. The cost of a program is evaluated through the calls it makes to elementary programs in the dictionary. Programs vs functions: A program P defines a unique input-output function f : we say that P "computes" f , or "implements" f , and with a slight abuse of notation, we will identify P and f when there is no ambiguity (e.g. derivative of P ). We use the equivalence relation " to relate programs computing the same function. The equivalence classes correspond to functions expressible by programs with a given dictionary D. Given a function f : R p Ñ R q and a program P on dictionary D, with p inputs and q outputs, we write f " rP s to denote the fact that P is in the equivalence class of programs computing f , that is, P implements f .

Complexity of a function:

The complexity of a function f over a dictionary D is the quantity comppf, Dq " inf tcostpP q, s.t P P PpDq, f " rP su, the infimum being over all programs implementing f on dictionary D. It could be infinite, if it is finite then it is attained.

3.2. AUTOMATIC DIFFERENTIATION

We pertain to programs implementing functions, that is Algorithm 1 with single outputs q " 1. Given a dictionary D of locally Lipschitz path differentiable functions, a derived dictionary is a set of functions D 1 Ą D which extends D and contains operations required to express at least an element in a conservative gradient for each of the functions in D, for example, an element in the Clarke subdifferential. We also consider a cost function on D 1 , which we denote by cost and which extends to programs over D 1 . Given programs g i on D, i " p `1, . . . , m, we define d i a derived program on D 1 , with |prpiq| inputs and outputs, which returns an element of a conservative gradient for g i (as for instance a Clarke subgradient, or simply a gradient in the C 1 case). By gd i , we denote a program on D 1 evaluating pg i pxq, d i pxqq jointly for a given x. We denote by Algorithm 1', an extension of Algorithm 1 which additionally returns w i " d i px prpiq q for i " p `1, . . . , m, by replacing line 2 in Algorithm 1 with a call to gd i instead of g i . The backward (resp. forward) AD program backproppP q (resp. forproppP q) is defined as follows: Algorithm 2: Algorithmic differentiation of P as in Section 3.1 Input: variables px i q p i"1 Forward evaluation with derivatives: evaluate w i " d i px prpiq q, i " p `1, . . . , m, with Algorithm 1': Algorithm 1 with gd i instead of g i on line 2. Note that Algorithm 2 starts with Algorithm 1', i.e., Algorithm 1 with gd i instead of g i on line 2. Its computational cost, denoted costpgd i q, should be thought of as an exogenous parameter: it may model, for instance, the use of underlying software libraries or the hardware properties.

4. COMPUTATIONAL COMPLEXITY OF NONSMOOTH AD

We now evaluate the complexity of the forprop and backprop operations for conservative gradients in the path-differentiable case -which encompasses, as mentioned earlier, all semi-algebraic and definable locally Lipschitz functions. We show, in particular, that backpropagation with conservative gradients has a computational overhead ratio that is independent of the dimension. This is in contrast with the best known algorithmic oracles for the Clarke subdifferential (see Khan and Barton (2012; 2013; 2015) ; Barton et al. (2018) and Appendix A.2), whose computational overhead ratio scales linearly with the dimension. Theorem 2 (Complexity of nonsmooth AD) Let P be a program over a dictionary D of pathdifferentiable functions with p inputs as in Algorithm 1 & 2. Then, the corresponding function rP s is path differentiable, there is a conservative gradient D P for the function rP s such that: (i) (Cost of backward mode) At each input point x P R p , the output of program backproppP q is in D P pxq and we have costpbackproppP qq ď ω b costpP q, where ω b " max i"p`1,m tpcostpgd i q `2 maxpcostp`q, costpˆqq|prpiq|q { cost pg i qu . (3) (ii) (Cost of forward mode) At each input point x P R p , the output of program forproppP q is in D P pxq and we have costpforproppP qq ď ω f ˆcostpP q where ω f " max i"p`1,m tpcost pgd i q `p|prpiq|costpˆq `pp|prpiq| ´1qcostp`qq { cost pg i qu . There is a dissymmetry between the two modes since the constant ω b is independent of the dimension p. This is why property (i) is sometimes called the "cheap conservative gradient principle" extending the classical smooth one which was derived by Baur and Strassen (1983) for real rational functions. Theorem 2 describes worst case upper bounds (maximum over i), which are tight, for example if prpiq, costs of g i and gd i are independent of i. We will consider several examples now. The class of ReLU programs Let D ReLU be the dictionary composed of elementary arithmetic operations, logarithm, exponential and the ReLU function: D ReLU :" t`, ˆ, `c, ˆc, inv, exp, log, ReLUu. (4) A ReLU program P is a program with dictionary D ReLU ; it can be expressed in a compositional form (Section 3.1) with program sequences in D ReLU . Note that this yields path differentiable functions. Assumption 1 (Computational Cost) In Algorithms (2), define the dictionary D 1 ReLU :" D ReLU Y tReLU 1 u as in Example 1; then, all operations from D 1 ReLU have unit cost (see Remark 1). Corollary 1 (Backprop complexity of ReLU programs) Let P be a ReLU program, under Assumption 1, we have: costpbackproppP qq ď 5costpP q. This extends to more complex cost weighting schemes (Remark 1) and to selection functions which virtually capture all losses in ML (Remark 2). Remark 1 (On refined cost systems) Unit cost in Assumption 1 gives a simple interpretation to Corollary 1: the cost of a program is the total number of numerical operations. This rough estimate of computational complexity, could be refined with different weighting schemes. However, the obtained constant 5 is robust to many different weighting choices, far beyond Assumption 1. We detail an example in the Appendix B.2 for which the cost of all smooth nonlinear operations different from òr ˆis c nonlin ě 1 and we model the cost of sign branching in computation of ReLU and ReLU 1 with constant c ReLU ě 0. This yields the same constant as in Corollary 1. Remark 2 (Beyond ReLU programs) Many other dictionaries could be considered. ReLU is an example chosen for its simplicity, but Corollary 1 would hold similarly (with the same constant 5) for many different nonsmooth activations or components such as absolute value, max-pooling, ELU function, ℓ 1 and ℓ 8 norms. Similar results could be developed for the class of selection functions, which encompasses the vast majority of ML building blocks (see Bolte and Pauwels (2020b) ). This is sketched in Appendix B.3. Chaining backpropagation derived programs Our approach is flexible enough to describe "programs of programs" and backpropagation chaining. Let P be a program as in Algorithm 1, with adapted ReLU program sequence tpg i q m i"p`1 u. If costpg i q " |prpiq|, g i is a "long program", with many operations per input.We may set gd i " backproppg i q using Algorithm 2, i " p `1, . . . , m. From Corollary 1, we have costpgd i q{costpg i q ď 5, and for long programs ω b » 5 in Theorem 2. This illustrates the versatility of our approach as it captures the complexity of chaining backprop operations, the resulting estimate being quite sharp in the regime of long programs. Beyond backpropagation Programs may be differentiated by other means than backpropagation. Examples include, forward propagation, with applications in optimization and algorithmic unrolling (Mehmood and Ochs, 2020; Lorraine et al., 2020; Maclaurin et al., 2015) , implicit differentiation Agrawal et al. (2018) ; Winston and Kolter (2020) ; Bai et al. (2019) ; Bolte et al. (2021) with application in optimization and hyperparameter optimization (Bertrand et al., 2020) , adjoint differentiation (Plessix, 2006) in programs with components involving ordinary differential equations (Courtier and Rabier, 1997; Chen et al., 2018) , differentiation of conjugate gradient (Gratton et al., 2014) , Cholesky algorithm (Smith, 1995) , approximation of Jacobian matrices involving a non-uniform FFT (Wang and Fessler, 2021) . Let P be a program as in Algorithm 1. Theorem 2 relates the complexity of combining derived programs in Algorithm 2 to the following quantities, for i " p `1, . . . , m: • costpgd i q{costpg i q: the "computational overhead ratio". • |prpiq|costpˆq{costpg i q: the ratio between multiplication cost and average cost per input argument. The first quantity depends on the technique used to obtain gd i . The second quantity is typically less than 2 (at least one arithmetic operation per input) and becomes negligible for long programs (many operations per input). For example in Mehmood and Ochs (2020) ; Lorraine et al. (2020) ; Maclaurin et al. (2015) , for one i, the program gd i is an optimization algorithm in R p , a long program differentiated using forward propagation. The corresponding overhead ratio is in this case 3p `5 (Theorem 2). If combined with an outer backward pass, we obtain a dimension-dependent overhead ratio, in contrast with full backward differentiation. Our model provides computational cost estimates for mixed techniques, here a combination of inner forward and outer backward propagation.

5. ON THE COMPUTATIONAL HARDNESS OF GENERALIZED GRADIENTS

Let P and DP be two programs such that DP evaluates jointly P and a derivative of P . In the sequel, we use the term (computational) overhead ratio of DP to denote the quantity costpDP q costpP q and computational overhead ratio of derivatives of P to denote the quantity comppDP q costpP q . As established in Theorem 2, this ratio is dimensionless in the case of backpropagation with conservative gradients. Are there other ways to compute cheap nonsmooth gradients? Toward an answer to this question, we discuss this ratio for other nonsmooth differentiation oracles: directional derivatives (for which we relate worst-case complexity to that of matrix multiplication), lexicographic derivatives with forward AD (with an overhead ratio of order p Barton et al. (2018) ). As for the Clarke subdifferential, we prove the hardness of subgradients enumeration. Our motivation to estimate the complexity of these particular types of derivatives (directional, lexicographic and Clarke) is that they serve as a basis to alternative implementable AD approaches (see Barton et al. (2018) and references therein), and are thus concurrent strategies of conservative gradient backpropagation. The results presented below do not provide a definitive answer, but they strongly suggest that backpropagation of conservative gradients has a much more favorable complexity.

5.1. THE OVERHEAD RATIO FOR EVALUATING p DIRECTIONAL DERIVATIVES

Given G : R p Ñ R locally Lipschitz and x, d P R p , the directional derivative of G at x in direction d is given by lim tÓ0 pGpx `tdq ´Gpxqq{t when the limit exists. This section considers a family of functions with p inputs and q real parameters, represented by a locally Lipschitz function F : R p Rq Ñ R, for which we investigate hardness of evaluation of p directional derivatives. The function F may describe, for instance, a ReLU feedforward neural network empirical loss, parameterized by q real weights, with p inputs. For functions represented by ReLU programs, we prove an overhead ratio of order p ω´2`op1q where ω is the matrix multiplication exponent (see definition below). In all rigor, it is not known whether ω ą 2 or ω " 2, so the derived ratio could be essentially dimensionless (if ω " 2), though all practical evidences are against this so far. The best known lower bound is ω ă 2.37 , and in practice, the matrix multiplication exponent is closer to 2.7, both corresponding to a dimension-dependent overhead, in contrast with the smooth case with essentially dimensionless overhead ratio to evaluate p directional derivatives (essentially a gradient). Complexity of matrix multiplication: Throughout this section, we set D " t`, ˆ, `c, ˆcu, with unit costs (corresponding to polynomial functions). Denote by cppq complexity of p ˆp matrix multiplication. More precisely, if f : R pˆp ˆRpˆp Ñ R pˆp is such that f pA, Bq " AB for all, square matrices A, B P R pˆp , we have cppq " comppf, Dq, which we may write cppq " p ω`op1q where ω is called the matrix multiplication exponent. Note that cppq ě p 2 , as one needs at least one operation for each of the 2p 2 entries. Directional derivatives: Given a function F : R p ˆRq Ñ R, we denote by F 1 1 : R p ˆRq ˆRpˆp Ñ R p the function which associates to x P R p , y P R q and a matrix A P R pˆp the p directional derivatives with respect to x variable, for fixed y, in directions given by the columns of A. The proof of the following theorem is given in Section C. Theorem 3 (Computational ratio for directional derivatives) There exists a function F : R p Rq Ñ R and a program P F implementing F on dictionary t`, ˆ, ReLU, `c, ˆcu (all operations have unit cost), such that for any program P 1 implementing py, Aq Þ Ñ F 1 1 p0, y, Aq on derived dictionary t`, ˆ, ReLU, ReLU 1 , `c, ˆcu, costpP 1 q{costpP F q ě pcppq ´5pq{p40p 2 q " p ω´2`op1q . (5) Theorem 3 has q parameters, parametric dependency is required to express hardness. Indeed, for some parameter values, computation may be trivial (e.g. null values). Alternatively, it states that for some values of the q parameters, computing p directional derivatives has cost as in (5). The bound in ( 5) is sharp up to multiplicative constants for linear ReLU networks, see Remark 5 in Appendix A.2. Consequences: Our overhead estimate is roughly p ω´2 , it constitutes a bottleneck: a "cheap nonsmooth p directional derivatives principle", would imply easy matrix multiplication, to the point that ω " 2. Since the seminal work of Strassen et al. (1969) , it is known that ω ď log 2 p7q » 2.81. Determining the precise exponent ω has been an object of intense research Robinson (2005) . Asymptotically, one has 2 ď ω ă 2.373, see Williams (2012); Le Gall (2014), the best known bound being given in Alman and Williams (2021) . In this case, the estimate in ( 5) is roughly p 0.373 . These estimates may involve non-constructive existence proofs, or suffer from the curse of recursion: meaningful efficiency occurs only for inaccessible sizes. According to Dumas and Pan (2016) , for values p ď 10 6 the most efficient practical algorithms have a complexity of the order p 2.774 , resulting in an overhead of order p 0.774 , in contrast with the constant overhead incurred by nonsmooth backpropagation. More discussion is given in Appendix A.2. Comparison with the smooth case: If F is C 1 , evaluating p directional derivatives is comparatively easier because F 1 px, dq " x∇F pxq, dy for all x, d P R p . Hence, one may first evaluate ∇F (once), at a cost similar to that of F (cheap gradient principle), and then evaluate p scalar products, at a cost p 2 . If the cost of F is of order p 2 at least (for example F is a feedforward neural network with p inputs and a layer of p hidden neurons), then this is overall proportional to the cost of computing F .

5.2. COMPUTING CLARKE SUBGRADIENTS USING FORWARD AUTOMATIC DIFFERENTIATION

In Khan and Barton (2012; 2013; 2015) , several automatic differentiation strategies are proposed to evaluate elements of the Clarke subdifferential. These approaches are based on directional (Shapiro, 1990) and lexicographic derivatives (Nesterov, 2005) which satisfy a chain rule under structural assumptions. The chain rule may be implemented using the vector forward mode of automatic differentiation (Barton et al., 2018) , which suffers from computational overhead scaling linearly in p, contrary to the reverse mode in Theorem 2. Reducing this factor is an open question, even for compositional functions involving only univariate nonsmoothness such as absolute value (Khan, 2018) . More details are given in Appendix A.2.1.

5.3. COMPUTATIONAL HARDNESS OF SUBGRADIENT ENUMERATION

We investigate in this section the hardness finding subgradients for programs defined on the elementary dictionary D 0 " t`, ´, ReLUu with unit costs. Let us denote by PpD 0 q the set of such programs. We will, with a slight abuse of notation, identify a program P P D 0 " t`, ´, ReLUu with the function it computes to state our complexity result (proof in Section D). Theorem 4 (Clarke subgradients and NP-Hardness) (i) The problem of finding two distinct subgradients in the Clarke subdifferential of P P PpD 0 q at given input (or one single subgradient if it is reduced to a singleton) is NP-hard. (ii) Deciding if P P PpD 0 q is not differentiable at some given input is NP-hard. Remark 3 In Theorem 4, numerical parameters and inputs are constrained to be in t´1, 0, 1u, so that the hardness result does not depend on numerical representation and only involves program size (strong NP-hardness). See Appendix D for more details. The above problems (i)-(ii) enter the field of computational complexity as we consider programs P P PpD 0 q with a natural notion of size, given by their cost, costpP q, the number of operations (recall that we assumed unit costs). Since the considered programs implement piecewise linear functions, it follows from (Barton et al., 2018 , Proposition 2.7) that, our hardness result also holds for the lexicographic subdifferential Nesterov (2005) , which reduces in this case to the set of neighboring gradients (see Section D). The counterpart of the above problem for AD conservative gradients as in Definition 2 is tractable, illustrating a major computational difference between Clarke subdifferential and AD conservative gradient. The proof is in Section D.4, by reduction to a graph shortest path problem. Proposition 1 (Finding two elements in autodiff conservative gradients is tractable) Given P P PpD 0 q, with conservative gradient D P given by Theorem 2, finding two elements in D P pxq at a given input x (or one single element if D P pxq is a singleton) is solvable in polynomial time.

6. CONCLUSION

We extended the "cheap gradient" principle to nonsmooth automatic differentiation with a flexible version of Baur-Strassen's result: the overhead ratio of conservative gradients is independent of the dimension. On the other hand, we showed that the potential gain in efficiency of forward AD for multiple directional derivatives is limited due to an intrinsic connection to matrix multiplication. Finally, we have shown that for simple ReLU networks, the enumeration of Clarke subgradients is computationally hard, in contrast to the enumeration of conservative gradients. The global picture is significantly different from the smooth case, with a well understood "cheap gradient" principle that yields "cheap p directional derivatives", illustrating the specificities of nonsmoothness. Our results confirm the centrality of conservative gradients in nonsmooth AD and machine learning: they generalize gradients with a clear "cheap principle", contrary to concurrent notions. An important open question in this context is the complexity of subgradients, or, in other words, the existence of a "cheap subgradient principle". We conjecture a negative answer in general. support of ANR Chess, grant ANR-17-EURE-0010, TSE-P and the Centre Lagrange. We thank our collaborators in the Thales LAS France, especially Andrei Purica, for helpful comments. We are grateful to Serge Gratton, Pierre Weiss and Pierre Boudier for useful reference suggestions. Published as a conference paper at ICLR 2023 This is the appendix for "On the complexity of nonsmooth automatic differentiation". We represent programs using the DAG representation as in Remark 4. Let us define a simple dictionary D :" t`, ˆu and introduce a level 0 elementary program P 0 such that P 0 pa, bq " a `b meaning that P 0 computes the quantity a `b. P 0 is identified with `from the dictionary. We also introduce a level 1 program P 1 such that P 1 pa, b, cq " a ˆpb `cq. We can construct an equivalent level 1 program, Q 1 such that Q 1 pa, b, cq " a ˆb `a ˆc, in this case, we have P 1 " Q 1 , or rP 1 s " rQ 1 s since they compute the same quantity. The level 2 program P 2 is such that P 2 pa, b, c, dq " pa `bq ˆpc `dq " Q 1 pa, c, dq `P1 pb, c, dq and uses level 1 programs Q 1 and P 1 in its computation nodes. The Directed Acyclic Graphs (DAGs) representing these programs are given in Figure 1 . Assuming costp`q " costpˆq " 1, we have costpP 0 q " 1, costpP 1 q " 2, costpQ 1 q " 3 and costpP 2 q " costpQ 1 q `costpP 1 q `costpˆq " 6.

CONTENTS

a b (a) P0 a b c (b) P1 b a c ˆ( c) Q1 a c d b P 1 Q 1 (d) P2 Figure 1 : DAG illustrating different programs with dictionary D :" t`, ˆu. (a) P 0 pa, bq " a `b, of level 0 which is identified with `from the dictionary, (b) P 1 pa, b, cq " apb `cq, of level 1, (c) Q 1 pa, b, cq " ab `ac, of level 1 and equivalent to P 1 , (d) P 2 pa, b, c, dq " pa `bqpc `dq " Q 1 pa, c, dq `P1 pb, c, dq, of level 2.

A.2 COMMENTS ON SECTION 5

A.2.1 FORWARD AD AND CLARKE SUBGRADIENTS Nesterov (2005) introduced the notion of lexicographic subdifferential, denoted here B L F for a Lipschitz function F : R p Ñ R. The construction of B L F is based on successive local approximations of F with directional derivatives, and one has B L F pxq Ă B c F pxq for all x such that the first term is well defined. It is known that automatic differentiation can be used to compute directional derivatives, particularly the forward mode of automatic differentiation (Griewank and Walther, 2008) . Based on this observation, Khan and Barton developed several algorithms to evaluate elements of B c F , based on directional derivatives (Khan and Barton, 2012; 2013; 2015) . They concentrate on piecewise C 1 functions, see for example Scholtes (2012) , and propose to handle compositional structures with different restrictions on the function class considered, such as functions in abs-normal forms (Khan and Barton, 2012) , or broader classes (Khan and Barton, 2013; Barton et al., 2018) . All these procedures either require to evaluate p directional derivatives (Khan and Barton, 2012; 2013) , or rely on forward chain rule propagation for lexicographic derivatives (Khan and Barton, 2015; Barton et al., 2018) , which also require to maintain p directional derivatives. For this reason, all these methods suffer from a multiplicative computational overhead ratio of the order of p in the worst case, and it is not known if this could be improved (Barton et al., 2018) , although efforts have been made in this direction (Khan, 2018).

A.2.2 MATRIX MULTIPLICATIONS

Remark 5 The lower bound described in Theorem 3 is sharp for a linear ReLU network F as in ( 11) involving only square p ˆp matrices. Indeed, p directional derivatives of F in directions a 1 , . . . , a p , can be computed with roughly Lcppq operations, using a matrix multiplication algorithm realizing the cppq bound, for example using the forward mode of AD Khan and Barton (2012; 2013) . The naive P F algorithm for forward evaluation performs roughly 2Lp 2 operations which results in the bound (neglecting terms of order one in numerator and denominator), comppF d , D Y tReLU, ReLU 1 uq costpP F q ď cppq 2p 2 , for this class of networks, to be compared with (5). Finally, we remark that in the smooth case such complexity estimates reduce to gradient computation which can be done using backward algorithmic differentiation with a constant multiplicative overhead ratio. We denote by F d , the function F d : py, Aq Þ Ñ F 1 1 p0, y, Aq which computes p directional derivatives at a given point. Setting ω " lim sup pÑ8 logpcppqq{ logppq, since P 1 is an arbitrary program implementing F d , we have shown that asymptotically, for any ϵ ą 0 sup p,F "rP F s,P F PPpDYtReLUuq comppF d , D Y tReLU, ReLU 1 uq costpP F q ˆp2´ω`ϵ " `8, where the supremum is taken over all p and all functions F : R pˆq Ñ R implemented by a program P F with dictionary D Y tReLUu. It is not known whether ω ą 2.

B PROOFS RELATED TO SECTION 4

Proof of Theorem 2: Given a program P as in Section 3.1, the path differentiability of rPs is immediate by composition and the chain rule property. The associated conservative gradient D P is constructed in Bolte and Pauwels (2020a). We have the following cost estimates which can be deduced from the definition of the cost of a program in Section 3.1. • Algorithm 1 forward evaluation: costpP q " costpAlgorithm 1q " m ÿ i"p`1 cost pg i q • Algorithm 1 forward evaluation with derivatives: Algorithm 1' with gd i instead of g i on line 2 costpAlgorithm 1'q " m ÿ i"p`1 cost pgd i q (7) • Algorithm 2 backward AD cost: costpbackproppP qq " costpAlgorithm 1'q `m ÿ i"p`1 |prpiq|pcostp`q `costpˆqq " m ÿ i"p`1 cost pgd i q `|prpiq|pcostp`q `costpˆqq. • Algorithm 2 forward AD cost: costpforproppP qq " costpAlgorithm 1'q `m ÿ i"p`1 p|prpiq|costpˆq `pp|prpiq| ´1qcostp`q " m ÿ i"p`1 cost pgd i q `p|prpiq|costpˆq `pp|prpiq| ´1qcostp`q. (9) Let us derive the complexity bound of Algorithm 1 according to Algorithm 2. Backward AD complexity result: Using (8) and the fact that cost has value in R ˚, we have costpbackproppP qq " m ÿ i"p`1 cost pgd i q `|prpiq|pcostp`q `costpˆqq " m ÿ i"p`1 costpg i q ˆcost pgd i q `|prpiq|pcostp`q `costpˆqq costpg i q ď max i"p`1,m ˆcost pgd i q `|prpiq|pcostp`q `costpˆqq costpg i q ˙m ÿ i"p`1 costpg i q, where the inequality is due to factorization by the maximal value. Using (6), we obtain costpbackproppP qq ď ω b ˆcostpP q where ω b is given in (3). This proves point (i). Forward AD complexity result: Using (9) and the fact that cost has value in R ˚, we have costpforproppP qq " m ÿ i"p`1 cost pgd i q `p|prpiq|costpˆq `pp|prpiq| ´1qcostp`q " m ÿ i"p`1 costpg i q ˆcost pgd i q `p|prpiq|costpˆq `pp|prpiq| ´1qcostp`q costpg i q ď max i"p`1,m ˆcost pgd i q `p|prpiq|costpˆq `pp|prpiq| ´1qcostp`q costpg i q ˙m ÿ i"p`1 costpg i q, where the inequality is due to factorization by the maximal value. Using (6), we obtain costpforproppP qq ď ω f ˆcostpP q where ω f is given in (3). Case 1 (costpˆq, costp`q) Let us define gpa, bq " a ˆb. To evaluate g, we need one operation from D ReLU . The derived program d related to g, should satisfy dpa, bq " pb, aq which does not require additional operation. Therefore, from Assumption 1 we can deduce that costpgq " 1 and costpgdq " 1. We get the same result for costp`q by applying identical reasoning. Case 2 (costpˆcq, costp`cq) Let us define gpaq " c ˆa. To evaluate g, we need one operation from D ReLU . The derived program d related to g, should satisfy dpaq " c which does not require additional operation from D 1 ReLU . Therefore, from Assumption 1 we can deduce that costpgq " 1 and costpgdq " 1. We get the same result for costp`cq by applying identical reasoning. Case 3 (costplogq) Let us define gpaq " logpaq. To evaluate g, we need one operation from D ReLU . The derived program d related to g, should satisfy dpaq " 1{a, which requires the inverse operation from D 1 ReLU . Therefore, from Assumption 1 we can deduce that costpgq " 1 and costpgdq " 2. Case 4 (costpexpq) Let us define gpaq " exppaq. To evaluate g, we need one operation from D ReLU . The derived program d related to g, should satisfy dpaq " gpaq which does not require operation from D 1 ReLU . Finally, from Assumption 1 we can deduce that costpgq " 1 and costpgdq " 1. Case 5 (costpinvq) Let us define gpaq " 1 a . To evaluate g, we need one operation from D ReLU . The derived program d related to g, should satisfy dpaq " ´1 a 2 which requires one additional multiplication to compute the square and one p´1q multiplication operation from D 1 ReLU . Finally, from Assumption 1 we can deduce that costpgq " 1 and costpgdq " 3. Case 6 (costpReLUq) Let us define gpxq " ReLUpxq " maxpx, 0q. To evaluate g, we need to evaluate the sign of x. The derived program ReLU 1 can be computed also from the sign of x without further operation. We have costpgq " 1 by hypothesis, but it is also reasonable to consider costpgdq " 1 as both operations only require sign evaluation of the same object. Remark 6 Since D ReLU dictionary contains the ReLU function, we can build other non-smooth functions such as the maximum and the absolute value. For example, maxtx, yu " ReLUpx ´yq ỳ " ReLUpx ´yq `ReLUpyq ´ReLUp´yq.

B.2 AN EXTENSION OF TABLE 1

The justifications of the following are similar to Section B.1, simply taking into consideration different types of operations. Taking c nonlin " c ReLU " 1, we recover table 1. We replace ReLU by ˆReLU which corresponds to its usage in practice and allows us to balance the cost of ReLU operations and that of multiplications. The justification is the same as in Section B.1 taking into consideration different types of operations. For the ˆReLU operation, the justification is as follows. Case 7 (ˆcostpReLUq) The operation has two argument and requires one sign evaluation and one multiplication in the worst case, so we assign it the cost 1 `cReLU . The differentiated program d should compute the function pa, bq Þ Ñ pReLUpbq, a ˆReLU 1 pbqq. One can write a program to compute jointly g and d as follows: return pa ˆb, b, aq if b ě 0 and p0, 0, 0q if b ă 0. This only requires a bit sign check which cost is c ReLU and a multiplication. We therefore model this operation such that costpgdq " costpgq " 1 `cReLU . Further refinements could be considered including various type of computational operations, such as memory moves, these are beyond the scope of the present paper.

B.3 ADDITIONAL ELEMENTARY NONSMOOTH PROGRAMS AND COST EXAMPLES

For simplicity, we do not discuss the dictionary and its related derived dictionary as there are many possibilities, one of them being D ReLU and D 1 ReLU as all the considered operations can be equivalently  g p`, ˆq | ¨| ELU 3 ˆ3-max-pool } ¨}8 } ¨}1 costpgq 1 1 `cR 2 `cR `cnl 153 `8c R n `2nc R ´1 np2 `cR q ´1 |pr| 2 1 1 9 n n costpd, gq 1 1 `cR 2 `cR `cnl 153 `8c R n `2nc R ´1 np2 `cR q ´1 costpgdq cost pgq 1 1 1 1 1 1 costpˆq|pr| cost pgq 4 1 1`cR 1 2`cR`c nl 9 153`8cR n n`2ncR´1 n np2`cRq´1 ω 5 ď 3 ď 2 ď 1.12 ď 3 ď 2 Case 8 (Absolute value and Leaky-ReLU) Recall that |x| " x if x ą 0 and ´x otherwise. Similarly Leaky-ReLUpxq " x if x ą 0 and ax otherwise, for some parameter a P p0, 1q so that both cases are exactly the same. The reasoning and result are exactly the same for both operations so we treat the absolute value. The construction is similar as what was proposed for ˆcostpReLUq treated in the previous section. Let g be a program to evaluate | ¨|, in the worst case it requires one sign evaluation and one multiplication so that costpgq " 1 `cReLU . Similarly it is possible to built a program which returns px, 1q if x ą 0 and p´x, ´1q otherwise, this computes pgdq and require the exact same operations so that costpgdq " costpgq " 1 `cReLU . Case 9 (ELU) f pxq " " x if x ě 0 ape x ´1q if x ă 0 with a ą 0. Let g be a program to evaluate the ELU function, it requires a sign evaluation and in the worst case one nonlinear operation to evaluate e x , one multiplication to evaluate ae x , and one substraction to evaluate ae x ´a. Therefore, costpgq " c ReLU `cnonlin `2. The derived program d requires the same sign and returns 1 or ae x depending on the sign. This does not require additional operation and therefore the joint computation of g and d satisfies costpgdq " costpgq. Case 10 (max-m-linear) Set n a number of inputs and m ě 2 a number of linear functions which are parameters, represented by a matrix A and a fixed input vector of size n represented by x P R n . Setting max m : R m to R the function which evaluates the maximum of m numbers, we consider g a program which evaluates the function A Þ Ñ max m pAxq. Recall that x is fixed so that the number of inputs is m ˆn. The multiplication requires m ˆp2n ´1q multiplications and additions and the evaluation of max m requires pm ´1qc ReLU as it requires m ´1 pairwise comparisons. We therefore have costpgq " m ˆp2n ´1q `pm ´1qc ReLU . As for the derived program d, setting M i " 0 except for row number i which attains the maximum in g which is set to x, we have an element of a conservative gradient for g. It is possible to jointly compute gpAq and dpAq by invoking a program which returns ppAxqris, M i q where i is any index realizing the max and M i is as discussed. This does not require more operations and we have therefore costpgdq " costpgq " m ˆ2n ´1 `pm ´1qc ReLU Case 11 (Two dimensional max-pooling (3 ˆ3-max-pool)) We consider a kernel of size 3 ˆ3 for simplicity. The goal is to differentiate with respect to the kernel weights for a fixed input. Let g denote a program implementing such a function, it is of the same form as max-m-linear except that the matrix A is of size 9 ˆ25 (padding values at the boundary of the 3 ˆ3 patch, this gives 5 ˆ5 " 25 inputs and 9 outputs), but it is sparse and can be parametrized by only 9 values, and the evaluation of the linear function for a fixed 5 ˆ5 input only requires 9 ˆp9 `8q " 153 addition and multiplications. We then take the maximum of these 9 outputs so that and costpgq " 153 `8c nonlin . For the same reason as max-m-linear, we have costpgdq " costpgq " 153 `8c nonlin . Case 12 (l 1 -norm, } ¨}8 ) Denote by g a program which evaluate the l 1 norm on R n . It has n inputs. In the worst case, its evaluation can be done with n ´1 addition, n multiplication by ´1 and n pairwise comparisons. Therefore we have costpgq " 2n `nc ReLU ´1. For the same reasons as all examples before, it is possible to identify a derived program d without requiring additional operation so that costpgdq " costpgq " 2n `nc ReLU ´1. Case 13 (Median of n numbers) Denote by g a program that evaluates the median of n numbers. This can be done by sorting the n numbers and outputting the value corresponding to t n 2 u, which requires roughly n logpnq operations, depending on the algorithm used. The sorting operation is a permutation, one could apply the same permutation to the vector p1, 2, . . . , nq without additional operation required. The number at position t n 2 u, call it i, is the index of the value associated with the median. Setting d to be the null vector in R n with value 1 at position i only, we have a selection in a conservative gradient for the median with no additional operation required. Therefore in this case costpgq " costpgdq. Case 14 (Selection functions) This example encompasses virtually all examples used in machine learning and extends the median example above. Assume that f : R p Ñ R is locally Lipschitz, given in the form f pxq " f spxq pxq where s : R p Ñ t1, . . . , mu is an index selection function, and for each i " 1, . . . , m, f i : R p Ñ R is a C 2 function. Let g be a program computing f , one possibility is to first evaluate spxq at cost c s and then evaluate f spxq pxq at cost c f . As shown in Bolte and Pauwels (2020b) , under very mild restrictions on s and f (which should be expressed with logarithms, polynomials, exponentials etc ...), the function x Þ Ñ ∇ spxq f pxq is a conservative gradient for f . It can be seen that it is possible to evaluate jointly pg, gdq by first computing s, at a cost c s , then evaluate f s and ∇f s jointly at a cost c ∇ . costpgq " c s `cf costpgdq " c s `c∇ costpgdq costpgq " c s `c∇ c s `cf ď c s `5c f c s `cf where we used c ∇ ď 5c f , the cheap gradient principle for smooth programs. This ratio is close to 5 if c s is negligible, we recover the usual ratio for smooth programs. It is close to 1 if c s dominates, which is the case in the median example where f s just corresponds to coordinate number s of the input and has a constant derivative. where the first inequality is because D 2 is a program computing BA T for all A, B on dictionary D, the second is because adding computation increases the cost, the third is a property of backward algorithmic differentiation on D and the last one is by construction of P 2 . Note that comppBA T , Dq " cppq by definition, therefore we have the claimed lower bound costpP 1 q costpP F q ě cppq ´5p 5costpP F q " cppq ´5p 8p 2 . l C.2 AN ADDITIONAL LEMMA Lemma 1 Let Q : R p Ñ R be a polynomial and P 1 be a program (without loss of generality of level 1) on the dictionary D 1 " t`, ˆ, ReLU, `c, ˆcu, such that Q " rP 1 s for all inputs restricted to an open set S Ă R p . Then there is a level 1 program P 2 on the dictionary D " D 1 ztReLUu " t`, ˆ, `c, ˆcu such that Q " rP 2 s (for all inputs in R p ). Furthermore, if costpReLUq " costpˆcq, then, costpP 2 q " costpP 1 q. Proof : We use the DAG representation of programs as in Remark 4. Therefore P 1 is described by a DAG which node are either input nodes or computation nodes implementing functions from D 1 . The function computed by P 1 as well as each of its nodes are semi-algebraic Bochnak et al. (2013) ; Coste (2000a; b) . For each ReLU node in the graph representing P 1 (assume that there are N of them) we associate a number: the function ReLU 1 evaluated on its input (with the convention that ReLU 1 p0q " 0). This defines a semialgebraic function G : R p Ñ t0, 1u N . As it has values in a finite set, by semialgebraicity, there is an open subset of S 1 Ă S such that G is constant on S (Coste, 2000a, Theorem 6.7). Consider P 2 which computation graph is the same as that of P 1 except that each absolute value node is replaced by multiplication by the corresponding ReLU 1 value (which is constant on S 1 ). Then Q " rP 1 s " rP 2 s for all inputs in the open set S 1 . All computation nodes of programs on D are multivariate polynomials and two polynomials which agree on an open set are equal globally. This concludes the proof. l D PROOFS OF SECTION 5.3 We investigate in this section the hardness of finding a Clarke subgradient for programs defined on the elementary dictionary D 0 " t`, ´, ReLUu. We start with an equivalent representation of these programs as linear ReLU networks with skip connections and specific weight matrices. This equivalence preserve representation size up to polynomial factors. We will then prove a hardness result on such ReLU networks. This will provide proof arguments for Theorem 4 by the polynomial time equivalence of the two representation. We proceed similarly to prove Proposition 1, using the equivalence with the two representations.

CONNECTIONS

Given a set of matrices M 1 P t´1, 0, 1u p1ˆp , M 2 P t´1, 0, 1u p2ˆp1 , . . . M L´1 P t´1, 0, 1u p L´1 ˆpL´2 , M L P t´1, 0, 1u 1ˆp L´1 we consider the function F : R p Ñ R, F : x Þ Ñ M L Φ L´1 pM L´1 Φ L´2 p. . . Φ 1 pM 1 xqqq. ( ) where Φ i : R pi Ñ R pi are given functions which apply to each coordinate, an activation function which is either the identity or the ReLU function. There is an obvious notion of size for this representation, corresponding to the number of free parameters (matrix entries and coordinates on which ReLU or identity is applied), the size of the representation is p L´1 `řL´1 i"1 p i ˆpi´1 `pi . A function F given in (11) can be represented by a program on D 0 of equivalent size, this correspond to a naive implementation. Similarly, any program P P PpD 0 q on p inputs and with a single output can be represented by a network as in (11) which size is at most 18costpP q 3 . Indeed, we may assume that costpP q ě p{2 without loss of generality, otherwise, the program would not perform operations on some of the input variables and it could be simplified by removing variables which do not affect the output. Recall that m in Algorithm 1 is the memory footprint of P , in our case, it is m " p `costpP q, the number of inputs plus the total number of operations. Note that we have m ď 3costpP q. Each operation `, ´or ReLU in the program can be represented by a m ˆm matrix composed with a certain Φ : R m Ñ R m which contribution to the Relu network size is at most pm 2 `mq ď 2m 2 ď 18costpP q 2 since m is integer and m ď 3costpP q. There are costpP q such operations so that a program can be represented equivalently by linear Relu network, with L " costpP q layers which contribution to the network size is at most 18costpP q 2 so that the size of the resulting network is at most 18costpP q 3 , which is the desired bound since. We have shown that working with functions represented as in equation ( 11) is equivalent to work with programs in PpD 0 q as it is possible to switch from one to the other at a cost of an increase of the representation size which is only cubic. Therefore we will from now on work with functions represented as linear relu networks with skip connections as in ( 11), and NP-hardness or polynomial time results on such function will be valid on PpD 0 q by the construction above. D.2 FURTHER PROPERTIES OF LINEAR ReLU NETWORKS Throughout this section F denotes a with representation as in ( 11). This function is positively homogeneous, it satisfies F p0q " 0 and it. By piecewise linearity, its Clarke subdifferential is a polyhedron (see e.g., Arora et al. (2018) ; Raghu et al. (2017) ). The Clarke subdifferential is a conservative gradient for this function, and we will associate to it a different conservative gradient, associated to Algorithm 2 Definition 2 (Autodiff conservative gradient) We consider a specific conservative gradient for F , it is given by D a F pxq " tM T 1 D 1 M T 2 D 2 . . . M T L´1 D L´1 M T L u , where for i " 1, . . . , L ´1, D i is a diagonal matrix which entries respects the sign pattern of the corresponding activation function: 1 if the activation is identity, 0 if the activation is ReLU and the input is negative, 1 if the input is positive and all elements in r0, 1s if the input is null. We have in particular D a F p0q " tM T 1 D 1 M T 2 D 2 . . . M T L´1 D L´1 M T L u where in this case, diagonal entries of matrices D i corresponding to ReLU activations are arbitrary in r0, 1s and the remaining diagonal entries are 1 (corresponding to identity activations). The autodiff conservative gradient is associated with the algorithmic differentiation of a natural numerical program implementing F as in Subsection 3.2. Furthermore, one can check that given a program P P PpD 0 q, after the transformation outlined in Section D.1, we have that D α F coincides with D P in Theorem 2. In the following definition, D F could be,for example, the Clarke subdifferential of F or the algorithmic differentiation conservative gradient D a F . We consider the following problem. Problem 1 (Conservative gradient enumeration) Given matrices M 1 P R p1ˆp , M 2 P R p2ˆp1 , . . . M L´1 P R p L´1 ˆpL´2 , M L P R 1ˆp L´1 , and functions Φ 1 , . . . , Φ L´1 , consider F : R p Ñ R the associated linear ReLU network with skip connections in (11), x P R p and D F : R p Ñ R p a conservative gradient for F . Compute two distinct elements in D F pxq or one element if it is a singleton. This problem enters the field of computational complexity as we have associated to it a representation size corresponding to the number of "free parameters" to be chosen: each matrix entry and the activation (ReLU or identity) corresponding to each coordinate, resulting in a number of parameters p L´1 `řL´1 i"1 p i ˆpi´1 `pi . In what follows, we will consider integral or rational entries for matrices and input x with the common notion of bit size. Schrijver (1998).

D.2.1 CLARKE ENUMERATION IS NP-HARD FOR RELU NETWORKS

The decision version of Problem 1, under the same assumptions, is to decide if there exists two distinct elements in D F pxq, that is, decide if D F pxq is not reduced to a singleton. Theorem 5 (Finding two Clarke subgradients is NP-Hard) Decision version of problem (1) with matrix and vector entries in t´1, 0, 1u and D F " B c F is NP-hard.

Sketch of proof:

We encode a boolean formula π on p boolean variable, in a linear ReLU network with p inputs, of size proportional to that of π. We do so by replacing "or" operations by maxima, "and" operations by minima, negation by multiplication by ´1 and adding ReLU operations to the result. Using Lemma 3 in appendix D.5, the resulting F is represented by a linear ReLU network. By construction, 0 is a global minimum of F so 0 P B c F p0q, and F takes positive values if and only if π is satisfiable if and only if B c F p0q ‰ t0u. We detail this proof in coming sections. Theorem 5 illustrates the hardness enumerating Clarke subgradients of linear ReLU networks. For F as in (11) and x P R p , B c F pxq is not a singleton if and only if F is not differentiable at x, therefore: Corollary 2 (Deciding non-differentiability of a NN is NP-Hard) Given a linear ReLU network as in (11) with matrices as in Theorem 5 and x P R p , deciding if F is not differentiable at x is NP-hard. In the coming section, we will provide a proof for Theorem 5 and Corollary 2. By the polynomial time equivalence of the representation of programs in PpD 0 q and functions as in (11) detailed in Section D.1, this proves Theorem 4. We add a remark on lexicographic subdifferential. It follows from (Barton et al., 2018, Proposition 2.7 ) that, for linear ReLU network F as in ( 11), the lexicographic subdifferential Nesterov (2005) is the set of neighboring gradients and is contained in Clarke subdifferential. Corollary 3 (Finding two lexicographic subgradients is NP-Hard) Theorem 5 remains true if D F is the lexicographic subdifferential.

D.3 PROOF OF THE MAIN HARDNESS RESULT

Preliminary on 3-SAT We will use reduction to 3-SAT problem which is among the most well known NP-complete problems. Recall that a boolean formula is built from boolean variables, and operators: AND (conjunction, denoted ^) OR (disjunction, _) and NOT (negation, ␣). A literal, is either a variable or the negation of a variable. A clause is a disjunction of literals (or a single literal). A formula is in conjunctive normal form (CNF), if it is a conjunction of clauses or a clause. 3-SAT is the decidability problem associated to CNF formulas with clauses containing 3 literals, such formulas are called 3-CNF formulas. Example 2 The formula pb 1 _ b 2 _ ␣b 3 q ^pb 1 _ b 4 _ ␣b 5 q ^p␣b 2 _ ␣b 3 _ b 6 q is 3-CNF with 6 boolean variables b 1 , . . . , b 6 and 3 clauses. Problem 2 (3-SAT) Given p, n P N and a boolean function π with p boolean arguments b 1 , . . . , b p represented by a 3-CNF formula with n clauses, decide if there exists an assignment pb 1 , . . . , b p q P t0, 1u p such that πpb 1 , . . . , b p q " 1.

Proof of Theorem 5:

The reduction is to 3-SAT. Consider a 3-CNF function π in p variables b 1 , . . . , b p with n clauses of size 3. We may assume without loss of generality that n is of the form 2 k for k P N by adding clauses which are always true and increasing the number of clauses by a factor at most 2. We will consider p real variables x 1 , . . . , x p . Consider the first clause of π, say for example pb 1 _ b 2 _ ␣b 3 q. We associate to each literal the corresponding variable x if the literal is equal to a variable, and ´x if it is the negation of the corresponding variable, for example x 1 , x 2 , ´x3 . These are combined using ReLU ˝max resulting in the expression ReLUpmaxtx 1 , x 2 , ´x3 uq. We proceed similarly for each clause, we obtain n " 2 k expressions involving ReLU ˝max where the max is over three numbers. The max of 3 numbers is the same as the max of 4 numbers (by copying one of the inputs) and, according to Lemma 3, can be represented by a ReLU network with 2 ReLU layers of size at most 3 ˆ2 " 6 with weight matrices in t´1, 0, 1u. We may therefore represent the n ReLU ˝max expressions with a network with p inputs and n outputs, with 3 ReLU layers (2 for each max and one for the outer ReLU) of size at most 6n (6 nodes for each max) involving matrices with entries in t´1, 0, 1u. These expressions are combined using the operator min applied to the n " 2 k clause. Thanks to Lemma 3 again, using minta, bu " ´maxt´a, ´bu, the max over the 2 k numbers can be expressed with k layers of size at most 3 ˆ2k´1 " 3 2 n We call the resulting network F . It has a representation as in (11), with matrices with entries in Z 3 " t´1, 0, 1u as in Problem 1. It contains log 2 pnq `3 ReLU layers of size at most 6n and it has therefore a description which size is polynomially bounded in n which is proportional to the bit size representation of the 3-CNF formula π. Example 3 If the 3-CNF formula is given by pb 1 _ b 2 _ ␣b 3 q ^pb 1 _ b 4 _ ␣b 5 q ^p␣b 2 _ ␣b 3 _ b 6 q ^pb 2 _ ␣b 2 _ b 6 q with p " 6 boolean variables and n " 4 clauses, we will obtain a network computing the following expression in 6 real variables x 1 , . . . , x 6 : F px 1 , . . . , x 6 q " minpReLUpmaxpx 1 , x 2 , ´x3 qq, ReLUpmaxpx 1 , x 4 , ´x5 qq, ReLUpmaxp´x 2 , ´x3 , x 6 qq, ReLUpmaxpx 2 , ´x2 , x 6 qqqq. We have the following rules for min and max over real numbers a, b, c (we use the convention signp0q " 0). • maxpa, b, cq ą 0 ô pa ą 0q _ pb ą 0q _ pc ą 0q. • maxpa, b, cq ą 0 ô maxpsignpaq, signpbq, signpcqq ą 0. • minpa, b, cq ą 0 ô pa ą 0q ^pb ą 0q ^pc ą 0q. • minpa, b, cq ą 0 ô minpsignpaq, signpbq, signpcqq ą 0. • a ą 0 ô p´a ă 0q ô signpaq ą 0. • ReLUpmaxpsignpaq, signpbq, signpcqqq P t0, 1u. Because of the min ˝ReLU structure, we have F pxq ě 0 for all x, furthermore, F p0q " 0, so that 0 is a global minimum of F and 0 P B c F p0q. For any x, we have F pxq ą 0 if and only if the output of each max is positive, if and only if each max clause contains a positive argument. We therefore have that F pxq ą 0 if and only if F psignpxqq ą 0 where sign is the coordinatewise application of the sign, taking value 0 at 0. We have the following chain of equivalence B c F p0q ‰ t0u ô Dx P R p , F pxq ‰ 0 ô Dx P R p , F pxq ą 0 ô Dx P R p , x i ‰ 0 p@i " 1, . . . , pq F pxq ą 0 ô Dx P R p , x i ‰ 0 p@i " 1, . . . , pq F psignpxqq ą 0 ô Dx P t´1, 1u p , F pxq ą 0 ô Dx P t´1, 1u p , πpbq " 1, b i " Ipx i " 1q pi " 1 . . . pq, where I outputs 1 if the boolean argument is true, and 0 otherwise. The first equivalence is by Lemma 2, the second is because F ě 0, the third is because F is continuous, the fourth is by the discussion above and the fifth is obvious because all possible t´1, 1u patterns can be described as coordinatewise sign applied vectors in R p with nonzero entries. For the last equivalence, for x i P t´1, 1u we set b i " 0 if x i " ´1 and b i " 1 if x i " 1. Each ReLU ˝max applied to the sign vector corresponds to a clause and its output is in t0, 1u. Proposition 2 Problem (1) with matrix entries in Q and D F " D a F is polynomial time solvable. Proof of Proposition 2: Consider the following polynomial expression: M T 1 p Q1 `Q1 q . . . M T L´1 p QL´1 `QL´1 qM T L , where we decomposed D i " Qi `Qi in Definition 2, such that Qi is constant, diagonal, with zero entries except for the 1 entries which are enforced by the network activation and sign pattern: strictly positive activation before application of ReLU when network is evaluated at x, or identity activations. Furthermore, Q i contains q i ď p i diagonal variables to be chosen in r0, 1s corresponding to the zero activation pattern before application of ReLU, for i " 1, . . . , L ´1. The strictly negative values before application of ReLU do not play an additional role, they correspond diagonal entries constrained to 0 in both Qi and Q i , i " 1, . . . , L ´1. Note that a polynomial is constant on a box if and only if it is constant so the polynomial expression in ( 13) is constant when diagonal entries are constrained in r0, 1s, if and only if it is constant. So the problem reduces to decide if the polynomial expression in ( 13) is non constant, with respect to variables Q 1 , . . . , Q L´1 . We show that this reduces to a graph connectivity problem over 2 `řl´1 i"1 q i vertices and edge weight given by partial products in (13). First, the problem can be reduced to finding a non-zero value in the expression in ( 13). Indeed, one can substract the value obtained choosing Q i " 0, i " 1, . . . , L ´1 and use the following block representation:  Therefore, expression (13) is nonconstant if and only if expression in (14) takes a nonzero value for some assignment of Q 1 , . . . , Q L´1 . The number of variables in ( 13) and ( 14) is the same and they have exactly the same form. Therefore we assume without loss of generality that the problem is to decide if the polynomial expression in ( 13) is not equal to the null polynomial. Expression (13) is a vector function each of its coordinates being a polynomial function. It is not uniformly null if and only if and only if there exists a coordinate which is not the null polynomial, so we may add a diagonal matrix Q 0 with p 0 " p diagonal entries in r0, 1s (and Q0 " 0 for the sake of symmetry) and M 0 P R pˆ1 the vector of all ones and find a nonzero value for the product M T 0 p Q0 `Q0 qM T 1 p Q1 `Q1 q . . . M T L´1 p QL´1 `QL´1 qM T L , Expression ( 15) is now real valued and therefore defines a polynomial. For each 0 " 1 . . . L ´1, denote by d i P r0, 1s qi , the vector containing the diagonal entries of matrix Q i , this corresponds exactly to the variable diagonal elements of D i in Definition 2. Denote by P pd 0 , . . . , d L q the obtained polynomial, P is multilinear in d 0 , . . . , d L´1 , that is, it has an affine dependency for one block vector if the others are fixed. Therefore the hessian of P has zero diagonal blocks and the function is harmonic (hessian has zero trace), as a consequence, the maximum principle for harmonic functions entails that its maximum and minimum on any polytope are attained at vertices. For i " 0, . . . , L ´1 denote by ∆ i Ă R qi , the convex hull of the origin and the canonical basis vectors, this is a q i dimensional simplex with nonempty interior. The polynomial P in ( 15) is identically zero if and only if it vanishes on the product of simplices ∆ 0 ˆ. . . ˆ∆L´1 (which has non empty interior), if and only if it vanishes on the product set of the edges of these simplices by the maximum principle. In other words, P is not identically zero, if and only if it contains a nonzero element when each d i is restricted to be an element of the canonical basis (zero vector with exactly one nonzero entry) or the null vector. Define a graph with a layer structure:



Constants need to be distinguished from variables (for instance to define a polynomial)



and eventually x m .

, DISCUSSION AND TECHNICAL ELEMENTS A.1 COMMENTS ON SECTION 3 A.1.1 COMPUTATIONAL MODEL IN SECTION 3.1 DAG representation and examples 3.1: We start with a remark regarding representations of programs as directed acyclic graphs and use them to illustrate the model of computation proposed in the main text. It reduces to that of arithmetic circuit complexity for a dictionary composed of elementary arithmetic operations.Remark 4 (Programs as directed graphs) A predecessor relation trivially describes a directed acyclic graph (DAG). Therefore, a program is equivalently represented as a DAG, nodes corresponding either to input variables (empty predecessor) or computation (nonempty predecessor). Directed edges connect predecessor nodes to their successors. Each computation node contains a lower-level program (with a single output), with the number of input edges being coherent with the number of arguments. The cost of a node is that of the underlying program and the cost of P is the sum of the costs of its nodes. Nodes without outer edges are output nodes. See examples in Appendix A.1.



TABLE 1 OF THE D ReLU -DICTIONARY. The proof of Corollary 1 follows from Theorem 2 by computing the relevant constants. They are shown in Table 1, let us justify the proposed numbers.

Extension of cost table. c nonlin ě 1 is the cost of nonlinear operations and c ReLU ě 0 is the cost of sign evaluation for ReLU or ReLU 1 . We use the same framework as in B.2 and we identify the cost of comparing two real numbers with c ReLU ą 0. For each program g and associated derived program d, we let

Extension of cost table. c nonlin ě 1 is the cost of nonlinear operations and c ReLU ě 0 is the cost of sign evaluation for ReLU or ReLU 1 . For simplicity c ReLU is abbreviated c R and c nonlin is abbreviated c nl

The output of each ReLU ˝max clause is 1 if and only if at least one of its argument is 1, if and only if one of the litteral of the corresponding disjunction is 1 if and only if the disjunction applied to the corresponding boolean variables is true. Otherwise, it is 0. Similarly, the min combination has positive output if and only if all max outputs are 1 if and only if all the disjunctions applied to variables b i are true. This shows that Problem 1 is NP-hard, because 0 P B c F p0q and B c F p0q ‰ t0u if and only if there exists two distinct elements in B c F p0q. lD.4 PROOF OF FEASIBILITY FOR AUTODIFF CONSERVATIVE GRADIENTThe counterpart of Problem 1 for AD conservative gradient in Definition 2 is tractable, illustrating a major computational difference between Clarke subdifferential and AD conservative gradient. The proof is in Section D.4, by reduction to a graph shortest path problem. By the polynomial time equivalence between linear ReLU network and programs on t`, ´, ReLUu proved in Section D.1, this proves Proposition 1.

ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING

The authors acknowledge the support of the AI Interdisciplinary Institute ANITI funding under the grant agreement ANR-19-PI3A-0004. The authors acknowledge the support of the Association nationale de la recherche et de la technologie (ANRT) and Thales LAS France, which contributed to Ryan B's grant. Jérome B. and Edouard P. acknowledge the financial support of Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant numbers FA9550-19-1-7026 FA8655-22-1-7012, and ANR MaSDOL 19-CE23-0017-01. Jérôme B. also acknowledges the

C PROOFS OF SECTION 5.1 C.1 PROOF OF THE MAIN RESULT

Proof of Theorem 3: Let U P R pˆp be an orthogonal matrix with entries in t´1, 1u which columns are denoted by u 1 , . . . , u p (with squared norm p). Assume that we have as variables a matrix M P R pˆp and two matrices A, B P R pˆp with columns a 1 , . . . , a p and b 1 , . . . , b p respectively.

Consider the function

The pair pM, Bq will be identified as y in the statement of the theorem. Considering the dictionary of elementary functions t`, ˆ, ReLU, `c, ˆcu, F has a representation as a program P F using the identity |t| " ReLUptq `ReLUp´tq for all t P R. We may construct P F such that costpP F q " 6p 2 `2p ď 8p 2 where we count 2p 2 ´p operation for each matrix vector multiplication to evaluate U B T M x (there are three of them), p multiplication by ´1 to evaluate ´U B T M x , 2p application of ReLU (on U B T M x and ´U B T M x), p additions of ReLU outputs to evaluate p applications of the absolute value, p ´1 for the outer sum and 1 for the division. Now consider the constraintsThe set of matrices A, B, M satisfying this constraint is an open set, call it S. We now restrict our attention to this open set and argue that costpP 1 q does not change if the input variables are constrained to be in S.We have for all i " 1, . . . , p and pA, B, M q P S, the following directional derivatives with respect to variable x F 1 1 p0, B, M, a i q "Setting the function G : pA, B, M q Þ Ñ ř p i"1 F 1 1 p0, B, M, a i q " TrpM AB T q, we have that G is a polynomial and ∇ M GpA, B, M q " ř p i"1 b i a T i " BA T . Note that this does not depend on M . Fix P 1 any program implementing the directional derivatives function py, Aq Þ Ñ F 1 1 p0, y, Aq of F described above, with dictionary t`, ˆ, ReLU, ReLU 1 , `c, ˆcu, as in the statement of the theorem.Claim 1 There is a program P 2 on dictionary D " t`, ˆ, `c, ˆcu such that G " rP 2 s (on the whole space) and costpP 2 q ď costpP 1 q `p.We use the DAG representation of programs as in Remark 4. Therefore P 1 is described by a DAG which node are either input nodes or computation nodes implementing functions from D 1ReLU . We will modify the program by simple modifications of the computation nodes. We may obtain a program P 0 implementing G on S with dictionary D 1ReLU with costpP 0 q ď costpP 1 q `p by summing the outputs of P 1 . The ReLU 1 nodes in P 0 represent a semialgebraic function Coste (2000a; b) with values in a finite set. Therefore, there is a dense open semialgebraic set on which all ReLU 1 nodes in P 0 are locally constant (Coste, 2000a, Theorem 6.7) . Reducing S if necessary, we obtain a program P 1 on dictionary D ReLU such that P 1 " P 0 on S by replacing each ReLU 1 node in P 0 by the corresponding constants. We have costpP 1 q ď costpP 0 q (we replace computing nodes by constants). By Lemma 1, there is a program P 2 on D such that costpP 2 q " costpP 1 q ď costpP 0 q ď costpP 1 q `p and G " rP 2 s (on the whole space). This proves the claim.We may obtain a program D 2 implementing ∇ M G with dictionary D by backward algorithmic differentiation on P 2 , that is D 2 " backproppP 2 q. we have therefore comppBA T , Dq ď costpD 2 q ď costpP 2 , D 2 q ď 5costpP 2 q ď 5p `5costpP 1 q,• The source layer V ´1 contains a single source node, v ´1,1 .• The zero-th layer V 0 contains q 0 " p nodes v 0,1 . . . v 0,q0 .• Recursively, the i-th layer V i contains q i nodes v i,1 . . . v i,qi , for i " 1 . . . L ´1.• The sink layer V L contains a single node node v L,1 .We connect nodes between consecutive layers, respecting the order induced by the layer structure. For i " ´1, . . . L ´1 and j " 0, . . . , L, with j ą i, we connect layers V i and V j as followswhere if j " i `1 the product reduces to the identity (R " M T j ). • For k " 1, . . . , q i and l " 1, . . . , q j , add an edge with between v i,k and v j,l if R k,l ‰ 0.The resulting graph has a number of nodes equal to the number of ReLU functions in F plus p additional nodes and the source and sink nodes. Computation of edges can be done in polynomial time: it requires at most 4pL `1q 2 matrix products involving at most 2L `1 matrices. Indeed the product of m matrices has polynomial time complexity in the representation bit size of the m input matrices.In this graph, a directed path from the source to the sink visits each layer at most once, and in that case it visits a single node. Each such path corresponds to a monomial with nonzero coefficient appearing in the polynomial P in (15) by construction of the graph structure. Conversely each nonzero coefficient of a given monomial in ( 15) is uniquely associated to a path in the graph which corresponds to the nodes associated to variables in the monomial. Therefore, the source is connected to the sink if and only if there is a nonzero monomial in (15), if and only if the corresponding polynomial is nonzero. Furthermore, each path corresponds to the evaluation of the program at an edge of the product ∆ 0 ˆ. . . ˆ∆L´1 . Therefore finding a path connecting the source to the sink allows to compute a nonzero element in the product and if no such path exists, the polynomial is identically zero.So we have shown that the truth value of problem 1 with D F " D a F , is the same as the source being connected to the sink by a directed path in the graph we defined, which has size polynomialy bounded compared to network size. Connectivity can be solved, for example using Dijkstra's algorithm, in time Op|V | 2 q where |V | is the number of nodes (or vertices). A path represents a nonzero element of D f p0q and if no such path exists, we conclude that D F p0q " t0u. This shows that the problem is solvable in polynomial time and concludes the proof.

D.5 ADDITIONAL LEMMAS

The following lemma provides a characterization of singleton subgradient for linear ReLU networks.Lemma 2 Let F be a linear ReLU network, then B c F p0q " t0u if and only if F is constant.Proof : If F is constant, the result is immediate because F " 0. Now, suppose that B c F p0q " t0u. We know that F is piecewise linear and there exists a finite set of polyhedron whose union is R p , where F is affine linear over each polyhedron. Furthermore, F is positively homogeneous, therefore for each x P R p , B c F pxq " B c F pλxq with λ ą 0. Setting R Ă R p , the full measure set where F is differentiable, one has that for all x P R p andTherefore, each affine part has zero derivative on each polyhedra and by continuity we conclude that F is constant. lThe next lemma describes an explicit representation of maximum of finitely many numbers using a ReLU network with weights in t´1, 0, 1u.Lemma 3 Given k P N, k ą 0, there exists F , a ReLU network with k ReLU layers of size at most 3 ˆ2k´1 and weight matrices with entries in t´1, 0, 1u, with p " 2 k inputs such that for any x P R p , F pxq " max i"1,...,2 kx i .Proof : We proceed by recursion on k. Note that for any x 1 , x 2 P R maxtx 1 , x 2 u " ReLUpx 1 ´x2 q `x2 " ReLUpx 1 ´x2 q `ReLUpx 2 q ´ReLUp´x 2 q.Set the matrices" p1 1 ´1q .The function F 1 : R 2 Ñ R given by F 1 pxq " BReLUpAxq satisfies F 1 pxq " maxtx 1 , x 2 u. This proves the result for k " 1.Now assume that for k ě 1, we have a network with k ReLU layers of size at most 3 ˆ2k represented by matrices M 1 , . . . , M k`1 with entries in t´1, 0, 1u, such that the corresponding ReLU network, as in ( 11)Set Fk the concatenation of two copies of F k , that is Fk : R 2 k`1 Ñ R 2 , such that for all x, y P R 2k , Fk px, yq " ˆmax i"1,...,2 k x i max i"1,...,2 k y i ˙.The matrices representing Fk can be described in block formMi " ˆMi 0 0 M i ˙P R p2piqˆp2pi´1qfor i " 1, . . . , k `1, where p 0 " 2 k and p k " 1. This network is made of k layers of size at most 3 ˆ2k`1 , it has 2 k`1 inputs and two outputs and its weight matrices have elements in t´1, 0, 1u.The block representation of the last matrix of this network is of the formwhere l is the size of the row vector M k`1 . We haveWe set F k`1 px, yq " F 1 pF k pxq, F k pyqq " F 1 p Fk px, yqq for all x, y P R 2k . In matrix notation we have F k`1 px, yq " BReLUpA Fk px, yqq.The involved matrices are M k`2 " B, A ˆM k`1 and Mk . . . M1 . They all have entries in t´1, 0, 1u and the corresponding network has layers of size at most 3 ˆ2k`1 . The result then holds by recursion. l

