ON THE COMPLEXITY OF NONSMOOTH AUTOMATIC DIF-FERENTIATION

Abstract

Using the notion of conservative gradient, we provide a simple model to estimate the computational costs of the backward and forward modes of algorithmic differentiation for a wide class of nonsmooth programs. The overhead complexity of the backward mode turns out to be independent of the dimension when using programs with locally Lipschitz semi-algebraic or definable elementary functions. This considerably extends Baur-Strassen's smooth cheap gradient principle. We illustrate our results by establishing fast backpropagation results of conservative gradients through feedforward neural networks with standard activation and loss functions. Nonsmooth backpropagation's cheapness contrasts with concurrent forward approaches, which have, to this day, dimensional-dependent worst-case overhead estimates. We provide further results suggesting the superiority of backward propagation of conservative gradients. Indeed, we relate the complexity of computing a large number of directional derivatives to that of matrix multiplication, and we show that finding two subgradients in the Clarke subdifferential of a function is an NP-hard problem.

1. INTRODUCTION

Automatic evaluation of derivatives: Algorithmic differentiation (AD) appeared around 60 years ago (Beda et al. (1959) ; Wengert (1964) ), and has been since then constantly developed and used in many contexts, see Griewank et al. (1989) ; Griewank and Walther (2008) for a thorough discussion. Today, it is at the core of modern learning architectures (Rumelhart et al., 1986; LeCun et al., 2015; Baydin et al., 2018) , to the point that training a neural network (NN) is ultimately a way to combine the outputs of AD. There are many practical and theoretical developments available nowadays: flexible and efficient numerical libraries (Abadi et al., 2016; Paszke et al., 2019; Bradbury et al., 2018) , an implicit differentiation theory (Griewank and Faure, 2003; Griewank and Walther, 2008) and its extensions (Agrawal et al., 2019; Bai et al., 2019; Bolte et al., 2021; Blondel et al., 2021) , the adjoint method (Farrell et al., 2013; Pearlmutter, 1995; Plessix, 2006) with application to neural ODEs (Chen et al., 2018) , "piggyback" style differentiation of optimization algorithms (Griewank and Faure, 2003; Mehmood and Ochs, 2020; Bertrand et al., 2020; Lorraine et al., 2020) , or differentiation of conjugate gradient algorithms (Gratton et al., 2014) . Backward algorithmic differentiation, or backpropagation, plays a particular role when smooth optimization tasks are at stake, as it evaluates the gradient of a function with a cost proportional to that of function evaluations, independently of dimension. This property, called the cheap gradient principle (Wolfe, 1982; Griewank and Walther, 2008) , is at the root of the machine learning libraries revolution. According to the key complexity theory version of this result due to Baur and Strassen (1983) , arithmetic complexity of the evaluation of the derivative of a rational function is at most 5 times the complexity of function evaluation. Extensions exist for smooth differentiable functions Baur and Strassen (1983) ; Griewank and Walther (2008) but standard computational practice of AD consists of little known about the nonsmooth case. The objective of this paper is precisely to present a simple, general, nonsmooth cheap conservative principle and to explore other complexity results for evaluating nonsmooth derivatives. This extends the cheap gradient principle of smooth AD to the path differentiable world Bolte and Pauwels (2020b) which includes semi-algebraic and more generally definable functions Coste (2000a;b), a class that contains the vast majority of machine learning programs used in practice, see for example Bolte and Pauwels (2020b). Nonsmooth AD & computational complexity: Sorting values, pooling data, thresholding functions, or determining closest points are some of the most essential numerical decision operations. They are ubiquitous in machine learning and modern optimization. All of them are nonsmooth, and most of them have a very desirable feature: they are cheap to compute, much cheaper than smoothed surrogates. For instance, the famous ReLU activation in deep learning, whose role is to threshold to zero negative values to allow for the inactivity of neurons, requires only one bit of encoding in theory. On the other hand, other nonlinear activations potentially require auxiliary algorithms for their evaluation, incurring a higher computational cost. This simplicity of use also comes with the issue of finding an adequate way of training models and, thus differentiating objects. The standard computational practice of AD consists in applying differential calculus rules directly to nonsmooth objects, replacing gradients by surrogates, typically Clarke subgradients. This is how AD is performed within TensorFlow, PyTorch or Jax. This approach has shown tremendous success (LeCun et al., 2015) and has been massively used for the last 10 years. Yet, despite this empirical success, Barton et al. claimed in Barton et al. (2018) that "there does not seem to exist [at this day] a true analogous reverse AD mode to compute generalized derivatives for nonsmooth functions", illustrating the difficulty of nonsmooth AD. Conservative gradients were introduced as a faithful mathematical model capturing the formal application of calculus rules to subdifferentials by Bolte and Pauwels (2020a;b); Bolte et al. (2021) . The author unfamiliar with this notion may reduce, in a ML context, conservative gradients to outputs of calculus rules formally applied to Clarke subgradients and Jacobians. Our goal is to provide an adequate computational complexity theory for conservative calculus, a theory that will therefore match standard practical approaches. Among other possible first-order options offered by nonsmooth calculus, we also investigate the properties of directional derivatives and those of the Clarke subdifferential. For directional derivatives, our motivation comes from the fact that this nonsmooth operation has general calculus rules, while the Clarke subdifferential is central in terms of variational interpretation.

Contributions:

The main thesis of this work is that conservative gradients have computational properties similar to smooth derivatives, which are much more favorable than those of alternative nonsmooth oracles such as subgradients or directional derivatives. • We provide a simple computational model for addressing the question of complexity theory of nonsmooth numerical programs. • For the backward mode, we prove a cheap conservative gradient principle à la Baur-Strassen, generalizing state of the art to nonsmooth programs modeling most NNs. We establish that, regardless of dimension, the computational cost of a conservative gradient is of the order of that of function evaluation. Our results provide a theoretical validation of the fact that the cost of backpropagation does not depend on the programs' smoothness. • For the forward mode, we relate the computational cost of p directional derivatives to that of p ˆp matrix multiplication. We provide lower complexity bounds that illustrate the limits to which this deficiency may be improved. This applies to existing nonsmooth AD frameworks (Khan and Barton, 2012; 2013) . • We establish that computing two distinct elements in the Clarke subdifferential of a given point is NP-hard for simple ReLU programs. This result also applies to the lexicographic subdifferential. In contrast, we show that the problem can be solved in polynomial time for conservative gradients. This reflects the computational difficulty of dealing with the Clarke subdifferential. • A result of independent interest: deciding differentiability of a ReLU program at a point is NP-hard. 



Relation with existing work: Conservative gradients were introduced in Bolte and Pauwels (2020a;b) to model "formal subdifferentiation" used by practitioners and nonsmooth "backpropagation". They were further studied in Lewis and Tian (2021); Davis and Drusvyatskiy (2021); Bolte et al. (2021) and empirically investigated in Bertoin et al. (2021). Computational complexity was

