NEURAL NETWORK APPROXIMATION OF LIPSCHITZ FUNCTIONS IN HIGH DIMENSIONS WITH APPLICA-TIONS TO INVERSE PROBLEMS

Abstract

The remarkable successes of neural networks in a huge variety of inverse problems have fueled their adoption in disciplines ranging from medical imaging to seismic analysis over the past decade. However, the high dimensionality of such inverse problems has simultaneously left current theory, which predicts that networks should scale exponentially in the dimension of the problem, unable to explain why the seemingly small networks used in these settings work as well as they do in practice. To reduce this gap between theory and practice, a general method for bounding the complexity required for a neural network to approximate a Lipschitz function on a high-dimensional set with a low-complexity structure is provided herein. The approach is based on the observation that the existence of a linear Johnson-Lindenstrauss embedding A ∈ R d×D of a given high-dimensional set S ⊂ R D into a low dimensional cube [-M, M ] d implies that for any Lipschitz function f : S → R p , there exists a Lipschitz function g : [-M, M ] d → R p such that g(Ax) = f (x) for all x ∈ S. Hence, if one has a neural network which approximates g : [-M, M ] d → R p , then a layer can be added which implements the JL embedding A to obtain a neural network which approximates f : S → R p . By pairing JL embedding results along with results on approximation of Lipschitz functions by neural networks, one then obtains results which bound the complexity required for a neural network to approximate Lipschitz functions on high dimensional sets. The end result is a general theoretical framework which can then be used to better explain the observed empirical successes of smaller networks in a wider variety of inverse problems than current theory allows.

1. INTRODUCTION

At present various network architectures (NN, CNN, ResNet) achieve state-of-the-art performance in a broad range of inverse problems, including matrix completion (Zheng et al., 2016; Monti et al., 2017; Dziugaite & Roy, 2015; He et al., 2017) image-deconvolution (Xu et al., 2014; Kupyn et al., 2018) , low-dose CT-reconstitution (Nah et al., 2017) , electric and magnetic inverse Problems (Coccorese et al., 1994) (seismic analysis, electromagnetic scattering). However, since these problems are very high dimensional, classical universal approximation theory for such networks provides very pessimistic estimates of the network sizes required to learn such inverse maps (i.e., as being much larger than what standard computers can store, much less train). As a result, a gap still exists between the widely observed successes of networks in practice and the network size bounds provided by current theory in many inverse problem applications. The purpose of this paper is to provide a refined bound on the size of networks in a wide range of such applications and to show that the network size is indeed affordable in many inverse problem settings. In particular, the bound developed herein depends on the model complexity of the domain of the forward map instead of the domain's extrinsic input dimension, and therefore is much smaller in a wide variety of model settings. To be more specific, recall in most inverse problems one aims to recover some signal x from its measurement y = F (x). Here y and x could both be high dimensional vectors, or even matrices and tensors, and F , which is called the forward map/operator, could either be linear or nonlinear with various regularity conditions depending on the application. In all cases, however, recovering x from y amounts to inverting F . In other words, one aims want to find the operator F -foot_0 , that sends every measurement y back to the original signal x. Depending on the specific application of interest, there are various commonly considered forms of the forward map F . For example, F could be a linear map from high to low dimensions as in compressive sensing applications; F could be a convolution operator that computes the shifted local blurring of an image as in the image deblurring setting; F could be a mask that filters out the unobserved entries of the data as in the matrix completion application; or F could also be the source-to-solution map of a differential equation as in ODE/PDE based inverse problems. In most of these applications, the inverse operator F -1 does not possess a closed-form expression. As a result, in order to approximate the inverse one commonly uses analytical approaches that involve solving, e.g., an optimization problem. Take the sparse recovery as an example. With the prior knowledge that the true signal x ∈ R n is sparse, one can recover it from the under-determined measurements R m y = Ax with m < n) by solving the optimization problem x = arg min z z 0 , Az = y The inverse of the linear measurement map F (x) = y = Ax when restricted to the low-complexity domain of sparse vectors has an inverse, F -1 (y), that is then the minimizer x above. Note that traditional optimization-based approaches could be extremely slow for large-scale problems (e.g., for n large above). Alternatively, we can approximate the inverse operator by a neural network instead. Amortizing the initial cost of an expensive training stage, the network can later achieve unprecedented speed over time at the test stage leading to better total efficiency over its lifetime. To realize this goal, however, we need to first find a neural network architecture f θ , and train it to approximate F -1 , so that the approximation error max y f θ (y) -F -1 (y) = f θ (y) -x is small. The purpose of this paper is to provide a unified way to give a meaningful estimation of the size of the network that one can use to set up the network in situations where the domain of F is low-complexity as is the case in, e.g., compressive sensing, low-rank matrix completion, deblurring with low-dimensional signal assumptions, etc..

2. RELATED WORK

The expressive power of neural networks is important in applications as a means of both guiding network architecture design choices, as well as for providing confidence that good network solutions exist in general situations. As a result, numerous results about the approximation power has been established in recent years (Zhou, 2020; Petersen & Voigtlaender, 2020; Yarotsky, 2022; 2018; Lin & Jegelka, 2018) . Most results concern the approximation of functions on R D , however, and yield network sizes that increase exponentially with the input dimension D. As a result, the high dimensionality of many inverse problems leads to bounds from most of the existing literature which are too large to explain the observed empirical success of neural approaches in such applications. A similar high-dimensional scaling issue arises in many image classification tasks as well. Motivated by this setting (Chen et al., 2019) refined previous approximation results for ReLU networks, and showed that input data that is close to a low-dimensional manifold leads to network sizes that only grow exponentially with respect to the intrinsic dimension of the manifold. However, this improved bound relies on the data fitting manifold assumption which is quite strong in the inverse problems setting. For example, even the "simple" sparse recovery problem does not have a domain/range that forms a manifold (note that the intersections of s-dimensional subspaces prevent from it being a manifold). Therefore, to study expressive power of networks on inverse problems needs to remove such strict manifold assumptions. Another mild issue with such manifolds results is that the number of neurons also depends on the curvature of the manifold in question which can be difficult to estimate. Furthermore, such curvature dependence is unavoidable for manifold results and needs to be incorporated into any valid bounds. 1 In this paper, we provide another way to estimate the size of the network, by directly using the Guassian width of the data as a measure of its inherent complexity. Our result can therefore be



To see why, e.g., curvature dependence is unavoidable, consider any discrete training dataset in a compact ball. There always exists a 1-dimensional manifold, namely a curve, that goes through all the data points. Thus, the mere existence of the 1-dimensional manifold does not mean the data complexity is low. Curvature information and other manifold properties matter as well!

