ON THE UNIVERSAL APPROXIMATION PROPERTY OF DEEP FULLY CONVOLUTIONAL NEURAL NETWORKS

Abstract

We study the approximation of shift-invariant or equivariant functions by deep fully convolutional networks from the dynamical systems perspective. We prove that deep residual fully convolutional networks and their continuous-layer counterpart can achieve universal approximation of these symmetric functions at constant channel width. Moreover, we show that the same can be achieved by nonresidual variants with at least 2 channels in each layer and convolutional kernel size of at least 2. In addition, we show that these requirements are necessary, in the sense that networks with fewer channels or smaller kernels fail to be universal approximators.

1. INTRODUCTION

Convolutional Neural Networks (CNN) are widely used as fundamental building blocks in the design of modern deep learning architectures, for it can extract key data features with much fewer parameters, lowering both memory requirement and computational cost. When the input data contains spatial structure, such as pictures or videos, this parsimony often does not hurt their performance. This is particularly interesting in the case of fully convolutional neural networks (FCNN) (Long et al., 2015) , built by the composition of convolution, nonlinear activation and summing (averaging) layers, with the last layer being a permutation invariant pooling operator, see Figure 1 . Consequently, a prominent feature of FCNN is that, when shifting the input data indices (e.g. picture, video, or other higher-dimensional spatial data), the output result should remain the same. This is called shift invariance. An example application of FCNN is image classification problems where the class label (or probability, under the softmax activation) of the image remains the same under translating the image (i.e. shifting the image pixels). A variant of FCNN applies to problems where the output data has the same size as the input data, e.g. pixel-wise segmentation of images (Badrinarayanan et al., 2017) . In this case, simply stacking the fully convolutional layers is enough. We call this type of CNN equivariant fully convolutional neural network (eq-FCNN), since when shifting the input data indices, the output data indices shift by the same amount. This is called shift equivariance. It is believed that the success of these convolutional architectures hinges on shift invariance or equivariance, which capture intrinsic structure in spatial data. From an approximation theory viewpoint, this presents a delicate trade-off between expressiveness and invariance: layers cannot be too complex to break the invariance property, but should not be too simple that it loses approximation power. The interaction of invariance and network architectures are subjects of intense study in recent years. For example, Cohen & Welling (2016c) designed the steerable CNN to handle the motion group for robotics. Deep sets (Zaheer et al., 2017) are proposed to solve the permutation invariance and equivariance. Other approaches to build equivariance and shift invariance include parameter sharing (Ravanbakhsh et al., 2017) and the homogeneous space approach (Cohen & Welling, 2016b; Cohen et al., 2019) . See Bronstein et al. (2017) for a more recently survey. Among these architectures, the FCNN is perhaps the simplest and most widely used model. Therefore, the study of its theoretical properties is naturally a first and fundamental step for investigating other more complicated architectures. In this paper, we focus on the expressive power of the FCNN. Mathematically, we consider whether a function F can be approximated via the FCNN (or eq-FCNN) function family in L p sense. This is also known as universal approximation in L p . In the literature, many results on fully connected neural networks can be found, e.g. Lu et al. (2017) ; Yarotsky (2018a); Shen et al. ( 2019). However, relatively few results address the approximation of shift invariant functions via fully convolutional networks. An intuitive reason is that the symmetry constraint (shift invariance) will hinder the unconditioned universal approximation. This can be also proved rigorously. In Li et al. (2022b), the authors showed that if a function can be approximated by an invariant function family to arbitrary accuracy, then the function itself must be invariant. As a consequence, when we consider the approximation property of the FCNN, we should only consider shift invariant functions. This brings new difficulty for obtaining results compared to those for fully connected neural networks. For this reason, many existing results on convolutional network approximation rely on some ways of breaking shift invariance, thus applying to general function classes without symmetry constraints (Oono & Suzuki, 2019) . Moreover, results on convolutional networks usually require (at least one) layers to have a large number of channels. In contrast, we establish universal approximation results for fully convolutional networks where shift invariance is preserved. Moreover, we show that approximation can be achieved by increasing depth at constant channel numbers, with fixed kernel size in each layer. The main result of this paper (Theorem 1) shows that if we choose ReLU as the activation function and the terminal layer is chosen as a general pooling operator satisfying mild technical conditions (e.g. max, summation), then convolutional layers with at least 2 channels and kernel size at least 2 can achieve universal approximation of shift invariant functions via repeated stacking (composition). The result is sharp in the sense that neither the size of convolution kernel nor the channel number can be further reduced while preserving the universal approximation property. To prove the result on FCNN, we rely on the dynamical systems approach where residual neural networks are idealized as continuous-time dynamical systems. This approach was introduced in E (2017) and first used to develop stable architectures (Haber & Ruthotto, 2017) and control-based training algorithms (Li et al., 2018) . This is also popularized in the machine learning literature as neural ODEs (Chen et al., 2018) . On the approximation theory front, the dynamical systems approach was used to prove universal approximation of general model architectures through composition (Li et al., 2022a) . The work of Li et al. (2022b) , extended the result to functions/networks with symmetry constraints, and as a corollary obtained a universal approximation result for residual fully convolutional networks with kernel sizes equal to the image size. The results in this paper restrict the size of kernel in a more practical way, and can handle common architectures for applications, which typically use kernel sizes ranging from 3 -7. Moreover, we also establish here the sharpness of the requirements on channel numbers and kernel sizes. The restriction on width and kernel size actually can provide more interesting results in the theoretical setting. This is because if we establish our approximation results using finite (and minimal) width and kernel size requirements, they can be used to obtain the universal approximation property for a variety of larger models by simply showing them to contain our minimal construction. In summary, the main contributions of this work are as follows: 1. We prove the universal approximation property of both continuous and time-discretized fully convolutional neural network with residual blocks and kernel size of at least 2. This result concerns about the deep but narrow neural networks with residual blocks. We provide



Figure 1: An illustration of fully convolutional neural network.

