MODELLING LONG RANGE DEPENDENCIES IN N D: FROM TASK-SPECIFIC TO A GENERAL PURPOSE CNN

Abstract

Performant Convolutional Neural Network (CNN) architectures must be tailored to specific tasks in order to consider the length, resolution, and dimensionality of the input data. In this work, we tackle the need for problem-specific CNN architectures. We present the Continuous Convolutional Neural Network (CCNN): a single CNN able to process data of arbitrary resolution, dimensionality and length without any structural changes. Its key component are its continuous convolutional kernels which model long-range dependencies at every layer, and thus remove the need of current CNN architectures for task-dependent downsampling and depths. We showcase the generality of our method by using the same architecture for tasks on sequential (1D), visual (2D) and point-cloud (3D) data. Our CCNN matches and often outperforms the current state-of-the-art across all tasks considered.

1. INTRODUCTION

The vast popularity of Convolutional Neural Networks (LeCun et al., 1998 ) (CNNs) is a result of their high performance and efficiency, which has led them to achieve state-of-the-art in applications across sequential (Abdel-Hamid et al., 2014; Van Den Oord et al., 2016 ), visual (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014) and high-dimensional data (Schütt et al., 2017; Wu et al., 2019) . Nevertheless, a major limitation of CNNs -and other neural networks-is that their architectures must be tailored to particular applications in order to consider the length, resolution and dimensionality of the input data. This has led to a plethora of task-specific architectures (Oord et al., 2016; Bai et al., 2018; Simonyan & Zisserman, 2014; Szegedy et al., 2015; Ronneberger et al., 2015; He et al., 2016; Qi et al., 2017; Wu et al., 2019) which (i) hampers the selection of the most appropriate architecture for a particular task, and (ii) obscures the transfer and generalization of insights across applications. In this work, we tackle the need for problem-specific CNN architectures and propose a generic CNN architecture that can be used independent of the length, resolution and dimensionality of the data. CNN architectures are data dependent. Current CNN architectures are task-specific because they are tied to the length, resolution, and dimensionality of the input. The length of the data varies from task to task, e.g. audio fragments may span milliseconds to minutes. This requires carefully chosen As such, their semantic meaning is independent of the resolution at which they are sampled, e.g., the same audio may be expressed at different resolutions. Nevertheless, current CNN architectures are resolution-bound, and thus different resolutions require different CNNs. These limitations aggravate when considering multi-dimensional data. Each input dimension can be defined at different lengths and resolutions, e.g., video, rectangular images, and each data modality brings its own conventions for each of these properties, e.g., the resolution of a second of audio (16kHz) (Warden, 2018) strongly contrasts with that of images (32 × 32) (Krizhevsky et al., 2009) . Towards a unified CNN architecture. As discussed in Sec. 3, the core component that makes CNNs data-dependent are their discrete convolutional kernels. Convolutional kernels are implemented via a one-to-one mapping between kernel values and model parameters (Fig. 1 left), which (i) binds them to the input resolution and length, and (ii) makes them ill suited to model long-range dependencies. The latter results from the large number of parameters needed to construct large convolutional kernels. This is why standard CNNs favour using local kernels in combination with task-dependent depths and pooling layers to model long-range dependencies, at the cost of making them task-dependent. The need for a continuous parameterization. To overcome task-dependent architectures, it is crucial to define a kernel parameterization that decouples parameter count from kernel size. Following Schütt et al. ( 2017); Romero et al. (2022b), we use a small neural network to define a continuous mapping from positions to the value of the kernel at those positions. The resulting Continuous Convolutional Kernels (Fig. 2 ), allow for the construction of convolutional kernels of arbitrary size in a parameter efficient manner. Consequently, the same convolutional layers -and thus the same CNN-can be used regardless of the input length, resolution and dimensionality. We leverage this formulation to construct the Continuous Convolutional Neural Network (CCNN): a single CNN architecture that can be applied regardless of the input length, resolution and dimensionality. Empirical results. To showcase the proposed CCNN, we deploy the same CCNN for several tasks on sequential (1D), visual (2D) and point-cloud (3D) data. Our CCNN matches and often outperforms the current state-of-the-art across all tasks considered. Importantly, the continuous parameterization of our CCNN allows it to handle irregularly sampled data natively. As a result, the CCNN is not restricted to grid data, e.g., 3D voxels, and can be used on point-clouds directly. Our proposed improvements allow the proposed Continuous CNN to achieve good empirical results on the tasks considered in 1D, 2D and 3D without structural changes.

2. RELATED WORK

An extended section with extended comparisons to related works is provided in Appx. A.



Figure 1: Discrete and continuous convolutional kernels. Discrete convolutional kernels assign a weight w i out of a discrete set of weights W to a relative offset xx. This ties the kernel to the length, resolution and dimensionality of the input, limiting the general applicability of the CNN architectures. Instead, our Continuous Convolutional Neural Network parameterizes kernel values as a continuous function φ Kernel over the input domain R d , which decouples it from data characteristics.

(a) Kernel network. (b) D=1, sequences. (c) D=2, images. (d) D=3, volumes.

Figure2: Continuous convolutional kernels: the key to a unified CNN architecture. The continuous parameterization of convolutional kernels used in this work consists of a small kernel network φ kernel that receives coordinates as input and outputs the value of the convolutional kernel at that position (2a). By changing the dimensionality of the coordinates x i , the same kernel network can render convolutional kernels for sequential (2b), visual (2c), and higher dimensional data (2d).

: • We propose the Continuous Convolutional Neural Network: a general purpose CNN architecture able to process data of arbitrary resolution, dimensionality and length without structural changes. • We study the layers of CNNs, and demonstrate that the ability to model long-term dependencies on N D without the need of input dependent downsampling and depth values is necessary and sufficient for the construction of a general purpose CNN architecture. • In order to model long-term dependencies on N D without input dependent downsampling and depth values, we utilize and improve the Continuous Kernel Convolutions of Romero et al. (2022b).

