DCT-DIFFSTRIDE: DIFFERENTIABLE STRIDES WITH REAL-VALUED DATA

Abstract

Reducing the size of intermediate feature maps within various neural network architectures is critical for generalization performance, and memory and computational complexity. Until recently, most methods required downsampling rates (i.e., decimation) to be predefined and static during training, with optimal downsampling rates requiring a vast hyper-parameter search. Recent work has proposed a novel and differentiable method for learning strides named DiffStride which uses the discrete Fourier transform (DFT) to learn strides for decimation. However, in many cases the DFT does not capture signal properties as efficiently as the discrete cosine transform (DCT). Therefore, we propose an alternative method for learning decimation strides, DCT-DiffStride, as well as new regularization methods to reduce model complexity. Our work employs the DCT and its inverse as a low-pass filter in the frequency domain to reduce feature map dimensionality. Leveraging the well-known energy compaction properties of the DCT for natural signals, we evaluate DCT-DiffStride with its competitors on image and audio datasets demonstrating a favorable tradeoff in model performance and model complexity compared to competing methods. Additionally, we show DCT-DiffStride and DiffStride can be applied to data outside the natural signal domain, increasing the general applications of such methods.

1. INTRODUCTION

Dimensionality reduction is a necessary computation in nearly all modern data and signal analysis. In convolutional networks, this reduction is often achieved through strided convolution that decimates a high-resolution input (e.g., images) into a lower-resolution space. Prediction in this lower-resolution space can enable the network to focus on relevant classification features while ignoring irrelevant or redundant features. This reduction is vital to the information distillation pipeline in neural networks, where data is transformed to be increasingly sparse and abstract (Chollet, 2021) . On image data, decimation comes in the form of reducing spatial dimensions (e.g., width and height). On time-series data, decimation is performed along the temporal axis. Decimation can be performed in a variety of ways including statistical aggregation, pooling methods (e.g., max and average pooling), and strided convolutions. By reducing the size of intermediate feature maps, computational and memory complexity of the architecture are reduced. Fewer operations, such as multiplies and accumulates (MACs), are needed to produce feature maps as there are there are fewer values for the network to operate on and store. In convolutional neural networks (CNNs), decimation allows for an increase in the receptive field as subsequent kernels have access to the downsampled values that span several frames. Decimation also allows for increased generalization performance because lower-resolution feature maps allow the network to be resilient to redundancy and ignore spurious activations and outliers that would otherwise not be filtered out. Most decimation methods, however, still require massive hyper-parameter searches to find the optimal window sizes and strides on which to operate. This puts the onus on post-processing and crossvalidation to find optimal values which may not be time or computationally feasible. To address this concern Riad et al. (2022) proposed DiffStride, which learns a cropping mask in the Fourier domain enabling a differentiable method to find optimal downsampling rates. Thus the decimation is learned through gradient-based methods, rather than requiring massive hyper-parameter searches. Importantly, (Riad et al., 2022) utilized the discrete Fourier transform (DFT), learning cropping size in the Fourier frequency domain. However, in many applications the DFT can often be outperformed by the discrete cosine transform (DCT) (Ahmed et al., 1974) due to differing periodicity assumptions and better energy compaction. We argue the use of the DCT has several advantages over the DFT, particularly for real-valued data and many natural signals. We introduce DCT-DiffStride, which leverages the advantages of the DCT to learn decimation rates in a CNN. Our contributions in this work are summarized as follows: first, we leverage the improved energy compaction properties of the DCT over the DFT, enabling smaller feature maps in CNNs without substantial loss in model performance. Second, we examine the tradeoff in model complexity and model performance for both DiffStride and DCT-DiffStride across a range of datasets including audio, images, and communications signals. In all tested situations, we conclude that DCT-DiffStride is superior or comparable to DiffStride. Third, we show that these methods can be applied outside the natural signal domain even though the motivation behind using the DCT/DFT as dimensionality reduction techniques is founded in natural signals (the use of low-pass filters), increasing the potential applications of such methods. While better performance at lower model complexities is useful in many cases, it is also important that energy components are highly concentrated in the lower frequencies for the implementations of DCT-DiffStride and DiffStride. Because the learned strides are cutoff frequencies creating a low-pass filter, we hypothesize that there is less energy being cropped from the signal in the higher frequencies and thus lower information loss. A property of many signals is that energy is not spread uniformly throughout the spectrum but is highly concentrated in the lower frequencies. For many data sources, including natural signals, this property could make the DCT preferable to the DFT.

2. FREQUENCY-BASED ANALYSIS

Naturally occurring signals are often defined as signals that humans have evolved to recognize like speech (Singh & Theunissen, 2003) and natural landscapes (Torralba & Oliva, 2003; Ruderman, 1994) . These natural signals typically have more low frequency content than high frequency content-that is, signals are typically smooth and change gradually. Taking advantage of this observation, Rippel et al. (2015) implement low-pass filters in the Fourier domain enabling fractional decimation rates, which helps to improve the pooling operation and ensure resolution is not reduced too early in the network. This allows for most of the content of the signal to remain present in subsequent layers while still providing a method to reduce dimensionality. Building on the idea of utilizing Fourier domain learning, Pratt et al. (2017) introduced Fourier or Spectral Convolutional Networks, a method for optimizing filter weights using the DFT. This approach optimized convolutional filters in the Fourier domain without conversion to spatial representations. However, without an efficient decimation scheme, these networks exploded in the number of trainable parameters. Lin et al. In signal analysis and signal classification, it is often desirable to use a linear transformation that tends to compact a large fraction of a signal's energy into just a few transform (or "spectral") coefficients. Let us first define an N -dimensional real-valued discrete signal as components of the vector, x ∈ R N . The optimal linear transformation matrix, T, for energy compaction is comprised of column vectors that are the eigenvectors of the covariance matrix of x with itself, which is the



(2019) leveraged the Fourier Convolution Theorem to increase the computational efficiency of pre-trained CNNs, adapting convolutional operations to use the DFT. Chi et al. (2020) expanded this idea to cross scale DFTs that employ Fourier units in multiple branches. Wood & Larson (2021) used learned functions in the Fourier domain to filter signals dynamically. The parametric spectral functions often worked to preserve low frequency content, although a specific cropping mechanism was not employed; therefore, reduction in computational complexity was not investigated. Due to the concentration of information in low frequency content, Rippel et al. (2015) introduced spectral pooling enabling fractional decimation rates and mitigates information loss compared to other pooling methods like max-pooling. A fractional decimation methodology was similarly proposed using Winograd algorithms for acceleration (Pan & Chen, 2021). These important works led to the innovations in DiffStride Riad et al. (2022) and our proposed work for DCT-DiffStride.

