DCT-DIFFSTRIDE: DIFFERENTIABLE STRIDES WITH REAL-VALUED DATA

Abstract

Reducing the size of intermediate feature maps within various neural network architectures is critical for generalization performance, and memory and computational complexity. Until recently, most methods required downsampling rates (i.e., decimation) to be predefined and static during training, with optimal downsampling rates requiring a vast hyper-parameter search. Recent work has proposed a novel and differentiable method for learning strides named DiffStride which uses the discrete Fourier transform (DFT) to learn strides for decimation. However, in many cases the DFT does not capture signal properties as efficiently as the discrete cosine transform (DCT). Therefore, we propose an alternative method for learning decimation strides, DCT-DiffStride, as well as new regularization methods to reduce model complexity. Our work employs the DCT and its inverse as a low-pass filter in the frequency domain to reduce feature map dimensionality. Leveraging the well-known energy compaction properties of the DCT for natural signals, we evaluate DCT-DiffStride with its competitors on image and audio datasets demonstrating a favorable tradeoff in model performance and model complexity compared to competing methods. Additionally, we show DCT-DiffStride and DiffStride can be applied to data outside the natural signal domain, increasing the general applications of such methods.

1. INTRODUCTION

Dimensionality reduction is a necessary computation in nearly all modern data and signal analysis. In convolutional networks, this reduction is often achieved through strided convolution that decimates a high-resolution input (e.g., images) into a lower-resolution space. Prediction in this lower-resolution space can enable the network to focus on relevant classification features while ignoring irrelevant or redundant features. This reduction is vital to the information distillation pipeline in neural networks, where data is transformed to be increasingly sparse and abstract (Chollet, 2021) . On image data, decimation comes in the form of reducing spatial dimensions (e.g., width and height). On time-series data, decimation is performed along the temporal axis. Decimation can be performed in a variety of ways including statistical aggregation, pooling methods (e.g., max and average pooling), and strided convolutions. By reducing the size of intermediate feature maps, computational and memory complexity of the architecture are reduced. Fewer operations, such as multiplies and accumulates (MACs), are needed to produce feature maps as there are there are fewer values for the network to operate on and store. In convolutional neural networks (CNNs), decimation allows for an increase in the receptive field as subsequent kernels have access to the downsampled values that span several frames. Decimation also allows for increased generalization performance because lower-resolution feature maps allow the network to be resilient to redundancy and ignore spurious activations and outliers that would otherwise not be filtered out. Most decimation methods, however, still require massive hyper-parameter searches to find the optimal window sizes and strides on which to operate. This puts the onus on post-processing and crossvalidation to find optimal values which may not be time or computationally feasible. To address this concern Riad et al. (2022) proposed DiffStride, which learns a cropping mask in the Fourier domain enabling a differentiable method to find optimal downsampling rates. Thus the decimation is learned through gradient-based methods, rather than requiring massive hyper-parameter searches. Importantly, (Riad et al., 2022) utilized the discrete Fourier transform (DFT), learning cropping size in the Fourier frequency domain. However, in many applications the DFT can often be outperformed by the discrete cosine transform (DCT) (Ahmed et al., 1974) due to differing periodicity assumptions and better energy compaction. We argue the use of the DCT has several advantages over the DFT, particularly for real-valued data and many natural signals. We introduce DCT-DiffStride, which leverages the advantages of the DCT to learn decimation rates in a CNN. Our contributions in this work are summarized as follows: first, we leverage the improved energy compaction properties of the DCT over the DFT, enabling smaller feature maps in CNNs without substantial loss in model performance. Second, we examine the tradeoff in model complexity and model performance for both DiffStride and DCT-DiffStride across a range of datasets including audio, images, and communications signals. In all tested situations, we conclude that DCT-DiffStride is superior or comparable to DiffStride. Third, we show that these methods can be applied outside the natural signal domain even though the motivation behind using the DCT/DFT as dimensionality reduction techniques is founded in natural signals (the use of low-pass filters), increasing the potential applications of such methods. While better performance at lower model complexities is useful in many cases, it is also important that energy components are highly concentrated in the lower frequencies for the implementations of DCT-DiffStride and DiffStride. Because the learned strides are cutoff frequencies creating a low-pass filter, we hypothesize that there is less energy being cropped from the signal in the higher frequencies and thus lower information loss. A property of many signals is that energy is not spread uniformly throughout the spectrum but is highly concentrated in the lower frequencies. For many data sources, including natural signals, this property could make the DCT preferable to the DFT.

2. FREQUENCY-BASED ANALYSIS

Naturally occurring signals are often defined as signals that humans have evolved to recognize like speech (Singh & Theunissen, 2003) and natural landscapes (Torralba & Oliva, 2003; Ruderman, 1994) . These natural signals typically have more low frequency content than high frequency content-that is, signals are typically smooth and change gradually. Taking advantage of this observation, Rippel et al. (2015) implement low-pass filters in the Fourier domain enabling fractional decimation rates, which helps to improve the pooling operation and ensure resolution is not reduced too early in the network. This allows for most of the content of the signal to remain present in subsequent layers while still providing a method to reduce dimensionality. Building on the idea of utilizing Fourier domain learning, Pratt et al. (2017) introduced Fourier or Spectral Convolutional Networks, a method for optimizing filter weights using the DFT. This approach optimized convolutional filters in the Fourier domain without conversion to spatial representations. However, without an efficient decimation scheme, these networks exploded in the number of trainable parameters. Lin et al. (2019) leveraged the Fourier Convolution Theorem to increase the computational efficiency of pre-trained CNNs, adapting convolutional operations to use the DFT. Chi et al. (2020) expanded this idea to cross scale DFTs that employ Fourier units in multiple branches. Wood & Larson (2021) used learned functions in the Fourier domain to filter signals dynamically. The parametric spectral functions often worked to preserve low frequency content, although a specific cropping mechanism was not employed; therefore, reduction in computational complexity was not investigated. Due to the concentration of information in low frequency content, Rippel et al. (2015) introduced spectral pooling enabling fractional decimation rates and mitigates information loss compared to other pooling methods like max-pooling. A fractional decimation methodology was similarly proposed using Winograd algorithms for acceleration (Pan & Chen, 2021) . These important works led to the innovations in DiffStride Riad et al. (2022) and our proposed work for DCT-DiffStride. In signal analysis and signal classification, it is often desirable to use a linear transformation that tends to compact a large fraction of a signal's energy into just a few transform (or "spectral") coefficients. Let us first define an N -dimensional real-valued discrete signal as components of the vector, x ∈ R N . The optimal linear transformation matrix, T, for energy compaction is comprised of column vectors that are the eigenvectors of the covariance matrix of x with itself, which is the autocorrelation matrix, R x = Cov{xx} = E{xx T }. This result has been known since at least the 1933 publication by Hotelling (1933) and T is commonly referred to as the Karhunen-Loève Transform (KLT) (Loève, 1945; Karhunen, 1946) . In terms of comprising independent energy components, the DFT is often used due to the Wiener-Khinchine theorem (Ziemer & Trantor, 1985) that states the power spectral density of a physical signal is equivalent to the Fourier transform of the signal's autocorrelation function. The DFT is optimal whenever x is comprised of periodic signals. Thus, the DFT is equivalent to the KLT for a periodic signal because it is the optimal transformation matrix with respect to energy compaction (Pearl, 1973) . As a signal loses its periodic structure, the DFT becomes less optimal in terms of energy compaction because additional basis vectors have significant energies when representing x. This introduces limitations on the rate of decimation for non-periodic signals. Additionally, the spectral coefficients comprising the DFT transform are complex, even when the transformed signal x is real-valued. These properties of the DFT can be problematic when incorporated with CNNs because they can increase computational complexity and memory footprints. In an effort to further reduce computational complexity, there is interest in the use of linear transformations that maximize energy compaction yet yield real-valued spectra. Naturally occurring signals are often non-periodic signals and are more amenable to being estimated as stationary first-order Markov processes, or, more generally, auto-regressive models of order one, AR(1). For this reason, we propose the use of the discrete cosine transform (DCT) over the DFT for efficient energy compaction in natural signals. In fact, the commonly used compression algorithm JPEG (Wallace, 1991) also makes use of the DCT for this reason. The DCT uses real-valued cosine functions mapping an input sequence x from R N → R N . For an AR(1) process, the DCT is the optimal KLT basis (Unser, 1984; Torun & Akansu, 2013) (also see appendix). This enables the DCT to express AR(1) processes in fewer components than the DFT. In turn, this increases the energy compaction properties of the transform, allowing for a higher decimation rate while preserving a similar amount of signal content. Because the DCT (and DFT) operate on discrete signals, there must be a definition for the boundary conditions for both the left and right boundaries of the repeated sequence along with the point at which the function is defined as even. This gives rise to various definitions of the DCT-a total of 8. In this work, we refer to the orthonormal DCT type-II as "the DCT." Boundary conditions (implied periodicity) for the DCT-II are even at n = -1 2 and n = N -1 2 and can be seen in Figure 1 . The single dimension orthonormal DCT-II is given by Equation (1) and is straightforward to extend to multiple dimensions: X k = 1 N x 0 + N -1 n=0 x n 2 N cos π N n + 1 2 k for k = 0, . . . N -1 .

3. DCT-DIFFSTRIDE

Similarly to the architectures in (Rippel et al., 2015; Riad et al., 2022) , DCT-DiffStride, depicted in Figure 1 , utilizes a cropping mask on the frequency-domain representation of the input signal. This is then followed by the inverse transform on the cropped frequency signal to transform back into the spatial domains. For example, an input image x ∈ R H×W is transformed by the DCT resulting in X ∈ R H×W . X is then cropped which is equivalent to decimation. Thus by adjusting the size of the crop, the stride of the filtering operation is also adjusted. Because the cropping operation itself is not differentiable, a stop gradient operator (Bengio et al., 2013) is used so that the DCT-DiffStride layer is still differentiable with respect to the learnable decimation rates, S. The output of the DCT is not necessarily symmetric, unlike the DFT for a real-valued input. Consequently, the learned cropping mask does not need to be symmetric to reconstruct the signal. To produce the windowed mask, we use a similar implementation to (Riad et al., 2022) based on the adaptive attention span proposed by Sukhbaatar et al. (2019) ; however, we account for the differences in symmetry. DCT-DiffStride is able to produce a smaller minimum sized feature map than DiffStride because this symmetry does not need to be kept. The minimum output shape for a DCT-DiffStride layer is given by 1 + R and is 2 + 2R for DiffStride where R is a defined smoothing factor for the cropping mask. Although this result may be inconsequential for high dimensional x, the percentage of compression for lower-dimensional signals may be largely impacted. For example, if we use a smoothness of R = 4 with x ∈ R 32 , as in many image datasets, DiffStride can compress to approximately 30% whereas DCT-DiffStride can compress to approximately 15% of the original size for a single layer. 3.1 REGULARIZATION Riad et al. (2022) proposed a novel regularization method to promote better usage of time and memory complexity given by L l=1 l i=1 1 S i h •S i w . There are a few drawbacks of this method. First, the model could learn to increase the striding factor therefore decreasing the regularization loss, but not reducing the feature map size. This is because of the max and min operations that keep the feature dimensionality from collapsing and retaining the smoothness factor. In other words, strides can be unconstrained and have the network decrease loss without any functional value. Second, although the regularization loss is proportional to model complexity, there is not a direct interpretation of the regularization value since the strides are unconstrained. While increasing the weight of the regularization term gives the network incentive to reduce memory and time complexity, the regularization term lacks semantic value. We propose a novel regularization term that alleviates these concerns by expressing model complexity as a percentage. DCT-DiffStride reduces the feature map size along the spatial/temporal axes; however, it does not change the number of channels or the number of parameters in neighboring layers. As such, we define the model complexity, C, as a function of the spatial/temporal feature map dimensionality defined as C = L l=1 D d=1 E l,d M l,d . L is the number of convolutional layers, D is the dimensionality of the spatial/temporal axes, and E is the expanse (e.g., height or width for an image). M is the maximum possible expanse (i.e., no decimation is performed in the network). We note that this does not include complexity of fully connected layers as they are also kept constant in the network. The output shape of each DCT-DiffStride layer is a function of the learned stride for each particular dimension, the employed smoothness factor, and the minimum allowed feature map size. The minimum feature map size for DCT-DiffStride is 1 because the output of the DCT is not symmetric. For DiffStride, the minimum size is 2 to account for the symmetry in the DFT for real-valued, evenlength signals. The output shape for a single dimension of a DCT-DiffStride layer is represented as ⌊max( E l,d S l,d , 1)⌋ + R where E is the expanse, l represents the layer, d is the dimension, S is the learned stride, and R is the smoothness factor. The output shape calculation for DiffStride is similar, but symmetry for real-valued inputs is taken into account giving ⌊max( E l,d S l,d , 2)⌋ + 2R. In our regularization term, we normalize each output shape by the maximum shape that the output could be, M (i.e., no decimation is performed). This allows each regularization term to be interpreted as the percentage of total complexity for a given layer and dimension on a normalized range of [0, 1]. We propose a novel regularization method in this work with two variants that alleviate issues of interpretability and introduce implicit constraints keeping the loss bounded. In practice, there may be computational or memory limitations on the device that the model will be deployed. By specifying the maximum desired percentage model complexity ĉ, Equation ( 2) can be used to incentivize the network to not exceed the desired complexity while still allowing the network to learn the optimal rate of decimation for each layer. L is the total number of DCT-DiffStride layers. This allows the network to learn a less complex model than ĉ (i.e., learn larger decimation rates) and optimize other loss terms such as cross entropy. In traditional decimation methods, the size of each layer would be designed apriori to keep the model within ĉ complexity, but the proposed method allows the network to learn a smaller model while determining which layers should have higher/lower decimation rates through gradient descent. This method can also be generalized to specify percentage complexity to specific layers if overall model complexity is not sufficient. L max complexity = max (C -ĉ * L, 0.0) (2) While Equation ( 2) is likely the method to employ in real-world situations, this work aims to illustrate an advantageous tradeoff in model complexity and model performance for the use of DCT-DiffStride against DiffStride. In order to fairly compare the two methods, each method needs to be evaluated with similar model complexity values. We propose a slight modification to Equation ( 2) such that the models will have very similar complexity values seen in Equation (3). Here, the network is penalized for having a larger or smaller overall complexity than ĉ rather than just creating an upper bound seen in Equation (2). It is notable that the network is still able to learn decimation rates for each DCT-DiffStride layer, but the overall model complexity needs to match a value close to ĉ. This is purely to illustrate the tradeoff in model complexity and model performance. L sq complexity = (C -ĉ * L) 2 (3) For the remainder of this work, we use Equation (3) to ensure a fair comparison between DCT-DiffStride and DiffStride. The overall loss function for all experiments in this work is L = L cross entropy + λL sq complexity . We set λ = 100 for all experiments to make the regularization the dominant term in the overall loss calculation. This helps to ensure both DCT-DiffStride and Diffstride have a similar model complexity for ease of comparison.

4. EXPERIMENTS

We compare DCT-DiffStride and DiffStride on five classification tasks. We include both image and audio datasets that include naturally occurring signals, as well as a modulation classification dataset representing non-natural signals. Providing a wide variety of dataset types allows exploration into advantages that DCT-DiffStride and DiffStride may possess as well as expanding the the general number of applications for both methods. To ensure stability during training, all methods utilize unit ℓ 2 -norm global gradient clipping. In this work, we apply DCT-DiffStride between convolutional layers, but it can be applied between other architectural layers (e.g., transformer layers) with minimal modification. While Riad et al. (2022) showed that DiffStride is resilient to different stride initializations, our work aims to illustrate the tradeoff for DCT-DiffStride in terms of model complexity and model performance when compared to DiffStride. All models are trained with categorical cross-entropy loss along with the appropriate regularization penalty for each model complexity, ĉ. Due to their smaller size, we first benchmark DCT-DiffStride on the CIFAR datasets (Krizhevsky et al., 2009) with an increased number of ĉ values in order to better characterize the complexity tradeoff. For each CIFAR experiment, we utilize stochastic gradient descent (SGD) (Saad, 1999) with an initial learning rate of 0.1, momentum of 0.99 (Qian, 1999), and a batch size of 32. We train the models for a total of 80 epochs with a learning rate decay factor of 0.1 at epochs 20, 40, and 60 with a final learning rate of 1e-4. We use ĉ = [0.1, 0.2, 0.3, ..., 0.8] to develop a tradeoff curve in terms of model complexity and model performance. We use the same model definition as (Riad et al., 2022) , and we report the highest performing model from their work using traditional strided convolutions to establish a baseline to compare to traditional methods. 

Results, ImageNet

A similar result was found on ImageNet as the CIFAR datasets. DCT-DiffStride achieved a much higher evaluation accuracy of 48% compared to that of 43.5% using DiffStride at 10% model complexity, and the gap decreases as more intermediate activations are retained. This reaffirms that DCT-Diffstride is able to use larger decimation rates than DiffStride even on larger images, while maintaining strong performance.

Experimental Setup

In addition to image data, we evaluate DCT-DiffStride on the speech commands dataset (Warden, 2018) . Raw audio data is sampled at 16KHz and processed into 64dimensional log mel-scaled spectrograms using a window size of 25ms and 10ms overlap (Stevens et al., 1937) . All spectrograms are mean-normalized for each mel bin. In order to leverage batch training, audio is randomly cropped or padded to 1s in length. We use the adaptive momentum (AdaM) (Kingma & Ba, 2015) optimizer with an initial learning rate of 0.1 and a decay factor of 0.1 at epochs 20 and 40 with a batch size of 32 for a total of 60 epochs. To establish a baseline to compare against traditional methods, we report the accuracy from (Riad et al., 2022) using traditional strided convolutions. Similarly to (Riad et al., 2022) , a 2D-CNN based on (Tagliasacchi et al., 2019) is employed where convolutional blocks consist of three convolutions. First, a (3 × 1) kernel is applied over time followed by a (1 × 3) kernel over frequency and a residual convolution. These convolutional blocks are illustrated in Figure 3a . The architecture consists of 5 convolutional blocks each followed by a decimation layer. The architecture then employs global average pooling and a single linear layer to perform classification. A decimation layer is also performed on the input to the network. The number of channels for each convolutional block are [128, 256, 256, 512, 512] and decimation rates are initialized to S = [2, 2, 1, 2, 1, 2].

Results

The accuracy versus model complexity is shown in Figure 3b . While both DCT-DiffStride and DiffStride perform comparably, DiffStride was found to have a slight advantage over DCT-DiffStride for the speech commands dataset. Both methods were able to use large decimation rates without loss in performance indicating redundant activation values. Performance was found to decrease at high model complexity values. Generalization performance may decrease without a large amount of decimation possibly due to spurious activations influencing predictions. Both methods outperformed the baseline method at a similar model complexity further demonstrating the ability for the network to learn superior decimation rates. A possible explanation for the similar performances is that speech is comprised of many near periodic (e.g., vowel sounds and semivowels) and non-periodic sources (e.g., voiceless fricatives and plosives). Thus, no clear advantages in periodicity or boundary conditions exists between the DCT and DFT. Moreover, the energy concentrations in speech are more dispersed among phonemes, which may explain why the energy compaction of the DCT does not provide a distinct advantage. We utilize a 1D CNN convolving over the temporal axis with 1D DCT-DiffStride and DiffStride layers. We base our model on (Harper et al., 2020) utilizing mean and variance pooling after convolutional layers, a common technique used by x-vector architectures (Snyder et al., 2018) . We train a baseline model which uses an equivalent architecture to the model in (Harper et al., 2020) , but trained on signals in the specified dB range. To compare DCT-DiffStride and DiffStride, the decimation methods are applied following each convolutional layer. We use the adaptive momentum (AdaM) (Kingma & Ba, 2015) optimizer with an initial learning rate of 0.01 and a decay factor of 0.1 at epochs 15, 30, and 45 with a batch size of 128 for a total of 60 epochs. All decimation rates are initialized to one. It is interesting that DCT-DiffStride and DiffStride are able to perform well with modulation data, particularly at small model complexities, because it is well known that modulation data contains a number of high frequency components that are important for classification. One possible explanation for this relates to the concept of neural collapse (Papyan et al., 2020; Han et al., 2022; Zhang et al., 2021; Belkin et al., 2019) , which has shown that models trained with categorical cross-entropy tend to produce activations very close to the mean class value in later layers. Since in spectral analysis the zeroth entry in the DFT or DCT output is equivalent to the mean value of the signal, it may be that DCT-DiffStride and DiffStride can leverage neural collapse. More specifically, as networks are trained the within-class variability decreases under neural collapse which, in turn, decreases the high frequency content in later layer activations and increases low frequency content. Because DCT-DiffStride and DiffStride operate as low-pass filters, they may be able to exploit this phenomenon.

5. DECIMATION RATE AND AR(1) RELATIONSHIP

In this section, we investigate the relationship between how well AR(1) processes fit the given datasets and intermediate activations along with learned decimation values. Because the DCT is the optimal KLT basis for an AR(1) process (see appendix), we would expect the trained networks to have larger decimation rates when the intermediate activations for a given layer are better modeled by an AR(1) process for DCT-DiffStride. To investigate this hypothesis, we take 100 random samples from each dataset, compute the activations for each layer for the 10% complexity model, fit an AR(1) process for each channel, and store the absolute value of the residuals between the activations and the AR(1) predicted activations for each sample and layer. We use the statsmodels package (Seabold & Perktold, 2010) to perform the AR(1) fitting. For 2-D data, the appendix describes the unraveling process to unpack the data into a 1-D series that can be modeled by an AR(1) process. The distribution of absolute residuals can be seen in the box plots and the feature map shape following decimation for the given layer are shown in the dotted lines in Figure 4 . Feature map shape is defined as height×width for 2-D data and length for 1-D data (RadioML 2018.01A dataset). For the speech commands dataset, decimation is performed directly on the input whereas decimation is not performed until after convolutional layers in the remaining datasets. For this reason, the residual box plots for the input and the first layer are the same for the speech commands dataset. A consistent trend was seen across the datasets for DCT-DiffStride. When the residuals were small for successive layers, the feature map was more aggressively decimated. When the residuals increased, decimation rate stabilized. This supports the hypothesis that DCT-DiffStride leverages the DCT for larger decimation rates when the data better fits an AR(1) process. This behavior provides further evidence as to why DCT-DiffStride outperforms DiffStride, particularly for low-complexity models.

6. CONCLUSION

We propose DCT-DiffStride, a method to perform differentiable decimation leveraging the energycompaction properties of the discrete cosine transform. DCT-DiffStride is able to outperform competitors on various classification tasks, particularly when large decimation rates are used. While low-pass filter methods are traditionally applied to natural signals, we show that DCT-DiffStride generalizes outside this domain and can be used as a direct replacement to other decimation methods, such as DiffStride and max pooling, in more general applications.

A.2 LINEAR TRANSFORMATION TO MAXIMIZE ENERGY COMPACTION IN INDEPENDENT SPECTRAL COEFFICIENTS: THE KLT TRANSFORM

In our work, we make use of the DCT to leverage its energy-compaction properties to enable learnable decimation rates. In this section, we provide the mathematical foundation explaining the DCT's ability to represent AR(1) processes with fewer terms than the DFT enabling larger decimation rates without compromising classification performance. It is desirable to use a transformation that tends to compact a large fraction of a signal's energy into just a few transform (or "spectral") coefficients. This characteristic is coupled with another desirable feature wherein maximum decorrelation is present among the transform coefficients, that is, the coefficients contain independent portions of the signal's energy components. Let us first define an N -sampled real-valued discrete signal as components of the vector, x. The components of x are x k ∈ R where 0 ≤ k ≤ N -1, or more succinctly, x ∈ R N . For simplicity while attempting to be as general as possible, we assume that x is of the form of a normalized zero-mean random vector thus E{x} = 0 and the x k are variates of a random variable, X, with zero-mean. The covariance matrix of x with itself is its N × N autocorrelation matrix, R x = Cov{xx} = E{xx T }. Note that the total energy of the signal, x, assumed to be ||x|| 2 , is contained within the coefficients of matrix E{xx T } = R x . The component of R x at row k and column j is denoted as R x (k, j) and encodes the correlation between the two signal components x k and x j comprising the energy term |x k x j |. Moreover, since x ∈ R N , R x = R x T . Observation 1: R x is a diagonal matrix. If components x k and x j are uncorrelated, then R x (k, j) = 0 for all k ̸ = j. Due to the definition of the autocorrelation matrix, it takes the form of a diagonal matrix with positive diagonal values equal to |x k | 2 . Because the signal x is of a form wherein the energy of each component, x k , is proportional to |x k | 2 , the total energy of the signal is then the inner product of x with itself, or its squared L 2 -norm, ||x|| 2 : ||x|| 2 = E{x T x} = N -1 k=0 R x (k, k) We denote the complex conjugate transpose of a matrix A as A * . Likewise, a conjugate transpose of a column vector v as the row vector v * . If the vector v is real-valued, as is the case for signal vector x, then v * = v T . Let us apply a linear transformation, T, to x as y = Tx where T is an orthogonal transformation matrix. Since T is orthogonal, energy components in the spectral coefficients comprising y are independent. The physical principal of conservation of energy applies such that ||y|| 2 = E{y T y} = ||x|| 2 . We also note that the covariance of y with itself is its autocorrelation matrix, R y , as given in Equation (5): R y = Cov{yy} = E{yy T } = E{Tx(Tx) * } = E{Txx T T * } = TE{xx T }T * = TR x T * Equation ( 5) indicates that if it is desirable to compact energy in the fewest M < N components of the transformation vector y, then the transformation matrix T should be structured such that the energy of y is contained within the first M coefficients and all y k = 0 for M < k ≤ N -1. We define the M -dimensional vector y M to the the first M coeffients of y = Tx. If a suitable transormation matrix T is chosen such that maximum energy compaction results, then from Equation (4), we have Equation ( 6). ||x|| 2 = ||y M || 2 = M -1 k=0 |y k | 2 = E{y M T y M } = E{(Tx) * Tx} = E{x T T * Tx} (6) Because y M T = [y 0 y 1 ... y M -1 0 ... 0] = x T T * and y M = Tx, we see that y = Tx = y M yet the N -M row vectors of T are irrelevant since the y M coefficients, y k = 0 ∀ k > M when the transformation y = Tx is calculated. Thus, we replace the transformation matrix T with that of T M where T M has the same row vectors, t * k , as matrix T for all k < M and has zero-valued row vectors, t * k = 0 T , for all M < k < N -1, where 0 denotes the null vector. This implies that maximum energy compaction occurs when we rewrite the rightmost term of Equation ( 6) as given in Equation ( 7). E x T T * Tx = E x T T * M T M x = E                      x T t 0 t 1 • • • t M 0 • • • 0            t * 0 t * 1 . . . t * M 0 T . . . 0 T            x                      From Equation ( 4), we know that ||x|| 2 = ||y M || 2 = N -1 k=0 R x (k, k) , and from Equation ( 6) that ||x|| 2 = ||y M || 2 = E{x T T * Tx}; therefore, only the diagonal values of TT * result in non-zero coefficients. Alternatively, since T by definition is an orthogonal transform, the inner product of any two column vectors, denoted as (t k , t j ) = 0 for all k ̸ = j. Thus, we can rewrite Equation ( 7) as E x T T * Tx = E x T M -1 k=0 t * k t k x Using matrix algebra identities, we re-arrange the terms of Equation ( 8) resulting in Equation ( 9). 6) and ( 8), and 4), we obtain Equation (10). E{x T T * Tx} = E{x T M -1 k=0 t * k xx T t k } = E{x T M -1 k=0 t * k R x t k } (9) Equating ||x|| 2 = ||y M || 2 = E{x T T * Tx} = E{x T M -1 k=0 t * k R x t k } from Equations ( ||x|| 2 = E{x T x} = N -1 k=0 R x (k, k) from Equation ( E{x T M -1 k=0 t * k xx T t k } = N -1 k=0 R x (k, k) Because the orthogonal eigenvectors, t k , serve as column vectors of the transformation matrix, T, and R x is a diagonal matrix, the spectral decomposition theorem applies to Equation (10) indicating that the expression T * R x T satisfies the eigendecomposition where the R x (k, k) are the k th eigenvalue, λ k , for the eigenvector t k as given in Equation ( 11). R x (k, k)t k = λ k t k (11) Therefore, the optimal transformation matrix in terms of compacting maximal signal energy components for a signal, x, in the fewest spectral coefficients such that all components contain independent, or decorrelated, energy values is in fact the covariance (or autocorrelation) matrix of x defined as R x = Cov{x} = E{xx T }. Therefore, T = R x and is an orthogonal transform. The particular transformation matrix, T = R x , that maximizes energy compaction within as few independent or decorrelated spectral coefficients as possible is not a new result and was formulated by Harold Hotelling in 1933 (Hotelling, 1933) for the purposes of obtaining Principal Components in a sequence of values for statistical analyses and is thus sometimes referred to as the "Hotelling transform," or, in more more contemporary times, as "Principal Components Analysis" (PCA). In later publications by Kari Karhunen and Michel Loève, this result was again simultaneously obtained and published by both researchers (Loève, 1945; 1955; Karhunen, 1946) as a series expansion method for representing continuous random processes, thus, T = R x is sometimes referred to as the "Karhunen-Loève transform" or the "KLT." In the contemporary era of data analytics, the use of PCA to obtain T = R X is often applied to data sequences represented as x to estimate the data vector in a reduced dimensional space as was originally proposed by Hotelling. Modern PCA is typically used to determine a reduced set of dominant basis vectors whose linear combinations closely approximate the data vector x thus allowing it, or at least a close approximation to it, to be processed in a lower dimensional space as described in (Fukunaga, 1993) .

A.3 PRACTICAL TRANSFORMS TO MAXIMIZE ENERGY COMPACTION WITH INDEPENDENT SPECTRAL COEFFICIENTS

As is apparent from the previous derivations, the property of requiring the spectral coefficients to be independent, in our case, of representing decorrelated signal energy components, implies that the desired transformation be orthogonal as is the case of the KLT. Unfortunately, computation of the KLT transformation matrix, the autocorrelation of the signal x, is computationally intense. In general, an N -dimensional vector x, requires N 2 scalar multiplications and N 2 scalar additions of two operands. Other classes of orthogonal transformations are independent with regard to properties of the signal and comprise a fixed transformation matrix. The use of a fixed transformation matrix is desirable since it may be used repeatedly and independent of the particular structure of the signal. Furthermore, some classes of these transformation matrices have desirable structural properties such as being represented as factors of sparse matrices enabling efficient algorithms to be devised that greatly reduce the number of arithmetic operations required for the computation of the spectral or transformation vector. In particular, the discrete Fourier transform (DFT) has both the property of a transformation matrix with a fixed structure, and of being decomposable into a set of sparse matrix factors. This observation has led to the development of the so-called "Fast Fourier Transform" (FFT) (Cooley & Tukey, 1965) that is an efficient implementation of the DFT. Since the publication of the FFT algorithm, which some attribute to work as early as that of Carl Gauss, many alternative efficient algorithms have been developed. A summary of more modern and effect spectral transformation methods and algorithms is described in (Thornton et al., 2012) .

A.4 KLT FOR REAL-VALUED PERIODIC PROCESS: THE DISCRETE FOURIER TRANSFORM

In terms of comprising independent energy components, the DFT is often used due to the Wiener-Khinchine theorem that states the power spectral density of a physical signal is equivalent to the Fourier transform of the signal's autocorrelation function (Ziemer & Trantor, 1985) . For these reasons, the DFT or FFT is often used to compute energy components of a signal. The DFT is optimal whenever the x vector is comprised of periodic signal. That is, x k = x k+m for all values of k. This can be shown to be the case since the autocorrelation matrix, R X , has a circulant structure. R x = E xx T =     r 0 r 1 r N -1 r 0 • • • r N -1 • • • r N -2 • • • • • • r 1 r 2 . . . • • • • • • r 0     The eigenvectors of R x as given in Equation ( 11) can be determined and are found to consist of the Fourier basis vectors, w k N , used to construct the DFT transformation matrix as shown in Equation (13). R x (k, k) w k N = λ k w k N , w k N = e -i2πk N , i 2 = -1 Thus, the DFT is a special case of the KLT for strictly periodic signals. As a signal loses its periodic structure, the DFT becomes less optimal in terms of energy compaction, but retains its ability to preserve independence or de-correlated energy components in the spectral vector due to the Wiener-Khinchine theorem. For this reason, the DFT is often the first choice for determining energy components of a signal due to the widespread popularity of efficient, or "FFT," algorithms and the fact that many signals in communications systems are in the form of modulated carrier waves and thus have a good degree of periodicity. The spectral coefficients comprising the DFT transform are complex, even when the transformed signal x is real-valued. In an effort to further reduce computational complexity, there is interest in the use of linear transformations that maximize energy compaction yet yield real-valued spectra. A.5 KLT FOR AR(1) PROCESS: THE DISCRETE COSINE TRANSFORM If the signal x is comprised of components, x i , that are related as x k = ρx k-1 + z k , where z k is a variate (i.e., the k th outcome) of the Gaussian distributed random variable Z with zero-mean, µ Z = 0, and variance σ 2 Z and the constant correlation coefficient, ρ, satisfies |ρ| < 1. In this case, x is a stationary first-order Markov process, or, more generally and autoregressive model of order one, AR(1). The autocorrelation matrix for x is R x = Cov {x} = E xx T and is given in Equation ( 14) as follows from Lai (1978) . R x = σ 2 Z         ρ 0 ρ 1 ρ 2 ρ 3 . . . ρ N -1 ρ 1 ρ 0 ρ 1 ρ 2 . . . ρ N -2 ρ 2 ρ 1 ρ 0 ρ 1 . . . ρ N -3 ρ 3 ρ 2 ρ 1 ρ 0 . . . ρ N -4 • • • • • • • • • • • • . . . • • • ρ N -1 ρ N -2 ρ N -3 ρ N -4 . . . ρ 0         We wish to compute the eigenvectors of R x in Equation ( 14) and make use of the fact that the eigenvectors of R x and β 2 R -1 x are equivalent wherein the scalar constant β 2 is defined as: β 2 = ρ 2 1 + ρ 2 (15) We also define the scalar constant, α in Equation ( 16) as: α = ρ 1 + ρ 2 (16) The form of β 2 R -1 x is determined as follows from Mallik (2001) and is that of the tridiagonal matrix as shown in Equation ( 17). -α 0 0 0 β 2 R -1 x = 1 σ 2 Z                   1 -ρα -α • • • • • • • • • 0 . . . ρ 0 -α 0 0 • • • • • • • • • • • • 0 -α ρ 0 -α 0 • • • • • • • • • • • • • • • 0 -α ρ 0 -α 0 . . . . . . . . . . . . . . . 0 -α 1 -ρα                   (17) When the correlation coefficient is close to unity,ρ ≈ 1, and the variance is normalized to unity, σ 2 Z = 1, the matrix β 2 R -1 x can be simplified to that of Equation ( 18). -α 0 0 0 T c =                   1 -α -α • • • • • • • • • 0 . . . 1 -α 0 0 • • • • • • • • • • • • 0 -α 1 -α 0 • • • • • • • • • • • • • • • 0 -α 1 -α 0 . . . . . . . . . . . . . . . 0 -α 1 -α                   The eigenvectors of T c are determined as per Da Fonseca ( 2007), leading to the result: T c t ck = λ k t ck Explicitly, these eigenvectors are of the form as provided in Equation ( 20). t T ck = 1 √ 2 cos (2m+1)kπ 2N • • • cos ((N -1) π) , ∀k = 1, 2, • • • (N -1) ; m = 1, 2, • • • (N -1) The coefficients of t T ck are recognized as the components of the "discrete cosine transform" (DCT) matrix, T c as defined in (Ahmed et al., 1974) . It is noted that the coefficients of T c are related to the DFT coefficients in Equation ( 13) as Re w k N . As previously mentioned, the fact that the DCT yields a real-valued spectrum is advantageous from a computational complexity viewpoint as compared to the complex-valued spectra arising from the DFT.



Figure 1: An overview of DCT-DiffStride on a greyscale image from ImageNet (Deng et al., 2009).

4.1 IMAGE DATAExperimental Setup, CIFAR We compare DCT-DiffStride and DiffStride on various image classification tasks including CIFAR10, CIFAR100, and ImageNet. Each comparison uses the ResNet18(He et al., 2016) architecture with a residual block using the same definition as(Riad et al., 2022). Each image experiment applies mixup(Zhang et al., 2018) with α = 0.2, which regularizes the CNN using convex combinations of training instances. Convolutional layers use a weight decay of 5e-3, and decimation rates are initialized to S = [1, 2, 2, 2].

Figure 2: Accuracy over various model complexities with (a) showing CIFAR10, (b) showing CI-FAR100 results, and (c) showing ImageNet results.

Figure 3: Parts (a) and (b) illustrate the speech commands experiment with (a) showing a convolutional block for audio data where BN stands for batch normalization and (b) showing top-1 accuracy over various model complexities for the speech commands dataset. (c) demonstrates accuracy over various model complexities for the RadioML 2018.01A dataset for signals in [-14, 14]dB SNR.

Figure 4: Absolute residuals (box plots) and learned decimation shapes for each layer. The top row, left to right: CIFAR10, CIFAR100, and ImageNet. The second row, left to right: speech commands and RadioML 2018.01A.

Experimental Setup To investigate this, we evaluate DCT-DiffStride on the RadioML 2018.01A(O'Shea et al., 2018) dataset with the task of modulation scheme classification. There are 24 different classes with a total of 2.56M signals S(T ), each represented as a 2-dimensional vector consisting of in-phase (I) and quadrature (Q) components where S(T ) = I(T ) + jQ(T ). Observations range from -20dB to +30dB signal to noise ratio (SNR) in 2dB increments for a total of 26 different signal-to-noise ratio (SNR) values. The dataset is balanced across modulation type and SNR value with a total of 4,096 observations per {modulation type, SNR} pair.While there is not an official training split, we follow a similar procedure as in(O'Shea et al., 2018;Harper et al., 2020;2021) where 1M observations are randomly selected for training and 1.56M observations are chosen for evaluation. Observed in(O'Shea et al., 2018;Harper et al., 2020;2021), classification performance plateaus approximately below -14dB and above 14dB SNR. In this work, we subset the training and evaluation datasets to only include signals in the range[-14, 14]dB SNR resulting in approximately 575K training observations and 900K evaluation observations.

RadioML 2018.01A results for each modulation category. Best performing models for each model complexity and modulation grouping are shown in bold. To make the table more compact, DCT and DFT represent DCT-DiffStride and DiffStride respectively.Results Although modulation data is non-natural, Figure3cand Table1clearly illustrate that DCT-DiffStride and DiffStride are able to classify modulation schemes even at low model complexities. Interestingly, both architectures outperformed the baseline architecture that does not use decimation. DCT-DiffStride was found to outperform DiffStride when the decimation rate was increased (i.e., lower model complexity)-consistent with our image dataset results. DCT-DiffStride produced the best overall performing model with a model complexity of approximately 50%.

