DUAL-TREE WAVELET PACKET CNNS FOR IMAGE CLASSIFICATION

Abstract

In this paper, we target an important issue of deep convolutional neural networks (CNNs) -the lack of a mathematical understanding of their properties. We present an explicit formalism that is motivated by the similarities between trained CNN kernels and oriented Gabor filters for addressing this problem. The core idea is to constrain the behavior of convolutional layers by splitting them into a succession of wavelet packet decompositions, which are modulated by freely-trained mixture weights. We evaluate our approach with three variants of wavelet decompositions with the AlexNet architecture for image classification as an example. The first variant relies on the separable wavelet packet transform while the other two implement the 2D dual-tree real and complex wavelet packet transforms, taking advantage of their feature extraction properties such as directional selectivity and shift invariance. Our experiments show that we achieve the accuracy rate of standard AlexNet, but with a significantly lower number of parameters, and an interpretation of the network that is grounded in mathematical theory.

1. INTRODUCTION

Deep convolutional neural networks (CNNs) have dramatically improved state-of-the-art performances in many domains such as speech recognition, visual object recognition or object detection (LeCun et al., 2015) . However, they are very resource-intensive and a full mathematical understanding of their properties remains a challenging issue. On the other hand, in the field of signal processing, wavelet and multi-resolution analysis are built upon a well-established mathematical framework. They have proven to be very efficient in tasks such as signal compression and denoising (Mallat, 2009) . Moreover, wavelet filters have been widely used as feature extractors for signal, image and texture classification (Laine & Fan, 1993; Pittner & Kamarthi, 1999; Yen, 2000; Huang & Aviyente, 2008) . While both fields rely on filters to achieve their goals, the two approaches are radically different. In wavelet analysis, filters are specifically designed to meet very restrictive conditions, whereas CNNs use freely trained filters, without any prior assumption on their behavior. Nevertheless, in many computer vision tasks, CNNs tend to learn parameters that are pretty similar to oriented Gabor filters in the first layer (Boureau et al., 2010; Yosinski et al., 2014) . This phenomenon suggests that early layers extract general features such as edges or basic shapes, which are independent from the task at hand.

Proposed approach

In order to improve our understanding of CNNs, we propose to constrain their behavior by replacing freely-trained filters by a series of discrete wavelet packet decompositions modulated by mixture weights. We therefore introduce prior assumptions to guide learning and reduce the number of trainable parameters in convolution layers, while retaining predictive power. The main goal of our work is to describe and interpret the observed behavior of CNNs with a sparse model, taking advantage of the feature extraction properties of wavelet packet transforms. By increasing control over the network, we pave the way for future applications in which theoretical guarantees are critical. In this paper we describe our wavelet packet CNN architectures with a mathematical formulation and introduce an algorithm to visualize the resulting filters. As a proof of concept, we based our experiments on AlexNet (Krizhevsky et al., 2012) . Our choice was driven by the large kernels in its first layer and convolution operations performed with a downsampling factor of 4. This allows to perform two levels of wavelet decomposition without any additional transformation, and facilitates visual comparison with our own custom filters. Note however that most CNNs trained on natural image datasets exhibit the same oscillating patterns. We therefore believe that our work could be extended to other architectures with a few adaptations.

Related work

In a similar spirit, a few attempts to combine the two research fields have been made in recent years. Wavelet scattering networks (Bruna & Mallat, 2013) compute CNN-like cascading wavelet convolutions to get translation-invariant image representations that are stable to deformation and preserve high-frequency information. They were later adapted to the discrete case using complex oriented wavelet frames (Singh & Kingsbury, 2017) . While these networks are designed from scratch and are totally deterministic, other approaches enhance existing networks with wavelet filter preprocessing or embedding. The goal is either to improve classification performance without increasing the network complexity (Chang & Morgan, 2014; Williams & Li, 2016; Fujieda et al., 2017; Williams & Li, 2018; Lu et al., 2018; Luan et al., 2018) , or to replace freely-trained layers by more constrained structures implementing spectral filtering. Such models include Gabor filters in parallel to regular trainable weight kernels (Sarwar et al., 2017) , wavelet scattering coefficients as the input of a CNN (Oyallon et al., 2018) , or linear combinations of discrete cosine transforms (Ulicny et al., 2019) . Our approach falls into this second category, although our design is based upon a different CNN architecture, i.e., AlexNet. To our knowledge we are the first to introduce the dual-tree wavelet packet transform (DT-CWPT) (Bayram & Selesnick, 2008) in such context. Like the filters used in the above papers, wavelet packet transforms are well-localized in the frequency domain and share a subsampling factor over the output feature maps. A major advantage with our approach is sparsity: a single vector (called conjugate mirror filter, or CMF) is sufficient to characterize the whole process. Moreover, like Gabor filters, DT-CWPT extracts oriented and shift-invariant features, but achieves this goal with minimal redundancy, while providing an efficient decomposition algorithm based on separable filter banks. Regarding the discrete cosine transform, its complexity is similar to DT-CWPT but lacks orientation properties. Our models therefore provide a sparser description of the observed behavior of convolutional layers. This is a step toward a more complete description of CNNs by using a small number of arbitrary parameters.

2. BACKGROUND

Notations In this paper, d-dimensional tensors are written with straight bold capital letters: Z ∈ R A1×•••×A d , where A i denotes the size of Z along its i-th dimension; the shape of Z is denoted Z = A 1 . . . A d . 2D matrices are written in italic: U ∈ R A×B and 1D vectors in bold lower-case letters: z ∈ R A . For the sake of legibility, indices are written between square brackets.

The convolution between two matrices

U ∈ R A×B and V ∈ R A ×B is defined, for all m ∈ {0 . . A + A -2} and n ∈ {0 . . B + B -2}, by (U * V ) [m, n] = i,j U [m -i, n -j] • V [i, j]. Since some indices are negative or bigger than the matrix size, U and V must be extended beyond their limits, either by setting all outside values to zero, or by using a periodic or symmetric pattern. Practical implications of this choice will not be discussed in this paper. Discrete wavelet packet transform (WPT) This is a brief overview on the WPT algorithm (Mallat, 2009) , written as a sequence of matrix convolutions. An illustration of the transform is given in Appendix A.7. We will implicitly build a discrete orthogonal basis in which any matrix X ∈ R N ×N can be decomposed. The basis is made of oriented 2D waveforms with high frequency resolution, which is an interesting property for feature extraction. Considering a pair of conjugate mirror filters (CMFs) h and g ∈ R µ , we build a separable 2D filter bank, made of one low-pass filter G (0) = h • h and three high-pass filters G (1) = h • g , G (2) = g • h and G (3) = g • g . We start the decomposition with D (0) 0 = X. Let us assume that for a given j ∈ N, the feature maps of wavelet packet coefficients at scale j, denoted D (k) j , have already been computed for any k ∈ 0 . . 4 j -1 . Then we compute the wavelet packet coefficients at the coarser scale j + 1 by decomposing each feature map D (k) j into four smaller submatrices: ∀l ∈ {0 . . 3} , D (4k+l) j+1 = D (k) j * G (l) ↓ 2 . (1) At each scale j > 0, the set of 4 j matrices D (k) j is a representation of X from which the original signal can be reconstructed. Figure 1 illustrates the WPT resulting filters with j = 2. Dual-tree complex wavelet packet transform (DT-CWPT) WPT has interesting properties such as sparse signal representation and vertical / horizontal feature discrimination. However, it suffers from a lack of shift invariance and a poor directional selectivity. To overcome this, Kingsbury (2001) designed a discrete wavelet transform in which input images are efficiently decomposed in a tight frame of complex oriented waveforms with limited redundancy. It was generalized to the wavelet packet framework by Bayram & Selesnick (2008) . In a nutshell, let us assume that we have decomposed an input matrix X into four WPT representations D (k) a, j , D (k) b, j , D (k) c, j , D (k) d, j . This is achieved by applying expression (1) with four suitable filter banks G (l) a , G (l) b , G (l) c , G . Then we can compute the following complex wavelet packet coefficients E (k) j and E (k) j , for each k ∈ 0 . . 4 j -1 : E (k) j E (k) j = I -I I I D (k) a, j D (k) d, j + i • I I I -I D (k) c, j D (k) b, j . For a given scale j > 0, the set of (2 • 4 j ) complex matrices E (k) j , E (k) j constitutes a redundant representation of X from which the original signal can be reconstructed. DT-CWPT is oriented, and nearly shift invariant if we consider the modulus of complex coefficients. Figure 1 illustrates the DT-CWPT resulting filters with j = 2. Dual-tree real wavelet packet transform (DT-RWPT) By computing only the real part of the above coefficients, we get a representation of X in a real tight frame. Like above, DT-RWPT is an oriented transform, but does not possess the shift invariance property. This may have consequences on its predictive power, as will be seen later. Link with Gabor filters Such as presented above, the wavelet packet transforms compute a full decomposition in what Mallat (2009) calls a pseudo-local cosine basis. The resulting filters have identical window size with a varying number of oscillations within these windows (see Figure 1 ). Therefore, such wavelet packets share similarities with Gabor filters. However, they offer a competitive advantage: the decomposition is performed efficiently using one or few separable filter banks, which are fully characterized by a single one-dimensional vector. Convolutional layers Let P denote the number of samples (batch size), K (resp. L) the number of input (resp. output) channels, (M, N ) the size of input feature maps and (µ, ν) the kernel size. A 2D convolutional layer with fixed parameters s, d, q ∈ N * (stride, dilation factor and number of groups, respectively), weight W ∈ R (K/q)×L×µ×ν and bias b ∈ R L , transforms any 4D input tensor , respectively. c Convolution kernels W alex [0, k] in AlexNet's first layer. We used a model from the Torchvision package, pretrained on ImageNet. X ∈ R P ×K×M ×N into an output tensor Y ∈ R P ×L×M ×N , such that Y [p, l] = b[l] + K/q-1 k=0 X [p, k 0 (l) + k] * W [k, l] ↑ d ↓ s , where k 0 (l) = lq/L • K/q denotes the first input channel influencing the l-th output. Note that in expression (3), Y [p, l], X [p, k 0 (l) + k] and W [k, l] are 2D matrices while b[l] is a scalar. Definition 1. We denote C (q) s, d (W, b) the operator computing (3): Y = C (q) s, d (W, b) • X. In this paper, we focus on AlexNet's first convolutional layer, which can be represented as a convolution operator C 

3. PROPOSED MODELS

We now introduce several network architectures that are built on standard AlexNet, in which the first 11 × 11 convolutional layer is replaced by a succession of WPT or DT-(RC)WPT decompositions modulated by mixture weights. Each network takes as input a 4D tensor X ∈ R P ×3×224×224 , i.e., a set of P images with three input channels (RGB images).

3.1. WPT MODULE

This module computes two successive WPT decompositions (1) for every input channel. Each step j ∈ {0, 1} is implemented as a strided convolution operator C (qj ) s, d (W j , 0) (see Definition 1), where W j contains the fixed low-and high-pass filters. In this configuration, each input channel is convolved with its own set of filters. More precisely, we have (s = 2), (d = 1) and (q j = K j ), where K j = (3 • 4 j ) denotes the number of input channels. W j ∈ R 1×(4Kj )×µ×µ is such that for all k ∈ {0 . . K j -1} and l ∈ {0 . . 3}, W j [0, 4k + l] = G (l) . The output, denoted D ∈ R P ×48×N ×N , is such that D = C (12) 2, 1 (W 1 , 0) • C (3) 2, 1 (W 0 , 0) • X . ( ) Once this is done, we need to modulate the importance of each wavelet packet. Moreover, the number of output channels must be equal to 64 as in standard AlexNet, and every output channel must be influenced by each input RGB channel. This is achieved with a 1 × 1 convolutional layer (Lin et al., 2014; Szegedy et al., 2015) placed after the WPT decomposition. Note that this approach was also chosen by Ulicny et al. (2019) in what they call a harmonic block. Therefore, the final output, denoted Y wpt ∈ R P ×64×56×56 , is such that Y wpt = C (1) 1, 1 (W mix , b mix ) • D , where W mix ∈ R 48×64×1×1 and b mix ∈ R 64 are freely trained. A schematic representation of the WPT module can be found in Figure 2 -2 . The orange ("FB", a.k.a., filter bank) and green ("Conv") layers compute expressions (4) and ( 5), respectively. Expression ( 6) is a tensor formulation of (2), where the real and imaginary parts of the complex coefficients are stored separately. As for WPT computed in (4), both DT-RWPT and DT-CWPT can be expressed as a succession of CNN-style convolution operators. This requires a few technicalities that are provided in Appendix A.4.

Number of trainable parameters

E R [p] = Re E [p] Re E [p] = D a [p] -D d [p] D a [p] + D d [p] ; E C [p] =      Re E [p] Re E [p] Im E [p] Im E [p]      =     D a [p] -D d [p] D a [p] + D d [p] D c [p] + D b [p] D c [p] -D b [p]     , Again, we placed a 1 × 1 convolutional layer after the wavelet packet decompositions. The final outputs, denoted Y dt-Rwpt and Y dt-Cwpt ∈ R P ×64×N ×N , are such that Y dt-Rwpt = C (1) 1, 1 (W mix , b mix ) • E R ; Y dt-Cwpt = C (1) 1, 1 (W mix , b mix ) • E C , where W mix ∈ R 96×64×1×1 , W mix ∈ R 192×64×1×1 , b mix and b mix ∈ R 64 are freely trained. A schematic representation of both modules can be found in Figure 2 -3 4 . The blue ("∓" and "±") and green ("Conv") layers compute expressions ( 6) and ( 7), respectively. Number of trainable parameters DT-RWPT and DT-CWPT modules have 6, 208 and 12, 352 trainable parameters, respectively (23, 296 in a standard AlexNet). We will see in Section 5 how these numbers can be further decreased without degrading the performance of the network.

3.3. KERNEL VISUALIZATION

WPT and DT-(RC)WPT modules are designed as a succession of multi-channel convolutional layers. The following proposition states that such cascading layers can be expressed as a single CNNstyle convolution operator. It provides an explicit formulation of the resulting hyperparametersi.e., stride, dilation factor and number of groups describing input-output channel connections -and weight tensor. It takes advantage of the well-known result that two successive convolutions can be written as another convolution with a wider kernel. Let C (1) s, 1 (W, b), denote an initial convolution operator, with W ∈ R K×L×µ×ν . We consider a second operator C (q) t, 1 (V, a), with V ∈ R (L/q)×L ×μ×ν and a ∈ R L (we assume that both L and L are divisible by q).  C (q) t, 1 (V, a) • C (1) s, 1 (W, b) = C (q ) s , d (W , b ) , with s = st (stride), d = 1 (dilation) and q = 1 (number of groups). Moreover, the resulting weight W is computed using a dilated CNN-style convolution operator: W = C (qw) sw, dw (V, 0) • W , with (s w = 1), (d w = s) and (q w = q). We also have b = a + f (V, b), where f is such that f (V, 0) = 0. An explicit formulation of f is given in the Appendix A.6. Remark. Proposition 1 is valid only if the matrices are infinitely extended beyond their limitswe either get an infinite sequence with finite support (zero padding), a N -periodic signal (periodic padding) or a 2N -periodic signal (symmetric padding). In practice, the amount of padding at each layer must be carefully chosen to avoid distortion effects at the edges of feature maps, as done in our implementation. Proposition 1, whose proof is given in Appendix A.6, shows that the wavelet packet modules compute Y wpt = C (1) 4, 1 (W wpt , b mix ) • X ; Y dt-Rwpt = C (1) 4, 1 (W dt-Rwpt , b mix ) • X ; Y dt-Cwpt = C (1) 4, 1 (W dt-Cwpt , b mix ) • X , where -2) are obtained from (9). As a reminder, µ denotes the size of the CMF. By means of comparison, AlexNet's first layer computes W wpt , W dt-Rwpt , W dt-Cwpt ∈ R 3×64×(3µ-2)×(3µ Y alex = C (1) 4, 1 (W alex , b alex ) • X , where W alex ∈ R 3×64×11×11 and b alex ∈ R 64 . Therefore, all modules are represented as convolution operators with stride s = 4, mapping 3 input channels to 64 output channels. Regarding kernel size, it is bigger than 11 as long as µ ≥ 5; However, most energy is concentrated in a region the size of which is similar to AlexNet kernels. A visualization of these kernels after training is given in Figure 3 . Training details are provided in Section 4. For the sake of visual comparison, we only displayed center patches of size 11 × 11 from the original matrices -it turns out that in all our models, between 97% and 99% of their energy (i.e., the squared L 2 -norm) is concentrated in these cropped regions. We point out that computing the resulting kernels is used for analysis purpose, but should not be involved in the training process. Whereas the WPT module mainly extracts horizontal and vertical features, many more orientations arise from the dual-tree modules. We can also notice the low-pass filters, which appear as color blobs. The resemblance with AlexNet kernels (see Figure 1 ) is prominent.

Wavelet filters

For the experiments we used a PyTorch / CUDA implementation. Our wavelet packet modules were designed with a Q-shift orthogonal filter of length 10 ( Kingsbury, 2003) , which approximately meets the half-sample-shift condition required for the dual-tree transforms. For the sake of setting consistency, we also used this filter for conventional WPT. Datasets Our models were trained and evaluated on ImageNet ILSVRC2012 dataset (Russakovsky et al., 2015) . Since the online evaluation server is no longer available, we set aside 100, 000 images from the training set -100 per class -in order to create a validation set. We used this subset to compute accuracy rate along the training phase. As for the validation set provided by ImageNet, we turned it into a test set on which our trained models were evaluated. To assess generalization performance of our models, we also finetuned them on PASCAL VOC 2012 (Everingham et al., 2015) and COCO 2014 (Lin et al., 2015) datasets, on the multilabel classification task. For this we initialized the networks with the parameters previously obtained with ImageNet and replaced the last fully-connected layer by a layer containing the desired number of outputs. Since again we didn't have access to the ground truth for the "official" test sets, we split each validation set in two roughly equal parts. We then used the first part for validation and the second for testing. Training details For each dataset, the models were trained on a single GPU. The training procedure was inspired from many ILSVRC papers (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; Szegedy et al., 2015; He et al., 2016) . More precisely, it was carried out by optimizing the cross-entropy loss with stochastic gradient descent. For this we fed the network with random batches of 256 images, until we reached 100 cycles through the whole training set (100 epochs, i.e., 461.4K iterations for ImageNet). The momentum was set to 0.9 and weight decay to 5 • 10 -4 . As for the learning rate, it was initially set to 10 -2 , and then decreased by a factor of 10 every 25 epochs. To reduce overfitting, we followed the data augmentation procedure used in Inception networks (Szegedy et al., 2015) . The images are first normalized to a specified mean and standard deviation for each RGB channel. Then they are randomly flipped and cropped from 8% to 100% of their original sizes, with a random aspect ratio varying from 3 4 to 4 3 , before being resized to 224 × 224 using a bilinear interpolation.

Model evaluation

The test phase was carried out following Krizhevsky et al. (2012) . Namely, predictions are made over 10 patches extracted from each input image, and the softmax output vectors are then averaged to get the overall prediction. We used top-1-5 accuracy rates (ImageNet) and average precision (multilabel tasks) (Everingham et al., 2015) as evaluation metrics. Comparison with existing models Our models were compared with a standard AlexNet that we trained according to the same procedure. In addition, we wanted to isolate the contribution of wavelet packet modules to the global predictive power. To achieve that we trained an AlexNet in which the first 11 × 11 convolutional layer was frozen to its initial parameters.

4.2. RESULTS AND DISCUSSION

The evolution of the loss function and validation error along training with ImageNet is shown in Figure 4 . Similar graphs can be found in Appendix A.3 for VOC and COCO datasets. Besides, classification performance of our trained models are displayed in Table 1 . For multilabel tasks, we have defined the average error by E = 1 -Π ∈ [0, 1], where Π denotes the average precision. The DT-CWPT models almost reach standard AlexNet's accuracy, with twice less parameters in the first layer. While WPT and DT-RWPT models achieve lower performances, they are still much higher than the frozen version of AlexNet. This result suggests that the predictive power is partly accountable to the wavelet packet modules themselves, and not entirely to the following layers. Besides, the good results we obtained on multilabel classification tasks assert the generalizability of our models. By looking at Figure 3 , it is easy to explain why conventional WPT has the lowest accuracy of all our models. We identified two main reasons. (1) The filter design makes it impossible to extract oriented features that are neither horizontal nor vertical. Instead, it yields checkerboard patterns, which are useful to catch the remaining information and ensure perfect reconstruction, but fail to process geometric image features like ridges and edges, as pointed out by Selesnick et al. (2005) . ( 2) There is no medium-frequency feature extractor like in AlexNet kernels -black-and-white patches side by side. Such patterns can however be found in dual-tree kernels. If we consider the real and imaginary parts of the complex low-pass filters separately, we actually get one low-pass and one oriented band-pass filter -see Figure 1 . Regarding the DT-RWPT model, it performs better than WPT but does not reach the accuracy of DT-CWPT or standard AlexNet, despite generating oriented filters. Intuitively, DT-CWPT is twice more redundant than DT-RWPT, and thus is more likely to extract relevant features for image classification. We propose here a more specific interpretation. In Section 2 we mentioned the shift invariance property of DT-CWPT, that neither WPT nor DT-RWPT possess. By applying a slight shift to an image, great disturbances can indeed be observed in the matrices of real coefficients (Selesnick et al., 2005) . Important features extracted from one image can thus disappear from the other. On the other hand however, the modulus of complex wavelet packet coefficients is nearly shift invariant, meaning that their value is smoothly transferred toward the neighboring pixels in the shift direction. Therefore, any loss of information in their real part is recovered in their imaginary part. The DT-CWPT module is capable to extract similar features from two shifted images, which could explain its superior performances. We tested the robustness of our models with respect to small shifts and obtained results that support our hypothesis. More details can be found in Appendix A.1. The source code used for our experiments will be published shortly after paper acceptance. We will also provide notebooks for the sake of replicability.

5. CONCLUSION AND FUTURE WORK

Following an epoch of frantic race toward classification performance, research is more and more focused on understanding learning mechanisms in CNNs. In this perspective, we proposed an architecture in which standard convolutional layers are replaced by dual-tree wavelet packet feature extractors. Our experiments show that such networks can compete with conventional models, providing a sparse description of their behavior. The DT-CWPT module contains twice less trainable parameters than standard AlexNet's first layer. Future research could be held to further increase the sparsity of our models. (1) Some filters generally extract less information than others, in that the corresponding feature map's energy is much lower. By discarding these feature maps before computing 1 × 1 convolutions, we could reduce the number of parameters by the same amount. More details about energy distribution is given in Appendix A.2. (2) Feature map combinations may not be equally worth. We could constrain the 1 × 1 convolutional layer by preventing some scales or orientations from wiring together, or by separating the real and imaginary coefficients. So far we trained our models with a predefined CMF, but other filters may provide better extraction properties due to a higher frequency localization or number of vanishing moments. One way to address this question could be to let the network learn the optimal filter with a proper regularizer in the loss function. This will be addressed in future work. We tested our framework on the first layer of AlexNet, because introducing wavelet packet transform into it is quite straightforward. Nevertheless, the phenomenon of oscillating patterns does not restrict to this particular model, nor is it specific to the sole task of image classification. However, extending our models to a wider range of architectures must be handled carefully to match the desired hyperparameters. This will be tackled in future work, in which we will also benchmark our results with other wavelet CNN approaches.

A APPENDIX A.1 ROBUSTNESS OF OUR MODELS WITH RESPECT TO SMALL SHIFTS

To assess the robustness of our models with respect to small shifts, we compared the network outputs between a reference image and eight shifted versions along each axis, over our custom ImageNet test set (50, 000 images). To do so, the Kullback-Leibler divergence is computed between each pair of softmax activation vectors (∈ R 1000 ) after forward-propagation through the network. An illustration of the results can be found in Figure 5 . Figure 5 : Shift-robustness of our models, compared to standard AlexNet. The Kullback-Leibler divergence is computed between output vectors and averaged over the whole dataset (50, 000 images). When the input image is shifted by 4 pixels, the output of the first layer is strictly shifted by one pixel. The first layer is therefore invariant to a 4-pixel shift. Consequently, any divergence between outputs should be due either to edge effects or to the action of deeper layers. Likewise, when the shift is equal to 8, the invariance applies to the two first layers. However, when the shift is not a multiple of 4, we can observe bigger discrepancies which depend on the chosen model. The sensitivity to small shifts seems to increase with the network's predictive power. On the other hand, we observe higher discrepancies for WPT AlexNet and DT-RWPT AlexNet, compared to DT-CWPT AlexNet. This is in agreement with our hypothesis that the nearly shift-invariance property of DT-CWPT is -to some extent -conserved across the whole network, and therefore brings a competitive advantage regarding predictive power. However, DT-CWPT AlexNet fails to reach the shift-robustness of standard AlexNet, suggesting that further improvements could be brought to our models (see Section 5).

A.2 ENERGY DISTRIBUTION OVER FEATURE MAPS

Figure 6 displays the mean energies of the 30 feature maps of high-frequency DT-CWPT coefficients, computed over our custom ImageNet test set. As we can see, the energy distribution is very unbalanced over the different filters.

A.3 TRAINING AND VALIDATION CURVES FOR VOC AND COCO DATASETS

The evolution of the loss function and validation error along training for multilabel tasks is shown in Figure 7 . The graphs share similarities with Figure 4 . These experiments suggest that DT-CWPT AlexNet has good generalization performance on other recognition tasks. We can notice the erratic aspect of the validation curves during the 25 first epochs. This may be due to a poor learning rate initial setting. After decreasing this parameter, the validation errors become more stable. 

A.4 IMPLEMENTATION OF THE DT-(RC)WPT MODULES

In this section we show that the dual-tree transforms can be written as a succession of CNN-style convolution operators. We focus on DT-CWPT but it can be easily adapted to DT-RWPT. The first step is to duplicate each input image four times (one for each filter bank). Given an input tensor X ∈ R P ×3×224×224 , this operation can be written as a CNN-style 1 × 1 convolution operator: X 0 = C (1) 1, 1 (V dupl , 0) • X , where X 0 ∈ R P ×12×224×224 and V dupl = I I I I ∈ R 3×12×1×1 , with I ∈ R 3×3×1×1 such that I[k, l, 0, 0] = 1 if k = l; 0 otherwise. Then each level of filter bank decomposition can be summarized into a single CNN-style convolution operator C (qj ) s, d (W j , 0), with (s = 2), (d = 1) and (q j = K j ), where K j = (3 • 4 (j+1) ) denotes the number of input channels. For any j ∈ {0, 1}, the weight tensor W j ∈ R 1×(4Kj )×µ×µ has the following structure: W j [0] =     W a,j [0] W b,j [0] W c,j [0] W d,j [0]     , where W a,j , W b,j , W c,j and W d,j are built similarly to the WPT module, using filter banks G (l) a , G (l) b , G (l) c and G (l) d , respectively. By denoting D the output of this stage, we get D = C (48) 2, 1 (W 1 , 0) • C (12) 2, 1 (W 0 , 0) • X 0 . Remark. Note that we have, for all sample p ∈ {0 . . P -1}, D[p] =     D a [p] D b [p] D c [p] D d [p]     . Finally, expression ( 6) is expressed as a CNN-style 1 × 1 convolution operator: E C = C (1) 1, 1 (V combine , 0) • D , where V combine ∈ R 192×192×1×1 has the following structure: V combine =     I O O -I O I I O I O O I O -I I O     , (18) with • I ∈ R 48×48×1×1 such that I[k, l, 0, 0] = 1 if k = l; 0 otherwise; • O ∈ R 48×48×1×1 such that O[k, l, 0, 0] = 0 for all k, l ∈ {0 . . 47}. Note that the two last dimensions of W combine have size 1 × 1; the "convolution" kernels are thus reduced to singleton matrices. Therefore, expressions ( 13), ( 15) and ( 17) provide a description of DT-CWPT as a succession of CNN-style convolution operators.

A.5 AN ALGORITHM TO COMPUTE THE RESULTING WEIGHT AND BIAS

Proposition 1 provides an iterative algorithm to compute the weight and bias resulting from a succession of CNN-style convolution operators. A special care must be paid to initialization. The number of groups in the first convolution operator must indeed be equal to 1. In order to meet this requirement, we introduce an identity operator C (1) 1, 1 (I, 0), where I ∈ R K×K×1×1 is defined by I[k, l, 0, 0] = 1 if k = l; 0 otherwise. . Proposition 2 (Identity operator). Let K, L ∈ N * and t, q ∈ N * such that both K and L are divisible by q. For all V ∈ R (K/q)×L×µ×ν and all a ∈ R L , C (q) t, 1 (V, a) = C (q) t, 1 (V, a) • C (1) 1, 1 (I, 0) . Proof. It can be easily proven that for all X ∈ R P ×K×N ×N , C 1, 1 (I, 0) • X = X. Therefore Proposition 1 can be used on expression (19) in order to initialize the algorithm. The details are given in Algorithm 1. The next steps require the two following lemmas:

Algorithm 1 Composition of convolution operators

Require: K = L 0 {number of input channels} Require: {(L 1 , t 1 , g 1 , W 1 , b 1 ), . . . , (L R , t R , q r , W R , b R )} {list of Lemma 1. For all 2D matrices U , V , and all b ∈ R, (b + U ) * V = b • m, n V [m, n] + (U * V ) . ( ) Lemma 2. For all 2D matrices U , V , and all integers s, t ∈ N * , ((U ↓ s) * V ) ↓ t = (U * (V ↑ s)) ↓ (st) , where, as a reminder, (V ↑ s) denotes the s-dilated matrix. Then, by plugging ( 22) into (21) and by using Lemma 1, we get Y [p, l ] = b [l ] a[l ] + L/q-1 l=0 b l q L • L q + l • m,n V[l, l , m, n] l q/L b, S l V , + L/q-1 l=0 K-1 k=0 X[p, k] * W k, l q L • L q + l ↓ s * V[l, l ] ↓ t ; where • P (q) γ b γ∈{0..q-1} denotes a partition of even-size slices of b. For any γ ∈ {0 . . q -1},  P (q) γ b = b γK q : (γ + 1)K q -1 ;



For any U ∈ R A×B , U denotes the flipped matrix: U [m, n] = U [A -(m + 1), B -(n + 1)]. The upsampling and downsampling operators are respectively denoted ↑ and ↓. For any α ∈ N * , (U ↑ α) [m, n] = U m α , n α if both m and n are divisible by α (= 0 otherwise), and (U ↓ α) [m, n] = U [αm, αn]. Finally, for any scalar z ∈ R, we denote z + U = zJ + U , where J ∈ R A×B denotes the matrix of ones.

Figure 1: ab Respectively, WPT and DT-CWPT filters for j = 2, computed with Q-shift orthogonal CMFs of length 10 (Kingsbury, 2003). The matrices have been cropped to 11 × 11. b displays 32 complex filters, alternatively represented by their real and imaginary parts. 1 and 2 are the filters computing E (k) j and E (k) j

(W, b), with W ∈ R 3×64×11×11 and b ∈ R 64 . The kernels W [0, k] after training with ImageNet are displayed Figure 1. We can notice oriented oscillating patterns similar to wavelet packet filters.

where D a , D b , D c and D d ∈ R P ×48×N ×N are computed similarly to (4).

Figure 2: 1 First layer of AlexNet; 2 WPT module; 3 DT-RWPT module; 4 DT-CWPT module. Only the green layers (Conv) have trainable parameters. The numbers between each layer indicate the height, width and depth (number of channels) of the current feature map tensor.

Figure 3: From left to right: {W wpt [0, k]}, {W dt-Rwpt [0, k]}, {W dt-Cwpt [0, k]} k∈{0..63} after training with ImageNet ILSVRC2012. All modules are implemented with a Q-shift filter (µ = 10).Whereas the WPT module mainly extracts horizontal and vertical features, many more orientations arise from the dual-tree modules. We can also notice the low-pass filters, which appear as color blobs. The resemblance with AlexNet kernels (see Figure1) is prominent.

Figure 4: Evolution of the loss function (left) and top-1 validation error (right) during the first 65 training epochs. Validation has been performed by simply resizing the smaller edge of each image to 224 and extracting a single patch of size 224 × 224 at the center.

Figure 6: Mean energies of the 30 feature maps of high-frequency DT-CWPT coefficients, computed over our custom ImageNet test set.

Figure 7: Evolution of the loss function (left) and average error (right) over VOC and COCO validation sets (top and bottom, respectively).

output channels, strides, groups, weights and bias} Ensure: ∀r ∈ {1 . . R}, both L r-1 and L r are divisible by q r Ensure: ∀r ∈ {1 . . R}, W r = Lr-1 qr L r µ r ν r and |b r | = L r W ← I ∈ R K×K×1×1 {identity weight} b ← 0 {initial bias} s ← 1 {initial stride} for r ∈ {1 . . R} do W ← C (qr) 1, s (W r , 0) • W {resulting weight} b ← b r + P (qr) lqr/Lr b, S l W r l∈{0..Lr} {resulting bias} s ← s × t r {new stride} end for W ← W {flip weight tensor along its 2 last dimensions} return (s, W, b) {resulting stride, weight and bias} A.6 PROOF OF PROPOSITION 1 Proof. Let X ∈ R P ×K×N ×N denote an input tensor. Let Y ∈ R P ×L×M ×M and Y ∈ R P ×L ×M ×M denote the outputs of the convolution operators, such that Y = C (1) s, 1 (W, b) • X and Y = C

t, 1 (V, a) • Y .

p ∈ {0 . . P -1}. By using (3) and (20), we have, for all l ∈ {0 . . L -1},Y [p, l ] = a[l ] + l * V[l, l ] ↓ t ,(21)and, for all l ∈ {0 . . L -1},Y[p, l] = b[l] + k] * W[k, l] ↓ s .(22)

) and finally, by reversing the 2 sums and using Lemma 2, we getY [p, l ] = b [l ] + , k] * W [k, l ] ↓ (st) .definition of a CNN-style convolution operator (3) to the definition of W [k, l ] in expression expression of b [l ] defined in (25) can be rewritten in a more concise way: b [l ] = a[l ] + P (q)

Figure 8: Left: original image from ImageNet2012 (j = 0). Middle: WPT with j = 1. Right: WPT with j = 2. At each step j, the feature maps of wavelet packet coefficients D (k) j are further

Figure 9: Modulus of the complex DT-CWPT coefficients, computed with j = 2: E (k) j (left) and

A WPT module has |W mix | + |b mix | = 3, 136 trainable parameters, versus 23, 296 for the first convolutional layer in a standard AlexNet.

Error rates on our custom validation and test sets.

annex

• S l W ∈ R K/q denotes the vector such that for any k ∈ {0 . . K/q -1},l q/L b, S l W . Then we get f (V, 0) = 0, as stated in Proposition 1.

Remark. Operators P (q)

γ and S l have the advantage of being efficiently computed with libraries such as PyTorch or even NumPy.

A.7 ILLUSTRATION OF WPT AND DT-CWPT

Figures 8 and 9 illustrate WPT and DT-CWPT, respectively. We chose an image from our custom ImageNet test set (one channel only), resized and to 224 × 224 pixels. We can notice that specific orientations tend to be selected by specific filters, especially in the dual-tree transform. Moreover, some filters seem to extract features of higher energy than others.

