DUAL-TREE WAVELET PACKET CNNS FOR IMAGE CLASSIFICATION

Abstract

In this paper, we target an important issue of deep convolutional neural networks (CNNs) -the lack of a mathematical understanding of their properties. We present an explicit formalism that is motivated by the similarities between trained CNN kernels and oriented Gabor filters for addressing this problem. The core idea is to constrain the behavior of convolutional layers by splitting them into a succession of wavelet packet decompositions, which are modulated by freely-trained mixture weights. We evaluate our approach with three variants of wavelet decompositions with the AlexNet architecture for image classification as an example. The first variant relies on the separable wavelet packet transform while the other two implement the 2D dual-tree real and complex wavelet packet transforms, taking advantage of their feature extraction properties such as directional selectivity and shift invariance. Our experiments show that we achieve the accuracy rate of standard AlexNet, but with a significantly lower number of parameters, and an interpretation of the network that is grounded in mathematical theory.

1. INTRODUCTION

Deep convolutional neural networks (CNNs) have dramatically improved state-of-the-art performances in many domains such as speech recognition, visual object recognition or object detection (LeCun et al., 2015) . However, they are very resource-intensive and a full mathematical understanding of their properties remains a challenging issue. On the other hand, in the field of signal processing, wavelet and multi-resolution analysis are built upon a well-established mathematical framework. They have proven to be very efficient in tasks such as signal compression and denoising (Mallat, 2009) . Moreover, wavelet filters have been widely used as feature extractors for signal, image and texture classification (Laine & Fan, 1993; Pittner & Kamarthi, 1999; Yen, 2000; Huang & Aviyente, 2008) . While both fields rely on filters to achieve their goals, the two approaches are radically different. In wavelet analysis, filters are specifically designed to meet very restrictive conditions, whereas CNNs use freely trained filters, without any prior assumption on their behavior. Nevertheless, in many computer vision tasks, CNNs tend to learn parameters that are pretty similar to oriented Gabor filters in the first layer (Boureau et al., 2010; Yosinski et al., 2014) . This phenomenon suggests that early layers extract general features such as edges or basic shapes, which are independent from the task at hand.

Proposed approach

In order to improve our understanding of CNNs, we propose to constrain their behavior by replacing freely-trained filters by a series of discrete wavelet packet decompositions modulated by mixture weights. We therefore introduce prior assumptions to guide learning and reduce the number of trainable parameters in convolution layers, while retaining predictive power. The main goal of our work is to describe and interpret the observed behavior of CNNs with a sparse model, taking advantage of the feature extraction properties of wavelet packet transforms. By increasing control over the network, we pave the way for future applications in which theoretical guarantees are critical. In this paper we describe our wavelet packet CNN architectures with a mathematical formulation and introduce an algorithm to visualize the resulting filters. As a proof of concept, we based our experiments on AlexNet (Krizhevsky et al., 2012) . Our choice was driven by the large kernels in its first layer and convolution operations performed with a downsampling factor of 4. This allows to perform two levels of wavelet decomposition without any additional transformation, and facilitates visual comparison with our own custom filters. Note however that most CNNs trained on natural image datasets exhibit the same oscillating patterns. We therefore believe that our work could be extended to other architectures with a few adaptations.

Related work

In a similar spirit, a few attempts to combine the two research fields have been made in recent years. Wavelet scattering networks (Bruna & Mallat, 2013) compute CNN-like cascading wavelet convolutions to get translation-invariant image representations that are stable to deformation and preserve high-frequency information. They were later adapted to the discrete case using complex oriented wavelet frames (Singh & Kingsbury, 2017) . While these networks are designed from scratch and are totally deterministic, other approaches enhance existing networks with wavelet filter preprocessing or embedding. The goal is either to improve classification performance without increasing the network complexity (Chang & Morgan, 2014; Williams & Li, 2016; Fujieda et al., 2017; Williams & Li, 2018; Lu et al., 2018; Luan et al., 2018) , or to replace freely-trained layers by more constrained structures implementing spectral filtering. Such models include Gabor filters in parallel to regular trainable weight kernels (Sarwar et al., 2017) , wavelet scattering coefficients as the input of a CNN (Oyallon et al., 2018) , or linear combinations of discrete cosine transforms (Ulicny et al., 2019) . Our approach falls into this second category, although our design is based upon a different CNN architecture, i.e., AlexNet. To our knowledge we are the first to introduce the dual-tree wavelet packet transform (DT-CWPT) (Bayram & Selesnick, 2008) in such context. Like the filters used in the above papers, wavelet packet transforms are well-localized in the frequency domain and share a subsampling factor over the output feature maps. A major advantage with our approach is sparsity: a single vector (called conjugate mirror filter, or CMF) is sufficient to characterize the whole process. Moreover, like Gabor filters, DT-CWPT extracts oriented and shift-invariant features, but achieves this goal with minimal redundancy, while providing an efficient decomposition algorithm based on separable filter banks. Regarding the discrete cosine transform, its complexity is similar to DT-CWPT but lacks orientation properties. Our models therefore provide a sparser description of the observed behavior of convolutional layers. This is a step toward a more complete description of CNNs by using a small number of arbitrary parameters.

2. BACKGROUND

Notations In this paper, d-dimensional tensors are written with straight bold capital letters: Z ∈ R A1×•••×A d , where A i denotes the size of Z along its i-th dimension; the shape of Z is denoted Z = A 1 . . . A d . 2D matrices are written in italic: U ∈ R A×B and 1D vectors in bold lower-case letters: z ∈ R A . For the sake of legibility, indices are written between square brackets.

The convolution between two matrices

U ∈ R A×B and V ∈ R A ×B is defined, for all m ∈ {0 . . A + A -2} and n ∈ {0 . . B + B -2}, by (U * V ) [m, n] = i,j U [m -i, n -j] • V [i, j]. Since some indices are negative or bigger than the matrix size, U and V must be extended beyond their limits, either by setting all outside values to zero, or by using a periodic or symmetric pattern. Practical implications of this choice will not be discussed in this paper. Discrete wavelet packet transform (WPT) This is a brief overview on the WPT algorithm (Mallat, 2009) , written as a sequence of matrix convolutions. An illustration of the transform is given in Appendix A.7.



For any U ∈ R A×B , U denotes the flipped matrix: U [m, n] = U [A -(m + 1), B -(n + 1)]. The upsampling and downsampling operators are respectively denoted ↑ and ↓. For any α ∈ N * , (U ↑ α) [m, n] = U m α , n α if both m and n are divisible by α (= 0 otherwise), and (U ↓ α) [m, n] = U [αm, αn]. Finally, for any scalar z ∈ R, we denote z + U = zJ + U , where J ∈ R A×B denotes the matrix of ones.

