DUAL-TREE WAVELET PACKET CNNS FOR IMAGE CLASSIFICATION

Abstract

In this paper, we target an important issue of deep convolutional neural networks (CNNs) -the lack of a mathematical understanding of their properties. We present an explicit formalism that is motivated by the similarities between trained CNN kernels and oriented Gabor filters for addressing this problem. The core idea is to constrain the behavior of convolutional layers by splitting them into a succession of wavelet packet decompositions, which are modulated by freely-trained mixture weights. We evaluate our approach with three variants of wavelet decompositions with the AlexNet architecture for image classification as an example. The first variant relies on the separable wavelet packet transform while the other two implement the 2D dual-tree real and complex wavelet packet transforms, taking advantage of their feature extraction properties such as directional selectivity and shift invariance. Our experiments show that we achieve the accuracy rate of standard AlexNet, but with a significantly lower number of parameters, and an interpretation of the network that is grounded in mathematical theory.

1. INTRODUCTION

Deep convolutional neural networks (CNNs) have dramatically improved state-of-the-art performances in many domains such as speech recognition, visual object recognition or object detection (LeCun et al., 2015) . However, they are very resource-intensive and a full mathematical understanding of their properties remains a challenging issue. On the other hand, in the field of signal processing, wavelet and multi-resolution analysis are built upon a well-established mathematical framework. They have proven to be very efficient in tasks such as signal compression and denoising (Mallat, 2009) . Moreover, wavelet filters have been widely used as feature extractors for signal, image and texture classification (Laine & Fan, 1993; Pittner & Kamarthi, 1999; Yen, 2000; Huang & Aviyente, 2008) . While both fields rely on filters to achieve their goals, the two approaches are radically different. In wavelet analysis, filters are specifically designed to meet very restrictive conditions, whereas CNNs use freely trained filters, without any prior assumption on their behavior. Nevertheless, in many computer vision tasks, CNNs tend to learn parameters that are pretty similar to oriented Gabor filters in the first layer (Boureau et al., 2010; Yosinski et al., 2014) . This phenomenon suggests that early layers extract general features such as edges or basic shapes, which are independent from the task at hand.

Proposed approach

In order to improve our understanding of CNNs, we propose to constrain their behavior by replacing freely-trained filters by a series of discrete wavelet packet decompositions modulated by mixture weights. We therefore introduce prior assumptions to guide learning and reduce the number of trainable parameters in convolution layers, while retaining predictive power. The main goal of our work is to describe and interpret the observed behavior of CNNs with a sparse model, taking advantage of the feature extraction properties of wavelet packet transforms. By increasing control over the network, we pave the way for future applications in which theoretical guarantees are critical.

