LEAF: A LEARNABLE FRONTEND FOR AUDIO CLASSIFICATION

Abstract

Mel-filterbanks are fixed, engineered audio features which emulate human perception and have been used through the history of audio understanding up to today. However, their undeniable qualities are counterbalanced by the fundamental limitations of handmade representations. In this work we show that we can train a single learnable frontend that outperforms mel-filterbanks on a wide range of audio signals, including speech, music, audio events and animal sounds, providing a general-purpose learned frontend for audio classification. To do so, we introduce a new principled, lightweight, fully learnable architecture that can be used as a drop-in replacement of mel-filterbanks. Our system learns all operations of audio features extraction, from filtering to pooling, compression and normalization, and can be integrated into any neural network at a negligible parameter cost. We perform multi-task training on eight diverse audio classification tasks, and show consistent improvements of our model over mel-filterbanks and previous learnable alternatives. Moreover, our system outperforms the current state-of-the-art learnable frontend on Audioset, with orders of magnitude fewer parameters.

1. INTRODUCTION

Learning representations by backpropagation in deep neural networks has become the standard in audio understanding, ranging from automatic speech recognition (ASR) (Hinton et al., 2012; Senior et al., 2015) to music information retrieval (Arcas et al., 2017) , as well as animal vocalizations (Lostanlen et al., 2018) and audio events (Hershey et al., 2017; Kong et al., 2019) . Still, a striking constant along the history of audio classification is the mel-filterbanks, a fixed, hand-engineered representation of sound. Mel-filterbanks first compute a spectrogram, using the squared modulus of the short-term Fourier transform (STFT). Then, the spectrogram is passed through a bank of triangular bandpass filters, spaced on a logarithmic scale (the mel-scale) to replicate the non-linear human perception of pitch (Stevens & Volkmann, 1940) . Eventually, the resulting coefficients are passed through a logarithm compression, to replicate our non-linear sensitivity to loudness (Fechner et al., 1966) . This approach of drawing inspiration from the human auditory system to design features for machine learning has been historically successful (Davis & Mermelstein, 1980; Mogran et al., 2004) . Moreover, decades after the design of mel-filterbanks, Andén & Mallat (2014) showed that they coincidentally exhibit desirable mathematical properties for representation learning, in particular shift-invariance and stability to small deformations. Hence, both from an auditory and a machine learning perspective, mel-filterbanks represent strong audio features. However, the design of mel-filterbanks is also flawed by biases. First, not only has the mel-scale been revised multiple times (O'Shaughnessy, 1987; Umesh et al., 1999) , but also the auditory experiments that led their original design could not be replicated afterwards (Greenwood, 1997) . Similarly, better alternatives to log-compression have been proposed, like cubic root for speech enhancement (Lyons & Paliwal, 2008) or 10th root for ASR (Schluter et al., 2007) . Moreover, even though matching human perception provides good inductive biases for some application domains, e.g., ASR or music understanding, these biases may also be detrimental, e.g. for tasks that require fine-grained resolution at high frequencies. Finally, the recent history of other fields like computer vision, in which the rise of deep learning methods has allowed learning representations from raw pixels rather than from engineered features (Krizhevsky et al., 2012) , inspired us to take the same path. In this work, we argue that a credible alternative to mel-filterbanks for classification should be evaluated across many tasks, and propose the first extensive study of learnable frontends for audio over a wide and diverse range of audio signals, including speech, music, audio events, and animal sounds. By breaking down mel-filterbanks into three components (filtering, pooling, compression/normalization), we propose LEAF, a novel frontend that is fully learnable in all its operations, while being controlled by just a few hundred parameters. In a multi-task setting over 8 datasets, we show that we can learn a single set of parameters that outperforms mel-filterbanks, as well as previously proposed learnable alternatives. Moreover, these findings are replicated when training a different model for each individual task. We also confirm these results on a challenging, large-scale benchmark: classification on Audioset (Gemmeke et al., 2017) . In addition, we show that the general inductive bias of our frontend (i.e., learning bandpass filters, lowpass filtering before downsampling, learning a per-channel compression) is general enough to benefit other systems, and propose a new, improved version of SincNet (Ravanelli & Bengio, 2018) . Our code is publicly availablefoot_0 .

2. RELATED WORK

In the last decade, several works addressed the problem of learning the audio frontend, as an alternative to mel-filterbanks. The first notable contributions in this field emerged for ASR, with Jaitly & Hinton (2011) pretraining Restricted Boltzmann Machines from the waveform, and Palaz et al. ( 2013) training a hybrid DNN-HMM model, replacing mel-filterbanks by several layers of convolution. However, these alternatives, as well as others proposed more recently (Tjandra et al., 2017; Schneider et al., 2019) , are composed of many layers, which makes a fair comparison with melfilterbanks difficult. In the following section, we focus on frontends that provide a lightweight, drop-in replacement to mel-filterbanks, with comparable capacity.

2.1. LEARNING FILTERS FROM WAVEFORMS

A first attempt at learning the filters of mel-filterbanks was proposed by Sainath et al. (2013) , where a filterbank is initialized using the mel-scale and then learned together with the rest of the network, taking a spectrogram as input. Instead, Sainath et al. (2015) and Hoshen et al. (2015) later



https://github.com/google-research/leaf-audio



Figure 1: Breakdown of the computation of mel-filterbanks, Time-Domain filterbanks, SincNet, and the proposed LEAF frontend. Orange boxes are fixed, while computations in blue boxes are learnable. Grey boxes represent activation functions.

