LEAF: A LEARNABLE FRONTEND FOR AUDIO CLASSIFICATION

Abstract

Mel-filterbanks are fixed, engineered audio features which emulate human perception and have been used through the history of audio understanding up to today. However, their undeniable qualities are counterbalanced by the fundamental limitations of handmade representations. In this work we show that we can train a single learnable frontend that outperforms mel-filterbanks on a wide range of audio signals, including speech, music, audio events and animal sounds, providing a general-purpose learned frontend for audio classification. To do so, we introduce a new principled, lightweight, fully learnable architecture that can be used as a drop-in replacement of mel-filterbanks. Our system learns all operations of audio features extraction, from filtering to pooling, compression and normalization, and can be integrated into any neural network at a negligible parameter cost. We perform multi-task training on eight diverse audio classification tasks, and show consistent improvements of our model over mel-filterbanks and previous learnable alternatives. Moreover, our system outperforms the current state-of-the-art learnable frontend on Audioset, with orders of magnitude fewer parameters.

1. INTRODUCTION

Learning representations by backpropagation in deep neural networks has become the standard in audio understanding, ranging from automatic speech recognition (ASR) (Hinton et al., 2012; Senior et al., 2015) to music information retrieval (Arcas et al., 2017) , as well as animal vocalizations (Lostanlen et al., 2018) and audio events (Hershey et al., 2017; Kong et al., 2019) . Still, a striking constant along the history of audio classification is the mel-filterbanks, a fixed, hand-engineered representation of sound. Mel-filterbanks first compute a spectrogram, using the squared modulus of the short-term Fourier transform (STFT). Then, the spectrogram is passed through a bank of triangular bandpass filters, spaced on a logarithmic scale (the mel-scale) to replicate the non-linear human perception of pitch (Stevens & Volkmann, 1940) . Eventually, the resulting coefficients are passed through a logarithm compression, to replicate our non-linear sensitivity to loudness (Fechner et al., 1966) . This approach of drawing inspiration from the human auditory system to design features for machine learning has been historically successful (Davis & Mermelstein, 1980; Mogran et al., 2004) . Moreover, decades after the design of mel-filterbanks, Andén & Mallat (2014) showed that they coincidentally exhibit desirable mathematical properties for representation learning, in particular shift-invariance and stability to small deformations. Hence, both from an auditory and a machine learning perspective, mel-filterbanks represent strong audio features. However, the design of mel-filterbanks is also flawed by biases. First, not only has the mel-scale been revised multiple times (O'Shaughnessy, 1987; Umesh et al., 1999) , but also the auditory experiments that led their original design could not be replicated afterwards (Greenwood, 1997) . Similarly, better alternatives to log-compression have been proposed, like cubic root for speech enhancement (Lyons & Paliwal, 2008) or 10th root for ASR (Schluter et al., 2007) . Moreover, even though matching human perception provides good inductive biases for some application domains, e.g., ASR or music understanding, these biases may also be detrimental, e.g. for tasks that require fine-grained resolution at high frequencies. Finally, the recent history of other fields like computer vision, in which the rise of deep learning methods has allowed learning representations from raw pixels rather than from engineered features (Krizhevsky et al., 2012) , inspired us to take the same path.

