LEARNING THE SPECTROGRAM TEMPORAL RESOLU-TION FOR AUDIO CLASSIFICATION

Abstract

The audio spectrogram is a time-frequency representation that has been widely used for audio classification. The temporal resolution of a spectrogram depends on hop size. Previous works generally assume the hop size should be a constant value such as ten milliseconds. However, a fixed hop size or resolution is not always optimal for different types of sound. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution learning to improve the performance of audio classification models. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier, and can be jointly optimized with the classification task. We evaluate DiffRes on the mel-spectrogram, followed by state-of-the-art classifier backbones, and apply it to five different subtasks. Compared with using the fixedresolution mel-spectrogram, the DiffRes-based method can achieve the same or better classification accuracy with at least 25% fewer temporal dimensions on the feature level, which alleviates the computational cost at the same time. Starting from a high-temporal-resolution spectrogram such as one-millisecond hop size, we show that DiffRes can improve classification accuracy with the same computational complexity.

1. INTRODUCTION

Audio classification refers to a series of tasks that assign labels to an audio clip. Those tasks include audio tagging (Kong et al., 2020) , speech keyword classfication (Kim et al., 2021) , and music genres classification (Castellon et al., 2021) . The input to an audio classification system is usually a one-dimensional audio waveform, which can be represented by discrete samples. Although there are methods using time-domain samples as features (Kong et al., 2020; Luo & Mesgarani, 2018; Lee et al., 2017) , the majority of studies on audio classification convert the waveform into a spectrogram as the input feature (Gong et al., 2021b; a) . Spectrogram is usually calculated by the Fourier transform (Champeney & Champeney, 1987) , which is applied in short waveform chunks multiplied by a windowing function, resulting in a two-dimensional time-frequency representation. According to the Gabor's uncertainty principle (Gabor, 1946) , there is always a trade-off between time and frequency resolutions. To achieve desired resolution on the temporal dimension, it is a common practice (Kong et al., 2021a; Liu et al., 2022a) to apply a fixed hop size between windows to capture the dynamics between adjacent frames. With the fixed hop size, the spectrogram has a fixed temporal resolution, which we will refer to simply as resolution. Using a fixed resolution is not necessarily optimal for an audio classification model. Intuitively, the resolution should depend on the temporal pattern: fast-changing signals are supposed to have high resolution, while relatively steady signals or blank signals may not need the same high resolution for the best accuracy (Huzaifah, 2017). For example, Figure 1 shows that by increasing resolution, more details appear in the spectrogram of Alarm Clock while the pattern of Siren stays mostly the same. This indicates the finer details in high-resolution Siren may not essentially contribute to the classification accuracy. There are plenty of studies on learning a suitable frequency resolution with a similar spirit (Stevens et al., 1937; Sainath et al., 2013; Ravanelli & Bengio, 2018b; Zeghidour et al., 2021) , but learning temporal resolution is still under-explored. Most previous studies focus on investigating the effect of different temporal resolutions (Kekre et al., 2012; Huzaifah, 2017; Ilyashenko et al., 2019; Liu et al., 2022c) . Huzaifah (2017) observe the optimal temporal resolution for audio classifi-

