LEARNING THE SPECTROGRAM TEMPORAL RESOLU-TION FOR AUDIO CLASSIFICATION

Abstract

The audio spectrogram is a time-frequency representation that has been widely used for audio classification. The temporal resolution of a spectrogram depends on hop size. Previous works generally assume the hop size should be a constant value such as ten milliseconds. However, a fixed hop size or resolution is not always optimal for different types of sound. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution learning to improve the performance of audio classification models. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier, and can be jointly optimized with the classification task. We evaluate DiffRes on the mel-spectrogram, followed by state-of-the-art classifier backbones, and apply it to five different subtasks. Compared with using the fixedresolution mel-spectrogram, the DiffRes-based method can achieve the same or better classification accuracy with at least 25% fewer temporal dimensions on the feature level, which alleviates the computational cost at the same time. Starting from a high-temporal-resolution spectrogram such as one-millisecond hop size, we show that DiffRes can improve classification accuracy with the same computational complexity.

1. INTRODUCTION

Audio classification refers to a series of tasks that assign labels to an audio clip. Those tasks include audio tagging (Kong et al., 2020) , speech keyword classfication (Kim et al., 2021) , and music genres classification (Castellon et al., 2021) . The input to an audio classification system is usually a one-dimensional audio waveform, which can be represented by discrete samples. Although there are methods using time-domain samples as features (Kong et al., 2020; Luo & Mesgarani, 2018; Lee et al., 2017) , the majority of studies on audio classification convert the waveform into a spectrogram as the input feature (Gong et al., 2021b; a) . Spectrogram is usually calculated by the Fourier transform (Champeney & Champeney, 1987) , which is applied in short waveform chunks multiplied by a windowing function, resulting in a two-dimensional time-frequency representation. According to the Gabor's uncertainty principle (Gabor, 1946) , there is always a trade-off between time and frequency resolutions. To achieve desired resolution on the temporal dimension, it is a common practice (Kong et al., 2021a; Liu et al., 2022a) to apply a fixed hop size between windows to capture the dynamics between adjacent frames. With the fixed hop size, the spectrogram has a fixed temporal resolution, which we will refer to simply as resolution. Using a fixed resolution is not necessarily optimal for an audio classification model. Intuitively, the resolution should depend on the temporal pattern: fast-changing signals are supposed to have high resolution, while relatively steady signals or blank signals may not need the same high resolution for the best accuracy (Huzaifah, 2017) . For example, Figure 1 shows that by increasing resolution, more details appear in the spectrogram of Alarm Clock while the pattern of Siren stays mostly the same. This indicates the finer details in high-resolution Siren may not essentially contribute to the classification accuracy. There are plenty of studies on learning a suitable frequency resolution with a similar spirit (Stevens et al., 1937; Sainath et al., 2013; Ravanelli & Bengio, 2018b; Zeghidour et al., 2021) , but learning temporal resolution is still under-explored. Most previous studies focus on investigating the effect of different temporal resolutions (Kekre et al., 2012; Huzaifah, 2017; Ilyashenko et al., 2019; Liu et al., 2022c) . Huzaifah (2017) observe the optimal temporal resolution for audio classifi- The pilot study shows that when increasing resolution, the improvement on different types of sound is not consistent, in which some of them even degrade with a higher resolution. This motivates us to design a method that can learn the optimal resolution. Besides, the potential of the high-resolution spectrogram, e.g., with one milliseconds (ms) hop size, is still unclear. Some popular choices of hop size including 10 ms (Böck et al., 2012; Kong et al., 2020; Gong et al., 2021a) and 12.5 ms (Shen et al., 2018; Rybakov et al., 2022) . Previous studies (Kong et al., 2020; Ferraro et al., 2021) show classification performance can be steadily improved with the increase of resolution. One remaining question is: Can even finer resolution improve the performance? We conduct a pilot study for this question on a limitedvocabulary speech recognition task with hop sizes smaller than 10 ms (see Figure 10b in Appendix A.3). We noticed that accuracy can still be improved with smaller hop size, at a cost of increased computational complexity. This indicates there is still useful information in the higher temporal resolution. In this work, we believe that we are the first to demonstrate learning temporal resolution on the audio spectrogram. We show that learning temporal resolution leads to efficiency and accuracy improvements over the fixed-resolution spectrogram. We propose a lightweight algorithm, DiffRes, that makes spectrogram resolution differentiable during model optimization. DiffRes can be used as a "dropin" module after spectrogram calculation and optimized jointly with the downstream task. For the optimization of DiffRes, we propose a loss function, guide loss, to inform the model of the low importance of empty frames formed by SpecAug (Park et al., 2019) . The output of DiffRes is a time-frequency representation with varying resolution, which is achieved by adaptively merging the time steps of a fixed-resolution spectrogram. The adaptive temporal resolution alleviates the spectrogram temporal redundancy and can speed up computation during training and inference. We perform experiments on five different audio tasks, including the largest audio dataset AudioSet (Gemmeke et al., 2017) . DiffRes shows clear improvements on all tasks over the fixed-resolution melspectrogram baseline and other learnable front-ends (Zeghidour et al., 2021; Ravanelli & Bengio, 2018b; Zeghidour et al., 2018) . Compared with the fixed-resolution spectrogram, we show that using DiffRes can achieve a temporal dimension reduction of at least 25% with the same or better audio classification accuracy. On high-resolution spectrogram, we also show that DiffRes can improve classifier performance without increasing the feature temporal dimensions. Our code is publicly availablefoot_0 .

2. LEARNING TEMPORAL RESOLUTION WITH DIFFRES

We provide an overview of DiffRes-based audio classification in Section 2.1. We introduce the detailed formulation and the optimization of DiffRes in Section 2.2.1, 2.2.2, and 2.3.

2.1. OVERVIEW

Let x ∈ R L denote a one-dimensional audio time waveform, where L is the number of audio samples. An audio classification system can be divided into a feature extraction stage and a classification stage. In the feature extraction stage, the audio waveform will be processed by a function Q l,h : R L → R F ×T , which maps the time waveform into a two-dimensional time-frequency repre-



https://anonymous.4open.science/r/diffres-8F22



cation is class dependent. Ferraro et al. (2021) experiment on music tagging with coarse-resolution spectrograms, and observes a similar performance can be maintained while being much faster to compute. Kazakos et al. (2021) propose a two-stream architecture that process both fine-grained and coarse-resolution spectrogram and shows the state-of-the-art result on VGG-Sound (Chen et al., 2020). Recently, Liu et al. (2022c) propose a spectrogram-pooling-based module that can improve classification efficiency with negligible performance degradation. In addition, our pilot study shows the optimal resolution is not the same for different types of sound (see Figure 10a in Appendix A.3).

Figure 1: The spectrogram of Alarm Clock and Siren sound with 40 ms and 10 ms hop sizes. All with a 25 ms window size.The pattern of Siren, which is relatively stable, does not change significantly using a smaller hop size (i.e., larger temporal resolution), while Alarm Clock is the opposite.

