SOUNDCOUNT: SOUND COUNTING FROM RAW AUDIO WITH DYADIC DECOMPOSITION NEURAL NETWORK

Abstract

In this paper, we study an underexplored, yet important and challenging problem: counting the number of distinct sound in raw audio characterized by a high degree of polyphonicity. We do so by systematically proposing a novel end-to-end trainable neural network (we call DyDecNet, comprising of a dyadic decomposition front-end and backbone network), and quantifying the difficulty level of counting depending on sound polyphonicity. Unlike existing audio-processing methods that uniformly apply a set of frequency-selective filters on the raw waveform in a one-stage manner to get time-frequency (TF) representation, our dyadic decomposition front-end progressively decomposes the raw waveform dyadically along the frequency axis to obtain TF representation in multi-stage, coarse-to-fine manner. Each intermediate waveform convolved by a parent filter is further processed by a pair of child filters that evenly split the parent filter's carried frequency response, with the higher-half child filter encoding the detail and lower-half child filter encoding the approximation. We further introduce an energy gain normalization to normalize sound loudness variance and spectrum overlap, and apply it each intermediate parent waveform before feeding it to the two child filters. We argue that such dyadic decomposition front-end better characterizes sound polyphonicity and concurrency that commonly exist in sound counting task, while introducing negligible extra computational cost. To better quantify sound counting difficulty level, we further design three polyphony-aware metrics: polyphony ratio, max polyphony andmean polyphony. We test DyDecNet on three main sound datasets from different domains: bioacoustic sound (both synthetic and real-world sound), telephone-ring sound and music sound. Comprehensive experiment results show our method outperforms existing sound event detection (SED) methods significantly. The dyadic decomposition front-end can be used as a general front-end by existing methods to improve their performance accordingly.

1. INTRODUCTION

Suppose you went to the seaside and heard a cacophony of seagulls, squawking and squabbling. An interesting question that naturally arises is whether you can tell the number of seagulls flocking around you from the sound you heard? Although a trivial example, this sound "crowd counting" problem has a number of important applications. For example, passive acoustic monitoring (PAM) is widely used to record sounds in natural habitats, which provides measures of ecosystem diversity and density [2, 15, 12] . Sound counting helps to quantify and map sound pollution by counting the number of individual polluting events [4] . It can also be used in music content analysis [24] . Despite its importance, research on sound counting has far lagged behind than its well-established crowd counting counterparts from either images [49, 46] , video [29] or joint audio-visual [22] . We conjecture that the lack of exploration stems from three main factors. First, sound counting has long been thought of as an over-solved problem by sound event detection (SED) methods [35, 9, 1, 19] , in which SED goes further to identify each sound event's (e.g. a bird call) start time, end time and semantic identity. Sound counting number then becomes easily accessible by simply adding up all detected events. Secondly, current SED only tags whether a class of sound event is present within a window, regardless of the number of concurrent sound sources of the same class like a series of baby crying or multiple bird calls [41] . Thirdly, labelling acoustic data is technically-harder and more time-consuming than labelling images, due to the overlap of concurrent and diverse sources. The lack of well-labelled sound data in crowded sound scenes naturally hampers research progress. Existing SED sound datasets [1, 20] capture simple acoustic scenarios with low polyphony and where the event variance is small. The simplified acoustic scenario in turn makes sound counting task by SED methods tackleable. But when the sound scene becomes much more complex with highly concurrent sound events, SED methods soon lose their capability in discriminating different sound events [38, 9] . Therefore, a study specific for sound counting problem is desirable and overdue. In this paper, we study the general sound counting problem under highly polyphonic, cluttered and concurrent situation. Whilst the challenges of image-based crowd counting mainly lie in spatial density, occlusion and view perspective distortion, the sound counting challenges are two-fold. Firstly, acoustic scenes are additive mixtures of sound along both time and frequency axes, making counting overlapping sounds difficult (temporal concurrence and spectrum-overlap). Secondly, there is a large variance in event loudness due to spherical signal attenuation with distance (loudness variance). Tackling these challenges require a more elegant method to process sound raw waveform so as to better localize sound in time-frequency domain. In this paper, we propose a novel dyadic decomposition neural network to learn a sound density representation capable of estimating cardinality directly from raw sound waveform. Unlike existing sound waveform processing methods that all apply frequency-selective filters on the raw waveform in single stage [19, 10, 48, 18, 14] , our network progressively decomposes raw sound waveform in a dyadic manner, where the intermediate waveform convolved by each parent filter is further processed by its two child filters. The two child filters evenly split the parent filter's frequency response, with one child filter encoding the waveform approximation (the one with the lower-half frequency response) and the other one encoding the waveform details (the one with the higher-half frequency response). To accommodate sound loudness variance, spectrum-overlap and time-concurrence, we further propose an energy gain normalization module to regularize each intermediate parent waveform before feeding it to two child filters for further processing. This hierarchical dyadic decomposition front-end enables the neural network to learn a robust TF representation in multi-stage coarse-to-fine manner, while introducing negligible extra computation cost. By setting each filter's frequency cutoff parameters to be learnable and self-adjustable during optimization in data-driven way, the final learned TF representation can better characterize sound existence in time and frequency domain. Following the front-end, we add a backbone network to continue to learn a time framewise representation. Such representation can be used to derive the final sound count number by either directly regressing the count number, regressing density map (the one we choose) or following SED pipeline. Apart from the network, we further propose three polyphony-aware metrics to quantify sound counting task difficulty level: polyphony ratio, maximum polyphony and mean polyphony. We will give detailed discussion to show the feasibility of three metrics. We run experiments on four cross-domain sound datasets: a bird sound set (both real-world and synthetic), a telephone-ring sound set (synthetic), and music sound [24] (real-world). Experimental results show our method (DyDecNet) outperforms exiting SED-based methods significantly on both real-world and synthetic dataset. Replacing existing methods' one-stage sound raw waveform processing front-end with our dyadic decomposition front-end dramatically improves their performance accordingly. Since the real-world datasets contain relatively small polyphony level, we specially synthesize a bird sound dataset that contain much higher sound polyphonic level and spectral overlap. The synthesized sound dataset has two sub-sets: one involves four kinds of bird sound (exhibits heterophony); the other has just one kind of sound (this encapsulates homophonic scenario). Experiment on such synthetic dataset helps to test performance under highly polyphonic situation. In summary, we make three main contributions: First, propose dyadic decomposition front-end to decompose the raw waveform in a multi-stage, coarse-to-fine manner, which better handles loudness variance, spectrum-overlap and time-concurrence. Second, propose a new set of polyphony-aware evaluation metrics to comprehensively and objectively quantify sound counting difficulty level. Third, Show the efficiency and generalization of DyDecNet on sound datasets across different domains.

2. RELATED WORK

Crowd counting from images or audio-visual has been thoroughly studied in recent years [49, 22] , the target of which is to estimate the instance number from very crowded scenes (e.g. pedestrian in train station) that cannot be efficiently handled by object detection methods. The methods approaching image crowd counting chronically evolve from the early detection-based [26] to the later regression-Figure 1 : DyDecNet pipeline. We first feed the input raw sound waveform to the dyadic decomposition front-end to learn a time-frequency representation, which is further fed to a backbone neural network to continue to learn framewise representation. Such representation retains time information, so it is general enough to get count number by either regression or SED method. The dyadic decomposition front-end consists of a set of parameterized learnable band-pass filters. Each intermediate waveform processed by a parent filter is further processed by two child filters, with lower-half filter (red color) encoding approximation and higher-half filter (light-blue) encoding details. based [11] and density map estimation [27] methods. Accompanying these methods, various neural network architectures have been designed to achieve higher performance. The counterpart task purely in sound, however, has been nearly ignored. Existing research mainly focus on sound event detection, including spatio-temporal sound event detection (SELD) [19, 18, 1, 10] from a microphone array and temporal sound event detection [7, 39] and high-frequency time series analysis [36] . They often combine convolutional neural networks (CNN) [8] and reccurent neural network [1] to separate sound sources. The datasets they work on are relatively simple, in which the sound scenes are relatively simple and contain few overlapping sound events. The common way to process raw sound waveform is to first convert the 1D waveform into 2D time-frequency representation so that sound events' frequency property and their variation along time axis are explicitly split out. Most existing methods [10, 1, 7, 39] adopt Fourier transform [14] or Wavelet transform [33] to obtain such 2D representation, in which the whole conversion process is fixed. Some recent work [19, 48, 42, 45, 37] re-parameterize the conversion frequency-selective filters to be learnable so that the whole neural network is able to directly learn from raw sound waveform. Experimental results show enabling the neural network to learn from the raw waveform can often achieve better performance than traditional fixed conversion. These methods, however, convert the raw waveform in a one-stage manner. Our proposed dyadic decomposition neural network instead processes the raw waveform in a dyadic multi-stage manner. Dyadic Network Dyadic representation idea has been initially proposed to represent signal hierarchically [6, 3] , in multi-scale manner. Its core idea is to construct a bank of filters (either learnable or fixed) so that different filter extracts different feature at a certain scale or resolution. Summarizing them together leads to more comprehensive and complete analysis. Similar idea has been widely used in computer vision community, including pyramid feature representation for object detection [30] and semantic segmentation [43, 28] .

3. DYADIC DECOMPOSITION NEURAL NETWORK

Different sound classes typically exhibit different spectral properties. A canonical way to process raw sound waveform is to apply a frequency-selective filter bank F f = {f i } k i=1 to project the raw sound waveform onto different frequency bins. Traditional Fourier transform [14] or Wavelet transform [33] construct fixed filter banks in which all filter-construction relevant hyperparameters are empirically chosen and thus may not be optimal for a particular task. Recent methods [19, 48] relax some hyperparameters to be trainable so that the filter bank can be optimized in a data-driven way. A learnable filter bank often leads to better performance than fixed filters. However, all existing methods apply all filters, either learnable or fixed, on the raw waveform in a one-stage manner. Such shallow and one-stage processing may fail to learn powerful and robust representation for sound counting task where large loudness variance and heavy spectrum overlap exist. In our dyadic decomposition framework, we instead adopt a progressive pairwise decomposition strategy to obtain the time-frequency (TF) representation. It learns a TF representation from coarse-grained to fine-grained granularity. Particularly, it consists of a dyadic frontend and a backbone.

3.1. DYADIC FREQUENCY DECOMPOSITION FRONTEND

In dyadic frequency decomposition frontend, we construct a set of D hierarchical filter banks F D dyadic = {F 1 2 1 , F 2 2 2 , • • • , F D 2 D }. The d-th filter bank has 2 d filters, each filter is parameterized by a learnable high freuqney-cutoff parameter and a low frequency-cutoff parameter. By cascading these filter banks, we consecutively decompose the raw waveform in frequency domain dyadically, leading to coarse-grained to fine-grained TF representation. Specifically, we denote the dyadic filter banks depth by D, in the depth d filter bank F d 2 d , we have 2 d filters evenly divide the waveform sampling frequency F s . Therefore, each single filter's frequency response length is Fs 2 d , the i-th filter f d i high frequency cutoff F h and low frequency cutoff F l are initialized as, F h (f d i ) = F s 2 d • (i + 1), F l (f d i ) = F s 2 d • i From Eqn. (1) we can see that dyadic decomposition frontend forms a complete binary-tree-like structure, in which the filter number doubles and each filter's frequency response length halves as the tree's depth increases by one. The intermediate waveform processed by a "parent" filter is just further processed by its two "childre" filters. The frequency responses of the two children filters evenly split their parent filter's frequency response. The child filter carrying the higher half frequency response encode the parent's processed intermediate waveform's detail while the other one carrying the lower half frequency response instead encodes the approximation. For example, for the filter f d i in the d-th filter bank, its frequency response lies in [ Fs 2 d • i, Fs 2 d • (i + 1)], its two children filters f d+1 2i and f d+1 2i+1 in the depth d + 1 evenly divide its frequency range, so f d+1 2i carries [ Fs 2 d • i, Fs 2 d (i + 1 2 )]. f d+1 2i+1 carries [ Fs 2 d (i + 1 2 ), Fs 2 d • (i + 1)]. With the pre-constructed dyadic decomposition filter banks, we cascade them together to process the raw sound waveform, progressively learning the final TF representation. In our implementation, each filter in dyadic filter banks is a learnable band-pass filter. We adopt rectangular band-pass in frequency domain filter which comprises of a learnable high frequency cutoff parameter F h and a learnable low frequency cutoff parameter F l . Converting it to time domain through inverse Fourier transform, we get sinc(•) function like filter that is used to convolve with the waveform. For example, the filter f d i in Eqn. (1) can be represented as, f d i [t, F h , F l ] = 2F h sinc(2πF h t) -2F l sinc(2πF l t) where sinc(x) = sin(x)/x, t indicates the filter's representation at time t. F h and F l are initialized according to Eqn. ( 1), but they can be further adjusted during training process. sinc(•) filters have been successfully used in speech recognition [42] and sound event detection and localization [19] . In our dyadic decomposition frontend, each filter from different depth has separate and independent learnable parameters (high frequency cutoff and low frequency cutoff). Moreover, our constructed filter is much longer (1025 in our case) than traditional 1D/2D Conv filters (3 or 5). Its wide length characteristic enables the filter to have wide field-of-view on the raw waveform. Cascading them together allows the filters in later layers (larger depth) to have even wider field-of-view on the input raw waveform. With this advantage, we do not have to model sound event temporal dependency explicitly with RNN network. As a result, the whole dyadic frequency decomposition frontend is fully convolutional and parametrically learnable, it is parameter-frugal and computationally efficient. In practice, the dyadic decomposition frontend depth is 8, so the output TF representation has 256 frequency bins. At the same time, we downsample the intermediate waveform by 2 before feeding it to its two children filters in the initial 5 dyadic filter banks to reduce the memory cost.

3.2. ENERGY GAIN NORMALIZATION

We further design an energy gain normalization module to regularize each intermediate waveform before feeding them to the next dyadic filter bank. The motivation of introducing energy gain normalization is two-fold: first, to reduce sound event loudness variance led by sound events' different spatial locations; Second, to reinforce the frontend to learn to better tackle spectrum overlap challenge led by intra-class sound events in the sound scene. contains loudness. We then introduce a learnable automatic gain control parameter α to mitigate sound loudness impact. Furthermore, another two learnable compression parameters δ and γ are introduced to further compress W f d i . The overall energy gain normalization can be represented as, W f d i = ( W f d i (W g d i ) α + δ) γ -δ γ where α, δ and γ are learnable parameters. As a result, the energy gain normalization eg-Norm is fully learnable and parametersized by four learnable parameters eg-Norm(σ, α, δ, γ). Practically, each filter in dyadic filter banks is associated with an independent eg-Norm module. Similar energy normalization has been successfully used in tasks like keyword spotting [47, 31] . The difference lies in the fact that they apply exponential moving average operation to get smoothed waveform representation, so the computation is very slow because it iterates along the time axis to compute the averaged value step by step. Our proposed energy gain normalized strategy instead adopts a Gaussian kernel to get the smoothed waveform, in which it can be easily implemented as 1D convolution. The dyadic filter visualization and energy normalization module is shown in Fig. 3 .

3.3. BACKBONE NEURAL NETWORK

We add a lightweight backbone neural network to the frontend neural network to further learn a representation useful for call counting . The backbone network consists of two parts: per-channel pooling and inter-channel 1D convolution. Unlike existing methods [9, 1] that first convert 1D sound waveform into 2D map with fixed FFT-like transform, then learns from the 2D map with 2D Conv. operations, our method directly learn from sound raw waveform with learnable 1D Conv.. Specifically, we downsample each channel separately by assigning each channel with an independent frequency-sensitive learnable filter. We call such learnable downsampling per-channel pooling. It helps to learn sound event's frequency variance along time axis individually. Moreover, we add normal 1D Conv. to achieve inter-channel communication, which enhances the neural network to learn concurrent sound events interaction. Detailed illustration is given in Appendix Table IV . The backbone serves as as the backend to learn framewise representation for counting.

3.4. DENSITY MAP AND LOSS FUNCTION

The backbone network discussed above learns a framewise representation [T b , F b ], where T b indicates the time steps and F b indicates feature size. There are three potential ways to derive final sound count number fromthe learned representation: 1. directly regress the count number; 2.SED method: detect sound events first and then aggregate results to get final count; 3. predict the density map. For a sound event with time location [t 1 , t 2 ], its density map is a 1D vector with value 1 t2-t1 during its occurrence time, otherwise is 0. So the count number equals to the vector integral. We show regressing density map produces the best result (see Table 6 ). We thus adopt the mean squared error (MSE) loss during training to directly regress the density map. The comparison of three methods is shown in Fig. 2 .

4. EVALUATION METRIC DISCUSSION

Mean absolute error (MAE) and mean squared error (MSE) are two widely used metrics in crowd counting [32, 49] . Specifically, denote the ground truth count and predicted count by y i and ŷi respectively, for the i-th sound clip. MAE is defined as MAE = . ratio-polyp measures polyphony (no fewer than two concurrent sound events) existence ratio along the time axis (cyan), so it is 5 6 . as MSE = 1 N N i=1 (y i -ŷi ) 2 . We also involve accuracy rate (AccuRate) to show the ratio of accurately predicted count. We introduce a tolerance term p, where p = 0 means predicted count number has to be exactly the same with ground truth number in order to be treated as an accurate counting; p = 1 relaxes the constraint so there can be one count mismatch for an accurate counting.

4.1. POLYPHONY-AWARE DIFFICULTY QUANTIFICATION

The aforementioned three general metrics do not reflect the impact of sound scene nature on algorithms. We introduce three polyphony-aware metrics to quantify sound counting difficulty level reflected by sound scene nature. The three metrics are time-window invariant so they can be used as general metrics to quantify difficulty level of sound scene of various lengths. Polyphony Ratio (ratio-polyp) describes the ratio of polyphony (at least two sound events happen at the same time) over a period of time. It binarizes each time step as either polyphonic or nonpolyphonic (monophoinc or silent) so the value lies between [0, 1]. Maximum Polyphony (max-polyp) focuses on the maximum polyphony level over a time period. It is motivated by the fact that human's capability in discriminating different sound events reduces seriously when the number of temporal-overlapping sound event number increases. It is a positive integer and helps us to understand an algorithm's capability in tackling polyphony peak. Mean Polyphony (mean-polyp) instead focuses on the averaging level of polyphony involved within a time period. It is designed to reflect algorithm's capability in tackling the average polyphony level over an arbitrary time window. Given T n time steps sound vector [p 1 , p 2 , • • • , p Tn ], where p i ≥ 0 is the sound event number happening at time step T i . The three metrics are defined as, ratio-polyp = n i=1 1 2 (p i ) n ; max-polyp = max i=1,••• ,n p i ; mean-polyp = n i=1 max(p i -1, 0) T n where 1 2 (p i ) is an indicator function, it is 1 if p i ≥ 2, otherwise 0. With the three quantifying metrics, we can report the general metrics (MAE, MSE) against various difficulty levels.

5. EXPERIMENT

We run experiments on six sound datasets derive from four commonly seen domains. 1. Bioacoustic Sound. We focus on bird sound as bird sound is ubiquitous is most terrestrial environment with distinctive vocal acoustic properties. Specifically, we test three datasets: one realworld NorthEastUS_Bird [12] dataset and other two synthesized datasets: Polyphony4Birds (for heterophony test) and Polyphony1Bird (for homophony test). NorthEastUS_Bird data is recorded in nature reserve in northeastern of United States. It encompasses 385 minutes of dawn chorus recordings collected in July 2018, with a total of 48 bird species. The average bird sound temporal length is very short (less than 1s) and the polyphony level (max-polyp and mean-polyp) is small. To test performance under highly polyphonic situation, we synthesize two bird sound datasets. Specifically, The first dataset contains four sounds: junco, American redhead, eagle, and rooster from copyright-free website findsounds.com. We call it Polyphony4Birds (heterophony test). The second dataset contains one sound: rooster. We call it Polyphony1Bird (homophony test). 2. Indoor Sound. We count telephone ring sound, the telephone ring seed sound comes from the same copyright-free website. We follow Polyphony1Bird synthesis procedure except synthesize in a much smaller room (10m × 10m × 3m) to reflect room reverberation effect. 3. Outdoor Sound. We count car engine sound, as it is widely heard in outdoor urban scenario. The car engine seed sound also comes from the same copyright-free website. We also follow Polyphony1Bird synthesis procedure to create car engine dataset. 4. Music Sound. We use OpenMic2018 dataset [24] . The target is to count the musical instrument class number being played in the audio clip, regardless of the number of times each single musical instrument class being played in the audio clip. This dataset contains 20 musical instrument categories, but we do not know each instrument's playing start time and end time, but instead the total sound count number within each clip. Therefore, we directly regress the number. The direct comparison between the six datasets is given in Table 1 . We highly refer to Appendix Sec. B for more discussion about the data synthesis process. Comparing Methods: We compare our framework with two main method categories: 1) traditional deterministic signal processing methods, including Librosa-onset and Aubio-onset; 2) SED-based Methods. Librosa-onset [34] provides an onset/offset detection method for music note detection. It measures the uplift or shift of spectral energy to decide the starting time of a note. We use its onset/offset detection ability to count sound event number. Aubio-onset [5] achieves pitch tracking by aligning period and phase of the Mel spectrogram. We use its pitch tracking to count. SED-based methods build on traditional fixed TF representation, such as short time Fourier transform (STFT) and LogMel. The TF representation is treated as a 2D image to be processed by a sequence of 2D Conv. operators. GRU [13] and LSTM [21] are often adopted to model temporal dependency. we compare three typical SED methods: 1) CRNNNet [9] consists of 2D Conv. to learn multiple compressed TF representations from the input TF map. Then it concatenates them together along the frequency dimension and further feed it to LSTM [21] to learn framewise representation. 2) DND-SED [16] instead adopts depthwise 2D convolution and dilated convolution to avoid using RNN. 3) SELDNet [1] is originally used for joint sound event detection and localization. It adopts 2D Conv. to convolve the 2D TF map, and bidirectional GRU to model temporal dependency. The three comparing methods' network architectures are slightly adjusted to fit our dataset. We call our method Dyadic Decomposition Network (DyDecNet). Implementation Detail and Experiment Configuration For all the six datasets, all input audios are segmented into 5 second long clips, with sampling rate 24 k Hz. So the input waveform has 120,000 data points and is normalized into [-1, 1]. We train the models with Pytorch [40] on TITAN RTX [25] with an initial learning rate 0.001 which decays every 20 epochs with a decaying rate 0.5. Overall, we train 60 epochs. We train each method 10 times independently and report the mean value and standard deviation. We do not report the standard deviation explicitly in the table because we find them very small (about 0.03). We first train the comparing SED methods with both their suggested training strategy and our training strategy, then choose the one with the better performance as the final result. For the energy gain normalization we initialize them as α = 0.96, δ = 2., γ = 0.5, σ = 0.5. The batchsize is 128.

5.1. EXPERIMENTAL RESULT

The quantitative result on MSE MAE and is is given in Table 2 , and the accuracy rate result is given in Table II in Appendix Material. From the two tables we can learn that our proposed DyDecNet outperforms both classic signal processing deterministic methods and existing SED methods by a large margin. Our framework is better than the baselines discussed in this paper in both real-world and synthesized sound scenes. It is capable of learning powerful representation from both weak sound signals (NorthEastUS_Bird dataset), highly polyphonic (Our synthesized four datasets) and heavy spectrum-overlapping, loudness-varying sound events. At the same time, we also observe that the two signal processing deterministic methods (Librosa-onset and Aubio-onset) generate the worst result over both SED based methods and DyDecNet. The higher of the polyphony level of the dataset, the worse performance the two deterministic methods lead to. For example, in NorthEastUS_Bird dataset with a relatively smaller polyphony level, Librosa-onset and Aubio-onset generate relatively good performance with accuracy rate (p = 1) reaching 0.58. In our synthesized two datasets with much higher polyphony levels, however, their accuracy drops significantly to near zeros. It thus shows traditional signal processing methods do not fit for sound event counting from crowded acoustic scenes. Among the three datasets, SED-based methods and DyDecNet produce decreasing performance on Polyphony4Birds, Polyphony1Bird and NorthEas-tUS_Birds dataset, respectively. The largest performance drop is observed on real-world NorthEastUS_Birds dataset, which shows counting from real-world dataset is a tough task that desires more future attention. Spectrum-overlap led by intra-class sound events is another potential challenge (better performance on Polyphony4Birds than Polyphony1Bird) that may need more work to tackle it. The MSE and MAE variation against max-polyp, ratio-polyp and meanpolyp difficulty level on NorthEas-tUS_Bird are shown in Fig. 5 . We can observe that our proposed three metrics max-polyp, ratio-polyp and mean-polyp are effective ways to accurately quantify sound counting tasks difficulty level. The three metrics has observed dramatic performance drop as their the difficulty level increases. Nevertheless, DyDecNet remains as the best one among all the three difficulty levels, showing DyDecNet outperforms the comparing methods under difficult levels discussed in this paper.

5.2. ABLATION STUDY

We do ablation study on NorthEastUS_Bird data. First, disentangling our proposed framework's dyadic decomposition frontend and backbone network so as to figure out their individual contribution. To this end, on the one hand, we concatenate dyadic decomposition frontend to the three SED methods backbone networks so that they can learn TF representation from raw waveform. We call them SELDNet_dydec, CRNNNet_dydec and DND-SED_dydec respectively. On the other hand, we feed our backbone neural network with fixed pre-extracted TF features, including short time Fourier transform (STFT), LogMel, MFCC and Gabor Wavelet filter. We call them DyDecNet_STFT, DyDecNet_LogMel and DyDecNet_MFCC, DyDecNet_Gabor, respectively. The results are in Table 3 and 4 . We can observe that: 1) replacing traditional fixed TF feature with dyadic decomposition frontend significantly improves the performance (Table 3 ). The gain stems from two-fold: our dyadic decomposition frontend enables the network to directly learn from the raw waveform so that all frequency-selective filters are adjustable during training process. Second, the dyadic progressive decomposition enables the neural network to learn robust representation for sound counting. Similarly, a huge performance drop is observed if we let our proposed backbone neural network to learn from traditional fixed TF features (Table 4 ). Therefore, it shows that both the dyadic decomposition frontend and backbone neural networks are important for sound counting task. Second, we want to figure out if the dyadic decomposition is essential for sound counting, and the importance of energy normalization block. We test three variants: our network with simply single scale decomposition which means applying all filters on the raw waveform (DyDecNet_SingScale) which helps validate necessity of hierarchical dyadically decomposition framework; replacing Energynormalization module with traditional batch normalization [23] (DyDecNet_BN); without any normalization (DyDecNet_noNorm). The result is in Table 5 , from which we can clearly observe that either removing energy normalization or replacing it with batch normalization significantly reduces the performance. It thus shows the importance of energy normalization. Lastly To show the effectiveness of density map, we run two ablation studies to directly regress the final count number or to firstly detect the sound events. From the result in Table 6 , we can conclude that directly regressing sound event count number leads to inferior performance than estimating density map. Treating it as a sound event detection problem leads to the worst performance. Another ablation study on the impact of energy gain normalization on traditional TF feature is presented in Appendix Sec. D.5. We refer reader to this section for more details.

Limitation Discussion and Conclusion

We do not discuss using microphone-array for enhanced counting, nor test our dyadic decomposition front-end for other acoustic tasks (e.g. source separation). Another limitation is that we just used one instance for each sound category in our synthetic dataset, which does not reflect the real scenario. A more convincing dataset is to involve as many diverse instances for each sound as possible, it also remains as future work.

APPENDIX A SOUND COUNTING PROBLEM DEFINITION

Given a mono-channel T seconds raw sound waveform x(t) sampled at a fixed sampling rate F s , the sound recording has recorded N independent sound events E = {E i = (t s , t e )} N i=1 , each single sound event freely undergoes either stationary or moving motion in the open area. The target is to design a neural network N parameterized by θ to predict sound event number N from raw sound waveform N = N (θ|x(t)). In our formulation, the counting process is class-agnostic, so all sound events are treated as instances to count, regardless of their classes. Three challenges make it a challenging task: 1) Large Datasize: microphone usually records sound at a high frequency rate (i.e. 24 kHz), resulting in large data size in the raw waveform. It thus requires more accessible filters with few parameters and computation cost to process the raw sound waveform. 2) Concurrent Sound Events (polyphony): sound events freely overlap both spatially and temporally, resulting highly polyphonic sound recording. It is a tough task to separate them apart from compressed 1D waveform. 3) Loudness Variance and Spectrum Overlap: sound events of the same class but different spatial location have large variance in their received loudness. They also have heavy spectrum overlap in the frequency domain. The above issues make counting a tough task. 3. Polyphony1Bird dataset contains 1 bird sound class in much higher polyphony level. This dataset involves heavy spectrum-overlap (due to the temporal inter-category bird sounds overlap), so it helps to test various methods' capability in tackling high spectrum-overlap and high-polyphony challenge (homophony test). In Polyphony4Birds dataset, 4 is an arbitrary number. We experimentally find involving 4 bird sounds is representative enough for heterophony test. We note that there are some other relevant public bird sound dataset [44, 20, 9 ], but we find they are not suitable for our study. For example, in TUT-SED 2009 data [20] , the polyphony-level is small and the involved bird sound usually lasts too long (not temporally separable and countable). Similarly, the Bird Audio Detection challenge (BAD challenge) [44] contains highly-sparse bird chirps (very small polyphony-level sound). Moreover, the two real-world bird sound datasets [20, 44] do not provide bird sound start time and end time label, so they are suitable for our study. The other synthesized dataset TUT-SED Synthetic 2016 [9] also contains very limited samples of high polyphony. The direct comparison between these datasets is given in Table I , from which we can see our created two datasets enjoy much higher polyphony-level, making them more suitable for our sound counting task. More detailed experimental result (MAE variation) on NorthEastUS_Bird is given in Fig. 5 , from which we can observe that with the increasing of max-polyp, ratio-polyp and mean-polyp, all methods (including our DyDecNet) reduces their performances. The three comparing methods (CRNNNet [9] ,DND-SED [16] , and SELDNet [17] ) have observed sharp performance drop when the our proposed three sound counting difficulty levels increases, whereas our proposed DyDecNet largely mitigates the challenge caused by higher counting difficulty level (the blue line increases slightly as the counting difficulty level increases). It thus shows 1) our proposed max-polyp, ratiopolyp and mean-polyp are capable of accurately measuring sound counting task difficulty level from different perspectives; 2) our proposed DyDecNet is capable of mitigating these sound counting difficulties. , by adding a full-connection layer (FC) to reduce the feature dimension size (can also be treated as frequency dimension) to 1. At the same time, we can use two full-connection layers to reduce the 2D representation to a scalar value, so as to directly regress the sound count number (sub-figure D). The process of constructing density map is shown in sub-figure B. Please note that since we use supervised learning, we know each sound event start time and end time. So the density map can be easily constructed by setting the same value to the range between the start time and end time so that they are added up sound count number. Adopting density map for counting problem is widely used vision-based crowd counting tasks [49, 26] , they show predicting density map usually give superior performance than object detection methods (in our case, SED method), and direct regressing count number. Their conclusion in vision-based crowd counting tasks matches our experimental result in sound-based counting tasks. The reason why density map based method outperforms SED and direct regression lie two fold (according to our understanding): 1) unlike SED methods that try to discriminate different sound event class from temporally overlapping sound input, density map based method ignores sound event class but instead treat all sound events as an instance spanning their active time range. The reduced difficulty enables the neural network (DyDecNet) to learn expressive representations. 2) One characteristic of sound events is that each sound event has a certain start time and end time in the time dimension. The 2D time-frequency representation learned by the backbone network and dyadic decomposition front-end naturally maintain such characteristic. Using the 2D time-frequency representation to regress density map internally exploits the temporal location of various sound events, which helps sound counting task.



see https://www.findsounds.com/



from Density Map D. Direct Prediction E. SED Method

Figure 2: Three counting methods illustration. For density map (sub-fig. C), the sum (or integral) of the density map equals to the count number. We can also direct regress the final count number (subfig. D), or use SED method (sub-fig. E). Detailed illustration is in Appendix Fig. V.

Figure 5: MSE and MAE variation against max-polyp, ratio-polyp and mean-polyp on NorthEastUS_Bird dataset. More results are in Appendix.

Figure II: AccuRate/MSE/MAE variation against max-polyp, ratio-polyp and mean-polyp on Polyphony1Bird Dataset.

Figure IV: MAE variation under different signal-to-ratio. Different color indicates different methods, which aligns with the color in all other figures of the paper. Light Green: CRNNNet, Black: SELDNet, Magenta: DND-SED, Blue: DyDecNet. The horizontal axis goes (from left to right) with reduced noise interference.

The comparison on the six datasets, in terms of data size, sound event class number and polyphony level.

MSE and MAE results on the six datasets. We leave the Accuracy Rate metric in the Appendix due to space limitation.GPU. Network architecture of DyDecNet is shown in Appendix TableIV. To train the neural network, we adopt Adam optimizer





Ablation study on various DyDecNet variants.

Ablation study on various counting method.

B MORE DISCUSSION ON DATASET CREATION B.1 MOTIVATION OF POLYPHONY4BIRDS AND POLYPHONY1BIRD DATASET CREATION Our motivation of synthesizing Polyphony4Birds and Polyphony1Birds are three-fold: 1. NorthEastUS_Bird dataset has as many as 48 different kind of bird categories. It helps to test various methods' capability in tackling high bird diversity challenge. 2. Polyphony4Birds dataset contains 4 kinds of bird sounds, but in much higher polyphony level (in terms of ratio-polyp, max-polyp and mean-polyp). It helps us to test various methods' capability in tackling limited bird categories but high polyphony level (heterophony test).

Comparison between various sound dataset, where "n/a" means not available.

Accuracy Rate results on the six datasets. HOW TO SIMULATE OPEN AREA ENVIRONMENT We collect 4 seed sounds from copyright-free website 1 : junco, American redhead, eagle, and rooster. To maximally reflect outdoor scenario, we simulate a large openarea environment [100m, 100m, 100m] with one microphone at [50m, 50m, 1m]. The wall is associated with high sound absorption coefficient, so the reverberation is negligible so as to resemble outdoor open area scenario. We introduce a random SNR (Signal-to-Noise Ratio) at two Gaussian means (-33 decibels and -20 decibels) at the microphone receiver. We put each seed sound at a random 3D spatial location and a random start time to imitate natural bird sounds that emit sound from a random location and random start time. A post-processing step is added to keep dataset balance between various polyphony-level metrics.C MORE DISCUSSION ON COMPARING METHODSMore detailed comparison between various methods is given in tableIII. We can see that our proposed DyDecNet is lightweight and directly learns from sound raw waveform (so it is end-to-end trainable). It thus strikes a good balance between model performance and model efficiency (inference time).

Table VI and TableVII, from which we can observe that performance of traditional T-F feature slightly increases after introducing the energy normalization module. It thus shows the necessity of energy normalization for sound counting task in high-polyphonic situation. However, they still lead to inferior performance than DyDecNet, which shows hierarchically dyadic decomposition with energy normalization is essential for sound counting task.

DyDecNet with Traditional T-F feature and learnable energy normalization module

DyDecNet with Traditional T-F feature

annex

They contain the accuracy rate, MSE and MAE variation against max-polyp, ratiopolyp and mean-polyp. From the two figures, we can get similar conclusion as of NorthEastUS_Bird dataset (Fig. I ): with the increasing of max-polyp, ratio-polyp and mean-polyp, all methods' performance gradually reduces. Our proposed DyDecNet stays as the best-performing one under all sound counting difficulty level metrics. Specifically, we can see that:• All methods give the best performance on Polyphony4Birds dataset, second best performance on Polyphony1Bird dataset, and the worst performance on NorthEastUS_Bird dataset. It thus shows 1) spectrum-overlap due to high inter-class sound overlap temporally (represented by Polyphony1Bird dataset) remains as a challenge for sound counting task. 2) sound counting in open area where noise pollution, high sound diversity (in our case, diversity means bird categories, we have 48 bird classes in NorthEastUS_Bird dataset), and small labelled data availability exist remains as another challenge for sound counting task. We hope to attract more researchers to consider sound counting task in more challenging scenario.• We do not observe such sharp performance drop (as we observed on NorthEastUS_Bird dataset) on our two synthetic datasets, which is in contrast with the real-world dataset NorthEastUS_Bird. It thus shows real-world sound counting task becomes increasingly challenging when our proposed three sound counting difficulty level metrics increase. We guess the large model and large training dataset are needed to achieve better performance, which can be treated as a future research direction.

D.3 COUNTING ON MORE BIRDS CLASSES

In the main paper, our two synthetic datasets Polyphony4Birds and Polyphony1Bird have just involved limited bird classes (up to 4). We naturally want to figure out the performance of all methods (including DyDecNet and the other three comparing methods) under more bird classes situation. We thus follow more the same data creation procedure to synthesize four extra datasets. In the dyadic decomposition front-end, the constructed filters' initialized trainable frequency cutoffs (high frequency cutoff and low frequency cutoff) evenly divide the frequency range of the input sound waveform (half of sampling rate). In our design, the filter number doubles as the "Depth" increases by 1. So each filter in the preceding depth associates with two child filters in the next depth, in which the two child filters carried frequency cutoffs evenly divide the frequency cutoff of their parent filter (see Fig. 1 in the main paper). In our implementation, we just connect a filter in the preceding depth with its two child filters in the next depth. We organize filters in channel dimension. In the dyadic decomposition front-end, each filter is instantiated as Sinc(•) filter, which comprises of a learnable high frequency cutoff and a low learnable frequency cutoff.Layer Name Filter Num Output Size Input Size: 5 s audio waveform: [ . Finally, the overall density map for an input audio is obtained by summarizing all events' density map together. The sum (or integral) of the overall density map equals to the sound count number involved in the audio (in this case, 2). The comparison between three count methods are shown in sub-figure C, D and E, respectively. Given the feature representation learned by dyadic decomposition front-end and backbone network, we further use 1) a full-connection layers (FC) to reduce the channel dimension to 1 but keep the time dimension, so we can obtain a vector of the same density map size. We can either regress the density map (sub-figure C) or classify the event for each time step (SED method, it is multi-label classification, sub-figure E); 2) two full-connection layers (FC) to consecutively reduce both channel dimension and time dimension to 1, so that can directly predict the count number (sub-figure D).

