SOUNDCOUNT: SOUND COUNTING FROM RAW AUDIO WITH DYADIC DECOMPOSITION NEURAL NETWORK

Abstract

In this paper, we study an underexplored, yet important and challenging problem: counting the number of distinct sound in raw audio characterized by a high degree of polyphonicity. We do so by systematically proposing a novel end-to-end trainable neural network (we call DyDecNet, comprising of a dyadic decomposition front-end and backbone network), and quantifying the difficulty level of counting depending on sound polyphonicity. Unlike existing audio-processing methods that uniformly apply a set of frequency-selective filters on the raw waveform in a one-stage manner to get time-frequency (TF) representation, our dyadic decomposition front-end progressively decomposes the raw waveform dyadically along the frequency axis to obtain TF representation in multi-stage, coarse-to-fine manner. Each intermediate waveform convolved by a parent filter is further processed by a pair of child filters that evenly split the parent filter's carried frequency response, with the higher-half child filter encoding the detail and lower-half child filter encoding the approximation. We further introduce an energy gain normalization to normalize sound loudness variance and spectrum overlap, and apply it each intermediate parent waveform before feeding it to the two child filters. We argue that such dyadic decomposition front-end better characterizes sound polyphonicity and concurrency that commonly exist in sound counting task, while introducing negligible extra computational cost. To better quantify sound counting difficulty level, we further design three polyphony-aware metrics: polyphony ratio, max polyphony andmean polyphony. We test DyDecNet on three main sound datasets from different domains: bioacoustic sound (both synthetic and real-world sound), telephone-ring sound and music sound. Comprehensive experiment results show our method outperforms existing sound event detection (SED) methods significantly. The dyadic decomposition front-end can be used as a general front-end by existing methods to improve their performance accordingly.

1. INTRODUCTION

Suppose you went to the seaside and heard a cacophony of seagulls, squawking and squabbling. An interesting question that naturally arises is whether you can tell the number of seagulls flocking around you from the sound you heard? Although a trivial example, this sound "crowd counting" problem has a number of important applications. For example, passive acoustic monitoring (PAM) is widely used to record sounds in natural habitats, which provides measures of ecosystem diversity and density [2, 15, 12] . Sound counting helps to quantify and map sound pollution by counting the number of individual polluting events [4] . It can also be used in music content analysis [24] . Despite its importance, research on sound counting has far lagged behind than its well-established crowd counting counterparts from either images [49, 46], video [29] or joint audio-visual [22] . We conjecture that the lack of exploration stems from three main factors. First, sound counting has long been thought of as an over-solved problem by sound event detection (SED) methods [35, 9, 1, 19] , in which SED goes further to identify each sound event's (e.g. a bird call) start time, end time and semantic identity. Sound counting number then becomes easily accessible by simply adding up all detected events. Secondly, current SED only tags whether a class of sound event is present within a window, regardless of the number of concurrent sound sources of the same class like a series of baby crying or multiple bird calls [41] . Thirdly, labelling acoustic data is technically-harder and more time-consuming than labelling images, due to the overlap of concurrent and diverse sources. The lack of well-labelled sound data in crowded sound scenes naturally hampers research progress. Existing SED sound datasets [1, 20] capture simple acoustic scenarios with low polyphony and where the event variance is small. The simplified acoustic scenario in turn makes sound counting task by SED methods tackleable. But when the sound scene becomes much more complex with highly concurrent sound events, SED methods soon lose their capability in discriminating different sound events [38, 9] . Therefore, a study specific for sound counting problem is desirable and overdue. In this paper, we study the general sound counting problem under highly polyphonic, cluttered and concurrent situation. Whilst the challenges of image-based crowd counting mainly lie in spatial density, occlusion and view perspective distortion, the sound counting challenges are two-fold. Firstly, acoustic scenes are additive mixtures of sound along both time and frequency axes, making counting overlapping sounds difficult (temporal concurrence and spectrum-overlap). Secondly, there is a large variance in event loudness due to spherical signal attenuation with distance (loudness variance). Tackling these challenges require a more elegant method to process sound raw waveform so as to better localize sound in time-frequency domain. In this paper, we propose a novel dyadic decomposition neural network to learn a sound density representation capable of estimating cardinality directly from raw sound waveform. Unlike existing sound waveform processing methods that all apply frequency-selective filters on the raw waveform in single stage [19, 10, 48, 18, 14] , our network progressively decomposes raw sound waveform in a dyadic manner, where the intermediate waveform convolved by each parent filter is further processed by its two child filters. The two child filters evenly split the parent filter's frequency response, with one child filter encoding the waveform approximation (the one with the lower-half frequency response) and the other one encoding the waveform details (the one with the higher-half frequency response). To accommodate sound loudness variance, spectrum-overlap and time-concurrence, we further propose an energy gain normalization module to regularize each intermediate parent waveform before feeding it to two child filters for further processing. This hierarchical dyadic decomposition front-end enables the neural network to learn a robust TF representation in multi-stage coarse-to-fine manner, while introducing negligible extra computation cost. By setting each filter's frequency cutoff parameters to be learnable and self-adjustable during optimization in data-driven way, the final learned TF representation can better characterize sound existence in time and frequency domain. Following the front-end, we add a backbone network to continue to learn a time framewise representation. Such representation can be used to derive the final sound count number by either directly regressing the count number, regressing density map (the one we choose) or following SED pipeline. Apart from the network, we further propose three polyphony-aware metrics to quantify sound counting task difficulty level: polyphony ratio, maximum polyphony and mean polyphony. We will give detailed discussion to show the feasibility of three metrics. We run experiments on four cross-domain sound datasets: a bird sound set (both real-world and synthetic), a telephone-ring sound set (synthetic), and music sound [24] (real-world). Experimental results show our method (DyDecNet) outperforms exiting SED-based methods significantly on both real-world and synthetic dataset. Replacing existing methods' one-stage sound raw waveform processing front-end with our dyadic decomposition front-end dramatically improves their performance accordingly. Since the real-world datasets contain relatively small polyphony level, we specially synthesize a bird sound dataset that contain much higher sound polyphonic level and spectral overlap. The synthesized sound dataset has two sub-sets: one involves four kinds of bird sound (exhibits heterophony); the other has just one kind of sound (this encapsulates homophonic scenario). Experiment on such synthetic dataset helps to test performance under highly polyphonic situation. In summary, we make three main contributions: First, propose dyadic decomposition front-end to decompose the raw waveform in a multi-stage, coarse-to-fine manner, which better handles loudness variance, spectrum-overlap and time-concurrence. Second, propose a new set of polyphony-aware evaluation metrics to comprehensively and objectively quantify sound counting difficulty level. Third, Show the efficiency and generalization of DyDecNet on sound datasets across different domains.

2. RELATED WORK

Crowd counting from images or audio-visual has been thoroughly studied in recent years [49, 22] , the target of which is to estimate the instance number from very crowded scenes (e.g. pedestrian in train station) that cannot be efficiently handled by object detection methods. The methods approaching image crowd counting chronically evolve from the early detection-based [26] to the later regression-

