SOUNDCOUNT: SOUND COUNTING FROM RAW AUDIO WITH DYADIC DECOMPOSITION NEURAL NETWORK

Abstract

In this paper, we study an underexplored, yet important and challenging problem: counting the number of distinct sound in raw audio characterized by a high degree of polyphonicity. We do so by systematically proposing a novel end-to-end trainable neural network (we call DyDecNet, comprising of a dyadic decomposition front-end and backbone network), and quantifying the difficulty level of counting depending on sound polyphonicity. Unlike existing audio-processing methods that uniformly apply a set of frequency-selective filters on the raw waveform in a one-stage manner to get time-frequency (TF) representation, our dyadic decomposition front-end progressively decomposes the raw waveform dyadically along the frequency axis to obtain TF representation in multi-stage, coarse-to-fine manner. Each intermediate waveform convolved by a parent filter is further processed by a pair of child filters that evenly split the parent filter's carried frequency response, with the higher-half child filter encoding the detail and lower-half child filter encoding the approximation. We further introduce an energy gain normalization to normalize sound loudness variance and spectrum overlap, and apply it each intermediate parent waveform before feeding it to the two child filters. We argue that such dyadic decomposition front-end better characterizes sound polyphonicity and concurrency that commonly exist in sound counting task, while introducing negligible extra computational cost. To better quantify sound counting difficulty level, we further design three polyphony-aware metrics: polyphony ratio, max polyphony andmean polyphony. We test DyDecNet on three main sound datasets from different domains: bioacoustic sound (both synthetic and real-world sound), telephone-ring sound and music sound. Comprehensive experiment results show our method outperforms existing sound event detection (SED) methods significantly. The dyadic decomposition front-end can be used as a general front-end by existing methods to improve their performance accordingly.

1. INTRODUCTION

Suppose you went to the seaside and heard a cacophony of seagulls, squawking and squabbling. An interesting question that naturally arises is whether you can tell the number of seagulls flocking around you from the sound you heard? Although a trivial example, this sound "crowd counting" problem has a number of important applications. For example, passive acoustic monitoring (PAM) is widely used to record sounds in natural habitats, which provides measures of ecosystem diversity and density [2, 15, 12] . Sound counting helps to quantify and map sound pollution by counting the number of individual polluting events [4] . It can also be used in music content analysis [24] . Despite its importance, research on sound counting has far lagged behind than its well-established crowd counting counterparts from either images [49, 46] , video [29] or joint audio-visual [22] . We conjecture that the lack of exploration stems from three main factors. First, sound counting has long been thought of as an over-solved problem by sound event detection (SED) methods [35, 9, 1, 19] , in which SED goes further to identify each sound event's (e.g. a bird call) start time, end time and semantic identity. Sound counting number then becomes easily accessible by simply adding up all detected events. Secondly, current SED only tags whether a class of sound event is present within a window, regardless of the number of concurrent sound sources of the same class like a series of baby crying or multiple bird calls [41] . Thirdly, labelling acoustic data is technically-harder and more time-consuming than labelling images, due to the overlap of concurrent and diverse sources. The lack

