AN EFFICIENT ENCODER-DECODER ARCHITECTURE WITH TOP-DOWN ATTENTION FOR SPEECH SEPARA-TION

Abstract

Deep neural networks have shown excellent prospects in speech separation tasks. However, obtaining good results while keeping a low model complexity remains challenging in real-world applications. In this paper, we provide a bio-inspired efficient encoder-decoder architecture by mimicking the brain's top-down attention, called TDANet, with decreased model complexity without sacrificing performance. The top-down attention in TDANet is extracted by the global attention (GA) module and the cascaded local attention (LA) layers. The GA module takes multi-scale acoustic features as input to extract global attention signal, which then modulates features of different scales by direct top-down connections. The LA layers use features of adjacent layers as input to extract the local attention signal, which is used to modulate the lateral input in a top-down manner. On three benchmark datasets, TDANet consistently achieved competitive separation performance to previous state-of-the-art (SOTA) methods with higher efficiency. Specifically, TDANet's multiply-accumulate operations (MACs) are only 5% of Sepformer, one of the previous SOTA models, and CPU inference time is only 10% of Sepformer. In addition, a large-size version of TDANet obtained SOTA results on three datasets, with MACs still only 10% of Sepformer and the CPU inference time only 24% of Sepformer. Our study suggests that top-down attention can be a more efficient strategy for speech separation.

1. INTRODUCTION

In cocktail parties, people's communications are inevitably disturbed by various sounds (Bronkhorst, 2015; Cherry, 1953) , such as environmental noise and extraneous audio signals, potentially affecting the quality of communication. Humans can effortlessly perceive the speech signal of a target speaker in a cocktail party to improve the accuracy of speech recognition (Haykin & Chen, 2005) . In speech processing field, the corresponding challenge is to separate different speakers' audios from the mixture audio, known as speech separation. Due to rapid development of deep neural networks (DNNs), DNN-based speech separation methods have significantly improved (Luo & Mesgarani, 2019; Luo et al., 2020; Tzinis et al., 2020; Chen et al., 2020; Subakan et al., 2021; Hu et al., 2021; Li & Luo, 2022) . As in natural language processing, the SOTA speech separation methods are now embracing increasingly complex models to achieve better separation performance, such as DPTNet (Chen et al., 2020) and Sepformer (Subakan et al., 2021) . These models typically use multiple transformer layers (Vaswani et al., 2017) to capture longer contextual information, leading to a large number of parameters and high computational cost and having a hard time deploying to edge devices. We question whether such complexity is always needed in order to improve the separation performance. Human brain has the ability to process large amounts of sensory information with extremely low energy consumption (Attwell & Laughlin, 2001; Howarth et al., 2012) . We therefore resort to our brain for inspiration. Numerous neuroscience studies have suggested that in solving the cocktail part problem, the brain relies on a cognitive process called top-down attention (Wood & Cowan, 1995; Haykin & Chen, 2005; Fernández et al., 2015) . It enables human to focus on task-relevant stimuli and ignore irrelevant distractions. Specifically, top-down attention modulates (enhance or inhibit) cortical sensory responses to different sensory information (Gazzaley et al., 2005; Johnson & Zatorre, 2005) . With neural modulation, the brain is able to focus on speech of interest and ignore others in a multi-speaker scenario (Mesgarani & Chang, 2012) . We note that encoder-decoder speech separation networks (e.g., SuDORM-RF (Tzinis et al., 2020) and A-FRCNN (Hu et al., 2021) ) contain top-down, bottom-up, and lateral connections, similar to the brain's hierarchical structure for processing sensory information (Park & Friston, 2013) . These models mainly simulate the interaction between lower (e.g., A1) and higher (e.g., A2) sensory areas in primates, neglecting the role of higher cortical areas such as frontal cortex and occipital cortex in accomplishing challenging auditory tasks like the cocktail party problem (Bareham et al., 2018; Cohen et al., 2005) . But they provide good frameworks for applying top-down attention mechanisms. In the speech separation process, the encoder and decoder information are not always useful, so we need an automatic method to modulate the features transmitted by the lateral and top-down connections. We propose an encoder-decoder architecture equipped with top-down attention for speech separation, which is called TDANet. As shown in Figure 1 



Figure 1: Main architecture of TDANet. N and T ′ denote the number of channels and length of features, respectively. By down-sampling S times, TDANet contains S + 1 features with different temporal resolutions. Here, we set S to 3. The red, blue, and orange arrows indicate bottom-up, top-down, and lateral connections, respectively. (a) The structure of the encoder, where "DWConv" denotes a depthwise convolutional layer with a kernel size of 5 and stride size of 2 followed by GLN. (b) The "Up-sample" layer denotes nearest neighbor interpolation. (c) The structure of decoder, where the LA layer adaptively modulates features of different scales by a set of learnable parameters.

, TDANet adds a global attention (GA) module to the encoder-decoder architecture, which modulates features of different scales in the encoder top-down through the attention signal obtained from multi-scale features. The modulated features are gradually restored to high-resolution auditory features through local attention (LA) layers in the top-down decoder. The experimental results demonstrated that TDANet achieved competitive separation performance on three datasets(LRS2-2Mix, Libri2Mix (Cosentino et al., 2020), and  WHAM! (Wichern et al., 2019)) with far less computational cost. Taking LRS2-2Mix dataset as an

