SHORT-TERM MEMORY CONVOLUTIONS

Abstract

The real-time processing of time series signals is a critical issue for many reallife applications. The idea of real-time processing is especially important in audio domain as the human perception of sound is sensitive to any kind of disturbance in perceived signals, especially the lag between auditory and visual modalities. The rise of deep learning (DL) models complicated the landscape of signal processing. Although they often have superior quality compared to standard DSP methods, this advantage is diminished by higher latency. In this work we propose novel method for minimization of inference time latency and memory consumption, called Short-Term Memory Convolution (STMC) and its transposed counterpart. The main advantage of STMC is the low latency comparable to long short-term memory (LSTM) networks. Furthermore, the training of STMC-based models is faster and more stable as the method is based solely on convolutional neural networks (CNNs). In this study we demonstrate an application of this solution to a U-Net model for a speech separation task and GhostNet model in acoustic scene classification (ASC) task. In case of speech separation we achieved a 5-fold reduction in inference time and a 2-fold reduction in latency without affecting the output quality. The inference time for ASC task was up to 4 times faster while preserving the original accuracy.

1. INTRODUCTION

Convolutional neural networks (CNNs) arose to one of the dominant types of models in deep learning (DL). The most prominent example is the computer vision, e.g. image classification (Rawat & Wang, 2017 ), object detection (Zhao et al., 2019) , or image segmentation (Minaee et al., 2021) . CNN models proved to be effective in certain signal processing tasks, especially where the longterm context is not required such as speech enhancement (Sun et al., 2021) , sound source separation (Stoller et al., 2018) , or sound event detection (Lim et al., 2017) . Some authors showed that convolutional models can achieve similar performance that recurrent neural networks (RNN) at a fraction of model parameters (Takahashi & Mitsufuji, 2017) . It is also argued that CNNs are easier to parallelize than recurrent neural networks (RNNs) (Gui et al., 2019; Liu et al., 2022; Rybalkin et al., 2021; Kong et al., 2021) . However, unlike RNNs, which can process incoming data one sample at the time, CNNs require a chunk of data to work correctly. The minimum chunk size is equal to the size of the receptive field which depends on the kernel sizes, strides, dilation, and a number of convolutional layers. Additionally, overlaps may be required to reduce the undesired edge effects of padding. Hence, standard CNN models are characterized by a higher latency than RNNs. Algorithmic latency which is related to model requirements and limitations such as the minimal chunk size (e.g. size of a single FFT frame), a model look-ahead, etc. is inherent to the algorithm. It can be viewed as a delay between output and input signals under assumption that all computations are instantaneous. Computation time is the second component of latency. In case of CNNs it is not linearly dependent on the chunk size, due to the fact that the whole receptive field has to be processed regardless of the desired output size. In the case of audio-visual signals, humans are able to spot the lag between audio and visual stimulus above 10 ms (Mcpherson et al., 2016) . However, the maximum latency accepted in conversations can be up to 40 ms (Staelens et al., 2012; Jaekl et al., 2015; Ipser et al., 2017) . For best human-device

