SHORT-TERM MEMORY CONVOLUTIONS

Abstract

The real-time processing of time series signals is a critical issue for many reallife applications. The idea of real-time processing is especially important in audio domain as the human perception of sound is sensitive to any kind of disturbance in perceived signals, especially the lag between auditory and visual modalities. The rise of deep learning (DL) models complicated the landscape of signal processing. Although they often have superior quality compared to standard DSP methods, this advantage is diminished by higher latency. In this work we propose novel method for minimization of inference time latency and memory consumption, called Short-Term Memory Convolution (STMC) and its transposed counterpart. The main advantage of STMC is the low latency comparable to long short-term memory (LSTM) networks. Furthermore, the training of STMC-based models is faster and more stable as the method is based solely on convolutional neural networks (CNNs). In this study we demonstrate an application of this solution to a U-Net model for a speech separation task and GhostNet model in acoustic scene classification (ASC) task. In case of speech separation we achieved a 5-fold reduction in inference time and a 2-fold reduction in latency without affecting the output quality. The inference time for ASC task was up to 4 times faster while preserving the original accuracy.

1. INTRODUCTION

Convolutional neural networks (CNNs) arose to one of the dominant types of models in deep learning (DL). The most prominent example is the computer vision, e.g. image classification (Rawat & Wang, 2017) , object detection (Zhao et al., 2019) , or image segmentation (Minaee et al., 2021) . CNN models proved to be effective in certain signal processing tasks, especially where the longterm context is not required such as speech enhancement (Sun et al., 2021) , sound source separation (Stoller et al., 2018) , or sound event detection (Lim et al., 2017) . Some authors showed that convolutional models can achieve similar performance that recurrent neural networks (RNN) at a fraction of model parameters (Takahashi & Mitsufuji, 2017) . It is also argued that CNNs are easier to parallelize than recurrent neural networks (RNNs) (Gui et al., 2019; Liu et al., 2022; Rybalkin et al., 2021; Kong et al., 2021) . However, unlike RNNs, which can process incoming data one sample at the time, CNNs require a chunk of data to work correctly. The minimum chunk size is equal to the size of the receptive field which depends on the kernel sizes, strides, dilation, and a number of convolutional layers. Additionally, overlaps may be required to reduce the undesired edge effects of padding. Hence, standard CNN models are characterized by a higher latency than RNNs. Algorithmic latency which is related to model requirements and limitations such as the minimal chunk size (e.g. size of a single FFT frame), a model look-ahead, etc. is inherent to the algorithm. It can be viewed as a delay between output and input signals under assumption that all computations are instantaneous. Computation time is the second component of latency. In case of CNNs it is not linearly dependent on the chunk size, due to the fact that the whole receptive field has to be processed regardless of the desired output size. In the case of audio-visual signals, humans are able to spot the lag between audio and visual stimulus above 10 ms (Mcpherson et al., 2016) . However, the maximum latency accepted in conversations can be up to 40 ms (Staelens et al., 2012; Jaekl et al., 2015; Ipser et al., 2017) . For best human-device interactions in many audio applications the buffer size is set to match the maximum acceptable latency.

1.1. RELATED WORKS

Many researchers presented solutions addressing the problem of latency minimization in signal processing models. Wilson et al. (2018) studied a model consisting of bidirectional LSTM (BLSTM), fully connected (FC) and convolutional layers. Firstly, they proposed to use unidirectional LSTM instead of BLSTM and found that it reduces latency by a factor of 2 while having little effect on the performance. Secondly, they proposed to alter the receptive field of each convolutional layer to be causal rather than centered on the currently processed data point. In addition, it was proposed to shift the input features with respect to the output, effectively providing a certain amount of future context, which the authors referred to as look-ahead. The authors showed that predicting future spectrogram masks comes at a significant cost in accuracy, with a reduction of 6.6 dB signal to distortion ratio (SDR) with 100 ms shift between input/output, compared to a zero-look-ahead causal model. It was argued, that this effect occurs because the model is not able to respond immediately to changing noise and speech characteristics. 2020) further modified the above-mentioned model by removing LSTM and FC layers and replacing the convolutional layers with depth-wise convolutions followed by point-wise convolutions, among other changes in the model. These changes allowed to achieve a 10-fold reduction in the fused multiply-accumulate operations per second (FMS/s). However, the computational complexity reduction was accompanied by a 10% reduction in signal to noise ratio (SNR) performance. The authors also introduced partial caching of input STFT frames, which they referred to as incremental inference. When each new frame arrives, it is padded with the recently cached input frames to match the receptive field of the model. Subsequently, the model processes the composite input and yields a corresponding output frame. Kondratyuk et al. (2021) introduced a new family of CNN for online classification of videos (MoViNets). The authors developed a more comprehensive approach to data caching, namely layer-wise caching instead of input-only caching. MoViNets process videos in small consecutive subclips, requiring constant memory. It is achieved through so-called stream buffers, which cache feature maps at subclip boundaries. Using the stream buffers allows to reduce the peak memory consumption up to an order of magnitude in large 3D CNNs. The less significant impact of 2-fold memory reduction was noted in smaller networks. Since the method is aimed at online inference, stream buffers are best used in conjunction with causal networks. The authors enforced causality by moving right (future) padding to the left side (past). It was reported, that stream buffers lead to approximately 1% reduction in model accuracy and slight increase in computational complexity, which however might be implementation dependent.

1.2. NOVELTY

In this work we propose a novel approach to data caching in convolutional layers called Short-Term Memory Convolution (STMC) which allows processing of the arbitrary chunks without any computational overhead (i.e. after model initialization computational cost of a chunk processing linearly depends on its size), thus reducing computation time. The method is model-and taskagnostic as its only prerequisite is the use of stacked convolutional layers. We also systematically address the problem of algorithmic latency (look-ahead namely) by discussing causality of the transposed convolutional layers and proposing necessary adjustments to guarantee causality of the auto-encoder-like CNN models. The STMC layers are based on the following principles: • Input data is processed in chunks of arbitrary size in an online mode. • Each chunk is propagated through all convolutional layers, and the output of each layer is cached with shift registers (contrary to input-only caching as in Romaniuk et al. ( 2020)), providing so-called past context. • The past context is never recalculated by any convolutional layer. When processing a time series all calculations are performed exactly once, regardless of the processed chunk size.

