SKIPW: RESOURCE ADAPTABLE RNN WITH STRICT UPPER COMPUTATIONAL LIMIT

Abstract

We introduce Skip-Window, a method to allow recurrent neural networks (RNNs) to trade off accuracy for computational cost during the analysis of a sequence. Similarly to existing approaches, Skip-Window extends existing RNN cells by adding a mechanism to encourage the model to process fewer inputs. Unlike existing approaches, Skip-Window is able to respect a strict computational budget, making this model more suitable for limited hardware like edge devices. We evaluate this approach on four datasets: a human activity recognition task, sequential MNIST, IMDB and adding task. Our results show that Skip-Window is often able to exceed the accuracy of existing approaches for a lower computational cost while strictly limiting said cost.

1. INTRODUCTION

Since Recurrent Neural Networks (RNN) have been introduced Williams et al. (1986) , they have become one of the reference methods to process sequences. A typical architecture is the Long-Short-Term-Memory neural network (LSTM) which allowed improvement in natural language processing such as large-vocabulary speech recognition (Sak et al., 2014; Li & Wu, 2015) . Used with CNNs they have also reached state of the art in automatic image captioning (Vinyals et al., 2015) . Deep learning models are now brought closer to the user rather than running in a distant cloud, helping to reduce latency, network congestion, and improving data security and privacy. However, smartphones and user devices impose additional constraints such as limited computation or energy. Handling these constraints has become an active research topic (Zhang et al., 2017; 2018; Howard et al., 2019; Wu et al., 2019; Cai et al., 2020) . User devices can also host multiple processes running at the same time and starting or stopping abruptly, modifying the constraints affecting the processes. Few works have considered models that can be modified at run time to adapt to an evolving computational limit (Yu et al., 2019; Yu & Huang, 2019; Guerra et al., 2020; Jin et al., 2020) . However, none of these focus on sequences and therefore none address the problem of adapting the model in the middle of a sequence. In this context, this paper introduces Skip-Window (SkipW), a flexible recurrent neural network architecture: its computational cost can be dynamically adapted during a sequence analysis to meet real time constraints changes. The proposed architecture can be combined with any RNN cell and allows to strictly limit the computational resources used to avoid exceeding a given budget. Furthermore, empirical experiments on four data sets (Adding Task, MNIST, IMDB and HAR-2D-POSE) demonstrate that this subsampling architecture is interesting in itself. Skip-Window matches or exceed the accuracy of existing approaches for a given computational cost. In addition, measurements on specific processors highlight that SkipW produces real computational and energy savings.

2. RELATED WORK

Typically, RNNs maintain a "state", a vector of variables, over time. This state is supposed to accumulate relevant information and is updated recursively. Each input of the sequence is typically a) processed by some deep layers and b) then combined with the previous state through some other deep layers to compute the new state. Hence, the RNN can be seen as a function taking a sequence of inputs x = (x 1 , . . . , x T ) and recursively computing a set of states s = (s 1 , . . . , s T ). Each state s t is computed from s t-1 and x t by a cell S of the RNN. As neural networks are increasingly run on limited hardware, recent research has focused on controlling their computational cost.

2.1. FLEXIBLE NEURAL NETWORKS

A few architectures have recently been designed to adapt the computational complexity of a Deep Neural Network (DNN) without reloading the whole model. This can be achieved by removing/adding neurons (Yu et al., 2019; Yu & Huang, 2019) or by modifying the quantization of the weights (Guerra et al., 2020; Jin et al., 2020 ). An efficient embedding of a mixture of Convolutional Neural Network (CNNs) also allows to add or remove several models at the same time, hence changing the computational cost (Ruiz & Verbeek, 2019).

2.1.1. THRRNN

For RNNs specifically, ThrRNN (Lambert et al., 2020) aims to control computation time by not processing some inputs. This is controlled by an update gate u t . The tradeoff between the average accuracy and the average number of updates can be modified during inference by changing a single parameter thr. ThrRNN can wrap any RNN cell S: u t = f binarize (ũ t , thr) = 0 if ũt < thr 1 otherwise (1) ∆ũ t = σ(W s t + b) (2) ũt+1 = u t ∆ũ t + (1 -u t )(ũ t + min(∆ũ t , 1 -ũt )) (3) s t = u t S(s t-1 , x t ) + (1 -u t )s t-1 . When an input is processed, an update gate computes the quantity ∆ũ t that determines how many inputs will be skipped. In practice the ∆ũ t are accumulated in ũt until ũt ≥ thr.

2.2. RECURRENT NEURAL NETWORK WITH LOW COMPUTATIONAL COMPLEXITY

Several architectures have been proposed to limit or reduce the computational cost of RNNs, but this cost cannot be adapted at inference. A first class of architectures dynamically reduces computation based on the input. SkipRNN (Campos et al., 2018) predates and is similar to ThrRNN, except that the binarization function does not change. A similar mechanism has been proposed by Zhang et al. (2019) . Other architectures directly select the next input to process (Yeung et al., 2016; Yu et al., 2017; Hansen et al., 2019; Song et al., 2018) . Early exit has also been investigated by Dennis et al. (2019) et al. (2018) propose various mechanisms to summarize subsequences of windows of inputs.

2.3. RECURRENT NEURAL NETWORK WITH HIERARCHICAL-DEPENDENT COMPLEXITY

A class of architectures focuses on hierarchy level concept to reduce the complexity. These methods are mainly used in the context of multi-layer RNNs where each layer is supposed to model a different level in the hierarchy (e.g. for a corpus the levels could be documents, paragraphs, sentences, words, letters). These approaches are based on the fact that a hierarchical separation exists within a sequence of inputs, which might not always be the case.



. Tao et al. (2019) also use x t as input to an update gate. So do Seo et al. (2018); Jernite et al. (2017); Li et al. (2020). However, they do not skip any input but perform partial state updates. A second class of architectures focuses on reducing the overal cost of the RNN. FastRNN is an RNN augmented with a residual connection with two extra scalar parameters and FastGRNN is an improved FastRNN: the residual connection is extended to a gate and RNN matrices are low rank, sparse and quantized (Kusupati et al., 2018). Other architectures reduce the RNN length. Chan et al. (2016) train an encoder to reduce the input length. Yeung et al. (2016); Shan et al. (2018); Chen

