CONTINUAL TRANSFORMERS: REDUNDANCY-FREE ATTENTION FOR ONLINE INFERENCE

Abstract

Transformers in their common form are inherently limited to operate on whole token sequences rather than on one token at a time. Consequently, their use during online inference on time-series data entails considerable redundancy due to the overlap in successive token sequences. In this work, we propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference on a continual input stream. Importantly, our modifications are purely to the order of computations, while the outputs and learned weights are identical to those of the original Transformer Encoder. We validate our Continual Transformer Encoder with experiments on the THUMOS14, TVSeries and GTZAN datasets with remarkable results: Our Continual one-and two-block architectures reduce the floating point operations per prediction by up to 63× and 2.6×, respectively, while retaining predictive performance.

1. INTRODUCTION

Many real-life usage scenarios such as the perception in self-driving cars and live monitoring of critical resources process a continual stream of inputs and require near-instantaneous predictions per time-step. This stands in contrast to what many common benchmarks for deep learning evaluate, namely the operation on distinct batches of data with no inter-batch relationships. Consequently, a plethora of methods have been developed (Ji et al., 2013; Carreira & Zisserman, 2017; Varol et al., 2018; Yan et al., 2018; Heidari & Iosifidis, 2021; Vaswani et al., 2017; Arnab et al., 2021; Bakhtiarnia et al., 2021b) , which focus on batch-wise processing, but fail to optimise for online operation, where new information (e.g., a video frame / token) arrives at each step from a continual input stream, and future information is not available at the current time-step. We need a class of networks, which operate efficiently on both batches of data and on continual streams. Accordingly, we propose a reformulation of the Transformer Encoder as a Continual Inference Network (CIN, Section 2.1) which accelerates the stream processing on time-series data, while retaining weight-compatibility. Specifically, we derive two variants of Continual Scaled Dot-Product Attention (SDA) for the cases where prior output tokes should and should not be updated after observing a new input token. Notably, our attention formulations reduce the per-step cost of SDA (Vaswani et al., 2017) from time complexity O(n 2 d) to O(nd) and memory complexity O(n 2 ) to O(nd) and are readily embedded into Continual Multi-Head Attention (MHA) and Continual Transformer Encoder blocks. Finally, we propose the use of Recycling Positional Encoding to accommodate progressive caching of partial attention results for continual data streams. Due to the interdependence of SDA outputs, Continual Transformers are most efficient for shallow architectures. Shallow Transformers have many applications such as augmentations of CNNs (Touvron et al., 2021) , light-weight Natural Language Processing (Cornia et al., 2020) , fusion operations in multi-modal (e.g. audio-visual) settings (Chumachenko et al., 2022) and early exit branches in multi-exit architectures (Bakhtiarnia et al., 2021a; b) . In our experimentsfoot_0 , we validate their exceptional efficiency improvements on common benchmarks in Online Action Detection (Idrees et al., 2017) and Online Audio Classification (Tzanetakis et al., 2001) . • is capable of continual step inference without computational redundancy, • is capable of batch inference corresponding to a non-continual Neural Network, • produces identical outputs for batch-and step inference given identical receptive fields, • uses one set of trainable parameters for both batch and step inference. These requirements ensure that a Neural Network has broad applicability for both (offline) batchwise inference (i.e., most research benchmarks) and online stream processing. While non-CINs can operate on streams of data by caching prior steps in a first-in first-out (FIFO) queue and aggregating them to a full (spatio-)temporal input, which is processed similarly to an offline batch, this entails computational redundancy in proportion with the sequence length. CINs perform step-wise inference without such caching and repeat computation. Uni-directional Recurrent Neural Networks are an example of Continual Inference Networks. Their default mode of operation is by time-step and they are easily applied to spatio-temporal batches of data by concatenation of the step-wise outputs. Recently, a modification to the spatio-temporal 3D convolution was proposed (Hedegaard & Iosifidis, 2021) , which enables existing 3D CNNs to operate efficiently during continual inference. A similar principle was used to enhance Spatio-temporal Graph Convolutions as well (Hedegaard et al., 2022) . In Section 3, we derive a CIN formulation for Transformer Encoders.

2.2. TRANSFORMER ARCHITECTURES

Initially proposed for sequence-to-sequence modelling in Natural Language Processing, the Transformer (Vaswani et al., 2017) has become a canonical building block in many applications of Deep Learning, including Computer Vision (Dosovitskiy et al., 2021; Arnab et al., 2021; Wang et al., 2021; Carion et al., 2020) and Audio Classification (Gong et al., 2021) . Their success can be partly attributed to reduced inductive bias compared with CNNs and RNNs, which allows better adaptations when sufficiently large datasets are available; the Scaled Dot-Product Attention (SDA) maps a set of input tokens to a set of outputs without inherent preconceptions. However, this many-to-many attention exhibits quadratic growth in time and space complexity with the token count n in the set. A great deal of research has sought to improve the efficiency of Transformers (Tay et al., 2020) . Block-wise or Chunking methods such as Image Transformer (Parmar et al., 2018) and Vision Transformer (Dosovitskiy et al., 2021) group up entities of a local receptive field into a single block, reducing the O(n 2 ) complexity to O(n 2 b ), where n b < n is the number of blocks. Techniques such as sliding windows, dilation and pooling can be used to achieve a similar effect (Beltagy et al., 2020) . The Reformer (Kitaev et al., 2020) reduces the complexity to O(n log n) by learning group-ings in a data-driven manner via Locality-Sensitive Hashing (LSH). A different paradigm aims to derive approximations of the self-attention matrix. Methods such as Linformer (Wang et al., 2020) , Nyströmformer (Xiong et al., 2021) and Performer (Choromanski et al., 2021) reduce the complexity from O(n 2 ) to O(n). Unlike these efforts, our approach produces the exact same computational outputs for temporal sequences as the original Multi-Head Attention.

3. CONTINUAL TRANSFORMERS

In this work, we examine the use of Transformer Encoders for stream-processing, where we receive one token per time-step. Specifically, the query, key and value inputs constitute a continual stream of d-dimensional tokens and we wish to compute the outputs for each step immediately considering n -1 prior tokens. We begin our exposition in Section 3.1 by considering the Scaled Dot-Product Attention (SDA) for this task. To alleviate the inefficiencies of SDA, we propose two alternative computational sequences in Section 3.2 and Section 3.3 and compare them to SDA in Section 3.4. Finally, sections 3.5-3.8 build up the full architecture, and discuss architectural considerations.

3.1. REGULAR SCALED DOT-PRODUCT ATTENTION

Denoting query, key, and value sequence matrices by Q, K, V ∈ R n×d , the regular Scaled Dot-Product Attention first defined by Vaswani et al. (2017) can be written as: Att(Q, K, V ) = D -1 AV A = exp QK ⊤ / √ d D = diag A1 ⊤ n , where A, D ∈ R n×n and 1 n is a row-vector of n ones. In each time-step, we can update Q, K, and V by discarding their oldest token and prepending a new one in a FIFO manner. This is a common implementation for step-wise inference, e.g. found in the FAIRSEQ library (Ott et al., 2019) . Each time-step results in 2n 2 d + 2nd multiplications, 2n 2 d -nd -n additions, and n 2 exponentiations as accounted for in Appendix A.1, which amounts to a time complexity of O(n 2 d) and a O(n 2 ) memory complexity originating from the transient feature-map A. Furthermore, a constantsized cache of size 3(n -1)d is needed to store the n -1 latest tokens in Q, K and V . We could avoid considerable redundancy by caching QK ⊤ directly. However, this comes with a memory penalty of (n -1) 2 . Fortunately, another computational scheme can be devised.

3.2. CONTINUAL RETROACTIVE SCALED DOT-PRODUCT ATTENTION

We can compute D -1 AV in a step-wise manner using the latest query, key, and value steps, q new , k new , v new ∈ R 1×d , alongside appropriately cached partial results. The softmax normalisation with D -1 can be efficiently implemented via column-aligned element-wise multiplications (denoted by ⊙ hereafter) of a column-vector d = A1 ⊤ n . If we cache the n -1 values for the prior step tokens, i.e. d mem = A (-n+1:-1) prev 1 ⊤ n-1 , alongside Q and K, we can define the step update as: d (-n+1:-1) = d (-n+2:0) mem -exp Q mem k ⊤ old + exp Q mem k ⊤ new (2) d (0) = exp q new √ d (K mem ∥ k new ) ⊤ 1 ⊤ n , where Q mem (K mem ) are the n -1 prior query (key) tokens, k old is the key from n steps ago, and ∥ denotes concatenation of matrices along the first dimension. Negative indices indicate prior timesteps. An update for AV can likewise be defined as a function of the n -1 prior values AV mem : AV (-n+1:-1) = AV (-n+2:0) mem -exp Q mem k ⊤ old v old + exp Q mem k ⊤ new v new (4) AV (0) = exp q new √ d (K mem ∥ k new ) ⊤ (V mem ∥ v new ) . Finally, we compute the Continual Retroactive Attention output in the usual manner: CoReAtt(q new , k new , v new ) = d -1 ⊙ AV . (6) An visual depiction of these update steps is provided in Appendix A.2. A time-step can now be computed with 7nd + 2n -3d multiplications, 6nd + 3n -6d -3 additions, and 3n -2 exponentials. This time complexity of O(nd) per step and a O(nd) memory complexity is a significant improvement over the prior O(n 2 d) and O(n 2 ) complexities in Section 3.1.

3.3. CONTINUAL SINGLE-OUTPUT SCALED DOT-PRODUCT ATTENTION

Both the Regular and Continual Retroactive Dot-Product Attentions produce attention outputs for the current step, as well as n -1 retroactively updated steps. In cases where retroactive updates are not needed, we can simplify the computation greatly via a Continual Single-Output Dot-Product Attention (CoSiAtt). In essence, the regular SDA is reused, but prior values of k and v are cached between steps (as in (Ott et al., 2019) ), and only the attention corresponding to a single query token q is computed: CoSiAtt(q, k new , v new ) = a (V mem ∥ v new ) /a1 ⊤ n , a = exp q √ d (K mem ∥ k new ) ⊤ . (7) A step output is computed with 2nd + 2d multiplications, 2nd -d -1 additions, and n exponentials. The time-and memory complexities remain O(nd) per step. Using the (leading) query q new as input, the attention is purely causal. Alternatively, prior (lagging) query vectors could be cached and used as query input, though this would introduce a network delay.

3.4. COMPARISON OF SCALED DOT-PRODUCT ATTENTIONS

Assuming n-1 prior q, k and v steps have been calculated by the Continual SDA modules, and that Q = (Q mem ∥ q new ), K = (K mem ∥ k new ), and V = (V mem ∥ v new ), we have the correspondence: Att(Q, K, V ) (t) = CoReAtt(q new , k new , v new ) (t) = CoSiAtt(q t , k new , v new ) (8) Here, q t is the t th row of Q, i.e. Q (t) . During stream processing, the complexity of the Continual Retroactive SDA scales significantly more favourably that the regular SDA. For example, the floating point operations (FLOPs) are reduced by 31× when n = d = 100 and 308× when n = d = 1000. If retroactive output updates are not needed, the Continual Single-Output SDA reduces FLOPs by respectively 100× and 1000×. The scaling properties are detailed in Appendix A.1.

3.5. CONTINUAL MULTI-HEAD ATTENTION

Continual Scaled Dot-Product Attentions can replace regular SDA's directly in a Multi-Head Attention (MHA). Given a new query, key, and value, q, k, v, the Continual MHA is defined as CoMHA(q, k, v) = h-1 ∥ i=0 CoAtt(qW i Q , kW i K , vW i V ) W O , where ∥ denotes concatenation of h heads and W i Q , W i K ∈ R d×d K /h , W i V ∈ R d×d V /h , and W O ∈ R d V ×d O are projection matrices of head i. CoAtt can be either CoReAtt or CoSiAtt.

3.6. CONTINUAL TRANSFORMER ENCODER

A Continual MHA block can be integrated in a Continual Transformer Encoder block as follows: z = LayerNorm (y + FF(y)) , y = LayerNorm (Sel(x) + CoMHA(x, x, x)) , where x corresponds to the newest step input and Sel(•) selects a single (last) token of x if CoSiMHA is used, or selects all tokens otherwise. FF(•) is a two-layer feed-forward network with weights W 1 , W 2 , biases w 1 , w 2 , and a activation function σ(•), i.e. FF(x) = σ(xW 1 + w 1 )W 2 + w 2 . Aside from the residual selection, this is identical to common Transformer Encoder implementations (Vaswani et al., 2017; Dosovitskiy et al., 2021) .

3.7. RECYCLING POSITIONAL ENCODING

Since a Transformer Encoder does not provide positional bias, it is common to augment a token x i with a positional encoding p, i.e. xi = x i • p i , where • could be addition or concatenation. In regular Transformers, the index i denotes a position in a sequence rather than a position in time. However, this static positional assignment is problematic in the context of continual inference; the last token at time t = 0 will be the next-to-last token at time t = 1, and thus in need of a different positional encoding than in the prior time-step. Instead, CINs require dynamic positions. There have been multiple prior works (Shaw et al., 2018; Huang et al., 2019; Dai et al., 2019) which create relative encodings by augmenting the SDA with positional offsets P between query and keys, i.e. A = exp(QK ⊤ / √ d + P ). While a similar modification to the continual attentions is possible, it is incompatibile with the regular SDA in Eq. ( 1). Instead, we use a Recycling Positional Encoding (RPE), which lets the positional encoding follow each token in time and recycles old encodings: xt = x t + p τt , τ t = (τ t-1 + 1) mod T, ( ) where T is the number of encodings. While RPE does not specify relative encodings explicitly, the absolute positional interpretation of each token changes dynamically when a new token arrives. In practice, the network learns relative, shift-invariant positional information by training with random τ 0 for each batch. Random shifts during training were recently explored in (Kiyono et al., 2021; Likhomanenko et al., 2021; Dehghani et al., 2019) as well. RPE can use either learned or predefined encodings. In the latter case, Cyclic Positional Encoding (Ma et al., 2021) , a sinusoidal encoding inspired by Gray code, is a good fit. If we reuse the encoding immediately after an old token has "slided out", i.e. T = n, a token will have the same positional encoding relative to another whether it was m steps older or n -m steps newer. The positional ambiguity can be avoided by extending the number of positional tokens to T = 2n -1. We explore both options in Section 4.1.2.

3.8. ARCHITECTURAL CONSIDERATIONS

Block count In Section 3.4, we observed an exact correspondence between the results of the continual and regular SDA layers. However, the correspondence does not necessarily hold for stacked layers. Consider the result of stacking two Continual Single-Output Transformer Encoder blocks. While the first block outputs a step t that is identical to that in a corresponding regular block, the second block would have been initialised with prior step-wise inputs, which were the result of prior input windows instead of the current one; the correspondence would not hold. Though it is not convertible to/from a regular Transformer Encoder, the stacked Single-Output Transformer Encoder architecture has the merit of efficiency. This was exploited in Transformer-XL (Dai et al., 2019) . Given a single step input, the Continual Retroactive Transformer Encoder block produces output tokens corresponding to the entire observed sequence inside the window. Due to this one-to-many input-output mapping, it is not possible to stack multiple such layers. Nevertheless, it can be used in conjunction with a Continual Single-Output Transformer Encoder with optional regular Transformer Encoder blocks in between as illustrated in Fig. 1 . The Regular Transformer Encoder blocks in the resulting architecture have a significantly larger computational complexity than the Continual Retroactive and Single-Output blocks. Consequently, we recommend that Continual Transformer Encoders be used primarily in lightweight architectures with one or two blocks unless compatibility with non-continual Transformers is not required and only Single-output blocks are used. Class token It is common to add a class token as input to transformers (Devlin et al., 2019; Dosovitskiy et al., 2021) , which accumulates information from other tokens prior to classification. However, it cannot be used naïvely with CINs, as this would effectively double the number of input steps. In practice, it can be employed in Continual multi-block Transformer Encoders as input to the second block (see Fig. 1 ), but this placement limits class token interaction with downstream layers. It can also be used for one-block Transformer Encoders if the value token is omitted as input. Peak memory reduction trick The FLOPs for Att(Q, K, V ) are exactly n times those of CoSiAtt(q, k new , v new ). Comparing their memory complexity, the regular SDA is O(n 2 ), while the Single-output SDA is O(nd). In practical applications where system memory is limited, we may thus reduce the maximum memory requirement of the computational device at inference by up to d/n (assuming n ≫ d) by computing each row of the attention individually. However, this may reduce throughput due to reduced parallelism.

4.1. ONLINE ACTION DETECTION

Online Action Detection (OAD) (De Geest et al., 2016) entails the per-frame classification of human actions in a video stream as they happen without the ability to change prior predictions nor to use future information. This is fundamentally more restrictive than Temporal Action Localisation, where the whole video clip is processed before start and end frames of an action are determined (Shou et al., 2016; Xu et al., 2017; Shou et al., 2017; Wu et al., 2019) . The dominant design in OAD works at the time of writing is to employ a two-stream Convolutional Neural Network as backbone for frame-wise feature extraction with RGB images as inputs in one stream and Optical Flow fields in the other (Gao et al., 2017; Xu et al., 2019; Eun et al., 2020; Wang et al., 2021; Xu et al., 2021) foot_2 . On top of these, OAD methods encode temporal information and perform predictions per time-step, e.g. by means of RNNs (Gao et al., 2017; Xu et al., 2019; Eun et al., 2020) or Transformers (Wang et al., 2021; Xu et al., 2021) . Alongside the action detection for the current frame, an action anticipation task may be learned in parallel by means of decoder structures, as this has been found to improve the primary OAD task. Unlike RNNs, an output update for the regular SDA in a Transformer block cannot be naïvely computed for a single step by feeding successive video frames. Instead, prior step features must be cached, re-loaded and re-processed by the Transformer in each step in correspondence with a predefined window-size of prior steps. As laid out in Section 3. Heilbron et al., 2015) or Kinetics-400 (Carreira & Zisserman, 2017) . For TVSeries (De Geest et al., 2016) , the network learns on the train and validations sets (20 videos) and evaluates on the test set (7 videos) as in (Wang et al., 2021) . RGB and Optical Flow features were extraced using an MMAction2 (Contributors, 2020) pipeline with ActivityNet v1.3 (Heilbron et al., 2015) and Kinetics-400 (Carreira & Zisserman, 2017) pretrained TSN ResNet-50 (He et al., 2016) backbones. This is similar to the feature extraction process used by LSTR (Xu et al., 2021) . Following Wang et al. (2021) , we use a batch size of 128, sequence length 64, initial learning rate 10 -4 with a factor ten reduction each epoch, alongside weight decay 10 -4 , and dropout with probability 0.1. We report results using two epochs of training on a Nvidia RTX2080 Ti GPU. We track mean Average Precision (mAP) for THUMOS14 and calibrated mean Average Precision (cmAP) (De Geest et al., 2016) for TVSeries, alongside FLOPs per prediction and parameters of the OAD module (feature extraction excluded). We report the mean ± standard deviation over five runs.

4.1.2. ABLATION STUDIES

Removing the Decoder As a first step to make an efficient Continual OadTR, we remove the decoder blocks used for action anticipation, which has a large impact on computational efficiency and the ease of transformation to a Continual Inference Network. The first two lines of Table 1a present the results of the removal. Contrary to the observations of Wang et al. (2021) , we did not find any drop in accuracy when excluding the decoder. We do, however, gain a large reduction in FLOPs and model size; they were reduced to 58% and 30%, respectively. Given these computational improvements, we exclude the decoder in subsequent experiments. ). Accordingly, we ablate its use and position. In cases where it is removed, we predict on the token corresponding to the last input token. The results of varying CLS pos are noted in Table 1a . For the one-block architecture, the removal came with noticeable drop in mAP, while the two-block architecture saw small improvements when removing or introducing the class token later. For the three block model, the use of class tokens in block two achieved the highest mAP. Though it is commonly accepted, that class tokens should be introduced alongside other inputs in the first block, our results indicate that they can accumulate sufficient information with only one or two blocks, and that later stage introduction may work better in some applications. In general, the achieved mAP when varying CLS pos. and number of blocks are very similar to one another, while (re)moving the class token and reducing the block size both reduce computational complexity. This encourages the use of shallow Transformer Encoders over deeper ones as well as the removal of class tokens, as we do in the following experiments.

Positional Encodings

We can transfer parameters from the simplified one-and two block OadTR to the corresponding Continual architecture, CoOadTR. Here, the one block version (CoOadTR-b1) uses CoSiMHA, and the two block model (CoOadTR-b2) uses CoReMHA in the first block and Single-output MHA in the second. However, a regular positional encoding is not suited for continual inference (see Section 3.7). We evaluate the performance of using non-continual encodings for continual inference, as well as of our proposed Recycling Positional Encodings with fixed or learned parameters. In addition, we explore the impact of extending the number of tokens from n to 2n -1 to avoid positional ambiguity. As seen in Table 1b ), non-continual encoding used in the continual setting result in severe mAP drop. Recycling Positional Encodings alleviate this. Comparing learned and fixed encodings, we find the learned encodings to work better when the number of encoding tokens corresponds to the sequence length n and the fixed encoding to work best when positional ambiguity is alleviated by extending the number of tokens to 2n -1. Fixed encoding with 2n -1 tokens works best overall and is employed in subsequent experiments unless stated otherwise. There is no difference in FLOPs for either strategy, and the difference in parameter count is negligible.

4.1.3. COMPARISON WITH PRIOR WORKS

We evaluate the (Co)OadTR architectures on THUMOS14 and TVSeries with two sets of features as described in Section 4.1.1. Since no prior OAD works have reported complexity metrics, we measured the FLOPs for TRN (Xu et al., 2019) based on the publicly available source code to serve as a point of reference. The results of this benchmark are presented in Table 2 and Fig. 2 . OadTR and our simplified (continual) one-block (b1) and two-block (b2) versions without decoder and class tokens generally achieve competitive precision in comparison with prior works, surpassing all but OadTR and LSTR. On THUMOS14, our reproduced OadTR results are slightly lower than originally reported (Wang et al., 2021) foot_3 , whereas achieved TVSeries results are higherfoot_4 . The (Co)OadTR-b# (Xu et al., 2019) 47.2 83.7 1387.5 FATS (Kim et al., 2021) 51.6 81.7 -IDN (Eun et al., 2020) 50.0 84.7 -TFN (Eun et al., 2021) 55.7 85.0 -LSTR (Xu et al., 2021) 65.3 88.1 -OadTR (Wang et al., 2021) 58 architecture largely retain precision and allow significantly reduced FLOPs per prediction. Our proposed continual variants CoOadTR-b1 and CoOadTR-b2 reduce FLOPs by 255× and 6.1×, respectively, compared to OadTR, while either achieving the same performance or conceding no more than one percentage point. On average, continual and non-continual (Co)OadTR-b# models achieve similar mAP on THUMOS14, while OadTR-b# have slightly higher mcAP on TVSeries. We attribute these discrepancies to differences in positional encoding. All in all, the CoOadTR-b# models provide far-superior computational efficiency to prior works, achieving state-of-the-art performance/efficiency trade-offs by a large margin.

4.1.4. AUDIO-VISUAL ONLINE ACTION DETECTION

To showcase the validity of our method in audio-visual settings as well, we explore the addition of audio-features to the Online Action Detection task on THUMOS14. As described in Section 4.2, audio-features are extracted using Mel spectrograms and an AudioSet pre-trained VGGish network (Hershey et al., 2017) (output of the penultimate layer) on 1.0 second windows with a step size of 0.2 seconds to match the 5.0 FPS sampling rate of the video features. The audio-features by themselves do not provide enough signal to reach good Online Action Detection performance (yielding only 6.7% mAP with an OadTR network). When concatenated with RGB and Flow they do provide a modest improvement as seen in Table 3 . On average, this amounts to +0.6% mAP when combined with ActivityNet features and +0.5% mAP when used with Kinetics-400 features with shallower models enjoying the largest improvements.

4.2.1. BACKGROUND

Audio Classification is the categorisation of audio waveforms. Though waveform sequences can be used directly (Lee et al., 2017) , it is common to first convert them to spectrograms. Mel spectrograms are obtained by a nonlinear transformation of a frequency scale (Stevens et al., 1937) , which is designed based on empirical knowledge about the human auditory system (Choi et al., 2016) . By employing spectrograms, audio classification can be approached in the same way as image classification (Palanisamy et al., 2020) . 

4.2.2. EXPERIMENTS

We conduct experiments on the Music Genre Classification dataset GTZAN (Tzanetakis & Cook, 2002) . It consists of 100 30-second clips for each of ten music genres. Each audio clip is sampled at 22,050 Hz. Since there are no predefined splits for GTZAN, we randomly select 10% of the data for validation and 10% for testing. The input is transformed to a temporal sequence by sliding a one-second window over each 30-second clip with a slide step size of 250ms, leading to 120 onesecond clips. These are subsequently converted to Mel spectrograms. We then fine-tune a VGGish network, pre-trained on AudioSet (Hershey et al., 2017) 6). The outputs of the exponentials in Eq. ( 2) and Eq. ( 3) can be reused in Eq. ( 4) and Eq. ( 5) respectively, and are omitted in the count.

Mul. Add Exp

Eq. ( 2) 2(n -1)d 2(n -2)d + 2(n -1) 2(n -1) Eq. ( 3) nd + n + d nd + (n -1) + d n Eq. ( 4) 2(n -1)d 2(n -1)d 0 Eq. ( 5) nd (n -1)d 0 Eq. ( 6) nd + n 0 0 A schematic illustration of the Audio Classification experiments architecture is depicted in Fig. 6 .



Source code: https://github.com/lukashedegaard/continual-transformers. EXPERIMENTSWe provide case studies within two perception disciplines, namely Online Action Detection (Section 4.1) and Audio Classification (Section 4.2). In each case, we will start with a brief overview of the field, followed by experiments and results. The feature extraction commonly used in Online Action Detection (OAD) works is in itself quite computationally costly. We consider the optimisation of the backbone as orthogonal future work and will follow the same feature extraction procedure as other OAD works at this time. The reported 58.3% on THUMOS14 could not be reproduced using their publicly available code. We attribute our higher mcAP to differences in the feature extraction pipeline.



Figure 1: Multi-block Continual Transformer Encoder with Recycling Positional Encoding. For b > 2 blocks, regular Transformer Encoder blocks can be added between an initial Continual Retroactive block and a final Single-Output block. A class-token may be used after the initial block.

Figure 2: Visual comparison of OAD methods on THUMOS14 and TVSeries for backbones trained on ActivityNet 1.3 and Kinetics-400.

Figure 3: FLOPs/step and memory footprint for Regular, Continual Retroactive, and . . . . . . . . . . . . Continual . . . . . . . . . . . . . . . . Single-Output Scaled Dot-Product Attention at varying sequence length n and embedding dimension d. Column (a) has d fixed to 256; Column (b) has n fixed to 256.

Figure 5: Continual Single-Output Dot-Product Attention. The key (K) and value (V ) matrices are aggregated over time by caching the step vectors k new and v new in a FIFO queue. During each step, only the attention output associated with q is computed.

Ablation experiments on THUMOS14 with TSN-Anet features. Best metrics are highlighted. '-' indicates that a particular feature was not used. Class token variations with OadTR. CLS pos. is the encoder block into which CLS is input. Positional encodings variations for CoOadTr.

Online Action Detection results. FLOPs per prediction are noted for inference on THUMOS14. The best and next-best metrics are highlighted.

Audio-Visual result, THUMOS14.

Audio Classification results for GTZAN.

and use the penultimate layer for feature extraction. A batch size of 64 and the Adam optimizer(Kingma & Ba, 2015) are used with an initial learning rate of 10 -4 . The learning rate is reduced by a factor of 0.6 on plateau with a tolerance of two epochs, and an early stopping mechanism, where a maximum of 100 epochs are allowed. The VGGIsh base-network attains an accuracy of 86.1% on the dataset of one-second clips with 72.1M parameters and 864.7M FLOPs. Subsequently, the audio features are passed to a (Continual) Transformer Encoder which has 16 attention heads, an embedding dimension of 192 and an MLP dimension of 384. The Transformer Encoder is trained on the whole temporal sequence using a batch size of 32 and the AdamW optimizer(Loshchilov & Hutter, 2019) with a learning rate of 10 -5 and a weight decay of 10 -4 for 50 epochs. Since the Transformer Encoder is trained on entire 30-second clips, there are less data points available for this training. Accordingly, the size of the validation set is increased to 18%. All audio classification training procedures were carried out on a single Nvidia RTX 2080 Ti GPU. Table4presents the accuracy and efficiency of regular and Continual Transformers during online inference. As a baseline, we also include the result of majority voting among the clips to classify the entire sequence. The Continual Transformers obtain similar accuracy as regular a Transformers while consuming 1.76× less FLOPs when using two blocks and 51.5× less FLOPs when using one Transformer Encoder block.5 CONCLUSIONIn this work, we presented Continual Transformers, a redundancy-free reformulation of Transformers tailored for online inference. Central to the Continual Transformer are the Continual Retroactive and Single-Output Attention operations, which produce outputs identical to the original Scaled Dot-Product Attention for continual input sequences, while greatly reducing the time and memory complexity per prediction. The applicability of Continual Transformer architectures was experimentally validated in Online Action Detection and Online Audio Classification settings, observing upwards of multiple orders of magnitude reduction in time complexity for lightweight architectures at modest accuracy concessions. Continual Transformers constitute an algorithmic innovation, which could make possible hitherto unseen precision, speed, and power efficiency in online inference use-cases. With applications spanning enhanced perception and reactivity of robots and autonomous vehicles, weather forecasting, price prediction and surveillance, we hope it will be used for the common good.

Floating Point Operations for the Scaled Dot-Product Attention in Eq. (1). D -1 (•) can be efficiently computed as element-wise multiplication with AV .

Floating Point Operations for the Continual Retroactive Dot-Product Attention in Eqs. (2) to (

Floating Point Operations for the Continual Single-Output SDA in Eq. (7).

ACKNOWLEDGMENTS

This work has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 871449 (OpenDR).

VGGIsh

VGGIsh VGGIsh Waveform

