ADAPTIVE COMPUTATION WITH ELASTIC INPUT SEQUENCE

Abstract

When solving a problem, human beings have the adaptive ability in terms of the type of information they use, the procedure they take, and the amount of time they spend approaching and solving the problem. However, most standard neural networks have the same function type and fixed computation budget on different samples regardless of their nature and difficulty. Adaptivity is a powerful paradigm as it not only imbues practitioners with flexibility pertaining to the downstream usage of these models but can also serve as a powerful inductive bias for solving certain challenging classes of problems (Dehghani et al., 2018; Banino et al., 2021). In this work, we propose a new strategy, AdaTape, that enables dynamic computation in neural networks via adaptive tape tokens. AdaTape employs an elastic input sequence by equipping an architecture with a dynamic read and write tape. Specifically, we adaptively generate input sequences using tape tokens obtained from a tape bank that can either be trainable or generated from input data. We analyze the challenges and requirements to obtain dynamic sequence content and length, and propose the Adaptive Tape Reading (ATR) algorithm to achieve both objectives. Via extensive experiments on image recognition tasks, we show that AdaTape can achieve better performance while maintaining the computational cost.

1. INTRODUCTION

Adaptive computation is central to human intelligence. This is clear, given that humans spend a variable amount of time and energy on different problems depending on their complexity (Meunier et al., 2009) . Adaptivity in neural networks is attractive for two key reasons. Firstly, adaptive computation could potentially be a powerful and essential inductive bias for solving challenging problems that would have been significantly harder otherwise (Dehghani et al., 2018; Banino et al., 2021) . Secondly, adaptive computation could imbue practitioners with downstream flexibility pertaining to the usage of these models. For the most part, altering the computation budget of a model after it has been trained becomes almost impossible. Hence, the ability to flexibly scale computational costs and budgets dynamically is highly desirable. This paper proposes AdaTape, a new general-purpose adaptive computation method. The key idea is to introduce elastic input sequences via the means of a dynamic read and write memory tape. Unlike all prior works that investigate adaptivity via sparse conditional computation (Fedus et al., 2022; 2021; Lepikhin et al., 2020) or adaptivity through recursion over architecture (Dehghani et al., 2018; Banino et al., 2021; Graves, 2016) , this work presents a new perspective that explores adaptivity with respect to input sequence length (or read/write memory tapes from the perspective of a Neural Turing Machine (Graves et al., 2014) ). We postulate that this exploration is crucial for the development of this class of methods and is very complementary to the existing suite of methods developed to encourage adaptive computation in neural networks. AdaTape promotes adaptivity in both type and amount of computation. Specifically, AdaTape controls (1) the contents of the tape tokens (2) the number of tape tokens, that are used for each input. To this end, AdaTape is characterized by a tape bank that can be dynamically read from, using a newly proposed dynamic halting algorithm which we call Adaptive Tape Reading (ATR). Concretely, ATR method adaptively and dynamically selects the content and length of this memory tape which is appended to the inputs of a standard Transformer (Vaswani et al., 2017) . Given that the increasing computation budget generally leads to improved quality (Kaplan et al., 2020; Dehghani et al., 2021; Hoffmann et al., 2022; Zhai et al., 2022; Abnar et al., 2021) , this enables a new way for adaptively scaling the computation budget without adding new parameters or applying a part of the model recursively. To ascertain the effectiveness of the proposed AdaTape method, we first evaluate it on the challenging Parity task (Graves, 2016; Banino et al., 2021) , a standard verification check for Adaptive Computation Time (ACT) algorithms (Graves, 2016). Our results demonstrate that AdaTape performs well on this problem. Meanwhile, this problem remains completely unsolvable by vanilla Transformers. This not only verifies that the AdaTape inductive bias is crucial in solving certain classes of problems but also asserts its correctness. Given that the standard Transformer is touted as the true universal algorithm with ubiquitous impact across all fields (Jumper et al., 2021; Dosovitskiy et al., 2020; Vaswani et al., 2017) , it would be unthinkable if Transformers lack the inductive bias for a standard vector parity problem. Finally, we conduct large-scale experiments on vision tasks (e.g., image recognition and evaluate their few-shot accuracy), showing that AdaTape performs well and outperforms vanilla Transformers when compute-matched (Dehghani et al., 2021) across both FLOPs and throughput. While AdaTape does not improve efficiency during training, the granted flexibility and adaptivity allow dynamic scaling of the computation budget during inference. Given that the standard practice is to train and serve n models to potentially cater to variable computation budgets (e.g., base models for less important workloads and large models for prioritized examples), we consider the property of having a single model flex between multiple requirements to be highly desirable.

2. ADATAPE: ADAPTIVE COMPUTATION WITH ELASTIC INPUT SEQUENCE

Neural networks can obtain adaptive ability by using different functions or variable computation budgets for different inputs. Assume a deep neural network is a function f (x; θ). The output of this function relies on both input x and parameter θ. For adaptive function types, we usually sparsely activate a subset of parameters θ conditioned on x. This type of adaptive ability can also be named as conditional computation. Research on Mixture-of-Experts (Fedus et al., 2021; Lepikhin et al., 2020; Xue et al., 2021; Lou et al., 2021; Riquelme et al., 2021) in fact introduce adaptivity on the function type via routing and specifying the computation for each input sample. Another line of adaptive computation research is the dynamic computation budget. For standard neural networks (e.g., transformer), the computation budget is fixed for different samples. However, recent studies show that adaptive computation budgets can help to solve many tasks where vanilla transformer totally fails (Dehghani et al., 2018; Banino et al., 2021; Abnar et al., 2020) . Most of these works use dynamic depth to achieve adaptivity in the allocation of the computation budget. For instance, the Adaptive Computation Time (ACT) algorithm was proposed by Graves (2016) 



Figure1: An overview of AdaTape. For different samples, we pick a variable number of different tokens from the tape bank. The tape bank can be driven from input, e.g., by extracting some extra fine-grained information or it can be a set of trainable vectors. The Adaptive Tape Reading is used to recursively select different sequences of tape tokens, with variable lengths, for different inputs. These tokens are then simply appended to inputs and fed to the transformer encoder.

