ADAPTIVE COMPUTATION WITH ELASTIC INPUT SEQUENCE

Abstract

When solving a problem, human beings have the adaptive ability in terms of the type of information they use, the procedure they take, and the amount of time they spend approaching and solving the problem. However, most standard neural networks have the same function type and fixed computation budget on different samples regardless of their nature and difficulty. Adaptivity is a powerful paradigm as it not only imbues practitioners with flexibility pertaining to the downstream usage of these models but can also serve as a powerful inductive bias for solving certain challenging classes of problems (Dehghani et al., 2018; Banino et al., 2021). In this work, we propose a new strategy, AdaTape, that enables dynamic computation in neural networks via adaptive tape tokens. AdaTape employs an elastic input sequence by equipping an architecture with a dynamic read and write tape. Specifically, we adaptively generate input sequences using tape tokens obtained from a tape bank that can either be trainable or generated from input data. We analyze the challenges and requirements to obtain dynamic sequence content and length, and propose the Adaptive Tape Reading (ATR) algorithm to achieve both objectives. Via extensive experiments on image recognition tasks, we show that AdaTape can achieve better performance while maintaining the computational cost.

1. INTRODUCTION

Adaptive computation is central to human intelligence. This is clear, given that humans spend a variable amount of time and energy on different problems depending on their complexity (Meunier et al., 2009) . Adaptivity in neural networks is attractive for two key reasons. Firstly, adaptive computation could potentially be a powerful and essential inductive bias for solving challenging problems that would have been significantly harder otherwise (Dehghani et al., 2018; Banino et al., 2021) . Secondly, adaptive computation could imbue practitioners with downstream flexibility pertaining to the usage of these models. For the most part, altering the computation budget of a model after it has been trained becomes almost impossible. Hence, the ability to flexibly scale computational costs and budgets dynamically is highly desirable. This paper proposes AdaTape, a new general-purpose adaptive computation method. The key idea is to introduce elastic input sequences via the means of a dynamic read and write memory tape. Unlike all prior works that investigate adaptivity via sparse conditional computation (Fedus et al., 2022; 2021; Lepikhin et al., 2020) or adaptivity through recursion over architecture (Dehghani et al., 2018; Banino et al., 2021; Graves, 2016) , this work presents a new perspective that explores adaptivity with respect to input sequence length (or read/write memory tapes from the perspective of a Neural Turing Machine (Graves et al., 2014) ). We postulate that this exploration is crucial for the development of this class of methods and is very complementary to the existing suite of methods developed to encourage adaptive computation in neural networks. AdaTape promotes adaptivity in both type and amount of computation. Specifically, AdaTape controls (1) the contents of the tape tokens (2) the number of tape tokens, that are used for each input. To this end, AdaTape is characterized by a tape bank that can be dynamically read from, using a newly proposed dynamic halting algorithm which we call Adaptive Tape Reading (ATR). Concretely, ATR method adaptively and dynamically selects the content and length of this memory tape which is appended to the inputs of a standard Transformer (Vaswani et al., 2017) . Given that the

