NEAR-OPTIMAL GLIMPSE SEQUENCES FOR IMPROVED HARD ATTENTION NEURAL NETWORK TRAINING Anonymous

Abstract

Hard visual attention is a promising approach to reduce the computational burden of modern computer vision methodologies. Hard attention mechanisms are typically non-differentiable. They can be trained with reinforcement learning but the highvariance training this entails hinders more widespread application. We show how hard attention for image classification can be framed as a Bayesian optimal experimental design (BOED) problem. From this perspective, the optimal locations to attend to are those which provide the greatest expected reduction in the entropy of the classification distribution. We introduce methodology from the BOED literature to approximate this optimal behaviour, and use it to generate 'nearoptimal' sequences of attention locations. We then show how to use such sequences to partially supervise, and therefore speed up, the training of a hard attention mechanism. Although generating these sequences is computationally expensive, they can be reused by any other networks later trained on the same task.

1. INTRODUCTION

Attention can be defined as the "allocation of limited cognitive processing resources" (Anderson, 2005) . In humans the density of photoreceptors varies across the retina. It is much greater in the centre (Bear et al., 2007) and covers an approximately 210 degree field of view (Traquair, 1949) . This means that the visual system is a limited resource with respect to observing the environment and that it must be allocated, or controlled, by some attention mechanism. We refer to this kind of controlled allocation of limited sensor resources as "hard" attention. This is in contrast with "soft" attention, the controlled application of limited computational resources to full sensory input. Hard attention can solve certain tasks using orders of magnitude less sensor bandwidth and computation than the alternatives (Katharopoulos & Fleuret, 2019; Rensink, 2000) . It therefore may enable the use of modern approaches to computer vision in low-power settings such as mobile devices. This paper focuses on the application of hard attention in image classification. Our model of attention (shown in Fig. 1 ) is as follows: a recurrent neural network (RNN) is given T steps to classify some unchanging input image. Before each step, the RNN outputs the coordinates of a pixel in the image. A patch of the image centered around this pixel is then fed into the RNN. We call this image patch a glimpse, and the coordinates a glimpse location. As such, the RNN controls its input by selecting each glimpse location, and this decision can be based on previous glimpses. After T steps, the RNN's hidden state is mapped to a classification output. As with most artificial hard attention mechanisms (Mnih et al., 2014; Ba et al., 2014) , this output is not differentiable with respect to the sequence of glimpse locations selected. This makes training with standard gradient backpropagation impossible, and so high variance gradient estimators such as REINFORCE (Williams, 1992) are commonly used instead (Mnih et al., 2014; Ba et al., 2014) . The resulting noisy gradient estimates make training difficult, especially for large T . In order to improve hard attention training, we take inspiration from neuroscience literature which suggests that visual attention is directed so as to maximally reduce entropy in an agent's world model (Bruce & Tsotsos, 2009; Itti & Baldi, 2009; Schwartenbeck et al., 2013; Feldman & Friston, 2010) . There is a corresponding mathematical formulation of such an objective, namely Bayesian optimal experimental design (BOED) (Chaloner & Verdinelli, 1995) . BOED tackles the problem Figure 1 : The hard attention network architecture we consider, consisting of an RNN core (yellow), a location network (light blue), a glimpse embedder (dark blue), and a classifier (red). h t is the RNN hidden state after t steps. The network outputs distributions over where to attend (l t ) at each time step, and over the class label (θ) after T steps. of designing an experiment to maximally reduce uncertainty in some unknown variable. When classifying an image with hard visual attention, the 'experiment' is the process of taking a glimpse; the 'design' is the glimpse location; and the unknown variable is the class label. In general, BOED is applicable only when a probabilistic model of the experiment exists. This could be, for example, a prior distribution over the class label and a generative model for the observed image patch conditioned on the class label and glimpse location. We leverage generative adversarial networks (GANs) (Goodfellow et al., 2014) to provide such a model. We use methodology from BOED to introduce the following training procedure for hard attention networks, which we call partial supervision by near-optimal glimpse sequences (PS-NOGS). 1. We assume that we are given an image classification task and a corresponding labelled dataset. Then, for a subset of the training images, we determine an approximately optimal (in the BOED sense) glimpse location for a hard attention network to attend to at each time step. We refer to the resulting sequences of glimpse locations as near-optimal glimpse sequences. Section 4 describes our novel method to generate them. 2. We use these near-optimal glimpse sequences as an additional supervision signal for training a hard attention network. Section 5 introduces our novel training objective for this. We empirically investigate the performance of PS-NOGS and find that it leads to faster training than our baselines, and qualitatively different behaviour with competitive accuracy. We validate the use of BOED to generate glimpse sequences through comparisons with supervision both by hand-crafted glimpse sequences, and by glimpse sequences sampled from a trained hard attention network.

2. HARD ATTENTION

Given an image, I, we consider the task of inferring its label, θ. We use an architecture based on that of Mnih et al. (2014) , shown in Fig. 1 . It runs for a fixed number of steps, T . At each step t, the RNN samples a glimpse location, l t , from a distribution conditioned on previous glimpses via the RNN's hidden state. A glimpse, in the form of a contiguous square of pixels, is extracted from the image at this location. We denote this y t = f fovea (I, l t ). An embedding of y t and l t is then input to the RNN. After T glimpses, the network outputs a classification distribution q φ (θ|y 1:T , l 1:T ), where φ are the learnable network parameters. Mnih et al. ( 2014) use glimpses consisting of three image patches at different resolutions, but the architectures are otherwise identical. As it directly processes only a fraction of an image, this architecture is suited to low-power scenarios such as use on mobile devices. During optimisation, gradients cannot be computed by simple backpropagation since f fovea is nondifferentiable. An alternative, taken by Mnih et al. (2014) and others in the literature (Ba et al., 2014; Sermanet et al., 2014) , is to obtain high-variance gradient estimates using REINFORCE (Williams, 1992) . Although these are unbiased, their high variance has made scaling beyond simple problems such as digit classification (Netzer et al., 2011) challenging. Section 7 describes alternatives (Ba et al., 2015; Lawson et al., 2018) to training with REINFORCE, but similar problems with scalability exist. This has led many studies to focus on easing the learning task by altering the architecture: e.g., to process a downsampled image before selecting glimpse locations (Ba et al., 2014; Sermanet et al., 2014; Katharopoulos & Fleuret, 2019) . We summarise these innovations in Section 7 but they tend to be less suitable for low-power computation. We therefore believe that improved training of the architecture in Fig. 1 is an important research problem, and it is the focus of this paper.

