HARD ATTENTION CONTROL BY MUTUAL INFORMA-TION MAXIMIZATION

Abstract

Biological agents have adopted the principle of attention to limit the rate of incoming information from the environment. One question that arises is if an artificial agent has access to only a limited view of its surroundings, how can it control its attention to effectively solve tasks? We propose an approach for learning how to control a hard attention window by maximizing the mutual information between the environment state and the attention location at each step. The agent employs an internal world model to make predictions about its state and focuses attention towards where the predictions may be wrong. Attention is trained jointly with a dynamic memory architecture that stores partial observations and keeps track of the unobserved state. We demonstrate that our approach is effective in predicting the full state from a sequence of partial observations. We also show that the agent's internal representation of the surroundings, a live mental map, can be used for control in two partially observable reinforcement learning tasks. Videos of the trained agent can be found at https://sites.google.com/view/hard-attention-control.

1. INTRODUCTION

Reinforcement learning (RL) algorithms have successfully employed neural networks over the past few years, surpassing human level performance in many tasks (Mnih et al., 2015; Silver et al., 2017; Berner et al., 2019; Schulman et al., 2017) . But a key difference in the way tasks are performed by humans versus RL algorithms is that humans have the ability to focus on parts of the state at a time, using attention to limit the amount of information gathered at every step. We actively control our attention to build an internal representation of our surroundings over multiple fixations (Fourtassi et al., 2017; Barrouillet et al., 2004; Yarbus, 2013; Itti, 2005) . We also use memory and internal world models to predict motions of dynamic objects in the scene when they are not under direct observation (Bosco et al., 2012) . By limiting the amount of input information in these two ways, i.e. directing attention only where needed and internally modeling the rest of the environment, we are able to be more efficient in terms of data that needs to be collected from the environment and processed at each time step. Figure 1 : Illustration of the PhysEnv domain (Du & Narasimhan, 2019) . Here, the agent (dark blue) needs to navigate to the goal (red) while avoiding enemies (light blue). Typically the full environment states (top) are provided to the RL algorithm to learn. We consider the case where only observations from under a hard attention window (bottom) are available. The attention's location (yellow square) at each time step is controllable by the agent. Using only partial observations, the agent must learn to represent its surroundings and complete its task. By contrast, modern reinforcement learning methods often operate on the entire state. Observing the entire state simultaneously may be difficult in realistic environments. Consider an embodied agent that must actuate its camera to gather visual information about its surroundings learning how to cross a busy street. At every moment, there are different objects in the environment competing for its attention. The agent needs to learn to look left and right to store locations, heading, and speed of nearby vehicles, and perhaps other pedestrians. It must learn to create a live map of its surroundings, frequently checking back on moving objects to update their dynamics. In other words, the agent must learn to selectively move its attention so as to maximize the amount of information it collects from the environment at a time, while internally modeling motions of the other, more predictable parts of the state. Its internal representation, built using successive glimpses, must be sufficient to learn how to complete tasks in this partially observable environment. We consider the problem of acting in an environment that is only partially observable through a controllable, fixed-size, hard attention window (see figure 1 ). Only the part of the state that is under the attention window is available to the agent as its observation. The rest must be inferred from previous observations and experience. We assume that the location of the attention window at every time step is under control of the agent and its size compared to the full environment state is known to the agent. We distinguish this task from that of learning soft attention (Vaswani et al., 2017) , where the full state is attended to, weighted by a vector, and then fed into subsequent layers. Our system must learn to 1) decide where to place the attention in order to gather more information about its surroundings, 2) record the observation made into an internal memory and model the motion within parts of the state that were unobserved, and 3) use this internal representation to learn how to solve its task within the environment. Our approach for controlling attention uses RL to maximize an information theoretic objective closely related to the notion of surprise or novelty (Schmidhuber, 1991) . It is unsupervised in terms of environment rewards, i.e. it can be trained on offline data (states and actions) without knowing the task or related rewards. We discuss this in more detail in section 5. Memory also plays a crucial role in allowing agents to solve tasks in partially observable environments. We pair our attention control mechanism with a memory architecture inspired largely by Du & Narasimhan (2019)'s SpatialNet, but modified to work in partially observable domains. This is described in section 4. Empirically, we show in section 6.1 that our system is able to reconstruct the full state image including dynamic objects at all time steps given only partial observations. Further, we show in section 6.2 that the internal representation built by our attention control mechanism and memory architecture is sufficient for the agent to learn to solve tasks in this challenging partially observable environment.

2. RELATED WORKS

Using hard attention for image classification or object recognition is well studied in computer vision (Alexe et al., 2012; Butko & Movellan, 2009; Larochelle & Hinton, 2010; Paletta et al., 2005; Zoran et al., 2020; Welleck et al., 2017) . Attention allows for processing only the salient or interesting parts of the image (Itti et al., 1998) . Similarly, attention control has been applied to tracking objects within a video (Denil et al., 2012; Kamkar et al., 2020; Yu et al., 2020) . Surprisingly, not a lot of recent work exists on the topic of hard attention control in reinforcement learning domains, where a sequential decision making task has to be solved by using partial observations from under the attention. Mnih et al. (2014) proposed a framework for hard attention control in the classification setting and a simple reinforcement learning task. Their approach consists of using environment rewards to train an attention control RL agent. Our approach differs mainly in that we train the attention control using our novel information theoretic objective as reward. Mnih et al. ( 2014)'s approach leads to a task specific policy for attention control, whereas our approach is unsupervised in terms of the task and can be applied generally to downstream tasks in the environment. Our approach also differs in that we use a memory architecture that is more suited to partially observable tasks with 2D images as input, compared to a RNN used by Mnih et al. (2014) . There has been much prior work on memory and world models for reinforcement learning (Ha & Schmidhuber, 2018; Graves et al., 2016; Hausknecht & Stone, 2015; Khan et al., 2017) . The work closest to our own is Du & Narasimhan (2019)'s SpatialNet, which attempts to learn a task-agnostic world model for multi-task settings. Our memory architecture is largely inspired by SpatialNet

