ACHIEVING EXPLAINABILITY IN A VISUAL HARD ATTENTION MODEL THROUGH CONTENT PREDICTION

Abstract

A visual hard attention model actively selects and observes a sequence of subregions in an image to make a prediction. Unlike in the deep convolution network, in hard attention it is explainable which regions of the image contributed to the prediction. However, the attention policy used by the model to select these regions is not explainable. The majority of hard attention models determine the attention-worthy regions by first analyzing a complete image. However, it may be the case that the entire image is not available in the beginning but instead sensed gradually through a series of partial observations. In this paper, we design an efficient hard attention model for classifying partially observable scenes. The attention policy used by our model is explainable and non-parametric. The model estimates expected information gain (EIG) obtained from attending various regions by predicting their content ahead of time. It compares EIG using Bayesian Optimal Experiment Design and attends to the region with maximum EIG. We train our model with a differentiable objective, optimized using gradient descent, and test it on several datasets. The performance of our model is comparable to or better than the baseline models.

1. INTRODUCTION

Though deep convolution networks achieve state of the art performance on the image classification task, it is difficult to explain which input regions affected the output. A technique called visual hard attention provides this explanation by design. The hard attention model sequentially attends small but informative subregions of the input called glimpses to make predictions. While the attention mechanism explains the task-specific decisions, the attention policies learned by the model remain unexplainable. For example, one cannot explain the attention policy of a caption generation model that correctly predicts the word 'frisbee' while looking at a region far from an actual frisbee (Xu et al. (2015) ). The majority of hard attention models first analyze a complete image to locate the task-relevant subregions and then attend to these locations to make predictions (Ba et al. (2014); Elsayed et al. (2019) ). However, in practice, we often do not have access to the entire scene, and we gradually attend to the important subregions to collect task-specific information. At each step in the process, we decide the next attention-worthy location based on the partial observations collected so far. The explainable attention policies are more desirable under such partial observability. Mnih et al. (2014) presents a model that functions under partial observability but their attention policies are not explainable. They train their model with the REINFORCE algorithm (Williams (1992)), which is challenging to optimize. Moreover, the model's performance is affected adversely if the parameterization of the attention policy is not optimal. For example, an object classification model with unimodal Gaussian policy learns to attend the background region in the middle of the two objects (Sermanet et al. (2014) ).

Pioneering work by

This paper develops a hard-attention model with an explainable attention policy for classifying images through a series of partial observations. We formulate the problem of hard attention as a Bayesian Optimal Experiment Design (BOED). A recurrent model finds an optimal location that gains maximum expected information about the class label and attends to this location. To estimate expected information gain (EIG) under partial observability, the model predicts content of the un-seen regions based on the regions observed so far. Using the knowledge gained by attending various locations in an image, the model predicts the class label. To the best of our knowledge, ours is the first hard attention model that is entirely explainable under partial observability. Our main contributions are as follows. First, our attention policies are explainable by design. One can explain that the model attends a specific location because it expects the corresponding glimpse to maximize the expected information gain. Second, the model does not rely on the complete image to predict the attention locations and provides good performance under partial observability. Third, the training objective is differentiable and can be optimized using standard gradient backpropagation. We train the model using discriminative and generative objectives to predict the label and the image content, respectively. Fourth, our attention policy is non-parametric and can be implicitly multi-modal.

2. RELATED WORKS

A hard attention model prioritizes task-relevant regions to extract meaningful features from an input. Early attempts to model attention employed image saliency as a priority map. High priority regions were selected using methods such as winner-take-all (Koch & Ullman ( 1987 2016) consume the whole image to predict the locations to attend. In contrast, our model does not look at the entire image at low resolution or otherwise. Moreover, our attention policies are explainable. We can apply our model in a wide range of scenarios where explainable predictions are desirable for the partially observable images.

3. MODEL

In this paper, we consider a recurrent attention model that sequentially captures glimpses from an image x and predicts a label y. The model runs for time t = 0 to T -1. It uses a recurrent net to maintain a hidden state h t-1 that summarizes glimpses observed until time t -1. At time t, it predicts coordinates l t based on the hidden state h t-1 and captures a square glimpse g t centered at l t in an image x, i.e. g t = g(x, l t ). It uses g t and l t to update the hidden state to h t and predicts the label y based on the updated state h t .



);Itti et al. (1998);Itti  & Koch (2000)), searching by throwing out all features but the one with minimal activity (Ahmad (1992)), and dynamic routing of information(Olshausen et al. (1993)). Few used graphical models to model visual attention. Rimey & Brown (1991) used augmented hidden Markov models to model attention policy. Larochelle & Hinton (2010) used a Restricted Boltzmann Machine (RBM) with third-order connections between attention location, glimpse, and the representation of a scene. Motivated by this, Zheng et al. (2015) proposed an autoregressive model to compute exact gradients, unlike in an RBM. Tang et al. (2014) used an RBM as a generative model and searched for informative locations using the Hamiltonian Monte Carlo algorithm. Many used reinforcement learning to train attention models. Paletta et al. (2005) used Q-learning with the reward that measures the objectness of the attended region. Denil et al. (2012) estimated rewards using particle filters and employed a policy based on the Gaussian Process and the upper confidence bound. Butko & Movellan (2008) modeled attention as a partially observable Markov decision process and used a policy gradient algorithm for learning. Later, Butko & Movellan (2009) extended this approach to multiple objects. Recently, the machine learning community use the REINFORCE policy gradient algorithm to train hard attention models (Mnih et al. (2014); Ba et al. (2014); Xu et al. (2015); Elsayed et al. (2019)). Among these, only Elsayed et al. (2019) claims explainability by design. Other works use EMstyle learning procedure (Ranzato (2014)), wake-sleep algorithm (Ba et al. (2015)), a voting based region selection (Alexe et al. (2012)), and differentiable models (Gregor et al. (2015); Jaderberg et al. (2015); Eslami et al. (2016)). Among the recent models, Ba et al. (2014); Ranzato (2014); Ba et al. (2015) look at the lowresolution gist of an input at the beginning, and Xu et al. (2015); Elsayed et al. (2019); Gregor et al. (2015); Jaderberg et al. (2015); Eslami et al. (

