ACHIEVING EXPLAINABILITY IN A VISUAL HARD ATTENTION MODEL THROUGH CONTENT PREDICTION

Abstract

A visual hard attention model actively selects and observes a sequence of subregions in an image to make a prediction. Unlike in the deep convolution network, in hard attention it is explainable which regions of the image contributed to the prediction. However, the attention policy used by the model to select these regions is not explainable. The majority of hard attention models determine the attention-worthy regions by first analyzing a complete image. However, it may be the case that the entire image is not available in the beginning but instead sensed gradually through a series of partial observations. In this paper, we design an efficient hard attention model for classifying partially observable scenes. The attention policy used by our model is explainable and non-parametric. The model estimates expected information gain (EIG) obtained from attending various regions by predicting their content ahead of time. It compares EIG using Bayesian Optimal Experiment Design and attends to the region with maximum EIG. We train our model with a differentiable objective, optimized using gradient descent, and test it on several datasets. The performance of our model is comparable to or better than the baseline models.

1. INTRODUCTION

Though deep convolution networks achieve state of the art performance on the image classification task, it is difficult to explain which input regions affected the output. A technique called visual hard attention provides this explanation by design. The hard attention model sequentially attends small but informative subregions of the input called glimpses to make predictions. While the attention mechanism explains the task-specific decisions, the attention policies learned by the model remain unexplainable. For example, one cannot explain the attention policy of a caption generation model that correctly predicts the word 'frisbee' while looking at a region far from an actual frisbee (Xu et al. (2015) ). The majority of hard attention models first analyze a complete image to locate the task-relevant subregions and then attend to these locations to make predictions (Ba et al. (2014) ; Elsayed et al. ( 2019)). However, in practice, we often do not have access to the entire scene, and we gradually attend to the important subregions to collect task-specific information. At each step in the process, we decide the next attention-worthy location based on the partial observations collected so far. The explainable attention policies are more desirable under such partial observability. Mnih et al. (2014) presents a model that functions under partial observability but their attention policies are not explainable. They train their model with the REINFORCE algorithm (Williams (1992)), which is challenging to optimize. Moreover, the model's performance is affected adversely if the parameterization of the attention policy is not optimal. For example, an object classification model with unimodal Gaussian policy learns to attend the background region in the middle of the two objects (Sermanet et al. (2014) ).

Pioneering work by

This paper develops a hard-attention model with an explainable attention policy for classifying images through a series of partial observations. We formulate the problem of hard attention as a Bayesian Optimal Experiment Design (BOED). A recurrent model finds an optimal location that gains maximum expected information about the class label and attends to this location. To estimate expected information gain (EIG) under partial observability, the model predicts content of the un-

