VARIATIONAL STRUCTURED ATTENTION NETWORKS FOR DENSE PIXEL-WISE PREDICTION

Abstract

State-of-the-art performances in dense pixel-wise prediction tasks are obtained with specifically designed convolutional networks. These models often benefit from attention mechanisms that allow better learning of deep representations. Recent works showed the importance of estimating both spatial-and channel-wise attention tensors. In this paper we propose a unified approach to jointly estimate spatial attention maps and channel attention vectors so as to structure the resulting attention tensor. Moreover, we integrate the estimation of the attention within a probabilistic framework, leading to VarIational STructured Attention networks (VISTA-Net). We implement the inference rules within the neural network, thus allowing for joint learning of the probabilistic and the CNN front-end parameters. Importantly, as demonstrated by our extensive empirical evaluation on six largescale datasets, VISTA-Net outperforms the state-of-the-art in multiple continuous and discrete pixel-level prediction tasks, thus confirming the benefit of structuring the attention tensor and of inferring it within a probabilistic formulation.

1. INTRODUCTION

Over the past decade, convolutional neural networks (CNNs) have become the privileged methodology to address computer vision tasks requiring dense pixel-wise prediction, such as semantic segmentation (Chen et al., 2016b; Fu et al., 2019) , monocular depth prediction (Liu et al., 2015; Roy & Todorovic, 2016) , contour detection (Xu et al., 2017a) and normal surface computation (Eigen et al., 2014) . Recent studies provided clear evidence that attention mechanisms (Mnih et al., 2014) within deep networks are undoubtedly a crucial factor in improving the performance (Chen et al., 2016b; Xu et al., 2017a; Fu et al., 2019; Zhan et al., 2018) . In particular, previous works demonstrated that deeply learned attentions acting as soft weights to interact with different deep features at each channel (Zhong et al., 2020; Zhang et al., 2018; Song et al., 2020) and at each pixel location (Li et al., 2020a; Johnston & Carneiro, 2020; Tay et al., 2019) permits to improve the pixel-wise prediction accuracy (see Fig.1.a and Fig.1.b). Recently, Fu et al. (2019) proposed the Dual Attention Network (DANet), embedding in a fully convolutional network (FCN) two complementary attention modules, specifically conceived to model separately the semantic dependencies associated to the spatial and to the channel dimensions (Fig. 1.c ). Concurrently, other approaches have considered the use of structured attention models integrated within a graph network framework (Zhang et al., 2020; Chen et al., 2019; Xu et al., 2017a) , showing the empirical advantage of adopting a graphical model to effectively capture the structured information present in the hidden layers of the neural network and thus enabling the learning of better deep feature representations. Notably, Xu et al. (2017a) first introduced attention-gated conditional random fields (AG-CRFs), a convolutional neural network implementing a probabilistic graphical model that considers attention variables as gates (Minka & Winn, 2009) in order to learn improved deep features and effectively fuse multi-scale information. However, their structured attention model is only learned at the spatial-wise level, while channel-wise dependencies are not considered. This paper advances the state of the art in dense pixel-wise prediction by proposing a novel approach to learn more effective deep representations by integrating a structured attention model which jointly account for spatial-and channel-level dependencies using an attention tensor (Fig. 1 .d) within a CRF framework. More precisely, inspired from Xu et al. (2017a) we model the attention as gates. Crucially, we address the question on how to enforce structure within these latent gates, in order to jointly model spatial-and channel-level dependencies while learning deep features. To do so, we hypothesize that the attention tensor is nothing but the sum of T rank-1 tensors, each of them being the tensor product of a spatial attention map and a channel attention vector. This attention tensor is used as a structured latent attention gate, enhancing the feature maps. We cast the inference problem into a maximum-likelihood estimation formulation that is made computationally tractable thanks to a variational approximation. Furthermore, we implement the maximum likelihood update rules within a neural network, so that they can be jointly learned with the preferred CNN front-end. We called our approach based on structured attention and variational inference VarIational STructured Attention Networks or VISTA-Net. We evaluate our method on multiple pixel-wise prediction problems, i.e. monocular depth estimation, semantic segmentation and surface normale prediction, considering six publicly available datasets, i.e. NYUD-V2 (Silberman et al., 2012) , KITTI (Geiger et al., 2013 ), Pascal-Context (Mottaghi et al., 2014 ), Pascal VOC2012 (Everingham et al., 2010) , Cityscape (Cordts et al., 2016) and ScanNet (Dai et al., 2017) . Our results demonstrate that VISTA-Net is able to learn rich deep representations thanks to the proposed structured attention and our probabilistic formulation, outperforming state-of-the-art methods. Related Work. Several works have considered integrating attention models within deep architectures to improve performance in several tasks such as image categorization (Xiao et al., 2015) , speech recognition (Chorowski et al., 2015) and machine translation (Vaswani et al., 2017; Kim et al., 2017; Luong et al., 2015) . Focusing on pixel-wise prediction, Chen et al. (2016b) first described an attention model to combine multi-scale features learned by a FCN for semantic segmentation. Zhang et al. (2018) designed EncNet, a network equipped with a channel attention mechanism to model global context. Zhao et al. (2018) proposed to account for pixel-wise dependencies introducing relative position information in spatial dimension within the convolutional layers. Huang et al. (2019b) described CCNet, a deep architecture that embeds a criss-cross attention module with the idea of modeling contextual dependencies using sparsely-connected graphs, such as to achieve higher computational efficiency. Fu et al. (2019) proposed to model semantic dependencies associated with spatial and channel dimensions by using two separate attention modules. Zhong et al. ( 2020) introduced a squeeze-and-attention network (SANet) specialized to pixel-wise prediction that takes into account spatial and channel inter-dependencies in an efficient way. Attention was first adopted within a CRF framework by Xu et al. (2017a) , which introduced gates to control the message passing between latent variables and showed that this strategy is effective for contour detection. Our work significantly departs from these previous approaches, as we introduce a novel structured attention mechanism, jointly handling spatial-and channel-level dependencies within a probabilistic framework. Notably, we also prove that our model can be successfully employed in case of several challenging dense pixel-level prediction tasks. Our work is also closely related to previous studies on dual graph convolutional network (Zhang et al., 2019c) and dynamic graph message passing networks (Zhang et al., 2020) , which have been successfully used for pixellevel prediction tasks. However, while they also resort on message passing for learning refined deep feature representations, they lack a probabilistic formulation. Finally, previous studies (Xu et al., 2017c; Arnab et al., 2016; Chen et al., 2019) described CRF-based models for pixel-wise estima-



Channel-wise attention.

Figure 1: Different attention mechanisms. (a) and (b) correspond to channel-only and spatial-only attention, respectively. (c) corresponds to previous works (Fu et al., 2019) adding (⊕) a channel and a spatial attention tensor. (d) shows the attention mechanism of VISTA-Net: a channel-wise vector and a spatial map are estimated then tensor-multiplied (⊗) yielding a structured attention tensor. The attention tensor acts as a structured latent gate, producing a probabilistically enhanced feature map.

