CROSS-LAYER RETROSPECTIVE RETRIEVING VIA LAYER ATTENTION

Abstract

More and more evidence has shown that strengthening layer interactions can enhance the representation power of a deep neural network, while self-attention excels at learning interdependencies by retrieving query-activated information. Motivated by this, we devise a cross-layer attention mechanism, called multi-head recurrent layer attention (MRLA), that sends a query representation of the current layer to all previous layers to retrieve query-related information from different levels of receptive fields. A light-weighted version of MRLA is also proposed to reduce the quadratic computation cost. The proposed layer attention mechanism can enrich the representation power of many state-of-the-art vision networks, including CNNs and vision transformers. Its effectiveness has been extensively evaluated in image classification, object detection and instance segmentation tasks, where improvements can be consistently observed. For example, our MRLA can improve 1.6% Top-1 accuracy on ResNet-50, while only introducing 0.16M parameters and 0.07B FLOPs. Surprisingly, it can boost the performances by a large margin of 3-4% box AP and mask AP in dense prediction tasks. Our code is available at https://github.com/joyfang1106/MRLA.

1. INTRODUCTION

Growing evidence indicates that strengthening layer interactions can encourage the information flow of a deep neural network (He et al., 2016; Huang et al., 2017; Zhao et al., 2021) . For example, in vision networks, the receptive fields are usually enlarged as layers are stacked. These hierarchical receptive fields play different roles in extracting features: local texture features are captured by small receptive fields, while global semantic features are captured by large receptive fields. Hence encouraging layer interactions can enhance the representation power of networks by combining different levels of features. Previous empirical studies also support the necessity of building interdependencies across layers. ResNet (He et al., 2016) proposed to add a skip connection between two consecutive layers. DenseNet (Huang et al., 2017) further reinforced layer interactions by making layers accessible to all subsequent layers within a stage. Recently, GLOM (Hinton, 2021) adopted an intensely interacted architecture that includes bottom-up, top-down, and same-level interactions, attempting to represent part-whole hierarchies in a neural network. In the meantime, the attention mechanism has proven itself in learning interdependencies by retrieving query-activated information in deep neural networks. Current works about attention lay much emphasis on amplifying interactions within a layer (Hu et al., 2018; Woo et al., 2018; Dosovitskiy et al., 2021) . They implement attention on channels, spatial locations, and patches; however, none of them consider attention on layers, which are actually the higher-level features of a network. It is then natural to ask: "Can attention replicate its success in strengthening layer interactions?" This paper gives a positive answer. Specifically, starting from the vanilla attention, we first give a formal

