LUNA: LANGUAGE AS CONTINUING ANCHORS FOR REFERRING EXPRESSION COMPREHENSION

Abstract

Referring expression comprehension aims to localize the description of a natural language expression in an image. Using location priors to remedy inaccuracies in cross-modal alignments is the state of the art for CNN-based methods tackling this problem. Recent Transformer-based models cast aside this idea making the case for steering away from hand-designed components. In this work, we propose LUNA, which uses language as continuing anchors to guide box prediction in a Transformer decoder, and show that language-guided location priors can be effectively exploited in a Transformer-based architecture. Specifically, we first initialize an anchor box from the input expression via a small "proto-decoder", and then use this anchor as location prior in a modified Transformer decoder for predicting the bounding box. Iterating through each decoder layer, the anchor box is first used as a query for pooling multi-modal context, and then updated based on pooled context. This approach allows the decoder to focus selectively on one part of the scene at a time, which reduces noise in multi-modal context and leads to more accurate box predictions. Our method outperforms existing stateof-the-art methods on the challenging datasets of ReferIt Game, RefCOCO/+/g, and Flickr30K Entities.

1. INTRODUCTION

Referring expression comprehension (REC) is the task of localizing natural language in images, where an expression in plain text describes a single or a group of objects from an image, and the objective is to put a bounding box around the target. It provides fundamental values to real-world applications such as robotics Tellex et al. (2020) , image editing Shi et al. (2021), and surveillance Li et al. (2017) . With purposes of reducing the search space and mitigating difficulties in cross-modal context modeling, early methods rely on prior knowledge about the "likely" locations of the target. 2020) leverage dense anchor boxes defined over image locations and directly predict a bounding box from integrated multi-modal feature maps. This approach allows box regression to be conditioned on the input expression, thus addressing the aforementioned issue of two-stage models. The definitions of anchors however, are heuristic and greatly influence model performance. A potential issue with this approach is that without location priors (pre-designed or learnable), these queries rely on pure "content"-based similarities to decide context locations, which often contain numerous inaccuracies in practice. Li et al. Li & Sigal (2021) demonstrate that even in scenes of relatively simple compositions, the attention of the context feature vector from this approach often peaks at multiple locations, including those outside the target area, which leads to wrong or inaccurate box predictions. Motivated by the above observations, in this work we propose a Transformer-based decoding method for addressing REC, which we term as LUNA (short for LangUage as contiNuing Anchors). It consists in leveraging the input expression for generating a series of continuously updated anchor boxes (shown in Fig. 1 ) that guide object localization. LUNA generates the first anchor box by attending to image regions under the guidance of the input expression. This is achieved via a crossattention-based proto-decoder, which summarizes an object representation based on word-specific visual context and decodes an approximate location of the target. Given the initial object representation and the anchor box, a stack of modified Transformer decoder layers iteratively refine the object representation and update the anchor box. Progressing through each layer, the current anchor box is projected into high-dimensional space and used for pooling multi-modal context; then a new anchor box is predicted from pooled context. We refer to this stack of decoder layers as a continuous anchor-guided decoder. This decoding approach allows the model to focus selectively on one part of the scene at a time, and acquire more accurate context information with focused attention. To evaluate the efficacy of the proposed method, we conduct extensive experiments on the challenging datasets of ReferIt Kazemzadeh et al. ( 2014 Our contributions are summarized as follows. (1) We proposed a simple but effective Transformerbased decoding method which leverages anchors as position priors for tackling referring expression comprehension. (2) Our decoding strategy consists of a proto-decoder to estimate the initial localization from language, and a continuous anchor-guided decoder for predicting bounding boxes progressively. (3) We obtain new state-of-the-art results on five REC benchmarks, demonstrating the effectiveness and generality of the proposed method. To address these issues, one-stage models opt for directly predicting the bounding box from fused multi-modal features, relying on feature fusion mechanisms and dense anchoring Yang et al. (2019b;  



Specifically, region proposals and anchor boxes are the two most common types of location priors. Two-stage models Hu et al. (2017); Wang et al. (2018); Yu et al. (2018a;b); Yang et al. (2019a) leverage region proposals, consisting of thousands of massively overlapping boxes extracted using a standalone proposal method Uijlings et al. (2013); Zitnick & Dollár (2014); Ren et al. (2015). A cross-modal similarity-based ranking method is used to select one proposal as the prediction. Such models cannot recover from proposal failures and generally suffer a low recall rate Yang et al. (2019b). On the other hand, one-stage models Yang et al. (2019b; 2020); Luo et al. (

The most recent Transformer-based methods Deng et al. (2021); Li & Sigal (2021) discard location priors and leverage the powerful correlation modeling power of the Transformer architecture Devlin et al. (2019); Vaswani et al. (2017) for end-to-end prediction. Typically, a Transformer encoder Devlin et al. (2019) jointly embeds visual and linguistic inputs, and box prediction is made from a context feature vector globally pooled from the encoder outputs based on query-to-feature similarities. The query (which can be a black-box learnable feature vector Deng et al. (2021) or a linguistic feature vector summarizing the expression Li & Sigal (2021)) serves as a representation of the target and context pooling is guided by similarity-based attention.

Figure 1: Example anchor boxes learned by our method. The anchor boxes are generated sequentially by a stack of modified Transformer decoder layers. The next anchor box is predicted based on context aggregated by the current anchor box. The last predicted box is used as the final prediction.

), RefCOCOYu et al. (2016), RefCOCO+ Yu et al.  (2016), RefCOCOg Nagaraja et al. (2016), and Flickr30K Entities Plummer et al. (2015), for which we improve the state of the art by large margins.

Since the proposal of this task in several parallel studies Yu et al. (2016); Hu et al. (2016); Nagaraja et al. (2016), the state-of-the-art paradigm for tackling it has transitioned from two-stage models to one-stage models to the most recent Transformer-based models. Two-stage models Hu et al. (2017); Wang et al. (2018); Yu et al. (2018a;b); Yang et al. (2019a) rank a set of image regions based on their similarities with the referring language expression, where the regions are pre-extracted using a region proposal method. Popular proposal methods include unsupervised ones based on hand-crafted features Uijlings et al. (2013); Zitnick & Dollár (2014) and pre-trained neural nets Ren et al. (2015). This propose-and-rank approach is slow and suffers from limitations of the proposal method (such as location inaccuracies from unsupervised methods or biases of the proposal network) Liao et al. (2020); Yang et al. (2019b).

