LANGUAGE CONTROLS MORE THAN TOP-DOWN AT-TENTION: MODULATING BOTTOM-UP VISUAL PRO-CESSING WITH REFERRING EXPRESSIONS

Abstract

How to best integrate linguistic and perceptual processing in multimodal tasks is an important open problem. In this work we argue that the common technique of using language to direct visual attention over high-level visual features may not be optimal. Using language throughout the bottom-up visual pathway, going from pixels to high-level features, may be necessary. Our experiments on several English referring expression datasets show significant improvements when language is used to control the filters for bottom-up visual processing in addition to top-down attention.

1. INTRODUCTION

As human beings, we can easily understand the surrounding environment with our visual system and interact with each other using language. Since the work of Winograd (1972) , developing a system that understands human language in a situated environment is one of the long-standing goals of artificial intelligence. Recent successes of deep learning studies in both language and vision domains have increased the interest in tasks that combine language and vision (Antol et al., 2015; Xu et al., 2015; Krishna et al., 2016; Suhr et al., 2017; Anderson et al., 2018b; Hudson & Manning, 2019) . However, how to best integrate linguistic and perceptual processing is still an important open problem. In this work we investigate whether language should be used to control the filters for bottom-up visual processing as well as top-down attention. In the human visual system, attention is driven by both "top-down" cognitive processes (e.g. focusing on target's color or location) and "bottom-up" salient, behaviourally relevant stimuli (e.g. fast moving objects) (Corbetta & Shulman, 2002; Connor et al., 2004; Theeuwes, 2010) . Studies on embodied language explore the link between linguistic and perceptual representations (Pulvermüller, 1999; Vigliocco et al., 2004; Gallese & Lakoff, 2005) and it is often assumed that language has a high-level effect on perception and drives the "top-down" visual attention (Bloom, 2002; Jackendoff & Jackendoff, 2002; Dessalegn & Landau, 2008) . However, recent studies from cognitive science point out that language comprehension also affects low-level visual processing (Meteyard et al., 2007; Boutonnet & Lupyan, 2015) . Motivated by this, we propose a modelfoot_0 that can modulate either or both of "bottom-up" and "top-down" visual processing with language conditional filters. Current deep learning systems for language-vision tasks typically start with low-level image processing that is not conditioned on language, then connect the language representation with high level visual features to control the visual focus. To integrate both modalities, concatenation (Malinowski et al., 2015) , element-wise multiplication (Malinowski et al., 2015; Lu et al., 2016; Kim et al., 2016) or attention from language to vision (Xu et al., 2015; Xu & Saenko, 2016; Yang et al., 2016; Lu et al., 2017; Anderson et al., 2018a; Zellers et al., 2019 ) may be used. Specifically they do not condition low-level visual features on language. One exception is De Vries et al. ( 2017) which proposes conditioning the ResNet (He et al., 2016) image processing network with language conditioned batch normalization parameters at every stage. Our model differs from these architectures by having explicit "bottom-up" and "top-down" branches and allowing us to experiment with modulating one or both branches with language generated kernels. We evaluate our proposed model on the task of image segmentation from referring expressions where given an image and a natural language description, the model returns a segmentation mask that marks the object(s) described. We can contrast this with purely image based object detection (Girshick, 2015; Ren et al., 2017) and semantic segmentation (Long et al., 2015; Ronneberger et al., 2015; Chen et al., 2017) tasks which are limited to predefined semantic classes. Our task gives users more flexibility to interact with the system by allowing them to describe objects of interest in free form language. The language input may contain various visual attributes (e.g., color, shape), spatial information (e.g., "on the right", "in front of"), actions (e.g., "running", "sitting") and interactions/relations between different objects (e.g., "arm of the chair that the cat is sitting in"). This makes the task both more challenging and suitable for comparing different strategies of language control. The perceptual module of our model is based on the U-Net image segmentation architecture (Ronneberger et al., 2015) . This architecture has clearly separated bottom-up and top-down branches which allows us to easily vary what parts are conditioned on language. The bottom-up branch starts from low level visual features and applies a sequence of contracting filters that result in successively higher level feature maps with lower spatial resolution. Following this is a top-down branch which takes the final low resolution feature map and applies a sequence of expanding filters that eventually result in a segmentation mask at the original image resolution. Information flows between branches through skip connections between contracting and expanding filters at the same level. We experiment with conditioning one or both of these branches with language. To make visual processing conditional on language, we add language-conditional filters at each level of the architecture, similar to Misra et al. (2018) . Our baseline only applies languageconditional filters on the top-down branch. Modulating only the top-down/expanding branch with language means the high level features extracted by the bottom-up/contracting branch cannot be language-conditional. Our model expands on this baseline by modulating both branches with language-conditional filters. Empirically, we find that adding language modulation to the bottomup/contracting branch has a significant positive improvement on the baseline model. Our proposed model achieves state-of-the art performance on three different English referring expression datasets.

2. RELATED WORK

In this section, we review related work in several related areas: Semantic segmentation classifies the object category of each pixel in an image without language input. Referring expression comprehension locates a bounding box for the object(s) described in the language input. Image segmentation from referring expressions generates a segmentation mask for the object(s) described in the language input. We also cover work on language-conditional (dynamic) filters and studies that use them to modulate deep-learning models with language.

2.1. SEMANTIC SEGMENTATION

Primitive semantic segmentation models are based on Fully Convolutional Networks (FCN) (Long et al., 2015) . DeepLab (Chen et al., 2017) and U-Net (Ronneberger et al., 2015) are the most notable state-of-the-art semantic segmentation models related to our work. DeepLab replaces regular convolutions with atrous (dilated) convolutions in the last residual block of ResNets (He et al., 2016) and implements Atrous Spatial Pyramid Pooling (ASPP) which fuses multi-scale visual information. The U-Net architecture (Ronneberger et al., 2015) improves over the standard FCN by connecting contracting (bottom-up) and expanding (top-down) paths at the same resolution: the output of the encoder layer at each level is passed to the decoder at the same level.

2.2. REFERRING EXPRESSION COMPREHENSION

Early models for this task were typically built using a hybrid LSTM-CNN architecture (Hu et al., 2016b; Mao et al., 2016) . Newer models (Hu et al., 2017; Yu et al., 2016; 2018; Wang et al., 2019) use an Region-based CNN (R-CNN) variant (Girshick et al., 2014; Ren et al., 2017; He et al., 2017) as a sub-component to generate object proposals. Nagaraja et al. (2016) proposes a solution based on multiple instance learning. Cirik et al. (2018) implements a model based on Neural Module Networks (NMN) by using syntax information. Among the literature, Compositional Modular Network



We will release our code and pre-trained models along with a reproducible environment after the blind review process.

