LANGUAGE CONTROLS MORE THAN TOP-DOWN AT-TENTION: MODULATING BOTTOM-UP VISUAL PRO-CESSING WITH REFERRING EXPRESSIONS

Abstract

How to best integrate linguistic and perceptual processing in multimodal tasks is an important open problem. In this work we argue that the common technique of using language to direct visual attention over high-level visual features may not be optimal. Using language throughout the bottom-up visual pathway, going from pixels to high-level features, may be necessary. Our experiments on several English referring expression datasets show significant improvements when language is used to control the filters for bottom-up visual processing in addition to top-down attention.

1. INTRODUCTION

As human beings, we can easily understand the surrounding environment with our visual system and interact with each other using language. Since the work of Winograd (1972) , developing a system that understands human language in a situated environment is one of the long-standing goals of artificial intelligence. Recent successes of deep learning studies in both language and vision domains have increased the interest in tasks that combine language and vision (Antol et al., 2015; Xu et al., 2015; Krishna et al., 2016; Suhr et al., 2017; Anderson et al., 2018b; Hudson & Manning, 2019) . However, how to best integrate linguistic and perceptual processing is still an important open problem. In this work we investigate whether language should be used to control the filters for bottom-up visual processing as well as top-down attention. In the human visual system, attention is driven by both "top-down" cognitive processes (e.g. focusing on target's color or location) and "bottom-up" salient, behaviourally relevant stimuli (e.g. fast moving objects) (Corbetta & Shulman, 2002; Connor et al., 2004; Theeuwes, 2010) . Studies on embodied language explore the link between linguistic and perceptual representations (Pulvermüller, 1999; Vigliocco et al., 2004; Gallese & Lakoff, 2005) and it is often assumed that language has a high-level effect on perception and drives the "top-down" visual attention (Bloom, 2002; Jackendoff & Jackendoff, 2002; Dessalegn & Landau, 2008) . However, recent studies from cognitive science point out that language comprehension also affects low-level visual processing (Meteyard et al., 2007; Boutonnet & Lupyan, 2015) . Motivated by this, we propose a modelfoot_0 that can modulate either or both of "bottom-up" and "top-down" visual processing with language conditional filters. Current deep learning systems for language-vision tasks typically start with low-level image processing that is not conditioned on language, then connect the language representation with high level visual features to control the visual focus. To integrate both modalities, concatenation (Malinowski et al., 2015) , element-wise multiplication (Malinowski et al., 2015; Lu et al., 2016; Kim et al., 2016) or attention from language to vision (Xu et al., 2015; Xu & Saenko, 2016; Yang et al., 2016; Lu et al., 2017; Anderson et al., 2018a; Zellers et al., 2019 ) may be used. Specifically they do not condition low-level visual features on language. One exception is De Vries et al. (2017) which proposes conditioning the ResNet (He et al., 2016) image processing network with language conditioned batch normalization parameters at every stage. Our model differs from these architectures by having explicit "bottom-up" and "top-down" branches and allowing us to experiment with modulating one or both branches with language generated kernels.



We will release our code and pre-trained models along with a reproducible environment after the blind review process.

