MOVIE: REVISITING MODULATED CONVOLUTIONS FOR VISUAL COUNTING AND BEYOND

Abstract

This paper focuses on visual counting, which aims to predict the number of occurrences given a natural image and a query (e.g. a question or a category). Unlike most prior works that use explicit, symbolic models which can be computationally expensive and limited in generalization, we propose a simple and effective alternative by revisiting modulated convolutions that fuse the query and the image locally. Following the design of residual bottleneck, we call our method MoVie, short for Modulated conVolutional bottlenecks. Notably, MoVie reasons implicitly and holistically and only needs a single forward-pass during inference. Nevertheless, MoVie showcases strong performance for counting: 1) advancing the state-of-the-art on counting-specific VQA tasks while being more efficient; 2) outperforming prior-art on difficult benchmarks like COCO for common object counting; 3) helped us secure the first place of 2020 VQA challenge when integrated as a module for 'number' related questions in generic VQA models. Finally, we show evidence that modulated convolutions such as MoVie can serve as a general mechanism for reasoning tasks beyond counting.

1. INTRODUCTION

We focus on visual counting: given a natural image and a query, it aims to predict the correct number of occurrences in the image corresponding to that query. The query is generic, which can be a natural language question (e.g. 'how many kids are on the sofa') or a category name (e.g. 'car'). Since visual counting requires open-ended query grounding and multiple steps of visual reasoning (Zhang et al., 2018) , it is a unique testbed to evaluate a machine's ability to understand multi-modal data. Mimicking how humans count, most existing counting modules (Trott et al., 2018) adopt an intuition-driven reasoning procedure, which performs counting iteratively by mapping candidate image regions to symbols and count them explicitly based on relationships (Fig. 1 , top-left). While interpretable, modeling regions and relations repeatedly can be expensive in computation (Jiang et al., 2020) . And more importantly, counting is merely a single visual reasoning task -if we consider the full spectrum of reasoning tasks (e.g. logical inference, spatial configuration), it is probably infeasible to manually design specialized modules for every one of them (Fig. 1 , bottom-left). In this paper, we aim to establish a simple and effective alternative for visual counting without explicit, symbolic reasoning. Our work is built on two research frontiers. First, on the synthetic CLEVR dataset (Johnson et al., 2017) , it was shown that using queries to directly modulate convolutions can lead to major improvements in the reasoning power of a Convolutional Network (ConvNet) (e.g. achieving near-perfect 94% on counting) (Perez et al., 2018) . However, it was difficult to transfer this finding to natural images, partially due to the dominance of bottom-up attention features that represent images with regions (Anderson et al., 2018) . Interestingly, recent analysis discovered that plain convolutional features can be as powerful as region features (Jiang et al., 2020) , which becomes a second step-stone for our approach to compare fairly against region-based counting modules. Motivated by fusing multi-modalities locally for counting, the central idea behind our approach is to revisit convolutions modulated by query representations. Following ResNet (He et al., 2016) , we choose bottleneck as our basic building block, with each bottleneck being modulated once. Multiple bottlenecks are stacked together to form our final module. Therefore, we call our method MoVie: Modulated conVolutional bottlenecks. Inference for MoVie is performed by a simple, feed-forward pass holistically on the feature map, and reasoning is done implicitly (Fig. 1 , top-right). Finally, we validate the feasibility of MoVie for reasoning tasks beyond counting (Fig. 1 , bottomright) by its near-perfect accuracy on CLEVR and competitive results on GQA (Hudson & Manning, 2019a). These evidences suggest that modulated convolutions such as MoVie can potentially serve as a general mechanism for visual reasoning. Code will be made available.

2. RELATED WORK

Here we discuss related works to the counting module and works related to the task. Explicit counting/reasoning modules. (Trott et al., 2018) was among the first to treat counting differently from other types of questions, and cast the task as a sequential decision making problem optimized by reinforcement learning. A similar argument for distinction was presented in (Zhang et al., 2018) , which took a step further by showing their fully-differentiable method can be attached to generic VQA models as a module. However, the idea of modular design for VQA was not new -notably, several seminal works (Andreas et al., 2016; Hu et al., 2017) have described learnable procedures to construct networks for visual reasoning, with reusable modules optimized for particular capabilities (e.g. count, compare). Our work differs from such works in philosophy, as they put more emphasis (and likely bias) on interpretation whereas we seek data-driven, general-purpose components for visual reasoning. Implicit reasoning modules. Besides modulated convolutions (Perez et al., 2018; De Vries et al., 2017) , another notable work is Relation Network (Santoro et al., 2017) , which learns to represent pair-wise relationships between features from different locations through simple MLPs, and showcases super-human performance on CLEVR. The counter from TallyQA (Acharya et al., 2019) followed this idea and built two such networks -one among foreground regions, and one between foreground and background. However, their counter is still based on regions, and neither generalization as a VQA module nor to other counting/reasoning tasks is shown. Because existing VQA benchmarks like VQA 2.0 also include counting questions, generic VQA models (Fukui et al., 2016; Yu et al., 2019) without explicit counters also fall within the scope of



https://visualqa.org/roe.html



Figure 1: We study visual counting. Different from previous works that perform explicit, symbolic counting (left), we propose an implicit, holistic counter, MoVie, that directly modulates convolutions (right) and can outperform state-of-the-art methods on multiple benchmarks. Its simple design also allows potential generalization beyond counting to other visual reasoning tasks (bottom).

