MOVIE: REVISITING MODULATED CONVOLUTIONS FOR VISUAL COUNTING AND BEYOND

Abstract

This paper focuses on visual counting, which aims to predict the number of occurrences given a natural image and a query (e.g. a question or a category). Unlike most prior works that use explicit, symbolic models which can be computationally expensive and limited in generalization, we propose a simple and effective alternative by revisiting modulated convolutions that fuse the query and the image locally. Following the design of residual bottleneck, we call our method MoVie, short for Modulated conVolutional bottlenecks. Notably, MoVie reasons implicitly and holistically and only needs a single forward-pass during inference. Nevertheless, MoVie showcases strong performance for counting: 1) advancing the state-of-the-art on counting-specific VQA tasks while being more efficient; 2) outperforming prior-art on difficult benchmarks like COCO for common object counting; 3) helped us secure the first place of 2020 VQA challenge when integrated as a module for 'number' related questions in generic VQA models. Finally, we show evidence that modulated convolutions such as MoVie can serve as a general mechanism for reasoning tasks beyond counting.

1. INTRODUCTION

We focus on visual counting: given a natural image and a query, it aims to predict the correct number of occurrences in the image corresponding to that query. The query is generic, which can be a natural language question (e.g. 'how many kids are on the sofa') or a category name (e.g. 'car'). Since visual counting requires open-ended query grounding and multiple steps of visual reasoning (Zhang et al., 2018) , it is a unique testbed to evaluate a machine's ability to understand multi-modal data. Mimicking how humans count, most existing counting modules (Trott et al., 2018) adopt an intuition-driven reasoning procedure, which performs counting iteratively by mapping candidate image regions to symbols and count them explicitly based on relationships (Fig. 1 , top-left). While interpretable, modeling regions and relations repeatedly can be expensive in computation (Jiang et al., 2020) . And more importantly, counting is merely a single visual reasoning task -if we consider the full spectrum of reasoning tasks (e.g. logical inference, spatial configuration), it is probably infeasible to manually design specialized modules for every one of them (Fig. 1 , bottom-left). In this paper, we aim to establish a simple and effective alternative for visual counting without explicit, symbolic reasoning. Our work is built on two research frontiers. First, on the synthetic CLEVR dataset (Johnson et al., 2017) , it was shown that using queries to directly modulate convolutions can lead to major improvements in the reasoning power of a Convolutional Network (ConvNet) (e.g. achieving near-perfect 94% on counting) (Perez et al., 2018) . However, it was difficult to transfer this finding to natural images, partially due to the dominance of bottom-up attention features that represent images with regions (Anderson et al., 2018) . Interestingly, recent analysis discovered that plain convolutional features can be as powerful as region features (Jiang et al., 2020) , which becomes a second step-stone for our approach to compare fairly against region-based counting modules. Motivated by fusing multi-modalities locally for counting, the central idea behind our approach is to revisit convolutions modulated by query representations. Following ResNet (He et al., 2016) , we choose bottleneck as our basic building block, with each bottleneck being modulated once. Multiple bottlenecks are stacked together to form our final module. Therefore, we call our method MoVie: Modulated conVolutional bottlenecks. Inference for MoVie is performed by a simple, feed-forward pass holistically on the feature map, and reasoning is done implicitly (Fig. 1 , top-right).

