VISUAL QUESTION ANSWERING FROM ANOTHER PERSPECTIVE: CLEVR MENTAL ROTATION TESTS

Abstract

Different types of mental rotation tests have been used extensively in psychology to understand human visual reasoning and perception. Understanding what an object or visual scene would look like from another viewpoint is a challenging problem that is made even harder if it must be performed from a single image. 3D computer vision has a long history of examining related problems. However, often what one is most interested in is the answer to a relatively simple question posed in another visual frame of reference -as opposed to creating a full 3D reconstruction. Mental rotations tests can also manifest as consequential questions in the real world such as: does the pedestrian that I see, see the car that I am driving? We explore a controlled setting whereby questions are posed about the properties of a scene if the scene were observed from another viewpoint. To do this we have created a new version of the CLEVR VQA problem setup and dataset that we call CLEVR Mental Rotation Tests or CLEVR-MRT, where the goal is to answer questions about the original CLEVR viewpoint given a single image obtained from a different viewpoint of the same scene. Using CLEVR Mental Rotation Tests we examine standard state of the art methods, show how they fall short, then explore novel neural architectures that involve inferring representations encoded as feature volumes describing a scene. Our new methods use rigid transformations of feature volumes conditioned on the viewpoint camera. We examine the efficacy of different model variants through performing a rigorous ablation study. Furthermore, we examine the use of contrastive learning to infer a volumetric encoder in a self-supervised manner and find that this approach yields the best results of our study using CLEVR-MRT.

1. INTRODUCTION

Psychologists have employed mental rotation tests for decades (Shepard & Metzler, 1971 ) as a powerful tool for devising how the human mind interprets and (internally) manipulates three dimensional representations of the world. Instead of using these test to probe the human capacity for mental 3D manipulation, we are interested here in: a) understanding the ability of modern deep neural architectures to perform mental rotation tasks, and b) building architectures better suited to 3D inference and understanding. Recent applications of concepts from 3D graphics to deep learning, and vice versa, have led to promising results. We are similarly interested in leveraging models of 3D image formation from the graphics and vision communities to augment neural network architectures with inductive biases that improve their ability to reason about the real world. Here we measure the effectiveness of adding such biases, confirming their ability to improve the performance of neural models on mental rotation tasks. Concepts from inverse graphics can be used to guide the construction of neural architectures designed to perform tasks related to the reverse of the traditional image synthesis processes: namely, taking 2D image input and inferring 3D information about the scene. For instance, 3D reconstruction in computer vision (Furukawa & Hernández, 2015) can be realized with neural-based approaches that output voxel (Wu et al., 2016; Nguyen-Phuoc et al., 2019 ), mesh (Wang et al., 2018) , or point cloud (Qi et al., 2017) representations of the underlying 3D scene geometry. Such inverse graphics methods range from fully-differentiable graphics pipelines (Kato et al., 2018) to implicit neural-based approaches with learnable modules designed to mimic the structure of certain components of the forward graphics pipeline (Yao et al., 2018; Thies et al., 2019) . While inverse rendering is potentially 1

