VISUAL QUESTION ANSWERING FROM ANOTHER PERSPECTIVE: CLEVR MENTAL ROTATION TESTS

Abstract

Different types of mental rotation tests have been used extensively in psychology to understand human visual reasoning and perception. Understanding what an object or visual scene would look like from another viewpoint is a challenging problem that is made even harder if it must be performed from a single image. 3D computer vision has a long history of examining related problems. However, often what one is most interested in is the answer to a relatively simple question posed in another visual frame of reference -as opposed to creating a full 3D reconstruction. Mental rotations tests can also manifest as consequential questions in the real world such as: does the pedestrian that I see, see the car that I am driving? We explore a controlled setting whereby questions are posed about the properties of a scene if the scene were observed from another viewpoint. To do this we have created a new version of the CLEVR VQA problem setup and dataset that we call CLEVR Mental Rotation Tests or CLEVR-MRT, where the goal is to answer questions about the original CLEVR viewpoint given a single image obtained from a different viewpoint of the same scene. Using CLEVR Mental Rotation Tests we examine standard state of the art methods, show how they fall short, then explore novel neural architectures that involve inferring representations encoded as feature volumes describing a scene. Our new methods use rigid transformations of feature volumes conditioned on the viewpoint camera. We examine the efficacy of different model variants through performing a rigorous ablation study. Furthermore, we examine the use of contrastive learning to infer a volumetric encoder in a self-supervised manner and find that this approach yields the best results of our study using CLEVR-MRT.

1. INTRODUCTION

Psychologists have employed mental rotation tests for decades (Shepard & Metzler, 1971 ) as a powerful tool for devising how the human mind interprets and (internally) manipulates three dimensional representations of the world. Instead of using these test to probe the human capacity for mental 3D manipulation, we are interested here in: a) understanding the ability of modern deep neural architectures to perform mental rotation tasks, and b) building architectures better suited to 3D inference and understanding. Recent applications of concepts from 3D graphics to deep learning, and vice versa, have led to promising results. We are similarly interested in leveraging models of 3D image formation from the graphics and vision communities to augment neural network architectures with inductive biases that improve their ability to reason about the real world. Here we measure the effectiveness of adding such biases, confirming their ability to improve the performance of neural models on mental rotation tasks. Concepts from inverse graphics can be used to guide the construction of neural architectures designed to perform tasks related to the reverse of the traditional image synthesis processes: namely, taking 2D image input and inferring 3D information about the scene. For instance, 3D reconstruction in computer vision (Furukawa & Hernández, 2015) can be realized with neural-based approaches that output voxel (Wu et al., 2016; Nguyen-Phuoc et al., 2019 ), mesh (Wang et al., 2018) , or point cloud (Qi et al., 2017) representations of the underlying 3D scene geometry. Such inverse graphics methods range from fully-differentiable graphics pipelines (Kato et al., 2018) to implicit neural-based approaches with learnable modules designed to mimic the structure of certain components of the forward graphics pipeline (Yao et al., 2018; Thies et al., 2019) . While inverse rendering is potentially an interesting and useful goal in itself, many computer vision systems could benefit from neural architectures that demonstrate good performance for more targeted mental rotation tasks. In our work here we are interested in exploring neural "mental rotation" by adapting a well known standard benchmark for visual question-and-answering (VQA) through answering questions with respect to another viewpoint. We use the the Compositional Language and Elementary Visual Reasoning (CLEVR) Diagnostic Dataset (Johnson et al., 2017) as the starting point for our work. While we focus on this well known benchmark, many analogous questions of practical interest exist. For example, given the camera viewpoint of a (blind) person crossing the road, can we infer if each of the drivers of the cars at an intersection can see this blind person crossing the street? As humans, we are endowed with the ability to reason about scenes and imagine them from different viewpoints, even if we have only seen them from one perspective. As noted by others, it therefore seems intuitive that we should encourage the same capabilities in deep neural networks (Harley et al., 2019) . In order to answer such questions effectively, some sort of representation encoding 3D information seems necessary to permit inferences to be drawn due to a change in the orientation and position of the viewpoint camera. However, humans clearly do not have access to error signals obtained through re-rendering scenes, but are able to perform such tasks. To explore these problems in a controlled setting, we adapt the original CLEVR setup in which a VQA model is trained to answer different types of questions about a scene consisting of various types and colours of objects. While images from this dataset are generated through the rendering of randomly generated 3D scenes, the three-dimensional structure of the scene is never fully exploited because the viewpoint camera never changes. We call our problem formulation and data set CLEVR-MRT, as it is a new Mental Rotation Test version of the CLEVR problem setup. In CLEVR-MRT alternative views of a scene are rendered and used as the input to a perception pipeline that must then answer a question that was posed with respect to another (the original CLEVR) viewpoint.foot_0 This gives rise to a more difficult task where the VQA model must learn how to map from its current viewpoint to the viewpoint that is required to answer the question. This can be seen in Figure 1(b) . Figure 1 (a) depicts a real world situation and analogy where the answers to similar types of questions may help different types of systems make consequential decisions, e.g. intelligent intersections, cars, robots, or navigation assistants for the blind. The fact that MRTs are a classical tool of Psychology and the link to these different practical applications motivated us to create the controlled setting of CLEVR-MRTs depicted in Figure 1 Using our new mental rotation task definition and our CLEVR-MRT dataset, we examine a number of new inverse-graphics inspired neural architectures. We examine models that use the FILM (Feature-wise Linear Modulation) technique (Perez et al., 2017) for VQA, which delivers competitive performance using contemporary state-of-the-art convolutional network techniques. We observe that such methods fall short for this more challenging MRT VQA setting. This motivates us to create new architectures that involve inferring a latent feature volume that we subject to rigid 3D transformations (rotations and translations), in a manner that has been examined in 3D generative modelling techniques such as spatial transformers (Jaderberg et al., 2015) as well as HoloGAN



Dataset and code will be available at https://github.com/anonymouscat2434/clevr-mrt



(b).

(a) (Left) A view of a street corner. (Middle) a CLEVR-like representation of the scene with abstractions of buildings, cars and pedestrians. (Right) The same virtual scene from another viewpoint, where questions concerning the relative positions of objects after a mental rotation could be of significant practical interest.(b) Random views of an example scene in CLEVR-MRT. The center image is the 'canonical' view, which is the unseen point of view for which questions must be answered using only one of the other views as input.

Figure 1: (a) A real-world example where the ability to perform mental rotations can be of practical utility. (b) Images from the CLEVR-MRT dataset.

