Perceive, Ground, Reason, and Act: A BENCHMARK FOR GENERAL-PURPOSE VISUAL REPRESENTATION

Abstract

Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts to create a general vision model are limited in the scope of assessed tasks and offer no overarching framework to perform them holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four functional domains -Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and visual manipulation. Along with the benchmark, we provide a general encoder-decoder framework to allow for the evaluation of arbitrary visual representation on all 11 tasks. We evaluate various pre-trained visual representations with our framework and observe that (1) Transformer-based visual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual representations from vision-language pre-training are superior to these with vision-only pre-training across visual tasks. With G-VUE, we provide a holistic evaluation standard to motivate research toward building general-purpose visual systems via obtaining a better, more general-purpose visual representation.

1. INTRODUCTION

The long-term goal of machine vision (Marr, 1982; Barrow and Tenenbaum, 1981; Ikeuchi and Hebert, 1996; Marr, 2010) is to build a general-purpose vision system that can perceive, understand, and react to visual inputs from unconstrained environments. The term general-purpose can be best understood through observing our own visual systems, which support various complex higher-order visual tasks, from edge detection and object recognition to visual navigation and manipulation, all while rooted in a common brain region that produces core visual representations. On the other hand, even though the field of computer vision has been thriving with models that solve complex tasks increasingly well over the last decade (He et al., 2016; 2017; Ronneberger et al., 2015; Dosovitskiy et al., 2020) , there is still a considerable gap between vision models and the aforementioned human visual systems. In particular, current vision models are mostly task-specific and often contain specialized components or architectures designed for a specific setting. They are also hindered by diverse input-output formats that vary from task to task. Recent works in self-supervised learning (Oord et al., 2018; He et al., 2020; Chen et al., 2020; Grill et al., 2020; He et al., 2021; Huang et al., 2021; Ma et al., 2022) explore more general visual representations by learning from enormously-sized visual domains. Advances in vision-language modeling (Lu et al., 2020; Hu and Singh, 2021; Cho et al., 2021; Gupta et al., 2021; Kamath et al., 2021; Wang et al., 2022) also seek to unify the vision tasks with a task-agnostic model architecture. However, these models are evaluated solely on traditional tasks like classification, detection, or vision-language understanding. It is still unclear whether current computer vision models are general-purpose like the human visual systems, as there is no overarching benchmark that covers all visual tasks holistically, from low-level perception to high-level reasoning and acting. To facilitate the research toward the general-purpose vision, we present General-purpose Visual Understanding Evaluation (G-VUE) benchmark. We carefully curate 11 tasks from four functional domains that visual systems should support -Perceive, Ground, Reason, and Actordered by their cognitive complexity. These four domains cover the full spectrum of human visual tasks and

