UNIFIED-IO: A UNIFIED MODEL FOR VISION, LANGUAGE, AND MULTI-MODAL TASKS

Abstract

We propose UNIFIED-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 90 diverse datasets in the vision and language fields. UNIFIED-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWiz-Ground, BoolQ, and SciTail, with no task-specific fine-tuning. Code and demos for UNIFIED-IO are available at: unified-io.allenai.org

1. INTRODUCTION

We present UNIFIED-IO, the first neural model to jointly perform a large and diverse set of AI tasks spanning classical computer vision (such as object detection, segmentation, and depth estimation), image synthesis (such as image generation and image in-painting), vision-and-language (like visual question answering, image captioning, and referring expression) and NLP (such as question answering and paraphrasing). Unified general-purpose models avoid the need for task-specific design, learn and perform a wide range of tasks with a single architecture, can utilize large, diverse data corpora, can effectively transfer concept knowledge across tasks, and even perform tasks unknown and unobserved at design and training time. Building unified models for computer vision has proven to be quite challenging since vision tasks have incredibly diverse input and output representations. For instance, object detection produces bounding boxes around objects in an image, segmentation produces binary masks outlining regions in an image, visual question answering produces an answer as text, and depth estimation produces a map detailing the distance of each pixel from the camera. This heterogeneity makes it very challenging to architect a single model for all these tasks. In contrast, while the landscape of natural language processing (NLP) tasks, datasets, and benchmarks is large and diverse, their inputs and desired outputs can often be uniformly represented as sequences of tokens. Sequence to sequence (Seq2Seq) architectures (Raffel et al., 2020; Brown et al., 2020) , specifically designed to accept and produce such sequences of tokens, are thus widely applicable to many tasks. Unified models employing such architectures have been central to much recent progress in NLP. Unified models for computer vision typically use a shared visual backbone to produce visual embeddings but then employ individual branches for each of the desired tasks. These include models * Equal contribution. Correspondence to jiasenl@allenai.org 1

