UNIFIED-IO: A UNIFIED MODEL FOR VISION, LANGUAGE, AND MULTI-MODAL TASKS

Abstract

We propose UNIFIED-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 90 diverse datasets in the vision and language fields. UNIFIED-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWiz-Ground, BoolQ, and SciTail, with no task-specific fine-tuning. Code and demos for UNIFIED-IO are available at: unified-io.

1. INTRODUCTION

We present UNIFIED-IO, the first neural model to jointly perform a large and diverse set of AI tasks spanning classical computer vision (such as object detection, segmentation, and depth estimation), image synthesis (such as image generation and image in-painting), vision-and-language (like visual question answering, image captioning, and referring expression) and NLP (such as question answering and paraphrasing). Unified general-purpose models avoid the need for task-specific design, learn and perform a wide range of tasks with a single architecture, can utilize large, diverse data corpora, can effectively transfer concept knowledge across tasks, and even perform tasks unknown and unobserved at design and training time. Building unified models for computer vision has proven to be quite challenging since vision tasks have incredibly diverse input and output representations. For instance, object detection produces bounding boxes around objects in an image, segmentation produces binary masks outlining regions in an image, visual question answering produces an answer as text, and depth estimation produces a map detailing the distance of each pixel from the camera. This heterogeneity makes it very challenging to architect a single model for all these tasks. In contrast, while the landscape of natural language processing (NLP) tasks, datasets, and benchmarks is large and diverse, their inputs and desired outputs can often be uniformly represented as sequences of tokens. Sequence to sequence (Seq2Seq) architectures (Raffel et al., 2020; Brown et al., 2020) , specifically designed to accept and produce such sequences of tokens, are thus widely applicable to many tasks. Unified models employing such architectures have been central to much recent progress in NLP. Unified models for computer vision typically use a shared visual backbone to produce visual embeddings but then employ individual branches for each of the desired tasks. These include models Figure 1 : UNIFIED-IO is a single sequence-to-sequence model that performs a variety of tasks in computer vision and NLP using a unified architecture without a need for either task or modalityspecific branches. This broad unification is achieved by homogenizing every task's input and output into a sequence of discrete vocabulary tokens. UNIFIED-IO supports modalities as diverse as images, masks, keypoints, boxes, and text, and tasks as varied as depth estimation, inpainting, semantic segmentation, captioning, and reading comprehension. like Mask R-CNN (He et al., 2017) for classical visual tasks that use an ImageNet pre-trained encoder followed by branches for detection and segmentation, trained in a fully supervised manner. In the vision and language (V&L) domain, CNN backbones feed visual features to transformer architectures that also combine language, followed by task-specific heads for visual question answering, referring expression, visual commonsense reasoning, etc. ( ) -for instance, a single language decoder that serves multiple tasks requiring language output like captioning and classification. However, most progress in unified models continues to be centered around V&L tasks, owing to the simplicity of building shared language decoders and is often limited to supporting just a handful of tasks. UNIFIED-IO is a Seq2Seq model capable of performing a variety of tasks using a unified architecture without a need for either task or even modality-specific branches. This broad unification is achieved by homogenizing every task's output into a sequence of discrete tokens. Dense structured outputs such as images, segmentation masks and depth maps are converted to sequences using a vector quantization variational auto-encoder (VQ-GAN) (Esser et al., 2021) , sparse structured outputs such as bounding boxes, and human joint locations are transcribed into sequences of coordinate tokens, and language outputs are converted to sequences using byte-pair encoding. This unification enables Unified-IO to jointly train on over 90 datasets spanning computer vision, V&L, and NLP tasks with a single streamlined transformer encoder-decoder architecture (Raffel et al., 2020) . Our jointly trained UNIFIED-IO is the first model to support all 7 tasks in the General Robust Image Task (GRIT) Benchmark (Gupta et al., 2022b) and obtains the top overall score of 64.3 when averaging across all tasks, handily beating the second best model by 32.0. We further evaluate UNIFIED-IO on 16 diverse benchmarks across computer vision and NLP, without any fine-tuning towards any individual benchmark, and find that it performs remarkably well compared to specialized (or fine-tuned) state-of-the-art models.

2. VISION, LANGUAGE AND MULTI-MODAL TASKS

UNIFIED-IO is designed to handle a wide range of language, vision and language, and classic vision tasks in a unified way. datasets from 62 publicly available data sources as targets for our model to learn during multi-task training. These datasets cover a wide range of tasks, skills, and modalities. ✓ ✓ ✓ - - - ✓ - Object Detection Open Images 3 1.9m 1.5 3.6 - ✓ - - - - ✓ - Object Localization VG 3 6m 4.6 7.1 ✓ ✓ - - - - ✓ - Keypoint Estimation COCO 1 140k 0.1 0.7 - ✓ ✓ - - - ✓ - Referring Expression RefCoco 3 130k 0.1 1.1 ✓ ✓ - - - - ✓ - Dense Labelling 6 2.4m 1.8 6.2 ✓ ✓ - - - - - ✓ Depth Estimation NYU Depth 1 48k 0.1 0.4 - ✓ - - - - - ✓ Surface Normal Estimation Framenet 2 210k 0.2 1.1 - ✓ - - - - - ✓ Object Segmentation LVIS 3 2.1m 1.6 4.7 ✓ ✓ - - - - - ✓ Image Classification 9 22m 16.8 12.5 - ✓ ✓ - ✓ - - - Image Classification ImageNet 6 16m 12.2 8.1 ✓ ✓ - - ✓ - - - Object Categorization COCO 3 6m 4.6 4.4 - ✓ ✓ - ✓ - - - Image Captioning 7 31m 23.7 12.5 - ✓ ✓ - ✓ - - - Webly Supervised Captioning CC12M 3 26m 19.7 8.8 - ✓ - - ✓ - - - Supervised Captioning VizWiz 3 1.4m 1.1 1.7 - ✓ - - ✓ - - - Region Captioning VG 1 3.8m 2.9 2.0 - ✓ ✓ - ✓ - - - Vision & Language 16 4m 3.0 12.5 ✓ ✓ ✓ - ✓ - - ✓ Visual Question Answering VQA 2.0 13 3.3m 2.5 10.4 ✓ ✓ ✓ - ✓ - - - Relationship Detection VG 2 640k 0.5 1.9 - ✓ ✓ - ✓ - - - Grounded VQA VizWiz 1 6.5k 0.1 0.1 ✓ ✓ - - ✓ - - ✓ NLP 31 7.1m 5.4 12.5 ✓ - - - ✓ - - - Text Classification MNLI 17 1.6m 1.2 4.8 ✓ - - - ✓ - - - Question Answering SQuAD 13 1.7m 1.3 5.2 ✓ - - - ✓ - - - Text Summarization Gigaword 1 3.8m 2.9 2.5 ✓ - - - ✓ - - - Language Modelling 2 - - 12.5 ✓ - - - ✓ - - - Masked Language Modelling C4 2 - - 12.5 ✓ - - - ✓ - - - All Tasks 95 130m 100 100 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ We categorize the input and output modalities of each task into 4 different types: Text -natural language tokens; Image -RGB images; Sparse -a small number of location coordinates within the image; Dense -per-pixel labels such as depth maps, surface normal maps, etc. We group related datasets into Language Modeling. The masking language modeling pre-training task (See Section 3.3) using text from C4 (Raffel et al., 2020) and Wikipedia (Foundation), which we include to ensure the knowledge gained from language pre-training is not lost during multi-task training. Other pre-training tasks are not included because the relevant datasets are already used in other supervised tasks (e.g., for captioning or classification). Table 1 shows the details of tasks and groups. We list an example dataset source, number of datasets, number of examples, percent of the total number of examples, and sampling rate during training (Section 3.3) for each group and task. Subsequent columns show what modalities are required for the inputs and outputs. We defer additional task details, inference details, the complete list of datasets and visualizations to the Appendix A.1.

3. UNIFIED-IO

Our goal is to build a single unified model that can support a diverse set of tasks across computer vision and language with little to no need for task-specific customizations and parameters. Such unified architectures can be applied to new tasks with little to no knowledge of the underlying machinery, enable general pre-training to benefit many diverse downstream applications, be jointly trained on a large number of tasks, and better allows knowledge to be shared between tasks.

3.1. UNIFIED TASK REPRESENTATIONS

Supporting a variety of modalities such as images, language, boxes, binary masks, segmentation masks, etc. without task-specific heads requires representing these modalities in a shared and unified space. To do this, we discretize the text, images, and other structured outputs in our tasks and represent them with tokens drawn from a unified and finite vocabulary. Images and dense structures representation. A variety of tasks in computer vision requires the model to produce high-dimensional outputs such as images (e.g., image in-painting) or per-pixel labels (e.g., depth estimation). To handle these modalities, we first convert per-pixel labels into RGB images. For depth, we construct a grayscale image by normalizing the depth map. For surface normal estimation, we convert the x/y/z orientations into r/g/b values. For segmentation, we map each instance present in the image to a unique color. We randomly select colors for each instance and specify the color-to-class mapping in the text instead of using universal color-to-class mapping. This avoids requiring a fixed list of classes and avoids having colors that may only be marginally different due to the presence of a large number of classes. Then we encode these images as discrete tokens using a VQ-GAN. In particular, we use the imagenet-pretrained VQ-GAN from Esser et al. ( 2021) with 256 × 256 resolution, a compression ratio of 16, and 16384 codebook size. The VQ-GAN codebook is added to the vocabulary as additional tokens that can be generated by the decoder. During training, the tokens for the target image are used as targets. During inference, the VQ-GAN decoder is used to convert the generated image tokens into an output image. Sparse structures representation. We encode sparse structures such as bounding boxes or human joints by adding 1000 special tokens to the vocabulary to represent discretized image coordinates (Chen et al., 2022b) . Points are then encoded with a sequence of two such tokens, one for the x and one for the y coordinates, and boxes are encoded using a sequence of four tokens, two for the upper right corner and two for the lower left corner. Labeled boxes are encoded as a box followed by a text class label, and joints are encoded as a sequence of points followed by a text visibility label. This allows us to handle a wide variety of tasks that use these elements in their inputs or output (see Appendix A.1 for examples).

3.2. UNIFIED ARCHITECTURE

Universally representing a wide variety of tasks as input and output sequences of discrete tokens enables us to employ architectures that have been proven successful in natural language processing. In UNIFIED-IO, we propose a pure transformer model largely following the design of T5 (Raffel et al., 2020) . In particular, UNIFIED-IO is an encoder-decoder architecture where both the encoder and decoder are composed of stacked transformer layers, which in turn are composed of self-attention transformers, cross-attention transformers (in the decoder), and feed-forward neural networks. The layers are applied residually, and layer norms are applied before each transformer and feed-forward network. See Raffel et al. (2020) for details. We make a few architectural changes to adapt the T5 architecture to our setting. First, to handle input images, we reshape the image into a sequence of patches that are embedded with linear projection similar to Dosovitskiy et al. (2021) . Second, we expand the vocabulary to include the location tokens and the image tokens used in the VQ-GAN. Third, we extend the 1-d relative embedding (Dosovitskiy et al., 2021 ) to 2-d with a fixed number of learned embeddings. We also add absolute position embedding to the token embedding following 

4.3. RESULTS ON ADDITIONAL TASKS

We report results on 16 additional tasks used in our training setup. For these tasks, we do not expect to get state-of-the-art results since specialized models are usually designed and hyper-parameter tuned for a single task, while we are evaluating a single jointly trained model. We also avoid extensive task-specific tricks like color jittering, horizontal flipping, CIDEr optimization, and label smoothing, which are often responsible for considerable gains in individual task performance. We leave such task-specific tuning for future work. See Table 5 for the results. When possible, we additionally report the best prior result on these tasks from a unified model, meaning a model that is trained in a multi-task setting and a unified architecture (no task-specific head or customizations) with at least three other tasks. UNIFIED-IO provides strong performance on all these tasks despite being massively multi-tasked. We review more fine-grained results below. Depth Estimation. On depth estimation, UNIFIED-IO achieves 0.475 rmse, which is behind SOTA (Li et 

6. CONCLUSION

We have presented UNIFIED-IO, a unified architecture that supports a large variety of computer vision and NLP tasks with diverse inputs and outputs, including images, continuous maps, binary masks, segmentation masks, text, bounding boxes, and keypoints. This unification is made possible by homogenizing each of these modalities into a sequence of discrete tokens. The 2.9B parameter UNIFIED-IO XL model is jointly trained on 90+ datasets, is the first model to perform all 7 tasks on the GRIT benchmark and obtains impressive results across 16 other vision and NLP benchmarks, with no benchmark fine-tuning or task-specific modifications. Keypoint Estimation. Keypoint estimation requires returning the location of 17 keypoints on a human body (e.g., eyes, nose, feet, etc.) for each person in an image. While it is possible to perform this task in one pass by listing the keypoints of all people in the image in a single output sequence, this can result in an extremely long output sequence, so UNIFIED-IO uses a multi-step approach instead. To do this UNIFIED-IO is trained to complete the subtask of detecting the keypoints for single a person in a given region. For this subtask, the input prompt specifies the target region and and the output is a list of 17 points (a pair of locations tokens for the x and y coordinates) along with a visibility labels (1 for not visible, 2 for partly visible, 3 for fully visible). Non-visible points are preceded by two copies of a new special tokens that indicate there are no valid coordinates. The keypoint metric does not award points for correctly identifying non-visible points, so during inference we mask that special token so the model makes a best-effort guess for the coordinates of every single point. Training data for this subtask comes from COCO human pose data (Lin et al., 2014) with the ground-truth person regions as input. During inference we locate person regions using the object localization prompt, then apply UNIFIED-IO again to find keypoints for each detected region.

TRUTH PREDICTION KEYPOINT ESTMATION

Find the human joints in the region "loc260 loc375 loc726 loc545". loc307 loc487 loc299 loc499 loc295 loc481 loc315 loc507 loc305 loc455 loc369 loc517 loc359 loc413 loc457 loc527 loc429 loc387 loc423 loc529 …. 1 . As discussed in Section 3.3, we equally sample each group (1/8) except image synthesis (3/16) and dense labeling (1/16) since dense labeling has a much smaller sample size compared to image synthesis. We sample tasks and datasets (middle and outer circle) with a temperature-scaled mixing strategy to make sure the model is sufficiently exposed to underrepresented tasks. We raise each task's mixing rate to the power of 1/T and then renormalize the rates so that they sum to 1.  V C R Q A V C R Q A R a tio n a le s S N L I- V E V Q A 2 .0 V G V Q A C O C O O b je c ts V G



https://github.com/CompVis/taming-transformers https://www.tensorflow.org/datasets/catalog/nyu depth v2 http://datasets.lids.mit.edu/sparse-to-dense/data/nyudepthv2.tar.gz



Figure 2: Unified-IO. A schematic of the model with four demonstrative tasks: object segmentation, visual question answering, depth estimation and object localization. Vision & Language. A broad category for other tasks that require joint reason over image content and a natural language query. There are many popular vision and language datasets, and we categories these datasets into 3 tasks -visual question answering (Antol et al., 2015); relationship detection (Lu et al., 2016) and grounded VQA (Chen et al., 2022a). NLP. Tasks with text as the only input and output modalities, including text classification (Williams et al., 2018), question answering (Rajpurkar et al., 2016) and text summarization (Graff et al., 2003).

Published as a conference paper at ICLR 2023 text includes the bounding boxes and class names of all objects in the image. We randomize the order of the output objects during training, but for simplicity leave integrating more complex dataaugmentation techniques(Chen et al., 2022b)  to future work. omitted for brevity Object Localization. Object localization requires returning bounding boxes around all objects of a given category. Training data is derived from our object detection training data by constructing a training example from each category of objects present in an image. The input is then the image, a prompt specifying the target class, and the output is a list of all boxes that contain an instance of that class. The class for each box (which is always the class specified in the prompt) is included in the output for the sake of keeping the output format consistent with the object detection output. Object localization can use input categories which are not present in the image. To handle this, we construct negative samples by randomly selecting categories not present in the image to use as input, in which case the output is an empty sequence. omitted for brevity Referring Expression Comprehension. The task requires the model to localize an image region described by a natural language expression. The annotation is similar to Object Localization, except that the target is specified with natural language expression instead of class name. Datasets for this task include RefCOCO(Kazemzadeh et al., 2014), RefCOCO+(Kazemzadeh et al., 2014) and RefCOCOg(Mao et al., 2016).Which region doesthe text "man in solid black" describe ?

text outputs omitted for brevity A.1.3 DENSE LABELLING TASKS Object Segmentation. Object segmentation requires finding the binary segmentation mask of each instance of a particular category in an image. The input is an image and a prompt that includes the target class, while the output is an RGB image with black background and instances of that class filled in with unique colors following the method in Section 3.1. The output image is resized to match the input image if needed using a nearest-neighbor resizing method, and binary masks are built from each unique color. In practice the output image from UNIFIED-IO can have slightly nonuniform colors or extraneous background pixels, likely due to in what the D-VAE can decode/encode, so the output pixels are clustered by color and and connected components of less than 8 pixels are removed to build cleaned instance masks. Segmentation annotations come from Open Images LVIS, and COCO. Depth estimation requires assigning each pixel in an image a depth value. This task uses a static prompt as input, and the output is a grayscale image representing the normalized depth at each pixel. The generated output image is reiszed to the same size as the input image and then pixel values are rescaled to the maximum depth in the training to get an output depth map. Training data comes from the NYU Depth Dataset V2 (Nathan Silberman & Fergus, 2012). What is the depth map of the image ? TRUTH PREDICTION DEPTH ESTIMATION Surface Normal Estimation. UNIFIED-IO is trained on FrameNet (Huang et al., 2019a) and Blend-edMVS (Yao et al., 2020) surface normal estimation datasets. For this task the input is a static prompt and an image and the output is an RGB representation of the x/y/z orientation of the surface at each pixel. The generated output image is resized to match the input image and converted back to x/y/z orientations to produce the final output. TRUTH PREDICTION SURFACE NORMAL ESTIMATION What is the surface normal of the image ? et al., 2019; Clark et al., 2019; Khashabi et al., 2018; Roemmele et al., 2011), Cosmos QA (Huang et al., 2019b), OpenBookQA (Mihaylov et al., 2018), and HellaSwag (Zellers et al., 2019b). If the text context is longer then our maximum sequence length we use a sliding-window approach following Devlin et al. (2019) which exposes the model to different windows of text from the context and returns the highest-confidence answer. context: Uptake of O 2 from the air is the essential purpose of respiration, so oxygen supplementation is used in medicine. Treatment not only increases oxygen levels in the patient's blood…. question: What medical treatment is used to increase oxygen uptake in a patient? Also following past work (Raffel et al., 2020), text classification tasks are formatted by placing the input sentences and a query in the prompt and training the model to generate the target class. Datasets include tasks from GLUE and SuperGLUE (Wang et al., 2018; 2019; Warstadt et al., 2018; Socher et al., 2013; Dolan & Brockett, 2005; Iyer et al., 2017; Cer et al., 2017; Williams et al., 2018; Dagan et al., 2005; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009; Levesque et al., 2012; Williams et al., 2018; De Marneff et al., 2019; Pilehvar & os'e Camacho-Collados, 2018), as well as SNLI (Bowman et al., 2015), SciTail (Khot et al., 2018), IMDB Reviews (Maas et al., 2011), and PAWS (Zhang et al., 2019). TRUTH PREDICTION TEXT CLASSIFICATION Yes it entails context: Swansea striker Lee Trundle has negotiated a lucrative image-rights deal with the League One club. Lee Trundle is in business with the League One club. question: Does this sentence entail the following sentence? Yes it entails Text Summerization. Text summarization is done again by providing the input paragraph and a prompt as input and generating a summary as output. We use the Gigaword dataset (Graff et al., 2003; Rush et al., 2015) for training data.

of O 2 from the air is the essential purpose of respiration, so oxygen supplementation is used in medicine. Treatment not only increases oxygen levels in the patient's blood…. question: What medical treatment is used to increase oxygen uptake in a patient? striker Lee Trundle has negotiated a lucrative image-rights deal with the League One club. Lee Trundle is in business with the League One club. question: Does this sentence entail the following sentenceLankan president Chandrika Kumaratunga called on the country's private sector to invest in the establishment of townships while pledging to provide government assistance for such ventures, official radio said Saturday. question: What is a short summary of this document? calls for private sector investment in townships A.1.8 LANGUAGE MODELING TASKS Mask Language Modeling. Following T5 (Raffel et al., 2020), the mask language modelling objective randomly samples and then drops out 15% of tokens in the input sequence. All consecutive spans of dropped-out tokens are replaced by a single sentinel token. The target is to recover the dropped tokens given the sentinel token. We use C4 (Raffel et al., 2020) and Wikipedia (Foundation) datasets.

c t C a te g o ri z a ti o n Im ag e Cl as sif ica tio n Supervise d Captionin g Reg ion Cap tion ing W e b ly S u p e rv is e d C a p ti o n in g c t S e g m e n ta ti o n Ke yp oin t Es tim at ion R ef er rin g Ex pr es si on Obj ect Det ecti on Ob ject Loc aliz atio n Im ag e Sy nth es is fro m Se g. Im a g e In p a in ti n g Im a g e S y n th e s is fr o

Figure 4: Task groups (inner circle), tasks (middle circle), and datasets (outer circle) used in multi-task training of UNIFIED-IO. Sizes correspond to the sampling rate in the training distribution. Best viewed in color.

Figure4shows a visualization of the multi-task training distribution used by UNIFIED-IO from Table1. As discussed in Section 3.3, we equally sample each group (1/8) except image synthesis (3/16) and dense labeling (1/16) since dense labeling has a much smaller sample size compared to image synthesis. We sample tasks and datasets (middle and outer circle) with a temperature-scaled mixing strategy to make sure the model is sufficiently exposed to underrepresented tasks. We raise each task's mixing rate to the power of 1/T and then renormalize the rates so that they sum to 1. Following Raffel et al. (2020), we use T = 2 in our experiments.Due to the large variance in dataset size, some of the tasks are rarely sampled. For example, the depth estimation task only has the NYU Depth dataset source (Nathan Silberman & Fergus, 2012) and thus the sampling rate is only 0.43%. However, the model still works well for depth estimation tasks, even outperforming concurrent work(Kolesnikov et al., 2022) (0.385 vs. 0.467 RMSE). We suspect the large model capacity and masked image denoising pre-training improve the performance. Similarly, Grounding VQA (Chen et al., 2022a) has 0.15% sample rate, but the model can still achieve stateof-the-art performance on this task partly because it is trained on many related datasets for VQA and segmentation.

Figure 10: Image captioning qualitative examples.

Figure 12: Natural language processing qualitative examples.

Tasks

Size variant of UNIFIED-IO. Both encoder and decoder are based on T5 implementation(Raffel et al., 2020). Parameters of VQ-GAN(Esser et al., 2021) are not included in the total parameter count. on a large variety of tasks. Since our goal is to examine whether a single unified model can solve a variety of tasks simultaneously, we do not perform task-specific fine-tuning although prior work(Lu et  al., 2020; Wang et al., 2022b) shows it can further improve task performance. Pre-training. To learn good representations from large-scale webly supervised image and text data, we consider two pre-training tasks: text span denoising and masked image denoising. The text span denoising task follows Raffel et al. (2020) -randomly corrupt 15% of the tokens and replace the consecutive corrupted tokens with a unique mask token. The masked image denoising task follows Bao et al. (2022) and He et al. (2022) -randomly masked 75% of the image patches, and the goal is to recover the whole image. When another modality is present, i.e. image or text, the model can use information from that modality to complete the tasks.

Comparison of our UNIFIED-IO models to recent SOTA on GRIT benchmark. UNIFIED-IO is the first model to support all seven tasks in GRIT. Results of CLIP, OFA obtained from GRIT challenge.UNIFIED-IO is the first model to support all seven tasks in GRIT. As seen in Table 3, UNIFIED-IO XL outperforms all prior submissions to GRIT obtaining average accuracy of 64.3 on test. The next best submission is GPV-2 (Kamath et al., 2022) which obtains 32.0 and can only support 4 out of 7 tasks. UNIFIED-IO XL also outperforms the multi-task checkpoint of OFA LARGE (Wang et al., 2022b) on VQA, refer expression and categorization. Mask R-CNN (He et al., 2017) is a strong baseline for core vision tasks. UNIFIED-IO XL outperforms Mask R-CNN on localization and segmentation. The reason is UNIFIED-IO XL shows little degradation in performance between the same concept and the new concept as discussed in Appendix A.5. Due to computational constraints, we ablate UNIFIED-IO LARGE and train for 250k steps. If ablating a task group, we reduce the number of training steps so that all models are trained on approximately the same number of examples for each of the remaining task groups. Results are shown in Table 4 on GRIT and MNLI (Williams et al., 2018).In spite of supporting a large number of heterogeneous tasks, Unified-IO is able to perform well across all tasks. Reducing this heterogeneity by removing task groups does not impact the performance of individual tasks significantly. This is notable since removing a task group significantly reduces the scope of what a model needs to learn while keeping the model capacity fixed. This empirically demonstrates the effectiveness of the proposed unified architecture for massive heterogeneous task support.An exception is that removing the NLP group significantly boosts categorization, which might indicate that the sentence classification task interferes with image classification. Removing captioning also boosts performances on VQA and a few other tasks, which might be caused by captioning requiring a relatively large amount of model capacity to learn free-form text generation, in contrast to VQA that requires short answer phrases from a limited vocabulary. Removing image synthesis causes a major regression in keypoint. Manual inspection shows that the model predicts standing-

Ablation study on holding out tasks groups and evaluating on GRIT and MNLI(Williams et al., 2018)

Comparing

al., 2022e) but similar the recently proposed unified model, UViM(Kolesnikov et al., 2022), despite being trained to do far more tasks. More discussion can be found in A.8. NLP tasks.: UNIFIED-IO achieves respectable results on three NLP tasks but lags behind SOTA models(Smith et al., 2022;Zoph et al., 2022;He et al., 2021). This can partly be attributed to scale. Modern NLP models contain 100 billion+ parameters and with more extensive NLP pre-training. Our use of a pre-trained VQ-GAN greatly simplifies our training and is surprisingly effective for dense prediction tasks. However, it does mean UNIFIED-IO has limited image generation capabilities (recent works(Yu et al., 2022b)  have shown this method can be greatly improved but was not available at the time of development). We also found in a small-scale study that our model does not always understand prompts not in the training data (see Appendix A.6). Wang et al., 2022b) proposes a similar approach that also supports image locations and text-to-image synthesis. However, OFA does not support dense labeling tasks such as depth estimation, segmentation, and surface normal estimation. Other closely related models include UViM(Kolesnikov et al., 2022) which generates a discrete guiding code for a D-VAE to build an autoregressive model for panoptic segmentation, depth prediction, and colorization; Pix2Seq v2(Chen et al., 2022c)  which extends Pix2Seq to segmentation, keypoint estimation, and image captioning; Visual Prompting(Bar et al., 2022) adapts the pre-trained visual model to novel downstream tasks by image inpainting. UNIFIED-IO covers all these tasks, and focuses on multi-tasking rather then task-specific fine-tuning. Additional discussions are presented in Appendix A.10.

a visualization of pre-training data distribution used by UNIFIED-IO. As discussed in Section 3.3, we equally sample data with the text denoising and image denoising objective (inner circle of Figure3). For text denoising, half of the samples are from pure text data, i.e. C4 and Wikipedia. The other half is constructed from image and class, such as Imagenet21k(Ridnik et al., 2021) or image and caption, such as YFCC15M(Radford et al., 2021). For image denoising, we use the text information when class and caption are present in the data source and sample the dataset proportional to the dataset size. For both text and image denoising, we randomly drop both modalities 10% of the time if both text and image as inputs.

Published as a conference paper at ICLR 2023

Published as a conference paper at ICLR 2023 A APPENDIX

A.1 TASKS DETAILS UNIFIED-IO is jointly trained on a large and diverse set of vision, language and vision & language tasks. In this section, we describe these tasks in detail and show the prompts we use during training and inference (text on the left of example cards). We also provide qualitative examples of both the ground truth and the predictions made by UNIFIED-IO.

A.1.1 IMAGE SYNTHESIS TASKS

Image Synthesis from Text. This task requires generating an image that matches a sentence. Training data comes from 4 captioning datasets: COCO Caption (Chen et al., 2015) , Conceptual Captions 3M and 12M (Changpinyo et al., 2021) , and RedCaps (Desai et al., 2021) as well datasets used for image classification using the object class as the input caption. Specialized image generation models like DALL•E 2 (Ramesh et al., 2022 ) use an order of magnitude more data, but we limit our sources to these sets for training efficiency.What is the complete image? Text: "a white sink in a small tiled bathroom"

TRUTH PREDICTION IMAGE GENERATION

Image Inpainting. This task requires filling in a region of an image with a target object. Training data for this task is built from object bounding box annotations from Open Images (Kuznetsova et al., 2020) , Visual Genome (Krishna et al., 2017) and COCO (Lin et al., 2014) . For each object, the input image becomes the source image with the object's bounding box blanked out. The input prompt provides the bounding box's location and the target category. The target output is the original image.

TRUTH PREDICTION IMAGE INPAINTING

Fill in the blank region "loc257 loc425 loc575 loc758" person"?Image Synthesis from Segmentation. This task involves generating an image that matches an input semantic segmentation, i.e., a set of class labels for some or all of the pixels in the image. UNIFIED-IO trained for this task using segmentation annotations from COCO (Lin et (Welinder et al., 2010) . For this task the input is an image and a static prompt, and the output is a class name. During inference we compute the log-probability of each class label in the dataset being evaluated and return the highest scoring one. This ensures UNIFIED-IO does not return a category from a different categorization dataset that is a synonym or hypernym of the correct label. The total vocabulary size is 49536, with 32152 language tokens, 1000 location tokens, and 16384 vision tokens. We use the imagenet pretrained VQ-GAN checkpoints with 16384 tokens and f = 16 1 . Please refer (Esser et al., 2021) During training, we random sub-sample 128 image patches for pre-training state and 256 image patches (out of 576) for multi-task stage. We do not use dropout. Adafactor (Shazeer & Stern, 2018) optimizer is used to save memory. We use a learning rate of 10 -2 for the first 10,000 steps and then decay at a rate of 1/ √ k. We train with β 1 = 0.9 and β 2 = 1.0 -k -0.8 , where k is the step number. We use global norm gradient clipping with 1.0 and find this is crucial to stabilized XL training. We train the Small, Base and Large with a batch size of 204 8 and XL with batch size of 1024 due to memory consideration. 4 GRIT provides a breakdown of metrics into two groups: same for samples that only contain concepts seen in the primary training data (a set of common datasets like COCO, ImageNet, and Visual Genome), and new for samples containing at least one concept unseen in primary training data. Table 6 shows results for UNIFIED-IO and other leaderboard entries for the ablation set, divided into the same and new concepts.UNIFIED-IO XL shows little degradation in performance between same and new, compared to competing entries. On some tasks UNIFIED-IO is even able to outperform on the new split compared to the same. This indicates that the volume of training data used to train UNIFIED-IO has a broad coverage of concepts, and provides almost as effective a level of supervision as provided by large standard vision datasets like COCO. Furthermore, since UNIFIED-IO is a uniquely unified architecture with no task-specific parameters, it is very likely able to effectively transfer knowledge across different tasks.In Overall, we find that the model has some capacity to generalize to paraphrases of the prompt (e.g., row 3 works reasonably well despite using completely different words), but there are paraphrases that result in a very significant performance decrease (e.g. rows 5, 6, and 8). We also find removing the spaces around the punctuation sometimes results in minor regressions (row 0 vs row 1) and sometimes in sharply reduced performance (row 6 vs row 7), showing UNIFIED-IO can be sensitive to formatting details. We hypothesize that this is caused by the SentencePiece tokenizer changing the tokenization of the referring expression if the quotes are not separated from it by spaces. Building multi-task models that can generalize to different prompts, and ideally to prompts for completely new tasks, is an exciting avenue for future work.

A.7 CROSS TASK GENERALIZATION CASE STUDY

Qualitative examples for UNIFIED-IO when applied to two out-of-domain settings, surface normal detection on COCO images and animal pose estimation, are included in Figure 5 with the other qualitative examples. For surface normal detection, we find that the model produces plausible images even for objects like cats or humans that are not in the surface normal training data. However, other scenes, such as outdoor scenes (Figure 5 bottom right), are less coherent.Despite not being trained on animal pose estimation, UNIFIED-IO is able to sometimes find animal keypoints. For animals standing or crouching on two legs, the keypoints are reasonably accurate (first two images), however for animals standing on four legs the model will find leg and eye points but then guess arm positions that would make sense for a person instead of attaching points to the other legs. While this hints that the model was able to combine skills learned from human pose estimation data with the knowledge of animals learned from other tasks, it also shows that more work is needed to fully realize this potential.

A.8 NYUV2 RESULTS

The first version of UNIFIED-IO multi-tasking data distribution contains two sources of depth dataset. nyu depth v2 from Tensorflow Dataset 2 and a pre-processed version from sparse-to-dense.pytorch 3 . Since the original NYUv2 dataset has a lot of 0-distance regions (holes) which can be problematic for sequence training, we included the latter source because it replaces the 0-distance holes with approximations. The second version of UNIFIED-IO model we trained for code release is trained on an updated multi-tasking data distribution that only contains the Tensorflow source of the NYUv2 dataset, and we find a significant drop in performance for XL model 0.475 vs. 0.385 while other tasks maintain similar performance. We suspect the reason is sparse-to-dense.pytorch has a different split and contaminates the training data for NYUv2 evaluation. Our final result is a little worse compared to UViM 0.475 vs. 0.467. 

