UNIVERSAL FEW-SHOT LEARNING OF DENSE PREDIC-TION TASKS WITH VISUAL TOKEN MATCHING

Abstract

Dense prediction tasks are a fundamental class of problems in computer vision. As supervised methods suffer from high pixel-wise labeling cost, a few-shot learning solution that can learn any dense task from a few labeled images is desired. Yet, current few-shot learning methods target a restricted set of tasks such as semantic segmentation, presumably due to challenges in designing a general and unified model that is able to flexibly and efficiently adapt to arbitrary tasks of unseen semantics. We propose Visual Token Matching (VTM), a universal few-shot learner for arbitrary dense prediction tasks. It employs non-parametric matching on patchlevel embedded tokens of images and labels that encapsulates all tasks. Also, VTM flexibly adapts to any task with a tiny amount of task-specific parameters that modulate the matching algorithm. We implement VTM as a powerful hierarchical encoder-decoder architecture involving ViT backbones where token matching is performed at multiple feature hierarchies. We experiment VTM on a challenging variant of Taskonomy dataset and observe that it robustly few-shot learns various unseen dense prediction tasks. Surprisingly, it is competitive with fully supervised baselines using only 10 labeled examples of novel tasks (0.004% of full supervision) and sometimes outperforms using 0.1% of full supervision.

1. INTRODUCTION

Dense prediction tasks constitute a fundamental class of computer vision problems, where the goal is to learn a mapping from an input image to a pixel-wise annotated label. Examples include semantic segmentation, depth estimation, edge detection, and keypoint detection, to name a few (Zamir et al., 2018; Cai & Pu, 2019) . While supervised methods achieved remarkable progress, they require a substantial amount of manually annotated pixel-wise labels, leading to a massive and often prohibitive per-task labeling cost (Kang et al., 2019; Liu et al., 2020; Ouali et al., 2020) . Prior work involving transfer and multi-task learning have made efforts to generally relieve the burden, but they often assume that relations between tasks are known in advance, and still require a fairly large amount of labeled images of the task of interest (e.g., thousands) (Zamir et al., 2018; Standley et al., 2020; O Pinheiro et al., 2020; Wang et al., 2021) . This motivates us to seek a few-shot learning solution that can universally learn arbitrary dense prediction tasks from a few (e.g., ten) labeled images. However, existing few-shot learning methods for computer vision are specifically targeted to solve a restricted set of tasks, such as classification, object detection, and semantic segmentation (Vinyals et al., 2016; Kang et al., 2019; Min et al., 2021) . As a result, they often exploit prior knowledge and assumptions specific to these tasks in designing model architecture and training procedure, therefore not suited for generalizing to arbitrary dense prediction tasks (Snell et al., 2017; Fan et al., 2022; Iqbal et al., 2022; Hong et al., 2022) . To our knowledge, no prior work in few-shot learning provided approaches to solve arbitrary dense prediction tasks in a universal manner. We argue that designing a universal few-shot learner for arbitrary dense prediction tasks must meet the following desiderata. First, the learner must have a unified architecture that can handle arbitrary tasks by design, and share most of the parameters across tasks so that it can acquire generalizable knowledge for few-shot learning of arbitrary unseen tasks. Second, the learner should flexibly adapt its prediction mechanism to solve diverse tasks of unseen semantics, while being efficient enough to prevent over-fitting. Designing such a learner is highly challenging, as it should be general and unified while being able to flexibly adapt to any unseen task without over-fitting few examples. In this work, we propose Visual Token Matching (VTM), a universal few-shot learner for arbitrary dense prediction tasks. We draw inspiration from the cognitive process of analogy making (Mitchell, 2021) ; given a few examples of a new task, humans can quickly understand how to relate input and output based on a similarity between examples (i.e., assign similar outputs to similar inputs), while flexibly changing the notion of similarity to the given context. In VTM, we implement analogymaking for dense prediction as patch-level non-parametric matching, where the model learns the similarity in image patches that captures the similarity in label patches. Given a few labeled examples of a novel task, it first adapts its similarity that describes the given examples well, then predicts the labels of an unseen image by combining the label patches of the examples based on image patch similarity. Despite the simplicity, the model has a unified architecture for arbitrary dense prediction tasks since the matching algorithm encapsulates all tasks and label structures (e.g., continuous or discrete) by nature. Also, we introduce only a small amount of task-specific parameters, which makes our model robust to over-fitting as well as flexible. Our contributions are as follows. (1) For the first time to our knowledge, we propose and tackle the problem of universal few-shot learning of arbitrary dense prediction tasks. We formulate the problem as episodic meta-learning and identify two key desiderata of the learner -unified architecture and adaptation mechanism. (2) We propose Visual Token Matching (VTM), a novel universal fewshot learner for dense prediction tasks. It employs non-parametric matching on tokenized image and label embeddings, which flexibly adapts to unseen tasks using a tiny amount of task-specific parameters. (3) We implement VTM as a powerful hierarchical encoder-decoder architecture, where token matching is performed at multiple feature hierarchies using attention mechanism. We employ ViT image and label encoders (Dosovitskiy et al., 2020) and a convolutional decoder (Ranftl et al., 2021) , which seamlessly works with our algorithm. (4) We demonstrate VTM on a challenging variant of Taskonomy dataset (Zamir et al., 2018) and observe that it robustly few-shot learns various unseen dense prediction tasks. Surprisingly, it is competitive to or outperforms fully supervised baselines given extremely few examples (0.1%), sometimes using only 10 labeled images (< 0.004%).

2. PROBLEM SETUP

We propose and tackle the problem of universal few-shot learning of arbitrary dense prediction tasks. In our setup, we consider any arbitrary task T that can be expressed as follows: T : R H×W ×3 → R H×W ×C T , C T ∈ N. (1) This subsumes a wide range of vision tasks including semantic segmentation, depth estimation, surface normal prediction, edge prediction, to name a few, varying in structure of output space, e.g., dimensionality (C T ) and topology (discrete or continuous), as well as the required knowledge. Our goal is to build a universal few-shot learner F that, for any such task T , can produce predictions Ŷ q for an unseen image (query) X q given a few labeled examples (support set) S T : Ŷ q = F(X q ; S T ), S T = {(X i , Y i )} i≤N . To build such a universal few-shot learner F, we adopt the conventional episodic training protocol where the training is composed of multiple episodes, each simulating a few-shot learning problem. To this end, we utilize a meta-training dataset D train that contains labeled examples of diverse dense prediction tasks. Each training episode simulates a few-shot learning scenario of a specific task T train in the dataset -the objective is to produce correct labels for query images given a support set. By experiencing multiple episodes of few-shot learning, the model is expected to learn general knowledge for fast and flexible adaptation to novel tasks. At test time, the model is asked to perform few-shot learning on arbitrary unseen tasks T test not included in the training dataset (D train ). An immediate challenge in handling arbitrary tasks in Eq. 1 is that each task in both meta-training and testing has different output structures (i.e., output dimension C T varies per task), making it difficult to design a single, unified parameterization of a model for all tasks. As a simple yet general solution, we cast a task T : R H×W ×3 → R H×W ×C T into C T single-channel sub-tasks T 1 , • • • , T C T of learning each channel, and model each sub-task T c : R H×W ×3 → R H×W ×1 independently using the shared model F in Eq. 2. Although multi-channel information is beneficial in general, we observe that its impact is negligible in practice, while the channel-wise decomposition introduces other useful benefits such as augmenting the number of tasks in meta-training, flexibility to generalize to arbitrary dimension of unseen tasks, and more efficient parameter-sharing within and across tasks. Without loss of generality, the rest of the paper considers that every task is of single-channel label.

availability

Codes are available at https://github.com/GitGyun/visual_token_matching.

