TASKPROMPTER: SPATIAL-CHANNEL MULTI-TASK PROMPTING FOR DENSE SCENE UNDERSTANDING

Abstract

Learning effective representations simultaneously from multiple tasks in a unified network framework is a fundamental paradigm for multi-task dense visual scene understanding. This requires joint modeling (i) task-generic and (ii) taskspecific representations, and (iii) cross-task representation interactions. Existing works typically model these three perspectives with separately designed structures, using shared network modules for task-generic learning, different modules for task-specific learning, and establishing connections among these components for cross-task interactions. It is barely explored in the literature to model these three perspectives in each network layer in an end-to-end manner, which can not only minimize the effort of carefully designing empirical structures for the three multi-task representation learning objectives, but also greatly improve the representation learning capability of the multi-task network since all the model capacity will be used to optimize the three objectives together. In this paper, we propose TaskPrompter, a novel spatial-channel multi-task prompting transformer framework to achieve this target. Specifically, we design a set of spatial-channel task prompts and learn their spatial-and channel interactions with the shared image tokens in each transformer layer with attention mechanism, as aggregating spatial and channel information is critical for dense prediction tasks. Each task prompt learns task-specific representation for one task, while all the prompts can jointly contribute to the learning of the shared image token representations, and the interactions between different task prompts model the cross-task relationship. To decode dense predictions for multiple tasks with the learned spatial-channel task prompts from transformer, we accordingly design a dense task prompt decoding mechanism, which queries the shared image tokens using task prompts to obtain spatial-and channel-wise task-specific representations. Extensive experiments on two challenging multi-task dense scene understanding benchmarks (i.e. NYUD-V2 and PASCAL-Context) show the superiority of the proposed framework and TaskPrompter establishes significant state-of-the-art performances on multitask dense predictions. Codes and models are publicly available at https: //github.com/prismformore/Multi-Task-Transformer.

1. INTRODUCTION

Dense visual scene understanding is a fundamental research topic in computer vision that involves many dense prediction tasks, including semantic segmentation, depth estimation, surface normal estimation, boundary detection, etc. These distinct tasks share a fundamental understanding of the scene, which motivates researchers to design learning systems that model and predict multiple tasks in a unified framework, which is called "multi-task learning" (MTL). MTL mainly has two strengths: on one hand, learning a unified multi-task model for multiple tasks is typically more parameterefficient than training several single-task models; on the other hand, different tasks can facilitate each other with a good design in MTL (Vandenhende et al., 2021) . With the powerful boost of deep learning, researchers have successfully designed highly promising multi-task learning models by exploiting the commonality and individuality of the tasks (Mani- nis et al., 2019; Xu et al., 2018; Kendall et al., 2018; Kokkinos, 2017) . Traditionally, researchers manually design different types of modules in the multi-task network architecture to learn useful information for multi-task predictions on three aspects: task-generic representations, task-specific representations, and cross-task interactions. For instance, earlier works (Liu et al., 2019; Gao et al., 2019; Misra et al., 2016) design dedicated modules to learn task-specific representations and embed cross-task information interactions through hand-crafted structures deployed in the encoder, while several recent works (Ye & Xu, 2022; Li et al., 2022b; Vandenhende et al., 2020) choose to develop task-specific and cross-task modules in the decoder, and share encoder among different tasks. However, all of these methods decouple the learning of task-generic representations, task-specific representations, and cross-task interactions, into different network modules, which not only makes the architecture design more challenging as each module needs to be configured with a specific structure and capacity, but also suboptimal as learning effective communication among these three important perspectives of information is critical for multi-task dense prediction. To tackle the above-mentioned issue, we believe a better MTL framework should be capable of learning task-generic and task-specific representations as well as their interactions jointly in each layer across the whole network architecture. In this paper, we achieve this goal by proposing a novel Spatial-Channel Multi-task Prompting framework, coined as TaskPrompter. The core idea of TaskPrompter is to design "spatial-channel task prompts" which are task-specific learnable tokens to learn spatial-and channel-wise task-specific information for each task. More specifically, the task prompts are embedded together with the task-generic patch tokens computed from the input image as input of a transformer with a specially designed Spatial-Channel Task Prompt Learning module. The task prompts and patch tokens interact with each other and refine themselves by means of attention mechanism in each transformer layer. In this way, TaskPrompter manages to learn taskgeneric and task-specific representation as well as cross-task interaction simultaneously and does not require the design of different types of network modules. With the learned spatial-channel task prompts and image patch tokens, it is a non-trivial problem how to effectively decode multi-task dense features and predictions from them. To meet this challenge, we further propose a novel Dense Spatial-Channel Task Prompt Decoding method, which leverages both the spatial-wise and channel-wise affinities calculated between the task prompts and the patch tokens in attention modules to extract dense task features. The features are further refined by the cross-task affinity obtained from the self-attention weights among task prompts. The final multi-task dense predictions are produced based on the dense task features. In summary, the contribution of this work consists of three parts: • We propose a novel Spatial-Channel Multi-task Prompting framework (TaskPrompter) for multitask dense scene understanding. Our method essentially combines the learning of task-generic and task-specific representations, as well as cross-task interactions in each layer across the whole network architecture by introducing task prompts in our transformer.



Figure 1: Illustration of our Spatial-Channel Multi-task Prompting framework (TaskPrompter).TaskPrompter unifies the learning of task-specific and task-generic representations as well as crosstask interactions in each layer throughout the whole transformer architecture, with the embedding of task prompts and and patch tokens. The task prompts are projected to spatial task prompts and channel task prompts to learn spatial-and channel-wise interactions, which are critical for dense predictions. The spatial and channel task prompts as well as patch tokens are further used in the proposed Dense Spatial-Channel Task Prompt Decoding module to prompt dense task-specific features and the final multi-task predictions.

