TASKPROMPTER: SPATIAL-CHANNEL MULTI-TASK PROMPTING FOR DENSE SCENE UNDERSTANDING

Abstract

Learning effective representations simultaneously from multiple tasks in a unified network framework is a fundamental paradigm for multi-task dense visual scene understanding. This requires joint modeling (i) task-generic and (ii) taskspecific representations, and (iii) cross-task representation interactions. Existing works typically model these three perspectives with separately designed structures, using shared network modules for task-generic learning, different modules for task-specific learning, and establishing connections among these components for cross-task interactions. It is barely explored in the literature to model these three perspectives in each network layer in an end-to-end manner, which can not only minimize the effort of carefully designing empirical structures for the three multi-task representation learning objectives, but also greatly improve the representation learning capability of the multi-task network since all the model capacity will be used to optimize the three objectives together. In this paper, we propose TaskPrompter, a novel spatial-channel multi-task prompting transformer framework to achieve this target. Specifically, we design a set of spatial-channel task prompts and learn their spatial-and channel interactions with the shared image tokens in each transformer layer with attention mechanism, as aggregating spatial and channel information is critical for dense prediction tasks. Each task prompt learns task-specific representation for one task, while all the prompts can jointly contribute to the learning of the shared image token representations, and the interactions between different task prompts model the cross-task relationship. To decode dense predictions for multiple tasks with the learned spatial-channel task prompts from transformer, we accordingly design a dense task prompt decoding mechanism, which queries the shared image tokens using task prompts to obtain spatial-and channel-wise task-specific representations. Extensive experiments on two challenging multi-task dense scene understanding benchmarks (i.e. NYUD-V2 and PASCAL-Context) show the superiority of the proposed framework and TaskPrompter establishes significant state-of-the-art performances on multitask dense predictions. Codes and models are publicly available at https: //github.com/prismformore/Multi-Task-Transformer.

1. INTRODUCTION

Dense visual scene understanding is a fundamental research topic in computer vision that involves many dense prediction tasks, including semantic segmentation, depth estimation, surface normal estimation, boundary detection, etc. These distinct tasks share a fundamental understanding of the scene, which motivates researchers to design learning systems that model and predict multiple tasks in a unified framework, which is called "multi-task learning" (MTL). MTL mainly has two strengths: on one hand, learning a unified multi-task model for multiple tasks is typically more parameterefficient than training several single-task models; on the other hand, different tasks can facilitate each other with a good design in MTL (Vandenhende et al., 2021) . With the powerful boost of deep learning, researchers have successfully designed highly promising multi-task learning models by exploiting the commonality and individuality of the tasks (Mani- TaskPrompter unifies the learning of task-specific and task-generic representations as well as crosstask interactions in each layer throughout the whole transformer architecture, with the embedding of task prompts and and patch tokens. The task prompts are projected to spatial task prompts and channel task prompts to learn spatial-and channel-wise interactions, which are critical for dense predictions. The spatial and channel task prompts as well as patch tokens are further used in the proposed Dense Spatial-Channel Task Prompt Decoding module to prompt dense task-specific features and the final multi-task predictions. However, all of these methods decouple the learning of task-generic representations, task-specific representations, and cross-task interactions, into different network modules, which not only makes the architecture design more challenging as each module needs to be configured with a specific structure and capacity, but also suboptimal as learning effective communication among these three important perspectives of information is critical for multi-task dense prediction.

Task Prompts

Image Patch Tokens To tackle the above-mentioned issue, we believe a better MTL framework should be capable of learning task-generic and task-specific representations as well as their interactions jointly in each layer across the whole network architecture. In this paper, we achieve this goal by proposing a novel Spatial-Channel Multi-task Prompting framework, coined as TaskPrompter. The core idea of TaskPrompter is to design "spatial-channel task prompts" which are task-specific learnable tokens to learn spatial-and channel-wise task-specific information for each task. More specifically, the task prompts are embedded together with the task-generic patch tokens computed from the input image as input of a transformer with a specially designed Spatial-Channel Task Prompt Learning module. The task prompts and patch tokens interact with each other and refine themselves by means of attention mechanism in each transformer layer. In this way, TaskPrompter manages to learn taskgeneric and task-specific representation as well as cross-task interaction simultaneously and does not require the design of different types of network modules. With the learned spatial-channel task prompts and image patch tokens, it is a non-trivial problem how to effectively decode multi-task dense features and predictions from them. To meet this challenge, we further propose a novel Dense Spatial-Channel Task Prompt Decoding method, which leverages both the spatial-wise and channel-wise affinities calculated between the task prompts and the patch tokens in attention modules to extract dense task features. The features are further refined by the cross-task affinity obtained from the self-attention weights among task prompts. The final multi-task dense predictions are produced based on the dense task features. In summary, the contribution of this work consists of three parts: • We propose a novel Spatial-Channel Multi-task Prompting framework (TaskPrompter) for multitask dense scene understanding. Our method essentially combines the learning of task-generic and task-specific representations, as well as cross-task interactions in each layer across the whole network architecture by introducing task prompts in our transformer. • A Spatial-Channel Task Prompt Learning module is designed. It can be flexibly deployed in each transformer layer for learning and refining task prompts and patch tokens along both spatial and channel dimensions. • We further design a novel Dense Spatial-Channel Task Prompt Decoding method based on the learned task-specific task prompts and task-generic patch tokens to generate pixel-wise predictions for multiple tasks simultaneously. Extensive experiments on two challenging multi-task dense prediction benchmarks (i.e. PASCAL-Context and NYUD-v2) clearly verify the effectiveness of the proposed method, which demonstrates superior performance compared with the previous state-of-the-art methods. 

3. TASKPROMPTER: SPATIAL-CHANNEL MULTI-TASK PROMPTING

The proposed TaskPrompter framework (see Fig. 1 ) can be divided into three parts, i.e. Prompt Embedding, Spatial-Channel Task Prompt Learning, and Dense Spatial-Channel Task Prompt Decoding. We now introduce the details of these components one by one in this section.

3.1. PROMPT EMBEDDING

We adopt a classic transformer pipeline to embed an input image into a sequence of patch tokens (Dosovitskiy et al., 2021) . It should be noted that our method is independent of the selection of different transformer architectures. The input image is first processed by a patch embedding layer, which is a convolutional layer with feature resolution downsampling. After the patch embedding layer, suppose the output feature map has a shape (H, W, C) where H and W are the height and width, and C is the number of channels of the feature map. The feature map is first reshaped into N = H × W patch tokens in a C-dimensional latent space and then added by positional encodings. To enable multi-task prompting, we propose to embed T learnable task-specific tokens as task prompts in the same C-dimensional latent space as that of patch tokens, where T is the number of tasks in multi-task learning. Each task prompt corresponds to a task. Next, the task prompts are concatenated with the patch tokens and form a token sequence matrix Z 0 : Z 0 = [p 1 ; p 2 ; . . . ; p T ; x 1 ; x 2 ; . . . ; x N ] ∈ R (T +N )×C , N = H × W, [•] indicates concatenation, {p i } T i=1 denotes the sequence of task prompts, and {x i } N i=1 denotes the sequence of patch tokens. In this way, the input image and task prompts are encoded and aggregated into a joint token sequence, which is refined through the proposed spatial-channel task prompt learning process.

3.2. SPATIAL-CHANNEL TASK PROMPT LEARNING

We design a spatial-channel task prompt learning module for concurrently learning and refining task prompts and patch tokens along both spatial and channel feature dimensions, as the spatial and channel information are both critical for various dense prediction tasks (Fu et al., 2019) . The module details are depicted in Fig. 2 . The goal of our design is to learn task-generic and task-specific visual representations, as well as cross-task interactions simultaneously, by taking full advantage of the structure of transformer with the embedded task prompts. Specifically, for each transformer layer, we adopt the basic layer structure of the classic ViT (Dosovitskiy et al., 2021) including MLPs, layer normalization layers (LayerNorm), and skip connections, while the core multi-head attention module is replaced by the proposed Spatial-Channel Task Prompt Learning module, in which, we design two types of task prompt learning paradigms: (i) Spatial Task Prompt Learning and (ii) Channel Task Prompt Learning. Without loss of generality, we use the first transformer layer for illustration. After the first Lay-erNorm, we obtain a token sequence Z = LayerNorm(Z 0 ) ∈ R (T +N )×C , which consists of task prompts P ∈ R T ×C and patch tokens X ∈ R N ×C . Then, P is projected into spatial task prompts P s and channel task prompts P c where the superscript s denotes 'spatial' and c denotes 'channel', and the projection can be formulated as: P s = f C→C (P) ∈ R T ×C , P c = f C→N (P) ∈ R T ×N , where f C→C denotes an identity mapping with the feature dimension unchanged; f C→N : R T ×C → R T ×N is a 2-layer MLP that projects the input task prompts P from Cdimensional latent space to N -dimensional latent space (N = H × W ). In this way, each channel task prompt aligns the feature dimension to the number of patch tokens for performing channel interactions. Then, P s and X are fed into the proposed spatial task prompt learning module, and concurrently P c and X are fed into the channel task prompt learning module. Spatial Task Prompt Learning This module simultaneously learns spatial task prompts P s and patch tokens X, where each spatial task prompt interacts with patch tokens along the spatial dimension to model the spatial-wise relationships between task prompts and patch tokens. The spatial task prompts P s and patch tokens X are stacked as a token sequence [P s ; X] ∈ R (T +N )×C , which is linearly projected by learnable parameters W s q , W s k , W s v ∈ R C×C , to respectively produce the query Q s , the key K s , and the value V s as follows: Q s = [P s ; X] × W s q , K s = [P s ; X] × W s k , V s = [P s ; X] × W s v . As a standard procedure, before we compute the multi-head self-attention (MSA), we need to partition Q s , K s , V s evenly along the last dimension into different groups as input of different heads. Suppose that we utilize N s head heads in MSA, after head partition we have Q s , K s , V s ∈ R N s head ×(T +N )× C N s head and spatial self-attention map is computed by A s = Q s × K s⊤ , where the symbol '⊤' denotes a transposing operation that transposes the last two dimensions of a tensor. A s is then scaled and normalized by a softmax function, and multiplied by V s to obtain a new token sequence. The token sequence merges different heads and is projected by a linear layer as in a standard MSA, which provides us an output token sequence, consisting of updated spatial task prompts P s′ and patch tokens X ′ . Channel Task Prompt Learning This module is proposed to model the channel-wise relationships between the channel task prompts P c and the patch tokens X along the channel dimension. We perform cross-attention to model the channel-wise relationships. The channel task prompts P c are projected by a learnable parameter W c q ∈ R N ×N to produce the query Q c ∈ R T ×N , and the patch tokens X are separately projected by two learnable parameter matrices W c k ∈ R C×C and W c v ∈ R C×C , and transposed to produce the key K c and the value V c : Q c = P c × W c q , K c = (X × W c k ) ⊤ , V c = (X × W c v ) ⊤ . Before computing the multi-head cross-attention, we need to partition Q c , K c , V c along the last dimension into different groups as the input of different heads in multi-head channel-wise cross attention. Suppose the number of heads used in Channel Task Prompt Learning is N c head , the most straightforward way is evenly partitioning the matrices along their last dimension into N c head groups. However, as K c and V c are computed from patch tokens X, the last dimension of them contains the spatial relationship of pixels, while a standard partition method disrupts this spatial relationship which is critical for learning features for dense predictions. A more reasonable strategy is reorganizing Q c , K c , V c based on the spatial adjacency. Specifically, we first reshape these matrices to a spatial shape R C×H×W as H ×W = N , and then partition the spatial planes formed by the last 2 dimensions evenly into N c head local windows, as shown in Fig. 6 . Notably, N c head needs to be properly set so that the number of windows along the height dimension N h win and the width dimension N w win satisfies N h win × N w win = N c head . This process is the proposed "window partition" to maintain the spatial relationship for multi-head channel cross-attention calculation. After the window partition, we have Q c ∈ R N c head ×T × N /N c head , K c , V c ∈ R N c head ×C× N /N c head . Then, channel attention maps are calculated by A c = Q c × K c⊤ . A c is scaled and normalized by a softmax function and multiplied by V c to obtain the updated channel task prompts P c′ after being processed by a linear layer. To update the overall task prompts P with information learned from both the spatial and channel task prompts, as shown in Fig. 2 2-layer MLP f N →C : R T ×N →T ×C , and then added by P s′ to obtain a combined task prompts P ′ as an update of P as follows: P ′ = P s′ + f N →C (P c′ ). (5) P ′ and X ′ are refined by the typical LayerNorm and MLP, and then stacked as a new token sequence Z ′ , which is fed into the next transformer layer following the same procedure for further learning.

3.3. DENSE SPATIAL-CHANNEL TASK PROMPT DECODING

To decode multiple dense predictions for distinct tasks from the task-specific task prompts and taskgeneric patch tokens, we need to design an effective decoding method for TaskPrompter. Since the task prompts including the spatial and channel task prompts are task-discriminative, then the affinities calculated between the spatial/channel task prompts and the shared patch tokens are also distinct. The different task prompts localize different spatial regions or channels on the patch tokens. This can also be confirmed from our visualization of the learned spatial and channel affinities, as shown in Fig. 4 . Based on the learned spatial and channel task prompts, we propose a Dense Spatial-Channel Task Prompt Decoding method, which consists of Spatial Task Prompting and Channel Task Prompting strategies, to respectively compute spatial-wise and channel-wise task-specific features, for the final multi-task predictions, as shown in Fig. 3 . Spatial Task Prompting Each spatial task prompt corresponds to a spatial affinity map by computing the affinity between the spatial task prompt and all the patch tokens. We denote this spatial affinity map as Spatial-Task-Prompt Affinity, as shown in Fig. 3 . It can be extracted directly from the learned spatial attention maps, i.e. A s ∈ R N s head ×(T +N )×(T +N ) with N = H × W as the number of patch tokens and T as the number of tasks, in the task-prompt learning stage. For the spatial task prompt of task t, we can extract an attention tensor in space R N s head ×N from A s , and then we reshape it to be a new tensor in space R N s head ×1×H×W , which is the Spatial-Task-Prompt Affinity and we denote it as A p→s t . On the other hand, given an updated token sequence X ′ ∈ R N ×C produced from a transformer layer, we transpose and reshape X ′ into space R N s head × C /N s head ×H×W , and denote it as X ′s . The spatial-wise task-specific features F s t for task t can be decoded by: F s t = f sr (A p→s t ⊙ X ′s ), where the symbol ⊙ indicates the operation of a Hadamard product and f sr (•) represents the operation of reshaping the tensor into space R C×H×W . Channel Task Prompting Each channel task prompt corresponds to a channel affinity vector, which can be computed by measuring the affinity between the channel task prompt and all the channels of the patch tokens. We name it Channel-Task-Prompt Affinity, denoted as A p→c t with t indicating the task t, which can be used to decode the channel-wise task-specific representation for task t. We obtain the channel-task-prompt affinity A p→c t of task t from A c ∈ R N c head ×T ×C by slicing one along the second dimension and reshaping it into space R C×N h win ×1×N w win ×1 . We also reshape X ′ as X ′c ∈ R C×N h win ×hwin×N w win ×wwin , where h win = H /N h win and w win = W /N w win . Then, we can compute the channel-wise task-specific features F c t for task t as follows: where f cr (•) denotes the operation of reshaping the tensor into the space R C×H×W . F c t = f cr (A p→c t ⊙ X ′c ), Image Semseg Parsing Saliency Normal Boundary Channels Tasks

Spatial-Channel Fusion

To fuse the task-specific features F s t and F c t from the spatial and channel task prompting, we concatenate them along the channel dimension and reduce the channel number by half via using a 3 × 3 convolution (CONV 3×3 ) with batch normalization (BN) and GELU to obtain a fused task-specific feature F t for task t as follows: F t = GELU • BN • CONV 3×3 ([F s t ; F c t ]). Then, we stack the prompted task-specific features of all the T tasks along the first dimension and obtain an overall task feature map F ∈ R T ×C×H×W .

Cross-task Reweighting

The spatial and channel task prompting decode the relationship between task prompts and patch tokens. The cross-task relationship is not involved, while the cross-task relationship is also modeled in the task-prompt learning stage in the encoder. To encourage crosstask information exchange in our decoding stage, we further put forward Cross-task Reweighting as also shown in Fig. 3 . First, we extract the affinity tensor among T task prompts from the attention map A s . The affinity tensor is in space R N s head ×T ×T . Then, we project the first dimension to 1 using a 2-layer MLP, and obtain the Cross Task Affinity A p→p . The prompted task features F ∈ R T ×C×H×W are updated by F ← A p→p × F. The prompted task features F contain features for all the T tasks. They are split and separately fed into T task-specific prediction heads for dense predictions. Each prediction head is composed of a simple 3 × 3 convolutional block with BN and GELU, and a linear projection layer. Hierarchical Prompting As discussed by previous multi-task transformer for dense scene understanding (Ye & Xu, 2022), different levels of transformer features help improve the multi-task performance. Therefore, we deploy our prompt decoding method to multiple levels of the transformer, and name it "Hierarchical Prompting (HP)". Specifically, we conduct Dense Spatial-Channel Task Prompt Decoding at multiple levels of transformer for prompting the task features, instead of only the last layer. The multi-level task features are later fused as one by addition to obtain the final prompted task features, which are fed into the prediction heads as described above. 

Effectiveness of Spatial and Channel Task Prompt Learning and Decoding

We evaluate the proposed methods on the PASCAL-Context dataset and report the results in Table 1 . Using Spatial Task Prompt Learning can already bring a clear performance improvement on all the tasks, particularly, with a boost of 0.91, 0.78, and 2.50 points for Semseg, Parsing, and Boundary respectively, compared against the baseline. By adding the Channel Task Prompt Learning, the performance gains are further increased to 1.41, 1.48, and 3.10 points for the three tasks, respectively. These experimental results clearly demonstrate the effectiveness of the core design of TaskPrompter. Effectiveness of Cross-Task Reweighting and Hierarchical Prompting Furthermore, as shown in Table 1 , cross-task reweighting can improve the performance of most of the tasks, with a multi-task gain (i.e. ∆ m ) of 0.21 points. Hierarchical Prompting helps increase the performance largely on all the tasks, by deploying task-prompt decoding on multiple levels of the transformer. Hierarchical Prompting is designed to facilitate the decoding of task-specific features with the learned spatialchannel task prompts at each layer of the transformer encoder. With the embedding and learning of the global task prompts from the beginning of the transformer encoder, our model can naturally perform spatial-channel hierarchical prompting at different layers for decoding task-specific features, which is very beneficial for producing more effective multi-task representations. Scaling TaskPrompter We follow ViT (Dosovitskiy et al., 2021) and build the proposed Spatial-Channel Multi-task Prompting framework on a transformer with 24 layers, denoted as TaskPrompter-Large. We also denote the one with 12 layers as TaskPrompter-Base, and compare their performances on both PASCAL-Context and NYUD-v2 datasets. The results are reported in Table 2 . We can observe that models with bigger capacity generally bring better performance for 2018), the performances of some tasks may be worse because of the multi-task competition issue. Qualitative Visualization of Spatial-Channel Task Prompt Affinity To investigate whether the task prompts learn task-specific affinity on patch tokens, we visualize the affinity values between task prompts and patch tokens in the Dense Spatial-Channel Task Prompt Decoding module as shown in Fig. 4 and Fig. 8 . We can clearly observe that the activated spatial-task-prompt affinity values are highly related to the particularity of each task, which indicates that the spatial task prompts can effectively encode task-specific representations, and attend to different semantic regions of patch tokens that are more beneficial for the prediction of a specific task when performing the decoding. On the other hand, we calculate the average of channel-task-prompt affinity maps from all the test images of PASCAL-Context dataset and randomly select 60 channels for visualization. We can observe that the different channel task prompts have distinct attention responses to different channels, which verifies that channel-task-prompt affinity encodes task-specific relationships between channel task prompts and patch tokens along the channel dimension.

Study of Number of Heads in Channel Task Prompt Learning

As introduced in Section 3.2, we partition query and key tensors in Channel Task Prompt Learning into different groups as input of different heads in cross-attention. We report the performance comparison of using different numbers of heads in Fig. 5 . We observe that using more heads brings better performance for most tasks. Comparison with Previous SOTA on NYUD-v2 and PASCAL-Context Table 3 

5. CONCLUSION

We have presented a Spatial-Channel Multi-task Prompting (TaskPrompter) framework for simultaneously learning task-generic and task-specific representations as well as cross-task interaction in each layer throughout the whole transformer architecture. We first propose to learn task prompts to encode task-specific information, and design a dedicated module to learn the relationship between task prompts and patch tokens along both spatial and channel dimensions. Furthermore, we propose a novel spatial-channel task prompt decoding method to generate dense task-specific features for prediction. The effectiveness of our method is validated by both quantitative and qualitative experiments, showing superior performances on different task sets. The performances of our method clearly surpass the existing multi-task dense scene understanding models. We investigate how the number of levels used by Hierarchical Prompting (HP) influences the overall performance of TaskPrompter. The experimental results with the varying number of levels are presented in Fig. 7 . It can be observed that using only two levels in HP can already largely boost the performances on all the five tasks, demonstrating the effectiveness of the hierarchical prompting scheme. Further increasing the number of levels, it helps further improve some tasks (e.g., Semseg, Normal, and Boundary), while for some others (e.g., Parsing and Saliency), it shows saturated performances with small variances. 1024×2048. For the evaluation of semantic segmentation, we use the more challenging 19-class labels. The models are evaluated on the validation set for all the tasks. 3D detection (3Ddet) uses mean detection score (mDS) as metric, which is the official evaluation index provided by Cityscapes-3D. Implementation of TaskPrompter on Cityscapes-3D Since the images in Cityscapes-3D have a larger spatial resolution and 3Ddet is sensitive to object sizes, we build our multi-task prompting method on Swin-Base model (Liu et al., 2021b) , which maintains high-resolution features in the backbone. Notably, as the transformer block in Swin-Base uses local window attention instead of global attention as ViT, we clone the spatial task prompts and embed them into each local window to model the interaction with all patch tokens. After computing the window attention, the spatial task prompts of different windows are merged into one by taking their average. In the decoding stage, we stitch the Spatial-Task-Prompt affinity tensors from all windows as a global affinity tensor to query the patch tokens. The channel task prompt learning and decoding are not affected by the Swin-Base architecture, which can be performed in the same way as in ViT backbone. The channel number of task prompts is doubled at each Patch Merging layer of Swin-Base to maintain the same channel number as patch tokens. The same as those used in experiments on PASCAL-Context and NYUD-v2, we use task-specific 3x3Conv-BN-ReLU blocks as prediction heads for generating final task features for all 3 tasks, and a linear layer for generating the final predictions for Semseg and Depth. As for 3Ddet, we adopt the final prediction heads of FCOS-3D (Wang et al., 2021) to predict the location coordinates, rotation angles, sizes, object classes, center-ness, and direction classification. To reduce computation cost, we reduce the resolution of images from 1024×2048 to 768×1536. The batch size is set to 2 and the model is trained for 40k iterations. We also use Adam optimizer with a learning rate 2 × 10 -5 without weight decay. For the 3D detection task, Non-maximum Suppression is used with a threshold of 0.3. Loss Functions For 3D detection, similar to FCOS3D (Wang et al., 2021), we use focal loss (Lin et al., 2017) for object classification, smooth L1 loss for location coordinates and size regression, and cross-entropy loss for direction classification and center-ness regression. Experimental Results on Cityscapes-3D The overall performance comparison with state-of-theart methods (single-task models) and multi-task baseline on Cityscapes-3D is shown in Table 4 . The multi-task baseline adopts a similar design as the multi-task baseline on PASCAL-Context but is built upon Swin-base backbone. It should be noted that we are the first in the literature to simultaneously perform all three tasks on this dataset. We clearly observe that our TaskPrompter can significantly improve our baseline on all three tasks, further confirming the effectiveness of the method. Moreover, TaskPrompter even yields better performance than several SOTA single-task models, such as on 3Ddet (Haq et al., 2022) and Depth (Wang et al., 2020) . It also shows decent performance on the highly competitive task (e.g. Semseg) compared with the SOTA (Zheng et al., 2021) . We also visualize our prediction results on Cityscapes-3D and compare them with ground truth labels in Fig. 11 . TaskPrompter can generate competitive results for multiple 2D and 3D scene understanding tasks simultaneously. These experiment results further indicate that the proposed TaskPrompter can be effectively adapted to other transformer models and task sets. 5 . We observe that our TaskPrompter outperforms all previous SOTA methods with less computation cost. The reason is that TaskPrompter avoids using a heavy multi-task decoder as previous decoder-focused methods using an additional decoder. We show the number of parameters and FLOPs of TaskPrompter with different numbers of tasks on PASCAL-Context in Table 8 . We observe that from 1 task to 5 tasks we only increase 10.16% parameters and 27.76% FLOPs, which demonstrates a strong scaling ability of our multi-task model.



Figure 1: Illustration of our Spatial-Channel Multi-task Prompting framework (TaskPrompter).TaskPrompter unifies the learning of task-specific and task-generic representations as well as crosstask interactions in each layer throughout the whole transformer architecture, with the embedding of task prompts and and patch tokens. The task prompts are projected to spatial task prompts and channel task prompts to learn spatial-and channel-wise interactions, which are critical for dense predictions. The spatial and channel task prompts as well as patch tokens are further used in the proposed Dense Spatial-Channel Task Prompt Decoding module to prompt dense task-specific features and the final multi-task predictions.

Figure 2: An illustration of the proposed Spatial-Channel Task Prompt Learning module in a transformer layer. This module learns the T task prompts by interacting with patch tokens along both the spatial and channel dimensions. The task prompts are projected into T spatial task prompts (each with C-dimensional) for Spatial Task Prompt Learning and into T channel task prompts (each with N -dimensional) for Channel Task Prompt Learning.

, P c′ is projected back to the C-dimensional latent space with a

Figure 3: A diagram illustration of Dense Spatial-Channel Task Prompt Decoding. The spatial attention map and channel attention map are calculated from the query and key tensors in Spatial Task Prompt Learning and Channel Task Prompt Learning, respectively. They are used to guide the decoding of task-specific features from patch tokens along spatial and channel dimensions.

Figure 4: Visualization examples of the spatial-task-prompt affinity (the first three rows) and the channel-task-prompt affinity (the last row). It can be observed that different spatial and channel task prompts can both attend to distinct spatial or channel locations of the patch tokens, which indicates that the task prompts can effectively learn task-specific representations from the interaction with the image patch tokens.

SETUPDatasets We evaluate the proposed TaskPrompter mainly on two mostly used multi-task dense visual scene understanding datasets, i.e. NYUD-v2(Silberman et al., 2012) and PASCAL-Context(Chen et al., 2014). Details of the datasets are presented in Appendix A.3.

Figure 5: Influence of the number of heads (windows) in Channel Task Prompt Learning.

Figure7: Influence of using different numbers of levels in Hierarchical Prompting (HP). The levels are chosen evenly based on network depth. "1" means not using HP. Using only two levels in HP can already bring large performance gain.

Figure 8: More visualization results of the spatial-task-prompt affinity maps on the PASCAL-Context dataset. We can observe that the task prompts attend to different areas of the images based on the characteristics of the tasks.

COMPARISON WITH SOTA METHODS USING VIT-LARGE BACKBONE As transformer-based multi-task learning methods only appear recently (Ye & Xu, 2022), we reimplement several CNN-based SOTA methods, including ATRC (Bruggemann et al., 2021), MTI-Net (Vandenhende et al., 2020), and PAD-Net (Xu et al., 2018), on ViT-Large backbone. We compare the performances of TaskPrompter and these methods in Table

Multi-task Dense Scene Understanding with Deep Learning Several works have verified that scene understanding tasks can benefit from each other via multi-task learning (MTL) in deep learn-

Effectiveness of different components of TaskPrompter. The performance gains compared against the baseline are shown in brackets. '↓' means lower better and '↑' means higher better.

Comparison between TaskPrompter-Base and TaskPrompter-Large.

Comparison with state-of-the-arts on NYUD-v2 (left) and PASCAL-Context (right). Our TaskPrompter clearly outperforms the previous state-of-the-arts.

reports a comparison of the proposed TaskPrompter against previous state-of-the-art methods, including InvPT(Ye & Xu, 2022), ATRC(Bruggemann et al., 2021), MTI-Net(Vandenhende et al., 2020) and PAD-Net(Xu et al., 2018), on both NYUD-v2 and PASCAL-Context datasets. Notably, the previous best method (i.e. InvPT) and our TaskPrompter are built upon the transformer architecture with the same backbone. Our TaskPrompter establishes new state-of-the-art performances on all 9 metrics on these two datasets. On NYUD-v2, the performance of Semseg is clearly boosted from the previous best, i.e. 53.56 to 55.30 (+1.74). On PASCAL-Context dataset, Semseg is improved from the previous best 79.03 to 80.89 (+1.86) and Parsing is improved from 67.61 to 68.89 (+1.28).

Performance of joint 2D-3D multi-task scene understanding on Cityscapes-3D dataset. TaskPrompter achieves better or comparable results against SOTA methods of multiple tasks. Bold denotes the best.

Study of the efficiency of TaskPrompter with different numbers of tasks. The last column reports the increases in the computation costs of the model from one task to five tasks.

ACKNOWLEDGEMENTS

This research is supported in part by HKUST-SAIL joint research funding, the Early Career Scheme of the Research Grants Council (RGC) of the Hong Kong SAR under grant No. 26202321 and HKUST Startup Fund No. R9253.

ETHICS STATEMENT

The proposed method focuses on the improvement of multi-task deep learning algorithms without introducing new tasks or datasets. Thus our work doesn't raise any new ethical issues.

REPRODUCIBILITY STATEMENT

The experiments conducted in this paper are based on widely used public datasets. We present the proposed methods in detail and include the implementation particulars in Section 4.1 and Section A. 3 . The code will be made publicly available to help further study.

A APPENDIX A.1 PRELIMINARIES: MULTI-HEAD SELF-ATTENTION AND MULTI-HEAD CROSS-ATTENTION

Multi-head self-attention and multi-head cross-attention are both widely used variants of the attention mechanism proposed by Vaswani et al. (2017) . As the input of the attention module, the token sequences are projected by individual learnable weights to query Q, key K, and value V matrices. Essentially, attention mechanism is a weighted addition operation on the value guided by the affinity between query and key values. The output of attention module is calculated by:where C is the embedding dimension. If the input token sequences are projected into N head set of query, key, and value tensors and compute attention separately, it is called "multi-head attention". The difference between multi-head self-attention (MHSA) and multi-head attention cross-attention (MHCA) is that the query, key, and value input of MHSA are projected from the same token sequence, while the query and key matrices of MHCA are projected from different token sequences.

A.2 WINDOW PARTITION IN CHANNEL TASK PROMPT LEARNING

We show a visual illustration of the window partition technique used in Channel Task Prompt Learning in Fig. 6 .

Window Partition

6: We partition the spatial planes formed by the last 2 dimensions of K c and V c evenly into local windows to maintain the spatial relationship when fed into different attention heads for cross-attention calculation in the Channel Task Prompt Learning.

A.3 IMPLEMENTATION DETAILS

Datasets We evaluate the proposed TaskPrompter mainly on two mostly used multi-task dense visual scene understanding datasets, i.e. NYUD-v2 (Silberman et al., 2012) and PASCAL-Context (Chen et al., 2014) . Specifically, PASCAL-Context provides 4,998 images in the training set and 5,105 images in the testing set. This dataset offers dense labels for multiple tasks including semantic segmentation, human parsing, and object boundary detection. Additionally, Maninis et al. (2019) provide pseudo ground truth labels for surface normals estimation and saliency detection. On the other hand, NYUD-v2 totally provides 1,449 images, in which 795 are used for training and the rest 654 for testing. It includes dense labels for tasks including semantic segmentation, monocular depth estimation, surface normal estimation, and object boundary detection. In our experimental setup, we include all the tasks in these datasets for a comprehensive performance study.

Model Training

The models for different experiments are trained for 40,000 iterations on all datasets, with a batch size of 4 if not otherwise specified. Adam optimizer is adopted with a learning rate of 2 × 10 -5 , and a weight decay rate of 1 × 10 -6 . A polynomial learning rate scheduler is used during optimization. For the continuous regression tasks (i.e. Depth and Normal) we use L1 Losses. For the discrete classification tasks (i.e. Semseg, Parsing, Saliency, and Boundary) we use cross-entropy losses for them. The learnable task prompts are randomly initialized with normal distribution (mean=1, std=1).Data Processing. For a fair comparison with Invpt (Ye & Xu, 2022), we follow its data processing pipeline. On PASCAL-Context, we pad the image to the size of 512 × 512, while on NYUD-v2, we randomly crop the input image to the size of 448 × 576. We use the same data augmentation including random color jittering, random cropping, random scaling, and random horizontal flipping. do and propose a novel Dense Spatial-Channel Task Prompt Decoding to decode multi-task dense predictions with the help of task prompts. 

A.8 ABLATION STUDY OF USING TASK-SPECIFIC ENCODERS

To verify the importance of the joint learning of task-specific, task-generic, and cross-task interaction in TaskPrompter, we design a model variant of TaskPrompter that uses task-specific encoder for each task. We name this variant "TaskPrompter w/ TE". In each task-specific encoder, we use one task prompt for the corresponding task, and thus there is no task-generic feature in the encoder stage. We still use our Dense Spatial-Channel Task Prompt Decoding to generate prediction for each task from the task-specific feature of the encoder. We adopt ViT-S as model backbone and compare the model variant to TaskPrompter with also ViT-S backbone in Table 6 . We find that when using taskspecific only encoders, the model performance decreases on all tasks, despite the model capacity and computation cost being much larger as each task has an independent encoder. The results can verify the importance of jointly modeling both task-specific and task-generic features in TaskPrompter, in terms of both effectiveness and efficiency. 

