M 3 SAT: A SPARSELY ACTIVATED TRANSFORMER FOR EFFICIENT MULTI-TASK LEARNING FROM MULTIPLE MODALITIES

Abstract

Multi-modal multi-task learning (M 2 TL) aims to discover the implicit correspondences among heterogeneous modalities and tasks, which is common in real-world applications like autonomous driving and robotics control. Current single-model solutions for M 2 TL usually fall short in several aspects. The shared backbone between the modalities is prone to overfitting the simpler modality, while jointly optimizing the tasks suffers from unstable training due to the gradient conflicts across tasks. On the other hand, designing a separate model for each task and modality can avoid the above problems but leads to prohibitively expensive computation and memory consumption, rendering this approach unrealistic. In this work, we propose M 3 SAT, a sparsely activated transformer for efficient M 2 TL. The proposed framework tailors the mixture-of-experts (MoEs) into both the self-attention and the feed-forward networks (FFN) of a transformer backbone. It adopts the routing policy to assign attention-heads and FFN experts during training, which effectively disentangles the parameter space to prevent training conflicts among diverse modalities and tasks. Meanwhile, disentangled parameter space also restrains the problem of simple modal prone to overfitting. Sparsely activating the transformer also enables efficient computation for each input sample. Through comprehensive evaluation, we demonstrate the effectiveness of our M 3 SAT: a remarkable performance margin (e.g., ≥ 1.37%) is achieved over the dense models with the same computation cost. More importantly, M 3 SAT can achieve the above performance improvements with a fraction of the computation cost -our computation is only 1.38% ∼ 53.51% of that of the SOTA methods. Our code will be released upon acceptance.

1. INTRODUCTION

Recently, multi-modal machine learning models have shown effective in several domains, mainly including image, language and audio understanding Ramesh et al. (2022) ; Saharia et al. (2022) ; Agrawal et al. (2017) ; Yang et al. (2016) ; Wang et al. (2022) . As the need of understanding our surroundings keeps rising, new sensing modalities that go beyond these domains need to be deployed and incorporated in multi-modal learning. To illustrate, let us consider an example autonomous vehicle system. Nowadays, autonomous vehicles are equipped with different types of sensors to ensure the viable perceptual capability under adverse conditions such as rain, haze, and snow. Therefore, performing multi-modal perception by fusing the data from these sensors has become a necessity. For example, Janani et al. ( 2022 2018), etc, which poses challenges to the underlying system. For example, autonomous vehicles usually move at a speed between 60 ∼ 120 km/h, forcing most of these tasks to run at a high frequency (e.g., 10Hz∼60Hz or higher). The fact that autonomous vehicles usually have limited computation resources suggests that each task needs to finish within a pre-set time, and that we cannot afford to load different task models when switching tasks. 2021). Notably, we assume that the intelligent system often only requires a small number of tasks simultaneously, and each task only involves a subset of all the modalities. For such a system, the "fully activated" model is heavily redundant and hard to scale. For example, Singh et al. ( 2022); Hu & Singh (2021) has to activate a massive transformer-based network for each task, with each modality using a distinct transformer encoder. Thus, as the backbone network grows with the number of modalities and tasks, the inference latency of each task becomes catastrophically long.

Multi

To tackle these bottlenecks, we propose the Multimodal Multi-task Sparsely Actived Transformer (M 3 SAT) which organically adapts the mixture of experts (MoE) Riquelme et al. ( 2021 2021b). We train the routing policy within our backbone to select the subset of experts for each input token. In the training stage, the load and importance balancing loss prevents the feature tokens from being always put into the same expert, and thus distributes the parameter updating of the specific modality to different experts. This can effectively restrain the easy modality from the overfitting problem. Meanwhile, the routing strategy separates the parameter spaces, which can balance feature reuse and avoid training conflicts among tasks. In fact, vanilla MoE already disentangles the parameter spaces of the FFN network; however, we find that these experts with separated parameter spaces are still insufficient to handle multiple multi-modal tasks. Therefore, the M 3 SAT adopts the MoE into the feed-forward network (FFN) and self-attention modules of the vanilla transformer encoder backbone. By further untangling more parameters into distinct parameter spaces of the transformer backbone, the M 3 SAT achieves better restrains the simpler modalities from overfitting and alleviates the gradient conflictions between different tasks. During the inference stage, the M 3 SAT only activates those experts corresponding to the necessary modality/task instead of the entire model. As such, the highly sparse active transformer achieves efficient inference for the specific modality and task. To verify the effectiveness of our M 3 SAT, we conduct comprehensive evaluation on MultiBench, a large-scale benchmark spanning more than 10 modalities, and testing for 20 prediction tasks across 6 distinct research areas. Our model surpasses the performance of the state-of-the-art (SOTA) multimodal multi-task model on the MultiBench. Meanwhile, our computation cost is 1.38% -53.51% of the computation cost of the current SOTA multi-modal multi-task model on MultiBench. Our main contributions are outlined below: • We target the problem of efficient multi-modal multi-task learning and propose the first multi-modal multi-task mixture of expert model. • We engage MoE to achieve the following three goals: (1) solving the training conflicts among tasks, (2) restraining the easy modality from overfitting, and (3) sparsely activating paths for single-modality and single-task inference. • We demonstrate remarkable performance improvements over dense models with equivalent computational cost and outperform current multi-task state-of-the-art performance with only 1.38% to 53.51% of their computational cost. 



) uses the eye blink sensor and photoplethysmography sensor for fatigue detection, Li et al. (2022) uses the RGB camera, LiDAR and millimeter wave radar for 3D detection and tracking, Raguraman & Park (2020) uses the RGB camera and LiDAR for drivable area detection, and Han et al. (2022) uses the RGB camera and LiDAR for collision avoidance. In addition, an autonomous vehicle system usually needs to perform a large number of tasks concurrently, including fatigue detection Nemcova et al. (2021), 3D object detection and tracking Li et al. (2022), lane detection Gao et al. (2019) and local planning Isele et al. (

-modal multi-task learning (M 2 TL) Liang et al. (2022); Hu & Singh (2021) aims at solving multiple multi-model tasks simultaneously with a single model. However, challenges from both multi-modal learning and multi-task learning hinder us from building an effective M 2 TL model. Firstly, multi-modal networks are often prone to overfitting with different modalities overfitting at different rates, and thus naively training them together is only sub-optimal Wang et al. (2020). Secondly, training multiple tasks within a single model often results in tasks that compete for modal capacity since the same weights might receive conflicting update directions Chen et al. (2020b); Fifty et al. (

); Lepikhin et al. (2021) for efficient M 2 TL tasks, as MoE can adaptively divide-and-conquer the entire model capacity into smaller sub-models Shazeer et al. (2017); Kim et al. (

Multi-modal and Multi-taskLearning. There has been a long history of work on multi-modal and multi-task learning. On the one hand, most previous efforts on multi-task learning Strezoski et al. (2019); Zamir et al. (2018); Søgaard & Goldberg (2016); Hashimoto et al. (2017) focus on

