M 3 SAT: A SPARSELY ACTIVATED TRANSFORMER FOR EFFICIENT MULTI-TASK LEARNING FROM MULTIPLE MODALITIES

Abstract

Multi-modal multi-task learning (M 2 TL) aims to discover the implicit correspondences among heterogeneous modalities and tasks, which is common in real-world applications like autonomous driving and robotics control. Current single-model solutions for M 2 TL usually fall short in several aspects. The shared backbone between the modalities is prone to overfitting the simpler modality, while jointly optimizing the tasks suffers from unstable training due to the gradient conflicts across tasks. On the other hand, designing a separate model for each task and modality can avoid the above problems but leads to prohibitively expensive computation and memory consumption, rendering this approach unrealistic. In this work, we propose M 3 SAT, a sparsely activated transformer for efficient M 2 TL. The proposed framework tailors the mixture-of-experts (MoEs) into both the self-attention and the feed-forward networks (FFN) of a transformer backbone. It adopts the routing policy to assign attention-heads and FFN experts during training, which effectively disentangles the parameter space to prevent training conflicts among diverse modalities and tasks. Meanwhile, disentangled parameter space also restrains the problem of simple modal prone to overfitting. Sparsely activating the transformer also enables efficient computation for each input sample. Through comprehensive evaluation, we demonstrate the effectiveness of our M 3 SAT: a remarkable performance margin (e.g., ≥ 1.37%) is achieved over the dense models with the same computation cost. More importantly, M 3 SAT can achieve the above performance improvements with a fraction of the computation cost -our computation is only 1.38% ∼ 53.51% of that of the SOTA methods. Our code will be released upon acceptance.

1. INTRODUCTION

Recently, multi-modal machine learning models have shown effective in several domains, mainly including image, language and audio understanding Ramesh et al. ( 2022 2022). As the need of understanding our surroundings keeps rising, new sensing modalities that go beyond these domains need to be deployed and incorporated in multi-modal learning. To illustrate, let us consider an example autonomous vehicle system. Nowadays, autonomous vehicles are equipped with different types of sensors to ensure the viable perceptual capability under adverse conditions such as rain, haze, and snow. Therefore, performing multi-modal perception by fusing the data from these sensors has become a necessity. For example, Janani et al. 2018), etc, which poses challenges to the underlying system. For example, autonomous vehicles usually move at a speed



); Saharia et al. (2022); Agrawal et al. (2017); Yang et al. (2016); Wang et al. (

(2022) uses the eye blink sensor and photoplethysmography sensor for fatigue detection, Li et al. (2022) uses the RGB camera, LiDAR and millimeter wave radar for 3D detection and tracking, Raguraman & Park (2020) uses the RGB camera and LiDAR for drivable area detection, and Han et al. (2022) uses the RGB camera and LiDAR for collision avoidance. In addition, an autonomous vehicle system usually needs to perform a large number of tasks concurrently, including fatigue detection Nemcova et al. (2021), 3D object detection and tracking Li et al. (2022), lane detection Gao et al. (2019) and local planning Isele et al. (

