M 3 SAT: A SPARSELY ACTIVATED TRANSFORMER FOR EFFICIENT MULTI-TASK LEARNING FROM MULTIPLE MODALITIES

Abstract

Multi-modal multi-task learning (M 2 TL) aims to discover the implicit correspondences among heterogeneous modalities and tasks, which is common in real-world applications like autonomous driving and robotics control. Current single-model solutions for M 2 TL usually fall short in several aspects. The shared backbone between the modalities is prone to overfitting the simpler modality, while jointly optimizing the tasks suffers from unstable training due to the gradient conflicts across tasks. On the other hand, designing a separate model for each task and modality can avoid the above problems but leads to prohibitively expensive computation and memory consumption, rendering this approach unrealistic. In this work, we propose M 3 SAT, a sparsely activated transformer for efficient M 2 TL. The proposed framework tailors the mixture-of-experts (MoEs) into both the self-attention and the feed-forward networks (FFN) of a transformer backbone. It adopts the routing policy to assign attention-heads and FFN experts during training, which effectively disentangles the parameter space to prevent training conflicts among diverse modalities and tasks. Meanwhile, disentangled parameter space also restrains the problem of simple modal prone to overfitting. Sparsely activating the transformer also enables efficient computation for each input sample. Through comprehensive evaluation, we demonstrate the effectiveness of our M 3 SAT: a remarkable performance margin (e.g., ≥ 1.37%) is achieved over the dense models with the same computation cost. More importantly, M 3 SAT can achieve the above performance improvements with a fraction of the computation cost -our computation is only 1.38% ∼ 53.51% of that of the SOTA methods. Our code will be released upon acceptance.

1. INTRODUCTION

Recently, multi-modal machine learning models have shown effective in several domains, mainly including image, language and audio understanding Ramesh et al. (2022) ; Saharia et al. (2022) ; Agrawal et al. (2017) ; Yang et al. (2016) ; Wang et al. (2022) . As the need of understanding our surroundings keeps rising, new sensing modalities that go beyond these domains need to be deployed and incorporated in multi-modal learning. To illustrate, let us consider an example autonomous vehicle system. Nowadays, autonomous vehicles are equipped with different types of sensors to ensure the viable perceptual capability under adverse conditions such as rain, haze, and snow. Therefore, performing multi-modal perception by fusing the data from these sensors has become a necessity. For example, Janani et al. (2022) uses the eye blink sensor and photoplethysmography sensor for fatigue detection, Li et al. (2022) uses the RGB camera, LiDAR and millimeter wave radar for 3D detection and tracking, Raguraman & Park (2020) uses the RGB camera and LiDAR for drivable area detection, and Han et al. (2022) uses the RGB camera and LiDAR for collision avoidance. In addition, an autonomous vehicle system usually needs to perform a large number of tasks concurrently, including fatigue detection Nemcova et al. (2021) , 3D object detection and tracking Li et al. (2022) , lane detection Gao et al. (2019) and local planning Isele et al. (2018) , etc, which poses challenges to the underlying system. For example, autonomous vehicles usually move at a speed between 60 ∼ 120 km/h, forcing most of these tasks to run at a high frequency (e.g., 10Hz∼60Hz or higher). The fact that autonomous vehicles usually have limited computation resources suggests that each task needs to finish within a pre-set time, and that we cannot afford to load different task models when switching tasks. Multi-modal multi-task learning (M 2 TL) Liang et al. (2022) ; Hu & Singh (2021) aims at solving multiple multi-model tasks simultaneously with a single model. However, challenges from both multi-modal learning and multi-task learning hinder us from building an effective M 2 TL model. Firstly, multi-modal networks are often prone to overfitting with different modalities overfitting at different rates, and thus naively training them together is only sub-optimal Wang et al. (2020) . Secondly, training multiple tasks within a single model often results in tasks that compete for modal capacity since the same weights might receive conflicting update directions Chen et al. (2020b) ; Fifty et al. (2021) . Notably, we assume that the intelligent system often only requires a small number of tasks simultaneously, and each task only involves a subset of all the modalities. For such a system, the "fully activated" model is heavily redundant and hard to scale. For example, Singh et al. (2022) ; Hu & Singh (2021) has to activate a massive transformer-based network for each task, with each modality using a distinct transformer encoder. Thus, as the backbone network grows with the number of modalities and tasks, the inference latency of each task becomes catastrophically long. To tackle these bottlenecks, we propose the Multimodal Multi-task Sparsely Actived Transformer (M 3 SAT) which organically adapts the mixture of experts (MoE) Riquelme et al. (2021) ; Lepikhin et al. (2021) for efficient M 2 TL tasks, as MoE can adaptively divide-and-conquer the entire model capacity into smaller sub-models Shazeer et al. (2017) ; Kim et al. (2021b) . We train the routing policy within our backbone to select the subset of experts for each input token. In the training stage, the load and importance balancing loss prevents the feature tokens from being always put into the same expert, and thus distributes the parameter updating of the specific modality to different experts. This can effectively restrain the easy modality from the overfitting problem. Meanwhile, the routing strategy separates the parameter spaces, which can balance feature reuse and avoid training conflicts among tasks. In fact, vanilla MoE already disentangles the parameter spaces of the FFN network; however, we find that these experts with separated parameter spaces are still insufficient to handle multiple multi-modal tasks. Therefore, the M 3 SAT adopts the MoE into the feed-forward network (FFN) and self-attention modules of the vanilla transformer encoder backbone. By further untangling more parameters into distinct parameter spaces of the transformer backbone, the M 3 SAT achieves better restrains the simpler modalities from overfitting and alleviates the gradient conflictions between different tasks. During the inference stage, the M 3 SAT only activates those experts corresponding to the necessary modality/task instead of the entire model. As such, the highly sparse active transformer achieves efficient inference for the specific modality and task. To verify the effectiveness of our M 3 SAT, we conduct comprehensive evaluation on MultiBench, a large-scale benchmark spanning more than 10 modalities, and testing for 20 prediction tasks across 6 distinct research areas. Our model surpasses the performance of the state-of-the-art (SOTA) multimodal multi-task model on the MultiBench. Meanwhile, our computation cost is 1.38% -53.51% of the computation cost of the current SOTA multi-modal multi-task model on MultiBench. Our main contributions are outlined below: • We target the problem of efficient multi-modal multi-task learning and propose the first multi-modal multi-task mixture of expert model. • We engage MoE to achieve the following three goals: (1) solving the training conflicts among tasks, (2) restraining the easy modality from overfitting, and (3) sparsely activating paths for single-modality and single-task inference. • We demonstrate remarkable performance improvements over dense models with equivalent computational cost and outperform current multi-task state-of-the-art performance with only 1.38% to 53.51% of their computational cost.

2. RELATED WORK

Multi-modal and Multi-task Learning. There has been a long history of work on multi-modal and multi-task learning. On the one hand, most previous efforts on multi-task learning 

3. METHODOLODY

We first describe the overall architecture of our M 3 SAT, as shown in Figure 1 , and then present the proposed Sparse MoE design for multi-modal, multi-task learning.

3.1. MULTI-MODAL MULTI-TASK MODEL DESIGN

Input Data Preprocessing. Each modality is treated as a sequence. And then the modalityspecific Fourier positional encoding and one-hot modality encoding are applied to integrate temporal/positional information and modality information into the input sequence of embedding. We refer (i) the details of the processing for each modality as sequential data, (ii) the Fourier positional encoding setting for different modalities, and (iii) the one-hot modality encoding to Appendix A.1. Unimodal Encoder. After the input-data pre-processing, we would receive the sequential tokens of different modalities, in which the feature dimension of all modalities is the same. A transformerbased Perceiver Block Jaegle et al. ( 2021) is adopted to convert each modality sequence to sequences with the same length. Note that only one copy of the Transformer-based Perceiver Block is used for all modalities and tasks. Moreover, the process that shares the parameters of unimodal encoder Figure 1 : Our model first standardizes each input modality into a sequence and uses modalityspecific embedding layers to capture the modality-specific information. Then the uni-modal encoder layer converts each sequence to sequences of the same length. We concatenate these modality tokens on the sequence dimension within each task and input them to M 3 SAT encoder layers for multimodal multi-task learning. These M 3 SAT encoder layers perform efficient modality information fusion, eliminate the training conflicts among tasks, and control easy modality to avoid overfitting. across different modalities also allows us to get rid of setting specific modality encoder for each modality. The details of the Transformer-based Perceiver Block can be found in Appendix A.2. Consecutive Transformer Encoder with MoE. So far, we receive the multi-modal tokens for each task, for which the sequence length of each modality is the same, and the feature dimension of each modality token is the same. After concatenating these modality tokens on the sequence dimension within each task, we put these tokens into several consecutive transformer encoder layers. Our proposed M 3 SAT encoder layer and the vanilla transformer dense encoder layer compose these transformer encoder layers. Specifically, the M 3 SAT encoder layer replaces the self-attention layer and the feedforward network (FFN) layer of the dense encoder layer with corresponding sparse MoE layers. The M 3 SAT encoder layer is introduced in Section 3.2, and the detailed configuration of the M 3 SAT layer and other training setups are provided in Appendix A.3. Task-specific Head and Multi-task Learning. Finally, we use a linear layer with normalization per task for task-specific learning. Our optimization objective is minimizing a weighted sum of losses for multiple tasks.

3.2. MULTI-MODAL MULTI-TASK MOE FOR M 3 SAT

We first describe the standard MoE, show the proposed Sparse MoE Self-attention, and then present the multi-router version MoE that consists of a standard MoE and a Sparse MoE Self-attention. Mixture of Experts Layer. A Mixture of Experts (MoE) layer typically consists of a group of N experts f 1 , f 2 , . . . , f N together with a routing network (also called router) to select appropriate experts. The experts usually use multi-layer perceptrons in transformer-based models (Riquelme et al., 2021) . We inherit the router design from V-MoE (Riquelme et al., 2021) . For an input x, the The Sparse MoE Self-attention consists of three Self-attention MoE to compute the value, key, and query, respectively. Note that, the expert of the Self-attention MoE is composed of a single linear layer. The Sparse MoE FFN is the same as the vanilla MoE layer. output of MoE layers selects the top K experts through a router R, depicted as below: y = K k=1 R(x) k • f k (x), R(x) = T opK(sof tmax(Gate(x)), K), T opK(v, K) = v if v is in the top K elements 0 otherwise , where Gate(•) represents the learnable network within the router, for which we employ a single linear layer without bias in practice. The sof tmax(•) and T opK(•, K) work together to set all vector elements to zero except the elements with the largest K values. To avoid always routing the same experts while ignoring others, we employ the load and importance balancing loss following Shazeer et al. (2017) . We list the settings of K and N for different tasks group in Appendix A. The M 3 SAT uses the vanilla sparsely activated MoE in the FFN layer. Sparse Self-attention MoE. We first revisit the definition of the original Self-attention layer. The Self-attention layer is mainly used to compute the self-attention of input tokens. The scaled-dot product computes the self-attention: Attention(Q, K, V) = sof tmax( QK T √ C )V, where Q, K, V ∈ R S×C are the query, key, and value matrices computed by three linear layers from the input tokens; S and C denote the sequence length and the hidden dimension. These three linear layers for computing query, key, and value use the same architecture but different parameters. In our proposed Sparse Self-attention MoE, we integrate MoE into these three linear layers to further disentangle parameter spaces, which are displayed on the left side of Figure 2 . For each attention MoE: y = K k=1 R(x) k • f a k (x), where these expert candidates f a k are shared across modalities and tasks. Unlike vanilla MoE, the expert f a k (•) is a single linear layer where the input and output dimensions are the same as the hidden dimension. Each expert of vanilla MoE is computed with W 2 δ gelu (W 1 x), where δ gelu is the GELU activation Hendrycks & Gimpel (2016) , W 1 and W 2 are two learnable weight matrix. The Sparse MoE Self-attention layer expert is computed with W x, where the W is the learnable weight matrix for calculating key, query, and value for self-attention. Note that, unlike Fedus et al. (2022) ; Zhu et al. (2022) , we use three independent routers to router tokens for q, k, and v separately. 2021a) investigate task-specific routing networks for multi-task learning. This paper takes one step further -we propose the task/modality specific multi-router MoE to study the benefits of the multi-router design. Formally, we define the output of our MoE layer as follows: y t = K k=1 R s (x) k • f k (x), where s is the routing network index and can be set as task index or modality index. The expert f k (•) can be either a single linear layer (used in the Sparse MoE Self-attention layer) or an FFN layer (used in the Sparse MoE FFN layer). All of these experts are shared between different modalities and tasks. Both the Sparse MoE Self-attention layer and FFN layer can use the task-specific or modality-specific router. Therefore, we design four versions of multirouter M 3 SAT : i) the Multi-router M 3 SAT uses modality-specific routing networks in the Sparse MoE Self-attention layer and task-specific routing networks in the Sparse MoE FFN layer. ii) the R-Multi-router M 3 SAT 'reverses' settings of the Multi-router M 3 SAT which uses modalityspecific routing networks in the Sparse MoE FFN layer and task-specific routing networks in the Sparse MoE Self-attention layer. Meanwhile, we also use modality-specific routing networks (P-Modality-router M 3 SAT ) or task-specific routing networks (P-Task-router M 3 SAT ) along both in the Sparse MoE Self-attention layer and the Sparse MoE FFN layer. The backbone model parameters of the M 3 SAT and these versions of multi-router M 3 SAT do not proportionally increase if we involve more modalities and tasks in training. We show the details of the task/modality specific multi-router MoE in Figure 3 . The effects of the multi-router MoE are included in Section 4.3 and Appendix B.

4.1. IMPLEMENTATION DETAILS

To evaluate the proposed method, we conduct experiments on the MultiBench, a large-scale multimodal multi-task benchmark involving more than 10 modalities and testing for 20 prediction tasks across 6 research areas. We choose 7 tasks in MultiBench and train 3 multi-modal multi-task models across combinations of these tasks in  1 T T i (-1) li (M m,i -M b,i )/M b,i , where M i is the metrics of task i, and l i = -1 if a lower value means better performance. M 2 TL results of HighMMT are running by their released code and training configuration. Training each task group takes about 12 -24 hours for HighMMT and our M 3 SAT model. Therefore, the performances of these tasks for HighMMT and M 3 SAT that we report in this paper are the mean of 3 times repetitions. For the min and max performance of MultiBench in Table 2 , we report numbers directly from the MultiBench paper. Configuration Details. We display our model overview architecture in Figure 1 and the architecture design details of M 3 SAT we proposed in Figure 2 . We conduct all of our experiments on the NVIDIA A30 Tensor Core GPU. Please refer to Appendix A.3 for more details on network configuration and training setup.

4.2. PERFORMANCE COMPARISON OF M 3 SAT WITH EXISTING MULTIMODEL MODELS

In Table 2 , we compare the performance of our model with the current SOTA model High- MMT Liang et al. (2022) as well as 20 recent multimodel models that are implemented in Liang et al. (2021) . The results show that our method outperforms the HighMMT on all tasks under all three settings (+12.93%/+20.19%/+2.28% M 2 TL performance, respectively). Notably, the 'ENRICO' performance of M 3 SAT is even significantly higher (+20.58% single-task performance) than the best performance of MultiBench, which sets a new state-of-the-art result. Meanwhile, the M 3 SAT only uses 1.38% ∼ 53.51% of the computation resources compared to HighMMT. For the large setting, although the number of parameters of M 3 SAT is larger than HighMMT, the computation resources (Flops) we used are still much smaller. Ablation Study: expert number and selection number. For the MoE layer, the number of selected experts per token K and the total number of experts N are two of the most significant hyperparameters. Due to the limited space, we show the detailed performance in Appendix C.2. In-Depth Discussion: MTL. We measure the following two metrics to explain the reason for obtaining MoE successfully from the multi-task learning (MTL) view: the gradient positive sign purity and the inter-task affinity. The gradient positive sign purity Chen et al. (2020b) (GPSP) measures how many positive gradients are presented in a network parameter at any given value. P is bounded by [0, 1]. The value of P close to 0 or 1 indicates that the gradient conflict of MTL has less effect on the corresponding parameter. In Figure 4 , we discretize P into 5 intervals and then count the number of parameters that fall within these 5 intervals. We record the GPSP distribution of the M 3 SAT , M 3 SAT without MoE on self-attention, M 3 SAT without MoE on FFN, and the equal computation dense model. The inter-task affinity Z i→j defined by Fifty et al. (2021) indicates the influence of the parameter update of task i on task j. The higher value of Z i→j indicates the update on the parameters is positive for task j, while a lower value of Z i→j indicates that the parameter update is antagonistic for task j. For the medium setting, in the right part of Figure 4 , we record the inter-task affinity of the 'ENRICO' task to the 'PUSH' task of the M 3 SAT , the multi-router M 3 SAT , and the equal computation dense model. Compared with other models, the GPSP of M 3 SAT is accumulated more in intervals [0.6, 0.8] and [0.8, 1.0], which shows by splitting the parameter space, only a fraction of the conflict parameters are running for specific tasks. The inter-task affinity of M 3 SAT and multi-router M 3 SAT is higher than the dense model most of the time, which shows that MoE can restrain the gradient conflict of MTL. For more details on the GPSP and the inter-task affinity, please refer to Appendix C.5 and Appendix C.6. In-depth Discussion: multi-modal learning. From the perspective of multi-modal learning, the optimal gradient blend (OGB) defined by Wang et al. (2020) indicates which modality is easily prone to overfitting (the smaller the value, the easier the modality is prone to overfitting). For a multi-modal task with M modalities, the OGB is bounded: w ogb m ∈ [0, 1] and M m w ogb m = 1, where m is the modality index. The greater the difference between the modality OGB values within a single task, the more serious the overfitting problem for the modalities with smaller OGB values. In Table 4 , we present the optimal gradient blend of the trained models under different MoE settings. For PUSH and AV-MNIST tasks, the overfitting problem still exists. However, M 3 SAT alleviates the problem in the ENRICO task. For more details on the optimal gradient blend, please see Appendix C.7. In-depth Discussion: Expert distribution. We also explore how routing is distributed across different modalities and tasks. Due to the limited space, we show the routing distribution under testing data for different modalities and tasks of the medium setting in Appendix C.8.

5. CONCLUSION AND LIMITATION

This paper proposes a sparsely active transformer model for efficient multi-modal multi-task learning. By tailoring the mixture-of-experts into both the self-attention and the feed-forward networks of a transformer backbone, we achieve the following. Firstly, we sparsely active experts in the self-attention and the feed-forward networks in training to restrain easy models from being overfitting and mitigating MTL gradient conflicts. Secondly, given any task and corresponding modalities, we can only activate the sparse 'expert' pathway for efficiency. Comprehensive experiments show that the proposed M 3 SAT surpasses the SOTA with a fraction of the computation cost (+12.93%/+20.19%/+2.28% M 2 TL performance); our computation cost is only 1.38% ∼ 53.51% of the SOTA model. Our experiments on MoE also provide rational perspectives for designing multi-modal multi-task learning neural network architectures. The limitation of our work is that the proposed M 3 SAT is only evaluated on academic datasets. Moving forward, we will evaluate M 3 SAT on more practical tasks like in-door robots and autonomous vehicles in future work. Also, we expect to expand our model size for larger scale tasks and more kinds of modalities in future work.

A.1 PROCESS DATA INTO SEQUENCE

Following the process of Jaegle et al. (2022) , we first standardize each input into a sequence. For each modality Jaegle et al. (2022) , we define some hyperparameters (such as max freq, num freq bands, and freq base) for the Fourier positional encoding. Fourier transformations get this positional information. For modalities such as text and time-series, they are already sequential data. We apply 1D positional encoding for these modalities x ∈ R bm×tm×dm , where b m , t m , d m are the batch-size, sequence length, and input dimension of current modality, respectively. For image and similar modalities, we follow the processing procedure of Dosovitskiy et al. (2021) , which breaks each input into h m × w m patches and flattens it as a sequence of p 2 regions. We use 2D positional encoding for image and similar modalities input x ∈ R bm×hm×wm×dm , where h m × w m is the number of patches. For image modality, the d m is the number of pixels within a patch. For video and similar modalities, we treat each frame data as the image modality, therefore we apply 3D positional encoding for input x ∈ R bm×lm×hm×wm×dm , where l m is the number of the frame. In the other modalities, such as table and graph, we treat each element in the table/graph as an element in the sequence and use a 1D positional encoding. After transposing inputs into sequence data, now we show the subsequent processing procedure in Algorithm 1. The 'max modality dim' equals to max m∈M (d m + d pm ), where d pm is the dimension of Fourier positional encoding for the corresponding modality. The one-hot encoding is defined as e m ∈ R |M | , where |M | is the number of all modalities involved. Algorithm 1 Data Preprocess # x: the input tokens of specific modality def DataPreprocess(x, modality): # get positional encoding information # pos dim: indicates 1D/2D/3D positional encoding enc pos=fourier encode(modality.pos dim, modality.max freq, modality.num freq bands, modality.freq base) # add padding for modalities with smaller input dimension # max modality dim: the maximum input dimension overall modalities # input dim: the input dimension of current modality padding=zeros(max modality dim-modality.input dim) # modality one-hot encoding # modality index: the index of current modality modality encodings=one hot(modality.modality index) # construct final input modality input=concatenate(x, padding, enc pos, modality encodings) return modality input

A.2 THE UNIMODAL ENCODER

The result of Algorithm 1 is then feeded into the unimodal encoder layer. We display the details of the unimodal encoder layer in Figure 5 . The sequence length T of different modalities are different, as T can be t m , h m × w m , or l m × h m × w m . However, the cross attention between the input sequence and latent input will convert the sequence length from different modalities into the same value. For example, the input modality sequence is x ∈ R Tm×D and the latent input is z ∈ R N ×C . After these three linear layer, we got K, V ∈ R Tm×X and Q ∈ R N ×X . Following the scaled-dot product attention: Attention(Q, K, V) = softmax( QK T √ C )V, from which we can know the dimension after the attention is Attention(Q, K, V) ∈ R N ×X . Therefore, the sequence length of the output depends on the sequence length of the latent input and the we construct a same capacity model where we ×4 the number of attention heads, ×8 the dimension of each attention head, and ×32 the hidden dimension of the MLP layer. . We find out that the single-router is the best architecture for M 2 TL. The second best architecture is using the task-specific router in the self-attention layer and the dense layer in the FFN layer. Meanwhile, using the modality-specific router in the self-attention layer and the task-specific router in the FFN layer also seems like a reasonable choice. For better understanding, we display the architecture of the Multi-Router M 3 SAT and the R-Multi-Router M 3 SAT in Figure 6 and Figure 7 , respectively. B.1 USING CONSECUTIVE M 3 SAT This section is used to illustrate how use consecutive M 3 SAT layer as transformer backbone, and provide more observation about how use M 3 SAT while network is getting deeper. Our experimental results in Table 7 show: • The performance may not be improved as the number of M 3 SAT layers increases. • The location of M 3 SAT matters. Using M 3 SAT in shallow layers helps the most. 

C EXPERIMENTS DETAILS

We show the number of parameters and the computation cost of the current SOTA and M 3 SAT in Figure 8 .

C.1

PUSH Lee et al. (2020a) , i.e., the MUJOCO PUSH task, is a planar pushing task, in which a 7-DoF Panda Franka robot is pushing a circular puck with its end-effector in simulation. We estimate the 2D position of the unknown object on a table surface while the robot intermittently interacts with the object. This dataset contains 1000 training data, 10 validation data, and 100 testing data, where each data point is split into 29 sequences, and each sequence includes 16 consecutive steps. V&T Lee et al. (2020b) also called 'VISION&TOUCH', is a real-world robot manipulation dataset that collects visual, force, and robot proprioception data for a peg insertion task. The robot is used to insert the peg into the hole. In this paper, we use this dataset to predict the manipulator weather contact the peg in the next step, which is a binary classification task. We follow the setting of MultiBench and use 117,600 data points for training and the remaining 29,400 data points for validation and testing. ENRICO Leiva et al. (2020) includes 20 Android app design categories. Each data point consists of the app screenshot and the view hierarchy. The view hierarchy describes the spatial and structural layout of UI elements of the corresponding screenshot. During training, the view hierarchy is rendered as "wireframe", be viewed as a form of set data. ENRICO contains 947 data points for training, 219 data points for validation, and 292 data points for testing. AV- MNIST Vielzeuf et al. (2018) is a multimedia dataset that uses audio and image information to predict the digit into one of 10 classes (0-9). This dataset comprises 55,000 training data points, 5,000 validation data points, and 10,000 testing data points. UR-FUNNY is the multi-modal affective computing dataset of humor detection in human speech. Each data point of UR-FUNNY is a video with text, visual, and acoustic modalities. We train this dataset to predict whether the current data point makes people fill positive or negative. There are 1,166, 300, and 400 videos in the train, valid, and test data, respectively. Zadeh et al. (2018) is the largest dataset of sentence-level sentiment analysis and emotion recognition in real-world online videos. Each video is annotated for 9 discrete emotions (angry, excited, fear, sad, surprised, frustrated, happy, disappointed, and neutral), and a continuous emotions value (valence, arousal, and dominance). We follow the MultiBench, training this dataset as a binary classification task. We use 16,265, 1,869, and 4,643 train, valid, and We do ablation studies on the number of experts per selection K and the total number of experts N for the medium setting. From Table 9 , we can observe that the performance is increase with the number of N . However, increasing the total number of experts requires more memory resources.

MOSEI

Increasing the number of experts per selection K can improve performance to some extent, but too larger K will restrain parameters from getting enough training, decreasing performances. The appropriate value of N and K is crucial for M 2 TL performance. Before we input tokens into our transformer backbone (several consecutive transformer encoder layers), we concatenate tokens on the sequence dimension. Therefore, we can fusion different modalities by the attention layer within each transformer encoder layer. To further illustrate that such an operation is necessary, we additional training the same model but concatenate tokens along the batch axis. Our following table shows fuse modalities by concatenating tokens along the sequence axis is positive for our tasks. Our results in Table 10 show fuse modalities by concatenating tokens along the sequence axis is positive for our tasks. 2022) also apply MoE in the attention layer. However, they all use a single router to routing tokens for q, k, and v simultaneously. We think such a design lack flexibility. Therefore, in our MoE attention layer, the router for q, k, and v is separate, which could provide a more flexible attention mechanism. In order to support the above statement, we conduct additional experiments in Table 11 to study the advantage of M 3 SAT v.s. Prior MoE attention design style (q, k, v using the same router in the MoE attention).

C.5 THE GRADIENT POSITIVE SIGN PURITY OF M 3 SAT

The Gradient Positive Sign Purity Chen et al. (2020b) P of a single parameter for T tasks is defined as: P = 1 2 (1 + T i ∆L i T i |∆L i | ), where ∆L i is the gradient for the task i. The Gradient Positive Sign Purity is bounded by [0, 1], which P close to 1 or 0 indicates such parameters suffer less gradient confliction from multi-task training. We use the trained model to collect the Gradient Positive Sign Purity of such model. Then we discrete the Gradient Positive Sign Purity value into five intervals of each parameter and count the ratio of parameters in these five intervals. 2021) is defined as follows: Z t i→j = 1 - L j (X t , θ t+1 s|i , θ t j ) L j (X t , θ t s , θ t j ) , where X t is the training batch at time-step t, θ t+1 s|i is the updated shared parameters after a gradient step with respect to the task i. θ t j represents the task j's specific parameters. Considering the imbalance between datasets in the MultiBench, we set the size of X t is the training data size of such task t. Therefore, for the medium setting, we collect the task affinity by solitary training the 'PUSH' task for an single epoch, then we calculate the loss of 'ENRICO' and 'AV-MNIST' on the corresponding training data. We count the task affinity from 'PUSH' to 'ENRICO' and 'AV-MNIST' every 10 epochs during training. We display the task affinity changes with training epochs in Figure 8 . The task affinity of M 3 SAT and multi-router M 3 SAT is usually higher than the one of the dense model which indicates that the MoE we proposed alleviates the training conflict of MTL. C.7 THE OPTIMAL GRADIENT BLEND OF M 3 SAT The optimal gradient blend Wang et al. ( 2020) is used to re-weight the feature of each modality during multi-modal training. The optimal gradient blend will give this modality a small weight for the modality that is easy to prone to overfitting. The weight of each modality is bounded by [0, 1] within a task, and the sum of all modalities for this task is 1. Therefore, the gap between different modalities within a task indicates that the modality with a smaller weight (optimal gradient blend) tends to overfit. We collect the optimal gradient blend of the corresponding trained model to determine whether our proposed model can restrain the easy model from overfitting. We use a modified version of the optimal gradient blend where the unnormalized optimal gradient blend of modality m is defined as: w m,n unnorm = L m valid L m valid -L m train , where L m valid is the validation loss after training n epochs only using modality m, and L m train is the training loss after training n epochs only using modality m. For task i, the final optimal gradient blend we reported is:  w i,m = w m,n unnorm M m w m,n unnorm ,



Figure2: The detailed architecture of the M 3 SAT Encoder Layer: the Sparse MoE Self-attention and Sparse MoE FFN. The Sparse MoE Self-attention consists of three Self-attention MoE to compute the value, key, and query, respectively. Note that, the expert of the Self-attention MoE is composed of a single linear layer. The Sparse MoE FFN is the same as the vanilla MoE layer.

Figure3: In the multi-router version of M 3 SAT encoder layer, we allow the Self-attention MoE and the Sparse MoE FFN of our M 3 SAT encoder layer to use task-specific router network or modalityspecific router network. The task-specific router indicates that each task owns its router network, and the modality-specific router indicates that each modality owns its router network.

Figure 4: The distribution of The Gradient Positive Sign Purity(left), and the inter-task affinity of the 'ENRICO' to the 'PUSH' task (right).

Figure 7: In the Reverse Multi-router M 3 SAT (R-Multi-router M 3 SAT) encoder layer, We use the task-specific router in the Self-attention layer and the modality-specific router in the FFN layer.

INDEPENDENT ROUTING POLICY BETWEEN Q, K, AND V Prior works Fedus et al. (2022); Zhu et al. (

Figure8: The inter-task affinity of the 'ENRICO' to the 'AV-MNIST' task (right), and the inter-task affinity of the 'ENRICO' to the 'PUSH' task (right). The results reported are the average of three replicates.

Figure 9: The token distributions of the small setting of the first M 3 SAT layer. The first two rows show the token distribution of different modalities for the 'PUSH' dataset, and the 'V&T' dataset. The last row shows the token distribution across different tasks within the self-attention key layer, the self-attention query layer, the self-attention value layer, and the FFN layer.

Figure 10: The token distributions of the medium setting of the first M 3 SAT layer. The first three rows show the token distribution of different modalities for the 'ENRICO' dataset, the 'AV-MNIST' dataset, and the 'PUSH' dataset. The last row shows the token distribution across different tasks within the self-attention key layer, the self-attention query layer, the self-attention value layer, and the FFN layer.

Figure 11: The token distributions of the large setting of the first M 3 SAT layer. The first four rows show the token distribution of different modalities for the 'AV-MNIST' dataset, the 'MOSEI' dataset, the 'UR-FUNNY' dataset, and the 'MIMIC' dataset. The last row shows the token distribution across different tasks within the self-attention key layer, the self-attention query layer, the selfattention value layer, and the FFN layer.

modalities, such as language and vision understanding. MaTLStrezoski et al. (2019) enables structured deterministic sampling of multiple sub-architectures within a single modal for multiple vision tasks.Søgaard & Goldberg (2016) design an MTL model with bi-RNNs for vision tasks. On the other hand, recent work on multi-modal learning prefers the Transformer-based model to learning general-purpose models over two or three modalities, typically in the language, vision, and audio Ramesh et al. (2022); Saharia et al. (2022); Agrawal et al. (2017); Yang et al. (2016); Dai et al. (2022). Base on the vanilla text-based Transformer model Vaswani et al. (2017), many multi-modal extensions typically use full self-attention over modalities concatenated across the sequence dimension Su et al. (2020); Chen et al. (2020a) or a cross-model attention layer Tan & Bansal (2019); Tsai et al. (2019). Several works such as Perceiver Jaegle et al. (2021), ViT-BERT Li et al. (2021), PolyViT Likhosherstov et al. (2021) have investigated the potential of using the same unimodal encoder architecture for different modalities. Moreover, multiple works have endeavored to build a single model that works well on multiple multi-modal tasks (i.e., multi-modal multi-task learning) Su et al. (2020); Cho et al. (2021); Hu & Singh (2021); Lu et al. (2019); Akbari et al.(2021).VATT Akbari et al. (2021)  introduces a shared model on video, audio, and text data to perform audio-only, video-only, and image-text retrieval tasks.VLBERT Su et al. (2020)  investigates a simple yet powerful pre-trainable generic representation for visual-linguistic tasks. UnitHu & Singh (2021) uses a single model for several vision-and-language tasks. HighMMT Liang et al.



We follow the setting of HighMMTLiang et al. (2022), which uses 3 multimodel multitask training to evaluate the performance of the M 3 SAT. These setups include tasks with different modality inputs, predicting objectives, research areas, and dataset size.

We compare the performance of our model, HighMMT (the state-of-the-art multi-modal multi-task learning method on the MultiBench benchmark), and all the 20 models implemented in the benchmark for in 3 different training settings. We show that our model outperforms the HighMMT model in most tasks.

Comparison of routing networks. To explore the effects of different routing networks, we consider the influences of task-specific routing networks and modality-specific routing networks in the self-attention layer and the FFN layer separately. We also investigate the combinations between the multi-routing network and the single-routing networks in Appendix B.

The optimal gradient blend for each tasks under different model architectures.

The results of different MoE router settings in the medium setting.

Task performances of different models. M 3 SAT 2/3/4 layers: 2/3/4 transformer encoder layers and replacing with M 3 SAT layer every other layer. P-M 3 SAT 2/3/4 layers: 2/3/4 consecutive M 3 SAT layers. M 3 SAT early/middle/late-2: encoder layers and replacing the early/middle/late-2 encoder layers with two M 3 SAT layers.

In the Multi-router M 3 SAT encoder layer, We use the modality-specific router in the Self-attention layer and the task-specific router in the FFN layer.

Detailed results of parameter and computation cost.

test data points, respectively.

Ablation Effects of the number K of selected experts per token and the total number N of experts.

Concatenate tokens along the batch axis.

Using a single router to routing tokens for q, k, and v simultaneously. ENRICO ↑ PUSH ↓ AV-MNIST ↑ ∆(%) ↑

Modality

feature dimension depends on the unimodal encoder's hidden size, which is independent of the shape of the input modality sequence. The hidden dimension of the self-attention encoder layer equals to the previous layer's cross-attention layer.

A.3 THE MODEL AND TRAINING SETUPS

We list hyperparameters for the training and the model in Table 13 , Table 14 and For our proposed M 3 SAT, we can use single-router MoE, multi-router MoE, and dense network in both the self-attention and FFN layers, respectively. Meanwhile, the multi-router MoE can also be divided into the modality-specific multi-router MoE and the task-specific multi-router MoE. Therefore, we explore all possible combinations of the above settings in the self-attention and FFN layers. We list all explored network architectures in Table 5 . We run above network architectures in the medium setting and report the results in Table 6 . All results reported in Table 6 use the same hyperparameters in Table 14 , except for the routing network setting. In particular, the 'Dense Model' is an equal computation dense model where we propose two kinds of equal computation dense model: 'Dense Model 1' uses the transformer encoder layer with double depth and 'Dense Model 2' is 4x wider than the hidden dimension of the transformer encoder layer. To further illustrate our performance gains are mainly come from our M 3 SAT design, where M is the number of modalities of the task i.For M 2 TL, the appropriate combination between modality-specific routers and task-specific routers (multi-router M 3 SAT) helps each other better than purely using one of them (In Figure 8 and Table 12, the Inter-Task Affinity and the optimal gradient blend of multi-router M 3 SAT is better than models which only use modality-specific routers (P-Modality-gate M 3 SAT) or task-specific routers (P-Task-gate M 3 SAT)).

C.8 EXPERT DISTRIBUTION

This section explores how tokens are distributed across different tasks and modalities by the routing policy of the M 3 SAT. We show the routing distributions under the testing distribution in Figure 9 , Figure 10 , and Figure 11 . In these three settings, our routers work well, and most experts handle all modalities and tasks. Meanwhile, several experts focus on specific tasks.For the large setting, we find out that the routing policy tends to route tokens to several specific experts, which also successfully proves MTL's MoE separate gradient conflict parameters. Especially for the 'MIMIC' dataset, only 2 to 4 experts activate for this task.In Figure 9 , Figure 10 , and Figure 11 , we also denote the FFN layer as the MLP layer. 

