VARIATIONAL MULTI-TASK LEARNING

Abstract

Multi-task learning aims to improve the overall performance of a set of tasks by leveraging their relatedness. When training data is limited using priors is pivotal, but currently this is done in ad-hoc ways. In this paper, we develop variational multi-task learning -VMTL, a general probabilistic inference framework for simultaneously learning multiple related tasks. We cast multi-task learning as a variational Bayesian inference problem, which enables task relatedness to be explored in a principled way by specifying priors. We introduce Gumbel-softmax priors to condition the prior of each task on related tasks. Each prior is represented as a mixture of variational posteriors of other related tasks and the mixing weights are learned in a data-driven manner for each individual task. The posteriors over representations and classifiers are inferred jointly for all tasks and individual tasks are able to improve their performance by using the shared inductive bias. Experimental results demonstrate that VMTL is able to tackle challenging multi-task learning with limited training data well, and it achieves state-of-the-art performance on four benchmark datasets consistently surpassing previous methods.

1. INTRODUCTION

Multi-task learning (Caruana, 1997 ) is a fundamental learning paradigm for machine learning, which aims to simultaneously solve multiple related tasks to improve the performance of the individual tasks by sharing knowledge. The crux of multi-task learning is how to explore task relatedness (Argyriou et al., 2007; Zhang & Yeung, 2012), which is non-trivial since the underlying relationship among tasks can be complicated and highly nonlinear. This has been extensively investigated in previous work by learning shared features, designing regularizers imposed on parameters (Pong et al., 2010; Kang et al., 2011; Jawanpuria et al., 2015; Jalali et al., 2010) or exploring priors over parameters (Heskes, 2000; Bakker & Heskes, 2003; Xue et al., 2007; Zhang & Yeung, 2012; Long et al., 2017) . Recently, deep neural networks have been developed learning shared representations in the feature layers while keeping the classifier layers independent (Yang & Hospedales, 2016; Hashimoto et al., 2016; Ruder et al., 2017) . It would be beneficial to learn them jointly by fully leveraging the shared knowledge related tasks, which however remains an open problem. In our work, we consider a particularly challenging setting, where each task contains limited training data. Even more challenging, we have only a handful of related tasks to gain shared knowledge from. This is in stark contrast to few-shot learning (Gordon et al., 2018; Finn et al., 2017; Vinyals et al., 2016) that also suffers from limited data for each task, but usually have a large number of related tasks. Therefore, in our scenario, it is difficult to learn a proper model for each task independently without overfitting (Long et al., 2017; Zhang & Yang, 2017) and it is crucial to leverage the inductive bias (Baxter, 2000) provided by various other related tasks that are learned simultaneously. To do so, we employ the Bayesian framework as it is able to deliver uncertainty estimates on predictions and automatic model regularization (MacKay, 1992; Graves, 2011) , which makes it well suited for multi-task learning with limited training data. The major motivation of our work is to leverage the Bayesian learning framework to handle the great challenges of limited data in multi-task learning. In this paper, we introduce variational multi-task learning -VMTL, a novel variational Bayesian inference approach that can explore task relatedness in a principled way. In order to fully utilize the shared knowledge from related tasks, we explore task relationships in both the feature representation and the classifier by placing prior distributions over them in a Bayesian framework. Thus, multi-task learning is cast as a variational inference problem for feature representations and classifiers jointly. The introduced variational inference allows us to specify the priors by depending on variational pos-teriors of related tasks. To further leverage the shared knowledge from related tasks, we introduce the Gumbel-softmax prior to each task, which is a mixture of variational posteriors of other related tasks. We apply the optimization technique (Jang et al., 2016) to learn the mixing weights jointly with the probabilistic modelling parameters by back-propagation. The Gumbel-softmax priors are incorporated into the inference of posteriors over representations and classifiers, which enable them to leverage the shared knowledge. We validate the effectiveness of the proposed VMTL by extensive evaluation on four challenging benchmarks for multi-task learning. The results demonstrate the benefit of variational Bayesian inference for modeling multi-task learning. VMTL consistently outperforms previous methods in terms of the average accuracy of all tasks.

2. METHODOLOGY

In this work, we tackle the challenging multi-task learning setting where only a few training samples are available for each task, and only a limited number of related tasks to share knowledge. We investigate multi-task learning under the Bayesian learning framework, where we learn the task relationship in a principled way by exploring priors. We cast multi-task learning as a variational inference problem, which offers a unified framework to learn task relatedness in both feature representations and task-specific classifiers.

2.1. MULTI-TASK VARIATIONAL INFERENCE

In our setting the tasks are classification problems which share the same label space, but where the samples are drawn from different data distributions. Given a set of related tasks {D t } T t=1 and each task D t = {x n t , y n t } Nt n=1 , N t is the number of training samples in the t-th task, the goal under this setting is to predict the label y of the test sample x for all tasks simultaneously, using the shared information extracted from other related tasks. We note that the main challenge is the limited number of labeled samples for each task, which makes it difficult to learn a proper model for each task independently (Long et al., 2017; Zhang & Yang, 2017) . Under this multi-task learning setting, we consider the Bayesian treatment. For a single task without knowledge sharing from related tasks, we place a prior over its classifier parameter w, which gives rise to the following data log-likelihood to maximize: log p(D) = log p(D|w)p(w)dw. ( For multi-task learning, we solve T tasks simultaneously with knowledge sharing among tasks. Thus, after observing data from all T tasks, the true posterior p(w t |D t ) of a single task t becomes p(w t |D 1:T ). Using Bayes' rule, we have the posterior for task t as follows: p(w t |D 1:T ) ∝ p(w t ) T i=1 p(D i |w i ) ∝ p(w t |D 1:T \D t )p(D t |w t ). We introduce a variational distribution q(w t ; θ t ) parameterized by θ t for current task t to approximate the true posterior, which involves minimizing the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior: θ t * = arg min θt D KL q(w t ; θ t )||p(w t |D 1:T ) . Generally, the approximate posterior is defined as a fully-factorized Gaussian distribution, i.e. q(w t ; et al., 2017; Kingma et al., 2015; Molchanov et al., 2017) . By applying Eq. (2) into (3) and extending them to all tasks, we obtain an evidence lower bound (ELBO) for multi-task learning: θ t ) = N (µ t , σ 2 t ) (Nguyen L ELBO (θ) = 1 T T t=1 E wt∼q log p(D t |w t ) -D KL q(w t ; θ t )||p(w t |D 1:T \D t ) , where θ = {θ t } T t=1 is the set of parameters for all task-specific classifiers. We maximize the ELBO to optimize the model parameters for multi-task learning. It is worth noting that for MTL the prior of each task is conditioned on other tasks which allows knowledge sharing between them, in contrast to single-task learning without knowledge sharing where uninformative Gaussian priors are generally applied. Actually, this multi-task ELBO in Eq. ( 4) provides a general probabilistic inference framework that enables task relatedness to be explored in a principled way by leveraging the inductive bias provided by other related tasks (Ruder, 2017) . For each task, the conditional prior p(w t |D 1:T \D t ) will serve as a regularizer for the posterior inference. In order to fully leverage the shared knowledge from related tasks to improve each individual task, in addition to the classifiers, we would also like to share knowledge among the feature representations of samples from different tasks. To this end, we introduce the conditional prior p(z n t |x n t ) over the feature representation z n t for each sample x n t in task t. In doing so, we are able to explore task relatedness among feature representations in a unified way, as we do for classifiers. To this end, we rewrite the log-likelihood as a sum over the marginal likelihoods of individual data points as follows: log p(D t |w t ) = 1 N t Nt n=1 log p(y n t |w t , x n t ). The variational Bayesian inference framework developed in Eq. ( 4) allows the latent representation to be seamlessly incorporated into the log-likelihood in Eq. (5) . Under the assumption that w t and z n t are conditionally independent, we therefore obtain a new marginal conditional log-likelihood as follows: log p(y n t |w t , x n t ) = log p(y n t , z n t |w t , x n t )dz n t = log p(y n t |z n t , w t )p(z n t |x n t )dz n t . The posterior p(z n t |x n t , y n t ) of latent representations z t is intractable. Therefore, like VAEs (Kingma & Welling, 2013) we introduce again a variational posterior namely q(z t |x t ; φ), which is made conditional on the sample x n t . We note that φ is the parameters of the inference network for latent representations, which is optimized jointly with other parameters in our model. By introducing q(z n t |x n t ; φ) into Eq. ( 6) and applying Jensen's inequality, we have log p(y n t |w t , x n t ) ≥ E z n t ∼q z n t log p(y n t |z n t , w t ) -D KL q(z n t |x n t ; φ)||p(z n t |x n t ) . After integrating Eq. ( 4) and (7) , we obtain the following variational objective for multi-task learning: L VMTL (θ, φ) = 1 T T t=1 1 N t Nt n=1 E qw t E q z n t log p(y n t |z n t , w t ) -D KL q(z n t |x n t ; φ)||p(z n t |x n t ) -D KL q(w t ; θ t )||p(w t |D 1:T \D t ) . (8) The objective function provides a general probabilistic inference framework for multi-task learning, which allows us to jointly explore shared knowledge among representations and classifiers in a unified way by specifying priors. The detailed derivation of the ELBO for z and w is given in the Appendix A.1. The graphical illustration of our VMTL is shown in Fig. 1 .

2.2. LEARNING TASK RELATEDNESS VIA GUMBEL-SOFTMAX PRIORS

The proposed variational multi-task inference framework enables the relationship among tasks to be explored by designing priors for both latent representations and classifiers. In Bayesian inference, priors serve as regularisation, which provides a principled way of sharing information across multiple tasks. We introduce the Gumbel-softmax prior for each task, which is a mixture of variational posteriors of other related tasks. In the case of latent representations, we define the prior by using other tasks posteriors of latent representations: p(z t |x t ) := i∈{1,••• ,T }\t A ti q(z i |m i ; φ), where m i is the mean feature representation of samples from the same class in the i-th task. The mixing weight A ti is defined as a binary value to indicate whether two tasks are correlated or not. To enable learning this binary A ti with back propagation, we introduce the Gumbel-softmax technique (Jang et al., 2016) : A ti = exp((log π ti + g ti )/τ ) exp((log π ti + g ti )/τ ) + exp((log(1 -π ti ) + g ti )/τ ) , where g ti and g ti are samples taken from a Gumbel distribution, using inverse transform sampling by drawing u ∼ Uniform(0, 1) and computing g = -log(-log(u)). π ti is the learnable parameter in the Gumbel-softmax technique, which denotes the probability of two tasks are correlated. Parameter τ is the softmax temperature. Likewise, we specify the prior over classifier parameters w t in the same way as in z t : p(w t |D 1:T \D t ) := i∈{1,••• ,T }\t A (w) ti q(w i ; θ i ), where A (w) ti is obtained in a similar way as in Eq. (10). Note that we use different mixing weights in designing the priors for representations and classifiers, which gave better results than a shared one in our preliminary experiments. This is likely due to the fact that the representations and classifiers leverage different correlation patterns among different tasks. It worth mentioning that the fundamental assumption in multi-task learning is that tasks are related and there is always positive transfer among them. Since tasks share the same label space in our setting, the case with only task interference would hardly happen. In case of only task interference, the KL term in Eq. (8) will degenerate to an 2 regularization on the representation and classifiers.

2.3. AMORTIZED INFERENCE

We leverage an amortization technique (Gershman & Goodman, 2014) , in which we amortize the computational cost of inferring the posterior of the latent representation, as done in VAEs (Kingma & Welling, 2013). In effect, the amortized inference can also be adopted to learn classifier parameters similar to the probabilistic prediction in few-shot learning (Gordon et al., 2018) . To this end, we design the variational posterior q(w t |D t ) in a context-independent manner such that each weight vector w c t depends only on samples from class c of the current task t: q(w t |D t ) = C c=1 q(w c t |D c t ) = C c=1 N (µ c t , diag((σ c t ) 2 )), where D c are the samples from the c-th class, C is the size of the shared label space, and each posterior is parameterized as a diagonal Gaussian distribution. The amortized inference of these posteriors is implemented by multi-layer perceptrons (MLPs) ( ) and the parameters of the inference networks are jointly optimized in the end-to-end learning. For a given task, we use the amortized inference to generate the classifier weight for each specific class by using the mean feature representations in this class. Thus, the weight for different classes are drawn from different distributions. In contrast to the amortized inference for latent representations, the amortized classifier inference enables the cost to be shared across classes, which reduces the overall cost. Therefore, it offers an effective way to handle a huge number of object classes and can still produce competitive performance even in the existence of the amortization gap (Cremer et al., 2018) .

2.4. EMPIRICAL OBJECTIVE FUNCTION

In our implementation for the variational objective for multi-task learning Eq. ( 8), we adopt Monte Carlo sampling, and obtain the empirical objective function as follows: LVMTL (θ, φ) = 1 T T t=1 1 N t Nt n=1 1 M L L =1 M m=1 log p(y t |z n,( ) t , w ) -D KL q(z n t |x n t ; φ)||p(z n t |x n t ) -D KL q(w t ; θ t )||p(w t |D 1:T \D t ) , where z n,( ) t ∼ q(z n t |x n t ; φ), w ∼ q(w t ; θ). L and M are the number of Monte Carlo samples. In practice, L and M are set to 10, which performs well while being computationally efficient. We maximize this empirical function to optimize the model's parameters: the log-likelihood term is implemented as the cross-entropy loss and the KL terms can be computed in closed forms. To sample from the variational posteriors, we adopt the reparameterization trick (Kingma & Welling, 2013). In the posterior inference of classifiers without amortization, we use the local reparameterization trick to reduce the gradient variance (Kingma et al., 2015). The priors p(z n t |x n t ) and p(w t |D 1:T \D t ) are implemented with Gumbel-softmax priors provided in Eq. ( 9) and Eq. ( 11), respectively. For amortized classifiers, the variational posterior q(w t ; θ) is implemented using Eq. ( 12).

3. RELATED WORKS

Multi-task learning (Caruana, 1997 ) is a machine learning paradigm that aims to leverage shared knowledge from multiple related tasks to improve the generalization performance of all the tasks simultaneously. The core of multi-task learning is how to explore the task relationship, which has been extensively investigated in the literature. Early works design feature-based or parameter-based regularizations to explore task relationships (Liu et In our work, we address multi-task learning in a probabilistic inference framework by casting it as a variational Bayesian inference problem. We explore task relationships in a principled way by specifying priors conditioned on other tasks, which enables the model to share knowledge among related tasks for learning both representations and classifiers.

4. EXPERIMENTS

We conduct experiments on four benchmark datasets for multi-task learning with limited training data. The results demonstrate the benefits of variational Bayesian approximation to representations and classifiers for multi-task learning. Our VMTL achieves the best performance and consistently surpasses previous methods. We provide more experimental results in the Appendix B.

4.1. DATASETS

We evaluate the proposed VMTL under a challenging multi-task learning setting, where each task has limited training data and only a handful of related tasks that can be learned simultaneously to leverage the shared knowledge. To benchmark our model with previous methods, we conduct experiments on four benchmark datasets, where tasks are defined as classification problems from distinctive domains with a shared label space. Office-Home (Venkateswara et al., 2017) contains images from four domains/tasks: Artistic (A), Clipart (C), Product (P) and Real-world (R). Each task contains images from 65 object categories collected in the office and home settings. There are about 15,500 images in total. Office-Caltech (Gong et al., 2012) was constructed by selecting the ten categories shared between Office-31 (Saenko et al., 2010) and Caltech-256 datasets (Griffin et al., 2007) . One task consists of data from Caltech-256 (C), and the other three tasks consist of data from Office-31 whose images were collected from three distinct domains/tasks, e.g., Amazon (A), Webcam (W) and DSLR (D). There are 8-151 samples per category per task, and 2,533 images in total. Comparison Methods We compare our method with single-task learning (STL) as well as a variational version of STL (VSTL) implemented by introducing variational Bayesian inference to representations and classifiers without using task relationship (Details can be found in the Appendix A.2). We also define a basic multi-task learning (bMTL) model, which is a deep learning model with a shared feature extractor and task-specific classifiers. The bMTL serves as a baseline model to demonstrate the benefits of the probabilistic modeling of multi-task learning based on variational Bayesian inference. VMTL is our basic proposed method. VMTL-AC is our proposed method with amortized classifiers. We compare with multilinear relationship network (MRN) (Long et al., 2017) , which holds the best performance among previous methods on the four benchmarks.

Effectiveness in Handling Limited Data

We conduct ablation studies to demonstrate the benefits of the proposed variational multi-task learning in exploring task relatedness. We conduct experiments under a large range of train-test splits from 5% to 50%. The results on the Office-Home dataset for different tasks and the average accuracy of all tasks are illustrated in Fig. 2 . More detailed experimental results are provided in the Appendix B.1. The performance advantage of our VMTL is larger in the settings with less training data (≤ 20%), which demonstrates its effectiveness in handling challenging scenarios with very limited training data.

Effectiveness of Variational Bayesian Approximation

We investigate the effect of variational Bayesian inference for representations and classifiers, separately. We conduct these experiments on the Office-Home dataset. The results with different train-test splits are shown in Table 1 . More detailed experimental results are provided in the Appendix B.2. As can be seen, both variational Bayesian representations and classifiers can benefit performance. This benefit becomes more significant when training data is very limited, which indicates the effectiveness of leveraging shared knowledge by conditioning priors on related tasks in our VMTL. In addition, in Fig. 2 , the VSTL is shown to be a strong learning model, which again demonstrates the benefits of variational Bayesian representations and classifiers compared to STL. Effectiveness of Gumbel-Softmax Priors The introduced Gumbel-softmax prior provides an effective way to learn data-driven task relationships. To demonstrate their effectiveness, we compare with several alternatives, including the mean and the learnable weighted posteriors of other tasks. The comparison results are shown in Table 2 . The proposed Gumbel-softmax priors perform the best, consistently surpassing other alternatives. It is also worth noting that the advantage of Gumbelsoftmax priors is ever larger for very limited training data, e.g., 5%, a challenging scenario where it is crucial to leverage task relatedness.

Comparison with other methods

The comparison results on the small-scale datasets Office-Home, Office-Caltech,ImageCLEF datasets and large-scale DomainNet datasets are shown in Tables 3, 4, 5 and 6, respectively. The average accuracy of all tasks is used for performance measurement. The best results of average accuracy are marked in bold, while the second-best by underline. Due to the space limitation, we show the experimental results with more data accessible in Table 15 and Table 16 of the Appendix B.4. A comprehensive comparison with more other methods is provided in the Appendix B.4. Our VMTL consistently achieves the best performance on all small-scale and large-scale datasets with all the defined train-test split settings. It is worth highlighting that on the most challenging setting of 5% training data, our VMTL shows a large performance advantage over compared methods. Specifically, on ImageCLEF, under the 5% setting, our VMTL surpasses the second best by a phenomenal margin up to 2.3%. This demonstrates the effectiveness of VMTL in exploring relatedness to improve the performance of each task. In addition, our VMTL-AC can also produce comparable performance and is better than most previous methods. It is worth mentioning that VMTL-AC demonstrates computation advantages with faster convergence compared to VMTL due to the amortized learning as shown in Fig. 3 . Besides, we found that VMTL-AC demonstrates good robustness against adversarial attacks. This could be due to that amortized learning applies the mean feature representations to generate classifiers, which is more robust to attacks. The detailed discussions are given in Appendix B.5. Finally, the improvement of VSTL over STL also indicates the benefits of variational Bayesian approximation for representations and classifiers.

5. CONCLUSION

In this paper, we address the multi-task learning problem and tackle a challenging setting where each task has a very limited amount of training data, with only a handful of related tasks. To this end, we develop variational multi-task learning -VMTL, a general probabilistic inference framework for simultaneously learning multiple tasks. We cast multi-task learning as a variational inference problem, which enables task relationships to be explored in a principled way by specifying priors. Specifically, we introduce the Gumbel-softmax priors, which offer an effective way to learn the task relatedness in a data-driven manner for each task. We evaluate VMTL on four benchmark datasets for multi-task learning. Results demonstrate that our VMTL consistently achieves better or comparable performance with state-of-the-art multi-task learning approaches.

A DERIVATION A.1 DERIVATION OF EVIDENCE LOWER BOUND FOR MULTI-TASK LEARNING

We provide a derivation of the evidence lower bound with z and w jointly. The log-likelihood of task t is conditioned on the data from other related tasks.  \D t )q(w t ) q(w t ) dw t ≥ -KL[q(w t )||p(w t |D 1:T \D t )] + E q(wt) log p(y t |z t , w t )p(z t |x t )dz t ≥ -KL[q(w t )||p(w t |D 1:T \D t )] + E q(wt) log p(y t |z t , w t )p(z t |x t )q(z t |x t ) q(z t |x t ) dz t ≥ -KL[q(w t )||p(w t |D 1:T \D t )] -KL[q(z t |x t )||p(z t |x t )] + E q(wt) E q(zt|xt) [log p(y t |z t , w t )] A.2 DERIVATION OF EVIDENCE LOWER BOUND FOR SINGLE-TASK LEARNING Generally, the proposed Bayesian inference framework which infers the posteriors of presentations z and classifiers w jointly can be widely applied in other research fields. To be simple, we introduce a variational version of single-task learning (VSTL), and provide the derivation of its evidence lower bound. It is worth noting that single-task learning does not share knowledge among tasks. Thus, the log-likelihood for single-task learning is not conditioned on the data from other related tasks. log p(y|x) = log p(y, z, w, |x)dwdz = log p(y|z, w)p(w)p(z|x)dwdz = log p(y|z, w)p(z|x)dzp(w)dw = log [ p(y|z, w)p(z|x)dz]p(w)q(w) q(w) dw ≥ -KL(q(w)||p(w)) + E q(w) [log p(y|z, w)p(z|x)dz] ≥ -KL[q(w)||p(w)] + E q(w) [log p(y|z, w)p(z|x)q(z|x) q(z|x) dz] ≥ -KL[q(w)||p(w)] -KL[q(z|x)||p(z|x)] + E q(w) E q(z|x) [log p(y|z, w)] Usually, the approximate posteriors q(w) and q(z|x) are defined as a fully-factorized Gaussian distribution. Due to lack of extracted information offered by other related tasks, the priors p(w) and p(z|x) are set to a standard Gaussian distribution, as applied in (Sohn et 

B.2 EFFECTIVENESS VARIATIONAL BAYESIAN APPROXIMATION

The comparison results on performance of Bayesian approximation for representations z and classifiers w on the Office-Home, Office-Caltech and ImageCLEF datasets are shown in Tables 8, 9 and 10, respectively. Both variational Bayesian representations and classifiers can benefit performance. And we find that Bayesian classifiers in the variational inference framework contribute more to the performance than Bayesian representations. It is likely due to the fact that Bayesian classifiers can better improve the model's discriminative ability. Our method jointly infers the posteriors over feature representations and classifiers in a Bayesian framework, which consistently outperforms its variants on three benchmarks.

B.3 EFFECTIVENESS OF GUMBEL-SOFTMAX PRIORS

The performance comparison of the proposed VMTL with different priors on the Office-Home, Office-Caltech and ImageCLEF datasets is shown in Tables 11, 12 and 13, respectively. "Mean" denotes that the prior of the current task is the mean of variational posteriors of other related tasks. "Learnable weighted" denotes that weights of mixing the variational posteriors of other related tasks are learnable. Our Gumbel-softmax Priors apply the Gumbel-softmax technique to learn the mixing weights, which introduces uncertainty to the relationships among tasks in order to explore sufficient transferable information from other tasks. In the three datasets, our designed priors outperform other methods consistently.

B.4 A COMPREHENSIVE COMPARISON WITH OTHER METHODS

The comprehensive comparison with state-of-the-art methods, including multi-task feature learning (MTFL) (Argyriou et al., 2007) We conduct some experiments to show the robustness of our methods against adversarial attacks. In our experiments, the adversarial attack is implemented by the fast gradient sign method (Goodfellow et al., 2014) where denotes the noise level. We evaluate our proposed VMTL, VMTL-AC, and the basic multi-task learning(bMTL) on the Office-home dataset. As shown in Fig. 4 , under different noise levels, VMTL outperforms bMTL. As the noise level increases, the variant of our method VMTL-AC is more robust and significantly outperforms the baseline multi-task learning model. 



Fig. 1. A graphical illustration of the proposed model, variational multi-task learning (VMTL). The two dashed lines show the prior of current task depends on the posteriors of other tasks for classifiers and representations. VMTL offers a principled way to explore task relationships: for each task, the priors over the classifiers and feature representations are conditioned on other tasks.

al., 2009; Obozinski et al., 2010; Pong et al., 2010; Jalali et al., 2010; Kang et al., 2011; Jawanpuria et al., 2015; Long et al., 2017; Zhang et al., 2020). Obozinski et al. (2010) are the first to study the multi-task feature selection (MTFS) problem based on the l 2,1 norm. Liu et al. (2009) propose to use the l ∞,1 norm with the objective function to select important features. Kang et al. (2011) design multiple task clusters, aiming to minimize the squared trace norm of the classifier parameters in each cluster. Bayesian methods (Heskes, 2000; Bakker & Heskes, 2003; Yu et al., 2005; Xue et al., 2007; Guo et al., 2011; Titsias & Lázaro-Gredilla, 2011; Zhang & Yeung, 2012; Long et al., 2017) are developed for multi-task learning under probabilistic frameworks, where the regularization usually corresponds to a prior. Heskes (2000) proposes a Bayesian neural network for multi-task learning and analyse it with an empirical Bayesian framework. Yu et al. (2005) investigate Gaussian processes for multi-task learning assuming that all models are sampled from a common prior. Zhang & Yeung (2012) reformulate the l p,q norm regularizer as a matrix-variate generalized normal prior and utilize the prior information to explore task relations. Long et al. (2017) explores tensor normal distribution as priors of network parameters in different layers, which explicitly models the positive and negative relations across features and tasks. Xue et al. (2007) proposes a non-parametric hierarchical Bayesian model to avoid the high complexity of model parameters and are implemented with a deterministic inference method. Further, Qian et al. (2020) adopt a variational information bottleneck method (Alemi et al., 2016) with an uninformative prior distribution to improve the latent probabilistic representation. An important conclusion drawn by (Qian et al., 2020) is that, under adversarial attacks, variational latent representations are regularized and thereby expected to be more robust to noise than deterministic latent representations. The idea of conditioning priors on posteriors is also amenable to continual learning in that the posterior of the previous task can be used as the prior of the current task to reduce catastrophic forgetting (Nguyen et al., 2017; Adel et al., 2019; Ebrahimi et al., 2019). In addition, Bragman et al. (2019) for the first time apply the Gumbel-Softmax to learn task relatedness in multi-task learning. Although these methods are also applicable to the data setting in this work and achieve encouraging improvements, they under-perform with very limited training data. Deep learning has recently been explored for multi-task learning (Misra et al., 2016; Yang & Hospedales, 2016; Hashimoto et al., 2016; Ruder et al., 2017; Long et al., 2017; Meyerson & Miikkulainen, 2017; Chen et al., 2018; Maziarz et al., 2019) by designing different deep architectures to explore task relationships. Liu et al. (2019); Zhang et al. (2020) use a hard parameter-sharing encoder to extract the shared representations, while learning a task-specific decoder to obtain the pixel-level predictions. Based on soft parameter sharing, Misra et al. (2016) propose cross-stitch units to allow the model to leverage the shared knowledge from another task. Gao et al. (2020) follow the soft parameter sharing mechanism and incorporate neural architecture search into generalpurpose multi-task learning. Meyerson & Miikkulainen (2017); Rosenbaum et al. (2017); Maziarz et al. (2019); Strezoski et al. (2019) develop flexible soft-ordering approaches to enable more effective sharing among tasks. Instead of learning the structure of sharing, Kendall et al. (2018) propose to weight multiple loss functions by considering the homoscedastic uncertainty of each task. Chen et al. (2018) present a gradient normalization algorithm that automatically balances training in deep multi-task models by dynamically tuning gradient magnitudes.

Fig. 2. The performance under different proportions of training data on the Office-Home dataset.

Fig. 3. The illustration of training loss with iterations. VMTL-AC converges faster than VMTL, which demonstrates its computation benefit by amortized learning.

Fig. 4. The performance for each task under different noise level on the Office-Home dataset.

Effectiveness of variational Bayesian approximation for representations and classifiers.

Performance comparison of VMTL with different priors. The detailed results on each individual task are provided in the Appendix B.3. Our Gumbel-softmax produces the best performance. Zisserman, 2014) and apply the remaining model to extract the feature representation x for each sample, from which we infer the latent representation z using amortized inference by MLPs (Kingma & Welling, 2013). In our experiments, the temperature is annealed using the same schedule as applied in (Jang et al., 2016): we start with a high temperature and gradually anneal it to a small but non-zero value. For the KL-divergence we use the annealing scheme from (Bowman et al., 2015), increasing the weight of the KL-divergence by a rate of 1e-6 per iteration. The dimension of the latent variable is set to 512. We adopt Adam optimizer (Kingma & Ba, 2014) with a learning rate of 1e-4 for training. All the results are obtained based on the 95% confidence interval from five runs.

Performance comparison of different methods on the Office-Home dataset for multiple



Performance comparison of different methods on the ImageCLEF dataset for multiple



log p(y t |x t , D 1:T \D t ) = log p(y t , z t , w t , |x t , D 1:T \D t )dw t dz t (14) Under the assumption that w t and z t are conditionally independent, we therefore obtain log p(y t |x t , D 1:T \D t ) = log p(y t |z t , w t )p(z t |x t )p(w t |D 1:T \D t )dw t dz t = log p(y t |z t , w t )p(z t |x t )dz t p(w t |D 1:T \D t )dw t = log p(y t |z t , w t )p(z t |x t )dz t p(w t |D 1:T

al., 2015; Kingma et al., 2015; Molchanov et al., 2017). .1 EFFECTIVENESS IN HANDLING LIMITED DATAWe further provide detailed information in Table7about average accuracy in Fig.2. Our proposed probabilistic models, i.e., VMTL and VMTL-AC outperform the deterministic baseline multitask learning model (bMTL), which demonstrates the benefits of our proposed variational Bayesian framework. Given a limited amount of training data, STL and VSTL can not train a proper model for each task. As the training data decreases, our methods based on the variational Bayesian framework are able to better handle this challenging case by incorporating the shared knowledge into the prior of each tasks. The best results of average accuracy are marked in bold, while the second-best by underline. Performance of average accuracy under different proportions of training data on Office-Home.

The results of the above state-of-the-art methods are taken from paper(Long et al., 2017). The results of three datasets under the 20% train-test split and the results of DomainNet under the 2% and 4% train-test split are is provided in Table15and Table16, respectively. The proposed VMTL consistently achieves the best performance on all datasets with all train-test split settings. VMTL with amortized classifiers (VMTL-AC) can produce competitive performance better than most of previous methods.B.5 ROBUSTNESS OF OUR METHODS

Detailed results on performance of Bayesian approximation for representation z

Detailed results on performance of Bayesian approximation for representation z

Detailed results on performance of Bayesian approximation for representation z

Detailed results on performance of VMTL with different priors on Office-Home.

Detailed results on performance of VMTL with different priors on Office-Caltech. Learnable weighted 92.9±0.4 94.7±0.2 96.4±0.4 89.9±0.3 93.5±0.1 95.5±0.1 97.0±0.3 97.9±0.0 90.2±0.2 95.2±0.1 95.8±0.2 94.5±0.3 99.2±0.9 90.7±0.6 95.1±0.1 Gumbel-softmax 93.8±0.3 95.5±0.4 96.4±0.4 90.0±0.3 93.9±0.2 95.5±0.1 97.0±0.1 97.9±0.3 91.0±0.1 95.3±0.1 95.6±0.1 95.8±0.4 99.2±0.6 90.6±0.5 95.3±0.1

Detailed results on performance of VMTL with different priors on ImageCLEF. 0±0.7 71.4±0.1 58.8±0.4 75.7±0.2 93.5±0.4 86.5±0.3 71.7±0.4 60.2±0.4 78.0±0.2 93.1±0.5 89.6±0.5 78.8±0.8 58.8±0.7 80.1±0.2 Learnable weighted 90.7±0.6 81.8±0.4 71.9±0.7 57.4±0.8 75.5±0.4 93.7±0.3 86.1±0.4 71.9±0.8 59.3±0.6 77.8±0.2 93.3±0.4 89.6±0.3 77.7±0.6 59.2±0.3 79.9±0.3 Gumbel-softmax 91.1±0.3 83.2±0.6 71.4±0.4 58.3±0.8 76.0±0.2 93.7±0.4 86.5±0.4 71.8±0.4 59.5±0.6 77.9±0.2 94.0±0.2 89.7±0.3 77.9±0.4 59.7±0.5 80.3±0.2

Performance comparison of different methods on the Office-Home dataset for multiple tasks: Artistic (A), Clipart (C), Product (P) and Real-world (R). VMTL-AC 52.3±0.4 37.5±0.5 70.1±0.3 66.7±0.2 56.7±0.2 58.4±0.5 46.5±0.3 76.9±0.2 73.1±0.3 63.7±0.1 62.3±0.1 52.4±0.3 82.0±0.2 75.9±0.4 68.2±0.1 VMTL 53.8±0.6 38.6±0.2 71.4±0.3 68.8±0.2 58.2±0.2 60.3±0.5 47.5±0.2 78.1±0.2 74.2±0.1 65.0±0.0 64.0±0.3 53.3±0.3 82.5±0.2 76.6±0.3 69.1±0.1

Performance comparison of different methods on the three dataset with 20% train-split: Office-Home, Office-Caltech, ImageCLEF 8±0.5 80.9±0.3 72.9±0.4 65.6±0.1 95.0±0.2 94.5±0.7 96.0±1.2 86.8±0.2 93.1±0.1 92.9±0.5 86.5±0.2 71.9±0.8 56.0±0.8 76.8±0.3 VMTL-AC 62.3±0.1 52.4±0.3 82.0±0.2 75.9±0.4 68.2±0.1 95.2±0.3 95.7±0.5 99.4±0.3 90.7±0.4 95.2±0.2 92.9±0.3 87.7±0.2 77.0±0.7 60.1±0.7 79.4±0.2 VMTL 64.0±0.3 53.3±0.3 82.5±0.2 76.6±0.3 69.1±0.1 95.6±0.1 95.8±0.4 99.2±0.6 90.6±0.5 95.3±0.1 94.0±0.2 89.7±0.3 77.9±0.4 59.7±0.5 80.3±0.2

Performance comparison of different methods on the large-scaled dataset DomainNet for multiple tasks: Clipart (C), Infograph (I), Painting (P), Quickdraw (Q), Real (R) and Sketch (S). VMTL 35.7±0.1 14.3±0.2 40.0±0.2 18.5±0.1 65.4±0.2 25.5±0.1 33.2±0.1 30.9±0.1 11.9±0.0 35.4±0.2 15.7±0.1 61.8±0.1 21.8±0.2 29.6±0.1

