DECIPHERING AND OPTIMIZING MULTI-TASK LEARNING: A RANDOM MATRIX APPROACH

Abstract

This article provides theoretical insights into the inner workings of multi-task and transfer learning methods, by studying the tractable least-square support vector machine multi-task learning (LS-SVM MTL) method, in the limit of large (p) and numerous (n) data. By a random matrix analysis applied to a Gaussian mixture data model, the performance of MTL LS-SVM is shown to converge, as n, p → ∞, to a deterministic limit involving simple (small-dimensional) statistics of the data. We prove (i) that the standard MTL LS-SVM algorithm is in general strongly biased and may dramatically fail (to the point that individual single-task LS-SVMs may outperform the MTL approach, even for quite resembling tasks): our analysis provides a simple method to correct these biases, and that we reveal (ii) the sufficient statistics at play in the method, which can be efficiently estimated, even for quite small datasets. The latter result is exploited to automatically optimize the hyperparameters without resorting to any cross-validation procedure. Experiments on popular datasets demonstrate that our improved MTL LS-SVM method is computationally-efficient and outperforms sometimes much more elaborate state-of-the-art multi-task and transfer learning techniques.

1. INTRODUCTION

The advent of elaborate learning machines capable to surpass human performances on dedicated tasks has reopened past challenges in machine learning. Transfer learning, and multitask learning (MTL) in general, by which known tasks are used to help a machine learn other related tasks, is one of them. The particularly interesting aspects of multi-task learning lie in the possibility (i) to exploit the resemblance between the datasets associated to each task so the tasks "help each other" and (ii) to train a machine on a specific target dataset comprised of few labelled data by exploiting much larger labelled datasets, however composed of different data. Practical applications are numerous, ranging from the prediction of student test results for a collection of schools (Aitkin & Longford, 1986) , to survival of patients in different clinics, to the value of many possibly related financial indicators (Allenby & Rossi, 1998) , to the preference modelling of individuals in a marketing context, etc. Since MTL seeks to improve the performance of a task with the help of related tasks, a central issue to (i) understand the functioning of MTL, (ii) adequately adapt its hyperparameters and eventually (iii) improve its performances consists in characterizing how MTL relates tasks to one another and in identifying which features are "transferred". The article aims to decipher these fundamental aspects for sufficiently general data models. Several data models may be accounted for to enforce relatedness between tasks. A common assumption is that the data lie close to each other in a geometrical sense (Evgeniou & Pontil, 2004) , live in a low dimensional manifold (Agarwal et al., 2010) , or share a common prior (Daumé III, 2009) . We follow here the latter assumption in assuming that, for each task, the data arise from a 2-class Gaussian mixture. 1Methodologically, in its simplest approach, MTL algorithms can be obtained from a mere extension of support vector machines (SVM), accounting for more than one task. That is, instead of finding the hyperplane (through its normal vector ω) best separating the two classes of a unique dataset, (Evgeniou & Pontil, 2004) proposes to produce best separating hyperplanes (or normal vectors) ω 1 , . . . , ω k for each pair of data classes of k tasks, with the additional constraint that the normal vectors take the form ω i = ω 0 + v i for some common vector w 0 and dedicated vectors v i . The amplitude of the vectors v i is controlled (through an additional hyperparameter) to enforce or relax task relatedness. We study this approach here. Yet, to obtain explicit and thus more insightful results, we specifically resort to a least-square SVM (as proposed e.g., in (Xu et al., 2013) ) rather than a margin-based SVM. This only marginally alters the overall behavior of the MTL algorithm and has no impact on the main insights drawn in the article. Moreover, by a now well-established universality argument in large dimensional statistics, (Mai et al., 2019) show that quadratic (leastsquare) cost functions are asymptotically optimal and uniformly outperform alternative costs (such as margin-based methods or logistic approaches), even in a classification setting. This argument further motivates the choice of considering first and foremost the LS-SVM version of MTL-SVM. Technically, the article exploits the powerful random matrix theory to study the performance of the MTL least-square SVM algorithm (MTL LS-SVM) for data arising from a Gaussian mixture model, assuming the total number n and dimension p of the data are both large, i.e., as n, p → ∞ with p/n → c ∈ (0, ∞). As such, our work follows after the recent wave of interest into the asymptotics of machine learning algorithms, such as studied lately in e.g., (Liao & Couillet, 2019; Deng et al., 2019; Mai & Couillet, 2018; El Karoui et al., 2010) . Our analysis reveals the following major conclusions: • we exhibit the sufficient statistics, which concretely enable task comparison in the MTL LS-SVM algorithm; we show that, even when data are of large dimensions (p 1), these statistics remain small dimensional (they only scale with the number k of tasks); • while it is conventional to manually set labels associated to each dataset within {-1, 1}, we prove that this choice is largely suboptimal and may even cause MTL to severely fail (causing "negative transfer"); we instead provide the optimal values for the labels of each dataset, which depend on the sought-for objective: these optimal values are furthermore easily estimated from very few training data (i.e., no cross-validation is needed); • for unknown new data x, the MTL LS-SVM algorithm allocates a class based on the comparison of a score g(x) to a threshold ζ, usually set to zero. We prove that, depending on the statistics and number of elements of the training dataset, a bias is naturally induced that makes ζ = 0 a largely suboptimal choice in general. We provide a correction for this bias, which again can be estimated from the training data alone; • we demonstrate on popular real datasets that our proposed optimized MTL LS-SVM is both resilient to real data and also manages, despite its not being a best-in-class MTL algorithm, to rival and sometimes largely outperform competing state-of-the-art algorithms. These conclusions thus allow for an optimal use of MTL LS-SVM with performance-maximizing hyperparameters and strong theoretical guarantees. As such, the present article offers through MTL LS-SVM a viable fully-controlled (even better performing) alternative to state-of-the-art MTL. Reproducibility. Matlab and Julia codes for reproducing the results of the article are available in the supplementary materials. Notation. e 



In the supplementary material, we extend this setting to a much broader and more realistic scope, and justify in passing the relevance of a Gaussian mixture modelling to address multi-task learning with real data.



[n] m ∈ R n is the canonical vector of R n with [e[n] m ] i = δ mi . Moreover, e [2k] ij = e [2k] 2(i-1)+j . Similarly, E [n] ij ∈ R n×n is the matrix with [E [n] ij ] ab = δ ia δ jb . The notations A ⊗ B and A B

