DECIPHERING AND OPTIMIZING MULTI-TASK LEARNING: A RANDOM MATRIX APPROACH

Abstract

This article provides theoretical insights into the inner workings of multi-task and transfer learning methods, by studying the tractable least-square support vector machine multi-task learning (LS-SVM MTL) method, in the limit of large (p) and numerous (n) data. By a random matrix analysis applied to a Gaussian mixture data model, the performance of MTL LS-SVM is shown to converge, as n, p → ∞, to a deterministic limit involving simple (small-dimensional) statistics of the data. We prove (i) that the standard MTL LS-SVM algorithm is in general strongly biased and may dramatically fail (to the point that individual single-task LS-SVMs may outperform the MTL approach, even for quite resembling tasks): our analysis provides a simple method to correct these biases, and that we reveal (ii) the sufficient statistics at play in the method, which can be efficiently estimated, even for quite small datasets. The latter result is exploited to automatically optimize the hyperparameters without resorting to any cross-validation procedure. Experiments on popular datasets demonstrate that our improved MTL LS-SVM method is computationally-efficient and outperforms sometimes much more elaborate state-of-the-art multi-task and transfer learning techniques.

1. INTRODUCTION

The advent of elaborate learning machines capable to surpass human performances on dedicated tasks has reopened past challenges in machine learning. Transfer learning, and multitask learning (MTL) in general, by which known tasks are used to help a machine learn other related tasks, is one of them. The particularly interesting aspects of multi-task learning lie in the possibility (i) to exploit the resemblance between the datasets associated to each task so the tasks "help each other" and (ii) to train a machine on a specific target dataset comprised of few labelled data by exploiting much larger labelled datasets, however composed of different data. Practical applications are numerous, ranging from the prediction of student test results for a collection of schools (Aitkin & Longford, 1986) , to survival of patients in different clinics, to the value of many possibly related financial indicators (Allenby & Rossi, 1998) , to the preference modelling of individuals in a marketing context, etc. Since MTL seeks to improve the performance of a task with the help of related tasks, a central issue to (i) understand the functioning of MTL, (ii) adequately adapt its hyperparameters and eventually (iii) improve its performances consists in characterizing how MTL relates tasks to one another and in identifying which features are "transferred". The article aims to decipher these fundamental aspects for sufficiently general data models. Several data models may be accounted for to enforce relatedness between tasks. A common assumption is that the data lie close to each other in a geometrical sense (Evgeniou & Pontil, 2004) , live in a low dimensional manifold (Agarwal et al., 2010) , or share a common prior (Daumé III, 2009) . We follow here the latter assumption in assuming that, for each task, the data arise from a 2-class Gaussian mixture. 1Methodologically, in its simplest approach, MTL algorithms can be obtained from a mere extension of support vector machines (SVM), accounting for more than one task. That is, instead of finding the hyperplane (through its normal vector ω) best separating the two classes of a unique dataset, (Evgeniou & Pontil, 2004) proposes to produce best separating hyperplanes (or normal vectors) ω 1 , . . . , ω k for each pair of data classes of k tasks, with the additional constraint that the normal vectors take the form ω i = ω 0 + v i for some common vector w 0 and dedicated vectors v i . The amplitude of the vectors v i is controlled (through an additional hyperparameter) to enforce or relax task relatedness. We study this approach here. Yet, to obtain explicit and thus more insightful results, we specifically resort to a least-square SVM (as proposed e.g., in (Xu et al., 2013) ) rather than a margin-based SVM. This only marginally alters the overall behavior of the MTL algorithm and has no impact on the main insights drawn in the article. Moreover, by a now well-established universality argument in large dimensional statistics, (Mai et al., 2019) show that quadratic (leastsquare) cost functions are asymptotically optimal and uniformly outperform alternative costs (such as margin-based methods or logistic approaches), even in a classification setting. This argument further motivates the choice of considering first and foremost the LS-SVM version of MTL-SVM. Technically, the article exploits the powerful random matrix theory to study the performance of the MTL least-square SVM algorithm (MTL LS-SVM) for data arising from a Gaussian mixture model, assuming the total number n and dimension p of the data are both large, i.e., as n, p → ∞ with p/n → c ∈ (0, ∞). As such, our work follows after the recent wave of interest into the asymptotics of machine learning algorithms, such as studied lately in e.g., (Liao & Couillet, 2019; Deng et al., 2019; Mai & Couillet, 2018; El Karoui et al., 2010) . Our analysis reveals the following major conclusions: • we exhibit the sufficient statistics, which concretely enable task comparison in the MTL LS-SVM algorithm; we show that, even when data are of large dimensions (p 1), these statistics remain small dimensional (they only scale with the number k of tasks); • while it is conventional to manually set labels associated to each dataset within {-1, 1}, we prove that this choice is largely suboptimal and may even cause MTL to severely fail (causing "negative transfer"); we instead provide the optimal values for the labels of each dataset, which depend on the sought-for objective: these optimal values are furthermore easily estimated from very few training data (i.e., no cross-validation is needed); • for unknown new data x, the MTL LS-SVM algorithm allocates a class based on the comparison of a score g(x) to a threshold ζ, usually set to zero. We prove that, depending on the statistics and number of elements of the training dataset, a bias is naturally induced that makes ζ = 0 a largely suboptimal choice in general. We provide a correction for this bias, which again can be estimated from the training data alone; • we demonstrate on popular real datasets that our proposed optimized MTL LS-SVM is both resilient to real data and also manages, despite its not being a best-in-class MTL algorithm, to rival and sometimes largely outperform competing state-of-the-art algorithms. These conclusions thus allow for an optimal use of MTL LS-SVM with performance-maximizing hyperparameters and strong theoretical guarantees. As such, the present article offers through MTL LS-SVM a viable fully-controlled (even better performing) alternative to state-of-the-art MTL. Reproducibility. Matlab and Julia codes for reproducing the results of the article are available in the supplementary materials. Notation. e [n] m ∈ R n is the canonical vector of R n with [e [n] m ] i = δ mi . Moreover, e [2k] ij = e [2k] 2(i-1)+j . Similarly, E [n] ij ∈ R n×n is the matrix with [E [n] ij ] ab = δ ia δ jb . The notations A ⊗ B and A B for matrices or vectors A, B are respectively the Kronecker and Hadamard products. D x is the diagonal matrix containing on its diagonal the elements of the vector x and A i• is the i-th row of A. The notation Å is used when a centering operation is performed on the matrix or vector A. Uppercase calligraphic letters (A, K, Γ, M, V,...) are used for deterministic small dimensional matrices. Finally, 1 m and I m are respectively the vector of all one's of dimension m and the identity matrix of dimension m × m. The index pair i, j generally refers to Class j in Task i.

2. RELATED WORKS

Let us first point out the difference between MTL and transfer learning: while MTL makes no distinction between tasks and aims to improve the performance of all tasks, transfer learning aims to maximize the performance of a target task with the help of all source tasks. Yet, both methods mostly sharing the same learning process, in this section, we mainly focus on the MTL literature, which is divided into parameter-based versus feature-based MTL. In the parameter-based MTL approach, the tasks are assumed to share some parameters (e.g., the hyperplanes best separating each class) or their hyperparameters have a common prior distribution. Existing learning methods (SVM, logistic regression, etc.) can then be appropriately modified to incorporate these relatedness assumptions. In this context, (Evgeniou & Pontil, 2004; Xu et al., 2013; Parameswaran & Weinberger, 2010) respectively adapt the SVM, LS-SVM, and Large Margin Nearest Neighbor (LMNN) algorithms to the MTL paradigm. The present article borrows ideas from Evgeniou & Pontil (2004) ; Xu et al. (2013) . In the feature-based MTL approach, the tasks data are instead assumed to share a low-dimensional common representation. In this context, most of the works aim to determine a mapping of the ambient data space into a low-dimensional subspace (through sparse coding, deep neural networks, principal component analysis, etc.) in which the tasks have high similarity (Argyriou et al., 2007; Maurer et al., 2013; Zhang et al., 2016; Pan et al., 2010) ; other works simply use a feature selection method by merely extracting a subset of the original feature space (Obozinski et al., 2006; Wang & Ye, 2015; Gong et al., 2012) . We must insist that, in the present work, our ultimate objective is to study and improve "data-generic" MTL mechanisms under no structural assumption on the data; this approach is quite unlike recent works exploiting convolutive techniques in deep neural nets or low dimensional feature-based methods to perform transfer or multi-task learning mostly for computer vision. From a theoretical standpoint though, few works have provided a proper understanding of the various MTL algorithms. To our knowledge, the only such results arise from elementary learning theory (Rademacher complexity, VC dimension, covering number, stability) and only provide loose performance bounds (Baxter, 2000; Ben-David & Schuller, 2003; Baxter, 1997) . As such, the present work fills a long-standing gap in the MTL research.

3. THE MULTI-TASK LEARNING SETTING

Let X ∈ R p×n be a collection of n independent data vectors of dimension p. The data are divided into k subsets attached to individual "tasks". Specifically, letting X = [X 1 , . . . , X k ], Task i is a binary classification problem from the training samples X i = [X (1) i , X (2) i ] ∈ R p×ni with X (j) i = [x (j) i1 , . . . , x (j) inij ] ∈ R p×nij the n ij vectors of class j ∈ {1, 2} for Task i. In particular, n = k i=1 n i and n i = n i1 + n i2 for each i ∈ {1, . . . , k}. To each x il ∈ R p of the training set is attached a corresponding "label" (or score) y il ∈ R. We denote y i = [y i1 , . . . , y ini ] T ∈ R ni the vector of all labels for Task i, and y = [y T 1 , . . . , y T k ] T ∈ R n the vector of all labels. These labels are generally chosen to be ±1 but, for reasons that will become clear in the course of the article, we voluntarily do not enforce binary labels here. Before detailing the multitask classification scheme, a preliminary task-wise centering operation is performed on the data, i.e., we consider in the following the datasets Xi = X i I ni - 1 n i 1 ni 1 T ni , ∀i ∈ {1, . . . , k}. As such, we systematically work on the labeled datasets ( X1 , y 1 ), . . . , ( Xk , y k ). Remark 1 in the supplementary material motivates this choice, which avoids extra biases produced by the algorithm.

3.1. THE OPTIMIZATION FRAMEWORK

The multitask learning least square support vector machine (MTL LS-SVM) aims to predict, for input vectors x ∈ R p not belonging to the training set, their associated score y upon which a decision on the class allocation of x is taken, for a given target task. To this end, based on the labeled sets ( X1 , y 1 ), . . . , ( Xk , y k ), MTL LS-SVM determines the normal vectors W = [ω 1 , ω 2 , . . . , ω k ] ∈ R p×k and intercepts b = [b 1 , b 2 , . . . , b k ] T ∈ R k defining k separating hyperplanes for the corresponding k binary classification tasks. In order to account for task relatedness, each ω i assumes the form ω i = ω 0 + v i for some common ω 0 ∈ R p and task-dedicated v i ∈ R p . Formally, writing V = [v 1 , . . . , v k ] ∈ R p×k (so that W = ω 0 1 k + V ) and following the work of (Evgeniou & Pontil, 2004; Xu et al., 2013) , the optimization function is given by min (ω0,V,b)∈ R p ×R p×k ×R k 1 2λ ω 0 2 + 1 2 k i=1 v i 2 γ i + 1 2 k i=1 ξ i 2 , ξ i = y i -( XT i ω i + b i 1 ni ), 1 ≤ i ≤ k. In this expression, the parameter λ enforces more task relatedness while the parameters γ 1 , . . . , γ k enforce better classification of the data in their respective classes. Being a quadratic optimization problem under linear equality constraints, ω 0 , V, b are obtained explicitly (see details in Section 1 of the supplementary material). The solution is best described through the expression of the hyperplanes ω 1 , . . . , ω k ∈ R p which take the form: ω i = e [k]T i ⊗ I p AZα, and b = (P T QP ) -1 P T Qy, where α = Q(y -P b) = Q 1 2 (I n -Q 1 2 P (P T QP ) -1 P T Q 1 2 )Q 1 2 y ∈ R n is the Lagrangian dual and Q = 1 kp Z T AZ + I n -1 ∈ R n×n , Z = k i=1 E [k] ii ⊗ Xi ∈ R pk×n A = D γ + λ11 T ⊗ I p ∈ R kp×kp , P = k i=1 E [k] ii ⊗ 1 ni ∈ R n×k with γ = [γ 1 , . . . , γ k ] T and D γ = diag(γ). MTL LS-SVM differs from a single-task joint LS-SVM for all data in that the data X1 , . . . , Xk are not treated simultaneously but through k distinct filters: this explains why Z ∈ R kp×n is not the mere concatenation [ X1 , . . . , Xk ] but a block-diagonal structure isolating each Xi . As such, the Xi 's-relating matrix A plays an important role in the MTL learning process. With this formulation for the solution (W, b), the prediction of the class of any new data point x ∈ R p for the target Task i is then obtained from the classification score g i (x) = 1 kp e [k] i ⊗ x T AZα + b i (1) where x = x -1 ni X i 1 ni is a centered version of x with respect to the training dataset for Task i.

3.2. LARGE DIMENSIONAL STATISTICAL MODELLING

The first objective of the article is to quantify the MTL performance, and thus of the (a priori intricate) statistics of g i (x), under a sufficiently simple but telling Gaussian mixture model for training and test data. Assumption 1 (Distribution of X and x). The columns of [X, x] are independent Gaussian random variables. Specifically, the n ij samples x (j) i1 , . . . , x inij of class j for Task i are independent N (µ ij , I p ) vectors, and we let ∆µ i ≡ µ i1 -µ i2 . As for x, it follows an independent N (µ x , I p ) vector. In the supplementary material, Assumption 1 is relaxed to [X, x] arising from a generative model of the type x (j) il = h ij (z (j) il ) for z (j) il ∼ N (0, I p ) and h ij : R p → R p a 1-Lipschitz function. This model encompasses extremely realistic data models, including data arising from generative networks (e.g., GANs (Goodfellow et al., 2014) ) and is shown in the supplementary material to be universal in the sense that, as n, p → ∞, the asymptotic performances of MTL LS-SVM only depend on the statistical means and covariances of the x (j) il : the performances under complex mixtures are thus proved to coincide with those under an elementary Gaussian mixture. This generalized study however comes at the expense of more complex definitions and formulas, which impedes readability; hence the simpler isotropic Gaussian mixture model here. Our central technical approach for the performance evaluation of the MTL LS-SVM algorithm consists in placing ourselves under the large p, n regime of random matrix theory. Assumption 2 (Growth Rate). As n → ∞, n/p → c 0 > 0 and, for 1 ≤ i ≤ k, 1 ≤ j ≤ 2, nij n → c ij > 0. We let c i = c i1 + c i2 , c = [c 1 , . . . , c k ] T ∈ R k . With these notations and assumptions, we are in position to present our main theoretical results.

4.1. TECHNICAL STRATEGY AND NOTATIONS

To evaluate the statistics of g i (x) (equation 1), we resort to finding so-called deterministic equivalents for the matrices Q, AZQ, etc., which appear at the core of the formulation of g i (x). Those are provided in Lemma 1 of the supplementary material. Our strategy then consists in "decoupling" the effect of the data statistics from those of the MTL hyperparameters λ, γ 1 , . . . , γ k . Specifically, we extract two fundamental quantities for our analysis: a data-related matrix M ∈ R 2k×2k and a hyperparameter matrix A ∈ R k×k : M = k i,j=1 ∆µ T i ∆µ j E [k] ij ⊗ c i c T j , c i =   ci2 ci ci1 ci -ci1 ci ci2 ci   , A = I k + D -1 2 ∆ D γ + λ1 k 1 T k -1 D -1 2 ∆ -1 where ∆ = [ ∆1 , . . . , ∆k ] T are the unique positive solutions to the implicit system ∆i = ci c0 -A ii (this is implicit because A is a function of the ∆j 's). In passing, it will appear convenient to use the shortcut notation ∆ = [ ∆11 , . . . , ∆k2 ] T ∈ R 2k where ∆ij = cij ci c 0 ∆i . We will see that M plays the role, in the limit of large p, n, of a sufficient statistic for the performance of the MTL LS-SVM algorithm only involving (i) the data statistics µ 11 , . . . , µ k2 and (ii) the (limiting) relative number c 11 /c 1 , . . . , c k2 /c k of elements per class in each task. As for A, it captures the information about the impact of the hyperparameters λ, γ 1 , . . . , γ k and of the dimension ratios c 1 , . . . , c k and c 0 . These two matrices will be combined in the core matrix Γ ∈ R 2k×2k of the upcoming MTL LS-SVM performance analysis, defined as Γ = I 2k + A ⊗ 1 2 1 T 2 M -1 where we recall that ' ' is the Hadamard (element-wise) matrix product. We raised in the introduction of Section 3 that we purposely relax the binary "labels" y ij associated to each datum x ij in each task to become "scores" y ij ∈ R. This will have fundamental consequences to the MTL performance. Yet, since x i1 , . . . , x ini are i.i.d. data vectors, we impose equal scores y i1 = . . . = y ini within each class. As such, we may reduce the complete score vector y ∈ R n under the form y = ỹ11 1 T n11 , . . . , ỹk2 1 T n k2 T for ỹ = [ỹ 11 , . . . , ỹk2 ] T ∈ R 2k . From Remark 1 of the supplementary material, the performances of MTL are insensitive to a constant shift in the scores y i1 and y i2 of every given Task i: as such, the recentered version ẙ = [ ẙ11 , . . . , ẙk2 ] T of ỹ, where ẙij = ỹij -( ni1 ni ỹi1 + ni2 ni ỹi2 ) will be central in the upcoming results.

4.2. MAIN RESULTS

Theorem 1 (Asymptotics of g i (x)). Under Assumptions 1-2, for x ∼ N (µ x , I p ) with µ x = µ ij , g i (x) -G ij → 0, G ij ∼ N (m ij , σ 2 i ) in distribution where, letting m = [m 11 , . . . , m k2 ] T , m = ỹ -D -1 2 ∆ ΓD 1 2 ∆ẙ , σ 2 i = 1 ∆i ẙT D 1 2 ∆Γ D K i. c ⊗12 + V i ΓD 1 2 ∆ẙ with V i = 1 c0 (AD c 0 K i• c +e [k] i A ⊗ 1 2 1 T 2 ) M and K = [A A](I k -D c 0 kc [A A]) -1 . As anticipated, the statistics of the classification scores g i (x) mainly depend on the data statistics µ i j and on the hyperparameters λ and γ 1 , . . . , γ k through the matrix Γ (and more marginally through V i and K for the variances). Since g i (x) has a Gaussian limit centered about m ij , the (asymptotic) standard decision for x to be allocated to Class 1 (x → C 1 ) or Class 2 (x → C 2 ) for Task i is obtained by the "averaged-mean" test g i (x) C1 ≷ C2 1 2 (m i1 + m i2 ) the classification error rate i1 ≡ P (x → C 2 |x ∈ C 1 ) of which is then i1 ≡ P g i (x) ≥ m i1 + m i2 2 x ∈ C 1 = Q m i1 -m i2 2σ i + o(1) with m ij , σ i as in Theorem 1 and Q(t) = ∞ t e -u 2 2 du. Further comment on Γ is due before moving to practical consequences of Theorem 1. From the expression (A ⊗ 1 2 1 T 2 ) -foot_1 M, we observe that: (i) if λ 1, then A is diagonal dominant and thus "filters out" in the Hadamard product all off-diagonal entries of M, i.e., all cross-terms ∆µ T i ∆µ j for i = j, therefore refusing to exploit the correlation between tasks; (ii) if instead λ is not small, A may be developed (using the Sherman-Morrison matrix inverse formulas) as the sum of a diagonal matrix, which again filters out the ∆µ T i ∆µ j for i = j, and of a rank-one matrix which instead performs a weighted sum (through the γ i 's and the ∆i 's) of the entries of M. Specifically, letting γ -1 = (γ -1 1 , . . . , γ -1 k ) T in the expression of A, we have D γ + λ1 k 1 T k -1 = D -1 γ -λγ -1 γ -1T 1+λ 1 k k i=1 γ -1 i . As such, disregarding the regularization effect of the ∆i 's, the off-diagonal ∆µ T i ∆µ j entry of M is weighted with coefficient (γ i γ j ) -1 : the impact of the γ i 's is thus strongly associated to the relevance of the correlation between tasks. A fundamental aspect of Theorem 1 is that it concludes that the performances of the large dimensional (n, p 1) classification problem at hand merely boils down to 2k-dimensional statistics, as all objects defined in the theorem statement are at most of size 2k. More importantly from a practical perspective, these "sufficient statistics" are easily amenable to fast and efficient estimation: it only requires a few training samples to estimate all quantities involved in the theorem. This, as a corollary, lets one envision the possibility of efficient transfer learning methods based on very scarce data samples as discussed in Remark 1. Estimating m ij and σ i not only allows one to anticipate theoretical performances but also enables the actual estimation of the decision threshold 1 2 (m i1 + m i2 ) in equation 2 and opens the possibility to largely optimize MTL LS-SVM through an (asymptotically) optimal choice of the training scores ỹ. Indeed, the asymptotics in Theorem 1 depend in an elegant manner on the training data labels (scores) ỹ. Since the variance σfoot_2 i is independent of the classes, we easily determine the vector ỹ = ỹ minimizing the misclassification probability for Task i as ỹ = arg max ỹ∈R 2k (m i1 -m i2 ) 2 σ 2 i = arg max ỹ∈R 2k ỹT (I 2k -D 1 2 ∆ΓD -1 2 ∆ )(e [2k] i1 -e [2k] i2 ) 2 ỹT D 1 2 ∆Γ(D K i. c ⊗12 + V i )ΓD 1 2 ∆ ỹ the solution of which is explicit ỹ = D -1 2 ∆ Γ -1 H[(A⊗1 2 1 T 2 ) M]D -1 2 ∆ (e [2k] i1 -e [2k] i2 ), H ≡ (D K i. c ⊗12 + V i ) -1 (4) with corresponding (asymptotically) optimal classification error i1 (equation 3) given by i1 = Q 1 2 (e [2k] i1 -e [2k] i2 ) T G(e [2k] i1 -e [2k] i2 ) , G = D 1 2 ∆[(A⊗1 2 1 T 2 ) M ]V -1 i [(A⊗1 2 1 T 2 ) M ]D The only non-diagonal matrices in equation 4 are Γ and V i in which M plays the role of a "variance profile" matrix. In particular, assuming ∆µ T i ∆µ = 0 for all = i (i.e., the statistical means of all tasks are orthogonal to those of Task i), then the two rows and columns of M associated to Task i are all zero but on the 2 × 2 diagonal block. Therefore, ỹ must be filled with zero entries but on its Task i two elements. All other values at the zero entry locations of ỹ (such as the usual ỹ = [1, -1, . . . , 1, -1] T ) would be suboptimal and possibly severely detrimental to the classification performance of Task i (not by altering the means m i1 , m i2 but by increasing the variance σ 2 i ). This extreme example strongly suggests that, in order to maximize the MTL performance on Task i, one must impose low scores ỹjl to all Tasks j strongly different from Task i. The choice ỹ = [1, -1, . . . , 1, -1] T can also be very detrimental when ∆µ T i ∆µ j < 0 for some i, j: that is, when the mapping of the two classes within each task is reversed (e.g., if Class 1 in Task 1 is closer to Class 2 than Class 1 in Task 2). In this setting, it is easily seen that ỹ = [1, -1, . . . , 1, -1] T works against the classification and performs much worse than a single-task LS-SVM. Another interesting conclusion arises from the simplified setting of equal number of samples per task and per class, i.e., n 11 = . . . = n k2 . In this case, ỹ = Γ -1 H (A ⊗ 1 2 1 T 2 ) M (e [2k] i1 -e [2k] i2 ) in which all matrices are organized in 2 × 2 blocks of equal entries. This immediately implies that ỹ j1 = -ỹ j2 for all j. So in particular the detection threshold 1 2 (m i1 + m i2 ) of the averaged-mean test (equation 2) is zero, as conventionally assumed. In all other settings for the n jl 's, it is very unlikely that ỹ i1 = -ỹ i2 and the optimal decision threshold must also be estimated. These various conclusions give rise to an improved MTL LS-SVM algorithm. A pseudo-code (Algorithm 1) along with discussions on (i) the estimation of the statistics met in Theorem 1 (Remark 1) and on (ii) its multi-class extension (Remark 2) are covered next. Matlab and Julia implementations of Algorithm 1 and its extensions are proposed in the supplementary material.

5. EXPERIMENTS

Our theoretical results (the data-driven optimal tuning of MTL LS-SVM, as well as the anticipation of classification performance) find various practical applications and consequences. We exploit them here in the context of transfer learning, first on a binary decision on synthetic data, and then on a multiclass classification on real data. To this end, and before proceeding to our experimental setup, a few key remarks to practically exploit Theorem 1 are in order. Remark 1 (On the estimation of m ij and σ i ). All quantities defined in Theorem 1 are a priori known, apart from the k 2 inner products ∆µ T i ∆µ j for 1 ≤ i, j ≤ k. For these, define S il ⊂ {1, . . . , n il } (l = 1, 2) and the corresponding indicator vector j il ∈ R ni with [j ij ] a = δ a∈Sij . For i = j, further let S il ⊂ {1, . . . , n il } with S il ∩ S il = ∅ and the corresponding indicator vector j il ∈ R ni . Then, the following estimates hold: ∆µ T i ∆µ j - j i1 |S i1 | - j i2 |S i2 | T X T i X j j j1 |S j1 | - j j2 |S j2 | = O (p min l∈{1,2} {|S il |, |S jl |}) -1 2 ∆µ T i ∆µ i - j i1 |S i1 | - j i2 |S i2 | T X T i X i j i1 |S i1 | - j i2 |S i2 | = O (p min l∈{1,2} {|S il |, |S il |}) -1 2 . Observe in particular that a single sample (two when i = j) per task and per class (|S il | = 1) is sufficient to obtain a consistent estimate for all quantities, so long that p is large. Remark 2 (From binary to multiclass MTL). Accessing the vector m in Theorem 1 allows for the extension of the MTL framework to a multiclass-per-task MTL by discarding the well known inherent biases of multiclass SVM. In the context of L i classes for Task i, using a one-versusall approach (for each ∈ {1, . . . , L i }, one MTL LS-SVM algorithms with Class "1" being the target class and Class "2" all other classes at once), one needs to access L i pairs of values (m i1 ( ), m i2 ( )) and, for a new x, decide on the genuine class of x based on the largest value among g i (x; 1) -m i1 (1), . . . , g i (x; L i ) -m i1 (L i ) with g i (x; ) the output score for "Class versus all". For simplicity, from Remark 1 in the supplementary material, one may choose smart shift vectors ȳ( ) ∈ R k for the scores ỹ( ) ∈ R 2k (i.e., replace ỹ( ) by ỹ( ) + ȳ( ) ⊗ 1 2 ) so that m i1 ( ) = 0 for each . Under this shift, the selected class is the one for which g i (x; ) is maximum. 

5.1. EFFECT OF INPUT SCORE (LABEL) AND THRESHOLD DECISION CHOICES

In order to support the theoretical insights drawn in the article, our first experiment illustrates the effects of the bias in the decision threshold for g i (x) (in general not centered on zero) and of the input score (label) optimization ỹ on synthetic data. Specifically, MTL-LSSVM is first applied to the following two-task (k = 2) setting: for Task 1, x (j) 1l ∼ N ((-1) j µ 1 , I p ) and for Task 2, x (j) 2l ∼ N ((-1) j µ 2 , I p ), where µ 2 = βµ 1 + 1 -β 2 µ ⊥ 1 and µ ⊥ 1 is any vector orthogonal to µ 1 and β ∈ [0, 1]. This setting allows us to tune, through β, the similarity between tasks. For four different values of β, Figure 1 depicts the distribution of the binary output scores g i (x) both for the classical MTL-LSSVM (top displays) and for our proposed random matrix improved scheme, with optimized input labels (bottom displays). As a first remark, note that both theoretical prediction and empirical outputs closely fit for all values of β, thereby corroborating our theoretical findings. In practical terms, the figure supports (i) the importance to estimate the threshold decision which is non-trivial (not always close to zero) and (ii) the relevance of an appropriate choice of the input labels to improve the discrimination performance between both classes, especially when the two tasks are not quite related as shown by the classification error presented in red in the figure.

5.2. MULTICLASS TRANSFER LEARNING

We next turn to the classical Office+Caltech256 (Saenko et al., 2010; Griffin et al., 2007) real data (images) benchmark for transfer learning, consisting of the 10 categories shared by both datasets. For fair comparison with previous works, we compare images using p = 4096 VGG features. Half of the samples of the target is randomly selected for the test data and the accuracy is evaluated over 20 trials. We use here Algorithm 1, the results of which (Proposed) are reported in Table 1 against the non-optimized LS-SVM (Xu et al., 2013) and alternative state-of-the-art algorithms: MMDT, CDLS and ILS. 3Algorithm 1 Proposed Multi Task Learning algorithm. Input: Training samples X = [X 1 , . . . , X k ] with X i = [X (1) i , . . . , X (Li) i ], X i ∈ R p×n i and test data x. Output: Estimated class ˆ ∈ {1, . . . , L t } of x for target Task t. for j = 1 to L t do Center and normalize data per task: for all i ∈ {1, . . . , k}, • Xi ← X i I n i -1 n i 1 n i 1 T n i • Xi ← Xi / 1 n i p tr( Xi XT i ) Estimate: Matrix M from Remark 1 and ∆, using X  ỹ (j) = D -1 2 ∆ Γ -1 H[(A⊗1 2 1 T 2 ) M]D -1 2 ∆ (e [2k] t1 -e [2k] t2 ). Estimate and center m = m(j) from Theorem 1 as per Remark 1. (Optional) Estimate the theoretical classification error t (λ, γ) minimize over (λ, γ) 2 . Compute classification scores g t (x; j) according to equation 1. end for Output: ˆ = arg max ∈{1,...,Lt} g t (x; ). 1 demonstrates that our proposed improved MTL LS-SVM, despite its simplicity and unlike the competing methods used for comparison, has stable performances and is quite competitive. We further recall that, in additions to these high levels of performance, the method comes along with theoretical guarantees, which none of the competing works are able to provide.

6. CONCLUSION: BEYOND MTL

Through the example of transfer learning (and more generally multitask learning), we have demonstrated the capability of random matrix theory to (i) predict and improve the performance of machine learning algorithms and, most importantly, to (ii) turn simplistic (and in theory largely suboptimal) methods, such as here LS-SVM, into competitive state-of-the-art algorithms. As Gaussian mixtures are quite "universal" and thus already appropriate to handle real data (as shown in supplementary material), one may surmise the optimality of the least square approach, thereby opening the possibility to prove that MTL LS-SVM is likely close-to-optimal even in real data settings. This is yet merely a first step into a generalized use of random matrix theory and large dimensional statistics to devise much-needed low computational cost and explainable, yet highly competitive, machine learning methods from elementary optimization schemes.



In the supplementary material, we extend this setting to a much broader and more realistic scope, and justify in passing the relevance of a Gaussian mixture modelling to address multi-task learning with real data. ∆. MMDT: Max Margin Domain Transform, proposed in (Hoffman et al., 2013), applies a weighted SVM on a learnt transformation of the source and target. CDLS: Cross-Domain Landmark Selection, proposed in (HubertTsai et al., 2016), derives a domain invariant feature space. ILS: Invariant Latent Space, proposed in(Herath et al., 2017), learns an invariant latent space to reduce the discrepancy between source and target.



Figure 1: Scores g 2 (x) [empirical histogram vs. theory in solid lines] for x of Class C 1 (red) or Class C 2 (blue) for Task 2 in a 2-task (k = 2) setting of isotropic Gaussian mixtures for: (top) classical MTL-LSSVM with y ∈ {±1} and threshold ζ = 0; (bottom) proposed optimized MTL-LSSVM with ỹ and estimated threshold ζ; decision thresholds ζ represented in dashed vertical lines; red numbers are misclassification rates; task relatedness with β = 0 for orthogonal tasks, β > 0 for positively correlated tasks, β < 0 for negatively correlated tasks; p = 100, [c 11 , c 12 , c 21 , c 22 ] = [0.3, 0.4, 0.1, 0.2], γ = 1 2 , λ = 10. Histograms drawn from 1 000 test samples of each class.

k } as data of class 2. Create scores

Classification accuracy over Office+Caltech256 database. c(Caltech), w(Webcam), a(Amazon), d(dslr), for different "Source to target" task pairs (S → T ) based on VGG features. Best score in boldface, second-to-best in italic. Mean score LSSVM 96.69 89.90 92.90 90.00 93.80 78.70 93.50 95.00 85.00 90.20 94.70 100 91.70 MMDT 93.90 87.05 90.83 84.40 94.17 86.25 94.58 97.50 86.25 87.23 92.05 97.35 90.96 ILS 77.89 73.55 86.85 76.22 86.22 71.34 74.53 82.80 68.15 63.49 78.98 92.88 77.74 CDLS 97.60 88.30 93.54 88.30 93.54 92.50 93.54 93.75 93.75 88.30 97.35 96.70 93.10 Ours 98.68 89.90 94.40 90.60 94.40 93.80 94.20 100 92.50 89.90 98.70 99.30 94.70

ACKNOWLEDGMENTS

Couillet's work is partially supported by MIAI at University Grenoble-Alpes (ANR-19-P3IA-0003) and the HUAWEI LarDist project.

