RANDOM MATRIX ANALYSIS TO BALANCE BETWEEN SUPERVISED AND UNSUPERVISED LEARNING UNDER THE LOW DENSITY SEPARATION ASSUMPTION

Abstract

We propose a theoretical framework to analyze semi-supervised classification under the low density separation assumption in a high-dimensional regime. In particular, we introduce QLDS, a linear classification model, where the low density separation assumption is implemented via quadratic margin maximization. The algorithm has an explicit solution with rich theoretical properties, and we show that particular cases of our algorithm are the least-square support vector machine in the supervised case, the spectral clustering in the fully unsupervised regime, and a class of semi-supervised graph-based approaches. As such, QLDS establishes a smooth bridge between these supervised and unsupervised learning methods. Using recent advances in the random matrix theory, we formally derive a theoretical evaluation of the classification error in the asymptotic regime. As an application, we derive a hyperparameter selection policy that finds the best balance between the supervised and the unsupervised terms of our learning criterion. Finally, we provide extensive illustrations of our framework, as well as an experimental study on several benchmarks to demonstrate that QLDS, while being computationally more efficient, improves over cross-validation for hyperparameter selection, indicating a high promise of the usage of random matrix theory for semi-supervised model selection.

1. INTRODUCTION

Semi-supervised learning (SSL, Chapelle et al., 2010; van Engelen and Hoos, 2019) aims to learn using both labeled and unlabeled data at once. This machine learning approach received a lot of attention over the past decade due to its relevance to many real-world applications, where the annotation of data is costly and performed manually (Imran et al., 2020) , while the data acquisition is cheap and may result in an abundance of unlabeled data (Fergus et al., 2009) . As such, semisupervised learning could be seen as a learning framework that lies in between the supervised and the unsupervised settings, where the former occurs when all the data is labeled, and the latter is restored when only unlabeled data is available. Generally, a semi-supervised algorithm is expected to outperform its supervised counterpart trained only on labeled data by efficiently extracting the information valuable to the prediction task from unlabeled examples. In practice, integration of unlabeled observations to the learning process does not always affect the performance (Singh et al., 2008) , since the marginal data distribution p(x) must contain information on the prediction task p(y|x). Consequently, most semi-supervised approaches rely on specific assumptions about how p(x) and p(y|x) are linked with each other. It is principally assumed that examples similar to each other tend to share the same class labels (van Engelen and Hoos, 2019) , and implementation of this assumption results in different families of semi-supervised learning models. The first approaches aim to capture the intrinsic geometry of the data using a graph Laplacian (Chong et al., 2020; Song et al., 2022) and suppose that high-dimensional data points with the same label lie on the same low-dimensional manifold (Belkin and Niyogi, 2004) . Another family of semi-supervised algorithms suggests that examples from a dense region belong to the same class. While some methods explicitly look for such regions by relying on a clustering algorithm (Rigollet, 2007; Peikari et al., 2018) , another idea is to directly restrict the classification model to have a decision boundary that only passes through low density regions. This latter approach is said to rely on the Low Density us from the algorithmic point of view showing a direct extension of QLDS to a non-linear case. It is important to mention other theoretical studies of approaches based on the low density separation, including upper-bounds of the classification error of TSVM (Derbeko et al., 2004; Wang et al., 2007) and analysis of the self-training algorithm (Feofanov et al., 2021; Zhang et al., 2022) . Graph-based Semi-Supervised Learning. The principle of a graph-based approach is to 1) build a suitable graph with all the labeled and the unlabeled examples as the nodes connected by the weighted edges measuring the pairwise similarities (graph construction step), 2) search for a function f over the graph that is close as possible to the given labels, and that is smooth on the entire constructed graph (label inference step). The graph structure can be naturally used as a reflection for the manifold assumption in SSL that suggests that samples located near to each other on a low-dimensional manifold should share similar labels. Among the graph construction methods, the K-nearest neighbor (KNN) graph (Ozaki et al., 2011; Vega-Oliveros et al., 2014 ) and b-Matching methods (Jebara et al., 2009; Dhillon et al., 2010) , along with their extensions, are the most popular ones. Several extensions have considered labeled samples as prior knowledge to refine the generated graph (Rohban and Rabiee, 2012; Berton and Lopes, 2014) . Depending on the particular choice of loss functions, the label inference methods can be divided in label propagation approaches (Xiaojin and Zoubin, 2002; Zhou et al., 2003) , manifold regularization (Belkin et al., 2006; Xu et al., 2010) , Poisson learning (Calder et al., 2020) and deformed Laplacian regularization (Gong et al., 2015) . Recently, Mai and Couillet (2021) proposed a theoretical analysis of a unified framework for label inference in a graph that encompasses label propagation, manifold, and Laplacian regularization as special cases. In this paper, we recover (Mai and Couillet, 2021) as a special case of QLDS. Large Dimensional Analysis for Machine Learning. Recently, Random Matrix Theory (RMT) has received particular attention in the machine learning community for studying the asymptotic performance in a regime when the dimension is of the same order of magnitude as the sample size. Recent advances include analysis of the linear discriminant (Niyazi et al., 2021) , spectral clustering (Couillet and Benaych-Georges, 2016) , least square SVM (Liao and Couillet, 2019) , graph-based semi-supervised learning (Mai and Couillet, 2021) . In this paper, we show that theoretical findings of the last three aforementioned works are recovered from our theoretical analysis of QLDS provided in Section 4. We derive our theoretical results under the assumption that observations follow a vectorconcentration inequality (Louart and Couillet, 2018) , which can be particularly interesting for deep learning representations that preserve concentration property (Seddik et al., 2020) . It is interesting to mention that a number of machine learning algorithms have been theoretically analyzed using methods from theoretical physics, especially glassy physics (Agliari et al., 2020; Carleo et al., 2019; Loureiro et al., 2021; Cui et al., 2021; d'Ascoli et al., 2020) . To continue with physical statistics-based methods, we highlight the work of Lelarge and Miolane (2019) who derived Asymptotic Bayes risk using information theory and the cavity method (Mézard et al., 1987) . Although statistical physics and RMT-based approaches share the same objectives, the techniques used and the interpretations make them two different but complementary methods. To the best of our knowledge, we are not aware of any analysis of the algorithm studied in this paper using a statistical physics approach, which we believe is however possible and could be an interesting future work. For completeness, let us also mention the works based on the Convex Gaussian MinMax Theorem (Thrampoulidis et al., 2015; 2016) that allows the analysis of many machine learning algorithms but is mathematically different from the approach used in this paper.

3. FRAMEWORK

Notations Matrices will be represented by bold capital letters (e.g., matrix A). Vectors will be represented in bold minuscule letters (e.g., vector v) and scalars will be represented without bold letters (e.g., variable a). The canonical vector of size n is denoted by e [n] m ∈ R n , 1 ≤ m ≤ n, where the i-th element is 1 if i = m, and 0 otherwise. The diagonal matrix with diagonal x and 0 elsewhere is denoted by D x , while A i: denotes the i-th line of the matrix A. Semi-supervised Setting We consider binary classification problems, where an observation x ∈ R d is described by d features and belongs either to the class C 1 with a label y = -1 or to the class C 2 with a label y = +1. We assume that training data consists of n l labeled examples (X , y ) = (x i , y i ) n i=1 ∈ R d×n × {-1, +1} n and n u unlabeled examples X u = (x i ) n +nu i=n +1 ∈ R d×nu given without labels. Following the transductive setting (Vapnik, 1982) , we formulate the goal of semisupervised learning as to learn a classification model R d → {-1, +1} that yields the minimal error on the unlabeled data X u . For convenience, we denote the concatenation of labeled and unlabeled observations by X = [X , X u ]. For each class C j , j ∈ {1, 2}, we denote the observations from this class as X (j) = [x (j) 1 , . . . , x n j ], where X = [X (1) , X 2) ] and n 1 +n 2 = n . The same convention is used for the unlabeled data X u . By n j = n j +n uj we denote the total number of samples in class C j , j ∈ {1, 2}. QLDS Based on the training set [X , X u ], we seek for a separating hyperplane (linear decision boundary) ω that is a solution of the following optimization problem: ω = arg min ω α 2 n i=1 y i - x i ω √ n 2 label fidelity term - α u 2 n +nu i=n +1 ω x i √ n 2 low density separation + λ 2 ω 2 regularization . The first term is the label fidelity term that involves the labeled data only, and it represents the classical least-square loss used in the LS-SVM. The second term implements the low density separation regularization by maximizing the square of the margin of each unlabeled example, thereby pushing the decision boundary away from the unlabeled points. The third term is the classical Tikhonov regularization although we do fix λ to the maximum eigenvalue of X = [X , X u ] (for more details, see Appendix C.2 and E.6). The first two terms are considered up to a (1/ √ n) factor in order to ease the notations of the theoretical derivations of Section 4. Note that the label fidelity term can be alternatively represented by the hinge loss or the log-loss, which slightly alters the overall behavior of the algorithm. Our choice of the least square loss is primarily motivated by the possibility of obtaining more explicit, tractable and insightful results, let alone numerically cheaper implementation. The question of the optimal choice for the loss of the supervised part is a highly interesting question in the literature. Although it is difficult to formulate a strong statement valid for all practical situations, some asymptotic attempts have been made such as (Aubin et al., 2020; Mai and Liao, 2019) . More related to our hypothesis, (Mai and Liao, 2019) shows that for isotropic Gaussian mixture models in the high dimensional regime, quadratic cost functions are optimal and outperform alternatives costs such as SVM or logistic approaches. Table 4 in Appendix summarizes the classification error by using three losses for labelled parts (hinge, logistic, and quadratic) and two losses for unlabelled parts (quadratic and absolute value). This table shows that the selection of losses presented in the article has a competitive performance. The optimization problem in Equation (1) is convex (as soon as λ > λ max where λ max is the maximum eigenvalue of α u

XuX

u n -α X X n ) and admits a unique solution (all details are given in the supplementary material, Section A) given by ω = 1 √ n λI d -α u X u X u n + α X X n -1 X y . It is worth remarking that for the fully-supervised case (α , α u ) = (1, 0), we recover the Least Square SVM (Suykens and Vandewalle, 1999) . Another extreme case is to take (α , α u ) = (0, 1) that leads to the optimal decision boundary of the graph-based approach proposed by Mai and Couillet (2021) (further denoted by GB-SSL). Moreover, if additionally to (α , α u ) = (0, 1) take λ as the maximum eigenvalue of the unlabeled data (1/n) X u X u , we recover spectral clustering (See Section B of the supplementary material for a complete derivation). Given the optimal decision boundary as per Equation ( 2), the decision score function for any example x ∈ R d is given as f (x) = 1 √ n ω x = 1 n y X λI d -α u X u X u n + α X X n -1 x . (3)

4. THEORETICAL ANALYSIS AND ITS APPLICATION

In this section, we theoretically analyze the statistical behavior of QLDS and its decision function f (x). First, we state the assumptions used for theoretical analysis. Then, we present the main results and describe an application for hyperparameter selection.

4.1. ASSUMPTIONS

In the following, we assume the following classical concentration property. Assumption 1 (Concentration of D(X)) For two classes C j , j ∈ {1, 2}, we assume that all vectors x (j) 1 , . . . , x nj ∈ C j are i.i.d. and in particular Cov(x (j) i ) = I d . Moreover we assume that there exist two constants C, c > 0 (independent of n, d) such that, for any 1-Lipschitz function f : R d → R, ∀t > 0, P x∼D(X) |f (x) -m f (x) | ≥ t ≤ Ce -(t/c) 2 where m Z is a median of the random variable Z. Assumption 1 notably encompasses the following scenarios: the columns of X are (a) independent Gaussian random vectors with identity covariance, (b) independent random vectors uniformly distributed on the R d sphere of radius √ d, and, most importantly, (c) any Lipschitz continuous transformation thereof, such as GAN as it has been recently theoretically shown in (Seddik et al., 2020) . In the appendix (Section D), we have further explained the concentrated vector assumption and complemented its relevance and generality for the study of machine learning algorithms. In Assumption 1, we only consider identity covariance matrix to keep this presentation simple. The more general case of arbitrary covariance matrix Σ j is fully derived in the supplementary material, Section C. We should mention that it is convenient to "center" the data X for the sake of simplicity. This centering operation is performed on the whole data set X by substracting the global mean from the training points i.e., X ← X -E[X]. Furthermore, we place ourselves into the following large dimensional regime: This assumption of the commensurable relationship between the number of samples and their dimension corresponds to a realistic regime and differs from classical asymptotic where the number of samples is often assumed to be exponentially larger than the feature size. Note that this chosen asymptotic regime classical in Random Matrix Theory fits most real-life applications and has been successfully applied in telecommunications (Couillet and Debbah, 2011) , finance (Potters et al., 2005) and more recently in machine learning (Liao, 2019; Mai and Couillet, 2021; Tiomoko et al., 2020) .

We introduce the mean matrix

M = [µ 1 , µ 2 ] ∈ R d×2 , where µ j = E x∈Cj [x] ∈ R d is the theoretical mean of the class C j , j ∈ {1, 2}. Further, we define matrices M and G that will play an important role at the core formulation of the statistics of f (x). Definition 1 (Data statistics matrices M and G) We define data matrices M and G as M = D -1 κ + δM M -1 , G = - n u n(1 -α u δ) + a d δM M, where the vectors a, d and κ are the unique positive solution of the following fixed point equations a j = c j α 2 (1 + α δ) 2 + c uj α 2 u (1 -α u δ) 2 , d j = - δ 2 (1 -α u δ) 2 c 0 n u n(1 -c 0 δ 2 a j ) , δ = 1 λ + κ 1 + κ 2 , κ j = c j α 1 + α δ - c uj α u 1 -α u δ . The existence of a, d, κ, δ are a direct application of (Louart and Couillet, 2018, Proposition 3.8) ). These quantities are common in Random Matrix Theory in order to correct large biases in high dimensions (for more details, we refer to the supplementary material, Section D). We are now in position to introduce the asymptotic theoretical analysis of the score f (x) of any unlabeled sample x. Theorem 1 Let X ∈ R d×n be a data set that follows Assumptions 1 and 2. For any x ∈ X u with x ∈ C j and f (x) = 1 √ n ω x defined by Equation (3), we have almost surely for both classes j f (x|x ∈ C j ) -f j a.s. -→ 0, where f j ∼ N m j , σ 2 . The mean m j and the variance σ 2 are defined as m j = (-1) j c j -(e [2] 1 -e [2] 2 ) D c D -1 κ Me [2] j κ j (1 -α u δ)(1 + α δ) , σ 2 = e [2] 1 -e [2] 2 (D s MGMD s + D d D c ) e [2] 1 -e [2] 2 , with s = [c 1 /(κ 1 (1 + α δ)), c 2 /(κ 2 (1 + α δ)]. Finally, the theoretical classification error is asymptotically given by ε = 1 2 1 -erf m 1 -m 2 2 √ 2σ , where erf(z) = 2/ √ π z 0 e -t 2 dt is the Gauss error function. A fundamental aspect of Theorem 1 is that the performance of the large dimensional (large n, large d) classification problem under consideration merely concentrates into two-dimensional sufficient statistics, as all objects defined in the theorem are at most of size 2. All quantities defined in Theorem 1 are a priori known, apart from the proportion of classes in X u and the matrix M M ∈ R 2×2 , whose (i, j)-entries are the inner products µ i µ j that have to be estimated from data. From a practical perspective, these inner products are easily amenable to fast and efficient estimation as per Proposition 2, requiring a few training data samples. Proposition 2 (On the estimation of m j and σ) The following estimates holds: M M ij =    4/n 2 i 1 n i X (i) ;1 X (i) ;2 1 n i + O 1/ d n i if i = j , (1/ (n i n j )) 1 n i X (i) X (j) 1 n j + O (d min{n i , n j }) -1 2 otherwise. with X (j) = [X (j) ;1 , X ;2 ] an even-sized division of X (j) . Note that a single sample (two when i = j) per class is sufficient to obtain a consistent estimate for all quantities as long as d is large. In the semi-supervised setting, when only few labeled examples are available, it is thus still possible to estimate M M. It is important to remark that the convergence rate of the estimation is a quadratic improvement over the convergence rate of the usual central-limit theorem. Finally, to estimate the proportion of classes in the unlabeled set, not known a priori, we assume that the distribution of classes to be the same for the labeled and unlabeled data, so that we have c uj = c j nu n for j ∈ {1, 2}. We show in the supplementary material (Section E) that this assumption has little impact on the theoretical insights as well as in the experiments. As an application of Theorem 1, we provide in Figure 5 of the Appendix a "phase diagram" (relative gain with respect to supervised learning as a function of the labeled sample size and the task difficulty) which shows that a non-trivial gain with respect to a fully supervised case is obtained when few labeled samples are available and when the task is difficult. This conclusion is similar to existing conclusion from (Mai and Couillet, 2021; Lelarge and Miolane, 2019) .

4.3. APPLICATION TO HYPERPARAMETER SELECTION

Following the discussion in Section 3, we obtain that the theorem allows us to recover the asymptotic performance of the spectral clustering, the graph-based approach GB-SSL of Mai and Couillet (2021) (Suykens and Vandewalle, 1999) . This generality of the theorem represents an important asset in the unification of some SSL learning schemes. In particular, as the theoretical error can be regarded as a function of α and α u , below we propose to use Equation ( 4) as a criterion to automatically select α and α u through the grid search over different values. This leads to Algorithm 1. Note that the classification error is invariant to a scaling of λ (see Equation ( 3)). Thus, we fix the value of λ in our experiments to be λ max with λ max the maximum eigenvalue of X = [X , X u ], and optimize only α and α u . The fixed value corresponds to the one also proposed in (Mai and Couillet, 2021) . We give more details about this choice for λ and its numerical stability in Appendix C.2. Our proposition to select α and α u by the theorem is motivated by several practical reasons. Firstly, the importance of labeled and unlabeled examples varies, making the graph-based learning more effective in some cases, and the LS-SVM more effective in the others. By properly choosing α and α u , one can find the best balance between the GB-SSL and LS-SVM. Secondly, the classical approach of selecting hyperparameters by cross-validation suffers from high computational time and prone to bias in the semi-supervised setting due to the scarcity of the labeled set (Madani et al., 2005) . Algorithm 1 QLDS algorithm with optimal selection of α and α u Input: labeled data X and unlabeled data X u grid of hyperparameter values {(α (t) , α (t) u )} T t=1 Preprocessing: center data X ← X -E[X] , where X = [X , X u ] Output: estimated label ŷ ∈ {-1, 1} for each unlabeled example x ∈ X u estimate inner product M M using Proposition 2 choose λ as the maximum eigenvalue of 1 n X u X u for t = 1, . . . , T grid-search steps do take α (t) and α (t) u estimate classification error ε (t) by Theorem 1 with α u = α (t) u and α = α (t) end for select α u and α by finding t that yields minimal classification error ε (t) compute the decision score f (x) using Equation (3) with α u = α u and α = α return label ŷ = -1 if f (x) < 0 1 otherwise Complexity analysis. Algorithm 1 or QLDS(th) may be sequentially described as in 1) training of QLDS, 2) estimation of M M, 3) selection of α and α u over the grid. As QLDS has an explicit solution, its complexity is equivalent to the computation of the decision scores f (x) which requires solving a system of n linear equations, yielding complexity O(n 3 ). The computation of M M is of complexity O(dn + d) (estimation + product). Hyperparameter selection consists of iterating T times the error estimation from Theorem 1 and its complexity is O(T ). Finally, the global complexity of QLDS(th) is O(n 3 ) in the regime of Assumption 2. It is important to mention that an alternative way to optimize α and α u is cross-validation (QLDS(cv), which requires optimizing QLDS for each candidate (α t , α t u ) for K folds, leading to a complexity of O(T Kn 3 ). This indicates a clear advantage of using Theorem 1 in terms of time complexity as highlighted in Table 1 .

5. EXPERIMENTAL RESULTS

In this section, we illustrate the robustness of the different algorithms and the optimization of the hyperparameters α and α u proposed in the previous section. More specifically, Section 5.1 confirms empirically the robustness of the concentrated random vector assumption on real data by comparing the empirical distribution of the decision score with the theoretical prediction of Theorem 1. Section 5.2 analyzes the performance of QLDS when increasing the number of labeled examples, and Section 5.3 is a benchmark with several baselines for a wide range of real data sets. We perform comparison between the following methods: • QLDS(0, 1) with α = 0, α u = 1 which stands for the graph-based approach proposed in (Mai and Couillet, 2021) ; Through the experimental part, we will use several data sets described as follows (see more details in Section E of the supplementary material): • QLDS(1, 0) with α = 1, α u = • Synthetic: Gaussian mixture model with x (McAuley et al., 2015; He and McAuley, 2016) from textual user reviews, positive or negative, on books (books), DVDs (dvd), electronics (electronics), and kitchen (kitchen) items respectively. The data is encoded as d = 400-dimensional tf-idf feature vectors of bag-of-words unigrams and bigrams; • Adult data set (Kohavi et al., 1996) which consists in predicting whether income exceeds 50 000 per year based on census data; • Mushrooms data set from UCI Machine Learning repository (Dua and Graff, 2017) which classifies between poisonous and edible mushrooms based on their physical characteristics; • Splice data set from UCI Machine Learning repository (Dua and Graff, 2017) which aims to recognize two types of splice junctions in DNA sequences. (j) i ∼ N (µ j , I d ) with µ 1 = -µ 2 ; • Amazon Review data set

5.1. ROBUSTNESS OF THEORETICAL ANALYSIS TO REAL DATA

This section illustrates the close fit of the theoretical performance (i.e., Theorem 1) on the synthetic and two real-life data sets. To do so, we compare the empirical decision function represented by the histograms on Figure 1 versus the Gaussian statistics m j and σ 2 from Theorem 1.

5.2. ANALYSIS OF SAMPLE SIZE

Figure 2 represents the classification error as a function of the number of labeled examples. The picture shows that the theoretical model selection outperforms the cross-validation scheme and is close to the oracle selection (which uses the ground truth labels). In general, we observe that QLDS(th) is very stable in comparison with the cross-validation selector QLDS(cv).

5.3. COMPARATIVE PERFORMANCE ON SEVERAL DATA SETS

This section compares through Table 2 the performance obtained by fixing the number of labeled and unlabeled data on several data sets to analyze the performance of the hyperparameter selection and also to validate the theoretical intuitions formulated in this article. The experimental results show that • QLDS benefits from both labelled and unlabelled data and significantly outperforms LS-SVM and GB-SSL on datasets 5 and 2 respectively. • Fine tunning of α and α u provides better results than setting them to the default values. • Hyperparameter selection using Theorem 1 outperforms or is comparable to cross-validation, at the same time being more robust according to the error's standard deviation. • There is still room for improvement when we compare QLDS (or) with QLDS (th).

6. CONCLUDING REMARKS

In this paper, we proposed a theoretical analysis of a simple yet powerful linear semi-supervised classifier that relies on the low density separation assumption. Moreover, our approach builds a bridge between several existing approaches such as the least square support vector machine, the spectral clustering, and graph-based semi-supervised learning. The key approach to our analysis was to use modern large dimensional statistics to quantify the classification error through the data statistics of the decision function. Based on this result, we proposed a hyperparameter selection criterion that demonstrated promising experimental results compared to the time-consuming cross-validation. The proposed theoretical study opens broad perspectives for analysis of the LDS assumption in more challenging settings such as the multi-class classification, the non-linear case, or fully unsupervised domain adaptation. smoothness assumption on the graph. Specifically, the algorithm estimates a class attachment "score" f i for each node i by solving the optimization problem: min f ω ii (f i -f i ) 2 (10) s.t. f i = y i ∀ 1 ≤ i ≤ n . Here, the term ω ii (f i -f i ) 2 imposes label consistency of nearby samples (smoothness condition on the labels of the graph). The optimization problem in Equation ( 10) is the classical Laplacian regularization algorithm studied in depth in (Mai and Couillet, 2018) . There, the authors showed the fundamental importance to "center" the weight matrix W. This centering approach corrects an important bias in the regularized Laplacian which completely annihilates the use of unlabeled data in a large dimensional setting. A significant performance increase was reported, both in theory and in practice in (Mai and Couillet, 2021) when this basic, yet counter-intuitive, correction is accounted for. More specifically the centering is performed as follows Ŵ = PWP with P = I n -foot_0 n 1 n 1 n the centering projector. However, the optimization problem described in (10) now becomes non convex since the entries of the weight matrix W may take negative values (this must actually be the case as the mean value of the entries of W is zero). To deal with this problem, Mai and Couillet (2021) proposes to constrain the norm of the unlabeled data score vector f u (that is, the score vector restricted to unlabeled data) by appending a regularization term α f u 2 to the previous minimization problem. This leads, under a more convenient matrix formulation, to min fu∈R nu α f u 2 -f Ŵf s.t. f = y . ( ) This problem is now convex for all α > Ŵuu where Ŵuu is the restriction of the matrix Ŵ to the unlabeled data. The optimization problem is a quadratic optimization problem with linear equality constraints, and f u can be obtained explicitly and its solutions are best described under the matrix formulation (using a linear kernel, i.e., h(x) = x): f u = 1 n y X λI d - 1 n X u X u -1 X u . The graph-based SSL solution given in Equation ( 13) is a particular case of QLDS solution given in Equation ( 9) with α u = 1 and α = 0.

C.2 LINK TO SPECTRAL CLUSTERING AND CHOICE OF λ

Spectral clustering is a particular case of (13) when λ is the maximum eigenvalue of 1 n X u X u . Indeed, using the eigenvalue decomposition 13) can be rewritten as 1 n X u X u = UΛU = d i=1 λ i u i u i , Equation ( f u = 1 n y X λI d - 1 n X u X u -1 X u = 1 n d i=1 y X u i u i λ -λ i X u . When λ → λ max = max λ i and denoting by u max the eigenvector corresponding to the largest eigenvalue λ max , we obtain, f u ∼ y X u max u max λ -λ max X u ∝ u max X u , which unfolds as projecting 1 the unlabeled data X u into the largest eigenvector of 1 n X u X u corresponding to the spectral clustering algorithm with linear kernel. The parameter λ In order for QLDS to specialize to spectral clustering in the unlabelled regime, we fix the parameter λ = λ max to be the maximum eigenvalue of X = [X , X u ]. For numerical reasons, in all the experiments we use λ = (1 + ε)λ max with ε = 10 -3 . Although many choices of ε have been tried out, we do not find substantial improvements at considering it as an hyper-parameter and therefore fix it.

D THEORETICAL ANALYSIS OF QLDS

We recall the solution of the optimization problem of QLDS as f u = 1 n y X λI d + α X X n -α u X u X u n -1 X u . ( ) The goal is to understand the statistical behavior of f u in particular its distribution, and the moments of the distribution. To that end, we will assume the following concentration property on the data X = [X , X u ]. Assumption 3 (Distribution of D(X)) There exist two constants C, c > 0 (independent of n, d) such that, for any 1-Lipschitz function f : R p×n → R, P x∼D(X) |f (x) -m f (x) | ≥ t ≤ Ce -(t/c) 2 ∀t > 0, where m Z is a median of the random variable Z. We require that the columns of X are independent and that for ∈ {1, 2}, x As discussed in the main article, Assumption 3 notably encompasses the following scenarios: the columns of X are (i) independent Gaussian random vectors with identity covariance, (ii) independent random vectors uniformly distributed on the R p sphere of radius √ p, and, most importantly, (iii) any Lipschitz continuous transformation thereof. Scenario (iii) is of particular relevance for practical data settings as it was recently shown (Seddik et al., 2020) . Indeed, random data generated by GANs (for example, images) can be modeled as in case (iii). An intuitive explanation of Assumption 1 is that the transformed random variables f (x) for any f : R d → R Lipschitz has a variance of order O(1). In particular, it implies that it does not depend on the initial dimension d. Although we are not aware of any formal method to check whether some data follow this assumption, a line of reasoning suggests that this concentration property is most likely present in many real data. Indeed, most machine learning algorithms are Lipschitz applications that transform data of high dimension d into a scalar (the decision score). If the data were not concentrated the decision score f (x) would have a very large variance (depending on the dimension d) which would in turn lead to a random performance. The fact that a machine algorithm is supposed to obtain non-trivial performance (different from randomness) combined with the fact that common machine learning algorithms are Lipschitz applications suggests that the concentration assumption is not meaningless for real applications. As an example, we perform the following experiment: for the books data set, we take a subset of examples and a subset of features, learn QLDS(1,0) on them, and plot the empirical distribution of f (x). With conduct this experiment with the increasing n and d, and see in Figure 3 that the variance with this increase remains to be of the same order. Furthermore, we place ourselves into the following large dimensional regime. Assumption 4 (Growth Rate) As n → ∞, we consider the regime where d = O(n), we assume d/n → c 0 > 0. Furthermore, n j /n → c j and n uj /n → c uj for j = 1, 2. We denote by c = [c 1 , c 2 ] and c u = [c u1 , c u2 ]. This assumption of the commensurable relationship between the number of samples and their dimension corresponds to a realistic regime and differs from classical asymptotic where the number of samples is often assumed to be exponentially larger than the feature size, which is very unlikely in real-life applications. Under Assumptions 3 and 4, the objective of this section is three-folds: (i) determine the distribution of f u (ii) determine the first order moment of f u and (iii) determine the second order moments of f u . books (n=200,d=40) (n=400,d=80) (n=600,d=120) (n=800,d=160) (n=1000,d=200) (n=1200,d=240) (n=1400,d=280) (n=1600,d=320) (n=1800,d=360) (n=2000,d=400) 

D.1 DISTRIBUTION OF f u

The proof of the gaussian distribution of the decision score of several learning schemes has been provided recently in (Tiomoko et al., 2020) (for the theoretical analysis of Multi-Task Learning), (Seddik et al., 2021) (in the case of the theoretical analysis of softmax). We follow a similar approach that is described as follows. Proof under Gaussian mixture model. Under a Gaussian mixture assumption for the input data X, the convergence in distribution of the statistics of the classification score f (x) is immediate as the projection of the deterministic vector ω on the Gaussian random vector x, it follows that ω x is asymptotically Gaussian. Extension to concentrated random vector assumption. Since conditionally on the training data X, the classification score g(x) is expressed as the projection of the deterministic vector ω on the concentrated random vector x, the CLT for concentrated vector unfolds by proving that projections of deterministic vector on concentrated random vector is asymptotically gaussian. This is ensured by the following result. Theorem 3 (CLT for concentrated vector (Klartag, 2007; Fleury et al., 2007) ) If x is a concentrated random vector as defined in Assumption 1 with E[x] = 0, E[xx ] = I p and σ be the uniform measure on the sphere S p-1 ⊂ R p of radius 1, then for any integer k = O(1), there exist two constants C, c and a set Θ ⊂ (S p-1 ) k such that σ ⊗ . . . ⊗ σ k (Θ) ≥ 1 - √ pCe -c √ p and ∀θ = (θ 1 , . . . , θ k ) ∈ Θ, ∀a ∈ R k : sup t∈R |P(a θ x ≥ t) -G(t)| ≤ Cp -1 4 , with G(t) the cumulative distribution function of N (0, 1). Then the result unfolds naturally. Since g(x) is asymptotically Gaussian, we will focus on computing its first and second order moment.

D.2 FIRST ORDER MOMENT OF f u

Using Equation ( 14), the first order moment of f u can be computed as E[f u ] = E 1 n y X λI d + α X X n -α u X u X u n -1 X u . ( ) Let's define for convenience the data matrix X being the concatenation of the labeled and unlabeled data matrix X and X u , i.e., X = [X , X u ] ∈ R d×n . Then the expectation in ( 15) can be rewritten in the more convenient compact formulation E[f u ] = E[ 1 n y X QX u ], Q = λI d + XAX n -1 , A = α I n 0 n ×nu 0 nu×n -α u I nu . To proceed, we furthermore introduce the matrices S = I n 0 nu×n ∈ R n×n and S u = I nu 0 n ×nu ∈ R n×nu such that X = XS , and X u = XS u . This lead to the following compact expression depending only on the random matrix X E[f u ] = E[y S X QX n S u ]. Furthermore, let us recall the concept of deterministic equivalents, a classical object in random matrix theory. Definition 2 ((Couillet and Debbah, 2011, Chapter 6)) A deterministic matrix F ∈ R n×d is said to be a deterministic equivalent of a given random matrix F ∈ R n×d , denoted F ↔ F, if for any deterministic linear functional f n,p : R n×d → R of bounded norm (uniformly over d, n), f n,p (F -F) → 0 almost surely as n, d → ∞. In particular if F ↔ F, then u (F -F)v a.s. -→ 0 for u, v two unit vectors, and for all deterministic matrix A of bounded norm we also have 1 n tr A(F -F) a.s. -→ 0. Deriving deterministic equivalents of the various objects under consideration will be a crucial tool to derive the main result. In particular, deterministic equivalents are particularly suitable to handle bilinear forms involving the random matrix F, in particular for the statistics of f u where the bilinear form X QX n appears (see Equation ( 16)). Deterministic equivalent of X QX n Let u, v unit vectors for the 2 -norm, we develop: 1 n E[u X QXv] = 1 n n i,j=1 E u i x i Qx j v j = 1 n n i=1 E u i x i Qx i v i + 1 n n i,j=1 i =j E u i x i Qx j v j . Furthermore, let us define for convenience the matrix X -i , which is the matrix X with a vector of 0 p on its i-th column such that XX = X -i X -i + x i x i . Applying the Sherman-Morrison matrix inversion lemma (i.e., , (M + uv ) -1 = M -1 -M -1 uv M -1 1+v M -1 u for any invertible matrix M and vectors u, v) to Q leads to Q = Q -i - 1 n A ii Q -i x i x i Q -i 1 + 1 n A ii x i Q -i x i , Q -i = X -i AX -i n + λI d -1 . The latter allows to disentangle the strong dependency between Q and x i as Qx i = Q -i x i 1 + 1 n A ii x i Q -i x i . Using Equation ( 17) we rewrite X QX n as 1 n E[u X QXv] = 1 n n i=1 E u i x i Q -i x i v i 1 + A ii δi + 1 n n i,j=1 i =j E u i x i Q -ij x j v j (1 + A ii δi )(1 + A jj δj ) , with δi = 1 n E x i Q -i x i . Assumption 3 ensures that x (j) 1 , . . . , x n k , j = 1, 2, are i.i.d. data vectors, we impose the natural constraint of equal δ1 = . . . = δn k within every class j = 1, 2. As such, we may reduce the complete score vector δ ∈ R n under the form δ = [δ 1 1 n 1 , δ 2 1 n 2 , δ 1 1 nu1 , δ 2 1 nu2 ] , where δ j = 1 n E x i Q -i x i |x i ∈ C j = 1 n tr(Σ j Q) is defined for each class j = 1, 2. Using the shortcut notation xi = E[x i ] and the independence between samples x i and x j for i = j, the expectation can finally be obtained as 1 n E[u X QXv] = n i=1 u i δi v i 1 + A ii δi + 1 n n i,j=1 i =j u i x i Q-ij xj v j (1 + A ii δi )(1 + A jj δj ) + O(1/ √ n) . We therefore deduce a deterministic equivalent for X QX as 1 n X QX ↔ ∆ + 1 n JM δ QM δ J , where ∆ is the diagonal matrix ∆ ii = δi 1+Aii δi , M δ = [ µ1 1+α δ1 , µ2 1+α δ2 , µ1 1-αuδ1 , µ2 1-αuδ2 ] and J = 1 n 1 0 . . . 0 1 nu2 . The expectation can finally be obtained as E[f u ] = y S ∆ + 1 n JM δ QM δ J S u = 1 n y S JM δ QM δ J S u . It then remains to find a deterministic equivalent Q for Q. Similarly as performed in (Louart and Couillet, 2018) , the deterministic equivalent for Q can be obtained as Q ↔ Q = λI d + α c 1 C 1 1 + α δ 1 + α c 2 C 2 1 + α δ 2 - α u c u1 C 1 1 -α u δ 1 - α u c u2 C 2 1 -α u δ 2 -1 . ( ) Further defining κ 1 = c 1 α 1+α δ1 -cu1αu 1-αuδ1 , κ 2 = c 2 α 1+α δ2 -cu2αu 1-αuδ2 , we can further write Q = Q0 -Q0 M (D -1 κ + M Q0 M)M Q0 , Q0 = (λI d + κ 1 Σ 1 + κ 2 Σ 2 ) -1 . Therefore M δ QM δ = D δ A D -1 κ I 2 -D -1 κ + M Q0 M -1 D -1 κ AD δ . where δ = [1/(1 + α δ 1 ), 1/(1 + α δ 2 ), 1/(1 -α u δ 1 ), 1/(1 -α u δ 2 )] and A = [I 2 , I 2 ]. We then deduce the expectation as m j = E[f i |x i ∈ C j ] = (e [2] 1 -e [2] 2 ) D c D δ D -1 κ I 2 -D -1 κ + M Q0 M -1 D -1 κ D δu e [2] j with M = D κ -1 + M Q0 M -1 , δ = [1/(1 + α δ 1 ), 1/(1 + α δ 2 )] and δu = [1/(1 - α u δ 1 ), 1/(1 -α u δ 2 )]. In the case of identity covariance tackled in the main article we have δ := δ 1 = δ 2 and M = D -1 κ + M M λ + κ 1 + κ 2 -1 . ( ) Therefore the mean reads as m = E[f i |x i ∈ C j ] = (-1) j c j -(e [2] 1 -e [2] 2 ) D c D -1 κ Me [2] j κ j (1 -α u δ)(1 + α δ) . The last step consists in finding a deterministic equivalent of QC uδ Q denoted E. To that end let's evaluate for any deterministic vector v, u ∈ R d of unit norm, 1 n E[u QC uδ (Q -Q)v]. Applying the matrix identity A -1 -B -1 = A -1 (B -A)B -1 for any invertible matrix A, B to Q -Q and using algebraic simplifications in particular Equation ( 17) allow to successively obtain 1 n E u QC uδ (Q -Q)v = 1 n E u QC uδ Q C δ - XAX n Qv = 1 n n i E - 1 n A ii u QC uδ Q -i x i x i Qv 1 + A ii δi + u QC uδ QC δ Qv + O(1/ √ n) = 1 n n i E - 1 n A ii u Q -i C uδ Q -i x i x i Qv 1 + A ii δi + 1 n 2 n i E 1 n A 2 ii u Q -i x i x i Q -i C uδ Q -i x i x i Qv (1 + A ii δi ) 2 + O(1/ √ n) = 1 n 2 n i E 1 n tr (C i Q -i C uδ Q) A 2 ii u QC i Qv (1 + A ii δi ) 2 + O(1/ √ n) , where Q = (λI d + C δ ) -1 , i.e., C δ = α c 1 C1 1+α δ1 + α c 2 C2 1+α δ2 -αucu1C1 1-αuδ1 -αucu2C2 1-αuδ2 . Therefore QC uδ Q ↔ E = QC uδ Q + 2 k=1 c k α 2 d k (1 + α δ k ) 2 QC k Q + 2 k=1 c uk α 2 u d k (1 -α u δ k ) 2 QC k Q (23) where 23) by C k and taking the trace allows to retrieve an expression for d k = 1 n tr (C k QC uδ Q). Right Multiplying Equation ( D = D d , with d = [d 1 , d 2 ] as D = Dt I 2 -D ã Ṽ -1 , Ṽkk = 1 n tr C k QC k Q , tk = 1 n tr C k QC uδ Q , ãk = c k α 2 (1 + α δ k ) 2 + c uk α 2 u (1 -α u δ k ) 2 . Similarly as performed for the mean, the variance can be furthermore simplified as Var(f i ) = (e 1 -e 2 ) D c D δ D -1 κ MGMD -1 κ D δ D c + D d D c (e 1 -e 2 ) where G = M Q0 C Q0 M, C = C uδ + 2 k=1 ãk d k C k . In the case of identity covariance matrix tackled in the main article, G = - c u (1 -α u δ) + 2 k=1 ãk d k δM M with d k and ãk which simplifies as d k = - 1 (1 -α u δ) 2 c 0 c u (λ + κ 1 + κ 2 ) 2 -c 0 ãk , ãk = c k α 2 (1 + α δ) 2 + c uk α 2 u (1 -α u δ) 2 . This leads to the theorem in the general covariance matrix Theorem 4 Let X ∈ R d×n be a data set that follows Assumptions 3 and 4 and consider the notation convention defined previously. For any x ∈ X u with x ∈ C j and f (x) = 1 √ n ω x, we have almost surely for both classes j f (x|x ∈ C j ) -f j a.s. -→ 0, where f j ∼ N m j , σ j 2 . The mean m j and the variance σ 2 are defined as m j = 1 n y S JM δ QM δ J S u , σ 2 j = (e 1 -e 2 ) D c D δ D -1 κ MGMD -1 κ D δ D c + D d D c (e 1 -e 2 ),

E EXPERIMENTS

This section complements Section 5 of the main paper by giving more details of the experimental setup and performing two additional experiments. E.1 EXPERIMENTAL SETUP Table 3 sums up the characteristics of publicly available real data sets used in our experiments. As we are interested in the practical use of the proposed approach in the semi-supervised regime, we test the performance in the case when n l n u . Thus, instead of using the original train/test splits proposed by data sources, we set our own labeled/unlabeled splits to fit the semi-supervised context. For each data set, we perform an experiment 20 times by randomly splitting original data on a labeled and an unlabeled sets fixing their sample sizes to the values shown in Table 3 . For the results, we evaluate the transductive error on the unlabeled data and display the average and the standard deviation (both in %) over the 20 trials. All experiments were performed on a laptop with an Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz, 16GB RAM. The implementation code for reproducing the experimental results of the paper will be released upon acceptance of the article. In the first experiment, we analyze the influence of the assumption c uj = c j nu n (proportion of class 1 and class 2 is the same in unlabeled and labeled set) by representing the optimal classification under this assumption and the optimal classification error knowing the true value of c uj as function of the violation of this assumption represented by the ratio n j nu nuj n . As a recall, this assumption is needed since the theoretical performance depends on c uj which is not known a priori and needs to be estimated. As shown in Figure 4 , this assumption doesn't alter the overall behavior of the model selection approach since the model selection is the same even though the proportion of class 1 and 2 are different in labeled set and unlabeled set ( n j nu nuj n = 1) E.3 IMPROVEMENT OVER THE SUPERVISED BASELINE In a second experiment, we represent the gain with respect to LSSVM (ε(α l , α u ) -ε(α l , α u )) of QLDS when optimizing the hyperparameters (as performed theoretically in Algorithm 1 of the main paper) as function of the difficulty of the task (implemented through the norm of the matrix M) and the number of labeled samples. Figure 5 which looks like a "phase diagram" shows that a non-trivial gain is obtained with respect to a fully supervised case. In particular, one can see that the gain of using a semi-supervised approach is relevant when few labeled samples are available and when the task is difficult. This conclusion is similar to existing conclusion from (Mai and Couillet, 2021; Lelarge and Miolane, 2019) . Figure 5 : (Left) Relative gain with respect to supervised learning as a function of the labeled sample size and the task difficulty (through the choice of the distance between the mean of class 1 and class 2 µ 1 -µ 2 ) on synthetic gaussian mixture model. A higher value of µ 1 -µ 2 means that the task is easy and a smaller value means that the task is difficult. On the left lower corner (difficult task and a small number of labeled samples ) a non-trivial gain is obtained with respect to fully supervised case. The task difficulty in the y-axis is µ 1 -µ 2 which measures the distance between the mean of the two classes.

E.4 COMPARISON OF DIFFERENT SEMI-SUPERVISED LOSSES

In this section, we additionally support our choice of the learning objective given by Eq. 1 and compare different possibilities to construct the loss function for semi-supervised linear classification. More specifically, we compare for the labeled part 1. the quadratic loss n i=1 (y i -x i ω) 2 , 2. differentiable surrogate of the hinge loss (Zhang and Oles, 2001) , 3. the log-loss n i=1 1 γ log(1 + exp γ(1 -y i x i ω) ) with γ set to 20 n i=1 y i log σ(x i ω) + (1 -y i ) log(1 -σ(x i ω)), and for the unlabeled part 1. the quadratic margin n +nu i=n +1 (ω x i ) 2 , 2. the differentiable surrogate of the absolute value of the margin n +nu i=n +1 exp{-3(ω x i ) 2 } (Chapelle and Zien, 2005) . We consider all possible combinations of the labeled and the unlabeled parts which result in 6 semi-supervised losses. We optimize them using Adam optimizer (Kingma and Ba, 2015) fixing the learning rate and the weight decay to 10 -3 and 10 -5 , respectively. Note that when the square loss and the quadratic margin are considered, we have a gradient-based version of QLDS. For fair comparison, for each loss, we perform a grid search over possible values of α , α u , λ and choose the best solution according to the oracle, namely, the performance on the unlabeled data. Table 4 illustrates the performance results on 7 real data sets. One can see that in most of cases the quadratic margin outperforms the absolute value of the margin. In general, the combination of the square loss and the quadratic margin appears to be stable leading to the second-best solution in many cases. Thus, by choosing this learning objective, we do not lose much efficiency, having a convex objective and the ability to conduct theoretical analysis. This section extends Section 5.2 of the main paper by providing experimental results for different size of labeled set. In addition, we depict the values of α l and α u (averaged over 20 splits) taken by QLDS (th), QLDS (cv) and QLDS (or). All the results can be seen in Figure 6 and Figure 7 .

E.6 CHOICE OF λ

In Section 5, we have fixed the value of λ as the maximum eigenvalue of X = [X , X u ] for all versions of QLDS. To support this choice and make sure that it does not harm the baselines, in this section we provide an additional experiment, where we compare the fixed value of λ with the case when λ is tuned by the 10-fold cross-validation on the available labeled set. Table 5 depicts this comparison for QLDS (1,0) (LS-SVM) and QLDS (0,1) (GB-SSL). As one can see on 5 of 7 data sets the maximum eigenvalue heuristics outperforms the cross-validation. The experimental results suggests that the cross-validation policy is more relevant for the cases where the labeled data is more informative than unlabeled data (adult and mushrooms). Otherwise, the maximum eigenvalue heuristics seems to be more appropriate, which is accorded with (Mai and Couillet, 2021) . Table 5 : The classification error of the supervised and the unsupervised baselines when the hyperparameter λ is fixed to the maximum eigenvalue, and when it's tuned using the cross-validation on the labeled set. The smallest error for each baseline is highlighted in bold. Data set QLDS (1,0) (LS-SVM) QLDS (0,1) (GB-SSL)  Fixed



up to a scaling y X umax which does not impact the classification error



High-dimensional asymptotics)) As n → ∞, we consider the regime where d = O(n) and assume d/n → c 0 > 0. Furthermore, for j = 1, 2, n j /n → c j and n uj /n → c uj . We denote by c = [c 1 , c 2 ] and c u = [c u1 , c u2 ].

0 which stands for LS-SVM; • Self-training with the Least-Square SVM as the base classifier, where the confidence threshold is optimized as proposed by Feofanov et al. (2019) denoted ST (LS-SVM); • QLDS(cv) with model selection of α and α u by the 10-fold cross-validation; • QLDS(th) with model selection of α and α u performed theoretically using Theorem 1; • QLDS(or) Oracle to measure the efficiency of the proposed algorithm: QLDS, where model selection of α and α u is performed on the ground truth (as if the labels for the unlabeled examples would be known). It represents the error-classification lower-bound for the previous approaches.

Figure 1: Empirical versus theoretical density of decision score f (x) for (Left) Synthetic data set with d = 100, n 1 = n 2 = 100, n u1 = n u2 = 1 000 (Center) Review-kitchen classification d = 400, n 1 = n 2 = 100 (Right) Review-books classification d = 400, n 1 = n 2 = 100. For both review data sets, the empirical histogram is computed using 400 unlabeled samples.

are i.i.d. such that Cov(x ( ) i ) = Σ . We further denote the mean and covariance for the columns of X respectively as µ ≡ E[x ( ) 1 ] and C = Σ + µ µ .

Figure 3: Practical illustration of the concentration property on books data set. With increasing n and d, every colored histogram corresponds to the empirical distribution of the g(x) of QLDS(1,0).

Figure 4: Optimal Classification error as a function of discrepancy between class proportion in labeled and unlabeled set ( n j nu nuj n ).

Figure 6: The performance and model selection results on different data sets with the increase of the number of labeled examples.

Figure 7: The performance and model selection results on different data sets with the increase of the number of labeled examples.

Running time comparison between theory-based hyperparameter selector and cross-validation based with 10 folds, with n j = n uj = d for j ∈ {1, 2}.

The classification error of different methods under consideration on the real benchmark data sets. ± 2.25 26.47 ± 0.72 49.13 ↓ ± 0.65 35.83 ↓ ± 2.48 27.91 ± 3.32 26.03 ± 0.79 25.7 ± 0.93 dvd 38.33 ↓ ± 1.72 29.12 ± 1.35 49.25 ↓ ± 0.68 36.46 ↓ ± 1.94 29.53 ± 3.48 28.53 ± 1.33 26.94 ± 1.47 electronics 34.15 ↓ ± 3.25 19.4 ± 0.29 48.67 ↓ ± 1.05 31.69 ↓ ± 3.56 20.1 ↓ ± 1.03 19.41 ± 0.46 19.11 ± 0.58 kitchen 32.39 ↓ ± 3.02 19.31 ± 0.16 49.07 ↓ ± 0.64 29.62 ↓ ± 3.03 19.98 ↓ ± 2.28 19.11 ± 0.32 18.67 ± 0.43 splice 39.81 ↓ ± 2.93 35.48 ± 0.86 44.36 ↓ ± 2.3 39.36 ↓ ± 3.12 37.02 ± 3.04 35.35 ± 1.26 33.63 ± 1.75 adult 33.35 ± 0.68 36.28 ↓ ± 0.06 32.55 ± 1.47 35.45 ↓ ± 0.75 32.25 ± 1.92 32.88 ± 2.46 31.9 ± 1.74 mushrooms 6.55 ↓ ± 2.07 11.33 ↓ ± 0.04 33.94 ↓ ± 10.67 6.62 ↓ ± 2.39 2.57 ± 1.86 8.49 ↓ ± 3.63 1.75 ± 1.31

Characteristics of data sets used in our experiments.

The classification error of different semi-supervised losses on the real benchmark data sets. Square Loss -Quadratic Margin corresponds to QLDS. The smallest and the second smallest error values are highlighted in bold and italics, respectively. ± 1.23 34.13 ± 2.71 23.83 ± 0.85 33.38 ± 2.78 36.2 ± 1.99 36.67 ± 2.05 dvd 24.81 ± 2.97 34.74 ± 2.76 23.33 ± 1.9 34.86 ± 2.72 37.34 ± 2.37 37.69 ± 2.22 electronics 19.82 ± 0.57 26.07 ± 2.88 19.22 ± 0.56 26.37 ± 2.89 32.52 ± 2.55 33.27 ± 2.83 kitchen 18.75 ± 0.85 24.02 ± 2.86 17.93 ± 0.57 22.95 ± 1.84 31.04 ± 3.04 31.75 ± 3.3 splice 34.47 ± 2.59 34.29 ± 3.8 34.42 ± 2.23 34.3 ± 3.73 38.7 ± 2.26 38.72 ± 2.27 mushrooms 1.55 ± 0.9 1.17 ± 0.66 2.33 ± 1.02 1.75 ± 0.74 1.9 ± 0.86 1.98 ± 1.0 adult 19.63 ± 0.88 19.65 ± 0.9 18.38 ± 0.73 18.5 ± 0.81 18.43 ± 0.64 18.47 ± 0.72 E.5 PERFORMANCE DEPENDING ON THE NUMBER OF LABELED EXAMPLES

CV Fixed CV books 37.47 ± 2.25 38.32 ± 2.37 26.47 ± 0.72 32.84 ± 8.65 dvd 38.33 ± 1.72 38.56 ± 2.03 29.12 ± 1.35 32.74 ± 7.26 electronics 34.15 ± 3.25 35.2 ± 3.0 19.4 ± 0.29 23.8 ± 9.19 kitchen 32.39 ± 3.02 33.42 ± 4.44 19.31 ± 0.16 22.05 ± 8.55 splice 39.81 ± 2.93 40.38 ± 3.31 35.48 ± 0.86 39.53 ± 3.53 adult 33.35 ± 0.68 32.13 ± 1.88 36.28 ± 0.06 34.0 ± 0.73 mushrooms 6.55 ± 2.07 2.53 ± 1.38 11.33 ± 0.04 8.8 ± 1.47

A APPENDIX B SOLUTION OF QLDS

We recall the optimization problem of QLDS without bias aswhereThe loss L(ω) can be rewritten in a more convenient and compact matrix formulationTaking the derivative of the loss function L(ω) with respect to ω leads toThe optimal value of ω (up to a scaling of α ) is found by setting the gradient to zeroThe decision function for the unlabeled data X u is given asWe would like to mention importantly that the hessian of the loss reads aswhere λ max (M ) denotes the maximum eigenvalue of the matrix M . Therefore the loss function is convex as soon as λ > λ max -αGiven generally few labeled examples and comparatively many unlabeled ones, the idea of graphbased SSL is to construct a connected graph that propagates effective labeled information to the unlabeled data. More specifically, the data are represented by a finite weighted graph G = (N , E, W) consisting of a set of nodes N based on the data samples X = [X , X u ], a set of edges E and its associated weight matrix W = {ω ii } n i,i =1 where ω ii measures the similarity between data points x i and x ifor some non decreasing non negative function h so that similar data vectors x i , x i are connected with a large weight. Graph-based learning algorithms estimate the label of each node based on a

D.3 SECOND ORDER MOMENT OF f u

The second order moment of f u can be computed asn 2 E y S X QXS u S u X QXS y . Let's define by convenience the matrix B = S u S u . As previously we are looking for a deterministic equivalent for X QXBX QX. We proceed in the same way by computingand we reuse Equation ( 17) in order to continuewhere C uδ is defined asLet's denote for convenience by E the deterministic equivalent of QC uδ Q, then a deterministic equivalent for X QXBX QX is given as :where E is the diagonal matrix containing on its diagonal E ii = 1 n E[tr (C i QC uδ Q)] = 1 n tr(C i E). We therefore deduce the variance of f u as

