RANDOM MATRIX ANALYSIS TO BALANCE BETWEEN SUPERVISED AND UNSUPERVISED LEARNING UNDER THE LOW DENSITY SEPARATION ASSUMPTION

Abstract

We propose a theoretical framework to analyze semi-supervised classification under the low density separation assumption in a high-dimensional regime. In particular, we introduce QLDS, a linear classification model, where the low density separation assumption is implemented via quadratic margin maximization. The algorithm has an explicit solution with rich theoretical properties, and we show that particular cases of our algorithm are the least-square support vector machine in the supervised case, the spectral clustering in the fully unsupervised regime, and a class of semi-supervised graph-based approaches. As such, QLDS establishes a smooth bridge between these supervised and unsupervised learning methods. Using recent advances in the random matrix theory, we formally derive a theoretical evaluation of the classification error in the asymptotic regime. As an application, we derive a hyperparameter selection policy that finds the best balance between the supervised and the unsupervised terms of our learning criterion. Finally, we provide extensive illustrations of our framework, as well as an experimental study on several benchmarks to demonstrate that QLDS, while being computationally more efficient, improves over cross-validation for hyperparameter selection, indicating a high promise of the usage of random matrix theory for semi-supervised model selection.

1. INTRODUCTION

Semi-supervised learning (SSL, Chapelle et al., 2010; van Engelen and Hoos, 2019) aims to learn using both labeled and unlabeled data at once. This machine learning approach received a lot of attention over the past decade due to its relevance to many real-world applications, where the annotation of data is costly and performed manually (Imran et al., 2020) , while the data acquisition is cheap and may result in an abundance of unlabeled data (Fergus et al., 2009) . As such, semisupervised learning could be seen as a learning framework that lies in between the supervised and the unsupervised settings, where the former occurs when all the data is labeled, and the latter is restored when only unlabeled data is available. Generally, a semi-supervised algorithm is expected to outperform its supervised counterpart trained only on labeled data by efficiently extracting the information valuable to the prediction task from unlabeled examples. In practice, integration of unlabeled observations to the learning process does not always affect the performance (Singh et al., 2008) , since the marginal data distribution p(x) must contain information on the prediction task p(y|x). Consequently, most semi-supervised approaches rely on specific assumptions about how p(x) and p(y|x) are linked with each other. It is principally assumed that examples similar to each other tend to share the same class labels (van Engelen and Hoos, 2019) , and implementation of this assumption results in different families of semi-supervised learning models. The first approaches aim to capture the intrinsic geometry of the data using a graph Laplacian (Chong et al., 2020; Song et al., 2022) and suppose that high-dimensional data points with the same label lie on the same low-dimensional manifold (Belkin and Niyogi, 2004) . Another family of semi-supervised algorithms suggests that examples from a dense region belong to the same class. While some methods explicitly look for such regions by relying on a clustering algorithm (Rigollet, 2007; Peikari et al., 2018) , another idea is to directly restrict the classification model to have a decision boundary that only passes through low density regions. This latter approach is said to rely on the Low Density

annex

Separation (LDS) assumption (Chapelle and Zien, 2005; van Engelen and Hoos, 2019) , and it has been widely used in practice in recent decades, combined with the support vector machine (Bennett and Demiriz, 1998; Joachims, 1999) , ensemble methods (d'Alché-Buc et al., 2001; Feofanov et al., 2019) and deep learning methods (Sajjadi et al., 2016; Berthelot et al., 2019) .Despite its popularity, the study of the low density separation assumption still has many open questions. First, there is a deficiency of works devoted to theoretical analysis of the algorithm's performance under this assumption, and most approaches focus on the methodological part (van Engelen and Hoos, 2019) . Second, in real applications, it always remains unclear how a semisupervised algorithm should balance the importance of the labeled and the unlabeled examples in order to not degrade the performance with respect to supervised and unsupervised baselines. This implies in particular that the hyperparameter selection for a semi-supervised classification model is crucial, and it is known that using the cross-validation for model selection may be suboptimal in the semi-supervised case due to the lack of labeled examples (Madani et al., 2005) .Motivated by the aforementioned reasons, this paper proposes a framework to analyze semi-supervised classification under the low density separation assumption using the power of the random matrix theory (Paul and Aue, 2014; Marchenko and Pastur, 1967) . We consider a simple yet insightful quadratic margin maximization problem, QLDS, that seeks for an optimal balance between the labeled part represented by the Least Square Support Vector Machine (LS-SVM, Suykens and Vandewalle, 1999) and the unlabeled part represented by the spectral clustering (Ng et al., 2001) . In addition, the considered algorithm recovers the graph-based approach proposed by Mai and Couillet (2021) as a particular case.The main contributions of this paper may be summarized as follows:• We propose a large dimensional analysis of QLDS and derive a theoretical evaluation of the classification error in the asymptotic regime under the data concentration assumption (Louart and Couillet, 2018) . The results allow a strong understanding of the interplay between the data statistics and the hyperparameters of the model. • Based on the proposed theoretical result, we propose a hyperparameter selection approach to optimally balance the supervised and unsupervised term of QLDS. We empirically validate this approach on synthetic and real-world data showing that it outperforms a hyperparameter selection by the cross-validation both in terms of performance and running time.The remainder of the article is structured as follows. In Section 2, we review the related work. Section 3 introduces the semi-supervised framework as well as the optimization problem of QLDS. Under mild conditions on the data distribution, Section 4 provides the large dimensional analysis of the proposed algorithm along with several insights and discusses its application for hyperparameter selection. Section 5 provides various numerical experiments to corroborate the pertinence of the theoretical analysis and to hyperparameter selection policy. Section 6 concludes the article.

2. RELATED WORK

LDS in Semi-supervised Learning. Formally introduced by Chapelle and Zien (2005), the LDS assumption imposes the optimal class boundary to lie in a low density region. This assumption is usually implemented by margin maximization, which underlies either explicitly or implicitly many semi-supervised algorithms such as the Transductive SVM (TSVM) (Joachims, 1999; Ding et al., 2017 ), self-training (Tür et al., 2005; Feofanov et al., 2019) or entropy minimization approaches (Grandvalet and Bengio, 2004; Sajjadi et al., 2016) . As the margin's signs for unlabeled data are unknown, various unsigned alternatives have been proposed (d'Alché-Buc et al., 2001; Grandvalet and Bengio, 2004) , where the classical approach is to consider the margin's absolute value (Joachims, 1999; Amini et al., 2008) . In practice, the latter is usually replaced by an exponential surrogate function for gradient-based optimization of TSVM (Chapelle and Zien, 2005; Gieseke et al., 2014) .In this paper, we will consider TSVM with the quadratic margin that is both differentiable and convex, which allows us to perform theoretical analysis and obtain a graph-based semi-supervised learning as a particular case. A similar framework of the quadratic margins was considered by Belkin et al. (2006) whose work considered a more general case with a kernel-based SVM and the Laplacian matrix integrated to the objective, for which they proved a Representer theorem. While our work focuses on explicitly deriving a theoretical expression of the classification error, their paper may complement

