RANDOM MATRIX ANALYSIS TO BALANCE BETWEEN SUPERVISED AND UNSUPERVISED LEARNING UNDER THE LOW DENSITY SEPARATION ASSUMPTION

Abstract

We propose a theoretical framework to analyze semi-supervised classification under the low density separation assumption in a high-dimensional regime. In particular, we introduce QLDS, a linear classification model, where the low density separation assumption is implemented via quadratic margin maximization. The algorithm has an explicit solution with rich theoretical properties, and we show that particular cases of our algorithm are the least-square support vector machine in the supervised case, the spectral clustering in the fully unsupervised regime, and a class of semi-supervised graph-based approaches. As such, QLDS establishes a smooth bridge between these supervised and unsupervised learning methods. Using recent advances in the random matrix theory, we formally derive a theoretical evaluation of the classification error in the asymptotic regime. As an application, we derive a hyperparameter selection policy that finds the best balance between the supervised and the unsupervised terms of our learning criterion. Finally, we provide extensive illustrations of our framework, as well as an experimental study on several benchmarks to demonstrate that QLDS, while being computationally more efficient, improves over cross-validation for hyperparameter selection, indicating a high promise of the usage of random matrix theory for semi-supervised model selection.

1. INTRODUCTION

Semi-supervised learning (SSL, Chapelle et al., 2010; van Engelen and Hoos, 2019) aims to learn using both labeled and unlabeled data at once. This machine learning approach received a lot of attention over the past decade due to its relevance to many real-world applications, where the annotation of data is costly and performed manually (Imran et al., 2020) , while the data acquisition is cheap and may result in an abundance of unlabeled data (Fergus et al., 2009) . As such, semisupervised learning could be seen as a learning framework that lies in between the supervised and the unsupervised settings, where the former occurs when all the data is labeled, and the latter is restored when only unlabeled data is available. Generally, a semi-supervised algorithm is expected to outperform its supervised counterpart trained only on labeled data by efficiently extracting the information valuable to the prediction task from unlabeled examples. In practice, integration of unlabeled observations to the learning process does not always affect the performance (Singh et al., 2008) , since the marginal data distribution p(x) must contain information on the prediction task p(y|x). Consequently, most semi-supervised approaches rely on specific assumptions about how p(x) and p(y|x) are linked with each other. It is principally assumed that examples similar to each other tend to share the same class labels (van Engelen and Hoos, 2019) , and implementation of this assumption results in different families of semi-supervised learning models. The first approaches aim to capture the intrinsic geometry of the data using a graph Laplacian (Chong et al., 2020; Song et al., 2022) and suppose that high-dimensional data points with the same label lie on the same low-dimensional manifold (Belkin and Niyogi, 2004) . Another family of semi-supervised algorithms suggests that examples from a dense region belong to the same class. While some methods explicitly look for such regions by relying on a clustering algorithm (Rigollet, 2007; Peikari et al., 2018) , another idea is to directly restrict the classification model to have a decision boundary that only passes through low density regions. This latter approach is said to rely on the Low Density 1

