EXPERIMENTAL DESIGN FOR OVERPARAMETERIZED LEARNING WITH APPLICATION TO SINGLE SHOT DEEP ACTIVE LEARNING Anonymous

Abstract

The impressive performance exhibited by modern machine learning models hinges on the ability to train such models on a very large amounts of labeled data. However, since access to large volumes of labeled data is often limited or expensive, it is desirable to alleviate this bottleneck by carefully curating the training set. Optimal experimental design is a well-established paradigm for selecting data point to be labeled so to maximally inform the learning process. Unfortunately, classical theory on optimal experimental design focuses on selecting examples in order to learn underparameterized (and thus, non-interpolative) models, while modern machine learning models such as deep neural networks are overparameterized, and oftentimes are trained to be interpolative. As such, classical experimental design methods are not applicable in many modern learning setups. Indeed, the predictive performance of underparameterized models tends to be variance dominated, so classical experimental design focuses on variance reduction, while the predictive performance of overparameterized models can also be, as is shown in this paper, bias dominated or of mixed nature. In this paper we propose a design strategy that is well suited for overparameterized regression and interpolation, and we demonstrate the applicability of our method in the context of deep learning by proposing a new algorithm for single shot deep active learning.

1. INTRODUCTION

The impressive performance exhibited by modern machine learning models hinges on the ability to train the aforementioned models on a very large amounts of labeled data. In practice, in many real world scenarios, even when raw data exists aplenty, acquiring labels might prove challenging and/or expensive. This severely limits the ability to deploy machine learning capabilities in real world applications. This bottleneck has been recognized early on, and methods to alleviate it have been suggested. Most relevant for our work is the large body of research on active learning or optimal experimental design, which aims at selecting data point to be labeled so to maximally inform the learning process. Disappointedly, active learning techniques seem to deliver mostly lukewarm benefits in the context of deep learning. One possible reason why experimental design has so far failed to make an impact in the context of deep learning is that such models are overparameterized, and oftentimes are trained to be interpolative (Zhang et al., 2017) , i.e., they are trained so that a perfect fit of the training data is found. This raises a conundrum: the classical perspective on statistical learning theory is that overfitting should be avoided since there is a tradeoff between the fit and complexity of the model. This conundrum is exemplified by the double descent phenomena (Belkin et al., 2019b; Bartlett et al., 2020) , namely when fixing the model size and increasing the amount of training data, the predictive performance initially goes down, and then starts to go up, exploding when the amount of training data approaches the model complexity, and then starts to descend again. This runs counter to statistical intuition which says that more data implies better learning. Indeed, when using interpolative models, more data can hurt (Nakkiran et al., 2020a )! This phenomena is exemplified in the curve labeled "Random Selection" in Figure 1 . 



Figure 1 explores the predictive performance of various designs when learning a linear regression model and varying the amount of training data with responses. 1

