EXPERIMENTAL DESIGN FOR OVERPARAMETERIZED LEARNING WITH APPLICATION TO SINGLE SHOT DEEP ACTIVE LEARNING Anonymous

Abstract

The impressive performance exhibited by modern machine learning models hinges on the ability to train such models on a very large amounts of labeled data. However, since access to large volumes of labeled data is often limited or expensive, it is desirable to alleviate this bottleneck by carefully curating the training set. Optimal experimental design is a well-established paradigm for selecting data point to be labeled so to maximally inform the learning process. Unfortunately, classical theory on optimal experimental design focuses on selecting examples in order to learn underparameterized (and thus, non-interpolative) models, while modern machine learning models such as deep neural networks are overparameterized, and oftentimes are trained to be interpolative. As such, classical experimental design methods are not applicable in many modern learning setups. Indeed, the predictive performance of underparameterized models tends to be variance dominated, so classical experimental design focuses on variance reduction, while the predictive performance of overparameterized models can also be, as is shown in this paper, bias dominated or of mixed nature. In this paper we propose a design strategy that is well suited for overparameterized regression and interpolation, and we demonstrate the applicability of our method in the context of deep learning by proposing a new algorithm for single shot deep active learning.

1. INTRODUCTION

The impressive performance exhibited by modern machine learning models hinges on the ability to train the aforementioned models on a very large amounts of labeled data. In practice, in many real world scenarios, even when raw data exists aplenty, acquiring labels might prove challenging and/or expensive. This severely limits the ability to deploy machine learning capabilities in real world applications. This bottleneck has been recognized early on, and methods to alleviate it have been suggested. Most relevant for our work is the large body of research on active learning or optimal experimental design, which aims at selecting data point to be labeled so to maximally inform the learning process. Disappointedly, active learning techniques seem to deliver mostly lukewarm benefits in the context of deep learning. One possible reason why experimental design has so far failed to make an impact in the context of deep learning is that such models are overparameterized, and oftentimes are trained to be interpolative (Zhang et al., 2017) , i.e., they are trained so that a perfect fit of the training data is found. This raises a conundrum: the classical perspective on statistical learning theory is that overfitting should be avoided since there is a tradeoff between the fit and complexity of the model. This conundrum is exemplified by the double descent phenomena (Belkin et al., 2019b; Bartlett et al., 2020) , namely when fixing the model size and increasing the amount of training data, the predictive performance initially goes down, and then starts to go up, exploding when the amount of training data approaches the model complexity, and then starts to descend again. This runs counter to statistical intuition which says that more data implies better learning. Indeed, when using interpolative models, more data can hurt (Nakkiran et al., 2020a )! This phenomena is exemplified in the curve labeled "Random Selection" in Figure 1 . The fact that more data can hurt further motivates experimental design in the interpolative regime. Presumably, if data is carefully curated, more data should never hurt. Unfortunately, classical optimal experimental design focuses on the underparameterized (and thus, noninterpolative) case. As such, the theory reported in the literature is often not applicable in the interpolative regime. As our analysis shows (see Section 3), the prediction error of interpolative models can either be bias dominated (the first descent phase, i.e., when training size is very small compared to the number of parameters), variance dominated (near equality of size and parameters) or of mixed nature. However, properly trained underparameterized models tend to have prediction error which is variance dominated, so classical experimental design focuses on variance reduction. As such, naively using classical optimality criteria, such as V-optimality (the one most relevant for generalization error) or others, in the context of interpolation, tends to produce poor results when prediction error is bias dominated or of mixed nature. This is exemplified in the curve labeled "Classical OED" in Figure 1 . The goal of this paper is to understand these regimes, and to propose an experimental design strategy that is well suited for overparameterized models. Like many recent work that attempt to understand the double descent phenomena by analyzing underdetermined linear regression, we too use a simple linear regression model in our analysis of experimental design in the overparameterized case (however, we also consider kernel ridge regression, not only linear interpolative models). We believe that understanding experimental design in the overparameterized linear regression case is a prelude to designing effective design algorithms for deep learning. Indeed, recent theoretical results showed a deep connection between deep learning and kernel learning via the so-called Neural Tangent Kernel (Jacot et al., 2018; Arora et al., 2019a; Lee et al., 2019) . Based on this connection, and as a proof-of-concept, we propose a new algorithm for single shot deep active learning. Let us now summarize our contributions: • We analyze the prediction error of learning overparameterized linear models for a given fixed design, revealing three possible regimes that call for different design criteria: bias dominated, variance dominated, and mixed nature. We also reveal an interesting connection between overparameterized experimental design and the column subset selection problem (Boutsidis et al., 2009) , transductive experimental design (Yu et al., 2006), and coresets (Sener & Savarese, 2018) . We also extend our approach to kernel ridge regression. • We propose a novel greedy algorithm for finding designs for overparameterized linear models. As exemplified in the curve labeled "Overparameterized OED", our algorithm is sometimes able to mitigate the double descent phenomena, while still performing better than classical OED (though no formal proof of this fact is provided). • We show how our algorithm can also be applied for kernel ridge regression, and report experiments which show that when the number of parameters is in a sense infinite, our algorithm is able to find designs that are better than state of the art. • We propose a new algorithm for single shot deep active learning, a scaracly treated problem so far, and demonstrate its effectiveness on MNIST. Related Work. The phenomena of benign overfitting and double descent was firstly recognized in DNNs (Zhang et al., 2017) , and later discussed and analyzed in the context of linear models (Zhang et al., 2017; Belkin et al., 2018; 2019a; b; Bartlett et al., 2020) . Recently there is also a growing interest in the related phenomena of "more data can hurt" (Nakkiran et al., 2020a; Nakkiran, 2019; Nakkiran et al., 2020b; Loog et al., 2019) . A complementary work discussed the need to consider zero or negative regularization coefficient for large real life linear models (Kobak et al., 2020) .



Figure 1 explores the predictive performance of various designs when learning a linear regression model and varying the amount of training data with responses.

Figure 1: MSE of a minimum norm linear interpolative model. We use synthetic data of dimension 100. The full description is in Appendix E.

