EXPERIMENTAL DESIGN FOR OVERPARAMETERIZED LEARNING WITH APPLICATION TO SINGLE SHOT DEEP ACTIVE LEARNING Anonymous

Abstract

The impressive performance exhibited by modern machine learning models hinges on the ability to train such models on a very large amounts of labeled data. However, since access to large volumes of labeled data is often limited or expensive, it is desirable to alleviate this bottleneck by carefully curating the training set. Optimal experimental design is a well-established paradigm for selecting data point to be labeled so to maximally inform the learning process. Unfortunately, classical theory on optimal experimental design focuses on selecting examples in order to learn underparameterized (and thus, non-interpolative) models, while modern machine learning models such as deep neural networks are overparameterized, and oftentimes are trained to be interpolative. As such, classical experimental design methods are not applicable in many modern learning setups. Indeed, the predictive performance of underparameterized models tends to be variance dominated, so classical experimental design focuses on variance reduction, while the predictive performance of overparameterized models can also be, as is shown in this paper, bias dominated or of mixed nature. In this paper we propose a design strategy that is well suited for overparameterized regression and interpolation, and we demonstrate the applicability of our method in the context of deep learning by proposing a new algorithm for single shot deep active learning.

1. INTRODUCTION

The impressive performance exhibited by modern machine learning models hinges on the ability to train the aforementioned models on a very large amounts of labeled data. In practice, in many real world scenarios, even when raw data exists aplenty, acquiring labels might prove challenging and/or expensive. This severely limits the ability to deploy machine learning capabilities in real world applications. This bottleneck has been recognized early on, and methods to alleviate it have been suggested. Most relevant for our work is the large body of research on active learning or optimal experimental design, which aims at selecting data point to be labeled so to maximally inform the learning process. Disappointedly, active learning techniques seem to deliver mostly lukewarm benefits in the context of deep learning. One possible reason why experimental design has so far failed to make an impact in the context of deep learning is that such models are overparameterized, and oftentimes are trained to be interpolative (Zhang et al., 2017), i.e., they are trained so that a perfect fit of the training data is found. This raises a conundrum: the classical perspective on statistical learning theory is that overfitting should be avoided since there is a tradeoff between the fit and complexity of the model. This conundrum is exemplified by the double descent phenomena (Belkin et al., 2019b; Bartlett et al., 2020) , namely when fixing the model size and increasing the amount of training data, the predictive performance initially goes down, and then starts to go up, exploding when the amount of training data approaches the model complexity, and then starts to descend again. This runs counter to statistical intuition which says that more data implies better learning. Indeed, when using interpolative models, more data can hurt (Nakkiran et al., 2020a )! This phenomena is exemplified in the curve labeled "Random Selection" in Figure 1 . Figure 1 explores the predictive performance of various designs when learning a linear regression model and varying the amount of training data with responses. The fact that more data can hurt further motivates experimental design in the interpolative regime. Presumably, if data is carefully curated, more data should never hurt. Unfortunately, classical optimal experimental design focuses on the underparameterized (and thus, noninterpolative) case. As such, the theory reported in the literature is often not applicable in the interpolative regime. As our analysis shows (see Section 3), the prediction error of interpolative models can either be bias dominated (the first descent phase, i.e., when training size is very small compared to the number of parameters), variance dominated (near equality of size and parameters) or of mixed nature. However, properly trained underparameterized models tend to have prediction error which is variance dominated, so classical experimental design focuses on variance reduction. As such, naively using classical optimality criteria, such as V-optimality (the one most relevant for generalization error) or others, in the context of interpolation, tends to produce poor results when prediction error is bias dominated or of mixed nature. This is exemplified in the curve labeled "Classical OED" in Figure 1 . The goal of this paper is to understand these regimes, and to propose an experimental design strategy that is well suited for overparameterized models. Like many recent work that attempt to understand the double descent phenomena by analyzing underdetermined linear regression, we too use a simple linear regression model in our analysis of experimental design in the overparameterized case (however, we also consider kernel ridge regression, not only linear interpolative models). We believe that understanding experimental design in the overparameterized linear regression case is a prelude to designing effective design algorithms for deep learning. Indeed, recent theoretical results showed a deep connection between deep learning and kernel learning via the so-called Neural Tangent Kernel (Jacot et al., 2018; Arora et al., 2019a; Lee et al., 2019) . Based on this connection, and as a proof-of-concept, we propose a new algorithm for single shot deep active learning. Let us now summarize our contributions: • We analyze the prediction error of learning overparameterized linear models for a given fixed design, revealing three possible regimes that call for different design criteria: bias dominated, variance dominated, and mixed nature. We also reveal an interesting connection between overparameterized experimental design and the column subset selection problem (Boutsidis et al., 2009) , transductive experimental design (Yu et al., 2006) , and coresets (Sener & Savarese, 2018) . We also extend our approach to kernel ridge regression. • We propose a novel greedy algorithm for finding designs for overparameterized linear models. As exemplified in the curve labeled "Overparameterized OED", our algorithm is sometimes able to mitigate the double descent phenomena, while still performing better than classical OED (though no formal proof of this fact is provided). • We show how our algorithm can also be applied for kernel ridge regression, and report experiments which show that when the number of parameters is in a sense infinite, our algorithm is able to find designs that are better than state of the art. 

2. UNDERPARAMETERIZED V-OPTIMAL EXPERIMENTAL DESIGN

Consider a noisy linear response model y = x T w + , where ∼ N (0, σ 2 ) and w ∈ R d and assume we are given with some data points x 1 , . . . , x n , for which we obtained independent responses, y i = x T i w + i . Consider the underparameterized case, i.e. n ≥ d, and furthermore assume that the set {x 1 , . . . , x n } contains at least d independent vectors. The best linear unbiased estimator ŵ of w according to the Gauss-Markov theorem is given by: ŵ = arg min w Xwy 2 2 = X + y where X ∈ R n×d is a matrix whose rows are x 1 , . . . , x n , y = [y 1 . . . y n ] T ∈ R n and X + is the Moore-Pensrose pseudoinverse of X. It is well known that ŵw is a normal random vector with zero mean and covariance matrix σ 2 M -1 , where M = X T X is the Fisher information matrix. This implies that ŷ(x) -y(x) is also a normal variable with zero mean and variance equal to σ 2 x T M -1 x. Assume also that x comes from a distribution ρ. With that we can further define the excess risk R( ŵ) = E x∼ρ (x T wx T ŵ) 2 and its expectation: E [R( ŵ)] = E x∼ρ [Var [y(x) -ŷ(x)]] = E x∼ρ σ 2 x T M -1 x = Tr σ 2 M -1 C ρ (1) where C ρ is the uncentered second moment matrix of ρ: C ρ := E x∼ρ xx T . Eq. ( 1) motivates the so-called V-optimal design criterion: select the dataset x 1 , . . . , x n so that ϕ(M) := Tr M -1 C ρ is minimized (if we do not have access to C ρ then it is possible to estimate it by drawing samples from ρ). In doing so, we are trying to minimize the expected (with respect to the noise ) average (with respect to the data x) prediction variance, since the risk is composed solely from it (due to the fact that the estimator is unbiased). As we shall see, this is in contrast with the overparameterized case, in which the estimator is biased. V-optimality is only one instance of various statistical criteria used in experimental design. In general experimental design, the focus is on minimizing a preselected criteria ϕ (M) (Pukelsheim, 2006) . For example in D-optimal design, ϕ(M) = det(M -1 ) and in A-optimal design ϕ(M) = Tr M -1 . However, since minimizing the V-optimality criterion corresponds to minimizing the risk, it is more appropriate when assessing the predictive performance of machine learning models.

3. OVERPARAMETERIZED EXPERIMENTAL DESIGN CRITERIA

In this section we derive an expression for the risk in the overparameterized case, i.e. like Eq. ( 1) but also for the case that n ≤ d (our expressions also hold for n > d). This, in turn, leads to an experimental design criteria analogous to V-optimality, but relevant for overparamterized modeling as well. We design a novel algorithm based on this criteria in subsequent sections.

3.1. OVERPARAMETERIZED REGRESSION AND INTERPOLATION

When n ≥ d there is a natural candidate for ŵ: the best unbiased linear estimator X + yfoot_0 . However, when d > n there is no longer a unique minimizer of Xwy 2 2 as there is an infinite amount of interpolating w's, i.e. w's such that Xw = y (the last statement makes the mild additional assumption that X has full row rank). One natural strategy for dealing with the non-uniqueness is to consider the minimum norm interpolator: ŵ := arg min w 2 2 s.t. Xw = y It is still the case that ŵ = X + y. Another option for dealing with non-uniqueness of the minimizer is to add a ridge term, i.e., add and additive penalty λ w 2 2 . Let: ŵλ := arg min Xw -y 2 2 + λ w 2 2 One can show that ŵλ = X + λ y where for λ ≥ 0 we define X + λ := X T X + λI d + X T (see also Bardow (2008) ). Note that Eq. (2) holds both for the overparameterized (d ≥ n) and underparameterized (d < n) case. Proposition 1. The function λ → X + λ is continuous for all λ ≥ 0. The proof, like all of our proofs, is delegated to the appendix. Thus, we also have that the minimum norm interpolator ŵ is equal to ŵ0 , and that λ → ŵλ is continuous. This implies that the various expressions for the expected risk of ŵλ hold also when λ = 0. So, henceforth we analyze the expected risk of ŵλ and the results also apply for ŵ.

3.2. EXPECTED RISK OF ŵλ

The following proposition gives an expression for the expected risk of the regularized estimator ŵλ . Note that it holds both for the overparameterized (d ≥ n) and underparameterized (d < n) case. Proposition 2. We have E [R( ŵλ )] = C 1 /2 ρ I -M + λ M w 2 2 bias + σ 2 Tr C ρ M + 2 λ M variance where M λ := X T X + λI d = M + λI d . The expectation is with respect to the training noise . The last proposition motivates the following design criterion, which can be viewed as a generalization of classical V-optimality: ϕ λ (M) := C 1 /2 ρ I -M + λ M w 2 2 + σ 2 Tr C ρ M + 2 λ M . For λ = 0 the expression simplifies to the following expression: ϕ 0 (M) = C 1 /2 ρ (I -P M ) w 2 2 + σ 2 Tr C ρ M + where P M = M + M is the projection on the row space of X. Note that when n ≥ d and X has full column rank, ϕ 0 (M) reduces to the variance of underparameterized linear regression, so minimizing ϕ λ (M) is indeed a generalization of the V-optimality criterion. Note the bias-variance tradeoff in ϕ λ (M). When the bias term is much larger than the variance, something we should expect for small n, then it make sense for the design algorithm to be bias oriented. When the variance is larger, something we should expect for n ≈ d or n ≥ d, then the design algorithm should be variance oriented. It is also possible to have mixed nature in which both bias and variance are of the same order.

3.3. PRACTICAL CRITERION

As is, ϕ λ is problematic as an experimental design criterion since it depends both on w and on C ρ . We discuss how to handle an unknown C ρ in Subsection 3.5. Here we discuss how to handle an unknown w. Note that obviously w is unknown: it is exactly what we want to approximate! If we have a good guess w for the true value of w, then we can replace w with w in ϕ λ . However, in many cases, such an approximation is not available. Instead, we suggest to replace the bias component with an upper bound: C 1 /2 ρ I -M + λ M w 2 2 ≤ w 2 2 • C 1 /2 ρ I -M + λ M 2 F . Let us now define a new design criterion which has an additional parameter t ≥ 0: φλ,t (M) = C 1 /2 ρ I -M + λ M 2 F bias bound (divided by w 2 2 ) + tTr C ρ M + λ 2 M variance (divided by w 2 2 ) . The parameter t captures an a-priori assumption on the tradeoff between bias and variance: if we have t = σ 2 / w 2 2 , then ϕ λ (M) ≤ w 2 2 • φλ,t (M) . Thus, minimizing φλ,t (M) corresponds to minimizing an upper bound of ϕ λ , if t is set correctly. Another interpretation of φλ,t (M) is as follows. If we assume that w ∼ N (0, γ 2 I d ), then E w [ϕ λ (M)] = γ 2 C 1 /2 ρ I -M + λ M 2 F + σ 2 Tr C ρ M + 2 λ M so if we set t = σ 2 /γ 2 then γ 2 φλ,t (M) = E w [ϕ λ (M)] , so minimizing φλ,t (M) corresponds to minimizing the expected expected risk if t is set correctly. Again, the parameter t captures an a-priori assumption on the tradeoff between bias and variance. Remark 1. One alternative strategy for dealing with the fact that w is unknown is to consider a sequential setup where batches are acquired incrementally based on increasingly refined approximations of w. Such a strategy falls under the heading of Sequential Experimental Design. In this paper, we focus on single shot experimental design, i.e. examples are chosen to be labeled once. We leave sequential experimental design to future research. Although, we decided to focus on the single shot scenario for simplicity, the single shot scenario actually captures important real-life scenarios.

3.4. COMPARISON TO OTHER GENERALIZED V-OPTIMALITY CRITERIA

Consider the case of λ = 0. Note that we can write: φ0,t (M) = C 1 /2 ρ (I -P M ) 2 F +tTr C ρ M + . Recall that the classical V-optimal experimental design criterion is Tr C ρ M -1 , which is only applicable if n ≥ d (otherwise, M is not invertible). Indeed, if n ≥ d and M is invertible, then P M = I d and φ0,t (M) is equal to Tr C ρ M -1 up to a constant factor. However, M is not invertible if n < d and the expression Tr C ρ M -1 does not make sense. One naive generalization of classical V-optimality for n < d would be to simply replace the inverse with pseudoinverse, i.e. Tr C ρ M + . This corresponds to minimizing only the variance term, i.e. taking t → ∞. This is consistent with classical experimental design which focuses on variance reduction, and is appropriate when the risk is variance dominated.

Another generalization of V-optimality can be obtained by replacing M with its regularized (and invertible) version

M µ = M + µI d for some chosen µ > 0, obtaining Tr C ρ M -1 µ . This is exactly the strategy employed in transductive experimental design (Yu et al., 2006) , and it also emerges in a Bayesian setup (Chaloner & Verdinelli, 1995) . One can try to eliminate the parameter µ by taking the limit of the minimizers when µ → 0. The following proposition shows that this is actually equivalent to taking t = 0. Proposition 3. For a compact domain Ω ⊂ R d×d of symmetric positive semidefinite matrices: lim µ→0 argmin M∈Ω Tr C ρ M -1 µ ⊆ argmin M∈Ω Tr (C ρ (I -P M )) . We see that the aforementioned generalizations of V-optimality correspond to either disregarding the bias term (t = ∞) or disregarding the variance term (t = 0). However, using φ0,t (M) allows much better control over the bias-variance tradeoff (see Figure 1 .) Let us consider now the case of λ > 0. We now show that the regularized criteria Tr C ρ M -1 µ used in transductive experimental design (See Proposition 3) when µ = λ corresponds to also using t = λ. In the absence of a decent model of the noise, which is a typical situation in machine learning, Prop. 5 suggests the to perhaps minimize only Tr C ρ M -1 λ without need to set t. However, this approach may be suboptimal in the overparameterized regime. This approach implicitly considers t = λ (see Prop. 4) which in a bias dominated regime can put too much emphasis on minimizing the variance. A sequential approach for experimental design can lead to better modeling of the noise, thereby assisting in dynamically setting t during acquisition-learning cycles. However, in a single shot regime, noise estimation is difficult. Arguably, there exists better values for t than using a default rule-of-thumb t = λ. In particular, we conjecture that t = 0 is a better rule-of-thumb then t = λ for severely overparameterized regimes as it suppresses the potential damage of choosing a too large λ and it is reasonale also if λ is small (since anyway we are in a bias dominated regime), so we can focus on minimizing the bias only. In the experiment section we show an experiment that supports this assumption. Notice that t = ∞ corresponds to minimizing the variance, while t = 0 corresponds to minimizing the bias.

3.5. APPROXIMATING C ρ

Our criteria so far depended on C ρ . Oftentimes C ρ is unknown. However, it can be approximated using unlabeled data. Suppose we have m unlabeled points (i.e. drawn form ρ), and suppose we write them as the rows of V ∈ R m×d . Then E m -1 V T V = C ρ . Thus, we can write mϕ λ (M) ≈ ψ λ (M) := V I d -M + λ M w 2 2 + σ 2 Tr VM + 2 λ MV T , λ ≥ 0. and use ψ λ (M) instead of ϕ λ (M). For minimum norm interpolation we have ψ 0 (M) = V (I d -P M ) w 2 2 + σ 2 Tr VM + V T . Again, let us turn this into a practical design criteria by introducing an additional parameter t: ψλ,t (M) := V I d -M + λ M 2 F + tTr VM + 2 λ MV T .

4. POOL-BASED OVERPARAMETERIZED EXPERIMENTAL DESIGN

In the previous section we defined design criteria φλ,t and ψλ,t that are appropriate for overparameterized linear regression. While one can envision a situation in which such we are free to choose X so to minimize the design criteria, in much more realistic pool-based active learning we assume that we are given in advance a large pool of unlabeled data x 1 , . . . , x m . The training set is chosen to be a subset of the pool. This subset is then labeled, and learning performed. The goal of pool-based experimental design algorithms is to chose the subset to be labeled. We formalize the pool-based setup as follows. Recall that to approximate C ρ we assumed we have a pool of unlabeled data written as the rows of V ∈ R m×d . We assume that V serves also as the pool of samples from which X is selected. For a matrix A and index sets S ⊆ [n], T ⊆ [d], let A S,T be the matrix obtained by restricting to the rows whose index is in S and the columns whose index is in T . If : appears instead of an index set, that denotes the full index set corresponding to that dimension. Our goal is to select a subset S of cardinality n such that ψλ,t (V T S,: V S,: ) is minimized (i.e., setting X = V S,: ). Formally, we pose following problem: 

5. OPTIMIZATION ALGORITHM

In this section we propose an algorithm for overparameterized experimental design. Our algorithm is based on greedy minimization of a kernalized version of ψλ,t (V T S,: V S,: ). Thus, before presenting our algorithm, we show how to handle feature spaces defined by a kernel.  For λ = 0 we have a simpler form: J 0,t (S) = Tr K :,S -K -1 S,S + tK -2 S,S K T :,S . Interestingly, when λ = 0 and t = 0, minimizing J 0,0 (S) is equivalent to maximizing the trace of the Nystrom approximation of K. Another case for which Eq. ( 4) simplifies is t = λ (this equation was already derived in Yu et al. ( 2006)): J λ,λ (S) = Tr -K :,S K S,S + λI |S| -1 K T :,S . Eq. ( 4) allows us, via the kernel trick, to perform experimental design for learning of nonlinear models defined using high dimensional feature maps. Denote our unlabeled pool of data by z 1 , . . . , z m ∈ R D , and that we are using a feature map φ : R d → H where H is some Hilbert space (e.g., H = R d ), i.e. the regression function is y(z) = φ(z), w H . We can then envision the pool of data to be defined by x j = φ(z j ), j = 1, . . . , m. If we assume we have a kernel function k : Our greedy algorithm proceeds as follows. We start with S (0) = ∅, and proceed in iteration. At iteration j, given selected samples S (j-1) ⊂ [m] the greedy algorithm finds the index i (j) ∈ [m] -S (j-1) that minimizes J λ,t S (j-1) ∪ {i (j) } . We set S (j) ← S (j-1) ∪ {i (j) }. We continue iterating until S (j) reaches its target size and/or J λ,t (S) is small enough. R D × R D → R D such that k(x, z) = φ(x), φ(z) H then J λ,t The cost of iteration j in a naive implementation is O (m -j) mj 2 + j 3 . Through careful matrix algebra, the cost of iteration j can be reduced to O((m -j)(mj + j 2 )) = O(m 2 j) (since j ≤ m). The cost of finding a design of size n is then O(m 2 (n 2 + D)) assuming the entire kernel matrix K is formed at the start and a single evaluation of k takes O(D). Details are delegated to Appendix C. Consider a DNN, and suppose the weights of the various layers can be represented in a vector θ ∈ R d . Given a specific θ, let f θ (•) denote the function instantiated by that network when the weights are set to θ. The crucial observation is that when the network is wide (width in convolutional layers refers to the number of output channels) enough, we use a quadratic loss function (i.e., l(f θ (x), y) = 1 /2(f θ (x) -y) 2 ), and the initial weights θ 0 are initialized randomly in a standard way, then when training the DNN using gradient descent, the vector of parameters θ stays almost fixed. Thus, when we consider θ 1 , θ 2 , . . . formed by training, a first-order Taylor approximation is:

6. SINGLE SHOT DEEP ACTIVE LEARNING

f θ k (x) ≈ f θ0 (x) + ∇ θ f θ0 (x) T (θ k -θ 0 ) Informally speaking, the approximation becomes an equality in the infinite width limit. The Taylor approximation implies that if we further assume that θ 0 is such that f θ0 (x) = 0, the learned prediction function of the DNN is well approximated by the solution of a kernel regression problem with the (Finite) Neural Tangent Kernel, defined as k f,θ0 (x, z) := ∇ θ f θ0 (x) T ∇ θ f θ0 (z) We remark that there are few simple tricks to fulfill the requirement that f θ0 (x) = 0. It has also been shown that under certain initialization distribution, when the width goes to infinity, the NTK k f,θ0 converges in probability to a deterministic kernel k f -the infinite NTK. Thus, in a sense, instead of training a DNN on a finite width network, we can take the width to infinity and solve a kernel regression problem instead. Although, it is unclear whether the infinite NTK can be an effective alternative to DNNs in the context of inference, one can postulate that it can be used for deep active learning. That is, in order to select examples to be labeled, use an experimental design algorithm for kernel learning applied to the corresponding NTK. Specifically, for single shot deep active learning, we propose to apply the algorithm presented in the previous section to the infinite NTK. In the next section we present preliminary experiments with this algorithm. We leave theoretical analysis to future research.

7. EMPIRICAL EVALUATION

Transductive vs ψλ,0 Criterion (i.e., variance-oriented vs. bias-oriented designs) ψλ,0 and ψλ,λ are simplified version of ψλ,t criterion. Our conjecture is that in the overparameterized regime ψλ,0 is preferable, at least for relatively large λ. Table 1 . empirically supports our conjecture. In this experiment, we performed an experimental design task on 112 classification datasets from UCI database (similar to the list that was used by Arora et al. (2019b) ). Learning is performed using kernel ridge regression with standard RBF kernel. We tried different values of λ and checked which criterion brings to a smaller classification error on a test set when selecting 50 samples. Each entry in Table 1 counts how many times ψλ,λ , won ψλ,0 won or the error was the same. We consider an equal error when the difference is less the 5%. Figure 2 report the mean and standard deviation (over the parameters initialization) of the final accuracy. We see a consistent advantage in terms of accuracy for designs selected via our algorithm, though as expected the advantage shrinks as the training size increase. Notice, that comparing the accuracy of our design with 400 training samples, random selection required as many as 600 for Wide-LeNet5 to achieve the same accuracy! Two remarks are in order. First, to prevent overfitting and reduce computational load, at each iteration of the greedy algorithm we computed the score for only on a subset of 2000 samples from the pool. Second, to keep the experiment simple we refrained from using tricks that ensure f θ0 = 0.

A PROOFS A.1 PROOF OF PROPOSITION 1

Proof. We prove the case of d ≥ n (for X ∈ R n×d ). The proof for d < n is similar. It is enough to show that lim λ→0 X + λ =X + . For a scalar γ let γ + := γ -1 γ = 0 0 γ = 0 Let X = UΣV T be the SVD of X with Σ =    σ 1 0 • • • 0 . . . . . . . . . σ d 0 • • • 0    ∈ R d×n where σ 1 , . . . , σ d are the singular values of X. We have X + = VΣ + U T where we have Σ + =          σ + 1 . . . σ + d 0 • • • 0 . . . . . . 0 • • • 0          On the other hand, simple matrix algebra shows that X + λ = V          (σ 2 1 + λ) + σ 1 . . . (σ 2 d + λ) + σ d 0 • • • 0 . . . . . . 0 • • • 0          U T Now clearly for i = 1, . . . , d, lim λ→0 + (σ 2 i + λ) + σ i = σ + i So the limit of the diagonal matrix in Eq. ( 5) when λ → 0 + is Σ + . Since matrix product is a linear, and thus continuous function, the proposition follows.

A.2 PROOF OF PROPOSITION 2

Proof. Let us write :=    1 . . . n    so y = Xw + . Thus, ŵλ = X + λ y = X + λ Xw + X + λ = M + λ Mw + X + λ and x T w -x T ŵλ = x T (I d -M + λ M)w + x T X + λ For brevity we denote P λ ⊥X = I d -M + λ M. Note that this is not really a projection, but rather (informally) a "soft projection". So: (x T w -x T ŵλ ) 2 = w T P λ ⊥X (xx T )P λ ⊥X w + w T P λ ⊥X (xx T )X + λ + T (X + λ ) T (xx T )X + λ Finally, E [R( ŵλ )] = E x, x T w -x T ŵλ 2 = E E x x T w -x T ŵλ 2 | = E E x w T P λ ⊥X (xx T )P λ ⊥X w + w T P λ ⊥X (xx T )X + λ + T (X + λ ) T (xx T )X + λ | = E w T P λ ⊥X C ρ P λ ⊥X w + w T P λ ⊥X C ρ X + λ + T (X + λ ) T C ρ X + λ = w T P λ ⊥X C ρ P λ ⊥X w + σ 2 Tr (X + λ ) T C ρ X + λ = C 1 /2 ρ P λ ⊥X w 2 2 + σ 2 Tr C ρ X + λ (X + λ ) T = C 1 /2 ρ I -M + λ M w 2 2 + σ 2 Tr C ρ M + 2 λ M A.3 PROOF OF PROPOSITION 3 Before proving Proposition 3 we need the following definition and theorem. Definition 1. For a family of sets Proof. Suppose w ∈ lim λ→ λ argmin w f (w, λ). The implies that there exits λ n → λ such that w n ∈ argmin w f (w, λ n ) and w n → w. From the continuity of f we have that f (w n , λ n ) → f w, λ . Now suppose for the sake of contradiction that w / ∈ argmin w f w, λ . So there is u such that f (u, λ) < f w, λ . From the continuity of f in λ there is n 0 such that for all n > n 0 f (u, λ n ) < f w, λ . Then from the continuity of f in w, and w n → w, for sufficiently large n, f (w n , λ n ) > f (u, λ n ), which contradicts w n ∈ argmin w f (w, λ n ). {A λ } λ∈R , A ⊂ R d we write lim λ→ λ A λ = A if w ∈ A if We are now ready to prove Proposition 3. is minimized. In the above l(x, y | S) is a loss function, and the conditioning on S denotes that the parameters of the loss function are the ones obtained when training only using indices selected in S. For linear regression the conditioning on S is not relevant (since the parameters do not affect the loss). The motivation for minimizing C(S) is that the expected test loss can be broken to the generalization loss on the entire dataset (which is fixed), the training loss (which is 0 in the presence of overparameterization) and the coreset loss. One popular approach to active learning using coresets is to find a coverset. A δ-coverset of a set of points A is a set of points B such that for every x ∈ A there exists a y ∈ B such that xy 2 ≤ δ (other metrics can be used as well). Sener and Savarese Sener & Savarese (2018) showed that under suitable Lipschitz and boundness conditions, if {x i } i∈S is a δ-coverset of {x i } i∈[m] then C(S) ≤ O(δ + m -1 /2 ) which motivates finding a S that minimizes δ S , where δ S denotes the minimal δ for which {x i } i∈S is a δ-coverset of {x i } i∈[m] . Since for a x in the training set (which is a row of V) x(I d -P M ) 2 2 , for M = V T S,: V S,: is the minimal distance from x to the span of {x i } i∈S , and as such is always smaller than the distance between x and it's closest point in {x i } i∈S , it is easy to show that n -1 ψ0,0 (V T S,: V S,: ) ≤ δ 2 S . Thus, minimizing δ S can be viewed as minimizing an upper bound on the bias term when λ = 0. Under the setup of the experiment in Section 7 we tried to replace our design with k-centers algorithm, which often used as approximated solution for the problem of finding S that minimizes δ S . How ever the result we got were much worse then random design, probably due to the problem of outliers. We did not try more sophisticated versions of the k-center algorithm that tackle the problem of outliers.

C DETAILS ON THE ALGORITHM

We discuss the case of λ = 0. The case of λ > 0 requires some more careful matrix algebra, so we omit the details.

Let us define

A j := K -1 S (j) ,S (j) , B j := K T :,S (j) K :,S (j) and note that J λ,t (S (j) ) = -Tr B j (A j -tA 2 j ) . We also denote by Ãj and Bj the matrices obtained from A j and B j (respectively) by adding a zero row and column. Our goal is to efficiently compute J λ,t (S (j-1) ∪ {i}) for any i ∈ [m] -S (j-1) so find i (j) and form S (j) . We assume that at the start of iteration j we already have in memory A j-1 and B j-1 . We show later how to efficiently update A j and B j once we have found i (j) . For brevity, let us denote S (j) i := S (j-1) ∪ {i}, A ji := K -1 S (j) i ,S (j) i , B ji := K T :,S (j) i K :,S (j) i Let us also define C j-1 := Bj-1 Ãj-1 , D j-1 := Bj-1 Ã2 j-1 , E j-1 := Ã2 j-1 Again, we assume that at the start of iteration j we already have in memory C j-1 , D j-1 and E j-1 , and show how to efficiently update these.

Let

W ji := 0 j-1 K T :,S (j-1) K :,i K T :,i K :,S (j-1) K T :,i K :,i and note that B ji = Bj-1 + W ji . Also important is the fact that W ji has rank 2 and that finding the factors takes O(mj) discounting the cost of computing columns of K. Next, let us denote r ji = 1 (K ii -K T S (j) ,i A j-1 K S (j) ,i ) and Q ji := r ji • A j-1 K S (j) ,i K T S (j) ,i A -1 j-1 -A j-1 K S (j) ,i -K T S (j) ,i A j-1 A well known identity regarding Schur complement implies that A ji = Ãj-1 + Q ji Also important is the fact that Q ji has rank 2 and that finding the factors takes O(j 2 ) discounting the cost of computing entries of K.

So

J λ,t (S (j) i ) = -Tr B ji (A ji -tA 2 ji ) = -Tr ( Bj-1 + W ji )( Ãj-1 + Q ji -t( Ãj-1 + Q ji ) 2 ) = -Tr ( Bj-1 + W ji )( Ãj-1 + Q ji ) -t( Bj-1 + W ji )( Ã2 j-1 + Q 2 ji + Ãj-1 Q ji + Q ji Ãj-1 = -Tr C j-1 + Bj-1 Q ji + W ji ( Ãj-1 + Q ji ) +tTr D j-1 + Bj ( Ãj-1 Q ji + Q ji Ãj-1 + Q 2 ji ) +Tr W i (E j-1 + Q 2 ji + Ãj-1 Q ji + Q ji Ãj-1 Now, C j-1 is already in memory so Tr (C j-1 ) can be computed in O(j), Q ji has rank 2 and Bj-1 is in memory so Tr Bj-1 Q ji can be compute in O(j 2 ), and W ji has rank 2 and Ãj-1 is in memory so Tr W i ( Ãj-1 + Q ji ) can be computed in O(j 2 ). Using a similar rationale, all the other terms of J λ,t (S i ) can also be computed in O(j) or O(j 2 ), and overall J λ,t (S i ) can be computed in O(j 2 ). Thus, scanning for i (j) takes O((m -j)j 2 ). Once i (j) has been identified, we set S (j) = S (j) i (j) , A j = A ji (j) = Ãj-1 +Q ji (j) and B j = B ji (j) = Bj-1 + W ji (j) . The last two can be computed in O(j 2 ) once we form Q i (j) and W i (j) . Computing the factors of these matrices takes O(mj). As for updating C j-1 , we have C j = Cj-1 + Bj-1 Q ji (j) + W ji (j) Ãj-1 + W ji (j) Q ji (j) where Cj-1 is obtained from C j-1 be adding a zero row and column. Since C j-1 is in memory and both Q ji (j) and W i (j) have rank O(1), we can compute C j is O(j 2 ). Similar reasoning can be used to show that D j and E j can also be computed in O(j 2 ). Overall, the cost of iteration j is O((m -j)(mj + j 2 )) = O(m 2 j) (since j ≤ m). The cost of finding a design of size n is O(m 2 (n 2 + D)) assuming the entire kernel matrix K is formed at the start and a single evaluation of k takes O(D).

D EXPERIMENTAL PARAMETERS EXPLORATION AND COMPARISON TO TRANSDUCTIVE EXPERIMENTAL DESIGN

In this subsection we report a set of experiments on a kernel ridge regression setup (though in one experiment we set the ridge term to 0, so we are using interpolation). We use the MNIST handwriting dataset (LeCun et al., 2010), where the regression target response was computed by applying one-hot function on the labels 0-9. Nevertheless, we still measure the MSE, and do not use the learnt models as classifiers. We use the RBF kernel k(x, z) = exp(-γ xz 2 2 ) with parameter γ = 1 /784. From the dataset, we used the standard test set of 10000 images and selected randomly another 10000 images from the rest of the 60000 images as a pool. We used our proposed greedy algorithm to select a training set of sizes 1 to 100. We use two values of λ: λ = 0 (interpolation), and λ = 0.75 2 . The optimal λ according to cross validation was the smallest we checked so we just used λ = 0. However, in some cases having a λ > 0 is desirable from a computational perspective, e.g. it caps the condition number of the kernel matrix, making the linear system easier to solve. Furthermore, in real world scenarios, oftentimes we do not have any data before we start to acquire labels, and if we do, it is not always distributed as in the test data, so computing the optimal λ can be a challenging. Results are reported in Figure 3 . The left panel show the results for λ = 0. We report results for t = 0 and t = 0.5. The choice of t = 0 worked better. Kernel models with the RBF kernel are highly overparameterized (the hypothesis space is infinite dimensional), so we expect the MSE to be bias dominated, in which case a small t (or t = 0) might work best. Recall that the option of λ = t = 0 is equivalent to the Column Subset Selection Problem, is the limit case of transductive experimental design (Yu et al., 2006) , and can be related to the coreset approach (specifically Sener & Savarese (2018)). The case of λ = 0.75 2 is reported in the right panel of Figure 3 . We tried t = 0 and t = λ = 0.75 2 . Here too, using a purely bias oriented objective (i.e., t = 0) worked better. Note that this is in contrast with classical OED which use variance oriented objectives. The choice of t = λ worked well, but not optimally. In general, in the reported experiments, and other experiments conducted but not reported, it seems that the choice of t = λ, which is, as we have shown in this paper, equivalent to transductive experimental design, usually works well, but is not optimal.

E EXPERIMENTAL SETUP FOR RESULT REPORTED IN FIGURE 1

First, w ∈ R 100 was sampled randomly from N (0, I) . Then a pool (the set from which we later choose the design) of 500 samples and a test set of 100 samples were randomly generated according to x ∼ N (0, Σ), ∼ N (0, σ 2 I) and y = x T w + , where Σ ∈ R 100×100 is diagonal with Σ ii = exp(-2.5i /100), and σ = 0.2. We then created three incremental designs (training sets) of size 120 according to three different methods: • Random design -at each iteration we randomly choose the next training sample from the remaining pool. • Classical OED (variance oriented) -at each iteration we choose the next training sample from the remaining pool with a greedy step that minimizes the variance term in Eq. (3). In Figure 4 we compare the result of our method on LeNet5 with the result of our method on Wide-LeNet5. We see that while the result on the wide version are generally better, both for random designs and our design, our method brings a consistent advantage over random design. In both the narrow and the wide versions it requires about 600 training samples for the random design to achieve the accuracy achieved using our algorithm with only 400 training samples! The parameters used by our algorithm to select the design are λ = t = 0. For the network training we used SGD with batch size 128, leaning rate 0.1 and no regularization. The SGD number iterations is equivalent to 20 epochs of the full trainning set.

G SEQUENTIAL VS SINGLE SHOT ACTIVE LEARNING

While in this work focus on the single shot active learning, an interesting question is how does it compare to sequential active learning. In sequential active learning we alternate between a model improving step and a step of new labels acquisition,. This obviously gives an advantage to sequential active learning over single shot active learning, as the latter is a restricted instance of the former. As we still do not have a sequential version of our algorithm to compare with, we chose to experimentally compare our single shot algorithm with the classical method of uncertainty sampling (Bartlett et al., 2020). This method has proved to be relatively efficient for neural networks (Gal et al., 2017) . Uncertainty sampling based active learning requires computing the uncertainty of the updated model regarding each sample in the pool. As such, this approach is sequential by nature. Usually uncertainty sampling is derived in connection to the cross entropy since in that case the network output after the softmax layer can be interpreted as a probability estimation of y = i given x, which we symbolize as p i (x). The uncertainty score (in one common version) is then given by 1 -max i∈[L] p i (x). Because we use the square lose, we need to make some adaptation for the way of p i (x) is computed. Considering the fact that the square loss is an outcome of a maximum likelihood model that given x assumes y ∼ N (f (x), I L ), it make sense to use p i (x) = (2π) -L 2 e -1 2 yi-f (x) 2 2 , where y i is the onehot vector of i. Initially, our selection procedure shows a clear advantage. However, once the training set grows large enough, the benefit of a sequential setup starts to kick-in, the sequential algorithm starts to show superior results. This experiment motivates further development of sequential version of our algorithm.



In practice, when n is only mildly bigger than d it is usually better to regularize the problem.



Figure 1: MSE of a minimum norm linear interpolative model. We use synthetic data of dimension 100. The full description is in Appendix E.

For any matrix space Ω, λ > 0: argmin X∈Ω Tr C ρ M -1 λ = argmin X∈Ω φλ,λ (M) So, transductive experimental design corresponds to a specific choice of bias-variance tradeoff. Another interesting relation with transductive experimental design is given by next proposition which is a small modification of Theorem 1 due toGu et al. (2012) . Proposition 5. For any λ > 0 and t ≥ 0: φλ,t (M) ≤ (λ + t) Tr C ρ M -1 λ

(Pool-based Overparameterized V-Optimal Design) Given a pool of unlabeled examples V ∈ R m×d , a regularization parameter λ ≥ 0, a bias-variance tradeoff parameter t ≥ 0, and a design size n, find a minimizer of minS⊆[m], |S|=n ψλ,t (V T S,: V S,: ).Problem 1 is a generalization of the Column Subset Selection Problem (CSSP) (Boutsidis et al., 2009). In the CSSP, we are given matrix U ∈ R d×m and target number of columns n, and our goal is to select a subset T which is a minimizer of min T ⊆[m], |T |=n (I d -U :,T U + :,T )U 2 When λ = 0 and t = 0, Problem 1 reduces to the CSSP for U = V T . The λ = t = 0 case is also somewhat related to the coreset approach for active learning Sener & Savarese (2018); Pinsler et al. (2019); Ash et al. (2019); Geifman & El-Yaniv (2017). See Appendix B.

Kernelization. If |S| ≤ d and V S,: has full row rank we have (V S,:) + λ = V T S,: V S,: V T S,: + λI |S| -1which allows us to write ψλ,t (V T S,: VS,:) = Tr V I -2V T S,: VS,:V T S,: + λI |S| -1 VS,: V T +Tr VV T S,: VS,:V T S,: + λI |S| -1 VS,:V T S,: VS,:V T S,: + λI |S| -1 VS,:V T +tTr VV T S,: VS,:V T S,: + λI |S| -2VS,:V T Let now K := VV T ∈ R m×m . Then V S,: V T S,: = K S,S and VV T S,: = K :,S . Since Tr (K) is constant, minimizing ψλ,t (V T S,: V S,: ) is equivalent to minimizing J λ,t (S) := Tr K:,S KS,S + λI |S| -1 -2I |S| + KS,S KS,S + λI |S| -1 + t KS,S + λI |S| -2 K T :,S .

Figure 2: Single shot active learning with Wide-LeNet5 model on MNIST.

and only if there exists sequence λ n → λ and a sequence w n → w where w n ∈ A λn for sufficiently large n. Theorem 1. (A restricted version of Theorem 1.17 in Rockafellar & Wets (2009)) Consider f : Ω × Ψ → R where Ω ⊆ R d and Ψ ⊆ R are compact and f is continuous. Then lim λ→ λ argmin w f (w, λ) ⊆ argmin w f w, λ .

Figure 3: Kernel regression experiments on MNIST.

Figure 5: Single shot active learning vs sequential active learning. MNIST and (standard) LeNet5

Figure 5. shows a comparison between the accuracy achieved with our single shot algorithm and the sequential active learning on MNIST with LeNet5. The acquisitions batch size of the sequential active learning were set to 100. Our algorithm ran with λ = t = 0. For the network training we used SGD with batch size 128, leaning rate 0.1 and no l2 regularization. The SGD number iterations is equivalent to 20 epochs of the full train set.

We propose a new algorithm for single shot deep active learning, a scaracly treated problem so far, and demonstrate its effectiveness on MNIST.Experimental design is an well established paradigm in statistics, extensively covered in the literature for the linear case(Pukelsheim, 2006) and the non linear case (Pronzato & Pázman, 2013). The application of it to pool based active learning with batch acquisitions was explored by Yu et al. (2006) for linear models and by Hoi et al. (2006) for logistic regression. It was also proposed in the context of deep learning (Sourati et al., 2018). Another related line of work is recent work by Haber and Horesh on experimental design for ill-posed inverse problems(Haber et al., 2008; 2012;Horesh et al., 2010). Active learning in the context of overparameterized learning was explored byKarzand & Nowak (2020), however their approach differs from ours significantly since it is based on artificially completing the labels using a minimax approach.I the context of Laplacian regularized Least Squares (LapRLS), which is a generalization of ridge regression,Gu et al. (2012) showed rigorously thatYu et al. (2006) criterion is justified as a bound for both the bias and variance components of the expected error. We farther show that this bound is in some sense tight only if the parameter norm is oneand the noise variance equals the l 2 penalty coefficient. In addition we postulate and show experimentally that in the overparameterized case using a bias dominant criterion is preferable. Another case in which the bias term idoes not vanish is when the model is misspecified. For linear and generalized linear models this case has been tackled with reweighing of the loss function.

S) can be computed without actually forming x 1 , . . . , x m since entries in K can be computed via k. If H is the Reproducing Kernel Hilbert Space of k then this is exactly the setting that corresponds to kernel ridge regression (possibly with a zero ridge term).

There are few ways in which our proposed experimental design algorithm can be used in the context of deep learning. For example, one can consider a sequential setting where current labeled data are used to create a linear approximation via the Fisher information matrix at the point of minimum loss(Sourati et al., 2018). However, such a strategy falls under the heading of Sequential Experimental Design, and, as we previously stated, in this paper we focus on single shot active learning, i.e. no labeled data is given neither before acquisition nor during acquisition(Yang & Loog, 2019).In order to design an algorithm for deep active learning, we leverage a recent breakthrough in theoretical analysis of deep learning -the Neural Tangent Kernel (NTK)(Jacot et al., 2018;Lee et al., 2019; Arora et al., 2019a). A rigorous exposition of the NTK is beyond the scope of this paper, but a short and heuristic explanation is sufficient for our needs.

ψλ,0 vs ψλ,λ on UCI datasets. We generated designs on 112 classification datasets. Each cell details the number of datasets in which that selection of t was clearly superior to the other possible choice, or the same (for the "SAME" column).Deep Active Learning Here we report preliminary experiments with the proposed algorithm for single shot deep active learning (Section 6). Additional experiments are reported in the appendix. We used the MNIST dataset, and used the square loss for training. As for the network architecture, we used a version of LeNet5 (LeCun et al., 1998) that is widen by a factor of 8. we refer to this network as "Wide-LeNet5".The setup is as follows. We use Google's open source neural tangents library(Novak et al., 2020) to compute Gram matrix of the infinite NTK using 59,940 training samples (we did not use the full 60,000 training samples due to batching related technical issues). We then used the algorithm proposed in Section 5 to incrementally select greedy designs of up to 800 samples, where we set the parameters to λ = t = 0. We now trained the original neural network with different design sizes, each design with five different random initial parameters. Learning was conducted using SGD, with fixed learning rate of 0.1, batch size of 128, and no weight decay. Instead of counting epochs, we simply capped the number of SGD iterations to be equivalent to 20 epochs of the full trainning set. We computed the accuracy of the model predictions on 9963 test-set samples (again, due to technical issues related to batching).

• Overparameterized OED -at each iteration we chose the next training sample from the remaining pool with a greedy step that minimizes Eq. (3), with λ = 0 and t = σ 2 . With the addition of each new training sample we computed the new MSE achieved on the test set with minimum norm linear regression.

annex

Proof. Consider the function f (M, µ) = Tr C ρ M -1 µ f (M, µ) = Tr µC ρ (M + µI) -1 µ > 0 Tr (C ρ (I -M + M)) µ = 0 defined over Ω × R ≥0 where R ≥0 denotes the set of non-negative real numbers. Note that this function is well-defined since Ω is a set of positive semidefinite matrices.We now show that f is continuous. For µ > 0 it is clearly continuous for every M, so we focus on the case that µ = 0 for an arbitrary M. Consider a sequence R >0 µ n → 0 (where R >0 is the set of positive reals) and Ω M n → M. Since Ω is compact, M ∈ Ω. Let us write a spectral decomposition of M n (recall that Ω is a set of symmetric matrices)where Λ n is diagonal with non-negative diagonal elements (recall that Ω is a set of positive definite matrices). Let M = UΛU T be a spectral decomposition of M. Without loss of generality we may assume that U n → U and Λ n → Λ. Now note thatwhere sign is taken entry wise, which implies that (M n + µ n I) -1 M n → U sign(Λ)U T since matrix multiplication is continuous. Next, note thatM n so the continuity of the trace operator implies thatwhich shows that f is continuous. so:Since λ > 0 it doesn't affect the minimizer.

B RELATION TO CORESETS

The idea in the coreset approach for active learning is to find an S such that

