GENERALIZATION PROPERTIES OF RETRIEVAL-BASED MODELS Anonymous authors Paper under double-blind review

Abstract

Many modern high-performing machine learning models such as GPT-3 primarily rely on scaling up models, e.g., transformer networks. Simultaneously, a parallel line of work aims to improve the model performance by augmenting an input instance with other (labeled) instances during inference. Examples of such augmentations include task-specific prompts and similar examples retrieved from the training data by a nonparametric component. Remarkably, retrieval-based methods have enjoyed success on a wide range of problems, ranging from standard natural language processing and vision tasks to protein folding, as demonstrated by many recent efforts, including WebGPT and AlphaFold. Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. In this paper, we present a formal treatment of retrieval-based models to characterize their generalization ability. In particular, we focus on two classes of retrieval-based classification approaches: First, we analyze a local learning framework that employs an explicit local empirical risk minimization based on retrieved examples for each input instance. Interestingly, we show that breaking down the underlying learning task into local sub-tasks enables the model to employ a low complexity parametric component to ensure good overall accuracy. The second class of retrieval-based approaches we explore learns a global model using kernel methods to directly map an input instance and retrieved examples to a prediction, without explicitly solving a local learning task.

1. INTRODUCTION

As our world is complex, we need expressive machine learning models to make high accuracy predictions on real world problems. There are multiple ways to increase expressiveness of a machine learning model. A popular way is to homogeneously scale the size of a parametric model, such as neural networks, which has been behind many recent high-performance models such as GPT-3 (Brown et al., 2020) and ViT (Dosovitskiy et al., 2021) . Their performance (accuracy) exhibits a monotonic behavior with increasing model size, as demonstrated by "scaling laws" (Kaplan et al., 2020) . Such large models, however, have their own limitations, including high computation cost, catastrophic forgeting (hard to adapt to changing data), lack of provenance, and explanability. Classical instancebased models Fix & Hodges (1989) , on the other hand, offer many desirable properties by designefficient data structures, incremental learning (easy addition and deletion of knowledge), and some provenance for its prediction based on the nearest neighbors w.r.t. the input. However, these models often suffer from weaker empirical performance as compared to deep parametric models. Increasingly, a middle ground combining the two paradigms and retaining the best of both worlds is becoming popular across various domains, ranging from natural language (Das et al., 2021; Wang et al., 2022; Liu et al., 2022; Izacard et al., 2022) , to vision (Liu et al., 2015; 2019; Iscen et al., 2022; Long et al., 2022) , to reinforcement learning (Blundell et al., 2016; Pritzel et al., 2017; Ritter et al., 2020) , to even protein structure predictions (Cramer, 2021) . In such approaches, given a test input, one first retrieves relevant entries from a data index and then processes the retrieved entries along with the test input to make the final predictions using a machine learning model. This process is visualized in Figure 1b . For example, in semantic parsing, models that augment a parametric seq2seq model with similar examples have not only outperformed much larger models but also are more robust to changes in data (Das et al., 2021) . While classical learning setups (cf. Figure 1a ) have been studied extensively over decades, even basic properties and trade-offs pertaining to retrieval-based models (cf. Figure 1b ), despite their aforementioned remarkable successes, remain highly under-explored. Most of the existing efforts on retrieval-based machine learning models solely focus on developing end-to-end domain-specific models, without identifying the key dataset properties or structures that are critical in realizing performance gains by such models. Furthermore, at first glance, due to the highly dependent nature of an input and the associated retrieved set, direct application of existing statistical learning techniques does not appear as straightforward. This prompts the natural question: What should be the right theoretical framework that can help rigorously showcase the value of the retrieved set in ensuring superior performance of modern retrieval-based models? In this paper, we take the first step towards answering this question, while focusing on the classification setting (Sec. 2.1). We begin with the hypothesis that the model might be using the retrieved set to do local learning implicitly and then adapt its predictions to the neighborhood of the test point. This idea is inspired from Bottou & Vapnik (1992) . Such local learning is potentially beneficial in cases where the underlying task has a local structure, where a much simpler function class suffices to explain the data in a given local neighborhood but overall the data can be complex (formally defined in Sec. 2.2). For instance looking at a few answers at Stackoverflow even if not for same problem may help us solve our issue much faster than understanding the whole system. We try to formally show this effect. We begin by analyzing an explicit local learning algorithm: For each test input, (1) we retrieve a few training examples located in the vicinity of the test input, (2) train a local model by performing empirical risk minimization (ERM) with only these retrieved examples -local ERM; and (3) apply the resulting local model to make prediction on the test input. For the aforementioned retrieval-based local ERM, we derive finite sample generalization bounds that highlight a trade-off between the complexity of the underlying function class and size of neighborhood where local structure of the data distribution holds in Sec. 3. Under this assumption of local regularity, we show that by using a much simpler function class for the local model, we can achieve a similar loss/error to that of a complex global model (Thm. 3.4 ). Thus, we show that breaking down the underlying learning task into local sub-tasks enables the model to employ a low complexity parametric component to ensure good overall accuracy. Note that the local ERM setup is reminiscent of semiparametric polynomial regression (Fan & Gijbels, 2018) in statistics, which is a special case of our setup. However, the semiparametric polynomial regression have been only analyzed asymptotically under mean squared error loss (Ruppert & Wand, 1994) and its treatment under a more general loss is unexplored. We acknowledge that such local learning cannot be the complete picture behind the effectiveness of retrieval-based models. As noted in Zakai & Ritov (2008) , there always exists a model with global component that is more "preferable" to a local-only model. In Sec. 3.2, we extend local ERM to a two-stage setup: First learn a global representation using entire dateset, and then utilize the representation at the test time while solving the local ERM as previously defined. This enables the local learning to benefit from good quality global representations, especially in sparse data regions. Finally, we move beyond explicit local learning to a setting that resembles more closely the empirically successful systems such as REINA, WebGPT, and AlphaFold: A model that directly learns to predict from the input instance and associated retrieved similar examples end-to-end. Towards this, we take a preliminary step in Sec. 4 by studying a novel formulation of classification over an extended feature space (to account for the retrieved examples) by using kernel methods (Deshmukh et al., 2019) . To summarize, our main contributions include: 1) Setting up a formal framework for classification under local regularity; 2) Finite sample analysis of explicit local learning framework; 3) Extending the analysis to incorporate a globally learnt model; and 4) Providing the first rigorous treatment of an end-to-end retrieval-based models to understand its generalization by using kernel-based learning.

2. PROBLEM SETUP

We first provide a brief background on (multiclass) classification along with necessary notations. Subsequently, we discuss the problem setup considered in this paper, which deals with designing retrieval-based classification models for the data distributions with local regularity.

2.1. MULTICLASS CLASSIFICATION

In this work, we restrict ourselves to (multi-class) classification setting, with access to n training examples S = {(x i , y i )} i∈[n] ⊂ X × Y, sampled i.i.d. from the data distribution D := D X,Y . Given S, one is interested in learning a classifier h : X → Y that minimizes miss-classification error. It is common to define a classifier via a scorer f : x → f 1 (x), . . . , f |Y| (x) ∈ R |Y| that assigns a score to each class in Y for an instance x. For a scorer f , the corresponding classifier takes the form: h f (x) = arg max y∈Y f y (x) . Furthermore, we define the margin of f at a given label y ∈ Y as γ f (x, y) = f y (x) -max y =y f y (x). Let P D (A) := E (X,Y )∼D 1 {A} for any random variable A. Given S and a set of scorers F ⊆ {f : X → R |Y| }, learning a model implies finding a scorer in F that minimizes miss-classification error: f * = arg min f ∈F P D (h f (X) = Y ). One typically employs a surrogate loss (Bartlett et al., 2006) for the miss-classification loss 1 {h f (X) =Y } and aims minimize the associated risk: R (f ) = E (X,Y )∼D f (X), Y . Since the underlying data distribution D is only accessible via examples in S, one learns a good scorer by minimizing the (global) empirical risk over a large function class F global as follows: f = arg min f ∈F global R (f ) := 1 n i∈[n] f (x i ), y i .

2.2. DATA DISTRIBUTIONS WITH LOCAL REGULARITY

In this work, we assume that the underlying data distribution D follows a local-regularity structure, where a much simpler (parametric) function class suffices to explain the data in each local neighborhood. Formally, for x ∈ X and r > 0, we define B x,r := {x ∈ X : d(x, x ) ≤ r}, an r-radius ball around x, w.r.t. a metric d : X × X → R. Let D x,r be the data distribution restricted to B x,r , i.e., D x,r (A) = D(A)/D (B x,r × Y) A ⊆ B x,r × Y. Now, the local regularity condition of the data distribution ensures that, for each x ∈ X, there exists a low-complexity function class F x , with |F x | |F global |, that approximates the Bayes optimal (w.r.t. F global ) for the local classification problem defined by D x,r . That is, for a given ε X > 0, we havefoot_0  min f ∈F x E D x,r [ (f (X), Y )] ≤ min f ∈F global E D x,r [ (f (X), Y )] + ε X , ∀ x ∈ X. As an example, if F global is linear in R d (possibly dense) with bounded norm τ , then F x can be a simpler function class such as linear in R d with sparsity k d and with bounded norm τ x ≤ τ .

2.3. RETRIEVAL-BASED CLASSIFICATION MODEL

This work focuses on retrieval-based methods that can leverage the aforementioned local regularity structure of the data distribution. In particular, we focus on two such approaches: Local empirical risk minimization. Given a (test) instance x, the local empirical risk minimization (ERM) approach first retrieves a neighboring set R x = {(x j , y j )} ⊆ S. Subsequently, it identifies a (local) scorer f x from a 'simple' function class F loc ⊂ {f : X → R |Y| } as follows: f x = arg min f ∈F loc Rx (f ); Rx (f ) := 1 |R x | (x ,y )∈R x f (x ), y . Here, R x corresponds to the samples in S that belong to B x,r ; hence, it follows the distribution D x,r . We assume there exists N (r, δ) such that for any r ≥ 0, and δ > 0, P (X,Y )∼D |R X | < N (r, δ) ≤ δ, and P (X,Y )∼D |R X | = 0 = 0. (8) Note that the local ERM approach requires solving a local learning task for each test instance. Such a local learning algorithms was introduced in Bottou & Vapnik (1992) . Another point worth mentioning here is that (7) employs the same function class F loc for each x, whereas the local regularity assumption (cf. ( 6)) allows for an instance dependent function class F x . We consider F loc that approximates ∪ x∈X F x closely. In particular, we assume that, for some ε loc > 0, we have min f ∈F loc E D x,r [ (f (X), Y )] ≤ min f ∈F x E D x,r [ (f (X), Y )] + ε loc , ∀ x ∈ X. (9) Continuing with the example following (6), where F x is linear with sparsity k d and bounded norm τ x , one can take F loc to be linear with the same sparsity k and bounded norm τ < sup x∈X τ x . Classification with extended feature space. Another approach to leverage the retrieved neighboring labeled instances during classification is to directly learn a scorer that maps x × R x ∈ X × (X × Y) to per-class scores. One can learn such a scorer over extended feature space X × (X × Y) as follows: f ex = arg min f ∈F ex Rex (f ); Rex (f ) := 1 n i∈[n] f x i , R xi , y i ), where F ex ⊂ f : X × (X × Y) → R |Y| denotes a function class over the extended space. Unlike local ERM approach, (10) learns a common function over extended space and does not require solving an optimization problem for each test instance. That said, since F ex operates on the extended feature space, it can be significantly complex and computationally expensive to employ as compared to F loc . Our goal is to develop a theoretical understanding of the generalization behavior of these two retrievalbased methods for classification with locally regular data distributions. We present our theoretical treatment of local ERM and classification with extended feature space in Sec. 3 and 4, respectively.

3. LOCAL EMPIRICAL RISK MINIMIZATION

Before presenting an excess risk bound for the local ERM method, we introduce various necessary definitions and assumptions that play a critical role in our analysis. We say that a scorer f is L-coordinate Lipschitz iff for all y ∈ Y and x 1 , x 2 ∈ X, we have |f y (x) -f y (x )| ≤ L x -x 2 . In this section, we restrict ourselves to the loss functions that act on the margin of a scorer (cf. (1)), i.e., for any given example (x, y) and any scorer f , we have (f (x), y)) = (γ f (x, y)). In addition, we assume that, naturally, is a decreasing function of the margin. Furthermore, we assume that is L -Lipschitz function, i.e., | (γ) -(γ )| ≤ L |γ -γ |, ∀γ ≥ γ . Note that the local ERM selects a scorer from F loc . At x ∈ X, let f x, * denote the minimizer of the population version of the local loss, and f * the population risk minimizer for the global loss, i.e., f x, * = arg min f ∈F loc E (X ,Y )∼D x,r f (X ), Y and f * = arg min f ∈F global E (X,Y )∼D f (X), Y . ( ) Given a distribution D, we define the weak margin condition (Döring et al., 2018) for a scorer f as: Definition 3.1. A scorer f satisfies (α, c)-weak margin condition iff, for all t ≥ 0, P (X,Y )∼D (|γ f (X, Y )| ≤ t) ≤ c t α . One of the key assumptions that we rely on is the existence of an underlying scorer f true that explains the true labels, while ensuring the weak margin condition (cf. Definition 3.1). Here, we note that the true function f true may neither lie in the function class F global , nor in F loc . Assumption 3.2 (True scorer function). There exists a scorer f true such that for all, (x, y) ∈ X × Y, f true generates the true label, i.e., γ f true (x, y) > 0 and |R X | ⊥ D γ f true (X, Y ). Furthermore, we assume f true is L true -coordinate Lipschitz, and satisfies the (α true , c true )-weak margin condition.

3.1. EXCESS RISK BOUND FOR LOCAL ERM

Now that we have introduced the required background and assumptions, we move to presenting our results on characterizing the generalization behavior of local ERM. In particular, we aim to bound E (X,Y )∼D ( f X (X), Y ) -(f * (X), Y ) . ( ) Note that in the above equation f X (cf. ( 7)) is a function of R X , and expectation over R X is taken implicitly. Towards this, we first obtain the following upper bound on (12). Lemma 3.3. The expected excess risk of the local ERM optimization f X is bounded as E (X,Y )∼D ( f X (X), Y ) -(f * (X), Y ) ≤ E (X,Y )∼D E (X ,Y )∼D X,r f X, * (X ), Y -f * (X ), Y Local vs Global Optimal Risk + F∈{F global ,F loc } E (X,Y )∼D sup f ∈F E (X ,Y )∼D X,r f (X ), Y -(f (X), Y ) Global and Local: Sample vs Retrieved Set Risk + E (X,Y )∼D sup f ∈F loc E (X ,Y )∼D X,r [ (f (X ), Y )] -1 |R X | (x ,y )∈R X f (x ), y Generalization of Local ERM + E (X,Y )∼D E (X ,Y )∼D X,r [ (f X, * (X ), Y )] -1 |R X | (x ,y )∈R X f X, * (x ), y Central Absolute Moment of f X, * . We delegate the proof of Lem. 3.3 to Appendix B. Now, as a strategy to obtain desired excess risk bounds, we separately bound the four terms appearing in Lem. 3.3. Note that the first term captures the expected difference between the loss incurred by global population optima f * ∈ F global and the local population optima f x, * ∈ F loc in a local region around test instance x.The second term aims to capture the loss for a scorer evaluated at x vs. the expected value of the loss for the scorer at a random instance sampled in the local region of x based on D x,r . The third term corresponds to the standard 'generalization error' for the local ERM with respect to the local data distribution D X,r , whereas the fourth term is the empirical variation of the true local function f X, * around its true mean under D X,r . Let the coordinate-Lipschitz constants for scorers in F loc and F global be L loc and L global , respectively. We define a function class G(X, Y ) = {(x , y ) → (γ f (•, •)) -(γ f (X, Y )) : f ∈ F loc }. Here, by subtracting f (X), Y from the loss, we center the losses on R X for any function f ∈ F loc , and obtain a tighter bound by utilizing the local nature of the distribution D X,r . For any L > 0, for notational convenience let us define M r (L; , f true , F) = 2L Lr + max{Lr, 2 F ∞ } -Lr c true 2L true r αtrue . Now, by controlling different terms appearing in the bound in Lem. 3.3, we obtain the following. Theorem 3.4. For any δ > 0, the expected excess risk of the local ERM solution f X is bounded as E (X,Y )∼D ( f X (X), Y ) -(f * (X), Y ) ≤ (ε X + ε loc ) Local vs Global Optimal loss (I) + M r (L loc ; , f true , F loc ) + M r (L global ; , f true , F global ) Global and Local: Sample vs Retrieved Set Risk (II) + 2 E (X,Y )∼D R R X G(X, Y ) + 5M r (L loc ; , f true , F loc ) 2 ln(4/δ) N (r, δ) + 4δL F loc ∞ (2 + 2 ln(4/δ)) Generalization of Local ERM (III) , where R R X G(X, Y ) is the empirical Rademacher complexity of G(X, Y ). Before discussing the implications of the aforementioned excess risk bound, we instantiate F loc with a few common function classes from the literature (see Appendix B for the detailed proof of Thm. 3.4, and about the descriptions of these specific instances). Kernel-based classifiers. When f y (•) belongs to a bounded RKHS with ∞ norm bound B (Zhang, 2004) , for some universal constant C > 0 and any δ > 0, E (X,Y )∼D R R X G(X, Y ) ≤ C |Y|L Bln(n + 1) 3/2 / |N (r, δ)| + 2δB . Similarly, when f y (•) belongs to a bounded RKHS with 2 norm bound B (Lei et al., 2019) , for some universal constant C > 0 and any δ > 0, E (X,Y )∼D R R X G(X, Y ) ≤ C L Bln(n|Y|) 3/2 / |N (r, δ)| + 2δB . Feed-forward classifiers. Assume that f y (•) is an L layer feed-forward network with 1-Lipschitz non-linearities (Bartlett et al., 2017) . Let, for layers l = 1 to L, the dimension of the weight matrix be (d l × d l-1 ) with d L = |Y|. Also, let b l and s l be the 2,1 norm and spectral norm upper bounds for layer l weight matrix, respectively, with b l /s l ≤ κ. We define d max = max l∈[L] d l and let B = max x∈X x 2 L l=1 s l . Then, for some universal constant C > 0 and any δ > 0, E (X,Y )∼D R R X G(X, Y ) ≤ C L B√ κ ln(d max )L 3/4 ln(L B√ n) 3/2 / N (r, δ) + 2δ B . Implications of the excess risk bound. Our main result for local-ERM highlights the trade-offs in approximation vs. generalization as the retrieval radius r varies. To further elaborate, note that the approximation error comprises two components, defined by (I) and (II) in Thm. 3.4. ε X shows the gap in approximating the r-radius neighborhood around X with a simple local function class F X which vary with X ∈ X. ε loc shows the gap in approximating the union of the local function class ∪ x∈X F X with a single function class F loc (possibly with smaller complexity) but while allowing for choosing a different optimizer f X ∈ F loc for each X ∈ X. As r increases, both the terms ε X and ε loc typically increase. For example, in approximating a polynomial function locally with linear function ε X increases as the radius increases. Thus, (I) increases with r. Note that the second component of the approximation error (II) corresponds to the difference of risk for the sample X and the retrieved set R X for F global and F loc , i.e., M r (L global ; , f true , F global ) and M r (L loc ; , f true , F loc ). As we increase r, Eq. ( 13) suggests that the terms increase as O(poly(r)). On the other hand, the generalization error (III) depends on the size of the retrieved set R X and the Rademacher complexity of G(X, Y ) which is induced by F loc . With increasing radius r, the term N (r, δ) increases. The Rademacher complexity decays with increasing radius, r, typically at the rate of O(1/ N (r, δ)). Thus, under the local ERM setting the total approximation error increases with increasing radius r, given F loc is fixed. On the contrary, the generalization error decreases with increasing radius r for a fixed F loc . This suggests a trade-off between the approximation and generalization error as we make a design choice about r. (We empirically validate this in Figure 2 .) Also, it's worth comparing local-ERM with conventional (non-local) ERM. Under the local-regularity condition assumption (Sec. 2.2), one would utilize a simple F loc for local-ERM, which would correspond to the Rademacher complexity term in Theorem 3.4 being small. In contrast, the generalization bound for the traditional (non-local) ERM approach would depend on the Rademacher complexity of a function class F global that can achieve a low approximation error on the entire domain. Such a function class (even under the regularity assumption) would be much more complex than F loc , resulting in a large Rademacher complexity. For the right design choice of r, and F loc , the approximation error increase of local-ERM can be offset by large generalization error of F global . As a consequence, local ERM with simple function class F loc can outperform (non-local) ERM with a complex class F global .

3.2. ENDOWING LOCAL ERM WITH GLOBAL REPRESENTATIONS

Note that the local ERM method takes a somewhat myopic view and does not aim to learn a global hypothesis that (partially or entirely) explains the entire data distribution. Such an approach may potentially result in poor performance in those regions of input domains that are not well represented in the training set. Here, we explore a two-stage learning approach as to leverage the global pattern present in the training data in order to address this apparent shortcoming of local ERM. Given the training data S and a simple function class G loc : R d → R |Y| , the first stage involves learning a d-dimensional feature map Φ S : X → R d that simultaneously ensures good representation for the entire data distribution (Radford et al., 2021; Grill et al., 2020; Cer et al., 2018; Reimers & Gurevych, 2019) . Subsequently, given a test instance x and its retrieved neighboring points R x = {(x j , y j )} ⊆ S, one employs local ERM with the function class: F Φ S = {x → g • Φ S (x) : g ∈ G loc }. At this point, it is tempting to invoke the proof strategy outlined following Lem. 3.3, with F loc replaced with F Φ S to characterize the performance of the aforementioned two-stage method. Note that one can indeed bound the first two terms appearing in Lem. 3.3 for the two-stage method as well. However, bounding the third term that corresponds to generalization gap for local ERM becomes challenging as F Φ S depends on S via the global representation Φ S learned in the first stage. Interestingly, Foster et al. (2019) explored a general framework to address such dependence for standard (non retrieval-based) learning. In fact, as an instantiation of their general framework, Foster et al. (2019, Sec. 5.4 ) considers the ERM in feature space defined by a representation. We employ their techniques to obtain the following result on the generalization gap for local ERM with F Φ S . Proposition 3.5. Assume that the representation learned duing the first stage is ∆-sensitive, i.e., for S and S that differ in a single example, we have Φ S (x) -Φ S (x) ≤ ∆ ∀x ∈ X. Furthermore, we assume that each g ∈ G loc (cf. 14) is L-Lipschitz, the loss : R |Y| × |Y| → R is L ,1 -Lipschitz w.r.t. • ∞ -norm in the first argument, and is bounded by M . Then, the following holds with probability at least 1 -δ. sup f ∈FΦ S E (X ,Y )∼D x,r [ (f (X ), Y )] -Rx (f ) ≤ M + 2∆LL ,1 |R x | log(1/δ) 2|R x | + E R x ∼D x,r sup f ∈FΦ S E (X ,Y )∼D x,r [ (f (X ), Y )] -Rx (f ) . ( ) Furthermore E R x ∼D x,r sup f ∈FΦ S E (X ,Y )∼D x,r [ (f (X ), Y )] -Rx (f ) ≤ 2R ( • F Φ S ), where  • F Φ S = {(x, y) → (f (x), y) : f ∈ F Φ S }

4. CLASSIFICATION IN EXTENDED FEATURE SPACE

Next, we focus on a family of retrieval-based methods that directly learn a scorer to map an input instance and its neighboring labeled instance to a score vector (cf. ( 10)). In fact, as discussed in Sec. 1, many successful modern instances of retrieval-based models such as REINA (Wang et al., 2022) and KATE (Liu et al., 2022) belong to this family. In this section, we provide the first rigorous treatment (to the best of our knowledge) for such models. Note that our objective is to learn a function f : X × (X × Y) → R |Y| (cf. Sec. 2.3). In this work, we restrict ourselves to a sub-family of such retrieval-based methods that first map R x ∼ D x,r to Dx,r -an empirical estimate of the local distribution D x,r , which is subsequently utilized to make a prediction for x. In particular, the scorers of interest are of the form: (x, R x ) → f (x, Dx,r ) = f 1 (x, Dx,r ), . . . , f |Y| (x, Dx,r ) ∈ R |Y| , Note that the general framework for learning in the extended feature space X := X × ∆ X×Y provides a very rich class of functions. Here, we focus on a specific form of learning methods in X by using the kernel methods, adapting the work on kernel methods for domain generalization (Deshmukh et al., 2019) . In particular, we study generalization of a kernel-based classifier over X learnt via regularized ERM. Due to space constraint, we present an informal version of our result below. See Appendix D for the precise statement (cf. Thm. D.4), necessary background, and detailed proof. Theorem 4.1 (Informal). Let 0 ≤ δ ≤ 1 and N (r, δ) be as defined in (8). Then, under appropriate assumptions, with probability at least 1 -δ, we have sup f ∈F R ex (f ) -R ex (f ) C 1 n -1 2 1 + log 3 2 √ 2n|Y| + C 2 log( n δ ) N (r, δ n ) + C 3 log( 1 δ ) n , where F is the extended feature kernel function class; and R ex (f ) and R ex (f ) are empirical and population risks, respectively. Interestingly, the bound in Thm. 4.1 implies that the size of the retrieved set R x (as captured by N (r, δ n )) has to scale at least logarithmically in the size of the training set n to ensure convergence.

5. EXPERIMENTS

There have been numerous successful practical applications of retrieval-based models in the literature (e.g., Wang et al., 2022; Das et al., 2021) . Here, we present a brief empirical study for such models in order to corroborate the benefits predicted by our theoretical results. Task and dataset. We perform experiments on both synthetic and real datasets, as summarized below. Further details are relegated to Appendix E. (i) Synthetic. We consider a task of binary classification on a Gaussian mixture. Each mixture component is endowed with its local linear decision boundary. We randomly generate a train set of n = 10000 in a 10-dimensional space. We use Euclidean distance for retrieval and perform a 10-fold cross-validation. (ii) CIFAR-10. Next, we consider a task of binary classification on a real data for object detection. In particular, we consider a subset of CIFAR-10 dataset where we only restrict to images from "Cat" and "Dog" classes. We randomly partition the data into a train set of n = 10000 points and remaining 2000 points for test. We use Euclidean distance for retrieval and do a 10-fold cross-validation. (iii) ImageNet. Finally, we consider 1000-way classification task on ImageNet dataset. We use the standard train-test split with n = 1281167 training and 50000 test examples. Following standard practice in literature, we use unsupervised but globally learned features from ALIGN (Jia et al., 2021) to do image retrieval. This also showcases benefits of endowing local ERM with global representation (Sec. 3.2). Given large computational cost, we could only run each experiment once in this setting. Methods On all datasets, as baseline, we consider simple linear classifier and multi-layer perceptron (MLP) of two layers. For retrieval-based models, we consider each of the above methods as the local model to fit on retrieved data points via local ERM framework (Sec. 3). For synthetic datasets, we also considered support vector machines with polynomial kernel (of degree 3) and with radial basis function (RBF) kernel, both for baseline and local ERM. For ImageNet, we additionally consider the state-of-the-art (SoTA) single model published for this task, which is from the most recent CVPR 2022 (Zhai et al., 2022) as a baseline. In addition, for ImageNet, we also consider the pretrain-finetune version of local ERM, where using the retrieved set we fine-tune a MobileNetV3 (Howard et al., 2019) model that has been pretrained on entire ImageNet. Observations. In Fig. 2 , we observe the tradeoff of varying the size of the retrieved set (as dictated by the neighborhood radius) on the performance of retrieval-based methods across all settings. We see that when the number of retrieved samples is small, local ERM has lower accuracy, this is due to large generalization error. When the size of the retrieved sample space is high, local ERM fails to minimize the loss effectively due to the lack of model capacity. We see that this effect being more pronounced for simpler function classes such as linear classifier as compared to MLP. In Fig. 2c , we see that, via local ERM with a small MobileNet-V3 model, we are able to achieve the top-1 accuracy of 82.78 whereas a regularly trained MobileNet-V3 model achieves the top-1 accuracy of only 65.80. Also the result is very competitive with SoTA of 90.45 with a much larger model. Thus, our empirical evaluation demonstrates the utility of retrieval-based models via simple local ERM framework. In particular, it allows small sized models to attain very high performance.

6. RELATED WORK AND DISCUSSION

Local polynomial regression. Perhaps the most similar problem to our setup is the rich set of work on local polynomial regression, which has been around for a long time since the pioneering works of Stone (1977; 1980) . This line of work aims to fit a low-degree polynomial at each point in the data set based on a subset of data points. Such approaches gained a lot of attention as parametric regression was not adequate in various practical applications of the time. The performance of this approach critically depends on subset selected to locally fit the data. Towards this, various selection approaches have been considered: fixed bandwidth (Katkovnik & Kheisin, 1979) , nearest neighbors (Cleveland, 1979) , kernel weighted (Ruppert & Wand, 1994) , and adaptive methods (Ruppert et al., 1995) . So far, the analysis of local polynomial regression has been mainly restricted to classical techniques like minimax estimation, on which the literature is a vast for various settings. First results on asymptotic minimax risks were established by Pinsker (1980) Multi-task and meta learning At a surface level, our setup might resemble multi-task and meta learning frameworks. In multi-task learning, we are given the examples from T tasks/distributions and the objective is to ensure good classification performance on all the tasks. In meta-learning, the setting is made harder by requiring good performance on a new target task. As a common approach in these settings, we learn a shared representation across the tasks and then learn a simple task-specific mapping on top of these learned shared features (Vilalta & Drissi, 2002, interalia) . While there is a vast literature on multi-task and meta-learning methods, the number of theoretical investigations is quite limited. There are a few works studying upper-bounds on generalization error in multi-task environments (Amit & Meir, 2017; Ben-David & Borbely, 2008; Ben-David et al., 2010; Pentina & Lampert, 2014) , and even fewer in case of meta-learning (Balcan et al., 2019; Khodak et al., 2019; Tripuraneni et al., 2021; Du et al., 2020) . However, most of these works assume linear or other classes of very simple models, whereas we consider general function class using kernel methods. Moreover, recall that our assumption on the underlying data distribution (Sec. 2.2) implies that it can be approximated by a mixture of tasks. However, by design most of these tasks have a very little overlap in the instance space. Additionally, the number of tasks can be very large in our case. Finally, it's not a priori clear which task a particular example belongs to. Thus, it is not straightforward to employ the aforementioned representation based approach for multi-task or meta-learning approaches for our setting. Interestingly, in this work, we show that retrieval-based approach alleviate the needs to identify the task-membership. By relying on retrieved neighboring instance, it is possible to obtain performance guarantees on their data domain which are attuned to local structure of the problem (cf. Sec. 3).

Conclusion and future direction.

In this work, we initiate the development of a theoretical framework to study the generalization behavior of retrieval-based modern machine learning models. Our treatment of an explicit local learning paradigm, namely local-ERM, establishes an approximation vs. generalization error trade-off. This highlights the advantage realized by access to a retrieved set during classification as it enables good performance with much simpler (local) function classes. As for the retrieval-based models that leverage a retrieved set without explicitly performing local learning, we present a systematic study by considering a kernel-based classifier over extended feature space. Studying end-to-end retrieval-based models beyond kernel-based classification is a natural and fruitful direction for future work. It's also worth exploring if existing retrieval-based end-to-end models inherently perform implicit local learning via architectures such as Transformers. A PRELIMINARIES Definition A.1 (Rademacher complexity). Given a sample S = {z i = (x i , y i )} i∈[n] ⊂ Z and a real-valued function class F : Z → R, the empirical Rademacher complexity of F with respect to S is defined as R S (F) = 1 n E σ sup f ∈F n i=1 σ i f (z i ) , where σ = {σ i } i∈[n] is a collection of n i.i.d. Bernoulli random variables. For n ∈ N, the Rademacher complexity Rn (F) and worst case Rademacher complexity R n (F) are defined as follows. Rn (F) = E S∼D n [R S (F)] , and R n (F) = sup S∼Z n R S (F). ( ) Definition A.2 (Covering Number). Let > 0 and • be a norm defined over R n . Given a function class F : Z → R and a collection of points S = {z i } i∈[n] ⊂ Z, we call a set of points {u j } j∈[m] ⊂ R n an ( , • )-cover of F with respect to S, if we have sup f ∈F min j∈[m] f (S) -u j ≤ , where Note that f (S) = f (z 1 ), . . . , f (z n ) ∈ R n . The • -covering number N • ( , F, S) denotes the cardinally of the minimal ( , • )-cover of F with respect to S. In particular, if • is an normalized-p norm ( v = ( 1 dim(v) dim(v) i=1 |v i | p ) 1/p ), E (X,Y )∼D ( f X (X), Y ) -(f * (X), Y ) // We add and subtract loss of the local optimizer f X, * (•) expected over D X,r = E (X,Y )∼D ( f X (X), Y ) -E (X ,Y )∼D X,r f X, * (X ), Y + E (X ,Y )∼D X,r f X, * (X ), Y -(f * (X), Y ) // We add and subtract loss of the global optimizer f * (•) expected over D X,r = E (X,Y )∼D ( f X (X), Y ) -E (X ,Y )∼D X,r f X, * (X ), Y + E (X ,Y )∼D X,r f * (X ), Y -(f * (X), Y ) + E (X ,Y )∼D X,r f X, * (X ), Y -E (X ,Y )∼D X,r f * (X ), Y // We group (1) local vs global optimizer, (2) global optimizer at X vs expected over D X,r , // and (3) ERM loss at X vs local optimizer loss expected over D X,r = E (X,Y )∼D E (X ,Y )∼D X,r f X, * (X ), Y -f * (X ), Y + E (X,Y )∼D E (X ,Y )∼D X,r f * (X ), Y -(f * (X), Y ) + E (X,Y )∼D ( f X (X), Y ) -E (X ,Y )∼D X,r f X, * (X ), Y // We add and subtract loss of the empirical optimizer f X (•) expected over D X,r = E (X,Y )∼D E (X ,Y )∼D X,r f X, * (X ), Y -f * (X ), Y + E (X,Y )∼D E (X ,Y )∼D X,r f * (X ), Y -(f * (X), Y ) + E (X,Y )∼D ( f X (X), Y ) -E (X ,Y )∼D X,r [ f X (X ), Y ] + E (X ,Y )∼D X,r [ f X (X ), Y ] -E (X ,Y )∼D X,r f X, * (X ), Y // We (1) bound difference of loss at X and loss expected over D X,r by maximizing over function class, // and (2) Subtract empirical loss of empirical optimizer and add (larger) empirical loss of local optimizer ≤ E (X,Y )∼D E (X ,Y )∼D X,r f X, * (X ), Y -f * (X ), Y + E (X,Y )∼D sup f ∈F global E (X ,Y )∼D X,r f (X ), Y -(f (X), Y ) + E (X,Y )∼D sup f ∈F loc (f (X), Y ) -E (X ,Y )∼D X,r [ f (X ), Y ]| + E (X,Y )∼D E (X ,Y )∼D X,r [ f X (X ), Y ] - 1 |R X | (x ,y )∈R X f X (x ), y + E (X,Y )∼D 1 |R X | (x ,y )∈R X f X, * (x ), y -E (X ,Y )∼D X,r f X, * (X ), Y // We (1) bound difference of empirical vs expected loss of empirical optimizer by maximizing over function class, ≤ E (X,Y )∼D E (X ,Y )∼D X,r f X, * (X ), Y -f * (X ), Y + E (X,Y )∼D sup f ∈F global E (X ,Y )∼D X,r f (X ), Y -(f (X), Y ) + E (X,Y )∼D sup f ∈F loc (f (X), Y ) -E (X ,Y )∼D X,r [ f (X ), Y ]| + E (X,Y )∼D sup f ∈F loc E (X ,Y )∼D X,r [ f (X ), Y ] - 1 |R X | (x ,y )∈R X f (x ), y + E (X,Y )∼D E (X ,Y )∼D X,r f X, * (X ), Y - 1 |R X | (x ,y )∈R X f X, * (x ), y B.2 PROOF OF THEOREM 3.4 As discussed in Sec. 3, the proof of Theorem 3.4 requires bounding three terms in Lemma 3.3. We now proceed to establishing the desired bounds. Local vs global loss. The local vs global loss can bounded easily using the local regularity condition, and due to the fact that F loc ≈ ∪ x F x . Let f X,loc = arg min f ∈F X E (X ,Y )∼D X,r f (X ), Y . E (X,Y )∼D E (X ,Y )∼D X,r f X, * (X ), Y -f * (X ), Y ≤ E (X,Y )∼D E (X ,Y )∼D X,r f X, * (X ), Y -f X,loc (X ), Y + E (X,Y )∼D E (X ,Y )∼D X,r f X,loc (X ), Y -f * (X ), Y ≤ ε loc + ε X . Global and local: Sample vs retrieved set risk. The following lemma bounds the second term in Lemma 3.3. Recall the definition, for any L > 0, M r (L; , f true , F) = 2L Lr + max{Lr, 2 F ∞ } -Lr c true 2L true r αtrue . ( ) Lemma B.1. Under Assumption 3.2, for a L-coordinate Lipschitz function class F with F ∞ := sup x∈X sup f ∈F f (x) ∞ we have E (X,Y )∼D sup f ∈F (f (X), Y ) -E (X ,Y )∼D X,r [ f (X ), Y ]| ≤ 2L Lr + max{Lr, 2 F ∞ } -Lr c true (2L true r) αtrue . Proof. We are given the example (X, Y ). Let us fix an arbitrary f ∈ F, and any arbitrary example (x , y ) in the r neighborhood of X. We first bound the perturbation in γ f (•) for a given label Ỹ . |γ f (X 1 , Ỹ )) -γ f (X 2 , Ỹ )| ≤ |f Ỹ (X 1 ) -max s = Ỹ f s (X 1 ) -f Ỹ (X 2 ) + max s = Ỹ f s (X 2 )| ≤ |f Ỹ (X 1 ) -f Ỹ (X 2 )| + | max s = Ỹ f s (X 1 ) -max s = Ỹ f s (X 2 )| ≤ |f Ỹ (X 1 ) -f Ỹ (X 2 )| + max s = Ỹ |f s (X 1 ) -f s (X 2 )| ≤ 2L X 1 -X 2 2 We can now proceed with bounding the loss. | (f (X), Y ) -(f (x ), y )| = | (γ f (X, Y )) -(γ f (x , y ))| ≤ L |γ f (X, Y ) -γ f (x , y )| ≤ 4L f ∞ ; Y = y 2L Lr; Y = y Under Assumption 3.2, if we have γ f true (X, Y ) > 2L true r, then following the above argument we have γ f true (X , Y ) > 0, thus Y is the true label of X . In other words, γ f true (X, Y ) > 2L true r imply for any X in the r neighborhood of X its true label Y = Y . | (f (X), Y ) -(f (x ), y )| ≤ 2L Lr1(γ f true (X, Y ) > 2L true r) + 2L max{r, 2 f ∞ }1(γ f true (X, Y ) ≤ 2L true r) ≤ 2L Lr + 2L max{Lr, 2 f ∞ } -Lr 1(γ f true (X, Y ) ≤ 2L true r) As (x , y ) was an arbitrary r-neighbor, we have | (f (X), Y ) -E (X ,Y )∼D X,r (f (X ), Y )| ≤ E (X ,Y )∼D X,r | (f (X), Y ) -(f (X ), Y )| ≤ 2L Lr + 2L max{Lr, 2 f ∞ } -Lr 1(γ f true (X, Y ) ≤ 2L true r) Furthermore, as f was arbitrary, we have sup f ∈F | (f (X), Y ) -E (X ,Y )∼D X,r (f (X ), Y )| ≤ sup f ∈F 2L Lr + 2L max{Lr, 2 f ∞ } -Lr 1(γ f true (X, Y ) ≤ 2L true r) = 2L Lr + 2L max{Lr, 2 F ∞ } -Lr 1(γ f true (X, Y ) ≤ 2L true r). Note f true is independent of f , which was used in the derivation of above inequalities. Taking expectation over (X, Y ), and using the margin condition as given in assumption 3.2 we obtain E (X,Y )∼D sup f ∈F | (f (X), Y ) -E (X ,Y )∼D X,r (f (X ), Y )| = 2L Lr + 2L max{Lr, 2 F ∞ } -Lr P (X,Y )∼D γ f true (X, Y ) ≤ 2L true r ≤ 2L Lr + 2L max{Lr, 2 F ∞ } -Lr c true (2L true r) αtrue = M r (L; , f true , F). Plugging in the Lipschitz bounds for the function classes F loc and F global in the above lemma bounds the second term. In particular, we have

Generalization of Local ERM. Recall the function class

G(X, Y ) = { (γ f (•, •)) -(γ f (X, Y )) : f ∈ F loc }. Here G(X, Y ) : X × Y → R. Note that E (X,Y )∼D sup f ∈F loc E (X ,Y )∼D X,r [ f (X ), Y ] - 1 |R X | (x ,y )∈R X f (x ), y = E (X,Y )∼D sup f ∈F loc E (X ,Y )∼D X,r [ f (X ), Y -f (X), Y ] - 1 |R X | (x ,y )∈R X f (x ), y -f (X), Y = E (X,Y )∼D sup g∈G(X,Y ) E (X ,Y )∼D X,r [g(X , Y )] - 1 |R X | (x ,y )∈R X g(x , y ) . We next state a standard result of learning theory that bounds the final term using the Rademacher complexity of the function class G(X, Y ) (Shalev-Shwartz & Ben-David, 2014) . Lemma B.2 (Adapted from Theorem 26.5 in Shalev-Shwartz & Ben-David (2014) .). For any (X, Y ) ∈ X × Y and a neighborhood set R X , and any function g ∈ G(X, Y ), for each δ > 0 with probability at least (1 -δ) the following holds E (X ,Y )∼D X,r [g X , Y ] - 1 |R X | (x ,y )∈R X g x , y ≤ 2R R X G(X, Y ) + 4G max ((X, Y ); R X ) 2 ln(4/δ) |R X | . Taking expectation with respect to (X, Y ), we obtain E (X,Y )∼D sup g∈G(X,Y ) E (X ,Y )∼D X,r [g X , Y ] - 1 |R X | (x ,y )∈R X g x , y ≤ 2E (X,Y )∼D R R X G(X, Y ) + 4E (X,Y )∼D G max ((X, Y ); R X ) 2 ln(4/δ) |R X | + 4δL F loc ∞ ≤ 2E (X,Y )∼D R R X G(X, Y ) + 4E (X,Y )∼D G max ((X, Y ); R X ) E (X,Y )∼D 2 ln(4/δ) |R X | + 4δL F loc ∞ ≤ 2E (X,Y )∼D R R X G(X, Y ) + 4M r (L loc ; , f true , F loc ) 2 ln(4/δ) N (r, δ) + 4δL F loc ∞ E (X,Y )∼D 2 ln(4/δ) |R X | ||R X | ≤ N (r, δ) + 4δL F loc ∞ ≤ 2E (X,Y )∼D R R X G(X, Y ) + 4M r (L loc ; , f true , F loc ) 2 ln(4/δ) N (r, δ) + 4δL F loc ∞ (1 + 2 ln(4/δ)). In the first inequality, we condition on retrieved sets of size at least N (r, δ) which happens with probability at least δ, by assumption. In the second inequality, with probability (1 -δ) we apply the bound from Lemma B.2, whereas we use the bound 4L F loc ∞ with remaining probability δ. For the second inequality, with probability δ we use 4L F loc ∞ . Further, we use that the |R X | ≤ N (r, δ) with probability at least (1 -δ). Also from the proof of Lemma B.1 we have that G max ((X, Y ); R X ) ≤ 2L Lr + max{Lr, 2 F loc ∞ } -Lr 1 γ f true (X, Y ) ≤ 2L true r . Taking expectation with respect to D completes the bound. Central Absolute Moment of f X, * . As the function f X, * is fixed using centering, and then Hoeffding bound, we can directly bound the remaining term. We have with probability at least (1 -δ) E (X ,Y )∼D X,r f X, * (X ), Y - 1 |R X | (x ,y )∈R X f X, * (x ), y = E (X ,Y )∼D X,r f X, * (X ), Y -f X, * (X), Y - 1 |R X | (x ,y )∈R X f X, * (x ), y -f X, * (X), Y ≤ G max ((X, Y ); R X ) ln(2/δ) |R X | Taking expectation similar to the previous case we obtain, E (X,Y )∼D E (X ,Y )∼D X,r f X, * (X ), Y - 1 |R X | (x ,y )∈R X f X, * (x ), y ≤ E (X,Y )∼D G max ((X, Y ); R X ) ln(2/δ) |R X | ≤ M r (L loc ; , f true , F loc ) ln(2/δ) N (r, δ) + 4δL F loc ∞ . This concludes the proof of Theorem 3.4. B.3 BOUNDING THE RADEMACHER COMPLEXITY R R X G(X, Y ) We now derive bounds on the Rademacher complexity of the class G(X, Y ). We use the covering number based bounds for that purpose. We then start by relating it to the covering number of the F loc function class. Finally, we provide a bound on the class of functions residing in bounded norm Reproducing Kernel Hilbert Space. We will use G max (X, Y ) instead of G max ((X, Y ); R X ) when the context is clear. Similar to G(X, Y ), we define the function class G = { (γ f (•, •)) : f ∈ F loc } which does not depend on the locality cen- tered around (X, Y ). On a set S ⊆ X × Y we can define G max (S) = sup g∈G sup (x ,y )∈S |g(x , y )|. Lemma B.3. Under Assumption 3.2 we have for any retrieved set within radius r of X, R X , for any p ≥ 1 R R X G(X, Y ) ≤ inf ∈[0,Gp,max(X,Y )/2] 4 + 12 √ |R X | Gp,max(X,Y )/2 log 2Gmax ν log N p (ν/2, G, R X ) dν . Furthermore, we have R R X G(X, Y ) ≤ inf ∈[0,Gmax(X,Y )/2] 4 + 12 √ |R X | Gmax(X,Y )/2 log N ∞ (ν/2, G, R X ∪ {(X, Y )}) dν . Proof. Given the set R X , and some function g ∈ G(X, Y ) let us define for p ≥ 1 g p,R X = 1 |R X | (x ,y )∈R X |g(x , y )| p 1/p . Then, we have G p,max (X, Y ); R X = max g∈G g p,R X for all g ∈ G(X, Y ). For the sake of brevity we will use G p,max (X, Y ) in place of G p,max (X, Y ); R X . Note that we have from previous definition G max (X, Y ) = G ∞,max (X, Y ) ≥ G p,max (X, Y ) for any p ≥ 1. Thus using the Chaining method (Shalev-Shwartz & Ben-David, 2014, Chapter 27) we can bound the Radamacher complexity as R R X G(X, Y ) ≤ inf ∈[0,Gp,max(X,Y )/2] 4 + 12 √ |R X | Gp,max(X,Y )/2 log N p (ν, G(X, Y ), R X )dν . To finish the proof we need to show, for p ≥ 1 N p (ν, G(X, Y ), R X ) ≤ N p (ν/2, G, R X )N p (ν/2, G, {(X, Y )}). First we fix any p ≥ 1. Let U (a set of real numbers) be a ν/2 cover (in p norm) of G with respect to {(X, Y )}. We have N p (ν, G(X, Y ), R X ) ≤ 2Gmax ν for any p ≥ 1 and any ν > 0. Further, let Ũ be a ν/2 cover of G with respect to R X . Note for any ũ ∈ Ũ we have ũ ∈ R |R X | . Now, we fix any g ∈ G. We have at least one ũ ∈ Ũ, and û ∈ U such that 1 |R X | (x ,y )∈R X |g (x , y ) -ũ(x , y )| p 1/p ≤ ν/2, and |g (X, Y ) -û| ≤ ν/2. Therefore, 1 |R X | (x ,y )∈R X | g (x , y ) -g (X, Y ) -ũ(x , y ) -û | p 1/p = 1 |R X | (x ,y )∈R X | g (x , y ) -ũ(x , y ) + û -g (X, Y ) | p 1/p ≤ 1 |R X | (x ,y )∈R X |g (x , y ) -ũ(x , y )| p 1/p + |û -g (X, Y )| ≤ ν/2 + ν/2 ≤ ν The first inequality follows by applying Minkowski's inequality. Whereas, for the second inequality we apply Jensen's inequality for (•) 1/p being a concave function for p ≥ 1, and applying the appropriate scaling. Therefore, given the covers Ũ and U , we can construct the set U with entries u ∈ R |R X | as: U := {u = (ũ(x, y) -û) : ũ ∈ Ũ, û ∈ U}. In particular, |U | = | U|| Ũ|. As the choice of g ∈ G and (x , y ) ∈ R X were arbitrary, we have U to be the cover of G(X, Y ). For p = ∞ we can specialize the bound. In particular, consider U to be a ν/2 cover (in ∞ norm) of G with respect to R X ∪ {(X, Y )}. Then U := {u = (ũ(x, y) -û(X, Y )) : ũ ∈ U} creates a (normalized) ∞ cover for G with respect to R X . This is true because 1 |R X | (x ,y )∈R X |g (x , y )- ũ(x , y )| p 1/p ≤ |g -ũ| ∞ = ν/2 and |û -g (X, Y )| ≤ |g -ũ| ∞ = ν/2. This concludes the proof. The first term in the above Lemma is similar to the Chaining based Rademacher bounds (Shalev-Shwartz & Ben-David, 2014, Chapter 28) for G, but the (in inf and in the integral) varies in [0, G max (X, Y )] instead of [0, G max ]. For small r we have G max (X, Y ) << G max , which can be leveraged to give tight bounds in certain situations. Example: F loc ≡ ∞ -bounded RKHS (Zhang, 2004) : Let us consider the setting of Zhang (2004) . In this setting, given some Reproducing Kernel Hilbert Space (RKHS) H, and a function f ∈ H, we can define the function f (•) = f • h x where for some h ∈ H. We further define the set of functions with bounded norm H A = { f (•) ∈ H : f H sup x∈X h x H ≤ A}. Finally, our local function class can be defined as F loc = H |Y| A = {f (•) : f y (•) ∈ H A , ∀y ∈ Y}. We have F loc ∞ = A. Recall that loss function for any y ∈ Y is given as (γ f (x, y)), for any f ∈ F loc . We also have for all y ∈ Y, | (γ f (x, y)) -(γ f (x, y))| ≤ 2L sup y |f y (x) -f y (x)| (Zhang, 2004, Assumption 15) with γ A = 2L ). Given the above setting, following Lemma 17 in Zhang ( 2004)foot_1 , we have for a universal constant c log N ∞ (2L ν, G, R X ∪ {(X, Y )}) ≤ c|Y| F loc 2 ∞ ln(2 + F loc ∞ /ν) + ln(|R X | + 1) ν 2 . This gives us the following bound for the Rademacher complexity of F loc R R X ≤ O |Y|L F loc ∞ ln(|R X |+1) 3/2 √ |R X | . ( ) Proof of Equation (24). Without optimizing over above, we plug in = Gmax(X,Y ) √ |R X | . We obtain R R X G(X, Y ) ≤ 4Gmax(X,Y ) √ |R X | + 12 √ |R X | Gmax(X,Y )/2 Gmax(X,Y ) √ |R X | log N ∞ ν/2, G, R X ∪ {(X, Y )} dν ≤ 4Gmax(X,Y ) √ |R X | + 48 √ c|Y|L F loc ∞ √ |R X | Gmax(X,Y )/2 Gmax(X,Y ) √ |R X | ln(2 + 4L F loc ∞ /ν) + ln(|R X | + 1) ν 2 dν ≤ 4Gmax(X,Y ) √ |R X | + 48 √ c|Y|L F loc ∞ √ |R X | Gmax(X,Y )/2 Gmax(X,Y ) √ |R X | ln((Gmax(X,Y )+4L F loc ∞ )/ν)+ln(|R X |+1) ν 2 dν ≤ 4Gmax(X,Y ) √ |R X | + 48 √ c|Y|L F loc ∞ √ |R X | 1/2 1 √ |R X | ln((1+4L F loc ∞/Gmax(X,Y ))/ν )+ln(|R X |+1) ν 2 dν ≤ 4Gmax(X,Y ) √ |R X | + 32 √ c|Y|L F loc ∞ √ |R X | ln (1 + 4L F loc ∞ /G max (X, Y )) |R X | + ln(|R X | + 1) 3/2 We use x ln(a/x) + b/xdx = -2/3(ln(a/x) + b) 3/2 for the final inequality, and ignore the negative part. Example: F loc ≡ 2 bounded RKHS (Lei et al., 2019) : We consider a fixed kernel K(x, x ) = φ(x), φ(x ) for x, x ∈ X, and let H K be the RKHS induced by K. Let us define the p,q norm for the vectors W = (w 1 , w 2 , . . . , w |Y| ) ∈ H |Y| K as (w 1 , . . . , w |Y| ) p,q = ( w 1 p , . . . , w |Y| p ) q . For some norm bound Λ > 0, the local hypothesis space is defined as F loc = {f (•) : f y (•) = w y , φ(•) , w y ∈ H K , ∀y ∈ Y, (w 1 , . . . , w |Y| ) 2,2 ≤ Λ}.

Recall that we have the loss function class

G = { (γ f (•, •)) : f ∈ F loc }, where the loss function (•) is assumed to be L-Lipschitz continuous w.r.t. ∞ norm. Given the retrieved set R X for some positive integer n ≥ 1, FX after Equation (8) in Lei et al. (2019) induced by R X .foot_2 Let the worst case Rademacher complexity of a function class F over n points be defined as R n (F). Also, for a set S let B(S) = max (x,y)∈S sup W : W 2,2 ≤Λ w y , φ(x) . We have from Theorem 23 in Lei et al. (2019) that the covering number is bounded as follows: for any set S = {(x i , y i ) : i = 1, . . . , n} of size n ≥ 1, for any ε > 4LR n|Y| FX log N ∞ ε, G, S ≤ 16n|Y|L 2 (R n|Y| FX ) 2 ε 2 log 2en|Y| B(S)L ε . Furthermore, from equation ( 18) in Lei et al. (2019) we have for any set Λ max (x,y)∈S φ(x) 2 √ 2n|Y| ≤ R n|Y| FX ≤ Λ max (x,y)∈S φ(x) 2 √ n|Y| . Therefore, we have for all ε ≥ 4L Λ max (x,y)∈S φ(x) 2 √ 2n|Y| log N ∞ ε, G, S ≤ 16 max (x,y)∈S φ(x) 2 2 Λ 2 L 2 ε 2 log 2en|Y| B(S)L ε . Plugging this covering number in in our Rademacher bound with ≥ 4L Λ max (x,y)∈S φ(x) 2 √ 2(|R X |+1)|Y| and taking S = R X ∪ {(X, Y )} we get R R X G(X, Y ) ≤ inf ∈[0,Gmax(X,Y )/2] 4 + 12 √ |R X | Gmax(X,Y )/2 log N ∞ (ν/2, G, R X ∪ {(X, Y )})dν ≤ 16 max (x,y)∈R X ∪{(X,Y )} φ(x) 2 ΛL 2(|R X | + 1)|Y| + 12 × 16 max (x,y)∈R X ∪{(X,Y )} φ(x) ΛL |R X | × × Gmax(X,Y )/2 4LΛ max (x,y)∈R X ∪{(X,Y )} φ(x) 2 √ 2(|R X |+1)|Y| 1 ν log 4e(|R X |+1)|Y| B(R X ∪{(X,Y )})L ν dν ≤ 16 max (x,y)∈R X ∪{(X,Y )} φ(x) 2 ΛL 2(|R X | + 1)|Y| + 8 × 16 max (x,y)∈R X ∪{(X,Y )} φ(x) ΛL |R X | × × log 4 √ 2eL B(R X ∪{(X,Y )})(|R X |+1)|Y| √ (|R X |+1)|Y| 4LΛ max (x,y)∈R X ∪{(X,Y )} φ(x) 2 3/2 ≤ 16 max (x,y)∈R X ∪{(X,Y )} φ(x) 2 ΛL 2(|R X | + 1)|Y| + 8 × 16 max (x,y)∈R X ∪{(X,Y )} φ(x) ΛL |R X | × × log √ 2e (|R X | + 1)|Y| 3/2 3/2 In the final inequality we use the fact that B(R X ∪ {(X, Y )}) ≤ max (x,y)∈R X ∪{(X,Y )} φ(x) 2 sup W : W 2,2 ≤Λ W 2,∞ ≤ max (x,y)∈R X ∪{(X,Y )} φ(x) 2 Λ Therefore, the final bound on the Rademacher complexity can be given as  R R X ≤ O L F loc ∞ ln(|Y||R X |) 3/2 √ |R X | . ( f A = σ L (A L σ L-1 (A L-1 σ L-2 (. . . A 1 x)) for x ∈ X where A = (A 1 , A 2 , . . . , A L ) is the sequence of weight matrices. The matrix A l ∈ R d l-1 ×d l for l = 1 to L, with d L = |Y|, and d 0 = d given X ⊆ R d . Furthermore, σ l (•) : R d l → R d l denotes the non-linearity (including pooling and activation), σ l -s are taken to be 1-Lipschitz, and σ l (0) = 0. We assume that the A l matrix is initialized at M l , for each l = 1 to L. We consider the local function class F loc = {f A : A l -M l 2,1 ≤ b l , A l σ ≤ s l , ∀l ≤ l ≤ L -1}. Furthermore, we have for any f ∈ F loc and any x ∈ X the function (f (x), y) → (γ f (•, •)) is 2L -Lipschitz. Therefore, for a fixed set S, we have from Theorem 3.3 in Bartlett et al. (2017) that the covering number of the G = { (γ f (•, •)) : f A ∈ F loc } is given as log N 2 ε, G, S ≤ 4L 2 B 2 ln(2d 2 max ) ε 2 L l=1 s l 2 L l=1 (b l /s l ) 2/3 3/2 = R ε 2 , where d max = max L l=1 d l , 1 |S| x∈S x 2 2 ≤ B, R = 4L 2 B 2 ln(2d 2 max ) L l=1 s l 2 L l=1 (b l /s l ) 2/3 3/2 . Using a the covering number based bound on Rademacher complexity we obtain R R X G(X, Y ) ≤ inf ∈[0,G2,max(X,Y )/2] 4 + 12 √ |R X | G2,max(X,Y )/2 log( 4L B L l=1 s l ν ) log N 2 ν/2, G, R X dν ≤ inf ∈[0,G2,max(X,Y )/2] 4 + 12 √ |R X | Gmax(X,Y )/2 log( 4L B L l=1 s l ν ) R ν 2 dν ≤ inf ∈[0,G2,max(X,Y )/2] 4 + 8 √ R √ |R X | log 3/2 ( 4L B L l=1 s l ) -8 √ R √ |R X | log 3/2 ( 8L B L l=1 s l G2,max(X,Y ) ) ≤ 4G2,max(X,Y ) √ |R X | + 8 √ R √ |R X | log 3/2 ( 4L B L l=1 s l √ |R X | G2,max(X,Y ) ) -8 √ R √ |R X | log 3/2 ( 8L B L l=1 s l G2,max(X,Y ) ) C PROOFS FOR SECTION 3.2 This section focuses on providing a proof of Proposition 3.5. It follows the proof technique of (Foster et al., 2019, Eq. (9) ). Before presenting the proof of Proposition 3.5, we need to introduce a slight variation of the Rademacher complexity for data-dependent hypothesis set. Let Z = X × Y. Let R = {z R j }, T = {z T j } ∈ Z m be two m-sized samples and σ ∈ {+1, -1} m be a vector of independent Rademacher variables. Now define R T,σ = {z R T,σ j } ∈ Z m such that z R T,σ j = z R j , if σ j = 1, z T j , if σ j = -1, i.e., R T,σ is obtained by replacing i-th element of R by i-th element of T iff  σ i = -1. Let U ∈ Z n-m be an m -n-sized sample; for R ∈ Z m , S R = U ∪ R ∈ Z n . R U,R,T (H) = 1 m E σ sup h∈H(S R T,σ ) m i=1 σ i h(z T i ) R U (H) = 1 m E R,T∼D m σ sup h∈H(S R T,σ ) m i=1 σ i h(z T i ) C.1 PROOF OF PROPOSITION 3.5 We are now ready to establish the proof of Proposition 3.5. As discussed above, we extend the proof technique of (Foster et al., 2019, Eq. (9) ) to obtain this result. Our setting differs from that of Foster et al. (2019) as the local ERM objective only depends on the retrieve samples R x while the function class of interest F S = F Φ S in ( 14) depends on the entire training set S via representation Φ S . We suitably modify the proof techniques of Foster et al. (2019) to handle this difference. Let |R x | := m and U = S\R x . For R, T ∈ Z m , we define Ξ(R, T) = sup f ∈FΦ U∪R E (X ,Y )∼D x,r [ (f (X ), Y )] :=R (f ;D x,r ) - 1 m (x ,y )∈T (f (x ), y ) := R (f ;T) = sup f ∈FΦ U∪R R (f ; D x,r ) -R (f ; T) . Note that we are interested in bounding Ξ(R x , R x ) = sup f ∈FΦ S E (X ,Y )∼D x,r [ (f (X ), Y )] R (f ;D x,r ) - 1 m (x ,y )∈T (f (x ), y ) R (f ;R x )= R x (f ) , where we have used the fact that U ∪ R x = S. Towards this, we first establish that Ξ(R, R) satisfies the M m + 2∆LL ,1 -bounded difference property, i.e., for R, R ∈ Z m that only differ in one element, we have Ξ(R, R) -Ξ(R , R ) ≤ M m + 2∆LL ,1 . Note that Ξ(R, R) -Ξ(R , R ) ≤ Ξ(R, R) -Ξ(R, R ) I + Ξ(R, R ) -Ξ(R , R ) II . Now, we will separately bound the two terms in the RHS. Let z = (x, y) ∈ R\R and z = (x , y ) ∈ R \R. Thus, we have the following bound on the first term. I = Ξ(R, R) -Ξ(R, R ) = sup f ∈FΦ U∪R R (f ; D x,r ) -R (f ; R) -sup f ∈FΦ U∪R R (f ; D x,r ) -R (f ; R ) ≤ sup f ∈FΦ U∪R R (f ; D x,r ) -R (f ; R) -R (f ; D x,r ) -R (f ; R ) ≤ sup f ∈FΦ U∪R R (f ; D x,r ) -R (f ; R) -R (f ; D x,r ) + R (f ; R ) = sup f ∈FΦ U∪R R (f ; R ) -R (f ; R) = sup f ∈FΦ U∪R 1 m (f (x ), y ) -(f (x), y) ≤ M m , where the last inequality follows from our boundedness assumption for the loss function . Now we move to term II. Towards this, note that, it follows from the definition of supremum that, for any > 0, there exists f ∈ F Φ U∪R such that sup f ∈FΦ U∪R R (f ; D x,r ) -R (f ; R ) -≤ R ( f ; D x,r ) -R ( f ; R ) (31) Let f = g • Φ U∪R ∈ F Φ U∪R and f = g • Φ U∪R ∈ F Φ U∪R . Note that, for any (x, y) ∈ Z, f (x), y -f (x), y = g • Φ U∪R (x), y -g • Φ U∪R (x), y (i) ≤ L ,1 g • Φ U∪R (x) -g • Φ U∪R (x) ∞ ≤ L ,1 g • Φ U∪R (x) -g • Φ U∪R (x) 2 (ii) ≤ L ,1 L Φ U∪R (x) -Φ U∪R (x) 2 (iii) ≤ L ,1 L∆, where we use L ,1 -Lipschitzness of w.r.t. • ∞ norm, L-Lipschitzness of g, and ∆-sensitivity of the representation Φ in (i), (ii), and (iii), respectively. Now, we have II = Ξ(R, R ) -Ξ(R , R ) = sup f ∈FΦ U∪R R (f ; D x,r ) -R (f ; R ) -sup f ∈FΦ U∪R R (f ; D x,r ) -R (f ; R ) (i) ≤ R ( f ; D x,r ) -R ( f ; R ) + -sup f ∈FΦ U∪R R (f ; D x,r ) -R (f ; R ) ≤ R ( f ; D x,r ) -R ( f ; R ) + -R ( f ; D x,r ) -R ( f ; R ) = R ( f ; D x,r ) -R ( f ; D x,r ) -R ( f ; R ) -R ( f ; R ) + ≤ R ( f ; D x,r ) -R ( f ; D x,r ) + R ( f ; R ) -R ( f ; R ) + (ii) ≤ 2L ,1 L∆ + , where (i) and (ii) follow from ( 31) and (32), respectively. Now, since in (31) can be chosen arbitrarily small, it follows from ( 29), (30), and (33) that Ξ(R, R) -Ξ(R , R ) ≤ M m + 2∆LL ,1 , i.e., Ξ(R, R) indeed satisfies the M m + 2∆LL ,1 -bounded difference property. Now, it follows from the McDiarmid's inequality that, for δ > 0, we have with probability at least 1 -δ: Ξ(R x , R x ) ≤ E Ξ(R x , R x ) + M + 2∆LL ,1 m log(1/δ) 2m or sup f ∈FΦ S R (f ; D x,r ) -R x (f ) ≤ E R x sup f ∈FΦ S R (f ; D x,r ) -R x (f ) + M + 2∆LL ,1 m log(1/δ) 2m . Now, first statement of Proposition 3.5 follows from (34) and the fact that m = |R x |. It follows from the proof steps in (Foster et al., 2019 , Section E.1) that E R x sup f ∈FΦ S=U∪R x R (f ; D x,r ) -R x (f ) ≤ 2R U ( • F), where F = {F Φ U∪R } R∈Z m and R U is defined in (27). This completes the proof of Proposition 3.5.

APPROACH

As introduced in Sec. 2.3, our objective is to learn a function f : X × (X × Y) → R |Y| . For a given instance x, such a function can leverage its neighboring set R x ∈ (X × Y) to improve the prediction on x. In this work, we restrict ourselves to a sub-family of such retrieval-based methods that first map R x ∼ D x,r to Dx,r -an empirical estimate of the local distribution D x,r , which is subsequently utilized to make a prediction for x. In particular, the scorers of interest are of the form: (x, R x ) → f (x, Dx,r ) = f 1 (x, Dx,r ), . . . , f |Y| (x, Dx,r ) ∈ R |Y| , where f y (x, Dx,r ) denotes the score assigned to the y-th class. Thus, assuming that ∆ X×Y denotes the set of distribution over X × Y, we restrict to a suitable function class in {f : X × ∆ X×Y → R |Y| }. Note that, given a surrogate loss : R |Y| × Y → R and scorer f , the empirical risk R ex (f ) and population risk R ex (f ) take the following form: Rex (f ) = 1 n i∈[n] x i , Dxi,r and R ex (f ) = E (X,Y )∼D f (X, D X,r ), Y . Note that that the general framework for learning in the extended feature space X := X × ∆ X×Y provides a very rich class of functions. In this paper, we focus on a specific form of learning methods in the extended feature space by using the kernel methods. The method as well as its analysis is obtained by adapting the work on utilizing kernel methods for domain generalization (Blanchard et al., 2011; Deshmukh et al., 2019) .

D.1 KERNEL-BASED CLASSIFICATION

Before introducing a kernel method for the classification, we need to define a suitable kernel k : X × X → R on the extended feature space X := X × ∆ X×Y . Towards this, let k Z be a kernel over Z := X × Y. Assuming that H k Z is the reproducing kernel Hilbert space (RKHS) associated with k Z , we can define a kernel mean embedding (Smola et al., 2007) Ψ : ∆ Z → H k Z as follows: Ψ(P ) = Z k Z z, • dP. For an empirical distribution Dx,r defined by R x , kernel embedding in (38) takes the following form. Ψ( Dx,r ) = 1 |R x | (x ,y )∈R x k Z (x , y ), • . Now, using a kernel k X over X and a kernel-like function κ over Ψ(∆ Z ), we define a desired kernel k : X × X → R as follows: k X 1 , X 2 = k (X 1 , D X1,r ), (X 2 , D X2,r ) = k X (X 1 , X 2 ) • κ Ψ(D X1,r ), Ψ(D X2,r ) . Let H k be the RKHS corresponding to the kernel k in (40), and  f ex = arg min f ∈H |Y| k 1 n n i=1 f (x i ), y i + λ • Ω(f ), where xi = (x i , Dxi,r ) and Ω(f ) := f 2 H |Y| k := y∈Y f y 2 H k . It follows from the representer theorem that the solution of (41) takes the form f ex (•) = i∈[n] α i k (x i , Dxi,r ), • . One can apply multiclass extensions of SVMs to learn the weights {α i } (Deshmukh et al., 2019) . Next, we focus on studying the generalization behavior of the scorer f ex recovered in (41).

D.2 GENERALIZATION BOUNDS FOR KERNEL-BASED CLASSIFICATION

Before presenting a generalization bound for kernel-based classification over the extended feature space X, we state the three key assumptions that are utilized in our analysis. Furthermore, assume that sup (x,y) (x, y) := M ≤ ∞. Assumption D.2. Kernels k X , k Z , and κ are bounded by M k X , M k Z , and M κ , respectively. Assumption D.3. Let H k Z and H κ be the RKHS associated with k Z and κ, respectively. Then, the canonical feature map ϕ κ : H k Z → H κ is α-Hölder continuous with α ∈ (0, 1], i.e., ϕ κ (h 1 ) -ϕ κ (h 2 ) Hκ ≤ L • h 1 -h 2 α H k Z ∀h 1 , h 2 ∈ {h ∈ H k Z : h H k Z ≤ M k Z } (43) The following result states our generalization bound for the kernel-based classification method described in Sec. D.1. Theorem D.4. Let 0 ≤ δ ≤ 1 and Assumptions D.1-D.3 hold. Furthermore, let N (r, δ) be as defined in (8). Then, for any B > 0, the following holds with probability at least 1 -3δ sup f ∈F k B R ex (f ) -R ex (f ) ≤ 32 log 2L ,1 BMκM k X n -1 2 1 + log 3 2 √ 2n|Y| + L ,1 L M k X B M k Z 2 log( n δ ) N (r, δ n ) + M k Z 1 N (r, δ n ) + 4M k Z log( n δ ) 3N (r, δ n ) α + M log( 1 δ ) 2n , where F k B = f = (f 1 , . . . , f |Y| ) ∈ H |Y| k : Ω(f ) ≤ B 2 and M := M + L ,1 BM k X M κ . Before presenting the proof of Theorem D.4, we state two key results from the literature that are used in our analysis. Proposition D.5 (Steinwart & Christmann (2008) ). Let (Ω, A, P ) be a probability space, H be a separable Hilbert space, and M > 0. Let η 1 , . . . , η m : Ω → H be m independent H-valued random variables satisfying η j ∞ ≤ M , for all j ∈ [m]. The, for δ > 0, the following holds with probability at least 1 -δ. Then the Rademacher complexity of the induced function class • F k B := { • f : f ∈ F k B } satisfies R S • F k B := E σi sup f ∈F k B 1 n i∈[n] σ i f (x i ), y i ≤ 16L ,1 log 2B sup x∈ X k(x, x)n -1 2 |Y| 1 2 - 1 max{2,p} 1 + log 3 2 √ 2n|Y| . ( ) Note that σ = (σ 1 , . . . , σ n ) denotes n i. By combining ( 48) and ( 49), we obtain that |f y (x i , D xi,r ) -f y (x i , D xi,r )| ≤ L M k X • f y H k • Ψ( D xi,r ) -Ψ(D xi,r ) α H k Z . Now, Hoeffding's inequality in Hilbert spaces (cf. Proposition D.5) implies that, for i ∈ [n], the following holds with probability at least 1 -δ. Ψ( D xi,r ) -Ψ(D xi,r ) α H k Z = 1 |R xi | (x ,y )∈R x i k Z ((x , y ), •) -E D x i ,r k Z ((X , Y ), •) H k Z ≤ M k Z 2 log(1/δ) |R xi | + M k Z 1 |R xi | + 4M k Z log(1/δ) 3|R xi | . ( ) It follows from ( 51) and (52) that, for each i ∈ [n], |f y (x i , D xi,r ) -f y (x i , D xi,r )| ≤ L M k X • f y H k • M k Z 2 log( 1 δ ) |R xi | + M k Z 1 |R xi | + 4M k Z log( 1 δ ) 3|R xi | α ∀ y ∈ Y (53) holds with probability at least 1 -δ. Next, taking union bound over i ∈ [n] implies that the following holds for all i ∈ [n] and y ∈ Y with probability at least 1 -δ. |f y (x i , D xi,r ) -f y (x i , D xi,r )|



As stated, we require the local-regularity condition to hold for each x. This can be relaxed to hold with high probability with increased complexity of exposition. We correct for a typographical error inZhang (2004), where the n ≡ |R X | comes in the denominator of the bound presented in Lemma 17. But Theorem 4 ofZhang (2002) shows this is a typographical error. Indeed, the covering number is not suppossed to decrease with increasing number of points. We need FX only to state some theorems inLei et al. (2019). We refer interested readers toLei et al. (2019) for the details.



Figure 1: An illustration of a retrieval-based classification model. Given an input instance x, similar to an instance-based model, it retrieves similar (labeled) examples R x = {(x j , y j )} j from training data. Subsequently, it processes (potentially via a nonparametric method) input instance along with the retrieved examples to make the final prediction ŷ = f (x, R x ).

Figure 2: Performance of local ERM with size of retrieved set across models of different complexity.

over Bobolov spaces. Minimax risks over more general classes were studied by Ibragimov & Has Minskii (2013), Donoho & Liu (1988), among others, for estimating an entire function. But none of these works provide finite sample generalization bounds, which we obtain in this work.

then we simply use N p ( , F, S) to denote the corresponding p -covering number. B PROOFS FOR SECTION 3.1 B.1 PROOF OF LEMMA 3.3

the function class is parameterized by (X, Y ). Let us define some quantities of the function class on a set S ⊆ X × Y as G max ((X, Y ); S) = sup g∈G(X,Y ) sup (x ,y )∈S |g(x , y )| By centering each function f ∈ F loc at the point (X, Y ) we can transform the generalization over the function class F loc , to the generalization over the function class G(X, Y ).

Example: F loc ≡ L-layer Fully Connected Deep Neural Network (DNN)(Bartlett et al., 2017): Following Bartlett et al. (2017), we consider a L-layer deep neural network (DNN)

Note that, following this notation, we have S R T,σ = U ∪ R T,σ . For S ∈ Z n , let H(S) be a data dependent function class (hypothesis set), which does not depend on the ordering of the elements in S. Definition C.1 (Rademacher complexity for data-dependent function class). Let H = {H(S)} S∈Z n be a family of data dependent function classes. Given R = {z R j∈[m] }, T = {z T j∈[m] } ∼ D m and U = {z U m+i } i∈[n-m] , the empirical Rademacher complexity R U,R,T (H) and Rademacher complexity R U,m (H) are defined as follows.

Assumption D.1. The loss function : R |Y| × Y is L ,1 -Lipschitz w.r.t. the first argument, i.e.,| (s 1 , y) -(s 2 , y)| ≤ L ,1 • s 1 -s 2 ∞ ∀s 1 , s 2 ∈ R |Y| and y ∈ Y.(42)

6. (Deshmukh et al., 2019;Lei et al., 2019) Let Z = X × Y be (extended) input and output space pair and S = z1 , . . . , zn . Let H k be a RKHS defined on X, with k being the associated kernel. Let F k B = (f 1 , . . . , f |Y| ) : f y ∈ H k ∀y ∈ Y and y∈Y : R |Y| × Y → R be a Lipschitz function in its first argument, i.e., | (s 1 , y) -(s 2 , y)| ≤ L ,1 s 1 -s 2 ∞ ∀s 1 , s 2 ∈ R |Y| and y ∈ Y.

i.d. Rademacher random variable. Proof of Theorem D.4. Note thatsup f ∈F k B R ex (f ) -R ex (f ) = sup f ∈F k B 1 n n i=1 f (x i , D xi,r ), y i -E (X,Y )∼D f (X, D X,r ), Y (x i , D xi,r ), y i -1 n n i=1 f (x i , D xi,r ), y i i , D xi,r ), y i -E (X,Y )∼D f (X, D X,r ), i , D xi,r ) -f (x i , D xi,r ) x i , D xi,r ) -f y (x i , D xi,r )| ≤ L ,1 • max y∈Y max i∈[n] |f y (x i , D xi,r ) -f y (x i , D xi,r )| (47)It follows from the reproducing property of the kernel k that, for any y ∈ Y,|f y (x i , D xi,r ) -f y (x i , D xi,r )| = | f y , k((x i , D xi,r ), •) -k((x i , D xi,r ), •) | ≤ f y H k • k((x i , D xi,r ), •) -k((x i , D xi,r ), •) H k . x i , D xi,r ), •) -k((x i , D xi,r ), •) H k = k((x i , D xi,r ), (x i , D xi,r )) + k((x i , D xi,r )), (x i , D xi,r )) -2k((x i , D xi,r ), (x i , D xi,r )) H k 1/2 = k X (x i , x i ) κ(Ψ( D xi,r ), Ψ( D xi,r )) + κ(Ψ(D xi,r )), Ψ(D xi,r )) -2κ(Ψ( D xi,r ), Ψ(D xi,r )) H k 1/2 = k X (x i , x i ) κ(Ψ( D xi,r ), •) -κ(Ψ(D xi,r ), •) Hκ ≤ M k X κ(Ψ( D xi,r ), •) -κ(Ψ(D xi,r ), •) Hκ (49) = M k X ϕ κ (Ψ( D xi,r )) -ϕ κ (Ψ(D xi,r )) Hκ ≤ L M k X • Ψ( D xi,r ) -Ψ(D xi,r ) α H k Z (50)

• H k be the norm associated with H

annex

Recall that, for each i ∈ [n], we have |R xi | ≥ N (r, δ) with probability at least 1 -δ (cf. ( 8)). Using union bound, we have, with probability at least 1 -δ. Thus, the following holds for all i ∈ [n] and y ∈ Y with probability at least 1 -2δBy using f y H k ≤ B and combining ( 47) with ( 55), we obtain thatholds with probability at least 1 -2δ.Bounding the term-II in (46). Note thatUsing the Assumptions D.1 and D.2 and the fact thatNow, it follows from the Azuma-McDiarmid's inequality that the following holds with probability at least 1 -δ.Using the standard symmetrization procedure, we get thatwhere σ = (σ 1 , . . . , σ n ) denotes n i.i.d. Rademacher random variables and R

S

• F k B denote the Rademarcher complexity of the function classNow, using Proposition D.6 with p = 2 and Assumption D.2, we haveNow, by combining ( 57), ( 58), and ( 59), we obtain that with probability at least 1 -δFinally, combining ( 46), ( 56) and ( 60) completes the proof.

E ADDITIONAL DETAILS FOR EXPERIMENTS E.1 SYNTHETIC

Task and data. We consider the task of binary classification on mixtures using synthetic data: In particular, we assume k = 100 clusters in a D = 10-dimensional space. Each cluster is specified by a mean parameter µ i ∈ R D ∼ Uniform(-10, 10) and a classification weight vectorWe randomly generate a train set of n = 10000 points as follows: To generate a labeled example (x j , y j ), j ∈ [n]: 1) select a cluster i uniformly at random, and 2) sample x j ∼ N (µ i , I) and its label y j = sign(w T i (x j -µ i )). Additionally, we also generate another set of points as test set using the same procedure. Methods As baseline, we consider models of various complexity, starting from simple linear classifier, to support vector machines with polynomial kernel (of degree 3) and with radial basis function (RBF) kernel, to a multi-layer perceptron (MLP) of two layers. For retrieval-based models, we consider each of the above method as the local model to fit on retrieved data points via local ERM framework (Sec. 3). Additionally, we also report simple kNN baseline. We compare all these methods using classification accuracy on the held out test set. We repeat all the experiments 10 times.Observations In Figure 3 , we observe the tradeoff of varying the size of the retrieved set (as dictated by the neighborhood radius) on the performance of the proposed algorithms. We see that when the number of retrieved samples is small the local methods have lower accuracy, this is due to large generalization error. When the size of the retrieved sample space is high, the local methods fail to minimize the loss effectively due to the lack of model capacity. We see that this effect being more pronounced for simpler function classes such as linear classifier as compared to RBF or polynomial classifiers.

E.2 CIFAR-10

Task and data. We consider the task of binary classification on a real image data for object detection. In particular, we consider a subset of CIFAR-10 dataset where we only restrict to images from "Cat" and "Dog" classes. We randomly partition the data into a train set of n = 10000 points and remaining 2000 points for test. We do a 10-fold cross-validation.

Methods

We consider a subset of method from Appendix. E.1. In particular, we only consider a simple linear classifier and a multi-layer perceptron (MLP) of two layers. For retrieval-based models, we consider each of the above methods as the local model to fit on retrieved data points via local ERM framework (Sec. 3). The retrieval is done using L2 distance in the input space directly (no features is extracted). Additionally, we also report simple kNN baseline. We compare all these methods using classification accuracy on the held out test set. We repeat all the experiments 10 times. Observations Similar to Figure 3 , Figure 4 exhibits a tradeoff, where varying the size of the retrieved set (as dictated by the neighborhood radius) impacts the performance of the proposed algorithms. We see when the number of retrieved samples is small the local methods have lower accuracy, this is due to large generalization error; and when the number of retrieved samples is large, simple local function class incurs a large approximation error.

E.3 IMAGENET

Task and data. We consider the task of 1000-way image classification on ImageNet ILSVRC-12 dataset. We use the standard train-test set split, where we have of n = 1281167 points for training and 50000 points for test. Given large computational cost, we could only run each experiment once. Methods We compare proposed Local ERM (Sec. 3) to stateof-the-art (SoTA) single model published for this task, which is from the most recent CVPR 2022 (Zhai et al., 2022) . For the local parametric model we use a small MobileNetV3 architecture (Howard et al., 2019) with 4.01M parameters and 156 MFLOPs compute cost. Contrast this to SoTA model ViT-G/14 with 1.84B parameters and 938 GFLOPs compute cost. Following standard practice in literature, we use unsupervised learned features from ALIGN (Jia et al., 2021) to do image retrieval using L2 distance. For solving the local ERM, we fine-tune a MobileNetV3 model, which has been pretrained on ImageNet, on the retrieved set using Adam optimizer with a linear decay schedule. Additionally, we also report simple kNN baseline. We compare all these methods using classification accuracy on the held out test set.Observations In Figure 5 , we see that local ERM with a small MobileNet-V3 model is able to achieve the top-1 accuracy of 82.78 whereas a regularly trained MobileNet-V3 model achieves the top-1 accuracy of only 65.80. Also the result is very competitive with SoTA of 90.45 with a much larger model. Thus, the result suggest that the simple local ERM framework (analyzed in our work) is able to demonstrate the utility of retrieval-based models. In particular, it allows a realistic small sized model to attain very competitive numbers on the popular ImageNet benchmark. Furthermore, as pointed at end of Sec. 3.2, using global representation from ALIGN embeddings help simplest linear model to outperform MobileNet-V3 working directly on image input, thereby showcasing the benefits of endowing local ERM with global representation.

