INTERVAL BOUND INTERPOLATION FOR FEW-SHOT LEARNING WITH FEW TASKS Anonymous

Abstract

Few-shot learning aims to transfer the knowledge acquired from training on a diverse set of tasks to unseen tasks from the same task distribution, with a limited amount of labeled data. The underlying requirement for effective few-shot generalization is to learn a good representation of the task manifold. This becomes more difficult when only a limited number of tasks are available for training. In such a few-task few-shot setting, it is beneficial to explicitly preserve the local neighborhoods from the task manifold and exploit this to generate artificial tasks for training. To this end, we introduce the notion of interval bounds from the provably robust training literature to few-shot learning. The interval bounds are used to characterize neighborhoods around the training tasks. These neighborhoods can then be preserved by minimizing the distance between a task and its respective bounds. We then use a novel strategy to artificially form new tasks for training by interpolating between the available tasks and their respective interval bounds. We apply our framework to both model-agnostic meta-learning as well as prototype-based metric-learning paradigms. The efficacy of our proposed approach is evident from the improved performance on several datasets from diverse domains in comparison to recent methods. e T c h l (z l-1 ) and z l,c (ϵ) = max where e c is the standard c-th basis vector. Further extending to multiple layers, such as f θ S having the first S layers of f θ , the individual transformations h l and h l for l ∈ {1, 2, • • • , S} can be composed to obtain the corresponding functions f θ S and f θ S , such that z S (ϵ) = f θ S (z 0 , ϵ), and z S (ϵ) = f θ S (z 0 , ϵ). Our proposed method aims to enable the learner f θ to learn a feature embedding that attempts to preserve the ϵ-neighborhoods in the task manifold. Therefore, in the following subsection, we describe the notion of an ϵ-neighborhood for a training task T i using IBP and show how they can be preserved to aid in few-shot learning problems. Consider the network f θ = f θ L-S • f θ S where S (≤ L) is a user-specified layer number that demarcates the boundary between the portion f θ S of the model that focuses on feature representation Testing Classes: Warts,

1. INTRODUCTION

Few-shot learning problems deal with diverse tasks consisting of subsets of data drawn from the same underlying data manifold along with associated labels. The joint distribution of data and corresponding labels which governs the sampling of such tasks is often called the task distribution (Finn et al., 2017; Yao et al., 2022) . Consequently, few-shot learning methods attempt to leverage the knowledge acquired by training on a large pool of such tasks to easily generalize to unseen tasks from the same distribution, using only a few labeled examples. We hereafter refer to the support of the task distribution as the task manifold which is distinct from but closely-related to the data manifold associated with the data distribution. Since the unseen tasks are sampled from the same underlying manifold governing the task distribution, we should ideally learn a good representation of the task manifold by preserving the neighborhoods from the high-dimensional manifold in the lower-dimensional feature embedding (Tenenbaum et al., 2000; Roweis & Saul, 2000; Van der Maaten & Hinton, 2008) . However, the labels associated with a task can define any arbitrary partitioning of the data. Therefore, we can attempt to preserve the neighborhood for a task by simply conserving the neighborhoods for the corresponding subset of the data manifold in the feature embedding learned by the few-shot learner. This facilitates effective generalization to new tasks using a limited amount of labeled data by only updating the classifier as the learned feature embedding would likely require very little adaptation. However, existing few-shot learning methods lack an explicit mechanism for achieving this. Further, real-world few-shot learning scenarios like rare disease detection may not have the large number of training tasks required for effective learning, due to various constraints such as data collection costs, privacy concerns, and/or data availability in newer domains (Yao et al., 2022) . In such scenarios, few-shot learning methods are prone to overfit the training tasks, thus limiting the ability to generalization to unseen tasks. Therefore, in this work, we develop a method to explicitly constrain the feature embedding in an attempt to preserve neighborhoods from the high-dimensional task manifold and to construct artificial tasks within these neighborhoods in the feature space, to improve the performance when a limited number of training tasks are available. The proposed approach relies on characterizing the neighborhoods from the high-dimensional task manifold and propagating them through the network with the intent to preserve the task neighborhood in the feature space. We achieve this by employing the concept of interval bounds from the provably robust training literature (Gowal et al., 2019; Morawiecki et al., 2020) , i.e. the axis-aligned bounds for the activations in each layer, obtained using interval arithmetic (Sunaga, 1958) . Concretely, as shown in Figure 1 , we first define a small ϵ-neighborhood for each few-shot training task and then use Interval Bound Propagation (IBP; Gowal et al., 2019) to obtain the bounding box around the mapping of the corresponding neighborhood in the feature embedding space. We then explicitly attempt to preserve the ϵ-neighborhoods by minimizing the distance between a task and its respective interval bounds in addition to optimizing the few-shot classification objective. We further devise a mechanism to construct the artificial tasks by interpolating between a task and its corresponding IBP bounds. It is important to notice that this setup is distinct from provably robust training for few-shot learning in that we do not attempt to minimize (or calculate for that matter) the worst-case classification loss. IBP is then used to obtain the bounding box around the mapping of the said neighborhood in the embedding space f θ S given by the first S layers of the learner f θ . While training the learner f θ to minimize the classification loss L CE on the query set D q i , we additionally attempt to minimize the losses L LB and L U B , forcing the ϵ-neighborhood to be compact in the embedding space as well. However, depending on how flat the task-manifold embedding is at the layer where interpolation is performed, the artificial tasks may either be created close to the task-manifold (green cross) or away from the task-manifold (red box). (b) The proposed interval bound-based task interpolation creates artificial tasks by combining an original task with one of its interval bounds (yellow ball). Such artificial tasks are likely to be in the vicinity of the task manifold as the interval bounds are forced to be close to the task embedding by the losses L LB and L U B . Various methods have been proposed to mitigate the few-task few-shot problem using approaches such as explicit regularization (Jamal & Qi, 2019; Yin et al., 2019) , intra-task augmentation (Lee et al., 2020; Ni et al., 2021; Yao et al., 2021) , and inter-task interpolation to construct new artificial tasks (Yao et al., 2022) . While inter-task interpolation has been shown to be the most effective among these existing approaches, it suffers from the limitation that the artificially created tasks may be generated away from the task manifold depending on the curvature of the feature embedding space, as there is no natural way to select pairs of task which are close to each other on the manifold (Figure 2 (a)). The interval bounds obtained using IBP, on the other hand, are likely to be close to the original task embedding as we explicitly minimize the distance between a task and its interval bounds. Thus, using them for interpolation is likely to keep the generated tasks close to the manifold (Figure 2 (b)). In essence, the key contributions made in this article advance the existing literature in the following ways: (1) In Section 4.1, we present for the first time, a novel method to synergize few-shot learning with interval bound propagation (Gowal et al., 2019 ) so as to explicitly lend the ability to preserve task neighborhoods in the feature embedding space of the few-shot learner. (2) In Section 4.2, we propose the interval bound-based task interpolation technique which can create new tasks (as opposed to augmenting each individual task (Lee et al., 2020; Ni et al., 2021; Yao et al., 2021) ), by interpolating between a task sampled from the task distribution and its interval bounds. (3) Unlike existing inter-task interpolation methods that require paired tasks for interpolation (Yao et al., 2022) , our framework can generate new tasks from only a single task. This allows the proposed framework to be seamlessly integrated with existing few-shot learning paradigms. In Section 5, we empirically demonstrate the effectiveness of our proposed approach on both gradientbased meta-learning and prototype-based metric-learning on few-task real-world datasets from multiple domains, in comparison to the recent prior methods. Finally, we make concluding remarks and also discuss future scopes of research in Section 6.

2. RELATED WORKS

Few-shot learning: The aim of few-shot learning is to generalize to new tasks using only a few examples (Wang et al., 2020) through three major strategies. First, one can augment the tasks at the data level (Hariharan & Girshick, 2017) . Second, the hypothesis space can be constrained at the model level (Snell et al., 2017) . Third, the hypothesis search strategy at the algorithm level can be improved (Finn et al., 2017) . The problem of few-task learning can be even more difficult when there is a scarcity of training tasks in a few-task scenario. We take the route of Yao et al. (2022) to offer a novel task augmentation strategy that can work in conjunction with both algorithm-level meta-learning as well as model-level metric-learning methods. Provable robust training of neural networks: A way to build robust neural networks is to find a differentiable upper bound on the verifiable violation of specifications. Such upper bounds can then be directly optimized alongside the original loss (Mirman et al., 2018; Raghunathan et al., 2018; Wong et al., 2018) . IBP (Gowal et al., 2019) follows this direction by explicitly minimizing the worst-case loss inside the ϵ-neighborhood of an input for an arbitrary network with some architectural constraints. However, in our work, instead of building robust networks, we repurpose IBP to characterize the ϵ-neighborhood to learn better representation such that the generalization to new tasks by a few-shot learner becomes easier. Moreover, the bounds of the ϵ-neighborhood obtained through IBP gives us a direct way to construct new artificial tasks when the number of available tasks is scarce. Manifold learning: Traditional methods like ISOMAP (Tenenbaum et al., 2000) , LLE (Roweis & Saul, 2000) , t-SNE (Van der Maaten & Hinton, 2008) , etc. aims to represent high-dimensional data in lower-dimensional space while preserving the local neighborhoods through manifold learning (Abukmeil et al., 2021) . Recent manifold learning approaches mostly employ generative neural networks such as deep belief network (Lee et al., 2009) , variational auto-encoders (Connor et al., 2021; Kumar & Poole, 2020) , flow-based approaches (Brehmer & Cranmer, 2020; Caterini et al., 2021) , etc. In a similar spirit, we repurpose IBP to define ϵ-neighborhoods for few-shot learning tasks and constrain the learned feature embedding to preserve the said neighborhoods. Task augmentation: To train on datasets with a limited number of tasks, some works directly impose regularization on the few-shot learner (Jamal & Qi, 2019; Yin et al., 2019) . Another line of work performs data augmentation on the individual tasks (Lee et al., 2020; Ni et al., 2021; Yao et al., 2021) . Finally, a third direction is to employ inter-task interpolation to mitigate task scarcity (Yao et al., 2022) . Our approach is similar to the third category in that we directly create new artificial tasks. But, we also differs from all of the above-mentioned methods in that we neither undertake intra-task augmentation nor inter-task interpolation.

3. PRELIMINARIES

In a few-shot learning problem, we deal with tasks T i ∼ p(T ). Each task T i is associated with a dataset D i = (X i , Y i ), that we further subdivide into a support set D s i = (X s i , Y s i ) = {(x s i,r , y s i,r )} Ns r=1 and a query set D q i = (X q i , Y q i ) = {(x q i,r , y q i,r )} Nq r=1 . Given a learning model f θ , where θ denotes the model parameters, few-shot learning algorithms attempt to learn θ to minimize the loss on the query set D q i for each of the sampled tasks using the data-label pairs from the corresponding support set D s i . Thereafter, the trained model f θ and the support set D s j for new tasks T j can be used to perform inference on the corresponding query set D q j . In the following subsection, we discuss gradient-based meta-learning while the prototype-based metric-learning is detailed in Appendix A. Gradient-based meta-learning: In gradient-based meta-learning, the aim is to learn initial parameters θ * such that a typically small number of gradient update steps using the data-label pairs in the support set D s i results in a model f ϕi that performs well on the query set of task T i . During the meta-training stage, first, a base learner is trained on multiple support sets D s i , and the performance of the resulting models f ϕi is evaluated on the corresponding query sets D q i . The meta-learner parameters θ are then updated such that the expected loss of the base learner on query sets is minimized. In the meta-testing stage, the final meta-trained model f θ * is fine-tuned on the support set D s j for the given test task T j to obtain the adapted model f ϕj that can then be used for inference on the corresponding query set D q j . Considering Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) as an example, the bi-level optimization of the gradient-based meta-learning is formulated as: θ * ← arg min θ E Ti∼p(T ) [L(f ϕi ; D q i )], where ϕ i = θ -η 0 ∇ θ L(f θ ; D s i ), while η 0 denotes the inner-loop learning rate used by the base learner to train on D s i for task T i , and L is the loss function, which is usually the cross-entropy loss for classification problems: L CE = E Ti∼p(T ) [- r log p(y q i,r |x q i,r , f ϕi )]. A key requirement for effective few-shot generalization to new tasks for both gradient-based metalearning and prototype-based metric-learning is to learn a good embedding of the high-dimensional manifold characterizing the task distribution p(T ), i.e. the task manifold. Ideally, the learned embedding should conserve the neighborhoods from the high-dimensional task manifold (Tenenbaum et al., 2000; Roweis & Saul, 2000) . Hence, in the following subsection, we discuss Interval Bound Propagation (IBP) (Gowal et al., 2019) that can be employed to define a neighborhood around a task. Interval bound propagation: Let us consider a neural network f θ consisting of a sequence of transformations h l , (l ∈ {1, 2, • • • , L}) for each of its L layers. We start from an initial input z 0 = x to the network along with lower bound z 0 (ϵ) = x -1ϵ and upper bound z 0 (ϵ) = x + 1ϵ for an ϵ-neighborhood around the input x. In each of the subsequent layers l ∈ {1, 2, • • • , L} of the network, we get an activation z l = h l (z l-1 ). IBP uses interval arithmetic to obtain the corresponding axis-aligned bounds of the form z l (ϵ) ≤ z l ≤ z l (ϵ) on the activations for the l-th layer. Given the specific differentiable transformation h l , interval arithmetic yields corresponding differentiable lower and upper bound transformations z l (ϵ) = h l (z l-1 (ϵ), z l-1 (ϵ)), and z l (ϵ) = h l (z l-1 (ϵ), z l-1 (ϵ)), as described in Appendix C. This ensures that each of the coordinates z l,c (ϵ) and z l,c (ϵ) of z l (ϵ) and z l (ϵ) respectively, satisfies the conditions: and the subsequent portion f θ L-S responsible for the classification. For a given training task T i , the Euclidean distances between the embedding f θ S (x q i,r ) for the query instances and their respective interval bounds f θ S (x q i,r , ϵ) and f θ S (x q i,r , ϵ) is a measure of how well the ϵ-neighborhood is preserved in the learned feature embedding: L LB = 1 N q Nq r=1 ||f θ S (x q i,r ) -f θ S (x q i,r , ϵ)|| 2 2 and (4) L U B = 1 N q Nq r=1 ||f θ S (x q i,r ) -f θ S (x q i,r , ϵ)|| 2 2 . ( ) To ensure that the small ϵ-neighborhoods get mapped to small interval bounds by the feature embedding f θ S , we can minimize the losses L LB and L U B in addition to the classification loss L CE in (2). Notice that the losses L LB and L U B are never used for the support instances x s i,r . Figure 3 : Dynamic weights for MAML+IBP on miniIm-ageNet when γ is set to 1 for ease of visualisation. Attempting to minimize a naïve sum of the three losses can cause some issues. For example, weighing the classification loss L CE too high essentially reduces the proposed method to vanilla few-shot learning. On the other hand, assigning very high weights to the interval losses L LB and/or L U B may diminish learnability as the preservation of ϵ-neighborhoods gets precedence over classification performance. Moreover, such static weighting approaches are not capable of adapting to (and consequently mitigating) situations where one of the losses comes to unduly dominate the others. Thus, we minimize a convex weighted sum L of the three losses: L(t) = w CE (t)L CE (t) + w LB (t)L LB (t) + w U B (t)L U B (t), where t denotes the current training step and w e (t) is the weight for the corresponding loss L e , e ∈ {CE, LB, U B} at the t-th training step, which is dynamically calculated based on a softmax across the current values of the three losses: w e (t) = exp(L e (t)/γ) e ′ ∈{CE,LB,U B} exp(L e ′ (t)/γ) . ( ) The hyperparameter γ controls the relative importance of the losses. If any of the losses become too large, the dynamic weighing scheme strives to restore balance by assigning very high weightage to the concerned loss, thus prioritizing its minimization over that of the other losses. The changes in the dynamic weights over training steps for IBP-aided MAML (hereafter called MAML+IBP) using "4-CONV" network (Vinyals et al., 2016) on the miniImageNet dataset (Vinyals et al., 2016 ) is illustrated in Figure 3 . We can observe that while there is a clear ordering to the magnitude of the weights (and therefore the corresponding losses) throughout the entire training run, the weights are in fact able to adapt to changes in loss values to maintain the status quo among the different losses. ato et al., 2018) to the feature embedding through the first S layers, (2) MAML+GL that uses the distance between the original query set and its perturbed (by additive Gaussian noise) version as an extra loss, and (3) MAML+ULBL that considers the distance between the upper and lower interval bounds as an additional loss (Morawiecki et al., 2020) (further details in Appendix F.3). We see that MAML+IBP achieves higher 5-way 1-shot classification accuracy than the 5 contenders on the miniImageNet and tieredImageNet (Ren et al., 2018) datasets. Moreover, we also illustrate that the feature embedding learned by IBP-aided training exhibits better intra-task compactness in terms of the mean Euclidean distances from the nearest neighbor in the same class for 100 query instances from 600 tasks, in the feature space characterized by f θ S . Recent works (Ni et al., 2021; Yao et al., 2022) have shown that augmenting the training data with artificial tasks can improve performance in domains with a limited amount of tasks. Therefore, while IBP-aided training improves the performance of vanilla MAML (as well as other baselines, detailed in Appendix F.3), we are particularly interested in the added advantage that it lends by facilitating the generation of artificial tasks within the neighborhoods defined by the interval bounds.

4.2. INTERVAL BOUND-BASED TASK INTERPOLATION

Since minimizing the additional losses L LB and L U B is expected to ensure that the ϵ-neighborhood around a task is mapped to a small interval in the feature embedding space, artificial tasks formed within such intervals are naturally expected to be close to the task manifold. Therefore, we create additional artificial tasks by interpolating between an original task and its corresponding interval bounds (i.e., either the upper or the lower interval bound). In other words, for a training task T i , a corresponding artificial task T ′ i is characterized by a support set D s ′ i = {(H s ′ i,r , y s i,r )} Ns r=1 in the embedding space. The artificial support instances H s ′ i,r are created as: H s ′ i,r = (1 -λ k )f θ S (x s i,r ) + (1 -ν k )λ k f θ S (x s i,r , ϵ) + ν k λ k f θ S (x s i,r , ϵ), where k denotes the class to which x s i,r belongs, λ k ∈ [0, 1] is sampled from a Beta distribution Beta(α, β), and the random choice of ν k ∈ {0, 1} dictates which of the bounds is chosen randomly for each class. The labels y s i,r for the artificial task remain identical to that of the original task. The query set D q ′ i for the artificial task is also constructed analogously. We then minimize the mean of the additional classification loss L ′ CE for the artificial task T ′ i and the classification loss L CE for the original task T i for query instances (also the support instances in case of meta-learning). As a reminder, the losses L LB and L U B are also additionally minimized for the query instances. The complete IBP-based task interpolation or Interval Bound Interpolation (IBI) training setup is illustrated in Figure 4 in Appendix B. Since IBI does not play any part during the testing phase, the testing recipe remains identical to that of vanilla few-shot learning. The detailed pseudocode of MAML+IBI (along with the IBI variant of ProtoNet) can be found in Appendix B. Theoretical analysis: The data X i (i = 1, 2, • • • , N ) for tasks T i can be thought of as i.i.d. observations from a marginal distribution P X defined on a compact subset X of R d (d ≥ 1), paired with corresponding Y i drawn from the marginal distribution P Y . The map f θ S is bestowed with the task of producing a lower-dimensional representation of the input X. Let us denote the embedding space by H ⊆ R κ , given that κ ≤ d. The spaces X and H are endowed with l 2 norm for simplicity and conformity to our convention. One may observe that f θ S = h 1 • h 2 • • • • • h S , where in general h l (z) = σ(A l z + b l ) given that A l ∈ R d l+1 ×d l and b l ∈ R d l+1 , l = 1, • • • , S. The function σ denotes the activation (such as ReLU), applied component-wise. Evidently, in our notation d 1 = d and d S+1 = κ. With this setup, we proceed to undertake the theoretical analysis of our approach. Please find the detailed proofs in Appendix C. Definition 1 (Perturbation). Given any x 1 ∈ X , an ε-perturbation corresponding to x 1 is the set of points x 1 (ε) ⊂ X such that∥x 1 -x 2 ∥ = ε, ∀x 2 ∈ x 1 (ε); ε > 0. For the particular choice of the l 2 norm, Definition 1 characterizes ε-perturbation as a hollow ball of radius ε = ϵ √ d around a given point. Lemma 1 (Lipschitz networks ensure bounded IBP). Let x and x be ε-perturbations of x ∼ P X for an ε > 0 (i.e. x, x ∈ x(ε)). Given that the activation σ is Lipschitz continuous (such as ReLU) with constant c σ > 0, there exists a constant D = D(c σ ; A 1 , A 2 , • • • , A S ; ε) such that f θ S (x, ε) and f θ S (x, ε)) will at most be an ε-perturbed version of f θ S (x), where ε = εD. The minimization objective function of IBI can be rephrased as L = L CE + ω 1 L LB + ω 2 L U B , where ω 1 , ω 2 ≥ 0 are Lagrangian multipliers. The forthcoming result, however, relies on the constrained formulation of the objective, given as min{L CE } subject to L LB ≤ t 1 and L U B ≤ t 2 , where t 1 , t 2 ≥ 0. This is motivated by the fact that the constrained formulation yields solutions upper bounding the ones obtained using its Lagrangian counterpart [ (Boyd & Vandenberghe, 2004), Chapter 5] . Lemma 1 implies that the two losses (L U B and L LB ) appearing in the constraints can always be made arbitrarily small, depending upon ε. As such, in the constrained regime, the remaining problem is to show that the multi-task sample classification loss can indeed be dealt with. Theorem 1 (Generalization bound). Let P be the joint distribution of (f θ S (X), Y ), supported on H × R. In the multi-task regime, let I denote the set of tasks, each consisting of N samples. Define R(N, |I|) = E Ti∼ p(T ) E (Xj ,Yj )∼ p(Ti) [L CE (f θ L-S (H * j ), Y j )] and R = E Ti∼p(T ) E (Xj ,Yj )∼Ti [L CE (f θ L-S (f θ S (X j )), Y j )]. For a bounded loss function L CE : R × R → [0, a](a ≥ 0), if the neural network-induced map f θ L-S is such that ∇f θ L-S (•) < ∞, we ensure: R(N, |I|) -R -λ ≾ 2 L-S+1 2 log(2κ + 2) 1 √ N + 1 |I| + log( 2|I| δ ) N + log( 2 δ ) |I| holds with probability at least 1 -δ, where λ = λ(ε, λ).

5. EXPERIMENTS

Experiment protocol: The experiments are conducted on few-task few-shot image classification datasets, viz. a subset of the miniImageNet dataset called miniImageNet-S (Yao et al., 2022) , and two medical images datasets namely DermNet-S (Yao et al., 2022), and ISIC (Codella et al., 2018; Li et al., 2020) . We begin our experiments with a few analyses and ablations to better understand the properties of our proposed method. We then empirically demonstrate the effectiveness of our proposed IBI method on the gradient-based meta-learning method MAML (Finn et al., 2017) as well as the prototype-based metric-learner ProtoNet (Snell et al., 2017) to show that IBI can be seamlessly integrated with multiple few-shot learning paradigms. For our experiments, we employ the commonly used "4-CONV" network (Vinyals et al., 2016) as well as the larger ResNet-12 network (Lee et al., 2019) to demonstrate the scalability of the proposed method (further details on scalability in Appendix E). We perform 5-way 1-shot and 5-way 5-shot classification on all the above datasets (except ISIC where we use 2-way classification problems, similar to Yao et al. (2021) , due to the lack of sufficient training classes). Further discussion on the datasets and implementation details of IBI along with the choice of hyperparameters can be found in the Appendix. The Average median distance is calculated with features after the third block for all cases.

Ablation studies on task interpolation:

We undertake an ablation study to highlight the importance of generating artificial tasks using IBP boundbased interpolation by comparing IBI with (1) inter-task interpolation on images, (2) inter-task interpolation in the feature embedding learned by f S θ , (3) Worst-Case Loss (WCL) on the ϵ-neighborhood (Gowal et al., 2019) along with IBP losses, (4) inter-task interpolation while minimizing ULBL (Morawiecki et al., 2020) , (5) Gaussian noise-based perturbation (GA) in the image space with IBP losses, (6) Gaussian noise-based perturbation in the feature embedding space f θ S with IBP losses, (7) MLTI (Yao et al., 2022) , which performs MixUp (Zhang et al., 2017) at randomly chosen layers of the learner, and (8) IBP bound-based interpolation without minimizing the L U B and L LB while only optimizing L CE (more results in Appendix F.3). We perform the ablation study on 5-way 1-shot classification with the "4-CONV" network on miniImageNet-S, ISIC, and DermNet-S. From Table 2 , we observe that IBI performs best in all cases. Moreover, inter-class interpolation at the same fixed layer S as IBI and at randomly selected task-specific layers in MLTI shows worse performance, demonstrating the superiority of the proposed interval bound-based interpolation mechanism. Further, it is interesting to observe that IBI, when performed without minimizing the L U B and L LB , performs the worst. This behavior is not unexpected as the neighborhoods are no longer guaranteed to be preserved by the learned embedding in this case, thus potentially resulting in the generation of out-of-manifold artificial tasks. To further check whether the tasks generated by IBI indeed follow the distribution, we undertake an additional comparison based on the similarity of the artificial tasks with the corresponding original tasks. Concretely, we define the distance between a task and its artificial counterpart as the median of the pairwise distances between the corresponding data instances in the two tasks. If an artificial task is created by combining two tasks, a la MLTI (Yao et al., 2022) , we consider the smaller of the two median distances. We observe from Table 2 , that the average median distance over 600 tasks is smaller for the proposed method compared to MLTI, as well as inter-task interpolation in the feature embedding learned by f S θ . This indicates that the tasks generated by IBI are more likely to lie close to the original task distribution. 

Importance of dynamic loss weighting:

To validate the usefulness of softmax-based dynamic weighting of the three losses for both IBP and IBI, we first find the average weights for each loss in a dynamic weight run and then plug in the respective values as static weights for new runs. All experiments in Table 3 are conducted on the miniImageNet-S dataset. From the upper half of Table 3 , we can see that the three average weights are always distinct with a definite trend in that L CE gets maximum importance followed by L U B while L LB contributes very little to the total loss L. This may be due to the particular "4-CONV" architecture used in this study which employs ReLU activations, thus implicitly limiting the spread of the lower bound (Gowal et al., 2019) . Further, the average weights of IBP and IBI are similar for a particular learner highlighting their commonalities, while they are distinct over different learners stressing their learner-dependent behavior. Further, in the lower half of Table 3 , we explore the effect of using static weights as well as the transferability of the loss weights across learners. In all cases, the softmax-based dynamic weighting outperforms static weighting, thus demonstrating the importance of dynamic weighting. Results on few-task few-shot classification problems: For evaluating the few-shot classification performance of IBI in few-task situations, we compare against the regularization-based meta-learning methods TAML (Jamal & Qi, 2019), Meta-Reg (Yin et al., 2019) , and Meta-Dropout (Lee et al., 2020) for MAML. We also compare against data augmentation-based methods like MetaMix (Yao et al., 2021) , Meta-Maxup (Ni et al., 2021) , and MLTI (Yao et al., 2022) for both MAML and ProtoNet. The results in Table 4 show that in keeping with the observation in Table 1 , IBP without task interpolation can improve upon the corresponding baselines. The incorporation of IBP-based task interpolation in IBI generally improves the results even further. Overall, we observe that both IBP and IBI outperform the other competitors, with the largest gains being observed for the ISIC dataset. 

Cross-domain transferability analysis:

The miniImageNet-S and DermNet-S datasets both allow 5-way 1-shot classification. Moreover, miniImageNet-S contains images from natural scenes while DermNet-S consists of medical images. Therefore, we undertake a crossdomain transferability study in Table 5 . We summarize the Accuracy values obtained by a source model trained on DermNet-S but tested on miniImageNet-S and vice-versa (denoted DS → mIS and mIS → DS, respectively). We can see that in most cases the IBP variant is able to improve upon the corresponding baseline. Further, the interpolation-based methods, i.e. MLTI (Yin et al., 2019; Yao et al., 2022) 38.35% 51.74% 58.57% 68.45% 45.01% 60.92% TAML (Jamal & Qi, 2019; Yao et al., 2022) 38.70% 52.75% 58.39% 66.09% 45.73% 61.14% MAML+Meta-Dropout (Lee et al., 2020; Yao et al., 2022) 38.32% 52.53% 58.40% 67.32% 44.30% 60.86% MAML+MetaMix (Yao et al., 2021; 2022) 39.43% 54.14% 60.34% 69.47% 46.81% 63.52% MAML+Meta-Maxup (Ni et al., 2021; Yao et al., 2022) 39 

6. CONCLUSION AND FUTURE WORKS

In this paper, we attempt to explore the utility of IBP beyond its originally-intended usage for building and verifying classifiers that are provably robust against adversarial attacks. In summary, we identify the potential of IBP to conserve a neighborhood from the input image space to the learned feature space through the layers of a deep neural network by minimizing the distances of the feature embedding from the two bounds. This can be effective in few-shot classification problems to obtain feature embeddings where task neighborhoods are preserved, thus enabling easy adaptability to unseen tasks. Further, since interpolating between training tasks and their corresponding IBP bounds can yield artificial tasks with a higher chance of lying on the task manifold, we exploit this property of IBP to prevent overfitting to seen tasks in the few-task scenario. The resulting IBI training scheme is shown to be effective in both the meta-learning and metric-learning paradigms of few-shot learning. We demonstrate in our results that IBI can be effectively scaled to relatively large networks like ResNet-12 as we typically only need to apply IBP to a few initial layers (see Appendix E). However, this still adds extra computational cost (see Appendix E for a comparative study) which scales linearly with the number of layers subjected to IBP. Therefore, to limit the additional complexity and computational cost, a future direction of research may be to investigate the applicability of more advanced provably robust training methods that yield more efficient and tighter bounds (Lyu et al., 2021) . Moreover, few-shot learners can also be improved with adaptive hyperparameters (Baik et al., 2020) , feature reconstruction (Lee & Chung, 2021) , knowledge distillation (Tian et al., 2020) , embedding propagation (Rodríguez et al., 2020) , etc. Thus, it may be interesting to observe the performance gains from these orthogonal techniques when coupled with IBI. However, this may not be a straightforward endeavor, given the complex dynamic nature of such frameworks.

REPRODUCIBILITY STATEMENT

We have included the pseudo-codes and PyTorch based Python implementation for the proposed method in Appendices B and G, respectively. The description of all datasets used in this study along with other key implementation details is available in Appendices D, E. The hyperparameter settings for different algorithms along with their tuning strategy are listed in Appendix F. For the theoretical analyses, complete proofs are provided in Appendix C. A copy of the code is available in the Appendix G while the same can also be found at https://anonymous.4open.science/ r/maml-ibp-ibi-D072/. A PROTOTYPE-BASED METRIC-LEARNING: Metric-based few-shot learning aims to obtain a feature embedding of the task manifold suitable for non-parametric classification. Prototype-based metric-learning, specifically Prototypical Network (ProtoNet) (Snell et al., 2017) , assigns a query point to the class having the nearest (in terms of Euclidean distance) prototype in the learned embedding space. Given the model f θ and a task T i , we first compute class prototypes {c k } K k=1 as the mean of f θ (x s i,r ) for the instances x s i,r belonging to class k: c k = 1 N s (x s i,r ,y s i,r )∈D s,k i f θ (x s i,r ), where D s,k i ⊂ D s i represents the subset of N s support samples from class k. Given a sample x q i,r from the query set, the probability p(y q i,r = k|x q i,r ) of assigning it to the k-th class is calculated using the distance function d(., .) between the representation f θ (x q i,r ) and the prototype c k : p(y q i,r = k|x q i,r , f θ ) = exp(-d(f θ (x q i,r ), c k )) k ′ exp(-d(f θ (x q i,r ), c k ′ )) . ( ) Thereafter, the parameters θ for the model f θ can be trained by minimizing cross-entropy loss (2). In the testing stage, each query sample x q j,r is assigned to the class having the maximal probability, i.e., y q j,r = arg max k p(y q j,r = k|x q j,r ).

B ALGORITHMS OF MAML AND PROTONET COUPLED WITH IBP AND IBI

The following Figure 4 illustrates a schematic diagram for the training of IBP and IBI variants.  ϵ-neighborhood [x-1ϵ, x+1ϵ] around x. The bounding box [f θ s (x, ϵ), f θ s (x, ϵ)] around the embedding f θ S (x) after the first S layers of the learner is found using IBP. In addition to the classification loss L CE , we also minimize the losses L LB and L U B which respectively measure the distances of f θ S (x) to f θ s (x, ϵ) and f θ s (x, ϵ). A softmax across the three loss values is used to dynamically calculate the convex weights for the losses, so as to prioritize the minimization of the dominant loss(es) at any given training step. For IBP-based interpolation, artificial tasks T ′ i are created with instances H ′ formed by interpolating both the support and query instances with their corresponding lower or upper bounds. The mean of the classification loss L CE for the T i and the corresponding extra loss L ′ CE for T ′ i is minimized. The steps for MAML+IBP/IBI and ProtoNet+IBP/IBI are respectively presented in Algorithm 1 and 2. Please consult the main paper for various notations and equations used in the algorithms. Remark 1. The way in which the training support set D s i informs the loss calculation on the corresponding query set D q i differs between the MAML and ProtoNet variants. While a limited number of training steps on the support set is undertaken to obtain the model f ϕi where the loss is calculated on the query set for MAML, the support set is used to calculate the prototypes {c k } K k=1 for the loss calculation on the query set for ProtoNet. Under review as a conference paper at ICLR 2023 Algorithm 1 IBP/IBI for MAML training Requires: Task distribution p(T ), batch size B, learning rates η 0 and η 1 , interval coefficient ϵ. 1: Randomly initialize the meta-learner parameters θ. 2: while not converged do 3: Sample a batch of B tasks from the distribution ρ(T ).

4:

For IBI, randomly sample an index 1 ≤ m ≤ B to perform the interpolation. 5: for all i ∈ {1, 2, • • • , B} do 6: Initialize base learner to meta-learner state.

7:

Sample a support set D s i of data-label pairs {(x s i,r , y s i,r )} Ns r=1 from task T i . 8: Calculate the classification loss L CE using f θ (x s i,r ) and y s i,r . 9: if i = m then 10: Generate interpolated support and query instances H s ′ i,r and H s ′ i,r using (8). 11: Calculate classification loss L ′ CE using f θ L-S (H s ′ i,r ) and y s i,r . 12: Set L CE = 1 2 (L CE + L ′ CE ). 13: end if 14: Update base learner parameters to ϕ i = θ -η 0 ∇ θ L CE . 15: Sample a query set D q i of data-label pairs {(x q i,r , y q i,r )} Nq r=1 from task T i . 16: Calculate the classification loss L CE with f ϕi (x q i,r ) and y q i,r . 17: Calculate L LB and L U B respectively using ( 4) and (5). 18: if i = m then 19: Calculate classification loss L ′ CE using f ϕ L-S (H q ′ i,r ) and y q i,r .

20:

Set L CE = 1 2 (L CE + L ′ CE ). 21: end if 22: Calculate L by accumulating L CE , L LB and L U B using (6). For IBI, randomly select if interpolation is to be performed.

4:

Sample a support set D s i of data-label pairs {(x s i,r , y s i,r )} Ns r=1 from task T i .

5:

Calculate the features f θ L (x s i,r ) and find the prototypes {c k } K k=1 using (9).

6:

if interpolation to be performed then 7: Generate interpolated support and query instances H s ′ i,r and H s ′ i,r using (8). 8: Calculate features f θ L-S (H s ′ i,r ) and find prototypes {c ′ k } K k=1 . 9: end if 10: Sample a query set D q i of data-label pairs {(x q i,r , y q i,r )} Nq r=1 from task T i . 11: Calculate the loss L CE using ( 10) and (2). 12: Calculate L LB and L U B using (4) and (5).

13:

if interpolation to be performed then 14: Calculate classification loss L ′ CE with f θ L-S (H q ′ i,r ), {c ′ k } K k=1 and y q i,r by ( 10) and (2). 15: Set L CE = 1 2 (L CE + L ′ CE ). 16: end if 17: Calculate L by accumulating L CE , L LB and L U B using (6). 18: Update learner parameters θ = θ -η∇ θ L. 19: end while

C DETAILED THEORETICAL ANALYSIS

Interval bound propagation for networks with affine layer: Let us assume a network f with L layers where the 0-th layer denotes the initial input. Let us also consider a layer l ≤ L that is not the 0-th input layer. The 0-th layer of f takes the input along with its perturbed counterparts as shown in Section 3 in the main paper. If at the end of l -1-th layer the activation, upper bound, and lower bound are respectively z l-1 , z l-1 and z l-1 . If the l-th layer performs an affine transformation (such as a convolutional, fully connected, batch normalization, etc.) followed by a monotonic activation function (such as ReLU, sigmoid, tanh, etc.) i.e. z l = σ(A l z l-1 + b l ), then as per Gowal et al. (2019) , we can calculate the interval bounds for the subsequent l-th layer as follows: z l = σ(µ l -ψ l ), z l = σ(µ l + ψ l ), where ψ l = |A l |ψ l-1 and µ l = A l µ l-1 + b l given µ l-1 = z l-1 +z l-1 2 and ψ l-1 = z l-1 -z l-1 2 . Lemma 1 (Lipschitz networks ensure bounded IBP). Let x and x be ε-perturbations of x ∼ P X for an ε > 0 (i.e. x, x ∈ x(ε)). Given that the activation σ is Lipschitz continuous (such as ReLU) with constant c σ > 0, there exists a constant D = D(c σ ; A 1 , A 2 , • • • , A S ; ε) such that f θ S (x, ε) and f θ S (x, ε)) will at most be an ε-perturbed version of f θ S (x), where ε = εD. Proof. Given that x 1 , x 2 ∈ X h 1 (x 1 ) -h 1 (x 2 ) = σ(A 1 x 1 + b 1 ) -σ(A 1 x 2 + b 1 ) ≤ c σ A 1 (x 1 -x 2 ) (13) ≤ c σ ∥A 1 ∥∥x 1 -x 2 ∥ where∥A 1 ∥ = sup ∥x∥=1 ∥A 1 x∥. The inequality 13 is due to the Lipschitz continuity of σ. Commonly used activation functions, such as ReLU, tend to satisfy this condition. In particular, for ReLU, c σ = 1. As such, the map h 1 also turns out to be Lipschitz continuous. A similar argument also proves that h l , l = 2, • • • , S all follow the same trait. As a result, f θ S also becomes Lipschitz continuous with accompanying constant (c σ A) S , where A = max {∥A l ∥}. The recurrence relation of extremities in IBP, as suggested by Gowal et al. (2019) , can be written as: f θ l (x, ε) = σ (A l +|A l |) 2 f θ l-1 (x, ε) + (A l -|A l |) 2 f θ l-1 (x, ε) + b l , and f θ l (x, ε) = σ (A l -|A l |) 2 f θ l-1 (x, ε) + {A l +|A l |} 2 f θ l-1 (x, ε) + b l , where the | • | operator results in a matrix with all elements replaced by their corresponding absolute values, and l = 1, 2, ..., S. Thus, f θ l (x, ε) -f θ l (x) = σ (A l +|A l |) 2 f θ l-1 (x, ε) + (A l -|A l |) 2 f θ l-1 (x, ε) + b l -σ A l f θ l-1 (x) + b l ≤c σ (A l +|A l |) 2 f θ l-1 (x, ε) + (A l -|A l |) 2 f θ l-1 (x, ε) -A l f θ l-1 (x) =c σ (A l +|A l |) 2 f θ l-1 (x, ε) -f θ l-1 (x) + (A l -|A l |) 2 f θ l-1 (x, ε) -f θ l-1 (x) ≤c σ (A l +|A l |) 2 f 1(θ l-1 (x, ε) -f θ l-1 (x) + (|A l | -A l ) 2 f θ l-1 (x) -f θ l-1 (x, ε) ≤c σ A l +|A l | 2 f θ l-1 (x, ε) -f θ l-1 (x) + |A l | -A l 2 f θ l-1 (x) -f θ l-1 (x, ε) . Observe that, in particular for l = 1 f θ 1 (x, ε) -f θ 1 (x) ≤ c σ A 1 +|A 1 | 2 ∥x -x∥ + |A 1 | -A 1 2 ∥x -x∥ = c σ ε A 1 +|A 1 | 2 + |A 1 | -A 1 2 = ε 1 say, i.e., the deviation in the first layer can be made arbitrarily small based on ε. The quantity f θ 1 (x) -f θ 1 (x, ε) can be shown to be upper bounded using a similar argument. In other words, both f θ 1 (x, ε) and f θ 1 (x, ε) are at most ε 1 -perturbed from f θ 1 (x). By the method of induction we eventually get a D = D(c σ ; A 1 , A 2 , • • • , A S ; ε) > 0 for which the lemma holds. Theorem 1 (Generalization bound). Let P be the joint distribution of (f θ S (X), Y ), supported on H × R. In the multi-task regime, let I denote the set of tasks, each consisting of N samples. Define R(N, |I|) = E Ti∼ p(T ) E (Xj ,Yj )∼ p(Ti) [L CE (f θ L-S (H * j ), Y j )] and R = E Ti∼p(T ) E (Xj ,Yj )∼Ti [L CE (f θ L-S (f θ S (X j )), Y j )]. For a bounded loss function L CE : R × R → [0, a](a ≥ 0), if the neural network-induced map f θ L-S is such that ∇f θ L-S (•) < ∞, we ensure: R(N, |I|) -R -λ ≾ 2 L-S+1 2 log(2κ + 2) 1 √ N + 1 |I| + log( 2|I| δ ) N + log( 2 δ ) |I| holds with probability at least 1 -δ, where λ = λ(ε, λ). Proof. Before beginning with the proof we point out that, based on Definition 1, given ε > 0 and x ∈ X , any x ′ ∈ x(ε) can be written as x ′ = x + η(ε). For example, in the simplest case, η(ε) can be a function in the family ±ϵ1. Thus, in case of IBI, the f θ S (x i , ε) and f θ S (x i , ε) can both be expressed as f θ S (x i ) + η(ε) with corresponding η(ε). In essence H * i = (1 -λ)f θ S (x i ) + λ(f θ S (x i ) + η(ε)) , where λ ∈ [0, 1]. Now, we can observe that, f θ L-S (H * i ) = f θ L-S (1 -λ)f θ S (x i ) + λ f θ S (x i ) + η(ε) = f θ L-S f θ S (x i ) + λη(ε) = f θ L-S f θ S (x i ) + λ∇f θ L-S f θ S (x i ) η(ε), where η(ε) ∈ R κ , ε being as mentioned in lemma 1. We obtain ( 14) by using the Taylor expansion of f θ L-S up to the first order. Given that ∇f θ L-S (•) < ∞, the second term λ∇f θ L-S f θ S (x i ) η(ε) can be made arbitrarily small. The higher-order terms in the expansion all follow suit, which justifies their omission. Now, 1 N N i=1 L CE (f θ L-S (H * i ), y i ) - H×R L CE (f θ L-S (x), y)d P(x, y) = 1 N N i=1 L CE (f θ L-S (H * i ), y i ) -L CE (f θ L-S (f θ S (x i )), y i ) + 1 N N i=1 L CE (f θ L-S (f θ S (x i )), y i ) - H×R L CE (f θ L-S (x), y)d P(x, y) ≤ 1 N N i=1 L CE (f θ L-S (H * i ), y i ) -L CE (f θ L-S (f θ S (x i )), y i ) + 1 N N i=1 L CE (f θ L-S (f θ S (x i )), y i ) - H×R L CE (f θ L-S (x), y)d P(x, y) . ( ) Since our networks use ReLU activation, the map induced by f θ L-S can be shown to be continuous. Given H is compact the output space also becomes compact. Restricted to such a space, the crossentropy loss L CE (similarly, regularized cross-entropy loss) turns out to be Lipschitz continuous. Consequently, |L CE (f θ L-S (H * i ), y i )-L CE (f θ L-S (f θ S (x i )), y i )| ≤ c L f θ L-S (H * i ) -f θ L-S f θ S (x i ) = λ(ε, λ), where c L > 0 is the Lipschitz constant associated with L CE . Without loss of generality we can construct the map f θ L-S such that∥f θ L-S ∥ ≤ 1. Now, in case there are|I| tasks involved, namely {T i } |I| i=1 (i.e., the multi-task regime), the population risk turns out to be R = E Ti∼p(T ) E (Xj ,Yj )∼Ti L CE f θ L-S (f θ S (X j )), Y j = E Ti∼p(T ) E (Xj ,Yj )∼Ti L CE f θ L-S (H j ), Y j . We are interested in observing the deviation of the same from the realized risk. In other words, R(N, |I|) -R ≤ R(N, |I|) -J (i) + J -R (ii) , where J = E Ti∼ p(T ) E (Xj ,Yj )∼Ti L CE f θ L-S (H j ), Y j and p is the empirical counterpart of the task distribution. Using Jensen's inequality, (i) can be upper bounded by E Ti∼ p(T ) E (Xj ,Yj )∼ p(Ti) L CE f θ L-S (H * j ), Y j -E (Xj ,Yj )∼Ti L CE f θ L-S (H j ), Y j ≤ λ + E Ti∼ p(T ) E (Xj ,Yj )∼ p(Ti) L CE f θ L-S (H j ), Y j -E (Xj ,Yj )∼Ti L CE f θ L-S (H j ), Y j , where we utilize arguments ( 15) and ( 16) to reach (18). Using the union bound based on|I| tasks on top of Corollary 3.14 of Wojtowytsch & E (2020) we can show that the second term in the right-hand side of (18) becomes ≾ 2 L-S+1 2 log(2κ+2) N + a 2 log( 2|I| δ ) N , with probability at least 1 -δ. To put a deterministic upper bound on (ii) let us first define the class of functions G = g : g(T ) = E (f θ S (X),Y )∼ P L CE f θ L-S (H), Y ; f θ L-S ∈ W L-S , where W L-S is the function space induced by networks with L -S hidden layers (Wojtowytsch & E, 2020) . Let us now calculate the Rademacher complexity of the class functions G: Rad G, {T i } |I| i=1 = E ξ sup g∈G 1 |I| |I| i=1 ξ i g(T i ) = E ξ sup g∈G 1 |I| |I| i=1 ξ i E Ti L CE f θ L-S (H), Y ≤ E Ti E ξ sup g∈G 1 |I| |I| i=1 ξ i L CE f θ L-S (H), Y ≤ c L E Ti E ξ sup f θ L-S ∈W L-S 1 |I| |I| i=1 ξ i f θ L-S (H) (20) ≤ c L 2 L-S+1 2 log(2κ + 2) |I| , where ( 20) is due to the Lipschitz property of L CE (•, y) [Lemma 26.9 of Shalev-Shwartz & Ben-David (2014) or Theorem 7 of Meir & Zhang (2003) ]. We arrive at (21) using lemma 3.13 of Wojtowytsch & E (2020) . The inequality ( 19) is based on the fact that sup u∈U E[u(X)] ≤ E[sup u∈U u(X) ], given the expectation exists for the class of functions U and random variable X. Thus we obtain the deterministic bound on (ii) given by E Ti∼ p(T ) E (Xj ,Yj )∼Ti L CE f θ L-S (H j ), Y j -E Ti∼p(T ) E (Xj ,Yj )∼Ti L CE f θ L-S (H j ), Y j ≾ 2 log(2κ + 2) |I| + log( 2 δ ) |I| , that holds with probability at least 1 -δ. The bounds on (i) and (ii) together prove the theorem.

D DETAILS OF DATASETS USED IN THIS STUDY

miniImageNet: The miniImageNet dataset (Vinyals et al., 2016 ) is a commonly used subset of ImageNet (Deng et al., 2009) 

E IMPLEMENTATION DETAILS

Scheduling of ϵ: In their paper Gowal et al. (2019) recommended starting with an initial perturbation ϵ 0 = 0 and gradually increasing it to the intended perturbation ϵ over the training steps. In our case, we follow a similar approach for scheduling the value of perturbation ϵ t at the t-th training step. We have observed that a fast increase in perturbation usually slows down training while a very slow increment fails to aid the learner. From extensive experimentation with various scheduling techniques such as linear, cosine, etc., we have found that the following strategy works well in practice. If the maximum allowed number of training steps is set to T then for the step t the perturbation ϵ t is calculated as: ϵ t = ϵ if t > ⌈0.9T ⌉ t 0.9T ϵ . ( ) In essence, we linearly increase ϵ t starting from 0 up to ϵ over 90% of the maximum training steps T and keep it fixed at ϵ for the remainder of the training. Frequency of interpolation for IBI variants: Performing IBP bound-based interpolation for every task during training may not be beneficial and may instead mislead the learner. For MAML, we have seen that performing interpolation once in every batch of B tasks aids the training process. In the case of ProtoNet, we have found that performing IBP bound-based interpolation with a 25% probability results in the best outcome. Modifications to network architecture: We have used two networks for our experiments namely "4-CONV" and "ResNet-12". The "4-CONV" network can be seamlessly integrated with IBI for both MAML and ProtoNet. This network consists of 4 blocks, each having a convolution, batch normalization, max pooling, and ReLU in sequential order. IBI can be performed after any one of the blocks. The ResNet-12 network also consists of 4 blocks where a block (except the first one) receives inputs from (1) the output of the preceding block, and (2) the input of the preceding block through a skip connection. While the idea of applying IBI after any of the blocks seems appealing, the presence of skip connections may hinder a straightforward integration of IBI in this case. To understand how ResNet-12 can be customized to accommodate MAML+IBP (and consequently MAML+IBI) we undertake an ablation study on the miniImageNet-S dataset in a 5-way 1-shot classification problem as described in Table 6 . We can observe that in our initial hyperparameter tuning experiment, MAML+IBP can not match the performance of vanilla MAML on ResNet-12. Moreover, the performance gap increases as IBP is applied deeper into the network. This may be explained by the fact that the interval bounds become gradually loose as they progress through the network. Thus, with increasing depth, the magnitude of the bound losses (especially L U B as the ReLU activations prevent L LB from becoming too large) will largely outscale the classification loss and consequently affect convergence (see Remark 2). Applying IBP after only the first block still fails to achieve parity with the baseline because IBP induces a distortion in the feature space, due to its regularization effect. While the sequential part of the blocks after IBP can adapt to this distortion due to their complexity, the simpler skip paths can not do so. Hence, the effect of the distortion keeps propagating to the deeper blocks via skip connections. To aid the network in such a situation, we investigate three approaches to modify the skip connection immediately after the block(s) subjected to IBP, viz. (1) remove the skip connection for the subsequent block (2) introduce additional layers in the skip connection for the subsequent block to make it deeper and more complex (3) use a skip after one or more of the initial sub-block(s) (consisting sequentially of one convolution, one batch normalization, and one ReLU layer) of the next block. Among the three approaches, we empirically found that MAML+IBP (consequently MAML+IBI) performs best when the skip connection starts after the second sub-block in block 2. Due to the comparatively powerful learning strategy of ProtoNet, no such modifications to ResNet-12 are necessary for ProtoNet+IBI. Remark 2. [Scalability of IBI] IBP (and consequently IBI) requires the propagation of the two interval bounds along with the input data. This introduces a computational overhead, especially in deeper networks. However, in practice even in a deeper network, we may not need to perform IBP except in the initial few layers, as the bound losses will otherwise overwhelm the classification loss and consequently impact convergence. To demonstrate this, we plot the losses (up to 5000 training steps for the ease of visualization) in the following Figure 5 for MAML+IBI using a ResNet-12 network for 5-way 1-shot miniImageNet-S classification, when IBP is applied up to blocks 1-4. We can see that the three losses have comparable scales only when IBP is applied after block 1. In all other cases, L U B heavily dominates the total loss. But, due to its sheer magnitude, the optimizer is unable to minimize it. Thus, in practice, IBP should only be limited to a few initial layers in deeper networks. Consequently, IBI easily scales to deeper networks despite the computational overhead. Remark 3. To show that IBP and IBI variants are well-scalable as their vanilla counterpart we list the actual training costs in the following Table 7 in terms of the average time in seconds to execute a single training step of the algorithm. All the experiments are performed in the same environment using a RTX 3090 GPU. From Table 7 we can observe that, in the case of MAML, the IBI and IBP variants only takes about 40%-70% additional time when "4-CONV" is used. The difference in cost reduces further if ResNet-12 is used as the backbone. This is expected as we only need to apply IBP in the first few layers of ResNet-12 to gain its full advantage. For ProtoNet, the increment in computational cost for the proposed techniques is observed to be slightly more compared to that of MAML. The following Table 8 describes the hyperparameters used in the vanilla MAML, MAML+IBP, and MAML+IBI. The following Table 9 describes the hyperparameters used in the vanilla ProtoNet, ProtoNet+IBP, and ProtoNet+IBI.

F.2 HYPERPARAMETER SEARCH SPACE AND TUNING

For hyperparameter tuning, we employ a grid search. In Table 10 , we list the search spaces for each of the hyperparameters used in MAML+IBP and MAML+IBI. Moreover, in Table 11 , we also detail the search spaces for each of the hyperparameters used in ProtoNet+IBP and ProtoNet+IBI. For all other learners used in Tables 1 and 2 in the main paper, the results are either taken from the corresponding article or reproduced using the originally recommended hyperparameter settings. In Tables 12 and 13 , we report the optimal dataset-specific hyperparameters for MAML+IBP and MAML+IBI. Similarly, Tables 14 and 15 detail the optimal dataset-specific hyperparameter choices for ProtoNet+IBP and ProtoNet+IBI. For Table 3 in the main paper, the methods using static weights share the same hyperparameter settings with their dynamic weighted counterpart except for γ, which is not used for the static weight runs. For Table 4 in the main paper, all the MAML variants use the same settings as vanilla MAML. Further, for all the different interpolation strategies Beta distribution is used with the choices of α and β matching those of the MAML+IBI settings. 

Contenders in Motivating Example:

For the contenders in Table 1 the settings are as follows: 1. MAML+SN on f θ S : This variant of MAML applies Spectral Normalization (Miyato et al., 2018) up to the S-th layer of the "4-CONV" network. Here similar to the MAML+IBP the value of S is set to 3. 2. MAML+SN on f θ : Here Spectral Normalization is applied on the full network. 3. MAML+GL: In this variant, we calculate a Gaussian regularization loss instead of IBP. Here we send the query set along with its perturbed version and attempt to minimize their norm after the S-th layer alongside L CE . The extra loss L GL can be expressed as follows: 2021), as well as the intertask interpolation method MLTI Yao et al. (2022) . In case of metric-learners, we compare against the vanilla ProtoNet in addition to other notable methods like MatchingNet Vinyals et al. (2016) , RelationNet Sung et al. (2018) , IMP Allen et al. (2019), and GNN Satorras & Estrach (2018) . We also compare against ProtoNet coupled with data augmentation methods such as MetaMix, Meta-Maxup, and MLTI, as done in Yao et al. (2022) . While Yao et al. (2022) had to modify the training strategy of the canonical ProtoNet to accommodate the changes introduced by MetaMix, Meta-Maxup, and MLTI, the flexibility of IBP and IBI imposes no such requirements. We summarize the findings in Table 18 . We can observe that either IBP or IBI or both achieve better Accuracy than the competitors in all cases. The slightly better performance of IBP with ProtoNet seems to imply that IBP-based task interpolation is often unnecessary for ProtoNet when a large number of tasks is available. (Finn et al., 2017) 48.70±1.75% 63.11±0.91% Meta-SGD (Li et al., 2017) 50.47±1.87% 64.03±0.94% Reptile (Nichol et al., 2018) 49.97±0.32% 65.99±0.58% LLAMA (Grant et al., 2018) 49.40±0.84% -R2-D2 (Bertinetto et al., 2019) 49.50±0.20% 65.40±0.20% TAML (Jamal & Qi, 2019; Yao et al., 2022) 46.40±0.82% 63.26±0.68% BOIL (Oh et al., 2021) 49.61±0.16% 66.45±0.37% MAML+Meta-Reg (Yin et al., 2019; Yao et al., 2022) 47.02±0.77% 63.19±0.69% MAML+Meta-Dropout (Lee et al., 2020; Yao et al., 2022) 47.47±0.81% 64.11±0.71% MAML+MetaMix (Yao et al., 2021; 2022) 47.81±0.78% 64.22±0.68% MAML+Meta-Maxup (Ni et al., 2021; Yao et al., 2022) 47.68±0.79% 63.51±0.75% MAML+MLTI (Yao et al., 2022) 48 (Sung et al., 2018) 50.44±0.82% 65.32±0.70% IMP (Allen et al., 2019) 49.60±0.80% 68.10±0.80% GNN (Satorras & Estrach, 2018) 49.02±0.98% 63.50±0.84% ProtoNet (Snell et al., 2017) 49.42±0.78% 68.20±0.66% ProtoNet * +MetaMix (Yao et al., 2021; 2022) 47.21±0.76% 64.38±0.67% ProtoNet * +Meta-Maxup (Ni et al., 2021; Yao et al., 2022) 47.33±0.79% 64.43±0.69% ProtoNet * +MLTI (Yao et al., 2022) 48 (Li et al., 2017) 48.97±0.21% 66.47±0.21% Reptile (Nichol et al., 2018) 49.97±0.32% 65.99±0.58% BOIL (Oh et al., 2021) 49  L GL = 1 N q Nq r=1 ||f θ S (x q i,r ) -f θ S (x q i,r + ζ)|| 2 2 ,

Notes on contenders used in Table 2:

The extra parameter settings required for the contenders in Table 2 are as follows: 1. MAML+WCL: Here given a task its worst-case loss in the ϵ-neighborhood (Gowal et al., 2019) is added with the original loss. In essence, this acts similar to augmentation with the worst-case logits. We tune the relative contribution of the original task and the worst-case task to the final L CE following the recommendations made by (Gowal et al., 2019) .

2.. MAML+GA (image space):

Here the original task is perturbed with Gaussian noise to form the augmented task in the image space. The noise is sampled from a Gaussian with mean 0 and standard deviation σ = ϵ/2. The value of σ is scheduled similarly to ϵ. 3. MAML+GA (at f θ S feature space): Here the embedding of the original task after f θ S is perturbed with Gaussian noise. Similar to the image space, the mean of the normal distribution used for sampling noise can be set to 0. However, finding a good σ may not be straightforward as the f θ S feature space is continuously updating. In our implementation, we take σ as half of the median distance between the original task and its bounds over a MAML+IBI run. 20 and 21 . Moreover, the full version of Table 5 in the main paper is presented in Table 22 .

The full version of

Table 19 : Full version of Table 2 for performance comparison of MAML+IBI against 11 augmentation strategies, in the 5-way 1-shot setting. The results are reported in terms of mean Accuracy over 600 tasks along with the 95% confidence intervals. (Yin et al., 2019; Yao et al., 2022) 38.35±0.76% 51.74±0.68% --TAML (Jamal & Qi, 2019; Yao et al., 2022) 38.70±0.77% 52.75±0.70% --MAML+Meta-Dropout (Lee et al., 2020; Yao et al., 2022) 38.32±0.75% 52.53±0.69% --MAML+MetaMix (Yao et al., 2021; 2022) 39.43±0.77% 54.14±0.73% 42.26±0.75% 54.65±0.87% MAML+Meta-Maxup (Ni et al., 2021; Yao et al., 2022) 39.28±0.77% 53.02±0.72% 41.97±0.78% 53.92±0.85% MAML+MLTI (Yao et al., 2022) 41 (Yin et al., 2019; Yao et al., 2022) 58.57±0.94% 68.45±0.81% --TAML (Jamal & Qi, 2019; Yao et al., 2022) 58.39±1.00% 66.09±0.71% --MAML+Meta-Dropout (Lee et al., 2020; Yao et al., 2022) 58.40±1.02% 67.32±0.92% --MAML+MetaMix (Yao et al., 2021; 2022) 60.34±1.03% 69.47±0.60% 62.06±1.77% 72.18±1.75% MAML+Meta-Maxup (Ni et al., 2021; Yao et al., 2022) 58.68±0.86% 69.16±0.61% 61.64±1.81% 72.04±1.79% MAML+MLTI (Yao et al., 2022) 61 (Yin et al., 2019; Yao et al., 2022) 45.01±0.83% 60.92±0.69% --TAML (Jamal & Qi, 2019; Yao et al., 2022) 45.73±0.84% 61.14±0.72% --MAML+Meta-Dropout (Lee et al., 2020; Yao et al., 2022) 44.30±0.84% 60.86±0.73% -MAML+MetaMix (Yao et al., 2021; 2022) 46.81±0.81% 63.52±0.73% 51.40±0.89% 64.82±0.87% MAML+Meta-Maxup (Ni et al., 2021; Yao et al., 2022) 46.10±0.82% 62.64±0.72% 50.82±0.85% 64.24±0.86% MAML+MLTI (Yao et al., 2022) 48.03±0 # Get an a r g u m e n t p a r s e r f o r a t r a i n i n g s c r i p t . # P e r f o r m t r a i n i n g o r e v a l u a t i o n a s p e r n e e d . i f a r g s . o n l y e v a l u a t i o n i s F a l s e : t r a i n ( l e a r n e r , t r a i n s e t , v a l s e t , o u t p u t f i l e , o u t p u t f o l d e r , ** t r a i n k w a r g s ( a r g s ) ) e l s e : a s s e r t a r g s . c h e c k p o i n t i s n o t None , ' F o r e v a l u a t i n g w i t h o u t t r a i n i n g p l e a s e p r o v i d e a c h e c k p o i n t ' p r i n t ( ' E v a l u a t i n g . . # A f u n c t i o n a l f o r w a r d t h a t w i l l a c t u a l l y be u s e d f o r a l l r e q u i r e m e n t s . # I t o n l y u s e s f u n c t i o n a l s t h u s e x p l i c i t l y n e e d s t h e w e i g h t s t o be p a s s e s . # The f u n c t i o n a l s can u s e t h e r e g u l a r l a y e r f u n c t i o n o r t h e i r IBP f o r m a s r e q u i r e d . # E l e m e n t -w i s e m u l t i p l i e r m u l t i p l i e r = t o r c h . r s q r t ( v a r + e p s ) m u l t i p l i e r = m u l t i p l i e r * w e i g h t o f f s e t = ( -m u l t i p l i e r * mean ) + b i a s m u l t i p l i e r = m u l t i p l i e r . u n s q u e e z e ( 0 ) . u n s q u e e z e ( 2 ) . u n s q u e e z e ( 3 ) o f f s e t o f f s e t . u n s q u e e z e ( 0 ) . u n s q u e e z e ( 2 ) . u n s q u e e z e ( 3 ) # B e c a u s e t h e s c a l e m i g h t be n e g a t i v e , we n e e d t o a p p l y a s t r a t e g y s i m i l a r t o l i n e a r u = ( i n p u t p + i n p u t n ) / 2 r = ( i n p u t p -i n p u t n ) / 2 , l o g i t s = s e l f . n e t . f u n c t i o n a l f o r w a r d ( i n p u t s , f a s t w e i g h t , None , F a l s e , None , None , None ) f a s t l o s s = F . c r o s s e n t r o p y ( l o g i t s , l a b e l s ) f a s t g r a d i e n t s = t o r c h . a u t o g r a d . g r a d ( f a s t l o s s , f a s t w e i g h t . v a l u e s ( ) ) # S p l i t a few -s h o t t a s k i n t o a t r a i n and a t e s t s e t . 



Figure 1: Illustration of the proposed interval bound propagation-aided few-shot learning setup (best viewed in color): We use interval arithmetic to define a small ϵ-neighborhood around a training task T i sampled from the task distribution p(T ). IBP is then used to obtain the bounding box around the mapping of the said neighborhood in the embedding space f θ S given by the first S layers of the learner f θ . While training the learner f θ to minimize the classification loss L CE on the query set D qi , we additionally attempt to minimize the losses L LB and L U B , forcing the ϵ-neighborhood to be compact in the embedding space as well.

Figure 2: Interval bound-based task interpolation (best viewed in color): (a) Existing inter-task interpolation methods create new artificial tasks by combining pairs of original tasks (blue ball).However, depending on how flat the task-manifold embedding is at the layer where interpolation is performed, the artificial tasks may either be created close to the task-manifold (green cross) or away from the task-manifold (red box). (b) The proposed interval bound-based task interpolation creates artificial tasks by combining an original task with one of its interval bounds (yellow ball). Such artificial tasks are likely to be in the vicinity of the task manifold as the interval bounds are forced to be close to the task embedding by the losses L LB and L U B .

Figure 4: Interval bound propagation-based few-shot training (best viewed in color): For each query data-label pair (x, y) in a given training task T i , we start by defining aϵ-neighborhood [x-1ϵ, x+1ϵ] around x. The bounding box [f θ s (x, ϵ), f θ s (x, ϵ)] around the embedding f θ S (x)after the first S layers of the learner is found using IBP. In addition to the classification loss L CE , we also minimize the losses L LB and L U B which respectively measure the distances of f θ S (x) to f θ s (x, ϵ) and f θ s (x, ϵ). A softmax across the three loss values is used to dynamically calculate the convex weights for the losses, so as to prioritize the minimization of the dominant loss(es) at any given training step. For IBP-based interpolation, artificial tasks T ′ i are created with instances H ′ formed by interpolating both the support and query instances with their corresponding lower or upper bounds. The mean of the classification loss L CE for the T i and the corresponding extra loss L ′ CE for T ′ i is minimized.

learner parameters θ = θ -η 1 1 B B i=1 ∇ θ L. 25: end while Algorithm 2 IBP/IBI for ProtoNet training Requires: Task distribution p(T ), learning rate η, interval coefficient ϵ.1: Randomly initialize the learner parameters θ. 2: while not converged do 3:

Figure 5: In the four plots above of losses against training steps the Blue, Green, Red, and Magenta lines respectively denote IBI applied after blocks 1,2,3, and 4 in ResNet-12 without any additional modifications. (a) Plot of L U B in log scale for ease of visualization. (b) Plot of L LB . (c) Plot of L CE . (d) Plot of L in log scale for ease of visualization.

G CODESHelper code: image data process.py import o s import random from PIL import Image import numpy a s np import t o r c h d e f r e a d d a t a s e t ( d a t a d i r , v a l p r e s e n c e = T r u e ) : # Read t h e image d a t a s e t . i f v a l p r e s e n c e i s T r u e : r e t u r n t u p l e ( r e a d c l a s s e s ( o s .p a t h . j o i n ( d a t a d i r , x ) ) f o r x i n [ ' t r a i n ' , ' v a l ' , ' t e s t ' ] ) e l s e : r e t u r n t u l e ( r e a d c l a s s e s ( o s . p a t h . j o i n ( d a t a d i r , x ) ) f o r x i n [ ' t r a i n ' , ' t e s t ' , ' t e s t ' ] ) d e f r e a d c l a s s e s ( d i r p a t h ) : # Read t h e c l a s s d i r e c t o r i e s i n a t r a i n / v a l / t e s t d i r e c t o r y . # I m a g e s s h o u l d be i n " . j p g " f o r m a t . r e t u r n [ I m a g e P r o c e s s C l a s s ( o s . p a t h . j o i n ( d i r p a t h , f ) ) f o r f i n o s . l i s t d i r ( d i r p a t h ) ] c l a s s I m a g e P r o c e s s C l a s s : # L o a d i n g and u s i n g t h e image d a t a s e t . # To u s e t h e s e APIs , you s h o u l d p r e p a r e a d i r e c t o r y t h a t # c o n t a i n s t h r e e sub -d i r e c t o r i e s : t r a i n , t e s t , and v a l .d e f i n i t ( s e l f , d i r p a t h ) : s e l f . d i r p a t h = d i r p a t h s e l f . c a c h e = {} d e f s a m p l e ( s e l f , num images ) : # Sample i m a g e s ( a s p y t o r c h t e n s o r ) f r o m t h e c l a s s . names = [ f f o r f i n o s . l i s t d i r ( s e l f . d i r p a t h ) i f f . e n d s w i t h ( ' . j p g ' ) ] random . s h u f f l e ( names ) i m a g e s = [ ] f o r name i n names [ : num images ] : i m a g e s . a p p e n d ( s e l f . r e a d i m a g e ( name ) ) r e t u r n i m a g e s d e f r e a d i m a g e ( s e l f , name ) : # For r e a d i n g i m a g e s and t r a n s f o r m a t i o n s a s n e c e s s a r y . # Image r e s o l u t i o n i s s e t t o 84 x84 . i f name i n s e l f . c a c h e : r e t u r n s e l f . c a c h e [ name ] w i t h open ( o s . p a t h . j o i n ( s e l f . d i r p a t h , name ) , ' r b ' ) a s i n f i l e : img = Image . open ( i n f i l e ) . r e s i z e ( ( 8 4 , 8 4 ) ) . c o n v e r t ( 'RGB ' ) img = np . a r r a y ( img ) . a s t y p e ( ' f l o a t 3 2 ' ) / 0 x f f img =np . r o l l a x i s ( img , 2 , 0 ) s e l f . c a c h e [ name ] = t o r c h . t e n s o r ( img ) r e t u r n s e l f . r e a d i m a g e ( name ) Main code: run learner.py import random import o s import s y s import numpy a s np import t o r c h import a r g p a r s e from i m g d a t a p r o c e s s import r e a d d a t a s e t from d a t e t i m e import d a t e t i m e from copy import d e e p c o p y from s r c . m o d e l s import NetworkModel from s r c . e v a l m o d e l import b u l k e v a l u a t e from s r c . t r a i n m o d e l import t r a i n from s r c . l e a r n e r s import L e a r n e r d e f a r g u m e n t p a r s e r ( ) :

j o i n ( f '{k}={v} ' f o r k , v i n v a r s ( a r g s ) . i t e m s ( ) ) , f i l e = f p ) d e v i c e = t o r c h . d e v i c e ( ' c u d a ' ) # I n s t a n t i a t e t h e d a t a s e t . t r a i n s e t , v a l s e t , t e s t s e t = e a d d a t a s e t ( DATA DIR , v a l p r e s e n c e ) # I n s t a n t i a t e t h e l e a r n e r model = NetworkModel ( a r g s . c l a s s e s ) l e a r n e r = L e a r n e r ( model , d e v i c e , ** m o d e l k w a r g s ( a r g s ) )

. ' ) r e s f i l e = o u t p u t f o l d e r + ' / ' + ' t e s t p e r f o r m a n c e ' + t i m e s t r i n g + ' . t x t ' w i t h open ( r e s f i l e , ' a+ ' ) a s f p : p r i n t ( ' E v a l u l a t i o n c h e c k p o i n t : ' + a r g s . c h e c k p o i n t , f i l e = f p ) c h e c k p o i n t m o d e l = t o r c h . l o a d ( a r g s . c h e c k p o i n t , m a p l o c a t i o n = ' c u d a : 0 ' ) l e a r n e r . n e t . l o a d s t a t e d i c t ( c h e c k p o i n t m o d e l [ ' m o d e l s t a t e ' ] ) l e a r n e r . m e t a o p t i m . l o a d s t a t e d i c t ( c h e c k p o i n t m o d e l [ ' m e t a o p t i m s t a t e ' ] ) t r a i n a c c u r a c y , v a l a c c u r a c y , t e s t a c c u r a c y = [ ] , [ ] , [ ] t r a i n c n f , v a l c n f , t e s t c n f = [ ] , [ ] , [ ] f o r i i i n range ( a r g s . t e s t i t e r s ) : t r a i n a c c , t r a i n d i v = b u l k e v a l u a t e ( l e a r n e r , t r a i n s e t , ** e v a l u a t e k w a r g s ( a r g s ) ) v a l a c c , v a l d i v = b u l k e v a l u a t e ( l e a r n e r , v a l s e t , ** e v a l u a t e k w a r g s ( a r g s ) ) t e s t a c c , t e s t d i v = b u l k e v a l u a t e ( l e a r n e r , t e s t s e t , ** e v a l u a t e k w a r g s ( a r g s ) ) t r a i n a c c u r a c y . a p p e n d ( t r a i n a c c ) v a l a c c u r a c y . a p p e n d ( v a l a c c ) t e s t a c c u r a c y . a p p e n d ( t e s t a c c ) t r a i n c n f . a p p e n d ( t r a i n d i v ) v a l c n f . a p p e n d ( v a l d i v ) t e s t c n f . a p p e n d ( t e s t d i v ) w i t h open ( r e s f i l e , ' a+ ' ) a s f p : p r i n t ( ' T e s t i t e r a t i o n : ' + s t r ( i i + 1 ) , f i l e = f p ) p r i n t ( ' T r a i n a c c u r a c y : ' + s t r ( t r a i n a c c u r a c y [ -1 ] ) + ' +/ -' + s t r ( t r a i n c n f [ -1 ] ) , f i l e = f p ) p r i n t ( ' V a l i d a t i o n a c c u r a c y : ' + s v a c c u r a c y [ -1 ] ) + ' +/ -' + s t r ( v a l c n f [ -1 ] ) , f i l e = f p ) p r i n t ( ' T e s t a c c u r a c y : ' + s t r ( t e s t a c c u r a c y [ -1 ] ) + ' +/ -' + s t r ( t e s t c n f [ -1 ] ) + '\n ' , f i l e = f p ) s a v e p a t h = o u t p u t f o l d e r + ' / ' + ' r e s u l t s ' + ' . npz ' t r a i n a c c u r a c y = np . a r r a y ( t r a i n a c c u r a c y ) v a l a c c u r a c y = np . a r r a y ( v a l a c c u r a c y ) t e s t a c c u r a c y = np . a r r a y ( t e s t a c c u r a c y ) t r a i n c n f = np . a r r a y ( t r a i n c n f ) v a l c n f = np . a r r a y ( v a l c n f ) t e s t c n f = np . a r r a y ( t e s t c n f ) np . s a v e z ( s a v e p a t h , t r a i n a c c u r a c y = t r a i n a c c u r a c y , v a l a c c u r a c y = v a l a c c u r a c y , t e s t a c c u r a c y = t e s t a c c u r a c y , t r a i n c o n f i d e n c e = t r a i n c n f , v a l c o n f i d e n c e = v a l c n f , t e s t c o n f i d e n c e = t e s t c n f ) src/models.py" import t o r c h import t o r c h . nn a s nn import t o r c h . nn . f u n c t i o n a l a s F import numpy a s np # A r e g u l a r 4-CONV n e t w o r k c l a s s NetworkModel ( nn . Module ) : d e f i n i t ( s e l f , k way ) : # I n i t i a l i z e t h e n e t w o r k l a y e r s . s u p e r ( NetworkModel , s e l f ) . i n i t ( ) s e l f . conv1 = nn . Conv2d ( 3 , 6 4 , k e r n e l s i z e =3 , s t r i d e =1 , p a d d i n g = ( 1 , 1 ) ) s e l f . b a t c h 1 = nn . BatchNorm2d ( 6 4 , t r a c k r u n n i n g s t a t s = F a l s e ) s e l f . conv2 = nn . Conv2d ( 6 4 , 6 4 , k e r n e l s i z e =3 , s t r i d e =1 , p a d d i n g = ( 1 , 1 ) ) s e l f . b a t c h 2 = nn . BatchNorm2d ( 6 4 , t r a c k r u n n i n g s t a t s = F a l s e ) s e l f . conv3 = nn . Conv2d ( 6 4 , 6 4 , k e r n e l s i z e =3 , s t r i d e =1 , p a d d i n g = ( 1 , 1 ) ) s e l f . b a t c h 3 = nn . BatchNorm2d ( 6 4 , t r a c k r u n n i n g s t a t s = F a l s e ) s e l f . conv4 = nn . Conv2d ( 6 4 , 6 4 , k e r n e l s i z e =3 , s t r i d e =1 , p a d d i n g = ( 1 , 1 ) ) s e l f . b a t c h 4 = nn . BatchNorm2d ( 6 4 , t r a c k r u n n i n g s t a t s = F a l s e ) s e l f . l i n 1 = nn . L i n e a r ( 6 4 * 5 * 5 , k way ) d e f f o r w a r d ( s e l f , x ) : # A f o r w a r d f u n c t i o n o n l y f o r r e f e r e n c e . x = F . r e l u ( F . m a x p o o l 2 d ( s e l f . b a t c h 1 ( s e l f . conv1 ( x ) ) , 2 ) ) x = F . r e l u ( F . m a x p o o l 2 d ( s e l f . b a t c h 2 ( s e l f . conv2 ( x ) ) , 2 ) ) x = F . r e l u ( . m a x p o o l 2 d ( s e l f . b a t c h 3 ( s e l f . conv3 ( x ) ) , 2 ) ) x = F . r e l u ( F . m a x p o o l 2 d ( s e l f . b a t c h 4 ( s e l f . conv4 ( x ) ) , 2 ) ) x = x . view ( -1 , 6 4 * 5 * 5 ) x = s e l f . l i n 1 ( x ) r e t u r n x d e f f u n c t i o n a l f o r w a r d ( s e l f , x , w e i g h t d i c t , l a y e r i n d e x =None , m i x u p f l a g =None , k way=None , b e t a a =None , b e t a b =None ) :

r o b u s t = T r u e i f l a y e r i n d e x i s None : y , r o b u s t = None , F a l s e # B l o c k 1 x = r o b u s t c o n v f o r w a r d ( x , w e i g h t d i c t [ ' conv1 . w e i g h t ' ] , w e i g h t d i c t [ ' conv1 . b i a s ' ] , s t r i d e =1 , p a d d i n g = ( 1 , 1 ) , r o b u s t = r o b u s t ) x = r o b u s t b a t c h n o r m f o r w a r d ( x , w e i g h t d i c t [ ' b a t c h 1 . w e i g h t ' ] , w e i g h t d i c t [ ' b a t c h 1 . b i a s ' ] , r o b u s t = r o b u s t ) x = F . m a x p o o l 2 d ( x , k e r n e l s i z e =2 , s t r i d e = 2 ) x = F . r e l u ( x ) i f l a y e r i n d e x == 1 : y , x , r o b u s t = i n t r a c l a s s m i x u p ( x , m i x u p f l a g , k way , b e t a a , b e t a b ) # B l o c k 2 x = r o b u s t c o n v f o r w a r d ( x , w e i g h t d i c t [ ' conv2 . w e i g h t ' ] , w e i g h t d i c t [ ' conv2 . b i a s ' ] , s t r i d e =1 , p a d d i n g = ( 1 , 1 ) , r o b u s t = r o b u s t ) x = r o b u s t b a t c h n o r m f o r w a r d ( x , w e i g h t d i c t [ ' b a t c h 2 . w e i g h t ' ] , w e i g h t d i c t [ ' b a t c h 2 . b i a s ' ] , r o b u s t = r o b u s t ) x = F . m a x p o o l 2 d ( x , k e r n e l s i z e =2 , s t r i d e = 2 ) x = F . r e l u ( x ) i f l a y e r i n d e x == 2 : y , x , r o b u s t = i n t r a c l a s s m i x u p ( x , m i x u p f l a g , k way , b e t a a , b e t a b )# B l o c k 3 x = r o b u s t c o n v f o r w a r d ( x , w e i g h t d i c t [ ' conv3 . w e i g h t ' ] , w e i g h t d i c t [ ' conv3 . b i a s ' ] , s t r i d e =1 , p a d d i n g = ( 1 , 1 ) , r o b u s t = r o b u s t ) x = r o b u s t b a t c h n o r m f o r w a r d ( x , w e i g h t d i c t [ ' b a t c h 3 . w e i g h t ' ] , w e i g h t d i c t [ ' b a t c h 3 . b i a s ' ] , r o b u s t = r o b u s t ) x = F .m a x p o o l 2 d ( x , k e r n e l s i z e =2 , s t r i d e = 2 ) x = F . r e l u ( x ) i f l a y e r i n d e x == 3 : y , x , r o b u s t = i n t r a c l a s s m i x u p ( x , m i x u p f l a g , k way , b e t a a , b e t a b ) # B l o c k 4 x = r o b u s t c o n v f o r w a r d ( x , w e i g h t d i c t [ ' conv4 . w e i g h t ' ] , w e i g h t d i c t [ ' conv4 . b i a s ' ] , s t r i d e =1 , p a d d i n g = ( 1 , 1 ) , r o b u s t = r o b u s t ) x = r o b u s t b a t c h n o r m f o r w a r d ( x , w e i g h t d i c t [ ' b a t c h 4 . w e i g h t ' ] , w e i g h t d i c t [ ' b a t c h 4 . b i a s ' ] , r o b u s t = r o b u s t ) x = F . m a x p o o l 2 d ( x , k e r n e l s i z e =2 , s t r i d e = 2 ) x = F . r e l u ( x ) i f l a y e r i n d e x == 4 : y , x , r o b u s t = i n t r a c l a s s m i x u p ( x , m i x u p f l a g , k way , b e t a a , b e t a b ) # Map t o number o f c l a s s e s . x = x . view ( -1 , 6 4 * 5 * 5 ) x = F . l i n e a r ( x , w e i g h t = w e i g h t d i c t [ ' l i n 1 . w e i g h t ' ] , b i a s = w e i g h t d i c t [ ' l i n 1 . b i a s ' ] ) r e t u r n y , x d e f r o b u s t c o n v f o r w a r d ( x , w e i g h t , b i a s , s t r i d e , p a d d i n g , r o b u s t ) : # C o n v o l u t i o n f u n c t i o n t h a t can p r o p a g a t e i n t e r v a l b o u n d s . i f r o b u s t i s F a l s e : # R e g u l a r c o n v o l u t i o n x = F . conv2d ( x , w e i g h t , b i a s , s t r i d e , p a d d i n g ) r e t u r n x # C o n v o l u t i o n p r o p a g a t i n g i n t e r v a l b o u n d s . b s i z e = x . s h a p e [ 0 ] / / 3 i n p u t p = x [ : b s i z e ] i n p u t o = x [ b s i z e : 2 * b s i z e ] i n p u t n = x [ 2 * b s i z e : ] u = ( i n p u t p + i n p u t n ) / 2 r = ( i n p u t p -i n p u t n ) / 2 o u t u = F . conv2d ( u , w e i g h t , b i a s , s t r i d e , p a d d i n g ) o u t r = F . conv2d ( r , t o r c h . abs ( w e i g h t ) , None , s t r i d e , p a d d i n g ) o u t o = F . conv2d ( i n p u t o , w e i g h t , b i a s , s t r i d e , p a d d i n g ) r e t u r n t o r c h . c a t ( [ o u t u + o u t r , o u t o , o u t u -o u t r ] 0 ) d e f r o b u s t b a t c h n o r m f o r w a r d ( x , w e i g h t , b i a s , r o b u s t ) : # B a t c h n o r m a l i z a t i o n f u n c t i o n t h a t can p r o p a g a t e i n t e r v a l b o u n d s . i f r o b u s t i s F a l s e : # R e g u l a r b a t c h n o r m a l i z a t i o n . x = F . b a t c h n o r m ( x , r u n n i n g m e a n =None , r u n n i n g v a r =None , w e i g h t = w e i g h t , b i a s = b i a s , t r a i n i n g = T r u e ) r e t u r n x # B a t c h n o r m a l i z a t i o n p r o p a g a t i n g i n t e r v a l b o u n d s . b s i z e = x . s h a p e [ 0 ] / / 3 e p s = 1 e -5 i n p u t p = x [ : b s i z e ] i n p u t o = x [ b s i z e : 2 * b s i z e ] i n p u t n = x [ 2 * b s i z e : ]# E q u i v a l e n t t o i n p u t o . mean ( ( 0 , 2 , 3 ) ) mean = i n p u t o . t r a n s p o s e ( 0 , 1 ) . c o n t i g u o u s ( ) . view ( i n p u t o . s h a p e [ 1 ] , -1 ) . mean ( 1 ) v a r = i n p u t o . t r a n s p o s e ( 0 , 1 ) . c o n t i g u o u s ( ) . view ( i n p u t o . s h a p e [ 1 ] , -1 ) . v a r ( 1 , u n b i a s e d = F a l s e )

o u t u = t o r c h . mul ( u , m u l t i p l i e r ) + o f f s e t o u t r = t o r c h . mul ( r , t o r c h . abs ( m u l t i p l i e r ) ) o u t o = t o r c h . mul ( i n p u t o , m u l t i p l i e r ) + o f f s e t r e t u r n t o r c h . c a t ( [ o u t u + o u t r , o u t o , o u t u -o u t r ] , 0 ) d e f i n t r a c l a s s m i x u p ( y , m i x u p f l a g , k way , b e t a a , b e t a b ) : # P e r f o r m i n t e r v a l bound i n t e r p o l a t i o n r o b u s t = F a l s e b s i z e = y . s h a p e [ 0 ] / / 3 u = y [ : b s i z e ] l = y [ 2 * b s i z e : ] o = y [ b s i z e : 2 * b s i z e ] i f m i x u p f l a g i s T r u e : n u m s h o t s = b s i z e / / k way m i x u p p a r a m s = np . r e p e a t ( np . random . b e t a ( b e t a a , b e t a b , k way ) , n u m s h o t s ) m i x u p p a r a m s = t o r c h . t e n s o r ( mixup params , d t y p e = t o r c h . f l o a t ) . view ( b s i z e , 1 , 1 , 1 ) . t o ( y . d e v i c e ) r a n d e x t = np . r e p e a t ( np . random . r a n d i n t ( 0 , 2 , k way ) , n u m s h o t s ) r a n d e x t = t o r c h . t e n s o r ( r a n d e x t , d t y p e = t o r c h . f l o a t ) . view ( b s i z e , 1 , 1 , 1 ) . t o ( y . d e v i c e ) m i x u p p a r a m s c = 1 -m i x u p p a r a m s r a n d e x t c = 1 -r a n d e x t u l o = ( m i x u p p a r a m s c * o + m i x u p p a r a m s * r a n d e x t * u + m i x u p p a r a m s * r a n d e x t c * l ) x = t o r c h . c a t ( [ o , u l o ] , 0 ) r e t u r n y , x , r o b u s t r e t u r n y , o , r o b u s t Learner: "src/learners.py" import random import t o r c h import t o r c h . nn . f u n c t i o n a l a s F from t o r c h import o p t i m import numpy a s np import t o r c h . o p t i m a s o p t i m from copy import d e e p c o p y from c o l l e c t i o n s import O r d e r e d D i c t c l a s s L e a r n e r : # Base L e a r n e r c l a s s f o r MAML and IBP / I B I v a r i a n t s . d e f i n i t ( s e l f , model , d e v i c e , u p d a t e l r , m e t a s t e p s i z e , b e t a a , b e t a b , s o f t m a x t e m p ) : # I n i t i a l i z a t i o n . s e l f . d e v i c e = d e v i c e s e l f . n e t = model . t o ( s e l f . d e v i c e ) s e l f . m e t a o p t i m = o p t i m . Adam ( s e l f . n e t . p a r a m e t e r s ( ) , l r = m e t a s t e p s i z e ) s e l f . u p d a t e l r = u p d a t e l r s e l f . b e t a a , s e l f . b e t a b = b e t a a , b e t a b s e l f . s o f t m a x t e m p = s o f t m a x t e m p d e f t r a i n s t e p ( s e l f a i n i n g f u n c t i o n f o r MAML, MAML+IBP , and MAML+I B I l e a r n e r s . # For r e c o r d k e e p i n g u p p e r l o s s r e c , l o w e r l o s s r e c , t a s k l o s s r e c , t o t a l l o s s r e c = 0 , 0 , 0 , 0 # T r i i g e r s FOMAML and v a r i a n t s i f r e q u i r e d . c r e a t e g r a p h , r e t a i n g r a p h = True , T r u e i f o r d e r == 1 : c r e a t e g r a p h , r e t a i n g r a p h = F a l s e , F a l s e s e l f . m e t a o p t i m . z e r o g r a d ( ) i f mixup i s T r u e : r a n d o m t a s k = np . random . r a n d ( 0 , m e t a b a t c h s i z e ) # I t e r a t e o v e r t a s k s i n a meta -b a t c t a s k i n d i n range ( m e t a b a t c h s i z e ) : f a s t w e i g h t = O r d e r e d D i c t ( s e l f . n e t . n a m e d p a r a m e t e r s ( ) )

d e v i c e ) l a b e l s = t o r c h . t e n s o r ( l a b e l s ) . t o ( s e l f . d e v i c e ) # F i x o r d e r i n g l a b e l s , s o r t i n d e x = t o r c h . s o r t ( l a b e l s ) i n p u t s = i n p u t s [ s o r t i n d e x ] i n p u t s c a t = t o r c h . c a t ( [ i n p u t s + i b p e p s i l o n , i n p u t s , i n p u t s -i b p e p s i l o n ] , 0 ) # F a s t a d a p t a t i o n s t e p s f o r i n range ( i n n e r i t e r s ) : i f mixup i s T r u e and t a s k i n d == r a n d o m t a s k : # For MAML+I B I , l o g i t s = s e l f . n e t . f u n c t i o n a l f o r w a r d ( i n p u t s c a t , f a s t w e i g h t , i b p l a y e r s , mixup , n u m c l a s s e s , s e l f . b e t a a , s e l f . b e t a b ) b s i z e = l o g i t s . s h a p e [ 0 ] / / 2 l o g i t s o = l o g i t s [ : b s i z e ] l o g i t s u l o = l o g i t s [ b s i z e : ] f a s t l o s s = ( F . c r o s s e n t r o p y ( l o g i t s o , l a b e l s ) + F . c r o s s e n t r o p y ( l o g i t s u l o , l a b e l s ) ) / 2 e l s e :# For MAML and MAML+IBP , l o g i t s = s e l f . n e t . f u n c t i o n a l f o r w a r d ( i n p u t s , f a s t w e i g h t , None , F a l s e , None , None , None ) f a s t l o s s = F . c r o s s e n t r o p y ( l o g i t s , l a b e l s ) f a s t g r a d i e n t s = t o r c h . a u t o g r a d . g r a d ( f a s t l o s s , f a s t w e i g h t . v a l u e s ( ) , c r e a t e g r a p h = c r e a t e g r a p h ) f a s t w e i g h t = O r d e r e d D i c t ( ( name , param -s e l f . u p d a t e l r * g r a d p a r a m ) f o r ( ( name , param ) , g r a d p a r a m ) i n z i p ( f a s t w e i g h t . i t e m s ( ) , f a s t g r a d i e n t s ) )#Query s e t i n p u t s , l a b e l s = z i p ( * t e s t s e t ) i n p u t s = t o r c h . s t a c k ( i n p u t s ) . t o ( s e l f . d e v i c e ) l a b e l s = t o r c h . t e n s o r ( l a b e l s ) . t o ( s e l f . d e v i c e ) # F i x o r d e r i n g l a b e l s , s o r t i n d e x = t o r c h . s o r t ( l a b e l s ) i n p u t s = i n p u t s [ s o r t i n d e x ] i f i b p l a y e r s i s None : # V a n i l l a MAML , l o g i t s = s e l f . n e t . f u n c t i o n a l f o r w a r d ( i n p u t s , f a s t w e i g h t , None , F a l s e , None , None , None ) t o t a l l o s s = F . c r o s s e n t r o p y ( l o g i t s , l a b e l s ) t a s k l o s s r e c = t a s k l o s s r e c + t o t a l l o s s . i t e m ( ) t o t a l l o s s r e c = t o t a l l o s s r e c + t o t a l l o s s . i t e ( ) e l s e : # For MAML+IBP and MAML+I B I i n p u t s c a t = t o r c h . c a t ( [ i n p u t s + i b p e p s i l o n , i n p u t s , i n p u t s -i b p e p s i l o n ] , 0 ) i f mixup i s T r u e and t a s k i n d == r a n d o m t a s k : # For MAML+I B I i b p e s t i m a t e , l o g i t s = s e l f . n e t . f u n c t i o n a l f o r w a r d ( i n p u t s c a t , f a s t w e i g h t , i b p l a y e r s , mixup , n u m c l a s s e s , s e l f . b e t a a , s e l f . b e t a b ) b s i z e = l o g i t s . s h a p e [ 0 ] / / 2 l o g i t s o = l o g i t s [ : b s i z e ] l o g i t s u l o = l o g i t s [ b s i z e : ] t a s k l o s s = ( F . c r o s s e n t r o p y ( l o g i t s o , l a b e l s ) + F . c r o s s e n t r o p y ( l o g i t s u l o , l a b e l s ) ) / 2 e l s e : # For MAML+IBP i b p e s t i m a t e , l o g i t s = s e l f . n e t . f u n c t i o n a l f o r w a r d ( i n p u t s c a t , f a s t w e i g h t , i b p l a y e r s , F a l s e , None , None , None ) t a s k l o s s = F . c r o s s e n t r o p y ( l o g i t s , l a b e l s ) # F i n d t h e p r o p a g a t e d b o u n d s b s i z e = i b p e s t i m a t e . s h a p e [ 0 ] / / 3 i b p e s t i m a t e u = i b p e s t i m a t e [ : b s i z e ] i b p e s t i m a t e o = i b p e s t i m a t e [ b s i z e : 2 * b s i z e ] i b p e s t i m a t e l = i b p e s t i m a t e [ 2 * b s i z e : ] # C a l c u l a t e $\ m a t h c a l {L} {UB}$ and $\ m a t h c a l {L} {LB } . u p p e r l o s s = F . m s e l o s s ( i b p e s t i m a t e u , i b p e s t i m a t e o ) l o w e r l o s s = F . m s e l o s s ( i b p e s t i m a t e l , i b p e s t i m a t e o ) # Dynamic w e i g h t i n g o f l o s s e s c o n c a t l o s s = t o r c h . c a t ( [ t a s k l o s s . u n s q u e e z e ( 0 ) , u p p e r l o s s . u n s q u e e z e ( 0 ) , l o w e r l o s s . u n s q u e e z e ( 0 ) ] , 0 ) w e i g h t s = F . s o f t m a x ( c o n c a t l o s s / s e l f . s o f t m a x t e m p , dim =0 ) t o t a l l o s s = t o r c h . sum ( c o n c a t l o s s * w e i g h t s ) # R e c o r d k e e p i n g u p p e r l o s s r e c = u p p e r l o s s r e c + u p p e r l o s s . i t e m ( ) l o w e r l o s s r e c = l o w e r l o s s r e c + l o w e r l o s s . i t e m ( ) t a s k l o s s r e c = t a s k l o s s r e c + t a s k l o s s . i t e m ( ) t o t a l l o s s r e c = t o t a l l o s s r e c + t o t a l l o s s . i t e m ( ) t o t a l l o s s . b a c k w a r d ( r e t a i n g r a p h = r e t a i n g r a p h ) # A v e r a g i n g t h e l o s s o v e r meta b a t c h e s f o r p a r a m s i n s e l f . n e t . p a r a m e t e r s ( ) : p a r a m s . g r a d = p a r a m s . g r a d / m e t a b a t c h s i z e # u p d a t e t h e meta l e a r n e r p a r a m e t e r s s e l f . m e t a o p t i m . s t e p ( ) s e l f . m e t a o p t i m . z e r o g r a d ( ) u p p e r l o s s r e c = u p p e r l o s s r e c / m e t a b a t c h s i z e l o w e r l o s s r e c = l o w e r l o s s r e c / m e t a b a t c h s i z e t a s k l o s s r e c = t a s k l o s s r e c / m e t a b a t c h s i z e t o t a l l o s s r e c = t o t a l l o s s r e c / m e t a b a t c h s i z e r e t u r n u p p e r l o s s r e c , l o w e r l o s s r e c , t a s k l o s s r e c , t o t a l l o s s r e c d e f e v a l u a t e ( s e l f , d a t a s e t , n u m c l a s s e s , n u m s h o t s , i n n e r i t e r s ) : # Run a s i n g l e e v a l u a t i o n o f t h e model . # P r e s e r v e c u r r e n t l y t r a i n e d model . o l d s t a t e = d e e p c o p y ( s e l f . n e t . s t a t e d i c t ( ) ) f a s t w e i g h t = O r d e r e d D i c t ( s e l f . n e t . n a m e d p a r a m e t e r s ( ) ) t r a i n s e t , t e s t s e t = s p l i t t r a i n t e s t ( s a m p l e m i n i d a t a s e t ( d a t a s e t , n u m c l a s s e s , n u m s h o t s + 1 ) ) # S u p p o r t s e t i n p u t s , l a b e l s = z i p ( * t r a i n s e t ) i n p u t s = ( t o r c h . s t a c k ( i n p u t s ) ) . t o ( s e l f . d e v i c e ) l a b e l s = ( t o r c h . t e n s o r ( l a b e l s ) ) . t o ( s e l f . d e v i c e ) # F a s t a d a p t a t i o n f o r i n range ( i n n e r i t e r s ) :

f a s t w e i g h t = O r d e r e d D i c t ( ( name , param -s e l f . u p d a t e l r * g r a d p a r a m ) f o r ( ( name , param ) , g r a d p a r a m ) i n z i p ( f a s t w e i g h t . i t e m s ( ) , f a s t g r a d i e n t s ) )# Query s e t i n p u t s , l a b e l s = z i p ( * t e s t s e t ) i n p u t s = ( t o r c h . s t a c k ( i n p u t s ) ) . t o ( s e l f . d e v i c e ) l a b e l s = ( t o r c h . t e n s o r ( l a b e l s ) ) . t o ( s e l f . d e v i c e ) # I n f e r , l o g i t s = s e l f . n e t . f u n c t i o n a l f o r w a r d ( i n p u t s , f a s t w e i g h t , None , F a l s e , None , None , None ) t e s t p r e d s = ( F . s o f t m a x ( l o g i t s , dim = 1 ) ) . argmax ( dim =1 ) # A c c u r a c y n u m c o r r e c t = t o r c h . eq ( t e s t p r e d s , l a b e l s ) . sum ( ) # R e t u r n n e t w o r k t o o r i g i n a l s t a t e f o r s a f e t y . s e l f . n e t . l o a d s t a t e d i c t ( o l d s t a t e ) r e t u r n n u m c o r r e c t . i t e m ( ) d e f s a m p l e m i n i d a t a s e t ( d a t a s e t , n u m c l a s s e s , n u m s h o t s ) : # Sample a f e w s h o t t a s k f r o m a d a t a s e t . s h u f f l e d = l i s t ( d a t a s e t ) random . s h u f f l e ( s h u f f l e d ) f o r c l a s s i d x , c l a s s o b j i n enumerate ( s h u f f l e d [ : n u m c l a s s e s ] ) : f o r s a m p l e i n c l a s s o b j . s a m p l e ( n u m s h o t s ) : y i e l d ( sample , c l a s s i d x ) d e f m i n i b a t c h e s ( s a m p l e s , b a t c h s i z e , n u m b a t c h e s ) : # G e n e r a t e m i n i -b a t c h e s f r o m some d a t a . s a m p l e s = l i s t ( s a m p l e s ) c u r b a t c h = [ ] b a t c h c o u n t = 0 w h i l e T r u e : random . s h u f f l e ( s a m p l e s ) f o r s a m p l e i n s a m p l e s : c u r b a t c h . a p p e n d ( s a m p l e ) i f l e n ( c u r b a t c h ) < b a t c h s i z e : c o n t i n u e y i e l d c u r b a t c h c u r b a t c h = [ ] b a t c h c o u n t += 1 i f b a t c h c o u n t == n u m b a t c h e s : r e t u r n d e f s p l i t t r a i n t e s t ( s a m p l e s , t e s t s h o t s = 1 ) :

t r a i n s e t = l i s t ( s a m p l e s ) t e s t s e t = [ ] l a b e l s = s e t ( i t e m [ 1 ] f o r i t e m i n t r a i n s e t ) f o r i n range ( t e s t s h o t s ) : f o r l a b e l i n l a b e l s : f o r i , i t e m i n enumerate ( t r a i n s e t ) : i f i t e m [ 1 ] == l a b e l : d e l t r a i n s e t [ i ] t e s t s e t . a p p e n d ( i t e m ) break i f l e n ( t e s t s e t ) < l e n ( l a b e l s ) * t e s t s h o t s : r a i s e I n d e x E r r o r ( ' n o t enough e x a m p l e s o f e a c h c l a s s f o r t e s t s e t ' ) r e t u r n t r a i n s e t , t e s t s e t Training function: "src/train model.py" import o s import numpy a s np import t o r c h from . l e a r n e r s import L e a r n e r d e f t r a i n ( l e a r n e r , t r a i n s e t , v a l s e t , m o d e l o u t p u t f i l e =None , m o d e l s a v e p a t h =None , o r d e r =None , n u m c l a s s e s =None , n u m s h o t s =None , m e t a s h o t s =None , i n n e r i t e r s =None , m e t a b a t c h s i z e =None , m e t a i t e r s =None , e v a l i n n e r i t e r s =None , e v a l i n t e r v a l =None , e v a l i n t e r v a l s a m p l e =None , i b p e p s i l o n =None , mixup= F a l s e , i b p l a y e r s =None ) : # T r a i n a model on a d a t a s e t . t r a i n a c c u r a c y , v a l a c c u r a c y = [ ] , [ ] u p p e r l o s s s t o r e , l o w e r l o s s s t o r e , t a s k l o s s s t o r e , t o t a l l o s s s t o r e = [ ] , [ ] , [ ] , [ ] # Loop o v e r t h e t r a i n i n g s t e p s . f o r i i n range ( m e t a i t e r s + 1 ) :# F i n d c u rr e n t v a l u e o f i n t e r v a l c o e f f i c i e n t . c u r i b p e p s i l o n = e p s s c h e d u l e r ( i , m e t a i t e r s , i b p e p s i l o n ) # T r a i n t h e l e a r n e r f o r a s t e p . u p p e r l o s s , l o w e r l o s s , t a s k l o s s , t o t a l l o s s = l e a r n e r . t r a i n s t e p ( t r a i n s e t , o r d e r = o r d e r , n u m c l a s s e s = n u m c l a s s e s , n u m s h o t s = n u m s h o t s , m e t a s o t s = m e t a s h o t s , i n n e r i t e r s = i n n e r i t e r s , m e t a b a t c h s i z e = m e t a b a t c h s i z e , i b p e p s i l o n = c u r i b p e p s i l o n , mixup=mixup , i b p l a y e r s = i b p l a y e r s ) # R e c o r d l o s s e s u p p e r l o s s s t o r e . a p p e n d ( u p p e r l o s s ) l o w e r l o s s s t o r e . a p p e n d ( l o w e r l o s s ) t a s k l o s s s t o r e . a p p e n d ( t a s k l o s s ) t o t a l l o s s s t o r e . a p p e n d ( t o t a l l o s s ) i f i % e v a l i n t e r v a l == 0 : # P e r f o r m i n t e r m e d i a t e e v a l u a t i o n . t o t a l c o r r e c t = 0 f o r i n range ( e v a l i n t e r v a l s a m p l e ) : t o t a l c o r r e c t = t o t a l c o r r e c t + l e a r n e r . e v a l u a t e ( t r a i n s e t , n u m c l a s s e s = n u m c l a s s e s , n u m s h o t s = n u m s h o t s , i n n e r i t e r s = e v a l i n n e r i t e r s ) t r a i n a c c u r a c y . a p p e n d ( t o t a l c o r r e c t / ( e v a l i n t e r v a l s a m p l e * n u m c l a s s e s ) ) s a v e p a t h = m o d e l s a v e p a t h + ' / i n t e r m e d i a t e ' + s t r ( i ) + ' m o d e l . p t ' t o r c h . s a v e ({ ' m o d e l s t a t e ' : l e a r n e r . n e t . s t a t e d i c t ( ) , ' m e t a o p t i m s t a t e ' : l e a r n e r . m e t a o p t i m . s t a t e d i c t ( ) } , s a v e p a t h ) t o t a l c o r r e c t = 0 f o r i n range ( e v a l i n t e r v a l s a m p l e ) : t o t a l c o r r e c t = t o t a l c o r r e c t + l e a r n e r . e v a l u a t e ( v a l s e t , n u m c l a s s e s = n u m c l a s s e s , n u m s h o t s = n u m s h o t s , i n n e r i t e r s = e v a l i n n e r i t e r s ) v a l a c c u r a c y . a p p e n d ( t o t a l c o r r e c t / ( e v a l i n t e r v a l s a m p l e * n u m c l a s s e s ) ) w i t h open ( m o d e l o u t p u t f i l e , ' a+ ' ) a s f p : p r i n t ( ' b a t c h %d : t r a i n =%f v a l=%f ' % ( i , t r a i n a c c u r a c y [ -1 ] , v a l a c c u r a c y [ -1 ] ) , f i l e = f p ) # I n t e r m e d i a t e r e c o r d k e e p i n g . r e s s a v e p a t h = m o d e l s a v e p a t h + ' / ' + ' i n t e r m e d i a t e a c c u r a c i e s . npz ' l o s s s a v e p a t h = m o d e l s a v e p a t h + ' / ' + ' i n t e r m e d i a t e s l o s s e s . npz ' np . s a v e z ( r e s s a v e p a t h , t r a i n a c c u r a c y =np . a r r a y ( t r a i n a c c u r a c y ) , v a l a c c u r a c y =np . a r r a y ( v a l a c c u r a c y ) ) np . s a v e z ( l o s s s a v e p a t h , u p p e r l o s s =np . a r r a y ( u p p e r l o s s s t o r e ) , l o w e r l o s s =np . a r r a y ( l o w e r l o s s s t o r e ) , t a s k l o s s =np . a r r a y ( t a s k l o s s s t o r e ) , t o t a l l o s s =np . a r r a y ( t o t a l l o s s s t o r e ) ) d e f e p s s c h e d u l e r ( i , m e t a i t e r s , i b p e p s i l o n ) : # S c h e d u l e t h e v a l u e o f i n t e r v a l c o e f f i c i e n t . i f i < m e t a i t e r s * 0 . 9 : r e t u r n ( i / ( m e t a i t e r s * 0 . 9 ) ) * i b p e p s i l o n r e t u r n i b p e p s i l o n Inference function: "src/eval model.py" """ H e l p e r s f o r e v a l u a t i n g m o d e l s . """ import numpy a s np from . l e a r n e r s import L e a r n e r d e f b u l k e v a l u a t e ( l e a r n e r , d a t a s e t , n u m c l a s s e s =5 , n u m s h o t s =5 , e v a l i n n e r i t e r s =10 , n u m s a m p l e s = 1 0 0 0 ) : # For e v a l u a t i n g t h e l e a r n e r on a s e t o f t a s k s . t o t a l c o r r e c t = [ ] f o r i n range ( n u m s a m p l e s ) : t o t a l c o r r e c t . a p p e n d ( l e a r n e r . e v a l u a t e ( d a t a s e t , n u m c l a s s e s = n u m c l a s s e s , n u m s h o t s = n u m s h o t s , i n n e r i t e r s = e v a l i n n e r i t e r s ) ) t o t a l a c c u r a c i e s = np . a r r a y ( t o t a l c o r r e c t ) / n u m c l a s s e s t e s t a c c u r a c y = t o t a l a c c u r a c i e s . sum ( ) / n u m s a m p l e s t e s t c n f = np . s t d ( t o t a l a c c u r a c i e s ) # For c o n f i d e n c e i n t e r v a l 0.95% # z s c o r e = 1 . 9 6 # t e s t c n f = z s c o r e * ( t e s t c n f / ( np . s q r t ( n u m s a m p l e s ) ) ) r e t u r n t e s t a c c u r a c y , t e s t c n f

Effect of IBP on MAML for miniImageNet and tieredIm-ageNet datasets in terms of 5-way 1-shot Accuracy and intra-task compactness

Ablation on task interpolation strategies in terms of mean Accuracy and average median distance between original and interpolated tasks over 600 tasks on miniImageNet-S (mIS), ISIC and DermNet-S (DS).

Average loss weights for the two proposed methods, and a comparison of the static weighting and dynamic weighting versions, including transferability of static weight values across variants.

Transferability comparison of MAML andProtoNet, with their MLTI, IBP, and IBI variants in terms of Accuracy over 600 tasks.

Performance comparison of the proposed method with baselines and contending algorithms in terms of mean Accuracy over 600 tasks.

for evaluating few-shot classifiers. The dataset contains a total of 100 classes each containing 600 images of resolution 84 × 84 × 3. Following the directives ofVinyals et al. (2016) from the total 100 classes 64 are kept in the Training set, 16 are retained for Validation, and the rest of the 20 classes are used for testing. In(Ren et al., 2018) the authors proposed a new larger subset of ImageNet(Deng et al., 2009) for addressing the limitations of miniImageNet. In miniImageNet it is not ensured that the classes used for training are distinct from those contained in the Test set. Evidently, this contains the risk of information leakage and may not provide a fair evaluation of the few-shot classifier. As a remedyRen et al. (2018) proposed to go higher in the class hierarchy in ImageNet. This enables tieredImageNet to use higher-level categories in the Training, Validation, and Test sets maintaining significant diversity between the three. In essence, a total of 608 ImageNet leaf-level classes are considered that can be categorized into 34 groups. Among these 34 higher-level groups, 20 are used for training, 6 are kept for validation, and the rest 8 are included in the Test set.Validation and Test sets are kept as same as those used in miniImageNet. The dataset after discarding the duplicates contains more than 22,000 medical images spread across 625 classes of dermatological diseases. Following the preprocessing suggested byPrabhu et al. (2019) the authors of(Yao et al., 2022) created DermNet-S by first extracting the 203 classes containing more than 30 images. Then from the long-tailed data distribution of the 203 disease classes, the top 30 larger classes are kept for training while the smaller 53 bottom classes are considered for meta-testing. The images are resized to 84 × 84 × 3 to match the resolution of miniImageNet. We follow the same dataset construction strategy in our case. Moreover, we use random classes not included in the Training or Test set as the Validation set. The complete list of classes in the Training and Test sets are listed as follows: Following Yao et al. (2022) for "ISIC 2018: Skin Lesion Analysis Towards Melanoma Detection"(Codella et al., 2018;Li et al., 2020) we select the third task where 10,015 medical images are categorized into seven classes based on lesion types. We first resize the images to 84 × 84 × 3 to match the miniImageNet resolution. Then among the seven classes in the ISIC dataset, we select the 4 classes containing a higher number of samples for training while considering the rest for meta-testing as per the directives ofYao et al. (2022). Since there are only 4 classes in the Training set setting the number of ways to 2 results in six possible class combinations in a task. This in consequence offers an extreme few-task scenario. For hyper-parameter tuning random classes are used as a Validation set following the cross-validation-based approach employed in(Yao et al., 2022). The list of classes in the Training and the Test sets are listed as follows:

Ablation study of ResNet-12 modifications for MAML+IBP on miniImageNet-S in terms of mean Accuracy over 600 tasks with 95% confidence interval.

Actual computational cost in seconds for IBP and IBI variants of MAML and ProtoNet with "4-CONV" and ResNet-12 backbone.

Descriptions of hyperparameters used in vanilla MAML, MAML+IBP, MAML+IBI

Descriptions of hyperparameters used in vanilla ProtoNet, ProtoNet+IBP, ProtoNet+IBI

Optimal hyperparamter setting for ProtoNet+IBP, ProtoNet+IBI in 5-shot settings when "4-CONV" network is used.

Optimal hyperparameter settings for MAML+IBP/IBI and ProtoNet+IBP/IBI when "ResNet-12" is used as the network.

Effect of IBP on MAML for miniImageNet and tieredImageNet datasets in terms of 5-way 1-shot Accuracy and intra-task compactness. This is the full version of Table1.

Performance comparison of the two proposed methods with baselines and competing algorithms on miniImageNet and tieredImageNet datasets. The results are reported in terms of mean Accuracy over 600 tasks with 95% confidence interval.

Table 2 is detailed in Table 19. The full version of Table 4 in the main paper is provided here across Tables

Full results for MAML variants on miniImageNet-S, DermNet-S, and ISIC in Table4of the main paper. All results are reported in terms of Accuracy over 600 tasks along with 95% confidence level.

Full results for ProtoNet variants on miniImageNet-S, ISIC, and DermNet-S in Table4of the main paper. All results are reported in terms of Accuracy over 600 tasks along with 95% confidence level.

Full result for Table5describing transferability comparison of MAML and ProtoNet, with their MLTI, IBP and IBI variants. All results are reported in terms of Accuracy over 600 tasks along with the 95% confidence intervals. Here, A → B indicates the model trained on dataset A is tested on dataset B.

o u t p u t f o l d e r = a r g s . d a t a s e t + ' ' + a r g s . a l g o r i t h m + ' o u t p u t f o l d e r ' + t i m e s t r i n g o u t p u t f i l e = o u t p u t f o l d e r + ' / ' + ' l o g ' + t i m e s t r i n g + ' . t x t '

annex

 Finn et al. (2017) . Inner loop iterations Set to 5 following Finn et al. (2017) . Inner loop learning rate η0Set to 0.01 following Finn et al. (2017) . Meta-step size η1Set to 0.001 following Finn et al. (2017) . Meta-batch B Set to 4 following Finn et al. (2017) . Meta-iterations T Set to 60000 for miniImageNet and tieredImageNet following Finn et al. (2017) . Set to 50000 for miniImageNet-S, DermNet-S, and ISIC following Yao et al. (2022) . Evaluation iterations Set to 10 following Finn et al. (2017) .

Additional hyperparameters introduced in MAML+IBP

Interval coefficient ϵ Searched in the set {0.05, 0.1, 0.2}.

Softmax coefficient γ

Searched in the set {0.01, 1, 10}.

Layer S

For the "4-CONV" learner containing 4 blocks of Convolution, Batch normalization, Max pooling, and ReLU, S is searched at the block level in the set {1, 2, 3, 4}. For example, S = 2 means IBP losses are calculated after the second block. For the "ResNet-12" network the ablation study in Appendix E provides the optimum choice of S.

Additional hyperparameters introduced in MAML+IBI

α and β Search space contains three pairs of choices (0.1, 1), (0.25, 1), and (0.5, 0.5) where a tuple contains the value of α and β in order. 

Additional hyperparameters introduced in ProtoNet+IBI

α and β Search space contains three pairs of choices (0.1, 1), (0.25, 1), and (0.5, 0.5) where a tuple contains the value of α and β in order. where ζ ∼ N (0, σ), and the standard deviation σ is scheduled similar to ϵ with starting from 0 and slowly increasing to ϵ/2. 4. MAML+ULBL: Following Morawiecki et al. (2020) we replace the two bound losses with a single one that calculates the distance between the upper and lower interval bounds. The loss L U LBL in this case can be written as:The full version of Table 1 is provided in the following Table 17 . # C r e a t e d i r e c t o r y f o r s t o r i n g r e s u l t s and i n i t i a t e l o g g i n g . i f o s . p a t h . e x i s t s ( o s . p a t h . j o i n ( DATA DIR , ' v a l ' ) ) : v a l p r e s e n c e = T r u e p r i n t ( " V a l i d a t i o n s e t i s p r e s e n t . " ) e l s e : v a l p r e s e n c e = F a l s e p r i n t ( " V a l i d a t i o n s e t i s n o t f o u n d . E x i t i n g . " ) s y s . e x i t ( ) t i m e s t r i n g = d a t e t i m e . now ( ) . s t r f t i m e ( "%m%d%Y %H:%M:%S" )

