BAYESIAN FEW-SHOT CLASSIFICATION WITH ONE-VS-EACH P ÓLYA-GAMMA AUGMENTED GAUSSIAN PROCESSES

Abstract

Few-shot classification (FSC), the task of adapting a classifier to unseen classes given a small labeled dataset, is an important step on the path toward human-like machine learning. Bayesian methods are well-suited to tackling the fundamental issue of overfitting in the few-shot scenario because they allow practitioners to specify prior beliefs and update those beliefs in light of observed data. Contemporary approaches to Bayesian few-shot classification maintain a posterior distribution over model parameters, which is slow and requires storage that scales with model size. Instead, we propose a Gaussian process classifier based on a novel combination of Pólya-Gamma augmentation and the one-vs-each softmax approximation (Titsias, 2016) that allows us to efficiently marginalize over functions rather than model parameters. We demonstrate improved accuracy and uncertainty quantification on both standard few-shot classification benchmarks and few-shot domain transfer tasks.

1. INTRODUCTION

Few-shot classification (FSC) is a rapidly growing area of machine learning that seeks to build classifiers able to adapt to novel classes given only a few labeled examples. It is an important step towards machine learning systems that can successfully handle challenging situations such as personalization, rare classes, and time-varying distribution shift. The shortage of labeled data in FSC leads to uncertainty over the parameters of the model, known as model uncertainty or epistemic uncertainty. If model uncertainty is not handled properly in the few-shot setting, there is a significant risk of overfitting. In addition, FSC is increasingly being used for risk-averse applications such as medical diagnosis (Prabhu, 2019) and human-computer interfaces (Wang et al., 2019) where it is important for a few-shot classifier to know when it is uncertain. Bayesian methods maintain a distribution over model parameters and thus provide a natural framework for capturing this inherent model uncertainty. In a Bayesian approach, a prior distribution is first placed over the parameters of a model. After data is observed, the posterior distribution over parameters is computed using Bayesian inference. This elegant treatment of model uncertainty has led to a surge of interest in Bayesian approaches to FSC that infer a posterior distribution over the weights of a neural network (Finn et al., 2018; Yoon et al., 2018; Ravi & Beatson, 2019) . Although conceptually appealing, there are several practical obstacles to applying Bayesian inference directly to the weights of a neural network. Bayesian neural networks (BNNs) are expensive from both a computational and memory perspective. Moreover, specifying meaningful priors in parameter space is known to be difficult due to the complex relationship between weights and network outputs (Sun et al., 2019) . Gaussian processes (GPs) instead maintain a distribution over functions rather than model parameters. The prior is directly specified by a mean and covariance function, which may be parameterized by deep neural networks. When used with Gaussian likelihoods, GPs admit closed form expressions for the posterior and predictive distributions. They exchange the computational drawbacks of BNNs for cubic scaling with the number of examples. In FSC, where the number of examples is small, this is often an acceptable trade-off. When applying GPs to classification with a softmax likelihood, the non-conjugacy of the GP prior renders posterior inference intractable. Many approximate inference methods have been proposed to circumvent this, including variational inference and expectation propagation. In this paper we investigate a particularly promising class of approaches that augment the GP model with a set of auxiliary random variables, such that when they are marginalized out the original model is recovered (Albert & Chib, 1993; Girolami & Rogers, 2006; Linderman et al., 2015) . Such augmentation-based approaches typically admit efficient Gibbs sampling procedures for generating posterior samples which when combined with Fisher's identity (Douc et al., 2014) can be used to optimize the parameters of the mean and covariance functions. In particular, augmentation with Pólya-Gamma random variables (Polson et al., 2013) makes inference tractable in logistic models. Naively, this is useful for handling binary classification, but in this paper we show how to extend Pólya-Gamma augmentation to multiple classes by using the one-vs-each softmax approximation (Titsias, 2016) , which can be expressed as a product of logistic sigmoids. We further show that the one-vs-each approximation can be interpreted as a composite likelihood (Lindsay, 1988; Varin et al., 2011) , a connection which to our knowledge has not been made in the literature. In this work, we make several contributions: • We show how the one-vs-each softmax approximation (Titsias, 2016) can be interpreted as a composite likelihood consisting of pairwise conditional terms. • We propose a novel GP classification method that combines the one-vs-each softmax approximation with Pólya-Gamma augmentation for tractable inference. • We demonstrate competitive classification accuracy of our method on standard FSC benchmarks and challenging domain transfer settings. • We propose several new benchmarks for uncertainty quantification in FSC, including calibration, robustness to input noise, and out-of-episode detection. • We demonstrate improved uncertainty quantification of our method on the proposed benchmarks relative to standard few-shot baselines.

2. RELATED WORK

Our work is related to both GP methods for handling non-conjugate classification likelihoods and Bayesian approaches to few-shot classification. We summarize relevant work here.

2.1. GP CLASSIFICATION

Non-augmentation approaches. There are several classes of approaches for applying Gaussian processes to classification. The most straightforward method, known as least squares classification (Rifkin & Klautau, 2004) , treats class labels as real-valued observations and performs inference with a Gaussian likelihood. The Laplace approximation (Williams & Barber, 1998) constructs a Gaussian approximate posterior centered at the posterior mode. Variational approaches (Titsias, 2009; Matthews et al., 2016 ) maximize a lower bound on the log marginal likelihood. In expectation propagation (Minka, 2001; Kim & Ghahramani, 2006; Hernandez-Lobato & Hernandez-Lobato, 2016) , local Gaussian approximations to the likelihood are fitted iteratively to minimize KL divergence from the true posterior. Augmentation approaches. Augmentation-based approaches introduce auxiliary random variables such that the original model is recovered when marginalized out. Girolami & Rogers (2006) propose a Gaussian augmentation for multinomial probit regression. Linderman et al. (2015) utilize Pólya-Gamma augmentation (Polson et al., 2013) and a stick-breaking construction to decompose a multinomial distribution into a product of binomials. Galy-Fajou et al. (2020) propose a logistic-softmax likelihood for classification and uses Gamma and Poisson augmentation in addition to Pólya-Gamma augmentation in order to perform inference.

2.2. FEW-SHOT CLASSIFICATION

Meta-learning. A common approach to FSC is meta-learning, which seeks to learn a strategy to update neural network parameters when faced with a novel learning task. The Meta-learner LSTM (Ravi & Larochelle, 2017) learns a meta-level LSTM to recurrently output a new set of parameters for a base learner. MAML (Finn et al., 2017) learns initializations of deep neural networks that perform well on task-specific losses after one or a few steps of gradient descent by backpropagating through the gradient descent procedure itself. LEO (Rusu et al., 2019) performs meta-learning in a learned low-dimensional latent space from which the parameters of a classifier are generated. Metric learning. Metric learning approaches learn distances such that input examples can be meaningfully compared. Siamese Networks (Koch, 2015) learn a shared embedding network along with a distance layer for computing the probability that two examples belong to the same class. Matching Networks (Vinyals et al., 2016) uses a nonparametric classification in the form of attention over nearby examples, which can be interpreted as a form of soft k-nearest neighbors in the embedding space. Prototypical Networks (Snell et al., 2017) make predictions based on distances to nearest class centroids. Relation Networks (Sung et al., 2018) instead learn a more complex neural network distance function on top of the embedding layer. Bayesian Few-shot Classification. More recently, Bayesian FSC approaches that attempt to infer a posterior over task-specific parameters have appeared. Grant et al. (2018) reinterpret MAML as an approximate empirical Bayes algorithm and propose LLAMA, which optimizes the Laplace approximation to the marginal likelihood. Bayesian MAML (Yoon et al., 2018) instead uses Stein Variational Gradient Descent (SVGD) (Liu & Wang, 2016) to approximate the posterior distribution over model parameters. VERSA (Gordon et al., 2019) uses amortized inference networks to obtain an approximate posterior distribution over task-specific parameters. ABML (Ravi & Beatson, 2019) uses a few steps of Bayes by Backprop (Blundell et al., 2015) on the support set to produce an approximate posterior over network parameters. CNAPs (Requeima et al., 2019) modulate taskspecific Feature-wise Linear Modulation (FiLM) (Perez et al., 2018) layer parameters as the output of an adaptation network that takes the support set as input. GPs for Few-shot Learning. There have been relatively few works applying GPs to few-shot learning. Tossou et al. (2020) consider Gaussian processes in the context of few-shot regression with Gaussian likelihoods. Deep Kernel Transfer (DKT) (Patacchiola et al., 2020) uses Gaussian processes with least squares classification to perform few-shot classification and learns covariance functions parameterized by deep neural networks. More recently, Titsias et al. (2020) applies GPs to meta-learning by maximizing the mutual information between the query set and a latent representation of the support set.

3. BACKGROUND

In this section we first review Pólya-Gamma augmentation for binary classification and the one-vseach approximation before we introduce our method in Section 4.

3.1. P ÓLYA-GAMMA AUGMENTATION

The Pólya-Gamma augmentation scheme was originally introduced to address Bayesian inference in logistic models (Polson et al., 2013) . Suppose we have a vector of logits ψ ∈ R N with corresponding binary labels y ∈ {0, 1} N . The logistic likelihood is p(y|ψ) = N i=1 σ(ψ i ) yi (1 -σ(ψ i )) 1-yi = N i=1 (e ψi ) yi 1 + e ψi , where σ(•) is the logistic sigmoid function. Let the prior over ψ be Gaussian: p(ψ) = N (ψ|µ, Σ). In Bayesian inference, we are interested in the posterior p(ψ|y) ∝ p(y|ψ)p(ψ) but the form of (1) does not admit analytic computation of the posterior due to non-conjugacy. The main idea of Pólya-Gamma augmentation is to introduce auxiliary random variables ω to the likelihood such that the original model is recovered when ω is marginalized out: p(y|ψ) = p(ω)p(y|ψ, ω) dω. Conditioned on ω ∼ PG(1, 0), the batch likelihood is proportional to a diagonal Gaussian (see Section A for a full derivation): p(y|ψ, ω) ∝ N i=1 e -ωiψ 2 i /2 e κiψi ∝ N (Ω -1 κ | ψ, Ω -1 ), where κ i = y i -1/2 and Ω = diag(ω). The conditional distribution over ψ given y and ω is now tractable: p(ψ|y, ω) ∝ p(y|ψ, ω)p(ψ) ∝ N (ψ| Σ(Σ -1 µ + κ), Σ), where Σ = (Σ -1 + Ω) -1 . The conditional distribution of ω given ψ and y can also be easily computed: p(ω i |y i , ψ i ) ∝ PG(ω i |1, 0)e -ωiψ 2 i /2 ∝ PG(ω i |1, ψ i ), ) where the last expression follows from the exponential tilting property of Pólya-Gamma random variables. This suggest a Gibbs sampling procedure in which iterates ω (t) ∼ p(ω|y, ψ (t-1) ) and ψ (t) ∼ p(ψ|X, y, ω (t) ) are drawn sequentially until the Markov chain reaches its stationary distribution, which is the joint posterior p(ψ, ω|y). Fortunately, efficient samplers for the Pólya-Gamma distribution have been developed (Windle et al., 2014) to facilitate this.

3.2. ONE-VS-EACH APPROXIMATION TO SOFTMAX

The one-vs-each (OVE) approximation (Titsias, 2016) was formulated as a lower bound to the softmax likelihood in order to handle classification over a large number of output classes, where computation of the normalizing constant is prohibitive. We employ the OVE approximation not to deal with extreme classification, but rather due to its compatibility with Pólya-Gamma augmentation, as we shall soon see. The one-vs-each approximation can be derived by first rewriting the softmax likelihood as follows: p(y = i | f ) e fi j e fj = 1 1 + j =i e -(fi-fj ) , where f (f 1 , . . . , f C ) are the logits. Since in general k (1 + α k ) ≥ (1 + k α k ) for α k ≥ 0, the softmax likelihood (5) can be bounded as follows: p(y = i | f ) ≥ j =i 1 1 + e -(fi-fj ) = j =i σ(f i -f j ), which is the OVE lower bound. This expression avoids the normalizing constant and factorizes into a product of pairwise sigmoids, which is amenable to Pólya-Gamma augmentation for tractable inference.

4. ONE-VS-EACH P ÓLYA-GAMMA GPS

In this section, we first show how the one-vs-each (OVE) approximation can be interpreted as a pairwise composite likelihood. We then we introduce our method for GP-based Bayesian few-shot classification, which brings together OVE and Pólya-Gamma augmentation in a novel combination.

4.1. OVE AS A COMPOSITE LIKELIHOOD

Titsias (2016) showed that the OVE approximation shares the same global optimum as the softmax maximum likelihood, suggesting a close relationship between the two. We show here that in fact OVE can be interpreted as a pairwise composite likelihood version of the softmax. Composite likelihoods (Lindsay, 1988; Varin et al., 2011) are a type of approximate likelihood often employed when the exact likelihood is intractable or otherwise difficult to compute. Given a collection of marginal or conditional events {E 1 , . . . , E K } and parameters f , a composite likelihood is defined as: L CL (f | y) K k=1 L k (f | y) w k , where L k (f | y) ∝ p(y ∈ E k | f ) and w k ≥ 0 are arbitrary weights. In order to make the connection to OVE, it will be useful to let the one-hot encoding of the label y be denoted as y ∈ {0, 1} C . Define a set of C(C -1)/2 pairwise conditional events E ij , one each for all pairs of classes i = j, indicating the event that the model's output matches the target label for classes i and j conditioned on all the other classes: p(y ∈ E ij | f ) p(y i , y j | y ¬ij , f ), where ¬ij denotes the set of classes not equal to either i or j. This expression resembles the pseudolikelihood (Besag, 1975) , but instead of a single conditional event per output site, the expression in ( 8) considers all pairs of sites. Stoehr & Friel (2015) explored similar composite likelihood generalizations of the pseudolikelihood in the context of random fields. Now suppose that y c = 1 for some class c / ∈ {i, j}. Then p(y i , y j | y ¬ij , f ) = 1 due to the one-hot constraint. Otherwise either y i = 1 or y j = 1. In this case, assume without loss of generality that y i = 1 and y j = 0 and thus p(y i , y j | y ¬ij , f ) = e fi e fi + e fj = σ(f i -f j ). ( ) The composite likelihood defined in this way with unit component weights is therefore L OVE (f | y) = i j =i p(y i , y j |y ¬ij , f ) = i j =i σ(f i -f j ) yi . Alternatively, we may simply write L OVE (f | y = i) = j =i σ(f i -f j ), which is identical to the OVE bound (6).

4.2. GP CLASSIFICATION WITH THE OVE LIKELIHOOD

We now turn our attention to GP classification. Suppose we have access to examples X ∈ R N ×D with corresponding one-hot labels Y ∈ {0, 1} N ×C , where C is the number of classes. We consider the logits jointly as a single vector f (f 1 1 , . . . , f 1 N , f 2 1 , . . . , f 2 N , . . . , f C 1 , . . . , f C N ) and place an independent GP prior on the logits for each class: f c (x) ∼ GP(m(x), k(x, x )). Therefore we have p(f |X) = N (f |µ, K), where µ c i = m(x i ) and K is block diagonal with K c ij = k(x i , x j ) for each block K c . The Pólya-Gamma integral identity used to derive (2) does not have a multi-class analogue and thus a direct application of the augmentation scheme to the softmax likelihood is nontrivial. Instead, we propose to directly replace the softmax with the OVE-based composite likelihood function from (10) with unit weights. The posterior over f when using OVE as the likelihood function can be expressed as: p(f |X, y) ∝ p(f |X) N i=1 c =yi σ(f yi i -f c i ), to which Pólya-Gamma augmentation can be applied as we show in the next section. Our motivation for using a composite likelihood therefore differs from the traditional motivation, which is to avoid the use of a likelihood function which is intractable to evaluate. Instead, we employ a composite likelihood because it makes posterior inference tractable when coupled with Pólya-Gamma augmentation. Prior work on Bayesian inference with composite likelihoods has shown that the composite posterior is consistent under fairly general conditions for correctly specified models (Miller, 2019) but can produce overly concentrated posteriors (Pauli et al., 2011; Ribatet et al., 2012) since each component likelihood event is treated as independent when in reality there may be significant dependencies. Nevertheless, we show in Section 5 that in practice our method exhibits competitive accuracy and strong calibration relative to baseline few-shot learning algorithms. We leave further theoretical analysis of the OVE composite posterior and its properties for future work. Compared to choices of likelihoods used by previous approaches, there are several reasons to prefer OVE. Relative to the Gaussian augmentation approach of Girolami & Rogers (2006) , Pólya-Gamma augmentation has the benefit of fast mixing and the ability of a single value of ω to capture much of the marginal distribution over function valuesfoot_0 . The stick-breaking construction of Linderman et al. (2015) induces a dependence on the ordering of classes, which leads to undesirable asymmetry. Finally, the logistic-softmax likelihood of Galy-Fajou et al. ( 2020) requires three augmentations and careful learning of the mean function to avoid a priori underconfidence (see Section F.1 for more details).

4.3. POSTERIOR INFERENCE VIA GIBBS SAMPLING

We now describe how we perform tractable posterior inference in our model with Gibbs sampling. Define the matrix A OVE-MATRIX(Y) to be a CN × CN sparse block matrix with C row partitions and C column partitions. Each block A cc is a diagonal N × N matrix defined as follows: A cc diag(Y •c ) -1[c = c ]I n , where Y •c denotes the c th column of Y. Now the binary logit vector ψ Af ∈ R CN will have entries equal to f yi i -f c i for each unique combination of c and i, of which there are CN in total. The OVE composite likelihood can now be written as L(ψ|Y) = 2 N N C j=1 σ(ψ j ), where the 2 N term arises from the N cases in which ψ j = 0 due to comparing the ground truth logit with itself. Analogous to (2), the likelihood of ψ conditioned on ω and Y is proportional to a diagonal Gaussian: L(ψ|Y, ω) ∝ N C j=1 e -ωj ψ 2 j /2 e κj ψj ∝ N (Ω -1 κ|ψ, Ω -1 ), where κ j = 1/2 and Ω = diag(ω). By exploiting the fact that ψ = Af , we can express the likelihood in terms of f and write down the conditional composite posterior as follows: p(f |X, Y, ω) ∝ N (Ω -1 κ|Af , Ω -1 )N (f |µ, K) ∝ N (f | Σ(K -1 µ + A κ), Σ), where Σ = (K -1 + A ΩA) -1 , which is an expression remarkably similar to (3). Analogous to (4), the conditional distribution over ω given f and the data becomes p(ω|y, f ) = PG(ω|1, Af ). The primary computational bottleneck of posterior inference lies in sampling f from (15). Since Σ is a CN × CN matrix, a naive implementation has complexity O(C 3 N 3 ). By utilizing the matrix inversion lemma and Gaussian sampling techniques summarized in (Doucet, 2010) , this can be brought down to O(CN 3 ). Details may be found in Section B.

4.4. LEARNING COVARIANCE HYPERPARAMETERS FOR FEW-SHOT CLASSIFICATION

We now describe how we apply OVE Pólya-Gamma augmented GPs to few-shot classification. We assume the standard episodic few-shot setup in which one observes a labeled support set S = (X, Y). Predictions must then be made for a query example (x * , y * ). We consider a zero-mean GP prior over the class logits f c (x) ∼ GP(0, k θ (x, x )), where θ are learnable parameters of our covariance function. These could include traditional hyperparameters such as lengthscales or the weights of a deep neural network as in deep kernel learning (Wilson et al., 2016) . We consider two objectives for learning hyperparameters of the covariance function: the marginal likelihood (ML) and the predictive likelihood (PL). Marginal likelihood measures the likelihood of the hyperparameters given the observed data and is intuitively appealing from a Bayesian perspective. On the other hand, many standard FSC methods optimize for predictive likelihood on the query set (Vinyals et al., 2016; Finn et al., 2017; Snell et al., 2017) . Both objectives marginalize over latent functions, thereby making full use of our Bayesian formulation. The details of these objectives and how we compute gradients can be found in Section C. Our learning algorithm for both marginal and predictive likelihood may be found in Section D. Details of computing the posterior predictive distribution p(y * |x * , X, Y, ω) may be found in Section E. Finally, details of our chosen "cosine" kernel may be found in Section H.

5. EXPERIMENTS

In this section, we present our results on few-shot classification both in terms of accuracy and uncertainty quantification. Additional results comparing the one-vs-each composite likelihood to the softmax, logistic softmax, and Gaussian likelihoods may be found in Section F. One of our aims is to compare methods based on uncertainty quantification. We therefore developed new benchmark evaluations and tasks: few-shot calibration, robustness, and out-of-episode detection. In order to empirically compare methods, we could not simply borrow the accuracy results from other papers, but instead needed to train each of these baselines ourselves. For all baselines except Bayesian MAML, ABML, and Logistic Softmax GP, we ran the code from (Patacchiola et al., 2020) and verified that the accuracies matched closely to their reported results. We have made PyTorch code for our experiments publicly availablefoot_2 .

5.1. FEW-SHOT CLASSIFICATION

For our few-shot classification experiments, we follow the training and evaluation protocol of Patacchiola et al. (2020) . We train both 1-shot and 5-shot versions of our model in four different settings: Caltech-UCSD Birds (CUB) (Wah et al., 2011) , mini-Imagenet with the split proposed by Ravi & Larochelle (2017) , as well as two cross-domain transfer tasks. The first transfer task entails training on mini-ImageNet and testing on CUB, and the second measures transfer from Omniglot (Lake et al., 2011) to EMNIST (Cohen et al., 2017) . Experimental details and an overview of the baselines we used can be found in Section G. Classification results are shown in Table 1 and 2 . We find that our proposed Pólya-Gamma OVE GPs yield strong classification results, outperforming the baselines in five of the eight scenarios. 60.11 ± 0.26 79.07 ± 0.05 48.00 ± 0.24 67.14 ± 0.23

5.2. UNCERTAINTY QUANTIFICATION THROUGH CALIBRATION

We next turn to uncertainty quantification, an important concern for few-shot classifiers. When used in safety-critical applications such as medical diagnosis, it is important for a machine learning system to defer when there is not enough evidence to make a decision. Even in non-critical applications, precise uncertainty quantification helps practitioners in the few-shot setting determine when a class has an adequate amount of labeled data or when more labels are required, and can facilitate active learning. Published as a conference paper at ICLR 2021 We chose several commonly used metrics for calibration. Expected calibration error (ECE) (Guo et al., 2017) measures the expected binned difference between confidence and accuracy. Maximum calibration error (MCE) is similar to ECE but measures maximum difference instead of expected difference. Brier score (BRI) (Brier, 1950 ) is a proper scoring rule computed as the squared error between the output probabilities and the one-hot label. For a recent perspective on metrics for uncertainty evaluation, please refer to Ovadia et al. (2019) . The results for representative approaches on 5-shot, 5-way CUB can be found in Figure 1 . Our OVE PG GPs are the best calibrated overall across the metrics. and Brier Score (BRI) for 5-shot 5-way tasks on CUB (additional calibration results can be found in Appendix I). Metrics are computed on 3,000 random tasks from the test set. The last two plots are our proposed method.

5.3. ROBUSTNESS TO INPUT NOISE

Input examples for novel classes in FSC may have been collected under conditions that do not match those observed at training time. For example, labeled support images in a medical diagnosis application may come from a different hospital than the training set. To mimic a simplified version of this scenario, we investigate robustness to input noise. We used the Imagecorruptions package (Michaelis et al., 2019) to apply Gaussian noise, impulse noise, and defocus blur to both the support set and query sets of episodes at test time and evaluated both accuracy and calibration. We used corruption severity of 5 (severe) and evaluated across 1,000 randomly generated tasks on the three datasets involving natural images. The robustness results for Gaussian noise are shown in Figure 2 . Full quantitative results tables for each noise type may be found in Section J. We find that in general Bayesian approaches tend to be robust due to their ability to marginalize over hypotheses consistent with the support labels. Our approach is one of the top performing methods across all settings. 

5.4. OUT-OF-EPISODE DETECTION

Finally, we measure performance on out-of-episode detection, another application in which uncertainty quantification is important. In this experiment, we used 5-way, 5-shot support sets at test time but incorporated out-of-episode examples into the query set. Each episode had 150 query examples: 15 from each of 5 randomly chosen in-episode classes and 15 from each of 5 randomly chosen outof-episode classes. We then computed the AUROC of binary outlier detection using the negative of the maximum logit as the score. Intuitively, if none of the support classes assign a high logit to the example, it can be classified as an outlier. The results are shown in Figure 3 . Our approach generally performs the best across the datasets. 

6. CONCLUSION

In this work, we have proposed a Bayesian few-shot classification approach based on Gaussian processes. Our method replaces the ordinary softmax likelihood with a one-vs-each pairwise composite likelihood and applies Pólya-Gamma augmentation to perform inference. This allows us to model class logits directly as function values and efficiently marginalize over uncertainty in each few-shot episode. Modeling functions directly enables our approach to avoid the dependence on model size that posterior inference in weight-space based models inherently have. Our approach compares favorably to baseline FSC methods under a variety of dataset and shot configurations, including dataset transfer. We also demonstrate strong uncertainty quantification, robustness to input noise, and out-of-episode detection. We believe that Bayesian modeling is a powerful tool for handling uncertainty and hope that our work will lead to broader adoption of efficient Bayesian inference in the few-shot scenario. A DERIVATION OF P ÓLYA-GAMMA AUGMENTED LOGISTIC LIKELIHOOD In this section, we show the derivation for the augmented logistic likelihood presented in Section 3.1. First, recall the logistic likelihood: p(y|ψ) = N i=1 σ(ψ i ) yi (1 -σ(ψ i )) 1-yi = N i=1 (e ψi ) yi 1 + e ψi , where σ(•) is the logistic sigmoid function. We have a Gaussian prior p(ψ) = N (ψ|µ, Σ) and introduce Pólya-Gamma auxiliary random variables ω to the likelihood such that the original model is recovered when ω is marginalized out: p(y|ψ) = p(ω)p(y|ψ, ω) dω. The Pólya-Gamma distribution ω ∼ PG(b, c) can be written as an infinite convolution of Gamma distributions: ω D = 1 2π 2 ∞ k=1 Ga(b, 1) (k -1/2) 2 + c 2 /(4π 2 ) . ( ) The following integral identity holds for b > 0: (e ψ ) a (1 + e ψ ) b = 2 -b e κψ ∞ 0 e -ωψ 2 /2 p(ω) dω, where κ = a -b/2 and ω ∼ PG(b, 0). Specifically, when a = y and b = 1, we recover an individual term of the logistic likelihood ( 16): p(y|ψ) = (e ψ ) y 1 + e ψ = 1 2 e κψ ∞ 0 e -ωψ 2 /2 p(ω) dω, where κ = y -1/2 and ω ∼ P G(1, 0). Conditioned on ω, the batch likelihood is proportional to a diagonal Gaussian: p(y|ψ, ω) ∝ N i=1 e -ωiψ 2 i /2 e κiψi ∝ N (Ω -1 κ | ψ, Ω -1 ), where κ i = y i -1/2 and Ω = diag(ω). The conditional distribution over ψ given y and ω is now tractable: p(ψ|y, ω) ∝ p(y|ψ, ω)p(ψ) ∝ N (ψ| Σ(Σ -1 µ + κ), Σ), where Σ = (Σ -1 + Ω) -1 .

B EFFICIENT GIBBS SAMPLING

The Gibbs conditional distribution over f is given by: p(f |X, y, ω) = N (f | Σ(K -1 µ + A κ), Σ), where Σ = (K -1 +A ΩA) -1 . Naively sampling from this distribution requires O(C 3 N 3 ) computation since Σ is a CN ×CN matrix. Here we describe a method for sampling from this distribution that requires O(CN 3 ) computation instead. First, we note that ( 22) can be interpreted as the conditional distribution p(f |z = Ω -1 κ) resulting from the following marginal distribution p(f ) and conditional p(z|f ): p(f ) = N (f |µ, K) (23) p(z|f ) = N (z|Af , Ω -1 ), where we have made implicit the dependence on X, Y, and ω for brevity of notation. Equivalently, the distribution over f and z can be represented by the partitioned Gaussian f z ∼ N µ Aµ , K KA AK AKA + Ω -1 . ( ) The gradient of the log marginal likelihood can be estimated by posterior samples ω ∼ p θ (ω|X, Y). In practice, we use a stochastic training objective based on samples of ω from Gibbs chains. We use Fisher's identity (Douc et al., 2014) to derive the following gradient estimator: ∇ θ L ML = p θ (ω|X, Y)∇ θ log p θ (Y|ω, X) dω ≈ 1 M M m=1 ∇ θ log p θ (Y|X, ω (m) ), where ω (1) , . . . , ω (M ) are samples from the posterior Gibbs chain. As suggested by Patacchiola et al. (2020) , who applied GPs to FSC via least-squares classification, we merge the support and query sets during learning to take full advantage of the available data within each episode. Predictive Likelihood (PL). The log predictive likelihood for a query example x * is: L PL (θ; X, Y, x * , y * ) log p θ (y * |x * , X, Y) = log p(ω)p θ (y * |x * , X, Y, ω) dω. ( ) We use an approximate gradient estimator again based on posterior samples of ω: ∇ θ L PL ≈ p θ (ω|X, Y)∇ θ log p θ (y * |x * , X, Y) dω ≈ 1 M M m=1 ∇ θ log p θ (y * |x * , X, Y, ω (m) ). (41) We note that this is not an unbiased estimator of the gradient, but find it works well in practice.

D LEARNING ALGORITHM

Our learning algorithm for both marginal and predictive likelihood is summarized in Algorithm 1. Algorithm 1 One-vs-Each Pólya-Gamma GP Learning Input: Objective L ∈ {L ML , L PL }, Task distribution T , number of parallel Gibbs chains M , number of steps T , learning rate η. Initialize hyperparameters θ randomly. repeat Sample S = (X, Y), Q = (X * , Y * ) ∼ T if L = L ML then X ← X ∪ X * , Y ← Y ∪ Y * end if A ← OVE-MATRIX(Y) for m = 1 to M do ω (m) 0 ∼ P G(1, 0), f (m) 0 ∼ p θ (f |X) for t = 1 to T do ψ (m) t ← Af (m) t-1 ω (m) t ∼ PG(1, ψ (m) t ) f (m) t ∼ p θ (f |X, Y, ω (m) t ) end for end for if L = L ML then θ ← θ + η M M m=1 ∇ θ log p θ (Y|X, ω (m) T ) else θ ← θ + η M M m=1 j ∇ θ log p θ (y * j |x * j , S, ω (m) T ) end if until convergence

E POSTERIOR PREDICTIVE DISTRIBUTION

The posterior predictive distribution for a query example x * conditioned on ω is: p(y * |x * , X, Y, ω) = p(y * |f * )p(f * |x * , X, Y, ω) df * , ( ) where f * are the query example's logits. The predictive distribution over f * can be obtained by noting that ψ and the query logits are jointly Gaussian: ψ f * ∼ N 0, AKA + Ω -1 AK * (AK * ) K * * , ( ) where K * is the N C × C block diagonal matrix with blocks K θ (X, x * ) and K * * is the C × C diagonal matrix with diagonal entries k θ (x * , x * ). The predictive distribution becomes: p(f * |x * , X, Y, ω) = N (f * |µ * , Σ * ), where µ * = (AK * ) (AKA + Ω -1 ) -1 Ω -1 κ and Σ * = K * * -(AK * ) (AKA + Ω -1 ) -1 AK * . ( ) With p(f * |x * , X, Y, ω) in hand, the integral in (42) can easily be computed numerically for each class c by forming the corresponding OVE linear transformation matrix A c and then performing 1D Gaussian-Hermite quadrature on each dimension of N (ψ c * |A c µ * , A c Σ * A c ).

F DETAILED COMPARISON OF LIKELIHOODS

In this section we seek to better understand the behaviors of the softmax, OVE, logistic softmax, and Gaussian likelihoods for classification. For convenience, we summarize the forms of these likelihoods in Table 3 . Table 3 : Likelihoods used in Section F. Likelihood L(f | y = c) Softmax exp(f c ) c exp(f c ) Gaussian c N (2 • 1[c = c] -1 | µ = f c , σ 2 = 1) Logistic Softmax (LSM) σ(f c ) c σ(f c ) One-vs-Each (OVE) c =c σ(f c -f c ) F.1 HISTOGRAM OF CONFIDENCES We sampled logits from f c ∼ N (0, 1) and plotted a histogram and kernel density estimate of the maximum output probability max c p(y = c | f ) for each of the likelihoods shown in Table 3 , where C = 5. The results are shown in Figure 4 . Logistic softmax is a priori underconfident: it puts little probability mass on confidence above 0.4. This may be due to the use of the sigmoid function which squashes large values of f . Gaussian likelihood and OVE are a priori overconfident in that they put a large amount of probability mass on confident outputs. Note that this is not a complete explanation, because GP hyperparameters such as the prior mean or Gaussian likelihood variance may be able to compensate for these imperfections to some degree. Indeed, we found it helpful to learn a constant mean for the logistic softmax likelihood, as mentioned in Section G.2.

F.2 LIKELIHOOD VISUALIZATION

In order to visualize the various likelihoods under consideration, we consider a trivial classification task with a single observed example. We assume that there are three classes (C = 3) and the single example belongs to the first class (y = 1). We place the following prior on f = (f 1 , f 2 , f 3 ) : In other words, the prior for f 1 and f 2 is a standard normal and f 3 is clamped at zero (for ease of visualization). The likelihoods are plotted in Figure 5 and the corresponding posteriors are plotted in Figure 6 . , where f 3 is clamped to 0. The mode of each posterior distribution is similar, but each differs slightly in shape. Gaussian is more peaked about its mode, while logistic softmax is more spread out. One-vs-Each is similar to softmax, but is slightly more elliptical. p(f ) = N f µ = 0 0 0 , Σ = 1 0 0 0 1 0 0 0 0 . (

F.3 2D IRIS EXPERIMENTS

We also conducted experiments on a 2D version of the Iris dataset (Fisher, 1936) , which contains 150 examples across 3 classes. The first two features of the dataset were retained (sepal length and width). We used a zero-mean GP prior and an RBF kernel k(x, x ) = exp -1 2 d(x, x ) 2 , where 2, 3, 4, 5, 10, 15, 20, 25, and 30 examples per class. For each training set size, we performed GP inference on 200 randomly generated train/test splits and compared the predictions across Gaussian, logistic softmax, and onevs-each likelihoods. d(•, •) is Euclidean distance. We considered training set sizes with 1, Predictions at a test point x * were made by applying the (normalized) likelihood to the posterior predictive mean f * . The predictive probabilities for each likelihood is shown in Figure 7 The ELBO is computed by treating each likelihood's posterior q(f |X, Y) as an approximation to the softmax posterior p(f |X, Y). ELBO(q) = E q [log p(f |X)] + E q [log p(Y|f )] -E q [log q(f |X, Y)] = log p(x) -KL(q(f |X, Y)||p(f |X, Y)). Even though direct computation of the softmax posterior p(f |X, y) is intractable, computing the ELBO is tractable. A larger ELBO indicates a lower KL divergence to the softmax posterior. One-vs-Each performs well for accuracy, Brier score, and ELBO across the training set sizes. Gaussian performs best on expected calibration error through 15 examples per class, beyond which onevs-each is better. 

G FEW-SHOT EXPERIMENTAL DETAILS

Here we provide more details about our experimental setup for our few-shot classification experiments, which are based on the protocol of (Patacchiola et al., 2020) .

G.1 DATASETS

We used the four dataset scenarios described below. The first three are the same used by Chen et al. (2019) and the final was proposed by Patacchiola et al. (2020) . • CUB. Caltech-UCSD Birds (CUB) (Wah et al., 2011) consists of 200 classes and 11,788 images. A split of 100 training, 50 validation, and 50 test classes was used (Hilliard et al., 2018; Chen et al., 2019) . • mini-Imagenet. The mini-Imagenet dataset (Vinyals et al., 2016) consists of 100 classes with 600 images per class. We used the split proposed by Ravi & Larochelle (2017) , which has 64 classes for training, 16 for validation, and 20 for test. • mini-Imagenet→CUB. This cross-domain transfer scenario takes the training split of mini-Imagenet and the validation & test splits of CUB. • Omniglot → EMNIST. We use the same setup as proposed by Patacchiola et al. (2020) . Omniglot (Lake et al., 2011) 

G.2 FEW-SHOT CLASSIFICATION BASELINES

Here we explain the few-shot baselines in greater detail. • Feature Transfer (Chen et al., 2019) • Baseline++ (Chen et al., 2019) is similar to Feature Transfer except it uses a cosine distance module prior to the softmax during fine-tuning. • Matching Networks (Vinyals et al., 2016) can be viewed as a soft form of k-nearest neighbors that computes attention and sums over the support examples to form a predictive distribution over classes. • Prototypical Networks (Snell et al., 2017) computes class means (prototypes) and forms a predictive distribution based on Euclidean distance to the prototypes. It can be viewed as a Gaussian classifier operating in an embedding space. • MAML (Finn et al., 2017) performs one or a few steps of gradient descent on the support set and then makes predictions on the query set, backpropagating through the gradient descent procedure. For this baseline, we simply quote the classification accuracy reported by (Patacchiola et al., 2020) . • RelationNet (Sung et al., 2018) rather than using a predefined distance metric as in Matching Networks or Prototypical Networks instead learns a deep distance metric as the output of a neural network that accepts as input the latent representation of both examples. It is trained to minimize squared error of output predictions. • Deep Kernel Transfer (DKT) (Patacchiola et al., 2020) relies on least squares classification (Rifkin & Klautau, 2004) to maintain tractability of Gaussian process posterior inference. In DKT, a separate binary classification task is formed for each class in one-vs-rest fashion by treating labels in {-1, +1} as continuous targets. We include the results of DKT with the cosine kernel as implemented by Patacchiola et al. (2020) , which is parameterized slightly differently from the version we used in (47): k cos dkt (x, x ; θ, α, ν) = softplus(α) • softplus(ν) • g θ (x) g θ (x ) g θ (x) g θ (x ) . ( ) • Bayesian MAML (Yoon et al., 2018) relies on Stein Variational Gradient Descent (SVGD) (Liu & Wang, 2016) to get an approximate posterior distribution in weight-space. We compare to both the non-chaser version, which optimizes cross-entropy of query predictions, and the chaser version, which optimizes mean squared error between the approximate posterior on the support set and the approximate posterior on the merged support & query set. The non-chaser version is therefore related to predictive likelihood methods and the chaser version is more analogous to the marginal likelihood methods. For the non-chaser version, we used 20 particles and 1 step of adaptation at both train and test time. For the chaser version, we also used 20 particles. At train time, the chaser took 1 step and the leader 1 additional step. At test time, we used 5 steps of adaptation. Due to the slow performance of this method, we followed the advice of Yoon et al. (2018) and only performed adaptation on the final layer of weights, which may help explain the drop in performance relative to MAML. The authors released Tensorflow code for regression only, so we reimplemented this baseline for classification in PyTorch. • Amortized Bayesian Meta-Learning (ABML) (Ravi & Beatson, 2019 ) performs a few steps of Bayes-by-backprop (Blundell et al., 2015) in order to infer a fully-factorized approximate posterior over the weights. The authors did not release code and so we implemented our own version of ABML in PyTorch. We found the weighting on the inner and outer KL divergences to be important for achieving good performance. We took the negative log likelihood to be mean cross entropy and used an inner KL weight of 0.01 and an outer KL weight of 0.001. These values were arrived upon by doing a small amount of hyperparameter tuning on the Omniglot→ EMNIST dataset. We used α = 1.0 and β = 0.01 for the Gamma prior over the weights. We only applied ABML to the weights of the network; the biases were learned as point estimates. We used 4 steps of adaptation and took 5 samples when computing expectations (using any more than this did not fit into GPU memory). We used the local reparameterization trick (Kingma et al., 2015) and flipout (Wen et al., 2018) when computing expectations in order to reduce variance. In order to match the architecture used by Ravi & Beatson (2019) , we trained this baseline with 32 filters throughout the classification network. We trained each 1-shot ABML model for 800 epochs and each 5-shot ABML model for 600 epochs as the learning had not converged within the epoch limits specified in Section G.3. • Logistic Softmax GP (Galy-Fajou et al., 2020) is the multi-class Gaussian process classification method that relies on the logistic softmax likelihood. Galy-Fajou et al. ( 2020) did not consider few-shot, but we use the same objectives described in Section 4.4 to adapt this method to FSC. In addition, we used the cosine kernel (see Section H for a description) that we found to work best with our OVE PG GPs. For this method, we found it important to learn a constant mean function (rather than a zero mean) in order to improve calibration.

G.3 TRAINING DETAILS

All methods employed the commonly-used Conv4 architecture (Vinyals et al., 2016) We train both marginal likelihood and predictive likelihood versions of our models. For Pólya-Gamma sampling we use the PyPólyaGamma packagefoot_3 . During training, we use a single step of Gibbs (T =1). For evaluation, we run until T = 50. In both training and evaluation, we use M = 20 parallel Gibbs chains to reduce variance. Normalized RBF Kernel. Finally, we consider a normalized RBF kernel similar in spirit to the cosine kernel: k rbf-norm (x, x ; θ, α, ) = exp(α) exp - 1 2 exp( ) 2 g θ (x) g θ (x) - g θ (x ) g θ (x ) 2 . ( ) The results of our Pólya-Gamma OVE GPs with different kernels can be found in Tables 5 and 6 . In general, we find that the cosine kernel works best overall, with the exception of Omniglot→EMNIST, where RBF does best. 

I ADDITIONAL CALIBRATION RESULTS

In Figure 9 , we include calibration results for mini-Imagenet and Omniglot→EMNIST. They follow similar trends to the results presented in Section 5.2.

J QUANTITATIVE ROBUSTNESS TO INPUT NOISE RESULTS

In this section we include quantitative results for the robustness to input noise results presented in Figure 2 . Results for Gaussian noise are shown in Table 7 , impulse noise in Table 8 , and defocus blur in Table 9 . 



See in particular Appendix C of(Linderman et al., 2015) for a detailed explanation of this phenomenon. https://github.com/jakesnell/ove-polya-gamma-gp https://github.com/slinderman/pypolyagamma



Figure1: Reliability diagrams, expected calibration error (ECE), maximum calibration error (MCE), and Brier Score (BRI) for 5-shot 5-way tasks on CUB (additional calibration results can be found in Appendix I). Metrics are computed on 3,000 random tasks from the test set. The last two plots are our proposed method.

Figure2: Accuracy (↑) and Brier Score (↓) when corrupting both support and query with Gaussian noise on 5-way 5-shot tasks. Quantitative results may be found in Appendix J.

Figure 3: Average AUROC (↑) for out-of-episode detection. The AUC is computed separately for each episode and averaged across 1,000 episodes. Bars indicate a 95% bootstrapped confidence interval.

Figure 4: Histogram and kernel density estimate of confidence for randomly generated function samples f c ∼ N (0, 1). Normalized output probabilities were computed for C = 5 and a histogram of max c p(y = c|f ) was computed for 50,000 randomly generated simulations.

Figure 5: Plot of L(f | y = 1), where f 3 is clamped to 0. The Gaussian likelihood penalizes configurations far away from (f 1 , f 2 ) = (1, -1). Logistic softmax is much flatter compared to softmax and has visibly different contours. One-vs-Each is visually similar to the softmax but penalizes (f 1 , f 2 ) near the origin slightly more.

Figure6: Plot of posterior p(f | y = 1), where f 3 is clamped to 0. The mode of each posterior distribution is similar, but each differs slightly in shape. Gaussian is more peaked about its mode, while logistic softmax is more spread out. One-vs-Each is similar to softmax, but is slightly more elliptical.

for a randomly generated train/test split with 30 examples per class. Test predictive accuracy, Brier score, expected calibration error, and evidence lower bound (ELBO) results across various training set sizes are shown in Figure 8.

Figure7: Training points (colored points) and maximum predictive probability for various likelihoods on the Iris dataset. The Gaussian likelihood produces more warped decision boundaries than the others. Logistic softmax tends to produce lower confidence predictions, while one-vs-each produces larger regions of greater confidence than the others.

Figure 8: Comparison across likelihoods in terms of test predictive accuracy, Brier score, expected calibration error (computed with 10 bins), and ELBO. Results are averaged over 200 randomly generated splits for each training set size (1, 2, 3, 4, 5, 10, 15, 20, 25, and 30 examples per class). Error bars indicate 95% confidence intervals.

involves first training an off-line classifier on the training classes and then training a new classification layer on the episode.

± 0.40 70.91 ± 0.32 40.88 ± 0.25 58.19 ± 0.17 Logistic Softmax GP + Cosine (ML) 60.23 ± 0.54 74.58 ± 0.25 46.75 ± 0.20 59.93 ± 0.31 Logistic Softmax GP + Cosine (PL) 60.07 ± 0.29 78.14 ± 0.07 47.05 ± 0.20 66.01 ± 0.25 OVE PG GP + Cosine (ML) [ours] 63.98 ± 0.43 77.44 ± 0.18 50.02 ± 0.35 64.58 ± 0.31 OVE PG GP + Cosine (PL) [ours]

Average accuracy and standard deviation (percentage) on 5-way cross-domain FSC, with the same experimental setup as in Table1. Baseline results (through DKT) are from(Patacchiola et al., 2020). RelationNet 75.62 ± 1.00 87.84 ± 0.27 37.13 ± 0.20 51.76 ± 1.48 MAML 72.68 ± 1.85 83.54 ± 1.79 34.01 ± 1.25 48.83 ± 0.62 DKT + Cosine 73.06 ± 2.36 88.10 ± 0.78 40.22 ± 0.54 55.65 ± 0.05 Bayesian MAML 63.94 ± 0.47 65.26 ± 0.30 33.52 ± 0.36 51.35 ± 0.16 Bayesian MAML (Chaser) 55.04 ± 0.34 54.19 ± 0.32 36.22 ± 0.50 51.53 ± 0.43 ABML 73.89 ± 0.24 87.28 ± 0.40 31.51 ± 0.32 47.80 ± 0.51 Logistic Softmax GP + Cosine (ML) 62.91 ± 0.49 83.80 ± 0.13 36.41 ± 0.18 50.33 ± 0.13 Logistic Softmax GP + Cosine (PL) 70.70 ± 0.36 86.59 ± 0.15 36.73 ± 0.26 56.70 ± 0.31 OVE PG GP + Cosine (ML) [ours] 68.43 ± 0.67 86.22 ± 0.20 39.66 ± 0.18 55.71 ± 0.31 OVE PG GP + Cosine (PL) [ours] 77.00 ± 0.50 87.52 ± 0.19 37.49 ± 0.11 57.23 ± 0.31

)

consists of 1,623 classes, each with 20 examples, and is augmented by rotations of 90 degrees to create 6,492 classes, of which 4,114 are used for training. The EMNIST dataset (Cohen et al., 2017), consisting of 62 classes, is split into 31 training and 31 test classes.

(see Table4for a detailed specification), except ABML which used 32 filters throughout. All of our experiments used the Adam(Kingma & Ba, 2015) optimizer with learning rate 10 -3 . During training, all models used epochs consisting of 100 randomly sampled episodes. A single gradient descent step on the encoder network and relevant hyperparameters is made per episode. All 1-shot models are trained for 600 epochs and 5-shot models are trained for 400 epochs, except for ABML which was trained for an extra 200 epochs. Each episode contained 5 classes (5-way) and 16 query examples. At test time, 15 query examples are used for each episode. Early stopping was performed by monitoring accuracy on the validation set. The validation set was not used for retraining.

Classification accuracy for Pólya-Gamma OVE GPs (our method) using different kernels. Cosine is overall the best, followed closely by linear. RBF-based kernels perform worse, except for the Omniglot→EMNIST dataset. Evaluation is performed on 5 randomly generated sets of 600 test episodes. Standard deviation of the mean accuracy is also shown. ML = Marginal Likelihood, PL = Predictive Likelihood. ± 0.32 78.71 ± 0.08 50.26 ± 0.31 64.84 ± 0.39 Cosine PL 60.11 ± 0.26 79.07 ± 0.05 48.00 ± 0.24 67.14 ± 0.23 Linear PL 60.44 ± 0.39 78.54 ± 0.19 47.29 ± 0.31 66.66 ± 0.36 RBF PL 56.18 ± 0.69 77.96 ± 0.19 48.06 ± 0.28 66.66 ± 0.39 RBF (normalized) PL 59.78 ± 0.34 78.42 ± 0.13 47.51 ± 0.20 66.42 ± 0.36

Cross-domain classification accuracy for Pólya-Gamma OVE GPs (our method) using different kernels. The experimental setup is the same as Table5. ± 0.67 86.22 ± 0.20 39.66 ± 0.18 55.71 ± 0.31 Linear ML 72.42 ± 0.49 88.27 ± 0.20 39.61 ± 0.19 55.07 ± 0.29 RBF ML 78.05 ± 0.38 88.98 ± 0.16 36.99 ± 0.07 51.75 ± 0.27 RBF (normalized) ML 75.51 ± 0.47 88.86 ± 0.16 38.42 ± 0.16 54.20 ± 0.13 Cosine PL 77.00 ± 0.50 87.52 ± 0.19 37.49 ± 0.11 57.23 ± 0.31 Linear PL 75.87 ± 0.43 88.77 ± 0.10 36.83 ± 0.27 56.46 ± 0.22 RBF PL 74.62 ± 0.35 89.87 ± 0.13 35.06 ± 0.25 55.12 ± 0.21 RBF (normalized) PL 76.01 ± 0.31 89.42 ± 0.16 37.50 ± 0.28 56.80 ± 0.39

Accuracy (%) and Brier Score when applying Gaussian noise corruption of severity 5 to both the support and query set of test-time episodes. Results were evaluated across 1,000 randomly generated 5-shot 5-way tasks.

Accuracy (%) and Brier Score when applying impulse noise corruption of severity 5 to both the support and query set of test-time episodes. Results were evaluated across 1,000 randomly generated 5-shot 5-way tasks.

Accuracy (%) and Brier Score when applying defocus blur corruption of severity 5 to both the support and query set of test-time episodes. Results were evaluated across 1,000 randomly generated 5-shot 5-way tasks.

ACKNOWLEDGMENTS

We would like to thank Ryan Adams, Ethan Fetaya, Mike Mozer, Eleni Triantafillou, Kuan-Chieh Wang, and Max Welling for helpful discussions. JS also thanks SK T-Brain for supporting him on an internship that led to precursors of some ideas in this paper. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute (https://www.vectorinstitute. ai/partners). This project is supported by NSERC and the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DoI/IBC) contract number D16PC00003. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/IBC, or the U.S. Government.

Published as a conference paper at ICLR 2021

The conditional distribution p(f |z) is given as:where Σ = (K -1 + A ΩA) -1 . Note that p(f |z = Ω -1 κ) recovers our desired Gibbs conditional distribution from ( 22).An efficient approach to conditional Gaussian sampling is due to Hoffman & Ribak (1991) and described in greater clarity by Doucet (2010) . The procedure is as follows:1. Sample f 0 ∼ p(f ) and z 0 ∼ p(z|f ).2. Return f = f 0 + KA (AKA + Ω -1 ) -1 (Ω -1 κz 0 ) as the sample from p(f |z).K is block diagonal and thus sampling from p(f ) requires O(CN 3 ) time. Af can be computed in O(CN ) time, since each entry is the difference between f yi i and f c i for some i and c. Overall, step 1 requires O(CN 3 ) time.We now show how to compute f from step 2 in O(CN 3 ) time. We first expand (AKA + Ω -1 ) -1 :We substitute into the expression for f :where we have defined v A Ω(Ω -1 κz 0 ).Define Y † to be the CN ×N matrix produced by vertically stacking diag(Y •c ), and let W † be the CN ×N matrix produced by vertically stacking diag((ω c 1 , . . . , ω c N ) ). A ΩA may then be written as follows:Substituting (31) into (30):Now we expand (K -1 + D -SPS ) -1 :whereNote that (S ES -P -1 ) -1 is a 2N × 2N matrix and thus can be inverted in O(N 3 ) time. The overall complexity is therefore O(CN 3 ).

C MARGINAL LIKELIHOOD AND PREDICTIVE LIKELIHOOD OBJECTIVES

Marginal Likelihood (ML). The log marginal likelihood can be written as follows:Published as a conference paper at ICLR 2021 

H EFFECT OF KERNEL CHOICE ON CLASSIFICATION ACCURACY

In this section, we examine the effect of kernel choice on classification accuracy for our proposed One-vs-Each Pólya-Gamma OVE GPs.Cosine Kernel. In the main paper, we showed results for the following kernel, which we refer to as the "cosine" kernel due to its resemblance to cosine similarity:where g θ (•) is a deep neural network that outputs a fixed-dimensional encoded representation of the input and α is the scalar log output scale. Both θ and α are considered hyperparameters and learned simultaneously as shown in Algorithm 1. We found that this kernel works well for a range of datasets and shot settings. We note that the use of cosine similarity is reminiscent of the approach taken by Baseline++ method of (Chen et al., 2019) , which computes the softmax over cosine similarity to class weights.Here we consider three additional kernels: linear, RBF, and normalized RBF.Linear Kernel. The linear kernel is defined as follows:where D is the output dimensionality of g θ (x). We apply this dimensionality scaling because the dot product between g θ (x) and g θ (x ) may be large depending on D.RBF Kernel. The RBF (also known as squared exponential) kernel can be defined as follows:where is the log lengthscale parameter (as with α, we learn the alongside θ). 

