BAYESIAN FEW-SHOT CLASSIFICATION WITH ONE-VS-EACH P ÓLYA-GAMMA AUGMENTED GAUSSIAN PROCESSES

Abstract

Few-shot classification (FSC), the task of adapting a classifier to unseen classes given a small labeled dataset, is an important step on the path toward human-like machine learning. Bayesian methods are well-suited to tackling the fundamental issue of overfitting in the few-shot scenario because they allow practitioners to specify prior beliefs and update those beliefs in light of observed data. Contemporary approaches to Bayesian few-shot classification maintain a posterior distribution over model parameters, which is slow and requires storage that scales with model size. Instead, we propose a Gaussian process classifier based on a novel combination of Pólya-Gamma augmentation and the one-vs-each softmax approximation (Titsias, 2016) that allows us to efficiently marginalize over functions rather than model parameters. We demonstrate improved accuracy and uncertainty quantification on both standard few-shot classification benchmarks and few-shot domain transfer tasks.

1. INTRODUCTION

Few-shot classification (FSC) is a rapidly growing area of machine learning that seeks to build classifiers able to adapt to novel classes given only a few labeled examples. It is an important step towards machine learning systems that can successfully handle challenging situations such as personalization, rare classes, and time-varying distribution shift. The shortage of labeled data in FSC leads to uncertainty over the parameters of the model, known as model uncertainty or epistemic uncertainty. If model uncertainty is not handled properly in the few-shot setting, there is a significant risk of overfitting. In addition, FSC is increasingly being used for risk-averse applications such as medical diagnosis (Prabhu, 2019) and human-computer interfaces (Wang et al., 2019) where it is important for a few-shot classifier to know when it is uncertain. Bayesian methods maintain a distribution over model parameters and thus provide a natural framework for capturing this inherent model uncertainty. In a Bayesian approach, a prior distribution is first placed over the parameters of a model. After data is observed, the posterior distribution over parameters is computed using Bayesian inference. This elegant treatment of model uncertainty has led to a surge of interest in Bayesian approaches to FSC that infer a posterior distribution over the weights of a neural network (Finn et al., 2018; Yoon et al., 2018; Ravi & Beatson, 2019) . Although conceptually appealing, there are several practical obstacles to applying Bayesian inference directly to the weights of a neural network. Bayesian neural networks (BNNs) are expensive from both a computational and memory perspective. Moreover, specifying meaningful priors in parameter space is known to be difficult due to the complex relationship between weights and network outputs (Sun et al., 2019) . Gaussian processes (GPs) instead maintain a distribution over functions rather than model parameters. The prior is directly specified by a mean and covariance function, which may be parameterized by deep neural networks. When used with Gaussian likelihoods, GPs admit closed form expressions for the posterior and predictive distributions. They exchange the computational drawbacks of BNNs for cubic scaling with the number of examples. In FSC, where the number of examples is small, this is often an acceptable trade-off. When applying GPs to classification with a softmax likelihood, the non-conjugacy of the GP prior renders posterior inference intractable. Many approximate inference methods have been proposed to circumvent this, including variational inference and expectation propagation. In this paper we investigate a particularly promising class of approaches that augment the GP model with a set of auxiliary random variables, such that when they are marginalized out the original model is recovered (Albert & Chib, 1993; Girolami & Rogers, 2006; Linderman et al., 2015) . Such augmentation-based approaches typically admit efficient Gibbs sampling procedures for generating posterior samples which when combined with Fisher's identity (Douc et al., 2014) can be used to optimize the parameters of the mean and covariance functions. In particular, augmentation with Pólya-Gamma random variables (Polson et al., 2013) makes inference tractable in logistic models. Naively, this is useful for handling binary classification, but in this paper we show how to extend Pólya-Gamma augmentation to multiple classes by using the one-vs-each softmax approximation (Titsias, 2016), which can be expressed as a product of logistic sigmoids. We further show that the one-vs-each approximation can be interpreted as a composite likelihood (Lindsay, 1988; Varin et al., 2011) , a connection which to our knowledge has not been made in the literature. In this work, we make several contributions: • We show how the one-vs-each softmax approximation (Titsias, 2016) can be interpreted as a composite likelihood consisting of pairwise conditional terms. • We propose a novel GP classification method that combines the one-vs-each softmax approximation with Pólya-Gamma augmentation for tractable inference. • We demonstrate competitive classification accuracy of our method on standard FSC benchmarks and challenging domain transfer settings. • We propose several new benchmarks for uncertainty quantification in FSC, including calibration, robustness to input noise, and out-of-episode detection. • We demonstrate improved uncertainty quantification of our method on the proposed benchmarks relative to standard few-shot baselines.

2. RELATED WORK

Our work is related to both GP methods for handling non-conjugate classification likelihoods and Bayesian approaches to few-shot classification. We summarize relevant work here.

2.1. GP CLASSIFICATION

Non-augmentation approaches. There are several classes of approaches for applying Gaussian processes to classification. The most straightforward method, known as least squares classification (Rifkin & Klautau, 2004) , treats class labels as real-valued observations and performs inference with a Gaussian likelihood. The Laplace approximation (Williams & Barber, 1998) constructs a Gaussian approximate posterior centered at the posterior mode. Variational approaches (Titsias, 2009; Matthews et al., 2016 ) maximize a lower bound on the log marginal likelihood. In expectation propagation (Minka, 2001; Kim & Ghahramani, 2006; Hernandez-Lobato & Hernandez-Lobato, 2016) , local Gaussian approximations to the likelihood are fitted iteratively to minimize KL divergence from the true posterior. Augmentation approaches. Augmentation-based approaches introduce auxiliary random variables such that the original model is recovered when marginalized out. Girolami & Rogers (2006) propose a Gaussian augmentation for multinomial probit regression. Linderman et al. (2015) utilize Pólya-Gamma augmentation (Polson et al., 2013) and a stick-breaking construction to decompose a multinomial distribution into a product of binomials. Galy-Fajou et al. (2020) propose a logistic-softmax likelihood for classification and uses Gamma and Poisson augmentation in addition to Pólya-Gamma augmentation in order to perform inference.

