CALIBRATING TRANSFORMERS VIA SPARSE GAUS-SIAN PROCESSES

Abstract

Transformer models have achieved profound success in prediction tasks in a wide range of applications in natural language processing, speech recognition and computer vision. Extending Transformer's success to safety-critical domains requires calibrated uncertainty estimation which remains under-explored. To address this, we propose Sparse Gaussian Process attention (SGPA), which performs Bayesian inference directly in the output space of multi-head attention blocks (MHAs) in transformer to calibrate its uncertainty. It replaces the scaled dot-product operation with a valid symmetric kernel and uses sparse Gaussian processes (SGP) techniques to approximate the posterior processes of MHA outputs. Empirically, on a suite of prediction tasks on text, images and graphs, SGPA-based Transformers achieve competitive predictive accuracy, while noticeably improving both indistribution calibration and out-of-distribution robustness and detection.

1. INTRODUCTION

Significant improvements have been made for accuracies in prediction tasks for computer vision, speech recognition and natural language processing using deep learning (He et al., 2015; Graves et al., 2013; Vaswani et al., 2017) . In particular, Transformers (Vaswani et al., 2017) based on multihead attention (MHA) have gained popularity in recent years. With Transformers being deployed in many downstream applications (Vaswani et al., 2017; Dosovitskiy et al., 2021; Brown et al., 2020) , it is crucial to prevent poor robustness which often comes from erratic outputs with high confidence from these models (Guo et al., 2017b; Mukhoti et al., 2020) . This requires calibrated uncertainty quantification for Transformers which is much less well-studied at the time of this work, and it raises concerns about using Transformers for safety-critical tasks which require rational and risk-averse decision making under uncertainty. Regarding uncertainty quantification, Bayesian inference is a powerful and principled framework to build probabilistic models for rational prediction and decision-making under uncertainty (Gal, 2016). Significant progress is observed for applying (approximate) Bayesian inference methods to quantify uncertainty in fully-connected, convolutional and recurrent neural networks (Blundell et al., 2015; Gal & Ghahramani, 2016; Zhang et al., 2019; Ritter et al., 2021) . Initial efforts have been made on extending these techniques to Transformers but with mixed results (Tran et al., 2019; Xue et al., 2021) .On the other hand, Gaussian processes (GPs) are gold standard methods for tasks requiring reliable function-space uncertainty estimates (Rasmussen & Williams, 2006; Wilson et al., 2020) . Researchers have proposed to integrate deep learning ideas to GP model design, including deep kernel learning (Wilson et al., 2016) and deep GPs (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) . Still these models have yet to be scaled to modern deep learning tasks such as large-scale image classification and language modelling. In this work, we propose sparse Gaussian process attention (SGPA), a novel uncertainty quantification technique for attention-based models (e.g., Transformers), by leveraging techniques from sparse variational Gaussian processes (SVGP) (Snelson & Ghahramani, 2005; Hensman et al., 2013) for improved uncertainty estimates. Our work presents the following insights and contributions: • Our key observation is that kernel-based attention (Tsai et al., 2019) is equivalent to the posterior mean of an SVGP. This inspires us to extend SVGP to Transformers for uncertainty estimation.

annex

Here, we briefly review MHSA and sparse variational Gaussian process, based on which our method is developed.

2.1. MULTI-HEAD SELF-ATTENTION (MHSA)

Given T queries q ∈ R T ×dq , keys k ∈ R T ×d k (d k = d q ) and values v ∈ R T ×dv , dot-product attention (Vaswani et al., 2017) is computed as follows, using a nonlinear activation function ω:(1) The dot product (qk ⊤ ) measures the similarities between the queries and keys. For self attention the keys are simply set to be equal to the queries, i.e., k = q. Transformers use multi-head self-attention (MHSA) which modifies dot-product self-attention as follows. Assume H attention heads are in use. Then given T inputs s ∈ R T ×ds to the MHSA block, we first project them to the queries for each head h with a projection matrix W h q ∈ R ds×dq : q h = sW h q . We obtain the keys k h and values v h accordingly by projections using matrices W h k ∈ R ds×d k and W h v ∈ R ds×dv respectively. Typically we use the same d q = d k = d v for all the heads. Then the head's output F h is obtained by plugging q h , k h and v h in eq.( 1). Lastly the attention outputs from each head is combined as follows with the output projection matrix W F ∈ R (Hdv)×(Hdv) :(2) In Transformers multiple layers of MHSA may be in use, where the output of the (l -1)th MHSA layer is further processed by a non-linear function G ϕ l -parameterised by an MLP -to obtain the input to the lth MHSA layer, i.e., s l = G ϕ l (F l-1 ). See Figure 1a for an illustration of an MHSA block in a Transformer model (excluding the combination projection step of eq.( 2)).

2.2. SPARSE VARIATIONAL GAUSSIAN PROCESS (SVGP) WITH DEEP KERNEL

A Gaussian process (GP) (Rasmussen & Williams, 2006 ) is a distribution over function f with infinite-dimensional index set X (domain of f ). In Bayesian inference framework, a GP prior over

