CALIBRATING TRANSFORMERS VIA SPARSE GAUS-SIAN PROCESSES

Abstract

Transformer models have achieved profound success in prediction tasks in a wide range of applications in natural language processing, speech recognition and computer vision. Extending Transformer's success to safety-critical domains requires calibrated uncertainty estimation which remains under-explored. To address this, we propose Sparse Gaussian Process attention (SGPA), which performs Bayesian inference directly in the output space of multi-head attention blocks (MHAs) in transformer to calibrate its uncertainty. It replaces the scaled dot-product operation with a valid symmetric kernel and uses sparse Gaussian processes (SGP) techniques to approximate the posterior processes of MHA outputs. Empirically, on a suite of prediction tasks on text, images and graphs, SGPA-based Transformers achieve competitive predictive accuracy, while noticeably improving both indistribution calibration and out-of-distribution robustness and detection.

1. INTRODUCTION

Significant improvements have been made for accuracies in prediction tasks for computer vision, speech recognition and natural language processing using deep learning (He et al., 2015; Graves et al., 2013; Vaswani et al., 2017) . In particular, Transformers (Vaswani et al., 2017) based on multihead attention (MHA) have gained popularity in recent years. With Transformers being deployed in many downstream applications (Vaswani et al., 2017; Dosovitskiy et al., 2021; Brown et al., 2020) , it is crucial to prevent poor robustness which often comes from erratic outputs with high confidence from these models (Guo et al., 2017b; Mukhoti et al., 2020) . This requires calibrated uncertainty quantification for Transformers which is much less well-studied at the time of this work, and it raises concerns about using Transformers for safety-critical tasks which require rational and risk-averse decision making under uncertainty. Regarding uncertainty quantification, Bayesian inference is a powerful and principled framework to build probabilistic models for rational prediction and decision-making under uncertainty (Gal, 2016) . Significant progress is observed for applying (approximate) Bayesian inference methods to quantify uncertainty in fully-connected, convolutional and recurrent neural networks (Blundell et al., 2015; Gal & Ghahramani, 2016; Zhang et al., 2019; Ritter et al., 2021) . Initial efforts have been made on extending these techniques to Transformers but with mixed results (Tran et al., 2019; Xue et al., 2021) .On the other hand, Gaussian processes (GPs) are gold standard methods for tasks requiring reliable function-space uncertainty estimates (Rasmussen & Williams, 2006; Wilson et al., 2020) . Researchers have proposed to integrate deep learning ideas to GP model design, including deep kernel learning (Wilson et al., 2016) and deep GPs (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) . Still these models have yet to be scaled to modern deep learning tasks such as large-scale image classification and language modelling. In this work, we propose sparse Gaussian process attention (SGPA), a novel uncertainty quantification technique for attention-based models (e.g., Transformers), by leveraging techniques from sparse variational Gaussian processes (SVGP) (Snelson & Ghahramani, 2005; Hensman et al., 2013) for improved uncertainty estimates. Our work presents the following insights and contributions: • Our key observation is that kernel-based attention (Tsai et al., 2019) is equivalent to the posterior mean of an SVGP. This inspires us to extend SVGP to Transformers for uncertainty estimation. The resulting Transformer based on our SGPA approach can be viewed as a sparse deep GP (Salimbeni & Deisenroth, 2017) with deep kernel in use for each GP layers. • We address the computational inefficiency issues of a naive extension of SVGP to multi-head self-attention with decoupled inducing points techniques (Salimbeni et al., 2018) , making SPGA scalable to deep learning tasks that Transformers are applied to. • Empirically, on a variety of vision, NLP and graph prediction tasks and compared with baselines, SGPA-based Transformers improve considerably over in-distribution calibration, outof-distribution (OOD) robustness, and OOD detection, while achieving competitive accuracy against Transformers with standard (Vaswani et al., 2017) or kernel attention (Tsai et al., 2019) .

2. BACKGROUND

Attention mechanism, first introduced in Graves et al. (2013) , has become the core bulding block for Transformer models. In this work, we consider Transformers using multi-head self-attention (MHSA) as in Vaswani et al. (2017) ; Dosovitskiy et al. (2021) . Here, we briefly review MHSA and sparse variational Gaussian process, based on which our method is developed.

2.1. MULTI-HEAD SELF-ATTENTION (MHSA)

Given T queries q ∈ R T ×dq , keys k ∈ R T ×d k (d k = d q ) and values v ∈ R T ×dv , dot-product attention (Vaswani et al., 2017) is computed as follows, using a nonlinear activation function ω: F = ω(qk ⊤ )v. (1) The dot product (qk ⊤ ) measures the similarities between the queries and keys. For self attention the keys are simply set to be equal to the queries, i.e., k = q. Transformers use multi-head self-attention (MHSA) which modifies dot-product self-attention as follows. Assume H attention heads are in use. Then given T inputs s ∈ R T ×ds to the MHSA block, we first project them to the queries for each head h with a projection matrix W h q ∈ R ds×dq : q h = sW h q . We obtain the keys k h and values v h accordingly by projections using matrices W h k ∈ R ds×d k and W h v ∈ R ds×dv respectively. Typically we use the same d q = d k = d v for all the heads. Then the head's output F h is obtained by plugging q h , k h and v h in eq.( 1). Lastly the attention outputs from each head is combined as follows with the output projection matrix W F ∈ R (Hdv)× (Hdv) : F = concat(F 1 , • • •, F H )W F . (2) In Transformers multiple layers of MHSA may be in use, where the output of the (l -1)th MHSA layer is further processed by a non-linear function G ϕ l -parameterised by an MLP -to obtain the input to the lth MHSA layer, i.e., s l = G ϕ l (F l-1 ). See Figure 1a for an illustration of an MHSA block in a Transformer model (excluding the combination projection step of eq.( 2)).

2.2. SPARSE VARIATIONAL GAUSSIAN PROCESS (SVGP) WITH DEEP KERNEL

A Gaussian process (GP) (Rasmussen & Williams, 2006 ) is a distribution over function f with infinite-dimensional index set X (domain of f ). In Bayesian inference framework, a GP prior over f is specified with a mean function (often set to zero) and a covariance function parameterized by a kernel function K ψ (•, •) (with hyperparameters ψ). Specifically, the marginal distribution of function values f evaluated on any finite number of inputs X = [x 1 , • • •, x N ] ⊤ , x n ∈ X is Gaussian: Prior: p(f ) ∼ GP(0, k ψ (•, •)) ⇒ p(f |X) = N (0, K XX ), [K XX ] i,j = K ψ (x i , x j ). (3) Given training data (X, y), with a Gaussian likelihood p(y|f ) = N (f , σ 2 I), the posterior process is also a GP, and the posterior predictive distribution of f * evaluated at the test inputs X * is: p(f * |X * , X, y) = N (K X * X (K XX + σ 2 I) -1 y, K X * X * -K X * X (K XX + σ 2 I) -1 K XX * ). (4) Unfortunately, with non-Gaussian likelihoods (e.g., for classification) or when the number of training datapoints N is large, the posterior process is intractable. Still we can approximate the posterior process with a GP, and a popular approach is sparse variational Gaussian process (SVGP) (Titsias, 2009; Hensman et al., 2013) , which uses a small number of M inducing points (Z, u) = {(z m , u m )} M m=1 to summarise the training data and, to some degree, replaces the terms involving X, y in eq.( 4) with the inducing points. A detailed introduction of SVGP is provided in Appendix B.1, in short it utilises the property of GP to augment the prior as p(f , u|X, Z), which is a Gaussian with zero mean and covariance matrix as a kernel matrix computed on [X, Z], and define the approximate posterior process as: p(f * , f , u|Z, X * , X, y) ∝ p(y|f )p(f * , f |u, Z, X * , X)p(u|Z) ≈q(f * , f , u|Z, X * , X) := p(f * , f |u, Z, X * , X)q(u), q(u) := N (m u , S u ). (5) Notice that the exact posterior and the approximate posterior share the conditional distribution p(f * , f |u, Z, X * , X). This simplifies the evidence lower-bound (ELBO) objective for optimising the variational parameters m u , S u and the kernel hyperparameters ψ to L ELBO = E q(f |X,Z) [log p(y|f )] -KL(q(u)||p(u|Z)). Since q(u) and p(u|Z) are both Gaussian, the second term can be evaluated analytically. For non-Gaussian likelihoods, we resort to Monte-Carlo estimation for computing the first term. In prediction, the approximate posterior predictive distribution of f * evaluated on test inputs X * becomes: q(f * |X * , Z) = p(f * , f |u, Z, X * , X)q(u)dudf = N (K X * Z K -1 ZZ m u , K X * X * + K X * Z K -1 ZZ (S u -K ZZ )K -1 ZZ K ZX * ). Note that the computations of both the ELBO (eq.( 6)) and the approximate posterior predictive distribution (eq.( 7)) require matrix inversion of K ZZ only. Since we usually use a small number of inducing points (M ≪ N ), the computational cost of SVGP (O(N M 2 + M 3 )) is significantly lower than the O(N 3 ) cost in full GP resulting from the inversion of K XX (c.f. eq.( 4)). One way to take advantage of the expressiveness of DNN in GP is to parameterize the kernel function using DNN, so that the network weights become part of the hyperparameters of a deep kernel (Wilson et al., 2016) . Given a regular base kernel, such as RBF kernel K base (•, •), we can first map the inputs X to a feature space using a DNN, h θ (X), then apply the base kernel to the DNN features corresponding to the inputs: K deep (•, •) = K base (h θ (•), h θ (•)).

3. SPARSE GAUSSIAN PROCESS ATTENTION

We propose Sparse Gaussian Process Attention (SGPA) to perform approximate Bayesian inference for Transformer-based models. The key idea is to replace the softmax operation in scaled dotproduct attention (Vaswani et al., 2017) with a kernel (Tsai et al., 2019) , and connect the resulting attention to the mean of an SVGP. This insight allows us to apply SVGP equations for uncertainty estimation, and we further introduce decoupled inducing points to improve computational efficiency.

3.1. ATTENTION AS THE MEAN OF A SPARSE VARIATIONAL GAUSSIAN PROCESS

Standard Transformers use attention blocks based on scaled dot-product (Vaswani et al., 2017) . Given queries q, keys k and value v, the scaled dot-product (SDP) attention is given as follows: SDP-Attention: F = softmax( qk ⊤ √ d k )v, where d k is the dimension of keys. Since attention involves measuring the similarity between q and k, ) in eq.( 8) with a kernel gram matrix K qk ([K qk ] i,j = K(q i , k j )) computed using a valid symmetric kernel K(•, •), for which we refer to it as kernel attention or K-Attention for short: K-Attention: F = K qk v. Recall the posterior mean of SVGP in eq.( 7) is m = K XZ K -1 ZZ m u when evaluated on training inputs (X * = X). Now we reparameterise the variational mean parameter of SVGP as [v] :,d := K -1 ZZ m u for each dimension (d) of v, and define the queries and keys as the input locations and inducing point locations: q := X, k := Z. By doing so, equivalence can be identified between the posterior mean of an SVGP and each dimension of the output of a kernel attention block.This allows us to extend the toolbox of Gaussian processes and their scalable approximations for quantifying uncertainty in Transformers in the following sections.

3.2. STANDARD SGPA & ITS INEFFICIENCY FOR SELF-ATTENTION

Observing the equivalence between K-Attention and SVGP mean, a natural idea for uncertainty estimation is to apply SVGP techniques for approximate posterior variance computations. In detail, we introduce a set of variational covariance parameters S ∈ R T ×T ×dv (with T as the number of keys/inducing inputs), and optimise them using the ELBO (eq.( 6)). This procedure returns the mean and covariance for each dimension (d) of the posterior attention output as: m d = K qk [v] :,d , Σ d = K qq + K qk (K -1 kk [S] :,:,d K -1 kk -K -1 kk )K kq . In this way, we fit an SVGP for each dimension of attention outputs independently: for each dimension d, an SVGP given as eq.( 10) is fitted using the same kernel, but with different variational mean ([v] :,d ) and covariance ([S] :,:,d ) parameters. We name this approach as standard SGPA and provide a visualisation of the operations in Figure 1b . Unfortunately, standard SGPA becomes computationally inefficient when applied to Transformers based on multi-head self-attention. In each attention layer the keys for each head h, k h , are obtained by passing the output from previous layer through a neural network. Moreover, the projection matrices for queries and keys need to be tied (i.e., W h q = W h k := W h qk ) to obtain a valid symmetric kernel (Tsai et al., 2019) . As a result, the queries and keys in a self-attention layer are the same, more importantly they are input-dependent, i.e., k h = sW h k and they vary as the input sequence to the Transformer changes. Therefore, to extend standard SVGP framework (eq.( 10)) to self-attention, the covariance parameters S h need to be input-dependent as well to accommodate the varying inducing inputs k h . A naive idea would parameterise S h by linear projection, e.g., vec(L h ) = sW h s for one head where L h is the Cholesky factor of S h (see Figure 1b ). This will incur a memory cost of O(T 2 ) per head even if we tie the variational covariances across output dimensions, and a run-time cost of O(T 3 ) per head per input sequence for inverting K k h k h as k h is input-dependent. Therefore standard SGPA is both memory and run-time inefficient especially for long input sequences.

3.3. IMPROVING TIME & MEMORY EFFICIENCIES VIA DECOUPLED SGPA

We propose to address the aforementioned inefficiency issues by extending the orthogonally decoupled sparse Gaussian process approximation (Salimbeni et al., 2018) to self-attention. In addition to input-dependent (or "amortised") keys/inducing inputs k h , which we will call k h a from now on, for each head h, we also incorporate another M g number of "global" keys/inducing inputs k h g that are shared across all input sequences. The main idea is to compute the variance of sparse GP using the global keys only, so that the variational parameters for the S h matrix become independent to the input sequences. Indeed, following the derivations presented in Appendix B.2, we can compute the mean and covariance for each output dimension (d) of each head as (we drop the superscript h here for more concise notation): m d = K qka [v a ] :,d -K qkg K kgka [v a ] :,d + K qkg [v g ] :,d , Σ d = K qq + K qkg K -1 kgkg ([S g ] :,:,d -K kgkg )K -1 kgkg K kgq , where v g ∈ R Mg×dv , S g ∈ R Mg×Mg×dv are the variational parameters associated with the global keys k g , and v a ∈ R T ×dv is computed via projection v a = sW v . We name this approach as decoupled SPGA which is illustrated in Figure 1c . Compared to standard SGPA (eq.( 10), where k h a in decoupled SGPA is the same as k h in standard SGPA), we see that the posterior mean of decoupled SGPA also involves two extra terms to take into account the effect of global inducing points. But more importantly, the posterior variance of the two SGPA methods differ only in the keys/inducing inputs in use (input-dependent keys k h versus global keys k h g ), and this brings in the key advantange of decoupled SGPA. As the posterior covariance in eq.( 11) only involves the global inducing points, the variational covariance no longer needs to be input-dependent, and (the Cholesky factor of) S h g can be parameterised freely. Now the number of parameters for the covariance part is of order of O(M 2 g ) (vs O(T 2 ) in standard SPGA), and the computation of matrix inversion pays a one-off cost of O(M 3 g ) (vs O(T 3 ) for every input sequence). Notice that we are free to choose the number of global inducing points M g , and in practice we find M g = O( Tavg H ) is usually sufficient, where T avg is the average length of training input sequences. In Table 1 , we summarise time complexity (with batch size B) and the additional memory (number of parameters) required for SGPA in one head of a Transformer. We also include maximum likelihood estimation (MLE) for reference (note that memory complexity for MLE does not depend on input sequence length T ).  ) - Standard SGPA O(BT 3 ) O(T 2 ) Decoupled SGPA O(BT 2 M g + M 3 g ) O(M 2 g ) As the time and memory savings are significant, we mainly evaluate decoupled SGPA in our experiments, and in the rest of the main text we will refer to decoupled SGPA as SGPA for short.

3.4. TRANSFORMER BASED ON DECOUPLED SGPA

So far we have presented SGPA methods for uncertainty quantification in a multi-head self-attention module. When applied to Transformer models, multiple layers of attention blocks are in use, and in the following we describe the construction of a Transformer model based on decoupled SGPA. Note that, as SGPA is equivalent to a sparse GP, the Transformer model presented below can be viewed as a sparse approximation to a deep GP (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) with deep kernel in each layer. Our Transformer architecture mostly follows the one in Vaswani et al. (2017) . The input to the lth SGPA layer is the output from previous SGPA layer F l-1 ∈ R T ×d l-1 . We first process the input with a non-linear mapping G ϕ l : R d l-1 → R d l , and then perform projections to obtain the queries, amortised & global keys and values. Specifically, we have for each head h: q l,h = k l,h a = G ϕ l (F l-1 )W l,h qk , k l,h g = G ϕ l (Z l,h g )W l,h qk , v l,h a = G ϕ l (F l-1 )W l,h v , where Z l,h g ∈ R Mg×d l-1 are global inducing locations of the lth layer defined on the same space as F l-1 . Then we apply a base kernel K base (•, •) to compute the kernel matrices. This is equivalent to using a deep kernel defined on the space of F l-1 , and the parameters of G ϕ l are viewed as the hyperparameters of the deep kernel. Lastly, with variational parameters (v l,h g , S l,h g ) associated with the global inducing locations Z l,h g , we can obtain m l,h d and Σ l,h d using equation 11. We then propagate uncertainty to the next layer by generating samples of output for each head using the reparameterization trick as in Salimbeni & Deisenroth (2017) : [F l h ] :,d = m l,h d + Σ l,h d ϵ l,h d , ϵ l,h d ∼ N (0, I). ( ) The final output F l ∈ R T ×d l of this SGPA layer is obtained by linear combination in the same way as in standard Transformers (see eq.( 2)). The ELBO objective for training the variational & kernel parameters is derived following deep GP and additive GP (Duvenaud et al., 2011) approaches. The key idea is that as each head in MHSA with SGPA is a (sparse) GP, the final output F l can also be viewed as a weighted summation of (sparse) GPs, which is again a GP (Duvenaud et al., 2011) . This allows us to perform variational approximations on each of the heads before the final combination instead of using a direct approximation on the F l process (Sun et al., 2021) . Assuming the approximate posterior q for {F l h } H h=1 factorises over h, the corresponding ELBO with input sequence F 0 := X is (derivations in Appendix B.3): L ELBO = E q(F L |F 0 ,{k l,h g } L,H l=1,h=1 ) [log p(Y |F L )] - L l=1 H h=1 E q(F l |F 0 ,{k j,h g } l,H j=1,h=1 )) [KL(q(u l,h a∪g |k l,h g , F l-1 )||p(u l,h a∪g |k l,h g , F l-1 ))]. In practice, we resort to Monte-Carlo to estimate L ELBO with samples of function values generated iteratively passing through each layer using the reparameterization trick (eq.( 13)).

4. EXPERIMENTS

We evaluate SGPA on prediction tasks across modalities, with the following experimental set-up. • Datasets: CIFAR10 & CIFAR100 (image classification (Krizhevsky et al., 2009) , CV tasks); CoLA (linguistic acceptability prediction (Warstadt et al., 2019) , NLP task) and IMDB (sentiment analysis, (Maas et al., 2011) , NLP task). • Network architectures: We use Vision Transformers (ViT (Dosovitskiy et al., 2021) ) for CV tasks. For kernel attention we use the exponential kernel (Tsai et al., 2019) and the ARD-RBF kernel (Rasmussen & Williams, 2006) for NLP and CV tasks respectively. Scaled dot-product (SDP) attention based Transformers are also evaluated. As in Tsai et al. (2019) , we find kernel attention tends to outperform SDP attention in most tasks considered, thus we do not include the results of SDP attention in the main text. These results can be found in the tables in Appendix G. • Baselines: We compare our approach with the following "single-model" methods: maximum likelihood estimation (MLE), Bayesian inference methods including mean-field variational inference (MFVI, (Blundell et al., 2015) ), Monte-Carlo Dropout (MCD, (Gal & Ghahramani, 2016) ), Kronecker-factored last layer Laplace approximation (KFLLLA) (Kristiadi et al., 2020) , and Spectral-normalized Neural Gaussian Process (SNGP) (Liu et al., 2020) . For tasks where a validation set is used, we also consider temperature scaling (TS) (Guo et al., 2017a) and use the validation set as the calibration set. For CV tasks, we also consider ensemble methods: we compare SGPA ensemble (SGPAE) with deep ensemble (DE) (Lakshminarayanan et al., 2017) . We don't consider ensemble models in NLP tasks since we use different train-(valid)-test splits in different runs for them. • Evaluations & metrics: We consider three evaluation set-ups: in-distribution performance, outof-distribution (OOD) robustness and OOD detection. The metrics on test set include predictive accuracy metrics for each task, uncertainty calibration metrics such as negative predictive loglikelihood (NLL), expected calibration error (ECE) and maximum calibration error (MCE) (Guo et al., 2017b) . We report the mean±two standard errors for each metric obtained from 5 independent runs. For OOD detection tasks we consider the area under the ROC & precision-recall curves (AUROC & AUPR, respectively), and we report the average ranks in terms of AUROC and AUPR over all of the 6 OOD detection tasks for each method. For fair comparisons, within each task, all the models are trained using the same architecture and optimisation setting. All the models are trained from scratch without pre-training. We include the experimental details in Appendix E. Results in tables are also presented in Appendix G.

4.1. IN-DISTRIBUTION CALIBRATION

We report the evaluation results for in-distribution test data on image classification (CIFAR10 & CIFAR100, without data augmentation), sentiment analysis (IMDB), and linguistic acceptability (CoLA) tasks in the first, second, third and fourth row of Figure 2 respectively. Here for the CoLA dataset, predictive accuracy is measured by Matthew correlation coefficient (MCC) (Matthews, 1975) instead of accuracy, as in Warstadt et al. (2019) . All "single-model" calibration methods considered tend to improve the calibration, except for sentiment analysis, where KFLLLA fails in the sense that it achieves worse calibration even than MLE (although KFLLLA achieves best calibration for linguistic acceptability (CoLA), its performance is unstable across tasks). Although MFVI tends to achieve the lowest calibration errors, it severely underfits the data in all the experiments. This is undesirable, as improvement in calibration should not come at a price of noticeable drop in predictive correctness. As a counter example, one can achieve perfect calibration in terms of zero ECE by predicting marginal class probability, but this prediction is useless in practice. For image classification on CIFAR100, KFLLLA achieves lower ECE than SGPA, however, it achieves worse NLL and the worst MCE among all the methods. Overall, SGPA achieves the best performance when compared with the other "single-model" baselines: it consistently achieves better calibration across all tasks while maintaining competitive (or even better, on IMDB) predictive accuracy. Compared with "single-model" methods, both ensemble methods, DE and SGPAE, achieve much better predictive accuracy. SGPAE noticeably outperforms DE in terms of calibration while maintaining competitive predictive accuracy.

4.2. ROBUST PREDICTION ON OUT-OF-DISTRIBUTION DATA

Next we evaluate the performance of SGPA under distribution shift for both the linguistic acceptability task (CoLA) and the image classification tasks (CIFAR10 & CIFAR100). The OOD data for CoLA is introduced by the same authors of Warstadt et al. (2019) , while for the CIFAR datasets, we use corrupted CIFAR datasets (CIFAR10-C and CIFAR100-C) (Hendrycks & Dietterich, 2019) as the OOD data, which contains noisy CIFAR images with different types of distortions introduced to their clean counterparts at different skew intensities. Note that we don't consider TS for OOD tasks in this and the next subsection since as a Frequentist method, it is proposed to calibrate the uncertainty on in-distribution data only. We report in Figure 3 calibration errors but they achieve worse predictive accuracy than SGPA. In particular, MFVI again underfits the data. For OOD robustness test on image classification, we compute metrics against skew intensity on the corrupted CIFAR datasets and report them in Figure 4 . Again the same story: among "singlemodel" methods, SGPA outperforms MLE, MCD and SNGP in terms of calibration without hurting accuracy. MFVI achieves lower calibration errors than SGPA but it pays the price of underfitting especially when the skew intensity is small. The performance of KFLLLA seems to be not as stable as SGPA. For CIFAR10-C, SGPA achieves better calibration than KFLLLA. For CIFAR100-C, KFLLA achieves the best NLL and ECE but the worst MCE. Ensemble methods again achieve better accuracy than "single-model" methods and SGPAE still outperforms DE in terms of calibration while achieving similar accuracy.

4.3. OUT-OF-DISTRIBUTION DETECTION

Lastly we consider OOD detection tasks on Transformer models trained for image classification, to further evaluate the quality of uncertainty estimates. Here we use predictive entropy to score each input (from either in-distribution or OOD data) and make decisions of "in/out" if the entropy is smaller/greater than a specified threshold. Using different thresholds for this detection task allows us to compute both the receiver operator characteristic (ROC) and the precision-recall (PR) curves, and we use the area under the curves, i.e., AU-ROC and AUPR, for performance evaluations. For each of the two CIFAR datasets, we consider the other CIFAR dataset, SVHN and mini-ImageNet as OOD datasets so that we construct 6 OOD detection tasks in total. For each method, here we report its average ranks 2020) only considers finetuning with variational attention, and (Cinquin et al., 2021) only considers experiments on synthetic or simple datasets with shallow networks and the variational distribution fitted over the attention weights are shared across data, which might be too restrictive for complex problems. Moreover, they find that a data-dependent variational distribution over attention weights can even hurt the performance of their approaches. Liu et al. (2020) consider performing Bayesian inference directly over the Transformer output by fitting a GP over the last layer output (Bradshaw et al., 2017) . This approach can be viewed as using a GP model with a deep kernel defined by the Transformer. Instead, SGPA fits a deep GP so that uncertainty is propagated through each attention layer of the Transformer. In addition, Liu et al. (2020) propose to preserve the distance awareness property for the deep kernel. Note that this distance-preserving trick is orthogonal to ours and can also be easily integrated into SGPA. Related GP methods. The ELBO of SGPA is similar to that in Sun et al. (2021) which also propose to independently approximate the posterior for each additive component in an additive GP (Duvenaud et al., 2011) . The difference is in the kernel design: Sun et al. (2021) aim to decompose a given kernel function into orthogonal "kernel basis", while in SGPA we consider the same type of kernel for each attention head but with different kernel hyperparameters. Our approach is also related to sparse within sparse Gaussian process (SWSGP) (Tran et al., 2021; Jafrasteh et al., 2022) which allows adaptive inducing points for each data point (similar to the input-dependent keys k a in SGPA). This connection between SGPA and SWSGP is further discussed in appendix D.

6. CONCLUSION AND FUTURE WORK

We have proposed SGPA to directly perform approximate Bayesian inference over the output of attention blocks in Transformers. Compared with other baselines, we showed Transformers based on SGPA achieve better balance between predictive accuracy and calibration. Furthermore, the improved quality of uncertainty estimation provided by SGPA has been proved useful in maintaining robustness under distribution shift and in out of distribution detection. Future work will investigate the following directions. First, masked pre-training (Delvin et al., 2019) , which has been proved crucial for downstream tasks for standard Transformers, may also improve the performance of Transformers based on SGPA. In this work, we are not able to consider pretraining due to the high computational cost, and since SGPA replaces scaled dot-product with a valid kernel, there is no existing pre-trained backbone that can be directly used for the downstream fine-tuning tasks. Second, many tasks using Transformers, such as neural machine translation, require autoregressive prediction using an encoder-decoder based architecture (Vaswani et al., 2017; Brown et al., 2020) , and we will adapt SGPA to the decoder as well. Lastly, we will investigate the introduction of hidden mapping distance preserving trick (Liu et al., 2020) to SGPA, which has been shown to be useful in regularizing the parameters in deep kernel learning. 

B DERIVATIONS B.1 ELBO DERIVATIONS FOR SVGP

Here we review the derivation of ELBO for standard SVGP (Titsias, 2009; Hensman et al., 2013) . With M inducing points pairs {(z m , u m )} M m=1 , the prior distribution of [f , u] ⊤ is: p(f , u|X, Z) = N (0, K XX K XZ K ZX K ZZ ). With prior conditional matching assumption (see eq.( 5)), the approximate posterior conditional distribution of function values f for training inputs X given inducing points u is the same as the prior conditional distribution: q(f |u, Z, X) = p(f |u, Z, X). ( ) Under prior conditional matching assumption, q(f , u|Z, X) = p(f |u, Z, X)q(u), where q(u) = N (m u , S u ). Suppose the observation likelihood is p(y|f ), ELBO can be simplified as follows: L ELBO = E q(f ,u|Z,X) [log p(y, f , u|Z, X) q(f , u|Z, X) ] = E q(f ,u|Z,X) [log p(y|f ) ( ( ( ( ( ( p(f |u, Z, X)p(u|Z) ( ( ( ( ( ( p(f |u, Z, X)q(u) ] = ( p(f |u, Z, X)q(u)du) log p(y|f )df + q(u) log p(u|Z) q(u) p(f |u, Z, X)df du = E q(f |X,Z) [log p(y|f )] -KL(q(u)||p(u|Z)). ( ) Here q(f |X, Z) = p(f |u, Z, X)q(u)du is a Gaussian and is given as: q(f |X, Z) = N (K XZ K -1 ZZ m u , K XX + K XZ K -1 ZZ (S u -K ZZ )K -1 ZZ K ZX ). ( ) With Gaussian likelihood, the first term in ELBO can be evaluated analytically. Otherwise we estimate the first term using Monte-Carlo samples f ∼ q(f |X, Z). The second term is a KLdivergence between two Gaussian distributions. Thus, it admits a closed form: KL(q(u)||p(u|Z)) = 1 2 [T r(K -1 ZZ S u ) + m ⊤ u K -1 ZZ m u + log |K ZZ | |S u | -M ]. In standard SGPA the ELBO objective remains almost the same, except that as the variational mean is reparameterised to v := K -1 ZZ m u , the mean of q(f |X, Z) becomes K XZ v, and the quadratic term in KL(q(u)||p(u|Z)) becomes v ⊤ K ZZ v.

B.2 ORTHOGONALLY DECOUPLED SVGP

The orthogonally decoupled SVGP (Salimbeni et al., 2018) can be interpreted as an SVGP (Titsias, 2009) with a structured variational distribution for the inducing points. Two sets of inducing points are in use: {(z (m) a , u (m) a )} Ma m=1 and {(z (m) g , u (m) g )} Mg m=1 . Consider a structured Gaussian variational distribution over u := u a∪g = u a ∪ u g , with variational mean and covariance given as follows: m u = K ZaZg K -1 ZgZg m g + m a m g , S u = K ZaZa K ZaZg K -1 ZgZg (S g -K ZgZg )K -1 ZgZg K ZgZa K ZaZg K -1 ZgZg S g S g K -1 ZgZg K ZgZa S g . Plugging the above m u and S u in eq.( 18), we can obtain the posterior distribution of orthogonally decoupled SVGP for f after canceling some terms, which is a Gaussian with mean m f and covariance Σ f f given as: m f = (K XZa -K XZg K ZgZa )(K ZaZa -K ZaZg K -1 ZgZg K ZgZa ) -1 m a + K XZg K -1 ZgZg m g , Σ f f = K XX + K XZg K -1 ZgZg (S g -K ZgZg )K -1 ZgZg K ZgX . ( ) If we further reparameterize (K ZaZa -K ZaZg K -1 ZgZg K ZgZa ) -1 m a as v a and K -1 ZgZg m g as v g , then we arrive at the final expressions used in decoupled SGPA (see eq.( 11)): m f = K XZa v a -K XZg K ZgZa v a + K XZg v g , Σ f f = K XX + K XZg K -1 ZgZg (S g -K ZgZg )K -1 ZgZg K ZgX . B.3 ELBO FOR TRANSFORMERS BASED ON SGPA An L-layer Transformer based on SGPA is a deep GP (Damianou & Lawrence, 2013) , and we train it using the doubly stochastic variational inference framework (Salimbeni & Deisenroth, 2017) . For each input sequence F 0 := X, the joint distribution for Y , {F l } L l=1 , {u l,h a∪g } L,H l=1,h=1 is: p(Y ,{F l } L l=1 , {u l,h a∪g } L,H l=1,h=1 |F 0 ) = p(Y |F L )[ L l=1 p(F l |{u l,h a∪g } H h=1 , F l-1 )p({u l,h a∪g } H h=1 |{k l,h g } H h=1 , F l-1 )], where p({u l,h a∪g } H h=1 |{k l,h g } H h=1 , F l-1 ) = H h=1 p(u l,h a∪g |k l,h g , F l-1 ) since we assume the prior for inducing points factorizes across heads in each layer. Note here the amortised keys k l,h a depend on F l-1 in a deterministic manner, therefore we drop the amortised key terms in the conditioning. Assuming prior conditional matching (i.e, q(F l |{u l,h a∪g } H h=1 , F l-1 ) = p(F l |{u l,h a∪g } H h=1 , F l-1 )), the joint approximate posterior for {F l } L l=1 , {u l,h a∪g } L,H l=1,h=1 is: q({F l } L l=1 , {u l,h a∪g } L,H l=1,h=1 |F 0 ) = L l=1 p(F l |{u l,h a∪g } H h=1 , F l-1 )q({u l,h a∪g } H h=1 |{k l,h g } H h=1 , F l-1 ), where q({u l,h a∪g } H h=1 |{k l,h g } H h=1 , F l-1 ) = H h=1 q(u l,h a∪g |k l,h g , F l-1 ) since we let the approximate distribution for {u l,h a∪g } H h=1 also factorises across heads. The ELBO is derived in a similar manner as the single-layer GP case (eq.( 17)) and again the conditional distribution terms in q and p cancel with each other. This simplifies the ELBO to: L ELBO = E q({F l } L l=1 ,{u l,h a∪g } L,H l=1,h=1 |F 0 ) [log p(Y , {F l } L l=1 , {u l,h a∪g } L,H l=1,h=1 |F 0 ) q({F l } L l=1 , {u l,h a∪g } L,H l=1,h=1 |F 0 ) ] = E q(F L |F 0 ,{k l,h g } L,H l=1,h=1 ) [log p(Y |F L )] - L l=1 H h=1 E q(F l |F 0 ,{k j,h g } l,H j=1,h=1 )) [KL(q(u l,h a∪g |k l,h g , F l-1 )||p(u l,h a∪g |k l,h g , F l-1 ))] where q(F l |F 0 ,{k j,h g } l,H j=1,h=1 ) = l j=1 p(F j |{u j,h a∪g } H h=1 , F j-1 )q({u j,h a∪g } H h=1 |{k j,h g } H h=1 , F j-1 )du 1:l,1:H a∪g dF 1:l-1 . ( ) Both terms in the ELBO can be estimated using samples generated iteratively through each layer using the reparameterization trick. For the second "regularisation" term, the KL-divergence within the expectation admits a simplified form as we assume for each attention output dimension (d), an independent decoupled SVGP is fitted: KL(q(u l,h a∪g |k l,h g , F l-1 )||p(u l,h a∪g |k l,h g , F l-1 )) = 1 2 D d=1 {[v l,h a ] ⊤ :,d (K k l,h a k l,h a -K k l,h a k l,h g K -1 k l,h g k l,h g K k l,h g k l,h a )[v l,h a ] :,d + [v l,h g ] ⊤ :,d K k l,h g k l,h g [v l,h g ] :,d + Tr([S l,h g ] :,:,d K -1 k l,h g k l,h g ) -log |[S l,h g ] :,:,d | + log |K k l,h g k l,h g | -M g }, ( ) where D is the total number of attention output dimensions and M g is the number of global inducing points for each head. 2017) is given as follows: For ViTs trained on CIFAR10 and CIFAR100 with data augmentation, we report results of indistribution calibration and out-of-distribution robustness in Figure 7 and 8 respectively. Although some "single-model" methods can achieve lower ECE or MCE than DE and SGPAE in some cases, DE and SGPAE consistently outperform them in terms of accuracy and NLL. SGPAE again achieves the best overall performance. Among "single-model"methods, MFVI still underfits the data, but for the other methods, data augmentation improves the model performance with SNGP achieving relatively low accuracy. The difference between SGPA and other "single-model" baselines becomes smaller, perhaps due to the strong regularisation from data augmentation. Still, SGPA performs more robustly as it generally returns smaller error bars when compared to TS, MCD, and KFLLLA. m d = K qka [v a ] :,d , Σ d = K qq + K qkg K -1 kgkg ([S g ] :,:,d -K kgkg )K -1 kgkg K kgq , In table 4 we report for each method its average ranks in terms of AUROC and AUPR over 6 OOD detection tasks. Ensemble methods again outperform "single-model" methods with SGPAE achieving the best performance. Among "single-model" methods, SGPA achieves the best performance in terms of AUROC while KFLLLA achieves the best in terms of AUPR. In Figure 9 and 10 in Appendix A we further plot the values of AUROC and AUPR for all methods within each task.

C.3 GRAPH PROPERTY REGRESSION WITH ZINC DATASET

For graph property regression, we assume a Laplace likelihood with a trainable scale parameter b (i.e., the density of the obaservation likelihood is g(y|f ) = 1 2b exp -|y-f | b , where f is the scalar function value output by the Transformer). We compute mean-absolute-error (MAE), root-meansquare error (RMSE) and negative-log-likelihood (NLL) to evaluate the models, with results presented in Figure 11 . Moreover, we use predictive variances as scores and evaluate OOD detection performance in Figure 12 . Note that MLE is useless for OOD detection in this case since it produces homogeneous predictive variances for all instances. We use a synthetic OOD dataset generated from test set: for each test instance, we remove the existing edges from the adjacency matrix and add edges between nodes that are not originally connected. Within "single-model" methods, SGPA and Consequently, SGPA based Transformers can not perfectly model the correlation between input sequences. Instead, they can only provide marginal uncertainty for each input sequence. Nevertheless, empirically we found that correlation might not be critical in applications such as text or image classification. One way to explain SGPA is to consider it as a sparse-within-sparse Gaussian process (SWSGP) (Tran et al., 2021; Jafrasteh et al., 2022) , which allows adaptive inducing points for each input. Suppose the index set (in our case the embedding space) is χ, and p(Z) is a distribution over M -element subset of χ (ie. each random draw will give us M inducing locations from χ). The joint prior and approximate posterior become p(f, u, Z) = p(f |u)p(u|Z)p(Z), and q(u, Z) = q(u|Z)p(Z), respectively. In Tran et al. (2021) , for each input x, they propose to select its Mnearest neighbours taken from the training inputs as inducing locations, so that q(Z) is a delta distribution conditioned on x, and q(u|Z) is the marginal variational distribution over function values evaluated at the selected inducing locations. In contrast, for each input sequence x, in layer l, the inducing locations used by Transformer based on SGPA consist both input-dependent ones ({k l,h a } H h=1 ), which are obtained from x using neural network as in Jafrasteh et al. (2022) , and global inducing locations {k l,h g } H h=1 , which are shared across all input sequences. Note that the input-dependent inducing points used during test time may not be encountered in training. Instead, we rely on the learned neural network to amortise them (Jafrasteh et al., 2022) . Therefore the fitted mean function may not be consistent when test sequence includes tokens far away from the tokens in training sequences (as v a may not be consistent). Empirically, we found this inconsistency issue to be minor for in-distribution test sequences. Furthermore, we argue that for OOD inputs, although the fitted posterior mean might be unreliable, the uncertainty still increases since the posterior covariance is fully determined by the global inducing points which exhibits no inconsistency issue. Intuitively, the global keys {k l,h g } L,H l=1,h=1 , shared across all input sequences, play a similar role as the inducing locations in standard SVGP: they summarise the training set but focus on the uncertainty behaviour only. As a result, the posterior variance in eq.( 11) still increases for queries that are less similar to the global keys as measured by the kernel. By propagating the uncertainty through each layer, we can still obtain increased uncertainty for input sequences that are very different from the training data, so that users can still be notified "when the model does not know" (Gal, 2016) . However, unlike standard kernels such as RBF, it's less obvious how deep kernel measures similarity and some advantages of Bayesian inference still lose as we have no idea what prior (or inductive bias) the deep kernel represents. Therefore, in our future work, we will investigating how to inject meaningful inductive bias into deep kernels. For datasets where Euclidean distance is meaningful, we can adopt the distance-preserving techniques in SNGP (Liu et al., 2020) to enforce inductive bias into deep kernels. However, for other types of datasets, it still remains an interesting open question that we are keen to explore.

E EXPERIMENTAL DETAILS

Training settings shared across experiments. For MLE and MCD, we initially considered both dropout rates 0.1 and 0.2, but we found models trained with dropout rate 0.1 consistently outperformed models trained with dropout rate 0.2 in terms of accuracy in our preliminary experiments for image classification. Therefore, we decided to use dropout rate 0.1 for all methods except MFVI. For each layer, we use mean-pooling strategy. The non-linear mapping G ϕ l at each attention layer is parameterised by a 2-layer MLP as in Vaswani et al. (2017) . For models with kernel-based attentions, we use exponential kernel for sentiment analysis and linguistic acceptability, (Tsai et al., 2019) : k(x, x ′ ) = σ 2 f exp( D j=1 xj x ′ j σ 2 j ), and we use ARD-RBF kernel (Rasmussen & Williams, 2006) for image classification and graph property regression: k(x, x ′ ) = σ 2 f exp(-1 2 D j=1 (xj -x ′ j ) 2 σ 2 j ), where D is the dimension of x and x ′ , σ 2 f is the output variance, and σ j is the length-scale for the j-th dimension. For MFVI, MCD, SNGP and SGPA, predictive uncertainty is estimated using 10 Monte Carlo samples. • Initialization: we train all our models from scratch (i.e. all parameters are randomly initialized without any pretraining). Apart from the global inducing points parameters, the rest of the parameters are the same as in standard transformers, and are initialized via the default method of the deep learning platform (we use Pytorch (Paszke et al., 2019) ). Each dimension of global inducing locations and the global variational mean are randomly initialized with standard Gaussian. For the Cholesky factor used to parameterize the global variational covariance, we randomly initialize each element in the lower triangular part and each element in the log-diagonal with a standard Gaussian. • Optimization: all the models are trained using ADAM optimiser (Kingma & Ba, 2015) , and for each input sequence in a batch, we only draw one sample to estimate the ELBO (eq. 14). In our experiments, we observe no optimization issue: the ELBO is consistently minimized during the training without significant spikes. Sentiment analysis with IMDB (Maas et al., 2011) . We consider 5 different splits, each includes 35,000 training, 5,000 validation, and 10,000 test instances. The maximum number of tokens in each input sequence is 512. Architecture-wise, we use a Transformer with 1 MHSA layer and 8 attention heads, same embedding dimension and hidden dimension of 128. For SGPA we use 50 global inducing points for each head. We train all the models (except post-hoc methods, TS and KFLLLA) for 20 epochs with batch-size 32 and with a initial learning rate 0.001 which decays linearly to 0.0001. The best model is selected based on the validation accuracy computed in every training epoch. Linguistic acceptability with CoLA (Warstadt et al., 2019) . The 516 OOD samples provided by the original dataset are used to assess the models' OOD robustness. Within each of the 5 independent runs, the remaining 9,078 in-distribution samples are randomly split into 7,262 training and 1,816 in-distribution test instances. We use a Transformer with 2 MHSA layers, each with 4 attention heads, an embedding dimension of 128 and a hidden dimension of 256. For the input embeddings, we use ELMO-style representation (Peters et al., 2018) . For SGPA we use 5 global inducing points for each head. We train all models (except post-hoc method KFLLLA) for 50 epochs with batch-size 32 and with a initial learning rate 0.0005 which decays linearly to 0.00001 and we use the model from the final epoch for evaluation. Image classification with CIFAR10 and CIFAR100 (Krizhevsky et al., 2009) . For both CIFAR10 and CIFAR100 datasets, we randomly split the original training set into 4,5000 training and 5,000 validation instances, and test on the original 10,000 test instances. The input images are tokenised with a patch size of 4 × 4. For CIFAR10 without data augmentation, we use a ViT (Dosovitskiy et al., 2021) with 5 MHSA layers, each with 4 attention heads and a hidden dimension of 128. For all the other experiments, we use a ViT with 6 layers 4 attention heads, a hidden dimension of 256. We train all models (except post-hoc methods, TS and KFLLLA) except SGPA for 600 epochs with batch-size 100 and with initial learning rate of 0.0005 which decays linearly to 0.00001. For SGPA, we use 32 global inducing points for each head, and we use the parameters from the 100th epoch of MLE to initialize the deep kernel hyperparameters, and continue training for 500 epochs. The best model is selected based on the validation accuracy computed every 10 epochs. For experiments with data augmentation, we consider the same data augmentation strategy (3-Augment) as in Touvron et al. (2022) . Graph property regression with ZINC (Dwivedi et al., 2020) . The results are presented in C.3 which are averaged from 3 independent runs. We use the same split as in (Dwivedi et al., 2020) , resulting in 10,000 training, 1,000 validation, and 1,000 test instances. Instead of applying graphspecific modifications to the network architecture, we use the feature engineering technique proposed in Kim et al. (2022) to transform each graph into a sequence of input embeddings. We consider Transformers with 8 layer and 8 attention heads, same embedding dimension and hidden dimension of 80. For SGPA we use 10 global inducing points for each head. We train all models for 500 epochs with batch-size 64 and with an initial learning rate 0.0004 which decays linearly to 0.000002. The best model is selected based on the validation accuracy computed at the end of every 10 epochs.

F RUNNING TIME

We analyse the wall-clock training and inference time of SGPA here. In Table 5 , we present the computational time for a single batch at the inference stage with 10 Monte Carlo samples for CoLA (batch size = 227) and CIFAR10 (batch size = 200) (results obtained using a single Nvidia GTX 2080 Ti GPU card). For SGPA, we first pay an one-off cost of inverting the kernel matrices related to global inducing points (K -1 kgkg ). Once this is done, we treat them as constant matrices and plug them in eq.( 11). When generating samples that are passed to the next layer, we diagonalize the covariance (Σ d ) in eq.( 11) to avoid the costly computation of the Cholesky factor of Σ d . The computational cost depends on the number of global inducing points used. For CoLA, we only used 5 global inducing points for each head, and the relative difference between inference times for MCD and SGPA is less than that for CIFAR10, where we use 32 global inducing points for each head. It is noteworthy that we haven't done extensive hyperparameter tuning for the number of global inducing points. It is likely that SGPA can still work well with a smaller number of global inducing points. For example, for CIFAR10, we recently trained an SGPA model with 16 global inducing points for each head, and we did not see a considerable performance drop (Accuracy: 0.7790, NLL: 0.7259, ECE: 0.0119, MCE: 0.0819). In this case, the inference time can be further reduced from 0.986 to 0.807. 

G RESULTS IN TABLES

We present numerical results (mean±standard error) for all experiments in tables. We show in-distribution results in Tables 7 to 13 . We show OOD robustness results in Tables 14 to 30 . We show OOD detection results in Tables 31 to 35 . 0.6729±0.0069 0.6310±0.0076 0.6054±0.0057 0.7597±0.0038 0.6859±0.0014 0.9155±0.0005 KFLLLA 0.6297±0.0059 0.5920±0.0071 0.6733±0.0159 0.8101±0.0086 0.6915±0.0057 0.9194±0.0017 SNGP 0.6448±0.0016 0.6080±0.0020 0.6764±0.0096 0.8093±0.0063 0.6823±0.0023 0.9161±0.0009 SGPA 0.6712±0.0052 0.6297±0.0057 0.6454±0.0147 0.7854±0.0099 0.6911±0.0028 0.9170±0.0009 DE 0.6848±0.0000 0.6377±0.0000 0.7388±0.0000 0.8425±0.0000 0.7394±0.0000 0.9300±0.0000 SGPAE 0.6925±0.0000 0.6461±0.0000 0.6961±0.0000 0.8213±0.0000 0.7398±0.0000 0.9304±0.0000 



Figure 1: Illustration of one head (h) of multi-head self attention in one layer of (a) vanilla Transformer, (b) Transformer based on standard SGPA and (c) Transformer based on decoupled SGPA.

Figure 2: Test set accuracy (or MCC for CoLA) & calibration metrics of Transformers or ViTs trained on CIFAR10 (1st row), CIFAR100 (2nd row), IMDB (3rd row) and CoLA (4th row).

Figure 4: Test set accuracy & calibration metrics on CIFAR10-C (top row) and CIFAR100-C (bottom row) against skew intensity of corruption for ViTs trained on corresponding clean data.

Figure 6: AUROC (top) and AUPR (bottom) for OOD detection using ViTs trained on CIFAR100.

COMPARISON BETWEEN STANDARD SGPA AND DECOUPLED SGPA In our preliminary experiments, we compare performance of ViT based on standard SGPA versus decoupled SGPA for image classification on CIFAR10 without data augmentation. We also consider decoupled SGPA based onCheng & Boots (2017). There is no difference in the expressiveness of the basis functions between decoupled SGPA based onCheng & Boots (2017) and decoupled SGPA based onSalimbeni et al. (2018) (the version shown in the main text). However,Salimbeni et al. (2018) tends to demonstrate faster convergence due to the orthogonal decomposition of the basis in the mean function. The posterior mean and covariance formula for decoupled SGPA based onCheng & Boots (

Figure 8: Accuracy and calibration metrics on CIFAR10-C (top row) and CIFAR100-C (bottom row) for ViTs trained on corresponding clean data with data augmentation.

Figure 10: AUROC (top) and AUPR (bottom) metrics for OOD detection using ViTs trained on CIFAR100 with data augmentation.

Figure 12: AUROC and AUPR metrics for OOD detection using Transformers trained on ZINC.

Tsai et al. (2019) generalised SDP-Attention by replacing softmax( qk ⊤

Complexity comparison for standard and decoupled SGPA.

the MCC and calibration metrics for the OOD test on CoLA. The observations are similar with the in-distribution test: SGPA outperforms MLE, MCD, and SNGP in terms of NLL and calibration errors while achieving improved accuracy. MFVI and KFLLLA achieve lower

Average ranks of different methods in terms of AU-ROC and AUPR over 6 OOD detection tasks In appendix C.1, we find that in addition to the parameter inefficiency problem, Transformers based on standard SGPA also suffer from underfitting. Compared with decoupled SGPA, standard SGPA achieves significantly worse accuracy on CIFAR10 classification task.• In appendix C.2, we report results of image classification with data augmentation. While MFVI and SNGP underfit the data, both the accuracy and calibration improve for the other methods.Tran et al. (2019) andXue et al. (2021)  propose to perform approximate posterior inference using MFVI in weight space for a subset of layers in Transformers. However, in our experiments we find this type of approaches underfits the data. This pathology of underfitting is theoretically confirmed for weight spaceMFVI (Foong et al., 2020;Coker et al., 2022). Another line of research proposes to perform VI over the attention matrices directly(Fan et al., 2020; Cinquin  et al., 2021). However, Fan et al. (

Boyang Xue, Jianwei Yu, Junhao Xu, Shansong Liu, Shoukang Hu, Zi Ye, Mengzhe Geng, Xunying Liu, and Helen Meng. Bayesian transformer language models for speech recognition. arXiv:2102.04754, 2021. Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson. Cyclical stochastic gradient mcmc for bayesian deep learning. In International Conference on Learning Representations, 2019.

ViT based on standard SGPA and decoupled SGPA based onCheng & Boots (2017) achieve worse performance than decouple SGPA based onSalimbeni et al. (2018). In particular, standard SGPA considerably underfits the data. Therefore we only consider decoupled SGPA based onSalimbeni et al. (2018) for the rest of the experiments.

Test set accuracy and NLL of ViTs based on standard and two variants of decoupled SGPA, trained on CIFAR10 without data augmentation.

Average ranks of different methods (trained with data augmentation) in terms of AUROC and AUPR over 6 OOD detection tasks MCD achieve much better results than MLE and MFVI. For this task, the difference in performance between SGPA and MCD is negligible. However, when compared to MCD, SGPA performs more robustly as it returns smaller error bars. Ensemble methods outperform "single-model" methods in OOD detection with SGPAE achieving the best result. Interestingly, for in-distribution calibration, they achieve worse performance than SGPA and MCE.

The computational time (in s) for a single batch at the inference stage with 10 Monte Carlo samples for CoLA (batch size = 227) and CIFAR10 (batch size = 200) (results obtained using a single Nvidia GTX 2080 Ti GPU card).

The training time (in s) of one epoch for SGPA and MLE on CoLA (batch size = 32) and CIFAR10 (batch size = 100) (results obtained using a single Nvidia GTX 2080 Ti GPU card).

In-distribution performance: sentiment analysis with IMDB

Accuracy of CIFAR10-C for ViTs trained on clean data without data augmentation.

Accuracy of CIFAR10-C for ViTs trained on clean data with data augmentation.

Accuracy of CIFAR100-C for ViTs trained on clean data without data augmentation.

Accuracy of CIFAR100-C for ViTs trained on clean data with data augmentation.

NLL of CIFAR10-C for ViTs trained on clean data without data augmentation.

NLL of CIFAR10-C for ViTs trained on clean data with data augmentation.

NLL of CIFAR100-C for ViTs trained on clean data without data augmentation.

NLL of CIFAR100-C for ViTs trained on clean data with data augmentation.

ECE of CIFAR10-C for ViTs trained on clean data without data augmentation.

ECE of CIFAR10-C for ViTs trained on clean data with data augmentation.

ECE of CIFAR100-C for ViTs trained on clean data without data augmentation.

ECE of CIFAR100-C for ViTs trained on clean data with data augmentation.

MCE of CIFAR10-C for ViTs trained on clean data without data augmentation.

MCE of CIFAR10-C for ViTs trained on clean data with data augmentation.

MCE of CIFAR100-C for ViTs trained on clean data without data augmentation.

MCE of CIFAR100-C for ViTs trained on clean data with data augmentation.

AUROC and AUPR metrics for OOD detection using ViTs trained on CIFAR10 without data augmentation.

AUROC and AUPR metrics for OOD detection using ViTs trained on CIFAR10 with data augmentation. 7908±0.0034 0.7523±0.0028 0.8748±0.0023 0.9338±0.0015 0.8148±0.0027 0.9555±0.0005 MFVI 0.7009±0.0043 0.6530±0.0045 0.7652±0.0179 0.8468±0.0109 0.7506±0.0073 0.9379±0.0027 MCD 0.7995±0.0038 0.7614±0.0030 0.8754±0.0040 0.9304±0.0029 0.8282±0.0029 0.9595±0.0007 KFLLLA 0.7989±0.0041 0.7617±0.0036 0.8884±0.0036 0.9415±0.0022 0.8273±0.0034 0.9591±0.0007 SNGP 0.7889±0.0092 0.7543±0.0104 0.8776±0.0044 0.9326±0.0032 0.8125±0.0072 0.9551±0.0020 SGPA 0.8007±0.0014 0.7630±0.0013 0.8746±0.0061 0.9281±0.0043 0.8319±0.0014 0.9600±0.0004 DE 0.8230±0.0000 0.7869±0.0000 0.9195±0.0000 0.9575±0.0000 0.8555±0.0000 0.9666±0.0000 SGPAE 0.8264±0.0000 0.7917±0.0000 0.9004±0.0000 0.9452±0.0000 0.8606±0.0000 0.9677±0.0000

AUROC and AUPR metrics for OOD detection using ViTs trained on CIFAR100 without data augmentation.

AUROC and AUPR metrics for OOD detection using ViTs trained on CIFAR100 with data augmentation.

AUROC and AUPR metrics for OOD detection using Transformers trained on ZINC.

ACKNOWLEDGMENTS

We would like to thank Harrison B. Zhu and Zijing Ou at Imperial College London for their valuable comments on the manuscript. We also would like to thank the anonymous reviewers: their valuable feedback and helpful discussions during the rebuttal helped improve the paper.

ETHICS STATEMENT

We believe that this work in its current status has minimal ethical implications: this research involves no human subjects and no data or domain where privacy/discrimination/fairness is a concern. However, as with most research in machine learning, new methods could be used on datasets and domains where privacy/discrimination/fairness is concerning to cause negative ethical impacts, but we do not think SGPA is more concerning than any other method in this regard.

REPRODUCIBILITY STATEMENT

Example code can be found at: https://github.com/chenw20/SGPA. Details for the experimental set-up can be found in Appendix E

