MONOTONICITY AND DOUBLE DESCENT IN UNCER-TAINTY ESTIMATION WITH GAUSSIAN PROCESSES

Abstract

The quality of many modern machine learning models improves as model complexity increases, an effect that has been quantified-for predictive performancewith the non-monotonic double descent learning curve. Here, we address the overarching question: is there an analogous theory of double descent for models which estimate uncertainty? We provide a partially affirmative and partially negative answer in the setting of Gaussian processes (GP). Under standard assumptions, we prove that higher model quality for optimally-tuned GPs (including uncertainty prediction) under marginal likelihood is realized for larger input dimensions, and therefore exhibits a monotone error curve. After showing that marginal likelihood does not naturally exhibit double descent in the input dimension, we highlight related forms of posterior predictive loss that do exhibit non-monotonicity. Finally, we verify empirically that our results hold for real data, beyond our considered assumptions, and we explore consequences involving synthetic covariates.

1. INTRODUCTION

With the recent success of overparameterized and nonparametric models for many predictive tasks in machine learning (ML), the development of the corresponding uncertainty quantification (UQ) has unsurprisingly become a topic of significant interest. Naïve approaches for forward propagation of error and other methods for inverse uncertainty problems typically apply Monte Carlo methods under a Bayesian framework (Zhang, 2021) . However, the large-scale nature of many ML problems of interest results in significant computational challenges. One of the most successful approaches for solving inverse uncertainty problems is the use of Gaussian processes (GP) (Williams & Rasmussen, 2006) . This is now frequently used for many predictive tasks, including time-series analysis (Roberts et al., 2013) and classification (Williams & Barber, 1998; Williams & Rasmussen, 2006) . GPs are also fundamental for Bayesian optimization (Hebbal et al., 2019; Frazier, 2018) , although extending Bayesian optimization into high-dimensional settings remains challenging (Binois & Wycoff, 2021) . Although the theoretical understanding of the predictive capacity of high-dimensional ML models continues to advance rapidly, a parallel rigorous theory for UQ is comparatively lagging. The prominent heuristic in modern ML that larger models will typically perform better has become almost axiomatic. However, it is only more recently that this heuristic has become represented in the theory through the characterisation of benign overfitting (Bartlett et al., 2020) . In particular, the double descent curve extends the bias-variance tradeoff curve to account for improving performance with higher model complexity (Belkin et al., 2019; Wang et al., 2021; Derezinski et al., 2020b) (see Figure 1 (right)). Typically, these arguments involve applications of random matrix theory (Edelman & Rao, 2005; Paul & Aue, 2014) , notably the Marchenko-Pastur law, concerning limits of spectral distributions under large data/large dimension regimes. While the predictive performance of ML models can improve as model size increases, it is not clear whether or not the same is true for predictions of model uncertainty. Several common measures of model quality which incorporate inverse uncertainty quantification are Bayesian in nature, the most prominent of which are the marginal likelihood and various forms of posterior predictive loss. It is well-known that Bayesian methods can perform well in high dimensions (De Roos et al., 2021) , even outperforming their low-dimensional counterparts when properly tuned (Wilson & Izmailov, 2020) . To close this theory-practice gap, an analogous formulation of double descent curves in the setting of uncertainty quantification is desired. Marginal likelihood and posterior distributions are often intractable for arbitrary models (e.g., Bayesian neural networks (Goan & Fookes, 2020)). However, their explicit forms are well known for GPs (Williams & Rasmussen, 2006) . GPs are nonparametric, and most of the kernels used in practice induce infinite-dimensional feature spaces, so model complexity can be difficult to quantify (although some notions of kernel dimension have been proposed (Zhang, 2005; Alaoui & Mahoney, 2015) ). Nevertheless, it is generally expected that accurately fitting a GP to data lying in higher-dimensional spaces requires training on a larger dataset. This curse of dimensionality has been justified using error estimates (von Luxburg & Bousquet, 2004) , and verified empirically (Spigler et al., 2020) . However, under appropriate setups, predictive performance has been demonstrated to improve with larger input dimension (Liu et al., 2021) . Here, we consider whether the same is true for the marginal likelihood and posterior predictive metrics. Our main results (see Theorem 1 and Proposition 1) are summarized as follows. • Monotonicity: For two optimally regularized scalar GPs, both fit to a sufficiently large set of iid normalized and whitened input-output pairs, the better performing model under marginal likelihood is the one with larger input dimension. • Double Descent: For sufficiently small temperatures, GP posterior predictive metrics exhibit double descent if and only if the mean squared error for the corresponding kernel regression task exhibits double descent (see Liang & Rakhlin (2020); Liu et al. ( 2021) for sufficient conditions). Figure 1 illustrates characteristics of monotone and double descent error curves. Along the way, we identify optimal choices of temperature (which can be interpreted as noise in the data) under a tempered posterior setup -see Table 1 for a summary. Our results highlight that the common curse of dimensionality heuristic can be bypassed through an empirical Bayes procedure. Furthermore, the performance of optimally regularized GPs (under several metrics), can be improved with additional covariates (including synthetic ones). Our theory is supported by experiments performed on real large datasets. Additional experiments, including the effect of ill-conditioned inputs, alternative data distributions, and choice of underlying kernel, are conducted in Appendix A. Details of the setup for each experiment are listed in Appendix G.

2. BACKGROUND 2.1 GAUSSIAN PROCESSES

A Gaussian process is a stochastic process f on R d such that for any set of points x 1 , . . . , x k ∈ R d , (f (x 1 ), . . . , f (x k )) is distributed as a multivariate Gaussian random vector (Williams & Rasmussen, 2006, §2.2). Gaussian processes are completely determined by their mean and covariance functions: if for any x, x ′ ∈ R d , Ef (x) = m(x) and Cov(f (x), f (x ′ )) = k(x, x ′ ), then we say that f ∼ GP(m, k). Inference for GPs is informed by Bayes' rule: letting (X i , Y i ) n i=1 denote a collection of iid input-output pairs, we impose the assumption that Y i = f (X i ) + ϵ i where each ϵ i ∼ N (0, γ), 



Figure 1: Illustrations of monotone (left) and double descent (right) error curves.

Behavior of UQ performance metrics and optimal posterior temperature γ.

