STABLE OPTIMIZATION OF GAUSSIAN LIKELIHOODS

Abstract

Uncertainty-aware modeling has emerged as a key component in modern machine learning frameworks. The de-facto standard approach adopts heteroscedastic Gaussian distributions and minimizes the negative log-likelihood (NLL) under observed data. However, optimizing this objective turns out to be surprisingly intricate, and the current state-of-the-art reports several instabilities. This work breaks down the optimization problem, initially focusing on non-contextual settings where convergence can be analyzed analytically. We show that (1) in this learning scheme, the eigenvalues of the predictive covariance define stability in learning, and (2) coupling of gradients and predictions build up errors in both mean and covariance if either is poorly approximated. Building on these insights, we propose Trustable, a novel optimizer that overcomes instabilities methodically by combining systematic update restrictions in the form of trust regions with structured, tractable natural gradients. We demonstrate in several challenging experiments that Trustable outperforms current optimizers in regression with neural networks in terms of the NLL, MSE, and further performance metrics. Unlike other optimizers, Trustable yields an improved and more stable fit and can also be applied to multivariate outputs with full covariance matrices.

1. INTRODUCTION

Generating models capable of quantifying uncertainty is crucial in modern machine learning applications. These uncertainties arise, amongst others, due to underlying stochastic processes, inherent fuzziness of the data, or partial visibility (Hüllermeier, 2021) . Efficient and reliable strategies for predicting uncertainties allows understanding the underlying data better, interpreting how well a model fares, or guiding decisions we make based on a model. Conventionally, uncertainty is modeled using a probabilistic approach. That is, a model inducesdirectly or indirectly -a probability distribution describing uncertainty with its parameters. A wellestablished technique of estimating aleatoric uncertainty, i. e. , uncertainty that accounts for inherent randomness in an experiment, predicts distributions via neural networks (NNs). NNs are extensively addressed methods (Seitzer et al., 2022; Detlefsen et al., 2019; Abdar et al., 2021; Hüllermeier, 2021) , primarily due to their ability to learn non-linear and complex relationships. Under the assumption of Gaussian distributed targets, the de-facto standard for various use cases, these NNs infer Gaussian mean and covariance conditioned on the respective input context. A common objective for training such networks is the negative log-likelihood (NLL) objective, which is given by θ * = arg min θ - i log p θ (x i , y i ) LNLL(θ,D) , where θ are the parameters of the NN and D := {x i , y i } i is the training data. Typically, optimization involves computing some form of gradient ∇ θ L NLL (θ, D). Yet, these Gaussian NLL gradient updates were shown to be often unstable and hard to use (Guo et al., 2017; Hüllermeier, 2021) . This is for example typically the case if a large portion of training data indicates highly certain subregions -gradients can explode (Takahashi et al., 2018) or fit only sub-par (Seitzer et al., 2022) . Hence, implementations across all domains rely on pragmatic techniques (Seitzer et al., 2022; Garnelo et al., 2018; Schulman et al., 2017a) NLL-based learning. Unfortunately, these techniques also impair model expressivity and degrade prediction quality. For example, full covariance matrices in higher dimensions are primarily avoided (Gopinath, 1998; Fiszeder & Orzeszko, 2021) due to their inherent instability during training. A gradient must keep all predicted covariances positive-definite at all times and ensure numerical stability. Most implementations, therefore, do not model correlations but only predict a diagonal Gaussian distribution. Furthermore, clamping the variance into empirically performing regions (Raffin et al., 2021; Garnelo et al., 2018) is a classic makeshift patch to avoid small or large variances in which gradient descent with the Gaussian NLL becomes unstable (Seitzer et al., 2022) . In this case, however, the target distribution's true variance cannot be learned if it lies outside the clamped region. Other approaches for compensating destructive gradients include a different variance parametrization (Raffin et al., 2021) , objective adjustments (Seitzer et al., 2022) or a change to the optimization strategy (Detlefsen et al., 2019; Choi et al., 2004; Takahashi et al., 2018) . Albeit all these efforts, reliably predicting large ranges of variances and correlations in data remains challenging. The noted shortcomings impact model expressivity and prediction quality. In this work, we examine the underlying cause of those shortcomings and open up new ways to further improve learning for a wide variety of models based on the NLL.

Summary of Contributions.

We first demonstrate that in an update step with the NLL, the covariance plays a distinct role in the convergence behavior. We extend the idea from Seitzer et al. ( 2022) and argue that the eigenvalues of the predictive covariance define stability in learning. Furthermore, coupling of gradients and predictions is detrimental as it destabilizes training and can build up errors in both optimization parameters if either is poorly approximated. We then introduce an optimization strategy dubbed Trustable (see Figure 1 ) that methodically overcomes coupling and instabilities. Through structured, tractable natural gradients (NGs) (Lin et al., 2021) we efficiently obtain more informative, decoupled gradients and systematically constrain the update step via Trust Region Projection Layers (TRPLs) (Otto et al., 2021) for stability. Finally, we empirically find that Trustable displays an improved and less volatile fit in terms of the NLL, mean squared error (MSE), and the Wasserstein L2 distance (W2) for non-contextual and contextual regression with full covariance matrices in higher dimensions.

2. PRELIMINARIES

In supervised learning, the optimization algorithm receives a labeled set of training data D := {x i , y i } i∈[0;N ] ⊂ X × Y, where the instance x i and target y i are distributed according to a complex, usually intractable distribution p(x, y). Our goal is to approximate the conditional distribution p(y|x) over some data including its aleatoric uncertainty. A common assumption is that each outcome y is distributed according some Gaussian distribution, which is in the multivariate case given by N (µ x , Σ x ), where µ ∈ R d represents the mean and Σ ∈ R d×d ++ the covariance. The covariance is a positive semi-definite matrix



Figure1: A schematic of Trustable for heteroscedastic prediction of Gaussians. We retrieve the exact natural gradient of the target distributions via local parametrizations(Lin et al., 2021)  and regress on the closed-form trust region projected new distributions(Otto et al., 2021). The local loss manifold aware natural gradient cancels out destructive gradient scalings while trust regions stabilize the sensitive update allowing for stable learning of full covariances.

for stabilizing uncertainty output and gradient training for

