STABLE OPTIMIZATION OF GAUSSIAN LIKELIHOODS

Abstract

Uncertainty-aware modeling has emerged as a key component in modern machine learning frameworks. The de-facto standard approach adopts heteroscedastic Gaussian distributions and minimizes the negative log-likelihood (NLL) under observed data. However, optimizing this objective turns out to be surprisingly intricate, and the current state-of-the-art reports several instabilities. This work breaks down the optimization problem, initially focusing on non-contextual settings where convergence can be analyzed analytically. We show that (1) in this learning scheme, the eigenvalues of the predictive covariance define stability in learning, and (2) coupling of gradients and predictions build up errors in both mean and covariance if either is poorly approximated. Building on these insights, we propose Trustable, a novel optimizer that overcomes instabilities methodically by combining systematic update restrictions in the form of trust regions with structured, tractable natural gradients. We demonstrate in several challenging experiments that Trustable outperforms current optimizers in regression with neural networks in terms of the NLL, MSE, and further performance metrics. Unlike other optimizers, Trustable yields an improved and more stable fit and can also be applied to multivariate outputs with full covariance matrices.

1. INTRODUCTION

Generating models capable of quantifying uncertainty is crucial in modern machine learning applications. These uncertainties arise, amongst others, due to underlying stochastic processes, inherent fuzziness of the data, or partial visibility (Hüllermeier, 2021) . Efficient and reliable strategies for predicting uncertainties allows understanding the underlying data better, interpreting how well a model fares, or guiding decisions we make based on a model. Conventionally, uncertainty is modeled using a probabilistic approach. That is, a model inducesdirectly or indirectly -a probability distribution describing uncertainty with its parameters. A wellestablished technique of estimating aleatoric uncertainty, i. e. , uncertainty that accounts for inherent randomness in an experiment, predicts distributions via neural networks (NNs). NNs are extensively addressed methods (Seitzer et al., 2022; Detlefsen et al., 2019; Abdar et al., 2021; Hüllermeier, 2021) , primarily due to their ability to learn non-linear and complex relationships. Under the assumption of Gaussian distributed targets, the de-facto standard for various use cases, these NNs infer Gaussian mean and covariance conditioned on the respective input context. A common objective for training such networks is the negative log-likelihood (NLL) objective, which is given by θ * = arg min θ - i log p θ (x i , y i ) LNLL(θ,D) , where θ are the parameters of the NN and D := {x i , y i } i is the training data. Typically, optimization involves computing some form of gradient ∇ θ L NLL (θ, D). Yet, these Gaussian NLL gradient updates were shown to be often unstable and hard to use (Guo et al., 2017; Hüllermeier, 2021) . This is for example typically the case if a large portion of training data indicates highly certain subregions -gradients can explode (Takahashi et al., 2018) or fit only sub-par (Seitzer et al., 2022) . Hence, implementations across all domains rely on pragmatic techniques (Seitzer et al., 2022; Garnelo et al., 2018; Schulman et al., 2017a) for stabilizing uncertainty output and gradient training for 1

