AMORTIZED CONDITIONAL NORMALIZED MAXIMUM LIKELIHOOD

Abstract

While deep neural networks provide good performance for a range of challenging tasks, calibration and uncertainty estimation remain major challenges. In this paper, we propose the amortized conditional normalized maximum likelihood (ACNML) method as a scalable general-purpose approach for uncertainty estimation, calibration, and out-of-distribution robustness with deep networks. Our algorithm builds on the conditional normalized maximum likelihood (CNML) coding scheme, which has minimax optimal properties according to the minimum description length principle, but is computationally intractable to evaluate exactly for all but the simplest of model classes. We propose to use approximate Bayesian inference technqiues to produce a tractable approximation to the CNML distribution. Our approach can be combined with any approximate inference algorithm that provides tractable posterior densities over model parameters. We demonstrate that ACNML compares favorably to a number of prior techniques for uncertainty estimation in terms of calibration on out-of-distribution inputs.

1. INTRODUCTION

Current machine learning methods provide unprecedented accuracy across a range of domains, from computer vision to natural language processing. However, in many high-stakes applications, such as medical diagnosis or autonomous driving, rare mistakes can be extremely costly, and thus effective deployment of learned models requires not only high expected accuracy, but also a way to measure the certainty in a model's predictions in order to assess risk and allow the model to abstain from making decisions when there is low confidence in the prediction. While deep networks offer excellent prediction accuracy, they generally do not provide the means to accurately quantify their uncertainty. This is especially true on out-of-distribution inputs, where deep networks tend to make overconfident incorrect predictions (Ovadia et al., 2019) . In this paper, we tackle the problem of obtaining reliable uncertainty estimates under distribution shift. Most prior work approaches the problem of uncertainty estimation from the standpoint of Bayesian inference. By treating parameters as random variables with some prior distribution, Bayesian inference can compute posterior distributions that capture a notion of epistemic uncertainty and allow us to quantitatively reason about uncertainty in model predictions. However, computing accurate posterior distributions becomes intractable as we use very complex models like deep neural nets, and current approaches require highly approximate inference methods that fall short of the promise of full Bayesian modeling in practice. Bayesian methods also have a deep connection with the minimum description length (MDL) principle, a formalization of Occam's razor that recasts learning as performing efficient lossless data compression and has been widely used as a motivation for model selection techniques. Codes corresponding to maximum-a-posteriori estimators and Bayes marginal likelihoods have been commonly used within the MDL framework. However, other coding schemes have been proposed in MDL centered around achieving different notions of minimax optimality. Interpreting coding schemes as predictive distributions, such methods can directly inspire prediction strategies that give conservative predictions and do not suffer from excessive overconfidence due to their minimax formulation. One such predictive distribution is the conditional normalized maximum likelihood (CNML) (Grünwald, 2007; Rissanen and Roos, 2007; Roos et al., 2008) model, also known as sequential NML or predictive NML (Fogel and Feder, 2018b) . To make a prediction on a new input, CNML considers every possible label and tries to find the model that best explains that label for the query point together with the training set. It then uses that corresponding model to assign probabilities for each input and normalizes to obtain a valid probability distribution. Intuitively, instead of relying on a learned model to extrapolate from the training set to the new (potentially out-of-distribution) input, CNML can obtain more reasonable predictive distributions by asking "given the training data, which labels would make sense for this input?" While CNML provides compelling minimax regret guarantees, practical instantiations have been exceptionally difficult, because computing predictions for a test point requires retraining the model on the test point concatenated with the entire training set. With large models like deep neural networks, this can potentially require hours of training for every prediction. In this paper, we proposed amortized CNML (ACNML), a tractable and practical algorithm for approximating CNML utilizing approximate Bayesian inference. ACNML avoids the need to optimize over large datasets during inference by using an approximate posterior in place of the training set. We demonstrate that our proposed approach is substantially more feasible and computationally efficient than prior techniques for using CNML predictions with deep neural networks and compares favorably to a number of prior techniques for uncertainty estimation on out-of-distribution inputs.

2. MINIMUM DESCRIPTION LENGTH: BACKGROUND AND PRELIMINARIES

ACNML is motivated from the minimum description length (MDL) principle, which can be used to derive a connection between optimal codes and prediction. We begin with a review of the MDL principle and discuss the challenges in implementing minimax codes that motivate our method. For more comprehensive treatments of MDL, we refer the readers to (Grünwald, 2007; Rissanen, 1989) . Minimum description length. The MDL principle states that any regularities in a dataset can be exploited to compress it, and hence learning is reformulated as losslessly transmitting the data with the fewest number of bits (Rissanen, 1989; Grünwald, 2007) . Simplicity is thus formalized as the length of the resulting description. MDL was originally formulated in a generative setting where the goal is to code arbitrary data, and we will present a brief overview in this setting. We can translate the results to a supervised learning setting, which corresponds to transmitting the labels after assuming either a fixed coding scheme for the inputs or that the inputs are known beforehand. While MDL is typically described in terms of code lengths, in general, we can associate codes with probability distributions, with the code length of an object corresponding to the negative log-likelihood under that probability distribution (Cover and Thomas, 2006) . Normalized Maximum Likelihood. Let θ(x 1:n ) denote the maximum likelihood estimator for a sequence of data x 1:n over all θ ∈ Θ. For any x 1:n ∈ X n and distribution q over X n , we can define a regret relative to the model class Θ as R(q, Θ, x 1:n ) def = log p θ(x1:n) (x 1:n ) -log q(x 1:n ). This regret corresponds to the excess number of bits q uses to encode x 1:n compared to the best distribution in Θ, denoted θ(x 1:n ). We can then define the normalized maximum likelihood distribution (NML) with respect to Θ as p NML (x 1:n ) = p θ(x1:n) (x 1:n ) x1:n∈X n p θ(x1:n) (x 1:n ) (2) when the denominator is finite. The NML distribution can be shown to achieve minimax regret (Shtarkov, 1987; Rissanen, 1996) p NML = argmin q max x1:n∈X n R(q, Θ, x 1:n ). (3) This corresponds, in a sense, to an optimal coding scheme for sequences of known fixed length. Conditional NML. Instead of making predictions across entire sequences at once, we can adapt NML to the setting where we make predictions about the next data point based on the previously seen data, resulting in conditional NML (CNML) (Rissanen and Roos, 2007; Grünwald, 2007; Fogel and Feder, 2018a) . While several variations on CNML exist, we consider the following: p CNML (x n |x 1:n-1 ) ∝ p θ(x1:n) (x n ).

