UNCERTAINTY ESTIMATION IN AUTOREGRESSIVE STRUCTURED PREDICTION

Abstract

Uncertainty estimation is important for ensuring safety and robustness of AI systems. While most research in the area has focused on unstructured prediction tasks, limited work has investigated general uncertainty estimation approaches for structured prediction. Thus, this work aims to investigate uncertainty estimation for autoregressive structured prediction tasks within a single unified and interpretable probabilistic ensemble-based framework. We consider uncertainty estimation for sequence data at the token-level and complete sequence-level; interpretations for, and applications of, various measures of uncertainty; and discuss both the theoretical and practical challenges associated with obtaining them. This work also provides baselines for token-level and sequence-level error detection, and sequencelevel out-of-domain input detection on the WMT'14 English-French and WMT'17 English-German translation and LibriSpeech speech recognition datasets.

1. INTRODUCTION

Neural Networks (NNs) have become the dominant approach in numerous applications (Simonyan & Zisserman, 2015; Mikolov et al., 2013; 2010; Bahdanau et al., 2015; Vaswani et al., 2017; Hinton et al., 2012) and are being widely deployed in production. As a consequence, predictive uncertainty estimation is becoming an increasingly important research area, as it enables improved safety in automated decision making (Amodei et al., 2016) . Important advancements have been the definition of baseline tasks and metrics (Hendrycks & Gimpel, 2016) and the development of ensemble approaches, such as Monte-Carlo Dropout (Gal & Ghahramani, 2016) and Deep Ensembles (Lakshminarayanan et al., 2017) foot_0 . Ensemble-based uncertainty estimates have been successfully applied to detecting misclassifications, out-of-distribution inputs and adversarial attacks (Carlini & Wagner, 2017; Smith & Gal, 2018; Malinin & Gales, 2019) and to active learning (Kirsch et al., 2019) . Crucially, they allow total uncertainty to be decomposed into data uncertainty, the intrinsic uncertainty associated with the task, and knowledge uncertainty, which is the model's uncertainty in the prediction due to a lack of understanding of the data (Malinin, 2019) 2 . Estimates of knowledge uncertainty are particularly useful for detecting anomalous and unfamiliar inputs (Kirsch et al., 2019; Smith & Gal, 2018; Malinin & Gales, 2019; Malinin, 2019) . Despite recent advances, most work on uncertainty estimation has focused on unstructured tasks, such as image classification. Meanwhile, uncertainty estimation within a general, unsupervised, probabilistically interpretable ensemble-based framework for structured prediction tasks, such as language modelling, machine translation (MT) and speech recognition (ASR), has received little attention. Previous work has examined bespoke supervised confidence estimation techniques for each task separately (Evermann & Woodland, 2000; Liao & Gales, 2007; Ragni et al., 2018; Chen et al., 2017; Koehn, 2009; Kumar & Sarawagi, 2019) which construct an "error-detection" model on top of the original ASR/NMT system. While useful, these approaches suffer from a range of limitations. Firstly, they require a token-level supervision, typically obtained via minimum edit-distance alignment to a ground-truth transcription (ASR) or translation (NMT), which can itself by noisy. Secondly, such token-level supervision is generally inappropriate for translation, as it doesn't account for the validity of re-arrangements. Thirdly, we are unable to determine whether the error is due to knowledge or data uncertainty. Finally, this model is itself subject to the pitfalls of the original system -domain shift, noise, etc. Thus, unsupervised uncertainty-estimation methods are more desirable. Recently, however, initial investigations into unsupervised uncertainty estimation for structured prediction have appeared. The nature of data uncertainty for translation tasks was examined in (Ott et al., 2018a) . Estimation of sequence and word-level uncertainty estimates via Monte-Carlo Dropout ensembles has been investigated for machine translation (Xiao et al., 2019; Wang et al., 2019; Fomicheva et al., 2020) . However, these works focus on machine translation, consider only a small range of uncertainty adhoc measures, provide limited theoretical analysis of their properties and do not make explicit their limitations. Furthermore, they don't identify or tackle challenges in estimating uncertainty arising from exponentially large output space. Finally, to our knowledge, no work has examined uncertainty estimation for autoregressive ASR models. This work examines uncertainty estimation for structured prediction tasks within a general, probabilistically interpretable ensemble-based framework. The five core contributions are as follows. First, we derive information-theoretic measures of both total uncertainty and knowledge uncertainty at both the token level and the sequence level, make explicit the challenges involved and state any assumptions made. Secondly, we introduce a novel uncertainty measure, reverse mutual information, which has a set of desirable attributes for structured uncertainty. Third, we examine a range of Monte-Carlo approximations for sequence-level uncertainty. Fourth, for structured tasks there is a choice of how ensembles of models can be combined; we examine how this choice impacts predictive performance and derived uncertainty measures. Fifth, we explore the practical challenges associated with obtaining uncertainty estimates for structured predictions tasks and provide performance baselines for tokenlevel and sequence-level error detection, and out-of-domain (OOD) input detection on the WMT'14 English-French and WMT'17 English-German translation datasets and the LibriSpeech ASR dataset.

2. UNCERTAINTY FOR STRUCTURED PREDICTION

In this section we develop an ensemble-based uncertainty estimation framework for structured prediction and introduce a novel uncertainty measure. We take a Bayesian viewpoint on ensembles, as it yields an elegant probabilistic framework within which interpretable uncertainty estimates can be obtained. The core of the Bayesian approach is to treat the model parameters θ as random variables and place a prior p(θ) over them to compute a posterior p(θ|D) via Bayes' rule, where D is the training data. Unfortunately, exact Bayesian inference is intractable for neural networks and it is necessary to consider an explicit or implicit approximation q(θ) to the true posterior p(θ|D) to generate an ensemble. A number of different approaches to generating ensembles have been developed, such as Monte-Carlo Dropout (Gal & Ghahramani, 2016) and DeepEnsembles (Lakshminarayanan et al., 2017 ). An overview is available in (Ashukha et al., 2020; Ovadia et al., 2019) . Consider an ensemble of models {P(y|x; θ (m) )} M m=1 sampled from an approximate posterior q(θ), where each model captures the mapping between variable-length sequences of inputs {x 1 , • • • , x T } = x ∈ X and targets {y 1 , • • • , y L } = y ∈ Y, where x t ∈ {w 1 , • • • , w V }, y l ∈ {ω 1 , • • • , ω K }. The predictive posterior is obtained by taking the expectation over the ensemble: P(y|x, D) = E q(θ) P(y|x, θ) ≈ 1 M M m=1 P(y|x, θ (m) ), θ (m) ∼ q(θ) ≈ p(θ|D) The total uncertainty in the prediction of y is given by the entropy of the predictive posterior.  The sources of uncertainty can be decomposed via the mutual information I between θ and y: I y, θ|x, D Know. Uncertainty =E q(θ) E P(y|x,θ) ln P(y|x, θ) P(y|x, D) = Ĥ P(y|x, D) Total Uncertainty -E q(θ) Ĥ[P(y|x, θ)] Expected Data Uncertainty (3) Mutual information (MI) is a measure of 'disagreement' between models in the ensemble, and therefore a measure of knowledge uncertainty (Malinin, 2019) . It can be expressed as the difference between the entropy of the predictive posterior and the expected entropy of each model in the ensemble.



An in-depth comparison of ensemble methods was conducted in(Ashukha et al., 2020; Ovadia et al., 2019) 2 Data and Knowledge Uncertainty are sometimes also called Aleatoric and Epistemic uncertainty.



y|x,D) [-ln P(y|x, D)] = -y∈Y P(y|x, D) ln P(y|x, D)

