REVISITING EXPLICIT REGULARIZATION IN NEURAL NETWORKS FOR RELIABLE PREDICTIVE PROBABILITY Anonymous

Abstract

From the statistical learning perspective, complexity control via explicit regularization is a necessity for improving the generalization of overparameterized models, which deters the memorization of intricate patterns existing only in the training data. However, the impressive generalization performance of overparameterized neural networks with only implicit regularization challenges the importance of explicit regularization. Furthermore, explicit regularization does not prevent neural networks from memorizing unnatural patterns, such as random labels. In this work, we revisit the role and importance of explicit regularization methods for generalization of the predictive probability, not just the generalization of the 0-1 loss. Specifically, we analyze the possible cause of the poor predictive probability and identify that regularization of predictive confidence is required during training. We then empirically show that explicit regularization significantly improves the reliability of the predictive probability, which enables better predictive uncertainty representation and prevents the overconfidence problem. Our findings present a new direction to improve the predictive probability quality of deterministic neural networks, which can be an efficient and scalable alternative to Bayesian neural networks and ensemble methods.

1. INTRODUCTION

As deep learning models have become pervasive in real-world decision-systems, the importance of producing a reliable predictive probability is increasing. In this paper, we call predictive probability reliable if it is well-calibrated and precisely represents uncertainty about its predictions. The calibrated behavior refers to the ability to match its predictive probability of an event to the longterm frequency of the event occurrence (Dawid, 1982) . The reliable predictive probability benefits many downstream tasks such as anomaly detection (Malinin & Gales, 2019) , classification with rejection (Lakshminarayanan et al., 2017) , and exploration in reinforcement learning (Gal & Ghahramani, 2016) . More importantly, deep learning systems with more reliable predictive probability can provide better feedback for explaining what is going on, situations when its prediction becomes uncertain, and unexpected anomalies to users. Unfortunately, neural networks are prone to be overconfident and lack uncertainty representation ability, and this problem has become a fundamental concern in the deep learning community. Bayesian methods have innate abilities to produce reliable predictive probability. Specifically, they express the probability distribution over parameters, in which uncertainty in the parameter space is automatically determined by data (MacKay, 1992; Neal, 1993) . Then, uncertainty in prediction can be represented by means of providing rich information about aggregated predictions from different parameter configurations such as entropy and mutual information. From this perspective, deterministic neural networks selecting a single parameter configuration that cannot provide such rich information naturally lack the uncertainty representation ability. However, the automatic determination of parameter uncertainty in the light of data, i.e., posterior inference, comes with prohibitive computational costs. Therefore, the mainstream approach for improving the predictive probability quality has been an efficient adoption of the Bayesian principle into neural networks (Gal & Ghahramani, 2016; Ritter et al., 2018; Teye et al., 2018; Joo et al., 2020a) . Recent works (Lakshminarayanan et al., 2017; Müller et al., 2019; Thulasidasan et al., 2019) has discovered the hidden gems of label smoothing (Szegedy et al., 2016 ), mixup (Zhang et al., 2018 ), and adversarial training (Goodfellow et al., 2015) , which improve the calibration performance and the uncertainty representation ability. These findings present a new possibility of improving the reliability of the predictive probability without changing the deterministic nature of neural networks. This direction is appealing because it can be applied in a plug-and-play fashion to the existing building blocks. This means that they can inherit the scalability, computational efficiency, and surprising generalization performance of the deterministic neural networks, for which Bayesian neural networks often struggle (Wu et al., 2019; Osawa et al., 2019; Joo et al., 2020a) . Motivated by these observations, we investigate a general direction from the regularization perspective to mitigate the unreliable predictive probability problem, rather than proposing new constructive heuristics or discovering hidden properties of specific methods. Our main contribution is twofold. First, we present a new direction for alleviating the unreliable predictive behavior, which is readily applicable, computationally efficient, and scalable to large-scale models compared to Bayesian neural networks or ensemble methods. Second, our findings provide a novel view of the role of explicit regularization in deep learning, which improves the reliability of the predictive probability.

2.1. BACKGROUND

We consider a classification problem with i.i.d. training samples D = (x (i) , y (i) ) N i=1 drawn from unknown distributions P x,y whose corresponding tuple of random variables is (x, y). We denote X as an input space and Y as a set of categories {1, 2, • • • , K}. Let f W : X → Z be a neural network with parameters W where Z = R K is a logit space. On top of the logit space, the softmax σ : R K → K-foot_0 normalizes the exponential of logits: φ W k (x) = exp(f W k (x)) i exp(f W i (x)) where we let φ W k (x) = σ k (f W (x)) for brevity. σ k (f W (x) ) is often interpreted as the predictive probability that the label of x belongs to class k (Bridle, 1990) The probabilistic interpretation of neural network outputs gives the natural minimization objective for classification-the cross-entropy between the predictive probability and the one-hot encoded label: l CE (y, φ W (x)) =k 1 y (k) log φ W k (x), where 1 A (ω) is an indicator function taking one if ω ∈ A and zero otherwise. By minimizing the cross-entropy (or equivalently maximizing the log-likelihood) with stochastic gradient descent (SGD) (Robbins & Monro, 1951) or its variants, modern neural networks achieve the surprising generalization performance. As the demand for neural networks in real-world decision-making is emerging, reliable predictive probability has been of interest in the machine learning community. One important quality of predictive probability is calibrated behavior. Specifically, based on the notion of calibration in classical forecasting problem (Dawid, 1982) , the perfectly calibrated model can be defined as follows: p(y = k|φ W (x) = p) = p k , ∀p ∈ K-1 , k ∈ {1, 2, • • • , K} Here note that the calibrated model does not necessarily be ones producing φ W k (x) = p(y = k|x). In practice, expected calibration error (ECE) (Naeini et al., 2015) is widely used for calibration performance measure. ECE on dataset D T can be computed by binning predictions into M -groups based on their confidences 1 and then averaging their calibration scores by: M i=1 |G i | |D T | |acc(G i ) -conf(G i )| where G i = x : i/M < max k φ W k (x) ≤ (1 + i)/M, x ∈ D T ; acc(G i ) and conf(G i ) are average accuracy and confidence of predictions in group G i , respectively.



Throughout this paper, the confidence (or predictive confidence) at x refers to max k φ W k (x), which is different from the confidence in statistics literature.

