REVISITING EXPLICIT REGULARIZATION IN NEURAL NETWORKS FOR RELIABLE PREDICTIVE PROBABILITY Anonymous

Abstract

From the statistical learning perspective, complexity control via explicit regularization is a necessity for improving the generalization of overparameterized models, which deters the memorization of intricate patterns existing only in the training data. However, the impressive generalization performance of overparameterized neural networks with only implicit regularization challenges the importance of explicit regularization. Furthermore, explicit regularization does not prevent neural networks from memorizing unnatural patterns, such as random labels. In this work, we revisit the role and importance of explicit regularization methods for generalization of the predictive probability, not just the generalization of the 0-1 loss. Specifically, we analyze the possible cause of the poor predictive probability and identify that regularization of predictive confidence is required during training. We then empirically show that explicit regularization significantly improves the reliability of the predictive probability, which enables better predictive uncertainty representation and prevents the overconfidence problem. Our findings present a new direction to improve the predictive probability quality of deterministic neural networks, which can be an efficient and scalable alternative to Bayesian neural networks and ensemble methods.

1. INTRODUCTION

As deep learning models have become pervasive in real-world decision-systems, the importance of producing a reliable predictive probability is increasing. In this paper, we call predictive probability reliable if it is well-calibrated and precisely represents uncertainty about its predictions. The calibrated behavior refers to the ability to match its predictive probability of an event to the longterm frequency of the event occurrence (Dawid, 1982) . The reliable predictive probability benefits many downstream tasks such as anomaly detection (Malinin & Gales, 2019) , classification with rejection (Lakshminarayanan et al., 2017) , and exploration in reinforcement learning (Gal & Ghahramani, 2016) . More importantly, deep learning systems with more reliable predictive probability can provide better feedback for explaining what is going on, situations when its prediction becomes uncertain, and unexpected anomalies to users. Unfortunately, neural networks are prone to be overconfident and lack uncertainty representation ability, and this problem has become a fundamental concern in the deep learning community. Bayesian methods have innate abilities to produce reliable predictive probability. Specifically, they express the probability distribution over parameters, in which uncertainty in the parameter space is automatically determined by data (MacKay, 1992; Neal, 1993) . Then, uncertainty in prediction can be represented by means of providing rich information about aggregated predictions from different parameter configurations such as entropy and mutual information. From this perspective, deterministic neural networks selecting a single parameter configuration that cannot provide such rich information naturally lack the uncertainty representation ability. However, the automatic determination of parameter uncertainty in the light of data, i.e., posterior inference, comes with prohibitive computational costs. Therefore, the mainstream approach for improving the predictive probability quality has been an efficient adoption of the Bayesian principle into neural networks (Gal & Ghahramani, 2016; Ritter et al., 2018; Teye et al., 2018; Joo et al., 2020a) . Recent works (Lakshminarayanan et al., 2017; Müller et al., 2019; Thulasidasan et al., 2019) has discovered the hidden gems of label smoothing (Szegedy et al., 2016 ), mixup (Zhang et al., 2018) , and 1

