MOMENT DISTRIBUTIONALLY ROBUST PROBABILISTIC SUPERVISED LEARNING

Abstract

Probabilistic supervised learning assumes the groundtruth itself is a distribution instead of a single label, as in classic settings. Common approaches learn with a proper composite loss and obtain probability estimates via an invertible link function. Typical links such as the softmax yield restrictive and problematic uncertainty certificates. In this paper, we propose to make direct prediction of conditional label distributions from first principles in distributionally robust optimization based on an ambiguity set defined by feature moment divergence. We derive its generalization bounds under mild assumptions. We illustrate how to manipulate penalties for underestimation and overestimation. Our method can be easily incorporated into neural networks for end-to-end representation learning. Experimental results on datasets with probabilistic labels illustrate the flexibility, effectiveness, and efficiency of this learning paradigm.

1. INTRODUCTION

The goal of classical supervised learning is point estimation-predicting a single target from the label domain given features-usually without justifying the confidence. The outcome distribution of an event can be inherently uncertain and more desirable than point predictions in some scenarios. For example, weather predictions that express the uncertainty of events such as rain occurring are more sensible than binary-valued predictions, while a uniform distribution prediction for the outcome of a fair dice roll is more sensible than speculating an integral value randomly. On one hand, the predicted distribution quantifies label uncertainty and is thus more informative than a point prediction, which is widely studied in weakly supervised learning (Yoshida et al., 2021 ), boosting (Friedman et al., 2000) and optimal treatment (Leibovici et al., 2000) . On the other hand, the ground truth naturally comes with multiple targets, possibly with different importances. For instance, there can be multiple emotions in a human face image, there are different gene expression levels over a period of time in biological experiments, and many annotators might disagree over a highly ambiguous instance. In the above settings, each predefined label is part of the ground truth as long as it has a positive probability in the true distribution. Hence, it is natural to use probabilistic labels in both training and inference when the ground truth is no longer a point. In the literature, the task of predicting full distributions from features is called probabilistic supervised learning (Gressmann et al., 2018) . A probabilistic supervised learning task comes with a probabilistic loss functional quantitatively measuring the utility of the prediction (Bickel, 2007) . Williamson et al. (2016) propose a composite multiclass loss that separates properness and convexity. They illuminate the connection between classification calibration (Tewari & Bartlett, 2007) and properness (Gneiting & Raftery, 2007; Dawid, 2007) , representing Fisher consistency for classification and probability estimation respectively. A proper loss is minimized when predictions match the true underlying probability, which implies classification calibration, but not vice versa. Among proper losses, the logarithmic loss (Good, 1952) severely penalizes underestimation of rare outcomes and assessing the "surprise" of the predictor in an information-theoretic sense, the Brier score-originally proposed for evaluating weather forecasts (Brier, 1950)-is useful for assessing prediction calibration, and the spherical scoring rule (Bickel, 2007) is used when a distribution with lower entropy is desired. A single proper loss is sometimes not sufficient for scenarios that elicit optimistic or pessimistic predictions for decision making with practical concerns (Elsberry, 2002; Chapman, 2012) . For example, underestimating disastrous events may provide very low utility, motivating more pessimistic predictions. Therefore it is desirable for a proper loss to be flexible in its penalties for deviated predictions that combine statistical properties of multiple losses. Deep neural networks typically adopt the softmax function to predict a legal distribution. However, softmax intentionally renormalizes the logits and therefore assumes that it follows a logistic distribution (Bendale & Boult, 2016) . It is poor at calibration, uncertainty quantification and robustness against overfitting (Joo et al., 2020) . The inverse of the canonical link function in Williamson et al. (2016) can be used to recover probabilities but commonly resembles softmax (Zou et al., 2008) . In this paper, we propose a probabilistic supervised learning method from first principles in distributionally robust optimization (DRO) for general proper losses that realize desired prediction properties. Instead of specifying a parametric distribution, it starts with a minimax learning problem in which the predictor non-parametrically minimizes the the most adverse risk among all distributions in an ambiguity set defined by empirical feature moments. The ambiguity set represents our uncertainty about the underlying distribution. By strong duality, we show that the primal DRO problem is equivalent to a regularized empirical risk minimization (ERM) problem. The regularization results naturally from the ambiguity set instead of being explicitly imposed. The ERM form also allows us to derive generalization bounds and make inferences from unseen data. We illustrate a set of solutions for general proper losses satisfying certain mild conditions and an efficient algorithm for a weighted sum of two common strictly proper losses. We conduct experiments on real-world datasets by adapting our method to end-to-end differentiable learning. We defer all technical proofs to the appendix. Contributions. Our contributions are summarized as follows. (1) We propose a distributionally robust probabilistic supervised learning method. (2) We characterize the solutions to the proposed method and present an efficient algorithm for specific losses. (3) We incorporate our method into neural networks and perform extensive empirical study on real-world data.

1.1. RELATED WORK

Model assessment of probabilistic models via predictive likelihood has been studied in Bayesian models (Gelman et al., 2014) , probabilistic forecasting (Gneiting & Raftery, 2007) , machine learning (Masnadi-Shirazi & Vasconcelos, 2009) , conditional density estimation (Sugiyama et al., 2010) , information theory (Reid & Williamson, 2011) and representation learning (Dubois et al., 2020) . A comprehensive framework for probabilistic supervised learning can be found in Gressmann et al. (2018) . Techniques developed to explicitly tackle multiclass probabilistic classification include multiclass logistic regression (Collins et al., 2002) , support vector machines (Lyu et al., 2019; Wang et al., 2019) , learning from noisy labels (Zhang et al., 2021) , weakly supervised learning (Yoshida et al., 2021) , and neural networks (Papadopoulos, 2013; Gast & Roth, 2018) . Multilabel classification, aimed at predicting multiple classes with equal importance, has been analyzed by Cheng et al. (2010) and Geng (2016) in a general probabilistic setting. Note that confidence calibration (Guo et al., 2017) has a different objective from probabilistic supervised learning. Fisher consistency results have been established for classification losses (Tewari & Bartlett, 2007 ), structured losses (Ciliberto et al., 2016; Nowak et al., 2020 ), proper losses (Williamson et al., 2016) and Fenchel-Young losses (Blondel et al., 2020) . The emerging field of DRO has led to learning methods with ambiguity sets defined by feature moments (Farnia & Tse, 2016; Mazuelas et al., 2020 ), ϕ-divergence (Duchi & Namkoong, 2019) and the Wasserstein distance (Shafieezadeh-Abadeh et al., 2019) . The moment-based ambiguity set adopted in this work originates from maximum entropy (Cortes et al., 2015; Mazuelas et al., 2022) , with similar work studying classification (Asif et al., 2015; Fathony et al., 2016) and structured prediction (Fathony et al., 2018a; b) .

