MOMENT DISTRIBUTIONALLY ROBUST PROBABILISTIC SUPERVISED LEARNING

Abstract

Probabilistic supervised learning assumes the groundtruth itself is a distribution instead of a single label, as in classic settings. Common approaches learn with a proper composite loss and obtain probability estimates via an invertible link function. Typical links such as the softmax yield restrictive and problematic uncertainty certificates. In this paper, we propose to make direct prediction of conditional label distributions from first principles in distributionally robust optimization based on an ambiguity set defined by feature moment divergence. We derive its generalization bounds under mild assumptions. We illustrate how to manipulate penalties for underestimation and overestimation. Our method can be easily incorporated into neural networks for end-to-end representation learning. Experimental results on datasets with probabilistic labels illustrate the flexibility, effectiveness, and efficiency of this learning paradigm.

1. INTRODUCTION

The goal of classical supervised learning is point estimation-predicting a single target from the label domain given features-usually without justifying the confidence. The outcome distribution of an event can be inherently uncertain and more desirable than point predictions in some scenarios. For example, weather predictions that express the uncertainty of events such as rain occurring are more sensible than binary-valued predictions, while a uniform distribution prediction for the outcome of a fair dice roll is more sensible than speculating an integral value randomly. On one hand, the predicted distribution quantifies label uncertainty and is thus more informative than a point prediction, which is widely studied in weakly supervised learning (Yoshida et al., 2021 ), boosting (Friedman et al., 2000) and optimal treatment (Leibovici et al., 2000) . On the other hand, the ground truth naturally comes with multiple targets, possibly with different importances. For instance, there can be multiple emotions in a human face image, there are different gene expression levels over a period of time in biological experiments, and many annotators might disagree over a highly ambiguous instance. In the above settings, each predefined label is part of the ground truth as long as it has a positive probability in the true distribution. Hence, it is natural to use probabilistic labels in both training and inference when the ground truth is no longer a point. In the literature, the task of predicting full distributions from features is called probabilistic supervised learning (Gressmann et al., 2018) . A probabilistic supervised learning task comes with a probabilistic loss functional quantitatively measuring the utility of the prediction (Bickel, 2007) . Williamson et al. (2016) propose a composite multiclass loss that separates properness and convexity. They illuminate the connection between classification calibration (Tewari & Bartlett, 2007) and properness (Gneiting & Raftery, 2007; Dawid, 2007) , representing Fisher consistency for classification and probability estimation respectively. A proper loss is minimized when predictions match the true underlying probability, which implies classification calibration, but not vice versa. Among proper losses, the logarithmic loss (Good, 1952) severely penalizes underestimation of rare outcomes and assessing the "surprise" of the predictor in an information-theoretic sense, the Brier score-originally proposed for evaluating weather forecasts (Brier, 1950)-is useful for assessing prediction calibration, and the spherical scoring rule (Bickel, 2007) is used when a distribution with lower entropy is desired. A single proper loss is sometimes not sufficient for scenarios that elicit optimistic or pessimistic predictions for decision making with practical concerns (Elsberry, 2002; Chapman, 2012) . For example, underestimating disastrous events may provide very low utility, motivating more pessimistic predictions.

