MOMENT DISTRIBUTIONALLY ROBUST PROBABILISTIC SUPERVISED LEARNING

Abstract

Probabilistic supervised learning assumes the groundtruth itself is a distribution instead of a single label, as in classic settings. Common approaches learn with a proper composite loss and obtain probability estimates via an invertible link function. Typical links such as the softmax yield restrictive and problematic uncertainty certificates. In this paper, we propose to make direct prediction of conditional label distributions from first principles in distributionally robust optimization based on an ambiguity set defined by feature moment divergence. We derive its generalization bounds under mild assumptions. We illustrate how to manipulate penalties for underestimation and overestimation. Our method can be easily incorporated into neural networks for end-to-end representation learning. Experimental results on datasets with probabilistic labels illustrate the flexibility, effectiveness, and efficiency of this learning paradigm.

1. INTRODUCTION

The goal of classical supervised learning is point estimation-predicting a single target from the label domain given features-usually without justifying the confidence. The outcome distribution of an event can be inherently uncertain and more desirable than point predictions in some scenarios. For example, weather predictions that express the uncertainty of events such as rain occurring are more sensible than binary-valued predictions, while a uniform distribution prediction for the outcome of a fair dice roll is more sensible than speculating an integral value randomly. On one hand, the predicted distribution quantifies label uncertainty and is thus more informative than a point prediction, which is widely studied in weakly supervised learning (Yoshida et al., 2021) , boosting (Friedman et al., 2000) and optimal treatment (Leibovici et al., 2000) . On the other hand, the ground truth naturally comes with multiple targets, possibly with different importances. For instance, there can be multiple emotions in a human face image, there are different gene expression levels over a period of time in biological experiments, and many annotators might disagree over a highly ambiguous instance. In the above settings, each predefined label is part of the ground truth as long as it has a positive probability in the true distribution. Hence, it is natural to use probabilistic labels in both training and inference when the ground truth is no longer a point. In the literature, the task of predicting full distributions from features is called probabilistic supervised learning (Gressmann et al., 2018) . A probabilistic supervised learning task comes with a probabilistic loss functional quantitatively measuring the utility of the prediction (Bickel, 2007) . Williamson et al. (2016) propose a composite multiclass loss that separates properness and convexity. They illuminate the connection between classification calibration (Tewari & Bartlett, 2007) and properness (Gneiting & Raftery, 2007; Dawid, 2007) , representing Fisher consistency for classification and probability estimation respectively. A proper loss is minimized when predictions match the true underlying probability, which implies classification calibration, but not vice versa. Among proper losses, the logarithmic loss (Good, 1952) severely penalizes underestimation of rare outcomes and assessing the "surprise" of the predictor in an information-theoretic sense, the Brier score-originally proposed for evaluating weather forecasts (Brier, 1950) -is useful for assessing prediction calibration, and the spherical scoring rule (Bickel, 2007) is used when a distribution with lower entropy is desired. A single proper loss is sometimes not sufficient for scenarios that elicit optimistic or pessimistic predictions for decision making with practical concerns (Elsberry, 2002; Chapman, 2012) . For example, underestimating disastrous events may provide very low utility, motivating more pessimistic predictions. Therefore it is desirable for a proper loss to be flexible in its penalties for deviated predictions that combine statistical properties of multiple losses. Deep neural networks typically adopt the softmax function to predict a legal distribution. However, softmax intentionally renormalizes the logits and therefore assumes that it follows a logistic distribution (Bendale & Boult, 2016) . It is poor at calibration, uncertainty quantification and robustness against overfitting (Joo et al., 2020) . The inverse of the canonical link function in Williamson et al. (2016) can be used to recover probabilities but commonly resembles softmax (Zou et al., 2008) . In this paper, we propose a probabilistic supervised learning method from first principles in distributionally robust optimization (DRO) for general proper losses that realize desired prediction properties. Instead of specifying a parametric distribution, it starts with a minimax learning problem in which the predictor non-parametrically minimizes the the most adverse risk among all distributions in an ambiguity set defined by empirical feature moments. The ambiguity set represents our uncertainty about the underlying distribution. By strong duality, we show that the primal DRO problem is equivalent to a regularized empirical risk minimization (ERM) problem. The regularization results naturally from the ambiguity set instead of being explicitly imposed. The ERM form also allows us to derive generalization bounds and make inferences from unseen data. We illustrate a set of solutions for general proper losses satisfying certain mild conditions and an efficient algorithm for a weighted sum of two common strictly proper losses. We conduct experiments on real-world datasets by adapting our method to end-to-end differentiable learning. We defer all technical proofs to the appendix. Contributions. Our contributions are summarized as follows. (1) We propose a distributionally robust probabilistic supervised learning method. (2) We characterize the solutions to the proposed method and present an efficient algorithm for specific losses. (3) We incorporate our method into neural networks and perform extensive empirical study on real-world data.

1.1. RELATED WORK

Model assessment of probabilistic models via predictive likelihood has been studied in Bayesian models (Gelman et al., 2014) , probabilistic forecasting (Gneiting & Raftery, 2007) , machine learning (Masnadi-Shirazi & Vasconcelos, 2009) , conditional density estimation (Sugiyama et al., 2010) , information theory (Reid & Williamson, 2011) and representation learning (Dubois et al., 2020) . A comprehensive framework for probabilistic supervised learning can be found in Gressmann et al. (2018) . Techniques developed to explicitly tackle multiclass probabilistic classification include multiclass logistic regression (Collins et al., 2002) , support vector machines (Lyu et al., 2019; Wang et al., 2019) , learning from noisy labels (Zhang et al., 2021) , weakly supervised learning (Yoshida et al., 2021) , and neural networks (Papadopoulos, 2013; Gast & Roth, 2018) . Multilabel classification, aimed at predicting multiple classes with equal importance, has been analyzed by Cheng et al. (2010) and Geng (2016) in a general probabilistic setting. Note that confidence calibration (Guo et al., 2017) has a different objective from probabilistic supervised learning. Fisher consistency results have been established for classification losses (Tewari & Bartlett, 2007) , structured losses (Ciliberto et al., 2016; Nowak et al., 2020 ), proper losses (Williamson et al., 2016) and Fenchel-Young losses (Blondel et al., 2020) . The emerging field of DRO has led to learning methods with ambiguity sets defined by feature moments (Farnia & Tse, 2016; Mazuelas et al., 2020) , ϕ-divergence (Duchi & Namkoong, 2019) and the Wasserstein distance (Shafieezadeh-Abadeh et al., 2019) . The moment-based ambiguity set adopted in this work originates from maximum entropy (Cortes et al., 2015; Mazuelas et al., 2022) , with similar work studying classification (Asif et al., 2015; Fathony et al., 2016) and structured prediction (Fathony et al., 2018a; b) .

2.1. NOTATIONS

We adopt the following notations by convention. A bold letter x denotes a vector whereas a normal letter x represents a scalar. x i or (x) i stands for the i-th coordinate of x. We denote random variables with capitalization (e.g, X or X) and sets with calligraphic capitalization (e.g, X , A). We denote by [n] the set {1, 2, . . . , n}. |•| means the absolute value of a scalar or the cardinality of a set, depending on the context. The ℓ p norm of a vector is defined as ∥x∥ p ≜ ( i |x i | p )foot_0/p . The indicator function of a subset S of a set X is a mapping I S : X → {0, 1} such that I S (x) = 1 if x ∈ S and I S (x) = 0 otherwise. I(•) is adopted for events so that I(S) = 1 if event S occurs and I(S) = 0 otherwise. We write δ z as the Dirac point measure at z ∈ Z. A probability simplex of (d + 1)-dimensional vectors is represented as ∆ d , whose superscript is omitted when the context is clear. We denote by P(Z) the set of all probability distributions on a set Z. 

2.2. PROBABILISTIC LOSS FUNCTIONALS

Y (2) = Q Y (3) = 0.2. P Y (2) = P Y (3) as P Y (1) varies. Each loss is normalized to cross (1, 0) and (0.5, 0.5) according to the binary case with a hard label. Best viewed in color. A loss function measures the quality of a prediction associated with an event. Scoring rules are widely adopted to assess probabilistic predictions, but can be naturally translated to loss functions by appropriate negation and normalization. To illustrate some examples, we consider a decision problem in which y ∈ Y is an outcome and P Y ∈ P(Y) is a predicted distribution over Y. We denote by p Y ≜ (P Y (y)) T y∈Y a vector of probabilities. The zero-one loss is defined for deterministic prediction so that a penalty of 1 is incurred whenever y ′ and y differ: ℓ 01 (y ′ , y) ≜ I(y ′ ̸ = y). It extends to probabilistic predictions as ℓ 01 (P Y , y) ≜ 1 -P Y (y) 1 . The costsensitive loss for multiclass classification is similarly defined with a confusion cost matrix C ∈ R |Y|×|Y| + : ℓ cs (P Y , y) ≜ i∈Y P Y (i)C iy . The multiclass Brier loss, based on the Brier score or quadratic scoring rule, measures the mean squared difference between P Y and y: ℓ br (P Y , y) ≜ y ′ (P Y (y ′ ) -I(y ′ = y)) 2 . The logarithmic loss, also called log-likelihood loss, incurs a rapidly increasing penalty as the predicted probability of the target event approaches zero: L log (P Y , y) ≜ln P Y (y). The spherical scoring rule can be interpreted as the spherical projection of the true belief onto the prediction vector. To use it as a loss function, we define ℓ sp (P Y , y) ≜ 1 -P Y (y)/∥p Y ∥ 2 . For ease of exposition, we define L(P, Q) := y Q Y (y)ℓ(P Y , y) where ℓ(•, •) : P(Y) × Y → R + is a probabilistic loss function as illustrated above. A loss L is called proper if L(Q, Q) ≤ L(P, Q) for all P, Q, and called strictly proper if Q is the unique minimizer of L(•, Q). Figure 1 provides a graphical comparison of the above losses for prediction with three classes. We can infer that the zero-one loss is an improper loss.

2.3. PROBABILISTIC SUPERVISED LEARNING

We study the probabilistic supervised learning task where we are given n training samples {(x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (n) , y (n) )} drawn i.i.d. from a distribution P on the joint space X × Y, in which X is a feature space and Y is a univariate finite discrete label space. A probabilistic multiclass loss function L : P(Y) × P(Y) → R + is given. The goal of ERM is to learn from the samples a mapping h : X → P(Y) to minimize the empirical L-risk of h: h * ∈ arg min h∈H R L P emp (h) := E P emp X L(h(X), P emp Y |X ) , where P emp X,Y represents the empirical distribution and H is a hypothesis space. Here we assume x may be accompanied with a probabilistic label by aggregating instances with the same x (i) . In this way, both learning and inference are accomplished in the general setting subsuming classical supervised learning.

3. METHOD

We now present our formulation for learning with general multiclass probabilistic losses. We provide theoretical results of consistency and generalization. We study the solution for general proper losses in our formulation and develop an efficient algorithm for two typical proper losses.

3.1. FORMULATION

We consider a continuous proper loss L to be optimized under the unknown distribution P true . We assume that a class-sensitive feature function ϕ : X × Y → R d that maps a data point to a ddimensional feature vector is given. Examples include the multi-vector representation and classdependent TF-IDF scores. Choosing a good ϕ is a representation learning problem, but as we will discuss in Section 3.4, it is not a concern once our method is incorporated into neural networks as a layer. Intuitively, the elements of the vector ϕ(x, y) can be regarded as scores indicating how well the label y matches with the feature x. For example, with a linear hypothesis h w (x, y) = ⟨w, ϕ(x, y)⟩, a good parameter vector w * should yield ⟨w * , ϕ(x, y)⟩ > ⟨w * , ϕ(x, y ′ )⟩ =⇒ P(x, y) > P(x, y ′ ). Instead of specifying a parametric form of predictions, we adopt a minimax statistical learning formulation: min P Y |X ∈P(Y) max Q∈A(P emp ) E Q X L P Y |X , Q Y |X , where A(P emp ) := {Q : Q ∈ P(X × Y) ∧ P emp X = Q X ∧ ∥E P emp [ϕ(•, •)] -E Q [ϕ(•, •)]∥ ≤ ε}. The ambiguity set is different from that in Wiesemann et al. (2014) and Farnia & Tse (2016) due to the inequality and feature mapping respectively. The minimization over the function space H is replaced by directly minimizing over P(Y) for each x ∈ X . The probabilistic predictions are chosen to minimize the worst-case risk evaluated on a set of distributions in an ambiguity set defined by the empirical distribution P emp and feature mapping ϕ. The ambiguity set A(P emp ) includes distributions that share the same marginal on X and are no more than ε away from P emp in terms of feature moment divergence. Note that given any feature function ϕ, the ambiguity set is a compact convex set. Conceptually, we restrict the support of Q on X to be the same as the empirical distribution for convenience in both algorithm design and theoretical analysis. Minimizing the worst-case risk by allowing a certain amount of label uncertainty makes this method inherently robust. It can also be shown to be equivalent to a dual-norm regularized ERM problem: Proposition 1 ( (Li et al., 2022) ). The distributionally robust probabilistic supervised learning problem based on moment divergence in Eq. (2) can be rewritten as min θ E P emp X min P max Q L P Y |X , Q Y |X + θ ⊺ (E Q Y |X ϕ(X, Y ) -E P emp Ỹ |X ϕ(X, Ỹ )) + ε∥θ∥ * L adv (θ,P emp Ỹ |X ) , where θ ∈ R D is the vector of Lagrangian multipliers and ∥•∥ * is the dual norm of ∥•∥. We give a proof sketch here. Both P(Y) and A( P) are non-empty closed convex sets. Since we assume L is continuous and proper, we know that L(•, Q) is quasi-convex for every Q and L(P, •) is concave for every P by definition. Eq. ( 2) is therefore a quasi-convex-concave problem and strong duality holds (Sion, 1958) . The regularization is obtained via Lagrangian and Fenchel conjugate. It is well-known that continuous proper losses are quasi-convex, such as the Brier score, the logarithmic score, the spherical score, the Winkler's score, the ranked probability score, etc. However, some improper (possibly discrete and non-convex) losses can be quasi-convex in the predicted distribution (e.g., the zero-one loss). In contrast, surrogate classification losses are usually convex in a parameter space that is easy to work with, for example, the multiclass hinge loss Weston & Watkins (1998) , ℓ ww (ψ, y) = y ′ ̸ =y max{0, 1 + ψ y ′ψ y }, and the multiclass logistic loss (Nelder & Wedderburn, 1972) , ℓ log (ψ, y) = ln ( y ′ exp (ψ y ′ ))ψ y , where ψ ∈ R |Y| is a vector of class scores. From a game theoretic point of view, our formulation in Eq. ( 2) is equivalent to a two-player zerosum game in which the predictor player chooses a distribution to minimize the expected game payoff while the adversary player chooses one to maximize the game value while constrained to satisfy certain statistical properties of training data (Grünwald et al., 2004 ). In the dual problem (Eq. ( 3)), the Lagrange multipliers parameterize the payoff function for an augmented game and provide a new payoff function for unseen data to construct predictors.

3.2. STATISTICAL PROPERTIES

It well known that minimizing strictly proper losses leads to Fisher consistent probability estimation (Williamson et al., 2016) . However, minimization of the surrogate risk in Eq. ( 3) may induce a sub-optimal classifier because of misalignment between the surrogate loss L adv and the original loss L. Fisher consistency provides desirable statistical implications for a surrogate loss such that minimizing it yields an estimator that also minimizes the original loss. The adversarial surrogate loss L adv is endowed with an additional regularization term. It reduces to a Fenchel-Young loss (Blondel et al., 2020)  P Y |x L(P Y |x , P true Y |x ) is the Bayes optimal probabilistic prediction made by θ * true , the solution in Eq. (3) under P true . The prediction made by θ is P θ Y |X ∈ arg min P max Q L P Y |X , Q Y |X + E Q Y |X θ ⊺ ϕ(X, Y ). The consistency result guarantees that the learned probabilistic prediction rules yield Bayes optimal risk as ERM with proper losses in the ideal setting with true distributions and all measurable functions. Also note that the conclusion holds for all quasi-convex losses. Basic generalization bounds related to true risk for DRO methods can be derived from measure concentration. This approach depends on the choice of ambiguity sets and may have a dimensionality issue. It is also not appropriate for ambiguity sets defined by low-order moments in this paper. Thus, we take an alternate approach following Farnia & Tse (2016) to prove excess out-of-sample risk bounds. We assume ε > 0 to ensure boundedness of ∥θ∥ * . We establish the following theorem by making mild assumptions on boundedness on features and losses: Theorem 3 ( (Li et al., 2022) ). Given n samples, a non-negative multiclass probabilistic loss L(•, •) such that |L(•, •)| ≤ K, a feature function ϕ(•, •) such that ∥ϕ(•, •)∥ ≤ B and a positive ambiguity level ε > 0, then, for any 0 < δ ≤ 1, with a probability at least 1δ, the following excess true worst-case risk bound holds: max Q∈A(P true ) R L Q (θ * emp ) -max Q∈A(P true ) R L Q (θ * true ) ≤ 4KB ε √ n 1 + 3 2 ln(4/δ) 2 , ( ) where θ * emp and θ * true are the optimal parameters learned in Eq. (3) under the empirical distribution P emp and true distribution P true , respectively. The original risk of θ under Q is R L Q (θ) := E Q X,Y ,P θ Y |X L(P Y |X , Q Y |X ). Theorem 3 improves the results of Asif et al. (2015) and Fathony et al. (2016) that only show qualitative bounds. Under positive regularization, this bound explains the rate of uniform convergence of the true worst-case risk of the estimator θ * emp learned through the empirical distribution P emp to the true worst-case risk of the ideal estimator θ * true learned under P true . Although the empirical estimator is obtained based on a finite set of samples, Theorem 3 justifies the roles which the ambiguity set A(•), the feature function ϕ(•, •), the loss function L(•, •) and the ambiguity parameter ε play in upper bounding the excess out-of-sample worst-case risk. Intuitively, a larger ε will reject more hypotheses that are sensitive with larger dual norms, whereas the worst-case risk scales with the range of loss and feature functions.

3.3. ALGORITHM

Since L(•, •) is a continuous quasiconvex-concave function, a saddle point in Eq. ( 3) given θ must have a zero derivative with respect to P and Q: y Q Y |x (y)∂ℓ(P Y |x , y)/∂P Y |x (y ′ ) + Z P Y |x = 0 (5) ℓ(P Y |x , y) + θ ⊺ ϕ(x, y) + Z Q Y |x = 0, where Z P Y |x is the Lagrange multipliers for the simplex constraint y P Y |x (y) = 1, similarly for Z Q Y |x . Note that Z Q Y |x is constant for all y given x. If ℓ is local, e.g., ℓ(P Y |x , y) is independent of P Y |x (y ′ ) for y ′ ̸ = y and if ℓ(•, y) is monotone in P Y |x (y) > 0 (without simplex constraints) with range R, which is the case for the logarithmic loss, Eq. ( 6) always has a solution and the system of equations for all y along with the simplex constraint y P Y |x (y) has a unique solution. With few assumptions on the boundedness of ℓ and θ ⊺ ϕ, Eq. ( 6) is ill-posed. Given P * Y |x from Eq. ( 6), the solution Q * Y |x to Eq. ( 5) exists iff    ∂ℓ(P Y |x , 1)/∂P Y |x (1) . . . ∂ℓ(P Y |x , |Y|)/∂P Y |x (1) 1 . . . ∂ℓ(P Y |x , 1)/∂P Y |x (|Y|) . . . ∂ℓ(P Y |x , |Y|)/∂P Y |x (|Y|) 1 1 . . . 1 0    is singular. By assuming locality and positiveness, there exists a unique solution Q * Y |x . One benefit of the proposed method is that users only need to focus on solve Eq. ( 6) and Eq. ( 5) for proper losses while Williamson et al. (2016) additionally require a canonical link function for convexity. Next we show how the system of equations can always be solved with specific losses. We consider an additive combination of the multiclass Brier loss and the logarithmic loss, both of which are continuous strictly proper losses. As indicated by Figure 1 , these losses differ primarily in how they penalize the ground truth label's prediction probability as it goes to zero and one. The Brier loss exhibits quadratic growth. The logarithmic loss has a vertical asymptote for labels considered increasingly unlikely to the point of impossibility by the predictor. They have different penalties for underestimation and overestimation of the desired prediction. A trade-off between the log loss and the Brier loss thus provides flexibility to control the cost for misalignment between the prediction and the observation. See appendix for a discussion on including the ranked probability score and other specific losses. We employ this kind of loss in our DRO method and present an efficient algorithm that can be implemented in practice. With only slight loss of generality and for computational consideration, we assume a fixed positive weight on the log loss. To begin with, the mixture loss is ℓ mix (P Y |x , y) = -ln P Y |x (y) + β(1 -2P Y |x (y) + y ′ P 2 Y |x (y ′ )), with derivative ∂ℓ mix (P Y |x , y)/∂P Y |x (y) = -1/P Y |x (y) -2β + 2βP Y |x (y). Scalar β weights the contribution of the Brier loss, to this additive combination, controlling the sensitivity of the predictor to underestimation. The adversarial surrogate of this mixture loss is Fisher consistent as a direct corollary. Methods that solely mix the predictions of classifiers designed for logarithmic loss minimization and Brier loss optimization, may be appealing for their simplicity, but are demonstrably sub-optimal. For example, with the logistic loss, logistic regression provides a natural parametric form for the predictor, that equates loss minimization with data likelihood maximization. Although the Brier loss is not local, the additional sum of quadratic terms y ′ P 2 Y |x (y ′ ) is constant across all y. Therefore Eq. ( 6) has a closed form expression in terms of the Lambert W function. Furthermore, the sum over y for all Q Y |x (y) will cancel out, leaving terms only dependent on the same y. So Eq. ( 5) is simplified into an expression of Q in terms of P. Normalizing Q solves Z P , yielding the following proposition: Proposition 4. The DRO method for a probabilistic loss based on logarithmic loss, and β Brier loss has a solution P * Y |X for the predictor parameterized by θ defined by the following systems of equations: ∀x ∈ X , ∃C ∈ R, ∀y ∈ Y P * Y |x (y) = exp(C + θ T ϕ(x, y) -W 0 (2βe C+θ T ϕ(x,y) )), ( ) where C is a constant dependent on θ and x but independent of y, W (•) is the principal branch of the Lambert W function. The corresponding adversary Q * Y |X is defined as Q * Y |x (y) = 2βP * 2 Y |x (y) + Z P Y |x P * Y |x (y) 1 + 2βP * Y |x (y) and  Z P Y |x = 1 -y 2βP * 2 Y |x ( ∂L adv /∂θ ≜ E P emp X (E Q * Y |X [ϕ(X, Y )] -E P emp Y |X [ϕ(X, Y )]) + ∂ε∥θ∥ * /∂θ, can be leveraged to optimize θ. The above steps are summarized in Algorithm 1.

3.4. DIFFERENTIABLE LEARNING

By taking advantage of deep neural networks, our method will be able to jointly optimize data representation and the Lagrange multipliers: min θ,ϕ E P emp X L adv (θ, P emp Ỹ |X ), enjoying the benefits of end-to-end representation learning without manually looking for a good feature mapping ϕ. More off-the-shelf mini-batch training tools could be leveraged as well. We show how to make use of our DRO method as a loss layer in neural network training. A network for supervised learning typically has a linear classification layer in the end without activation. Assume the penultimate layer outputs ϕ(x), the last layer will output a |Y|-dimensional vector ψ(x) = [(θ (1) ) ⊺ ϕ(x), . . . , (θ (|Y|) ) ⊺ ϕ(x)]. This is essentially equivalent to adopting a multivector representation to construct ϕ. Specifically, given x ∈ R d and y ∈ [|Y|], the resulting feature vector v = ϕ(x, y) ∈ R d|Y| satisfies v yd-d+i = x i for i ∈ [d] and v j = 0 otherwise. Therefore taking ψ(x) as the input is sufficient for us to compute P * Y |x and Q * Y |x . In this way, our method is the loss layer without learnable parameters, which backpropagates the sub-derivative of loss with respect to ψ(x) to the linear classification layer: E P emp X (q Y |X -p emp Y |X ) ∈ ∂L adv /∂ψ(x). Recall q and p emp are the probability vectors for Q and P emp . The sub-gradient with respect to θ is added to the classification layer.

4. EXPERIMENTS

In the experiments, we consider as the performance measure the L-risk R L P (h), also called the expected generalization loss. The mixture loss ℓ mix of the log loss and Brier loss is adopted. The normalized generalization loss 1 (1+β) R L P test (h) is estimated based on the test set distribution P test X,Y . We compare our adversarial learning approach against neural network models with the softmax and the spherical softmax function as the final normalization layer (Laha et al., 2018) . All the baseline methods are able to make use of probabilistic labels in both training and testing. We adopt a threelayer neural network for all the methods, who share the same number of parameters. To make a more fair comparison, we set ε = 0 such that the final classification layer is unregularized. The baselines compute the target loss L mix based on their probability outputs applied to the logits. We implement all the methods using PyTorch (Paszke et al., 2019) . We use Adam (Kingma & Ba, 2014) for optimization. The number of hidden units is set to 50. The number of training steps is set to 500 with a batch size of 64. We set β = 1. Default values are used for unmentioned hyperparameters. We conduct experiments on several real-world datasets, including corel5k (Duygulu et al., 2002) , flags (Gonc ¸alves et al., 2013) , Stackex chess (Charte et al., 2015) , GpositivePseAAC and GnegativePseAAC (Xu et al., 2016) , having statistics reported in Table 1 . The ground truth labels in these dataset are either originally probabilistic or converted to a uniform distribution for multi-label classification datasets. At the beginning of each run, we randomly choose 80% of the dataset as the training set and the remaining 20% for evaluation. We further take 20% of the training set as the validation set to determine the best parameter for final testing. We repeat the above process 10 times for each dataset on a laptop with a 2.7 GHz Quad-Core Intel Core i7 CPU. All the methods take less than 1 minute per run in wall time. The results in Table 1 show that our proposed method either has the best performance or achieves similar performance to the best method with no statistical significance in most of the adopted datasets. For sensitivity analysis, we fix a random split of the Stackex chess dataset and vary β with other settings unchanged. The experiments are repeated 10 times. As shown in Figure 2 , the expected loss of our method on the test set is slightly better than baselines. For better illustration, we cut [0.1, 0.5] off the x-axis because the softmax and our method are indistinguishable without scaling. Additionally, we study the robustness of our approach by introducing noise to the training set of the Stackex chess dataset, repeated 10 times. To this end, for each instance x, with a probability p noise , we replace the ground truth by a random distribution from P(Y). We vary p noise from 0 to 0.8. As seen in Figure 2 , our method is slightly better when p noise < 0.4. All the methods become vulnerable for large p noise possibly because of the backbone neural network model.

5. DISCUSSION AND CONCLUSION

We proposed a moment-based distributionally robust learning framework for probabilistic supervised learning under mild assumptions, showed its equivalence to dual-norm regularization for a surrogate loss, presented its out-of-sample guarantees, developed efficient algorithms for typical continuous proper losses, incorporated the proposed method into differentiable learning and conducted experiments on several real-world datasets. We aimed to shed light on this more general supervised learning setting (Gressmann et al., 2018) and provide a more expressive way of quantifying prediction uncertainty. A drawback of the proposed method is that solving the saddle-point problem can be difficult for some complicated losses while neural networks equipped with a softmax layer makes use of automatic differentiation to avoid facing this issue. Interesting directions for future investigation include generalizing the learning framework to conditional density estimation and considering ambiguity sets defined by higher-order moments.

A TECHNICAL PROOFS

Proposition 1. The distributionally robust probabilistic supervised learning problem based on moment divergence in Eq. (2) can be rewritten as min θ E P emp X min P max Q L P Y |X , Q Y |X + θ ⊺ (E Q Y |X ϕ(X, Y ) -E P emp Ỹ |X ϕ(X, Ỹ )) + ε∥θ∥ * L adv (θ,P emp Ỹ |X ) , where θ ∈ R D is the vector of Lagrangian multipliers and ∥•∥ * is the dual norm of ∥•∥. Proof. Recall the primal problem min P Y |X ∈P(Y) max Q∈A(P emp ) E Q X L P Y |X , Q Y |X , where A(P emp ) := {Q : Q ∈ P(X × Y) ∧ P emp X = Q X ∧ ∥E P emp [ϕ(•, •)] -E Q [ϕ(•, •)]∥ ≤ ε}. Note the feature function ϕ(•) is fixed and given. The constraint sets P(Y) and A(P emp ) are convex. The objective function L(P, Q) is quasi-convex in P by (Williamson et al., 2016) and concave in Q because it is affine in Q. Therefore strong duality holds by Sion's minimax theorem (Sion, 1958) : max Q∈A(P emp ) min P Y |X ∈P(Y) E Q X L P Y |X , Q Y |X . Let C(u) := {u : ∥u -E P emp ϕ(•)∥ ≤ ε}. Rewrite the problem with this constraint: The dual problem by relaxing the equality constraint is sup sup Q,u min P E P emp X L P Y |X , Q Y |X -I C (u) s.t. u = E P emp X Q Y |X ϕ(X, Q,u min θ min P E P emp X L P Y |X , Q Y |X -I C (u) + θ ⊺ E P emp X Q Y |X ϕ(X, Y ) -θ ⊺ u, where θ is the vector of Lagrange multipliers. Given X = x, optimization of Q Y |x and P Ŷ |x can be done independently. Again by strong duality, we can rearrange the terms: min θ E P emp X min P max Q L P Y |X , Q Y |X + θ ⊺ E Q Y |X ϕ(X, Y ) + sup u -I C (u) -θ ⊺ u. The associated dual norm ∥•∥ * of the norm ∥•∥ is defined as ∥z∥ * := sup{z ⊺ x : ∥x∥ ≤ 1}, based on which we are able to simplify the optimization over u as sup u -I C (u) -θ ⊺ u = sup u∈C -θ ⊺ u = sup e:∥e∥≤1 -θ ⊺ (E P emp ϕ(•) -εe) = -θ ⊺ E P emp ϕ(•) + ε∥θ∥ * . Plugging it back to the dual problem, we have min θ E P emp X min P max Q L P Y |X , Q Y |X + θ ⊺ (E Q Y |X ϕ(X, Y ) -E P emp Ỹ |X ϕ(X, Ỹ )) + ε∥θ∥ * . Corollary 2. When ε = 0, L adv is Fisher consistent with respect to L. Namely, for any x, P θ * true Y |x ∈ arg min P Y |x L(P Y |x , P true Y |x ) is the Bayes optimal probabilistic prediction made by θ * true , the solution in Eq. (3) under P true . Proof. Our setting differs from Nowak et al. (2020) in the fact that we use a distribution as the ground truth. By defining y * (µ) as the gold standard probabilistic prediction and Y as the set of all possible probabilistic predictions in Proposition C.2 in Nowak et al. (2020) , we have P θ * true Ŷ |x ∈ Conv(arg min P Ŷ |x L(P Y |x , P true Y |x )). Because L is assumed continuous proper, any convex combination of minimizers is also a minimizer. Therefore, P θ * true Ŷ |x ∈ arg min P Ŷ |x L(P Y |x , P true Y |x ). Theorem 3. Given n samples, a non-negative multiclass probabilistic loss L(•, •) such that |L(•, •)| ≤ K, a feature function ϕ(•, •) such that ∥ϕ(•, •)∥ ≤ B and a positive ambiguity level ε > 0, then, for any 0 < δ ≤ 1, with a probability at least 1δ, the following excess true worstcase risk bound holds: max Q∈A(P true ) R L Q (θ * emp ) -max Q∈A(P true ) R L Q (θ * true ) ≤ 4KB ε √ n 1 + 3 2 ln(4/δ) 2 , where θ * emp and θ * true are the optimal parameters learned in Eq. (3) under the empirical distribution P emp and true distribution P true , respectively. The original risk of θ under Q is R L Q (θ) := E Q X,Y ,P θ Y |X L(P Y |X , Q Y |X ) with prediction P θ Y |X ∈ arg min P max Q L P Y |X , Q Y |X + E Q Y |X θ ⊺ ϕ(X, Y ). Proof. Define the adversarial surrogate risk of θ with respect to P as R S P (θ) := E PX min P max Q L P Y |X , Q Y |X + θ ⊺ (E Q Y |X ϕ(X, Y ) -E P Ỹ |X ϕ(X, Ỹ )) + ε∥θ∥ * . Let θ * true ∈ arg min θ R S P true (θ) and θ * emp ∈ arg min θ R S P emp (θ) be the optimal parameters learned with P true X,Y and P emp X,Y respectively. Given x, define the decoded prediction by θ as P θ Y |x ∈ arg min P max Q L P Y |X , Q Y |X + θ ⊺ E Q Y |X ϕ(X, Y ). Let the original risk of loss L under some distribution Q be R L Q (θ) := E Q X L P θ Y |X , Q Y |X . According to Proposition 1, for any fixed P, we have similarly max Q∈A(P emp ) E Q X L P θ Y |X , Q Y |X ≜ min θ E P emp X max Q L P Y |X , Q Y |X + θ ⊺ (E Q Y |X ϕ(X, Y ) -E P emp Ỹ |X ϕ(X, Ỹ )) + ε∥θ∥ * . We start by looking at the worst-case risk of θ * true and θ * emp . max Q∈A(P true ) R L Q (θ * emp ) = min θ E P true X max Q L P θ * emp Y |X , Q Y |X + θ ⊺ (E Q Y |X ϕ(X, Y ) -E P true Y |X ϕ(X, Y )) + ε∥θ∥ * ≤E P true X max Q L P θ * emp Y |X , Q Y |X + θ * emp • (E Q Y |X ϕ(X, Y ) -E P true Y |X ϕ(X, Y )) + ε∥θ * emp ∥ * , where the last inequality holds because θ * emp is not necessarily a minimizer. Similarly for θ * true , max Q∈A(P true ) R L Q (θ * true ) ≤ E P true X max Q L P θ * true Y |X , Q Y |X + θ * true • (E Q Y |X ϕ(X, Y ) -E P true Y |X ϕ(X, Y )) + ε∥θ * true ∥ * . On the other hand, E P true X max Q L P θ * true Y |X , Q Y |X + θ * true • (E Q Y |X ϕ(X, Y ) -E P true Y |X ϕ(X, Y )) + ε∥θ * true ∥ * = min θ E P true X min P max Q L P Y |X , Q Y |X + θ ⊺ (E Q Y |X ϕ(X, Y ) -E P true Y |X ϕ(X, Y )) + ε∥θ∥ * = min P min θ E P true X max Q L P Y |X , Q Y |X + θ ⊺ (E Q Y |X ϕ(X, Y ) -E P true Y |X ϕ(X, Y )) + ε∥θ∥ * ≤ min θ E P true X max Q L P θ * true Y |X , Q Y |X + θ ⊺ (E Q Y |X ϕ(X, Y ) -E P true Y |X ϕ(X, Y )) + ε∥θ∥ * = max Q∈A(P true ) R L Q (θ * true ), where the first equality holds according to the definition of θ * true . The above two inequalities imply the equality: max Q∈A(P true ) R L Q (θ * true ) = E P true X max Q L P θ * true Y |X , Q Y |X + θ * true • (E Q Y |X ϕ(X, Y ) -E P true Y |X ϕ(X, Y )) + ε∥θ * true ∥ * . Therefore, max Q∈A(P true ) R L Q (θ * emp ) -max Q∈A(P true ) R L Q (θ * true ) ≤(E P true X max Q L P θ * true Y |X , Q Y |X + θ * emp • (E Q Y |X ϕ(X, Y ) -E P true Y |X ϕ(X, Y )) + ε∥θ * emp ∥ * ) -(E P true X max Q L P θ * true Y |X , Q Y |X + θ * true • (E Q Y |X ϕ(X, Y ) -E P true Y |X ϕ(X, Y )) + ε∥θ * true ∥ * ). The main idea is thus to use uniform convergence bound. Firstly, by substituting Q = P true , note that min P max Q L P Y |X , Q Y |X + θ ⊺ (E Q Y |X ϕ(X, Y ) -E P true Y |X ϕ(X, Y )) ≥ min P L P Y |X , P true Y |X ≥ 0. We can get an upper bound of the norm of any optimal solution θ * true or θ * emp as follows: 0 + ε∥θ * true ∥ * ≤ R S P true (θ * true ) ≤ R S P true (0) ≤ E P true X L P Y |X , Q Y |X ≤ K =⇒ ∥θ * true ∥ * ≤ K ε . Let ψ(X, Y ) := θ ⊺ ϕ(X, Y ) and ψ x := (ψ(x, Y )) Y ∈Y . Define f (θ, P) := E PX min P max Q L P Y |X , Q Y |X + θ ⊺ (E Q Y |X ϕ(X, Y ) -E P Ỹ |X ϕ(X, Ỹ )) ≜ E PX max Q L P θ Y |X , Q Y |X + θ ⊺ (E Q Y |X ϕ(X, Y ) -E P Ỹ |X ϕ(X, Ỹ )) ≜ E PX max Q L P θ Y |X , Q Y |X + (E Q Y |X ψ(X, Y ) -E P Ỹ |X ψ(X, Ỹ )) ≜ g(ψ, P). Let q x ∈ ∆ be the probability vector of Q Y |x and e Y be the standard basis vector with Y -th entry equal to 1. We have that for any (x, Y ), ∂ ∂ψ x g(ψ, δ (x,Y ) ) ⊆ Conv({q x -e Y : q x ∈ ∆}) =⇒ ∥ ∂ ∂ψ x g(ψ, δ (x,Y ) )∥ 1 ≤ max qx∈∆ ∥q x -e Y ∥ 1 ≤ 2, where δ (x,Y ) is the Dirac point measure. g(ψ, P) is therefore 2-Lipschitz with respect to the ℓ 1 norm. As per the assumption, ∥ϕ(•, •)∥ ≤ B. This further implies that f (θ 1 , δ (x1,Y1) ) -f (θ 2 , δ (x2,Y2) ) ≤ 4KB ε ∀θ 1 , θ 2 , x 1 , x 2 , Y 1 , Y 2 s.t. ∥θ i ∥ * ≤ K ε ∀i = 1, 2. We then follow the proof of Theorem 3 in Farnia & Tse (2016) . According to Theorem 26.12 in Shalev-Shwartz & Ben-David (2014) , by uniform convergence, for any δ ∈ (0, 2], with a probability at least 1 -δ 2 , f (θ * emp , P true ) -f (θ * emp , P emp ) ≤ 4KB ε √ n 1 + ln(4/δ) 2 . According to the definition of θ * true , the following inequality holds: f (θ * emp , P emp ) + ε∥θ * emp ∥ * -f (θ * true , P emp ) -ε∥θ * true ∥ * ≤ 0. Since θ * true do not depend on samples, according to the Hoeffding's inequality, with a probability 1 -δ/2, f (θ * true , P emp ) -f (θ * true , P true ) ≤ 2KB ε √ n ln(4/δ) 2 . Applying the union bound to the above three inequations, with a probability 1δ, we have f (θ * emp , P true ) + ε∥θ * emp ∥ * -f (θ * true , P true ) -ε∥θ * true ∥ * ≤ 4KB ε √ n 1 + 3 2 ln(4/δ) 2 . As stated by Inequation (10), we conclude with the following excess risk bound: max Q∈A(P true ) R L Q (θ * emp ) -max Q∈A(P true ) R L Q (θ * true ) ≤ 4KB ε √ n 1 + 3 2 ln(4/δ) 2 . Proposition 4. The DRO method for a probabilistic loss based on logarithmic loss, and β Brier loss has a solution P * Y |X for the predictor parameterized by θ defined by the following systems of equations: ∀x ∈ X , ∃C ∈ R, ∀y ∈ Y P * Y |x (y) = exp(C + θ T ϕ(x, y) -W 0 (2βe C+θ T ϕ(x,y) )) , where C is a constant dependent on θ and x but independent of y, W (•) is the principal branch of the Lambert W function. The corresponding adversary Q * Y |X is defined as Since 2βe θ•ϕ(x,y)+C * θ,x ≥ 0, the principal branch W 0 of the Lamber W function is always applicable. Also by the formula e -W (x) = W (x) Q * Y |x (y) = 2βP * 2 Y |x (y) + Z P Y |x P * Y |x (y) 1 + 2βP * Y |x x , we have P Y (y) = exp(C * θ,x + θ T ϕ(x, y) -W 0 (2βe C * θ,x +θ T ϕ(x,y) )) ∀y. Let P * Y (for a given θ) be a solution to this set of equations that also satisfies y P * Y (y) = 1. By Eq. ( 5            Q Y (1) Q Y (2) . . . Q Y (|Y|) Z P Y      =      C 1 C 2 . . . C |Y| 1.      This is not an unreduced Hessenberg matrix. However, notice that as Z P Y increases, Q Y (|Y|) also increases by the penultimate equation. This in turn increases Q Y (|Y|-1) according to the third from last equation. Therefore, the solution Q * Y without the simplex constraint increases monotonically as Z P Y increases. We can use bisection method again to find the Q * Y that also satisfies the simplex constraint.



In the literature, the zero-one loss is sometimes defined as ℓ01(PY , y) := I(y / ∈ arg max y ′ PY (y ′ )), which is proper, but discontinuous and not strictly proper.



Figure 1: The expected value of four loss functions for three classes with Q Y (1) = 0.6 andQ Y (2) = Q Y (3) = 0.2. P Y (2) = P Y (3) as P Y (1)varies. Each loss is normalized to cross (1, 0) and (0.5, 0.5) according to the binary case with a hard label. Best viewed in color.

when the ambiguity radius ε is zero. A conclusion of consistency can drawn based on Nowak et al. (2020); Blondel et al. (2020) and our assumption that the groundtruth is probabilistic: Corollary 2 ((Li et al., 2022)). When ε = 0, L adv is Fisher consistent with respect to L. Namely, for any x, P θ * true Y |x ∈ arg min

Figure 2: Normalized generalization losses with different coefficients or noise levels. Left: varying β in [0.1, 10.0]. Right: varying probability of contamination in [0, 0.8]. The X axes of the left subfigure is in logarithmic scale. Best viewed in color.

Y ), where I C (•) is the indicator function with I C (x) = 0 if x ∈ C and +∞ otherwise. The simplex constraints of P and Q are omitted.

y) and Z P Y |x = 1y 2βP * 2 Y |x (y)/(1 + 2β P * Y |x (y)) y P * Y |x (y)/(1 + 2β P * Y |x (y)).Proof. Recall the saddle-point optimality condition:y Q Y (y)∂ℓ(P Y , y)/∂P Y (y ′ ) + Z P Y = 0 ℓ(P Y , y) + θ ⊺ ϕ(x, y) + Z Q Y = 0.Dependence on x is omitted when context is clear. Substituting ℓ mix yields:Q Y (y)(-1 P Y (y) -2β) + 2βP Y (y) + Z P Y = 0 ln P Y (y) + β(1 -2P Y (y) + y ′ P 2 Y (y ′ )) + θ ⊺ ϕ(x, y) + Z Q Y = 0. Note that C := β + β y ′ P 2 Y (y ′ ) + Z Q Y isconstant across all y's given θ, x. Thus for fixed θ, x, we have for someC * θ,x , C * θ,x + θ • ϕ(x, y) = ln P Y (y) + 2βP Y (y) ∀y ∈ Y, which is equivalent to 2βP Y (y)e 2βP Y (y) = 2βe θ•ϕ(x,y)+C * θ,x .By the definition of the Lambert W function, 2βP Y (y) = W (2βe θ•ϕ(x,y)+C * θ,x ).

P Y and Q * Y (y) are positive because P * Y ∈ P(Y) is a solution.B MORE LOSSESThe discrete ranked probability vector assumes an ordering relationship in Y, i.e., Y := {1, 2, . . . , |Y|}. The score can be written asℓ rp (P Y , y) Y (j) -I(i ≥ y)] 2 .The mixture loss of the log loss, Brier loss and ranked probability loss can be written asℓ mix (P Y , y) =ln P Y (y) + β(1 -2P Y (y) + Y (j) + Z P Y -2α(|Y|y + 1 -|Y| i=y+1 (iy)Q Y (i)) = 0 (11) ln P Y (y) + β(1 -2P Y (y) + y ′ P 2 Y (y ′ )) + θ ⊺ ϕ(x, y) + Z Q Y + α(|Y|y + 1 + (j)is constant across all y. By absorbing them into constant C, we also observe that the equation for y only depends on P Y (y ′ ) for y ′ < y in the term y-1 i=1 i j=1 P Y (j). Therefore, P * Y (y) can be found in increasing order of y from 1 to |Y|. Given P * Y , consider Eq. (11) in matrix form:

Dataset statistics and normalized generalization losses with 95% confidence intervals on each dataset. The best results are indicated in bold. † indicates statistical significance with paired t-test (p < 0.05).

