MODELING THE SECOND PLAYER IN DISTRIBUTIONALLY ROBUST OPTIMIZATION

Abstract

Distributionally robust optimization (DRO) provides a framework for training machine learning models that are able to perform well on a collection of related data distributions (the "uncertainty set"). This is done by solving a min-max game: the model is trained to minimize its maximum expected loss among all distributions in the uncertainty set. While careful design of the uncertainty set is critical to the success of the DRO procedure, previous work has been limited to relatively simple alternatives that keep the min-max optimization problem exactly tractable, such as f -divergence balls. In this paper, we argue instead for the use of neural generative models to characterize the worst-case distribution, allowing for more flexible and problem-specific selection of the uncertainty set. However, while simple conceptually, this approach poses a number of implementation and optimization challenges. To circumvent these issues, we propose a relaxation of the KL-constrained inner maximization objective that makes the DRO problem more amenable to gradient-based optimization of large scale generative models, and develop model selection heuristics to guide hyper-parameter search. On both toy settings and realistic NLP tasks, we find that the proposed approach yields models that are more robust than comparable baselines 1 .

1. INTRODUCTION

Machine learning models trained with empirical risk minimization (ERM) are able to achieve high aggregate performance on data sampled from their training distribution. However, they often exhibit drops in accuracy when confronted with data from domains that are under-represented in their training data, such as those of different topic (Gururangan et al., 2020 ), sociolect (Blodgett et al., 2016) , accent (Amodei et al., 2016) or writer age (Hovy & Søgaard, 2015) in language processing tasks, or skin color (Grother et al., 2019) or lighting (Georghiades et al., 2001) in image processing tasks. This is a particularly egregious issue in applications where higher error rates can have far reaching negative implications, such as the silencing of underrepresented minorities in toxicity detection systems (Dixon et al., 2018) or disparity amplifying feedback loops in credit rating models (Fuster et al., 2018) . This behaviour often arises from the objective function of ERM, where the parameters θ of the model are learned by minimizing the expectation of a loss function under a data distribution p (or, specifically in practice, an associated empirical data distribution p) L ERM (θ) = E (x,y)∼ p (x, y, θ). (1) When the model encounters data sampled from a different distribution q test = p, performance can suffer significantly. Distributionally robust optimization (DRO) (Ben-Tal et al., 2013b) provides a natural solution to this issue by replacing the expected risk under a single distribution p with the worst expected risk over a pre-determined family of distributions Q (the "uncertainty set") L DRO (θ) = max q∈Q E (x,y)∼q (x, y, θ). (2) If Q contains q test , the DRO objective upper bounds the expected risk under q test . However, a priori knowledge of possible test distributions is not always available or easy to acquire. For example, training a model to be robust to some demographic attributes (Q = {q demographic 1 , q demographic 2 , . . .}) requires collecting and annotating data with the necessary information, an expensive and ethically fraught endeavour. In the absence of such information, one has to resort to defining the uncertainty set analytically, drawing on one's intuition of what constitutes a possible test distribution given the observed training distribution, such as using moment constraints (Delage & Ye, 2010; Nguyen et al., 2020) , f -divergence (Ben-Tal et al., 2013a; Hu & Hong, 2013; Faury et al., 2020) , Wasserstein/IPM (Sinha et al., 2018; Husain, 2020) balls, or coarse-grained mixture models (Oren et al., 2019; Hu et al., 2018) . However, the need for keeping the inner supremum in Eq. ( 2) tractable limits the possible choices. In this paper, we propose that the uncertainty set be instead defined as a family of parametric generative models. The resulting DRO objective ( §2) is a differentiable game with two players: the original model (x, y; θ) and a model of its worst-case distribution q ψ (x, y), the titular "second player" which we hereafter refer to as the adversary. Using this formulation -which we call Parametric DRO (P-DRO) -allows for more flexibility in the choice of the adversary's architecture (and so the uncertainty set). Unfortunately, finding a solution of this game via direct application of simultaneous gradient descent (Singh et al., 2000) is difficult (Balduzzi et al., 2018) . In particular, direct gradient descent on the uncertainty set suffers from instability due to the large variance of the gradients (Greensmith et al., 2004) , and hyper-parameter selection is not straightforward. To address these challenges, we make two main contributions ( §3): first, we propose a new relaxation of the DRO game's inner maximization problem (with KL constraints). The resulting objective is more amenable to simultaneous gradient update than the original zero-sum game and significantly improves training stability, while still yielding useful adversaries. Second, we develop a principled approach for selecting hyper-parameters: we leverage the learned adversaries to decide which of any two given models trained with P-DRO is more robust than the other. We do an in-depth set of experiments analyzing the effect of our proposed changes on both a toy task as well as a more realistic, yet still synthetic sentiment classification task ( §4). Finally, we show that in the more realistic setting of toxicity detection, P-DRO yields models that are more robust to changes in demographic groups, even though these groups are unknown at training time, opening up applications in combatting dataset bias ( §5).

2. PARAMETERIZING THE UNCERTAINTY SET

Consider a model parameterized by θ ∈ R dmodel . Minimizing the DRO objective described in Eq. (2) over the uncertainty set Q turns the optimization problem into the min-max (or zero-sum) game min θ∈R d max q∈Q E (x,y)∼q (x, y, θ). The first player controls the parameters θ, whilst the second player controls the worst-case distribution q. In the absence of explicit information on groups of interest (such as demographics, domain, etc.), an adequate choice of the uncertainty set Q is critical to the success of DRO. This is in fact very much an active area of research (Sinha et al. (2018) ; Duchi & Namkoong (2018) ; Oren et al. (2019) , see Rahimian & Mehrotra (2019) for a survey). Q must be sufficiently large to contain test distributions of interest, but if it is too large it may contain "adversarial" distributions on which no model can perform well. Moreover, the design of Q is also circumscribed by the necessity of keeping the min-max problem tractable, particularly in the context of stochastic optimization. In Hu & Hong (2013) and Duchi et al. (2016) for example, the choice of f -divergence balls allows the use of duality arguments to reformulate (3) as a more manageable min-min problem. Others, like Hu et al. (2018) or Oren et al. (2019) , propose using mixture models, the simplicity of which enables them to solve the inner maximization problem efficiently. Instead, we propose to explicitly model the second player in the DRO game as a parametric model q ψ of the data. Of course, not all parameterizations ψ ∈ R dadv of a given generative model represent useful distributions, and we require that the adversary stay "close" to the underlying true data distribution p. As a measure of distance between q ψ and p, we choose the KL (Kullback & Leibler, 1951) divergence due to its wide acceptance in the machine learning community, as well as its appealing properties in the context of DRO. 2 The KL upper bound, κ, is left as a parameter to be decided by the experimenter. We refer to the resulting DRO formulation as Parametric DRO min θ max ψ KL(q ψ p)≤κ E (x,y)∼q ψ (x, y, θ) LP-DRO(θ,ψ) . (4)

3. OPTIMIZING P-DRO

The min-max problem in Eq. ( 4) belongs to a class of games called "differentiable games" (another famous representative being generative adversarial networks (Goodfellow et al., 2014) ). We can search for a solution of this game with simultaneous gradient descent (Singh et al., 2000) , i.e. by simultaneously updating θ and ψ with -∇ θ L P-DRO and ∇ ψ L P-DRO respectively. Unfortunately, in general, there is no theoretical guarantee that simultaneous gradient descent will converge to a Nash equilibriumfoot_2 (Balduzzi et al., 2018) , nor that any such equilibrium even exists if the objective is nonconvex in θ (or non-concave in ψ). The success of GANs and the follow-up literature (Wang et al., 2019 ) serves as an encouraging example that gradient based methods can yield useful solutions despite the pessimistic theoretical results. In this section, we discuss difficulties that arise when optimizing θ and ψ jointly, and propose modifications of the objective to address them.

3.1. TRAINING THE MODEL θ

We could train the model θ by taking negative gradient steps on E (x,y)∼q ψ (x, y; θ). This gradient can be estimated by sampling examples from q ψ and averaging the gradient of their losses. Unfortunately, this objective requires that q ψ is well-behaved at all iterations, as it is the only source of supervision for θ. If q ψ is initialized incorrectly or begins producing unrealistic (x, y), the quality of θ degrades as it begins to learn a predictor on invalid training examples from q ψ . As an alternative, we opt to compute the gradients for θ with importance sampling, i.e. rewriting L P-DRO as E (x,y)∼p q ψ (x,y) p(x,y) (x, y; θ), which ensures that all (x, y) samples will be derived from the training set itself. Unfortunately, the true density p is unknown to us. As an approximation, we replace q ψ (x,y) p(x,y) with the likelihood ratio between q ψ and the maximum likelihood estimate of p, q ψ0 := arg max q ψ E (x,y)∼p log q ψ (x, y). This changes the min-max problem to min θ max ψ KL(q ψ p)≤κ E (x,y)∼p q ψ (x, y) q ψ0 (x, y) (x, y, θ) Lmodel . (5) This becomes a simple expected loss objective, which we can estimate by sampling from the empirical distribution p. In experiments, we find that with this formulation we are able to train robust θ even when q ψ is only a mediocre generative model (see Appendix C.2). To further stabilize training at the beginning of the optimization process, we initialize ψ with ψ 0 , making the objective exactly the same as ERM for the first gradient step.

3.2. TRAINING THE ADVERSARY ψ

According to Eq. ( 5) the adversary ψ must maximize E (x,y)∼q ψ p(x,y) q ψ 0 (x,y) (x, y, θ) within a KL ball of fixed radius. This is challenging for several reasons: first, enforcing the bound is intractable for complex families of adversaries where e.g. projecting onto the KL ball is another difficult optimization problem of its own. Second, maximizing the expectation with respect to the parameters of the distribution q ψ is prone to instability due to large gradient variance (Greensmith et al., 2004) . Lagrangian Relaxation To address the first difficulty, we loosen the strict KL constraint and instead consider the Lagrangian relaxation L L(ψ, τ ) = E (x,y)∼q ψ p(x, y) q ψ0 (x, y) (x, y, θ) -τ (KL(q ψ p) -κ) . We fix the Lagrangian multiplier τ > 0 as treat it as a "temperature" hyper-parameter. With some reorganization (which we develop in Appendix A.1), we can show that L(ψ, τ ) = -τ KL(q ψ q * τ,θ ) + C. Where q * τ,θ ∝ p(x, y)e p(x,y) q ψ 0 (x,y) (x,y;θ) τ and C is a constant in ψ. In other words, maximizing L in ψ is equivalent to minimizing the KL divergence between q ψ and q * τ,θ . One difficulty with this objective is that q * τ,θ depends upon the unknown probability density p(x, y). We avoid this problem by treating the density ratio p(x,y) q ψ 0 (x,y) as a constant, which is closely related to assumptions that have been used successfully in past formulations of DRO (Oren et al., 2019) . Empirically, we find that incorporating q ψ0 as a surrogate for p is a serviceable approximation, as demonstrated in Section 4. Reversing the KL Minimizing the KL divergence in this direction is difficult for several reasons. First, it entails optimizing an expectation in q ψ over ψ, which is difficult due to the large variance of the gradients (Greensmith et al., 2004) . Second, computing this KL necessitates access to the true theoretical density p(x, y) in order to compute q * τ,θ (x, y) in the argument of the expectation, but this quantity is unknown in practice. 4 To sidestep these issues, we elect to minimize the reverse direction KL(q * τ,θ q ψ ) instead. Due to the KL divergence being non-symmetric, this is a rather crude approximationfoot_4 , the implications of which are discussed in Norouzi et al. (2016) . However, we find that this approach dramatically stabilizes the gradient dynamics while still yielding good adversaries, as observed empirically in Section 4.4. Discarding the entropy term (constant in ψ), the resulting problem is equivalent to minimizing L adv (ψ, τ ) := - 1 Z τ,θ E p e (x,y;θ) τ log q ψ (x, y) (8) in ψ, where Z τ,θ = E p e (x,y;θ) τ is the normalizer of q * . In this case, we can estimate this expectation by substituting the empirical distribution p for p in the expectation. Computing the Normalizer Approximating the inverse normalizer 1 Z τ,θ in a minibatch yields a biased estimator. On the other hand, computing Z τ,θ over the entire training data at each step is prohibitive since it requires computing the loss of every single example. As a middle ground, we keep a running normalizer Zk computed from the average of the normalizers over a fixed number k of consecutive minibatches. In other words, if B i and θ i denote the minibatch and adversary parameters at step i respectively, the normalizer at step t will be Zk = 1 t i=t-k |B i | t i=t-k x,y∈Bi e (x,y;θ i ) τ . ( ) If k is too low, there is a risk of under-estimating the normalizer, especially if the distribution of weights contains infrequent high weight samples. On the other hand, if k is too high there is a risk of using "stale" weights in the normalizer. In experiments, we treat k as a hyper-parameter.

3.3. OPTIMAL STOPPING

When should one stop training a model with P-DRO? In ERM it is customary to stop training after the empirical risk -periodically evaluated on a held out validation dataset -stops decreasing. This is particularly important to prevent over-fitting to the training data. However, it is not an appropriate criterion for P-DRO, since the model is not trained to minimize empirical risk in the first place. A more pertinent choice is to compare the robust validation losses L robust,valid (θ) = max q ψ ∈Q 1 |D valid | x,y∈Dvalid q ψ (x, y) q ψ0 (x, y) (x, y; θ) :=L valid(θ,ψ) . ( ) However, finding the inner supremum for each of the T evaluation checkpoints θ 1 . . . θ T is expensive as it requires solving T independent optimization problems. Instead, we leverage the existence of adversaries ψ t associated with each model θ t , as well as the initial adversary ψ 0 and take the maximum over the T + 1 adversaries {ψ 0 , . . . , ψ T }. Since our relaxation of the P-DRO objective loosens the KL constraint, we need weed out adversaries which might violate it. Specifically, we estimate the KL(q ψ p) = E p q ψ /p log q ψ /p on the validation set, using q ψ /q ψ0 as a stand-in for q ψ /p, and reject all adversaries for which the result is greater than a threshold, which we set to log 10 based on preliminary experiments detailed in Appendix C.1. 6 We refer to this stopping criterion as Minmax. Computing the full min-max necessitates keeping track of T models and T + 1 adversaries, which is ponderous when the model is large. As a solution, we propose an approximation, Greedy-Minmax, in which we only keep one best model θ * . At each evaluation step T , we compare θ T to θ * , and update θ * to whichever achieves lower robust validation loss over the T + 1 adversaries ψ 0 , . . . , ψ T . By keeping track of only one additional model, and using the weights xi,yi) of individual examples in D valid as sufficient statistics for computing the loss against each adversary, Greedy-Minmax can be achieved with space complexity 2d model + T |D valid |, which is much more efficient than the T (d model + d adv ) of Minmax. q ψ t (xi,yi) q ψ 0 (

3.4. HYPER-PARAMETER SELECTION

Our proposed P-DRO method relies on 3 different hyper-parameters (in addition to the model's hyper-parameters): the adversary learning rate λ, the temperature τ and the size of the renormalizing window k. As a consequence, we need a reliable criterion for deciding which of two configurations is better. This model comparison bears many similarities with the stopping problem described above. Therefore, we resort to a similar solution: given two models θ 1 , θ 2 trained with P-DRO, and their respective adversaries {ψ 1 0 , . . . , ψ 1 T }, {ψ 2 0 , . . . , ψ 2 T } (for instance, the adversaries associated with θ 1 and θ 2 at periodic checkpoints during training), we select the best model following θ * = arg min θ∈{θ1,θ2} max ψ∈{ψ 1 0 ,...,ψ 1 T ,ψ 2 0 ,...,ψ 2 T } L valid (θ, ψ). ( ) 4 EXPERIMENTAL ANALYSIS OF P-DRO Before moving on to a real world scenario in Section 5, we first demonstrate that P-DRO is able to learn robust models in a synthetic Natural Language Processing (NLP) task, and perform ablation studies to examine the importance of the various modifications described in Section 3.

4.1. EXPERIMENTAL SETTING

For analysis purposes, we design a simple NLP task amenable to DRO. We specifically choose NLP as a domain due to the striking success of language models as generative models of textual data (Sundermeyer et al., 2012; Radford et al., 2018) , which can be used to model the uncertainty set. We base our task off of the binary version of the Stanford Sentiment Treebank dataset (SST-2; Socher et al. ( 2013)), which we modify to introduce spurious correlation. Specifically, we introduce a distractor token to some sentences. The distractor we use consists of prepending "so , " to the sentence ("i hated this movie" -→ "so , I hated this movie"), which doesn't change the underlying sentiment. The resulting samples can be categorized in 4 "groups" depending on their label (positive or negative) and the presence or absence of the distractor. In particular, we add this distractor to 95% of the negative reviews and 5% of the positive reviews in the training and validation set, so that the presence of the distractor strongly correlates with negative sentiment (a similar construction is proposed in (Utama et al., 2020)). In the test data, we modify 50% of all sentences for each class equitably to ensure that there is enough data in each group, but we report "average" test accuracy by re-weighting the group accuracies to mimick the training distribution. We call this modified task BiasedSST. For the classifier, we train a simple one layer BiLSTM model with embedding/hidden dimension 300. For the adversary, we adopt an auto-regressive transformer model based on the successful GPT-2 language model architecture but with 6 layers, a dimension of 512 and 8 attention heads (we experiment with a smaller, LSTM based adversary in Appendix C.2). In order to model the input output pair (x, y), we pre-pend a special label-specific token to sentences before running them through the language model. We train the model with Adam (Kingma & Ba, 2014) and the adversary with vanilla stochastic gradient descent (which we found more stable in experiments). We refer to Appendix B for specific details of the experimental setting. We train 7 models with P-DRO on BiasedSST using different hyper-parameters for the adversary. We start from configuration λ = 10 -4 , τ = 0.01, k = 5, and for each hyper-parameter we run a configuration with a smaller and a higher value, keeping all other hyper-parameters the same. We train for 50 epochs and select the best model using the strategies described in Section 3.

4.2. P-DRO CAN LEARN ROBUST MODELS

We also compare three other approaches. First, to appreciate how well the model could perform if the groups were known at training time, we train with Group-DRO on the oracle groups using an exponentiated-gradients based online algorithm (Oracle DRO; Sagawa et al. ( 2020)). Second, we implement Topic CVaR (Oren et al., 2019) , a method for DRO on NLP where the uncertainty set is determined by mixtures of a topic model. Finally, we compare to non-parametric DRO with a Kullback-Leibler (KL) constrained uncertainty set (Hu & Hong, 2013; Hu et al., 2018) , which we adapt to fit our online mini-batch training setting (NonParam). We refer to Appendix B.3 for details and hyper-parameters of the baselines. We report the worst-case ("robust") accuracy over all groups on the test set, as well the average accuracy in Table 1 (we report the mean and standard deviation over 5 runs). We find that both Topic-CVaR, NonParam and P-DRO are more robust than ERM, but the latter outperforms the former two close to 30 and 7 points respectively, achieving 52% of Oracle DRO's robust accuracy, while not leveraging any information on the oracle groups. 

4.3. OPTIMAL STOPPING AND HYPER-PARAMETER SELECTION ABLATION

To understand the importance of the optimal stopping and hyper-parameter selection strategy described in Section 3.3, we perform an ablation on the BiasedSST dataset comparing 4 strategies: • Average: models are selected based on their average zero-one loss (i.e. error rate) on the unmodified validation set. This is the baseline stopping criterion. • Minmax: selection based on the adversaries (as described in Section 3.3), with and without the KL constraint, as well as its variant Greedy-Minmax for stopping. • Oracle: in this setting the groups are known (in the validation set), and models are selected based on their error rate on the worst performing group. This is the optimal criterion for the group-DRO setting we are considering. To compare stopping criterions experiments, we only consider one set of hyper-parameters: λ = 10 -4 , k = 5 and τ = 0.01. From the robust validation accuracies reported in Table 2a , we first observe that Average stopping results in a robust accuracy of 0, highlighting the necessity for a suitable stopping criterion. We find that Minmax, especially with a KL constraint, is a much better strategy, recovering ≈ 60% of the performance achievable with Oracle stopping. Notably, the Greedy-Minmax variant which we use in practice reaches very close results (< 1 point difference) despite its requiring to keep track of only 2 out of the 50 model checkpoints at any time. To understand the effectiveness of the Minmax strategy for selecting hyper-parameters. We take the models trained in Section 4.1, but select the best hyper-parameters using the different strategies described above. Results, shown in Table 2b , confirm that Minmax (with the KL constraint) is a better choice than Average for selecting hyper-parameters, even though the improvement is not as striking as for stopping. Finally, we investigate the importance of modifying the adversary's objective as described in Section 3.2. For this experiment, we devise a simpler toy task on which directly training the constrained DRO objective is possible. Specifically, we consider the two-dimensional binary classification problem pictured in Figure 2 . The training data consists of 10,000 points partitioned in two normally distributed "domains" with a 1:50 sampling ratio and different classification boundaries. We train a logistic regression model, which cannot perfectly fit the training data and must trade-off between accuracy on each domain. For the sake of simplicity, we only model the input variables x 7 as isotropic normal distributions with fixed variance: the adversaries' parameter ψ ∈ R 2 represents the location of the Gaussian (we fix the variance to the empirical variance of the data). 7 In other words, we set q ψ (x, y) = p(y | x)q ψ (x), where p(y | x), is the true conditional which will be canceled out in the ratio q ψ (x,y) We compare 3 different versions of P-DRO: first, naive simultaneous gradient descent on the zero-sum game, without any constraint on the adversary (bare P-DRO), then the same, but with an approximation of the explicit KL constraint between q ψ and q ψ0 (+KL constraint; see Appendix A.2 for more details). Finally we report results using our relaxation and the KL reversal described in Section 3.2 (+L adv ). For each setting, we report the average and robust accuracy with mean and standard deviation over 10 runs. For the KL constraint and the relaxation, we report the best results among 4 values of the KL bound κ and the temperature τ respectively.

4.4. IMPORTANCE OF L ADV

q ψ 0 (x,y) . In Table 3 , we observe that bare P-DRO is too unstable and systematically diverges. The addition of a KL constraint mitigates this behaviour, but the zero-sum objective is still unstable, as evidenced by the high standard deviations. Finally, we find that the addition of L rev stabilizes the training process greatly, leading to consistently high robust accuracy.

5. P-DRO IN PRACTICE: CASE STUDY OF TOXICITY DETECTION

In this section, we demonstrate the effectiveness of P-DRO in the more realistic setting of toxicity detection, the task of recognizing various forms of toxic language (eg. hate speech or offensive language). Identifying online abuse on the internet is a crucial challenge, and has garnered much interest in the NLP community (Schmidt & Wiegand, 2017; Fortuna & Nunes, 2018) . However, recent work (Sap et al., 2019) has shown that there is strong correlation between toxic labels and the presence of certain markers of dialects of English spoken by minority groups. This correlation is in turn amplified by hate speech classifiers trained on such data, leading to biased prediction. Our results on BiasedSST suggest that P-DRO can provide one solution to preventing models from absorbing spurious correlations present in their training data, even in the absence of protected attributes (such as language variety here).

5.1. EXPERIMENTAL SETTING

Following Sap et al. (2019) and Xia et al. ( 2020), we perform experiments on two datasets: DWMW17 (Davidson et al., 2017) , a corpus of 25K tweets classified in three categories: hate speech (6%), offensive (76%) and neither (18%), and FDCL18 (Founta et al., 2018) , a 100k sized dataset, also collected from Twitter and annotated with an additional spam label, with the following breakdown by categories: hateful (5%), abusive (27%), normal (54%) and spam (14%). The released version of these datasets does not contain information on the dialect of each user. In order to be able to evaluate our models, and to train an Oracle DRO baseline, we follow Sap et al. (2019) and use annotations provided by the dialect classifier described in Blodgett et al. (2016) to label each example as one of four English varieties: White-aligned, African American, Hispanic, and Other. Note that, as these are automatically obtained labels, the groups may not exactly correspond to the actual racial sociolects, however Sap et al. (2019) does report that they correlate highly with self-reported race, and they serve as a useful proxy in the absence of manual annotation. We formulate the group-DRO problem by separating each dataset into independent groups identified by both language variety and label, for a total of 12 and 16 groups for DWMW17 and FDCL18 respectively. Some of these groups are severely under-represented in the test set. In order to make our robust accuracy results reliable yet still representative of the under-represented groups, we combine groups that contain less than 100 samples into a single group to compute robust test accuracies. On DWMW17, we train the same BiLSTM model as described in Section 4.3. To illustrate the applicability of P-DRO to other model architectures, we pick BERT (Devlin et al., 2018) , a large scale pre-trained model as a classifier on FDCL18. In both cases, we adopt the Transformer architecture described in Section 4.3 as the adversary. We train the adversary with a temperature of τ = 0.01 and a normalizing window k = 10. To demonstrate the efficacy of automatic hyper-parameter selection in the P-DRO setting, we delegate the choice of the adversary's learning rate λ to grid-search, training 3 models with λ ∈ {10 -5 , 10 -4 , 10 -3 } and selecting the best using the Minmax criterion described in Section 3.4. We also report numbers for Oracle DRO and Topic CVaR. Results are averaged over 5 runs, each with a different random seed.

5.2. CAN P-DRO PRODUCE MORE ROBUST MODELS?

Table 4a reports the robust test accuracies of all models on both tasks. Importantly, except for Oracle DRO, none of the methods compared here necessitate any knowledge of the groups, neither in the training nor validation data. We observe that in both settings P-DRO is able to achieve higher robust accuracy than ERM, Topic-CVaR and NonParam. This suggests P-DRO as a useful option in case no group information whatsoever is available. However, in practice, it may be feasible to annotate at least a small amount of data with group information. To emulate this scenario, we perform the same experiment, but assume that group annotations are available on the validation data, which we use to determine optimal stopping and hyper-parameters. Results for this setting are reported in Table 4b . We find that, while the use of robust validation accuracy yields more robust models even for ERM (especially on FDCL18), P-DRO is still the best alternative that doesn't require group annotation on the training data.

6. IMPLICATIONS AND OUTLOOK

We have shown that there is promise in using parametric families of neural generative models for defining the uncertainty set in distributionally robust optimization. While we only perform experiments on NLP tasks, this approach can, in theory, be applied in any modality and in future work we hope to pursue this direction. In such cases where good quality generative models are unavailable, or such model cannot produce densities efficiently, an interesting direction would be to model the likelihood ratio q ψ /p directly. This alternative formulation poses different implementation challenges, and we leave it as a promising avenue for future research. Mengzhou Xia, Anjalie Field, and Yulia Tsvetkov. Demoting racial bias in hate speech detection. In Proceedings of the 9th International Workshop on Natural Language Processing for Social Media (SocialNLP), pp. 7-14, 2020. URL https://www.aclweb.org/anthology/2020. socialnlp-1.2.

A DERIVATIONS

A.1 REORGANIZING THE LAGRANGIAN L(ψ, τ ) Let us write the Lagrangian L explicitly: L(ψ, τ ) = E (x,y)∼q ψ p(x, y) q ψ0 (x, y) (x, y, θ) -τ (KL(q ψ p) -κ) = E (x,y)∼q ψ p(x, y) q ψ0 (x, y) (x, y, θ) -τ E (x,y)∼q ψ log q ψ (x, y) p(x, y) + τ κ (13) = τ E (x,y)∼q ψ log   p(x, y)e p(x,y) q ψ 0 (x,y) (x,y,θ) τ q ψ (x, y)   + τ κ = τ (κ -KL(q ψ q * τ,θ )) + log E (x,y)∼p e p(x,y) q ψ 0 (x,y) (x,y,θ) τ (15) This last step requires that the log moment generating function of under p exist for τ . In most scenarios we consider, is typically the negative log likelihood of a neural network model, which is generally bounded. Therefore the moment generating function is defined everywhere. Note that the KL term is the only one dependent on ψ, therefore maximizing L for ψ is equivalent to maximizing -KL(q ψ q * τ,θ ), in other words minimizing KL(q ψ q * τ,θ )

A.2 ENFORCING THE KL CONSTRAINT IN THE TOY SETTING

Even in this simplest setting, the exact KL between q ψ (a gaussian) and p (a mixture of gaussians) does not have an analytical expression (Hershey & Olsen, 2007) . Instead, we fall back on enforcing the KL constraint between q ψ and q ψ0 , both isotropic gaussians with the same standard deviation. Let µ and µ 0 ∈ R 2 denote their respective mean, and σ > 0 their standard deviation. In this context, their KL divergence reduces to: KL(q ψ q ψ0 ) = KL(q ψ0 q ψ ) = 1 2σ 2 µ -µ 0 2 In other words, the KL divergence is equivalent to the euclidean distance between the distributions' means. We use this fact to project ψ (in the KL sense) onto B κ = { ψ | KL(q ψ q ψ0 ) < κ}: proj Bκ (ψ) := arg min ψ∈Bκ KL(q ψ q ψ ) = ψ 0 + √ 2κσ ψ -ψ 0 (ψ -ψ 0 )

B EXPERIMENTAL DETAILS

We describe in more details some of the experimental settings for our NLP experiments. More details can be found in our code release: https://github.com/pmichel31415/P-DRO.

B.1 MODEL SETTINGS

In all experiments, we split the text into sub-word tokens using the tokenizer described in (Devlin et al., 2018) . During training, we sample minibatches that contain at most 64 sentences or 2500 tokens, whichever is greater, in order to prevent GPU memory overflow in case of long sentences. We train all models with Adam (Kingma & Ba, 2014) with an initial learning rate of 2×10 -5 , which we decay linearly at each step until the end of training. We validate the models every epoch. For BERT, we start from the bert-base-uncased checkpoint.

B.2 ADVERSARY SETTINGS

In all experiments, we use a Transformer model based on the GPT-2 architecture (Radford et al., 2019) to serve as the adversary. In order to initialize the adversary (to obtain ψ 0 ), we first pre-train the model on a generic, relatively large language modeling dataset, WikiText-103 (Merity et al., 2017) . We also use a batch size of 64 samples or 2500 tokens, and train with Adam for 10 epochs, with a fixed learning rate of 3 × 10 -4 . Then, we fine-tune this model on each dataset, this time minimizing the negative log-likelihood of the (x, y) pair (by introducing the special "[label]" token as described in Section B), using the same hyper-parameters but a smaller learning rate (10 -5 ). We find that, due to the small to medium size of the datasets under consideration, this LM pretraining step helped achieve lower error on the generative modeling task.

B.3.1 TOPIC CVAR

To train the topic model for Topic CVaR, we first pre-process the text by removing all punctuation, urls and user mentions (for twitter data). Importantly, we remove stop-words for our toxicity experiments but not for our BiasedSST experiment. This is because the distractor token we use ("so") belongs to most English stop words lists, and removing it would completely prevent the topic model from picking up on the groups of interest. We then estimate the parameters of the model with Gensimfoot_6 and use similar settings as Oren et al. ( 2019) (α = 0.1, β = 1.0), setting the number of topics to 10. For both Oracle-DRO and Topic-CVaR, we use the algorithm proposed in Sagawa et al. (2020) to estimate the worst-case group (either oracle group or topic in Topic-CVaR) online during training. We perform grid-search over {1, 0.1, 0.01} to find the best learning rate for the group weights update. For Oracle DRO, the best model is simply selected by robust validation accuracy. For Topic CVaR, unless specified otherwise, we select the model with the lowest worst-case error over all topics.

B.3.2 NONPARAM

In the KL-constrained non-parametric setting, the min-max problem reads min θ max q s.t. KL(q ψ p)≤κ E (x,y)∼q (x, y, θ). (16) Here, κ is the desired radius of the KL ball, and is treated as a hyper-parameter. The solution of the inner maximum has an analytical solution of the form q * θ = a Z θ,τ * p(x, y)e  * p) = E p e (x,y;θ) τ * Z θ,τ * (x, y; θ) τ * -log Z θ,τ * = κ. Note that both computing Z θ,τ * and KL(q * p) require taking expectations over p. In our setting, where (x, y; θ) is the output of a large neural network, we cannot afford to take this expectation over the entire training data at each step. Instead, we fall back to taking the average over each minibatch. We find τ * with binary search in log 10 space within the [10 -10 , 10 10 ] interval and clip to the lowest or highest value should the result lie outside the search interval. x,y∈Dvalid q ψ (x,y) p(x,y) log q ψ (x,y) p(x,y) . Similarly to Section 3, we approximate the (unknown) likelihood ratio q ψ (x,y) p(x,y) with q ψ (x,y) q ψ 0 (x,y) . We want to reject all adversaries where this approximated KL is greater than some threshold, κ valid , but how do we choose a good value for κ valid ? Consider an adversary which selects a fraction of the validation data of size α|D valid | for some α ∈ (0, 1]. In such a case, the likelihood ratio is 1/α on this subset and 0 everywhere else, and the resulting KL estimate will be log α. In other words, choosing a threshold of κ valid means allowing the adversary to potentially select any subset of size at least 1/e κvalid of the original data. Our heuristic choice, log 10, corresponds to allowing subsets of size at least 10% of |D valid |. Of course, this is only a heuristic because the adversary can reweight the validation set nonuniformly. To assess the effect of κ valid on Greedy-Minmax, we compute the average robust validation error of the selected model across 5 runs for 3 different values of the adversary's learning rate. Results on BiasedSST, depicted in Figure 3 , show that adversaries with higher learning rate are more sensitive to the choice of threshold, but all values of κ valid between log 5 and log 20 seem to work for these settings.

C.2 P-DRO EXPERIMENTS WITH AN LSTM ADVERSARY

We replicate the experiments BiasedSST experiments in Section 4, but this time using a smaller generative model, which is unlikely to generate good samples. Specifically, we use a one layer LSTM model (Hochreiter & Schmidhuber, 1997) with embedding and hidden dimension 256. We only perform grid-search over λ ∈ [10 -5 , 10 -4 , 10 -3 ] and select the best with Minmax. Once pre-trained on the BiasedSST dataset, this model achieves a perplexity of 227.0, more than 4 times worse than the transformer model we use in other experiments (49.8). However, as evidenced by its robust accuracy displayed in Table 5 , P-DRO is still able to learn a robust model. We take this as evidence that the re-weighting introduced in Section 3 helps stabilize training even when q ψ is not a perfect model of the data. We study the influence of the 3 hyper-parameters τ (temperature), k (size of the renormalization window) and λ (learning rate the adversary) on the performance of P-DRO. All experiments are run on the BiasedSST dataset, and the analysis proceeds as follows: starting from configuration τ = 0.01, k = 5 and λ = 10 -4 and vary each of the hyper-parameters independently. We report two numbers for each configuration: robust accuracy of the best model using Greedy-Minmax stopping and using Oracle stopping. The latter is useful to disentangle the effect of the stopping criterion. As seen in the results shown in Table 6 , we find that τ has the least effect on robust accuracies. While the renormalization window parameter k has some effect on optimal stopping, the best robust accuracy achieved by the model (with oracle stopping) varies little. We observe the adversary's learning rate λ to be the most sensitive hyper-parameter, which is why we restrict our grid-search to λ in Section 5.



Code to reproduce our experiments can be found at https://github.com/pmichel31415/P-DRO For instance: KL(q p) < +∞ implies that q stays within the support of p Nash equilibria(Osborne & Rubinstein, 1994) can be thought of the game theoretic analog of global minima in optimization. Note that substituting the empirical distribution p for p poses issues here, because q ψ is not absolutely continuous with respect to p. For instance, the optimum of the reverse KL doesn't necessarily match that of the forward KL within the parametric confusion set Q To simplify notation, this additional constraint is implicit in the rest of this section. https://radimrehurek.com/gensim/



Figure 1: Summary of P-DRO: At every step of training, (x, y) pairs are sampled from the data distribution p and fed to both the model θ and the adversary ψ. For every sample, the model produces loss values (x, y; θ) and the adversary produces densities q ψ (x, y). Both are combined into L model and L adv , which are used to update the θ and ψ respectively, via simultaneous gradient updates.

Figure 2: A toy classification task.

et al. (2018) for details) with Z θ,τ * = E p e (x,y;θ) τ * and τ * such that KL(q

Figure 3: Evolution of the robust validation accuracy of the model selected by Greedy-Minmax as a function of the KL threshold κ valid

Average and robust accuracies on BiasedSST. Underlining indicates statistically significant difference compared to ERM (p < 0.05)

Effect of different optimal stopping and hyper-parameter selection strategies on robust validation accuracy.

Ablation of P-DRO to train the linear model on the toy task. We report accuracy on both domains, as well as robust accuracy.

Robust test accuracy on the DWMW17 and FDCL18 toxicity detection tasks. Oracle DRO 74.50 ± 1.74 65.79 ± 0.76 55.23 ± 3.97 72.43 ± 2.61

Average and robust accuracies on BiasedSST when P-DRO is trained with an LSTM adversary. Underlining indicates statistically significant difference compared to ERM (p < 0.05)

Effect of hyper-parameters on robust validation accuracy on BiasedSST

ACKNOWLEDGEMENTS

The authors would like to thank the anonymous reviewers for their insightful feedback which helped improve the paper to its current version. In addition, this paper greatly benefited from discussion and feedback from various colleagues at CMU, in particular Chunting Zhou, Haohan Wang, Zachary Lipton and Zico Kolter. This work was supported by a Facebook Sponsored Research Award and by the DARPA GAILA project (award HR00111990063). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the sponsors.

