MODELING THE SECOND PLAYER IN DISTRIBUTIONALLY ROBUST OPTIMIZATION

Abstract

Distributionally robust optimization (DRO) provides a framework for training machine learning models that are able to perform well on a collection of related data distributions (the "uncertainty set"). This is done by solving a min-max game: the model is trained to minimize its maximum expected loss among all distributions in the uncertainty set. While careful design of the uncertainty set is critical to the success of the DRO procedure, previous work has been limited to relatively simple alternatives that keep the min-max optimization problem exactly tractable, such as f -divergence balls. In this paper, we argue instead for the use of neural generative models to characterize the worst-case distribution, allowing for more flexible and problem-specific selection of the uncertainty set. However, while simple conceptually, this approach poses a number of implementation and optimization challenges. To circumvent these issues, we propose a relaxation of the KL-constrained inner maximization objective that makes the DRO problem more amenable to gradient-based optimization of large scale generative models, and develop model selection heuristics to guide hyper-parameter search. On both toy settings and realistic NLP tasks, we find that the proposed approach yields models that are more robust than comparable baselines 1 .

1. INTRODUCTION

Machine learning models trained with empirical risk minimization (ERM) are able to achieve high aggregate performance on data sampled from their training distribution. However, they often exhibit drops in accuracy when confronted with data from domains that are under-represented in their training data, such as those of different topic (Gururangan et al., 2020 ), sociolect (Blodgett et al., 2016) , accent (Amodei et al., 2016) or writer age (Hovy & Søgaard, 2015) in language processing tasks, or skin color (Grother et al., 2019) or lighting (Georghiades et al., 2001) in image processing tasks. This is a particularly egregious issue in applications where higher error rates can have far reaching negative implications, such as the silencing of underrepresented minorities in toxicity detection systems (Dixon et al., 2018) or disparity amplifying feedback loops in credit rating models (Fuster et al., 2018) . This behaviour often arises from the objective function of ERM, where the parameters θ of the model are learned by minimizing the expectation of a loss function under a data distribution p (or, specifically in practice, an associated empirical data distribution p) L ERM (θ) = E (x,y)∼ p (x, y, θ). (1) When the model encounters data sampled from a different distribution q test = p, performance can suffer significantly. Distributionally robust optimization (DRO) (Ben-Tal et al., 2013b) provides a natural solution to this issue by replacing the expected risk under a single distribution p with the worst expected risk over a pre-determined family of distributions Q (the "uncertainty set") L DRO (θ) = max q∈Q E (x,y)∼q (x, y, θ). If Q contains q test , the DRO objective upper bounds the expected risk under q test . However, a priori knowledge of possible test distributions is not always available or easy to acquire. For example, training a model to be robust to some demographic attributes (Q = {q demographic 1 , q demographic 2 , . . .}) requires collecting and annotating data with the necessary information, an expensive and ethically fraught endeavour. In the absence of such information, one has to resort to defining the uncertainty set analytically, drawing on one's intuition of what constitutes a possible test distribution given the observed training distribution, such as using moment constraints (Delage & Ye, 2010; Nguyen et al., 2020) , f -divergence (Ben-Tal et al., 2013a; Hu & Hong, 2013; Faury et al., 2020) , Wasserstein/IPM (Sinha et al., 2018; Husain, 2020) balls, or coarse-grained mixture models (Oren et al., 2019; Hu et al., 2018) . However, the need for keeping the inner supremum in Eq. ( 2) tractable limits the possible choices. In this paper, we propose that the uncertainty set be instead defined as a family of parametric generative models. The resulting DRO objective ( §2) is a differentiable game with two players: the original model (x, y; θ) and a model of its worst-case distribution q ψ (x, y), the titular "second player" which we hereafter refer to as the adversary. Using this formulation -which we call Parametric DRO (P-DRO) -allows for more flexibility in the choice of the adversary's architecture (and so the uncertainty set). Unfortunately, finding a solution of this game via direct application of simultaneous gradient descent (Singh et al., 2000) is difficult (Balduzzi et al., 2018) . In particular, direct gradient descent on the uncertainty set suffers from instability due to the large variance of the gradients (Greensmith et al., 2004) , and hyper-parameter selection is not straightforward. To address these challenges, we make two main contributions ( §3): first, we propose a new relaxation of the DRO game's inner maximization problem (with KL constraints). The resulting objective is more amenable to simultaneous gradient update than the original zero-sum game and significantly improves training stability, while still yielding useful adversaries. Second, we develop a principled approach for selecting hyper-parameters: we leverage the learned adversaries to decide which of any two given models trained with P-DRO is more robust than the other. We do an in-depth set of experiments analyzing the effect of our proposed changes on both a toy task as well as a more realistic, yet still synthetic sentiment classification task ( §4). Finally, we show that in the more realistic setting of toxicity detection, P-DRO yields models that are more robust to changes in demographic groups, even though these groups are unknown at training time, opening up applications in combatting dataset bias ( §5).

2. PARAMETERIZING THE UNCERTAINTY SET

Consider a model parameterized by θ ∈ R dmodel . Minimizing the DRO objective described in Eq. (2) over the uncertainty set Q turns the optimization problem into the min-max (or zero-sum) game min θ∈R d max q∈Q E (x,y)∼q (x, y, θ). The first player controls the parameters θ, whilst the second player controls the worst-case distribution q. In the absence of explicit information on groups of interest (such as demographics, domain, etc.), an adequate choice of the uncertainty set Q is critical to the success of DRO. This is in fact very much an active area of research Instead, we propose to explicitly model the second player in the DRO game as a parametric model q ψ of the data. Of course, not all parameterizations ψ ∈ R dadv of a given generative model represent useful distributions, and we require that the adversary stay "close" to the underlying true data distribution p. As a measure of distance between q ψ and p, we choose the KL (Kullback & Leibler, 1951) divergence due to its wide acceptance in the machine learning community, as well as its appealing



Code to reproduce our experiments can be found at https://github.com/pmichel31415/P-DRO



(Sinha et al. (2018);Duchi & Namkoong (2018); Oren et al.(2019), seeRahimian & Mehrotra (2019)  for a survey). Q must be sufficiently large to contain test distributions of interest, but if it is too large it may contain "adversarial" distributions on which no model can perform well. Moreover, the design of Q is also circumscribed by the necessity of keeping the min-max problem tractable, particularly in the context of stochastic optimization. InHu &  Hong (2013) and Duchi et al. (2016)  for example, the choice of f -divergence balls allows the use of duality arguments to reformulate (3) as a more manageable min-min problem. Others, likeHu et al.  (2018)  orOren et al. (2019), propose using mixture models, the simplicity of which enables them to solve the inner maximization problem efficiently.

