TILTED EMPIRICAL RISK MINIMIZATION

Abstract

Empirical risk minimization (ERM) is typically designed to perform well on the average loss, which can result in estimators that are sensitive to outliers, generalize poorly, or treat subgroups unfairly. While many methods aim to address these problems individually, in this work, we explore them through a unified frameworktilted empirical risk minimization (TERM). In particular, we show that it is possible to flexibly tune the impact of individual losses through a straightforward extension to ERM using a hyperparameter called the tilt. We provide several interpretations of the resulting framework: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variancereduction properties that can benefit generalization; and can be viewed as a smooth approximation to a superquantile method. We develop batch and stochastic firstorder optimization methods for solving TERM, and show that the problem can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. TERM is not only competitive with existing solutions tailored to these individual problems, but can also enable entirely new applications, such as simultaneously addressing outliers and promoting fairness.

1. INTRODUCTION

Many statistical estimation procedures rely on the concept of empirical risk minimization (ERM), in which the parameter of interest, θPΘĎR d , is estimated by minimizing an average loss over the data: Rpθq :" 1 N ÿ iPrN s f px i ; θq . ( ) While ERM is widely used and has nice statistical properties, it can perform poorly in situations where average performance is not an appropriate surrogate for the problem of interest. Significant research has thus been devoted to developing alternatives to traditional ERM for diverse applications, such as learning in the presence of noisy/corrupted data (Jiang et al., 2018; Khetan et al., 2018) , performing classification with imbalanced data (Lin et al., 2017; Malisiewicz et al., 2011) , ensuring that subgroups within a population are treated fairly (Hashimoto et al., 2018; Samadi et al., 2018) , or developing solutions with favorable out-of-sample performance (Namkoong & Duchi, 2017) . In this paper, we suggest that deficiencies in ERM can be flexibly addressed via a unified framework, tilted empirical risk minimization (TERM). TERM encompasses a family of objectives, parameterized by a real-valued hyperparameter, t. For t P R z0 , the t-tilted loss (TERM objective) is given by: r Rpt; θq :" 1 t log ˆ1 N ÿ iPrN s e tf pxi;θq ˙. TERM generalizes ERM as the 0-tilted loss recovers the average loss, i.e., r Rp0, θq"Rpθq. 1 It also recovers other popular alternatives such as the max-loss (tÑ`8) and min-loss (tÑ´8) (Lemma 2). For tą0, the objective is a common form of exponential smoothing, used to approximate the max (Kort & Bertsekas, 1972; Pee & Royset, 2011) . Variants of tilting have been studied in several contexts, ˚Equal contribution. 1 r Rp0; θq is defined in ( 14) via the continuous extension of Rpt; θq. et al., 2018) . However, despite the rich history of tilted objectives, they have not seen widespread use in machine learning. In this work, we aim to bridge this gap by: (i) rigorously studying the objective in a general form, and (ii) exploring its utility for a number of ML applications. Surprisingly, we find that this simple extension to ERM is competitive for a wide range of problems. t < 0 t = 0 t > 0 °10 °8 °6 °4 °2 0 2 4 x 1 °6 °4 °2 0 2 4 x 2 logistic regression t < 0 t = 0 t > To highlight how the TERM objective can help with issues such as outliers or imbalanced classes, we discuss three motivating examples below, which are illustrated in Figure 1 . (a) Point estimation: As a first example, consider determining a point estimate from a set of samples that contain some outliers. We plot an example 2D dataset in Figure 1a , with data centered at (1,1). Using traditional ERM (i.e., TERM with t " 0) recovers the sample mean, which can be biased towards outlier data. By setting t ă 0, TERM can suppress outliers by reducing the relative impact of the largest losses (i.e., points that are far from the estimate) in (2). A specific value of t ă 0 can in fact approximately recover the geometric median, as the objective in (2) can be viewed as approximately optimizing specific loss quantiles (a connection which we make explicit in Section 2). In contrast, if these 'outlier' points are important to estimate, setting t ą 0 will push the solution towards a point that aims to minimize variance, as we prove more rigorously in Section 2, Theorem 4. (b) Linear regression: A similar interpretation holds for the case of linear regression (Figure 2b ). As t Ñ ´8, TERM finds a line of best while ignoring outliers. However, this solution may not be preferred if we have reason to believe that these 'outliers' should not be ignored. As t Ñ `8, TERM recovers the min-max solution, which aims to minimize the worst loss, thus ensuring the model is a reasonable fit for all samples (at the expense of possibly being a worse fit for many). Similar criteria have been used, e.g., in defining notions of fairness (Hashimoto et al., 2018; Samadi et al., 2018) . We explore several use-cases involving robust regression and fairness in more detail in Section 5. (c) Logistic regression: Finally, we consider a binary classification problem using logistic regression (Figure 2c ). For t P R, the TERM solution varies from the nearest cluster center (tÑ´8), to the logistic regression classifier (t"0), towards a classifier that magnifies the misclassified data (tÑ`8). We note that it is common to modify logistic regression classifiers by adjusting the decision threshold from 0.5, which is equivalent to moving the intercept of the decision boundary. This is fundamentally different than what is offered by TERM (where the slope is changing). As we show in Section 5, this added flexibility affords TERM with competitive performance on a number of classification problems, such as those involving noisy data, class imbalance, or a combination of the two. Contributions. In this work, we explore TERM as a simple, unified framework to flexibly address various challenges with empirical risk minimization. We first analyze the objective and its solutions, showcasing the behavior of TERM with varying t (Section 2). Our analysis provides novel connections between tilted objectives and superquantile methods. We develop efficient methods for solving TERM (Section 4), and show via numerous case studies that TERM is competitive with existing, problemspecific state-of-the-art solutions (Section 5). We also extend TERM to handle compound issues, such as the simultaneous existence of noisy samples and imbalanced classes (Section 3). Our results demonstrate the effectiveness and versatility of tilted objectives in machine learning.



Figure 1: Toy examples illustrating TERM as a function of t: (a) finding a point estimate from a set of 2D samples, (b) linear regression with outliers, and (c) logistic regression with imbalanced classes. While positive values of t magnify outliers, negative values suppress them. Setting t"0 recovers the original ERM objective (1).

