TILTED EMPIRICAL RISK MINIMIZATION

Abstract

Empirical risk minimization (ERM) is typically designed to perform well on the average loss, which can result in estimators that are sensitive to outliers, generalize poorly, or treat subgroups unfairly. While many methods aim to address these problems individually, in this work, we explore them through a unified frameworktilted empirical risk minimization (TERM). In particular, we show that it is possible to flexibly tune the impact of individual losses through a straightforward extension to ERM using a hyperparameter called the tilt. We provide several interpretations of the resulting framework: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variancereduction properties that can benefit generalization; and can be viewed as a smooth approximation to a superquantile method. We develop batch and stochastic firstorder optimization methods for solving TERM, and show that the problem can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. TERM is not only competitive with existing solutions tailored to these individual problems, but can also enable entirely new applications, such as simultaneously addressing outliers and promoting fairness.

1. INTRODUCTION

Many statistical estimation procedures rely on the concept of empirical risk minimization (ERM), in which the parameter of interest, θPΘĎR d , is estimated by minimizing an average loss over the data: In this paper, we suggest that deficiencies in ERM can be flexibly addressed via a unified framework, tilted empirical risk minimization (TERM). TERM encompasses a family of objectives, parameterized by a real-valued hyperparameter, t. For t P R z0 , the t-tilted loss (TERM objective) is given by: Rpθq :" 1 N ÿ iPrN s f px i ; θq . ( ) r Rpt; θq :" 1 t log ˆ1 N ÿ iPrN s e tf pxi;θq ˙. TERM generalizes ERM as the 0-tilted loss recovers the average loss, i.e., r Rp0, θq"Rpθq. 1 It also recovers other popular alternatives such as the max-loss (tÑ`8) and min-loss (tÑ´8) (Lemma 2). For tą0, the objective is a common form of exponential smoothing, used to approximate the max (Kort & Bertsekas, 1972; Pee & Royset, 2011) . Variants of tilting have been studied in several contexts, ˚Equal contribution. 



Rp0; θq is defined in (14) via the continuous extension of Rpt; θq.1

While ERM is widely used and has nice statistical properties, it can perform poorly in situations where average performance is not an appropriate surrogate for the problem of interest. Significant research has thus been devoted to developing alternatives to traditional ERM for diverse applications, such as learning in the presence of noisy/corrupted data(Jiang et al., 2018; Khetan et al., 2018), performing classification with imbalanced data(Lin et al., 2017; Malisiewicz et al., 2011), ensuring that subgroups within a population are treated fairly(Hashimoto et al., 2018; Samadi et al., 2018), or developing solutions with favorable out-of-sample performance(Namkoong & Duchi, 2017).

