TILTED EMPIRICAL RISK MINIMIZATION

Abstract

Empirical risk minimization (ERM) is typically designed to perform well on the average loss, which can result in estimators that are sensitive to outliers, generalize poorly, or treat subgroups unfairly. While many methods aim to address these problems individually, in this work, we explore them through a unified frameworktilted empirical risk minimization (TERM). In particular, we show that it is possible to flexibly tune the impact of individual losses through a straightforward extension to ERM using a hyperparameter called the tilt. We provide several interpretations of the resulting framework: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variancereduction properties that can benefit generalization; and can be viewed as a smooth approximation to a superquantile method. We develop batch and stochastic firstorder optimization methods for solving TERM, and show that the problem can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. TERM is not only competitive with existing solutions tailored to these individual problems, but can also enable entirely new applications, such as simultaneously addressing outliers and promoting fairness.

1. INTRODUCTION

Many statistical estimation procedures rely on the concept of empirical risk minimization (ERM), in which the parameter of interest, θPΘĎR d , is estimated by minimizing an average loss over the data: Rpθq :" 1 N ÿ iPrN s f px i ; θq . ( ) While ERM is widely used and has nice statistical properties, it can perform poorly in situations where average performance is not an appropriate surrogate for the problem of interest. Significant research has thus been devoted to developing alternatives to traditional ERM for diverse applications, such as learning in the presence of noisy/corrupted data (Jiang et al., 2018; Khetan et al., 2018) , performing classification with imbalanced data (Lin et al., 2017; Malisiewicz et al., 2011) , ensuring that subgroups within a population are treated fairly (Hashimoto et al., 2018; Samadi et al., 2018) , or developing solutions with favorable out-of-sample performance (Namkoong & Duchi, 2017) . In this paper, we suggest that deficiencies in ERM can be flexibly addressed via a unified framework, tilted empirical risk minimization (TERM). TERM encompasses a family of objectives, parameterized by a real-valued hyperparameter, t. For t P R z0 , the t-tilted loss (TERM objective) is given by: r Rpt; θq :" 1 t log ˆ1 N ÿ iPrN s e tf pxi;θq ˙. (2) TERM generalizes ERM as the 0-tilted loss recovers the average loss, i.e., r Rp0, θq"Rpθq. 1 It also recovers other popular alternatives such as the max-loss (tÑ`8) and min-loss (tÑ´8) (Lemma 2). For tą0, the objective is a common form of exponential smoothing, used to approximate the max (Kort & Bertsekas, 1972; Pee & Royset, 2011) . Variants of tilting have been studied in several contexts, ˚Equal contribution. 1 r Rp0; θq is defined in ( 14) via the continuous extension of Rpt; θq. including robust regression (Wang et al., 2013) ptă0q, importance sampling (Wainwright et al., 2005) , sequential decision making (Howard & Matheson, 1972; Nass et al., 2019) , and large deviations theory (Beirami et al., 2018) . However, despite the rich history of tilted objectives, they have not seen widespread use in machine learning. In this work, we aim to bridge this gap by: (i) rigorously studying the objective in a general form, and (ii) exploring its utility for a number of ML applications. Surprisingly, we find that this simple extension to ERM is competitive for a wide range of problems. To highlight how the TERM objective can help with issues such as outliers or imbalanced classes, we discuss three motivating examples below, which are illustrated in Figure 1 . (a) Point estimation: As a first example, consider determining a point estimate from a set of samples that contain some outliers. We plot an example 2D dataset in Figure 1a , with data centered at (1,1). Using traditional ERM (i.e., TERM with t " 0) recovers the sample mean, which can be biased towards outlier data. By setting t ă 0, TERM can suppress outliers by reducing the relative impact of the largest losses (i.e., points that are far from the estimate) in (2). A specific value of t ă 0 can in fact approximately recover the geometric median, as the objective in (2) can be viewed as approximately optimizing specific loss quantiles (a connection which we make explicit in Section 2). In contrast, if these 'outlier' points are important to estimate, setting t ą 0 will push the solution towards a point that aims to minimize variance, as we prove more rigorously in Section 2, Theorem 4. (b) Linear regression: A similar interpretation holds for the case of linear regression (Figure 2b ). As t Ñ ´8, TERM finds a line of best while ignoring outliers. However, this solution may not be preferred if we have reason to believe that these 'outliers' should not be ignored. As t Ñ `8, TERM recovers the min-max solution, which aims to minimize the worst loss, thus ensuring the model is a reasonable fit for all samples (at the expense of possibly being a worse fit for many). Similar criteria have been used, e.g., in defining notions of fairness (Hashimoto et al., 2018; Samadi et al., 2018) . We explore several use-cases involving robust regression and fairness in more detail in Section 5. (c) Logistic regression: Finally, we consider a binary classification problem using logistic regression (Figure 2c ). For t P R, the TERM solution varies from the nearest cluster center (tÑ´8), to the logistic regression classifier (t"0), towards a classifier that magnifies the misclassified data (tÑ`8). We note that it is common to modify logistic regression classifiers by adjusting the decision threshold from 0.5, which is equivalent to moving the intercept of the decision boundary. This is fundamentally different than what is offered by TERM (where the slope is changing). As we show in Section 5, this added flexibility affords TERM with competitive performance on a number of classification problems, such as those involving noisy data, class imbalance, or a combination of the two. Contributions. In this work, we explore TERM as a simple, unified framework to flexibly address various challenges with empirical risk minimization. We first analyze the objective and its solutions, showcasing the behavior of TERM with varying t (Section 2). Our analysis provides novel connections between tilted objectives and superquantile methods. We develop efficient methods for solving TERM (Section 4), and show via numerous case studies that TERM is competitive with existing, problemspecific state-of-the-art solutions (Section 5). We also extend TERM to handle compound issues, such as the simultaneous existence of noisy samples and imbalanced classes (Section 3). Our results demonstrate the effectiveness and versatility of tilted objectives in machine learning.

2. TERM: PROPERTIES & INTERPRETATIONS

To better understand the performance of the t-tilted losses in (2), we provide several interpretations of the TERM solutions, leaving the full statements of theorems and proofs to the appendix. We make no distributional assumptions on the data, and study properties of TERM under the assumption that the loss function forms a generalized linear model, e.g., L 2 loss and logistic loss (Appendix D). However, we also obtain favorable empirical results using TERM with other objectives such as deep neural networks and PCA in Section 5, motivating the extension of our theory beyond GLMs in future work. °0.75 °0.50 °0.25 0.00 0.25 0.50 0.75 1.00 1.25 1.50 µ 0.0 0.5 1.0 e R(t; µ) f1(µ) = (µ + 0.2) 2 , f2(µ) = (µ °0.2) 2 + 0.1, f3(µ) = (µ °1.2) 2

max-loss avg-loss min-loss median-loss

Figure 2 : TERM objectives for a squared loss problem with N " 3. As t moves from ´8 to `8, t-tilted losses recover min-loss, avgloss, and max-loss. TERM is smooth for all finite t and convex for positive t. General properties. We begin by noting several general properties of the TERM objective (2). Given a smooth f px; θq, the t-tilted loss is smooth for all finite t (Lemma 4). If f px; θq is strongly convex, the t-tilted loss is strongly convex for t ą 0 (Lemma 5). We visualize the solutions to TERM for a toy problem in Figure 2 , which allows us to illustrate several special cases of the general framework. As discussed in Section 1, TERM can recover traditional ERM (t"0), the max-loss (tÑ`8), and the min-loss (tÑ´8). As we demonstrate in Section 5, providing a smooth tradeoff between these specific losses can be beneficial for a number of practical use-casesboth in terms of the resulting solution and the difficulty of solving the problem itself. Interestingly, we additionally show that the TERM solution can be viewed as a smooth approximation to a superquantile method, which aims to minimize quantiles of losses such as the median loss. In Figure 2 , it is clear to see why this may be beneficial, as the median loss (orange) can be highly non-smooth in practice. We make these rough connections more explicit via the interpretations below. (Interpretation 1) Re-weighting samples to magnify/suppress outliers. As discussed via the toy examples in Section 1, the TERM objective can be tuned (using t) to magnify or suppress the influence of outliers. We make this notion rigorous by exploring the gradient of the t-tilted loss in order to reason about the solutions to the objective defined in (2). Lemma 1 (Tilted gradient, proof in Appendix B). For a smooth loss function f px; θq, ∇ θ r Rpt;θq" ÿ iPrN s w i pt;θq∇ θ f px i ;θq, where w i pt;θq:" e tf pxi;θq ř jPrN s e tf pxj ;θq " 1 N e tpf pxi;θq´r Rpt;θqq . (3) From this, we can observe that the tilted gradient is a weighted average of the gradients of the original individual losses, where each data point is weighted exponentially proportional to the value of its loss. Note that t " 0 recovers the uniform weighting associated with ERM, i.e., w i pt; θq " 1{N . For positive t, it magnifies the outliers-samples with large losses-by assigning more weight to them, and for negative t, it suppresses the outliers by assigning less weight to them. (Interpretation 2) Tradeoff between average-loss and min/max-loss. To put Interpretation 1 in context and understand the limits of TERM, a benefit of the framework is that it offers a continuum of solutions between the min and max losses. Indeed, for positive values of t, TERM enables a smooth tradeoff between the average-loss and max-loss (as we demonstrate in Figure 10 , Appendix I). Hence, TERM can selectively improve the worst-performing losses by paying a penalty on average performance, thus promoting a notion of uniformity or fairness (Hashimoto et al., 2018) . On the other hand, for negative t, the solutions achieve a smooth tradeoff between average-loss and min-loss, which can have the benefit of focusing on the 'best' losses, or ignoring outliers (Theorem 3, Appendix D). Theorem (Formal statement and proof in Appendix D, Theorem 3). Let θptq be the minimizer of r Rpt; θq, referred to as t-tilted solution. Then, for t ą 0, max-loss, p Rp θptqq, is non-increasing with t while the average loss, Rp θptqq, is non-decreasing with t. (Interpretation 3) Empirical bias/variance tradeoff. Another key property of the TERM solutions is that the empirical variance of the loss across all samples decreases as t increases (Theorem 4). Hence, by increasing t, it is possible to trade off between optimizing the average loss vs. reducing variance, allowing the solutions to potentially achieve a better bias-variance tradeoff for generalization (Bennett, 1962; Hoeffding, 1994; Maurer & Pontil, 2009) (Figure 10 , Appendix I). We use this property to achieve better generalization in classification in Section 5. We also prove that the cosine similarity between the loss vector and the all-ones vector monotonically increases with t (Theorem 5), which shows that larger t promotes a more uniform performance across all losses and can have implications for fairness defined as representation disparity (Hashimoto et al., 2018 ) (Section 5.2). Theorem (Formal statement and proof in Appendix D, Theorem 4). Let f pθq :" pf px 1 ; θqq, . . . , f px N ; θqq be the loss vector for parameter θ. Then, the variance of the vector f p θptqq is non-increasing with t while its average, i.e., Rp θptqq, is non-decreasing with t. (Interpretation 4) Approximate superquantile method. Finally, we show that TERM is related to superquantile-based objectives, which aim to minimize specific quantiles of the individual losses that exceed a certain value (Rockafellar et al., 2000) . For example, optimizing for 90% of the individual losses (ignoring the worst-performing 10%) could be a more reasonable practical objective than the pessimistic min-max objective. Another common application of this is to use the median in contrast to the mean in the presence of noisy outliers. As we discuss in Appendix G, superquantile methods can be reinterpreted as minimizing the k-loss, defined as the k-th smallest loss of N (i.e., 1-loss is the min-loss, N -loss is the max-loss, pN ´1q{2-loss is the median-loss). While minimizing the k-loss is more desirable than ERM in many applications, the k-loss is non-smooth (and generally non-convex), and is challenging to solve for large-scale problems (Jin et al., 2020; Nouiehed et al., 2019b) . Theorem (Formal statement and proof in Appendix G, Theorem 10). The quantile of the losses that exceed a given value is upper bounded by a smooth function of the TERM objective. Further, the t-tilted solutions are good approximate solutions of the superquantile (k-loss) optimization.

3. TERM EXTENDED: HIERARCHICAL MULTI-OBJECTIVE TILTING

We also consider an extension of TERM that can be used to address practical applications requiring multiple objectives, e.g., simultaneously achieving robustness to noisy data and ensuring fair performance across subgroups. Existing approaches typically aim to address such problems in isolation. To handle multiple objectives with TERM, let each sample x be associated with a group g P rGs, i.e., x P g. These groups could be related to the labels (e.g., classes in a classification task), or may depend only on features. For any t, τ P R, we define multi-objective TERM as: 

4. SOLVING TERM

To solve TERM, we suggest batch and stochastic variants of traditional first-order gradient-based optimization methods. TERM in the batch setting (Batch TERM) is summarized in Algorithm 1 in the context of solving multi-objective hierarchical TERM (4) for full generality. The main steps include computing the tilted gradients of the hierarchical objective defined in (4). Note that Batch TERM with t " τ reduces to solving the sample-level tilted objective (2). We also provide a stochastic variant in Algorithm 2, Appendix H. At a high level, at each iteration, group-level tilting is addressed by choosing a group based on the tilted weight vector estimated via stochastic dynamics. Sample-level tilting is then incorporated by re-weighting the samples in a uniformly drawn mini-batch. We find that these methods perform well empirically on a variety of tasks (Section 5). We defer readers to Appendix H for general properties of TERM (smoothness, convexity) that may vary with t and affect the convergence of gradient-based methods used to solve the objective.

5. TERM IN PRACTICE: USE CASES

In this section, we showcase the flexibility, wide applicability, and competitive performance of the TERM framework through empirical results on a variety of real-world problems such as handling outliers (Section 5.1), ensuring fairness and improving generalization (Section 5.2), and addressing compound issues (Section 5.3). Despite the relatively straightforward modification TERM makes to traditional ERM, we show that t-tilted losses not only outperform ERM, but either outperform or are competitive with state-of-the-art, problem-specific tailored baselines on a wide range of applications. We provide implementation details in Appendix J. All code, datasets, and experiments are publicly available at github.com/litian96/TERM. For experiments with positive t (Section 5.2), we tune t P t0.1, 0.5, 1, 5, 10, 50, 100, 200u on the validation set. In our initial robust regression experiments, we find that the performance is robust to various t's, and we thus use a fixed t " ´2 for all experiments involving negative t (Section 5.1 and Section 5.3). For all values of t tested, the number of iterations required to solve TERM is within 2ˆthat of standard ERM.

5.1. MITIGATING NOISY OUTLIERS

We begin by investigating TERM's ability to find robust solutions that reduce the effect of noisy outliers. We note that we specifically focus on the setting of 'robustness' involving random additive noise; the applicability of TERM to more adversarial forms of robustness would be an interesting direction of future work. We do not compare with approaches that require additional clean validation data (e.g., Hendrycks et al., 2018; Ren et al., 2018; Roh et al., 2020; Veit et al., 2017) , as such data can be costly to obtain in practice. Robust regression. We first consider a regression task with noise corrupted targets, where we aim to minimize the root mean square error (RMSE) on samples from the Drug Discovery dataset (Diakonikolas et al., 2019; Olier et al., 2018) . The task is to predict the bioactivities given a set of chemical compounds. We compare against linear regression with an L 2 loss, which we view as the 'standard' ERM solution for regression, as well as with losses commonly used to mitigate outliers-the L 1 loss and Huber loss (Huber, 1964) . We also compare with consistent robust regression (CRR) (Bhatia et al., 2017) and STIR (Mukhoty et al., 2019) , recent state-of-the-art methods specifically designed for label noise in robust regression. In this particular problem, TERM is equivalent to exponential squared loss, studied in (Wang et al., 2013) . We apply TERM at the sample level with an L 2 loss, and generate noisy outliers by assigning random targets drawn from N p5, 5q on a fraction of the samples. In Table 1 , we report RMSE on clean test data for each objective and under different noise levels. We also present the performance of an oracle method (Genie ERM) which has access to all of the clean data samples with the noisy samples removed. Note that Genie ERM is not a practical algorithm and is solely presented to set the expected performance limit in the noisy setting. The results indicate that TERM is competitive with baselines on the 20% noise level, and achieves better robustness with moderate-to-extreme noise. We observe similar trends in scenarios involving both noisy features and targets (Appendix I.2). CRR tends to run slowly as it scales cubicly with the number of dimensions (Bhatia et al., 2017) , while solving TERM is roughly as efficient as ERM. Note that the outliers considered here are unstructured with random noise, and not adversarial. This makes it possible for the methods to find the underlying structure of clean data even if the majority of the samples are noisy outliers. To gain more intuition on these cases, we also generate synthetic two-dimensional data points and test the performance of TERM under 0%, 20%, 40%, and 80% noise for linear regression. TERM with t " ´2 performs well in all noise levels (Figure 11 and 12 in Appendix I.2). However, as might be expected, in Figure 14 (Appendix I .2) we show that TERM may overfit to noisy samples when the noise is structured and the noise values are large (e.g., 80%). Robust classification. It is well-known that deep neural networks can easily overfit to corrupted labels (e.g., Zhang et al., 2017) . While the theoretical properties we study for TERM (Section 2) do not directly cover objectives with neural network function approximations, we show that TERM can be applied empirically to DNNs to achieve robustness to noisy training labels. MentorNet (Jiang et al., 2018 ) is a popular method in this setting, which learns to assign weights to samples based on feedback from a student net. Following the setup in Jiang et al. (2018) , we explore classification on CIFAR10 (Krizhevsky et al., 2009) when a fraction of the training labels are corrupted with uniform noise-comparing TERM with ERM and several state-of-the-art approaches (Krizhevsky et al., 2009; Kumar et al., 2010; Ren et al., 2018; Zhang & Sabuncu, 2018) . As shown in Table 2 , TERM performs competitively with 20% noise, and outperforms all baselines in the high noise regimes. We use MentorNet-PD as a baseline since it does not require clean validation data. In Appendix I.2, we show that TERM also matches the performance of MentorNet-DD, which requires clean validation data. To help reason about the performance of TERM, we also explore a simpler, two-dimensional logistic regression problem in Figure 13 , Appendix I.2, finding that TERM with t=´2 is similarly robust across the considered noise regimes. Low-quality annotators. It is not uncommon for practitioners to obtain human-labeled data for their learning tasks from crowd-sourcing platforms. However, these labels are usually noisy in part due to the varying quality of the human annotators. Given a collection of labeled samples from crowd-workers, we aim to learn statistical models that are robust to the potentially low-quality annotators. As a case study, following the setup of (Khetan et al., 2018) , we take the CIFAR-10 dataset and simulate 100 annotators where 20 of them are hammers (i.e., always correct) and 80 of them are spammers (i.e., assigning labels uniformly at random). We apply TERM at the annotator group level in (4), which is equivalent to assigning annotator-level weights based on the aggregate value of their loss. As shown in Figure 3 , TERM is able to achieve the test accuracy limit set by Genie ERM, i.e., the ideal performance obtained by completely removing the known outliers. We note in particular that the accuracy reported by (Khetan et al., 2018) (0.777) is lower than TERM (0.825) in the same setup, even though their approach is a two-pass algorithm requiring at least to double the training time. We provide full empirical details and investigate additional noisy annotator scenarios in Appendix I.2.

5.2. FAIRNESS AND GENERALIZATION

In this section, we show that positive values of t in TERM can help promote fairness (e.g., via learning fair representations), and offer variance reduction for better generalization. Fair principal component analysis (PCA). We explore the flexibility of TERM in learning fair representations using PCA. In fair PCA, the goal is to learn low-dimensional representations which are fair to all considered subgroups (e.g., yielding similar reconstruction errors) (Kamani et al., 2019; Samadi et al., 2018; Tantipongpipat et al., 2019) . Despite the non-convexity of the fair PCA problem, we apply TERM to this task, referring to the resulting objective as TERM-PCA. We tilt the same loss function as in (Samadi et al., 2018) : f pX; U q " 1 |X| ´}X ´XU U J } 2 F ´}X ´X} 2 F ¯ , where X P R nˆd is a subset (group) of data, U P R dˆr is the current projection, and X P R nˆd is the optimal rank-r approximation of X. Instead of solving a more complex min-max problem using semi-definite programming as in (Samadi et al., 2018) , which scales poorly with problem dimension, we apply gradient-based methods, re-weighting the gradients at each iteration based on the loss on each group. In Figure 4 , we plot the aggregate loss for two groups (high vs. low education) in the Default Credit dataset (Yeh & Lien, 2009) for different target dimensions r. By varying t, we achieve varying degrees of performance improvement on different groups-TERM (t " 200) recovers the min-max results of (Samadi et al., 2018 ) by forcing the losses on both groups to be (almost) identical, while TERM (t " 10) offers the flexibility of reducing the performance gap less aggressively. Handling class imbalance. Next, we show that TERM can reduce the performance variance across classes with extremely imbalanced data when training deep neural networks. We compare TERM with several baselines which re-weight samples during training, including assigning weights inversely proportional to the class size (InverseRatio), focal loss (Lin et al., 2017) , HardMine (Malisiewicz et al., 2011), and LearnReweight (Ren et al., 2018) . Following (Ren et al., 2018) , the datasets are composed of imbalanced 4 and 9 digits from MNIST (LeCun et al., 1998) . In Figure 5 , we see that TERM obtains similar (or higher) final accuracy on the clean test data as the state-of-the-art methods. We note that compared with LearnReweight, which optimizes the model over an additional balanced validation set and requires three gradient calculations for each update, TERM neither requires such balanced validation data nor does it increase the per-iteration complexity. Improving generalization via variance reduction. A common alternative to ERM is to consider a distributionally robust objective, which optimizes for the worst-case training loss over a set of distributions, and has been shown to offer variance-reduction properties that benefit generalization (e.g., Namkoong & Duchi, 2017; Sinha et al., 2018) . While not directly developed for distributional robustness, TERM also enables variance reduction for positive values of t (Theorem 4), which can be used to strike a better bias-variance tradeoff for generalization. We compare TERM with several baselines including robustly regularized risk (RobustRegRisk) (Namkoong & Duchi, 2017) , linear SVM (Ren et al., 2018) , LearnRewight (Ren et al., 2018) , FocalLoss (Lin et al., 2017) , and HRM (Leqi et al., 2019) . The results and detailed discussions are presented in Appendix I.2.

5.3. SOLVING COMPOUND ISSUES: HIERARCHICAL MULTI-OBJECTIVE TILTING

Finally, in this section, we focus on settings where multiple issues, e.g., class imbalance and label noise, exist in the data simultaneously. We discuss two possible instances of hierarchical multiobjective TERM to tackle such problems. One can think of other variants in this hierarchical tilting space which could be useful depending on applications at hand. However, we are not aware of other prior work that aims to simultaneously handle multiple goals, e.g., suppressing noisy samples and addressing class imbalance, in a unified framework without additional validation data. We explore the HIV-1 dataset (Rögnvaldsson, 2013), as in Section 5.2. We report both overall accuracy and accuracy on the rare class in four scenarios: (a) clean and 1:4, the original dataset that is naturally slightly imbalanced with rare samples represented 1:4 with respect to the common class; (b) clean and 1:20, where we subsample to introduce a 1:20 imbalance ratio; (c) noisy and 1:4, which is the original dataset with labels associated with 30% of the samples randomly reshuffled; and (d) noisy and 1:20, where 30% of the labels of the 1:20 imbalanced dataset are reshuffled. In Table 3 , hierarchical TERM is applied at the sample level and class level (TERM sc ), where we use the sample-level tilt of τ "´2 for noisy data. We use class-level tilt of t"0.1 for the 1:4 case and t"50 for the 1:20 case. We compare against baselines for robust classification and class imbalance (discussed previously in Sections 5.1 and 5.2), where we tune them for best performance (Appendix J). Similar to the experiments in Section 5.1, we avoid using baselines that require clean validation data (e.g., Roh et al., 2020) . While different baselines perform well in their respective problem settings, TERM is far superior to all baselines when considering noisy samples and class imbalance simultaneously (rightmost column in Table 3 ). Finally, in the last row of Table 3 , we simulate the noisy annotator setting of Section 5.1 assuming that the data is coming from 10 annotators, i.e., in the 30% noise case we have 7 hammers and 3 spammers. In this case, we apply hierarchical TERM at both class and annotator levels (TERM ca ), where we perform the higher level tilt at the annotator (group) level and the lower level tilt at the class level (with no sample-level tilting). We show that this approach can benefit noisy/imbalanced data even further (far right, Table 3 ), while suffering only a small performance drop on the clean and noiseless data (far left, Table 3 ).

6. RELATED WORK

Alternate aggregation schemes: exponential smoothing/superquantile methods. A common alternative to the standard average loss in empirical risk minimization is to consider a min-max objective, which aims to minimize the max-loss. Min-max objectives are commonplace in machine learning, and have been used for a wide range of applications, such as ensuring fairness across subgroups (Hashimoto et al., 2018; Mohri et al., 2019; Samadi et al., 2018; Stelmakh et al., 2019; Tantipongpipat et al., 2019) , enabling robustness under small perturbations (Sinha et al., 2018) , or generalizing to unseen domains (Volpi et al., 2018) . As discussed in Section 2, the TERM objective can be viewed as a minimax smoothing (Kort & Bertsekas, 1972; Pee & Royset, 2011) with the added flexibility of a tunable t to allow the user to optimize utility for different quantiles of loss similar to superquantile approaches (Laguel et al., 2021; Rockafellar et al., 2000) , directly trading off between robustness/fairness and utility for positive and negative values of t (see Appendix G for these connections). However, the TERM objective remains smooth (and efficiently solvable) for moderate values of t, resulting in faster convergence even when the resulting solutions are effectively the same as the min-max solution or other desired quantiles of the loss (as we demonstrate in the experiments of Section 5). Such smooth approximations to the max often appear through LogSumExp functions, with applications in geometric programming (Calafiore & El Ghaoui, 2014, Sec. 9.7), and boosting (Mason et al., 1999; Shen & Li, 2010) . Interestingly, Cohen et al. introduce Simnets (Cohen & Shashua, 2014; Cohen et al., 2016) , with a similar exponential smoothing operator, though for a differing purpose of achieving layer-wise operations between sum and max in deep neural networks. Alternate loss functions. Rather than modifying the way the losses are aggregated, as in (smoothed) min-max or superquantile methods, it is also quite common to modify the losses themselves. For example, in robust regression, it is common to consider losses such as the L 1 loss, Huber loss, or general M -estimators (Holland & Ikeda, 2019) as a way to mitigate the effect of outliers (Bhatia et al., 2015) . (Wang et al., 2013) studies a similar exponentially tilted loss for robust regression, though it is limited to the squared loss and only corresponds to tă0. Losses can also be modified to address outliers by favoring small losses (Yu et al., 2012; Zhang & Sabuncu, 2018) or gradient clipping (Menon et al., 2020) . On the other extreme, the largest losses can be magnified to encourage focus on hard samples (Li et al., 2020b; Lin et al., 2017; Wang et al., 2016) , which is a popular approach for curriculum learning. Constraints could also be imposed to promote fairness (Baharlouei et al., 2020; Donini et al., 2018; Hardt et al., 2016; Rezaei et al., 2020; Zafar et al., 2017) . Ignoring the log portion of the objective in (2), TERM can be viewed as an alternate loss function exponentially shaping the loss to achieve both of these goals with a single objective, i.e., magnifying hard examples with t ą 0 and suppressing outliers with t ă 0. In addition, we show that TERM can even achieve both goals simultaneously with hierarchical multi-objective optimization (Section 5.3). Sample re-weighting schemes. Finally, there exist approaches that implicitly modify the underlying ERM objective by re-weighting the influence of the samples themselves. These re-weighting schemes can be enforced in many ways. A simple and widely used example is to subsample training points in different classes. Alternatively, one can re-weight examples according to their loss function when using a stochastic optimizer, which can be used to put more emphasis on "hard" examples (Jiang et al., 2019; Katharopoulos & Fleuret, 2017; Leqi et al., 2019; Shrivastava et al., 2016) . Re-weighting can also be implicitly enforced via the inclusion of a regularization parameter (Abdelkarim et al., 2020), loss clipping (Yang et al., 2010) , or modelling crowd-worker qualities (Khetan et al., 2018) . Such an explicit re-weighting has been explored for other applications (e.g., Chang et al., 2017; Gao et al., 2015; Jiang et al., 2018; Lin et al., 2017; Ren et al., 2018; Shu et al., 2019) , though in contrast to these methods, TERM is applicable to a general class of loss functions, with theoretical guarantees. TERM is equivalent to a dynamic re-weighting of the samples based on the values of the objectives (Lemma 1), which could be viewed as a convexified version of loss clipping. We compare to several sample re-weighting schemes empirically in Section 5.

7. CONCLUSION

In this paper, we examined tilted empirical risk minimization (TERM) as a flexible extension to the ERM framework. We explored, both theoretically and empirically, TERM's ability to handle various known issues with ERM, such as robustness to noise, class imbalance, fairness, and generalization, as well as more complex issues like the simultaneous existence of class imbalance and noisy outliers. Despite the straightforward modification TERM makes to traditional ERM objectives, the framework consistently outperforms ERM and delivers competitive performance with state-of-the-art, problemspecific methods on a wide range of applications. Our work highlights the effectiveness and versatility of tilted objectives in machine learning. Building on the analyses and empirical study provided herein, in future work, it would be interesting to investigate generalization bounds for TERM as a function of t, and to derive theoretical convergence guarantees for our proposed stochastic solvers.

A NOTATION & ASSUMPTIONS

In this section, we provide the notation and the assumptions that are used throughout our theoretical analyses. The results in this paper are derived under one of the following four assumptions: Assumption 1 (Smoothness condition). We assume that for i P rN s, loss function f px i ; θq is in differentiability class C 1 (i.e., continuously differentiable) with respect to θ P Θ Ď R d . Assumption 2 (Strong convexity condition). We assume that Assumption 1 is satisfied. In addition, we assume that for any i P rN s, f px i ; θq is in differentiability class C 2 (i.e., twice differentiable with continuous Hessian) with respect to θ. We further assume that there exist β min , β max P R `such that for i P rN s and any θ P Θ Ď R d , β min I ĺ ∇ 2 θθ J f px i ; θq ĺ β max I, ( ) where I is the identity matrix of appropriate size (in this case d ˆd). We further assume that there does not exist any θ P Θ, such that ∇ θ f px i ; θq " 0 for all i P rN s. Assumption 3 (Generalized linear model condition (Wainwright & Jordan, 2008) ). We assume that Assumption 2 is satisfied. We further assume that the loss function f px; θq is given by f px; θq " Apθq ´θJ T pxq, ( ) where Ap¨q is a convex function such that there exists β max such that for any θ P Θ Ď R d , β min I ĺ ∇ 2 θθ J Apθq ĺ β max I. We also assume that ÿ iPrN s T px i qT px i q J ą 0. This nest set of assumptions become the most restrictive with Assumption 3, which essentially requires that the loss be the negative log-likelihood of an exponential family. While the assumption is stated using the natural parameter of an exponential family for ease of presentation, the results hold for a bijective and smooth reparameterization of the exponential family. Assumption 3 is satisfied by the commonly used L 2 loss for regression and logistic loss for classification (see toy examples (b) and (c) in Figure 1 ). While the assumption is not satisfied when we use neural network function approximators in Section 5.1, we observe favorable numerical results motivating the extension of these results beyond the cases that are theoretically studied in this paper. In the sequel, many of the results are concerned with characterizing the t-tilted solutions defined as the parametric set of solutions of t-tiled losses by sweeping t P R, θptq P arg min θPΘ r Rpt; θq, where Θ Ď R d is an open subset of R d . We state an assumption on this set below. Assumption 4 (Strict saddle property (Definition 4 in (Ge et al., 2015) )). We assume that the set arg min θPΘ r Rpt; θq is non-empty for all t P R. Further, we assume that for all t P R, r Rpt; θq is a "strict saddle" as a function of θ, i.e., for all local minima, ∇ 2 θθ J r Rpt; θqą0, and for all other stationary solutions, λ min p∇ 2 θθ J r Rpt; θqq ă 0, where λ min p¨q is the minimum eigenvalue of the matrix. We use the strict saddle property in order to reason about the properties of the t-tilted solutions. In particular, since we are solely interested in the local minima of r Rpt; θq, the strict saddle property implies that for every θptq P arg min θPΘ r Rpt; θq, for a sufficiently small r, for all θ P Bp θptq, rq, ∇ 2 θθ J r Rpt; θq ą 0, where Bp θptq, rq denotes a d-ball of radius r around θptq. We will show later that the strict saddle property is readily verified for t P R `under Assumption 2.

B BASIC PROPERTIES OF THE TERM OBJECTIVE

In this section, we provide the basic properties of the TERM objective. Proof of Lemma 1. Lemma 1, which provides the gradient of the tilted objective, has been studied previously in the context of exponential smoothing (see (Pee & Royset, 2011, Proposition 2.1)). We provide a brief derivation here under Assumption 1 for completeness. We have: ∇ θ r Rpt; θq " ∇ θ $ & % 1 t log ¨1 N ÿ iPrN s e tf pxi;θq ‚, . -  Hence, the proof follows by putting together ( 20) and ( 23). The proof proceeds similarly to t Ñ ´8 for t Ñ `8 and is omitted for brevity. Note that Lemma 2 has been previously observed in (Cohen & Shashua, 2014) . This lemma also implies that r θp0q is the ERM solution, r θp`8q is the min-max solution, and r θp´8q is the min-min solution. Lemma 3 (Tilted Hessian and strong convexity for t P R `). Under Assumption 2, for any t P R, ∇ 2 θθ J r Rpt; θq " t ÿ iPrN s p∇ θ f px i ; θq ´∇θ r Rpt; θqqp∇ θ f px i ; θq ´∇θ r Rpt; θqq J e tpf pxi;θq´r Rpt;θqq (24) `ÿ iPrN s ∇ 2 θθ J f px i ; θqe tpf pxi;θq´r Rpt;θqq . In particular, for all θ P Θ and all t P R `, the t-tilted objective is strongly convex.  The proof of the first part is completed by differentiating again with respect to θ, followed by algebraic manipulation. To prove the second part, notice that the term in ( 24) is positive semi-definite, whereas the term in ( 25) is positive definite and lower bounded by β min I (see Assumption 2, Eq. ( 5)). Lemma 4 (Smoothness of r Rpt; θq in the vicinity of the final solution θptq). For any t P R, let βptq be the smoothness parameter in the vicinity of the final solution: βptq :" sup θPBp θptq,rq λ max ´∇2 θθ J r Rpt; θq ¯, where ∇ 2 θθ J r Rpt; θq is the Hessian of r Rpt; θq at θ, λ max p¨q denotes the largest eigenvalue, and Bpθ, rq denotes a d-ball of radius r around θ. Under Assumption 2, for any t P R, r Rpt; θq is a βptq-smooth function of θ. Further, for t P R ´, at the vicinity of θptq,  βptq ă β max , ď β max , where we have used the fact that the term in ( 24) is negative semi-definite for t ă 0, and that the term in ( 25) is positive definite for all t with smoothness bounded by β max (see Assumption 2, Eq. ( 5)). For t P R `, following Lemma 3 and Weyl's inequality (Weyl, 1912)  where we have used the fact that no solution θ exists that would make all f i 's vanish (Assumption 2). Under the strict saddle property (Assumption 4), it is known that gradient-based methods would converge to a local minimum (Ge et al., 2015) , i.e., θptq would be obtained using gradient descent (GD). The rate of convergence of GD scales linearly with the smoothness parameter of the optimization landscape, which is characterized by Lemma 4. Lemma 5 (Strong convexity of r Rpt; θq in R `). Under Assumption 2, for any t P R `, r Rpt; θq is a strongly convex function of θ. That is for t P R `, ∇ 2 θθ J r Rpt; θq ą β min I. (40) Proof. The result follows by invoking Lemma 3 with t P R `, and considering (5) (Assumption 2). This lemma also implies that under Assumption 2, the strict saddle assumption (Assumption 4) is readily verified. Lemma (Gradient of Λ with θ). For all t P R and all θ P Θ,  ∇ In particular, by taking the limit as t Ñ `8, lim tÑ`8 B Bτ r Rpt; θpτ qq ď 0. ( ) Theorem 1 is concerned with characterizing the impact that TERM solutions for different t P R have on the objective r Rpτ ; θptqq for some fixed τ P R. Recall that τ " ´8 recovers the min-loss, τ " 0 is the average-loss, and τ " `8 is the max-loss. By definition, if t " τ , θpτ q is the minimizer of r Rpτ ; θptqq. Theorem 1 shows that for t P p´8, τ q the objective is decreasing; while for t P pτ, `8q the objective increasing. Recall that for any fixed τ P R, r Rpτ ; θq is also related to the k-th smallest loss of the population (Appendix G). Hence, the solution θptq is approximately minimizing the kptq-th smallest loss where kptq is increasing from 1 to N by sweeping t in p´8, `8q. Theorem 4 (Variance reduction). Let f pθq :" pf px 1 ; θqq, . . . , f px N ; θqq. For any u P R N , let meanpuq :" 1 N ÿ iPrN s u i , varpuq :" 1 N ÿ iPrN s pu i ´meanpuqq 2 . ( ) Then, under Assumption 3 and Assumption 4, for any t P R, B Bt ! varpf p θptqqq ) ă 0. ( ) Proof. Recall that f px i ; θq " Apθq ´θJ T px i q. Thus, meanpf q " 1 N ÿ iPrN s f px i ; θq " Apθq ´1 N θ J ÿ iPrN s T px i q " Apθq ´Mp0; θq (105) Consequently, varpf pθqq " 1 N ÿ iPrN s ¨f px i ; θq ´1 N ÿ jPrN s f px j ; θq ‚2 (106) " 1 N ÿ iPrN s ¨θJ T px i q ´1 N θ J ÿ jPrN s T px j q ‚2 (107) " 1 N θ J ¨ÿ iPrN s pT px i q ´1 N ÿ jPrN s T px j qqpT px i q ´1 N ÿ jPrN s T px j qq J ‚θ (108) " θ J V 0 θ, where V 0 " Vp0; θq " 1 N ÿ iPrN s pT px i q ´1 N ÿ jPrN s T px j qqpT px i q ´1 N ÿ jPrN s T px j qq J . Hence, B Bτ ! varpf p θpτ qqq ) " ˆB Bτ θpτ q ˙J ∇ θ ! varpf p θpτ qqq ) " 2 ˆB Bτ θpτ q ˙J V 0 θpτ q (112) " ´2θ J pτ qVpτ ; θpτ qq ´∇2 θθ Ap θpτ qq `τ Vpτ ; θpτ qq ¯´1 V 0 θpτ q (113) ă 0, completing the proof. Theorem 5 (Cosine similarity of the loss vector and the all-ones vector increases with t). For u, v P R N , let cosine similarity be defined as spu, vq :" u J v }u} 2 }v} 2 . ( ) Let f pθq :" pf px 1 ; θqq, . . . , f px N ; θqq and let 1 N denote the all-1 vector of length N . Then, under Assumption 3 and Assumption 4, for any t P R, B Bt ! spf p θptqq, 1 N q ) ą 0. (116) Proof. Notice that spf pθq, 1 N q " 1 N ř iPrN s f px i ; θq b 1 N ř iPrN s f 2 px i ; θq . ( ) Let M 0 :" Mp0; θq and V 0 :" Vp0; θq. Hence, 1 N ÿ iPrN s f px i ; θq " Apθq ´θJ M 0 , ( ) 1 N ÿ iPrN s f 2 px i ; θq " pApθq ´θJ M 0 q 2 `θJ V 0 θ (119) Notice that ∇ θ s 2 pf pθq, 1 N q ( " ∇ θ $ ' & ' % ´1 N ř iPrN s f px i ; θq ¯2 1 N ř iPrN s f 2 px i ; θq , / . / - (120) " ∇ θ " pApθq ´θJ M 0 q 2 pApθq ´θJ M 0 q 2 `θJ V 0 θ * (121) " 2pApθq ´θJ M 0 qp∇ θ Apθq ´M0 qθ J V 0 θ ´2pApθq ´θJ M 0 q 2 V 0 θ ppApθq ´θJ M 0 q 2 `θJ V 0 θq 2 " 2pApθq ´θJ M 0 q `θJ p∇ θ Apθq ´M0 q ´Apθq `θJ M 0 ˘V0 θ ppApθq ´θJ M 0 q 2 `θJ V 0 θq 2 (123) " 2pApθq ´θJ M 0 q `θJ ∇ θ Apθq ´Apθq ˘V0 θ ppApθq ´θJ M 0 q 2 `θJ V 0 θq 2 (124) " ´2pApθq ´θJ M 0 q 2 V 0 θ ppApθq ´θJ M 0 q 2 `θJ V 0 θq 2 . ( ) Hence, B Bτ ! s 2 pf p θpτ qq, 1 N q ) " ˆB Bτ θpτ q ˙J ∇ θ ! s 2 pf p θpτ qq, 1 N q ) ( ) " ´θ J pτ qVpτ ; θpτ qq ´∇2 θθ Ap θpτ qq `τ Vpτ ; θpτ qq ¯´1 ˆ´2pAp θpτ qq ´θpτ q J M 0 q 2 ´pAp θpτ qq ´θpτ q J M 0 q 2 `θpτ q J V 0 θ ¯2 V 0 θpτ q (127) ą 0, completing the proof. 151) follows from the chain rule and Lemma 10. Hence, t 2 gpt; θq is an increasing function of t for t P R `, and a decreasing function of t for t P R ´, taking its minimum at t " 0. Hence, t 2 gpt; θq ě 0 for all t P R. This implies that gpt; θq ě 0 for all t P R, which in conjunction with (145) implies the statement of the theorem. Definition 4 (Optimal tilted objective). Let the optimal tilted objective be defined as r F ptq :" r Rpt; θptqq. Theorem 8 (Optimal tilted objective is increasing with t). Under Assumption 3, for all t P R, and all θ P Θ, B Bt r F ptq " B Bt r Rpt; θptqq ě 0. Proof. Notice that for all θ, and all P R  which completes the proof.

F CONNECTIONS BETWEEN TERM AND EXPONENTIAL TILTING

Here we provide connections between TERM and exponential tilting, a concept previously explored in the context of importance sampling and the theory of large deviations (Beirami et al., 2018; Dembo & Zeitouni, 2009; Wainwright et al., 2005) . To do so, suppose that X is drawn from distribution pp¨q. Let us study the distribution of random variable Y " f pX; θq. Let Λ Y ptq be the cumulant generating function (Dembo & Zeitouni, 2009, Sectiom 2.2) . That is Λ Y ptq :" log `Ep e tY (˘( 158) " log ´Ep ! e tf pX;θq )¯. Now, suppose that x 1 , . . . , x N are drawn i.i.d. from pp¨q. Note that this distributional assumption is made solely for providing intuition on the tilted objectives and is not needed in any of the proofs in this paper. Hence, r Rpt; θq can be viewed as an empirical approximation to the cumulant generating function: Λ Y ptq « t r Rpt; θq. Hence, r Rpt; θq provides an approximate characterization of the distribution of f pX; θq. Thus, minimizing r Rpt; θq is approximately equivalent to minimizing the complementary cumulative distribution function (CDF) of f pX; θq. In other words, this is equivalent to minimizing P tf pX; θq ą au for some a, which is a function of t. In the next section, we will explore these connections with tail probabilities dropping the distributional assumptions, effectively drawing connections between superquantile methods and TERM.

G TERM AS AN APPROXIMATE SUPERQUANTILE METHOD

For all a P R, let Qpa; θq denote the quantile of the losses that are no smaller than a, i.e.,

Qpa; θq

:" 1 N ÿ iPrN s I tf px i ; θq ě au , where It¨u is the indicator function. Notice that Qpa; θq P 0, 1 N , . . . , 1 ( quantifies the fraction of the data for which loss is at least a. In this section, we further assume that f is such that f px i ; θq ě 0 for all θ. Suppose that we are interested in choosing θ in a way that for a given a P R, we minimize the fraction of the losses that are larger than a. That is Q 0 paq :" min θ Qpa; θq " Qpa; θ 0 paqq, where θ 0 paq :" arg min θ Qpa; θq. This is a non-smooth non-convex problem and solving it to global optimality is very challenging. In this section, we argue that TERM provides a reasonable approximate solution to this problem, which is computationally feasible. Notice that we have the following simple relation: Lemma 20. If a ă r F p´8q then Q 0 paq " 1. Further, if a ą r F p`8q then Q 0 paq " 0, where r F p¨q is defined in Definition 4, and is reproduced here: r F p´8q " lim tÑ´8 r Rpt; θptqq " min θ min iPrN s f px i ; θq, r F p`8q " lim tÑ`8 r Rpt; θptqq " min θ max iPrN s f px i ; θq. Next, we present our main result on the connection between the superquantile method and TERM. Theorem 9. For all t P R, and all θ, and all a P p r F p´8q, r F p`8qq,foot_1  Qpa; θq ď r Qpa; t, θq :" e r Rpt;θqt ´e r F p´8qt e at ´e r F p´8qt . Proof. We have Qpa; θq " 1 N ř iPrN s e pa´r F p´8qqtI " f px i ;θq´Ă F p´8q a´Ă F p´8q ě1 * ´1 e pa´r F p´8qqt ´1 (167) ď 1 N ř iPrN s e pf pxi;θq´r F p´8qqt ´1 e pa´r F p´8qqt ´1 (168) " e r Rpt;θqt ´e r F p´8qt e at ´e r F p´8qt , where (167) follows from Lemma 21, (168) follows from Lemma 22, the fact that e tx is strictly increasing (resp. decreasing) for t ą 0 (resp. t ă 0) and pe pa´r F p´8qqt ´1q is positive (resp. negative) for t ą 0 (resp. t ă 0), and (169) follows from definition. Lemma 21. For all t P R, and all θ,foot_2  Qpa; θq " 1 N ř iPrN s e pa´r F p´8qqtItf pxi;θqěau ´1 e pa´r F p´8qqt ´1 . ( ) Proof. The proof is completed following this identity: 1 N ÿ iPrN s e pa´r F p´8qqItf pxi;θqěaut " Qpa; θqe pa´r F p´8qqt `p1 ´Qpa; θqq. Lemma 22. For x ě 0, we have Itx ě 1u ď x. Theorem 9 directly leads to the following result. Theorem 10. For all a P p r F p´8q, r F p`8qq, we have Q 0 paq ď Q 1 paq ď Q 2 paq ď Q 3 paq " inf tPR # e r F ptqt ´e r F p´8qt e at ´e r F p´8qt + , where Q 1 paq :" inf tPR Qpa; θptqq Q 2 paq :" Qpa; θp r tpaqqq (174) Q 3 paq :" r Qpa; r tpaq, θp r tpaqqq and r tpaq :" arg inf tPR ! r Qpa; t, θptqq ) " arg inf tPR # e r F ptqt ´e r F p´8qt e at ´e r F p´8qt + . Proof. The only non-trivial step is to show that Q 2 paq ď Q 3 paq. Following Theorem 9, Q 2 paq " Qpa; θp r tpaqq (177) ď inf tPR r Qpa; t, θptqq " Q 3 paq, which completes the proof. Theorem 10 motivates us with the following approximation on the solutions of the superquantile method. Approximation 1. For all a P p r F p´8q, r F p`8qq, Qpa; θ 0 paqq " Q 0 paq « Q 2 paq " Qpa; θp r tpaqq, and hence, θp r tpaq is an approximate solution to the superquantile optimization problem. While we have not characterized how tight this approximation is for a P p r F p´8q, r F p`8qq, we believe that Approximation 1 provides a reasonable solution to the superquantile optimization problem in general. This is evidenced empirically when the approximation is evaluated on the toy examples of Figure 1 , and compared with the global solutions of the superquantile method. The results are shown in Figure 6 . As can be seen, Q 0 paq « Q 2 paq as suggested by Approximation 1. Also, we can see that while the bound in Theorem 10 is not tight, the solution that is obtained from solving it results in a good approximation to the superquantile optimization. Finally, we draw connections between these results and the k-loss. Notice that minimizing Qpa; θq for a fixed a is equivalent to minimizing a for a fixed Qpa; θq. If we fix Qpa; θq " pN ´kq{N, minimizing a would be equivalent to minimizing the k-loss. Formally, let R pkq pθq be the k-th order statistc of the loss vector. Hence, R pkq is the k-th smallest loss, and particularly R p1q pθq " q Rpθq, R pN q pθq " p Rpθq. (182) 0.2 0.4 0.6 0.8 1.0 quantiles -4 -3 -2 -1 0 a Q i (a), point estimation Q 0 (a) Q 1 (a) Q 2 (a) Q 3 (a) (a) Numerical results showing the bounds Q 1 paq, Q 2 paq, and Q 3 paq for Q 0 paq on the point estimation example.  a Q i (a), linear regression Q 0 (a) Q 1 (a) Q 2 (a) Q 3 (a) (b) Numerical results showing the bounds Q 1 paq, Q 2 paq, and Q 3 paq for Q 0 paq on the linear regression example. Figure 6 : Q 1 paq and Q 2 paq are close to Q 0 paq, which indicates that the solution obtained from solving Q 3 paq (which is Q 2 paq) is a tight approximation of the globally optimal solution of Q 0 paq. Thus, for any k P rN s, we define R pkq :" min θ R pkq pθq. (183) θ ˚pkq :" arg min θ R pkq pθq. Note that R p1q " r F p´8q, R pN q " r F p`8q. (186) Theorem 9 directly implies the following result: Corollary 11. For all k P t2, . . . , N ´1u, and all t P R `: ˇˇe pR pkq pθq´r F p´8qqt ´1ˇˇˇď ˆN N ´k ˙ˇˇe p r Rpt;θq´r F p´8qqt ´1ˇˇˇ. Proof. We proceed by setting Qpa; θq " N ´k N and a " R pkq pθq in Theorem 9, which implies the result. While the bound is left implicit in Corollary 11, we can obtain an explicit bound if we only consider t P R `(i.e., we are interested in k-losses for larger k): Corollary 12. For all k P t2, . . . , N ´1u, and all t P R `: R pkq pθq ď r F p´8q `1 t log ˜ep r Rpt;θq´r F p´8qqt ´k N 1 ´k N ¸. ( ) Proof. The statement follows by algebraic manipulation of Corollary 11. Three toy examples (Figure 1 ) Algorithm 3 Robust Regression (Table 1 ) Algorithm 3 Robust Classification (Table 2 ) Algorithm 4 Low-quality Annotators (Figure 3 ) Algorithm 2 (τ " 0) Fair PCA (Figure 4 ) Algorithm 1 (τ " 0) Class Imbalance (Figure 5 ) Algorithm 2 (τ " 0) Variance Reduction (Table 8 ) Algorithm 1 (τ " 0) Hierarchical TERM (Table 3 ) Algorithm 1 Table 4 : Applications and their corresponding solvers.

H.1 CONVERGENCE WITH t

First, we note that t-tilted losses are βptq-smooth for all t. In a small neighborhood around the tilted solution, βptq is bounded for all negative t and moderately positive t, whereas it scales linearly with t as t Ñ `8, which has been previously studied in the context of exponential smoothing of the max (Kort & Bertsekas, 1972; Pee & Royset, 2011) . We prove this formally in Appendix B, Lemma 4, but it can also be observed visually via the toy example in Figure 2 . Based on this, we provide a convergence result below for Algorithm 3. Theorem 13. Under Assumption 2, there exist C 1 , C 2 ă 8 that do not depend on t such that for any t P R `, setting the step size α " 1 C1`C2t , after m iterations: r Rpt, θ m q ´r Rpt, θptqq ď ˆ1 ´βmin C 1 `C2 t ˙m ´r Rpt, θ 0 q ´r Rpt, θptqq ¯. Proof. First observe that by Lemma 5, r Rpt, θq is β min -strongly convex for all t P R `. Next, notice that by Lemma 4, there exist C 1 , C 2 ă 8 such that r Rpt; θq has a pC 1 `C2 tq-Lipschitz gradient for all t P R `. Then, the result follows directly from (Karimi et al., 2016) Theorem 13 indicates that solving TERM to a local optimum using gradient-based methods will tend to be as efficient as traditional ERM for small-to-moderate values of t (Jin et al., 2017) , which we corroborate via experiments on multiple real-world datasets in Section 5. This is in contrast to solving for the min-max solution, which would be similar to solving TERM as t Ñ `8 (Kort & Bertsekas, 1972; Ostrovskii et al., 2020; Pee & Royset, 2011) . Second, recall that the t-tilted loss remains strongly convex for t ą 0, so long as the origi-nal loss function is strongly convex. On the other hand, for sufficiently large negative t, the t-tilted loss becomes non-convex. Hence, while the t-tilted solutions for positive t are unique, the objective may have multiple (spurious) local minima for negative t even if the original loss function is strongly convex. For negative t, we seek the solution for which the parametric set of t-tilted solutions obtained by sweeping t P R remains continuous (as in Figure 1a-c ). To this end, for negative t, we solve TERM by smoothly decreasing t from 0 ensuring that the solutions form a continuum in R d . Despite the non-convexity of TERM with t ă 0, we find that this approach produces effective solutions to multiple real-world problems in Section 5. Additionally, as the objective remains smooth, it is still relatively efficient to solve. We plot the convergence with t on a toy problem in Figure 8 .

I ADDITIONAL EXPERIMENTS

In this section we provide complete experimental results showcasing the properties of TERM (Appendix I.1) and the use-cases covered in Section 5 (Appendix I.2). Details on how the experiments themselves were executed are provided in Appendix J.

I.1 EXPERIMENTS TO SHOWCASE PROPERTIES OF TERM

Recall that in Section 2, Interpretation 1 is that TERM can be tuned to re-weight samples to magnify or suppress the influence of outliers. In Figure 9 below, we visually show this effect by highlighting the samples with the largest weight for t Ñ `8 and t Ñ ´8 on the logistic regression example previously described in Figure 1 . When t Ñ `8 (left), a few misclassified samples have the largest weights and are highlighted. On the other hand, for negative values of t, TERM suppresses the effect of the outliers, and as t Ñ ´8 (right), samples with the smallest losses hold the the largest weights. Interpretation 2 is concerned with smooth tradeoffs between the average-loss and max/min-loss. In Figure 10 below, we show that (1) tilted solutions with positive t's achieve a smooth tradeoff between average-loss and max-loss, (2) similarly, negative t's result in a smooth tradeoff between average-loss and min-loss, and (3) increasing t from ´8 to `8 reduces the variance of the losses. 

I.2 COMPLETE CASE STUDIES

Here we provide complete results obtained from applying TERM to a diverse set of applications. We either present full metrics of the empirical results discussed in Section 5, or provide additional experiments demonstrating the effects of TERM in new settings. Robust regression. In Section 5.1, we focused on noise scenarios with random label noise. Here, we present results involving both feature noise and target noise. We investigate the performance of TERM on two datasets (cal-housing (Pace & Barry, 1997) and abalone (Dua & Graff, 2019 )) used in (Yu et al., 2012) . Both datasets have features with 8 dimensions. We generate noisy samples following the setup in (Yu et al., 2012) -sampling 100 training samples, and randomly corrupting 5% of them by multiplying their features by 100 and multiply their targets by 10,000. From Table 5 below, we see that TERM significantly outperforms the baseline objectives in the noisy regime on both datasets. We also provide results on synthetic data across different noise levels in two settings. In Figure 11 , the mean of the noise is different from the mean of the clean data, and in Figure 12 , the mean of two groups of data are the same. Similarly, TERM (t " ´2) can effectively remove outliers in the presence of random noise. Figure 11 : Robust regression on synthetic data. In the presence of random noise, TERM with negative t's (blue, t " ´2) can fit structured clean data at all noise levels, while ERM (purple) and TERM with positive t's (red) overfit to corrupted data. We color inliers in green and outliers in brown for visualization purposes. In the presence of random noise with the same mean as that of clean data, TERM with negative t's (blue) can still surpass outliers in all cases, while ERM (purple) and TERM with positive t's (red) overfit to corrupted data. While the performance drops for 80% noise, TERM can still learn useful information, and achieves much lower error than ERM. Robust classification. Recall that in Section 5.1, for classification in the presence of label noise, we only compare with baselines which do not require clean validation data. In Table 6 below, we report the complete results of comparing TERM with all baselines, including MentorNet-DD (Jiang et al., 2018) which needs additional clean data. In particular, in contrast to the other methods, MentorNet-DD uses 5,000 clean validation images. TERM is competitive with can even exceed the performance of MentorNet-DD, even though it does not have access to this clean data. To interpret the noise more easily, we provide a toy logistic regression example with synthetic data here. In Figure 13 , we see that TERM with t " ´2 (blue) can converge to the correct classifier under 20%, 40%, and 80% noise. Figure 13 : Robust classification using synthetic data. On this toy problem, we show that TERM with negative t's (blue) can be robust to random noisy samples. The green line corresponds to the solution of the generalized cross entropy (GCE) baseline (Zhang & Sabuncu, 2018) . Note that on this toy problem, GCE is as good as TERM with negative t's, despite its inferior performance on the real-world CIFAR10 dataset. (Adversarial or structured noise.) As a word of caution, we note that the experiments thus far have focused on random noise; as one might expect, TERM with negative t's could potentially overfit to outliers if they are constructed in an adversarial way. In the examples shown in Figure 14 , under 40% noise and 80% noise, TERM has a high error measured on the clean data (green dots). Low-quality annotators. In Section 5.1, we demonstrate that TERM can be used to mitigate the effect of noisy annotators, and we assume each annotator is either always correct, or always uniformly increases the accuracy on the worst-performing device (similar to q-FFL (Li et al., 2020b )) while obtaining a similar average accuracy (Table 7 ). assigning random labels. Here, we explore a different and possibly more practical scenario where there are four noisy annotators who corrupt 0%, 20%, 40%, and 100% of their data by assigning labels uniformly at random, and there is one additional adversarial annotator who always assigns wrong labels. We assume the data points labeled by each annotator do not overlap, since (Khetan et al., 2018) show that obtaining one label per sample is optimal for the data collectors under a fixed annotation budget. We compare TERM with several baselines: (a) training without the data coming from the adversarial annotator, (b) training without the data coming from the worst two annotators, and (c) training with all the clean data combined (Genie ERM). The results are shown in Figure 15 . We see that TERM outperforms the strong baselines of removing one or two noisy annotators, and closely matches the performance of training with all the available clean data. Fair federated learning. Federated learning involves learning statistical models across massively distributed networks of remote devices or isolated organizations (Li et al., 2020a; McMahan et al., 2017) . Ensuring fair (i.e., uniform) performance distribution across the devices is a major concern in federated settings (Li et al., 2020b; Mohri et al., 2019) , as using current approaches for federated learning (FedAvg (McMahan et al., 2017) ) may result in highly variable performance across the network. (Li et al., 2020b) consider solving an alternate objective for federated learning, called q-FFL, to dynamically emphasize the worst-performing devices, which is conceptually similar to the goal of TERM, though it is applied specifically to the problem of federated learning and limited to the case of positive t. Here, we compare TERM with q-FFL in their setup on the vehicle dataset (Duarte & Hu, 2004) consisting of data collected from 23 distributed sensors (hence 23 devices). We tilt the L 2 regularized linear SVM objective at the device level. At each communication round, we re-weight the accumulated local model updates from each selected device based on the weights estimated via Algorithm 2. From Figure 16 , we see that similar to q-FFL, TERM (t " 0.1) can also significantly promote the accuracy on the worst device while maintaining the overall performance. The statistics of the accuracy distribution are reported in Table 7 below. 0.173 (0.003) q-FFL (q " 5) 0.862 (0.065) 0.704 (0.033) 0.929 (0.006) 0.064 (0.011) TERM (t " 0.1) 0.853 (0.061) 0.707 (0.021) 0.926 (0.006) 0.061 (0.006) Improving generalization via variance reduction. We compare TERM (applied at the class-level as in (4), with logistic loss) with robustly regularized risk (RobustRegRisk) as in (Namkoong & Duchi, 2017) on the HIV-1 (Dua & Graff, 2019; Rögnvaldsson, 2013) dataset originally investigated K DISCUSSION Our proposed work provides an alternative to empirical risk minimization (ERM), which is ubiquitous throughout machine learning. As such, our framework (TERM) could be widely used for applications both positive and negative. However, our hope is that the TERM framework will allow machine learning practitioners to easily modify the ERM objective to handle practical concerns such as enforcing fairness amongst subgroups, mitigating the effect of outliers, and ensuring robust performance on new, unseen data. One potential downside of the TERM objective is that if the underlying dataset is not well-understood, incorrectly tuning t could have the unintended consequence of magnifying the impact of biased/corrupted data in comparison to traditional ERM. Indeed, critical to the success of such a framework is understanding the implications of the modified objective, both theoretically and empirically. The goal of this work is therefore to explore these implications so that it is clear when such a modified objective would be appropriate. In terms of the use-cases explored with the TERM framework, we relied on benchmark datasets that have been commonly explored in prior work (e.g., Samadi et al., 2018; Tantipongpipat et al., 2019; Yang et al., 2010; Yu et al., 2012) . However, we note that some of these common benchmarks, such as cal-housing (Pace & Barry, 1997) and Credit (Yeh & Lien, 2009) , contain potentially sensitive information. While the goal of our experiments was to showcase that the TERM framework could be useful in learning fair representations that suppress membership bias and hence promote fairer performance, developing an understanding for-and removing-such membership biases requires a more comprehensive treatment of the problem that is outside the scope of this work.



When the argument of the max-loss or the min-loss is not unique, for the purpose of differentiating the loss function, we define p Rpθq as the average of the individual losses that achieve the maximum, and q Rpθq as the average of the individual losses that achieve the minimum. We define the RHS at t " 0 via continuous extension. We define the RHS at t " 0 via continuous extension.



Figure 1: Toy examples illustrating TERM as a function of t: (a) finding a point estimate from a set of 2D samples, (b) linear regression with outliers, and (c) logistic regression with imbalanced classes. While positive values of t magnify outliers, negative values suppress them. Setting t"0 recovers the original ERM objective (1).

Figure 3: TERM (t"´2) completely removes the impact of noisy annotators, reaching the performance limit set by Genie ERM.

Figure 5: TERM (t"100) is competitive with state-of-the-art methods for classification with imbalanced classes.

Figure 7: Correctness of Algorithm 4. The y-axis is the normalized distance between θ obtained by running Alg. 4 and the optimal solution θ ˚via the full batch method. For different values of t, θ can be arbitrarily close to θ ˚.

Samples with the largest weights when t Ñ ´8.

Figure 9: For positive values of t, TERM focuses on the samples with relatively large losses (rare instances).

Figure 10: The tradeoffs between the average-loss and the max/min-loss offered by TERM on the point estimation (top) and logistic regression (bottom) toy examples presented in Figure1, empirically validating Theorems 1-4. Positive values of t trade the average-loss for the max-loss, while negative values of t trade the average-loss for the min-loss. Increasing t from ´8 to `8 results in the reduction of loss variance, allowing the solution to tradeoff between bias/variance and potentially improve generalization.

Figure 12: In the presence of random noise with the same mean as that of clean data, TERM with negative

Figure14: TERM with negative t's (blue) cannot fit clean data if the noisy samples (brown) are adversarial or structured in a manner that differs substantially from the underlying true distribution.

Figure 15: TERM achieves higher test accuracy than the baselines, and can match the performance of Genie ERM (i.e., training on all the clean data combined).

Figure 16: TERM FL (t " 0.1) significantly

TERM is competitive with robust regres-

TERM is competitive with robust classification

Hierarchical TERM can address both class imbalance and noisy samples. Duchi, 2017) 0.844 (.010) 0.939 (.004) 0.622 (.011) 0.906 (.005) 0.634 (.014) 0.907 (.006) 0.051 (.014) 0.792 (.012) FocalLoss (Lin et al., 2017) 0.834 (.013) 0.937 (.004) 0.806 (.020) 0.918 (.003) 0.638 (.008) 0.908 (.005) 0.565 (.027) 0.890 (.009) TERM sc 0.840 (.010) 0.937 (.004) 0.836 (.018) 0.921 (.002) 0.852 (.010) 0.924 (.004) 0.778 (.008) 0.900 (.005) TERM ca 0.844 (.014) 0.938 (.004) 0.834 (.021) 0.918 (.003) 0.846 (.015) 0.933 (.003) 0.806 (.020) 0.901 (.010)

Lemma (Gradient of Λ with respect to t and θ). For all t P R and all θ P Θ, , we characterize some of the general properties of the solutions of TERM objectives. Note that these properties are established under Assumptions 3 and 4.Proof. The proof follows from definition and the assumption that Θ is an open set.

∇ θ H pwpt; θqq " ∇ θ `Λpt; θq ´tApθq `tθ J Mpt; θq ˘(135)" t∇ θ Apθq ´tMpt; θq ´t∇ θ Apθq `tMpt; θq ´t2 Vpt; θqθ

Input: t, α, λ while stopping criteria not reached do sample minibatch B uniformly at random from rN s compute the loss f px; θq and gradient ∇ θ f px; θq for all x P B

[Theorem 1].Figure8: As t Ñ `8, the objective becomes less smooth in the vicinity of the final solution, hence suffering from slower convergence. For negative values of t, TERM converges fast due to the smoothness in the vicinity of solutions despite its non-convexity.

An alternative noise setup involving both feature noise and label noise. Similarly, TERM with t ă 0

A complete comparison including two MentorNet variants. TERM is able to match the performance of

Both q-FFL and TERM can encourage more uniform accuracy distributions across the devices in federated networks while maintaining similar average performance.

ACKNOWLEDGEMENTS

; and to Meisam Razaviyayn (USC) for helpful discussions and pointers to exponential smoothing . The work of TL and VS was supported in part by the National Science Foundation grant IIS1838017, a Google Faculty Award, a Carnegie Bosch Institute Research Award, and the CONIX Research Center. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the National Science Foundation or any other funding agency.We are grateful to Arun Sai Suggala and Adarsh Prasad (CMU) for their helpful comments on robust regression; to Zhiguang Wang, Dario Garcia Garcia, Alborz Geramifard, and other members of Facebook AI for productive discussions and feedback and pointers to prior work (Cohen & Shashua, 2014; Cohen et al., 2016; Rockafellar et al., 2000; Wang et al., 2016) ; and to Meisam Razaviyayn (USC) for helpful discussions and pointers to exponential smoothing (Kort & Bertsekas, 1972; Pee & Royset, 2011 ), Value-at-Risk (Nouiehed et al., 2019a; Rockafellar & Uryasev, 2002) , and general properties of gradient-based methods in non-convex optimization problems (Ge et al., 2015; Jin et al., 2017; 2020; Ostrovskii et al., 2020) . The work of TL and VS was supported in part by the National Science Foundation grant IIS1838017, a Google Faculty Award, a Carnegie Bosch Institute Research Award, and the CONIX Research Center. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the National Science Foundation or any other funding agency.

APPENDIX

In this appendix we provide full statements and proofs of the analyses presented in Section (Appendix A-G); details on the methods we propose for solving TERM (Appendix H); complete empirical results and details of our empirical setup (Appendix I-J), and a discussion on the broader impacts (both positive and negative) of TERM and the research herein (Appendix K). We provide a table of contents below for easier navigation. 

C HIERARCHICAL MULTI-OBJECTIVE TILTING

We start by stating the hierarchical multi-objective tilting for a hierarchy of depth 3. While we don't directly use this form, it is stated to clarify the experiments in Section 5 where tilting is done at class level and annotator level, and the sample-level tilt value could be understood to be 0. pT px i q ´Mpt; θqqpT px i q ´Mpt; θqq J e ´tθ J T pxiq´Γpt;θq .Lemma 8. For all t P R, we have Vpt; θq ą 0.Next we state a few key relationships that we will use in our characterizations. The proofs are straightforward and omitted for brevity. Lemma 9 (Partial derivatives of Γ). For all t P R and all θ P Θ,Lemma 10 (Partial derivatives of M). For all t P R and all θ P Θ,The next few lemmas characterize the partial derivatives of the cumulant generating function. In particular, by taking the limit as t Ñ 0,completing the proof.Theorem 3 (Average-vs. min-loss tradeoff). Under Assumption 3 and Assumption 4, for any t P R´,Proof of Theorem 3. 

H ALGORITHMS FOR SOLVING TERM

In the main text, we present TERM in the batch setting (Algorithm 1). Here we provide the stochastic variants of the solvers in the context of hierarchical multi-objective tilting (see Eq. ( 4)).There are a few points to note about stochastic solvers (Algorithm 2):1. It is intractable to compute the exact normalization weights for the samples in the minibatch. Hence, we use r R g,τ , a term that incorporates stochastic dynamics, to follow the tilted objective for each group g, which is used for normalizing the weights as in (3).2. While we sample the group from which we draw the minibatch, for small number of groups, one might want to draw one minibatch per each group and weight the resulting gradients accordingly.3. The second last line in Algorithm 2, concerning the update of r R g,τ is not a trivial linear averaging. Instead, we use a tilted averaging to ensure an unbiased estimator (if θ is not being updated).Algorithm 2: Stochastic TERM Initialize : r R g,τ " 0 @g P rGs Input: t, τ, α, λ while stopping criteria not reached do sample g on rGs from a Gumbel-Softmax distribution with logits r R g,τ `1 t log |g| and temperature 1 t sample minibatch B uniformly at random within group g compute the loss f px; θq and gradient ∇ θ f px; θq for allThe stochatic algorithm above requires roughly the same time/space complexity as mini-batch SGD, and thus scales similarly for large-scale problems. TERM for the non-hierarchical cases can be recovered from Algorithm 1 and 2 by setting the inner-level tilt parameter τ " 0. For completeness, we also describe them here. Algorithm 3 is the sample-level tilting algorithm in the batch setting, and Algorithm 4 is its stochastic variant.Algorithm 3: Batch Non-Hierarchical TERM Input: t, α while stopping criteria not reached do compute the loss f px i ; θq and gradient ∇ θ f px i ; θq for all i P rN s r Rpt; θq Ð t-tilted loss (2) on all i P rN s w i pt; θq Ð e tpf pxi;θq´r Rpt;θqq θ Ð θ ´α N ř iPrN s w i pt; θq∇ θ f px i ; θq end In order to verify the correctness of Algorithm 4, we plot the distance of the solution θ to the solution θ ˚obtained by running the full gradient method (Algorithm 1) in terms of the number of iterations. In Figure 7 , we see that θ can be arbitrarily close to θ ˚, and Algorithm 4 with t ‰ 0 converges similarly to mini-batch SGD with t " 0. As mentioned in the main text, theoretically analyze the convergence of stochastic solvers would be interesting direction of future work. The challenges would be to characterize the tightness of the estimator r R to the true risk R at each iteration leveraging the proposed tilted averaging structure.We summarize the applications we solve TERM for and the algorithms we use in Table 4 below on the left. by (Namkoong & Duchi, 2017) . We examine the accuracy on the rare class (Y " 0), the common class (Y " 1), and overall accuracy.The mean and standard error of accuracies are reported in Table 8 . RobustRegRisk and TERM offer similar performance improvements compared with other baselines, such as linear SVM, Learn-Rewight (Ren et al., 2018) , FocalLoss (Lin et al., 2017) , and HRM (Leqi et al., 2019) . For larger t, TERM achieves similar accuracy in both classes, while RobustRegRisk does not show similar trends by sweeping its hyperparameters. It is common to adjust the decision threshold to boost the accuracy on the rare class. We do this for ERM and RobustRegRisk and optimize the threshold so that ERM `and RobustRegRisk `result in the same validation accuracy on the rare class as TERM (t " 50). TERM achieves similar performance to RobustRegRisk `, without the need for an extra tuned hyperparameter. 

J EXPERIMENTAL DETAILS

We first describe the datasets and models used in each experiment presented in Section 5, and then provide a detailed setup including the choices of hyperparameters. All code and datasets are publicly available at github.com/litian96/TERM.

J.1 DATASETS AND MODELS

We apply TERM to a diverse set of real-world applications, datasets, and models.In Section 5.1, for regression tasks, we use the drug discovery data extracted from (Diakonikolas et al., 2019) which is originally curated from (Olier et al., 2018) and train linear regression models with different losses. There are 4,085 samples in total with each having 411 features. We randomly split the dataset into 80% training set, 10% validation set, and 10% testing set. For mitigating noise on classification tasks, we use the standard CIFAR-10 data and their standard train/val/test partitions along with a standard inception network (Szegedy et al., 2016) . For experiments regarding mitigating noisy annotators, we again use the CIFAR-10 data and their standard partitions with a ResNet20 model. The noise generation procedure is described in Section 5.1.In Section 5.2, for fair PCA experiments, we use the complete Default Credit data to learn lowdimensional approximations and the loss is computed on the full training set. We follow the exact data processing steps described in the work (Samadi et al., 2018) we compare with. There are 30,000 total data points with 21-dimensional features (after preprocessing). Among them, the high education group has 24,629 samples and the low education group has 5,371 samples. For class imbalance experiments, we directly take the unbalanced data extracted from MNIST (LeCun et al., 1998) used in (Ren et al., 2018) . When demonstrating the variance reduction of TERM, we use the HIV-1 dataset (Rögnvaldsson, 2013) as in (Namkoong & Duchi, 2017) and randomly split it into 80% train, 10% validation, and 10% test set. There are 6,590 total samples and each has 160 features. We report results based on five such random partitions of the data. We train logistic regression models (without any regularization) for this binary classification task for TERM and the baseline methods. We also investigate the performance of a linear SVM.In Section 5.3, the HIV-1 data are the same as that in Section 5.2. We also manually subsample the data to make it more imbalanced, or inject random noise, as described in Section 5.3.

J.2 HYPERPARAMETERS

Selecting t. In Section 5.2 where we consider positive t's, we select t from a limited candidate set of t0.1, 0.5, 1, 5, 10, 50, 100, 200u on the held-out validation set. For initial robust regression experiments, RMSE changed by only 0.08 on average across t; we thus used t " ´2 for all experiments involving noisy training samples (Section 5.1 and Section 5.3).Other parameters. For all experiments, we tune all other hyperparameters (the learning rates, the regularization parameters, the decision threshold for ERM `, ρ for (Namkoong & Duchi, 2017) , α and γ for focal loss (Lin et al., 2017) ) based on a validation set, and select the best one. For experiments regarding focal loss (Lin et al., 2017) , we select the class balancing parameter (α in the original focal loss paper) from rangep0.05, 0.95, 0.05q and select the main parameter γ from t0.5, 1, 2, 3, 4, 5u.We tune ρ in (Namkoong & Duchi, 2017) such that ρ n is selected from t0.5, 1, 2, 3, 4, 5, 10u where n is the training set size. All regularization parameters including regularization for linear SVM are selected from t0.0001, 0.01, 0.1, 1, 2u. For all experiments on the baseline methods, we use the default hyperparameters in the original paper (or the open-sourced code).We summarize a complete list of main hyperparameter values as follows.Section 5.1:• Robust regression. The threshold parameter δ for Huber loss for all noisy levels is 1, the corruption parameter k for CRR is: 500 (20% noise), 1000 (40% noise), and 3000 (80% noise); and TERM uses t " ´2. • Robust classification. The results are all based on the default hyperparameters provided by the open-sourced code of MentorNet (Jiang et al., 2018) , if applicable. We tune the q parameter for generalized cross entropy (GCE) from t0.4, 0.8, 1.0u and select a best one for each noise level.For TERM, we scale t linearly as the number of iterations from 0 to -2 for all noise levels. • Low-quality annotators. For all methods, we use the same set of hyperparameters. The initial step-size is set to 0.1 and decayed to 0.01 at epoch 50. The batch size is 100.Section 5.2:• Fair PCA. We use the default hyperparameters and directly run the public code of (Samadi et al., 2018) to get the results on the min-max fairness baseline. We use a learning rate of 0.001 for our gradient-based solver for all target dimensions. • Handling class imbalance. We take the open-sourced code of LearnReweight (Ren et al., 2018) and use the default hyperparameters for the baselines of LearnReweight, HardMine, and ERM. We implement focal loss, and select α " 0.05, γ " 2. • Variance reduction. The regularization parameter for linear SVM is 1. γ for focal loss is 2. We perform binary search on the decision thresholds for ERM `and RobustRegRisk `, and choose 0.26 and 0.49, respectively.Section 5.3:• We tune the q parameter for GCE based on validation data. We use q " 0, 0, 0.7, 0.3 respectively for the four scenarios we consider. For RobustlyRegRisk, we use ρ n " 10 (where n is the training sample size) and we find that the performance is not sensitive to the choice of ρ. For focal loss, we tune the hyperparameters for best performance and select γ " 2, α "0.5, 0.1, 0.5, and 0.2 for four scenarios. We use t " ´2 for TERM in the presence of noise, and tune the positive t's based on the validation data. In particular, the values of tilts under four cases are: (0, 0.1), (0, 50), (-2, 5), and (-2, 10) for TERM sc and (0.1, 0), (50, 0), (1, -2) and (50, -2) for TERM ca .

