HOW DOES OVERPARAMETRIZATION AFFECT PERFORMANCE ON MINORITY GROUPS?

Abstract

The benefits of overparameterization for the overall performance of modern machine learning (ML) models are well known. However, the effect of overparameterization at a more granular level of data subgroups is less understood. Recent empirical studies demonstrate encouraging results: (i) when groups are not known, overparameterized models trained with empirical risk minimization (ERM) perform better on minority groups; (ii) when groups are known, ERM on data subsampled to equalize group sizes yields state-of-the-art worst-group-accuracy in the overparameterized regime. In this paper, we complement these empirical studies with a theoretical investigation of the risk of overparameterized random feature models on minority groups. In a setting in which the regression functions for the majority and minority groups are different, we show that overparameterization always improves minority group performance.

1. INTRODUCTION

Traditionally, the goal of machine learning (ML) is to optimize the average or overall performance of ML models. The relentless pursuit of this goal eventually led to the development of deep neural networks, which achieve state of the art performance in many application areas. A prominent trend in the development of such modern ML models is overparameterization: the models are so complex that they are capable of perfectly interpolating the data. There is a large body of work showing overparameterization improves the performance of ML models in a variety of settings (e. However, as ML models find their way into high-stakes decision-making processes, other aspects of their performance (besides average performance) are coming under scrutiny. One aspect that is particularly relevant to the fairness and safety of ML models is their performance on traditionally disadvantaged demographic groups. There is a troubling line of work showing ML models that perform well on average may perform poorly on minority groups of training examples. For example, Buolamwini and Gebru (2018) show that commercial gender classification systems, despite achieving low classification error on average, tend to misclassify dark-skinned people. In the same spirit, Wilson et al. (2019) show that pedestrian detection models, despite performing admirably on average, have trouble recognizing dark-skinned pedestrians. The literature examines the effect of model size on the generalization error of the worst group. Sagawa et al. ( 2020) find that increasing model size beyond the threshold of zero training error can have a negative impact on test error for minority groups because the model learns spurious correlations. They show that subsampling the majority groups is far more successful than upweighting the minority groups in reducing worst-group error. Pham et al. ( 2021) conduct more extensive experiments to investigate the influence of model size on worst-group error under various neural network architecture and model parameter initialization configurations. They discover that increasing model size either improves or does not harm the worst-group test performance across all settings. Idrissi et al. ( 2021) recommend using simple methods, i.e. subsampling and reweighting for balanced classes or balanced groups, before venturing into more complicated procedures. They suggest that newly developed robust optimization approaches for worst-group error control (Sagawa et al., 2019; Liu et al., 2021) could be computationally demanding, and that there is no strong (statistically significant) evidence of advantage over those simple methods.



g. ridgeless least squares Hastie et al. (2019), random feature models Belkin et al. (2019); Mei and Montanari (2019b) and deep neural networks Nakkiran et al. (2019)).

