DR-FAIRNESS: DYNAMIC DATA RATIO ADJUSTMENT FOR FAIR TRAINING ON REAL AND GENERATED DATA Anonymous

Abstract

Fair visual recognition has become critical for preventing demographic disparity. A major cause of model unfairness is the imbalanced representation of different groups in training data. Recently, several works aim to alleviate this issue using generated data. However, these approaches often use generated data to obtain similar amounts of data across groups, which is not optimal for achieving high fairness due to different learning difficulties and generated data qualities across groups. To address this issue, we propose a novel adaptive sampling approach that leverages both real and generated data for fairness. We design a bilevel optimization that finds the optimal data sampling ratios among groups and between real and generated data while training a model. The ratios are dynamically adjusted considering both the model's accuracy as well as its fairness. To efficiently solve our non-convex bilevel optimization, we propose a simple approximation to the solution given by the implicit function theorem. Extensive experiments show that our framework achieves state-of-the-art fairness and accuracy on the CelebA and ImageNet People Subtree datasets. We also observe that our method adaptively relies less on the generated data when it has poor quality.

1. INTRODUCTION

Model fairness in visual recognition is becoming essential to prevent discriminatory predictions over demographics. Recently, numerous unfairness issues have been reported (Wang et al., 2020; Najibi, 2020) , and several fair image classification approaches have been proposed that do not discriminate against specific groups such as gender, age, or skin color (Ramaswamy et al., 2021; Roh et al., 2021) . With the rapid progress in deep generative learning (Karras et al., 2020; Dhariwal & Nichol, 2021) , there is a new research direction to improve fairness by augmenting training data with generated data. Recent breakthroughs in generative learning make generated data practical enough to use in real-world applications (OpenAI, 2022), and many high-quality pre-trained generative models are now open to the public (Rombach et al., 2022) , which obviates the need to train such models ourselves. Thus, generated data is increasingly used to improve model performances, including fairness. From a fairness perspective, generated data complements real data by making it more diverse. For example, if a specific group's data is collected from a limited data source that does not have the full data distribution, that group may be discriminated in model training due to the bias (Mehrabi et al., 2021) . In this case, generated data can be used to supplement that underrepresented group. However, most fair training approaches that use generated data simply generate similar amounts of data across groups (Ramaswamy et al., 2021; Choi et al., 2020) , which may not be optimal to improve group fairness such as equalized odds (Hardt et al., 2016) and demographic parity (Feldman et al., 2015) . Such suboptimality could originate from 1) the learning difficulty differences across groups and 2) the potential bias (i.e., typically in the form of missing modes) and quality issues in the generated data that can hurt the accuracy and fairness of the model under training. Therefore, it is essential to find the right mix of generated and real data for the best accuracy and fairness. In this paper, we harness the potential of both real and generated data via adaptive sampling to improve group fairness while minimizing accuracy degradation. To this end, we design a new sampling approach called Dr-Fairness (Dynamic Data Ratio Adjustment for Fairness) that adaptively adjusts data ratios among groups and between real and generated data over iterations, as in Figure 1a . 1

