UNDERSTANDING THE ROLE OF IMPORTANCE WEIGHT-ING FOR DEEP LEARNING

Abstract

The recent paper by Byrd & Lipton (2019), based on empirical observations, raises a major concern on the impact of importance weighting for the over-parameterized deep learning models. They observe that as long as the model can separate the training data, the impact of importance weighting diminishes as the training proceeds. Nevertheless, there lacks a rigorous characterization of this phenomenon. In this paper, we provide formal characterizations and theoretical justifications on the role of importance weighting with respect to the implicit bias of gradient descent and margin-based learning theory. We reveal both the optimization dynamics and generalization performance under deep learning models. Our work not only explains the various novel phenomenons observed for importance weighting in deep learning, but also extends to the studies where the weights are being optimized as part of the model, which applies to a number of topics under active research.

1. INTRODUCTION

Importance weighting is a standard tool for estimating a quantity under a target distribution while only the samples from some source distribution is accessible. It has been drawing extensive attention in the communities of statistics and machine learning. Causal inference for deep learning investigates heavily on the propensity score weighting method that applies the off-policy optimization with counterfactual estimator (Gilotte et al., 2018; Jiang & Li, 2016) , modelling with observational feedback (Schnabel et al., 2016; Xu et al., 2020) and learning from controlled intervention (Swaminathan & Joachims, 2015) . The importance weighting methods are also applied to characterize distribution shifts for deep learning models (Fang et al., 2020) , with modern applications in such as the domain adaptation (Azizzadenesheli et al., 2019; Lipton et al., 2018) and learning from noisy labels (Song et al., 2020) . Other usages include curriculum learning (Bengio et al., 2009) and knowledge distillation (Hinton et al., 2015) , where the weights characterize the model confidence on each sample. To reduce the discrepancy between the source and target distribution for model training, a standard routine is to minimize a weighted risk (Rubinstein & Kroese, 2016) . Many techniques have been developed to this end, and the common strategy is re-weighting the classes proportionally to the inverse of their frequencies (Huang et al., 2016; 2019; Wang et al., 2017) . For example, Cui et al. * The work was done when the author was with Walmart Labs.

