DATA DRIFT CORRECTION VIA TIME-VARYING IMPOR-TANCE WEIGHT ESTIMATOR

Abstract

Real-world deployment of machine learning models is challenging when data evolves over time. And data does evolve over time. While no model can work when data evolves in an arbitrary fashion, if there is some pattern to these changes, we might be able to design methods to address it. This paper addresses situations when data evolves gradually. We introduce a novel time-varying importance weight estimator that can detect gradual shifts in the distribution of data. Such an importance weight estimator allows the training method to selectively sample past data-not just similar data from the past like a standard importance weight estimator would but also data that evolved in a similar fashion in the past. Our time-varying importance weight is quite general. We demonstrate different ways of implementing it that exploit some known structure in the evolution of data. We demonstrate and evaluate this approach on a variety of problems ranging from supervised learning tasks (multiple image classification datasets) where the data undergoes a sequence of gradual shifts of our design to reinforcement learning tasks (robotic manipulation and continuous control) where data undergoes a shift organically as the policy or the task changes.

1. INTRODUCTION

Real-world machine learning performance often drops during deployment when test data no longer stem from the same distribution from which previous training data were sampled. Thus many tools have been developed to address data distribution shiftfoot_0 (Heckman, 1979; Shimodaira, 2000; Huang et al., 2006; Bickel et al., 2007; Sugiyama et al., 2007b; 2008; Gretton et al., 2008) , often using an estimate of the Radon-Nikodym (Rosenbaum & Rubin, 1983) derivative between the two distributions (also known as a propensity score or importance weight) to re-weight the training data so that its weighted distribution better matches the test data (Agarwal et al., 2011; Wen et al., 2014; Reddi et al., 2015b; Chen et al., 2016; Fakoor et al., 2020b; a; Tibshirani et al., 2019) . These methods mostly consider offline settings with one training and one test dataset. In many applications, data for supervised learning is continuously collected from a constantly evolving distribution (e.g. due to a pandemic (Callaway, 2020)) such that the single train/test paradigm no longer applies. Settings in which the data distribution drifts gradually over time (rather than experiencing abrupt changes) are particularly ubiquitous (Shalev-Shwartz, 2012) . Assuming observations are not statistically dependent over time (as in time-series), how to best train models in such streaming/online learning settings remains a key question (Lu et al., 2018) . Some basic options include: fitting/updating the model to only the recent data (which is statistically suboptimal if the distribution has not significantly shifted) or fitting the model to all previous observations (which leads to biased estimates if shift has occurred). Here we consider a less crude approach in which past data are weighted during training with continuous-valued weights that vary over time. Our proposed estimator of these weights generalizes standard two-sample propensity scores, allowing the training process to selectively emphasize past data collected at time t based on the distributional similarity between the present and time t.



Covariate, label, and concept shifts are all referred to as distribution shift in the paper to simplify things wherever it is clear from the context.

