DATA DRIFT CORRECTION VIA TIME-VARYING IMPOR-TANCE WEIGHT ESTIMATOR

Abstract

Real-world deployment of machine learning models is challenging when data evolves over time. And data does evolve over time. While no model can work when data evolves in an arbitrary fashion, if there is some pattern to these changes, we might be able to design methods to address it. This paper addresses situations when data evolves gradually. We introduce a novel time-varying importance weight estimator that can detect gradual shifts in the distribution of data. Such an importance weight estimator allows the training method to selectively sample past data-not just similar data from the past like a standard importance weight estimator would but also data that evolved in a similar fashion in the past. Our time-varying importance weight is quite general. We demonstrate different ways of implementing it that exploit some known structure in the evolution of data. We demonstrate and evaluate this approach on a variety of problems ranging from supervised learning tasks (multiple image classification datasets) where the data undergoes a sequence of gradual shifts of our design to reinforcement learning tasks (robotic manipulation and continuous control) where data undergoes a shift organically as the policy or the task changes.

1. INTRODUCTION

Real-world machine learning performance often drops during deployment when test data no longer stem from the same distribution from which previous training data were sampled. Thus many tools have been developed to address data distribution shiftfoot_0 (Heckman, 1979; Shimodaira, 2000; Huang et al., 2006; Bickel et al., 2007; Sugiyama et al., 2007b; 2008; Gretton et al., 2008) , often using an estimate of the Radon-Nikodym (Rosenbaum & Rubin, 1983) derivative between the two distributions (also known as a propensity score or importance weight) to re-weight the training data so that its weighted distribution better matches the test data (Agarwal et al., 2011; Wen et al., 2014; Reddi et al., 2015b; Chen et al., 2016; Fakoor et al., 2020b; a; Tibshirani et al., 2019) . These methods mostly consider offline settings with one training and one test dataset. In many applications, data for supervised learning is continuously collected from a constantly evolving distribution (e.g. due to a pandemic (Callaway, 2020)) such that the single train/test paradigm no longer applies. Settings in which the data distribution drifts gradually over time (rather than experiencing abrupt changes) are particularly ubiquitous (Shalev-Shwartz, 2012) . Assuming observations are not statistically dependent over time (as in time-series), how to best train models in such streaming/online learning settings remains a key question (Lu et al., 2018) . Some basic options include: fitting/updating the model to only the recent data (which is statistically suboptimal if the distribution has not significantly shifted) or fitting the model to all previous observations (which leads to biased estimates if shift has occurred). Here we consider a less crude approach in which past data are weighted during training with continuous-valued weights that vary over time. Our proposed estimator of these weights generalizes standard two-sample propensity scores, allowing the training process to selectively emphasize past data collected at time t based on the distributional similarity between the present and time t. We evaluate our proposed method in supervised and reinforcement learning settings involving a sequence of gradually changing tasks with slow repeating patterns. In such settings, the model not only must continuously adapt to changes in the environment/task but also learn how to select past data which may have become more relevant for the current task. Comprehensive experiments demonstrate that our method can effectively detect shift in data and statistically account for such changes during learning and inference time.

2. RELATED WORK

Given its importance in real-world applications, the problem of how to learn from shifting distributions has been widely studied. Much past work has focused on a single shift between training/test data (Lu et al., 2021; Wang & Deng, 2018; Fakoor et al., 2020b) as well as restricted forms of shift involving changes in only the features (Sugiyama et al., 2007a; Reddi et al., 2015a) , labels (Lipton et al., 2018; Garg et al., 2020; Alexandari et al., 2020) , or in the underlying relationship between the two (Zhang et al., 2013; Lu et al., 2018) . Past approaches to handle distributions evolving over time have been considered in the literature on: concept drift Gomes et al. ( 2019 2019), but to our knowledge, existing methods for these settings do not employ time-varying data weights like we propose here. Time-varying data weights have been considered in survival analysis which models longitudinal observations subject to censoring Lu (2005); Cox (1972), but our weights here are implemented differently (via deep learned estimates) and utilized for general supervised learning and reinforcement learning with drift.

3. APPROACH

Consider a standard learning problem where training data are drawn from a probability distribution p(x) and test data have a different probability distribution q(x). Our goal is to build a model that can predict equally well on the training and test data distributions. We can do so by observing that E x∼q(x) [ℓ(x)] = dp(x) dp(x) dq (x)ℓ(x) = E x∼p(x) dq(x) dp(x) ℓ(x) = E x∼p(x) β(x) ℓ(x) where ℓ(x) is any function, say the loss of our model. The propensity score β(x) = dq(x) dp(x) is the importance ratio; this can also be seen as the Radon-Nikodym derivative (Resnick, 2013) of the two distributions. The propensity score measures the likelihood that a sample x came from distribution p against it coming from a distribution q. We will call β(x) the "standard propensity score". Since densities q(x) and p(x) are often unknown, a binary classifier (Agarwal et al., 2011) needs to be used to estimate β(x) by utilizing samples drawn from p and q. In particular, we want to create a data set, D = {(x i , z i )} N i=1 for the binary classifier such that z i = 1 if x i ∼ p and z i = -1 if x i ∼ q. We want our training data to have z = 1 in half of the training examples (with x ∼ p) and z = -1 (with x ∼ q) in the other half. We fit the binary classifier by solving the following: minimize θ 1 N N (xi,zi) log 1 + exp(-z i g θ (x i )) (2) where g θ can be either a linear or non-linear model parameterized by θ.

3.1. TIME-VARYING IMPORTANCE WEIGHT ESTIMATOR

Let us now consider time-varying probability distribution p t (x) with t = 1, . . . , T . We will assume that we have recorded tuples D = {(x i , t i )} m i=1 with each x i ∼ p ti (x). Our goal is to build a model that can predict well on data from p T (x) using the historical data D, say we seek to minimize E x∼p T +dt (x) [ℓ(x)] . (3)



Covariate, label, and concept shifts are all referred to as distribution shift in the paper to simplify things wherever it is clear from the context.



); Souza et al. (2020), reinforcement learning (shift between the target policy and behavior policy) Schulman et al. (2015); Wang et al. (2016); Fakoor et al. (2020a), (meta) online learning Shalev-Shwartz (2012); Finn et al. (2019); Harrison et al. (2020); Wu et al. (2021), and task-free continual/incremental learning Aljundi et al. (2019); He et al. (

