PREDICTOR-CORRECTOR ALGORITHMS FOR STOCHAS-TIC OPTIMIZATION UNDER GRADUAL DISTRIBUTION SHIFT

Abstract

Time-varying stochastic optimization problems frequently arise in machine learning practice (e.g. gradual domain shift, object tracking, strategic classification). Often, the underlying process that drives the distribution shift is continuous in nature. We exploit this underlying continuity by developing predictor-corrector algorithms for time-varying stochastic optimization that anticipates changes in the underlying data generating process through a predictor-corrector term in the update rule. The key challenge is the estimation of the predictor-corrector term; a naive approach based on sample-average approximation may lead to non-convergence. We develop a general moving-average based method to estimate the predictorcorrector term and provide error bounds for the iterates, both in presence of pure and noisy access to the queries from the relevant derivatives of the loss function. Furthermore, we show (theoretically and empirically in several examples) that our method outperforms non-predictor corrector methods that do not anticipate changes in the data generating process. 1

1. INTRODUCTION

Stochastic optimization is a basic problem in modern machine learning (ML) theory and practice. Although there is a voluminous literature on stochastic optimization (Agarwal et al., 2014; Moulines & Bach, 2011; Bottou, 2003; 2012; Bottou & Bousquet, 2007) , most prior works consider a timeinvariant stochastic optimization problem in which the data generating distribution is not changing over time. However, there is an abundance of real examples in which the underlying optimization problem is time-varying which can be broadly divided into two categories: the first kind arises due to exogeneous variation in the data generating process. A concrete example is the object tracking problem in which an observer observes (noisy) signals regarding the position of a moving object, and the goal is inferring the trajectory of the object. The second kind of time-varying optimization problem arises due to endogeneous variation in the data generating process. Examples here include strategic classification (Dong et al., 2018; Hardt et al., 2016) and performative prediction (Perdomo et al., 2020; Mendler-Dünner et al., 2020; Brown et al., 2022) . Although there are a few recent papers on time-varying stochastic optimization (e.g., Cutler et al. (2021); Nonhoff & Müller (2020); Dixit et al. (2019; 2018) ), they model the temporal drift as discrete, precluding them from exploiting the smoothness in the drift. This leads to worse asymptotic tracking error, depending on the magnitude of the temporal drift of the optimal solution, (e.g. see Popkov (2005) In this paper, we focus on time-varying stochastic optimization problems in which the temporal drift is driven by a continuous-time process. We leverage the smoothness of the temporal drift to develop predictor-corrector (PC) methods for stochastic optimization that anticipates future changes in the data generating process to improve the asymptotic tracking error (ATE) (see Definition 2.1). The main benefit of such methods is smaller tracking error (compared to other stochastic optimization algorithms that does not leverage smoothness of the temporal drift). A primary challenge in stochastic time-varying optimization is properly accounting for the temporal drift through a PC term in the update rule. The noise in the stochastic setting makes naive estimates of the PC term unstable, and may lead to non-convergence. One of our main contributions is developing a general way of estimating the PC term. . We complement the methodological contributions with theoretical results that show PC stochastic optimization algorithms inherit the benefits of their non-stochastic counterparts for time-varying problems. In particular, we show that PC stochastic optimization algorithms have smaller asymptotic tracking error (ATE) compared to their non-PC counterparts. We also demonstrate the superiority of PC algorithms empirically in a time-varying linear regression and target tracking applications. The rest of the paper is organized as follows: In Sections 2 and 3, we present the algorithm and highlight the difference between predictor-corrector based algorithm and time non-adaptive algorithms like simple gradient descent. We further present theories regarding the bound on ATE of the algorithms of both kinds. In Section 4, we present three concrete instances of time-varying stochastic optimization problems driven by underlying gradual distributions shifts and derive the details of PC algorithms for these problems. Section 5 concludes.

2. FRAMEWORK AND ALGORITHM

In a typical learning problem, we have n samples X 1 , . . . , X n ∼ P and based on the data, we estimate some parametric (or non-parametric) functional of the underlying distribution θ ⋆ = ν(P ) by minimizing some loss function ℓ(θ, X). A standard assumption for consistent estimation of θ ⋆ is that R(θ) ≜ E[ℓ(X, θ)] is uniquely minimized at θ ⋆ . In a time varying framework, we assume that the data generating distribution P ≡ P t changes with time and so does the parameter of interest θ ⋆ t = ν(P t ) along with the (time-varying) risk function R(θ, t) ≜ E Pt [ℓ(X, θ)]. As a concrete example, consider the object tracking problem studied in Patra et al. ( 2020): suppose we have installed n sensors at positions x 1 , . . . , x n (which remain fixed over time) and let θ ⋆ t be the location of the target object at time t. At each time, we get some noisy feedback from the sensors regarding the position of the target, i.e. y i,t = ∥x iθ ⋆ t ∥ 2 + ϵ it for 1 ≤ i ≤ n, where ϵ it 's are iid (over i and t) with mean zero and variance τ 2 . Denote by y t ∈ R n to be the vector of observations from n sensors at time t. A natural approach to estimate θ ⋆ t is to minimize squared error loss ℓ(y t , θ) = n i=1 (y i,t -∥x i -θ∥ 2 ) 2 , which yields the risk function R(θ, t) = n i=1 E[(y i,t -∥x i -θ∥ 2 )] = nτ 2 + n i=1 (∥x i -θ ⋆ t ∥ 2 -∥x i -θ∥ 2 ) 2 . From the risk function, it is immediate that under very mild assumptions on x 1 , . . . , x n (i.e. they are in general position, as discussed in Appendix A.1) we have θ ⋆ (t) = arg min θ R(θ, t), t ≥ 0 . (2.1) We will use the notations θ ⋆ (t) and θ ⋆ t interchangeably to denote the same thing. From the above formulation, it is immediate that we are merely observing one sample/incident at each time point. Therefore if θ ⋆ t behaves erratically over time, there is no hope to learn the evolution pattern from the data. Therefore, it is imperative to assume some smoothness on the target function θ ⋆ t . Our proposed method exploits the smoothness of θ ⋆ t to improve its estimation over time. We now formulate the problem: we assume that the underlying distribution function {P t } changes continuously with time t. As statisticians, we query the model at discrete time steps (i.e. say at time {kh} k∈N where h is the time step which controls the frequency of query) and observe n samples {X it } 1≤i≤n from the distribution at that time. As a consequence, we have a sequential batch of data using which we aim to estimate θ ⋆ kh , i.e. the parameters at the time of query. Note that to estimate the parameter at time t one may use all previous data points. We evaluate the quality of our estimator using the asymptotic tracking error (ATE) defined below.



Codes: https://github.com/smaityumich/concept-drift.



Zavlanos et al. (2012), Zhang et al. (2009), Ling & Ribeiro (2013) and references therein).

