PREDICTOR-CORRECTOR ALGORITHMS FOR STOCHAS-TIC OPTIMIZATION UNDER GRADUAL DISTRIBUTION SHIFT

Abstract

Time-varying stochastic optimization problems frequently arise in machine learning practice (e.g. gradual domain shift, object tracking, strategic classification). Often, the underlying process that drives the distribution shift is continuous in nature. We exploit this underlying continuity by developing predictor-corrector algorithms for time-varying stochastic optimization that anticipates changes in the underlying data generating process through a predictor-corrector term in the update rule. The key challenge is the estimation of the predictor-corrector term; a naive approach based on sample-average approximation may lead to non-convergence. We develop a general moving-average based method to estimate the predictorcorrector term and provide error bounds for the iterates, both in presence of pure and noisy access to the queries from the relevant derivatives of the loss function. Furthermore, we show (theoretically and empirically in several examples) that our method outperforms non-predictor corrector methods that do not anticipate changes in the data generating process. 1

1. INTRODUCTION

Stochastic optimization is a basic problem in modern machine learning (ML) theory and practice. Although there is a voluminous literature on stochastic optimization (Agarwal et al., 2014; Moulines & Bach, 2011; Bottou, 2003; 2012; Bottou & Bousquet, 2007) , most prior works consider a timeinvariant stochastic optimization problem in which the data generating distribution is not changing over time. However, there is an abundance of real examples in which the underlying optimization problem is time-varying which can be broadly divided into two categories: the first kind arises due to exogeneous variation in the data generating process. A concrete example is the object tracking problem in which an observer observes (noisy) signals regarding the position of a moving object, and the goal is inferring the trajectory of the object. The second kind of time-varying optimization problem arises due to endogeneous variation in the data generating process. Examples here include strategic classification (Dong et al., 2018; Hardt et al., 2016) and performative prediction (Perdomo et al., 2020; Mendler-Dünner et al., 2020; Brown et al., 2022) . Although there are a few recent papers on time-varying stochastic optimization (e.g., Cutler et al. (2021) ; Nonhoff & Müller (2020) ; Dixit et al. (2019; 2018) ), they model the temporal drift as discrete, precluding them from exploiting the smoothness in the drift. This leads to worse asymptotic tracking error, depending on the magnitude of the temporal drift of the optimal solution, (e.g. see Popkov (2005) . Zavlanos et al. (2012) , Zhang et al. (2009) , Ling & Ribeiro (2013) and references therein). In this paper, we focus on time-varying stochastic optimization problems in which the temporal drift is driven by a continuous-time process. We leverage the smoothness of the temporal drift to develop predictor-corrector (PC) methods for stochastic optimization that anticipates future changes in the data generating process to improve the asymptotic tracking error (ATE) (see Definition 2.1). The main benefit of such methods is smaller tracking error (compared to other stochastic optimization algorithms that does not leverage smoothness of the temporal drift). A primary challenge in stochastic time-varying optimization is properly accounting for the temporal drift through a PC term in the update rule. The noise in the stochastic setting makes naive estimates of the PC term unstable, and may lead to non-convergence. One of our main contributions is developing a general way of estimating the PC term. We complement the methodological contributions with theoretical results that show PC stochastic optimization algorithms inherit the benefits of their non-stochastic counterparts for time-varying problems. In particular, we show that PC stochastic optimization algorithms have smaller asymptotic tracking error (ATE) compared to their non-PC counterparts. We also demonstrate the superiority of PC algorithms empirically in a time-varying linear regression and target tracking applications. The rest of the paper is organized as follows: In Sections 2 and 3, we present the algorithm and highlight the difference between predictor-corrector based algorithm and time non-adaptive algorithms like simple gradient descent. We further present theories regarding the bound on ATE of the algorithms of both kinds. In Section 4, we present three concrete instances of time-varying stochastic optimization problems driven by underlying gradual distributions shifts and derive the details of PC algorithms for these problems. Section 5 concludes.

2. FRAMEWORK AND ALGORITHM

In a typical learning problem, we have n samples X 1 , . . . , X n ∼ P and based on the data, we estimate some parametric (or non-parametric) functional of the underlying distribution θ ⋆ = ν(P ) by minimizing some loss function ℓ(θ, X). A standard assumption for consistent estimation of θ ⋆ is that R(θ) ≜ E[ℓ(X, θ)] is uniquely minimized at θ ⋆ . In a time varying framework, we assume that the data generating distribution P ≡ P t changes with time and so does the parameter of interest θ ⋆ t = ν(P t ) along with the (time-varying) risk function R(θ, t) ≜ E Pt [ℓ(X, θ)]. As a concrete example, consider the object tracking problem studied in Patra et al. ( 2020): suppose we have installed n sensors at positions x 1 , . . . , x n (which remain fixed over time) and let θ ⋆ t be the location of the target object at time t. At each time, we get some noisy feedback from the sensors regarding the position of the target, i.e. y i,t = ∥x i -θ ⋆ t ∥ 2 + ϵ it for 1 ≤ i ≤ n , where ϵ it 's are iid (over i and t) with mean zero and variance τ 2 . Denote by y t ∈ R n to be the vector of observations from n sensors at time t. A natural approach to estimate θ ⋆ t is to minimize squared error loss ℓ(y t , θ) = n i=1 (y i,t -∥x i -θ∥ 2 ) 2 , which yields the risk function R(θ, t) = n i=1 E[(y i,t -∥x i -θ∥ 2 )] = nτ 2 + n i=1 (∥x i -θ ⋆ t ∥ 2 -∥x i -θ∥ 2 ) 2 . From the risk function, it is immediate that under very mild assumptions on x 1 , . . . , x n (i.e. they are in general position, as discussed in Appendix A.1) we have θ ⋆ (t) = arg min θ R(θ, t), t ≥ 0 . (2.1) We will use the notations θ ⋆ (t) and θ ⋆ t interchangeably to denote the same thing. From the above formulation, it is immediate that we are merely observing one sample/incident at each time point. Therefore if θ ⋆ t behaves erratically over time, there is no hope to learn the evolution pattern from the data. Therefore, it is imperative to assume some smoothness on the target function θ ⋆ t . Our proposed method exploits the smoothness of θ ⋆ t to improve its estimation over time. We now formulate the problem: we assume that the underlying distribution function {P t } changes continuously with time t. As statisticians, we query the model at discrete time steps (i.e. say at time {kh} k∈N where h is the time step which controls the frequency of query) and observe n samples {X it } 1≤i≤n from the distribution at that time. As a consequence, we have a sequential batch of data using which we aim to estimate θ ⋆ kh , i.e. the parameters at the time of query. Note that to estimate the parameter at time t one may use all previous data points. We evaluate the quality of our estimator using the asymptotic tracking error (ATE) defined below. Definition 2.1 (Asymptotic tracking error (ATE)). Let the true dynamic parameter {θ ⋆ t , t ≥ 0} be sequentially estimated as { θkh , k ∈ N} over the time grid {kh : k ∈ N}, where h > 0 is the time step. Then the asymptotic tracking error is defined as ATE(θ) = lim sup k→∞ ∥ θkh -θ ⋆ kh ∥ 2 . As mentioned in the Introduction, we here compare performance of a time-adjusted (PC) gradient descent method to a time-unadjusted (GD) one. As will be evident, both the methods require evaluation of certain derivatives of the risk function. To motivate the predictor-corrector (PC) algorithm, we start by deriving the prediction correction term when the optimizer has access to the exact gradients of the (time-varying) cost function. The optimality of θ ⋆ t implies g(t) = ∇ θ R(θ ⋆ t , t) = 0 for all t. (2.2) Thus g ′ (t) = 0; i.e. ∇ θθ R(θ ⋆ t , t) θ⋆ t + ∇ θt R(θ ⋆ t , t) = 0, (2. 3) where θ⋆ t is the temporal drift of the optimal solution θ ⋆ t . We see that θ ⋆ t satisfies the ODE: θ⋆ t = -∇ -1 θθ R(θ ⋆ t , t)∇ θt R(θ ⋆ t , t). (2.4) We interpret the right side of this ODE as a prediction of the change in θ ⋆ t . This suggests modifying the update rule of stochastic optimization algorithms to account for the predicted change in θ ⋆ t . This leads to the update rule θ(k+1)h = θkh -η ∇θ R( θkh , kh) -h{ ∇θθ R( θkh , kh)} -1 ∇θt R( θkh , kh), where η > 0 is a learning rate, h is a (time) step, and ∇θ R(θ, t), ∇θθ R(θ, t), ∇θt R(θ, t) are estimates of ∇ θ R(θ, t), ∇ θθ R(θ, t), ∇ θt R(θ, t) respectively. We summarize the stochastic PC algorithm in Algorithm 1.

Algorithm 1 Stochastic predictor-corrector based method

Require: step size h > 0, learning rate η > 0, estimated gradients ∇θ R(θ, kh), Hessians ∇θθ R(θ, kh) and time derivative of the gradient ∇θt R(θ, kh) for k ∈ N 1: Initialize θ0 at some value. 2: for k ≥ 0 do 3: Update θ(k+1)h = θkh -η ∇θ R( θkh , kh) -h{ ∇θθ R( θkh , kh)} -1 ∇θt R( θkh , kh) 4: end for To motivate the benefit of accounting for the predicted change in θ ⋆ t , we present here a brief comparison of the tracking error of the PC algorithm with simple stochastic gradient descent without the correction term. For simplicity of exposition, we study the tracking error of these two algorithms in the non-stochastic setting. We begin by showing a lower bound on the tracking error of gradient descent. We suspect this result is known to experts, but it does not appear (to the best of our knowledge) in the literature. Theorem 2.2 (Lower bound). There exists a R(θ, t) that satisfies Assumption 3.1 such that the gradient descent algorithm with time step size h > 0 and learning rate η > 0 satisfies the following: there exists a c > 0 such that lim inf k→∞ θkh -θ ⋆ kh 2 ≥ c h η . As we shall see in the subsequent section (see Theorem 3.3), the tracking error of the PC algorithm in the non-stochastic setting is O( L ′′ h 2 M η ). We restate this special case of Theorem 3.3 here for the reader's convenience. Corollary 2.3. (Non-stochastic version of Theorem 3.3) Assume that 3.1 and 3.2 holds and that R(θ, t) is twice differentiable in both co-ordinates. Then with access to the true gradients, the sequence of estimates in Algorithm 1 with step size h > 0 and learning rate η satisfies: lim sup k→∞ θkh -θ * kh ≤ L ′′ h 2 M η+hM ′ , where M ′ = sup θ,t ∥{∇ θθ R(θ, t)} -1 ∇ θt R(θ, t)∥ op , M = sup t≥0 ∥∇ θ R(θ, t)∥ and L ′′ = max t≥0 1 2 ∥ θ(t)∥ are finite. To compare the rates for the PC algorithm and gradient descent, we set a learning rate such that h/η → 0 as h → 0. We see from Theorem 2.2 that the ATE for gradient descent ATE cannot converge faster than h/η. By comparison, the ATE for PC algorithm in the non-stochastic setting (see Theorem 2.3) converges to zero at rate h 2 /η (since it holds L ′′ h 2 M η+hM ′ ≍ L ′′ h 2 M η as long as h/η → 0), which a faster rate than h/η. Hence, we conclude that the ATE for predictor-corrector update converges at a faster rate than the ATE for gradient descent update. A high-level reasoning for such a distinction is the following: the gradient descent method does not consider or calibrate for the underlying smoothness in θ ⋆ (t) while performing the time updates, whereas predictor-corrector calibrates for the smoothness in the time update by adjusting the term -h{∇ θθ R( θt , t)} -1 ∇ θt R( θt , t). One of the main challenge in implementing the stochastic PC algorithm is obtain estimates of ∇ θ R(θ, t), ∇ θθ R(θ, t), ∇ θt R(θ, t): 1. Estimation of gradient: ∇ θ R(θ, t) also appears in the SGD update rule; it is typically estimated with sample average approximation. 2. Estimation of PC term: ∇ θθ R(θ, t), ∇ θt R(θ, t) are quantities that arise due to the presence of the PC term in the stochastic PC update rule. Although it is possible to construct unbiased sample average approximations of them individually, obtaining an unbiased estimate of the overall PC term is generally not possible due to the presence of non-linearity: the PC term is a product of the inverse of the hessian matrix and the cross derivative with respect to the parameter and time. Fortunately, the stochastic PC algorithm is robust against biases in the estimate of the PC term. That said, naively estimating the PC term with sample average approximation can lead to non-convergence of the stochastic PC algorithm (see Remark 3.5 for details). In section 4, we present two ways of estimating/evaluating ∇ θθ R(θ, t), ∇ θt R(θ, t) that ensure the stochastic PC algorithm converges. In the next section, we elucidate how errors in the estimates of ∇ θ R(θ, t), ∇ θθ R(θ, t), ∇ θt R(θ, t) affect the asymptotic tracking error of the stochastic PC algorithm.

3. THEORETICAL PROPERTIES OF THE STOCHASTIC PC ALGORITHM

We begin by stating our assumptions on the problem. We assume that the risk function is strongly convex with respect to θ. This assumption implies that θ ⋆ (t) is uniquely identified at time t and is crucial in studying the convergence of ATE. Assumption 3.1. R(θ, t) is µ-strongly convex with respect to θ, i.e. for any θ 1 , θ 2 and t ≥ 0 it holds: R(θ 2 , t) ≥ R(θ 1 , t) + (θ 2 -θ 1 ) ⊤ ∇ θ R(θ 1 , t) + µ 2 ∥θ 2 -θ 1 ∥ 2 2 . We now assume that as a function of t, θ ⋆ (t) is smooth, which is naturally satisfied in numerous examples including dynamic least squares recovery, object tracking, etc.. Assumption 3.2. The function θ ⋆ ∈ C 2 R d ([0, ∞)), i.e. θ ⋆ is twice continuously differentiable with respect to time and its double derivative is uniformly bounded over time. As will be seen in the subsequent theorems, the bounds on the ATE of these stochastic algorithms depend on the error in the estimation of the pertinent gradients, which are defined as follows: ξ t = ∇θ R(θ, t) -∇ θ R(θ, t) denotes the estimation error of the gradient and ζ t = ∇θθ R( θt k , t k ) -1 ∇θt R( θt k , t k ) -∇ θθ R(θ * t k , t k ) -1 ∇ θt R(θ * t k , t k ) represents the error in the adjustment term for the temporal drift. We use σ ξ (resp. σ ζ ) to denote an upper bound on sup t E[∥ξ t ∥] (resp. sup t E[∥ζ t ∥]). These bounds may or may not be a function of h, depending on the application. Note that we do not assume the estimates of the gradients are unbiased. Below we present our main theorems regarding the bounds on the ATE of stochastic gradient descent and 1 in terms of σ ξ , σ ζ , learning rate η and stepsize h: Theorem 3.3 (Stochastic predictor-corrector method). The update sequence of stochastic predictorcorrector method presented in Algorithm 1 yields the follows bound on ATE: lim sup k→∞ E[∥ θkh -θ ⋆ kh ∥] ≤ L ′′ h 2 ηM +hM ′ + ησ ξ + hσ ζ . for any small η > 0. When σ ξ and σ ζ are independent of h then choice of η = h implies that ATE of stochastic predictor-corrector method is at-most of the order of h. Note that we don not require zero mean for the noise; we merely require them to have finite second moment. This is slightly more general that the usual stochastic optimization setting in which the noise is assumed to have mean zero. That said, in most of the applications that we have in mind, the noise comes from approximation of expectations by sample means, so the noise will be mean zero. To see the benefits of the PC algorithm in the stochastic setting, we compare the ATE of the PC algorithm with that of stochastic approximation. Recall the stochastic gradient descent update rule: θ(k+1)h = θkh -η ∇θ R( θkh , kh). (3.1) for some estimate of gradient. Its ATE is known (Cutler et al., 2021) , but we restate it here to facilitate comparison: Theorem 3.4 (Stochastic gradient descent, Cutler et al. ( 2021)). The update sequence of stochastic gradient method satisfies: lim sup k→∞ E ∥ θt k -θ * t k ∥ ≤ Lh µη + ησ ξ for any small η > 0. Minimizing the right hand side with respect to yields: lim sup k→∞ E ∥ θt k -θ * t k ∥ ≤ Lhσ ξ µ . Therefore, when σ ξ does not depend on h, the rate is O( √ h). If we assume the error variances are independent of h, then simple stochastic gradient descent yields a bound of the order √ h on the ATE, whereas, the time-adjusted predictor-corrector based method yields a bound of the order of h, implying the superiority of the later for small stepsize. The superiority continues to hold even when the variances depend on h, as will be evident in the applications in the subsequent section. Remark 3.5 (Naive estimation of PC term fails). In practice, it is straightforward to obtain estimates of ∇ θ R(θ, t) (e.g. sample average approximation), but it is less straightforward to estimate the cross-derivative term ∇ θt R(θ, t). A naive application of first-order finite differences to estimating the cross-derivative term leads to poor tracking performance because this estimate of the cross-derivative term leads to a σ ζ term that is O(h -1 ). Indeed, we have ∇θt R(θ, kh) = ∇θ R(θ, kh) -∇θ R(θ, (k -1)h) h = ∇ θ R(θ, kh) -∇ θ R(θ, (k -1)h) h + ξ kh -ξ (k-1)h h . As long as R(θ, t) is smooth with respect to t, the first term is ∇ θt R(θ, t) + O(h). But the second term is generally O(h -1 ), e.g., if we assume that the errors over time are independent. Plugging this into the bound on the tracking error in Theorem 3.3, we see that the term that includes σ ζ no longer depends on h, leading to a vacuous O(1) bound. In Section 4, we use moving average schemes to obtain more accurate estimates of ∇ θt R(θ, t) to avoid this pitfall (e.g., for least square recovery problem in §4.1 see Equation (4.4) and Lemma 4.3 for it's estimation and error analysis).

4. APPLICATIONS

In this section, we present three concrete time-varying optimization problems; all are characterized a gradual underlying distribution shift. This allows us to leverage the predictor-corrector (PC) method to improve tracking of the optimal trajectory. In some applications (e.g. the strategic classification example), there is a model for the distribution shift, so it is possible to evaluate the PC term exactly. In other applications, there is no such model, so it is necessary to approximate the PC term. We use a generic finite-difference approach in §4.1 and §4.3 to approximate the PC term. This approach is generally applicable, but it may not be optimal in applications in which the underlying distribution shift exhibits higher orders of smoothness.

4.1. LEAST SQUARES RECOVERY WITH FIXED DESIGN MATRIX

We first demonstrate the performance of the predictor-corrector method and compare it with gradient descent method in a linear regression model. We observe y kh ≜ {Y kh,j } n j=1 at time kh for k ∈ N and some fixed stepsize h, where the observations are modeled as: y t = Xθ ⋆ t + ϵ t . Here X ∈ R n×d is a fixed time-invariant design matrix and the co-ordinates of ϵ t are i.i.d with mean 0 and variance τ 2 . The parameter of interest here is the function θ ⋆ : [0, ∞) → R d . We consider a low dimensional scenario (i.e. d < n) and assume that the columns of X are in general position, i.e. X ⊤ X is invertible. This immediately implies the following: Lemma 4.1. For any t ∈ [0, ∞), θ ⋆ t is the unique solution of the least square problem: θ ⋆ t = arg min θ∈R d R(θ, t) = arg min θ∈R d 1 2n E ∥y t Xθ∥ 2 where R is the risk function and the expectation is taken with respect to the distribution of The proof of Lemma can be found in Appendix A. We now compare the performance of stochastic gradient descent method (3.1) and PC method (Algorithm 1). The gradient of the risk function (with respect to) θ is: ∇ θ R(θ, t) = 1 n X ⊤ (Xθ -Xθ ⋆ t ) . We propose a moving average based technique to estimate the gradient of the risk function, i.e. for any time t we define: ∇θ R(θ, t) = 1 n X ⊤ Xθ - m-1 i=0 α i y t-ih , ξ t = ∇θ R(θ, t) -∇ θ R(θ, t) . (4.1) The optimal choice of the moving window length m and {α i } m-1 i=0 depends on a careful analysis of bias-variance trade-off of the estimation error ξ t . First, note that ξ t can be decomposed into two terms as follows: ξ t = 1 n X ⊤ X θ ⋆ t - m-1 i=0 α i θ ⋆ t-ih -1 n m-1 i=0 α i X ⊤ ϵ t-ih ≜ A + B . We can expand the bias term A via a two step Taylor expansion: θ ⋆ t - m-1 i=0 α i θ ⋆ t-ih = θ ⋆ t 1 - m-1 i=0 α i -h θ⋆ (t) m-1 i=0 iα i + h 2 2 m-1 i=0 i 2 α i θ⋆ ( ti ) for some ti ∈ [tih, t]. The following lemma presents an optimal scheme for choosing {m; α i , i = 0, . . . , m -1}: Lemma 4.2 (Gradient estimate). For any fixed m, choosing the weights {α i } as: (α 0 , . . . , α m-1 ) ∈ arg min m-1 i=0 a 2 i : m-1 i=0 a i = 1, and m-1 i=0 ia i = 0, a i ∈ R (4.2) we obtain the following: 1. ∥A∥ 2 2 = a 2 0 × O(m 4 h 4 ), where a 2 0 = max t≥0 X ⊤ X n θ⋆ (t) 2 2 , 2. E[∥B∥ 2 2 ] = b 2 0 × O(m -1 ), where b 2 0 = τ 2 TR X ⊤ X n 2 . Therefore, taking m = O(h -4/5 {a 0 /b 0 } 2/5 ) we have: E[∥ξ t ∥ 2 ] ≤ ∥A∥ 2 + {E[∥B∥ 2 2 ]} 1/2 = O(a 0 m 2 h 2 ) + O(b 0 m -1/2 ) = O(a 1/5 0 b 4/5 0 h 2/5 ) . (4.3) The proof of this Lemma is deferred to the Appendix A. It is established in the proof that the optimal weights are: α i = 2(2m -1 -3i)/{m(m + 1)}. Using the error bound of Lemma 4.2 in Theorem 3.4 yields for small η, ATE is bounded by O(h/η) + O(ηh 2/5 ). The right hand is minimized by taking η = h 3/10 , which yields the order for asymptotic tracking error O(h 7/10 ). For prediction-corrector based method (Algorithm 1) we additionally need to estimate the Hessian and the time derivative of the gradient. The Hessian is constant (X ⊤ X/n) and known (as we assume to know X). To estimate ∇ θt R(θ, t) = -{X ⊤ X/n} θ⋆ t , we again resort to a moving average based method, i.e. we set: for some choice of p and the weights {β j } p-1 j=0 , where the weights and the size of the window is obtained from a bias-variance trade-off: ∇ θt R(θ, t) = -1 n X ⊤ p-1 i=0 β i y t-ih (4.4) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 t 0.0 0.5 1.0 1.5 2.0 θ(t) -θ (t) 2 GD PC h = 10 -2 h = 10 -3 h = 10 -4 h = 10 -5 10 -5 10 -4 10 -3 10 -2 h 2 -3 2 -2 2 -1 2 0 2 1 θ(t) -θ (t) 2 GD PC 0.0 0.5 1.0 1.5 2.0 2.5 3.0 t 0.4 0.6 0.8 1.0 1.2 1.4 θ(t) -θ (t) 2 GD PC h = 10 -2 h = 10 -3 h = 10 -4 h = 10 -5 10 -5 10 -4 10 -3 10 -2 h 2 -1 2 0 θ(t) -θ (t) 2 GD PC ζ t = X ⊤ X n -1 ∇ θt R(θ, t) -∇ θt R(θ, t) = - p-1 i=0 β i θ ⋆ t-ih + θ⋆ t -X ⊤ X n -1 X ⊤ n p-1 i=0 β i ϵ t-ih ≜ C + D . The following lemma presents the optimal choice of the weights, window length and consequently an error bound: Lemma 4.3. (Estimate of time derivative) For any p, if we choose the weights β i 's as: (β 0 , . . . , β p-1 ) ∈ arg min p-1 i=0 b 2 i : p-1 i=0 b i = 0, and p-1 i=0 ib i = -1 h , b i ∈ R (4.5) of Figure 1 , we observe that the predictor-corrector based method (denoted by PC) outperforms the gradient-descent based method (denoted by GD) in terms of the tracking error for all moderately large t (here, t ≥ 1) for all choices of h ∈ {10 -2 , 10 -3 , 10 -4 , 10 -5 }. As illustrated in our theory, the rate of convergence of limiting error of predictor-corrector based method decreases faster in h compared to the gradient descent based method. The right side of Figure 1 establishes this phenomena through comparing the performances at t = 3 for several choices of h. As ∥ θtθ ⋆ t ∥ is a random quantity, error bars (over 10 Monte-Carlo iterations) are provided to quantify the variability of the tracking error, which turns out to be relatively small compared to the mean difference of the tracking error of the methods.

4.2. STRATEGIC CLASSIFICATION

In strategic classification, samples correspond to agents who change their features strategically to affect the output of the ML model (e.g. scammers modifying their scam to skirt a scam detector). To keep things simple, we assume the ML model is a binary classifier, and the positive output is advantageous for the agents. Following Hardt et al. (2016) , we assume the agents are maximizing utility, and they have full information regarding the classifier. Thus an agent with features x changes their features by solving the (expected) utility maximization problem x + (x) ≜ arg max x ′ ∈X u + f (x ′ ) -c(x, x ′ ), where u + > 0 is the utility from a positive output, f (x) is the predicted probability of having positive output for x from an ML model, u + f (x) is the expected utility at x, and c(x, x ′ ) > 0 encodes the cost of changing features from x to x ′ . Let p 1 be the probability density function (pdf) of the positive class conditional at time t and f t be the ML model deployed at time t. The agents respond strategically to f t ; i.e. an agent with feature x changes their features to x + (x) (their label remains unchanged). The resulting change in the class conditional satisfies the continuity equation: ∂ t p 1 + ∇ • (vp 1 ) = 0, where ∇ ≜ [∂ x1 . . . ∂ x d ] is the spatial gradient operator and v is the vector field v(x) ≜ x + (x)x. Similarly, the pdf of the negative class conditional also satisfies the continuity equation. This change in the distribution of agent features leads to a time-varying optimization problem: θ(t) ∈ arg min θ f (θ, t) ≜ π X ℓ(f θ (x), 1)dP 1 (x) + (1 -π) X ℓ(f θ (x), 0)dP 0 (x), where π ≜ P{Y = 1} is the faction of the positive class and ℓ is a loss function for the classification task. Interchanging limits freely, we see that it is possible to estimate ∇ θ f (θ, t) and ∇ θθ f (θ, t) empirically: ∇ θ f (θ, t) = π X ∇ θ ℓ(f θ (x), 1)dP 1 (x) + (1 -π) X ∇ θ ℓ(f θ (x), 0)dP 0 (x), ∇ θθ f (θ, t) = π X ∇ 2 θ ℓ(f θ (x), 1)dP 1 (x) + (1 -π) X ∇ 2 θ ℓ(f θ (x), 0)dP 0 (x) Similarly, it is possible to estimate ∇ θt f (θ, t) empirically: ∇ θt f (θ, t) = π X ℓ(f θ (x), 1)∂ t p 1 (x, t)dx + (1 -π) X ℓ(f θ (x), 0)∂ t p 0 (x, t)dx = -π X ℓ(f θ (x), 1)∇ • (v(x)p 1 (x, t))dx -(1 -π) X ℓ(f θ (x), 0)∇ • (v(x)p 0 (x, t))dx = π X ∇ x ℓ(f θ (x), 1) ⊤ v(x)p 1 (x, t)dx + (1 -π) X ∇ x ℓ(f θ (x), 0) ⊤ v(x)p 0 (x, t)dx. where we appealed to the equation in the second step and Green's identities in the third step. Unlike the other two applications in this section, it is possible to compute the PC term exactly (without resorting to finite-difference approximation) here.

4.3. OBJECT TRACKING

Our third application is object tracking problem proposed and analyzed by Patra et al. (2020) . Assume we have n sensors placed the position {X i } n i=1 in R d and θ ⋆ (t) denotes position of the object (that we aim to track) at t. At any given point, we observe a noisy version of some monotone function of distance of the object from the sensors, i.e. we observe: Y i,t = f (∥X i -θ ⋆ t ∥ 2 ) + ϵ i,t ∀ 1 ≤ i ≤ n ., In this example, we assume we know f . The case of unknown f is more complicated and beyond the scope of the paper. The risk function for estimating θ ⋆ t under quadratic loss function is: R(θ, t) = E 1 2n n i=1 Y i,t -f (∥X i -θ∥ 2 ) 2 = σ 2 2 + 1 2n n i=1 f (∥X i -θ∥ 2 ) -f (∥X i -θ ⋆ t ∥ 2 ) 2 Note that the risk function here is not strongly convex (so it does not satisfy the assumptions of Theorem 3.3), but as we shall see, the PC algorithm nevertheless outperforms standard first-order methods. The gradient and the Hessian, that are required for GD and PC methods, can be easily estimated as using sample averages (exact expressions are presented in the Appendix B). The time derivative of the gradient is: ∇ θ,t R(θ, t) = -2 n n i=1 f ′ (∥X i -θ∥ 2 )(θ -X i ) d dt f (∥X i -θ ⋆ t ∥ 2 ) To estimate d dt f (∥X i -θ ⋆ t ∥ 2 ) we again resort to the moving average procedure: ∂ t f (∥X i -θ ⋆ t ∥ 2 ) = p-1 j=0 β j Y i,t-jh , where p, {β j } p j=1 are chosen carefully to balance the bias variance trade-off. For notational simplicity we drop the index i and define g (t) = f (∥X i -θ ⋆ t ∥ 2 ). From the relation Y i,t = g(t) + ϵ i,t and Assumption 3.2, we have: p-1 j=0 β j Y i,t-jh = p-1 j=0 β j {g(t) -jhg ′ (t) + j 2 h 2 2 g ′′ ( tj )} + p-1 j=0 β j ϵ i,t-jh Now if the sequence {β j } p-1 0 satisfies j β j = 0 and j jβ j = -(1/h) then we have: p-1 j=0 β j Y i,t-jhg ′ (t) = ( j j 2 β j )O(h 2 ) + p-1 j=0 β j ϵ i,t-jh . The variance of the error term is σ 2 ϵ j β 2 j . Therefore we will choose {β j } and p by minimizing j β 2 j subject to the above constraints. Similar calculation as of Example 1 (Lemma 4.3) yields p = O(h -3/4 ) and {β j } p-1 j=0 is β j = 6(p-1-2j) p(p 2 -1)h . Simulation study We consider n = 11 2 sensors placed at {-1, -0.8, . . . , 0.8, 1} 2 ⊂ [-1, 1] 2 grid. The moving object takes the path θ ⋆ t = (sin(2πt), cos(2πt)) and we let f (x) = x. We generate noisy observations of the object as y it = ∥X iθ ⋆ t ∥ 2 + ϵ it where ϵ it ∼ N (0, 1 /4). Runtime analysis for the two methods can be found in Figure 2 in Appendix C. Figure 1 (third and forth plots from left) presents the tracking error of both the methods (gradient descent based method (resp. predictor-corrector based method) is represented as GD (resp. PC)) for four choices of h ∈ {10 -2 , 10 -3 , 10 -4 , 10 -5 }. The superiority of the performance of the PC method for all t ≥ 1 and for various choices of h is evident from the third picture from the left-side of Figure 1 . In the last picture, we compare the limiting performance of both the methods (here we take the error at t = 3 to be the limiting error) for several choices of h. The superiority of the performance of PC corroborates our theoretical finding that the tracking error for PC converges faster in terms of h compared to the GD. We also show error bars (over 10 Monte-Carlo iterations) of the random tracking error ∥ θt -θ ⋆ t ∥ at t = 3 on the right picture, which indicates that the variability is relatively small compared to the difference of the mean tracking error of the methods.

5. CONCLUSION

We developed predictor corrector algorithms for stochastic time-varying optimization. These algorithms leverage smoothness in the temporal drift to anticipate changes to the optimal solution. We showed that these algorithms have smaller asymptotic tracking errors than their non-predictor corrector counterparts and demonstrated their efficacy in three applications. Although we focused on first-order algorithms in this paper, the predictor corrector term in the update rules of our PC algorithms can be incorporated into the update rules of other algorithms (e.g. Newton-type methods). We hope that the benefits of first-order PC algorithms motivates others to study PC versions of other algorithms.



Codes: https://github.com/smaityumich/concept-drift.



Figure 1: The two left-most figures show the performance of gradient descent and the PC method for time-varying linear regression. The left-most figure presents tracking error of GD and the PC methods for various choices of h on the time interval t ∈ [0, 3]. The second figure from the left one compares the performance for a fixed t = 3 and different h to illustrate how the performance of the methods vary with h. The two right-most figures show the performance of gradient descent and the PC method for object tracking. The second figure from the right shows the tracking error of the object tracking model for GD and the PC method for different choices of h. The Y -axis represents the tracking error and X-axis represents the time interval t ∈ (0, 3).

ACKNOWLEDGMENTS

This paper is based upon work supported by the National Science Foundation (NSF) under grants no. 2027737 and 2113373.

annex

we have the following bounds on the error:Setting p = (d 0 /c 0 ) 1/4 h -3/4 , we have:The β j in the above lemma takes the value 6(p-1-2j) p(p 2 -1)h as elaborated in the proof of the Lemma (see Appendix A). From the above lemma we have). Furthermore, we have established in the analysis of gradient descent method that the order of the error of the gradient estimation is σ ξ = O(h 2/5 ) (Lemma 4.2). Using these rates in Theorem 3.3 we conclude that predictor-corrector method with step size η ATE is of the order O(h 2 /{η + h}) + O(ηh 2/5 ) + O(h 5/4 ) . The bound is minimized at η = h 4/5 and consequently, the order of the tracking error is O(h 6/5 ). Therefore, predictor-corrector method yields faster rate (in terms of the step-size h) in comparison to the gradient based method.Simulation studies: We compare the tracking performance of the gradient descent based and the predictor-corrector based method for the regression model via simulation. For simulation purpose, we set d = 2, n = 40. The values of {X i } n i=1 are generated independently from N (0, I 2 ) and remain fixed over time. The true parameter is taken to be θ * t = (sin(2πt), cos(2πt)) and at time t, the observation y t ∈ R n is generated as y t = Xθ * t + ϵ t where ϵ t ∼ N (0, 0.5I n ). Runtime analysis for the gradient descent based and the predictor-corrector based methods can be found in Figure 2 in Appendix C.The tracking performance of the methods over a finite time interval (t ∈ (0, 3)) and the effect of h on the limiting error (i.e. tracking error for some large t) are presented in Figure 1 . In the left plot Michael M Zavlanos, Alejandro Ribeiro, and George J Pappas. Network integrity in mobile robotic networks. IEEE Transactions on Automatic Control, 58(1):3-18, 2012.Yunong Zhang, Ke Chen, and Hong-Zhou Tan. Performance analysis of gradient neural network exploited for online time-varying matrix inversion. IEEE Transactions on Automatic Control, 54 (8):1940 Control, 54 (8): -1945 Control, 54 (8): , 2009. . 

