ROBUST MULTIVARIATE TIME-SERIES FORECASTING: ADVERSARIAL ATTACKS AND DEFENSE MECHANISMS

Abstract

This work studies the threats of adversarial attack on multivariate probabilistic forecasting models and viable defense mechanisms. Our studies discover a new attack pattern that negatively impact the forecasting of a target time series via making strategic, sparse (imperceptible) modifications to the past observations of a small number of other time series. To mitigate the impact of such attack, we have developed two defense strategies. First, we extend a previously developed randomized smoothing technique in classification to multivariate forecasting scenarios. Second, we develop an adversarial training algorithm that learns to create adversarial examples and at the same time optimizes the forecasting model to improve its robustness against such adversarial simulation. Extensive experiments on real-world datasets confirm that our attack schemes are powerful and our defense algorithms are more effective compared with baseline defense mechanisms.

1. INTRODUCTION

Understanding the robustness for time-series models has been a long-standing issue with applications across many disciplines such as climate change (Mudelsee, 2019) , financial market analysis (Andersen et al., 2005; Hallac et al., 2017) , down-stream decision systems in retail (Böse et al., 2017) , resource planning for cloud computing (Park et al., 2019; 2020) , and optimal control of vehicles (Kim et al., 2020) . In particular, the notion of robustness defines how sensitive the model output is when authentic data is (potentially) perturbed with noises. In practice, as observation data are often corrupted by measurement noises, it is important to develop statistical forecasting models that are less sensitive to such noises (Brown, 1957; Brockwell & Davis, 2009; Taylor & Letham, 2018) or more stable against outliers that might arise from such corruption (Connor et al., 1994; Gelper et al., 2010; Liu & Zhang, 2021; Wang & Tsay, 2021) . However, these approaches have not considered the possibility of adversarial noises which are strategically created to mislead the model rather than being sampled from a known distribution. As a matter of fact, vulnerabilities against such adversarial noises have been previously pointed out (Szegedy et al., 2013; Goodfellow et al., 2014b) in classification. In practice, it has been shown that human-imperceptible adversarial perturbation can alter classification outcomes of a deep learning (DL) model, revealing a severe threat to many safety-critical systems . As such a risk is associated with the high capacity to fit complex data pattern of DL, we postulate that similar threats might also occur in forecasting where modern DL-based forecasting models (Rangapuram et al., 2018; Salinas et al., 2020; Lim et al., 2020; Wang et al., 2019; Park et al., 2022) have become the dominant approach. For example, to mislead the forecasting of a particular stock, the adversaries might attempt to alter some features external to the stock's financial valuation to maximize the gap between predictions of its values on authentic and altered features. The feasibility of such an adversarial attack has been recently demonstrated with tweet messages (Xie et al., 2022) on a text-based stock forecasting. Motivated by these real scenarios, we propose to investigate such adversarial threats on more practical forecasting models whose predictions are based on more precise features, e.g. valuations of other stock indices. Intuitively, rather than releasing adverse information to alter the sentiment about the target stock on social media, the adversaries can instead invest hence change the valuation adversely for a selected subset of stock indices (not including the target stock) which is arguably harder to detect. Interestingly, despite being seemingly plausible given the vast literature on adversarial attack for classification models, formulating such imperceptible attack under a multivariate forecasting setup is not straightforward. This is due to several differences between forecasting and classification, particularly in terms of unique characteristic of time series, e.g., multi-step predictions, correlation over multiple time series, and probabilistic predictions. These differences open up the question of how adversarial perturbations and robustness should be defined more properly in time series setting. Although there have been a few recent studies in this direction based on randomized smoothing (Yoon et al., 2022) , these approaches are all restricted to univariate forecasting where the attack has to make adverse alterations directly to the target time series. Thus, under the less studied scenario of multivariate time-series forecasting setup, it remains unclear whether the attack to a target time series can be made instead via perturbing the other correlated time series; and whether it is defensible against such adversarial threats. In particular, as illustrated above in the stock forecasting example, there are new regimes of sparse and indirect cross time series attack under multivariate time-series scenarios, which are more effective and realistic than the direct attack in univariate cases. In order to understand whether such new regimes of attack exists and can be defended against, we raise three questions: 1. Indirect Attack. Can we mislead the prediction of some target time series via perturbations on the other time series? 2. Sparse Attack. Can such perturbations be sparse and non-deterministic to be less perceptible? 3. Robust Defense. Can we defend against those indirect and imperceptible attacks? Here we summarize our technical contributions by answering the questions above: Regarding indirect attack, we provide general framework of adversarial attack in multivariate time series (see Section 3.1). Then, we devise a deterministic attack (see Section 3.2) to the state-ofthe-art probabilistic multivariate forecasting model. The attack changes the model's prediction on the target time series via adversely perturbing a subset of other time series. This is achieved via formulating the perturbation as solution of an optimization task with packing constraints. Regarding sparse attack, we develop a non-deterministic attack (see Section 3.3) that adversely perturbs a stochastic subset of time series related to the target time series, which makes the attack less perceptible. This is achieved via a stochastic and continuous relaxation of the above packing constraint which are shown (see Section 5) to be more effective than the deterministic attack in certain cases. Moreover, unlike deterministic attack, its differentiability makes it suitable to be directly integrated as part of a differentiable defense mechanism that can be optimized via gradient descent in an end-to-end fashion, as discussed later in Section 4.2. Regarding robust defense, we propose two defense mechanisms. First, we adapt randomized smoothing to the new multivariate forecasting setup with robust certificate. Second, we devise a defense mechanism (see Section 4.2) via solving a mini-max optimization task which minimizes the maximum expected damage caused by the probabilistic attack that continually updates the generation of its adverse perturbations in response to the model updates. Their effectiveness are demonstrated across extensive experiments in Section 5. Furthermore, our experiments in Section 5.3 demonstrate that attacks designed for univariate cases cannot be reused as an effective attack to multivariate forecasting models, which highlights the importance and novelty of our studies. The code to reproduce our experiments results can be found at https://github.com/awslabs/gluonts/tree/dev/src/gluonts/ nursery/robust-mts-attack. (2018) respectively. To model the uncertainty, various probabilistic models have been proposed from distributional outputs (Salinas et al., 2020; de Bézenac et al., 2020; Rangapuram et al., 2018) to distribution-free quantile-based outputs (Park et al., 2022; Gasthaus et al., 2019; Kan et al., 2022) . In multivariate cases, Salinas et al. (2019) generalized DeepAR (Salinas et al., 2020) to multivariate cases and adopted low-rank Gaussian copula process to tackle the high-dimensionality challenge.

Deep

Adversarial Attack. Despite its success in various tasks, deep neural network is especially vulnerable to adversarial attacks (Szegedy et al., 2013) in the sense that even imperceptible adversarial noise can lead to completely different prediction. In computer vision, many adversarial attack schemes have been proposed. See Goodfellow et al. (2014b) ; Madry et al. (2018) for attacking image classifiers and Dai et al. (2018) for attacking graph structured data. In the field of time series, there is much less literature and even so, most existing studies on adversarial robustness of MTS models (Mode & Hoque, 2020; Harford et al., 2020) are restricted to regression and classification settings. Alternatively, Yoon et al. (2022) studied both adversarial attacks to probabilistic forecasting models, which is only restricted to univariate settings. Adversarial Robustness and Certification. Against adversarial attacks, an extensive body of work has been devoted to quantifying model robustness and defense mechanisms. For instance, Fast-Lin/Fast-Lip (Weng et al., 2018) recursively computes local Lipschitz constant of a neural network; PROVEN (Weng et al., 2019) certifies robustness in a probabilistic approach. Recently, randomized smoothing has gained increasing popularity as to enhance model robustness, which was proposed by Cohen et al. (2019) ; Li et al. (2019) as a defense approach with certification guarantee. To the time series setting, Yoon et al. (2022) adopted randomized smoothing technique to univariate forecasting models and developed theory therein. However, we are not aware of any prior works on randomized smoothing for multivariate probabilistic models.

3. ADVERSARIAL ATTACK STRATEGIES

We provide a generic framework of sparse and indirect adversarial attack under a multivariate setting in Section 3.1. Then, a deterministic one to this task is introduced next in Section 3.2, followed by a stochastic attack derived in Section 3.3. Notations. Denote d-dimensional multivariate time series x t ∈ R d at time t with its observation of i-th time series x i,t = [x t ] i . We denote x = {x t } T t=1 ∈ R d×T and z = {x T +t } τ t=1 ∈ R d×τ as recent T historical observations and next τ -step of the future values respectively. Then, probabilistic forecaster p θ with parameterzation θ takes history x to predict z, i.e., z ∼ p θ (• | x). We denote the set [d] = {1, . . . , d} and i-th time series as δ i = ([δ t ] i ) T t=1 .

3.1. FRAMEWORK ON SPARSE AND INDIRECT ADVERSARIAL ATTACK

Following the notation convention of Dang-Nhu et al. (2020) , given an adversarial prediction target t adv and historical input x to the forecaster p θ (z|x), we design a perturbation matrix δ such that the perturbed input x + δ disturbs a statistic χ(z) as close as possible to t adv . That is, we find δ such that the distance between E z|x+δ [χ(z)] and t adv is minimized. Here, χ(z) and t adv are any arbitrary function of interest or adversarial target values with the same dimension. We focus on scenarios where the perturbed prediction is far way from original prediction by properly choosing χ(•) and t adv . For example, by choosing χ(z) = z and t adv = 100z, the adversary's target is to design an attack that can increase the prediction by 100 times. Thus, suppose the adversaries want to mislead the forecasting of time series in a subset I ⊂ [d], denoted as z I . Let χ be a statistic function of interest that concerns only time series in I, i.e. χ(z) = χ(z I ). To make the attack less perceptible, we impose the following sparse and indirect constraints: First, perturbation δ cannot be direct to target time series in I and can be indirectly applied to a small subset of I c = [d] \ I. In other words, we restrict δ I = 0 and s(δ) = |{i ∈ I c : δ i ̸ = 0}| ≤ κ with sparsity level κ ≤ d. Lastly, to avoid outlier detection, we also cap the energy of the attack such that the value of the perturbation at any coordinates is no more than a pre-defined threshold η. To sum up, the sparse and indirect attack δ can be found via solving minimize δ∈R T ×d F (δ) ≜ E p θ (z|x+δ) χ(z) -t adv 2 2 (3.1) subject to ∥δ∥ max ≤ η, s(δ) ≤ κ, δ I = 0, where ∥δ∥ max = max t,i |[δ t ] i | is the element-wise maximum norm. As such, small values of κ and η imply a less perceptible attack. However, solving this is intractable due to the discrete cardinality constraint on s(δ). To sidestep this, we develop two approximations in the subsequent sections which correspond to our deterministic and non-deterministic attack strategies.

3.2. DETERMINISTIC ATTACK

Here we present an approximated solution, inspired by the ideas in Croce & Hein (2019) . We first get an intermediate solution δ through projected gradient descent (PGD) until it converges, δ ← B∞(0,η) δ -α∇ δ F δ , (3.2) where α ≥ 0 is a step size and B∞(0,η) is the projection onto the ℓ ∞ -norm ball with radius η, allowing a simple element-wise clipping: B∞(0,η) ([ δt ] i ) = sign([ δt ] i ) η if |[ δt ] i | > η else [ δt ] i . With this intermediate non-sparse δ, we retrieve for final sparse perturbation δ via solving minimize δ∈R T ×d ∥δ -δ∥ F subject to s(δ) ≤ κ, δ I = 0. (3.3) It turns out (3.3) can be solved analytically. Given δ, we compute the absolute perturbation added to each row i, p i = T t=1 |[ δt ] i | for i ∈ [d] \ I and sort them in descending order π: p π1 ≥ • • • ≥ p π d . Finally, we construct the solution as δ with δ πi = δπi if i ≤ κ else 0. Remark. ∇ δ F (δ) involves the computation of the gradient of an expectation, which doesn't have a closed-form solution. To overcome this intractability, we adopt the re-parameterized sampling approach used in Dang-Nhu et al. ( 2020) and Yoon et al. (2022) .

3.3. PROBABILISTIC ATTACK

To make the attack even less perceptible, we further show in this section an alternative approximation that results in a probabilistic sparse attack, which makes adverse alterations to a non-deterministic set of coordinates (i.e., time series and time steps). As shown in our experiment, this non-determinism appears to make the attack stronger and harder to detect. To achieve this, we view the sparse attack vector as a random vector drawn from a distribution with differentiable parameterization. The core challenge is how to configure such a distribution whose support is guaranteed to be within the space of sparse vectors. To achieve this, we propose sparse layer, a distributional output, of a normal standard and a Dirac density combination. The output of this layer satisfied relaxed sparse support condition (see Theorem 3.2). Algorithm 1 Deterministic Adversarial Attack input: pre-trained model p θ (z | x), observation x and other parameters: • statistic χ(•), adversarial target t adv , target set I ⊂ [d] • attack energy η, sparse constraint κ, PGD iterations n and step size α ≥ 0 output: perturbation matrix δ ∈ R T ×d s.t. ∥δ∥ max ≤ η, s(δ) ≤ κ, δ I = 0 1. initialize δ = 0 for iteration 1, 2, . . . , n do 2. compute the expected loss F (δ) using Eq. (3.1) 3. update δ via PGD in Eq. (3.2) end for 4. for i / ∈ I, compute p i = T t=1 |[δ t ] i | 5. sort p i in a descending order π = (π 1 , . . . , π d ): p π1 ≥ p π2 ≥ • • • ≥ p π d . 6. set δ πκ+1 = δ πκ+2 = • • • = δ π d = 0 and δ I = 0. Return δ. Sparse Layer. A sparse layer is defined as a distributional output q(δ|x; β, γ) of δ having independent rows, such that its sample (probabilistic attack) δ ∼ q(δ|x; β, γ) = i q i (δ i |x; β, γ) satisfies sparse condition E[s(δ)] ≤ κ and δ I = 0. With δ i denoted as the i-th row (time series) of δ and sparsity level κ, each factor distribution q i (δ i |x; β, γ) parameterized by β and γ is defined as q i δ i | x; β, γ ≜ r i (γ) • q ′ i δ i | x; β + 1 -r i (γ) • D δ i , (3.4) where r i (γ) ≜ κγ 1 2 i • ( d i=1 γ i ) -1 2 / √ d, D(δ i ) = I(δ i = 0) is the Dirac density, and q ′ i (δ i | x; β) is a Gaussian N(µ(x; β), σ 2 (x; β)). The combination weight r i (γ) denotes the probability mass of the event δ i = 0, which is parameterized by γ. Intuitively, this means the choice of {r i (γ)} d i=1 controls the row sparsity of the random matrix δ, which can be calibrated to enforce that E[s(δ)] ≤ κ. We will show in Theorem 3.1 how samples can be drawn from the combined density in (3.4). Theorem 3.1. Let δ i ′ ∼ q ′ i (• | x; β, γ) and u i ∼ N(0, 1) for i = 1, . . . , d. Define δ i = δ i ′ • I(u i ≤ Φ -1 (r i (γ))). Then, δ i ∼ q i (δ i | x; β, γ). Here, q i (•|x; β, γ) is given in (3.4) and Φ -1 is the inverse cumulative of the standard normal distribution. We provide the proof in the appendix. For implementation: Let q ′ i (• | x; β) be a distribution over dense vectors, e.g. N(µ(β), σ 2 (β)I), and u i ∼ N(0, 1) for i ∈ [d]. We can construct a binary mask m i = I(u i ≤ Φ -1 (r i (γ))), i ∈ [d], where r i (γ) is defined above. Next, for each i ∈ [d], we draw δ i ′ from q ′ i (• | x; β) and obtain δ i by δ i = δ i ′ • m i where • denotes the element-wise multiplication. Finally, we set δ I = 0. Theorem 3.2 proves that δ sampled from (3.4) would meet the constraint E[s(δ)] ≤ κ. Put together, Theorem 3.1 and Theorem 3.2 enable differentiable optimization of a sparse attack as desired. Theorem 3.2. Let δ ∼ q(• | x; β, γ). Then, E[s(δ)] ≤ κ. Remark. Note that we can also obtain a direct sparse constraint on s(δ) by applying Theorem 3.2 to a smaller quantity cκ for c ∈ (0, 1). Then, by the Markov inequality, with probability at least 1 -c, we have s(δ) ≤ E[s(δ)]/c = cκ/c = κ. We provide the proof of Theorem 3.2 in Appendix C. Optimizing Sparse Layer. The differentiable parameterization of the above sparse layer can therefore be optimized for maximum attack impact via minimizing the expected distance between the attacked statistic and adversarial target: min β,γ H(β, γ) ≜ E δ∼q(.|x;β,γ) E z∼p θ (z|x+δ) χ(z) -t adv 2 2 . (3.5) This attack is probabilistic in two ways. First, the magnitude of the perturbation δ is a random variable from distribution q(• | x). Second, the non-zero components of the mask depend on the random Gaussian samples, which brings another degree of non-determinism into the design, making the attack less perceptible and harder to detect. See Algorithm 4 in Appendix A for the implementation. Remark. There are three important advantages of the above probabilistic sparse attack. First, by viewing the attack vector as random variable drawn from a learnable distribution instead of fixed parameter to be optimized, we are able to avoid solving the NP-hard problem (3.1) as usually approached in previous literature (Croce & Hein, 2019) . Second, our approach introduces multiple degree of non-determinism to the attack vector, apparently making it more stealth and powerful (see Section 5). Last, unlike the deterministic attack which has two separate, decoupled approximation stages that cannot be optimized end-to-end due to the non-convex and non-differentiable constraint in (3.1), the probabilistic attack model is entirely differentiable. Therefore, it can be directly integrated as part of a differentiable defense mechanism that can be optimized via gradient descent in an end-to-end fashion -see Section 4.2 for more details. Again, we adopt re-parameterizaed sampling approach to compute the gradient of the expectation in Eq. (3.5).

4. DEFENSE MECHANISMS AGAINST ADVERSARIAL ATTACKS

The adversarial attack on probabilistic forecasting models was investigated under the univariate time series setting (Dang-Nhu et al., 2020; Yoon et al., 2022) . Beyond basic data augmentation (Wen et al., 2020) , we develop more effective defense mechanism to enhance model robustness via randomized smoothing (in Section 4.1) and mini-max defense using sparse layer (in Section 4.2).

4.1. RANDOMIZED SMOOTHING DEFENSE

Randomized smoothing (RS) (Cohen et al., 2019 ) is a post-training defense technique. Having never been considered to multivariate setting to the best of our knowledge, we apply RS to our multivariate forecasters z(x) ∼ p θ (z | x) which maps x to a random vector z(x) distributed by p θ (z | x). Let P z (z(x) ⪯ r) denote the CDF of such random outcome vector where ⪯ denotes the element-wise inequality, the RS version g σ (x) = E ϵ z(x + ϵ) (4.1) of z(x) with noise level σ > 0 and ϵ ∼ N(0, σ 2 I) is a random vector whose CDF is defined as P gσ g σ (x) ⪯ r ≜ E ϵ∼N(0,σ 2 I) P z z(x + ϵ) ⪯ r (4.2) where we abuse the notation ϵ ∼ N(0, σ 2 I) to indicate the (scalar) entries of the matrix ϵ are independently and identically distributed by N(0, σ 2 ). Computing the output of the smoothed forecaster g σ (x) is intractable in general since the integration of z(x + ϵ) with N(0, σ 2 I) cannot be done analytically. However, it can still be approximated with arbitrarily high accuracy via MC sampling with a sufficiently large number of samples. Check Algorithm 2 for a detailed implementation. Algorithm 2 Randomized Smoothing input: forecaster z(x) ∼ p θ (z|x) and: • number of samples n • noise level σ • input x output: g σ (x) initialize g σ (x) = 0 for e = 1, 2, . . . , n do 1. sample [ϵ (e) t ] i ∼ N(0, σ 2 ) ∀(t, i) 2. g σ (x) ← g σ (x)+(1/n)•z(x+ϵ (e) ) end for Algorithm 3 Mini-max Defense input: dataset D of (x, z) pairs and parameters: • sparse constraint κ for q in Eq. (4.6) • number of optimization iterations n output: forecasting model z(x) ∼ p θ (z | x). for e = 1, 2, . . . , n do 1. Fix θ, minimize -(x,z)∼D ℓ g (ϕ; x, z, θ) with respect to ϕ -see Eq. (4.5) 2. Fix ϕ, maximize (x,z)∼D ℓ p (θ; x, z, ϕ) with respect to θ -see Eq. (4.6) end for For the randomized smoothing version g σ of the base forecaster z(x) ∼ p θ (z|x), we establish a robustness guarantee or certificate in the following theorem. Theorem 4.1 (Robust Certificate). Given an input x, let g σ (x) be defined in Eq. (4.1). Let G(r) = P gσ (g σ (x) ⪯ r) and G δ (r) = P gσ (g σ (x + δ) ⪯ r). For any δ, we have sup r∈R d G(r) -G δ (r) ≤ √ d σ • δ F . (4.3) This shows that the difference between the CDFs of the smoothed forecaster on authentic and perturbed input, i.e. g σ (x) and g σ (x + δ), is guaranteed to be no more than O(∥δ∥ F ). We defer the formal proof to Appendix C. Remark. Different from Theorem 1 in Yoon et al. (2022) that only applies to univariate cases, our Theorem 4.1 provides a more general robustness guarantee as it's available for multivariate setting. Also, Theorem 1 in Yoon et al. (2022) only holds for δ → 0, but our Theorem 4.1 holds for any δ.

4.2. MINI-MAX DEFENSE

As discussed in Section 3.3, the sparse layer is differentiable, which is suitable to be directly integrated as part of a differentiable defense mechanism that can be optimized via gradient descent in an end-to-end fashion. To fix the idea, with a sparse layer q(• | x; ϕ) having parameters ϕ = (β, γ) in Eq. (3.4), we propose to train the forecaster by minimizing the worst-case loss caused by q(• | x; ϕ): min ϕ max θ (x,z)∼D ℓ p (θ; x, z, ϕ) -ℓ g (ϕ; x, z, θ) . (4.4) Here ℓ g (ϕ; x, z, θ) is a function of ϕ conditioned on (x, z, θ) while ℓ p (θ; x, z, ϕ) is a function of θ conditioned on (x, z, ϕ) as follows ℓ g (ϕ; x, z, θ) ≜ E q(δ|x;ϕ) E p θ (z ′ |x+δ) z ′ -z 2 (4.5) ℓ p (θ; x, z, ϕ) ≜ E q(δ|x;ϕ) log p θ z | x + δ (4.6) where the expectation is taken over δ ∼ q(δ|x; ϕ) with q given by Eq. (3.4), each pair (x, z) represents a training data point in our dataset with x = {x t } T t=1 and z = {x T +t } τ t=1 . Solving Eq. (4.4) therefore means finding a stable state where the model parameter is conditioned to perform best in the worst situation where the adversarial noises are also conditioned to generate the most impact even in the most benign scenario. This can be achieved by alternating between (1) minimizing -ℓ g in Eq. (4.5) with respect to (β, γ) and (2) maximizing ℓ p in Eq. (4.6) with respect to θ. We call this defense mechanism a mini-max defense. We note that similar ideas have been previously exploited in deep generative models, such as GAN (Goodfellow et al., 2014a) and WGAN (Arjovsky et al., 2017) . See Algorithm 3 for a detailed description. Remark. Unlike the sparse layer used in attack, the sparse layer used to simulate mock attack in our defense strategy does not have access to the actual attack sparsity parameter κ or the set of target time series I. Hence, we need to set the sparsity κ as a tuning parameter and skip the last step of the sparse layer described in Section 3.3 where we set δ I = 0.

5. EXPERIMENTS

We conduct numerical experiments to demonstrate the effectiveness of our proposed indirect sparse attack on a multivariate probabilistic forecasting models and compare various defense mechanisms.

5.1. EXPERIMENT SETUPS

Dataset. We include Traffic (Asuncion & Newman, 2007) , Electricity (Asuncion & Newman, 2007) , Taxi (Taxi & Commission, 2015) , Wiki (Lai, 2017) . See Appendix B.1 for more information. Multivariate Forecaster. We consider DeepVAR Salinas et al. (2020) which is state-of-the-art multivariate probabilistic models with implementation available pytorch-ts (Rasul, 2021) 2022), we use relative noises in both data augmentation and randomized smoothing. That is, given a sequence of observation x = ([x t ] i ) i,t ∈ R d×T , we draw i.i.d. noise samples [ϵ t ] i ∼ N(0, σ 2 ) and produce noisy input as [x t ] i ← [x t ] i (1 + [x t ] i ). In data augmentation, we train model with noisy input [x t ] i . In RS, the base model is still trained on noisy input [x t ] i with noise level σ. The noise level σ remains the same across DA and RS. Metrics. We adopt weighted quantile loss (wQL) to measure the performance. (See Appendix B.3.)

5.2. EXPERIMENT RESULTS

Traffic, Electricity, and More Datasets. Averaged wQL loss is reported in Table 1 and Table 2 for Traffic and Electricity dataset respectively. The attacks include both deterministic and probabilistic ones for both single and multiple target time series and time horizons. Besides, we plot wQL under both attacks against sparsity level to better visualize the effect of different types of attack. See Message 1: Sparse, Indirect Attack is effective, and becomes more effective as κ increases. In the experiment, we can verify the effectiveness of sparse indirect attack, that is, one can attack the prediction of one time series without directly attacking the history of this time series. For example in Table 2 , under deterministic attack, the average wQL is increased by 20% by only attacking one out of nine remaining time series (there are totally 10 but the target time series is excluded). Moreover, attacking half of the time series can increase average wQL by 102%! This observation is even more noticeable under probabilistic attack: average wQL can be increased by 215% with 50% of the time series attacked. Besides, wQL loss increases as κ increases, which is also an evidence that sparse indirect attack is effective. Message 2: Probabilistic Attack is more effective than Deterministic Attack, especially for small κ. In general, average wQL increases as κ increases and probabilistic attack appears to be more effective than deterministic one, see Figure 4a and Table 2 . For example, under no defense when κ = 7, probabilistic attack causes 50% larger wQL loss than deterministic one. Message 3: Randomized Smoothing (RS) and Mini-Max are more robust than Data Augmentation (DA). As can be seen in Figure 4b , Table 2 and Table 1 , all three defense methods can bring robustness to the forecasting model. Data augmentation and randomized smoothing works well under small κ and mini-max defense achieves comparable performance as data augmentation and randomized smoothing under small κ and outperforms them under large κ. 

5.3. NON-TRANSFERRABLITY BETWEEN UNIVARIATE AND MULTIVARIATE CASES

From the above Section 5.2, we verify the effectiveness of sparse indirect attack of multivariate forecasting models. In this subsection, we investigate the transferrability from univariate attack to multivariate attack. To be specific, we study the question whether the adversarial perturbation generated by univariate models can be transferred to multivariate models as an indirect attack. In empirical experiments, we choose κ = 1 and other parameters are the same as what are described in Section 5.1. It turns out TS 5 is selected by Algorithm 1 to harm the prediction of TS 1 when κ = 1. Thus, we use the technique in Dang-Nhu et al. (2020) ; Yoon et al. (2022) to generate univariate attack on TS 5 from DeepAR. Note that only the history of TS 5 has been adversely altered. The attacked time series 5 is further fed into DeepVAR model. Experiment Result. The averaged wQL loss is reported in Table 12 in Appendix D. For a better visualization, the history of TS 5 and prediction of TS 1 are plotted in Figure 3a and Figure 3b respectively. From the experiment results in Table 12 , we observe that multivariate attack is 3x more effective than univariate attack, which is also a reason why multivariate attack worth investigation.

6. CONCLUSION

In this work, we investigate the existence of sparse indirect attack for multivariate time series forecasting models. We propose both deterministic approach and a novel probabilistic approach to finding effective adversarial attack. Besides, we adopt the randomized smoothing technique from image classification and univariate time series to our framework and design another mini-max optimization to effectively defend the attack delivered by our attackers. To the best of our knowledge, this is the first work to study sparse indirect attack on multivariate time series and develop corresponding defense mechanisms, which could inspire a future research direction.

A PROBABILISTIC ATTACK ALGORITHM

Algorithm 4 Probabilistic Adversarial Attack input: pre-trained model p θ (z | x), observation x and other parameters: • statistic χ(•), adversarial target t adv , target set I ⊂ [d] • attack energy η, sparse constraint κ, number of iterations n output: perturbation matrix δ ∈ R T ×d s.t. ∥δ∥ max ≤ η, E[s(δ)] ≤ κ, δ I = 0 1. randomly initialize a sparse layer q(•|x; β, γ) for iteration 1, 2, . . . , n do 2. compute the expected loss H(β, γ) using Eq. (3.5) 3. update β, γ via first-order optimization method end for 4. draw δ ∼ q(•|x; β, γ) and return B DETAILS ON THE EXPERIMENT SETTING B.1 DATASETS • Traffic: hourly occupancy rate, between 0 and 1, of 963 San Francisco car lanes. • Taxi: traffic time series of New York taxi rides taken at 1214 locations for every 30 minutes from January 2015 to January 2016 and considered to be heterogeneous. We use the taxi-30min dataset provided by GluonTS. • Wiki: daily page views of 2000 Wikipedia pages used in Gasthaus et al. (2019) . • Electricity: consists of hourly electricity consumption time series from 370 customers. Traffic. We target at the prediction of χ(z) = (x 1,T +τ -1 , x 1,T +τ , x 5,T +τ -1 , x 5,T +τ ). We choose prediction length τ = 24 and context length T = 4τ = 96, and sparsity level κ = 1, 3, 5, 7, 9. Wiki. We target at the prediction of χ(z) = x 1,T +τ . We choose prediction length τ = 30 and context length T = 4τ = 120, and sparsity level k = 1, 3, 5, 7, 9. Electricity & Taxi. We target at the prediction of the first time series at the last prediction time step, i.e. target time series I = {1} and time horizon to attack H = {τ }, so χ(z) = x 1,T +τ . We choose prediction length τ = 24 and context length T = 4τ = 96, and sparsity level κ = 1, 3, 5, 7, 9. For all experiments, we train a DeepVAR with rank 5. The attack energy η = c 1 max |x|, is proportional to the largest element of the past observation in magnitude, where c 1 is set to 0.5. For the adversarial target t adv , we first draw a prediction x from un-attacked model p θ (•|x) and choose t adv = c 2 x for constants c 2 = 0.5 and 2.0. Following the process in Yoon et al. (2022) , we report the largest error produced by these choices of constants. Unless otherwise stated, the number of Published as a conference paper at ICLR 2023 sample paths drawn from the prediction distribution n = 100 to quantify quantiles q (α) i,t . In minimax defense, the sparsity level of the sparse layer is set to 5 for all cases. For the noise level σ in DA and RS, we select them via a validation set and it turns out no σ is uniformly better than the others across different sparsity level. Thus, σ = 0.1 is chosen in the empirical evaluation. For an ablation study on the effect of σ, see Table 6 in Appendix B.4.

B.3 METRICS

We measure the performance of model under attacks by the popular metric especially for probabilistic forecasting models: weighted quantile loss (wQL), which is defined as wQL(α) = 2 • i,t [α max(x i,t -q (α) i,t , 0) + (1 -α) max(q (α) i,t -x i,t , 0)] i,t |x i,t | where α ∈ (0, 1 ) is a quantile level. In practical application, under-prediction and over-prediction may cost differently, suggesting wQL should be one's main consideration especially for probabilistic forecasting models. In the subsequent sections, we calculate average wQL over a range of α = [0.1, 0.2, . . . , 0.9] and evaluate the performance in terms of averaged wQL.

B.4 MORE RESULTS

To measure the performance of a forecasting model, other metrics like Weighted Absolute Percentage Error (WAPE) or Weighted Squared Error (WSE) are also considered by a large body of literature. For completeness, we present the definition of WAPE and WSE: WAPE = predicted value true value -1 = 1 |I||H| i∈I,h∈H 1 n n j=1 xj T +h,i x T +h,i -1 , WSE = predicted value true value -1 2 = 1 |I||H| i∈I,h∈H 1 n n j=1 xj T +h,i x T +h,i -1 2 , where xi,j is the predicted values from forecasting model. We report WAPE, WSE and wQL under deterministic and probabilistic attacks on electricity dataset in Table 4 and Table 5 . Also, Table 6 reports the effect of choosing different values for σ in data augmentation and randomized smoothing. Next, we report wQL loss of Taxi and Wiki dataset under both types of attacks in Table 7 , Table 8 , Table 9 and Table 10 respectively. We also report the results of attacking effects with different attacked time stamp H in Table 11 . 

C DETAILED PROOFS

Proof of Lemma 3.1. We can compute P(δ i = 0) = 1 -P u i ≤ Φ -1 r i (γ) = 1 -r i (γ). (C.1) That is, with probability 1 -r i (γ), δ i = 0. Equivalently, δ i is distributed by a degenerated probability measure with Dirac density D(δ i ) concentrated at 0. On the other hand, with probability r i (γ), δ i is distributed as q ′ i (•|x; β). Combining the two cases, it follows that δ i is distributed by a mixture of q ′ i (•|x; β) and D(δ i ) with weights r i (γ) and 1 -r i (γ) respectively. Proof of Lemma 3.2. By the construction of r i (γ), which completes the proof. E s(δ) = d i=1 E I u i ≤ Φ -1 (r i (γ)) = d i=1 P u i ≤ Φ -1 (r i (γ)) = d i=1 r i (γ) = κ √ d • d i=1 γ 1/2 i d i=1 γ i 1/2 ≤ κ. ). Let F x (r) ≜ P(z(x) ⪯ r). Consider sup r∈R d G(r) -G δ (r) = sup r∈R d ϵ∈R d×T F x+ϵ (r) -F x+δ+ϵ (r) p σ (ϵ) dϵ = sup r∈R d ϵ∈R d×T F ϵ (r) p σ (ϵ -x) -p σ (ϵ -x -δ) dϵ = sup r∈R d ϵ∈R d×T 1 0 F ϵ (r)∇p σ (ϵ -x -tδ)δ dt dϵ = sup r∈R d 1 0 ϵ∈R d×T F ϵ (r) δ • ϵ -x -tδ σ 2 p σ (ϵ -x -tδ) dϵ dt = 1 σ sup r∈R d 1 0 ϵ∈R d×T F x+tδ+ϵ (

D NON-TRANSFERRABILITY OF ATTACKS BETWEEN UNIVARIATE AND MULTIVARIATE FORECASTERS

We study the transferrability from univariate attack to multivariate attack. To be specific, if an attack is generated on the same subset (excluding target time series) of time series using a univariate model and then fed into a multivariate model, can it indirectly harm the prediction of target time series. Next, we report the experiment results of univariate attack and multivariate attack. For better illustration, we also plot the history of the noisy time series used in Appendix D. That is, the electricity usage of the fifth user over T = 1, . . . , 264.

F ATTACK AND DEFENSE ON DETERMINISTIC MODEL

In this section, we conduct parallel experiments to examine the attack and defense effect on deterministic (non-probabilistic) forecasting models. We implement a SOTA Informer model (Zhou et al., 2021) on Traffic dataset. We set the hyper-parameters exactly the same as those in Table 1 for fair comparison. As Informer model is non-probabilistic, we only report MSE loss below in Table 13 . 

G LONGER PREDICTION LENGTH

In this section, we evaluate our attack and defense effect on Informer model with longer prediction length. We set the prediction length τ = 168 and the other hyper-parameters are exactly the same as those in Table 1 and Table 13 . The experiments results are reported in Table 14 . 



Forecasting Models. The recent decades have witnessed a tremendous progress in DNNbased forecasting models. Given the temporal dependency of time series data, RNN and CNN-based architectures have been proved a success for time series forecasting tasks, see Rangapuram et al. (2018); Lim et al. (2020); Wang et al. (2019); Salinas et al. (2020) and Oord et al. (2016); Bai et al.

Figure 1: Illustration figure: an attacker misleads prediction of time series (TS) 1 at time 288 by indirectly attacking TS 5. Left plot of is authentic (orange) and perturbed (blue) versions of TS 5; right plot is no-attack (orange) and under-attack (blue) predictions for TS 1. Ground truth (green) is also plotted for comparison. No alteration is made to TS 1 but the prediction of TS 1 at the attack time step (t = 288) is adversely altered in the under-attack (blue) setting, which can set the prediction of TS 1 significantly away from the ground truth.

with target dimension 10. For more details on the model parameters, see Appendix B.2. Data Augmentation (DA) and Randomized Smoothing (RS). Following the convention in Dang-Nhu et al. (2020); Yoon et al. (

Figure 2: Plots of (a) averaged wQL under sparse indirect attack against the sparsity level on electricity dataset. The underlying model is a clean DeepVAR without defense. Target time series I = {1} and attacked time stamp H = {τ }; and (b) & (c) averaged wQL under different defense mechanisms on electricity dataset for deterministic & probabilistic attack respectively.

Figure 2. More experiment results with error bars on additional datasets can be found in Appendix B.4.

Figure 3: Plots of (a) authentic (orange), DeepAR-attacked (blue) and DeepVAR-attacked (green) versions of time-series (TS) 5; and (b) ground-truth (orange), no-attack (blue), under-DeepAR-attack (red) and under-DeepVAR-attack (green) predictions for TS 1. Shaded area is attacker's target range. Compared to clean prediction, the value of TS 1 at the attack time step (t = 288) were adversely altered by DeepVAR-attack (green) but only slightly altered by DeepAR-attack (red). The wQL loss under no attack: 0.288, under DeepAR attack: 0.322, under DeepVAR attack: 0.390.

Figure 4: Plots of original time series x (blue) vs noisy time series x (orange) under different noise level σ = 0.1, 0.2, 0.3, where x = x(1 + σ * N (0, 1)).

Average wQL on Traffic dataset under deterministic and probabilistic attack. Target time series I = {1, 5} and attacked time stamp H = {τ -1, τ }. Smaller is better.

Average wQL on Electricity dataset under deterministic and probabilistic attack. Target time series I = {1} and attacked time stamp H = {τ }. Smaller is better.

Summary of statistics of the datasets used, including prediction length τ , domain, frequency, dimension, and time steps.

Metrics on Electricity dataset under deterministic attack. Target time series I = {1} and attacked time stamp H = {τ }. Smaller is better.

Metrics on electricity dataset under probabilistic attack using sparse layer. Target time series I = {1} and attacked time stamp H = {τ }. Smaller is better.

Average wQL on Electricity dataset under deterministic attack. The defense is data augmentation and randomized smoothing with varying σ = 0.1, 0.2, 0.3. Target time series I = {1} and attacked time stamp H = {τ }. Smaller is better.

Average wQL on Taxi dataset under deterministic attack. Target time series I = {1} and attacked time stamp H = {τ }. Smaller is better.

Average wQL on Taxi dataset under probabilistic attack. Target time series I = {1} and attacked time stamp H = {τ }. Smaller is better.

Average wQL on Wiki dataset under deterministic attack. Target time series I = {1} and attacked time stamp H = {τ }. Smaller is better.

Average wQL on Wiki dataset under probabilistic attack. Target time series I = {1} and attacked time stamp H = {τ }. Smaller is better.

Average wQL on Traffic dataset under deterministic attack. Target time series I = {1} and attacked time stamp H = {1}. Smaller is better.

Transfer the attack from DeepAR to DeepVAR. Target items I = {1} and time horizon to attack H = {τ }. Clean DeepAR and DeepVAR models are used. Averaged wQL is reported.

Average MSE on Traffic dataset under deterministic attack on Informer model with τ = 24. Target time series I = {1, 5} and attacked time stamp H = {τ -1, τ }. Smaller is better.

Average MSE on Traffic dataset under deterministic attack on Informer model with τ = 168. Target time series I = {1, 5} and attacked time stamp H = {τ -1, τ }. Smaller is better.

