MIA: A Framework for Certifiably Robust Time-Series Classification and Forecasting Against Temporally-Localized Perturbations

Abstract

Recent literature demonstrates that times-series forecasting/classification are sensitive to input perturbations. However, the defenses for time-series models are relatively under-explored. In this paper, we propose Masking Imputing Aggregation (MIA), a plug-and-play framework to provide an arbitrary deterministic timeseries model with certified robustness against temporally-localized perturbations (also known as ℓ 0 -norm localized perturbations), which is to our knowledge the first ℓ 0 -norm defense for time-series models. Our main insight is to let an occluding mask move across the input series, guaranteeing that, for an arbitrary localized perturbation there must exist at least one mask that completely occlude the perturbed area, so that the prediction on this masked series is certifiably unaffected. MIA is flexible as it still works even if we only have the query access to the pretrained model. To further validate the superior effectiveness of MIA, we specifically compare MIA to two baselines extended from prior randomized smoothing approaches. Extensive experiments show that MIA yields stronger robustness.

1. Introduction

Time series forecasting/classification (TSF/TSC) have been widely applied to help businesses make informed decisions and plans (Miyato et al., 2017; Zhou et al., 2019; Schlegl et al., 2019; Park et al., 2018) . However, a wide range of literature demonstrate that time-series models are vulnerable to adversarial input perturbations (Connor et al., 1994; Gelper et al., 2010; Ding et al., 2022; Yang et al., 2020; Dang-Nhu et al., 2020; Oregi et al., 2018; Han et al., 2020) , e.g., an elaborately designed imperceptible perturbation could control the prediction (Karim et al., 2020; Fawaz et al., 2019) . So far related literature is mainly focusing on detecting the outliers (Ruff et al., 2018; Yairi et al., 2017) , the adversarial robustness of time-series models is relatively under-explored, especially ℓ 0 -norm robustness, e.g., (Yoon et al., 2022) only explore the ℓ 2 -norm adversarial robustness for probabilistic forecasting models. In the present work, we focus on the robustness against temporally-localized perturbations, as we notice there already exists corresponding powerful attacks (Yang et al., 2022) . Generally, defenses can be divided into two types, heuristic defenses and certified defenses. Heuristic defense can yield better empirical robustness but lack robustness guarantees. From the experience on image classification (Athalye et al., 2018; Carlini & Wagner, 2017; Athalye & Carlini, 2018) , the heuristic defenses would be useless when confronted with the newly designed adaptive attacks, e.g., Athalye et al. (2018) leverage Backward Pass Differentiable Approximation technique to successfully circumvent almost all the heuristic defenses at that time. To end such a "cat and mouse" game between the adaptive attacks and the heuristic defenses, the concept of certified defense is proposed, with unbreakable robustness certificates. Current certified defenses can produce robustness certificates but often require the user to retrain the base model from scratch, e.g., Yoon et al. (2022) ; Li et al. (2020) ; Cohen et al. (2019) retrain the base model as these defenses do perform poorly on naturally-trained models. The requirement for retraining could bring additional challenges when it comes to the real-world deployments. In addition, the certified defenses on sequence-based data are quite under-explored, since almost all the certified defenses are focusing on matrix-based data (e.g. image). Figure 1 : Overview of MIA pipeline. Inputted a series x 1:t0 , MIA first masks different periods of x 1:t0 to construct the masked series x 1:t0 ⊙ M (k) , k = 0, . . . , M . Then MIA imputes the masked series with the imputation model G(•). We classify the imputed series with the pretrained model. If the predictions of all the imputed series are Class 0, MIA will return Class 0 with the robustness guarantee that the output is clean, otherwise MIA will return Abstain. To address these issues, in this paper, we propose Masking Imputing Aggregation (MIA), a flexible framework to arm an arbitrary TSF/TSC deterministic model with robustness certificates against temporally-localized perturbations. Different from the requirement for retraining in prior defenses, MIA only an imputation model for recovering the masked areas, which can be easily learned in an unsupervised setting. Specifically, MIA works as follows: 1) masking: MIA first masked series via sliding a mask through the input series; 2) imputing: MIA imputes the masked series with the imputation model; 3) aggregation (checking agreement): MIA only returns the the class if the pretrained model outputs the same for all the imputed series, otherwise returns Abstain. With the above three steps, we can guarantee that all the predictions from MIA is clean. Furthermore, we compare MIA to two baselines extended from randomized smoothing, as we notice that randomized smoothing has achieved a widespread success in defending different adversarial attacks. The contributions are: 1) We propose MIA, a plug-and-play framework to arm an arbitrary TSF/TSC model with certified robustness against temporally-localized perturbations, which is to our knowledge the first ℓ 0 -norm certified defense in time series domain. 2) We propose randomized masked training, a specialized training algorithm for training the imputation model of MIA, to further boost the performance of MIA. 3) We compare MIA to two baseline methods comprehensively on three aspects. 1) robustness: extensive experiments on different datasets validate that superior robustness of MIA. 2) Practicality: MIA is stronger as it is plug-and-play and do not require retraining. 3) Inference cost: the inference time of MIA is comparable to the time cost of two baselines.

2. Related Work

Heuristic defenses for time-series models. Prior works on robust TSF/TSC can be divided into two general categories: outlier detection and deep learning. The former is to filter the outliers in a statistical way, including k-Means clustering (Yang et al., 2017) , one-class SVM clustering (Schölkopf et al., 2001) , Kalman filters (de Bézenac et al., 2020) and support vector data description (Tax & Duin, 2004) . The latter leverages the strong representation ability of neural networks to recover the perturbed series, including robust feature-based approaches (Guo et al., 2016; Yang & Fan, 2022) , reconstruction-based methods (Li et al., 2021; 2019; Xu et al., 2018; Schlegl et al., 2019) , GNN-based methods (Zhao et al., 2020; Deng & Hooi, 2021) , association discrepancy (Xu et al., 2022) , LSTM-based methods (Hundman et al., 2018; Tariq et al., 2019) . However, these empirical methods lack robustness guarantees, hinting that they would be meaningless once a new adaptive attack is found. For that reason, certified defenses are crucial because their mathematical robustness certificates are permanently unbreakable. Certified adversarial defenses. In the field of image classification, there has been much work on the certified defenses, including randomized smoothing (Cohen et al., 2019; Salman et al., 2020) , convex polytope (Wong & Kolter, 2018) , CROWN-IBP (Zhang et al., 2019) and Lipschitz bounding (Cisse et al., 2017) . Among them, the ℓ 0 -norm defenses include derandomized smoothing (Levine & Feizi, 2020a) , randomized ablation (Levine & Feizi, 2020b; Zhang et al., 2020) and a series of mask-based defenses (Xiang & Mittal, 2021; McCoyd et al., 2020; Han et al., 2021; Xiang et al., 2021; 2022) . In stark contrast, the certified defenses for time-series data are rarely explored. To our knowledge, (Yoon et al., 2022) and (Li et al., 2020) are the only two defenses that produce ℓ 2 -norm robustness certificates, but a common downside is that they both additionally require retraining the base model over Gaussian augmented samples, which imposes a large amount of additional training costs.

3. Preliminaries

Time series classification (TSC). The time series classification is modeled as: inputted a t 0 -length series (denoted by x 1:t0 = [x 1 , x 2 , . . . , x t0 ]), TSC model returns a class f (•) : x 1:t0 → y. Time series forecasting (TSF). Given the "past observations" x 1:t0 , the forecasting model returns the "future values" f (•) : x 1:t0 → x t0+1,t0+τ . In this paper we mainly focus on the classic and commonly studied short-term forecasting setting (Ke et al., 2017) , which is to forecast a single time point f (•) : x 1:t0 → R (not necessarily the next point x t0+1 ). The short-term forecasting problem is sufficiently representative as the problem of long-term forecasting f (x 1:t0 ) → x t0+1:t0+τ can be decomposed into τ short-term forecasting subproblems, in which the i-th (i = 1, . . . , τ ) forecaster predicts the (t 0 + i)-th time point. We discuss the multivariate tasks later in this paper. Definition 1 (Temporally-localized perturbation δ [t adv +1:t adv +L adv ] ). In a temporally-localized perturbation attack, the adversary is allowed to perturb an arbitrary subseries w.r.t. the given ℓ 0 -norm constraint. Let L adv be the ℓ 0 -norm constraint on the localized perturbation. We can formulate all the perturbed series w.r.t. the ℓ 0 -norm constraint as follows: x 1:t0 + δ [t adv +1:t adv +L adv ] =x 1:t0 + [0, . . . , 0, δ t adv +1 , . . . , δ t adv +L adv , 0, . . . , 0] =[x 1 , . . . , x t adv +1 + δ t adv +1 , . . . , x t adv +L adv + δ t adv +L adv Perturbed subseries , . . . , x t0 ] where unbold δ t refers to the single perturbation value added to the t-th time point. t adv + 1 and t adv + L adv refer to the starting point and the ending point of the perturbation respectively, which explicitly restricts its ℓ 0 norm as ∥δ [t adv +1:t adv +L adv ] ∥ ℓ0 = (t adv + L adv ) -(t adv + 1) + 1 = L adv . Significance of temporally-localized perturbation. Temporally-localized perturbation is especially representative in real-world scenarios. Temporally-localized perturbation can represent shortterm volatility and local anomaly, both of which can be regarded as the normal data added with temporally-localized perturbation. The resistance to short-term volatility is important in long-term forecasting/prediction, in which the long-term value is considered unaffected by the short-term volatility. A typical example is the well-known investment philosophy, "Value Investing" (Piotroski, 2000) , where the "intrinsic value" of a business is considered to be robust against short-term volatility. Moreover, the detection of local anomaly is practically useful in real-world scenarios. For instance, detecting a subsequent time interval of abnormal heart rate in electronic health records is a problem of local anomaly detection. We can also adopt the method of detecting temporally-localized perturbation for detecting the abnormal network traffic for IoT Time-Series Data. Furthermore, to highlight the risk of temporally-localized perturbations, we empirically show how much a ℓ 0 -norm perturbation can change the output of an undefended forecaster in Appendix. We also compare the attacking performance of ℓ 0 -norm perturbation to ℓ 0 -norm perturbation, and the empirical results suggest that forecasting models might be more sensitive to ℓ 0 -norm perturbations. 4 Proposed Framework: Masking Imputing Aggregation 4.1 Pipeline Overview MIA includes three steps: 1) masking; 2) imputing; 3) aggregation (checking agreement). 1. Masking. We denote a mask by M [u:v] , where x 1:t0 ⊙ M [u:v] is replacing the values of x u:v among x 1:t0 with zeros. Let L mask be the size of the mask. Inputted a series x 1:t0 and the ℓ 0 norm of the temporally-localized perturbation L adv , we slide the mask thorough the input series with the step size α = L mask -L adv + 1, and then obtain the following masked series1 : x 1:t0 ⊙ M [1+kα : min(L mask +kα,t0)] , k = 0, . . . , ⌈(t 0 -L mask )/α⌉ where α = L mask -L adv + 1 (2) 1⌈c⌉ returns the smallest integer larger than or equal to c Algorithm 1: Algorithm of Masking Imputing Aggregation.

Input:

The pretrained TSF/TSC model f (•), the imputation model G(•), the input series x 1:t0 , the mask size L mask , the length of temporally-localized perturbation L adv , the discretization parameter ∆ for TSF task. Compute the step size of masking α ← L mask -L adv + 1; Generate the masked series via sliding the L mask -size mask x 1:t0 ⊙ M [1+kα:min(t0,L mask +kα)] , k = 0, . . . , ⌈(t 0 -L mask )/α⌉ ; Utilize the imputation model to impute the masked series x (k) 1:t0 = G(x 1:t0 ⊙ M [1+kα:min(t0,L mask +kα)] ); Compute the output (denoted by y (k) ) for each imputed series, as follows: y (k) = f (x (k) 1:t0 ) for TSC task f dis (x (k) 1:t0 ) f dis (•)for TSF task # f dis (x (k) 1:t0 ) is computed as Eq. ( 5); if y (0) = y (1) = . . . = y (k) then Output: y (0) . else Output: Abstain. We set the step size to L mask -L adv + 1 for guaranteeing all the temporally-localized perturbations of L adv can be covered. min(L mask + kα, t 0 ) is to prevent the mask from exceeding t 0 . 2. Imputing. Our second step is to recover the masked values with the imputation model G(•): x (k) 1:t0 = G(x 1:t0 ⊙ M [1+kα:min(t0,L mask +kα)] ) k = 0, 1, . . . , ⌈(t 0 -L mask )/α⌉ (3) This step is to make x (k) 1:t0 approximate the normal time series, so that the pretrained model could perform similarly on these imputed series. We discuss G(•) later this section.

3.. Aggregation (Checking Agreement).

We input the imputed series x (k) 1:t0 into the pretrained model f (•). If the pretrained model's ouputs on all x (k) 1:t0 reach agreement unanimously, MIA classifier f MIA (x 1:t0 ) will output this unanimously approved label/prediction, otherwise output Abstain to alert that the input series might have been attacked by the temporally-localized perturbations. f MIA (x 1:t0 ) = f (x (0) 1:t0 ) f (x (0) 1:t0 ) = f (x (1) 1:t0 ) = . . . = f (x (⌈(t0-L mask )/α⌉) 1:t0 ) Abstain Otherwise (4) Discretization technique for MIA on TSF. We note that TSF models are impossible to forecast exactly the identical value on different series, so that MIA would output Abstain all the time on TSF. To address this, we substitute the original pretrained forecaster f (•) with its discretized version f dis (•) in Eq. ( 4), where f dis (x (k) 1:t0 ), k = 0, . . . , ⌈(t 0 -L mask )/α⌉ compute as follow: f dis (x (k) 1:t0 ) = ∆ • ⌊f (x (k) 1:t0 )/∆⌋ (5) where ∆ is a discretization parameter that controls the trade-off between the discretization error and the success rate of achieving agreement. As ∆ decreases, the discretized forecasts retain more information from the original forecasts while the agreement rate decreases. If we take ∆ = 0.5, f dis (x (k) 1:t0 ) is to round up the value of f dis (x (k) 1:t0 ) to the nearest integer. 4.2 Discussion on the mask size L mask . The only requirement of Masking (Step 1) is to ensure for an arbitrary temporally-localized perturbation of L adv , there always exists a mask to occlude that perturbation. Thus a prerequisite is L mask ≥ L adv . We can control the trade-off between the the imputation quality and the inference cost with L mask . As we increase L mask , the imputation quality will decrease since the number of missing values increases. Meanwhile, the number of masked series decreases subsequently, so the inference cost is reduced. In the extreme case where L mask = t 0 where the masked series are all equal 0 1:t0 , MIA always outputs f (G(0 1:t0 )) regardless of the input series. The imputation quality is extremely poor and the inference cost is the smallest. The practical implementation of MIA is showed in Algorithm 1. Remark 1 (MIA on Probabilistic Models). We notice a line of time-series forecasting models are probabilistic (e.g., DeepAR (Salinas et al., 2020) ), which models the forecasted value f (x 1:t0 ) as a random distribution q[y | x 1:t0 ] rather than a single value, as follows: f (x 1:t0 ) = E q[xt 0 +1|x1:t 0 ] [x t0+1 ] (6) The exact forecasting value of probabilistic models is inaccessible (prior works perform Monte-Carlo inference for approximation). which makes applying MIA to probabilistic models challenging. Although we can utilize Clopper-Pearson method (Clopper & Pearson, 1934) to estimate the discretized forecasts f dis (x 1:t0 ) with a confidence level, the inference cost would be expensive for confidence interval estimation. 2 4.3 Robustness Certificate of MIA Proposition 1 (Robustness Certificate of MIA). The forecast/label (not Abstain) returned by Algorithm 1 cannot be changed by any temporally-localized perturbation whose ℓ 0 norm is no larger than L adv (see proof in Appendix). Remark 2 (Robustness Certificate). The robustness certificate is for f MIA (x 1:t0 ) rather than f (x 1:t0 ) because it is almost impossible to derive the certificate for a pretrained model without any requirement. Our aggregation does not allow any tolerance because the certificate would not hold once a disagreement is allowed. Note that, with Masking (Step 1), we can guarantee there exists a masked series that is unaffected, and all other masked series retain the perturbed area. If we allow a disagreer, the ensemble prediction would be totally under the adversary's control, because all except one masked series are perturbed (the only one not affected would become the disagreer). We point out that the certificate also holds for multivariable TSC/TSF. We can easily apply MIA to multivariable tasks through repeating Masking (Step 1) and Imputing (Step 2) on each variable.

4.4. Training Imputation Model G(•)

The performance of MIA highly depends on the imputation model G(•). We notice that there already exists much work on time series imputation (Cao et al., 2018; Du et al., 2022; Moritz & Bartz-Beielstein, 2017; Fortuin et al., 2020; Cao et al., 2018; Luo et al., 2019; Yozgatligil et al., 2013) . However, all these imputation models aim to recover the discrete missing values, which is not we want. To train an imputation model to recover consecutive missing values, we propose randomized masked training algorithm, which minimizes the MSE loss over the masked noisy series, as follows: E δ [1:t 0 ] ∼N (0,σ 2 ) 1 C + 1 C k=0 ∥G (x 1:t0 + δ [1:t0] ⊙ M [1+kα:min(L mask +kα,t0)] ) -x 1:t0 ∥ 2 2 (7) where C = ⌈(t 0 -L mask )/α⌉ and δ [1:t0] ∼ N (0, σ 2 ) is a Gaussian noise series, of which each entry is i.i.d. sampled from Gaussian distribution. We specifically add Gaussian noise is to make the imputation model robust to the random noise and avoid overfitting, since prior works (Foster et al., 1992; Passalis et al., 2021; Hwang et al., 1998) show the time series data is generally noisy. We emphasize that we do not add any noise in inference stage.

4.5. Comparison to Randomized Smoothing Defenses

Randomized smoothing (Cohen et al., 2019) is a well-know model-agnostic method in the field of certified defenses, which has been applied to defend various types of attacks and achieves superior certified robustness in their respective fields. Comparing MIA to randomized smoothing can better demonstrate the advance of our method. We extend two image-specific randomized smoothing defenses, Derandomized Smoothing (Levine & Feizi, 2020a) and Randomized Ablation (Levine & Feizi, 2020b) to the time series domain, as the baselines. Derandomized smoothing for time-series models. In the time-series version of DS, given a time series x 1:t0 and the base classifier f (•), DS (denoted by f DS ) classifies as follows3: f DS (x 1:t0 ) = arg max y∈Y   x sub ∈Sub(x1:t 0 ,η) I{f (x sub ) = y}   (8) 2We notice that a recent work (Yoon et al., 2022) derives robustness certificate for probabilistic forecasters, but our definitions of robustness are different. Yoon et al. (2022) bounds the local Lipschitz constant, while our objective is much stricter, aiming to guarantee the forecast is invariant under the perturbation. 3I{} is the indicator function. where Sub(x 1:t0 , η) consists of the subsequences x 1:η , x η+1:2η , x 2η+1:3η , . . . , x t0-η+1,t0 . We first let the base classifier make predictions on these subsequences, and then f DS (x 1:t0 ) outputs the majority label. The prediction is robust if x sub ∈Sub(x1:t 0 ,η) I{f (x sub ) = ŷ} -max y̸ =ŷ x sub ∈Sub(x1:t 0 ,η) I{f (x sub ) = y} > 2(η + L adv -1) (9) Randomized ablation for time-series models. RA (denoted by f RA (•)) classifies as follows: f RA (x 1:t0 ) = arg max y∈Y Pr x sub ∼Sample(x1:t 0 ,η) [f (x sub ) = y] where x sub ∼ RA(x 1:t0 , η) is to randomly sample η time points without replacement to construct the subseries x sub , and ablate all other points. f RA (x 1:t0 ) returns the label that f (•) is most likely to classify x sub as. ŷ = f RA (x 1:t0 ) is robust if Pr x sub ∼Sample(x1:t 0 ,η) [f (x sub ) = ŷ] > 3 2 - t0-L adv η t0 η Comparison to DS and RA. We note that the pretrained models of DS and RA make predictions on subseries f (x sub ) instead of normal series. Since the data distribution of the subseries are fundamentally different from the normal data, we can expect that these two defenses would perform poorly on the naturally-trained models. Therefore, we need to train the base classifiers from scratch on the subseries. In stark contrast, MIA is a plug-and-play framework that can be directly applied to TSF/TSC pretrained models. In MIA, the main cost of training stage is preparing the imputation model. We point out that the imputation model of MIA can be trained in an unsupervised manner, saving us from labeling the data. Furthermore, we empirically show that MIA attains a significantly better robustness than DS and RA in Section 5.

5. Experiments

Experimental setup. We evaluate MIA on both TSC and TSF datasets. TSF includes Exchange Rate, Traffic and UCI Electricity (Alexandrov et al., 2019) , and TSC datasets include DistalPhalanxTW, MiddlePhalanxTW and ProximalPhalanx (Ismail Fawaz et al., 2019a) . We use MLP-Mixer (Tolstikhin et al., 2021) , MLP and LSTM (Hochreiter & Schmidhuber, 1997) as the pretrained model. Our experiments are conducted on the clean trainsets, following the common setting of certified adversarial defenses (Yoon et al., 2022; Li et al., 2020; Cohen et al., 2019; Chiang et al., 2020; Zhang et al., 2019) . Unless otherwise specified, We use MLP-Mixer as the base model for MIA, DS and RA, and ∆ = 1.5. The experiments are conducted on CPU (16 Intel(R) Xeon(R) Gold 5222 CPU @ 3.80GHz) and GPU (one NVIDIA RTX 2080 Ti). More details are omitted to Appendix. Evaluation metrics. For TSC, we evaluate the defense by certified accuracy (CA) under the temporally-localized perturbation, which is defined by the fraction of the test samples that are correctly classified and certifiably robust to the perturbation. For TSC, we evaluate the defense by: forecasting rate (FR), mean square error (MSE) and mean absolute error (MAE)4. FR is the fraction of the test samples on which MIA outputs the forecast instead of Abstain. MSE/MAE measures the mean square error/mean absolute error between MIA forecasts (Abstain are excluded) and groundtruth. We omit the evaluation on multivariate tasks to Appendix due to space limitations.5

5.1. Comparison to Peer Methods

Comparison on TSC. Comparison on inference time. Table 3 compares the inference time of three defenses, which is averaged among three TSC datasets. We observe that the inference time of MIA is larger than DS, but significantly smaller than RA. Specifically, MIA's larger inference time than DS is owing to the cost in running the imputation model. RA's large inference time is for the confidence interval estimation6. We also observe that the inference time of MIA increases with L adv , and decreases with L mask , because the number of masked series is ⌈(t 0 -L mask )/(L mask -L adv + 1)⌉ + 1. We omit the inference time analysis on TSF datasets to Appendix.

5.2. Analysis of MIA on Different Pretrained Models

Table 4 and Table 5 report the performance of MIA on different pretrained models, where the model forecasts the next 24 points. We observe that MIA (∆ = 1.0, 1.5) consistently lowers MSE as compared to that of the original pretrained models, suggesting MIA could also be an effective plugin for performance improvement. Specifically, MIA improves the forecasting performance in the way of filtering these distrustful forecasts, sacrificing the availability (the decrease of FR) for lower MSE as well as certified robustness, which is a common trade-off in the field of certified defenses (Cohen et al., 2019; Levine & Feizi, 2020a; b; Liu et al., 2021; Han et al., 2021) . We can control the trade-off between MSE and FR by ∆, as the decrease of ∆ can reduce MSE and FR.

5.3. Analysis on Imputation Model of MIA and Ablation Study

Impact of training algorithm. (Levine & Feizi, 2020b) , we take 100, 000 samples for the confidence interval estimation. 2) FR decreases with the increase of L mask , because our imputation quality decreases with L mask , making it harder to reach agreement unanimously. Fig. 2c show that the impact of L mask on CA is not significant. Although the increase of L mask reduces our imputation quality, it reduces the number of masked series simultaneously. Impact of ∆. Fig. 2d , 2e report the impact of ∆. As ∆ increases, we observe that MSE and FR increase, validating our statement about ∆ in Section 4.1.

6. Conclusion

In this paper, we propose the first framework for time-series models to certifiably defend against ℓ 0 -norm perturbations. Notably, MIA is a plug-and-play defense, which can be easily applied to any TSF/TSC pretrained model. The only requirement of deploying MIA is to train an imputation model, which has been extensively explored in this work. Moreover, our extensive experiments validate the effectiveness of MIA. We expect our work can inspire more studies on the ℓ 0 -norm robustness for time-series models. Interesting future works include applying MIA to probabilistic models.

A Proof for Proposition 1

Proposition 2 (Robustness Certificate of MIA). The forecast/label (not Abstain) returned by Algorithm 1 cannot be changed by any temporally-localized perturbation whose ℓ 0 norm is no larger than L adv (see proof in Appendix). Proof. Here we prove the robustness certificate for MIA (TSC). The proof for MIA (TSF) is analogous to this proof. Assume that the adversary has changed the classification result of MIA from y 1 to y 2 via the temporally-localized perturbation δ (ℓ 0 norm is L adv ). For notational simplicity we denote M [1+kα:min(L mask +kα,t0)] by M (k) , k = 0, . . . , ⌈(t 0 -L mask )/α⌉ and denote δ [t adv +1:t adv +L adv ] by δ (t adv ) , t adv = 0, . . . , t 0 -L adv . Then, we have: f (x 1:t0 ⊙ M (0) ) = f (x 1:t0 ⊙ M (1) ) = . . . = y 1 (12) f ((x 1:t0 + δ (t adv ) ) ⊙ M (0) ) = f ((x 1:t0 + δ (t adv ) ) ⊙ M (1) ) = . . . = y 2 (13) Then our next step is to prove that there exists a mask M ( m) , m ∈ {0, . . . , ⌈t 0 -L mask )/α⌉} that can occlude the perturbation. Specifically, we show that the mask M (⌊ t adv α ⌋) can cover the perturbation. First, we show the presence of the mask M (⌊ t adv α ⌋) by proving ⌊ t adv α ⌋ ≤ ⌈(t 0 -L mask )/α⌉, as follow: t adv α - t 0 -L mask α ≤ t 0 -L adv -t 0 + L mask L mask -L adv + 1 = L mask -L adv L mask -L adv + 1 < 1 (14) Second, we show that M (⌊ t adv α ⌋) covers the perturbation by comparing the starting/end point of the mask M ( t adv α ) and the perturbation δ. For the starting point, we have: (α⌊ t adv α ⌋ + 1) Mask -(t adv + 1) Perturbation ≤ 0 In terms of the end points, we have: (α⌊ t adv α ⌋ + 1 + L mask ) Mask -(t adv + L adv ) Perturbation ( ) =α⌊ t adv α ⌋ + (L mask -L adv + 1) -t adv (17) =α(⌊ t adv α ⌋ + 1) -t adv ≥ 0 As M ( m) occludes the perturbation, thus (x 1:t0 ) ⊙ M ( m) = (x 1:t0 + δ (t adv ) ) ⊙ M ( m) ⇒ y 1 = y 2 . Our proof is completed.

B Empirical Evaluation on Risk of Temporally-Localized Perturbations

To support our statement about the risk of temporally-localized perturbations, we specifically propose an algorithm for generating the temporally-localized perturbations. We then evaluate the attack performance visually.

B.1 Restate Definition of temporally-localized Perturbation

Definition 2 (Temporally-localized perturbation). Temporally-localized perturbation is to perturb consecutive time points of x 1:t0 w.r.t. ℓ 0 -norm constraint. The perturbed series is: The objective of our algorithm is to maximize MSE between the original forecasts and the perturbed forecasts, with respect to the ℓ 0 -norm constraint. Specifically, given the forecasting model f (x 1:t0 ) → x t0+1,t0+τ , our objective can be formulated as follows: x 1:t0 + δ [t arg max δ |f (x 1:t0 + δ) -f (x 1:t0 )| 2 2 (20) where δ corresponds to the perturbation defined in Eq.( 19). Actually, the problem of computing the temporally-localized perturbation can be decomposed into two sub-problems: P1 ) Search for the period [t adv +1, t adv +L adv ] to perturb. P2 ) Fix the period [t adv +1, t adv +L adv ], compute the value of the perturbation δ t adv +1 , . . . , δ t adv +L adv . Here solving P2 is not hard. If we have determined δ t adv +1 , . . . , δ t adv +L adv , we can maximize the following loss to compute the perturbation values via projected gradient descent (PGD). max δt adv +1,...,δt adv +L adv |f (x 1:t0 + δ) -f (x 1:t0 )| 2 2 (21) Then the main challenge is to determined which period to perturb. Here we solve P1 by enumerating all the possible perturbing positions [t adv + 1 : t adv + L adv ], t adv = 0, . . . , t 0 -L adv and compute the corresponding attacks. Finally, we return the one with the largest MSE loss among t 0 -L adv + 1 perturbations. However, in practice we found that computing the values of perturbation (P2 ) w.r.t. to the fixed period is hard to converge, as the ℓ 2 norm of the temporally-localized perturbation will approach ∞. We believe that a perturbation attack with ∞ ℓ 2 norm is meaningless in practice. In the sake of practicality, we additionally consider ℓ 2 norm for the temporally-localized perturbations besides ℓ 0 -norm constraint for the sub-problem P2 , as follows: max δt adv +1,...,δt adv +L adv |f (x 1:t0 + δ) -f (x 1:t0 )| 2 2 subject to∥δ∥ 2 2 ≤ ϵ ( ) where ϵ is the preset upper bound of the perturbation ℓ 2 norm. B.3 Empirical Evaluation of Temporally-localized Perturbations. B.4 Quantify the risk of temporally-localized perturbations. Table 8 quantifies the risk of temporally-localized perturbations via computing MSE between the clean forecasts and the perturbed forecasts w.r.t. the ℓ 0 -norm (L adv ) and the ℓ 2 -norm (ϵ) constraints. We observe that MLP-Mixer model provides the highest empirical robustness among three models, which partially explains why MLP-Mixer outperforms other models on MIA. In particular, we further compare MLP to MLP+MIA (δ = 1.5) in Table 4 of the main paper. Specifically, MSE of MLP under temporally-localized perturbations (ϵ = 3.0) is 5% : 1.472, 10% : 0.1.686 while MLP+MIA is 5% : 0.146, 10% : 0.144. MIA reduces the MSE to roughly one tenth of the original, which indicates that MIA can effectively prevent our forecasting results from being influenced by the temporally-localized perturbations.  MSE ℓ 0 -MSE ℓ 2 MSE ℓ 2 × 100%. B.6 Visualize the risk of temporally-localized perturbations. Fig. 3 illustratively shows the effect of temporally-localized perturbations on our forecasting results. We observe that temporally-localized perturbations of L atk = 10% can significantly change our forecasts.

C.1 Dataset Information

Table 9 shows the details of each dataset, including context length, forecasting length (for TSF datasets) and number of classes. Traffic Hourly occupancy rate, between 0 and 1, of 963 San Francisco car lanes (Salinas et al., 2019) . Electricity Hourly time series of the electricity consumption of 370 customers (Salinas et al., 2019) . Exchange Daily exchange rate between 8 currencies (Salinas et al., 2019) . DistalPhalanxTW, MiddlePhalanxTW, ProximalPhalanxTw 7 This series of 11 classification problems were created as part of Luke Davis's PhD titled "Predictive Modelling of Bone Ageing". They are designed to test the efficacy of hand and bone outline detection and whether these outlines could be helpful in bone age prediction. Note that these problems are aligned by subject, and hence can be treated as a multi-dimensional TSC problem. The final three bone classification problems, DistalPhalanxTW, MiddlePhalanxTW and ProximalPhalanxTW, involve predicting the Tanner-Whitehouse score (as labelled by a human expert) from the outline.

Data Pre-Processing

We pre-process the input series with scipy.signal.savgol filter with window length 15 and polyorder 5 on both training and testing datasets. Besides, we normalize each input series with its mean value and standard deviation. Mean value and standard deviation will be 0 and 1 respectively for each normalized input series. We use the instance normalization method on both trainsets and testsets. 7https://timeseriesclassification.com/description.php (Cho et al., 2014) , LSTM (Hochreiter & Schmidhuber, 1997) , FCN (Ismail Fawaz et al., 2019b) and ResNet-18 (He et al., 2016) as the pretrained models. We show the architecture of these models in Table 10 , 11, 12, 13 . Training. we uniformly adapt Adam optimizer (Kingma & Ba, 2014) with lr = 0.0001, β = (0.9, 0.999), ϵ = 10 -8 , weight decay = 0, epochs = 20 for all the pretrained models. 

Comparison to peer methods on Mean Absolute Error (MAE).

, d = 1, 2, . . . , d 0 . X = [x (1) 1:t0 , x 2:t0 , . . . , x 1. We generate the masks in the same way as Masking (Step 1). The main difference is the way we mask the multivariate series with the univariate mask. Masking multivariate series X with the mask M [u:v] is computed as follow: 1:t0 ] T X ⊙ M [u:v] = x (1) 1:t0 ⊙ M [u:v] , x 2:t0 ⊙ M [u:v] , . . . , x :t0 ⊙ M [u:v] T 2. With the imputation model G(•), Imputing (Step 2) for multivariate series X is computed as follow: G(X ⊙ M [u:v] ) = G x (1) 1:t0 ⊙ M [u:v] , G x (2) 2:t0 ⊙ M [u:v] , . . . , G x (d0) 1:t0 ⊙ M [u:v] T 3. We aggregate all the predictions of the imputed multivariate series in the same way as Aggregation (Step 3) for univariate series. Evaluation of MIA on multivariate tasks. In terms of the multivariate time series forecasting (MTSF) task, we follow the work (Wu et al., 2021) and evaluate our MIA framework on four datasets (ETTh2 (Zhou et al., 2021) , ETTm2 (Zhou et al., 2021) , weather 8 and illness 9. Corresponding results are presented in G Imputation Models. For the imputation model architectures, we take SAITS, Transformer and BRITS and MLP-Mixer. Specifically, for SAITS, we use the code from 10. For SAITS (Du et al., 2022) , we set d model = 32, n layers = 2, d inner = 16, n head = 4, d k = 8, d v = 8. For Transformer (Vaswani et al., 2017) , we set d model = 32, n layers = 2, d inner = 16, n head = 4, d k = 8, d v = 8. For BRITS (Cao et al., 2018) , we set h hidden = 32. For MLP-Mixer (Tolstikhin et al., 2021) , we use the same structure as Table 12 . In terms of training, we use optimizer Adam with lr = 0.0001, β = (0.9, 0.999), ϵ = 10 -8 , weight decay = 0 and train the model for 30 epochs. G.1 Choice of Imputation Model Architecture. Table 14 compares the imputation quality of different imputation models on Traffic. The imputation quality is quantified by the mean square error (MSE) between the imputed series and the original series. We observe that MLP-Mixer consistently outperforms other three models across different datasets and L adv , indicating the superior imputation ability of MLP-Mixer architecture.



8https://www.bgc-jena.mpg.de/wetter/ 9https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html 10https://github.com/WenjieDu/PyPOTS



Figure 2: Top: impact of L mask on TSC dataset ProximalPhalanxTW. Bottom: impact of ∆ on TSF dataset Traffic (L adv = 3%).

Figure 3: The effect of temporally-localized perturbations (L atk = 10% and ϵ = 3.0) on different datasets. Clean and Perturbed refer to the normal input series and the temporally-localized perturbations respectively. Row 1, 2, 3: Traffic. Row 4, 5: Electricity. Row 6, 7: Exchange rate. The red background denotes the position of the location of the perturbation. The blue background denotes the output series.

(DistalPhalanxTW)  Comparison among three defenses on a TSC dataset.

(Exchange)  Comparison among three certified defenses on TSF dataset.

Comparison of the inference time (millisecond) of three defenses on TSC datasets.











-Mixer. An interesting observation is that the certified accuracy of DS and RA keeps constant to different L adv . The reason is that, the probability score of DS/RA models often concentrates on a single class, causing most classifications (including both (correct and wrong classifications) of DS and RA are of high robustness. The results show that the certified accuracy of MIA is more than twice of DR and RA across different L adv . The reason is that the pretrained model of MIA classifies the masked series, while the base model in RS/DS classifies the subseries. MIA can attain a higher certified accuracy because the masked series contains much more information (t 0 -L mask unmasked time points) than the subseries (η sampled time points).Comparison on TSF.Table 2 reports FR and MSE of three defenses on Exchange, where the model predicts the next 30 values. Here we utilize discretization technique to make the TSF task feasible to DS and RA. The table shows that MIA offers a significantly higher FR than DS and RA, implying

(Electricity) (c 1 c 2 %) report (MSE FR%) of MIA on different pretrained models.

Comparison of different training algorithms on 3 TSC datasets.

Comparison of four imputation models. The best results are shown in bold-face.

adv +1:t adv +L adv ] subject to ∥δ [t adv +1:t adv +L adv ] ∥ 0 ≤ L (Traffic) Evaluate MSE between the clean forecasts and the perturbed forecasts. The temporally-localized perturbations is generated subject to different ℓ 0 -norm constraints (L adv = 2%, 5%, 10%) and ℓ 2 -norm constraints (1.0, 1.5, 2.0, 2.5, 3.0, 3.5).

Dataset information for TSF and TSC. Compare Attacking Performance of ℓ 0 Attack to ℓ 2 Attack We compare the ℓ 0 attack and ℓ 2 attack under under norm constraints β (attack rate) on time series forecasting task. Results are shown in Table 30, 31 and 32. The values in tables are calculated as

MLP structure

The structure of MLP-Mixer Block, and we stack 4 MLP-Mixer blocks to construct MLP-Mixer for forecasting and imputation.

The structure of Fully Convolutional Network (FCN) for TSC. We use the classical forecasting and classification models, MLP, MLP-Mixer(Tolstikhin et al., 2021), GRU

Comparison of imputation quality (MSE between the imputed series and the original series) of different imputation models on imputing the masked series of different mask length L mask = 5%, 10%, 15%, 20%. Fixing L mask , we first construct t 0 -L mask + 1 masked series of L mask and compute the average imputation MSE over imputing these t 0 -L mask + 1 masked series. Bold indicates the best among four generators.

(Exchange)  Comparison among three certified defenses on TSF dataset.

(Traffic)  The performance of MIA on different pretrained models. (c 1 c 2 %) reports (MAE, FR%) of MIA. Baseline is MAE of the pretrained model without MIA. The lowest MAE and the highest FR for each pretrained model is shown in bold-face.





.345 88.1% 0.343 86.1% 0



(Electricity) (c 1 c 2 %) report (MAE FR%) of MIA on different pretrained models.

Table 16, Table 17 report MAE and FR of MIA on Traffic, Electricity and Exchange respectively, as a supplement to Table 4 and Table 5 in the main paper. We observe that MIA consistently reduces MAE of the pretrained models, similar to Table 18: (Exchange) c 1 c 2 % report MSE FR% of MIA at different L atk and L def .

(Exchange) c 1 c 2 % report MAE FR% of MIA at different L adv and L mask .

Table 18 and Table19evaluate MSE/MAE of MIA on exchange. We observe that, MIA moderately increase MSE/MAE of the pretrained models because of the information loss for the discretization technique. Specifically, we can see that MSE/MAE of the pretrained models are commonly much smaller than that of Traffic and Electricity because Exchange dataset is much simpler. The better forecasting performance implies that we might lose more information for the discretization technique. On Exchange, information loss plays a more conspicuous role than the filtering function of MIA, causing the increase of MSE/MAE. This suggests that discretization technique might lower the forecasting performance when the pretrained models are precise enough. Table 20 reportS MSE and FR of masked training and random training on the TSF dataset Traffic. Similar to the comparison on TSC datasets (Table 6 in the main paper), our masked training consistently achieves lower MSE than random training on different imputation models.

(Traffic)  Comparison between mask methods. c 1 c 2 % report MSE TPR% of MIA across L def = 2%, 5%, 10%, L atk = 2%, 5%, 10%, ∆ = 1.0, 1.2, 1.5. c 1 . MSE(c 1 ) reports the forecasting performance.

Comparison of four imputation models. The best results are shown in bold-face.

Table 22 to 29. Extensive experiments demonstrate that MIA behaves similarly to that of univariate forecasting tasks.

(ETTh2) Evaluate MAE of MIA with pretrained models on multi-variate time series forecasting (MTSF).

(ETTh2) Evaluate MSE of MIA with different pretrained models on multi-variate time series forecasting (MTSF).





annex

) on the MSE between the original forecast and the perturbed forecast. The table reports the relative improvement of the ℓ 0 -norm perturbation over the ℓ 2 -norm perturbation (averaging among 128 randomly selected samples). The positive value indicates that our ℓ 0 -norm perturbation outperforms ℓ 0 -norm perturbation. For fairness, the ℓ 0 or ℓ 2 norm of the perturbation is restricted to be no larger than β× the average value among the ℓ 0 or ℓ 2 norm of all the testing samples. 2.0 % 1.0 % 0.0 % -1.0 % -2.0 % -2.0 % -3.0 % 1.0 % 1.0 % MLP-Mixer ACC 5 % -1.0 % 1.0 % 0.0 % 10 % 0.0 % 1.0 % 0.0 % -1.0 % 0.0 % 0.0 % 15 % 0.0 % 2.0 % 0.0 % 0.0 % 1.0 % -2.0 % -1.0 % 2.0 % 0.0 % MLP ACC 5 % 1.0 % 1.0 % 1.0 % 10 % 1.0 % 0.0 % 0.0 % 1.0 % 1.0 % 1.0 % 15 % 1.0 % 3.0 % 4.0 % 1.0 % 3.0 % 3.0 % -1.0 % 0.0 % 0.0 % ResNet-18 ACC 5 % -1.0 % 1.0 % -1.0 % 10 % 0.0 % 0.0 % 0.0 % 1.0 % 1.0 % 1.0 % 15 % 2.0 % 2.0 % 0.0 % 1.0 % 1.0 % 2.0 % 0.0 % 0.0 % -1.0 % 

