HIDDEN MARKOV MIXTURE OF GAUSSIAN PROCESS FUNCTIONAL REGRESSION: UTILIZING MULTI-SCALE STRUCTURE FOR TIME-SERIES FORECASTING

Abstract

The mixture of Gaussian process functional regressions (GPFRs) assumes that there are a batch of time-series or sample curves which are generated by independent random processes with different temporal structures. However, in the real situations, these structures are actually transferred in a random manner from a long time scale. Therefore, the assumption of independent curves is not true in practice. In order to get rid of this limitation, we propose the hidden Markov based GPFR mixture model (HM-GPFR) by describing these curves with both fine and coarse level temporal structures. Specifically, the temporal structure is described by the Gaussian process model at the fine level and hidden Markov process at the coarse level. The whole model can be regarded as a random process with state switching dynamics. To further enhance the robustness of the model, we also give a priori to the model parameters and develop Bayesian hidden Markov based GPFR mixture model (BHM-GPFR). Experimental results demonstrate that the proposed methods have both high prediction accuracy and good interpretability.

1. INTRODUCTION

The time-series considered in this paper has the multi-scale structure: the coarse level and the fine level. We have observations (y 1 , . . . , y T ) where each y t = (y t,1 , . . . , y t,L ) itself is a time-series of length L. The whole time-series is arranged as y 1,1 , y 1,2 , . . . , y 1,L , y 2,1 , y 2,2 , . . . , y 2,L , . . . , y T,1 , y T,2 , . . . , y T,L . (1) The subscripts of {y t } T t=1 are called coarse level indices, while the subscripts of {y t,i } L i=1 are called fine level indices. Throughout this paper, we take the electricity load dataset as a concrete example. The electricity load dataset consists of T = 365 consecutive daily records, and in each day there are L = 96 samples recorded every quarter-hour. In this example, the coarse level indices denote "day", while the fine level indices correspond to the time resolution of 15 minutes. The aim is to forecast both short-term and long-term electricity loads based on historical records. There may be partial observations y T +1,1 , . . . , y T +1,M with M < L, so the entire observed time-series has the form y 1,1 , y 1,2 , . . . , y 1,L , y 2,1 , y 2,2 , . . . , y 2,L , . . . , y T,1 , y T,2 , . . . , y T,L , y T +1,1 , . . . , y T +1,M . (2) The task is to predict future response y t * ,i * where t * ≥ T + 1, 1 ≤ i * ≤ L are positive integers. The coarse level and fine level provide different structural information about the data generation process. In the coarse level, each y t can be regarded as a time-series, and there is certain cluster structure (Shi & Wang, 2008; Wu & Ma, 2018) underlying these time-series {y t } T t=1 : we can divide {y t } T t=1 into groups such that time-series within each group share a similar evolving trend. Back to the electricity load dataset, such groups correspond to different electricity consumption patterns. We use z t to denote the cluster label of y t . In the fine level, observations {y t,i } L i=1 can be regarded as a realization of a stochastic process, and the properties of the stochastic process are determined by the cluster label z t . The mixture of Gaussian processes functional regression (mix-GPFR) model (Shi & Wang, 2008; Shi & Choi, 2011) is powerful for analyzing functional data or batch data, and it is applicable to the multi-scale time-series forecasting task. Mix-GPFR assumes there are K Gaussian processes functional regression (GPFR) (Shi et al., 2007) components, and associated with each y t there is a latent variable z t indicating y t is generated by which GPFR component. Since GPFR is good at capturing temporal dependency, this model successfully utilizes the structure information in the fine level. However, the temporal information in the coarse level is totally ignored since mix-GPFR assumes {z t } T t=1 are i.i.d. . In this work, we propose to model the temporal dependency in the coarse level by the hidden Markov model, which characterizes the switching dynamics of z 1 , . . . , z T by the transition probability matrix. We refer to the proposed model as HM-GPFR. Mix-GPFR is able to effectively predict y T +1,M +1 , . . . , y T +1,L when M > 0. To predict the responses y T +1,i * , we must determine the cluster label z T +1 based on observations y T +1,1 , . . . , y T +1,M , otherwise we do not know y T +1 is governed by which evolving pattern. If there is no observation at day T + 1 (i.e., M = 0), then mix-GPFR fails to identify the stochastic process that generates y T +1 . For the same reason, mix-GPFR is not suitable for long-term forecasting (t * > T + 1). On the other hand, HM-GPFR is able to infer z t * for any t * based on the transition probabilities of the hidden Markov model even M = 0. Therefore, HM-GPFR makes use of coarse level temporal information and solves the cold start problem in mix-GPFR. Besides, when a new day's records y T +1 have been fully observed, one needs to re-train a mix-GPFR model to utilize y T +1 , while HM-GPFR can adjust the parameters incrementally without retraining the model.

2. RELATED WORKS

Gaussian process (Rasmussen & Williams, 2006 ) is a powerful non-parametric Bayesian model. In (Girard et al., 2002; Brahim-Belhouari & Bermak, 2004; Girard & Murray-Smith, 2005) , GP has been applied for time-series forecasting. Shi et al. proposed the GPFR model to process batch data (Shi et al., 2007) . To effectively model multi-modal data, the mixture structure is further introduced to GPFR and the mix-GPFR model was proposed (Shi & Wang, 2008; Shi & Choi, 2011) . In (Wu & Ma, 2018; Li et al., 2019; Cao et al., 2021) , GP related methods for electricity load prediction have been evaluated thoroughly. However, in these works daily records are treated as i.i.d. samples, and the temporal information in the coarse level is ignored. Multi-scale time-series was proposed in (Ferreira et al., 2006; Ferreira & Lee, 2007b; a) , and further developments in this direction have been achieved in recent years. The time-series considered in this work is different from the multi-scale time-series since at the coarse level there is no aggregated observation from the samples at the fine level. In this paper, we mainly emphasize the multi-scale structure of the time-series.

3.1. HIDDEN MARKOV MODEL

For a sequence of observations {y t } T t=1 , the hidden Markov model (HMM) (Rabiner & Juang, 1986; Elliott et al., 2008) assumes there is a hidden state variable z t associated with y t . The sequence of hidden states {z t } T t=1 forms a homogeneous Markov process. Usually, {z t } T t=1 are categorical variables taking values in {1, . . . , K}, and the transition dynamics is governed by P(z t = l|z t-1 = k) = P kl . There are K groups of parameters {θ k } K k=1 , and z t = k indicates that the observation y t is generated by P(y; θ k ). The goal of learning is to identify the parameters and infer the posterior distribution of hidden states {z t } T t=1 . Usually, the Baum-Welch algorithm (Baum & Petrie, 1966; Baum et al., 1970) is utilized to learn the HMM, which can be regarded as a specifically designed EM algorithm based on the forward-backward algorithm. Once the model has been trained, we are able to simulate future behavior of the system.

3.2. GAUSSIAN PROCESS FUNCTIONAL REGRESSIONS

Gaussian process is a stochastic process that any finite-dimensional distribution of samples is a multivariate Gaussian distribution. The property of a Gaussian process is determined by the mean function and the covariance function. We write the mean function as µ(•) and the covariance function as c(•, •). Suppose that we have a dataset D = {(x i , y i )} L i=1 . The relationship between input and output is connected by a function Y , i.e., Y (x i ) = y i . Let x = [x 1 , x 2 , . . . , x L ] T , y = [y 1 , y 2 , . . . , y L ] T , then we assume y|x ∼ N (µ, C) where µ = [µ(x 1 ), µ(x 2 ), . . . , µ(x L )] T and C ij = c(x i , x j ). In machine learning, the mean function and the covariance function are usually parameterized. Here, we use the squared exponential covariance function (Rasmussen & Williams, 2006; Shi & Choi, 2011; Wu & Ma, 2018)  c(x i , x j ; θ) = θ 2 1 exp -θ 2 2 (xi-xj ) 2 2 + θ 2 3 δ ij , where δ ij is the Kronecker delta function and θ = [θ 1 , θ 2 , θ 3 ]. The mean function is modeled as a linear combination of B-spline basis functions (Shi et al., 2007; Shi & Choi, 2011) . Suppose that we have D B-spline basis functions {ϕ d (x)} D d=1 . Let µ(x) = D d=1 b d ϕ d (x) and Φ be an L × D matrix with Φ id = ϕ d (x i ), b = [b 1 , b 2 , . . . , b D ] T , then y|x ∼ N (Φb, C). From the function perspective, this model can be denoted as Y (x) ∼ GPF R(x; b, θ). We can use the Gaussian process to model the multi-scale time-series considered in this paper, and the key-point is transform the multi-scale time-series to a batch dataset. For each coarse level index t, we can construct a dataset D t = {(x t,i , y t,i )} L i=1 , where x t,i is the sampling time of i-th sample in time-series y t . Let Y t be the function underlying dataset D, i.e., Y t (x t,i ) = y t,i , then these {D t } T t=1 can be regarded as independent realizations of a GPFR, which assumes Y t (x) i.i.d.

∼ GPFR(x; b, θ).

Without loss of generality, we may assume x t,i = i, and thus Φ id = ϕ d (i), C ij = c(i, j; θ) do not depend on the coarse level index t. Therefore, it is equivalent to assume {y t } T t=1 are independently and identically distributed as N (Φb, C). To learn the parameters b and θ, we apply the Type-II maximum likelihood estimation technique (Rasmussen & Williams, 2006; Shi & Choi, 2011) . As for prediction, given a new record {(x t * ,i , y t * ,i )} M i=1 and we want to predict the corresponding output y t * ,i * at x t * ,i * where M < i * ≤ L, from the definition of Gaussian process we immediately know that y t * ,i * also obeys a Gaussian distribution (Rasmussen & Williams, 2006) . Let x * = [x t * ,1 , . . . , x t * ,M ] T , y * = [y t * ,1 , . . . , y t * ,M ] T , µ * = [µ(x t * ,1 ) , . . . , µ(x t * ,M )] T , [C * ] ij = c(x t * ,i , x t * ,j ), then the mean of y t * ,i * is µ(x t * ,i * ) + c(x t * ,i * , x * )C -1 * (y * -µ * ) , and the variance of y t * ,i * is c(x t * ,i * , x t * ,i * ) -c(x t * ,i * , x * )C -1 * c(x * , x t * ,i * ). Note that if M = 0, the prediction is simply given by N (µ(x t * ,i * ), c(x t * ,i * , x t * ,i * )), which equals to the prior distribution of y t * ,i * and fails to utilize the temporal dependency with recent observations. In the electricity load prediction example, this means we can only effectively predict a new day's electricity loads when we already have the first few observations of this day. In practice, however, it is very common to predict a new day's electricity loads from scratch.

3.3. THE MIXTURE OF GAUSSIAN PROCESS FUNCTIONAL REGRESSIONS

GPFR implicitly assumes that all {y t } T t=1 are generated by the same stochastic process, which is not the case in practice. In real applications, they may be generated from different signal sources, thus a single GPFR is not flexible enough to model all the time series, especially when there are a variety of evolving trends. Take the electricity load dataset for example, the records corresponding to winter and summer are very likely to have significantly different trends and shapes. To solve this problem, Shi et al.() suggested to introduce the mixture structure to GPFR, and proposed the mixture of Gaussian process functional regressions (mix-GPFR). In mix-GPFR, there are K GPFR components with different parameters {b k , θ k } K k=1 , and the mixing proportion of the k-th GPFR component is π k . Intuitively, there are K different signal sources or evolving patterns in mix-GPFR to describe temporal data with different temporal properties. For each y t , there is an associated latent indicator variable z t ∈ {1, 2, . . . , K}, and z t = k indicates y t is generated by the k-th GPFR component. The generation process of mix-GPFR is z t i.i.d. ∼ Categorical(π 1 , π 2 , . . . , π K ) , Y t (x)|z t = k ∼ GPF R(x; b k , θ k ) . (5) Let C k ∈ L × L be the covariance matrix calculated by θ k , i.e., [C k ] ij = c(i, j; θ k ), then the above equation is equivalent to y t ∼ N (Φb k , C k ). Due to the existence of latent variables, the parameter learning of mix-GPFR involves the EM algorithm (Dempster et al., 1977; Shi & Wang, 2008) . As for prediction, K GPFR components of mix-GPFR first make predictions individually, then we weight these predictions based on the posterior probability P(z t * = k|y t * ; b k , θ k ). Note that if M = 0, then P(z t * = k|y t * ; b k , θ k ) = π k , which equals to the mixing proportions and also fails to utilize recent observations. Therefore, mix-GPFR also suffers from the cold start problem.

4.1. HIDDEN MARKOV BASED GAUSSIAN PROCESS FUNCTIONAL REGRESSION MIXTURE MODEL

Similar to mix-GPFR, the hidden Markov based Gaussian process functional regression mixture model also assumes the time-series is generated by K signal sources. The key difference is that the signal source may switch between consecutive observations in the time resolution of the coarse level. The temporal structure in the coarse level is characterized by the transition dynamics of {z t } T t=1 , and the temporal dependency in the fine level is captured by Gaussian processes. Precisely, z 1 ∼ Categorical(π 1 , π 2 , . . . , π K ), P(z t = l|z t-1 = k) = P kl , t = 2, 3, . . . , T Y t (x)|z t = k ∼ GPF R(x; b k , θ k ) , t = 1, 2, . . . , T. Here, π = [π 1 , π 2 , . . . , π K ] is the initial state distribution, and P = [P kl ] K×K is the transition probability matrix. We refer to this model as HM-GPFR. In GPFR and mix-GPFR, the observations {y t } T t=1 are modeled as independent and exchangeable realizations of stochastic processes, thus the temporal structure in the coarse level is destroyed. However, in HM-GPFR, consecutive y t-1 , y t are connected by the transition dynamics of their corresponding latent variables z t-1 , z t , which is more suitable for time-series data. For example, if today's electricity loads are very high, then it is unlikely that tomorrow's electricity loads are extremely low. The learning algorithm for HM-GPFR is based on the EM algorithm, and we derive the algorithm in the appendix. After the parameters have been learned, we assign the latent variable ẑt = arg max k=1,...,K γ t (k) and regard {ẑ t } T t=1 as deterministic. For prediction, we consider two cases: t * = T + 1 and t * > T + 1. When t * = T + 1, the latent variable z T +1 is determined by both the conditional transition probability z T +1 |ẑ T and partial observations y T +1 . More precisely, suppose ẑT = l, then ω k = P(z T +1 = k|T , y T +1 , ẑT = l; Θ) ∝ Plk N (y T +1 ; Φ[1 : M, :] bk , C[1 : M, 1 : M ]) , where the square brackets denote slicing operation. If M = 0, then ω k = Plk is determined by the last hidden state and transition dynamics, which is more accurate than mix-GPFR. Suppose the prediction of the k-th component is y (k) * , then the final prediction is given by K k=1 ω k y (k) * . We next consider the case t * > T + 1, the main difference is the posterior distribution of z t * . In this case, we need to use the transition probability matrix recursively. First, we calculate the distribution of z T +1 according to Equation ( 7). Then by the Markov property, we know ω k = P(zt * = k|T , yT +1, zT = l; Θ) ∝ K m=1 P(zT +1 = m|T , yT +1, ẑT = l; Θ)[ Pt * -T -1 ] mk . (8) The final prediction is also given by  K k=1 ω k y (k) * = K k=1 ω k Φ[i * , :]b k . y 1 y 2 . . . y N z 1 z 2 . . . z N θ k , γ k , b k P kl π k = 1, 2 . . . , K k, l = 1, 2, . . . , K (a) HM-GPFR y 1 y 2 . . . y N z 1 z 2 . . . z N θ k , γ k b k m b , Σ b P kl a 0 π k = 1, 2 . . . , K k, l = 1, 2, . . . , K (b) BHM-GPFR

4.2. BAYESIAN HIDDEN MARKOV BASED GAUSSIAN PROCESS FUNCTIONAL REGRESSION MIXTURE MODEL

One drawback of HM-GPFR is that there are too many parameters and thus has the risk of overfitting. In this section, we further develop a fully Bayesian treatment of HM-GPFR. We place a Gaussian prior N (m b , Σ b ) on the coefficients of B-spline functions {b k } K k=1 . For the transition probabilities, let p k = [P k1 , P k2 , . . . , P kK ] T be the probabilities from state k to other states, then we assume p k obeys a Dirichlet prior Dir(a 0 , . . . , a 0 ). The generation process of Bayesian HM-GPFR is b k ∼ N (m b , Σ b ) , k = 1, 2, . . . , K p k ∼ Dir(a 0 , . . . , a 0 ) , k = 1, 2, . . . , K z 1 ∼ Categorical(π 1 , π 2 , . . . , π K ), P(z t = l|z t-1 = k) = P kl , t = 2, 3, . . . , T Y t (x)|z t = k ∼ GPF R(x; b k , θ k ) , t = 1, 2, . . . , T. (9) The detailed learning algorithm is presented in the appendix. After learning, we set the latent variables to their maximum a posteriori (MAP) estimates Ω. Specifically, bk = m k , Pkl = a kl K m=1 a km , ẑt = arg max k=1,2,...,K γ t (k). The rest of prediction is the same as HM-GPFR.

5.1. EXPERIMENT SETTINGS

In this section, we use the electricity load dataset issued by the State Grid of China for a city in northwest China. The dataset records electricity loads every 15 minutes, thus there are 96 records per day. Using the electricity load records of 2010 for training, we predict the subsequent S-step electricity loads in a time-series prediction fashion, where S = 1, 2, 3, 4, 5, 10, 20, 30, 50, 80, 100, 200, 500, 1000 . This setting allows both short-term and long-term predictions to be evaluated. For a more comprehensive and accurate assessment of the performance, we roll the time series by 100 rounds. Based on the electricity loads of 2010, the r-th round also puts the first (r -1) records of 2011 into the training set. In each round, we predict the subsequent S-step electricity loads. In r-th round, suppose the ground-truths are y 1 , y 2 , . . . , y S and the predictions are ŷ1 , ŷ2 , . . . , ŷS , we use the Mean Absolute Percentage Errors (MAPEs) to evaluate the prediction results. Specifically, MAPE r = 1

S S s=1

|ys-ŷs| |ys| . For overall evaluation, we report the average of 100 MAPEs to obtain MAPE = 1 100 100 r=1 MAPE r . Since the algorithms are influenced by randomness, we repeat the algorithms for 10 runs and report the average results. We compare HM-GPFR and BHM-GPFR with other time-series forecasting methods. Specifically, • Classical times-series forecasting methods: auto-regressive (AR), moving average (MA), auto-regressive moving average (ARMA), auto-regressive integrated moving average (ARIMA), seasonal auto-regressive moving average (SARMA). • Machine learning methods: long short-term memory (LSTM), feedforward neural network (FNN), support vector regression (SVR), enhanced Gaussian process mixture model (EGPM). • GPFR related methods: the mixture of Gaussian process functional regressions (mix-GPFR), the mixture of Gaussian processes with nonparametric mean functions (mix-GPNM), Dirichlet process based mixture of Gaussian process functional regressions (DPM-GPFR). Detailed parameter settings of comparison methods are shown in the appendix. The main parameters in HM-GPFR and BHM-GPFR are the number of components K and the number of B-spline basis functions D, and we set K = 5, D = 30.

5.2. PERFORMANCE EVALUATION AND MODEL EXPLANATION

The prediction results of various methods on the electricity load dataset are shown in table 1. From the table, we can see that the prediction accuracy of classical time-series forecast methods decreases significantly as we increase the prediction step. Among them, SARMA outperforms AR, MA, ARMA, and ARIMA, because SARMA takes the periodicity of data into consideration and can fit data more effectively. The results of machine learning methods LSTM, NN, SVR, and EGPM also have similar phenomena, that is, when S is small, the prediction accuracy is high, and when S is large, the prediction accuracy is low. This observation indicates that these methods are not suitable for long-term prediction. In addition, machine learning methods are also sensitive to the settings of parameters. For example, the results of FNN and SVR are better when L = 4, which is close to SARMA, while the long-term prediction accuracy of EGPM decreases significantly when L is relatively large. It is challenging to appropriately set hyper-parameters in practice. When making a long-term prediction, classical time-series prediction methods and machine learning methods need to recursively predict the subsequent values based on estimated values, which will cause the accumulation and amplification of errors. On the other hand, GPFR-related methods first make predictions according to the mean function, then finely correct these predictions based on observed data. The mean function part can better describe the evolution law of data, which enables us to historical information and structural information in data more effectively. Mix-GPFR, mix-GPNM, and DPM-GPFR obtain similar results in long-term prediction compared with SARMA, and can even achieve the best results in short-term prediction. This observation demonstrates the effectiveness of GPFR-related methods. However, these methods cannot deal with long-term prediction tasks well due to the "cold start" problem. Overall, the performances of the proposed HM-GPFR and BHM-GPFR are more comprehensive. For medium-term and short-term prediction, the results of HM-GPFR and BHM-GPFR are slightly worse than those of mix-GPFR, mix-GPNM, and DPM-GPFR, but they still enjoy significant advantages compared with other comparison methods. In terms of long-term forecasting, HM-GPFR and BHM-GPFR outperform mix-GPFR, mix-GPNM, and DPM-GPFR, which shows that considering the multi-scale temporal structure between daily electricity load timeseries can effectively improve the accuracy of long-term forecasting. In addition, BHM-GPFR is generally better than HM-GPFR, which shows that giving prior distributions to the parameters and learning in a fully Bayesian way can further increase the robustness of the model and improve the prediction accuracy. HM-GPFR and BHM-GPFR have strong interpretability. Specifically, the estimated values of hidden variables obtained after training {ẑ i } n i=1 divide the daily electricity load records into K categories according to the evolution law. Each evolution pattern can be represented by the mean function of GPFR component, and these evolution patterns transfer to each other with certain probabilities. The transfer law is characterized by the transfer probability matrix in the model. In fig. 3 , we visualize the evolution patterns and transfer laws learned by HM-GPFR and BHM-GPFR. We call the evolution law corresponding to the mean function represented by the orange curve (at the top of the figure) mode 1, and call the five evolution modes as mode 1 to mode 5 respectively in clockwise order. Combined with the practical application background, some meaningful laws can be found according to the results of learned models. Examples are as follows: • The electricity load of mode 1 is the lowest. Besides, mode 1 is relatively stable: when the system is in this evolution pattern, then it will stay in this state in the next step with a probability of about 0.5. In case of state transition, the probability of transferring to the mode with second lowest load (mode 2 in Figure 3a and mode 3 in Figure 3b ) is high, while the probability of transferring to the mode with highest load (mode 5 in Figure 3a and mode 2, mode 5 in Figure 3b ) is relatively low; • The evolution laws of mode 2 and mode 5 in fig. 3b are very similar, but the probabilities of transferring to other modes are different. From the perspective of electricity load alone, both of them can be regarded as the mode with the highest load. When the system is in the mode with the highest load (mode 5 in Figure 3a and mode 2, mode 5 in Figure 3b ), the probability of remaining in this state in the next step is the same as that of transferring to the mode with the lowest (mode 1); • When the system is in the mode with the second-highest load (mode 3 in fig. 3a and mode 4 in fig. 3b ), the probability of remaining in this state in the next step is low, while the probabilities of transferring to the modes with the lowest load and the highest load are high. These laws are helpful for us to understand the algorithm, have a certain guiding significance for production practice, and can also be further analyzed in combination with expert knowledge. The case of S = 1in table 1 is the most common in practical application, that is, one-step-ahead rolling forecast. As discussed in section 4.1, when making a rolling prediction, HM-GPFR and BHM-GPFR can dynamically adjust the model incrementally after collecting new data without retraining the model. The results of the one-step-ahead rolling prediction of HM-GPFR and BHM-GPFR on the electricity load dataset are shown in fig. 4 . It can be seen that the predicted values of HM-GPFR and BHM-GPFR are very close to the ground-truths, indicating that they are effective for rolling prediction. In the figure, the color of each point is the weighted average of the colors corresponding to each mode in fig. 3 according to the weight ω K . Note that there are color changes in some electricity load curves in fig. 4a and fig. 4b . Taking the time-series in fig. 4a in the range of about 1100-1200 as an example, when there are few observation data on that day, HM-GPFR believes that the electricity load evolution pattern of that day is more likely to belong to mode 3. With the gradual increase of observation data, the model tends to think that the electricity load evolution pattern of that day belongs to mode 5, and then tends to mode 3 again. This shows that HM-GPFR and BHM-GPFR can timely adjust the value of z i * according to the latest information during rolling prediction.

5.3. ABLATION STUDY

In this section, we mainly compare HM-GPFR, BHM-GPFR with mix-GPFR, mix-GPNM, and DPM-GPFR to explore the impact of introducing coarse-grained temporal structure on the prediction performance. The MAPEs reported in Table 2 are averaged with respect to r = 1, . . . , 100, while in this section we pay special attention to the case of r = 1. In this case, the observed data is the electricity load records in 2010, and there is no partial observations on January 1, 2011 (i.e., M = 0 in eq. ( 2)). Therefore, mix-GPFR, mix-GPNM, and DPM-GPFR will encounter the cold-start problem. Table 2 reports the MAPE of these methods at different prediction steps when r = 1. It can be seen from the table that the prediction accuracy of HM-GPFR and BHM-GPFR is higher than that of mix-GPFR, mix-GPNM, and DPM-GPFR at almost every step, which shows that coarse-grained temporal information is helpful to improve the prediction performance, and the use of Markov chain to model the transfer law of electricity load evolution patterns can make effective use of coarse-grained temporal information. Figure 5 further shows the results of multi-step prediction of these methods on the electricity load dataset. Here is also the case of "cold start" (r = 1), and we predict the electricity loads in the next 10 days (960 time points in total). It can be seen from the figure that these methods can effectively utilize the periodic structure in the time-series, and the prediction results show periodicity, but the prediction results of HM-GPFR and BHM-GPFR are slightly different from other methods. Due to the problem of "cold start", the predictions of mix-GPFR, mix-GPNM, and DPM-GPFR for each day are the same, i.e., ŷN+1 = ŷN+2 = • • • = ŷN+10 , while HM-GPFR and BHM-GPFR will use coarse-grained temporal information when making predictions, and then adjust the predicted values of each day. Based on the predicted values of other methods, it can be seen from the figure that the predicted values of HM-GPFR and BHM-GPFR on the first day are higher, and with the increase in step size, the predicted values will tend to the weighted average value of the mean function of each GPFR component.

6. CONCLUSION

In this paper, we have proposed the concept of multi-scale time series. Multi-scale time series have two granularity temporal structures. We established the HM-GPFR model for multi-scale time series forecasting and designed an effective learning algorithm. In addition, we also gave a priori to the parameters in the model, and obtain a more robust BHM-GPFR model. Compared with conventional GPFR-related methods (mix-GPFR, mix-GPNM, DPM-GPFR), the proposed method can effectively use the temporal information of both fine level and coarse level, alleviate the "cold start" problem, and has good performance in short-term prediction and long-term prediction. HM-GPFR and BHM-GPFR not only achieve high prediction accuracy but also have good interpretability. Combined with the actual problem background and domain knowledge, we can explain the state transition law learned by the model.

A LEARNING ALGORITHMS OF THE PROPOSED METHODS

A.1 HM-GPFR Due to the existence of latent variables {z t } T t=1 , we apply the EM algorithm to learn the HM-GPFR model. We write T = {D t } T t=1 to denote observations, Θ = {P kl } K k,l=1 ∪ {π k , b k , θ k } K k=1 to denote all parameters, and Ω = {z t } T t=1 to denote all latent variables. First, the complete data log-likelihood is L(Θ; T , Ω) = K k=1 I(z 1 = k) log π k + T -1 t=1 K k=1 K l=1 I(z t+1 = l, z t = k) log P kl + T t=1 K k=1 I(z t = k) log P(y t ; b k , θ k ). (10) In the E-step of the EM algorithm, we need to calculate the expectation of Equation ( 10) with respect to the posterior distribution of latent variables Ω to obtain the Q-function. However, it is not necessary to explicitly calculate P(Ω|T ; Θ), which is a categorical distribution with K N possible values, and it suffices to obtain P(z t+1 = l, z t = k|T ; Θ) and P(z t = k|T ; Θ). We first introduce some notations as follows: α t (k) = P(y 1 , y 2 , . . . , y t , z t = k; Θ), β t (k) = P(y t+1 , y t+2 , . . . , y T |z t = k; Θ), γ t (k) = P(z t = k|T ; Θ), ξ t (k, l) = P(z t = k, z t+1 = l|T ; Θ) . (11) The key-point is to calculate γ t (k) and ξ t (k, l). Note that γ t (k) = P(z t = k|T ; Θ) ∝ P(z t = k, T ; Θ) = α t (k)β t (k), ξ t (k, l) = α t (k)P kl N (y t+1 ; Φb l , C l )β t+1 (l). Therefore, the problem boils down to calculate α t (k) and β t (k). We can derive them recursively based on the forward-backward algorithm . According to the definition of α t (k), we have α 1 (k) = π k N (y 1 ; Φb k , C k ) , α t (k) = K l=1 α t-1 (l)P lk N (y t ; Φb k , C k ). Similarly, according to the definition of β t (k), we have β T (k) = 1 , β t (k) = K l=1 P kl N (y t+1 ; Φb l , C l )β t+1 (l). To summary, in the E-step we first use Equations ( 13) and ( 14) to calculate α t (k), β t (k) recursively based on current parameters, then calculate γ t (k), ξ t (k, l) according to Equation ( 12). The Q-function is given by Q(Θ) = K k=1 γ 1 (k) log π k + T -1 t=1 K k=1 K l=1 ξ t (k, l) log P kl + T t=1 K k=1 γ t (k) log N (y t ; Φb k , C k ). In the M-step, we need to maximize Q with respect to parameters. The parameters {π k } K k=1 and {P kl } K k,l=1 can be optimized in closed form, π k = γ 1 (k) K l=1 γ 1 (l) , P kl = T -1 t=1 ξ t (k, l) T -1 t=1 K m=1 ξ t (k, m) . ( ) The parameters {b k , θ k } K k=1 cannot be solved in closed form, and we apply the gradient ascent algorithm to optimize Q(Θ) with gradients ∂Q(Θ) ∂θ k = 1 2 T t=1 γ t (k)tr C -1 k (y t -Φb k )(y t -Φb k ) T C -1 k -C -1 k ∂C k ∂θ k , ∂Q(Θ) ∂b k = T t=1 γ t (k)Φ T C -1 k (y t -Φb k ) . The complete algorithm is summarized in Algorithm 1. When the partial observations y T +1,1 , . . . , y T +1,M become complete as we collect more data, we can adjust the parameters incrementally without retraining the model. This is achieved by continuing EM iterations with current parameters until the iteration converges again. Algorithm 1: The EM algorithm for learning HM-GPFR.

Initialize parameters Θ;

while not converged do // E-step  1 α 1 (k) = π k N (y 1 ; Φb k , C k ); 2 for t = 2, . . . , T do 3 for k = 1, 2, . . . , K do 4 α t (k) = K l=1 α t-1 (l)P lk N (y t ; Φb k , C k ); 5 end 6 end 7 β T = 1; 8 for t = T -1, . . . , 1 do 9 for k = 1, 2, . . . , K do 10 β t (k) = K l=1 P kl N (y t+1 ; Φb l , C l )β t+1 (l);

A.2 BHM-GPFR

We still use the EM algorithm to learn the parameters of the BHM-GPFR model. However, this case is more complicated since there are more latent variables. The complete data log-likelihood is L(Θ; T , Ω) = K k=1 log N (b k ; m b , Σ b ) + K k=1 K l=1 (a 0 -1) log P kl + K k=1 I(z 1 = k) log π k + T -1 t=1 K k=1 K l=1 I(z t+1 = l, z t = k) log P kl + T t=1 K k=1 I(z t = k) log N (y t ; Φb k , C k ) . Compared with Equation ( 10), the first two terms are extra due to the prior distributions. In the E-step of EM algorithm, we need to take expectation of Equation ( 18) with respect to the posterior distribution of latent variables. However, the posterior of Ω is intractable since {b k } K k=1 , {p k } K k=1 and {z t } T t=1 are correlated. We use the variational inference method and try to find an optimal approximation of P(Ω|T ; Θ) with simple form. We adopt the mean-field family approximation, which factorizes the joint distribution of Ω to a product of several independent distributions, Q(Ω) = K k=1 Q(b k ) K k=1 Q(p k )Q(z). Similar to the HM-GPFR case, Q(z) is a categorical distribution with K T possible values, but we do not need to calculate Q(z) explicitly and only need to calculate γ t (k) = Q(z t = k) and ξ t (k, l) = Q(z t+1 = l, z t = k). According to the variational inference theory, we iterate Q(b k ), Q(p k ) and Q(z) alternately until convergence. For Q(b k ), Q(b k ) ∝ exp E K k=1 Q(p k )Q(z) [L(Θ; T , Ω)] = exp E Q(z) log N (b k ; m b , Σ b ) + T t=1 I(z t = k) log N (y t ; Φb k , C k ) ∝ exp - 1 2 log |Σ b | - 1 2 (b k -m b ) T Σ -1 b (b k -m b ) + T t=1 γ t (k) - 1 2 log |C k | - 1 2 (y t -Φb k ) T C -1 k (y t -Φb k ) By completing the square, we obtain the approximate posterior of b k is N (m k , Σ k ) with Σ k = Σ b + T t=1 γ t (k)Φ T C -1 k Φ -1 , m k = Σ k Σ -1 b m b + T t=1 γ t (k)Φ T C -1 k y t . For Q(p k ), Q(p k ) ∝ exp E K k=1 Q(b k )Q(z) [L(Θ; T , Ω)] = exp E Q(z) K l=1 (a 0 -1) log P kl + T -1 t=1 K l=1 I(z t+1 = l, z t = k) log P kl = exp K l=1 (a 0 -1) log P kl + T -1 t=1 K l=1 ξ t (k, l) log P kl = K l=1 P a0+ T -1 t=1 ξt(k,l)-1 kl (22) Therefore, the approximate posterior of p k is Dir(a k1 , . . . , a kK ) with a kl = a 0 + T -1 t=1 ξ t (k, l). For Q(z), Q(z) ∝ exp E K k=1 Q(b k ) K k=1 Q(p k ) [L(Θ; T , Ω)] = exp E K k=1 Q(b k ) K k=1 Q(p k ) K k=1 I(z 1 = k) log π k + T -1 t=1 K k=1 K l=1 I(z t+1 = l, z t = k) log P kl + T t=1 K k=1 I(z t = k) log N (y t ; Φb k , C k ) = exp K k=1 I(z 1 = k) log π k + T -1 t=1 K k=1 K l=1 I(z t+1 = l, z t = k)E Q(p k ) [log P kl ] + T t=1 K k=1 I(z t = k)E Q(b k ) [log N (y t ; Φb k , C k )] . (23) Note that this equation has exactly the same form as Equation ( 10), thus we can use the forwardbackward algorithm to obtain γ t (k) and ξ t (k, l). To see this, let Pkl = exp E Q(p k ) [log P kl ] = exp ψ(a kl ) -ψ( K l=1 a kl ) , P(y t ; m k , Σ k , θ k ) = exp E Q(b k ) [log N (y t ; Φb k , C k )] = N (y t ; Φm k , C k ) exp - 1 2 tr(Σ k ΦC -1 k Φ T ) , then Equation ( 23) can be rewritten as log Q(z) = K k=1 I(z 1 = k) log π k + T -1 t=1 K k=1 K l=1 I(z t+1 = l, z t = k) log Pkl + T t=1 K k=1 I(z t = k) log P(y t ; m k , Σ k , θ k ) . (25) To obtain γ t (k) and ξ t (k, l), we run the Baum-Welch algorithm with sufficient statistics π k , Pkl , P(y t ; m k , Σ k , θ). Taking expectation of Equation ( 18) with respect to the approximate posterior Q(Ω), the Q function is Q(Θ) = K k=1 E Q(b k ) [log N (b k ; m b , Σ b )] + K k=1 γ 1 (k) log π k + T t=1 K k=1 γ t (k)E Q(b k ) [log N (y t ; Φb k , C k )] = K k=1 (log N (m k ; m b , Σ b ) - 1 2 tr(Σ k Σ -1 b )) + K k=1 γ 1 (k) log π k + T t=1 K k=1 γ t (k)(log N (y t ; Φm k , C k ) - 1 2 tr(Σ k ΦC -1 i Φ T ))] . (26) Maximizing Q(Θ) with respect to π k , m b and Σ b , we obtain π k = γ 1 (k) K l=1 γ 1 (l) , Σ b = 1 K K k=1 Σ k + (m k -m b )(m k -m b ) T , m b = 1 K K k=1 m k . The parameters {θ k } K k=1 cannot be solved in closed form, and we apply the gradient ascent algorithm to optimize Q(Θ). The gradient of Q(Θ) with respect to θ k is ∂Q(Θ) ∂θ k = T t=1 1 2 γ t (k)tr C -1 k S t,k C -1 k ∂C k ∂θ k , S t,k = (y t -Φm k )(y t -Φm k ) T +Φ T Σ k Φ-C k . (28) The complete algorithm is summarized in Algorithm 2.

B.1 DETAILED EXPERIMENT SETTINGS

For AR, MA, ARMA, ARIMA, and SARMA, we set the model order L in {4, 8, 12}. For SARMA, the seasonal length is set to be 96 since there are 96 records per day, which implicitly assumes that the overall time-series exhibits periodicity in days. LSTM, NN, SVR, and EGPM transform the time-series prediction problem into a regression problem, i.e., use the latest L observations to predict the output at the next point and then use the regression method to train and predict. In the experiment, we set L in {4, 12, 24, 48}. The neural network in the FNN has two hidden layers with 10 and 5 neurons, respectively. The kernel function in SVR is the Gaussian kernel whose scale parameters are adaptively selected by cross-validation. The number of components for EGPM is set in {3, 5, 10}. In addition, we use the recursive method ? for multi-step prediction. For mix-GPFR, mix-GPNM, and DPM-GPFR, we first convert the time-series data into curve datasets and then use these methods to make predictions. The number of components K in mix-GPFR and mix-GPNM is set to 5 and the number of B-spline basis functions D in mix-GPFR and DPM-GPFR is set to 30.

B.2 CLUSTERING STRUCTURE

Estimated values of latent variable ẑi also indicate the evolution mode corresponding to the data of the i-th day. Figure 6 visualizes some training data with different colors indicating different evolution modes, so we can intuitively see the multi-scale structure in the electricity load timeseries. According to the learned transition probability, we can obtain the stationary distribution of Markov chain (z 1 , z 2 , . . . , z N ), which is [0.4825, 0.2026, 0.0513, 0.1124, 0.1513] T in HM- GPFR, and [0.4501, 0.0427, 0.2992, 0.1381, 0 .0700] T in BHM-GPFR. The proportion of each mode in fig. 6 is roughly consistent with the stationary distribution.

B.3 MULTI-STEP PREDICTION UNDER COLD-START SETTING

In order to more clearly see the role of Markov chain structure of hidden variables in the cold start setting, in Figure 7 and Figure 8 , we show the predicted values of HM-GPFR and BHM-GFPR for electricity load in the next five days ŷN+1 , . . . , ŷN+5 and distributions of latent variables z N +1 , . . . , z N +5 conditioned on ẑN = k. It can be seen from the figure that HM-GPFR and BHM-GPFR have different predictions for each day's electricity load, which will be adjusted according to the transition probability of evolution law. For example, in Figure 7 , when ẑN = 1, the power load on that day is low, and the predicted value of HM-GPFR on the (N + 1)-th day is also low. When hatz N = 5, the electricity load on that day is higher, and the predicted value of HM-GPFR on the (N + 1)-th day is also higher. Figure 8 has a similar phenomenon. In addition, it can be seen that with the increase of i * , P(z i * ) quickly converges to the stable distribution of the Markov chain, and the predicted value ŷi * also tends to be the weighted average of the mean function in each GPFR component. In conclusion, these phenomena demonstrate that HM-GPFR and BHM-GPFR can effectively use the coarse-grained temporal structure to adjust the prediction of each day.

B.4 SENSITIVITY OF HYPER-PARAMETERS

There are two main hyper-parameters in HM-GPFR and BHM-GPFR: the number of B-spline basis functions D and the number of GPFR components K. Here we mainly focus on the selection of K. We vary K in {3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30 , 50}, train HM-GPFR and BHM-GPFR respectively, and report the results in table 3. For HM-GPFR, its prediction performance tends to deteriorate with  α 1 (k) = π k P(y 1 ; b k , θ k ); 9 for t = 2, . . . , T do 10 for k = 1, 2, . . . , K do 11 α t (k) = K l=1 α t-1 (l) Plk P(y t ; m k , Σ k , θ k ); Σ k = Σ b + T t=1 γ t (k)Φ T C -1 k Φ -1 ; m k = Σ k Σ -1 b m b + T t=1 γ t (k)Φ T C -1 k y t ; 33 for l = 1, 2, . . . , K do 34 a kl = a 0 + T -1 t=1 ξ t (k, l); 35 end 36 end 37 end // M-step 38 Σ b = 1 K K k=1 Σ k + (m k -m b )(m k -m b ) T , m b = 1 K K k=1 m k ; 39 for k = 1, 2, . . . , K do 40 π k = γ1(k) K l=1 γ1(l) ; Using gradient ascent algorithm to optimize Q(Θ) with respect to θ k according to Equation (28); HM-GPFR 3 0.84% 1.01% 1.19% 1.35% 1.52% 2.35% 3.85% 4.82% 6.36% 8.48% 9.63% 10.71% 9.43% 6.78% 6.75% 4 0.9% 1.09% 1.27% 1.44% 1.61% 2.46% 4.02% 5.09% 6.6% 8.57% 9.68% 10.71% 9.43% 6.78% 6.75% 5 0.93% 1.12% 1.3% 1.48% 1.66% 2.51% 4.07% 5.18% 6.79% 8.8% 9.83% 10.76% 9.49% 6.82% 6.77% 6 1.09% 1.32% 1.57% 1.81% 2.04% 3.13% 4.77% 5.83% 7.3% 9.04% 10.03% 10.9% 9.59% 6.88% 6.8% 7 1.06% 1.27% 1.49% 1.7% 1.91% 2.87% 4.5% 5.66% 7.28% 9.17% 10.13% 10.88% 9.58% 6.87% 6.79% 8 0.97% 1.16% 1.36% 1.54% 1.73% 2.57% 4.14% 5.32% 6.99% 8.95% 9.92% 10.81% 9.57% 6.88% 6.79% 9 1.1% 1.33% 1.56% 1.79% 2.01% 3.06% 4.77% 5.93% 7.49% 9.3% 10.29% 10.99% 9.6% 6.88% 6.8% 10 1.18% 1.41% 1.65% 1.88% 2.11% 3.22% 4.97% 6.05% 7.53% 9.39% 10.45% 11.17% 9.71% 6.95% 6.83% 15 1.25% 1.48% 1.72% 1.94% 2.17% 3.29% 5.03% 6.12% 7.63% 9.42% 10.5% 11.22% 9.71% 6.94% 6.82% 20 1.31% 1.54% 1.77% 2.0% 2.22% 3.33% 5.07% 6.14% 7.65% 9.57% 10.78% 11.52% 9.85% 7.01% 6.86% 30 1.37% 1.62% 1.87% 2.12% 2.38% 3.62% 5.45% 6.5% 7.98% 9.76% 10.92% 11.57% 9.86% 7.01% 6.86% 50 1.47% 1.72% 1.99% 2.25% 2.5% 3.7% 5.5% 6.62% 8.29% 10.35% 11.66% 12.01% 10.06% 7.12% 6.91% BHM-GPFR 3 0.85% 1.02% 1.19% 1.36% 1.52% 2.35% 3.87% 4.85% 6.37% 8.46% 9.6% 10.7% 9.47% 6.84% 6.82% 4 0.78% 0.93% 1.07% 1.18% 1.29% 1.86% 2.82% 3.49% 4.8% 6.96% 8.23% 9.91% 9.04% 6.68% 6.85% 5 0.77% 0.92% 1.07% 1.18% 1.3% 1.89% 2.88% 3.59% 4.89% 6.88% 8.04% 9.85% 9.21% 6.94% 7.15% 6 0.8% 0.96% 1.1% 1.23% 1.36% 2.02% 3.17% 3.97% 5.32% 7.22% 8.32% 9.91% 9.3% 7.01% 7.18% 7 0.79% 0.95% 1.1% 1.22% 1.33% 1.94% 3.01% 3.79% 5.12% 6.89% 7.99% 9.76% 9.34% 7.18% 7.39% 8 0.78% 0.94% 1.08% 1.19% 1.31% 1.89% 2.94% 3.71% 5.03% 6.74% 7.79% 9.7% 9.49% 7.46% 7.7% 9 0.78% 0.93% 1.07% 1.18% 1.29% 1.86% 2.86% 3.61% 4.92% 6.69% 7.77% 9.73% 9.52% 7.53% 7.8% 10 0.82% 0.98% 1.13% 1.26% 1.4% 2.11% 3.29% 4.09% 5.37% 7.01% 8.04% 9.94% 9.8% 7.86% 8.12% 15 0.79% 0.94% 1.07% 1.18% 1.29% 1.84% 2.86% 3.64% 4.95% 6.66% 7.7% 9.89% 9.96% 8.25% 8.6% 20 0.79% 0.94% 1.07% 1.17% 1.28% 1.83% 2.83% 3.6% 4.88% 6.51% 7.5% 9.95% 10.32% 8.88% 9.31% 30 0.8% 0.95% 1.07% 1.18% 1.29% 1.83% 2.82% 3.58% 4.86% 6.52% 7.53% 10.04% 10.46% 9.07% 9.52% 50 0.83% 0.98% 1.11% 1.22% 1.33% 1.88% 2.9% 3.68% 4.96% 6.5% 7.46% 10.14% 10.71% 9.45% 9.9% the increase of K. In short-term prediction, MAPE increases significantly, while MAPE changes less in long-term prediction. With the increase of K, the number of parameters in the model also increases, and the model tends to suffer from over-fitting. For BHM-GPFR, with the increase of K, its long-term prediction performance decreases significantly, while the medium-term and short-term prediction results do not change much. This shows that BHM-GPFR can prevent overfitting to a certain extent after introducing prior distributions to parameters. In addition, we also note that when K ≤ 10, the difference between the results corresponding to different K is not significant, which is a more reasonable choice. From the perspective of application, we set K = 5 in the experiment, which can take both expression ability and interpretability of the model into consideration. 1.00 0.52 0.47 0.45 0.45 0.45 0.00 0.05 0.04 0.04 0.04 0.04 0.00 0.29 0.29 0.30 0.30 0.30 0.00 0.11 0.14 0.14 0.14 0.14 0.00 0.03 0.06 0.07 0.07 0.07      2      0.00 0.12 0.39 0.44 0.45 0.45 1.00 0.12 0.04 0.04 0.04 0.04 0.00 0.53 0.32 0.30 0.30 0.30 0.00 0.12 0.17 0.14 0.14 0.14 0.00 0.12 0.08 0.08 0.07 0.07      3      0.00 0.48 0.47 0.45 0.45 0.45 0.00 0.02 0.04 0.04 0.04 0.04 1.00 0.25 0.28 0.30 0.30 0.30 0.00 0.22 0.12 0.14 0.14 0.14 0.00 0.04 0.08 0.07 0.07 0.07      4      0.00 0.44 0.39 0.45 0.45 0.45 0.00 0.04 0.05 0.04 0.04 0.04 0.00 0.24 0.36 0.29 0.30 0.30 1.00 0.04 0.15 0.14 0.14 0.14 0.00 0.24 0.06 0.07 0.07 0.07      5      0.00 0.07 0.42 0.44 0.45 0.45 0.00 0.07 0.03 0.04 0.04 0.04 0.00 0.58 0.29 0.30 0.30 0.30 0.00 0.20 0.17 0.14 0.14 0.14 1.00 0.07 0.09 0.08 0.07 0.07      



Figure 1: An illustration of multi-scale time-series.

Figure 2: Probabilistic graphical models of HM-GPFR and BHM-GPFR.

Figure 3: Evolving law of electricity loads and transition dynamics learned by HM-GPFR and BHM-GPFR.

Figure 4: One-step-ahead rolling prediction results of HM-GPFR and BHM-GPFR.

Figure 5: Multi-step prediction results of mix-GPFR, mix-GPNM, DPM-GPFR, HM-GPFR and BHM-GPFR.

k) ∝ α t (k)β t (k); 16 for l = 1, 2, . . . , K do 17 ξ t (k, l) ∝ α t (k)P kl N (y t+1 ; Φb l , C l )β t+1 (l); algorithm to optimize Q(Θ) with respect to θ k and b k according to Equation (17); 29 Update C k with new parameters θ k ; end end

(y t+1 ; m l , Σ l , θ l )β t+1 (l); k) ∝ α t (k)β t (k); 23 for l = 1, 2, . . . , K do 24 ξ t (k, l) ∝ α t (k) Pkl P(y t+1 ; m l , Σ l , θ l )β t+1 (l); posterior Q(b k ) and Q(p k ).30 for k = 1, 2, . . . , K do 31

Figure 6: Training time-series are divided into different evolving laws based on the learning results of HM-GPFR and BHM-GPFR.

Figure 7: Estimated values ŷN+1 , . . . , ŷN+5 and distributions of z N , . . . , z N +5 of HM-GPFR under ẑ = k where k = 1, 2, 3, 4, 5.

Figure 8: Estimated values ŷN+1 , . . . , ŷN+5 and distributions of z N , . . . , z N +5 of BHM-GPFR under ẑ = k where k 1, 2, 3, 4, 5.

MAPE of various methods on the electricity loads dataset under different step lengths and parameter settings.

MAPE of GP related methods under the cold-start setting (r = 1).

Sensitivity of HM-GPFR and BHM-GPFR with respect to the number of components K.

ẑN Mean function of N -th day Predicted values of i * = N + 1, . . . , N + 5P(z N ), P(z N +1 ), P(z N +2 ), P(z N +3 ), P(z N +4 ), P(z N +5 )

annex

Algorithm 2: The Variational EM algorithm for learning BHM-GPFR.

Initialize parameters Θ;

while not converged do // Variational E-stepInitialize variational parameters {m k , Σ k , a k1 , a k2 , . . . , a kK } K k=1 . while not converged do // Calculate surrogate parameters. 

