CAUSAL PROBABILISTIC SPATIO-TEMPORAL FUSION TRANSFORMERS IN TWO-SIDED RIDE-HAILING MARKETS

Abstract

Achieving accurate spatio-temporal predictions in large-scale systems is extremely valuable in many real-world applications, such as weather forecasts, retail forecasting, and urban traffic forecasting. So far, most existing methods for multi-horizon, multi-task and multitarget predictions select important predicting variables via their correlations with responses of interest, and thus it is highly possible that many forecasting models generated from those methods are not causal, leading to poor interpretability. The aim of this paper is to develop a collaborative causal spatio-temporal fusion transformer, named CausalTrans, to establish the collaborative causal effects of predictors on multiple forecasting targets, such as supply and demand in ride-sharing platforms. Specifically, we integrate the causal attention with the Conditional Average Treatment Effect (CATE) estimation method in causal inference. Moreover, we propose a novel and fast multi-head attention evolved from Taylor's expansion instead of softmax, reducing time complexity from O(V 2 ) to O(V), where V is the number of nodes in a graph. We further design a spatial graph fusion mechanism to significantly reduce the parameters' scale. We conduct a wide range of experiments to demonstrate the interpretability of causal attention, the effectiveness of various model components, and the time efficiency of our CausalTrans. As shown in these experiments, our CausalTrans framework can achieve up to 15% error reduction compared with various baseline methods. 2017), multi-view prediction (Yao et al., 2018) , and multi-horizon prediction (Lim et al., 2019; Yu et al., 2020) , these existing methods primarily select important predictors via their correlations with responses, leading to many forecasting models with poor interpretability. In contrast, we propose CausalTrans: a Collaborative Spatio-temporal Fusion Transformer, that generates causal probabilistic multi-horizon forecasts. To the best of our knowledge, this is the first work that captures collaborative causal effects of external covariates on multiple forecasting targets. Building such models is not only essential to enhancing forecasting performance, but also helps the platform to utilize various platform policies to match the distribution of supply with that of demand in two-sided markets. In the CausalTrans framework, our major contributions are summarized as follows: • We design the causal attention based on double machine learning (Chernozhukov et al., 2018) with two layers fully connected neural networks, and successful apply it to various large-scale time series forecasting problems. We conduct a wide range of experiments on real world datasets with multiple covariates and demonstrate that CausalTrans with causal attention outperforms many baseline models in various Ride-hailing scenarios. • We propose a spatial fusion mechanism based on graph attention networks (GAT) (Veličković et al., 2017) to gather local regions and enhance robustness as adjacent regions always share similar supply and demand patterns. • We propose an approximate time-efficient Taylor expansion attention to replace softmax in multihead attention of Transformers (Vaswani et al., 2017) such that time complexity reduces from O(V 2 ) to O(V). We carry out two groups of experiments with three multi-heads and five multi-heads to verify such efficiency improvement.

1. INTRODUCTION

This paper is motivated by solving a collaborative probabilistic forecasting problem of both supply and demand in two-sided ride-hailing platforms, such as Uber and DiDi. Collaborative supply and demand relationships are common in various two-sided markets, such as Amazon, Airbnb, and eBay. We consider two-sided ride-hailing platforms as an example. In this case, we denote supply and demand as online driver number and call orders, respectively, on the platform at a specific time in a city. Some major factors for demand include rush hours, weekdays, weather conditions, transportation network, points of interest, and holidays. For instance, if it rains during peak hours in weekdays, demand will dramatically increase and last for a certain time period. In contrast, some major factors for supply include weather, holidays, traffic condition, weekdays, and platform's dispatching and repositioning policies. Moreover, supply tends to gradually cover the area with many unsatisfied orders, that is, the distribution of supply tends to match with that of demand. We are interested in establishing collaborative causal forecasting models for demand and supply by using various predictors (or covariates). Although many learning methods have been developed to address various collaborative prediction tasks, such as spatio-temporal traffic flow prediction (Zhu & Laptev, 2017; Du et al., 2018; Zhang et al., 2019b; Ermagun & Levinson, 2018; Luo et al., 2019) , multivariate prediction (Bahadori et al., 2014; Liang et al., 2018) , multi-task prediction (Tang et al., 2018; Chen et al., 2018; Chandra et al., 

2. RELATED WORK

There is a large body of literature on vehicle flow forecasting (Zhu & Laptev, 2017; Bahadori et al., 2014; Tang et al., 2018; Lim et al., 2019; Yao et al., 2018) . We selectively review several major methods as follows. In Zhu & Laptev (2017) , the time series forecasting task as a two-step procedure includes offline pre-training and online forecasting. The offline pre-training step is an encoder-decoder framework for compressing sequential features and extracting principal components, whereas the second step gives explainable prediction changes under external variables. Bahadori et al. (2014) proposed a unified low-rank tensor learning framework for multivariate spatio-temporal analysis by combining various attributes of spatio-temporal data including spatial clustering and shared variables structure. For multi-step traffic flow prediction, Tang et al. (2018) proposed a spatio-temporal multi-task collaborative learning model to extract and learn shared information among multiple prediction tasks collaboratively. For example, such model combines spatial features collected from offline observation stations and inherent information between blended time granularities. Lim et al. (2019) proposed a temporal fusion transformer (TFT) to capture temporal correlations at each position, which was similar to self-attention mechanism and expected to capture long-term and short-term dependencies. Yao et al. (2018) proposed a deep multi-view spatio-temporal network (DMVST-Net), including a speed viewpoint (modeling the correlation between historical and future demand by LSTM (Gers & Schmidhuber, 2001 )), a spatial viewpoint (modeling local spatial correlation by CNN), and a contextual viewpoint (modeling regional correlations in local temporal patterns). Overall, all above methods improve time series fitting by learning and predicting correlations across multiple spatio-temporal perspectives, targets, and tasks. However, those methods lack convincing interpretability of "how and to what extent external variables affect supply and demand". Achieving good demand forecasting involves not only historical demand targets, but also various current external variables (e.g., weather conditions, traffic conditions, holidays, and driver reposition). Those historical demand observations were affected by historical external factors, so the demand forecasting only based on correlation between variables is hardly convincing. Furthermore, supply forecasting is empirically affected by the distribution of demand besides current external variables. Establishing causal relationship between (supply, demand) and multiple external variables is critically important for accurate supply and demand forecasting.

3. METHODOLOGY

We introduce the CausalTrans framework to efficiently establish the collaborative causal effects of multiple predictors on spatio-temporal supply and demand below.

3.1. COLLABORATIVE SUPPLY AND DEMAND FORECASTING

We consider all related observations including supply, demand, and external variables collected in a city. Each day is divided into 24 hour segments and a city is divided into non-overlapping hexagonal regions (side length ranges from 600 to 1000 meters). The complete data consists of demand x v (t) ∈ R, supply y v (t) ∈ R, and dynamic covariates z v (t) ∈ R z , where t is a specific hour segment and v ∈ V is a specific hexagon of the set of hexagonal regions, denoted as V. Dynamic covariates includes weather, holidays, social events, POI (Point Of Interests), and government policies. Weather features consist of temperature ( • C), rainfall (mm), wind level and PM2.5 (mg/m 3 ). Holiday features are represented by one-hot boolean vectors, including seasons, weekdays, and national and popular holidays, such as Christmas Day. POI features are represented by the number of various positions, including traffic stations, business districts, communities, hospitals and schools. More detail cases about collaborative supply and demand are provided in Appendix A.

The problem of interest is to use all available observations in

{(x v (: t), y v (: t), z v (:, t)), v ∈ V} to predict {(x v (t + 1 : t + τ max ), y v (t + 1 : t + τ max )), v ∈ V}, where τ max is a pre-specified time length, x v (t 1 : t 2 ) and y v (t 1 : t 2 ) are the demand and supply vectors starting from time point t 1 to time point t 2 , and x v (: t 2 ) and y v (: t 2 ) are the demand and supply vectors starting from the earliest time point to time point t. The demand x v may depend on historical supply y v that happens several weeks (or even longer) ago. But in the latest several weeks (training period), based on our understanding of ride-sharing business, demand x v may be primarily influenced by its own recent historical patterns. Based on the above description, we formulate the learning problem of collaborative demand and supply forecasting as follows: P (x v (t + 1 : t + τ max )|x v (: t), z v (: t + τ max )), (1) P (y v (t + 1 : t + τ max )|y v (: t), x v (: t + τ max ), z v (: t + τ max )), where P (•|•) is a conditional distribution. In (1), it is assumed that x v (t + 1 : t + τ max ) is primarily affected by historical demands in x v (: t) and external covariates in z v (: t + τ max ). Furthermore, in (2), it is assumed that future supplies in y v (t + 1 : t + τ max ) are primarily affected by historical supplies in y v (: t), demand patterns in x v (: t + τ max ), and external covariates in z v (: t + τ max ). Comparing (1) with (2), we assume that the distribution of supply during [t + 1, t + τ max ] is driven by the historical and current distributions of demand besides the historical information in y v (: t) and external covariates in z v (: t + τ max ).

3.2. PROBABILISTIC FORECASTING

Most time series forecasting methods produce deterministic values, whereas forecasting results might have large variation and were hardly robust due to the variation of covariates and training process. To enhance forecasting reliability, we adapt the quantile loss function with the Poisson distribution as our final optimization functionfoot_0 . Empirically, following (Salinas et al., 2020; Wen et al., 2017; Li et al., 2019; Lim et al., 2019) , we choose three quantile points q ∈ Q = {10%, 50%, 90%}, in which the gap between forecasting values at 90% and 10% percentiles can be regarded as the confidence interval. Take demand x t forecasting at time point t as an example, the final quantile loss function is given by L Q = xt∈Ω q∈Q τmax τ =1 QL q (x t , xq t-τ ) M • τ max , where QL q (x t , xq t ) = {q -I(x t ≤ xq t )}(x t -xq t ) , Ω is the training dataset, τ max is the maximum prediction step, and I(•) is an indicator function. For a fair comparison, given the test dataset Ω, we employ q-risk (Salinas et al., 2020; Lim et al., 2019; Li et al., 2019) , denoted as R q , to evaluate the risk level of each quantile point as follows: R q = 2 xt∈ Ω τmax τ =1 QL q (x t , xq t-τ ) xt∈ Ω τmax τ =1 |x t | . ( ) There are at least two advantages of using the quantile loss function. First, the quantile loss function is more robust and stable than the mean square error or the hinge loss, especially when forecasting targets have large variation. Second, we can modify external covariates to change the confidence interval of causal attention and analyze real-world cases. Our CausalTrans is a novel combination of causal estimators and the encoder-decoder architecture.

3.3. CAUSAL TRANSFORMER FRAMEWORK

Figure 1 shows the overview of the CausalTrans framework. The three key novel contributions of CausalTrans include fast spatial graph fusion, causal attention, and temporal attention units. First, from the spatial perspective, CausalTrans gathers a set of graph attention kernels (GAT) by using assignment scores extracted from temporal patterns. Moreover, we adapt the first-order Taylor's expansion on multi-head attention from transformer to reduce time complexity from square complexity to linear complexity. Second, from the temporal perspective, causal attention based on sufficient historical observations is trained offline to evaluate the causal weights on peek time slots, those on weather conditions, and those on holidays, which are denoted as θ T , θ W , and θ H , respectively, under diverse spatio-temporal conditions. Furthermore, we simplify three seasonal perspectives (week, month, and holidays) to represent multi-view position encoding (MVPE). Third, temporal attention is used to fill the gap between encoder and decoder, in which we add a sequence mask to ensure that the historical observations of time point t only uses observations smaller than t. We set mask out to be -∞ and illegal connection weights to be zero. In the following subsections, we introduce the main components of CausalTrans: fast spatial graph fusion and causal attention and show how they works together as a causal spatio-temporal predictor. Moreover, for notational simplicity, we focus on describing those components for forecasting demand x v in the following subsections, while avoid repeating the same components for supply y v .

3.4. FAST SPATIAL GRAPH FUSION ATTENTION

In this subsection, we describe the fast graph fusion attention unit based on region clustering and fast multihead attention. See 2019a)), we use GAT to extract contextual features in huge graphs. However, directly applying GAT to large-scale forecasting problems is a challenging task, so we design spatial fusion subgraphs that share local supply and demand information. Moreover, we build our framework based on transformers (Vaswani et al., 2017) . Transformers have been state-of-the-art structure in various natural language processing (NLP) tasks (Wolf et al., 2019; Wang et al., 2019) and time series forecasting due to its prominent powers of long-term feature extraction and parallel computing. However, the multi-head attention in transformers becomes a key bottleneck for time efficiency. We design an approximate Taylor's expansion attention instead of using softmax function to accelerate matrix products. More detail results of fast attention can be found in Appendices C.2 and C.3. We briefly describe the fast spatio-temporal fusion graph attention procedure below. First, let X t be the spatio-temporal demand feature matrix of all grids V before time t, the temporal patterns of V are represented as assignment scores given by C = (c xv,k ) = [σ s (σ r (X t W v )W t )] Batch , where [•] Batch is the mean operator on the batch mode, k belongs to a K-dimensional cluster vector, v ∈ V, W v and W t are, respectively, spatial and temporal weight matrices corresponding to X t , and σ s (•) and σ r (•) are sigmoid and relu activation functions, respectively. Second, we use the k-th spatial learner G k (x v ) to extract spatial features of sequential data x v in grid v, and the summed outputs of K clusters are given as follows: h v = k∈K G k (x v )c xv,k . The softmax function is used to get attention weights among regions as follows: α v = v ∈Nv α v,v • x v = v ∈Nv exp(σ r (a T [W • x v ||W • x v ])) • x v v ∈Nv exp(σ r (a T [W • x v ||W • x v ])) , where α v,v is the correlation weight between v and v , a and W are network parameters, the superscript T denotes the transpose of a vector or matrix, N v = {v |v ∈ V, v = v} is the neighboring region set of region v, and [•||•] is the concatenation operation. In ( 7), the time complexity of computing exp(σ r (a T [W • x v ||W • x v ])) is O(V 2 ). Specifically, the exponent operation in exp(a T • W ) • X of the softmax function limits the efficiency of attention. Moreover, cluster number K V, and the time complexity of a T W X is O(K 2 • V) ≈ O(V). Many recent studies find that linear attention is feasible for tasks, whose primary focus is on short-term dependence. More details are discussed in Appendix D. Our novel linear attention is easy to implement and interpret. It follows from Taylor expansion that exp(a T W ) ≈ 1 + a T W under the condition of small a T W . Analogous to the self-attention in original Transformer (Vaswani et al., 2017) , the approximate mean and variance of QK T √ d k are 0 and 1, respectively, so a T W here is limited to small values. We introduce L 2 normalization to ensure small a T W and 1 + a T W ≥ 0 such that exp(a T W ) ≈ T (a T W ) = 1 + a ||a|| 2 T W ||W || 2 . ( ) where T is an approximate Taylor expansion. Equation ( 8) is close to inner dot products, which have advantages on parallel implementation and linear time complexity. Finally, α v can be transformed into α v = v ∈Nv T (σ r (a T [W • x v ||W • x v ])) • x v v ∈Nv T (σ r (a T [W • x v ||W • x v ])) . (9)

3.5. CAUSAL ATTENTION MECHANISM

Many external covariates causally change the distribution of demand and supply as shown in Figures 2 and 3 of the supplementary document. Meanwhile, many existing works focus on finding the correlation between external covariates and forecasting targets. For example, Li et al. (2019) designed causal convolution to enhance the locality of attention, whereas Lim et al. (2019) added the variables selection networks and gate mechanism to train attention weights. These two studies (Lim et al., 2019; Li et al., 2019) intend to calculate correlations among variables, but not causal effects under counterfactual conditions. Statistically, such issue can be regarded as a heterogeneous treatment effect (HTE) problem. See Figure 1 (c) for the architecture of C.A.. To the best of our knowledge, causal attention methods for HTE have not been proposed in large-scale spatio-temporal forecasting problems. First, we briefly describe the conditional average treatment effect (CATE) (Abrevaya et al., 2015) . We still take demand vectors x v (t 1 : t 2 ) (abbreviated as x in the following) of grid v starting from time point t 1 to time point t 2 as an example. The X represents a set of x. The treatments we consider include weather (rainfall, temperature and wind level), peek time slots and holidays. Let x(s) be the target variable under treatment s ∈ S, and z is a vector of other covariates. The HTE for comparing two treatment levels s 0 and s 1 is defined as τ (s 0 , s 1 ; z) = E[X(s 1 ) -X(s 0 )|z]. (10) If treatment s is continuous, then the treatment effect is defined to be E[∇ s X(s)|z], where ∇ s = ∂/∂s. To unbiasedly estimate treatment effects, we propose a causal attention module based on double machine learning (DML) (Chernozhukov et al., 2017) based on two layers non-parametric fully connected neural networks. Specifically, we assume X(S) = θ(z) • S + g 0 (z) + and S = g 1 (z) + η, where and η are independent random variables such that E[ |z] = E[η|z] = 0, g 0 (•) and g 1 (•) are two non-parametric neural networks, and θ(z) is the constant marginal CATE. Let X = X -E(X|z) and S = S -E(S|z), we can get X = X -E(X|z) = θ(z) • {S -E(S|z)} + = θ(z) • S + . (12) Therefore, we can compute θ(z) by solving θ(z) = arg min θ∈Θ E n ( X -θ(z) • S) 2 , ( ) where E n denotes the empirical expectation. Large historical data source contains all kinds of experimental environment and treatments. According to Algorithm 1, given time series x v (: t) at grid v (v is dropped for readability) and treatment s 1 ∈ S, loop and search two treatment levels s 0 and s 1 along with the historical timeline to construct the AB groups {x(t 0 )|s 0 } and {x(t 1 )|s 1 }. Then, we construct the AA groups {x(t 0 -τ : t 0 )} and {x(t 1 -τ : t 1 )} by a look-back window with the same length τ before t 0 and t 1 , and make sure that both are both stationary processes with equal mean (P KPSS > 0.05 in KPSS test (Shin & Schmidt, 1992 ) and P T-Test > 0.05 in T-Test on both AA groups' first-order differences). Based on the selected AA/AB groups, we employ DML to estimate causal attention. In our method, trained causal attention θ will be inserted to transformer, and clustered regions share global θ each other. Algorithm 1 Causal Attention Algorithm with DML Input: Given demand matrix x(: t) at a grid v before time t, three kinds of treatments includes weekday and hour slots T (: t) = {W (: t), H(: t)}, weather vectors W (: t), and holidays one-hot vectors H(: t) Output: causal effect coefficients θ T for T (: t), θ W for W (: t), and θ H for H(: t) 1: Take θ T as an example, and suppose that a AA group and AB group on T (: t) is T AA = T AB = {} 2: for all {T w (t 0 ), T w (t 1 )} ∈ {M on, T ue, ...Sun}, {T h (t 0 ), T h (t 1 )} ∈ {1, ...24} do 3: if T w (t 0 ) = T w (t 1 ), T h (t 0 ) = T h (t 1 ), P T-Test (x(t 0 ), x(t 1 )) < 0.05 then 4: for all t 0 ∈ {: t 0 } and t 1 ∈ {: t 1 } do 5: Calculate 1st-order differences x(t 0 : t 0 ) and x(t 1 : t 1 ) 6: if P KPSS ( x(t 0 : t 0 )), P KPSS ( x(t 1 : t 1 )) and P T-Test ( x(t 0 : t 0 ), x(t 1 : t 1 )) > 0.05 then 7: T AA .append([(x(t 0 : t 0 ), x(t 1 : t 1 ))]) 8: T AB .append([(x(t 0 ), x(t 1 ))]) Electricity. Electricity contains hourly univariate electricity consumption of 370 customers. According to (Salinas et al., 2020) , weekly oberservations before t are inputs to predict the next 24 hours' series. Traffic. Traffic contains hourly univariate occupancy rate of 963 San Francisco bay area freeways, where the look-back rolling window and prediction step are the same as Electricity. Retail. Retail is the Favorita Grocery Sales Dataset from Kaggle competition (Lim et al., 2019) , including daily metadata with diverse products, stores and external variables. To compared with some state-of-the-art methods (Lim et al., 2019; Salinas et al., 2020) , historical observations across 90 days are trained to forecast product sales in the next 30 days. Ride-hailing. The Ride-hailing dataset contains real supply, demand, and various of metadata at the hourly and hexagonal grid scale between June 2018 and June 2020 in two big cites (city A and city B) obtained from a ride-hailing company. The first 70%, the next 10% and the remaining 20% is used for training, validation and testing, respectively. We group the first two datasets into the univariate group and the last two datasets into the multivariate group.

4.2. BENCHMARKS

In this section, two different forecasting methods, including iterative methods and multi-horizon methods, are compared in a wide range of comparison experiments. For our method CausalTrans, a pre-defined search space is used to determine optimal hyperparameters. Experimental details are included in Appendix B. Iterative 

4.3. RESULTS AND DISCUSSION

We adapt the quantile loss as optimization function, and compare various results by q-risk R 50 /R 90 at quantile point 50%/90%. More detailed descriptions of probabilistic forecasting are provided in subsection 3.2. Table 1 includes the R 50 /R 90 losses of all forecasting methods for Electricity and Traffic datasets. The Electricity data set does not have any covariates and is lack of spatial information, whereas the Traffic dataset does have spatial information even without multiple covariates. We observe that ConvTrans and TFT are comparable with each other and both outperform all other methods. We believe that compared with TFT, ConvTrans is able to take advantage of the spatial information in the Traffic dataset. This is not the case for the Electricity data set. Table 2 and Table 3 include the R 50 and R 90 losses of all multi-horizon methods in the multivariate group. We consider both one-day and seven-day predictions and optimize the hyperparameters of all methods by using grid search. We have several important observations. First, for the one-day prediction, iterative DeepAR outperforms Seq2Seq and MQRNN due to the use of Poisson distribution and weather conditions. Second, for the spatial baselines DMVST and ST-MGCN, R 50 and R 90 losses are increasing with longer forecasting days, as such methods may overfit biased weights of external covariates. Third, CausalTrans outperforms all other competing methods primarily due to the use of the causal estimator DML. For instance, compared with the second best method, CausalTrans yields maximum 9.3% lower R 50 and 15.2% lower R 90 on the Ride-hailing (7d, city A, Supply) dataset. Fourth, CausalTrans achieves lower losses on forecasting supply than forecasting demand, since we explicitly model causal relationship between supply and demand in (2). Fifth, as expected, different with the one-day prediction, the seven-day prediction focuses on unbiased distribution estimation in order to alleviate error accumulation. This point of view is further reinforced by the results of the ablation study reported in Appendix C.2, and causal attention is visualized in Appendix C.1.

5. CONCLUSION

Based on causal inference theory, we develop the CausalTrans framework to address collaborative supply and demand forecasting in large-scale two-sided markets. We design the fast multi-head attention to improve the computational complexity to nearly linear O(V). CausalTrans achieves similar performance as TFT based on the two datasets in the univariate group and outperforms all competing methods including TFT in the nine different experiments for the multivariate group. In particular, for our Ride-hailing datasets, CausalTrans can achieve up to 15% error reduction compared with various baseline methods. In the future, we will continue to integrate causal inferences with existing deep learning methods to deal with large-scale spatio-temporal forecasting problems.

A RIDE-HAILING DATASET DETAILS

Taking city A 3 as an example, supply, demand, delta 4 and rainfall trends (January 1st, 2018 to January 1st, 2020) are plotted at daily scale in figure 2 . We conclude that the variance of demand is bigger than supply, especially in raining rush hours. Taking August 17th, 2018 in city A as another example in figure 3 , we observe that the delta at dark red regions would not be for long, as spatio-temporal supply was changed by corresponding demand and reposition of drivers. The ride-hailing platform would release useful strategies to promote orders. Collaborative demand and supply implies that the distribution of supply corresponds to the distribution of demand. 

B TRAINING DETAILS

Empirically, we consider determining optimal hyperparameters via a pre-defined random search space. For reproducibility, we include essential hyperparameters on our Ride-hailing dataset in Table 4 . 

C INTERPRETABILITY CASES

In this section, we analyze the impacts of essential components in CausalTrans and focus on what causal attention learns. First, since causal demand and supply are hardly assembled with unbiased estimation (Figure 2 ), we demonstrate attention-based interpretability in instance-specific significant events like frequent rainfall, holidays and peek time slots. Second, we perform ablation analysis about target probabilistic distribution PoissonOutput, causal attention with DML and Uplift, FastAttention and SpatialFusion. Finally, we compare fast improvements in multi-head attention on CPU (Intel Xeon E5-2630 2.20GHz) and GPU (Tesla P40), respectively.

C.1 CAUSAL ATTENTION VISUALIZATION

As one of the most essential components, causal attention employs difference stationary tests and double machine learning to estimate coefficients θ(s) of treatment effects. In this section, we visualize causal attention distribution through sample-specific cases, including rainfall, weekdays, and time slots. Frequent rainfall is the most significant weather event for demand as described in Section A. Unlike with plenty of rainfall events, there are only a dozen of holidays in one year. If sequential context before one holiday fails to pass Kpss stationary test, causal estimator would not to be applied in training attention weights. Large-scale dataset is the fundamental to our method. For the diverse peek time slots, Section 3.1 concludes that demand and supply distributes different at commuting peeks and night hours. In addition, seasonal fluctuation and government's policies (e.g. traffic restriction in National Day) are considerable factors. Rainfall. Take demand forecasting at an anonymous region in city A as an example, treatment is rainfall s, target is demand x, and other covariates z include regional id, time slots and holidays. For convenience, we select a group of adjacent AB groups from sufficient rainfall cases to give an interpretation. In Figure 4 , we backtrack rainfall treatments to fix AB Group 2, and search AB Group 1 by controlling similar covariates. Similarity means that both first-order differences are stationary, and then we construct a group of simple randomized controlled experiments. Given estimated θ(z) by running DML, we plot the distribution of causal attention on the right side of green line. In practice, large amounts of increasing data would enhance the robustness of causal evaluation iteratively. Figure 4 : Causal attention and probabilistic demand forecasting treated by rainfall in some days at a anonymous region in city A. A time slice means an hour. The black solid line is real demand time series, and orange solid line is temporal differences in next 24 hour slices. The green vertical line means a starting point of forecasting, and subsequent green filled areas describe a confidence interval between quantile 10% and 90%. According to causal attention, "AA group 1" and "AA group 2" (two red filled areas) are regarded as comparable contexts, as the first-order difference of both groups passes kpss (Kwiatkowski et al., 1992) stationary test (P > 0.05). "AB group 1" and "AB group 2" (two purple filled areas) is control group and treatment group, respectively. Based on homologous hypothetical controlled experiments, causal attention with future weather information would be added in forecasting to learn inferences. Collaborative demand and supply. As described in Section 3.1 and equation ( 2), the distribution of supply is driven by the spatio-temporal patterns of demand. Similar with above Rainfall analysis, we take another anonymous region in city A as an example. In this case, forecasting target is supply, causal treatment is demand, and external variables include weather, time slots and holidays. According to Algorithm 1, our method needs to construct AB groups and corresponding lookback AA groups from large-scale historical data. For both AB controlled experiments, the average demands of AB and AA groups should be significant different, while supply is unlimited. In AA experiments, we empirically suggest that the time span maintains for at least one day. We trace back data to the past, but selected AA groups should satisfy randomization grouping hypothesis passed by t-test. Such periods with stable supply are abundant in recent years, which implies that we can easily find proper evaluation dataset for diverse regions. In Figure 5 Furthermore, spatial fusion shows tiny improvement (+0.3% on average) in Table 5 . We feel that spatial fusion aggregates adjacent hexagonal grids, leading to reducing statistical noises in both demand and supply. For instance, in some cases, the boundary (usually around 800 meters) of adjacent grids separates large demand hotpots (e.g., large shopping malls), resulting in some noise when counting supply and demand. Spatial fusion can reduce the influence of such noise, while improving the probabilistic forecasting performance. According to Table 5 , the longer forecasting time (e.g., 7 days versus 1 day), the more significant gain by using spatial fusion. We consider the use of spatial fusion as a trick for enhancing the robustness of forecasting. The hyperparameter of spatial fusion is K used in the kmeans method. In this paper, we set K ∈ {3, 4, 5}. More ablation analysis about K is shown in Table 6 . Table 6 : Ablation analysis (R 50 losses) of various cluster number K in SpatialFusion. Each value represents the R 50 loss on a specific K, and percentages in brackets are loss variations. Roughly speaking, seven-day prediction needs bigger K than one-day. We conclude that the longer prediction range needs more heterogeneous patterns. Optimal K on various Ride-hailing subdatasets are shown in Table 4 . K = 2 K = 3 K = 4 K = 5 K = 6 Ride-hailing (1d 

C.3 TIME EFFICIENCY IMPROVEMENT

One of innovations proposed in this paper is to shorten running time of attention without losing overall quantile loss. The long experiment cycle suggests that we should choose a representative dataset, such as one-day demand prediction in city A. Data size of city A is large enough to reflect robust attention weights. In such dataset, we are only interested in the decrease of running time as the number of heads in multi-head attention decreases. As shown in Figure 6 , when multi-head is 3, the reduction ratios of CPU(20), GPU(1) and GPU(2) compared with softmax are 58%, 70%, and 68%, respectively. Similarly, when multi-head is equal to 5, the responding reduction ratios are, respectively, 49%, 58% and 60%. An exact time complexity is O(K 2 V ) (see in Section 3.4), the smaller K, the longer running time. In summary, proposed time-efficient attention outperforms default softmax attention significantly.  softmax(Q T K) = ϕ(Q) T • φ(K), where Q and K are query matrices and key matrices, respectively. For instance, Katharopoulos et al. (2020) construct a kernel function with basis function ϕ(x) = φ(x) = elu(x) + 1 and reduce the computation complexity from O(N 2 ) to O(N ), but such performance is just only concluded from image dataset. Shen et al. (2018) further explore a series of kernel forms to dissect Transformer's attention. They proposed a new variant of Transformer's attention by modeling the input as a product of symmetric kernels. This approach replaces the calculation order of softmax, which is equivalent to the basis function φ(x) = softmax(x) and ϕ(x) = e x . The second one is to modify attention's definition. Child et al. (2019) develope sparse factorizations of the attention matrix, which reduce the computation to O(N √ N ), but its attention hyperparameters are very hard to be initialized and actual efficiency is hard to ensure. Kitaev et al. (2020) propose Reformer to replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(N 2 ) to O(N log(N )), where N is the length of the sequence. Furthermore, they use reversible residual layers instead of standard residuals, allowing storing activations only once in the training process instead of L times, where L is the number of layers. However, Reformer is difficult to be implemented and applied in different tasks. Wang et al. (2020) demonstrate that the self-attention mechanism can be approximated by a low-rank matrix, and further propose Linformer mechanism to reduce the overall self-attention complexity to O(N ). Linformer uses two additional matrices E and V to project K and V , respectively, in order to get Attention(Q, K, V ) = sof tmax(Q(EK) T )F V . But the MLM experiment in Linformer does not need to extract long-term dependence and cannot verify its linear time complexity for capturing long-term attention. Eliminating redundancy vectors from the self-attention is a key design idea. Furthermore, Goyal et al. (2020) exploit redundancy pertaining to word-vectors, and propose PoWER-BERT to achieve up to 4.5x reduction in inference time over BERT with <1% loss in accuracy on the standard GLUE benchmark. Similarly, Dai et al. (2020) propose Funnel-Transformer, which gradually compresses the sequence of hidden states to a shorter one, and hence reduces the computation cost. Finally, for our approximate Taylor expansion of softmax attention, if feature maps (i.e. Q, K and V in self-attention) meet the positive definite and normalization conditions and our task focuses on short-term dependence, then our linear attention would be useful for this aspect.



Ride-hailing supply and demand variables approximately follow with the Poisson distribution. https://www.kaggle.com/c/favorita-grocery-sales-forecasting/



Figure 1: The overview of CausalTrans framework. Demand and supply are trained separately in sequence. (a). The framework consists of three essential components: Fast S.F. (fast graph spatial fusion), C.A. (causal attention), and T.A. (temporal attention). Moreover, we employ the average quantile loss distributed from {10%, 50%, 90%} to optimize forecasting probabilistic distributions. (b). The Fast S.F. consists of self-clustering with GAT and fast attention. (c). The C.A. applies offline trained causal weights θ to online treatments evaluations. (d). The T.A. aims to keep ordering self-attentions.

Figure 1 (b) for the architecture of Fast S.F.. Since GAT has achieved impressive results in traffic forecasts (Park et al. (2019); Kosaraju et al. (2019); Zhang et al. (

Figure2: Supply, demand, delta and rainfall trends in city A ranging from January 1st, 2018 to January 1st, 2020. On August 17th, 2018 (Friday), demand increased significantly under heavy rains and evening peak hours. Lack of supply was far from being able to meet explosive demand. At the night peak hours on September 24th, 2018 (Mid-Autumn Festival), drivers' number was reduced and demand increased due to family reunion. During the National Days (from October 1st to 7th, 2018), commuting drivers and passengers both decreased, resulting in a bilateral decline of supply and demand. On December 7th, 2018 (Friday), heavy snow and low temperature stimulated potential demand. At the beginning of New Year's Eve in 2019, people were eager to reunite with families, resulting in low supply and high demand. After that, supply and demand tended to be balanced gradually.

Figure 3: The delta heat maps at different o'clock of city A on August 17th, 2018. Panels (a), (b) and (c) show delta maps in peek hour 17:00pm, 18:00pm and 19:00pm, respectively. Delta values are normalized to (-100, 0) for data privacy, where -100 and 0 means the maximum and minimum delta value, respectively. As the evening peak process was going on, supply was matched with demand along the direction of red arrows in panels (a), (b) and (c) gradually.

Figure5: Causal attention and probabilistic supply forecasting treated by demand in some days at an anonymous region in city A. The black solid line is real supply time series, while other lines and areas are similar with Figure4. AA group 1 and AA group 2 (two red filled areas) are selected by causal attention, and both AA groups have analogous rainfall, the same weekday and time slots. Mean difference of both AA groups in t-test is not significant. As rainfall can change demand and following demand changes supply, we estimate collaborative demand causal effects by adding rainfall causal effects. Fortunately, there are no rainfall and holidays factors in evaluation periods (right-hand side of the green vertical line), and therefore we can visualize the pure causal attention distributed in (0, 1).

(a) Multi-head = 3. (b) Multi-head = 5.

Figure 6: Time efficiency improvements on one-day demand prediction in city A. The numbers in brackets mean logic cores in a chip. CPU(20) utilizes multiple processing to accelerate matrix multiplication. The comparison of GPU(1) and GPU(2) aims to demonstrate a possibility of applying powerful GPUs in real world. Setting up (a) 3 heads and (b) 5 heads in multi-head attention is quite different. According to equation (9) and above bar plots, the less number of heads, the shorter running hours. Each running hour result is averaged by using three independent experiments.

Do DML on T AA and T AB datasets and estimate treatment coefficients θ T 14: Repeat from Step 2 and estimate θ W and θ H by different DML. 15: return θ T , θ W , and θ H Traffic, Retail 2 and Ride-hailing) in our experiments as follows.

R 50 /R 90 losses on the electricity and traffic datasets in the univariate group, where denotes the results obtained fromLi et al. (2019).

R 50 losses on the retail and ride-hailing datasets. Percentages in brackets are loss reductions between CausalTrans and the second best result. denotes results fromLi et al. (2019).

methods. Iterative methods generate multi-step prediction results by step-by-step rolling windows, where results in previous steps are used to as inputs in the next step. Typically, iterative methods include DeepAR † , Deep State Space Models (DeepState † ) (Rangapuram et al., 2018), ARIMA †(Zhang, 2003), ETS(Jain & Mallick, 2017) and TRMF(Yu et al., 2016). R 90 losses on the retail and ride-hailing datasets. Percentages in brackets are loss reductions between CausalTrans and the second best result. denotes results fromLi et al. (2019).

Optimal hyperparameters on Ride-hailing dataset.

, trained causal attention demonstrates the demand's causal weights reflect in supply forecasting. Additionally, more novel causal modules similar with equation (2) can be designed to enhance interpretability and robustness, and such modules support end-to-end training in CausalTrans as well.C.2 ABLATION ANALYSISThis subsection focuses on the performance of CausalTrans when some components are excluded. Proposed essential items contain tricky PoissonOutput, Causal Attention(C.A.), FastAttention and SpatialFusion. C.A. can be implemented by different causal algorithms, such as DML and Uplift(Künzel et al., 2019). As shown in Table5, we list R 50 (50% quantile point) losses on previous eight Ridehailing datasets. Table5demonstrates that C.A.(DML) outperforms all of other components, and causal supply can be clearly influenced by causal demand. Finally, both FastAttention and SpatialFusion are not harmful to forecasting performance.

Ablation analysis (R 50 losses) of various components in CausalTrans. Each value represents the R 50 loss that eliminates a specific component, and percentages in brackets are loss variations. For Traffic data set, PoissonOutput and SpatialFusion are positive components. For Ride-hailing data set, each component reflects different importance varying on different tasks. For PoissonOutput, forecasting demand and supply are not significantly different, but longer predict step leads to a bigger loss increment. Lack of PoissonOutput increases 2% loss on average. The most essential component C.

