BSTT: A BAYESIAN SPATIAL-TEMPORAL TRANS-FORMER FOR SLEEP STAGING

Abstract

Sleep staging is helpful in assessing sleep quality and diagnosing sleep disorders. However, how to adequately capture the temporal and spatial relations of the brain during sleep remains a challenge. In particular, existing methods cannot adaptively infer spatial-temporal relations of the brain under different sleep stages. In this paper, we propose a novel Bayesian spatial-temporal relation inference neural network, named Bayesian spatial-temporal transformer (BSTT), for sleep staging. Our model is able to adaptively infer brain spatial-temporal relations during sleep for spatial-temporal feature modeling through a well-designed Bayesian relation inference component. Meanwhile, our model also includes a spatial transformer for extracting brain spatial features and a temporal transformer for capturing temporal features. Experiments show that our BSTT outperforms state-ofthe-art baselines on ISRUC and MASS datasets. In addition, the visual analysis shows that the spatial-temporal relations obtained by BSTT inference have certain interpretability for sleep staging.

1. INTRODUCTION

Sleep staging is essential for assessing sleep quality and diagnosing sleep disorders. Sleep specialists typically classify sleep stages based on the AASM sleep standard and polysomnography (PSG) recordings to aid in diagnosis. The AASM standard not only provides criteria for determining each sleep period, but also documents conversion rules between different sleep stages, which is known as sleep transition rules, to help sleep specialists identify sleep stages when sleep transitions occur. However, artificial sleep staging takes a long time, and the classification results are greatly affected by professional level and subjectivity (Supratak et al., 2017) . Therefore, automatic classification methods are applied into sleep staging to improve efficiency. Traditional machine learning methods use artificially designed features for sleep staging, which improves the efficiency of staging to a certain extent (Fraiwan et al., 2012) . However, the accuracy of traditional machine learning methods relies heavily on feature engineering and feature selection, which still requires a lot of expert knowledge. To address the above problems, deep learning methods have been applied to sleep staging and achieved satisfactory classification performance (Phan et al., 2019; Jia et al., 2022a; b) . Most of the early deep learning methods focus on the temporal information of the sleep data, utilizing convolutional neural networks (CNN) and recurrent neural networks (RNN) to capture temporal features for sleep staging (Jain & Ganesan, 2021; Perslev et al., 2019) . In addition, some studies have shown that the spatial topology of the brain behave differently in different sleep stages (Khanal, 2019) , which means that both the temporal and spatial relations of the brain are both important during sleep. Therefore, some researchers try to use the spatial and temporal characteristics of the brain for sleep staging (Jia et al., 2020b; Phan et al., 2022; Jia et al., 2020a) . Although the above methods achieve good classification performance, it is challenging to model spatial and temporal relations. Specifically, for the modeling of temporal relations, some approaches attempt to capture sleep transition rules in sleep to serve the identification of specific sleep stages. However, it is difficult for these methods to explicitly demonstrate the relation of different sleep time slices in accordance with the AASM sleep standard. Besides, for the modeling of spatial relations, spatial convolution operation is employed to extract the spatial features of the brain, which is insufficient that it may ignore the spatial topology of the brain by most methods (Zhou et al., 2021a; Perslev et al., 2019) . A few researches utilize spatial topology and temporal relation information of brain for sleep staging by graph convolutional networks, but the constructed brain networks still lack interpretability to a certain extent (Jia et al., 2020b) . To address the above challenges, we propose a novel model called Bayesian spatial-temporal transformer (BSTT) for sleep staging. The proposed model integrates the transformer and Bayesian relation inference in a unified framework. Specifically, we design the spatial-temporal transformer architecture, which can capture the temporal and spatial features of the brain. Besides, we propose the Bayesian relational inference component which comes in two forms, Bayesian temporal relation inference and Bayesian spatial relation inference. Wherefore, it can infer the spatial-temporal relations of objects and generate the relation intensity graphs. Specifically, the main contributions of our BSTT are summarized as follows: • We design Bayesian relational inference component which can adaptively infer spatialtemporal relations of brain during sleep in the service of capturing spatial-temporal relations. • We apply the spatial-temporal transformer architecture to simultaneously model spatialtemporal relations. It can effectively capture the spatial-temporal features of the brain and enhance the model's ability to model spatial-temporal relations. • Experimental results show that the proposed BSTT achieves the state-of-the-art in multiple sleep staging datasets. The visual analysis shows that our model has a certain degree of interpretability for sleep staging.

2. RELATED WORK

Identifying sleep stages plays an important role in diagnosing and treating sleep disorders. Earlier, the support vector machine (SVM) and random forest (RF) are used for sleep staging (Fraiwan et al., 2012) . However, these methods need hand-crafted features, which require a lot of prior knowledge. Currently, deep learning methods have become the primary method for sleep staging. Early deep learning methods extract temporal features of sleep signals for classification. The earliest methods are based on the CNN models (Tsinalis et al., 2016; Chambon et al., 2018) (Eldele et al., 2021) . In addition, RNN models have been gradually used for sleep staging (Phan et al., 2019; Perslev et al., 2019; Phan et al., 2018) . For example, Phan et al. propose a deep bidirectional RNN model with attention mechanism for single-channel EEG (Phan et al., 2018) . They then design an end-to-end hierarchical RNN architecture for capturing different levels of EEG signal features (Phan et al., 2019) . Some studies combine CNN with RNN (Supratak & Guo, 2020; Guillot & Thorey, 2021; Dong et al., 2017) (Jia et al., 2021b) . Further, several studies have shown the importance of brain spatial relations for sleep staging (Khanal, 2019; Sakkalis, 2011) . Some researchers try to model the spatial-temporal characteristics of sleep data. For example, Jia et al. propose an adaptive deep learning model for sleep staging. The proposed spatial-temporal graph convolutional network is used to extract spatial features and capture transition rules (Jia et al., 2020b) . They also propose a multi-view spatial-temporal graph convolutional network based on domain generalization, which models the multi-view-based spatial characteristics of the brain (Jia et al., 2021a) . Although the above models achieve good classification performance, these models do not adequately model spatial-temporal properties or effectively reason and capture spatial-temporal rela-tions. Therefore, our method attempts to model spatial-temporal relations using Bayesian inference, combined with state-of-the-art transformer architectures for sleep staging.

3. PRELIMINARIES

The 

4. BAYESIAN SPATIAL-TEMPORAL TRANSFORMER

We propose a novel model named Bayesian spatial-temporal transformer (BSTT) for sleep staging. The core ideas of our model are summarized as follows: • Infer the spatial-temporal relations of the brain based on Bayesian inference method. • Design the Bayesian transformer architecture while capturing the spatial-temporal features of the brain. • Integrate Bayesian relational inference components and transformer architecture into a unified framework which can stage sleep period effectively. The overall model is carefully designed to accurately classify different sleep stages.

4.1. ARCHITECTURE

The overall architecture of the proposed BSTT is shown in Figure 1 . The EEG signals are first encoded by the embedding layer. The spatial-temporal relations are then inferred and modeled by Bayesian spatial-temporal transformer module. Specifically, the Bayesian spatial-temporal transformer includes a Bayesian spatial transformer and a Bayesian temporal transformer. The Bayesian spatial transformer can reason about spatial relations in the brain and capture spatial features. The Bayesian temporal transformer can reason about the temporal relations of consecutive sleep epochs and capture temporal features. Finally, the predictions of different sleep stages are performed by the classification layer.

4.1.1. BAYESIAN RELATION INFERENCE

Capturing the spatial-temporal relations of brain signals during sleep can better serve sleep staging task. However, due to the difficulty in inferring the spatial-temporal relations of sleep, current researches are insufficient for modeling the spatial-temporal relations. Inspired by deep graph random process (DGP) proposed in recent research (Huang et al., 2020) , we propose the Bayesian relation inference method. Bayesian relational inference is the core component of our model, which can infer relations between each pair of object nodes and build relation intensity graphs. In this article, an object node represents the embedding of an EEG channel or the embedding of a certain time slice. The construction of the relation intensity graph is mainly divided into the following steps: Step 1: Edge embedding initialization. embeddings. We first apply a linear neural network f θ to generate edge embeddings: E e = f θ (E n [i, j]) (i, j ∈ [1, n] && i ! = j) (1) where E n ∈ R B×Nv×V represents the input n node objects, E n [i, j] ∈ R B×Ne×2V represents the splicing of two node embeddings, and E e ∈ R B×Ne×E represents the generated edge embeddings. Step 2: Edge embedding coupling. Coupling is one of the key steps in relation inference, the purpose is to obtain a summary graph of the relation between object nodes (Huang et al., 2020) . We assume that the edges of the summary graph Mi,j are a summary of an n → ∞ of λ → 0 Binomial distributions, which means mi,j ∼ B(n, λ), due to the uncertainty of spatial-temporal relations of brain. Drawing from the Virtual Recurrent Neural Network (VRNN) model, the parameters of the approximate posteriors are estimated using a Recurrent Neural Network (RNN) to encode features. However, in contrast to VRNN, Bayesian relation inference component contains an approximate posterior q( m|E e ), whose inference and sampling cannot be solved in a computationally feasible manner due to its infinite n. By De MoivreLaplace theorem (Sheynin, 1977) and DGP (Huang et al., 2020) , we can subject these edge embeddings to a coupling transformation as follows: n i,j = ζ L mean E ei,j + ϵ (2) σ i,j = ζ L std E ei,j m i,j = 1 + 2n i,j σ 2 i,j -1 + 4n 2 i,j σ 4 i,j 2 where ζ(•) is softplus function, E e is the edge embedding generated in the first step, ϵ is a very small constant, L mean (•) and L std (•) are implemented by neural networks for estimating the mean and standard deviation respectively, m i,j ∈ M is the approximation of Binomial edge variable in the summary graph, and M ∈ R B×Ne is the approximation of summary graph which strengthens the representation of real spatial or temporal relations. Step 3: Sleep relation intensity calculation. The final step is to strengthen edge information and generate a relation intensity graph for downstream tasks. The relation intensities of brain temporalspatial network during sleep is sparse based on existing research (Razi et al., 2017) . Therefore, generating a sparse graph based on edge embedding can not only highlight the representation of key relations, but also be more in line with the actual situation. We employ the Gaussian graph transformation approach which produces a sparse sleep relation intensity graph G. The specific calculation is defined as follows: α i,j = m std i,j × ε i,j + m i,j s i,j = m mean i , j × α i,j + α std i,j × σ mean i , j × ε ′ i,j ᾱi,j = s i,j × α i,j (7) α i,j = ζ (L (ᾱ i,j ) ) ) where m i,j ∈ M is the approximation of the edges of the summary graph obtained in Step 2, ε and ε ′ are the standard Gaussian random variable of the same dimension as M , α ∈ R B×Ne is the Gaussian edge representation, S ∈ R B×Ne is the task-related Gaussian variable, σi,j is calculated in Eq.( 3), std is the standard deviation, mean is the mean value, ᾱ is the Gaussian transformation map, α ∈ R B×Ne is the final sleep relation intensity graph, ζ(•) is softplus function, and L(•) is linear function. Afterwards, we utilize an attention mechanism-based method to convert the node embeddings into feature embeddings based on the sleep relation intensity graph as the output of Bayesian relation inference. The specific calculation is as follows: E out = f GAL (E n , α) where f GAL (•) is the graph attention layer, E n is the node embeddings of the input object, α is the sleep relation intensity graph, and E out ∈ R B×Nv×Vout is the output node embeddings.

4.1.2. LEARNING OF BAYESIAN RELATION INFERENCE

We adopt variational inference to jointly optimise Baysian relation inference component. Inspired by VRNN (Chung et al., 2015) , we can use the evidence lower bound (ELBO) for joint learning and inference. The details of how variational inference fits into our model are shown in the Appendix 3. Specifically, we use two random variables need to be optimised to describe the same random process data. The resulting objective is to maximize the ELBO: M i=1 KL q Ã, S | X 0:i ∥p Ã, S | X 0:i -E Ã,S log P Y i | X i , Ã, S where S is the task-related Gaussian variable, Ã is the Gaussian graph embedding, q Ã, S | X 0:i is the prior distribution, and p Ã, S | X 0:i is the posterior distribution. Since every variable in S are affected by Ã in Eq.( 6), the KL term can be further written as: ( i, j )∈ Ẽ KL B( n, λi,j )∥B n, λ(0) i,j +E αi,j KL N ( αi,j * µ i,j , αi,j * σ 2 i,j ∥N αi,j * µ (0) i,j , αi,j * σ (0) 2 i,j Obviously the second term can be calculated, while the first term is tough to calculate because n → ∞. According to Theorem 2 provided in DGP (Huang et al., 2020) , we can convert it into an easy-to-solve value to approximate the calculation.

4.1.3. BAYESIAN TRANSFORMER MODULE

Transformer shows convincing results in various sequence modeling tasks (Li et al., 2021; Luo et al., 2021; Zhou et al., 2021b) . However, traditional transformer does not have the ability to reason the relation between each pair of object nodes. This results in a lack of interpretability of the attention graph generated by transformer. Besides, the accuracy of traditional transformer is not good enough under some medical scenarios. where S ∈ R (B×Nt)×Ns×V is the input spatial node embedding, P ep ∈ R (B×Nt)×Ns×V is the position encoding matrix, and S ∈ R (B×Nt)×Ns×V is the position-encoded spatial node embeddings. For the position encoding matrix, we follow the groundbreaking work (Vaswani et al., 2017) and use the sine and cosine functions to calculate. We design a multi-head spatial Bayesian relation inference component which can reson about spatial relations to improve the representation learning ability of the model. The details of Bayesian relation inference component have been described in Section 4.1.1. The node embeddings with positional encoding are encoded as embeddings with spatial features after passing through the multi-head spatial Bayesian relation inference component and the feed-forward neural network layer: S′ = f FNN f BSRI S ( ) where S is the input node embedding, f BSRI (  Y = f C T ′ ( ) where Y is the classification result of BSTT.

5. EXPERIMENTS

To verify the effectiveness of the Bayesian spatial-temporal transformer, we evaluate it on the Institute of Systems and Robotics, University of Coimbra (ISRUC) and Montreal Archives of Sleep Studies-SS3 (MASS-SS3) dataset.

5.1. DATASET

ISRUC dataset contains the PSG recordings from 100 adult subjects. Each PSG recording contains 6 EEG channels, 6 EOG channels, 3 EMG channels, and 1 ECG channel. MASS-SS3 dataset contains the PSG recordings from 62 adult subjects. Each PSG recording contains 20 EEG channels, 2 EOG channels, 3 EMG channels, and 1 ECG channel. The recordings are divided into time slices according to a sleep epoch every 30s. Sleep spacialists divide these time slices into five distinct sleep stages (W, N1, N2, N3, and REM) according to the AASM standard. There are also motion artifacts at the beginning and end of each subject's recording which are marked as unknown. We follow the previous study and remove these recordings (Supratak et al., 2017) .

5.2. EXPERIMENT SETTINGS

We evaluate our model using k-fold cross-subject validation to ensure that the experiment results are correct and reliable. We set k = 5 in order to test all recordings efficiently. For optimizer, the Adam and Adadelta optimizer are deployed in MASS-SS3 and ISRUC dataset. We use multi-channel EEG data for sleep staging to better capture brain network structure. Specifically, on the ISRUC dataset, we use all 6 EEG channels for experiments. In the MASS-SS3 dataset we use 19 channels of EEG signals. To comprehensively evaluate the Bayesian spatial-temporal transformer model and all baseline methods, we use accuracy (ACC), F1 Score, and KAPPA to evaluate the models. Specific information of the evaluation indicators and baseline models are shown in Appendix 2. To verify the effectiveness of each component in the Bayesian spatial-temporal transformer, we conduct ablation experiments to determine the impact of each module on the model's performance. Specifically, we design three variants of the Bayesian spatial-temporal transformer, including:

5.3. EXPERIMENT ANALYSIS

• Bayesian Spatial Transformer (BST), which removes the Bayesian temporal transformer module to determine the impact of modeling temporal relations on model performance. • Bayesian Temporal Transformer (BTT), which removes the Bayesian spatial transformer module to determine the impact of modeling spatial relations on model performance. • Spatial-Temporal Transformer (STT), which removes the relational inference component to determine the impact of Bayesian relational inference on model performance. Figure 2 demonstrates that the performance of the variant models degrades after removing certain component or module. Among them, the removal of the relational inference component has the greatest impact on performance, which shows the importance of introducing relational inference to the sleep staging task. In addition, only modeling the spatial relation or temporal relation of the data also lead to a decrease in performance. It can be seen that modeling the spatial and temporal relation is helpful for the sleep staging task. 

5.5. VISUAL ANALYSIS

To verify the proposed Bayesian relational inference module can infer the spatial-temporal relations during sleep, we visualize and analysis the generated relational inference graphs.

5.5.1. VISUAL ANALYSIS OF SPATIAL RELATION INFERENCE

Some researches have shown that the functional connectivity of the brain varies during different sleep stages (Nguyen et al., 2018) . In order to analyze the role of the Bayesian spatial inference component of our model, we visualize the spatial relation intensity graph between EEG signal channels at different sleep periods, as shown in Figure 3 . The position of the nodes in the figure represents the position of the electrodes that output the EEG signals, and the edge is the relation intensity between the each two electrodes. We notice that during the NREM period, brain connectivity is significantly stronger in light sleep (N1, N2) than that during deep sleep (N3). It has been revealed that during light sleep, cerebral blood flow (CBF) and cerebral metabolic rate (CMR) are only about 3% to 10% lower than those of wakefulness while during deep sleep, these indexes have a significant decrease of 25% to 44% by previous study (Madsen & Vorstrup, 1991) . Synaptic connection activity is directly correlated with CBF and CMR, which is consistent with our connection intensity graph. Madsen also reported that the level of brain synaptic activity during REM period is similar to that of the wake period, which matches our experimental findings. . (Bassi et al., 2009) , which is consistent with our experimental results. Figure 4 also reports that when sleep transition occurs, the relation intensity between the t and t + 1 time slices are usually stronger. Similarly, temporal intensity between the t -2 and the last three time slices are usually weaker than that during the unchanging period, which is conducive to the interpretation of the sleeping transition.

6. CONCLUSION

We propose a novel Bayesian spatial-temporal transformer model for sleep staging. To our best knowledge, this is the first attempt to combine Bayesian relational inference with spatial-temporal transformer for sleep staging. Our BSTT constructs spatial-temporal relations through Bayesian relational inference and applies the transformer architecture to capture brain spatial-temporal features for sleep staging. The results show that BSTT can effectively improve the model performance and achieve state-of-the-art results. In addition, visual analysis presents that the relation intensity graphs generated by the Bayesian relation inference have certain interpretability, which is consistent with the existing research and helps to reveal the potential working mechanism of our model. Besides, the proposed BSTT is a general-framework which can inference the spatial-temporal relations of EEG data and perform satisfactory data forecasting. In the future, the proposed method can be used for other EEG tasks, such as emotion recognition or motor imagery classification.



* indicates the significant differences between our model and other models (p < 0.05).



Figure 2: Ablation experiment results of Bayesian spatial-temporal transformer on ISRUC dataset..

Figure 3: The graph shows the average of the brain spatial intensity over time. The spatial relation during the WAKE, REM and N1 period is strong, while the that during the N2 and N3 period is weak..

Figure 4: Intensity graphs of the temporal relation during different sleep periods and when sleep transitions occur..

proposed model processes data from successive T sleep epochs and predicts the label of the epoch in the middle. Each sleep epoch is defined as x ∈ R C×N , where C represents the number of channels of the sleep epoch (i.e. the EEG channel in this work. Since EEG signals from different channels are extracted from different regions of the brain, spatial relations are contained among these channels.) and N represents the number of sampling points in a sleep epoch. The input sequence of sleep epochs is defined as x = {x 1 , x 2 , . . . , x T }, where x i denotes a sleep epoch (i ∈ [1, 2, . . . , T ]) and T is the number of sleep epochs. The sleep staging problem is defined as: learning an artificial neural network F based on Bayesian spatial-temporal transformer, which can infer the spatial-temporal relations of the input sleep epoch sequence x and map it to the corresponding sleep stage Y , where Y is the classification result of the middle epoch x



The Bayesian relation inference component we proposed can infer spatial-temporal relations efficiently. Hence, we integrate the Bayesian relation inference component mentioned in Section 4.1.1 with the transformer in a unified framework to simultaneously reason and model the spatial-temporal features of sleep EEG data. Bayesian spatial transformer. To better construct the spatial functional connectivity of the brain and capture spatial features, we design the Bayesian spatial transformer. It contains two components: a Bayesian spatial relation inference component and a position feed-forward network, of which the core component is a Bayesian relation inference component. Specifically, the input of the Bayesian spatial transformer S is the embeddings of n spatial nodes. First we add position encoding to the input to introduce position information: S = S + P ep (12)

and 2 indicate that the proposed model achieves the best performance compared to other baseline methods on both datasets. Specifically, the MCNN and MMCNN utilize the CNN model to automatically extract sleep features, while RNN based methods such as DeepSleepNet and TinySleepNet focus on the temporal context in sleep data, and model the multi-level temporal characteristics of the sleep process for sleep staging. Further, GraphSleepNet and ST-Transformer simultaneously model the spatial-temporal relations during sleep and achieve satisfactory results. However, GraphSleepNet and ST-Transformer cannot adequately reason the spatial-temporal relations, which limits the classification performance to a certain extent. Our Bayesian ST-Transformer uses the multi-head Bayesian relation inference component to infer spatial-temporal relations to better model spatial and temporal relations. Therefore, the proposed model achieves best classification performance on different datasets.

Comparison of Bayesian spatial-temporal transformer and baselines on ISRUC dataset

Comparison of Bayesian spatial-temporal transformer and baselines on MASS dataset

7. REPRODUCIBILITY STATEMENT AND ETHICS STATEMENT

We provide an open-source implementation of our BSTT and other baseline models. The code of BSTT is available at: https://github.com/YuchenLiu1225/BSTT/tree/main/ BSTT. Please check the Appendix 7 for links of the baseline methods.The authors do not foresee any negative social impacts of this work. All authors disclosed no relevant relationships.

