JOINT LEARNING OF FULL-STRUCTURE NOISE IN HIERARCHICAL BAYESIAN REGRESSION MODELS

Abstract

We consider hierarchical Bayesian (type-II maximum likelihood) models for observations with latent variables for source and noise, where both hyperparameters need to be estimated jointly from data. This problem has application in many domains in imaging including biomagnetic inverse problems. Crucial factors influencing accuracy of source estimation are not only the noise level but also its correlation structure, but existing approaches have not addressed estimation of noise covariance matrices with full structure. Here, we consider the reconstruction of brain activity from electroencephalography (EEG). This inverse problem can be formulated as a linear regression with independent Gaussian scale mixture priors for both the source and noise components. As a departure from classical sparse Bayesan learning (SBL) models where across-sensor observations are assumed to be independent and identically distributed, we consider Gaussian noise with full covariance structure. Using Riemannian geometry, we derive an efficient algorithm for updating both source and noise covariance along the manifold of positive definite matrices. Using the majorization-maximization framework, we demonstrate that our algorithm has guaranteed and fast convergence. We validate the algorithm both in simulations and with real data. Our results demonstrate that the novel framework significantly improves upon state-of-the-art techniques in the real-world scenario where the noise is indeed non-diagonal and fully-structured.

1. INTRODUCTION

Having precise knowledge of the noise distribution is a fundamental requirement for obtaining accurate solutions in many regression problems (Bungert et al., 2020) . In many applications however, it is impossible to separately estimate this noise distribution, as distinct "noise-only" (baseline) measurements are not feasible. An alternative, therefore, is to design estimators that jointly optimize over the regression coefficients as well as over parameters of the noise distribution. This has been pursued both in a (penalized) maximum-likelihood settings (here referred to as Type-I approaches) (Petersen & Jung, 2020; Bertrand et al., 2019; Massias et al., 2018) as well as in hierarchical Bayesian settings (referred to as Type-II) (Wipf & Rao, 2007; Zhang & Rao, 2011; Hashemi et al., 2020; Cai et al., 2020a) . Most contributions in the literature are, however, limited to the estimation of only a diagonal noise covariance (i.e., independent between different measurements) (Daye et al., 2012; Van de Geer et al., 2013; Dalalyan et al., 2013; Lederer & Muller, 2015) . Considering a diagonal noise covariance is a limiting assumption in practice as the noise interference in many realistic scenarios are highly correlated across measurements; and thus, have non-trivial off-diagonal elements. This paper develops an efficient optimization algorithm for jointly estimating the posterior of regression parameters as well as the noise distribution. More specifically, we consider linear regression with Gaussian scale mixture priors on the parameters and a full-structure multivariate Gaussian noise. We cast the problem as a hierarchical Bayesian (type-II maximum-likelihood) regression problem, in which the variance hyperparameters and the noise covariance matrix are optimized by maximizing the Bayesian evidence of the model. Using Riemannian geometry, we derive an efficient algorithm for jointly estimating the source and noise covariances along the manifold of positive definite (P.D.) matrices. To highlight the benefits of our proposed method in practical scenarios, we consider the problem of electromagnetic brain source imaging (BSI). The goal of BSI is to reconstruct brain activity from magneto-or electroencephalography (M/EEG), which can be formulated as a sparse Bayesian learning (SBL) problem. Specifically, it can be cast as a linear Bayesian regression model with independent Gaussian scale mixture priors on the parameters and noise. As a departure from the classical SBL approaches, here we specifically consider Gaussian noise with full covariance structure. Prominent source of correlated noise in this context are, for example, eye blinks, heart beats, muscular artifacts and line noise. Other realistic examples for the need for such full-structure noise can be found in the areas of array processing (Li & Nehorai, 2010) or direction of arrival (DOA) estimation (Chen et al., 2008) . Algorithms that can accurately estimate noise with full covariance structure are expected to achieve more accurate regression models and predictions in this setting.

2. TYPE-II BAYESIAN REGRESSION

We consider the linear model Y = LX + E, in which a forward or design matrix, L ∈ R M ×N , is mapped to the measurements, Y, by a set of coefficients or source components, X. Depending on the setting, the problem of estimating X given L and Y is called an inverse problem in physics, a multitask regression problem in machine learning, or a multiple measurement vector (MMV) recovery problem in signal processing (Cotter et al., 2005) . Adopting a signal processing terminology, the measurement matrix Y ∈ R M ×T captures the activity of M sensors at T time instants, y(t) ∈ R M ×1 , t = 1, . . . , T , while the source matrix, X ∈ R N ×T , consists of the unknown activity of N sources at the same time instants, x(t) ∈ R N ×1 , t = 1, . . . , T . The matrix E = [e(1), . . . , e(T )] ∈ R M ×T represents T time instances of zero-mean Gaussian noise with full covariance Λ, e(t) ∈ R M ×1 ∼ N (0, Λ), t = 1, . . . , T , which is assumed to be independent of the source activations. In this paper, we focus on M/EEG based brain source imaging (BSI) but the proposed algorithm can be used in general regression settings, in particular for sparse signal recovery (Candès et al., 2006; Donoho, 2006) with a wide range of applications (Malioutov et al., 2005) . The goal of BSI is to infer the underlying brain activity X from the EEG/MEG measurement Y given a known forward operator, called lead field matrix L. As the number of sensors is typically much smaller than the number of locations of potential brain sources, this inverse problem is highly ill-posed. This problem is addressed by imposing prior distributions on the model parameters and adopting a Bayesian treatment. This can be performed either through Maximum-a-Posteriori (MAP) estimation (Type-I Bayesian learning) (Pascual-Marqui et al., 1994; Gorodnitsky et al., 1995; Haufe et al., 2008; Gramfort et al., 2012; Castaño-Candamil et al., 2015) or, when the model has unknown hyperparameters, through Type-II Maximum-Likelihood estimation (Type-II Bayesian learning) (Mika et al., 2000; Tipping, 2001; Wipf & Nagarajan, 2009; Seeger & Wipf, 2010; Wu et al., 2016) . In this paper, we focus on Type-II Bayesian learning, which assumes a family of prior distributions p(X|Θ) parameterized by a set of hyperparameters Θ. These hyper-parameters can be learned from the data along with the model parameters using a hierarchical Bayesian approach (Tipping, 2001; Wipf & Rao, 2004) through the maximum-likelihood principle: Θ II := arg max Θ p(Y|Θ) = arg max Θ p(Y|X, Θ)p(X|Θ)dX . (1) Here we assume a zero-mean Gaussian prior with full covariance Γ for the underlying source distribution, x(t) ∈ R N ×1 ∼ N (0, Γ), t = 1, . . . , T . Just as most other approaches, Type-II Bayesian learning makes the simplifying assumption of statistical independence between time samples. This leads to the following expression for the distribution of the sources and measurements: p(X|Γ) = T t=1 p(x(t)|Γ) = T t=1 N (0, Γ) (2) p(Y|X) = T t=1 p(y(t)|x(t)) = T t=1 N (Lx(t), Λ) . ( ) The parameters of the Type-II model, Θ, are the unknown source and noise covariances, i.e., Θ = {Γ, Λ}. The unknown parameters Γ and Λ are optimized based on the current estimates of the source and noise covariances in an alternating iterative process. Given initial estimates of Γ and Λ, the posterior distribution of the sources is a Gaussian of the form (Sekihara & Nagarajan, 2015 ) p(X|Y, Γ) = T t=1 N (µ x (t), Σ x ) , where (4) µ x (t) = ΓL (Σ y ) -1 y(t) Σ x = Γ -ΓL (Σ y ) -1 LΓ Σ y = Λ + LΓL . (7) The estimated posterior parameters µ x (t) and Σ x are then in turn used to update Γ and Λ as the minimizers of the negative log of the marginal likelihood p(Y|Γ, Λ), which is given by (Wipf et al., 2010) : L II (Γ, Λ) = -log p(Y|Γ, Λ) = log|Σ y | + 1 T T t=1 y(t) Σ -1 y y(t) = log|Λ + LΓL | + 1 T T t=1 y(t) Λ + LΓL -1 y(t) , where | • | denotes the determinant of a matrix. This process is repeated until convergence. Given the final solution of the hyperparameters Θ II = {Γ II , Λ II }, the posterior source distribution is obtained by plugging these estimates into equations 3 to 6.

3. PROPOSED METHOD: FULL-STRUCTURE NOISE (FUN) LEARNING

Here we propose a novel and efficient algorithm, full-structure noise (FUN) learning, which is able to learn the full covariance structure of the noise jointly within the Bayesian Type-II regression framework. We first formulate the algorithm in its most general form, in which both the noise distribution and the prior have full covariance structure. Later, we make the simplifying assumption of independent source priors, leading to the pruning of the majority of sources. This effect, which has also been referred to as automatic relevance determination (ARD) or sparse Bayesian learning (SBL) is beneficial in our application of interest, namely the reconstruction of parsimonious sets of brain sources underlying experimental EEG measurements. Note that the Type-II cost function in equation 8 is non-convex and thus non-trivial to optimize. A number of iterative algorithms such as majorization-minimization (MM) (Sun et al., 2017) have been proposed to address this challenge. Following the MM scheme, we first construct convex surrogate functions that majorizes L II (Γ, Λ) in each iteration of the optimization algorithm. Then, we show the minimization equivalence between the constructed majoring functions and equation 8. This result is presented in the following theorem: Theorem 1. Let Λ k and Σ k y be fixed values obtained in the (k)-th iteration of the optimization algorithm minimizing L II (Γ, Λ). Then, optimizing the non-convex type-II ML cost function in equation 8, L II (Γ, Λ), with respect to Γ is equivalent to optimizing the following convex function, which majorizes equation 8: L conv source (Γ, Λ k ) = tr( C k S -1 Γ) + tr(M k S Γ -1 ) , where C k S and M k S are defined as: C k S := L Σ k y -1 L -1 , M k S := 1 T T t=1 x k (t)x k (t) . ( ) Similarly, optimizing L II (Γ, Λ) with respect to Λ is equivalent to optimizing the following convex majorizing function: L conv noise (Γ k , Λ) = tr( C k N -1 Λ) + tr(M k N Λ -1 ) , where C k N and M k N are defined as: C k N := Σ k y , M k N := 1 T T t=1 (y(t) -Lx k (t))(y(t) -Lx k (t)) . ( ) Proof. The proof is presented in Appendix A. We continue by considering the optimization of the cost functions L conv source (Γ, Λ k ) and L conv noise (Γ k , Λ) with respect to Γ and Λ, respectively. Note that in case of source covariances with full structure, the solution of L conv source (Γ, Λ k ) with respect to Γ lies in the (N 2 -N )/2 Riemannian manifold of positive definite (P.D.) matrices. This consideration enables us to invoke efficient methods from Riemannian geometry (see Petersen et al., 2006; Berger, 2012; Jost & Jost, 2008) , which ensures that the solution at each step of the optimization is contained within the lower-dimensional solution space. Specifically, in order to optimize for the source covariance, the algorithm calculates the geometric mean between the previously obtained statistical model source covariance, C k S , and the source-space sample covariance matrix, M k S , in each iteration. Analogously, to update the noise covariance estimate, the algorithm calculates the geometric mean between the model noise covariance, C k N , and the empirical sensor-space residuals, M k N . The update rules obtained from this algorithm are presented in the following theorem: Theorem 2. The cost functions L conv source (Γ, Λ k ) and L conv noise (Γ k , Λ) are both strictly geodesically convex with respect to the P.D. manifold, and their optimal solution with respect to Γ and Λ, respectively, can be attained according to the two following update rules: Γ k+1 ← (C k S ) 1 2 (C k S ) -1 /2 M k S (C k S ) -1 /2 1 2 (C k S ) 1 2 , ( ) Λ k+1 ← (C k N ) 1 2 (C k N ) -1 /2 M k N (C k N ) -1 /2 1 2 (C k N ) 1 2 . ( ) Proof. A detailed proof can be found in Appendix B. Convergence of the resulting algorithm is shown in the following theorem. Theorem 3. Optimizing the non-convex type-II ML cost function in equation 8, L II (Γ, Λ) with alternating update rules for Γ and Λ in equation 13 and equation 14 leads to an MM algorithm with guaranteed convergence guarantees. Proof. A detailed proof can be found in Appendix C. While Theorems 1-3 reflect a general joint learning algorithm, the assumption of sources with full covariance structure is often relaxed in practice. The next section will shed light on this important simplification by making a formal connection to SBL algorithms.

3.1. SPARSE BAYESIAN LEARNING WITH FULL NOISE MODELING

In brain source imaging, the assumption of full source covariance is often relaxed. Even if, technically, most parts of the brain are active at all times, and the concurrent activations of different brain regions can never be assumed to be fully uncorrelated, there are many experimental settings in which it is reasonable to assume only a small set of independent brain sources. Such sparse solutions are physiologically plausible in task-based analyses, where only a fraction of the brain's macroscopic structures is expected to be consistently engaged. A common strategy in this case is to model independent sources through a diagonal covariance matrix. In the Type-II Bayesian learning framework, this simplification interestingly leads to sparsity of the resulting source distributions, as, at the optimum, many of the estimated source variances are zero. This mechanism is known as sparse Bayesian learning and is closely related to the more general concept of automatic relevance determination. Here, we adopt the SBL assumption for the sources, leading to Γ-updates previously described in the BSI literature under the name Champagne (Wipf & Nagarajan, 2009) . As a novelty and main focus of this paper, we here equip the SBL framework with the capability to jointly learn full noise covariances through the geometric mean based update rule in equation 14. In the SBL framework, the N modeled brain sources are assumed to follow independent univariate Gaussian distributions with zero mean and distinct unknown variances γ n : x n (t) ∼ N (0, γ n ), n = 1, . . . , N . In the SBL solution, the majority of variances is zero, thus effectively inducing spatial sparsity of the corresponding source activities. For FUN learning, we also impose a diagonal structure on the source covariance matrix, Γ = diag(γ), where γ = [γ 1 , . . . , γ N ] . By constraining Γ in equation 9 Algorithm 1: Full-structure noise (FUN) learning Input: The lead field matrix L ∈ R M ×N and the measurement vectors y(t) ∈ R M ×1 , t = 1, . . . , T . Result: The estimated prior source variances [γ 1 , . . . , γ N ] , noise covariance Λ, the posterior mean µ x (t) and covariance Σ x of the sources. Set a random initial value for Λ as well as γ = [γ 1 , . . . , γ N ] , and construct Γ = diag(γ). Calculate the statistical covariance Σ y = Λ + LΓL . Repeat Calculate the posterior mean as µ x (t) = ΓL (Σ y ) -1 y(t). Calculate the posterior covariance as Σ x = Γ -ΓL (Σ y ) -1 LΓ. to the set of diagonal matrices, W, we can show that the update rule equation 13 for the source variances simplifies to the following form: γ k+1 n ← M k S n,n C k S -1 n,n = 1 T T t=1 (x k n (t)) 2 L n Σ k y -1 L n for n = 1, . . . , N , where L n denotes the n-th column of the lead field matrix. Interestingly, equation 15 is identical to the update rule of the Champagne algorithm. A detailed derivation of equation 15 can be found in Appendix D. Summarizing, the FUN learning approach, just like Champagne and other SBL algorithms, assumes independent Gaussian sources with individual variances (thus, diagonal source covariances), which are updated through equation equation 15. Departing from the classical SBL setting, which assumes the noise distribution to be known, FUN models noise with full covariance structure, which is updated using equation 14. Algorithm 1 summarizes the used update rules. Note that various recent Type-II noise learning schemes for diagonal noise covariance matrices (Hashemi et al., 2020; Cai et al., 2020a) that are rooted in the concept of SBL can be also derived as special cases of FUN learning assuming diagonal source and noise covariances, i.e., Γ, Λ ∈ W. Specifically imposing diagonal structure on the noise covariance matrix for the FUN algorithm, Λ, results in identical noise variance update rules as derived in Cai et al. (2020a) for heteroscedastic, and in Hashemi et al. (2020) for homoscedastic noise. We explicitly demonstrate this connection in Appendix E. Here, we note that heteroscedasticity refers to the common phenomenon that measurements are contaminated with non-uniform noise levels across channels, while homoscedasticity only accounts for uniform noise levels.

4. NUMERICAL SIMULATIONS AND REAL DATA ANALYSIS

Source, Noise and Forward Model: We simulated a sparse set of N 0 = 5 active brain sources that were placed at random positions on the cortex. To simulate the electrical neural activity of these sources, T = 200 identically and independently distributed (i.i.d) points were sampled from a Gaussian distribution, yielding sparse source activation vectors x(t). The resulting source distribution, represented as X = [x(1), . . . , x(T )], was projected to the EEG sensors through application of lead field matrix as the forward operator: Y signal = LX. The lead field matrix, L ∈ R 58×2004 , was generated using the New York Head model (Huang et al., 2016) taking into account the realistic anatomy and electrical tissue conductivities of an average human head. Further details regarding forward modeling is provided in Appendix F. Gaussian additive noise was randomly sampled from a zero-mean normal distribution with full covariance matrix Λ: e(t) ∈ R M ×1 ∼ N (0, Λ), t = 1, . . . , T . This setting is further referred to as full-structure noise. Note that we also generated noise with diagonal covariance matrix, referred to as heteroscedastic noise, in order to investigate the effect of model violation on reconstruction performance. The noise matrix E = [e(1), . . . , e(T )] ∈ R M ×T was normalized by it Frobenius norm and added to the signal matrix Y signal as follows: Y = Y signal + (1-α) Y signal F /α E F E , where α determines the signal-to-noise ratio (SNR) in sensor space. Precisely, SNR is obtained as follows: SNR = 20log 10 ( α /1-α). In the subsequently described experiments the following values of α were used: α={0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.65, 0.7, 0.8}, which correspond to the following SNRs: 0, 1.7, 3.5, 5.4, 7.4, 12} (dB) . MATLAB codes for producing the results in the simulation study are uploaded here. Evaluation Metrics and Simulation Set-up: We applied the full-structure noise learning approach on the synthetic datasets described above to recover the locations and time courses of the active brain sources. In addition to our proposed approach, two further Type-II Bayesian learning schemes, namely Champagne with homo-and heteroscedastic noise learning (Hashemi et al., 2020; Cai et al., 2020a) , were also included as benchmarks with respect to source reconstruction performance and noise covariance estimation accuracy. Source reconstruction performance was evaluated according to the earth mover's distance (EMD) (Rubner et al., 2000) ), the error in the reconstruction of the source time courses, the average Euclidean distance (EUCL) (in mm) between each simulated source and the best (in terms of absolute correlations) matching reconstructed source, and finally F1-measure score (Chinchor & Sundheim, 1993) . A detailed definition of evaluation metrics is provided in Appendix F. To evaluate the accuracy of the noise covariance matrix estimation, the following two metrics were calculated: the Pearson correlation between the original and reconstructed noise covariance matrices, Λ and Λ, denoted by Λ sim , and the normalized mean squared error (NMSE) between Λ and Λ, defined as NMSE = || Λ -Λ|| 2 F /||Λ|| 2 F . Note that NMSE measures the reconstruction of the true scale of the noise covariance matrix, while Λ sim is scale-invariant and hence only quantifies the overall structural similarity between simulated and estimated noise covariance matrices. Each simulation was carried out 100 times using different instances of X and E, and the mean and standard error of the mean (SEM) of each performance measure across repetitions was calculated. Convergence of the optimization programs for each run was defined if the relative change of the Frobenius-norm of the reconstructed sources between subsequent iterations was less than 10 -8 . A maximum of 1000 iterations was carried out if no convergence was reached beforehand. Figure 1 shows two simulated datasets with five active sources in presence of full-structure noise (upper panel) as well as heteroscedastic noise (lower panel) at 0 (dB) SNR. Topographic maps depict the locations of the ground-truth active brain sources (first column) along with the source reconstruction result of three noise learning schemes assuming noise with homoscedastic (second column), heteroscedastic (third column), and full (fourth column) structure. For each algorithm, the estimated noise covariance matrix is also plotted above the topographic map. Source reconstruction performance was measured in terms of EMD and time course correlation (Corr), and is summarized in the table next to each panel. Besides, the accuracy of the noise covariance matrix reconstruction was measured on terms of Λ sim and NMSE. Results are included in the same table. Figure 1 (upper panel) allows for a direct comparison of the estimated noise covariance matrices obtained from the three different noise learning schemes. It can be seen that FUN learning can better capture the overall structure of ground truth full-structure noise as evidenced by lower NMSE and similarity errors compared to the heteroscedastic and homoscedastic algorithm variants that are only able to recover a diagonal matrix while enforcing the off-diagonal elements to zero. This behaviour results in higher spatial and temporal accuracy (lower EMD and time course error) for FUN learning compared to competing algorithms assuming diagonal noise covariance. This advantage is also visible in the topographic maps. The lower-panel of Figure 1 presents analogous results for the setting where the noise covariance is generated according to a heteroscedastic model. Note that the superior spatial and temporal reconstruction performance of the heteroscedastic noise learning algorithm compared to the full-structure scheme is expected here because the simulated ground truth noise is indeed heteroscedastic. The full-structure noise learning approach, however, provides fairly reasonable performance in terms of EMD, time course correlation (corr), and Λ sim , although it is designed to estimate a full-structure noise covariance matrix. The convergence behaviour of all three noise learning variants is also illustrated in Figure 1 . Note that the full-structure noise learning approach eventually reaches lower negative log-likelihood values in both scenarios, namely full-structure and heteroscedastic noise. (green) and full-structure (blue) noise covariances for a range of 10 SNR values. The upper panel represents the evaluation metrics for the setting where the noise covariance is full-structure model, while the lower-panel depicts the same metric for simulated noise with heteroscedastic diagonal covariance. Concerning the first setting, FUN learning consistently outperforms its homoscedastic and heteroscedastic counterparts according to all evaluation metrics in particular in low-SNR settings. Consequently, as the SNR decreases, the gap between FUN learning and the two other variants increases. Conversely, heteroscedastic noise learning shows an improvement over FUN learning according to all evaluation metrics when the simulated noise is indeed heteroscedastic. However, note that the magnitude of this improvement is not as large as observed for the setting where the noise covariance is generated according to a full-structure model and then is estimated using the FUN approach. Analysis of Auditory Evoked Fields (AEF): Figure 3 shows the reconstructed sources of the Auditory Evoked Fields (AEF) versus number of trials from a single representative subject using FUN learning algorithm. Further details on this dataset can be found in Appendix G. We tested the reconstruction performance of FUN learning with the number of trials limited to 1, 2, 12, 63 and 120. Each reconstruction was performed 30 times with the specific trials themselves chosen as a random subset of all available trials. As the subplots for different trials demonstrate, FUN learning algorithm is able to correctly localize bilateral auditory activity to Heschel's gyrus, which is the characteristic location of the primary auditory cortex, under a few trials or even a single trial.

5. DISCUSSION

This paper focused on sparse regression within the hierarchical Bayesian regression framework and its application in EEG/MEG brain source imaging. To this end we developed an algorithm, which is, however, suitable for a much wider range of applications. What is more, the same concepts used here for full-structure noise learning could be employed in other contexts where hyperparameters like kernel widths in Gaussian process regression (Wu et al., 2019) or dictionary elements in the dictionary learning problem (Dikmen & Févotte, 2012) are to be inferred. Besides, using FUN learning algorithm may also prove useful for practical scenarios in which model residuals are expected to be correlated, e.g., probabilistic canonical correlation analysis (CCA) (Bach & Jordan, 2005), spectral independent component analysis (ICA) (Ablin et al., 2020) , wireless communication (Prasad et al., 2015; Gerstoft et al., 2016; Haghighatshoar & Caire, 2017; Khalilsarai et al., 2020) , robust portfolio optimization in finance (Feng et al., 2016) , graph learning (Kumar et al., 2020) , thermal field reconstruction (Flinth & Hashemi, 2018) , and brain functional imaging (Wei et al., 2020) . Noise learning has also attracted attention in functional magnetic resonance imaging (fMRI) (Cai et al., 2016; Shvartsman et al., 2018; Cai et al., 2019b; 2020b; Wei et al., 2020) , where various models like matrix-normal (MN), factor analysis (FA), and Gaussian-process (GP) regression have been proposed. The majority of the noise learning algorithms in the fMRI literature rely on the EM framework, which is quite slow in practice and has convergence guarantees only under certain strong conditions. In contrast to these existing approaches, our proposed framework not only applies to the models considered in these papers, but also benefits from theoretically proven convergence guarantees. To be more specific, we showed in this paper that FUN learning is an instance of the wider class of majorization-minimization (MM) framework, for which provable fast convergence is guaranteed. It is worth emphasizing our contribution within the MM optimization context as well. In many MM implementations, surrogate functions are minimized using an iterative approach. Our proposed algorithm, however, obtains a closed-form solution for the surrogate function in each step, which further advances its efficiency. In 2006) assume more complex spatiotemporal noise covariance structures. A common limitation of these works is, however, that the noise level is not estimated as part of the source reconstruction problem on task-related data but from separate noise recordings. Our proposed algorithm substantially differs in this respect, as it learns the noise covariance jointly with the brain source distribution. Note that The idea of joint estimation of brain source activity and noise covariance has been previously proposed for Type-I learning methods in (Massias et al., 2018; Bertrand et al., 2019) . In contrast to these Type-I methods, FUN is a Type-II method, which learns the prior source distribution as part of the model fitting. Type-II methods have been reported to yield consistently superior results than Type-I methods (Owen et al., 2012; Cai et al., 2019a; 2020a; Hashemi et al., 2020) . Our numerical results show that the same hold also for FUN learning, which performs on par or better than existing variants from the Type-II family (including conventional Champagne) in this study. We plan to provide a formal comparison of the performance of noise learning within Type-I and Type-II estimation in our future work. While being broadly applicable, our approach is also limited by a number of factors. Although Gaussian noise distributions are commonly justified, it would be desirable to also include more robust (e.g., heavy-tailed) non-Gaussian noise distributions in our framework. Another limitation is that the superior performance of the full-structure noise learning technique comes at the expense of higher computational complexity compared to the variants assuming homoscedastic or heteroscedastic strucutre. Besides, signals in real-world scenarios often lie in a lower-dimensional space compared to the original high-dimensional ambient space due to the particular correlations that inherently exist in the structure of the data. Therefore, imposing physiologically plausible constraints on the noise model, e.g., low-rank or Toeplitz structure, not only provides side information that can be leveraged for the reconstruction but also reduces the computational cost in two ways: a) by reducing the number of parameters and b) by taking advantage of efficient implementations using circular embeddings and the fast Fourier transform (Babu, 2016) . Exploring efficient ways to incorporate these structural assumptions within a Riemannian framework is another direction of future work.

6. CONCLUSION

This paper proposes an efficient optimization algorithm for jointly estimating Gaussian regression parameter distributions as well as Gaussian noise distributions with full covariance structure within a hierarchical Bayesian framework. Using the Riemannian geometry of positive definite matrices, we derived an efficient algorithm for jointly estimating source and noise covariances. The benefits of our proposed framework were evaluated within an extensive set of experiments in the context of electromagnetic brain source imaging inverse problem and showed significant improvement upon state-of-the-art techniques in the realistic scenario where the noise has full covariance structure. The performance of our method is assessed through a real data analysis for the auditory evoked field (AEF) dataset. A PROOF OF THEOREM 1 Proof. We start the proof by recalling equation 8: L II (Γ, Λ) = -log p(Y|Γ, Λ) = log|Σ y | + 1 T T t=1 y(t) Σ -1 y y(t) . The upper bound on the log |Σ y | term can be directly inferred from the concavity of the logdeterminant function and its first-order Taylor expansion around the value from the previous iteration, Σ k y , which provides the following inequality (Sun et al., 2017, Example 2) : log |Σ y | ≤ log Σ k y + tr Σ k y -1 Σ y -Σ k y = log Σ k y + tr Σ k y -1 Σ y -tr Σ k y -1 Σ k y . Note that the first and last terms in equation 17 do not depend on Γ; hence, they can be ignored in the optimization procedure. Now, we decompose Σ y into two terms, each of which only contains either the noise or source covariances: tr Σ k y -1 Σ y = tr Σ k y -1 Λ + LΓL = tr Σ k y -1 Λ + tr Σ k y -1 LΓL . In next step, we decompose the second term in equation 8, 1 T T t=1 y(t) Σ -1 y y(t), into two terms, each of which is a function of either only the noise or only the source covariances. To this end, we exploit the following relationship between sensor and source space covariances: 1 T T t=1 y(t) Σ -1 y y(t) = 1 T T t=1 x k (t) Γ -1 x k (t) + 1 T T t=1 (y(t) -Lx k (t)) Λ -1 (y(t) -Lx k (t)) . By combining equation 18 and equation 19, rearranging the terms, and ignoring all terms that do not depend on Γ, we have: L II (Γ) ≤ tr Σ k y -1 LΓL + 1 T T t=1 x k (t) Γ -1 x k (t) + const = tr( C k S -1 Γ) + tr(M k S Γ -1 ) + const = L conv source (Γ, Λ k ) + const , where C k S = L Σ k y -1 L -1 and M k S = 1 T T t=1 x k (t)x k (t) . Note that constant values in equation 20 do not depend on Γ; hence, they can be ignored in the optimization procedure. This proves the equivalence of equation 8 and equation 9 when the optimization is performed with respect to Γ. The equivalence of equation 8 and equation 11 can be shown analogously, with the difference that we only focus on noise-related terms in equation 18 and equation 19: L II (Λ) ≤ tr Σ k y -1 Λ + 1 T T t=1 (y(t) -Lx k (t)) Λ -1 (y(t) -Lx k (t)) + const = tr( C k N -1 Λ) + tr(M k N Λ -1 ) + const = L conv noise (Γ k , Λ) + const , where C k N = Σ k y , and M k N = 1 T T t=1 (y(t) -Lx k (t))(y(t) -Lx k (t)) . Constant values in equation 21 do not depend on Λ; hence, they can again be ignored in the optimization procedure. Summarizing, we have shown that optimizing equation 8 is equivalent to optimizing L conv noise (Γ k , Λ) and L conv source (Γ, Λ k ), which concludes the proof.

B PROOF OF THEOREM 2

Before presenting the proof, the subsequent definitions and propositions are required: Definition 4 (Geodesic path). Let M be a Riemannian manifold, i.e., a differentiable manifold whose tangent space is endowed with an inner product that defines local Euclidean structures. Then, a geodesic between two points on M, denoted by p 0 , p 1 ∈ M, is defined as the shortest connecting path between those two points along the manifold, ζ l (p 0 , p 1 ) ∈ M for l ∈ [0, 1], where l = 0 and l = 1 defines the starting and end points of the path, respectively. In the current context, ζ l (p 0 , p 1 ) defines a geodesic curve on the positive definite (P.D.) manifold joining two P.D. matrices, P 0 , P 1 > 0. The specific pairs of matrices we will deal with are {C k S , M k S } and {C k N , M k N }. Definition 5 (Geodesic on the P.D. manifold). Geodesics on the manifold of P.D. matrices can be shown to form a cone within the embedding space. We denote this manifold by S ++ . Assume two P.D. matrices P 0 , P 1 ∈ S ++ . Then, for l ∈ [0, 1], the geodesic curve joining P 0 to P 1 is defined as (Bhatia, 2009, Chapter. 6 ): ξ l (P 0 , P 1 ) = (P 0 ) 1 2 (P 0 ) -1 /2 P 1 (P 0 ) -1 /2 l (P 0 ) 1 2 l ∈ [0, 1] . Note that P 0 and P 1 are obtained as the starting and end points of the geodesic path by choosing l = 0 and l = 1, respectively. The midpoint of the geodesic, obtained by setting l = 1 2 , is called the geometric mean. Note that, according to Definition 5, the following equality holds : ξ l (Γ 0 , Γ 1 ) -1 = (Γ 0 ) 1 /2 (Γ 0 ) -1 /2 Γ 1 (Γ 0 ) -1 /2 l (Γ 0 ) 1 /2 -1 = (Γ 0 ) -1 /2 (Γ 0 ) 1 /2 (Γ 1 ) -1 (Γ 0 ) 1 /2 l (Γ 0 ) -1 /2 = ξ l (Γ -1 0 , Γ -1 1 ) . ( ) Definition 6 (Geodesic convexity). Let p 0 and p 1 be two arbitrary points on a subset A of a Riemannian manifold M. Then a real-valued function f with domain A ⊂ M with f : A → R is called geodesic convex (g-convex) if the following relation holds: f (ζ l (p 0 , p 1 )) ≤ lf (p 0 ) + (1 -l)f (p 1 ) , where l ∈ [0, 1] and ζ(p 0 , p 1 ) denotes the geodesic path connecting two points p 0 and p 1 as defined in 4. Thus, in analogy to classical convexity, the function f is g-convex if every geodesic ζ(p 0 , p 1 ) of M between p 0 , p 1 ∈ A, lies in the g-convex set A. Note that the set A ⊂ M is called g-convex, if any geodesics joining an arbitrary pair of points lies completely in A. Remark 7. Note that g-convexity is a generalization of classical (linear) convexity to non-Euclidean (non-linear) geometry and metric spaces. Therefore, it is straightforward to show that all convex functions in Euclidean geometry are also g-convex, where the geodesics between pairs of matrices are simply line segments: ζ l (p 0 , p 1 ) = lp 0 + (1 -l)p 1 . For the sake of brevity, we omit a detailed theoretical introduction of g-convexity, and only borrow a result from Zadeh et al. ( 2016); Sra & Hosseini (2015) . Interested readers are referred to Wiesel et al. (2015, Chapter 1) for a gentle introduction to this topic, and Papadopoulos (2005, Chapter. 2) Rapcsak (1991) ; Ben-Tal (1977) ; Liberti (2004) 3). Proof. We only show the proof for L conv source (Γ, Λ k ). The proof for L conv noise (Γ k , Λ) can be presented analogously; and therefore, is omitted here for brevity. We proceed in two steps. First, we limit our attention to P.D. manifolds and express equation 24 in terms of geodesic paths and functions that lie on this particular space. We then show that L conv source (Γ, Λ k ) is strictly g-convex on this specific domain. In the second step, we then derive the updates rules proposed in equation 13 and equation 14.

B.1 PART I: PROVING G-CONVEXITY OF THE MAJORIZING COST FUNCTIONS

We consider geodesics along the P.D. manifold by setting ζ l (p 0 , p 1 ) to ξ l (Γ 0 , Γ 1 ) as presented in Definition 5, and define f (.) to be f (Γ) = tr(C k S Γ) + tr(M k S Γ -1 ), representing the cost function L conv source (Γ, Λ k ). We now show that f (Γ) is strictly g-convex on this specific domain. For continuous functions as considered in this paper, fulfilling equation 24 for f (Γ) and ξ l (Γ 0 , Γ 1 ) with l = 1 /2 is sufficient to prove strict g-convexity: tr C k S ξ1 /2 (Γ 0 , Γ 1 ) + tr M k S ξ1 /2 (Γ 0 , Γ 1 ) -1 < 1 2 tr C k S Γ 0 + 1 2 tr M k S Γ 0 -1 + 1 2 tr C k S Γ 1 + 1 2 tr M k S Γ 1 -1 . ( ) Given C k S ∈ S ++ , i.e., C k S > 0 and the operator inequality (Bhatia, 2009, Chapter. 4 ) ξ1 /2 (Γ 0 , Γ 1 ) ≺ 1 2 Γ 0 + 1 2 Γ 1 , we have: tr C k S ξ1 /2 (Γ 0 , Γ 1 ) < 1 2 tr C k S Γ 0 + 1 2 tr C k S Γ 1 , which is derived by multiplying both sides of equation 27 with C k S followed by taking the trace on both sides. Similarly, we can write the operator inequality for {Γ -1 0 , Γ -1 1 } using equation 23 as: ξ1 /2 (Γ 0 , Γ 1 ) -1 = ξ1 /2 (Γ -1 0 , Γ -1 1 ) ≺ 1 2 Γ -1 0 + 1 2 Γ -1 1 , Multiplying both sides of equation 29 by M k S ∈ S ++ , and applying the trace operator on both sides leads to: tr M k S ξ1 /2 (Γ 0 , Γ 1 ) -1 < 1 2 tr M k S Γ 0 -1 + 1 2 tr M k S Γ 1 -1 . ( ) Summing up equation 28 and equation 30 proves equation 26 and concludes the first part of the proof.

B.2 PART II: DETAILED DERIVATION OF THE UPDATE RULES IN EQUATIONS 13 AND 14

We now present the second part of the proof by deriving the update rules in equations 13 and 14. Since the cost function L conv source (Γ, Λ k ) is strictly g-convex, its optimal solution in the k-th iteration is unique. More concretely, the optimum can be analytically derived by taking the derivative of equation 9 and setting the result to zero as follows: ∇L conv source (Γ, Λ k ) = C k S -1 -Γ -1 M k S Γ -1 = 0 , which results in Γ C k S -1 Γ = M k S . ( ) This solution is known as the Riccati equation, and is the geometric mean between C k S and M k S (Davis et al., 2007; Bonnabel & Sepulchre, 2009) : Γ k+1 = (C k S ) 1 2 (C k S ) -1 /2 M k S (C k S ) -1 /2 1 2 (C k S ) 1 2 . The update rule for the full noise covariance matrix can be derived analogously: Λ k+1 = (C k N ) 1 2 (C k N ) -1 /2 M k N (C k N ) -1 /2 1 2 (C k N ) 1 2 . Remark 8. Note that the obtained update rules are closed-form solutions for the surrogate cost functions, equations 9 and 11, which stands in contrast to conventional majorization minimization algorithms (see section C in the appendix), which require iterative procedures in each step of the optimization. Deriving the update rules in equation 13 and equation 14 concludes the second part of the proof of Theorem 2.

C PROOF OF THEOREM 3

In the following, we provide proof for Theorem 3 by showing that alternating update rules for Γ and Λ in equation 13 and equation 14 are guaranteed to converge to a local minimum of the Bayesian Type-II likelihood equation 8. In particular, we will prove that FUN learning is an instance of the general class of majorization-minimization (MM) algorithms, for which this property follows by construction. To this end, we first briefly review theoretical concepts behind the majorizationminimization (MM) algorithmic framework (Hunter & Lange, 2004; Razaviyayn et al., 2013; Jacobson & Fessler, 2007; Wu et al., 2010) .

C.1 REQUIRED CONDITIONS FOR MAJORIZATION-MINIMIZATION ALGORITHMS

MM encompasses a family of iterative algorithms for optimizing general non-linear cost functions. The main idea behind MM is to replace the original cost function in each iteration by an upper bound, also known as majorizing function, whose minimum is easy to find. The MM class covers a broad range of common optimization algorithms such as convex-concave procedures (CCCP) and proximal methods (Sun et al., 2017 , Section IV), (Mjolsness & Garrett, 1990; Yuille & Rangarajan, 2003; Lipp & Boyd, 2016) . Such algorithms have been applied in various domains such as brain source imaging (Hashemi & Haufe, 2018; Bekhti et al., 2018; Cai et al., 2020a; Hashemi et al., 2020) , wireless communication systems with massive MIMO technology (Masood et al., 2016; Haghighatshoar & Caire, 2017; Khalilsarai et al., 2020) , and non-negative matrix factorization (Fagot et al., 2019) . Interested readers are referred to Sun et al. (2017) for an extensive list of applications on MM. The problem of minimizing a continuous function f (u) within a closed convex set U ⊂ R n : min u f (u) subject to u ∈ U , within the MM framwork can be summarized as follows. First, construct a continuous surrogate function g(u|u k ) that majorizes, or upper-bounds, the original function f (u) and coincides with f (u) at a given point u k : [A1] g(u k |u k ) = f (u k ) ∀ u k ∈ U [A2] g(u|u k ) ≥ f (u) ∀ u, u k ∈ U . as follows: γ k+1 = arg min γ, Λ=Λ k diag C k S -1 γ + diag M k S γ -1 L diag source (γ|γ k ) , where γ -1 = [γ -1 1 , . . . , γ -1 N ] is defined as the element-wise inversion of γ. The optimization with respect to the scalar source variances is then carried out by taking the derivative of equation 35 with respect to γ n , for n = 1, . . . , N , and setting it to zero, ∂ ∂γ n C k S -1 γ n + M k S γ -1 n = C k S -1 n,n - 1 (γ n ) 2 M k S n,n = 0 for n = 1, . . . , N , where L n denotes the n-th column of the lead field matrix. This yields the following update rule γ k+1 n ← M k S n,n C k S -1 n,n = 1 T T t=1 (x k n (t)) 2 L n Σ k y -1 L n for n = 1, . . . , N , which is identical to the update rule of Champagne (Wipf & Nagarajan, 2009 ).

E DERIVATION OF CHAMPAGNE WITH HETEROSCEDASTIC NOISE LEARNING AS A SPECIAL CASE OF FUN LEARNING

Similar to Appendix D, we start by constraining Λ to the set of diagonal matrices W: Λ = diag(λ), where λ = [λ 1 , . . . , λ M ] . We continue by reformulating the constrained optimization with respect to the noise covariance matrix, Λ k+1 = arg min Λ∈W, Γ=Γ k tr(C k N Λ) + tr(M k N Λ -1 ) , as follows: λ k+1 = arg min λ, Γ=Γ k diag C k N -1 λ + diag M k N λ -1 L diag noise (λ|λ k ) , where λ -1 = [λ -1 1 , . . . , λ -1 M ] is defined as the element-wise inversion of λ. The optimization with respect to the scalar noise variances then proceeds by taking the derivative of equation 37 with respect to λ m , for m = 1, . . . , M , and setting it to zero,  ∂ ∂λ m C k N -1 λ m + M k N λ -1 m = C k N -1 m,m - 1 (λ m ) 2 M k

F PSEUDO-EEG SIGNAL GENERATION

Our simulation setting is an adoption of the EEG inverse problem, where brain activity is to be reconstructed from simulated pseudo-EEG data (Haufe & Ewald, 2016) . Forward Modeling: Populations of pyramidal neurons in the cortical gray matter are known to be the main drivers of the EEG signal (Hämäläinen et al., 1993; Baillet et al., 2001) . Here, we use a realistic volume conductor model of the human head to model the linear relationship between primary electrical source currents generated within these populations and the resulting scalp surface potentials captured by EEG electrodes. The lead field matrix, L ∈ R 58×2004 , was generated using the New York Head model (Huang et al., 2016) taking into account the realistic anatomy and electrical tissue conductivities of an average human head. In this model, 2004 dipolar current sources were placed evenly on the cortical surface and 58 sensors were considered. The lead field matrix, L ∈ R 58×2004 was computed using the finite element method. Note that the orientation of all source currents was fixed to be perpendicular to the cortical surface, so that only scalar source amplitudes needed to be estimated. Evaluation Metrics: Source reconstruction performance was evaluated according to the following metrics. First, the earth mover's distance (EMD) (Rubner et al., 2000; Haufe et al., 2008) ) was used to quantify the spatial localization accuracy. The EMD measures the cost needed to transform two probability distributions defined on the same metric domain (in this case, distributions of the true and estimated sources defined in 3D Euclidean brain space) into each other. EMD scores were normalized to [0, 1]. Second, the error in the reconstruction of the source time courses was measured. To this end, Pearson correlation between all pairs of simulated and reconstructed (i.e., those with non-zero activations) source time courses was assessed as the mean of the absolute correlations obtained for each source, after optimally matching simulated and reconstructed sources based on maximal absolute correlation. We also report another metric for evaluating the localization error as the average Euclidean distance (EUCL) (in mm) between each simulated source and the best (in terms of absolute correlations) matching reconstructed source. For assessing the recovery of the true support, we also compute F 1 -measure scores (Chinchor & Sundheim, 1993; van Rijsbergen, 1979) : F 1 = 2×T P /P +T P +F P , where P denotes the number of true active sources, while TP and FP are the numbers of true and false positive predictions. Note that perfect support recovery, i.e., F 1 = 1, is only achieved when there is a perfect correspondence between ground-truth and estimated support.



Calculate C k S and M k S based on equation 10, and update γ n for n = 1, . . . , N based on equation 15. Calculate C k N and M k N based on equation 12, and update Λ based on equation 14. Until stopping condition is satisfied;

Figure2shows the EMD, the time course reconstruction error, the EUCL and the F1 measure score incurred by three different noise learning approaches assuming homoscedastic (red), heteroscedastic

Figure 3: Auditory evoked field (AEF) localization results versus number of trials from one representative subject using FUN learning algorithm. All reconstructions show focal sources at the expected locations in the left (L: top panel) and right (R: bottom panel) auditory cortex. As a result, the limited number of trials does not influence the reconstruction results of FUN learning algorithm.

t) -Lx k (t))(y(t) -Lx k (t)) for m = 1, . . . , M ,(38)which is identical to the update rule of the Champagne with heteroscedastic noise learning as presented inCai et al. (2020a).

Figure 4: Accuracy of the noise covariance matrix reconstruction incurred by three different noise learning approaches assuming homoscedastic (red), heteroscedastic (green) and full-structure (blue) noise covariances. The ground-truth noise covariance matrix is either full-structure (upper row) or heteroscedastic diagonal (lower row). Performance is assessed in terms of the Pearson correlation between the entries of the original and reconstructed noise covariance matrices, Λ and Λ, denoted by Λ sim (left column). Shown is the similarity error 1 -Λ sim . Further, the normalized mean squared error (NMSE) between Λ and Λ, defined as NMSE = || Λ-Λ|| 2 F /||Λ|| 2 F is reported (right column).

Anqi Wu, Oluwasanmi Koyejo, and Jonathan W Pillow. Dependent relevance determination for smooth and structured sparse regression. J. Mach. Learn. Res., 20(89):1-43, 2019. Tong Tong Wu, Kenneth Lange, et al. The MM alternative to EM. Statistical Science, 25(4):492-505, 2010.

annex

Second, starting from an initial value u 0 , generate a sequence of feasible points u 1 , u 2 , . . . , u k , u k+1 as solutions of a series of successive simple optimization problems, where [A3] u k+1 := arg min u∈U g(u|u k ) .If a surrogate function fulfills conditions [A1]-[A3], then the value of the cost function f decreases in each iteration: f (u k+1 ) ≤ f (u k ). For the smooth functions considered in this paper, we further require that the derivatives of the original and surrogate functions coincide at u k :We can then formulate the following theorem:Theorem 9. Thus, both gradients coincide in Σ k y by construction. Now, we prove that [A3] can be satisfied by showing that L conv source (Γ, Λ k ) reaches its global minimum in each MM iteration. This is guaranteed if L conv source (Γ, Λ k ) can be shown to be convex or g-convex with respect to Γ. To this end, we first require the subsequent proposition: Proposition 10. Any local minimum of a g-convex function over a g-convex set is a global minimum.Proof. A detailed proof is presented in Rapcsak (1991, Theorem 2.1).Given the proof presented in appendix B.1, we can conclude that equation 20 is g-convex; hence, any local minimum of L conv source (Γ, Λ k ) is a global minimum according to Proposition 10. This proves that condition [A3] is fulfilled and completes the proof that the optimization of equation 8 with respect to Γ using the convex surrogate cost function equation 9 leads to an MM algorithm. For the sake of brevity, we omit the proof for the optimization with respect to Λ based on the convex surrogate function in equation 11, L conv noise (Γ k , Λ), as it can be presented, analogously.

D DERIVATION OF CHAMPAGNE AS A SPECIAL CASE OF FUN LEARNING

We start the derivation of update rule equation 15 by constraining Γ to the set of diagonal matrices W: Γ = diag(γ), where γ = [γ 1 , . . . , γ N ] . We continue by rewriting the constrained optimization with respect to the source covariance matrix,To evaluate the accuracy of the noise covariance matrix estimation, the following two metrics were calculated: the Pearson correlation between the original and reconstructed noise covariance matrices, Λ and Λ, denoted by Λ sim , and the normalized mean squared error (NMSE) between Λ and Λ, defined as: NMSE = || Λ -Λ|| 2 F /||Λ|| 2 F . Similarity error was then defined as one minus the Pearson correlation: 1 -Λ sim . Note that NMSE measures the reconstruction of the true scale of the noise covariance matrix, while Λ sim is scale-invariant and hence only quantifies the overall structural similarity between simulated and estimated noise covariance matrices.Evaluating the accuracy of the noise covariance matrix estimation: Figure 4 depicts the accuracy with which the covariance matrix is reconstructed by three different noise learning approaches assuming noise with homoscedastic (red), heteroscedastic (green) and full (blue) structure. The ground truth noise covariance matrix either had full (upper row) or heteroscedastic (lower row) structure. Performance was measured in terms of similarity error and NMSE. Similar to the trend observed in Figure 2 , full-structure noise learning leads to better noise covariance estimation accuracy (lower NMSE and similariy error) for the full-structure noise model, while superior reconstruction performance is achieved for heteroscedastic noise learning when true noise covariance is heteroscedastic.

G FURTHER DETAILS ON AUDITORY EVOKED FIELDS (AEF) DATASET

The MEG data used in this article were acquired in the Biomagnetic Imaging Laboratory at the University of California San Francisco (UCSF) with a CTF Omega 2000 whole-head MEG system from VSM MedTech (Coquitlam, BC, Canada) with 1200 Hz sampling rate. The lead field for each subject was calculated with NUTMEG (Dalal et al., 2004 ) using a single-sphere head model (two spherical orientation lead fields) and an 8 mm voxel grid. Each column was normalized to have a norm of unity. The neural responses of one subject to an Auditory Evoked Fields (AEF) stimulus were localized. The AEF response was elicited with single 600 ms duration tones (1 kHz) presented binaurally. 120 trials were collected for AEF dataset. The data were first digitally filtered from 1 to 70 Hz to remove artifacts and DC offset, time-aligned to the stimulus, and then averaged across the following number of trials:{1,2,12, 63,120}. The pre-stimulus window was selected to be 100 ms to 5 ms and the post-stimulus time window was selected to be 60 ms to 180 ms, where 0 ms is the onset of the tone (Wipf et al., 2010; Dalal et al., 2011; Owen et al., 2012; Cai et al., 2019a) .

