AUTOBAYES: AUTOMATED BAYESIAN GRAPH EXPLO-RATION FOR NUISANCE-ROBUST INFERENCE

Abstract

Learning data representations that capture task-related features, but are invariant to nuisance variations 1 remains a key challenge in machine learning. We introduce an automated Bayesian inference framework, called AutoBayes, that explores different graphical models linking classifier, encoder, decoder, estimator and adversarial network blocks to optimize nuisance-invariant machine learning pipelines. AutoBayes also enables learning disentangled representations, where the latent variable is split into multiple pieces to impose various relationships with the nuisance variation and task labels. We benchmark the framework on several public datasets, and provide analysis of its capability for subject-transfer learning with/without variational modeling and adversarial training. We demonstrate a significant performance improvement with ensemble learning across explored graphical models.

1. INTRODUCTION

The great advancement of deep learning techniques based on deep neural networks (DNN) has enabled more practical design of human-machine interfaces (HMI) through the analysis of the user's physiological data (Faust et al., 2018) , such as electroencephalogram (EEG) (Lawhern et al., 2018) and electromyogram (EMG) (Atzori et al., 2016) . However, such biosignals are highly prone to variation depending on the biological states of each subject (Christoforou et al., 2010) . Hence, frequent calibration is often required in typical HMI systems. Toward resolving this issue, subject-invariant methods (Özdenizci et al., 2019b) , employing adversarial training (Makhzani et al., 2015; Lample et al., 2017; Creswell et al., 2017) with the Conditional Variational AutoEncoder (A-CVAE) (Louizos et al., 2015; Sohn et al., 2015) shown in Fig. 1(b) , have emerged to reduce user calibration for realizing successful HMI systems. Compared to a standard DNN classifier C in Fig. 1 (a), integrating additional functional blocks for encoder E, nuisanceconditional decoder D, and adversary A networks offers excellent subject-invariant performance. The DNN structure may be potentially extended with more functional blocks and more latent nodes as shown in Fig. 1(c ). However, such a DNN architecture design may rely on human effort and insight to determine the block connectivity of DNNs. Automation of hyperparameter and architecture exploration in the context of AutoML (Ashok et al., 2017; Brock et al., 2017; Cai et al., 2017; He et al., 2018; Miikkulainen et al., 2019; Real et al., 2017; 2020; Stanley & Miikkulainen, 2002; Zoph et al., 2018) can facilitate DNN design suited for nuisance-invariant inference. Nevertheless, without proper reasoning, most of the search space for link connectivity will be pointless. In this paper, we propose a systematic automation framework called AutoBayes, which searches for the best inference graph model associated with a Bayesian graph model (also a.k.a. Bayesian network) well-suited to reproduce the training datasets. The proposed method automatically formulates various different Bayesian graphs by factorizing the joint probability distribution in terms of data, class label, subject identification (ID), and inherent latent representations. Given Bayesian graphs, some meaningful inference graphs are generated through the Bayes-Ball algorithm (Shachter, 2013) for pruning redundant links to achieve high-accuracy estimation. In order to promote robustness against nuisance variations such as inter-subject/session factors, the explored Bayesian graphs can provide reasoning to use adversarial training with/without variational modeling and latent disentanglement. We demonstrate that AutoBayes can achieve excellent performance across various public datasets, in particular with an ensemble stacking of multiple explored graphical models.

2. KEY CONTRIBUTIONS

At the core of our methodology is the consideration of graphical models that capture the probabilistic relationship between random variables representing the data features X, task labels Y , nuisance variation labels S, and (potential) latent representations Z. The ultimate goal is to infer the task label Y from the measured data feature X, which is hindered by the presence of nuisance variations (e.g., inter-subject/session variations) that are (partially) labelled by S. One may use a standard DNN to classify Y given X as shown in Fig. 1 (a), without explicitly involving S or Z. Although A-CVAE in Fig. 1 (b) may offer nuisance-robust performance through adversarial disentanglement of S from latent Z, there is no guarantee that such a model can perform well across different datasets. It is exemplified in Fig. 2 where A-CVAE outperforms the standard DNN model for some datasets (QMNIST, Stress, ErrP) while it does not for the other cases. This may be due to the underlying probabilistic relationship of the data varying across datasets. Our proposed framework can construct justifiable models, achieving higher performance for every dataset, as demonstrated in Fig. 2 . It is verified that significant gain is attainable with ensemble methods of different Bayesian graphs which are explored in our AutoBayes. For example, our method with a relatively shallow architecture achieves 99.61% accuracy which is close to state-of-the-art performance in QMNIST dataset. The main contributions of this paper over the existing works are five-fold as follows: Algorithm 1 Pseudocode for AutoBayes Framework Require: Nodes set V = [Y, X, S 1 , S 2 , . . . , S n , Z 1 , Z 2 , . . . , Z m ], where Y denotes task labels, X is a measurement data, S = [S 1 , S 2 , . . . , S n ] are (potentially multiple) semi-supervised nuisance variations, and Z = [Z 1 , Z 2 , . . . , Z m ] are (potentially multiple) latent vectors Ensure: Semi-supervised training/validation datasets 1: for all permutations of node factorization from Y to X do • AutoBayes automatically explores potential graphical models inherent to the data by combinatorial pruning of dependency assumptions (edges) and then applies Bayes-Ball to examine various inference strategies, rather than blindly exploring hyperparameters of DNN blocks. • AutoBayes offers a solid reason of how to connect multiple DNN blocks to impose conditioning and adversary censoring for the task classifier, feature encoder, decoder, nuisance indicator and adversary networks, based on an explored Bayesian graph. • The framework is also extensible to multiple latent representations and nuisances factors. • Besides fully-supervised training, AutoBayes can automatically build some relevant graphical models suited for semi-supervised learning. • Multiple graphical models explored in AutoBayes can be efficiently exploited to improve performance by ensemble stacking. We note that this paper relates to some existing literature in AutoML, variational Bayesian inference (Kingma & Welling, 2013; Sohn et al., 2015; Louizos et al., 2015) , adversarial training (Goodfellow et al., 2014; Dumoulin et al., 2016; Donahue et al., 2016; Makhzani et al., 2015; Lample et al., 2017; Creswell et al., 2017) , and Bayesian network (Nie et al., 2018; Njah et al., 2019; Rohekar et al., 2018) as addressed in Appendix A.1 in more detail. Nonetheless, AutoBayes is a novel framework that diverges from AutoML, which mostly employs architecture tuning at a micro level. Our work focuses on exploring neural architectures at a macro level, which is not an arbitrary diversion, but a necessary interlude. Our method focuses on the relationships between the connections in a neural network's architecture and the characteristics of the data (Minsky & Papert, 2017) . In addition to the macro-level structure learning of Bayesian network, our approach provides a new perspective in how to involve the adversarial blocks and to exploit multiple models for ensemble stacking.

3. AUTOBAYES

AutoBayes Algorithm: The overall procedure of the AutoBayes algorithm is described in the pseudocode of Algorithm 1. The AutoBayes automatically constructs non-redundant inference factor graphs given a hypothetical Bayesian graph assumption, through the use of the Bayes-Ball algorithm. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure 3 : Bayes-Ball algorithm basic rules (Shachter, 2013) . Conditional nodes are shaded. The Bayes-Ball algorithm (Shachter, 2013) facilitates an automatic pruning of redundant links in inference factor graphs through the analysis of conditional independency. Fig. 3 shows ten Bayes-Ball rules to identify conditional independency. Given a Bayesian graph, we can determine whether two disjoint sets of nodes are independent conditionally on other nodes through a graph separation criterion. Specifically, an undirected path is activated if a Bayes ball can travel along without encountering a stop symbol: in Fig. 3 . If there are no active paths between two nodes when some conditioning nodes are shaded, then those random variables are conditionally independent. Y S Z X (a) Bayesian Model X Z S Y p Z p S p Y (b) Z-First Inference X S Z Y p S p Z p Y (c) S-First Inference Graphical Models: We here focus on 4-node graphs. Let p(y, s, z, x) denote the joint probability distribution underlying the datasets for the four random variables, i.e., Y , S, Z, and X. The chain rule can yield the following factorization for a generative model from Y to X (note that at most 4! factorization orders exist including useless ones such as with reversed direction from X to Y ): p(y, s, z, x) = p(y)p(s|y)p(z|s, y)p(x|z, s, y), (1) which is visualized in Bayesian graph of Fig. 4 (a). The probability conditioned on X can then be factorized, e.g., as follows (among 3! different orders of inference factorization for four-node graphs): p(y, s, z|x) = p(z|x)p(s|z, x)p(y|s, z, x), Z-first-inference p(s|x)p(z|s, x)p(y|z, s, x), S-first-inference which are marginalized to obtain the likelihood: p(y|x) = Es,z|x p(y, s, z|x) . The above two scheduling strategies in (2) are illustrated in factor graph models as in Figs. 4(b) and (c), respectively. The graphical models in Fig. 4 do not impose any assumption of potentially inherent independency in datasets and hence are most generic. However, depending on the underlying independency in datasets, we may be able to prune some edges in those graphs. For example, if the data only follows the simple Markov chain of Y -X, while being independent of S and Z, as shown in Fig. 5 (a), all links except one between X and Y will be unreasonable in inference graphs of Figs. 4(b) and (c), that justifies the standard classifier model in Fig. 1 (a) . This implies that more complicated inference models such as A-CVAE can be unnecessarily redundant depending on the dataset. This motivates us to consider an extended AutoML framework which automatically explores the best pair of inference factor graph and corresponding Bayesian graph models matching dataset statistics besides the micro-scale hyperparameter tuning.

Methodology:

AutoBayes begins with exploring any potential Bayesian graphs by cutting links of the full-chain graph in Fig. 4 (a), imposing possible (conditional) independence. We then adopt the Bayes-Ball algorithm on each hypothetical Bayesian graph to examine conditional independence over different inference strategies, e.g., full-chain Z-/S-first inference graphs in Figs. 4(b)/(c). Applying Bayes-Ball justifies the reasonable pruning of the links in the full-chain inference graphs, and also the potential adversarial censoring when Z is independent of S. This process automatically constructs the connectivity of inference, generative, and adversary blocks with sound reasoning. Y X (a) Model A Y Z X (b) Model B Y S X (c) Model C Y S Z X (d) Model D Y S Z X (e) Model E Y S Z X (f) Model F Y S Z X (g) Model G Y S Z X (h) Model H Y S Z X (i) Model I S Y Z1 Z2 X (j) Model J S Y Z1 Z2 X (k) Model K Consider an example case when the data adheres to the following factorization: p(y, s, z, x) = p(y)p(s| ¡ y)p(z| ¡ s, y)p(x|z, s, ¡ y), where we explicitly indicate conditional independence by slash-cancellation from the full-chain case in (1). This corresponds to a Bayesian graphical model illustrated in Fig. 5 (e). Applying the Bayes-Ball algorithm to the Bayesian graph yields the following conditional probability: p(y, s, z|x) = p(z|x)p(s|z, x)p(y|z, ¡ s, ¡ x), for the Z-first inference strategy in (2). The corresponding factor graph is then given in Fig. 6(c ). Note that the Bayes-Ball also reveals that there is no marginal dependency between Z and S, which provides the reason to use adversarial censoring to suppress nuisance information S in the latent space Z. In consequence, by combining the Bayesian graph and factor graph, we automatically obtain A-CVAE model in Fig. 1(b ). AutoBayes justifies A-CVAE under the assumption that the data follows the Bayesian model E in Fig. 5 (e). As the true generative model is unknown, AutoBayes explores different Bayesian graphs like in Fig. 5 to search for the most relevant model. Our framework is readily applicable to graphs with more than 4 nodes to represent multiple Y , S, and Z. Models J and K in Fig. 5 are such examples having multiple latent factors Z 1 and Z 2 . Despite the search space for AutoBayes will rapidly grow with the number of nodes, most realistic datasets do not require a large number of neural network blocks for macro-level optimization. See Appendix A.2 for more detailed descriptions for some Bayesian graph models to construct factor graphs like in Fig. 6 . Also see discussions of graphical models suitable for semi-supervised learning in Appendix A.4. Training: Given a pair of generative graph and inference graph, the corresponding DNN structures will be trained. For example of the generative graph model K in Fig. 5 (k), one relevant inference graph Kz in Fig. 6 (k) will result in the overall network structure as shown in Fig. 7 , where adversary network is attached as Z 2 is (conditionally) independent of S. This 5-node graph model justifies a recent work on partially disentanged A-CVAE by Han et al. (2020) . Each factor block is realized by a DNN, e.g., parameterized by θ for p θ (z 1 , z 2 |x), and all of the networks except for adversarial network are optimized to minimize corresponding loss functions including L(ŷ, y) as follows: min θ,ψ,µ max η E L(ŷ, y) + λ s L(ŝ, s) + λ x L(x , x) + λ z KL(z 1 , z 2 N (0, I)) -λ a L(ŝ , s) , (5) (z 1 , z 2 ) = p θ (x), ŷ = p φ (z 1 , z 2 ), ŝ = p ψ (z 1 ), x = p µ (z 1 ), ŝ = p η (z 2 ), where λ * denotes a regularization coefficient, KL is the Kullback-Leibler divergence, and the adversary network p η (s |z 2 ) is trained to minimize L(ŝ , s) in an alternating fashion (see the Adversarial Regularization paragraph below). The training objective can be formally understood from a likelihood maximization perspective, in manner that can be seen as a generalization of the VAE Evidence Lower Bound (ELBO) concept (Kingma & Welling, 2013) . Specifically, it can be viewed as the maximization of a variational lower bound of the likelihood p Φ (x, y, s) that is implicitly defined and parameterized by the networks, where Φ represents the collective parameters of the network modules (e.g., Φ = (φ, ψ, µ) in the example of equation 5) that specify the generative model p Φ (x, y, s|z), which implies the likelihood p Φ (x, y, s), as given by X Z S Y (a) Model Dz X S Z Y (b) Model Ds X Z S Y (c) Model Ez X Z S Y (d) Model Es X Z S Y (e) Model Fz X S Z Y (f) Model Fs X Z S Y (g) Model Gz X S Z Y (h) Model Gs X Z1 Z2 S Y (i) Model Jz X S Z1 Z2 Y (j) Model Js X Z1 Z2 S Y (k) Model Kz X S Z1 Z2 Y (l) Model Ks p θ (z 1 , z 2 |x) Z 1 Z 2 p φ (y|z 1 , z 2 ) Y p ψ (s|z 1 ) S p η (s |z 2 ) S p µ (x |z 1 , z 2 ) X L(x , x) L(ŝ, s) L(ŷ, y) L(ŝ , s) p Φ (x, y, s) = p Φ (x, y, s|z)p(z) dz. However, since this expression is generally intractable, we introduce q θ (z|x, y, s) as a variational approximation of the posterior p Φ (z|x, y, s) implied by the generative model (Kingma & Welling, 2013; Ranganath et al., 2014) : 1 n n i=1 log p Φ (x i , y i , s i ) = 1 n n i=1 log p Φ (x i , y i , s i |z i )-log q θ (z i |x i , y i , s i ) p(z i ) +log q θ (z i |x i , y i , s i ) p Φ (z i |x i , y i , s i ) ≈ 1 n n i=1 [log p Φ (x i , y i , s i |z i )] -KL(q θ (z|x, y, s) p(z)) + KL(q θ (z|x, y, s) p Φ (z|x, y, s)) ≥ 1 n n i=1 [log p Φ (x i , y i , s i |z i )] -KL(q θ (z|x, y, s) p(z)), where the samples z i ∼ q θ (z|x i , y i , s i ) are drawn for each training tuple (x i , y i , s i ), and the final inequality follows from the non-negativity of KL divergence. Ultimately, the minimization of our training loss function corresponds to the maximization of the lower bound in ( 7), which corresponds to maximizing the likelihood of our implicit generative model, while also optimizing the variational posterior q θ (z|x, y, s) toward the actual posterior for the latent representation p Φ (z|x, y, s), since the gap in the bound is given by KL(q θ (z|x, y, s) p Φ (z|x, y, s)). Further factoring of log p Φ (x, y, s|z) yields the multiple loss-terms and network modules. Adversarial Regularization: We can utilize adversarial censoring when Z and S should be marginally independent, e.g., such as in Fig. 1 (b) and Fig. 7 , in order to reinforce the learning of a representation Z that is disentangled from the nuisance variations S. This is accomplished by introducing an adversarial network that aims to maximize a parameterized approximation q(s|z) of the likelihood p(s|z), while this likelihood is also incorporated into the loss for the other modules with a negative weight. The adversarial network, by maximizing the log likelihood log q(s|z), essentially maximizes a lower-bound of the mutual information I(S; Z), and hence the main network is regularized with the additional term that corresponds to minimizing this estimate of mutual information. This follows since the log-likelihood maximized by the adversarial network is given by E[log q(s|z)] = I(S; Z) -H(S) -KL p(s|z) q(s|z) , where the entropy H(S) is constant. Ensemble Learning: We further introduce ensemble methods to make best use of all Bayesian graph models explored by the AutoBayes framework without wasting lower-performance models. Ensemble stacked generalization works by stacking the predictions of the base learners in a higher level learning space, where a meta learner corrects the predictions of base learners (Wolpert, 1992) . Subsequent to training base learners, we assemble the posterior probability vectors of all base learners together to improve the prediction. We compare the predictive performance of a logistic regression (LR) and a shallow multi-layer perceptron (MLP) as an ensemble meta learner to aggregate all inference models. See Appendix A.5 for more detailed description of the stacked generalization.

4. EXPERIMENTAL EVALUATION

Datasets: We experimentally demonstrate the performance of AutoBayes for publicly available datasets as listed in Table 1 . Note that they cover a wide variety of data size, dimensionality, subject scale, and class levels as well as sensor modalities including image, EEG, EMG, and electrocorticography (ECoG). See more detailed information of each dataset in Appendix A.6. Model Implementation: All models were trained with a minibatch size of 32 and using the Adam optimizer with an initial learning rate of 0.001. The learning rate is halved whenever the validation loss plateaus. A compact convolutional neural network (CNN) with 4 layers is employed as an encoder network E to extract features from C × T data. classifier model A, whereas the rest of the models underperform. We observe a large gap of 1.0% between the best and worst models with a standard deviation of 0.23% across all Bayesian graph models. This indicates that we may have a potential risk that one particular model may lose up to 1.0% accuracy if we do not explore different models. Similar behaviors with a huge deviation can be seen for different datasets as shown in Fig. 8(b) . It was shown that the best inference strategy highly depends on datasets. Specifically, the best model at one dataset does not perform best for different datasets. This suggests that we must consider different inference strategies for each target dataset and our AutoBayes provides such an adaptive framework across datasets. More detailed results are found in Appendix A.8. Remarkably, the ensemble of base learners further enhances the performance regardless of the choice from LR or MLP as the meta learner, as illustrated in Fig. 2 across all the datasets. For some lowperforming datasets such as ErrP, MI and Faces (Noisy), ensemble learning significantly improves the accuracy by 15.3%, 19.3% and 13.2% at the expense of more storage and computational resources. Exploring different models has actually a significant benefit in improving nuisance robustness as shown in Fig. 9 (a), where box-whisker plots are present to show the quartile distribution of the subject variation for the Stress dataset having |S| = 20 users. We can observe that the standard classification (Model A) has a wider distribution; the best subject achieves an accuracy grater than 96%, whereas the worst-case user has lower than 82% accuracy. Except for model A, the other models from B to Kz take the subject ID (S) into consideration to extract nuisance-robust feature, which leads to significant improvement for the worst-case user performance not only for the mean or median. The ensemble stacking further improves the subject variation robustness, achieving the worst-case user performance of at least 94%. Additional results per user are found in Appendix A.9. Despite the performance gain, the nuisance-robust models tend to have higher complexity. Fig. 9 (b) shows the trade-off between the accuracy and the space complexity. Here, we varied the number of hidden layers and hidden nodes for the models A, B, and Js to adjust the space complexity. The Pareto front over the finite set of DNN configurations is indicated with lines. It is observed that the standard classifier model A has superior performance only at low complexity regimes, while it does not improve performance beyond 95% accuracy even with increased complexity. The Pareto front of AutoBayes is thus better than the individual models at higher accuracy regimes. See Appendix A.10 for an additional analysis of time complexity. We finally compare the performance of AutoBayes with the benchmark competitor models from (Byerly et al., 2020; Han et al., 2020; Özdenizci et al., 2019c; b; a) in Table 2 . It can be seen that AutoBayes outperforms the state-of-the-art in all datasets except QMNIST. Consequently, we can see a great advantage of AutoBayes with exploring different graphical models. Even for QMNIST, AutoBayes meta-MLP model, achieving 99.61% accuracy, ranks 17 in the published leaderboard. Note that performing better than 99.84% is nearly impossible, since some numbers are illegible or mislabeled. Also note that we have not specifically designed AutoBayes architecture for image classification but for spatio-temporal signal applications and hyper-parameters were not fully optimized yet. AutoBayes can be readily integrated with AutoML to optimize any hyperparameters of individual DNN blocks. Nevertheless, as our primary objective was to show a proof-of-concept benefit from solely graphical model exploration of AutoBayes, we leave more rigorous analysis to optimize DNN parameters such as network depths, widths, activation, augmentation, etc. as a future work.

5. CONCLUSION AND FUTURE WORK

We proposed a new concept called AutoBayes which explores various different Bayesian graph models to facilitate searching for the best inference strategy, suited for nuisance-robust deep learning. With the Bayes-Ball algorithm, our method can automatically construct reasonable link connections among classifier, encoder, decoder, nuisance estimator and adversary DNN blocks. As a proofof-concept analysis, we demonstrated the benefit of AutoBayes for various public datasets. We observed a huge performance gap between the best and worst graph models, implying that the use of one particular model without graph exploration can potentially suffer a poor classification result. In addition, the best model for one dataset does not always perform best for different data, which encourages us to use AutoBayes for adaptive model generation given target datasets. We further improved the performance approaching the state-of-the-art accuracy by exploiting multiple graphical models explored in AutoBayes through the use of ensemble stacking. The ensemble AutoBayes offers significant gain in nuisance robustness by improving the worst-case user performance. Even though additional computations are required, we showed that AutoBayes can still achieve the superior Pareto front in the trade-off between complexity and accuracy. We are extending the AutoBayes framework to integrate AutoML to optimize hyperparameters of each DNN block. How to handle the exponentially growing search space of possible Bayesian graphs along with the number of random variables remains a challenging future work. It should require more sophisticated metrics like Bayesian information criterion for efficient graph exploration. 

APPENDICES A.1 RELATED WORK

We note that this paper relates to some existing literature as follows. • AutoML: Searching DNN models with hyperparameter optimization has been intensively investigated in a framework called AutoML (Ashok et al., 2017; Brock et al., 2017; Cai et al., 2017; He et al., 2018; Miikkulainen et al., 2019; Real et al., 2017; 2020; Stanley & Miikkulainen, 2002; Zoph et al., 2018) . The automated methods include architecture search (Zoph et al., 2018; Real et al., 2017; He et al., 2018; Real et al., 2020) , learning rule design (Bayer et al., 2009; Jozefowicz et al., 2015) , and augmentation exploration (Cubuk et al., 2019; Park et al., 2019) . Most work used either evolutionary optimization or reinforcement learning framework to adjust hyperparameters or to construct network architecture from pre-selected building blocks. (Miconi, 2016) gradually increases the size of an RNN starting from only one node by incorporating structural parameters into model training, which are optimized along with the model weights. (Zoph & Le, 2016) uses reinforcement learning to find the optimal neural network architecture based on actor-critic framework. The method uses an LSTM as a controller and critic to explore the hyperparameter configurations for each layer (number of filters, kernel size and stride) based on the validation error of the output architecture that corresponds to reward. The recent AutoML-Zero (Real et al., 2020) considers an extension to preclude human knowledge and insights for fully automated designs from scratch. • Variational Bayesian Inference: The VAE (Kingma & Welling, 2013) introduced variational Bayesian inference methods, incorporating autoassociative architectures, where generative and inference models can be learned jointly. This method was extended with the CVAE (Sohn et al., 2015) , which introduces a conditioning variable that could be used to represent nuisance variations, and a regularized VAE in (Louizos et al., 2015) , which considers disentangling the nuisance variable from the latent representation. • Adversarial Training: The concept of adversarial networks was introduced with Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) , and has been adopted into myriad applications. The simultaneously discovered Adversarially Learned Inference (ALI) (Dumoulin et al., 2016) and Bidirectional GAN (BiGAN) (Donahue et al., 2016) propose an adversarial approach toward training an autoencoder. Adversarial training has also been combined with VAE to regularize and/or disentangle the latent representations (Makhzani et al., 2015; Lample et al., 2017; Creswell et al., 2017) . • Bayesian Network Structure Learning: Deep Bayesian network (Nie et al., 2018; Njah et al., 2019; Rohekar et al., 2018) has been studied to learn probabilistic relationships between random variables. Learning model structure of a Bayesian network is a problem that has long been studied, e.g., recovery algorithm (Rebane & Pearl, 2013) , scoring methods (Campos, 2006) , and constraint methods (Scutari, 2014; Pearl et al., 2000) . Scoring methods commonly use the posterior probability of the Bayesian network given training data, such as Bayesian information criterion (BIC). Although the complexity of an exhaustive search is superexponential in the number of variables, recent approaches (Cussens et al., 2017) showed capability to learn structure of Bayesian network with up to 100 variables using integer programming. Constraint-based methods use conditional independence tests between pairs of variables, commonly mutual information test or the Student's t-test for correlation. All constraint-based methods entail three phases: i.e., (i) learning Markov blankets of each variable, (ii) learning neighbors (parents and children) of each variable that identifies which arcs are present in a Bayesian network, and (iii) establishing arc directions. Compared to the existing AutoML literature, our method provides more systematic framework to explore justifiable network architectures from a macro view. Although related Bayesian network was studied to design DNN architecture, our method extends it to realize nuisance robustness by reasonably involving adversarial networks. In addition, ensemble stacking was first introduced in AutoML framework where multiple architectures can be reused to improve the performance over every individual model.

A.2 BAYESIAN GRAPH AND INFERENCE MODELS

Given measurement data, we never know the true joint probability beforehand, and therefore we shall assume one of several possible generative models. AutoBayes aims to explore such potential graph models to match the measurement distributions. As the maximum possible number of graphical models is huge even for a four-node case involving Y , S, Z and X, we restrict our focus to a few meaningful graphs-of-interest shown in Fig. 5 . Each Bayesian graph corresponds to the following assumption of the joint probability factorization (p(x| • • • ) term specifies a generative model of X):  p(y, s, z, x) =                                        p( | & & z 1 , ¡ s, y)p(x|z 2 , z 1 , ¡ s, ¡ y), Model-J p(y)p(s| ¡ y)p(z 1 |s, ¡ y)p(z 2 |z 1 , ¡ s, y)p(x|z 2 , z 1 , ¡ s, ¡ y), Model-K (9) where we explicitly indicate independence by slash-cancelled factors from the full-chain case in equation 1. Blue-colored terms correspond to the blue arrows in Figs. 5 for generative graph of decoder networks. Depending on the assumed Bayesian graph, the relevant inference strategy will vary as some variables may be conditionally independent, which enables pruning links in the inference factor graphs. As shown in Fig. 6 , the reasonable inference graph model can be automatically generated by the Bayes-Ball algorithm (Shachter, 2013) on each Bayesian graph hypothesis inherent in datasets. Specifically, the conditional probability p(y, s, z|x) can be obtained for each model as below. Bayesian Graph Model A (Direct Markov): The simplest model between X and Y would be single Markov chain without any dependency of S and Z, shown in Bayesian graph of Fig. 5(a) . This model puts an assumption that the data are nuisance-invariant. For this case, there is no reason to employ complicated inference models such as A-CVAE since most factors will be independent as p(y, s, z|x ) = p(z| ¡ x)p(s| ¡ z, ¡ x)p(y| ¡ s, ¡ z, x). We hence should use a standard classification method, as in Fig. 1  = p(z|x)p(s| ¡ z, ¡ x)p(y| ¡ s, z, ¡ x) . Note that this model assumes independence between Z and S, and thus adversarial censoring (Makhzani et al., 2015; Creswell et al., 2017; Lample et al., 2017) can make it more robust against nuisance. This model is hence based on A-VAE.

Bayesian Graph Model C (Subject-Dependent):

We may model the case when the data X directly depends on subject S and task Y , shown in Fig. 5(c ). For this case, we may consider the corresponding inference models due to the Bayes-Ball: p(y, s, z|x) = p(s|x)p(z| ¡ s, ¡ x)p(y|s, ¡ z, x), Model-Cs p(y|x)p(s|y, x)p(z| ¡ s, ¡ y, ¡ x). Model-Cy (10) Note that this model does not depend on Z, and thus Z-first inference strategy reduces to S-first model. As a reference, we here consider additional Y -first inference strategy to evaluate the difference. Bayesian Graph Model D (Latent Summary): Another graphical model is shown in Fig. 5(d) , where a latent space bridges all other random variables. Bayes-Ball yields the following models: Bayesian Graph Model J (Disentangled Latent): We can also consider multiple latent vectors to generalize the Bayesian graph with more vertices. We here focus on two such examples of graph models with two-latent spaces as shown in Figs. 5(j) and (k). Those models are identical class of the model D, except that a single latent Z is disentangled into two parts Z 1 and Z 2 , respectively associated with S and Y . Given the Bayesian graph of Fig. 5 (j), the Bayes-Ball yields some inference strategies including the following two models: p(y, s, z 1 , z 2 |x) = p(z 1 , z 2 |x)p(s|z 1 , & & z 2 , ¡ x)p(y| ¡ s, & & z 1 , z 2 , ¡ x), Model-Jz p(s|x)p(z 1 |s, x)p(z 2 | ¡ s, z 1 , x)p(y| ¡ s, & & z 1 , z 2 , ¡ x), Model-Js which are shown in Figs. 6(i) and (j). Note that Z 2 is marginally independent of the nuisance variable S, which encourages the use of adversarial training to be robust against subject/session variations. Bayesian Graph Model K (Conditionally Disentangled Latent): Another modified model in Fig. 5 (k) linking Z 1 and Z 2 yields the following inference models: p(y, s, z 1 , z 2 |x) = p(z 1 , z 2 |x)p(s|z 1 , & & z 2 , ¡ x)p(y| ¡ s, z 1 , z 2 , ¡ x), Model-Kz p(s|x)p(z 1 |s, x)p(z 2 | ¡ s, z 1 , x)p(y| ¡ s, z 1 , z 2 , ¡ x), Model-Ks as shown in Figs. 6(k) and (l). The major difference from the model J lies in the fact that the inference graph should use Z 1 along with Z 2 to infer Y .

A.3 BACKGROUND ON VARIATIONAL BAYESIAN INFERENCE

Variational AE AutoBayes may automatically construct autoencoder architecture when latent variables are involved, e.g., for the model E in Fig. 5 (e). For this case, Z represents a stochastic node to marginalize out for X reconstruction and Y inference, and hence VAE will be required. In contrast to vanilla autoencoders, VAE uses variational inference by assuming a marginal distribution for latent p(z). In variational approach, we reparameterize Z from a prior distribution such as the normal distirbution to marginalize. Depending on the Bayesian graph models, we can also consider reparametering semi-supervision on S (i.e., incorporating a reconstruction loss for S) as a conditioning variable. Conditioning on Y and/or S should depend on consistency with the graphical model assumptions. Since VAE is a special case of CVAE, we will go into further detail about the more general CVAE below. Conditional VAE When X is directly dependent on S or Y along with Z in the Bayesian graph, the AutoBayes gives rise the CVAE architecture, e.g., for the models E/F/G/H/I in Fig. 5 . For those generative models, the decoder DNN needs to feed S or Y as a conditioning parameter. Even for other Bayesian graphs, the S-first inference strategy will require conditional encoder in CVAE, e.g., the models Ds/Es/Fs/Gs/Js/Ks in Fig. 6 , where latent Z depends on S. Consider the case when S plays as the conditioning variable in a data model with the factorization: p(s, x, z) = p(s)p(z)p(x|s, z), where we directly parameterize p(x|s, z), set p(z) to something simple (e.g., isotropic Gaussian), and leave p(s) arbitrary (since it will not be directly used). The CVAE is trained according to maximizing the likelihood of data tuples (s, x) with respect to p(x|s), which is given by p(x|s) = p(x|s, z)p(z) dz, which is intractable to compute exactly given the potential complexity of the parameterization of p(x|s, z). While it could be possible to approximate the integration with sampling of Z, the crux of the VAE approach is to utilize a variational lower-bound of the likelihood that involves a variational approximation of the posterior p(z|s, x) implied by the generative model. With q(z|s, x) representing the variational approximation of the posterior, the Evidence Lower-Bound (ELBO) is given by log p(x|s) ≥ E z∼q(z|s,x) [log p(x|s, z)] -KL q(z|s, x) p(z) . The parameterization of the variational posterior q(z|s, x) may also be decomposed into parameterized components, e.g., q(z|s, x) = q(s|x)q(z|s, x) such as in the S-first models shown in Fig. 6 . Such decomposition also enables the possibility of semi-supervised training, which can be convenient when some of the variables, such as the nuisances variations, are not always labeled. For data tuples that include s, the likelihood q(s|x) can also be directly optimized, and the given value for s is used an input to the computation of q(z|s, x). However, for tuples where s is missing, the component q(s|x) can be used to generate an estimate of s to be input to q(z|s, x). We further discuss semi-supervised learning and the sampling methods for categorical nuisance variables in Appendix A.4 below. A.4 SEMI-SUPERVISED LEARNING: CATEGORICAL SAMPLING Graphical Models for Semi-Supervised Learning Nuisance values S such as subject ID or session ID may not be always available for typical physiological datasets, in particular for the testing phase of an HMI system deployment with new users, requiring semi-supervised methods. We note that some graphical models are well-suited for such semi-supervised training. For example, among the Bayesian graph models in Fig. 5 , the models C/E/G/I require the nuisance S to reproduce X. If no ground-truth labels of S are available, we need to marginalize S across all possible categories for the decoder DNN D. Even for other Bayesian graphs, the corresponding inference factor graphs in Fig. 6 may not be convenient for the semi-supervised settings. Specifically, for models Ez/Fz/Jz/Kz have an inference of S at the end node, whereas the other inference models use inferred S for subsequent inference of other parameters. If S is missing or unknown as a semi-supervised setting, those inference graphs having S in a middle node are inconvenient as we need sampling over all possible nuisance categories. For instance, the model Kz shown in Fig. 7 does not need S marginalization, and thus readily applicable to semi-supervised datasets. Variational Categorical Reparameterization In order to deal with the issue of categorical sampling, we can use the Gumbel-Softmax reparameterization trick (Jang et al., 2016) , which enables differentiable approximation of one-hot encoding. Let [π 1 , π 2 , . . . , π |S| ] denote a target probability mass function for the categorical variable S. Let g 1 , g 2 , . . . , g |S| be independent and identically distributed samples drawn from the Gumbel distribution Gumbel(0, 1).foot_1 Then, generate an |S|dimensional vector ŝ = [ŝ 1 , ŝ2 , . . . , ŝ|S| ] according to ŝk = exp((log(π k ) + g k )/τ ) |S| i=1 exp((log(π i ) + g i )/τ ) , where τ > 0 is a softmax temperature. As the softmax temperature τ approaches 0, samples from the Gumbel-Softmax distribution become one-hot and the distribution becomes identical to the target categorical distribution. Ensemble generalization works by stacking the predictions of the base learners in a higher level learning space, where meta learner, denoted as Mk , corrects the predictions of base learners (Wolpert, 1992) . Subsequent to training base learners, we assemble the posterior probability vectors of all base learners together: P y (x n ) = {P ky (x n )} and P s (x n ) = {P ks (x n )}, where k = 1 : 37. Mk is trained using the predictions from all base learners as input attributes: Din train = {(P y (x n ), P s (x n ))} and correct labels as output: Dout train = {(y n , s n )}, where n = 1 : N train . Hold-out D test is used to measure the classification performance of both base and meta learners. To make best use of the base learners, we compare the predictive performance of a LR model and a shallow MLP as a meta learner in Table 2 .

A.6 DATASETS DESCRIPTION

We used publicly available physiological datasets as well as a benchmark MNIST as follows. The parameters of datasets are also summarized in Table 1 . • QMNIST: A hand-written digit image MNIST with extended label information including a writer ID number (Yadav & Bottou, 2019) . 3 There are |S| = 539 writers for classifying |Y | = 10 digits from grayscale 28 × 28 pixel images over 60,000 training samples. Additional 297 writers provide 10,000 test samples. • Stress: A physiological dataset considering neurological stress level (Birjandtalab et al., 2016) . 4 It consists of multi-modal biosignals for |Y | = 4 discrete stress states from |S| = 20 healthy subjects, including physical/cognitive/emotional stresses as well as relaxation. The data were collected by C = 7 sensors, i.e., electrodermal activity, temperature, threedimensional acceleration, heart rate, and arterial oxygen level. For each stress status, a corresponding task of 5 minutes long (i.e., T = 300 time samples with 1 Hz down-sampling) was assigned to subjects for a total of 4 trials. • RSVP: An EEG-based typing interface using rapid serial visual presentation (RSVP) paradigm (Orhan et al., 2012) .foot_4 |S| = 10 healthy subjects participated in the experiments at three sessions performed on different days. The dataset consists of 41,400 epochs of C = 16 channel EEG data for T = 128 samples, which were collected by g.USBamp biosignal amplifier with active electrodes during RSVP keyboard operations. |Y | = 4 labels for emotion elicitation, resting-state, or motor imagery/execution task. • MI: The PhysioNet EEG Motor Imagery (MI) dataset (Goldberger et al., 2000) . 6 Excluding irregular timestamp, the dataset consists of |S| = 106 subjects' EEG data. During the experiments, subjects were instructed to perform cue-based motor execution/imagery tasks while C = 64 channels were recorded at a sampling rate of 160 Hz. Focusing on motor imagery tasks, we use the EEG data for three seconds of post-cue interval data (i.e., T = 480 time samples). The subject performed |Y | = 4-class tasks; either right hand motor imagery, left hand motor imagery, both hands motor imagery, or both feet motor imagery. This resulted in a total of 90 trials per subject. • ErrP: An error-related potential (ErrP) of front-central EEG dataset (Margaux et al., 2012) . 7The dataset consists of EEG data recorded from |S| = 16 healthy subjects participating in an offline P300 spelling task, where visual feedback of the inferred letter is provided to the user at the end of each trial for 1.3 seconds to monitor evoked brain responses for erroneous decisions made by the system. EEG data were recorded from C = 56 channels for epoched 1.25 seconds at a sampling rate of 200 Hz (i.e., T = 250). Across five recording sessions, each subject performed a total of 340 trials. Since it was an offline copy spell task, binary |Y | = 2 labels were provided as erroneous or correct feedback. • Faces Basic: An implanted electrocorticography (ECoG) array dataset for visual stimulus experiments (Miller et al., 2015; 2016) .foot_7 ECoG arrays were implanted on the subtemporal cortical surface of |S| = 14 epilepsy patients. |Y | = 2 classes of grayscale images, either faces or houses, were displayed rapidly in random sequence for 400 ms each with blackscreen intervals of 400 ms. The ECoG potentials were measured with respect to a scalp reference and ground, at a sampling rate of 1000 Hz. Subjects performed a basic face and house discrimination task. There were 3 sessions for each patient, with 50 house pictures and 50 face pictures in each run, in total 4,100 samples. We use the first C = 31 channels to analyze for T = 400. Reusing the public dataset requires the ethics statement information.foot_8  • Faces Noisy: The implanted ECoG arrays dataset for visual stimulus experiments (Miller et al., 2015; 2017) . The experiment is similar to Faces Basic dataset, while pictures of faces and houses are randomly scrambled. There are |S| = 7 subjects with C = 39 channels. Refer ethics statement to reuse the dataset.foot_9  • ASL: An EMG dataset for finger gesture identification for American Sign Language posture). Dynamic letters 'J' and 'Z' were omitted, along with the number '0', which is visually the same as the letter 'O'. The participants were given 2 seconds to form the posture, 6 seconds to maintain it, and 2 seconds to rest between trials. The signal is decimated to be T = 100.

A.7 DNN MODEL PARAMETERS

For 2D datasets, we use deep CNN for the encoder E and decoder D blocks. For the classifier C, nuisance estimator N , and adversary A, we use a multi-layer perceptron (MLP) having three layers, whose hidden nodes are doubled from the input dimension. We also use batch normalization (BN) and ReLU activation as listed in Table 3 . Note that for a tabular data such as Stress datasets, CNN was replaced with 3-layer MLP having ReLU activation and dropout with a ratio of 20%. Also the MLP classifier was replaced with CNN for 2D input dimension cases such as in the model A. The number of latent dimensions was chosen |Z| = 64. When we need to feed S along with 2D data of X into the CNN encoder such as in the model Ds, dimension mismatch poses a problem. We address this issue by using one linear layer to project S into the temporal dimensional space of X and another linear layer to project it into the spatial dimensional space of X. The dot product of those two projected vectors is concatenated as additional channel input. We use λ * = 0.01 for the regularization coefficient. We leave hyperparameter exploration to integrate AutoML and AutoBayes as a remaining future work.

A.8 PERFORMANCE RESULTS

The additional results for the all datasets are listed in Table 4 . The results suggest that the best inference strategy highly depends on datasets. Specifically, the best model at one dataset does not perform best for different datasets; e.g., the model non-variational Is was best for ASL dataset, while the model variational Ds was best for RSVP dataset. It suggests that we shall consider different inference strategies for each target dataset and AutoBayes provides such an adaptive framework. Also note that reconstruction loss may not be a good indicator to select the graph model. In addition, a huge performance gap between the best and worst models was observed for some datasets. For example, the task accuracy of 76.4% was achieved with model non-variational Dz for Faces (Noisy) dataset, whereas the model variational B offers 51.4%. This implies that we may have a potential risk that one particular model cannot achieve good performance if we do not explore different models. A.9 SUBJECT VARIATION PERFORMANCE For Stress dataset, there are |S| = 20 subjects. As we have shown in Fig. 9 (a), we demonstrated that AutoBayes can improve robustness against the nuisance variation, i.e., subject ID S. In Fig. 10 , we show that the task classification accuracy highly depends on the subject ID S. Here, the box-whisker plots shows the accuracy distribution over different models from A to Kz. The outliers are identified by a whisker factor of 2.4 with respective to an inter-quartile range. It is seen that some users (e.g., S = 8) have superior performance whereas classification task is harder for some other users (e.g., S = 6). Our AutoBayes can well resolve the issues of such a nuisance variation by linking the adversarial block for S-independent latent variables Z to generate subject-invariant feature. 



For example of speech recognition, nuisance factors such as speaker's attributes and recording environment may change the task accuracy. For image recognition, ambient light conditions and image sensor conditions may become inherent nuisance factors. In the context of this paper, nuisance variations mainly refer to subject identities and biological states during recording sessions for physiological data learning. The Gumbel(0, 1) distribution can be sampled by drawing e ∼ Exp(1) and computing g = -log(e). QMNIST dataset: https://github.com/facebookresearch/qmnist Stress dataset: https://physionet.org/content/noneeg/1.0.0/ RSVP dataset: http://hdl.handle.net/2047/D20294523 MI dataset: https://physionet.org/physiobank/database/eegmmidb/ ErrP dataset: https://www.kaggle.com/c/inria-bci-challenge/ Faces dataset: https://exhibits.stanford.edu/data/catalog/zk881ps0522 Ethics statement: All patients participated in a purely voluntary manner, after providing informed written consent, under experimental protocols approved by the Institutional Review Board of the University of Washington (#12193). All patient data was anonymized according to IRB protocol, in accordance with HIPAA mandate. These data originally appeared in the manuscript "Spontaneous Decoding of the Timing and Content of Human Object Perception from Cortical Surface Recordings Reveals Complementary Information in the Event-Related Potential and Broadband Spectral Change" published in PLoS Computational Biology in 2016(Miller et al., 2016). All patients participated in a purely voluntary manner, after providing informed written consent, under experimental protocols approved by the Institutional Review Board of the University of Washington (#12193). All patient data was anonymized according to IRB protocol, in accordance with HIPAA mandate. These data originally appeared in the manuscript "Face percept formation in human ventral temporal cortex" published in Journal of Neurophysiology in 2017(Miller et al., 2017). ASL Dataset: http://hdl.handle.net/2047/D20294523



Figure 1: Inference methods to classify Y given data X under latent Z and semi-labeled nuisance S.

Figure 4: Full-chain Bayesian graph and inference models for Z-first or S-first factorizations.

Figure 5: Example Bayesian graphs for data generative models under automatic exploration. Blue arrows indicate generative graph for decoder networks. Thick circled S specifies the requirement of S-conditional decoder, which is less-convenient when learning unlabeled nuisance datasets.

Figure 6: Z-first and S-first inference graph models relevant for generative models D-G, J, and K. Green arrows indicate feature extraction graph for encoder networks. Thick circled S specifies the end node of inference, which is convenient when learning unlabeled nuisance datasets.

Figure 7: Overall network structure for pairing generative model K and inference model Kz.

Figure 8: Task classification accuracy across different graphical models (with standard deviation).

Figure 9: Task classification accuracy for Stress dataset.

(a), to infer Y given X, based on the inference model p(y|x) without involving S and Z. Bayesian Graph Model B (Markov Latent): Assuming a latent Z can work in a Markov chain of Y -Z -X shown in Fig. 5(b), we obtain a simple inference model: p(y, s, z|x)

p(y, s, z|x) = p(z|x)p(s|z, ¡ x)p(y|s, z, ¡ x), Model-Dz p(s|x)p(z|s, x)p(y|z, s, ¡ x), Model-Ds (11) whose graphical models are depicted in Figs. 6(a) and (b), respectively. Bayesian Graph Model E (Task-Summary Latent): Another graphical model involving latent variables is shown in Fig. 5(e), where a latent space only summarizes Y . Bayes-Ball yields the following inference models: p(y, s, z|x) = p(z|x)p(s|z, x)p(y|z, ¡ s, ¡ x), Model-Ez p(s|x)p(z|s, x)p(y| ¡ s, z, ¡ x), Model-Es (12) which are illustrated in Figs. 6(c) and (d). Note that the generative model E has no marginal dependency between Z and S, which provides the reason to use adversarial censoring to suppress nuisance information S in the latent space Z. In addition, because the generative model of X is dependent on both Z and S, it is justified to employ the A-CVAE classifier shown in Fig. 1(b). Bayesian Graph Model F (Subject-Summary Latent): Consider Fig. 5(f), where a latent variable summarizes subject information S. The Bayes-Ball provides the inference graphs shown in Figs. 6(e) and (f), which respectively correspond to: p(y, s, z|x) = p(z|x)p(s|z, ¡ x)p(y| ¡ s, x, z), Model-Fz p(s|x)p(z|s, x)p(y|x, ¡ s, z). Model-Fs (13) Bayesian Graph Model G: Letting the joint distribution follow the model G in Fig. 5(g), we obtain the following inference models via the Bayes-Ball: p(y, s, z|x) = p(z|x)p(s|z, x)p(y|s, z, ¡ x), Model-Gz p(s|x)p(z|s, x)p(y|z, s, ¡ x), Model-Gs (14) whose graphical models are described in Figs. 6(g) and (h). Note that the inference model Gs in Fig. 6(h) is identical to the inference model Ds in Fig. 6(b). Although the inference graphs Gs and Ds are identical, the generative model of X is different as shown in Figs. 5(g) and (d). Specifically, VAE decoder for the model G should feed S along with variational latent space Z, and thus using CVAE is justified for the model G but D. This difference of the generative models can potentially make a different impact on the performance of inference despite the inference graph alone is identical. Bayesian Graph Models H and I: Both the generative models H and I shown in Figs. 5(h) and (i) have the fully-connected inference strategies as given in (2), whose graphs are shown in Figs. 4(b) and (c), respectively, since no useful conditional independency can be found with the Bayes-Ball. Analogous to the relation of models Ds and Gs, the inference graph can be identical for Bayesian graphs H and I, whereas the generative model of X is different as shown in Figs. 5(h) and (i).

(ASL) (Günay et al., 2019). 11 |S| = 5 healthy, right-handed, subjects participated in experiments with surface EMG (Delsys Inc. Trigno) recorded at 2 kHz from |C| = 16 lower-arm muscles. Subjects shaped their right hand into letters and numbers of the ASL posture set presented as pictures on a computer screen (|Y | = 33 postures, 3 trials per

Figure 10: Task classification accuracy across subject ID for Stress dataset.

Figure 11: Task classification accuracy as a function of time complexity for Stress dataset.

Parameters of Public Dataset Under Investigation

Task classification performance of AutoBayes compared to state-of-the-art.

Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016. Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697-8710, 2018.

The temperature τ is usually decreased across training epochs as an annealing technique, e.g., with exponential decaying.A.5 ENSEMBLE LEARNING: STACKED GENERALIZATIONTo achieve higher predictive performance, we construct ensembles from the output posterior class probabilities of all graphical models. Let D 0 = {(x n , y n , s n )|n = 1 : N } denote a data set, where x n is a data instance, y n is the task label, s n is the nuisance (subject) label and N is the number of samples in the dataset. We randomly split the data into training set D train and validation set D test . Given 37 graphical models, which we call base learners, we induce a decision algorithm M k , for k = 1, . . . , 37 by invoking the kth graphical model on the data in D train . For each x n in D train , graphical model M k generates a class probability vector for task and nuisance label prediction. Let P ky (x n ) = {P (y 1 |x n ), . . . , P (y i |x n ), . . . , P (y Ny |x n )} denote the posterior probability distribution over N y task labels and P ks (x n ) = {P (s 1 |x n ), . . . , P (s i |x n ), . . . , P (s Ns |x n )} denote the posterior probability distribution over N s nuisance labels produced by model M k given data instance x n .

DNN model parameters in Fig.7; Conv(h, w) c g denotes 2D convolution layer with kernel size of (h, w) for output channel of c over group g. FC(h) denotes fully-connected layer with h output nodes. BN denotes batch normalization.

annex

Under review as a conference paper at ICLR 2021 

