LEARNING DOMAIN-AGNOSTIC REPRESENTATION FOR DISEASE DIAGNOSIS

Abstract

In clinical environments, image-based diagnosis is desired to achieve robustness on multi-center samples. Toward this goal, a natural way is to capture only clinically disease-related features. However, such disease-related features are often entangled with center-effect, disabling robust transferring to unseen centers/domains. To disentangle disease-related features, we first leverage structural causal modeling to explicitly model disease-related and center-effects that are provable to be disentangled from each other. Guided by this, we propose a novel Domain Agnostic Representation Model (DarMo) based on variational Auto-Encoder. To facilitate disentanglement, we design domain-agnostic and domain-aware encoders to respectively capture disease-related features and varied center effects by incorporating a domain-aware batch normalization layer. Besides, we constrain the disease-related features to well predict the disease label as well as clinical attributes, by leveraging Graph Convolutional Network (GCN) into our decoder. The effectiveness and utility of our method are demonstrated by the superior performance over others on both public datasets and in-house datasets.

1. INTRODUCTION

A major barrier to the deployment of current deep learning systems to medical imaging diagnosis lies in their non-robustness to distributional shift between internal and external cohorts (Castro et al., 2020; Ma et al., 2022; Lu et al., 2022) , which commonly exists among multiple healthcare centers (e.g., hospitals) due to differences in image acquisition protocols. For example, the image appearance can vary a lot among scanner models, parameters setting, and data preprocessing, as shown in Fig. 1  (a, b, c ). Such a shift can deteriorate the performance of trained models, as manifested by a nearly 6.7% AUC drop of empirical risk minimization (ERM) method from internal cohorts (source domain, in distribution) to external cohorts (unseen domain, out of distribution), as shown in Fig. 1 (bar graph). To resolve this problem, existing studies have been proposed to learn task-related features (Castro et al., 2020; Kather et al., 2022; Wang et al., 2021b) from multiple environments of data. Although the learned representation can capture lesion-related information, it is not guaranteed that such features can be disentangled from the center effect, i.e., to variations in image distributions due to domain differences in acquisition protocols (Fang et al., 2020; Du et al., 2019; Garg et al., 2021) . The mixtures of such variations lead to biases in learned features and final predictions. Therefore, a key question in robustness is: in which way can the disease-related features be disentangled from center-effect? Recently, (Sun et al., 2021) showed that the task-related features can be disentangled from others, but requires that the input X and the output Y are generated simultaneously. However, this requirement often does not satisfy disease prediction scenarios, e.g., Y can refer to ground-truth disease labels acquired from pathological examination, which can affect lesion patterns in image X. To achieve this disentanglement, we build our model in Fig. 2 (b), via structural causal modeling (SCM) that can effectively encode prior knowledge beyond data with hidden variables and causal relations. As shown, we introduce v ma and v mi to respectively denote macroscopic and microscopic parts of disease-related features that often employed in clinical diagnosis. Specifically, the macroscopic features encode morphology-related attributes (Surendiran & Vadivel, 2012) of lesion areas, as summarized in American College of Radiology (ACR) (Sickles et al., 2013) ; while the microscopic features are hard to observe but reflect subtle patterns of lesions. Taking the mammogram in Fig. 2 (a) as an illustration, the macroscopic features refer to the margins, shapes, and speculations of the masses; while the microscopic features refer to the textures, and the curvatures of contours (Ding et al., 2020a) . As these disease-related patterns vary between malignancy and benign, they are determined by the disease status Y and we have y → (v ma , v mi ) in Fig. 2 (b) correspondingly. Besides, the v ma differs from v mi , as it is related to clinical attributes A that are easy to observe from the image. In addition to disease-related features, we also introduce v d to account for domain gaps from the center effect in the image. Note that given the image X (i.e., condition on X), the v d is correlated to (v ma , v mi ), making them entangled with each other. This entanglement can cause bias and thus unstable prediction behaviors when transferred to unseen centers/domains. Equipped with this causal modeling, we can observe that the distributional shift of data is mainly accounted for by the variation of v d across domains. Moreover, we can theoretically prove that when this variation is diverse enough, the disease-related features can be disentangled from the center effect. To the best of our knowledge, we are the first to prove that this disentanglement is possible, in the literature on imaging diagnosis. Inspired by this result, we propose a disentangling learning framework, dubbed as Domain Agnostic Representation Model (DarMo), to disentangle diseaserelated features for prediction. Specifically, we adopt a variational auto-encoder framework and decompose the encoder into domain-agnostic and domain-aware branches, which respectively encode disease-related information (v ma , v mi ) and domain effect v d . To account for the variation of v d across domains, we propose to incorporate a domain-aware batch normalization (BN) layer into the domainaware encoder, to well capture the effect in each domain. To capture disease-related information, we use disease labels to supervise (v ma , v mi ) and additionally constrain v ma to reconstruct clinical attributes with Graph Convolutional Network (GCN) to model relations among attributes. To verify the utility and effectiveness of our method, we perform our method on mammogram benign/malignant classification. Here the clinical attributes are those related to the masses, which are summarized in ACR (Sickles et al., 2013) and are easy to obtain. We consider four datasets (one public and three in-house) that are collected from different sources. The results on unseen domains show that our method can outperform others by 6.2%. Besides, our learned disease-related features can successfully encode the information on the lesion areas. In summary, our contributions are mainly three-fold: a) We leverage SCM to encode medical prior knowledge, equipped with which we theoretically show that the disease-related features can be disentangled from the domain effect; b) We propose a novel DarMO framework with domain-agnostic and domain-aware encoders, which facilitates the disentanglement of disease-related features from center effect to achieve robust prediction; c) Our model can achieve state-of-the-art performance in terms of robustness to distributional shifts across domains in breast cancer diagnosis. 2018). It considers multiple domains (centers) and aims to improve the diagnosis performance in unseen domains. However, under unseen domains, previous methods will lead to a dramatic performance decrease when testing on data from a different domain with a different bias (Ilse et al., 2020; Sathitratanacheewin et al., 2020; Zhang et al., 2022a) . Thus such previous models are not robust enough to the actual task (Azulay & Weiss, 2020; Cheng et al., 2022) . An intuitive idea to solve domain gaps among multi-centers is learning domain-agnostic representation. Progress has been made can be roughly divided into three classes: (i) Learning the domain-specific constraints, e.g., (Chattopadhyay et al., 2020) aim to learn domain-specific masks but fails in medical images for not suitable to distinguish different domains based on masks. (ii) Disentangle-based, e.g., (Ilse et al., 2020) model three independent latent subspaces for the domain, the class, and the residual variations respectively. They do not make use of the medical attribute knowledge which is important in our mammogram classification. (iii) Design invariant constraints, e.g., (Arjovsky et al., 2019; Zhang et al., 2022b) aim to learn invariant representation across environments by minimizing the Invariant Risk Minimization term. (Ganin et al., 2016) and (Li et al., 2018) use an adversarial way with the former performing domain-adversarial training to ensure a closer match between the source and the target distributions. Lack of disentanglement and the guidance of medical prior knowledge limits their performance on unseen domains. In the following, we first introduce our causal model that incorporates medical priors regarding heterogeneous data from multiple domains in Sec. 3.1. With this modeling, we show that the domain-agnostic causal features can be disentangled from domain-aware features if we can fit distributions of each domain well. Guided by this result, we in Sec. 3.2 propose a variational auto-encoder (VAE)-based method as a generative model to fit these distributions, so as to learn causal features for disease prediction. decompose latent factors v of the input image x into domain-agnostic causal features (v ma , v mi ) that are determined by the disease status y, and other domain-aware features v d affected by the domain variable d. For domain-agnostic casual features, we further denote v ma as macroscopic features that generate clinical attributes v ma (such as shapes, margins (Sickles et al., 2013; Wang et al., 2021b; Zhao et al., 2022) ) that are normally utilized by clinicians for disease prediction, and v mi as microscopic features (such as textures, curvatures of contours (Ding et al., 2020a) ) that may be difficult to observe but can encode the high-frequency patterns of lesions. For v d , it can encode biases introduced during the imaging acquisition process from different centers/medical devices.

A Malignant Mass

If we directly train a predictor p(y|x) using a neural network, the extracted representation from x can entangle the causal features (v ma , v mi ) and center effects v d because conditioning on x can induce the spurious path from v d to (v ma , v mi ), making v d and (v ma , v mi ) correlated with each other. Such an entanglement makes it hard to generalize well on new centers' data. Specifically, if we denote S as the learned representation from training domains' data, then S's distribution of the diseased group can be affected by v d , which is domain-aware. Therefore, this distribution can change a lot on another domain's data, which may cause difficulty in discriminating the diseased group from the normal one in terms of S's distribution. To remove this domain dependency, it is desired to disentangle causal features from domain-aware features. Indeed, this disentanglement can be achieved via acquisition from multiple domains with diverse distributions. Specifically, the difference between (v ma , v mi ) and v d mainly lies in whether this feature is domain-invariant. The diversity among domains can thus provide a clue to identify invariant information, i.e., (v ma , v mi ) in our scenario, as shown in the following theorem: Theorem 3.1 (Informal). Suppose that multiple domains are diverse enough. Then as long as we can fit each domain's distribution well, then for each image x ← f x (v ⋆ mi , v ⋆ ma , v ⋆ c ), the learned factors (ṽ mi , ṽma , ṽd ) has ṽmi = h mi (v ⋆ mi ), ṽd = h d (v ⋆ d ), ṽma = h ma (v ⋆ ma ) for some h mi , h ma , h d . Remark 3.1. The diversity condition means the extent of dependency v d on d and (v ma , v mi ) on y are large enough, which can be shown to hold generically in the appendix. This theorem informs that as long as we can fit data well, we can identify each factor, particularly domain-agnostic causal features (v ma , v mi ) up to the transformation that does not depend on v d . In this regard, the learned domain-agnostic causal features are disentangled from domain effects. Guided by this analysis, we propose a variational auto-encoder (VAE)-based method, as a generative model to fit data from each center. log p d θ (y i , A i |x i ) + E q ψ d (v|xi) log p d θ (x i , v) q ψ d (v|x i ) , = - 1 n d n d i=1 log E q ψ d (v|xi) (p θ (A i |v ma )p θ (y i |v ma , v mi )) + E q ψ d (v|xi) (log p θ (x i |v)) -KL(q ψ d (v|x i ), p d θ (v)) . with q ψ d (v|x) learned to approximate p d θ (v|x). To optimize the loss, we need to respectively parameterize the prior models p d θ (v ma , v mi , v d ) := p θ (v d |d)p(v ma , v mi ) , inference models q ψ d (v|x) (i.e., encoder) and generative models p θ (x|v ma , v mi , v d ), p θ (A|v ma ), p θ (y|v ma , v mi ) (i.e., decoder). In the following, we will introduce our implementation for these models. As illustrated in Fig. 3 , we propose a two-branch encoder: Domain-Agnostic Encoder to extract (v ma , v mi ) and Domain-Aware Encoder to extract v d . For the latter, we incorporate a domain-aware BN to capture the variation of multiple domains. With learned causal features, we implement graph convolution network to capture relations among clinical attributes. Domain-Aware Prior Models. Following the causal graph in Fig. 2 , we factorize p d θ (v ma , v mi , v d ) into p d θ (v ma , v mi , v d ) = p(v ma , v mi )p θpri (v d |d) , where the p(v ma , v mi ) can be modeled as isotropic Gaussian while p θ d (v d |d) is domain-aware, and is parameterized as a Multilayer Perceptron (MLP) with one-hot encoded vector d ∈ R m as input. Domain-Aware/Agnostic Inference Models. To disentangle causal features (v ma , v mi ) from domain effects, we adopt a mean-field approximation to factorize q ψ d (v d , v ma , v mi |x) as q ψ1 (v ma , v mi |x) * q ψ d 2 (v d |x) , with q ψ (v ma , v mi |x) and q(v d |x, d) respectively implemented via a domain-agnostic disease-relevant encoder (DADR) and a domain-aware disease-irrelevant encoder (DADI). This parameterization is inspired by the domain-invariant/-variant properties of (v ma , v mi ) and v d . By attributing the domain-aware effects to feature v d while sharing parameters of the domain-agnostic encoder ψ 1 for all centers, the domain-aware effects can be removed in learned macroscopic and microscopic information, leading to robust generalization ability across domains. With shared parameters of the domain-agnostic encoder, we have p d θ (y|A, x) = p θ (A|v ma )p θ (y|v ma , v mi )q ψ (v ma , v mi |x)dv mi dv ma . which hence does not depend on the domain index d. To reflect the variety of different domain effects, the domain-aware encoder contains a Domain-Aware Layer (DAL), which is composed of m batch-normalization (BN) layers with (γ d , β d ) for each center: f d = BN γi,βi ( f ) = γ d f + β d , with f = f -µ B √ δ 2 B +ϵ denoting the normalized features by the mini-batch mean µ B and variance δ B . Disease-Attribute Generative Models. To learn v ma , v mi , v d , we constrain them to well recover x and predict A, y, respectively via p θx (x|v), p θy (y|v ma , v mi ) and p θ A (A|v ma ). Specifically, to capture macroscopic patterns in v ma , we constrain it to estimate the clinical attributes A that include macroscopic information such as shape, margins, lobulation, etc. As correlations among clinical attributes can be helpful for disease diagnosis, we propose to reconstruct A via Graph Convolutional (He et al., 2016) 0.822 0.758 0.735 0.779 (Chen et al., 2019) 0.877 0.827 0.804 0.830 Guided-VAE (Ding et al., 2020b) 0.872 0.811 0.779 0.811 IAIA-BL (Barnett et al., 2021) 0.861 0.803 0.767 0.782 ICADx (Kim et al., 2018) 0.882 0.802 0.777 0.826 (Li et al., 2019) 0.848 0.794 0.769 0.815 DANN (Ganin et al., 2016) 0.857 0.811 0.781 0.813 MMD-AAE (Li et al., 2018) 0.860 0.783 0.770 0.786 DIVA (Ilse et al., 2020) 0.865 0.809 0.784 0.813 IRM (Arjovsky et al., 2019) 0.889 0.830 0.795 0.829 (Chattopadhyay et al., 2020) 0.851 0.796 0.772 0.797 DDG (Zhang et al., 2022b) 0.867 0.811 0.778 0.802 EFDM (Zhang et al., 2022c) 0  InH1 InH2 InH3 DDSM ⃝ • • • • ⃝ • • • • ⃝ • • • • ⃝ ERM

4. EXPERIMENTS

Datasets and Implementation. To evaluate the effectiveness of our model, we apply our model on mammogram mass benign/malignant classification, which drives increasing attention recently (Wang et al., 2021a; Zhao et al., 2018; Wang et al., 2021b; Lei et al., 2020; Wang et al., 2020; 2022) due to its clinical use. Public dataset (DDSM (Bowyer et al., 1996) ) and three in-house datasets (InH1, InH2, and InH3) what we use are from different centers (center4, 1, 2, 3 respectively). Different medical devices, different regions/countries, and different image formats cause domain gaps. For each dataset, we randomly split it into training, validation, and testing sets with an 8:1:1 patient-wise ratio. The inputs of the network are resized into 224 × 224 with random horizontal flips and fed into networks. To verify the effectiveness of multi-center benign/malignant diagnosis, we show our performances on the external cohort (unseen domains) in Tab. 1 (training data and testing data are from different domains). To remove the randomness, we run for 10 times and report their average values. To further validate our effectiveness, we also give internal cohort (source domain,i.e., the same domain as training domain) results of each dataset which can be seen as the upper bounds of each dataset. For a fair comparison, the number of above-all training sets all keep the same. Area Under the Curve (AUC) is used as the evaluation metric image-wise. More details of datasets and implementation are shown Appendix.

4.1. RESULTS

Compared Baselines. We compare our model with the following methods: a) ERM (He et al., 2016) Results & Analysis on external cohorts (unseen domains). To verify the effectiveness of our method under unseen domains (out-of-distribution), we train our model on the combination of three datasets from three different centers and test on the external cohort (another unseen dataset from other centers). As shown in Tab. 1 (Lines 1-18), our methods can achieve state-of-the-art results under unseen domains in all settings. Specifically, the first six lines are the methods based on different representation learning and we extend them to our domain generalization task. The next seven lines are the methods aiming at domain generalization. (Li et al., 2019) generates more data under the current domain, the larger number of data improves the performance compared with ERM (He et al., 2016) but the augmentation for the current domain greatly limits its ability of domain generalization. (Chattopadhyay et al., 2020) learns domain-specific masks (Clipart, Sketch, Painting), however, the gap that exists in medical images can not balance through mask learning. DANN (Ganin et al., 2016) , DDG (Zhang et al., 2022b) , EFDM (Zhang et al., 2022c) and MMD-AAE (Li et al., 2018) design distance constraints between the source and the target distributions. However, simple distance constraints are not the key to cancer diagnosis and are not robust enough. The advantage of Guided-VAE (Ding et al., 2020b) and DIVA (Ilse et al., 2020 ) over mentioned methods above may be due to the disentanglement learning in the former methods. IRM (Arjovsky et al., 2019) learns invariant representation across environments by Invariant Risk Minimization. However, lacking the guidance of the disentanglement learning limits their performance. Guided-VAE (Ding et al., 2020b) introduces the attribute prediction which improves their performance more than DIVA (Ilse et al., 2020) . The improvements in ICADx (Kim et al., 2018) , Guided-VAE (Ding et al., 2020b) prove the importance of the guidance of attribute learning. Although ICADx (Kim et al., 2018) uses the attributes during learning, it fails to model correlations between attributes and diagnosis, which limits their performance. With further exploration of attributes via GCN, our method can outperform ICADx (Kim et al., 2018) , Guided-VAE (Ding et al., 2020b) . Compared to (Chen et al., 2019) and IAIA-BL (Barnett et al., 2021) that also implement attribute learning, we additionally employ disentanglement learning with variance regularizer which can help to identify invariant disease-related features during prediction. Comparisons on internal cohorts (source domains). We further compute the in-distribution AUC performance of every single dataset under internal cohorts (Tab. 2). Our method shows stable performance while other methods drop a lot under external cohorts compared with Tab. 1. We argue that based on our mechanism for domain generalization, our method can get evenly matched performance compared under external cohorts (out-of-distribution) with internal cohorts (in-distribution). For example, as shown when testing on DDSM (Bowyer et al., 1996) , performances of our model training on InH1+InH2+InH3 and training on DDSM itself are comparable.

4.2. ABLATION STUDY

Ablation study on each component. To verify the effectiveness of each component in our DarMo, we evaluate some variant models on external cohorts as shown in our appendix-Tab.4. To abate the impact of the combination of training domains, we also train our model under different training combinations and show results in Appendix. Results indicate that influences between different domains are not obvious and three domains are sufficient to achieve comparable results. Ablation study on the ratio of using domain-aware layers. To verify the effectiveness of the ratio of using DAL, we replaced the original BN layer with DAL in different ratios. The results are shown in Tab. 1 Line18-21, specifically, 1/3 means only 1/3 BN layers in the network are replaced, others, and so forth. As shown, under the lower ratio, the performance drops due to the poorer domain interpretability. The higher ratio can get better performance. Ablation study on Domain-Aware Mechanism To deeply investigate the proposed domain-aware BN layer, we analyze various implementation forms of multiple domains as follows: a) Multiple Encoders(ME). Since the irrelevant encoder contains the information of domain environments, an intuitive idea is to use multiple irrelevant encoders so as to each domain has one irrelevant encoder directly. b) Grouped Layer(GL). To reduce the parameter quantity of ME, we consider several groups of blocks with each group containing two blocks in the same structures. Each group only responds to one block each time, and different domains are different block combinations. The number of groups is set to n that satisfies 2 n = m (m denotes the number of domains, if m is not the exponential power of 2, find m that is larger than m and is the least number that satisfies 2 n = m). Thus each domain is a permutation combination based on each group choosing one block. c) Domain-Aware BN Layer(DAL). To further reduce the parameter quantity and achieve domain generalization, we propose the domain-aware BN layer for each domain. The scaling and shifting parameters in each layer are learned adaptively. We conduct experiments under the mechanisms above and the results are shown in Tab. 1 Line18-23. Three different kinds of mechanisms have comparable performance. Since BN can usually be used as an effective measure for domain adaptation (Ioffe & Szegedy, 2015) , DAL can be slightly better than the others with lighter computation, especially compared to ME.

4.3. PREDICTION ACCURACY OF ATTRIBUTES

We argue that attributes can be the guidance of benign/malignant classification. In the current domain generalization task, under external cohorts (unseen domain), we also calculate the prediction accuracy of attributes in ours and other attribute-based representative methods in Tab. 3. Our method gets the best prediction accuracy on the attributes over other methods under out-of-distribution.

4.4. VISUALIZATION

We visualize reconstruction results of all latent factors and the predicted attributes of the current image in Fig. 4 to validate that our model can successfully disentangle latent factors v ma , v mi , and v d . Since the DADI Encoder is partially domain-dependent, validating (Left in Fig. 4 ) and training sets are from the same domain, but the testing set (Right in Fig. 4 ) is from a different unseen domain. As we can see, the disease-related features v ma + v mi mainly reflect the disease-related information since they mainly reconstruct the lesion regions without mixing others. The disease-irrelevant v d features mainly learn features such as the contour of the breasts, pectoralis, and other irrelevant glands without lesion information. It is worth noting that the white dots on the image which are caused by machine shooting are learned by v d as visualization. This means that through the ability of domain generalization, our method can disentangle the irrelevant part successfully and prevent it from predicting the disease. Moreover, the macroscopic features v ma capture the macroscopic attributes of the lesions, e.g., shape and density; while the microscopic features v mi learn properties like global context, texture, or other invisible features but related to disease classification. These results further indicate the effectiveness and interpretability of our DarMo.

5. CONCLUSION

We propose a novel Domain Agnostic Representation Model (DarMo) on domain generalization for medical diagnosis, in order to achieve robustness to multi centers. We evaluate our method on both public and in-house datasets. Potential results demonstrate the effectiveness of our DarMo, we will try to generalize this method to other medical imaging problems such as lung cancer, liver cancer, etc.

6. ACKNOWLEDGEMENT

This work was supported by MOST-2022ZD0114900, NSFC-62061136001, Hong Kong Research Grants Council through General Research Fund (Grant 17207722). A FORMAL DESCRIPTION OF THEOREM 3.1 In this section, we present the formal version and the proof of theorem 3.1, which claims the disentanglement between disease-related features and center effects. In the following, we first introduce model assumptions, followed by definition of disentanglement; finally, we present the formal version of theorem 3.1 and its proof. Model Assumptions and Notations. According to the causal graph in Fig. 2 , the joint distribution over (y, v d , v mi , v ma , A, x) given each domain can be factorized as conditional factors Pearl (2009); Schölkopf et al. (2021) : p(y, v d , v mi , v ma , A, x|d) = p(y)p(v d |d)p(v mi |y)p(v ma |y)p(x|v d , v mi , v ma )p(A|v ma ). In the following, we will introduce the assumption of each conditional factor. Specifically, for latent variables v d , v mi , v ma , we assume that v mi |y, v ma |y and v d |d belong to the following exponential families: p(v d |d) := p T v d ,Γ v d d (v d |d), p T v mi ,Γ v mi y (v mi |y), p T vma ,Γ vma y (v ma |y), where p T u ,Γ u o (u|o) = qu i=1 exp ku j=1 T u i,j (u i )Γ u o,i,j + B i (u i ) -C u o,i , for any u ∈ {v mi , v ma } with o = y; and u = v d with o = d. The {T u i,j (u i )}, {Γ u o,i,j } denote the sufficient statistics and natural parameters, {B i } and {C u o,i } denote the base measures and normalizing constants to ensure the integration of distribution equals to 1. Let T u (u) := [T u 1 (u1), ..., T u qu (uq u )] ∈ R ku×qu T u i (ui) := [T u i,1 (ui), ..., T u i,ku (ui)], ∀i ∈ [qu] , Γ u o := Γ u o,1 , ..., Γ u o,qu ∈ R ku×qu Γ u o,i := [Γ u o,i,1 , ..., Γ u o,i,ku ], ∀i ∈ [qu] . This assumption has been widely assumed in the literature of causal representation learning and causal learning Khemakhem et al. (2020) ; Sun et al. (2021) . For x, A, we assume the following additive noise model (ANM): x = f x (v mi , v ma , v d ) + ε x , A = f A (v ma )+ A , where ε x , ε A denote the exogenous variables of X and A, respectively. 

Definition of

T([ f -1 x ] I (x)) = M vmi T([f -1 x ] I (x)) + b vmi , (6) T([ f -1 x ] A (x)) = M vma T([f -1 x ] A (x)) + b vma , (7) T([ f -1 x ] D (x)) = M v d T([f -1 x ] D (x)) + b v d , ( ) where the I, A, D denote the space of the latent variables v mi , v d , v ma . Correspondingly, for f -1 (x) that transforms x into the latent space (I, A, D), [f -1 x ] I (x), [f -1 x ] A (x) and [f -1 x ] D (x) respectively denote the elements of f -1 (x) in the space I, A and D. Remark A.1. This definition is a variation of A-identifiability of Khemakhem et al. (2020) and the identifiability in Sun et al. (2021) , which means the latent variables can be determined up to affine transformation with an invertible transformation matrix. Specifically, for any x ← f x (v * mi , v * ma , v * d ), the [f -1 x ] I (x), [f -1 x ] A (x) and [f -1 x ] D (x) respectively return true latent variables v * mi , v * ma , v * d . Then if the model θ can perfectly fit the joint distribution over each domain, i.e., p θ (x, A|d), [ f -1 x ] I (x), [ f -1 x ] A (x) and [ f -1 x ] D (x) can recover v * mi , v * ma , v * d up to linear transformations with invertible matrices M vmi , M vma and M v d . Formal Version of Theorem 3.1. We present the formal version of theorem 3.1 as follows: Theorem A.2 (Formal version of theorem 3.1). Under the causal model in Fig. 2 with Eq. 3 and Eq. 5, for any θ, we have that v mi , v d , v ma are disentangled, under following assumptions: 1. The characteristic functions of ε x , ε A are almost everywhere nonzero. 2. f x , f A are bijective functions; 3. The sufficient statistics are differentiable almost everywhere; besides, {T u i,j } 1≤j≤ku are linearly independent in I, A or D for each i ∈ [q u ] for any u = v mi , v ma , v d . R m×(qu×ku) have full column rank. Remark A.2. These assumptions have been widely assumed in the literature of independent component analysis and representation learning Khemakhem et al. (2020) ; Sun et al. (2021) ; Li et al. (2021) . Assumptions 1-3 is easy to satisfy. Specifically, the characteristic functions of ε x and ε A are almost everywhere non-zeros for most discrete (such as binomial, Poisson, geometric) continuous variables (such as Gaussian, student-t). For assumption 2, as it has been empirically verified Kramer (1991) that the extracted low dimensional embedding is able to recover the original image, it is natural for f x to be bijective. The bijectivity of f A is to ensure the disentanglement of v ma (up to affine transformation), as similarly adopted in Li et al. (2021) . ) such that [Γ v d d2 -Γ v d d1 ] T , ..., [Γ v d dm -Γ v d d1 ] T T ∈ R m×(qv d ×kv d ) and [Γ u=vmi,vma y2 -Γ u=vmi,vma y1 ] T , ..., [Γ u=vmi,vma y K -Γ u=vmi,vma y1 ] T T ∈ For assumption 4, it is required that distribution across domains and disease status are diverse enough, which is easy to satisfy. Based on this assumption, m (the number of domains) and K (the number of disease statuses) are respectively required to be larger than the dimension of v d , and the dimension of (v mi , v ma ). This suggests that we should collect data from as many domains as possible, although empirically we find that three domains are enough to achieve disentanglement and generalization (as shown in Tab. 1 and Tab. 2 that out-of-domain performance is comparable to in-distribution performance). For disease label, we can access a more finer label, e.g. breast cancer stage D 'Orsi et al. (2018) , although empirically we find that binomial benign/malignancy label is able to disentangle disease-related features. Proof. For simplicity, we denote p(u|o) := p Tu , Γu o (u|o). Since p θ (x|d, y) = p θ (x|d, y), then we have p fx (x|v mi , v ma , v d )p(v mi , v ma |y)p(v d |d)dv mi dv ma dv d = p fx (x|v mi , v ma , v d )p(v mi , v ma |y)p(v d |d)dv mi dv ma dv d . According to the chain rule of changing from v mi , v ma , v d to x := f x (v mi , v ma , v d ), we have that p εx (x -x)p(f -1 x (x)|d, y)J f -1 (x)dx = p εx (x -x)p( f -1 x (x)|d, y)J f -1 (x)dx, where J f (x) denotes the Jacobian matrix of f on x. Denote p ′ (x|d, y) := p(f -1 x (x)|d, y)J f -1 (x). Applying Fourier transformation to both sides, we have F [p ′ ](ω)φ εx (ω) = F [p ′ ](ω)φ εx (ω) , where φ εx denotes the characteristic function of ε x . Since they are almost everywhere nonzero, we have that (x|d, y) . This is equivalent to the following: F [p ′ ](ω) = F [p ′ ], which means that p ′ (x|d, y) = p′ log volJ fx (x) + u=vmi,vma qu i=1 (log B i ([f -1 x,i ] U (x)) -log C u i (y) + ku j=1 T u i,j (f -1 x,i (x))Γ u i,j (x)) + qv d i=1 (log B i ([f -1 x,i ] D (x)) -log C v d i (d) + kv d j=1 T v d i,j (f -1 x,i (x))Γ v d i,j (x)). = log volJ fx (x) + u=vmi,vma qu i=1 (log Bi ([ f -1 x,i ] U (x)) -log Cu i (y) + ku j=1 T u i,j ( f -1 x,i (x)) Γu i,j (x)) + qv d i=1 (log Bi ([ f -1 x,i ] D (x)) -log Cv d i (d) + kv d j=1 T v d i,j ( f -1 x,i (x)) Γv d i,j (x)). Subtract the Eq. 9 with y = y 1 from Eq. 9 with y = y k for k ̸ = 1, we have that u=vmi,vma ⟨T u ([f -1 x ] U (x)), Γ u (y k )⟩ + i log C u i (y 1 ) C u i (y k ) = u=vmi,vma ⟨ Tu ([ f -1 x ] U (x)), Γu (y k )⟩ + i log Cu i (y 1 ) Cu i (y k ) , for all k ∈ [m], where Γ(y) = Γ(y) -Γ(y 1 ). Denote bu (k) = u=vmi,vma qu i Cu i (y1)C u i (y k ) Cu i (y k )C u i (y1) for k ̸ = 1. Similarly, by subtracting the Eq. 9 with d = d 1 from Eq. 9 with d = d l for l ̸ = 1, we have ⟨T v d ([f -1 x ] D (x)), Γ v d (d l )⟩ + i log C v d i (d 1 ) C v d i (d l ) = ⟨ Tv d ([ f -1 x ] D (x)), Γv d (d l )⟩ + i log Cv d i (d 1 ) Cv d i (d l ) , for all k ∈ [m], where Γ(d) = Γ(d) -Γ(d 1 ). Denote bv d (l) = i Cv d i (d1)C v d i (d l ) Cv d i (d l )C v d i (d1) for l ̸ = 1, we have that: Γ v d ,⊤ T v d ([f -1 x ] D (x)) = Γv d ,⊤ Tv d ([ f -1 x ] D (x)) + bv d , Γ vmi,⊤ T vmi ([f x ] -1 I (x)) + Γ vma,⊤ T vma ([f x ] -1 A (x)) = Γvmi,⊤ Tvmi ([ fx ] -1 I (x)) + Γvma,⊤ Tvma ([ fx ] -1 A (x)) + bvma + bvmi . Similarly, we also have p ′ ( Ā|y) = p′ ( Ā|y), which means that log volJ f A (A) + qv ma i=1 (log B i ([f -1 A ] A,i (A)) -log C vma i (d) + kv ma j=1 T vma i,j ([f -1 A ] A,i (A))Γ vma i,j (x)) = log volJ fA (A) + qv ma i=1 (log B i ([ f -1 A ] A,i (A)) -log C vma i (d) + kv ma j=1 T vma i,j ([ f -1 A ] A,i (A)) Γvma i,j (A)), which has that Γ vma,⊤ T vma ([f -1 A ] A (A)) = Γvma,⊤ Tvma ([ f -1 A ] A (A)) + bvma . ( ) Denote v := [x ⊤ , A ⊤ ] ⊤ , ε := [ε ⊤ x , ⊤ A ] ⊤ , h(v) = [[f x ] -1 I (x) ⊤ , [f -1 A ] A (A) ⊤ ] ⊤ . Applying the same trick above, we have that Γ vma,⊤ T vma ([f x ] -1 A (x)) = Γvma,⊤ Tvma ([ fx ] -1 A (x)) + bvma . Combining Eq. 12, 13, 16, we have that Γ v d ,⊤ T v d ([f -1 x ] D (x)) = Γv d ,⊤ Tv d ([ f -1 x ] D (x)) + bv d , Γ vma,⊤ T vma ([f x ] -1 A (x)) = Γvma,⊤ Tvma ([ fx ] -1 A (x)) + bvma . (18) Γ vmi,⊤ T vmi ([f x ] -1 I (x)) = Γvmi,⊤ Tvmi ([ fx ] -1 I (x)) + bvmi . Applying the same trick in (Sun et al., 2021, Theorem 7 .9) due to assumption 3, 4, we have that (Γ u,⊤ ) -1 Γu,⊤ are invertible for u = v mi , v ma , v d . The proof is completed by setting M u , b u in Def. A.1 as (Γ u,⊤ ) -1 Γu,⊤ and bu for u = v mi , v ma , v d . B OBJECTIVE FUNCTION B.1 FINAL LOSS Our final loss function is the summation of the loss in Eq. 1, i.e., d ℓ d d ℓ d (q d , p d θ ), where each ℓ d (q d , p d θ ) is: ℓ d (q d , p d θ ) = - 1 n d n d i=1 log p d θ (y i , A i |x i ) + E q ψ d (v|xi) log p d θ (x i , v) q ψ d (v|x i ) , = - 1 n d n d i=1 log E q ψ d (v|xi) (p θ (A i |v ma )p θ (y i |v ma , v mi )) prediction of A, y - 1 n d n d i=1 E q ψ d (v|xi) (log p θ (x i |v)) reconstruction loss - 1 n d n d i=1 KL(q ψ d (v|x i ), p d θ (v)) KL divergence . The first term is to the cross entropy loss for A and y; for each sample x i , we first generate v mi , v ma from x i via q ψ d (v|x i ), then feed v mi , v ma into p θ (y i |v ma , v mi ) and v ma into p θ (A i |v ma to predict y and A, respectively. The second and third terms are respectively reconstruction loss and KL divergence loss in VAE.

B.2 DERIVATION OF OUR OBJECTIVE

The log-likelihood over the observations (x, y, A) in the Bayesian network is given by: log p(x, y, A; θ) = log p(x; θ) + log p(A|x; θ) + log p(y|x, A; θ) which forms the learning objective of our problem. Next, we give the details about how the loss functions for optimization are derived from the likelihood. For the log-likelihood log p(x; θ) of each domain, we have the ELBO as a lower bound on the log-likelihood: log p(x; θ) = KL(q(v d , v mi , v ma |x)||p(v d , v mi , v ma |x))+ E q(v d ,vmi,vma|x) log p(v d , v mi , v ma , x) -E q(v d ,vmi,vma|x) log q(v d , v mi , v ma |x) ≥E q(v d ,vmi,vma|x) log p(v d , v mi , v ma , x) q(v d , v mi , v ma |x) = -KL(q(v d , v mi , v ma |x)||p(v d , v mi , v ma )) + E q(v d ,vmi,vma|x) log (p(x|v d , v mi , v ma )) where q(•|x) denotes q(v d , v mi , v ma |x) for simplicity. Specifically, we use θ to parameterize p(v d , v mi , v ma , x) and ϕ to parameterize q(v d , v mi , v ma |x). The prior joint distribution p θ (v d , v mi , v ma , x) can be factorized into p d θ (z)p θ (v mi , v ma )p d θ (x|v d , v mi , v ma ). Under mean-field approximation, the posterior q ϕ (v d , v mi , v ma |x) can be factorized into q d ϕ (v d |x)q ϕ (v mi , v ma |x). Note that the index d is added since v d and x are domain-variant and v mi , v ma are domain-invariant. The final two terms in Eq. 21 are the KL loss and reconstruction loss in the loss functions. For the conditional log-likelihood log p(A|x; θ), we have: log p(A|x; θ) = log p θ (A|v ma )p θ (v ma |x)dv ma (22) where p θ (v ma |x) is re-parameterized by the posterior model q ϕ (v ma |x) in the variational framework above. Under one-time sampling, we have log p(A|x; θ) = log p θ (A|v ma )p θ (v ma |x). Since the different attributes are independent in A, and each attribute {g i } i∈[C] ∈ A ([C] := {1, ..., C}) obeys binomial distribution, we can rewrite the log-likelihood as: log p(A|x; θ) = log p(g 1 , • • • , g C |x; θ) = log C i=1 p(g i |x; θ) = log C i=1 ĝi gi (1 -ĝi ) 1-gi = C i=1 g i log ĝi + (1 -g i ) log(1 -ĝi ) where ĝi denotes the probability of that the sample x contains attribute i (i.e., g i being 1 under the prediction of p θ (A|v ma )q θ (v ma |x)). Thus we derive the multi-label loss for the Graph Convolutional Network. For the conditional log-likelihood log p(y|x, A; θ), we have: log p(y|x, A; θ) = log p θ (y|v mi , v ma )p θ (v mi , v ma |x, A)dsda where p θ (v mi , v ma |x, A) is re-parameterized by the posterior model q ϕ (v mi , v ma |x) in the variational framework above. Under one-time sampling, we have log p(y|x, A; θ) = log p θ (y|v mi , v ma )q θ (v mi , v ma |x). Since y obeys binomial distribution, we can rewrite the loglikelihood as: log p(y|x, A; θ) = log ŷy (1 -ŷ) (1-y) = y log ŷ + (1 -y) log(1 -ŷ) where ŷ denotes the probability of y being 1 under the prediction of p θ (y|v mi , v ma )q θ (v mi , v ma |x). Thus we derive the loss function for the binary classification of benign/malignant.

C ABLATION STUDY ON EACH COMPOMENT

Here are some interpretations for the variants: a) DADI denotes whether using DADI encoder during the reconstructing phase, while DAL denotes using domain-aware layers for distinguishing multiple domains in DADI encoder; b) Attribute Learning denotes the way to predict attributes: × means no predictions of attributes, multi-task means using a fully connected layer to predict the multiple attributes, and L gcn means using our Disease-Attribute Generative Model to predict attributes; c) v mi denotes whether split the latent factor v mi out for disentanglement in training; d) Medical Image Decoder denotes whether use the reconstruction loss in training. As shown in Tab. 4, every component is effective. It is worth noting that using naive GCN also leads to a boosting of around 5% in average. Such a result can demonstrate that the attributes can guide the 

D MORE ABLATION STUDY

We also explore the impact of the combination of training domains and try different training combinations for unseen test domains. Take testing on DDSM (Bowyer et al., 1996) as an example. As shown in Tab. 5, the more types of domains the better effect of our model. Due to the different correlations between different domains, the effect will be different under different combinations. But based on the inter mechanism of our model, influences between different domains are not obvious and three domains are sufficient to achieve comparable results. Under the setting: testing on DDSM (Bowyer et al., 1996) 

E MORE DETAILS OF DISEASE ATTRIBUTE GENERATIVE MODELS.

To capture macroscopic patterns in v ma , we constrain it to estimate the clinical attributes A that include macroscopic information such as shape, margins, lobulation, etc. Besides, we constrain it and v mi to predict the disease label y, with v mi accounting for additional microscopic information of lesions. We note that such constraints align with the causal graph in Fig. 2 , as only v ma → A and y ⊥ v c |v ma , v mi . Finally, we constrain all factors to reconstruct the input x, with v d responsible for the domain-aware effects in x (Medical Image Decoder ). Indeed, such asymmetric roles of v ma , v mi , v d in terms of relations with y, A, x can additionally help to disentangle them from each other, on the basis of the two-branch encoder. We parameterize p θ (y, A, x|v) as p θx (x|v), p θy (y|v ma , v mi ) and p θ A (A|v ma ). To utilize these relations, we parameterize the p θ (A|v ma ) by a Graph Convolutional Network (GCN) which is a flexible way to capture the topological structure in the label space. Along with (Chen et al., 2019), we build a graph G = (V, E) with twelve nodes and consider each attribute as a node, e.g., Shape-circle, Margin-clear. Each node V i ∈ V represents the word embedding of the attributes. Each edge e ∈ E represents the inter-relevance between attributes. The inputs of the graph are feature representations H l and corresponding correlation matrix B which is calculated in the same way Wang et al. (2021b) . For the first layer, H 0 ∈ R c×c ′ denotes the one-hot embedding matrix of each attribute node where c is the number of attributes, c ′ is the length of embeddings. Then, the feature representation of the graph at every layer (Kipf & Welling, 2016) can be calculated via H l+1 = δ(BH l W l ), where δ(•) is LeakyRelu (Maas et al., 2013) , W l is the transformation matrix which is the parameter to be learned in the lth layer. The output { Âk } k = GCN([Causal-Encoder(x)] A )) is learned to approximate attributes {A k } k . F MORE DETAILS OF IMPLEMENTATION AND DATASETS. External cohorts are unseen before testing, i.e., have not been used in the training phase. For each dataset, the region of interest (ROIs) (malignant/benign masses) are cropped based on the annotations of radiologists the same as Kim et al. (2018) . The training/valid/testing samples we use contain 1165 ROIs from 571 patients/143 ROIs from 68 patients/147 ROIs from 75 patients in DDSM (Bowyer et al., 1996) For a fair comparison, all methods are conducted under the same setting and share the same encoder backbone, i.e., ResNet34 (He et al., 2016) . Meanwhile, the decoder is the deconvolution network of the encoder. For attribute annotations, in DDSM (Bowyer et al., 1996) annotations can be parsed from the ".OVERLAY" file. The third line in the ".OVERLAY" file has annotations for types, shapes, and margins of masses. And in our in-house datasets, we obtain attribute annotations from the verification of one director doctor based on the annotations of three senior doctors. We implement all models with PyTorch. We implement Adam for optimization. The weight hyperparameter in variance regularizer β is 1 in our experiments. The clinical attributes contain circle, oval, irregular, circumscribed, obscured, ill-defined, is-lobulated, not-lobulated, is-spiculated, not-spiculated. We add additional benign and malignant nodes to learn the correlation between the combination of attributes and benign/malignant. For the implementation of compared baselines, we directly load the published codes of ERM (He et al., 2016 ), Chen et al. (Chen et al., 2019) , DANN (Ganin et al., 2016) , MMD-AAE (Li et al., 2018) , DIVA (Ilse et al., 2020) , IRM (Arjovsky et al., 2019) and Prithvijit et al. (Chattopadhyay et al., 2020) during the test stage; while we re-implement methods of Guided-VAE (Ding et al., 2020b) , ICADx (Kim et al., 2018) and Li et al. (Li et al., 2019) for lacking published source codes.

G TEST SET OF DDSM

We use the same To provide convenience for the latter works, we publish the list of our test division on the public dataset DDSM (Bowyer et al., 1996) . 



Figure 1: Domain differences between multi centers (Cases-a,b,c) and AUC evaluation of Ours/ ERM (training by Empirical Risk Minimization) under internal/external cohort. Cases-a,b,c: similar cases in different centers (red rectangles: lesion areas). The bar graph: in the external cohort (unseen domain) ERM performs a large drop on AUC, instead, our proposed method performs stable.

Domain-Agnostic Representation Learning for Disease Diagnosis. The multi-center study is important for clinical diagnosis Liu et al. (2020); Kather et al. (2022); Castro et al. (2020); Pollard et al. (

Figure 2: Causal Graph of our model. For v ma , v mi , v d that respectively denote macroscopic, microscopic, and center-dependent features, the (v ma , v mi ) are associated with the disease status y and v d is affected by domain variable d.

Figure 3: Overview of our VAE-based method, which is composed of two-branch encoder: Domain-Agnostic Disease-Relevant Encoder (DADR) to extract macroscopic features v ma , microscopic features v mi , and Domain-Aware Disease-Irrelevant Encoder (DADI) to extract domain-specific effects v d . In DADI, images from different centers are fed into corresponding domain-aware layers respectively, to model the variation of domain effects. In DADR, we implement graph convolution (Disease-Attribute Generative Model) to capture relations among clinical attributes.

Training and Testing. With above parameterizations, we optimize prior parameters θ pri := {θ d }, inference parameters ψ := {ψ d } with ψ d := (ψ 1 , ψ d2 ) such that ψ d 2 includes (γ d , β d ) and other layers' parameters that do not depend on d, and generative parameters θ gen := (θ x , θ y , θ A ) via L(θ pri , ψ, θ gen ) := d ℓ d (θ d , ψ d , θ gen ) with ℓ d defined in Eq. 1. During testing stage for a new image x, we first extract causal features (v ma , v mi ) from x, followed by prediction via p θy (y|v ma , v mi ).

directly trains the classifier via ResNet34 by Empirical Risk Minimization; b) Chen et al. (Chen et al., 2019) achieves multi-label classification with GCN for attributes prediction; c) Guided-VAE

Figure 4: Visualization on valid(internal) and test(external) cohorts. Red rectangles: lesion regions; green rectangles: white dots caused by machine shooting. Each row: the reconstruction of different latent variables. Validation: 1st and 4th columns are from the center2, the 2nd column is from the center1, and the 3rd column is from the center3. Test: All columns are from center 4. Note that there are no reconstruction results of v d at the test stage because the test domains have no corresponding domain-aware encoders.

Disentanglement. With such model assumptions and formulations, we introduce our formal definition of disentanglement. First, we denote θ := {T v mi , T v d , T vma , Γ v mi y in the above models. We define the disentanglement as follows: Definition A.1 (Disentanglement of Latent Space). We say that the v mi , v ma , v d are disentangled with each other under θ, if for any θ := { Tvmi , Tv d , Tvma , Γvmi y , Γvma y , Γv d d , fx , fA } that giving rise to the same observational distributions: p θ (x, A, y|d) = p θ (x, A, y|d) for any x, y, A and d, there exists invertible matrices M vmi , M vma , M v d and vectors b v d , b vmi , b vma such that:

There exists m different values of domain variable d, (i.e., d 1 , ..., d m ) and K different values of disease label y, (i.e., y 1 , ..., y K

Figure 5: Distribution of lesions' characteristics in each center (dataset).

, 684 ROIs from 292 patients/87 ROIs from 38 patients/83 ROIs from 33 patients in InH1, 840 ROIs from 410 patients/104 ROIs from 50 patients/105 ROIs from 52 patients in InH2, and 565 ROIs from 271 patients/70 ROIs from 33 patients/70 ROIs from 34 patients in InH3. The distribution of lesions' characteristics in each center (dataset) we use is shown in Fig. 5. And the distribution of ages in each center we use is shown in Fig. 6.

Figure 6: Distributions of ages in each center (dataset).

i g n _ 0 9 _ c a s e 4 0 8 5 c a n c e r _ 0 2 _ c a s e 0 1 1 2 c a n c e r _ 1 5 _ c a s e 3 3 9 8 b e n i g n _ 0 3 _ c a s e 1 4 3 5 c a n c e r _ 0 1 _ c a s e 3 0 2 7 c a n c e r _ 0 7 _ c a s e 1 1 1 4 c a n c e r _ 0 3 _ c a s e 1 0 7 0 b e n i g n _ 0 3 _ c a s e 1 4 3 2 c a n c e r _ 0 6 _ c a s e 1 1 8 2 c a n c e r _ 0 5 _ c a s e 0 1 4 0 b e n i g n _ 1 2 _ c a s e 1 9 4 7 b e n i g n _ 1 2 _ c a s e 1 9 2 2 c a n c e r _ 0 5 _ c a s e 0 2 1 0 c a n c e r _ 0 8 _ c a s e 1 4 0 3 c a n c e r _ 0 5 _ c a s e 0 1 7 3 b e n i g n _ 0 1 _ c a s e 0 2 3 5 b e n i g n _ 0 2 _ c a s e 1 3 1 7 b e n i g n _ 1 1 _ c a s e 1 8 3 6 c a n c e r _ 0 5 _ c a s e 0 2 2 2 c a n c e r _ 0 8 _ c a s e 1 5 3 2 b e n i g n _ 0 6 _ c a s e 0 3 7 2 c a n c e r _ 0 2 _ c a s e 0 0 7 7 b e n i g n _ 1 1 _ c a s e 1 8 5 5 c a n c e r _ 0 5 _ c a s e 0 1 3 9 b e n i g n _ 0 8 _ c a s e 1 7 8 6 c a n c e r _ 0 7 _ c a s e 1 1 5 9 c a n c e r _ 1 0 _ c a s e 1 5 7 3 c a n c e r _ 0 5 _ c a s e 0 1 8 1 b e n i g n _ 0 9 _ c a s e 4 0 3 8 c a n c e r _ 0 5 _ c a s e 0 1 9 2 b e n i g n _ 0 6 _ c a s e 0 3 6 3 c a n c e r _ 0 6 _ c a s e 1 1 2 2 b e n i g n _ 0 1 _ c a s e 3 1 1 3 b e n i g n _ 0 9 _ c a s e 4 0 0 3 b e n i g n _ 0 6 _ c a s e 0 3 6 7 c a n c e r _ 1 2 _ c a s e 4 1 3 9 c a n c e r _ 1 4 _ c a s e 1 9 8 5 c a n c e r _ 0 5 _ c a s e 0 1 8 3 c a n c e r _ 1 0 _ c a s e 1 6 4 2 c a n c e r _ 0 5 _ c a s e 0 2 0 6 c a n c e r _ 0 3 _ c a s e 1 0 0 7 c a n c e r _ 1 2 _ c a s e 4 1 0 8 c a n c e r _ 0 9 _ c a s e 0 3 4 0 b e n i g n _ 0 7 _ c a s e 1 4 1 2 c a n c e r _ 0 5 _ c a s e 0 0 8 5 b e n i g n _ 0 9 _ c a s e 4 0 6 5 b e n i g n _ 0 3 _ c a s e 1 3 6 3 b e n i g n _ 0 9 _ c a s e 4 0 2 7 b e n i g n _ 1 0 _ c a s e 4 0 1 6 b e n i g n _ 1 3 _ c a s e 3 4 3 3 b e n i g n _ 0 9 _ c a s e 4 0 9 0

AUC evaluation of public/in-house datasets on external cohorts (unseen domains), i.e., training and testing data are from different domains. ⃝: domains for testing, •: domains for training).

AUC evaluation of public/in-house datasets on internal cohorts (source domains, i.e., in-distribution: training and testing data are from the same domains. -: do not use, : the same domain for training and testing). Li et al.(Li et al., 2019) improve performance by generating more

Overall Prediction Accuracy (ACC) of Multi Attributes (Mass shapes, Mass margins) on external cohorts (unseen domains, i.e., out-of-distribution: training and testing data are from different domains). Testing names are noted in the table.Specifically, we implement the methods which aim at representation learning on internal cohorts, i.e., training and testing on the data from the same domain. Such in-distribution results can serve as the upper bounds of our generalization method. To adapt our proposed mechanism to the in-distribution situation, we change our network with two branches without domain-aware BN layers accordingly for extracting features into a, s, z since training data is only from one center(Ours-single), i.e., one domain without domain influence. As shown in Tab. 2, based on the disentanglement mechanism and the guidance of attribute learning, Ours-single still gets the state-of-art performance. We argue that the disentangling mechanism with the guidance of attributes helps effective learning of disease-related features under a single domain. Results in Tab. 2 can be seen as the upper bound results of each setting in Tab. 1. Our results in Tab. 1 are slightly lower than results in Tab. 2 by 0.4% to 2.7%.

Ablation Study: AUC evaluation of public/in-house datasets on external cohorts (unseen domains, i.e., out-of-distribution: training and testing data are from different domains, testing on InH1/InH2/InH3/DDSM while training on the other three). Testing names are noted in the table.

Ablation study on the combination of training data sets. Take testing on the public dataset DDSM as an example. (OOD settings) -related features. Meanwhile, disentanglement learning also causes a noticeable promotion, which may be due to that the disease-related features can be easier identified through disentanglement learning without mixing information with others. Moreover, Lines 5-6 validate that disease-related features can be disentangled better with the guidance of exploring attributes. Lines 7-8 validate that distinguishing multiple domains improves the generalization performance.

AUC of testing on data set InH1/InH2/InH3/DDSM while training on InH1+InH2+InH3.

