MULTI-DOMAIN LONG-TAILED LEARNING BY AUG-MENTING DISENTANGLED REPRESENTATIONS

Abstract

There is an inescapable long-tailed class-imbalance issue in many real-world classification problems. Existing long-tailed classification methods focus on the single-domain setting, where all examples are drawn from the same distribution. However, real-world scenarios often involve multiple domains with distinct imbalanced class distributions. We study this multi-domain long-tailed learning problem and aim to produce a model that generalizes well across all classes and domains. Towards that goal, we introduce TALLY, which produces invariant predictors by balanced augmenting hidden representations over domains and classes. Built upon a proposed selective balanced sampling strategy, TALLY achieves this by mixing the semantic representation of one example with the domain-associated nuisances of another, producing a new representation for use as data augmentation. To improve the disentanglement of semantic representations, TALLY further utilizes a domain-invariant class prototype that averages out domain-specific effects. We evaluate TALLY on four long-tailed variants of classical domain generalization benchmarks and two real-world imbalanced multi-domain datasets. The results indicate that TALLY consistently outperforms other state-of-the-art methods in both subpopulation shift and domain shift. Multi-Domain Imbalanced Learning. Multi-domain long-tailed learning is a natural extension of classical long-tailed learning, where the overall data distribution is drawn from a set of domains D = {1, . . . , D} and each domain d is associated with a class-imbalanced dataset {(xi, yi, d)} N d i=1 drawn from domain-specific distribution p d . Following (Albuquerque et al., 2019; Koh et al., 2021), both training and test distribution can be formulated as a mixture distribution over domain space D, i.e., P tr = D d=1 η tr d P tr d and P ts = D d=1 η ts d P ts d . The corresponding training and test domains are D tr = {d ∈ D|η tr d > 0} and D ts = {d ∈ D|η ts d > 0}, respectively, where η tr d and η ts d represent the



. Both subpopulation shift and domain shift settings are illustrated. Deep classification models can struggle when the number of examples per class varies dramatically (Beery et al., 2020; Zhang et al., 2021) . This long-tailed setting arises frequently in practice, such as wildlife recognition (Beery et al., 2020) . Classifiers tend to be biased towards majority classes and perform poorly on classbalanced test distributions, i.e. when there is a shift in the label distribution between training and test. Existing approaches address the long-tailed problem by modifying the data sampling strategy (Chawla et al., 2002; Zhang & Pfister, 2021) , adjusting the loss function for different classses (Cao et al., 2019; Hong et al., 2021) , or augmenting minority classes (Chou et al., 2020; Zhong et al., 2021) . Unlike these works, which focus on single-domain long-tailed learning, we study multi-domain long-tailed learning, where each domain has its own long-tailed distribution. Take wildlife recognition as an example (Figure 1 ): images of wildlife are collected from various locations, and the distribution over species (classes) at each location is typically imbalanced and the class distribution also varies between locations. In multi-domain long-tailed classification, the classifiers need to handle distribution shift amidst class imbalance. Here, we focus on two types of distribution shift: subpopulation shift and domain shift. In subpopulation shift, we train a model on data from multiple domains and evaluate the model on a test set with balanced domain-class pairs. In the wildlife recognition example, species are often concentrated at only a few locations, creating a spurious correlation between the label (species) and the domain (location) . A machine learning model trained on the entire population may fail on the test set when this correlation does not hold anymore. In domain shift, we expect the trained model to generalize well to completely new test domains. For example, in wildlife recognition, we train a model on data from a fixed set of training locations and then deploy the model to new test locations. Prior long-tailed classification methods work well in single-domain settings, but may perform poorly when the test data is from underrepresented domains or novel domains. Meanwhile, invariant learning approaches alleviate cross-domain performance gaps by learning representations or predictors that are invariant across different domains (Arjovsky et al., 2019; Li et al., 2018 ). Yet, these approaches are mostly evaluated in class-balanced settings, where models must be trained on plenty of examples from each class even if augmentation strategies are applied (Yao et al., 2022) -see a detailed discussion in Appendix B. With multi-domain long-tailed data, learning a class-unbiased domain-invariant model is not trivial since the imbalance can exist within a domain or across domains. We aim to address these challenges in this work, leading to a novel method named TALLY (mulTi-domAin Long-tailed learning with baLanced representation reassemblY). TALLY empowers augmentation to balance examples over domains and classes by decomposing and reassembling example pairs, combining the class-relevant semantic information of one example with the domain-associated nuisances of another Zhou et al. (2022) . Specifically, TALLY first decouples the representation of each example into semantic information and nuisances with instance normalization. To further mitigate the effects of nuisances, we first average out domain information over examples of the same class and construct class prototype representations. Each semantic representation is then linearly interpolated with a corresponding class prototype, leading to the prototype-enhanced semantic representation. The domain-associated factors are similarly interpolated with class-agnostic domain factors to improve training stability and remove noise. Finally, TALLY produces augmented representations to benefit the training process by reassembling the prototypeenhanced semantic representation and domain-associated nuisances among examples. To further achieve balanced augmentation, we propose a selective balanced sampling strategy to draw example pairs for augmentation. Concretely, for each pair, the label of one example is uniformly sampled from all classes and the domain of another example is uniformly sampled from all domains. In this way, TALLY encourages the model to learn a class-unbiased invariant predictor. In summary, our major contributions are: we investigate and formalize an important yet less explored problem -multi-domain long-tailed learning, and propose an effective augmentation algorithm called TALLY to simultaneously address the class-imbalance issue and learn domain-invariant predictors. We empirically demonstrate the effectiveness of TALLY under subpopulation shift and domain shift. We observe that TALLY outperforms both prior single-domain long-tailed learning and domaininvariant learning approaches, with a 5.18% error decrease over all datasets. Furthermore, TALLY is capable of capturing stronger invariant predictors compared with prior invariant learning approaches.

2. FORMULATIONS AND PRELIMINARIES

Long-Tailed Learning. In this paper, we investigate the setting where one predicts the class label y ∈ C based on the input feature x ∈ X , where C = {1, . . . , C}. Given a machine learning model f parameterized by parameter θ and a loss function ℓ, empirical risk minimization (ERM) trains such a model by minimizing average loss over all training examples as min θ E (x,y)∼P tr [ℓ(f θ (x), y)], (1) which works well when the label distribution is approximately uniform. In long-tailed learning, however, the label distribution is long-tailed, where a small proportion of classes have massive labels and the rest of classes are associated with a few examples. Assume {(xi, yi)} N i=1 is a training set sampled from training distribution and the number of examples for each class is {n1, . . . , nC }, where C c=1 nc = N . In long-tailed learning, all classes are sorted according to cardinality (i.e., n 1 ≪ n C ) and the imbalance ratio ρ is defined as ρ = n1/nC > 1. Note that same definitions are used in the test set {(xi, yi)} N ts i=1 . Under the class-imbalanced training distribution, vanilla ERM model tends to perform poorly on minority classes, but we expect the model can perform consistently well on all classes. Hence we typically assume the test distribution is class-balanced (i.e., ρ ts = 1).

Pre-Layers

Pre-Layers

Representation Augmentation

Post-Layers 𝑧 ! (𝑠 " ) 𝜇 ! 𝑠 " , 𝜎 ! (𝑠 " ) 𝑥 " , 𝑦 " , 𝑑 " w/ 𝑦 " ~Uni(𝒞)  Nuisances 𝜇 𝑠 ! , 𝜎 𝑠 ! Semantic 𝑧(𝑠 ! ) (𝑠 # , 𝑦 # , 𝑑 # ) (𝑠 " , 𝑦 " , 𝑑 " ) (𝑠 ! , 𝑦 ! , 𝑑 ! ) Nuisances 𝜇 𝑠 " , 𝜎 𝑠 " Semantic 𝑧(𝑠 " ) (𝑠 " , 𝑦 " , 𝑑 " ) ( s " , 1 𝑦 " ) Class-Proto 𝑟 # Domain-Proto 𝑢 $ , 𝑣 $ Representation Disentanglement Representation Reassembly + + Representation Augmentation s ! , . 𝑦 ! . 𝑦 ! = 𝑦 ! Overall Pipeline 𝑥 # , 𝑦 # , 𝑑 # w/ 𝑑 # ~Uni(𝒟)

3. MULTI-DOMAIN LONG-TAILED LEARNING WITH BALANCED REPRESENTATION REASSEMBLY

To improve robustness in multi-domain long-tailed learning, we would like method that can learn classunbiased domain-invariant representations. To accomplish this, we introduce TALLY to do balanced augmentation over classes and domains. The key idea motivating TALLY is that every example can be decomposed into class-relevant semantic information and domain-associated nuisances that should be ignored by an ideal classifier. Here, following (Zhou et al., 2022) , nuisances is defined as "classagnostic" transformations that apply similarly to all classes, such as image style and background changes in image classification. As outlined in Figure 2 Motivated by style transfer (Huang & Belongie, 2017) , we use instance normalization (Instan-ceNorm (Ulyanov et al., 2016) ) to perform the required disentanglement of semantic and nuisance information. Concretely, given an example (x, y, d) we denote the hidden representation at layer r as s = f r (x) ∈ R C×H×W , where C, H, and W denote channel, height, and width dimensions, respectively. Ignoring affine parameters, InstanceNorm normalizes the example as: z(s) = InstanceNorm(s) = s -µ(s) σ(s) , where z(s), µ(s), σ(s) ∈ R C (2) where µ(•), σ(•) are the mean and standard deviation computed across the spatial dimensions of s: µ(s) = 1 HW H h=1 W w=1 s[:, h, w], σ(s) = 1 HW H h=1 W w=1 (s[:, h, w] -µ(s)) 2 . (3) Following Huang & Belongie (2017) , we treat the normalized example z(s) as the semantic representation, and regard µ(s) and σ(s) as the domain-associated nuisances. Notice that we adopt a warm start strategy of running vanilla ERM for the first few epochs to ensure reliable disentanglement. After decoupling representations, we produce an augmented representation from a pair of examples (xi, yi, di) and (xj, yj, dj) by swapping semantic representations and domain-associated nuisances: s = σ(sj) si -µ(si) σ(si) + µ(sj), ỹ = yi. ( ) Since the semantic content of the augmented representation s is from example (x i , y i , d i ), we label our augmented example with ỹ = y i . By reassembling disentangled representations, we can augment representations for minority domains or minority classes. In the process of representation disentanglement and reassembly, finding a suitable strategy of sampling examples from the training distribution is crucial to solving the domain-class imbalance problem. In single-domain long-tailed learning, up-sampling examples from minority classes is a classical yet effective method. In multi-domain long-tailed learning, the most straightforward extension is up-sampling examples from minority domain-class groups, which is named balanced sampling here. In practice, for each example (x i , y i , d i ), the label y i and domain d i are uniformly sampled a joint uniform distribution over all domain-class combinations, i.e., (yi, di) ∼ Uniform(C, D).

3.2. SELECTIVE BALANCED SAMPLING

However, to transfer the knowledge between different domain-class groups in TALLY, using such a sampling strategy may overemphasize the importance of minority domain-class groups. In Figure 3 , we illustrate three domain-class groups from OfficeHome-LT, which is a long-tailed variant of OfficeHome dataset (Venkateswara et al., 2017) . To augment minority groups (e.g., fan-clipart pair), balanced sampling tends to repeatedly draw examples from the same minority group. We do not expect this because of two reasons: first, it limits the sample diversity in knowledge transfer; second, as shown in Figure 3 , minority groups typically perform worse than majority groups, which may make the knowledge transfer less reliable. Hence, we propose a selective balanced sampling strategy in TALLY. Concretely, for a pair of examples (x i , y i , d i ) and (x j , y j , d j ), the label y i of example i is uniformly sampled from all classes (yi ∼ Uniform(C)) and the domain d j of example j is uniformly sampled from all domains (dj ∼ Uniform(D)). The illustration verifies that selective balanced sampling has a higher chance of diversifying the sample selection in transferring domain and class information.

3.3. PROTOTYPE-GUIDED INVARIANT LEARNING

Since the semantic representation z(s) (Eqn. 2) should contain only class-relevant information, it should ideally be domain-invariant. However, per-instance statistics can be noisy and instance normalization may not perfectly disentangle the semantic information from the domain-related nuisances. To improve robustness, we can "average out" domain information over many examples of the same class from different domains. However, merely averaging over examples would remove the diversity that distinguishes different examples of the same class. We balance diversity and domain-invariance by interpolating z(s) with the corresponding class prototype representation. We define the class prototype representation r c as the average semantic representation over examples ▷ Disentangle semantic factor (Eqn. 2) 8: z ′ (si) ← λcz(si) + (1 -λc)r (t) c ▷ Enhance semantic factor (Eqn. 6) 9: (µ ′ (sj), σ ′ (sj)) ← λ d (µ(sj), σ(sj)) + (1 -λ d )(u (t) d , v ▷ Enhance nuisances (Eqn. 8) 10: (s ′ , ỹ′ ) ← (σ ′ (sj)z ′ (si) + µ ′ (sj), yi) ▷ Generate augmented example (Eqn. 9)  11: Optimize[ℓ(f L-r θ (s ′ ), for d = 1 to D do 16: (u (t+1) d , v ) ← γ(u (t) d , v (t) d ) + (1 -γ)(u d , v d ) ▷ Update class-agnostic statistics belonging to class c regardless of domain: rc = 1 nc nc i=1 z(si) = 1 nc nc i=1 si -µ(si) σ(si) . For each example (x, y, d) with y = c, we obtain the prototype-enhanced semantic representation by linearly interpolating z(s) with the corresponding class prototype r c : z ′ (s) = λcz(s) + (1 -λc)rc, where λ c ∼ Beta(α c , α c ) is the interpolation coefficient. By applying this class prototype-based interpolation strategy, we are capable of capturing invariant knowledge and keeping the diversity of instance-level semantic representation when swapping information. We also desire that the disentangled µ(s) and σ(s) (Eqn. 2) contain only domain-related nuisance information. However, for similar reasons as with z(s), they may still contain some class-related semantic information which we would like to remove by "averaging out." In this case, we remove semantic information by averaging over examples from different classes within the same domain: u d = 1 n d n d i=1 µ(si), v d = 1 n d n d i=1 σ(si), where n d represents the number of training examples in domain d. Then, for each example, we linearly interpolate its domain-associated nuisances with the above class-agnostic nuisances as: µ ′ (s) = λ d µ(s) + (1 -λ d )u d , σ ′ (s) = λ d σ(s) + (1 -λ d )v d , where the interpolation ratio is λ d ∼ Beta(α d , α d ). In practice, we update the class prototype r c and domain-agnostic nuisances u d and v d with momentum updating, where we denote the values of rc, u d , v d at epoch t as r (t) c , u (t) d and v (t) d , respectively. By replacing the original semantic representation and domain-associated nuisances in Eqn. 4 with the prototype-guided ones, we obtain the enhanced augmented representation as follows: s′ = σ ′ (sj)z ′ (si) + µ ′ (sj), ỹ′ = yi. Finally, we replace the original training data with the augmented ones and reformulate the optimization process in Eqn. 1 as: min θ E (x i ,y i ),(x j ,y j )∼P tr [ℓ(f L-r θ (s ′ ), ỹ′ )], where f L-r represents the post-layers after layer r. It is also worthwhile to point out that TALLY can be incorporated into any kinds of class-imbalanced losses (e.g., Focal, LDAM). We summarize the overall framework of TALLY in Algorithm 1. 

4. EXPERIMENTS

In this section, we conduct extensive experiments to answer the following questions: Q1: How does TALLY perform relative to prior invariant learning and single-domain long-tailed learning approaches under subpopulation shift and domain shift? Q2: Since it is straightforward to combine invariant learning with imbalanced data strategies, how does TALLY compare with such combinations? Q3: What affect does incorporating the prototype representation (Eqn. 9) have, in comparison with naive representation swapping (Eqn. 4)? Q4: Can TALLY produce models with greater domain invariance? We compare TALLY to two categories of algorithms. The first category includes single-domain long-tailed learning methods such as Focal (Lin et al., 2017) , LDAM (Cao et al., 2019) , CRT (Kang et al., 2020) , MiSLAS (Zhong et al., 2021) , RIDE (two experts) (Wang et al., 2020a) , PaCo (Cui et al., 2021) , and Remix (Chou et al., 2020) . The second category includes approaches for improving robustness to distribution shift: IRM (Arjovsky et al., 2019) , GroupDRO (Sagawa et al., 2020) , LISA (Yao et al., 2022) , MixStyle (Zhou et al., 2020b) , DDG (Zhang et al., 2022) , and BODA (Yang et al., 2022) , where BODA is a work studying multi-domain long-tailed learning by adding regularizer on domain-class pairs. Follow Yang et al. (2022) , we use a ResNet-50 for all algorithms, and detail the baselines and evaluation metrics in Appendix C. All hyperparameters are selected via cross-validation.

4.1. EVALUATION ON LONG-TAILED VARIANTS OF DOMAIN GENERALIZATION BENCHMARKS

Datasets. Many standard domain-generalization benchmarks are not long-tailed, while standard imbalanced-classification datasets tend to be in a single-domain. We curate four multi-domain long-tailed datasets by modifying four existing domain-generalization benchmarks: VLCS (Fang et al., 2013) , PACS (Li et al., 2017) , OfficeHome (Venkateswara et al., 2017) , and DomainNet (Peng et al., 2019) . We modify the prior datasets by removing training examples so that each domain has a long-tailed label distribution (overall imbalance ratio: 50) and call the resulting datasets VLCS-LT, PACS-LT, OfficeHome-LT, and DomainNet-LT. See Appendix D.1 for more details. Evaluation Protocol. We evaluate performance under both subpopulation shift and domain shift. In subpopulation shift, the test set is balanced across both domains and classes, which means that each domain-class pair contains the same number of test examples. In domain shift, we use the classical domain generalization setting (Zhang et al., 2022) . More specifically, we alternately use one domain as the test domain, and the rest as the training domains. Results are averaged over all combinations. Appendix D.1. detail the statistics and training class distribution for each multi-domain long-tailed dataset. The hyperparameters α c and α d in the Beta distribution are set to 0.5 and the warm start epoch T 0 is set to 7. We list all hyperparameters in Appendix D.2. Results. The overall performance of TALLY and prior methods for tackling subpopulation shift is reported in Table 1 (left). For subpopulation shift, we report the average performance over all domains and the full results are presented in In addition, Figure 4 shows performance broken down by class size for OfficeHome-LT and DomainNet-LT, where we split all classes into five levels according to their cardinality. We compare TALLY with ERM, and four strongest baselines (LDAM, CRT, MiSLAS, BODA). The results show that TALLY's performance improvements arise from larger improvements on smaller classes rather than performance improvements across the board, hence indicating that it is particularly well-suited for class-imbalanced problems. Results of Domain Shifts. Table 1 (right) shows the domain shift results. We first find that ERM works relatively well compared to invariant learning approaches in most cases. This is expected since we evaluate the performance on unseen domains and similar observations have been reported from prior domain shift benchmarks (Koh et al., 2021) . Second, single-domain long-tailed learning methods boost the performance of ERM in most cases, showing that class-imbalance is still an important issue in domain shift. Even so, as with the subpopulation shift setting, TALLY consistently outperforms prior approaches, indicating its efficacy in enhancing robustness to domain shift. Finally, we also provide the comparison on the standard version in Appendix G and TALLY also achieves comparable results compared with state-of-the-art methods. We report the relative improvement of each combination over the vanilla methods in Figure 6 . Here, we use Officehome-LT and DomainNet-LT to evaluate subpopulation shift and TerraInc and iWildCam to evaluate domain shift (Appendix F.3) performance. We see that applying loss up-weighting or up-sampling approaches on performant invariant learning approaches does improve their performance, as evidenced by Figure 5 . Nonetheless, the consistent improvements from TALLY indicates the importance of considering domain-class pair information to achieve balanced augmentation. 

How do prototypes benefit invariant learning?

We analyze the effects of prototypes in alleviating domain-associated nuisances. Specifically, we compare TALLY with three variants: (1) without using any prototype information (None); (2) only applying class prototype (C Only); (3) only applying class-agnostic nuisances (D Only). We report the results in Figure 6 (full results in Appendix F.4). We observe that adding class prototype does improve the performance. The classagnostics domain factors also benefits the performance to some extent. In summary, TALLY outperforms its variants, demonstrating the effectiveness of prototype representation in mitigating domain-associated nuisances. We analyze and compare the domain invariance of classifiers trained by ERM, TALLY, and other invariant learning approaches. Following (Yang et al., 2022; Yao et al., 2022) , we measure the lack of domain invariance as the accuracy of domain prediction (I acc ) and as the pairwise divergence of unscaled logits (I kl ). Specifically, for the accuracy of domain prediction, we perform logistic regression on top of the unscaled logits to predict the domain. For the pairwise divergence, we use kernel density estimation to estimate the probability density function P (h c,d ) of logits from domain-class pair (c, d) and calculate the KL divergence of the distribution of logits from different pairs. Formally, I kl is defined as I kl = 1 |C||D| 2 c∈C d ′ ,d∈D KL(P (h c,d )|P (h c,d ′ )). We report the results of Officehome-LT and DomainNet-LT in 

5. RELATED WORK

Long-Tailed Learning. Training a well-performed machine learning model on class-imbalanced data has been widely studied and a lot of approaches have been proposed, including over-sampling minority classes or under-sampling majority classes (Chawla et al., 2002; Estabrooks et al., 2004; Kang et al., 2020; Liu et al., 2008; Zhang & Pfister, 2021) , adjusting loss functions or logits for different classes during training (Cao et al., 2019; Cui et al., 2019; Hong et al., 2021; Jamal et al., 2020; Lin et al., 2017) , transferring knowledge from head classes to tail classes (Wang et al., 2017; Liu et al., 2020; Yin et al., 2019; Zhou et al., 2022) , directly augmenting tail classes (Chou et al., 2020; Kang et al., 2020; Zhong et al., 2021) , and ensembling models with different sampling or loss weighting strategies (Xiang et al., 2020; Zhou et al., 2020a) . Unlike single-domain imbalanced learning, Yang et al. (2022) targets on the multi-domain imbalanced learning scenario by encouraging invariant representation learning with a domain-class calibrated regularizer. However, BODA focuses on subpopulation shift with the imbalanced distribution for each domain, while the overall distribution among all classes are relatively balanced. TALLY instead studies more kinds of distribution shifts with conceptually different direction to alleviate domain-associated nuisances via balanced augmentation. It relaxes the explicit constraint on internal representations and leads to stronger empirical performance. Domain Generalization and Out-of-Distribution Robustness. To improve out-of-distribution robustness, one line of works aims to learn domain-invariant representations by 1) minimizing the discrepancy of feature representations across all training domains (Li et al., 2018; Sun & Saenko, 2016; Tzeng et al., 2014; Zhou et al., 2020b) ; 2) leveraging domain augmentation methods to generate more training domains and improve the consistency among domains (Shu et al., 2021; Wang et al., 2020b; Xu et al., 2020; Yan et al., 2020; Yue et al., 2019; Zhou et al., 2020c) ; 3) disentangling feature representations to semantic and domain-varying ones and minimizing the semantic differences across training domains (Robey et al., 2021; Zhang et al., 2022) . Another line of works focuses on learning invariant predictors with regularizers, including minimizing the variances of risks across domains (Krueger et al., 2021) , encouraging a predictor that performs well over all domains (Ahuja et al., 2021; Arjovsky et al., 2019; Guo et al., 2021; Khezeli et al., 2021) . Apart from explicitly involving regularizers, data augmentation is another promising approach for learning invariant predictors (Yao et al., 2022; Zhou et al., 2020b) . Unlike previous augmentation methods that require sufficient training examples for each class to learn invariance (see detailed discussion in Appendix B), TALLY tackles the class-imbalanced issue in domain generalization and employs a domain-balanced augmentation strategy to learn class-unbiased invariant representation.

6. CONCLUSION

In this paper we investigate multi-domain imbalanced learning, a natural extension of classical singledomain imbalanced learning. We propose a novel balanced augmentation algorithm called TALLY to achieve robust imbalanced learning that can overcome distribution shifts. To generate more examples, TALLY introduces a prototype enhanced disentanglement procedure for separating semantic and nuisance information. TALLY then mixes the enhanced semantic and domain-associated nuisance information among examples. The results on four synthetic and two real-world datasets demonstrate its effectiveness over existing imbalanced classification and invariant learning techniques. In the future, we plan to conduct theoretical studies to better understand how TALLY works.

A ADDITIONAL INFORMATION FOR TALLY

A.1 ADDITIONAL DISCUSSION OF SELECTIVE BALANCED SAMPLING In this section, we detail our explanation about Figure 3 and show why selective balanced sampling is a better strategy in TALLY. In Figure 3 , there are three domain-class groups: (Fan, Clipart), (Fan, Art), (Computer, Clipart) and the number of examples for each group is 1, 13, 83, respectively. We specifically focus on augmenting examples from the minority group (i.e., (Computer, Clipart) group). In this case, the samples (x i , y i , d i ) and (x j , y j , d j ) need to contain class semantic information (i.e., Fan) and domain information (i.e., Clipart) in the representation augmentation module, respectively. The balanced sampling under the multi-domain long-tailed classification problem is to up-sample examples from minority domain-class groups. Concretely, the label y i and domain d i for each example (x i , y i , d i ) are jointly sampled from Uniform(C, D). If we employ original balanced sampling in the representation augmentation module of TALLY, then (y i , d i ) Uniform(C, D), (y j , d j ) Uniform(C, D). To augment the (Fan, Clipart) group, the class semantic information from (x i , y i , d i ) has a 1/2 probability to be obtained from the original (Fan, Clipart) group, and a 1/2 probability to be obtained from the (Fan, Art) group. Similarly, the domain information has a 1/2 probability to be obtained from the original (Fan, Clipart) group, and a 1/2 probability to be obtained from the (Computer, Clipart) group. Thus, examples from the minority group (i.e., Fan, Clipart) will be repeatedly sampled during the augmentation process and this is what we do not expect. Instead, for selective balanced sampling, the label y i of example i is uniformly sampled from all classes (i.e., y i Uniform(C)), and the domain d j or example j is uniformly sampled from all domains (i.e., d i Uniform(D)). Thus, to augment the (Fan, Clipart) group, the class semantic information from example (x i , y i , d i ) has a 1/14 probability to be obtained from the original (Fan, Clipart) group and a 13/14 probability to be obtained from the (Fan, Art) group because we do not consider domain information in sampling example (x i , y i , d i ). Similarly, for selective balanced sampling, the domain information has a 1/84 probability to be obtained from the original (Fan, Clipart) group, and a 83/84 probability to be obtained from the (Computer, Clipart) group. To sum up, using selective balanced sampling can provide more diverse and effective knowledge transfer.

A.2 ALGORITHM OF THE TESTING STAGE OF TALLY

In this section, we summarize the testing stage of TALLY in Alg. 

B ADDITIONAL DISCUSSION OF RELATED WORKS

In this section, we provide an additional discussion of related works. Specifically, we would like to point out why data interpolation-based domain generalization approaches (e.g., LISA (Yao et al., 2022) , mixup) can not benefit the performance when encountering the long-tailed distribution. Take LISA as an example, we adopt intra-label LISA in this paper, which is more suitable for domain shift as mentioned in (Yao et al., 2022) . Intra-label LISA learns domain-invariant predictors by interpolating examples with the same label but from different domains, which can probably aggravate the label imbalance issue. We In this paper, we compare TALLY with two types of approaches: long-tailed classification methods and invariant learning approaches. We detail these methods here: Long-tailed Classification Methods. We compare TALLY with Focal Lin et al. (2017) , LDAM Cao et al. (2019) , CRT Kang et al. (2020) , MiSLAS Zhong et al. (2021), and Remix Chou et al. (2020) . Here, Focal and LDAM up-weight the loss for minority classes. CRT uses up-sampling strategy to fine-tune the classifier. MiSLAS and Remix modify the vanilla mixup Zhang et al. (2018) and make it suitable to long-tailed distribution. Invariant Learning. We further compare TALLY with invariant learning approaches, i.e., IRM Arjovsky et al. (2019) , GroupDRO Sagawa et al. (2020) , LISA Yao et al. (2022) 

D.2 DETAILED EXPERIMENTAL SETUPS AND HYPERPARAMETERS

In this section, we detail how we split the training and test set in synthetic datasets under both subpopulation shifts and domain shifts. In subpopulation shift, the training distribution for each domain is a long-tailed distribution and the test is domain-class balanced distribution, i.e., the number of examples in each domain-class pair is the same. In terms of domain shifts in synthetic datasets, following (Peng et al., 2019; Shi et al., 2021) , we hold out one domain as the testing domain and use the rest domains as training. All baselines and TALLY use the same evaluation protocol. We list the hyperparameters in Table 5 for the above four synthetic datasets. 

D.3 FULL RESULTS

The full results of subpopulation shift are reported in Table 6 . In domain shift, we report the results of each domain for VLCS-LT, PACS-LT, OfficeHome-LT and DomainNet-LT in Table 7 , 8, 9, 10, respectively. In the domain shift scenario of VLCS-LT, though TALLY only performs best in VOC2007 (VLCS), the results of TALLY is relatively more stable compared to other approaches, leading to the best averaged performance. 

E.2 DETAILED EXPERIMENTAL SETUPS AND HYPERPARAMETERS

In this section, we detail how we split the training and test set in real-world datasets with domain shifts. Specifically, we follow Koh et al. (2021) and use the same split as they did for iWildCam, where a bunch of locations is selected for testing and the rest ones are used for training. For Terrainc, we adopt the same strategy to split training and test domains since Terrainc also focuses on wildlife recognition and the data distribution is similar to iWildCam. All baselines and TALLY use the same evaluation protocol. We list the hyperparameters in Table 11 for both TerraInc and iWildCam datasets.

E.3 FULL RESULTS

The full results on Real-world Data are reported in Table 12 . 

F ADDITIONAL RESULTS OF ANALYSIS

In this section, we conduct two additional analysis to better understand how TALLY works. Then, we provide additional results of the analysis in the main paper. 13 , where the performance under subpopulation shift in OfficeHome-LT and DomainNet-LT and the performance under domain shift in TerraInc and iWildCam are reported. According to the results, we observe that TALLY still outperforms all variants of balanced finetuning, indicating its effectiveness in addressing multi-domain long-tailed learning problem by augmenting disentangled representations. Table 13 : Performance comparison between TALLY and balanced finetuning. FT means fine-tuning. Here, worst performance means the class-balanced worst-domain accuracy. In Figure 8 and Table 15 , we report the full results of combining long-tailed learning and invariant learning approaches, where Figure 8 shows the results of domain shifts and Table 15 listed all results with standard deviation over three seeds. 

F.4 FULL RESULTS OF THE EFFECT OF PROTOTYPES

We report the full results of the prototype analysis in Table 16 . 

G RESULTS ON STANDARD DOMAIN GENERALIZATION BENCHMARKS

In this section, we present the additional comparison on standard domain generalization benchmarks. Notice that the data distributions in these standard benchmarks are not long-tailed, which is thus not our focus in this paper. The goal is to compare our approach with other domain generalization methods. In Table 18 -21, we present results on four standard benchmarks: VLCS, PACS, OfficeHome, DomainNet, respectively. Results for all algorithms except TALLY are directly copied from Gulrajani & Lopez-Paz (2021) and Yang et al. (2022) . In Table 22 , we summarize all results and show the comparison between different approaches. According to the results, TALLY can achieve comparable performance compared with state-of-the-art domain generalization approaches. 



Figure 1: Illustration of imbalanced class distributions across domains in iWildCam, a wildlife recognition benchmark (Beery et al., 2020). Both subpopulation shift and domain shift settings are illustrated.

, TALLY assumes that domain-associated nuisance information can be transferred among examples. It leverages this to perform augmentation by transferring domain-specific nuisance factors between classes with a novel selective balanced sampling strategy. In practice, TALLY first disentangles examples into latent semantic and nuisance factors. Then, we introduce a domain-agnostic prototype representation for each class in order to eliminate the nuisance information, whereby the semantic representation of each example is linearly interpolated with the corresponding prototype representation. Finally, TALLY reassembles the semantic and nuisance factors of different examples to produce augmented representations. 3.1 REPRESENTATION DISENTANGLEMENT AND REASSEMBLY As described above, TALLY reassembles augmented examples from pairs of examples by combining the semantic representation of one with the domain-related nuisance factors of the other.

Figure 3: Illustration of selective balanced sampling. Transition probabilities between different pairs are visualized. Detailed discussion is in Appendix A.1.

TALLY Training Process Require: learning rates η; warm start epochs T 0 ; prototype momentum γ; model f θ (•) with hidden representation f r θ (•) at layer r; Dataset D tr = {(x, y, d)} 1: Initialize domain-agnostic prototypes {r (0) c } C c=1 and class-agnostic statistics {(u (0) f θ with ERM for t < T 0 3: for t = T 0 to T do 4: yi ∼ Uniform(C), dj ∼ Uniform(D) ▷ Randomly sample domains and classes 5: (xi, yi, di) ∼ {D tr |y = yi}, (xj, yj, dj) ∼ {D tr |d = dj} 6:(si, sj) ← (f r (xi), f r (xj))

Figure 6: Analysis of prototype-guided invariant learning. C Only and D Only represent only using class prototype representation or classagnostic domain factors, respectively.

2. It is worthwhile to notice that the representation augmentation module is only used during the training stage. Algorithm 2 TALLY Testing Process Require: model f θ * (•) with learned parameter θ * ; Test Dataset D ts . 1: Feed all testing examples {(xi, yi, di)} n ts i=1 to model f θ * (•) and get the predicted label of each example (x, y, d) as ŷ = f θ * (x) 2: Evaluate and report the performance based on the predicted values and the groundtruth.

provide a simple example to explain why LISA changes the label distribution. Assume we have two classes and two domains, the ratio of training examples between four domain-class pairs in training set is: (y 1 , d 1 ) : (y 1 , d 2 ) : (y 2 , d 1 ) : (y 2 , d 1 ) = 100 : 200 : 80 : 5. The label imbalance ratio is 300 : 85 = 3.52. To apply LISA, we can essentially have 100 × 200 = 20000 example pairs in class 1 (y 1 ), and 80 × 5 = 400 example pairs in class 2 (y 2 ), which roughly leads to a new label imbalance ratio 20000 : 400 = 50 > 3.52. The comparison between the number of example pairs for interpolation can not precisely reflect the training label imbalance ratio in LISA, but it shows that LISA changes the label distribution to some extent. C ADDITIONAL DETAILS OF GENERAL EXPERIMENTAL SETUPS C.1 DETAILED BASELINE DESCRIPTIONS

Figure 7: Long-tailed training distributions for all synthetic datasets. Here, the x-axis represents sorted class indices.

FINE-TUNING ERM ON BALANCED DATASETSIn this section, we conducted an additional experiment to compare TALLY with a simpler approachtraining ERM on a imbalanced dataset and finetuning it on the corresponding balanced dataset. Here, we adopt three variants of balanced datasets for fine-tuning: (1) class-balanced dataset, where the number of examples is the same across different classes; (2) domain-balanced dataset, where the number of examples is the same across different domains; (3) domain-class balanced dataset, where the number of examples is the same across each domain-class group. The results are reported in Table

Figure 8: Domain shift results (Macro F1) of the comparison between TALLY and variants of two domain generalization approaches (CORAL, MixStyle), where we replace the losses of them with class re-weighting or re-sampling ones.

An illustration of TALLY. Left: the overall approach produces augmented representations from a pair of examples x i and x j . At a chosen layer, it mixes their semantic and nuisance information to create augmented representation s. Right: In detail, the augmentation step disentangles hidden representations s categories of distribution shifts -subpopulation shift and domain shift. In subpopulation shift, the test domains have been observed during training time, but the test distribution is class-balanced and domain-balanced, i.e., D ts ⊆ D tr and {η ts d = 1/|D ts ||∀d ∈ D ts }. In domain shift, the test domains are disjoint from the training domains, i.e., D tr ∩ D ts = ∅.

Results of subpopulation shifts and domain shifts on synthetic data. Domain-class balanced accuracy is reported. See full table with standard deviation in Appendix D.3. We bold the best results and underline the second best results. OH-LT and DN-LT represent OfficeHome-LT and DomainNet-LT, respectively.

Table 6 of Appendix D.3. The following key observations can be made according to Table 1(left). We first observe that most single-domain re-weighting approaches (e.g., Focal, LDAM) consistently outperform multi-domain learning approaches (e.g., GroupDRO, CORAL) in most scenarios, indicating that imbalances in classes are probably more detrimental than imbalances in domains. This observation is not surprising, since all domains are observed in during training and testing in subpopluation shift problems. Even so, TALLY consistently outperforms all methods with 7.29% error decreasing, verifying its effectiveness in improving the robustness to subpopulation shifts. It is particularly noteworthy that TALLY shows superior performance compared with BODA -an invariant learning approach to deal with multi-domain long-tailed learning. The results indicate that designing regularizers that are suitable for datasets from diverse domains can be challenging. Instead, balanced augmentation are capable of improving the robustness by transferring domain nuisances between examples. Performance w.r.t. Class Size. We split all classes into five levels. XL and XS represent the largest and smallest classes, respectively.

Results of domain shifts on real-world data. We report the full results in Appendix E.3 We report the results over all test domains in Table2. The conclusions are largely consistent with the results from Sec. 4.1, where TALLY consistently improves the performance over all baselines and enhances the robustness of multi-domain long-tailed learning. Additionally, data interpolation based invariant learning approaches (e.g., LISA) hurt performance compared with ERM. This is not a surprise because examples from large classes dominate the interpolation process, which essentially exacerbate the class imbalance (See Appendix B for more discussion). The superiority of TALLY over prior augmentation techniques is further evidence of the effectiveness of balanced augmentation.

Invariance Analysis of TALLY. OH-LT and DN-LT represents Officehome-LT and DomainNet-LT, respectively.

Smaller I acc and I kl values indicate more invariant representations with respect to the labels. The results show that TALLY does lead to greater domain-invariance compared to prior invariant learning approaches (e.g., BODA). Finally, we compare the proposed selective balanced sampling in TALLY with domain-class balanced sampling. As discussed in Sec. 3.2, for an example pair (x i , y i , d i ) and (x i , y j , d j ), selective balanced sampling gets yi ∼ Uniform(C) and dj ∼ Uniform(D), while traditional balanced sampling get (yi, di), (yj, dj) ∼ Uniform(C, D). The results of subpopulation shifts in OfficeHome-LT, DomainNet-LT and of domain shifts (Macro-F1) in TerraInc, iWildCam are reported in Table4(see Appendix F.5 for full results), indicating the effectiveness of selective balanced sampling in transferring knowledge over domains and classes.

Hyperparameters for experiments on synthetic data.

Full results of subpopulation shifts on long-tailed variants of domain generalization benchmarks. The standard deviation is computed across three seeds.

Domain shift results on VLCS-LT.

Domain shift results on PACS-LT.

Domain shift results on OfficeHome-LT.

Domain shift results on DomainNet-LT.

Full Results of Domain Shifts on Real-world Data.

Hyperparameters for experiments on real-world data.

In this section, we analyze the compatibility of TALLY. Since TALLY only augmenting disentangled representations during the training stage, we can easily incorporate TALLY with other long-tailed learning approaches. Specifically, we incorporate TALLY with PaCo and RIDE in this analysis and report the results in Table14. We observe that incorporating TALLY significantly improves the performance over the vanilla PaCo and RIDE. Nevertheless, the original TALLY already showed competitive performance compared with TALLY+PaCo and TALLY+RIDE. Compatibility Analysis of TALLY. Worst performance represents the class-balanced worst-domain accuracy.

Full results of the comparison between TALLY and variants of two representative domain generalization approaches (CORAL, MixStyle). Worst means class-balanced worst domain accuracy in subpopulation shift.

Full results of the analysis of prototype-guided invariant learning. C Only and D Only represent only using class prototype representation or class-agnostic domain factors, respectively. Worst means class-balanced worst domain accuracy in subpopulation shift.

Full results of comparison between sampling strategies. Worst means class-balanced worst domain accuracy in subpopulation shift.

Comparison on the standard VLCS benchmark. ± 0.1 66.1 ± 1.2 73.4 ± 0.3 77.5 ± 1.2 78.8 MMD 97.7 ± 0.1 64.0 ± 1.1 72.8 ± 0.2 75.3 ± 3.3 77.5 DANN 99.0 ± 0.3 65.1 ± 1.4 73.1 ± 0.3 77.2 ± 0.6 78.6 CDANN 97.1 ± 0.3 65.1 ± 1.2 70.7 ± 0.8 77.1 ± 1.5 77.5 MTL 97.8 ± 0.4 64.3 ± 0.3 71.5 ± 0.7 75.3 ± 1.7 77.2 SagNet 97.9 ± 0.4 64.5 ± 0.5 71.4 ± 1.3 77.5 ± 0.5 77.8 ARM 98.7 ± 0.2 63.6 ± 0.7 71.3 ± 1.2 76.7 ± 0.6 77.6 VREx 98.4 ± 0.3 64.4 ± 1.4 74.1 ± 0.4 76.2 ± 1.3 78.3 RSC 97.9 ± 0.1 62.5 ± 0.7 72.3 ± 1.2 75.6 ± 0.8 77.1 BODA 98.1 ± 0.3 64.5 ± 0.4 74.3 ± 0.3 78.0 ± 0.6 78.5 TALLY (ours) 97.5 ± 0.5 67.2 ± 1.1 73.8 ± 0.5 79.2 ± 0.9 78.8

Comparison on the standard PACS benchmark.

Comparison on the standard OfficeHome benchmark. ± 0.7 52.4 ± 0.3 75.8 ± 0.1 76.6 ± 0.3 66.5 IRM 58.9 ± 2.3 52.2 ± 1.6 72.1 ± 2.9 74.0 ± 2.5 64.3 GroupDRO 60.4 ± 0.7 52.7 ± 1.0 75.0 ± 0.7 76.0 ± 0.7 66.0 Mixup 62.4 ± 0.8 54.8 ± 0.6 76.9 ± 0.3 78.3 ± 0.2 68.1 MLDG 61.5 ± 0.9 53.2 ± 0.6 75.0 ± 1.2 77.5 ± 0.4 66.8 CORAL 65.3 ± 0.4 54.4 ± 0.5 76.5 ± 0.1 78.4 ± 0.5 68.± 1.4 51.4 ± 0.3 74.8 ± 1.1 75.1 ± 1.3 65.5 BODA 65.4 ± 0.1 55.4 ± 0.3 77.1 ± 0.1 79.5 ± 0.3 69.3 TALLY (ours) 64.2 ± 0.5 55.1 ± 0.8 78.0 ± 1.1 79.2 ± 0.5 69.1

Comparison on the standard DomainNet benchmark. ± 0.4 18.8 ± 0.3 46.7 ± 0.3 12.2 ± 0.4 59.6 ± 0.1 58.1 ± 0.3 40.9 IRM 42.3 ± 3.1 15.0 ± 1.5 38.3 ± 4.3 10.9 ± 0.5 48.2 ± 5.2 48.5 ± 2.8 33.9 GroupDRO 40.1 ± 0.6 17.5 ± 0.4 33.8 ± 0.5 9.3 ± 0.3 51.6 ± 0.4 47.2 ± 0.5 33.3 Mixup 48.2 ± 0.5 18.5 ± 0.5 44.3 ± 0.5 12.5 ± 0.4 55.8 ± 0.3 55.7 ± 0.3 39.2 MLDG 50.2 ± 0.4 19.1 ± 0.3 45.8 ± 0.7 13.4 ± 0.3 59.6 ± 0.2 59.1 ± 0.2 41.2 CORAL 50.1 ± 0.6 19.7 ± 0.2 46.6 ± 0.3 13.4 ± 0.4 59.8 ± 0.2 59.2 ± 0.1 41.5 MMD 28.9 ± 11.9 11.0 ± 4.6 26.8 ± 11.3 8.7 ± 2.1 32.7 ± 13.8 32.1 ± 13.3 23.4 DANN 46.8 ± 0.6 18.3 ± 0.1 44.2 ± 0.7 11.8 ± 0.1 55.5 ± 0.4 53.1 ± 0.2 38.3 CDANN 45.9 ± 0.5 17.3 ± 0.1 43.7 ± 0.9 12.1 ± 0.7 56.2 ± 0.4 54.6 ± 0.4 38.3 MTL 49.2 ± 0.1 18.5 ± 0.4 46.0 ± 0.1 12.5 ± 0.1 59.5 ± 0.3 57.9 ± 0.5 40.6 SagNet 48.8 ± 0.2 19.0 ± 0.2 45.3 ± 0.3 12.7 ± 0.5 58.1 ± 0.5 57.7 ± 0.3 40.3 ARM 43.5 ± 0.4 16.3 ± 0.5 40.9 ± 1.1 9.4 ± 0.1 53.4 ± 0.4 49.7 ± 0.3 35.5 VREx 42.0 ± 3.0 16.0 ± 1.5 35.8 ± 4.6 10.9 ± 0.3 49.6 ± 4.9 47.3 ± 3.5 33.6 RSC 47.8 ± 0.9 18.3 ± 0.5 44.4 ± 0.6 12.2 ± 0.2 55.7 ± 0.7 55.0 ± 1.2 38.9 BODA 51.3 ± 0.3 20.5 ± 0.7 48.0 ± 0.1 13.8 ± 0.6 60.6 ± 0.4 62.1 ± 0.4 42.7 TALLY (ours) 50.5 ± 0.2 19.7 ± 0.1 47.7 ± 0.6 14.1 ± 0.3 60.0 ± 0.2 60.1 ± 0.5 42.0

Domain shift results over all four benchmarks.

