BRANCH-TRAIN-MERGE: EMBARRASSINGLY PARAL-LEL TRAINING OF EXPERT LANGUAGE MODELS

Abstract

We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of language models (LMs). We show it is possible to independently train subparts of a new class of LMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LMs. BTM learns a set of independent EXPERT LMs (ELMs), each specialized to a different textual domain, such as scientific or legal text. These ELMs can be added and removed to update data coverage, ensembled to generalize to new domains, or averaged to collapse back to a single LM for efficient inference. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training on new domains, and then merging the resulting models back into the set for future use. Experiments show that BTM improves in-and out-of-domain perplexities as compared to GPT-style transformer LMs, when controlling for training cost. Through extensive analysis, we show that these results are robust to different ELM initialization schemes, but require expert domain specialization; ensembles with random data splits do not perform well. Our results suggest that extreme parallelism could be used to efficiently scale LMs in future work.

1. INTRODUCTION

Training and inference in language models (LMs) typically require access to supercomputers that can achieve the massive multi-node synchronization required to compute model activations and gradients (Brown et al., 2020; Fedus et al., 2022; Zhang et al., 2022) . We develop a new class of LMs that is instead embarrassingly parallel: different parts of the model are independently trained on different subsets of the data, with no need for multi-node training or inference (Figure 1 ). Our new ELMFORESTfoot_0 model consists of a set of EXPERT LMs (ELMs), each specialized to a distinct domain in the training corpus, e.g., scientific or legal text. ELMs are independently functional LMs with no shared parameters, unlike recent mixture-of-experts models that only specialize the transformer feedforward layers to domains (Gururangan et al., 2022) . ELMs can be added or removed at any time to update data coverage, ensembled to generalize to new domains, or parameter averaged to collapse back to a single LM for more efficient inference. We also present the Branch-Train-Merge (BTM) algorithm for learning ELMs. BTM repeatedly expands the ELMFOREST by adding one or more new ELMs in parallel. Each new ELM is first branched by initializing a new LM with an average of the parameters of the most relevant LMs in the current set, then further trained on new domains with a standard cross-entropy loss, and finally merged into the ELMFOREST (Figure 2 ). The ELMFOREST is initalized with a single LM, trained on heterogeneous data to establish strong shared representations for future domain specialization. When evaluated in-and out-of-domain, ELMFORESTs trained with BTM outperform GPT-style transformer LMs, a domain-specialized mixture-of-experts baseline (Gururangan et al. 2022) , and nondomain-based ensemble baselines across a range of computational budgets -up to 1.3B parameters per ELM trained for 7000 GPU-hours in aggregate ( §4.2). These gains are biggest for ELMFOREST ensembles, which activate all experts during inference, but also hold when combine experts with parameter averaging. We also do extensive analysis of these results; expert specialization to domains is crucial ( §5.1), while the compute budget allocation ( §5.2) and the choice of training data for the initial ELM ( §5.3) are much less so. We release our code and models publicly. 

2. ELMFORESTS

We define an ELMFOREST to be a set of EXPERT LMs (ELMs), each independently trained to specialize to a different subset of a corpus. ELMs are inspired by the experts in earlier MoE models (Jacobs et al., 1991) , but we define ours to be domain specialists and specialize the full LM instead of components. We follow Gururangan et al. 2022 and define domains by provenance, or the source of the document (e.g., legal document, computer science research paper), which yields simple and interpretable corpus segmentations useful for identifying ELMs in our experiments. 3 Potential extensions to multi-lingual, -modal, -task, or other types of data splits are left for future work. ELMs remain independent throughout training and inference, enabling the functions below.

2.1. ADDING AND REMOVING ELMS

Existing control techniques to steer LMs towards (Keskar et al., 2019; Gururangan et al., 2020; Dathathri et al., 2020) or away (Welleck et al., 2019) from certain behaviors tend to be expensive, require retraining the model, or do not provide strong guarantees on test-time behavior (Gehman et al., 2020) . In contrast, ELMFORESTs allow for explicit inference-time application of constraints on the provenance of training data. We modify the domain coverage of an ELMFOREST at any time by incorporating new ELMs specialized to different domains or removing existing ELMs in the set.

2.2. INFERENCE FROM ELMFORESTS

ELMFORESTs support two inference modes, which trade off efficiency for performance. Output Ensembling In the first method, we ensemble the output probabilities of multiple ELMs. This allows us to generalize to texts of unknown domain. We use the cached prior method proposed in Gururangan et al. (2022) . Consider the probabilistic view of language modeling, where we estimate p(X t | x <t ). We introduce a domain variable D, alongside each sequence. Then the next-step conditional distribution on the history x <t is: p(X t | x <t )= n j=1 p(X t | x <t , D = j) • p(D = j | x <t ) We estimate a domain posterior, or a probability of a sequence belonging to each of the k domains using Bayes' rule: p(D = j | x <t )= p(x <t | D = j) • p(D = j) p(x <t ) = p(x <t | D = j) • p(D = j) k j =1 p(x <t | D = j ) • p(D = j ) 3 See §A.1 for a discussion on the possible limitations of this domain definition. ELMs are used to compute the likelihood over contexts given a domain label. To compute the cached prior, we maintain an exponential moving average of posterior probabilities over domains, updated only at the end of each sequence block: p(D = j) = N i=1 λ i • p(D = j | x (i) <T ). Following Gururangan et al. 2022 , we use N = 100 sequences (of length T = 1024 each) of development data, and set EMA decay λ = 0.3. We fix this prior at test time for each domain. Output ensembling naively requires a forward pass through all ELMs in the ELMFOREST, but we observe in practice that the domain posterior is sparse, which suggests that top-k selection of EXPERT LMs can reduce inference time costs with negligible effects on performance. Parameter Averaging Alternatively, we also use weighted parameter averaging (Izmailov et al., 2018; Wortsman et al., 2022; Matena & Raffel, 2021) to collapse the ELMFOREST into a single LM, using the cached prior from above as averaging coefficients. This operation keeps inference cost constant regardless of how many ELMs are added to the set.

3. BRANCH-TRAIN-MERGE (BTM)

BRANCH-TRAIN-MERGE training of ELMFOREST models is incremental and embarrassingly parallel (Figure 1 ). EXPERT LMs are trained fully independently, starting from a seed LM (Figure 2 ).

3.1. THE BRANCH-TRAIN-MERGE ITERATION

Each BRANCH-TRAIN-MERGE iteration begins with an existing ELMFOREST E = {θ i } k i=1 . Each ELM θ i represents a corresponding domain d i in the dataset of k domains D E = {d i } k i=1 modeled by E. We first describe the inductive case of k > 0, then describe how to train the initial model θ 0 . Step 1 (Branch): Sprouting a new ELM The new ELM parameters are a function of the current expert set E. Given some vector of weights w = {w 1 , w 2 , ..., w k } over the existing experts θ 1 , θ 2 , ..., θ k , we can initialize the new expert with the weighted parameter average θ k+1 ← k i=0 w i θ i . Step 2 (Train): Growing the ELM We train the new ELM θ k+1 on data domain d k+1 with the log likelihood objective. None of the existing ELMs in E are involved in the training of the new ELM. We also refer to this step as branched training to distinguish it from other training regimens. Step 3 (Merge): Transplanting the ELM We merge the new ELM θ k+1 , which now represents the domain d k+1 , into E to create an updated set: E = E ∪ {θ k+1 } and D E = D E ∪ {d k+1 }. These operations can be parallelized to add multiple ELMs in a batch or iterated for many batches.

3.2. STEP 0 (INITIALIZATION): SEEDING THE ELMFOREST

In the first iteration of BTM, E = ∅; we have no ELMs in the set to branch from. Instead of initializing the first ELMs randomly, we hypothesize that ELM performance is boosted by branching from pretrained LM parameters, since multi-phase adaptive pretraining is an effective way to develop domain-specific language models (Gururangan et al., 2020) , and model merging techniques work best with models that have a shared initialization (Izmailov et al., 2018; Frankle et al., 2020) . To this end, we perform a seed phase, training a seed LM θ seed on some data corpus d seed , which can be used to initialize the first batch of ELMs in the set.

4. CORE EXPERIMENTS AND RESULTS

We first compare BTM training to compute-matched baselines, to carefully measure the efficiency gains. We use the simplest form of BTM, training on a set of k = 8 domains.

4.1. EXPERIMENTAL SETUP

Data We use data from Gururangan et al. (2022) , which consists of 8 diverse training and 8 evaluation (primarily English-language) domains, from web text and U.S court opinions to GitHub and COVID-19 research papers. Details are in Appendix Table 7 .

Model hyperparameters

The model architecture is a randomly-initialized LM with the GPT-3 (Brown et al., 2020) architecture implemented in Fairseq (Ott et al., 2019) . We use 125M (small), 350M (medium), 750M (large), 1.3B (xl) parameter models. Following Brown et al. 2020, we use the GPT-2 (Radford et al., 2019) vocabulary of 50,264 BPE types, and train with 1,024-token sequences, across document boundaries. We prepend a beginning-of-document token to each document.

Compared Models

• TRANSFORMER-LM: The first baseline is a standard transformer LM, implemented with distributed data parallelism (Li, 2021) . This is identical to the DENSE model from Gururangan et al. (2022) , in which data from each domain is balanced. We find, in line with Gururangan et al. (2022) , that balancing data domains achieves better performance than without data balancing (Appendix Table 9 ). • DEMIX: We follow the training procedure of Gururangan et al. (2022) , where feedforward layers in the transformer are trained to specialize as domain experts, resulting in better domain specialization and generalization than other sparsely activated (e.g., MoE) models. • ELMFOREST: We first conduct a seed phase to initialize the ensemble with LM parameters ( §3.2), then conduct branched training on the ELMs ( §3.1), all of which are initialized with the seed LM. Between the seed and branched phases, we continue training from the saved optimizer state. These models are compute-matched, since computation is typically the limiting factor in model training. Like other sparse models (Fedus et al., 2022; Lepikhin et al., 2021; Gururangan et al., 2022) , ELMFORESTs decouple compute and parameters; we can train many more parameters at the same computational cost as the equivalent TRANSFORMER-LM. Total parameter counts are in Table 1 . Training settings To disentangle variations in GPU speed, we use number of updates as our computational budget: 80k, 32k, 24k, and 12k updates on 16, 32, 64, and 128 GPUs in parallel for the 125M, 350M, 750M, 1.3B parameter TRANSFORMER-LM and DEMIX baselines, respectively. We use the same GPU counts for the seed phase in the ELMFOREST. For branched training, we divide these GPU counts equally among the ELMs; for example, the 1.3B parameter per GPU ELMFOREST uses 16 GPUs for each of the 8 ELMs. For all models, we fix the learning rate at 0.0005 with a polynomial (linear) decay learning rate schedule and 8% warmup, which we found to be optimal for most settings after a large grid search. We use a batch size of 16 for each GPU, with gradient accumulation of 32 steps, and train with fp16. We train on NVIDIA V100 32GB GPUs.

4.2. PERFORMANCE COMPARISONS

Our results are shown in Table 1 . At these model scales, ELMFORESTs trained with the BTM procedure outperform both the sparsely trained DEMIX LM and the densely trained TRANSFORMER-LM baselines. The improvements in performance we observe over DEMIX layers suggest that isolation of all LM parameters results in better specialization of domain experts.

4.3. EFFICIENCY COMPARISONS

Training ELMFORESTs requires less inter-GPU communication than TRANSFORMER-LM or DEMIX models, since no synchronization occurs between GPUs assigned to different ELMs. This results in higher updates per second and therefore shorter train times ( ELMFOREST training jobs were scheduled and run more quickly, and with less preemption, than the TRANSFORMER-LM and DEMIX training jobs at the same overall budget.

4.4. ELMFOREST PARAMETER AVERAGE

While ELMFOREST substantially improves performance at lower training cost relative to the TRANSFORMER-LM, it comes at the price of a larger model size and higher associated inference costs when ensembling. Here, we explore an alternative way to combine experts to improve generalization with no additional inference costs relative to the TRANSFORMER-LM baseline: parameter averaging. Given some weight vector w over k ELMs {θ i , ..., θ k }, we define a single model such that all of its parameters are a weighted average of the ELM parameters, according to w: θ = k i=0 w i θ i . For w, we consider a number of options: Uniform: We set w to be a uniform; i.e., 1 k . This setting disregards the relevance of each ELM to the target domain, assuming all ELMs should contribute equally to the average.

Argmax:

We set w to be an indicator vector corresponding to the maximum probability in the domain posterior ( §2.2), thus activating only the estimated best-performing ELM. Posterior: We set w to be the domain posterior ( §2.2), computed on the validation set. Results on the evaluation domains are in Table 3 . 4 Using uniform weights underperforms all baselines, even TRANSFORMER-LMs, highlighting the importance of domain relevance in output ensembling and parameter averaging ELMs. Using the argmax ELM outperforms uniform averaging for small models, but not larger models. Weighting the average with the domain posterior outperforms all other techniques, and consistently improves performance over TRANSFORMER-LMs at no additional inference cost. Though parameter averaging achieves lower performance than output ensembling, the lower inference costs and simplicity of deployment may make averaging the preferred inference technique for resource-constrained applications. Due to its superior performance, we report results for ELMFOREST with output ensembling, unless otherwise noted. Surprisingly, we observe poor performance of model averaging at the 125M scale. We see later that the amount of compute allocated to the seed phase critically affects the viability of ELMFOR-EST parameter averaging ( §5.2). With sufficient seed training, parameter averaging outperforms TRANSFORMER-LM at all scales.

5. ANALYSIS

In §4, we fix the training setup to conduct a controlled comparison of BTM to baseline methods. We now analyze the importance of various training and inference decisions on language modeling performance. 10 .

5.1. ELMFOREST OUTPERFORMS PARAMETER-MATCHED ENSEMBLES

We compare our method to other LM ensembles to study the effect of increased parameter counts: Random Ensemble (seed init) A set of LMs trained on random data splits, to assess the importance of ELM domain specialization. We pool the training and development sets of the 8 train domains, divide into 8 random splits, then conduct BTM on those splits, with 50% seed training. ELMFOREST (random init) An ELMFOREST trained with BTM where all ELMs are randomly initialized, to assess the effect of seed training. This is equivalent to setting the seed training compute budget to zero updates. We fix the random initialization across models.

ELMFOREST (seed init)

The ELMFOREST setting of §4. We conduct BTM on the 8 train domains, and dedicate 50% of the updates in the budget to seed and to branched ELM training. Results with the largest models are in Table 4 .foot_3 ELMFOREST (random init) nearly matches ELM-FOREST on training domains but performs poorly on evaluation domains. The random ensemble is consistently worse than both variants of ELMFOREST, showing that the performance improvement is not only due to ensembling or increased total model size.foot_4 

5.2. ELMFOREST PERFORMANCE IS ROBUST TO SEED LM TRAINING COMPUTE ALLOCATION

The ELMFOREST (random init), which has no seed training, underperforms ELMFOREST (LM init) in §5.1, indicating that seed training is essential. On the other hand, TRANSFORMER-LM, equivalent to 100% seed training, also underperforms ELMFOREST (LM init) in §4, which suggests the importance of branched ELM training. We now study the changes to performance when we vary the portion of the compute budget dedicated to seed training. We control for the total compute budget (across seed and branched training). Our results, in Figure 3 , show that the optimal amount of seed training is about 40-60% of the total budget. At both ends of the full range, performance deteriorates, approaching the ELMFOREST (random init) and TRANSFORMER-LM performance (at 0% and 100% seed training, respectively). However, as little as 10% of seed training can be performed to result in strong gains over the ELMFOREST (random init) and TRANSFORMER-LM. This suggests that the majority of BTM training may focus on branched training to dramatically reduced computational costs ( §4.3). The optimal share of compute to use towards each training phase likely depends on many factors, including the total compute budget. We leave more thorough study of this trend to future work. ELMFOREST from random initialization (i.e., with no seed phase), resulting in perplexities in the thousands. Since these ELMs still share the same random initialization, we conclude that there is importance to seeding the ELMFOREST with a shared set of partially trained LM parameters. We observe that parameter averaging performance on training domains is relatively robust to seed training. On evaluation domains, however, the smallest scale ELMFOREST does not achieve optimal performance until about 60% or more updates are dedicated to seed training. This explains the poor performance of the 125M parameter scale ELMFOREST average on evaluation domains in Table 3 . ELMFOREST parameter averaging at 50% seed training outperforms TRANSFORMER-LM baselines on evaluation domains at 350M, 750M, and 1.3B parameter scales (Table 3 ).

5.3. ELMFOREST PERFORMANCE IS ROBUST TO THE CHOICE OF SEED TRAINING CORPUS

We compare the effects of using different training corpora for seed training in BTM. Here, we fix the compute budget allocations studied in §5.2 so that 50% of updates are allocated to seed training and 50% to branched training. As seen in Table 5 , our experiments using the most diverse corpora for seed training resulted in the best performance, but even seed training on only JavaScript code yielded better results than the compute-matched TRANSFORMER-LM baseline, and far better than the ELMFOREST (random init) models in Table 1 , which use identical random initialization. This suggests that initializing ELMs with parameters of any model checkpoint is critical. Domain forgetting through ELM removal is mostly robust to seed training corpus We evaluate the empirical effect of removing ELMs to reduce the influence of specific training domains at inference time (e.g., those that contain stale or harmful text). In Appendix Table 11 , we show the performance of ELMFOREST ensembles on the training domains when using all ELMs (from Table 5 ), and compare to performance when removing the relevant ELM; e.g., to evaluate on REDDIT, we keep active all ELMs except the REDDIT ELM. Removal of an ELM from an ELMFOREST guarantees that the domain will be forgotten, if the seed training corpus did not include that domain's data ( §2.1). 7 We find that performance does indeed degrade appreciably on ELMFORESTs when removing the ELM, indicating that ELMFORESTs are capable of effectively forgetting a data domain without any gradient updates to the parameters.

6. RELATED WORK

Sparsely activated language models have been considered in a few forms (Evci et al., 2020; Mostafa & Wang, 2019; Dettmers & Zettlemoyer, 2019) , but most related to this work is the Mixture-of-Experts (MoE) model (Jacobs et al., 1991) . Recent MoE models (Shazeer et al., 2017) Previous work explored extending the capacity of a model with additional specialized parameters (e.g., adapters; Houlsby et al., 2019; Pfeiffer et al., 2020; Ben Zaken et al., 2022) . Our approach is simpler, as our ELMs each consist of an entire model which requires no additional and no shared parameters. Future work may explore combining ELMs with adapters. Ensemble methods are classic techniques in machine learning (Breiman, 1996; Freund, 1995; Wolpert, 1992) . Similar to our work, recent work has considered growing ensembles, in which new models are trained sequentially on streaming data Caccia et al. (2021) . Our averaging mechanism is inspired by the model merging techniques in the vision and NLP literature (Wortsman et al., 2022; Izmailov et al., 2018; Matena & Raffel, 2021) . Our posterior weighted average is highly related to Bayesian model averaging techniques used in classic ensembling methods (Fragoso et al., 2018) . Model averaging has also been explored for federated learning (McMahan et al., 2017) , where different models are trained to fit privacy-sensitive data on different devices and merged. Future work may explore variations of ELM weighted averages. The importance of the seed training as a critical warm-up phase for BTM aligns with findings that model merging only works when models share part of their optimization trajectory (Frankle et al., 2020; Entezari et al., 2022) Table 11 : The ability to reduce the influence of domains through ELM removal is (mostly) robust to seed training corpus ( §5.3). We present the average test perplexity for the 8 train domains in ELMFORESTs where all ELMs are active. We vary the seed training corpora. In parentheses, we show the increase in perplexity when the ELM trained to specialize on each domain is removed at inference time. Large increases are desired and suggest the ease of removing (e.g., stale or harmful) data from the ELMFOREST's distribution after training.



Expert Language Models For Efficient Sparse Training URL anonymized for review. We display similar findings with training domains in Appendix Table8. We display similar findings with smaller models in Appendix Table10. We speculate that the random ensemble is poor because its constituent models make correlated errors during evaluation(Gontijo-Lopes et al., 2022). The actual effect of ELM removal on model performance may depend on the overlap between the removed domain and other training domains.



2

Figure 1: Fully Synchronized vs. Embarrassingly Parallel Training ( §3). (a) In fully synchronized data-parallel training of a TRANSFORMER-LM, all parameters are synchronized across all GPUs. This synchronization incurs hefty cross-node communication costs. (b) In embarrassingly parallel training (our work), individual models are trained on each domain, eliminating expensive cross-node parameter synchronization between those models.

Figure 2: BTM training process overview ( §3). In the seed phase (Step 0), an LM is trained on one data corpus. We branch, or copy, the parameters k times (Step 1), and continue to train each copy on a unique data domain, resulting in k ELMs (Step 2), which are merged into the ELMFOREST (Step 3). After the seed phase, ELMs are fully disconnected, with no communication between them.

ELMFORESTs trained with BTM outperform all baselines across multiple model scales ( §4.2). Average test-set perplexity (↓) for each model scale across the 8 training, 8 evaluation, and 16 data domains. Total parameters are shown for each model and scale. At 125M parameters per GPU, we show the mean and standard deviation over 8 random seeds. For BTM, we show results with 50% of compute dedicated to the seed phase.

BTM



ELMs can be combined through parameter averaging ( §4.4). Average test-set perplexity across the 8 evaluation domains, from the models in Table1, comparing techniques to collapse ELMFOREST into a single LM. The relatively poor performance of ELMFOREST parameter averaging for the 125M parameter model is investigated (and improved) in §5.2.

Domain expert ensemble outperforms random split ensemble ( §5.1). Average test-set perplexity (↓) for our largest model scales across the 8 training, 8 evaluation, and all 16 domains. We show similar results for the 125M and 350M parameter scale models in Appendix Figure

ELMFOREST ensembling performance is robust to most seed training compute allocations ( §5.2). Test perplexity averaged across the 8 training (left) or 8 evaluation (right) domains (from §4.1) when fixing total compute budget but varying the portion allocated to seed training. ELMFOREST ensembling performance is robust to seed training corpus ( §5.3). Test set perplexity averages on the 8 training, 8 evaluation, and all 16 data domains, using different training corpora used in seed LM training. All models are of the 125M parameters per GPU scale.

have been proposed with token-based routing Lepikhin et al. (2021); Fedus et al. (2022); Lewis et al. (2021); Roller et al. (2021). Of this line of work, ours is most closely related to DEMix layers Gururangan et al. (2022), which replace transformer feedforward layers as domain experts. Similarly, Pfeiffer et al. (2022) develop a multilingual expert model with language-specific routing, and Kudugunta et al. (2021) develop a multi-task expert model with task-specific routing.

. Similar to seed training,Nie et al. (2021) propose dense-to-sparse gating, where mixture-of-experts routing mechanisms are gradually sparsified during the course of training.7 CONCLUSIONWe introduce BTM training, a new algorithm to train an ELMFOREST, which contains many EXPERT LMs that can be added and removed, ensembled, or parameter averaged at any time for efficient scaling and rapid customization. Our extensive experiments show that ELMFOREST ensembles trained with BTM outperform baselines at no additional training cost. Additionally, parameter averaged ELMFORESTs closely approach ELMFOREST ensemble performance while enabling substantially cheaper inference. These results provide compelling evidence for the promise of scaling language models with many smaller, independently trained ELMs.

Multi-domain data corpus used in §4 and §5. Details of this corpus, both training and evaluation domains, including the size of our training and evaluation (i.e. validation and test) data in whitespace-separated tokens. We borrow these datasets fromGururangan et al. (2022). † indicates datasets we de-identify with regexes in Table6. REDDIT was de-identified byXu et al. (2021); we use their version. Meta researchers did not collect any of the Reddit or Twitter data and the data was not collected on behalf of Meta.

Performance of ELM parameter averaging on training domains ( §4.4). Average test-set perplexity across the 8 training domains, from the models in Table1, comparing techniques to collapse ELMFOREST into a single LM. As with evaluation domain results in the main paper, parameter averaging (with posterior weights) generally yields better average perplexities than TRANSFORMER-LM at no additional inference cost, but underperforms ELMFOREST ensembling, which uses more effective parameters and is included for comparison as a lower bound. The seed phase is vital to our ability to parameter average ELMs ( §5.2). Test perplexity averaged across the 8 training (left) and 8 evaluation (right) domains when averaging ELMFOREST with different seed training compute allocations for the 125M and 350M parameter LMs. Seed training is critical to reasonable averaging performance.

annex

Table 6 : De-identification schema. We de-identify text using the regexes provided in the above links for the categories listed.

A APPENDIX

A.1 LIMITATIONSThe definition of a domain The nature of domains in NLP is a matter of active research. Textual domains reflect language variation that stems from factors such as vocabulary differences (Blitzer et al., 2006) , sociolinguistic (Biber, 1988) or demographic (Rickford, 1985; Blodgett et al., 2016) variables, community membership (Lucy & Bamman, 2021) , end-tasks (Gururangan et al., 2020) , or temporal shifts (Lazaridou et al., 2021; Luu et al., 2021) . In this work, we follow Gururangan et al. (2022) and define domains by provenance, or the source of the document. Provenance labels yield simple and interpretable segmentations of a corpus, which are useful for identifying ELMs in our experiments. However, other methods for discovering domains, including unsupervised techniques (Aharoni & Goldberg, 2020; Chronopoulou et al., 2022) , may yield better expert assignments. We leave experimentation with other definitions of domain for future work.

Domain posterior data requirement

To calculate the domain posteriors used for our ensembling and parameter averaging weights, we assume access to a small additional sample of data to train the vector w. While it is easy to imagine that extra data may be available for most applications to estimate the posterior, future work may explore the possibility of eliminating this requirement.Other distributed training baselines Our TRANSFORMER-LM baseline is implemented with distributed data-parallel. Model-parallel, fully sharded data-parallel, and other distributed training strategies (Artetxe et al., 2021) confer different scaling patterns that may change the conclusions that we report in this work. However, we expect that BTM will provide strong efficiency gains against these alternatives.Harms of language models BTM results in an LM whose test time behaviors can be controlled with much stronger guarantees after training due to the isolation of domains in ELMs. However, ELMFORESTs exposed to large datasets scraped from the Internet may contain toxic language (e.g., hatespeech) that are difficult to identify with coarse provenance domain labels, and nevertheless result in harmful output from the ELMs (Gehman et al., 2020). Future work may explore recipes for training and deploying ELMFORESTs to better support user safety. 

