AANG: AUTOMATING AUXILIARY LEARNING

Abstract

Auxiliary objectives, supplementary learning signals that are introduced to help aid learning on data-starved or highly complex end-tasks, are commonplace in machine learning. Whilst much work has been done to formulate useful auxiliary objectives, their construction is still an art which proceeds by slow and tedious handdesign. Intuition for how and when these objectives improve end-task performance has also had limited theoretical backing. In this work, we present an approach for automatically generating a suite of auxiliary objectives. We achieve this by deconstructing existing objectives within a novel unified taxonomy, identifying connections between them, and generating new ones based on the uncovered structure. Next, we theoretically formalize widely-held intuitions about how auxiliary learning improves generalization on the end-task. This leads us to a principled and efficient algorithm for searching the space of generated objectives to find those most useful to a specified end-task. With natural language processing (NLP) as our domain of study, we demonstrate that our automated auxiliary learning pipeline leads to strong improvements over competitive baselines across continued training experiments on a pre-trained model on 5 NLP tasks 1 .



The auxiliary learning paradigm, where we augment a primary objective with extra learning signals to boost end-task performance, is a staple of many machine learning (ML) domains. In natural language processing (NLP), well known models like SpanBERT (Joshi et al., 2020) and RoBERTa (Liu et al., 2019b) are trained on masked language modelling (MLM) auxiliary objectives (Devlin et al., 2018) before fine-tuning on the end-task. And for speech processing and reinforcement learning (RL), Oord et al. (2018) introduced the popular contrastive predictive coding objective which achieved state of the art performance in many settings when multi-tasked with the end-task. Despite these successes and many more, research into devising such objectives has progressed in a very local, objective-by-objective manner (Raffel et al., 2019; Clark et al., 2020; Grill et al., 2020; Chen et al., 2020) . Auxiliary objectives are constructed by hand-design and without much overarching structure, relying on the experience and intuition of a select group of researchers versed at making appropriate design choices. Unfortunately, this status-quo not only creates a technical barrier of entry for exploring auxiliary objectives in new domains but also, by virtue of its incremental nature, limits the rate at which new objectives are discovered and investigated. To address the above challenges, this paper presents a framework for automatically generating and utilizing a large set of candidate auxiliary objectives. Our framework is seeded by the following key observation: leading auxiliary objectives across multiple domains can be viewed as making different design decisions within a 4 stage pipeline: Input Data (D) → Input Transformation (T ) → Model Representation (R) → Output (O). For instance, in RL, a common auxiliary objective is to predict the environment's forward dynamics (Agrawal et al., 2016; Hafner et al., 2019) . To construct this objective, the current task state-action pair (D) is corrupted (T ) and then passed through the model to produce a latent representation (R) which is finally used to predict the next state (O). Similarly, in NLP, the XLNet (Yang et al., 2019) objective-which performs language modelling on a randomly factorized permutation of the input-can be written within our taxonomy as {D = Out-of-Domain, T = No-op, R = Random-Factorized, O = Next Token}. These two examples (along with others listed in Figure 1 ) fall within a class we term named objectives: objectives that have been previously proposed in the auxiliary learning literature. Our framework in the context of NLP. We decompose named objectives within our four staged taxonomy : {D, T , R, O}. By taking the cartesian product of choices across stages, we reproduce named objectives and discover new ones. Decomposing named objectives within our taxonomy provides a unified view of the auxiliary learning landscape. From this vantage point, it becomes clear that there are many unexplored combinations of the various primitives used across named objectives. This presents a simple formula for automatically generating a large set of candidate objectives: take the cartesian product of the design decisions across given stages (Figure 2 ). Using this compositional process, not only can we reconstruct existing named objectives, we can also generate new combinations. This overcomes the tedium of implementing each objective independently since we can just reuse a small set of simple stage-wise primitives. Generating a large set of objectives raises the natural question of how to efficiently select the most helpful ones for a given end task. Instead of leaving this to practitioner intuition, we develop principled guidelines to address this question by theoretically studying the impact of auxiliary learning on a particular end-task. Specifically, using arguments based on algorithmic stability (Hardt et al., 2016; Bousquet & Elisseeff, 2002) , we derive end-task generalization error bounds that are dependent on the choice of auxiliary task. This contributes to existing theory (Saunshi et al., 2020; Xie et al., 2021) on how auxiliary learning impacts the end-task by suggesting a new candidate mechanism: auxiliary learning results in more stable optimization end-points in the sense of Bousquet & Elisseeff (2002) , which in theory improves generalization of the final model. Guided by our theory, we introduce AANG (Automating Auxiliary LearniNG), an efficient, structureaware algorithm for adaptively combining a set of related objectives to improve generalization on a specific end-task. AANG incorporates the following prescriptions from our theory: (i) auxiliary tasks that are more similar to the end-task are desirable. Given a set of objectives, AANG learns adaptive weights to bring the composite objective closer to the end-task; (ii) in general, more auxiliary data is better. AANG maximizes the effective amount of data used in training by using all the generated objectives instead of taking task-specific subsets. To empirically validate our method for automatically generating and utilizing auxiliary objectives, we experiment on five NLP tasks. We do so in the widely-used setting of continued pretraining (Gururangan et al., 2020; Aghajanyan et al., 2021; Dery et al., 2021b; Zhang et al., 2022) , where a model trained with a single auxiliary objective on large-scale data is further trained on end-task related data. Without introducing any external data or architectural modifications, variants of AANG outperform strong and widely used baselines in 4 out of 5 tasks. AANG achieves an average improvement of 4.2% over standard fine-tuning of RoBERTa across our chosen tasks. We believe our results will spur further research into exploring automating auxiliary learning across a variety of settings. Notably, while we focus on NLP when discussing the space of auxiliary objectives (Section 3) and in our empirical evaluation (Section 6), our theoretical results (Section 4) and AANG itself are domain-agnosticfoot_0 .

2. RELATED WORK

To properly scope this work, we define auxiliary learning as training a model on alternative objectives with the goal of improving performance on some primary end-task. Auxiliary learning is an instantiation of transfer learning (Caruana, 1997; Baxter, 2000; Ruder et al., 2019) . It covers the pretrain-then-finetune paradigm (Huh et al., 2016; Devlin et al., 2018; Schneider et al., 2019; Gururangan et al., 2020) as well as end-task aware multitasking approaches (Lin et al., 2019; Dery et al., 2021a; b) . Whilst auxiliary objectives may be meta-learned (Liu et al., 2019a; Navon et al., 2020) , for simplicity -since incorporating these would require further complication of our design space -such objectives are out of the scope of this paper. This work bears many parallels to the area of neural architecture search (NAS) (Stanley & Miikkulainen, 2002; Zoph & Le, 2016; Roberts et al., 2021) . Whilst we seek to automate auxiliary learning, the objective of NAS is to automate the discovery of the right neural architecture given a specific end-task. Search spaces of candidate architectures are created by taking the cartesian product of architecture design choices across the depth of the network. The design of suitable architectural search spaces for a variety of settings has been an active area of research (Tan & Le, 2019; Howard et al., 2019; Dao et al., 2020; Roberts et al., 2021) . To develop AANG, we borrow ideas from the NAS literature on efficient algorithms for sifting through spaces of architectures. Mirroring the popular differentiable NAS method DARTS Liu et al. (2018) , we perform a continuous relaxation over the search space of objectives, allowing for efficient search by gradient descent. We also use a factored approach to model relationships between objectives that share primitives. This is inspired by recent work on stochastic-relaxation weight sharing (Dong & Yang, 2019; Li et al., 2020) . As a theoretical contribution, this work derives an end-task aware generalization error bound for auxiliary learning. Our bound is built on that of Hardt et al. (2016) , who derive generalization bounds for parametric models trained with stochastic gradient descent (SGD). To derive their bounds, they leverage the concept of algorithmic stability introduced by Bousquet & Elisseeff (2002) . Informally, a randomized algorithm is uniformly stable if changing a single training data point in the given samples does not change its end-point too much. Said change is characterized as the average difference in predictions between the two learned models. Stability implies generalization in expectation (Hardt et al., 2016; Kuzborskij & Lampert, 2018) .

3. AUTOMATICALLY GENERATING AUXILIARY OBJECTIVES

To begin, we take a high-level view of the landscape of named objectives. Using running examples from NLP, we propose the following coarse structure for the sequence of choices made in the hand-design of auxiliary objectives: 1. Data, D: Auxiliary objective pipelines begin with a choice of input data. Here, options can range from heterogeneous out-of-domain data (Radford et al., 2019) , in-domain data with respect to the final end-task (Beltagy et al., 2019) or the task data itself (Gururangan et al., 2020) . It may even include data outside the modality of the end-task. 2. Input-Transformation, T : Many auxiliary objectives are self-supervised with respect to their input data. They corrupt or transform the input and then reconstruct it in whole or part. For example, input text tokens can be masked, replaced or deleted. Operations can also be aggregated as in BERT-Op: mask 80% of selected tokens and randomly replace 50% of the remaining Devlin et al. (2018) ; Liu et al. (2019b) . 3. Representation, R: After transformation, representations of the input data can be computed from a given model in different ways. A chosen token's representation can depend on only its left context (Left-to-Right) (Radford et al., 2018) or its right context (Right-to-Left) (Peters et al., 2018) . It could also depend on the representations of a randomly selected permutation of other tokens (Random Factorized) Yang et al. (2019) . 4. Output, O: Finally, representations obtained from the previous stage are fed into a loss function producing a final output. The choice of output loss is usually coupled with the choice of transformation made in stage 2. Choices include but are not restricted to denoising tokens, predicting the next token or predicting the TF-IDF (Term Frequency-Inverse Document Frequency) of a token. The above taxonomy {D → T → R → O} is expansive enough to cover a range of named auxiliary objectives of interest in NLP (Figure 1 )foot_1 . For example, we can write any member of the GPT series (Radford et al., 2018; 2019; Brown et al., 2020) which perform left-to-right language modelling on out-of-domain data as {D = Out-of-Domain, T = No-op, R = Left-To-Right, O = Next Token}. We can summarize the pre-existing choices within each design stage to obtain a unique set of options. For example, we can reduce the set of model representation types used by the objectives enumerated in Figure 1 to the unique set R = {Bi-directional, Left-To-Right, Right-To-Left, Random-Factorized}. Having summarized the list of primitives within each stage, a simple formula for generating a space of auxiliary objectives becomes apparent: take the cartesian product of the design choices at each stage (see Figure 2 ). In general, given an instance of our taxonomy, we can construct a space of objectives (Carreras et al., 2003; Charniak, 1997) . A special example is setting O to the end-task supervised output E O . This leads to A = D × T × R × O of F O=E O D=E D which is a subset of F D=E D . F O=E O D=E D includes many objectives like predicting the end-task signal from corrupted input data. In Section 6, we will introduce a search space of objectives that leverages task augmentation.

4. THE IMPACT OF AUXILIARY LEARNING ON END-TASK GENERALIZATION

In this section, we relieve reliance on practitioner intuition by deriving a set of guiding principles on how to effectively utilize the automatically generated objectives from Section 3. Auxiliary learning influences the end-task through both training and generalization error. Previous theory has largely focused on characterizing the impact on end-task training error. Liu et al. (2021) , for example, show that end-task agnostic pre-training can create a performance gap in training error compared to training with the end-task alone. The size of this gap depends on how dissimilar the pre-training auxiliary objective is from the end-task. They introduce the following assumption (which we will borrow) to formalize their notion of task similarity: Assumption A.1: Let f e represent the end-task objective and f a be the auxiliary objective. There exists ∆ ≥ 0 such that ∥∇f a (θ) -∇f e (θ)∥ ≤ ∆ ∀ θ. Note that θ represents all the parameters of the model. Smaller ∆ implies f a is more similar to the primary task f e . Liu et al. (2021) bound the end-task agnostic training error gap to be logarithmic in ∆. Unlike training error, end-task generalization error has gone unstudied in the auxiliary learning setting. Bounding the generalization error not only adds to our theoretical understanding of the impact of auxiliary learning but also provides insights to guide algorithm design. To arrive at a bound, we adapt the technique of Hardt et al. (2016) who derive a generalization bound on training with only the end-task via stochastic gradient descent. We consider the end-task aware setting where the end-task is multi-tasked with the auxiliary objective. This setting has recently been shown to improve end-task performance over the pretrain-then-finetune paradigm (Dery et al., 2021a; b; Yao et al., 2021) . Auxiliary learning with Dynamic Sampling: We are given an auxiliary objective f a (•; z) ∈ [0, 1] with N a samples S a = (z 1 , . . . , z Na ) from the distribution D a . f a can either be a single objective or a weighted linear combination of objectives : f a = k w k f k a . At any iteration of SGD, we sample a choice of the end-task function f e or the auxiliary objective f a according to the probabilities λ e , λ a ∈ [0, 1] | λ e + λ a = 1. Given the chosen objective, we sample a data-point and perform stochastic gradient descent based on the sampled data-point. We now present our bound in the setting described. Theorem 4.1 (Auxiliary learning with Dynamic Sampling). Assume that f e (; z e ), f a (; z a ) ∈ [0, 1] are both L-Lipschitz with β e and β a -smooth loss functions respectively. Consider that we have N ′ = N e + N a total samples where f e and f a have N e and N a samples respectively. r e = Ne N ′ is the fraction of the available data represented by the end-task. Suppose that we run stochastic gradient descent for T steps with monotonically non-increasing step sizes α t ≤ c t by dynamically sampling the tasks according to λ e and λ a . Then, with respect to f e , the generalization error is bounded by: ϵ gen ⪅ ∆) 1 1+cλ * β * γT N ′ 1- 1 cλ * β * +1 Where γ = λ e r e Here β * = min{β e , β a } and λ * is the weighting of the function with smaller smoothness. Proof. See Appendix E for full proof and Appendix F for more discussion As a detailed inspection of the proof will show, we derive Equation 1 by appealing to algorithmic stability (Bousquet & Elisseeff, 2002; Hardt et al., 2016; Kuzborskij & Lampert, 2018) (Section 2). To our knowledge, ours is the first work to present an algorithmic stability view to formally explain how auxiliary learning influences end-task performance. Equation 1 surfaces the following prescriptions about learning with auxiliary tasks : P1 Smaller ∆ improves ϵ gen . This implies that the more similar the auxiliary objective is to the end-task (under Assumption A.1), the lower the generalization error. P2 Larger N ′ leads to smaller ϵ genfoot_2 . Since we usually have a fixed amount of task data N e , we can increase N ′ by adding more auxiliary data N a .

5. END-TASK AWARE SEARCH OF STRUCTURED OBJECTIVE SPACES

Algorithm 1 AANG Input: Search Space -A Factor vectors -{W All , W I , W T , W R , W O } End-task -E, End-task weight -λe Initial Model Params -θ0 ∈ R D repeat Sample a batch of n objectives K n ∼ A Weighting of objectives in K n Construct w n for k = 1 to n do (d, t, r, o) = [K n k ].stages w k ∝ exp W All (d, t, r, o) +W I d +W T t +W R r +W O o w n k ← w k end for Get losses from batches of data LA(K n , w n ) = n k=1 w k L k L total = λeLE + (1 -λe) LA Get gradients and update factors θt+1, {∇ w n ,λe } ← META-TARTAN θt, E, L total ) Update {W All , W I , W T , W R , W O } using ∇wn Update λe using ∇ λe until done Return : θT Guided by Section 4, we build a practical method for exploring a set of objectives, A. Whilst the dynamic sampling setting described in Section 4 is amenable to theoretical consideration, we make a few practical changes to it. First, instead of performing alternating gradient descent by sampling f a , f e according to λ e , λ a , we instead use them as multitask weights and perform joint training. Joint training has been found to produce superior results compared to alternating optimization when leveraging auxiliary objectives (Aghajanyan et al., 2021) . We perform gradient descent on the following total loss which interpolates between the end-task and the auxiliary loss L total = λ e L E + (1 -λ e )L K . Here, K is a chosen subset of A. Second, as indicated in Section 4, given K, we can write the set as a single objective f a = k∈K w k f k a . By Prescription P1, we want to choose {w k } such that f a has a small ∆ with the end-task f e . We would also like to set λ e such that the bound on ϵ gen is minimized. Whilst a closed form exists for the optimal weightings λ e , {w k }, it depends on variables like {∆ k }, {β k a }, L that are hard to estimate. We therefore propose to learn λ e , {w k } in an online, data-driven way. To do this, we build on top of the META-TARTAN algorithm proposed by Dery et al. (2021b) . META-TARTAN is a meta-learning algorithm that learns adaptive weights for different auxiliary tasks in a way that prioritizes end-task generalization. It learns {w k } by minimizing the loss on the end-task validation set: ∂L val E ∂w k ≈ -∇ θ L f k a T ∇ θ L val E . This corresponds to learning {w k } such that ∇ θ f a T ∇ θ f e ) is maximized. This minimizes one of the terms that contributes to ∆ and thus attempts to fulfil Prescription P1. We can similarly learn λ e to minimize the end-task validation loss. For a more detailed discussion of META-TARTAN, please see Appendix B. So far, we have introduced independent weights, {w k }, for each objective. This is sufficient in the case of unrelated objectives. However, the objectives in A share an underlying structure. We recognize this by using a factored approach to model each w k . We introduce a factor vector for each of the 4 stages introduced in Section 3: W D ∈ R |D| , W T ∈ R |T | , W R ∈ R |R| and W O ∈ R |O| . This ties together the weights of objectives that share primitives in common. To capture the fact that an objective can be more than the sum of it parts, we also introduce an independent weight for each objective : W All ∈ R |D|×|T |×|R|×|O| . Consider the objective k which is generated by the composition of the operations {d ∈ D, t ∈ T , r ∈ R, o ∈ O}, its weighting is computed as : w k ∝ exp W All (d,t,r,o) + W I d + W T t + W R r + W O o . Our factored approach not only allows us to share information between objectives but it also allows us to analyze which stages and primitives are most important to a particular end-task after training is completed (Section 7). Prescription P2 from Section 4, advocates for introducing as much auxiliary data as possible. As such, instead of fixing to a specific subset throughout training for a particular end-task, we propose to utilize all the objectives in A. This also avoids the combinatorial explosion that comes with exploring subsets of A at a time. |A| can be large and descending on all of A at once can be computationally prohibitive. As an efficient work around, at each training step, we sample a subset of A for execution with META-TARTAN. Our samples are drawn from all of A so any objective can get used at any timestep. Because we model each w k via a factored approach, even if an objective is not sampled its weight is implicitly updated. Our approach is reminiscent of stochastic-relaxation weight sharing (Pham et al., 2018; Dong & Yang, 2019; Li et al., 2020) where sampled architectural primitives result in updates to shared model weights which can be used by other primitives that are not sampled. We coalesce all the ideas we have introduced so far into Algorithm 1 which we dub AANG (Automated Auxiliary LearniNG). At a high-level, given an end-task E: 1. We generate a space of auxiliary objectives A by leveraging the taxonomy discussed in Section 3. A may contain auxiliary tasks that can improve our performance on E. 2. We leverage MAML-style (Finn et al., 2017) meta-learning to adaptively weight the objectives in A based on measuring each objective's influence on E's validation set loss. 3. We make our algorithm scalable by sub-sampling the tasks A. By exploiting the underlying structure of the objectives in A via a factored approach to modeling task weights, we reduce the impact of the inexact sub-sampling.

6. EXPERIMENTAL SETTING

Our exploration of auxiliary learning has made the following transitions from the status-quo: manual to automated, single task to multitask, end-task agnostic to end-task aware. In this section, we set up experiments to validate these deviations from the standard. We focus on continued pre-training (Gururangan et al., 2020; Aghajanyan et al., 2021) . In this setting, we perform further auxiliary learning on an already pre-trained model. We favor this setting over pre-training from scratch (Liu et al., 2019b; Yang et al., 2019 ) not only because it is a more computationally feasible arena for experimentation but also because it is more relevant to modern ML systems where building upon pre-trained models is the norm (Qiu et al., 2020; Du et al., 2020) . Model Details and Datasets: We use a pre-trained RoBERTa base (Liu et al., 2019b) as the shared model base. We implement each auxiliary objective as a separate head on top of this shared base. For classification based objectives, the output head is a 2-layer multi-layer perceptron (MLP) that receives representations for the special classification token [CLS] (Devlin et al., 2018) from RoBERTa base . For sequence generation objectives, we make a copy of the pre-trained output layer of RoBERTa base for each task. Table 4 in Appendix C provides details of the 5 datasets used. All datasets are low-resource classification tasks. Not only are these datasets more amenable to meta-learning from a computational standpoint, but low-resource tasks also benefit the most from auxiliary learning. We also choose these tasks because they feature in previous work which we use as baselines (Gururangan et al., 2020; Dery et al., 2021b) Baselines and Search Spaces: The following methods are end-task agnostic baselines. By end-task agnostic, we mean that these do not multitask with the end-task. Finetuning on the end-task occurs after training on the auxiliary objective. 1. RoBERTa (Liu et al., 2019b) : We simply finetune a pre-trained RoBERTa base on the end-task. 2. TAPT (Gururangan et al., 2020) : Continue training RoBERTa base on masked language modelling on end-task data itself before finetuning on the end-task. The following named objectives are end-task aware baselines that use META-TARTAN (Dery et al., 2021b) 

7. RESULTS AND DISCUSSION

In this section, we experimentally validate our case for automating the creation of auxiliary objectives and using them in an end-task aware multitask fashion.

7.1. GOING A LONG WAY WITHOUT EXTERNAL DATA

We first consider the setting where we rely solely on end-task data (task augmentation), and work with the AANG-TD search space. This search space has 24 objectives. Table 2 shows that automatically generating auxiliary objectives from only task data and using them appropriately is productive. End-task awareness is key: From Table 2 , methods that are end-task aware result in over 1.12% average improvement over those that are end-task agnostic even under the most generous comparison (GPT-style 79.84% vs task-agnostic TAPT 78.72%). Knowing the end-task means that at each iteration, AANG can make informed gradient updates by adapting task weights so the resulting auxiliary task better aligns with the end-task (Prescription P1). Amongst the single task objectives, BERT-style performs best. We posit that this is because RoBERTa was trained from scratch on a similar objective and so this objective represents minimal shift in training distributions. Adaptive multi-task auxiliary learning improves performance: We compare single-task end-task aware auxiliary learning to its multitask variant. Table 2 shows that multitasking our 3 different types of language modelling tasks results in improved average performance over using the tasks individually (81.12% for the BERT-style and 81.55% for combining the three single task objectives). We get our best performance when we multitask 24 auxiliary objectives automatically generated with our framework using AANG-TD. Boosting the number of objectives from 3 to 24 resulted in a 0.66% improvement in average performance across tasks. This is in line with Prescription P2 from Section 4 since we are increasing the effective amount of auxiliary data. We further posit that introducing more auxiliary objectives also serves to implicitly regularize the end-task during training. For the ACL-ARC task, we experiment with introducing auxiliary tasks based on external data. AANG-TD+ED has 40 tasks, 16 of which are based on domain data. We introduce CS domain data (from the S2ORC dataset (Lo et al., 2019) ) that is n = 10× the size of the task data. From Figure 3 we see that AANG-TD+ED makes better use of domain-data than doing end-task aware training using only BERT-style objective with task (TAPT) and domain-data (DAPT) jointly as in Dery et al. (2021b) . However, AANG-TD+ED (73.70) does not significantly improve over AANG-TD (73.26) on the ACL-ARC task (Figure 3 ). This might seem at odds with Prescription P2 since the TD+ED search space introduces more data. However, note that the AANG search algorithm is approximate and as such, with a larger search space, it can be harder to find composite tasks with a small ∆ as suggested by Prescription P1. We posit that we need more external data than n = 10× in order to see marked improvements to offset our inexact search of the space of composite functions. However, such scales are outside our computational budget.

7.3. WHY DOES AANG WORK ?

To better understand why our auxiliary learning pipeline improves end-task performance, we perform multiple ablations under AANG-TD. Static versus Dynamic Weighting: We ablate the impact of using static task weights throughout training, as against adaptive task weights. Just as with AANG, we sub-sample n tasks from the search space at every iteration (n is cross-validated exactly as AANG is -Table D ). Each sampled tasks weight is initialized to 1 n and this remains unchanged throughout training. This is the Static Multitask-TD baseline in Table2. AANG-TD improves upon the static multitask baseline by over 1.1% on average. With adaptive weighting, AANG down-weights objectives that are harmful to the end-task whilst up-weighting relevant ones (Prescription P1). However, using static weightings is more compute friendly since we do not have to calculate task-weight meta-gradients. This computevs-performance trade-off is left for practitioners to resolve based on their available resources. Impact of number of sampled objectives: Due to computational constraints, AANG sub-samples the set of generated objectives. Whilst this sampling can result in approximation error when inferring task weightings, it can also introduce stochasticity which can help regularize the learned model. From Table 3 (Appendix A) we find that for some tasks (ACL-ARC and SCIERC) sampling a larger number of tasks helps. SE-2016-6 and CHEMPROT on the other hand benefit from smaller number of sampled tasks. Our recommendation is that the number of sampled tasks be cross-validated on a per-task basis. Learned task weight trajectories: AANG learns interesting trajectories for weighting design stage primitives. From Table 2 , the fact that AANG-TD roughly matches the best single task performance (72.46 1.65 versus 72.70 0.60 for BERT-style) on the SE-2016-6 task suggests that it may be learning to mostly up-weight this task. Figure 4 2 ) such as simple input reconstruction were discovered to have relevant impact on the end-tasks. This means AANG can automatically surface new, previously unexplored objectives relevant to the end-task.

8. LIMITATIONS AND CONCLUSION

Our work has some limitations that we leave for future work. First, because AANG relies on meta-learning, it presents extra compute burden over simple multitasking. This is because, we have to independently compute meta-gradients for each auxiliary task thus requiring O(n) forward-backward operations for n sampled tasks compared to O(1) for static multitasking. In Table 2 , we show that our static Multitask-TD method outperforms all other non-task-adaptive methods by ≈ 2.4% and is thus a viable alternative when runtime is a signficant constraint. Secondly, AANG as presented is an approximate algorithm -primarily due to sub-sampling the space of tasks. Thus as mentioned in Section 7.2, we do not get as much gain as desired when our search space becomes larger. We leave finding an efficient exact search algorithm for future exploration. A MORE ABLATION TABLES META-TARTAN (Dery et al., 2021b ) is a MAML style (Finn et al., 2017) meta-learning algorithm that learns to adaptively weight a given set of tasks based on their influence on the end-task validation performance. META-TARTAN achieves this by formulating the following bi-level optimization problem : θ * , w * = argmin {θ ∈ g(θ0), w} L E (θ) where θ 0 = argmin θ L total (θ, w) = argmin θ w * L E (θ) + Ti∈A w i L Ti (θ) Note that E is the end-task and A is the set of auxiliary tasks. Since the above bi-level problem is difficult to solve directly, Dery et al. (2021a) relax the problem and into an alternating optimization problem where task weights are updated based on 1-step improvement to the validation performance of the end-task : ∂L val E (θ t+1 (w)) ∂w i ≈ -β ∇L Ti T ∇L val E (θ t ) To prevent the above relaxation from finding the trivial solution of just upweigting solely the end-task, Dery et al. (2021b) introduce a special dev-head which they use for estimating the meta-gradient : ∂L val T * (θ * (w)) ∂w i ≈ -β ∇ θ L Ti T ∇ θ L val E ([θ body ; ϕ * ] t ) Where ϕ * t is the special dev-head and θ body is the body of the model. For even more details about META-TARTAN, please see Section 3 of Dery et al. (2021b) . Though we leverage MET-TARTAN, compared to Dery et al. (2021b) , we make three distinct contributions to the field of auxiliary learning. We list them below 1. Novel Problem Formulation: As far as we are aware of, we are the first to formulate the problem of automated auxiliary learning. Specifically, we presented an approach for automatically constructing a suite of auxiliary objectives based on existing objectives. Please note that Dery et al. (2021b) perform auxiliary learning with only the DAPT/TAPT variants of the BERT objective. They effectively assume that the search space of objectives (the 2 they explore) is given before-hand. Our approach automatically creates the search space. 2. Theoretical Novelty: To the best of our knowledge, we are the first work to provide an exploration of why auxiliary learning improves primary task performance via algorithmic stability. Dery et al. (2021b) in introducing META-TARTAN do not attempt to give a theoretical characterization of why the algorithm improves end-task performance. 3. Algorithm Improvements to META-TARTAN: Please note that META-TARAN as presented in Dery et al. (2021b) was used with only 2 auxiliary tasks. When scaling to more tasks, using META-TARTAN naively becomes computationally prohibitive. Specifically, on a search space of N tasks, META-TARTAN requires O(N ) order computation per step. We improve upon this by introducing the task sub-sampling of (k ≪ N ) which reduces the compute overhead to O(k). To account for the impact of sub-sampling as an approximation, we introduced the factorised modelling of task weights which allows sharing of information between auxiliary tasks that might themselves be related. 

D MORE TRAINING DETAILS

We run each hyper-parameter configuration across 3 seeds {0, 1, 2}. We use a batch size of 128 for all end-tasks tasks except H.PARTISAN where we use a batch size of 64. The auxiliary task batch-size, aux bsz, is shared across all the n sub-sampled auxiliary objectives according to the objective's weight. We use the AdamW optimizer (Loshchilov & Hutter, 2017) , with weight decay of 0.01 for all experiments. We copy the end-task agnostic baseline results from (Dery et al., 2021b) when available. We use the hyper-parameters specified for TAPT in Gururangan et al. (2020) to train for the SE-2016-6 task. All models were trained on one of two types of gpus: NVIDIA A100 or NVIDIA A6000. All models fit within a single gpu. We used gradient accumulation to expand the effective batch sizes used for our experiments.

E GENERALIZATION ERROR BOUND FOR END-TASK AWARE TRAINING

E.1 DEFINITIONS Here, the expectation is taken only over the internal randomness of A. We will denote by ϵ stab (A, N e ) the infimum over all ϵ for which the above holds. Definition E.1. A function, f : Ω → R is L-Lipschitz if ∀u, v ∈ dom(f ): ∥f (u) -f (v)∥ ≤ L∥u -v∥ Note that L-Lipschitz implies bounded gradients. ∥∇f (w)∥ ≤ L ∀w Definition E.2. A function, f : Ω → R is β-smooth if ∀u, v ∈ Ω: ∥∇f (u) -∇f (v)∥ ≤ β∥u -v∥ Definition E.3. An update rule, G is σ-bounded if : sup w∈Ω ∥w -G(w)∥ ≤ σ

E.2 RELEVANT THEOREMS

Theorem E.1 (Uniform Stability implies Generalization in expectation). Let Algorithm A be ϵuniformly stable. Then, ϵ gen (A, N e ) = E S,A R S [A(S)] -R[A(S)] ≤ ϵ stab (A, N e ) For full proof see Theorem 2.2 of Hardt et al. (2016) . Theorem E.2 (Stochastic Gradient Method is stable). Assume that f e (; z) ∈ [0, 1] is an L-Lipschitz and β e -smooth loss function for every z. Suppose that we run SGM for T steps with monotonically non-increasing step sizes α t ≤ c t . Then, SGM has uniform stability with : ϵ sgm ≤ 1 + 1 q N e -1 2cL 2 1 q+1 T q q+1 where q = β e c We can simplify this to only terms involving T and N e ϵ sgm ⪅ T 1-1 cβe +1 N e (6) Proof. For the full proof, see Theorem 3.12 of Hardt et al. (2016) E.3 GROWTH FUNCTIONS Lemma E.3 (Growth Recursion Under Dynamic Sampling). We consider the Stochastic Gradient update rule G : Ω → Ω : G f (w) = w -α∇f (w) Fix an arbitrary sequence of updates G f1 , . . . , G f T and another G ′ f1 , . . . , G ′ f T . Let w 0 = w ′ 0 be a starting point in Ω given that f : Ω → R and define δ t = E f1...ft∼P λ ∥w t -w ′ t ∥ where w t , w ′ t are defined recursively through : w t = G ft (w t-1 ) w ′ t = G ′ ft (w ′ t-1 ) t ≥ 0 Then we have the recurrence relation : δ 0 = 0 δ t+1 ≤ min 1 + αλ 1 β 1 δ t + αλ 2 ∆ + 2L , 1 + α λ 1 β 1 + λ 2 β 2 ) δ t G ft = G ′ ft δ t + 2σ t G ft , G ′ ft are σ-bounded Note that P f is a distribution over the support {f 1 , f 2 } according to probabilities {λ 1 , λ 2 | λ 1 +λ 2 = 1}. {f 1 , f 2 } have smoothness β 1 , β 2 respectively. Proof. The second bound on δ t is taken directly from Lemma 2.5 of Hardt et al. (2016) . We now derive the first-half of the first bound δ t+1 = E f1...ft+1∼P λ ∥w t+1 -w ′ t+1 ∥ = E f1...ft∼P λ λ 1 ∥G f 1 (w t ) -G ′ f 1 (w ′ t )∥ + λ 2 ∥G f 2 (w t ) -G ′ f 2 (w ′ t )∥ = E f1...ft∼P λ λ 1 ∥w t -α∇f 1 (w t ) -w ′ t + α∇f 1 (w ′ t )∥ + λ 2 ∥w t -α∇f 2 (w t ) -w ′ t + α∇f 2 (w ′ t )∥ ≤ E f1...ft∼P λ ∥w t -w ′ t ∥ + αE f1...ft∼P λ λ 1 ∥∇f 1 (w ′ t ) -∇f 1 (w t )∥ + λ 2 ∥∇f 2 (w ′ t ) -∇f 2 (w t )∥ (Triangle Inequality used for above step) = δ t + αE f1...ft∼P λ λ 1 ∥∇f 1 (w ′ t ) -∇f 1 (w t )∥ + λ 2 ∥∇f 2 (w ′ t ) -∇f 2 (w t )∥ (Without Loss of Generality, let β 1 ≤ β 2 ) ≤ δ t + αE f1...ft∼P λ λ 1 β 1 ∥w t -w ′ t ∥ + λ 2 ∥∇f 2 (w ′ t ) -∇f 2 (w t )∥ (Smoothness) = δ t + αλ 1 β 1 δ t + αλ 2 E f1...ft∼P λ ∥∇f 2 (w ′ t ) -∇f 2 (w t )∥ (Triangle Inequality) = 1 + αλ 1 β 1 δ t + αλ 2 ∇f 2 (w ′ t ) -∇f 1 (w ′ t ) + ∇f 1 (w ′ t ) -∇f 2 (w t ) (add zero) ≤ 1 + αλ 1 β 1 δ t + αλ 2 ∥∇f 2 (w ′ t ) -∇f 1 (w ′ t )∥ + ∥∇f 1 (w ′ t ) -∇f 2 (w t )∥ (Triangle Inequality) ≤ 1 + αλ 1 β 1 δ t + αλ 2 ∆ + ∥∇f 1 (w ′ t ) -∇f 2 (w t )∥ Using Assumption A.1 ≤ 1 + αλ 1 β 1 δ t + αλ 2 ∆ + ∥∇f 1 (w ′ t )∥ + ∥∇f 2 (w t )∥ Triangle Inequality ≤ 1 + αλ 1 β 1 δ t + αλ 2 ∆ + 2L L-Lipschitz function To obtain the second half of the first bound : δ t+1 = E f1...ft+1∼P λ ∥w t+1 -w ′ t+1 ∥ = E f1...ft∼P λ λ 1 ∥G f 1 (w t ) -G ′ f 1 (w ′ t )∥ + λ 2 ∥G f 2 (w t ) -G ′ f 2 (w ′ t )∥ = E f1...ft∼P λ λ 1 ∥w t -α∇f 1 (w t ) -w ′ t + α∇f 1 (w ′ t )∥ + λ 2 ∥w t -α∇f 2 (w t ) -w ′ t + α∇f 2 (w ′ t )∥ ≤ E f1...ft∼P λ ∥w t -w ′ t ∥ + αE f1...ft∼P λ λ 1 ∥∇f 1 (w ′ t ) -∇f 1 (w t )∥ + λ 2 ∥∇f 2 (w ′ t ) -∇f 2 (w t )∥ (Triangle Inequality used for above step) ≤ δ t + αE f1...ft∼P λ λ 1 β 1 ∥w t -w ′ t ∥ + λ 2 β 2 ∥w t -w ′ t ∥ (Smoothness) = δ t + αλ 1 β 1 E f1...ft∼P λ ∥w t -w ′ t ∥ + αλ 2 β 2 E f1...ft∼P λ ∥w t -w ′ t ∥ = δ t + α(λ 1 β 1 + λ 2 β 2 )δ t = (1 + α(λ 1 β 1 + λ 2 β 2 ))δ t E.4 STABILITY OF DYNAMIC SAMPLING We repeat the description of our Auxiliary Learning with Dynamic Sampling Setting here for ease of access. Setting : We are given an auxiliary objective f a (•; z) ∈ [0, 1] with N a samples S a = (z 1 , . . . , z Na ) from the distribution D a . At any iteration of SGD, we sample a choice of either the end-task function f e or the auxiliary objective f a according to the probabilities λ e , λ a | λ e + λ a = 1. Given the chosen objective, we sample a data-point and perform stochastic gradient descent (SGD) based on the sampled data-point. An equivalent way to instantiate this procedure to create S A by drawing N ′ = N e + N a total samples from the end-task and auxiliary task according to P λ . S ′ A is then created by replacing 1 end-task sample in S A . At each step, a sample is drawn from a distribution : z i , z ′ i ∼ P S A , P S ′ A and a gradient step is taken on the function corresponding to the set the sample was drawn from. Lemma E.4 (Stability of dynamic sampling). We denote the outputs of T steps of SGM on S A and S ′ A with the dynamically sampled functions, as w T and w ′ T respectively. Then, for every z e ∈ Z e and every t 0 > 0, under both the random update rule and the random permutation rule, we have : E f e (w T ; z) -f e (w ′ T ; z) ≤ γt 0 N ′ sup w,ze f e (w; z e ) + LE[δ T |δ t0 = 0] Where N ′ = N e + N a and γ = λe•N ′ Ne = λe λ r . Proof. Let E = 1[δ t0 = 0] denote the event that δ t0 = 0. We have E f e (w T ; z) -f e (w ′ T ; z) = P {E}E f e (w T ; z) -f e (w ′ T ; z) |E + P {E c }E f e (w T ; z) -f e (w ′ T ; z) |E c ≤ E f e (w T ; z) -f e (w ′ T ; z) |E + P {E c } • sup w,ze f e (w; z e ) because f e is non-negative ≤ LE ∥w T -w ′ T ∥|E + P {E c } • sup w,ze f e (w; z e ) because f e is L-Lipschitz We now proceed to bound P {E c }. Let i * ∈ [N ′ ] denote the position in which S A , S ′ A differ and consider the random variable I assuming the index of the first time step in which SGM uses the example z i * e . Note that when I > t 0 , then we must have that δ t0 = 0 since the two samples are identical up until this point. P {E c } = P {δ 0 ̸ = 0} ≤ P {I ≤ t 0 } Using the selection rule specified above (sample either f e , f a according to the probabilities λ e , λ a and then sample uniformly from the selected task data) we have that : P {I ≤ t 0 } = t0 t=1 P {I = t 0 } = t0 t=1 λ e • 1 N e = λ e t 0 N e = γt 0 N ′ Theorem E.5 (Stability Bound on Dynamic Sampling). Assume that f e (; z e ), f a (; z a ) ∈ [0, 1] are L-Lipschitz and β e and β a -smooth loss functions. Consider that we have N ′ = N e + N a total samples where f e and f a have N e and N a samples respectively. Suppose that we run SGM for T steps with monotonically non-increasing step sizes α t ≤ c t by dynamically sampling the tasks according to λ e and λ a . Then, with respect to f e , SGM has uniform stability with : ϵ stab ≤ 1 + 1 c β 2γL 2 c N ′ -γ + ρLc 1 c β+1 γT N ′ c β 1+c β Where γ = λ e N ′ N e Given that β * = min{β e , β a } and λ * is the corresponding weighting of the function with smaller smoothness. Depending on which one gives a tighter bound the pair ( β, ρ) can be : ( β, ρ) 1 = (λ * β * , (1 -λ * ) ∆ + 2L ) or ( β, ρ) 2 = (λ e β e + λ a β a , 0) When ( β, ρ) 1 gives the tighter bound, we can simplify to : ϵ gen ⪅ ∆) 1 1+cλ * β * γT N ′ 1- 1 cλ * β * +1 As presented in Section 4.  Let Ψ T = E[δ T |δ t0 = 0]. We will bound Ψ T as function of t 0 and then minimize for t 0 . Note the following : • At any step t, with probability 1 -γ N ′ , the sample selected is the same in both S A and S ′ A . In this case G ft = G ′ ft and we use the corresponding expansivity rule from lemma E.4. This gives : δ t+1 ≤ min 1 + α t λ * β * δ t + α t (1λ * ) ∆ + 2L , 1 + α t λ e β e + λ a β a ) δ t Where β * = min{β e , β a } and λ * is the corresponding weighting of the function with smaller smoothness. To avoid deriving the bound independently for each case, we perform a variable substituation that captures the two cases : δ t+1 ≤ 1 + α t β δ t + α t ρ β = λ * β * , λ e β e + λ a β a and ρ = (1λ * ) ∆ + 2L , 0 . We can present the final bound in terns of these variables which can be substituted depending on the minimizer. • With probability γ N ′ the selected example is different. Note that in this case, we know that we are evaluating the end-task function f e . We use that both G ft and G ′ ft are (σ t = α t L)bounded according to lemma E.3 since f e is L-Lipschitz. Combining the above we have : Ψ t+1 ≤ 1 - γ N ′ 1 + α t β Ψ t + α t ρ + γ N ′ Ψ t + 2α t L = γ N ′ + 1 - γ N ′ 1 + α t β Ψ t + 2γα t L N ′ + α t 1 - γ N ′ ρ = 1 + 1 - γ N ′ α t β Ψ t + α t 2γL + (N ′ -γ)ρ N ′ ≤ 1 + 1 - γ N ′ c t β Ψ t + c 2γL + (N ′ -γ)ρ tN ′ ≤ exp 1 - γ N ′ c t β Ψ t + c 2γL + (N ′ -γ)ρ tN ′ We use 1 + x ≤ exp(x) ∀x ≤ exp 1 - γ N ′ c t β Ψ t + cρ tN ′ Where ρ = 2γL + (N ′γ)ρ We can upper bound the sum over t with an integral + drop negative terms ≤ cρ N ′ c β(1 -γ N ′ ) T t 0 c β(1-γ N ′ ) = ρ β(N ′ -γ) T t 0 c β(1-γ N ′ ) ≤ ρ β(N ′ -γ) T t 0 c β Plugging this bound back into Equation 8 and using the fact that f e ∈ [0, 1]: E f e (w T ; z)f e (w ′ T ; z) ≤ γt 0 N ′ + Lρ β(N ′ -γ) T t 0 c β We let q * = c β, we can minimize the R.H.S by setting : t 0 = N ′ Lcρ γ(N ′ -γ) 1 q * +1 T q * q * +1 Plugging this in gives us : E f e (w T ; z)f e (w ′ T ; z) ≤ In going from the first line to the second we consider the setting where ∆ ≫ 2L. This is a case where the auxiliary task is sufficiently different from the primary task. Some observations about this setting: (1 + 1 c β ) N ′ N ′ Lc 2γL + (N ′ -γ)ρ (N ′ -γ) 1 c β+1 γT c β 1+c β = 1 + 1 c β 2γL 2 c N ′ -γ + ρLc 1 c β+1 γT N ′ 1. Smaller ∆ implies auxiliary task is similar to main task and leads to improving the bound. 2. Dependence of the bound on N ′ is a bit more nuanced. Note that increasing N ′ increases γ unless we reduce λ e appropriately. Remember that λ e is the rate at which we sample the primary task. Thus, if we add more auxiliary data but still sample the primary task at the original rate, then we are effectively ignoring the extra auxiliary data. 3. It might be tempting to assume that we can get arbitrary improvements in this setting by setting λ e = 0. However, note that whilst this might reduce the generalization error, it means that we are seeing none of the end-task which would result in large increase in the training error 4. Note that ( β = λ * β * ≤ β e ) always. So we get improvements on the dependence on T compared to Theorem E.2. 5. We can optimize λ e , λ a to minimize ϵ auxdyn stab .



Our ideas could be applied to domains like RL or computer vision (CV), where a similar dissection of existing objectives can be performed. Although this taxonomy is quite expansive, it obviously does not consider other elements of objective creation such as choice of model architecture, optimizer settings, etc. This holds at fixed γ which we achieve by adjusting λe to account for introducing more auxiliary data. This paper presents a procedure for automating the creation of auxiliary objectives. We showed, theoretically, how auxiliary learning impacts end-task generalization. This resulted in prescriptions that informed the design of AANG, an algorithm to search the space of generated objectives in an end-task aware multitask fashion. Our experiments show that AANG is a promising first step in automating auxiliary learning.



Figure 1: We present the decomposition of some auxiliary objectives in NLP within our framework.

Figure 2:Our framework in the context of NLP. We decompose named objectives within our four staged taxonomy : {D, T , R, O}. By taking the cartesian product of choices across stages, we reproduce named objectives and discover new ones.

but utilize only 1 auxiliary task. Each auxiliary objective is multi-tasked with the end-task. 1. GPT-style: We perform end-task aware training with a denoising auxiliary objective based on left-to-right causal masking for computing representations. {I = End-task data, T = No-op, R = Left-To-Right, O = Denoise Token }. 2. XLNET-style: This is a denoising auxiliary objective that uses randomized masking for computing representations. {I = End-task data, T = No-op, R = Random-factorized, O = Denoise Token}. 3. BERT-style / TAPT: Denoising inputs corrupted via BERT-Op: 80% masking and 10% random replacement. {I = End-task data, T = BERT-Op, R = Bi-directional, O = Denoise Token}. Please note that this baseline is equivalent to META-TARTAN as introduced in Dery et al. (2021b).

Figure 3: AANG effectively leverages out-of-task data. P-values (in brackets) are comparisons to(Dery et al., 2021b)

Figure 4: Learned trajectories for AANG-TD for run instances of SE-2016-6 and SCIERC tasks.

Figure 5: Top ranked objectives (averaged weight) early in training (left) and later in training (right) supervised output O = {DENOISE} are highly weighted but later, objectives based on supervised signal, O = {Task} play a larger role. AANG rediscovers the common practice of training on self-supervised objectives before introducing supervised ones. It is also interesting to note that many newly generated objectives (outside of the 3 named single task baselines in Table2) such as simple input reconstruction were discovered to have relevant impact on the end-tasks. This means AANG can automatically surface new, previously unexplored objectives relevant to the end-task.

Consider the following general setting. There is an unknown distribution D e over examples from some space Z. We receive a sample S = (z 1 , . . . , z Ne ) of N e examples drawn i.i.d. from D e . Our goal is to find a model w, that parameterizes the function f e , with small population risk defined as: Definition E.4. Population Risk R[w] = E z∼De f e (w; z) Definition E.5. Empirical Risk Since we have a finite number of samples, we can only compute the empirical risk which is :R S [w] = 1 N e i f e (w; z i ),Let A be a potentially randomized algorithm (such as Stochastic Gradient Descent) that is a function of the S such that w = A(S). Definition E.6. Generalization Error ϵ gen (A, N e )ϵ gen (A, N e ) = E S,A R S [A(S)] -R[A(S)]Definition E.7. Uniform Stability A randomized algorithm A is ϵ-uniformly stable if for all data sets S, S ′ ∈ Z, |S| = |S ′ | = N e such that S and S ′ differ in at most one example, we have sup z E A f e (A(S); z)f e (A(S ′ ); z) ≤ ϵ

Proof. Let S A , S ′ A be two sample of size N ′ = N e + N a as described in lemma E.4. Consider the gradient updates G f1 , . . . , G f T and G ′ f1 , . . . , G ′ f T induced by running SGM on samples S A and S ′ A respectively. Let w T and w ′ T denote the corresponding outputs of SGM. By lemma E.4 we have : E f e (w T ; z)f e (w ′ T ; z) ≤ γt 0 N ′ sup w,ze f e (w; z e ) + LE[δ T |δ t0 = 0]

β = λ * β * , λ e β e + λ a β a ρ = (1λ * ) ∆ + 2L , 0We can choose whichever of the pairs for β, ρ that minimizes the bound : F DISCUSSION OF GENERALIZATION ERROR BOUNDS F.1 WHAT DOES THEOREM E.5 SAY.We consider the setting whereβ = λ * β * ρ = (1λ * ) ∆ + 2LAssuming the ρ term dominates Equation 12 in this setting is :

D=E D . F D=E D not only includes pre-existingTAPT Gururangan  et al. (2020)  but also unexplored objectives like task-data dependent variants of XLNET, ELMO etc. Auxiliary learning with F D=E D can be seen as a relaxed form of data augmentation which we dub task augmentation. Whilst data augmentation requires applying transformations that preserve the data-point's label, task augmentation has no such restriction and thus offers greater flexibility in terms of specifying {T , R, O}. We can also reason about expanding particular stages to include new primitives. Any supervised loss can be added to the output stage, O, allowing us to potentially explore auxiliary objectives based on supervised signals like NER or POS tagging

AANG-TD (task data) has 24 objectives and is based on only end-task data. AANG-TD+ED (task data + external data) has 40 objectives and uses both end-task and in-domain data.

Our framework and AANG on tasks using only task data. Without using any external data, we are able to get significant average performance improvement over baselines. Superscripts are p-values from paired t-tests (best multitask versus best single-task).

The vertical black lines indicate the point of best validation set performance. AANG responds to over-fitting by down-weighting objectives based on the output loss being over-fit to. Thus, after several iterations, the objective that dominates when the validation performance is at its highest (black vertical line) gets down-weighted in response to it becoming saturated. What tasks are important and when they are important? We study which tasks are most highly weighted early in training (first 10% of learning trajectory) and later in training (last 50%). We aggregate statistics across 3 datasets. Note that early in training, objectives based on the self-



Varying number of sampled objectives per-iteration.

Specifications of datasets used to evaluate our methods.

AANG-TD specific Hyper-parameters

AANG-TD+ED specific Hyper-parameters Learning rate for factor vectors -{W All , W I , W T , W R , W O }

META-TARTAN Hyper-parameters for single task auxiliary tasks Learning rate used for further training of RoBERTa base META-TARTAN introduces a dev-head which is trained sporadically during training for estimating the meta-gradients. We use the following hyper-parameters for training this dev-head : we sample 32 examples (8 examples in the case of H.PARTISAN) and perform full batch gradient descent with a learning rate of 1e-2 for 10 iterations. The dev-head is trained with the AdamW optimizer with weight decay set to 0.1.

9. ACKNOWLEDGEMENTS

This work was supported in part by DSO National Laboratories, an ENS-CFM Data Science Chair, DARPA FA875017C0141, the National Science Foundation grants IIS1705121, IIS1838017, IIS2046613 and IIS-2112471, an Amazon Web Services Award, a Facebook Faculty Research Award, funding from Booz Allen Hamilton Inc., and a Block Center Grant. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of any of these funding agencies. We are grateful for helpful feedback from Uri Alon, Patrick Fernandes, Joon Sik Kim, Han Guo, Victor Akinwande and Clara Na.

