AANG: AUTOMATING AUXILIARY LEARNING

Abstract

Auxiliary objectives, supplementary learning signals that are introduced to help aid learning on data-starved or highly complex end-tasks, are commonplace in machine learning. Whilst much work has been done to formulate useful auxiliary objectives, their construction is still an art which proceeds by slow and tedious handdesign. Intuition for how and when these objectives improve end-task performance has also had limited theoretical backing. In this work, we present an approach for automatically generating a suite of auxiliary objectives. We achieve this by deconstructing existing objectives within a novel unified taxonomy, identifying connections between them, and generating new ones based on the uncovered structure. Next, we theoretically formalize widely-held intuitions about how auxiliary learning improves generalization on the end-task. This leads us to a principled and efficient algorithm for searching the space of generated objectives to find those most useful to a specified end-task. With natural language processing (NLP) as our domain of study, we demonstrate that our automated auxiliary learning pipeline leads to strong improvements over competitive baselines across continued training experiments on a pre-trained model on 5 NLP tasks 1 .



. Auxiliary objectives are constructed by hand-design and without much overarching structure, relying on the experience and intuition of a select group of researchers versed at making appropriate design choices. Unfortunately, this status-quo not only creates a technical barrier of entry for exploring auxiliary objectives in new domains but also, by virtue of its incremental nature, limits the rate at which new objectives are discovered and investigated. To address the above challenges, this paper presents a framework for automatically generating and utilizing a large set of candidate auxiliary objectives. Our framework is seeded by the following key observation: leading auxiliary objectives across multiple domains can be viewed as making different design decisions within a 4 stage pipeline: Input Data (D) → Input Transformation (T ) → Model Representation (R) → Output (O). For instance, in RL, a common auxiliary objective is to predict the environment's forward dynamics (Agrawal et al., 2016; Hafner et al., 2019) . To construct this objective, the current task state-action pair (D) is corrupted (T ) and then passed through the model to produce a latent representation (R) which is finally used to predict the next state (O). Similarly, in NLP, the XLNet (Yang et al., 2019) objective-which performs language modelling on a randomly factorized permutation of the input-can be written within our taxonomy as {D = Out-of-Domain, T = No-op, R = Random-Factorized, O = Next Token}. These two examples (along with others listed in Figure 1 ) fall within a class we term named objectives: objectives that have been previously proposed in the auxiliary learning literature.  # TAPT = {Task data ! BERT-Op ! Bidirectional ! Denoise Token} GPT = {Out-of-domain ! No-Op ! Left-to-Right ! Next Token} New-Obj1 = {Task data ! BERT-Op ! Left-to-Right ! Denoise Token} New-Obj2 = {In-domain ! No-Op ! Random Factorized ! TF-IDF} . . . 2 Figure 2: Our framework in the context of NLP. We decompose named objectives within our four staged taxonomy : {D, T , R, O}. By taking the cartesian product of choices across stages, we reproduce named objectives and discover new ones. Decomposing named objectives within our taxonomy provides a unified view of the auxiliary learning landscape. From this vantage point, it becomes clear that there are many unexplored combinations of the various primitives used across named objectives. This presents a simple formula for automatically generating a large set of candidate objectives: take the cartesian product of the design decisions across given stages (Figure 2 ). Using this compositional process, not only can we reconstruct existing named objectives, we can also generate new combinations. This overcomes the tedium of implementing each objective independently since we can just reuse a small set of simple stage-wise primitives. Generating a large set of objectives raises the natural question of how to efficiently select the most helpful ones for a given end task. Instead of leaving this to practitioner intuition, we develop principled guidelines to address this question by theoretically studying the impact of auxiliary learning on a particular end-task. Specifically, using arguments based on algorithmic stability (Hardt et al., 2016; Bousquet & Elisseeff, 2002) , we derive end-task generalization error bounds that are dependent on the choice of auxiliary task. This contributes to existing theory (Saunshi et al., 2020; Xie et al., 2021) on how auxiliary learning impacts the end-task by suggesting a new candidate mechanism: auxiliary learning results in more stable optimization end-points in the sense of Bousquet & Elisseeff (2002) , which in theory improves generalization of the final model. Guided by our theory, we introduce AANG (Automating Auxiliary LearniNG), an efficient, structureaware algorithm for adaptively combining a set of related objectives to improve generalization on a specific end-task. AANG incorporates the following prescriptions from our theory: (i) auxiliary tasks that are more similar to the end-task are desirable. Given a set of objectives, AANG learns adaptive weights to bring the composite objective closer to the end-task; (ii) in general, more auxiliary data is better. AANG maximizes the effective amount of data used in training by using all the generated objectives instead of taking task-specific subsets. To empirically validate our method for automatically generating and utilizing auxiliary objectives, we experiment on five NLP tasks. We do so in the widely-used setting of continued pretraining (Gururangan et al., 2020; Aghajanyan et al., 2021; Dery et al., 2021b; Zhang et al., 2022) , where a model trained with a single auxiliary objective on large-scale data is further trained on end-task related data. Without introducing any external data or architectural modifications, variants of AANG outperform strong and widely used baselines in 4 out of 5 tasks. AANG achieves an average improvement of 4.2% over standard fine-tuning of RoBERTa across our chosen tasks. We believe our results will spur further research into exploring automating auxiliary learning across a variety of settings. Notably, while we focus on NLP when discussing the space of auxiliary objectives (Section 3) and in our empirical evaluation (Section 6), our theoretical results (Section 4) and AANG itself are domain-agnosticfoot_0 . Our ideas could be applied to domains like RL or computer vision (CV), where a similar dissection of existing objectives can be performed.

