AN EVOLUTIONARY APPROACH TO DYNAMIC INTRODUCTION OF TASKS IN LARGE-SCALE MULTITASK LEARNING SYSTEMS

Abstract

Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer, a key feature of human learning. Though, state of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks. Also, continual learning, that adds the temporal aspect to multitask, is often focused to the study of common pitfalls such as catastrophic forgetting instead of being studied at a large scale as a critical component to build the next generation artificial intelligence. We propose an evolutionary method capable of generating large scale multitask models that support the dynamic addition of new tasks. The generated multitask models are sparsely activated and integrates a task-based routing that guarantees bounded compute cost and fewer added parameters per task as the model expands. The proposed method relies on a knowledge compartmentalization technique to achieve immunity against catastrophic forgetting and other common pitfalls such as gradient interference and negative transfer. We demonstrate empirically that the proposed method can jointly solve and achieve competitive results on 69 public image classification tasks, for example improving the state of the art on a competitive benchmark such as cifar10 by achieving a 15% relative error reduction compared to the best model trained on public data.

1. INTRODUCTION

The success of machine learning continues to grow as it finds new applications in areas as diverse as language generation (Brown et al., 2020) , visual art generation (Ramesh et al., 2021) , chip design (Mirhoseini et al., 2020) , protein folding (Senior et al., 2020) and competitive sports (Silver et al., 2016; Vinyals et al., 2019) . The vast majority of machine learning models are designed and trained for a single task and specific data modality, and are often trained by starting with randomly initialized parameters, or with limited knowledge transfer from a pre-trained model. While this paradigm has shown great success, it uses a large amount of computational resources, and does not leverage knowledge transfer from many related tasks in order to achieve higher performance and efficiency. The work presented in this paper is based on the intuition that significant advances can be enabled by dynamic, continual learning approaches capable of achieving knowledge transfer across a very large number of tasks. The method described in this paper can dynamically incorporate new tasks into a large running system, can leverage pieces of a sparse multitask ML model to achieve improved quality for new tasks, and can automatically share pieces of the model among related tasks. This method can enhance quality on each task, and also improve efficiency in terms of convergence time, amount of training examples, energy consumption and human engineering effort. The ML problem framing proposed by this paper can be interpreted as a generalization and synthesis of the standard multitask and continual learning formalization, since an arbitrarily large set of tasks can be solved jointly. But also, over time, the set of tasks can be extended with a continuous stream of new tasks. Furthermore, it lifts the distinction between a pretraining task and a downstream task. As new tasks are incorporated, the system searches for how to combine the knowledge and representations already present in the system with new model capacity in order to achieve high quality for each new task. Knowledge acquired and representations learned while solving a new task are available for use by any future task or continued learning for existing tasks. We refer to the proposed method as "mutant multitask network" or µ2Net. This method generates a large scale multitask network that jointly solves multiple tasks to achieve increased quality and efficiency for each. It can continuously expand the model by allowing the dynamic addition of new tasks. The more accumulated knowledge that is embedded into the system via learning on previous tasks, the higher quality the solutions are for subsequent tasks. Furthermore, new tasks can be solved with increasing efficiency in terms of reducing the newly-added parameters per task. The generated multitask model is sparsely activated as it integrates a task-based routing mechanism that guarantees bounded compute cost per task as the model expands. The knowledge learned from each task is compartmentalized in components that can be reused by multiple tasks. As demonstrated through experiments, this compartmentalization technique avoids the common problems of multitask and continual learning models, such as catastrophic forgetting, gradient interference and negative transfer. The exploration of the space of task routes and identification of the subset of prior knowledge most relevant for each task is guided by an evolutionary algorithm designed to dynamically adjust the exploration/exploitation balance without need of manual tuning of meta-parameters. The same evolutionary logic is employed to dynamically tune the hyperparameters multitask model components.

2. RELATED WORK

The main novelty of the presented work is to propose and demonstrate a method that jointly provides all of the following properties: 1) ability to continually learn from an unbounded stream of tasks, 2) automate the selection and reuse of prior knowledge and representations learned for previous tasks in the solving of new tasks, 3) search the space of possible model architectures allowing the system to dynamically extend its capacity and structure without requiring random initialization, 4) automatically tune the hyperparameters of both the generated models and the evolutionary method, including the ability to learn schedules for each hyperparameter, rather than just constant values, 5) ability to optimize for any reward function, also including non-differentiable factors, 6) immunity from catastrophic forgetting, negative transfer and gradient interference, 7) ability to extend any pre-existing pre-trained model, including extending its architecture and adapting the domain on which such model have been trained to other domains automatically, 8) introduction of a flexible access control list mechanism that allows expression of a variety of privacy policies, including allowing the use or influence of task-specific data to be constrained to just a single task or to a subset of tasks for which data or higher-level representation use should be permitted. Different lines of research have focused on distinct subsets of the many topics addressed by the proposed method. In this section we highlight a few cornerstone publications. Refer to Appendix A for an extended survey. Different methods have been proposed to achieve dynamic architecture extensions (Chen et al., 2016; Cai et al., 2018) , some also focusing on an unbounded stream of tasks (Yoon et al., 2018) , or achieving immunity from catastrophic forgetting (Rusu et al., 2016; Li & Hoiem, 2018; Rosenfeld & Tsotsos, 2020) . Unlike our work, these techniques rely on static heuristics and patterns to define the the structural extensions, rather than a more open-ended learned search process. Neural architecture search (NAS) (Zoph & Le, 2017) methods aim to modularize the architectural components in search spaces whose exploration can be automated with reinforcement learning or evolutionary approaches (Real et al., 2019; Maziarz et al., 2018) . More efficient (but structurally constrained) parameter sharing NAS techniques (Pham et al., 2018; Liu et al., 2019a ) create a connection with routing methods (Fernando et al., 2017) and sparse activation techniques, that enable the decoupling of model size growth from compute cost growth (Shazeer et al., 2017; Du et al., 2021) . Evolutionary methods have also been applied with success for hyperparameter tuning (Jaderberg et al., 2017) . Cross-task knowledge transfer has gained popularity, especially through transfer learning from a model pre-trained on a large amount of data for one or a few general tasks, and then fine-tuned on a small amount of data for a related downstream task. This approach has been shown to be very effective in a wide variety of problems and modalities (Devlin et al., 2019; Dosovitskiy et al., 2021) . Large scale models have recently achieved novel transfer capabilities such as few/zero shot learning (Brown et al., 2020) . More complex forms of knowledge transfer such as multitask training or continual learning often lead to interesting problems such as catastrophic forgetting (McCloskey & Cohen, 1989; French, 1999) , negative transfer (Rosenstein, 2005; Wang et al., 2019) or gradient interference

