HYPERGRID TRANSFORMERS: TOWARDS A SINGLE MODEL FOR MULTIPLE TASKS

Abstract

Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose HyperGrid Transformers, a new Transformer architecture that leverages task-conditioned hyper networks for controlling its feed-forward layers. Specifically, we propose a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks. In order to construct the proposed hypernetwork, our method learns the interactions and composition between a global (task-agnostic) state and a local task-specific state. We conduct an extensive set of experiments on GLUE/SuperGLUE. On the SuperGLUE test set, we match the performance of the state-of-the-art while being 16 times more parameter efficient. Our method helps bridge the gap between fine-tuning and multi-task learning approaches.

1. INTRODUCTION

Learning a single multi-task model that performs well across multiple targeted tasks is an attractive proposition for many reasons (Kaiser et al., 2017; Ruder, 2017; Clark et al., 2019b) . Although extremely challenging, this paradigm enables a substantial savings in overall parameter costs, along with eliminating the need for maintaining multiple models in production (Stickland and Murray, 2019) . However, achieving state-of-the-art performance on natural language understanding benchmarks today (Wang et al., 2018; 2019) still relies on fine-tuning a new model for every single task. This methodology is infeasible in many situations. Moreover, certain tasks rely on an extensive ensemble of models and/or task-specific fine-tuning tricks (Liu et al., 2019b; Devlin et al., 2018; Clark et al., 2020) . The single-task fine-tuning paradigm is well-established to be the dominant approach (Raffel et al., 2019) , as training multiple tasks using a single set of parameters can be problematic in many ways, such as catastrophic forgetting (French and Chater, 2002; McCloskey and Cohen, 1989; McClelland et al., 1995; Kirkpatrick et al., 2017) or the inherent difficulty of finding a consistently good model for all tasks (Clark et al., 2019b; Wu et al., 2020) . Inevitable task conflicts and difficulty in fitting all models within a set of hard parameters is also a challenging problem for multi-task co-training. In this paper, we propose a new Transformer architecture, the HyperGrid Transformer for efficient modeling of multiple tasks within a single set of model parameters. HyperGrid Transformers rely on a hypernetwork-based (Ha et al., 2016) module that performs gridwise decomposable hyper projections. This module is task conditioned and dynamically learns to generate weights of the feedforward layers of the Transformer model. Overall, our eventual goal is to dispense with task specific fine-tuning tricks altogether. While neural networks typically maintain the same consistent set of parameters for all input instances, the proposed HyperGrid Transformers introduces instance-specific parameters by conditioning on the current input. This setup enables our model to learn task-specific reparameterization for each input instance, which mitigates several challenges of multi-task co-training.

