HYPERGRID TRANSFORMERS: TOWARDS A SINGLE MODEL FOR MULTIPLE TASKS

Abstract

Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose HyperGrid Transformers, a new Transformer architecture that leverages task-conditioned hyper networks for controlling its feed-forward layers. Specifically, we propose a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks. In order to construct the proposed hypernetwork, our method learns the interactions and composition between a global (task-agnostic) state and a local task-specific state. We conduct an extensive set of experiments on GLUE/SuperGLUE. On the SuperGLUE test set, we match the performance of the state-of-the-art while being 16 times more parameter efficient. Our method helps bridge the gap between fine-tuning and multi-task learning approaches.

1. INTRODUCTION

Learning a single multi-task model that performs well across multiple targeted tasks is an attractive proposition for many reasons (Kaiser et al., 2017; Ruder, 2017; Clark et al., 2019b) . Although extremely challenging, this paradigm enables a substantial savings in overall parameter costs, along with eliminating the need for maintaining multiple models in production (Stickland and Murray, 2019) . However, achieving state-of-the-art performance on natural language understanding benchmarks today (Wang et al., 2018; 2019) still relies on fine-tuning a new model for every single task. This methodology is infeasible in many situations. Moreover, certain tasks rely on an extensive ensemble of models and/or task-specific fine-tuning tricks (Liu et al., 2019b; Devlin et al., 2018; Clark et al., 2020) . The single-task fine-tuning paradigm is well-established to be the dominant approach (Raffel et al., 2019) , as training multiple tasks using a single set of parameters can be problematic in many ways, such as catastrophic forgetting (French and Chater, 2002; McCloskey and Cohen, 1989; McClelland et al., 1995; Kirkpatrick et al., 2017) or the inherent difficulty of finding a consistently good model for all tasks (Clark et al., 2019b; Wu et al., 2020) . Inevitable task conflicts and difficulty in fitting all models within a set of hard parameters is also a challenging problem for multi-task co-training. In this paper, we propose a new Transformer architecture, the HyperGrid Transformer for efficient modeling of multiple tasks within a single set of model parameters. HyperGrid Transformers rely on a hypernetwork-based (Ha et al., 2016) module that performs gridwise decomposable hyper projections. This module is task conditioned and dynamically learns to generate weights of the feedforward layers of the Transformer model. Overall, our eventual goal is to dispense with task specific fine-tuning tricks altogether. While neural networks typically maintain the same consistent set of parameters for all input instances, the proposed HyperGrid Transformers introduces instance-specific parameters by conditioning on the current input. This setup enables our model to learn task-specific reparameterization for each input instance, which mitigates several challenges of multi-task co-training. Our proposed HyperGrid module belongs to a family of hypernetworks (Ha et al., 2016) , in which a side network is responsible for weight generation for the main network. In our case, task-conditioned hypernetworks provide greater flexibility and expressiveness for capturing the dynamics of multiple tasks within a single set of parameters. Specifically, we introduce two novel algorithmic improvements over the existing methods. First, we introduce the notion of grid-wise projections in which we assume a structural layout in vanilla projection layers. For each input sample, our projections dynamically control the parameters in a grid-wise, region-specific manner. The structural segmentation of feed-forward layers is similar in spirit to mixture-of-experts gating (Shazeer et al., 2017) , albeit at a lower-level. Conversely, standard hypernetworks only consider row-wise re-weighting of weight matrices. Second, we introduce decomposable hyper-projections. The key idea is to learn rich compositional and pairwise interactions between dual hypernetworks. A dual setup is adopted, where we explore different hypernetwork composition variants. We introduce a novel local-global setup, which composes a local instance-specific and task-specific hyper-projection with a task agnostic global state embedding. This is intuitive since this setup is not only highly expressive and flexible but also serves as a factorization of local and global components. To the best of our knowledge, our work is the first to explore this setup with respect to learning conditional parameters. Finally, we conduct extensive experiments on GLUE/SuperGLUE. Our proposed model is able to match the performance of individually fine-tuned state-of-the-art Text-to-Text Transformers (T5) (Raffel et al., 2019) models with a single model that is learned to fit all GLUE and SuperGLUE tasks at once. Moreover, our single model also outperforms strong baselines that employ ensembling and other task-specific tricks (Liu et al., 2019b; Clark et al., 2020) .

Our Contributions

The contributions of this paper can be summarized as follows: • We propose HyperGrid Transformers, a form of hypernetwork-based Transformer that learns task-conditioned dynamic weights for its feed-forward layers. • The key novelty behind HyperGrid Transformers is the factorization of local and global components for weight generation. Our weight generation is grid-wise and imbues the model with a structural layout. • We conduct extensive experiments on natural language understanding benchmarks (GLUE/SuperGLUE). With a single model, we match the state-of-the-art T5 model that is finetuned in a per-task fashion (multiple models), resulting in 16x parameter savings.

2. HYPERGRID TRANSFORMERS

This section outlines the key idea of the proposed algorithm.

2.1. HYPERGRID MODULE

HyperGrid operates on weight matrices (linear transformations), i.e., Y = W X + b. In a hypernetwork formulation, instead of letting W be free weights, we generate W using a parameterized side network H(.). Y = W x + b where W = H(X) where W ∈ R dm×d f . In the case where X is a single vector ∈ R dm , we may parameterize H(.) with a simple feed-forward layer. H(X) = σ(U X)1 W (2) where 1 is a column vector of ones, σ is the sigmoid activation function and U ∈ R dm×d f . The key idea is that the hypernetwork generates a vector, i.e., U X ∈ R d f that is broadcast (multiplied by 1) and multiplied by W , acting as a row-wise scaling of W . We are also able to reduce U ∈ R dm×n where d f mod n = 0 and repeat the vector d f n times to form the original dimension of d f . These methods only consider scaling one dimension of W (e.g., row-wise). We now consider methods beyond simple row-wise weight scaling.

