HYPERGRID TRANSFORMERS: TOWARDS A SINGLE MODEL FOR MULTIPLE TASKS

Abstract

Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose HyperGrid Transformers, a new Transformer architecture that leverages task-conditioned hyper networks for controlling its feed-forward layers. Specifically, we propose a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks. In order to construct the proposed hypernetwork, our method learns the interactions and composition between a global (task-agnostic) state and a local task-specific state. We conduct an extensive set of experiments on GLUE/SuperGLUE. On the SuperGLUE test set, we match the performance of the state-of-the-art while being 16 times more parameter efficient. Our method helps bridge the gap between fine-tuning and multi-task learning approaches. Learning a single multi-task model that performs well across multiple targeted tasks is an attractive proposition for many reasons (Kaiser et al., 2017; Ruder, 2017; Clark et al., 2019b). Although extremely challenging, this paradigm enables a substantial savings in overall parameter costs, along with eliminating the need for maintaining multiple models in production (Stickland and Murray, 2019) . However, achieving state-of-the-art performance on natural language understanding benchmarks today (Wang et al., 2018; 2019) still relies on fine-tuning a new model for every single task. This methodology is infeasible in many situations. Moreover, certain tasks rely on an extensive ensemble of models and/or task-specific fine-tuning tricks (Liu et al., 2019b; Devlin et al., 2018; Clark et al., 2020) . The single-task fine-tuning paradigm is well-established to be the dominant approach (Raffel et al., 2019) , as training multiple tasks using a single set of parameters can be problematic in many ways, such as catastrophic forgetting (French and Chater, 2002; McCloskey and Cohen, 1989; McClelland et al., 1995; Kirkpatrick et al., 2017) or the inherent difficulty of finding a consistently good model for all tasks (Clark et al., 2019b; Wu et al., 2020) . Inevitable task conflicts and difficulty in fitting all models within a set of hard parameters is also a challenging problem for multi-task co-training. In this paper, we propose a new Transformer architecture, the HyperGrid Transformer for efficient modeling of multiple tasks within a single set of model parameters. HyperGrid Transformers rely on a hypernetwork-based (Ha et al., 2016) module that performs gridwise decomposable hyper projections. This module is task conditioned and dynamically learns to generate weights of the feedforward layers of the Transformer model. Overall, our eventual goal is to dispense with task specific fine-tuning tricks altogether. While neural networks typically maintain the same consistent set of parameters for all input instances, the proposed HyperGrid Transformers introduces instance-specific parameters by conditioning on the current input. This setup enables our model to learn task-specific reparameterization for each input instance, which mitigates several challenges of multi-task co-training.



Our proposed HyperGrid module belongs to a family of hypernetworks (Ha et al., 2016) , in which a side network is responsible for weight generation for the main network. In our case, task-conditioned hypernetworks provide greater flexibility and expressiveness for capturing the dynamics of multiple tasks within a single set of parameters. Specifically, we introduce two novel algorithmic improvements over the existing methods. First, we introduce the notion of grid-wise projections in which we assume a structural layout in vanilla projection layers. For each input sample, our projections dynamically control the parameters in a grid-wise, region-specific manner. The structural segmentation of feed-forward layers is similar in spirit to mixture-of-experts gating (Shazeer et al., 2017) , albeit at a lower-level. Conversely, standard hypernetworks only consider row-wise re-weighting of weight matrices. Second, we introduce decomposable hyper-projections. The key idea is to learn rich compositional and pairwise interactions between dual hypernetworks. A dual setup is adopted, where we explore different hypernetwork composition variants. We introduce a novel local-global setup, which composes a local instance-specific and task-specific hyper-projection with a task agnostic global state embedding. This is intuitive since this setup is not only highly expressive and flexible but also serves as a factorization of local and global components. To the best of our knowledge, our work is the first to explore this setup with respect to learning conditional parameters. Finally, we conduct extensive experiments on GLUE/SuperGLUE. Our proposed model is able to match the performance of individually fine-tuned state-of-the-art Text-to-Text Transformers (T5) (Raffel et al., 2019) models with a single model that is learned to fit all GLUE and SuperGLUE tasks at once. Moreover, our single model also outperforms strong baselines that employ ensembling and other task-specific tricks (Liu et al., 2019b; Clark et al., 2020) . Our Contributions The contributions of this paper can be summarized as follows: • We propose HyperGrid Transformers, a form of hypernetwork-based Transformer that learns task-conditioned dynamic weights for its feed-forward layers. • The key novelty behind HyperGrid Transformers is the factorization of local and global components for weight generation. Our weight generation is grid-wise and imbues the model with a structural layout. • We conduct extensive experiments on natural language understanding benchmarks (GLUE/SuperGLUE). With a single model, we match the state-of-the-art T5 model that is finetuned in a per-task fashion (multiple models), resulting in 16x parameter savings.

2. HYPERGRID TRANSFORMERS

This section outlines the key idea of the proposed algorithm.

2.1. HYPERGRID MODULE

HyperGrid operates on weight matrices (linear transformations), i.e., Y = W X + b. In a hypernetwork formulation, instead of letting W be free weights, we generate W using a parameterized side network H(.). Y = W x + b where W = H(X) where W ∈ R dm×d f . In the case where X is a single vector ∈ R dm , we may parameterize H(.) with a simple feed-forward layer. H(X) = σ(U X)1 W (2) where 1 is a column vector of ones, σ is the sigmoid activation function and U ∈ R dm×d f . The key idea is that the hypernetwork generates a vector, i.e., U X ∈ R d f that is broadcast (multiplied by 1) and multiplied by W , acting as a row-wise scaling of W . We are also able to reduce U ∈ R dm×n where d f mod n = 0 and repeat the vector d f n times to form the original dimension of d f . These methods only consider scaling one dimension of W (e.g., row-wise). We now consider methods beyond simple row-wise weight scaling.

2.1.1. DECOMPOSABLE GRIDWISE PROJECTIONS

In our method, we propose grid-wise projections that segments W into a grid, i.e., blocks of dm dr × d f dc . We generate blocks by the outer product of L r ∈ R dr and L c ∈ R dc . Note that d r and d c are user-specified hyperparameters that control the grid-size for the fan-in and fan-out of the output matrix. For simplicity, we consider divisible blocks where d r < d m , d m mod d r = 0, and d c < d f , d f mod d c = 0. In this case: H(X) = ψ(σ((L r X)(L c X) )) W where (L r X)(L c X) ∈ R dr×dc , ψ(. ) is a repeat vector function that repeats its input dm dr times on the row axis and d f dc times on the column axis. We name this approach the L 2 variant, short for Local-Local Gridwise Projection. 

Composition between Local and Global Factors

The decomposable grid-wise projections learn L r and L c from X, which makes it conditioned on local, instance-wise information. Here, we postulate that it may be beneficial for either L r or L c to be a global embedding. By keeping L c as a global, trainable embedding, this can be formulated as: H(X) = ψ(σ((L r X)G c )) W where G c ∈ R d f . In this case, L r is conditioned from X, the specific input sample. On the other hand, G c remains consistent across all input samples. Hence, the outer product is essentially a rich dyadic composition between local and global factors. Local-Global and Global-Local It is easy to see that there are two ways of composing L and G. The above method considers the Local-Global approach where the fan-in uses a local hypernetwork and the global part uses a trainable embedding. An alternative that flips this around to use a Global-Local composition is evaluated in our experiments. Namely, this can be expressed as: H(X) = ψ(σ((G r (L c X) )) W (5)

2.2. DYNAMIC WEIGHT GENERATION WITH HYPERGRID

This section describes how we use task-conditioned hypernetworks to influence and generate the parameters for HyperGrid Transformers. Task Conditioning The local network part of Hyper-Grid L is learned via a task embedding T ∈ R dm , which provides a task identifier and information to the hypernetwork. In HyperGrid Transformers, we first apply self-attention of the task embedding by concatenating it with the input sequence. This is described as: T = MHSA([T ; X]) [0] (6) where [; ] is a concatenation on the length dimension and MHSA(.) is the multi-head self-attention function. The input sequence X interacts with the task embedding to generate T which is used in our hypernetwork module. Weight Gating The HyperGrid module is added at the position-wise feed-forward layers of the Transformer models. More specifically, we equip the second positional FFN after the ReLU activations with HyperGrid. There are several reasons for doing so. In most Transformer implementations, the fan out of this layer is typically scaled up to very large values (Raffel et al., 2019) . Hence, the influence on this layer has the greatest potential to benefit the Transformer model. Second, early experiments on both of the positional feed-forward layers yielded no substantial improvements. Hence, we opt to only modify the second positional FFN of the Transformer model.

Initialization

In our experiments, we take advantage of existing pretrained model (Raffel et al., 2019) and add additional HyperGrid parameters that are fine-tuned along with the rest of the network. The overall formulation of the HyperGrid-enhanced Transformer can be written as: Y i = H i (X i-1 , W i ) + W i (X i-1 ) where i denotes the layer i. We construct a new HyperGrid (with non-shared parameters) for each layer. Since W has been pretrained, we also add a residual connection of the original W i (X i-1 ) computation to the mixture. 

3. EXPERIMENTAL RESULTS

We conduct experiments on GLUE (Wang et al., 2018) and SuperGLUE (Wang et al., 2019) which are consolidated benchmarks of multiple challenging NLP and NLU tasks. While most of the work in this area has been focused on achieving good task-specific performance, our work focuses on trying to get good performance with a single model on all GLUE and SuperGLUE tasks. Therefore, most of our experiments are conducted on a mixture of all GLUE and SuperGLUE tasks.

3.1. EXPERIMENTAL RESULTS

In this section, we discuss the empirical results of our experiments. Further details about the experimental setup can be found in the appendix. Pertaining to parameter counts (reported as θ in our experiments), this is the total parameter cost for serving the entire suite of tasks. If a single model (finetuned on a single task) is X parameters, then the costs of serving N tasks in a multi-model setup would be N X.

3.1.1. SINGLE MODEL VERSUS MULTIPLE MODELS

These experiments investigate different model settings, namely using (1) multiple models (MM) or a single model (SM) for all tasks. The MM model trains a fresh model on every single task and retains the best checkpoint for all tasks. Meanwhile, the SM model is trained on all tasks at once. We compare them against HGT, our single model HyperGrid Transformer approach. Details about Baselines For the Single Model Baseline, this is the identical T5 model without our HyperGrid layers which serves as a fair comparison to our model. This enables us to observe the effect of HyperGrid directly. As for the multi-model method, this is done by finetuning T5 directly on each task and reporting results from the best checkpoint (of each task). This is a very strong baseline because the model is also allowed to choose specific best checkpoints for each task which the single model approach is prohibited against. As for the sampling strategy in the singlemodel approach (baseline and our approaches), we use a proportionate mix of tasks according to the number of samples in the each task/dataset. Results Table 2 reports results of our experiments on the GLUE and SuperGLUE benchmark. The first key observation is that the single model (SM) approach is outperformed by multi model (MM). This is a well known phenomena and therefore multi-model is generally adopted when the absolute best score is desired on every single task. The interesting result is that we are able to come rather close to the performance of single model with our approach. As a result, the multi-model has 16x more parameters. To fit both GLUE and SuperGLUE, this would require 16x the parameters. Given that our goal is to bridge the performance of a single model versus multiple models for multiple tasks, we find that this result is considerably successful. Moreover, we observe that our singlemodel approach outperforms the multi-model baseline by +0.6% on average across 8 tasks. We observe similar trends as on the GLUE benchmark. Naturally, the best model is the multi-model model which involves finetuning a specialized model for each task. The gap between multi-model and single-model is at 74.8 versus 73.6. Our approach bridges this gap, improving the single-model score to 74.5, competitive with the multi-model approach.

3.1.2. PERFORMANCE GAINS ACROSS MODEL SIZES

We investigate the gains of the proposed HyperGrid over the base model on various sizes of the T5 model. For models larger than Base, we train with 64 TPU V3 chips for 200K steps and select the best checkpoint for all tasks based on the benchmark score. Overall, on a macro-average of 18 tasks, we find an overall +1.0% improvement across three sizes. These results show that performance gains scale with model size.

3.1.3. EFFECT OF MODELING CHOICES

To ascertain the effectiveness of our approach, we test different architectural variants of HyperGrid Transformers, along with other architectural variants considered during model development. Setup We evaluate all four model variants of HyperGrid Transformers (L, L 2 , GL and LG). For the other architectural variants, we were mainly interested to know if a hypernetwork setup (weight gating) is better than gating on the output representations (details can be found in the supplementary material). For the base setting, we ran the baseline T5 model (single-model) four times and reported the mean and standard deviation of the runs. When comparing the performance gain of our method, we compare against the max run of the baseline runs. We report relative performance gains/loss against this max baseline score. We conduct ablation studies on the four composition types on the large modelsfoot_0 . On the large setting, we find that the LG model performs the best while the L and L 2 variants perform similar to the baseline. Is Output Gating Better? The other architectural variants (OutGate) do not perform well and generally perform with a net loss in performance as compared to the baseline. As such, we ascertain that gating on weights is more effective than gating on the output representations. This verifies that our hypernetwork-based approach is indeed effective as opposed to simple task-conditioned output gating.

3.1.4. EFFECT OF GRID SIZE ON PERFORMANCE

We investigate the effect of Grid size (fan-in and fan-out) of our proposed HyperGrid method. The purpose of this experiment is to discover how fine-grained or coarse-grained the hypernetwork should be. Notably, smaller values of d r , d c signify a more coarse-grained control of the Transformer weights. Setup We searched d r (fan-in) and d c (fan-out) in the ranges of {4, 8, 16, 32, 128, 256} and {8, 16, 32, 128, 256} respectively and report the results on GLUE + SuperGLUE (macro-average) by varying a single value. When varying d r , we took the average of all d c runs and plot the max, mean and min. Likewise, when varying d c , we took the average of all d r runs and plot max, mean and average. We report scores across the L 2 , LG, and GL variants of HyperGrid. Findings pertaining to Grid Size Figures 3 to Figures 8 illustrates performance across varied grid sizes. We observe that a clear trend exists. For most settings, a small fan-out (d c ) works well (e.g., 32) as noted by many spikes around this region. For fan-in (d r ) a smaller value also works well. However, performance gets better at higher fan-out d c values again (e.g., > 128). Trends are quite consistent across all three variations that we considered. These results suggest that a more coarse grid may be more effective, as the regions within the grid become larger. (Wang et al., 2019) . Our HyperGrid Transformers achieves competitive performance to the state-of-the-art with a single model. Parameter costs refers to total number of parameters used to fit all GLUE and SuperGLUE tasks Setup We run experiments with a 3B and 11B HyperGrid Transformer model in multi-taskfoot_1 setup (GLUE + SuperGLUE). We initialize with the T5 pre-trained checkpoints. Since this is a relatively expensive run, we only train the single model HyperGrid once using a 32 × 128 grid with the LG (local-global) setting. For GLUE, we compare against baselines reported in (Clark et al., 2020) which includes models such as BERT (Devlin et al., 2018) , ALBERT Lan et al. (2019) , RoBERTa (Liu et al., 2019b) , and XLNet (Yang et al., 2019) . Note that all these models are ensembles and heavily rely on task-specific fine-tunining strategies. More details can be found in the supplementary material.

Results on Test Set

We find that our single model approach can achieve highly competitive results on both GLUE and SuperGLUE. Our model achieves a strong performance of 88.9 on SuperGLUE, matching the reported T5 results while having 16 times fewer total parameters. On GLUE, the performance gap is also small, almost matching the T5 model at 89.4 versus 89.7. The gap on the base model remains similar at 88.2 versus 88.5. On SuperGLUE, our 3B model achieves 84.7, a respectable score that matches the performance of RoBERTa ensembles fine-tuned individually with task specific tricks (Liu et al., 2019b) .

4. RELATED WORK

Multi-task learning (MTL) (Caruana, 1997 ) is a long standing research problem. Learning a single unified model that does well on multiple tasks is an uphill battle given well-known problems such as catastrophic forgetting (Kirkpatrick et al., 2017) . As such, learning a large number of tasks with a single set of model parameters is an extremely challenging endeavour. Moreover, the disproportionate amount of data per task is also potentially problematic (Lee et al., 2017; Pfeiffer et al., 2020) , which results in models overfitting on high resource tasks but underfitting on low resource tasks. Early work in multi-task NLP typically considered a hierarchical taxonomy of tasks (Hashimoto et al., 2016) where a clear hierarchy of tasks exist, such as POS → Chunking → entailment. The Joint Many-Task (JMT) model explores an incremental and hierarchical paradigm for building multitask NLP models. Similarly, (Sanh et al., 2019) proposed a hierarchical multi-task model based on the intuition of low-level and high-level tasks. Another line of recent work explores casting all tasks into a form of question answering problem (McCann et al., 2018) and using an interpolated pointer-generator (See et al., 2017) mechanism for generating 'answers'. Exploiting task relatedness as a means for improved model quality has been frequently explored. In relatively recent work, (Liu et al., 2019a) proposed MTDNN, a multi-task deep neural network that shares parameters between several NLP tasks. The model achieves strong performance on the GLUE benchmark. However, MTDNN simply leverages MTL as a form of pretraining and uses task-specific models for final evaluation. The recent T5 (Text-to-Text Transfer Transformers) model (Raffel et al., 2019) frames all NLP problems as a Seq2Seq (Sutskever et al., 2014) problem. However, the best results are again obtained by task-specific fine-tuning. Orthogonal to other research efforts, (Clark et al., 2019b) proposed Born Again Neural Networks (BAM), a clever way to obtain a single multi-task network by knowledge distillation. (Stickland and Murray, 2019) proposed Projected Attention Layers for task-specific fine-tuning of BERT (Devlin et al., 2018) . (Zaremoodi et al., 2018) proposed Adaptive Knowledge Sharingfoot_2 for low-resource neural machine translation. Our work is related to the literature surrounding hypernetworks (Ha et al., 2016) which have been found to useful in areas such as continual learning (von Oswald et al., 2019) . Learning task-adaptive parameters to avoid catastrophic forgetting has also been a go-to strategy for continual learning (Yoon et al., 2019) . Outside of the NLP domain, flexible parameter sharing approaches are also dominant strategies for learning multi-task models (Ma et al., 2018; 2019) . 6 SUPPLEMENTARY MATERIAL and Brockett, 2005) , QQP (Quora Question Pairs) (Iyer et al., 2017) , Semantic Textual Similarity Benchmark (STSB) (Cer et al., 2017) , MNLI (Multi-Genre Natural Language Inference) Williams et al. (2018) , QNLI (Rajpurkar et al., 2016) , RTE (Dagan et al., 2005) , Winograd Schema Challenge WNLI (Levesque et al., 2012) . More details can be found at https: //github.com/tensorflow/datasets/blob/master/docs/catalog/glue.md.

6.1.2. SUPERGLUE

The datasets in SuperGLUE (Wang et al., 2019) are BoolQ (Boolean Questions) (Clark et al., 2019a) , CB (Commitment Bank) (De Marneff et al., 2019) , CoPA (Roemmele et al., 2011) (Choice of Plausible Alternatives), MultiRC (Multi-Sentence Reading Comprehension Dataset) (Khashabi et al., 2018) , Record (Reading Comprehension with Commonsense Reasoning) (Zhang et al., 2018) , RTE (Recognizing Textual Entailment) (Dagan et al., 2005; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009) , Word-in-Context (WiC) (Pilehvar and os'e Camacho-Collados, 2018), and WSC (Winograd Schema Challenge) (Levesque et al., 2012) . We use Tensorflow datasets for loading and preprocessing these datasets. More details can be found at https://github.com/ tensorflow/datasets/blob/master/docs/catalog/super_glue.md.

6.2. EXPERIMENT SETTINGS

This section describes most of the hyperparameter settings for our experiments.

6.3. DATASETS AND EXPERIMENTAL SETUP

Our experiments are built upon the existing state-of-the-art model, T5. We run most of our experiments using the base T5 setting, which is comprised of 220M parameters. We fine-tune for a maximum of 100K steps. We initialize our models with the released pretrained checkpointsfoot_3 . Our implementation is in Mesh Tensorflow (Shazeer et al., 2018) . We consider the following setups for the baseline T5 model. First, we compare with the T5 results reported in the originalfoot_4 paper (Raffel et al., 2019) . These results are denoted with T5 † . Second, we compare with T5 (PTFT), which stands for pretrain-finetune. In this setup, we fine-tune a T5 model for each task individually following common practice. Finally, we compare with T5 (MTL) which is a fair comparison of T5 without HyperGrid. In this setting, T5 is co-trained and results are reported from a single model checkpoint selected from the best overall GLUE dev score. Note that in the MTL setting, we co-train GLUE and SuperGLUE within the same model.

Experiments for Base Models

For all experiments with base models, we train models for 100K steps with a batch size of 128. We use the en mix mixture which samples each task proportionately to the number of examples in the dataset. Learning rate is a constant 0.001 with Adafactor (Shazeer and Stern, 2018) . All results for baselines are reported with scores at the last checkpoint. During fine-tuning, the embeddings are not fine-tuned. Experiments are run with 16 TPU V3 chips and are typically completed in about 8 to 10 hours.

Experiments with Large Models

We increased the search for large models to 200K steps pick the best checkpoint for all models based on the best GLUE score. Experiment and hyperparameter settings remain identical although we use 64 TPU V3 chips for finetuning which typically take about 12 hours to complete. Experiments with 3B and 11B Models For the large models, we only use 1 -2 HyperGrid configurations 32x128 or 32x256 in LG mode for finetuning the model. We submit each model only once to the leaderboardfoot_5 . Finetuning hyperparameters remain identical. We pick a single checkpoint based on the best GLUE score. Finetuning for the 3B model is using 64 TPU V3 chips and the 11B model is fine-tuned with 128 TPU V3 chips.

6.4. COMPARING WITH OUTPUT GATING

One of the model architecture variants we compared with is Output Gating. It can be formulated as: Y = max(W x + b, 0) (σ(U X)1 ) Comparing to the HyperGrid, which gates the weights in the Relu layer, output gating directly gates the Relu layer outputs. We can apply either the basic projection method (Equation ( 2)), or the grid-wise projection method with block-wise projection on layer outputs. There are two key differences: (1) Output Gating applies sigmoid gating on Relu layer outputs, while HyperGrid applies sigmoid gating on weights before the Relu function. Output gating is similar to the Mixture-of-Expert architecture while concatenating the expert outputs. (2) Based on this formulation, the full grid-based projection cannot be applied to output gating.

6.5. FURTHER ARCHITECTURAL ABLATIONS

We include more architectural ablations to supplement the results and findings of the paper.

Ablation Architectures

We run experiments for 6 different ablations. The first (1) -( 4) is concerning with where to apply HyperGrid. (1) applies on the entire network (all QKV + both FFNs), (2) applies it to both positional FFNs, (3) applies it on the 1st FFN only and (4) applies it on the 2nd FFN only. For ( 5) and ( 6), we evaluate on different weighting schemes for HyperGrid. ( 5) projects HyperGrid to a scalar value by pooling (in essense, instead of upsampling the grid, we downsample the grid into a scalar value). As fo (6), this is the setting where the the row and column size of the HyperGrid is essentially the same as the FFN. 

Results

Our ablation studies show that the best place to place the FFN is only at the 2nd FFN. Using HyperGrid on all the layers, apart from slowing down the network, would also degrade performance. This can be observed because the difference between (1) and ( 2) is the addition of QKV HyperGrids to the model. In ( 5) and ( 6), we also note that Scalar HyperGrid performs poorly and worse than the Baseline model. Finally, the Max-Grid HyperGrid performs reasonably well. However, this incurs a huge cost because the hyper network is now larger (as composed to learning small grids and upsampling).



Due to the relative increased cost of searching large models, we performed a sparingly low number of ablations on large models. Since we did not co-train with the WNLI dataset due to issues stated in(Raffel et al., 2019), we simply report T5 results on WNLI. To be fair, we ignore WNLI parameter counts for all baseline models. The authors of(Raffel et al., 2019) explored this approach but did not find it to be satisfactory. https://github.com/google-research/text-to-text-transfer-transformer. This model is not directly comparable as they used less pretraining steps. No dev score results on a comparable setup is reported. We report this score for the sake of completeness. Discounting submissions that turn out to be incomplete or error submissions.



Figure 2: Illustration of the proposed Hyper-Grid architecture.

We note that the parameter counts added by HyperGrid are relatively negligible since d r and d c are small. In the LG setting, the model adds d m d r + d c parameters at each layer. On the GL setting, the parameter cost added is d r + d f d c . The most expensive option is L 2 where the added cost is d m d r + d f d c . These added parameter costs are often negligible for large Transformer models.

Figure 3: fan-in on L 2 setting.

Experimental results on GLUE dev set.

Experimental results on SuperGLUE dev set.

Effect of HyperGrid Transformers across all model sizes. HyperGrid improves single-model co-training consistently overly different model sizes. Improvement over Su-perGLUE is greater than GLUE.

Ablation Study. OG stands for Output Gating.

Test set performance on GLUE(Wang et al., 2018). Models with * are large ensembles. All models are single-tasked fine-tuned except ours. Parameter costs are reported considering ensembles and cost required to fit all of GLUE and SuperGLUE.

Test set performance on SuperGLUE

More results (SuperGLUE + GLUE) studies on applying HyperGrid to different parts of the Transformer.

5. CONCLUSION

We proposed Hypergrid Transformers, a new Transformer architecture that leverages Grid-wise Decomposable Hyper Projections (HyperGrid), a hypernetwork-based projection layer for task conditioned weight generation. We learn and fit all GLUE and SuperGLUE tasks within the same set of model parameters and achieve competitive results to the same state-of-the-art model that is specially and individually fine-tuned on each and every task. On GLUE/SuperGLUE, this efficient single-model method results in 16x fewer parameters.

