LEARNING WITHOUT FORGETTING: TASK AWARE MULTI-TASK LEARNING FOR MULTI-MODALITY TASKS

Abstract

Existing Multi-Task Learning(MTL) strategies like joint or meta-learning focus more on shared learning and have little to no scope for task-specific learning. This creates the need for a distinct shared pretraining phase and a task-specific finetuning phase. The finetuning phase creates separate models for each task, where improving the performance of a particular task necessitates forgetting some of the knowledge garnered in other tasks. Humans, on the other hand, perform task-specific learning in synergy with general domain-based learning. Inspired by these learning patterns in humans, we suggest a simple yet generic task aware framework to incorporate into existing MTL strategies. The proposed framework computes task-specific representations to modulate the model parameters during MTL. Hence, it performs both shared and task-specific learning in a single phase resulting in a single model for all the tasks. The single model itself achieves significant performance gains over the existing MTL strategies. For example, we train a model on Speech Translation (ST), Automatic Speech Recognition (ASR), and Machine Translation (MT) tasks using the proposed task aware multitask learning approach. This single model achieves a performance of 28.64 BLEU score on ST MuST-C English-German, WER of 11.61 on ASR TEDLium v3, and BLEU score of 23.35 on MT WMT14 English-German tasks. This sets a new state-of-the-art performance (SOTA) on the ST task while outperforming the existing end-to-end ASR systems with a competitive performance on the MT task.

1. INTRODUCTION

The process of Multi-Task Learning (MTL) on a set of related tasks is inspired by the patterns displayed by human learning. It involves a pretraining phase over all the tasks, followed by a finetuning phase. During pretraining, the model tries to grasp the shared knowledge of all the tasks involved, while in the finetuning phase, task-specific learning is performed to improve the performance. However, as a result of the finetuning phase, the model forgets the information about the other tasks that it learnt during pretraining. Humans, on the other hand, are less susceptible to forgetfulness and retain existing knowledge/skills while mastering a new task. For example, a polyglot who masters a new language learns to translate from this language without losing the ability to translate other languages. Moreover, the lack of task-based flexibility and having different finetuning/pretraining phases cause gaps in the learning process due to the following reasons: Role Mismatch: Consider the MTL system being trained to perform the Speech Translation(ST), Automatic Speech Recognition(ASR) and Machine Translation(MT) tasks. The Encoder block has a very different role in the standalone ASR, MT and ST models and hence we cannot expect a single encoder to perform well on all the tasks without any cues to identify/use task information. Moreover, there is a discrepancy between pretraining and finetuning hampering the MTL objective. Task Awareness: At each step in the MTL, the model tries to optimize over the task at hand. For tasks like ST and ASR with the same source language, it is impossible for the model to identify the task and alter its parameters accordingly, hence necessitating a finetuning phase. A few such examples have been provided in Table 1 . Humans, on the other hand, grasp the task they have to perform by means of context or explicit cues. war ein Erfolg oder nicht. Table 1 : Issues due to the lack of task information in MTL strategies. The MTL model, unable to identify the task, produces the output corresponding to another task either completely or partially.(En:English, De:German, Ro:Romanian). Task aware output is the output obtained from our proposed approach. Although MTL strategies help the finetuned models to perform better than the models directly trained on those tasks, their applicability is limited to finding a good initialization point for the finetuning phase. Moreover, having a separate model for each task increases the memory requirements, which is detrimental in low resource settings. In order to achieve the goal of jointly learning all the tasks, similar to humans, we need to perform shared learning in synergy with task-specific learning. Previous approaches such as Raffel et al. ( 2019) trained a joint model for a set of related text-to-text tasks by providing the task information along with the inputs during the joint learning phase. However, providing explicit task information is not always desirable, e.g., consider the automatic multilingual speech translation task. In order to ensure seamless user experience, it is expected that the model extracts the task information implicitly. Thus, a holistic joint learning strategy requires a generic framework which learns task-specific information without any explicit supervision. In this work, we propose a generic framework which can be easily integrated into the MTL strategies which can extract task-based characteristics. The proposed approach helps align existing MTL approaches with human learning processes by incorporating task information into the learning process and getting rid of the issues related to forgetfulness. We design a modulation network for learning the task characteristics and modulating the parameters of the model during MTL. As discussed above, the task information may or may not be explicitly available during the training. Hence, we propose two different designs of task modulation network to learn the task characteristics; one uses explicit task identities while the other uses the examples from the task as input. The model, coupled with the modulation network, jointly learns on all the tasks and at the same time, performs the task-specific learning. The proposed approach tackles issues related to forgetfulness by keeping a single model for all the tasks, and hence avoiding the expensive finetuning phase. Having a single model for all the tasks also reduces memory constraints, improving suitability for low resource devices. To evaluate the proposed framework, we conduct two sets of experiments. First, we include the task information during MTL on text-to-text tasks to show the effect of task information. Secondly, we train a model on tasks with different modalities and end goals, with highly confounding tasks. Our proposed framework allows the model to learn the task characteristics without any explicit supervision, and hence train a single model which performs well on all the tasks. The main contributions of this work are as follows: • We propose an approach to tackle the issue of forgetfulness which occurs during the finetuning phase of existing MTL strategies. • Our model, without any finetuning, achieves superior performance on all the tasks which alleviates the need to keep separate task-specific models. • Our proposed framework is generic enough to be used with any MTL strategy involving tasks with multiple modalities.

2. TASK-AWARE MULTITASK LEARNING

An overview of our proposed approach is shown in Figure 1 .

