THE NEGATIVE PRETRAINING EFFECT IN SEQUENTIAL DEEP LEARNING AND THREE WAYS TO FIX IT Anonymous authors Paper under double-blind review

Abstract

Negative pretraining is a prominent sequential learning effect of neural networks where a pretrained model obtains a worse generalization performance than a model that is trained from scratch when either are trained on a target task. We conceptualize the ingredients of this problem setting and examine the negative pretraining effect experimentally by providing three interventions to remove and fix it. First, acting on the learning process, altering the learning rate after pretraining can yield even better results than training directly on the target task. Second, on the learning task-level, we intervene by increasing the discretization of data distribution changes from start to target task instead of "jumping" to a target task. Finally at the model-level, resetting network biases to larger values likewise removes negative pretraining effects, albeit to a smaller degree. With these intervention experiments, we aim to provide new evidence to help understand the subtle influences that neural network training and pretraining can have on final generalization performance on a target task in the context of negative pretraining.

1. INTRODUCTION

Lifelong learning (Thrun & Mitchell, 1995; Silver et al., 2013) holds both the promise of benefitting from past experiences and the challenge of having to continually adapt to new problem settings. The characteristics of this intriguing sequential learning problem make it a balancing act between retaining knowledge and adapting to new experiences referred to as the stability-plasticity dilemma (Mermillod et al., 2013) . Similar to a coach or teacher, we want to understand when a sequence of tasks helps or hinders further learning and what the underlying mechanism are that influence this process given that the cost of training very large models from scratch such as GPT-3 (Brown et al., 2020) has increased considerably (Sharir et al., 2020) . This setting of training a model on a sequence of tasks to maximize its performance on a target task we coin the sequential learning problem. The existing literature on sequential learning sheds some light on the characteristics of this problem setting. In some cases, following a sequence of learning tasks can lead to better results than simply training on the target task from scratch (Bengio et al., 2009) . Achille and Soatto (Achille et al., 2017) and Wang et al. (2019) contrast this picture by highlighting that neural networks can suffer from pretraining on some tasks. When learning data is corrupted for a sufficiently long period, even switching back to clean data does not allow the model to recover its original performance (Maennel et al., 2020) . Until compared, it may unforuntely be unclear that a model is corrupted.

Contribution:

We formalize the basic ingredients of sequential learning as following a path on a learning manifold as depicted in Fig. 1 and provide three ways to remove the negative pretraining effect that can occur for certain task changes in neural network training as depicted in Fig. 2 . We establish experimentally for a variety of task changes and datasets that this final performance can be highly dependent on the way the chosen model is trained, the sequence of training tasks, and how the model is perturbed during the training process. First, we demonstrate that increasing learning rates after pretraining can remove negative pretraining effects. Secondly, we show that continuously changing from source to target task instead of "jumping" can reap significant benefits in performance and does not incur a deficit penalty. Thirdly, we display that resetting only the biases for each task in the training process appears to decrease negative pretraining effects. Finally we speculate on plausible reasons such as qualitatively different training behavior during the "high learning rate phase" (Lewkowycz et al., 2020) that may explain the observed effects.

