THE NEGATIVE PRETRAINING EFFECT IN SEQUENTIAL DEEP LEARNING AND THREE WAYS TO FIX IT Anonymous authors Paper under double-blind review

Abstract

Negative pretraining is a prominent sequential learning effect of neural networks where a pretrained model obtains a worse generalization performance than a model that is trained from scratch when either are trained on a target task. We conceptualize the ingredients of this problem setting and examine the negative pretraining effect experimentally by providing three interventions to remove and fix it. First, acting on the learning process, altering the learning rate after pretraining can yield even better results than training directly on the target task. Second, on the learning task-level, we intervene by increasing the discretization of data distribution changes from start to target task instead of "jumping" to a target task. Finally at the model-level, resetting network biases to larger values likewise removes negative pretraining effects, albeit to a smaller degree. With these intervention experiments, we aim to provide new evidence to help understand the subtle influences that neural network training and pretraining can have on final generalization performance on a target task in the context of negative pretraining.

1. INTRODUCTION

Lifelong learning (Thrun & Mitchell, 1995; Silver et al., 2013) holds both the promise of benefitting from past experiences and the challenge of having to continually adapt to new problem settings. The characteristics of this intriguing sequential learning problem make it a balancing act between retaining knowledge and adapting to new experiences referred to as the stability-plasticity dilemma (Mermillod et al., 2013) . Similar to a coach or teacher, we want to understand when a sequence of tasks helps or hinders further learning and what the underlying mechanism are that influence this process given that the cost of training very large models from scratch such as GPT-3 (Brown et al., 2020) has increased considerably (Sharir et al., 2020) . This setting of training a model on a sequence of tasks to maximize its performance on a target task we coin the sequential learning problem. The existing literature on sequential learning sheds some light on the characteristics of this problem setting. In some cases, following a sequence of learning tasks can lead to better results than simply training on the target task from scratch (Bengio et al., 2009) . Achille and Soatto (Achille et al., 2017) and Wang et al. (2019) contrast this picture by highlighting that neural networks can suffer from pretraining on some tasks. When learning data is corrupted for a sufficiently long period, even switching back to clean data does not allow the model to recover its original performance (Maennel et al., 2020) . Until compared, it may unforuntely be unclear that a model is corrupted.

Contribution:

We formalize the basic ingredients of sequential learning as following a path on a learning manifold as depicted in Fig. 1 and provide three ways to remove the negative pretraining effect that can occur for certain task changes in neural network training as depicted in Fig. 2 . We establish experimentally for a variety of task changes and datasets that this final performance can be highly dependent on the way the chosen model is trained, the sequence of training tasks, and how the model is perturbed during the training process. First, we demonstrate that increasing learning rates after pretraining can remove negative pretraining effects. Secondly, we show that continuously changing from source to target task instead of "jumping" can reap significant benefits in performance and does not incur a deficit penalty. Thirdly, we display that resetting only the biases for each task in the training process appears to decrease negative pretraining effects. Finally we speculate on plausible reasons such as qualitatively different training behavior during the "high learning rate phase" (Lewkowycz et al., 2020) that may explain the observed effects. f 1 f 2 f 3 (⌧ 1 , ! 1 ) (⌧ 2 , ! 2 ) (⌧ 3 , ! 3 ) f 0 M Learning path T Model path (f 0 , . . . , f final ) 

2. RELATED WORK

Sequential learning involves choosing a sequence of tasks and learning processes and following this learning path to improve performance on a single or set of target tasks. This strongly resembles the objectives of lifelong learning (Thrun & Mitchell, 1995 ), continual learning (Lesort et al., 2019) and meta-learning (Thrun & Pratt, 2012; Schaul & Schmidhuber, 2010; Hochreiter et al., 2001) and connects to many time-dependent learning aspects in relation to neural networks. et al., 2014; Ritter et al., 2018) and others towards dissimilarity (Farquhar & Gal, 2018; Nguyen et al., 2019) . Self-paced learning (Jiang et al., 2015) extends curriculum learning by letting the model itself influence the sequence of tasks (Graves et al., 2017) . The order of tasks, the curriculum, is generally an input that is hand-designed by human experimenters and often based on heuristics to guide the scheduling (i.e. it may be easier to make an agent jump over small gaps before moving to larger ones (Heess et al., 2017) ). Meta-learning (Schaul & Schmidhuber, 2010; Vilalta & Drissi, 2002) is similarly related by focusing on speeding up or enhancing performance on a target task by pretraining on other tasks to first learn to learn.



Figure 1: In the context of the negative pretraining effect, e.g. when pretraining on blurred images before training on unblurred images (Achille et al., 2017), we explore the effects of interventions on "learning paths", shown as the blue path on the learning manifold M, on the generalization performance of neural networks. A point on the learning manifold is defined as a learning task ⌧ := (p(x, y), L), composed of a data distribution of input and labels and a loss function, and a learning process ! which determines how a model f is adapted given a task ⌧ . Tasks can vary by changing any part of the definition such as when the data distribution is changed along a blurring nuisance as depicted above with cat images. Learning processes can be adapted by changing their update equation, e.g. via learning rate changes ⌘. The learning path T determines model path changes from initial to final trained model f final . Via three interventions in the learning path we show that varying neural network learning paths can result in a removal of negative pretraining effects.

Time-dependent learning: We are aware of several time-dependent phenomena in training neural networks such as catastrophic forgetting(French, 1999), critical learning periods(Achille et al.,  2017), and time sensitivity of regularization and data augmentation(Golatkar et al., 2019).Achille  et al. (2017)  show that by dropping out certain frequencies during the beginning of training, networks are unable to recover to full performance, even when those frequencies are reintroduced.Similarly,  Liu et al. (2019)  show that first optimizing on random labels ruins performance on clean data shown indefinitely to the same network afterwards. Recent information-theoretic analysis (Shwartz-Ziv & Tishby, 2017) suggests that neural network have distinct learning phases: First, a phase where network parameters grow in information content, before, in a second phase, self-regularizing and pruning away irrelevant information. More commonly known time-dependent or sequential heuristics are learning rate schedules(Darken & Moody, 1991), cyclical learning rates (Smith, 2017), and regularization annealing such as dropout annealing(Rennie et al., 2014), student-teacher model transfer(Vicente & Caticha, 1997), and pretraining (Erhan et al., 2009).

