DISTANCE-BASED REGULARISATION OF DEEP NETWORKS FOR FINE-TUNING

Abstract

We investigate approaches to regularisation during fine-tuning of deep neural networks. First we provide a neural network generalisation bound based on Rademacher complexity that uses the distance the weights have moved from their initial values. This bound has no direct dependence on the number of weights and compares favourably to other bounds when applied to convolutional networks. Our bound is highly relevant for fine-tuning, because providing a network with a good initialisation based on transfer learning means that learning can modify the weights less, and hence achieve tighter generalisation. Inspired by this, we develop a simple yet effective fine-tuning algorithm that constrains the hypothesis class to a small sphere centred on the initial pre-trained weights, thus obtaining provably better generalisation performance than conventional transfer learning. Empirical evaluation shows that our algorithm works well, corroborating our theoretical results. It outperforms both state of the art fine-tuning competitors, and penalty-based alternatives that we show do not directly constrain the radius of the search space.

1. INTRODUCTION

The ImageNet Large Scale Visual Recognition Challenges have resulted in a number of neural network architectures that obtain high accuracy when trained on large datasets of labelled examples (He et al., 2016; Tan and Le, 2019; Russakovsky et al., 2015) . Although these models have been shown to achieve excellent performance in these benchmarks, in many real-world scenarios such volumes of data are not available and one must resort to fine-tuning an existing model: taking the weights from a model trained on a large dataset, to initialise the weights for a model that will be trained on a small dataset. The assumption being that the weights from the pre-trained model provide a better initialisation than randomly generated weights. Approaches for fine-tuning are typically ad hoc, requiring one to experiment with many problem-dependent tricks, and often a process that will work for one problem will not work for another. Transforming fine-tuning from an art into a well principled procedure is therefore an attractive prospect. This paper investigates, from both a theoretical and empirical point of view, the impact of different regularisation strategies when fine-tuning a pre-trained network for a new task. Existing fine-tuning regularisers focus on augmenting the cross entropy loss with terms that indirectly or directly penalise the distance the fine-tuned weights move from the pre-trained values. The intuition behind this seems sensible-the closer the fine-tuned weights are to the pre-trained weights, the less information is forgotten about the source dataset-but it is not obvious how this idea should be translated into an effective algorithm. One should expect that the choice of distance metric is quite important, but existing methods exclusively make use of Euclidean distance (Li et al., 2019; 2018) without any theoretical or empirical justification regarding why that metric was chosen. These methods achieve only a small improvement in performance over standard fine-tuning, and it is reasonable to expect that using a metric more suited to the weight space of neural networks would lead to greater performance. Moreover, while the use of penalty terms to regularise neural networks is well established, the impact of using penalties vs constraints as regularisers has not been well studied in the context of deep learning. In order to study the generalisation error of fine-tuned models, we derive new bounds on the empirical Rademacher complexity of neural networks based on the distance the trained weights move from their initial values. In contrast to existing theory (e.g., Neyshabur et al. (2018) ; Bartlett et al. (2017) ; 1

