DISTANCE-BASED REGULARISATION OF DEEP NETWORKS FOR FINE-TUNING

Abstract

We investigate approaches to regularisation during fine-tuning of deep neural networks. First we provide a neural network generalisation bound based on Rademacher complexity that uses the distance the weights have moved from their initial values. This bound has no direct dependence on the number of weights and compares favourably to other bounds when applied to convolutional networks. Our bound is highly relevant for fine-tuning, because providing a network with a good initialisation based on transfer learning means that learning can modify the weights less, and hence achieve tighter generalisation. Inspired by this, we develop a simple yet effective fine-tuning algorithm that constrains the hypothesis class to a small sphere centred on the initial pre-trained weights, thus obtaining provably better generalisation performance than conventional transfer learning. Empirical evaluation shows that our algorithm works well, corroborating our theoretical results. It outperforms both state of the art fine-tuning competitors, and penalty-based alternatives that we show do not directly constrain the radius of the search space.

1. INTRODUCTION

The ImageNet Large Scale Visual Recognition Challenges have resulted in a number of neural network architectures that obtain high accuracy when trained on large datasets of labelled examples (He et al., 2016; Tan and Le, 2019; Russakovsky et al., 2015) . Although these models have been shown to achieve excellent performance in these benchmarks, in many real-world scenarios such volumes of data are not available and one must resort to fine-tuning an existing model: taking the weights from a model trained on a large dataset, to initialise the weights for a model that will be trained on a small dataset. The assumption being that the weights from the pre-trained model provide a better initialisation than randomly generated weights. Approaches for fine-tuning are typically ad hoc, requiring one to experiment with many problem-dependent tricks, and often a process that will work for one problem will not work for another. Transforming fine-tuning from an art into a well principled procedure is therefore an attractive prospect. This paper investigates, from both a theoretical and empirical point of view, the impact of different regularisation strategies when fine-tuning a pre-trained network for a new task. Existing fine-tuning regularisers focus on augmenting the cross entropy loss with terms that indirectly or directly penalise the distance the fine-tuned weights move from the pre-trained values. The intuition behind this seems sensible-the closer the fine-tuned weights are to the pre-trained weights, the less information is forgotten about the source dataset-but it is not obvious how this idea should be translated into an effective algorithm. One should expect that the choice of distance metric is quite important, but existing methods exclusively make use of Euclidean distance (Li et al., 2019; 2018) without any theoretical or empirical justification regarding why that metric was chosen. These methods achieve only a small improvement in performance over standard fine-tuning, and it is reasonable to expect that using a metric more suited to the weight space of neural networks would lead to greater performance. Moreover, while the use of penalty terms to regularise neural networks is well established, the impact of using penalties vs constraints as regularisers has not been well studied in the context of deep learning. In order to study the generalisation error of fine-tuned models, we derive new bounds on the empirical Rademacher complexity of neural networks based on the distance the trained weights move from their initial values. In contrast to existing theory (e.g., Neyshabur et al. (2018) ; Bartlett et al. (2017) ; Long and Sedghi ( 2019)), we do not resort to covering numbers or make use of distributions over models to make these arguments. By deriving two bounds utilising different distance metrics, but proved with the same techniques, we are able to conduct a controlled theoretical comparison of which metric one should use as the basis for a fine-tuning regularisation scheme. Our findings show that a metric based on the maximum absolute row sum (MARS) matrix norm is a more suitable measure of distance in the parameter space of convolutional neural networks than Euclidean distance. Additionally, we challenge the notion that using a penalty term to encourage the fine-tuned weights to lie near the pre-trained values is the best way to restrict the effective hypothesis class. We demonstrate that the equivalence of penalty methods and constraint methods in the case of linear models (Oneto et al., 2016) does not translate to the context of deep learning. As a result, using projected stochastic subgradient methods to constrain the distance the weights in each layer can move from the initial settings can lead to improved performance. Several regularisation methods are proposed, with the aim of both corroborating the theoretical analysis with empirical evidence, and improving the performance of fine-tuned networks. One of these approaches is a penalty-based method that regularises the distance from initialisation according to the MARS-based distance metric. The other two techniques make use of efficient projection functions to enforce constraints on the Euclidean and MARS distance between the pre-trained and fine-tuned weights throughout training. The experimental results demonstrate that projected subgradient methods improve performance over using penalty terms, and that the widely used Euclidean metric is typically not the best choice of metric to measure distances in network parameter space.

2. RELATED WORK

The idea of developing an algorithm to restrict the distance of weights from some unbiased set of reference weights has been explored in various forms to improve the performance of fine-tuned networks. Li et al. (2018) presented the 2 -SP regulariser, which consists of adding a term to the objective function that penalises the squared 2 distance of the trained weights from the initial weights. This is based on an idea initially made use of when performing domain adaptation, where it was applied to linear support vector machines (Yang et al., 2007) . The subsequent work of Li et al. (2019) follows the intuition that the features produced by the fine-tuned network should not differ too much from the pre-trained features. They also use Euclidean distance, but to measure distance between feature vectors rather than weights. The idea is extended to incorporate an attention mechanism that weights the importance of each channel. The method is implemented by adding a penalty term to the standard objective function. In contrast to these approaches, we solve a constrained optimisation problem rather than adding a penalty, and we demonstrate that the MARS norm is more effective than the Euclidean norm when measuring distance in weight space. Many recent meta-learning algorithms also make use of idea that keeping fine-tuned weights close to their initial values is desirable. However, these approaches typically focus on developing methods for learning the initial weights, rather than working with pre-specified initial weights. The model-agnostic meta-learning approach (Finn et al., 2017) does this by simulating few-shot learning tasks during the meta-learning phase in order to find a good set of initial weights for a neural network. Once the learned algorithm is deployed, it adapts to new few-shot learning tasks by fine-tuning the initial weights for a small number of iterations. Denevi et al. (2018) proposes a modified penalty term for ridge regression where, instead of penalising the distance of the parameters from zero, they are regularised towards a bias vector. This bias vector is learned during the course of solving least squares problems on a collection of related tasks. Denevi et al. (2019) extend this approach to a fully online setting and a more general family of linear models. Previous work investigating the generalisation performance of neural network based on the distance the weights have travelled from their initial values has done so with the aim of explaining why existing methods for training models work well. Bartlett et al. (2017) present a bound derived via covering numbers that shows the generalisation performance of fully connected networks is controlled by the distance the trained weights are from the initial weights. Their bound makes use of a metric that scales with the number of units in the network, which means if the result is extended to a class of simple convolutional networks then the generalisation performance will scale with the resolution of the feature maps. A similar bound can also be proved through the use of PAC-Bayesian analysis (Neyshabur et al., 2018) . One can make use of different metrics and techniques for applying

