TEACHING WITH COMMENTARIES

Abstract

Effective training of deep neural networks can be challenging, and there remain many open questions on how to best learn these models. Recently developed methods to improve neural network training examine teaching: providing learned information during the training process to improve downstream model performance. In this paper, we take steps towards extending the scope of teaching. We propose a flexible teaching framework using commentaries, learned meta-information helpful for training on a particular task. We present gradient-based methods to learn commentaries, leveraging recent work on implicit differentiation for scalability. We explore diverse applications of commentaries, from weighting training examples, to parameterising label-dependent data augmentation policies, to representing attention masks that highlight salient image regions. We find that commentaries can improve training speed and/or performance, and provide insights about the dataset and training process. We also observe that commentaries generalise: they can be reused when training new models to obtain performance benefits, suggesting a use-case where commentaries are stored with a dataset and leveraged in future for improved model training.

1. INTRODUCTION

Training, regularising, and understanding complex neural network models is challenging. There remain central open questions on making training faster and more data-efficient (Kornblith et al., 2019; Raghu et al., 2019a; b) , ensuring better generalisation (Zhang et al., 2016) and improving transparency and robustness (Bau et al., 2017; Madry et al., 2017) . A promising approach for addressing these questions is learning to teach (Zhu, 2015) , in which learned auxiliary information about a task is provided to a neural network to inform the training process and help downstream objectives. Examples include providing auxiliary training targets (Liu et al., 2019; Navon et al., 2020; Pham et al., 2020) and reweighting training examples to emphasise important datapoints (Fan et al., 2020; Jiang et al., 2018; Ren et al., 2018; Shu et al., 2019) . Learning to teach approaches have achieved promising results in vision and language applications (Jiang et al., 2018; Ren et al., 2018; Shu et al., 2019; Hu et al., 2019) using a handful of specific modifications to the training process. In this paper, we take steps towards generalising these approaches, introducing a flexible and effective learning to teach framework using commentaries. Commentaries represent learned meta-information helpful for training a model on a task, and once learned, such commentaries can be reused as is to improve the training of new models. We demonstrate that commentaries can be used for applications ranging from speeding up training to gaining insights into the neural network model. Specifically, our contributions are: 1. We formalise the notion of commentaries, providing a unified framework for learning metainformation that can be used to improve network training and examine model learning. 2. We present gradient-based methods to learn commentaries by optimising a network's validation loss, leveraging recent work in implicit differentiation to scale to larger models. 3. We use commentaries to define example-weighting curricula, a common method of teaching neural networks. We show that these learned commentaries hold interpretable insights, lead to speedups in training, and improve performance on few-shot learning tasks. 4. We define data augmentation policies with label-dependent commentaries, and obtain insights into the design of effective augmentation strategies and improved performance on benchmark tasks as compared to baselines. 5. We parameterise commentaries as attention masks to find important regions of images. Through qualitative and quantitative evaluation, we show these masks identify salient image regions and can be used to improve the robustness of neural networks to spurious background correlations. 6. We show that learned commentaries can generalise: when training new models, reusing learned commentaries can lead to learning speed/performance improvements. This suggests a use-case for commentaries: being stored with a dataset and leveraged to improve training of new models.

2. TEACHING WITH COMMENTARIES

Definition: We define a commentary to be learned information helpful for (i) training a model on a task or (ii) providing insights on the learning process. We envision that commentaries, once learned, could be stored alongside a dataset and reused as is to assist in the training of new models. Appendix A explores a simple instantiation of commentaries for Celeb-A (Liu et al., 2015) , to provide intuition of the structures that commentaries can encode. Formally, let t(x, y, i; φ) denote a commentary that is a function of a data point x, prediction target y, and iteration of training i, with parameters φ. The commentary may be represented in a tabular fashion for every combination of input arguments, or using a neural network that takes these arguments as inputs. The commentary is used to train a student network n(x; θ) with parameters θ.

2.1. LEARNING COMMENTARIES

We now describe algorithms to learn commentariesfoot_0 . Throughout, we denote the training set as D T , the validation set as D V and the loss function (e.g. cross-entropy) as L. With θ denoting the parameters of the student network and φ denoting the commentary parameters, we let θ, φ be the respective optimised parameters. We seek to find φ such that the student network's validation loss, L V , is minimised. As the commentary is used during the training of the student network, L V implicitly depends on φ, enabling the use of gradient-based optimisation algorithms to find φ. By optimizing the commentary parameters over the entire trajectory of student learning, we encourage this commentary to be effective when used in the training of new student networks. This supports the goal of the commentary being stored with the dataset and reused in future model learning. Algorithm 2: Large-Scale Commentary Learning with Implicit Differentiation: When training the student model has a large memory footprint, backpropagating through training to obtain exact commentary parameter gradients is too memory expensive. We therefore leverage the Implicit Function Theorem (IFT) and efficient inverse Hessian approximation to obtain approximate gradients, following Lorraine et al. ( 2020). The gradient of the validation loss w.r.t. the commentary parameters can be expressed as: ∂L V ∂φ = ∂L V ∂ θ × ∂ θ ∂φ . The first term on the right hand side in equation 3 is simple to compute, but the second term is expensive. Under fixed-point and regularity assumptions on student and commentary parameters θ(φ), φ , the IFT allows expressing this second term ∂ θ ∂φ as the following product: ∂ θ ∂φ = - ∂ 2 L T ∂θ ∂θ T -1 × ∂ 2 L T ∂θ ∂φ T θ(φ) ,



Code at https://github.com/googleinterns/commentaries



Backpropagation Through Training: When student network training has a small memory footprint, we optimise commentary parameters by iterating the following process, detailed in Algorithm 1: (1) train a student and store the computation graph during training; (2) compute the student's validation loss; (3) calculate the gradient of this loss w.r.t. the commentary parameters by backpropagating through training; (4) update commentary parameters using gradient descent.

