A SIMPLE APPROACH TO DEFINE CURRICULA FOR TRAINING NEURAL NETWORKS

Abstract

In practice, sequence of mini-batches generated by uniform sampling of examples from the entire data is used for training neural networks. Curriculum learning is a training strategy that sorts the training examples by their difficulty and gradually exposes them to the learner. In this work, we propose two novel curriculum learning algorithms and empirically show their improvements in performance with convolutional and fully-connected neural networks on multiple real image datasets. Our dynamic curriculum learning algorithm tries to reduce the distance between the network weight and an optimal weight at any training step by greedily sampling examples with gradients that are directed towards the optimal weight. The curriculum ordering determined by our dynamic algorithm achieves a training speedup of ∼ 45% in our experiments. We also introduce a new task-specific curriculum learning strategy that uses statistical measures such as standard deviation and entropy values to score the difficulty of data points in natural image datasets. We show that this new approach yields a mean training speedup of ∼ 43% in the experiments we perform. Further, we also use our algorithms to learn why curriculum learning works. Based on our study, we argue that curriculum learning removes noisy examples from the initial phases of training, and gradually exposes them to the learner acting like a regularizer that helps in improving the generalization ability of the learner.

1. INTRODUCTION

Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951 ) is a simple yet widely used algorithm for machine learning optimization. There have been many efforts to improve its performance. A number of such directions, such as AdaGrad (Duchi et al., 2011 ), RMSProp (Tieleman & Hinton, 2012) , and Adam (Kingma & Ba, 2014) , improve upon SGD by fine-tuning its learning rate, often adaptively. However, Wilson et al. (2017) has shown that the solutions found by adaptive methods generalize worse even for simple overparameterized problems. Reddi et al. (2019) introduced AMSGrad hoping to solve this issue. Yet there is performance gap between AMSGrad and SGD in terms of the ability to generalize (Keskar & Socher, 2017 ). Further, Choi et al. (2019) shows that more general optimizers such as Adam and RMSProp can never underperform SGD when all their hyperparameters are carefully tuned. Hence, SGD still remains one of the main workhorses of the ML optimization toolkit. SGD proceeds by stochastically making unbiased estimates of the gradient on the full data (Zhao & Zhang, 2015) . However, this approach does not match the way humans typically learn various tasks. We learn a concept faster if we are presented the easy examples first and then gradually exposed to examples with more complexity, based on a curriculum. An orthogonal extension to SGD (Weinshall & Cohen, 2018) , that has some promise in improving its performance is to choose examples according to a specific strategy, driven by cognitive science -this is curriculum learning (CL) (Bengio et al., 2009) 2009) uses manually crafted scores, self-paced learning (SPL) (Kumar et al., 2010) uses the loss values with respect to the learner's current parameters, and CL by transfer learning uses the loss values with respect to a pre-trained learner to rate the difficulty of examples in data. Among these works, what makes SPL particular is that they use a dynamic CL strategy, i.e. the preferred ordering is determined dynamically over the course of the optimization. However, SPL does not really improve the performance of deep learning models, as noted in (Fan et al., 2018) They theoretically show that for linear regression, the expected rate of convergence at a training step t for an example monotonically decreases with its ideal difficulty score. This is practically validated by Hacohen & Weinshall (2019) by sorting the training examples based on the performance of a network trained through transfer learning. However, there is a lack of theory to show that CL improves the performance of a completely trained network. Thus, while CL indicates that it is possible to improve the performance of SGD by a judicious ordering, both the theoretical insights as well as concrete empirical guidelines to create this ordering remain unclear. While the previous CL works employ tedious methods to score the difficulty level of the examples, Hu et al. ( 2020) uses the number of audio sources to determine the difficulty for audiovisual learning. Liu et al. ( 2020) uses the norm of word embeddings as a difficulty measure for CL for neural machine translation. In light of these recent works, we discuss the idea of using task-specific statistical (unsupervised) measures to score examples making it easy to perform CL on real image datasets without the aid of any pre-trained network.

1.2. OUR CONTRIBUTIONS

Our work proposes two novel algorithms for CL. We do a thorough empirical study of our algorithms and provide some more insights into why CL works. Our contributions are as follows: • We propose a novel dynamic curriculum learning (DCL) algorithm to study the behaviour of CL. DCL is not a practical CL algorithm since it requires the knowledge of a reasonable local optima as needs to compute the gradients of full data after ever training epoch. DCL uses the gradient information to define a curriculum that minimizes the distance between the current weight and a desired local minima. However, this simplicity in the definition of DCL makes it easier to analyze its performance formally. • Our DCL algorithm generates a natural ordering for training the examples. Previous CL works have demonstrated that exposing a part of the data initially and then gradually exposing the rest is a standard way to setup a curriculum. We use two variants of our DCL framework to show that it is not just the subset of data which is exposed to the model that matters, but also the ordering within the data partition that is exposed. We also analyze how DCL is able to serve as a regularizer and improve the generalization of networks. • We contribute a simple, novel and practical CL approach for image classification tasks that does the ordering of examples in a completely unsupervised manner using statistical measures. Our insight is that statistical measures could have an association with the difficulty of examples in real data. We empirically analyze our argument of using statistical scoring measures (especially standard deviation) over permutations of multiple datasets and networks. Additionally, we study why CL based on standard deviation scoring works using our DCL framework.



, wherein the examples are shown to the learner based on a curriculum. 1.1 RELATED WORKS Bengio et al. (2009) formalizes the idea of CL in machine learning framework where the examples are fed to the learner in an order based on its difficulty. The notation of difficulty of examples has not really been formalized and various heuristics have been tried out: Bengio et al. (

. Similarly, Loshchilov & Hutter (2015) uses a function of rank based on latest loss values for online batch selection for faster training of neural networks. Katharopoulos & Fleuret (2018) and Chang et al. (2017) perform importance sampling to reduce the variance of stochastic gradients during training. Graves et al. (2017) and Matiisen et al. (2020) propose teacher-guided automatic CL algorithms that employ various supervised measures to define dynamic curricula. The most recent works in CL show its advantages in reinforcement learning (Portelas et al., 2020; Zhang et al., 2020). The recent work by Weinshall & Cohen (2018) introduces the notion of ideal difficult score to rate the difficulty of examples based on the loss values with respect to the set of optimal hypotheses.

