A SIMPLE APPROACH TO DEFINE CURRICULA FOR TRAINING NEURAL NETWORKS

Abstract

In practice, sequence of mini-batches generated by uniform sampling of examples from the entire data is used for training neural networks. Curriculum learning is a training strategy that sorts the training examples by their difficulty and gradually exposes them to the learner. In this work, we propose two novel curriculum learning algorithms and empirically show their improvements in performance with convolutional and fully-connected neural networks on multiple real image datasets. Our dynamic curriculum learning algorithm tries to reduce the distance between the network weight and an optimal weight at any training step by greedily sampling examples with gradients that are directed towards the optimal weight. The curriculum ordering determined by our dynamic algorithm achieves a training speedup of ∼ 45% in our experiments. We also introduce a new task-specific curriculum learning strategy that uses statistical measures such as standard deviation and entropy values to score the difficulty of data points in natural image datasets. We show that this new approach yields a mean training speedup of ∼ 43% in the experiments we perform. Further, we also use our algorithms to learn why curriculum learning works. Based on our study, we argue that curriculum learning removes noisy examples from the initial phases of training, and gradually exposes them to the learner acting like a regularizer that helps in improving the generalization ability of the learner.

1. INTRODUCTION

Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951 ) is a simple yet widely used algorithm for machine learning optimization. There have been many efforts to improve its performance. A number of such directions, such as AdaGrad (Duchi et al., 2011 ), RMSProp (Tieleman & Hinton, 2012 ), and Adam (Kingma & Ba, 2014) , improve upon SGD by fine-tuning its learning rate, often adaptively. However, Wilson et al. (2017) has shown that the solutions found by adaptive methods generalize worse even for simple overparameterized problems. Reddi et al. (2019) introduced AMSGrad hoping to solve this issue. Yet there is performance gap between AMSGrad and SGD in terms of the ability to generalize (Keskar & Socher, 2017). Further, Choi et al. (2019) shows that more general optimizers such as Adam and RMSProp can never underperform SGD when all their hyperparameters are carefully tuned. Hence, SGD still remains one of the main workhorses of the ML optimization toolkit. SGD proceeds by stochastically making unbiased estimates of the gradient on the full data (Zhao & Zhang, 2015) . However, this approach does not match the way humans typically learn various tasks. We learn a concept faster if we are presented the easy examples first and then gradually exposed to examples with more complexity, based on a curriculum. An orthogonal extension to SGD (Weinshall & Cohen, 2018) , that has some promise in improving its performance is to choose examples according to a specific strategy, driven by cognitive science -this is curriculum learning (CL) (Bengio et al., 2009) 



, wherein the examples are shown to the learner based on a curriculum. 1.1 RELATED WORKS Bengio et al. (2009) formalizes the idea of CL in machine learning framework where the examples are fed to the learner in an order based on its difficulty. The notation of difficulty of examples

