VARIANCE BASED SAMPLE WEIGHTING FOR SUPERVISED DEEP LEARNING Anonymous

Abstract

In the context of supervised learning of a function by a Neural Network (NN), we claim and empirically justify that a NN yields better results when the distribution of the data set focuses on regions where the function to learn is steeper. We first traduce this assumption in a mathematically workable way using Taylor expansion. Then, theoretical derivations allow to construct a methodology that we call Variance Based Samples Weighting (VBSW). VBSW uses local variance of the labels to weight the training points. This methodology is general, scalable, cost effective, and significantly increases the performances of a large class of NNs for various classification and regression tasks on image, text and multivariate data. We highlight its benefits with experiments involving NNs from shallow linear NN to ResNet (He et al., 

1. INTRODUCTION

When a Machine Learning (ML) model is used to learn from data, the distribution of the training data set can have a strong impact on its performances. More specifically, in the context of Deep Learning (DL), several works have hinted at the importance of the training set. In Bengio et al. (2009) ; Matiisen et al. (2017) , the authors exploit the observation that a human will benefit more from easy examples than from harder ones at the beginning of a learning task. They construct a curriculum, inducing a change in the distribution of the training data set that makes a Neural Network (NN) achieve better results in an ML problem. With a different approach, Active Learning (Settles, 2012) modifies dynamically the distribution of the training data, by selecting the data points that will make the training more efficient. Finally, in Reinforcement Learning, the distribution of experiments is crucial for the agent to learn efficiently. Nonetheless, the challenge of finding a good distribution is not specific to ML. Indeed, in the context of Monte Carlo estimation of a quantity of interest based on a random variable X ∼ dP X , Importance Sampling owes its efficiency to the construction of a second random variable, X ∼ dP X that will be used instead of X to improve the estimation of this quantity. Jie & Abbeel (2010) even make a connection between the success of likelihood ratio policy gradients and importance sampling, which shows that ML and Monte Carlo estimation, both distribution based methods, are closely linked. In this paper, we leverage the importance of the training set distribution to improve performances of NNs in supervised DL. This task can be formalized as approximating a function f with a model f θ parametrized by θ. We build a new distribution from the training points and their labels, based on the observation that f θ needs more data points to approximate f on the regions where it is steep. We use Taylor expansion of a function f , which links the local behaviour of f to its derivatives, to build this distribution. We show that up to a certain order and locally, variance is an estimator of Taylor expansion. It allows constructing a methodology called Variance Based Sample Weighting (VBSW) that weights each training data points using the local variance of their neighbor labels to simulate the new distribution. Sample weighting has already been explored in many works and for various goals. Kumar et al. ( 2010 & Tao, 2016) to solve noisy label problem. In this work, the weights' construction relies on a more general claim that can be applied to any data set and whose goal is to improve the performances of the model. VBSW is general, because it can be applied to any supervised ML problem based on a loss function. In this work we specifically investigate VBSW application to DL. In that case, VBSW is applied within the feature space of a pre-trained NN. We validate VBSW for DL by obtaining performance improvement on various tasks like classification and regression of text, from Glue benchmark (Wang et al., 2019) , image, from MNIST (LeCun & Cortes, 2010) and Cifar10 (Krizhevsky et al.) and multivariate data, from UCI ML repository (Dua & Graff, 2017) , for several models ranging from linear regression to Bert (Devlin et al., 2019) or ResNet20 (He et al., 2015) . As a highlight, we obtain up to 1.65% classification improvement on Cifar10 with a ResNet. Finally, we conduct analyses on the complementarity of VBSW with other weighting techniques and its robustness. Contributions: (i) We present and investigate a new approach of the learning problem, based on the variations of the function f to learn. (ii) We construct a new simple, scalable, versatile and cost effective methodology, VBSW, that exploits these findings in order to boost the performances of a NN. (iii) We validate VBSW on various ML tasks.

2. RELATED WORKS

Active Learning -Our methodology is based on the consideration that not every sample bring the same amount of information. Active learning (AL) exploits the same idea, in the sense that it adapts the training strategy to the problem by introducing a data point selection rule. In (Gal et al., 2017) , the authors introduce a methodology based on Bayesian Neural Networks (BNN) to adapt the selection of points used for the training. Using the variational properties of BNN, they design a rule to focus the training on points that will reduce the prediction uncertainty of the NN. In (Konyushkova et al., 2017) , the construction of the selection rule is taken as a ML problem itself. See (Settles, 2012) for a review of more classical AL methods. While AL selects the data points, so modifies the distribution of the initial training data set, VBSW is applied independently of the training so the distribution of the weights can not change throughout the training. Examples Weighting -VBSW can be categorized as an example weighting algorithm. The idea of weighting the data set has already been explored in different ways and for different purposes. While curriculum learning (Bengio et al., 2009; Matiisen et al., 2017) starts the training with easier examples, Self paced learning (Kumar et al., 2010; Jiang et al., 2015) downscales harder examples. However, some works have proven that focusing on harder examples at the beginning of the learning could accelerate it. In (Shrivastava et al., 2016) , hard example mining is performed to give more importance to harder examples by selecting them primarily. Example weighting is used in (Cui et al., 2019) to tackle the class imbalance problem by weighting rarer, so harder examples. At the contrary, in (Liu & Tao, 2016) it is used to solve the noisy label problem by focusing on cleaner, so easier examples. All these ideas show that depending on the application, example weighting can be performed in an opposed manner. Some works aim at going beyond this opposition by proposing more general methodologies. In (Chang et al., 2017) , the authors use the variance of the prediction of each point throughout the training to decide whether it should be weighted or not. A meta learning approach is proposed in (Ren et al., 2018) , where the authors choose the weights after an optimization loop included in the training. VBSW stands out from the previously mentioned example weighting methods because it is built on a more general assumption that a model would simply need more points to learn more complicated functions. Its effect is to improve the performances of a NN, without solving data set specific problems like class imbalance or noisy labels. Importance Sampling -Some of the previously mentioned methods use importance sampling to design the weights of the data set or to correct the bias induced by the sample selection (Katharopoulos & Fleuret, 2018) . Here, we construct a new distribution that could be interpreted as an importance distribution. However, we weight the data points to simulate this distribution, not to correct a bias induced by this distribution. Generalization Bound -Generalization bound for the learning theory of NN have motivated many works, most of which are reviewed in (Jakubovitz et al., 2018) . In Bartlett et al. (1998) , Bartlett et al. (2019) , the authors focus on VC-dimension, a measure which depends on the number of parameters of NNs. Arora et al. (2018) introduces a compression approach that aims at reducing the number of model parameters to investigate its generalization capacities. PAC-Bayes analysis constructs generalization bounds using a priori and a posteriori distributions over the possible models. It is



); Jiang et al. (2015) use it to prioritize easier samples for the training, Shrivastava et al. (2016) for hard example mining, Cui et al. (2019) to avoid class imbalance, or (Liu

