VARIANCE BASED SAMPLE WEIGHTING FOR SUPERVISED DEEP LEARNING Anonymous

Abstract

In the context of supervised learning of a function by a Neural Network (NN), we claim and empirically justify that a NN yields better results when the distribution of the data set focuses on regions where the function to learn is steeper. We first traduce this assumption in a mathematically workable way using Taylor expansion. Then, theoretical derivations allow to construct a methodology that we call Variance Based Samples Weighting (VBSW). VBSW uses local variance of the labels to weight the training points. This methodology is general, scalable, cost effective, and significantly increases the performances of a large class of NNs for various classification and regression tasks on image, text and multivariate data. We highlight its benefits with experiments involving NNs from shallow linear NN to ResNet (He et al., 

1. INTRODUCTION

When a Machine Learning (ML) model is used to learn from data, the distribution of the training data set can have a strong impact on its performances. More specifically, in the context of Deep Learning (DL), several works have hinted at the importance of the training set. In Bengio et al. (2009) ; Matiisen et al. (2017) , the authors exploit the observation that a human will benefit more from easy examples than from harder ones at the beginning of a learning task. They construct a curriculum, inducing a change in the distribution of the training data set that makes a Neural Network (NN) achieve better results in an ML problem. With a different approach, Active Learning (Settles, 2012) modifies dynamically the distribution of the training data, by selecting the data points that will make the training more efficient. Finally, in Reinforcement Learning, the distribution of experiments is crucial for the agent to learn efficiently. Nonetheless, the challenge of finding a good distribution is not specific to ML. Indeed, in the context of Monte Carlo estimation of a quantity of interest based on a random variable X ∼ dP X , Importance Sampling owes its efficiency to the construction of a second random variable, X ∼ dP X that will be used instead of X to improve the estimation of this quantity. Jie & Abbeel (2010) even make a connection between the success of likelihood ratio policy gradients and importance sampling, which shows that ML and Monte Carlo estimation, both distribution based methods, are closely linked. In this paper, we leverage the importance of the training set distribution to improve performances of NNs in supervised DL. This task can be formalized as approximating a function f with a model f θ parametrized by θ. We build a new distribution from the training points and their labels, based on the observation that f θ needs more data points to approximate f on the regions where it is steep. We use Taylor expansion of a function f , which links the local behaviour of f to its derivatives, to build this distribution. We show that up to a certain order and locally, variance is an estimator of Taylor expansion. It allows constructing a methodology called Variance Based Sample Weighting (VBSW) that weights each training data points using the local variance of their neighbor labels to simulate the new distribution. Sample weighting has already been explored in many works and for various goals. Kumar et al. ( 2010 & Tao, 2016) to solve noisy label problem. In this work, the weights' construction relies on a more general claim that can be applied to any data set and whose goal is to improve the performances of the model. 1



); Jiang et al. (2015) use it to prioritize easier samples for the training, Shrivastava et al. (2016) for hard example mining, Cui et al. (2019) to avoid class imbalance, or (Liu

