VARIANCE BASED SAMPLE WEIGHTING FOR SUPERVISED DEEP LEARNING Anonymous

Abstract

In the context of supervised learning of a function by a Neural Network (NN), we claim and empirically justify that a NN yields better results when the distribution of the data set focuses on regions where the function to learn is steeper. We first traduce this assumption in a mathematically workable way using Taylor expansion. Then, theoretical derivations allow to construct a methodology that we call Variance Based Samples Weighting (VBSW). VBSW uses local variance of the labels to weight the training points. This methodology is general, scalable, cost effective, and significantly increases the performances of a large class of NNs for various classification and regression tasks on image, text and multivariate data. We highlight its benefits with experiments involving NNs from shallow linear NN to ResNet (He et al., 

1. INTRODUCTION

When a Machine Learning (ML) model is used to learn from data, the distribution of the training data set can have a strong impact on its performances. More specifically, in the context of Deep Learning (DL), several works have hinted at the importance of the training set. In Bengio et al. (2009) ; Matiisen et al. (2017) , the authors exploit the observation that a human will benefit more from easy examples than from harder ones at the beginning of a learning task. They construct a curriculum, inducing a change in the distribution of the training data set that makes a Neural Network (NN) achieve better results in an ML problem. With a different approach, Active Learning (Settles, 2012) modifies dynamically the distribution of the training data, by selecting the data points that will make the training more efficient. Finally, in Reinforcement Learning, the distribution of experiments is crucial for the agent to learn efficiently. Nonetheless, the challenge of finding a good distribution is not specific to ML. Indeed, in the context of Monte Carlo estimation of a quantity of interest based on a random variable X ∼ dP X , Importance Sampling owes its efficiency to the construction of a second random variable, X ∼ dP X that will be used instead of X to improve the estimation of this quantity. Jie & Abbeel (2010) even make a connection between the success of likelihood ratio policy gradients and importance sampling, which shows that ML and Monte Carlo estimation, both distribution based methods, are closely linked. In this paper, we leverage the importance of the training set distribution to improve performances of NNs in supervised DL. This task can be formalized as approximating a function f with a model f θ parametrized by θ. We build a new distribution from the training points and their labels, based on the observation that f θ needs more data points to approximate f on the regions where it is steep. We use Taylor expansion of a function f , which links the local behaviour of f to its derivatives, to build this distribution. We show that up to a certain order and locally, variance is an estimator of Taylor expansion. It allows constructing a methodology called Variance Based Sample Weighting (VBSW) that weights each training data points using the local variance of their neighbor labels to simulate the new distribution. Sample weighting has already been explored in many works and for various goals. Kumar et al. (2010) ; Jiang et al. (2015) use it to prioritize easier samples for the training, Shrivastava et al. (2016) for hard example mining, Cui et al. (2019) to avoid class imbalance, or (Liu & Tao, 2016) to solve noisy label problem. In this work, the weights' construction relies on a more general claim that can be applied to any data set and whose goal is to improve the performances of the model. VBSW is general, because it can be applied to any supervised ML problem based on a loss function. In this work we specifically investigate VBSW application to DL. In that case, VBSW is applied within the feature space of a pre-trained NN. We validate VBSW for DL by obtaining performance improvement on various tasks like classification and regression of text, from Glue benchmark (Wang et al., 2019) , image, from MNIST (LeCun & Cortes, 2010) and Cifar10 (Krizhevsky et al.) and multivariate data, from UCI ML repository (Dua & Graff, 2017) , for several models ranging from linear regression to Bert (Devlin et al., 2019) or ResNet20 (He et al., 2015) . As a highlight, we obtain up to 1.65% classification improvement on Cifar10 with a ResNet. Finally, we conduct analyses on the complementarity of VBSW with other weighting techniques and its robustness. Contributions: (i) We present and investigate a new approach of the learning problem, based on the variations of the function f to learn. (ii) We construct a new simple, scalable, versatile and cost effective methodology, VBSW, that exploits these findings in order to boost the performances of a NN. (iii) We validate VBSW on various ML tasks.

2. RELATED WORKS

Active Learning -Our methodology is based on the consideration that not every sample bring the same amount of information. Active learning (AL) exploits the same idea, in the sense that it adapts the training strategy to the problem by introducing a data point selection rule. In (Gal et al., 2017) , the authors introduce a methodology based on Bayesian Neural Networks (BNN) to adapt the selection of points used for the training. Using the variational properties of BNN, they design a rule to focus the training on points that will reduce the prediction uncertainty of the NN. In (Konyushkova et al., 2017) , the construction of the selection rule is taken as a ML problem itself. See (Settles, 2012) for a review of more classical AL methods. While AL selects the data points, so modifies the distribution of the initial training data set, VBSW is applied independently of the training so the distribution of the weights can not change throughout the training. Examples Weighting -VBSW can be categorized as an example weighting algorithm. The idea of weighting the data set has already been explored in different ways and for different purposes. While curriculum learning (Bengio et al., 2009; Matiisen et al., 2017) starts the training with easier examples, Self paced learning (Kumar et al., 2010; Jiang et al., 2015) downscales harder examples. However, some works have proven that focusing on harder examples at the beginning of the learning could accelerate it. In (Shrivastava et al., 2016) , hard example mining is performed to give more importance to harder examples by selecting them primarily. Example weighting is used in (Cui et al., 2019) to tackle the class imbalance problem by weighting rarer, so harder examples. At the contrary, in (Liu & Tao, 2016) it is used to solve the noisy label problem by focusing on cleaner, so easier examples. All these ideas show that depending on the application, example weighting can be performed in an opposed manner. Some works aim at going beyond this opposition by proposing more general methodologies. In (Chang et al., 2017) , the authors use the variance of the prediction of each point throughout the training to decide whether it should be weighted or not. A meta learning approach is proposed in (Ren et al., 2018) , where the authors choose the weights after an optimization loop included in the training. VBSW stands out from the previously mentioned example weighting methods because it is built on a more general assumption that a model would simply need more points to learn more complicated functions. Its effect is to improve the performances of a NN, without solving data set specific problems like class imbalance or noisy labels. Importance Sampling -Some of the previously mentioned methods use importance sampling to design the weights of the data set or to correct the bias induced by the sample selection (Katharopoulos & Fleuret, 2018) . Here, we construct a new distribution that could be interpreted as an importance distribution. However, we weight the data points to simulate this distribution, not to correct a bias induced by this distribution. Generalization Bound -Generalization bound for the learning theory of NN have motivated many works, most of which are reviewed in (Jakubovitz et al., 2018) . In Bartlett et al. (1998) , Bartlett et al. (2019) , the authors focus on VC-dimension, a measure which depends on the number of parameters of NNs. Arora et al. (2018) introduces a compression approach that aims at reducing the number of model parameters to investigate its generalization capacities. PAC-Bayes analysis constructs generalization bounds using a priori and a posteriori distributions over the possible models. It is investigated for example in Neyshabur et al. (2018) ; Bartlett et al. (2017), and Neyshabur et al. (2017) ; Xu & Mannor (2012) links PAC-Bayes theory to the notion of sharpness of a NN, i.e. its robustness to small perturbation. While sharpness of the model is often mentioned in the previous works, our bound includes the derivatives of f , which can be seen as an indicator of the sharpness of the function to learn. Even if it uses elements of previous works, like the Lipschitz constant of f θ , our work does not pretend to tighten and improve the already existing generalization bounds, but only emphasizes the intuition that the NN would need more points to capture sharper functions. In a sense, it investigates the robustness to perturbations in the input space, not in the parameter space.

3. A NEW TRAINING DISTRIBUTION BASED ON TAYLOR EXPANSION

In this section, we first illustrate why a NN may need more points where f is steep by deriving a generalization bound that involves the derivatives of f . Then, using Taylor expansion, we build a new training distribution that improves the performances of a NN on simple functions.

3.1. PROBLEM FORMULATION

We formalize the supervised ML task as approximating a function f : S ⊂ R ni → R no with an ML model f θ parametrized by θ, where S is a measured sub-space of R ni depending on the application. To this end, we are given a training data set of N points, {X 1 , ..., X N } ∈ S, drawn from X ∼ dP X and their point-wise values, or labels {f (X 1 ), ..., f (X N )}. Parameters θ have to be found in order to minimize an integrated loss function J X (θ) = E X [L(f θ (X), f (X))], with L the loss function, L : R no × R no → R. The data allow estimating J X (θ) by J X (θ) = 1 N N i=1 L(f θ (X i ), f (X i )). Then, an optimization algorithm is used to find a minimum of J X (θ) w.r.t. θ.

3.2. INTUITION BEHIND TAYLOR EXPANSION

In the following, we illustrate the intuition with a Generalization Bound (GB) that include the derivatives of f , provided that these derivatives exist. The goal of the approximation problem is to be able to generalize to points not seen during the training. The generalization error J X (θ) = J X (θ) -J X (θ) thus needs to be as small as possible. Let S i , i ∈ {1, ..., N } be some sub-spaces of S such that S = N i=1 S i , N i=1 S i = Ø, and X i ∈ S i . Suppose that L is the squared L 2 error, n i = 1, f is differentiable and f θ is K θ -Lipschitz. Provided that |S i | < 1, we show that J X (θ) ≤ N i=1 (|f (X i )| + K θ ) 2 |S i | 3 4 + O(|S i | 4 ), where |S i | is the volume of S i . The proof can be found in Appendix B. We see that on the regions where f (X i ) is higher, quantity |S i | has a stronger impact on the GB. This idea is illustrated on Figure 1 . Since |S i | can be seen as a metric for the local density of the data set (the smaller |S i | is, Figure 1 : Illustration of the GB. The maximum error (the GB), at order O(|Si| 4 ), is obtained by comparing the maximum variations of f θ , and the first order approximation of f , whose trends are given by K θ and f (Xi). We understand visually that because f (X1) and f (X3) are higher than f (X2), the GB is improved more efficiently by reducing S1 and S3 than S2. the denser the data set is), the GB can be reduced more efficiently by adding more points around X i in these regions. This bound also involves K θ , the Lipschitz constant of the NN, which has the same impact than f (X i ). It also illustrates the link between the Lipschitz constant and the generalization error, which has been pointed out by several works like (Gouk et al., 2018 ), (Bartlett et al., 2017) and (Qian & Wegman, 2019) . Note that equation 1 only gives indications about n = 1. Indeed, this GB only has illustration purposes. Its goal is to motivate the metric described in the next section, which is based on Taylor expansion and therefore involves derivatives of order n > 1.

3.3. A TAYLOR EXPANSION BASED METRIC

In this paragraph, we build a metric involving the derivatives of f . Using Taylor expansion at order n on f and supposing that f is n times differentiable (multi index notation): f (x+ ) = →0 0≤|k|≤n k ∂ k f (x) k! +O( n ). Df n (x) = 1≤|k|≤n k • Vect(∂ k f (x)) k! . Quantity f (x + )f (x) gives an indication on how much f changes around x. By neglecting the orders above n , it is then possible to find the regions of interest by focusing on Df n , given by equation 2, where Vect(X) denotes the vectorization of a tensor X and . the squared L 2 norm. Note that Df n is evaluated using ∂ k f (x) instead of ∂ k f (x) for derivatives not to cancel each other. f will be steeper and more irregular in the regions where x → Df n (x) is higher. To focus the training set on these regions, one can use {Df n (X 1 ), ..., Df n (X N )} to construct a probability density function (pdf) and sample new data points from it. This sampling is evaluated and validated in Appendix A for conciseness. Based on these experiments, we choose n = 2, i.e. we use {Df 2 (X 1 ), ..., Df 2 (X N )}. The good results obtained, presented in Appendix A confirm our observation and motivate its application to complex DL problems.

4. VARIANCE BASED SAMPLES WEIGHTING (VBSW)

4.1 PRELIMINARIES The new distribution cannot always be applied as is, because we do not have access to f . Problem 1: {Df 2 (X 1 ), ..., Df 2 (X N )} cannot be evaluated since it requires to compute the derivatives of f . Moreover, it assumes that f is differentiable, which is often not true. Problem 2: even if {Df 2 (X 1 ), ..., Df 2 (X N )} could be computed and new points sampled, we could not obtain their labels to complete the training data set. Problem 1: Unavailability of derivatives To overcome problem 1, we construct a new metric based on statistical estimation. In this paragraph, n i > 1 but n o = 1. The following derivations can be extended to n o > 1 by applying it to f element-wise and then taking the sum across the n o dimensions. Let ∼ N (0, I ni ) with ∈ R + and I ni the identity matrix of dimension n i . We claim that V ar(f (x + )) = Df 2 (x) + O( 3 2 ). The demonstration can be found in Appendix B. Using the unbiased estimator of variance, we thus define new indices Df 2 (x) by Df 2 (x) = 1 k -1 k i=1 f (x + i ) -f (x) 2 , ( ) with { 1 , ..., k } k samples of . The metric Df 2 (x) → k→∞ V ar(f (x + )) and V ar(f (x + )) = Df 2 (x) + O( 3 2 ), so Df 2 (x) is a biased estimator of Df 2 (x) , with bias O( 32 ). Hence, when → 0, Df 2 (x) becomes an unbiased estimator of Df 2 (x). It is possible to compute Df 2 (x) from any set of points centered around x. Therefore, we compute Df 2 (X i ) for each i ∈ {1, ..., N } using the set S k (X) of k-nearest neighbors of X i . We obtain Df 2 (X i ) using Df 2 (X i ) = 1 k -1 Xj ∈S k (Xi) f (X j ) - 1 k k l=1 f (X l ) 2 , The advantages of this formulation are twofold. First, Df 2 can even be applied to non-differentiable functions. Second, all we need is {f (X 1 ), ..., f (X N )}. In other words, the points used by Df 2 (X i ) are those used for the training of the NN. Finally, while the definition of Df 2 (x) is local, the definition of Df 2 (x) holds for any . Note that equation 4 can even be applied when the data points are too sparse for the nearest neighbors of X i to be considered as close to X i . It can thus be seen as a generalization of Df 2 (x), which tends towards Df 2 (x) locally. Problem 2: Unavailability of new labels To tackle problem 2, recall that the goal of the training is to find θ * = argmin θ J X (θ), with J X (θ) = 1 N i L(f (X i ), f θ (X i )). With the new distribution based on previous derivations, the procedure is different. Since the training points are sampled using Df 2 , we no longer minimize J X (θ), but J X (θ) = 1 N i L(f ( Xi ), f θ ( Xi )), with X ∼ dP X the new distribution. However, J X (θ) estimates J X (θ) = S L(f (x), f θ (x))dP X . Let p X (x)dx = dP X , p X (x)dx = dP X be the pdfs of X and X (note that Df 2 ∝ p X ). Then, J X (θ) = S L(f (x), f θ (x)) p X (x) p X (x) dP X . The straightforward Monte Carlo estimator for this expression of J X (θ) is J X,2 (θ) = 1 N i L(f (X i ), f θ (X i )) p X (X i ) p X (X i ) ∝ 1 N i L(f (X i ), f θ (X i )) Df 2 (X i ) p X (X i ) . Thus, J X (θ) can be estimated with the same points as J X (θ) by weighting them with Xi) . w i = Df 2 (Xi) p X (

4.2. HYPERPARAMETERS OF VBSW

The expression of w i involves Df 2 (X i ), whose estimation has been the goal of the previous sections. However, it also involves p X , the distribution of the data. Just like for f , we do not have access to p X . The estimation of p X is a challenging task by itself, and standard density estimation techniques such as K-nearest neighbors or Gaussian Mixture density estimation led to extreme estimated values of p X (X i ) in our experiments. Therefore, we decided to only apply ω i = Df 2 (X i ) as a first order approximation. In practice, we re-scale the weighting points to be between 1 and m, a hyperparameter. As a result, VBSW has two hyperparameters: m and k. Their effects and interactions are studied and discussed in Sections 5.1 and 5.4.

4.3. VBSW FOR DEEP LEARNING

We specified that local variance could be computed using already existing points. This statement implies to find the nearest neighbors of each point. In extremely high dimension spaces like image spaces the curse of dimensionality makes nearest neighbors spurious. In addition, the structure of the data may be highly irregular, and the concept of nearest neighbor misleading. Thus, it may be irrelevant to evaluate D 2 f directly on this data. One of the strength of DL is to construct good representations of the data, embedded in lower dimensional latent spaces. For instance, in Computer Vision, Convolutional Neural Networks (CNN)'s deeper layers represent more abstract features. We could leverage this representational power of NNs, and simply apply our methodology within this latent feature space. Algorithm 1 Variance Based Samples Weighting (VBSW) for Deep learning 1: Inputs: k, m, M 2: Train M on the training set {( 1 N , X 1 ), ..., ( 1 N , X N )}, {( 1 N , f (X 1 )), ..., ( 1 N , f (X N ))} 3: Construct M * by removing its last layer 4: Compute { Df 2 (M * (X 1 )), ..., Df 2 (M * (X N ))} using equation 4. 5: Construct a new training data set {(w 1 , M * (X 1 )), ..., (w N , M * (X N ))} 6: Train f θ on {(w 1 , f (X 1 )), ..., (w N , f (X N ))} and add it to M * . The final model is M f = f θ • M * Variance Based Samples Weighting (VBSW) for DL is recapitulated in Algorithm 1. Here, M is the initial NN whose feature space will be used to project the training data set and apply VBSW. Line 1: m and k are hyperparameters that can be chosen jointly with all other hyperparameters, e.g. using a random search. Line 2: The initial NN, M, is trained as usual. Notations {( 1 N , X 1 ), ..., ( 1 N , X N )} is equivalent to {X 1 , ..., X N }, because all the weights are the same ( 1N ). Line 3: The last fully connected layer is discarded, resulting in a new model M * , and the training data set is projected in the feature space. Line 4-5: equation 4 is applied to compute the weights w i that are used to weight the projected data set. To perform nearest neighbors search, we use KD-Tree (Bentley, 1975) . Line 6: The last layer is re-trained (which is often equivalent to fitting a linear model) using the weighted data set and added to M * to obtain the final model M f . As a result, M f is a composition of the already trained model M * and f θ trained using the weighted data set.

5. EXPERIMENTS

We first test this methodology on toy datasets with linear models and small NNs. Then, to illustrate how general VBSW can be, we consider various tasks in image classification, text regression and classification. Finally, we study the robustness of VBSW and its complementarity with other sample weighting techniques.

5.1. TOY EXPERIMENTS

VBSW is studied on a Double Moon (DM) classification, in the Boston Housing (BH) regression and Breast Cancer (BC) classification data sets. fit for each method. VBSW provides a cleaner decision boundary than baseline. These pictures as well as the results of Table 1 show the improvement obtained with VBSW. For BH data set, a linear model is trained and for BC data set, a MLP of 1 layer and 30 units, with a train-validation split of 80% -20%. Both models are trained with ADAM (Kingma & Ba, 2014) . Since these data sets are small and the models are light, we study the effects of the choice of m and k on the performances. Moreover, BH is a regression task and BC a classification task, so it allows studying the effect of hyperparameters more extensively. We train the models for a grid of 20 × 20 different values of m and k. These hyperparameters seem to have a different impact on performances for classification and regression. In both cases, low values for m yields better results, but in classification, low values of k are better, unlike in regression. Details and visualization of this experiment can be found in Appendix C. The best results obtained with this study are compared to the best result of the same models trained without VBSW in Table 1 .

5.2. MNIST AND CIFAR10

For MNIST, we train 40 LeNet 5, i.e. with 40 different random seeds, and then apply VBSW for 10 different random seeds, with ADAM optimizer and categorical cross-entropy loss. Note that in the following, ADAM is used with the default parameters of its keras implementation. We record the best value obtained from the 10 VBSW training. The same procedure is followed for Cifar10, except that we train a ResNet20 for 50 random seeds and with data augmentation and learning rate decay. (acc(M i f ) -acc(M)) with acc the accuracy and M i f the VBSW model trained at the i-th random seed. The results statistics are gathered in Table 2 , which also displays statistics about the gain due to VBSW for each model. The results on MNIST, for all statistics and for the gain are significantly better than forVBSW than for baseline. For Cifar10, we get a 0.3% accuracy improvement for the best model and up to 1.65% accuracy gain, meaning that among the 50 ResNet20s, there is one whose accuracy has been improved by 1.65% by VBSW. Note that applying VBSW took less than 15 minutes on a laptop with an i7-7700HQ CPU. A visualization of the samples that were weighted by the highest w i is given in Figure 3 . 

5.3. RTE, STS-B AND MRPC

For this application, we do not pre-train Bert NN, like in the previous experiments, since it has been originally built for Transfer Learning purposes. Therefore, its purpose is to be used as is and then fine-tuned on any NLP data set see (Devlin et al., 2019) . However, because of the small size of the dataset and the high number of model parameters we chose not to fine-tune the Bert model, and only to use the representations of the data sets in its feature space to apply VBSW. More specifically, we use tiny-bert (Turc et al., 2019) , which is a lighter version of the initial Bert NN. We train the linear model with tensorflow, to be able to add the trained model on top of the Bert model and obtain a unified model. RTE and MRPC are classification tasks, so we use binary cross-entropy loss function to train our model. STS-B is a regression task so the model is trained with Mean Squared Error. All the models are trained with ADAM optimizer. For each task, we compare the training of the linear model with VBSW, and without VBSW (baseline). The results obtained with VBSW are better overall, except for Pearson Correlation in STS-B, which is slightly worse than baseline ( 

5.4. ROBUSTNESS OF VBSW

VBSW relies on statistical estimation: the weights are based on local empirical variance, evaluated using k points. In addition, they are rescaled using hyperparameter m. Section 5.1 and Appendix C show that many different combinations of m and k and therefore many different values for the weights improve the error. This behavior suggests that VBSW is quite robust to weights approximation error. We also assess the robustness of VBSW to label noise. To that end, we train a ResNet20 on Cifar10 with four different noise levels. We randomly change the label of p% training points for four different values of p (10, 20, 30 and 40). We then apply VBSW 30 times and evaluate the performances of the obtained NNs on a clean test set. The results are gathered in Table 4 The results show that VBSW is still effective despite label noise, which could be explained by two observations. First, the weights of VBSW rely on statistical estimation, so perturbations in the labels might have a limited impact on weights' value. Second, as mentioned previously, VBSW is robust to weights approximation error, so perturbation of the weights due to label noise may not critically hurt the method. Although VBSW is robust to label noise, note that the goal of VBSW is not to address noisy label problem, like discussed in Section 2. It may be more effective to use a sampling technique specifically tailored for this situation -possibly jointly with VBSW, like in Section 5.5.

5.5. COMPLEMENTARITY OF VBSW

Existing sample weighting techniques can be used jointly with VBSW by training the initial NN M with the first sample weighting algorithm, and then applying VBSW on its feature space. To illustrate this feature, we compare VBSW with the recently introduced Active Bias (AB) (Chang et al., 2017) . AB dynamically weights the samples based on the variance of the probability of prediction of each points throughout the training. Here, we study the combined effects of AB and VBSW for the training of a ResNet20 on Cifar10. (acc(M i f )acc(M)) with acc the accuracy and M i f the VBSW model trained at the i-th random seed. per model is lower for AB + VBSW than for VBSW alone. An explanation might be that AB is already improving the NN performances compared to vanilla, so there is less room for accuracy improvement by VBSW in that case.

6. DISCUSSION & FUTURE WORK

Previous experiments demonstrate the performances improvement that VBSW can bring in practice. In addition to these results, several advantages can be pointed out. • VBSW is validated on several different tasks, which makes it quite versatile. Moreover, the problem of high dimensionality and irregularity of f , which often arises in DL problems, is alleviated by focusing on the latent space of NNs. This makes VBSW scalable. As a result, VBSW can be applied to complex NNs such as ResNet, a complex CNN or Bert, for various ML tasks. Its application to more diverse ML models is a perspective for future works. • The validation presented in this paper supports an original view of the learning problem, that involves the local variations of f . The studies of Appendix A, that use the derivatives of the function to learn to sample a more efficient training data set, support this approach as well. • VBSW allows to extend this original view to problems where the derivatives of f are not accessible, and sometimes not defined. Indeed, VBSW comes from Taylor expansion, which is specific to derivable functions, but in the end can be applied regardless of the properties of f . • Finally, this method is cost effective. In most cases, it allows to quickly improve the performances of a NN using a regular CPU. In terms of energy consumption, it is better than carrying on a whole new training with a wider and deeper NN. We first approximated p X to be uniform, because we could not approximate it correctly. This approximation still led to a an efficient methodology, but VBSW may benefit from a finer approximation of p X . Improving the approximation of p X is among our perspectives. Finally, the KD-tree and even Approximate Nearest Neighbors algorithms struggle when the data set is too big. One possibility to overcome this problem would be to parallelize their execution. We only considered the cases where we have no access to f . However, there are ML applications where we do. For instance, in numerical simulations, for physical sciences, computational economics or climatology, ML can be used for various reasons, e.g. sensitivity analysis, inverse problems or to speed up computer codes (Zhu et al., 2019; Winovich et al., 2019; Feng et al., 2018) . In this context data comes from numerical models, so the derivatives of f are accessible and could be directly used. Appendix A contains examples of such possible applications.

7. CONCLUSION

Our work is based on the observation that, in supervised learning, a function f is more difficult to approximate by a NN in the regions where it is steeper. We mathematically traduced this intuition and derived a generalization bound to illustrate it. Then, we constructed an original method, Variance Based Samples Weighting (VBSW), that uses the variance of the training samples to weight the training data set and boosts the model's performances. VBSW is simple to use and to implement, because it only requires to compute statistics on the input space. In Deep Learning, applying VBSW on the data set projected in the feature space of an already trained NN allows to reduce its error by simply training its last layer. Although specifically investigated in Deep Learning, this method is applicable to any loss function based supervised learning problem, scalable, cost effective, robust and versatile. It is validated on several applications such as glue benchmark with Bert, for text classification and regression and Cifar10 with a ResNet20, for image classification.

APPENDIX A TAYLOR BASED SAMPLING

In this part, we empirically verify that using Taylor expansion to construct a new training distribution has a beneficial impact on the performances of a NN. To this end, we construct a methodology, that we call Taylor Based Sampling (TBS), that generates a new training data set based on the metric introduced in Section 3.3. First, we recall the formula for {Df n (X 1 ), ..., Df n (X N )}. Df n (x) = 1≤|k|≤n k • Vect(∂ k f (x)) k! . ( ) To focus the training set on the regions of interest, i.e. regions of high {Df n (X 1 ), ..., Df n (X N )}, we use this metric to construct a probability density function (pdf). This is possible since Df n (x) ≥ 0 for all x ∈ S. It remains to normalize it but in practice it is enough considering a distribution d ∝ Df n . Here, to approximate d we use a Gaussian Mixture Model (GMM) with pdf d GMM that we fit to {Df n (X 1 ), ..., Df n (X N )} using the Expectation-Maximization (EM) algorithm. N new data points { X1 , ..., XN }, can be sampled, with X ∼ d GMM . Finally, we obtain {f ( X1 ), ..., f ( XN )}, add it to {f (X 1 ), ..., f (X N )} and train our NN on the whole data set.

TAYLOR BASED SAMPLING (TBS)

TBS is described in Algorithm 2. Line 1: The choice criterion of , the number of Gaussian distribution n GMM and N is to avoid sparsity of { X1 , ..., XN } over S. Line 2: Without a priori information on f , we sample the first points uniformly in a subspace S. Line 3-7: We construct {Df n (X 1 ), ..., Df n (X N )}, and then d to be able to sample points accordingly. Line 8: Because the support of a GMM is not bounded, some points can be sampled outside S. We discard these points and sample until all points are inside S. This rejection method is equivalent to sampling points from a truncated GMM. Line 9-10: We construct the labels and add the new points to the initial data set. Algorithm 2 Taylor Based Sampling (TBS) 1: Inputs: , N , N , n GMM , n 2: Sample {X 1 , ..., X N }, with X ∼ U(S) 3: for 0 ≤ k ≤ n do 4: Compute {∂ k f (X 1 ), ..., ∂ k f (X N )} 5: Compute {Df n (X 1 ), ..., Df n (X N )} using equation 2 6: Approximate d ∼ D with a GMM using EM algorithm to obtain a density d GMM 7: Sample { X1 , ..., XN } using rejection method to sample inside S 8: Compute {f ( X1 ), ..., f ( XN )} 9: Add {f ( X1 ), ..., f ( XN )} to {f (X 1 ), ..., f (X N )} APPLICATION TO SIMPLE FUNCTIONS Sampling L 2 error L ∞ f : Runge (×10 -2 ) BS 1.45 ± 0.62 5.31 ± 0.86 TBS 1.13 ± 0.73 3.87 ± 0.48 f : tanh (×10 -1 ) BS 1.39 ± 0.67 2.75 ± 0.78 TBS 0.95 ± 0.50 2.25 ± 0.61 Table 6 : Comparison between BS and TBS. The metrics used are the L 2 and L ∞ errors, displayed with a 95% confidence interval. In order to illustrate the benefits of TBS compared to a uniform, basic sampling (BS), we apply it to two simple functions: hyperbolic tangent and Runge function. We chose these functions because they are differentiable and have a clear distinction between flat and steep regions. These functions are displayed in Figure 4 , as well as the map x → Df 2 (x). Adam optimizer Kingma & Ba (2014) with the defaults tensorflow implementation hyperparameters, and Mean Squared Error loss function. We first sample {X 1 , ..., X N } according to a regular grid. To compare the two methods, we add N additional points sampled using BS to create the BS data set, and then N other points sampled with TBS to construct the TBS data set. As a result, each data set have the same number of points (N + N ). We repeated the method for several values of n, n GMM and , and finally selected n = 2, n GMM = 3 and = 10 -3 . Table 6 summarizes the L 2 and the L ∞ norm of the error of f θ , obtained at the end of the training phase for N + N = 16, with N = 8. Those norms are estimated using a same test data set of 1000 points. The values are the means of the 40 independent experiments displayed with a 95% confidence interval. These results illustrate the benefits of TBS over BS. Table 6 shows that TBS slightly degrades the L 2 error of the NN, but improves its L ∞ error. This may explain the good results of VBSW for classification. Indeed, for a classification task, the accuracy will not be very sensitive to small output variations, since the output is rounded to 0 or 1. However, a high error can induce a misclassification, and the reduction in L ∞ norm limits this risk.

APPLICATION TO AN ODE SYSTEM

We apply TBS to a more realistic case: the approximation of the resolution of the Bateman equations, which is an ODE system : ∂ t u(t) = vσ a • η(t)u(t), ∂ t η(t) = vΣ r • η(t)u(t), , with initial conditions u(0) = u 0 , η(0) = η 0 . , with u ∈ R + , η ∈ (R + ) M , σ T a ∈ R M , Σ r ∈ R M ×M . Here, f : (u 0 , η 0 , t) → (u(t), η(t)). For physical applications, M ranges from tens to thousands. We consider the particular case M = 1 so that f : R 3 → R 2 , with f (u 0 , η 0 , t) = (u(t), η(t)). The advantage of M = 1 is that we have access to an analytic, cheap to compute solution for f . Of course, this particular case can also be solved using a classical ODE solver, which allows us to test it end to end. It can thus be generalized to higher dimensions (M > 1). All NN trainings have been performed in Python, with Tensorflow Abadi et al. (2015) . We used a fully connected NN with hyperparameters chosen using a simple grid search. The final values are: 2 hidden layers, "ReLU" activation function, and 32 units for each layer, trained with the Mean Squared Error (MSE) loss function using Adam optimization algorithm with a batch size of 50000, for 40000 epochs and on N + N = 50000 points, with N = N . We first trained the model for (u(t), η(t)) ∈ R, with an uniform sampling (BS) (N = 0), and then with TBS for several values of n, n GMM and = (1, 1, 1), to be able to find good values. We finally select = 5 × 10 -4 , n = 2 and n GMM = 10. The data points used in this case have been sampled with an explicit Euler scheme. This experiment has been repeated 50 times to ensure statistical significance of the results. Table 7 summarizes the MSE, i.e. the L 2 norm of the error of f θ and L ∞ norm, with L ∞ (θ) = max X∈S (|f (X)f θ (X)|) obtained at the end of the training phase. This last metric is important because the goal in computational physics is not only to be averagely accurate, which is measured with MSE, but to be accurate over the whole input space S. Those norms are estimated using a same test data set of N test = 50000 points. The values are the means of the 50 independent experiments displayed with a 95% confidence interval. These results reflect an error reduction of 6.6% for L 2 and of 45.3% for L ∞ , which means that TBS mostly improves the L ∞ error of f θ . Moreover, the L ∞ error confidence intervals do not intersect so the gain is statistically significant for this norm. : 1a: t → f θ (u 0 , η 0 , t) for randomly chosen (u 0 , η 0 ), for f θ obtained with the two samplings. 1b: t → f θ (u 0 , η 0 , t) for (u 0 , η 0 ) resulting in the highest point-wise error with the two samplings. 2a: u 0 , η 0 → max 0≤t≤10 D n (u 0 , η 0 , t) w.r.t. (u 0 , η 0 ). 2b: u 0 , η 0 → g θ BS (u 0 , η 0 ) - g θ T BS (u 0 , η 0 ), Figure 1a shows how the NN can perform for an average prediction. Figure 1b illustrates the benefits of TBS relative to BS on the L ∞ error (Figure 2b ). These 2 figures confirm the previous observation about the gain in L ∞ error. Finally, Figure 2a displays u 0 , η 0 → max 0≤t≤10 D n (u 0 , η 0 , t) w.r.t. (u 0 , η 0 ) and shows that D n increases when U 0 → 0. TBS hence focuses on this region. Note that for the readability of these plots, the values are capped to 0.10. Otherwise only few points with high D n are visible. Figure 2b displays u 0 , η 0 → g θ BS (u 0 , η 0 )g θ T BS (u 0 , η 0 ), with g θ : u 0 , η 0 → max 0≤t≤10 f (u 0 , η 0 , t)f θ (u 0 , η 0 , t) 2 2 where θ BS and θ T BS denote the parameters obtained after a training with BS and TBS, respectively. It can be interpreted as the error reduction achieved with TBS. The highest error reduction occurs in the expected region. Indeed, more points are sampled where D n is higher. The error is slightly increased in the rest of S, which could be explained by a sparser sampling on this region. However, as summarized in Table 1 , the average error loss (AEL) of TBS is around six times lower than the average error gain (AEG), with AEG = E u0,η0 (Z1 Z>0 ) and AEL = E u0,η0 (Z1 Z<0 ) where Z(u 0 , η 0 ) = g θ BS (u 0 , η 0 )g θ T BS (u 0 , η 0 ). In practice, AEG and AEL are estimated using uniform grid integration, and averaged on the 50 experiments. f θ (X i ) -K θ |x -X i | ≤ f θ (x) ≤ f θ (X i ) + K θ |x -X i |, -f θ (X i ) -K θ |x -X i | ≤ -f θ (x) ≤ -f θ (X i ) + K θ |x -X i |, f (X i ) + f (X i )(x -X i ) + 1 2 f (X i (x -X i ) 2 ) -f θ (X i ) -K θ |x -X i | + O((x -X i ) 3 ) ≤ A i (x) ≤ f (X i ) + f (X i )(x -X i ) + 1 2 f (X i )(x -X i ) 2 -f θ (X i ) + K θ |x -X i | + O((x -X i ) 3 ), A i (x) ≤ f (X i ) -f θ (X i ) + f (X i )(x -X i ) + 1 2 f (X i )(x -X i ) 2 + K θ |x -X i | + O((x -X i ) 3 ). And finally, using triangular inequality, A i (x) ≤ |f (X i ) -f θ (X i )| + |f (X i )||x -X i | + 1 2 |f (X i )||x -X i | 2 + K θ |x -X i | + O(|x -X i | 3 ). Now, . being the squared L 2 norm: J X (θ) = N i=1 Si f (X i ) + f (X i )(x -X i ) + 1 2 f (X i )(x -X i ) 2 -f θ (x) + O(|x -X i | 3 ) dP X , J X (θ) ≤ N i=1 Si |f (X i ) -f θ (X i )| + |f (X i )||x -X i | + 1 2 |f (X i )||x -X i | 2 + K θ |x -X i | + O(|x -X i | 3 ) 2 dP X , = N i=1 Si |f (X i ) -f θ (X i )| 2 + 2|f (X i ) -f θ (X i )| |f (X i )||x -X i | + 1 2 |f (X i )||x -X i | 2 + K θ |x -X i | + |f (X i )||x -X i | + 1 2 |f (X i )||x -X i | 2 + K θ |x -X i | 2 + O(|x -X i | 3 ) dP X , = N i=1 Si |f (X i ) -f θ (X i )| 2 + 2|f (X i ) -f θ (X i )| |f (X i )||x -X i | + 1 2 |f (X i )||x -X i | 2 + K θ |x -X i | + |f (X i )| 2 |x -X i | 2 + 2K θ |f (X i )||x -X i | 2 + K 2 θ |x -X i | 2 + O(|x -X i | 3 ) dP X , = N i=1 Si |f (X i ) -f θ (X i )| 2 + 2|f (X i ) -f θ (X i )| |f (X i )||x -X i | + 1 2 |f (X i )||x -X i | 2 + K θ |x -X i | + |f (X i )| + K θ 2 |x -X i | 2 + O(|x -X i | 3 ) dP X . Hornik's theorem (Hornik et al., 1989) states that given a norm . p,µ = such that f p p,µ = S |f (x)| p dµ(x), with dµ a probability measure, for any , there exists θ such that for a Multi Layer Perceptron, f θ , f (x)f θ (x) p p,µ < , This theorem grants that for any , with dµ = N i=1 1 N δ(x -X i ), there exists θ such that              f (x) -f θ (x) 1 1,µ = N i=1 1 N |f (X i ) -f θ (X i )| ≤ , f (x) -f θ (x) 2 2,µ = N i=1 1 N f (X i ) -f θ (X i ) 2 ≤ . (8) Let's introduce i * such that i * = argmin |S i |. Note that for any i ∈ {1, ..., N }, O(|S * i | 4 ) is O(|S i | 4 ). Now, let's choose such that = O(|S * i | 4 ). Then, equation 8 implies that        |f (X i ) -f θ (X i )| = O(|S i | 4 ), f (X i ) -f θ (X i ) 2 = O(|S i | 4 ), J X (θ) = f (x) -f θ (x) 2 2,µ = O(|S i | 4 ). Thus, we have J X (θ) = J X (θ) -J X (θ) = J X (θ) + O(|S i | 4 ) and therefore, J X (θ) ≤ N i=1 Si |f (X i )| + K θ 2 |x -X i | 2 dP X + O(|S i | 4 ). Finally, J X (θ) ≤ N i=1 (|f (X i )| + K θ ) 2 |S i | 3 3 + O(|S i | 4 ). We see that on the regions where f (X i ) + K θ is higher, quantity |S i | (the volume of S i ) has a stronger impact on the GB. Then, since |S i | can be seen as a metric for the local density of the data set (the smaller |S i | is, the denser the data set is), the Generalization Bound (GB) can be reduced more efficiently by adding more points around X i in these regions. This bound also involves K θ , the Lipschitz constant of the NN, which has the same impact as f (X i ). It also illustrates the link between the Lipschitz constant and the generalization error, which has been pointed out by several works like, for instance, Gouk et al. (2018) , Bartlett et al. (2017) and Qian & Wegman (2019) . PROBLEM 1: UNAVAILABILITY OF DERIVATIVES (SECTION 4.1) In this paragraph, we consider n i > 1 but n o = 1. The following derivations can be extended to n o > 1 by applying it to f element-wise. Let ∼ N (0, 1 ni ) with ∈ R + and = ( 1 , ..., ni ), i.e. i ∼ N (0, ). Using Taylor expansion on f at order 2 gives f (x + ) = f (x) + ∇ x f (x) • + 1 2 T • H x f (x) • + O( 3 2 ). With ∇ x f and H x f (x) the gradient and the Hessian of f w.r.t. x. We now compute V ar(f (X + )) and make Df 2 (x) = ∇ x f (x) 2 F + 1 2 2 H x f (x) 2 F appear in its expression to establish a link between these two quantities: V ar(f (x + )) = V ar f (x) + ∇ x f (x) • + 1 2 T • H x f (x) • + O( 3 2 ) , = V ar ∇ x f (x) • + 1 2 T • H x f (x) • + O( 3 2 ). Since i ∼ N (0, ), x = (x 1 , ..., x ni ) and with ∂ 2 f ∂xixj (X) the cross derivatives of f w.r.t. x i and x j , In this expression, we have to assess three quantities: Cov( i1 , i2 ), Cov( i , j k ) and Cov( j1 k1 , j2 k2 ). ∇ x f (x) • + 1 2 T • H x f (x) • = ni i=1 i ∂f ∂x i (x) + 1 2 ni j=1 ni k=1 j k ∂ 2 f ∂x j x k (x), V ar ∇ x f (x) • + 1 2 T • H x f (x) First, since ( 1 , ..., ni ) are i.i.d., Cov i1 , i2 = V ar( i ) = if i 1 = i 2 = i, 0 otherwise. . To assess Cov( i , j k ), three cases have to be considered. • If i = j = k, because E[ 3 i ] = 0, Cov( i , j k ) = Cov( i , 2 i ), = E[ 3 i ] -E[ i ]E[ 2 i ], = 0. • If i = j or i = k (we consider i = k, and the result holds for i = j by commutativity), Cov( i , j k ) = Cov( i , i j ), = E[ 2 i j ] -E[ i ]E[ i j ], = E[ 2 i ]E[ j ], = 0. • If i = j and i = k, i and j k are independent and so Cov( i , j k ) = 0. Finally, to assess Cov( j1 k1 , j2 k2 ), four cases have to be considered: • If j 1 = j 2 = k 1 = k 2 = i, Cov( j1 k1 , j2 k2 ) = V ar( 2 i ), = 2 2 . • If j 1 = k 1 = i and j 2 = k 2 = j, Cov( j1 k1 , j2 k2 ) = Cov( 2 i , 2 j ) = 0 since 2 i and 2 j are independent. • If j 1 = j 2 = j and k 1 = k 2 = k, Cov( j1 k1 , j2 k2 ) = V ar( j k ), = V ar( j )V ar( k ), = 2 . • If j 1 = k 1 , j 2 and k 2 , Cov( j1 k1 , j2 k2 ) = E[ j1 k1 j2 k2 ] -E[ j1 k1 ]E[ j2 k2 ], = E[ j1 ]E[ k1 j2 k2 ] -E[ j1 ]E[ k1 ]E[ j2 k2 ], = 0. All other possible cases can be assessed using the previous results, commutativity and symmetry of Cov operator. Hence,  V ar ∇ x f (x) • + 1 2 T • H x f (x) • = ni i1=1 ni i2=1 ∂f ∂x i1 (x) ∂f ∂x i2 (x)Cov i1 , i2 + 1 4 ni j1=1 ni k1=1 ni j2=1 ni k2=1 ∂ 2 f ∂x j1 x k1 (x) ∂ 2 f ∂x j2 x k2 (x)Cov j1 k1 , j2 k2 , = ni i=1 ∂f 2 ∂x i (x) + 1 2 ni j=1 ni k=1 2 ∂ 2 f 2 ∂x j x k (x), = ∇ x f (x) 2 F + 1 2 2 H x f (x) 2 F , =Df 2 (x).

PAPER HYPERPARAMETERS VALUES

The values chosen for the hyperparameters of the paper experiments are gathered in Table 8 



Figure 2: From left to right: (a) Double Moon (DM) data set. (b) Decision boundary with the baseline method.(c) Heat map of the value of wi for each Xi (red is high and blue is low) and (d) Decision boundary with VBSW method

DM, Figure 2 (c)  shows that the points with highest w i (in red) are close to the boundary between the two classes. Indeed, in classification, VBSW can be interpreted as local label agreement. We train a Multi Layer Perceptron of 1 layer of 4 units, using Stochastic Gradient Descent (SGD) and binary cross-entropy loss function, on a 300 points training data set for 50 random seeds. In this experiment, VBSW, i.e. weighting the data set with w i is compared to baseline where no weights are applied. Figure 2 (b) and (d) displays the decision boundary of best

Figure 3: Samples from Cifar10 and MNIST with high wi. Those pictures are either unusual or difficult to classify, even for a human (especially for MNIST).

Figure 4: Left: (left axis) Runge function w.r.t x and (right axis) x → Df 2 (x). Points sampled using TBS are plotted on the x-axis and projected on f . Right: Same as left, with hyperbolic tangent function.

Figure 1a

Figure5: 1a: t → f θ (u 0 , η 0 , t) for randomly chosen (u 0 , η 0 ), for f θ obtained with the two samplings. 1b: t → f θ (u 0 , η 0 , t) for (u 0 , η 0 ) resulting in the highest point-wise error with the two samplings. 2a: u 0 , η 0 → max

Df 2 (x) as defined in equation 2, on section 3.3 of the main document,Df 2 (x) → k→∞ V ar(f (x + )) . Since V ar(f (x + )) = Df 2 (x) + O( 3 2 ), Df 2 (x) is a biased estimator of Df 2 (x), with bias O( 3 2 ).Hence, when → 0, Df 2 (x) becomes an unbiased estimator of Df 2 (x).

4, 94.44 ± 0.78 99, 92.06 ± 0.66 BH 13.31, 13.38 ± 0.01 14.05, 14.06 ± 0.01 BC 99.12, 97.6 ± 0.34 98.25, 97.5 ± 0.11 best

The networks have been trained on 4 Nvidia K80 GPUs. The values of the hyperparameters used can be found in Appendix C. We compare the test accuracy between LeNet 5 + VBSW, ResNet20 + VBSW and the initial test accuracies of LeNet 5 and ResNet20 (baseline) for each of the initial networks. best

62.31, 62.20  ± 0.01 60.99, 60.88 ± 0.01 61.88, 61.87 ± 0.01 60.98, 60.92 ± 0.01 MRPC 72.30, 71.71 ± 0.03 82.64, 80.72 ± 0.05 71.56, 70.92 ± 0.03 81.41, 80.02 ± 0.07

best, mean + se for each method. For RTE the metric used is accuracy (m1). For MRPC, metric 1

. best, mean + se of the training of a ResNet20 on Cifar10 for different label noise levels. These results illustrate the robustness of VBSW to labels noise.



best, mean + se of the training of 60 ResNet20s on Cifar10 for vanilla, VBSW, AB and AB + VBSW. Gain per model g is defined by g = max

Comparison between BS and TBS SamplingL 2 error (×10 -4 ) L ∞ (×10 -1 ) AEG(×10 -2 ) AEL(×10 -2 )

. For ADAM optimizer hyperparameters, we kept the default values of Keras implementation. We chose these hyperparameters after simple grid searches. Paper experiments hyperparameters values

APPENDIX B: DEMONSTRATIONS

INTUITION BEHIND TAYLOR EXPANSION (SECTION 3.2) We want to approximate f : x → f (x), x ∈ R ni , f (x) ∈ R no with a NN f θ . The goal of the approximation problem can be seen as being able to generalize to points not seen during the training. We thus want the generalization error J X (θ) to be as small as possible. Given an initial data set {X 1 , ..., X N } drawn from X ∼ dP X and {f (X 1 ), ..., f (X N )}, and the loss function L being the squared L 2 error, recall that the integrated error J X (θ), its estimation J X (θ) and the generalization error J X (θ) can be written:where . denotes the squared L 2 norm. In the following, we find an upper bound for J X (θ). We start by finding an upper bound for J X (θ) and then for J X (θ) using equation 7.Let S i , i ∈ {1, ..., N } be some sub-spaces of a bounded space S such that S =Suppose that n i = n o = 1 and f twice differentiable. Let |S| = S dP X . The volume |S| = 1 since dP X is a probability measure, and therefore |S i | < 1 for all i ∈ {1, ..., N } . Using Taylor expansion at order 2, and sinceTo find an upper bound for J(θ), we can first find an upper bound for These experiments, illustrated in Figure 6 shows that the influence of m and k on the performances of the model can be different. For BH data set, low values of k clearly lead to poorer performances. Hyperparameter m seems to have less impact, although it should be chosen not too far from its lowest value, 2. For BC data set, at the contrary, the best performances are obtained for low values of k, while m could be chosen in high values. These experiments highlight that the impact of m and k can be different between classification and regression, but it could also be different depending on the data set. Hence, we recommend considering these hyperparameters like many other involved in DL, and to select their values using classical hyperparameters optimization techniques.This also shows that many different (m, k) pairs lead to error improvement. This suggests that the weights approximation does not have to be exact in order for VBSW to be effective, like stated in Section 5.4.

