BATCH INVERSE-VARIANCE WEIGHTING: DEEP HET-EROSCEDASTIC REGRESSION

Abstract

In model learning, when the training dataset on which the parameters are optimized and the testing dataset on which the model is evaluated are not sampled from identical distributions, we say that the datasets are misaligned. It is wellknown that this misalignment can negatively impact model performance. A common source of misalignment is that the inputs are sampled from different distributions. Another source for this misalignment is that the label generating process used to create the training dataset is imperfect. In this work, we consider this setting and additionally assume that the label generating process is able to provide us with a quantity for the role of each label in the misalignment between the datasets, which we consider to be privileged information. Specifically, we consider the task of regression with labels corrupted by heteroscedastic noise and we assume that we have access to an estimate of the variance over each sample. We propose a general approach to include this privileged information in the loss function together with dataset statistics inferred from the mini-batch to mitigate the impact of the dataset misalignment. Subsequently, we propose a specific algorithm for the heteroscedastic regression case, called Batch Inverse-Variance weighting, which adapts inverse-variance weighting for linear regression to the case of neural network function approximation. We demonstrate that this approach achieves a significant improvement in network training performances compared to baselines when confronted with high, input-independent noise.

1. INTRODUCTION

In supervised learning, a central assumption is that the samples in the training dataset, used to train the model, and the samples in the testing set, used to evaluate the model, are sampled from identical distributions. Formally, for input x and label y, this assumption implies that p train (x, y) = p test (x, y). This assumption can be decomposed as the product p train (x) • p train (y|x) = p test (x) • p test (y|x), which is true if two conditions are respected: 1. The features in both datasets are sampled from the same distribution: p train (x) = p test (x). When this is condition is violated, the training dataset is not representative. 2. The labels in both datasets are sampled from the same conditional distribution: p train (y|x) = p test (y|x). If this condition is violated, the training labels are noisy. In practice, these assumptions are not always respected because gathering representative and precise data (including labels) can be arduous. In this case, the training and testing datasets are misaligned, and the performance of the deployed model may decrease since the training process did not actually optimize the model's parameters based on the correct data (Arpit et al., 2017; Kawaguchi et al., 2020) . One possible reason for misalignment is that there is some uncertainty about the labels in the training set as a result of the labeling process. Since our objective is to optimize the performance of the model compared to ground truth labels, we should consider that the labels in test dataset have no uncertainty, even though it may be impossible to collect such a dataset in practice. As a result, p test (y|x) is sampled from a Dirac delta function, whereas p train (y|x) is not since it encapsulates the uncertainty in the labelling process, which leads to misalignment. In this paper, we propose an algorithm for more efficient model training in the case where we have some information about the sample-wise misalignment. More specifically, we examine the case of regression with a deep network where labels are corrupted by heteroscedastic noise. We assume that we have access at least an estimate of the variance of the distribution of the noise that corrupted each label, information that is available if the labels are being generated by some stochastic process that is capable of also jointly reporting uncertainty. We examine how the knowledge of the estimate of the label noise variance can be used to mitigate the effect of the noise on the learning process of a deep neural network. We refer to our method as Batch Inverse-Variance (BIV), which, inspired by information theory, performs a re-weighting using both the the sample-wise variance but also statistics over the entire mini-batch. BIV shows a strong empirical advantage over L2 loss as well as over a simple filtering of the samples based on a threshold over the variance.foot_0  Our claimed contributions are threefold: 1. A definition of the problem of learning with information quantifying the misalignment between datasets for the case of heteroscedastic noisy labels in regression. 2. A general formulation of how to use the mini-batch to infer statistics of the dataset and incorporate this information in the loss function when training on neural networks. 3. We present Batch Inverse-Variance as an instantiation of this framework and show its usefulness when applied to regression tasks with labels corrupted by heteroscedastic noise. The outline of the paper is as follows: In section 2, we describe the task of regression with heteroscedastic noisy labels and its parallels with learning with privileged information, and we explain the challenges of applying classical heteroscedastic regression methods to stochastic gradient descent. In section 3, we position our work among the existing literature on learning with noisy labels. In section 4, we present a general framework to incorporate information regarding dataset misalignment in the mini-batch loss. We introduce BIV within this framework to tackle heteroscedastic regression. In section 5, we describe the setup for the experiments we made to validate the benefits of using BIV, and we present and analyze the results in section 6.

2.1. HETEROSCEDASTIC NOISY LABELS IN REGRESSION

Here, we introduce how heteroscedastic noisy labels can be generated in regression and how the variance can be known. Consider an unlabelled dataset of inputs {x i }. To label it, one must apply to each input x i an instance of a label generator which should provide its true label y i . This label generator has access to some features z i correlated to x i . We define LG j : Z -→ R . When the labelling process is not exact and causes some noise on the label, the noisy label of x i provided by LG j is defined as ỹi,j . Noise on a measured or estimated value is often represented by a Gaussian distribution, based on the central limit theorem, as most noisy processes are the sum of several independent variables. Gaussian distributions are also mathematically practical, although they present some drawbacks as they can only represent unimodal and symmetric noise (Thrun et al., 2006) . We model: ỹi,j = y i + δ yi,j with δ yi,j ∼ N (0, σ 2 i,j ) σ 2 i,j can be a function of z i and LG j , without any assumption on its dependence on one or the other. We finally assume that the label generator is able to provide an estimate of σ 2 i,j , therefore being redefined as LG j : Z -→ R × R ≥0 . The training dataset is formed of triplets (x i , σ 2 i,j , ỹi,j ), renamed (x k , σ 2 k , ỹk ) for triplet k for simplicity. This setup describes many labelling processes, such as: Crowd-sourced labelling: In the example case of age estimation from facial pictures, labellers Alice and Bob are given z i = x i the picture of someone's face and are asked to estimate the age of that person. Age is harder to estimate for older people come (5 and 15 years of age are harder to confuse than 75 and 85) suggesting a correlation between σ 2 i,j and z i . But Alice and Bob may also have been given different instructions regarding the precision needed, inducing a correlation between σ 2 i,j and LG j . Finally, there may be some additional interactions between z i and LG j , as for example Alice may know Charlie, recognize him on the picture and label his age with lower



Our code is available in supplemental material and will be publicly released after the reviewing process.

