BATCH INVERSE-VARIANCE WEIGHTING: DEEP HET-EROSCEDASTIC REGRESSION

Abstract

In model learning, when the training dataset on which the parameters are optimized and the testing dataset on which the model is evaluated are not sampled from identical distributions, we say that the datasets are misaligned. It is wellknown that this misalignment can negatively impact model performance. A common source of misalignment is that the inputs are sampled from different distributions. Another source for this misalignment is that the label generating process used to create the training dataset is imperfect. In this work, we consider this setting and additionally assume that the label generating process is able to provide us with a quantity for the role of each label in the misalignment between the datasets, which we consider to be privileged information. Specifically, we consider the task of regression with labels corrupted by heteroscedastic noise and we assume that we have access to an estimate of the variance over each sample. We propose a general approach to include this privileged information in the loss function together with dataset statistics inferred from the mini-batch to mitigate the impact of the dataset misalignment. Subsequently, we propose a specific algorithm for the heteroscedastic regression case, called Batch Inverse-Variance weighting, which adapts inverse-variance weighting for linear regression to the case of neural network function approximation. We demonstrate that this approach achieves a significant improvement in network training performances compared to baselines when confronted with high, input-independent noise.

1. INTRODUCTION

In supervised learning, a central assumption is that the samples in the training dataset, used to train the model, and the samples in the testing set, used to evaluate the model, are sampled from identical distributions. Formally, for input x and label y, this assumption implies that p train (x, y) = p test (x, y). This assumption can be decomposed as the product p train (x) • p train (y|x) = p test (x) • p test (y|x), which is true if two conditions are respected: 1. The features in both datasets are sampled from the same distribution: p train (x) = p test (x). When this is condition is violated, the training dataset is not representative. 2. The labels in both datasets are sampled from the same conditional distribution: p train (y|x) = p test (y|x). If this condition is violated, the training labels are noisy. In practice, these assumptions are not always respected because gathering representative and precise data (including labels) can be arduous. In this case, the training and testing datasets are misaligned, and the performance of the deployed model may decrease since the training process did not actually optimize the model's parameters based on the correct data (Arpit et al., 2017; Kawaguchi et al., 2020) . One possible reason for misalignment is that there is some uncertainty about the labels in the training set as a result of the labeling process. Since our objective is to optimize the performance of the model compared to ground truth labels, we should consider that the labels in test dataset have no uncertainty, even though it may be impossible to collect such a dataset in practice. As a result, p test (y|x) is sampled from a Dirac delta function, whereas p train (y|x) is not since it encapsulates the uncertainty in the labelling process, which leads to misalignment. In this paper, we propose an algorithm for more efficient model training in the case where we have some information about the sample-wise misalignment. More specifically, we examine the case of

