N-STUDENT LEARNING: AN APPROACH TO MODEL UNCERTAINTY AND COMBAT OVERFITTING

Abstract

This work presents N-Student Learning, a pseudo-label based multi-network training setup that can be applied to nearly any supervised learning architecture in order to help combat the problem of overfitting and control the way in which a network models uncertainty in the data. The effectiveness of N-Student Learning relies on the idea that a network's predictions on unseen data are largely independent of any instance-dependent noise in the labels. In N-Student Learning, each student network is assigned a subset of the training dataset such that no data point is in every student's training subset. Unbiased pseudo-labels can thus be generated for every data point in the training set by taking the predictions of appropriate student networks. Training on these unbiased pseudo-labels minimizes the extent to which each network overfits to instance-dependent noise in the data. Furthermore, based on prior knowledge of the domain, we can control how the networks learn to model uncertainty that is present in the dataset by adjusting the way that pseudolabels are generated. While this method is largely inspired by the general problem of overfitting, a natural application is found in the problem of classification with noisy labels -a domain where overfitting is a significant concern. After developing intuition through a toy classification task, we proceed to demonstrate that N-Student Learning performs favorably on benchmark datasets when compared to state-of-the-art methods in the problem of classification with noisy labels.

1. INTRODUCTION

Overfitting is a fundamental problem in supervised classification in which a model learns properties of the training data that do not generalize to unseen data. If samples from the input space contain information that is not relevant to the task, the model may overfit by learning to use these irrelevant details for predictive purposes. Overfitting may also occur as a result of noise in the label space, in which the label provided in the dataset does not match the expected output of the model given the corresponding input. Overfitting due to label noise in its various forms will be a primary focus of this paper. In this paper, we introduce N-Student Learning, a pseudo-label based multi-network training setup which mitigates overfitting to noise in the labels. The idea is to relabel the dataset by taking the predictions of networks that have never seen the data. We do this by training multiple networks on different subsets of the data so that the pseudo-labels that they generate on their respective unseen subsets will be clean of any instance-dependent noise. Training on these pseudo-labels results in networks that are less prone to overfitting. After introducing the architecture in section 2, we will discuss a few types of label noise that are commonly present in datasets. Using a toy classification problem in section 3, we show the effect of the N-Student Learning setup and demonstrate that the setup can be adapted to handle different kinds of noise. Following this, in section 4, we show that N-Student Learning performs favorably when compared to state-of-the-art methods on both artificially noisy and naturally noisy benchmark datasets.

2. METHOD 2.1 TRAINING SETUP

Supervised classification is the problem of learning a function that maps an input space X to a label space Y , given a dataset D = {x i , y i } S i=1 , where x i ∈ X and y i ∈ Y , and is the S is the size of the dataset. In N-Student Learning, we train n student networks {N i } n i=1 in parallel, where each network N i trains on a subset D i of the dataset. In our work, it is assumed that these networks all have the same architecture. These subsets are generated with the constraint that for each sample (x i , y i ) in the dataset, there is some network that does not train on that sample. This allows us to generate clean pseudo-labels for each sample. the following must hold: D = i∈{1,2,...,n} D c i , where c denotes a set complement. We begin training with a warmup phase where we train each N i on its respective dataset D i for a small number of epochs without pseudo-labels. After the warmup phase, each student continues to train on its own subset D i , but with a portion of the labels replaced by pseudo-labels generated by a student that has never seen the respective inputs. We define a hyperparameter p, the pseudo-label rate, which controls what proportion of the ground-truth labels will be replaced by pseudo-labels. At the beginning of each epoch, we randomly select a subset D sample of size ⌊pS⌋. For each pair {x i , y i } ∈ D sample , we replace it with {x i , N * j (x i )}, where N j is a network that does not train on x i and N * j (x i ) is a pseudo-label generated from the prediction of N j on x i . After this pseudo-labeled dataset is generated, we train each student on its respective pseudo-labeled subset. We experiment with three methods for generating pseudo-labels: hard, soft, and stochastic pseudolabels. Hard pseudo-labels are generated by taking the argmax of the predicted logits and assigning that class as the correct label in the form of a one-hot vector. Soft pseudo-labels are generated by using the distribution defined by the output of the softmax layer as the target vector. Lastly, we define stochastic pseudo-labels to be pseudo-labels generated by sampling from the distribution output by the network and representing the sampled class as a one-hot vector. We restrict our experiments to cases in which each student is allocated a training subset of equal size. For 2 students, this means that each student trains on half of the data, and therefore generates pseudo-labels for the other half. For 3 students, each student can train on anywhere from one third to two thirds of the data with this equal-allocation restriction. To further constrain the choice of subsets, we only consider splits in which the fraction of the dataset per student is on the boundary of the possible range of values. For n students, this means that each student trains on either 1 n or n-1 n of the data. This is because if each student trains on less than 1 n of the dataset, then part of the dataset is not used for training because | i∈{1,2,...,n} D i | ≤ n|D 0 | < |D|. If each student trains on more than n-1 n , then each student can pseudo-label less than 1 n of the dataset, which is not enough to label the whole dataset. In the case that each student trains on 1 n of the data, the training subsets are disjoint and for any student's subset, each of the n -1 other networks are able to pseudo-label the samples from that subset. Each epoch, we choose a random network to pseudo-label these samples. If each student trains on n-1 n of the data, then there is only one network that can pseudo-label each sample, so there are no choices to be made. There are a variety of ways to perform inference when training multiple networks. One way is to choose a random network after training to use for inference. Alternatively, predictions can be aggregated by averaging or multiplying class-wise before normalizing, which can improve performance.

Important hyperparameters:

n: Number of students p: Pseudo-label rate (proportion of labels that are replaced by pseudo-labels each epoch)

