MULTIPLE OUTPUT SAMPLES FOR EACH INPUT IN A SINGLE-OUTPUT GAUSSIAN PROCESS

Abstract

The standard Gaussian Process (GP) is formulated to only consider a single output sample for each input in the training set. Datasets for subjective tasks, such as spoken language assessment, may be annotated with output labels from multiple human raters for each input. This paper proposes to generalise the GP to allow for multiple output samples per input in the training set. This differs from a multioutput GP, because all output samples are from the same task here. The output density function is formulated to be the joint likelihood of observing all output samples. Through this, the hyper-parameters are optimised using a criterion that is similar to minimising a Kullback-Leibler divergence. This is computationally cheaper than repeating the input for each output sample. The test set predictions are inferred fairly similarly to a standard GP, with a key difference being in the optimised hyper-parameters. This approach is evaluated on spoken language assessment tasks, using the public speechocean762 dataset and an internal Tamil language dataset. The results show that by using the proposed method, the GP is able to compute a test set output distribution that is more similar to the collection of reference outputs annotated by multiple human raters.

1. INTRODUCTION

The Gaussian Process (GP) (Rasmussen & Williams, 2006) expresses a prediction uncertainty that naturally increases for inputs further away from the training data. As opposed to this, Neural Networks (NN) have been observed to yield overly confident predictions, even when the input is from a mismatched domain (Guo et al., 2017) . This behaviour of a GP may allow better explainability of the model's predictions. Having explainable predictions of uncertainty may be especially desirable for tasks that are subjective in nature. In such subjective tasks, multiple human annotators may provide differing output labels for the same input. A collection of human annotations for the same input may therefore be interpreted as a reference of uncertainty that an automatic model should also aim to compute. In such settings, the uncertainties expressed by the model and the human annotators can be explicitly compared. However, the standard GP formulation assumes that each input in the training set is paired with only a single output, which is treated as the ground truth. This paper proposes to extend the GP formulation, to accommodate for situations where multiple samples of output labels for the same task are provided for each input. The hyper-parameters can be optimised and the test set predictions can be inferred, with the consideration of having multiple training set output samples, in a computationally cheaper manner than simply repeating the inputs for each output sample.

2. RELATED WORK

The multi-output GP is formulated in a multi-task framework (Yu et al., 2005; Bonilla et al., 2007) . This treats the multiple outputs for each input as separate tasks. On the other hand, this paper considers a single-output GP with a single output task, where multiple output samples for each input are present for that task. This paper considers optimising the GP hyper-parameters using a criterion that is similar to minimising a distance to a reference output density function. When training a NN, the reference output Under review as a conference paper at ICLR 2023 can be in the form of a distribution, as opposed to a scalar or single class. Full-sum training in speech recognition (Yan et al., 1997) and handwriting recognition (Senior & Robinson, 1995) trains a NN toward the distributional reference formed by the soft forced alignment. In BLIND, NNs for Spoken Language Assessment (SLA) are trained toward the distribution represented by the scores from multiple human raters. The distributional output from one NN can also be used as a reference to train another NN toward (Li et al., 2014; Hinton et al., 2014) .

3. GAUSSIAN PROCESS REGRESSION

When given a collection of N input feature vectors of dimension D, X ∈ R N ×D , a GP places a jointly Gaussian prior over latent variables, f ∈ R N , as p (f |X) = N (f ; 0, K (X, X)) . (1) Here, p(f |X) is an abbreviation of p( f = f | X = X), with the interpretation of being the likelihood of continuous random variables f taking the values f . A multivariate Gaussian density function with mean µ and covariance V is represented as N (f ; µ, V ). In a GP, the covariance of the latent variable is defined as a pair-wise distance between the inputs, with the notion of distance being defined by the kernel, K. In this paper, the squared exponential kernel is used, with kernel matrix elements defined as k ij (X, X) = s 2 exp - (x i -x j ) ⊤ (x i -x j ) 2l 2 , where i and j are the matrix indexes, l is a length hyper-parameter, and s is a scale hyper-parameter. The GP makes the assumption that the outputs, y ∈ R N , are conditionally independent of the inputs when given the latent variables. The outputs are Gaussian distributed about a mean of the latent variable, with a noise hyper-parameter, σ, to accommodate for observational noise, p (y|f ) = N y; f , σ 2 I , ( ) where I is the identity matrix.

3.1. TRAINING

Training a GP involves estimating the hyper-parameters of the kernel, s and l, and of the observation noise σ. One approach is to find the hyper-parameters that maximise the marginal log-likelihood of the training data, F = log p y ref X , where the training data comprises pairs of observed features and reference outputs, y ref . The marginal likelihood can be computed as p (y|X) = p (y|f ) p (f |X) df (5) = N y; 0, K (X, X) + σ 2 I . Optimising the hyper-parameters using gradient-based methods requires the inversion of K(X, X) + σ 2 I (Rasmussen & Williams, 2006), which entails a number of computational operations that scale as O(N 3 ), when using Gaussian elimination or when computing the singular value decomposition in a pseudo-inverse implementation. Algorithms, such as Petković & Stanimirović (2009) , are able to reduce the polynomial power, but still require more than O(N 2 ) operations.

3.2. INFERENCE

When performing evaluation, a test set of input feature vectors, X, is given, and the task is to infer the predicted outputs, y. Inference through a GP can be performed by first computing the density

