LEARNING WHAT NOT TO MODEL: GAUSSIAN PRO-CESS REGRESSION WITH NEGATIVE CONSTRAINTS

Abstract

We empirically demonstrate that our GP-NC framework performs better than the traditional GP learning and that our framework does not affect the scalability of Gaussian Process regression and helps the model converge faster as the size of the data increases. Gaussian Process (GP) regression fits a curve on a set of datapairs, with each pair consisting of an input point 'x' and its corresponding target regression value 'y(x)' (a positive datapair). But, what if for an input point 'x', we want to constrain the GP to avoid a target regression value 'ȳ(x)' (a negative datapair)? This requirement can often appear in real-world navigation tasks, where an agent would want to avoid obstacles, like furniture items in a room when planning a trajectory to navigate. In this work, we propose to incorporate such negative constraints in a GP regression framework. Our approach, 'GP-NC' or Gaussian Process with Negative Constraints, fits over the positive datapairs while avoiding the negative datapairs. Specifically, our key idea is to model the negative datapairs using small blobs of Gaussian distribution and maximize its KL divergence from the GP. We jointly optimize the GP-NC for both the positive and negative datapairs. We empirically demonstrate that our GP-NC framework performs better than the traditional GP learning and that our framework does not affect the scalability of Gaussian Process regression and helps the model converge faster as the size of the data increases.

1. INTRODUCTION

Gaussian process are one of the most studied model class for data-driven learning as these are nonparametric, flexible function class that requires little prior knowledge of the process. Traditionally, GPs have found their applications in various fields of research, including Navigation systems (e.g., in Wiener and Kalman filters) (Jazwinski, 2007) , Geostatistics, Meteorology (Kriging (Handcock & Stein, 1993) ) and Machine learning (Rasmussen, 2006) . The wide range of applications can be attributed to the property of GPs to model the target uncertainty by providing the predictive variance over the target variable. Gaussian process regression in its current construct fits only on a set of positive datapairs, with each pair consisting of an input point and its desired target regression value, to learn the distribution on a functional space. However, in some cases, more information is available in the form of datapairs, where at a particular input point, we want to avoid a range of regression values during the curve fitting of GP. We designate such data as negative datapairs. An illustration where modeling such negative datapairs would be extremely beneficial is given in Fig 1. In Fig 1(b) , an agent wants to model a trajectory such that it covers all the positive datapairs marked by 'x'. However, it is essential to note that the agent would run into an obstacle if it models its trajectory based only on the positive datapairs. We can handle this problem of navigating in the presence of obstacles in two ways, one way is to get a high density of positive datapairs near the obstacle, and the other more straightforward approach is to just mark the obstacle as a negative datapair. The former approach would unnecessarily increase the number of positive datapairs for GP to regress. Hence, it may run into scalability issues. However, in the latter approach, if the point is denoted as a negative datapair with a sphere of negative influence around it as illustrated by Fig 1 .c , the new trajectory can be modeled with less number of datapairs that accounts for all obstacles on the that are needed to be covered in its trajectory. Since the number of these observed points is low, the agent is not able to avoid the obstruction (coffee table) while forecasting its course; (c) the agent is given both the positive datapairs which it needs to reach along with negative datapairs (area of influence is given by shaded red region) that should be avoided during the modeling of future trajectory. way. Various GP methods in their current framework lack the ability to incorporate these negative datapairs for the regression paradigm. Contributions:In this paper, we explore the concept of negative datapairs. We provide a simple yet effective GP regression framework, called GP-NC which can fit on the positive datapairs while avoiding the negative datapairs. Specifically, our key idea is to model the negative datapairs using a small Gaussian blob and maximize its KL divergence from the GP . Our framework can be easily incorporated for various types of GP models (e.g., exact, SVGP (Hensman et al., 2013) , PPGPR (Jankowiak et al., 2019) ) and works well in the scalable settings too. We empirically show in §5 that the inclusion of negative datapairs in training helps with both the increase in accuracy and the convergence rate of the algorithm.

2. REVIEW OF GAUSSIAN PROCESS REGRESSION

We briefly review the basics of Gaussian Process regression, following the notations in (Wilson et al., 2015) . For more comprehensive discussion of GPs, refer to (Rasmussen, 2006) . A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution (Rasmussen, 2006) . We consider a dataset D with n D-dimensional input vectors, X = {x 1 , • • • , x n } and corresponding n × 1 vector of targets y = (y(x 1 ), • • • , y(x n )) T . The goal of GP regression is to learn a function f that maps elements from input space to a target space, i.e., y(x) = f (x) + where is i.i.d. noise. If f (x) ∼ GP(µ, k θ ), then any collection of function values f has a joint multivariate normal distribution given by, f = f (X) = [f (x 1 ), • • • , f (x n )] T ∼ N (µ X , K X,X ) with the mean vector and covariance matrix defined by the functions of the Gaussian Process, as (µ X ) i = µ(x i ) and (K X,X ) ij = k θ (x i , x j ). The kernel function k θ of the GP is parameterized by θ. Assuming additive Gaussian noise, y(x)|f (x) ∼ N (y(x); f (x), σ 2 ), then the predictive distribution of the GP evaluated at the n * test points indexed by X * , is given by f * |X * , X, y, θ, σ 2 ∼ N (E[f * ], cov(f * )) , E[f * ] = µ X * + K X * ,X K X,X + σ 2 I -1 y, cov(f * ) = K X * ,X * -K X * ,X K X,X + σ 2 I -1 K X,X * K X * ,X represents the n * × n covariance matrix between the GP evaluated at X * and X. Other covariance matrices follow similar conventions. µ X * is the mean vector of size n * × 1 for the test points and K X,X is the n × n covariance matrix calculated using the training inputs X. The underlying hyperparameter θ implicitly affects all the covariance matrices under consideration.

2.1. GP S: LEARNING AND MODEL SELECTION

We can view the GP in terms of fitting a joint probability distribution as, p (y, f |X) = p y|f , σ 2 p (f |X) (3) and we can derive the marginal likelihood of the targets y as a function of kernel parameters alone for the GP by integrating out the functions f in the joint distribution of Eq. (3). A nice property of the GP is that this marginal likelihood has an analytical form given by, L(θ) = log p(y|θ, X) = - 1 2 y T K θ + σ 2 I -1 y + log K θ + σ 2 I + N log (2π) where we have used K θ as a shorthand for K X,X given θ. The process of kernel learning is that of optimizing Eq. ( 4) w.r.t. θ. The first term on the right hand side in Eq. ( 4) is used for model fitting, while the second term is a complexity penalty term that maintains the Occam's razor for realizable functions as shown by (Rasmussen & Ghahramani, 2001) . The marginal likelihood involves matrix inversion and evaluating a determinant for n × n matrix, which the naive implementation would require a cubic order of computations O(n 3 ) and O(n 2 ) of storage. Approaches like Scalable Variational GP (SVGP) (Hensman et al., 2013) and parametric GPR (PPGPR) (Jankowiak et al., 2019) have proposed approximations that lead to much better scalability. Please refer to Appexdix A for details.

3. GP REGRESSION WITH NEGATIVE DATAPAIRS

As shown in Fig. 1 , we want the model to avoid certain negative datapairs in its trajectory. In other words, we want the trajectory of the Gaussian Process to have a very low probability of passing through these negative datapairs. In this section, we will first formalize the functional form of the negative datapairs and then subsequently describe our framework called GP-NC regression.

3.1. DEFINITION OF POSITIVE & NEGATIVE DATAPAIRS

Positive datapairs: The set of datapairs through which the GP should pass are defined as positive datapairs. We assume a set of n datapairs (input, positive target) with D-dimensional input vectors, X = {x 1 , • • • , x n } and corresponding n × 1 vector of target regression values y = {y(x 1 ), • • • , y(x n )}. Negative datapairs: The set of datapairs which the GP should avoid (obstacles) are defined as negative datapairs. We assume a set of m datapairs (input, negative target) with D-dimensional input vectors X = {x 1 , • • • , xm } and corresponding set of negative targets ȳ = {ȳ(x 1 ), • • • , ȳ(x m )}. The sample value of GP at input xi , given by f (x i ), should be far from the negative target regression value ȳ(x i ). Note that it is possible that a particular input x can be in both the positive and negative data pair set. This will happen, when at a particular input we want the GP regression value to be close to its positive target regression value y(x) and far from its negative target regression value ȳ(x).

3.2. FUNCTIONAL REPRESENTATION OF NEGATIVE DATAPAIRS

For our framework, we first get a functional representation of the negative datapairs. We define a Gaussian distribution around each of the negative datapair, q(ȳ|x) ∼ N (ȳ(x), σ 2 neg ), with mean equal to the negative target value ȳ(x) and σ 2 neg is the variance which is a hyperparameter. The Gaussian blob can also be thought of as the area of influence for the negative datapair with the variance σ neg indicating the spread of its influence.

3.3. GP -NC REGRESSION FRAMEWORK

The aim of our GP-NC regression framework is to simultaneously fit the GP regression on the positive datapairs (X) and avoid the negative datapairs ( X) (i.e., using them as negative constraints (NC)). The former is achieved by maximizing the marginal likelihood given in the Eq. (4). To avoid Algorithm describing the GP-NC regression. We alternatively update between the negative loglikelihood and KL divergence term with respect to the kernel parameters θ. For different GP methods we can appropriately plug-in the log-likelihood term (NLL).

Input

: Datapairs {X, y} + , { X, ȳ} - Parameters : GP Kernel Parameters 'θ' Hyperparameters : σ neg , λ while until convergence do NLL = -p(y|θ, X); θ ←minimize (NLL); KL div = λ • log D KL p ŷ|θ, X ||N ȳ, σ 2 neg ; θ ← maximize (KL div ) end the negative datapairs, we want our GP model to adjust its distribution curve so that while drawing samples from the predictive GP distribution, its values do not lie in the influence region of the negative datapairs. To this end, we propose to fit the GP regression model on the positive datapairs along with maximizing the Kullback-Leibler (KL) divergence between the distributions of the GP regression model and the Gaussian distributions defined over the negative datapairs. Thus, mathematically, we want to maximize the following KL divergence given by ∆ = arg max θ D KL (p(y|θ, X) q(ȳ| X)) We chose to maximize the D KL term in the p → q direction, as this fixes the negative datapairs distribution q(ȳ|X) as the reference probability distribution. Now, since the KL divergence is an unbounded distance metric, the following section describes a practical workaround to maximize it.

3.3.1. MAXIMIZING KL DIVERGENCE USING THE LOGARITHM TRICK

Eq. ( 5) is increasing the distance between the GP distribution and the negative datapairs distribution by maximizing the KL divergence. However, KL divergence is an unbounded function, i.e., D KL ∈ [0, ∞). Implementing the D KL divergence directly in the form of Eq. ( 5) can create problems for the gradient updates and convergence. We also want to maximize the marginal log-likelihood Eq. ( 4) and the ∆ terms simultaneously. This raises a problem of mismatch in the magnitude of scale for the marginal log-likelihood term and D KL divergence term as the values of D KL divergence will be significantly higher. Thus, the gradient update would be dominated by ∆ term. In essence, the model would fixate more on avoiding the negative datapairs than fitting the curve on the positive datapairs. We also observed this empirically. Hence, to suppress the gradient update from the ∆ term, we encapsulate Eq. ( 5) in a logarithmic function. This turns out to be beneficial in multiple ways. Firstly, maximizing the D KL term is equivalent to maximizing the log(D KL ), as log is an monotonically increasing function. Secondly, and more importantly, the scale of magnitude of ∆ term becomes equivalent to the scale of magnitude for the marginal log-likelihood term which makes the convergence stable.

3.3.2. GP -NC: LEARNING AND MODEL SELECTION

We apply a log function to the D KL given in Eq. ( 5) and write the combined objective function for our GP-NC regression, L(θ) = arg min θ -log p(y|θ, X) -λ log D KL (p(y|θ, X) q(ȳ| X)) where p(y|θ, X) is the marginal log-likelihood term that represents the model to be fitted on the observed datapoints. The parameter λ is the tradeoff hyperparameter between curve fitting and avoidance of the negative datapoints, or how relaxed can the negative constraints be. We already know the analytical form of the log-likelihood term from Eq. ( 4). We now focus on the log(D KL ) term. Since, both the likelihood p(y|θ, X) and negative datapair distributions q(ȳ| X) are modeled using Gaussians, we can simply use the analytical form of KL divergence between any two Gaussian distributions given by, D KL (p, q) = log σ 2 σ 1 + σ 2 1 + (µ 1 -µ 2 ) 2 2σ 2 2 - 1 2 (7) here p, q are Gaussian distributions defined as N (µ 1 , σ 1 ) and N (µ 2 , σ 2 ) respectively. The D KL term is adjusting the mean and variance of the likelihood p Y |θ, X with respect to the fixed blobs of Gaussian distributions around the negative datapairs. Specifically, we can consider the distribution 'p ≡ N (µ 1 , σ 1 ) ≡ p(y|θ, X)' and 'q ≡ N (µ 2 , σ 2 ) ≡ q(ȳ| X)' in Eq. 7. Now, if we refer Eq. 2, µ 1 = E[f * ] and σ 1 = cov(f * ) which contain the parameters θ of the GP that are optimized. µ 2 , σ 2 correspond to the hyperparameters of the Gaussian distribution representing the negative datapairs and are constant. Algorithm (1) gives an overview of the training of GP-NC regression. A note on the difference between the GP-NC regression and general classification settings: In the case of GP-NC regression, the boundary for every negative datapair is optimized independent of each other. In classification settings, all the negative points belong to a class and they jointly affect the decision boundary of the GP for their class.

3.4. SPARSE GAUSSIAN PROCESSES WITH NEGATIVE DATAPOINTS

In Appendix A, we show that it is straightforward to modify the class of scalable and Sparse GP regression models to account for the negative datapairs in their formulation. In particular we review the SVGP model by (Hensman et al., 2013) , which is a popular scalable implementation of GPs. We also investigate a recent parametric Gaussian Process regressors (PPGPR) method by (Jankowiak et al., 2019) . We evaluate the performance of these methods with our GP-NC framework in the experiments section.

4. RELATED WORKS

Classical GP: To the best of our knowledge, the classical GP regression introduced in (Rasmussen, 2006) and many subsequent works primarily focus on positive datapairs for curve fitting. Even with the absence of the concept of negative datapairs, GP regression methods have been widely used for obstacle-aware navigation task which is one of the relevant applications to evaluate our GP-NC framework. GPs for navigation: GPs are extensively used in the field of navigation and often are a component of path planning algorithms. (Ellis et al., 2009) used GP regression in modeling the pedestrian trajectories by using positive datapairs. (Aoude et al., 2013 ) used heuristic based approach over GP regression to incorporate dynamic changes and environmental constraint in the surroundings. Their solution, named RR-GP, builds a learned motion pattern model by combining the flexibility of GP with the efficiency of RRTReach, a sampling-based reachability computation. Obstacle trajectory GP predictions are conditioned on dynamically feasible paths identified from the reachability analysis, yielding more accurate predictions of future behavior. (Goli et al., 2018) introduced the use of GP regression for long-term location prediction for collision avoidance in Connected Vehicle (CV) environment. The GPs are used to model the trajectory of the vehicles using the historical data. The collected data from vehicles together with GPR models received from infrastructure are then used to predict the future trajectories of vehicles in the scene. (Meera et al., 2019) designed an Obstacle-aware Adaptive Informative Path Planning (OA-IPP) algorithm for target search in cluttered environments using UAVs. This method uses GP to detect the obstacles/target, which the UAV gets by marking dense number of points (positive datapairs) around the obstacles. (Hewing et al., 2020; Yuan & Kitani, 2019) are some of the works using sampling based techniques for trajectory prediction. (Choi et al., 2015) is one of the work which tries to incorporate the concept of negative datapairs in classical GP construct by introducing a leveraged parameter in kernel function. The authors demonstrate that having the ability to incorporate negative targets increases the efficiency of trajectory predictions. However, this approach fundamentally differs from ours in terms of incorporation of negative datapairs ours try to maximize the KL-Divergence between ŷ(x) and ȳ(x) while theirs utilizes additional leveraged parameter in the kernel function. Besides, our approach is more scalable as the size of the covariance matrix doesn't increase to incorporate the negative datapairs. Scalable GP: Naïve implementation of GP regression is not scalable for large datasets as the model selection and inference of GP requires a cubic order of computations O(n 3 ) and O(n 2 ) of storage. Since the GP-NC framework is quite generic and can work for various scale GP methods, we want to highlight few of these methods. (Hensman et al., 2013; Dai et al., 2014; Gal et al., 2014) are some of the well known scalable methods suitable for our framework as they use stochastic gradient descent Negative datapairs in other domains: The concept of negative datapairs has been extensively utilized in the self-supervised learning. Applications include learning word embeddings (Mikolov et al., 2013; Mnih & Kavukcuoglu, 2013) , image representations (He et al., 2020; Misra & Maaten, 2020; Feng et al., 2019) , video representations (Sermanet et al., 2018; Fernando et al., 2017; Misra et al., 2016; Harley et al., 2020) , etc. In these works negative and positive samples are created as pseudo labels to train a neural network to learn the deep representations of the inputs.

5. EXPERIMENTS

We compared various GP regression models in their classical form (using only positive datapairs) with their corresponding GP-NC regression models where we used our negative constraints framework. We used Negative Log-likelihood (NLL) and Root Mean Squared Error (RMSE) as our evaluation metrics. We compared our framework on a synthetic dataset and six real world datasets. Throughout our experiments, we found that for every GP model, the GP-NC regression framework outperforms its corresponding classical GP regression setting. We used GPytorch (Gardner et al., 2018) to implement all the GP (ours + baselines) models. We use zero mean value and the RBF kernel for GP prior for all of the models unless mentioned otherwise.

5.1. SYNTHETIC DATASET: VISUALIZING THE GP -NC REGRESSION FRAMEWORK

We aim to visualize the GP-NC regression framework using a toy dataset. We sampled 400 positve datapairs from a sinusoidal function and randomly sampled 15 negative datapairs as represented by Fig. 2 . We trained a sparse SVGP model to regress a curve on the positive datapair using the classical GP framework and the one with negative constraints GP-NC . For the top figures of Fig. 2 , we trained SVGP with 80 inducing points, all at the starting location of training input range. For the bottom figures of Fig. 2 , we randomly sampled 10 inducing points from the range of training inputs. SVGP with a constant mean and a RBF kernel for the GP prior was used. After training the SVGP model in both settings for 100 epochs we obtain the curves as depicted by the figures on the left-side in Fig. 2 . The inability to incorporate the information provided by the negative datapairs in the classical GP construct hinders its ability to fit the data well as patently visible in the left figure. Mean and predictive variance are not only losing out on some positive datapairs but are also engulfing the negative datapairs in the confidence region which is undesirable. Our GP-NC framework re-calibrates the curve by integrating the information provided by the negative datapairs as seen in the right-hand side figures of Fig 2 . As evident from the figure, the additional information from a few negative datapairs helps the model to fit better to the positive datapairs in addition to avoiding most of the negative datapairs. We can tune the values of λ given in Eq. ( 6) to balance between the weightage given by the GP to positive and negative datapairs. Decreasing the value of λ results in reduction of influence of the negative datapairs. Notice that curved learned by our approach even with sub-optimal inducing points.

5.2. TRAJECTORY PREDICTION USING GP -NC REGRESSION FRAMEWORK

We want to model an agent's trajectory using a GP regression model such that it takes the agent's present location (x, y) as input and predicts agents next location (x, ŷ). For this set of experiments, we synthesize a 2d-virtual traffic scene given by Fig. 4 . Furthermore, the road contain pitfalls, roadblocks, accidents, etc. that need to be avoided are represented as red diamonds in the figure. We designate these targets as negative targets that are to be avoided for ensuring the safety of traffic. There are a total number of 10 negative datapairs present in the scene. Next, We sample 250 observed co-ordinates on the 2d virtual path for modeling the future trajectory of the agent. We trained a classical SVGP model and ours SVGP-NC model to predict the trajectory of agent. For both the models we utilize a constant mean and an RBF kernel for the GP prior. Both the models were trained for 100 epochs. It can be observed from the Fig. 4 .a that the classical GP framework lacks the ability to incorporate negative datapairs, which results in a loosely fitted GP model. On the other hand, when trained with an additional constraints given by negative datapairs GP-NC fits a tighter curve on the observed datapairs and avoiding all the negative datapairs as shown in Moreover, in these set of experiments it is easy to demonstrate the impact of the λ values from Eq. ( 6) on the GP regression. Decreasing the value of λ results in reduction of influence of the negative 

5.3. REALWORLD DATASETS

We evaluated our GP-NC framework on six real world datasets with the number of datapoints ranging from N ∼ 1500, 5000, 15000, 50000, 450000 and the number of input dimension d ∈ [3, 127]. Among the six datasets five of them are from UCI repository (Dua & Graff, 2017 ) (Wine quality -red, white, Elevators, Protein, and 3DRoad), while the sixth one is from Kaggle Competition (Prudential life insurance risk assessment). These datasets consists of two different kind of prediction/regression variables namely discrete variable and continuous variable. For discreet variable the value of elements lies between certain range, i.e., integer values lie between [0, 10]. For continuous variable, the value of target regression can be any real number. Datasets (Prudential risk assessment, Wine quality -red, white) are all discrete target variable datasets while datasets (Elevators, Protein and 3DRoad) have continuous target variables. Random shuffling technique for creating negative datapairs for GP-NC: As we are only given positive regression target values in these datasets, we create pseudo-negative regression targets by randomly shuffling the labels and pairing them with the inputs to create negative datapairs. This generates a set of valid datapairs as given the input x only y(x) can be associated as true regression value/label, we can assume whatever label we get by random shuffling as a negative target. We compared our model against the standard baselines of Exact GP (Gardner et al., 2019; Wang et al., 2019) , and Sparse GP methods like SVGP (Hensman et al., 2013) and PPGPR (Jankowiak et al., 2019) . For training the sparse methods, we used 1000 inducing points. We trained the models using Adam optimizer with a learning rate of 0.1 for 400 epochs on each dataset. For GP-NC framework, we used 200 negative datapairs. We maintained consistency in all the models in all terms of maintaining a constant mean and RBF kernel for a GP prior. Fig. 3 compares negative log likelihood values of various GP regression methods, in both classical and GP-NC frameworks, on all six real datasets. The orange dots represents our methods while the blue dots in the plots depict the baselines. It can be observed that GP-NC framework outperforms the classical GP framework. Methods like SVGP -NC and Exact GP -NC performs on an average 0.2 nats better than baseline SVGP and Exact GP respectively. It is interesting to note that including negative datapairs by the 'random shuffling technique' is quite effective and we can observe gains in terms of model performance. value over 10 runs. It can be observed from Table 1 that the ∆ t depends on the size of the dataset and also on the type of target regression variable. For the Exact GP and discrete target variables setting the ∆ t does not increase much with size of the dataset, however for continuous target variable there is a considerable amount of increase. For SVGP model the increase in ∆ t can be attributed to the size of the dataset. Overall, the average increase in training is not very significant. Thus, our experiments indicate that the added penalty term to likelihood term in Eq. ( 6) does not significantly affect the scalability of current scalable GP architectures. Figure 5 shows the RMSE plots for Exact GP, SVGP, and PPGPR models with the classical GP juxtaposed on to the models with the negative constraint GP-NC framework. It can be observed from the plots that our GP-NC models converges faster, and better, than the classical GP models for the six univariate real world datasets. In addition, it is interesting to note that as the size of the data increases, the convergence curve of the GP-NC model becomes steeper.

6. CONCLUSION

We presented a novel and generic Gaussian Process regression framework GP-NC that incorporates negative constraints along with fitting the curve for the positive datapairs. Our key idea was to assume small blobs of Gaussian distribution on the negative datapairs. Then, while fitting the GP regression on the positive datapairs, our GP-NC framework simultaneously maximizes the KL divergence from the negative datapairs. Our work highlights the benefits of modeling the negative datapairs for GPs and our experiments support the effectiveness of our approach. We hope that this successful realization of the concept of negative datapairs for GP regression will be useful in variety of applications.



Figure1: An illustration of our problem setup. (a) top view of the room where the agent wants to travel to a particular location while avoiding obstacles; (b) the agent has been given the location of the positive datapairs that are needed to be covered in its trajectory. Since the number of these observed points is low, the agent is not able to avoid the obstruction (coffee table) while forecasting its course; (c) the agent is given both the positive datapairs which it needs to reach along with negative datapairs (area of influence is given by shaded red region) that should be avoided during the modeling of future trajectory.

Figure2: Visualizing GP-NC regression framework: The figures compare how the SVGP regression fits using the classical GP framework (left) vs the GP-NC framework (right). The aim is to fit the regression targets marked in 'black' (positive datapairs) and avoid the targets marked in 'red' (negative datapairs). The classical GP framework only uses the positive datapairs whereas our proposed GP-NC framework uses both the positive and negative datapairs for fitting the regression curve. The points in 'yellow' are the inducing points used to fit the GP. We used two inducing points setting. Top figures: locations of inducing points were taken at the start of curve. Bottom figures: we randomly sampled the inducing points from the whole range of training inputs. For GP-NC framework (right), hyper-parameters were selected as λ = 0.1 and σneg = 1.2

Fig 4.b.

Figure 4: Trajectory prediction with GP-NC regression framework: The figures compare trajectory prediction in a 2D-virtual environment using the classical GP framework (a) vs. the GP-NC framework (b,c). The car is navigating through the forest and our aim is to avoid the roadblocks marked in 'red while maintaining the car's proximity to the 'black' trajectory markers. The classical GP framework only uses the positive datapairs whereas our proposed GP-NC framework uses both the positive & negative datapairs for prediction of agent's trajectory. (b) depicts the GP-NC framework with the hyper parameter λ = 1 (c) depicts the GP-NC framework with the hyper parameter λ = 0.1

Figure 5: RMSE plots on real world data (top -Exact GP; middle -SVGP; bottom -PPGPR): Plots show the test RMSE for six univariate regression datasets (lower is better). Models are fitted by using cross validation on training data. Convergence of GP-NC framework is consistently faster than its classical GP counterpart for all the models.

Runtime comparison of the classical GP and GP-NC frameworks which includes negative datapairs on different datasets. ∆t is the runtime difference of the GP model in GP-NC framework vs the classical GP framework. We used GPU accelerated GP implementation of GPyTorch library.DatasetsSize of data Type of Target Variable ∆t Exact GP ∆t sparse SVGP



annex

A GP -NC FOR SCALABLE GP METHODS We can replace the NLL term in Algorithm (1) by the log likelihood of the different scalable GP methods. We have a scalable implementation of the D KL update, so the entire Algorithm scales well with the input data size. It is straightforward to plug-in the class of scalable and Sparse GP regression models in the likelihood term of Algorithm (1) to account for the negative datapairs in their formulation. In particular we review the SVGP model by (Hensman et al., 2013) , which is a popular scalable implementation of GPs. We also investigate a recent parametric Gaussian Process regressors (PPGPR) method by (Jankowiak et al., 2019) . In this section, we follow the notations given in their respective research works and give their derivations of the log likelihood function here for the sake of completeness.A.1 SVGP REGRESSION MODEL (Hensman et al., 2013) proposed the Scalable Variational GP (SVGP) method. The key technical innovation was the development of inducing point methods which we now review. By introducing inducing variables u that depend on variational parameters {z m } M m=1 , where M = dim(u) N and with each z m ∈ R d , we augment the GP prior as follows:We then appeal to Jensen's inequality and lower bound the log joint density over the targets and inducing variables:whereand Kt N N is given byThe essential characteristics of Eqn. 8 are that: i) it replaces expensive computations involving K N N with cheaper computations like K -1 M M that scale as O(M 3 ); and ii) it is amenable to data subsampling, since the log likelihood and trace terms factorize as sums over datapoints (y i , x i ).

A.1.1 SVGP LIKELIHOOD FUNCTION

SVGP proceeds by introducing a multivariate Normal variational distribution q(u) = N (m, S). The parameters m and S are optimized using the ELBO (evidence lower bound), which is the expectation of Eqn. 8 w.r.t. q(u) plus an entropy term term H[q(u)]:where KL denotes the Kullback-Leibler divergence, µ f (x i ) is the predictive mean function given by µL svgp , which depends on m, S, Z, σ obs and the various kernel hyperparameters θ, can then be maximized with gradient methods. We refer to the resulting GP regression method as SVGP.A.2 PPGPR-NC REGRESSION MODEL: LIKELIHOOD FUNCTION Jankowiak et al. (2019) recently proposed a parametric Gaussian Process regressors (PPGPR) method. We defer the reader to Section (3.2) of their paper for details about their likelihood function.

