Exercise: Gaussian processes

Original by Carl Rasmussen and Manon Kok for CUED course 4f13. This version adapted by Damon Wischik.

In this assignment you will use Gaussian process modelling. This coursework uses the Python package sklearn.gaussian_process which is the rough equivalent of the GPML package for MATLAB. See the appendix at the end of this document for a walkthrough.

What to submit. Your answers should contain an explanation of what you do, and 2–4 central commands to achieve it. The focus of your answer should be interpretation: explain what the numerical values and graphs you produce mean, and why they are as they are. The text of your answer to each question should be no more than a paragraph or two. Marks will be awarded based on the clarity and insight in your explanations.

DO NOT SUBMIT FULL SOURCE CODE, unless it is as an appendix. Do not repeat the question text in your answers. If you submit your answers as a Jupyter notebook, structure the notebook in two sections: a section at the top for the examiner to read with just your answers and trimmed code snippets, and a section at the bottom with all your working code.

Question (a)

Load data from https://www.cl.cam.ac.uk/teaching/2122/DataSci/data/cw1a.mat. Consider a Gaussian process with a squared exponential covariance function,

ν**2 * gp.kernels.RBF(length_scale=λ) + gp.kernels.WhiteKernel(noise_level=σ),

and maximize the log marginal likelihood starting with hyperparameters λ=np.exp(-1), ν=1, σ=1. Show the 95% predictive error bars. Comment on the predictive error bars and the optimized hyperparameters.

Question (b)

The 95% predictive error bars depict the total uncertainty about a new readout $$ Y_\text{new} = F(x_\text{new}) + N(0,\sigma^2) $$ conditional on the dataset. Here $F$ is the underlying Gaussian process, and $\sigma$ is the white noise standard deviation. Update your plot from part (a) to show how much of the total uncertainty is because we're uncertain about $F$, and how much is uncertainty due to white noise.

Question (c)

Show that by initializing the hyperparameters differently, you can find a different local optimum for the hyperparameters. Try a range of values and show the fit. In your fit, show both the 95% predictive error bars, and also the 95% error bars showing just the uncertainty about the posterior distribution of the underlying random function. Explain what is going on. Which fit is best, and why?

Fit a RBF kernel but without white noise, and plot it. Explain what you see.

Question (d)

Train instead a GP with a periodic covariance function, using gp.kernels.ExpSineSquared plus white noise. Show the fit. Comment on the behaviour of the error-bars, compared to your fit from (a). Do you think the data generating mechanism was really periodic? Why, why not?

Question (e)

Consider a Gaussian process with the following covariance function:

1 * gp.kernels.ExpSineSquared(length_scale=np.exp(-0.5), periodicity=1) 
    * gp.kernels.RBF(length_scale=np.exp(2)))

Sample several different functions from this distribution, and plot them, and explain their behaviour. Don't add noise to the function values, i.e. don't add gp.kernels.WhiteKernel() to the kernel. Note: you can sample a set of values using GaussianProcessRegressor.sample_y.

The choice of kernel dictates what output are likely. By looking at sample functions, you gain experience of the expressive power of Gaussian process models. Ultimately, you should be able to look at a scatterplot and know straight away what type of kernel would be a good fit.

Question (f)

Load https://www.cl.cam.ac.uk/teaching/2122/DataSci/data/cw1e.mat. This data has two-dimensional input and scalar output. Visualise the data, e.g. using the code snippets in the appendix. Consider two different Gaussian process models of the data, one using covariance function RBF(length_scale=[l1,l2]) and the other using the sum of two such RBF terms. (Make sure to break symmetry, e.g. by choosing the initial hyperparameters randomly, and include white noise.)

Compare the two models. How do the data fits compare? How do the marginal likelihoods compare? What is your interpretation? Which of the two is better?

When length_scale is a list, the RBF kernel uses separate length-scale parameters for each dimension of the input space. In GPML it is known as 'Squared Exponential with Automatic Relevance Determination', covSEard. It is a useful tool to learn which inputs are important for predictions: if length-scales are short, inputs are very important, and when they grow very long (compared to the spread of the data), the corresponding inputs will be largely ignored.

Question (g)

Load https://www.cl.cam.ac.uk/teaching/2122/DataSci/data/twoseries.csv. This data consists of two time series, where $t$ is time, $k$ is the index of the series, and $y$ is the output. The data has two-dimensional input, but the second dimension is categorical rather than numerical, so it doesn't make sense to use the RBF kernel. Instead, consider the model $$ Y_{t,k} = Z_t + E_k + \operatorname{Normal}(0,\sigma^2) $$ where $Z_t$ is a shared underlying RBF Gaussian process, and $E_1$ and $E_2$ are independent $\operatorname{Normal}(0,\rho^2)$ offsets, giving $$ \operatorname{Cov}(Y_{t,k}, Y_{t',k'}) = \nu^2 \exp \Bigl( -\frac{(t-t')^2}{2\ell^2}\Bigr) + \rho^2 \delta_{k k'} + \sigma^2 \delta_{t t'}\delta_{k k'}. $$ Implement a kernel for this model, fit it, and plot your predicted values for the two time series.

Appendix

Gaussian processes in Python

Here is a simple example of how to use sklearn.gaussian_process. For full details, see the documentation.

Let's consider a simple Gaussian process model: a prior on the space of Gaussian processes, with mean 0, and with covariance function $$ k(x, x') = \nu^2 \exp \Bigl( -\frac{(x-x')^2}{2\ell^2}\Bigr) $$ (this function is called RBF in sklearn.gaussian_process and covSEiso in GPML). Suppose that the data model is
$$ p(y\:|\:x,f) \sim \operatorname{Normal}\bigl(f(x), \sigma^2 I\bigr) $$ where $x$ and $y$ are vectors and $f(x)$ means $(f(x_1),\dots,f(x_n))$. In this model, $\nu$, $\ell$ and $\sigma$ are hyperparameters. Another way to write out this entire model is with a single covariance function, $$ k(x, x') = \nu^2 \exp \Bigl( -\frac{(x-x')^2}{2\ell^2}\Bigr) + \sigma^2 \delta_{x x'}. $$

In the gaussian_process package, a Gaussian process model is specified by a kernel object. (Kernel is another name for covariance function.) The package has a library of kernels, each implemented as a Python class, and we can create composite kernels by adding together and multiplying kernel objects.

To extract parameters from a kernel, use get_params(). You can also use set_params to set the parameters for a kernel.

Machine learning functions are implemented via the class GaussianProcessRegressor, which is initialized with a kernel object. It has methods for learning hyperparameters and making predictions. We can access a model's kernel with GaussianProcessRegressor.kernel.

To learn hyperparameters, use GaussianProcessRegressor.fit(x,y). It requires the input coordinates (the $\boldsymbol{x}_i$) to be an array with one row per observation. In the example below we have one-dimensional input, so we have to reshape it to be a column vectors. We can access the fitted kernel, and thence the fitted parameters, with model.kernel_.

To make predictions with a fitted model, call GaussianProcessRegressor.predict(x). As before, x should be a column vector.

Custom kernels

If there is no standard kernel that captures the covariance function we want, it's easy to implement a custom kernel. Below is a reimplementation of the RBF kernel, which has covariance function $$ k(x, x') = \exp \Bigl( -\frac{(x-x')^2}{2\ell^2}\Bigr). $$ The main method is __call__, which computes a covariance matrix, and is invoked by

k = MyRBFKernel(length_scale=2.718)
x = np.array([1,2,4])[:, np.newaxis]
y = np.array([1.5,3])[:, np.newaxis]
k(x,y)     # returns a matrix m_{i,j} = kernel(x_i, y_j)

The __init__ constructor should accept arguments for each hyperparameter, and there should be hyperparameter_X properties that describe each hyperparameter. The simple implementation below requires $x\in\mathbb{R}$, but the source code for the built-in RBF kernel allows multidimensional features $\boldsymbol{x}\in\mathbb{R}^d$ with anisotropic length-scale $\ell\in\mathbb{R}^d$. It is also more efficient.

Plotting a function of two variables

Here are some ways we might plot the data from part (f).