Here’s some of our tutorials on Gaussian processes

gaussian processes are very cool

Welcome to the tutorial! In this short piece, we are going to explore the basic fundamentals of Gaussian processes (GP) through an applied example. I was motivated to write this piece after recently finishing reading the seminal text in the field: Gaussian processes for machine learning. While an incredibly detailed resource, the treatment is extremely technical and layered with complex higher mathematics. Moral of the story: do not be misled by the word "Gaussian" into thinking GPs are simple. They are very complex and a deep understanding of them requires a rigorous handle on linear algebra and calculus. However, building an intuition for them and understanding why they are so absurdly flexible and useful is our goal here. For this tutorial, we are going to use a Python library called GPy which offers a simple and intuitive code design for beginners.

Caveat: GPs are very complex and many other excellent detailed guides are available. This article seeks to provide a very short introduction in the conversational style of usual Orbisant articles with easy-to-follow code.

Quick conceptual introduction to Gaussian processes

In Bayesian inference we typically specify a prior for a particular parameter, such as the coefficient of a predictor in a regression model. In a GP, however, we instead specify a prior over functions. Let's break that down for a moment. We are specifying the likely space of entire functions that are likely to have generated the data we have observed. To make it a bit clearer, consider the specification of a Gaussian (normal) distribution:

The normal distribution is completely parameterised by its mean and variance (or standard deviation). A GP extends this same idea to the next layer up, meaning it is parameterised by a mean function and a covariance function. Mathematically, we can write this as:

While some additions have been made, notice that the basic structure is still the same as our simple normal distribution above.

By extension, in a simple bivariate setting, we can generate mean predictions from the model for a given value of x with some error/uncertainty associated with it (sigma):

The information contained in the above is very important. One of the core characteristics of a GP is that the joint distribution of any of the Gaussian random variables is a multivariate Gaussian - which is much more interpretable to us than a lot of alternative distributions. This lets us make particularly difficult problems in statistics and machine learning analytically tractable. Moreover, a way to visualise this (more on this later) is to imagine the shaded credible intervals on the resultant plot from a GP to be predictions of a sample of a whole host of infinite different functions that exist within the space specified by your mean and covariance functions.

However, strangely enough, the mean function for most purposes actually contributes very little to a GP's generalisability - it's all in the covariance function! So, we typically just set the mean to 0. But boy there are a lot of choices of covariance functions...

Covariance functions, kernels, and connections to other models

The choice of covariance function is the crux of GP modelling, just like the specification of informative priors in Bayesian inference. This is often the point where subject matter expertise reigns supreme. There are a lot of choices available to us (see here for an overview). The most common one is known as the Squared Exponential Kernel (SE) or a "Radial Basis Function" (RBF). The latter may sound familiar, as this is often the default for common machine learning algorithms such as Support vector machines (SVM). Mathematically, this kernel is defined as:

Here, the l parameter defines the lengthscale - how long the "wiggles" in your function are, while sigma-squared specifies the average distance away from the mean. As a final note, while it may be tempting to use any function under the sun as a covariance function in a GP, you cannot do this! There are a range of rules that these functions must follow (see here for more).

A comparison of different kernels

There is an excellent graphic in the GPy documentation that depicts some fundamental differences between a range of common kernels. Let's recreate it here and take a look.

import GPy
import matplotlib; matplotlib.rcParams['figure.figsize'] = (8,5)
from matplotlib import pyplot as plt

# Draw matrix of plots

figure, axes = plt.subplots(3, 3, figsize = (10,10), tight_layout = True)

kerns = [GPy.kern.RBF(1), GPy.kern.Exponential(1), GPy.kern.Matern32(1), GPy.kern.Matern52(1), GPy.kern.Brownian(1), GPy.kern.Bias(1), GPy.kern.Linear(1), GPy.kern.PeriodicExponential(1), GPy.kern.White(1)]

for k,a in zip(kerns, axes.flatten()):
    k.plot(ax = a, x = 1)
    a.set_title(k.name.replace('_', ' '))

Quite a lot of different shapes! You can begin to intuit when some might be more useful than others for certain applications. For example, on periodic or seasonal time series data, a periodic exponential kernel make might sense as a starting point. Moreover, you can even combine kernels! In the time series example, you might also want to model other components in the signal, and combining the period exponential with a RBF or linear might be a way to do this. At its core, Bayesian inference and GPs are all about flexibility - they exist to help you model the underlying statistical process that is likely to have generated your data. These tools just happen to let you weave in your subject matter expertise more explicitly as a prior.

An applied example

So much math! Let's put it into visuals and code to help make sense of what is going on.

To start with, let's look at an example with only 10 datapoints. Many other models would not even fit to this data, let alone would most people consider modelling such limited N in the first place. But what if we wanted to get a basic understanding of some potential processes that might have generated this data? One (terrible) option would be to fit increasingly higher order polynomials, which would return very high fit statistics, but as we know, would generalise and extrapolate horrendously and ruin any chance at statistical inference. Another option (surprise, surprise) is to use a GP.

For this example, we are going to generate a set of 10 random numbers between -5 and +5 as our X variable, and then take the sine of those values to get our Y variable.

import random
import numpy as np

random.seed(123) # Fix seed for reproducibility
X = np.random.uniform(-5., 5., (10, 1)) # Generate vector of X values
Y = np.sin(X) # Generate vector of y values

We can then specify a very basic and uninformed GP using only two lines of code, followed by a third to automatically generate a plot with 95% credible intervals. Here we are just using the basic RBF kernel with all hyperparameters set to 1. Nothing too fancy.

kernel = GPy.kern.RBF(input_dim = 1, variance = 1., lengthscale = 1.)
m = GPy.models.GPRegression(X, Y, kernel)
fig = m.plot()

Not a bad start! Evidently we have produced some estimates that aren't too far away from what we might expect had we not explicitly known the data was generated from a sine wave. But can we do better? Let's increase the variance hyperparameter to something a bit drastic, like 5 and see what happens.

kernel2 = GPy.kern.RBF(input_dim = 1, variance = 5., lengthscale = 1.)
m2 = GPy.models.GPRegression(X, Y, kernel2)
fig2 = m2.plot()

Our mean and credible interval clearly follows the datapoints a little more closely now, and at the more extreme x values away from the datapoints in both directions, our credible interval widens to accommodate the fact that the model is less sure what estimates might be, due to the lack of data. Pretty neat.

Now what if we keep these parameters but increase the lengthscale slightly, say to 2? Remembering that lengthscale affects the "wiggliness" of the kernel, we should expect a slightly smoother output here.

kernel3 = GPy.kern.RBF(input_dim = 1, variance = 5., lengthscale = 2.)
m3 = GPy.models.GPRegression(X, Y, kernel3)
fig3 = m3.plot()

Just as we expected! At this point, we could probably take random draws from the model posterior and produce data that is somewhat representative of the ten datapoints we started with. This is useful in practice for applications such as simulation and sensitivity analysis. But something else we can do is optimise the models hyperparameters with respect to the data. This will produce a very tight fit to the data. The way this works is called maximum likelihood, which seeks to optimise parameters in the likelihood function based on their ability to generate the observed data. In cases with a lot of data, this approach makes sense. However, in ours, with such little data, this approach will likely produce much too tight of a fit given such limited observations. Let's take a look.

m4 = GPy.models.GPRegression(X, Y, kernel3)
m4.optimize(messages = True)
fig4 = m4.plot()

Probably way too tight! Sure it looks appealing on this data, but how well might the model generalise to new data that's very far out on either side of the x-axis? Probably not very well. However, we can see rapidly widening intervals at the extremes of x, which suggests that the model may be able to capture this high uncertainty if we extrapolated. But would it reasonably extrapolate to a sinusoid shape? Maybe not (follow-up post on this coming soon!). Apart from that, the tightness of the intervals does not really capture our inherent uncertainty given the limited data.

Additional thoughts

While we considered only an extremely basic example here, I just wanted to reiterate the utility of GPs. GPs are so flexible that they can be used for both classification and regression, even time series, and can fit an almost unfathomably large number of real-world and synthetic datasets. However, as always in statistics, caution is advised when using GPs and it is strongly recommended that deep thought be used to specify the parameters of the model.

Resources for further reading

If you are mathematically inclined, definitely check out the seminal Rasmussen and Williams textbook). Otherwise, some other lighter but still very useful reading includes: