Gaussian processes for non-Gaussian likelihoods

Welcome to the first instalment in Trent's Bayesian Bank! As many readers of this website know, I (Trent) am a huge proponent of Bayesian inference and all of the tools available to Bayesians, having authored some myself (see PosteriorPlots.jl for an example). Today we are going to be exploring a lesser demonstrated use of a particular Bayesian method known as Gaussian processes (GP)---using them to model data which are not Gaussian in nature. Examples of such data include binomial data (e.g. two-class classification problems) and count data (e.g. the number of people who enrolled in a Bachelor of Science at a university). Of course, there are many more types of data, but these two are particularly common in applied settings. Now, right off the bat you might be wondering "but Trent, how can we use something with 'Gaussian' in its name to understand non-Gaussian data?!" That is a completely fair question, so let's start with a (brief and non-technical) review of what a GP is and how we can extend its foundations to model feasibly any type of data we might encounter. Indeed, this flexibility has made GPs such a valuable tool in probabilistic machine learning.

NOTE: GPs are very mathematically complex. For a more rigorous understanding, please refer to the seminal textbook Gaussian processes for machine learning by Rasmussen and Williams, this excellent interactive Gaussian process visualiser, this other excellent interactive Gaussian process visualiser, and this great detailed breakdown by Michael Betancourt.

What is a Gaussian process?

A Gaussian process is an abstraction from the multivariate Gaussian distribution, which itself is a generalisation of the standard univariate Gaussian distribution. Recall that a univariate Gaussian is completely parameterised by two quantities: mean and standard deviation (square root of the variance). A multivariate Gaussian builds upon this by specifying the joint distribution of two or more Gaussians through a mean vector (with an entry for each distribution) and a covariance matrix (denoting the variance of each distribution on the diagonals and covariance between distributions on the off-diagonals). A Gaussian process continues this pattern to the function domain. That is, we are no longer drawing values from a single function, we are drawing values from entirely different functions! More specifically, a Gaussian process is defined by a collection of random variables for which any finite subset is a multivariate Gaussian. This is not only computationally nice (in the sense that some linear algebra can take us to our sought-after posterior distribution in Bayesian inference for a new predicted set of points), but also aids interpretability.

Gaussian process regression/classification specifically are probabilistic, non-parametric methods. That doesn't mean they don't have parameters--rather, they have infinite parameters! We specify Bayesian prior distributions over entire functions instead of just values themselves, as is the case in most other Bayesian regression settings. To help that make sense, take a look at the following regression equation:

In a simple linear regression setting, the function f(x) would simply be a linear combination of our predictor variable and regression coefficients. But in the case of GPs, we instead do not assume any particular functional form, and instead place a GP prior over functions, through the specification of a mean function and covariance function:

GPs in practice (i.e. a regression setting) are parameterised by a mean function (which we often just assume to be zero) and a covariance function, as outlined above. The covariance function, or "kernel", is the crux of GP modelling. The priors we place on both the choice of covariance function and its hyperparameters drive most of the GP's capacity to model the data of interest. This is because the covariance function tells us how our inputs (x) map to outputs (x')---that is, we want values that are "close" in the input domain to be close in our output space. The most common choice of kernel is a squared exponential, sometimes referred to as a "radial basis function" in other popular machine learning techniques such as support vector machines. The equation for this kernel function is presented below. Note that the kernel is mapping the relationship between inputs x and new points x' and that we have 2 hyparameters to adjust - \sigma^{2} (variance scaling factor; i.e. how far our values deviate from the mean) and \ell (lengthscale; i.e. how "smooth" or "wiggly" the function is).

However, there are many more covariance functions than the radial basis function. Examples include exponentiated quadratic kernels, linear kernels, and periodic kernels, among many more. The radial basis function is typically the one which many people have encountered somewhere in their statistical modelling journey and so makes a useful reference point.

Now why is this simulation over functions approach useful? Well, often in practice we don't actually know what the underlying data generating function is. So instead of specifying tight assumptions and pushing ahead with a potentially inappropriate functional conceptualisation, we can simulate values from many different functions and observe how well the collective predictions perform against the data.

Also note that the mathematics of GPs is much more complicated than what has been shown here. As a Bayesian method, there is a substantial amount of of calculus involved in the computation of the posterior distribution, and of course large amounts of linear algebra, given that we are working with vectors and matrices. Please see the Rasmussen and Williams textbook for all the beautiful details.

How can we model non-Gaussian data with GPs?

Just as we can extend our old friend linear regression to model all manner of non-Gaussian data using generalised linear models (GLM) through the use of link functions, we can adopt a similar approach for GPs. We can use the standard GP architecture outlined above and convert the resulting functional output to the domain of our data using a link function on the likelihood. In the GP literature, this approach is referred to specifying a "latent GP".

To make it more concrete, let's take a look at an example using the probabilistic programming language Stan where we are going to fully code up our own GP.

Example: Count data

Let's take a look at some count data. Remember that count data is discrete (and non-negative), not continuous. This is because we cannot have decimals or negatives of a count. We can only have 0 or a positive integer. Typically, count data is modelled using a Poisson or negative binomial distribution, both of which are explicitly designed for such data. They are equally applicable in the case of GPs. Note that a Poisson distribution assumes that the mean and variance are equivalent, which is usually violated in practice---typically our count data is overdispersed, meaning we need an additional parameter to capture this dispersion. That's where negative binomial distributions come in. However, to keep things simple here, we'll stick to the Poisson model.

To test our model and highlight the power of GPs to learn all manner of non-linear functions, let's simulate some wild count data that has a sinusoidal shape. Any linear model (other than perhaps a generalised additive model with splines and smoothing functions) that most people gravitate to in a statistical modelling context would be inappropriate for such data. We'll first load the relevant packages and then simulate.

The plot of our data looks like this:

Note the distinct non-linearity we have simulated. We have gone for such data to illustrate the power of GPs in modelling all manner of relationships.

With our data ready, we can now specify our GP. The code for this is below. It's quite long, especially if you have not written Stan code before, so let's break it down by section (data, parameters, transformed parameters, model, generated quantities). Also note the use of ; at the end of every line and the use of // for commenting--this is because Stan is written in C++.

data

This is the "block" where we tell Stan all of the things we wish to feed in as inputs. This can include data, values for parameters and anything else you want to provide that's external to the model machinery directly. Here, I am telling Stan that we have sample sizes for our training data (N1) and the data we want to make predictions on (N2). These are integers as they are counts, and we can't have less than 1 observation so we fix the lower bound. Similarly, we are feeding in the response variable vector of our counts as y1 where the bottom is bounded by 0. x1 and x2 refer to the predictor variables.

Note that we can easily extend our model to the multivariate domain by changing the dimensionality of x1 and x2 to reflect the number of parameters in our data/input matrix. Neat!

transformed data

This block is an interesting one. Here, we can denote any changes we wish to make on the data specified in the data block above, or set up variables that will not be changed when running the program. I am declaring a quantity delta which we will use later on for numerical stability, and setting up a vector x to hold all the predictor values in the training and test data that we will use in our kernel function later.

parameters

This block is one of the more straightforward ones in Stan. We declare all the parameters that we want our program to estimate. Here I am declaring the parameters of the covariance function to be called in the next block and quantities that will be estimated as part of the GP.

Note that any parameters that do not have a fixed value are usually estimated later on in the model block through different probability distributions. More on that soon.

transformed parameters

This is another spicy block. Plenty of what seems like black magic can occur here. In the case of a GP, this block contains most of the heavy lifting. We are declaring our covariance function (exponential quadratic) and feeding in the parameters we declared earlier. We are then adding that small delta quantity we specified earlier to the covariance matrix diagonal as it adds some numerical stability to the model. Finally, we are calling a Cholesky decomposition operation which factorises our covariance matrix into an efficient product of a lower triangular matrix and its conjugate transpose. In Cholesky we trust!

model

Finally! We are at the point where we declare how our model samples. I have divided this block into two sections for visual clarity: (i) priors; and (ii) likelihood sampling. Notice in the priors section we are specifying a probability distribution for each of the parameters we declared earlier? This is where we can provide subject matter expertise/prior information as to the types of distributions those parameters are likely drawn from. The likelihood section highlights the power of GPs. Simply switching what would have been a multivariate normal statement to a Poisson log distribution that is parameterised by our GP model defined above converts our model estimates into the target distribution like a link function does for GLMs.

generated quantities

Final block! This block is pretty wild as you can write almost anything conceivable here and Stan will try to estimate it. In our case, we are wanting to estimate y values for our test data using our GP for the test x values. Again, we want to drawn from a Poisson distribution, so the correct function to call in Stan in this block is poisson_log_rng. Note the difference to the call in the model block of poisson_log. The differences in these distributions and which block they should be called in is specified in the incredibly detailed Stan documentation.

Some comments on building probabilistic models

Okay, so now we understand the different components of our model. Notice how explicit modelling in Stan is compared to just fitting linear regressions or GLMs (or any frequentist model for that matter) in any programming language? This is extremely important as it highlights the conscious and thoughtful decisions that need to be made on the behalf of the analyst when constructing probabilistic models. While this seems more difficult and/or tedious, it also leads to more informed and nuanced modelling decisions, with the goal of modelling the underlying generative statistical process (no matter how complex) always front-of-mind. This contrasts the typical frequentist model building approach, which is something akin to "getting as close as you can with the tools you have". In probabilistic programming languages, you do not face such limitations--you can feasibly code up almost any model you can conceive of. This becomes increasingly important as the complexity of the modelling scenario increases, such as in the case of GPs, mixed effects models, and stochastic differential equations.

With that mild rant done, let's fit the model and evaluate some outputs. This won't be an exhaustive exercise (and in fact will include no diagnostics) as full Bayesian model evaluation is a whole other post by itself. We can fit the Stan model in R with the following:

With our model successfully fit and no Stan warnings (if we did not adjust adapt_delta for this model we would have gotten some about effective sample size being too low), let's go ahead and pull out our median and 90% credible interval estimates for each x value. Note that most of this code is just wrangling the output matrix into a format to allow us to join back the original x data (as Stan returns indices, not values as column headers). The important conceptual parts are the group_by, summarise, and ungroup statements at the end.

We can now plot our model predictions against the actual data using the following code:

And the plot looks like this:

Pretty darn good for absolutely no informative prior specification of hyperparameters or anything else! Our quantiles nicely wrap around the actual data points and the median seems to trace a sensible path through the middle that is close to the points themselves. The ribbon looks a little janky because we only have 20 data points--if we had more it would continue to smooth out until none of the lines looked like glued-together straight lines. Evidently our basic model does pretty well on some non-linear count data.

If we wanted to properly evaluate the model, we would typically go through a process of prior and posterior predictive checks, leave-one-out cross-validation, assessments of chain convergence, and other methods. See here for some visual examples. For now, let's just pull out the posterior distributions of our model parameters using the following code:

This produces the following plot:

Bonus Example: Classification

As a bonus, I'll show you how easy it is to adapt the code format we used to everyone's favourite task: binary classification! Here is a quick example of some code for the classic Titanic dataset which provides some information on the passengers and whether they survived the crash or not. Feel free to run this yourself and try and pull out posterior summaries. Notice that the only things we changed are the likelihood function from a poisson_log to a bernoulli_logit and the input data y vector to be constrained as an integer between 0 and 1.

Final thoughts

Thanks for making it through! GPs are a very mathematically complex but very powerful technique for statistical learning, and have shown immense utility in a broad range of important real-world applications. I hope this post has motivated you to learn more or think about ways that GPs may help you in your own modelling/statistical analysis work. If you do want to learn more, please check out some of the resources I linked at the start of the post.