Written by Distill, Jochen Gortler, Rebecca Kehlbeck, Oliver Deussen, Yi Bai, Zhang Qian, Wang Shuting.

Gauss process allows us to make predictions on data based on prior knowledge. The most intuitive application is regression. In this paper, the author uses several interactive diagrams to vividly explain the relevant knowledge of Gaussian process, so that readers can intuitively understand the working principle of Gaussian process and how to adapt it to different types of data.

The introduction

Even if you have read books on machine learning, you may not have heard of the Gaussian process. Of course, it doesn’t hurt if you’ve heard of it, but brushing up on the basics can help refresh your memory. The purpose of this paper is to introduce the gauss process to the reader and make the mathematics behind it more intuitive and easy to understand.

Gaussian processes are a fairly useful tool in the machine learning toolbox [1]. It allows us to combine prior knowledge to make predictions about the data. Its most intuitive application is in regression problems, such as robotics. It can also be extended to classification and clustering tasks. Just a quick refresher: The goal of regression is to find a function that describes a given set of data points as closely as possible. This process is called fitting data with a function. For a given set of training data points, there may be an infinite number of functions that can be used for fitting. The Gaussian process provides an elegant solution to this by assigning each of these functions a probability value [1]. The mean of this probability distribution represents the most likely representation of this data. Furthermore, the probabilistic approach allows us to incorporate confidence in the prediction into the regression results.

First, we will explore the mathematical basis of Gauss regression. You can use the interactive graphics and hands-on examples to understand this. They help explain the impact of each component and demonstrate the flexibility of the Gauss process. Hopefully, after reading this article, you will have an intuitive understanding of how Gaussian processes work and how they can be adapted to different types of data.

Multivariate Gaussian distribution

Before we can explore Gaussian distributions, we need to understand their mathematical underpinnings. As the name suggests, the Gaussian distribution (also known as the normal distribution) is the building block of the Gaussian process. And what we’re most interested in is the multivariate Gaussian distribution, where every random variable is normally distributed, and the joint distribution is gaussian. In general, the multivariate Gaussian distribution is defined by the mean vector μ and the covariance matrix sigma.

The mean vector μ describes the expected value of the distribution, and each of its components describes the mean of the corresponding dimension. Sigma models the variance of each dimension and determines correlations between different random variables. The covariance matrix is always symmetric and positive semi-definite [4]. The sigma diagonal consists of the standard deviation of the ith random variable, σ_i, while the non-diagonal elements describe the correlation between each element, σ_ij.

Let’s say X is normally distributed. The covariance matrix sigma describes the shape of the distribution, which is defined by the expected value E:

Graphically, the distribution is centered on the mean and its shape is determined by the covariance matrix. The following figure shows the effect of these parameters on a two-dimensional Gaussian distribution. The standard deviation of each random variable is on the diagonal of the covariance matrix, and the other values show the covariance between them.

This is an interactive graph, where you can adjust the variance on each dimension and the correlation between two random variables by dragging three points across the graph. The purple is the high probability region of the distribution.

The Gaussian distribution is widely used to model the real world, sometimes as a substitute when the original distribution is unknown, and sometimes for the central limit theorem. We’ll go on to explain how to manipulate the Gaussian distribution and how to get useful information from it.

Marginalization and conditioning

The Gaussian distribution has a nice algebraic property: it is closed under conditioning and marginalization. That is, after all these operations, the resulting distribution is still gaussian, which makes a lot of problems in statistics and machine learning easy to solve. Next, we will take a closer look at these two operations, which are the basis of the Gaussian process.

Marginalization and conditioning both work on a subset of the original distribution, and we will use the following notation:

Where X and Y represent subsets of the original random variables.

Through marginalization, we can obtain part of the information of multivariate probability distribution. Given the normal probability distribution P(X,Y) of the vectors composed of random variables X and Y, we can determine their marginal probability distribution by the following method:

The formula is straightforward: the two subsets of X and Y depend only on their respective values in μ and sigma. Thus, to marginalize a random variable from the Gaussian, we simply discard the corresponding variables in μ and sigma.

What this formula says is, if we’re only interested in the probability that X is equal to X, we’re going to look at all the possible values of Y, and they’re going to work together to get the final result.

Another important operation of The Gaussian process is conditioning, which can be used to obtain the probability distribution of one variable under the condition of another variable. Like marginalization, this operation is also closed, resulting in a different Gaussian distribution. Conditional operations are the cornerstone of Gaussian processes, which make Bayesian inference possible. Conditioning is defined as follows:

Note that the new mean depends only on the conditional variable, and the covariance matrix does not depend on this variable.

Now that we know the necessary formulas, we need to think about how to visually understand the two operations. Although marginalization and conditioning can be used for multi-dimensional multivariate distributions, it is better to use the two-dimensional distribution shown in the figure below as an example. Marginalization can be understood as the accumulation of one dimension of the Gaussian distribution, which also conforms to the general definition of edge distribution. Conditioning also has a nice geometric expression — we can think of it as cutting through a multivariate distribution to get a Gaussian distribution with fewer dimensions.

In the middle is a bivariate normal distribution. On the left is the marginalization of this distribution with respect to Y, similar to the accumulation along the Y-axis. The distribution on the right is given X, which is like making a cut in the original distribution. You can modify the Gaussian distribution and the conditional variables by dragging the points in the diagram.

Gaussian process

Having reviewed the basic properties of multivariate Gaussian distributions, we can then assemble them together to define gaussian processes and show how gaussian processes can be used to solve regression problems.

First, we move from a continuous function to a discrete expression of the function: we are more interested in predicting the value of the function at specific points, called test points X, than in finding an implicit function. Correspondingly, we call the training data Y. So, the key point behind the Gaussian process is that all the values of the function come from the multivariate Gaussian distribution. That means that the joint probability distribution P(X,Y) spans the space of possible values for the function that we want to predict. The combined distribution of test data and training data has the dimensions of ∣X∣+∣Y∣.

In order to perform regression on training data, we will use Bayesian inference to deal with this problem. The core idea of Bayesian inference is to update current assumptions as new information becomes available. For gaussian processes, this information refers to training data. Therefore, we are interested in the conditional probability P (X | Y). Finally, remember that the Gaussian distribution is closed under conditioning? So P (X | Y) is also a normal distribution.

Ok, so we have the basic framework of the Gaussian process, except for one thing: how can we construct this distribution, define the mean μ and the covariance matrix sigma? The method is to use the kernel function k, which is discussed in the next section. But before we do that, let’s just remember how to estimate the function using the multivariate Gaussian distribution. The example below contains ten test points at which we will predict the function.

This is also an interactive diagram

In the Gaussian process, we take each test point as a random variable, and the dimension of multivariate Gaussian distribution is consistent with the number of random variables. Since we want to predict the value of the function at ∣X∣=N test points, the corresponding multivariate Gaussian distribution is also n-dimensional. Making predictions using gaussian processes ultimately boils down to sampling this distribution. In this way, we can take the ith member of the result vector as the corresponding function value of the ith test point.

Kernel function

So let’s recall that in order to set up the distribution that we want, we first define μ s and sigma. In gaussian processes, we often assume μ =0, which simplifies the formulas needed for conditioning. It is always correct to assume that, even if μ≠0, we can add μ back to the value of the result function after the prediction. So setting μ is very easy, and what’s more interesting is the other parameter of this distribution.

The tricky part of the Gaussian process is how to set up the covariance matrix sigma. The covariance matrix not only describes the shape of this distribution, but ultimately determines the properties of the function we want to predict. We generate the covariance matrix by evaluating the kernel k, also known as the covariance function, on all test points in pairs. The kernel receives two points as input,, returns a scalar that expresses the similarity between the two points.

We paired the test points and evaluated the function to obtain the covariance matrix, which is also shown in the figure below. To get a more intuitive idea of what the kernel does, we can think about what the elements of a covariance matrix describe. σ _ij describes the interaction between the ith and JTH points, which is consistent with the definition of the multivariate Gaussian distribution. In the definition of the multivariate Gaussian distribution, σ _ij defines the correlation between the ith and JTH random variables. Since the kernel describes the similarity between the values of the function, it controls the shape that the fitting function may have. Note that when we choose a kernel, we want to make sure that the matrix it generates follows the properties of the covariance matrix.

Kernels are widely used in machine learning, such as support vector machines. It is so popular because it allows us to measure similarity beyond the standard Euclidean distance (L2 distance). Many kernels embed the input points in higher dimensional space to measure the similarity. The following figure shows some common kernels for Gaussian processes. For each kernel, we generate a covariance matrix with N=25 linear points in the range [-5,5]. The elements in the matrix show the covariance between points, with values between [0,1].

The figure above shows the various kernels that can be used for the Gauss process. Each kernel has different parameters, and you can change the values of these parameters by dragging the slider. When you click on a slider, you can see how the current parameter affects the kernel function on the right side of the image.

Kernel functions can be classified into stationary and non-stationary types. Stationary kernels, such as radial basis function kernels (RBF) or periodic kernels, have translational invariance, and the covariance between two points depends only on their relative positions. Non-stationary kernels, such as linear kernels, do not have this limitation and depend on absolute position. The stationary properties of the kernel of the radial basis function can be observed from the diagonal band of its covariance matrix (figure below). Increasing the length parameter makes the band wider, because points farther apart are more correlated with each other. For the periodic kernel, we also have a parameter P to determine the period, which controls the distance between each iteration of the function. By contrast, the parameter C of the linear kernel allows us to change the point at which the functions meet.

There are many other kernels that describe different classes of functions, giving them the shape we want them to have. Duvenaud’s Automatic Model Construction with Gaussian Processes provides an overview of the different kernels and is worth a look. We can also combine several kernels, but we’ll talk about that later.

Prior distribution

Going back to our original regression task, as we mentioned earlier, the Gaussian process defines the probability distribution of the potential function. Since this is a multivariate Gaussian distribution, these functions are also normally distributed. We usually assume μ= 0, but let’s consider that no training data has been observed. In the framework of Bayesian inference, we call this the prior distribution P(X).

If no training samples have been observed, the distribution will expand around μ=0, as we assumed at the beginning. The dimension of the prior distribution is the same as the number of test points N=∣X∣. We’re going to use the kernel function to build the covariance matrix, dimension N by N.

We saw examples of different kernels in the previous chapter, and since the kernel is used to define the contents of the covariance matrix, it also determines which types of functions are more likely in a space containing all possible functions. The prior distribution does not contain any additional information, which gives us an excellent opportunity to show the influence of the kernel on the distribution of the function. The following figure shows some samples of potential functions that can be obtained with prior distributions generated by different kernels.

Click on the picture to get a series of successive samples using the specified kernel in the Gaussian process. After each sample, the previous sample is faded in the background. Over time, you can probably see that these functions are normally distributed around the mean.

By adjusting the parameters, you can control the shape of the function you get, which also changes the confidence of the prediction. The variance σ is a parameter common to all kernel functions, and if you reduce it, the sampled function will be more closely centered around the mean μ. For a linear kernel, setting σb = 0 yields a set of functions that converge precisely at point C, while setting σ _B = 0.2 introduces some uncertainty, and the sampled functions pass roughly near point C.

Posterior distribution

So what happens if we observe training data? Let’s review the bayesian inference model, it can tell us the additional information into the model, so as to get the posterior distribution P (X | Y). Let’s take a closer look at how we can use this in the Gaussian process.

First, we get the joint distribution P(X,Y) between the test point X and the training point Y, which is a multivariate Gaussian distribution with dimensionality of ∣Y∣+∣X∣. As can be seen in the following figure, we spliced the training points and test points together to calculate the corresponding covariance matrix.

Next, we want to do a previously defined on gaussian distribution of operations: through conditioning from P (X, Y), P (X | Y). The dimension of the new distribution is the same as the number of test points, N, and is normally distributed. It is important to note that the mean and standard deviation vary with conditioning: X∣Y to N(μ ‘, σ ‘), as detailed in the section on marginalization and conditioning. Intuitively, the training point sets a bound for the candidate function: to pass through the training point.

Adding training points (■) will change the dimension of multivariate Gaussian distribution. The covariance matrix is generated by pairwise pairing of the values of the kernel function. The result is a twelve-dimensional distribution. Under conditions, we get a distribution that describes what we predict the function will be for a given value of x.

Similar to the prior distribution, we can obtain a prediction by sampling the distribution. However, because sampling involves randomness, we cannot guarantee that the results will fit the data well. To optimize the prediction, we can use another basic operation of the Gaussian distribution.

By marginalizing each random variable, we can extract the corresponding mean function value μ’ I and standard deviation σ’ I = σ’ II for the ith test point. And unlike the prior distribution, when we do the prior distribution we set μ= 0, in that case the mean doesn’t really matter. However, when conditioning is applied to the joint distribution of test data and training data, the obtained distribution tends to have a non-zero mean value, μ ‘≠ 0. Extraction of μ ‘and σ’ not only makes the prediction more meaningful, but also indicates the confidence of the predicted value.

The diagram below shows an example of a conditional distribution. At the beginning, no training points were observed, so the predicted mean remained at 0 and the standard deviation was the same for each test point. By hovering over the covariance matrix, you can see how each point affects the current test point. As long as no training points have been observed, only adjacent points affect each other.

The training point can be activated by clicking, resulting in a constrained distribution. This change is reflected in the contents of the covariance matrix and changes the mean and standard deviation of the predicted function. As expected, the prediction uncertainty was small in the area close to the training data, and greater the further out.

When no training data is activated, the figure shows the prior distribution of a Gaussian process, which uses the radial basis function kernel. The opacity over the gradient when the cursor hovers over the covariance matrix shows the effect of a function value on its neighbors. This distribution changes when we look at training data. They can be activated by clicking on a single dot. The Gaussian process is then constrained to tend to give higher probabilities to functions that intersect those points. The best interpretation of the training data is in the updated mean function.

In the constrained covariance matrix, we can see that the correlation between adjacent points will be affected by the training data. If the predicted point is on the training data, there is no correlation between it and other points, so the function must pass directly through it. The farther predicted value is also affected by the training data, to a degree dependent on its distance.

Combine different kernels

As we explained earlier, the power of the Gaussian process lies in its choice of kernel. This allows experts to introduce domain knowledge into the process, making the Gaussian process flexible enough to capture trends in training data. For example, an expert can control the smoothness of the function in the result by selecting an appropriate bandwidth for the radial basis kernel.

One advantage of kernels is that they can be combined to form a more specialized kernel. This allows experts in a particular field to add more information, making predictions more accurate. The way we usually combine different kernels is by multiplying them. We can consider the case of two kernels, for example a radial basis kernel K_rbf and a periodic kernel k_per. Here’s how we combine them:

In the graph below (the original interactive graph), the raw training data shows an upward trend with periodic deviations. If you just use a linear kernel, you might get a general linear regression of these points. At first glance, the radial basis kernel can accurately approximate these points. But since the kernel of the radial basis function is stationary, it always returns to the mean μ= 0 away from the observed training data. This makes it less accurate to predict very early or very late dates. Only by combining several kernel functions can the periodicity of data and the tendency of non-zero mean be maintained simultaneously. For example, this method can be used to analyze weather data.

By clicking the checkbox, we can combine the different kernels into a new Gaussian process. Only by combining multiple kernels can we capture the characteristics of more complex training data.

conclusion

After reading this article, you should have an overall picture of Gaussian processes and a better understanding of how they work. As we have seen, The Gaussian process provides a flexible framework for regression problems, and has some extensibility to make it more general. When dealing with real-world data, we often find that measurements are subject to uncertainty and error. A gaussian process can be used to define a kernel function to fit our data and add uncertainty to the predicted results. For example, McHutchon et al. [7] made a special extension of the Gaussian process to make it compatible with inputs containing noise.

Although we mostly discuss Gaussian processes in the context of regression problems, it can also be used for other tasks, such as model stripping and hypothesis testing. By comparing the effects of different kernels on data sets, experts in a field can embed additional knowledge by combining kernels properly or choosing parameters for them. Since we cannot have such experts in many cases, people are also studying how to use deep learning [8, 9] to learn special kernel functions from given data. In addition, a number of papers [10, 11] have explored the connection between Bayesian inference, Gaussian process and deep learning.