Oisín Fitzgerald, May 2021
A look at section 6.4 of:
Bishop C.M. (2006). Pattern recognition and machine learning. Springer.
https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/
Basically this post goes through (Bayesian) linear regression from a Gaussian process space point of view with some example Julia code to make things concrete.
Update (10/11/2021): Deleted "estimating the hyperparameters" section for now as it was too short and had no examples.
The dominant approach to solving regression problems in machine learning today is finding the parameters of a model that minimise a loss function by optimally combining a set of basis vectors. These basis vectors can be the original data or some transformation where is the output-input pair and is length (the number of features). For example:
Linear regression: find the best set of weights that minimise mean square error giving us predictions .
Deep learning: at the other extreme of complexity we can think of deep learning as learning both the basis vectors and the weights. In a network with layers the outputs and weights of the final layer are and giving us .
With Gaussian processes with are going to switch from thinking in terms of locating which parameters are most likely to have generated the data to considering the data a finite sample from a function that has particular properties. The parameters and function space viewpoint are not conflicting, for example for linear regression:
Parameter space view: is a combination of basis functions with the weights being from a mltivariate normal distribution.
Function space view: is a sample from a family of functions where any finite sample of points follow a multivariate normal distibution.
To fully see the connection let's go from the parameter space view to the function space view for linear regression. The model is
In matrix form the above is written as , with each row of the matrix made up of the individual observations , each a vector of length , the number of features plus one (to have an intercept term). The prior distribution on our weights reflects a lack of knowledge about the process
For example if there is one input we have and setting (arbitrarily) the prior looks like the graph below.
using Plots, Random, Distributions, LinearAlgebra
plotlyjs()
Random.seed!(1)
α = 1.0
d = MvNormal([0,0], (1/α)*I)
W0 = range(-1, 1, length=100)
W1 = range(-1, 1, length=100)
p_w = [pdf(d, [w0,w1]) for w0 in W0, w1 in W1]
contourf(W0, W1, p_w, color=:viridis,xlab="w0",ylab="w1",title="Prior: weight space")
Since we treat input features (the x's) as constants this implies a prior distribution for the output
From the function space view we can randomly sample functions at finite spacings from the prior.
Random.seed!(1)
x1 = range(-1, 1, length=100)
X = [repeat([1],100) x1]
d = MvNormal(repeat([0],100), (1/α)*X*transpose(X) + 1e-10*I)
p = plot(x1,rand(d),legend=false,seriestype=:line,title="Prior: function space",xlabel="x",ylabel="y")
for i in 1:20
plot!(p,x1,rand(d),legend=false,seriestype=:line)
end
The matrix is made up of elements with the kernel function. Notice that the kernel function returns the variance for and covariance between and otherwise. Also that we are talking here about the covariance between observations, not features. is a matrix and so can be quite large. There are many potential kernel functions other than but that's for another day.
We have a prior on and then we observe some data. Let assume there is noise in the data so we observe
with random noise that is independent between observations and the observed output values for input features .
Random.seed!(1)
n = 10
x1 = range(-1, 1, length=n)
X = [repeat([1],n) x1]
β = 0.01
d = MvNormal(repeat([0],n), (1/α)*X*transpose(X) + β*I)
y = rand(d)
p = scatter(x1,y,legend=false,title="Observed data",xlabel="x",ylabel="y")
At this point in practise we could estimate the noise parameter , but lets come back to that. For now assume we know that . It is worth remember there are no weights giving us the intercept, slope etc but we can sample from our distribution of or or given the observed data. Because our interest is in predicting for new observations we'd like to estimate the posterior for any future input . It turns out the posterior for for any is another normal distribution which is coded below.
p = scatter(x1,y,legend=false,
title="Posterior: function space",xlabel="x",ylabel="y")
# new X's over which to predict
xs = range(-1, 1, length=100)
Xs = [repeat([1],100) xs]
ys = zeros(100)
# get ready to construct posterior
σ2 = zeros(100)
C = (1/α)*X*transpose(X) + β*I
Cinv = inv(C)
# one prediction at a time
for i in 1:100
k = X * Xs[i,:]
c = Xs[i,:]' * Xs[i,:] + β
ys[i] = (k' * Cinv) * y
σ2[i] = c - (k' * Cinv) * k
end
plot!(p,xs,ys, ribbon=(2*sqrt.(σ2),2*sqrt.(σ2)), lab="estimate")
plot!(p,xs,ys)
# noise free samples from the posterior
# all predictions at once
m = (Xs * X') * Cinv * y
CV = (Xs * Xs') - (Xs * X') * Cinv * (X * Xs')
CV = Symmetric(CV) + 1e-10*I
d = MvNormal(m, Symmetric(CV) + 1e-10*I)
for i in 1:20
plot!(p,xs,rand(d),legend=false,seriestype=:line)
end