Gaussian processes and linear regression

Oisín Fitzgerald, May 2021

A look at section 6.4 of:

Bishop C.M. (2006). Pattern recognition and machine learning. Springer.

https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/

Basically this post goes through (Bayesian) linear regression from a Gaussian process space point of view with some example Julia code to make things concrete.

Update (10/11/2021): Deleted "estimating the hyperparameters" section for now as it was too short and had no examples.

Overview

The dominant approach to solving regression problems in machine learning today is finding the parameters ww of a model MwM_w that minimise a loss function LL by optimally combining a set of basis vectors. These basis vectors can be the original data xnx_n or some transformation zn=ϕ(xn)z_n = \phi(x_n) where (yn,xn)(y_n,x_n) is the nthn^{th} output-input pair n{1,...,N}n \in \{1,...,N\} and xnx_n is length pp (the number of features). For example:

  • Linear regression: find the best set of weights ww that minimise mean square error YXw\left\Vert Y - X w \right\Vert giving us predictions yn=wtxny_n = w^t x_n.

  • Deep learning: at the other extreme of complexity we can think of deep learning as learning both the basis vectors and the weights. In a network with LL layers the outputs and weights of the final layer are zLz_L and wLw_L giving us yn=wLtzL(xn)y_n = w_L^t z_L(x_n).

With Gaussian processes with are going to switch from thinking in terms of locating which parameters are most likely to have generated the data to considering the data a finite sample from a function that has particular properties. The parameters and function space viewpoint are not conflicting, for example for linear regression:

  1. Parameter space view: yy is a combination of basis functions with the weights being from a mltivariate normal distribution.

  2. Function space view: y(xn)y(x_n) is a sample from a family of functions where any finite sample of points {y1,...,yN}\{y_1,...,y_N\} follow a multivariate normal distibution.

From the parameter to function space view

To fully see the connection let's go from the parameter space view to the function space view for linear regression. The model is

y(xn)=wtxny(x_n) = w^t x_n

In matrix form the above is written as Y=XwY = X w, with each row of the N×pN \times p matrix XX made up of the NN individual observations xntx^t_n, each a vector of length p+1p+1, the number of features plus one (to have an intercept term). The prior distribution on our weights ww reflects a lack of knowledge about the process

wN(0,α1I)w \sim N(0,\alpha^{-1}I)

For example if there is one input we have w=(w0,w1)tw = (w_0, w_1)^t and setting α=1.0\alpha = 1.0 (arbitrarily) the prior looks like the graph below.

using Plots, Random, Distributions, LinearAlgebra
plotlyjs()
Random.seed!(1)
α = 1.0
d = MvNormal([0,0], (1/α)*I)
W0 = range(-1, 1, length=100)
W1 = range(-1, 1, length=100)
p_w = [pdf(d, [w0,w1]) for w0 in W0, w1 in W1]
contourf(W0, W1, p_w, color=:viridis,xlab="w0",ylab="w1",title="Prior: weight space")

Since we treat input features (the x's) as constants this implies a prior distribution for the output

yN(0,αXtX)y \sim N(0,\alpha X^t X)

From the function space view we can randomly sample functions at finite spacings X={x1,...,xN}\mathfrak{X} = \{x_1,...,x_N\} from the prior.

Random.seed!(1)
x1 = range(-1, 1, length=100)
X = [repeat([1],100) x1]
d = MvNormal(repeat([0],100), (1/α)*X*transpose(X) + 1e-10*I)
p = plot(x1,rand(d),legend=false,seriestype=:line,title="Prior: function space",xlabel="x",ylabel="y")
for i in 1:20
    plot!(p,x1,rand(d),legend=false,seriestype=:line)
end

The matrix K=cov(y)=α1XtXK = \text{cov}(y) = \alpha^{-1} X^t X is made up of elements Knm=k(xn,xm)=1αxntxmK_{nm} = k(x_n,x_m) = \frac{1}{\alpha}x_n^t x_m with k(x,x)k(x,x') the kernel function. Notice that the kernel function k(x,x)k(x,x') returns the variance for x=xx = x' and covariance between xx and xx' otherwise. Also that we are talking here about the covariance between observations, not features. KK is a N×NN \times N matrix and so can be quite large. There are many potential kernel functions other than k=xtxk = x^tx but that's for another day.

Modelling data with straight lines

We have a prior on yy and then we observe some data. Let assume there is noise in the data so we observe

tn=y(xn)+ϵnt_n = y(x_n) + \epsilon_n

with ϵnN(0,β)\epsilon_n \sim N(0,\beta) random noise that is independent between observations and t={t1,...,tN}t = \{t_1,...,t_N\} the observed output values for input features xnx_n.

Random.seed!(1)
n = 10
x1 = range(-1, 1, length=n)
X = [repeat([1],n) x1]
β = 0.01
d = MvNormal(repeat([0],n), (1/α)*X*transpose(X) + β*I)
y = rand(d) 
p = scatter(x1,y,legend=false,title="Observed data",xlabel="x",ylabel="y")

At this point in practise we could estimate the noise parameter β\beta, but lets come back to that. For now assume we know that β=0.01\beta = 0.01. It is worth remember there are no weights giving us the intercept, slope etc but we can sample from our distribution of yty|t or ttt*|t or given the observed data. Because our interest is in predicting for new observations we'd like to estimate the posterior p(tt,x,x)p(t*|t,x,x*) for any future input xx*. It turns out the posterior for for any tt* is another normal distribution which is coded below.

p = scatter(x1,y,legend=false,
            title="Posterior: function space",xlabel="x",ylabel="y")

# new X's over which to predict
xs = range(-1, 1, length=100)
Xs = [repeat([1],100) xs]
ys = zeros(100)

# get ready to construct posterior
σ2 = zeros(100)
C = (1/α)*X*transpose(X) + β*I
Cinv = inv(C)

# one prediction at a time 
for i in 1:100
    k = X * Xs[i,:]
    c = Xs[i,:]' * Xs[i,:] + β
    ys[i] = (k' * Cinv) * y
    σ2[i] = c - (k' * Cinv) * k
end
plot!(p,xs,ys, ribbon=(2*sqrt.(σ2),2*sqrt.(σ2)), lab="estimate")
plot!(p,xs,ys)

# noise free samples from the posterior
# all predictions at once
m = (Xs * X') * Cinv * y
CV = (Xs * Xs') - (Xs * X') * Cinv * (X * Xs')
CV = Symmetric(CV) + 1e-10*I
d = MvNormal(m, Symmetric(CV) + 1e-10*I)
for i in 1:20
    plot!(p,xs,rand(d),legend=false,seriestype=:line)
end