So far, given a training set , we fitted linear functions of the form , where where and . However, in many real world applications, given an input , there is often uncertainty regarding the value of . This makes it difficult to formulate a training set as we did above.

For instance, say we’d like to predict the strength of concrete given the amount of sand that was used in the mixture. To collect training data, we pick some sand amounts . Say for each , we build concrete 10 times and measure the concrete strength , to , resulted from using that much sand each time. It is very realistic to get different values, as there may be other factors that we did not have control over. For instance, perhaps we mistakenly slipped different amounts of water each time. Or, the weather might have been warmer some times, resulting in quicker strengthening, compared to other times.

Such a scenario is illustrated below. For each input , we have outputs with error bar . That is, the standard deviation of each measurement is .

Of course, we could simply average each measurement to get one value, i.e. build the training pair . But what if we wanted a framework that allows us to embed this uncertainty about the specific response given an input ?

All of the three lines aboce are plausible fits to the data. Due to the measurement noise, we just don’t know what the “correct” fit is. We would like to have an “intelligent agent” to answer questions like:

  • what are the possible fits to this data, and how uncertain are we about each such fit?
  • given a new input location, what are possible outputs at that location, and how uncertain are we about each such output?

That is, we would like a formal way to answer such questions in a way that allows for the expression of our uncertainty about a particular answer. We can use probability theory.

Representing uncertainty

Uncertainty about predictions

Given

  • function , i.e. we know both the family of the parametric function, and its parameters;
  • an input

we can represent our uncertainty about the output by assuming that is drawn from a normal distribution centred at , with standard deviation .

That is, we assume

That is, .

Alternatively, . That is, . This is also known as the observation model. Here, is an unapplied function parametrised by .

Maximum Likelihood

A small note now. We have already embedded some uncertainty in our pipeline. Mainly, for a given model and an input , we assumed the output is normally distributed with mean and standard deviation . We also assumed that the standard deviation is constant for all outputs. We can formulate the likelihood of the outputs given the inputs and the model:

In general, likelihood of [placeholder] is a term for when we consider the probability of the data as a function of [placeholder].

One thing to do is to fit the parameters to maximise the likelihood. This is known as the maximum likelihood approach to model fitting. It’s equivalent to minimising the negative log-likelihood, where computations are easier:

Here, we have used the fact that:

so:

Conclusion: having assumed a constant noise variance , we end up fitting the parameters simply by minimising the square error, exactly as done earlier in the course. Of course, we could choose non-constant variance, in which case we cannot pull the resulting term out of the sum. We call the precision. In this case, the squared difference of each sample would be weighted by its individual precision.

But maximum likelihood is not … Bayesian.

Uncertainty about the model

Let us restrict the model family to functions the form where , i.e. . So, if we knew the parameters , we could compute the function for any and simulate what typical observations at that location would look like. However, when learning from data, we don’t know the parameters .

The prior over models

We do not know the parameters, but we may have some prior belief about what the parameters should be. Prior as in it’s a belief that we have before we see any data. For instance, we may know that the correlation between amount of sand and concrete strength should be positive, i.e. the function should have a positive slope. Indeed, we intuitively expect the strength to grow as we put more sand in, not decrease.

We represent our prior beliefs and uncertainty about the parameters with a distribution over the space of parameters . This way we express uncertainty about the function itself. Since we have already fixed the function family, the function is now fully specified by the parameters.

For example, we could set:

That is, we think the following functions are all reasonable models we could consider. To produce a line in the plot below, we sampled parameters from the distribution above and plotted the resulting function.

%config InlineBackend.figure_format = 'svg'
 
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import multivariate_normal
 
np.set_printoptions(precision=2)
plt.rcParams.update({'font.size': 12, 'text.usetex': True})
 
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
ax[0].set(xlabel=r"$w_1$", ylabel=r"$w_2$")
 
w_0 = np.array([0, 0])
V_0 = 0.4**2 * np.identity(2)
 
dist = multivariate_normal([0, 0], 0.4**2)
x1, x2 = np.mgrid[-1:1:0.1, -1:1:0.1]
 
points = np.dstack([x1, x2])#.reshape(-1, 2)
ax[0].contourf(x1, x2, dist.pdf(points))
 
x = np.linspace(-1, 1)
colors = ['red', 'green', 'blue', 'yellow', 'gray']
for color in colors:
    w1, w2 = dist.rvs()
    ax[0].plot([w1], [w2], 'o', c=color)
    ax[1].plot(x, w1 + w2 * x, c=color)
 
plt.show()

svg

The posterior over models

We formulated some prior beliefs. At this point, consider the following experiment. We go ahead and build concrete using various amounts of sand . For each amount of sand , we assume the true concrete strength is given by the parametric function . We have a prior over what that function may be.

Now imagine that, for each amount of sand , we build concrete several times. Under perfect conditions, the resulting concrete should have the strength every time. However, several things may go wrong: weather conditions, such as temperature, and humidity, may vary across several builds. As a result, for the same amount of sand , we may get slightly different concrete strengths every time. We express our uncertainty about the concrete strength with an observation model .

In the linear regression case, we assumed . Note that we assume this standard deviation is the same for all amounts of sand , hence we called it , not .

Via the experiment above we can construct a training dataset of pairs. Note that here could be the average observed strength across several concrete-building experiments given the amount of sand .

We now formulate the posterior distribution over the weight space. This is a distribution over models, since the weights fully specify the models - having the prior assumption that we are working with the family of linear models. This distrivution reflects our beliefs and uncertainty about the weights after having observed the training examples in . Using Bayes’ rule:

Here, is the prior of the parameters that we saw before. is called the likelihood of the parameters. Intuitively, how likely it is that our data was generated using these specific parameters.

In general, likelihood of [placeholder] is a term for when we consider the probability of the data as a function of [placeholder]. The likelihood function is not a probability distrbution, as . Intuitively, it could be that there are 100 parameter settings and, under each setting, our data would be generated with the same probability 0.1. That sums up to 10, not to 1.

In our case, we assume are fixed quantities that do not affect our beliefs. With these assumption in place, we write:

where .

At a high level, Bayesian inference updates beliefs before seeing the data (given by the prior), so that parameters that are compatible with the data (high likelihood) become more probable under our new beliefs (given by the posterior).

How do we compute the posterior? Luckily, our prior is a conjugate prior to the likelihood. That is, a prior where the product of the prior and likelihood combines to give a distribution with the same functional form as the prior. In our case, our prior is a Gaussian, which is a conjugate prior to the likelihood, also a Gaussian.

So, we have:

After grunt work we get:

Recall that and are the mean vector and covariance matrix of the prior, respectively. That is, the mean and covariance after seeing 0 data points.

Here is that grunt work:

Notice how a normal distribution with mean and covariance would expand

By comparing the terms of this with the previous expression for the posterior we can see:

And

Bayesian prediction

So far we formalised:

  • our prior beliefs about the parameters, i.e. our beliefs before seeing any data;
  • our posterior beliefs about the parameters, i.e. our beliefs after seeing some data.

We now focus on prediction. That is, given a new input , not present in the training set , we would like to express our uncertainty regarding possible outputs in the form of a distribution .

Before looking at linear regression in particular, let us discuss a general strategy for solving such prediction problems.

General Strategy

We often can’t immediately write down a mathematical expression for . So, by using the sum rule, we introduce “stuff” (model parameters and/or latent variables) that helps us define the problem.

If we would know how to predict if we were told the extra stuff , we can find the predictive distribution: weight the predictions for each possible stuff, , by the posterior probability of the stuff, . We get the posterior from Bayes’ rule.

Note: sometimes the extra stuff summarizes everything we need to know to make a prediction, and the data is irrelevant, i.e. .

Predictions for linear regression

To get the last line, notice that:

  • given parameters , the distribution of is not dependent on anymore, only on ;
  • the posterior distribution of the parameters is not dependent on the new .

Plugging in the known forms of the posterior and the predictive distribution for given parameters, the expression becomes:

How do we compute this integral…? Analytically it’s quite tedious. Instead, we can note the following.

We have assumed that, given function , and an input , the output is a random variable, distributed as . This is the same as saying

For linear regression, we know , i.e. . We know , so , because is a linear combination of Gaussians (each is a Gaussian).

So, since , we have:

Bayesian decision making

Given a test input , the Bayesian approach says that we don’t know what the output will be. It does not return a single value, but a distribution over possible values. However, we sometimes need to provide point-estimates .

We can provide such a point estimate by writing down a loss function

Summary

  • Dataset ;
  • Observation model ;
  • Prior over models ;
  • Posterior over models ;
  • Predictions

For linear regression:

  • Assume ;
  • Observation model . We wrote instead of because we have the prior belief that the true model has the form . As such, the model is fully specified by the weights;
  • Prior over models ;
  • Posterior over models , where
  • Predictions ;

As an example, consider , .

fig, ax = plt.subplots(7, 3, figsize=(8, 16))
 
# =========================================================================== #
# Prior
w_0 = np.array([
    [0], 
    [0]
])
V_0 = 0.4**2 * np.identity(2)
prior = multivariate_normal(w_0.flatten(), V_0)
 
x1, x2 = np.mgrid[-1:1:0.01, -1:1:0.01]
points = np.dstack([x1, x2])
 
ax[0, 0].contourf(x1, x2, prior.pdf(points))
 
x = np.linspace(-1, 1)
for _ in range(7):
    w1, w2 = prior.rvs()
    ax[0, 1].plot(x, w1 + w2 * x, c="gray")
 
# =========================================================================== #
# Data points
Phi = np.array([
  [ 1,  -0.3],
  [ 1,  -0.2],
  [ 1,   0. ],
  [ 1,   0.2],
  [ 1,   0.3],
  [ 1,  -0.8]
])
Y = np.array([0.5, 0.2, 0.3, 0.6, 0.5, 0])[:, None]
 
# =========================================================================== #
# Posterior
sigma_y = 0.15 # uncertainty about y, given x
V_0_inv = np.linalg.inv(V_0)
 
for n in range(1, len(Phi) + 1):
    V_N = sigma_y**2 * np.linalg.inv(
        sigma_y**2 * V_0_inv +
        Phi[:n].T.dot(Phi[:n]))
    w_N = V_N.dot(V_0_inv.dot(w_0) + 
                  1/(sigma_y**2) * Phi[:n].T.dot(Y[:n]))
    posterior = multivariate_normal(w_N.flatten(), V_N)
    ax[n, 0].contourf(x1, x2, posterior.pdf(points))
    
    for _ in range(7):
        w1, w2 = posterior.rvs()
        ax[n, 1].plot(x, w1 + w2 * x, c="gray")
    
    ax[n, 1].plot(Phi[:n, 1], Y[:n, 0], "o", c="blue")
    
    # Predictive
    x_star = np.linspace(-1, 1, 20)[:, None]
    Phi_star = np.concatenate([x_star**0, x_star], axis=1)
    means = np.einsum('ij,jk->i', Phi_star, w_N) + 0
    variances = np.einsum('ik,jk,ik->i', Phi_star, V_N, Phi_star) + sigma_y**2
    stdevs = np.sqrt(variances)
 
    ax[n, 2].plot(x_star, means)
    ax[n, 2].fill_between(
        x_star.flatten(), means - stdevs, means + stdevs, alpha=1,
        facecolor='lightsteelblue'
    )
 
for row in range(7):
    ax[row, 0].set(xlabel=r"$w_1$", ylabel=r"$w_2$", xlim=(-1, 1), ylim=(-1, 1))
    ax[row, 1].set(xlabel=r"$x$", ylabel=r"$w_1 + w_2 x$", xlim=(-1, 1), ylim=(-1, 1))
    ax[row, 2].set(xlim=(-1, 1), ylim=(-1, 1))
fig.tight_layout(pad=2)
plt.show()

svg