In the Bayesian viewpoint, we formulate linear regression using
probability distributions rather than point estimates. The response, y,
is not estimated as a single value, but is assumed to be drawn from a
probability distribution. The model for Bayesian Linear Regression
with the response sampled from a normal distribution is:
The output, y is generated from a normal (Gaussian) Distribution
characterized by a mean and variance. The mean for linear regression
is the transpose of the weight matrix multiplied by the predictor
matrix. The variance is the square of the standard deviation σ
(multiplied by the Identity matrix because this is a multi-dimensional
formulation of the model).
The aim of Bayesian Linear Regression is not to find the single “best”
value of the model parameters, but rather to determine the posterior
distribution for the model parameters. Not only is the response
generated from a probability distribution, but the model parameters
are assumed to come from a distribution as well. The posterior
, probability of the model parameters is conditional upon the training
inputs and outputs:
Here, P(β|y, X) is the posterior probability distribution of the model
parameters given the inputs and outputs. This is equal to the likelihood
of the data, P(y|β, X), multiplied by the prior probability of the
parameters and divided by a normalization constant. This is a simple
expression of Bayes Theorem, the fundamental underpinning of
Bayesian Inference:
Let’s stop and think about what this means. In contrast to OLS, we
have a posterior distribution for the model parameters that is
proportional to the likelihood of the data multiplied by
the prior probability of the parameters. Here we can observe the two
primary benefits of Bayesian Linear Regression.
1. Priors: If we have domain knowledge, or a guess for what the
model parameters should be, we can include them in our model,
unlike in the frequentist approach which assumes everything