# Normal Distribution - Bayesian Analysis of The Normal Distribution - With Unknown Mean and Variance

With Unknown Mean and Variance

For a set of i.i.d. normally distributed data points X of size n where each individual point x follows with unknown mean μ and variance σ2, the a combined (multivariate) conjugate prior is placed over the mean and variance, consisting of a normal-inverse-gamma distribution. Logically, this originates as follows:

1. From the analysis of the case with unknown mean but known variance, we see that the update equations involve sufficient statistics computed from the data consisting of the mean of the data points and the total variance of the data points, computed in turn from the known variance divided by the number of data points.
2. From the analysis of the case with unknown variance but known mean, we see that the update equations involve sufficient statistics over the data consisting of the number of data points and sum of squared deviations.
3. Keep in mind that the posterior update values serve as the prior distribution when further data is handled. Thus, we should logically think of our priors in terms of the sufficient statistics just described, with the same semantics kept in mind as much as possible.
4. To handle the case where both mean and variance are unknown, we could place independent priors over the mean and variance, with fixed estimates of the average mean, total variance, number of data points used to compute the variance prior, and sum of squared deviations. Note however that in reality, the total variance of the mean depends on the unknown variance, and the sum of squared deviations that goes into the variance prior (appears to) depend on the unknown mean. In practice, the latter dependence is relatively unimportant: Shifting the actual mean shifts the generated points by an equal amount, and on average the squared deviations will remain the same. This is not the case, however, with the total variance of the mean: As the unknown variance increases, the total variance of the mean will increase proportionately, and we would like to capture this dependence.
5. This suggests that we create a conditional prior of the mean on the unknown variance, with a hyperparameter specifying the mean of the pseudo-observations associated with the prior, and another parameter specifying the number of pseudo-observations. This number serves as a scaling parameter on the variance, making it possible to control the overall variance of the mean relative to the actual variance parameter. The prior for the variance also has two hyperparameters, one specifying the sum of squared deviations of the pseudo-observations associated with the prior, and another specifying once again the number of pseudo-observations. Note that each of the priors has a hyperparameter specifying the number of pseudo-observations, and in each case this controls the relative variance of that prior. These are given as two separate hyperparameters so that the variance (aka the confidence) of the two priors can be controlled separately.
6. This leads immediately to the normal-inverse-gamma distribution, which is defined as the product of the two distributions just defined, with conjugate priors used (an inverse gamma distribution over the variance, and a normal distribution over the mean, conditional on the variance) and with the same four parameters just defined.

The priors are normally defined as follows:

begin{align} p(mu|sigma^2; mu_0, n_0) &sim mathcal{N}(mu_0,sigma_0^2/n_0) \ p(sigma^2; nu_0,sigma_0^2) &sim Ichi^2(nu_0,sigma_0^2) = IG(nu_0/2, nu_0sigma_0^2/2) end{align}

The update equations can be derived, and look as follows:

begin{align} bar{x} &= frac{1}{n}sum_{i=1}^n x_i \ mu_0' &= frac{n_0mu_0 + nbar{x}}{n_0 + n} \ n_0' &= n_0 + n \ nu_0' &= nu_0 + n \ nu_0'{sigma_0^2}' &= nu_0 sigma_0^2 + sum_{i=1}^n (x_i-bar{x})^2 + frac{n_0 n}{n_0 + n}(mu_0 - bar{x})^2 \ end{align}

The respective numbers of pseudo-observations just add the number of actual observations to them. The new mean hyperparameter is once again a weighted average, this time weighted by the relative numbers of observations. Finally, the update for is similar to the case with known mean, but in this case the sum of squared deviations is taken with respect to the observed data mean rather than the true mean, and as a result a new "interaction term" needs to be added to take care of the additional error source stemming from the deviation between prior and data mean.

Proof is as follows.

The prior distributions are

begin{align} p(mu|sigma^2; mu_0, n_0) &sim mathcal{N}(mu_0,sigma_0^2/n_0) = frac{1}{sqrt{2pifrac{sigma^2}{n_0}}} expleft(-frac{n_0}{2sigma^2}(mu-mu_0)^2right) \ &propto (sigma^2)^{-1/2} expleft(-frac{n_0}{2sigma^2}(mu-mu_0)^2right) \ p(sigma^2; nu_0,sigma_0^2) &sim Ichi^2(nu_0,sigma_0^2) = IG(nu_0/2, nu_0sigma_0^2/2) \ &= frac{(sigma_0^2nu_0/2)^{nu_0/2}}{Gamma(nu_0/2)}~ frac{expleft}{(sigma^2)^{1+nu_0/2}} \ &propto {(sigma^2)^{-(1+nu_0/2)}} expleft \ end{align}

Therefore, the joint prior is

begin{align} p(mu,sigma^2; mu_0, n_0, nu_0,sigma_0^2) &= p(mu|sigma^2; mu_0, n_0),p(sigma^2; nu_0,sigma_0^2) \ &propto (sigma^2)^{-(nu_0+3)/2} expleft end{align}

The likelihood function from the section above with known variance, and writing it in terms of variance rather than precision, is:

begin{align} p(mathbf{X}|mu,sigma^2) &= left(frac{1}{2pisigma^2}right)^{n/2} expleft \ &propto {sigma^2}^{-n/2} expleft \ end{align}

where

Therefore, the posterior is (dropping the hyperparameters as conditioning factors):

begin{align} p(mu,sigma^2|mathbf{X}) & propto p(mu,sigma^2) , p(mathbf{X}|mu,sigma^2) \ & propto (sigma^2)^{-(nu_0+3)/2} expleft {sigma^2}^{-n/2} expleft \ &= (sigma^2)^{-(nu_0+n+3)/2} expleft \ &= (sigma^2)^{-(nu_0+n+3)/2} expleft \ & propto (sigma^2)^{-1/2} expleft \ & quadtimes (sigma^2)^{-(nu_0/2+n/2+1)} expleft \ & = mathcal{N}_{mu|sigma^2}left(frac{n_0mu_0 + nbar{x}}{n_0 + n}, frac{sigma^2}{n_0+n}right) cdot {rm IG}_{sigma^2}left(frac12(nu_0+n), frac12left(nu_0sigma_0^2 + S + frac{n_0 n}{n_0+n}(mu_0-bar{x})^2right)right) . \ end{align}

In other words, the posterior distribution has the form of a product of a normal distribution over times an inverse gamma distribution over, with parameters that are the same as the update equations above.