Variational Bayesian Methods - A More Complex Example

A More Complex Example

Imagine a Bayesian Gaussian mixture model described as follows:

begin{align} mathbf{pi} & sim operatorname{SymDir}(K, alpha_0) \ mathbf{Lambda}_{i=1 dots K} & sim mathcal{W}(mathbf{W}_0, nu_0) \ mathbf{mu}_{i=1 dots K} & sim mathcal{N}(mathbf{mu}_0, (beta_0 mathbf{Lambda}_i)^{-1}) \ mathbf{z} & sim operatorname{Mult}(1, mathbf{pi}) \ mathbf{x}_{i=1 dots N} & sim mathcal{N}(mathbf{mu}_{z_i}, {mathbf{Lambda}_{z_i}}^{-1}) \ K &= text{number of mixing components} \ N &= text{number of data points} end{align}

Note:

• SymDir is the symmetric Dirichlet distribution of dimension, with the hyperparameter for each component set to . The Dirichlet distribution is the conjugate prior of the categorical distribution or multinomial distribution.
• is the Wishart distribution, which is the conjugate prior of the precision matrix (inverse covariance matrix) for a multivariate Gaussian distribution.
• Mult is a multinomial distribution over a single observation (equivalent to a categorical distribution). The state space is a "one-of-K" representation, i.e. a -dimensional vector in which one of the elements is 1 (specifying the identity of the observation) and all other elements are 0.
• is the Gaussian distribution, in this case specifically the multivariate Gaussian distribution.

The interpretation of the above variables is as follows:

• is the set of data points, each of which is a -dimensional vector distributed according to a multivariate Gaussian distribution.
• is a set of latent variables, one per data point, specifying which mixture component the corresponding data point belongs to, using a "one-of-K" vector representation with components for, as described above.
• is the mixing proportions for the mixture components.
• and specify the parameters (mean and precision) associated with each mixture component.

The joint probability of all variables can be rewritten as

where the individual factors are

begin{align} p(mathbf{X}|mathbf{Z},mathbf{mu},mathbf{Lambda}) & = prod_{n=1}^N prod_{k=1}^K mathcal{N}(mathbf{x}_n|mathbf{mu}_k,mathbf{Lambda}_k^{-1})^{z_{nk}} \ p(mathbf{Z}|mathbf{pi}) & = prod_{n=1}^N prod_{k=1}^K pi_k^{z_{nk}} \ p(mathbf{pi}) & = frac{Gamma(Kalpha_0)}{Gamma(alpha_0)^K} prod_{k=1}^K pi_k^{alpha_0-1} \ p(mathbf{mu}|mathbf{Lambda}) & = mathcal{N}(mathbf{mu}_k|mathbf{m}_0,(beta_0 mathbf{Lambda}_k)^{-1}) \ p(mathbf{Lambda}) & = mathcal{W}(mathbf{Lambda}_k|mathbf{W}_0, nu_0) end{align}

where

begin{align} mathcal{N}(mathbf{x}|mathbf{mu},mathbf{Sigma}) & = frac{1}{(2pi)^{D/2}} frac{1}{|mathbf{Sigma}|^{1/2}} exp {-frac{1}{2}(mathbf{x}-mathbf{mu})^{rm T} mathbf{Sigma}^{-1}(mathbf{x}-mathbf{mu}) } \ mathcal{W}(mathbf{Lambda}|mathbf{W},nu) & = B(mathbf{W},nu) |mathbf{Lambda}|^{(nu-D-1)/2} exp left(-frac{1}{2} operatorname{Tr}(mathbf{W}^{-1}mathbf{Lambda}) right) \ B(mathbf{W},nu) & = |mathbf{W}|^{-nu/2} (2^{nu D/2} pi^{D(D-1)/4} prod_{i=1}^{D} Gamma(frac{nu + 1 - i}{2}))^{-1} \ D & = text{dimensionality of each data point} end{align}

Assume that .

Then

begin{align} ln q^*(mathbf{Z}) &= operatorname{E}_{mathbf{pi},mathbf{mu},mathbf{Lambda}} + text{constant} \ &= operatorname{E}_{mathbf{pi}} + operatorname{E}_{mathbf{mu},mathbf{Lambda}} + text{constant} \ &= sum_{n=1}^N sum_{k=1}^K z_{nk} ln rho_{nk} + text{constant} end{align}

where we have defined

Exponentiating both sides of the formula for yields

Requiring that this be normalized ends up requiring that the sum to 1 over all values of, yielding

where

In other words, is a product of single-observation multinomial distributions, and factors over each individual, which is distributed as a single-observation multinomial distribution with parameters for .

Furthermore, we note that

which is a standard result for categorical distributions.

Now, considering the factor, note that it automatically factors into due to the structure of the graphical model defining our Gaussian mixture model, which is specified above.

Then,

begin{align} ln q^*(mathbf{pi}) &= ln p(mathbf{pi}) + operatorname{E}_{mathbf{Z}} + text{constant} \ &= (alpha_0 - 1) sum_{k=1}^K ln pi_k + sum_{n=1}^N sum_{k=1}^K r_{nk} ln pi_k + text{constant} end{align}

Taking the exponential of both sides, we recognize as a Dirichlet distribution

where

where

Finally

Grouping and reading off terms involving and, the result is a Gaussian-Wishart distribution given by

given the definitions

begin{align} beta_k &= beta_0 + N_k \ mathbf{m}_k &= frac{1}{beta_k} (beta_0 mathbf{m}_0 + N_k {bar{mathbf{x}}}_k) \ mathbf{W}_k^{-1} &= mathbf{W}_0^{-1} + N_k mathbf{S}_k + frac{beta_0 N_k}{beta_0 + N_k} ({bar{mathbf{x}}}_k - mathbf{m}_0)({bar{mathbf{x}}}_k - mathbf{m}_0)^{rm T} \ nu_k &= nu_0 + N_k \ N_k &= sum_{n=1}^N r_{nk} \ {bar{mathbf{x}}}_k &= frac{1}{N_k} sum_{n=1}^N r_{nk} mathbf{x}_n \ mathbf{S}_k &= frac{1}{N_k} sum_{n=1}^N (mathbf{x}_n - {bar{mathbf{x}}}_k) (mathbf{x}_n - {bar{mathbf{x}}}_k)^{rm T} end{align}

Finally, notice that these functions require the values of, which make use of, which is defined in turn based on, and . Now that we have determined the distributions over which these expectations are taken, we can derive formulas for them:

$begin{array}{rcccl} operatorname{E}_{mathbf{mu}_k,mathbf{Lambda}_k} &&&=& Dbeta_k^{-1} + nu_k (mathbf{x}_n - mathbf{m}_k)^{rm T} mathbf{W}_k (mathbf{x}_n - mathbf{m}_k) \ ln {tilde{Lambda}}_k &equiv& operatorname{E} &=& sum_{i=1}^D psi left(frac{nu_k + 1 - i}{2}right) + D ln 2 + ln |mathbf{Lambda}_k| \ ln {tilde{pi}}_k &equiv& operatorname{E}left &=& psi(alpha_k) - psileft(sum_{i=1}^K alpha_iright) end{array}$

These can be converted from proportional to absolute values by normalizing over so that the corresponding values sum to 1.

Note that:

1. The update equations for the parameters, and of the variables and depend on the statistics, and, and these statistics in turn depend on .
2. The update equations for the parameters of the variable depend on the statistic, which depends in turn on .
3. The update equation for has a direct circular dependence on, and as well as an indirect circular dependence on, and through and .

This suggests an iterative procedure that alternates between two steps:

1. An E-step that computes the value of using the current values of all the other parameters.
2. An M-step that uses the new value of to compute new values of all the other parameters.

Note that these steps correspond closely with the standard EM algorithm to derive a maximum likelihood or maximum a posteriori (MAP) solution for the parameters of a Gaussian mixture model. The responsibilities in the E step correspond closely to the posterior probabilities of the latent variables given the data, i.e. ; the computation of the statistics, and corresponds closely to the computation of corresponding "soft-count" statistics over the data; and the use of those statistics to compute new values of the parameters corresponds closely to the use of soft counts to compute new parameter values in normal EM over a Gaussian mixture model.