# Bayesian Inference in Phylogeny - Metropolis-coupled MCMC (Geyer)

Metropolis-coupled MCMC (Geyer)

If the target distribution has multiple peaks, separated by low valleys, the Markov chain may have difficulty in moving from one peak to another. As a result, the chain may get stuck on one peak and the resulting samples will not approximate the posterior density correctly. This is a serious practical concern for phylogeny reconstruction, as multiple local peaks are known to exist in the tree space during heuristic tree search under maximum parsimony (MP), maximum likelihood (ML), and minimum evolution (ME) criteria, and the same can be expected for stochastic tree search using MCMC. Many strategies have been proposed to improve mixing of Markov chains in presence of multiple local peaks in the posterior density. One of the most successful algorithms is the Metropolis-coupled MCMC (or ).

In this algorithm, chains are run in parallel, with different stationary distributions, where the first one, is the target density, while, are chosen to improve mixing. For example, one can choose incremental heating of the form:

so that the first chain is the cold chain with the correct target density, while chains are heated chains. Note that raising the density to the power with has the effect of flattening out the distribution, similar to heating a metal. In such a distribution, it is easier to traverse between peaks (separated by valleys) than in the original distribution. After each iteration, a swap of states between two randomly chosen chains is proposed through a Metropolis-type step. Let be the current state in chain, . A swap between the states of chains and is accepted with probability:

At the end of the run, output from only the cold chain is used, while those from the hot chains are discarded. Heuristically, the hot chains will visit the local peaks rather easily, and swapping states between chains will let the cold chain occasionally jump valleys, leading to better mixing. However, if is unstable, proposed swaps will seldom be accepted. This is the reason for using several chains which differ only incrementally. (See Figure3.3 (to be added)).

An obvious disadvantage of the algorithm is that chains are run and only one chain is used for inference. For this reason, is ideally suited for implementation on parallel machines, since each chain will in general require the same amount of computation per iteration.