Biostatistics Advance Access published online on August 20, 2009
Biostatistics, doi:10.1093/biostatistics/kxp033
Bayesian mixture modeling using a hybrid sampler with application to protein subfamily identification
Department of Biostatistics, University of Washington, Seattle, WA 98105, USA
Department of Statistics and Department of Biostatistics, University of Washington, Seattle, WA 98105, USA jonno@u.washington.edu
Department of Biostatistics, University of Washington, Seattle, WA 98105, USA
Predicting protein function is essential to advancing our knowledge of biological processes. This article is focused on discovering the functional diversification within a protein family. A Bayesian mixture approach is proposed to model a protein family as a mixture of profile hidden Markov models. For a given mixture size, a hybrid Markov chain Monte Carlo sampler comprising both Gibbs sampling steps and hierarchical clustering–based split/merge proposals is used to obtain posterior inference. Inference for mixture size concentrates on comparing the integrated likelihoods. The choice of priors is critical with respect to the performance of the procedure. Through simulation studies, we show that 2 priors that are based on independent data sets allow correct identification of the mixture size, both when the data are homogeneous and when the data are generated from a mixture. We illustrate our method using 2 sets of real protein sequences.
Keywords: Clustering; Hybrid sampler; Markov chain Monte Carlo; Mixture models; Protein subfamily
* To whom correspondence should be addressed.
Received February 4, 2009; revised July 9, 2009; accepted for publication July 27, 2009.