Biostatistics Advance Access originally published online on February 5, 2008
Biostatistics 2008 9(3):540-554; doi:10.1093/biostatistics/kxm051
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mixture models with multiple levels, with application to the analysis of multifactor gene expression data
Department of Statistics, Rutgers University, 501 Hill Center, Piscataway, NJ 08854, USA
rebecka{at}stat.rutgers.edu

Department of Statistics, Department of Biostatistics and Medical Bioinformatics, University of Wisconsin-Madison, 1300 University Avenue, Madison, WI 53706, USA
* To whom correspondence should be addressed.
Model-based clustering is a popular tool for summarizing high-dimensional data. With the number of high-throughput large-scale gene expression studies still on the rise, the need for effective data- summarizing tools has never been greater. By grouping genes according to a common experimental expression profile, we may gain new insight into the biological pathways that steer biological processes of interest. Clustering of gene profiles can also assist in assigning functions to genes that have not yet been functionally annotated. In this paper, we propose 2 model selection procedures for model-based clustering. Model selection in model-based clustering has to date focused on the identification of data dimensions that are relevant for clustering. However, in more complex data structures, with multiple experimental factors, such an approach does not provide easily interpreted clustering outcomes. We propose a mixture model with multiple levels,
, that provides sparse representations both "within" and "between" cluster profiles. We explore various flexible "within-cluster" parameterizations and discuss how efficient parameterizations can greatly enhance the objective interpretability of the generated clusters. Moreover, we allow for a sparse "between-cluster" representation with a different number of clusters at different levels of an experimental factor of interest. This enhances interpretability of clusters generated in multiple-factor contexts. Interpretable cluster profiles can assist in detecting biologically relevant groups of genes that may be missed with less efficient parameterizations. We use our multilevel mixture model to mine a proliferating cell line expression data set for annotational context and regulatory motifs. We also investigate the performance of the multilevel clustering approach on several simulated data sets.
Keywords: Clustering; Gene expression; Mixture model; Model selection; Profile expectation–maximization
Received November 20, 2007; revised August 28, 2007; revised November 26, 2007; accepted for publication November 27, 2007.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
H. Chun and S. Keles Expression Quantitative Trait Loci Mapping With Multivariate Sparse Partial Least Squares Regression Genetics, May 1, 2009; 182(1): 79 - 90. [Abstract] [Full Text] [PDF] |
||||
