Skip Navigation



Biostatistics Advance Access published online on April 12, 2006

Biostatistics, doi:10.1093/biostatistics/kxj029
This Article
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
8/1/9    most recent
kxj029v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by kapp, A. V.
Right arrow Articles by Tibshirani, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by kapp, A. V.
Right arrow Articles by Tibshirani, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org
Received July 28, 2005
Revised February 28, 2006
Accepted March 8, 2006

Article

Are clusters found in one dataset present in another dataset?

Amy V. kapp 1 * and Robert Tibshirani 2

1 Department of Statistics, Stanford University, Stanford, CA 94305-4065, USA
2 Department of Health Research & Policy and Department of Statistics, Stanford, University, Stanford, CA

* To whom correspondence should be addressed.
Amy V. kapp, E-mail: akapp{at}stanford.edu


   Abstract

In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be reproducible and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high.

Here we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the in-group proportion (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the in-group proportion is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and WADP score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the in-group proportion, but differ in the way in which the null distributions are generated. We find the in-group proportion is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called clusterRepro available through The Comprehensive R Archive Network (http://cran.r-project.org).

Keywords: Cluster validation; in-group proportion; prediction accuracy; breast cancer subtypes.
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
The OncologistHome page
J. S. Ross, C. Hatzis, W. F. Symmans, L. Pusztai, and G. N. Hortobagyi
Commercialized Multigene Predictors of Clinical Outcome for Breast Cancer
Oncologist, May 1, 2008; 13(5): 477 - 493.
[Abstract] [Full Text] [PDF]


Home page
JNCI J Natl Cancer InstHome page
L. Lusa, L. M. McShane, J. F. Reid, L. De Cecco, F. Ambrogi, E. Biganzoli, M. Gariboldi, and M. A. Pierotti
Challenges in Projecting Clustering Results Across Gene Expression Profiling Datasets
J Natl Cancer Inst, November 21, 2007; 99(22): 1715 - 1723.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.