Skip Navigation


Biostatistics Advance Access originally published online on April 12, 2006
Biostatistics 2007 8(1):9-31; doi:10.1093/biostatistics/kxj029
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
8/1/9    most recent
kxj029v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Kapp, A. V.
Right arrow Articles by Tibshirani, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kapp, A. V.
Right arrow Articles by Tibshirani, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

Are clusters found in one dataset present in another dataset?

Amy V. Kapp*

Department of Statistics, Stanford University, Stanford, CA 94305-4065, USA akapp{at}stanford.edu

Robert Tibshirani

Department of Health Research and Policy and Department of Statistics, Stanford University, Stanford, CA, USA

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATIONS
 4. APPLICATION TO BREAST...
 5. DISCUSSION
 REFERENCES
 
In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be "reproducible" and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the "in-group proportion" (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called "clusterRepro" available through The Comprehensive R Archive Network (http://cran.r-project.org).

Keywords: Breast cancer subtypes; Cluster validation; In-group proportion; Prediction accuracy


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATIONS
 4. APPLICATION TO BREAST...
 5. DISCUSSION
 REFERENCES
 
As the name suggests, cluster validation is concerned with "assessing the validity of classifications that have been obtained from the application of clustering procedure" (Gordon, 1999Go). In general, cluster validation procedures define a cluster quality measure (e.g. a measure of isolation or a measure of cohesion) and determine how likely given values of that measure are to occur under a null model of no structure. Either graph theory or Monte Carlo simulations can be used to find the null distribution of the cluster quality values.

Interest in cluster validation has been re-ignited by the need for gauging the significance of gene and array clusters in microarray studies. The majority of the literature has centered on determining which clustering procedure to use and on determining how many clusters are present in a microarray dataset (Datta and Datta, 2003Go; Chen and others, 2002Go; Kerr and Churchill, 2001Go; Yeung and others, 2001Go; Levine and Domany, 2001Go).

Although many of these papers used a cluster quality measure based on within-cluster and/or between-cluster variance, three papers (Dudoit and Fridlyand, 2002Go; Dudoit and others, 2002Go; Tibshirani and Walther, 2005Go) used prediction error to evaluate the quality of clusters. When the true classifications of the test dataset were known, as in Dudoit and others (2002), the estimate of prediction error was the proportion of correct classifications in the test dataset. When the true classifications are unknown, as in Tibshirani and Walther (2005)Go, cluster quality can be estimated by how well training centroids predict test set co-memberships, i.e. pairs of observations classified to the same cluster. Instead of concentrating on a single measure of prediction accuracy, Dudoit and Fridlyand (2002)Go compared a variety of indices to measure the agreement between the training set partition and the test set partition.

Despite their differences, all three papers argued that the use of a measure of test set clusters defined by a classifier made from the training data is the most appropriate approach to cluster validation when the aim of analyzing the microarray data is to identify reproducible clusters of genes or arrays with similar expression profiles. The genes or samples of a microarray dataset are partitioned into clusters and used to build a classifier which is applied to new data. If the new data classified to a cluster are like the samples already present in the cluster (molecularly or clinically), then the cluster is validated because it is reproducible and may be biologically significant.

In other words, a classifier built using previously defined clusters is used to predict which new data have certain molecular or clinical characteristics in common with the other members of the cluster. A cluster is validated if enough predictions are correct because accurate predictions mean the cluster is present in the new data. Therefore, when the goal of a study is the identification of reproducible clusters, validation is related to prediction accuracy which is defined to be the proportion of data whose predicted classifications are identical to the true classifications.

This paper extends the idea of using prediction accuracy (or strength) from validating the number of clusters or the choice of clustering method to validating individual clusters found in a new dataset. First, a new cluster quality measure is proposed. The "in-group proportion" (IGP) is similar to the measure of co-memberships in Tibshirani and Walther (2005)Go. It is defined to be the proportion of observations classified to a cluster whose nearest neighbor is also classified to the same cluster.

The IGP also resembles a cluster quality measure proposed by Bailey and Dubes in 1982. Their "measure of cohesion" was defined for a random graph with m edges and n vertices: FormulaFormula, where Formula was the set of Formula edges in the graph. If C is made up of observations in the same cluster and Formula is the set of edges that connect observations in C to their nearest neighbors, then in certain situations Formula is equal to the product of the IGP and the total number of observations in the cluster (m). For example, consider three points on the real line: 0, 1, and Formula. If Formula, Formula is as defined above, and Euclidean distance is used, then Formula and the IGP is Formula.

In the subsequent sections, a more explicit description of the IGP and four other cluster quality measures are presented, after which a cluster validation procedure is proposed. Four different versions of the cluster validation procedure are described. In all versions, the IGPs for all the clusters in a new dataset identified by centroids built on a previous dataset are computed and then compared to an appropriate null distribution to obtain p-values. The null distributions of IGPs, however, are generated differently in each version of the cluster validation procedure. Finally, simulations and real datasets are used to compare the IGP with four other cluster quality measures and to compare the four versions of the cluster validation procedure.


    2. METHODS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATIONS
 4. APPLICATION TO BREAST...
 5. DISCUSSION
 REFERENCES
 
Before describing the cluster quality measures and validation procedure, some basic definitions must be established. We let A be an Formula matrix of microarray data where m is the number of features (genes) and n is the number of samples (arrays). We assume that a subset of the samples of A have been partitioned into p groups (labeled Formula) and C is the Formula matrix of the centroids. The uth column of C is made by averaging over the features (rows) of the samples in A classified to group u. If X is an Formula matrix of microarray data independent of A, then all the samples (columns) of X can be classified to one of the p groups or to a "below-cutoff group" using C and a cutoff (c). The function Formula is defined to be the Pearson's (centered) correlation for vectors x and y and Formula is the class label for sample j of X:

Formula (2.1)

Since Formula is a measure of correlation not distance, Formula and a Formula near 1 means x and y are close together. Thus, every sample of X is classified to the the group whose centroid with which it most highly correlates. If the maximum correlation for a sample and any of the centroids is less than c, the sample receives the class label 0. The below-cutoff group is composed of all the samples i of X for which Formula. A cluster quality measure is subsequently computed for each group to which at least one array of X is classified.

2.1 The IGP

We propose a new cluster quality measure based on the idea of prediction accuracy. The IGP (Figure 1) is defined to be the proportion of samples in a group whose nearest neighbors are also in the same group. In other words, the IGP quantifies how often points near each other are predicted to belong to the same group. Define Formula for each columnn of X, and let u be the class label for all the samples whose Formula, then

Formula (2.2)


Figure 1
View larger version (5K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Using Euclidean distance, the IGP for the circles is 0.8 and that for the triangles is 0.6.1

 
For the jth sample of X, Formula is j's nearest neighbor and so Formula is the proportion of samples in class u whose nearest neighbor is also in class u.

If a distance function is used instead of Pearson's (centered) correlation coefficient, then the above definitions still hold with min replacing max, argmin replacing argmax, Formula replacing Formula, and Formula replacing Formula.

2.2 Other cluster quality measures

The IGP is not the only measure of cluster quality. Chen and others (2002) described several others. Since Chen and others (2002) defined the "homogeneity score" (HS), separation score (SS), "silhouette width" (SW), and "weighted average discrepant pairs" (WADP) for entire clusterings as opposed to individual clusters, we slightly modified the scores to apply to an individual cluster. First, the HS is defined to be the average correlation between a cluster's centroid and the members of the cluster. If Formula and Formula is the number of elements in Formula, then the HS for cluster u is: Formula.

The SS for cluster u is the weighted average of the correlation between the uth cluster's centroid and every other centroid: Formula.

Next, we assume that Formula and let Formula be the average dissimilarity (or distance) between sample j and the other samples in Formula and Formula be the average dissimilarity (or distance) between sample j and the samples not in Formula. The SW for cluster u is thus defined: Formula.

Since Pearson's (centered) correlation coefficient is a measure of similarity, Formula Pearson's (centered) correlation | was used as the measure of dissimilarity to compute the SWs and every time a measure of distance was required. Therefore, for each member of a cluster u, the discrepancy between the average value of Formula Pearson's (centered) correlation | between that member and the other members of the cluster and the average Formula Pearson's (centered) correlation | between the member and the members outside of the cluster is calculated and then divided by the maximum of those two quantities. The SW for cluster u is the average of these quotients over all the members of cluster u.

Finally, the WADP score measures the consistency of a classifier when the samples are subject to small perturbations. We generate an Formula matrix of Gaussian random variables: Formula where Formula for Formula and Formula. R is added to X and the samples of X are reclassified. For each cluster, we calculate the number of sample pairs that were in the same cluster in the original classification, but not in the same cluster after reclassification. That quantity is divided by the total number of sample pairs originally in the cluster. This process is repeated many times and the WADP score for cluster u is the average of these ratios taken over the perturbations of X. The value of Formula is specified by the user and has a large impact on the WADP score. If Formula is too small, the WADP score is always 0; if Formula is too large, the WADP score is close to 1. In this paper, Formula was always chosen to be large enough for the WADP scores to vary between groups.

2.3 Null distribution generation

To validate the groups found in X, the IGPs of those groups are compared to a null distribution of IGPs. Four different versions of the same procedure are used in this paper to generate null distributions of IGPs. The basic null distribution generation procedure repeatedly generates an m Formula p centroid matrix (Formula), computes Formula in the way described above with Formula replacing C, and calculates the IGPs for the groups in Formula. Each version of the null distribution generation procedure generates Formula differently.

Version 1 permutes each row of C to get Formula.
Version 2 permutes the rows of A, hierarchically clusters the columns (average linkage), automatically cuts the dendogram to make p groups, and averages over the rows of the arrays with the same group labels to get Formula.
Version 3 transforms C to get Formula (transformation described below).
Version 4 transforms A (transformation described below), hierarchically clusters the columns (average linkage), automatically cuts the dendogram to make p groups, and averages over the rows of the arrays with the same group labels to get Formula.

The first two versions assume independence of the genes in the centroids or raw data. As many microarray studies have demonstrated, however, genes are not completely independent. Therefore, the centroids produced by Versions 1 and 2 may not be near the data. To remedy this problem, the third and fourth versions permute the samples within the box aligned with their principal components. This transformation increases the chance that the centroids are near the data without being too similar to the actual data or actual centroids which would bias the p-values towards 1.

  1. Let Formula be the singular value decomposition of W. (W can be either C or A.)
  2. Define Formula.
  3. Permute the columns of Formula to obtain Formula.
  4. Let Formula.
  5. Substitute Z for W.

The null distribution generation methods were designed to produce centroids that are placed randomly in the data. As a consequence, the groups defined by the centroids most likely are not high-quality clusters. Thus, the null distributions are composed of IGPs that come from groups of data that are not high-quality clusters. Since a cluster of high-quality will have an IGP close to 1, the p-value of a group is the fraction of the null distribution IGPs that are as close or closer to 1 than the group's actual IGP. In other words, the null hypothesis is that a group is not a high-quality cluster and it is rejected if the actual IGP of the group is high enough (i.e. close enough to 1).

A group of data with a significant p-value is a high-quality cluster. In addition, that group of data corresponds to a cluster in an independent dataset (the one in which the original centroids were formed). Therefore, a significant p-value means a high-quality cluster (as opposed to a group of data near each other) corresponding to the original cluster was found in an independent dataset. Hence, the cluster is reproducible and thus validated.

Since the IGPs depend on the size of the group, Formula is compared only to the IGPs from the null distribution generation procedure that come from groups of the same size. When a cutoff is used, the below-cutoff group is compared to the IGPs of all the below-cutoff groups obtained from the null distribution procedure because the sizes of the generated below-cutoff groups so rarely match the size of the actual below-cutoff group.

Nothing about the null distribution generation procedure is specific to the IGP. Therefore, the overall cluster validation method and its four versions could be used with any of the cluster quality measures described in Section 2.2. In light of the results presented in Section 3.1, however, we only compared the null distribution generation versions for the IGP.


    3. SIMULATIONS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATIONS
 4. APPLICATION TO BREAST...
 5. DISCUSSION
 REFERENCES
 
Results from five simulations are presented: two (Simulation 1 and Simulation 2) were done to compare the five cluster quality measures described in Section 2.2 and two (Simulation 3, Simulation 4, and Simulation 5) were done to compare the different versions of the null distribution generation procedure described in Section 2.3. All the simulations used Pearson's (centered) correlation coefficient and datasets of 300 observations. The details and results of the simulations are described in Sections 3.1 and 3.2.

3.1 Comparison of cluster quality measures

The datasets for Simulation 1 and Simulation 2 were generated in the same fashion. First, a single vector of length 500 was defined: Formula such that Formula. Then, the Formula (Formula) were defined to be random samples of size 50 drawn without replacement from the set Formula. Using P and the Formula's, a Formula matrix (Q), which can be thought of as the matrix of true centroids, was defined:

Formula (3.1)

The Formula were independent identically distributed Formula. To produce the data matrix of observations, the variable Formula was defined to be a random sample of size 100 drawn without replacement from the set Formula for Formula. Thus, the data matrix (R) was defined:

Formula (3.2)

The Formula were independent identically distributed Formula.

R is like a matrix of microarray data where the 500 rows are genes (features) and the 300 columns are arrays (samples). Each column of R was classified to one of four groups: columns 1–50 were the first group, columns 51–100 were the second group, columns 101–200 were the third group, and columns 201–300 were the fourth group. The Formula matrix, Formula, was found by averaging over the columns of R which were generated from the same column of Q: Formula, where Formula, Formula, and Formula.

In Simulation 1, Formula for all u, but Formula and Formula. In Simulation 2, Formula for all j, but Formula and Formula. In other words, if Q is thought of as the centroid matrix, then in Simulation 1, as Formula increased the centroids moved further apart from one another, but the correlations between the data and the centroids remained constant. Furthermore, in Simulation 2, as Formula increased the data moved further away from their centroids, but the between-centroid correlations remained constant.

For each value of Formula in Simulation 1 and Formula in Simulation 2, 100 datasets (R) were generated in the manner described above. For each of the datasets, the IGPs, HSs, SSs, SWs, and WADP scores (Formula) were computed using two different classifications. One was the true classification:


Formula (3.3)

The other was the estimated classification: Estimated classification of Formula.

For both the true classifications and the estimated classifications, the average values for all five cluster quality measures are presented in Figures 2 and 3.


Figure 2
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 These graphs show the results of Simulation 1. The horizontal axis on each graph is the value of Formula used to generate each Q matrix. The vertical axis is the cluster quality measure. The lines trace out the average cluster quality measure when the true classifications were used; the solid points are the average cluster quality measure values when the estimated classifications were used. The averages from observations generated using Formula, Formula, Formula, and Formula are circles (solid line), diamonds (dashed line), triangles (dotted line), and squares (dashed and dotted line), respectively.

 

Figure 3
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 These graphs show the results of Simulation 2. The horizontal axis on each graph is the value of Formula used to generate each data matrix (R) of 300 observations. The vertical axis is the cluster quality measure. The lines trace out the cluster quality measure when the true classifications were used; the solid points are the average cluster quality measure values when the estimated classifications were used. The averages for the observations generated using Formula, Formula, Formula, and Formula are circles (solid line), diamonds (dashed line), triangles (dotted line), and squares (dashed and dotted line), respectively.

 
A cluster is said to be "isolated" if the members of the cluster are very different from the members of other clusters and a cluster is said to be "cohesive" if the members of the cluster are very similar to each other (Gordon, 1999Go). In contrast, a measure of prediction accuracy needs to quantify how likely a point classified to one cluster is to have been classified to another cluster because prediction accuracy is the proportion of data whose predicted classifications are identical to the true classifications. If a cluster is both isolated and cohesive, its members are unlikely to be classified to another cluster. Therefore, a measure of prediction accuracy will be sensitive to both isolation and cohesion.

In the context of the above simulations, this implies an appropriate cluster quality measure should consistently increase (or decrease) as Formula increased in Simulation 1 (causing the between-column correlation of Q to decrease) and as Formula increased in Simulation 2 (causing the correlation between the columns of R and their associated columns of Q to decrease). In Figure 2, the true classification curves of all the cluster quality measures consistently increased or decreased with Formula: the IGP, HS, and SW increased while the SS and WADP score decreased. Although the HS's true classification curves increased, the scale of the increase over the range of Formula was so small (0.78–0.79) that the HS was basically unchanged. In other words, as the groups become more isolated, the HS was constant. Therefore, the HS is not an appropriate measure of isolation.

In Figure 3, all the true classification curves for the IGP, HS, and WADP score decreased with the increase in Formula. In contrast, the true classification curves for the SS were constant and two of the SW true classification curves changed direction, first decreasing and then increasing. Therefore, the SS and SW are not appropriate measures of cohesion.

Based upon the true classification curves, only the IGP and WADP score were appropriate cluster quality measures. The WADP score had two major drawbacks as a cluster quality measure, however. First, the estimated WADP scores differed greatly from the true WADP scores in Figure 2 for Formula. Second, the WADP score required us to choose the value of Formula. The true IGPs were different from the estimated IGPs for Formula, but the difference was not as great. In addition, the IGP did not require values for parameters to be chosen before calculating the score. Thus, these simulations demonstrate that the IGP was a better cluster quality measure than the HS, SS, SW, and WADP score.

These simulations also show a cluster's IGP depended upon the cluster's size, average correlation between members, and average correlation between each member and the centroids. In Figure 2 IGP graph, the two groups of 50 observations traced out similar IGP true classification curves (black and green) that differed from those of the two groups of 100 observations (red and blue). In Figure 3 IGP graph, however, all four groups traced out different IGP true classification curves.

Finally, the IGP of a cluster depended upon the composition of the entire population. When the third and fourth groups were removed from Simulation 1 and Simulation 2, the IGPs for the first and second groups increased when the clusters were not very cohesive or not very isolated (Figure 4). (In each of the 100 repetitions, the third and fourth columns of Q were removed, columns 101–300 were removed from R, the true classifications for the first 50 columns of R were 1, and the true classifications for the second 50 columns of R were 2.) Therefore, the IGP of a cluster that is not very isolated or cohesive will decrease in the presence of other clusters.


Figure 4
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 The IGP averages for the first and second groups in the absence of the third and fourth groups for Simulation 1 (left) and Simulation 2 (right). The lines trace out the average IGPs for the true classifications; the solid points are the average IGPs for the estimated classifications. (Left) The IGP averages of observations generated by Formula and Formula are circles (solid line) and diamonds (dashed line), respectively. (Right) The IGP averages of observations generated by Formula and Formula are circles (solid line) and diamonds (dashed line), respectively.

 
3.2 Comparison of null distribution generation versions

To apply the null distribution generation method proposed in Section 2.3, two independent datasets are required. The first is the one in which the clusters are initially identified and upon which the centroids are formed, and the second is the one whose columns are classified using these centroids. Hence, pairs of independent R matrices were made repeatedly for Simulations 3–5 which compare the null distribution generations versions. In Simulation 3, both R matrices were made identically to the R matrix in Simulation 1. In Simulation 4, both R matrices were made identically to the R matrix in Simulation 2. In Simulation 5, both R matrices were made like the R matrix in Simulation 1 with one important difference. For Formula, not all Formula and Formula were independent. After the Q matrix was generated and before the R matrix was generated, the following transformation was performed: Formula for Formula and Formula.

In Simulations 3–5, the true classifications of both R matrices were the same. The true classifications were used to make Formula from one R only by averaging over the rows of columns with the same classifications. These centroids were then applied to the other R matrix that was not used to make the centroids.

Unlike Simulations 1 and 2, only 20 datasets were generated for each standard deviation value (either Formula or Formula). Although 20 times is not as many as one would like, it was large enough to see differences between each of the null distribution generation methods and complete the simulations within a realistic time frame.

In all three simulations, all four null distribution methods were applied and p-values were computed as described in Section 2.3. Each null distribution of IGPs was made by generating 500 centroid matrices (Formula). For each standard deviation value (either Formula or Formula) and column of Q, the average and standard error of the p-values of the 20 repetitions were computed. They are presented in Figures 57.


Figure 5
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 The results of Simulation 3 are shown in these graphs. The horizontal axis of every graph is the value of Formula used to generate the elements of Q; the vertical axis of every graph is the p-value. The average p-values are plotted with their corresponding standard error bars. Results from Version 1 are represented by solid lines; results from Version 3, by dashed lines; and results from Version 4, by dotted and dashed lines. Results for data generated from Formula are in the first row; results for data generated from Formula are in the second row; results for data generated from Formula are in the third row; and results for data generated from Formula are in the fourth row (Formula).

 

Figure 7
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7 The results of Simulation 5 are shown in these graphs. The horizontal axis of every graph is the value of Formula used to generate the elements of Q; the vertical axis of every graph is the p-value. The average p-values are plotted with their corresponding standard error bars. Results from Version 1 are represented by solid lines; results from Version 3, by dashed lines; and results from Version 4, by dotted and dashed lines. Results for data generated from Formula are in the first row; results for data generated from Formula are in the second row; results for data generated from Formula are in the third row; and results for data generated from Formula are in the fourth row (Formula).

 
Version 3 and Version 4 consistently produced p-values while Version 2 did not produce any p-values and Version 1 did not always produce p-values (the second and third panels in the first column of Figure 6). Version 2 did not produce any p-values because none of the groups induced by any Formula was the same size as any of the actual groups. This occurs when not all columns of Formula are near the data, e.g. Formula for all j and a single Formula. Therefore, Version 3 and Version 4 (transform versions) were applicable to more situations than Version 1 and Version 2 (no transform versions). Moreover, Version 3 and Version 4 produced very similar p-values. Finally, as expected, in all three null distribution generation versions, more isolated (as Formula increased in Figures 5 and 7) or more cohesive (as Formula decreased in Figure 6) clusters had lower p-values.


Figure 6
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6 The results of Simulation 4 are shown in these graphs. The horizontal axis of every graph is the value of Formula used to generate the elements of R; the vertical axis of every graph is the p-value. The average p-values are plotted with their corresponding standard error bars. Results from Version 1 are represented by solid lines; results from Version 3, by dashed lines; and results from Version 4, by dotted and dashed lines. Results for data generated from Formula are in the first row; results for data generated from Formula are in the second row; results for data generated from Formula are in the third row; and results for data generated from Formula are in the fourth row (Formula).

 
As in Simulation 1 IGP results (top panel of Figure 2), Simulation 3 p-values for data groups of the same size were similar to each other, but different for data groups of a different size. The first and second rows of Figure 5 are similar to each other but different from the third and fourth rows, and the third and fourth rows of Figure 5 are similar to each other but different from the first and second rows. Specifically, the groups of size 50 had higher p-values than the groups of size 100 over Formula. This difference in p-values between groups of different sizes corresponded to a difference between the IGPs of the smaller groups and the larger groups (Figure 2, top graph). Furthermore, the dependence within the elements of Formula and Formula for Formula had the most impact when the centroids were not isolated (Formula). For Formula, the curves in Figure 7 greatly resembled those for Figure 5.

The results of Simulation 4 (Figure 6) were similar to the IGP results for Simulation 2 (Figure 3): the curves for the four groups all differed. Even though the relationship between the groups' IGP curves is not the same as the relationship between the groups' p-value curves, higher IGPs tend to correspond to lower p-values. Therefore, if the third and fourth groups were absent from Simulations 3 and 4, the p-values of the first and second groups would probably be lower over the regions where the first and second groups are not very isolated or cohesive. When a group is not very isolated or cohesive, the presence of additional groups will lower the p-value of the group.

In these simulations, both R matrices were identically generated which means the number of columns of both R were the same (300 columns). In real situations, however, the sample sizes of two independent datasets are not always the same. To determine the effect sample size had upon p-values a final simulation, whose results are not shown, was conducted. Pairs of R matrices were generated where Formula for all i and Formula. While the number of columns of the R matrix used to define the centroids remained constant, the number of columns of R to which the centroids were applied varied. The proportions the classes within the latter R matrix remained constant, however.

The number of columns of the second R ranged from 300 to 30. For all four groups and all three null distribution generation versions, the p-values fluctuated over this range for number of columns of R. Sometimes the p-values were lower for smaller sample sizes; sometimes the p-values were higher for smaller sample sizes. Therefore, the relationship between sample size and p-values is not easily summarized.


    4. APPLICATION TO BREAST CANCER DATA
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATIONS
 4. APPLICATION TO BREAST...
 5. DISCUSSION
 REFERENCES
 
Although breast cancer is the most common form of cancer to affect women, men are also affected, although in smaller numbers. Like other cancers, the seven stages of breast cancer are based upon the development of the disease (National Cancer Institute, 2004Go). Stage 0 is the least severe stage and is characterized by the appearance of non-invasive tumors in the breast. In contrast, Stage IV is the most severe and this advanced stage is characterized by breast tumors which have spread to the underarm, internal lymph nodes and beyond (Breastcancer.org, 2004Go).

In practice, breast cancer stages currently are based entirely upon clinical parameters. The invention of microarrays, however, created the possibility of identifying subtypes using gene expression profiles instead of tumor characteristics. Previous studies (Sørlie and others, 2001Go, Sørlie and others, 2003Go, Perou and others, 2000Go) have done just that. Sørlie and others (2003)Go analyzed 122 microarrays (115 of which were from malignant breast cancer tissues, seven of which were from non-malignant tissues) and identified five subtypes: two luminal-like (luminal A and luminal B), one ERBB2-overexpressing (ERBB2Formula), one basal-like (basal), and one normal breast tissue-like (normal-like). The five subtypes were identified in a semi-supervised way: the samples were hierarchically clustered on 534 "intrinsic" genes but the five groups are not identified by cutting the tree at a certain height or specifying the number of groups. As in Sørlie and others (2003)Go, we will refer to this dataset as the Norway/Stanford dataset. The Norway/Stanford data are available from the Stanford Microarray Database (http://genome-www.stanford.edu/MicroArray/). A link to them is provided on the webpage: http://genome-www.stanford.edu/breast_cancer/.

Each of the five subtypes was characterized by a centroid made by averaging the gene expression values across all the samples belonging to a subtype. These centroids were then used to classify breast cancer microarrays from independent datasets: van't Veer and others (2002) (shared 461 of the 534 intrinsic genes) and West and others (2001) (shared 222 of the 534 intrinsic genes). The Pearson's (centered) correlation coefficient was computed for each sample and each centroid. The sample was classified to the group whose centroid had the maximum correlation with the sample. In addition, a correlation cutoff of 0.1 was used, so samples which did not have a correlation of at least 0.1 with any of the centroids were not classified to any subtype (labeled "below-cutoff") (Table 1).


View this table:
[in this window]
[in a new window]

 
Table 1 These are the summary statistics for the sample groups in the van't Veer and others and West and others datasets formed by classification using the Norway/Stanford centroids. The size of each group, average Pearson's (centered) correlation coefficients between the members of each group, and the standard error of the Pearson's (centered) correlation coefficients between the members of each group both with and without the 0.1 cutoff are shown

 
In Sections 4.1 and 4.2, we go a few steps further with the Norway/Stanford, van't Veer and others and West and others datasets. They are all used to compare the five cluster quality measures and to compare the four cluster validation versions. In addition, these comparisons lead us to validate three of the five subtypes.

4.1 Comparison of cluster quality measures

As in Sørlie and others (2003)Go, the Norway/Stanford centroids were applied to the van't Veer and others and West and others datasets to classify the samples. After classifying all the samples in the van't Veer and others dataset (no cutoff was used), the five cluster quality measures were computed for each subtype which contained at least one sample (Formula and data perturbed 500 times). The procedure was repeated for the West and others dataset. The results for both datasets are presented in Table 2.


View this table:
[in this window]
[in a new window]

 
Table 2 These are the values of the five cluster quality measures of groups defined by application of Norway/Stanford centroids to van't Veer and others and West and others datasets. The WADP score was computed using 500 perturbation matrices whose entries are from a normal distribution (mean was 0 and standard deviation was 1)

 
For the IGP, HS, and SW, positive value is directly related to cluster quality. For the WADP score, clusters whose scores are closer to zero are of higher quality. For the SS, clusters whose scores are closer to Formula are of higher quality. Using this information to rank the subtypes from highest quality to lowest quality, we saw that the SW and WADP score ranked the West and others groups in the same order. In every other case, however, the subtypes were ranked differently by each of the cluster quality measures. Nevertheless, in four of the five cases, the basal subtype had the best score. For the SS, the Luminal A centroid is the most isolated, but the basal-like centroid is the second-most isolated. Therefore, while none of the measures was equivalent, the cohesive and isolated basal-like cluster stood out when using any of the five cluster quality measures, especially when the IGP, HS, SW, or WADP score was used.

4.2 Comparison of null distribution generation versions

All four versions of the null distribution generation procedure were applied to the van't Veer and others and West and others datasets twice, first without a cutoff and then with a cutoff of 0.1 (Tables 7–10GoGoGo). As in Section 2, the group of samples for which Formula was called the below-cutoff group.


View this table:
[in this window]
[in a new window]

 
Table 7 Estimated p-values for van't Veer and others data by null distribution generation version (no cutoff). The size of the null distribution is also given (n)

 

View this table:
[in this window]
[in a new window]

 
Table 8 Estimated p-values for van't Veer and others data by null distribution generation version (0.1 cutoff). The size of the null distribution is also given (n)

 

View this table:
[in this window]
[in a new window]

 
Table 9 Estimated p-values for West and others data by null distribution generation version (no cutoff). The size of the null distribution is also given (n)

 

View this table:
[in this window]
[in a new window]

 
Table 10 Estimated p-values for West and others data by null distribution generation version (0.1 cutoff). The size of the null distribution is also given (n)

 
In other words, the Norway/Stanford microarray data were A and the five centroids made from the dataset comprised C. First, C was used to classify the samples of the van't Veer and others dataset (Formula and Formula), then C was used to classify the West and others dataset (Formula and Formula). In both cases, no cutoff was used (Formula) and a 0.1 cutoff was used (Formula).

The minimum number of permutations used to generate the null distributions was 2500; the maximum was 250 000. The number of permutations was chosen so that at least 100 permutations would be used to compose the null distributions for each group. (NB recall that the null distributions depend upon the size of the group.) In only one case were fewer than 100 permutations used: the basal group when the van't Veer and others raw data were permuted (Version 2) with 0.1 cutoff. Out of 50 500 permutations, a group of size 35 only occurred three times. At this rate, over one million permutations would have been necessary to get 100 IGPs for groups of size 35.

When the Norway/Stanford centroids were applied to the van't Veer and others and West and others datasets both with and without a 0.1 cutoff, the basal-like subtype had the highest IGPs. In the West and others dataset, no samples were classified to the normal breast tissue-like subtype. In the van't Veer and others dataset, this subtype had the lowest IGPs. The IGPs for the ERBB2Formula, luminal A, and luminal B subtypes varied. For these three subtypes, however, at least one IGP was above 0.75.

The IGPs for the samples not classified to any group when a cutoff was used were 0 and 0.84 for the West and others and van't Veer and others datasets, respectively. More than 20 samples were not classified in the van't Veer and others dataset when the cutoff was used.

The groups were much more cohesive in the West and others dataset than in the van't Veer and others dataset. In the West and others dataset, the nearest neighbor of a sample classified to one subtype was classified to only two subtypes if it was classified at all. In contrast, in the van't Veer and others dataset, the nearest neighbor could have been classified to any subtype. The one exception was the normal breast tissue-like subtype. No normal breast tissue-like samples had a nearest neighbor classified to the basal-like subtype. The reverse also held true (Tables 3–6GoGoGo).


View this table:
[in this window]
[in a new window]

 
Table 3 Each entry is the number of samples classified to the row subtype and whose nearest neighbors were classified to the column subtype. The right-hand most column is the proportion of samples in the row subtype whose nearest neighbors were also classified to the same subtype

 

View this table:
[in this window]
[in a new window]

 
Table 4 Each entry is the number of samples assigned to the row group and whose nearest neighbors were assigned to the column group. The right-hand most column is the proportion of samples in the row subtype whose nearest neighbors were also classified to the same subtype

 

View this table:
[in this window]
[in a new window]

 
Table 5 Each entry is the number of samples classified to the row subtype and whose nearest neighbors were classified to the column subtype. The right-hand most column is the proportion of samples in the row subtype whose nearest neighbors were also classified to the same subtype

 

View this table:
[in this window]
[in a new window]

 
Table 6 Each entry is the number of samples assigned to the row group and whose nearest neighbors were assigned to the column group. The right-hand most column is the proportion of samples in the row subtype whose nearest neighbors were also classified to the same subtype

 
In every case except for that in which the null distribution was generated by permuting the centroids with a cutoff of 0.1, the estimated p-value for the basal-like subtype was less than 0.05. The estimated p-values for the luminal B subtype were all below 0.05 in the West and others dataset. In addition, the ERBB2Formula subtype's p-values were below 0.05 in every case using the West and others data and the cutoff. When a cutoff was not used when the Norway/Stanford centroids were applied to the West and others data, the ERBB2Formula p-values were below 0.05 in two cases and below 0.10 in the other two cases. Only the basal-like subtype was validated at the Formula level by any version of the null distribution generation procedure applied to the van't Veer and others data.

Although the estimated p-values varied widely across the datasets and across versions of the null distribution generation procedure, three trends were evident. First, the p-values were higher when the 0.1 cutoff was used because the van't Veer and others and West and others samples were not close to the centroids. When compared to the null distributions made without a cutoff, the null distributions made with the cutoff were skewed toward 1.0 (Figure 8). Second, the versions which applied the transformation before permuting yielded smaller p-values than the versions that did not apply the transformation and just permuted. Third, the p-values obtained from a centroid version were very similar to the p-values obtained from the equivalent version using the raw data.


Figure 8
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8 These are boxplots of the IGPs for eight different null distribution generation methods. Each version of the null distribution generation procedure was twice applied (with and without cutoff) to both datasets for three different group sizes (5, 10, and 20). Each even-numbered method and its odd-numbered neighbor to the left are identical except that the even-numbered method uses the 0.1 cutoff and the odd-numbered method does not use a cutoff. All the even-numbered plots are skewed toward 1.0 when compared to its left-hand neighbor. Null distribution generation method labels: (1) permute centroids without cutoff, (2) permute centroids with 0.1 cutoff, (3) permute raw data without cutoff, (4) permute raw data with 0.1 cutoff, (5) transform centroids without cutoff, (6) transform centroids with 0.1 cutoff, (7) transform raw data without cutoff, and (8) transform raw data with 0.1 cutoff.

 

    5. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATIONS
 4. APPLICATION TO BREAST...
 5. DISCUSSION
 REFERENCES
 
Although the IGP, HS, SS, SW, and WADP score all measure cluster quality, they are not equivalent. In Simulation 1 (Figure 2), the average HS for the true classifications was unaffected by the change in correlation between the columns of Q. Therefore, the HS does not capture any information about the isolation of the centroids. On the other hand, the SS only captured information about the correlation between centroids. In Simulation 2 (Figure 3), the average SS for the true classifications was unaffected by the change in correlation between members of a groups and their centroids. In addition, in Simulation 2, two of the SW true classification curves increased while two of the SW true classification curves decreased between Formula. Therefore, the HS, SS, and SW are poor choices for a cluster quality measure.

In both Simulation 1 and Simulation 2, the WADP score true classification curves differed greatly from the estimated classification curves for some values of Formula or Formula. This in addition to the requirement that we choose a value for Formula prevents the WADP score from being a good cluster quality measure.

Consequently, the IGP is the best choice for a cluster quality measure. First, the true classification curves for all groups increased as Formula increased and Formula decreased. Second, the true classification curves are close to the estimated classification curves. Third, the IGP did not require us to choose a parameter value.

Nevertheless, when all five cluster quality measures were applied to the breast cancer datasets, the basal subtype was given the best score by four of the five measures (Table 2). In both datasets, the basal subtype had the highest IGP, HS, and SW and had the lowest WADP score. The differences between these four scores appeared when a cluster was not very cohesive or isolated. Thus, if all the clusters in the datasets are of high quality, which of the four is used in the validation procedure may not matter.

Since the quality of the clusters may not be known beforehand, the IGP was used to compare the four versions of the null distribution generation procedure. Simulations 3–5 (Figures 57) showed when a null distribution generation version produced p-values, they were lower for more isolated or more cohesive clusters. In all three simulations, Version 2 did not yield any p-values and Version 1 tended to produce more conservative p-values than Versions 3 and 4. Also, the p-values from Version 3 and Version 4 were very similar. In addition, dependence between the rows of two of the columns of the centroid matrix (C) had the most impact when the clusters were not isolated (Formula). When the rows of C were not completely independent and the clusters were not isolated, the average p-value for a cluster was lower than the average p-value for the same cluster when all the rows of C were independent, even for the columns of C that were generated identically in Simulation 3 and Simulation 5. The difference in the p-values between Simulation 3 and Simulation 5 was most dramatic for the clusters of larger size.

Similar conclusions were seen when the null distribution generation versions were compared using real data (Tables 710). Not only were the p-values produced by Version 3 and Version 4 very similar for the breast cancer data but also were the p-values produced by Version 1 and Version 2. In other words, a p-value from a raw data version was close to the p-value from the corresponding centroid version. Furthermore, as was seen when Version 1 and Version 3 p-values were compared in the simulations, a p-value from a version that did not use a transformation was more conservative than the p-value from the corresponding transformation version. Moreover, the p-values from the versions using a 0.1 cutoff were more conservative than the versions not using a cutoff (Figure 8). Most likely, this is due to the IGP of each group without a cutoff being as high or higher than the IGP of the group with the 0.1 cutoff in all but one case (West and others ERBB2Formula). Based upon these results, a cutoff should not be used because it may reduce the quality of a cluster when quality is measured by the IGP.

Of the cluster quality measures considered, the IGP was the best at quantifying how likely a point was to be assigned to a different cluster. Of the null distribution generation versions, Versions 3 and 4 most reliably generated p-values. Although the two versions generated very similar p-values, Version 3 is superior to Version 4. Version 3 takes less time to implement, does not use raw data which may be unavailable, and does not require the user to make a choice about which hierarchical clustering method to use.

Therefore, using the IGP with the transform centroids version of the null distribution generation procedure without a cutoff (Version 3, Formula, and Formula), Section 4 shows that only the ERBB2+1, luminal B, and basal-like groups are reproducible and potentially biologically significant.

An implementation of null distribution generation Version 3 without a cutoff is available on-line through CRAN (http://cran.r-project.org) in the clusterRepro package.

As this breast cancer application demonstrates, the cluster validation method proposed here has the potential to be very useful. For a cluster found in datasets independent of the one in which it was defined, we believe this method (using the IGP and null distribution generation Version 3 without a cutoff) reliabily and efficiently gauges the significance of the cluster's reproducibility.


    ACKNOWLEDGMENTS
 
Robert Tibshirani was partially supported by National Science Foundation Grant DMS-9971405 and National Institutes of Health Contract N01-HV-28183.


    FOOTNOTES
 
1 ERBB2+ was validated on the West and others dataset and only two or three samples were classified to this subtype so this result is somewhat suspect. Back


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATIONS
 4. APPLICATION TO BREAST...
 5. DISCUSSION
 REFERENCES
 

    Bailey TA and Dubes R. (1982) Cluster validity profiles. Pattern Recognition 15:61–83.

    Breastcancer ORG. (2004) Stages of Breast Cancer, http://www.breastcancer.org/cmn_sta_idx.html.

    Chen G, Jaradat SA, Banerjee N, Tanaka TS, Ko MSH, Zhang MQ. (2002) Evaluation and comparison of clustering algorithms in analyzing es cell gene expression data. Statistica Sinica 12:241–62.

    Datta S and Datta S. (2003) Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19:459–66.[Abstract/Free Full Text]

    Dudoit S and Fridlyand J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3:research0036.1–21.

    Dudoit S, Fridlyand J, Speed TP. (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97:77–87.[CrossRef]

    Gordon AD. (1999) Classification(Chapman & Hall, Boca Raton, FL).

    Kerr MK and Churchill GA. (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proceedings of the National Academy of Sciences of the Unites States of America 98:8961–5.

    Levine E and Domany E. (2001) Resampling method for unsupervised estimation of cluster validity. Neural Computation 13:2573–93.[CrossRef][Web of Science][Medline]

    National Cancer Institute. (2004) Staging: Questions and Answers, http://www.cancer.gov/cancertopics/factsheet/Detection/staging.

    Perou CM, Sørlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA. and others. (2000) Molecular portraits of human breast tumours. Nature 406:747–52.[CrossRef][Medline]

    Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, van de Rijn M, Jeffrey SS. and others. (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences of the United States of America 98:10869–74.[Abstract/Free Full Text]

    Sørlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S. and others. (2003) Repeated observation of breast tumor subtypes in independent gene expression data sets. Proceedings of the National Academy of Sciences of the United States of America 100:8418–23.[Abstract/Free Full Text]

    Tibshirani R and Walther G. (2005) Cluster validation by prediction strength. Journal of Computational and Graphical Statistics 14:511–28.[CrossRef]

    van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT. and others. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530–6.[CrossRef][Medline]

    West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olsen JA Jr, Marks JR, Nevins JR. (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences of the United States of America 98:11462–7.[Abstract/Free Full Text]

    Yeung KY, Haynor DR, Ruzzo WL. (2001) Validating clustering for gene expression data. Bioinformatics 17:309–18.[Abstract/Free Full Text]

    Received July 28, 2005; revised February 28, 2006; accepted for publication March 8, 2006.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


    This article has been cited by other articles:


    Home page
    The OncologistHome page
    J. S. Ross, C. Hatzis, W. F. Symmans, L. Pusztai, and G. N. Hortobagyi
    Commercialized Multigene Predictors of Clinical Outcome for Breast Cancer
    Oncologist, May 1, 2008; 13(5): 477 - 493.
    [Abstract] [Full Text] [PDF]


    Home page
    JNCI J Natl Cancer InstHome page
    L. Lusa, L. M. McShane, J. F. Reid, L. De Cecco, F. Ambrogi, E. Biganzoli, M. Gariboldi, and M. A. Pierotti
    Challenges in Projecting Clustering Results Across Gene Expression Profiling Datasets
    J Natl Cancer Inst, November 21, 2007; 99(22): 1715 - 1723.
    [Abstract] [Full Text] [PDF]


    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    8/1/9    most recent
    kxj029v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Kapp, A. V.
    Right arrow Articles by Tibshirani, R.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Kapp, A. V.
    Right arrow Articles by Tibshirani, R.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?