Skip Navigation


Biostatistics Advance Access originally published online on April 13, 2006
Biostatistics 2007 8(1):101-117; doi:10.1093/biostatistics/kxj036
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
8/1/101    most recent
kxj036v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Dobbin, K. K.
Right arrow Articles by Simon, R. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dobbin, K. K.
Right arrow Articles by Simon, R. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press 2006.

Sample size planning for developing classifiers using high-dimensional DNA microarray data

Kevin K. Dobbin* and Richard M. Simon

Biometric Research Branch, National Cancer Institute, 6130 Executive, Boulevard, Rockville, MD 20852, USA dobbinke{at}mail.nih.gov

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION/MOTIVATION
 2. THE SAMPLE SIZE...
 3. THE PROBABILITY MODEL
 4. METHODS
 5. RESULTS
 6. EXAMPLES USING SYNTHETIC...
 7. CONCLUSION
 APPENDICES
 REFERENCES
 
Many gene expression studies attempt to develop a predictor of pre-defined diagnostic or prognostic classes. If the classes are similar biologically, then the number of genes that are differentially expressed between the classes is likely to be small compared to the total number of genes measured. This motivates a two-step process for predictor development, a subset of differentially expressed genes is selected for use in the predictor and then the predictor constructed from these. Both these steps will introduce variability into the resulting classifier, so both must be incorporated in sample size estimation. We introduce a methodology for sample size determination for prediction in the context of high-dimensional data that captures variability in both steps of predictor development. The methodology is based on a parametric probability model, but permits sample size computations to be carried out in a practical manner without extensive requirements for preliminary data. We find that many prediction problems do not require a large training set of arrays for classifier development.

Keywords: Gene expression; Microarrays; Prediction; Predictive inference; Sample size


    1. INTRODUCTION/MOTIVATION
 TOP
 SUMMARY
 1. INTRODUCTION/MOTIVATION
 2. THE SAMPLE SIZE...
 3. THE PROBABILITY MODEL
 4. METHODS
 5. RESULTS
 6. EXAMPLES USING SYNTHETIC...
 7. CONCLUSION
 APPENDICES
 REFERENCES
 
The goal of many gene expression studies is the development of a predictor that can be applied to future biological samples to predict phenotype or prognosis from expression levels (Golub and others, 1999). This paper addresses the question of how many samples are required to build a good predictor of class membership based on expression profiles. Determining appropriate sample size is important as available clinical samples are often either very limited or costly to acquire and assay. As microarray studies move from the laboratory toward the clinic, the reason for developing the predictor is increasingly to assist with medical decisions (Paik and others, 2004), and the consequences of having a predictor that performs poorly because the sample size was too small can be serious.

Few methods have been published for developing genomic classifiers. Most publications on sample size determination for microarray studies are limited to the objective of identifying genes which are differentially expressed among the pre-defined classes. These have been reviewed by Dobbin and Simon (2005)Go. Hwang and others (2002) addressed the objective of sample size planning for testing the global null hypothesis that no genes are differentially expressed, which is equivalent to testing the null hypothesis that no classifier performs better than chance. However, a sample size sufficient for rejecting the global null hypothesis may not be sufficient for identifying a good classifier. Mukherjee and others (2003) developed a learning curve estimation method that is applicable to the development of predictors but requires that extensive data already be available so that the learning curve parameters can be estimated. Fu and others (2005) developed a martingale stopping rule for determining when to stop adding cases, but it assumes that the predictor is developed sequentially one case at a time and does not provide an estimate of the number of cases needed.

The high dimensionality of microarray data, combined with the complexity of gene regulation, make any statistical model for the data potentially controversial. This has led some authors to avoid modeling the expression data directly, and instead model the general abstract learning process (Mukherjee and others, 2003; Fu and others, 2005). But a model for gene expression does not need to be exactly correct to be useful, and to provide insights into the classification problem that other more abstract approaches do not. We do not assert that the model presented here is exactly correct, but a useful oversimplification. Such oversimplifications are not uncommon in sample size determination methodologies: for example, models used to estimate sample sizes in clinical trials are often simpler than the planned data analysis. The simpler model is likely to have lower sensitivity and specificity, resulting in conservative sample size estimates. The simpler model also has the advantage that the resulting calculations will be more transparent, whereas sample size calculations based on the more complex model may be opaque and unconvincing.

A novel contribution of this paper is the integration of dimension-reduction into the framework of the normal model to calculate sample size for high-dimensional genomic data. We develop a novel methodology for calculating the significance level Formula to be used in gene selection that will produce a predictor with the best resulting expected correct classification rate. We present methods for sample size calculation when one class is under-represented in the population. We also present novel results on how the size of the fold-change for differentially expressed genes, the noise level, and the number of differentially expressed genes, affect predictor development and performance.

Section 2 presents the predictive objective that will be used to drive sample size calculation. Section 3 presents the probability model for microarray data and the optimal classification rates. In Section 4, sample size algorithms are developed. In Section 5, the accuracy of the approximation formulas used in Section 4 are assessed, as well as their robustness to violations of model assumptions; also, the effect of model parameter combinations on sample size requirements and correct classification rates are examined. In Section 6, the robustness of the methodologies are assessed by application to a number of synthetic and real-world datasets that violate the model assumptions. In Section 7, results and recommendations are summarized.


    2. THE SAMPLE SIZE OBJECTIVE
 TOP
 SUMMARY
 1. INTRODUCTION/MOTIVATION
 2. THE SAMPLE SIZE...
 3. THE PROBABILITY MODEL
 4. METHODS
 5. RESULTS
 6. EXAMPLES USING SYNTHETIC...
 7. CONCLUSION
 APPENDICES
 REFERENCES
 
In a traditional class comparison study, the sample size objective is to achieve a certain power, say 95%, to detect a specified difference between the class means when testing at an Formula significance level. Under the usual class comparison model assumptions, an adequate sample size will exist because the power goes to 1 as the sample size goes to infinity.

By analogy to the class comparison case, one might wish in the class prediction setting to establish a sample size large enough to ensure that the probability of correct classification (PCC) will be above, say, 95%. There are at least two problems with this objective. The first problem is that the probability of correct classification will depend on what samples are chosen for the training set; in other words, the probability of correct classification will not be a fixed quantity, but will be a random variable with some variance. Hence, for any sample size there will likely be some positive probability that the correct classification rate is below 95%. So we will instead consider the "expected value" of this random variable, where the expectation is taken over all possible sample sets of the same size in the population.

The second problem is that, unlike the power in a class comparison study (which always goes to 1 as the sample size goes to infinity), the PCC in a class prediction study will not necessarily go to 1 as the sample size goes to infinity. This is because for any two populations there may be areas in the predictor space where samples from each class overlap, so that class membership cannot be determined with confidence in these areas. An extreme example would be two identically distributed populations, where no predictor can be expected to do better than a coin toss (50%). Lachenbruch (1968)Go solved this problem by framing the question as: how large a sample size is required for the resulting predictive function "to have an error rate within Formula of the optimum value?" For example, a Formula of Formula ensures that the expected probability of correct classification for the predictor will be within 10% of the best possible predictor. We will use an objective equivalent to Lachenbruch's, namely: determine the sample size required to ensure that the expected1 correct classification probability for a predictor developed from training data is within Formula of the optimal expected correct classification probability for a linear classification problem2.

We would also note that although we will focus attention on this objective, the formulas developed here could also be used to ensure sensitivity and/or specificity above a specified target.


    3. THE PROBABILITY MODEL
 TOP
 SUMMARY
 1. INTRODUCTION/MOTIVATION
 2. THE SAMPLE SIZE...
 3. THE PROBABILITY MODEL
 4. METHODS
 5. RESULTS
 6. EXAMPLES USING SYNTHETIC...
 7. CONCLUSION
 APPENDICES
 REFERENCES
 
The general framework for the class prediction problem is that in some population of interest, individuals can be split into k disjoint classes, Formula. The classes may correspond to different outcome groups (e.g. relapse in 5 years versus no relapse in 5 years), different phenotypes (e.g. adenocarcinoma versus squamous cell carcinoma), etc. A predictive model will be developed based on a training set T. For each individual in the training set, one observes that individual's class membership, and a data vector Formula, the gene expression vector; the goal of the training set experiment is to develop a predictor of class membership based on gene expression, and possibly an estimate of the predictor's performance. Our goal is to determine a sample size for the training set.

Consider a two-class problem, with the gene expression data vector denoted by Formula, which consists of normalized, background-corrected, log-transformed gene expression measurements. To simplify notation and presentation, let the first m elements of the data vectors represent the differentially expressed genes3, the remaining Formula elements of the data vectors represent undifferentially expressed genes, and each differentially expressed gene be centered around zero. The probability model for this two-class prediction problem is

Formula (3.1)

The mean vector has the form Formula, where Formula represent the differentially expressed gene means.

The model stipulates that within each class, expression vectors follow a multivariate normal distribution with the same covariance matrix. It will be also assumed that differentially expressed genes are independent of undifferentially expressed genes. If Formula is singular, then some genes are linear combinations of other genes (see, e.g. Rao, 1973Go, pp 527–8). Put another way, there are "redundant" genes with expression that is completely determined by the expression of other genes. Having these type of redundant genes is analogous to having an overparameterized linear model, and a model reduction transformation (Hocking, 1996Go, p 81) can eliminate the redundant genes, resulting in a non-singular covariance matrix. We will imagine these redundant genes have been eliminated, so that the covariance matrix Formula is non-singular4. Marginal normality of the gene expression vectors may be considered reasonable for properly transformed and normalized data, although multivariate normality may be more questionable. However, in taking a model-based approach one must make some assumption and one weaker than multivariate normality is unlikely to lead to a tractable solution. The assumption of independence between differentially expressed and non-differentially expressed genes is not critical and is mainly made for mathematical convenience. Violations of this assumption will be evaluated.

A key issue is the size of the Formula relative to the biological and technical variation present. A relatively small Formula would correspond to a differentially expressed gene that is nearly uninformative—i.e. that will be of little practical use for prediction. We discuss these types of genes with numerous examples below and show that nearly uninformative genes, even if there are many of them (50–200 among thousands of genes), are generally of little use for predictor construction and sample size estimates should be based on more informative genes. If the biological reality is that all the differentially expressed genes are nearly uninformative, then we will see that no good predictor will result from the microarray analysis.

It will simplify presentation to assume that each differentially expressed gene has a common variance, and each undifferentially expressed gene has a common variance. In practice, genes are likely to have different variances. But the relationship between fold-change and gene variance determines the statistical power in the gene selection phase of predictor development. In order to keep this relationship intuitive, it is important to have a single variance estimate rather than a range of variance estimates. The single variance parameter can be considered a mean or median variance, in which case the targeted power will be achieved on average, with higher power for some genes and lower for others. More conservatively, the 90th percentile of the gene variances can be used, in which case the targeted power will be achieved even for genes exhibiting the highest variation (which may be the ones of most potential interest) across the population.

3.1 Notation

Throughout, Formula will denote the marginal expected probability of correct classification taken over all samples of size n in the population. Formula will denote the expected PCC for an optimal linear classifier.

In the population of interest, the proportion in class Formula is Formula, and the proportion in class Formula is FormulaFormula. The covariance matrix can be written in a partitioned form, with notation Formula where Formula is an Formula correlation matrix for the differentially expressed genes and Formula is the Formula correlation matrix for the undifferentially expressed genes.

3.2 Optimal classification rates for the model

The optimal classification rule and rate will depend on the proportion in class Formula in the population. For a two-class problem with equal covariance matrices, if the model parameters are known, then the optimal normal-based linear discriminant rule, that is, the Bayes rule, is known and the classification rate of this classifier can be determined. In Appendix A, it is shown that

Formula

where Formula is the natural logarithm, and Formula is the cumulative distribution function for a standard normal random variable. When Formula, so that each class is equally represented in the population, this simplifies to

Formula (3.2)

Note that Formula is the Mahalanobis' distance between the class means, making this result closely related to that of Lachenbruch (1968)Go.

In the special case when Formula, and Formula, it is shown in Appendix A that an upper bound on the the best probability of correct classification is

Formula (3.3)

where Formula is the smallest eigenvalue of Formula, which is 1 if genes are independent.


    4. METHODS
 TOP
 SUMMARY
 1. INTRODUCTION/MOTIVATION
 2. THE SAMPLE SIZE...
 3. THE PROBABILITY MODEL
 4. METHODS
 5. RESULTS
 6. EXAMPLES USING SYNTHETIC...
 7. CONCLUSION
 APPENDICES
 REFERENCES
 
Consider linear classifiers of the form: classify in class Formula if Formula and in class Formula otherwise. The vector Formula is estimated from the training set, and will depend on how genes are selected for inclusion in the predictor, and what prediction method is used. We will take a simple approach to predictor construction which does not weight the importance of the individual genes in the predictor. Each element of Formula is 0 or 1; 1 indicates a gene determined to be differentially expressed by the hypothesis test of Formula versus Formula; 0 indicates a gene determined not to be differentially expressed. This simple predictor is likely to have lower sensitivity and specificity than more sophisticated ones that assign weights to individual genes and we would by no means recommend people to use it. But, the sample sizes that we calculate this way should tend to be conservative (large).

Consider the hypothesis tests for gene selection described in the previous paragraph. These are tests of differential expression for each of the p genes. With each hypothesis test is associated a specificity, which will be denoted Formula, and is the probability of correctly identifying a gene that is not differentially expressed; and also a sensitivity or power, which will be denoted Formula, and is the probability of correctly identifying a gene as differentially expressed when in fact it is differentially expressed by a specified amount (Formula). These hypothesis tests could be based on many different statistics. The calculations here will use two-sample t-tests.

4.1 Formulas for PCC(n)

Each differentially expressed gene will be assumed to be differentially expressed by Formula, and a common variance Formula for genes will be assumed. An approximate lower bound for the expected probability of correct classification is derived in Appendix B, and is

Formula

where Formula is the largest eigenvalue of the population correlation matrix. When the other parameters are fixed, Formula reaches a minimum at Formula. When Formula, so that the two classes are equally represented, this simplifies to (Appendix B)

Formula

In the special case when Formula, Formula.

Note that Formula is the power associated with the gene-specific hypothesis tests that each gene is not differentially expressed among the classes, and the term under the final root sign is the true discovery rate (TDR)5.

In fact, power calculations can be used to eliminate Formula from the equation (see Appendix C).

4.2 Sample size determination

Recall that the objective is to find a sample size that will ensure that Formula, where Formula is a pre-specified constant. The calculation can be based on the general formula

Formula (4.1)

Note that this formula will only guarantee that the overall PCC is within the specified bound, but the PCC for the individual classes may differ. In particular, the rarer subgroup may have much poorer PCC. A more stringent approach is discussed below in Section 4.4. If we assume that Formula and that genes are independent, then the simpler formula

Formula (4.2)

can be used to determine the sample size. Ideally, one would want to eliminate m from the equation, so that the number of differentially expressed genes need not be stipulated. One can do this by maximizing the difference over m. As m gets large, the distance between the class means increases, and Formula goes to zero (but the difference is not always strictly decreasing). So the maximum m value should be a low integer, and most likely Formula. To ensure that the difference is less than Formula, one can use

Formula (4.3)

This gives rise to the following algorithm for sample size determination:

  1. Initialize Formula.
  2. Formula.
  3. For Formula, use (C.1) to find the optimal Formula for each m, by a linear search over Formula. Then plug these Formulas into (4.3) to get an upper bound on Formula, call this Formula.
  4. Is Formula? If no, return to step 2. If yes, continue to step 5.
  5. Use a sample of size n, with Formula from each class.

If we do not assume gene independence, then it will be necessary to estimate extreme eigenvalues of the correlation matrix. Based on simulation with block diagonal compound symmetric covariance matrices (not shown), we suggest the method of Ledoit and Wolf (2004)Go, which seems to perform reasonably well compared to others we examined. One could then plug these estimates into the equation

Formula (4.4)

4.3 Sample size for a test set

Once a predictor has been developed, an estimate of its performance is required. Such estimates are calculated either using a separate testing set of samples that were not used at all in the predictor development or by cross-validation applied to the training set. Advantages to having an independent test set are discussed in Simon and others (2003).

How best to split a collection of samples into a test set and training set is addressed in Molinaro and others (2005). Alternatively, one can use the methods developed here to determine the sample size required for the training set, then base the sample size for the test set on the width of the resulting confidence interval for the PCC. This is a valid approach because the independence of the test set implies that a binomial random variable correctly models the number of correct classifications. If L is the interval length, then the sample size formula6 for the test set is just Formula, where Formula is the estimated correct classification rate. For example, if Formula, then Formula results in sample sizes of Formula, respectively.

4.4 Controlling the PCC in each class

We have presented methods for controlling the overall probability of correct classification. In some cases, one may want to control the PCC for each class individually. If one class is under-represented, then the PCC using the optimal cut-point will be lower in the under-represented class.

Formula

where Formula is the proportion in the under-represented class and Formula is the power to detect a differentially expressed gene given Formula, n, and Formula. This formula can be used to develop methods similar to those we have presented to determine sample size.

A simpler rough approach is to let Formula be the sample size calculated by the methods we have presented for the case when half the population is from each class, for example, based on (4.3). Then, use a sample size of

Formula

to ensure that the expected number from the under-represented class is the same as it would be if both classes were equally represented.

If Formula is close to 0 or 1, then this approach may lead to very large sample sizes. One option in this case is to oversample from the rarer class, so that for example half the samples are from each class. But resulting estimates of classification error will depend on external estimates of the proportion in each class, so oversampling is most appropriate for preliminary studies.


    5. RESULTS
 TOP
 SUMMARY
 1. INTRODUCTION/MOTIVATION
 2. THE SAMPLE SIZE...
 3. THE PROBABILITY MODEL
 4. METHODS
 5. RESULTS
 6. EXAMPLES USING SYNTHETIC...
 7. CONCLUSION
 APPENDICES
 REFERENCES
 
We first ran an extensive simulation to verify that the approximations used in the course of the derivation of the equation for the PCC produced good estimates. Table 1 shows the estimated probabilities of correct classification based on (4.4), and compares these to estimates of the population values based on Monte Carlo for a variety of combinations of values of effect size, Formula, number of genes affected, m, and sample size, n. For the Monte Carlo-based estimates, data were generated according to the model specifications, then predictors developed as outlined in Section 4, and finally prediction accuracy assessed on an independent test set. As can be seen, the equation-based estimates are close to the Monte Carlo-based population estimates and, when different, tend to be conservative.


View this table:
[in this window]
[in a new window]

 
Table 1 The expected PCC estimate is generally either accurate or conservative (underestimates the true PCC). Here, the number of genes is p = 10 000, 2{delta}/{sigma} is the effect size for differentially expressed genes, n is the number of samples in the training set, m is the number of differentially expressed genes, and {alpha} is the significance level used for gene selection. Monte Carlo PCC estimates were calculated by generating a predictor based on a simulated sample, then generating 100 datasets from the same populations and calculating the prediction error rate of the predictor; this entire process was repeated 100 times and the average correct classification proportions appear in the table

 
We next consider choice of Formula, the significance level for gene selection. Formula can be chosen to maximize the resulting PCC. Plots of the Formula as a function of Formula for varying values of n are shown in Figure 1. The plots show that there is an optimal Formula value on the interval. As the sample size n increases, the Formula that maximizes the probability of correct classification decreases. This trend is intuitively reasonable because Formula determines the cutoff value used in the gene selection step, and as the sample size gets large the power associated with small Formula's will increase. Also note from the plots that the larger the effect size (Formula), the smaller the Formula value that will maximize Formula for fixed n. This makes sense since larger effect sizes will be easier to detect, so that one can afford to use more stringent Formula's to reduce the false-positive rate.


Figure 1
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Plots of the estimated PCC as a function of Formula, plotted for various values of n, based on (C.1). In each plot, "Effect" is defined as Formula, Formula is the number of differentially expressed genes, and p = 10 000 is the number of genes in vector Formula. The sample sizes are, from left to right, Formula, Formula, and Formula. The maximum points on the plots (indicated by vertical lines) are, for the top row, Formula, Formula, and Formula and for the bottom row, Formula, Formula, and Formula.

 
Figure 2 shows the sample sizes required as a function of Formula, for Formula and Formula differentially expressed genes. Here, Formula. The sample sizes were optimized7 over Formula, and will ensure the expected Formula is within Formula of the best possible prediction rule. For a fixed effect size Formula per gene, Formula is an easier classification problem than Formula, because the distance between the class means is Formula. So the sample size requirements are smaller for Formula. For an effect size of 1.5 (e.g. Formula for a two-fold change in expression, and Formula), a sample size around 60 appears adequate.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Plot of effect size (Formula) versus sample size. Optimal Formula used for determination of sample size. The sample sizes ensure that the average PCC is within Formula of the best possible correct classification probability. Gene independence is assumed. Half the population is from each group, so that Formula.

 
We next turned to the question of the relationship between effect size, Formula, the number of differentially expressed genes m, and the Mahalonobis distance between the class means (which determines the best possible classification rate). Table 2 shows examples where the effect size is small, so that each gene is nearly uninformative. For example, when the effect size is Formula, and the number of informative genes is 200 or less, then one of two bad things will happen: either even an optimal predictor will have a poor Formula, so that predictions will be unreliable or development of a predictor with Formula within some reasonable bound will require prohibitively large sample sizes (500 or more). In the latter case, while a good predictor exists, the problem is too hard practically to work out with this technology. The table also shows that as the effect size gets larger, these issues go away. Hence, it is critical that at least some genes have a reasonably large effect size in order to develop a predictor. Therefore, for sample size calculations, one should assume that some genes do have reasonable effect sizes.


View this table:
[in this window]
[in a new window]

 
Table 2 Impact of small effect sizes on the correct classification rate. PCC({infty}) is the best possible correct classification rate, PCC(500) and PCC(200) are the correct classification rates with samples of sizes 500 and 200, respectively. For small effect size 0.2, a strong classifier exists only if many genes are differentially expressed; but even with many differentially expressed genes and a sample size of 500, the estimator performs poorly. For somewhat larger effect size 0.4, fewer differentially expressed genes are required for a good classifier to exist, but even n = 200 samples do not result in estimates within 10% of the optimal classification rate. For effect size 0.6, the situation is more amenable if running a large sample is feasible. For reference, if {sigma} = 0.5, then an effect size of 0.6 corresponds to a 22{delta} = 20.3 = 1.23 fold-change. Gene independence assumed here

 
These considerations also lead to the following recommendation: when using conservative sample size approaches with Formula, the estimated effect size Formula should correspond to the estimated largest effect size for informative genes. This is the approach we used in Section 6 and seemed to perform adequately. Note also that we assume that p is in the 1000 to 22 000 range, that there are Formula–200 or so genes differentially expressed. When either p or m values fall outside these assumptions, then our conclusions may not be valid. For example, microarrays with 10's or 100's of genes represented would require amendment of this approach.

Figure 3 shows plots of the probability of correct classification as a function of n. The lines represent different values of Formula. One can see the Formula values approaching their asymptotic Formula values as n approaches 100.


Figure 3
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Left plot is Formula and right plot is m = 5. p = 10 000. Plot of sample size versus PCC for various values of the effect size Formula. Gene independence is assumed. Formula uses optimal Formula. Population assumed evenly split between the classes, so Formula.

 

    6. EXAMPLES USING SYNTHETIC AND REAL MICROARRAY DATA
 TOP
 SUMMARY
 1. INTRODUCTION/MOTIVATION
 2. THE SAMPLE SIZE...
 3. THE PROBABILITY MODEL
 4. METHODS
 5. RESULTS
 6. EXAMPLES USING SYNTHETIC...
 7. CONCLUSION
 APPENDICES
 REFERENCES
 
We tested the robustness of these methods using synthetic microarray data that violates the model assumptions, and real microarray data that likely do as well. Results are presented in Table 3. See table caption for detailed description of analysis.


View this table:
[in this window]
[in a new window]

 
Table 3 Robustness evaluation of PCC estimates. Gene selection based on {alpha} = 0.001, with support vector machine (SVM) and nearest neighbor (1-NN). The sample size is n, with n/2 taken from each class. All equation-based Formula estimates use m = 1. Synthetic datasets: generated from multivariate normal distribution with three differentially expressed genes with effect sizes given. p = 1000 genes with block diagonal correlation structure compound symmetric with 20 blocks of 50 genes each and within-block correlation {rho}, and each gene with variance 1. True PCC(n) for each classification algorithm estimated using n for classifier development and remainder for PCC(n) estimation, with 10 replicates of a sample size of 200, then averaging correct classifications. PCC({infty}) estimated using cross-validation on a sample size of 200. Real datasets: taken from Golub and others (1999) and Rosenwald and others (2002). Predictor developed using n for classifier development and remainder for PCC(n) estimation. Process repeated 30(20) times for Golub (Rosenwald) datasets, and average probabilities of correct classification presented. Effect size 2{delta}/{sigma} was estimated from the complete data using empirical Bayes methods. See Appendix E for details. PCC({infty}) estimated based on cross-validation on complete datasets. Analyses were performed using BRB ArrayTools developed by Dr. Richard Simon and Amy Peng Lam

 
The estimates of Formula and Formula based on our method are uniformly conservative across all datasets. The Formula estimates are very conservative when applied to the Golub dataset. This might be expected since there are likely many genes with large effect sizes in these different tumor types, so that our assumption that there is only Formula informative gene makes the problem much harder than it really is. The Formula estimates are extremely conservative not only for the Golub dataset but also for several others as well. This is offset somewhat in the Formula estimates since both quantities are biased in the same direction, causing some of the bias to cancel out of the difference.

One must be somewhat careful in the interpretation of Table 3, because although for smaller sample sizes like Formula the Formula and Formula estimates, which are based on means over multiple simulations, appear uniformly conservative, there was also significant variation observed in the performance of classifiers in different simulations. For example, while the mean for the synthetic dataset on the first row of the table was 0.92 using SVMs, the worst classifier had an estimated correct classification rate of 0.79.

We applied both the gene independence-based method and the Ledoit and Wolfe eigenvalue estimation method to two other real microarray datasets (not shown) and found that resulting sample size estimates were similar in the two approaches, and in the 40–60 samples range.

In conclusion, our method tends to be conservative in estimating Formula, Formula, and Formula for the datasets we examined, sometimes very conservative. So the method should be lead to adequate sample sizes while sometimes producing larger sample size estimates than are truly required. For example, our formulas are likely to significantly overestimate the required sample size for classification problems involving morphologically diverse tissues that are expected to have many differentially expressed genes with large effect sizes. In these cases, it may be advisable to follow the guidelines in Mukherjee and others (2003) instead.


    7. CONCLUSION
 TOP
 SUMMARY
 1. INTRODUCTION/MOTIVATION
 2. THE SAMPLE SIZE...
 3. THE PROBABILITY MODEL
 4. METHODS
 5. RESULTS
 6. EXAMPLES USING SYNTHETIC...
 7. CONCLUSION
 APPENDICES
 REFERENCES
 
We have presented novel methods for determining sample size for building predictors of class membership from high-dimensional microarray data. These methods take into account variability both in predictor construction and gene selection. These methods require only that two quantities be specified: the size of the fold-change relative to the variation in gene expression and the number of genes on the microarrays. We presented an alternative approach based on eigenvalue estimation. We investigated the robustness of our method on synthetic datasets that violated our model assumptions and on publicly available microarray datasets.

We found that sample sizes in the range of 20–30 per class may be adequate for building a good predictor in many cases. These results are similar to Mukherjee and others (2003). In general, we found that the sample size requirements for prediction are relatively small. We showed that the reason for this is that if a good gene-expression-based predictor exists in the population, then it is likely that some genes exhibit significant differential expression between the classes relative to the biological noise present. Hence, adequate power to identify these genes can be achieved without a large number of microarrays. One drawback of our approach is that it controls the expected probability of correct classification to be within some tolerance of the optimal, but does not control the actual probability of correct classification; for small sample sizes, the probability of correct classification may be highly variable depending on what samples are chosen for the training set, so that our method may not give adequate control and should be used with caution in these situations.

We identified scenarios in which either no good predictor exists or it is practically not feasible to construct one. No good predictor may exist when differential expression is small relative to biological variation—and this may be the case even when as many as 100 genes are differentially expressed. We further found that even if enough genes are differentially expressed to construct a reasonable predictor (in theory), if the fold-changes for differentially expressed genes are uniformly small relative to the biological variation, then identification of differentially expressed genes and construction of a good predictor will probably not be feasible.

We investigated both the cases when each class is equally represented in the population and when they are not equally represented. We presented methods for controlling the overall PCC and for controlling the PCC in the worst group in these situations (sensitivity and specificity).

The eigenvalue-based estimation method presented is quite preliminary and we would in general recommend that the independence assumption approach be used instead. One problem with the eigenvalue approach is that there will be uncertainty about the eigenvalue estimates. Another problem is that these formulas are theoretical worst-case-scenarios bounds that may in fact be much worse than reality, and therefore lead to too conservative estimate for the sample size. Additionally, Table 3 suggests that the gene independence method may be conservative enough already, even when it is violated.

A method for finding the optimal significance level Formula to use for gene selection when developing a predictor was presented, and approaches to determining sample size for a testing set discussed. The optimal significance levels Formula tend to be on par with those generally used in microarray experiments (e.g. Formula). We further showed that the probability of correct classification depends critically on the power and the true discovery rate, so that gene selection methods that control the false discovery rate should produce good predictors.


    APPENDICES
 TOP
 SUMMARY
 1. INTRODUCTION/MOTIVATION
 2. THE SAMPLE SIZE...
 3. THE PROBABILITY MODEL
 4. METHODS
 5. RESULTS
 6. EXAMPLES USING SYNTHETIC...
 7. CONCLUSION
 APPENDICES
 REFERENCES
 

A.1 Appendix A

Assume (3.1) where Formula is positive definite, and has the form Formula, where Formula indicates an Formula covariance matrix. For an x randomly selected from the population of interest Formula, define Formula. Then, Formula follows.

The linear prediction rule which results in the best probability of correct classification classifies x in Formula if (see, e.g. Enis and Geisser, 1974Go)

Formula

and classifies x in Formula otherwise. If Formula, then the rule reduces to: classify x in Formula if Formula and classify x in Formula otherwise.

The vector Formula is a linear combination of normal random variables, and so is normally distributed:

Formula

Therefore, the PCC for this optimal classifier is

Formula

If Formula, then this simplifies to

Formula

An extremal property of eigenvalues is that Formula (Schott, 1997Go, Theorem 3.15). If Formula and Formula, then it follows that

Formula

If we further assume that the covariance matrix for the differentially expressed genes has the form Formula, so that differentially expressed genes are independent, then

Formula

A.2 Appendix B

Assume that each gene has the same variance Formula, and all differentially expressed genes have the same fold-change, Formula. The linear predictor w developed on some training set consists of zeros and ones. To simplify notation, let Formula.

Formula

Now, Formula. Also, if Formula is the largest eigenvalue of the correlation matrix of the genes, then Formula (see, for example, Schott, 1997Go, Theorem 3.15). So Formula. Thus,

Formula

If Formula, then Formula and

Formula

If genes are also independent, so that Formula, then Formula and

Formula

A.3 Appendix C

In this appendix, we present a method for finding the optimal Formula for fixed n.

Fix a sample size n, and assume this is an even number. First, note that the distance between the class means under the current model is Formula. Now, use the normal approximation sample size formula to solve for Formula. Recall that notationally Formula where Formula is the inverse cumulative distribution function for a central T distribution with Formula degrees of freedom

Formula

This results in

Formula (C.1)

where Formula is the cumulative distribution function for a Student's t distribution with Formula degrees of freedom. Equation (C.1) can be used to select an Formula significance level that will maximize Formula.

A.4 Appendix D

In Appendix A, it was shown that

Formula

In Appendix B, it was shown that

Formula

so that

Formula

Under the assumption Formula and Formula, we have

Formula

A.5 Appendix E

The empirical Bayes method used in Table 3 is presented. Let Formula where Formula is the median (across genes) of the estimated pooled within-class variance. Let Formula be the sample variance of the estimated effect sizes, Formula. Then, Formula and Formula. See, e.g. Carlin and Louis (1996). The largest estimated effect size was used.


    ACKNOWLEDGMENTS
 
Conflict of Interest: None declared.


    FOOTNOTES
 
1 The expectation is taken over all training samples of size n in the population. Back

2 Under the assumptions of the homogeneous variance multivariate normal model. Back

3 A differentially expressed gene is defined as a gene with different average expression in the different classes. Back

4 Note that assuming Formula is non-singular is not the same as assuming the estimated covariance matrix S is non-singular. S will usually be singular because of the "large p small n" issue, i.e. because there are many more genes than samples. Back

5 Technically, this is the approximate TDR, the expected value of the true discovery proportion (TDP), which is one minus the false discovery proportion (FDP). Let Formula be the number of false discoveries when the sample size is n, and Formula the number of true discoveries. Then, Formula. Back

6 The formula is valid when Formula and Formula, which must be verified. Back

7 For these plots, the optimal Formula was calculated for all even sample sizes from 2 to 100, and then for each possible sample size the corresponding optimal Formula was used to estimate the mean Formula. Back


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION/MOTIVATION
 2. THE SAMPLE SIZE...
 3. THE PROBABILITY MODEL
 4. METHODS
 5. RESULTS
 6. EXAMPLES USING SYNTHETIC...
 7. CONCLUSION
 APPENDICES
 REFERENCES
 

    Carlin BP and Louis TA. (1996) Bayes and Empirical Bayes Methods for Data Analysis(Chapman and Hall, London).

    Dobbin K and Simon R. (2005) Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics 6:27–38.[Abstract]

    Enis P and Geisser S. (1974) Optimal predictive linear discriminants. Annals of Statistics 2:403–10.

    Fu WJ, Dougherty ER, Mallick B, Carroll RJ. (2005) How many samples are needed to build a classifier: a general sequential approach. Bioinformatics 21:63–70.[Abstract/Free Full Text]

    Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA. and others. (1999) Molecular classification of cancer: class discovery and class prediction by expression monitoring. Science 286:531–7.[Abstract/Free Full Text]

    Hocking RR. (1996) Methods and Applications of Linear Models: Regression and the Analysis of Variance(John Wiley and Sons, New York).

    Hwang D, Schmitt WA, Stephanopoulos G, Stephanopoulos G. (2002) Determination of minimum sample size and discriminatory expression patterns in microarray data. Bioinformatics 18:1184–93.[Abstract/Free Full Text]

    Lachenbruch PA. (1968) On expected probabilities of misclassification in discriminant analysis, necessary sample size, and a relation with the multiple correlation coefficient. Biometrics 24:823–34.[CrossRef][Web of Science]

    Ledoit O and Wolf M. (2004) A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88:365–411.[CrossRef][Web of Science]

    Molinaro AM, Simon R, Pfeiffer RM. (2005) Prediction error estimation: a comparison of resampling methods. Bioinformatics 21:3301–7.[Abstract/Free Full Text]

    Mukherjee S, Tamayo P, Rogers S, Rifkin R, Engle A, Campbell C, Golub TR, Mesirov JP. (2003) Estimating dataset size requirements for classifying DNA microarray data. Journal of Computational Biology 10:119–42.[CrossRef][Web of Science][Medline]

    Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG, Watson D, Park T. and others. (2004) A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. New England Journal of Medicine 351:2817–26.[Abstract/Free Full Text]

    Rao CR. (1973) Linear Statistical Inference and its Applications 2nd edition (John Wiley and Sons, New York).

    Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Giltnane JM. and others. (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. New England Journal of Medicine 346:1937–47.[Abstract/Free Full Text]

    Schott JR. (1997) Matrix Analysis for Statistics(John Wiley and Sons, New York).

    Simon R, Radmacher MD, Dobbin K, Mcshane LM. (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute USA 95:14–8.

    Received October 21, 2005; revised March 31, 2006; accepted for publication April 7, 2006.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


    This article has been cited by other articles:


    Home page
    BiostatisticsHome page
    P. de Valpine, H.-M. Bitter, M. P. S. Brown, and J. Heller
    A simulation-approximation approach to sample size planning for high-dimensional classification studies
    Biostat., July 1, 2009; 10(3): 424 - 435.
    [Abstract] [Full Text] [PDF]


    Home page
    BiostatisticsHome page
    M. A. van de Wiel, J. Berkhof, and W. N. van Wieringen
    Testing the prediction error difference between 2 predictors
    Biostat., July 1, 2009; 10(3): 550 - 560.
    [Abstract] [Full Text] [PDF]


    Home page
    BiostatisticsHome page
    K. K. Dobbin
    A method for constructing a confidence bound for the actual error rate of a prediction rule in high dimensions
    Biostat., April 1, 2009; 10(2): 282 - 296.
    [Abstract] [Full Text] [PDF]


    Home page
    CirculationHome page
    K. V. Ballman
    Genetics and Genomics: Gene Expression Microarrays
    Circulation, October 7, 2008; 118(15): 1593 - 1597.
    [Full Text] [PDF]


    Home page
    Clin. Cancer Res.Home page
    S. L. George
    Statistical Issues in Translational Cancer Research
    Clin. Cancer Res., October 1, 2008; 14(19): 5954 - 5958.
    [Abstract] [Full Text] [PDF]


    Home page
    Clin. Cancer Res.Home page
    R. Simon
    The Use of Genomics in Clinical Trial Design
    Clin. Cancer Res., October 1, 2008; 14(19): 5984 - 5993.
    [Abstract] [Full Text] [PDF]


    Home page
    JCOHome page
    J. A. Sparano and S. Paik
    Development of the 21-Gene Assay and Its Application in Clinical Practice and Clinical Trials
    J. Clin. Oncol., February 10, 2008; 26(5): 721 - 728.
    [Abstract] [Full Text] [PDF]


    Home page
    Clin. Cancer Res.Home page
    K. K. Dobbin, Y. Zhao, and R. M. Simon
    How Large a Training Set is Needed to Develop a Classifier for Microarray Data?
    Clin. Cancer Res., January 1, 2008; 14(1): 108 - 114.
    [Abstract] [Full Text] [PDF]


    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    8/1/101    most recent
    kxj036v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Dobbin, K. K.
    Right arrow Articles by Simon, R. M.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Dobbin, K. K.
    Right arrow Articles by Simon, R. M.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?