Skip Navigation


Biostatistics Advance Access originally published online on January 22, 2007
Biostatistics 2007 8(4):744-755; doi:10.1093/biostatistics/kxm002
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
8/4/744    most recent
kxm002v2
kxm002v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Lai, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lai, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data

Yinglei Lai

Department of Statistics and Biostatistics Center, The George Washington University, Washington, DC 20052, USA

ylai{at}gwu.edu


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A MOMENT-BASED ESTIMATION...
 3. SIMULATIONS AND APPLICATIONS
 4. DISCUSSION
 REFERENCES
 
Due to advances in experimental technologies, it is feasible to collect measurements for a large number of variables. When these variables are simultaneously screened by a statistical test, it is necessary to consider the adjustment for multiple hypothesis testing. The false discovery rate has been proposed and widely used to address this issue. A related problem is the estimation of the proportion of true null hypotheses. The long-standing difficulty to this problem is the identifiability of the nonparametric model. In this study, we propose a moment-based method coupled with sample splitting for estimating this proportion. If the p values from the alternative hypothesis are homogeneously distributed, then the proposed method will solve the identifiability and give its optimal performances. When the p values from the alternative hypothesis are heterogeneously distributed, we propose to approximate this mixture distribution so that the identifiability can be achieved. Theoretical aspects of the approximation error are discussed. The proposed estimation method is completely nonparametric and simple with an explicit formula. Simulation studies show the favorable performances of the proposed method when it is compared to the other existing methods. Two microarray gene expression data sets are considered for applications.

Keywords: Microarray; Moment estimator; Proportion of true null hypothesis


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A MOMENT-BASED ESTIMATION...
 3. SIMULATIONS AND APPLICATIONS
 4. DISCUSSION
 REFERENCES
 
Due to advances in experimental technologies, it is feasible to collect measurements for a large number of variables. These data include microarray gene expression data (Hedenfalk and others, 2001Go), mass spectrometry data (Wu and others, 2003Go), and nuclear magnetic resonance spectral data (Tadesse and others, 2005Go). The sample sizes of these data sets are usually small because of their relatively high costs. These data sets can be collected for multiple sample groups, and a typical interest is to identify variables significantly distinguishing these groups, such as normal against disease groups. Statistically, we conduct a multisample comparison test for each of the measured variables. Because numerous variables are simultaneously screened, it is necessary to consider the adjustment for multiple hypothesis testing. The false discovery rate (FDR) has been proposed and widely used to address this issue (Benjamini and Hochberg, 1995Go; Storey and Tibshirani, 2003Go). It evaluates the proportion of false positives among the identified positives. To efficiently evaluate FDRs, it is necessary to obtain an accurate estimate of the proportion of true null hypotheses Formula. For microarray data, it is equivalent to estimate the proportion of differentially expressed genes. This quantity is also crucial for the sample-size calculation in microarray experiment designs (Jung, 2005Go; Wang and Chen, 2004Go).

Many statistical methods have been proposed to estimate Formula, such as a mixture model proposed by Allison and others (2002), QVALUE (Storey and Tibshirani, 2003Go), BUM (Pounds and Morris, 2003Go), SPLOSH (Pounds and Cheng, 2004Go), and LBE (Dalmasso and others, 2005). These methods are not always efficient. They may give accurate estimation results in some cases but fail in other cases. If the distributions of test statistics or the related p-value distributions can be specified in parametric forms for both the null and the alternative hypotheses, then the model-based estimation approach, such as the mixture model proposed by Allison and others (2002) or BUM proposed by Pounds and Morris (2003)Go, should provide favorable performances. However, it is generally difficult to validate these distribution assumptions, especially when sample sizes are small. For the nonparametric approach, a long-standing difficulty is the model identifiability (unique solution of model parameters), because observations are sampled from mixed distributions from the null and the alternative hypotheses. QVALUE (Storey and Tibshirani, 2003Go) and SPLOSH (Pounds and Cheng, 2004Go) first smooth the empirical p-value distribution and then estimate an upper bound of Formula. LBE proposed by Dalmasso and others (2005) estimates the upper bound of Formula through a moment-based method. Recently, Pawitan and others (2005a,b) discussed the bias in the estimation of Formula and the influence from sample sizes.

Moment-based estimation methods usually require no independence assumptions. Explicit formulas can generally be derived. The requirement of large sample sizes, which is necessary for the statistical efficiency of these methods, limits their usefulness in practice. However, when estimating Formula for "omics" data, the sample size is the number of variables and is usually large. Therefore, we consider a moment-based method coupled with sample splitting for estimating Formula. By splitting the sample, we are able to understand the p-value distribution under different hypotheses by establishing the conditional independence structure of joint p-value distribution. If the p values from the alternative hypothesis are homogeneously distributed, then the proposed method will solve the model identifiability and give its optimal performances. When the p values from the alternative hypothesis are heterogeneously distributed, we propose to approximate this mixture distribution so that the model identifiability can be achieved. The proposed method is completely nonparametric and simple with an explicit formula.

In the following sections, we first propose the method for estimating Formula. Theoretical aspects of the approximation error are also presented. Then, we present analysis results for several simulated and experimental data sets to compare the performances of the proposed method and the other existing methods. Finally, the advantages and disadvantages of the proposed method are discussed.


    2. A MOMENT-BASED ESTIMATION METHOD
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A MOMENT-BASED ESTIMATION...
 3. SIMULATIONS AND APPLICATIONS
 4. DISCUSSION
 REFERENCES
 

2.1 Motivation

A typical situation when multiple hypothesis testing is performed for omics data (microarray data, mass spectrometry data, etc.) is that numerous p values are generated. A proportion of these p values are consistent with the null hypothesis and the rest are consistent with the alternative hypothesis. Our interest in this study is to estimate Formula, the proportion of true null hypothesis. To provide an illustrative example for our proposed method, we simulate 2 independent data sets. Both data sets have the same 3000 variables and 2 sample groups with 5 samples in each group. In each data set, the first 1200 variables are independently simulated from the normal distribution Formula and Formula for the first and the second sample groups, respectively (40% nonnull), and the rest 1800 variables are independently simulated from the normal distribution Formula for both the groups (60% null). p values from the 2-sample Student's t-test are calculated for these simulated variables.

The marginal histograms in Figure 1(a) give illustrations of the p-value distributions based on one data set. From these histograms, one may realize the problem of identifiability when estimating Formula. Although the null distribution is known as uniformly distributed in Formula, the nonnull distribution is unknown. Without imposing any parametric or other assumptions on the nonnull distribution, we cannot obtain a unique solution for Formula if only one data set is considered.


Figure 1
View larger version (33K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. (a) Scatter plot with marginal histograms for paired p values based on 2 independently simulated data sets (see Section 2.1 for details), in which the grey and black dots represent variables consistent with the null and the alternative hypotheses, respectively, and the dashed lines represent the proportion of true null hypotheses. (b) An artificial example for the data division scheme (see Procedure 1 for details), in which grey and black colors represent the first and the second sub data sets, respectively. (c,d) Estimation results based on the microarray gene expression data sets for (c) the breast cancer and (d) the blood studies. N, Q, B, S, and L represent the proposed method, QVALUE, BUM, SPLOSH, and LBE, respectively. In the p-value histograms, the lines with different characters represent the original estimates from different methods. The boxplots are based on the bootstrap estimates from different methods.

 
However, if we have 2 independent data sets such that both data sets contain the same variables, then the pairs of p values can be obtained for all variables, and these pairs are actually conditionally independent. The scatter plot in Figure 1(a) gives an illustration. From this plot, one may realize that it is possible to solve the identifiability problem and obtain a unique solution for Formula under certain conditions.

In the following subsections, we first introduce an estimation method when 2 independent data sets are available. When there is only one data set, we propose a procedure to generate 2 independent data sets. A bootstrap procedure for confidence intervals and some theoretical aspects are also discussed.

2.2 Two data sets

At the beginning, we consider 2 independent data sets. Both data sets contain the same m variables and g sample groups. Their sample sizes may be different. Test statistics are chosen to test some specific hypotheses for each variable, such as Formula: the variable has the same population means in different sample groups versus Formula: the variable has different population means in different sample groups. (For simplicity, we skip the mathematical description of data structure and the related test statistics.) The goal is to estimate Formula, the proportion of variables consistent with the null hypothesis.

Suppose a test statistic T is chosen to test a specified hypothesis. Without loss of generality, we assume that T is continuous. For each variable, we can obtain 2 corresponding p values from the 2 data sets. For data set k, Formula, the p value Formula follows a uniform distribution Formula under the null hypothesis Formula. Under the alternative hypothesis Formula, there may be various distribution components (except Formula) for the p-value distribution. We use Formula to denote the set containing the indices representing different nonnull distribution components.

Generally, the set I may contain many different components (Formula, where Formula is the number of elements in I). We propose that the null component and the different nonnull components can be approximated by 2 components: a null component and a nonnull component. Under this approximation, there is an approximated proportion of true null hypothesis Formula, which may be different from Formula (however, if Formula, then Formula). Considering the moments of p values, we have

Formula

Formula, Formula, and Formula are the expected values of p value following the null, nonnull, and marginal distributions in data set k, Formula, respectively. Formula is the expected value of the product of Formula and Formula under the marginal joint distribution. Note that Formula because the null distribution is known as Formula. Furthermore, Formula, Formula, and Formula can be estimated from the data (using the corresponding sample moments). Then, there are only 3 unknown parameters: Formula, Formula, and Formula. With the above 3 equations, we can obtain an explicit formula

Formula

The mathematical proof is given as Lemma 1 in supplementary material available at Biostatistics online. Therefore, an estimator for Formula is proposed as

Formula (2.1)

where Formula is the calculated p value of the jth variable in data set k, Formula, Formula. Boundary constrains are imposed since the proportion Formula must be within Formula.

2.3 One data set

To estimate Formula for a given data set, which contains m variables and g sample groups, we can first divide the data set into 2 parts and then use the method described above. The following procedure is proposed.

PROCEDURE 1

1) For a given variable, randomly divide its observations in each sample group into 2 parts with (approximately) equal sample sizes;
2) With a given test statistic T, calculate the p value for each part;
3) Repeat steps 1 and 2 for all variables and obtain the set of paired p values;
4) Use (2.1) to estimate Formula;
5) Repeat steps 1–4 R times and obtain R estimates of Formula;
6) Return the median of these R estimates.

There may be complicated dependence structures among the different variables in the data set. We perform data division step (step 1) separately for each variable to reduce the impacts from dependence structures (see Figure 1(b) for an illustration). Although the proposed method is moment based and does not require any independence assumptions, it is still necessary to reduce these impacts so that the estimation can be more statistically efficient. Because different random divisions of the data set result in different estimates, we repeat steps 1–4 R times to obtain a resample distribution of estimates. (In this study, we repeat Formula times. Based on some simulation studies [data not shown], 25 is an appropriate choice for the balance between estimation accuracy and computation burden.) Then, the median is reported for robustness purpose.

2.4 Confidence interval

Theoretically, we can apply Delta method (Casella and Berger, 2002Go, p. 240) to obtain formulas for the large sample variance and confidence intervals. However, these formulas may be invalid because of complicated dependence structures among the variables in omics data. Therefore, we use the bootstrap method (Efron, 1979Go) to obtain confidence intervals. For QVALUE, BUM, SPLOSH, and LBE, we can simply repeat sampling p values and estimating Formula B times to obtain a resample distribution. For the proposed method, a resample distribution of estimates can be similarly obtained by the following procedure.

PROCEDURE 2

1) Run the following 3 steps R times to obtain R sets of paired p values:
a) For a given variable, randomly divide its observations in each sample group into 2 parts with (approximately) equal sample sizes;
b) With a given test statistic T, calculate the p value for each part;
c) Repeat steps a and b for all variables and obtain the set of paired p.

2) Sample m integer numbers Formula with replacement from the set Formula with probability Formula.
3) Perform the following 2 steps for each set of paired p values:
  • Form a new set by selecting Formulath paired p values;
  • use (2.1) to estimate Formula.

4) Record the median of these R estimates of Formula.
5) Return a resample distribution by repeating steps 2–4 B times.

2.5 Approximation error

The proposed estimation method is derived based on the approximated Formula. It is necessary to study the approximation error. We can show that

Formula (2.2)

where Formula is the expected value of p value following the nonnull distribution component Formula. The mathematical proof is given as Lemma 2 in supplementary material available at Biostatistics online.

The approximation will be close if Formula for all Formula and any Formula. An ideal case is that all p values from the alternative hypothesis follow only one distribution (Formula). In this situation, we have Formula for all Formula and any Formula, and therefore Formula.

The approximation will also be close if Formula for all Formula and any Formula. An ideal case is that the number of samples in each group goes to infinity, in which we have Formula for all Formula and any Formula, and therefore Formula.

To better understand the approximation error when the p values from the alternative hypothesis are heterogeneously distributed, we have the following discussion. If the number of samples in each group in the first data set is the same as the corresponding one in the second data set, then we have Formula for all Formula and

Formula

Since moment estimators are generally asymptotically efficient, Formula will be asymptotically overestimated. An upper bound can be further derived:

Formula

Based on this upper bound, the following conclusions can be drawn:

  • The approximation error depends on the "factor" (the smaller the better). It will be small if Formula. The estimation bias will be larger if Formula is closer to 0 (or if the proportion of differentially expressed genes is larger).
  • The approximation error depends on the "numerator" (the smaller the better). It will be small if Formula or, equivalently, Formula for all Formula. This case has been discussed above.
  • The approximation error depends on the "denominator" (the larger the better). For p values from the alternative hypothesis, we have Formula. Since Formula, Formula. Therefore, the approximation error will be small if Formula for all Formula. This case has also been discussed above.


    3. SIMULATIONS AND APPLICATIONS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A MOMENT-BASED ESTIMATION...
 3. SIMULATIONS AND APPLICATIONS
 4. DISCUSSION
 REFERENCES
 

3.1 Comparison with other methods

A typical application of the proposed method is to estimate the proportion of differentially expressed genes in a given microarray gene expression data set. This proportion is actually Formula. Therefore, it is equivalent to estimate Formula, which is the proportion of nondifferentially expressed genes. Many statistical methods have been proposed to estimate Formula, such as QVALUE (Storey and Tibshirani, 2003Go), BUM (Pounds and Morris, 2003Go), SPLOSH (Pounds and Cheng, 2004Go), and LBE (Dalmasso and others, 2005). In this section, we compare the proposed method with these existing statistical methods through simulations and applications. The simulations are conducted based on a microarray gene expression data set for a breast cancer study. We use the 2-sample Student's t-test for hypothesis testing. For the experimental data set, we observe from Quantile–Quantile plots that the p values given by the t-distribution and the permutation procedure are consistent (data not shown). Therefore, we choose to use the t-distribution to assess p values because it gives unique results.

Statistical efficiencies can be compared in simulation studies since we know the truth. With a given Formula, we repeat simulation and estimation procedures B=100 times. Note that the proposed method requires much more computation time than these existing methods because of its repetition of random data division (Formula). Although Formula is a relatively small number, it is adequate to compare the performances of different methods. The root mean square error (RMSE), Bias, and standard deviation (SD) are used to compare different methods (estimators) including the proposed one. For an estimator Formula be the calculated estimate in the ith simulation. The Bias, SD, and RMSE are defined as: Formula and Formula

3.2 Simulation studies

Configurations.

In general, there are complicated dependence structures in a microarray gene expression data set. Therefore, we conduct the following simulation studies with covariance matrices constructed based on a microarray gene expression data set (the first data set in Section 3.3). A gene expression data set is simulated with Formula genes and 2 sample groups with sample sizes Formula (simulation studies 1 and 2) or 50 (simulation study 3). Data are simulated from normal distributions with an assumed proportion Formula of differentially expressed genes. Genes are grouped into 30 blocks with 100 genes in each block. For each block, we randomly select 100 genes from the experimental data set and calculate the correlation matrices Formula and Formula in the first and the second groups, respectively. For blocks of differentially expressed genes, we simulate data from the normal distributions Formula and Formula for the first and the second sample groups, respectively. For the remaining blocks, we simulate data from the normal distributions Formula and Formula for the first and the second sample groups, respectively. Here, 0 and Formula are (random) vectors. For each configuration, we repeat simulation and estimation procedures Formula times. Different statistical methods are used to estimate Formula. We run QVALUE, BUM, SPLOSH, and LBE with their default settings. For the proposed method, we divide each sample group into 2 parts with equal sample sizes: Formula for simulation studies 1 and 2, Formula for simulation study 3. The results are summarized in Figure 2 in which RMSE, Bias, and SD are compared. We also compare boxplots of the estimation results from different methods when Formula.


Figure 2
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Estimation results based on the simulation studies 1 (left panel), 2 (middle panel), and 3 (right panel). The RMSEs (a–c), Biases (d–f), and SDs (g–i) (y-axes) against the true proportions (x-axes) are plotted. The solid lines with black dots, solid, dashed, dotted, and dot-dashed lines represent the proposed method, QVALUE, BUM, SPLOSH, and LBE, respectively. In the boxplots (j–l) of estimated proportions, N, Q, B, S, and L represent the proposed method, QVALUE, BUM, SPLOSH, and LBE, respectively, and the dashed lines represent the true value.

 
Results.

The first simulation study is to consider the situation that there is only one p-value distribution component for differentially expressed genes. We fix Formula and let Formula. Generally, the sample size of a microarray data set is relatively small. Therefore, we set Formula. As shown in Figure 2, for Formula around 0.2, only BUM gives smaller RMSEs than the proposed method. For other values of Formula, the proposed method gives the lowest RMSEs. Note that the behavior of BUM is not stable. It gives the highest RMSEs when Formula or Formula. For different values of Formula, the proposed method consistently gives relatively low biases and the second lowest SDs.

The second simulation study is to consider a general situation that p values of differentially expressed genes may follow different distribution components. We randomly sample Formula from a uniform distribution Formula and let Formula and Formula. As shown in Figure 2, for Formula, the proposed method gives the lowest RMSEs. For Formula around 0.2, only BUM gives lower RMSEs than the proposed method. Note again that the behavior of BUM is not stable. It gives the highest RMSEs when Formula. For Formula around 0.1, QVALUE gives the lowest RMSEs, and the proposed method gives a slightly higher RMSEs. For different values of Formula, the proposed method consistently gives relatively low biases and the second lowest SDs.

The third simulation study is to consider the situation that the sample size of a microarray data set is relatively large. Therefore, we set Formula. We still consider a general situation that p values of differentially expressed genes may follow different distribution components. We randomly sample Formula from a uniform distribution Formula and let Formula. As shown in Figure 2, the proposed method always gives the lowest RMSEs and biases and the second lowest SDs for different values of Formula.

Other simulations.

Simulations for other configurations are also considered. Generally, the proposed method can give comparably favorable performances. However, if the sample size is very small (e.g. Formula), the proposed method will give poor performances. This is not surprising. If the sample size of a given data set is very small, then the sample size of a divided subset will be even smaller, which significantly reduces the power to detect differential expressions. This fact has also been discussed by Pawitan and others (2005a,b). Therefore, while enjoying the model identifiability through data division, we lose certain statistical efficiency in estimations.

3.3 Applications

The above theoretical and simulation studies show the favorable performances of the proposed method especially when (i) the sample size is relatively large, (ii) the p values from the alternative hypothesis are homogeneously distributed, or (iii) the proportion of differentially expressed genes is relatively small. In practice, it is difficult to find a microarray data set for the second or the third situation. However, there are many microarray data sets with relatively large sample sizes.

We consider 2 data sets for applications. The first one is the famous microarray gene expression data set for a breast cancer study. Hedenfalk and others (2001)Go used microarrays to compare 3226 gene expression profiles between 7 BRCA1 samples and 8 BRCA2 samples. The data set is publicly available at http://research.nhgri.nih.gov/microarray/NEJM_Supplement. A total of 56 genes were filtered out, because they had one or more expression measurements exceeding 20, which were considered not trustworthy (Storey and Tibshirani, 2003Go). Therefore, 3170 gene expression measurements for 15 samples are used in this study.

The second data set has a relatively large sample size. Wiestner and others (2003) used lymphochips to compare 12 447 gene expression profiles between 79 Ig-mutated and 28 Ig-unmutated samples with chronic lymphocytic leukemia. The data set is publicly available at http://llmpp.nih.gov/cll/. We use the k-nearest neighbors method (R package impute; Troyanskaya and others, 2001) to impute the missing values in the data set.

We use different statistical methods to estimate Formula. QVALUE, BUM, SPLOSH, and LBE are run with their default settings. For the proposed method, we divide the data set into 2 subsets: Formula for the first data set and Formula for the second data set. We bootstrap Formula times to obtain the resample distributions of estimates (see Section 2 for details). Since the p values from the null hypothesis follow a uniform distribution Formula, Formula is expected to be under the curve of underlying empirical p-value distribution. (Formula, where Formula, Formula, and f are the null, nonnull, and marginal distributions of p value, respectively.)

For the first data set, Figure 1(c) shows a histogram of p values and boxplots to compare estimates from different methods. Only the proposed method and BUM give estimates under the histogram. The proposed method gives the smallest estimated Formula. Among these 5 methods, BUM gives a relatively small variance and the other 4 give comparatively high variances. However, from the simulation studies (e.g. boxplots in Figure 2), some confidence intervals given by BUM do not contain the true value and are not meaningful. Therefore, the proposed method may give more reliable estimation results.

For the second data set, Figure 1(d) shows a histogram of p values and boxplots to compare estimates from different methods. Not only the proposed method gives the smallest estimates but also its whole boxplot is under the histogram. Furthermore, its variance is relatively small among these 5 methods.

In the above simulation studies and applications, the variances of BUM are always the lowest among these 5 estimation methods. This comes from the simple model of BUM: the mixture of a beta distribution and a uniform distribution. However, it is difficult to validate this model in practice.


    4. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A MOMENT-BASED ESTIMATION...
 3. SIMULATIONS AND APPLICATIONS
 4. DISCUSSION
 REFERENCES
 
In the problem of estimating the proportion of true null hypotheses, the number of variables is the sample size of study. Microarrays and other high-throughput technologies enable us to collect measurements for a large number of variables. With these data, moment-based estimation methods can be considered, because they are generally asymptotically efficient. In this study, we proposed a moment-based estimation method coupled with sample splitting and discussed its theoretical properties. The simulation studies and the applications to microarray data showed the favorable performances of the proposed method when it was compared with the other existing methods. Since the t-test requires at least 2 samples in each group, the proposed method cannot be applied when a group sample size is less than 4. In such a situation, other statistical methods, such as QVALUE, should be considered. From the above analyses, we observe that there are certain situations for a particular method to achieve its optimal performance. New methods for estimating Formula are being proposed (Langaas and others, 2005). It is necessary to conduct more comprehensive reviews and systematical comparisons for different Formula-estimation methods.

We recently proposed a likelihood-based method coupled with an EM algorithm for estimating Formula (Lai, 2006Go). Random data division was also used to achieve the model identifiability. Through simulations and applications to microarray gene expression data, we showed the favorable performances of this method (Lai, 2006Go). However, there are 2 disadvantages: (i) The method is likelihood based and assumes independence among different genes, which is unlikely to be true because genes interact with each other during cellular processes. (ii) The method uses an EM algorithm, which may provide unreliable estimation when the likelihood function is not regular. The moment-based method proposed in this study requires no independence assumption. In addition to its favorable performances, it is completely nonparametric and simple with an explicit formula to give a unique solution.

A future research topic is to generalize the proposed method so that estimation efficiencies can be further improved. As shown in the simulation studies, the estimation variance tends to increase when the true proportion increases (Figures 2). In the second simulation study for heterogeneous alternative, there is a considerable estimation bias when the true proportion is relatively small (Figure 2). It is necessary to pursue both theoretical and simulation studies so that more efficient estimation methods can be developed.


    ACKNOWLEDGMENTS
 
I am grateful to Prof. Tapan Nayak, the editors, associate editors, and the anonymous reviewers for their helpful comments and suggestions. This work was partially supported by a start-up fund from the George Washington University and the National Institutes of Health grant DK-75004. The R codes are available at http://home.gwu.edu/~ylai/research/RDPM. Conflict of Interest: None declared.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A MOMENT-BASED ESTIMATION...
 3. SIMULATIONS AND APPLICATIONS
 4. DISCUSSION
 REFERENCES
 

    Allison DB, Gadbury GL, Heo M, Fernandez JR, Lee C-K, Prolla TA, Weindruch R. A mixture model approach for the analysis of microarray gene expression data. Computational Statistics and Data Analysis (2002) 39:1–20.[CrossRef]

    Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (1995) 57:289–300.

    Casella G, Berger RL. Statistical Inference (2002) 2nd edition. Pacific Grove, CA: Duxbury.

    Dalmasso C, Broët P, Moreau T. A simple procedure for estimating the false discovery rate. Bioinformatics (2005) 21:660–668.[Abstract/Free Full Text]

    Efron B. Bootstrap methods: another look at the jackknife. Annals of Statistics (1979) 7:1–26.[Web of Science]

    Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, and others. Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine (2001) 344:539–548.[Abstract/Free Full Text]

    Jung S-H. Sample size for FDR-control in microarray data analysis. Bioinformatics (2005) 21:3097–3104.[Abstract/Free Full Text]

    Lai Y. A statistical method for estimating the proportion of differentially expressed genes. Computational Biology and Chemistry (2006) 30:193–202.[CrossRef][Web of Science][Medline]

    Langaas M, Lindqvist BH, Ferkingstad E. Estimating the proportion of true null hypotheses, with application to DNA microarray data. Journal of the Royal Statistical Society, Series B (2005) 67:555–572.[CrossRef]

    Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A. False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics (2005a) 21:3017–3024.[Abstract/Free Full Text]

    Pawitan Y, Murthy KRK, Michiels S, Ploner A. Bias in the estimation of false discovery rate in microarray studies. Bioinformatics (2005b) 20:3865–3872.

    Pounds S, Cheng C. Improving false discovery rate estimation. Bioinformatics (2004) 20:1737–1745.[Abstract/Free Full Text]

    Pounds S, Morris SW. Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics (2003) 19:1236–1242.[Abstract/Free Full Text]

    Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America (2003) 100:9440–9445.[Abstract/Free Full Text]

    Tadesse MG, Ibrahim JG, Vannucci M, Gentleman R. Wavelet thresholding with Bayesian false discovery rate control. Biometrics (2005) 61:25–35.[CrossRef][Web of Science][Medline]

    Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB. Missing value estimation methods for DNA microarrays. Bioinformatics (2001) 17:520–525.[Abstract/Free Full Text]

    Wang S-J, Chen JJ. Sample size for identifying differentially expressed genes in microarray experiments. Journal of Computational Biology (2004) 11:714–726.[CrossRef][Web of Science][Medline]

    Wiestner A, Rosenwald A, Barry TS, Wright G, Davis RE, Henrickson SE, Zhao H, Ibbotson RE, Orchard JA, Davis Z, and others. ZAP-70 expression identifies a chronic lymphocytic leukemia subtype with unmutated immunoglobulin genes, inferior clinical outcome, and distinct gene expression profile. Blood (2003) 101:4944–4951.[Abstract/Free Full Text]

    Wu B, Abbott T, Fishman D, McMurray W, Mor G, Stone K, Ward D, Williams K, Zhao H. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics (2003) 19:1636–1643.[Abstract/Free Full Text]

    Received August 4, 2006; revised January 5, 2007; accepted for publication January 17, 2007.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow Supplementary Material
    Right arrow All Versions of this Article:
    8/4/744    most recent
    kxm002v2
    kxm002v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Lai, Y.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Lai, Y.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?