Skip Navigation


Biostatistics Advance Access originally published online on December 6, 2005
Biostatistics 2006 7(2):167-181; doi:10.1093/biostatistics/kxj009
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
7/2/167    most recent
kxj009v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Taylor, J.
Right arrow Articles by Tibshirani, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Taylor, J.
Right arrow Articles by Tibshirani, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

A tail strength measure for assessing the overall univariate significance in a dataset

Jonathan Taylor

Department of Statistics, Stanford University, Stanford, CA 94305,USA jtaylor{at}stat.stanford.edu

Robert Tibshirani*

Department of Health Research and Policy and Department of Statistics, Stanford University, Stanford, CA 94305, USA tibs{at}stanford.edu

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. TAIL STRENGTH
 3. ESTIMATES OF VARIANCE
 4. EXAMPLES
 5. DISCUSSION
 APPENDIX
 REFERENCES
 
We propose an overall measure of significance for a set of hypothesis tests. The ‘tail strength’ is a simple function of the p-values computed for each of the tests. This measure is useful, for example, in assessing the overall univariate strength of a large set of features in microarray and other genomic and biomedical studies. It also has a simple relationship to the false discovery rate of the collection of tests. We derive the asymptotic distribution of the tail strength measure, and illustrate its use on a number of real datasets.

Keywords: Multiple testing; p-value


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. TAIL STRENGTH
 3. ESTIMATES OF VARIANCE
 4. EXAMPLES
 5. DISCUSSION
 APPENDIX
 REFERENCES
 
Dave et al. (2004)Go published a study correlating the expression of 44 928 clones from microarrays with patient survival in follicular lymphoma (FL). The authors derived a multivariate Cox model for the data, and reported that it was highly predictive in an independent test set. Tibshirani (2005)Go reanalyzed this data, shedding considerable doubt on the reproducibility of the findings.

The left panel of Figure 1 shows the ordered Cox scores T(k) for each gene (see, e.g. Kalbfleisch and Prentice, 1980Go). These are the partial likelihood score statistics, and are plotted against the expected (null) order statistics Formula where the expectation is estimated by repeated permutations of the patient labels. We see that there is little deviation from the expected values. The right panel shows a similar plot for the leukemia data of Golub et al. (1999)Go. This problem compares two disease classes, so the scores T(k) are the ordered two-sample t-statistics. There are many more large values than we would expect to see by chance. Perhaps this is why the Golub dataset has become the most common testing ground for authors proposing new methods for microarray analysis.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
 
Fig. 1. Test statistics (one per gene) from the FL data (left) and leukemia data (right). Each plot shows the observed test statistics versus the expected order statistics under the null hypothesis.

 
In the re analysis of the Dave et al. (2004)Go data, it became clear that if there was predictive power in this dataset, it was very subtle. As seen in the left panel of Figure 1, very few genes seem to exhibit univariate effects. From this experience, we felt it would be useful to have a general quantitative measure of the univariate strength of a large set of predictors, that is, the marginal effects. Such a measure could be routinely reported as an indication of the predictive strength in a dataset. Of course such a measure would not capture any multivariate or interactive effects that might be present.

In this paper, we propose a measure of the overall statistical significance in testing the global hypothesis of no gene effects. We call it the ‘tail strength’. We derive its asymptotic distribution and illustrate its use on a number of real datasets. We also relate our measure to the false discovery rate (FDR) and the area under the ROC curve.


    2. TAIL STRENGTH
 TOP
 SUMMARY
 1. INTRODUCTION
 2. TAIL STRENGTH
 3. ESTIMATES OF VARIANCE
 4. EXAMPLES
 5. DISCUSSION
 APPENDIX
 REFERENCES
 

2.1 Definition

We first define our measure based on a set of p-values. Later, we give an equivalent form in terms of test statistics. We assume that we have null hypotheses H0i, and associated p-values pi, i = 1, 2, ..., m. The global null hypothesis states that H0i holds for all i, and we assume that under this hypothesis, the pi are i.i.d. U[0, 1] random variables.

Let the ordered p-values be p(1) ≤ p(2) ≤ ··· ≤ p(m). We define the tail strength as

Formula 1(2.1)

Now under the global null hypothesis, each pk has a uniform distribution, so that the expected value of the kth smallest p(k) is k/(m + 1) and TS has expectation zero. The tail strength measures the deviation of each p-value from its expected value: p(k) < k/(m + 1) causes Formula 1 to be >0. Thus, large positive values of TS indicate evidence against the global null hypothesis, that is, it indicates that there are more small p-values than we would expect by chance. Note also that the particular form of TS will give more weight to the lowest p-values, so that it is most sensitive to deviations in the tail. In contrast, for example, the sum of successive differences in p-values would not have this property.

There is a Bayesian model for this setting that we will find useful in our later analysis. Given a prior ‘null’ probability, {pi}0, and an alternative distribution F1, the Bayesian model for observing m p-values is the following: for 1 ≤ i ≤ m independently

  1. generate H0, i ~ Bernoulli({pi}0);
  2. if H0, i = 0, generate pi ~ Unif(0, 1), else generate pi ~ F1.
Under this model, it is easy to see that the p-values are unconditionally i.i.d. with distribution

Formula 1

Further, without any constraint on F1, the parameter {pi}0 is obviously unidentifiable.

The most common application of tail strength is likely to be in assessing the univariate (marginal) effect of a set of predictors. Suppose we have predictors xij, j = 1, 2, ..., m, and response variable Yi, for observations i = 1, 2, ..., N. Letting y = (y1, y2, ..., yN) and xj = (x1j, ..., xNj), we form a test statistic for each predictor:

Formula 2(2.2)

Thus, Tj measures the univariate effect of the jth predictor on the response. The two-sample t-statistic is a simple example.

There is an equivalent form of tail strength, expressed in terms of the Tj. Suppose we have a null distribution Prob0 for these statistics, derived from a set of permutations or asymptotic theory. This yields a set of p-values

Formula 3(2.3)

where |T(1)| ≤ |T(2)| ≤ ··· ≤ |T(m)| are the test statistics ordered by absolute value. Then

Formula 4(2.4)

Each term is the proportion of test statistics that exceeds the expected number, when testing at value T(k).

For the FL and leukemia datasets, TS equals –0.027 and 0.655, respectively. Hence, the FL p-values are slightly larger than we would expect under the uniform distribution. In contrast, the leukemia genes are highly significant. The value 0.655 for the leukemia data indicates that there are (on average) 65.5% more significant test statistics than we would expect by chance.

The left panel of Figure 2 shows the tail strength measure applied to some simulated microarray data. There are 1000 genes (features) and 20 samples; all measurements are standard N(0, 1), except for the first 100 genes in the second 10 samples, which were generated as N({Delta}, 1). The plot shows tail strength divided by its standard error Formula 4 from 100 realizations at each of the seven different values of {Delta}. We see that TS has the desired behavior: it is centered around zero when the overall null hypothesis holds ({Delta} = 0) and then becomes more and more positive as {Delta} increases.


Figure 2
View larger version (11K):
[in this window]
[in a new window]
 
Fig. 2. Simulated microarray data: 1000 genes and 20 samples, and we wish to compare the first 10 samples to the second 10. In the left panel, all measurements were generated as standard Gaussian, except for the first 100 genes in samples 11–20, which have mean {Delta}. Shown are 100 realizations of tail strength divided by its standard error Formula 4 at each value of {Delta}. A horizontal line is drawn at the upper 95% point 1.645. In the right panel, the genes were not independent but were generated with pairwise correlation 0.5 before {Delta} was added. Hence, the resulting p-values are exchangeable (but not independent) under the null hypothesis.

 
In the right panel, the genes are not independent but have pairwise correlation 0.5 before {Delta} is added. Hence, the resulting p-values are exchangeable (but not independent) under the null hypothesis. We see that the expectation of tail strength behaves as in the independent case; however, its variance seems to be inflated. We address this issue in Section 3.

2.2 Relationship to FDR

The quantity TS is closely related to the FDR) (Benjamini and Hochberg, 1995Go; Efron et al., 2001Go; Storey, 2002Go; Efron and Tibshirani, 2002Go; Genovese and Wasserman, 2002Go).

We first review the FDR. Table 1 displays the various outcomes when testing m null hypotheses H0i, 1 ≤ i ≤ m. The quantity V is the number of false positives (type I errors), while R is the total number of hypotheses rejected, which depends on the testing procedure.


View this table:
[in this window]
[in a new window]
 
Table 1. Possible outcomes from m hypothesis tests

 
FDR (Benjamini and Hochberg, 1995Go) is defined as the expected value of (V/R) x 1{R>0}. If the decision rule is a thresholding rule, then one can define the following plug-in estimate of FDR at p-value x (Storey, 2002Go; Storey et al., 2004Go)

Formula 5(2.5)

where

Formula 6(2.6)

the empirical cumulative distribution function of the p-values p1, ..., pm.

In Efron and Tibshirani (2002)Go and Storey (2002)Go, it is shown that under the Bayesian model of section 2,

Formula 6

For extensions to large samples, see Storey et al. (2004)Go.

Finally, we can derive the relationship between tail strength and FDR. Looking at the plug-in estimate (2.5), it is easy to see that

Formula 7(2.7)

Figure 3 gives a graphical interpretation of the simple relationship (2.7): TS measures a weighted area under the curve Formula 7 evaluating this function at the observed p-values pk. Hence, the faster Formula 7 goes to one (Formula 7 drops to zero) as x {downarrow} 0, the higher the TS. Further, the tighter the p-values are bunched up near 0, the larger the TS.


Figure 3
View larger version (9K):
[in this window]
[in a new window]
 
Fig. 3. A graphical description of tail strength. It measures the area under the curve 1 – FDR(x) , evaluating this function at the observed p-values pk (x values).

 
Another way of seeing that TS can be phrased directly in terms of the test statistics (as in (2.4)) comes from the fact that the expression Formula 7 in (2.7) can be computed on the scale of the test statistics or the p-values. Therefore, TS is unchanged under any one-to-one transformation of the p-values, and is not tied to the choice of test statistic used to test each null hypothesis H0i, 1 ≤ i ≤ m.

When the p-values are i.i.d. with distribution F, the following result, proven in the Appendix, is therefore not surprising:

Formula 8(2.8)

where

Formula 9(2.9)

is the (asymptotic) population FDR, with the unknown proportion of true null hypotheses {pi}0 set to one. In other words, the tail strength statistic estimates the average amount by which the true FDR function falls below its null value of one, with the average computed with respect to the true distribution of p-values.

If F is stochastically dominated by Unif(0, 1), then TS is asymptotically normal with variance

Formula 10(2.10)

where C(F) ≤ 1 if F(x) ≥ x for each x in [0, 1]. Hence, we have the asymptotic approximation

Formula 11(2.11)

In Section 3, we examine the accuracy of this approximation.

Note that the quantity m1 = mm0 measures how many non-null genes there are in the dataset. Various authors have studied this as a measure of univariate strength (cf. Benjamini and Hochberg, 2000Go; Storey et al., 2004Go). However, this does not really measure how different the non-null p-values are from Unif(0, 1). Further, in the Bayesian model described earlier, this parameter is not identifiable without some constraint on the alternatives. In contrast, tail strength is identifiable and measures how far the non-null p-values are from Unif(0, 1).

The asymptotic behavior of tail strength is summarized in the following theorem.

THEOREM 2.1 Under the Bayesian model of Storey (2002)Go, suppose that F(x) ≥ x. Then, if F has density f, as m -> {infty}, TS is asymptotically normally distributed with mean

Formula 11

and variance

Formula 11

The proof appears in the Appendix.

2.3 Relationship of TS to area under the ROC curve

In the diagnostic testing literature (cf. Hanley and McNeil, 1982Go; Pepe, 2003Go), the ROC curve is used to discriminate between two samples. Such a curve can also be constructed to compare a sample of test statistics to a given null distribution. A commonly used summary of the ROC curve is the area under the ROC curve. In the two-sample setting, the area is essentially equivalent to the Mann–Whitney test statistic (Hanley and McNeil, 1982Go). This measure places equal weight on departures from Unif(0, 1) without focusing on the most interesting region, the ‘tail’ of the test statistics. One solution is to only look at the area under the ROC curve up to some false positive level t0 (Pepe, 2003Go), but the choice of t0 is somewhat arbitrary. Here we show that the tail strength measure is related to a weighted area under such an ROC curve, weighted to accentuate the tail of the test statistics.

It is well-known (Hanley and McNeil, 1982Go) that for two independent samples Formula 11 and Formula 11 the expected area under the empirical ROC curve

Formula 11

is

Formula 11

The measure TS is also closely related to the area under the ROC curve (Pepe, 2003Go; Hanley and McNeil, 1982Go). Let

Formula 11

be the population ROC curve reflected along the line y = x. Suppose that X ~ Unif(0, 1) and Y ~ F, then

Formula 11

and this quantity is Formula 11 if F = Unif(0, 1). This suggests that the area under the (ROC) curve

Formula 12(2.12)

is a measure of departure from uniformity. It is positive whenever F(x) ≤ x or whenever the p-values are stochastically dominated by Unif(0, 1). This quantity places equal weight on the differences for all values of x with no focus on the tail. One way to adjust it is to insert a weight into the expression (2.12)

Formula 13(2.13)

The choice w(x) = x–1 corresponds to TS in the asymptotic setting. In finite samples, the above integral is of course replaced by a Riemann sum.

The partial AUC proposed by Pepe (2003)Go also attempts to accentuate the tail

Formula 13

Though the axes are reversed, this is equivalent to choosing a weight

Formula 13

while (2.12). Setting w(x) = x–1, which puts more weight on the tail, yields TS.


    3. ESTIMATES OF VARIANCE
 TOP
 SUMMARY
 1. INTRODUCTION
 2. TAIL STRENGTH
 3. ESTIMATES OF VARIANCE
 4. EXAMPLES
 5. DISCUSSION
 APPENDIX
 REFERENCES
 
To use the tail strength measure in practice, we need a reasonably accurate estimate of its variance. Formula (2.11) is very simple, but assumes both independence of the genes and the global null hypothesis.

To assess the accuracy of formula (2.11), we did a simulation experiment. To ensure that the correlation structure was realistic, we used the gene expression data from Rieger et al. (2004)Go, consisting of 12 625 genes and 58 samples in two classes. We constructed three scenarios: in the null scenario, datasets were created by permuting the sample labels (leaving the expression data intact); in the first non-null scenario, we first permuted the sample labels and then added 2000 units to the first 500 genes (2000 was about the largest average difference in group means in the actual data), and in the second non-null scenario, we added a random amount ui ~ N(0, 2002) to 2000 genes i chosen at random.

We carried about 100 realizations of this experiment, and the results are shown in Table 2. The quantity ‘sd’ is the actual standard deviation of tail strength over the 100 realizations; Formula 13 the asymptotic standard error. The third column Formula 13perm is a different estimate, one that starts with the the permutation values used in the original computation of tail strength. We compute tail strength from successive blocks of 20 permutations, and then compute the standard deviation of the resulting tail strength values. This estimate explicitly assumes that the null hypothesis is true. We see that Formula 13 is much too small, in general, because of the lack of independence of the genes. On the other hand, the permutation-based estimate Formula 13perm is reasonably accurate under both the null and non-null scenarios. As we might expect, it is somewhat conservative (too large) under the non-null setup, but this is acceptable in practice. Alternatively, one could use a (non-null) bootstrap process to estimate the standard error, but this would require a great deal more computation. This might be an impediment against routine use of tail strength measure, so we use the permutation-based estimate Formula 13perm in the real data examples in this paper.


View this table:
[in this window]
[in a new window]
 
Table 2. Results of a simulation experiment to assess estimates' standard error of the tail strength measure. ‘sd’ is the actual standard deviation of tail strength over the 100 realizations; Formula 13 the asymptotic standard error; Formula 13perm is obtained by computing tail strength from successive blocks of 20 permutations, and then computing their standard deviation

 

    4. EXAMPLES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. TAIL STRENGTH
 3. ESTIMATES OF VARIANCE
 4. EXAMPLES
 5. DISCUSSION
 APPENDIX
 REFERENCES
 
Figure 4 shows the tail strength measure and asymptotic 90% confidence intervals, applied to 12 different datasets. The datasets are summarized in Table 3. The first nine datasets are from microarray studies, and all report positive findings. Most of these are described in Dettling (2004)Go, where some comparative analyses are also performed.


Figure 4
View larger version (9K):
[in this window]
[in a new window]
 
Fig. 4. Tail strength measure computed on some biological datasets. Shown are the TS measures along with 90% confidence intervals.

 

View this table:
[in this window]
[in a new window]
 
Table 3. Summary of datasets for Figure 4

 
The remaining datasets are from neuroimaging studies. The datasets ‘aud-over’ and ‘aud-sent’ are from an auditory functional magnetic resonance imaging study (Taylor and Worsley, 2005Go) with aud-over being overall activation and aud-sent a measure of hemodynamic delay (Liao et al., 2002Go) in response to different sentences. The dataset ‘dtiTS’ comes from a diffusion tensor imaging dataset, studying pediatric differences in white matter in dyslexic and control cases (Deutsch et al., 2005Go), the p-values reflect local differences in direction of white matter fiber tracts and were studied in Schwartzmann et al. (2005)Go.

All the datasets (except for FL) show significant (non-zero) tail strengths of various degrees. For the subset of classification problems among these studies, Table 4 compares the estimated tail strength with the misclassification rate from the nearest shrunken centroid classifier (Tibshirani et al., 2001Go) [Results from other classifiers, given in Dettling (2004)Go, are quite similar]. The error rates were computed by repeated (2/3, 1/3) train-test splits of the data, except for the ‘skin’ data which uses 14-fold cross-validation.


View this table:
[in this window]
[in a new window]
 
Table 4. Tail strengths and misclassification rates (test set or cross-validated), for the classification problems in Table 3. Classification was done using nearest shrunken centroids

 
There is one interesting (qualitative) discrepancy in Table 4: the multi class ‘brain’ dataset shows very different behavior in tail strength and misclassification rate. The tail strength is high—0.82, but the misclassification rate seems poor (23.5%). The test statistic for each gene is an F-statistic—the ratio of between-class to within-class variance. Figure 5 shows the ordered test statistics versus their expected values under the null hypothesis. There is clearly more variation that we would expect by chance.


Figure 5
View larger version (10K):
[in this window]
[in a new window]
 
Fig. 5. Brain data: ordered statistics (F-statistics) versus their expected values under the null hypothesis.

 
There are some possible explanations for the seeming discrepancy between tail strength and classification rate in the brain example. First note that with five classes, the base error rate is 80%, so that the value 23.5% is actually a substantial reduction in this rate. In addition, there are only 42 cases in this dataset, so that the training set on which the classifiers were trained had only 28 cases on an average. For the five classes, the class wise error rates were 15, 6, 6, 20 and 57%. We computed the tail strengths for each class versus the rest (based on a two-sample t-statistic): they were 0.39, 0.53, 0.67, 0.54 and 0.32. Hence, class 5 has both a high error rate and a lower tail strength. It seems that the overall tail strength, based on the F-statistic for all five classes, fails to capture the difficulty in predicting class five.


    5. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. TAIL STRENGTH
 3. ESTIMATES OF VARIANCE
 4. EXAMPLES
 5. DISCUSSION
 APPENDIX
 REFERENCES
 
The tail strength measure is potentially useful for assessing the overall statistical significance of a set of hypothesis tests. It gives a quantitative measure of the overall strength of evidence against the global null hypothesis of ‘no association’ between a large set of features and an outcome of interest. We suggest that the tail strength could be routinely reported in such studies, to give the reader a crude idea of the degree of departure from the no association null in a complex dataset.

In statistics, there is of course a long history and a substantial literature in the area of multiple hypothesis testing. With the flurry of applications in genomics, there has been a resurgence of interest in this area (see, e.g. Dudoit et al., 2003Go, for a summary). Our work has a close relationship to the FDR approach to multiple testing, as we have shown in Section 2.2. There is a recent work of Efron (2005)Go (Section 5) in which quantities similar to tail strength are considered, based on a local version of the FDR.

Another concept that seems connected to tail strength is the higher criticism of Donoho and Jin (2004)Go, generalizing an idea introduced by Tukey (1976)Go. They define

Formula 14(5.14)

for some {alpha}0 > 0. This statistic is designed as an overall summary of the p-values, and they prove that it is optimal for detecting certain sparse patterns of p-values. They also show that the asymptotic {alpha} percentile for HCm is of the size Formula 14 We attempted some numerical comparisons of HCm with tail strength on the datasets in this paper, but these were not successful. The presence of some very small p-values made the denominator very small and caused the statistic to get very large. In addition, it was not clear how to choose {alpha}0 and the significance cutpoint in finite samples. We leave this comparison for future study.

In summary, the tail strength measure proposed here is simple to compute, with no parameters that require adjustment. It must be stressed, however, that it does not measure all the interesting structure that might be present in a dataset. When applied to univariate association measures, it does not capture interactions or multivariate effects that might exist.


    APPENDIX
 TOP
 SUMMARY
 1. INTRODUCTION
 2. TAIL STRENGTH
 3. ESTIMATES OF VARIANCE
 4. EXAMPLES
 5. DISCUSSION
 APPENDIX
 REFERENCES
 

A.1 Asymptotic properties of tail strength

In the FDR setting, previous work has shown that examination of the limiting behavior of (estimates of) FDR and local FDR is useful in understanding what the various techniques are doing in a population setting. In this section, we carry out a similar analysis for TS, and give the proof of Theorem 2.1.

We can write

Formula 14

Under H0, the spacings of order statistics are distributed as

Formula 14

where {xi}i ~ Exp(1) are i.i.d. exponential random variables.

This suggests that TS should be asymptotically normally distributed, at least under H0, because it is the sum of the approximately independent random variables. In fact, TS is also asymptotically normally distributed when the p-values are identically distributed with distribution F, as in the Bayesian model of Storey (2002)Go. We could alternatively assume that some fixed proportion {pi}0 of the p-values are i.i.d. Unif(0, 1), and the remaining are i.i.d. from some distribution F1, such that

Formula 14

This is the mixture model used in the development of local FDR by Efron et al. (2001)Go and Efron and Tibshirani (2002)Go. This assumption would not likely change the essence of our main result, only complicate the proofs.

We begin by expressing (2.1) in yet another way, in terms of quantile processes. Let

Formula 15(A.1)

be the quantile process of the p-values and

Formula 16(A.2)

be the population quantile function.

Given the definition of Formula 16m, it is not hard to see that

Formula 16

Using this fact, the expression (2.1) takes the form of a Riemann sum

Formula 17(A.3)

Such an expression is simpler to analyze than (eq:ts), using some results from the theory of quantile processes.

If Q is Riemann integrable, then this expression converges to

Formula 17

If, further, F has a density, then making the substitution u = Q(x) we see that this expression is equal to (2.8).

The result we will use from the theory of quantile processes (Barrio, 2004Go) is the following: under H0,

Formula 18(A.4)

where (B(x))0≤x≤1 is a standard Brownian bridge. That is, a continuous Gaussian process on [0, 1] with mean 0 and covariance function

Formula 19(A.5)

Suppose the p-values are i.i.d. with distribution F, where F is twice differentiable with strictly positive density f on (0, 1), then (Barrio, 2004Go)

Formula 20(A.6)

This suggests that

Formula 20

A straightforward application of Theorem 1 of Shorack (1972)Go, combined with the comments above suffices to prove Theorem 2.1.

REMARK A.1 Actually, F need not even have a density, for the central limit to hold the above, though the expected value will be changed slightly. If F has density f, then under the hypothesis F(x) ≥ x

Formula 20

so the variance under H0 is an upper bound.


    ACKNOWLEDGMENTS
 
We would like to thank Brad Efron for helpful comments, and editors and referees whose comments substantially improved this paper. Robert Tibshirani was partially supported by National Science Foundation grant DMS-9971405 and National Institutes of Health contract N01-HV-28183.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. TAIL STRENGTH
 3. ESTIMATES OF VARIANCE
 4. EXAMPLES
 5. DISCUSSION
 APPENDIX
 REFERENCES
 

    ALIZADEH, A., EISEN, M., DAVIS, R. E., MA, C., LOSSOS, I., ROSENWAL, A., BOLDRICK, J., SABET, H., TRAN, T., YU, X. et al. (2000). Identification of molecularly and clinically distinct substypes of diffuse large B cell lymphoma by gene expression profiling. Nature 403, 503–511.[CrossRef][Medline]

    ALON, U., BARKAI, N., NOTTERMAN, D., GISH, K., YBARRA, S., MACK, D. AND LEVINE, A. (1999). Broad patterms of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceeding of the National Academy of Sciences of the United States of America 96, 6745–6750.

    BARRIO, E. (2004). Empirical and Quantile Processes in the Asymptotic Theory of Goodness of Fit Tests. http://www.eio.uva.es/ems/Goodness_of_fit-Laredo_2004.pdf.

    BENJAMINI, Y. AND HOCHBERG, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 85, 289–300.

    BENJAMINI, Y. AND HOCHBERG, Y. (2000). On the adaptive control of the false discovery fate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics 25, 60–83.

    DAVE, S. S., WRIGHT, G., TAN, B., ROSENWALD, A., GASCOYNE, R. D., CHAN, W. C., FISHER, R. I., BRAZIEL, R. M., RIMSZA, L. M., GROGAN, T. M. et al. (2004). Prediction of survival in follicular lymphoma based on molecular features of tumor-infiltrating immune cells. The New England Journal of Medicine 351, 2159–2169.[Abstract/Free Full Text]

    DETTLING, M. (2004). Bagboosting for tumor classification with gene expression data. Bioinformatics 20, 3583–3593.[Abstract/Free Full Text]

    DEUTSCH, G. K., DOUGHERTY, R. F., BAMMER, R., SIOK, W. T., GABRIELI, J. D. AND WANDELL, B. (2005). Correlations between white matter microstructure and reading performance in children. Cortex 41, 354–363.[ISI][Medline]

    DONOHO, D. AND JIN, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics 32, 962–994.[CrossRef]

    DUDOIT, S., SHAFFER, J. P. AND BOLDRICK, J. C. (2003). Multiple hypothesis testing in microarray experiments. Statistical Science 18, 71–103.[CrossRef]

    EFRON (2005). Local false discovery rates. Technical Report, Stanford University.

    EFRON, B. AND TIBSHIRANI, R. (2002). Empirical Bayes methods and false discovery rates for microarrays. Genetic Epidemiology 1, 70–86.

    EFRON, B., TIBSHIRANI, R., STOREY, J. AND TUSHER, V. (2001). Empirical bayes analysis of a microarray experiment. Journal of The American Statistical Association 96, 1151–1160.[CrossRef]

    GENOVESE, C. AND WASSERMAN, L. (2002). Operating characteristics and extensions of the FDR procedure. Journal of the Royal Statistical Society Series B 64, 499–517.[CrossRef]

    GOLUB, T., SLONIM, D. K., TAMAYO, P., HUARD, C., GAASENBEEK, M., MESIROV, J. P., COLLER, H., LOH, M. L., DOWNING, J. R., CALIGIURI, M. A. et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–536.[Abstract/Free Full Text]

    HANLEY, J. A. AND MCNEIL, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36.[Abstract/Free Full Text]

    KALBFLEISCH, J. AND PRENTICE, R. (1980). The Statistical Analysis of Failure Time Data. New York: Wiley.

    KHAN, J., WEI, J. S., RINGNéR, M., SAAL, L. H., LADANYI, M., WESTERMANN, F., BERTHOLD, F., SCHWAB, M., ANTONESCU, C. R., PETERSON, C. et al. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 7, 673–679.[CrossRef][ISI][Medline]

    LIAO, C., WORSLEY, K., POLINE, J.-B., ASTON, J., DUNCAN, G. AND EVANS, A. (2002). Estimating the delay of the response in fMRI data. Neuroimage 16, 593–606.[CrossRef][ISI][Medline]

    PEPE, M. S. (2003). Partial AUC estimation and regression. Biometrics 59, 614–623. http://www.blackwell-synergy.com/doi/abs/10.1111/1541-0420.00071.[CrossRef][ISI][Medline]

    POMEROY, S., TAMAYO, P., GAASENBEEK, M., STURLA, L., ANGELO, M., MCLAUGHLIN, M., KIM, J., GOUMNEROVA, L., BLACK, P., LAU, C. et al. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 5, 436–442.

    RIEGER, K., HONG, W., TUSHER, V., TANG, J., TIBSHIRANI, R. AND CHU, G. (2004). Toxicity from radiation therapy associated with abnormal transcriptional responses to DNA damage. Proceedings of the National Academy of Sciences of the United States of America 101, 6634–6640.

    ROSENWALD, A., WRIGHT, G., CHAN, W. C., CONNORS, J. M., CAMPO, E., FISHER, R. I., GASCOYNE, R. D., MULLER-HERMELINK, H. K., SMELAND, E. B. AND STAUDT, L. M. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large b-cell lymphoma. The New England Journal of Medicine 346, 1937–1947.[Abstract/Free Full Text]

    SCHWARTZMANN, A., DOUGHERTY, R. AND TAYLOR, J. (2005). Cross-subject comparison of principal diffusion direction maps. Magnetic Resonance in Medicine 53, 1423–1431.

    SHORACK, G. R. (1972). Functions of order statistics. Annals of Mathematical Statistics 43, 412–427.

    SINGH, D., FEBBO, P., ROSS, K., JACKSON, D., MANOLA, J., LADD, C., TAMAYO, P., RENSHAW, A., D'AMICO, A., RICHIE, J. et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell 1, 203–209.[CrossRef][ISI][Medline]

    STOREY, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B 64, 479–498.[CrossRef]

    STOREY, J. D., TAYLOR, J. E. AND SIEGMUND, D. O. (2004). Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society Series B 66, 187–205.[CrossRef]

    TAYLOR, J. AND WORSLEY, K. (2005). Analysis of hemodynamic delay in the FIAC data. 11th Annual Meeting of the Organization for Human Brain Mapping, Toronto, June 12–16, 2005.

    TIBSHIRANI, R. (2005). Immune signatures in follicular lymphoma. The New England Journal of Medicine 352, 1496–1497.[Free Full Text]

    TIBSHIRANI, R., HASTIE, T., NARASIMHAN, B. AND CHU, G. (2001). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America 99, 6567–6572.

    TUKEY, J. (1976). T13 n: The Higher Criticism. Course notes stat 411. Princeton university.

    Received August 25, 2005; revised November 29, 2005; accepted for publication December 1, 2005.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


    This article has been cited by other articles:


    Home page
    BloodHome page
    L. M. Rimsza, M. L. LeBlanc, J. M. Unger, T. P. Miller, T. M. Grogan, D. O. Persky, R. R. Martel, C. M. Sabalos, B. Seligmann, R. M. Braziel, et al.
    Gene expression predicts overall survival in paraffin-embedded tissues of diffuse large B-cell lymphoma treated with R-CHOP
    Blood, October 15, 2008; 112(8): 3425 - 3433.
    [Abstract] [Full Text] [PDF]


    Home page
    BloodHome page
    T. M. Habermann, S. S. Wang, M. J. Maurer, L. M. Morton, C. F. Lynch, S. M. Ansell, P. Hartge, R. K. Severson, N. Rothman, S. Davis, et al.
    Host immune gene polymorphisms in combination with clinical and demographic factors predict late survival in diffuse large B-cell lymphoma patients in the pre-rituximab era
    Blood, October 1, 2008; 112(7): 2694 - 2702.
    [Abstract] [Full Text] [PDF]


    Home page
    BloodHome page
    J. R. Cerhan, S. M. Ansell, Z. S. Fredericksen, N. E. Kay, M. Liebow, T. G. Call, A. Dogan, J. M. Cunningham, A. H. Wang, W. Liu-Mares, et al.
    Genetic variation in 1253 immune and inflammation genes and risk of non-Hodgkin lymphoma
    Blood, December 15, 2007; 110(13): 4455 - 4463.
    [Abstract] [Full Text] [PDF]


    Home page
    BloodHome page
    J. R. Cerhan, S. Wang, M. J. Maurer, S. M. Ansell, S. M. Geyer, W. Cozen, L. M. Morton, S. Davis, R. K. Severson, N. Rothman, et al.
    Prognostic significance of host immune gene polymorphisms in follicular lymphoma survival
    Blood, June 15, 2007; 109(12): 5439 - 5446.
    [Abstract] [Full Text] [PDF]


    Home page
    Schizophr BullHome page
    N. J. Schork, T. A. Greenwood, and D. L. Braff
    Statistical Genetics Concepts and Approaches in Schizophrenia and Related Neuropsychiatric Research
    Schizophr Bull, January 1, 2007; 33(1): 95 - 104.
    [Abstract] [Full Text] [PDF]


    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    7/2/167    most recent
    kxj009v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Taylor, J.
    Right arrow Articles by Tibshirani, R.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Taylor, J.
    Right arrow Articles by Tibshirani, R.
    Social Bookmarking
     Add to CiteULike