Skip Navigation


Biostatistics Advance Access originally published online on September 12, 2006
Biostatistics 2007 8(2):500-504; doi:10.1093/biostatistics/kxl025
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
8/2/500    most recent
kxl025v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Jeffries, N. O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Jeffries, N. O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press 2006.

Multiple comparisons distortions of parameter estimates

Neal O. Jeffries

MSC 1430, 10 Center Drive, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892, USA neal.jeffries{at}nih.gov


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. ILLUSTRATIONS OF THE...
 3. CONCLUSION
 REFERENCES
 
In experiments involving many variables, investigators typically use multiple comparisons procedures to determine differences that are unlikely to be the result of chance. However, investigators rarely consider how the magnitude of the greatest observed effect sizes may have been subject to bias resulting from multiple testing. These questions of bias become important to the extent investigators focus on the magnitude of the observed effects. As an example, such bias can lead to problems in attempting to validate results, if a biased effect size is used to power a follow-up study. An associated important consequence is that confidence intervals constructed using standard distributions may be badly biased. A bootstrap approach is used to estimate and adjust for the bias in the effect sizes of those variables showing strongest differences. This bias is not always present; some principles showing what factors may lead to greater bias are given and a proof of the convergence of the bootstrap distribution is provided.

Keywords: Bootstrap; Effect size; Multiple comparisons


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. ILLUSTRATIONS OF THE...
 3. CONCLUSION
 REFERENCES
 
Most considerations involving multiple comparisons problems focus upon the increased probability of false-positive errors when the null hypothesis is true. Here the focus is upon the distorting effects of multiple comparisons in evaluating those variables judged to show the strongest effects or differences between groups. These distortions may be present both when the null hypothesis of no difference is true, as well as when it is false.

This bias can be relevant in some circumstances. If a power analysis is employed for a follow-up study, the study will likely be underpowered if overestimation bias is present. Further, a follow-up study may be difficult to mount and the preliminary study may provide the best point estimate and this estimate should be deflated if bias is present. In genetic epidemiology marker studies, there is sometimes interest in assessing the strength of a marker's association—if the strength is low it may not be worth performing a fine-mapping or other follow-up study. Also, confidence intervals for the point estimate will reflect the degree of bias affecting the estimate.

Recent work by genetic epidemiologists (e.g. Sun and Bull, 2005; Siegmund, 2002) has focused upon this bias problem in the context of estimating the presence and magnitude of genetic marker effects in genome-wide scans. The former study examines bootstrap and cross-validation approaches similar to that proposed here, though in the present work more attention is paid to estimating the entire distribution of overestimation and determining confidence intervals. The latter reference puts forth an analytical approach that is highly dependent upon the genetic model that is assumed and therefore appears to be restricted largely to genetic marker studies. Also, both of these papers posit the overestimation arises from truncation bias related to the significance threshold for declaring significance—here a different presentation of bias is described that is given without reference to a significance threshold. The basic idea is that observed outcomes are composed of random and deterministic components and under some circumstances, the fact that one outcome performs best may suggest that the random component for that outcome was unusually beneficial and this "good luck" is associated with overestimation of the true effect. Determining when such circumstances exist and measuring their distortion are the focus of this paper.


    2. ILLUSTRATIONS OF THE PROBLEM
 TOP
 SUMMARY
 1. INTRODUCTION
 2. ILLUSTRATIONS OF THE...
 3. CONCLUSION
 REFERENCES
 
An elementary two-group t-test applied to a number of variables will be used to illustrate some principles. A simple examination of gene expression differences between healthy and diseased individuals could give rise to such a design.

We assume that each of the two groups has n individuals, a total of G variables (e.g. genes) are measured, and for variable j we denote the n response measures as Xij in group 1 and Yij in group 2 for iisin{1,···,n} and jisin{1,···,G}. Let Formula, {sigma}j denote the common standard deviation for Xij and Yij, and sj denote the pooled estimate of {sigma}j. If µj = EXijEYij denotes the average difference for the jth variable, then the t-statistic may be written as

Formula (2.1)


Formula (2.2)


Formula (2.3)

{tau}j is a realization of a random variable with a t-distribution having 2n – 2 degrees of freedom. The distinction between t and {tau} is that the latter has a t-distribution (centered about 0) regardless of whether the null hypothesis, µj = 0, is true. One sees in (2.3) that the degree to which the sample difference, Formula, exceeds the true difference, µj, is associated with the magnitude of {tau}j. This degree of overestimation expressed by {tau}j is of primary interest in this paper. Of interest is the distribution of {tau}j when it corresponds to a gene with an extreme tj value. Let r1,r2,···,rG denote the indices associated with the smallest to the largest t-statistics so that tr1 ≤ tr2 ≤ ··· ≤ trG. While {tau}j has a t-distribution marginally, this is generally no longer the case when we condition on j corresponding to an extreme {tau}j value. It is impossible to know in general the distribution of

Formula (2.4)

and hence the degree to which Formula underestimates µr1 (or Formula overestimates µrG). Situations in which E[{tau}rG] > 0 are consistent with Formula, and in this way reflect bias in the estimated mean or effect size. As an aside, it should be noted that the distribution of {tau}rG may also be driven by small values of the sj terms.

In terms of confidence intervals, naive application of a t-test distribution can be misleading, e.g. the interval for µrG given by Formula is likely to systematically overestimate µrG; this will be demonstrated in simulations below.

As µrG is unknown, one cannot directly estimate the distribution of


Formula (2.5)

The bootstrap techniques popularized by Efron (1979) will be used to develop alternative confidence intervals that compensate for the bias described above. To proceed, one first constructs a bootstrap sample from the Xij,Yij by sampling individuals with replacement from these two-group data in a stratified manner, i.e. sampling from the Xi· = {Xi1,...,XiG} and Yi· = {Yi1,···,YiG} separately. One samples each individual's entire data, not the individual variables separately. From here, one obtains bootstrap samples XFormula and YFormula and can compute associated bootstrap statistics Formula. For a particular bootstrap sample designated by the * superscript, let rFormula,rFormula,...,rFormula order the t-statistics, tFormula, i.e. tFormula ≤ tFormula ≤ ··· ≤ tFormula. Then compute

Formula (2.6)

or {tau}Formula or any other ordered {tau}* of interest. One may produce and process many bootstrap samples in this way and obtain an empirical distribution of {tau}Formula. The hope is that the unknown distribution of {tau}rG may be approximated by that of {tau}Formula. In considering the terms

Formula (2.7)

the idea is that the degree to which Formula exceeds µrG can be approximated by the degree to which Formula exceeds Formula. In other words, rFormula is treated like rG, Formula like µrG, and Formula like Formula. Once an empirical distribution of {tau}Formula is obtained, denoted by F*, one may use the percentiles F* – 1 to create confidence interval for µrG, i.e.

Formula (2.8)

where Formula and srG are observed from the original data. For this procedure, a second-order bootstrap may also be employed to improve the approximation of F – 1 by F* – 1 where F 1 denotes the cdf of {tau}rG. This involves creating a further number of bootstrap samples and associated statistics from each first-level bootstrap sample. Details of this nested percentile approach and an R-program implementing it are available at http://data.ninds.nih.gov/Jeffries/multcomps/index.htm.

Table 1 indicates how this approach works in terms of confidence intervals for µrG in a simulation context. Here n = 14 in each group and G = 444. Each variable has an effect size chosen from the set {2/444,4/444,···,886/444,888/444}. The idea is that this may correspond to about 2% of approximately 22200 genes being differentially expressed and the genes/variables are constructed as independent for convenience. Rather than performing tests on all 22200 variables, attention was restricted to those differentially expressed to make the second-order bootstrap computations feasible. Values of Formula and µrG were obtained from each simulation. Further, the two-stage bootstrap algorithm was implemented and confidence intervals of varying nominal coverage were constructed. Table 1 gives the characteristics of the coverage of these bootstrap intervals and intervals constructed using the naive t-statistic approach. The results show that the naive t-statistic intervals fail to cover very often and the bootstrap approach is better. Also worth noting is that the bootstrap intervals are about 20% longer. Though wider, this is not the primary reason the bootstrap covers better—instead it is due to the overestimation correction as expanding the t-statistic regions by 20% will increase the coverage probabilities to only 5%, 19%, 36%, and 55% for the four different intervals.


View this table:
[in this window]
[in a new window]

 
Table 1. Confidence interval characteristics for with n = 14, G = 444, effect sizes evenly spaced in (0, 2], 1000 simulations

 
A second set of simulations was run with a much smaller number of variables/genes. Here the interest is in evaluating overestimation when a more modest number of comparisons are involved. Here n = 14, there are G = 10 independent variables, and the effect sizes are chosen at evenly spaced intervals between 0 and 1. Table 2 provides confidence interval characteristics for the bootstrap and t-statistic approaches. Here we see that the naive approach performs better though some distortion is still present. In this case, the bootstrap intervals are of comparable length with good coverage characteristics.


View this table:
[in this window]
[in a new window]

 
Table 2. Confidence interval characteristics for with n = 14, G = 10, effect sizes evenly spaced in (0, 1], 1000 simulations

 
A third set of simulations (see Table 3) was run to show that when overestimation is not a problem, the bootstrap approach yields coverage and confidence interval lengths comparable to those produced by the naive approach. In these simulations, one of the 10 effect sizes was chosen to be 3 and the other nine were set to 0. In all 1000 simulations, the variable with the large effect size generated the largest t-statistic, and in this case there was no overestimation problem.


View this table:
[in this window]
[in a new window]

 
Table 3. Confidence interval characteristics for with n = 14, G = 10, effect sizes are {3, 0, 0, 0, 0, 0, 0, 0, 0, 0}, 1000 simulations

 
From the data in Tables 1, 2, and 3, some generalizations may be drawn. First, the naive estimate often performs badly—particularly as the number of variables/genes grows. The coverage probabilities in Table 1 give some indication of how poorly the common naive approach performs under circumstances that may not be atypical in a microarray context. Table 2 shows that problems still remain for the naive approach when the number of variables is reduced. In Table 3, one sees that in some circumstances when there is little overestimation, the bootstrap and naive approaches perform appropriately and similarly.


    3. CONCLUSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. ILLUSTRATIONS OF THE...
 3. CONCLUSION
 REFERENCES
 
In multiple testing situations, when one examines the variable/test with largest observed effect size, it is more likely that there exists a large random component that leads to an overestimation of the associated effect size. This bias is potentially present whenever attention is focused upon the most extreme results among a number of tests. The problem is not alleviated by corrections made to address the number or proportion of Type I errors. While multiple comparisons corrections and false discovery rate approaches affect the choice of which variables may reflect significant changes, they do not address distortions in the associated magnitudes of change. Such problems become important when (1) an estimate of effect size is used to power a follow-up study, (2) comparing results across different studies and finding discrepancies in the strength of those variables showing greatest differences, or (3) contemplating further action based on initial study (e.g. follow-up fine-mapping study). Further, the traditional confidence intervals may be badly biased in such circumstances. The results indicate that the bootstrap approach may be able to distinguish instances when such bias does and does not exist and associated confidence intervals are less prone to overestimation than those derived from a naive t-statistic approach.

Supplemental materials at http://data.ninds.nih.gov/Jeffries/multcomps/index.htm provide (1) a more complete description of the two-stage bootstrap approach used here, (2) asymptotic justification for applying the bootstrap in these situations, and (3) analysis showing which factors exacerbate this bias. Mitigating factors are increased sample size and greater distinction among the most extreme effect sizes.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. ILLUSTRATIONS OF THE...
 3. CONCLUSION
 REFERENCES
 

    Efron B. (1979) Bootstrap methods: another look at the jackknife. Annals of Statistics 7:1–26.[Web of Science]

    Siegmund D. (2002) Upward bias in estimation of genetic effects. American Journal of Human Genetics 71:1183–1188.[CrossRef][Web of Science][Medline]

    Sun L and Bull SB. (2005) Reduction of selection bias in genome-wide studies by resampling. Genetic Epidemiology 28:352–367.[CrossRef][Web of Science][Medline]

    Received February 3, 2006; revised September 1, 2006; accepted for publication September 8, 2006.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow Supplementary Material
    Right arrow All Versions of this Article:
    8/2/500    most recent
    kxl025v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Jeffries, N. O.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Jeffries, N. O.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?