Skip Navigation



Biostatistics Advance Access published online on September 14, 2007

Biostatistics, doi:10.1093/biostatistics/kxm032
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
9/2/201    most recent
kxm032v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Gail, M. H.
Right arrow Articles by Pee, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gail, M. H.
Right arrow Articles by Pee, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press 2007.

Probability of detecting disease-associated single nucleotide polymorphisms in case-control genome-wide association studies

Mitchell H. Gail*

Division of Cancer Epidemiology and Genetics, National Cancer Institute, 6120 Executive Boulevard, EPS 8032, Bethesda, MD 20892-7244, USA gailm{at}mail.nih.gov

Ruth M. Pfeiffer

Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD,USA

William Wheeler and David Pee

Information Management Services, Rockville, MD, USA

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 REFERENCES
 
Some case–control genome-wide association studies (CCGWASs) select promising single nucleotide polymorphisms (SNPs) by ranking corresponding p-values, rather than by applying the same p-value threshold to each SNP. For such a study, we define the detection probability (DP) for a specific disease-associated SNP as the probability that the SNP will be "T-selected," namely have one of the top T largest chi-square values (or smallest p-values) for trend tests of association. The corresponding proportion positive (PP) is the fraction of selected SNPs that are true disease-associated SNPs. We study DP and PP analytically and via simulations, both for fixed and for random effects models of genetic risk, that allow for heterogeneity in genetic risk. DP increases with genetic effect size and case–control sample size and decreases with the number of nondisease-associated SNPs, mainly through the ratio of T to N, the total number of SNPs. We show that DP increases very slowly with T, and the increment in DP per unit increase in T declines rapidly with T. DP is also diminished if the number of true disease SNPs exceeds T. For a genetic odds ratio per minor disease allele of 1.2 or less, even a CCGWAS with 1000 cases and 1000 controls requires T to be impractically large to achieve an acceptable DP, leading to PP values so low as to make the study futile and misleading. We further calculate the sample size of the initial CCGWAS that is required to minimize the total cost of a research program that also includes follow-up studies to examine the T-selected SNPs. A large initial CCGWAS is desirable if genetic effects are small or if the cost of a follow-up study is large.

Keywords: Case–control study; Detection probability; Genetic association; Genome-wide association study; Ranking and selection; Whole genome scan


    1. Introduction
 TOP
 SUMMARY
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 REFERENCES
 
Case–control genome-wide association studies (CCGWASs) are used to detect associations of disease with genetic markers (single nucleotide polymorphisms [SNPs]) across the genome by comparing individuals with disease (cases) to disease-free individuals (controls). Several factors can lead to false-positive associations in CCGWASs, including population stratification and measurement error (Clayton and others, 2005Go). However, an overriding concern is the play of chance, when only a few true disease–associated SNPs are sought amidst the multitude of nondisease-associated SNPs.

One approach to control for multiplicity is to set stringent criteria for declaring an association statistically significant. For example, one might use the Bonferroni inequality to control the experiment-wise type I error rate. Two-stage designs have been proposed to reduce the amount of genotyping required while protecting the experiment-wise type I error rate (Skol and others, 2006Go). Less stringent but objective criteria such as the false discovery rate (Benjamini and Hochberg, 1995Go) have also been advocated.

Some investigators use p-values in a CCGWAS to rank and select promising SNPs for future study and are not concerned about the frequentist error control properties of the selection procedure. For example, the Cancer Genetic Markers of Susceptibility (CGEMS) project was designed to detect SNPs associated with prostate cancer (http://cgems.cancer.gov/). About 550 000 tagging SNPs were analyzed in an initial set of 1172 cases and 1157 controls (Yeager and others, 2007Go). About 28 000 most promising SNPs will be studied further in a second stage. These SNPs will be subjected to further winnowing in subsequent stages, leaving only 25–50 SNPs that will be regarded as sufficiently promising to warrant independent laboratory and epidemiologic investigations to attempt to establish a causal connection to disease. Altogether, data from 8400 cases and 8400 controls will be used.

The purpose of this paper is to investigate the probability that a specific disease SNP will be selected for further study in a CCGWAS and the probability that a selected SNP will be a true disease SNP. To be precise, we say that a SNP is "T-selected" or simply "selected" if its associated chi-square test statistic (or p-value) is among the top T chi-square test statistic values (or T lowest p-values). We call the probability that the test statistic for a specific disease SNP will be among the top T chi-square values in the sample the detection probability (DP). We also calculate PP, the proportion positive, namely the fraction of selected SNPs that are true disease–associated SNPs. The chi-square test we consider is a 2-sided trend test with additive (codominant) modeling of the SNP genotype (Armitage, 1955Go; Sasieni, 1997Go). Such a test does not require knowing whether the minor or major allele is associated with disease (Devlin and Roeder, 1999Go; Pfeiffer and Gail, 2003Go). We examine DP and PP both for the usual trend test based on the scores of the log-likelihood and for a Wald test.

It is computationally prohibitive to generate case–control data and perform 500 000 separate logistic analyses repeatedly. For the 1-stage design in which all SNPs are genotyped for every case and control, we develop asymptotic theory that allows us to study DP and PP in simulations, both for the score test and for the Wald test, and we also show how to calculate DP and PP analytically.

Our data give practical guidance as to required sample sizes and numbers of top ranks T needed to yield a high DP. We also study the effects of these parameters on PP, and on other factors crucial to designing a CCGWAS, including the rapid decrease in incremental DP per unit increase in T as T increases. Our findings give insight into how resources should be allocated between the CCGWAS and subsequent studies needed to follow up on the T-selected SNPs. In Section 4, we relate our work to the literature.


    2. Materials and methods
 TOP
 SUMMARY
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 REFERENCES
 

2.1 Study design and data for simulations

We consider a population-based case–control design with n cases and n controls selected, respectively, at random from all cases and all controls in the source population. We assume that risk of disease is influenced by M out of N SNPs under study. For simulations, at each SNP i = 1,2,...N, we randomly and independently select a minor allele frequency, {eta}i, from the 299 686 minor allele frequencies that are 0.05 or greater in CGEMS (https://caintegrator.nci.nih.gov/cgems/downloadSetup.do). This minor allele frequency distribution had mean 0.2763, standard deviation 0.12, minimum 0.05, maximum 0.50, and quartiles 0.15 (25%), 0.26 (median), and 0.38 (75%). In each replicate of the simulations described below, minor allele frequencies were reassigned to each SNP in this way. CGEMS SNPs were chosen to be "tagging SNPs." Therefore, in the simulations in this paper, we regard the genotypes at the N SNPs as statistically independent in the source population, even though there may be correlation among nearby tagging SNPs.

2.2 Logistic models

Let Xi = 0,1, or 2 be the number of minor alleles at locus i, let X = (X1,X2,...,XN)', and let Y = 1 or 0 for diseased or nondiseased subjects. Suppose SNPs 1,...,M are associated with disease, while SNPs M + 1,...,N are not. Suppose in the source population, the probability of disease is given by logit{P(Y = 1||X)} = µ + {sum}FormulaßiXi, where logit(u) = log{(u)/(1 – u)}. For rare diseases (or for more common diseases over a confined age range such as 10 years), P(Y = 1||X)=exp(µ + {sum}FormulaßiXi). Assuming X1 is independent of X2,X3,...,XN, we find P(Y = 1||X1)=exp(µ* + ß1X1), where µ* = µ + log{Eexp({sum}FormulaßiXi)} and E is the expectation operator. For the case–control population, it follows that

Formula (2.1)

whereµ** = µ* + log({pi}1/{pi}0),{pi}1 is the proportion of cases in the source population that are in the case–control study, and {pi}0 is the analogous proportion for controls.

2.3 Models for the genetic effect ß

For nondisease-associated SNPs, ßi = 0. For disease-associated SNPs, we consider a random effects model and a fixed effects model for ßi,i = 1,...,M. Under the random effects model, each ßi,i = 1,...,M, is drawn independently from a normal distribution with mean 0 and variance {tau}2. As E||ßi|| = {tau}(2/{pi})1/2=0.798{tau}, large values of {tau}2 correspond to large average effects of disease SNPs. We also consider fixed effects models ßi = ß, for i = 1,...,M, for a fixed ß.

2.4 Properties of the parameter estimates and chi-square tests

Maximum likelihood estimation for a cohort study applied to case–control data with (2.1) yields a fully efficient estimate of ß1 and a consistent variance estimate, Formula (Prentice and Pyke, 1979Go). We show that if genotypes are independent in the source population, as we assume, and if the disease is rare, then genotypes are independent in the samples of cases and of controls. For simplicity, we prove independence for 2 genotypes, but the result extends to N genotypes. Consider a fixed set of parameters   = (ß1,ß2,...,ßN)'. Let {rho}ki = P(Xi = k) be the probability that the genotype at locus i has k minor alleles in the source population. Under independence of genotypes and the rare disease assumption, P(Y = 1||Xi;µFormula,ßi)=exp(µFormula + ßiXi) in the source population. Thus,

Formula

in the case–control sample. Likewise,

Formula

and

Formula

Thus, Xi and Xj are independent, not only in controls but also in cases.

It follows that each SNP can be analyzed separately based on (2.1), resulting in independent chi-square statistics across SNPs. One chi-square test for ßi = 0 is the Wald statistic Formula. A second chi-square test for the ith SNP is the score test (Armitage, 1955Go) Formula, where Ui = 0.5({sum}FormulaXi {sum}FormulaXi), and Var0(Ui) is the null variance of Ui (supplementary Appendix available at Biostatistics online [http://www.biostatistics.oxfordjournals.org]) . In the expression for Ui, the index i in Xi refers to the locus and not to the cases or controls.

2.5 Simulations and ranking criteria

We use the asymptotic normal distributions of Formula and Ui (see supplementary Appendix available at Biostatistics online [http://www.biostatistics.oxfordjournals.org]) to generate realizations of these quantities rapidly in GAUSS (Aptec Systems, 2005Go). For givenßi, we calculate Formula by taking the expectation of the prospective information matrix, I, from (2.1), with respect to the retrospective sampling distribution. Letting

Formula

We obtain µFormulain these equations from 0.5 = 0.5{sum}Formula(fxi + gxi)P(Y = 1||Xi = x;µFormula,ßi).

Conditional on Formula. For nondisease loci and for disease loci under the fixed effects disease model, the conditional variances equal the unconditional variances. Although the results in Sections 2.4 and 2.5 hold for general{rho}ki, we assume Hardy–Weinberg Equilibrium in simulations. Thus, {rho}0i = (1 – {eta}i)2, {rho}1i = 2{eta}i(1 – {eta}i), and {rho}2i = {eta}Formula.

For each of the various parameter settings, we generate NSIM = 10 000 independent simulations. From the previous theory, we can generate Ci as follows. For the fixed effects model, set ßi = ß for i = 1,2,...,M and ßi = 0 for i = M + 1,...,N. For the random effects model, draw ßi from the normal distribution N(0,{tau}2) for i = 1,2,...,M. For i = M + 1,...,N, set ßi = 0. Under either model, draw an independent random allele frequency {eta}i for each SNP, and conditional on the values ofßi[tnqit]and[/tnqit]{eta}i, compute {sigma}Formula(ßi). Then draw Formula from N(ßi,{sigma}Formula(ßi)) and compute Formula. This quantity has the same asymptotic distribution as Formula. A similar approach is used to generate a quantity that is asymptotically equivalent to Formula, as described in the supplementary Appendix available at Biostatistics online (http://www.biostatistics.oxfordjournals.org).

Define I(m,ISIM,T) = 1 if rank (Cm) > NT, namely SNP m has a test statistic Cm that is in the top T ranks of the N-ranked values of Ci in simulation ISIM, and 0 otherwise. Here, m indexes one of the M disease SNPs. The DP is estimated by

Formula (2.2)

The inner summation divided by NSIM is the proportion of simulations in which Cmis in the top T ranks. Because the disease SNPs are exchangeable under either the random effects or the fixed effects model, the average of this quantity over m yields the average probability that a disease SNP will be found in the top T ranks. Thus, Formula is an estimate of the probability that a given disease SNP will have an associated Wald test in the top T ranks. By exchanging the order of summation in (2.2), we see that Formula has an alternative interpretation as the average proportion of disease SNPs that are in the top T ranks. Therefore, PP can be estimated from Formula. We evaluate the rankings of the score test by replacing Cm with CSm and Ci with CSi in the definition of I(m,ISIM,T). We study T = 25,50,100,250,500,1000,5000,10000, and 25 000, which, when divided by the total number of SNPs, N = 500000, yields respective fractions 0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.01, 0.02, and 0.05.

2.6 Analytic calculation of DP

For all nondisease-associated SNPs, the asymptotic distribution of the Wald test and of the score test is a central chi-square distribution with 1 degree of freedom, F, even if their allele frequencies vary. For disease-associated SNPs, however, the asymptotic distributions of these test statistics vary under the random effects model, or if their allele frequencies vary. For fixed {eta}i and ßi, let Gi be the distribution of Ci for i = 1,2,...,M. Consider a particular disease SNP, without loss of generality, SNP 1. Let g1(c) be the density of C1, and let H1(c) be the event that C1 is in the interval [c,c + dc). Define H2(m;c,M) to be the event that m of the remaining M 1 disease SNPs have Ci values greater than c, and let H3(T m – 1;c,m) be the event that no more than T – 1 m nondisease SNPs have Ci values greater than c. Note that the intersection of these 3 events implies that C1 is in the top T ranks. Thus, conditionally on {eta}i and ßi, DP is given by

Formula (2.3)

where F is the central chi-square distribution with 1 degree of freedom. GoEquation (2.3) is the integral over c of P{H1(c)}P{H2(m;c,M)||H1(c)}P{H3(Tm – 1;c,m)||H1(c),H2(m;c,M)}. If M = 1, P{H2(m = 0;c,M)} = G1(c) and P{H2(m = 1;c,M)} = 1 G1(c). If M > 1, the following recursion, similar to that in Gail and others (1979)Go, can be used to calculate the quantity P(H2(m;c)||c) in (2.3): P{H2(m;c,M)} = GM(c)P{H2(m;c,M – 1)} + {1 – GM(c)}P{H2(m – 1;c,M 1)}. Initial conditions are P{H2(m = 0;c,M = 0)} = 1, P{H2(m = 1;c,M = 0)} = 0, and P{H2(m = – 1;c,M} = 0 for all M. The unconditional value of DP is obtained by averaging (2.3) over the distribution of {eta}i and, under the random effects model, over ßi, for i = 1,2,...,M.

If all the disease loci have the same distribution, G(c), (2.3) simplifies to

Formula (2.4)

Expressions (2.4), (2.5), and (2.6) apply equally to CSi. We used (2.4) with a fixed allele frequency and common fixed effect or a common random effect for all disease loci to check the simulation procedures and to compute DP and PP in these special cases. For fixed {eta} and ß, Ci has distribution G(c) = G*(c;ß2/{sigma}Formula(ß)), where G* is a noncentral chi-square distribution with 1 degree of freedom and noncentrality ß2/{sigma}Formula(ß). For the random effects model, G is the average of G*over the distribution of ß. For fixed {eta} and ß, CSi has distribution G(c) = G*(c{Varßi = 0(Ui)/Varßi(Ui)};{{delta}(ß)}2/Varß(Ui)), where {delta}(ß) and Varß(Ui) are given in the supplementary Appendix available at Biostatistics online (http://www.biostatistics.oxfordjournals.org). For the random effects model, G is the average of G*over the distribution of ß.

For the fixed effects model with fixed allele frequencies {eta}i = {eta}, G for the Wald test has noncentrality parameter NCP = ß2/{sigma}2(ß). If {eta} varies in a fixed effects model or if ß arises from a random effects model, there is no parameter, NCP, that characterizes G for the Wald test. For descriptive purposes for such cases, we use the "approximate NCP" defined by ß2E{1/{sigma}2(ß)}, in which the expectation is estimated analytically for fixed {eta}i = {eta} and empirically in simulations otherwise.

An excellent approximation to (2.4) for T > 20 that requires no integration is

Formula (2.5)

where qm = F – 1{(NMT + m)/(N M + 1)}. For M = 1, this approximation is very nearly

Formula (2.6)

2.7 Analytic calculation of PP

The PP is the expected number of true disease SNPs among the T-selected SNPs, divided by T, namely

Formula (2.7)

Thus, PP can be calculated analytically from the previous calculations for DP.


    3. Results
 TOP
 SUMMARY
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 REFERENCES
 

3.1 Detection probability

Figure 1 shows estimates of DP for the fixed effects model from 10 000 simulations for the score test with M = 1 and N = 500000 for a case–control study with n = 1000 cases and n = 1000 controls; results are also shown for n = 8000 cases and controls. For n = 1000, DP is near 1.0 for ß = log(2.0) even for T = 25 (not shown). For ß = log(1.2), DP is only 0.07 at T = 25 and rises gradually to 0.69 at T = 25000. Because T is plotted on a log scale, it is apparent that DP increases very slowly with increasing T. For ß = log(1.2) and n = 8000 cases and controls, DP = 0.96 at T = 25 and DP = 0.999 at T = 25000. Approximate NCP values are shown for the 8 loci in Figure 1.


Figure 1
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Estimated DP for the score test and for the fixed effects model plotted against log(T) based on 10 000 simulations with minor allele frequencies drawn at random from the CGEMs distribution. Other parameters are N = 500000, M = 1 disease allele, and n = 1000 cases and controls (not bold) or n = 8000 cases and controls (bold). Approximate NCPs are computed as ß2 times the average value of Formulain the simulations.

 
Figure 2 shows estimates of DP for the random effects model. Even though values of {tau} were chosen such that the mean absolute value of ß in the random effects model equaled the value of ß in the corresponding fixed effects model, the DP of the score test is less in the random effects model for large {tau} (and large NCP). For example, with n = 1000 and {tau} = log(2.0)/0.798, the DP increases from 0.71 at T = 25 to 0.85 at T = 25000, whereas, in the fixed effects model with ß = log(2.0) the DP is 1.000 for T = 25 (locus not shown). For small NCP, the DP of the random effects model exceeds that of a fixed effects model with comparable NCP.


Figure 2
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Estimated DP for the score test and for the random effects model plotted against log(T) based on 10 000 simulations with minor allele frequencies drawn at random from the CGEMs distribution. Other parameters are N = 500000, M = 1 disease allele, and n = 1000 cases and controls (not bold) or n = 8000 cases and controls (bold). Approximate NCPs are computed as ß2 times the average value of Formula in the simulations, where ß = log(1.1), log(1.2), log(1.3), log(1.5), or log(2) correspond to the values of the standard deviation of the random effects distribution, {tau} = log(ß)/0.798.

 
Figures 1 and 2 can be used to assess DP for other ß values by interpolation. One can also interpolate based on NCP values; in particular, for other sample sizes n and genetic effects ß*, estimate DP from some choice of n and ß in Figure 1 satisfying n(ß*)2 = nß2. The approximations (2.5) and (2.6) indicate that DP depends on T and N mainly through T/N. Unreported numerical examples show that this is also true for simulated estimates, such as those in Figures 1 and 2. Thus, Figures 1 and 2 can be used for other values, say N* and T*, by referring to the value of T that satisfies T/N = T*/N*. Results for the Wald test are visually indistinguishable from those in Figures 1 and 2 for the score test and are not shown.

The number of competing disease loci M has little effect on DP provided M < T, but for M > T, DP declines sharply with increasing M. This phenomenon was demonstrated in simulations that allow for variable allele frequencies (data not shown) as well as by analytic results (scenarios 4–6 in Table 1) for the random effects model with {tau} = ß/0.798 = log(1.2)/0.798, n = 8000and{eta} fixed at the mean of the CGEMS distribution, 0.2673. DP for M = 100 is much reduced compared to M = 1 or M = 10 when T < 100, and DP is much less for M = 10 than for M = 1 with T = 1. Similar results hold for the fixed effects model (supplementary Table S1 available at Biostatistics online [http://www.biostatistics.oxfordjournals.org]).


View this table:
[in this window]
[in a new window]

 
Table 1. DP and PP for the Wald test and for random effects model{dagger}

 
For SNPs with fixed minor allele frequencies {eta}i = {eta}, DP is lower for {eta} = 0.05 than for {eta} = 0.50, as seen from scenarios 8 and 9 in Table 1 for the random effects model and supplementry Table S1 available at Biostatistics online (http://www.biostatistics.oxfordjournals.org) for the fixed effects model. From supplementary Appendix equation (A.8) available at Biostatistics online (http://www.biostatistics.oxfordjournals.org), the NCP for {eta} = 0.5 is greater than that for {eta} = 0.05 by a factor (0.5)2/(0.05)(0.95) = 5.26.

Very large values of T may be needed to ensure a DP above 0.5 with n = 1000 (Figures 1 and 2). However, it may not be feasible to evaluate many loci in independent data or to develop functional assays or knockout systems to confirm their importance. It is therefore important to study the increment in DP per unit increase in T in order to determine how rapidly the returns from increasing T diminish. Figure 3, based on (2.4), plots the increment in DP for the random effects model from T – 1 to T against log(T), for T = 2,3,...,500 and for various choices of {tau} = ß/0.798 and sample sizes n = 1000 or n = 8000 cases and controls. For M = 1 (not bold loci), with n = 8000 and high NCP, there is little to be gained by increasing T beyond 100; for M = 1 and n = 1000 with smaller NCP, little is gained by increasing T beyond 500. For M = 100 (bold loci) and n = 8000 with large NCP, increases in T are worthwhile for T ≤ 100, but not beyond T = 200; for M = 100 and n = 1000 with smaller NCP, values of T in the range 200–500 still yield some incremental increases in DP. Increments in DP decrease even more rapidly for the fixed effects model (supplementry Figure S1 available at Biostatistics online [http://www.biostatistics.oxfordjournals.org]).


Figure 3
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Increment in DP per unit increase in T for the Wald test and for the random effects model, plotted against log(T) for various values of the standard deviation of the random effects distribution, {tau} = log(ß)/0.798, for n = 1000 or 8000 cases and controls, and for M = 1 disease locus (not bold) and M = 100 disease loci (bold). Approximate NCPs are shown and were computed as ß2 times the expectation over ß of Formula. We assume N = 500000 and calculate the DP from (2.4) for minor allele frequency fixed at {eta} = 0.2673, the mean of the CGEMS distribution.

 
3.2 Proportion positive

Factors influencing PP are depicted in Figure 4 for the random effects model, based on (2.4) and (2.7). PP is higher for M = 100 than for M = 1, for every choice of ß , n, and T. For T < M, a PP near 1.0 is achieved for DP near 1.0, namely with large NCP. As anticipated from (2.7), PP decreases as T increases for T > M (Figure 4). For M = 100 (bold loci) and T < M, PP is near 1.0 for n = 8000 (large NCP and hence large DP), but falls to below M/T = 0.004 at T = 25000. Even for n = 1000, for M = 100 and T = 10, PP is near 1.0. Thus, if there are M = 100 disease loci, the chance that a randomly selected SNP in the top 10 selected SNPs is a disease SNP is nearly 1.0. For M = 1 (not bold loci), PP falls rapidly for T > 1. If the NCP (and therefore the DP) is large, the PP is near 0.7 for T = 1, but falls to below 1/T as T increases. Figure 4 and Table 1 illustrate that raising T to increase DP decreases PP for T > M. The PP curves for a fixed effects model (Supplementary Figure S2 available at Biostatistics online [http://www.biostatistics.oxfordjournals.org]) have similar shapes as for the random effects model (Figure 4). For large comparable NCP values, the fixed effects model has higher PP, but for models with comparable low NCP values, the random effects model has higher PP (compare Figure 4 with Supplementary Figure S2 available at Biostatistics online [http://www.biostatistics.oxfordjournals.org] and Table 1 with supplementary Table S1 available at Biostatistics online [http://www.biostatistics.oxfordjournals.org]).


Figure 4
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. PP on log scale for the Wald test and for the random effects model plotted against log(T) for M = 1(not bold) or M = 100 (bold) and for various values of the standard deviation of the random effects distribution, {tau} = log(ß)/0.798, and numbers of cases and controls, n = 1000 or 8000. Approximate NCPs are shown and were computed as ß2 times the expectation over ß of Formula. Other parameters are N = 500000 and the minor allele frequency, fixed at {eta} = 0.2673, the mean of the CGEMS distribution.

 
3.3 Designs to minimize total cost

Suppose the total cost to identify and confirm the importance of a potential disease SNP is 2nC1 + C2T, where C1 is the cost to recruit and genotype a subject for a CCGWAS and C2 is the cost of a confirmatory study on each of the T-selected candidate SNPs. We call C1 the "initial study cost" and C2 the "follow-up cost," which might require laboratory investigations of functionality or confirmatory association studies. As n increases, the initial study cost increases linearly, but the number T required to achieve a desired DP decreases, reducing the follow-up cost. Expressed in units of C1, the total cost is 2n + (C2/C1)T. We consider cost ratios, C2/C1 = 10 or 1000. For example, if the initial study cost is C1 = $1000 per subject, a laboratory study to check the functionality of a locus might cost C2 = $10000, and a confirmatory epidemiologic study might cost $1 000 000, corresponding to cost ratios 10 and 1000. As an example, we consider fixed effects models with ß = log(1.2) (bold loci) or ß = log(1.3) (not bold loci), with M = 1 or M = 100, with minor allele frequency fixed at the mean CGEMS value, {eta} = 0.2673, and with N = 500000. For each fixed n, we can invert (2.4) to find the T that gives DP = 0.9, and for various cost ratios find that value of n that minimizes total cost (Figure 5). Total costs are higher for ß = log(1.2), because larger values of n and T are needed to yield DP = 0.9 (Figure 1). For ß = log(1.3) and M = 1, the values of (n,T) that minimized total cost were (1948, 38) and (2711, 1), respectively, for cost ratios 10 and 1000; the corresponding total costs were 4276 and 6422, in units of C1. For ß = log(1.3) and M = 100, the values of (n,T) that minimized total cost were (1944, 128) and (2598, 91) with respective total costs 5168 and 96 196. The optimal CCGWAS size is little affected by having M = 100 instead of M = 1, but the total cost can be much larger because larger numbers, T, are needed to assure DP = 0.9. For ß = log(1.2) and M = 1, the values of (n,T) that minimized total cost were (3810, 77) and (5673, 1) with respective total costs 8390 and 12 346. If M = 100 instead, the results are (3801, 168) and (5436, 91) with respective total costs 9282 and 101 872. These calculations indicate that the optimal n increases with decreasing genetic effect size and with increasing cost ratio. The optimal T increases with decreasing genetic effect and with the number of disease genes, M, but decreases with increasing cost ratio.


Figure 5
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Total cost, in units of C1, the cost per subject in the CCGWAS, plotted as a function of n, the numbers of cases and controls for the Wald test and for a fixed effects model. Cost ratios C2/C1 = 10 and 1000 are studied, where C2 is the cost of a follow-up study, and the fixed effects were ß = log(1.2) (bold) or ß = log(1.3) (not bold). Other parameters were N = 500000, M = 1 disease allele, and the minor allele frequency, fixed at {eta} = 0.2673, the mean of the CGEMS distribution.

 

    4. Discussion
 TOP
 SUMMARY
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 REFERENCES
 
We studied DP, the probability that a given disease locus will be T-selected based on the largest T chi-square values (or corresponding smallest p-values). DP has the alternative interpretation as the proportion of exchangeable disease loci that will be T-selected. We have shown how DP can be calculated based on the underlying logistic risk model in the source population and on the case–control sampling. We calculated DP analytically as the average of (2.3) over the distribution of allele frequencies, and, for random effects models that allow for heterogeneity of disease SNP effects, over the corresponding log odds ratios. To study realistic designs, we simulated the results in Figures 1 and 2 based on the distribution of allele frequencies in CGEMS. We showed that if genotypes are independent in the source population and the disease is rare, chi-square tests for individual SNPs are independent, and we presented asymptotic theory that permits one to generate realizations for simulations rapidly.

A number of factors affect DP, especially the magnitude of the disease SNP effect, ß (Figures 1 and 2). DP depends on the number of nondisease loci, NM, but mainly through the ratio T/N; large N may require very large T to insure high DP. Competition among multiple disease loci (M > 1) can reduce DP, but this effect is only large when M > T. DP is less in SNPs with small minor allele frequencies.

Our assumption that the SNP genotypes are independent is valid if the original N = 500000 tagging SNPs are independent. Zaykin and Zhivotovsky (2005)Go showed that correlations of p-values within linkage disequilibrium blocks of SNPs or among such blocks have little effect on selection probabilities similar to DP, because such correlations do not extend beyond a small portion of the genome. It is likely, therefore, that DP is also little affected by such correlations.

Our work complements and differs from that of Zaykin and Zhivotovsky (2005)Go. First, our criterion, DP, is not the same as the criteria they evaluated, namely the probability that all disease SNPs would have p-values lower than that of the ith smallest p-value among nondisease SNPs (their equation A.6) and the probability that at least one disease SNP will have a p-value below that of the ith smallest p-value among nondisease SNPs (their equation A.3). Although each of these criteria may be useful in some circumstances, we believe that DP gives the best overall assessment of the ability to detect disease SNPs. Second, we relate DP to the parameters in a logistic risk model with case–control sampling.

Satagopan and others (2004)Go studied ranking procedures in 1- and 2-stage designs and computed the probability that at least a desired number, m, of the M disease SNPs would have normally distributed trend tests exceeding the largest normally distributed trend test among nondisease SNPs. This criterion differs from DP except in the case T = M = m = 1. Moreover, the calculations assume that it is known whether the minor or major allele is positively associated with disease. Our results are based on chi-square statistics that are invariant to this polarity, and are therefore more suited to exploratory CCGWASs.

Wacholder and others (2004)Go made recommendations regarding alpha levels and power required to control the "false-positive report probability" (FPRP), namely the probability that a genetic variant selected on the basis of rejecting the null hypothesis of no association would be a false discovery. The FPRP concept, however, does not require any consideration of how many SNPs are examined. Each is considered on its own. In contrast, in the present paper we define "selection" not in terms of a hypothesis test but in terms of a ranking of chi-square tests (or p-values) for all SNPs. Because the rank selection criterion depends not only on the chi-square test for a given SNP but also on the chi-square tests for all other SNPs, we have used the terms DP and PP rather than the analogous terms, sensitivity or true positive fraction (Pepe, 2003Go) and positive predictive value (Vecchio, 1966Go), that are used for diagnostic tests in independent subjects.

Our methods give insight into aspects of CCGWAS design. Low PP was found for studies with modest effect sizes or case–control sample sizes, illustrating the futility of trying to identify disease loci from a CCGWAS with small n. Even for n = 1000 cases and controls, one requires T = 4710 to achieve DP = 0.5 for a fixed effect ß = log(1.2) with M = 1 and fixed {eta} = 0.2673. The corresponding PP = 0.0001. For M = 100, T = 4758 is required to attain DP = 0.5, and PP = 0.0105 in this case. Thus, the vast majority of selected SNPs will be false positives. If n = 8000 cases and controls are used instead and M = 1, then T = 1 yields DP = 0.992 and the corresponding PP = 0.992. For M = 100, T = 50 is required to yield DP = 0.5, and the corresponding PP = 1.00. Thus, large samples are needed to detect a disease SNP with odds ratio 1.2 reliably with a value of T that is practical for follow-up studies. CCGWASs with only a few hundred cases and controls will have low DP unless T is extravagantly large, in which case the PP will be very small, necessitating too many follow-up studies to be feasible.

Our methods also provide information on how to allocate resources in a research program that supports both discovery of promising SNPs in an initial CCGWAS and follow-up studies to evaluate initial leads among the T-selected SNPs. When follow-up costs for each selected SNP are 10 or 1000 times the cost of a subject in the CCGWAS, or when genetic effects are small, then the initial CCGWAS should be large, in order to limit the number of SNPs that require follow-up (Figure 5). Even larger n would be recommended for the model of random genetic effects (unreported data).

We have extended the simulation methods to compare 1- and 2-stage designs. For a fixed effects model with M = 1 and ß = log(1.2), a 1-stage design with n = 8000 cases and controls has DP = 0.948 with T = 10. If instead T = 25000 SNPs are selected at stage 1 based on n = 1000 cases and controls, then among the top 10 SNPs in a second stage based on n = 7000 independent cases and controls, the DP = 0.677. For a random effects model with {tau} = log(1.2)/0.798, the corresponding DP estimates are 0.596 for the 1-stage design and 0.485 for the 2-stage design. Thus, the 2-stage design with n = 1000 cases and controls in the first stage can lead to an appreciable loss in DP.


    Funding
 TOP
 SUMMARY
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 REFERENCES
 
Intramural Research Program of the Division of Cancer Epidemiology and Genetics, National Cancer Institute.


    ACKNOWLEDGMENTS
 
Conflict of Interest: None declared.


    REFERENCES
 TOP
 SUMMARY
 1. Introduction
 2. Materials and methods
 3. Results
 4. Discussion
 Funding
 REFERENCES
 

    Aptec Systems. The Gauss System, Version 6 (2005) Maple Valley, WA: Aptec Systems, Inc.

    Armitage P. Tests for linear trends in proportions and frequencies. Biometrics (1955) 11:375–386.[Medline]

    Benjamini Y, Hochberg Y. Controlling the false discovery rate—a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B-Methodological (1995) 57:289–300.

    Clayton DG, Walker NM, Smyth DJ, Pask R, Cooper JD, Maier LM, Smink LJ, Lam AC, Ovington NR, Stevens HE. and others. Population structure, differential bias and genomic control in a large-scale, case-control association study. Nature Genetics (2005) 37:1243–1246.[CrossRef][Web of Science][Medline]

    Devlin B, Roeder K. Genomic control for association studies. Biometrics (1999) 55:997–1004.[CrossRef][Web of Science][Medline]

    Gail MH, Weiss GH, Mantel N, Obrien SJ. Solution to the generalized birthday problem with application to allozyme screening for cell-culture contamination. Journal of Applied Probability (1979) 16:242–251.[CrossRef][Web of Science]

    Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction (2003) New York: Oxford University Press.

    Pfeiffer RM, Gail MH. Sample size calculations for population- and family-based case-control association studies on marker genotypes. Genetic Epidemiology (2003) 25:136–148.[CrossRef][Web of Science][Medline]

    Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika (1979) 66:403–411.[Abstract/Free Full Text]

    Sasieni PD. From genotypes to genes: doubling the sample size. Biometrics (1997) 53:1253–1261.[CrossRef][Web of Science][Medline]

    Satagopan JM, Venkatraman ES, Begg CB. Two-stage designs for gene-disease association studies with sample size constraints. Biometrics (2004) 60:589–597.[CrossRef][Web of Science][Medline]

    Skol AD, Scott LJ, Abecasis GR, Boehnke M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature Genetics (2006) 38:209–213.[CrossRef][Web of Science][Medline]

    Vecchio TJ. Predictive value of a single diagnostic test in unselected populations. New England Journal of Medicine (1966) 274:1171–1173.[Web of Science][Medline]

    Wacholder S, Chanock S, Garcia-Closas M, El Ghormli L, Rothman N. Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. Journal of the National Cancer Institute (2004) 96:434–442.[Abstract/Free Full Text]

    Yeager M, Orr N, Hayes RB, Jacobs KB, Kraft P, Wacholder S, Minichiello MJ, Fearnhead P, Yu K, Chatterjee N. and others. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nature Genetics (2007) 39:645–649.[CrossRef][Medline]

    Zaykin DV, Zhivotovsky LA. Ranks of genuine associations in whole-genome scans. Genetics (2005) 171:813–823.[Abstract/Free Full Text]

    Received May 1, 2007; revised July 23, 2007; accepted for publication July 27, 2007.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


    This article has been cited by other articles:


    Home page
    JNCI J Natl Cancer InstHome page
    M. H. Gail
    Discriminatory Accuracy From Single-Nucleotide Polymorphisms in Models to Predict Breast Cancer Risk
    J Natl Cancer Inst, July 16, 2008; 100(14): 1037 - 1041.
    [Abstract] [Full Text] [PDF]


    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow Supplementary Material
    Right arrow All Versions of this Article:
    9/2/201    most recent
    kxm032v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Gail, M. H.
    Right arrow Articles by Pee, D.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Gail, M. H.
    Right arrow Articles by Pee, D.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?