Biostatistics Advance Access published online on November 13, 2007
Biostatistics, doi:10.1093/biostatistics/kxm039
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Genetic model selection in two-phase analysis for case–control association studies
Office of Biostatistics Research, National Heart, Lung and Blood Institute, 6701 Rockledge Drive, Bethesda, MD 20892-7931, USA
Department of Statistical Science, Southern Methodist University, 3225 Daniel Avenue, PO Box 750332, Dallas, TX 75275-0332, USA ngh{at}mail.smu.edu
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
The Cochran–Armitage trend test (CATT) is well suited for testing association between a marker and a disease in case–control studies. When the underlying genetic model for the disease is known, the CATT optimal for the genetic model is used. For complex diseases, however, the genetic models of the true disease loci are unknown. In this situation, robust tests are preferable. We propose a two-phase analysis with model selection for the case–control design. In the first phase, we use the difference of Hardy–Weinberg disequilibrium coefficients between the cases and the controls for model selection. Then, an optimal CATT corresponding to the selected model is used for testing association. The correlation of the statistics used for selection and the test for association is derived to adjust the two-phase analysis with control of the Type-I error rate. The simulation studies show that this new approach has greater efficiency robustness than the existing methods.
Keywords: Cochran–Armitage trend test; Disease risk; Efficiency robustness; Hardy–Weinberg disequilibrium; SNP
| 1. INTRODUCTION |
|---|
|
|
|---|
The case–control design is often employed for testing association between a marker and a disease. A single-nucleotide polymorphism (SNP) is a marker with 2 alleles, denoted as A and B, where A is the at-risk one. Its genotypes are denoted as AA, AB, or BB. Testing genetic association is equivalent to finding any general association between the outcome and the observed genotypes in a 2 x 3 contingency table, for which Pearson's chi-squared test can be used. It has an asymptotic chi-squared distribution with 2 degrees of freedom (df) under the null hypothesis. However, as A is the risk allele, an individual with genotype AA is more likely to have disease than an AB individual, who in turn is more likely to have disease than a BB individual. The Cochran–Armitage trend test (CATT) (Cochran, 1954
Wittke-Thompson and others (2005)
studied the directions (signs) of the Hardy–Weinberg disequilibrium (HWD) coefficients when Hardy–Weinberg equilibrium (HWE) holds in the population and used these to confirm the underlying genetic model. We further show that HWD coefficients can be used to divide the parameter space into 4 unique regions, from which genetic models can be selected. The selection of genetic models based on the above theory is, however, robust to departure from HWE in the population. Next, we propose a two-phase analysis for genetic association with model selection. In the first phase, we apply the difference of HWD coefficients between the cases and the controls to classify the underlying genetic model into 3 categories: the recessive region, additive/multiplicative region, or dominant region. In the second phase, we apply the appropriate CATT, optimal for the selected model, to test genetic association. Such two-phase selection-testing analysis has been studied by Hogg (1974)
and extensively studied in clinical trials (e.g. Thall and others, 1988
), but not in genetic analysis. We further derive and adjust the asymptotic correlation between the 2 analyses so that the two-phase analysis with model selection has an approximately correct size. Extensive simulation studies are also conducted to compare this new approach with the existing approaches. Finally, for illustration, we apply the two-phase analysis with model selection to real data.
| 2. NOTATION AND MODEL |
|---|
|
|
|---|
Denote the allele frequencies in cases and controls as p = Pr(A|case) and q = Pr(A|control) and the genotypes as g0 = BB, g1 = AB, and g2 = AA. Assume that r cases and s controls are independently sampled. The observed counts for genotypes (g0, g1, g2) are (r0, r1, r2) in cases and (s0, s1, s2) in controls. The observed genotype counts (r0, r1, r2) and (s0, s1, s2) follow multinomial distributions (r; p0, p1, p2) and mul(s; q0, q1, q2), respectively, where pi = Pr(gi|case) and qi = Pr(gi|control) for i = 0, 1, 2. The null hypothesis of no association can be stated as H0 : pi = qi for i = 0, 1, 2. Denote the disease prevalence as k = Pr(case). Then, pi = Pr (gi)fi / k and qi = Pr(gi)(1 – fi)/(1 – k), where fi = Pr(case|gi) > 0 is the penetrance and
. The null hypothesis is equivalent to H0: f0 = f1 = f2 = k. A genetic model is referred to as a relationship between 2 genotype relative risks, denoted as
1 = f1 /f0 and
2 = fs/f0. The null and alternative hypotheses can be equivalently stated as H0:
1 =
2 = 1 against H1:
2
1
1 with at least one inequality. Then, under H0, p = q. The recessive (REC), additive (ADD), multiplicative (MUL), and dominant (DOM) models refer to
1 = 1,
1 = (
2 + 1)/2,
2 = 
, and
1 =
2, respectively.
CATT models association between the outcome as a linear trend in genotypes. In a logistic regression model for case–control data with a single covariate (genotype) coded with scores (0, x, 1) for (g0, g1, g2), where 0
x
1, the score statistic is equivalent to the CATT statistic (Sasieni, 1997
)
![]() | (2.1) |
where ni = ri + si for i = 0, 1, 2, n = r + M, and (x0, x1, x2) = (0, x, 1). Under H0, for a given x
[0, 1], Zx asymptotically follows a standard normal distribution N(1, 0). Sasieni (1997)
, Slager and Schaid (2001)
, and Zheng and others (2003)
showed that the optimal choices of x for the REC, ADD (MUL), and DOM models are x = 0, 1/2, and 1, respectively.
In genetic association studies, departure from HWE in cases has also been used to test genetic association in the case–control design (Nielsen and others, 1998
). However, using departure from HWE in cases as a test statistic has lower power for the additive model and no power at all for the multiplicative model (Nielsen and others, 1998
; Deng and others, 2000
; Song and Elston, 2003
, 2006
). In general, Pr(g0) = {Pr(B)}2 + FPr(A)Pr(B), Pr(g1) = 2Pr(A)Pr(B)(1 – F), and Pr(g2) = {Pr(A)}2 + FPr(A)Pr(B), where F is the Wright coefficient of inbreeding and HWE holds in the population if and only if F = 0.
| 3. TWO-PHASE ANALYSIS WITH GENETIC MODEL SELECTION |
|---|
|
|
|---|
HWE is usually tested using the HWD coefficients (Weir, 1996
), denoted as
. When HWE holds in the population,
, under which Wittke-Thompson and others (2005)
studied the directions of
in cases (
) and controls (
) and their relations with the underlying genetic model. They considered an additive model as
regardless of the reference penetrance
, but here we use the definition
.
Additional relations among
,
, and
are as follows: (i)
does not imply that
or
and vice versa; (ii) under the null hypothesis of no association, it follows from
that
. This implies that, when HWE holds in the population
, genetic association can be tested under the null hypothesis
. On the other hand, under the alternative hypothesis of association, whether or not HWE holds in the population cannot be tested using the retrospective case–control data. Song and Elston (2006)
considered a Hardy–Weinberg disequilibrium trend test (HWDTT) based on
. Here, following Wittke-Thompson and others (2005)
, we assume that HWE holds in the population and use the HWDTT for genetic model selection. The sensitivity of departure from HWE is examined empirically in Section B of the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org). Substituting
and
into
and
with
, one obtains the following:
and
. Wittke-Thompson and others (2005)
proved that
and
under the REC model (
),
and
under the DOM model (
), and
and
under the MUL model (
). Under the ADD model (
), it can be shown similarly that both
and
are negative. The 4 genetic models with the directions of
are plotted in Figure 1. The open triangle, formed by REC and DOM in Figure 1, indicates the null and alternative hypotheses. The vertex
corresponds to the null hypothesis. The signs of
for the 4 models are indicated with the labels of the genetic models. The curve CD corresponds to
, the condition used to determine the sign of
. It follows from Figure 1 that the 4 genetic models plus the curve CD divide the alternative hypothesis into 4 mutually exclusive open regions:
,
,
, and
(boundaries are excluded). The signs of
in the 4 open regions can be uniquely determined when HWE holds in the population (see Section A of the supplementary material available at Biostatistics online). Based on the above analysis, the difference of HWD coefficients between cases and controls can be used to classify the REC and DOM models. For example, the REC and DOM models imply that
and
, respectively.
|
When the genetic model is unknown,
has the greatest efficiency robustness among the family of statistics
against model misspecification and is often used in applications (Sasieni, 1997
; Slager and Schaid, 2001
; Freidlin and others, 2002
; Zheng and others, 2005
). Other robust tests considered by Freidlin and others (2002)
are
and the MERT given by
, where
, is a consistent estimate of the asymptotic null correlation between
and
. The MERT and MAX have been shown to have high-efficiency robustness in hypothesis testing in survival and ordered categorical data, and genetic data analyses when the model underlying the data is uncertain, but a family of scientifically plausible models and their normally distributed tests are available (Gastwirth, 1966
, 1985
; Davies, 1977
; Zucker and Lakatos, 1990
; Whittemore and Tu, 1998
; Freidlin and others, 1999
; Gastwirth and Freidlin, 2000
; Shih and Whittemore, 2001
).
We propose a two-phase analysis where we classify the underlying genetic model in the first phase followed by testing association using the optimal CATT for the selected model in the second phase. Song and Elston (2006)
considered testing association using the test statistic
|
|
where
and
are the estimates of
and
, respectively, with
and
.
From Figure 1 and the signs of
for the 4 genetic models and regions, it seems that the ADD and MUL models cannot be distinguished by the signs as
have the same sign
under the 2 models. However,
is asymptotically optimal for both ADD and MUL models under local alternatives. Thus, it is not necessary to distinguish them. On the other hand, the ADD (MUL) model may be distinguished from the REC and DOM models by the signs of
. Thus, in the first phase, we classify the genetic model in the recessive region (
) if
, the dominant region (
) if
, and in the middle region (
and
) containing the ADD or MUL models otherwise, where
is a prespecified real number. In the second phase, we choose
as a test statistic for association, given by
if
,
if
, and
if otherwise, where we choose
. The performance of the above model selection is examined and the results are reported in Section B of the supplementary material available at Biostatistics online. When
, the probabilities to select correct REC/DOM and ADD/MUL models are greater than 65% and 85% and are robust to departure from HWE.
To correct the size of
analytically, the correlation between the two phases of the analysis needs to be derived. In Section C of the supplementary material available at Biostatistics online, we obtain
,
, and
, where
under
. Note that
and
are asymptotically independent.
Suppose in the second phase the null hypothesis corresponding to the jth SNP (
) is rejected at the level
using the normal distribution. Then, to control the overall level at the given
in the two-phase analysis with model selection for M SNPs, we have 
and
, where
is the upper
th percentile of
. Using the bivariate normal distribution of (Zx, ZHWDTT) with correlation
x,
* satisfies
![]() | (3.1) |
where
,
, and
. Given
and M and under HWE in the population, this shows that
* only asymptotically depends on p (=q, under
), which can be estimated consistently using the observed data (Weir, 1996
) by
. The supplementary material available at Biostatistics online (Section E) includes a SAS IML macro to calculate
*. Table 1 (the last column) reports the simulated levels of significance using the adjusted significance level (
*) for
. The adjusted level
* was calculated for each simulated data set. The sizes of the analytically adjusted level (
*) based on (3.1) are not significantly different from the nominal level. For rare allele frequency (
), the level
* leads to a conservative test. The last part of Table 1 also shows that departure from HWE does not inflate the Type-I error rates of the 4 genotype-based test statistics (see comments of Sasieni, 1997
).
|
| 4. NUMERICAL RESULTS |
|---|
|
|
|---|
In Section D of the supplementary material available at Biostatistics online, we reported simulation studies to compare power for various approaches. Our results show that the two-phase analysis with model selection (
) has greater efficiency robustness than MAX, although their power difference is about 2–3%. The
has at least 95% efficiency under ADD/MUL and 100–225% (100–123%) under REC (DOM) relative to the CATT.
To illustrate the use of the two-phase model selection, we consider the top 2 SNPs in a genome-wide case–control study for age-related macular degeneration (AMD) presented in Klein and others (2005)
. After the quality control of genotyping, there were 103 611 SNPs left for testing association. Each SNP was tested at the level
. The top 2 SNPs (rs380390 and rs1329428) with smallest p-values were found in a close region and were further studied by biology and haplotype analysis. The results show that this region is associated with AMD. For both SNPs, the underlying genetic models are uncertain.
The data set for the top SNP (rs380390) was given by
for cases and
for controls. The CATT with the additive score is
with unadjusted p-value
. The MERT is
with unadjusted p-value
. Further, we have
and its simulated p-value is
based on
replicates, which indicates that the association is marginally significant. For the model selection, using (3.1), we obtained the adjusted significance level
and
. This suggests that we use
with p-value
. After Bonferroni correction, the adjusted p-value is
. Thus, the SNP is significantly associated with the AMD after Bonferroni correction. Note that the two-phase analysis selected the largest CATT in the first phase.
The second top SNP (rs1329428) was given by
for cases and
for controls. Using (3.1), the adjusted significance level is
. The CATT
with unadjusted p-value
. On the other hand,
with unadjusted p-value
. The MAX is equal to
and its simulated p-value is
. Since
, we select the CATT with the recessive score in the second phase and
with unadjusted p-value
, which is not significant after Bonferroni correction. In fact, none of the 4 nonadaptive robust tests reached statistical significance; however, the two-phase test selected the appropriate CATT.
| 5. DISCUSSION |
|---|
|
|
|---|
In testing genetic association between a SNP and a disease using a case–control design, we studied a new two-phase analysis, where an underlying genetic model is selected in the first phase based on the difference of HWD coefficients between cases and controls. Then, in the second phase, we use the CATT for association with the score optimal for the selected model. Since two-phase analyses are correlated, we derived their asymptotic null correlations and used them to adjust the level in the two-phase analysis to control the Type-I error rate.
Our simulation results demonstrate that this new approach compares favorably with other existing approaches, including the CATT with the additive score (Sasieni, 1997
) and the robust tests (MERT and MAX) of Freidlin and others (2002)
. When the underlying genetic model is uncertain, it has the greatest efficiency robustness among all the tests considered here. Thus, it is useful for genetic data analysis using case–control data. Since it has been shown that MAX and likelihood ratio tests have comparable power, we did not compare our approach to the likelihood ratio test.
One assumption in this article is that HWE holds in the population. Although model selection can be uniquely determined when HWE holds in the population, our empirical results showed that a moderate departure from HWE had little effect on the power performance of the model selection under various genetic models. Nor did departure of HWE inflate Type-I error rates. It is known that case–control designs for genetic association may be affected by population stratifications and/or cryptic relatedness (Devlin and Roeder, 1999
; Pritchard and Rosenberg, 1999
; Satten and others, 2001
; Zheng and others, 2005
; Gorroochurn and others, 2006
). This needs to be examined in the future. In addition, we chose a fixed threshold value as the upper 95th normal percentile in the model selection phase. Data-driven threshold levels may further improve the power. This also needs future research.
| ACKNOWLEDGMENTS |
|---|
The authors gratefully acknowledge the editor, the associate editor, and the 2 anonymous referees for their constructive comments. The authors would also like to thank Prof. William R. Schucany and Ms Julia Kozlitina for their suggestions which resulted in a much improved version of the manuscript.
Conflict of Interest: None declared.
| REFERENCES |
|---|
|
|
|---|
-
Armitage P. Tests for linear trends in proportions and frequencies. Biometrics (1955) 11:375–386.[Medline]
Cochran WG. Some methods for strengthening the common chi-square tests. Biometrics (1954) 10:417–451.[Medline]
Davies RB. Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika (1977) 64:247–254.
Deng HW, Chen WM, Recker RR. QTL fine mapping by measuring and testing for Hardy-Weinberg and linkage disequilibrium at a series of linked marker loci in extreme samples of population. American Journal of Human Genetics (2000) 66:1027–1045.[CrossRef][Web of Science][Medline]
Devlin B, Roeder K. Genomic control for association studies. Biometrics (1999) 55:997–1004.[CrossRef][Web of Science][Medline]
Freidlin B, Podgor MJ, Gastwirth JL. Efficiency robust tests for survival or ordered categorical data. Biometrics (1999) 55:883–886.[CrossRef][Web of Science][Medline]
Freidlin B, Zheng G, Li Z, Gastwirth JL. Trend tests for case-control studies of genetic markers: power, sample size and robustness. Human Heredity (2002) 53:146–152.[CrossRef][Web of Science][Medline]
Gastwirth JL. On robust procedures. Journal of the American Statistical Association (1966) 61:929–948.[CrossRef][Web of Science]
Gastwirth JL. The use of maximin efficiency robust tests in combining contingency tables and survival analysis. Journal of the American Statistical Association (1985) 80:380–384.[CrossRef][Web of Science]
Gastwirth JL, Freidlin B. On power and efficiency robust linkage tests for affected sibs. Annals of Human Genetics (2000) 64:443–453.[CrossRef][Web of Science][Medline]
Gorroochurn P, Heiman GA, Hodge SE, Greenberg DA. Centralizing the non- central chi-square: a new method to correct for population stratification in genetic case-control association studies. Genetic Epidemiology (2006) 30:277–289.[CrossRef][Web of Science][Medline]
Graubard BI, Korn EL. Choice of column scores for testing independence in ordered 2 x K contingency tables. Biometrics (1987) 43:471–476.[CrossRef][Web of Science][Medline]
Hogg RV. Adaptive robust procedures—partial review and some suggestions for future applications and theory. Journal of the American Statistical Association (1974) 69:909–923.[CrossRef][Web of Science]
Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, Sangiovanni JP, Mane SM, Mayne ST, et al. Complement factor H polymorphism in aged-related macular degeneration. Science (2005) 308:385–389.
Nielsen DM, Ehm MG, Weir BS. Detecting marker-disease association by testing for Hardy-Weinberg disequilibrium at a marker locus. American Journal of Human Genetics (1998) 63:1531–1540.[CrossRef][Web of Science][Medline]
Pritchard MJ, Rosenberg NA. Use of unlinked genetic markers to detect population stratification in association studies. American Journal of Human Genetics (1999) 65:220–228.[CrossRef][Web of Science][Medline]
Sasieni PD. From genotypes to genes: doubling the sample size. Biometrics (1997) 53:1253–1261.[CrossRef][Web of Science][Medline]
Satten GA, Flanders WD, Yang Q. Account for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. American Journal of Human Genetics (2001) 68:466–477.[CrossRef][Web of Science][Medline]
Shih M, Whittemore AS. Allele-sharing among affected relatives: nonparametric methods for identifying genes. Statistical Methods in Medical Research (2001) 10:27–55.
Slager SL, Schaid DJ. Case-control studies of genetic markers: power and sample size approximations for Armitage's test for trend. Human Heredity (2001) 52:149–153.[CrossRef][Web of Science][Medline]
Song K, Elston RC. The Hardy-Weinberg disequilibrium (HWD) measure and test statistics for a disease-susceptibility locus with multiple alleles allowing for an inbreeding coefficient (F). Genetica (2003) 119:269–293.[CrossRef][Web of Science][Medline]
Song K, Elston RC. A powerful method of combining measures of association and Hardy-Weinberg disequilibrium for fine-mapping in case-control studies. Statistics in Medicine (2006) 25:105–126.[CrossRef][Web of Science][Medline]
Thall PF, Simon R, Ellenberg SS. Two-stage selection and testing designs for comparative clinical trials. Biometrika (1988) 75:303–310.
Wang K, Sheffield VC. A constrained-likelihood approach to marker-trait association studies. American Journal of Human Genetics (2005) 77:768–780.[CrossRef][Web of Science][Medline]
Weir BS. Genetic Data Analysis II: Methods for Discrete Population Genetic Data (1996) Sunderland, MA: Sinauer Associations Inc.
Whittemore AS, Tu I-P. Simple, robust linkage tests for affected sibs. American Journal of Human Genetics (1998) 62:1228–1242.[CrossRef][Web of Science][Medline]
Wittke-Thompson JK, Pluzhnikov A, Cox NJ. Rational inferences about departure from Hardy-Weinberg equilibrium. American Journal of Human Genetics (2005) 76:967–986.[CrossRef][Web of Science][Medline]
Zheng G, Freidlin B, Gastwirth JL. Comparison of robust tests for genetic association using case-control studies (2nd special issue in honor of E.L. Lehmann). IMS Lecture Notes—Monograph Series (2006) 49:253–265.[CrossRef]
Zheng G, Freidlin B, Li Z, Gastwirth JL. Choice of scores in trend tests for case-control studies of candidate-gene associations. Biometrical Journal (2003) 45:335–348.[CrossRef][Web of Science]
Zheng G, Freidlin B, Li Z, Gastwirth JL. Genomic control for association studies under various genetic models. Biometrics (2005) 61:186–192.[CrossRef][Web of Science][Medline]
Zucker DM, Lakatos E. Weighted log rank type statistics for comparing survival curves when there is a time lag in the effectiveness of treatment. Biometrika (1990) 77:853–864.
Received April 16, 2007; revised July 25, 2007; revised October 1, 2007; accepted for publication October 12, 2007.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


