Skip Navigation



Biostatistics Advance Access published online on May 8, 2007

Biostatistics, doi:10.1093/biostatistics/kxm011
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
9/1/81    most recent
kxm011v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Chen, Y.-H.
Right arrow Articles by Carroll, R. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chen, Y.-H.
Right arrow Articles by Carroll, R. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

Retrospective analysis of haplotype-based case–control studies under a flexible model for gene–environment association

Yi-Hau Chen

Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan, People's Republic of China

Nilanjan Chatterjee*

Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 6120 Executive Boulevard, EPS 8038, Rockville, MD 20852, USA chattern{at}mail.nih.gov

Raymond J. Carroll

Department of Statistics, Texas A&M University, TAMU 3143, College Station, TX 77843-3143, USA

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. NOTATIONS AND PROPOSED...
 3. SEMIPARAMETRIC ESTIMATING...
 4. SIMULATIONS
 5. CASE-CONTROL STUDY OF...
 6. CONCLUDING REMARKS
 APPENDIX A: BASIC LEMMAS
 APPENDIX B: PROOF OF...
 APPENDIX C: EXPRESSIONS FOR...
 REFERENCES
 
Genetic epidemiologic studies often involve investigation of the association of a disease with a genomic region in terms of the underlying haplotypes, that is the combination of alleles at multiple loci along homologous chromosomes. In this article, we consider the problem of estimating haplotype–environment interactions from case–control studies when some of the environmental exposures themselves may be influenced by genetic susceptibility. We specify the distribution of the diplotypes (haplotype pair) given environmental exposures for the underlying population based on a novel semiparametric model that allows haplotypes to be potentially related with environmental exposures, while allowing the marginal distribution of the diplotypes to maintain certain population genetics constraints such as Hardy–Weinberg equilibrium. The marginal distribution of the environmental exposures is allowed to remain completely nonparametric. We develop a semiparametric estimating equation methodology and related asymptotic theory for estimation of the disease odds ratios associated with the haplotypes, environmental exposures, and their interactions, parameters that characterize haplotype–environment associations and the marginal haplotype frequencies. The problem of phase ambiguity of genotype data is handled using a suitable expectation–maximization algorithm. We study the finite-sample performance of the proposed methodology using simulated data. An application of the methodology is illustrated using a case–control study of colorectal adenoma, designed to investigate how the smoking-related risk of colorectal adenoma can be modified by "NAT2," a smoking-metabolism gene that may potentially influence susceptibility to smoking itself.

Keywords: Case–control studies; EM algorithm; Gene–environment interactions; Haplotype; Semiparametric methods


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. NOTATIONS AND PROPOSED...
 3. SEMIPARAMETRIC ESTIMATING...
 4. SIMULATIONS
 5. CASE-CONTROL STUDY OF...
 6. CONCLUDING REMARKS
 APPENDIX A: BASIC LEMMAS
 APPENDIX B: PROOF OF...
 APPENDIX C: EXPRESSIONS FOR...
 REFERENCES
 
Genetic epidemiologic studies often involve investigation of the association between a disease and a candidate genomic region of biologic interest. Typically, in such studies, genotype information is obtained on multiple loci that are known to harbor genetic variations within the region of interest. An increasingly popular approach for analysis of such multilocus genetic data are haplotype-based regression methods, where the effect of a genomic region on disease risk is modeled through "haplotypes," the combinations of alleles (gene variants) at multiple loci along individual homologous chromosomes. It is believed that association analysis based on haplotypes, which can efficiently capture inter-loci interactions as well as "indirect association" due to "linkage disequilibrium" of the haplotypes with unobserved causal variant(s), can be more powerful than more traditional locus-by-locus methods (Schaid, 2004Go).

A technical problem for haplotype-based regression analysis is that in traditional epidemiologic studies, the haplotype information for the study subjects is not directly observable. Instead, locus-specific genotype data are observed, which contain information on the pair of alleles a subject carries on his/her pair of homologous chromosomes at each of the individual loci but does not provide the "phase information," that is which combinations of alleles appear across multiple loci along the individual chromosomes. In general, the genotype data of a subject will be phase ambiguous whenever the subject is heterozygous at 2 or more loci. Statistically, the lack of phase information can be viewed as a special missing data problem.

Recently, a variety of methods have been developed for haplotype-based analysis of case–control data using the logistic regression model (Zhao and others, 2003Go; Lake and others, 2003Go; Epstein and Satten, 2003Go; Satten and Epstein, 2004Go; Spinka and others, 2005Go; Lin and Zeng, 2006Go; Chatterjee and others, 2006Go). Two classes of methods, namely, "prospective" and "retrospective" have evolved. Prospective methods ignore the retrospective nature of the case–control design. In the classical setting, without any missing data, justification of prospective analysis of case–control data relies on the well-known result about the equivalence of prospective and retrospective likelihoods under a semiparametric model that allows the distribution of the underlying covariates to remain completely nonparametric (Andersen, 1970; Prentice and Pyke, 1979Go) . Even with missing data, the equivalence of the prospective and retrospective likelihood may hold, provided the covariate distribution is allowed to remain unrestricted (Roeder and others, 1996Go). For haplotype-based genetic analysis, however, complete nonparametric treatment of the covariates, including haplotypes, may not be possible due to intrinsic identifiability issues for the phase-ambiguous genotype data (Epstein and Satten, 2003Go). Thus, in this setting, the proper retrospective analysis of case–control data requires special attention.

An attractive feature of the retrospective likelihood is that it can enhance efficiency of case–control analysis by directly incorporating certain type of covariate distributional constraints that are natural for genetic epidemiologic studies. The assumptions of Hardy–Weinberg equilibrium (HWE) and gene– environment independence are 2 prime examples of such constraints. The HWE model, which specifies simple relationships between "allele" and "genotype" frequencies at a given chromosomal locus or between haplotype and diplotype (pair of haplotypes on homologous chromosomes) frequencies across multiple loci, is a natural law for a random mating large stable population. Often, it is also natural to assume that a subject's genetic susceptibility, a factor which is determined at birth, is independent of his/her subsequent environmental exposures. However, if these assumptions are violated in some situations, then retrospective methods can produce serious bias in odds ratio estimates (see, e.g. Satten and Epstein, 2004Go; Chatterjee and Carroll, 2005Go; Spinka and others, 2005Go). Thus, there is a need for alternative flexible models for specifying the joint distribution of genetic and environmental covariates that could be used to assess the sensitivity of the retrospective methods to underlying assumptions as well as to develop alternative robust methods.

Both Satten and Epstein (2004)Go and Lin and Zeng (2006)Go have described retrospective maximum likelihood analysis of case–control data under flexible population genetics models that can relax the HWE assumption. Moreover, Lin and Zeng considered a model that allows the conditional distribution of environmental exposure given unphased genotypes to remain completely nonparametric, but they assumed conditional independence between haplotypes and the environmental factors given the unphased genotypes. If, however, haplotypes are the underlying biologic units through which a mechanism of gene is determined, then it is more natural to allow for direct association between haplotypes and environmental exposures. Moreover, if such association could exist, then quantifying the association between haplotypes and certain type of environmental exposures, such as lifestyle and behaviorial factors, would be of scientific interest.

In this article, we propose methods for retrospective analysis of case–control data using a novel model for the gene–environment distribution that can account for direct association between haplotypes and environmental exposures. The model is developed in Section 2. We assume a standard logistic regression model to specify the disease risk conditional on diplotypes and environmental exposures. In addition, we assume a polytomous logistic regression model for specifying the population distribution of the diplotypes conditional on the environmental exposures, with the intercept parameters of the model specified in such a way that the "marginal" distribution of the diplotypes can follow certain population genetic constraints such as HWE. Moreover, by exploiting the equivalence of prospective and retrospective odds ratios under the polytomous regression model, we further incorporate certain constraints on the diplotype-exposure odds ratio parameters that could reflect specific "mode of effects" for the haplotypes. We allow the marginal distribution of the environmental exposure to remain completely nonparametric.

Under the proposed modeling framework, we then describe in Section 3 a "semiparametric" estimating equation method for inference about the finite-dimensional parameters of interest, namely the disease odds ratios, haplotype frequencies, and haplotype-exposure odds ratios. We develop a suitable expectation-maximization (EM) algorithm to account for the phase-ambiguity problem. We study asymptotic theory of the proposed estimator under the underlying semiparametric setting.

In Section 4, we assess the finite-sample performance of the proposed estimator based on case–control data that were simulated utilizing haplotype patterns and frequencies obtained from a real study. In Section 5, we apply the proposed methodology to a case–control study of colorectal adenoma to investigate whether certain haplotypes in the smoking metabolism gene, NAT2, could modify smoking-related risk of colorectal adenoma and whether the same haplotypes could influence an individual's susceptibility to smoking as well. Section 6 contains concluding remarks. All technical details are in an appendix. A SAS macro is available from the Web site http://www.stat.sinica.edu.tw/yhchen/download.htm.


    2. NOTATIONS AND PROPOSED MODEL
 TOP
 SUMMARY
 1. INTRODUCTION
 2. NOTATIONS AND PROPOSED...
 3. SEMIPARAMETRIC ESTIMATING...
 4. SIMULATIONS
 5. CASE-CONTROL STUDY OF...
 6. CONCLUDING REMARKS
 APPENDIX A: BASIC LEMMAS
 APPENDIX B: PROOF OF...
 APPENDIX C: EXPRESSIONS FOR...
 REFERENCES
 
For haplotype-based studies, the underlying genetic covariate for a subject is defined by "diplotypes," that is, the 2 haplotypes the individual carries in his/her pair of homologous chromosomes, where each haplotype is the combination of alleles at the loci of interest along an individual chromosome. Following the notation developed in Spinka and others (2005)Go, let the diplotype status for a subject be Hdi = (H1,H2), where H1 and H2 denote the constituent haplotypes. We assume that there are J possible haplotypes indexed by hj for j = 1,...,J. The diplotypes are then indexed by hFormula = (hj1,hj2), j1 = 1,...J1, j2 = 1,...,J2. The diplotype data, however, is not directly observable. Instead, for each subject, the multilocus genotype data G is observed, which contains information on the pair of alleles the individual carries at each individual locus but does not provide the phase information, that is which combination of alleles appears along each of the individual chromosomes. Thus, the same genotype data G could be consistent with multiple diplotypes. We will denote C(G) to be the set of all possible diplotypes that are consistent with the genotype data G.

Given the diplotype data Hdi and a set of environmental covariate X, we assume that the risk of the disease is given by the logistic regression model

Formula (2.1)

for some known function m(·,ß1). Often one further imposes structural assumptions on the odds ratio parameters ß1 by modeling the effect of the diplotypes through constituent haplotypes according to a "dominant," "additive," or "recessive" mode of effect (Wallenstein and others, 1998Go). For example, a logistic regression model which assumes an additive effect for each copy of a haplotype corresponds to

Formula (2.2)

where ßX is the main effect of X, ßhjk is the main effect of haplotype hjk, k = 1,2, and ßhjk:X is the interaction effect of X with haplotype hjk, k = 1,2. Such modeling may be necessary due to identifiability considerations (Epstein and Satten, 2003Go) and is desirable when the effects of the haplotypes themselves are of direct scientific interest.

Unlike Spinka and others (2005)Go, who assumed independence of Hdi and X, we assume a general polytomous logistic regression for the conditional distribution of Hdi given X:

Formula (2.3)

where hFormula = (hjFormula,hjFormula) is a chosen reference diplotype. Observe that model (2.3) allows association between Hdi and X through the regression parameters {gamma}1j1j2. Let {gamma}0 and {gamma}1 denote the vectorized forms for the parameters {gamma}0j1j2 and {gamma}1j1j2. Let qhap(hdi|x,{gamma}0,{gamma}1) denote pr(Hdi = hdi|X = x) as defined by model (2.3). We allow the marginal distribution of X, denoted by F(x), to remain completely unspecified. If Hdi were directly observable, then, in principle, no further assumptions are necessary, and one can estimate {gamma}0 and {gamma}1 together with the odds ratio parameters of the disease risk using the profile likelihood approach developed by Chatterjee and Carroll (2005)Go. In the presence of phase ambiguity, however, the diplotypes being not directly observable, further constraints on the parameters {gamma}0 and {gamma}1 are needed for the purpose of identifiability. In the following, we show how certain natural genetic models can be used to impose these constraints.

Given that genetic susceptibility may influence environmental exposures and not vice versa, for causal interpretation of parameters it is more natural to consider a model for the environmental exposures given the diplotypes. However, the odds ratios associated with the distributions [X|H] and [H|X] being the same, the parameters in {gamma}1 can be interpreted as measures of "diplotype effects" on the distribution of exposure. Thus, it is natural to specify the {gamma}1 parameters according to certain mode of effects of the underlying haplotypes. For example, assuming an additive effect for the haplotypes, one can write {gamma}1j1j2 = {gamma}1,j1 + {gamma}1,j2, which allows the diplotype effects to be determined by a reduced set of "haplotype effect" parameters {gamma}1,j; in this case, {gamma}1 would denote the vectorized form for the parameters {gamma}1,j. Similarly, other commonly used models, such as dominant or recessive models, could be used to impose natural constraints on the {gamma}1 parameters in model (2.3). We also observe that the parametric model (2.3), combined with the nonparametric distribution F(x), imposes a semiparametric model on the distribution of [X|H] with a density

Formula

This class of semiparametric models includes the parametric submodel where X|Hdi = hdi follows a multivariate normal distribution with mean µhdi and common variance–covariance matrix {Sigma}. In this case, it is easy to see that {gamma}1hdi = (µhdi µhFormula)T{Sigma} 1, which is a measure of the shift in the mean of the distribution of X due to differences in the diplotypes.

The parameter {gamma}0 in model (2.3) defines the population diplotype frequencies for a baseline value of the exposure X. It is common to use population genetics models, such as HWE, to specify a relationship between diplotype and haplotype frequencies. However, observe that if the diplotypes can influence certain environmental exposures, then the frequencies of the diplotypes within exposure categories may not follow the HWE constraints although the underlying population, as a whole, may be in HWE. Thus, the population-level marginal haplotype-pair distribution is assumed to follow HWE and is characterized by the parameters {theta} = ({theta}2,...,{theta}J) so that

Formula (2.4)

where h1 denotes the chosen reference haplotype and {theta}1 = 0. Let

Formula

be the marginal frequency for the diplotype hdi. Recall that in the proposed model, {gamma}0 is defined as an implicit function of {gamma}1, {theta}, and F(x) through the relationship

Formula (2.5)

Note that F is left unspecified, and hence the model propoised is semiparametric.


    3. SEMIPARAMETRIC ESTIMATING EQUATION INFERENCE
 TOP
 SUMMARY
 1. INTRODUCTION
 2. NOTATIONS AND PROPOSED...
 3. SEMIPARAMETRIC ESTIMATING...
 4. SIMULATIONS
 5. CASE-CONTROL STUDY OF...
 6. CONCLUDING REMARKS
 APPENDIX A: BASIC LEMMAS
 APPENDIX B: PROOF OF...
 APPENDIX C: EXPRESSIONS FOR...
 REFERENCES
 

3.1 Estimation with known haplotypes

In what follows, where there can be no confusion, we will write h for hdi.

Let H(x) = exp(x)/{1 + exp(x)} be the logistic distribution function. Write the risk model probability as

Formula

Recall that qhap(h|X,{gamma}0,{gamma}1) = pr(Hdi = h|X;{gamma}0,{gamma}1) is the conditional model of Hdi given X that is specified as in (2.3).

To start with, consider the ideal case that the phase information is known so that Hdi is observed. Since F is treated nonparametrically, assume that F is discrete and has mass {delta}k at xk, k = 1,...,K, where {x1,...,xK} are the distinct values of X that are observed in the case–control sample. Let ndkh be the number of subjects in the sample with (D = d,X = xk,Hdi = h). Ignoring the dependence of {gamma}0 on F tentatively, the log-likelihood of the case–control data can then be written as

Formula

Maximizing l with respect to {delta} for fixed values of {omega} = (ß,{gamma}0,{gamma}1) then leads to

Formula (3.1)

and the profile log-likelihood

Formula (3.2)

where

Formula

and

Formula

with B = (ß0,ß1,{kappa})T and {kappa} = ß0 + log(n1/n0) – log{pr(D = 1)/pr(D = 0)}. The calculation is similar to that in Chatterjee and Carroll (2005)Go.

As noted by Chatterjee and Carroll (2005)Go, the parameter ß0 is separable from {kappa} and hence is theoretically identifiable. In practice, however, there is usually little information about ß0 available in the observed data, and hence the information matrix is nearly singular. One way to bypass this problem is to use external information on the disease prevalence pr(D = 1), while another way is to use the rare-disease approximation when the disease is rare. The estimation method described below can be applied to both the 2 cases of pr(D = 1) being known and the rare-disease approximation being made, with suitable definitions on B and S(d,h,x,B,{gamma}0,{gamma}1). When pr(D = 1) is known, {kappa} depends on ß0 only, hence here we define B = (ß0,ß1)T. When the disease is rare so that

Formula (3.3)

we have

Formula

Note that ß0 does not appear in this expression, and hence we define B = ({kappa},ß1)T.

Our goal is to estimate the parameters (B,{theta},{gamma}1) based on the profile log-likelihood (3.2), where {gamma}0 is defined as an implicit function of ({theta},{gamma}1,F) through (2.5), and we write {gamma}0 = G({theta},{gamma}1,F). Let {Omega} = (B,{gamma}1,{gamma}0), {Omega}* = {B,{gamma}1,G({theta},{gamma}1,F)}, and {Phi} = (B,{gamma}1,{theta}). Let L{Omega}(·) and L{Phi}(·) be, respectively, the derivatives of L(·) with respect to {Omega} and {Phi}, and G{theta} and G{gamma}1 the derivatives of G(·) with respect to {theta} and {gamma}1. We then have

Formula

Explicit expressions for G{gamma}1 and G{theta} are given in Appendix C. Also, the information matrix is given by

Formula

where I{Omega}{Omega} = E( – L{Omega}{Omega}), with L{Omega}{Omega} the second derivative of L with respect to {Omega}; note that the terms involving second derivatives of {Omega}* do not appear in the information matrix because E(L{Omega}) = 0, which is a direct consequence of the Lemmas A.1 and A.2 in Appendix A. We propose to obtain the estimate of {Phi} by solving the estimating equation

Formula (3.4)

where we have substituted an estimate Formula for F in G(·); that is, for each fixed value of ({theta},{gamma}1), we solve {gamma}0 from

Formula (3.5)

One convenient choice of Formula is the empirical estimate Formula, which is given by

Formula

for the case where pr(D = 1) is known, where Formula and Formula are the empirical distributions of X in the case and control samples, and is given by Formula for the case where the rare-disease assumption can be made. An alternative choice of Formula would be the profile likelihood estimate (3.1). Numerical calculations not given here show that the latter choice requires more computational efforts while yielding results very similar to those given by the empirical estimate Formula.

3.2 Estimation with ambiguous haplotype data

Now, we turn to the more practical case where the haplotype data cannot be observed directly and must be inferred from the unphased genotype data, that is, the haplotype information may be subject to ambiguity. In this case, we apply an EM-like algorithm to the "complete data" estimating equation (3.4). Let Gi denote the observed unphased genotype of subject i and C(Gi) the set of diplotypes that are consistent with Gi. When only Gi instead of HFormula is observed for each subject, we propose to obtain the estimate Formula for {Phi} = (B,{gamma}1,{theta}) as the solution of the weighted version of (3.4):

Formula (3.6)

where using the short-hand notation that Formula, the weights are given by

Formula (3.7)

The limiting version of the weights is given as

Formula (3.8)

Solving the estimating equation (3.6) can be implemented simply by an EM-like algorithm as follows: starting with an initial value for {Phi} and hence an initial value for {gamma}0, we

(i) calculate the weights Formula from (3.7) and
(ii) solve (3.6) to obtain an updated estimate of {Phi} using the weights Formula given in (i); note that within this step we also need to solve (3.5) to obtain updated value of {gamma}0. The algorithm is iterated between the 2 steps until convergence. Note that the weights Formula are only used in solving {Phi} from (3.6) and are not required in solving {gamma}0 from (3.5).

3.3 Asymptotic theory

Make the following series of definitions. Expectations denoted as Ecc(·) are taken under the case–control sampling design, that is, for any random vector Y, Ecc(Y) = {sum}FormulaµdE(Y|D = d), where µd = plim nd/n, d = 0,1. Then, define

Formula (3.9)

Note that the second derivative of {Omega}* does not appear in Formula since Formula, and the last identity in (3.9) is given by Lemma 1.3 in Appendix A.

Define Formula to be the mass of Formula, which is equal to {sum}Formula{pi}dI(Di = d)/nd if {pi}d = pr(D = d) is known and is equal to I(Di = 0)/n0 when the rare-disease approximation is used. Let qhap(X,{gamma}1,{gamma}0) = {qhap(h|X,{gamma}1,{gamma}0)} be the vector collection over h of qhap(h|X,{gamma}1,{gamma}0) for all diplotypes except the reference diplotype, and let qHWE({theta}) be defined similarly. Define

Formula

where Formula is the obvious submatrix of Formula.

THEOREM 3.1 Let

Formula

Suppose that Formula exists and the matrix Formula is invertible. Then, Formula is asymptotically normal with mean zero and covariance matrix

Formula (3.10)

REMARK 3.2 The asymptotic variance Formula can be readily estimated by replacing each component matrix with its empirical counterpart. Lemma 1.3 gives useful expressions to facilitate this computation.

REMARK 3.3 In our numerical experiments, the estimated covariance based on formula (3.10) is very close to that based on the "naive" covariance estimate obtained by naively treating the estimating equation (3.6) as a genuine score equation; namely, treating the Formula plugged into G(·) as the true covariate distribution F. In this case, by applying Proposition 1(ii) in Chatterjee and Carroll (2005)Go, the naive covariance estimate can be obtained simply as the empirical counterpart of the matrix Formula, where Formula. Whether this naive estimate performs well in general is unknown, and we suggest using the estimate based on (3.10).


    4. SIMULATIONS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. NOTATIONS AND PROPOSED...
 3. SEMIPARAMETRIC ESTIMATING...
 4. SIMULATIONS
 5. CASE-CONTROL STUDY OF...
 6. CONCLUDING REMARKS
 APPENDIX A: BASIC LEMMAS
 APPENDIX B: PROOF OF...
 APPENDIX C: EXPRESSIONS FOR...
 REFERENCES
 

4.1 Finite-sample performance under correct model

In this section, we study the finite-sample performance of the proposed estimator using simulated data generated under the proposed modeling framework. We simulated haplotypes following published data (Epstein and Satten, 2003Go) on haplotype patterns and frequencies for 5 single-nucleotide polymorphisms (SNPs) in a putative susceptibility gene for diabetes (see Table 1). The simulations involved a single environmental covariate X, assumed to follow a standard normal distribution in the population. Given X, the diplotypes (haplotype pair) for an individual were generated from a polytomous logistic regression of the form (2.3), where the diplotype-specific odds ratios were further specified according to an additive model of the form {gamma}1j1j2 = {gamma}1,j1 + {gamma}1,j2, where j1 and j2 denote the index for 14 haplotypes shown in Table 1. We assume {gamma}1,4 = {gamma}1,5 = – 0.4 and {gamma}1,12 = 0.4, and all the other {gamma}1,j = 0. The parameters {gamma}0j1j2's in model (2.3) are then specified in such a way that the marginal diplotype distribution follow HWE with haplotype frequencies given in Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1. Marginal haplotype frequencies and grouped haplotypes in the simulation. Here, j is index for the original haplotypes while j' is index for the grouped haplotypes

 
For generating disease outcome, we chose the haplotype "01100" (j = 5) to be causal and used the logistic model

Formula

where Z(5) denotes the number of the copies of the causal haplotype contained in Hdi. The true value of the parameter vector (ß0,ßH,ßX,ßHX) was set to ( – 3.0,0.2,0.1,0.3). A case–control sample with 600 controls and 600 cases was then sampled. The results were based upon 1000 simulated data sets.

When analyzing the data, we only used the unphased genotype information. We did not assume the causal haplotype to be known. Thus, in both the disease-risk model (2.1) and the diplotype-frequency model (2.3), we choose the most common haplotype "10011" as a reference and estimated a separate regression parameter for each of the non-referent haplotypes. Since rare haplotypes may lead to unreliable estimates of the associated regression parameters, when estimating ß and {gamma}1, rare haplotypes with frequency < 1% are grouped into the reference haplotype. The resulting 8-grouped haplotypes are labeled as hj', j' = 2,...,8; see Table 1 for details about how the haplotypes are grouped.

In each simulation, we obtain 2 sets of estimates from the proposed method, one using the rare-disease approximation (3.3) and the other using the known value of the population disease prevalence. Results shown in Table 2 show that both sets of estimates are essentially unbiased. Also, the standard error estimates are in close agreement with the true values, and the coverage probabilities are close to the nominal value (95%). As expected, the estimates for {theta} and {gamma}1 are generally more efficient using external information on the disease prevalence than when using the rare-disease approximation, but no such efficiency gain is observed for the parameters ß in the disease-risk model. Similar conclusion can be drawn from the simulations with a Bernoulli covariate (success probability = 0.5), showing the applicability of the proposed method to the categorical covariate. Detailed results for this latter set of simulations are included in the supplementary material available at Biostatistics online.


View this table:
[in this window]
[in a new window]

 
Table 2. Simulation results for the case with an additive genetic law. Here, mean is the mean over 1000 simulated data sets, SE is the standard deviation of the estimates, Formula is the mean of the estimated standard deviation of the parameter estimates, CP is the coverage probability of the 95% confidence interval, ß's denote risk parameters, {theta}'s characterize marginal haplotype frequencies, and {gamma}'s denote the haplotype–environment association parameters

 
4.2 Model robustness

Here, we consider a simulation study where we generate the data in such a way that the polytomous model for diplotype frequencies may not exactly hold. The main goal is to give an indication of the robustness of the estimate of the association parameters (ß) from the proposed method when the model for [Hdi|X] is misspecified.

Following the argument of causality in Section 2, if [X|Hdi] follows a normal distribution with constant variance, then the polytomous model is exact. So, to show a modest violation of the polytomous model, for given diplotype we generate the data on X from

Formula

where the diplotype data are again generated from the distribution in Table 1, {epsilon} is a t-distribution (d.f. = 3) truncated at ±5, {lambda}4 = {lambda}5 = – 1.2, {lambda}12 = 1.2, and all the other {lambda}j are 0. The disease status data are generated from the same logistic model as in the previous simulation. The simulated data on 600 cases and 600 controls are then analyzed with the proposed method, where the analysis models for the disease risk, [Hdi|X], and the marginal diplotype distribution are specified the same as those in the previous simulation. As a comparison, we also fit to the simulated data a model with the haplotype–environment (H-X) independence assumption, i.e. pr(Hdi|X) = pr(Hdi) = qHWE(Hdi), using the method proposed by Spinka and others (2005)Go. The rare-disease approximation is made when applying both the 2 methods.

The results shown in Table 3 reveal that, for the estimation of the association parameters ß, the proposed method may be quite robust to modest misspecification of the model for [Hdi|X]. On the other hand, the estimates from the H-X independence method does result in substantial bias, especially for parameters corresponding to haplotypes for which [X|Hdi] have nonzero mean; for example, the estimate for the interaction parameter between h5 and X is severely biased with the H-X independence method. The estimates for the marginal haplotype-frequency parameters {theta} seem to be robust to misspecification of [Hdi|X] for both the 2 methods.


View this table:
[in this window]
[in a new window]

 
Table 3. Simulation results for the case of a misspecified conditional diplotype distribution given covariates. Here, mean is the mean over 1000 simulated data sets, SE is the standard deviation of the estimates, Formula is the mean of the estimated standard deviation of the parameter estimates, CP is the coverage probability of the 95% confidence interval, ß's denote risk parameters, {theta}'s characterize marginal haplotype frequencies, and {gamma}'s denote the haplotype–environment association parameters

 

    5. CASECONTROL STUDY OF COLORECTAL ADENOMA STUDY, NAT2 HAPLOTYPE, AND SMOKING
 TOP
 SUMMARY
 1. INTRODUCTION
 2. NOTATIONS AND PROPOSED...
 3. SEMIPARAMETRIC ESTIMATING...
 4. SIMULATIONS
 5. CASE-CONTROL STUDY OF...
 6. CONCLUDING REMARKS
 APPENDIX A: BASIC LEMMAS
 APPENDIX B: PROOF OF...
 APPENDIX C: EXPRESSIONS FOR...
 REFERENCES
 
We illustrate the proposed modeling and estimating methodologies with an application to a case–control study of colorectal adenoma, a precursor of colorectal cancer. The study involved 628 prevalent advanced adenoma cases and 635 gender-matched controls, selected from the screening arm of the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial at the National Cancer Institute, USA (Gohagan and others, 2000Go; Moslehi and others, 2006Go). One of the main objectives of this study is to assess whether smoking-related risk of colorectal adenoma may be modified by certain haplotypes in NAT2, a gene known to be important in metabolism of smoking-related carcinogens. In addition, since NAT2 is involved in the smoking metabolism pathway, potentially it can influence an individual's addiction to smoking. Thus, it was also of interest to identify potential haplotypes that could influence an individual's susceptibility to smoking.

Genotype data were available on 6 SNPs. We initially applied the EM algorithm proposed by Li and others (2003)Go for haplotype-frequency estimation to derive 7 common haplotypes with estimated frequency greater than 0.5%, which are then included in our association analysis with the most frequent haplotype served as the reference haplotype. Subjects were categorized as "never," "former," or "current" smokers. We fit a logistic regression model (2.1) assuming an additive effect for each haplotype other than the reference one; see (2.2). The haplotype–environment interaction terms include only those for the haplotype "101010" with "Smk1" and "Smk2," 2 dummy variables for former and never smokers, because they are the only promising interactions according to preliminary analysis. The disease-risk model was further adjusted for "age," recorded in years, and "gender." A polytomous logistic regression (2.3) is specified for the conditional distribution of diplotypes given the environmental covariates Smk1 and Smk2 with the marginal diplotype distribution being specified by the HWE constraints. The main parameters of interest include the disease-haplotype odds ratio parameters ß1, the haplotype–environment odds ratio parameters {gamma}1, and the marginal haplotype frequencies in the whole population. The marginal distribution for the environmental covariates is left unspecified. For estimation of regression parameters ß and {gamma}1, we grouped haplotypes with frequency less than 2% into the reference haplotype. The rare-disease approximation was made in deriving the estimating equation, and the EM algorithm proposed in Section 3 is utilized to accommodate the unphased genotype data.

Results from this application are displayed in Table 4. It is clear that current smokers can have significantly elevated risk for colorectal adenoma relative to nonsmokers, adjusting for gender and age. Relative to the reference haplotype "001100," all the other haplotypes are associated with reduced risk for colorectal adenoma, but the statistical evidence is not significant. However, the significance of the interaction 101010xSmk2 suggests that smoking-related risk of adenoma was much reduced for carriers of the haplotype 101010 than non-carriers. The finding is consistent with previous laboratory and epidemiologic studies that have identified the haplotype 101010, known as "NAT2*4," as a rapid metabolizer for smoking-related carcinogens. The estimates for the parameter {gamma}1 for the conditional diplotype distribution reveal that the susceptibility to smoking seems not to be influenced by any haplotypes we considered. Finally, the estimates for the marginal haplotype frequencies derived from the estimates of {theta} are quite close to those obtained by the EM algorithm of Li and others (2003)Go applied to the genotype data of the controls.


View this table:
[in this window]
[in a new window]

 
Table 4 Results of the colorectal adenoma study. Haplotypes 001010 and 001110 are grouped into the reference haplotype in the disease-risk and conditional diplotype distribution models

 
To check if the analysis is sensitive to model specification for the conditional distribution of diplotypes given the environmental covariates, we further fit the model (2.3) with various choices of the environmental covariates. The results (not shown) for the association parameters ß and the marginal haplotype frequencies are fairly consistent across the analyses.


    6. CONCLUDING REMARKS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. NOTATIONS AND PROPOSED...
 3. SEMIPARAMETRIC ESTIMATING...
 4. SIMULATIONS
 5. CASE-CONTROL STUDY OF...
 6. CONCLUDING REMARKS
 APPENDIX A: BASIC LEMMAS
 APPENDIX B: PROOF OF...
 APPENDIX C: EXPRESSIONS FOR...
 REFERENCES
 
The model we have proposed for gene–environment association is suitable when the underlying haplotypes of a genomic region may causally influence the environmental exposure(s) under study. The model, however, requires special treatment for environmental factors, such as ethnicity or geographic region(s), which may be associated with the genomic region under study, not because of any causal relationship but merely due to population stratification. Suppose, in addition to the main environmental exposure X, there is a set of environmental factors S which could be used to divide the underlying population into K strata that are likely to be genetically heterogenous. In such a situation, a natural model for describing the association between diplotypes Hdi and environmental factors W = (X,S) is given by

Formula (6.1)

where the stratum-specific intercept parameters {gamma}0j1j2(S) should be specified in such a way that the diplotype frequencies, after marginalized over X, follow population genetics constraints, such as HWE, within each stratum defined by S. The disease-risk model could be also extended to include S as a risk factor. The proposed estimating equation methodology can be easily modified to estimate the gene–environment interaction and association parameters of interest under these extended models.


    APPENDIX A: BASIC LEMMAS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. NOTATIONS AND PROPOSED...
 3. SEMIPARAMETRIC ESTIMATING...
 4. SIMULATIONS
 5. CASE-CONTROL STUDY OF...
 6. CONCLUDING REMARKS
 APPENDIX A: BASIC LEMMAS
 APPENDIX B: PROOF OF...
 APPENDIX C: EXPRESSIONS FOR...
 REFERENCES
 
The following lemmas are required to derive the asymptotic distribution of the proposed estimator. Lemma 1.1 below is in fact Lemma 3 of Chatterjee and Carroll (2005)Go.

LEMMA A.1 Under the case–control sampling design where the total sample size n = n1 + n0 tends to infinity but the sampling proportions for the cases and controls, that is, n1/n and n0/n, remain fixed at µ1 and µ0, we have for any measurable function M(D,Hdi,X) of data (D,Hdi,X),

Formula

where E*(·|X) denotes the expectation with respect to the joint distribution of (D,Hdi) given X defined by

Formula (A.1)

and {lambda}(x) = {sum}d{sum}hS(d,h,x,B,{gamma}1,{gamma}0).

Lemma A.2 below provides an explicit expression for the estimating function Formula.

LEMMA A.2 Write

Formula

Then

Formula (A.2)

Proof: By definition,

Formula

and direct calculation yields

Formula

which proves the result.

Lemma A.3 provides explicit forms for the information matrices.

LEMMA A.3 Let I{Omega}{Omega} = Ecc{ – L{Omega}{Omega}(D,Hdi,X,{Omega})}, where L{Omega}{Omega} is the second derivative of L(·) with respective to {Omega} and Formula. Then

Formula

Proof: The first identity has been given in Lemma 4 of Chatterjee and Carroll (2005)Go. To show the second identity, applying the chain rule we have

Formula (A.3)

The first term of (A.3) equals

Formula

By the definition of w(h,{Omega}) given in (3.8), it easy to see that

Formula

where the joint density p*(D,Hdi|X,{Omega}) is defined in (A.1). Recalling further that L{Omega}(D,h,X,{Omega}) = {partial}logp*(D,Hdi = h|X,{Omega})/{partial}{Omega} and Formula, we have the identity

Formula

Hence, the second term of (A.3) leads to

Formula

The desired result thus follows.


    APPENDIX B: PROOF OF THEOREM
 TOP
 SUMMARY
 1. INTRODUCTION
 2. NOTATIONS AND PROPOSED...
 3. SEMIPARAMETRIC ESTIMATING...
 4. SIMULATIONS
 5. CASE-CONTROL STUDY OF...
 6. CONCLUDING REMARKS
 APPENDIX A: BASIC LEMMAS
 APPENDIX B: PROOF OF...
 APPENDIX C: EXPRESSIONS FOR...
 REFERENCES
 
We will first obtain the asymptotic expansion of the proposed estimating equation (3.6), by which the large sample distribution theory for Formula can be derived immediately from the central limit theorem.

For our estimator Formula, a standard Taylor series expansion of (3.6) yields

Formula (B.1)

In view of (2.5) and (3.5), let {gamma}0 = G({theta},{gamma}1,F) and Formula. Recall that Formula is solved from (3.5):

Formula

Making a further Taylor series expansion, we have

Formula

Note that

Formula

Hence,

Formula

An explicit expression for Q is given in Appendix C. Consequently, the middle 2 terms in the expansion (B.1) reduce to

Formula

Therefore, we have

Formula

Note that in the last equality above, we have used the fact that

Formula

and {sum}FormulaE{K(X,Di)||Di} = 0, which follow directly from Lemmas 1.1 and 1.2. This completes the proof.


    APPENDIX C: EXPRESSIONS FOR THE DERIVATIVES OF G({theta},{gamma}1,F) AND Formula
 TOP
 SUMMARY
 1. INTRODUCTION
 2. NOTATIONS AND PROPOSED...
 3. SEMIPARAMETRIC ESTIMATING...
 4. SIMULATIONS
 5. CASE-CONTROL STUDY OF...
 6. CONCLUDING REMARKS
 APPENDIX A: BASIC LEMMAS
 APPENDIX B: PROOF OF...
 APPENDIX C: EXPRESSIONS FOR...
 REFERENCES
 
Recall that G({theta},{gamma}1,F) is the solution of {gamma}0 to (2.5). Here, we define {gamma}1 to be the vectorized form for the diplotype-effect parameters subject to certain mode of effects (e.g. additive effect) of haplotypes, and define {eta} to be the vectorized form for the full set of diplotype effects {{gamma}1j1j2}. Differentiating both sides of (2.5) with respect to {gamma}1, we have

Formula

Let

Formula

and Q = {int}Q(x)dF(x). Then

Formula

Similarly, letting

Formula

we have

Formula

The derivatives of Formula can be obtained by replacing dF in the above quantities with Formula.


    ACKNOWLEDGMENTS
 
Chen's research was supported by the National Science Council of the People's Republic of China (NSC 95-2118-M-001-022-MY3). Chatterjee's research was supported by the Intramural Research Program of the National Cancer Institute. Carroll's research was supported by a grant from the National Cancer Institute (CA57030) and by the Texas A&M Center for Environmental and Rural Health via a grant from the National Institute of Environmental Health Sciences (P30-ES09106). A SAS macro is available from the Web site http://www.stat.sinica.edu.tw/yhchen/download.html. Conflict of Interest: None declared.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. NOTATIONS AND PROPOSED...
 3. SEMIPARAMETRIC ESTIMATING...
 4. SIMULATIONS
 5. CASE-CONTROL STUDY OF...
 6. CONCLUDING REMARKS
 APPENDIX A: BASIC LEMMAS
 APPENDIX B: PROOF OF...
 APPENDIX C: EXPRESSIONS FOR...
 REFERENCES
 

    Andersen JB. Asymptotic properties of conditional maximum likelihood estimators. Journal of the Royal Statistical Society, Series B (1970) 32:283–301.

    Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation in case-control studies of gene-environmental interactions. Biometrika (2005) 92:399–418.[Abstract/Free Full Text]

    Chatterjee N, Chen J, Spinka C, Carroll RJ. Comment on the paper likelihood based inference on haplotype effects in genetic association studies by D.J. Lin and D. Zeng. Journal of the American Statistical Association (2006) 101:108–110.[CrossRef][Web of Science]

    Epstein M, Satten G. Inference of haplotype effects in case-control studies using unphased genotype data. American Journal of Human Genetics (2003) 73:1316–1329.[CrossRef][Web of Science][Medline]

    Gohagan JK, Prorok PC, Hayes RB, Kramer BS. The Prostate, LungColorectal and Ovarian (PLCO) Cancer Screening Trial of the National Cancer Institute: History, organization, and status. Controlled Clinical Trials (2000) 21:251S–272S.[CrossRef][Web of Science][Medline]

    Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ. Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Human Heredity (2003) 55:56–65.[CrossRef][Web of Science][Medline]

    Li SS, Khalid N, Carlson C, Zhao LP. Estimating haplotype frequencies and standard errors for multiple single nucleotide polymorphisms. Biostatistics (2003) 4:513–522.[Abstract]

    Lin DY, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies (with discussion). Journal of the American Statistical Association (2006) 101:89–118.[CrossRef][Web of Science]

    Moslehi R, Chatterjee N, Church TR, Chen J, Yeager M, Weissfield J, Hein DW, Hayes RB. Cigarette smoking, n-acetyltransferase genes and the risk of advanced colorectal adenoma. Pharmacogenomics (2006) 7:819–829.[CrossRef][Web of Science][Medline]

    Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika (1979) 66:403–411.[Abstract/Free Full Text]

    Roeder K, Carroll RJ, Lindsay BG. A nonparametric mixture approach to case-control studies with errors in covariables. Journal of the American Statistical Association (1996) 91:722–732.[CrossRef][Web of Science]

    Satten GA, Epstein MP. Comparison of prospective and retrospective methods for haplotype inference in case-control studies. Genetic Epidemiology (2004) 27:192–201.[CrossRef][Web of Science][Medline]

    Schaid DJ. Evaluating associations of haplotypes with traits. Genetic Epidemiology (2004) 27:348–364.[CrossRef][Web of Science][Medline]

    Spinka C, Carroll RJ, Chatterjee N. Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. Genetic Epidemiology (2005) 29:108–127.[CrossRef][Web of Science][Medline]

    Wallenstein S, Hodge SE, Weston A. Logistic regression model for analyzing extended haplotype data. Genetic Epidemiology (1998) 15:173–181.[CrossRef][Web of Science][Medline]

    Zhao LP, Li SS, Khalid NA. Method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies. American Journal of Human Genetics (2003) 72:1231–1250.[CrossRef][Web of Science][Medline]

    Received October 13, 2006; revised March 2, 2007; accepted for publication March 14, 2007.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    9/1/81    most recent
    kxm011v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Chen, Y.-H.
    Right arrow Articles by Carroll, R. J.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Chen, Y.-H.
    Right arrow Articles by Carroll, R. J.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?