Biostatistics Advance Access originally published online on February 24, 2006
Biostatistics 2006 7(3):486-502; doi:10.1093/biostatistics/kxj021
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Efficient semiparametric estimation of haplotype-disease associations in casecohort and nested casecontrol studies
Department of Biostatistics, CB# 7420, University of North Carolina, Chapel Hill, NC 27599-7420, USA lin{at}bios.unc.edu
Department of Epidemiology, CB# 7435, University of North Carolina, Chapel Hill, NC 27599-7420, USA
Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
Estimating the effects of haplotypes on the age of onset of a disease is an important step toward the discovery of genes that influence complex human diseases. A haplotype is a specific sequence of nucleotides on the same chromosome of an individual and can only be measured indirectly through the genotype. We consider cohort studies which collect genotype data on a subset of cohort members through casecohort or nested casecontrol sampling. We formulate the effects of haplotypes and possibly time-varying environmental variables on the age of onset through a broad class of semiparametric regression models. We construct appropriate nonparametric likelihoods, which involve both finite- and infinite-dimensional parameters. The corresponding nonparametric maximum likelihood estimators are shown to be consistent, asymptotically normal, and asymptotically efficient. Consistent variancecovariance estimators are provided, and efficient and reliable numerical algorithms are developed. Simulation studies demonstrate that the asymptotic approximations are accurate in practical settings and that casecohort and nested casecontrol designs are highly cost-effective. An application to a major cardiovascular study is provided.
Keywords: Age of onset; Association studies; Censoring; Haplotype effects; Nonparametric likelihood; Proportional hazards; Semiparametric efficiency; Single nucleotide polymorphisms; Survival data
| 1. INTRODUCTION |
|---|
|
|
|---|
Complex human diseases, such as cancer, diabetes, schizophrenia, and coronary heart disease (CHD), are affected by multiple genetic and environmental factors. Recent sequencing of the human genome and advances in genotyping technologies have spurred an enormous interest in genetic association studies which explore the relationships between complex diseases and single nucleotide polymorphisms (SNPs). SNPs are single-base variations in the genetic code that occur about every 1000 bases along the 3 billion bases of the human genome. A specific combination of nucleotides at a series of nearby SNPs on the same chromosome of an individual is called a haplotype. The use of haplotypes can yield more powerful tests of genetic associations than the use of single SNPs, especially when the disease-predisposing SNPs are not directly measured or when there are strong interactions of multiple SNPs on the same chromosome (Akey et al., 2001
Current genotyping technologies cannot separate the two homologous chromosomes of an individual. Consequently, only the unphased genotype, i.e. the combination of the two homologous haplotypes, is directly observable. Several methods have been proposed for inferring individual haplotypes and for estimating haplotype-specific relative risks based on unphased genotype data from casecontrol studies (see Schaid, 2004
, for a recent review).
Cohort studies offer several advantages over casecontrol studies (Breslow and Day, 1987
, pp. 1120). First, the age of onset carries more information about the etiology of a complex disease than the disease status. Second, selection and information biases inherent in casecontrol studies can usually be eliminated in cohort studies. Third, the cohort design enables one to investigate a full range of diseases and related traits in a single study.
Cohort studies are major undertakings, involving long-term follow-up of many individuals. Fortunately, there are a number of cohort studies that have already been assembled for other purposes and have repositories of stored specimens that would allow the individuals to be genotyped for candidate genes of interest. Examples include the Cardiovascular Health Study (Fried et al., 1991
), the Women's Health Initiative (Johnson et al., 1999
), and the Atherosclerosis Risk in Communities (ARIC) Study (The ARIC Investigators, 1989
).
Lin (2004)
showed how to perform the Cox regression analysis of haplotype-disease associations with genotype data in cohort studies. The genotype data are required to be available on all cohort members. Despite the continuing improvement in genotyping efficiency, it is still prohibitively expensive to genotype a large cohort. An efficient compromise is to employ the casecohort or nested casecontrol design (Kalbfleisch and Prentice, 2002
, Section 11.4), so that only a subset of the cohort members need to be genotyped. In fact, the casecohort design was recently employed in the ARIC study, which is an epidemiologic cohort study of 15 792 individuals aged 4564 years to investigate the etiology of atherosclerosis and other diseases. There is a large body of literature on the Cox regression for casecohort and nested casecontrol designs (see Kulich and Lin, 2004
; Nan, 2004
; Scheike and Juul, 2004
; Scheike and Martinussen, 2004
; and the references therein). None of the existing work, however, deals with the additional complexity due to haplotype uncertainty.
In the present paper, we study semiparametric estimation of haplotype-disease associations in casecohort and nested casecontrol studies. The fact that the genotype data are available only on a biased subset of the cohort members poses considerable challenges in making inference about haplotype-disease associations. We propose a broad class of semiparametric regression models to formulate the effects of haplotype configurations and possibly time-dependent environment factors on the age of onset of disease. We derive appropriate likelihoods for these models and establish the asymptotic properties of the resultant maximum likelihood estimators. We develop efficient and stable numerical algorithms to implement the corresponding inference procedures. We apply the proposed methods to the aforementioned ARIC study, which motivated this work.
| 2. INFERENCE PROCEDURES |
|---|
|
|
|---|
Let T be the time to disease occurrence, H the pair of homologous haplotypes, and G the corresponding genotype. If we denote the two possible alleles of each SNP by the values 0 versus 1, then H is a pair of ordered sequences of zeros and ones and G, which is the sum of the two sequences in H, is an ordered sequence of zeros, ones, and twos. Although we are interested in the association between H and T, we only observe G directly.
Under casecohort and nested casecontrol designs, a subset of individuals is selected for genotyping. We allow the possibility that some other expensive discrete time-independent covariates, denoted by W, are also measured in this subset only. Additional covariates of interest X, possibly time dependent, are measured on all cohort members.
The time to disease occurrence will be censored if the individual has not developed the disease of interest by the end of the study or is withdrawn from the study prematurely. Let C denote the potential censoring time. We assume coarsening at random. That is, the event C = t is independent of (T, H, W) conditional on {X(s): s
t} and T
t.
Suppose that we have a cohort of n individuals. We collect the data
(i = 1, ..., n), where Yi = min(Ti, Ci),
i = I(Ti
Ci),
and I(·) is the indicator function. We also measure G and W for a subset of the cohort, which is selected by the casecohort or nested casecontrol sampling.
Under the casecohort sampling, we randomly select a subcohort from the full cohort. The selection probabilities depend on the observed event histories and possibly on covariates that are always measured. Let Ri indicate by the values 1 versus 0 whether the ith individual is selected. We assume missing at random in that
The observed data can be represented as
(i = 1, ..., n).
Let S(G) denote the set of all haplotype pairs that are compatible with genotype G. Then the observed-data likelihood function can be written as
![]() |
where
T and
C pertain to the conditional hazard functions of T and C, respectively, and fX(t) pertains to the conditional density of X(t). Thus, the observed-data likelihood function concerning the distribution of T given (H, X, W) is proportional to
![]() | (2.1) |
where f is the conditional density of
Under the nested casecontrol sampling, a small number of the individuals who are at risk at the time of disease occurrence of a case are selected for genotyping. The probability of selection at time t for an individual may depend on the observed past history
The observed data can be represented as
(i = 1, ..., n), where
and Si(t) indicates whether the ith individual is selected for genotyping at time t.
To motivate the likelihood construction, we pretend that all the random variables are discrete. Then the observed-data likelihood is
![]() | (2.2) |
If the ith individual is never selected for genotyping, i.e. Si(t) = 0 for all t
Yi, then
i = 0 and
only contains the information of Ti
t and
so the likelihood contribution from this individual is the same as the likelihood of
If the ith individual is selected, i.e. Si(t) = 1 for some t0
Yi, then
contains the information of Ti
t and
for t < t0 and becomes the information of Ti
t, Gi, Wi, and
for t
t0, so the contribution from this individual to (2.2) is the same as the likelihood of
Thus, the likelihood function concerning the distribution of T given (H, W, X) is exactly the same as (2.1), in which Ri
max {Si(t): t
Yi} indicates whether the ith individual is ever selected for genotyping.
REMARK 2.1 In the above derivation, the sampling is assumed to be independent among individuals. We may relax this assumption by allowing the sampling at time t to depend on the observed history at t of all individuals so that sampling without replacement can be accommodated. The likelihood function remains the same.
The conditional hazard function
represents the effects of the haplotype pair and environmental factors on the risk of disease, which can be formulated by a variety of parametric and semiparametric models. We propose the following class of semiparametric transformation models in terms of the cumulative hazard function:
![]() | (2.3) |
where
(t) is an unknown increasing function with
(0) = 0,
is a specified function of H, X(t), and W, and Q is a three-time differentiable transformation with Q(0) = 0 and Q'(x) > 0. Here and in the sequel, g'(x) = dg(x)/dx and g''(x) = d2g(x)/dx2. We may use the class of BoxCox transformations Q(x) = {(1 + x)r 1}/r (r > 0) or the class of logarithmic transformations Q(x) = r1 log( 1 + r2x) (r1 > 0, r2 > 0). The choices of Q(x) = x and log( 1 + x) yield the proportional hazards and proportional odds models, respectively.
Nonidentifiability arises if the joint distribution of the haplotype pair is totally unrestricted. Lin (2004)
assumed HardyWeinberg equilibrium such that P(H = (hk, hl)) =
k
l (k, l = 1, ..., K), where
k is the marginal probability that the haplotype is hk and K is the number of possible haplotypes. We consider the following one-parameter extension:
![]() | (2.4) |
where
kl = 1 if k = l and 0 otherwise, and
is the inbreeding coefficient. Although the actual disequilibrium may not conform exactly to (2.4), this extension allows more robust inference than the standard HardyWeinberg equilibrium assumption.
Under (2.3) and (2.4), the observed-data likelihood function concerning the parameters of interest
(ß,
,
1, ...,
K) and
takes the form
![]() | (2.5) |
Simplifications arise under certain conditions. If there is no W, then (2.5) will not contain any term involving W. If
is independent of (W, H), then the conditional density of
can be dropped out of (2.5) due to factorization. In the sequel, we focus on the most common situation in which W does not exist and X is independent of H.
We propose to estimate
and
by the nonparametric maximum likelihood method. The maximum of (2.5) does not exist if
is restricted to be absolutely continuous. Thus, we allow
to be right continuous and maximize the following function:
![]() | (2.6) |
where
{Yi} denotes the jump size of
at Yi. The maximization is tantamount to maximizing (2.6) over
and the
{Yi} associated with
i = 1 and can be carried out through the EM algorithm described in Appendix A.
Let
0 and
0 denote the true values of
and
, and
and
the maximum likelihood estimators. We show in Appendix B that
weakly converges to a zero-mean Gaussian process and that the limiting covariance matrix of
achieves the semiparametric efficiency bound (Bickel et al., 1993
, Chapter 3). We can estimate the limiting covariance function of
by regarding (2.6) as a parametric likelihood with
and the
{Yi} associated with
i = 1 as the parameters and inverting the observed information matrix for those parameters. We can also estimate the covariance matrix of
by the profile likelihood method (Murphy and van der Vaart, 2000
). The profile log-likelihood function can be calculated via the EM algorithm, in which
is held fixed.
| 3. NUMERICAL RESULTS |
|---|
|
|
|---|
We are currently evaluating common genetic polymorphisms which, in combination with exposure to tobacco smoking, may affect the risk of atherosclerosis and its clinical sequelae. An average of six polymorphisms, selected on the basis of their prevalence and functional significance, expression in relevant tissues, evaluation in previous studies, and biological plausibility within 19 genes involved in activation, detoxification, oxidative stress, and DNA repair pathways, are being evaluated in a well-characterized, bi-ethnic cohort of 15 792 men and women under active follow-up since 19871989 as part of the ARIC study. Four endpoints quantifying subclinical atherosclerosis and validated clinical atherosclerotic events are being studied under the casecohort design.
So far, we have genotyped five SNPs in XRCC1, a major base excision repair gene. We considered all incident CHD cases occurring between 1987 and 2001. A subcohort was selected by stratified random sampling with different proportions of participants drawn from eight agesexrace strata. Genotyping was conducted using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. Cigarette smoking history was obtained through an interviewer-administered questionnaire.
We focus on the Caucasian sample, which consists of 11 526 individuals, 774 cases, and a subcohort of 698 controls. Cigarette smoking status is known for 11 519 participants. The five SNPs are missing in 12%, 6%, 10%, 12%, and 6% of the casecohort sample. The minor allele frequencies are 0.34, 0.40, 0.37, 0.41, and 0.36. There are nine haplotypes with estimated frequencies of higher than 0.5% in the sample. The frequencies for haplotypes (00100, 00110, 01001, 01100, 01110, 10110, 11001, 11100, 11110) are estimated at 0.012, 0.158, 0.096, 0.063, 0.012, 0.227, 0.276, 0.148, and 0.008, and the inbreeding coefficient is estimated at 0.025.
We fit separate models comparing each haplotype in turn with all others. Each model includes haplotype, smoking status (ever smoke = 1, never smoke = 0), two dummy variables contrasting Minnesota and Washington to North Carolina, gender and age at the baseline, as well as the interaction between smoking and haplotype. The effects of the haplotype pair are assumed to be additive (Lin, 2004
). The results for the estimation of the haplotype effects and haplotypesmoking interactions under these models are summarized in Table 1. The individuals with haplotype 00110 appear to have a significantly higher risk of CHD as compared to the individuals without this haplotype. No estimate was obtained for haplotype 00100 due to numerical instability. There is no convincing evidence for interactions.
|
We also compare haplotype 00110 with the other five common haplotypes in a single model, and the estimation results are shown in Table 2. There is some evidence that haplotype 00110 is associated with higher risk of CHD than all other common haplotypes, especially haplotypes 01001, 10110, and 11001. The likelihood ratio statistic for testing the global null hypothesis of no haplotype effects and no haplotypesmoking interactions has an observed
2 value of 15.45 with 10 degrees of freedom, yielding a p-value of 0.116.
|
We conducted extensive simulation studies to examine the finite-sample properties of the proposed methods. We considered five SNPs and generated genotypes according to the observed haplotype distribution of the ARIC data. We focused on the effect of haplotype 01100 and its interaction with a Bernoulli environmental variable with 0.6 success probability, mimicking cigarette smoking in the ARIC data. We generated time to disease occurrence from either the proportional hazards model or the proportional odds model with baseline hazard function of 0.14t and with additive haplotype effects. The individuals were selected for genotyping by casecohort or nested casecontrol sampling with two controls per case. The proportions of missingness for the five SNPs among those selected for genotyping were the same as in the ARIC study. We generated censoring times from the uniform [0, 5] distribution truncated at 1. Approximately 90% of the observations were censored.
Table 3 summarizes the results of the simulation studies for n = 2000 and with various combinations of parameter values. The parameter estimators seem to have little bias. The profile likelihood method provides accurate estimators of the variances. The Wald tests have proper type I error rates, and the confidence intervals have reasonable coverage probabilities. The relative efficiencies of the casecohort and nested casecontrol designs are generally between 80% and 90% for estimating haplotype effects and haplotypeenvironment interactions and over 95% for estimating environmental effects. Thus, these designs are highly cost-effective since only 30% of the entire cohort is genotyped. The casecohort design appears to be slightly more efficient than the nested casecontrol design; however, 24% of selected controls became cases later on, so the total number of individuals genotyped is slightly smaller under the nested casecontrol design than under the casecohort design.
|
| 4. REMARKS |
|---|
|
|
|---|
The results presented in Section 3.1 represent some preliminary findings from a major ongoing investigation. We are currently genotyping additional SNPs in the XRCC1 gene and examining 18 other genes using the methods proposed here. The full results will be reported elsewhere.
In practice, the true model is unknown. Thus, one will need to explore several possible models. Since the proposed methods are likelihood based, we can apply model selection criteria such as the Akaike information criterion (AIC) (Akaike, 1985
) to determine the best model. Our experience shows that AIC performs well in this kind of setting (see Lin, 2004
).
It is assumed in (2.6) that there is no W and X is independent of H. This assumption is reasonable in most genetic studies. It is easy to remove this assumption if X is time independent and discrete because then the general likelihood function given in (2.5) just involves some discrete probability functions. If X contains one or two time-independent continuous components, we can still estimate the conditional density function of X nonparametrically.
We have assumed that W is discrete and time independent. If W is continuous and possibly time dependent but X is discrete and time independent, we will replace
in (2.5) with
However, if both X and W are continuous, it is necessary to parameterize the distribution; nevertheless, nonparametric estimation is possible if X and W have one continuous component each.
If one is not interested in haplotypes, the likelihood given in (2.5) simplifies greatly. There will be no summation over H, and H will disappear from all expressions. The theoretical results will continue to hold, and the EM algorithm will still apply, although the parameters will not include
and
k. Scheike and Juul (2004)
and Scheike and Martinussen (2004)
studied maximum likelihood estimation in the proportional hazards model under casecohort and nested casecontrol designs (without the additional complexities due to haplotype uncertainty and missing genotype data) but did not provide theoretical justifications for the asymptotic results. The asymptotic theory derived in the present paper covers those situations. Note that the aforementioned challenge in dealing with continuous covariates still exists even when one is not interested in haplotypes. In fact, this challenge tends to be less severe in genetic studies because genes are discrete and are usually independent of other covariates.
This paper is focused on casecohort and nested casecontrol designs, while the recent paper of Lin and Zeng (2006)
is concerned with other commonly used study designs. A nontechnical description of the methods developed in the two papers was provided by Lin et al. (2005)
. The software is available at http://www.bios.unc.edu/
lin.
| APPENDIX A |
|---|
|
|
|---|
Write H = BQ1 + (1 B)Q2, where B is a Bernoulli variable with success probability
and Q1 and Q2 are discrete variables with P(Q1 = (hk, hk)) =
k and P(Q2 = (hk, hl)) =
k
l (k, l = 1, ..., K). We introduce a subject-specific frailty
with density
(
) such that
![]() | (A.1) |
Then the observed-data likelihood function under the transformation model is equivalent to the likelihood function under the proportional hazards frailty model: the conditional hazard function of T given (B, Q1, Q2,
) is
By treating (B, Q1, Q2,
) as missing data, we obtain the following complete-data likelihood function:
![]() | (A.2) |
In the M-step of the EM algorithm, we maximize the conditional expectation of the logarithm of (A.2) given the observed data. Let
denote the conditional expectation given the ith observation (Yi, Xi,
i, Ri, RiGi). Then
and
k are updated by the following formulas:
![]() |
In addition, we update ß by solving the following equation:
![]() | (A.3) |
and update
by the step function with jump sizes
![]() | (A.4) |
Note that (A.3) and (A.4) are reminiscent of the partial likelihood (Cox, 1972
) score equation and the Breslow (1972)
estimator.
In light of (A.3) and (A.4), we calculate the conditional expectations in the form of E[
i
(Bi, Q1i, Q2i)|Yi, Xi,
i, Ri, RiGi] in the E-step. We can avoid numerical integration over
i in these calculations. Define
In view of (A.2), the conditional density of
given (Bi, Q1i, Q2i) and (Yi, Xi,
i, Ri, RiGi) is proportional to
so that
![]() |
By differentiating (A.1) with respect to x, we obtain
![]() |
It follows that
![]() |
Consequently,
![]() |
According to (A.2), the conditional density of (Bi, Q1i, Q2i) given the observed data is proportional to gi(Bi, Q1i, Q2i), where
![]() |
Thus, E[
i
(Bi, Q1i, Q2i)|Yi, Xi,
i, Ri, RiGi] is equal to
![]() |
for individuals with Ri = 1 and is equal to
![]() |
for individuals with Ri = 0.
| APPENDIX B |
|---|
|
|
|---|
We impose the following conditions:
- (C.1) Both X(t) and
(H,X(t)) have bounded total variations in [0,
] with probability one, where
corresponds to the end of the study.
- (C.2) There exists a positive constant a such that with probability one,
and
- (C.3) If
for t
[0,
] and k = 1, ..., K with probability one, then ß1 = ß2 and µ1(t) = µ2(t).
- (C.4) |ß0|
c0 for some known constant c0, and
0(t) is continuous and positive for t
[0,
].
- (C.5) Q(x) satisfies one of the two conditions:
- (C.5.1) for any positive constant c0,
- (C.5.2) there exist some constants r1, r2 > 0 such that Q(x) = r1 log( 1 + r2x).
- (C.5.2) there exist some constants r1, r2 > 0 such that Q(x) = r1 log( 1 + r2x).
- (C.2) There exists a positive constant a such that with probability one,
We state the asymptotic results in three theorems. The above conditions are assumed to hold in the theorems. The first theorem states the consistency, weak convergence, and asymptotic efficiency.
THEOREM B.1 With probability one,In addition,
weakly converges to a zero-mean Gaussian process in Rd x l
[0,
], where d is the dimension of
and l
[0,
] is a normed space consisting of all the bounded functions and the norm is defined as the supremum norm on [0,
]. Furthermore, the limiting covariance matrix of
achieves the semiparametric efficiency bound.
The second theorem justifies the estimation of the limiting covariance function of
by the inverse information matrix.
THEOREM B.2 Let V(h1, h2) be the limiting variance of the random variablewhere h1 is a d-vector and h2 is a bounded function. The estimator
uniformly in (h1, h2) in probability, where hn consists of h1 and the values of h2(Yi) associated with
i = 1, and
is the negative Hessian matrix of
with respect to
and the
{Yi} associated with
i = 1.
The last theorem justifies the use of the profile log-likelihood pln(
)
max
log Ln(
,
) in estimating the limiting covariance matrix of 
THEOREM B.3 For any d-vector h1 with norm one,
in probability, where
n = O(n1/2) and
is the limiting covariance matrix of
The proofs of these theorems involve advanced mathematical tools from empirical process theory (van der Vaart and Wellner, 1996
) and semiparametric efficiency theory (Bickel et al., 1993
). We outline here the main arguments. The detailed proofs are available from the authors.
Proof of Theorem B.1. We first prove the consistency under Condition (C.5.1). The proof consists of three steps.
Step 1: We show the existence of
or equivalently the finiteness of the jump sizes of
The logarithm of (2.6), denoted by ln(
,
), is bounded by
![]() | (B.1) |
where O(1) denotes some positive constant and M is a constant satisfying
![]() |
Such an M exists under Conditions (C.1) and (C.4). It then follows from Condition (C.5.1) that (B.1) will diverge if
{Yi} is infinite for some i.
Step 2: We show that with probability one,
is bounded for any n. Let
where
Clearly,
![]() | (B.2) |
Since P(
= 0, Y =
) > 0, (B.2) will be negative if
n diverges. Thus,
n is bounded, which implies that
is bounded.
Step 3: By Helly's selection theorem, we can choose a subsequence such that
and
with probability one. It remains to show that
* =
0 and
* =
0. Note that
![]() | (B.3) |
where
![]() |
![]() |
and
![]() |
In view of (B.3), we construct another step function
with
By the GlivenkoCantelli theorem,
uniformly converges to
0, and
is absolutely continuous with respect to
with the derivative converging uniformly to d
*(t)/d
0(t). Since
the KullbackLeibler information of (
*,
*) with respect to (
0,
0) is non-negative, so that (2.6) has the same value almost surely whether (
,
) = (
*,
*) or (
0,
0). Setting n = 1, Gi = 2hk, Ri = 1, and
i = 1 and integrating Yi from y to
, we obtain
![]() |
By comparing this equation with the one obtained from (2.6) with n = 1, Gi = 2hk, Ri = 1,
i = 0, and Yi =
, we have
![]() |
The choice of y = 0 yields that
*
k* + (1
*)
k*2 =
0
0k + (1
0)
0k2, which entails that
* =
0 and
k* =
0k. In addition,
implying that
![]() |
It then follows from Condition (C.6) that ß* = ß0 and
* =
0. Hence,
and
almost surely. Since
0 is continuous, the weak convergence of
can be strengthened to the convergence uniformly in [0,
].
If Q(·) satisfies Condition (C.5.2) instead of (C.5.1), we need to modify Step 2. It follows from (B.3) that
![]() |
![]() | (B.4) |
By partitioning [0,
] into a sequence of intervals as in Zeng et al. (2005)
and examining the two terms on the right-hand side of (B.4) when Yi lies in each partition, we can show that the right-hand side of (B.4) is negative if
diverges. Thus,
must be bounded.
To derive the asymptotic distribution of
we apply Theorem 3.3.1 of van der Vaart and Wellner (1996)
to the score operators for
and
Except for the invertibility of the derivative operator of the score operator, all the conditions in Theorem 3.3.1 can be verified via empirical process theory (see Zeng et al., 2005
). The derivative operator is invertible if the information operator is one-to-one. Thus, we wish to show that if a score function along the path
is zero, then h1 = 0 and h2 = 0. For Ri = 1 and Gi = 2hk, the score equation is
![]() | (B.5) |
where (h1ß, h1
, h1k) are the components of h1 associated with (ß0,
0,
0k) and
Setting Y = 0 yields that h1
= h1k = 0. This result implies that (B.5) is a homogeneous integral equation for
so that
Thus, h2 = 0 and h1ß = 0. It then follows from Theorem 3.3.1 of van der Vaart and Wellner (1996)
that
weakly converges to a zero-mean Gaussian process. Furthermore, we can use the arguments of Zeng et al. (2005)
to show that
is asymptotically efficient.
Proof of Theorem B.2. This proof follows from the arguments given in the proof of Theorem 3 of Zeng et al. (2005)
. The details are omitted.
Proof of Theorem B.3.We can verify the conditions in Theorem 1 of Murphy and van der Vaart (2000)
. In particular, we can construct the least favorable submodel by using the invertibility of the information operator shown in the proof of Theorem 1. The details are omitted.
| ACKNOWLEDGMENTS |
|---|
This research was supported by the National Institutes of Health. The authors thank two referees for their timely reviews and helpful comments. Conflict of Interest: None declared.
| REFERENCES |
|---|
|
|
|---|
-
AKAIKE, H. (1985). Prediction and entropy. In Atkinson, A. C. and Fienberg, S. E. (eds), A Celebration of Statistics. New York: Springer, pp. 124.
AKEY, J., JIN, L. AND XIONG, M. (2001). Haplotypes vs. single marker linkage disequilibrium tests: what do we gain? European Journal of Human Genetics 9, 291300.[CrossRef][Web of Science][Medline]
BICKEL, P. J., KLASSEN, C. A. J., RITOV, Y. AND WELLNER, J. A. (1993). Efficient and Adaptive Estimation in Semiparametric Models. Baltimore, MD: Johns Hopkins University Press.
BRESLOW, N. E. (1972). Discussion of the paper by D. R. Cox. Journal of the Royal Statistical Society, Series B 34, 216217.
BRESLOW, N. E. AND DAY, N. E. (1987). Statistical Methods in Cancer Research: The Design and Analysis of Cohort Studies. Lyon: International Agency for Research on Cancer.
COX, D. R. (1972). Regression models and life-tables (with discussion). Journal of the Royal Statistical Society, Series B 34, 187220.
FRIED, L. P., BORHANI, N. O., ENRIGHT, P., FURBERG, C. D., GARDIN, J. M., KRONMAL, R. A., KULLER, L. H., MANOLIO, T. A., MITTELMARK, M. B., NEWMAN, A. et al. (1991). The Cardiovascular Health Study: design and rationale. Annals of Epidemiology 1, 263276.[Medline]
JOHNSON, S. R., ANDERSON, G. L., BARAD, D. H. AND STEFANICK, M. L. (1999). The Women's Health Initiative: rationale, design, and progress report. Journal of the British Menopause Society 5, 155159.
KALBFLEISCH, J. D. AND PRENTICE, R. L. (2002). The Statistical Analysis of Failure Time Data, 2nd edition. Hoboken, NJ: Wiley.
KULICH, M. AND LIN, D. Y. (2004). Improving the efficiency of relative-risk estimation in case-cohort studies. Journal of the American Statistical Association 99, 832844.[CrossRef]
LIN, D. Y. (2004). Haplotype-based association analysis in cohort studies of unrelated individuals. Genetic Epidemiology 26, 255264.[CrossRef][Web of Science][Medline]
LIN, D. Y. AND ZENG, D. (2006). Likelihood-based inference on haplotype effects in genetic association studies (with discussion). Journal of the American Statistical Association 101, 89118.[CrossRef]
LIN, D. Y., ZENG, D. AND MILLIKAN, R. (2005). Maximum likelihood estimation of haplotype effects and haplotype-environment interactions in association studies. Genetic Epidemiology 29, 299312.[Medline]
MORRIS, R. W. AND KAPLAN, N. L. (2002). On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles. Genetic Epidemiology 23, 221233.[CrossRef][Web of Science][Medline]
MURPHY, S. A. AND VAN DER VAART, A. W. (2000). On profile likelihood. Journal of the American Statistical Association 95, 449465.[CrossRef]
NAN, B. (2004). Efficient estimation for case-cohort studies. The Canadian Journal of Statistics 32, 403419.
SCHAID, D. J. (2004). Evaluating associations of haplotypes with traits. Genetic Epidemiology 27, 348364.[CrossRef][Web of Science][Medline]
SCHAID, D. J., ROWLAND, C. M., TINES, D. E., JACOBSON, R. M. AND POLAND, G. A. (2002). Score tests for association between traits and haplotypes when linkage phase is ambiguous. American Journal of Human Genetics 70, 425434.[CrossRef][Web of Science][Medline]
SCHEIKE, T. H. AND JUUL, A. (2004). Maximum likelihood estimation for Cox's regression model under nested case-control sampling. Biostatistics 5, 193206.[Abstract]
SCHEIKE, T. H. AND MARTINUSSEN, T. (2004). Maximum likelihood estimation for Cox's regression model under case-cohort sampling. Scandinavian Journal of Statistics 31, 283293.[CrossRef]
THE ARIC INVESTIGATORS (1989). The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. American Journal of Epidemiology 129, 687702.
VAN DER VAART, A. W. AND WELLNER, J. A. (1996). Weak Convergence and Empirical Processes. New York: Springer.
ZAYKIN, D. V., WESTFALL, P. H., YOUNG, S. S., KARNOUB, M. A., WAGNER, M. J. AND EHM, M. G. (2002). Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Human Heredity 53, 7991.[Web of Science][Medline]
ZENG, D., LIN, D. Y. AND YIN, G. (2005). Maximum likelihood estimation for the proportional odds model with random effects. Journal of the American Statistical Association 100, 470483.[CrossRef][Medline]
Received December 5, 2005; accepted for publication February 7, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
A. DeMichele, R. Gray, M. Horn, J. Chen, R. Aplenc, W. P. Vaughan, and M. S. Tallman Host Genetic Variants in the Interleukin-6 Promoter Predict Poor Outcome in Patients with Estrogen Receptor-Positive, Node-Positive Breast Cancer Cancer Res., May 15, 2009; 69(10): 4184 - 4191. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Siedlinski, C. C. van Diemen, D. S. Postma, J. M. Vonk, and H. M. Boezen Superoxide dismutases, lung function and bronchial responsiveness in a general population Eur. Respir. J., May 1, 2009; 33(5): 986 - 992. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Galanter, S. Choudhry, C. Eng, S. Nazario, J. R. Rodriguez-Santana, J. Casal, A. Torres-Palacios, J. Salas, R. Chapela, H. G. Watson, et al. ORMDL3 Gene Is Associated with Asthma in Three Ethnically Diverse Populations Am. J. Respir. Crit. Care Med., June 1, 2008; 177(11): 1194 - 1200. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Cai, N. Kataoka, C. Li, W. Wen, J. R. Smith, Y.-T. Gao, X. O. Shu, and W. Zheng Haplotype Analyses of CYP19A1 Gene Variants and Breast Cancer Risk: Results from the Shanghai Breast Cancer Study Cancer Epidemiol. Biomarkers Prev., January 1, 2008; 17(1): 27 - 32. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Hughes, O. Agbaje, R. L. Bowen, D. L. Holliday, J. A. Shaw, S. Duffy, and J. L. Jones Matrix Metalloproteinase Single-Nucleotide Polymorphisms and Haplotypes Predict Breast Cancer Progression Clin. Cancer Res., November 15, 2007; 13(22): 6673 - 6680. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





















where h1 is a d-vector and h2 is a bounded function. The estimator
uniformly in (h1, h2) in probability, where hn consists of h1 and the values of h2(Yi) associated with
is the negative Hessian matrix of
with respect to 
n = O(n1/2) and
is the limiting covariance matrix of 
















