Skip Navigation


Biostatistics Advance Access originally published online on February 24, 2006
Biostatistics 2006 7(3):486-502; doi:10.1093/biostatistics/kxj021
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
7/3/486    most recent
kxj021v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Zeng, D.
Right arrow Articles by Bray, M. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zeng, D.
Right arrow Articles by Bray, M. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

Efficient semiparametric estimation of haplotype-disease associations in case–cohort and nested case–control studies

D. Zeng and D. Y. Lin*

Department of Biostatistics, CB# 7420, University of North Carolina, Chapel Hill, NC 27599-7420, USA lin{at}bios.unc.edu

C. L. Avery and K. E. North

Department of Epidemiology, CB# 7435, University of North Carolina, Chapel Hill, NC 27599-7420, USA

M. S. Bray

Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. INFERENCE PROCEDURES
 3. NUMERICAL RESULTS
 4. REMARKS
 APPENDIX A
 APPENDIX B
 REFERENCES
 
Estimating the effects of haplotypes on the age of onset of a disease is an important step toward the discovery of genes that influence complex human diseases. A haplotype is a specific sequence of nucleotides on the same chromosome of an individual and can only be measured indirectly through the genotype. We consider cohort studies which collect genotype data on a subset of cohort members through case–cohort or nested case–control sampling. We formulate the effects of haplotypes and possibly time-varying environmental variables on the age of onset through a broad class of semiparametric regression models. We construct appropriate nonparametric likelihoods, which involve both finite- and infinite-dimensional parameters. The corresponding nonparametric maximum likelihood estimators are shown to be consistent, asymptotically normal, and asymptotically efficient. Consistent variance–covariance estimators are provided, and efficient and reliable numerical algorithms are developed. Simulation studies demonstrate that the asymptotic approximations are accurate in practical settings and that case–cohort and nested case–control designs are highly cost-effective. An application to a major cardiovascular study is provided.

Keywords: Age of onset; Association studies; Censoring; Haplotype effects; Nonparametric likelihood; Proportional hazards; Semiparametric efficiency; Single nucleotide polymorphisms; Survival data


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. INFERENCE PROCEDURES
 3. NUMERICAL RESULTS
 4. REMARKS
 APPENDIX A
 APPENDIX B
 REFERENCES
 
Complex human diseases, such as cancer, diabetes, schizophrenia, and coronary heart disease (CHD), are affected by multiple genetic and environmental factors. Recent sequencing of the human genome and advances in genotyping technologies have spurred an enormous interest in genetic association studies which explore the relationships between complex diseases and single nucleotide polymorphisms (SNPs). SNPs are single-base variations in the genetic code that occur about every 1000 bases along the 3 billion bases of the human genome. A specific combination of nucleotides at a series of nearby SNPs on the same chromosome of an individual is called a haplotype. The use of haplotypes can yield more powerful tests of genetic associations than the use of single SNPs, especially when the disease-predisposing SNPs are not directly measured or when there are strong interactions of multiple SNPs on the same chromosome (Akey et al., 2001Go; Morris and Kaplan, 2002Go; Schaid et al., 2002Go; Zaykin et al., 2002Go; Schaid, 2004Go).

Current genotyping technologies cannot separate the two homologous chromosomes of an individual. Consequently, only the unphased genotype, i.e. the combination of the two homologous haplotypes, is directly observable. Several methods have been proposed for inferring individual haplotypes and for estimating haplotype-specific relative risks based on unphased genotype data from case–control studies (see Schaid, 2004Go, for a recent review).

Cohort studies offer several advantages over case–control studies (Breslow and Day, 1987Go, pp. 11–20). First, the age of onset carries more information about the etiology of a complex disease than the disease status. Second, selection and information biases inherent in case–control studies can usually be eliminated in cohort studies. Third, the cohort design enables one to investigate a full range of diseases and related traits in a single study.

Cohort studies are major undertakings, involving long-term follow-up of many individuals. Fortunately, there are a number of cohort studies that have already been assembled for other purposes and have repositories of stored specimens that would allow the individuals to be genotyped for candidate genes of interest. Examples include the Cardiovascular Health Study (Fried et al., 1991Go), the Women's Health Initiative (Johnson et al., 1999Go), and the Atherosclerosis Risk in Communities (ARIC) Study (The ARIC Investigators, 1989Go).

Lin (2004)Go showed how to perform the Cox regression analysis of haplotype-disease associations with genotype data in cohort studies. The genotype data are required to be available on all cohort members. Despite the continuing improvement in genotyping efficiency, it is still prohibitively expensive to genotype a large cohort. An efficient compromise is to employ the case–cohort or nested case–control design (Kalbfleisch and Prentice, 2002Go, Section 11.4), so that only a subset of the cohort members need to be genotyped. In fact, the case–cohort design was recently employed in the ARIC study, which is an epidemiologic cohort study of 15 792 individuals aged 45–64 years to investigate the etiology of atherosclerosis and other diseases. There is a large body of literature on the Cox regression for case–cohort and nested case–control designs (see Kulich and Lin, 2004Go; Nan, 2004Go; Scheike and Juul, 2004Go; Scheike and Martinussen, 2004Go; and the references therein). None of the existing work, however, deals with the additional complexity due to haplotype uncertainty.

In the present paper, we study semiparametric estimation of haplotype-disease associations in case–cohort and nested case–control studies. The fact that the genotype data are available only on a biased subset of the cohort members poses considerable challenges in making inference about haplotype-disease associations. We propose a broad class of semiparametric regression models to formulate the effects of haplotype configurations and possibly time-dependent environment factors on the age of onset of disease. We derive appropriate likelihoods for these models and establish the asymptotic properties of the resultant maximum likelihood estimators. We develop efficient and stable numerical algorithms to implement the corresponding inference procedures. We apply the proposed methods to the aforementioned ARIC study, which motivated this work.


    2. INFERENCE PROCEDURES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. INFERENCE PROCEDURES
 3. NUMERICAL RESULTS
 4. REMARKS
 APPENDIX A
 APPENDIX B
 REFERENCES
 
Let T be the time to disease occurrence, H the pair of homologous haplotypes, and G the corresponding genotype. If we denote the two possible alleles of each SNP by the values 0 versus 1, then H is a pair of ordered sequences of zeros and ones and G, which is the sum of the two sequences in H, is an ordered sequence of zeros, ones, and twos. Although we are interested in the association between H and T, we only observe G directly.

Under case–cohort and nested case–control designs, a subset of individuals is selected for genotyping. We allow the possibility that some other expensive discrete time-independent covariates, denoted by W, are also measured in this subset only. Additional covariates of interest X, possibly time dependent, are measured on all cohort members.

The time to disease occurrence will be censored if the individual has not developed the disease of interest by the end of the study or is withdrawn from the study prematurely. Let C denote the potential censoring time. We assume coarsening at random. That is, the event C = t is independent of (T, H, W) conditional on {X(s): s ≤ t} and T ≥ t.

Suppose that we have a cohort of n individuals. We collect the data Formula (i = 1, ..., n), where Yi = min(Ti, Ci), {Delta}i = I(Ti ≤ Ci), Formula and I(·) is the indicator function. We also measure G and W for a subset of the cohort, which is selected by the case–cohort or nested case–control sampling.

Under the case–cohort sampling, we randomly select a subcohort from the full cohort. The selection probabilities depend on the observed event histories and possibly on covariates that are always measured. Let Ri indicate by the values 1 versus 0 whether the ith individual is selected. We assume missing at random in that Formula The observed data can be represented as Formula (i = 1, ..., n).

Let S(G) denote the set of all haplotype pairs that are compatible with genotype G. Then the observed-data likelihood function can be written as

Formula

where {lambda}T and {lambda}C pertain to the conditional hazard functions of T and C, respectively, and fX(t) pertains to the conditional density of X(t). Thus, the observed-data likelihood function concerning the distribution of T given (H, X, W) is proportional to

Formula 2(2.1)

where f is the conditional density of Formula 2

Under the nested case–control sampling, a small number of the individuals who are at risk at the time of disease occurrence of a case are selected for genotyping. The probability of selection at time t for an individual may depend on the observed past history Formula 2 The observed data can be represented as Formula 2 (i = 1, ..., n), where Formula 2 and Si(t) indicates whether the ith individual is selected for genotyping at time t.

To motivate the likelihood construction, we pretend that all the random variables are discrete. Then the observed-data likelihood is

Formula 2(2.2)

If the ith individual is never selected for genotyping, i.e. Si(t) = 0 for all t ≤ Yi, then {Delta}i = 0 and Formula 2 only contains the information of Ti ≥ t and Formula 2 so the likelihood contribution from this individual is the same as the likelihood of Formula 2 If the ith individual is selected, i.e. Si(t) = 1 for some t0 ≤ Yi, then Formula 2 contains the information of Ti ≥ t and Formula 2 for t < t0 and becomes the information of Ti ≥ t, Gi, Wi, and Formula 2 for t ≥ t0, so the contribution from this individual to (2.2) is the same as the likelihood of Formula 2 Thus, the likelihood function concerning the distribution of T given (H, W, X) is exactly the same as (2.1), in which Ri {equiv} max {Si(t): t ≤ Yi} indicates whether the ith individual is ever selected for genotyping.

REMARK 2.1 In the above derivation, the sampling is assumed to be independent among individuals. We may relax this assumption by allowing the sampling at time t to depend on the observed history at t of all individuals so that sampling without replacement can be accommodated. The likelihood function remains the same.

The conditional hazard function Formula 2 represents the effects of the haplotype pair and environmental factors on the risk of disease, which can be formulated by a variety of parametric and semiparametric models. We propose the following class of semiparametric transformation models in terms of the cumulative hazard function:

Formula 2(2.3)

where {Lambda}(t) is an unknown increasing function with {Lambda}(0) = 0, Formula 2 is a specified function of H, X(t), and W, and Q is a three-time differentiable transformation with Q(0) = 0 and Q'(x) > 0. Here and in the sequel, g'(x) = dg(x)/dx and g''(x) = d2g(x)/dx2. We may use the class of Box–Cox transformations Q(x) = {(1 + x)r – 1}/r (r > 0) or the class of logarithmic transformations Q(x) = r1 log( 1 + r2x) (r1 > 0, r2 > 0). The choices of Q(x) = x and log( 1 + x) yield the proportional hazards and proportional odds models, respectively.

Nonidentifiability arises if the joint distribution of the haplotype pair is totally unrestricted. Lin (2004)Go assumed Hardy–Weinberg equilibrium such that P(H = (hk, hl)) = {pi}k{pi}l (k, l = 1, ..., K), where {pi}k is the marginal probability that the haplotype is hk and K is the number of possible haplotypes. We consider the following one-parameter extension:

Formula 2(2.4)

where {delta}kl = 1 if k = l and 0 otherwise, and {rho} is the inbreeding coefficient. Although the actual disequilibrium may not conform exactly to (2.4), this extension allows more robust inference than the standard Hardy–Weinberg equilibrium assumption.

Under (2.3) and (2.4), the observed-data likelihood function concerning the parameters of interest {theta} {equiv} (ß, {rho}, {pi}1, ..., {pi}K) and {Lambda} takes the form

Formula 2(2.5)

Simplifications arise under certain conditions. If there is no W, then (2.5) will not contain any term involving W. If Formula 2 is independent of (W, H), then the conditional density of Formula 2 can be dropped out of (2.5) due to factorization. In the sequel, we focus on the most common situation in which W does not exist and X is independent of H.

We propose to estimate {theta} and {Lambda} by the nonparametric maximum likelihood method. The maximum of (2.5) does not exist if {Lambda} is restricted to be absolutely continuous. Thus, we allow {Lambda} to be right continuous and maximize the following function:

Formula 2(2.6)

where {Lambda}{Yi} denotes the jump size of {Lambda} at Yi. The maximization is tantamount to maximizing (2.6) over {theta} and the {Lambda}{Yi} associated with {Delta}i = 1 and can be carried out through the EM algorithm described in Appendix A.

Let {theta}0 and {Lambda}0 denote the true values of {theta} and {Lambda}, and Formula 2 and Formula 2 the maximum likelihood estimators. We show in Appendix B that Formula 2 weakly converges to a zero-mean Gaussian process and that the limiting covariance matrix of Formula 2 achieves the semiparametric efficiency bound (Bickel et al., 1993Go, Chapter 3). We can estimate the limiting covariance function of Formula 2 by regarding (2.6) as a parametric likelihood with {theta} and the {Lambda}{Yi} associated with {Delta}i = 1 as the parameters and inverting the observed information matrix for those parameters. We can also estimate the covariance matrix of Formula 2 by the profile likelihood method (Murphy and van der Vaart, 2000Go). The profile log-likelihood function can be calculated via the EM algorithm, in which {theta} is held fixed.


    3. NUMERICAL RESULTS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. INFERENCE PROCEDURES
 3. NUMERICAL RESULTS
 4. REMARKS
 APPENDIX A
 APPENDIX B
 REFERENCES
 

3.1 ARIC study

We are currently evaluating common genetic polymorphisms which, in combination with exposure to tobacco smoking, may affect the risk of atherosclerosis and its clinical sequelae. An average of six polymorphisms, selected on the basis of their prevalence and functional significance, expression in relevant tissues, evaluation in previous studies, and biological plausibility within 19 genes involved in activation, detoxification, oxidative stress, and DNA repair pathways, are being evaluated in a well-characterized, bi-ethnic cohort of 15 792 men and women under active follow-up since 1987–1989 as part of the ARIC study. Four endpoints quantifying subclinical atherosclerosis and validated clinical atherosclerotic events are being studied under the case–cohort design.

So far, we have genotyped five SNPs in XRCC1, a major base excision repair gene. We considered all incident CHD cases occurring between 1987 and 2001. A subcohort was selected by stratified random sampling with different proportions of participants drawn from eight age–sex–race strata. Genotyping was conducted using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. Cigarette smoking history was obtained through an interviewer-administered questionnaire.

We focus on the Caucasian sample, which consists of 11 526 individuals, 774 cases, and a subcohort of 698 controls. Cigarette smoking status is known for 11 519 participants. The five SNPs are missing in 12%, 6%, 10%, 12%, and 6% of the case–cohort sample. The minor allele frequencies are 0.34, 0.40, 0.37, 0.41, and 0.36. There are nine haplotypes with estimated frequencies of higher than 0.5% in the sample. The frequencies for haplotypes (00100, 00110, 01001, 01100, 01110, 10110, 11001, 11100, 11110) are estimated at 0.012, 0.158, 0.096, 0.063, 0.012, 0.227, 0.276, 0.148, and 0.008, and the inbreeding coefficient is estimated at 0.025.

We fit separate models comparing each haplotype in turn with all others. Each model includes haplotype, smoking status (ever smoke = 1, never smoke = 0), two dummy variables contrasting Minnesota and Washington to North Carolina, gender and age at the baseline, as well as the interaction between smoking and haplotype. The effects of the haplotype pair are assumed to be additive (Lin, 2004Go). The results for the estimation of the haplotype effects and haplotype–smoking interactions under these models are summarized in Table 1. The individuals with haplotype 00110 appear to have a significantly higher risk of CHD as compared to the individuals without this haplotype. No estimate was obtained for haplotype 00100 due to numerical instability. There is no convincing evidence for interactions.


View this table:
[in this window]
[in a new window]
 
Table 1. Estimates of haplotype effects and haplotype–smoking interactions for the ARIC study based on separate models

 
We also compare haplotype 00110 with the other five common haplotypes in a single model, and the estimation results are shown in Table 2. There is some evidence that haplotype 00110 is associated with higher risk of CHD than all other common haplotypes, especially haplotypes 01001, 10110, and 11001. The likelihood ratio statistic for testing the global null hypothesis of no haplotype effects and no haplotype–smoking interactions has an observed {chi}2 value of 15.45 with 10 degrees of freedom, yielding a p-value of 0.116.


View this table:
[in this window]
[in a new window]
 
Table 2. Estimates of haplotype effects and haplotype–smoking interactions for the ARIC study based on a full model with haplotype 00110 as the reference

 
3.2 Simulation studies

We conducted extensive simulation studies to examine the finite-sample properties of the proposed methods. We considered five SNPs and generated genotypes according to the observed haplotype distribution of the ARIC data. We focused on the effect of haplotype 01100 and its interaction with a Bernoulli environmental variable with 0.6 success probability, mimicking cigarette smoking in the ARIC data. We generated time to disease occurrence from either the proportional hazards model or the proportional odds model with baseline hazard function of 0.14t and with additive haplotype effects. The individuals were selected for genotyping by case–cohort or nested case–control sampling with two controls per case. The proportions of missingness for the five SNPs among those selected for genotyping were the same as in the ARIC study. We generated censoring times from the uniform [0, 5] distribution truncated at 1. Approximately 90% of the observations were censored.

Table 3 summarizes the results of the simulation studies for n = 2000 and with various combinations of parameter values. The parameter estimators seem to have little bias. The profile likelihood method provides accurate estimators of the variances. The Wald tests have proper type I error rates, and the confidence intervals have reasonable coverage probabilities. The relative efficiencies of the case–cohort and nested case–control designs are generally between 80% and 90% for estimating haplotype effects and haplotype–environment interactions and over 95% for estimating environmental effects. Thus, these designs are highly cost-effective since only 30% of the entire cohort is genotyped. The case–cohort design appears to be slightly more efficient than the nested case–control design; however, 2–4% of selected controls became cases later on, so the total number of individuals genotyped is slightly smaller under the nested case–control design than under the case–cohort design.


View this table:
[in this window]
[in a new window]
 
Table 3. Summary statistics for the simulation studies{dagger}

 

    4. REMARKS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. INFERENCE PROCEDURES
 3. NUMERICAL RESULTS
 4. REMARKS
 APPENDIX A
 APPENDIX B
 REFERENCES
 
The results presented in Section 3.1 represent some preliminary findings from a major ongoing investigation. We are currently genotyping additional SNPs in the XRCC1 gene and examining 18 other genes using the methods proposed here. The full results will be reported elsewhere.

In practice, the true model is unknown. Thus, one will need to explore several possible models. Since the proposed methods are likelihood based, we can apply model selection criteria such as the Akaike information criterion (AIC) (Akaike, 1985Go) to determine the best model. Our experience shows that AIC performs well in this kind of setting (see Lin, 2004Go).

It is assumed in (2.6) that there is no W and X is independent of H. This assumption is reasonable in most genetic studies. It is easy to remove this assumption if X is time independent and discrete because then the general likelihood function given in (2.5) just involves some discrete probability functions. If X contains one or two time-independent continuous components, we can still estimate the conditional density function of X nonparametrically.

We have assumed that W is discrete and time independent. If W is continuous and possibly time dependent but X is discrete and time independent, we will replace Formula 2 in (2.5) with Formula 2 However, if both X and W are continuous, it is necessary to parameterize the distribution; nevertheless, nonparametric estimation is possible if X and W have one continuous component each.

If one is not interested in haplotypes, the likelihood given in (2.5) simplifies greatly. There will be no summation over H, and H will disappear from all expressions. The theoretical results will continue to hold, and the EM algorithm will still apply, although the parameters will not include {rho} and {pi}k. Scheike and Juul (2004)Go and Scheike and Martinussen (2004)Go studied maximum likelihood estimation in the proportional hazards model under case–cohort and nested case–control designs (without the additional complexities due to haplotype uncertainty and missing genotype data) but did not provide theoretical justifications for the asymptotic results. The asymptotic theory derived in the present paper covers those situations. Note that the aforementioned challenge in dealing with continuous covariates still exists even when one is not interested in haplotypes. In fact, this challenge tends to be less severe in genetic studies because genes are discrete and are usually independent of other covariates.

This paper is focused on case–cohort and nested case–control designs, while the recent paper of Lin and Zeng (2006)Go is concerned with other commonly used study designs. A nontechnical description of the methods developed in the two papers was provided by Lin et al. (2005)Go. The software is available at http://www.bios.unc.edu/ ~ lin.


    APPENDIX A
 TOP
 SUMMARY
 1. INTRODUCTION
 2. INFERENCE PROCEDURES
 3. NUMERICAL RESULTS
 4. REMARKS
 APPENDIX A
 APPENDIX B
 REFERENCES
 

EM algorithm

Write H = BQ1 + (1 – B)Q2, where B is a Bernoulli variable with success probability {rho} and Q1 and Q2 are discrete variables with P(Q1 = (hk, hk)) = {pi}k and P(Q2 = (hk, hl)) = {pi}k{pi}l (k, l = 1, ..., K). We introduce a subject-specific frailty {xi} with density {phi}({xi}) such that

Formula 7(A.1)

Then the observed-data likelihood function under the transformation model is equivalent to the likelihood function under the proportional hazards frailty model: the conditional hazard function of T given (B, Q1, Q2, {xi}) is Formula 7 By treating (B, Q1, Q2, {xi}) as missing data, we obtain the following complete-data likelihood function:

Formula 8(A.2)

In the M-step of the EM algorithm, we maximize the conditional expectation of the logarithm of (A.2) given the observed data. Let Formula 8 denote the conditional expectation given the ith observation (Yi, Xi, {Delta}i, Ri, RiGi). Then {rho} and {pi}k are updated by the following formulas:

Formula 8

In addition, we update ß by solving the following equation:

Formula 9(A.3)

and update {Lambda} by the step function with jump sizes

Formula 10(A.4)

Note that (A.3) and (A.4) are reminiscent of the partial likelihood (Cox, 1972Go) score equation and the Breslow (1972)Go estimator.

In light of (A.3) and (A.4), we calculate the conditional expectations in the form of E[{xi}i{omega}(Bi, Q1i, Q2i)|Yi, Xi, {Delta}i, Ri, RiGi] in the E-step. We can avoid numerical integration over {xi}i in these calculations. Define Formula 10 In view of (A.2), the conditional density of {xi} given (Bi, Q1i, Q2i) and (Yi, Xi, {Delta}i, Ri, RiGi) is proportional to Formula 10 so that

Formula 10

By differentiating (A.1) with respect to x, we obtain

Formula 10

It follows that

Formula 10

Consequently,

Formula 10

According to (A.2), the conditional density of (Bi, Q1i, Q2i) given the observed data is proportional to gi(Bi, Q1i, Q2i), where

Formula 10

Thus, E[{xi}i{omega}(Bi, Q1i, Q2i)|Yi, Xi, {Delta}i, Ri, RiGi] is equal to

Formula 10

for individuals with Ri = 1 and is equal to

Formula 10

for individuals with Ri = 0.


    APPENDIX B
 TOP
 SUMMARY
 1. INTRODUCTION
 2. INFERENCE PROCEDURES
 3. NUMERICAL RESULTS
 4. REMARKS
 APPENDIX A
 APPENDIX B
 REFERENCES
 

Asymptotic results

We impose the following conditions:

(C.1) Both X(t) and Formula 10(H,X(t)) have bounded total variations in [0, {tau}] with probability one, where {tau} corresponds to the end of the study.
(C.2) There exists a positive constant a such that with probability one, Formula 10 and Formula 10
(C.3) If Formula 10 for t isin [0, {tau}] and k = 1, ..., K with probability one, then ß1 = ß2 and µ1(t) = µ2(t).
(C.4) |ß0| ≤ c0 for some known constant c0, and {lambda}0(t) is continuous and positive for t isin [0, {tau}].
(C.5) Q(x) satisfies one of the two conditions:
(C.5.1) for any positive constant c0, Formula 10
(C.5.2) there exist some constants r1, r2 > 0 such that Q(x) = r1 log( 1 + r2x).

We state the asymptotic results in three theorems. The above conditions are assumed to hold in the theorems. The first theorem states the consistency, weak convergence, and asymptotic efficiency.

THEOREM B.1 With probability one,

Formula 10

In addition, Formula 10 weakly converges to a zero-mean Gaussian process in Rd x l{infty}[0, {tau}], where d is the dimension of {theta} and l{infty}[0, {tau}] is a normed space consisting of all the bounded functions and the norm is defined as the supremum norm on [0, {tau}]. Furthermore, the limiting covariance matrix of Formula 10 achieves the semiparametric efficiency bound.

The second theorem justifies the estimation of the limiting covariance function of Formula 10 by the inverse information matrix.

THEOREM B.2 Let V(h1, h2) be the limiting variance of the random variable Formula 10 where h1 is a d-vector and h2 is a bounded function. The estimator Formula 10 uniformly in (h1, h2) in probability, where hn consists of h1 and the values of h2(Yi) associated with {Delta}i = 1, and Formula 10 is the negative Hessian matrix of Formula 10 with respect to {theta} and the {Lambda}{Yi} associated with {Delta}i = 1.

The last theorem justifies the use of the profile log-likelihood pln({theta}) {equiv} max {Lambda} log Ln({theta}, {Lambda}) in estimating the limiting covariance matrix of Formula 10

THEOREM B.3 For any d-vector h1 with norm one,

Formula 10

in probability, where {epsilon}n = O(n–1/2) and {Omega} is the limiting covariance matrix of Formula 10

The proofs of these theorems involve advanced mathematical tools from empirical process theory (van der Vaart and Wellner, 1996Go) and semiparametric efficiency theory (Bickel et al., 1993Go). We outline here the main arguments. The detailed proofs are available from the authors.

Proof of Theorem B.1. We first prove the consistency under Condition (C.5.1). The proof consists of three steps.

Step 1: We show the existence of Formula 10 or equivalently the finiteness of the jump sizes of Formula 10 The logarithm of (2.6), denoted by ln({theta}, {Lambda}), is bounded by

Formula 11(B.1)

where O(1) denotes some positive constant and M is a constant satisfying

Formula 11

Such an M exists under Conditions (C.1) and (C.4). It then follows from Condition (C.5.1) that (B.1) will diverge if {Lambda}{Yi} is infinite for some i.

Step 2: We show that with probability one, Formula 11 is bounded for any n. Let Formula 11 where Formula 11 Clearly,

Formula 12(B.2)

Since P({Delta} = 0, Y = {tau}) > 0, (B.2) will be negative if {psi}n diverges. Thus, {psi}n is bounded, which implies that Formula 12 is bounded.

Step 3: By Helly's selection theorem, we can choose a subsequence such that Formula 12 and Formula 12 with probability one. It remains to show that {theta}* = {theta}0 and {Lambda}* = {Lambda}0. Note that

Formula 13(B.3)

where

Formula 13


Formula 13

and

Formula 13

In view of (B.3), we construct another step function Formula 13 with Formula 13 By the Glivenko–Cantelli theorem, Formula 13 uniformly converges to {Lambda}0, and Formula 13 is absolutely continuous with respect to Formula 13 with the derivative converging uniformly to d{Lambda}*(t)/d{Lambda}0(t). Since Formula 13 Formula 13 the Kullback–Leibler information of ({theta}*, {Lambda}*) with respect to ({theta}0, {Lambda}0) is non-negative, so that (2.6) has the same value almost surely whether ({theta}, {Lambda}) = ({theta}*, {Lambda}*) or ({theta}0, {Lambda}0). Setting n = 1, Gi = 2hk, Ri = 1, and {Delta}i = 1 and integrating Yi from y to {tau}, we obtain

Formula 13

By comparing this equation with the one obtained from (2.6) with n = 1, Gi = 2hk, Ri = 1, {Delta}i = 0, and Yi = {tau}, we have

Formula 13

The choice of y = 0 yields that {rho}*{pi}k* + (1 – {rho}*){pi}k*2 = {rho}0{pi}0k + (1 – {rho}0){pi}0k2, which entails that {rho}* = {rho}0 and {pi}k* = {pi}0k. In addition, Formula 13 implying that

Formula 13

It then follows from Condition (C.6) that ß* = ß0 and {Lambda}* = {Lambda}0. Hence, Formula 13 and Formula 13 almost surely. Since {Lambda}0 is continuous, the weak convergence of Formula 13 can be strengthened to the convergence uniformly in [0, {tau}].

If Q(·) satisfies Condition (C.5.2) instead of (C.5.1), we need to modify Step 2. It follows from (B.3) that

Formula 13

Thus,

Formula 15(B.4)

By partitioning [0, {tau}] into a sequence of intervals as in Zeng et al. (2005)Go and examining the two terms on the right-hand side of (B.4) when Yi lies in each partition, we can show that the right-hand side of (B.4) is negative if Formula 15 diverges. Thus, Formula 15 must be bounded.

To derive the asymptotic distribution of Formula 15 we apply Theorem 3.3.1 of van der Vaart and Wellner (1996)Go to the score operators for Formula 15 and Formula 15 Except for the invertibility of the derivative operator of the score operator, all the conditions in Theorem 3.3.1 can be verified via empirical process theory (see Zeng et al., 2005Go). The derivative operator is invertible if the information operator is one-to-one. Thus, we wish to show that if a score function along the path Formula 15 is zero, then h1 = 0 and h2 = 0. For Ri = 1 and Gi = 2hk, the score equation is

Formula 16(B.5)

where (h1ß, h1{rho}, h1k) are the components of h1 associated with (ß0, {rho}0, {pi}0k) and Formula 16 Setting Y = 0 yields that h1{rho} = h1k = 0. This result implies that (B.5) is a homogeneous integral equation for Formula 16 so that Formula 16 Thus, h2 = 0 and h1ß = 0. It then follows from Theorem 3.3.1 of van der Vaart and Wellner (1996)Go that Formula 16 weakly converges to a zero-mean Gaussian process. Furthermore, we can use the arguments of Zeng et al. (2005)Go to show that Formula 16 is asymptotically efficient.

Proof of Theorem B.2. This proof follows from the arguments given in the proof of Theorem 3 of Zeng et al. (2005)Go. The details are omitted.

Proof of Theorem B.3.We can verify the conditions in Theorem 1 of Murphy and van der Vaart (2000)Go. In particular, we can construct the least favorable submodel by using the invertibility of the information operator shown in the proof of Theorem 1. The details are omitted.


    ACKNOWLEDGMENTS
 
This research was supported by the National Institutes of Health. The authors thank two referees for their timely reviews and helpful comments. Conflict of Interest: None declared.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. INFERENCE PROCEDURES
 3. NUMERICAL RESULTS
 4. REMARKS
 APPENDIX A
 APPENDIX B
 REFERENCES
 

    AKAIKE, H. (1985). Prediction and entropy. In Atkinson, A. C. and Fienberg, S. E. (eds), A Celebration of Statistics. New York: Springer, pp. 1–24.

    AKEY, J., JIN, L. AND XIONG, M. (2001). Haplotypes vs. single marker linkage disequilibrium tests: what do we gain? European Journal of Human Genetics 9, 291–300.[CrossRef][Web of Science][Medline]

    BICKEL, P. J., KLASSEN, C. A. J., RITOV, Y. AND WELLNER, J. A. (1993). Efficient and Adaptive Estimation in Semiparametric Models. Baltimore, MD: Johns Hopkins University Press.

    BRESLOW, N. E. (1972). Discussion of the paper by D. R. Cox. Journal of the Royal Statistical Society, Series B 34, 216–217.

    BRESLOW, N. E. AND DAY, N. E. (1987). Statistical Methods in Cancer Research: The Design and Analysis of Cohort Studies. Lyon: International Agency for Research on Cancer.

    COX, D. R. (1972). Regression models and life-tables (with discussion). Journal of the Royal Statistical Society, Series B 34, 187–220.

    FRIED, L. P., BORHANI, N. O., ENRIGHT, P., FURBERG, C. D., GARDIN, J. M., KRONMAL, R. A., KULLER, L. H., MANOLIO, T. A., MITTELMARK, M. B., NEWMAN, A. et al. (1991). The Cardiovascular Health Study: design and rationale. Annals of Epidemiology 1, 263–276.[Medline]

    JOHNSON, S. R., ANDERSON, G. L., BARAD, D. H. AND STEFANICK, M. L. (1999). The Women's Health Initiative: rationale, design, and progress report. Journal of the British Menopause Society 5, 155–159.

    KALBFLEISCH, J. D. AND PRENTICE, R. L. (2002). The Statistical Analysis of Failure Time Data, 2nd edition. Hoboken, NJ: Wiley.

    KULICH, M. AND LIN, D. Y. (2004). Improving the efficiency of relative-risk estimation in case-cohort studies. Journal of the American Statistical Association 99, 832–844.[CrossRef]

    LIN, D. Y. (2004). Haplotype-based association analysis in cohort studies of unrelated individuals. Genetic Epidemiology 26, 255–264.[CrossRef][Web of Science][Medline]

    LIN, D. Y. AND ZENG, D. (2006). Likelihood-based inference on haplotype effects in genetic association studies (with discussion). Journal of the American Statistical Association 101, 89–118.[CrossRef]

    LIN, D. Y., ZENG, D. AND MILLIKAN, R. (2005). Maximum likelihood estimation of haplotype effects and haplotype-environment interactions in association studies. Genetic Epidemiology 29, 299–312.[Medline]

    MORRIS, R. W. AND KAPLAN, N. L. (2002). On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles. Genetic Epidemiology 23, 221–233.[CrossRef][Web of Science][Medline]

    MURPHY, S. A. AND VAN DER VAART, A. W. (2000). On profile likelihood. Journal of the American Statistical Association 95, 449–465.[CrossRef]

    NAN, B. (2004). Efficient estimation for case-cohort studies. The Canadian Journal of Statistics 32, 403–419.

    SCHAID, D. J. (2004). Evaluating associations of haplotypes with traits. Genetic Epidemiology 27, 348–364.[CrossRef][Web of Science][Medline]

    SCHAID, D. J., ROWLAND, C. M., TINES, D. E., JACOBSON, R. M. AND POLAND, G. A. (2002). Score tests for association between traits and haplotypes when linkage phase is ambiguous. American Journal of Human Genetics 70, 425–434.[CrossRef][Web of Science][Medline]

    SCHEIKE, T. H. AND JUUL, A. (2004). Maximum likelihood estimation for Cox's regression model under nested case-control sampling. Biostatistics 5, 193–206.[Abstract]

    SCHEIKE, T. H. AND MARTINUSSEN, T. (2004). Maximum likelihood estimation for Cox's regression model under case-cohort sampling. Scandinavian Journal of Statistics 31, 283–293.[CrossRef]

    THE ARIC INVESTIGATORS (1989). The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. American Journal of Epidemiology 129, 687–702.[Abstract/Free Full Text]

    VAN DER VAART, A. W. AND WELLNER, J. A. (1996). Weak Convergence and Empirical Processes. New York: Springer.

    ZAYKIN, D. V., WESTFALL, P. H., YOUNG, S. S., KARNOUB, M. A., WAGNER, M. J. AND EHM, M. G. (2002). Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Human Heredity 53, 79–91.[Web of Science][Medline]

    ZENG, D., LIN, D. Y. AND YIN, G. (2005). Maximum likelihood estimation for the proportional odds model with random effects. Journal of the American Statistical Association 100, 470–483.[CrossRef][Medline]

    Received December 5, 2005; accepted for publication February 7, 2006.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


    This article has been cited by other articles:


    Home page
    Cancer Res.Home page
    A. DeMichele, R. Gray, M. Horn, J. Chen, R. Aplenc, W. P. Vaughan, and M. S. Tallman
    Host Genetic Variants in the Interleukin-6 Promoter Predict Poor Outcome in Patients with Estrogen Receptor-Positive, Node-Positive Breast Cancer
    Cancer Res., May 15, 2009; 69(10): 4184 - 4191.
    [Abstract] [Full Text] [PDF]


    Home page
    Eur Respir JHome page
    M. Siedlinski, C. C. van Diemen, D. S. Postma, J. M. Vonk, and H. M. Boezen
    Superoxide dismutases, lung function and bronchial responsiveness in a general population
    Eur. Respir. J., May 1, 2009; 33(5): 986 - 992.
    [Abstract] [Full Text] [PDF]


    Home page
    Am. J. Respir. Crit. Care Med.Home page
    J. Galanter, S. Choudhry, C. Eng, S. Nazario, J. R. Rodriguez-Santana, J. Casal, A. Torres-Palacios, J. Salas, R. Chapela, H. G. Watson, et al.
    ORMDL3 Gene Is Associated with Asthma in Three Ethnically Diverse Populations
    Am. J. Respir. Crit. Care Med., June 1, 2008; 177(11): 1194 - 1200.
    [Abstract] [Full Text] [PDF]


    Home page
    Cancer Epidemiol. Biomarkers Prev.Home page
    Q. Cai, N. Kataoka, C. Li, W. Wen, J. R. Smith, Y.-T. Gao, X. O. Shu, and W. Zheng
    Haplotype Analyses of CYP19A1 Gene Variants and Breast Cancer Risk: Results from the Shanghai Breast Cancer Study
    Cancer Epidemiol. Biomarkers Prev., January 1, 2008; 17(1): 27 - 32.
    [Abstract] [Full Text] [PDF]


    Home page
    Clin. Cancer Res.Home page
    S. Hughes, O. Agbaje, R. L. Bowen, D. L. Holliday, J. A. Shaw, S. Duffy, and J. L. Jones
    Matrix Metalloproteinase Single-Nucleotide Polymorphisms and Haplotypes Predict Breast Cancer Progression
    Clin. Cancer Res., November 15, 2007; 13(22): 6673 - 6680.
    [Abstract] [Full Text] [PDF]


    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    7/3/486    most recent
    kxj021v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Zeng, D.
    Right arrow Articles by Bray, M. S.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Zeng, D.
    Right arrow Articles by Bray, M. S.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?