Skip Navigation


Biostatistics Advance Access originally published online on December 20, 2005
Biostatistics 2006 7(3):387-398; doi:10.1093/biostatistics/kxj014
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrowOA All Versions of this Article:
7/3/387    most recent
kxj014v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Hsu, L.
Right arrow Articles by Gorfine, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hsu, L.
Right arrow Articles by Gorfine, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org.

Multivariate survival analysis for case–control family data

Li Hsu*

Program in Biostatistics and Biomathematics, Public Health Sciences Division, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, M2-B500, PO Box 19024, Seattle, WA 98109-1024, USA lih{at}fhcrc.org

Malka Gorfine

Department of Mathematics, Bar Ilan University, Ramat-Gan 52900, Israel and Faculty of Industrial Engineering and Management, Technion—Israel Institute of Technology, Technion City, Haifa 32000, Israel

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHOD
 3. A SIMULATION STUDY
 4. ILLUSTRATION
 5. SOME CONCLUDING REMARKS
 REFERENCES
 
Multivariate survival data arise from case–control family studies in which the ages at disease onset for family members may be correlated. In this paper, we consider a multivariate survival model with the marginal hazard function following the proportional hazards model. We use a frailty-based approach in the spirit of Glidden and Self (1999) to account for the correlation of ages at onset among family members. Specifically, we first estimate the baseline hazard function nonparametrically by the innovation theorem, and then obtain maximum pseudolikelihood estimators for the regression and correlation parameters plugging in the baseline hazard function estimator. We establish a connection with a previously proposed generalized estimating equation-based approach. Simulation studies and an analysis of case–control family data of breast cancer illustrate the methodology's practical utility.

Keywords: Case–control family study; Cox proportional hazards model; Familial aggregation; Frailty model; Innovation theorem


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHOD
 3. A SIMULATION STUDY
 4. ILLUSTRATION
 5. SOME CONCLUDING REMARKS
 REFERENCES
 
There is an increasing interest in the application of multivariate survival analysis techniques to population-based case–control studies for estimating the marginal hazard function and the dependencies of correlated ages at disease onset (Li et al., 1998Go; Shih and Chatterjee, 2002Go; Hsu et al., 2004Go). An interesting example is a breast cancer study conducted at the Fred Hutchinson Cancer Research Center (Malone et al., 1996Go). In this study, the breast cancer incidence cases were ascertained from the Surveillance, Epidemiology, and End Results registry, which is a set of geographically defined, population-based cancer registries in the United States. The controls were obtained through random digit dialing and were matched with cases on the case diagnosis ages and county of residence. First- and second-degree female relatives of cases and controls were identified, and risk factor and outcome information were subsequently collected on them. Researchers were interested in understanding how various risk factors (including candidate genes) affect the age-dependent breast cancer risk and whether these risk factors partly explain the pattern of familial resemblance in ages at diagnosis for breast cancer.

The ages at onset of (K + 1) family members can be denoted by T0, T1, ..., TK with marginal hazard function

Formula

and Formula for k = 0, 1, ..., K. The index 0 refers to the proband, and 1, ..., K refers to the K relatives. We loosely define a proband as the individual because of whom the family is ascertained. The proband in this situation therefore can be either diseased (case) or nondiseased (control). Note that this is different from the traditional definition of probands as diseased. The commonly used Cox proportional hazards model (Cox, 1972Go) specifies that the hazard function for the age at onset Tk associated with a p x 1 vector of covariates Zk takes the form of

Formula 1(1.1)

where {lambda}0(·) is an unspecified baseline hazard function and ß0 is a p x 1 vector of unknown regression parameters. One of the most commonly studied joint survival distribution of ages at onset is the Clayton–Oakes model (Clayton, 1978Go; Oakes, 1989Go), which can be written as

Formula 2(1.2)

and Sk(tk;Zk) = exp{–{Lambda}0(t)exp(ß'0Zk)}. The parameter {theta}0 indexes the intrafamily dependence. When {theta}0 = 0, the joint survival distribution degenerates to the product of marginal survival functions and the ages at onset are independent.

Much work has been done for prospectively collected data. Wei et al. (1989)Go proposed an estimating equation-based approach to estimate {Lambda}0(t) and ß. Fixing the marginal hazard function at these estimates, Shih and Louis (1995)Go and Glidden (2000)Go further estimated Formula 2 by maximizing the joint density (1.2) for the multivariate survival times. Shih and Chatterjee (2002)Go subsequently generalized this approach to case–control family survival data. The cumulative baseline hazard estimate, Formula 20, was obtained through marginalized conditional hazard function for the relatives on the proband data. Advantages of this approach include relatively simple computation and robustness to the misspecification of high-order dependencies among family members. However, the approach may lose efficiency by assuming independence among relatives in the presence of a large number of relatives.

Another approach is to use the frailty model, where the relatives share a common (latent) frailty. Many researchers, for example Nielsen et al. (1992)Go, Klein (1992)Go, Murphy (1995Go, 1996Go), and Parner (1998)Go, have studied the estimation and inference for this model. A comprehensive review can be found in the survival analysis books by Andersen et al. (1993)Go and Hougaard (2000)Go. Hsu et al. (2004)Go generalized this approach to the case–control family data. All these approaches focus on estimating the parameter conditional on the latent frailty. In some situations, when the population-averaged risk is of interest, the marginalized hazard function following the integration of the latent frailty distribution may be sensitive to the misspecification of the assumed joint model. However, an appealing feature of the frailty model-based approach is that it naturally accommodates the dependency among all the relatives of the same relation by assuming a shared frailty. Intuitively, this will improve the efficiency in estimating the strength of dependencies.

To circumvent the dependence of the marginalized hazard function on the dependence parameter, Glidden and Self (1999)Go formulated the conditional hazard function so that the marginalized hazard function obeys the Cox model (1.1); they proposed a modified expectation maximization (EM) algorithm for parameter estimation. More recently, Pipper and Martinussen (2004)Go used a generalized estimating equation (GEE)-based approach and provided a large sample theory to their estimators. They showed that the efficiency of the regression coefficient estimators increases with the strength of the dependence and the cluster size.

In this paper, we propose a general framework extending the approach of Pipper and Martinussen to the noncohort case–control study design and to accommodating varying degrees of the dependence strength among different types of relatives. When there is only one relative per proband, the proposed approach is the same as that of Shih and Chatterjee (2002)Go. We will describe the method in Section 2 and the simulation results in Section 3. Section 4 illustrates the method in real data. We conclude with final remarks in Section 5.


    2. METHOD
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHOD
 3. A SIMULATION STUDY
 4. ILLUSTRATION
 5. SOME CONCLUDING REMARKS
 REFERENCES
 

2.1 Notation and model

Consider n independent case–control families. Let Tik and Cik be the failure time and the censoring time, respectively, for the kth individual in the ith family, k = 0, 1, ..., Ki, i = 1, ..., n. When k = 0, the index refers to the proband. All other k values refer to the relatives of the proband. We assume that Ki is relatively small with respect to n, and K = maxiKi is bounded. For each individual, one can only observe Xik = Tik {wedge} Cik and {Delta}ik = I(Tik ≤ Cik), where {wedge} is minimum and I(·) is an indicator function. Let Zik be a p x 1 vector of bounded covariates. For simplicity, we assume that each family has the same set of relatives, K. The presence or absence of any particular type of relatives is assumed to be independent of the observed data. The absence of these relatives can be denoted by setting Cik = 0. Naturally, such cases will not contribute to any of the proposed likelihoods. Denote X'i = (Xi1, ..., XiK), {Delta}'i = ({Delta}i1, ..., {Delta}iK), and Zi = (Z'i1, ..., Z'iK), for i = 1, ..., n.

The K relatives can be divided into m disjoint sets, where each set is defined by the genealogical relationship between the relatives and the proband. For example, in the breast cancer study, the female relatives ascertained for each proband were mothers, sisters, grandmothers, and paternal and maternal aunts. The sets then can be defined accordingly. Denote vik the set to which the ikth individual belongs, and it takes value 1, ..., M, where M is the total number of the disjoint sets. Let v'i = (vi1, ..., viK). For the jth set in the ith family, let {omega}ij be the unobservable random effect with the distribution function F{theta}j(·), j = 1, ..., M. Assume that conditional on {omega}ij and Zi, the failure times of the relatives in the jth set, are independent and jointly they are independent of the censoring times of these relatives. We also assume that the censoring times are noninformative of {omega}ij. Technically, the distributions can be different for each set of relatives. For simplicity, we assume F(·) the same for all sets, but the dependence parameter {theta}j takes different values for each set. Moreover, we leave the joint distribution of {omega}'i = ({omega}i1, ..., {omega}iM) unspecified.

Assuming a multiplicative effect {omega}ij on the hazard function, the conditional hazard function for the ikth individual takes the form of

Formula 3(2.1)

The hazard function {alpha}ik(t) has a one-to-one relationship with the marginal hazard function in (1.1). This relationship can be expressed as

Formula 3

For the Clayton–Oakes model (1.2), which is equivalent to a gamma frailty model with Formula 3 the relationship can be written explicitly as

Formula 3

Note that an added advantage of formulating the marginal hazard function as in (1.1) is that there will be no ambiguity on how to estimate the marginal distribution of an individual who is related to different types of relatives with different {theta} values.

In the counting process notation, Yik = I(Xik ≥ t) and Nik(t) = {Delta}ikI(Xik ≤ t), t isin [0, {tau}], where {tau} is the maximum follow-up time. Note that Nik(t) is a counting process with marginal intensity function Yik(t){lambda}(t;Zik), where {lambda}(t;Zik) satisfies (1.1).

2.2 An estimation procedure

In this section, we describe a frailty model approach for estimating the parameters {ß0, {Lambda}0(t), {theta}1, ..., {theta}M} from case–control family data. The data consist of iid random elements (Xi, {Delta}i, Zi, vi, {omega}i)(i = 1, ..., n) if {omega} were observed. We assume that the effect of the covariates on the age at onset is subject specific, i.e. Pr(Xij, {Delta}ij|Zi) = Pr(Xij, {Delta}ij|Zij).

The likelihood function for case–control family data can be written as

Formula 3

where the conditioning event describes the ascertainment scheme of case–control sampling. Following the conditional probability rule, this likelihood can be further decomposed into

Formula 3

One can see that the first part of the likelihood contains information from the relatives' data (ages at onset and risk factors Xi) conditional on the proband ages at onset and risk factor data. The second part is the conventional retrospective likelihood for case–control data without families. This suggests that one can obtain consistent parameter estimates of ß from either part of the likelihood. However, the information for the dependencies of ages at onset among family members as well as the baseline hazard function is largely from the first part of the likelihood, the data of the relatives. In what follows we describe each part of the likelihood in detail, starting with the second part of the likelihood, the likelihood function of the case–control data.

Let Formula 3 where the subscript cc stands for case and control. Assuming cases and controls are age matched and there are Q matched sets, Formula 3cc can be replaced by a conditional likelihood (Prentice and Breslow, 1978Go):


Formula

where Sq is the sum of the covariate vectors of the dq cases in the qth stratum and Rq is the set of all selections of dq individuals chosen without replacement from those in the qth stratum. This is equivalent to a stratified partial likelihood function with each stratum corresponding to a matched set of cases and controls. The standard estimation procedure applies. When cases and controls are not matched on age, one can follow the Bayes rule to estimate the parameter. However, this approach requires an explicit modeling of the covariate distribution in the population and a stronger assumption than conditional independence for the censoring time. That is, it needs to be completely independent of the failure time.

Next we show how to model the first part of the likelihood Formula 3, which represents contributions from the relatives of the probands. In order to obtain a full likelihood function, one needs to assume a joint distribution for {omega}i1, ..., {omega}iM. Such an approach may not be appealing because it is more difficult to perform model diagnostic procedure for a joint distribution than for a univariate distribution, and any violations will potentially invalidate the parameter estimates. Moreover, the computation can be cumbersome and intensive, which limits the general use of the method. We therefore propose a pseudolikelihood approach, a product of the likelihood function for each of the J sets of relatives. The idea is similar to the GEE (Liang and Zeger, 1986Go), in that the joint distribution of {omega}i1, ..., {omega}iM is left unspecified. Denote the pseudolikelihood by Formula 3R, where R stands for relatives, and we can write

Formula 3

Under the assumptions that the censoring is noninformative of {omega} values and the conditional independence of censoring and failure times given {omega}i values and the covariates, the likelihood function

Formula 3

If a parametric model for the baseline hazard function {lambda}0(t) is assumed, one may simply utilize the maximum likelihood machinery to obtain the estimates of the parameters. However, a parametric modeling of {lambda}0(t) may not be able to fully capture the often arbitrary age-dependent trend of the hazard function. It is desirable to estimate {lambda}0(t) in a nonparametric fashion. At time t, define the observed history Formula 3 including the history of events for all relatives in the jth (j = 1, ..., M) set up to time t and the history of events for all probands up to time {tau}. The Formula 3 intensity of Nik(t) is given by replacing {omega}ivik by its conditional expectation with respect to the history Formula 3 i.e.

Formula 4(2.2)

where Formula 4 Simple algebraic calculations show that when there is only one relative per proband, the intensity Formula 4 has the same expression as the conditional hazard function (5) in Shih and Chatterjee (2002)Go, provided that the copula model, namely, the joint distribution of multivariate failure times, can be induced by frailties. Oakes (1989)Go gave a comprehensive account of such bivariate survival models and noted exceptions: for example, the bivariate survival distribution introduced by Gumbel (1961)Go cannot be a frailty distribution even though it is a copula model.

The structure of (2.2) suggests a Nelson–Aalen type of estimator for Formula 40. Using the Clayton–Oakes model (1.2), the intensity function (2.2) can be expressed as Yik(t){lambda}0(t)Hik(t), where Formula 4 Treating Hik(t) as the time-dependent risk score as in Shih and Chatterjee (2002)Go, we can write the Nelson–Aalen type of nonparametric estimator as

Formula 5(2.3)

Using the Clayton–Oakes model (1.2), Formula 5 has an explicit form given by

Formula 6(2.4)

for j = 1, ..., M. Note that Formula 6 involves {Lambda}0 for the proband observation times Xi0 that can be greater than time t, i.e. Hik(t) is not predictable in the usual martingale sense. An iterative procedure is needed for obtaining Formula 60(t), t isin [0, {tau}]. Large sample theory for this kind of problem is generally not available. However, our experience with simulated data sets shows that the baseline hazard function estimator Formula 60(t) in (2.3) converges and, moreover, the biases are few, if there are any.

Estimation of ß and {theta} values is straightforward once we know Formula 60. We construct a profile likelihood Formula 6R x Formula 6cc in ß and {theta} values by replacing {Lambda}0 by Formula 60 in (2.3). Standard nonlinear optimization routine can be used to obtain the maximum profile likelihood estimates for ß and {theta}. We term this approach as the ‘joint’ analysis of cluster members because it utilizes the information for all family members without breaking them down into pairs, as in the following marginal approach.

We further extend this approach by applying again the GEE (Liang and Zeger, 1986Go) to relative–proband pairs. Specifically, the baseline hazard estimate is still in the form of (2.3) except now that Formula 6 When there is only one set of relatives, i.e. {theta}vik{equiv}{theta}, plugging Formula 6 into (2.3) yields

Formula 6

where Formula 6 The baseline hazard estimate Formula 6 is in exactly the same form as (7) in Shih and Chatterjee (2002)Go. Estimates of ß and {theta} values can be similarly obtained by maximizing either the pseudolikelihood that uses the data on relative–proband pairs or the full likelihood that incorporates the joint distribution of all relatives in the same cluster with a plug-in estimate of {Lambda}0(t). Advantages of the GEE-based approach, especially for the estimates obtained based on the relative–proband pairs, are easy to compute and robust to the misspecification of higher (>2)-order correlations among family members. However, it may suffer an efficiency loss compared to the pseudolikelihood approach that partitions the relatives into clusters. Besides, it seems natural to assume that relatives with the same relation to the proband share a common frailty.


    3. A SIMULATION STUDY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHOD
 3. A SIMULATION STUDY
 4. ILLUSTRATION
 5. SOME CONCLUDING REMARKS
 REFERENCES
 
We conducted a simulation study to evaluate the finite sample properties of the joint and GEE-based approaches. Matched case–control family data were generated, mimicking the case–control family study as seen in practice. Specifically, for each individual we generated a continuous variable Z from the standard normal distribution N(0, 1) as an individual-specific risk factor. We also generated a common frailty value {omega} from the gamma distribution with scale and shape parameters {theta}–1, where {theta} quantified the strength of the dependence among family members. This common frailty value was shared by all relatives within the same family. Let ß be the log hazard ratio of Z after averaging over the frailty distribution. The marginal hazard function under the Cox proportional hazards model can be written as

Formula 6

where {lambda}0(t) follows a Weibull distribution with density babtb–1exp{–abxb}, where a > 0 and b > 0. The failure time for an individual with covariate Z and frailty value {omega} can be generated by

Formula 6

where u is a random variable from a uniform [0,1]. The censoring distribution is normal with mean 60 years of age and standard deviation 20 years. We choose a = 0.01 and b = 4.6 for the baseline hazard function. Together with the censoring, this yields a censoring of about 80% and the mean age at onset about 65 years old. This is in line with the age at onset of breast cancer in the general population with a slightly lower censoring than observed. We also generate censoring times from normal with mean 90 years of age and standard deviation 20 years, yielding about 60% censoring. This is to examine how the proposed methods perform when the censoring percentage is lower than 80%.

We generated 20 000 individuals as the base population. Arbitrarily assigning the first individual of each family as the index subject, we randomly select n = 200 diseased subjects and match each diseased subject with an unaffected subject whose censoring time is within 1 year of the age at onset for the disease subject. This sampling results in a total of 200 case families and 200 control families that are matched on age.

We considered simulation situations that were a combination of the following parameter choices. We let {theta} = 1 or 3 so that the strength of the dependence parameter ranges from moderate to strong. The regression coefficient ß takes value log(2), giving the relative risk of covariate 2. We repeated each simulation scenario 100 times. Results are summarized by the sample mean and sample standard deviation of Formula 6, {Formula 60(t), t = 60, 70, or 80}, and Formula 6 over these 100 simulated data sets.

Table 1 shows a summary of marginal parameter estimates of Formula 6 and Formula 60 and the dependence parameter estimate Formula 6. Both approaches, the joint and GEE, yielded virtually unbiased estimates of all parameters under most situations considered. When the number of relatives is two, {theta} = 3, and 80% censoring rate, the dependence estimates Formula 6 are slightly biased upward. This is due to a few simulated data sets that have large Formula 6 values. When we take the median, the corresponding estimates for the GEE and the joint analyses are 3.02 and 3.01, respectively. For all situations considered, we did not encounter any convergence problem in any of the simulated data sets.


View this table:
[in this window]
[in a new window]
 
Table 1. Summary of marginal parameter estimates of Formula 6 and Formula 60 and the dependence parameter estimate Formula 6. Each cell is the mean (standard error) of the estimate over 100 simulated data sets. The first lines of the entry are estimates from the GEE-based approach and the second lines are estimates from the joint distribution approach

 
We see that Formula 6 and Formula 6 are more efficient for the joint analysis than the GEE-based estimators. This is especially pronounced in situations with a high degree of correlation and relatively large number of relatives. The gain in efficiency for Formula 60 from the joint analysis is more moderate compared to that for Formula 6 and Formula 6. This suggests that the efficiency gain may be mostly from the likelihood estimation of ß and {theta}, rather than the joint estimation of Formula 60. We subsequently performed a simulation study using GEE-based approach for {Lambda}0 estimate, but using the joint distribution for all family members for estimating ß and {theta}. The simulation results show that the new estimators do (almost) equally well as the joint analysis (results not shown).

We examined the performance of the bootstrap method for estimating the standard errors of the parameter estimates. Given the time demand for the bootstrap method, we only examined one situation, which is closest to our real data, that is, {theta} = 3 and the censoring rate of about 80%. We used 50 bootstrap samples for each simulated data set, where the sampling unit was each matched case–control set with their relatives. A very small amount of Gaussian noise, N(0, 0.1), was added to the observational times in the bootstrap samples in order to break the ties. Table 2 shows a summary of sample means of bootstrap standard deviations and corresponding estimated coverage probabilities of nominal 95% confidence intervals, along with sample means and standard deviations of parameter estimates. The bootstrap standard deviations are generally close to sample standard deviations. The coverage probabilities for the joint approach maintain about 95% confidence intervals, whereas the coverage probabilities for the GEE approach are slightly underestimated.


View this table:
[in this window]
[in a new window]
 
Table 2. Summary of bootstrap standard deviations (SDs) and corresponding estimated coverage probabilities of nominal 95% confidence intervals (CIs) under the situation of two relatives per family with 80% censoring. The first lines are estimates from the GEE approach and the second lines are estimates from the joint distribution approach

 

    4. ILLUSTRATION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHOD
 3. A SIMULATION STUDY
 4. ILLUSTRATION
 5. SOME CONCLUDING REMARKS
 REFERENCES
 
The case–control family study of early-onset breast cancer (Malone et al., 1996Go) recruited a set of population-based cases and controls and their first- and second-degree relatives to study the familial aggregation of age at diagnosis of breast cancer. A primary goal of the study was to determine whether the age at diagnosis of breast cancer in probands was associated with the age at diagnosis of breast cancer in relatives. For illustrative purpose, we only use a subset of the data: probands (cases and controls) and their mothers and aunts. There are a total of 820 cases and 942 controls with 1762 mothers and 5015 aunts. Among these, 904 probands have two aunts or fewer, 731 have three to six aunts, and 127 have more than six aunts. In these relatives, 155 mothers (mean: 53 years old; standard deviation: 12 years old) and 256 aunts (mean: 54 years old; standard deviation: 12 years old) have breast cancer.

We take both the joint (Joint) and GEE-based (GEE) analyses with a common baseline hazard function for all individuals. In the joint analysis, the baseline hazard function is estimated by using (2.3) and (2.4), so that all family members are used without being broken down into relative–proband pairs. In the GEE-based analysis, both baseline hazard function and parameters ß and {theta} are estimated under a working assumption that all relative–probands were independent. We also analyze the data by estimating the baseline hazard function with relative–proband pairs but estimating ß and {theta} with the full set of relatives. We call this analysis the GEE2 analysis. Table 3 summarizes the dependency parameter estimates Formula 61 for mothers, Formula 62 for aunts, and Formula 60 at 40, 50, 60, 70, and 80 years old. Fifty bootstrap samples with families as sampling units are used to calculate the standard errors of the parameter estimates.


View this table:
[in this window]
[in a new window]
 
Table 3. Analysis of a case–control family study of breast cancer. Fifty bootstrap samples were used to calculate the standard errors (SEs) of parameter estimates

 
In the joint analysis, the dependency of ages at onset between probands and their mothers is strong ({theta} = 2.524), whereas the dependency between probands and their aunts is much weaker ({theta} = 0.705). Under the assumed gamma frailty model, (1 + {theta}) can also be interpreted as the constant relative risk of the hazard rate for the relative to develop breast cancer at age t2 given the proband having the breast cancer at age t1 compared to that given the proband being cancer free at age t1. In other words, our results for mother–proband relation implied that the mothers of cases had an average 2.524 + 1 = 3.524–fold increased risk of developing breast cancer compared to the mothers of controls. In contrast, the aunts of cases had a much reduced increased risk, 0.705 + 1 = 1.705, than that of controls. Both dependence estimates were significantly greater than zero, i.e. independence, at level 0.05. We also obtained the baseline hazard function estimate and its point-wide standard errors. The cancer-free probability can therefore be estimated by taking exp{–Formula 60(t)}. Using this formula, we estimated from our data that the cancer-free probability by age 80 for a woman was about 0.92 with 95% confidence interval (0.91, 0.94).

The GEE2 analysis yields dependence estimates similar to the joint analysis with the joint analysis moderately more efficient. The GEE analysis, however, yields much lower dependence estimates, especially for the aunt, compared to either GEE2 or joint estimates. This is in part due to the fact that in this data set, aunts from larger aunt-ship sizes tend to have better disease-free probability than aunts from small aunt-ship sizes. The GEE, i.e. marginal approach that breaks down the large aunt-ship by aunt–proband pairs, includes many pairs that have fewer breast cancer cases in aunts. We excluded the aunt-ships having larger than six aunts and reran the GEE analysis. The dependence estimates for mother and aunts are 2.287 (standard error 0.459) and 0.648 (standard error 0.191), much closer to the dependence estimates from GEE2 or joint analysis.


    5. SOME CONCLUDING REMARKS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHOD
 3. A SIMULATION STUDY
 4. ILLUSTRATION
 5. SOME CONCLUDING REMARKS
 REFERENCES
 
In this paper, we present a general framework for handling the family size >2 and establish a connection with a previous marginal approach proposed by Shih and Chatterjee (2002)Go. Simulation results show that the efficiency gain by the proposed method is most pronounced with a high degree of correlation and with large family size. The real data analysis also shows a potential efficiency gain for the joint approach. Like any other (pseudo)likelihood-based method, the proposed approach may be sensitive to the misspecification of the joint distribution for the entire family. The lack of robustness is somewhat eased by partitioning relatives into clusters, where each cluster consists of only the relatives with same relation to the proband, allowing for different dependency parameters (or even different distribution) for each cluster. In addition, the computational demand for the joint analysis is greater than the GEE type of marginal approach.

The proposed approach involves a hypothesized frailty shared by relatives with the same relation to the proband. We note that the frailty here is a variation between families, not a variation between individuals. It is introduced for the purpose of accommodating the unexplained dependency of ages at onset among family members. In other words, the latent frailty is like a common covariate that is missing for all the relatives in the same cluster. One can potentially introduce an additional individual-level frailty to account for the heterogeneity among individuals or missing covariates at individual level. In this situation, Heckman and Singer (1984)Go showed theoretically that both the frailty distribution and the baseline hazard function can be identified in the presence of other observed covariates, but estimating them from the data can be problematic. We also note that the marginal hazard function conditional on the observed covariates, the focus of the paper, can perhaps be considered as the hazard function averaging over all unmeasured individual risk factors.

The shared frailty model for each set of relatives may not fully capture the relations among the relatives in the set. For example, in the set of paternal aunts, the dependency of these aunts on the proband can be hypothesized sharing a common frailty; however, the dependency among aunts, who are sisters among themselves, may share a common frailty that is not the same as aunt–proband relation. A possible solution is to consider a hierarchical frailty distribution. This is shown in the following simplified derivation. Let T1 and T2 be the failure times for paternal aunts and T0 be the failure time of the proband. Further more, let {omega}1 and {omega}2 be the frailties shared among aunts (i.e. sisters) and shared between aunts and the proband, respectively. Then the joint survival distribution of T1, T2, and T0 can be written as

Formula 6

A bivariate distribution of {omega}1 and {omega}2 needs to be assumed. However, the corresponding conditional hazard function under such bivariate distributions may be difficult to obtain. Other approaches such as additive gamma frailty model (Parner, 1998Go) and multiple level multivariate survival model (Bandeen-Roche and Liang, 1996Go) could also be employed to deal with multivariate frailties.

We provide a large sample theory for Formula 6, Formula 6, and Formula 60 under a general frailty model with cohort family data, provided that there are finite moments for the frailty distribution (Gorfine et al., 2005). However, the theory developed for the cohort data cannot be directly applied to case–control family data, in part, because Formula 6(t) in (2.4) at time t involves proband observational times that can be greater than t. Currently, we rely on bootstrap methods to draw inferences on parameter estimates. Further work is needed to show the large sample properties of the proposed and other similar estimators (Shih and Chatterjee, 2002Go).


    ACKNOWLEDGMENTS
 
This work is supported in part by grants from the National Institutes of Health (AG14358 and CA 53996) and the United States-Israel Binational Science Foundation.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHOD
 3. A SIMULATION STUDY
 4. ILLUSTRATION
 5. SOME CONCLUDING REMARKS
 REFERENCES
 

    ANDERSEN, P. K., BORGAN, O., GILL, R. D. AND KEIDING, N. (1993). Statistical Models Based on Counting Processes. New York: Springer.

    BANDEEN-ROCHE, K. J. AND LIANG, K.-Y. (1996). Modeling failure-time associations in data with multiple levels of clustering. Biometrika 83, 29–39.[Abstract/Free Full Text]

    CLAYTON, D. G. (1978). A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika 65, 141–151.[Abstract/Free Full Text]

    COX, D. R. (1972). Regression models and life tables (with discussion). Journal of the Royal Statistical Society Series B 34, 187–220.

    GLIDDEN, D. V. (2000). A two-stage estimator of the dependence parameter for the Clayton-Oakes model. Lifetime Data Analysis 6, 141–156.[Medline]

    GLIDDEN, D. V. AND SELF, S. G. (1999). Semiparametric likelihood estimation in the Clayton-Oakes failure time model. Scandinavian Journal of Statistics 26, 363–372.[CrossRef]

    GORFINE, M., ZUCKER, D. M. AND HSU, L. (2006). Prospective survival analysis with a general semiparametric shared frailty model—a pseudo full likelihood approach. Biometrika (in press).

    GUMBEL, E. J. (1961). Bivariate logistic distribution. Journal of the American Statistical Association 56, 335–349.[CrossRef]

    HECKMAN, J. AND SINGER, B. (1984). The identifiability of the proportional hazard model. Review of Economic Studies 81, 231–241.

    HOUGAARD, P. (2000). Analysis of Multivariate Survival Data. New York: Springer.

    HSU, L., CHEN, L., GORFINE, M. AND MALONE, K. (2004). Semiparametric estimation of marginal hazard function from case–control family studies. Biometrics 60, 936–944.[CrossRef][Web of Science][Medline]

    KLEIN, J. P. (1992). Semiparametric estimation of random effects using the Cox model based on the EM algorithm. Biometrics 48, 795–806.[CrossRef][Web of Science][Medline]

    LI, H., YANG, P. AND SCHWARTZ, A. G. (1998). Analysis of age at onset data from case–control family studies. Biometrics 54, 1030–1039.[CrossRef][Web of Science][Medline]

    LIANG, K. Y. AND ZEGER, S. L. (1986). Longitudinal analysis using generalized linear models. Biometrika 73, 13–22.[Abstract/Free Full Text]

    MALONE, K. E., DALING, J. R., WEISS, N. S., MCKNIGHT, B., WHITE, E. AND VOIGT, L. F. (1996). Family history and survival of young women with invasive breast carcinoma. Cancer 78, 1417–1425.[CrossRef][Web of Science][Medline]

    MURPHY, S. A. (1995). Consistency in a proportional hazards model incorporating a random effect. Annals of Statistics 22, 712–731.

    MURPHY, S. A. (1996). Asymptotic theory for the frailty model. American Statistician 23, 183–214.[CrossRef]

    NIELSEN, G., ANDERSEN, P., GILL, R. AND SORENSEN, T. (1992). A counting process approach to maximum likelihood estimation in frailty models. Scandinavian Journal of Statistics 19, 25–43.

    OAKES, D. (1989). Bivariate survival models induced by frailties. Journal of the American Statistical Association 84, 487–493.

    PARNER, E. (1998). Asymptotic theory for the correlated gamma-frailty model. The Annals of Statistics 26, 183–214.[CrossRef]

    PIPPER, C. B. AND MARTINUSSEN, T. (2004). An estimating equation for parametric shared frailty models with marginal additive hazards. Journal of the Royal Statistical Society B 66, 207–220.[CrossRef]

    PRENTICE, R. L. AND BRESLOW, N. E. (1978). Retrospective studies and failure time models. Biometrika 65, 153–158.[Abstract/Free Full Text]

    SHIH, J. H. AND CHATTERJEE, N. (2002). Analysis of survival data from case–control family studies. Biometrics 58, 502–509.[CrossRef][Web of Science][Medline]

    SHIH, J. H. AND LOUIS, T. A. (1995). Inferences on the association parameter in copula models for bivariate survival data. Biometrics 51, 1384–1399.[CrossRef][Web of Science][Medline]

    WEI, L. J., LIN, D. Y. AND WEISSFELD, L. (1989). Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. Journal of the American Statistical Association 84, 1065–1073.[CrossRef]

    Received July 27, 2005; revised December 9, 2005; accepted for publication December 16, 2005.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrowOA All Versions of this Article:
    7/3/387    most recent
    kxj014v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Hsu, L.
    Right arrow Articles by Gorfine, M.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Hsu, L.
    Right arrow Articles by Gorfine, M.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?