Biostatistics Advance Access originally published online on May 25, 2005
Biostatistics 2005 6(4):633-652; doi:10.1093/biostatistics/kxi033
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Regression analysis of longitudinal binary data with time-dependent environmental covariates: bias and efficiency
Department of Biostatistics, Vanderbilt University, S-2323 Medical Center North, Nashville, TN 37232-2158, USA jonathan.schildcrout{at}vanderbilt.edu
Department of Biostatistics, University of Washington, F-600 Health Sciences Building, Campus Mail Stop 357232, Seattle, WA 98195-7232, USA
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
Generalized estimating equations (Liang and Zeger, 1986) is a widely used, moment-based procedure to estimate marginal regression parameters. However, a subtle and often overlooked point is that valid inference requires the mean for the response at time t to be expressed properly as a function of the complete past, present, and future values of any time-varying covariate. For example, with environmental exposures it may be necessary to express the response as a function of multiple lagged values of the covariate series. Despite the fact that multiple lagged covariates may be predictive of outcomes, researchers often focus interest on parameters in a cross-sectional model, where the response is expressed as a function of a single lag in the covariate series. Cross-sectional models yield parameters with simple interpretations and avoid issues of collinearity associated with multiple lagged values of a covariate. Pepe and Anderson (1994), showed that parameter estimates for time-varying covariates may be biased unless the mean, given all past, present, and future covariate values, is equal to the cross-sectional mean or unless independence estimating equations are used. Although working independence avoids potential bias, many authors have shown that a poor choice for the response correlation model can lead to highly inefficient parameter estimates. The purpose of this paper is to study the biasefficiency trade-off associated with working correlation choices for application with binary response data. We investigate data characteristics or design features (e.g. cluster size, overall response association, functional form of the response association, covariate distribution, and others) that influence the small and large sample characteristics of parameter estimates obtained from several different weighting schemes or equivalently working covariance models. We find that the impact of covariance model choice depends highly on the specific structure of the data features, and that key aspects should be examined before choosing a weighting scheme.
Keywords: ALR; Biasvariance trade-off; GEE; Longitudinal data; Marginal model
| 1. INTRODUCTION |
|---|
|
|
|---|
The Childhood Asthma Management Program (CAMP) is a multicenter clinical trial whose primary aim is the evaluation of long-term effects of daily inhaled anti-inflammatory medication use on asthma status and lung growth in children with mild to moderate persistent asthma (CAMP Research Group, 2000
The relationship between asthma symptoms and daily ambient PM concentrations could be flexibly characterized using a distributed lag model that includes multiple lags in the covariate series. If Yi is an ni -vector of binary responses, Yit, for subject i, Xi is the corresponding vector of PM concentrations, and g is the link function (which we assume to be logit unless otherwise specified), then
is a J th-order distributed lag model. One challenge in direct use of this model lies in ultimately summarizing the PM effect described by the vector of coefficients. While a seemingly natural option would be to isolate the effect of the j th-lag,
j + 1, PM effects are believed to be small and daily PM concentrations are correlated making such an adjusted effect difficult to detect. Another summary measure,
is appealing because the effect of the entire past covariate series is captured in this overall summary. However, a major drawback to the use of
* is that in many cities, PM concentrations are not measured on a daily basis (see Table 2 in Samet et al., 2000
), and so this summary is not directly estimable unless a separate model for the missing PM values is constructed. In this study, we focus on a commonly used alternative summary which also captures the effect of the entire PM series: the coefficient ßj + 1 in the cross-sectional model E(Yit|Xitj) = g1(ß0 + ßj + 1Xitj). Like
*, ßj + 1 is approximately a sum of
values; however, the weights in this sum are derived from the day-to-day correlation among PM concentrations. For example, an underlying linear distributed lag model
induces a cross-sectional model E(Yit|Xitj) = ß0 + ßj + 1Xitj, where
X(|jk|) = corr(Xitj, Xitk) assuming the exposure process is mean zero and stationary. For nonlinear models a similar approximate relationship exists. In the case of a logistic second-order distributed lag model, when the covariate series is normally distributed, the parameter ß1 can be approximated with cß(
1 + 
2), where
= corr(Xit, Xit1) and cß is a constant described in Zeger et al. (1988)
. The cross-sectional model parameter is appealing because it has a simple interpretation and is a naturally weighted sum of parameters from an underlying distributed lag model. The coefficient ßj + 1 captures the direct effect of Xitj as well as the indirect effects of other lags in the PM series. Though
* is also a useful summary measure, the unweighted sum of serial covariate effects (e.g. equal weights to all lagged effects) represents a somewhat artificial contrast in the exposure series where for each day t, t1,..., tJ1, the exposure is elevated by one unit. The parameter ßj + 1 characterizes the increase in risk associated with the observable difference in exposure on day tj of one unit, Xitj = (x + 1) versus Xitj = x, which represents both a scientifically meaningful and an observable exposure contrast. The summary ßj + 1 is a valid measure of association between exposure and outcome even when underlying etiologic models may have distributed lags.
Though cross-sectional modeling can be considered simple and descriptive, there are analytical issues associated with valid and efficient estimation using longitudinal data. Pepe and Anderson (1994)
showed that biased parameter estimates for time-varying covariates will likely result if a nonindependence, working correlation matrix is used in a generalized estimating equations (GEE) approach unless the cross-sectional mean (CSM), µit = E(Yit|Xit), is equal to the full covariate conditional mean (FCCM), E(Yit|Xi) = E(Yit|Xi1,Xi2,...,Xini), that describes how responses depend on past, present, and future covariate values. A simulation study with continuous response data and an exogenous covariate process is reported in Diggle et al. (2002)
and supports the conclusions of Pepe and Anderson. Emond et al. (1997)
and Pan et al. (2000)
provide analytic calculations for bias with continuous response data, but with different data-generating models (i.e. transition models).
Other authors have studied the efficiency of GEE estimates when the FCCM is properly specified but the association among responses has been misspecified. Independence weighted estimates (GEE-Ind) have been shown to be nearly fully efficient for cluster-level covariate parameters with equal-sized clusters (McDonald, 1993
; Fitzmaurice, 1995
; Mancl and Leroux, 1996
), though some efficiency losses may be observed when clusters vary in size (Mancl and Leroux, 1996
). However, misspecification of the response association model can result in substantial losses in efficiency for estimates of time-varying covariate parameters (e.g. Zhao et al., 1992
; Fitzmaurice et al., 1993
; Fitzmaurice, 1995
; Mancl and Leroux, 1996
; Sutradhar and Das, 2000
). In particular, GEE-Ind can be highly inefficient relative to weighting schemes that incorporate correlation among responses. In the special case where the time-varying covariate is mean balanced (i.e.
), GEE-Ind is as efficient as exchangeable weighted estimates (GEE-Exch) (Mancl and Leroux, 1996
), thus clarifying the apparently discrepant results reported by McDonald (1993)
and Sutradhar and Das (1999)
who found GEE-Ind and GEE-Exch estimates to be equally efficient when the cluster-varying covariate was balanced, and therefore mean balanced. In addition to covariate mean balance, other characteristics of the covariate distribution (e.g. intracluster correlation) and the effect size can influence the efficiency of estimates under a misspecified association model (e.g. Fitzmaurice, 1995
; Mancl and Leroux, 1996
; Sutradhar and Das, 2000
).
In this study, we explore the biasefficiency trade-off with estimating CSM model parameters using GEE. In Section 2, we introduce the model from which inference is drawn. In Section 3, we briefly overview moment-based estimation procedures for marginal mean models with binary outcomes. We explore the biasefficiency trade-off in Section 4 using analytical calculations, numerical calculations, and simulations. In Section 5, we analyze the CAMP data reported in Yu et al. (2000)
and we provide concluding discussion and recommendations in Section 6.
| 2. MARGINAL REGRESSION MODELS WITH TIME-VARYING COVARIATES |
|---|
|
|
|---|
Let Yi = (Yi1,..., Yini) be a binary response vector for cluster or subject i
{1, 2,..., N}, where ni is the dimension of the vector, and let Xi = (Xi1, , Xini) be the ni x p design matrix consisting of time-varying and time-invariant covariates. Marginal regression models express the mean response as a function of covariate values, µi = E(Yi|Xi) = g1(Xiß), where g is the link function. The logit link, which is the canonical link function for binary response data, will be used throughout this paper, although other link functions (e.g. probit) could also be used. The vector, ß, is a p x 1 parameter which we consider the target of estimation.
If Xi is a single time-varying covariate vector, it is important to consider the potential for a lagged relationship between Xi and Yi. For example, Yit may depend on one or more of the past covariate values, Xitj, j
{0, 1,..., t1}. Subject matter considerations are crucial for determining what functions of past exposure are most relevant since the lag time from exposure to health effect is determined by the underlying biological processes.
In this study, we only consider statistically exogenous covariates (see Hernan et al., 2001
). With a time-varying exogenous exposure, the FCCM is the expected response as a function of the entire covariate history, µit
E(Yit|Xi1,..., Xit). In contrast, the CSM is the expected response expressed as a function of a single lag in the covariate,
where j can take on any value from 0 to t1.
We study the bias versus efficiency trade-off inherent in estimating CSM regression parameters for time-varying covariates when the data-generating model involves a more complicated dependence on covariate history.
For simplicity, we assume that the FCCM model is given by µit = g1(
0 +
1xit +
2xit1), while interest is in estimating the induced CSM regression parameter, ß1, in the regression
In order to characterize the relationship between the FCCM and the CSM, we study a time-dependent covariate vector Xi which is a first-order, auto-regressive Gaussian process with autocorrelation parameter
. With this particular data-generating model we can calculate the induced value of the CSM parameter vector, ß1. Let Z =
2(Xit1
Xit), then
and
![]() |
where
denotes the probability density function for a mean 0, normally distributed random variable. Neuhaus et al. (1991)
showed that for the logistic link function, if
|ß0|
|
0|, and |ß1|
|
1 + 
2| and these discrepancies grow with
Zeger et al. (1988)
provided reasonable approximations for ß0 and ß1: ß0
0cß and ß1
(
1 + 
2)cß, where
and
Using these results, we are able to approximate the induced CSM from the FCCM specified above and can therefore study the magnitude of bias and/or inefficiency associated with different estimation methods. The reduced CSM remains approximately linear in xit, with exact linearity obtaining for certain models such as with the probit link.
| 3. ESTIMATION FOR MODELS BASED ON MARGINAL MOMENTS |
|---|
|
|
|---|
Paired Estimating Equations (paired EE) (Prentice, 1988
![]() | (3.1) |
![]() | (3.2) |
where, in the first-order equation,
ß is a parameter vector describing the relationship between covariates and the link-transformed expected response, Vi = Vi(
) is the working covariance matrix, and
is a parameter vector describing dependence among responses. In the second-order equation, Si is the
vector of cross products {YisYit}s
t,
i = E(Si),
and Wi is the
estimate of the covariance matrix for Si. Since Si is a vector of cross products, the off-diagonals in Wi involve fourth-order probabilities. These are considered a nuisance, and therefore, it is common to set the off-diagonals of Wi equal to 0.
Alternating Logistic Regression or ALR (Carey et al., 1993
) is a paired EE procedure that uses (3.1) to estimate the mean model, but as suggested by Lipsitz et al. (1991)
, the association model is parameterized in terms of log odds ratios (LOR) rather than correlations. The pairwise LOR,
is a natural and unconstrained measure of association for two binary variables. In ALR, the pairwise odds ratios are allowed to follow a regression structure given by the parameter vector
, that is estimated with a logistic regression of Yis on Yit for the
pairs of responses,
![]() | (3.3) |
Given the values of µis
E(Yis|Xi), µit
E(Yit|Xi), and
ist, the induced pairwise correlations can be calculated to update the working covariance matrix in (3.1) at each successive iteration. For the purpose of this study we assume response dependence is isotropic (i.e. only depends on time separations) and we denote LOR(
t) =
ist, where
t = |ts|.
| 4. BIAS, EFFICIENCY, AND MEAN SQUARE ERROR |
|---|
|
|
|---|
In this section, we discuss the impact that a working covariance choice can have on CSM parameter estimates. The logit link function will be considered throughout our analysis. In Section 4.1, we numerically characterize design features or data characteristics that influence asymptotic bias, and in Section 4.2 we evaluate the features that influence efficiency and mean square error (MSE). In Section 4.3, we report on simulations that study the operating characteristics of estimation methods applied in finite samples. We conclude with a summary of the results.
Throughout this section the FCCM is given by µit = E(Yit|Xi = xi) = g1(
0 +
1xit +
2xit1), where g = logit. The CSM is then
and ß1 is the target of estimation. We study a number of design features to characterize the operating characteristics of ß1 estimates obtained using different working association models. These include effect size (
1), relative effect size of the omitted covariate (
2/
1), overall or average response association, functional form of the response association (rate of serial decay), cluster size, variation in cluster size, number of independent clusters, and the distribution of the time-varying, exogenous covariate.
We will use acronyms for estimates based on various weighting schemes or working association models. Independence {LOR(
t) = 0}, exchangeable {LOR(
t) =
0,e}, serial {LOR(
t) =
1,s
tk}, mixture {LOR(
t) =
0,m +
1,m
tk}, and smooth {LOR(
t) expressed as a smooth function of time} weighting scheme estimates will be referred to as GEE-Ind, GEE-Exch, GEE-Serial, GEE-Mixture, and GEE-Smooth, respectively.
To explore the extent of bias attributable to using covariance weighted GEE with CSM models, we first detail numerical approximations which permit evaluation of several design aspects. Let
i
{1,..., N}, be the first-order estimating function for the CSM model. Assume
where
and
is the working covariance matrix;
= E{
1,i(ß)};
= E{
1,i'(ß)};
; and cluster size is fixed and equal across clusters. Then by the multivariate central limit theorem (Rao, 1973
Using a first-order Taylor series expansion of
1,i(
) about ß, and noting that by definition
Thus, for N sufficiently large,
Our goal is to calculate the bias in estimates of ß1 by calculating the corresponding univariate values of
and
. In general, when canonical link functions are used, and when an AR(1), time-varying covariate is exogenous to the response process,
and
are given by
![]() | (4.4) |
![]() | (4.5) |
where
and rist is the (s, t) th element of the inverse working correlation matrix (see Appendix for details). In the case where a diagonal working covariance matrix is used, rist = 0,
s
t, showing that for GEE-Ind estimates,
= 0, and therefore estimates of ß1 are asymptotically unbiased, while for other weighting schemes bias will likely incur. From (4.4) and (4.5), we are able to calculate bias analytically for linear models with continuous response data and numerically for generalized linear models with binary response data.
It is useful to first consider the continuous response scenario using the identity link function and commonly used working correlation structures before proceeding to a binary response with the logit link. Solutions to the asymptotic bias in
1 for the linear model are analytically obtained and provide insight into the results for binary response scenarios. Assume Var(Yit|Xit) =
*2,
t
{1, 2,..., ni}, Xi is a mean-0, AR(1), Gaussian process with autocorrelation parameter
, and
E and
AR are the limiting values for the working correlation parameter estimates from GEE-Exch and GEE-AR(1), respectively. The asymptotic biases from GEE-Exch and GEE-AR(1) are given by (see Appendix for details):
![]() | (4.6) |
![]() | (4.7) |
Of note, the bias in estimates of ß1 will most likely have the opposite sign of
2, which means that if both
1 and
2 are the same sign, biases will generally be in the form of an attenuation towards 0. The magnitude of bias in both estimators grows with
(i.e. with overall response association). With GEE-Exch, asymptotic bias is inversely related to cluster size as the numerator in (4.6) is O(n) and the denominator is O(n2). Finally, these estimators are asymptotically unbiased and consistent if responses are independent (
= 0) or if the covariate is cluster level (
= 1).
In other exponential family models, even with canonical link functions, bias is not easily obtained in closed form; however, it can be evaluated numerically by approximating
and
in (4.4) and (4.5) using bivariate Gaussian quadrature because for all s and t, (Xis, Xit)T follows a standardized bivariate normal distribution with correlation,
|ts|.
In our numerical evaluations, we parameterize the response association model directly in terms of LOR rather than correlations. We calculate percent bias, 100 x (
1ß1,truth)/|ß1,truth|, incurred with three working association models: (1) exchangeable, LOR(
t) =
0,e; (2) serial, LOR(
t) =
1,s
t0.6; and (3) mixture LOR(
t) =
0,m +
1,m
t0.6, where
1,m/
0,m = 1.5. The serial association structure is a special case of that suggested by Fitzmaurice and Lipsitz (1995)
. In this subsection, we consider scenarios where fitted working response association models equal the true association models (e.g. no association model misspecification).
In our calculations, the decay rate component, k, was set to 0.6. LOR association models induce correlation models, and for illustration we relate the serial LOR association model described above to an induced AR(1) correlation model. Specifically, when marginal means are equal to 0.45,
1,s = 4.8, and k = 0.6, the serial LOR association model induces a Pearson correlation model that closely approximates the short-range behavior of an AR(1) or exponential decay with autocorrelation parameter equal to 0.82. Most serial association models we study induce correlation models that decay more slowly than AR(1). The mixture association model also has a serial decay component; however, the long-range association parameter,
0,m, is greater than 0, and therefore, the rate of serial decay is slower for the mixture than for the serial association model.
To make comparisons among the weighting schemes, we set response associations equal at time separations of 3 days,
t
|ts| = 3. Thus, although we consider data generated with purely serial or with a mixture of serial and long-range association, we attempt to calibrate scenarios by fixing LOR at 3-day time separations at a common value. We study percent bias as a function of overall response association (Figure 1a), cluster size (Figure 1b), strength of autocorrelation in the exposure,
(Figure 1c),
2/
1 (Figure 1d), and the sign and magnitude of
1 (Figure 1e). The vertical lines in Figure 1 depict the value which was held fixed for each of other subfigures. For example, in all calculations not involving cluster size, we fixed cluster size at n = 20 (see Figure 1b).
|
We summarize Figure 1 with the following observations:
- GEE-Exch yielded the smallest biases in magnitude, followed by GEE-Mixture and GEE-Serial yielded the largest biases which were as high as 44%.
- Figure 1a shows that bias grew with the overall response association, and severe biases were incurred for GEE-Serial even when there was only a modest response association.
- Biases in GEE-Mixture and GEE-Serial estimates did not change substantially for cluster sizes between 5 and 50 (Figure 1b). Bias from GEE-Exch dropped for larger clusters, which is consistent with the analytic bias determined for a continuous response.
- For a large range of values of the covariate autocorrelation,

(0, 0.7), the bias remained nearly constant (Figure 1c). As
approached 1, and the covariate resembled a cluster-level covariate, and the bias approached 0 with all estimators.
- Figure 1d shows that when
1 was held fixed, bias grew with
2.
- Figure 1e displays bias as a function of the sign and magnitude of
1, where
2/
1 was fixed at 0.7. The magnitude of
1 had an impact on percent bias where larger values of
1 lead to larger (absolute) percent biases. The signs of
1 and
2 were also crucial. If
1 and
2 are of the same sign, we observed that the estimates of ß1 were attenuated towards 0.
Regarding the final point, it should be noted that we have also done calculations using more rapidly decaying serial association models where we observed that estimates of ß1 can be consistent for a parameter value with the opposite sign of the true value. The scenario under which risk of bias larger than ß1 was greatest was when
2/
1 was large (
0.7) and when there was a large short-term association that decayed rapidly with time separations (k
0.8). Such a scenario may arise when there is a relatively large amount of response dependence and when a rapidly decaying serial association model is imposed. In such situations, the short-term dependence estimate would be very large.
As a final note, we showed earlier in this section that with continuous response data, bias in ß1 is the opposite sign of
2. In the scenarios studied here with binary response data, this is also the case.
In this section, we evaluate design features that influence large sample relative efficiency and MSE of ß1 estimates for the CSM,
when the FCCM model is logit(µit) = 0.2 +
1xit +
2xit1. Specific data characteristics that we evaluate include: response association, LOR(
t)
{0.50 + 1.00 x
t0.6, 0.75 + 1.25 x
t0.6}; effect size,
1
{0.1, 0.5}; relative effect size of the omitted lagged covariate,
2/
1
{0.0, 0.3, 0.7, 1.0} ; cluster size, n
{20, 50}; and the number of independent clusters N
{75, 150}.
To approximate asymptotic operating characteristics, calculations were done using Monte Carlo methods with a large number of clusters. Even with a large number of clusters, error is associated with Monte Carlo methods, so average bias and variance were calculated from four replicates of N = 5000 clusters, and estimation properties were then based upon those averages.
Asymptotic variance was smaller when response association models were used. GEE-Exch variances were 1739% smaller than GEE-Ind, and GEE-Mixture and GEE-Serial variances were 2847% smaller. However, as shown in Section 4.1, biased estimates can result for CSM parameter estimates when a nonindependence weighting scheme is imposed. The MSE is an optimality criterion that incorporates both bias and variance. With a sufficiently large number of clusters, bias remains constant while variance declines with the addition of independent clusters. Therefore, the choice of weighting scheme should depend on the number of clusters available for analysis.
Table 1 displays relative MSEs using GEE-Ind as a reference for N = 75 and 150 clusters. A few clear patterns emerge. With small effect sizes (
1 = 0.1), GEE-Exch outperformed GEE-Ind, with MSEs reduced by 2439 %. With larger effect sizes, the optimality of GEE-Exch versus GEE-Ind depended on the ratio of
2 to
1, and the cluster size. GEE-Exch was more favorable with fewer clusters, and when the ratio of
2 to
1 was smaller. GEE-Mixture, which captures the true response association most closely, was optimal compared to other schemes when the effect size,
1, was small and when
2/
1 = 0.3 ; however, there are clearly scenarios in which one would not want to use response dependence weighting (e.g. when
1 = 0.5). GEE-Serial was the least desirable estimator in most situations. Although variances were similar to GEE-Mixture , substantially more bias was incurred when using GEE-Serial. This was also observed in Section 4.1.2. According to the MSE, weighted GEE estimators were more favorable than GEE-Ind with fewer independent clusters; smaller clusters (n); a smaller effect size (
1); and a smaller relative value for the omitted covariate parameter (
2/
1). The effect that overall response association and rate of serial decay in the working association model have on weighted estimates depends upon the potential for bias (i.e. the magnitude of
2/
1). When the potential for bias is large, the risk of bias resulting from modeling the association structure appropriately may well outweigh potential for efficiency gains by doing so.
|
In our simulations we study bias, efficiency, and MSE of CSM parameter estimates in finite samples. This is in contrast to the previous sections which were based on asymptotic calculations. Additionally, we examine the influence of two data features that were not examined previously: variation in cluster size (ni
n) and differing between-subject covariate distributions. We primarily focus on the CSM parameter corresponding to the time-varying covariate, Xit, but we also now include two cluster-level covariates: a continuous covariate, Xi
N(0, 1), and a group or treatment covariate, Gi, that takes on the values 0 or 1 each with probability 0.5. The FCCM model used to generate the binary response data was logit(µit) =
0 +
1xit +
2xit1 +
3xi +
4gi, where the dependence model was given by LOR(
t) =
0 +
1
t0.6 and
t = |ts|. To generate the correlated binary response data with mean and association models specified above, we used methods similar to those described in Emrich and Piedmonte (1991)
.
We examined data and model characteristics to investigate the validity and the practical implications of using GEE-Ind, GEE-Exch, GEE-Mixture, and GEE-Serial estimators with the CSM model
Several aspects were held fixed across the simulations: LOR(
t) = 0.25 + 1.75 x
t0.6,
= 0.4,
0 = 0.2,
1 = 0.1,
3 = 0.2, and
4 = 0.4. We varied (1) cluster size, ni
50 or ni uniformly distributed between 20 and 80; (2)
2 = {0, 0.03, 0.07, 0.10}; and (3) the number of independent clusters, N = {100, 200}.
The time-varying covariates studied in previous sections have been stochastically mean balanced. That is,
with c equal to 0. With small clusters, this leads to substantial variation in
between clusters, while with large clusters, the variation is relatively small. In an additional simulation study, we explored operating characteristics of estimators when we either forced the covariate to be mean balanced or when we forced it to be mean imbalanced. To force mean balance, we subtracted
from Xit and Xit1 before generating response data, and to force mean imbalance, we added a cluster-level random exposure level (Fi) to Xit and Xit1, where Fi
N(0, 1/2). Note that by subtracting
or by adding Fi to the cluster-varying covariate, we do not preserve the auto-regressive structure.
For cluster-level covariates, design features did not appear to influence operating characteristics of CSM parameter estimates. All estimates showed little bias and in general were equally efficient, though if the clusters varied in size, we observed up to a 7% larger variance for GEE-Ind compared to GEE-Mixture. In contrast, and consistent with results from the previous sections, the impact of covariance weighting choice depends on specific design features when time-varying covariate parameters are of interest. Under different data scenarios, different weighting schemes are recommended.
Table 2 displays the results from the first set of simulations in which we compared MSE, percent bias, and variance of the four weighting schemes under various designs. Whether cluster size was fixed at ni = 50 or was random ni
U(20, 80) had little impact on the optimality of weighting schemes. MSEs with GEE-Exch were smaller than with GEE-Ind in all designs studied. GEE-Exch was optimal for
2
{0.07, 0.10}, while the weighting schemes that incorporated the serial decay of response association were optimal when
2
{0, 0.03}.
|
As expected, sample size had a clear impact on MSE comparisons when
2
0. For example, when
2 = 0.07 and N = 100, GEE-Mixture had a slightly smaller MSE than GEE-Ind (MSE ratio = 0.96), while for N = 200 it was nearly 50 % larger (MSE ratio = 1.48).
GEE-Ind and GEE-Exch yielded approximately unbiased parameter estimates for all scenarios studied. This is consistent with our analytical and numerical calculations. Since the cluster sizes studied were large and the time-varying covariate was stochastically mean-0 balanced, biases with GEE-Exch were negligible. GEE-Mixture and GEE-Serial were shown to be severely biased when
2 = 0.1. Bias dropped rapidly with
2 and both estimators were unbiased when
2 = 0 (i.e. the Pepe and Anderson criterion was satisfied). The GEE-Mixture estimator was less biased than the GEE-Serial estimator because the long-range component
0, m was greater than 0, and therefore, the decay rate was slower than it was with pure serial decay where the long-range component was 0.
Similar to the results in Section 4.2 GEE-Ind yielded estimators with the largest variances. GEE-Mixture and GEE-Serial were most efficient and yielded variances that were 3545 % smaller than GEE-Ind. GEE-Exch tended to be more efficient than GEE-Ind with variance decreases of approximately 20 %.
Table 3 shows the MSE, percent bias, and variance estimates from the four weighting schemes where
2 = 0.07 and N = 100. When the time-varying covariate was forced to be mean balanced, GEE-Ind and GEE-Exch yielded almost identical results in all situations studied. This is consistent with the variance results given by Mancl and Leroux (1996)
|
According to MSEs, GEE-Exch was the optimal choice in all scenarios studied. Estimates were reasonably robust to mean-imbalanced data both in terms of bias and variance. GEE-Ind variance estimates grew rapidly with imbalanced data, as did biases for GEE-Mixture and GEE-Serial. Note that
2 = 0.07, which is a situation where GEE-Mixture and GEE-Serial are known to be quite biased. Different conclusions would be drawn if
2 was small.
We explored operating characteristics of CSM, time-varying covariate parameter estimates for several weighting schemes in a setting where the CSM did not equal the FCCM (
2
0), and where the response association had both serial components and long-range components. Long series of binary response data were examined.
A key issue is the trade-off of bias versus efficiency. In general, GEE-Ind estimates were unbiased. If the covariate distribution was mean balanced, nearly mean balanced, and mean imbalanced, GEE-Exch estimates were unbiased, negligibly biased, and biased, respectively. GEE-Mixture and GEE-Serial estimates were biased in all situations where the CSM did not equal the FCCM. In contrast, GEE-Mixture and GEE-Serial produced estimates with smallest variance, followed by GEE-Exch, while GEE-Ind produced estimates with the largest variance. The degree to which estimates were biased or inefficient depended upon several design features.
Effect size,
1, and the ratio,
2/
1, had a large impact on bias. If
2 was nonzero (FCCM
CSM), biased estimates resulted, and the magnitude of the bias grew with
2/
1. Bias has the opposite sign of
2, which will often result in an attenuation towards 0 for estimates of ß1. Also, holding
2/
1 fixed, bias grew with larger |
1| values.
When the functional form of the working dependence model included a serial component, bias was more severe when the decay rate was more rapid. In addition to the functional form of the association model, the degree of overall response association was crucial. For all working association models, biases were more severe when there was a large amount of overall response association. Therefore, in scenarios similar to those studied here, the most severe bias will result from imposing a rapidly decaying working serial association model when there is a high degree of overall or average response association.
The covariate distribution also played a key role in determining the degree of bias. For a broad range of autocorrelation parameter values,
(0, 0.7), weighted estimates were susceptible to substantial bias. As the autocorrelation parameter approached 1, the covariate resembled a time-invariant covariate and estimates were nearly unbiased. In addition to the covariate autocorrelation, mean-imbalanced time-varying covariate parameter estimates suffered substantially more bias for nonindependence weighting than corresponding mean-balanced designs.
An independence working association model will rarely (if ever) be the optimal choice for minimizing the variance of estimators for a longitudinal data setting. As relatively inefficient as GEE-Ind was shown to be when the time-varying covariate was stochastically mean balanced, it was even less efficient when the covariate was mean imbalanced.
| 5. EXAMPLE: CAMP |
|---|
|
|
|---|
CAMP was a randomized clinical trial that recruited 1041 children with mild to moderate, persistent asthma in eight North American cities to evaluate the long-term effects of daily inhaled anti-inflammatory medication use on asthma status and lung growth. During the screening phase, children recorded asthma symptoms on daily diary cards. Yu et al. (2000)
When investigating the short-term relationship between PM and symptoms from an observational study where subjects participated at differing times of year, confounding due to seasonal factors must be considered. One way to partially control for seasonal confounding is to decompose PMit into two components
and
where
is subject i's average exposure to PM concentrations. The association between symptoms and
is generally severely confounded by seasonal factors, and therefore is not of scientific interest. Additional control for unmeasured seasonal factors is often done using smooth functions of calendar date. The decomposition described above, which has the advantage that the time-varying covariate was mean balanced, was used in Yu et al. (2000)
.
Figure 2 shows point estimates and confidence intervals corresponding to three regression models. The first model included the 0-, 1-, and 2-day lags in PM concentrations simultaneously, while the second and third included the 0- and 1-day lags and only the 0-day lag, respectively. The first and second models could represent the FCCM model while the third is a CSM model. All estimates shown resulted from GEE-Ind controlling for age, gender, sensitivity to a methacholine challenge, race, day of week, temperature, temperature squared, and a linear spline in calendar date (to represent season). Notice that although either the 0- or the 1-day lag in PM concentrations may be cross-sectionally associated with symptoms, by estimating simultaneous effects we could naively be misled to conclude that there was no relationship between PM concentrations and symptoms. The error bar on the right shows that the lag-0 concentration was significantly associated with symptoms, while the panel on the left shows that modeling of the 2-day lag was unnecessary. In the middle panel note that estimates for the 0- and 1-day lags were similar in magnitude indicating that weighting may in fact be susceptible to bias if CSM regression inference is desired.
|
Figure 3 displays the LORelogram (Heagerty and Zeger, 1998
t) =
0, e ; (2) serial, LOR(
t) =
1, s
tk ; (3) mixture of serial plus exchangeable components, LOR(
t) =
0, m +
1, m
tk ; and (4) smooth, a natural spline with knots placed at 5, 10, 20, 30, 45, 60, and 80 day separations (
t). In Figure 3(a) the decay rate parameter k was set to 0.5, while in Figure 3(b) it was set to 0.35 to allow slower serial decay. It appears that both the serial and the mixture association models approximate smooth decay reasonably well. The exchangeable association model estimate is very useful for estimating an average or overall association among responses in a cluster. We see that the average or overall LOR association is approximately 1.
|
We compared six weighting schemes to estimate the CSM model parameters: (1) GEE-Ind; (2) GEE-Exch; (3) GEE-Serial; (4) GEE-Mixture; (5) GEE-Smooth, LOR(
t) is a smooth function of time separations; and (6) GEE-AR(1). We set k to 0.5 and also to 0.35 to allow for slower serial decay. The results, shown in Table 4, are consistent with patterns revealed in Section 4. GEE-Ind and GEE-Exch produced similar parameter estimates, 0.161 and 0.158, respectively, though exchangeable standard errors were smaller (0.065 versus 0.070). The results from Section 4 revealed that in situations where the cluster-varying covariate was mean balanced, GEE-Ind and GEE-Exch estimators had very similar operating characteristics. A possible explanation for the modest difference between the two estimates is that time-varying confounders (e.g. season, temperature, and day of week) were not mean balanced like the PM covariate, and therefore differences between these estimators could result.
|
Estimates and standard errors from weighting schemes that incorporated a serial component were smaller than those from GEE-Ind and GEE-Exch. With k = 0.5, the GEE-Serial estimate (0.087) was much smaller than the GEE-Ind estimate (0.161), as was the standard error (0.054 versus 0.070). When k was set to 0.35, the apparent bias dropped, which may be attributed to the slower decay rate. For k = 0.5 the GEE-Mixture estimate was larger than the GEE-Serial estimate because
0, m was estimated to be greater than 0, and therefore the rate of decay was slower. However, when k = 0.35, the estimate of
0, m was slightly negative, and the GEE-Mixture and the GEE-Serial estimates were nearly identical (0.109 and 0.111, respectively). Estimates and standard errors from GEE-Smooth fell between those from GEE-Exch and GEE-Mixture, and estimates and standard errors from GEE-AR(1) were similar to the rapidly decaying GEE-Serial. | 6. DISCUSSION AND CONCLUSIONS |
|---|
|
|
|---|
In this study, we explored a biasvariance trade-off when the goal is to estimate CSM model parameters. We focused on a long series of binary response data where the response association had short-range serial and long-range components. Independence weighting estimates are unbiased but can be relatively inefficient. Exchangeable weighting estimates have similar operating characteristics to independence weighting estimates if the time-varying covariate is mean balanced, but are otherwise modestly biased and more efficient. Estimates that incorporate the serial component to the response association can be susceptible to substantial bias, but potentially with the smallest variance. As we discuss below, in deciding upon a working association model, exploratory data analysis may be crucial.
A number of data features influence the operating characteristics of CSM parameter estimates. Primary consideration should be paid to the effect size of any lagged covariate that is omitted from the regression model. If it is at least half as large in magnitude as the effect size for the lag of interest, independence or exchangeable weighting should generally be used. Otherwise, a working association model that resembles the true association should be used because the impact of bias may be substantially smaller than the resulting gain in efficiency.
The direction and magnitude of bias are fairly predictable. Bias in point estimates is generally the opposite sign of the coefficient for the omitted lagged covariate. Thus, in many plausible scenarios (
2 and
1 are the same sign), bias will be an attenuation towards 0. As a word of caution, if the omitted lag effect size is nearly as large as the target of inference, there is a substantial overall response association, and the working association model includes a rapidly decaying serial component, then parameter estimates may in fact be consistent for a value that has the opposite sign of the true value. The commonly used AR(1) working association model could be susceptible to this problem.
If scientifically reasonable, such as in the case with the CAMP analysis, we recommend using a mean-balanced covariate as opposed to a mean-imbalanced one. Parameter estimates based on nonindependence weighting are less biased and are as efficient when the covariate is mean balanced compared to when it is mean imbalanced. On the other hand, independence weighting estimates are unbiased irrespective of the covariate distribution but appear more efficient when the covariate is mean balanced.
In general, we recommend exploring an FCCM model to assess whether the lag of choice is a reasonable one. If other lag effects are larger than the chosen target of inference, severe bias may result from using nonindependence weighting. Once a choice of lag has been made, a LORelogram that uses flexible functions (e.g. natural splines) of time separations can be used to explore the form of the response association model. By modeling the association structure reasonably well, improvements in efficiency can be made. However, if the association among responses decays rapidly and the CSM is not equal to the FCCM, we recommend using a working association model that decays more slowly than the estimated serial decay model to avoid the potential for severely biased estimates. Finally, if scientifically reasonable, we recommend forcing mean balance in time-varying covariates (e.g. by centering) since estimates may be less biased if nonindependence weighting is used, and smaller variance estimates can be obtained with independence weighting estimates.
As a final note, although we have chosen to estimate CSM parameters directly, a valid indirect strategy could conceivably be implemented. First, likelihood-based or covariance weighting could be used to estimate the FCCM regression parameters. Second, a model for the exposure process would be constructed such that averages over the multivariate exposure distribution could be taken. Finally, the induced cross-sectional (or single lag) associations could then be estimated based on both the FCCM model, and the model for the exposure process via E[Yit | Xtj] = E{E[Yit | Xit, Xit1, Xit2, ...] | Xitj}, where the full conditional mean is averaged over the conditional distribution of the covariate process given Xitj. If such an approach was likelihood based then fully efficient estimates of cross-sectional associations may be obtained. However, compared to the direct method that we study, an indirect approach requires the additional modeling of the exposure, and may be sensitive to model misspecification.
| APPENDIX A |
|---|
|
|
|---|
To obtain
and 
given in (4.4) and (4.5), we note that if Xi is a standardized AR(1) random vector, when s>t,
![]() |
and when s<t1,
![]() |
Calculation.
The GEE and ALR estimating equation that is solved to calculate the estimate of ß1 is given by
where

and Ri is the working correlation matrix corresponding to cluster i. When we take the expectation of (a.1), we note 
s
t, when independence weighting is used, the expectations of (a.2) and (a.3) are equal to 0, and therefore GEE-Ind yields unbiased parameter estimates. Next, we take the expectation of (a.2) where s>t,
![]() |
The FCCM is E(Yit|Xit, Xit1), so E(Yit|Xit, Xit1, Xis) = E(Yit|Xit, Xit1), and since Xi is AR(1), for s>t, Xit1|Xit, Xis
Xit1|Xit, and
![]() |
Thus, the expectation of (a.2) is 0. The value of
is therefore given by the the expectation of (a.3).
![]() |
where,
![]() |
and Z =
2{Xit1 E(Xit1|Xit, Xis)}. This implies Z|Xit, Xis
N{0, Var(Xit1|Xit, Xis)}, and since
is given by
calculation.
![]() |
If the response is continuous, there is no meanvariance relationship (
), and the identity link function is used, the calculation of
and
simplifies. It can be shown that
![]() |
If
is the limiting value for the response correlation parameter, then it is straightforward to calculate asymptotic bias for the exchangeable and AR(1) response correlation structures:
![]() |
![]() |
| ACKNOWLEDGMENTS |
|---|
The authors gratefully acknowledge The Childhood Asthma Management Program, supported by contracts for the National Heart, Lung and Blood Institute, for permission to use the asthma symptom data. This research was supported by grant HL72966 from the National Institutes of Health.
| REFERENCES |
|---|
|
|
|---|
-
CAMP RESEARCH GROUP (2000). Long-term effects of budesonide or nedocrimil in children with asthma. New England Journal of Medicine 343, 10541063.
CAREY, V., ZEGER, S. L. AND DIGGLE, P. (1993). Modelling multivariate binary data with alternating logistic regressions. Biometrika 80, 517526.
CROWDER, M. (1986). On consistency and inconsistency of estimating equations. Econometric Theory 2, 305330.
CROWDER, M. (1995). On the use of a working correlation matrix in using generalised linear models for repeated measures. Biometrika 82, 407410.
DIGGLE, P., HEAGERTY, P. J., LIANG, K.-Y. AND ZEGER, S. L. (2002). Analysis of Longitudinal Data. New York: Oxford University Press.
EMOND, M. J., RITZ, J. AND OAKES, D. (1997). Bias in GEE estimates from misspecified models for longitudinal data. Communications in Statistics, Part ATheory and Methods 26, 1532.
EMRICH, L. J. AND PIEDMONTE, M. R. (1991). A method for generating high-dimensional multivariate binary variates. The American Statistician 45, 302304.[CrossRef]
FITZMAURICE, G. M. (1995). A caveat concerning independence estimating equations with multivariate binary data. Biometrics 51, 309317.[CrossRef][Web of Science][Medline]
FITZMAURICE, G. M., LAIRD, N. M. AND ROTNITZKY, A. G. (1993). Regression models for discrete longitudinal responses. Statistical Science 8, 284299.
FITZMAURICE, G. M. AND LIPSITZ, S. R. (1995). A model for binary time series data with serial odds ratio patterns. Applied Statistics 44, 5161.
HEAGERTY, P. J. AND ZEGER, S. L. (1998). Lorelogram: a regression approach to exploring dependence in longitudinal categorical responses. Journal of the American Statistical Association 93, 150162.[CrossRef]
HERNÁN, M. A., BRUMBACK, B. AND ROBINS, J. M. (2001). Marginal structural models to estimate the joint causal effect of nonrandomized treatments. Journal of the American Statistical Association 96, 440448.[CrossRef][Web of Science]
LIANG, K.-Y. AND ZEGER, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 1322.
LIPSITZ, S. R., LAIRD, N. M. AND HARRINGTON, D. P. (1991). Generalized estimating equations for correlated binary data: using the odds ratio as a measure of association. Biometrika 78, 153160.
MANCL, L. A. AND LEROUX, B. G. (1996). Efficiency of regression estimates for clustered data. Biometrics 52, 500511.[CrossRef][Web of Science][Medline]
MCDONALD, B. W. (1993). Estimating logistic regression parameters for bivariate binary data. Journal of the Royal Statistical Society, Series B, Methodological 55, 391397.
NEUHAUS, J. M., KALBFLEISCH, J. D. AND HAUCK, W. W. (1991). A comparison of cluster-specific and population-averaged approaches for analyzing correlated binary data. International Statistical Review 59, 2535.[Web of Science]
PAN, W., LOUIS, T. A. AND CONNETT, J. E. (2000). A note on marginal linear regression with correlated response data. The American Statistician 54, 191195.[CrossRef]
PEPE, M. S. AND ANDERSON, G. L. (1994). A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communications in Statistics, Part BSimulation and Computation 23, 939951.
PRENTICE, R. L. (1988). Correlated binary regression with covariates specific to each binary observation. Biometrics 44, 10331048.[CrossRef][Web of Science][Medline]
RAO, C. R. (1973). Linear Statistical Inference and Its Applications. New York: John Wiley and Sons.
SAMET, J. S., DOMINICI, F., CURRIERO, F. C., COURSAC, I. AND ZEGER, S. L. (2000). Fine particulate air pollution and mortality in 20 U.S. cities, 19871994. The New England Journal of Medicine 343, 17421749.
SUTRADHAR, B. C. AND DAS, K. (1999). On the efficiency of regression estimators in generalised linear models for longitudinal data. Biometrika 86, 459465.
SUTRADHAR, B. C. AND DAS, K. (2000). On the accuracy of efficiency of estimating equation approach. Biometrics 56, 622625.[CrossRef][Web of Science][Medline]
WANG, Y.-G. AND CAREY, V. (2003). Working correlation structure misspecification, estimation and covariate design: implications for generalized estimating equations performance. Biometrika 90, 2941.
YU, O. C., SHEPPARD, L., LUMLEY, T., KOENIG, J. Q. AND SHAPIRO, G. G. (2000). Effects of ambient air pollution on symptoms of asthma in seattle-area children enrolled in the camp study. Environmental Health Perspectives 108, 12091214.[Web of Science][Medline]
ZEGER, S. L., LIANG, K.-Y. AND ALBERT, P. S. (1988). Models for longitudinal data: a generalized estimating equation approach . Biometrics 44, 10491060. (Corr: V45 p347.)
ZHAO, L. P., PRENTICE, R. L. AND SELF, S. G. (1992). Multivariate mean parameter estimation by using a partly exponential model. Journal of the Royal Statistical Society, Series B, Methodological 54, 805811.
Received January 15, 2004; revised November 17, 2004; revised May 12, 2005; accepted for publication May 19, 2005.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
J. H Fowler and N. A Christakis Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the Framingham Heart Study BMJ, December 4, 2008; 337(dec04_2): a2338 - a2338. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Schildcrout and P. J. Heagerty On outcome-dependent sampling designs for longitudinal binary response data with time-varying covariates Biostat., October 1, 2008; 9(4): 735 - 749. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. A. Christakis and J. H. Fowler The Collective Dynamics of Smoking in a Large Social Network N. Engl. J. Med., May 22, 2008; 358(21): 2249 - 2258. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. A. Christakis and J. H. Fowler The Spread of Obesity in a Large Social Network over 32 Years N. Engl. J. Med., July 26, 2007; 357(4): 370 - 379. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Schildcrout, L. Sheppard, T. Lumley, J. C. Slaughter, J. Q. Koenig, and G. G. Shapiro Ambient Air Pollution and Asthma Exacerbations in Children: An Eight-City Analysis Am. J. Epidemiol., September 15, 2006; 164(6): 505 - 517. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
























