Biostatistics Advance Access originally published online on June 29, 2006
Biostatistics 2007 8(2):337-344; doi:10.1093/biostatistics/kxl013
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
On the equivalence of case-crossover and time series methods in environmental epidemiology
Department of Biostatistics, Johns Hopkins Bloomberg School of Public health, 615 North Wolfe Street, Baltimore, MD 21205-2179, USA ylu{at}jhsph.edu
Department of Biostatistics, Johns Hopkins Bloomberg School of Public health, 615 North Wolfe Street, Baltimore, MD 21205-2179, USA
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
The case-crossover design was introduced in epidemiology 15 years ago as a method for studying the effects of a risk factor on a health event using only cases. The idea is to compare a case's exposure immediately prior to or during the case-defining event with that same person's exposure at otherwise similar "reference" times. An alternative approach to the analysis of daily exposure and case-only data is time series analysis. Here, log-linear regression models express the expected total number of events on each day as a function of the exposure level and potential confounding variables. In time series analyses of air pollution, smooth functions of time and weather are the main confounders. Time series and case-crossover methods are often viewed as competing methods. In this paper, we show that case-crossover using conditional logistic regression is a special case of time series analysis when there is a common exposure such as in air pollution studies. This equivalence provides computational convenience for case-crossover analyses and a better understanding of time series models. Time series log-linear regression accounts for overdispersion of the Poisson variance, while case-crossover analyses typically do not. This equivalence also permits model checking for case-crossover data using standard log-linear model diagnostics.
Keywords: Air pollution; Case-crossover design; Environmental epidemiology; Log-linear model; Overdispersion; Poisson regression; Time series
| 1. INTRODUCTION |
|---|
|
|
|---|
The case-crossover design was introduced in epidemiology 15 years ago as a method for studying the effects of a risk factor on a health event using only cases (Maclure, 1991
Maclure (1991)
originally proposed that only the intervals before the one in which the event occurred can be used for reference. Greenland (1996)
and Navidi (1998)
pointed out that this choice produces a biased odds ratio estimate in the presence of a secular trend. As an alternative, Navidi (1998)
proposed the "full-stratum" design such that all intervals other than the event interval can be used for reference. Bateson and Schwartz (1999)
suggested a "symmetrical bidirectional" reference window that uses control intervals equidistant shortly before and after the event to control for bias induced by long-term and seasonal trends. Lumley and Levy (2000)
and Janes and others (2005b)
have shown that in the bidirectional design, CLR gives "overlap" biased estimates of the odds ratio because the reference windows are not chosen independently of the event time. They favor the use of prespecified reference windows or "time-stratified designs" (TSDs).
The substantial statistical interest in case-crossover designs reflects its common application in many subspecialties of epidemiology, including cardiovascular disease (e.g. Koton and others, 2004
), HIV (e.g. Schneider and others, 2005
), accidents (Hagel and others, 2005
), and health service quality assessment (e.g. Polevoi and others, 2005
). The number of papers per year that include "case-crossover" in their title or keywords as identified in a Science Citation Index search has grown from 4 to 66 papers between 1993 and 2005.
This work is motivated by our group's research on the effects of air pollution on morbidity or mortality where the case-crossover method is especially popular (e.g. Dominici and others, 2004
, 2006
; Wellenius and others, 2005
; Zanobetti and Schwartz, 2005
). Case-crossover methods are used to estimate the relative rate of events per unit increase in exposure, controlling for potential confounding variables through matching. For example, Zanobetti and Schwartz (2005)
applied CLR to data from each of 21 regions to study the relative risk of emergency room admission for myocardial infarction associated with PM10 exposure (particulate matter 10 µm or smaller in aerodynamic diameter). This application and many others like it are characterized by the fact that the exposure for a given day is assumed to be the same for all persons.
An alternative approach to the analysis of daily exposure and case-only data is time series analysis (e.g. Kedem and Fokianos, 2002
). Here, log-linear regression models express the expected total number of events on each day as a function of the exposure level and potential confounding variables. In time series analyses of air pollution, smooth functions of time and weather are the main confounders. The smooth function of time is typically modeled using a flexible parametric or nonparametric curve to represent longer term trends in the outcome due to changes in the population, its health behaviors, and services and to represent seasonality. Zeger and others (2006)
and Bell and others (2004)
present overviews of time series methods in general and with application to air pollution epidemiology specifically.
The current understanding is that case-crossover methods control for potential confounding "by design" while time series methods control by modeling (Bateson and Schwartz, 2001
; Janes and others, 2005b
; Mittleman, 2005
; Zanobetti and Schwartz, 2005
). In this way, case-crossover analysis apparently avoids the need to control through statistical modeling.
The relative merits of time series and case-crossover studies have been discussed by several recent papers in the environmental epidemiology literature. For example, Checkoway and others (2000)
selected the case-crossover approach as an alternative to time series methods in order to make causal inferences about air pollution effects. Bateson and Schwartz (1999
, 2001
) demonstrated that strong confounding by seasonality could be controlled by design in the casecontrol approach.
In this paper, we demonstrate that when exposure is common to the cohort at each time, as in air pollution studies, the case-crossover approach is an application of log-linear time series analysis rather than an alternative approach. This equivalence has previously been noted in special cases by Levy and others (2001)
and by Janes and others (2005a)
. We show how the choice of reference intervals in the case-crossover design is equivalent to the choice of estimator for the confounding function of time in the time series analysis. Given this correspondence, we offer an alternate perspective on bias of inferences from case-crossover designs. We show that inferences from case-crossover designs based upon CLR do not account for overdispersion as is routinely done in time series analyses. The connection of case-crossover and time series analyses also sheds some new light on the time series applications.
| 2. GENERAL FRAMEWORK |
|---|
|
|
|---|
Let Xit be the exposure for person i in interval t, t = 1,...,T, and let Yit indicate whether subject i has the event in interval t (1, event; 0, not). Assume that the outcome Yit = 1 is rare and that the probability that subject i fails in interval t is given by the relative risk model
|
| (2.1) |
Each subject is assumed to have his/her own baseline risk
0it at time t consisting of two parts;
0i is a constant frailty for person i and exp(
it) is the effect of unmeasured time-varying factors on his/her risk. The exposure Xit is assumed to have a common effect on each individual, as quantified by the log relative risk ß.
For air pollution and other similar studies, the population is assumed to have common exposure during each interval so that Xit = Xt.
Denote the population from which cases arise by
; hence, the observed number of events in interval t is Yt =
i
Yit. The expected number of events is the sum over the population of the individual risks,
|
| (2.2) |
where exp(St) =
i
![]()
0it =
i
![]()
0iexp(
it). The target of inference is the regression coefficient ß, the common log relative rate of the event per unit change in the exposure. St is a nuisance function that is the log of the total population baseline risk on each day t. The total risk integrates across the population the individual baseline risks and behaviors such as exercise, smoking, and seeking health care. It also represents factors that affect the population as a whole, such as influenza epidemics or improved medical services. In time series analysis, St is assumed to be a smooth function of time and is modeled with parametric or nonparametric curves such as regression or smoothing splines (e.g. Kelsall and others, 1997
). Because St is not the scientific focus, most time series investigators examine the sensitivity of inferences about the exposure relative risk ß to the choice of model for St (e.g. Dominici and others, 2004
).
To estimate jointly ß and St, we assume that Yt follows a log linear model with mean E(Yt) = µt and Var(Yt) =
µt. For any chosen estimator
of St, we obtain the estimate
by solving the following estimating equation:
![]() | (2.3) |
Note that
will depend on the estimate of the nuisance function
, which also depends on ß, so that joint estimation typically involves iteration. We choose the estimate of ß that makes the observed number of events Yt on each day t on average equal to the model-based predicted value
. Inferences about ß are made robust to the Poisson assumption by allowing the variance of the data to exceed its mean using the method of "quasi-likelihood" or by using a robust variance estimator (Liang and Zeger, 1986
; McCullagh and Nelder, 1989
; White, 1982
; Zeger, 1988
).
In the case-crossover approach, the exposure of cases in interval ti is compared to the exposures from a set of reference periods. We denote the event interval by ti and the set of reference periods by W(ti). For example for day 10, W(10) might be {8,9,10,11,12}. The key assumption of a case-crossover design is that the time-varying effect
ij is constant for all j within the reference window W(ti), i.e.
ij =
ij' for j,j'
W(ti).
Conditional on an individual being a case within a prespecified reference window W(ti), the probability piti that subject i fails at time ti is
![]() | (2.4) |
which is free of the time-constant effect
0i and time-varying effects
ij using the case-crossover assumption that
ij is constant for all j within the reference window W(ti).
As Janes and others (2005b)
have pointed out, this probability is not correct if the reference window depends on t, e.g. in the symmetric bidirectional design (SBD). However, (2.4) can still be used to construct an estimating equation for ß.
If we assume subjects are independent, the likelihood function is
![]() | (2.5) |
The estimating equation for ß is
![]() | (2.6) |
This estimating equation is the sum over subjects of the difference between each subject's exposure at the index time ti and a weighted average of exposures at all times in the reference window W(ti) (Janes and others, 2005a
). By solving (2.6), we estimate ß by the value that on average makes the relative risk weighted average of exposures on reference days equal to the exposure on the event days.
If we assume common exposure, Xit = Xt, (2.6) can be rewritten as
![]() | (2.7) |
In the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org), we give the derivation of (2.7). Here,
(t) is the set of days containing day t in their reference window. For the SBD and TSD, but not more generally, this set is identical to the reference set for day t itself, that is
(t) = W(t).
In (2.7),
is the weighted average of numbers of events across days m that have day t in their reference window. The weight for Ym is the probability of having an event on day t given the reference window W(m).
The case-crossover equation (2.7) is a special case of the time series equation (2.3) in which St is estimated by a weighted average of the observed numbers of events for those intervals m that include interval t in their reference windows. The weights are determined by the conditional probabilities that an event occurs in t given that it occurs within the window.
Two special cases are worth considering further: TSD and SBD. For TSD, time is divided a priori into strata s(t) = 1,...,S. The reference window for day t is the set of days in its stratum (Lumley and Levy, 2000
). Levy and others (2001)
previously pointed out that the time-stratified case-crossover design leads to the same estimate as obtained from a Poisson regression with dummy variables indicating the strata. The score equation can be written as
![]() | (2.8) |
where
is the expected number of events on day t. Note that
is the maximum likelihood estimator of exp(Ss(t)). The smooth function of time is assumed to be a step function with a separate level of population baseline risk for each prespecified stratum. Whether to expect the total population baseline risk to change abruptly at each stratum boundary as assumed in this design is a question specific to each application. However, if it does not, assuming St is a step function may introduce bias in the estimator or the pollution log relative risk ß.
In the SBD, symmetric control days close to the event time are used. As the simplest example, define the controls as the days immediately before and after the event day. Then the score equation can be written as
![]() |
This is equivalent to using a locally weighted running-mean smoother to estimate St in time series analysis.
| 3. DISCUSSION |
|---|
|
|
|---|
This paper has shown that the CLR estimating equation used to obtain the case-crossover estimate of relative risk is a special case of the time series log-linear model estimating equation when exposure is common across subjects in each interval. Time series and case-crossover analyses simply offer different parameterizations for St.
The time-stratified case-crossover design is equivalent to Poisson regression with indicator variables for strata (Levy and others, 2001
). The smooth function of time St is assumed to be a step function with different levels of total population baseline risk for each stratum. The symmetric bidirectional case-crossover design is equivalent to Poisson regression using a locally weighted running-mean smoother to estimate St.
The equivalence of the case-crossover and time series methods improves our understanding of both methods and provides computational convenience. Most case-crossover analyses use CLR for estimation. When the number of time intervals and the number of controls for each case are large (e.g. full-stratum design), standard CLR is computationally inefficient by comparison with Poisson regression.
Each case-crossover design corresponds to a model (or estimator) for St. The equivalence of case-crossover and time series methods permits model checking for case-crossover data using standard log-linear model diagnostic tools (McCullagh and Nelder, 1989
).
When the same estimating equation is used for a time series and case-crossover analysis, that is, the same estimator of St is used, the two methods can give different standard errors. This is because time series analysis allows for overdispersion of the Poisson variance, while case-crossover design uses the exact Poisson variance to calculate the standard error. In some applications, the Poisson assumption may not be valid.
This connection also informs our interpretation of time series analysis. For example, in Dominici and others (2004)
time series models are used to estimate a PM effect on daily mortality. The degrees of freedom to estimate St with a regression spline are allowed to vary ninefold from 2.3 to 21 degrees of freedom per year, yet the standard error of the pollution effect changes little. For matched casecontrol studies, there is little change in the standard error when the number of exactly matched controls per case is beyond roughly four (McCullagh and Nelder, 1989
). In a case-crossover design, this corresponds to four control days per event day, or equivalently 90 degrees of freedom per year, which is much greater than the entire range included by Dominici and others (2004)
. This point only considers precision; the actual choice of degrees of freedom is obviously a trade-off between bias and precision.
The connection between case-crossover estimates obtained by CLR and by time series methods is an example of the connection between logistic and log-linear Poisson regression (McCullagh and Nelder, 1989
). A related connection is between Poisson regression and the Cox proportional hazards model with time-invariant covariates estimated by CLR, as discussed by Clayton (1988
, 1991
). The Cox model is approximated by a log-linear Poisson model for the number of events in small intervals of follow-up time. The number of events is regressed on the covariates plus indicator variables for bins with log person-time in each bin as an offset. Clayton exploits this connection to develop Bayesian formulations of frailty and other extensions of the basic Cox model.
This connection is also apparent in our work whereby a hazard model with individual frailties is the basis for a log-linear regression for the binned event counts. The overdispersion in our time series model reflects the influence of unmeasured causes of mortality that vary over time in a manner that is not accounted for by the assumed model for St. These factors are population analogues to frailties in the survival context.
In this paper, we only focused on exposures common to all subjects. In many applications of the case-crossover design, exposures vary among subjects. The connection between case-crossover and time series method in this case is the topic of further study.
| ACKNOWLEDGMENTS |
|---|
The authors are grateful to partial support from the National Institute for Environmental Health Sciences (NIEHS) grant ES012054-03 and the NIEHS Center in Urban Environmental Health grant P30 ES 03819. Conflict of Interest: None declared.
| REFERENCES |
|---|
|
|
|---|
-
Bateson TF and Schwartz J. (1999) Control for seasonal variation and time trend in case-crossover studies of acute effects of environmental exposures. Epidemiology 10:53944.[CrossRef][Web of Science][Medline]
Bateson TF and Schwartz J. (2001) Selection bias and confounding in case-crossover of environmental time-series data. Epidemiology 12:65461.[CrossRef][Web of Science][Medline]
Bell ML, Samet JM, Dominici F. (2004) Time-series studies of particulate matter. Annual Review of Public Health 25:24780.[CrossRef][Web of Science][Medline]
Checkoway H, Levy D, Sheppard L, Kaufman J, Koenig J, Siscovick D. (2000) A case-crossover analysis of fine particulate matter air pollution and out-of-hospital sudden cardiac arrest. Research Report 99(Health Effects Institute, Boston, MA).
Clayton D. (1988) The analysis of event history data: a review of progress and outstanding problems. Statistics in Medicine 7:81941.[Web of Science][Medline]
Clayton DG. (1991) A Monte Carlo method for Bayesian inference in frailty models. Biometrics 47:46785.[CrossRef][Web of Science][Medline]
Dominici F, Mcdermott A, Hastie TJ. (2004) Improved semiparametric time series models of air pollution and mortality. Journal of the American Statistical Association 99:93848.[CrossRef][Web of Science]
Dominici F, Peng RD, Bell ML, Pham L, Mcdermott A, Zeger SL, Samet JM. (2006) Fine particulate air pollution and hospital admission for cardiovascular and respiratory diseases. Journal of the American Medical Association 295:112734.
Greenland S. (1996) Confounding and exposure trends in case-crossover and case-time-control designs. Epidemiology 7:2319.[Web of Science][Medline]
Hagel BE, Pless IB, Goulet C, Platt RW, Robitaille Y. (2005) Effectiveness of helmets in skiers and snowboarders: case-control and case crossover study. British Medical Journal 330:2813.
Janes H, Sheppard L, Lumley T. (2005a) Case-crossover analyses of air pollution exposure data: referent selection strategies and their implications for bias. Epidemiology 16:71726.[CrossRef][Web of Science][Medline]
Janes H, Sheppard L, Lumley T. (2005b) Overlap bias in the case-crossover design, with application to air pollution exposures. Statistics in Medicine 24:285300.[CrossRef][Web of Science][Medline]
Kedem B and Fokianos K. (2002) Regression Models for Time Series Analysis(John Wiley & Sons, Inc, Hoboken).
Kelsall JE, Samet JM, Zeger SL, Xu J. (1997) Air pollution and mortality in Philadelphia, 19741988. American Journal of Epidemiology 146:75062.
Koton S, Tanne D, Bornstein NM, Green MS. (2004) Triggering risk factors for ischemic strokea case-crossover study. Neurology 63:200610.
Levy D, Lumley T, Sheppard L, Kaufman J, Checkoway H. (2001) Referent selection in case-crossover analyses of acute health effects of air pollution. Epidemiology 12:18692.[CrossRef][Web of Science][Medline]
Liang KY and Zeger SL. (1986) Longitudinal data-analysis using generalized linear-models. Biometrika 73:1322.
Lumley T and Levy D. (2000) Bias in the case-crossover design: implications for studies of air pollution. Environmetrics 11:689704.[CrossRef][Web of Science]
Maclure M. (1991) The case-crossover design: a method for studying transient effects on the risk of acute events. American Journal of Epidemiology 133:14453.
McCullagh P and Nelder JA. (1989) Generalized Linear Models 2nd edition (Chapman & Hall, London).
Mittleman MA. (2005) Optimal referent selection strategies in case-crossover studiesa settled issue. Epidemiology 16:7156.[CrossRef][Web of Science][Medline]
Navidi W. (1998) Bidirectional case-crossover designs for exposures with time trends. Biometrics 54:596605.[CrossRef][Web of Science][Medline]
Polevoi SK, Quinn JV, Kramer NR. (2005) Factors associated with patients who leave without being seen. Academic Emergency Medicine 12:2326.[CrossRef][Web of Science][Medline]
Schneider MF, Gange SJ, Margolick JB, Detels R, Chmiel JS, Rinaldo C, Armenian HK. (2005) Application of case-crossover and case-time-control study designs in analyses of time-varying predictors of T-cell homeostasis failure. Annals of Epidemiology 15:13744.[CrossRef][Web of Science][Medline]
Wellenius GA, Bateson TF, Mittleman MA, Schwartz J. (2005) Particulate air pollution and the rate of hospitalization for congestive heart failure among medicare beneficiaries in Pittsburgh, Pennsylvania. American Journal of Epidemiology 161:10306.
White H. (1982) Maximum likelihood estimation of misspecified models. Econometrika 50:126.[CrossRef]
Zanobetti A and Schwartz J. (2005) The effect of particulate air pollution on emergency admissions for myocardial infarction: a multicity case-crossover analysis. Environmental Health Perspectives 113:97882.[Web of Science][Medline]
Zeger SL. (1988) A regression model for time series of counts. Biometrika 75:6219.
Zeger SL, Irizarry RA, Peng RD. (2006) On time series analysis of public health and biomedical data. Annual Review of Public Health 27:5779.[CrossRef][Web of Science][Medline]
Received March 11, 2006; revised May 31, 2006; revised June 15, 2006; revised June 21, 2006; accepted for publication June 27, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
M. Stafoggia, J. Schwartz, F. Forastiere, C. A. Perucci, and the SISTI Group Does Temperature Modify the Association between Air Pollution and Mortality? A Multicity Case-Crossover Analysis in Italy Am. J. Epidemiol., June 15, 2008; 167(12): 1476 - 1485. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







