Skip Navigation


Biostatistics Advance Access originally published online on March 23, 2007
Biostatistics 2007 8(4):772-783; doi:10.1093/biostatistics/kxm004
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
8/4/772    most recent
kxm004v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Verbeke, G.
Right arrow Articles by Fieuws, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Verbeke, G.
Right arrow Articles by Fieuws, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

The effect of miss-specified baseline characteristics on inference for longitudinal trends in linear mixed models

Geert Verbeke* and Steffen Fieuws

Biostatistical Centre, Katholieke Universiteit Leuven, U.Z. St.-Rafaël. Kapucijnenvoer 35, B-3000 Leuven, Belgium geert.verbeke{at}med.kuleuven.be

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE GENERAL LINEAR...
 3. THE GROWTH CURVE...
 4. THE BIAS IN...
 5. DISCUSSION
 REFERENCES
 
The main advantage of longitudinal studies is that they can distinguish changes over time within individuals (longitudinal effects) from differences among subjects at the start of the study (baseline characteristics, cross-sectional effects). Often, especially in observational studies, longitudinal trends are studied after correction for many potentially important baseline differences between subjects. We show that, in the context of linear mixed models, inference for longitudinal trends is in general biased if a wrong model for the baseline characteristics is used. However, we will argue that this bias is small in most practical situations and completely vanishes in the special case of a growth curve model for complete balanced data. In the latter case, inference for longitudinal trends is completely independent of additional baseline covariates that might have been omitted from the model.

Keywords: Baseline characteristics; Growth curve model; Linear mixed model; Longitudinal data; Longitudinal trends


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE GENERAL LINEAR...
 3. THE GROWTH CURVE...
 4. THE BIAS IN...
 5. DISCUSSION
 REFERENCES
 
In medical science, studies are often designed to investigate the changes in a specific parameter which is measured repeatedly over time in the participating subjects. Such studies are in contrast to cross-sectional studies where the response of interest is measured only once for each individual. As pointed out by Diggle and others (2002Go, Section 1.4), one advantage of longitudinal studies is that they can distinguish changes over time within individuals (longitudinal effects) from differences among people in their baseline values (cross-sectional effects).

Often, especially in observational studies, a lot of baseline variability between subjects is observed, and one wishes to study longitudinal trends, correcting for important baseline covariates that explain (a part of) this variability. For example, Diggle and others (2002Go, Section 9.3) used longitudinal data on 250 children to investigate the evolution of the risk for respiratory infection and its relation to vitamin A deficiency, adjusting for factors like gender, season, and age at entry in the study. Brant and others (1992)Go analyzed repeated measures of systolic blood pressure from 955 healthy males. Their models included cross-sectional linear and quadratic effects for age at first visit, as well as the factors obesity and birth cohort.

Note that the longitudinal trends are often the parameters of interest, whereas the parameters modeling the baseline differences can often be viewed as a nuisance. It is therefore important to investigate to what extent inferences for the longitudinal effects are influenced by the inclusion of baseline covariates. In this paper, we study this problem in the context of the linear mixed model. Due to the availability of many commercially available software packages which easily allow fitting such models, linear mixed models are nowadays widely applied for the analysis of continuous longitudinal data. In Section 2, the general model will be introduced, and an expression will be derived for the bias in the longitudinal inference if an important baseline covariate has been omitted from the model. In Section 3, we focus on the special but important case of the growth curve model for complete balanced data. Afterward, in Section 4, the bias under the general model will be looked at in much detail, and arguments will be given for which this bias will be small in many practical applications. Finally, Section 5 presents a brief discussion of the main results.


    2. THE GENERAL LINEAR MIXED MODEL
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE GENERAL LINEAR...
 3. THE GROWTH CURVE...
 4. THE BIAS IN...
 5. DISCUSSION
 REFERENCES
 
Let yij denote the response for subject i measured at occasion j, i = 1,...,N, j = 1,...,ni. Further, yi = (yi1,...,yini)' denotes the vector of all repeated measurements for subject i. The linear mixed model (Laird and Ware, 1982Go) assumes that yi satisfies


Formula (2.1)

in which Xi and Zi are (nixp) and (nixq) matrices of known covariates, ß is a p-dimensional vector of population-average regression coefficients called fixed effects, and bi is a q-dimensional vector of subject-specific regression coefficients, assumed to be normally distributed with mean vector 0 and covariance matrix D. The residual components {varepsilon}i are assumed to be independent N(0,{sigma}2Ini), where Ini is the identity matrix of dimension ni, and independent of the random effects bi. Finally, let {alpha} denote the vector of variance components, that is, {sigma}2 as well as all elements in D.

Marginally, yi~N(Xiß,Vi) with Vi({alpha}) = ZiDZFormula + {sigma}2Ini, where it is explicitly denoted that Vi depends on the variance components of {alpha}. Conditionally on {alpha}, the maximum likelihood (ML) estimator for ß and its covariance matrix are

Formula (2.2)

where Wi equals VFormula. Inference immediately follows from the normality of Formula. In practice, {alpha} is not known and is replaced in (2.2) by its ML or restricted maximum likelihood (REML) estimator, and this additional uncertainty is accounted for by replacing the normal sampling distribution for Formula by an appropriate t distribution. Full details on estimation and inference under the linear mixed model are given in Verbeke and Molenberghs (2000).

In practice, Xi is usually of the form Xi = [XFormula|XFormula], where XFormula = 1niaFormula for some s-dimensional vector ai with baseline characteristics for subject i and 1ni denotes the ni-dimensional vector containing only ones. Similarly, ß can be subdivided as ß' = [ßC’,ßL’], where ßC consists of cross-sectional effects modeling the baseline differences between subjects and ßL contains the longitudinal effects modeling evolutions over time. A similar decomposition holds for Zi = [1ni|ZFormula], where the first column of ones corresponds to the subject-specific intercepts, while ZFormula corresponds to subject-specific longitudinal effects.

The aim of this paper is to study the effect of omitting a cross-sectional covariate XFormula = ci1ni for a series of known constants ci, i = 1,...,N, on the estimation and inference for ßL. Suppose that the correct linear mixed model for our data is (2.1), extended with the additional cross-sectional component XFormulaß*. However, the model fitted is (2.1). How does this affect results for ßL? The effect can be twofold. First, conditionally on {alpha}, omission of XFormula can lead to biased estimates or biased, estimated standard errors. Second, there could be an indirect effect through the estimation of {alpha}, estimates of which are plugged into the expressions in (2.2).

Conditionally on {alpha}, the bias in Formula can easily be seen to be

Formula (2.3)

while the original expression remains for the covariance matrix. The indirect effect through the estimation of {alpha} is less easy to derive since, in general, no analytic expressions exist for Formula. Note, however, that this can potentially affect the bias (2.3) as well as standard errors obtained from (2.2).

In Section 3, we will study the effect of omitting XFormula on the bias and the standard errors for the special but important case of the growth curve model for complete and balanced data, where analytic expressions can be found for Formula.


    3. THE GROWTH CURVE MODEL FOR COMPLETE BALANCED DATA
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE GENERAL LINEAR...
 3. THE GROWTH CURVE...
 4. THE BIAS IN...
 5. DISCUSSION
 REFERENCES
 
Consider the case where a fixed number of observations n is obtained for all subjects at fixed time points. Further, we assume a growth curve model as discussed by Laird and others (1987)Go and Lange and Laird (1989), that is, Xi = X{otimes}aFormula = [1n|T]{otimes}aFormula, where 1n is the n-dimensional vector of ones, T is the (nx(r 1)) design matrix modeling the slopes, and ai is again the vector with s subject characteristics used to explain variability between subjects in intercepts and slopes. Typically, the r 1 columns in T consist of hierarchical orders of polynomials of the time points at which the n measurements have been taken. However, more general models can be considered as well, for example, piecewise linear models. Without loss of generality, we assume that the columns of T are centered, that is, T'1n = 0. Note that the assumption of complete balanced data implies that centering a specific column of T only requires subtracting a fixed constant from all elements in that column, while the same constant applies to all subjects in the data set. This affects the cross-sectional components but leaves the regression coefficients corresponding to the time-varying covariates T{otimes}aFormula unchanged.

The growth curve model assumes that the same covariates are used to model differences in the intercepts as well as differences in time effects. Furthermore, it is assumed that Zi = Z = [1n,ZL] for all i, where the columns in ZL are the first q – 1 columns of T, q ≤ r. This automatically leads to a so-called well-formulated model (Morrell and others, 1997Go) which does not include any polynomial effects or interactions unless all hierarchically lower-order terms have been included as well. Finally, ßC is s-dimensional, and ßL is (r – 1)s-dimensional and is obtained from stacking the (r – 1) s-dimensional fixed effects modeling linear, quadratic, cubic, ...evolutions, underneath each other.

Let Y, B, E, and A be the (nxN) matrix, the (qxN) matrix, the (nxN) matrix, and the (sxN) matrix with yi, bi, {varepsilon}i, and ai as columns, respectively. Finally, let {psi} be the (rxs) matrix such that vec({psi}') = ß, where the "vec" operator stacks the columns of {psi} one underneath the other. Note that the original cross-sectional effects ßC are then in the first row of {psi}, while the elements of ßL are in the subsequent rows. The original model (2.1) can then be rewritten as


Formula

Lange and Laird (1989, Section 5.1) have then shown the following expressions for the (restricted) ML estimators for {psi}, D, and {sigma}2 as well as for the covariance matrix of the estimator of {psi}:

Formula (3.4)


Formula (3.5)


Formula (3.6)


Formula (3.7)

where the constants {xi} and {zeta} equal N(n q) and N, respectively, in case of ML estimation and are equal to N(nq) – s(rq) and Ns, respectively, in case of REML estimation, and where Jq equals the (rxq) matrix [Iq|0]'.

If the full model, with the additional cross-sectional covariate XFormula = ci1ni, were fitted, the same expressions (3.4)–(3.6) would remain valid, but with Y replaced by Y – 1nc'ß*, with c the N-dimensional vector with baseline characteristics ci, and with ß* replaced by its estimator Formula. In Section 3.1, we will show that, conditionally on the variance components D and {sigma}2, fitting this full model does not affect the estimators for the longitudinal components in {psi} and hence also yields the same standard errors. Hence, any effect of omitting XFormula from the model is an indirect effect, through the estimation of the variance components D and {sigma}2. We handle this in Section 3.2.

3.1 Variance components known

When the full model is fitted, it follows from (3.4) that the estimator for the longitudinal effects in {psi} is

Formula (3.8)

with ß* replaced by its estimator Formula. Since T'1n = 0, we have that (3.8) is independent of ß*, showing that the same expression would be obtained when fitting the model without the additional covariate XFormula, and hence also the same standard errors (conditionally on D and {sigma}2).

3.2 Variance components not known

In practice, D and {sigma}2 are not known and need to be replaced by ML or REML estimators, which might be influenced by whether or not the additional cross-sectional covariate is included in the model. Note that this would not affect Formula since (3.8) is independent of the variance components. Further, it follows from (3.7) that

Formula

in which D22 is the covariance matrix of the random slopes, that is, the matrix obtained from omitting the first column and row of D. We will now show that Formula and Formula do not depend on whether or not XFormula is included in the fitted model, yielding no effect at all on the estimation or on the inference for the longitudinal effects in ßL.

When fitting the full model, it follows from (3.5) and (3.6) that

Formula

again with ß* replaced by Formula. However, since (InZ(Z'Z) – 1Z')1n = 0 and ZL'1n = 0, the estimators for {sigma}2 and D22 are independent of whether or not XFormula is included in the fitted model.

3.3 Example: activities of daily living data

So far, we can conclude that in the case of complete balanced data, if a growth curve model is fitted, the estimation as well as inference for the longitudinal fixed effects in ßL are not affected by the possible omission of cross-sectional covariates, whether or not these affect the mean response. As an illustration, we consider the analysis of data that have been collected as a part of a doctoral research project in the Centre for Health Services and Nursing Research of the Katholieke Universiteit Leuven in Belgium. Details about the project can be found in Milisen (1999)Go and Milisen and others (1998)Go, and the data are available as supplementary material at Biostatistics online, http://www.biostatistics.oxfordjournals.org. The aim of this part of the project was to study the relation between the postoperative evolution of the functional status of elderly hip fracture patients and their preoperative neuropsychiatric status. The functional status was measured 1 day, 5 days, and 12 days after the surgery, using the Katz index of activities of daily living (ADL; Katz and Akpom, 1976Go; Brorsson and Äsberg, 1984Go; Törnquist and others, 1990Go), which is an ordinal score ranging from 6 to 24, with high ADL values indicating strong dependence of the patient on others. In total, we have data on 54 patients, 17 of whom were classified as neuropsychiatric prior to the surgery. An extensive analysis of the ADL data is given by Verbeke and others (2006)Go, where several techniques for baseline correction are illustrated and compared. Here, we will focus on one specific model to illustrate the effect of the omission of important baseline covariates. Let Y denote the ADL response taken at time points t = 1,5,12. We then use the following linear mixed model to compare the postoperative average evolution in ADL between the two neuro groups:

Formula

We first focus on the 35 patients for whom all the 3 scheduled ADL measurements have been taken. The estimation and inference results for the fixed slopes ß11 and ß21 are shown in the top part of Table 1. We obtain a significant difference in the average evolutions over time (p = 0.0018), with no significant time trend for the neuropsychiatric patients and a significant average improvement for the patients who were not neuropsychiatric preoperatively. The variance of the random intercepts b0i was estimated as 9.1989, which is large compared to the within-subject error variability {sigma}2 estimated as 2.4494. We therefore repeated the analysis, but correcting for baseline differences, which explains a large amount of the variability present at baseline. Corrections were also made for the age of the patient and preoperative housing situation of the patient, where housing situation can be (1) on his/her own, (2) with a partner or family, or (3) in a nursing home. The results for the fixed longitudinal effects are also summarized in Table 1. The corrections for age and housing situation were significant (p = 0.0032 and p = 0.0477, respectively). However, as determined by the general theory developed earlier, exactly the same inferences as under the original model are obtained for ß11 and ß21.


View this table:
[in this window]
[in a new window]

 
Table 1. ADL data: REML estimates, associated standard errors, and Wald tests for fixed longitudinal effects, with and without correction for baseline characteristics, based on only those subjects for which all 3 scheduled ADL measurements are available as well as based on all subjects

 
We repeated the analyzes after adding the 19 subjects for whom only the first 2 measurements are available. The results are shown in the bottom part of Table 1. As is expected, we obtain different, more efficient results when compared to our initial analyzes. But we now also obtain slight differences when we compare the corrected and uncorrected inferences for ß11 and ß21.

Apart from a formal test for significance of the baseline covariates "age" and "housing situation," an overall indication for goodness-of-fit, such as the Akaike information criterion (AIC) or Bayesian information criterion (BIC), would be helpful. However, standard model fitting in linear mixed models is based on REML estimation which implies that likelihoods and statistics derived thereof, such as AIC and BIC, are no longer comparable between models with different mean structures (Verbeke and Molenberghs, 2000Go, Section 6.4).


    4. THE BIAS IN THE UNBALANCED CASE
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE GENERAL LINEAR...
 3. THE GROWTH CURVE...
 4. THE BIAS IN...
 5. DISCUSSION
 REFERENCES
 
In general, omission of a cross-sectional covariate will lead to biased estimates for the longitudinal effects in the linear mixed model. Conditionally on the variance components in {alpha}, the bias is given by (2.3). In practice, however, estimation of the variance components in (2.3) can also be affected by the omission of the covariate, and it is not clear what this implies for the bias of the longitudinal effects. In Section 4.1, we will therefore perform a small-scale simulation study based on a large very unbalanced data set. Afterward, in Section 4.2, theoretical arguments will be given for the findings from the simulations.

4.1 Example: hearing data

As an example of the effect of the assumed cross-sectional model on the estimation of longitudinal trends, we will analyze data from the Baltimore Longitudinal Study of Aging (BLSA) which is an ongoing multidisciplinary study begun in 1958 to study normal human aging (Shock and others, 1984Go). Participants in the BLSA return approximately every 2 years for 3 days of biomedical and psychological examinations. Among many other variables, hearing threshold sound pressure levels (in decibels) were measured at 11 different frequencies (varying from 125 to 8000 HZ) on both ears, yielding a maximum of 22 observations per visit. This was done by means of a sound proof chamber and a Bekesy audiometer. Previous analyzes of these hearing thresholds can be found in Brant and Fozard (1990)Go, Morrell and Brant (1991)Go, Pearson and others (1995)Go, Verbeke and others (2001)Go, and Verbeke and Molenberghs (2000Go, Chapter 13).

For our purposes, we now consider all available hearing thresholds for 500 Hz from male BLSA participants without otologic disease, unilateral hearing loss, or evidence of noise-induced hearing loss, and we restrict our analysis to measurements for the left ear only. In total, we have 3089 observations from 680 males, some of whom were followed for over 22 years. The number of observations per subject varies from 1 to 15. Following Verbeke and others (2001)Go, we use the following linear mixed model:

Formula (4.9)

in which tij is the time point (in decades from entry in the study) at which the jth measurement is taken for the ith subject, and where Agei1 is the age (in decades) of the subject at entry in the study. Pearson and others (1995)Go found evidence for the presence of a learning effect from the first visit to subsequent visits. This is taken into account by the extra time-varying covariate (Visit1ij) defined to be one at the first measurement and zero for all other visits. Finally, the bi1, bi2, and {varepsilon}ij are the classical random intercepts, slopes, and error components, respectively. Table 2 shows the REML estimates and the associated standard errors for all parameters in the marginal model corresponding to model (4.9).


View this table:
[in this window]
[in a new window]

 
Table 2. Hearing data: (a) REML estimates (standard errors) for the parameters in the marginal linear mixed model corresponding to (4.9), with and without inclusion of a cross-sectional quadratic effect of age. (b) Bias and MSE obtained from the analyses of 1000 data sets simulated from model (4.9), with 2 different values for ß3 and analyzed without the cross-sectional quadratic age effect

 
To study the effect of miss-specifying the cross-sectional model on the estimation of the longitudinal model, we refitted model (4.9), not including the cross-sectional quadratic age effect. The results are also shown in Table 2. We conclude that removing cross-sectional terms from the model inflates the random-intercepts variance d11, but the estimates of the average longitudinal trends (ß4, ß5, and ß6) are only slightly affected.

In order to investigate whether this holds more generally, we simulated 1000 data sets from model (4.9) with the true parameter values given by the REML estimates obtained from the original data set (first column of results in Table 2). Each data set was then analyzed using the correct model but with the quadratic age effect left out (i.e. restricting Formula to be zero). The bias and the mean squared error (MSE) are also reported in Table 2. We find substantial biases in the cross-sectional effects and variance components related to the cross-sectional between-subject differences, but very little bias in the longitudinal effects and associated variance components. Note that the bias in the cross-sectional effects is a direct consequence of the lack of orthogonality between the omitted covariate and the other cross-sectional covariates in the model. Similar results are known for linear models for independent data (e.g. Winer, 1970Go; Ramsey, 1969Go). Further, removing baseline covariates from the model increases the amount of unexplained baseline variability between subjects. Hence, omitting an important cross-sectional covariate increases the random-intercepts variance d11.

Expression (2.3) suggests that bias will depend on the magnitude of ß3, that is, on how important the omitted covariate is. We therefore repeated the simulation, still sampling from model (4.9), but now with ß3 = 35, that is, 100 times larger than the initial value of 0.35. The results are again shown in Table 2. While we get huge biases for the parameters that express baseline differences between subjects, we still get minor effects for the parameters related to longitudinal effects. In Section 4.2, theoretical arguments will be given for why such small biases in longitudinal effects are to be expected.

4.2 Small bias

Although omission of important cross-sectional covariates will yield biased inferences for longitudinal effects when dealing with unbalanced data, one can show that this bias will, in most cases, be small. Indeed, it follows from (2.3) that the bias in Formula is of the form Formula in which Formula is obtained as follows. Consider the linear model which regresses the omitted covariates XFormula on the covariates that were included in the original fitted model, that is, XFormula = XFormula{gamma}C + XFormula{gamma}L + ei. We then have that Formula is the weighted least squares estimator for {gamma}L, with weight matrices Wi equal to the inverse fitted covariance matrices, obtained from fitting the original incorrect model (2.1).

Since the XFormula contain cross-sectional covariates and XFormula contain only time-varying covariates, {gamma}L equals zero. It follows from the theory of generalized estimating equations (GEEs; Liang and Zeger, 1986Go) that our weighted least squares estimator Formula is unbiased and consistent for {gamma}L and therefore will converge to zero for increasing sample sizes. The key assumption in order for GEE to be applicable is that the time points at which measurements have been taken are completely unrelated to the omitted covariates XFormula. In our analysis of the hearing data, this would not be satisfied if, for example, older subjects were followed for shorter periods of time than younger study participants. The weighted least squares estimator Formula is then no longer unbiased nor consistent, unless the inverse weights are the correct covariance matrices of the omitted covariates XFormula, which is very unlikely to be the case. Hence, Formula is then not necessarily close to zero, even in large samples.

Fortunately, whenever the omitted covariates XFormula explain a lot of variability in the response values yi (i.e. whenever |ß*| is relatively large), we still have that Formula will be small, even in situations where there is a relation between the time points at which measurements have been taken and the omitted covariates. Indeed, if the XFormula explain a lot of variability, omitting them from the model will yield fitted covariance matrices Vi close to compound symmetry, that is, of the form Vi{approx}d111ni1Formula + {sigma}2Ini, but with large d11, relative to {sigma}2. Note that this was already illustrated in our simulation results summarized in Table 2. This implies that the weight matrices in the calculation of Formula are approximately given by Wi = (Ini – 1ni(1Formula1ni) – 11Formula)/{sigma}2, which, apart from a constant, equals the projection matrix orthogonal to 1ni. We then have that Formula approximately satisfies

Formula

implying that Formula approximately satisfies

Formula

Since WiXFormula is the vector of centered longitudinal covariates for the ith subject, it does not equal zero. It follows that Formula, resulting in a small bias for the longitudinal effects.

To illustrate these theoretical arguments, we have calculated Formula for a variety of situations. The context is a linear mixed model with random intercepts and linear time effects, that is, Zi = 1ni, XFormula = 1ni, and XFormula is a column of time points at which measurements have been taken. The omitted cross-sectional covariates XFormula = ci1ni are obtained from sampling the N (equal to the number of subjects) values ci according to ci~N(50,102). Measurements are available at time points tij = 1,...,ni, where ni is the integer closest to {kappa}ci/5 + (1 – {kappa}){delta}, for {delta} sampled from a uniform distribution between 0 and 15. When this would yield a value ni < 1, we replace it by ni = 1. Finally, {kappa} is a prespecified value on the [0,1] interval. For {kappa} = 0, we have that the time points at which measurements are taken are completely independent of the omitted covariate XFormula. When {kappa} equals 1, we have a deterministic relation between the omitted covariate and the time points at which measurements are taken. The residual variance {sigma}2 was set equal to 5, and the following scenarios were used: N = 10,20,...,1000, d11 = 1,5,10,100, and {kappa} = 0,0.5,1. The results are shown in Figure 1.


Figure 1
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Calculated bias Formula in the case of (a) no relation, (b) a mild relation, and (c) a deterministic relation between the omitted covariate and the time points at which measurements are taken, for various sample sizes and for various random-intercepts variabilities.

 
As can be expected based on our earlier theoretical arguments, we have that the bias in longitudinal effects converges to zero for increasing sample sizes as long as the time points at which measurements are taken are completely independent of the omitted cross-sectional covariate values (Figure 1(a)). When this independence does not hold, we get biased estimates for the longitudinal effects, even in large samples, but this bias is smaller as the random-intercepts variability gets larger, that is, as the omitted covariates explain a lot of the variability between subjects at baseline.


    5. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE GENERAL LINEAR...
 3. THE GROWTH CURVE...
 4. THE BIAS IN...
 5. DISCUSSION
 REFERENCES
 
When analyzing longitudinal data, the primary interest is in studying how subjects evolve over time and what subject characteristics affect this evolution. Often, especially in observational studies, a lot of baseline variability is observed between subjects. In this paper, we have addressed the question of whether or not one should then aim at including baseline covariates to explain such differences between subjects at the beginning of the study. In general, omitting baseline covariates will imply biased inferences for the longitudinal time trends of interest, except in the special but important case of the growth curve model for complete balanced data. Orthogonality of effects in balanced linear regression or analysis of variance models is well-known (e.g. Winer, 1970Go; Ramsey, 1969Go). Several papers have also studied the effect of omitting covariates in more general models such as generalized linear models (Neuhaus and Jewell, 1993Go; Neuhaus, 1998Go) and nonlinear regression models (Gail and others, 1984Go). Here, we have considered the case of multivariate regression models with very specific covariance structures, that is, covariances implied by random-effects models, and we have focussed on the inference as well as on estimation, but attention was restricted to longitudinal effects only. In linear models for balanced data, orthogonality properties often ensure that omission of important covariates does not affect the estimation of the other effects in the model, but this usually does not hold for the associated standard errors, due to increase of the residual variance. In our context, we have shown that such an orthogonality argument also holds for the standard errors as long as only baseline covariates are omitted.

The effect of omitting covariates in linear models for correlated data has also been studied by Palta and Yao (1991)Go. This setting was more general in the sense that the omitted covariates were not restricted to baseline covariates. However, focus was on bias and MSE, not on actual equality of the parameter estimates and standard errors. Also, their results only applied to models with compound symmetric covariance structures and were based on normality assumptions for the covariates. Both assumptions are unlikely to hold in many general longitudinal settings, such as those we have considered here.

In the case of unbalanced data, where the orthogonality properties no longer hold, we have developed theoretical arguments explaining why, in many applications, the bias resulting from omitting important baseline covariates can be expected to be rather small. Still, if one wishes completely to rule out possible biases due to a miss-specified cross-sectional component of a linear mixed model, conditional linear mixed models can be applied (Verbeke and others, 2001Go; Verbeke and Molenberghs, 2000Go, Chapter 13) to obtain inferences for longitudinal effects which are completely independent of the cross-sectional part of the model. Further, it should be emphasized that our results only apply in the context of linear mixed models for continuous data. Whenever nonlinear mixed models (Davidian and Giltinan, 1995Go; Vonesh and Chinchilli, 1997Go) or generalized linear mixed models (Pinheiro and Bates, 2000Go) are used, bias due to omitted cross-sectional covariates may be much more severe. See Chao and others (1997)Go for some results in the context of GEEs for binary correlated data.


    ACKNOWLEDGMENTS
 
We gratefully acknowledge Dr Larry Brant (Gerontology Research Center and The Johns Hopkins University, Baltimore, USA) and Dr Koen Milisen (Center for Nursing Research, Katholieke Universiteit Leuven, Leuven, Belgium) for providing us with the hearing data and the ADL data, respectively. Conflict of Interest: None declared.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE GENERAL LINEAR...
 3. THE GROWTH CURVE...
 4. THE BIAS IN...
 5. DISCUSSION
 REFERENCES
 

    Brant LJ, Fozard JL. Age changes in pure-tone hearing thresholds in a longitudinal study of normal human aging. Journal of the Acoustical Society of America (1990) 88:813–820.[CrossRef][Web of Science][Medline]

    Brant LJ, Pearson JD, Morrell CH, Verbeke G. Statistical methods for studying individual change during aging. Collegium Antropologicum (1992) 16:359–369.[Web of Science]

    Brorsson B, Äsberg K. Katz index of independence in adl: reliability and validity in short-term care. Scandinavian Journal of Rehabilitative Medicine (1984) 16:125–132.

    Chao W-H, Palta M, Young T. Effect of omitted confounders on the analysis of correlated binary data. Biometrics (1997) 53:678–689.[CrossRef][Web of Science][Medline]

    Davidian M, Giltinan DM. Nonlinear models for repeated measurement data (1995) London: Chapman & Hall.

    Diggle PJ, Heagerty P, Liang KY, Zeger SL. Analysis of longitudinal data (2002) Oxford: Clarendon Press.

    Gail MH, Wieand S, Piantadosi S. Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika (1984) 71:431–444.[Abstract/Free Full Text]

    Katz S, Akpom CA. A measure of primary sociobiological functions. International Journal of Health Services (1976) 6:493–507.[Web of Science][Medline]

    Laird NM, Lange N, Stram D. Maximum likelihood computations with repeated measures: application of the EM algorithm. Journal of the American Statistical Association (1987) 82:97–105.[CrossRef][Web of Science]

    Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics (1982) 38:963–974.[CrossRef][Web of Science][Medline]

    Lange N, Laird NM. The effect of covariance structure on variance estimation in balanced growth-curve models with random parameters. Journal of the American Statistical Association (1989) 84:241–247.[CrossRef][Web of Science]

    Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika (1986) 73:13–22.[Abstract/Free Full Text]

    Milisen K. An intervention study for delirium in elderly hip fracture patients, [PhD. Thesis]. (1999) Belgium: Katholieke Universiteit Leuven (unpublished).

    Milisen K, Abraham IL, Broos PLO. Postoperative variation in neurocognitive and functional status in elderly hip fracture patients. Journal of Advanced Nursing (1998) 27:59–67.[CrossRef][Web of Science][Medline]

    Morrell CH, Brant LJ. Modelling hearing thresholds in the elderly. Statistics in Medicine (1991) 10:1453–1464.[Web of Science][Medline]

    Morrell CH, Pearson JD, Brant LJ. Linear transformations of linear mixed-effects models. The American Statistician (1997) 51:338–343.[CrossRef]

    Neuhaus JM. Estimation efficiency with omitted covariates in generalized linear models. Journal of the American Statistical Association (1998) 93:1124–1129.[CrossRef][Web of Science]

    Neuhaus JM, Jewell NP. A geometric approach to assess bias due to omitted covariates in generalized linear models. Biometrika (1993) 80:807–815.[Abstract/Free Full Text]

    Palta M, Yao T-J. Analysis of longitudinal data with unmeasured confounders. Biometrics (1991) 47:1355–1369.[CrossRef][Web of Science][Medline]

    Pearson JD, Morrell CH, Gordon-Salant S, Brant LJ, Metter EJ, Klein LL, Fozard JL. Gender differences in a longitudinal study of age-associated hearing loss. Journal of the Acoustical Society of America (1995) 97:1196–1205.[CrossRef][Web of Science][Medline]

    Pinheiro JC, Bates DM. Mixed effects models in S and S-plus (2000) New York: Springer.

    Ramsey JB. Tests for specification errors in classical linear least-squares regression analysis. Journal of the Royal Statistical Society, Series B (1969) 31:350–371.

    Shock NW, Greullich RC, Andres R, Arenberg D, Costa PT, Lakatta EG, Tobin JD. Normal Human Aging: The Baltimore Longitudinal Study of Aging. In: Superintendent of Documents (1984) Washington, DC: U.S. Government Printing Office. 84–2450.

    Törnquist K, Lövgren M, Sölderfeldt B. Sensitivity, specificity and predictive value in Katz's and Barthel's ADL indices applied on patients in long term nursing care. Scandinavian Journal of Caring Sciences (1990) 4:99–106.[Medline]

    Verbeke G, Fieuws S, Lesaffre E, Kato BS, Foreman MD, Broos PLO, Milisen K. A comparison of procedures to correct for baseline differences in the analysis of continuous longitudinal data: a case study. Applied Statistics (2006) 55:93–102.

    Verbeke G, Molenberghs G. Linear mixed models for longitudinal data. In: Springer Series in Statistics (2000) New York: Springer.

    Verbeke G, Spiessens B, Lesaffre E. Conditional linear mixed models. The American Statistician (2001) 55:25–34.[Medline]

    Vonesh EF, Chinchilli VM. Linear and nonlinear models for the analysis of repeated measurements (1997) New York: Marcel Dekker Inc.

    Winer BJ. Statistical principles in experimental design (1970) England: McGraw-Hill.

    Received August 2, 2006; revised November 7, 2006; accepted for publication February 7, 2007.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    8/4/772    most recent
    kxm004v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Verbeke, G.
    Right arrow Articles by Fieuws, S.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Verbeke, G.
    Right arrow Articles by Fieuws, S.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?