Skip Navigation


Biostatistics Advance Access originally published online on April 14, 2005
Biostatistics 2005 6(3):395-403; doi:10.1093/biostatistics/kxi017
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
6/3/395    most recent
kxi017v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Byth, K.
Right arrow Articles by Cox, D. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Byth, K.
Right arrow Articles by Cox, D. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oupjournals.org.

On the relation between initial value and slope

K. Byth*

NHMRC Clinical Trials Centre, University of Sydney, Locked Bag 77, Camperdown, NSW 1450, Australia and Westmead Millenium Institute, Westmead Hospital, Westmead, NSW 2145, Australia. kbyth{at}ctc.usyd.edu.au

D. R. Cox

Nuffield College, Oxford, OX1 1NF, UK

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A STUDY OF...
 3. THEORETICAL DISCUSSION
 4. RESULTS FOR HUNTINGTON'S...
 5. DISCUSSION
 REFERENCES
 
Suppose measurements of a particular feature are collected at baseline and at a number of subsequent time points and that for each individual there is a roughly linear trend in time. This paper takes three approaches to testing whether there is a relation between the initial value and the slope. It also considers whether the initial value for an individual is a useful predictor of the slope for that individual. The problems are formulated in terms of regression models with random coefficients. The solutions are illustrated using data from an observational study of clinical correlates of disability and progression in Huntington's disease.

Keywords: Huntington's disease; Initial value; Longitudinal data; Regression model with random coefficients; Regression to the mean; Slope


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A STUDY OF...
 3. THEORETICAL DISCUSSION
 4. RESULTS FOR HUNTINGTON'S...
 5. DISCUSSION
 REFERENCES
 
Suppose that patients are measured for a particular feature at baseline (t = 0) and at a number of subsequent time points and that for each patient there is a roughly linear trend in time. Is there a relation between the initial value and the slope, and is the initial value for an individual a useful predictor of the slope for that individual? Any analysis must take account of the phenomenon of regression to the mean first noted by Galton (1869)Go. For a recent account from a number of perspectives, see the series of papers edited by Senn (1997)Go in the issue of Statistical Methods in Medical Research devoted to this topic.

The present paper was motivated by a specific application to be described in Section 2. Three versions of the problem are given and formulated in terms of a regression model with random coefficients; see, for example, Laird and Ware (1982)Go. A theoretical discussion more from first principles is outlined in Section 3 and the conclusions from the specific study are sketched in Section 4. Section 5 discusses some further developments.

The multilevel model approach to longitudinal data analysis is well known and our suggested solution implicit in its formulation; see, for example, Burton et al. (1998)Go, Omar et al. (1999)Go, Verbeke and Molenberghs (2000)Go, or Leyland and Goldstein (2001)Go. Some care, however, is needed in applying the approach. For example, we are unaware of papers in the medical statistical literature where such a model has been correctly used to address the questions posed above. Even when a multilevel model is properly fitted, it is easy to overlook the effect of regression to the mean in interpretation.

In this paper we identify three different formulations of the issue and indicate solutions which avoid the regression to the mean effect.


    2. A STUDY OF HUNTINGTON'S DISEASE
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A STUDY OF...
 3. THEORETICAL DISCUSSION
 4. RESULTS FOR HUNTINGTON'S...
 5. DISCUSSION
 REFERENCES
 
The data used to illustrate the above ideas were collected as part of an observational study of clinical correlates of disability and progression in Huntington's disease (Mahant et al., 2003Go). Patients with a definite diagnosis of Huntington's disease were assessed at their initial and subsequent routine clinic visits for motor, cognitive, and functional impairment using the Unified Huntington's Disease Rating Scale (UHDRS) (Huntington Study Group, 1996Go). We concentrate here on the total motor score (TMS) and the total functional capacity (TFC) of the UHDRS. The TMS can take integer values from 0 (normal) to 124 (maximally abnormal) while the TFC takes integer values between 0 (maximal disability) and 13 (normal).

Our study population consists of 83 patients who had more than one clinic visit and were followed up for at least 1 year at Westmead Hospital, a major Sydney teaching hospital and tertiary referral center. The median duration of follow-up was 5.2 years (interquartile range, 3.1–6.8 years). The median number of visits per patient was 9 (interquartile range, 5–12). The median baseline TMS and TFC scores were 38 (interquartile range, 26–55) and 8 (interquartile range, 5–10), respectively. Figure 1 illustrates typical TMS and TFC profiles for a random selection of 10 patients.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 1. Profile plots of the TMS and TFC scores over time for 10 randomly selected subjects.

 

    3. THEORETICAL DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A STUDY OF...
 3. THEORETICAL DISCUSSION
 4. RESULTS FOR HUNTINGTON'S...
 5. DISCUSSION
 REFERENCES
 
3.1. Formulation

For the ith patient (i = 1, ..., k), suppose observations Yij (j = 0, ..., ri) are available at times tij ≥ 0, where ti0 = 0 defines the time origin for individual i and is taken at the first observation. We take as an initial model

(3.1)

where the {varepsilon}ij are uncorrelated random terms of zero mean and variance and (µi, {gamma}i) are, respectively, the expected value at tij = 0 and the slope for the ith individual. We write

(3.2)

where ({zeta}i, {eta}i) are random terms of zero mean and

We now distinguish different versions of the problem sketched in Section 1. In the first, we are interested in the relation between the expected value µi at time zero and the slope {gamma}i. This relation is summarized in {rho}{zeta}{eta} or by the regression coefficient

(3.3)

Especially if the {varepsilon}ij represent primarily measurement error or uninteresting ‘noise,’ the dependence of interest may be best encapsulated in ß{gamma}µ. Note that the conditional variance of {eta}i given {zeta}i is

A second possibility, directly relevant for empirical prediction, is to relate {gamma}i to Yi0. Under assumptions (3.1) and (3.2), the regression coefficient of {gamma}i on Yi0 is

(3.4)

A different approach to assessing dependence on the baseline value is to consider Y0 purely as an explanatory variable.

3.2. Statistical analysis

The model (3.1)–(3.2) can be fitted by PROC MIXED in SAS or lme in R or S-PLUS, the defining parameters being estimated preferably by restricted maximum likelihood (REML). In particular, estimates of ß{gamma}µ and can be found and confidence intervals (CIs) obtained via the asymptotic standard errors. We now give a discussion from first principles which has the advantage of showing explicitly the relation with regression to the mean and with the contributions of individual patients.

For the analysis to study ß{gamma}µ, we start by fitting linear least square regressions to the data from each patient. For the ith patient this gives i and Direct evaluation under the model (3.1)–(3.2) gives the covariance matrix of (i, i) as

where and are averages for the ith individual. Note that if {sigma}{zeta}{eta} = 0, that is, if baseline mean and slope are uncorrelated, the estimated covariance is negative, a manifestation of regression to the mean.

A pooled estimate of the parameter is obtained from the residual mean square after fitting separate regression lines for each patient. Various simple, if somewhat inefficient, estimates of the remaining parameters are now available. In particular, we may calculate the sums of squares and products about the mean of the individual (i, i) and equate these to their expectations. This gives for the expectations of, respectively, the sum of squares of the i, the i and the sum of products of i, i:

(3.5)


(3.6)


(3.7)

As noted above, the version of the problem in which the regression of slope on the observed baseline value is of interest can be approached via the above analysis, taking in (3.4) as the primary parameter of interest. An alternative approach, which assumes less, is as follows. We use the baseline value as an explanatory variable, and hence to be treated conditionally when modeling Yij for t > 0. This covers the possibility that the baseline value, while a predictor of slope, is not derived from the system (3.1)–(3.2). For example, when the baseline value is measured at diagnosis it may be subject to a bias that does not, however, affect its usefulness as a predictor of slope. Thus, we fit the following linear regression models to the data for i = 1, ..., n2, j = 1, ..., ri, where n2 is the number of subjects with ri ≥ 2:

(3.8)


(3.9)


(3.10)

The random variable {delta}i in the model {gamma}i={gamma}+ß(yi0.0)+{delta}i represents the variation in slope (t > 0) not accounted for by linear regression on yi0. The sum of squares for estimating is the difference between the sums of squares for fitting constants in models (3.8) and (3.9). The expected value of this difference under the model

(3.11)

is of the form where for (t > 0),

(3.12)

It is therefore possible to estimate and and to test the significance of the term. Additional explanatory variables could be included in the regression term. One way to set out the calculations is in the form of an analysis of variance.


    4. RESULTS FOR HUNTINGTON'S DISEASE STUDY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A STUDY OF...
 3. THEORETICAL DISCUSSION
 4. RESULTS FOR HUNTINGTON'S...
 5. DISCUSSION
 REFERENCES
 
If normal error structures are assumed in (3.1)–(3.2), R or S-PLUS can be used to fit linear mixed effects models to TMS and TFC. Because TFC takes only integer values over the narrow range [0, 13], a suitably transformed version, namely

was also considered. The resulting estimates of parameters in (3.1)–(3.4) and their associated 95% CIs obtained using REML are shown in Table 1. The maximum likelihood estimates are virtually the same.


View this table:
[in this window]
[in a new window]
 
Table 1. Estimates* of parameters in (3.1)–(3.4) obtained using REML for TMS, TFC, and TFC transformed

 
First consider the results for TMS. Since {zeta}{eta} is here 0.06 with a 95% CI (–0.23, 0.34), there is no evidence of association between the individual expected value µi at time zero and the slope {gamma}i. A plot of the residuals {eta}i versus {zeta}i for the fitted model (3.1)–(3.2) is shown in Figure 2(a). This plot suggests no obvious departures from the underlying assumptions and appears consistent with the independence of {eta}i and {zeta}i and hence of the intercept µi and slope {gamma}i.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 2. (a) Residuals in slope {eta}i versus residuals in intercept {zeta}i for the fitted TMS model (3.1)–(3.2). (b) Individual regression coefficients for TMS by time (t > 0) versus baseline (t = 0) value.

 
Similarly, there is no evidence of association between the baseline value of TMS and its rate of change over subsequent visits since is 0.009, 95% CI (–0.019, 0.037). The analysis of variance associated with (3.8)–(3.10) together with (3.12) yields estimates of 6.48 and 1.81 for {sigma}{varepsilon} and {sigma}{delta}, respectively. These are consistent with the earlier REML estimates for the system (3.1)–(3.2) (t ≥ 0) given in Table 1. There the residual error estimate was 6.66, 95% CI (6.28, 7.07), and the estimated error for individual slopes was 1.68, 95% CI (1.26, 2.25). Figure 2(b) illustrates the individual TMS regression coefficients (for t > 0) plotted against the baseline value. This plot is consistent with the absence of appreciable association between these variables.

Now consider the TFC scores. Here {zeta}{eta} is –0.37, 95% CI (–0.62, –0.04), and is –0.042, 95% CI (–0.074, –0.011). At first glance there seems to be reasonable evidence of a negative association between the initial value and the slope, that is, more rapid deterioration of TFC among those with better initial scores. On the transformed scale, however, {zeta}{eta} is –0.22, 95% CI (–0.52, 0.14), and is –0.022, 95% CI (–0.051, 0.006), both not quite significant at the 5% level. The normal probability Q–Q plots of the residuals {varepsilon}ij, {zeta}i, and {eta}i obtained fitting model (3.1)–(3.2) to the raw and to the transformed TFC data are shown in Figure 3. There is some evidence of departure from normality which is not entirely removed by the transformation. This departure may be a result of ‘floor and ceiling effects’ due to the limited possible range of observations.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 3. Normal probability Q–Q plots of the residuals {varepsilon}ij, {zeta}i, and {eta}i obtained fitting model (3.1)–(3.2) to the raw and the transformed TFC data (all patients).

 
Figure 4(a) is a plot of the residuals {eta}i versus {zeta}i for the model (3.1)–(3.2) fitted to the raw TFC data. Note that patients with {zeta}i < –3 all have {eta}i > 0 while those with {zeta}i > –3 have, on average, negative {eta}i. Closer examination revealed that patients corresponding to points in the top left of Figure 4(a) had TFC scores ≤3 at baseline and could deteriorate over time only to a score of zero, the lowest possible for TFC. In particular, there were two profoundly affected subjects who entered the study with zero TFC scores and remained at this level throughout (a total of 3.0 and 7.8 years). Such behavior would be expected clinically since Huntington's disease is a chronic progressive illness. These patients correspond to the larger points on the far left of Figure 4(a).



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 4. Residuals in slope {eta}i versus residuals in intercept {zeta}i for model (3.1)–(3.2) fitted to (a) TFC (all patients) and (b) TFC transformed (omitting two profoundly ill patients with TFC {equiv} 0).

 
Omission of these two patients from the analysis effectively normalized the residuals from the model of the transformed TFC scores, producing the scatterplot of {eta}i versus {zeta}i shown in Figure 4(b). On the transformed scale, the reduced data set yielded estimates for {zeta}{eta} and of –0.30, 95% CI (–0.60, 0.08), and 0.003, 95% CI (–0.017, 0.024), respectively. The estimated correlation between the individual rate of change of transformed TFC values and the expected transformed TFC at time zero is negative and not quite significant at the 5% level. There is no obvious association between this rate of change and the baseline transformed value.

Figure 5 shows the individual regression coefficients (for t > 0) plotted against the baseline value for (a) raw TFC scores and (b) transformed TFC. Note that the larger points at the bottom of the plots are associated with the three patients who each had only two postbaseline readings and therefore the least accurate slope estimates. There is no obvious relationship between the individual slopes for t > 0 and the baseline values for either the raw or transformed TFC scores. The estimates of {sigma}{varepsilon} and {sigma}{delta} obtained using (3.12) and the related analysis of variance are 1.430 and 0.298 for the raw TFC and 0.550 and 0.121 for the transformed TFC, respectively. These values are virtually equivalent to the earlier REML estimates for the system (3.1)–(3.2) (t ≥ 0) given in Table 1.



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 5. Individual regression coefficients by time (t > 0) versus baseline (t = 0) value for (a) raw TFC score and (b) transformed TFC (all patients).

 

    5. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A STUDY OF...
 3. THEORETICAL DISCUSSION
 4. RESULTS FOR HUNTINGTON'S...
 5. DISCUSSION
 REFERENCES
 
The substantive conclusion is that for TMS there is no firm evidence of a relation between initial value and slope. For TFC, a scale with a very limited range, it is desirable both to transform the response scale to reduce the effects of end-point constraint, and also to exclude the two patients whose initial (and all subsequent) scores were zero. For the remaining patients, although there is a suggestion of a negative relationship between the estimated individual rate of change of transformed TFC values and the expected transformed TFC at time zero, the evidence for this conclusion is not decisive. Analysis of a larger set of data is desirable. It seems unlikely that, even if confirmed, the relation is strong enough for predictive purposes unless the slope for an individual patient could be shown to depend appreciably on an explanatory variable, thereby eliminating a substantial source of random variation. A final point is that in any protocol for a new study in which TFC is an important outcome variable it might be advisable either to exclude patients with a very low initial score or at least to give provision for them to be considered separately.

The analysis illustrates a number of methodological issues. Quite apart from the important need to escape entrapment by regression to the mean, there are two broad distinctions of formulation. One is that between regression on the initial underlying population mean versus regression on the initial observed value. The other is the distinction between treating the initial value as part of the underlying random system and treating the initial value as distinct, for example, related to the reasons why a patient enters the study. For these data any association with observed value at entry is so weak as to be useless for predicting slope. There is, however, some evidence of a modest relationship between the underlying initial mean and slope for TFC. In future studies, it may be worthwhile trying to measure the initial value more precisely.

The various estimates reported in Table 1 are readily available by fitting a carefully formulated multilevel model using likelihood-based methods. This approach on its own is likely to be rather dangerous, however, and may overlook important aspects of the data. Our more elementary analyses are based on fitting separate regressions to each patient, examining the results graphically and constructing an analysis of variance table. This table is closely connected to the analysis of covariance table common in the older literature. The estimate of the component of variance between regression coefficients obtained via the analysis of variance table equating mean squares to their expectation is very similar to that obtained by the slightly more efficient REML procedure.

The details of the statistical analysis illustrate a number of general points. First is the need to avoid totally a regression to the mean effect. Then there is the more subtle aspect of distinguishing between regression of slope on the observed value of the initial response and regression on some underlying notional true value. The former permits the possibility that the initial observation is predictive of slope even though it is not part of the main stochastic system. The suggested approach to analysis also highlights other potential problems such as apparently anomalous individuals. These latter issues are likely to arise quite often when multilevel random variability is present.


    ACKNOWLEDGMENTS
 
We would like to thank Elizabeth McCusker, Neil Mahant, and Shanti Graham of the Department of Neurology, Westmead Hospital, for bringing this problem to our attention and for providing the data used in the examples.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A STUDY OF...
 3. THEORETICAL DISCUSSION
 4. RESULTS FOR HUNTINGTON'S...
 5. DISCUSSION
 REFERENCES
 

    BURTON, P., GURRIN, L. AND SLY, P. (1998). Extending the simple linear regression model to account for correlated responses: an introduction to generalized estimating equations and multi-level mixed modeling. Statistics in Medicine 17, 1261–1291.[CrossRef][Web of Science][Medline]

    GALTON, F. (1869). Hereditary Genius. London: Macmillan.

    HUNTINGTON STUDY GROUP (1996). Unified Huntington's Disease Rating Scale: reliability and consistency. Movement Disorders 11, 136–142.

    LAIRD, N. M. AND WARE, J. H. (1982). Random-effects models for longitudinal data. Biometrics 38, 963–974.[CrossRef][Web of Science][Medline]

    LEYLAND, A. H. AND GOLDSTEIN, H. (2001). Multilevel Modelling of Health Statistics. Chichester: John Wiley and Sons.

    MAHANT, N., MCCUSKER, E. A., BYTH, K. AND GRAHAM, S. (2003). Huntington's disease: clinical correlates of disability and progression. Neurology 61, 1085–1092.[Abstract/Free Full Text]

    OMAR, R. Z., WRIGHT, E. M., TURNER, R. M. AND THOMPSON, S. G. (1999). Analysing repeated measurements data: a practical comparison of methods. Statistics in Medicine 18, 1587–1603.[CrossRef][Web of Science][Medline]

    SENN, S. (1997). Editorial. Statistical Methods in Medical Research 6, 99–102.[Free Full Text]

    VERBEKE, G. AND MOLENBERGHS, G. (2000). Linear Mixed Models for Longitudinal Data. New York: Springer-Verlag .

    Received May 28, 2004; revised October 21, 2004; revised December 7, 2004; accepted for publication January 6, 2005.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


    This article has been cited by other articles:


    Home page
    Hum ReprodHome page
    S.D. Fleming, R.S. Ilad, A-M.G. Griffin, Y. Wu, K.J. Ong, H.C. Smith, and R.J. Aitken
    Prospective controlled trial of an electrophoretic method of sperm preparation for assisted reproduction: comparison with density gradient centrifugation
    Hum. Reprod., December 1, 2008; 23(12): 2646 - 2651.
    [Abstract] [Full Text] [PDF]


    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    6/3/395    most recent
    kxi017v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Byth, K.
    Right arrow Articles by Cox, D. R.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Byth, K.
    Right arrow Articles by Cox, D. R.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?