Biostatistics Advance Access originally published online on June 1, 2006
Biostatistics 2007 8(2):252-264; doi:10.1093/biostatistics/kxl006
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Statistical methods for panel data from a semi-Markov process, with application to HPV
Department of Biostatistics, Harvard University School of Public Health, 655 Huntington Avenue, Boston, MA 02115, USA mkang{at}hsph.harvard.edu
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
Continuous-time, multistate processes can be used to represent a variety of biological processes in the public health sciences; yet the analysis of such processes is complex when they are observed only at a limited number of time points. Inference methods for such panel data have been developed for time homogeneous Markov models, but there has been little research done for other classes of processes. We develop likelihood-based methods for panel data from a semi-Markov process, where transition intensities depend on the duration of time in the current state. The proposed methods account for possible misclassification of states. To illustrate the methods, we investigate a three- and a four-state models in detail and apply the results to model the natural history of oncogenic genital human papillomavirus infections in women.
Keywords: Human papillomavirus; Misclassification; Multistate process; Natural history
| 1. INTRODUCTION |
|---|
|
|
|---|
Continuous-time, multistate stochastic processes provide a useful framework for many studies of event history data (Commenges, 1999
However, in many instances, observations consist of the states of the individual processes at discrete time points, with no information about the types and times of events between observation times. For example, when state transitions of a process are silent eventssuch as the onset of an early stage of a disease before symptomsthe sample paths are observed infrequently, often resulting from diagnostic tests given during patient visits to their caregivers. Inference methods for such panel data from multistate processes have been limited. Kalbfleisch and Lawless (1985)
proposed methods for the analysis of panel data for Markov models with time homogeneous transition intensities; Kay (1986)
, Andersen (1988)
, and Gentleman and others (1994)
have applied these methods to cancer, diabetes, and HIV. Inference based on Markov models in such settings is greatly simplified, because the discrete-time process observed at prespecified time points forms a Markov process.
In many applications, however, the Markov assumption is not appropriate because the transition intensities depend on the elapsed time in the current state. For instance, in modeling the natural history of human papillomavirus (HPV), which is known to cause almost all cervical cancers, the Markov assumption would not account for the strong association between infection duration and progression to cervical abnormality (Stoler, 2000
). Sexually acquired genital HPV infections in women are often transient and recurring, and are usually resolved by the host prior to the onset of symptoms. However, epidemiology studies show that when an infection by certain HPV types persists for a long period of time, it can eventually lead to clinical conditions such as high-grade cervical intraepithelial neoplasia (CIN 2 and 3) and, ultimately, cervical cancer (Stoler, 2000
). When the underlying process is not a time homogeneous Markov process, the observed discrete-time process will in general have a complex structure. A further complication that can arise is that the observations of the states of the process are subject to misclassification.
Motivated by the HPV studies, this paper considers inference for continuous-time semi-Markov processes in settings where sample paths of individuals are observed only at a finite number of prespecified times, possibly with misclassification errors in the observed states. We show that evaluation of likelihood functions can be greatly simplified when the transition intensity from at least one of the states of the underlying process is time homogeneous. Section 2.1 introduces the underlying and the observed processes and outlines the general approach to likelihood methods. Section 3 presents the likelihood contributions in detail for a nonprogressive three-state process. Section 4 applies the proposed methods to a recent HPV trial and a simulated study.
| 2. INFERENCE FOR A K-STATE SEMI-MARKOV PROCESS |
|---|
|
|
|---|
Suppose that X(·) = {X(t),t
0} denotes a continuous-time process with K states, denoted 1,...,K. Let the random variables
0,
1,... denote the initial and subsequent consecutive states occupied by the process, and let
n represent the sojourn time between the (n 1)th and nth states, for n = 1,... . Thus, X(·) is equivalent to
|
|
The process is semi-Markov if the sequence {
0,
1,...} of consecutively occupied states forms a simple Markov chain, and the sojourn times
n between consecutively occupied states are independent random variables with distributions that depend only on the adjoining states (cf. Cox and Miller, 1977
). The probabilistic properties of a semi-Markov process can be characterized by the transition intensities, or cause-specific hazard functions, among states. Suppose that the process enters state i at its nth transition. Then the transition intensity functions out of state i, say
ij(·), are
|
|
where t denotes the elapsed time from entrance into state i, for i,j = 1,...,K, i
j, and n = 1,2,... . Such dependence on the duration (t) in the current state (
n) makes semi-Markov processes distinct from Markov processes.
The probabilistic properties of semi-Markov processes have been studied extensively (Cox and Miller, 1977
), and inferences for settings where sample paths are continuously observed or right censored can be made by extensions of methods for ordinary failure time data (cf. Lagakos and others, 1978
). Inferences for unidirectional, or progressive, processes that lead to interval-censored data have also been developed (Sternberg and Satten, 1999
). However, we consider settings where the independent realizations of the process X(·) corresponding to different subjects are observed only at a finite number of prespecified time points and the process may be bidirectional (nonprogressive). Consider the values of X(·) at the fixed times 0 = v0 < v1 < v2 <
< vM, and let X = (X0,X1,...,XM), where Xm = X(vm). Despite the simple probabilistic form of the underlying semi-Markov process X(·), the joint distribution of X0,X1,...,XM is in general complex because the time points v0,v1,...,vM do not correspond to transition times between states of the process. If the process is not progressive, the states visited in the sample path are not determined by the knowledge of the states occupied at visit times.
Inferences about the parameters
ij(·) are further complicated by misclassification errors. Rather than X, suppose that the observation for a subject is given by Y = (Y0,Y1,...,YM), where Ym
{1,2,...,K} denotes Xm subject to misclassification error. We assume that the misclassification error probabilities satisfy the conditional independence assumption
![]() | (2.1) |
That is, conditional upon the true values of X(·) at the visit times, the distribution of Ym depends only on the value of X(·) at vm. We denote these error probabilities by
ik = P(Ym = k|Xm = i).
The likelihood contribution for an individual can be written as
![]() | (2.2) |
where xm,ym
{1,...,K}, cxy = 
xm,ym, and the summation is over all possible state sequences.
If the transition intensities from at least one of the K states can be assumed to be time homogeneousthat is,
ij(t) =
ij for j = 1,...,K if and only if i
, where
is a nonempty subset of {1,...,K}then P(X = x) in the evaluation of (2.2) is nicely simplified. A consequence of this assumption is that for any times 0
t0 < t1 <
(whether or not these correspond to visit times) and any xm
for m = 0,1,...,
![]() | (2.3) |
where
(tm) denotes the time at which the process last entered state xm prior to tm. The semi-Markov nature of X(·), which ensures that the future of the process for times greater than
(tm) depends on the history of the process only through
(tm) and xm = X(
(tm)), combined with the memoryless property of sojourn times in state xm yields this stationarity property of the process at times when the process is in a state belonging to
. To illustrate how (2.3) can greatly simplify the evaluation of the finite-dimensional distributions of X(·), suppose that
= {1}. Then, for example,
![]() |
Thus, instead of having to compute the joint probability of the 8-dimensional vector [X(t0),...,X(t7)], the result in (2.3) reduces this to computing the distribution of one 3-dimensional vector and two 2-dimensional vectors. More generally, if we define,
|
|
then for any positive integer L,
![]() | (2.4) |
By taking L = M and tm = vm, the result in (2.4) simplifies the calculation of the likelihood contribution in (2.2) by allowing P(X = x) to be expressed as a product of one-step probabilities of the form
|
| (2.5) |
and of multistep probabilities of the form
|
| (2.6) |
where 0 <
1 <
2 <
correspond to differences between the original visit times v0,...,vM.
Calculations of the conditional probabilities in (2.5) and (2.6) can be difficult because of the unknown transitions in the underlying sample paths. However, they also can be simplified by applying the result in (2.4) to the probabilities corresponding to the potential underlying sample paths that yield the observed states in (2.5) and (2.6).
| 3. A THREE-STATE SEMI-MARKOV MODEL |
|---|
|
|
|---|
Consider a three-state semi-Markov model (Figure 1) where states 1 and 2 are transient while state 3 is absorbing and can be entered from either of the other states. As an example, suppose states 1, 2, and 3 represent the infection-free, currently infected, and clinical disease statuses, respectively. Then a subject who is initially uninfected can become infected for a period of time, after which the infection resolves to the infection-free status (121), or the infection leads to clinical disease (123). Alternatively, the subject could develop clinical disease from the infection-free state (13). For someone beginning in state 1, every sample path will consist of k visits (k
0) to state 2, followed by a visit to state 1 or 3 (if k
1).
|
Suppose that individuals begin in state 1 and
= {1}, so that
1j(·) =
1j for j = 2,3. The remaining transition intensities,
2j(·) for j = 1,3, are arbitrary. Let fij(·) denote the subdensity function corresponding to
ij(·); that is,
|
|
for t
0, where
i(t) = 



ij(u)du, i
j. The subdistribution function corresponding to fij(·) is denoted Fij(·), where Fij(t) = 
fij(u)du.
From (2.3)(2.6), calculation of the likelihood contribution of a subject requires consideration of at most the conditional probabilities,
|
| (3.1) |
for j = 1,2,3, where
i > 0 and 0 <
1 <
2 <
denote the distinct values of vj vi and vi < vj are the visit times. When visit times are equally spaced, say every
time units, these conditional probabilities simplify to
|
|
for j = 1,2,3 and m = 1,2,... .
Note that for each observed path, there may be infinitely many underlying sample paths. For example, for the observed sequence
, the underlying sample path of the process can be of the form
![]() |
and so on, where the unboxed states denote the unobservable state changes of the process between visit times. We now consider the probability elements for the one- and multistep conditional probabilities in (3.1) in detail.
Let P1j(k,t) denote the conditional probability that the process is in state j at time t0 + t after k visits to state 2 in the interval (t0,t0 + t), given that the process is in state 1 at time t0, for j = 1,2,3 and k = 0,1,... . Due to stationarity in (2.3), we may set t0 = 0. Note that P12(k,t) = 0 for k = 0, since there must be at least one visit to state 2. The case of j = 1 for k = 0,1,2 is depicted in Figure 2, where u represents the unknown time of transition from state 1 to 2 and z denotes that from state 2 to 1. In this notation,
|
|
| (3.2) |
The sum will be finite if sojourn times from state 2 have a guarantee time; that is, support bounded away from zero. For example, if
2j(t) = 0 for 0
t
G, then the upper limit in the sum is the greatest integer less than or equal to t/G.
For the calculation of P11(k,t), we have P11(0,t) = exp{
1(t)} and
|
|
For k > 1, the probability of k transitions to state 2 is given by a convolution of probability functions P
(1,x) and P11(k 1,t x), where
|
|
differs from P11(k,x) in that its corresponding underlying process ends at the transition time (marked by state 1* in Figure 2). Thus,
|
|
We can extend the definition of P
(1,x) to P
(k1,x), and generalize the convolution function to
|
|
where k1 + k2 = k, for k1,k2
1.
Next, let P13(k,t) denote the probabilities contributing to the calculation of P[X(t) = 3|X(0) = 1]. Since state 3 can be reached from state 1 or state 2, P13(k,t) can be expressed as the sum of P
(k,t), which denotes that state 1 immediately precedes state 3, and P
(k,t), which denotes the other possibility that transition to state 3 was made from state 2. Then (see Appendix as supplementary material available at Biostatistics online)
|
|
Having computed P[X(t0 + t) = 1|X(t0) = 1] and P[X(t0 + t) = 3|X(t0) = 1], P[X(t0 + t) = 2|X(t0) = 1] is obtained as the complement of their sum.
The ideas in one-step conditional probability calculations can easily be extended to obtain the multistep conditional probabilities in (2.6). For example, consider P[X(t2) = 1,X(t1) = 2|X(0) = 1]. Define P121(k1,k2,t1,t2) to be the conditional probability that X(t1) = 2 and X(t2) = 1, with k1 visits to state 2 in (0,t1] and k2 visits to state 2 in (t1,t2], given that X(0) = 1. Note that P121(k1,k2,t1,t2) differs from P11(k1 + k2,t) in that state 2 is known to be occupied in the process at time t1 in P121(k1,k2,t1,t2). To illustrate, a sample path corresponding to P121(1,0,t1,t2) is depicted in Figure 3. Details are presented in the Appendix as supplementary material available at Biostatistics online.
|
The expressions developed above can be combined to obtain an expression for P(X = x) and then the likelihood contribution for an individual from (2.2). The overall likelihood is obtained as the product of the contributions of individual subjects and will be a function of {
12,
13,
21(·),
23(·)}. If parametric forms are assumed for
21(·) and
23(·), the likelihood function will depend on a finite-dimensional parameter vector and standard numerical methods can be used to obtain the maximum likelihood estimator. Under the standard regularity conditions, the maximum likelihood estimators will be consistent and asymptotically normal.
To examine the adequacy of model fit, the visit patterns can be partitioned into several categories, and the observed frequencies of the categories can be compared with the estimated model-based frequencies. An overall
2 statistic would then have an approximate chi-square distribution with the degrees of freedom equal to the number of independent cells minus the number of model parameters.
| 4. ILLUSTRATIONS |
|---|
|
|
|---|
We illustrate the proposed methods with the placebo data from a completed pilot clinical trial of a candidate vaccine for HPV type 16 (Koutsky and others, 2002
|
We considered
= {1} and modeled the transition intensities for state 2 as step functions, incorporating the biological minimum time required for clearance of infection or progression to disease; that is,
![]() |
for j = 1,4. Hence,
2j(t) is dependent on time due to the corresponding guarantee time, G2j. We considered guarantee times of G21 = 5 and G24 = 6 (in months) for the HPV study. We assumed that the diagnostic tests for HPV and CIN have perfect sensitivity but may have imperfect specificity, and reestimated model parameters assuming several assumed specificities.
Estimates of
assuming various misclassification errors are presented in Table 1. The maximized likelihood decreases sharply as either specificity declines below one and indicates that the assumption of no misclassification errors provides the best model fit. The very small estimates of
13 in the presence of imperfect CIN diagnostic test would suggest that the observed prevalence of women in state 3 is almost fully explainable by misclassification, implying that HPV infections of types other than 16 do not lead to CIN 2/3 during the 30-month period of follow-up. However, this would be inconsistent with the literature which suggests that various HPV types are responsible for CIN outcomes. (cf. Liaw and others, 1999
|
We found the inverse sample information to be numerically unstable and relied on the bootstrap method using 100 bootstrapped samples. The model assuming known G21 = 5, G24 = 6, and no misclassification errors gives the following estimates of
and standard errors, based on four unknown parameters in the model:
,
,
, and
. Following the guarantee times, the estimated risk of clearance is about 28 times higher than the risk of progression to CIN after 6 months. The conditional probabilities of HPV16 infection (entering state 2) and non-HPV16-related CIN development (entering state 3) are 0.944 and 0.056, given that a woman leaves the uninfected state (state 1). Once a woman is infected with HPV16, the conditional probabilities of clearing the HPV16 infection, P(
n + 1 = 1|
n = 2), and progressing to CIN, P(
n + 1 = 4|
n = 2), are 0.969 and 0.031, respectively. The mean times to clearance and progression to HPV, conditional on the next state, are about 16 and 17 months, respectively, consistent with commonly used definitions of "persistent infection" (Ho and others, 1998
|
Model fit (Table 2) was assessed by comparing the observed frequencies with the expected frequencies of the selected visit patterns. The chosen model (Model 1) appears to fit the data adequately (p = 0.65, 8d.f.), while the goodness-of-fit results for the same model but allowing a specificity of 0.99 for the HPV16 assay (Model 2) indicate poorer fit (p = 0.04).
|
Fitting the standard Markov model (Kalbfleisch and Lawless, 1985
ij) of
(0.00067),
(0.00013),
(0.0089), and
(0.0022). The estimated transition intensities from state 1 were very similar to those from the semi-Markov methods. The resulting cumulative incidence and prevalence curves were also similar to those from the semi-Markov model over the period of observation. The corresponding goodness-of-fit
2 statistic was 6.93(p = 0.54) for the Markov model, somewhat worse than that of Model 1.
To further illustrate the value of the methods, data were simulated from a four-state semi-Markov process of Figure 4 with
= {1} and
2j(t) = a2jb2j(t G2j)b2j 1, j = 1,4 (Weibull hazard function). For the simulation,
12 = 0.025,
13 = 0.002, a21 = 0.2, a24 = 0.001, b21 = 0.5, b24 = 2, G21 = 4, G24 = 5, visits are every 6 months for 30 months and N = 5000, a sample size typical of a moderately sized Phase III vaccine trial. The simulated data produced 231 and 233 observations in states 3 and 4, respectively, by the end of the observation period, with 2252 subjects observed to be in state 1 at each visit, and yielded
,
,
,
,
, and
, assuming that G21 and G24 are known. Figure 6(a) shows that the estimated cumulative CIN incidence and HPV prevalence are very close to the true probabilities. However, when a Markov model is incorrectly assumed, the biases in such estimates increase with time (Figure 6(b)).
|
| 5. DISCUSSION |
|---|
|
|
|---|
Panel data pose challenges for semi-Markov processes because missing information about an individual's process between the observations can contribute to the probability of being in the current state. Hence, despite the wider applicability of semi-Markov processes compared to the Markov processes, research in this area has been slim. We showed that when the transition intensities from at least one of the states of the process are time homogeneous, the expression for the joint probability in the likelihood function is tractable. In our data applications, we considered a process that begins in a state in
, so that it is not necessary to take account of the duration of time that individuals have been in this state prior to the start of the study. If X(0)
, the proposed methods would still apply if either the subjects had just entered this state or the durations of time in this state prior to entrance into the study were known. Otherwise, the methods would need to be modified to account for the duration of time in the initial state prior to the start of the study. Methods similar to those in the work by Satten and Sternberg (1999)
In the HPV data example, the data came from a proof-of-principle study, thus smaller than a Phase III vaccine trial, and the number of CIN 2/3 events was small (10 events). Several Phase III HPV vaccine trials are currently underway, and are approximately five times the size of the pilot trial, and will be more suitable for the methods developed here, as the simulated data example illustrates. Without data on sexual activities and HPV types other than type 16, we assumed
= {1}, and exponential distributions with guarantee times were chosen for state 2 for model simplicity. Whereas the guarantee times limited the number of potential paths in our example, one can also limit the number of transitions that can occur in a given time interval when enumerating the potential paths. The clinical estimates for guarantee times were not clear, but we found that varying these times (from 4 to 8 months) did not change the likelihood parameter estimates substantially in our data.
The methods in the paper can be extended to allow fixed covariates in the model, for instance, to compare two treatments in a clinical trial. A one-sample model can be fit separately for each distinct covariate vector, and the overall likelihood is obtained by multiplying the likelihoods from the homogeneous groups.
| ACKNOWLEDGMENTS |
|---|
We are grateful to Joseph Heyse and Lisa Chiacchierini for their comments and to Merck Research Laboratories for providing the data used in Section 4. This research was supported by the Howard Hughes Medical Institute as part of the first author's doctoral thesis and by Grants U01 AI38855 (Kang) and AI24643 (Lagakos) from the National Institute of Allergy and Infectious Diseases. Conflicts of Interest: None declared.
| REFERENCES |
|---|
|
|
|---|
-
Andersen P. (1988) Multistate models in survival analysis: a study of nephropathy and mortality in diabetes. Statistics in Medicine 7:66170.[Web of Science][Medline]
Andersen P and Borgan O. (1985) Counting process models for life history data: a review. Scandinavian Journal of Statistics 12:1558.
Andersen P and Keiding N. (2002) Multi-state models for event history analysis. Statistical Methods in Medical Research 11:91115.
Commenges D. (1999) Multi-state models in epidemiology. Lifetime Data Analysis 5:31527.[CrossRef][Web of Science][Medline]
Commenges D. (2002) Inference for multi-state models from interval-censored data. Statistical Methods in Medical Research 11:16782.
Cox D and Miller H. (1977) The Theory of Stochastic Processes(Chapman and Hall, London).
Gentleman R, Lawless J, Lindsey J, Yan P. (1994) Multi-state Markov models for analysing incomplete disease history data with illustrations for HIV disease. Statistics in Medicine 13:80521.[Web of Science][Medline]
Herrero R, Hildesheim A, Bratti C, Sherman ME, Hutchinson M, Morales J, Balmaceda I, Greenberg M, Alfaro M, Burk R. (2000) Population-based study of human papillomavirus infection and cervical neoplasia in rural Costa Rica. Journal of the National Cancer Institute 92:46474.
Ho GYF, Bierman R, Beardsley L, Chang CJ, Burk RD. (1998) Natural history of cervicovaginal papillomavirus infection in young women. The New England Journal of Medicine 338:4238.
Hougaard P. (1999) Multi-state models: a review. Lifetime Data Analysis 5:23964.[CrossRef][Web of Science][Medline]
Kalbfleisch J and Lawless J. (1985) The analysis of panel data under a Markov assumption. Journal of the American Statistical Association 80:86371.[CrossRef][Web of Science]
Kay R. (1986) A Markov model for analysing cancer markers and disease states in survival studies. Biometrics 42:85565.[CrossRef][Web of Science][Medline]
Koutsky L, Ault K, Wheeler C, Brown D, Barr E, Alvarez F, Chiacchierini L, Jansen K. (2002) A controlled trial of a human papillomavirus type 16 vaccine. The New England Journal of Medicine 347:164551.
Lagakos S, Sommer C, Zelen M. (1978) Semi-Markov models for partially censored data. Biometrika 65:3118.
Liaw KL, Glass AG, Manos MM, Greer CE, Scott DR, Sherman M, Burk RD, Kurman RJ, Wacholder S, Rush BB. (1999) Detection of human papillomavirus DNA in cytologically normal women and subsequent cervical squamous intraepithelial lesions. Journal of the National Cancer Institute 91:95460.
Moscicki AB, Shiboski S, Broering J, Powell K, Clayton L, Jay N, Darragh TM, Brescia R, Kanowitz S, Miller SB. (1998) The natural history of human papillomavirus infection as measured by repeated DNA testing in adolescent and young women. The Journal of Pediatrics 132:27784.[CrossRef][Web of Science][Medline]
Satten G and Sternberg M. (1999) Fitting semi-Markov models to interval-censored data with unknown initiation times. Biometrics 55:50713.[CrossRef][Web of Science][Medline]
Sternberg M and Satten G. (1999) Discrete-time nonparametric estimation for semi-Markov models of chain-of-events data subject to interval censoring and truncation. Biometrics 55:51422.[CrossRef][Web of Science][Medline]
Stoler M. (2000) Human papillomaviruses and cervical neoplasia: a model for carcinogenesis. International Journal of Gynecological Pathology 19:1628.[Web of Science][Medline]
Received October 9, 2005; revised May 18, 2006; accepted for publication May 31, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
M. Mandel Estimating disease progression using panel data Biostat., January 11, 2010; (2010) kxp057v1. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












