Biostatistics Advance Access originally published online on April 28, 2005
Biostatistics 2005 6(4):590-603; doi:10.1093/biostatistics/kxi029
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Optimal design and efficiency of two-phase casecontrol studies with error-prone and error-free exposure measures
Biostatistics Group, School of Epidemiology and Health Sciences, University of Manchester, Oxford Road, Manchester M13 9PT, UK rmcnamee{at}manchester.ac.uk
| SUMMARY |
|---|
|
|
|---|
This paper addresses optimal design and efficiency of two-phase (2P) casecontrol studies in which the first phase uses an error-prone exposure measure, Z, while the second phase measures true, dichotomous exposure, X, in a subset of subjects. Optimal design of a separate second phase, to be added to a preexisting study, is also investigated. Differential misclassification is assumed throughout. Results are also applicable to 2P cohort studies with error-prone and error-free measures of disease status but error-free exposure measures. While software based on the mean score method of Reilly and Pepe (1995
Keywords: Casecontrol studies; Differential misclassification; Efficiency; Exposure validation studies; Measurement error; Two-phase studies; Two-stage studies
| 1. INTRODUCTION |
|---|
|
|
|---|
Inaccurate exposure measurement can lead to biassed measures of diseaseexposure relationships but accuracy often entails a greater cost per subject than error-prone approaches. Use of an error-free but expensive measure, X, instead of a cheap, error-prone measurement, Z, will remove the bias but, if the study budget is fixed, will also reduce statistical power and precision. A third possibility is to use both Z and X in a two-phase (2P) study where, for example, Z and disease status Y are measured in phase 1 while, in the second phase, X is measured on a sample of subjects from the first phase. This paper is concerned with the optimal design of such studies for efficiency in estimating ß, the log odds ratio linking Y and dichotomous X, given a fixed budget.
In general 2P studies, the set of variables measured in phase 1, W say, is incomplete; this deficiency is made up in phase 2 but only on a sample of first-phase subjects. W may be a subset of the complete set of interest, V or, as considered here, an error-prone version of V. Data from both phases are combined for analysis. The latter, errors in variables, design may also be referred to as a study with internal exposure validation data. Various designs may be employed for the first phase, e.g. subjects may be chosen initially on the basis of disease status, as in casecontrol designs, or exposure status, or completely at random (Zhao and Lipsitz, 1992
). This paper concentrates on first-phase casecontrol designs with an error-prone measure of exposure Z, followed by error-free measurement, X, in the second phase, but we show how the results apply when the first-phase design is that of a cohort study or clinical trial with error-free X but error-prone and error-free measures of disease status measured in the two phases.
Several methods exist for analyzing 2P casecontrol studiessee Thurigen et al. (2000)
for a comprehensive review. In designing a 2P study, we need to distinguish situations where the two phases are planned together, and those where the second phase is planned separately after a first phase based on Y and Z is complete. In the latter case, here referred to as separate phase 2 (SP2) design, the second phase may be easily justified because it transforms a previous study with a biassed estimate of ß into one which is unbiassed. The additional cost, compared to including all first-phase subjects in the second phase, may be low yet precision can, in some cases, be almost as good (Reilly, 1996
). On the other hand, to justify a 2Ps design, here defined as a design where the two phases are planned together, one should consider whether, for a fixed budget, there is any gain in efficiency compared to a simpler, one-phase (1P) study based on Y and X alone. In a 2Ps design, the efficiency per unit cost may be increased if the balance between first- and second-phase sample sizes takes account of the relative costs per subject of Z and X, and if the first- and second-phase sampling designs are chosen optimally. In a separate second-phase design with a fixed budget, only the second-phase sampling design is open to choice; legitimate designs include simple random sampling and random sampling stratified by Y, Z, or Y and Z (Zhao and Lipsitz, 1992
).
Palmgren (1987)
and Greenland (1988a)
addressed optimal 2Ps designs, but with restrictions which may have limited efficiency: second-phase sampling could only be stratified by Y, and the sampling fraction for cases and controls had to be equal. Palmgren proved that a yet more restricted design (equal numbers of first-phase cases and controls), but with an optimally chosen sampling fraction, could be less efficient than a one-phase (1P) study of the same cost and based on X alone. Greenland suggested that the 2Ps design might be less efficient unless the sensitivity and specificity of Z were uniformly high but no formal proof was given. Reilly and Pepe (1995)
developed an approach to solve general 2P optimality problems based on their mean score analysis but, since information matrix inversion is required, software is generally needed for implementation. The software designed for this purpose (Reilly and Salim, 2000
) requires first-phase pilot data on Y, Z, and X ; thus, while it yields individual solutions, this empirical approach does not readily give insight into the general utility of the design.
The efficiency of a 2P casecontrol study depends, among other factors, on whether or not one assumes differential misclassification, i.e. that misclassification of exposure by Z varies by Y: the differential assumption increases the variance of ß (Greenland, 1988b
; Palmgren, 1987
). Greenland (1988b)
has argued that differential misclassification should be the presumption in epidemiological studies unless there is compelling evidence against it; casecontrol studies in particular may be more prone to differential misclassification (Thurigen et al., 2000
; Dahm et al., 1995
). The mean score method and related software make no assumptions about the relationship between X and Z and thus would appear to assume differential misclassification by defaultalthough Thurigen et al. (2000)
have stated the opposite. This software gives empirical solutions to many optimal 2Ps design problems and also to SP2 design problems, given pilot first-phase data. Alternative software (Holcroft and Spiegelman, 1999
), applicable only to the latter class of designs, and based on maximum likelihood analysis, assumes nondifferential misclassification and requires parameter estimates for implementation.
Optimal solutions, however derived, are in practice only as good as the parameter estimates: estimation error could undo any theoretical advantage of a design. Breslow and Cain (1988)
, Cain and Breslow (1988)
, Breslow and Holubkov (1997)
, and Breslow and Chatterjee (1999)
proposed instead balanced phase 2 designs: for the problem here, balance is defined as equal numbers of subjects from the Y.Z strata. Using examples and simulations, these authors argued that, in general, balanced designs were near optimal, although only the last mentioned paper addressed in detail the errors in variables problem. Holcroft and Spiegelman (1999)
also found that the balanced designs were nearly optimal in the SP2 examples they considered.
This paper derives formulae for the efficiency of a fully optimized, 2Ps casecontrol design compared with a 1P design of the same cost, assuming dichotomous exposure and that error in Z is differential. The optimal first-phase control:case ratio, ratio of first- and second-phase sizes, and optimal sampling fractions for the strata of Y.Z are established. We prove that the maximum efficiency is greater than that implied by Palmgren's (1987)
restricted design but has an upper bound which is a simple function of the sensitivities and specificities of Z. The advantage of stratification by Z, in addition to Y, is quantified theoretically. Formulae for the optimal design of a SP2 study are given and used to evaluate balanced designs, enabling us to identify situations where balance is near optimal or suboptimal. The basic theory is extended to studies for estimating the interaction of X with another covariate and to situations where the roles of Y and X, in terms of design and measurement error, are reversed.
To illustrate the formulae, we consider 2P studies of the relationship between cervical cancer (Y) and herpes simplex virus 2 (HSV-2), where HSV-2 can be measured by the cheaper western blot procedure (Z) or a more refined method (X). Data from such a study were analyzed by Carroll et al. (1993)
assuming differential misclassification. Here, we derive the optimal 2Ps design and the optimal SP2 design of a new investigation and consider whether the 2Ps design is efficient compared to a 1P design of the same budget. A clinical trial of treatment for ovarian cancer with surrogate and true outcome measures is also considered.
| 2. NOTATION AND ESTIMATION |
|---|
|
|
|---|
Consider the first phase of a 2P casecontrol study in which there are n subjects in total and the ratio of controls to cases is R0, giving n1=n/(R0+1) cases and n0=nR0/(R0+1) controls. Cases (Y=0) and controls (Y=1) are sampled independently of each other. All n subjects are classified using an error-prone surrogate for exposure, Z which is assumed here to have two categories, 0 and 1. Corresponding results are available for the case where Z has more than two categories in the supplementary material (www.biostatistics.oupjournals.org); this could occur, for example, if Z was a score derived from a questionnaire. For the second-phase, random sampling, possibly stratified by Y and Z, is used to choose subjects.
The sampling fraction in the stratum with Y=i, Z=j, is
ij, and the number of subjects sampled is mij=ni
ij, i=0, 1, j=0, 1. The total second-phase size is m=
i, jmij, where m<n. These subjects are classified as either X=0 or X=1 (exposed).
The subscripts i and j always refer to levels of Y and Z, respectively. All probabilities are conditional on Y. Let
i=Pr{X=1|Y=i}, i=0, 1. The objective of the study is to estimate ß, the log odds ratio of the YX relationship which, in a casecontrol study, can be found from
Let
ij=Pr(Z=j|Y=i}, where
i0+
i1=1 and
ij=Pr(X=1|Y=i, Z=j). Then,
i=Pr{X=1|Y=i}=
j
ij
ij. Suppose that the first-phase data give nij subjects with Y=i, Z=j, ni0+ni1=ni ; then
ij=nij/ni. In phase 2 suppose that, of the mij subjects sampled from this stratum, xij have X=1 ; then
ij=xij/mij. Thus,
i can be estimated by
which has variance (Cochran, 1977
):
![]() | (2.1) |
![]() | (2.2) |
When misclassification by Z is differential, the distributions of
1 and
0 are independent and V(
) is therefore the sum of the variances of the separate log odds in (2.2). From the delta approximation,
hence, (2.1) and (2.2) give
![]() | (2.3) |
For the fully optimized 2Ps problem, a more useful expression is obtained by replacing mij by its expectation ni
ijvij, ni by nRi/(R0+1), where R1=1, and [
by
where
i is the Pearson correlation coefficient between Z and X in group i (McNamee, 2003
). This gives
![]() | (2.4) |
Formulae (2.3) and (2.4) do not appear to have been published previously.
An alternative parameterization of the relationship between Z and X is sometimes preferable. Let
i=Pr{Z=1|Y=i, X=1} and
i=Pr{Z=1|Y=i, X=0}, i=0, 1. Then,
i and 1
i are the sensitivity and specificity, respectively, of Z as a proxy for X in group i. Nondifferential misclassification implies
1=
0 and
1=
0. We will assume that Z is such that
i
i
0, i=0, 1 ; this condition ensures that
i
0, i=0, 1.
| 3. OPTIMAL 2PS DESIGN |
|---|
|
|
|---|
Let the total budget for the study be B and the costs per subject of measurement of Z and X be c1 and c2, respectively, with c2>c1. If there are no other costs, then the total cost, C, of a 2Ps study is c1
ini+c2
i, jmij. Since E(mij)=ni
ij
ij, all choices of n, R0, and {
ij} must satisfy
![]() | (3.1) |
To maximize the precision, the values of all the quantities, R0, n, and {
ij}, should be chosen to minimize (2.4) subject to the constraint (3.1). These optimal values, and the resulting minimum variance and efficiency compared with a 1P study of the same budget, are given in Section 3.2. We also give details of some constrained optimal designswhere R0 is a fixed a priori or where only stratification by Y is possible. For comparison, the variances for 1P studies based on X alone are given in (3.1). All proofs are given in the supplementary material.
In a single-phase casecontrol study based on X alone, with n1 cases and n0 controls, V(
) is approximately
(e.g. Jewell, 2004
). Given a control:case ratio of R0 and a budget B, this can be written as
![]() | (3.2) |
where
i=0, 1, and n has been replaced by B/c2. If R0 is chosen to minimize (3.2) subject to the constraint (1+R0)n1=n=B/c2, then from Section 1, supplementary material:
![]() | (3.3) |
and (3.2) becomes
![]() | (3.4) |
Equation (3.3) is equal to 1 under the null hypothesis that
1=
0.
In Section 2 of the supplementary material, it is shown that the optimal values of R0, {
ij}, and the corresponding minimum variance in an unconstrained 2Ps design are
![]() | (3.5) |
![]() | (3.6) |
![]() | (3.7) |
where
![]() |
The optimal value of n is found by substituting (3.5) and (3.6) into (3.1). When
i
i
0, both
and
i are measures of the validity of Z, increasing from 0 to 1 as
i or 1
i increases. Hence, (3.7) has terms which decrease (gi) and increase (
i) with validity. However, when c2/c1 is large, the first term predominates so that variance decreases as validity increases.
The efficiency of the optimal 2Ps design relative to the optimal 1P design is the ratio of (3.4) to (3.7); its reciprocal is
![]() | (3.8) |
and tends to
say, as c1/c2
0. The efficiency of the 2Ps design therefore increases as c2/c1 increases. However, regardless of c2/c1, efficiency can never be more than 1/
2. When Z is a poor measure of X, the gi, and hence
, are close to unity; hence, even when c2/c1 is large, only a small reduction in variance is possible. If Z is poor and c2/c1 small, the efficiency may even be less than 1.
The optimal values of the overall sampling fraction, E(m)/n, and the second-phase allocation fractions, E(mij)/E(m) can be derived from (3.5) and (3.6). From Section 2 of the supplementary material, they are
![]() | (3.9) |
![]() | (3.10) |
This characterization of optimal designrather than in terms of n and {
ij}is convenient and insightful. The optimal fraction, m/n, decreases both as c2/c1 increases and as the validity of Z increases. The optimal allocation fractions (3.10) are discussed in detail in Section 4.
In many casecontrol studies, R0 is fixed in advance; values in the range 28 are common and reflect the relative difficulty of finding cases. When R0 is fixed, the constrained optimal sampling fractions are (see Section 2 of the supplementary material):
![]() | (3.11) |
where
is given by (3.6). The corresponding variance is
![]() | (3.12) |
Equation (3.11) yields an expression identical to (3.10) for the optimal allocation fractions E(mij)/E(m).
Some empirical comparisons were made between (3.10)(3.12) and output from the optbud procedure, implemented in STATA by Reilly and Salim (2000)
, which also addresses the 2Ps design with a fixed budget. The comparison requires input of pilot data constructed to yield
i=
i,
i=
i, and
i=
i, where
i,
i, and
i were the values used in (3.10)(3.12). Identical results were obtained in each case. A similar comparison is not possible when R0 is a design parameter since the software is built on the assumption that the first-phase distribution of Y is already fixed.
While many authors tend to assume simple random sampling for second-phase subject selection, others such as Breslow and Chatterjee (1999)
and Pepe et al. (1994)
have sought to demonstrate the benefit of stratified random sampling. In particular, the benefit, if any, of stratification by Y and Z compared to Y alone has been of interest. Here, we prove a general result which quantifies this. Suppose that samples of sizes m1. and m0. are chosen randomly from first-phase Y strata, with no consideration of Z. If we replace mij by its expectation mi.
ij under this scheme, it can be shown (Section 2 of supplementary material) that (3.3) then has a particularly simple form:
![]() | (3.13) |
It can also be shown that (3.13) is equal to equation (4) from Palmgren (1987
, p. 692) which was derived using a likelihood approach and also equation (10) from Dahm et al. (1995
, p. 2593) for a design in which the roles of X and Z were reversed. Formulae given by Greenland (1988b)
do not appear to be equivalent. Palmgren then restricted the design further so that n1=n0 and
i=mi./n=
, i=0, 1, and found the optimal value of
.
In Section 2 of the supplementary material, it is shown that, without these restrictions, optimal values of R0,
0, and
1 for minimizing (3.13) subject to the fixed budget B are (3.5) for R0, and
![]() | (3.14) |
The reciprocal of the efficiency of this design relative to (3.4) for 1P studies with optimal R0 is
![]() | (3.15) |
From (3.15) and (3.8), it follows that
is an upper bound on the efficiency of stratification by Y and Z compared to Y alone. Now for fixed
i and
i,
varies with
i, reaching its minimum
when
(McNamee, 2003
). Therefore,
proving that stratification by Z is never a disadvantage.
The advantage of stratification by Z will be most marked when c2/c1 is large, and Z is accuratewithout the latter condition, both gi and
are close to 1. Also, it increases as the
i move towards 0 or 1, i.e. away from the value (above) that minimizes
To see how it also depends on the
i, first consider a fully optimized design, i.e. with stratification by Y and Z, when
1=0.9, 1
1=0.8, and
0=0.8, 1
0=0.9: the maximum efficiency of this design compared to a 1P study is 1/
2=2.00. Now consider the corresponding maxima for designs which have stratification by Y only, when (
1,
0) has each of the following values: (0.02, 0.01), (0.2,0.1), and (0.43, 0.57); the last pair was chosen to make
i=0, 1. From (3.15), the maximum efficiencies are 1.05, 1.50, and 2.00 and therefore the gains in efficiency from stratification by Z are 1.9, 1.3, and 1.
| 4. SP2 DESIGNn, R0, AND m FIXED IN ADVANCE |
|---|
|
|
|---|
Suppose the second phase is planned after the first has been completed and has a separate budget B'. If the only cost is c2 per subject, then m=B'/c2 and the problem is to find allocation fractions, {aij=mij/m, i=0, 1, j=0, 1} so as to minimize (2.3). As shown in Section 1 of the supplementary material, these are
![]() | (4.1) |
which is identical to the right-hand side of (3.10). Thus, the optimal distribution of subjects among the cells of Y.Z is the same whether or not the second phase is planned separately. Also,
![]() | (4.2) |
It is helpful to decompose the allocation strategy (4.1) into two parts: (i) the marginal allocation between cases and controls and (ii) the allocation between Z=1 and Z=0 within these groups.
From (4.1), the optimal fraction of cases among phase 2 subjects is
When g0=g1, this becomes f1/(f0 + f1), a function of
1 and
0 only or, equivalently, of
0 and ß only. When ß
0, this is in the range 0.30.5 over a wide range of valuessee Table 1. The condition g0=g1 is less restrictive than nondifferential misclassification; for example, it includes situations where
1=1
0 and
0=1
1 which might be roughly true in many situations. If instead, g1>g0 (g1 < g0) so that, overall, Z performs worse (better) among cases than controls, the fraction of phase 2 subjects who are cases would need to be increased (decreased) compared to Table 1.
|
Now consider the optimum fraction of subjects with Z=1 among second-phase cases (controls); from (4.1) this is
for disease group i, which depends only on
i and 1
i and equals 0.5 when
i=1
i. On the other hand, if there is a large difference between
i and 1
i, then optimality demands a large imbalance between Z=1 and Z=0 (Table 2); sampling should be tilted towards (away from) the Z=1 category when
i > 1
i (
i < 1
i).
|
A balanced design has equal numbers of second-phase subjects from the strata formed by cross- classifying Y and Z and does not require estimates of
i,
i, and
i for implementation. Balance implies that aij=mij/m=0.25, i, j=0, 1 ; substituting these into (2.3), and assuming that n is very large compared with m so that the variance terms of order n1 are close to zero, gives
![]() | (4.3) |
where
has been changed to
for convenience. Since
ak=1, it follows from (4.2) and (4.3), again ignoring terms of order n1, that
![]() | (4.4) |
As discussed in Section 4.1, when
1=
0, g0=g1, and
i=1
i, i=0, 1,
so that a balanced design is optimal. In other cases, variance inflation from using a balanced design will depend on the extent of departure from these conditions. Consider separately the negative impact of each type of balance: on Y or on Z. The impact on SE (ß) of using a design balanced on Y but with optimal allocation between Z=0 and Z=1 is shown in Table 1 as a function of
0 and eß given ß>0 ; the table assumes g0=g1 and ignores variance terms of order n1. The increase in SE (ß) is less than 10% except when
0 is small and exp(ß) is large. However, if g1<g0, the optimal design demands fewer cases than Table 1 implies, and thus balance on Y will have a worse effect. The impact on SE (ß) of balancing on Z, but with optimal distribution between cases and controls, is shown in Table 2; large increases occur when
i and 1
i are very different, especially when one of them is close to 1.
These considerations suggest that the balanced designs will be near optimal in a wide range of situations and also enable prediction of situations where there will be major inefficiency. For example, if
0=0.01,
1=0.1,
0=0.50, 1
0=0.99,
1=0.99, and 1
1=0.50, then (a00, a01, a10, a11)opt=(0.682, 0.069, 0.023, 0.226) and (4.4) leads to an SE ratio of 1.45. When variance terms of order n1 are taken into account, the negative impact of a balanced design will be less than that (4.4) predicts.
| 5. EXTENSIONS TO OTHER PROBLEMS |
|---|
|
|
|---|
Suppose now that the goal is to estimate the interaction between X and U as represented by the term ßP in the model, log odds=
+ßXX+ßUU+ßPXU. Besides Z, an accurately measured categorical risk factor U is ascertained in the first phase, while X is to be measured as before in the second phase on a sample of size m. The design which minimizes SE (ßP) is required.
Breslow and Chatterjee (1999)
advocated a balanced design here also, where balance now means equal numbers in the cross-classification by Y, Z, and U. The software from Reilly and Salim (2000)
provides an empirical solution to this problem when R0 is fixed and given pilot data.
Let
i.k=Pr(X = 1|Y = i, U = k) and assume U has two categories 0 and 1 only. Using Bayes theorem, it is easy to show that
![]() | (5.1) |
and we can estimate
i.k by
j
j|ik
ijk, where
![]() |
and
![]() |
These terms can be estimated from the first and second phases, respectively.
If the four components in (5.1) are independent, then V(
P) is simply the sum of the variances of the four components in (5.1). By analogy with variance formulae (2.3), we then have
![]() |
where nik and mijk are, respectively, the numbers of first-phase subjects with Y=i, U=k, and second-phase subjects with Y=i, Z=j, and U=k. The same methods as before (see the supplementary material) can now be applied to find constrained and unconstrained optimal designs. For a separate second-phase design, the optimal allocation fractions, expressed in terms of
ik and 1
ik, the sensitivity and specificity, respectively, of Z when Y=i, U=k, are
![]() | (5.2) |
where
and
Empirical comparisons with output from the STATA optbud procedure suggest identical results.
Consider a 2P cohort study where, in the first phase, n0 and n1 subjects are chosen from unexposed (X=0) and exposed (X=1) populations; the classification by X is error free. All are measured using an error-prone surrogate of disease, Z. In a second phase, true disease status Y is measured accurately on a sample of subjects. A clinical trial where subjects are randomly allocated to treatments X=0 or X=1 and which employs a surrogate measure of the eventual outcome Y would also have the same structure. To use the previous results without introducing a new notation, we simply reverse the roles of Y and X. Now
i=Pr(Y = 1|X = i), with ß defined as before, and
i and 1
i are the sensitivity and specificity of Z for Y in group X=i. Assuming differential misclassification by the surrogate measure in the two X groups, and that the odds ratio is the parameter of interest, all the results derived earlier apply. In general, the differential misclassification assumption may be less valid in this setting but in the next section an example of this type, quoted by Pepe et al. (1994)
, is considered.
| 6. EXAMPLES |
|---|
|
|
|---|
In the study of cervical cancer and HSV analyzed by Carroll et al. (1993)
, the western blot method (Z) exhibited differential misclassification,
and 1
being 0.784 and 0.811 for cases but 0.576 and 0.688 for controls, respectively.
1 and
0 are estimated as 0.591 and 0.440, respectively. Their maximum likelihood analysis based on 2044 first-phase subjects classified by Z and a simple random sample of 115 subjects classified by the more refined X gave
=0.609, with an SE of 0.350. Application of (2.2) and (2.3) to data in their Table 1 gives identical results.
In planning a new study to estimate ß, we need estimates of fi, gi,
, and
. An expression for
i in terms of
i,
i, and
i is given by McNamee (2003)
. From the original study, we can estimate f0=2.015, f1=2.034, g0=0.964, g1=0.803,
=0.883,
0=0.265,
1=0.587, and
=0.427. The maximum reduction in SE (ß) from 2Ps design is 1
or 12%. For finite values of c2/c1, the reduction is less and depends on
. Consider two extremes: c2/c1=100 and c2/c1=5. From (3.8), there will be a reduction in SE of about 7% in the former case and an increase of 7% in the latter. In the latter case, we would conclude that a 1P study based on X is preferable. Here, we proceed to find the optimal 2Ps design when c1=1, c2=100, and B=13 600 monetary units.
Expressions (3.9) and (4.1) together with B=n[c1 + c2mn1] are easiest for calculation. From (3.9), the optimal m/n is 0.207 and so n=627, m=130. From (3.5),
and so the first phase should have 433 cases and 194 controls. From (3.10), (a00, a01, a10, a11)opt=(0.30, 0.24, 0.24, 0.22) and hence (m00, m01, m10, m11)=(40, 31, 31, 28). From (3.7), SE (ß) is then 0.321. Use of a balanced second phase instead of the optimal mij would not reduce efficiency much: with (m00, m01, m10, m11)=(32, 33, 33, 32), SE (ß)=0.324. The near optimality of this balanced design could have been anticipated from the discussion in Section 4.1. An optimal 1P study of the same cost would have 68 cases and 68 controls
) and give SE(ß)=0.347.
Pepe et al. (1994)
considered design of a 2P trial to compare chemotherapy (X = 0) and chemotherapy + radiotherapy (X = 1) in the treatment of ovarian cancer, where the outcome of interest was tumor eradication. X-ray determination was the surrogate outcome measure, Z, for all subjects, while a more accurate determination, Y, was to be made on a sample. They assumed
1=Pr(Y = 1|X = 1)=0.40,
0=Pr(Y = 1|X = 0)=0.20, where Y=1 denotes tumor eradication. Misclassification was assumed to be differential because of radiation-induced scarring:
0=0.95, 1
0=0.75, whereas
1=0.75=1
1. They found the optimal SP2 design, assuming n1=n0=100 and m=60, using their mean score method. Applying (4.1) and (4.2) instead gives (m00, m01, m10m11)/m=(0.139, 0.351, 0.255, 0.255) and V(ß)=0.231. These figures agree with the authors' results [which were expressed as
and V(0.5ß)]. The authors did not address the efficiency of 2Ps design; however, we can deduce that, since
=0.764, the maximum reduction in SE from 2Ps design is 24%.
| 7. DISCUSSION |
|---|
|
|
|---|
This paper has concentrated on the simplest 2P, errors in variables, design for etiological studies so as to provide general insights into efficiency. For dichotomous X, we have shown that the maximum benefit for efficiency can be easily predicted from a simple function,
, of sensitivity and specificity, and that it may be quite small. Given that the administration of 2P studies may be more difficult than 1P studies with possibly higher refusal rates (Deming, 1977
The limitations of the results should be noted: X is dichotomous although Z could be a continuous variable categorized for sampling; differential misclassification is assumed; and the cost model ignores costs other than those associated with exposure determination. The theory can be adapted to situations where any of the initial calculations from (3.6) yield an optimal sampling fraction greater than 1; this can occur when c2/c1 is small and the
i are small. This problem can be solved theoretically by setting the excessive fractions equal to 1 and finding new optimal values for the others subject to this constraint, but the software by Reilly and Salim (2000)
provides convenient empirical solutions. Importantly, these additionally constrained optimal designs lead to greater variances than unconstrained problems, thus reducing efficiency even more compared to (3.8).
2P designs are likely to be substantially more efficient when nondifferential misclassification is correctly assumed. Empirical comparison (not shown) of minimum variances given in Table 1 of Palmgren (1987)
for the nondifferential model with stratification by Y only, with the equivalent minimum derived here for the differential model, showed that the former assumption halved variance in some cases. The wide variation in efficiency between analysis methods found in comparative studies (e.g. Breslow and Chatterjee, 1999
; Spiegelman and Casella, 1997
; Sturmer et al., 2002
) may be partly explained by this factor. The nondifferential/differential distinction has also been shown to have implications for testing Ho: ß=0: Palmgren (1987)
found that a 2Ps design was never efficient for this task compared with 1P studies of X alone or Z alone. Accepting the differential assumption here, one might still question whether there is an estimation method more efficient than (2.1) and (2.4). There is evidence against this from the present study since (2.4) coincides with Carroll's maximum likelihood estimates in Example 1 and, in the case where there is stratification by Y only, (3.13) is equal to Palmgren's likelihood-based formula.
The addition of a second phase to a preexisting study based on Z need not be justified on the same cost grounds as a 2Ps study. The simple formulae (4.1) for the optimal allocation fractions, aij, give insight into the role played by parameters, such as sensitivity and specificity, as well as into the performance of balanced designs. They also show that optimal allocation at the second phase is independent of first-phase design. All these results are based on a differential misclassification assumption. An interesting, unanswered question is whether (4.1) would also be optimal under nondifferential misclassification.
Empirical comparisons suggest that, where there is an overlap between the optimality problems treated here and those covered by Reilly and Salim's software, the results are identical both in terms of design and standard errorprovided that the pilot data supplied to the software is made to yield estimates of
i,
i, and
i identical to the correct values. It seems highly likely that their likelihood-based method, given appropriate distributional assumptions, is equivalent to the formulae here. The advantage of these formulae is their simplicity; also, by identifying the parameters which are the drivers of optimal design, they show how one can easily evaluate the impact of uncertainty in estimates of these parameters. However, the software covers a much wider range of problems, including missing data designs, but with the restriction that the joint distribution of the first-phase variables is fixed in advance.
| REFERENCES |
|---|
|
|
|---|
-
BRESLOW, N. E. AND CAIN, K. C. (1988). Logistic regression for two-stage casecontrol data. Biometrika 75, 1120.
BRESLOW, N. E. AND CHATTERJEE, N. (1999). Design and analysis of two-phase studies with binary outcomes applied to Wilms tumour prognosis. Applied Statistics 48, 457468.
BRESLOW, N. E. AND HOLUBKOV, R. (1997). Weighted likelihood, pseudo-likelihood and maximum likelihood methods for logistic regression analysis of two-stage data. Statistics in Medicine 16, 103116.[CrossRef][ISI][Medline]
CAIN, K. AND BRESLOW, N. E. (1988). Logistic regression and efficient design for two-stage studies. American Journal of Epidemiology 128, 11981206.
CARROLL, R. J., GAIL, M. H. AND LUBIN, J. H. (1993). Casecontrol studies with errors in covariates. Journal of the American Statistical Association 88, 185199.[CrossRef]
COCHRAN, W. G. (1977). Sampling Techniques, 3rd edition. New York: Wiley.
DAHM, P. F., GAIL, M. H., ROSENBERG, P. S. AND PEE, D. (1995). Determining the value of additional surrogate exposure information for improving the estimate of an odds ratio. Statistics in Medicine 14, 25812598.[ISI][Medline]
DEMING, W. E. (1977). An essay on screening or on two-phase sampling, applied to surveys of a community. International Statistical Review 45, 2937.
GREENLAND, S. (1988a). Statistical uncertainty due to mis-classification: implications for validation sub-studies. Journal of Clinical Epidemiology 41, 11671174.[CrossRef][ISI][Medline]
GREENLAND, S. (1988b). Variance estimation for epidemiological effect estimates under mis-classification. Statistics in Medicine 7, 745757.[ISI][Medline]
HOLCROFT, C. A. AND SPIEGELMAN, D. (1999). Design of validation studies for estimating the odds ratio of exposuredisease relationships when exposure is misclassified. Biometrics 55, 11931201.[CrossRef][ISI][Medline]
JEWELL, N. (2004). Statistics for Epidemiology. Boca Raton, FL: Chapman and Hall/CRC.
MCNAMEE, R. (2003). The efficiency of two-phase studies for prevalence estimation. International Journal of Epidemiology 32, 10721078.
PALMGREN, J. (1987). Precision of double sampling estimators for comparing two probabilities. Biometrika 74, 687694.
PEPE, M. S., REILLY, M. AND FLEMING, T. R. (1994). Auxiliary outcome data and the mean score method. Journal of Statistical Planning and Inference 42, 137160.[CrossRef]
REILLY, M. (1996). Optimal sampling strategies for two phase studies. American Journal of Epidemiology 143, 92100.
REILLY, M. AND PEPE, M. S. (1995). A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82, 299314.
REILLY, M. AND SALIM, A. (2000). Software for mean score analysis and optimal design of 2-phase studies written in R, SPLUS and STATA. Available from http://www.mep.ki.se/biostat/software/reilly.
SPIEGELMAN, D. AND CASELLA, M. (1997). Fully parametric and semi-parametric models for common events with covariate measurement error in main study/validation study designs. Biometrics 53, 395409.[CrossRef][ISI][Medline]
STURMER, T., THURIGEN, D., SPIEGELMAN, D., BLETTNER, M. AND BRENNER, H. (2002). The performance of methods for correcting measurement error in casecontrol studies. Epidemiology 13, 507516.[CrossRef][ISI][Medline]
THURIGEN, D., SPIEGELMAN, D., BLETTNER, M., HEUER, C. AND BRENNER, H. (2000). Measurement error correction using validation data: a review of methods and their application in casecontrol studies. Statistical Methods in Medical Research 9, 447474.
ZHAO, L. P. AND LIPSITZ, S. (1992). Design and analysis of two-stage studies. Statistics in Medicine 11, 769782.[ISI][Medline]
Received December 19, 2003; revised September 23, 2004; revised March 1, 2005; revised March 29, 2005; accepted for publication April 25, 2005.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| |||||||||||||||||||||||||||||||||






















of Z in that group (% increase in SE(ß)
if a fraction of 0.5 used instead, assuming n is very large)





