Skip Navigation


Biostatistics Advance Access originally published online on January 20, 2006
Biostatistics 2006 7(3):456-468; doi:10.1093/biostatistics/kxj018
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
7/3/456    most recent
kxj018v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Janes, H.
Right arrow Articles by Pepe, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Janes, H.
Right arrow Articles by Pepe, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

The optimal ratio of cases to controls for estimating the classification accuracy of a biomarker

Holly Janes*,{dagger}

Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205, USA hjanes{at}jhsph.edu

Margaret Pepe

Department of Biostatistics, University of Washington, and Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, WA 98195, USA

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OPTIMAL CASE-CONTROL...
 3. ESTIMATING THE OPTIMAL...
 4. THE OPTIMAL RATIO...
 5. EFFICIENCY LOSS ASSOCIATED...
 6. THE OPTIMAL CASE-CONTROL...
 7. ILLUSTRATION
 8. DISCUSSION
 APPENDIX
 REFERENCES
 
The case–control design is frequently used to study the discriminatory accuracy of a screening or diagnostic biomarker. Yet, the appropriate ratio in which to sample cases and controls has never been determined. It is common for researchers to sample equal numbers of cases and controls, a strategy that can be optimal for studies of association. However, considerations are quite different when the biomarker is to be used for classification. In this paper, we provide an expression for the optimal case–control ratio, when the accuracy of the biomarker is quantified by the receiver operating characteristic (ROC) curve. We show how it can be integrated with choosing the overall sample size to yield an efficient study design with specified power and type-I error. We also derive the optimal case–control ratios for estimating the area under the ROC curve and the area under part of the ROC curve. Our methods are applied to a study of a new marker for adenocarcinoma in patients with Barrett's esophagus.

Keywords: Case–control design; Efficiency; Power; ROC curve; Sample size; Sensitivity; Specificity


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OPTIMAL CASE-CONTROL...
 3. ESTIMATING THE OPTIMAL...
 4. THE OPTIMAL RATIO...
 5. EFFICIENCY LOSS ASSOCIATED...
 6. THE OPTIMAL CASE-CONTROL...
 7. ILLUSTRATION
 8. DISCUSSION
 APPENDIX
 REFERENCES
 
Recent scientific and technological innovations have produced an abundance of potential screening and diagnostic biomarkers. Screening markers, such as those for various cancers and for cardiovascular disease, are being investigated with the goal of detecting disease at an early stage, when treatment is often more effective. Accurate diagnosis of disease, on the other hand, is a fundamental first step in treating symptomatic patients. Efforts are underway to develop diagnostic tests for recently defined conditions, and faster, more accurate tests for long-established diseases. Note that the tasks of disease screening and diagnosis are fundamentally the same: the goal is to classify individuals as either diseased or non-diseased.

We use the term ‘biomarker’ to refer to a generic medical test, biochemical or otherwise, for screening or diagnosis. Rigorous evaluation of biomarkers is essential, in order to guarantee that the tests that are developed are sufficiently accurate and beneficial to the patient. In this paper, we are concerned with the efficient design of studies to evaluate biomarkers.

The case–control study is commonly employed (Pepe et al., 2001Go). A fixed number of cases and controls are sampled in order to determine how well the biomarker distinguishes between these two groups. The appropriate ratio in which to sample cases and controls has never been established. It is common to simply sample equal numbers of cases and controls (Etzioni et al., 1999Go), or, when controls are easier to obtain, to sample more controls than cases (Janes et al., 2005Go). In this paper, we derive the optimal case–control ratio, which enables us to design a future study with optimal case–control allocation, thereby increasing efficiency and reducing the cost of the study.

We consider continuous biomarkers with diagnostic accuracy evaluated using the receiver operating characteristic (ROC) curve. The ROC curve plays a key role in the evaluation of biomarker accuracy (Baker, 2003Go; Hanley, 1989Go; Begg, 1991Go; Zhou et al., 2002Go; Pepe, 2003Go). It displays the trade-off between false-positive and false-negative error rates associated with rules which classify individuals as ‘positive’ (negative) if their biomarker values are above (below) a threshold, for all possible thresholds. For population screening markers, false-positive errors are especially troublesome, and our interest is estimation of the true-positive fraction (TPF) for a fixed, low, false-positive fraction (FPF). In other cases, false-negative errors are more important, and thus we estimate the FPF corresponding to an acceptably high TPF. We focus here on the first scenario, but note that the second can be handled in a similar fashion.

We define the optimal case–control ratio as that for which the variance of the ROC estimator is minimized. In Section 2, we derive the optimal ratio and show how it can be combined with choosing the overall sample size. A non-parametric estimation method is proposed in Section 3. In Section 4, we consider how the optimal ratio changes as a function of the diagnostic accuracy of the biomarker in a classic distributional case. Section 5 demonstrates that efficiency loss associated with sub-optimal case–control allocation can be substantial. In Section 6, we derive the optimal case–control ratios for estimating two other popular summary measures, the area under the ROC curve (AUC) and the area under part of the ROC curve (pAUC). Finally, we illustrate these methods using data from a study of a marker for adenocarcinoma in patients with Barrett's esophagus.


    2. THE OPTIMAL CASECONTROL RATIO
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OPTIMAL CASE-CONTROL...
 3. ESTIMATING THE OPTIMAL...
 4. THE OPTIMAL RATIO...
 5. EFFICIENCY LOSS ASSOCIATED...
 6. THE OPTIMAL CASE-CONTROL...
 7. ILLUSTRATION
 8. DISCUSSION
 APPENDIX
 REFERENCES
 
Suppose we are planning a case–control study with nD cases and nD controls to evaluate the diagnostic accuracy of a biomarker, Y. Let YD and YD denote diseased and non-diseased observations with survivor functions SD(y) = P[YD > y] and SD(y) = P[YD > y] and densities fD(y) and fD(y). We begin by assuming that the study has a fixed overall sample size, N. We seek to determine the fraction of cases in this total sample size in order to most efficiently estimate the ROC curve at a fixed acceptable FPF.

The ROC curve is a plot of the TPF (sensitivity) versus the FPF (1 – specificity) for the set of rules that define ‘test positive’ as ‘Y > c’, where c varies from –{infty} to {infty}. It can be written as (Pepe, 2003Go, p. 69)

Formula 1(2.1)

We use the empirical non-parametric ROC estimator, which estimates the constituent survivor functions empirically:

Formula 1

Its asymptotic variance, which performs well even in small samples, is given by (Greenhouse and Mantel, 1950Go; Hsieh and Turnbull, 1996Go; Pepe, 2003Go, p. 101)

Formula 2(2.2)

where Formula 2 This variance is minimized when

Formula 3(2.3)

The form of {rho}opt(t) is completely general and interesting in several respects. The ratio depends on t, the FPF of interest, and on the slope of the ROC curve at t. A steeper slope at t implies a lower optimal ratio since, if a small change in t has a large effect on ROC(t), more emphasis should be placed on estimating the threshold accurately, which necessitates sampling more controls. The ratio also depends on Formula 3 this function can be greater or less than 1, and we explore its form further in Section 4. Interestingly, the only case in which the optimal ratio is equal to 1 for all t is when ROC(t) = t. This result is proven in the Appendix.

The optimal ratio, {rho}opt(t), is that which minimizes the variance of Formula 3(t) given the total sample size, N. This is equivalent to minimizing the size of the study for fixed V(t).

The optimal case–control ratio can be combined with existing sample size calculations (Zhou et al., 2002Go, Chapter 6; Pepe, 2003Go, Chapter 8) to determine nD and nFormula 3 that optimize efficiency while guaranteeing the required power, 1 – ß, and type-I error, {alpha}. Standard sample size calculations for testing H0: ROC(t) ≤ v0 against the alternative H1: ROC(t) = v1, where t is the fixed FPF, v0 is the minimal TPF of interest, and v1 is the TPF alternative, require that (Pepe, 2003Go, p. 223)

Formula 4(2.4)

where V(t) is the asymptotic variance of Formula 4(t), Formula 4 These expressions, together with the optimal ratio, yield the desired values of nD and nFormula 4. Observe that no additional parameters are involved in the optimal ratio beyond those required for the standard sample size calculation.


    3. ESTIMATING THE OPTIMAL RATIO USING PILOT DATA
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OPTIMAL CASE-CONTROL...
 3. ESTIMATING THE OPTIMAL...
 4. THE OPTIMAL RATIO...
 5. EFFICIENCY LOSS ASSOCIATED...
 6. THE OPTIMAL CASE-CONTROL...
 7. ILLUSTRATION
 8. DISCUSSION
 APPENDIX
 REFERENCES
 
In this section, we propose a non-parametric method for estimating the optimal case–control ratio on the basis of case–control pilot data. Having chosen the target FPF = t, the task is to estimate ROC(t) and ROC'(t). We estimate ROC(t) non-parametrically using the empirical estimator, or a smooth version (Peng and Zhou, 2004Go). To estimate ROC'(t), we exploit its relationship with the likelihood ratio function, Formula 4 evaluated at Formula 4

Formula 4

We estimate ROC'(t) using a ratio of kernel density estimates for fD and fFormula 4, evaluated at Formula 4 an empirical or parametric estimate of the (1 – t) quantile for YFormula 4. In Section 7, we illustrate this estimation method using a Barrett's esophagus dataset.

Clearly, the estimates of the parameters involved in the optimal case–control ratio have uncertainty. With a small pilot study, this uncertainty may be substantial, and the ROC slope estimator in particular may be sensitive to the choice of kernel or bandwidth. In practice, we recommend that one err on the conservative side and consider a range of plausible estimates consistent with the pilot data.


    4. THE OPTIMAL RATIO IN A CLASSIC DISTRIBUTIONAL SETTING
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OPTIMAL CASE-CONTROL...
 3. ESTIMATING THE OPTIMAL...
 4. THE OPTIMAL RATIO...
 5. EFFICIENCY LOSS ASSOCIATED...
 6. THE OPTIMAL CASE-CONTROL...
 7. ILLUSTRATION
 8. DISCUSSION
 APPENDIX
 REFERENCES
 
The binormal ROC curve is the ROC curve induced by normal marker distributions, YFormula 4~N(0, 1) and Formula 4 where without loss of generality we can assume YFormula 4 has mean 0 and variance 1. However, it is more general; this is the ROC curve for any marker, Y, for which there exists a monotone function that transforms both YD and YFormula 4 to normality (Swets and Pickett, 1982Go; Swets, 1986Go; Hanley, 1988Go, 1996Go; Pepe, 2003Go, pp. 84–85). The binormal ROC is

Formula 4

where {Phi} is the standard normal cumulative distribution function. Using (2.3), the corresponding optimal ratio is

Formula 5(4.1)

where {phi} is the standard normal density function. We emphasize that this is the optimal ratio for estimating ROC(t) non-parametrically, as is typical in practice. Here, the binormal model simply provides a context in which to illustrate how the height and steepness of the ROC curve affect the optimal ratio.

A variety of binormal ROC curves, indexed by µD, are shown in Figure 1. The parameter {sigma}D is fixed at unity, guaranteeing a concave ROC curve. We show in Figure 2 the optimal ratio as a function of µD, for several values of t, the FPF of interest. Observe that {rho}opt(t) is not a monotone function of µD: it decreases, then increases. This is primarily due to the first component of the optimal ratio, Formula 5 as illustrated in Figures 2(a) and (b), which display Formula 5 and Formula 5 as functions of µD. The optimal ratio decreases and then increases with µD due to the fact that, for a fixed t, the slopes of the ROC curves increase and then decrease. The ‘turning point’, the value of µD at which the slopes begin to increase, is lower for larger t. The factor S generally has the effect of amplifying Formula 5 Hence, we can view the optimal ratio as simply being inversely proportional to the slope of the ROC curve.


Figure 1
View larger version (11K):
[in this window]
[in a new window]
 
Fig. 1. A range of binormal ROC curves, ROC(t) = {Phi}(µD + {Phi}–1(t)), indexed by µD.

 

Figure 2
View larger version (14K):
[in this window]
[in a new window]
 
Fig. 2. For the binormal ROC curves in Figure 1, where ROC(t) = {Phi}(µD + {Phi}–1(t)), the optimal case–control ratio, and its components, as a function of µD, for various values of the fixed FPF. (a) Formula 5 as a function of µD. (b) Formula 5 as a function of µD

 
The association shown here between {rho}opt(t) and µD is valid for all concave binormal models. We have found that it also holds for other concave ROC models, such as those induced by exponential or logistic models for YD and YFormula 5.


    5. EFFICIENCY LOSS ASSOCIATED WITH SUB-OPTIMAL CASECONTROL ALLOCATION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OPTIMAL CASE-CONTROL...
 3. ESTIMATING THE OPTIMAL...
 4. THE OPTIMAL RATIO...
 5. EFFICIENCY LOSS ASSOCIATED...
 6. THE OPTIMAL CASE-CONTROL...
 7. ILLUSTRATION
 8. DISCUSSION
 APPENDIX
 REFERENCES
 
We assess efficiency loss associated with a sub-optimal case–control ratio by comparing the variance of Formula 5(t) with the optimal case–control ratio to the variance with a sub-optimal ratio, assuming a constant total sample size, N. Letting Formula 5 and nD denote the optimal and sub-optimal numbers of cases and Vo(t) and V(t) the corresponding variances of Formula 5(t), the relative efficiency is

Formula 5

Using the optimal ratio expression, (2.3), this simplifies to

Formula 6(5.1)

where Formula 6 Observe that this expression is invariant with respect to N, and depends only on ROC(t), ROC'(t), and {gamma}. Figure 3 shows the loss of efficiency associated with sub-optimal case–control allocation under the binormal model, when µD = 0.75, 1.50, 2.25 (see the corresponding ROC curves in Figure 1). The plot of Formula 6 versus Formula 6 is shown in Figure 3 for a low FPF of t = 0.01, and a moderate value, t = 0.25. The relative efficiency is equal to 1 when the percent cases is Formula 6 and decreases monotonically on either side. Efficiency losses can be substantial. For example, when µD = 2.25 and t = 0.01, 25% cases is optimal; with 50% cases, efficiency is just 80%. When t = 0.25, 60% cases is optimal; with only 40% cases, efficiency is 86%.


Figure 3
View larger version (11K):
[in this window]
[in a new window]
 
Fig. 3. The relative efficiency of a design with an optimal case–control ratio relative to a sub-optimal design, as a function of the percent cases in the sub-optimal design. Lines are drawn for three binormal models, where ROC(t) = {Phi}(µD + {Phi}–1(t)), and µD takes the values 0.75, 1.50, 2.25. In (a), the FPF is 0.01, and in (b), the FPF is 0.25.

 

    6. THE OPTIMAL CASECONTROL RATIOS FOR ESTIMATING THE AUC AND PAUC
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OPTIMAL CASE-CONTROL...
 3. ESTIMATING THE OPTIMAL...
 4. THE OPTIMAL RATIO...
 5. EFFICIENCY LOSS ASSOCIATED...
 6. THE OPTIMAL CASE-CONTROL...
 7. ILLUSTRATION
 8. DISCUSSION
 APPENDIX
 REFERENCES
 
6.1 Estimating the AUC

The AUC is a popular summary index for the ROC curve. It has several drawbacks, including the fact that it treats false-positive and false-negative errors equally and summarizes the entire ROC curve when frequently only a portion is of interest (McClish, 1989Go; Thompson and Zuccini, 1989Go; Pepe, 2003Go, Section 4.3). It is nevertheless commonly used and has some desirable mathematical properties (Bamber, 1975Go; Hanley and McNeil, 1982Go; Pepe, 2003Go, Section 4.3). We derived the optimal case–control ratio for estimating the AUC empirically. The asymptotic variance of the empirical AUC (DeLong et al., 1988Go) is

Formula 6

With N fixed, this is minimized when

Formula 7(6.1)

This result was first pointed out by Hanley and Hajian-Tilaki (1997)Go.

Observe that Formula 7 depends on the variances of two random variables, called placement values (Hanley and Hajian-Tilaki, 1997Go; DeLong et al., 1988Go; Pepe, 2003Go, pp. 105–106; Pepe and Cai, 2004Go). SFormula 7(YD) is the placement value of a case observation in the control population, and SD(YFormula 7) is the placement value of a control observation in the case population. For good discrimination, SFormula 7(YD) should take small values and SD(YFormula 7) large values. When estimating the AUC, more cases than controls are required if the case placement values are more variable than the control placement values. To estimate this optimal ratio using pilot data, we estimate the placement values empirically (or perhaps parametrically) and then calculate the sample variances.

In contrast to estimation of the ROC curve, there are many settings where equal numbers of cases and controls are optimal for estimating the AUC. All that is required is that the case and control placement values have the same variance. If there exists a monotone increasing transformation, g, such that g(YD) and g(YFormula 7) follow a symmetric location-scale distribution with the same scale parameter, the height of the ROC curve and the value of the AUC are irrelevant: sampling equal numbers of cases and controls is optimal (see Appendix). One such example is when ROC(t) = t (in this case, the case and control placement values actually have the same distribution). This result also implies that a case–control ratio of 1 is optimal when g(YD) and g(YFormula 7) are both normally distributed for some function g with the same scale and different location parameters. This includes all the scenarios shown in Figure 1.

We emphasize that the optimal ratio for estimating the AUC may be entirely different than the optimal ratio for estimating ROC(t), a consequence of the fact that these are two different measures of classification accuracy.

Calculating Formula 7 may be combined with choosing the overall sample size, as was done for the ROC curve. Briefly, when AUC0 is a minimally acceptable value for the AUC, in order to achieve power 1–ß and type-I error {alpha} when the AUC is equal to AUC1, we require that (Pepe, 2003Go, p. 226)

Formula 8(6.2)

This expression and (6.1) determine the sample sizes.

6.2 Estimating the pAUC

The partial area under the curve, or pAUC, is the area under a specified region of the ROC curve,

Formula 8

It can be used to summarize the performance of the marker over an acceptable range of FPFs (McClish, 1989Go; Thompson and Zuccini, 1989Go). The non-parametric estimator of the pAUC (Dodd and Pepe, 2003Go) is

Formula 8

where Formula 8 This is approximately asymptotically equivalent to (Dodd, 2001Go)

Formula 8

which has asymptotic variance

Formula 9(6.3)

where Formula 9 and Formula 9 are restricted placement values (Dodd and Pepe, 2003Go). For fixed N, the variance is minimized when

Formula 10(6.4)

Observe that this optimal ratio depends on the FPF region of interest, and will be greater than unity if restricted case placement values are more variable than those for controls. The ratio will equal 1 if and only if the restricted placement values have equal variance. This will occur much less frequently than will equal variance of the unrestricted placement values. For example, a disease that (after some transformation) causes a simple shift in a symmetric location-scale distribution need not have unrestricted placement values with equal variance.

Note that the variance in (6.3) used to derive Formula 10 holds only approximately, since it does not include the variance due to estimating the two quantiles, q1 and q0 (Dodd and Pepe, 2003Go). This second component tends to be small. Nevertheless, in practice, one might check using simulations that Formula 10 truly minimizes the variance of the estimator.

Estimating Formula 10 is straightforward, using sample variances of the empirical (or parametrically estimated) restricted placement values. Calculation of this optimal ratio can also be combined with sample size calculations. When pAUC0 and pAUC1 are the null and alternative values for pAUCt0, t1, we require that

Formula 10

in order to achieve power 1–ß and type-I error {alpha}.


    7. ILLUSTRATION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OPTIMAL CASE-CONTROL...
 3. ESTIMATING THE OPTIMAL...
 4. THE OPTIMAL RATIO...
 5. EFFICIENCY LOSS ASSOCIATED...
 6. THE OPTIMAL CASE-CONTROL...
 7. ILLUSTRATION
 8. DISCUSSION
 APPENDIX
 REFERENCES
 
Barrett's esophagus is a condition that develops in some patients with gastroesophageal reflux disease, in which the normal lining of the esophagus becomes replaced with a red-cell lining called ‘Barrett's esophagus’. Certain cells within this new lining progress to become a cancer of the lower esophagus, or adenocarcinoma.

The current study concerns a new potential diagnostic marker for adenocarcinoma in Barrett's esophagus patients which we call Y. A pilot study of 71 cases and 55 controls have marker measurements taken from endoscopy tissue. Our goal is to estimate the numbers of cases and controls that are needed in a future validation study of the marker.

Since work-up procedures following a positive screening test will require the patient to undergo invasive repeat endoscopies, it is important that the false-positive rate be low. The investigator seeks a marker for adenocarcinoma that has a FPF of no more 10 %. We calculate the sample sizes for estimating the ROC curve at FPFs of 5 and 10 %. According to the pilot data, the empirical ROC estimates are Formula 10(0.05) = 0.28 and Formula 10(0.10) = 0.37 (see Figure 4[a]). We also estimate Formula 10 empirically for t = 0.05 and t = 0.10, and evaluate the kernel density estimates for fD and fFormula 10 at these thresholds (see Figure 4[b]). The final slope estimates are 3.36 and 0.96.


Figure 4
View larger version (7K):
[in this window]
[in a new window]
 
Fig. 4. Barrett's esophagus data: (a) The empirical ROC curve for Y, for tisin[0, 0.3] ; (b) Kernel density estimates of Y.

 
We next calculate the optimal sample sizes. Although not ideal, the marker will be considered minimally useful clinically if it detects 15% of cancers when the FPF is 0.05 (v0 = 0.15), or 25% of cancers when the FPF is 0.10 (v0 = 0.25). The observed TPFs at t = 0.05 and t = 0.10 are used as the anticipated alternative values. In order to have 90% power and 5% type-I error, we require (nD, nFormula 10) = (262, 427) and (nD, nFormula 10) = (235, 140) at t = 0.05 and t = 0.10, respectively (see Table 1). The optimal ratios at these two FPFs are 0.61 and 1.68. Interestingly, more cases are required than controls at t = 0.10, while fewer cases than controls are required at t = 0.05. The projected variance of Formula 10(t) in the future study can also be calculated using (2.2) (see Table 1). Observe that this variance depends on the estimates of ROC(t) and ROC'(t) and the sample sizes.


View this table:
[in this window]
[in a new window]
 
Table 1. Barrett's esophagus data: optimal ratios, sample sizes, and variances

 
We also calculate the sample sizes that are required to estimate the AUC from t = 0.0 to t = 0.1. Using sample variances of empirical restricted placement values, we calculate that the optimal case–control ratio for estimating the pAUC is 0.34. This case–control ratio is much smaller than those pertaining to estimation of the ROC curve at our two FPFs of interest. In order to calculate the raw sample sizes required, we use a null pAUC of 0.01 (which corresponds to the binormal ROC curve passing through the null values for ROC[0.05] and ROC[0.10]) and an alternative pAUC of 0.02 (the observed non-parametric estimate). In order to have 90 % power and 5 % type-I error, we require nD = 314 cases and nFormula 10 = 928 controls (see Table 1). The large sample size is a consequence of the fact that the null and alternative pAUC values are close together. We find that the choice of summary measure greatly affects the optimal case–control ratio and required sample sizes.


    8. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OPTIMAL CASE-CONTROL...
 3. ESTIMATING THE OPTIMAL...
 4. THE OPTIMAL RATIO...
 5. EFFICIENCY LOSS ASSOCIATED...
 6. THE OPTIMAL CASE-CONTROL...
 7. ILLUSTRATION
 8. DISCUSSION
 APPENDIX
 REFERENCES
 
Case–control studies have long been used in epidemiology to evaluate associations between disease and exposure. Odds ratios are typically used to quantify association. It has been demonstrated, however, that appropriate evaluation of case–control biomarker studies is fundamentally different from that of case–control association studies (Pepe et al., 2004Go). The odds ratio does not characterize classification accuracy. Accordingly, sample size calculations for case–control biomarker studies must be based on estimating parameters that relate to classification accuracy. Zhou et al. (2002Go, Chapter 6) and Pepe (2003, Chapter 8) document sample size calculations that fix the case–control ratio arbitrarily. In this paper, we develop methods that select the ratio optimally.

We have shown the optimal case–control ratio for estimating the ROC curve at a given FPF = t to be a simple function of t and the ROC curve and its derivative at t. In general, the optimal ratio is simply inversely proportional to the slope of the ROC curve at t, and the use of a sub-optimal case–control ratio can be associated with substantial loss of efficiency. We also derived the optimal case–control allocations for estimating the AUC and pAUC.

The sample size and the optimal ratio are highly dependent on the accuracy parameter chosen as the basis of the study design. This was made clear in the example of Section 7, where we observed that estimation of the pAUC required a much larger sample size than did estimation of the ROC curve at a fixed point (N = 1242 versus N = 375). We have also found in practice that estimating the AUC requires a much smaller sample size than does estimating a specific point on the ROC curve. Such small studies, however, will not have the appropriate power to draw adequate conclusions about specific points on the ROC curve. In most situations, the ROC curve is of more interest since it has more clinical meaning than AUC. This relationship between parameter and sample size is not surprising. Moreover, it reinforces the notion that the parameter choice should be based on careful consideration of the study objectives and consequences of study results.

In order to estimate the optimal ratio and the required sample sizes, estimates of the parameters are needed. These estimates incorporate uncertainty into the calculations. Unfortunately, this is so with any sample size calculation. For example, in order to design a clinical trial to evaluate treatment efficacy, we require an estimate of {sigma}2, the variance of the outcome. This estimate is often highly uncertain, but is the best that we can do with the available information. Similarly, calculating the case–control ratio for estimating ROC(t), as well as determining the raw sample sizes, requires an estimate of the slope of the ROC curve at t. We have proposed a non-parametric method for estimating the slope which relies on kernel density estimates, and which can be sensitive to the choice of kernel and bandwidth. We have provided suggestions for estimating the ROC slope given limited pilot data. However, there are other existing approaches to estimate the slope; one is to use the form induced by a parametric form for the ROC curve. (Interestingly, this leads to quite a different optimal ratio estimate in our example.) A thorough study of methods to estimate the slope of the ROC curve is warranted.

In calculating the optimal ratio for estimating ROC(t), we have assumed that the goal is to estimate the TPF at a given FPF = t. However, in some situations, a priority is to avoid false-negative errors, and our interest lies in calculating the FPF at a fixed large TPF = v. We note that this can easily be handled using the methods we have described by replacing t with ROC–1(v) in all the expressions.

We have proposed methods for choosing the case–control ratio and overall sample size in a biomarker study. In some studies, only a fixed number of cases are available, and the real question is how many controls should be sampled. In such circumstances, efficiency cannot be maximized, but the number of controls can be calculated as a function of the desired power and type-I error, using standard sample size formulas.

Our methods can be extended in a number of ways. First, we have sought to minimize the variance of the estimator for fixed overall sample size, or equivalently, minimized the sample size for fixed variance. An alternative approach would be to assign different costs to cases and controls, and to minimize the total cost of the study, while fixing the variance. We also focused on minimizing the asymptotic variance of the estimator. While this is sufficient for relatively large biomarker studies, it may not be adequate for small studies. We suggest using small-sample simulation studies with various case–control ratios, and selecting that which minimizes the variance of Formula 10(t). Our large-sample results could provide reasonable starting values for such simulations. Finally, in later phase biomarker studies, the task is to compare the new biomarker to the best available markers. Our methods may be extended to derive the optimal case–control ratio and raw sample sizes needed to compare two markers. We leave these questions for future research.


    APPENDIX
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OPTIMAL CASE-CONTROL...
 3. ESTIMATING THE OPTIMAL...
 4. THE OPTIMAL RATIO...
 5. EFFICIENCY LOSS ASSOCIATED...
 6. THE OPTIMAL CASE-CONTROL...
 7. ILLUSTRATION
 8. DISCUSSION
 APPENDIX
 REFERENCES
 
In terms of estimating ROC(t) at a fixed t of interest, the only case in which {rho}opt(t) = 1 for all t is when ROC(t) = t. To see this, note that {rho}opt(t) = 1 if and only if Formula 10 Letting ROC(t) = y and integrating both sides, we have arcsin(2y – 1) = arcsin(2t – 1) + c, where c is a constant. The condition y(0) = 0 implies that c = 0, which yields the desired result.

In estimating the AUC, if there exists a monotone increasing transformation, g, such that g(YD) and g(YFormula 10) + µ have survivor function S, where S is symmetric, Formula 10 Assume without loss of generality that S is symmetric about zero, and hence S(y) = 1–S(–y) and f(y) = f(–y), where f is the density corresponding to S. Using SD(YFormula 10) = S(g(YFormula 10)), SFormula 10(YD) = S(g(YD) + µ), and E(SD(YFormula 10)) = P(g(YD) > g(YFormula 10)) = P(YD > YFormula 10), we have

Formula 10

and so

Formula 10


    ACKNOWLEDGMENTS
 
This work was supported by R01GM 54438 and U01CA 086368. Conflict of Interest: None declared.


    FOOTNOTES
 
{dagger} Work conducted while at the University of Washington, Department of Biostatistics, Seattle, WA, USA. Back


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OPTIMAL CASE-CONTROL...
 3. ESTIMATING THE OPTIMAL...
 4. THE OPTIMAL RATIO...
 5. EFFICIENCY LOSS ASSOCIATED...
 6. THE OPTIMAL CASE-CONTROL...
 7. ILLUSTRATION
 8. DISCUSSION
 APPENDIX
 REFERENCES
 

    BAKER, S. G. (2003). The central role of receiver operating characteristic (ROC) curves in evaluating tests for the early detection of cancer. Journal of the National Cancer Institute 95, 511–515.

    BAMBER, D. (1975). The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology 12, 387–415.

    BEGG, C. B. (1991). Advances in statistical methodology for diagnostic medicine in the 1980s. Statistics in Medicine 10, 1887–1895.

    DELONG, E. R., DELONG, D. M. AND CLARKE-PEARSON, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845.

    DODD, L. E. (2001). Regression methods for areas and partial areas under the ROC curve, Ph.D. Thesis, University of Washington, Seattle, WA.

    DODD, L. E. AND PEPE, M. S. (2003). Partial AUC estimation and regression. Biometrics 59, 614–623.

    ETZIONI, R., PEPE, M., LONGTON, G., HU, C. AND GOODMAN, G. (1999). Incorporating the time dimension in receiver operating characteristic curves: a case study of prostate cancer. Medical Decision Making 19, 242–251.

    GREENHOUSE, S. W. AND MANTEL, N. (1950). The evaluation of diagnostic tests. Biometrics 6, 399–412.

    HANLEY, J. A. (1988). The robustness of the ‘binormal’ assumptions used in fitting ROC curves. Medical Decision Making 8, 197–203.

    HANLEY, J. A. (1989). Receiver operating characteristic (ROC) methodology: the state of the art. Critical Review of Diagnostic Imaging 29, 307–335.

    HANLEY, J. A. (1996). The use of the ‘binormal’ model for parametric ROC analysis of quantitative diagnostic tests. Statistics in Medicine 15, 1575–1585.

    HANLEY, J. A. AND HAJIAN-TILAKI, K. O. (1997). Sampling variability of nonparametric estimates of the areas under receiver operating characteristic curves: an update. Academic Radiology 4, 49–58.

    HANLEY, J. A. AND MCNEIL, B. J. (1982). The meaning and use of the area under an ROC curve. Radiology 143, 29–36.

    HSIEH, F. AND TURNBULL, B. W. (1996). Nonparametric and semiparametric estimation of the receiver operating characteristic curve. Annals of Statistics 24, 25–40.

    JANES, H., PEPE, M. S., KOOPERBERG, C. AND NEWCOMB, P. (2005). Identifying target populations for screening or not screening using logic regression. Statistics in Medicine 24, 1321–1338.

    MCCLISH, D. K. (1989). Analyzing a portion of the ROC curve. Medical Decision Making 9, 190–195.

    PENG, L. AND ZHOU, X. H. (2004). Local linear smoothing of receiver operating characteristic (ROC) curves. Journal of Statistical Planning and Inference 118, 129–143.

    PEPE, M. S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press.

    PEPE, M. S. AND CAI, T. (2004). The analysis of placement values for evaluating discriminatory measures. Biometrics 60, 528–535.

    PEPE, M. S., ETZIONI, R., FENG, Z., POTTER, J. D., THOMPSON, M. L., THORNQUIST, M., WINGET, M. AND YASUI, Y. (2001). Phases of biomarker development for early detection of cancer. Journal of the National Cancer Institute 93, 1054–1061.

    PEPE, M. S., JANES, H., LONGTON, G., LONGTON, G. M., LEISENRING, W. AND NEWCOMB, P. (2004). Limitations of the odds ratio in gauging the performance of a diagnostic or prognostic marker. American Journal of Epidemiology 159, 882–890.

    SWETS, J. A. (1986). Indices of discrimination or diagnostic accuracy: their ROCs and implied models. Psychological Bulletin 99, 100–117.

    SWETS, J. A. AND PICKETT, R. M. (1982). Evaluation of Diagnostic Systems: Methods from Signal Detection Theory. New York: Academic Press.

    THOMPSON, M. L. AND ZUCCINI, W. (1989). On the statistical analysis of ROC curves. Statistics in Medicine 8, 1277–1290.

    ZHOU, X. H., MCCLISH, D. K. AND OBUCHOWSKI, N. A. (2002). Statistical Methods in Diagnostic Medicine. New York: Wiley.

    Received March 2, 2005; revised November 28, 2005; revised January 17, 2006; accepted for publication January 17, 2006.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


    This article has been cited by other articles:


    Home page
    J. Clin. Pathol.Home page
    K. Soreide
    Receiver-operating characteristic curve analysis in diagnostic, prognostic and predictive biomarker research
    J. Clin. Pathol., January 1, 2009; 62(1): 1 - 5.
    [Full Text] [PDF]


    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    7/3/456    most recent
    kxj018v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Janes, H.
    Right arrow Articles by Pepe, M.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Janes, H.
    Right arrow Articles by Pepe, M.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?