Biostatistics Advance Access originally published online on March 10, 2006
Biostatistics 2006 7(4):585-598; doi:10.1093/biostatistics/kxj027
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Published by Oxford University Press 2006.
Pooling biospecimens and limits of detection: effects on ROC curve analysis
Division of Epidemiology, Statistics & Prevention, NICHD, NIH, DHHS, Bethesda, MD 20892, USA and Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA
Division of Epidemiology, Statistics & Prevention, NICHD, NIH, DHHS, Bethesda, MD 20892, USA schistee{at}mail.nih.gov
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
Frequently, epidemiological studies deal with two restrictions in the evaluation of biomarkers: cost and instrument sensitivity. Costs can hamper the evaluation of the effectiveness of new biomarkers. In addition, many assays are affected by a limit of detection (LOD), depending on the instrument sensitivity. Two common strategies used to cut costs include taking a random sample of the available samples and pooling biospecimens. We compare the two sampling strategies when an LOD effect exists. These strategies are compared by examining the efficiency of receiver operating characteristic (ROC) curve analysis, specifically the estimation of the area under the ROC curve (AUC) for normally distributed markers. We propose and examine a method to estimate AUC when dealing with data from pooled and unpooled samples where an LOD is in effect. In conclusion, pooling is the most efficient cost-cutting strategy when the LOD affects less than 50% of the data. However, when much more than 50% of the data are affected, utilization of the pooling design is not recommended.
Keywords: Limit of detection; Maximum likelihood; Pooling design; Receiver operating characteristics; Sampling
| 1. INTRODUCTION |
|---|
|
|
|---|
New biomarkers are continually being researched and developed to detect and prevent various chronic and acute diseases. Biomarkers are distinctive biochemical indicators of biological processes or events that help measure the progress of disease or the effects of treatment. At times, the high cost associated with evaluating these biomarkers can prohibit further investigation. For example, the cost of a single assay measuring polychlorinated biphenyl (PCB) is between $500 and $1000, so only small studies have been able to examine whether PCBs are associated with cancer and with endometriosis (Laden and others, 2001; Laden and Hunter, 1998
A critical step in biomarker development is the evaluation of its discriminating ability in terms of receiver operating characteristics (ROC) curves (e.g. Faraggi and Reiser, 2002
; Shapiro, 1999
; Wieand and others, 1989). The most commonly used global index of diagnostic accuracy is the area under the ROC curve (AUC). Bamber (1975)
showed that
(
). This can be interpreted as the probability that in a randomly selected pair of healthy and diseased individuals, the diagnostic marker value is higher for the diseased subject. Values of AUC close to 1.0 indicate that the marker has high diagnostic accuracy while a value of 0.5 indicates a noninformative marker which does no better than a random (fair) coin toss.
In evaluating the discriminating ability of PCBs, for example, we know that analysis is restricted by the high cost of assays and instrument sensitivity is limited by an LOD. Clearly, estimation of the AUC will be biased if we ignore the LOD issue. We propose and evaluate methodology for AUC estimation under different sampling strategies when faced with LOD and cost limitations. Two common sampling strategies used to ease cost restrictions include pooling biospecimens and taking a simple random sample. Pooling involves physically combining individual blood samples and has been found to be a useful way to cut costs and evaluate biomarkers (e.g. Faraggi and others, 2003; Liu and Schisterman, 2003
; Schisterman and others, 2005; Weinberg and Umbach, 1999
). The pooling strategy reasonably assumes that the measurement of the samples being pooled adequately represents the average of the individual unpooled samples, giving the sample mean the properties associated with a mean of n individual measurements. One advantage of pooling is that the amount of information per assay is increased, while the number of assays and the associated cost needed to evaluate this information remains fixed; whereas, taking a random sample of the data reduces the number of assays that need to be performed, but only uses a fraction of the available information. Generally, however, the density of measured pooled biospecimens involves the complex convolution of density functions of individual biological samples and likelihood methods based on pooled data may not be feasible. On the other hand, the density function of the random sample is the same as the original data, avoiding the complexities that come from pooling.
The effect of pooling on ROC curve analysis without an LOD effect has been investigated by Faraggi and others (2003). The authors obtained that, for normally distributed markers, the estimator of the mean based on pooled data is equivalent to that based on the full sample. However, the variance estimator based on pooled data is less efficient than that based on the full sample. For example, Faraggi and others (2003) showed (in terms of efficiency of AUC estimation) that for a true value of
, 200 assays of unpooled samples are equivalent to 110 assays of pooled samples of group size 2. The authors demonstrated that the loss of efficiency due to pooling is not of practical importance for
and for
when the pooling group size is 2. However, evaluation of the AUC estimator based on pooled data subject to an LOD has not been addressed in biostatistical literature.
Schisterman and others (2005b) discussed the benefits of pooling when data are affected by an LOD in the context of estimating the mean of one population. They showed that based on normally distributed data, there is always an interval where the pooling strategy is more efficient than a random sample and sometimes even the full sample, given that inference based on the pooling design provides more numerical information. In our study, we apply maximum likelihood methodology to investigate the joint effect of pooling and LOD on AUC (i.e. the context of testing for separation of two populations). We examine the efficiency of estimation of AUC as a function of the LOD for various sampling strategies. Efficiency is analyzed by comparing the variance between the estimators based on the pooled data and the random sample. We measure loss of information by the change in root mean-squared error (RMSE) of the AUC estimate. We examine the extent of this loss via a simulation study, in which we also investigate the sensitivity of our methodology to departures from normality.
This paper is organized as follows. In Section 2, we formally present notations related to the stated problem. Section 3 presents the maximum likelihood estimator of AUC and the proposed asymptotic distribution of this estimator. Section 4 analyzes efficiency of AUC estimation, dependent on sampling strategy, for various levels of LOD. Section 5 presents a real data example. We give some concluding remarks in Section 6.
| 2. FORMALIZATION OF STATED PROBLEM |
|---|
|
|
|---|
Let X and Y denote the diagnostic marker measurements for diseased and healthy individuals, respectively. We assume that these measurements follow a normal distribution, i.e.
|
|
When
and
are completely observed, standard estimates of the unknown parameters
, and
, are easily obtained. Hence, AUC can then be calculated by replacing the unknown parameters in the following formula with their estimated values:
![]() | (2.1) |
where
However, when an LOD is in effect, only biomarker values above some threshold
is the value of LOD) are observed such that
![]() |
where N/A (not applicable) represents values less than the threshold value d. Thus, estimation of
,
,
, and
, ignoring the LOD will lead to biased results. To this end, we utilize maximum likelihood estimation (MLE) as proposed by Gupta (1952)
. Details of this method are discussed in (3.1). Using the MLEs for the estimation of
and
, the AUC can then be obtained by applying (2.1).
Pooled samples are obtained by randomly grouping individuals of similar disease status into groups of size g. The grouped specimens are combined as pooled samples and are tested as single observations. The pooled sample measurements are considered to be the average of the individual samples. Consider the instance where there are n and m pooled observations available for cases and controls, with groups of size
and
, respectively. Let
denote cases and
denote controls, such that,
![]() |
(with both n and m being integers). By using the additive property of the normal distribution, we obtain that
![]() |
For simplicity, let
and
be equal,
. In a manner similar to the unpooled data, the detection limit leads to the definition of the observed sample in the form of
![]() |
Since the pooled data (
or
) follow normal distributions, the technique proposed by Gupta (1952)
, which corresponds to estimation of the unknown parameters, is still appropriate. Thus, AUC can be estimated by substituting the unknown parameters
,
,
, and
, with the maximum likelihood estimators based on
We will use the subscript j to denote whether the estimators are being computed from the full sample (
), the pooling sample (
), or the random sample (
). Thus, we specify
,
,
, and
as maximum likelihood estimators based upon
![]() |
| 3. MLE UNDER POOLING AND LOD |
|---|
|
|
|---|
Let
denote the number of elements of sets
, where
, respectively.
Similarly, we define
as the number of unobserved measurements in these samples. Depending on j and k, the log likelihood functions based on full data, the pooled data, and the random sample are
![]() | (3.1) |
where
and
is an individual data point (not N/A) of the considered data sets where
and
Therefore, the likelihood equations are
![]() |
where
![]() |
Solving this system of equations yields the MLEs for
and
, adjusted for pooling and LOD. Certainly, the statistical properties of the estimators depend on the number of observations above the LOD. Since pooling reduces the variability, if
the probability that
is smaller than the probability that
. Therefore, when
, a pooled sample is more likely to be observed than an individual sample. The situation reverses when
. Thus, the pooled data is less affected by an LOD when the mean is larger than the LOD (Schisterman and others, 2005b). This can be demonstrated by considering an unpooled sample,
. The pooled sample then has the following distribution,
, based on a pooling group size of
. When
, 16% of the unpooled observations are censored, whereas only 2% of the pooled observations are censored. When d = 1, 84% of the unpooled observations are censored, whereas 98% of the pooled observations are censored. We will further show that this gain in information for
based on the pooling strategy leads to improvements in efficiency.
In this section we examine the asymptotic distributions of the AUC estimators, which are based on the application of the maximum likelihood technique. Denote the total sample size
and assume that
|
| (3.2) |
We define the estimators of
from (2.1) in the form
, where
corresponds to estimation based upon the full data, the pooling sample, and the random sample, respectively. Subsequently,
and the asymptotic distribution of
is derived by using the following proposition.
PROPOSITION 3.1 Let (3.2) hold and
be finite. Then,
has the asymptotic (as
normal distribution with mean zero and covariance matrix
(where
is defined in Appendix). Proof is given in the Appendix.
Thus, the confidence interval (CI) for AUC is constructed using the following formula:
![]() |
where
is an estimator of
For different values of d, we graphically present
. Figures 1 (a) and (b) are based on 500 cases and 500 controls (full data), where
and
, corresponding to an
. The pool size is set to be
and is compared to the results of a simple random sample of
. As expected and shown in Figure 1 (a), estimates of AUC based on pooled data have asymptotically lower variance than the random sample until about
. For the pooled sample, this corresponds to 75% of the Xs and 92% of the Ys falling below the LOD. For the random sample, this corresponds to 63% of the Xs and 76% of the Ys falling below the LOD. In fact, in some cases (Figure 1 (b)), the variance of the
from the pooled data is smaller than the original sample. This result corresponds to the results of Schisterman and others (2005b), where an interval exists where the pooling strategy is more efficient than the full sample, given that the pooling design provides more numerical information.
|
Consider another application of Proposition 3.1 when we have two populations with fixed sizes
. For fixed
we then consider the asymptotic variance of
as a function of pooling size
This variance is defined by
and the sample sizes
. If
, we have that
. Hence, depending on d, the value of g that minimizes the variance of
can then be recommended. Let, for example
, and
, corresponding to an
and
.
Figure 2 plots the asymptotic variance of
, for
and
. In agreement with these graphs, the classical individual measuring of biological samples (i.e.
minimizes the variance of the maximum likelihood AUC estimator only if
or
. This result makes sense intuitively because when no LOD exists (
), or when the LOD is above the mean, the full sample contains more information than the pooled sample. For this reason, the panels in Figure 2 corresponding to
and
have a similar pattern.
|
| 4. SIMULATION STUDY |
|---|
|
|
|---|
A simulation study was carried out in order to examine the combined effects of pooling and LOD on the AUC estimator. Normally distributed data were generated for both cases and controls at varying levels of separation (
), with fixed
,
, and
, and mean
obtained by
.
The data were then pooled into groups of sizes (
) and an LOD was applied. LODs were defined so that a specified percentage of the control population was censored (
). Random samples of the data were also taken and the LOD was applied in the same manner. The findings of the simulation study are presented in Table 1. Following Faraggi and others (2003), we considered two general conditions regarding the availability of samples in an experimental setting. The first involves fixing the number of study subjects (
), and the second fixes the number of assays (
). Results for
and
were not included due to space limitations. We generated 5000 individual samples from each set of parameters. The relative RMSE was calculated relative to estimates based on the total population as follows:
|
![]() |
where
is the RMSE for the total population and
is the RMSE for pooled data with pooling size 2 or 4 when
, or the RMSE for a random sample of size
when
. Coverage was calculated by finding the percentage out of 5000 CIs for each set of conditions that contained the true AUC.
First, let the number of subjects available be fixed. As the pooling size increases (
), the number of tested samples
decreases, and so does the quality of the
estimator. As expected, the relative RMSE increased as pool size increased. For LODs less than 60%, no considerable distinction could be made between the RMSE from unpooled data (
) and pooled data (
). However, when 80% of the control samples were censored, the relative loss of efficiency was about 25%. In addition, for
, the relative loss of efficiency was three times that of pairs for all LODs. That result is to be expected when reducing the sample by 75%. The loss of efficiency between the random sample and the unpooled data was about 40%. For LODs less than 60%, pooling was consistently more efficient than random sampling. Coverage tended to decrease as AUC increased and was more conservative when more than 50% of the control samples were censored. Bias for all levels of discrimination and pooling were found to be negligible and were not included in the table due to space limitations. In terms of cost, when the number of subjects is fixed, pooling or taking a random sample will reduce cost by 50% (when
). Using Table 1, we can compare the efficiency of the sampling schemes for various values of AUC and LOD. For a fixed number of subjects
, if we assume that
and an LOD that affects 40% of the control subjects, the Rel. RMSE for the pooling strategy is 1.01 as compared to the full data, but 1.42 for the random sample as compared to the full data. Therefore, there is essentially no loss in efficiency when we employ the pooling strategy over the full data. However, there is a substantial gain in efficiency when we pool the samples rather than take a random sample. In this case, when we employ the pooling strategy, we cut cost by 50% and suffer essentially no loss in efficiency. If the LOD affects 80% of the controls, however, the pooling strategy is not as efficient as the full sample, and pooling is not recommended.
When the number of assays is fixed, the benefits of pooling are readily noticed. For example, using 40 pooled samples (
) as opposed to 40 unpooled samples leads to a 30% gain in efficiency. This gain in efficiency increases as the pooling group size increases and is consistent for LODs less than 60%. These results can be particularly useful in cases where the cost of assaying significantly exceeds the cost of obtaining samples because for the same overall cost, there is a significant gain in efficiency.
The simulations thus far assumed that the samples followed normal distributions. In order to illustrate the robustness of our methodology, we performed the following Monte Carlo simulations. Let us assume that one believes the observations are normally distributed and chooses the method of AUC estimation as proposed in Section 4. However, the true diagnostic markers satisfy
![]() |
where
,
are t-distributed random variables with df degrees of freedom. Thus, the true AUC is
![]() |
For example, if
, then
, respectively, when the putative
. Here we ran 5000 repetitions of the sample
, with parameters
at each
,
(d is the value of LOD), and
(g is the value of the pool size). We examined the proposed estimation of AUC given the uncorrected distributional assumption. Figure 3 corresponds to the case when
.
|
From these results we conclude that the proposed methodology is reasonable even when the distributional assumptions do not exactly satisfy normality. However, the accuracy of the considered estimators is poor when
(see Figure 3(b)). Note that, although the AUC estimator based on the pooled data utilizes only
observations, the efficiency of this estimator is close to the efficiency of the AUC estimator based on the full data (Figure 3(b)). Moreover, there are values of the LOD in which the estimator based on the pooling sample is the most robust and accurate. However, the bias
when
is based on pooled samples seems to be the largest for some values of d (note that the differences between the biases
are respectively small, see Figure 3(a)). This is, perhaps, partly because the assumed normal distribution of the pooled data
is less likely than the assumed normal distribution of the individual markers. Note that similar results were observed for
and 5. | 5. EXAMPLE |
|---|
|
|
|---|
Development of a marker for cholesterol is crucial because evidence shows that cholesterol may play a contributing role in the development of coronary heart disease. The pooling strategy explained above was applied to individual cholesterol measurements on 80 volunteers. Forty of those individuals who recently survived a myocardial infarction were defined as cases; the remaining 40 subjects served as controls. In addition, the blood specimens were randomly pooled in groups of 2, for the cases and controls separately, and remeasured. Faraggi and others (2003) have shown, using the same data, that the assumption that the pooled sample measurements are the equivalent of the average of the individual case is justified. Due to the costs involved, such confirmatory evidence for the averaging assumption will generally not be available.
Distributional assumptions were also tested and found to fit well with normal assumptions. The mean (
SD) in the control and case unpooled samples, respectively, were 205.5 (
42.3) and 226.8 (
41.7). An artificial
was applied to the cholesterol data so that 20% of the control samples were censored. AUC was then estimated using the method previously described. Table 2 presents the estimated AUC with corresponding 90% CIs. The pooled sample and a random sample were also used to estimate the AUC. The estimator of the variance of the AUC based on the random sample was two times the estimator of the variance of the AUC estimator based on the original and pooled samples. This is consistent with findings from the simulation study. The pooled point estimate of the AUC
was closer to the AUC based on full data with no LOD effect
than the random sample
. Upon further investigation, it was found that the pooled data had two outliers. These outliers were a result of variability introduced by the pooling process itself. The methods presented in this paper rely on the assumption that the value of the pooled sample is the average of the individual unpooled samples. It is reasonable to assume, however, that sometimes the practicality of pooling biological specimens can lead to additive pooling errors (Schisterman and others, 2005b). Care must be taken during the physical pooling process so as not to introduce additional variability. In order to complete our analysis, the outliers were removed and the analysis was repeated. The point estimate, after removing the two points, was closer to the true AUC (changed from 0.584 to 0.605)
. More importantly, this pooled analysis shows that cholesterol has discrimination properties, as shown by the original data. This is not the case in the random sample analysis. However, the largest improvement in the point estimation was found when we used the original data to calculate the theoretical pooled data values (mathematically pooling and not physically pooling samples). This resulted in an AUC point estimate of 0.634. The process of pooling the samples may introduce variability and careful consideration must be taken when pooling biospecimens so that no additional error is introduced because we may lose all the benefits of pooling.
|
| 6. CONCLUSIONS |
|---|
|
|
|---|
In this paper, we have presented a method to estimate the AUC based on pooled or unpooled data affected by an LOD. We have shown that there is a significant gain in efficiency when using pooled specimens as opposed to taking a random sample. This gain in efficiency occurs when the LOD affects less than 50% of our control samples. In this case, there are more pooled observations above the LOD, and the quality of our estimator is improved. Pooling is therefore a statistically viable cost-saving approach. However, estimating AUC based on a pooled sample requires that certain distributional assumptions be met. The process of mixing biospecimens may be a potential source of additional variability. Therefore, careful attention to instrument sensitivity must be taken during the pooling process. The paper proposes the methodology for normally distributed biomarkers. However, in a similar manner to the proposed method, one could consider another distribution, such as Gamma etc.
| Appendix |
|---|
|
|
|---|
The covariance matrix has the following form:
![]() |
and
.
Proof of Proposition 3.1 It is clear that the maximum likelihood estimator
has the asymptotic normal distribution with covariance matrix
, for
. The covariance matrix can be found by inverting the asymptotic Fisher information matrix divided by N (if
) or M (if
), as
. Thus, by applying the results proposed by Gupta (1952)
, we obtain
![]() |
where
. The estimator
of the
can be considered as a function of
. Therefore, the usual Taylor expansion around points
can be utilized for analyzing the asymptotic distribution of
. This technique is presented by Kotz and others (2003). Based on the results proposed by Kotz and others (2003), we complete the proof of Proposition 3.1.
| ACKNOWLEDGMENTS |
|---|
We are grateful to the editor, associate editor, and referee for their helpful comments that clearly improved this paper. This work was supported by the Intramural Research Program of the National Institutes of Health, National Institute of Child Health and Human Development.
| REFERENCES |
|---|
|
|
|---|
-
Bamber DC. (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology 12:387415.[CrossRef][Web of Science]
Faraggi D and Reiser B. (2002) Estimation of the area under the ROC curve. Statistics in Medicine 21:3093106.[CrossRef][Web of Science][Medline]
Faraggi D, Reiser B, Schisterman E. (2003) ROC curve analysis for biomarkers based on pooled assessments. Statistics in Medicine 22:251527.[CrossRef][Web of Science][Medline]
Finkelstein M and Verma D. (2001) Exposure estimation in the presence of nondetectable values: another look. American Industrial Hygiene Association Journal 62:1958.
Gupta AK. (1952) Estimation of the mean and standard deviation of a normal population from a censored sample. Biometrika 39:26073.
Hornung R and Reed L. (1990) Estimation of average concentration in the presence of nondetectable values. Applied Occupational Environmental Hygiene 5:4651.
Kotz S, Lumelskii Y, Pensky M. (2003) The Stress-Strength Model and Its Generalizations(World Scientific., London).
Laden F, Hankinson SE, Wolff MS, Colditz GA, Willett WC, Speizer FE, Hunter DJ. (2001) Plasma organochlorine levels and the risk of breast cancer: an extended follow-up in the Nurses' Health Study. International Journal of Cancer 91:56874.[CrossRef][Web of Science][Medline]
Laden F and Hunter DJ. (1998) Environmental risk factors and female breast cancer. Annual Review of Public Health 19:10123.[CrossRef][Web of Science][Medline]
Liu A and Schisterman E. (2003) Comparison of diagnostic accuracy of biomarkers with pooled assessments. Biometrical Journal 45:631644.[CrossRef][Web of Science]
Louis GM, Weiner JM, Whitcomb BW, Sperrazza R, Schisterman EF, Lobdell DT, Crickard K, Greizerstein H, Kostyniak PJ. (2005) Environmental PCB exposure and risk of endometriosis. Human Reproduction 20:27985.
Lubin JH, Colt JS, Camann D, Davis S, Cerhan JR, Severson RK, Bernstein L, Hartge P. (2004) Epidemiologic evaluation of measurement data in the presence of detection limits. Environmental Health Perspectives 112:169196.[Web of Science][Medline]
Schisterman EF, Perkins NJ, Liu A, Bondell H. (2005a) Optimal cut-point and its corresponding Youden Index to discriminate individuals using pooled blood samples. Epidemiology 16:7381.[CrossRef][Web of Science][Medline]
Schisterman EF, Vexler A, Liu A. (2005b) To pool or not to pool: from whether to when: applications of pooling to biospecimens with incomplete measurements. Statistics in Medicine (submitted).
Shapiro DE. (1999) The interpretation of diagnostic tests. Statistical Methods in Medical Research 8:11334.
Weinberg CR and Umbach DM. (1999) Using pooled exposure assessment to improve efficiency in case-control studies. Biometrics 55:71826.[CrossRef][Web of Science][Medline]
Wieand S, Gail MH, James BR, James KL. (1989) A family of non-parametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika 76:58592.
Received September 13, 2005; revised February 9, 2006; revised February 28, 2006; accepted for publication March 6, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











), pooled data (curve ), and a random sample (curve - - - - -). Based on
,
. (b) Interval where variance based on pooled sample is smaller than variance based on full data.




), pooled data (curve ), and a random sample (curve - - - - - -), plotted against d for
) correspond to the true values of AUC. (b) Monte Carlo estimators of
.
