Skip Navigation


Biostatistics Advance Access originally published online on March 23, 2007
Biostatistics 2007 8(4):800-804; doi:10.1093/biostatistics/kxm006
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
8/4/800    most recent
kxm006v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Li, X.
Right arrow Articles by Scharfstein, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, X.
Right arrow Articles by Scharfstein, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

On the potential for illogic with logically defined outcomes

Xianbin Li*, Brian Caffo and Daniel Scharfstein

Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA xli{at}jhsph.edu

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. MATHEMATICAL FORMULATION
 3. FOUR ESTIMATORS AND...
 4. COMPARISON OF ESTIMATORS
 5. DISCUSSION
 REFERENCES
 
Logically defined outcomes are commonly used in medical diagnoses and epidemiological research. When missing values in the original outcomes exist, the method of handling the missingness can have unintended consequences, even if the original outcomes are missing completely at random. In this note, we consider 2 binary original outcomes, which are missing completely at random. For estimating the prevalence of a logically defined "or" outcome, we discuss the properties of 4 estimators: the complete-case estimator, the available-case estimator, the maximum likelihood estimator (MLE), and a moment-based estimator. With the exception of the available-case case estimator, all the estimators are consistent. The MLE exhibits superior performance and should be generally adopted.

Keywords: Available-case estimator; Complete-case estimator; Hypertension; Maximum likelihood estimator; Missing data; Moment-based estimator


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. MATHEMATICAL FORMULATION
 3. FOUR ESTIMATORS AND...
 4. COMPARISON OF ESTIMATORS
 5. DISCUSSION
 REFERENCES
 
Logically defined outcomes arise frequently in biomedical practice and research. For example, in epidemiologic studies, a common definition of hypertension requires systolic blood pressure over 140 mmHg, diastolic blood pressure over 90 mmHg, or use of antihypertensive medications (see , Nieto and others, 2000Go; Peppard and others, 2000Go; Banks and others, 2006Go, e.g.). Estimation of the prevalence of the logically defined outcome is straightforward if there is no missing information in the original outcomes (in this example, the diagnosis criteria). However, when there is missing information in one or more original outcomes, the estimation of the prevalence is less clear.

The most straightforward approach to addressing the missing data is to discard all logical outcomes where any of the original outcomes have missing values, referred to as the complete-case analysis. However, such an approach may discard known logical outcomes. For example, if a subject has missing blood pressure measurements but is known to be taking antihypertensive medication, then he or she is hypertensive, as per the operational definition; hence their logical outcome is known, despite some of the original outcomes being missing. The approach that estimates the prevalence by computing the fraction of known logical outcomes that are positive is called the available-case analysis. At first blush, this might seem to be a better strategy; we show otherwise.

In this note, we assume that the original outcomes are missing completely at random. We show that, while the complete-case approach provides a consistent estimator of the true prevalence, the available-case approach does not. As we demonstrate, this follows because the probability of missing the logical outcome then depends on unknown original outcomes (i.e. informative missingness). We derive 2 additional consistent estimators of the prevalence: the maximum likelihood estimator (MLE) and a moment-based estimator. Both these estimators use all the available data; the moment-based estimator is more computationally tractable but less efficient.

This note is organized as follows: In Section 2, we introduce a mathematical formalization of the problem. In Section 3, we introduce the 4 estimators and present their asymptotic properties. In Section 4, we highlight the results of a simulation study. Section 5 is devoted to a summary and discussion of directions for future research.


    2. MATHEMATICAL FORMULATION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. MATHEMATICAL FORMULATION
 3. FOUR ESTIMATORS AND...
 4. COMPARISON OF ESTIMATORS
 5. DISCUSSION
 REFERENCES
 
For ease of exposition, we only consider 2 binary (yes/no) outcomes, labeled Y(1) and Y(2). We define the associated observed data 0/1 indicators as R(1) and R(2), where R(j) equals 1 if Y(j) is observed; otherwise it is 0. The logically defined outcome Y is 1 if Y(1) = 1 or Y(2) = 1; otherwise it is 0. Mathematically,

Formula

Let {pi}jk be the probability that Y(1) = j and Y(2) = k, and let {gamma}lm indicate the probability of R(1) = l and R(2) = m, where {sum}Formula{sum}Formula{pi}jk = 1 and {sum}Formula{sum}Formula{gamma}lm = 1. We assume throughout that the original outcomes are missing completely at random, i.e. (Y(1),Y(2)) is independent of (R(1),R(2)). However, the outcomes can be dependent, as well as the observed data indicators.

Of scientific interest is the estimation of

Formula (2.1)

In what follows, we discuss the impact of the choice of R, the observed data indicator for Y, on the estimation of µ. A complete-case analysis sets R = R*, where


Formula (2.2)

The observed data indicator that uses all of the known values of Y sets R = R{dagger}, where

Formula (2.3)

This is the so-called available-case analysis. While such an approach "seems" better, because it does not discard known outcomes and the observed data indicator now depends on the original outcomes, this approach induces informative missingness, even if the original data are missing completely at random.


    3. FOUR ESTIMATORS AND THEIR ASYMPTOTIC PROPERTIES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. MATHEMATICAL FORMULATION
 3. FOUR ESTIMATORS AND...
 4. COMPARISON OF ESTIMATORS
 5. DISCUSSION
 REFERENCES
 
We assume that we have n independent and identically distributed copies of O = (R(1),R(2),R(1)·Y(1),R(2)·Y(2)). We reserve the subscript i to indicate individuals when necessary. We focus on the following 4 estimators of µ:

Formula

The first 2 estimators are simple averages of the observed values of Y; Formula uses only the instances where both the original outcomes are observed, while Formula uses all the available logical outcomes. In general, while the complete-case estimator is consistent, the available-case estimator converges in probability to

Formula

The second term indicates nonnegative bias; it is zero if and only if {gamma}11 > 0, {gamma}10 = {gamma}01 = 0 (i.e. no discordant missing data), or µ = 1 (i.e. the probability that both Y(1) and Y(2) are zero is zero, {pi}00 = 0). Note that the bias converges to 1 when {gamma}11 and {pi}00 converge to 1. When there is no discordant missingness, the complete-case and available-case estimators are identical. As ratio estimators, these estimators are asymptotically normal with an asymptotic variance of the form

Formula (3.1)

where R = R* for the complete-case estimator and R = R{dagger} for the available-case estimator. In the complete-case setting, this expression simplifies to Formula, which is the Bernoulli variance divided by the probability of observing a complete case. The corresponding form for Formula is more complicated and is available in an online supplement.

The moment-based estimator Formula is a direct estimator based on the fact that µ = {pi}1 + + {pi} + 1{pi}11. Because the first 2 terms depend only on the individual original outcomes, this estimate makes use of more information than the complete-case estimator. Provided that {gamma}11 > 0, it is consistent as the first, second, and third terms converge in probability to {pi}1 +, {pi} + 1, and {pi}11, respectively. The estimator is asymptotically normal with an asymptotic variance

Formula

The final estimator is based on maximum likelihood. Since (R(1),R(2)) is ancillary (Basu, 1977Go) for {pi} = ({pi}01,{pi}10,{pi}11)' (i.e. the marginal distribution of (R(1),R(2)) is the same for all {pi}isin{Pi}), the MLE for {pi} can be found by maximizing the conditional likelihood for the observed data given (R(1),R(2)). The conditional likelihood contribution for a random individual with observed data O is

Formula

The overall conditional likelihood is prodFormulaL({pi};Oi). The first, second, and third factors of the conditional likelihood function, L({pi};O), are the contributions from observations where both Y(1) and Y(2) are observed, Y(1) is available and Y(2) is missing, and Y(1) is missing and Y(2) is available. To obtain the MLEs of {pi}, one can maximize the likelihood numerically, for example, using a quasi-Newton algorithm. Operationally, it is useful to re-parameterize {pi} in terms of ß = (ß1,ß2,ß3)', where ß1 = log{{pi}10/(1 – {pi}01{pi}10 {pi}11)}, ß2 = log{{pi}01/(1 – {pi}01{pi}10 {pi}11)}, and ß3 = log{{pi}11/(1 – {pi}01{pi}10 {pi}11)}, to eliminate boundary constraints.

Assuming that the solution lies within the interior of a compact set, the MLE of {pi}, Formula, will be consistent and asymptotically normal with asymptotic variance equal to the inverse of the Fisher information matrix, which is a 3x3 matrix, I({pi}), with ith row, jth column denoted by Iij({pi}), where

Formula

By the invariance property, the MLE of µ is Formula. This estimator will be consistent and asymptotically normal with asymptotic variance found using the multivariate delta method.


    4. COMPARISON OF ESTIMATORS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. MATHEMATICAL FORMULATION
 3. FOUR ESTIMATORS AND...
 4. COMPARISON OF ESTIMATORS
 5. DISCUSSION
 REFERENCES
 
Due to the relative simplicity of the expressions for the asymptotic variances of the complete-case and moment-based estimators, we are able to analytically compare their efficiencies. In the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org), we present a general formula for the difference between the asymptotic variances of the moment-based estimator and the complete-case estimator. We prove that, when there is some discordant missingness and the proportion of complete cases is small, the choice between the complete-case and moment-based estimators relies on whether it is better to estimate {pi}00 or {pi}11 with the complete cases. Specifically, the moment-based (complete-case) estimator is dramatically more efficient if {pi}11 ({pi}00) is further from 0.5 than {pi}00 ({pi}11).

These results were confirmed in a simulation study (see the supplementary material available at Biostatistics online). The study also showed the poor performance of the available-case estimator in terms of both bias and mean-squared error, except in the case where there is no discordant missingness or all the original outcomes are zero. Further, the study demonstrated the superior efficiency of the MLE relative to the other estimators.


    5. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. MATHEMATICAL FORMULATION
 3. FOUR ESTIMATORS AND...
 4. COMPARISON OF ESTIMATORS
 5. DISCUSSION
 REFERENCES
 
Logically defined outcomes are commonly used in medical diagnosis and epidemiological research. Without missing values in the original outcomes, the estimation of the prevalence of the logically defined outcomes is straightforward. However, when there are missing values in some of the original outcomes, the method of handling the missingness can have unintended consequences, even if the original outcomes are missing completely at random. We believe that this potential problem is largely unknown. Complicating the issue is that standard packages differ in the default behavior of the mathematical "or" operator.

The MLE is the optimal choice, though it requires the use of numerical optimization techniques. Regardless, we would recommend its general use in these problems. We would never recommend the available-case estimator, though it does appear to be used in practice.

In this manuscript, we reduced the missing data problem to the simplest setting. For future work, more complicated logical structures involving more than 2 original outcomes should be considered. Regression modeling of logical outcomes with missing value is also a potentially fruitful area for future research.


    ACKNOWLEDGMENTS
 
Conflict of Interest: None declared.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. MATHEMATICAL FORMULATION
 3. FOUR ESTIMATORS AND...
 4. COMPARISON OF ESTIMATORS
 5. DISCUSSION
 REFERENCES
 

    Banks J, Marmot M, Oldfield Z, Smith JP. Disease and disadvantage in the United States and in England. Journal of the American Medical Association (2006) 295:2037–2045.[Abstract/Free Full Text]

    Basu D. On the elimination of nuisance parameters. Journal of the American Statistical Association (1977) 72:355–366.[CrossRef][Web of Science]

    Nieto J, Young T, Lind B, Shahar E, Samet J, Redline S, D'Agostino R, Newman A, Lebowitz M, Pickering J. Association of sleep-disordered breathing, sleep apnea, and hypertension in a large community-based study. Journal of the American Medical Association (2000) 283:1829–1836.[Abstract/Free Full Text]

    Peppard P, Young T, Palta M, Skatrud J. Prospective study of the association between sleep-disordered breathing and hypertension. New England Journal of Medicine (2000) 342:1378–1384.[Abstract/Free Full Text]

    Received November 5, 2006; revised January 12, 2007; accepted for publication February 7, 2007.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    8/4/800    most recent
    kxm006v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Li, X.
    Right arrow Articles by Scharfstein, D.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Li, X.
    Right arrow Articles by Scharfstein, D.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?