Biostatistics Advance Access published online on March 23, 2007
Biostatistics, doi:10.1093/biostatistics/kxm006
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
On the potential for illogic with logically defined outcomes
Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA xli{at}jhsph.edu
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
Logically defined outcomes are commonly used in medical diagnoses and epidemiological research. When missing values in the original outcomes exist, the method of handling the missingness can have unintended consequences, even if the original outcomes are missing completely at random. In this note, we consider 2 binary original outcomes, which are missing completely at random. For estimating the prevalence of a logically defined "or" outcome, we discuss the properties of 4 estimators: the complete-case estimator, the available-case estimator, the maximum likelihood estimator (MLE), and a moment-based estimator. With the exception of the available-case case estimator, all the estimators are consistent. The MLE exhibits superior performance and should be generally adopted.
Keywords: Available-case estimator; Complete-case estimator; Hypertension; Maximum likelihood estimator; Missing data; Moment-based estimator
| 1. INTRODUCTION |
|---|
|
|
|---|
Logically defined outcomes arise frequently in biomedical practice and research. For example, in epidemiologic studies, a common definition of hypertension requires systolic blood pressure over 140 mmHg, diastolic blood pressure over 90 mmHg, or use of antihypertensive medications (see , Nieto and others, 2000
The most straightforward approach to addressing the missing data is to discard all logical outcomes where any of the original outcomes have missing values, referred to as the complete-case analysis. However, such an approach may discard known logical outcomes. For example, if a subject has missing blood pressure measurements but is known to be taking antihypertensive medication, then he or she is hypertensive, as per the operational definition; hence their logical outcome is known, despite some of the original outcomes being missing. The approach that estimates the prevalence by computing the fraction of known logical outcomes that are positive is called the available-case analysis. At first blush, this might seem to be a better strategy; we show otherwise.
In this note, we assume that the original outcomes are missing completely at random. We show that, while the complete-case approach provides a consistent estimator of the true prevalence, the available-case approach does not. As we demonstrate, this follows because the probability of missing the logical outcome then depends on unknown original outcomes (i.e. informative missingness). We derive 2 additional consistent estimators of the prevalence: the maximum likelihood estimator (MLE) and a moment-based estimator. Both these estimators use all the available data; the moment-based estimator is more computationally tractable but less efficient.
This note is organized as follows: In Section 2, we introduce a mathematical formalization of the problem. In Section 3, we introduce the 4 estimators and present their asymptotic properties. In Section 4, we highlight the results of a simulation study. Section 5 is devoted to a summary and discussion of directions for future research.
| 2. MATHEMATICAL FORMULATION |
|---|
|
|
|---|
For ease of exposition, we only consider 2 binary (yes/no) outcomes, labeled Y(1) and Y(2). We define the associated observed data 0/1 indicators as R(1) and R(2), where R(j) equals 1 if Y(j) is observed; otherwise it is 0. The logically defined outcome Y is 1 if Y(1) = 1 or Y(2) = 1; otherwise it is 0. Mathematically,
|
|
Let
jk be the probability that Y(1) = j and Y(2) = k, and let
lm indicate the probability of R(1) = l and R(2) = m, where 



jk = 1 and 



lm = 1. We assume throughout that the original outcomes are missing completely at random, i.e. (Y(1),Y(2)) is independent of (R(1),R(2)). However, the outcomes can be dependent, as well as the observed data indicators.
Of scientific interest is the estimation of
|
| (2.1) |
In what follows, we discuss the impact of the choice of R, the observed data indicator for Y, on the estimation of µ. A complete-case analysis sets R = R*, where
|
| (2.2) |
The observed data indicator that uses all of the known values of Y sets R = R
, where
|
| (2.3) |
This is the so-called available-case analysis. While such an approach "seems" better, because it does not discard known outcomes and the observed data indicator now depends on the original outcomes, this approach induces informative missingness, even if the original data are missing completely at random.
| 3. FOUR ESTIMATORS AND THEIR ASYMPTOTIC PROPERTIES |
|---|
|
|
|---|
We assume that we have n independent and identically distributed copies of O = (R(1),R(2),R(1)·Y(1),R(2)·Y(2)). We reserve the subscript i to indicate individuals when necessary. We focus on the following 4 estimators of µ:
![]() |
The first 2 estimators are simple averages of the observed values of Y;
uses only the instances where both the original outcomes are observed, while
uses all the available logical outcomes. In general, while the complete-case estimator is consistent, the available-case estimator converges in probability to
|
|
The second term indicates nonnegative bias; it is zero if and only if
11 > 0,
10 =
01 = 0 (i.e. no discordant missing data), or µ = 1 (i.e. the probability that both Y(1) and Y(2) are zero is zero,
00 = 0). Note that the bias converges to 1 when
11 and
00 converge to 1. When there is no discordant missingness, the complete-case and available-case estimators are identical. As ratio estimators, these estimators are asymptotically normal with an asymptotic variance of the form
|
| (3.1) |
where R = R* for the complete-case estimator and R = R
for the available-case estimator. In the complete-case setting, this expression simplifies to
, which is the Bernoulli variance divided by the probability of observing a complete case. The corresponding form for
is more complicated and is available in an online supplement.
The moment-based estimator
is a direct estimator based on the fact that µ =
1 + +
+ 1
11. Because the first 2 terms depend only on the individual original outcomes, this estimate makes use of more information than the complete-case estimator. Provided that
11 > 0, it is consistent as the first, second, and third terms converge in probability to
1 +,
+ 1, and
11, respectively. The estimator is asymptotically normal with an asymptotic variance
![]() |
The final estimator is based on maximum likelihood. Since (R(1),R(2)) is ancillary (Basu, 1977
) for
= (
01,
10,
11)' (i.e. the marginal distribution of (R(1),R(2)) is the same for all 

), the MLE for
can be found by maximizing the conditional likelihood for the observed data given (R(1),R(2)). The conditional likelihood contribution for a random individual with observed data O is
![]() |
The overall conditional likelihood is 
L(
;Oi). The first, second, and third factors of the conditional likelihood function, L(
;O), are the contributions from observations where both Y(1) and Y(2) are observed, Y(1) is available and Y(2) is missing, and Y(1) is missing and Y(2) is available. To obtain the MLEs of
, one can maximize the likelihood numerically, for example, using a quasi-Newton algorithm. Operationally, it is useful to re-parameterize
in terms of ß = (ß1,ß2,ß3)', where ß1 = log{
10/(1
01
10
11)}, ß2 = log{
01/(1
01
10
11)}, and ß3 = log{
11/(1
01
10
11)}, to eliminate boundary constraints.
Assuming that the solution lies within the interior of a compact set, the MLE of
,
, will be consistent and asymptotically normal with asymptotic variance equal to the inverse of the Fisher information matrix, which is a 3x3 matrix, I(
), with ith row, jth column denoted by Iij(
), where
![]() |
By the invariance property, the MLE of µ is
. This estimator will be consistent and asymptotically normal with asymptotic variance found using the multivariate delta method.
| 4. COMPARISON OF ESTIMATORS |
|---|
|
|
|---|
Due to the relative simplicity of the expressions for the asymptotic variances of the complete-case and moment-based estimators, we are able to analytically compare their efficiencies. In the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org), we present a general formula for the difference between the asymptotic variances of the moment-based estimator and the complete-case estimator. We prove that, when there is some discordant missingness and the proportion of complete cases is small, the choice between the complete-case and moment-based estimators relies on whether it is better to estimate
00 or
11 with the complete cases. Specifically, the moment-based (complete-case) estimator is dramatically more efficient if
11 (
00) is further from 0.5 than
00 (
11). These results were confirmed in a simulation study (see the supplementary material available at Biostatistics online). The study also showed the poor performance of the available-case estimator in terms of both bias and mean-squared error, except in the case where there is no discordant missingness or all the original outcomes are zero. Further, the study demonstrated the superior efficiency of the MLE relative to the other estimators.
| 5. DISCUSSION |
|---|
|
|
|---|
Logically defined outcomes are commonly used in medical diagnosis and epidemiological research. Without missing values in the original outcomes, the estimation of the prevalence of the logically defined outcomes is straightforward. However, when there are missing values in some of the original outcomes, the method of handling the missingness can have unintended consequences, even if the original outcomes are missing completely at random. We believe that this potential problem is largely unknown. Complicating the issue is that standard packages differ in the default behavior of the mathematical "or" operator.
The MLE is the optimal choice, though it requires the use of numerical optimization techniques. Regardless, we would recommend its general use in these problems. We would never recommend the available-case estimator, though it does appear to be used in practice.
In this manuscript, we reduced the missing data problem to the simplest setting. For future work, more complicated logical structures involving more than 2 original outcomes should be considered. Regression modeling of logical outcomes with missing value is also a potentially fruitful area for future research.
| ACKNOWLEDGMENTS |
|---|
Conflict of Interest: None declared.
| REFERENCES |
|---|
|
|
|---|
-
Banks J, Marmot M, Oldfield Z, Smith JP. (2006) Disease and disadvantage in the United States and in England. Journal of the American Medical Association 295:20372045.
Basu D. (1977) On the elimination of nuisance parameters. Journal of the American Statistical Association 72:355366.
Nieto J, Young T, Lind B, Shahar E, Samet J, Redline S, D'Agostino R, Newman A, Lebowitz M, Pickering J. (2000) Association of sleep-disordered breathing, sleep apnea, and hypertension in a large community-based study. Journal of the American Medical Association 283:18291836.
Peppard P, Young T, Palta M, Skatrud J. (2000) Prospective study of the association between sleep-disordered breathing and hypertension. New England Journal of Medicine 342:13781384.
Received November 5, 2006; revised January 12, 2007; accepted for publication February 7, 2007.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



