Biostatistics Advance Access originally published online on September 12, 2006
Biostatistics 2007 8(2):468-473; doi:10.1093/biostatistics/kxl024
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Numerical equivalence of imputing scores and weighted estimators in regression analysis with missing covariates
Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, PO Box 19024, Seattle, WA 98109-1024, USA cywang{at}fhcrc.org
Department of Statistics, Feng-Chia University, Taichung, Taiwan, Republic of China
Insightful Corporation, 1700 Westlake Avenue North, Suite 500, Seattle, WA 98109, USA
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
Imputation, weighting, direct likelihood, and direct Bayesian inference (Rubin, 1976) are important approaches for missing data regression. Many useful semiparametric estimators have been developed for regression analysis of data with missing covariates or outcomes. It has been established that some semiparametric estimators are asymptotically equivalent, but it has not been shown that many are numerically the same. We applied some existing methods to a bladder cancer casecontrol study and noted that they were the same numerically when the observed covariates and outcomes are categorical. To understand the analytical background of this finding, we further show that when observed covariates and outcomes are categorical, some estimators are not only asymptotically equivalent but also actually numerically identical. That is, although their estimating equations are different, they lead numerically to exactly the same root. This includes a simple weighted estimator, an augmented weighted estimator, and a mean-score estimator. The numerical equivalence may elucidate the relationship between imputing scores and weighted estimation procedures.
Keywords: Estimating equation; Ignorable missingness; Inverse selection probability; Missing at random
| 1. INTRODUCTION |
|---|
|
|
|---|
Our methodology is motivated by prior analysis of data from a casecontrol study of bladder cancer conducted at the Fred Hutchinson Cancer Research Center. Eligible subjects were recruited from three western counties in Washington state. Cases were those diagnosed between January 1987 and June 1990 with invasive or noninvasive bladder cancer. This population-based casecontrol study was designed to address the association between bladder cancer and various risk factors. Detailed results can be found in Bruemmer and others (1996). One covariate of interest was pack year, which is defined as the average number of cigarette packs smoked per day multiplied by the years one has been smoking. Of the 667 subjects available in the data, two did not have body mass index (BMI) information. Smoking year was available for all the 665 subjects. However, pack year information was missing for one case and for 38.3% of the controls. The question of interest, then, is to estimate the odds ratios of pack year and other covariates, given that there are many subjects who are missing pack year data. In this study, there is almost no missing information from the cases, but we will need a statistical method to adjust for the incomplete data in the controls. In this example, the discretized binary smoking year (0, if less than 30 years; 1, otherwise) may be used as a surrogate variable for pack year data. Obesity (BMI
30) (Centers for Disease Control and Prevention) is also a risk factor of interest in this study. The probabilities of non-missing data are also called selection probabilities, used primarily in two-stage studies to denote the probability of being selected in the second-stage sampling. Some analyses of the data show that in this example non-missing probabilities depend on the disease outcome, obesity status, and binary smoking year level. Therefore, methods for missing data are important for the analysis of the data. There is a rich body of literature on missing data regression; see Little and Rubin (2002) for a review. Here, we consider data where covariates may be missing at random (MAR), such that the missing data mechanism does not depend on the missing data itself. In this paper, we will show some numerical equivalence relationships between the mean-score estimator (Reilly and Pepe, 1995) and weighted estimators that were proposed in Robins and others (1994) when some data are MAR. In Section 2, we will review the mean-score estimator and a class of inverse probability estimators. Section 3 presents our findings that when observed variables are discrete, some semiparametric estimators are not only asymptotically equivalent but also numerically the same. We use the terminology of numerical equivalence of two estimators to denote that they are numerically identical for any data set that is applied to them. Our concluding observations are detailed in Section 4.
| 2. SOME SEMIPARAMETRIC METHODS |
|---|
|
|
|---|
Let Y be the outcome variable for the regression analysis of interest, X be the partially missing covariate vector, Z be the always observable covariate vector, and W be the observable surrogate for X, i.e. Y and W are conditionally independent given (X,Z). Let
,Z
)'. The aim is to estimate the regression coefficients ß in the following assumed regression model: |
|
for some function
, and ß = (ß0,ß
,ß
)' is a vector of parameters. For example, in linear regression,
is the identity function while for logistic regression
(u) = (1 + e u) 1. Let
i indicate whether Xi is observed (
i = 1) or not (
i = 0). The complete-case subset (
i = 1) consists of (Yi,Xi,Zi,Wi), and the noncomplete-case subset (
i = 0) consists of (Yi,Zi,Wi). The probability of Xi being observed,
i(i = 1,...,n), is defined to be pr(
i = 1|Yi,Xi,Zi,Wi), where n is the number of all observations. We assume Xi is MAR such that the selection probability pr(
i = 1|Yi,Xi,Zi,Wi) depends on (Yi,Zi,Wi) but not on Xi. Let Vi = (Yi,Z
,W
)', then
i =
(Yi,Zi,Wi) =
(Vi).
When there is no missing data, ß can be consistently estimated by solving estimating equation n 1

i{Yi
(ß'
i)} = 0. Let
(Vi) = E{S(Yi,Xi,Zi)|Vi} and S(Yi,Xi,Zi) =
i{Yi
(ß'
i)}, where E(·) denotes expectation. The idea of Reilly and Pepe (1995) was to replace S(Yi,Xi,Zi) by
(Vi) = E{S(Yi,Xi,Zi)|Vi} when Xi is missing. For discrete V,
may be estimated nonparametrically by its empirical average using the complete-case data with the same Vi value:
|
| (2.1) |
where I[·] is the indicator function. Then, the mean-score estimator solves
|
|
The idea behind the mean-score estimator is to replace the unobserved score by an average of some observed scores based on the observed Vi values.
Rather than estimating the score, an alternative approach is to use inverse selection probability weighting to accommodate missing data. Similar to the idea of Horvitz and Thompson (1952), the simplest weighted estimator uses subjects in the validation set and applies {
(Vi)} 1

as the weight for subject i. Zhao and Lipsitz (1992) proposed the simple inverse probability weighted (SIPW) estimator, which solved the estimating equation n 1
(
i/
i)S(Yi,Xi,Zi) = 0. We denote this estimator by SIPW1. The SIPW1 estimator is applicable in a two-stage study in which the selection probabilities of the second-stage subjects are known to the study investigators. Let
denote the empirical mean for any X1,...,Xn, that is, n 1
Xi. The SIPW1 estimating equation is
, which has limit E{S(Y,X,Z)} = 0. Therefore, it can be easily seen that the SIPW1 estimator is consistent for ß.
In many missing data problems, data may be missing by happenstance. In this case, the true
i is not known, but it generally can be estimated consistently under the MAR assumption. Let
estimate
i consistently, then the SIPW estimator using estimated
solves
|
| (2.2) |
Note that
where
i* is between
i and
, and generally the second term is negligible, i.e. which has the rate of o(1). Therefore, the SIPW estimator solving (2.2) will lead to a consistent estimator under some general conditions. We denote this estimator by SIPW2. The SIPW2 estimator is the same as the SIPW1 estimator except that it uses estimated selection probabilities.
Generally, when Z and W are discrete, misclassifying
is not a concern. In this case, for any v in the support of V, the nonparametric estimator is
|
| (2.3) |
where I(·) is an indicator function.
The SIPW2 estimator applies the inverse probability weights to the estimating function for subjects from the complete-case subset. However, it does not directly apply the expected estimating function for subjects from the noncomplete-case set. Therefore, it is natural to include an augmented term to gain efficiency. This idea can be generalized to the class of weighted estimators proposed in Robins and others (1994). One of the simplest estimators among the class is the following augmented inverse probability weighted (AIPW) estimator which solves
|
| (2.4) |
We denote this specific AIPW estimator by AIPW1, which uses the true selection probabilities and true conditional scores. As mentioned in Section 1,
has limit E{S(Y,X,Z)} = 0. Also, it is easily seen that
has limit 0 since E[{1 (
/
)}
(V)|V] = 0 for any ß. Therefore, the AIPW1 estimator solving (2.4) will lead to a consistent estimator.
When both
and
are unknown, we may apply their nonparametric estimators to (2.4), denoted by AIPW2. The AIPW2 estimator is in general more practical since often the
values are unknown. The AIPW estimator has the feature of being doubly robust. When the selection probabilities
s are consistently estimated, the arguments given in Section 1 show that
has limit 0 for any function
(V). This is robustness against misclassification of
. In addition, AIPW is robust against misclassification of
. Consider when
(V) is correctly calculated but
is misspecified so that
, which is different from
. Then using a false 
in the AIPW estimator leads to the estimating equation
. This has limit E[(
/
*)S(Y,X,Z) + {1 (
/
*)}
(V)] = 0. By direct calculation, we note that
![]() |
Therefore, the estimating equation for the AIPW estimator has the same limit as the full data estimating score under a misspecified
, if
(Vi) is correctly calculated.
| 3. NUMERICAL EQUIVALENCE OF MEAN-SCORE AND WEIGHTED ESTIMATORS FOR CATEGORICAL DATA |
|---|
|
|
|---|
One question of interest is the relationship between the mean-score estimator and various weighted estimators if Vi is categorical. To address this question, we now present the first equivalence result. Let
be the empirical average estimator of
shown in (2.3), and
be the empirical average estimator of
(Vi) shown in (2.1). Given the observed data, let A(V) be any function of V, then
![]() |
The first equivalence result is stronger than the first robustness of the AIPW estimator given in Section 2. Further, no matter what the augmented term is used, the AIPW2 estimator is essentially the "same" as the SIPW2 estimator using the empirical average estimator for the selection probabilities. They are also the same as the mean-score estimator of Reilly and Pepe (1995).
We now present a dual, but different, equivalence result. Let
and
be the empirical average estimators of
and
given in (2.3) and (2.1), respectively. Given the observed data, let 
=
*(Vi) be any function of V, then
|
|
Recall that the second robustness of AIPW says that as long as the augmented term is consistently estimated, the augmented estimator is consistent, even if the selection probabilities are wrongly estimated. This equivalence result is stronger since it says that as long as the augmented term is estimated by the empirical average estimator, then no matter how the selection probabilities are estimated, the AIPW2 estimator is the same as the SIPW2 estimator which uses the empirical average estimates for the selection probabilities. The proofs of the two equivalence results, along with some numerical results, are posted on this journal's web site.
Another question of interest, but not covered by the above propositions, is whether AIPW1 and AIPW2 are the same numerically. The distribution of X given V can be obtained by the following Bayes formula:
|
|
As in Wang and Wang (2001), AIPW1 is asymptotically equivalent to SIPW2 and AIPW2. However, it is numerically different from them.
Two other AIPW estimators may be considered by hybrid designs. The first estimator is considered if the true
is known while
is estimated, and the associated AIPW estimator is denoted by AIPW3; the second if true
is known while
is estimated, and the associated AIPW is denoted by AIPW4. From the equivalence results given above, it is easily seen that AIPW3 and AIPW4 are numerically identical to AIPW2 as long as the observed data are categorical, and the
in AIPW3 and
in AIPW4 are estimated by their nonparametric estimators (2.1) and (2.3), respectively.
| 4. DISCUSSION |
|---|
|
|
|---|
We have established numerical equivalence of some semiparametric estimators. The numerical equivalence results, nevertheless, hold only when covariate data are discrete. Our numerical results show that there is no efficiency gain when using the augmentation term if nonparametric estimates of the selection probabilities are applied to the weights. For continuous observed data, kernel smoothing may be applied to estimate
and
(Wang and others, 1997; Wang and Wang, 2001). Wang and Wang (2001) established some asymptotic equivalence relationships between the mean-score estimator and weighted estimators. Note that if there are more covariate variables in the model, modeling the distribution of covariates is likely to encounter the curse of dimensionality. Therefore, the kernel-assisted estimators are applicable primarily when there are few continuous variables in the regression model of interest. To address this issue, partial linear modeling of
and
may be considered; see Liang and others (2004). To ease presentation, the canonical link function was assumed in Section 2. Further technical conditions are needed for a general link function primarily for consistency of the aforementioned estimators. Nevertheless, the two numerical equivalence results still hold for discrete observed data.
| ACKNOWLEDGMENTS |
|---|
This research was partially supported by the National Institutes of Health funds AG15026, CA53996 (Wang), and CA88754 (Wang and Chao). The authors thank Barbara Bruemmer and Emily White for providing us the casecontrol data, Noelle Noble for editing the paper, the coeditor, associate editor and referee for helpful comments. Conflict of Interest: None declared.
| REFERENCES |
|---|
|
|
|---|
-
Bruemmer B, White E, Vaughan T, Cheney C. (1996) Nutrient intake in relationship to bladder cancer among middle aged men and women. American Journal of Epidemiology 144:485495.
Horvitz DG and Thompson DJ. (1952) A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 47:663685.[CrossRef][Web of Science]
Liang H, Wang S, Robins JM, Carroll RJ. (2004) Estimation in partially linear models with missing covariates. Journal of the American Statistical Association 99:357367.[CrossRef][Web of Science]
Little RJA and Rubin DB. (2002) Statistical Analysis with Missing Data 2nd edition (John Wiley & Sons, New York).
Reilly M and Pepe MS. (1995) A mean-score method for missing and auxiliary covariate data in regression models. Biometrika 82:299314.
Robins JM, Rotnitzky A, Zhao LP. (1994) Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association 89:846866.[CrossRef][Web of Science]
Rubin DB. (1976) Inference and missing data. Biometrika 63:581592.
Wang CY, Wang S, Zhao LP, Ou ST. (1997) Weighted semiparametric estimation in regression analysis with missing covariate data. Journal of the American Statistical Association 92:512525.[CrossRef][Web of Science]
Wang S and Wang CY. (2001) Asymptotic comparisons of kernel assisted estimators in missing covariate regression. Statistics and Probability Letters 55:439449.[CrossRef]
Zhao LP and Lipsitz S. (1992) Designs and analysis of two-stage studies. Statistics in Medicine 11:769782.[Web of Science][Medline]
Received May 1, 2006; revised August 25, 2006; accepted for publication September 8, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

