Biostatistics Advance Access originally published online on May 4, 2005
Biostatistics 2005 6(4):505-519; doi:10.1093/biostatistics/kxi031
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Smooth quantile ratio estimation with regression: estimating medical expenditures for smoking-attributable diseases
Department of Biostatistics, Bloomberg School of Public Health, The Johns Hopkins University Baltimore, MD 21205-3179, USA fdominic{at}jhsph.edu
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
The methodological development of this paper is motivated by a common problem in econometrics where we are interested in estimating the difference in the average expenditures between two populations, say with and without a disease, as a function of the covariates. For example, let Y1 and Y2 be two nonnegative random variables denoting the health expenditures for cases and controls. Smooth Quantile Ratio Estimation (SQUARE) is a novel approach for estimating
= E[Y1] E[Y2] by smoothing across percentiles the log-transformed ratio of the two quantile functions. Dominici et al. (2005) have shown that SQUARE defines a large class of estimators of
, is more efficient than common parametric and nonparametric estimators of
, and is consistent and asymptotically normal. However, in applications it is often desirable to estimate
(x) = E[Y1|x] E[Y2|x], that is, the difference in means as a function of x. In this paper we extend SQUARE to a regression model and we introduce a two-part regression SQUARE for estimating
(x) as a function of x. We use the first part of the model to estimate the probability of incurring any costs and the second part of the model to estimate the mean difference in health expenditures, given that a nonzero cost is observed. In the second part of the model, we apply the basic definition of SQUARE for positive costs to compare expenditures for the cases and controls having similar covariate profiles. We determine strata of cases and control with similar covariate profiles by the use of propensity score matching. We then apply two-part regression SQUARE to the 1987 National Medicare Expenditure Survey to estimate the difference
(x) between persons suffering from smoking-attributable diseases and persons without these diseases as a function of the propensity of getting the disease. Using a simulation study, we compare frequentist properties of two-part regression SQUARE with maximum likelihood estimators for the log-transformed expenditures.
Keywords: Comparing means; Health expenditures; Log-normal; Propensity scores; QQ plots; Quantile regression; Regression splines; Skewed distributions; Smoking
| 1. INTRODUCTION |
|---|
|
|
|---|
This paper is motivated by a common problem in health economics of estimating the difference in mean or total health expenditures between diseased and otherwise similar nondiseased persons as a function of covariates. In our motivating application, we study people affected by major smoking-attributable diseases: lung cancer (LC) and chronic obstructive pulmonary diseases (COPD). The nondiseased group comprises people not affected by any of the diseases mentioned above nor by other major smoking-caused illness such as cardiovascular diseases.
Let Y1 and Y2 be two nonnegative random variables representing health expenditures for the cases and controls, and let x be a vector of covariates, such as smoking, age, race, gender, and socioeconomic factors. We seek to estimate the difference
(x) = E[Y1|x] E[Y2|x] as a function of the covariates.
Estimation of
(x) is challenging because health expenditures are very skewed toward high values, tend to have a high proportion of zeros, and the number of cases tends to be much smaller than the number of controls. Nevertheless,
(x) is an important target for inference in econometrics, statistics, and other disciplines (Duan, 1983
; O'Brien, 1988
; Fenn et al., 1996
; Lin et al., 1997
; Hlatky et al., 1997
; Lin, 2000
; Tu and Zhou, 1999
; Lipscomb et al., 1999
). Econometric approaches for analyses of health expenditure have been discussed extensively. Among the most common approaches are linear regression models for log-transformed dependent variables and generalized linear models (GLMs) with a logarithmic link function (Duan, 1983
; Jones, 2000
; Manning, 1998
; Mullahy, 1998
; Blough et al., 1999
). GLMs estimate logE[Y|x] directly, whereas the linear regression model for the log-transformed costs estimate E[log(Y)|x]. This estimate can then be converted into an estimate of E[Y|x] by a suitable transformation that involves higher moments of the distribution of logY (Duan, 1983
). See Manning and Mullahy (2001)
for a simulation-based comparison of suitable estimators of E[Y|x] under the parametric approaches described above.
Dominici et al. (2005)
have recently introduced a novel estimator of the mean difference of two highly skewed distributions,
= E[Y1] E[Y2], called Smooth Quantile Ratio Estimation or SQUARE. Note that the most obvious nonparametric estimator of
is the sample mean difference
1
2 = 
1(p) dp 
2(p) dp, which is here defined as a function of the empirical quantiles
1(p) and
2(p). The basic idea of SQUARE is to augment the empirical quantiles
1(p) and
2(p) with smoother and less variable versions obtained by smoothing the log-transformed ratio of the two quantile functions:
![]() | (1.1) |
SQUARE encompasses a large class of estimators of
including the class of L-estimates (Serfling, 1980
). For example, if s(p) interpolates the log-ratios of the order statistics, then SQUARE reduces to the sample mean difference. If s(p) is very smooth, then SQUARE reduces to the maximum likelihood estimate of
under a log-normal sampling distribution for Y1 and Y2 (Dominici et al., 2005
; Cope, 2003
). Broadly speaking, SQUARE is a semiparametric estimate of
which interpolates between parametric estimates (such as maximum likelihood estimates) and nonparametric estimates (such as the sample mean difference) with weights depending on the degrees of smoothness of s(p).
Simulation studies (Dominici et al., 2005
; Cope, 2003
) have shown that SQUARE outperforms common estimators of
, such as the sample mean difference and log-normal estimators commonly used for the analysis of skewed data (Aitchison and Shen, 1980
; Zellner, 1971
; Zhou et al., 1997a
; Zhou and Gao, 1997
; Land, 1971
; Angus, 1994
; Duan et al., 1983
; Zhou et al., 1997b
; Lipscomb et al., 1999
; Andersen et al., 2000
). Theoretical developments of SQUARE including proofs of consistency and asymptotic normality are detailed in Cope (2003)
.
In this paper we generalize SQUARE to a two-part regression model and present a detailed example of its use in the important public health problem of estimating the difference in medical expenditures between people with and without smoking-related disease, taking covariates into account. In the first part of the model, we estimate the probability of incurring any costs among the cases and the controls, P(Y1 > 0|x) and P(Y2 > 0|x). In the second part, we estimate the mean difference of the positive expenditures for the cases and the controls. In summary, we produce an estimate of the following parameter:
![]() |
In the second part of the model, we use SQUARE to compare the positive expenditures for cases and for controls having similar covariate profiles. We identify these homogeneous covariate groups by using propensity score matching (Rosenbaum and Rubin, 1983
). The propensity score, here denoted by e(x), is the probability of having a smoking-related disease given the following covariates: smoking dose, age, race, and socioeconomic factors.
For our analyses, we use the National Medical Expenditure Survey (NMES) (National Center For Health Services Research, 1987
) supplemented by the Adult Self-Administred Questionnarie Household Survey (ASAQS). NMES and ASAQS provide data on annual medical expenditures, disease status, age, race, socioeconomic factors, and critical information on health risk behaviors such as smoking, for a representative sample of U.S. noninstitutionalized adults. A key component of our analysis is to estimate
(x) as a function of e(x) and to illustrate how differences in medical expenditures might vary with respect to the propensity of having the disease.
Because SQUARE is a new idea, we compare it in a simulation study to a more standard econometric approach: two-part linear regression model for log-transformed cost. We illustrate under which sampling mechanisms two-part regression SQUARE provides a more efficient estimate of
(x) than parametric alternatives commonly used in the analysis of health cost data.
| 2. SMOOTH QUANTILE RATIO ESTIMATION (SQUARE) |
|---|
|
|
|---|
In this section, we briefly review the definition of SQUARE and its estimation approach. Details are in Dominici et al. (2005)
![]() | (2.1) |
The basic idea of SQUARE is to replace the empirical quantiles
1(p) and
2(p) with smoother and less variable versions obtained by smoothing the log-transformed ratio of the two quantile functions across percentiles. SQUARE estimates
by smoothing across percentiles the log-ratio of the empirical quantile functions.
By borrowing strength across the two samples to learn about the shape of the distribution, SQUARE produces an estimator of
that tends to be less variable than the sample mean difference but with small bias.
More specifically, let
1,
2 be the empirical quantile functions and let y1 = (y1(1), y1(2), ..., y1(n1)) and y2 = (y2(1), y2(2), ..., y2(n2)) be the order statistics of the positive medical expenditures for the cases and controls, respectively. If n1 = n2 = n, then SQUARE estimates
by the use of smoothed quantile functions
1=
2exp(
(p,
)) and
2 =
1exp(
(p,
)), where
(p,
) is a natural cubic spline with
degrees of freedom, obtained by fitting the model
![]() | (2.2) |
with pi = i/(n + 1). If n = n1 < n2 as in our real application, then we replace y2 by q2, the linear interpolation of the order statistics y2(i) to the grid of points p1i = i/(n1 + 1), i = 1, ..., n1.
Notice that in our application, the total numbers of cases and controls are N1 = 188 and N2 = 9228, respectively. Among these only n1 = 118 and n2 = 2262 have nonzero expenditures. If we let
1 = P(Y1 > 0) and
2 = P(Y2 > 0) be the probabilities of nonzero expenditure for the cases and controls and let E[Y1|Y1 > 0] and E[Y2|Y2 > 0] be the corresponding averages of the nonzero values, then the SQUARE estimate of
= P(Y1 > 0)E[Y1|Y1 > 0] P(Y2 > 0)E[Y2|Y2 > 0] is
![]() | (2.3) |
where
1 and
2 are the proportions of nonzero costs among the cases and the controls, and
and
are two samples of size 2n, where
and
i=
(pi,
).
Dominici et al. (2005)
has shown that SQUARE is asymptotically normal, embraces a large class of estimators of
including the sample mean difference, the maximum likelihood estimate under log-normal samples, and L-estimates, and in several realistic situations has lower mean-squared error than competitors including the sample mean difference and log-normal parametric estimates.
| 3. REGRESSION SQUARE |
|---|
|
|
|---|
In our case study we are interested in estimating the difference in medical expenditures between the cases and controls as a function of their covariates, that is, we seek to estimate
![]() |
To extend SQUARE to the regression case we assume that the log-ratio of the quantile functions is a smooth function of the percentiles given the covariates x; hence, we extend (1.1) to
![]() | (3.1) |
To control for systematic differences in covariates between the two populations, a common strategy is to group units into subclasses based on covariate values, and then to compare medical expenditures only for the cases and controls who fall in the same subclass. However, as the number of covariates increases, the number of subclasses grows exponentially (Cochran, 1965
). This problem can be overcame by matching with respect to the propensity scores (Cochran and Rubin, 1973
; Rubin, 1973
). The propensity score in this case can be defined as the conditional probability that an individual with vector xi of observed covariates has the disease, ei(xi) = P(di = 1|xi). Rosenbaum and Rubin (1983)
showed that subclassifications on the population propensity score will balance x; in other words, population subgroups of cases and controls that have similar propensity scores will have similar distributions of all their covariates.
We use the propensity score matching in the definition of regression SQUARE as follows:
- for each case i = 1, ..., N1, we construct a stratum [i] of m1 cases and m2 controls both with propensity scores as similar to the case i as possible, by using a nearest-neighbor matching algorithm detailed at the end of this section;
- within each propensity score stratum [i], we estimate:
- the fractions of nonzero expenditures
and
; and
- the difference in average medical expenditures between the cases and the controls by applying the definition of SQUARE (2.3) to the m1 cases and m2 controls that belong to the i th stratum, that is,
- the fractions of nonzero expenditures
- we estimate
by averaging the SQUARE estimates across the N1 strata, that is,

(3.2)
Matching is performed by using a modification of the nearest-neighbor matching algorithm (Rubin and Thomas, 2000
), beginning with the case with lowest propensity score and proceeding to the case with highest propensity score. More specifically, let e1=(e1(x1), ..., eN1(xN1)) be the ordered vector of propensity scores for the cases. Then, for each case i:
- we select the m1 closest cases to the i th propensity score and identify their m1 propensity scores,
- we divide the m1 propensity scores into S strata,
- within each stratum, we sample with replacement H matched controls, thus obtaining a total of S x H = m2 matched controls.
| 4. ANALYSIS OF MEDICAL EXPENDITURES |
|---|
|
|
|---|
In this section, we use two-part regression SQUARE to estimate the mean difference between annual Medicare expenditures for persons with LC or COPD (cases, d=1) and otherwise similar persons without these two smoking-attributable diseases and without cardiovascular disease (controls, d=0).
In our problem, the propensity score ei(xi) is an estimate of the probability that a person i has LC or COPD given his/her covariate profile xi. We estimate this risk by using the following logistic regression model (Johnson et al., 2003
):
![]() | (4.1) |
where male, afro-american, and recent quit are indicators for being male, being African-American, having ever smoked, and having quit smoking within 1 year; poverty, marital status, education, census region, and seat belt use are categorical variables indicating socioeconomic status, place of residence, and propensity of an individual to take risks. The variable smoking indicates self-reported total smoking exposure (packs of cigarettes over the lifetime). We model age and smoking as natural cubic splines with three degrees of freedom. The full set of variables included in the model is listed in Table 1. Details of this modeling approach and results for the NMES data are given by Johnson et al. (2003)
.
|
We match the propensity scores on the logistic scale (Rubin and Thomas, 2000
Figure 1 shows the average logit propensity scores for cases versus the average for controls within each matched set. The proximity of the points to the diagonal line indicates excellent performance of the matching algorithm. Some deviation occurs among the highest risk subjects where the cases are at slightly higher risk than the controls. To further assess the relative success of the propensity score model for creating balanced matched samples, Table 1 compares the observed proportions for categorical covariates and the sample means for continuous covariates between cases and controls for the matched samples. The matching appears to have performed well.
|
In addition to estimating the mean difference in expenditures for persons with and without disease caused by smoking, a second question is whether this difference is smaller for smokers than for nonsmokers, perhaps because one group has a tendency to seek or receive fewer services. That is, does smoking status modify the difference in medical expenditures between the cases and the controls? Table 2 shows the number of disease cases and controls for smokers (current or former) and for nonsmokers (neither current nor former). The numbers within parentheses represent the percentage of people in that cell with nonzero expenditures. The percentage of cases with nonzero expenditures is more than twice as large as for the controls (63% and 25%); this is consistent with our expectation that people with disease receive more services. These proportions are similar for smokers and nonsmokers. Because of the very small number of cases among the nonsmokers, we report the results for everyone in the sample and for the smokers.
|
We apply two-part regression SQUARE with
= 2 to the NMES database and to the subset of the NMES data for smokers only. We choose
= 2 because previous applications of SQUARE to the NMES database (Dominici et al., 2005
= 2 minimizes a 10 -fold cross-validation method (Efron, 1983
Table 3 summarizes the estimated mean differences in annual Medicare expenditures for the cases and controls, with and without covariate adjustment, for everyone in the sample and for the smokers alone. We estimate
using four approaches. The first is a two-part regression SQUARE (T1) as defined in 2. The second is the weighted sample mean difference within each stratum,
The third and the fourth are two MLE estimators of
under a log-normal distribution. More specifically, the third, called MLE with propensity score, calculates the MLE of
(xi) within each propensity score stratum and is defined as
![]() | (4.2) |
where
and
are the sample means and variances of the log-expenditures for the m1 cases and the m2 controls, respectively. We then estimate
by averaging the estimates across the N1 strata, that is
The fourth approach, called MLE with two-part log-normal model, provides an estimate of
by fitting the following model,
![]() | (4.3) |
where N = N1 + N2, n = n1 + n2 and xi is the i th row of the design matrix including all the covariates specified in the propensity score model (4.1). Under model (4.3), the MLE estimate of
is defined as
![]() | (4.4) |
|
Note that T4 takes into account the covariates by use of a regression model instead of propensity score matching and therefore does not provide an estimate of
(xi) as a function of the propensity score.
When we do not adjust for the covariates, SQUARE (T1) and the weighted sample mean difference (T2) provide smaller estimates than the MLEs under a log-normal model (T3 and T4). When we adjust for the covariates, T1, T2, and T4 are similar, but T4 has a smaller bootstrap standard error suggesting greater efficiency of the MLE two-part regression model for estimating
. T3 gives much larger estimates of
than the other methods. Finally, estimates for the smokers are larger. Frequentist properties of these estimators are studied more carefully in a simulation study presented in Section 4.1.
Figure 2 shows estimated probabilities of any cost (first row), estimated means of nonzero costs (second row), and estimated mean costs (third row) for the cases and controls plotted against propensity scores. The darker lines are the estimates for the smokers only. The gray polygon represents the 95% bootstrap confidence intervals. At the far right, we display the pooled estimates averaged across propensity scores with their 95% bootstrap confidence intervals.
|
We found that the estimated probabilities of any expenditure smoothly increase as the risk of disease increases. The probabilities of any cost are consistently higher for the cases than for the controls across propensity scores. In addition, at low propensity scores and for both the cases and the controls, the probability of any cost for the smokers is slightly smaller than for everyone. This may indicate that healthy smokers are more reluctant to seek services than the rest of the population.
Average positive expenditures are larger for the cases than for the controls. At low propensity scores and for the cases, the average positive costs for the smokers are larger than average for everyone. This indicates that, although the smokers with low propensity of disease are more reluctant to seek services than the rest of the population, if they do use any service they tend to have larger medical expenditures than the rest of the population.
Figure 3 (top) shows the estimated mean differences plotted against propensity scores. As in Figure 2, the darker lines are the estimates for the smokers only. At the far right are plotted the pooled estimates across propensity scores with their 95% bootstrap confidence intervals also reported in Table 3. The shape of the distribution of the estimated mean differences is driven by the estimates of mean costs for the cases (Figure 2). At low propensity scores, the estimated mean differences are roughly constant at approximately $3000. At moderate values of the propensity scores, the estimated mean differences are larger, reaching about $9000 ; whereas, at high propensity scores the estimated mean differences drop to $1000. By examining the covariates for the cases within low, moderate, and high propensity score strata, we found that cases with high risk of disease tend to be older, poorer, and less educated than the other cases, raising the possibility that they have poorer access to services.
|
Figure 3 (bottom) shows the estimated mean differences plotted against propensity scores under four alternative propensity score matching methods. These scenarios were selected after having assessed the balance on observed covariates in the matched samples, and only scenarios that assured a reasonable balance were examined in the sensitivity analysis. The scenario 50:125 is our baseline. The other three scenarios represent more or less coarse matching, corresponding to 25:125, 25:50, and 50:50. Pooled estimates averaged across propensity scores are similar under the four scenarios. As expected, case-specific estimates are somewhat sensitive to the selection of the number of cases, leading to less smooth curves under the scenarios 25:125 and 25:50 than under the scenarios 50:125 and 50:50. However, these differences are small and all within the case-specific confidence intervals of the baseline estimates.
In this section, we implement a simulation study where we compare frequentist properties of two-part regression SQUARE (T1) to the three alternative estimators of
used in the data analysis: the weighted sample mean difference (T2) ; MLE with propensity score matching (T3) ; and MLE with two-part log-linear model regression model for the log-transformed costs (T4) (Duan, 1983
; Mullahy, 1998
; Mullahy and Manning, 1995
).
We generate cost data under nonparametric and parametric sampling mechanisms as follows:
- A. Sampling from the empirical distribution of the cost data. We divide the propensity scores for the cases estimated under model (4.1) into 25 strata. Within each stratum, we first identify the matched cases and the matched controls, and then sample with replacement observations from the corresponding empirical distributions of the observed costs. Here we assume that the true value of
is equal to the weighted sample mean difference
averaged across 1000 bootstrap samples.
- B. Sampling from a two-part linear regression model of the log-transformed costs. We generate cost data from the same two-part log-linear model used in the data analysis and defined in (4.3). Under this data-generating mechanism we assume that the true estimate of
is the MLE, which is equal to T4.
- B. Sampling from a two-part linear regression model of the log-transformed costs. We generate cost data from the same two-part log-linear model used in the data analysis and defined in (4.3). Under this data-generating mechanism we assume that the true estimate of
Note that under scenario B, the presence of heteroscedasticity implies that the log-scale prediction E[exp(
i)]exp(ßdi+
xi) provides a biased estimate of E[Yi|di, xi] and that the bias depends on the covariates (di, xi). This bias can be reduced by including an estimate of E[exp(
i)|di, xi], called the smearing coefficient (Duan, 1983
; Parmigiani et al., 1997
; Andersen et al., 2000
). This would lead to another estimate of
, namely,
![]() |
where
is the so-called smearing coefficient (Duan, 1983
) calculated as a function of the ordinary least squares residuals ri, i = 1, ..., Nn, of the linear regression model for the positive expenditures in (4.3).
In summary, two-part regression SQUARE and the weighted sample mean difference (T1, T2) are nonparametric estimates of
(xi) within each propensity score stratum, whereas T3 uses a MLE of
(xi), also within each propensity score stratum. The last two estimators T4 and T5 are not based on propensity score matching and therefore cannot estimate
(xi) as a function of the propensity score. On the contrary, they estimate
already marginalized with respect to the covariates.
Scenario A favors propensity score matching and nonparametric estimation methods, whereas scenario B favors model-based estimation approaches. The results are summarized in Table 4. In scenario A, if the goal is to estimate
(xi) then T1 and T2 outperform T3. This result suggests that MLE with propensity score matching is very inefficient when the data are not log-normal and the sample size is small. If the goal is to estimate
, then T1, T2, T4, and T5 have similar performance.
|
In scenario B, if the goal is to estimate
(xi) then two-part regression SQUARE (T1) outperforms both the weighted sample mean difference T2 and the MLE with propensity score matching T3. It is interesting to note that T1 is better than T3 in estimating
(xi) even when the data are log-normal. If the goal is to estimate
, then by definition T4 is the best and is similar to T5. However, T1 is the second best.
In summary, these simulation studies suggest that two-part regression SQUARE produces the most efficient estimate of
(xi) with respect to nonparametric and parametric alternatives. If the goal is to estimate
marginalized with respect to the covariates, then SQUARE can be more variable than the MLE obtained under a two-part log-linear model.
| 5. DISCUSSION |
|---|
|
|
|---|
In this paper we have extended SQUARE, a novel estimator of the difference in means for two right-skewed distributions, to the regression case. More specifically, we control for possible imbalances in the covariates between the cases and the controls by use of propensity score matching (Rosenbaum and Rubin, 1983
Our analysis of Medicare expenditures allows smoking status to modify the effect of disease on expenditures. We examine this effect modification by stratifying the cases and the controls with respect to their smoking status, and then by estimating SQUARE separately for smokers and all subjects. In addition, our plots of the estimated mean differences as a function of the propensity scores allow the detection of effect modification by variables that are important predictors of disease. For example, the visual inspection of Figure 3 suggests that the estimated mean differences drop from $9000 to $1000 for large propensity scores. We found that these individuals tend to be older, poorer, and less educated than the others, suggesting the hypothesis that they have poorer access to services.
Our formulation of SQUARE in the regression case is related to quantile regression (Ruppert and Carroll, 1980
; Koenker, 1982
; Lifson and Bhattacharyya, 1983
). Regression SQUARE is a two-step procedure. We first estimate the difference in medical expenditures in a [0, 1]x[0, 1] grid of points of the values of propensity scores and percentiles, then estimate the parameter of interest
(x) at each value of the propensity score by smoothing across percentiles (see Figures 2 and 3). In quantile regression, we estimate the parameter of interest as a function of the covariates for a fixed percentile.
In the simulation study, we found that two-part regression SQUARE is the most suitable approach for estimating
(xi) as a function of the propensity score. However, if the ultimate goal is to estimate
marginalized with respect to the covariates, then standard regression adjustment provides a more efficient estimate of
than propensity score matching. As future work, our simulation study can be extended to compare two-part regression SQUARE with respect to the more general GLM framework (McCullagh and Nelder, 1989
), with an exponential conditional mean under a range of distributional assumptions including Poisson, Gamma, Weibull, and Chi-square structures (Manning and Mullahy, 2001
).
| ACKNOWLEDGMENTS |
|---|
Funding for Scott L. Zeger was provided from NIMH grant R01 MH56639. Funding for Francesca Dominici was provided from NIHES grant R01ES012054. We thank Timothy Wyant for providing data on the National Medical Expenditures Survey and Elizabeth Johnson for assistance in database development and software.
| REFERENCES |
|---|
|
|
|---|
-
AITCHISON, J. AND SHEN, S. M. (1980). Logistic normal distributions: some properties and uses. Biometrika 67, 261272.
ANDERSEN, C. K., ANDERSEN, K. AND KRAGH-SORENSEN, P. (2000). Cost function estimation: the choice of a model to apply to dementia. Health Economics 9, 397409.[CrossRef][Web of Science][Medline]
ANGUS, J. E. (1994). Bootstrap one-sided confidence intervals for the log-normal mean. The Statistician 43, 395401.[CrossRef]
BLOUGH, D., MADDEN, C. AND HORNBROOK, M. (1999). Modelling risk using generalized linear models. Journal of Health Economics 18, 153171.[CrossRef][Web of Science][Medline]
BREIMAN, L. AND SPECTOR, P. (1992). Submodel selection and evaluation in regression: the X-random case. International Statistical Review 60, 291319.
COCHRAN, W. G. (1965). The planning of observational studies of human populations (with discussion). Journal of the Royal Statistical Society, Series A, General 128, 234266.[CrossRef]
COCHRAN, W. G. AND RUBIN, D. B. (1973). Controlling bias in observational studies: a review. Sankhy
, Series A, Indian Journal of Statistics 35, 417446.
COPE, L. (2003). Some asymptotic properties of smooth quantile ratio estimation, Ph.D. Thesis, Department of Applied Mathematics, Johns Hopkins University, Baltimore, MD.
DOMINICI, F., COPE, L., NAIMAN, D. AND ZEGER, S. L. (2005). Smooth quantile ratio estimation (SQUARE). Biometrika (in press).
DUAN, N. (1983). Smearing estimate: a nonparametric retransformation method. Journal of the American Statistical Association 78, 605610.[CrossRef]
DUAN, N., MANNING, W. G., MORRIS, C. N. AND NEWHOUSE, J. P. (1983). A comparison of alternative models for the demand for medical care. Journal of Business and Economic Statistics 1, 115125.
EFRON, B. (1983). Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association 78, 316331.[CrossRef][Web of Science]
EFRON, B. AND TIBSHIRANI, R. J. (1993). An Introduction to the Bootstrap. New York: Chapman & Hall.
FENN, P., MCGUIRE, A., BACKHOUSE, M. AND JONES, D. (1996). Modelling programme costs in economic evaluation. Journal of Health Economics 15, 115125.[CrossRef][Web of Science][Medline]
HLATKY, M., ROGERS, W., JOHNSTONE, I., BOOTHROYD, D., BROOKS, M. M., PITT, B., REEDER, G., RYAN, T., SMITH, H., WHITLOW P. et al. (1997). Medical care costs and quality of life after randomization to coronary angioplasty and coronary bypass surgery. New England Journal of Medicine 336, 9299.
JOHNSON, E., DOMINICI, F., GRISWOLD, M. AND ZEGER, S. L. (2003). Disease cases and their medical costs attributable to smoking: an analysis of the national medical expenditure survey. Journal of Econometrics 112, 135151.[CrossRef]
JONES, A. (2000). Health econometrics. In Culyer, A. and Newhouse, J. (eds), Handbook of Health Economics. Amsterdam: Elsevier.
KOENKER, R. (1982). Robust methods in econometrics. Econometric Reviews 1, 213255.
LAND, C. E. (1971). Confidence intervals for linear functions of the normal mean and variance. The Annals of Mathematical Statistics 42, 11871205.
LIFSON, D. P. AND BHATTACHARYYA, B. B. (1983). Quantile regression method and its application to estimate the parameters of lognormal and other distributions. In Sen, P. K. (ed.), Contributions to Statistics: Essays in Honour of Norman L. Johnson. Amsterdam: North Holland Publishing Company, pp. 313327.
LIN, D. (2000). Linear regression analysis of censored medical costs. Biostatistics 1, 3547.[Abstract]
LIN, D. Y., FEUER, E. J., ETZIONI, R. AND WAX, Y. (1997). Estimating medical costs from incomplete follow-up data. Biometrics 53, 419434.[CrossRef][Web of Science][Medline]
LIPSCOMB, J., ANCUKIEWICZ, M., PARMIGIANI, G., HASSELBLAD, V., SAMSA, G. AND MATCHAR, D. (1999). Predicting the cost of illness: a comparison of alternative models applied to stroke. Medical Decision Making 18, S39S56.
MANNING, W. (1998). The logged dependent variable: heteroschedasticity and the transformation problem. Journal of Health Economics 17, 283295.[CrossRef][Web of Science][Medline]
MANNING, W. G. AND MULLAHY, J. (2001). Estimating log models: to transform or not to transform. Journal of Health Economics 20, 461494.[CrossRef][Web of Science][Medline]
MCCULLAGH, P. AND NELDER, J. A. (1989). Generalized Linear Models, 2nd edition. Boca Raton, FL: Chapman & Hall.
MULLAHY, J. (1998). Much ado about two: reconsidering retransformation and the two-part model in health econometrics. Journal of Health Economics 17, 247281.[CrossRef][Web of Science][Medline]
MULLAHY, J. AND MANNING, W. (1995). Statistical issues in cost-effectiveness analysis. In Sloan, F. (ed.), Valuing Health Care: Costs, Benefits, and Effectiveness of Pharmaceutical and Other Medical Technologies. New York: Cambridge University Press.
National Center for Health Services Research (1987). National Medical Expenditure Survey. Methods II. Questionnaires and Data Collection Methods for the Household Survey and the Survey of American Indians and Alaska Natives. Rockville, MD: United States Department of Health and Human Services, Agency for Health Care Policy and Research.
O'BRIEN, P. C. (1988). Comparing two samples: extensions of the t, rank-sum, and log-rank tests. Journal of the American Statistical Association 83, 5261.[CrossRef]
PARMIGIANI, G., SAMSA, G., ANCUKIEWICZ, M., LIPSCOMB, J., HASSELBLAD, V. AND MATCHAR, D. (1997). Assessing uncertainty in cost-effectiveness analyses: application to a complex decision model. Medical Decision Making 17, 390401.
ROSENBAUM, P. AND RUBIN, D. (1983). The central role of propensity score in observational studies for causal effects. Biometrika 70, 4155.
ROSENBAUM, P. R. AND RUBIN, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association 79, 516524.[CrossRef]
RUBIN, D. AND THOMAS, N. (2000). Combining propensity score matching with additional adjustments for prognostic covariates. Journal of the American Statistical Association 95, 573585.[CrossRef]
RUBIN, D. B. (1973). The use of matched sampling and regression adjustment to remove bias in observational studies. Biometrics 29, 185203.[CrossRef][Web of Science]
RUPPERT, D. AND CARROLL, R. J. (1980). Trimmed least squares estimation in the linear model. Journal of the American Statistical Association 75, 828838.[CrossRef]
SERFLING, R. J. (1980). Approximation Theorems of Mathematical Statistics. New York: Wiley.
SHAO, J. AND TU, D. (1995). The Jackknife and Bootstrap. New York: Springer.
TU, W. AND ZHOU, X.-H. (1999). A Wald test comparing medical cost based on log-normal distributions with zero valued costs. Statistics in Medicine 18, 27492761.[CrossRef][Web of Science][Medline]
ZELLNER, A. (1971). Bayesian and non-Bayesian analysis of the log-normal distribution and log-normal regression. Journal of the American Statistical Association 66, 327330.[CrossRef]
ZHOU, X.-H. AND GAO, S. (1997). Confidence intervals for the log-normal mean. Statistics in Medicine, 16, 783790.[CrossRef][Web of Science][Medline]
ZHOU, X.-H., GAO, S. AND HUI, S. L. (1997a). Methods for comparing the means of two independent log-normal samples. Biometrics 53, 11291135.[CrossRef][Web of Science][Medline]
ZHOU, X.-H., MELFI, C. AND HUI, S. (1997b). Methods for comparison of cost data. Biometrics 53, 11291135.
Received November 18, 2003; revised December 27, 2004; revised January 31, 2005; revised February 25, 2005; revised March 30, 2005; accepted for publication April 27, 2005.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
J. Katz, P. Christian, F. Dominici, and S. L. Zeger Treatment Effects of Maternal Micronutrient Supplementation Vary by Percentiles of the Birth Weight Distribution in Rural Nepal J. Nutr., May 1, 2006; 136(5): 1389 - 1394. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||















