Skip Navigation



Biostatistics Advance Access published online on October 8, 2007

Biostatistics, doi:10.1093/biostatistics/kxm034
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
9/2/355    most recent
kxm034v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Zeng, D.
Right arrow Articles by Lin, D. Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zeng, D.
Right arrow Articles by Lin, D. Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

Efficient resampling methods for nonsmooth estimating functions

Donglin Zeng and D. Y. Lin*

Department of Biostatistics, CB 7420, University of North Carolina, Chapel Hill, NC 27599-7420, USA lin{at}bios.unc.edu

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. APPLICATIONS
 5. DISCUSSION
 FUNDING
 REFERENCES
 
We propose a simple and general resampling strategy to estimate variances for parameter estimators derived from nonsmooth estimating functions. This approach applies to a wide variety of semiparametric and nonparametric problems in biostatistics. It does not require solving estimating equations and is thus much faster than the existing resampling procedures. Its usefulness is illustrated with heteroscedastic quantile regression and censored data rank regression. Numerical results based on simulated and real data are provided.

Keywords: Bootstrap; Censoring; Quantile regression; Rank regression; Robustness; Variance estimation


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. APPLICATIONS
 5. DISCUSSION
 FUNDING
 REFERENCES
 
The parameters of interest in biostatistics are typically estimated by minimizing a loss function or more generally by solving an estimating equation. In many nonparametric and semiparametric situations, such as Huber's (1964)Go robust estimation of location (with nonsmooth loss functions), quantile regression, and rank regression, the estimating functions are not differentiable. Then, the asymptotic variances of the parameter estimators generally involve unknown density functions and are thus difficult to estimate directly.

In such situations, it is natural to appeal to resampling techniques. The familiar bootstrap (Efron and Tibshirani, 1993Go) estimates variances by resampling from the empirical distribution function. This approach needs to be justified on a case-by-case basis and may not be appropriate in complex situations. Parzen and others (1994)Go proposed a resampling technique by equating the observed data estimating function to a random vector which generates the asymptotic distribution of the estimating function. This technique has been applied to numerous biostatistical problems (e.g. Yao and others, 1998Go; Chen and Jewell, 2001Go; Cai and others, 2006Go). Hu and Kalbfleisch (2000)Go provided a similar procedure for linear estimating functions with independent terms by bootstrapping the individual terms. For estimators that can be written as minimizers of certain U-statistics, Jin and others (2001)Go developed a resampling approach by incorporating suitable random variables into the minimand. Their approach was adapted by Jin and others (2003, 2006)Go to the rank and least squares regression with censored data.

All the aforementioned resampling procedures require solving the perturbed estimating equations or minimizing the perturbed loss functions a large number of times. This is computationally very demanding, especially for complex nonlinear functions. In addition, the perturbed estimating equations or loss functions tend to be associated with extreme solutions and are thus unstable. As a result, nonsmooth estimating functions are rarely used in practice.

In the present paper, we propose a new resampling strategy to estimate asymptotic variances of parameter estimators obtained from general nonsmooth estimating functions. Our approach only requires generation of random numbers and evaluation of estimating functions. It does not involve solving any perturbed estimating equations or minimizing any perturbed objective functions; therefore, it is far more efficient and more stable than the existing resampling methods. With our approach, variance estimation for complex nonsmooth estimating functions can be accomplished in a matter of seconds or minutes rather than hours or days. We describe the proposed approach in Section 2. We present simulation results and medical examples in Sections 3 and 4, respectively. We provide some concluding remarks in Section 5.


    2. METHODS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. APPLICATIONS
 5. DISCUSSION
 FUNDING
 REFERENCES
 
Let {theta}0 denote a d-vector of parameters. We estimate {theta}0 by solving the estimating equation Un({theta}) = 0, where Un is a function based on n independent observations such that n – 1Un({theta}0)->p0. Suppose that the solution Formula exists and is consistent. Suppose also that, uniformly in a neighborhood of {theta}0,

Formula (2.1)

where Si(i = 1,...,n) are independent zero-mean random vectors, and A is a nonsingular matrix, which is the asymptotic slope of n – 1Un({theta}0). This asymptotic expansion holds for a wide variety of estimating functions and can typically be verified through empirical process arguments (van der Vaart and Wellner, 1996Go, Section 3.3). The Si are the influence functions for Un({theta}0). The dependence of Si and A on {theta}0 is suppressed. Since Formula and Formula is consistent, (2.1) implies that Formula is n1/2-consistent and Formula is asymptotically zero-mean normal with covariance matrix A – 1V(A – 1)T, where V = limn->{infty}n – 1{sum}FormulaSiSFormula. For parametric likelihood, Un({theta}0) = {sum}FormulaSi and V = – A, where Si is the score for the ith observation and A is the negative information matrix.

We give 2 examples.

EXAMPLE 1 (Heteroscedastic quantile regression). For i = 1,...,n, let Yi and Xi denote the response variable and a set of covariates for the ith subject. Assume that the 100{tau}th percentile of Yi is {alpha}0 + ßFormulaXi. We may estimate {theta}0{equiv}({alpha}0,ßFormula)T by solving the equation

Formula

where I(·) is the indicator function. The solution Formula can be obtained by minimizing the loss function

Formula

where {rho}{tau}(v) is {tau}v if v > 0 and ({tau} – 1)v if v ≤ 0. This minimization can be performed by linear programing (Koenker and D'Orey, 1987Go). Under the assumption that (Yi{alpha}0 ßFormulaXi) has a unique 100{tau}th percentile at 0 and has a continuous density function fi such that fi(0) is strictly positive, the estimator Formula is consistent and the asymptotic expansion (2.1) holds with Si = {I(Yi{alpha}0ßFormulaXi ≤ 0) – {tau}}(1,XFormula)T (Jin and others, 2001Go). The slope matrix A involves the density functions fi. Buchinsky (1995) compared various bootstrap procedures for estimating the asymptotic covariance matrix of Formula.

EXAMPLE 2 (Rank regression with censored data). Assume that

Formula (2.2)

where {varepsilon}i(i = 1,...,n) are independent and identically distributed random variables that are independent of Xi(i = 1,...,n). Suppose that Yi is subject to censoring by Ci. In survival analysis, Yi and Ci are usually expressed on the log-scale and (2.2) is referred to as the accelerated life or accelerated failure time model (Cox and Oakes, 1984Go, pp. 64–65; Kalbfleisch and Prentice, 2002Go, pp. 218–219). The data consist of Formula(i = 1,...,n), where Formula and {Delta}i = I(Yi ≤ Ci). It is assumed that Ci is independent of Yi conditional on Xi. One may estimate ß0 by the log-rank estimating equation

Formula (2.3)

It is not a trivial matter to solve this discrete equation, especially when d is large. One may use bisection search or optimization algorithms, such as simulated annealing (Lin and Geyer, 1992Go). Recently, Jin and others (2003)Go showed that linear programing can be used to obtain an approximation to the log-rank estimate. Under mild conditions (Tsiatis, 1990Go; Ying, 1993Go), expansion (2.1) holds with

Formula

where

Formula

and {Lambda}0 is the cumulative distribution function of i. In this case, direct estimation of A would require estimation of the hazard function or density function of {varepsilon}i.

It is natural to estimate V directly by Formula , where Formula is obtained from Si by replacing the unknown quantities by their sample estimators. In Example 1, only {theta}0 is unknown; in Example 2, the unknown quantities include ß0, {Gamma}0(·), {Gamma}1(·), and {Lambda}0(·). The consistency of Formula can typically be established by empirical process arguments.

When the Formula have complicated expressions, it is more convenient and perhaps more accurate to bootstrap from the data. Let UFormula({theta}) denote the estimating function based on the bootstrap sample. It follows from (2.1) that

Formula

where Mi is the number of times the ith observation appears in the bootstrap sample. Since Formula by definition, we obtain

Formula

By Lemma 3.6.15 of van der Vaart and Wellner (1996)Go, the conditional distribution of Formula given the data is asymptotically zero-mean normal with covariance matrix V provided that the remainder term in the above display is op(1) uniformly in the bootstrap samples. It is straightforward to verify the required condition for Examples 1 and 2. The bootstrap estimator of V is also denoted by Formula.

To avoid nonparametric density estimation, we propose efficient resampling procedures to estimate A and consequently the asymptotic covariance matrix of Formula. Let Formula, where Z is a zero-mean random vector independent of the data. It follows from (2.1) that

Formula

Since Formula and Formula, we have

Formula (2.4)

Thus, we propose the following resampling procedure based on the least squares.

LS method:

Step 1: Generate B realizations of Z, denoted by Z1,...,ZB.

Step 2: Calculate Formula(b = 1,...,B).

Step 3: For j = 1,...,d, calculate the least squares estimate of Formula(b = 1,...,B) on Zb(b = 1,...,B), where Ujn denotes the jth component of Un. Let Formula be the matrix whose jth row is the jth least squares estimate.

Step 4: Estimate the covariance matrix of Formula by Formula .

In many situations, A is symmetric, in which case a simpler resampling procedure can be obtained. If the covariance matrix of Z is V – 1, then (2.4) implies that Formula. The inverse of this covariance matrix is equal to A – 1V(A – 1)T when A is symmetric. Thus, we propose the following resampling procedure based on the sample variance of Formula.

SV method:

Step 1: Generate Formula(b = 1,...,B), where Zb is a zero-mean random vector with covariance matrix Formula.

Step 2: Calculate the sample covariance matrix of Formula(b = 1,...,B) and denote it by Formula.

Step 3: Estimate the covariance matrix of Formula by Formula.

Unlike the existing resampling methods, the least squares (LS) and sample variance (SV) methods do not require solving estimating equations. This is an important advantage since it is computationally intensive to solve complex nonsmooth estimating equations. Although we have suggested the possible use of bootstrap to estimate V, that procedure is different from bootstrap estimation of the variance of Formula and does not involve solving equations.


    3. SIMULATION STUDIES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. APPLICATIONS
 5. DISCUSSION
 FUNDING
 REFERENCES
 
We conducted extensive simulation studies to assess the performance of the proposed resampling methods. For both the LS and the SV methods, we estimated V either by direct evaluation or by bootstrap. We set Z to Formula, where Z* is either a d-variate standard normal random vector or a d-vector of independent centerd Bernoulli random variables with equal probabilities at – 1 and 1. Thus, 8 different variants of the methods were considered.

The first set of studies mimics the simulation studies on median regression reported in Section 3 of Parzen and others (1994)Go. We generated data from the model Yi = X1i + X2i + {varepsilon}i, where X1i and X2i are independent standard normal and Bernoulli with 0.5 success probability, respectively, and {varepsilon}i is normal with mean 0 and variance |X1i|. We obtained the parameter estimates through linear programing.

The second set of studies is similar to those of Jin and others (2003)Go. We generated survival times from model (2.2) in which X1 and X2 are independent Uniform(0,1) variable and Bernoulli variable with 0.5 success probability, ß0 = (1, – 1)T, and the error distribution is either extreme-value or zero-mean normal with standard deviation 0.5. We generated censoring times from a uniform distribution to yield a censoring rate of 25%. We obtained the log-rank estimates through bisection search.

The results from the above 2 sets of studies are summarized in Tables 1 and 2. The results of Table 1 pertain to the continuous covariate. Each entry in the tables is based on 10 000 simulated data sets and B = 10000. Clearly, all 8 variants of the resampling methods work well in that the variance estimators accurately reflect the true variations and the associated confidence intervals have proper coverage probabilities. There are virtually no differences between the LS and SV methods or between the direct and bootstrap estimation of V. For the rank regression under the normal error distribution, the Bernoulli sampling appears to be slightly better than the normal sampling. For median regression, the new resampling method is approximately 100 times faster than bootstrap (with 10 000 resamples); for rank regression, it is approximately 1000 times faster.


View this table:
[in this window]
[in a new window]

 
Table 1. Simulation results for heteroscedastic median regression

 

View this table:
[in this window]
[in a new window]

 
Table 2. Simulation results for rank regression with censored data

 

    4. APPLICATIONS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. APPLICATIONS
 5. DISCUSSION
 FUNDING
 REFERENCES
 
4.1. Multiple myeloma study

We applied the proposed resampling methods to a multiple myeloma study (Krall and others, 1975Go). Out of the 65 patients who were treated with alkylating agents, 48 died during the study. Following Jin and others (2003)Go, we fitted model (2.2) with hemoglobin and the logarithm of blood urea nitrogen as the covariates by using both the log-rank and the Gehan estimators. The Gehan estimator is obtained by incorporating the weight function Formula into (2.3). We considered the 8 variants of the resampling methods evaluated in the simulation studies. The differences are negligible between the LS and the SV methods and between the direct and the bootstrap methods of estimating V. The results based on the SV method and direct evaluation of V are shown in Table 3. These results are comparable to those of Jin and others (2003)Go but were obtained with much less time.


View this table:
[in this window]
[in a new window]

 
Table 3. Rank regression analysis of the myeloma data

 
4.2. Atherosclerosis Risk in Communities Study

We also applied our methods to the Atherosclerosis Risk in Communities Study (The ARIC Investigators, 1989Go), which is an epidemiologic cohort study of 15 792 subjects aged 45–64 years to investigate the etiology of atherosclerosis and other diseases. We considered all incident coronary heart disease (CHD) cases occurring between 1987 and 2001. We focused on the Caucasian sample, which consists of 11 526 subjects with 774 cases. We used model (2.2) to study the effects of 5 covariates, including smoking status (ever smoke = 1, never smoke = 0), 2 dummy variables contrasting Minnesota and Washington states to North Carolina, gender (male = 1, female = 0), and standardized age at the baseline, on the time to the occurrence of CHD. For large data sets such as this one, the methods of Jin and others (2003, 2006)Go are not computationally feasible. We used the Nelder–Mead algorithm as implemented in MATLAB to calculate the log-rank and Buckley–James estimates. The results based on the LS and SV methods with direct evaluation of V and 10 000 normal random samples are displayed in Table 4. For comparison, we also report the results of the method of Parzen and others (1994)Go with B = 10000. The standard error estimates are very similar between the LS and the SV methods, whereas those of the method of Parzen and others tend to be slightly larger. The larger standard error estimates by the method of Parzen and others are likely due to the unstabilities of the perturbed estimating equations. Indeed, the method of Parzen and others produced 7 extreme estimates in the Buckley–James estimation of the gender effect, which were excluded in the standard error calculations. For the new resampling approach, it took approximately 1 and 3 min on an IBM BladeCenter HS20 machine to estimate the standard errors for the log-rank and Buckley–James estimators, respectively, whereas the method of Parzen and others consumed 10 and 24 h, respectively.


View this table:
[in this window]
[in a new window]

 
Table 4. Accelerated failure time regression for the Atherosclerosis Risk in Communities data

 

    5. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. APPLICATIONS
 5. DISCUSSION
 FUNDING
 REFERENCES
 
The existing resampling methods require solving estimating equations or minimizing loss functions repeatedly, whereas the proposed methods only involve the evaluation of estimating functions. In complex situations, such as rank regression and least squares regression with censored data, the amount of time required to evaluate an estimating function is negligible as compared to solving the corresponding estimating equation. Then, the proposed methods are orders of magnitude faster than the existing resampling methods. Despite the continuing improvement in computer power, this degree of saving is very important, especially for large data sets and for simulation studies. Adopting the proposed resampling procedures will not only enhance the utilities of many existing nonparametric and semiparametric estimators but also facilitate the development and evaluation of new methods for complex biostatistical problems.

The approach of Hu and Kalbfleisch (2000)Go does not require solving estimating equations repeatedly in order to construct confidence intervals but requires to do so for estimating the variances of parameter estimators. It is restricted to linear estimating functions with independent terms and thus would be applicable to quantile regression, but not to rank regression or Buckley–James estimation.

Our method can be viewed as a version of Monte Carlo numerical differentiation. In contrast to the usual numerical differentiation that uses fixed step sizes, the new method generates random step sizes Z, exploring a broad range of step sizes and producing stable estimates. Numerical results indicate that our method is not sensitive to the choice of the distribution of Z.

The proposed methods have very broad applications and are particularly applicable to the situations in which the method of Parzen and others has been used. We have focused our attention on nonsmooth estimating functions. In some situations, the estimating functions are differentiable, but the derivatives are difficult to calculate. Then, the proposed resampling methods would also be appealing.

The results of Section 2 continue to hold if (2.1) is replaced by the more general expansion

Formula

where G is a zero-mean random vector whose covariance matrix can be consistently estimated. Thus, the proposed resampling methods can be applied to multivariate responses, biased sampling, and time series data among others. Indeed, the n1/2 convergence rate is not essential. Furthermore, our approach can potentially be extended to semiparametric situations in which infinite-dimensional parameters are part of {theta}.


    FUNDING
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. APPLICATIONS
 5. DISCUSSION
 FUNDING
 REFERENCES
 
National Institutes of Health.


    ACKNOWLEDGMENTS
 
The authors thank the reviewers for helpful comments. Conflict of Interest: None declared.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. SIMULATION STUDIES
 4. APPLICATIONS
 5. DISCUSSION
 FUNDING
 REFERENCES
 

    Buchinsky M. Estimating the asymptotic covariance-matrix for quantile regression-models—a Monte Carlo study. Journal of Econometrics (1995) 68:303–338.[CrossRef]

    Cai TX, Pepe MS, Zheng YY, Lumley T, Jenny NS. The sensitivity and specificity of markers for event times. Biostatistics (2006) 7:182–197.[Abstract/Free Full Text]

    Chen YQ, Jewell NP. On a general class of semiparametric hazards regression models. Biometrika (2001) 88:687–702.[Abstract/Free Full Text]

    Cox DR, Oakes D. Analysis of Survival Data (1984) London: Chapman and Hall.

    Efron B, Tibshirani RJ. An Introduction to the Bootstrap (1993) New York: Chapman and Hall.

    Hu F, Kalbfleisch JD. The estimating function bootstrap (with discussion). Canadian Journal of Statistics (2000) 28:449–499.[CrossRef]

    Huber PJ. Robust estimation of a location parameter. The Annals of Mathematical Statistics (1964) 35:73–101.

    Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based inference for the accelerated failure time model. Biometrika (2003) 90:341–353.[Abstract/Free Full Text]

    Jin Z, Lin DY, Ying Z. On the least squares regression with censored data. Biometrika (2006) 93:147–162.[Abstract/Free Full Text]

    Jin Z, Ying Z, Wei LJ. A simple resampling method by perturbing the minimand. Biometrika (2001) 88:381–390.[Abstract/Free Full Text]

    Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data (2002) 2nd edition. Hoboken, NJ: Wiley.

    Koenker R, D'Orey V. Computing regression quantiles. Applied Statistics (1987) 36:383–393.[CrossRef][Web of Science]

    Krall JM, Uthoff VA, Harley JB. A step-up procedure for selecting variables associated with survival. Biometrics (1975) 31:49–57.[CrossRef][Web of Science][Medline]

    Lin DY, Geyer CJ. Computational methods for semiparametric linear regression with censored data. Journal of Computational and Graphical Statistics (1992) 1:77–90.[CrossRef]

    Parzen MI, Wei LJ, Ying Z. A resampling method based on pivotal estimating functions. Biometrika (1994) 81:341–350.[Abstract/Free Full Text]

    ARIC Investigators The. The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. American Journal of Epidemiology (1989) 129:687–702.[Abstract/Free Full Text]

    Tsiatis AA. Estimating regression parameters using linear rank tests for censored data. The Annals of Statistics (1990) 18:354–372.

    van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes (1996) New York: Springer.

    Yao Q, Wei LJ, Hogan JW. Analysis of incomplete repeated measurements with dependent censoring times. Biometrika (1998) 85:139–149.[Abstract/Free Full Text]

    Ying Z. A large sample study of rank estimation for censored regression data. The Annals of Statistics (1993) 21:76–99.

    Received January 4, 2007; revised August 24, 2007; accepted for publication August 31, 2007.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    9/2/355    most recent
    kxm034v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Zeng, D.
    Right arrow Articles by Lin, D. Y.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Zeng, D.
    Right arrow Articles by Lin, D. Y.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?