Biostatistics Advance Access originally published online on November 11, 2005
Biostatistics 2006 7(2):268-285; doi:10.1093/biostatistics/kxj006
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Microarray gene expression data with linked survival phenotypes: diffuse large-B-cell lymphoma revisited
Division of Biostatistics, University of California, San Francisco, CA 94143-0560, USA mark{at}biostat.ucsf.edu
| SUMMARY |
|---|
|
|
|---|
Diffuse large-B-cell lymphoma (DLBCL) is an aggressive malignancy of mature B lymphocytes and is the most common type of lymphoma in adults. While treatment advances have been substantial in what was formerly a fatal disease, less than 50% of patients achieve lasting remission. In an effort to predict treatment success and explain disease heterogeneity clinical features have been employed for prognostic purposes, but have yielded only modest predictive performance. This has spawned a series of high-profile microarray-based gene expression studies of DLBCL, in the hope that molecular-level information could be used to refine prognosis. The intent of this paper is to reevaluate these microarray-based prognostic assessments, and extend the statistical methodology that has been used in this context. Methodological challenges arise in using patients' gene expression profiles to predict survival endpoints on account of the large number of genes and their complex interdependence. We initially focus on the Lymphochip data and analysis of Rosenwald et al. (2002). After describing relationships between the analyses performed and gene harvesting (Hastie et al., 2001a), we argue for the utility of penalized approaches, in particular least angle regression-least absolute shrinkage and selection operator (Efron et al., 2004). While these techniques have been extended to the proportional hazards/partial likelihood framework, the resultant algorithms are computationally burdensome. We develop residual-based approximations that eliminate this burden yet perform similarly. Comparisons of predictive accuracy across both methods and studies are effected using time-dependent receiver operating characteristic curves. These indicate that gene expression data, in turn, only delivers modest predictions of posttherapy DLBCL survival. We conclude by outlining possibilities for further work.
Keywords: Diffuse large-B-cell lymphoma; Gene harvesting; Least angle regression; Microarray; Proportional hazards; Time-dependent ROC curve
| 1. INTRODUCTION |
|---|
|
|
|---|
Diffuse large-B-cell lymphoma (DLBCL) is an aggressive malignancy of mature B lymphocytes. It is the most common type of lymphoma in adults with an annual incidence exceeding 25 000 cases in the United States and accounting for 40% of cases of non-Hodgkin's lymphoma. While treatment advances have been substantial in what was formerly a fatal disease, less than 50% of patients achieve lasting remission. In an effort to predict treatment success and explain disease heterogeneity five clinical features (age, tumor stage, serum lactate dehydrogenase concentration, performance status, number of extranodal disease sites) were synthesized into the International Prognostic Index (IPI). The modest predictive performance of the IPI and other variables has spawned a series of high-profile DNA microarray investigations into DLBCL, in the hope that molecular-level information could be used to refine prognosis.
DNA microarray technology, fast emerging as one of the most widely used and powerful tools for a suite of genomic applications, can profile gene expression of an organism on a whole-genome scale. Due to a variety of factors, the data structures and attendant research questions surrounding microarray studies are not amenable to standard off-the-shelf analyses. Consequently, there has been a correspondingly rapid rate of development of new statistical methodologies (see, e.g. Parmigiani et al., 2003
; Speed, 2003
). Much of this attention has focused on preprocessing, clustering, identifying differentially expressed genes, or discrimination. Less effort has been directed to regression problems where we have a linked, continuous phenotype (outcome). Here, we address the situation where the phenotype is a (right censored) survival timetime from DLBCL treatment to death.
The paper is organized as follows. The suite of DLBCL microarray studies is briefly overviewed next. Chronologically tracing this progression is informative with regard to the evolution of statistical approaches. This provides motivation for application of the gene-harvesting methodology proposed by Hastie et al. (2001a)
as detailed in Section 2. Results so obtained, along with some putative shortcomings of gene harvesting, further motivate use of least angle regression-least absolute shrinkage and selection operator (LARS-Lasso) procedures (Efron et al., 2004
). Extension of these methods to censored survival phenotypes was developed by Gui and Li (2004)
, as outlined in Section 3.1.
The resultant techniques are, however, very computationally intensive. A residual finesse strategy, described in Section 3.2, is established to mitigate these concerns. Results from applying these methods, as well as several others advanced in the literature for these data, are presented in Section 3.4. Parallel analyses straddling different microarray platforms are summarized in Section 3.5. One tool used to facilitate comparisons between methods and data sets is time-dependent ROC curves (Heagerty et al., 2000
), briefly described in Section 3.3. Finally, Section 4 offers some concluding discussion and possibilities for further work.
One of the early success stories showcasing microarray technology was the work of Alizadeh et al. (2000)
. This paper introduced the Lymphochip, a custom-designed cDNA microarray that was enriched with genes that are preferentially expressed in lymphoid cells as well as with genes implicated in processes pertinent to cancer or immunology. We reanalyze data deriving from Lymphochip experiments in subsequent sections. Substantively, Alizadeh et al. used Lymphochip gene expression data to effect molecular subtyping of DLBCL, identifying two novel and distinct forms corresponding to the cell of origin: germinal-center B-like and activated B-like, with the former exhibiting significantly better overall survival. Such molecular subtyping represents a consequential advance in explaining the considerable clinical heterogeneity implicit in the DLBCL diagnostic category that had resisted attempts based on morphologic and pre-microarray molecular parameters.
Importantly, subtype delineation did not make recourse to the linked survival data, but rather was based solely on gene expression data using only unsupervised methods. In particular, hierarchical clustering using average linkage and correlation distance (Eisen et al., 1988
) was used to generate the now familiar heat map matrix. This graphically depicts gene expression profiles across differing samples, with the gene order rearranged according to the cluster dendrogram. Informal, visual inspection of this heat map led to the identification of signaturesgene clusters possessing ostensibly coherent expression patterns that can be labeled according to the function, or cell type where expressed, of the constituent genes. Subsequently, survival differences between select signatures are evaluated and served to establish these as novel molecular subtypes.
It is worth reemphasizing the informal nature of this process. No criteria or algorithm is invoked in the initial signature selection. While the eye-brain interface is adept at such visual tasks, the dimensions and variations common to microarray studies imparts difficulty to this clearly subjective signature extraction process. The heterogeneity of clusters so extracted motivated the development of tight clustering (Tseng and Wong, 2005
). Further, the use of unsupervised methods with after-the-fact assessment of phenotype (here survival) differences has been criticized relative to direct, supervised methods (Smyth et al., 2003
). We revisit these concerns subsequently, but note here that it is possible to imbue the signature selection task with some formalism. Firstly, a method for determining an appropriate number of clusters is applied. Examples of such techniques include permutation-based (Tibshirani et al., 2001
; Dudoit and Fridlyand, 2002
) or model-based (Yeung et al., 2001
) approaches. Secondly, the putative labeling of these clusters can be pursued using tools such as GoMiner, MAPPFinder, and EASE that identify enriched functional or other categories relative to nominated, annotated databases.
Shipp et al. (2002)
use gene expression data obtained from use of Affymetrix Hu6800 oligonucleotide microarrays (Affymetrix, Santa Clara, CA). Again, they have linked survival phenotypes. However, in addition to arguing for the value of microarray-based profiling, they forcefully emphasize differences between unsupervised and supervised analyses. Failure to reproduce Alizadeh et al.'s association between cell-of-origin subtypes and survival outcome is partly ascribed to these differences. Yet, while advocating supervised methods, Shipp et al. acknowledge that their approach of reducing the survival phenotype into a dichotomy (cured versus fatal/refractory disease), as opposed to using actual survival times, is limiting. In dealing with a dichotomous outcome the machine-learning term supervised equates to the statistical term classification. The classifier employed by Shipp et al. is the so-called weighted voting which was previously used in a celebrated microarray classification study to discriminate between acute lymphocytic leukemia and acute mylogenous leukemia (Golub et al., 1999
). It has been shown (Dudoit et al., 2002
; Tibshirani et al., 2002
) that weighted voting essentially coincides with a penalized version of linear discriminant analysis wherein the sample covariance matrix is replaced by its diagonal. Some such (strong) regularization has been widely demonstrated to be effective in microarray-based prediction/classification problems on account of dimensional considerationsthe number (p) of covariates (genes) greatly exceeds the number (n) of samples (arrays). The bias-variance trade-off implications of this p >> n configuration necessitate regularization and/or simple methods and this frequently dovetails with interpretability concerns whereby interest is in selecting small subsets of important genes. We subsequently (Section 3.2) adopt a similar strategy employing regularization via L1 penalization that simultaneously confers gene selection and using diagonal approximations to enhance computational practicality.
The next article of note is that of Rosenwald et al. (2002)
. This paper not only utilizes supervised methods applied directly to survival outcomes but also does so in the context of a relatively large study comprising some 240 patients. Expression profiling again makes recourse to Lymphochip DNA microarrays. It is to this data set that our primary reanalyses are applied. A detailed description of the analytic approach adopted by Rosenwald et al., along with salient results, is provided in Section 2, where relationships with gene harvesting are indicated.
The paper by Wright et al. (2003)
was primarily concerned with attempting to reconcile the disparate findings (Alizadeh et al. versus Shipp et al.) with regard to the role (or lack thereof) of cell-of-origin subtypes. A central methodologic component is the use of compound covariates (called linear predictor scores) to discriminate between subtypes. Such a base classifier is closely connected to weighted voting and shrunken centroid approaches (see Tibshirani et al., 2002
). The latter is distinguished by its use of soft, rather than hard, threshholding in constructing the classifier, a strategy that has been advocated in other settings (Donoho and Johnstone, 1994
), and which we employ subsequently. Wright et al. demonstrated that subtypes and attendant survival differences can be elicited from the (Affymetrix) expression profiling data obtained by Shipp et al.
Finally, we summarize the work of Lossos et al. (2004)
. Motivated by the desire to devise a simple, clinically practicable prediction model for survival in DLBCL patients based on gene expression, they retreat from microarray-based expression profiling in favor of quantitative reverse-transcriptase polymerase chain reaction (RT-PCR) expression measurements applied to a select and targeted list of 36 genes. This list derives in part from the above-mentioned microarray studies. The fact that no overlap exists among the genes included in the prediction models derived by Shipp et al., and Rosenwald et al., further motivates Lossos et al. to devise an alternate prediction scheme. Univariate Cox proportional hazards models are used to associate each of the 36 genes with survival for 66 patients. A somewhat arbitrary thresholding of the associated Wald z statistic is used to select the six most significant genes. These are then included in a multivariate proportional hazards model and the resultant coefficients are used as the basis for a survival prediction model and, via stratification, the construction of prognostic groups. Validation of these predictions and groups is assessed by applying the derived six-gene model to the Shipp et al. and Rosenwald et al. microarray data. We further evaluate both the six-gene model and this approach to validation in Sections 3.4 and 3.5.
| 2. GENE HARVESTING FOR LYMPHOCHIP |
|---|
|
|
|---|
As indicated, Rosenwald et al. (2002)
The process whereby signatures are identified is, as mentioned, quite subjective. Indeed, in the concluding remarks of the article cited re the signature extraction process (Shaffer et al., 2001
), it is stated The concept of a gene expression signature is somewhat fluid and operationally defined. However, it is unclear what constitutes operationally defined hereno algorithmic or computable definitions are proffered; rather there is sole reliance on visual inspection of heat maps. Beyond the concerns with this visual approach that were stated previously and led to the development of tight clustering, there are further issues surrounding the clustering itself. Even with regard to hierarchical clustering there are choices for linkage type and distance metric, there being nothing sacrosanct about average linkage and correlation distance. Some sensitivity analysis and stability assessment of designated signatures with respect to differing specifications seems warranted. And, of course, there are numerous alternative clustering schemes.
In parallel, univariateone gene at a timeCox proportional hazards regressions are fitted to relate expression to survival. This is done on the training data (ntrain = 160). The partitioning into training and test data is somewhat unorthodox with (some) selection based on survival phenotype being applied, as opposed to random assignment. The 35 genes that achieved univariate significance, and which satisfied variability and missingness criteria, were then molded into a multivariate predictor as follows. Where possible, the 35 genes were assigned to the previously identified signatures: nine to the major histocompatibility complex (MHC) class II signature, three to the germinal-center B-cell signature, three to the proliferation signature, and six to the lymph-node signature. The remaining 14 genes were not associated with any particular gene expression signature. Expression values for those genes ascribed to each of the four signatures were then averaged (within patients) yielding four signature expression profiles. An exception was made for the MHC class II signature where only the four (of nine) most significant genes were used, due to high between-gene correlations.
These steps provided the building blocks for development of a multivariate predictor. Initially, all four of the signature expression profiles are forced into a Cox proportional hazards regression. Then forward stepwise methods were used to add genes from the set of 14 genes that were not affiliated with any of the signatures. This resulted in the inclusion of one additional gene, BMP6. The resultant linear predictor score, based on these five terms and attendant coefficients, constitutes Rosenwald et al.'s gene-expression-based survival predictor. This is validated using the test data by testing the corresponding predictor (retaining training-data-based coefficients) for association with survival, as well as by stratifying (into quartiles) based on predictor score, and testing between strata survival differences.
While this prediction strategy moves beyond the earlier unsupervised approach of Alizadeh et al. (2000)
by direct use of survival outcomes, it nonetheless retains the subjectivity and arbitrariness surrounding use of cluster-based signatures as outlined above. Furthermore, several selection criteria are invoked (univariate significance level, variability, within-signature correlation, stepwise entry significance level). Evaluating the performance characteristics of the resultant predictor requires evaluation of the entire model-building processthis is particularly consequential in the microarray setting due to dimensionality concerns (see Hastie et al., 2001b
; West et al., 2001
; Simon et al., 2003
; Pittman et al., 2004
).
We address the central concernsignature arbitrariness/subjectivityby appeal to methodology devised expressly for supervised analysis of microarray data, namely gene harvesting (Hastie et al., 2001a
). In fact, gene harvesting represents a formalization of the approach pursued by Rosenwald et al. that overcomes some of the shortcomings. Gene harvesting commences with the same hierarchical clustering scheme as used by Rosenwald et al. (2002)
to elicit signatures. But, in our application while using the same 7399 Lymphochip features, we restrict to expression data derived from the 160 DLBCL patients constituting the training data, thereby avoiding any possible data reuse concerns. Rather than visual extraction of signatures, gene harvesting treats the average expression profile from each node (cluster) in the resulting dendrogram as a potential covariate. So, if we start with p features, this procedure augments with an additional p1 expression profile that serves as a covariate for the supervised (regression) modeling. With survival outcomes, this modeling is forward stepwise Cox proportional hazards regression, akin to that employed by Rosenwald et al. (2002)
. However, rather than using significance criteria, whose known distortion in the face of greedy selection algorithms will be compounded here due to large p, gene harvesting uses cross-validation to determine the number of terms retained. Provision is also made for between-gene interactions, nonlinear effects, and biasing results toward selection of larger clusters.
Hastie et al. (2001a)
claim two advantages for this approach. Firstly, because of the familiarity of hierarchical clustering (e.g. Eisen et al., 1988
) in unsupervised analyses of microarray expression data, the usage of clusters as covariates will be convenient for interpretation. Indeed, it is arguably this familiarity that led to the signature-based approaches. Secondly, by using clusters as covariates, selection of correlated sets of genes is favored, which in turn potentially reduces overfitting. However, it has been demonstrated that gene harvesting can give rise to artifactual results (Segal et al., 2003
) as we discuss in conjunction with the results obtained. But first we give a brief overview of the gene-harvesting algorithm.
The available data from Rosenwald et al. (2002)
consist of the n x p matrix of gene expression values x = [xij], where xij is the expression level of the jth feature (j = 1, ..., p = 7399) for the ith DLBCL patient (i = 1, ..., n = 240). Each patient also provides a potentially right-censored survival outcome yi, along with a corresponding censoring indicator
i. A hierarchical clustering algorithm is applied to the expression matrix corresponding to the ntrain = 160 patients constituting the training data and, for each of the resulting clusters ck, k = 1, ..., 2p 1, the average expression profile
ck = (
1, ck,
2, ck, ...,
ntrain, ck), where
i, ck = 1/|ck |
j
ck xij is obtained. Note that we have included the individual genes (the tips/leaves of the dendrogram) as clusters (of size 1) in this formulationtheir average expression profile coinciding with the individual gene profile.
This set of 2p1 average expression profiles constitutes the covariate set (
). A forward stepwise Cox proportional hazards regression is performed as follows. Initially, the first term in the model (
) is the covariate that maximizes a modified partial log-likelihood score statistic, hereafter termed the score. The modification consists of a variance penalty used to inflate the score statistic denominator. This is motivated by stability concerns deriving from small variances (see Tusher et al., 2001
). At each subsequent stage, candidates for inclusion consist of all products between a term in
and a term in
. The term chosen for inclusion is that which most improves fit, again measured by maximizing the score, but subject to biasing in favor of selection of larger clusters. This biasing is effected by selecting the largest cluster whose score is within 100 x (1
)% of the maximal attained score, where
is a user-specified tuning parameter. The process continues until some prespecified maximum number of terms, m, have been added to the model. The number of terms retained is subsequently determined by cross-validation. However, we note that the cross-validation results presented for the illustrative gene-harvesting survival analysis in Hastie et al. (2001a)
are puzzling in that resubstitution error significantly exceeds cross-validated error. Further, cross-validation is inherently unstable for p >> n problems, as is shown later in Section 3.4.
Hastie et al. (2001a)
restrict to second-order interaction terms; i.e. product terms are limited to pairwise products. This is partly motivated by interpretational considerations and borrows from the multivariate adaptive regression spline (MARS) methodology of Friedman (1991)
. The gene-harvesting model for the hazard function
(·) for the ith patient is then
![]() | (2.1) |
Here,
0(ti) is the baseline hazard and S1 connotes the set of clusters that enter singly while S2 is the set of clusters that enter as product terms. So,
Estimation makes recourse to partial likelihood (Cox, 1972
) which we outline further in Section 3.
Results from applying gene harvesting using average linkage and Euclidean distance are presented in Table 1. Interactions were not allowed; i.e. S2 =
. Additionally, biasing toward the selection of larger clusters was employed with
= 0.1.
|
The first cluster selected, #3498 with 14 members, contains exclusively MHC class II genes of particular subtypes. We note that MHC class II was one of the four signatures chosen by Rosenwald et al. (2002)
The potential for gene harvesting to produce artifactual solutions derives from the selection of clusters for which the average gene expression profile is strongly associated with outcome, but the cluster itself is heterogeneous so that individual constituent gene profiles are not associated with outcome, and the average does not provide a meaningful summary. That this potential can be realized was demonstrated in Segal et al. (2003)
and is exacerbated by (i) the small sample sizes typically encountered in microarray studies, and (ii) gene harvesting's use of all clusters as candidate covariates, irrespective of their coherence. Here, the MHC class II cluster identified by gene harvesting was not artifactual in that there are strong correlations between all 14 constituent genes and all are individually associated with survival.
An additional concern for gene harvesting is sensitivity to the hierarchical clustering scheme and/or distance metric used. However, here the clusters chosen were essentially invariant to linkage method (average, single, complete).
Nonetheless, selection of the MHC class II cluster was predicated on biasing covariate selection in favor of larger clusters. If this biasing is turned off (
= 0), then the resultant selections are all singletons, i.e. individual genes. Moreover, the underlying forward selection strategy represents a very greedy algorithm (Efron et al., 2004
). Such greediness is likely to be detrimental to predictive performance in the microarray p >> n setting which, as previously mentioned, rewards simple methods. Indeed, in Section 3.4, we show that gene harvesting does relatively poorly in predicting survival in the test data set. One avenue toward simpler models is via regularization, as considered in Section 3.
| 3. LARS-LASSO FOR SURVIVAL PHENOTYPES |
|---|
|
|
|---|
Forward stepwise regression is an adaptive, greedy algorithm. Gene harvesting combines this with basis expansion, the addition of (numerous, derived) covariates, as constituted by the cluster-averaged expression profiles. In the microarray p >> n context, this combination proves extremely costly from the perspective of effective number of parameters or (adaptive) degrees of freedom. Methods for estimating these quantities based on simulation (Ye, 1998
0.5n, while five steps cost 0.9n. While the quantifications are certainly problem specific, the message is that microarray regression problems are preferably tackled using some form of regularization in order to contain these costs. The same conclusion has been reached for microarray classification problems (Dudoit et al., 2002
Consider a standard Cox regression model of the form
![]() | (3.1) |
where
0(t) is an unspecified baseline hazard function, ß = (ß1, ..., ßp) is the vector of the regression coefficients, and X = (X1, ..., Xp) is the vector of gene expression levels with the corresponding Lymphochip values of xi = (xi1, ..., xip) for the ith patient.
The partial likelihood (Cox, 1972
) is then
![]() | (3.2) |
where D is the set of patient indices for those experiencing the event (here death) and Rr denotes the set of indices of patients at risk at time tr. Let l(ß) = log L(ß).
In the case of continuous response Tibshirani (1996)
demonstrated that penalized least squares, using an L1 bound on the coefficients, offered advantages over the heretofore used L2 bound, as employed by ridge regression and, more recently, support vector machines (Cristianini and Shawe-Taylor, 2000
). In particular, in addition to improving on prediction accuracy through shrinkage, the nature of the L1 constraint is such that interpretation is enhanced by zeroing out many covariates. That is, the least absolute deviation (L1) constraint simultaneously effects shrinkage and selection, spawning the acronym lasso.
For survival outcomes Tibshirani (1997)
defined the lasso estimate ß via constrained partial likelihood:
![]() | (3.3) |
Here s is a tuning parameter that determines how many coefficients are nonzero. Accordingly, varying s provides a continuous form of subset regression and thereby overcomes the instability associated with conventional, discrete versions. For estimation, Tibshirani (1997)
used an iteratively reweighted least squares (IRLS)-based linearization to reformulate this constrained optimization problem as an L1 -penalized linear regression model and then applied the same quadratic programming approach used for continuous outcomes. The linearization utilizes the following components:
= ß' X; µ =
l/
; A =
2l/

T; z =
+ Aµ. A one-term Taylor series approximation for the negative log partial likelihood, l(ß) is then given by (z
)T A(z
) and the iterative estimation scheme is
- Fix s and initialize
= 0.
- Compute
, µ, A, and z based on the current value of
.
- Minimize (z
' X)T A(z
' X) subject to
|
j|
s.
- Iterate between steps 2 and 3 until
converges.
The quadratic programming approach used for the minimization in step 3 starts from the least squares solution. Hence, it does not handle the p > n case and so is inapplicable to microarray applications. This limitation, along with efficiency concerns, motivated Osborne et al. (2000)
to regard the lasso as a convex programming problem and to devise an algorithm based on homotopy methods. While the objectives of handling p > n and improving efficiency were realized, the algorithm remained problematic for microarray settings as described in Segal et al. (2003)
. A highly efficient algorithm, LARS, that includes lasso as a special case, was recently devised for the continuous outcome setting by Efron et al. (2004)
. Remarkably, LARS has the same computational cost as a standard (unconstrained) multiple least squares fit. A software package, lars() for R and Splus is available from http://www-stat.stanford.edu/
hastie/Papers/LARS/. We loosely treat LARS and lasso as interchangeable here, results being equivalent for models of reasonable size, and refer to the procedure as LARS-Lasso.
LARS-Lasso has been extended to survival outcomes by Gui and Li (2004)
. They employed a slightly modified LARS-Lasso algorithm to solve the step 3 minimization. The modification consists of mapping the minimization into the original, unweighted problem via Cholesky decomposition of A. Gui and Li applied the resultant methodology to the DLBCL Lymphochip data as analyzed by Rosenwald et al. (2002)
. They obtained interpretable results with good predictive performance (described subsequently) in concordance with LARS-Lasso goals. However, embedding the modified LARS-Lasso estimation scheme within the IRLS strategy for handling survival endpoints serves to undo much of the computational efficiency that LARS-Lasso formerly delivered. For example, run times were 8 h for a single solution and 2 days for 10-fold cross-validation of the tuning parameter s on a Dell desktop with Intel Pentium 4 3.2 GHz and 2.00 GB of RAM (J. Gui, personal communication). Next, we describe an estimation strategy that restores the computational efficiency of LARS-Lasso.
The computational burden associated with Gui and Li's (2004)
adaptation of LARS-Lasso to survival outcomes derives from embedding the (modified) LARS-Lasso algorithm within the IRLS algorithm, which in turn is required to cope with the complexities of the Cox proportional hazards model. However, at the expense of some approximation, these can be eliminated using what we term a residual finesse. This strategy involves substituting suitably chosen residuals for the survival endpoint, enabling inheritance of simple algorithms applicable to continuous outcomes and bypassing difficulties deriving from censoring. Residual finesse has been employed to adapt additive (Cox) models (Grambsch et al., 1995
; Segal et al., 1995
), multiple adaptive regression splines (MARS) (LeBlanc and Crowley, 1999
), and regression trees (LeBlanc and Crowley, 1992
; Kele
and Segal, 2002
) to censored survival outcomes. Perhaps anticipating such a strategy, Gui and Li (2004)
pose two questions/objections: (i) which residuals should be used among the competing types available for survival data and (ii) there are no guarantees that optimizing resultant residual sums of squares (RSS) improves fit to the survival data (Therneau and Grambsch, 2000
). In response to (i) we next make a case for use of deviance residuals, and defer consideration of (ii) to Section 4.
For the Rosenwald et al. (2002)
Lymphochip study, we are in the simple setting where there are no time-dependent covariates, at most one event per patient, and each patient is under observation from time t = 0 (time of treatment). The martingale residual for the ith patient then takes a simple nonintegral form and can be interpreted as the difference between observed and expected number of events. This expected number is
![]() | (3.4) |
and the martingale residual is
i =
i Êi. Deviance residuals are derived from the highly skewed martingale residuals by a normalizing transformation analogous to deviance residuals for Poisson regression:
![]() | (3.5) |
where the approximation follows from a one-term Taylor expansion (Therneau and Grambsch, 2000
).
Returning to the LARS-Lasso Cox proportional hazards estimation scheme of Gui and Li, we can show that (diag(A))i
Êi. Further, these diagonal elements of A dominate the off-diagonals of A by an order of magnitude (Tibshirani, 1997
). Also, we have that (z
' X) = AM. Accordingly, the minimization (step 3) can be approximated as
![]() | (3.6) |
the RSS of (first-order Taylor approximation) deviance residuals.
So, in order to effect the L1 constrained minimization of step 3 we initially compute null deviance residuals and use these as outcomes for the computationally efficient standard (continuous response) LARS-Lasso algorithm. Null deviance residuals can be obtained simply in Splus or R by specifying residual type and zero iterations in the call to coxph().
Typically, in the microarray-DLBCL survival literature, the evaluation of how well a predictive model performs is done as follows: (i) risk scores based on the fitted model are computed for patients in a (withheld) test data set, (ii) strata (usually two) are created based on thresholding these scores, and (iii) log-rank testing of between-strata survival differences is performed. The greater the achieved significance the more predictive the model is deemed. Limitations of this approach include not just the arbitrariness of the imposed stratification but, more importantly, the familiar shortcoming of p -values not necessarily capturing effect size/explained variation.
A more refined approach is afforded by the use of time-dependent ROC curves, proposed by Heagerty et al. (2000)
and used in the present context by Gui and Li (2004)
. Denote the predictive model as f (X) and define time-dependent sensitivity and specificity functions at cutoff point c as
![]() | (3.7) |
![]() | (3.8) |
where
(t) is the event indicator at time t. The corresponding time-dependent ROC curve at time t, ROC(t|f(X)), is then just the plot of sensitivity(c, t|f(X)) versus 1 specificity(c, t|f(X)) with cutoff point c varying. We use the analogous area under the time-dependent ROC(t|f(X)) curve, AUC(t|f(X)), as a summary in the next sections. In order to estimate the conditional probabilities in (3.7) and (3.8), accounting for possible censoring, we follow Heagerty et al. in employing a nearest neighbor estimator for the bivariate distribution function (Akritas, 1994
).
Figure 1 shows coefficient trajectories as obtained by applying LARS-Lasso to the training data with null deviance residuals as outcome on expression data as covariates. The individual trajectories correspond to individual genes and can best be interpreted when viewed as functions of increasing values of the L1 constraint s. When s = 0 there are no terms (genes) in the model. As s increases successively, more genes enter. Note that while the magnitude of the coefficients tend to increase with increasing s, there are instances of nonmonotonicity. This is due to between-gene correlation. From both an interpretative and predictive standpoint, the critical issue is determining an appropriate value of s.
|
We approached determination of s using the built-in cross-validation routines provided by lars(). However, results were unstable, showing considerable variation according to changes in the number of cross-validation folds. Gui and Li (2004)
The top four genes selected, and the order in which they are chosen, using L1 -penalized partial likelihood (Gui and Li, 2004
) coincide with the top four selections (and order) that result from use of our deviance residual finesse. Indeed, the agreement extends appreciably further (e.g. 15 out of top 20 coincide), providing some indication that the approximations employed in arriving at the deviance residual finesse are reasonable. These top four genes are itemized (in order of selection) in Table 2. Strikingly, they represent three of the four signatures identified by Rosenwald et al. (2002)
. If we extend to the top 10 genes, then 7/10 (coincident with Gui and Li's results) are representative of the signatures, with the remaining three having putative biological relevance.
|
Next, we illustrate the standard approach to evaluating the predictive performance of models using the test data. Following Li and Gui (2004) and Gui and Li (2004)
|
The gene-harvesting entry corresponds to a model with four terms. These correspond to two singleton genes, one cluster of size two, and one cluster of size three. One of the singletons is thyroxine-binding globulin precursor (AA805575 [GenBank] ) which was also chosen by the L1 -penalized approaches and is included in the germinal-center B-cell signature. The selected harvesting model did not favor inclusion of larger clusters, i.e.
= 0. Even poorer performance results from biasing toward larger clusters with, for example, an R2 of 0.012 corresponding to
= 0.1. The Bair and Tibshirani (2004) model is based on 23 genes selected by thresholding univariate Cox regression test statistics (akin to Lossos et al., 2004
R2 values are those provided by the coxph() R function and are based on partial likelihood differences as described in Nagelkerke (1991)
. We consider these further in Section 4.
To overcome the arbitrariness of stratification into low- and high-risk groups we next obtain time-dependent ROC curves for the same six models. The corresponding time-dependent areas under the ROC curves are displayed in Figure 2.
|
Both the stratified and time-dependent ROC curve analyses are consistent and support the following findings.
- The deviance-residual-based approximation to direct L1 penalization seems empirically reasonable. This is no surprise given the previously noted correspondence in gene selections. But, the agreement evidenced here is additionally affirming since the associated coefficients are estimated under different criteria (Cox partial likelihood, squared error loss).
- The best results are obtained by both L1 penalization schemes and the signatures approach. Again, this is not surprising in view of the overlap in gene selections. Further, as already illustrated with the MHC class II signature, the genes constituting the signatures themselves are strongly correlated and so the corresponding average expression profile is similar to the profile of any individual member gene.
Given that this set of methods yield the best results it is purposeful to further compare among them. Arguably, the deviance-residual-based LARS-Lasso approach is the easiest to implement and affords a more direct, less ad hoc approach than the use of signatures. That the signature approach does so well is a tribute to the perspicacity with which the signatures themselves were chosen. However, the fact the test data were used as part of this selection process is a concern. It is difficult to assess how much of a concern it is since the process itself is so subjective.
- The relatively lesser performance of the Lossos et al. (2004)
model and the Bair and Tibshirani (2004)
model comes although (or because) these methods use more genes (6 and 23, respectively) than the above methods. This is presumably attributable to the gene selection process which, in both instances, is based on thresholding univariate significance levels. That Bair and Tibshirani (2004)
observe diminished predictive performance when such univariate-based preselection is not employed testifies to the need for some form of regularization/selection. The results here suggest that gains can be achieved by pursuing this in a multivariate fashion as per LARS-Lasso. Of further note are the comparison studies undertaken by Bair and Tibshirani that, for the DLBCL data as analyzed here, show superior performance of their supervised principal components technique versus a suite of competitors. By extension, the L1 -penalized approaches perform better still.
- The worst performance belongs to gene harvesting. Quite simply we ascribe this to greediness. As mentioned previously, gene harvesting, with its basis expansion and unpenalized forward selection strategy, is extravagant when cost is measured via effective number of parameters. Of further interest is the performance disparity between gene harvesting and the Rosenwald et al. (2002)
signature approach. To the extent that the latter represents a model reachable via the former, the superior results for signatures again illustrate the perspicacity of signature selection and/or the reuse of test data.
- These performance rankings should not be overinterpreted since the overall differences as measured by the time-dependent ROC areas (Figure 2) are modest. Moreover, even the best predictive performance when (properly) assessed in this fashion is modestrecall that an ROC area of 0.5 corresponds to predicting a binary outcome by a coin toss. This perspective corrects the overly optimistic viewpoint conveyed by log-rank testing of derived high/low-risk groups (Table 3) and gives pause to use of the associated models/genes as the basis for treatment decisions or prognosis as has been advocated.
To further assess the generality of the models specified by the use of the LARS-Lasso strategy, we investigated how well methods performed when applied to gene expression data obtained under alternate platforms, for which linked, posttherapy, DLBCL survival was also available. In particular, we applied the LARS-Lasso model as developed using Lymphochip to data obtained using Affymetrix Hu6800 arrays (Shipp et al., 2002
). We then undertook the reverse operation, developing a model using Affymetrix data and evaluating against the Lymphochip study. Effecting these comparisons requires processing to convert (absolute) intensity measures as yielded by single-channel platforms such as Affymetrix to the (relative) log-ratio expression values provided by a competitive hybridization scheme (two channel arrays in general and Lymphochip in particular). Since there is no definitive way to do this, we (separately) applied two processing schemes that have been used in this setting (Wright et al., 2003
; Lossos et al., 2004
). Additionally, it is necessary to match genes (where possible) between platforms. We used the NetAffx resource (Liu et al., 2003
) and Lymphochip GenBank accessions for this purpose.
Again, we summarize results via the area under time-dependent ROC curves. The left panel of Figure 3 presents results where the evaluation data are the same Lymphochip test set as used above. We have reproduced the curve corresponding to the six-gene model of Lossos et al. (2004)
. Recall that this model was developed by selecting from genes implicated in studies using both Lymphochip and Affymetrix platforms. We have modified the model obtained using L1 -penalized proportional hazards from that graphed in Figure 2 and summarized in Table 3. The modification consists of eliminating the LC_29222 gene since this gene is not represented on the Affymetrix Hu6800 chip. Finally, we have used the Lymphochip test data to appraise a model developed using L1 -penalized proportional hazards on the Affymetrix data. The converse set of curves obtained when using the same models but evaluated on (converted) Affymetrix data are given in the right panel of Figure 3. The analysis leading to these curves employed the conversion approach of Lossos et al. (2004)
, but similar results and conclusions pertain if the conversion approach of Wright et al. (2003)
is used. Alternatively, it is possible to use the original (unconverted) data and refit the respective models using the appropriate genes. This direct approach also yielded highly similar curves. Since there were only 58 patients in this study, for the purposes of these comparisons we have not partitioned into training and test sets. Results for the six-gene L1 -penalized model developed on the same data are therefore optimistic. Some conclusions regarding the generalizability across data sets in particular, and the utility of microarray expression profiling for predicting DLBCL survival more broadly, can be drawn by contrasting the two panels.
|
The intermediary performance of the Lossos et al. (2004)
| 4. DISCUSSION |
|---|
|
|
|---|
In our reanalysis of microarray studies of gene expression with linked DLBCL survival phenotypes, a number of interesting findings emerged from both a methodologic and substantive standpoint, summarized in Sections 3.4 and 3.5, respectively. These results are perhaps surprising to the extent that (i) the performance of gene harvesting, which ostensibly generalizes the signature-based approach of Rosenwald et al. (2002)
As always, there are several avenues for further work. In order to couple the sparse variable selection attributes of L1 penalties with clustering/signature constructs, whereby sets of strongly correlated variables are either jointly included or excluded, Zou and Hastie (2003)
developed the elastic net which combines L1 and L2 penalties for continuous phenotypes. Extensions to survival outcome analogous to that devised for LARS-Lasso warrant development and exploration.
As indicated in Section 3.4, model selection via cross-validation is highly unstable in this (microarray) setting, the p >> n data configuration accentuating this well-known attribute of cross-validation. Indeed, Efron (2004)
shows that covariance-penalty-based model selection schemes (e.g. Akaike Information Criterion) can be viewed as Rao-Blackwellized versions of cross-validation and therefore are conferred with greater stability. This leads Efron to advocate the use of covariance penalty methods despite these approaches requiring additional model assumptions. However, another required ingredient is an estimate of error variance,
2. Typically, it is recommended that this be obtained from a big model. While some guidelines have been advanced for specifying such big models, these pertain to the n > p situation and breakdown for microarray problems. Trying to mimic specifications based on degrees of freedom or effective numbers of parameters (see Section 3) is circular since these also require an estimate of
2. In the particular case of LARS-Lasso we could base model size on the onset of nonmonotonicity in the coefficient traces (cf. Figure 1), although this is certainly ad hoc.
We return to the issue of minimizing residual-based R2's not equating to improving fit for censored survival data, as raised by Therneau and Grambsch (2000)
and echoed by Gui and Li (2004)
(see Section 3.2). A key consideration is how fit is measured in this context. The R2 measure pro- vided by Therneau's R function coxph, and featured in Table 3, is based on information theoretic ideas (Nagelkerke, 1991
). However, O'Quigley and Xu (2001)
show that an equivalent criterion can be expressed as a weighted R2 involving Schoenfeld residuals. Heagerty and Zheng (2003)
described analogies between the underpinnings of O'Quigley and Xu's formulation and time-dependent ROC curves. This connection between a fit criterion and (a) RSS partially validates our deviance residual finesse. The empiric agreement of the finesse with the full LARS-Lasso solution affords additional encouragement. Further analytic and empiric work to evaluate the reasonableness of the approximations used is indicated.
We view the superior performance of the L1 -penalized approaches noted in Section 3.4, as being illustrative of the principle enunciated by Rosset et al. (2004): Working in high dimension and regularizing is statistically preferable to a two-step procedure of first reducing the dimension, then fitting a model in the reduced space. Zhu and Hastie (2004)
used L2 -penalized logistic regression to pursue classification in the context of microarray cancer studies with categorical outcomes. Because the L2 penalty does not possess the built-in selection property of L1 penalties, they need to employ feature (gene) elimination strategies. We are presently investigating LARS-Lasso style L1 -penalized algorithms for such categorical outcome types and, more generally the entire class of generalized linear models, based on analogous deviance residual finesses to that used here for survival outcomes.
| ACKNOWLEDGMENTS |
|---|
Assistance received from Jiang Gui, Hongzhe Li, Trevor Hastie, Rob Tibshirani, Eric Bair, and Chuck McCulloch with respect to data/software provision and helpful discussions is greatly appreciated. Funding to pay the Open Access publication charges for this article was provided by the Center for Bioinformatics & Molecular Biostatistics, University of California, San Francisco.
| REFERENCES |
|---|
|
|
|---|
-
AKRITAS, M. G. (1994). Nearest neighbor estimation of a bivariate distribution under random censoring. Annals of Statistics 22, 12991327.
ALIZADEH, A. A., EISEN, M. B., DAVIS, R. E., MA, C., LOSSOS, I. S., ROSENWALD, A., BOLDRICK, J. C., SABET, H., TRAN, T., YU, X. et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503511.[CrossRef][Medline]
BAIR, E. AND TIBSHIRANI, R. (2004). Semi-supervised methods to predict patient survival from gene expression data. Public Library of Science: Biology 2, 511522.
COX, D. R. (1972). Regression models and life tables. Journal of the Royal Statistical Society B 34, 187220.
CRISTIANINI, N. AND SHAWE-TAYLOR, J. (2000). An Introduction to Support Vector Machines. Cambridge: Cambridge University Press.
DONOHO, D. L. AND JOHNSTONE, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425455.
DUDOIT, S. AND FRIDLYAND, J. (2002). A prediction-based resampling method to estimate the number of clusters in a dataset. Genome Biology 3, 0036.10036.21.
DUDOIT, S., FRIDLYAND, J. AND SPEED, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97, 7787.[CrossRef]
EFRON, B. (2004). The estimation of prediction error: covariance penalties and cross-validation. Journal of the American Statistical Association 99, 619642.[CrossRef]
EFRON, B., HASTIE, T. J., JOHNSTONE, I. AND TIBSHIRANI, R. J. (2004). Least angle regression. Annals of Statistics 32, 407451.[CrossRef][ISI]
EISEN, M. B., SPELLMAN, P. T., BROWN, P. O. AND BOTSTEIN, D. (1988). Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95, 1486314868.
FRIEDMAN, J. H. (1991). Multivariate adaptive regression splines. Annals of Statistics 19, 167.
GOLUB, T. R., SLONIM, D. K., TAMAYO, P., HUARD, C., GAASENBEEK, M., MESIROV, J. P., COLLER, H., LOH, M. L., DOWNING, J. R., CALIGIURI, M. A. et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531537.
GRAMBSCH, P. M., THERNEAU, T. M. AND FLEMING, T. R. (1995). Diagnostic plots to reveal functional form for covariates in multiplicative intensity models. Biometrics 51, 14691482.[CrossRef][ISI][Medline]
GUI, J. AND LI, H. (2004). Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. http://repositories.cdlib.org/cbmb/L1Cox/.
HASTIE, T., TIBSHIRANI, R., BOTSTEIN, D. AND BROWN, P. (2001a). Supervised harvesting of expression trees. Genome Biology 2, 0003.10003.12.
HASTIE, T., TIBSHIRANI, R. AND FRIEDMAN, J. H. (2001b). The Elements of Statistical Learning. New York: Springer.
HEAGERTY, P. J., LUMLEY, T. AND PEPE, M. (2000). Time dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 56, 337344.[CrossRef][ISI][Medline]
HEAGERTY, P. J. AND ZHENG, Y. (2003). Survival model predictive accuracy and ROC curves. Technical Report. Department of Biostatistics, University of Washington.
KELE
, S. AND SEGAL, M. R. (2002). Residual-based tree-structured survival analysis. Statistics in Medicine 21, 313326.[CrossRef][ISI][Medline]
LEBLANC, M. AND CROWLEY, J. (1992). Relative risk regression trees for censored survival data. Biometrics 48, 411425.[CrossRef][ISI][Medline]
LEBLANC, M. AND CROWLEY, J. (1999). Adaptive regression splines in the Cox model. Biometrics 55, 204213.[CrossRef][ISI][Medline]
LI, H. AND GUI, J. (2004). Partial Cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics 20, i208i215.[Abstract]
LIU, G., LORAINE, A. E., SHIGETA, R., CLINE, M., CHENG, J., VALMEEKAM, V., SUN, S., KULP, D. AND SIANI-ROSE, M. A. (2003). NetAffx: Affymetrix probesets and annotations. Nucleic Acids Research 31, 8286.
LOSSOS, I. S., CZERWINSKI, D. K., ALIZADEH, A. A., WECHSER, M. A., TIBSHIRANI, R., BOTSTEIN, D. AND LEVY, R. (2004). Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes. New England Journal of Medicine 350, 18281837.











