Biostatistics Advance Access published online on September 14, 2007
Biostatistics, doi:10.1093/biostatistics/kxm033
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Cross-study validation and combined analysis of gene expression microarray data
Division of Biostatistics, The Hollings Cancer Center, Medical University of South Carolina, Charleston, SC 29425, USA garrettm{at}musc.edu
Division of Biostatistics, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
Department of Mathematics and Applied Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
Division of Biostatistics, The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
Department of Pathology, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
Investigations of transcript levels on a genomic scale using hybridization-based arrays have led to formidable advances in our understanding of the biology of many human illnesses. At the same time, these investigations have generated controversy because of the probabilistic nature of the conclusions and the surfacing of noticeable discrepancies between the results of studies addressing the same biological question. In this article, we present simple and effective data analysis and visualization tools for gauging the degree to which the findings of one study are reproduced by others and for integrating multiple studies in a single analysis. We describe these approaches in the context of studies of breast cancer and illustrate that it is possible to identify a substantial biologically relevant subset of the human genome within which hybridization results are reliable. The subset generally varies with the platforms used, the tissues studied, and the populations being sampled. Despite important differences, it is also possible to develop simple expression measures that allow comparison across platforms, studies, laboratories and populations. Important biological signals are often preserved or enhanced. Cross-study validation and combination of microarray results requires careful, but not overly complex, statistical thinking and can become a routine component of genomic analysis.
Keywords: Breast cancer; Intraclass correlation; Meta-analysis; Prinicipal components; Reliability
| 1. INTRODUCTION |
|---|
|
|
|---|
Microarray experiments simultaneously measure the transcriptional activity of a large number of genes. In recent years, hundreds of these experiments have been performed, providing important insight on gene regulation and revealing some interesting relationships between genes and disease phenotypes. Yet, because of cost and other practical limitations, most microarray studies have used a relatively small number of biological samples. As a result, cross-referencing lists of genes found to be associated with disease phenotypes in 2 separate studies usually produce relatively few genes in common (Parmigiani and others, 2004
In this scenario, 3 related statistical questions are important for making progress toward an objective assessment of the worth of microarray analysis results: (1) reliability, that is whether different measuring techniques are capturing the same biological variation, (2) validation, that is whether the conclusions of a study are supported by other similar studies, and (3) combination, that is whether more reliable conclusions can be reached by jointly analyzing multiple studies. In this paper, we develop and illustrate simple and effective statistical approaches to address these 3 questions.
Variation in measurements of gene expression includes "technological" variation, associated with limitations of the measuring technologies, and "biological" variation, due to the phenotype or experimental condition being studied, as well as natural variation of levels of gene expression in different samples of the same type (Pritchard and others, 2001
), (Enard and others, 2002
), (Oleksiak and others, 2002
). Because the determinants of both technological and biological variation tend to vary from study to study and from laboratory to laboratory, study- and laboratory-specific effects are inherent in most gene expression array datasets. Study-specific conditions may affect different genes differently, generating study "signatures." Overall, study effects can dominate the biological signal of interest (Aach and others, 2000) in a large number of genes.
There are several microarray technologies currently in use (Schena, 2000
; Southern, 2001
; Hardiman, 2002
). Although all exploit hybridization, they differ in how DNA sequences are laid on the array, the length of these sequences, and the number of samples measured in each hybridization. As a result, an important source of technological variability in gene expression measurements is the platform used. Several studies have compared measurements across platforms. Kuo and others (2002)
compared mRNA measurements from Stanford-type complementary DNA (cDNA) microarrays and Affymetrix oligonucleotide chips using the so-called "NCI 60" set of cancer cell lines (Ross and others, 2000
), (Scherf and others, 2000
). Based on correlation between matched measurements and concordance between clusters, they concluded that correlation can be poor and clusters of genes and cell lines can be discordant between the 2 technologies. They also provide evidence to indicate that sequence-specific factors influence reliability. Similar caveats about cross-platform variability have been raised by other analyses comparing Affymetrix to custom-made cDNA arrays (Yuen and others, 2002
) and Affymetrix to IncyteGenomics arrays (Kothapalli and others, 2002
).
To compare experiments that are performed on different gene expression platforms, oligonucleotide probe sets, spotted sequences, and other microarray features need to be linked. Expressed sequence tag (EST) sequencing projects have generated cDNA sequences for human, mouse, and other organisms. These are identified by an accession number in databases such as GenBank. Extensive efforts have been devoted to grouping these sequences into clusters representing a single transcript (Boguski and Schuler, 1995
), (Miller and others, 1999
), (Quackenbush and others, 2001
). UniGene, developed at the National Center for Biotechnology Information (2003)
, partitions ESTs derived from one organism into mutually exclusive clusters based on sequence homology (Boguski and Schuler, 1995
). As GenBank is growing, UniGene clustering is performed periodically, resulting in new clusters. Typically, a sequence-specific identifier (GenBank accession number) serves as a reference to the array probe sequences. A subset of the UniGene clusters is reliably linked to genes of known function such as those catalogues in RefSeq or AceView (Pruitt and others, 2005
), (Thierry-Mieg, D. and Thierry-Mieg, J., 2006
).
Because of these challenges, the first and most critical step in cross-study analysis of gene expression is to identify a subset of genes that are consistently measured across platforms. Even after 2 features are mapped to the same UniGene cluster, inconsistencies across platforms can still be substantial because of differences in hybridization efficiencies, limitations of linkage databases, splicing variation, and a number of other factors. To this end, we propose a tool, termed integrative correlation (IGC), that can be used to investigate reliability and to isolate a subset of reliable genes for further analysis. IGC was previously illustrated, though not described in any statistical detail, in Parmigiani and others (2004)
and in Hayes and others (2006)
. Here, we provide a rigorous discussion in the general case of an arbitrary number of studies and describe how to choose reliability cutoffs using false-discovery rates (FDRs).
We then turn to the assessment of reliability of gene selection in class comparison analyses, whose goal is to identify the genes that are differentially expressed across a given set of conditions or phenotypes. We propose exploratory analysis techniques based on visualizing suitably chosen standardized effect (SE) sizes. Such visualizations make it simple to place the study-specific effect sizes and their discrepancies across studies in the context of the variation across the genome. We also build on these visualizations to propose simple meta-analytic methods for selecting genes that are reproducibly associated with a phenotype of interest and for assessing the FDRs associated with this selection. While typical meta-analytic approaches focus on combination, our genome-wide implementation focuses on reliable selection, where reliability is defined in terms of both IGC and consistency of effect sizes.
To illustrate our statistical methods for assessing reliability and comparing results across studies, we will use 3 breast cancer gene expression datasets (West and others, 2001), (Huang and others, 2003
), (Sorlie and others, 2003
). These studies provide information about lymph node status of patients that we will use to determine which genes show evidence of differential expression in node-positive versus node-negative breast cancers. Node-positive tumors have been shown to be associated with decreased survival as compared to node-negative cancers (Carter and others, 1989
), (Soerjomataram and others, 2007
). Elzagheid and others (2006)
propose that lymph node–positive and -negative cancers may require separate prognosticators based on suggested differences in the associations between known biologic predictors of survival and actual survival in analyses stratified by nodal status. This suggests that a better understanding of possible molecular differences in node-positive and node-negative disease may elucidate different treatment in these groups.
| 2. DATA |
|---|
|
|
|---|
We have chosen 3 publicly available breast cancer microarray gene expression studies, all of which had information about nodal status. The first study is by West and others (2001) who selected 49 primary breast cancers from the Duke Breast Cancer SPORE frozen tissue bank. Twenty-four of the tumor samples are from women with node-positive disease. Two total RNA extractions were performed for each tumor and total RNA was pooled. The Affymetrix HuGeneFL GENECHIP arrays were used, scanned with a GENECHIP scanner, and processed using GENECHIP Expression Analysis algorithm. Data were available on 6834 genes.
The second dataset was published by Sorlie and others (2003)
. Of the 122 breast tumor samples, 79 were node positive, 34 were node negative, and nodal status was unavailable for the remaining samples. cDNA was obtained from each tumor sample and hybridized to 2-channel cDNA arrays with a common reference cell line. A total of 30 078 genes were included in the datasets available; however, there was significant missing data on genes from many of the samples. By requiring that each gene included in our analysis have missing data on no more than 20% of samples, we are left with 6199 genes. Details and data from this study can be found at www.pnas.org and http://genome-www.stanford.edu/breast_cancer/.
Huang and others (2003)
analyzed 89 heterogeneous breast tumors which were obtained at biopsy of primary tumor, banked between 1991 and 2001 and chosen based on clinical parameters. Seventy-three of these tumors were from node-positive women and 11 from node-negative women. Total RNA was extracted and synthesized to cDNA. Affymetrix human U95Av2 arrays were used for hybridization, arrays were scanned using Affymetrix GeneArray scanner and expression values calculated using Affymetrix suite 5.0 (avdif). A total of 12 625 genes were available for analysis. The Huang data can be accessed at www.thelancet.com.
Note that each dataset was preprocessed before being made publicly available.
| 3. METHODS |
|---|
|
|
|---|
The methods can be divided into 3 areas: evaluating reliability of gene expression across studies, comparing strength of evidence of gene–phenotype associations across studies, and combining effects across studies. For each dataset, the only genes of interest are the ones which are common across the studies being compared. For ease of computation, the datasets are hence subsetted to include only the common genes based on UniGene ID and are then sorted so that the order of the genes is identical across studies.
The tools for performing the analyses we demonstrate in this paper are available as an R libary called MergeMaid (Cope and others, 2004
) which can be downloaded at http://www.bioconductor.org. Functions are available for merging data, estimating correlations within and across studies, performing comparative analyses, and validating gene sets.
We use f and h to index studies (i.e. datasets), and j and k to index genes within studies, while nf refers to the number of samples within the study f. The genes which are common across the 2 studies f and h (i.e. the intersection of genes) are denoted by
fh, whereas
f is the set of all genes in study f.
When considering just one gene expression dataset, it can be difficult or impossible to determine whether or not expression levels are reliably measured. If the spot on the chip is incorrect (i.e. the sequence spotted does not correspond to the gene that is assumed to be spotted), which is not an uncommon occurrence, it will usually be consistently incorrect for all the genes in one study because the same type of chips are generally used within one study. By comparing patterns of expressions across 2 studies which use different types of chips (e.g. Affymetrix oligonucleotide chips versus spotted cDNA glass arrays), we may be able to determine if there are inconsistencies by looking at how the genes vary in relation to other genes. Additionally, if a gene shows relatively little variability across samples, we would expect it to have relatively low correlations with other genes. To assess which genes lack reliability and variability, we consider correlations between genes. However, when comparing single-channel arrays to 2-channel arrays, or comparing 2 different 2-channel arrays, it is important to consider what the reference sample of the 2-channel arrays represents. It may, in many cases, limit the conclusions that can be drawn by cross-study analyses. This is described in more detail in Section 4.4.
Gene reliability can be assessed by looking at the correlation structure across studies. For a dataset with G genes, the GxG correlation matrix describes how each gene is correlated to every other gene in the study. If we calculate the correlation matrices for 2 studies using the same set of G genes and these datasets both have comparable and reliable gene expression measures across samples, then we would expect that the 2 correlation matrices would be similar. In other words, for gene j, the correlation between the jth row of the correlation matrix in study f and the jth row of the correlation matrix in study h would be high. The correlation of correlations can be denoted by r
,
![]() | (3.1) |
where
fjg is the correlation between genes g and j in study f, and
fj is the average correlation between gene j and all the other G genes being assessed. This gene-specific measure can tell us which genes tend to be measured consistently and with agreement across studies. In general, we do not necessarily expect there to be high correlations, but we do expect that we will see overall positive trends. Note that r
is not measuring agreement of gene expression measurement. It is instead identifying if the associations between genes show similar patterns in different studies. The level of correlation comparing any 2 studies will depend on the sample covariance matrix of each study. This is discussed in more detail in Cope and others (2007)
. Genes showing negative trends or no trend suggest that the gene signals across the 2 studies are different for that particular gene, which may be due to mislabeled spots on the arrays, other chip-specific problems, or artifacts of the experimental conditions. Genes showing no trend (i.e. correlation close to zero) may lack variability in one or both studies. Variability can be measured by looking at a gene's correlation with other genes within the study: if the variability in the gene is due to signal as opposed to noise, we would expect that the gene would show a wide range of correlations with other genes. If the range of correlations for a given gene is narrow, then we conclude that the gene shows little variability. As a result, genes with a narrow range of correlation with other genes will not be useful in our meta-analysis: the variation in these genes is likely random (i.e. not related to phenotype) and they will not be helpful is distinguishing between phenotypes.
For any given gene comparison across 2 studies, we highlight 5 common scenarios for the resulting r
: (1) r
is positive and gene j shows variability in both studies f and h, (2) r
is negative and shows variability in both studies f and h, (3) r
is low (i.e. close to zero) and variability of gene j is high in both studies f and h, (4) r
shows low correlation and variability of gene j is high in only one of the studies, and (5) r
is low and variability of gene j is low in both studies. These are exhibited in Figure 1, where the 5 scenarios are shown via genes 1–5, respectively. For each panel, the correlations of gene j with each of the other genes in
fh in study f is plotted versus the correlations of gene j with each of the other genes in
fh in study h, with r
.
Of the 5 types of genes shown in Figure 1, only genes with patterns similar to gene 1 are of interest. Genes 2 and 3 show different patterns of expression in the 2 studies, suggesting lack of reliability. For gene 5, we see low variation in both studies, so that the gene is not likely to be informative for distinguishing between phenotypes. Gene 4 only shows variability in one of the 2 studies: this suggests that (a) gene 4 could be poorly or incorrectly measured in one of the studies or (b) one of the studies lacks heterogeneity of samples for gene 4 while the other study shows signficant variability across samples. Regardless of whether (a) or (b) is true, the gene does not show a comparable pattern across the 2 studies and cannot be considered consistently measured. Hence, in addition to identifying genes with inconsistent patterns in the 2 studies (i.e. like genes 2 and 3), the IGC of genes also identifies genes which do not show sufficient variability in co-regulation across samples (i.e. like genes 4 and 5). For simplicity, we refer to genes similar to gene 1 as "reliable" and genes 2, 3, 4, and 5 as "unreliable." We consider r
a gene's reliability score.
Plots like those shown in figure 1 can be made easily to see patterns of reliability across datasets for several genes. However,
fh is usually so large that a high-throughput method is necessary for determining which genes are reliable or not via a cutoff for reliability. A common approach in gene expression array analysis is to derive the null distribution for the statistic of interest and choose a threshold based on a FDR. In this case, we choose a FDR that will ensure that relatively few "unreliable" genes are declared reliable. The null distribution is estimated by creating a series of "pseudogenes" which are calculated to maintain the sample covariance matrix, but remove the systematic associations (Cope and others, 2007
). To generate data for one pseudogene, we randomly sample one gene expression value from each column of the original gene expression matrix (for a vector of length S). We then add this gene to our expression matrix (i.e. we add it as a row) and calculate its r
value using the process described above. We repeat this, sampling pseudogenes and calculating their r
values some large number of times M, and use the M values to estimate the null distribution. For any given FDR, we can estimate the appropriate cutoff that defines acceptable reliability (Tusher and others, 2001
). Once the cutoff is chosen, we can proceed in the analysis using only the genes that have been deemed reliable.
|
The approach described in Section 3.2 is attractive in its relative simplicity and ease of calculation. However, it is limited because correlation can only be calculated between 2 variables: the correlation of correlation approach described cannot be extended to more than 2 studies. As an alternative, we consider using an approach based on the intraclass correlation (ICC). Both correlation and ICC are well-established methods for assessing reliability of measures of latent constructs and are used most commonly in psychosocial research and other research areas that tend to have unobservable outcomes (e.g. mental health research). In our case, we assume that the latent variable is the true correlation between pairwise correlations across studies.
The ICC can be used to assess reliability when there are more than 2 "raters," which are studies in our application (i.e. Huang, West, and Sorlie studies). There are different ICCs, depending on the study design and particularly how the raters are selected. For our purposes, the "fixed rater" ICC is most appropriate (Shrout and Fleiss, 1979
). In the context of reliability of gene expression measures, it incorporates the variation due to gene expression and the expression unexplained, adjusting for study-specific variation. The fixed rater ICC ranges from -1 to 1 and can be easily estimated using an analysis of variance (ANOVA) approach. In our setting, we calculate estimates of variance due to genes, variance due to studies, and unexplained (i.e. error) variance. One of the key benefits of this approach is that it is extraordinarily efficient due to simple matrix multiplication.
In some situtations, ICC is similar to the Pearson correlation for S = 2. The equations for the correlation and the ICC reduce to similar quantities when written using the same notation, but in the ICC, the overall mean (taken across both quantities) replaces the individual means. Additionally, the derivation of the ICC relies on the classical test theory idea that the quantities being compared arise from a true underlying latent variable (i.e. in this case, the true correlation vector between gene j and the other genes). The ICC has a natural and well-described interpretation and is easy to calculate. It can also be applied when the Spearman correlation has been used to identify the correlation matrices of genes within studies, making its inferences more robust. However, there are other possible candidates for measuring reliability of gene expression across studies for S
2, such as the minimum, median, or average of the pairwise correlations. In our application in Section 4, we compare these measures.
In the context of gene expression, we will refer to the ICC as the integrative correlation, or IGC, which we believe is more intuitive in this setting and note that the IGC is defined for S
2. In our further discussions and applications for S = 2 and S = 3, we use the IGC to determine gene reliability. For a given gene j, the variation in the correlations between gene j and the other genes is partitioned into 3 parts: variance due to gene effects, variance due to study effects, and residual variance. Using standard ANOVA terminology, these are referred to as the gene sum of squares (SSGj), the study sum of squares (SSSj), and error sum of squares (SSEj) and can defined as follows:
![]() | (3.2) |
![]() | (3.3) |
![]() | (3.4) |
where
j is the average correlation for gene j across all genes and studies,
jg is the average correlation between gene g and gene j across the S studies,
js is the average correlation between gene g and all genes for study s, and
jgs is the correlation between gene j and gene g in study s. Note that this partitioning can be more easily understood by observing that the total variance for gene j can be written as
![]() | (3.5) |
Using the sums of squares defined in (3.2) and (3.4) above, the fixed rater IGC for gene j can be written as
|
| (3.6) |
Using (3.6), we obtain the IGC for each gene as our reliability measure. To determine the appropriate cutoff for selecting genes for further study, we consider 2 types of cutoffs: fixed intuitive cutoffs that are not data driven and cutoffs based on a derived null distribution of IGC. The null distribution is estimated using the "pseudogene" approach as described in Section 3.2. For any given FDR, we can estimate the appropriate cutoff that defines acceptable reliability. In practice, we use the null distribution to find a cutoff, but we also consider other cutoffs for IGC reliability, which could be more or less stringent.
After choosing genes to include in the analysis based on their reliability, the association between gene and phenotype can be compared across studies. This requires that information about the phenotype (e.g. cancer type, recurrence status) be available and comparable across the studies being combined. Summary statistics need to be calculated for each study, quantifying the association between gene and phenotype for each of the reliable genes. We often find this preferable to combining the raw data values across studies. In our experience, gene expression data cannot be easily combined. In addition to reasons common to other meta-analytic settings (e.g. experimental designs are different), some reasons specific to gene expression data include the following: often, the available data have been preprocessed using algorithms so that the original data cannot be recovered; reference samples may differ; and the measurement of expression might be different (e.g. log ratio of expression versus absolute expression). Others have integrated datasets, but using studies with comparable chip measurement (Jiang and others, 2004
; Shedden and Taylor, 2005
; Morris and others, 2004
; Kapp and others, 2006
) unlike our situation. Examples of authors combining gene microarray data under a variety of conditions are discussed in Section 5.
In the case of a binary phenotype variable, we can detect which genes are associated with phenotype in several ways. Logistic regression can be used, where phenotype is regressed on gene expression for each gene, and the log odds ratio quantifies the relationship. Although the idea that gene expression is "predictive" of phenotype is appealing, this approach runs into problems when gene expression perfectly predicts phenotype. When there exists a gene whose range of expressions for the 2 phenotypes do not overlap, the odds ratio is not estimable using a standard logistic regression approach.
As an alternative, a difference in means and a t-ratio approach can be used, where the average expressions in the 2 phenotypes are compared. To provide a common metric across studies, for each gene in each study, we divide the gene expression data by the standard deviation of gene expression data in the reference type before estimating the difference in means (e.g. for the lymph node status, we would consider node negative as our reference phenotype). As a result, our effect size, wj, for gene j is measured by the differences in the means in the 2 groups being compared, divided by the standard deviation of the reference group:
|
| (3.7) |
The wj value represents the "effect size" for the association between phenotype and gene expression for gene j.
To assess statistical significance of the observed differences in expression across the 2 phenotypes, we can use a significance analysis of microarrays (SAM) approach, an extenstion of the t-test. The SAM statistic (dj) is calculated for each gene j making an adjustment for variance stabilization (Tusher and others, 2001
). The SAM statistic (dj) has been widely used for detecting differentially expressed genes in gene expression microarray studies and takes the form
|
| (3.8) |
where the standardized mean expression in the 2 phenotypes for gene j are denoted by
1j and
2j, and sj is a pooled estimate of the standard error of the difference. The way in which dj differs from the t-statistic is the inclusion of s0, which is a constant chosen to minimize the coefficient of variation of dj. When the t-ratio is used in gene expression analyses, many genes tend to be deemed significant that have very small variation in expression across phenotype, but also have small standard errors. The SAM statistic adjusts for this with the addition of s0 in the denominator, picking up more genes with larger effect sizes and fewer genes with inconsequential differences.
Notice the difference between wj and dj: wj is simply a scaled version of the effect size (similar to a standardized regression coefficient), while dj is more comparable to a t-statistic, which directly provides information about statistical significance. The major computational difference is that dj is sample size adjusted because it uses a standard error in the denominator, whereas wj is not sample size adjusted, using a standard deviation in the denominator. One reason to prefer wj to dj is that when considering studies of different sizes, the values of dj are not directly comparable while the wj statistics are. After these computations have been made for each of the reliable genes in each of the studies, we can compare the results using simple scatter plots, including statistical significance (determined by SAM or another approach) as part of the display.
In many of the oncologic gene expression datasets that are collected, time to death or relapse is of primary interest. Researchers are intent on finding genes which may be predictive of good or poor prognosis. The approach used above for comparing results for binary phenotypes can be extended. Just as the logistic regression approach was mentioned for looking at binary phenotypes, a Cox proportional hazards model approach can be used for assessing associations between time to death and gene expression, where survival time or relapse time is regressed on gene expression. The estimation problem that was pointed out earlier in logistic models is less severe in the Cox model, which will accomodate any gene expression dataset that shows some variation across samples and has a sufficient number of events to be able to estimate a hazard ratio. Usually only a small number of events are required for the estimate to be finite.
Log hazard ratios and their statistical significance can be compared across studies. However, the data need to be standardized first in order that the units of the log hazard ratios are comparable. A logical approach is to subtract the row mean (or median) and then divide by the row standard deviation (or another measure of variability) for each row of the gene expression matrix. Additionally, we can use the SAM-type adjustment: we can find the log hazard ratio (hj) and its standard error (sj) for each gene, to calculate the Z-score (Zj = hj/sj) for each gene. Then, we can estimate the value of s0 that stabilizes the variability of the Z-scores and recalculate the Z-score such that
|
| (3.9) |
The SAM approach as applied to Cox regression coefficients was used by Bullinger and others (2004)
for finding gene expression profiles associated with survival in acute myeloid leukemia patients.
Continuous outcomes can be handled analogously, using linear regression where continuous phenotype is regressed on (standardized) gene expression for each gene. In the case of linear regression, it is suggested that some time be spent ensuring that a linear model is appropriate: by studying the linear regression model on several randomly chosen genes and several genes which show strong association with phenotype, the assumptions of linearity and constant variance can be explored.
The techniques we propose can generally be considered meta-analytic approaches to analyzing gene expression data. Meta-analysis is a broad area consisting of techniques for analyzing data obtained from different studies. In our methods, we do not actually combine the data across studies, but instead perform comparative analyses, making inferences based on consistency across studies, and estimate combined inferential statistics. A clear understanding of the basic tenets of meta-analysis is critical to make logicial comparisons. General reviews of meta-analysis include Normand (1999)
Just as in standard meta-analytic approaches, we can go one step further and estimate "pooled" estimates of effect sizes. For example, when looking at overall survival, we could estimate a pooled hazard ratio estimate across studies for each gene in common across a set of studies. Variation in estimates might be assumed to arise from a fixed effects model, such that, for each gene, the log hazard ratios across studies are realizations from a large population of estimates with a common mean,
j. On the other hand, a random effects approach could be assumed, where for each gene and each study, we assume a different mean (e.g.
fj). The latter is a more reasonable approach in general for combining the results from gene expression array experiments due to the many sources of heterogeneity across studies, including the variety of tissues and differences in experimental methods.
However, in the gene microarray setting, we are often in the situation of comparing the results of just a few studies. In our applied example, we have only 3 microarray studies of breast cancer with information on lymph node status. As a result, a random effects approach is not feasible and alternative methods for combining effects need to be utilized. We outline an approach for combining effects across 3 studies which can be extended for k > 3 studies. For k >> 3, we recommend considering a random effects approach as described above.
To combine results, consider first the situation where the studies to be combined have equal sample sizes and the average effect size in each study is 0. In this case, one logical approach would be simply to average the effect sizes in the 2 studies. Another approach would be to use the best-fit line based on perpendicular residuals, which treats studies equivalently (unlike a least-squares regression), estimated via the first eigenvector from a principal components analysis (PCA). In the case of just 2 studies, the best-fit line is defined by a slope equal to the ratio of the 2 elements in the first eigenvector. We can then project each point to this best-fit line and use the distance from the origin to the projected point (adjusted to preserve the metric) to represent the combined effect size. By weighting using the first principal component, we upweight studies that show good agreement with others and downweight those that do not agree with other estimates. In the case of 3 studies, we can define a1, a2, and a3 as the elements in the eigenvector for the first principal component. We use these as weights to combine effect sizes w1, w2, and w3 from the 3 studies for a combined effect wc defined as wc = (a1w1 + a2w2 + a3w3)/(a1 + a2 + a3).
In practice, we need to make some modifications to the above approach because (a) sample sizes are generally not the same and we would like to give more weight to estimates from studies with larger sample sizes and (b) the average effect size in each study is generally not equal to zero (although they are often very close to zero). Our method for combining results across 3 studies is described below:
- Define the vectors w1, w2, and w3 to be the vectors representing effect sizes in the 3 studies.
- Center w1, w2, and w3 by their means.
- Fit principal components to w1, w2, and w3 using covariance, saving the loadings of the first principal component a1, a2, and a3 (i.e. the first eigenvector).
- Estimate the combined effect:
Step 2 ensures that the fitted line intersects the origin. Step 3 uses covariance (instead of correlation) to determine which study should be weighted more heavily: studies with more variation in estimates will tend to be favored. Step 4 does 2 things: it, sample size adjusts the fitted line and it rescales the metric to preserve the original scale of the effect sizes. The vector wc now represents our combined effect size and is used for determining which genes are associated with phenotype using the combination of information from the 2 studies.
Notice that we choose to use covariance rather than correlation for estimating the principal components so as to allow studies that have more variation in their effect sizes to contribute more to the combined effect. This is a sensible approach: large variation in effect sizes suggests that measurement in the study may be better than in a study with little variation in effect sizes. However, the correlation matrix can also be used and this is compared to the approach using covariance in Section 4. As will be seen in Section 4, other approaches could be considered for combining effects, such as simple averaging of statistics. Simple averages have the drawback that they do not use any of the available information about the relative importance of studies in determining the contribution of each study to the overall effect size. For example, in Section 4, we demonstrate the difference in our combined effect size and a statistic based on the averages of SAM statistics (weighted only by sample size).
| 4. RESULTS |
|---|
|
|
|---|
The 3 studies described in Section 2 (referred to as West, Huang, and Sorlie) were analyzed for reliability, and we assessed the comparability of results with regard to the genes associated with lymph node status. For all IGCs reported, the ICC approach (described in Section 3.2) was used for estimation (i.e. none of the results use the correlation approach described in Section 3.2).
The bioconductor library MergeMaid was used for merging the 3 datasets. The MergeMaid library combines phenotype and gene expression data from multiple studies, averages multiple occurences of genes within a study before merging, keeps track of which genes are in common across studies, and sorts genes consistently across studies. We found that the 3 studies had 2789 genes in common.
Reliability was assessed for each pair of studies by calculating the pairwise and the overall IGCs. The distributions of the pairwise IGCs and their null distributions are shown in Figure 2(A–C). The FDR cutoff was set to 1%, meaning that 1% of genes deemed reliable based on the selected cutoff will not be reliable. From these plots, approximately 7% of the genes are reliable between Huang and Sorlie, 16% between West and Sorlie, and 27% between Huang and West. The distribution of the overall IGC for all 3 studies for the 2789 genes is shown in Figure 2(D). The overall IGC for the 3 studies is less conservative than the pairwise IGCs: approximately 50% of genes are defined as reliable (1401 of the 2789 genes) using the estimated null distribution and an FDR of 1%.
|
In Figure 3, we compare the behavior of the overall IGC (i.e. the IGC for all 3 studies shown in Figure 2(D)) with the average of the pairwise correlations (Figure 3(A)) and also the median, minimum, and maximum (Figures 3(B–D), respectively). These are all metrics that one may consider appropriate to use when choosing which genes to retain for analysis. The maximum, minimum, and median each has its obvious drawbacks. The maximum and minimum are underconservative and overconservative in their approach to inclusion of genes, while the median ignores the extremes. Both the IGC and the average of the correlations fully use information about the comparisons of each study, but in different ways computationally. The agreement between these 2 approaches is quite good (with R2 = 0.97). However, the agreement as a general result will depend on the variation of correlations across studies, the number of studies included, and the patterns of associations across studies.
|
Genes were categorized into 3 categories: (1) high reliability: IGC > 0.11, (2) low reliability: 0 < IGC < 0.11, and (3) unreliable: IGC < 0. We chose 0.11 as the cutoff between high and low relibility because it is the cutoff corresponding to an FDR of 1% shown in Figure 2(D). Using these categories, there were 1401 genes in the high reliability category, 1157 in the low, and 231 in the unreliable. For each of the 3 studies, effect sizes wj are estimated for each gene and plotted against each other by reliability category in Figure 4, including the Spearman correlation coefficient. The 3 columns of plots show the different levels of reliability, and the rows correspond to pairs of studies (row 1—Huang and Sorlie; row 2—West and Sorlie; and row 3—Huang and West). Our expectations are that as reliability decreases, the correlation between effect sizes should decrease. This is supported in all studies: correlations of effect sizes are highest in the highly reliable genes (0.16 and greater), the correlations are lower in the low reliability genes (correlations of approximately 0), and negative correlations in the unreliable genes (correlations of – 0.15, – 0.06, and – 0.03).
|
We also chose an IGC cutoff of 0.20, which selected 618 genes that are highly reliable. The pairwise correlations between effect size across studies were as follows: 0.23 between Sorlie and Huang, 0.20 between West and Sorlie, and 0.34 between Huang and West. With this more reliable set of genes, we have even greater agreement between effect sizes. This may seem to lead us to find a highly reliable set to investigate further and explore combined scores. However, in investigations such as these, where only a small fraction of the total number of genes are expected to be significantly related to outcome (i.e. to nodal status), we decrease our chance of finding important genes considerably. If we were only to perform further analyses on genes for which the IGC was high, we would end up with only 20% of the 2789 genes originally included in all 3 studies.
We next combined scores for the 1401 most reliable genes, weighting by the principal components loadings and study sample sizes, as described in Section 3.3. This was done twice: first using covariance and then using correlation. The standard deviations of effect sizes and the results for the first principal components are shown in Table 1.
|
Using the correlation, each of the studies receives approximately equal weights, as shown in column (2) in Table 1. This is a result of the comparability of associations between studies as seen by the correlations shown in column A of Figure 4. If the covariance is used to estimate the principal components, the weights are less balanced, with the Huang study having the majority of influence with a weight of – 0.97 as compared to the other 2 studies with weights of only – 0.20 and – 0.10. This reflects the larger standard deviation of the effect sizes in the Huang study, which is approximately twice as large as the standard deviation of effect sizes in the other 2 studies. Also note that by using the covariance, a larger proportion of variation of the effect sizes is explained by the first principal component: 61% for covariance versus 44% for correlation. This also suggests that covariance is a better approach for combining effects: the combined effects are able to explain a greater fraction of the individual results. These factors suggest that covariance is optimal compared to the correlation-based approach. However, the choice should also depend on external knowledge of the investigator. There may be a rationale for desiring more equal weights across studies. The comparison of effect sizes based on covariance and sample size–weighted SAM statistics is shown in Figure 5(A) where weights are the square root of the sample size. As can be clearly seen, these 2 statistics are correlated, although the ranges are different. In addition, there is quite a bit of variation so that creating a list of the top genes using the different approaches will yield different gene sets. The analogous results where weights for combining effect sizes are based on a PCA on the correlation matrix are shown in Figure 5(B) where we see a stronger correlation; the average SAM statistics give us a result that is more similar because the weights in the correlation-based PCA are much closer. Note that our conclusions based on the averaged SAM and the combined wc score would be different for some genes (i.e. there are several SAM statistics that are more extreme than values that we have selected as significant). We cannot argue that our approach as opposed to the weighted average of the SAM statistics has chosen the "correct" genes. However, our approach has the appealing quality that it calculates empirical weights for combining effect sizes that are chosen based on the estimated quality of the data from each study, as judged by its agreement with the other studies. We plan to investigate the superiority of the wc statistic over a weighted SAM statistic in future work.
|
Only a handful of genes were found to be significant when compared to the distribution of effect sizes in a null distribution generated using a permutation approach (i.e. nodal status was permuted and the data reanalyzed). One gene of particular interest that did arise was alcohol dehydrogenase 1C (ADH1C) which has been shown in numerous studies to be associated with breast cancer (Coutelle and others, 2004
One critical issue when combining results across microarray studies is comparability of results. In our example, we have 3 types of arrays. The Sorlie study uses 2-channel arrays with a noncancer breast cell line as the reference; the West study and the Huang study both use single-channel arrays from Affymetrix, but they are 2 different types of chips with 2 different sets of probes. As a result, large values of gene expression mean increased expression relative to the nontumor breast tissue in the Sorlie dataset, but in the Huang and West datasets it means high gene expression in absolute terms. The question is whether or not the analysis is meaningful given these differences in the definition of high expression values and whether or not the improved agreement in the West and Huang studies seen in Figure 4 is explained by this difference in meaning.
For each of the 3 datasets, we calculate the correlation matrix of genes and then compare them. It is reasonable to assume that if genes are highly correlated, then their correlation matrix should not be strongly affected by the specific definition. So, it is our contention that different reference samples or channels should not strongly affect measurement of reliability of gene expression across studies. However, we are also exploring ways to combine effect sizes across studies. This is where the interpretation of gene over (or under) expression may be limiting if it is not consistent. A large positive effect size in the Sorlie study implies that a particular gene tends to have a higher ratio of expression in node-positive versus normal tissue than in node-negative versus normal tissue. In the West and Huang studies, a large effect size implies that a gene has high expression in node-positive as compared to node-negative disease. When combining effect sizes across studies, we need to consider that, while similar, these interpretations are not the same. In this study, the differences in interpretation are not particularly problematic. However, additional care is needed when combining studies measured on 2-channel arrays where the reference sample is not the same. For example, one study may use a pooled reference containing equal quantities of samples from all tumors in the study, whereas another may use an independent sample of normal tissue, or commercially available RNA.
| 5. DISCUSSION |
|---|
|
|
|---|
Given the multitude of gene expression studies that are currently available, it is clearly of interest to make inferences based on their collective evidence. Additionally, given the concerns that gene expression data are often unreliable, it is important to be able to confirm results found in one study by looking at other similar studies. When looking at a single study, we usually do not have the ability to evaluate statistically the reliability of gene measurement and instead usually rely on gene-by-gene validations using methods such as reverse transcriptase-polymerase chain reaction among a small subset of genes found to be interesting. When multiple studies are available, statistical cross-study validation offers an effective, high-throughput alternative that should be exploited more routinely in genomic analyses.
In this article, we have described simple tools for both the comparison of results and the combination of association measures across 2 or more studies. We have focused on studies of gene expression using micorarrays because they are common, expensive, and controversy exists on their validity. Our analysis plans can be extended to other high-throughput techniques for measuring the transcriptome and the proteome. We focused on phenotype comparison issues, though reliability filters would be advisable in prediction and class discovery settings as well.
In practice, we suggest that reliability should always be considered before comparing results across studies. By filtering out genes that appear to be mismeasured or incorrectly linked, we avoid making comparisons that are not biologically relevant, and we will likely substantially reduce both the number of discordant findings and the number of falsely concordant findings. This allows us, to some extent, to separate the technology-specific and study-specific variation from the biological variation: for example, while strong biological variation can lead to small IGCs, it is far less likely to lead to negative IGCs. The latter are more likely the result of inaccurate cross-linkage or strong differences in the hybridization behavior of the sequences used as probes in different platforms.
A novel addition in this paper is the extension of the reliability assessment to more than 2 studies. In the case of 2 studies, calculating a measure to describe the overall agreement between 2 studies is straightforward. However, when looking at more than 2 studies, finding a 1-number summary of the overall agreement led us to the IGC. In practice, this approach has been useful for finding genes which are measured consistently across all studies and also allows for the possibility of finding genes which are measured consistently in only 2 of 3 studies, suggesting mismeasurement in the third study.
While individual studies may not provide convincing evidence of gene–phenotype relationships, collectively analyzing gene expression datasets with the same phenotypes may provide substantially more information. By looking at a collection of datasets, we can assess which genes appear to be reliable across studies and which show consistent trends in their association with phenotype. Many gene expression datasets are publicly available so that now there are collections of datasets, each with the goal of finding associations between phenotype and genes.
To date, many authors have compared platforms using the same samples in controlled experiments (Kothapalli and others, 2002
; Kuo and others, 2002
; Yuen and others, 2002
; Barczak and others, 2003
; Culhane and others, 2003
; Mecham and others, 2004
). And, there have been quite a few who have published approaches for combining information across samples and platforms from different studies (discussed below), but essentially no authors who have formally addressed "mismeasurement" of genes across studies. Grigoryev and others (2004)
combined gene expression across different Affymetrix chip types (U47A, U34A, U95A, and U133A) for different animal species. Their approach included matching genes using orthologs (i.e. matching genes across species) and directly combining normalized gene expression data before analysis to obtain overall measures of "statistical significance" of orthologs. While this may be an appropriate method across matched Affymetrix chip technologies, the applicability of this approach for different platforms is unlikely. Rhodes and others (2002)
meta-analyzed 4 prostate cancer gene expression studies, comparing prostate tumor tissue to benign prostate tissue. For each gene, they first computed the p-value of the test for the null hypothesis of no difference across the 2 groups. They then ranked these p-values within each study and combined the resulting ranks. Combining p-values according to their ranks is consistent in this case with combining t-statistics according to their ranks. Test statistics and their significance values depend both on information about the magnitude of the fold change across the 2 classes considered and on information about the precision with which this fold change can be measured in the experiment at hand. The latter reflects within-class variability for a gene, but also the study sample size. Therefore, it would not have been appropriate to combine p-values or t-statistics directly across studies (although their ranks can be combined). Because of this limitation, most of the combination approaches developed in medicine (Hedges and Olkin, 1985
; Hasselblad and McCrory, 1995
) are based on standard errors (SEs). In the genomic context, a SE for a 2-class comparison could be defined as the fold change divided by the within-class standard deviation. SEs have the added advantage of allowing for direct combination across studies. This approach is likely to use information more efficiently and also leads to a combined estimate of direct biological interpretation.
Some other instances of data combination in gene micoarray studies include Kapp and others (2006)
where the authors were able to combine data from 3 separate studies of breast cancer microarray data and then apply the resulting model to a validation set comprised of 2 similar studies. However, the 3 sets that were combined were all made on the same microarray platform using the same RNA preparation protocol, indexed by clone ID. Conlon and others (2006)
implement a Bayesian model that accounts for variability across studies, but still does not formally address the issue that there are quite possibly large discrepancies in what the genes are measuring so that a large effect size may be a combination of probes measuring different expression levels but agreeing by chance. Shen and others (2004)
transform microarray data to a common probability metric (the poe scale, Parmigiani and others, 2002
) and then combine data from individual studies to produce one large dataset. However, unlike our approach, this paper does not specifically address the situation of mismeasurement and lack of reproducibility across studies. Hayes and others (2006)
use the IGCs approach we address in this manuscript to combine results from 3 cohorts of patients.
Using the approaches that we have developed for comparing the 3 studies that include lymph node invasion information, we have been able to show that, in general for the reliable genes, there was decent agreeement in terms of magnitude and direction of the associations between genes and nodal status. The results seen for the unreliable genes are also sensible: they have a correlation close to zero, suggesting that IGC identifies a certain number of transcripts that are substitutes for each other and happen to be annotated to the same features. An example of this is isoforms of the same genes. Had these genes been left in the analysis, we would have dampened the estimated association between effect sizes in the 2 studies. We chose to use the effect size, defined as the difference in mean expression between the node-positive and node-negative samples, standardized by the standard deviation in the node-negative samples. However, there are many possible choices for statistics of interest to be compared. SAM statistics can be compared, although the magnitude across studies might vary significantly due to differences in sample sizes.
The R library MergeMaid has been developed for performing our proposed analyses, including matching datasets, assessing reliability of genes, and performing comparative analyses across datasets. It can be accessed freely from http://astor.som.jhmi.edu/MergeMaid. These tools can be applied generally to datasets from any gene expression platforms for which there exist a set of overlapping genes.
| FUNDING |
|---|
|
|
|---|
National Cancer Institute (CA88843); National Science Foundation (034211).
| ACKNOWLEDGMENTS |
|---|
Conflict of Interest: None declared.
| REFERENCES |
|---|
|
|
|---|
-
Aach J, Rindone W, Church GM. Systematic management and analysis of yeast gene expression data. Genome Research (2000) 10:431–445.
Barczak A, Rodriguez M, Hanspers K, Koth L, Tai Y, Bolstad B, Speed T, Erle D. Spotted long oligonucleotide arrays for human gene expression analysis. Genome Research (2003) 13:1775–1785.
Boguski MS, Schuler GD. Establishing a human transcript map. Nature Genetics (1995) 10:369–371.[CrossRef][Web of Science][Medline]
Bullinger L, Dohner K, Bair E, Frohling S, Schlenk R, Tibshirani R, Dohner H, Pollack J. Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. New England Journal of Medicine (2004) 350:1605–1615.
Carter CL, Allen C, Henson DE. Relation of tumor size, lymph node status, and survival in 24,740 breast cancer cases. Cancer (1989) 63:181–187.[CrossRef][Web of Science][Medline]
Conlon E, Song J, Liu J. Bayesian models for pooling microarray studies with multiple sources of replication. BMC Bioinformatics (2006) 7:247.[CrossRef][Medline]
Cook DJ, Sackett DL, Spitzer WO. Methodologic guidelines for systematic reviews of randomized control trials in health care from the Potsdam consultation onmeta-analysis. Journal of Clinical Epidemiology (1995) 48:167–171.[CrossRef][Web of Science][Medline]
Cope L, Garrett-Mayer E, Gabrielson E, Parmigiani G. The integrative correlation coefficient: a measure of cross-study reproducibility for gene expression array data. In: Working Paper 146 (2007) Department of Biostatistics: Johns Hopkins University. http://www.bepress.com/jhubiostat/paper146/.
Cope L, Zhong X, Garrett-Mayer ES, Parmigiani G. Mergemaid: R tools for merging and cross-study validation of gene expression data. Statistical Applications in Genetics and Molecular Biology (2004) 3. Article 27.
Coutelle C, Hohn B, Benesova M, Oneta CM, Quattrochi P, Roth HJ, Schmidt-Gayk H, Schneeweiss A, Bastert G, Seitz HK. Risk factors in alcohol associated breast cancer: alcohol dehydrogenase polymorphism and estrogens. International Journal of Oncology (2004) 25:1127–1132.[Web of Science][Medline]
Culhane A, Perriere G, Higgins D. Cross-platform comparison and visualization of gene expression data using co-inertia analysis. BMC Bioinformatics (2003) 4:1–15.[Medline]
Dumitrescu RG, Cotarla I. Understanding breast cancer risk–where do we stand in 2005? Journal of Cellular and Molecular Medicine (2005) 9:208–211.[Web of Science][Medline]
Elzagheid A, Kuopio T, Pyrhonen S, Collan Y. Lymph node status as a guide to selection of available prognostic markers in breast cancer: the clinical practice of the future? Diagnostic Pathology (2006) 1:41.[CrossRef][Medline]
Enard W, Khaitovich P, Klose J, Zöllner S, Heissig F, Giavalisco P, Nieselt-Struwe K, Muchmore E, Varki A, Ravid R, et al. Intra- and interspecific variation in primate gene expression patterns. Science (2002) 296:340–343.
Ghosh D, Barette TR, Rhodes D, Chinnaiyan AM. Statistical issues and methods for meta-analysis of microarray data: a case study in prostate cancer. Functional and Integrative Genomics (2003) 3:180–184.[CrossRef]
Grigoryev D, Ma S, Irizarry R, Ye S, Quackenbush J, Garcia J. Orthologous gene-expression profiling in multi-species models: search for candidate genes. Genome Biology (2004) 5:R34.[CrossRef][Medline]
Hardiman G. Microarray technologies—an overview. Pharmacogenomics (2002) 3:293–297.[CrossRef][Medline]
Hasselblad V, McCrory DC. Meta-analytic tools for medical decision making: a practical guide. Medical Decision Making (1995) 15:81–96.
Hayes DN, Monti S, Parmigiani G, Naoki K, Bhattacharjee A, Socinski MA, Perou C, Meyerson ML. Gene expression profiling reveals reproducible human lung adenocarcinoma subtypes in multiple independent patient cohorts. Journal of Clinical Oncology (2006) 24:5079–5090.
Hedges L, Olkin I. Statistical Methods for Meta-Analysis (1985) New York: Academic Press.
Huang E, Cheng S, Dressman H, Pittman J, Tsou M, Horng C, Bild A, Iversen E, Liao M, Chen C, et al. Gene expression predictors of breast cancer outcomes. The Lancet (2003) 361:1590–1596.
Jiang H, Deng Y, Chen H-S, Tao L, Zhe Q, Chen J, Tsai C, Zhang S. Joint analysis of two microarray gene-expression data sets to select lung cancer adenocarcinoma genes. BMC Bioinformatics (2004) 5:1–12.
Kapp A, Jeffrey S, Langerod A, Borrensen-Dale A-L, Han W, Noh D-Y, Bukholm I, Nicolau M, Brown P, Tibshirani R. Discovery and validation of breast cancer subtypes. BMC Genomics (2006) 7:231.[CrossRef][Medline]
Kothapalli R, Yoder S, Mane S, Loughran T. Microarray results: how accurate are they? BMC Bioinformatics (2002) 3:1–10.[Medline]
Kuo W, Jenssen T, Butte A, Ohno-Machado L, Kohane I. Analysis of matched mrna measurements from two different microarray technologies. Bioinformatics (2002) 18:405–412.
Mecham B, Klus G, Strovel J, Augustus M, Byrne D, Bozso P, Wetmore D, Mariani T, Kohane I, Szallasi Z. Sequence-matched probes produce increased cross-platform consistency and more reproducible biological results in microarray-based gene expression measurements. Nuceic Acids Research (2004) 32:1–8.[CrossRef]
Miller RT, Christoffels AG, Gopalakrishnan C, Burke J, Ptitsyn AA, Broveak TR, Hide WA. A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. Genome Research (1999) 9:1143–1155.
Moreau Y, Aerts S, De Moor B, De Strooper B, Dabrowski M. Comparison and meta-analysis of microarray data: from the bench to the computer desk. Trends in Genetics (2003) 19:570–577.[CrossRef][Web of Science][Medline]
Morris J, Baggerly K, Wu C, Zhang L. Pooling information across different studies and oligonucleotide microarray chip types to identify prognostic genes for lung cancer. Methods in Microarray Data Analysis (2004) IV (in press).
National Center for Biotechnology Information. NCBI UniGene (2003) http://www.ncbi.nlm.nih.gov/UniGene.
Normand SL. Meta-analysis: formulating, evaluating, combining, and reporting. Statistics in Medicine (1999) 18:321–359.[CrossRef][Web of Science][Medline]
Oleksiak MF, Churchill GA, Crawford DL. Variation in gene expression within and among natural populations. Nature Genetics (2002) 32:261–266.[CrossRef][Web of Science][Medline]
Parmigiani G, Garrett E, Anbazhagan R, Gabrielson E. A statistical framework for expression-based molecular classification in cancer. Journal of the Royal Statistical Society, Series B [with discussion] (2002) 64:717–736.[CrossRef]
Parmigiani G, Garrett-Mayer ES, Anbazhagan R, Gabrielson E. Cross-study comparison of gene expression data sets for the molecular classification of lung cancer. Clinical Cancer Research (2004) 10:2922–2927.
Pritchard CC, Hsu L, Delrow J, Nelson PS. Project normal: defining normal variance in mouse gene expression. Proceedings of the National Academy of Science (2001) 98:13266–13271.
Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research (2005) 33:D501–D504.
Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White J. Tigr gene indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Research (2001) 29:159–164.
Rhodes D, Barette T, Rubin M, Ghosh D, Chinnaiyan A. Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Research (2002) 62:4427–4433.
Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, van de Rijn M, Waltham M, et al. Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics (2000) 24:227–235.[CrossRef][Web of Science][Medline]
Schena M. Microarray Biochip Technology (2000) Westborough, MA: BioTechniques Press.
Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Tanabe L, Kohn KW, Reinhold WC, Myers TG, Andrews DT. A gene expression database for the molecular pharmacology of cancer. Nature Genetics (2000) 24:236–244.[CrossRef][Web of Science][Medline]
Shedden K, Taylor J. Differential correlation detects complex associations between gene expression and clinical outcomes in lung adenocarcinomas. In: Methods of Microarray Data Analysis IV—Shoemaker Lin, ed. (2005) US: Springer. 121–131.
Shen R, Ghosh D, Chinnaiyan A. Prognostic meta-signature of breast cancer developed by two-stage mixture modeling of microarray data. BMC Genomics (2004) 5:94.[CrossRef][Medline]
Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin (1979) 86:420–427.[CrossRef][Web of Science][Medline]
Soerjomataram I, Louwman MW, Ribot JG, Roukema JA, Coebergh JW. An overview of prognostic factors for long-term survivors of breast cancer. Breast Cancer Research and Treatment (2007) DOI: 10.1007/S10549-007-9556-1.
Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, et al. Repeated observation of breast tumor subtypes in independent gene expression datasets. Proceedings of the National Academy of Science (2003) 100:8418–8423.
Southern EM. DNA microarrays. History and overview. Methods in Molecular Biology (2001) 170:1–15.[Medline]
Thierry-Mieg D, Thierry-Mieg J. AceView: a comprehensive cDNA-supported gene and transcripts annotation. Genome Biology (2006) 7(Suppl. 1). S12.
Tusher V, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences (2001) 98:5116–5121.
West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA Jr, Marks JR, Nevins JR. Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences (2001) 98:11462–11467.
Yuen T, Wurmbach E, Pfeffer R, Ebersole B, Sealfon S. Accuracy and calibration of commercial oligonucleotide and custom cdna microarrays. Nucleic Acids Research (2002) 30:1–9.
Received December 22, 2007; revised May 22, 2007; revised July 16, 2007; accepted for publication August 7, 2007.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









