Biostatistics Advance Access originally published online on May 15, 2006
Biostatistics 2007 8(1):2-8; doi:10.1093/biostatistics/kxl005
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Outlier sums for differential gene expression analysis
Department of Health Research & Policy and Department of Statistics, Stanford University, Stanford, CA 94305, USA tibs{at}stat.stanford.edu
Department of Statistics and Department of Health Research & Policy, Stanford University, Stanford, CA 94305, USA
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
We propose a method for detecting genes that, in a disease group, exhibit unusually high gene expression in some but not all samples. This can be particularly useful in cancer studies, where mutations that can amplify or turn off gene expression often occur in only a minority of samples. In real and simulated examples, the new method often exhibits lower false discovery rates than simple t-statistic thresholding. We also compare our approach to the recent cancer profile outlier analysis proposal of Tomlins and others (2005).
Keywords: Cancer; COPA; Gene expression analysis; Microarray
| 1. INTRODUCTION |
|---|
|
|
|---|
We consider methods for detecting differentially expressed genes in a set of microarray experiments. We consider the simple case of m genes measured across two experimental conditions. A number of authors have proposed methods for detecting differential gene expression; Dudoit and others (2002)
One widely used approach to this problem is as follows. We compute a two-sample t-statistic
for each gene, and then call a gene significant if
exceeds some threshold c. Various values of c are tried using permutations of the sample labels to estimate the false discovery rate (FDR) for the procedure for each c. A threshold c is finally chosen based on the estimates of FDR and other considerations, such as the ballpark number of significant genes that is desirable. This recipe roughly describes the strategy used, for example, in the significance analysis of microarrays (SAM) procedure (Tusher and others, 2001
). The SAM procedure can be applied to other test statistics for a wide variety of data types, such as paired, censored, or time-course data.
In a study of mutations in prostate cancer, Tomlins and others (2005)
introduced a method called "cancer profile outlier analysis" (COPA) for detecting what they call "oncogene outliers." These are genes which show a systematic increase in expression, but only for a small number of cancer samples. They show that COPA can be more powerful than the usual t-statistic in these cases. In related work, Lyons-Weiler and others (2004)
proposed the permutation percentile separability test with a similar objective: to find genes that are overexpressed only in a subset of cases.
The COPA work inspired us to study this problem and look for better ways of detecting changes that occur in a small number of samples. We introduce the "outlier-sum" statistic, and compare it to both the t-statistic and COPA in a number of examples.
| 2. THE OUTLIER-SUM STATISTIC |
|---|
|
|
|---|
Let
be the expression values for genes
and samples
. We assume that the samples fall into two groups. We think of group 1 as a normal or reference group, while group 2 is a disease group. Let
be the set of indices of the observations in group k, for
. The standard (unpaired) t-statistic is
|
| (2.1) |
Here
is the mean of gene i in group k and
is the pooled within-group standard deviation of gene i.
As an alternative, we define the outlier-sum statistic as follows. Let
and
be the median and median absolute deviation of the values for gene i. We first standardize each gene
|
| (2.2) |
This standardization puts all genes on the same scale to facilitate comparisons across genes.
Let
be the rth percentile of the
values for gene i, and
, the interquartile range. Finally, note that values greater than the limit
are defined to be outliers in the usual statistical sense.
The outlier-sum statistic is defined to be sum of the values in the disease group that are beyond this limit:
|
| (2.3) |
Hence,
is large if there are many outliers in the disease group, or a few outliers with large values. If there are no outliers, then
is zero.
As an example, we generated 1000 genes and 30 samples, all values drawn independently from a standard normal distribution. Then, we added two units to gene 1 for four of the samples in the second group. We computed the P-value: the proportion of genes with score greater than that for gene 1, in absolute value. This process was repeated 50 times. The results for both the t-statistic and the outlier-sum statistic are shown in Figure 1. We see that the outlier-sum statistic yields smaller (more significant) P-values overall.
|
In real applications, one might expect negative as well as positive outliers. Hence, we define
|
| (2.4) |
and set the outlier sum to the larger of
and
in absolute value. We call this the "two-sided outlier-sum statistic," and illustrate its use in the skin data example of Section 4.
Note that the outlier-sum statistic is not symmetric in the classes. It explicitly looks for outliers in group 2, treating group 1 as a normal reference class. If finding outliers in group 1 is also of interest, then the procedure can be applied with groups 1 and 2 interchanged.
| 3. SIMULATION STUDY AND COMPARISON TO THE COPA STATISTIC |
|---|
|
|
|---|
We carried out a small simulation study to assess the relative performance of the t-statistic, COPA, and the outlier sum. The COPA statistic (Tomlins and others, 2005
We generated the data in the same way as in Figure 1. There are 1000 genes and 30 samples, all values drawn from a standard normal distribution. Then we added two units to gene 1 for k of the samples in the second group. We computed the P-value: the proportion of genes with score greater than that for gene 1, in absolute value (smaller values are better). This entire process was repeated 50 times. The values of k tried were 15, 8, 4, and 2. The median, mean, and standard deviation of the P-values are shown in Table 1.
|
When
, so that all samples in group 2 are differentially expressed, the t-statistic performs the best. It continues to win when
. But for smaller values of k, the outlier-sum statistic yields lower P-values, and has smaller standard deviation. The COPA statistic has consistently higher P-values than the outlier sum. Table 2 shows the results when each method uses only the 15 samples from the disease class. Hence, the t-statistic in Table 2 refers to the one-sample t-statistic, and similarly for COPA and outlier sum. The outlier sum performs a little worse than it did in Table 1, but still offers some improvement over the t-statistic when the number of outliers is small. The performance of the COPA statistic is noticeably worse in the one-sample case.
|
These experiments suggest that the outlier-sum statistic can provide a useful alternative to the t-statistic. With real data, one can estimate FDRs of both procedures to get an idea of which procedure is most informative for the data at hand. We illustrate this in Section 4.
| 4. APPLICATION TO THE SKIN DATA VIA THE SAM METHOD |
|---|
|
|
|---|
In this example taken from Rieger and others (2004)
Figure 2 shows the outlier-sum statistic plotted against the t-statistic. These two scores are correlated but still differ substantially.
|
Figure 3 shows the FDR versus the number of genes called significant, as the corresponding thresholds for each statistic are varied. We see that the outlier-sum statistic has lower FDR near the right of plot, although the FDR may be too high there, for it to be useful in practice. Figure 4 shows the top 12 genes called by the outlier-sum statistic, plotted by group. The number in brackets is the rank given to that gene by the t-statistic, and we see that some of these genes are ranked quite low.
|
|
The outlier-sum statistic will appear in an upcoming version of the SAM package, available at http://www-stat.stanford.edu/
tibs/SAM.
| ACKNOWLEDGMENTS |
|---|
Tibshirani was partially supported by National Science Foundation Grant DMS-9971405 and National Institutes of Health Contract N01-HV-28183. Conflict of Interest: None declared.
| REFERENCES |
|---|
|
|
|---|
-
Allison D, Cui X, Page G, Sabripour M. (2006) Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics 7:5565.[CrossRef][Web of Science][Medline]
Dudoit S, Yang Y, Callow M, Speed T. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 97:11139.
Irizarry R, Hobbs B, Collin F, Beazer-Barclay Y, Antonellis K, Scherf U, Speed T. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2:24964.
Lyons-Weiler J, Patel S, Becich M, Godfrey T. (2004) Tests for finding complex patterns of differential expression in cancers: towards individualized medicine. BMC Bioinformatics 5:.
Rieger K, Hong W, Tusher V, Tang J, Tibshirani R, Chu G. (2004) Toxicity from radiation therapy associated with abnormal transcriptional responses to DNA damage. Proceedings of the National Academy of Sciences of the United States of America 101:663440.
Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun X-W, Varambally S, Cao X, Tchinda J, Kuefer R, et al. (2005) Recurrent fusion of tmprss2 and ets transcription factor genes in prostate cancer. Science 310:6448.
Tusher V, Tibshirani R, Chu G. (2001) Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. Proceedings of the National Academy of Sciences of the United States of America 98:511621.
Received March 24, 2006; accepted for publication May 10, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
C. Ernst, V. Deleva, X. Deng, A. Sequeira, A. Pomarenski, T. Klempan, N. Ernst, R. Quirion, A. Gratton, M. Szyf, et al. Alternative Splicing, Methylation State, and Expression Profile of Tropomyosin-Related Kinase B in the Frontal Cortex of Suicide Completers Arch Gen Psychiatry, January 1, 2009; 66(1): 22 - 32. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Ghosh and A. M. Chinnaiyan Genomic outlier profile analysis: mixture models, null hypotheses, and nonparametric estimation Biostat., January 1, 2009; 10(1): 60 - 69. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. J. Cosgrove, Y. Zhou, T. S. Gardner, and E. D. Kolaczyk Predicting gene targets of perturbations via network-based filtering of mRNA expression compendia Bioinformatics, November 1, 2008; 24(21): 2482 - 2490. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Hu Cancer outlier detection based on likelihood ratio test Bioinformatics, October 1, 2008; 24(19): 2193 - 2199. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Lian MOST: detecting cancer differential gene expression Biostat., July 1, 2008; 9(3): 411 - 418. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Kosari, J. M. A. Munz, C. D. Savci-Heijink, C. Spiro, E. W. Klee, D. M. Kube, L. Tillmans, J. Slezak, R. J. Karnes, J. C. Cheville, et al. Identification of Prognostic Biomarkers for Prostate Cancer Clin. Cancer Res., March 15, 2008; 14(6): 1734 - 1743. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Buness, R. Kuner, M. Ruschhaupt, A. Poustka, H. Sultmann, and A. Tresch Identification of aberrant chromosomal regions from gene expression microarray studies applied to human breast cancer Bioinformatics, September 1, 2007; 23(17): 2273 - 2280. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







