Skip Navigation


Biostatistics Advance Access originally published online on May 15, 2006
Biostatistics 2007 8(1):2-8; doi:10.1093/biostatistics/kxl005
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
8/1/2    most recent
kxl005v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Tibshirani, R.
Right arrow Articles by Hastie, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Tibshirani, R.
Right arrow Articles by Hastie, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

Outlier sums for differential gene expression analysis

Robert Tibshirani*

Department of Health Research & Policy and Department of Statistics, Stanford University, Stanford, CA 94305, USA tibs{at}stat.stanford.edu

Trevor Hastie

Department of Statistics and Department of Health Research & Policy, Stanford University, Stanford, CA 94305, USA

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OUTLIER-SUM STATISTIC
 3. SIMULATION STUDY AND...
 4. APPLICATION TO THE...
 REFERENCES
 
We propose a method for detecting genes that, in a disease group, exhibit unusually high gene expression in some but not all samples. This can be particularly useful in cancer studies, where mutations that can amplify or turn off gene expression often occur in only a minority of samples. In real and simulated examples, the new method often exhibits lower false discovery rates than simple t-statistic thresholding. We also compare our approach to the recent cancer profile outlier analysis proposal of Tomlins and others (2005).

Keywords: Cancer; COPA; Gene expression analysis; Microarray


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OUTLIER-SUM STATISTIC
 3. SIMULATION STUDY AND...
 4. APPLICATION TO THE...
 REFERENCES
 
We consider methods for detecting differentially expressed genes in a set of microarray experiments. We consider the simple case of m genes measured across two experimental conditions. A number of authors have proposed methods for detecting differential gene expression; Dudoit and others (2002)Go and Allison and others (2006)Go give summaries.

One widely used approach to this problem is as follows. We compute a two-sample t-statistic Formula for each gene, and then call a gene significant if Formula exceeds some threshold c. Various values of c are tried using permutations of the sample labels to estimate the false discovery rate (FDR) for the procedure for each c. A threshold c is finally chosen based on the estimates of FDR and other considerations, such as the ballpark number of significant genes that is desirable. This recipe roughly describes the strategy used, for example, in the significance analysis of microarrays (SAM) procedure (Tusher and others, 2001Go). The SAM procedure can be applied to other test statistics for a wide variety of data types, such as paired, censored, or time-course data.

In a study of mutations in prostate cancer, Tomlins and others (2005)Go introduced a method called "cancer profile outlier analysis" (COPA) for detecting what they call "oncogene outliers." These are genes which show a systematic increase in expression, but only for a small number of cancer samples. They show that COPA can be more powerful than the usual t-statistic in these cases. In related work, Lyons-Weiler and others (2004)Go proposed the permutation percentile separability test with a similar objective: to find genes that are overexpressed only in a subset of cases.

The COPA work inspired us to study this problem and look for better ways of detecting changes that occur in a small number of samples. We introduce the "outlier-sum" statistic, and compare it to both the t-statistic and COPA in a number of examples.


    2. THE OUTLIER-SUM STATISTIC
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OUTLIER-SUM STATISTIC
 3. SIMULATION STUDY AND...
 4. APPLICATION TO THE...
 REFERENCES
 
Let Formula be the expression values for genes Formula and samples Formula. We assume that the samples fall into two groups. We think of group 1 as a normal or reference group, while group 2 is a disease group. Let Formula be the set of indices of the observations in group k, for Formula. The standard (unpaired) t-statistic is

Formula (2.1)

Here Formula is the mean of gene i in group k and Formula is the pooled within-group standard deviation of gene i.

As an alternative, we define the outlier-sum statistic as follows. Let Formula and Formula be the median and median absolute deviation of the values for gene i. We first standardize each gene

Formula (2.2)

This standardization puts all genes on the same scale to facilitate comparisons across genes.

Let Formula be the rth percentile of the Formula values for gene i, and Formula, the interquartile range. Finally, note that values greater than the limit Formula are defined to be outliers in the usual statistical sense.

The outlier-sum statistic is defined to be sum of the values in the disease group that are beyond this limit:

Formula (2.3)

Hence, Formula is large if there are many outliers in the disease group, or a few outliers with large values. If there are no outliers, then Formula is zero.

As an example, we generated 1000 genes and 30 samples, all values drawn independently from a standard normal distribution. Then, we added two units to gene 1 for four of the samples in the second group. We computed the P-value: the proportion of genes with score greater than that for gene 1, in absolute value. This process was repeated 50 times. The results for both the t-statistic and the outlier-sum statistic are shown in Figure 1. We see that the outlier-sum statistic yields smaller (more significant) P-values overall.


Figure 1
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Simulated example: P-values for gene 1 over 50 simulations.

 
In real applications, one might expect negative as well as positive outliers. Hence, we define

Formula (2.4)

and set the outlier sum to the larger of Formula and Formula in absolute value. We call this the "two-sided outlier-sum statistic," and illustrate its use in the skin data example of Section 4.

Note that the outlier-sum statistic is not symmetric in the classes. It explicitly looks for outliers in group 2, treating group 1 as a normal reference class. If finding outliers in group 1 is also of interest, then the procedure can be applied with groups 1 and 2 interchanged.


    3. SIMULATION STUDY AND COMPARISON TO THE COPA STATISTIC
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OUTLIER-SUM STATISTIC
 3. SIMULATION STUDY AND...
 4. APPLICATION TO THE...
 REFERENCES
 
We carried out a small simulation study to assess the relative performance of the t-statistic, COPA, and the outlier sum. The COPA statistic (Tomlins and others, 2005Go) is defined as follows. All measurements for a gene are standardized by the overall median and median absolute deviation for that gene. Then the COPA statistic is the rth quantile of the data in the disease group. The authors use r = 0.75, 0.90, or 0.95. In our comparison below, we use the intermediate value of 0.90. In their paper, the authors apply the procedure to data from just the disease group. However, the standardization in the first step could also make use of a normal group, if available. We try both approaches in the simulation studies below.

We generated the data in the same way as in Figure 1. There are 1000 genes and 30 samples, all values drawn from a standard normal distribution. Then we added two units to gene 1 for k of the samples in the second group. We computed the P-value: the proportion of genes with score greater than that for gene 1, in absolute value (smaller values are better). This entire process was repeated 50 times. The values of k tried were 15, 8, 4, and 2. The median, mean, and standard deviation of the P-values are shown in Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1 Results of simulation study: median, mean, and standard deviation of P-values for gene 1, over 50 simulations

 
When Formula, so that all samples in group 2 are differentially expressed, the t-statistic performs the best. It continues to win when Formula. But for smaller values of k, the outlier-sum statistic yields lower P-values, and has smaller standard deviation. The COPA statistic has consistently higher P-values than the outlier sum.

Table 2 shows the results when each method uses only the 15 samples from the disease class. Hence, the t-statistic in Table 2 refers to the one-sample t-statistic, and similarly for COPA and outlier sum. The outlier sum performs a little worse than it did in Table 1, but still offers some improvement over the t-statistic when the number of outliers is small. The performance of the COPA statistic is noticeably worse in the one-sample case.


View this table:
[in this window]
[in a new window]

 
Table 2 Results of simulation study, one sample setting: median, mean, and standard deviation of P-values for gene 1, over 50 simulations

 
These experiments suggest that the outlier-sum statistic can provide a useful alternative to the t-statistic. With real data, one can estimate FDRs of both procedures to get an idea of which procedure is most informative for the data at hand. We illustrate this in Section 4.


    4. APPLICATION TO THE SKIN DATA VIA THE SAM METHOD
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OUTLIER-SUM STATISTIC
 3. SIMULATION STUDY AND...
 4. APPLICATION TO THE...
 REFERENCES
 
In this example taken from Rieger and others (2004)Go, there are 12 625 genes and 58 cancer patients: 14 with radiation sensitivity and 44 without radiation sensitivity. We applied the outlier-sum statistic within the SAM (Significance analysis of microarrays) approach (Tusher and others, 2001Go), using the group of 44 as the normal class. SAM estimates differential expression using the two-sample t-statistic and estimates FDRs via permutations of the class labels. Here we compare the t-statistic to the outlier-sum statistic in SAM. Since the data are from Affymetrix chips and have a wide range of expression, we first took cube roots. However, in practice a more careful preprocessing should be used, such as that provided by robust multi-chip analysis (Irizarry and others, 2003Go).

Figure 2 shows the outlier-sum statistic plotted against the t-statistic. These two scores are correlated but still differ substantially.


Figure 2
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Skin data: outlier-sum statistic versus the t-statistic. Note that many genes have outlier sums of zero.

 
Figure 3 shows the FDR versus the number of genes called significant, as the corresponding thresholds for each statistic are varied. We see that the outlier-sum statistic has lower FDR near the right of plot, although the FDR may be too high there, for it to be useful in practice. Figure 4 shows the top 12 genes called by the outlier-sum statistic, plotted by group. The number in brackets is the rank given to that gene by the t-statistic, and we see that some of these genes are ranked quite low.


Figure 3
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Skin data: FDR versus the number of genes called significant, as the corresponding thresholds for each statistic is varied.

 

Figure 4
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 Plots of the expression values in each class, for the 12 genes ranked highest by the outlier-sum statistic. The points have been "jittered" in the vertical direction, for clearer viewing. The number in brackets is the rank given to that gene by the t-statistic. The red points are identified as positive outliers; the green points are negative outliers.

 
The outlier-sum statistic will appear in an upcoming version of the SAM package, available at http://www-stat.stanford.edu/~tibs/SAM.


    ACKNOWLEDGMENTS
 
Tibshirani was partially supported by National Science Foundation Grant DMS-9971405 and National Institutes of Health Contract N01-HV-28183. Conflict of Interest: None declared.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. THE OUTLIER-SUM STATISTIC
 3. SIMULATION STUDY AND...
 4. APPLICATION TO THE...
 REFERENCES
 

    Allison D, Cui X, Page G, Sabripour M. (2006) Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics 7:55–65.[CrossRef][Web of Science][Medline]

    Dudoit S, Yang Y, Callow M, Speed T. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 97:111–39.

    Irizarry R, Hobbs B, Collin F, Beazer-Barclay Y, Antonellis K, Scherf U, Speed T. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2:249–64.

    Lyons-Weiler J, Patel S, Becich M, Godfrey T. (2004) Tests for finding complex patterns of differential expression in cancers: towards individualized medicine. BMC Bioinformatics 5:.

    Rieger K, Hong W, Tusher V, Tang J, Tibshirani R, Chu G. (2004) Toxicity from radiation therapy associated with abnormal transcriptional responses to DNA damage. Proceedings of the National Academy of Sciences of the United States of America 101:6634–40.

    Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun X-W, Varambally S, Cao X, Tchinda J, Kuefer R, et al. (2005) Recurrent fusion of tmprss2 and ets transcription factor genes in prostate cancer. Science 310:644–8.[Abstract/Free Full Text]

    Tusher V, Tibshirani R, Chu G. (2001) Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. Proceedings of the National Academy of Sciences of the United States of America 98:5116–21.[Abstract/Free Full Text]

    Received March 24, 2006; accepted for publication May 10, 2006.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


    This article has been cited by other articles:


    Home page
    Arch Gen PsychiatryHome page
    C. Ernst, V. Deleva, X. Deng, A. Sequeira, A. Pomarenski, T. Klempan, N. Ernst, R. Quirion, A. Gratton, M. Szyf, et al.
    Alternative Splicing, Methylation State, and Expression Profile of Tropomyosin-Related Kinase B in the Frontal Cortex of Suicide Completers
    Arch Gen Psychiatry, January 1, 2009; 66(1): 22 - 32.
    [Abstract] [Full Text] [PDF]


    Home page
    BiostatisticsHome page
    D. Ghosh and A. M. Chinnaiyan
    Genomic outlier profile analysis: mixture models, null hypotheses, and nonparametric estimation
    Biostat., January 1, 2009; 10(1): 60 - 69.
    [Abstract] [Full Text] [PDF]


    Home page
    BioinformaticsHome page
    E. J. Cosgrove, Y. Zhou, T. S. Gardner, and E. D. Kolaczyk
    Predicting gene targets of perturbations via network-based filtering of mRNA expression compendia
    Bioinformatics, November 1, 2008; 24(21): 2482 - 2490.
    [Abstract] [Full Text] [PDF]


    Home page
    BioinformaticsHome page
    J. Hu
    Cancer outlier detection based on likelihood ratio test
    Bioinformatics, October 1, 2008; 24(19): 2193 - 2199.
    [Abstract] [Full Text] [PDF]


    Home page
    BiostatisticsHome page
    H. Lian
    MOST: detecting cancer differential gene expression
    Biostat., July 1, 2008; 9(3): 411 - 418.
    [Abstract] [Full Text] [PDF]


    Home page
    Clin. Cancer Res.Home page
    F. Kosari, J. M. A. Munz, C. D. Savci-Heijink, C. Spiro, E. W. Klee, D. M. Kube, L. Tillmans, J. Slezak, R. J. Karnes, J. C. Cheville, et al.
    Identification of Prognostic Biomarkers for Prostate Cancer
    Clin. Cancer Res., March 15, 2008; 14(6): 1734 - 1743.
    [Abstract] [Full Text] [PDF]


    Home page
    BioinformaticsHome page
    A. Buness, R. Kuner, M. Ruschhaupt, A. Poustka, H. Sultmann, and A. Tresch
    Identification of aberrant chromosomal regions from gene expression microarray studies applied to human breast cancer
    Bioinformatics, September 1, 2007; 23(17): 2273 - 2280.
    [Abstract] [Full Text] [PDF]


    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    8/1/2    most recent
    kxl005v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Tibshirani, R.
    Right arrow Articles by Hastie, T.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Tibshirani, R.
    Right arrow Articles by Hastie, T.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?