Skip Navigation


Biostatistics Advance Access originally published online on August 24, 2006
Biostatistics 2007 8(2):433-437; doi:10.1093/biostatistics/kxl020
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
8/2/433    most recent
kxl020v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Purutçuoglu, V.
Right arrow Articles by Wit, E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Purutçuoglu, V.
Right arrow Articles by Wit, E.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

FGX: a frequentist gene expression index for Affymetrix arrays

Vilda Purutçuoglu and Ernst Wit*

Department of Mathematics and Statistics, Lancaster University, Lancaster, LA1 4YF, UK e.wit{at}lancaster.ac.uk

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. MODEL FORMULATION AND...
 3. APPLICATION
 REFERENCES
 
We consider a new frequentist gene expression index for Affymetrix oligonucleotide DNA arrays, using a similar probe intensity model as suggested by Hein and others (2005), called the Bayesian gene expression index (BGX). According to this model, the perfect match and mismatch values are assumed to be correlated as a result of sharing a common gene expression signal. Rather than a Bayesian approach, we develop a maximum likelihood algorithm for estimating the underlying common signal. In this way, estimation is explicit and much faster than the BGX implementation. The observed Fisher information matrix, rather than a posterior credibility interval, gives an idea of the accuracy of the estimators. We evaluate our method using benchmark spike-in data sets from Affymetrix and GeneLogic by analyzing the relationship between estimated signal and concentration, i.e. true signal, and compare our results with other commonly used methods.

Keywords: Affymetrix; Fisher information matrix; GeneChip; Gene expression; Maximum likelihood; Probe-level analysis; Spike-in data sets


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. MODEL FORMULATION AND...
 3. APPLICATION
 REFERENCES
 
High-dimensional Affymetrix oligonucleotide DNA arrays are widely used in biomedical research. Each oligonucleotide probe consists of a small string of DNA 25 base pairs specific for a gene or expressed sequence tag and is immobilized on a glass slide or array. To measure the amount of transcribed RNA, the gene targets are labeled with a dye and hybridized to the probes on the array. Each gene is defined by a set of 11 to 20 "probe pairs," coming from different parts of the gene's DNA sequence. There are two components to each probe pair: the "perfect match" (PM) measures the amount of transcribed perfectly matched target complementary to the mRNA of a specific gene, whereas the "mismatch" (MM) is supposed to measure the amount of nonspecific binding of the target by changing the 13th base pair of the probe. In order to denote the estimated gene expression level, the term "gene expression index" is widely used. There are several methods that are often used for estimating gene expression levels: MAS 5.0 (Hubbell and others, 2002Go), MBEI (dChip) (Li and Wong, 2001Go), RMA (Irizarry and others, 2003Go), GC-RMA (Wu and others, 2004Go), BGX (Hein and others, 2005), mgMOS (Liu and others, 2005Go), (Milo and others, 2003Go), and multi-mgMOS (Liu and others, 2005Go).

The GC-RMA method, an extension of RMA, is the first method to consider the idea that besides nonspecific hybridization, MM values also contain some information about the true signal S whose contamination is a fraction p. However, in practice the calculations are simplified by assuming that p is zero. The BGX method actually estimates p from the data. For estimation, a Bayesian hierarchical setup in conjunction with an Markov chain Monte Carlo approach is used. The Bayesian gene expression index is computed as the median of the posterior signal distribution, although Bayesian credibility intervals are also available.

In this paper, we consider the application of a maximum likelihood (ML) alternative to estimate the true signals under a PM and MM intensity model that is similar to BGX. In this way, we intend to reduce computational cost considerably and to maintain the efficiency of the estimators. The website http://www.maths.lancs.ac.uk/~wite/research contains both the R-code for frequentist gene expression (FGX) function as well as other supplementary material.


    2. MODEL FORMULATION AND INFERENCE
 TOP
 SUMMARY
 1. INTRODUCTION
 2. MODEL FORMULATION AND...
 3. APPLICATION
 REFERENCES
 
A model that induces a relationship between PM and MM probes is one where both PM and MM share part of a common signal S as well as a large nonspecific hybridization component H as an offset term. Assuming lognormality for the probe intensities deals largely with the full extent of the variance heterogeneity across the intensity range, logPMij~N(Si + µH,{sigma}2) and logMMij~N(pSi + µH,{sigma}2), where j = 1,...,m is the probe indicator, Si is the true expression value for gene i = 1,...,n, p is the fraction of "specific" hybridization of the MM probe, and µH is the mean of the nonspecific hybridization random effect. The constant variance term {sigma}2 = {sigma}{varepsilon}2 + {sigma}Formula is the sum of a measurement error term {sigma}{varepsilon}2 and a nonspecific hybridization term {sigma}Formula. Since the variance components cannot be identified separately, we estimate a combined {sigma}2 term.

As averages of the log-transformed PM and MM probes are sufficient statistics for their associated underlying means and because analysis of Affymetrix data typically takes place on a probe set level, rather than an individual probe level, we consider that the available data typically consist of PMi: = {sum}FormulalogPMij/m and MMi: = {sum}FormulalogMMij/m, such that PMi~N(Si + µH,{sigma}2/m) and MMi~N(pSi + µH,{sigma}2/m). The aim is to obtain estimates for the parameters p, Si, and µH by using ML. The loglikelihood l conditional on PM = (PM1,...,PMn) and MM = (MM1,...,MMn) is given as

Formula

The maximum likelihood estimators (MLEs) of the unknown parameters are solutions of the partial derivatives of l equated to zero and of the parameters µH and S are the explicit functions of the intensities and the MLE of p,

Formula

where Formula and Formula. In order to obtain the MLE of p, we substitute Formula and Formula into {partial}l/{partial}p = – m/(2{sigma}2){sum}Formula[ – 2Si(MMipSi µH)], which gives a fourth-order polynomial equation which can be written as

Formula

where Formula, Formula, and Formula. There are three solutions for this equation. SSPM,MM > 0, i.e. a positive correlation between all the PM and MM signals, the maximum of l is found at

Formula

where Formula, Formula, and Formula. And in that case, the MLE of p is given as Formula. Note that if SSPM,MM ≤ 0, then there is no evidence in the data that the MM probes contain any information about the underlying signal. In that case, the estimate of p should be set to Formula.

We note that for estimation of the variance terms, the probe means, PMi and MMi, are not sufficient. The loss of information can be regained by reconstructing the likelihood function in terms of all data after estimating the parameters Si, p, and µH and to calculate an MLE for {sigma}2 conditional on the estimates Formula, Formula, and Formula, giving Formula.

Asymptotically, MLEs are fully efficient, i.e. they are unbiased and have minimum variance bounds. For a finite number of samples, the covariance matrix of the MLEs is given by I – 1, where I is the "observed Fisher information matrix"

Formula

where the first and second column belong to the µH and p terms, respectively, and the remaining columns denote the terms belonging to the signals (S1,...,Sn). To obtain the inverse of I, we partition the matrix after the second row and second column as given below in which A is the 2x2 submatrix at the top of the left-hand side of I, B is the 2xn submatrix at the top of the right-hand side of I, accordingly, and C is the nxn diagonal submatrix at the bottom of the right-hand side of I,

Formula

in which P = (ABC – 1BT) – 1, Q = – (C 1BTP)T, R = C – 1C – 1BTQ. Explicit formulas for the variances of Formula, useful for confidence intervals, can be found from the diagonal of R, i.e.

Formula

where

Formula


    3. APPLICATION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. MODEL FORMULATION AND...
 3. APPLICATION
 REFERENCES
 
We compare FGX and BGX using the GeneLogic spike-in hgu95a data set, which is available from http://www.genelogic.com/newsroom/studies/index.cfm. The GeneLogic spike-in data set consists of 14 arrays with 11 GeneLogic spike-in probe sets and each probe set consists of 20 probes. Apart from gene CreX-3, each of the 10 spike-in genes is hybridized at various concentrations from 0.0 to 150.0 pM. All computations were in R using the affy and hgu95acdf packages for importing and handling the data.

Despite the structural similarities between the FGX and BGX models, FGX slightly outperforms BGX in terms of the so-called "slope detect," in that the slope 0.60 (0.77, if omitting low concentrations) calculated from regressing expression values on the log-concentrations are somewhat closer to 1 than the 0.50 (0.60) achieved by BGX (Hein and others, 2005, Figure 6). We note that only a slope of 1 on the log-scale corresponds to a linear relationship on the original scale. From the plots of FGX signals in Figure 1(a), it is seen that the weighted average of estimated FGX intensities of each array have larger confidence intervals than those of BGX at high concentrations (Hein and others, 2005, Figure 6).


Figure 1
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. (a) FGX estimated signal versus true signal and pointwise 95% confidence interval for 10 GeneLogic spike-in genes (excluding array 92 466, spike-in gene DapX-M, and 0.0, 0.75 pM concentrations). (b) Average intensities of 14 Affymetrix spike-in genes versus nominal log-concentrations (excluding 0.0 pM concentration) using MAS 5.0, dCHIP, RMA, GC-RMA, and FGX method using MAS 5.0, mgMOS, multi-mgMOS, and FGX method.1

 
At low concentrations, on the other hand, both methods have large confidence interval since low measurements are highly affected by the noise and the nonspecific hybridization. Most importantly, due to its explicit formulas the FGX is much faster (1 s in R) than the BGX method (70 min in C++).

In order to compare the performance of FGX to other benchmark methods, we use the Affymetrix spike-in data, available from http://affycomp.biostat.jhsph.edu/. This data set involves in 59 arrays with 14 spike-in probe sets, when excluding anomalous probe sets 33 818 and 546. Each probe set has 16 probes. These spike-in genes are measured at concentrations from 0.0 to 1024.0 pM. In Figure 1(b, c), the average estimated signal across all genes is plotted against nominal log-concentration level. It is interesting to note that FGX is the only method that estimates an effectively zero signal, when the concentrations are negligible. The reason is that by assuming that the MM probe contains some level of the true signal, the average level of nonspecific signal µH is identifiable in the FGX model. The R-squared obtained by FGX from regressing the expression values on the log-concentrations is considerably higher (0.95) than those obtained by the other methods (0.80–0.86). However, apart from these two points, which clearly pack out in favor of FGX, there is a rough correspondence between all the methods.


    ACKNOWLEDGMENTS
 
Conflict of Interests: None declared.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. MODEL FORMULATION AND...
 3. APPLICATION
 REFERENCES
 

    Hein A-MK, Richardson S, Causton HC, Ambler GK, GreenJ P. (2005) BGX: a fully Bayesian gene expression index for affymetrix geneChip data. Biostatistics 6:349–73.[Abstract/Free Full Text]

    Hubbell E, Lui WM, Mei R. (2002) Robust estimators for expression analysis. Bioinformatics 18:1585–92.[Abstract/Free Full Text]

    Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4:249–64.[Abstract]

    Li C and Wong W. (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proceedings of the National Academy of Sciences 98:31–6.[Abstract/Free Full Text]

    Liu X, Milo M, Lawrence ND, Rattray M. (2005) A tractable probabilistic model for affymetrix probe-level analysis across multiple chips. Bioinformatics 21:3637–44.[Abstract/Free Full Text]

    Milo M, Fazeli A, Niranjan M, Lawrence ND. (2003) A probabilistic model for the extraction of expression levels from oligonucleotide arrays. Biochemical Society Transactions 31:1510–2.[Web of Science][Medline]

    Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F. (2004) A model-based background adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association 99:909–17.[CrossRef][Web of Science]

    Received April 13, 2006; revised July 6, 2006; accepted for publication August 2, 2006.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    8/2/433    most recent
    kxl020v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Purutçuoglu, V.
    Right arrow Articles by Wit, E.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Purutçuoglu, V.
    Right arrow Articles by Wit, E.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?