Biostatistics Advance Access originally published online on August 24, 2006
Biostatistics 2007 8(2):433-437; doi:10.1093/biostatistics/kxl020
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
FGX: a frequentist gene expression index for Affymetrix arrays
luDepartment of Mathematics and Statistics, Lancaster University, Lancaster, LA1 4YF, UK e.wit{at}lancaster.ac.uk
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
We consider a new frequentist gene expression index for Affymetrix oligonucleotide DNA arrays, using a similar probe intensity model as suggested by Hein and others (2005), called the Bayesian gene expression index (BGX). According to this model, the perfect match and mismatch values are assumed to be correlated as a result of sharing a common gene expression signal. Rather than a Bayesian approach, we develop a maximum likelihood algorithm for estimating the underlying common signal. In this way, estimation is explicit and much faster than the BGX implementation. The observed Fisher information matrix, rather than a posterior credibility interval, gives an idea of the accuracy of the estimators. We evaluate our method using benchmark spike-in data sets from Affymetrix and GeneLogic by analyzing the relationship between estimated signal and concentration, i.e. true signal, and compare our results with other commonly used methods.
Keywords: Affymetrix; Fisher information matrix; GeneChip; Gene expression; Maximum likelihood; Probe-level analysis; Spike-in data sets
| 1. INTRODUCTION |
|---|
|
|
|---|
High-dimensional Affymetrix oligonucleotide DNA arrays are widely used in biomedical research. Each oligonucleotide probe consists of a small string of DNA 25 base pairs specific for a gene or expressed sequence tag and is immobilized on a glass slide or array. To measure the amount of transcribed RNA, the gene targets are labeled with a dye and hybridized to the probes on the array. Each gene is defined by a set of 11 to 20 "probe pairs," coming from different parts of the gene's DNA sequence. There are two components to each probe pair: the "perfect match" (PM) measures the amount of transcribed perfectly matched target complementary to the mRNA of a specific gene, whereas the "mismatch" (MM) is supposed to measure the amount of nonspecific binding of the target by changing the 13th base pair of the probe. In order to denote the estimated gene expression level, the term "gene expression index" is widely used. There are several methods that are often used for estimating gene expression levels: MAS 5.0 (Hubbell and others, 2002
The GC-RMA method, an extension of RMA, is the first method to consider the idea that besides nonspecific hybridization, MM values also contain some information about the true signal S whose contamination is a fraction p. However, in practice the calculations are simplified by assuming that p is zero. The BGX method actually estimates p from the data. For estimation, a Bayesian hierarchical setup in conjunction with an Markov chain Monte Carlo approach is used. The Bayesian gene expression index is computed as the median of the posterior signal distribution, although Bayesian credibility intervals are also available.
In this paper, we consider the application of a maximum likelihood (ML) alternative to estimate the true signals under a PM and MM intensity model that is similar to BGX. In this way, we intend to reduce computational cost considerably and to maintain the efficiency of the estimators. The website http://www.maths.lancs.ac.uk/
wite/research contains both the R-code for frequentist gene expression (FGX) function as well as other supplementary material.
| 2. MODEL FORMULATION AND INFERENCE |
|---|
|
|
|---|
A model that induces a relationship between PM and MM probes is one where both PM and MM share part of a common signal S as well as a large nonspecific hybridization component H as an offset term. Assuming lognormality for the probe intensities deals largely with the full extent of the variance heterogeneity across the intensity range, logPMij
N(Si + µH,
2) and logMMij
N(pSi + µH,
2), where j = 1,...,m is the probe indicator, Si is the true expression value for gene i = 1,...,n, p is the fraction of "specific" hybridization of the MM probe, and µH is the mean of the nonspecific hybridization random effect. The constant variance term
2 = 
2 + 
is the sum of a measurement error term 
2 and a nonspecific hybridization term 
. Since the variance components cannot be identified separately, we estimate a combined
2 term.
As averages of the log-transformed PM and MM probes are sufficient statistics for their associated underlying means and because analysis of Affymetrix data typically takes place on a probe set level, rather than an individual probe level, we consider that the available data typically consist of PMi: = 
logPMij/m and MMi: = 
logMMij/m, such that PMi
N(Si + µH,
2/m) and MMi
N(pSi + µH,
2/m). The aim is to obtain estimates for the parameters p, Si, and µH by using ML. The loglikelihood l conditional on PM = (PM1,...,PMn) and MM = (MM1,...,MMn) is given as
|
|
The maximum likelihood estimators (MLEs) of the unknown parameters are solutions of the partial derivatives of l equated to zero and of the parameters µH and S are the explicit functions of the intensities and the MLE of p,
|
|
where
and
. In order to obtain the MLE of p, we substitute
and
into
l/
p = m/(2
2)
[ 2Si(MMi pSi µH)], which gives a fourth-order polynomial equation which can be written as
|
|
where
,
, and
. There are three solutions for this equation. SSPM,MM > 0, i.e. a positive correlation between all the PM and MM signals, the maximum of l is found at
|
|
where
,
, and
. And in that case, the MLE of p is given as
. Note that if SSPM,MM
0, then there is no evidence in the data that the MM probes contain any information about the underlying signal. In that case, the estimate of p should be set to
.
We note that for estimation of the variance terms, the probe means, PMi and MMi, are not sufficient. The loss of information can be regained by reconstructing the likelihood function in terms of all data after estimating the parameters Si, p, and µH and to calculate an MLE for
2 conditional on the estimates
,
, and
, giving
.
Asymptotically, MLEs are fully efficient, i.e. they are unbiased and have minimum variance bounds. For a finite number of samples, the covariance matrix of the MLEs is given by I 1, where I is the "observed Fisher information matrix"
![]() |
where the first and second column belong to the µH and p terms, respectively, and the remaining columns denote the terms belonging to the signals (S1,...,Sn). To obtain the inverse of I, we partition the matrix after the second row and second column as given below in which A is the 2x2 submatrix at the top of the left-hand side of I, B is the 2xn submatrix at the top of the right-hand side of I, accordingly, and C is the nxn diagonal submatrix at the bottom of the right-hand side of I,
|
|
in which P = (A BC 1BT) 1, Q = (C 1BTP)T, R = C 1 C 1BTQ. Explicit formulas for the variances of
, useful for confidence intervals, can be found from the diagonal of R, i.e.
![]() |
where
![]() |
| 3. APPLICATION |
|---|
|
|
|---|
We compare FGX and BGX using the GeneLogic spike-in hgu95a data set, which is available from http://www.genelogic.com/newsroom/studies/index.cfm. The GeneLogic spike-in data set consists of 14 arrays with 11 GeneLogic spike-in probe sets and each probe set consists of 20 probes. Apart from gene CreX-3, each of the 10 spike-in genes is hybridized at various concentrations from 0.0 to 150.0 pM. All computations were in R using the affy and hgu95acdf packages for importing and handling the data.
Despite the structural similarities between the FGX and BGX models, FGX slightly outperforms BGX in terms of the so-called "slope detect," in that the slope 0.60 (0.77, if omitting low concentrations) calculated from regressing expression values on the log-concentrations are somewhat closer to 1 than the 0.50 (0.60) achieved by BGX (Hein and others, 2005, Figure 6). We note that only a slope of 1 on the log-scale corresponds to a linear relationship on the original scale. From the plots of FGX signals in Figure 1(a), it is seen that the weighted average of estimated FGX intensities of each array have larger confidence intervals than those of BGX at high concentrations (Hein and others, 2005, Figure 6).
|
At low concentrations, on the other hand, both methods have large confidence interval since low measurements are highly affected by the noise and the nonspecific hybridization. Most importantly, due to its explicit formulas the FGX is much faster (1 s in R) than the BGX method (70 min in C++).
In order to compare the performance of FGX to other benchmark methods, we use the Affymetrix spike-in data, available from http://affycomp.biostat.jhsph.edu/. This data set involves in 59 arrays with 14 spike-in probe sets, when excluding anomalous probe sets 33 818 and 546. Each probe set has 16 probes. These spike-in genes are measured at concentrations from 0.0 to 1024.0 pM. In Figure 1(b, c), the average estimated signal across all genes is plotted against nominal log-concentration level. It is interesting to note that FGX is the only method that estimates an effectively zero signal, when the concentrations are negligible. The reason is that by assuming that the MM probe contains some level of the true signal, the average level of nonspecific signal µH is identifiable in the FGX model. The R-squared obtained by FGX from regressing the expression values on the log-concentrations is considerably higher (0.95) than those obtained by the other methods (0.800.86). However, apart from these two points, which clearly pack out in favor of FGX, there is a rough correspondence between all the methods.
| ACKNOWLEDGMENTS |
|---|
Conflict of Interests: None declared.
| REFERENCES |
|---|
|
|
|---|
-
Hein A-MK, Richardson S, Causton HC, Ambler GK, GreenJ P. (2005) BGX: a fully Bayesian gene expression index for affymetrix geneChip data. Biostatistics 6:34973.
Hubbell E, Lui WM, Mei R. (2002) Robust estimators for expression analysis. Bioinformatics 18:158592.
Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4:24964.[Abstract]
Li C and Wong W. (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proceedings of the National Academy of Sciences 98:316.
Liu X, Milo M, Lawrence ND, Rattray M. (2005) A tractable probabilistic model for affymetrix probe-level analysis across multiple chips. Bioinformatics 21:363744.
Milo M, Fazeli A, Niranjan M, Lawrence ND. (2003) A probabilistic model for the extraction of expression levels from oligonucleotide arrays. Biochemical Society Transactions 31:15102.[Web of Science][Medline]
Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F. (2004) A model-based background adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association 99:90917.[CrossRef][Web of Science]
Received April 13, 2006; revised July 6, 2006; accepted for publication August 2, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



