Skip Navigation


Biostatistics Advance Access originally published online on March 24, 2006
Biostatistics 2006 7(4):630-641; doi:10.1093/biostatistics/kxj032
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
7/4/630    most recent
kxj032v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Bhowmick, D.
Right arrow Articles by Ruffieux, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bhowmick, D.
Right arrow Articles by Ruffieux, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

A Laplace mixture model for identification of differential expression in microarray experiments

Debjani Bhowmick, AC Davison*, Darlene R. Goldstein and Yann Ruffieux

Ecole Polytechnique Fédérale de Lausanne, Institute of Mathematics, EPFL-FSB-IMA, Station 8, CH-1015 Lausanne, Switzerland anthony.davison{at}epfl.ch

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LAPLACE MIXTURE MODEL
 3. ESTIMATION PROCEDURES
 4. COMPARATIVE STUDY
 5. EXAMPLE: Arabidopsis thaliana...
 6. DISCUSSION
 REFERENCES
 
Microarrays have become an important tool for studying the molecular basis of complex disease traits and fundamental biological processes. A common purpose of microarray experiments is the detection of genes that are differentially expressed under two conditions, such as treatment versus control or wild type versus knockout. We introduce a Laplace mixture model as a long-tailed alternative to the normal distribution when identifying differentially expressed genes in microarray experiments, and provide an extension to asymmetric over- or underexpression. This model permits greater flexibility than models in current use as it has the potential, at least with sufficient data, to accommodate both whole genome and restricted coverage arrays. We also propose likelihood approaches to hyperparameter estimation which are equally applicable in the Normal mixture case. The Laplace model appears to give some improvement in fit to data, though simulation studies show that our method performs similarly to several other statistical approaches to the problem of identification of differential expression.

Keywords: Laplace distribution; Marginal likelihood; Microarray experiment; Mixture model; REML


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LAPLACE MIXTURE MODEL
 3. ESTIMATION PROCEDURES
 4. COMPARATIVE STUDY
 5. EXAMPLE: Arabidopsis thaliana...
 6. DISCUSSION
 REFERENCES
 
Microarrays have become an important tool for studying the molecular basis of complex disease traits and fundamental biological processes. Two-channel microarrays, such as spotted cDNA or long oligonucleotide arrays, measure relative gene expression in two samples. Once preprocessed, the data from such arrays take the form of normalized base 2 logarithm of the expression ratios. A common purpose of microarray experiments is the detection of genes that are differentially expressed under two conditions, such as treatment versus control or wild type versus knockout. Numerous statistical methods have been proposed for identification of differential expression, with new ones continuing to be introduced.

In early analyses such as Schena and others (1995Go, 1996Go), fold change between conditions exceeding a constant was used to identify differentially expressed genes. This method performs poorly, however, because it ignores the different variability of expression across genes. Such variability can be taken into account by using a t-test on the average log-fold change, but variation in gene expression is poorly estimated with small numbers of replicates, so genes with artificially low variance may be selected even if they are not truly differentially expressed. Numerous refinements have been proposed in order to reduce the numbers of false-positive and false-negative results. Tusher and others (2001)Go and Efron and others (2000)Go suggested adding a constant to the t denominator so that it does not become too small. Lönnstedt and Speed (2002)Go proposed a normal mixture model for the gene expression data and defined a log posterior odds statistic, their B-statistic, for ranking genes. Their approach was extended to linear models for more general designs by Smyth (2004)Go; these methods are implemented in the R package "limma" as part of the BioConductor project (www.bioconductor.org). Gottardo and others (2003)Go not only use a similar approach but also include a heuristic, iterative method for estimating the proportion of differentially expressed genes.

The present paper makes three main contributions: we consider robust and asymmetric variants of the mixture modeling approach, we propose two new approaches to parameter estimation, and we perform a comparative study intended to elucidate key features of such methods. The methods studied include those listed above, along with our own proposal of a Laplace mixture as a long-tailed alternative to the normal distribution for mean gene expression across replicate arrays proposed by Lönnstedt and Speed (2002)Go.

This paper is organized as follows: Section 2 describes the model, with estimation of the hyperparameters discussed in Section 3. Simulation studies outlined in Section 4 show that this method performs similarly to several other statistical approaches to the problem of identification of differential expression. In Section 5, we apply our method to a published data set and compare its results with other methods. The paper ends with a brief discussion.


    2. LAPLACE MIXTURE MODEL
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LAPLACE MIXTURE MODEL
 3. ESTIMATION PROCEDURES
 4. COMPARATIVE STUDY
 5. EXAMPLE: Arabidopsis thaliana...
 6. DISCUSSION
 REFERENCES
 

2.1 Symmetric gene expression

Lönnstedt and Speed (2002)Go propose the use of a mixture of a point mass at zero and a normal distribution for mean gene expression, and of an inverse gamma distribution for the single-gene variances. Gaussian variation is the exception rather than the rule in practice, however, so we consider using a Laplace distribution as a potentially more realistic longer tailed alternative to the normal law. We assume that the data are normalized base 2 logarithms of fold changes, as would be the case for two-channel arrays. It is not difficult to apply our approach to single-channel technologies, by modeling the single-channel log intensities, and using a contrast matrix to obtain corresponding log ratios.

We suppose initially that it is reasonable to expect approximately symmetric over- and underexpression, as when an array encompasses an entire genome. Arrays with selected genome coverage are considered in Section 2.2.

We model normalized relative expression measures on each of G genes independently, but in this section, we ease the notation by confining our attention to data from a single gene. Although in reality genes do interact with each other, the independence assumption is a useful simplification. The measured relative expression levels for this gene in n replicates, Formula, are taken to be independent and normally distributed with mean relative expression level Formula and variance Formula, i.e. FormulaFormula We denote the average and sample variance of the observations y by Formula and Formula.

We let Formula represent the probability that this gene is differentially expressed, and assume that in this case the mean expression level Formula has a Laplace distribution with mean zero and variance 2Formula, so that the density is given by

Formula

If the gene is not differentially expressed, its mean expression level is zero. Thus, Formula may be regarded as a random draw from the mixture distribution Formula, where Formula represents the Laplace distribution for differential expression and Formula denotes the distribution which places unit mass at Formula. Conditional on Formula, we assume that Formula, where Formula represents a type of generalized signal to noise ratio. Thus, if the variability Formula of expression levels across replicates is large, the variation of Formula will also be large. This assumption, which simplifies subsequent mathematical developments, is checkable from the data; in the cases we have examined it seems fairly reasonable. Variability of measured expression levels is not constant across genes, and we follow Lönnstedt and Speed (2002)Go in assuming an inverse gamma distribution for Formula, that is Formula , where Formula and Formula are shape and scale parameters, respectively. We use a similar notation for the gamma distribution Formula .

We treat Formula, V, Formula, and Formula as hyperparameters, and in Section 3.1 discuss their estimation.

A straightforward but involved calculation shows that the marginal posterior distribution of Formula, like the prior for Formula, is a mixture of continuous and discrete components. It may be expressed as

Formula 2(2.1)

where Formula 2 is a Dirac delta function and provided Formula 2,

Formula 2(2.2)

Here Formula 2 represents the gamma function, Formula 2,

Formula 2

and Formula 2 denotes the cumulative distribution function of a Student t variable with Formula 2 degrees of freedom. Thus h, the posterior density of Formula 2 given that a gene is differentially expressed, comprises two off-centered Student t densities placed back-to-back at the origin.

The posterior odds of differential expression may be written as

Formula 2(2.3)

The second term on the right of (2.3) is the Bayes factor for differential expression Formula 2 relative to Formula 2.

2.2 Asymmetric gene expression

Most methods for the detection of differential expression assume equal over- and underexpression, but this may be unrealistic when arrays contain genes chosen for their special interest to the investigator, or in other cases where genome coverage is restricted. We therefore consider an asymmetric extension of our mixture model.

The density of an asymmetric Laplace variable, which may be obtained as the difference of two independent exponential variables with different means, may be expressed as (Kotz and others, 2001Go, Chapter 3)

Formula 2

where Formula 2. Taking Formula 2 yields the symmetric Laplace distribution, while Formula 2 gives increasing weight to the right tail of the density, and conversely when Formula 2. We write the corresponding distribution as Formula 2, in terms of which the asymmetric Laplace mixture model for Formula 2 may be written as Formula 2. A calculation generalizing that in the symmetric case shows that when Formula 2 the posterior density of Formula 2 is again given by (2.1) and (2.2), but with

Formula 2

The posterior odds of differential expression are again given by (2.3), and the hyperparameters Formula 2, V, Formula 2, Formula 2, and Formula 2 may be estimated by an empirical Bayes procedure, as in the symmetric case.

2.3 Bayesian estimation

The discussion above relates to data from a single gene, but in applications many genes will be considered, making it useful to reduce the entire marginal posterior density of Formula 2 to a single value. One way to do this is to consider the posterior median Formula 2 of Formula 2. If the posterior probability that Formula 2 exceeds one-half, then Formula 2, whereas if the posterior probability that Formula 2 exceeds one-half, then Formula 2. If neither of these events occurs, then the point mass at Formula 2 ensures that Formula 2. Thus, the use of a posterior median to estimate Formula 2 yields a thresholding rule according to which Formula 2, if the posterior distribution of Formula 2 is too closely concentrated around zero. This method has been used for selection of non-zero components in Bayesian implementations of wavelet smoothing (Johnstone and Silverman, 2005Go).


    3. ESTIMATION PROCEDURES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LAPLACE MIXTURE MODEL
 3. ESTIMATION PROCEDURES
 4. COMPARATIVE STUDY
 5. EXAMPLE: Arabidopsis thaliana...
 6. DISCUSSION
 REFERENCES
 

3.1 Marginal likelihood estimation

We now suppose that measurements Formula 2 are available on gene g, for Formula 2, and that the genes behave independently. The posterior density of the expression level for gene g, Formula 2, depends on the data y and on hyperparameters Formula 2 in the symmetric case and Formula 2 in the asymmetric case. Consider the symmetric model under which

Formula

for Formula 2. A simple and in principle an efficient approach to estimating the hyperparameters is to maximize the log marginal likelihood

Formula 3(3.1)

based on the marginal densities

Formula 3

Rather than adopt such a procedure, Lönnstedt and Speed (2002)Go suggested fixing the proportion of differentially expressed genes Formula 3 at an arbitrary value and then estimating other parameters using the method of moments, while Gottardo and others (2003)Go use a heuristic approach to reduce instability in the estimates. Below we describe a two-stage likelihood procedure which can yield fairly stable estimates.

The basis of the two-stage approach is the elementary fact that under a model with Gaussian errors, the average and sample variance Formula 3 and Formula 3 are jointly sufficient statistics. That is,

Formula 3

where the terms on the right are normal and chi-square density functions. As the marginal density of Formula 3 depends only on Formula 3, hyperparameters Formula 3 of the density for Formula 3 can be estimated by maximizing the marginal likelihood obtained as the product of

Formula 3(3.2)

The resulting estimates do not take into account any information about Formula 3 contained in Formula 3, but have the useful property that they do not depend on parameters Formula 3 of the prior density for Formula 3. Estimates of Formula 3 obtained from the marginal likelihood based on (3.2) can be substituted into (3.1), which is then maximized with respect to the remaining hyperparameters. The result is a two-stage procedure, which is compared to an overall optimization of (3.1) in Section 3.2.

An important special case is where the density of Formula 3 is inverse gamma, that is, Formula 3 is gamma distributed with shape and scale parameters Formula 3 and Formula 3, respectively. Then the marginal distribution of Formula 3 is Formula 3, and estimates Formula 3, Formula 3 are readily obtained by maximum likelihood (ML) estimation based on this distribution. Moreover, the suitability of the inverse gamma distribution can be indirectly assessed by plotting the ordered Formula 3 against quantiles of the fitted F distribution. We have found that with this model, standard routines can be applied without difficulty in both optimization stages.

The above approach to estimation of the hyperparameters associated with the variances Formula 3 applies also to designed experiments, provided they are replicated, because it is independent of any superstructure for the Formula 3. It also extends to cases where the error structure is equivariant, that is, where we may write Formula 3, with the Formula 3 having a known density w. Let Formula 3 represent the ordered Formula 3. The differences of order statistics Formula 3 depend only on Formula 3 and w, and so Formula 3 may be estimated using a likelihood based on the marginal densities

Formula 3

Typically, these integrals must be obtained numerically or approximated, unlike when w is the normal density and the prior density of Formula 3 is inverse gamma.

3.2 Numerical comparison

In this section, we compare our two-stage and certain other approaches to hyperparameter estimation using simulated data. The simulation study is based on three scenarios. Initially, we generated 100 data sets each with Formula 3 genes and Formula 3 replicates from the Laplace mixture model with Formula 3 and Formula 3. An example simulated dataset is represented in Figure 1.


Figure 1
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 M (Formula 5 fold change) versus log variance plot of a simulated data set. Highlighted points are truly differentially expressed genes, with mean Formula 5 simulated from a Laplace distribution.

 
For Scenario I, we simulate data sets with parameter values chosen to match the simulations of Lönnstedt and Speed (2002)Go. Variances Formula 3 are generated from the inverse gamma distribution with parameter values Formula 3 and Formula 3 estimated from the scavenger receptor BI transgenic mice data set (Callow and others, 2000Go) given Formula 3 and Formula 3. For Scenario II, we generate data sets using parameters estimated from a data set on 150 defence-related genes on the plant Arabidopsis thaliana (Reymond and others, 2000Go). In this case, the mean and variance of the error variances were lower, being roughly 2 and 8, respectively, with corresponding parameter values Formula 3 and Formula 3. Scenario III uses parameter values intermediate between the other two, taking the mean and variance of the error variances to be 5 and 50, respectively; this yields Formula 3 and Formula 3.

For each simulated dataset, we estimated the parameters by our two-stage marginal likelihood procedure (MML) and by ordinary maximum likelihood (ML). Table 1 gives the bias and root mean square error for the three scenarios, based on 100 datasets. Both procedures converged readily and appear to perform well, but as one would expect on theoretical grounds, ordinary maximum likelihood performs at least as well as marginal likelihood estimation in most cases. One perhaps surprising aspect is the high biases observed for estimation of Formula 3, using either procedure.


View this table:
[in this window]
[in a new window]

 
Table 1 Root mean squared error (bias) (x 102) of maximum marginal likelihood (MML) and maximum likelihood (ML) estimators of the parameters of the symmetric Laplace mixture model. In all cases the proportion of differentially expressed genes is {omega} = 0.1 and the number of replicates is n = 2.

 
In a second study, we generated data from the asymmetric Laplace mixture model. For each of various scenarios, we generated 100 data sets each with 2000 genes. As this model is likely to be considered for subsets of genes selected because they are thought to be important, the probability of differential expression Formula 3 was taken to be appreciably higher than the first study (Formula 3 or 0.75). The asymmetry parameter Formula 3 was varied (Formula 3), with the remaining parameters taken to be similar to the values for the symmetric cases described above. We also considered the effect of taking Formula 3 replicates as well as setting Formula 3.

Unfortunately, with the low numbers of replicates considered here there is difficulty in estimating most parameters under full ML, especially for values of Formula 3 substantially different from 0. Even with REML estimation, typically, only the parameters Formula 3 and Formula 3 are well estimated unless Formula 3 is close to 0. For some of the parameter sets, we generated 100 replicates, and in these cases estimation appeared to be improved. Apparently, large numbers of replicates are required for acceptable parameter estimates in this more complex mixture model.


    4. COMPARATIVE STUDY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LAPLACE MIXTURE MODEL
 3. ESTIMATION PROCEDURES
 4. COMPARATIVE STUDY
 5. EXAMPLE: Arabidopsis thaliana...
 6. DISCUSSION
 REFERENCES
 

4.1 Methods and statistics

A number of methods have been proposed for identifying differentially expressed genes, with new ones continuing to be introduced. We compare the performance of several of these after brief descriptions of those we consider.

In order to generalize and compare our work with that of Lönnstedt and Speed (2002)Go, we include the statistics they used, both versions of our statistic, and another recently introduced method. These seven statistics and their justification are

L -stat—Laplace mixture model

AL -stat—Asymmetric Laplace mixture model

B -stat—Normal mixture model (Lönnstedt and Speed, 2002Go)

Formula 3-stat—Normal mixture (Gottardo and others, 2003Go)

M -stat—Average Formula 3 fold change

t -stat—t-statistic based on individual gene standard deviation

Pen-t -stat—Penalized t-statistic (Efron and others, 2000Go).

Note that when, as is the case here, all genes are equally replicated, the B-statistic reduces to the moderated t-statistic (Smyth, 2004Go).

4.2 Simulation studies

Simulated data sets were generated under different mixture models. For each scenario, we generated 100 data sets with 3000 genes each and Formula 3, 4, or 8 replicates, as in Section3.2. Parameter values for the simulations were chosen based on our analysis of a publicly available Arabidopsis thaliana data set, presented in Section 5.

The parameter corresponding to the proportion of differentially expressed genes (Formula 3 or p) is estimated from the data for the L- and AL-statistics, and set to 0.1 for the rest of the statistics.

Parameters for L-stat and AL-stat are estimated by the two-step procedure described in Section 3. The B-stat is computed using the "eBayes" function of the R package "limma" (Smyth, 2004Go). This package is available as part of the open source BioConductor project (Gentleman and others, 2004Go). We computed the Formula 3-stat with default values of the function "B1.bayes" in the package "amd" that works with older versions of R; similar functionality is now available through the BioConductor packages "rama" and "bridge."

We generated data from the Laplace mixture model, as described in Section3.2, with parameter values estimated by applying this model to the Arabidopsis data set.

Table 2 gives the generating parameter values, with summaries of estimates from the L and AL models for sample sizes Formula 3, 4, and 8.


View this table:
[in this window]
[in a new window]

 
Table 2 Parameter values for Laplace model simulations, using ML estimation.

 
We also simulated data sets under the normal mixture model as for the Laplace mixture model. Because the findings are very close to those for data generated under the Laplace model, we do not show detailed results. The B-statistic and L-statistic perform quite similarly, although with data generated from the normal mixture model the B-statistic performs slightly better.

4.3 Results

For each gene, the statistics are calculated for each of the 100 data sets. For a range of cutoff values for all the methods, the numbers of false-positive and false-negative results are calculated in each data set. For each cutoff value of a given statistic, the observed numbers of false-positive and false-negative genes are then averaged over the 100 data sets. For each method, a sufficient range of cutoffs was chosen to be able to observe the behavior of the receiver operating characteristic (ROC) curves over a large range. The average numbers of false-positive genes are plotted against the average numbers of false-negative genes for the statistics in variants of ROC plots. Curves falling more steeply toward the lower left corner correspond to more powerful test procedures.

A few of these plots are presented in Figure 2. The AL-statistic performed practically identical to the L-statistic, so only the curve corresponding to the L-statistic is shown, along with those of the other six statistics. We summarize our general findings from examining several such curves.


Figure 2
View larger version (12K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 ROC curves for statistics computed on data simulated under the Laplace model, 3000 genes, Formula 5, Formula 5; (a) Formula 5, Formula 5, Formula 5; (b) Formula 5, Formula 5, Formula 5.

 
Several of the methods perform broadly similarly across simulation conditions: L-stat, AL-stat, B-stat, Formula 3-stat, and Pen-t-stat. In these examples, mean Formula 3 fold change M also shows reasonably good performance as the great majority of gene-specific variances are small. Unsurprisingly, the t-statistic has poor performance for these sample sizes and cannot be recommended.

To reduce computing time in these simulations, we generated arrays consisting of a few thousand genes rather than the tens of thousands of genes more typically encountered in practice. We also examined our scenarios using 20 000 genes, varying Formula 3 from 0.01 to 0.10, in order to assess the sensitivity method performance in this more realistic situation. Within each generating scenario, relative method performance was broadly similar across values of Formula 3.


    5. EXAMPLE: Arabidopsis thaliana DATA SET
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LAPLACE MIXTURE MODEL
 3. ESTIMATION PROCEDURES
 4. COMPARATIVE STUDY
 5. EXAMPLE: Arabidopsis thaliana...
 6. DISCUSSION
 REFERENCES
 

5.1 Data

We present an example using data publicly available from the Stanford Microarray Database (SMD) (http://genome-www5.stanford.edu/). The experiments were carried out to compare the general effect on disease resistance RNA transcript levels of Arabidopsis thaliana infected by rhizobacterium Pseudomonas thivervalensis (strain MLG45) to axenic control plants (Cartieaux and others, 2003Go). Here we consider the experiment on plant leaves, which has slides containing 16 416 spots for four biological replicates hybridized to a common reference (SMD Experiment ID numbers 27084, 27000, 26995, and 26718). Further details are available from Cartieaux and others (2003)Go.

From the raw channel intensities, we computed print-tip loess normalized log ratios for each spot (Yang and others, 2002Go). To avoid inflated variance for low-intensity spots, we did not carry out any background correction; we also removed control spots. From these normalized log ratios, we estimated the parameters for the Laplace and Asymmetric Laplace models (Table 3). These values were used as a starting point for part of the simulation study of Section 4.


View this table:
[in this window]
[in a new window]

 
Table 3 Parameter estimates and standard errors for Laplace and Asymmetric Laplace models for the Arabidopsis thaliana data set

 
5.2 Model fit

Here we examine the fit of the Laplace mixture model to the data by comparing gene-specific means and variances of the true and simulated data.

The values of the log ratios seen in the simulated data sets tend to be larger than in the original data, although quantile-quantile (QQ) plots comparing genewise data means to simulated means indicate a reasonably good fit (data not shown).

Under the proposed mixture models, the marginal distribution of the sample variances follows an Formula 3 distribution, where n is the number of replicates. We can thus check the fit of the model to the data set variances by plotting the sample variances against the quantiles of the Formula 3 distribution. Figure 3(a) shows that the agreement seems very good overall, with some deviation for the highest few variances.


Figure 3
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Variance fits: (a) QQ plot of sample variances; (b) square of the sample means (Formula 5) versus log sample variances (Formula 5)—solid line for Laplace model, broken line for normal model.

 
To further investigate the fit of the Laplace mixture model, and compare it with that of the normal mixture model (Lönnstedt and Speed, 2002Go), we examine the relationship between the sample mean Formula 3 and the sample variance Formula 3.

Under the Laplace model, we find that

Formula 5(5.1)

which is quadratic in Formula 5, with coefficients that can be determined from n and the ML estimates of Formula 5 and V.

Under the normal mixture model (Lönnstedt and Speed, 2002Go), the mean Formula 5 for differentially expressed genes is distributed as normal with mean zero and variance Formula 5, where Formula 5. Thus, the sample mean square, Formula 5, is linear in the sample variance (Formula 5):

Formula 5(5.2)

Here the coefficient can be determined by assuming Formula 5, the default value for the B-statistic, and with V estimated by the method of moments (Lönnstedt and Speed, 2002Go).

We further investigate the relationship between the sample variances and the sample means by fitting generalized linear models. The models account for the squared means in terms of linear (normal) or linear and quadratic (Laplace) functions of the variances, using the gamma family with identity link. The quadratic effect is significant at level around 0.001, suggesting a strong preference for the Laplace model over the normal one.


    6. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LAPLACE MIXTURE MODEL
 3. ESTIMATION PROCEDURES
 4. COMPARATIVE STUDY
 5. EXAMPLE: Arabidopsis thaliana...
 6. DISCUSSION
 REFERENCES
 
This study makes three main contributions to statistical methodology for analysis of microarray experiments. First, we have introduced the Laplace mixture model. This model permits greater flexibility than models in current use as it has the potential, at least with sufficient data, to accommodate both whole genome and restricted coverage arrays. In addition, it also appears to provide a somewhat better fit to data. We have also proposed fundamentally sound approaches for estimating the proportion of differentially expressed genes that also appear to perform well in practice. Last, we have further characterized and compared several methods of identifying differential expression.

Our original motivation for proposing the Laplace mixture model was that the normal mixture model of Lönnstedt and Speed (2002)Go did not fit our data well. We were initially surprised to find that this lack of fit only marginally affects performance: the Laplace model fits better but the performance of both is similar, so gene rankings from these models are anticipated to be alike across a variety of data sets. Thus, although there may be some room for "fine tuning," we do not argue that the Laplace mixture should supplant the commonly used normal mixture in practice. Rather, we believe that efforts aimed at further model refinement might more profitably be focused on solving new and more challenging problems.

Our proposed procedures for estimating the proportion of differentially expressed genes seem to perform quite well in the simulations. Unlike the heuristic procedure of Gottardo and others (2003)Go, our likelihood approach does not require an arbitrary criterion for calling a gene differentially expressed. This approach could also be readily incorporated into existing normal mixture model-based software. However, our approach might in some cases give poor results, in which case the obvious work-around is to simply fix the value of Formula 5, as suggested by Lönnstedt and Speed (2002)Go.

In the comparison study, several procedures—those which in essence use some type of global shrinkage to modify the denominator—appear to perform quite similarly across simulation conditions. Some forms of local shrinkage give worse performance than global shrinkage (Kooperberg and others, 2005Go); we did not consider them here. These results underline the futility of endeavors to create substantially better procedures by further tinkering. The procedure that will be optimal in a given experiment depends on many factors which in practice are typically unknown, such as the form of the data generating distribution and the true parameter values, including the proportion of differentially expressed genes. The key aspect seems to be how the smallest and largest single-gene variances are treated.

Finally, we would like to draw attention to the issue of software. Carrying out this type of study and, more importantly, data analyses in general, depends heavily on the availability of readily usable software. Here, we mostly relied upon the open source R-based BioConductor project (Gentleman and others, 2004Go), which provides an ideal framework for software distribution as well as a large number of quality software tools for statistical genomics computing. We are in the process of preparing a more user-friendly version of our software as an R package for BioConductor. We urge those who create software intended for wide dissemination also to consider implementation in R and contribution to BioConductor.


    ACKNOWLEDGMENTS
 
This work was funded by the Swiss National Science Foundation through the National Centre for Competence in Research in Plant Survival. Conflict of Interest: None declared.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LAPLACE MIXTURE MODEL
 3. ESTIMATION PROCEDURES
 4. COMPARATIVE STUDY
 5. EXAMPLE: Arabidopsis thaliana...
 6. DISCUSSION
 REFERENCES
 

    Callow MJ, Dudoit S, Gong EL, Speed TS, Rubin EM. (2000) Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Research 10:2022–9.[Abstract/Free Full Text]

    Cartieaux F, Thibaud M-C, Zimmerli L, Lessard P, Sarrobert C, David P, Gherbaud A, Robaglia C, Somerville S, Nussaume L. (2003) Transcriptome analysis of Arabidopsis colonized by a plant-growth promoting rhizobacterium reveals a general effect on disease resistance. The Plant Journal 36:177–90.[CrossRef][Web of Science][Medline]

    Efron B, Tibshirani RJ, Goss V, Chu G. (2000) Microarrays and their use in a comparative experiment. Technical report. (Department of Health Research and Policy, Stanford University, Stanford, CA.).

    Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 5:R80 http://genomebiology.com/2004/5/10/R80.[CrossRef][Medline]

    Gottardo R, Pannucci JA, Kuske CR, Brettin T. (2003) Statistical analysis of microarray data: a Bayesian approach. Biostatistics 4:597–620.[Abstract]

    Johnstone IM and Silverman BW. (2005) Empirical Bayes selection of wavelet thresholds. Annals of Statistics 33:1700–52.[CrossRef]

    Kooperberg C, Aragaki A, Strand AD, Olson JM. (2005) Significance testing for small microarray experiments. Statistics in Medicine 24:2281–98.[CrossRef][Web of Science][Medline]

    Kotz S, Kozubowski TJ, Podgórski K. (2001) The Laplace Distribution and Generalizations(Birkhäuser., Boston).

    Lönnstedt I and Speed TP. (2002) Replicated microarray data. Statistica Sinica 12:31–46.

    Reymond P, Weber H, Damond H, Farmer EE. (2000) Differential gene expression in response to mechanical wounding and insect feeding in Arabidopsis. Plant Cell 12:707–19.[Abstract/Free Full Text]

    Schena M, Shalon D, Davis RW, Brown PO. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–70.[Abstract/Free Full Text]

    Schena M, Shalon D, Heller R, Chai A, Brown PO, Davis RW. (1996) Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proceedings of the National Academy of Sciences of the United States of America 93:10614–9.[Abstract/Free Full Text]

    Smyth GK. (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3:.

    Tusher VG, Tibshirani RJ, Chu G. (2001) Significance analysis of microarrays applied to the ionising radiation response. Proceedings of the National Academy of Sciences of the United States of America 98:5116–24.[Abstract/Free Full Text]

    Yang YH, Dudoit S, Luu P, Speed TP. (2002) Normalisation issues for cDNA microarray data. In Bittner ML, Chen Y, Dorsel AN, Dougherty ER (Eds.). Microarrays: Optical Technologies and Informatics. Proceedings of SPIE(SPIE, Bellingham, WA) pp. 141–52.

    Received December 15, 2005; revised February 18, 2006; revised March 10, 2006; accepted for publication March 23, 2006.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


    This article has been cited by other articles:


    Home page
    BiometrikaHome page
    A. C. Davison and D. Mastropietro
    Saddlepoint approximation for mixture models
    Biometrika, June 1, 2009; 96(2): 479 - 486.
    [Abstract] [PDF]


    Home page
    BioinformaticsHome page
    K. Lo and R. Gottardo
    Flexible empirical Bayes models for differential gene expression
    Bioinformatics, February 1, 2007; 23(3): 328 - 335.
    [Abstract] [Full Text] [PDF]


    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    7/4/630    most recent
    kxj032v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Bhowmick, D.
    Right arrow Articles by Ruffieux, Y.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Bhowmick, D.
    Right arrow Articles by Ruffieux, Y.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?