Skip Navigation


Biostatistics Advance Access originally published online on March 30, 2007
Biostatistics 2007 8(4):805-820; doi:10.1093/biostatistics/kxm007
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
8/4/805    most recent
kxm007v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Gupta, M.
Right arrow Articles by Ibrahim, J. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gupta, M.
Right arrow Articles by Ibrahim, J. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

A temporal hidden Markov regression model for the analysis of gene regulatory networks

Mayetri Gupta*, Pingping Qu and Joseph G. Ibrahim

Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA gupta{at}bios.unc.edu

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A HIDDEN MARKOV...
 3. PARAMETER ESTIMATION IN...
 4. BAYESIAN CRITERION-BASED...
 5. SIMULATION STUDIES
 6. CASE STUDY: ANALYSIS...
 7. DISCUSSION
 REFERENCES
 
We propose a novel hierarchical hidden Markov regression model for determining gene regulatory networks from genomic sequence and temporally collected gene expression microarray data. The statistical challenge is to simultaneously determine the groupings of genes and subsets of motifs involved in their regulation, when the groupings may vary over time, and a large number of potential regulators are available. We devise a hybrid Monte Carlo methodology to estimate parameters under 2 classes of latent structure, one arising due to the unobservable state identity of genes and the other due to the unknown set of covariates influencing the response within a state. The effectiveness of this method is demonstrated through a simulation study and an application on an yeast cell-cycle data set.

Keywords: Data augmentation; Gene expression; Transcription factor; Variable selection


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A HIDDEN MARKOV...
 3. PARAMETER ESTIMATION IN...
 4. BAYESIAN CRITERION-BASED...
 5. SIMULATION STUDIES
 6. CASE STUDY: ANALYSIS...
 7. DISCUSSION
 REFERENCES
 
With the increased availability of large-scale genomic data, discovering the roles of various components in the genome has become an important goal in the biomedical sciences. Many biological processes in complex organisms are regulated by combinations of genes forming a pathway or network (Wang and others, 2005Go). A fundamental question is to determine how genes interact to regulate biological pathways. Making robust statistical inference from diverse types of genomic data to discover pathways of biological function presents a formidable challenge, especially as various latent correlations exist in gene behavior, and these behaviors and relationships may change over time. The first step in gene regulation constitutes the binding of certain proteins, called transcription factors (TFs) to transcription factor binding sites (TFBSs) on the DNA sequence, which activates the genes downstream into being transcribed into messenger RNA (mRNA). The pattern common to the TFBSs for a TF is termed a "motif", and genes that are regulated by the same TF often share a common motif. A popular strategy for studying gene regulation is a 2-step procedure—searching for motifs upstream of genes after clustering the genes by similar expression patterns. However, a major problem with this approach is that if the initial clustering is inaccurate, discovery of the correct motifs may be heavily biased.

1.1 Gene regulatory processes and statistical modeling

We first describe some terminology and current approaches. A motif of length w is represented through a 4xw matrix called a position-specific weight matrix (PSWM), each column denoting the probabilities (or frequencies) of observing the 4 letters A, C, G, or T in that position (w typically ranges between 8 and 20). For example, binding sites for the yeast motif RAP1 can be represented through the PSWM (of counts):

Formula

Successful computational strategies to discover functional motifs in biological processes have mostly involved searching for conserved motifs in the "promoter" sequence adjacent to each of a set of "co-regulated" genes (Liu and others, 1995Go); for example, genes exhibiting similar expression patterns across a number of experimental conditions in a microarray experiment. More recently, approaches that incorporate more biological information, such as weighting promoters by expression-based scores before motif discovery, have been proposed (Liu and others, 2002Go). Linear model-based approaches (Conlon and others, 2003Go) relate expression values to the estimated propensity of motif occurrences (summarized by a sequence motif "score"), assuming that the presence of a motif site contributes additively to the gene expression level. These approaches assume a single set of gene expression patterns, which makes it difficult to discover motifs having a positive effect on a subset of genes and a negative or neutral effect on another, a problem exacerbated in the case of temporal gene expression. Probabilistic relational models based on the specification of multiple conditional distributions (Segal and others, 2001Go) have been used for modeling gene regulation; however, such models involve a degree of parametric complexity that hinders model validation approaches. One promising approach involved iterative model-based clustering and motif determination (Holmes and Bruno, 2000Go) but avoided a direct model-based link between expression and sequence.

1.2 A joint approach for time-dependent regulatory networks

In this article, we present a method that determines groups of genes and TFs regulating them, allowing for the pattern of regulation to change temporally. This approach addresses the discovery of motifs temporally involved in gene regulation, unlike methods which either ignore the time dependence or consider each time point separately. It may not be reasonable to assume that genes can be grouped into clusters that behave similarly as a group over time. Our model allows correlations to exist between gene measurements over different time points as well as between separate genes on the same array. Several "patterns" of gene behavior may exist in a single experiment where genes may enter or leave a particular pattern at a certain time point in the study.

We present a novel statistical approach that can integrate the information from expression measurements over time and genomic sequence data through a Bayesian hidden Markov model (HMM) framework and simultaneously uncover relationships between genes and TFs that regulate them. We generalize the regression model proposed in Conlon and others (2003)Go to a framework that can accommodate temporally varying motif effects. In our model, the temporal structure of dependence in gene behavior is modeled by a latent Markov process, where at every time point, the observed expression measurement is modeled as a state-dependent function of the time and motif covariates. Using a regression framework provides interpretability of the resulting coefficients and a systematic way to model interactions between various factors.

Unlike other 2-step clustering and motif-finding approaches, our method provides a unified framework giving simultaneous estimates of gene co-regulation and time-dependent motif effects. We first describe the general Bayesian model framework (Section 2) and present a hybrid Monte Carlo procedure for model fitting, parameter estimation, and variable selection under a 2-layered latent structure (Section 3). Defining the model within a hierarchical framework, we show that problems of parameter estimability, identifiability, and model validation can be addressed in a robust way. A Bayesian criterion for selecting the number of model states is discussed (Section 4), which is seen to outperform criteria such as the Akaike information criterion (AIC) or Bayesian information criterion (BIC). Finally, simulation studies and an application to an yeast cell-cycle data set demonstrate the feasibility of this model in a real biological scenario (Sections 5 and 6).

1.3 An yeast cell-cycle gene expression data set

Our motivating data set is from a set of yeast microarray experiments (Spellman and others, 1998Go), an extensive study of genes regulated in a periodic manner coincident with the cell cycle. cDNA microarrays were used to analyze mRNA levels in cell cultures that had been synchronized by 3 independent methods. Gene expression was recorded as the logarithm of the ratio of expression of that gene in the sample to a baseline control. The data set was preprocessed by computing the fluorescence ratios through a local background correction (intensities of the weakest 12% of the pixels in each box) and normalized such that the average log-ratio over the course of the experiments equaled 0. Cell-cycle synchronization was inferred for genes whose mRNA levels (i) were significantly correlated with mRNA levels of genes previously known to be cell cycle regulated and (ii) showed a high periodicity based on a numerical score derived from Fourier analysis. A total of 800 genes were identified as being cell cycle regulated.

For our analysis, we restrict our focus to the common group of genes between the 3 schronization methods. In Section 6, we describe the analysis of the elutriation data set which consists of measurements of 477 genes over 12 time points corresponding to approximately 2 cell cycles. For deriving the motif scores corresponding to each gene, PSWMs and upstream promoter regions of each gene (about a 1-Kb region) were downloaded from the Saccharomyces cerevisiae promoter database (Zhu and Zhang, 1999Go). The motif scoring method is described in Section 2.2. The cell-cycle data set is available from the Stanford Microarray Database (http://genome-www5.stanford.edu/).


    2. A HIDDEN MARKOV REGRESSION MODEL
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A HIDDEN MARKOV...
 3. PARAMETER ESTIMATION IN...
 4. BAYESIAN CRITERION-BASED...
 5. SIMULATION STUDIES
 6. CASE STUDY: ANALYSIS...
 7. DISCUSSION
 REFERENCES
 
Now, we motivate the development of the model framework. Let Yi = (Yi1,...,YiT) (i = 1,...,N) denote the vector of T expression measurements made on gene i. Assume that each observation may have been generated from one of the K classes, indexed by the corresponding latent vector Zi = (Zi1,...,ZiT). Further, assume that we have measurements on a set of p motif covariates for each observation, the design matrix being denoted by X = (Xil)(i = 1,...,N;l = 1,...,p). Let ßk = (ßk1,...,ßkp) denote the regression coefficient vector for class k. First, assume Yij‘s are independently generated from class k (k = 1,...,K), with the relationship between Yi and Xi = (Xi1,...,Xip) given by

Formula

where {zeta}k(·) is shorthand for a function specifying the regression relationship, that is for a linear regression model, {zeta}k(Xi) = XFormulaßk. If class memberships are unknown a priori, {pi}k(1 ≤ k ≤ K) denoting the prior probability of being in class k, the Yij‘s can be assumed to be generated from a mixture of regression models, each component of the mixture encapsulating a separate regression. However, if the measurements are made over time, the observations Yi1,...,YiT are intrinsically dependent. By setting the "class membership" to account for the dependence in Yi, we capture this dependence through a HMM introducing a stochastic relationship in the latent class indicator Zi. Assuming that the dependence structure is homogeneous over time, we model the dependence in Zij (j = 1,...,T) using a transition matrix {tau} = (({tau}kl)), where

Formula (2.1)

At the initial time point (j = 1), we assume that the prior probability of states is given by {pi} = ({pi}1,...,{pi}K), where P(Zi1 = k) = {pi}k(k = 1,...,K). We denote this model as a "hidden Markov regression model" (HMRM), where the conditional distribution of Yij at time j for unit i can be written in the form

Formula (2.2)

In (2.2), we have included an extra covariate {varphi}(j) for the jth time point to account for time-dependence. Equations (2.2) and (2.1)Go completely specify the HMRM up to the functional forms of {zeta}k(·) and {varphi}(j). In Section 2.1, we describe the model components in more detail.

2.1 Modeling cell-cycle effects

The 2 goals of interest are to (i) determine the variables X influencing the observed Y and (ii) find the optimal number of classes K. We assume that the observed gene expression measurement Yij at time j is a composite of 2 effects, a "fixed" motif-dependent effect {zeta}k(Xi,{varphi}(j)) and a "random" effect {psi}ijk due to gene-specific differences in expression magnitude, where {psi}ijk|Zij = k~N(µkj,{xi}0{sigma}Formula) ({xi}0 denoting a variance inflation factor). If a sufficient number of replicate measurements are available, gene-specific variances can be modeled by using a hierarchical prior. Since such replication is often not available (as in the yeast data set), genes in a common class are currently assumed to have the same variance. Integrating out the random effect gives the marginal distribution of Yij as

Formula (2.3)

We next choose appropriate functional forms for {zeta}(·) and {varphi}(·). For regulatory networks over time, the covariates of interest are motif scores corresponding to each gene. We assign each sequence promoter region a score covariate Xil (1 ≤ l ≤ D,1 ≤ i ≤ N) with respect to the set of D motif patterns, indicating its propensity to contain 1 or more binding sites for that motif (Section 2.2).

Most previous work relating sequence effects with gene expression assume a linear relationship (Bussemaker and others, 2001Go), (, conlonetal03) which may be a simplification of the real biological model. However, for balancing the trade-off between parameter identifiability and model sophistication, in the face of increased model complexity, we also currently consider only linear covariate effects of the motifs. For the time covariate, the effect is cyclical depending on which functional cluster each gene lies within and the point of the cell cycle being observed. With a total set of D motif covariates, we assume

Formula (2.4)

where {varphi}(j) is a suitably chosen function that maps time point j into a phase of the cell cycle. The state-dependent sinusoidal coefficients allow the flexibility of gene clusters varying in amplitude of their effects as well as frequency. This may help, for example, in identifying clusters having opposing time-dependent effects over the cycle. Since the data set was partially synchronized before analysis, we limited the number of harmonic components to 1 to avoid overparametrization. More stringent tests could be carried out to determine an adequate number of harmonic components (see, e.g. Quinn, 1989Go).

Next, we summarize the general data structure. For simplifying the notation, unless indicated otherwise, the subscript ranges in the following are i = 1,...,N;j = 1,...,T; and l = 1,...,p, with p = 2 + D. Let Y(k) = {Yijµkj:Zij = k} and x = {xijl}, where xFormula = (xij1,...,xijp) = (sin{varphi}(j),cos{varphi}(j),Xi1,...,XiD). Further, let X = ((Xil)),(1 ≤ i ≤ NT){equiv}stack{xFormula,1 ≤ i ≤ N,1 ≤ j ≤ T}. X is of dimension NTxp, where each row corresponds to a realization of 1 gene (i) at 1 time point (j). We denote the subset of X corresponding to observations in state k as X(k) = {X:Zij = k}. Let u = (u1,...,up) denote a p-dimensional vector where ul = 1(0) denotes that covariate Xl is present (absent) in the model. The subset of X indexed by the "selected variables" u is X(u) = {XilisinX:ul = 1;1 ≤ i ≤ NT}. Then, the submatrix of X formed by the rows (i,j) corresponding to Zij = k, and |u| = {sum}Formulaul columns such that ul = 1, is given by XFormula = {X(u):Zij = k}. Also, for the set of coefficients ßk = ({alpha}1k,{alpha}2k,ßk1,...,ßkD)T corresponding to the subset of variables in state k, indexed by u, we use the notation ßFormula = {ßkl:ul = 1}.

2.2 Motif scoring model and covariates

Each upstream sequence is next given a score with respect to each PSWM, {Theta}j(1 ≤ j ≤ D) of width wj columns, to get a NxD covariate matrix of gene sequence scores X, where each row Xi = (Xi1,...,XiD) is the score vector for gene i (1 ≤ i ≤ N). As mentioned in Section 1.1, a PSWM is characterized as {Theta} = (({theta}jk)) (1 ≤ j ≤ 4, 1 ≤ k ≤ w), to denote a motif of w columns, where each column of the matrix {theta}k = ({theta}1k,...,{theta}4k)T denotes the relative frequencies of each letter in that position. Let {theta}0 = ({theta}01,...,{theta}04) denote the relative frequencies of the 4 nucleotides characterizing the "background" sequence (not containing motifs) assuming that the sequence is generated from a Markov process of order 0. If the probability of motif occurrence is assumed uniform over the sequence, then the score Xij for sequence i, relative to motif j, is taken to be the logarithm of the likelihood ratio between the jth motif model and the background model, assuming that any number of motif sites can occur in the sequence, with the constraint that there is no overlap between sites.

Let x[l:m] denote the sequence {xl,xl + 1,...,xm}. For the motif covariates Xij for gene i and motif j, given the PSWM {Theta}j, the likelihood ratio is Formula, where Li is the length of the upstream sequence for gene i. Calculation of the denominator is straightforward, for both an independent and a Markovian model. However, since the actual partitions of the sequence into individual sites and background are not known, evaluating the numerator (with {Theta}j fixed) involves an exponential order sum—including all possible segmentations into possible motif sites and background sequence. Fortunately, we can use a recursive technique that has been formulated in detail in Gupta and Liu (2003)Go in the context of motif discovery for more efficient computation. Let {Phi}k – 1({Theta}) be the sum of probabilities for all legitimate partitions for the partial sequence x[1:(k – 1)]. Then, we can recursively evaluate {Phi}k as

Formula (2.5)

where {rho}(x[l:m]) = P(x[l:m]|{Theta}j)1[m l + 1 = wj]P(x[l:m]|{theta}0)1 – 1[ml + 1 = wj] evaluates the probability that the previous segment is either a motif site or a letter generated from the background and 1[X = a] denotes an indicator variable that takes value 1 only if X = a. P(x[l:m]|{Theta}j) is calculated assuming that the letter frequencies in each position are independent multinomials with parameters corresponding to columns of {Theta}j. By recursively evaluating expression (2.5), the likelihood of the entire segment is calculated as {Phi}Li({Theta}) for sequence i (1 ≤ i ≤ N). We note that Conlon and others (2003)Go use a simpler version of this score under the assumption that a sequence can contain only a single motif site.

2.3 A hierarchical prior framework

In a linear regression framework, an attractive choice of prior for the regression coefficient ß is the conjugate g-prior (Zellner, 1986Go). With the error variance denoted as {sigma}2, the g-prior is of the form g{sigma}2I – 1, where I denotes the Fisher information matrix XTX and g is a scalar constant. However, when the observations belong to one of K unknown classes, that is in a mixture or HMM framework, the density is no longer from a regular exponential family and the information matrix cannot be derived in an analytical form. In our model, the entire set of regression coefficients for state k(1 ≤ k ≤ K) is denoted by ßk = ({alpha}1k,{alpha}2k,ßk1,...,ßkD)T and ß = (ß1,...,ßk). Following the general form of the g-prior, we choose a conjugate prior for ß as ßk~N(ß0,g{sigma}Formula(XTX) – 1). This approximates the "complete data" g-prior with covariance g{sigma}Formula({sum}i:Zi = kXiXFormula) – 1, which cannot be directly evaluated since Z is a latent variable. Our "modified" g-prior preserves a weakened form of the correlation structure of the likelihood, in comparison to using a simpler (e.g. independent) form, and leads to several attractive resultant properties in posterior estimates (Section 3.1). Also, by not assuming an a priori independence structure on ß, we can avoid imposing posterior independence among the TF effects, which appears biologically more meaningful. The choice of the scalar g, which controls the penalty for choosing models, is discussed further in Section 3.2.

For the other parameters, we use standard conjugate prior formulations. Let µ = ((µkj)) and {sigma}2 = ({sigma}Formula,...,{sigma}Formula)(1 ≤ j ≤ T,1 ≤ k ≤ K). We take µkj~N(mk0,vFormula), Formula~Gamma(w0/2,S0/2), {tau} = ({tau}k1,{tau}k2,...,{tau}kK)~Dirichlet({omega}k = ({omega}k1,...,{omega}kK)), and ({pi}1,...,{pi}K)~Dirichlet({alpha}0 = ({alpha}01,...,{alpha}0K)). As a hierarchical prior for u, we set ul|{eta}~Bernoulli({eta}) (1 ≤ l ≤ p) and {eta}~Beta({epsilon}1,{epsilon}2).


    3. PARAMETER ESTIMATION IN THE HMRM
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A HIDDEN MARKOV...
 3. PARAMETER ESTIMATION IN...
 4. BAYESIAN CRITERION-BASED...
 5. SIMULATION STUDIES
 6. CASE STUDY: ANALYSIS...
 7. DISCUSSION
 REFERENCES
 
After developing the model in Section 2, we now introduce a hybrid Monte Carlo procedure for efficient model fitting and parameter estimation. Let {theta} = (µ, ß, {sigma}2, {tau}) denote the set of all parameters in the model. To start with, assume that the total number of states K is fixed and known. The complete data posterior distribution, integrating out the random effects {psi}, is given by

Formula (3.1)

We update the model parameters through the following iterative procedure: (i) covariate selection: update u|Y, X, Z, {theta}, (ii) state updating: update Z|Y, X, u, {theta}, and (iii) parameter updating: update u|Y, X, u, Z. Details of each step are provided in the next section.

3.1 Covariate selection

The initial number p of covariates included in the model may potentially be very large. Covariates that are not significantly correlated with the response may introduce noise and result in inaccurate fitting, especially as the clusters are determined by the state-specific regression relationship between the response and the covariate. We thus include a variable selection step using the evolutionary Monte Carlo (EMC) procedure, which has been shown to be highly efficient in high-dimensional problems (Liang and Wong, 2000Go).

EMC procedure. EMC is a "population-based" Monte Carlo procedure that involves (i) sampling simultaneously from parallel Monte Carlo chains using "tempered" versions of the target distribution, ranging from lowest (target distribution) to highest ("flattest" distribution), maximizing mixing and (ii) using local Metropolis-Hastings moves of "mutation" and "crossover."

In the variable selection step, we need to compare models of dimensions |u| and |v| (|u| = {sum}Formulaul). We first calculate a marginalized likelihood integrating out the parameters whose dimensions vary with u. With conjugate priors for ßk and {sigma}Formula, we get

Formula (3.2)


Formula (3.3)

where IBeta,b(c,d) denotes the inverse ratio of the normalizing constants for the beta distributions Beta (a + c,b + d) and Beta(a,b) and Formula denotes the ratio of normalizing constants for the inverse- gamma distributions IG(a + c,b + d) and IG(a,b). IDir{omega}k(tk) denotes the ratio of normalizing constants for the Dirichlet distributions Dir(tk + {omega}k) and Dir({omega}k), where tk = (tk1,...,tkK); tkl = {sum}Formula{sum}Formula1[Zi,j – 1 = k,Zij = l] is the number of observed transitions between the states k and l. m1 = (m11,...,m1K), m1k = {sum}Formula1[Zi,1 = k], and nk = {sum}Formula{sum}Formula1[Zij = k]. Also,

Formula (3.4)

where the posterior estimates for the covariance and mean of ßk are

Formula (3.5)

Rk can be interpreted as a cluster-specific "residual" (see Appendix A in supplementary material available at Biostatistics online). Also, for a large value of g, the Rk term has the attractive property of reducing to a scaled version of the frequentist residual sum of squares in the regression framework (Appendix B in supplementary material available at Biostatistics online). Conversely, for a small value of g, implying a strongly informative prior for ß, Rk represents a sum of squares scaled by the "involutory" matrix (I 2Hk), where (I – 2Hk)T(I – 2Hk) = I (Harville, 1997Go). It is interesting to note that while YFormula(IHk)Y(k) = YFormula(I Hk)(IHk)Y(k) represents the residual sum of squares, YFormula(I – 2Hk)(I – 2Hk)Y(k) = YFormulaY(k) is the total sum of squares. Since (I – 2Hk) is not positive definite (though it is nonsingular), taking a very small value of g may lead to instability or inestimability of parameter estimates in case YFormula(I 2Hk)Y(k) is negative. Thus, it is desirable to perform a few pilot runs before selecting an appropriate g for the analysis. Further details of the EMC procedure are given in Appendix C in supplementary material available at Biostatistics online.

A remaining task is to sample the states and parameters. A Gibbs sampling procedure may be used to sample states, 1 gene at a time, successively from the conditional distributions P(Zi,t|Zi,1,...,Zi,t – 1,Zi,t + 1,...,Zi,T,X,u,{theta}). However, a more efficient method in this scenario is a recursive data augmentation step based on "forward sum-backward sampling" techniques used in the computation of the likelihood in HMMs (Gupta and Liu, 2003Go). This represents a "grouped" sampling step rather than a conditional update and has been shown to have better convergence properties than the Gibbs sampler (Liu and others, 1994Go). Since conjugate priors are used, each of the updating steps is of closed analytical form and straightforward. The state and parameter updating steps are detailed in Appendices D and E in supplementary material available at Biostatistics online.

3.2 Sensitivity of covariate selection to the choice of g

A large g in the prior for ßk results in robust posterior inference for ß but over-penalizes larger models. As g->{infty}, the Bayes factor for comparing any choice of predictors to the null model tends to 0, making the model selection procedure inconsistent (see Appendix F in supplementary material available at Biostatistics online and Bartlett, 1957). For regular families, the choice of g = n represents a "unit information prior", leading to Bayes factors that behave as the BIC (Kass and Wasserman, 1995Go). In the case of HMMs, this result does not hold as the observations are correlated. Our empirical studies suggest that the BIC tends to overestimate the number of components.

One alternative is to take a hyperprior on g. By assuming g~ Inv-{chi}2({nu}), (ßkß0)~t{nu}(·;{sigma}Formula(XTX) – 1), a scaled t distribution with {nu} degrees of freedom. The robustness of the t distribution makes it an attractive alternative to using a fixed g-prior. In the case of the marginal distribution (3.2), integrating out g is analytically intractable; however, one can sample g from its posterior distribution during the Markov chain Monte Carlo procedure. This involves adding a sampling step for g, using

Formula

In applications, a numerically stable procedure appears to be initiating the algorithm with a large g value and sampling g with other parameters once the algorithm stabilizes. In general, one must check that the diagonal elements of g(XTX) – 1 are not too small.


    4. BAYESIAN CRITERIONBASED MODEL SELECTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A HIDDEN MARKOV...
 3. PARAMETER ESTIMATION IN...
 4. BAYESIAN CRITERION-BASED...
 5. SIMULATION STUDIES
 6. CASE STUDY: ANALYSIS...
 7. DISCUSSION
 REFERENCES
 
To estimate the number of states K in the HMRM, a usual approach would be calculating the Bayes factor for models with different K. The HMRM involves the latent variables u and Z relating to the included covariates and hidden states. To evaluate the Bayes factor, we need to compute the marginal probabilities P(Y|M,X) = {int}{theta}{sum}u{sum}zP(Y|M, X, {theta}, Z, u)P({theta}|M)P(Z)P(u)d{theta}, for each model M, integrating out the latent variables. These sums are analytically intractable, and the computational methods for calculating the Bayes factor (Meng and Wong, 1996Go), (Chen and Shao, 1997Go), (, green95) are typically either computationally expensive or unstable due to the irregular nature of the likelihood function.

Instead, we use an alternative approach through a Bayesian goodness-of-fit statistic constructed from the posterior predictive distribution of the data. The L-measure (Ibrahim and others, 2001Go) and its calibration distribution allow the formal comparison of 2 competing models. Let Wi(i = 1,...,N) denote the future values from an imagined replicate experiment, with the same sampling distribution as Yi(i = 1,...,N) as in Section 2. The generalized L-measure (Ibrahim and others, 2001Go) is

Formula (4.1)

where the expectations are taken with respect to the posterior predictive distribution P(W|y) and 0 ≤ {nu} ≤ 1 (see Appendix G in supplementary material available at Biostatistics online for details). The 2 terms in (4.1) can be interpreted as corresponding to a variance and bias term, and hence choosing {nu} = 1 would reduce (4.1) to the matrix of Euclidean distance between y and W. An attractive feature of the L-measure approach, compared to Bayes factors, is that this criterion gives an absolute judgment of whether the selected model provides adequate fit to the data, rather than a comparison between models. To calibrate the L-measure, we compute the empirical distribution of Formula, where Formula and Formula denote the estimated trace of the L-measure for the "candidate" and "true" models. In practice, as the true model is unknown, we follow Ibrahim and others (2001)Go in replacing the true model by the "criterion-minimizing" model and generating samples Formula from the prior predictive distribution of y.


    5. SIMULATION STUDIES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A HIDDEN MARKOV...
 3. PARAMETER ESTIMATION IN...
 4. BAYESIAN CRITERION-BASED...
 5. SIMULATION STUDIES
 6. CASE STUDY: ANALYSIS...
 7. DISCUSSION
 REFERENCES
 
As a proof-of-principle test for the new approach, simulation studies were conducted to test the (i) consistency of the method in estimating parameters in the presence of noise and increasing dimensionality of the data and (ii) robustness of the method to misspecification of hyperparameters.

Data were generated from a 2-cluster and 5-cluster model with 400 response variables (genes) and 15 true covariates (motifs), with measurements over 18 time points. The motif scores were simulated from a set of real yeast PSWMs through the scoring function described in Section 2.2. We also generated a set of 35 dummy covariates corresponding to each gene, some of which resembled true motif scores not significantly correlated with the response and the remaining as noise. Given the motif scores and the cluster identity Zij at time point j, for gene i, the distribution of gene expression scores was given by

Formula

We tested the algorithm both by fixing g such that g(XTX) – 1 ranged between 1 and 100 and by allowing g to vary once the algorithm had achieved some stability—either way, we observed no significant variability in the posterior estimates. The next tests for consistency and robustness were done assuming that the total number of true states K was known. Autocorrelation functions for all parameters for different specifications of {xi}0 were seen to have a maximum autocorrelation of 0.05 at lag 5 (supplementary material available at Biostatistics online), so the algorithm was assumed to have converged to the stationary distribution over the range of iterations being considered (with a burn-in of about 5000). A typical run of 50 000 iterations on a data set of the size mentioned took a median CPU time of 23.43 h on a IBM BladeCenter cluster with dual Intel Xeon 2.4 GHz nodes running RedHat Linux 7.3.

5.1 Tests for consistency

We tested the performance of the algorithm under increasing levels of noise, by introducing sets of "noise" variables into the algorithm, 15, 25, and 35 variables in turn. Results are presented for the K = 5 data set; the K = 2 data set gives even more accurate results due to the decreased complexity of the model and are not shown here. Table 1 shows that increasing the number of covariates has almost no effect on the selection of the correct ones. This observation suggests that the algorithm is robust to mis-inclusion of covariates which have no effect on the response and thus should be able to efficiently select covariates even when the total dimensionality is high, as long as the true covariates have a significant effect on the response. To estimate the performance of the algorithm when unknown or unmodeled motifs interact with the pathway, we also ran a similar analysis excluding 1 and 2 true motifs from the model fitting step (Table 1). This leads to a very slight increase in the errors of variable selection. The parameter estimates are consistent, with the exception of {sigma}2, for which the informative prior appears to have a biasing effect for clusters of very small sizes.


View this table:
[in this window]
[in a new window]

 
Table 1. Comparing misclassification and variable selection rates for simulation studies, for number of states in the HMRM equaling 5. "Var sel" denotes the results of variable selection, "Sens" denotes the sensitivity or proportion of correct variables found, and "Spec" denotes specificity, the proportion of selected variables that are correctly selected. "1 – Misc(%)" is 1 minus the misclassification rate, based on state allocation by the sampler

 
5.2 Tests for robustness to hyperparameter specification

We tested the performance of the algorithm under varying amounts of noise from the random effects component {xi}0 and under varying degrees of misspecification of {xi}0. Results (Figure 1 in Appendix in supplementary material available at Biostatistics online) show that the algorithm performs consistently well for a range of values of {xi}0 (given in the log scale), but as expected, at extremely high values of {xi}0 the performance declines as more noise is introduced into the data. However, the performance is extremely robust to parameter settings—for lower true values of {xi}0, the magnitude of set values of {xi}0 appears to have no effect on either the MSEs of parameter estimates, misclassification rates, or correct selection of covariates.


Figure 1
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. The 95 % posterior intervals for the motif and time coefficients for each state in the HMRM with (1) K = 3 and (2) K = 5. On each panel, the left column gives the names of the TF motifs and the number on the right gives the posterior marginal frequency of selection of each covariate in the model. The horizontal lines give the posterior confidence intervals, with each state being denoted by a different line type (The solid, dashed, and dotted lines in panel 1 give the intervals for the regression coefficient in each of the 3 states, while the 5 line types in panel 2 represent the intervals for K = 5).

 
5.3 Model selection using the L-measure

For the simulated data set, we next applied the model selection method based on the L-measure to test whether the method could recover the true value of K. The results here are presented for the K = 2 data set. We used the posterior estimates to construct the L-measure for values of {nu} varying between 0 and 1. It can be seen that the true model with K = 2 outperforms all the other models (Table 2). To check whether the fitted model was not significantly different from the true, we constructed the calibration distribution for the difference Formula, where Formula denotes the L-measure for the true model with the true parameter values. For almost all values of {nu} (except for very small ones), it appeared that the model fits quite well. As a comparison, we also computed the AIC and BIC model selection criteria based on the likelihood calculated through recursion (Appendix D in supplementary material available at Biostatistics online) at the model parameter estimates for K = 2,3,4. As can be seen in Table 2 , while AIC and BIC tend to slightly overfit the data, the L-measure clearly picks out the correct number of states. Since the effective sample size (ESS) is an issue with the correct calculation of the BIC for the dependent data, we computed the BIC for both extremes with sample size equaling N (underestimate, assuming each gene contributes an ESS of 1) and NT (overestimate, assuming that measurements for a gene over all time points are independent); the true ESS would be between these 2 extremes. However, since overfitting occurs for both extremes, it appears that the L-measure is probably a more accurate model selection criterion in this case.


View this table:
[in this window]
[in a new window]

 
Table 2. Comparison of model selection criteria for simulation study. The columns give the number of states K, the mode of the log-likelihood, the AIC, the BIC with sample size N (N x T), and the L-measure evaluated for {nu} = 0.5. The optimal choice of K for each criterion is given in bold, showing that only the L-measure gives the correct selection of K, while AIC and BIC overfit by choosing 3 states

 

    6. CASE STUDY: ANALYSIS OF YEAST CELL-CYCLE DATA
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A HIDDEN MARKOV...
 3. PARAMETER ESTIMATION IN...
 4. BAYESIAN CRITERION-BASED...
 5. SIMULATION STUDIES
 6. CASE STUDY: ANALYSIS...
 7. DISCUSSION
 REFERENCES
 
In this section, we describe the analysis of the yeast cell-cycle data set (Spellman and others, 1998Go) applying the new methodology. The HMRM model was fitted to the yeast cell-cycle data over a range of values for the total number of states K, with a starting set of 24 motif covariates, and the L-measure used to determine the optimal number of states. We chose the hyperparameter settings as suggested by our simulation studies, {xi}0 in a range of 0.1–5 and g in the range 1–100, and the results were virtually identical within this range. We report here the results for the settings of {xi}0 = 0.1, g = 100, {epsilon}1 = {epsilon}2 = 2, {omega}kk = 1, and {omega}k, k = 9 (k = 1,...,K) and data-dependent priors for S0 = Var(Y) and Formula Table 3 indicates that K = 3 or 4 is almost identical to the model with the lowest value of the L-criterion (K = 5).


View this table:
[in this window]
[in a new window]

 
Table 3. Model selection based on L-criterion for yeast cell-cycle data. Based solely on the L-measure, the optimal model choice is for K* = 5. Using the calibration distribution for D(y, {nu}) shows that models with K = 3, 4, or 5 are essentially indistinguishable (95% posterior intervals of the difference almost symmetric about 0), hence we may choose the most parsimonious model with either K = 3 or K = 5 for inference

 
6.1 Data analysis

Posterior convergence diagnostics were computed using the R CODA package. Autocorrelation plots for all parameters and trace plots of the number of selected variables (Figure 3 in supplementary material available at Biostatistics online) showed adequate convergence on taking about 50 000 iterations of the sampler. Posterior samples were used to make inference regarding the classification of genes and to judge whether there were significant effects of certain TFs on groups of genes. Figure 1 gives the motifs that showed significant effects (posterior probability of selection greater than 50%) and the 95% confidence intervals for the regression coefficient for each state, with K = 3 and 5. For almost all selected motifs, the posterior intervals do not overlap between states and in at least 1 state show a highly significant positive or negative effect on gene expression. For K = 5, significant positive effects (details in supplementary Table 2 available at Biostatistics online) are shown by the TFs GAL4, GCR1, MATalpha2, MIG1, XBP1 (state 1); CSRE, GCN4, MCM1, PHO2, UASPHR (state 2); and MCB, PHO4, RAP1, STE12, SWI5 (state 4); while significant negative effects are shown by CAR1 repressor, CSRE, MCM1, PHO4, RLM1, ROX1, STE12, UASPHR (state 1); GCR1, MCB, MIG1, RAP1, SWI5, XBP1 (state 2); CSRE, GAL4, GCN4, GCR1, MATalpha2, UASPHR (state 4); and SMP1 (state 5). State 3 overall seems to show weaker motif effects. The significant TFs picked out by K = 3 states include many of the same motifs, suggesting that the extra states are formed by subdividing some of the previous states (e.g. state 1 is similar for the 2 sets, while state 2 for the K = 5 model is similar to state 3 in the K = 3 model). By comparing to the motifs listed in Spellman and others (1998)Go, we see that a number of factors known to have strong effects on the cell cycle, such as MCB, MCM1, and SWI5, are detected by the method, as well as motifs that are active at specific time points (Table 4). By jointly modeling temporal expression with motif sequence scores, we thus can get a simultaneous picture of groups of genes regulated by certain TFs that may have opposing effects in different groups. Since the groups are allowed to change over time, this implies that we can uncover the pattern of regulation of TFs over the cell cycle, not being limited by a fixed grouping of genes.


View this table:
[in this window]
[in a new window]

 
Table 4. TFs selected by (i) stepwise regression separately over each time point and (ii) the HMRM. The " + " and " – " signs denote positive and negative motif effects, based on the 95% confidence (posterior) interval for the regression coefficient. The confidence (posterior) intervals are given in Tables 1–3 in the supplementary material available at Biostatistics online

 
6.2 Biological validation

We compared our results with previous inference on the same data set (Spellman and others, 1998Go) and also reanalyzed the data using a stepwise multiple regression approach on the lines of Conlon and others (2003)Go. The stepwise method uses each time point separately and assumes all genes in a single group without clusters. It uses the AIC and a forward–backward procedure to select significant covariates. In comparison, the HMRM detects the overall influence of motifs over time and allows genes to belong to different states at different times. Table 4 indicates that the HMRM succeeds in uncovering more motif effects that are also observed to be cluster specific. Motifs that are not found significant by HMRM are not found by stepwise regression, with the exception of TBP. The stepwise method misses known cell-cycle regulators MCB and SWI5, while MCM1 signals are weak (picked up at only 1 time point). One reason why they might not appear significant is that they have opposing effects in subgroups of genes at different time points (e.g. SWI5 has a negative effect in group 1 and a positive effect in group 3 for the K = 3 model); opposing effects nullifying the overall one. Also, although a few TFs show continual effects over a phase of the cell cycle (e.g. GCN4, MIG1), in many cases most motifs only show significant effects sparsely, as no information is borrowed over neighboring time points to judge whether the overall effect over a period of time is significant, which is biologically more meaningful.

We compared how the clusters based jointly on motif effects and gene expression correlate to groups of genes categorized by their involvement in a functional pathway over certain points in the cell cycle, as shown in Figure 7 of Spellman and others (1998)Go. Our results indicate that a number of functional groupings at particular phases of the cell cycle are shown to have high overrepresentation in certain clusters (Table 5): for example, genes involved in DNA repair (G1), DNA synthesis (G1), cell-cycle control (G1), budding (G1 and M/G1), mitosis (M), nutrition (M), and mating (M/G1). Even more promising is the fact that the analysis for gene clusters are remarkably consistent for the K = 3 and K = 5 models; the pathway-related genes are found to be grouped together at the relevant phase irrespective of the model chosen, which is important for robust inference. The only significant difference is for the budding:fatty acids pathway for which the K = 3 and K = 5 models pick out 2 different groups, which are active at different points of the cell cycle, hence are still consistent. It can be noted here that a single cluster–based analysis, such as the stepwise regression approach, does not provide us a mechanism for attempting this kind of pathway inference which can generate useful biological hypotheses for further testing.


View this table:
[in this window]
[in a new window]

 
Table 5. Functional group enrichment of genes under the models with K = 3 and K = 5. Column 1 denotes the functional pathways discovered by Spellman and others (1998)Go. Column 2 denotes the proportion of genes involved in a function that were predicted by the HMRM as being in the same state at a time point corresponding to the phase they are known to be active in the cell cycle (Column 3). The correspondence between phase and time point is determined from Spellman and others (1998)Go. Column 4 gives the gene names and Column 5 the state label in the HMRM

 

    7. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A HIDDEN MARKOV...
 3. PARAMETER ESTIMATION IN...
 4. BAYESIAN CRITERION-BASED...
 5. SIMULATION STUDIES
 6. CASE STUDY: ANALYSIS...
 7. DISCUSSION
 REFERENCES
 
Treating gene expression clustering as a temporal variable may help in discovering relationships between functional sequence motifs and groups of genes they regulate. Differing groups of genes may behave as a "cluster" at different time points, influenced by different groups of TFs. If 2 groups of genes are "differentially regulated" by a TF, for example, in one group the TF acts by inducing the response, while in the other it has a repressive effect, grouping the genes together will lead to losing the effect of the TF for the entire set. In order to uncover the relationships between TFs and genes that are involved in biological pathways of interest, it is thus desirable to determine how these TF effects may vary between groups and over time.

The HMRM framework allows for (i) determining covariates (motifs and phase of cell cycle) that have significant effects on the response (gene expression) (ii) covariate effects varying between states, and (iii) gene clusters varying over time. The hierarchical model framework also leads to nicely interpretable properties of the posterior estimates and asymptotic comparability to the frequentist framework. In addition, although the model as discussed here does not directly find novel motifs, a de novo motif discovery method may be formulated that uses the model predictions to generate groups of genes for motif discovery. Our approach also seems to induce a "tighter" clustering of genes, by grouping noisy genes which are not significantly affected by the covariates as a separate cluster, as seen in the yeast cell-cycle application. However, a principled way to induce tight clustering under a model-based framework still needs to be studied in detail. Our approach allows a novel, and to our belief the first, model-based method for determining groups of genes being influenced by separate sets of covariates over time. Application to yeast cell-cycle data succeeds in detecting "regulatory modules"—groups of genes regulated by sets of TFs that match previous biological knowledge. The joint modeling approach also succeeds in discovering more known functional TFs than the stepwise method in the yeast data (e.g. MCB, SWI5)—one likely reason for this is the separation of effects of groups of TFs that have differential effects on groups of genes.

Our approach to determining the number of clusters K is based on a Bayesian model choice criterion, the L-measure. The use of this approach is supported by the fact that (i) the number of biologically "interesting" clusters are within an approximately known, small range of values and (ii) inference for the significant covariates appears consistent for a small range of K around the "optimal" value. For instance, for the yeast cell-cycle data, if we choose 5 states instead of 3, the gene-cluster allocation and cluster-specific covariates remain essentially the same, with a few new genes (having weaker covariate effects) being substratified. Another promising direction for avoiding the model choice issue is by considering extensions to models such as the infinite mixture model based on the Dirichlet process, which we intend to explore further in future work.


    ACKNOWLEDGMENTS
 
This research was supported by funding from the National Institutes of Health and the Environmental Protection Agency. The authors are grateful to 2 anonymous referees and the editor, whose comments significantly improved the content and presentation in this article. Conflict of Interest: None declared.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. A HIDDEN MARKOV...
 3. PARAMETER ESTIMATION IN...
 4. BAYESIAN CRITERION-BASED...
 5. SIMULATION STUDIES
 6. CASE STUDY: ANALYSIS...
 7. DISCUSSION
 REFERENCES
 

    BartlettS M. A comment on D.V. Lindley's statistical paradox. Biometrika (1957) 44:533–534.[Free Full Text]

    Bussemaker HJ, Li H, Siggia ED. Regulatory detection using correlation with expression. Nature Genetics (2001) 27:167–174.[CrossRef][Web of Science][Medline]

    Chen M-H, Shao Q-M. On Monte Carlo methods for estimating ratios of normalizing constants. The Annals of Statistics (1997) 25:1563–1594.[CrossRef]

    Conlon EM, Liu XS, Lieb JD, Liu JS. Integrating regulatory motif discovery and genome-wide expression analysis. Proceedings of the National Academy of Sciences of the United States of America (2003) 100:3339–3344.[Abstract/Free Full Text]

    Green PJ. Reversible jump MCMC and Bayesian model determination. Biometrika (1995) 82:711–732.[Abstract/Free Full Text]

    Gupta M, Liu JS. Discovery of conserved sequence patterns using a stochastic dictionary model. Journal of the American Statistical Association (2003) 98:55–66.[CrossRef][Web of Science]

    Harville DA. Matrix Algebra from a Statistician's Perspective (1997) New York: Springer.

    Holmes I, Bruno W. Finding regulatory elements using joint likelihoods for sequence and expression profile data. Proceedings of International Conference on Intelligent Systems for Molecular Biology (2000) 8:202–210.

    Ibrahim JG, Chen M-H, Sinha D. Criterion-based methods for Bayesian model assessment. Statistica Sinica (2001) 11:419–443.[Web of Science]

    Kass RE, Wasserman L. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. Journal of the American Statistical Association (1995) 90:928–934.[CrossRef][Web of Science]

    Liang F, Wong WH. Evolutionary Monte Carlo: applications to cp model sampling and change point problem. Statistica Sinica (2000) 10:317–342.[Web of Science]

    Liu JS, Neuwald AF, Lawrence CE. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. Journal of the American Statistical Association (1995) 90:1156–1170.[CrossRef][Web of Science]

    Liu JS, Wong WH, Kong A. Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika (1994) 81:27–40.[Abstract/Free Full Text]

    Liu X, Brutlag DL, Liu JS. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature Biotechnology (2002) 20:835–839.[Web of Science][Medline]

    Meng XL, Wong W. Simulating ratios of normalising constants via a simple identity: a theoretical exploration. Statistica Sinica (1996) 6:831–860.[Web of Science]

    Quinn BG. Estimating the number of terms in a sinusoidal regression. Journal of Time Series Analysis (1989) 10:71–75.

    Segal E, Taskar B, Gasch A, Friedman N, Koller D. Rich probabilistic models for gene expression. Bioinformatics (2001) 17:S243–S252.[Abstract]

    Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell (1998) 9:3273–3297.[Abstract/Free Full Text]

    Wang W, Cherry JM, Nochomovitz Y, Jolly E, Botstein D, Li H. Inference of combinatorial regulation in yeast transcriptional networks: a case study of sporulation. Proceedings of the National Academy of Sciences of the United States of America (2005) 102:1998–2003.[Abstract/Free Full Text]

    Zellner A. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In: Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti—Goel PK, Zellner A, eds. (1986) Amsterdam, The Netherlands: North-Holland. 233.

    Zhu J, Zhang M. SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics (1999) 15:607–611.[Abstract/Free Full Text]

    Received May 15, 2006; revised December 1, 2006; accepted for publication February 19, 2007.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow Supplementary Material
    Right arrow All Versions of this Article:
    8/4/805    most recent
    kxm007v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Gupta, M.
    Right arrow Articles by Ibrahim, J. G.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Gupta, M.
    Right arrow Articles by Ibrahim, J. G.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?