Biostatistics Advance Access originally published online on May 28, 2007
Biostatistics 2008 9(1):51-65; doi:10.1093/biostatistics/kxm014
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Integrating quantitative information from ChIP-chip experiments into motif finding
Department of Statistics, 1300 University Avenue, University of Wisconsin, Madison, WI 53705, USA
*
Department of Statistics and Department of Biostatistics and Medical Informatics, 1300 University Avenue, University of Wisconsin, Madison, WI 53705, USA keles{at}stat.wisc.edu
* To whom correspondence should be addressed.
| SUMMARY |
|---|
Identifying binding locations of transcription factors (TFs) within long segments of noncoding DNA is a challenging task. Recent chromatin immunoprecipitation on microarray (ChIP-chip) experiments utilizing tiling arrays are especially promising for this task since they provide high-resolution genome-wide maps of the interactions between the TFs and the DNA. Data from these experiments are invaluable for characterizing DNA recognition profiles (regulatory motifs) of TFs. A 2-step paradigm is commonly used for performing motif searches based on ChIP-chip data. First, candidate bound sequences that are in the order of 500–1000 bp are identified from ChIP-chip data. Then, motif searches are performed among these sequences. These 2 steps are typically carried out in a disconnected fashion in the sense that the quantitative nature of the ChIP-chip information is ignored in the second step. More specifically, all bound regions are assumed to equally likely have the motif(s), and the motifs are assumed to reside at any position of the bound regions with equal probability. We develop a conditional two-component mixture (CTCM) model that relaxes both these common assumptions by adaptively incorporating ChIP-chip information. The performances of the new and existing methods are compared using simulated data and ChIP-chip data from recently available ENCODE studies (Consortium, 2004
Keywords: ChIP-chip data; Conditional mixture models; Piecewise constant linear regression; Prior models; Regulatory motif finding; Tiling microarrays
| 1. INTRODUCTION |
|---|
Annotation of the transcription factor (TF) binding sites in a given genome is crucial for building genome-wide regulatory networks of coding DNA to elucidate complex interactions among genes. Each TF recognizes a unique family of short sequence elements, usually between 5 and 20 bp in length. A recent technological innovation, termed ChIP-chip for chromatin immunoprecipitation coupled with microarray analysis, has enabled researchers to identify regions of a given genome that are bound by specific TFs by creating high-resolution genome-wide maps of the in vivo interactions between DNA-associated proteins and DNA. This technology has been successfully applied using both high-density oligonucleotide arrays (a.k.a. tiling arrays) (Affymetrix arrays, Cawley and others, 2004
Motif finding based on ChIP-chip data traditionally consists of 2 consecutive steps. In the first step of the ChIP-chip data analysis, one is concerned with deducing tens or hundreds of bound regions using the ChIP-chip data on millions of probes on a microarray. The noisy microarray technology and the scarce number of replicate experiments make this task challenging. Several statistical and computational methods have been developed for the identification of bound regions from ChIP-chip data (Cawley and others, 2004
, Kim and others, 2005
, Ji and Wong, 2005
, Li and others, 2005
, Kele
and others, 2006
, Kele
, 2007
). Although substantial differences exist among these methods due to modeling assumptions and the targeted array platforms, for example, single-channel or 2-channel high-density oligonucleotide arrays, they all aim to partition the tiled regions of the genome into regions that are either bound or unbound. Bound regions refer to the set of probes which, as a whole, show evidence of enrichment in the treatment sample in comparison to the control sample, whereas the unbound regions are hypothesized to be equally enriched in both groups. Due to the spatial structure of the ChIP-chip data, enrichment occurs over the interval of a set of probes rather than for singletons. As typical outcome, each method reports a set of coordinates corresponding to start and end sites of the peaks and a summary measure for each peak. These coordinates are obtained controlling the false discovery rate (Benjamini and Hochberg, 1995
) at a user-specified threshold using either an empirically driven cutoff or multiple testing procedures. The nature of the summary measures varies according to the analysis methods but all these, directly or indirectly at one point in the analysis process, assign a measure of enrichment evidence to each probe (e.g. posterior probability of binding; Ji and Wong, 2005
, Kele
, 2007
).
In the second step of the ChIP-chip data analysis, popular motif-finding algorithms such as Bailey and Elkan (1995)
and Liu and others (2001)
are used to search for enriched subsequences in these regions. Input to these algorithms are genomic sequences from the identified peak regions. There is a vast number of motif-finding methods but a large portion of these, at the core, relies on the two-component mixture (TCM) model proposed by Lawrence and Reilly (1990)
and Lawrence and others (1993)
. Let Xi,k
{A,C,G,T} denote the nucleotide at position k of sequence i and Li denote the length of this sequence. Let Xi = {Xi,k}
represent all nucleotides in sequence i. The observed data are represented by N i.i.d. random variables {X1,...,XN}. The data-generating distribution is assumed to be composed of 2 components, namely the background and the motif component. The background component is typically a low-order Markov chain, for example, third order, over the DNA alphabet. The motif model is described by a product multinomial where each position of the binding site is assumed to have an independent multinomial distribution. This resulting representation for the binding site is also referred to as a position weight matrix (PWM) model. Here, a position weight matrix refers to 4 by length of the binding site columns where each column corresponds to a multinomial distribution (Stormo, 1982
). Additionally, these models are described via a hidden random variable Z that takes on values 0 or 1 depending on whether or not the sequence under consideration has 0 or more copies of the motif of interest. Thus, the distribution of the ith sequence is given by
![]() |
where Pr(Zi = j) =
j is assumed to be the same for all the sequences under consideration. For the case where the ith sequence has a single copy of the motif, we have
![]() |
where W denotes the length of the binding site. Here, each Yij, j
{1,...,Li}, takes on the value 0 or 1 depending on whether or not the motif starts at this particular jth position. Currently, all the available methods assume a discrete uniform distribution for the start site, that is Pr(Yij = 1|Zi = 1) = 1/(Li – W + 1).
The above 2-step framework has 2 limitations that are apparent from the underlying model: (1) each region is assumed to equally likely have the motif(s) of interest and (2) motifs are assumed to reside at any position of the bound regions with equal probability. In this paper, we address relaxation of these 2 assumptions by developing an adaptive framework that integrates quantitative information from ChIP-chip experiments into motif finding. The second assumption is especially limiting if the regions considered are long. Although the binding sites localize at a particular position of the 500-to 1000-bps-long regions (namely close to the center of the peak), currently available motif-finding algorithms ignore this information and treat each position of the sequences equally likely to be a motif start site.
In the remainder of the paper, we empirically investigate the amount of quantitative information available in ChIP-chip data using real case studies. We then present a conditional two-component mixture (CTCM) model that incorporates ChIP-chip data into motif finding beyond the use of bound sequences. The performance of this modeling framework is compared with the 2-step analysis framework using simulated data and data from ChIP-chip case studies.
| 2. REGIONS WITH HIGH CHIP-CHIP SCORES ARE MORE LIKELY TO HAVE THE MOTIF(S) OF INTEREST |
|---|
To motivate our approach further, we first investigate the amount of quantitative information available in ChIP-chip data using the Stat1 and c-Jun ChIP-chip experiments from the Encyclopedia of DNA Elements (ENCODE) regions. ENCODE is a public research consortium devoted to identifying all functional elements in the human genome sequence and currently focuses on 1% of the whole human genome sequence. Snyder lab from Yale identified 345 Stat1 and 200 c-Jun binding sites in the ENCODE regions using Nimblegen tiling arrays. We scored both the Stat1 and the c-Jun bound regions by their corresponding position weight matrices from TRANSFAC (Wingender, 1994
|
| 3. CTCM MODEL |
|---|
In order to directly incorporate quantitative ChIP-chip data into motif finding, we consider modeling the sequence data conditional on the ChIP-chip scores. Before presenting our model, we will give a brief review of the TCM model (Bailey and Elkan, 1995
denotes the probability that a given position is a motif start site rather than a background site. Due to high computational complexity of the TCM model, Bailey and Elkan (1995)
denote the subsequence of length W starting at position j in sequence Xi. Let Zi,j be the indicator variable denoting whether a motif starts at position j in sequence Xi, that is, whether sequence
is a realization from the position weight matrix. The pseudo–data set is modeled as though it comes from a TCM model assuming that all the
are independent of each other.
' represents the mixing proportion in the modified model. Although the independence assumption is violated due to the overlap of the sequences, Bembom and others (2007)
conditional on the hidden variable Zi,l that denotes whether or not the lth segment in sequence i has a motif is given by
![]() |
where
0 refers to parameters of the background model and
W refers to parameters of the motif product multinomial model. Thus,
0 consists of the background multinomial probabilities p0j, j = 1,...,4, and
W consists of the product multinomial parameters pkj, k = 1,...,W and j = 1,...,4. We will let
to denote all the parameters,
0,
W,
0, in the model.
Let T = (T1,...,TN), where Ti = (Ti,1,...,Ti,Li), denote the ChIP-chip scores for each base pair in each sequence. These scores are driven by the method used to analyze the ChIP-chip data. They could correspond to the posterior probabilities of binding (Ji and Wong, 2005
, Kele
, 2007
), smoothed averages of the probe-level intensities, or simply some choice of probe-level test statistics. As noted in Section 1, current tiling array designs do not usually provide single base pair–level resolution due to monetary limitations; therefore, the extent of the available ChIP-chip information is limited by the resolution of the array used. Here, all the bases within a probe are assigned the ChIP-chip value of the probe. ChIP-chip scores for the regions in between the probes are obtained by interpolation with simple averaging. We assume that conditional on the motif occurrence and location random variables, the sequence data are independent of the ChIP-chip data. This explicitly corresponds to the assumption
. Next, consider the following conditional probability model:
![]() | (3.1) |
where
f denotes the parameters of the conditional distribution of Z given T, and the conditional independence assumption makes the first term in the sum independent of the ChIP-chip scores. Next, we consider 3 alternatives for modeling Pr(Zil = 1 | Til).
(M1).The first alternative considers a beta prior formulation for Pr(Zil) as follows:
|
| (3.2) |
|
| (3.3) |
where
is the unknown scaling factor. Then, by integrating out
il, we get
|
|
The following logistic regression model
|
| (3.4) |
models the log-odds of having a motif starting at position l as a linear function of the ChIP-chip score at that position.
A semiparametric alternative to the above formulations is a piecewise constant model. In this model, the ChIP-chip scores are partitioned into k intervals with boundaries t1,t2,...,tk,tk + 1. Within each interval, Pr(Zil = 1 | Til) takes on an unknown constant value given by
![]() |
This model is capable of approximating both the linear and the nonlinear dependencies of Z on T and is therefore a semiparametric alternative to the above beta prior and logistic regression formulation that both assume restricted functional forms.
| 3.2 Inference for the CTCM model |
|---|
Parameter estimation in all 3 formulations can be carried out using the standard expectation–maximization (EM) (Dempster and others, 1977
f. On the other hand, we can easily derive closed-form M-step updating formulas for the vector parameter
in the piecewise constant model framework. The expected full-data likelihood of the conditional TCM model at the r step of the EM algorithm is given by
![]() |
Here,
r denotes the parameter estimates after the r – 1th step of the EM algorithm. Then, the EM algorithm steps are as follows:
E-step:
![]() |
where 
= (
, 
) represents the back ground and position weight matrix parameter estimates after the r – 1th iteration.
M-step: For M3, parameter estimates of the piecewise constant regression model are given by
![]() |
where Ik = [tk,tk + 1) and
|
|
where
and
correspond to the estimates of the mixing proportions in the original and modified CTCM models. The parameter
'k is a generalization of the rate parameter
' of the TCM model. Moreover, we have
![]() |
where
![]() |
| 3.3 A technical note |
|---|
The use of M2 formulation for our CTCM model warrants some caution in especially small simulated data sets. Under the condition of perfect linear separability (e.g. sequences with and without the motif are perfectly linearly separable based on their ChIP-chip scores), the proposed EM algorithm can converge to nonfinite estimates for the vector parameter β. This is due to the fact that M-step of the EM algorithm can reduce to exactly a logistic regression log-likelihood and the problem of nonfinite estimates is a known issue (Albert and Anderson, 1984) in the fitting of logistic regression models. Due to the factorization of the likelihood, consistent estimates for the other parameters of the model are still attainable in this pathological case. We prove this property of the EM algorithm for the logistic regression CTCM model algebraically in Section 6 of the supplementary material available at Biostatistics online. From a practical point of view, this is not a concern as such perfect separability is hardly ever encountered in the ChIP-chip data from tiling arrays.
| 3.4 Model selection for the piecewise constant model |
|---|
We note that M3 conditional formulation requires the partitioning of the ChIP-chip score into bins and setting of the bin boundaries. The number of parameters in this model increases as the number of bins increase. Therefore, model selection schemes are needed to prevent overfitting. We utilize the Bayesian information criteria (BIC) to select the number of bins. Given the bin numbers, the bins are set to equal widths to cover the range of the ChIP-chip scores. As will be illustrated in our case study, the visual inspection of the histogram of the ChIP-chip scores often suggests natural bin boundaries.
| 3.5 Previous algorithms utilizing ChIP-chip data in motif finding |
|---|
Currently, there are only a small number of motif-finding approaches (Liu and others, 2002
(3,20) sequences (k regions with the highest ChIP-chip score). For each seed, all the m matches (subsequences with at least m matches to the seed) in the top k regions are found and then are used to generate a position weight matrix. Then, each matrix is evaluated based on an approximate maximum a posteriori scoring function, and the resulting top 10–50 matrices are further calibrated using the rest of the sequences that were not among the top k. Although this approach respects the ranking of the ChIP-chip scores, it does not take into account the second issue, that is motifs are more likely to reside near the center of the peak. More recently, Hong and others (2005)| 4. RESULTS |
|---|
It is now well established that signal-to-noise ratio in motif searches decreases dramatically as the lengths of the sequences become longer or the proportion of sequences with the motif(s) decreases rapidly (Keich and Pevzner, 2002
). Therefore, it is vital to take into account as much prior information as possible when performing motif finding with large number of long sequences. We compared CTCM to TCM of MEME and MDSCAN on 2 case studies. Using publicly available c-Jun and Stat1 ChIP-chip data from the ENCODE project, we generated sequence data sets containing ± 100, 300, 500 flanking bases of the mid points of the peaks (i.e. probe with the highest ChIP-chip score within the peak) and varied the number of bound regions used in the analysis as N = 20,80,200. Details of these data sets are provided in Section 1 of the online supplementary material available at Biosatistics online. With this analysis, we aim to illustrate that one can decrease the signal-to-noise ratio in the sequence data set but can preserve roughly the same performance by utilizing quantitative ChIP-chip information in an adaptive way. In the comparisons below, we feed each method the correct motif width as our focus here is not the identification of the right width. Selection of the motif width W is an integral part of motif finding and has been extensively studied in Bailey and Elkan (1995)
, Kele
and others (2003)
and Bembom and others (2007)
.
A closer look into MEME software code revealed that MEME allows sequence-specific weights that are used to smooth the E-step of the algorithm. These weights are restricted between 0 and 1 and are kept constant during the model fitting process. Although the weights are available in the code, we did not come across their explanation or use in the MEME methodology papers. Nonetheless, we decided to take advantage of these for comparison purposes since smoothing of the E-step aims to down-weight or up-weight contributions of the individual sequences via weights. Note that MEME only uses one weight per sequence, thereby does not incorporate information on likely motif start sites within a sequence. There are no guidelines for setting the weights in MEME, therefore we experimented with
and w
= Ti/maxj
{1,...,N}Tj. CTCM (M1, M2, M3), MEME, and weighted versions of MEME (MEME(w1) and MEME(w2)) were employed with a starting value that is similar to the motifs of interest. MDSCAN does not accept as input a starting position weight matrix. So, in order to make the MDSCAN comparisons fair, we considered top 5 position weight matrices estimated by MDSCAN and used the best one, for example, the one closest to the true PWM in the Euclidean distance. Mean of the componentwise squared distances between the known and the estimated position weight matrices of c-Jun is displayed in Figure 2. As depicted in this figure, CTCM has the smallest distance between the true and the estimated position weight matrices in all cases regardless of the specific CTCM model (M1, M2, or M3) used. Investigating this analysis more closely, we note that CTCM's performance is fairly constant in the c-Jun analysis, especially for the logistic regression (M2) and piecewise constant linear model (M3), as the sample size and length of the sequences are increased. On the other hand, MEME and its weighted versions and MDSCAN have increased mean squared distance especially when the length of the sequences are increased from 100 to 300 and 500 bp. This eludes to the fact that including quantitative information on the potential locations of the motifs provides compensation for the increased noise level. More specifically, utilizing ChIP-chip scores directly, CTCM reduces the search space of motif occurrences. The histogram of ChIP-chip scores and the estimates of the parameter
within each bin in the CTCM (M3) model are provided in Figure 3 for the cases N = 20, L = 100 (top panel) and N = 80, L = 300 (bottom panel) of the c-Jun analysis. In these analysis, the number of bins were set based on the visual inspection of the histograms of ChIP-chip scores. The estimates of
indicate that some of the bins could potentially be collapsed. As illustrated in these plots, the probability that a given base is a motif start site highly depends on the ChIP-chip scores. In the bottom panel of Figure 3, we observe that probe sequences with a ChIP-chip score of 1.52 or higher are twice as more likely to have a motif instance than the probe sequences with a score of 1.52 or smaller. We also note significant differences between weighted and unweighted versions of MEME where the weighted versions often have worse performance than the unweighted version. We elaborate on the potential reasons of these differences within the context of our simulations in Section 4.2. An additional observation is that the performances of the logistic regression model (M2) and piecewise constant linear model (M3) are remarkably similar in this analysis. We do not expect such a result to hold for all the analysis. However, since the logistic regression model does not require model selection over the bin sizes, it might serve as a computationally easier alternative. The best performance for the Stat1 data is also achieved with the CTCM logistic regression model M2. We refer to Section 2 of the supplementary material available at Biostatistics online for a detailed investigation of the model fit to Stat1 data.
|
|
These case studies provide the proof of the principle that including quantitative information from ChIP-chip experiments into motif finding leads to better position weight matrix estimates that are robust against increased noise level in the sequence data. Notably, all the 3 formulations we proposed, namely beta prior, logistic regression, and piecewise constant model provide better means of utilizing ChIP-chip information than ad hoc weighting schemes that are possible via MEME. Overall, even though the logistic regression model is a restrictive parametric model, it performed as better or even better than the flexible semiparametric alternative piecewise constant linear model.
In order to investigate the operating characteristics of our CTCM model framework systematically, we performed simulation studies. The data-generation processes for these simulations are described in detail in Section 4 of the online supplementary material available at Biostatistics online. We varied 2 main parameters, namely, the degeneracy of the underlying position weight matrix and the abundance of the motif within the sequences and considered 3 different position weight matrix classes based on the information content to accommodate varying degrees of degeneracy. These 3 classes, listed in Table 1, are chosen to represent the characteristics of the the known position weight matrices from the JASPAR database (Sandelin and others, 2004
). Our simulations included position weight matrices with the information content 0.965, 1.172, and 1.389 corresponding to roughly 25, 50, and 75 percentiles of the information content distribution of the matrices in the JASPAR database. All these position weight matrices have larger information contents than those of c-Jun and Stat1 position weight matrices. Average number of motifs within a sequence, a, is varied in the set {10%,30%,50%,70%,130%}. A small abundance of 10% is considered since recent case studies indicate that canonical motif might be present in as small as 10% of the identified bound regions (Bieda and others, 2006
).
|
For each scenario, we generated 50 data sets, analyzed them with different methods, and computed a receiving operating characteristic (ROC) curve for each method. In all the data sets, the motif occurrence probability based on the ChIP-chip data is generated from a logistic regression model (see supplementary Table 1 of the supplementary material available at Biostatistics online for details on the choice of the parameters). The logistic regression model was chosen as the simulation model since it fitted the case study data of Section 4.1. For each simulated data set, area under the receiving operating characteristic (AUROC) curve is computed using the trapezoid method implemented in the "R" package "caTools". Box plots of AUROC curves for the factors GA repeat binding protein (GABPA), heterodimeric bHLH protein (TAL1-TCF3), and Broad-complex_2 are provided in the top-panel plots of Figure 4 for subsets of the abundance parameter. These subsets are selected such that a wide range of information content and motif abundance combinations is spanned. All the simulation results are provided in Section 5 of the supplementary material (available at Biostatistics online, Supplementary Figures 4, 5, and 6). Although in practice the final bin size will be determined by the BIC, we display performances of different bin sizes for illustrative purposes in Figure 4. Overall, we note that modeling the motif occurrence probability within each bin improves the performance compared to MEME, which can be thought as having a single bin. In Figure 4, only the results for M3 model of CTCM with 3 different bin sizes are reported. Application of M1 and M2 leads to comparable results with M2 with a better performance since it corresponds exactly to the simulation model.
|
In Figure 4, we first observe that even though model M3 does not correspond to the simulation model generating the ChIP-chip data, it is able to provide a good approximation to the data-generating mechanism and has good operating characteristics overall. When the motif abundance (a) is low (0.1–0.5), incorporating quantitative information from ChIP-chip data gives strikingly better results with higher sensitivity and specificity than the default MEME that treats each sequence uniformly. Notably, our adaptive scheme that models the "weights" for each sequence is performing substantially better than MEME's ad hoc weighting schemes. For the conserved motif GABPA, when we increase the motif abundance, MEME, MEME with second weighting scheme, and our adaptive models are performing about the same, with ROC curves rising sharply (area under the ROC curve is close to 1). However, to our surprise, MEME with the first weighting scheme is performing significantly worse than the unweighted scheme. Since the first set of weights have much smaller numerical values compared to the second set of weights, they reduce the motif occurrence probability of each sequence. In our adaptive framework, relative values of the ChIP-chip scores rather than their absolute values gain importance. If the ChIP-chip information does not add any power to motif discovery,
values are expected to have similar estimates for different bins. This flexibility is important for robustness against the measurement and modeling error in ChIP-chip scores. We also note that for degenerate motifs such as Broad-complex_2, CTCM models are substantially better than MEME and the weighted versions of MEME even at high motif abundance cases. We perform another comparative analysis using MDSCAN which is the only motif-finding method that is tailored toward improving enrichment-based motif finding with ChIP-chip data. Since publicly available MDSCAN only reports the identified motifs but does not allow thresholding, we base our comparisons directly on the motif estimates rather than generating ROC curves. The MSEs between the true and the estimated position weight matrices (we let MDSCAN to identify 5 motifs and choose the one with the best MSE) are computed for each method. The results displayed in the bottom-panel plot of Figure 4 clearly illustrate that MDSCAN provides worse position weight matrix estimates in terms of MSE and has larger variability especially when the motif is degenerate and moderately abundant. In these calculations, the top k sequences parameter of MDSCAN is set to its true value from the data-generating mechanism. However, using top k without accounting for the variability in the ChIP-chip data does not facilitate MDSCAN to identify the motif of interest accurately. MSE levels of CTCM are significantly smaller than those of MEME and its weighted versions at lower abundance and/or low information content scenarios. As the abundance and/or the information content increase, the MSE levels of CTCM, MEME, and MEME(w2) become small and comparable whereas MEME(w1) has substantially higher MSE as expected from the ROC curves.
| 5. DISCUSSION |
|---|
We have presented a systematic way of incorporating quantitative ChIP-chip information into motif finding. Both the actual case studies and the extensive simulations suggest that principled use of ChIP-chip quantitative information compensates for the decreased signal-to-noise ratio when considering long sequences around the peak regions. Additionally, in the case of degenerate motifs with low information content, direct use of ChIP-chip data can boost the signal for the motif leading to a better characterization of the DNA recognition profile. We considered 3 models including a beta prior, logistic regression, and piecewise constant linear regression formulation. Among the 3, piecewise constant regression formulation seems favorable due to less model assumptions and computationally easier M-step. However, the logistic regression model performed as better or even better in the actual case studies. Therefore, logistic regression formulation provides an alternative without additional model selection issues, for example, bin sizes in the piecewise constant linear model. Our current implementation of the piecewise constant linear model does not impose any constraints on the ChIP-chip score-specific motif occurrence probabilities. However, a set of order constraints that enforces the bin with the highest ChIP-chip score to have the highest motif occurrence probability might be desirable. Such a constraint might prevent ChIP-chip information from being overriden by overwhelmingly abundant spurious motif occurrences. In this paper, we focused on identification of individual motifs. However, the probability of sequence data given the start site can be easily replaced with a probabilistic model of motif module as in Zhou and Wong (2004)
Software availability: The CTCM model implementation will soon become a part of the "R" sequence analysis package "cosmo" (Bembom and others, 2007
) and will be downloadable from Sündüz Kele
Web site www.stat.wisc.edu/
keles. The designated name for the R package is "SUCcESS" which stands for "Systematic Utilization of ChIP-chip Data for Estimating Sequence Signals".
| ACKNOWLEDGMENTS |
|---|
This research has been supported by a Wisconsin Alumni Research Foundation grant to Sündüz Kele
from the University of Wisconsin, Madison. Conflict of Interest: None declared. | REFERENCES |
|---|
-
Albert A, Anderson J. On the existence of maximum likelihood estimates in logistic regression models. Biometrika (1984) 71:1–10.
Bailey T, Elkan C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning (1995) 21:51–80.[Web of Science]
Bembom O, Kele
S, van der Laan MJ. Supervised detection of conserved motifs in DNA sequences with cosmo. Statistical Applications is Genetics and Molecular Biology (2007) 6. Article 8. Available from http:www.bepress.com/sagmb/vol6/iss1/art8.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (1995) 57:289–300.
Bieda M, Xu X, Singer MA, Green R, Farnham PJ. Unbiased location analysis of e2f1-binding sites suggests a widespread role for e2f1 in the human genome. Genome Research (2006) 16:595–605.
Cawley S, Bekiranov S, Ng H, Kapranov P, Sekinger E, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams A, et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of non-coding RNAs. Cell (2004) 116:499–511.[CrossRef][Web of Science][Medline]
Consortium TEP. The ENCODE (ENCyclopedia of DNA elements) project. Science (2004) 306:636–640.
Dempster AP, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (1977) 39:1–38.
Hertz GZ, Stormo GD. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics (1999) 15:563–577.
Hong P, Liu XS, Zhou Q, Lu X, Liu JS, Wong WH. A boosting approach for motif modeling using ChIP-chip data. Bioinformatics (2005) Advance access doi:10.1093/bioinformatics/bti402.
Ji H, Wong WH. TileMap: create chromosomal map of tiling array hybridizations. Bioinformatics (2005) 21:3629–3636.
Keich U, Pevzner P. Finding motifs in the twilight zone. Bioinformatics (2002) 18:1374–1381.
Kele
S. Mixture modeling for genome-wide localization of transcription factors. Biometrics (2007) 63:10–21.[CrossRef][Web of Science][Medline]
Kele
S, van der Laan MJ, Dudoit S, Cawley SE. Multiple testing methods for ChIP-Chip high density oligonucleotide array data. Journal of Computational Biology (2006) 13:579–613.[CrossRef][Web of Science][Medline]
Kele
S, van der Laan MJ, Dudoit S, Xing B, Eisen MB. Supervised detection of regulatory motifs in DNA sequences. Statistical Applications in Genetics and Molecular Biology (2003) 2. Article 5. Available from http://www.bepress.com/sagmb/vol2/iss1/art5.
Kim TH, Barrera LO, Zheng M, Qu C, Singer M, Richmond T, Wu Y, Green RD, Ren B. A high-resolution map of active promoters in the human genome. Nature (2005) 436:876–880.[CrossRef][Medline]
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science (1993) 262:208–214.
Lawrence CE, Reilly AA. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins (1990) 7:41–51.[CrossRef][Web of Science][Medline]
Li W, Meyer CA, Liu XS. A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics (2005) 21:i244–i282.
Liu X, Brutlag DL, Liu JS. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. In: Proceedings of the Pacific Symposium of Biocomputing—Altman Russ B., Keith Dunker A, Hunter Lawrence, Klein Teri E., eds. (2001) : World Scientific. 127–138.
Liu XS, Brutlag DL, Liu JS. An algorithm for finding protein-DNA binding sites with applications to chromatin immunoprecipitation microarray experiments. Nature Biotechnology (2002) 20:835–839.[Web of Science][Medline]
Ren B, Robert F, Wyrick J, Aparicio O, Jennings E, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al. Genome-wide location and function of DNA binding proteins. Science (2000) 290:2306–2309.
Sandelin A, Alkema W, Engstrom P, Wasserman W, Lenhard B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research (2004) 32:D91–94. http://jaspar.genereg.net/.
Stormo GD. Use of the Perceptron algorithm to distinguish translational initiation sites in E.coli. Nucleic Acids Research (1982) 10:2997–3011.
Weinmann AS, Yan P, Oberley M, Huang T, Farnham P. Use of chromatin immunoprecipitation to clone novel E2F target promoters. Molecular and Cellular Biology (2001) 21:6820–6832.
Wingender E. Recognition of regulatory regions in genomic sequences. Journal of Biotechnology (1994) 35:273–280. http://www.gene-regulation.com/pub/databases.htm#transfac.[CrossRef][Web of Science][Medline]
Zhou Q, Wong WH. CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proceedings of the National Academy of Sciences of the United States of America (2004) 101:12114–12119.
Received December 11, 2006; accepted for publication March 8, 2007.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
M. Hu, J. Yu, J. M. G. Taylor, A. M. Chinnaiyan, and Z. S. Qin On the detection and refinement of transcription factor binding sites using ChIP-Seq data Nucleic Acids Res., January 6, 2010; (2010) gkp1180v1. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||














