Biostatistics Advance Access originally published online on February 22, 2006
Biostatistics 2006 7(4):569-584; doi:10.1093/biostatistics/kxj026
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Time ordering of gene coexpression
Division of Public Health Sciences, Department of Biostatistical Sciences, Wake Forest University School of Medicine, Medical Center Boulevard, MRI-3, Winston-Salem, NC 27157, USA ileng{at}wfubmc.edu
Department of Statistics, University of California, One Shields Avenue, Davis, CA 95616, USA
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
Temporal microarray gene expression profiles allow characterization of gene function through time dynamics of gene coexpression within the same genetic pathway. In this paper, we define and estimate a global time shift characteristic for each gene via least squares, inferred from pairwise curve alignments. These time shift characteristics of individual genes reflect a time ordering that is derived from ob- served temporal gene expression profiles. Once these time shift characteristics are obtained for each gene, they can be entered into further analyses, such as clustering. We illustrate the proposed methodology using Drosophila embryonic development and yeast cell-cycle gene expression profiles, as well as simulations. Feasibility is demonstrated through the successful recovery of time ordering. Estimated time shifts for Drosophila maternal and zygotic genes provide excellent discrimination between these two categories and confirm known genetic pathways through the time order of gene expression. The application to yeast cell-cycle data establishes a natural time order of genes that is in line with cell-cycle phases. The method does not require periodicity of gene expression profiles. Asymptotic justifications are also provided.
Keywords: Curve alignment; Functional data analysis; Gene expression profiles; Microarray; Time dynamics; Warping
| 1. INTRODUCTION |
|---|
|
|
|---|
The thousands of genes of an organism or a cell must be expressed in a regulated manner to enable the organism or the cell to utilize the biological information contained within the genome. One of the most important dimensions of gene regulation is temporal control of gene expression within genetic pathways or biological processes, through which groups of genes are thought to be coexpressed coherently across different time periods. These processes are related to the function of the proteins encoded by these genes. In general, the timing of mRNA expression for a given gene correlates well with the function of the resultant protein (Bozdech et al., 2003
The temporal control of gene expression extends over many time scales, such as development and maturation, over the entire life span on the organismal level or cell cycles on the cellular level. For example, embryogenesis is an example of a time-sensitive biological process, whereby the embryo forms and develops. A fertilized egg of the fruit fly (Drosophila melanogaster) undergoes cleavage, blastoderm, gastrulation, germ band elongation and retraction, head involution and dorsal closure, and differentiation over about 24 h. The coordination in time and space of the expressions of maternal and zygotic genes guarantees normal embryogenesis (Lodish et al., 2000
; Weigmann et al., 2003
; Pollard and Earnshaw, 2004
). Another example is provided by the timing characteristics of the yeast cell cycle, which can be modeled as a line of dominoes or dependent pathway, whereby a gene will not start to express until certain other genes have already expressed (Pollard and Earnshaw, 2004
).
Uncovering the time ordering of gene expression could therefore be expected to aid the elucidation of gene function, and to provide a context for interpreting organismal/cellular responses to drugs, growth conditions, or environmental perturbations. This lends motivation to employ increasingly collected temporal microarray gene expression data, when aiming at detection of the time ordering of gene expression. The assumption that genes with similar temporal expression patterns are part of a common genetic pathway provides further motivation for global time-ordering analysis of gene expression profiles.
Previous time-ordering analyses have aimed at pairwise gene expression profile alignment. Butte et al. (2001)
utilized digital signal-processing tools, including power spectral densities, coherence, transfer gain, and phase shift, to find pairwise gene associations based on periodically expressed time-invariant gene profiles. Qian et al. (2001)
adopted a local clustering approach to determine optimal pairwise alignment based on dynamic programming, while Aach and Church (2001)
implemented both simple and interpolative time warping based on dynamic programming to identify an optimal time alignment of two gene expression time series. Expanding this approach, Liu and Müller (2003)
proposed a non-parametric time-warping technique to construct modes of temporal structure for a sample of gene expression profiles, adapting a time synchronization approach (Liu and Müller, 2004
). For general warping and curve alignment procedures, we refer to Wang and Gasser (1997)
and Gervini and Gasser (2004)
. Bar-Joseph et al. (2003a
, 2003b) developed yet another approach and aligned temporal data sets under varying conditions, extracting shift and stretch parameters for each data set. In a very general approach, Arkin and Ross (1995)
and Arkin et al. (1997)
devised models with time shifts for chemical reaction pathways, and proposed to determine pairwise time shifts by maximizing correlation. A related global time shift model for functional data was considered by Silverman (1995)
.
In this paper, we define and infer global time shift characteristics for genes that are based on observations of optimal pairwise curve alignments, which then are symmetrized through the minimization of a functional distance via a least squares step. We show that a conditional
distance between two curves is minimized near the true underlying pairwise time shift. The resulting global time shifts for genes lead to the proposed time-order characteristics. Further analysis such as gene grouping/clustering within the same genetic pathway can then be based on the estimated time-order characteristics in a subsequent step. The proposed methodology may play an auxiliary role in determining a genetic pathway.
The organization of the paper is as follows: We introduce pairwise alignment in Section 2, including asymptotic consistency results for identifying underlying time shifts based on minimizing
distance. The connection to global time shifts is made in Section 3, followed by simulations (Section 4) and the analysis of the time ordering for Drosophila embryonic development (Section 5) and yeast cell cycle (Section 6). The paper ends with discussion and conclusions (Section 7). Theoretical results and proofs can be found in the Appendix which is posted at the journal's website.
| 2. PAIRWISE CURVE ALIGNMENT |
|---|
|
|
|---|
Assume we observe a collection of K gene expression profiles with expression levels
measured at times
,
,
,
, with
![]() |
where
are the underlying expression profiles, which we view as realizations of a stochastic process, and
are iid errors with zero mean and finite variance. Furthermore, assume
![]() |
where
is a non-random mean curve and
are iid realizations of a stochastic process Z, s.t.
and
. It is not necessary to specify the covariance function of Z. If, on the other hand, it is known, some efficiency might be gained. The
are random time shifts and
is a small constant, so that
might be viewed as a small random perturbation. For a pair of random curves
and
, the relative time shift is
,
. Note that this pairwise time shift is symmetric in the sense that
.
We are interested in identifying the time shifts
, which are associated with specific gene profiles
, and also in the inherent ordering of the
,
. The time shifts
are characteristics of individual genes associated with their expression trajectories. They reflect the inherent time order of the ensemble of genes and therefore have a clear biological interpretation.
Given a random trajectory
, we align other trajectories
against
on the interval
for some constants
by minimizing a distance
with regard to s. We aim at the minimizers
![]() | (2.1) |
Similarly, we define
and
by exchanging i and j. We assume implicitly that
is defined for all
, for a
, by extending the domain of the
suitably beyond
. For distance d in function space, we consider the
pseudodistance
![]() | (2.2) |
All functions are pre-normalized by the transformation
, aiming to de-emphasize differences in amplitude and to emphasize differences in horizontal shift. Other distances could also be used, such as correlation and rank correlation (Arkin and Ross, 1995
; Arkin et al., 1997
; Heckman and Zamar, 2000
).
THEOREM 2.1 For two random functions
and
, let
|
|
Under Conditions (A1)(A4) (see Appendix), for sufficiently small
,
![]() |
Theorem 2.1 shows that, under suitable assumptions,
asymptotically tracks the true value
. The proof is in the Appendix.
After pairwise alignment, we obtain two matrices: the minimum distance matrix
and the relative time shift matrix
, where
and
,
. We note that the matrices
and
are generally asymmetric. The elements of S consist of the pairwise relative time shifts
, which serve as responses in a global time shift model that is discussed in the next section.
| 3. GLOBAL TIME SHIFT MODEL AND INFERENCE FOR TIME SHIFTS |
|---|
|
|
|---|
For each pair of random curves
,
,
, the relative shift of
with respect to
is expected to be close to
, if
is relatively small, since
for
. We note that, in general, for the pairwise time shifts
(2.1),
, so that a reasonable algorithm needs to include a symmetrization step. Without loss of generality, let
. For K gene expression trajectories, the equation
![]() | (3.1) |
corresponds to a linear system
![]() | (3.2) |
where
is a
vector of stacked pairwise relative time shifts, A the corresponding design matrix corresponding to (3.1), and
the shift parameter vector, where
denotes the transpose of a (column or row) vector x; specifically,
, where
is a
vector of
,
,
. Note that A is always of full rank by design.
Gene expression profiles
are often not continuously observed but are rather observed over a discrete grid of measurement times
, giving rise to discrete observations
per profile. Since the measurement times can be irregular, there seems no obvious way to obtain the distance between any two observed profiles without making distributional assumptions, which we prefer to avoid; compare Yao et al. (2005)
for the implementation of Gaussian assumptions for irregularly observed functional data. Smooth trajectories for profiles
can be obtained from the discrete measurements by applying a linear scatterplot smoother to the scatterplot
, denoted by
, when evaluated at point t. While various smoothing methods are available (Fan and Gijbels, 1996
), we use a class of kernel smoothers (Gasser and Müller, 1984
) that are well suited for our purpose; further details about these smoothers are provided in the Appendix.
THEOREM 3.1 Using kernel smoothers (A1, see Appendix), for any two random functions
and
and their smoothed estimates
and
, let
![]() | (3.3) |
Under Assumptions (A1)(A6) (see Appendix), it holds that
![]() | (3.4) |
For the proof, we refer to the Appendix. Theorem 3.1 says that the minimizers
asymptotically track the minimizing pairwise time shifts
and therefore according to Theorem 2.1 also the true shifts
up to an asymptotically small error. Together with (3.1) and (3.2), this suggests a least squares approach.
With relative time shifts
, Model (3.2) becomes
![]() | (3.5) |
where
is the vector of estimated pairwise relative time shifts obtained from (3.3) and
is the corresponding vector of errors. The least squares estimator
of
is then
![]() | (3.6) |
COROLLARY 3.2 The least squares time shifts
satisfy
![]() | (3.7) |
The proof is in the Appendix. Relative time shift estimates are practically implemented based on the aligned smoothed trajectories
,
via
|
| (3.8) |
which are entered on the right-hand side of the least squares equation (3.5). Then the least squares solution is
, where
is defined from (3.8) in analogy to
as defined in (3.2).
The gene-specific time shifts can alternatively be estimated through weighted least squares. Introducing the estimated squared distances
resulting from (3.8), we define case weights
. Then the weighted least squares solution is
![]() | (3.9) |
where
is a diagonal matrix with diagonal elements
. Updated estimates for pairwise time shifts
are then obtained by
. These estimates are symmetric as
.
In the analysis of temporal gene expression profiles, one may be interested in finding clusters of genes, which can be defined as genes with similar time shifts. Reaching high expression levels at about the same time might imply that such genes are related in function and may be involved in a common genetic pathway. Cluster analyses, based on various methods such as k-means and hierarchical or non-parametric density estimation-based clustering (Fukunaga and Hostetler, 1975
; Wong and Lane, 1983
), may be performed in a second step, based on estimated time shifts
.
| 4. SIMULATION STUDIES |
|---|
|
|
|---|
To compare the performance of distance and other similarity criteria, we conducted two small simulation studies.
Simulation 1 We generated a set of curves (
) with known non-random time shifts
,
,
, on the interval
:
![]() |
where
,
,
, and data were sampled on an equi-spaced measurement grid
on
,
. In this setup, the mean function is
,
, where
and
and
are iid errors. The number of Monte Carlo runs was
.
The normalized smoothed curves for one data set are shown in a time-ordered manner in the left panel of Figure 1. We aligned the normalized smoothed curves using distances d (2.2), and also distances based on correlation and rank correlation. For each criterion, we obtained the least squares solution for the time shifts, assuming
= 0 in the model. The performance of each criterion was measured by the observed mean squared prediction errors (MSPE) for
,
,
.
|
Both correlation and rank correlation measures (Heckman and Zamar, 2000), as alternative distances, led to larger MSPEs as compared to the proposed distances d (2.2). We also compared the least squares estimates computed with and without case weights. The introduction of case weights reduced MSPE for all criteria, especially for the proposed L2 distance d. The first part of Table 1 provides the comparisons for various similarity measures with regard to MSPE.
The right panel of Figure 1 shows estimated time order using weighted least squares based on distance d for one set of simulated curves. It is evident that the original time order is well recovered.
Simulation 2 We also carried out a second simulation to investigate the behavior for less smooth trajectories Z than those considered in Simulation 1, with Z constructed as moving averages of Gaussian white noise
with varying span sizes. The mean curve
and errors were defined as in Simulation 1. We report the results for three different levels of smoothness of Z in the second part of Table 1. The results are quite similar to those of Simulation 1. The distance d (2.2) again led to smaller MSPEs, as compared to correlation and rank correlation measures, and the introduction of case weights reduced MSPE for all criteria, especially for distance d, which emerged as the overall best distance for time alignment in both simulation studies and was used in the following applications. We conclude that our procedure is robust with regard to differing levels of smoothness of Z.
|
| 5. TIME ORDERING FOR Drosophila EMBRYONIC DEVELOPMENTAL GENES |
|---|
|
|
|---|
Arbeitman et al. (2002)
h of the fly embryo stage are displayed in Figure 2.
|
Here, it is natural to set the time shift of the first strict maternal gene as zero. The obtained time orders for the Drosophila genes are given in Table 2. The gene expression patterns are depicted in Figure 3, ordered by their estimated time shifts. The genes with complete peaks in the lower part of the figure are zygotic and the genes with half peaks in the upper half are maternal. We note that the time shift for zygotic genes starts at about 2 h after fertilization, in accordance with the time when zygotic transcription initiates the switch from maternal to zygotic control of mitotic cycles (Foe et al., 1993
|
|
We also demonstrate another application of the proposed methodology by including the anterior genes in the alignment. The anterior system is one of the four maternal systems for assuring proper polarity of the oocyte prior to fertilization. Bicoid is the principal protein organizing anterior development in Drosophila. Directional action of exuperantia, exuperantia-like, swallow, and staufen are required during the process, in which bicoid mRNA is transported along the microtubule network of the oocyte to its anterior pole (Brody, 1999
The time-order analysis also reveals that one gene [CG1624 or DPLD (dappled)] previously identified as maternal displays a relatively large lagged shift, which is a typical feature of zygotic genes. The pattern of this gene is very similar to that of zygotic genes and it likely has been misclassified in previous analysis as a maternal gene.
| 6. TIME ORDERING FOR YEAST CELL-CYCLE GENES |
|---|
|
|
|---|
We illustrate the proposed methods for a set of time course microarray expression profiles of 90 yeast genes (
factor synchronized), which have been identified as cell-cycleregulating genes using traditional methods, such as northern blot analysis (Spellman et al., 1998
|
Without employing knowledge of phase assignment, when applying our algorithm to obtain time- order characteristics for these genes, the phases were successfully recovered in the natural time order: G2/M
M/G1
G1
S
S/G2
G2/M, as illustrated by the dot plot in Figure 5. A complete list of time orders for the cell-cycle genes is in Table 3. Furthermore, we would expect that relatively small time shifts between two genes indicate that they are closely coregulated. The pairwise time-order characteristics are visualized in Figure 6. One may discern four time clusters (darker squares) corresponding to G2/M, M/G1, G1, and SS/G2 genes.
|
|
|
The gene expression profiles can be aligned in their time order as shown in Figure 7. We note that no sharp cutoff occurs between genes in different phases, i.e. there is a continuum of timing of expression. Assignment of a phase boundary gene to a certain cycle phase therefore appears subjective (see also Cooper and Shedden, 2003
|
Besides contributing to the issue of boundary genes, the time-order analysis points to a few genes as likely having been misclassified into underlying cycle phases by previous methods. These include G1 phase genes YGL225W, YDR113C, and YJL173C, for which the time shifts fall within the cluster of SS/G2 phase genes, and YDL102W, which according to its time shift belongs to the cluster of M/G1 genes. In addition, two M/G1 genes, namely, YER111C and YNR044W, fall into the cluster of G1 genes. Moreover, the time shifts of genes YDR150W and YPR141C, previously classified as S/G2, are found to be considerably closer to those of G2/M and S genes, respectively. This demonstrates the utility of the inferred time order for clustering and classification.
| 7. DISCUSSION AND CONCLUSIONS |
|---|
|
|
|---|
The timing of individual gene expression offers insights into the function of the proteins they encode. A recent study on the transcriptome of the intraerythrocytic developmental cycle (IDC) of the malaria parasite Plasmodium falciparum led Bozdech et al. (2003)
Recognizing the importance of timing characteristics of gene expression, in this paper, we developed methods to uncover the underlying time ordering. We define and estimate a global time shift for each gene that is rooted in pairwise curve alignments and is obtained in a symmetrization step through minimization of a functional distance via least squares. The proposed approach is easy to implement. In practice, the interval of integration
can be determined by searching over different ranges and choosing the range that leads to a best global alignment, e.g. minimizing the sum of all pairwise distances
Although we sought to specifically emphasize the global temporal order of gene expression, finding time clusters for classifying sets of genes by their time dynamics might also be of interest in some studies and can be based at least in part on the proposed time ordering.
We demonstrated the methodology for Drosophila embryonic development and yeast cell-cycle gene expression profiles, as well as simulations. Feasibility of the approach was demonstrated by successful recovery of time ordering. While recovering time ordering of genes in itself is not sufficient to construct gene regulatory networks, it contains valuable information toward that goal and may serve as an informative guiding tool for biologists to further explore and discover relevant regulatory relationships within certain genetic pathways. For example, the timeline of early embyonic development of Drosophila has been well studied and the estimated time shifts of participating genes will assign them into various embyonic sub-stages, thereby informing about their respective functions.
| ACKNOWLEDGMENTS |
|---|
We wish to thank an associate editor and two referees for helpful comments, and Ms. Karen Klein at Office of Research, Wake Forest University School of Medicine, for assistance in text editing. This research was supported in part by National Science Foundation grants DMS03-5448 and DMS05-05537. Conflict of Interest: None declared.
| REFERENCES |
|---|
|
|
|---|
-
Aach J and Church GM. (2001) Aligning gene expression time series with time warping algorithms. Bioinformatics 17:495508.
Arbeitman MN, Furlong EEM, Imam F, Johnson E, Null BH, Baker BS, Krasnow MA, Scott MP, Davis RW, White KP. (2002) Gene expression during the life cycle of Drosophila melanogaster. Science 297:22702275.
Arkin A and Ross J. (1995) Statistical construction of chemical-reaction mechanisms from measured time-series. Journal of Physical Chemistry 99:970979.[CrossRef]
Arkin A, Shen P, Ross J. (1997) A test case of correlation metric construction of a reaction pathway from measurements. Science 277:12751279.
Bähler J. (2005) Cell cycle control of gene expression in budding and fission yeast. Annual Review of Genetics 39:6994.[CrossRef][Web of Science][Medline]
Bar-Joseph Z, Gerber G, Gifford DK, Jaakkola TS, Simon I. (2003a) Continuous representation of time-series gene expression data. Journal of Computational Biology 10:24.
Bar-Joseph Z, Gerber G, Simon I, Gifford DK, Jaakkola TS. (2003b) Comparing the continuous representation of time-series expression profiles to identify differently expressed genes. Proceedings of the National Academy of Sciences of the United States of America 100:1014610151.
Bozdech Z, Llinas M, Pulliam BL, Wong ED, Zhu J, Derisi JL. (2003) The transcriptome of the intraerythrocytic developmental cycle of Plasmodium falciparum. PLoS Biology 1:85100.
Brody TB. (1999) The Interactive Fly: Gene networks, development and the Internet. Trends in Genetics 15:333334.[CrossRef][Web of Science][Medline]
Butte AJ, Bao L, Reis BY, Watkins TW, Kohane IS. (2001) Comparing the similarity of time-series gene expression using signal processing metrics. Journal of Biomedical Informatics 34:396405.[CrossRef][Web of Science][Medline]
Cooper S and Shedden K. (2003) Microarray analysis of gene expression during the cell cycle. Cell & Chromosome 2:112.[Medline]
Fan J and Gijbels I. (1996) Local Polynomial Modelling and its Applications(Chapman and Hall., London).
Foe VE, Odell GM, Edgar BA. (1993) Mitosis and morphogenesis in the Drosophila embryo. In Bate M and Martinez-Arias A (Eds.). The Development of Drosophila melanogaster(Cold Spring Harbor Laboratory Press., Long Island, NY).
Fukunaga K and Hostetler LD. (1975) Estimation of the gradient of a density-function, with applications in pattern recognition. IEEE Transactions on Information Theory 21:3240.[CrossRef]
Gasser T and Müller HG. (1984) Estimating regression functions and their derivatives by the kernel method. Scandinavian Journal of Statistics 11:171185.
Gervini D and Gasser T. (2004) Self-modelling warping functions. Journal of the Royal Statistical Society, Series B 66:959971.[CrossRef]
Heckman NE and Zamar RH. (2000) Comparing the shapes of regression functions. Biometrika 87:135144.
Liu XL and Müller HG. (2003) Modes and clustering for time-warped gene expression profile data. Bioinformatics 19:19371944.
Liu XL and Müller HG. (2004) Functional convex averaging and synchronization for time-warped random curves. Journal of the American Statistical Association 99:687699.
Lodish H, Berk A, Zipursky LS, Matsudaira P, Baltimore D, Darnell J. (2000) Molecular Cell Biology 4th edition (W.H. Freeman and Company, New York).
Pollard TD and Earnshaw WC. (2004) Cell Biology 1st edition (W.B. Saunders., Philadelphia, PA).
Qian J, Dolled-Filhart M, Lin J, Yu H, Gerstein M. (2001) Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identified new, biologically relevant interactions. Journal of Molecular Biology 314:10531066.[CrossRef][Web of Science][Medline]
Silverman BW. (1995) Incorporating parametric effects into functional principal components analysis. Journal of the Royal Statistical Society, Series B 57:673689.
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. (1998) Comprehensive identification of cell cycle-regulated gene of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9:32733297.
Wang FM and Gasser T. (1997) Alignment of curves by dynamic time warping. Annals of Statistics 25:12511276.[CrossRef]
Weigmann K, Klapper R, Strasser T, Rickert C, Technau GM, Jäckle H, Janning W, Klämbt C. (2003) FlyMovea new way to look at development of Drosophila. Trends in Genetics 19:310311.[CrossRef][Web of Science][Medline]
Wong MA and Lane T. (1983) A k-th nearest neighbor clustering procedure. Journal of the Royal Statistical Society, Series B 45:362368.
Yao F, Müller HG, Wang JL. (2005) Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100:577590.[CrossRef]
Received August 5, 2005; revised November 28, 2005; revised February 14, 2006; accepted for publication February 16, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
R. Tang and H.-G. Muller Pairwise curve synchronization for functional data Biometrika, December 1, 2008; 95(4): 875 - 889. [Abstract] [PDF] |
||||
![]() |
D. Sahoo, D. L. Dill, R. Tibshirani, and S. K. Plevritis Extracting binary signals from microarray time-course data Nucleic Acids Res., June 28, 2007; 35(11): 3705 - 3712. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






















