Skip Navigation


Biostatistics Advance Access originally published online on October 23, 2006
Biostatistics 2007 8(3):632-653; doi:10.1093/biostatistics/kxl035
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
8/3/632    most recent
kxl035v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Xing, B.
Right arrow Articles by Bull, S. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xing, B.
Right arrow Articles by Bull, S. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

A hierarchical clustering method for estimating copy number variation

Baifang Xing

Genetics and Genome Biology, Hospital for Sick Children, Toronto, Ontario, Canada and Samuel Lunenfeld Research Institute of Mount Sinai Hospital, Toronto, Ontario, Canada

Celia M. T. Greenwood*

Genetics and Genome Biology, Hospital for Sick Children, Toronto, Ontario, Canada and Department of Public Health Sciences, University of Toronto, Toronto, Ontario, Canada celia.greenwood{at}utoronto.ca

Shelley B. Bull

Samuel Lunenfeld Research Institute of Mount Sinai Hospital, Toronto, Ontario, Canada and Department of Public Health Sciences, University of Toronto, Toronto, Ontario, Canada

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. CHANGE-POINT DETECTION
 4. SIMULATIONS
 5. APPLICATION TO A...
 6. DISCUSSION AND CONCLUSIONS
 APPENDIX A
 APPENDIX B
 REFERENCES
 
Microarray technologies allow for simultaneous measurement of DNA copy number at thousands of positions in a genome. Gains and losses of DNA sequences reveal themselves through characteristic patterns of hybridization intensity. To identify change points along the chromosomes, we develop a marker clustering method which consists of 2 parts. First, a "circular clustering tree test statistic" attaches a statistic to each marker that measures the likelihood that it is a change point. Then construction of the marker statistics is followed by outlier detection approaches. The method provides a new way to build up a binary tree that can accurately capture change-point signals and is easy to perform. A simulation study shows good performance in change-point detection, and cancer cell line data are used to illustrate performance when regions of true copy number changes are known.

Keywords: Array CGH; Change-point; Genomic copy number; Outlier detection; Permutation


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. CHANGE-POINT DETECTION
 4. SIMULATIONS
 5. APPLICATION TO A...
 6. DISCUSSION AND CONCLUSIONS
 APPENDIX A
 APPENDIX B
 REFERENCES
 
DNA copy number alterations in tumor cells are key genetic events in the development and progression of human cancers (Tirkkonen and others, 1998Go; Lengauer and others, 1998Go; Foronzan and others, 2000Go). These alterations are typically a result of genomic events producing gains and losses of chromosomes or chromosomal regions. Losses and gains of DNA can contribute to alterations in the expression of tumor suppressor genes and oncogenes, respectively. Therefore, the identification of DNA copy number alterations in tumor genomes may help to discover critical genes associated with cancers and, eventually, to improve therapeutic approaches. A variety of techniques have been developed to screen genomes for all possible regions with DNA copy number alterations. For example, comparative genomic hybridization (CGH) (Kallioniemi and others, 1992Go, 1994Go), high-resolution array-based CGH (Pinkel and others, 1998Go; Snijders and others, 2001Go), representational difference analysis (Lisitsyn and others, 1993Go), and single nucleotide polymorphism (SNP) arrays (Zhao and others, 2004Go) can all be used to assess copy number.

The goal in analyzing DNA copy number data is to partition the whole genome into regions with the same underlying copy number and, subsequently, to quantify the copy number in each region. Therefore, locating copy number changes is an important step in the analysis of DNA copy number data. Various statistical methods have been proposed to address this problem. The software CGH-Plotter uses moving median and mean filtering, k-means clustering, and a dynamic programming algorithm (Autio and others, 2003Go). Jong and others (2003Go, 2004Go) proposed a genetic local search algorithm to maximize a likelihood with a penalty term containing the number of change points. Other proposals for change-point detection methods include a dynamic programming algorithm with a penalized likelihood (Picard and others, 2005Go) and a weighted likelihood function using adaptive weight smoothing (Hupe and others, 2004Go). Olshen and others (2004)Go proposed a modified binary segmentation procedure, called circular binary segmentation (CBS), to look for 2 breakpoints at a time in the data arranged in a circle. Wang and others (2005)Go proposed a simple but effective method based on hierarchical clustering along the chromosomes (CLAC) to select regions of interest while simultaneously controlling the false discovery rate (FDR). Fridlyand and others (2004)Go used hidden Markov models (HMMs) to estimate the underlying copy numbers (the hidden states). In Hsu and others (2005)Go, a smoothing algorithm was given that denoises the data using a wavelets approach.

A recent paper (Lai and others, 2005Go) described an extensive study comparing several of these methods and pointed out that some segmentation methods, especially the CGH segmentation method (Picard and others, 2005Go) and CBS (Olshen and others, 2004Go), appeared to perform consistently well whether the noise level was low or high. Willenbrock and Fridlyand (2005)Go also did a comparison study of HMMs, CBS, and the Hupe method and concluded that CBS has the best operational characteristics in terms of its sensitivity and FDR. In Picard's method, however, the CGH profile is assumed to follow a Gaussian signal, an assumption that may be violated, while CBS is not designed to detect single outliers. For processing a large number of high-resolution arrays, the speed of many algorithms becomes an issue. Both HMMs and CBS can be slow and difficult to apply to very high–density oligonucleotide arrays.

In this report, we seek approaches to segment the chromosomes into regions of equal copy number that can include either a single marker or multiple contiguous markers. We propose a method, which we call the "circular clustering tree test statistic" (CCTTS), and following the construction of the statistics, we apply outlier detection approaches to identify change points along the chromosomes. Our method is novel in that it provides a new way to build up a binary tree that can accurately capture change-point signals and is easy to perform.

The rest of the paper is organized as follows. The CCTTS algorithm and properties are given in Section 2. In Section 3, we provide several change-point detection approaches based on the scores produced by the CCTTS algorithm. In Section 4, the proposed methods are evaluated by simulation studies. The method is applied to data from cell lines with known copy number changes in Section 5. We summarize our conclusions and discuss open questions in Section 6.


    2. METHODS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. CHANGE-POINT DETECTION
 4. SIMULATIONS
 5. APPLICATION TO A...
 6. DISCUSSION AND CONCLUSIONS
 APPENDIX A
 APPENDIX B
 REFERENCES
 
It is clear that estimating the locations of regions with aberrant DNA copy numbers is closely connected to the problem of finding the locations of change points (Olshen and others, 2004Go). This makes change-point methods a natural framework to approach the analysis of array DNA copy number data. Let Formula be a sequence of random variables that correspond to markers positioned sequentially along the genome. In general, an index Formula is called a change point if Formula have a common distribution and Formula have a different common distribution. Intuitively, a point can be identified as a change point when the difference in signal intensity is large between 2 adjacent segments demarcated by the point. The correct choice for the sizes of the 2 adjacent segments is an important issue when the signal difference is measured. Segments that are either too small or too large will reduce the sensitivity of change-point detection methods. Our proposed method—the CCTTS—automatically gives 2 "optimal" adjacent segments (or clusters) with which to measure the difference at any point along a chromosome. The idea is to build a hierarchical cluster-style tree from the bottom upward such that the gain/normal/loss regions are separated into different branches. In this procedure, the tree building and the evaluation of statistics at any point are carried out simultaneously. Furthermore, the statistics associated with the tree fully reflect the information involved in the tree structure, in- cluding the magnitude of the difference between any 2 adjacent clusters and the order of the merging procedure. These statistics can then be used for detecting the most likely change points along a chromosome.

2.1 The CCTTS algorithm

The CCTTS algorithm uses a bottom-up strategy that generates a binary tree to represent the similarities in the observed copy number intensities along a chromosome. The algorithm begins with the data arranged in a circle, obtained by connecting the 2 ends of the chromosome together, with every observation representing a singleton cluster. At each of the Formula steps, the closest 2 adjacent clusters are merged into a single cluster, where closest is defined by the smallest difference in copy number intensities, and the differences at the new cluster boundaries are updated. Hence, after each step there is one less cluster.

Let Formula be a sequence of random variables for the Formula relative copy number intensities at N marker positions. We define a signed distance to measure the difference between any 2 adjacent clusters as

Formula (2.1)

where markers Formula are adjacent to Formula. Note that Formula is simply a normal test statistic for the data with variance 1. Here, the implicit assumption that the data have a common variance (without loss of generality, 1) can be relaxed for segmentation methods. We call index j (k) the left (right) boundary of a cluster Formula (when Formula, in a circular arrangement, the left boundary of a cluster Formula is defined to be N). The CCTTS iterative algorithm is given in detail in the following and is illustrated in Figure 1.

1) Arrange the data in a circle so that the first point is connected to the last. That means Formula and Formula. Define N clusters each consisting of one point. Let Formula be the set of indices corresponding to each of the N clusters, and define a set of statistics that will be used to decide if point i is a change point. To start, use

Formula

which is the distance between each pair of adjacent observations.

2) Let Formula. Then merge the 2 adjacent clusters Formula and Formula, remove index l from I, and update Formula and Formula in the following way:

Formula


3) In general, merge the 2 adjacent clusters Formula with the smallest Formula. Then remove index q from I. If the data still contain more than one cluster, update Formula and Formula in the following way:
  • Let LC and RC be the left and right adjacent clusters to the cluster Formula we just formed, respectively (see Figure 1).
  • Update Formula using

    Formula (2.2)


  • Update Formula using

    Formula (2.3)



4) Repeat Step (3) until one big cluster is obtained. After merging the last 2 clusters, updating S is not relevant and is not performed. At that point there will be N statistics Formula.


Figure 1
View larger version (6K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. The horizontal axis describes the positions of several markers along a chromosome, and the vertical axis is the signal. Horizontal lines stand for the mean of each cluster. The 2 solid line clusters are merged to form cluster C and then statistics Formula and Formula are updated.

 
In Step (2) or (3), if there are ties for the minimum value Formula, say, Formula, then indices Formula are simultaneously removed from I, all corresponding pairs of adjacent clusters are merged, and all statistics Formula on the boundaries of these newly formed clusters are updated as in Step (2) or (3). Note that 2 boundary points may correspond to the merging of more than 2 adjacent clusters.

The simple example in Table 1 is used to illustrate the CCTTS algorithm. The first row shows a data set with 5 points. The algorithm starts by defining 5 clusters: Formula and Formula. Formula are calculated and given in the second row after connecting the data in a circle. In the first step, Formula is removed from I, and Formula and Formula are merged together to obtain 4 clusters: Formula, and Formula. To update Formula and Formula corresponding to the boundaries of the newly formed cluster, Formula (given in the second row) is compared to Formula and Formula is compared to Formula. Both Formula and Formula are updated to new values that are larger in absolute values as seen in the third row. The remaining values of Formula, are unchanged. Then, this procedure of computing Formula, updating I, merging 2 adjacent clusters together, calculating Formula, and updating Formula is repeated until one cluster is formed. The final merging order Formula matches the magnitude of the final statistics Formula (in absolute value).


View this table:
[in this window]
[in a new window]

 
Table 1. Example of CCTTS. Numbers in parentheses are the smallest absolute Si

 
In addition to treating the data as a circle, there is a substantial difference between the CCTTS algorithm and the conventional clustering approach (Wang and others, 2005Go) with respect to how Formula and Formula are updated in Step (2) or (3), for a given difference measure. In the conventional approach, Formula and Formula are simply updated by Formula and Formula, respectively, which ignores the previous statistic Formula at a marker when updating the statistic to Formula, after merging clusters Formula and Formula. This leads to a suboptimal choice of statistics. To clarify, let Formula be the difference measurement between clusters LC and Formula, and let Formula be the difference between LC and C. Suppose Formula is a true change point and Formula, then our algorithm assigns the larger statistic Formula to marker j, whereas the standard algorithm would assign the smaller statistic Formula. This also means that there can be inconsistency between the order in which indices are removed from I and the magnitude of the resulting statistics Formula. With the usual tree-building procedure, Formula and Formula may be smaller than Formula, even though position q is merged before positions j and k. The merging order indicates that Formula is less likely to be a change point than Formula and Formula. Our method of updating difference measurements Formula in the CCTTS algorithm will produce a sequence of statistics Formula, ordered by their position in the clustering process.

The CCTTS algorithm provides a way to calculate the statistics Formula and also leads to 2 "optimally sized" adjacent clusters to achieve a statistic value (see Figure 2). Updating procedures (2.2) and (2.3) keep track of 2 optimally sized adjacent clusters at each step and detect the signals from using both the mean differences and the sizes of the clusters. This makes it possible to identify two interesting cases: (i) small regions with extremely large or small values and (ii) regions of consistent gain/loss that do not deviate much from zero. The magnitude of Formula naturally reflects the likelihood that Formula is a change point in the whole data set Formula. Let Formula. Then Formula is most likely to be a change point, Formula is the second most likely change point, and so on. Several additional properties of our statistics are given in detail in Section 2.2.


Figure 2
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. (a) Simulated data Formula along a chromosome, with (b) corresponding statistics Formula computed by the CCTTS algorithm. Six true change points are marked in (a) with different shapes. Horizontal solid and dashed lines indicate the corresponding adjacent right and left clusters used when defining Formula. It can be clearly seen that change points correspond to outliers for the Formula‘s.

 
2.2 Properties

Under the null hypothesis of no change points, the final statistics Formula have the following properties:

a) If Formula are iid, then all Formula are identically distributed, but not independent.
b) If Formula come from a normal population, then all Formula are approximately identically normally distributed, but not independent.

The clustering tree built by the CCTTS algorithm is based on a circular arrangement of the data. Property (a) is a natural consequence of this circular symmetric arrangement. In Appendix A, we give a simple example of a sequence data set containing only 3 points Formula, and Formula and illustrate the functional relationships between the statistic scores Formula and the data set Formula. In general, for a given data arranged in a sequence: Formula, the method for calculating Formula is a function of Formula. So we write Formula, where the subscript 1 on S corresponds to the first location, Formula. On the other hand, for any Formula, if the data are rearranged as Formula, then, according to the previous notation, the CCTTS algorithm gives statistic Formula corresponding to the first location of Formula: Formula. Since both data sets Formula and Formula, when arranged in a circle, are exactly the same, Formula. Therefore, Formula and Formula are identically distributed because Formula are iid. Property (a) plays a significant role when analyzing non-normal data for change-point detection, since the symmetry can be exploited to reduce the number of permutations needed to produce a reference distribution.

Property (b) has been established empirically by examining normal QQ plots of simulated data. From property (a), we only need to illustrate that Formula is approximately normally distributed. Let Formula, be n data sets generated iid from Formula. The statistics Formula were computed from the Formula data set by the CCTTS algorithm. Then Formula is a sample drawn from the distribution of Formula. Figure 3 shows normal QQ plots for Formula replications with number of markers Formula and indicates that Formula is approximately normally distributed. In our simulations, we simultaneously calculated sample correlation coefficients between Formula and Formula which, because of the symmetric circular arrangement and property (a), also estimate the correlation coefficients for any 2 statistics Formula and Formula. When Formula, for different N, all these correlation coefficients were between Formula and Formula. When Formula, correlations between Formula and Formula are small (e.g. correlations between Formula and Formula are around Formula), and these statistics could be treated as approximately independent. Due to the negative correlations between Formula and Formula, a histogram of all statistics Formula jointly has larger tails than would be expected from the marginal approximate normality of any one statistic Formula. Property (b) makes it possible to analyze normal data for change-point detection easily and efficiently.


Figure 3
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Normal QQ plots with respect to Formula from Formula simulated data sets. The dash-dotted line is a robust linear fit of the sample order statistics which passes through the first and third quartiles. The slope of the solid line is 1 and represents the standard normal distribution. Simulations are applied for different data size N. The plots demonstrate approximate normality of the statistics but with larger variances than standard normal.

 

    3. CHANGE-POINT DETECTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. CHANGE-POINT DETECTION
 4. SIMULATIONS
 5. APPLICATION TO A...
 6. DISCUSSION AND CONCLUSIONS
 APPENDIX A
 APPENDIX B
 REFERENCES
 
We assume that the observed Formula is a realization of the true Formula relative copy number Formula at marker i plus random noise,

Formula (3.1)

where N is the number of markers on a given chromosome and Formula are iid with mean 0. Under the null hypothesis of no change points Formula, property (a) implies that Formula have the same distribution. As discussed in Section 2.2, the problem of change-point identification can be reduced to the problem of finding extremely large Formula that deviate from the pattern of Formula. In other words, our task becomes the identification of the outliers among the scores Formula. Interpretation of identified change points must take into account the artificial circular data. For example, if the number of detected change points is one, we must conclude that the original data contain no change points, since circular data with only one change point are still one cluster. For simplicity, we will not exclude a detected change point that is the last point on a chromosome.

We discuss and introduce several different approaches for outlier detection, separately for normal and non-normal data.

3.1 Normally distributed data

We assume that the data Formula are normally distributed with a common variance. The scores Formula are computed according to the CCTTS. Under the null hypothesis of no change points, Formula are each approximately normally distributed (see property (b)). Therefore, some approaches used to identify outliers for normal data are appropriate here, even though Formula are negatively correlated between 2 adjacent Formula. To perform this procedure, we apply the simple d standard deviation method, where the value of d can be user defined.

Approach 1: outlier detection method (ODM).

Formula are sorted in decreasing absolute values: Formula. The method starts with the assumption of no outliers in Formula, then examines and determines the first most likely outliers. If outliers are detected, then they are removed from the set of scores Formula. The procedure will be repeated for the remaining scores until no more outliers can be found.
1) Set Formula. Calculate the mean m and standard deviation s of all the scores.
2) If Formula, stop. Otherwise, identify the largest l such that Formula. Then remove Formula from Formula, recalculate the mean m and standard deviation s for the remaining scores in Formula, and then update Formula.
3) In general, if Formula, stop. Otherwise, identify the largest l such that Formula.
4) Remove Formula from Formula, then recalculate the mean m and standard deviation s for the remaining scores in Formula.
5) Formula, go to Step (3).
After completing this procedure, if Formula, there are no change points; otherwise, we find change points Formula.

The value of d can be user defined based on some domain or data knowledge, or theoretically using Chebyshev's theorem (Barnett and Lewis, 1994Go). In practice, d is usually chosen between 3 and 6 (Maletic and Marcus, 2000Go). From our experience, values between Formula and 4 were found to generate the best results with fewer false positives and false negatives in general.

Under the null hypothesis of no change points, the negative correlations between Formula and Formula and approximate independence between Formula and Formula mean that a sample variance Formula from Formula overestimates the variance Formula of the marginal distribution for any one statistic Formula. Therefore, outlier detection in the outlier detection method (ODM) is conservative. Outlier detection in Formula with one or several (outliers) points removed is also expected to be conservative. However, this conservativeness is not a concern in most cases due to the following 2 facts: (i) a sample variance Formula from Formula is an asymptotically unbiased estimator of the marginal variance Formula with respect to the number of markers N (see Appendix B) and (ii) a user-defined value of d can be adjusted for outlier detection. Therefore, the negative correlations do not have much effect on ODM performance for large data sets. When the number of markers is small, we may choose a smaller value of d to adjust for the conservativeness.

3.2 Non-normal data

In the outlier detection method, we assumed that the data followed a normal distribution. For non-normal data, we will introduce 2 methods to identify change points based on permutation-derived reference distributions. The sequential detection method (SDM) is the most straightforward method to apply.

Approach 2: Sequential detection method (SDM).

Assume the null hypothesis Formula is true, that is, there are no change points. Using property (a), each statistic value Formula computed from position permuted data has the same distribution and, therefore, can be used as a contribution to the permutation distribution. For example, by doing 100 permutations on a data set of Formula markers, Formula null test statistics are obtained. This drastically reduces the number of permutations needed.
1) For the original data Formula, we calculate statistic scores according to CCTTS and sort them in decreasing absolute values: Formula.
2) Randomly permute the order of the markers M times (Formula). For each permutation data set P, collect all absolute statistic values Formula and construct a reference distribution. Outliers are chosen if Formula is greater than the upper Formula quantile of the permutation distribution Formula. Let Formula.
3) If Formula, there are no change points. Otherwise, the data can be partitioned into k or Formula segments separated by k change points Formula. The exact number of segments (k or Formula) depends on whether Formula is chosen as a change point.
4) Steps (1) to (3) are recursively performed on each small segment until no more change points are identified. Single-point regions of gain or loss may be identified by the occurrence of 2 adjacent change points. These may occur either because there was a very small region of copy number change on the chromosome that is measured by only one marker or because there was an error in the data at one marker. Suppose a chromosome with N markers has only one single-point region of gain/loss. Any permuted data from this chromosome also has one single-point region of gain/loss. That means 2 out of the N statistic values Formula in each permutation will be relatively large, and therefore the magnitude of quantile Formula depends critically on N. It will be difficult to detect change points for small values of N. Therefore, since the SDM does not work well in this situation, we propose a modification to improve detection of single-point regions. Let s be the sample standard deviation of the original data. Suppose during Step (3) on a segment with mean Formula no change points were detected (Formula). We propose to examine the indices of Formula and Formula. If they are adjacent in the circle, then point Formula (or Formula, depending on the relative positions of Formula and Formula) may be a single-point region. If Formula, then we conclude that there are no change points on the segment; otherwise, we conclude that Formula and Formula are 2 change points.

Approach 3: backward residual method (BRM).

The SDM works well and is robust to unequal variances. However, when segments become small, the SDM permutation approach can miss some change points. For example, look at a very simple data set: Formula. It is obvious that there are 2 change points and the largest statistic, Formula, is Formula. The probability of achieving this largest value in the permutation distribution is Formula. No change points can be identified at the level Formula. Therefore, as an alternative approach, we developed a backward residual method (BRM) that uses the whole data set all the time rather than working with smaller segments. The algorithm starts in the same way as the SDM, but after some of the change points are detected, new permutations will be performed on the residual data instead of on the smaller segments.
1) As before, for the original data Formula, we calculate statistic scores according to CCTTS and sort them in decreasing absolute values: Formula.
2) Let Formula(one segment) and Formula.
3) Randomly permute the order of the data Formula M times (Formula). For each permutation data set, collect all absolute statistic values to form a reference distribution. The threshold values can be chosen to be the upper Formula quantile of the permutation distribution, say Formula. Let Formula.
4) If Formula, stop. Otherwise, Formula (Formula detected change points or segments); denote Formula as the mean of the segment which contains observation Formula and update Formula to be the mean adjusted residual of Formula, so that Formula. Then go to Step (3).
Similar to the SDM, a modification can be made to deal with single-point regions in data sets. Let s be the sample standard deviation of the original data. In Step (4), when Formula, let Formula and Formula (Formula when Formula, Formula when Formula) be the next 2 most likely change points. Identify if there are potential single-point regions produced by combining Formula with previously detected change points or by examining Formula and Formula. Suppose this potential single-point region Formula lies in segment L. We need to decide if this is a real single-point region. If Formula (Formula is the mean of segment L), we conclude that there are no more change points. Otherwise, Formula or both Formula and Formula are identified to be new change points.

Software written in MATLAB is available from the first author.


    4. SIMULATIONS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. CHANGE-POINT DETECTION
 4. SIMULATIONS
 5. APPLICATION TO A...
 6. DISCUSSION AND CONCLUSIONS
 APPENDIX A
 APPENDIX B
 REFERENCES
 
In this section, we investigate through Monte Carlo simulations the performance of the proposed methods. Data containing regions of gain, normal, and loss copy numbers were generated from the model Formula, where N is the sample size, Formula is the mean, and Formula is distributed as Formula. We assume one region of loss and one region of gain per chromosome. Within a loss region, all markers have the same state, with copy number reduced by 1. However, within the gain region we allow 3 possible states amplified by Formula, or 3 copies.

Following Pollack and others (2002)Go, we set Formula‘s for the copy numbers deleted by 1 and amplified by 1, 2, and 3 to be Formula, and Formula, respectively, and we use Formula for normal regions. To mimic typical data, the following quantities were generated randomly: the position of loss/gain regions, the size of loss/gain regions, the number and allocation of the 3 amplified states, and the sizes of the 3 amplified regions within the regions of gain. In order to compare our methods and the CBS method, the simulated data sets do not contain single points of loss/gain. The details of the simulation are given in the following.

1) A data set is defined to contain a normal region, a region containing a loss, and a region containing a gain in random order.
2) To generate the 3 regions, each of size m: Formula, the following parameters are used:
a) Data from normal region: Formula.
b) Data from region of loss:
i) Initially generate Formula.
ii) Generate the width w of a loss region with minimum width Formula:

Formula


iii) Locate the loss interval in the region: Formula by generating i from a uniform distribution:

Formula


iv) Let Formula and Formula

c) Data from region of gain:
i) Initially generate Formula.
ii) Generate the number of amplified states: Formula and then randomly select the magnitude of the gains by sampling without replacement n times from Formula within the region of gain.
iii) Generate the width of a region with minimum width 2 for 1 amplified state, 4 for 2 amplified states, and 6 for 3 amplified states:

Formula


iv) Locate the gain interval in the region: Formula by generating i from a uniform distribution:

Formula


v) Divide the gain region into n segments, with the length of each segment at least 2 and chosen from a uniform distribution, and then randomly assign the n amplified states (selected by (ii)) to the n segments.
vi) Let Formula, and Formula. If segment Formula corresponds to the Formula state, then Formula.


3) The whole simulated data set is produced by randomly concatenating the 3 regions generated by (2) and examples are given in Figure 4.


Figure 4
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Examples of 4 simulated data sets with Formula. Each data set is produced by randomly concatenating the 3 parts: Formula, Formula, and Formula which contain normal, loss, and gain regions, respectively. In the gain regions, (a) has 1 state amplified by 3 copies, (b) has 2 states amplified by 1 and 2 copies, (c) and (d) have all 3 states, amplified by either 1, 2, or 3 copies.

 
One hundred data sets with 180 markers were simulated for Formula and for each value of Formula. In order to compare the simulation results across different values of Formula, Formula are generated and Formula are used for different Formula. Examples with Formula from simulation are shown in Figure 4. If regions of loss and gain are in the middle, a data set with only one amplified state for a gain region leads to 4 change points, whereas a data set with all 3 amplified states leads to 6 change points. The total number of change points in 100 data sets for a given Formula is 500. We use Formula standard deviations to perform the ODM. For both the SDM and the BRM, Formula is used to define outliers.

The number of change points detected is summarized in Table 2. For comparison, the results of the CBS without pruning (Olshen and others, 2004Go) and the CLAC (Wang and others, 2005Go) are included. In order to compare the CPU time, we coded the algorithm for CBS in MATLAB instead of the existing R package. The CLAC method was included for comparison since it also involves hierarchical clustering of points. CLAC was implemented with the existing R package (available at http://www-stat.stanford.edu/~wp57/CGH-Miner/) using default parameters. Because reference samples from the normal arrays are needed in the CLAC method, we simulated 100 reference data sets using a normal distribution with zero mean and variance Formula, set to be the same as in the data to be analyzed.


View this table:
[in this window]
[in a new window]

 
Table 2. Results from applying ODM, SDM, BRM, CBS, and CLAC to the simulated data. ODM is performed with d = 3.5 standard deviation. {alpha} = 0.001 is used to define outliers for both SDM and BRM. CBS is applied without pruning

 
The simulation results in Table 2 show that the proposed methods are competitive with the CBS method without pruning. In contrast, most of the change points detected by the CLAC are not in the right locations, and many change points are missed. This may be caused by the mean smoothing step in CLAC that reduces noise in the simulation data at the expense of blurring the edges of the loss/gain regions. When Formula is small, our methods appear to perform better than CBS. When Formula is large (e.g. Formula), SDM finds a few more false positives and ODM has a few more false negatives than CBS. As Formula increases, the data sets become more noisy, and the numbers of both false negatives and false positives from all methods increase. However, once Formula reaches very large values, the data sets appear approximately equal to those without change points and the number of additional change points reflects the false-positive rate in the absence of any change points. This can be seen in the last 2 columns in Table 2, where the number of additional change points detected for Formula is less than for Formula (except for CLAC).

To further assess the true-positive and false-positive rates, we applied CCTTS-based methods in additional simulations with Formula gradually increasing toward infinity. We note that, from the viewpoint of change-point detection, analyzing a data set Formula is equivalent to analyzing Formula for any nonzero constant Formula. Therefore, for any fixed Formula, analyzing a simulation data set Formula, yields the same results as analyzing the data set Formula. As Formula, the data set becomes one without change points. In Figure 5, we plot change-point detection rates against Formula. We define true positives to be both change points correctly detected and those detected but mislocated. The results indicate that the number of change points correctly detected and the true-positive rates decrease as Formula increases. However, among the true positives, the number of change points mislocated increases when Formula is small and then decreases as Formula becomes large. Similarly, for small values of Formula, the false positives increase due to the increased noise. When Formula reaches a certain value (around 0.5), the number of false positives begins to decrease and eventually approaches a limiting value corresponding to Formula (data without change points).


Figure 5
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Results from applying ODM, SDM, and BRM to the simulation data for different values of Formula. (a) The number of change points correctly detected. (b) The number of change points mislocated. (c) True-positive rates (total numbers of change points correctly detected and mislocated divided by the number of true change points, 500). (d) The number of additional change points detected (false positives).

 
We also compared the CPU time consumed (with MATLAB software) to perform these analyses. For analyzing 100 data sets with Formula, the ODM, SDM, and BRM consumed about Formula, and Formula of the CPU time used by the CBS method, respectively.


    5. APPLICATION TO A DATA SET OF FIBROBLAST CELL LINES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. CHANGE-POINT DETECTION
 4. SIMULATIONS
 5. APPLICATION TO A...
 6. DISCUSSION AND CONCLUSIONS
 APPENDIX A
 APPENDIX B
 REFERENCES
 
To further assess the performance of our methods, we applied them to a set of CGH data with known DNA copy number changes (Snijders and others, 2001Go). The data consisted of single experiments on 15 fibroblast cell lines. The variable used for analysis was the normalized average of Formula of the test over reference ratio. For 6 of the cell lines, the copy number alteration encompassed the whole chromosome. Therefore, we first limited our analysis to the other 9 cell lines and tested one chromosome at a time for change points. The ODM was performed with both Formula and Formula. For SDM and BRM, we used thresholds based on Formula and Formula. Results can be found in Table 3. Outcomes from applying the CBS method reported by Olshen and others (2004)Go are included for comparison.


View this table:
[in this window]
[in a new window]

 
Table 3. Results from applying ODM, SDM, BRM, and CBS to 9 cell lines with known copy number alterations

 
In Table 3, "yes" for ODM, SDM, and BRM means that the alteration was found for the particular cell line and chromosome at the given Formula level, although in some cases, the locations of the change points may be 1 or 2 positions away from the real ones. Some additional change points may also be identified, which, as we will discuss further, can be eliminated through a "pruning" procedure. "Yes" for CBS is taken from Olshen and others (2004)Go, but the localization relative to the truth was not given. "No" means that the alteration was not found. "Single" and "false" represent the number of chromosomes where change points were found that do not correspond to known alterations. The difference between "single" and "false" is that "single" includes only the chromosomes where all detected change points were single-point regions, whereas "false" includes regions that are larger than single points. Using the Formula criterion, ODM gives better results than other methods. Outcomes for both SDM and BRM show that there is a little difference between Formula and Formula. Unlike the CBS application, our methods were directly applied to the original data sets without any prior smoothing process. Therefore, single-point regions can be identified.

All the alterations for 6 of the remaining cell lines covered a whole chromosome and would not be identified by analyzing each chromosome separately. However, analysis of an entire genome may help to identify whole chromosome alterations. We applied ODM to each of these cell lines with Formula standard deviation followed by one specific pruning step: If the absolute mean difference between 2 adjacent segments around a change point detected by ODM was less than Formula, then that change point was not considered a detected change point. The results are shown in Figure 6 and clearly reveal whole chromosome alterations along each cell line.


Figure 6
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. The ODM analysis of 6 whole fibroblast cell lines. Markers are arranged by chromosome. The vertical dotted lines demarcate chromosomes. The horizontal lines represent the mean of each segment.

 

    6. DISCUSSION AND CONCLUSIONS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. CHANGE-POINT DETECTION
 4. SIMULATIONS
 5. APPLICATION TO A...
 6. DISCUSSION AND CONCLUSIONS
 APPENDIX A
 APPENDIX B
 REFERENCES
 
We have proposed several new approaches for assessing DNA copy number variation along a chromosome. These methods can be used to detect copy number alterations in tumors or other tissues. All approaches can be directly applied to analyze data sets without any prior smoothing process, and single-point regions (or outliers) can be identified. We applied these methods to an array CGH data set of fibroblasts with known copy number changes confirmed by spectral karyotyping and were able to identify all the expected alterations. We also applied the methods to a 100-K SNP data set of non-small cell lung carcinoma, namely cell line H2122 from Zhao and others (2005)Go. Our results were similar to Zhao and others (2005)Go for large regions, but we identified many more small regions or single-point alterations. The analysis time for the SNP data of Formula markers was very fast (less than 3 min on a 3.20-GHz PC in performing ODM). These methods are also likely to be useful in assessing normal variation. In addition, through simulation we have demonstrated that our methods can correctly detect change points that partition chromosomes into regions with the same underlying copy number.

In developing the methods, we have made the implicit assumption that residual variances (defined in (3.1)) are equal across the chromosome. In some additional simulations, we found that SDM still works well for heteroscedastic data as long as heteroscedasticity across different regions is mild. In general, if heteroscedasticity is apparent, we suggest that some pre-analysis should be performed to derive good estimates of variances. Then the signed distance measure in (2.1) could be modified to involve estimated variances, although this approach would require future investigation.

An issue that can arise with SDM is the edge effect caused by artificially connecting the first and the last points Formula and Formula in the segment Formula. That is, if the index k of an identified change point Formula is "close " to either i or j, then it might not be a real change point. The same problem occurs for the CBS method (Olshen and others, 2004Go) and can be handled in a similar way. By contrast, in the process of change-point detection, BRM does not cut the whole data set into several segments so that this problem of small segments does not exist. Also, the edge effect is a minor issue for BRM because there is only one edge to be considered.

Any statistical method for analyzing DNA copy number will detect more change points or alterations than there are real copy number changes. These false positives may be caused by local trends in the data or other reasons that are not totally understood. Therefore, a pruning procedure may be helpful to eliminate some change points or to merge adjacent segments in order to reduce the number of false positives. Pruning can be performed either from a statistical viewpoint, by controlling a sequence of sum of squares as in CBS, or from a biological viewpoint by merging 2 adjacent segments as in the HMM method (Fridlyand and others, 2004Go) and the penalized least-squares regression method (Huang and others, 2005Go). Our methods do not include this step because any pruning procedure is dominated by a threshold parameter which must be determined by the type of data produced, the level of purity of the populations, and whether type I or type II error is more undesirable. As Fridlyand and others (2004)Go point out, threshold values for comparing the means of 2 adjacent segments could be as low as Formula when analyzing primary tumors and as high as Formula when looking for genetic abnormalities homogeneously present in all cells. For the results given in Table 3, CBS includes a pruning procedure by controlling a sequence of sum of squares, while our 3 methods do not perform any pruning.

It may be noted that the critical values Formula, Formula used for SDM and BRM in both simulated and real data studies are much smaller than the ones used in CBS (Formula, Formula). This is because all our change-point detection methods are essentially based on identifying the outliers among the scores produced by CCTTS algorithm, and this is conceptually different from the CBS approach. CBS, using the likelihood ratio, identifies only the largest change point at each segmentation step, whereas our methods find as many change points as possible each time. This distinction affects the permutation distributions. Our approach constructs a reference permutation distribution Formula by collecting all absolute statistic scores, whereas CBS constructs a distribution Formula by collecting only one score with the maximum absolute value for each permutation. That means that the distribution Formula is stochastically larger than the distribution Formula and the p-value of an observed statistic with respect to Formula is smaller than Formula. The cutoffs are chosen to give comparable performance, but the interpretation is not the same.

We have shown that, for normally distributed data, the ODM is easy to perform and extremely fast. The numbers of permutations needed for both SDM and BRM are small compared with the number needed for CBS. Furthermore, more markers means that even fewer permutations are needed. Existing methods tend to analyze each chromosome separately due to the fact that chromosomes are physically separated. Our methods can also analyze an entire genome simultaneously to identify regions of copy number variation. Inferences about change points drawn from analyzing the whole genome, rather than single chromosomes, can be used to further identify whole chromosome gain or loss.


    APPENDIX A
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. CHANGE-POINT DETECTION
 4. SIMULATIONS
 5. APPLICATION TO A...
 6. DISCUSSION AND CONCLUSIONS
 APPENDIX A
 APPENDIX B
 REFERENCES
 

A.1. Simple example illustrating the clustering algorithm and property (a)

We consider a simple situation of a data set containing only 3 points in sequence: Formula, and Formula. Let Formula, and Formula, where D is defined by (2.1). At the beginning of the CCTTS algorithm, the initial scores are Formula, and Formula. We focus here on how to obtain the final score of Formula. According to the CCTTS algorithm, we have

1) When Formula or Formula or Formula, then more than 2 clusters are merged simultaneously. One big cluster is obtained, and Formula is not updated. So the final score is

Formula


2) When Formula, then we merge clusters Formula and Formula, and we update Formula and Formula, but not Formula. Since Formula and Formula are merged together, Formula will not be updated anymore. Therefore, the final score is

Formula


3) When Formula, then we merge clusters Formula and Formula, and we update Formula to obtain

Formula

Note that, at this stage, the data set contains 2 clusters. The next merging procedure will produce one big cluster, and hence statistic scores will not be updated anymore. Therefore, the above achieved value is the final score of Formula.

4) When Formula, then we merge clusters Formula and Formula. In a similar discussion as in (3), the final score of Formula can be obtained by

Formula


By summarizing (1), (2), (3), and (4), Formula can be mathematically expressed as

Formula

Similarly, we can obtain mathematical expressions for Formula and Formula. For example, Formula can be written as

Formula

From this simple case, we find that (i) the final statistic scores Formula from the CCTTS algorithm are functions of the original sequence data set and (ii) if Formula is written as Formula, then Formula. So, Formula and Formula are identically distributed when Formula, and Formula are iid.


    APPENDIX B
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. CHANGE-POINT DETECTION
 4. SIMULATIONS
 5. APPLICATION TO A...
 6. DISCUSSION AND CONCLUSIONS
 APPENDIX A
 APPENDIX B
 REFERENCES
 

B.1 Conditions for asymptotically unbiased variance estimation

Let Formula be the statistic scores computed from the CCTTS. According to property (b) of Formula being approximately identically normally distributed, negative correlations between 2 adjacent statistics, and approximate independence between statistics at least 2 positions apart, we simplify the assumptions for the joint distribution of Formula as follows:

1) Marginal distributions:

Formula


2) Correlations:

Formula


Here, the assumption of the correlation Formula is based on the circular structure. The sample variance Formula from Formula is given by

Formula

where Formula. Our goal is to prove that Formula is an asymptotically unbiased estimator of Formula with respect to N under the assumptions given in (1) and (2). This can be done by the following derivation:

Formula

Therefore,

Formula

Formula when Formula (negative correlation), and Formula as Formula.


    ACKNOWLEDGMENTS
 
This work was supported by the Mathematics of Information Technology and Complex Systems (MITACS/NCE), by National Institutes of Health grant HL68532, and Genome Canada through the Center for Applied Genomics, Toronto. Conflict of Interest: None declared.


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. METHODS
 3. CHANGE-POINT DETECTION
 4. SIMULATIONS
 5. APPLICATION TO A...
 6. DISCUSSION AND CONCLUSIONS
 APPENDIX A
 APPENDIX B
 REFERENCES
 

    Autio R, Hautaniemi S, Kauraniemi P, Yli-Harja O, Astola J, Wolf M, Kallioniemi A. CGH-plotter: MATLAB toolbox for CGH-data analysis. Bioinformatics (2003) 19:1714–1715.[CrossRef][Web of Science][Medline]

    Barnett V, Lewis T. Outliers in Statistical Data (1994) New York: John Wiley and Sons.

    Foronzan F, Mahlamaki EH, Monni O, Chen Y, Veldman R, Jiang Y, Gooden GC, Ethier SP, Kallioniemi A, Kallioniemi OP. Comparative genomic hybridization analysis of 38 breast cancer cell lines: a basis for interpreting complementary DNA microarray data. Cancer Research (2000) 60:4519–4525.[Abstract/Free Full Text]

    Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain A. Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis (2004) 90:132–153.[CrossRef][Web of Science]

    Hsu L, Self SG, Grove D, Randolph T, Wang K, Delrow JJ, Loo L, Porter P. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics (2005) 6:211–226.[Abstract]

    Huang T, Wu B, Lizardi P, Zhao H. Detection of DNA copy number alterations using penalized least squares regression. Bioinformatics (2005) 21:3811–3817.[Abstract/Free Full Text]

    Hupe P, Stransky N, Thiery J-P, Radvanyi F, Barillot E. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics (2004) 20:3413–3422.[Abstract/Free Full Text]

    Jong K, Marchiori E, Meijer G, Vaart AVD, Ylstra B. Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics (2004) 20:3636–3637.[Abstract/Free Full Text]

    Jong K, Marchiori E, van der Vaart A, Ylstra B, Meijer G, Weiss M. Chromosomal Breakpoint Detection in Human Cancer. Lecture Notes in Computer Science (2003) Volume 2611. Berlin: Springer.

    Kallioniemi A, Kallioniemi OP, Piper J, Tanner M, Stokke T, Chen L, Smith HS, Pinkel D, Gray JW, Waldman FM. Detection and mapping of amplified DNA sequences in breast cancer by comparative genomic hybridization. Proceedings of the National Academy of Sciences USA (1994) 91:2156–2160.[Abstract/Free Full Text]

    Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, Pinkel D. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science (1992) 258:818–821.[Abstract/Free Full Text]

    Lai WR, Johnson MD, Kucherlapati R, Park PJ. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics (2005) 21:3763–3770.[Abstract/Free Full Text]

    Lengauer C, Kinzler K, Vogelstein B. Genetic instabilities in human cancers. Nature (1998) 396:643–649.[CrossRef][Medline]

    Lisitsyn N, Lisitsyn N, Wigler M. Cloning the differences between two complex genomes. Science (1993) 259:946–951.[Abstract]

    Maletic JI, Marcus A. Data cleansing: beyond integrity analysis. In: Proceedings of The Conference on Information Quality (IQ2000) (2000) Cambridge: MA. 200–209.

    Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics (2004) 5:557–572.[Abstract]

    Picard F, Robin S, Lavielle M, Vaisse C, Daudin J-J. A statistical approach for array CGH data analysis. BMC Bioinformatics (2005) 6:27.[CrossRef][Medline]

    Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo WL, Chen C, Zhai Y, and others. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature Genetics (1998) 20:207–211.[CrossRef][Web of Science][Medline]

    Pollack JR, Sørlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Børresen-Dale A, Brown PO. Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proceedings of the National Academy of Sciences USA (2002) 99:12963–12968.[Abstract/Free Full Text]

    Snijders AM, Nowak N, Segraves R, Blackwood S, Brown N, Conroy J, Hamilton G, Hindle AK, Huey B, Kimura K, and others. Assembly of microarrays for genome-wide measurement of DNA copy number. Nature Genetics (2001) 29:263–264.[CrossRef][Web of Science][Medline]

    Tirkkonen M, Tanner M, Karhu R, Kallioniemi A, Isola J, Kallioniemi OP. Molecular cytogenetics of primary breast cancer by CGH. Genes, Chromosomes & Cancer (1998) 21:177–184.[CrossRef][Web of Science][Medline]

    Wang P, Kim Y, Pollack J, Narasimhan B, Tibshirani R. A method for calling gains and losses in array CGH data. Biostatistics (2005) 6:45–58.[Abstract]

    Willenbrock H, Fridlyand J. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics (2005) 21:4084–4091.[Abstract/Free Full Text]

    Zhao X, Weir BA, LaFramboise T, Lin M, Beroukhim R, Garraway L, Beheshti J, Lee JC, Naoki K, Richards WG, and others. Homozygous deletions and chromosome amplifications in human lung carcinomas revealed by single nucleotide polymorphism array analysis. Cancer Research (2005) 65:5561–5570.[Abstract/Free Full Text]

    Zhao XJ, Li C, Paez JG, Chin K, Jänne PA, Chen TH, Girard L, Minna J, Christiani D, Leo C, and others. An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Research (2004) 64:3060–3071.[Abstract/Free Full Text]

    Received March 20, 2006; revised September 1, 2006; accepted for publication October 16, 2006.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


    This article has been cited by other articles:


    Home page
    BiostatisticsHome page
    C. D. Greenman, G. Bignell, A. Butler, S. Edkins, J. Hinton, D. Beare, S. Swamy, T. Santarius, L. Chen, S. Widaa, et al.
    PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data
    Biostat., January 1, 2010; 11(1): 164 - 175.
    [Abstract] [Full Text] [PDF]


    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow All Versions of this Article:
    8/3/632    most recent
    kxl035v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrowRequest Permissions
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Xing, B.
    Right arrow Articles by Bull, S. B.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Xing, B.
    Right arrow Articles by Bull, S. B.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?