Biostatistics Advance Access originally published online on October 23, 2006
Biostatistics 2007 8(3):632-653; doi:10.1093/biostatistics/kxl035
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A hierarchical clustering method for estimating copy number variation
Genetics and Genome Biology, Hospital for Sick Children, Toronto, Ontario, Canada and Samuel Lunenfeld Research Institute of Mount Sinai Hospital, Toronto, Ontario, Canada
Genetics and Genome Biology, Hospital for Sick Children, Toronto, Ontario, Canada and Department of Public Health Sciences, University of Toronto, Toronto, Ontario, Canada celia.greenwood{at}utoronto.ca
Samuel Lunenfeld Research Institute of Mount Sinai Hospital, Toronto, Ontario, Canada and Department of Public Health Sciences, University of Toronto, Toronto, Ontario, Canada
* To whom correspondence should be addressed.
| SUMMARY |
|---|
|
|
|---|
Microarray technologies allow for simultaneous measurement of DNA copy number at thousands of positions in a genome. Gains and losses of DNA sequences reveal themselves through characteristic patterns of hybridization intensity. To identify change points along the chromosomes, we develop a marker clustering method which consists of 2 parts. First, a "circular clustering tree test statistic" attaches a statistic to each marker that measures the likelihood that it is a change point. Then construction of the marker statistics is followed by outlier detection approaches. The method provides a new way to build up a binary tree that can accurately capture change-point signals and is easy to perform. A simulation study shows good performance in change-point detection, and cancer cell line data are used to illustrate performance when regions of true copy number changes are known.
Keywords: Array CGH; Change-point; Genomic copy number; Outlier detection; Permutation
| 1. INTRODUCTION |
|---|
|
|
|---|
DNA copy number alterations in tumor cells are key genetic events in the development and progression of human cancers (Tirkkonen and others, 1998
The goal in analyzing DNA copy number data is to partition the whole genome into regions with the same underlying copy number and, subsequently, to quantify the copy number in each region. Therefore, locating copy number changes is an important step in the analysis of DNA copy number data. Various statistical methods have been proposed to address this problem. The software CGH-Plotter uses moving median and mean filtering, k-means clustering, and a dynamic programming algorithm (Autio and others, 2003
). Jong and others (2003
, 2004
) proposed a genetic local search algorithm to maximize a likelihood with a penalty term containing the number of change points. Other proposals for change-point detection methods include a dynamic programming algorithm with a penalized likelihood (Picard and others, 2005
) and a weighted likelihood function using adaptive weight smoothing (Hupe and others, 2004
). Olshen and others (2004)
proposed a modified binary segmentation procedure, called circular binary segmentation (CBS), to look for 2 breakpoints at a time in the data arranged in a circle. Wang and others (2005)
proposed a simple but effective method based on hierarchical clustering along the chromosomes (CLAC) to select regions of interest while simultaneously controlling the false discovery rate (FDR). Fridlyand and others (2004)
used hidden Markov models (HMMs) to estimate the underlying copy numbers (the hidden states). In Hsu and others (2005)
, a smoothing algorithm was given that denoises the data using a wavelets approach.
A recent paper (Lai and others, 2005
) described an extensive study comparing several of these methods and pointed out that some segmentation methods, especially the CGH segmentation method (Picard and others, 2005
) and CBS (Olshen and others, 2004
), appeared to perform consistently well whether the noise level was low or high. Willenbrock and Fridlyand (2005)
also did a comparison study of HMMs, CBS, and the Hupe method and concluded that CBS has the best operational characteristics in terms of its sensitivity and FDR. In Picard's method, however, the CGH profile is assumed to follow a Gaussian signal, an assumption that may be violated, while CBS is not designed to detect single outliers. For processing a large number of high-resolution arrays, the speed of many algorithms becomes an issue. Both HMMs and CBS can be slow and difficult to apply to very highdensity oligonucleotide arrays.
In this report, we seek approaches to segment the chromosomes into regions of equal copy number that can include either a single marker or multiple contiguous markers. We propose a method, which we call the "circular clustering tree test statistic" (CCTTS), and following the construction of the statistics, we apply outlier detection approaches to identify change points along the chromosomes. Our method is novel in that it provides a new way to build up a binary tree that can accurately capture change-point signals and is easy to perform.
The rest of the paper is organized as follows. The CCTTS algorithm and properties are given in Section 2. In Section 3, we provide several change-point detection approaches based on the scores produced by the CCTTS algorithm. In Section 4, the proposed methods are evaluated by simulation studies. The method is applied to data from cell lines with known copy number changes in Section 5. We summarize our conclusions and discuss open questions in Section 6.
| 2. METHODS |
|---|
|
|
|---|
It is clear that estimating the locations of regions with aberrant DNA copy numbers is closely connected to the problem of finding the locations of change points (Olshen and others, 2004
be a sequence of random variables that correspond to markers positioned sequentially along the genome. In general, an index
is called a change point if
have a common distribution and
have a different common distribution. Intuitively, a point can be identified as a change point when the difference in signal intensity is large between 2 adjacent segments demarcated by the point. The correct choice for the sizes of the 2 adjacent segments is an important issue when the signal difference is measured. Segments that are either too small or too large will reduce the sensitivity of change-point detection methods. Our proposed methodthe CCTTSautomatically gives 2 "optimal" adjacent segments (or clusters) with which to measure the difference at any point along a chromosome. The idea is to build a hierarchical cluster-style tree from the bottom upward such that the gain/normal/loss regions are separated into different branches. In this procedure, the tree building and the evaluation of statistics at any point are carried out simultaneously. Furthermore, the statistics associated with the tree fully reflect the information involved in the tree structure, in- cluding the magnitude of the difference between any 2 adjacent clusters and the order of the merging procedure. These statistics can then be used for detecting the most likely change points along a chromosome.
The CCTTS algorithm uses a bottom-up strategy that generates a binary tree to represent the similarities in the observed copy number intensities along a chromosome. The algorithm begins with the data arranged in a circle, obtained by connecting the 2 ends of the chromosome together, with every observation representing a singleton cluster. At each of the
steps, the closest 2 adjacent clusters are merged into a single cluster, where closest is defined by the smallest difference in copy number intensities, and the differences at the new cluster boundaries are updated. Hence, after each step there is one less cluster.
Let
be a sequence of random variables for the
relative copy number intensities at N marker positions. We define a signed distance to measure the difference between any 2 adjacent clusters as
|
| (2.1) |
where markers
are adjacent to
. Note that
is simply a normal test statistic for the data with variance 1. Here, the implicit assumption that the data have a common variance (without loss of generality, 1) can be relaxed for segmentation methods. We call index j (k) the left (right) boundary of a cluster
(when
, in a circular arrangement, the left boundary of a cluster
is defined to be N). The CCTTS iterative algorithm is given in detail in the following and is illustrated in Figure 1.
- 1) Arrange the data in a circle so that the first point is connected to the last. That means
and
. Define N clusters each consisting of one point. Let
be the set of indices corresponding to each of the N clusters, and define a set of statistics that will be used to decide if point i is a change point. To start, use 
which is the distance between each pair of adjacent observations.
- 2) Let
. Then merge the 2 adjacent clusters
and
, remove index l from I, and update
and
in the following way: 
- 3) In general, merge the 2 adjacent clusters
with the smallest
. Then remove index q from I. If the data still contain more than one cluster, update
and
in the following way:- Let LC and RC be the left and right adjacent clusters to the cluster
we just formed, respectively (see Figure 1).
- Update
using

(2.2)
- Update
using

(2.3)
- 4) Repeat Step (3) until one big cluster is obtained. After merging the last 2 clusters, updating S is not relevant and is not performed. At that point there will be N statistics
.
- 2) Let
|
In Step (2) or (3), if there are ties for the minimum value
, say,
, then indices
are simultaneously removed from I, all corresponding pairs of adjacent clusters are merged, and all statistics
on the boundaries of these newly formed clusters are updated as in Step (2) or (3). Note that 2 boundary points may correspond to the merging of more than 2 adjacent clusters.
The simple example in Table 1 is used to illustrate the CCTTS algorithm. The first row shows a data set with 5 points. The algorithm starts by defining 5 clusters:
and
.
are calculated and given in the second row after connecting the data in a circle. In the first step,
is removed from I, and
and
are merged together to obtain 4 clusters:
, and
. To update
and
corresponding to the boundaries of the newly formed cluster,
(given in the second row) is compared to
and
is compared to
. Both
and
are updated to new values that are larger in absolute values as seen in the third row. The remaining values of
, are unchanged. Then, this procedure of computing
, updating I, merging 2 adjacent clusters together, calculating
, and updating
is repeated until one cluster is formed. The final merging order
matches the magnitude of the final statistics
(in absolute value).
|
In addition to treating the data as a circle, there is a substantial difference between the CCTTS algorithm and the conventional clustering approach (Wang and others, 2005
and
are updated in Step (2) or (3), for a given difference measure. In the conventional approach,
and
are simply updated by
and
, respectively, which ignores the previous statistic
at a marker when updating the statistic to
, after merging clusters
and
. This leads to a suboptimal choice of statistics. To clarify, let
be the difference measurement between clusters LC and
, and let
be the difference between LC and C. Suppose
is a true change point and
, then our algorithm assigns the larger statistic
to marker j, whereas the standard algorithm would assign the smaller statistic
. This also means that there can be inconsistency between the order in which indices are removed from I and the magnitude of the resulting statistics
. With the usual tree-building procedure,
and
may be smaller than
, even though position q is merged before positions j and k. The merging order indicates that
is less likely to be a change point than
and
. Our method of updating difference measurements
in the CCTTS algorithm will produce a sequence of statistics
, ordered by their position in the clustering process.
The CCTTS algorithm provides a way to calculate the statistics
and also leads to 2 "optimally sized" adjacent clusters to achieve a statistic value (see Figure 2). Updating procedures (2.2) and (2.3) keep track of 2 optimally sized adjacent clusters at each step and detect the signals from using both the mean differences and the sizes of the clusters. This makes it possible to identify two interesting cases: (i) small regions with extremely large or small values and (ii) regions of consistent gain/loss that do not deviate much from zero. The magnitude of
naturally reflects the likelihood that
is a change point in the whole data set
. Let
. Then
is most likely to be a change point,
is the second most likely change point, and so on. Several additional properties of our statistics are given in detail in Section 2.2.
|
Under the null hypothesis of no change points, the final statistics
have the following properties:
- a) If
are iid, then all
are identically distributed, but not independent.
- b) If
come from a normal population, then all
are approximately identically normally distributed, but not independent.
- b) If
The clustering tree built by the CCTTS algorithm is based on a circular arrangement of the data. Property (a) is a natural consequence of this circular symmetric arrangement. In Appendix A, we give a simple example of a sequence data set containing only 3 points
, and
and illustrate the functional relationships between the statistic scores
and the data set
. In general, for a given data arranged in a sequence:
, the method for calculating
is a function of
. So we write
, where the subscript 1 on S corresponds to the first location,
. On the other hand, for any
, if the data are rearranged as
, then, according to the previous notation, the CCTTS algorithm gives statistic
corresponding to the first location of
:
. Since both data sets
and
, when arranged in a circle, are exactly the same,
. Therefore,
and
are identically distributed because
are iid. Property (a) plays a significant role when analyzing non-normal data for change-point detection, since the symmetry can be exploited to reduce the number of permutations needed to produce a reference distribution.
Property (b) has been established empirically by examining normal QQ plots of simulated data. From property (a), we only need to illustrate that
is approximately normally distributed. Let
, be n data sets generated iid from
. The statistics
were computed from the
data set by the CCTTS algorithm. Then
is a sample drawn from the distribution of
. Figure 3 shows normal QQ plots for
replications with number of markers
and indicates that
is approximately normally distributed. In our simulations, we simultaneously calculated sample correlation coefficients between
and
which, because of the symmetric circular arrangement and property (a), also estimate the correlation coefficients for any 2 statistics
and
. When
, for different N, all these correlation coefficients were between
and
. When
, correlations between
and
are small (e.g. correlations between
and
are around
), and these statistics could be treated as approximately independent. Due to the negative correlations between
and
, a histogram of all statistics
jointly has larger tails than would be expected from the marginal approximate normality of any one statistic
. Property (b) makes it possible to analyze normal data for change-point detection easily and efficiently.
|
| 3. CHANGE-POINT DETECTION |
|---|
|
|
|---|
We assume that the observed
is a realization of the true
relative copy number
at marker i plus random noise,
|
| (3.1) |
where N is the number of markers on a given chromosome and
are iid with mean 0. Under the null hypothesis of no change points
, property (a) implies that
have the same distribution. As discussed in Section 2.2, the problem of change-point identification can be reduced to the problem of finding extremely large
that deviate from the pattern of
. In other words, our task becomes the identification of the outliers among the scores
. Interpretation of identified change points must take into account the artificial circular data. For example, if the number of detected change points is one, we must conclude that the original data contain no change points, since circular data with only one change point are still one cluster. For simplicity, we will not exclude a detected change point that is the last point on a chromosome.
We discuss and introduce several different approaches for outlier detection, separately for normal and non-normal data.
We assume that the data
are normally distributed with a common variance. The scores
are computed according to the CCTTS. Under the null hypothesis of no change points,
are each approximately normally distributed (see property (b)). Therefore, some approaches used to identify outliers for normal data are appropriate here, even though
are negatively correlated between 2 adjacent
. To perform this procedure, we apply the simple d standard deviation method, where the value of d can be user defined.
are sorted in decreasing absolute values:
. The method starts with the assumption of no outliers in
, then examines and determines the first most likely outliers. If outliers are detected, then they are removed from the set of scores
. The procedure will be repeated for the remaining scores until no more outliers can be found.- 1) Set
. Calculate the mean m and standard deviation s of all the scores.
- 2) If
, stop. Otherwise, identify the largest l such that
. Then remove
from
, recalculate the mean m and standard deviation s for the remaining scores in
, and then update
.
- 3) In general, if
, stop. Otherwise, identify the largest l such that
.
- 4) Remove
from
, then recalculate the mean m and standard deviation s for the remaining scores in
.
- 5)
, go to Step (3).
- 2) If
, there are no change points; otherwise, we find change points
.
The value of d can be user defined based on some domain or data knowledge, or theoretically using Chebyshev's theorem (Barnett and Lewis, 1994
). In practice, d is usually chosen between 3 and 6 (Maletic and Marcus, 2000
). From our experience, values between
and 4 were found to generate the best results with fewer false positives and false negatives in general.
Under the null hypothesis of no change points, the negative correlations between
and
and approximate independence between
and
mean that a sample variance
from
overestimates the variance
of the marginal distribution for any one statistic
. Therefore, outlier detection in the outlier detection method (ODM) is conservative. Outlier detection in
with one or several (outliers) points removed is also expected to be conservative. However, this conservativeness is not a concern in most cases due to the following 2 facts: (i) a sample variance
from
is an asymptotically unbiased estimator of the marginal variance
with respect to the number of markers N (see Appendix B) and (ii) a user-defined value of d can be adjusted for outlier detection. Therefore, the negative correlations do not have much effect on ODM performance for large data sets. When the number of markers is small, we may choose a smaller value of d to adjust for the conservativeness.
In the outlier detection method, we assumed that the data followed a normal distribution. For non-normal data, we will introduce 2 methods to identify change points based on permutation-derived reference distributions. The sequential detection method (SDM) is the most straightforward method to apply.
Assume the null hypothesis
is true, that is, there are no change points. Using property (a), each statistic value
computed from position permuted data has the same distribution and, therefore, can be used as a contribution to the permutation distribution. For example, by doing 100 permutations on a data set of
markers,
null test statistics are obtained. This drastically reduces the number of permutations needed.- 1) For the original data
, we calculate statistic scores according to CCTTS and sort them in decreasing absolute values:
.
- 2) Randomly permute the order of the markers M times (
). For each permutation data set P, collect all absolute statistic values
and construct a reference distribution. Outliers are chosen if
is greater than the upper
quantile of the permutation distribution
. Let
.
- 3) If
, there are no change points. Otherwise, the data can be partitioned into k or
segments separated by k change points
. The exact number of segments (k or
) depends on whether
is chosen as a change point.
- 4) Steps (1) to (3) are recursively performed on each small segment until no more change points are identified. Single-point regions of gain or loss may be identified by the occurrence of 2 adjacent change points. These may occur either because there was a very small region of copy number change on the chromosome that is measured by only one marker or because there was an error in the data at one marker. Suppose a chromosome with N markers has only one single-point region of gain/loss. Any permuted data from this chromosome also has one single-point region of gain/loss. That means 2 out of the N statistic values
in each permutation will be relatively large, and therefore the magnitude of quantile
depends critically on N. It will be difficult to detect change points for small values of N. Therefore, since the SDM does not work well in this situation, we propose a modification to improve detection of single-point regions. Let s be the sample standard deviation of the original data. Suppose during Step (3) on a segment with mean
no change points were detected (
). We propose to examine the indices of
and
. If they are adjacent in the circle, then point
(or
, depending on the relative positions of
and
) may be a single-point region. If
, then we conclude that there are no change points on the segment; otherwise, we conclude that
and
are 2 change points.
- 2) Randomly permute the order of the markers M times (
The SDM works well and is robust to unequal variances. However, when segments become small, the SDM permutation approach can miss some change points. For example, look at a very simple data set:
. It is obvious that there are 2 change points and the largest statistic,
, is
. The probability of achieving this largest value in the permutation distribution is
. No change points can be identified at the level
. Therefore, as an alternative approach, we developed a backward residual method (BRM) that uses the whole data set all the time rather than working with smaller segments. The algorithm starts in the same way as the SDM, but after some of the change points are detected, new permutations will be performed on the residual data instead of on the smaller segments.- 1) As before, for the original data
, we calculate statistic scores according to CCTTS and sort them in decreasing absolute values:
.
- 2) Let
(one segment) and
.
- 3) Randomly permute the order of the data
M times (
). For each permutation data set, collect all absolute statistic values to form a reference distribution. The threshold values can be chosen to be the upper
quantile of the permutation distribution, say
. Let
.
- 4) If
, stop. Otherwise,
(
detected change points or segments); denote
as the mean of the segment which contains observation
and update
to be the mean adjusted residual of
, so that
. Then go to Step (3).
- 2) Let
, let
and
(
when
,
when
) be the next 2 most likely change points. Identify if there are potential single-point regions produced by combining
with previously detected change points or by examining
and
. Suppose this potential single-point region
lies in segment L. We need to decide if this is a real single-point region. If
(
is the mean of segment L), we conclude that there are no more change points. Otherwise,
or both
and
are identified to be new change points. Software written in MATLAB is available from the first author.
| 4. SIMULATIONS |
|---|
|
|
|---|
In this section, we investigate through Monte Carlo simulations the performance of the proposed methods. Data containing regions of gain, normal, and loss copy numbers were generated from the model
, where N is the sample size,
is the mean, and
is distributed as
. We assume one region of loss and one region of gain per chromosome. Within a loss region, all markers have the same state, with copy number reduced by 1. However, within the gain region we allow 3 possible states amplified by
, or 3 copies.
Following Pollack and others (2002)
, we set
s for the copy numbers deleted by 1 and amplified by 1, 2, and 3 to be
, and
, respectively, and we use
for normal regions. To mimic typical data, the following quantities were generated randomly: the position of loss/gain regions, the size of loss/gain regions, the number and allocation of the 3 amplified states, and the sizes of the 3 amplified regions within the regions of gain. In order to compare our methods and the CBS method, the simulated data sets do not contain single points of loss/gain. The details of the simulation are given in the following.
- 1) A data set is defined to contain a normal region, a region containing a loss, and a region containing a gain in random order.
- 2) To generate the 3 regions, each of size m:
, the following parameters are used:- a) Data from normal region:
.
- b) Data from region of loss:
- i) Initially generate
.
- ii) Generate the width w of a loss region with minimum width
: 
- iii) Locate the loss interval in the region:
by generating i from a uniform distribution: 
- iv) Let
and
- ii) Generate the width w of a loss region with minimum width
- c) Data from region of gain:
- i) Initially generate
.
- ii) Generate the number of amplified states:
and then randomly select the magnitude of the gains by sampling without replacement n times from
within the region of gain.
- iii) Generate the width of a region with minimum width 2 for 1 amplified state, 4 for 2 amplified states, and 6 for 3 amplified states:

- iv) Locate the gain interval in the region:
by generating i from a uniform distribution: 
- v) Divide the gain region into n segments, with the length of each segment at least 2 and chosen from a uniform distribution, and then randomly assign the n amplified states (selected by (ii)) to the n segments.
- vi) Let
, and
. If segment
corresponds to the
state, then
.
- ii) Generate the number of amplified states:
- b) Data from region of loss:
- 3) The whole simulated data set is produced by randomly concatenating the 3 regions generated by (2) and examples are given in Figure 4.
- 2) To generate the 3 regions, each of size m:
|
One hundred data sets with 180 markers were simulated for
and for each value of
. In order to compare the simulation results across different values of
,
are generated and
are used for different
. Examples with
from simulation are shown in Figure 4. If regions of loss and gain are in the middle, a data set with only one amplified state for a gain region leads to 4 change points, whereas a data set with all 3 amplified states leads to 6 change points. The total number of change points in 100 data sets for a given
is 500. We use
standard deviations to perform the ODM. For both the SDM and the BRM,
is used to define outliers.
The number of change points detected is summarized in Table 2. For comparison, the results of the CBS without pruning (Olshen and others, 2004
) and the CLAC (Wang and others, 2005
) are included. In order to compare the CPU time, we coded the algorithm for CBS in MATLAB instead of the existing R package. The CLAC method was included for comparison since it also involves hierarchical clustering of points. CLAC was implemented with the existing R package (available at http://www-stat.stanford.edu/
wp57/CGH-Miner/) using default parameters. Because reference samples from the normal arrays are needed in the CLAC method, we simulated 100 reference data sets using a normal distribution with zero mean and variance
, set to be the same as in the data to be analyzed.
|
The simulation results in Table 2 show that the proposed methods are competitive with the CBS method without pruning. In contrast, most of the change points detected by the CLAC are not in the right locations, and many change points are missed. This may be caused by the mean smoothing step in CLAC that reduces noise in the simulation data at the expense of blurring the edges of the loss/gain regions. When
is small, our methods appear to perform better than CBS. When
is large (e.g.
), SDM finds a few more false positives and ODM has a few more false negatives than CBS. As
increases, the data sets become more noisy, and the numbers of both false negatives and false positives from all methods increase. However, once
reaches very large values, the data sets appear approximately equal to those without change points and the number of additional change points reflects the false-positive rate in the absence of any change points. This can be seen in the last 2 columns in Table 2, where the number of additional change points detected for
is less than for
(except for CLAC).
To further assess the true-positive and false-positive rates, we applied CCTTS-based methods in additional simulations with
gradually increasing toward infinity. We note that, from the viewpoint of change-point detection, analyzing a data set
is equivalent to analyzing
for any nonzero constant
. Therefore, for any fixed
, analyzing a simulation data set
, yields the same results as analyzing the data set
. As
, the data set becomes one without change points. In Figure 5, we plot change-point detection rates against
. We define true positives to be both change points correctly detected and those detected but mislocated. The results indicate that the number of change points correctly detected and the true-positive rates decrease as
increases. However, among the true positives, the number of change points mislocated increases when
is small and then decreases as
becomes large. Similarly, for small values of
, the false positives increase due to the increased noise. When
reaches a certain value (around 0.5), the number of false positives begins to decrease and eventually approaches a limiting value corresponding to
(data without change points).
|
We also compared the CPU time consumed (with MATLAB software) to perform these analyses. For analyzing 100 data sets with
, the ODM, SDM, and BRM consumed about
, and
of the CPU time used by the CBS method, respectively. | 5. APPLICATION TO A DATA SET OF FIBROBLAST CELL LINES |
|---|
|
|
|---|
To further assess the performance of our methods, we applied them to a set of CGH data with known DNA copy number changes (Snijders and others, 2001
of the test over reference ratio. For 6 of the cell lines, the copy number alteration encompassed the whole chromosome. Therefore, we first limited our analysis to the other 9 cell lines and tested one chromosome at a time for change points. The ODM was performed with both
and
. For SDM and BRM, we used thresholds based on
and
. Results can be found in Table 3. Outcomes from applying the CBS method reported by Olshen and others (2004)
|
In Table 3, "yes" for ODM, SDM, and BRM means that the alteration was found for the particular cell line and chromosome at the given
level, although in some cases, the locations of the change points may be 1 or 2 positions away from the real ones. Some additional change points may also be identified, which, as we will discuss further, can be eliminated through a "pruning" procedure. "Yes" for CBS is taken from Olshen and others (2004)
criterion, ODM gives better results than other methods. Outcomes for both SDM and BRM show that there is a little difference between
and
. Unlike the CBS application, our methods were directly applied to the original data sets without any prior smoothing process. Therefore, single-point regions can be identified.
All the alterations for 6 of the remaining cell lines covered a whole chromosome and would not be identified by analyzing each chromosome separately. However, analysis of an entire genome may help to identify whole chromosome alterations. We applied ODM to each of these cell lines with
standard deviation followed by one specific pruning step: If the absolute mean difference between 2 adjacent segments around a change point detected by ODM was less than
, then that change point was not considered a detected change point. The results are shown in Figure 6 and clearly reveal whole chromosome alterations along each cell line.
|
| 6. DISCUSSION AND CONCLUSIONS |
|---|
|
|
|---|
We have proposed several new approaches for assessing DNA copy number variation along a chromosome. These methods can be used to detect copy number alterations in tumors or other tissues. All approaches can be directly applied to analyze data sets without any prior smoothing process, and single-point regions (or outliers) can be identified. We applied these methods to an array CGH data set of fibroblasts with known copy number changes confirmed by spectral karyotyping and were able to identify all the expected alterations. We also applied the methods to a 100-K SNP data set of non-small cell lung carcinoma, namely cell line H2122 from Zhao and others (2005)
markers was very fast (less than 3 min on a 3.20-GHz PC in performing ODM). These methods are also likely to be useful in assessing normal variation. In addition, through simulation we have demonstrated that our methods can correctly detect change points that partition chromosomes into regions with the same underlying copy number. In developing the methods, we have made the implicit assumption that residual variances (defined in (3.1)) are equal across the chromosome. In some additional simulations, we found that SDM still works well for heteroscedastic data as long as heteroscedasticity across different regions is mild. In general, if heteroscedasticity is apparent, we suggest that some pre-analysis should be performed to derive good estimates of variances. Then the signed distance measure in (2.1) could be modified to involve estimated variances, although this approach would require future investigation.
An issue that can arise with SDM is the edge effect caused by artificially connecting the first and the last points
and
in the segment
. That is, if the index k of an identified change point
is "close " to either i or j, then it might not be a real change point. The same problem occurs for the CBS method (Olshen and others, 2004
) and can be handled in a similar way. By contrast, in the process of change-point detection, BRM does not cut the whole data set into several segments so that this problem of small segments does not exist. Also, the edge effect is a minor issue for BRM because there is only one edge to be considered.
Any statistical method for analyzing DNA copy number will detect more change points or alterations than there are real copy number changes. These false positives may be caused by local trends in the data or other reasons that are not totally understood. Therefore, a pruning procedure may be helpful to eliminate some change points or to merge adjacent segments in order to reduce the number of false positives. Pruning can be performed either from a statistical viewpoint, by controlling a sequence of sum of squares as in CBS, or from a biological viewpoint by merging 2 adjacent segments as in the HMM method (Fridlyand and others, 2004
) and the penalized least-squares regression method (Huang and others, 2005
). Our methods do not include this step because any pruning procedure is dominated by a threshold parameter which must be determined by the type of data produced, the level of purity of the populations, and whether type I or type II error is more undesirable. As Fridlyand and others (2004)
point out, threshold values for comparing the means of 2 adjacent segments could be as low as
when analyzing primary tumors and as high as
when looking for genetic abnormalities homogeneously present in all cells. For the results given in Table 3, CBS includes a pruning procedure by controlling a sequence of sum of squares, while our 3 methods do not perform any pruning.
It may be noted that the critical values
,
used for SDM and BRM in both simulated and real data studies are much smaller than the ones used in CBS (
,
). This is because all our change-point detection methods are essentially based on identifying the outliers among the scores produced by CCTTS algorithm, and this is conceptually different from the CBS approach. CBS, using the likelihood ratio, identifies only the largest change point at each segmentation step, whereas our methods find as many change points as possible each time. This distinction affects the permutation distributions. Our approach constructs a reference permutation distribution
by collecting all absolute statistic scores, whereas CBS constructs a distribution
by collecting only one score with the maximum absolute value for each permutation. That means that the distribution
is stochastically larger than the distribution
and the p-value of an observed statistic with respect to
is smaller than
. The cutoffs are chosen to give comparable performance, but the interpretation is not the same.
We have shown that, for normally distributed data, the ODM is easy to perform and extremely fast. The numbers of permutations needed for both SDM and BRM are small compared with the number needed for CBS. Furthermore, more markers means that even fewer permutations are needed. Existing methods tend to analyze each chromosome separately due to the fact that chromosomes are physically separated. Our methods can also analyze an entire genome simultaneously to identify regions of copy number variation. Inferences about change points drawn from analyzing the whole genome, rather than single chromosomes, can be used to further identify whole chromosome gain or loss.
| APPENDIX A |
|---|
|
|
|---|
We consider a simple situation of a data set containing only 3 points in sequence:
, and
. Let
, and
, where D is defined by (2.1). At the beginning of the CCTTS algorithm, the initial scores are
, and
. We focus here on how to obtain the final score of
. According to the CCTTS algorithm, we have
- 1) When
or
or
, then more than 2 clusters are merged simultaneously. One big cluster is obtained, and
is not updated. So the final score is 
- 2) When
, then we merge clusters
and
, and we update
and
, but not
. Since
and
are merged together,
will not be updated anymore. Therefore, the final score is 
- 3) When
, then we merge clusters
and
, and we update
to obtain 
Note that, at this stage, the data set contains 2 clusters. The next merging procedure will produce one big cluster, and hence statistic scores will not be updated anymore. Therefore, the above achieved value is the final score of
.
- 4) When
, then we merge clusters
and
. In a similar discussion as in (3), the final score of
can be obtained by 
- 2) When
By summarizing (1), (2), (3), and (4),
can be mathematically expressed as
![]() |
Similarly, we can obtain mathematical expressions for
and
. For example,
can be written as
![]() |
From this simple case, we find that (i) the final statistic scores
from the CCTTS algorithm are functions of the original sequence data set and (ii) if
is written as
, then
. So,
and
are identically distributed when
, and
are iid.
| APPENDIX B |
|---|
|
|
|---|
Let
be the statistic scores computed from the CCTTS. According to property (b) of
being approximately identically normally distributed, negative correlations between 2 adjacent statistics, and approximate independence between statistics at least 2 positions apart, we simplify the assumptions for the joint distribution of
as follows:
- 1) Marginal distributions:

- 2) Correlations:

- 2) Correlations:
Here, the assumption of the correlation
is based on the circular structure. The sample variance
from
is given by
![]() |
where
. Our goal is to prove that
is an asymptotically unbiased estimator of
with respect to N under the assumptions given in (1) and (2). This can be done by the following derivation:
![]() |
Therefore,
![]() |
when
(negative correlation), and
as
.
| ACKNOWLEDGMENTS |
|---|
This work was supported by the Mathematics of Information Technology and Complex Systems (MITACS/NCE), by National Institutes of Health grant HL68532, and Genome Canada through the Center for Applied Genomics, Toronto. Conflict of Interest: None declared.
| REFERENCES |
|---|
|
|
|---|
-
Autio R, Hautaniemi S, Kauraniemi P, Yli-Harja O, Astola J, Wolf M, Kallioniemi A. CGH-plotter: MATLAB toolbox for CGH-data analysis. Bioinformatics (2003) 19:17141715.[CrossRef][Web of Science][Medline]
Barnett V, Lewis T. Outliers in Statistical Data (1994) New York: John Wiley and Sons.
Foronzan F, Mahlamaki EH, Monni O, Chen Y, Veldman R, Jiang Y, Gooden GC, Ethier SP, Kallioniemi A, Kallioniemi OP. Comparative genomic hybridization analysis of 38 breast cancer cell lines: a basis for interpreting complementary DNA microarray data. Cancer Research (2000) 60:45194525.
Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain A. Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis (2004) 90:132153.[CrossRef][Web of Science]
Hsu L, Self SG, Grove D, Randolph T, Wang K, Delrow JJ, Loo L, Porter P. Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics (2005) 6:211226.[Abstract]
Huang T, Wu B, Lizardi P, Zhao H. Detection of DNA copy number alterations using penalized least squares regression. Bioinformatics (2005) 21:38113817.
Hupe P, Stransky N, Thiery J-P, Radvanyi F, Barillot E. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics (2004) 20:34133422.
Jong K, Marchiori E, Meijer G, Vaart AVD, Ylstra B. Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics (2004) 20:36363637.
Jong K, Marchiori E, van der Vaart A, Ylstra B, Meijer G, Weiss M. Chromosomal Breakpoint Detection in Human Cancer. Lecture Notes in Computer Science (2003) Volume 2611. Berlin: Springer.
Kallioniemi A, Kallioniemi OP, Piper J, Tanner M, Stokke T, Chen L, Smith HS, Pinkel D, Gray JW, Waldman FM. Detection and mapping of amplified DNA sequences in breast cancer by comparative genomic hybridization. Proceedings of the National Academy of Sciences USA (1994) 91:21562160.
Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, Pinkel D. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science (1992) 258:818821.
Lai WR, Johnson MD, Kucherlapati R, Park PJ. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics (2005) 21:37633770.
Lengauer C, Kinzler K, Vogelstein B. Genetic instabilities in human cancers. Nature (1998) 396:643649.[CrossRef][Medline]
Lisitsyn N, Lisitsyn N, Wigler M. Cloning the differences between two complex genomes. Science (1993) 259:946951.[Abstract]
Maletic JI, Marcus A. Data cleansing: beyond integrity analysis. In: Proceedings of The Conference on Information Quality (IQ2000) (2000) Cambridge: MA. 200209.
Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics (2004) 5:557572.[Abstract]
Picard F, Robin S, Lavielle M, Vaisse C, Daudin J-J. A statistical approach for array CGH data analysis. BMC Bioinformatics (2005) 6:27.[CrossRef][Medline]
Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo WL, Chen C, Zhai Y, and others. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature Genetics (1998) 20:207211.[CrossRef][Web of Science][Medline]
Pollack JR, Sørlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Børresen-Dale A, Brown PO. Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proceedings of the National Academy of Sciences USA (2002) 99:1296312968.
Snijders AM, Nowak N, Segraves R, Blackwood S, Brown N, Conroy J, Hamilton G, Hindle AK, Huey B, Kimura K, and others. Assembly of microarrays for genome-wide measurement of DNA copy number. Nature Genetics (2001) 29:263264.[CrossRef][Web of Science][Medline]
Tirkkonen M, Tanner M, Karhu R, Kallioniemi A, Isola J, Kallioniemi OP. Molecular cytogenetics of primary breast cancer by CGH. Genes, Chromosomes & Cancer (1998) 21:177184.[CrossRef][Web of Science][Medline]
Wang P, Kim Y, Pollack J, Narasimhan B, Tibshirani R. A method for calling gains and losses in array CGH data. Biostatistics (2005) 6:4558.[Abstract]
Willenbrock H, Fridlyand J. A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics (2005) 21:40844091.
Zhao X, Weir BA, LaFramboise T, Lin M, Beroukhim R, Garraway L, Beheshti J, Lee JC, Naoki K, Richards WG, and others. Homozygous deletions and chromosome amplifications in human lung carcinomas revealed by single nucleotide polymorphism array analysis. Cancer Research (2005) 65:55615570.
Zhao XJ, Li C, Paez JG, Chin K, Jänne PA, Chen TH, Girard L, Minna J, Christiani D, Leo C, and others. An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Research (2004) 64:30603071.
Received March 20, 2006; revised September 1, 2006; accepted for publication October 16, 2006.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
C. D. Greenman, G. Bignell, A. Butler, S. Edkins, J. Hinton, D. Beare, S. Swamy, T. Santarius, L. Chen, S. Widaa, et al. PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data Biostat., January 1, 2010; 11(1): 164 - 175. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



simulated data sets. The dash-dotted line is a robust linear fit of the sample order statistics which passes through the first and third quartiles. The slope of the solid line is 1 and represents the standard normal distribution. Simulations are applied for different data size N. The plots demonstrate approximate normality of the statistics but with larger variances than standard normal.
,
, and
which contain normal, loss, and gain regions, respectively. In the gain regions, (a) has 1 state amplified by 3 copies, (b) has 2 states amplified by 1 and 2 copies, (c) and (d) have all 3 states, amplified by either 1, 2, or 3 copies.
= 0.001 is used to define outliers for both SDM and BRM. CBS is applied without pruning






