Skip Navigation


Biostatistics Advance Access originally published online on July 31, 2006
Biostatistics 2007 8(2):357-367; doi:10.1093/biostatistics/kxl015
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrowOA All Versions of this Article:
8/2/357    most recent
kxl015v2
kxl015v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Wang, P.
Right arrow Articles by Aebersold, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, P.
Right arrow Articles by Aebersold, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2006 The Authors
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (
http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

A statistical method for chromatographic alignment of LC-MS data

Pei Wang*,{dagger}, Hua Tang, Matthew P. Fitzgibbon and Martin Mcintosh

Fred Hutchinson Cancer Research Center, 1100 Fairveiw Avenue N, M2-B500, PO Box 19204, Seattle, WA, USA pwang{at}fhcrc.org

Marc Coram{dagger}

Department of Statistics, University of Chicago, Chicago, IL, USA

Hui Zhang, Eugene Yi and Ruedi Aebersold

Institute for System Biology, Seattle, WA, USA

* To whom correspondence should be addressed.


    SUMMARY
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LC-MS EXPERIMENT
 3. PETAL FOR LC-MS
 4. DATA EXAMPLE
 5. DISCUSSION
 REFERENCES
 
Integrated liquid-chromatography mass-spectrometry (LC-MS) is becoming a widely used approach for quantifying the protein composition of complex samples. The output of the LC-MS system measures the intensity of a peptide with a specific mass-charge ratio and retention time. In the last few years, this technology has been used to compare complex biological samples across multiple conditions. One challenge for comparative proteomic profiling with LC-MS is to match corresponding peptide features from different experiments. In this paper, we propose a new method—Peptide Element Alignment (PETAL) that uses raw spectrum data and detected peak to simultaneously align features from multiple LC-MS experiments. PETAL creates spectrum elements, each of which represents the mass spectrum of a single peptide in a single scan. Peptides detected in different LC-MS data are aligned if they can be represented by the same elements. By considering each peptide separately, PETAL enjoys greater flexibility than time warping methods. While most existing methods process multiple data sets by sequentially aligning each data set to an arbitrarily chosen template data set, PETAL treats all experiments symmetrically and can analyze all experiments simultaneously. We illustrate the performance of PETAL on example data sets.

Keywords: Alignment; LC-MS; Regression; Retention time


    1. INTRODUCTION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LC-MS EXPERIMENT
 3. PETAL FOR LC-MS
 4. DATA EXAMPLE
 5. DISCUSSION
 REFERENCES
 
An integrated system of liquid-chromatography mass-spectrometry (LC-MS) offers a versatile and high throughput proteomics technology. In such a system, LC efficiently separates a peptide mixture (peptides are short amino acid sequences) based on hydrophobicity; thousands of peptides can then be identified and quantified using MS to address important biology questions (Mann and Aebersold, 2003Go).

While high precision LC-MS systems are available, bioinformatics tools remain incomplete. LC-MS systems generate massive amounts of data, representing the intensity of peptides with specific mass-charge ratios (mz) and LC column retention times (RT) (see Section 2.1 for more details). Statistical and computational methods are required to detect and quantify the intensity of each feature. A more challenging task is to compare multiple LC-MS profiles, which, for example, can be used to identify discriminating peptides between distinct biological groups. Because the sequence identifications of the peptide are often unavailable at this stage, one relies on RT and mz to match corresponding peptides across different samples. However, the retention time of a specific peptide depends on instrument conditions as well as the underlying composition of the mixture; variation in RT between experiments is often nonnegligible even when all samples are processed by the same LC-MS system. To a lesser extent, mz of a peptide also varies as a result of instrument noise. For these reasons, a prerequisite for quantitative analysis of multiple LC-MS experiments is to align output data with respect to both RT and mz. In Section 2.2, we review two groups of existing methods. The first group align raw spectrum data before peak detection. These methods search for optimal warping functions to map RT of one experiment to that of another. Since the warping function only accounts for "global" variation in RT, these methods may not always align individual peptides. The second group of alignment methods use the detected feature lists, and allow some variation in RT of individual peptides. However, since this method relies on the detected peak and does not take advantage of the raw spectrum information, the alignment decisions are vulnerable to inaccuracy in the peak detection step. In addition, both groups of methods are formulated to work on data sets that are similar to each other, and may produce bias when analyzing different samples, such as cancer and noncancer serum. In order for LC-MS-based analysis to become a routine procedure in biomedical research, a computationally efficient and robust alignment procedure must be developed.

In this paper, we propose a statistical method, called "Peptide Element Alignment" (PETAL), which uses both raw spectrum data and peak detection results to simultaneously align features from multiple LC-MS experiments. PETAL first creates spectrum elements to represent the relative intensity profiles of individual peptides. It then models the variation in retention time and the instrument noise in intensity measurements that produce error in the mz values. Peptides detected in different LC-MS data are aligned if they are represented by the same element. By considering each peptide separately, this method offers greater flexibility than simply matching retention time between profiles. In addition, PETAL treats all experiments symmetrically and avoids the possible biases that may result from choosing one experiment as a template.

The rest of the paper is organized as follows: Section 2 provides a brief description of the LC-MS experiments. The PETAL method is described in Section 3. Section 4 is devoted to real data examples. In Section 5, we make several remarks regarding the strength and weaknesses of our method in comparison to existing methods and discuss the choices of parameters in the model.


    2. LC-MS EXPERIMENT
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LC-MS EXPERIMENT
 3. PETAL FOR LC-MS
 4. DATA EXAMPLE
 5. DISCUSSION
 REFERENCES
 

2.1 Generic LC-MS experiment

Figure 1 is a cartoon of a typical LC-MS experiment. First, protein mixtures are isolated from biological samples and enzymatically digested into peptides (short amino acid sequences). The peptides are then separated by one or more steps of high-pressure LC, and are eluted into an electro-spray ion source, where they are nebulized in small, highly charged droplets. After evaporation, multiple protonated peptides enter the mass spectrometer, and a mass spectrum of the peptide eluting at each time point is taken (Mann and Aebersold, 2003Go). A more detailed introduction to LC-MS can be found in Liebler (2002)Go.


Figure 1
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Outline of one LC-MS experiment. See text for details.

 
The output of an LC-MS experiment can be represented as a two-dimensional image. One dimension represents the elution time (also called retention time and denoted as RT) and the other dimension indicates the mass-charge ratio. Although RT is a continuous variable, the LC-MS system produces mass spectra at a discrete set of RT points, typically a few seconds apart. Thus, it is equivalent to represent RT by scan indices. The mass spectrum at one RT point, i.e. in a single scan, measures the abundance of peptide ions at each mz (each mass-charge ratio point). Figure 2 illustrates part of an LC-MS experiment result.


Figure 2
View larger version (41K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Output from a LC-MS experiment. Left: Output of one LC-MS experiment in the region Formula. The horizontal axis represents the retention time and the vertical axis represents the mass-charge ratio. The color at each (mz, RT) indicates peptide intensity (scales are defined in the color bar). Each peak feature identified in previous analysis steps is labelled by number (in black): the vertical coordinate of the number is its monoisotopic mass; the horizontal coordinate of the number is the index of the scan, in which the feature is detected at the highest intensity; the value of the number indicates the estimated charge status. The plot is made with R-package Nimbus (by Marc Coram, available at http://galton.uchicago.edu/~coram/). Right: Mass spectrum of the scan 943 for Formula, which corresponds to one column of the left image. The isotopic shape suggests that this peptide has a charge of 3.

 
As shown in Figure 2, the mass spectrum of a peptide feature has a characteristic shape, consisting of multiple peaks equally spaced along mz. This shape, referred to as isotopic pattern (or isotopic distribution), arises as a result of the naturally existing rare isotopes in the sample. The dominant source for isotopic distribution in mass spectrum is carbon-13, which accounts for Formula of all naturally occurring carbon atoms. For a peptide with a charge of 1, the molecule with no carbon-13 and the molecule with one carbon-13 accumulate at locations that are one unit apart on the mass spectrum. More generally, for a peptide of charge k, the gap between two adjacent isotopic peaks is Formula. The peak where the analyte consists of only light isotopes is called the mono-isotopic peak, indicating the ordinary mz value of this peptide. Thus, in MS experiments, each peptide can be characterized by its mz value (the position of the mono-isotope) and charge status (the isotope shape). This information can be used for peptide peak detection as well as for subsequent alignment.

2.2 Existing alignment methods

The goal of alignment is to match corresponding peptide features in the mz-scan plot (e.g. Figure 2) from different experiments in the presence of retention time variation and experimental noise. Bylund and others (2002)Go proposed a time warping method based on raw spectrum for alignment of LC-MS data, which is a modification of the original correlated optimized warping algorithm (Nielsen and others, 1998Go). After choosing one file as a template, the method warps the time coordinate (RT axis) of another file to give maximal similarity between the two images. This framework was also used by Wang and others (2003)Go, who implemented a dynamic time warping algorithm allowing every RT point to be moved. However, compared with the classical one-dimension chromatography profiles, LC-MS data have an added dimension of mass spectral information, which makes the alignment problem more complicated. Different peptides with different mz values may have different retention time shifting between two experiments. In other words, two peptides eluting at the same time in one experiment may not necessarily elute at the same time in another experiment. Therefore, only mapping the retention time coordinates between two LC-MS files is not sufficient to provide alignment for individual peptides.

Instead of using raw spectrum data, Radulovic and others (2004)Go performed alignment based on the (mz, RT) values of detected features. It first divides the mz domain into several intervals and fitted different piece-wise linear time warping functions for each mz interval. After the time warping, a "wobble" function is then applied wherein a peak is allowed to move (Formula1–2% of total scan range) in order to match with the nearest adjacent peak in another file. Here, the stratification of mz achieves improved flexibility and accuracy. Since the method relies on only the (mz, RT) values of detected peptide features, it fails to take advantage of other information in the raw image (such as isotope distribution). In addition, the wobble function may produce ambiguous findings when complex mixtures like human serum are processed, where multiple peptides may exist within the Formula1–2% window.

Recent software platforms, "msInspect" (Bellew and others, 2006), "SpecArray" (Li and others, 2005Go), and "MZmine" (Katajamaa and Oresic, 2005Go) provide alignment solutions by allowing variation in RT of individual peptides within the detected feature lists. However, since these methods rely on the peak detection result and do not take advantage of the raw spectrum information, the alignment decisions are vulnerable to any inaccuracy estimation in the peak detection step. Moreover, most methods process multiple data sets by sequentially aligning each to an arbitrarily chosen template profile, which may lead to unpredictable errors. Most methods work best on data sets that vary similar to each other. They are likely to produce bias when analyzing samples from different disease classes such as cancer and noncancer tissues.

Other related algorithms are discussed in Listgarten and others (2005)Go including a hierarchical clustering method for aligning MALDI/SELDI spectra (Tibshirani and others, 2004Go), a multi-scale wavelet decomposition approach for aligning MALDI data along the mz axis (Randolph and Yasui, 2004Go), and a Hidden Markov Model for multiple alignments of time series. Prakash and others (2006)Go recently proposed a novel signal mapping algorithm to perform comparisons directly on the signal level of MS experiments.

To overcome the drawbacks of current methods, we propose a new alignment algorithm, PETAL, for LC-MS data. It uses both the raw spectrum data and the information of the detected peak features for peptide alignment.


    3. PETAL FOR LC-MS
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LC-MS EXPERIMENT
 3. PETAL FOR LC-MS
 4. DATA EXAMPLE
 5. DISCUSSION
 REFERENCES
 
In LC-MS profiles, each peptide is characterized by two things: its mass spectrum and its retention time range. The mass spectrum of one scan is a vector recording the intensity measurements along mz for peptides eluting in this scan. For any given peptide, its "element spectrum vector" is defined as the mass spectrum with no experimental noise that contains only one unit abundance of this peptide Formula, where L is the total pixel number along Formula and Formula is the intensity value at the lth pixel. One spectrum element vector Formula can be uniquely determined by the mono-isotope position and the charge status (isotopic pattern) of the corresponding peptide. In addition, we denote the theoretical retention time range of one peptide as Formula. The kth peptide can then be represented as Formula.

We define a peptide element library Formula as a collection of all possible peptides appearing in the target samples. Given a library of peptide elements, the goal of alignment can be easily achieved by matching the peak features in each profile to this common library. Peak features from different profiles matched to the same peptide element are features representing the same peptide and should be aligned.

We now introduce a loss function and seek the solution of alignment by solving an optimization problem.

3.1 Loss function

We first consider the mass spectrum of one scan. Suppose there are H different peptides Formula eluting in this scan, and the measurable abundance (the abundance of peptides that can be measured in LC-MS experiment) of Formula is Formula. The entire mass spectrum of one scan is the sum of all individual peptide spectra eluting in the scan. The observed mass spectrum Formula can therefore be represented as a linear combination of the spectrum element vectors of the H peptides: Formula, where Formula is the instrument noise and Formula is the spectrum element vector of Formula.

In reality, we would not know which peptides elute in an observed scan Formula. However, with a peptide element library Formula, we can estimate the peptide abundances by fitting a Formula penalized least square regression model Formula:


Formula (3.1)

where Formula is a nonnegative parameter, t is the retention time of scan Formula, and Formula is a weight function depending on the scan retention time t as well as the theoretical retention time of each peptide Formula. The Formula norm penalty controls in the coefficient solution the total number of nonzero coefficients (Tibshirani, 1996Go). The weight function Formula gives larger penalty to peptides whose theoretical retention time Formula is further from the scan retention time t, so that the corresponding predictors (Formula) are selected less often. A simple example for Formula is


Formula (3.2)

where Formula is a nonnegative parameter.

In the solution of (3.1), a nonzero estimate of Formula indicates that part of the signal in Formula matches the kth peptide in the peptide element library. Thus, if we have the proper regression models for two scans, FormulaFormula from two different profiles, the alignment between these two scans can be achieved by comparing the coefficient sets Formula and Formula.

Suppose there are N profiles and each profile has Formula observed scans. Given a peptide element library Formula, we are interested in Formula satisfying


Formula (3.3)

where Formula and Formula are the mass spectrum vector and the retention time of the mth scan in the nth profile.

In most cases, the peptide element library Formula is not available at this point. We therefore also need to identify the peptide element library (Formula) that best explains all mass spectrum scans observed in the experiments (Formula). Thus, we introduce the overall "loss function":


Formula (3.4)

and search for


Formula (3.5)

where Formula is a penalty term for overall model complexity. The choices of Formula and Formula are discussed in Section 5.

The main part of the loss function in (3.4) can also be deemed as the negative log joint likelihood of Formula under some reasonable assumptions as discussed in Section A of the supplementary material available at Biostatistics online.

Note that besides the random variation in retention time due to individual peptides, there is always some systematic retention time shifting across LC-MS experiments. Thus, we first apply a global transformation to adjust for the systematic trend and then use the adjusted time to calculate Formula. The details of the global transformation are described in Section B of the supplementary material available at Biostatistics online.

3.2 Optimization strategy

From the loss function in (3.5), we can see that if Formula is given, the optimal solution for Formula can be easily calculated with Formula-regression techniques such as "lasso" (Tibshirani, 1996Go) and "lars" (Efron and others, 2003Go). Thus, our main obstacle is to find the appropriate peptide element library Formula.

It is difficult to directly search the whole vector space of element spectra. We therefore approach this problem using two steps. First, we build an initial collection of peptide elements based on all profiles subjected to alignment. This initial collection is expected to represent all peptides appearing in the experiments, but it may also contain redundant or incorrect elements. Then, we search for the subset of the initial collection that minimizes the target loss function. The details of these two steps are described below.

Initial collection.

To build the initial set, we include one peptide element for every peak feature detected in the profiles. To do so, we first need to estimate the ideal isotopic shapes. Since, at a given mass, the variation of isotopic shapes resulting from the differences between amino acid sequences is much less than the variation introduced by experimental noise, we assume that the peptides with similar mass values and at the same charge status have the same isotopic pattern. Thus, the "ideal" isotopic shape of a certain charge and mass can be approximated by averaging all feature spectra of the same charge status and similar mass values. With these empirical isotopic shapes, we make spectrum element vectors Formula based on the estimated mono-isotope positions and charge values of detected features. Details are provided in Section C of supplementary material available at Biostatistics online.

We denote this initial collection as Formula.

Subset selection.

There are two major strategies for subset selection, forward-stepwise and backward-stepwise. In this section, we focus on the backward-stepwise strategy, which enjoys a higher computational efficiency than the forward-stepwise strategy (discussed in Section D of the supplementary material available at Biostatistics online).

Backward-stepwise begins with the whole collection Formula and removes redundant elements iteratively. Instead of eliminating the redundant elements one by one as is usually done, we propose a more efficient procedure. As mentioned before, each peptide in the experiments may correspond to more than one peptide element in the initial collection contributed by different profiles. If we can cluster peptide elements in some appropriate way, such that elements representing the same peptide are grouped together, we will be able to eliminate the redundancy of multiple clusters simultaneously.

For this purpose, we apply a sparse regression approach called elastic net (Zou and Hastie, 2005Go), which aims to minimize the loss function Formula The ridge penalty term encourages a grouping effect: strongly correlated predictors tend to be in or out of the model together. And the Lasso penalty term enables the algorithm to have a more sparse representation.

The new backward-stepwise procedure is as follows:

  1. Take all M feature scans of target profiles and the initial collection of peptide elements Formula (Formula).
  2. For Formula to M, do elastic net regression for Formula with a fixed number of maximum steps. Thus, each element gets M coefficients from M regression models.
  3. Cluster elements based on the coefficient vector (length of M), such that elements representing the same peptide are grouped together. Representing each cluster with one element, we get a new set of Formula elements.
  4. Repeat steps 2–3 until Formula.

We choose not to cluster directly on the original element spectrum vector space because it is not straightforward to define an appropriate distance measurement between elements, taking into consideration of the meaning of isotopic pattern and retention time. However, after we map elements to the coefficient space through regression, we can easily use Euclidean distance for clustering. In addition, the regression procedure enjoys a "selection" effect, such that incorrect basis will not enter the models and will be eliminated from the library directly.

The performance of the algorithm is illustrated with data examples in Section 4.


    4. DATA EXAMPLE
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LC-MS EXPERIMENT
 3. PETAL FOR LC-MS
 4. DATA EXAMPLE
 5. DISCUSSION
 REFERENCES
 
PETAL is applied on a data example from a spike-in experiment, and its performance is compared with the performance of two other alignment methods implemented in public available softwares msInspec (Bellew and others, 2006) and SpecArray (Li and others, 2005Go). (A more detailed illustration on how PETAL works to solve the challenges of the alignment problem is shown in Section E of the supplementary material available at Biostatistics online, where PETAL is applied on a data example of human serum samples.)

In the spike-in experiment, three different biological samples were analyzed with LC-MS instruments{dagger}. The three samples were (1) 20 Formulag of four bovine glycoprotein mix, (2) 80 Formulal of normal human serum sample, and (3) a mix sample of bovine glycoproteins and 80 Formulal of human serum in a concentration of 20 Formulag/ml bovine glycoproteins in human serum. Three LC-MS replica were collected for each biological sample, which resulted in a total of nine LC-MS profiles.

Peak signals corresponding to peptide features in each LC-MS profile were first detected using both msInspect (msI) and specArray (specA). msI returns Formula peptide features for each LC-MS profile and specA returns Formula peptide features. Comparing quality of the feature detection algorithms requires more than comparing features counts for each individual profile. However, since the feature detection step is not the focus of this paper, we will not discuss in further here.

The alignment method of specA makes use of information computed specifically by its own feature detection methods, whereas msI uses only mz, RT, and charge information. Thus, to better characterize the advantages of the different alignment methods, we compare the performance of PETAL and the alignment method in msI using feature lists returned by msI, and we compare the performance of PETAL and the alignment method in specA using feature lists returned by specA.

We assessed the performance of alignment using two criteria: one is the efficiency of recognizing features corresponding to the same peptide and the other is the degree of false-alignment—incorrectly matched features corresponding to different peptides.

First, we use replicate profiles to examine the alignment efficiency. Since the majority of the peptides in a sample should behave the same across replicate LC-MS experiments, we expect to see majority of the features aligned across replicate profiles. In Table 1, column N3R (column N2R) shows the number (percentage) of features aligned across the three replicates (two replicates) of any biological sample by different alignment methods. We can see that PETAL recognized many more matching peptide features across replicate profiles than either msI (8058 vs. 7686) or specA (3510 vs. 3088).


View this table:
[in this window]
[in a new window]

 
Table 1. Alignment results. Column names: FD, feature detection method; Align, alignment method; TN, total number of features in all files after alignment; N3R, number of features appearing in all three replicates of any biological sample (bovine protein, human serum, or bovine + serum mixture); N2R, number of features appearing in at least two replicates of any biological sample; NBF, number of features in the bovine + serum mixture that corresponds to bovine proteins (see text for details)

 
On the other hand, since the same peptide should have similar intensities across LC-MS replica experiments and since none of the alignment methods takes into consideration the intensity information when matching features across different profiles, we can use the correlation of intensities of aligned features between two replicate profiles to assess the alignment quality: the more the false-aligned pairs, the less correlated the intensities of aligned features tend to be. The correlation coefficient of log-intensities of aligned features between each replicate pair is illustrated in Figure 3 (log scale is used to adjust the heavy right tail of the intensity distribution). From the figure, we can see that feature pairs aligned by PETAL have more similar intensities than the feature pairs aligned by msI and specA, which suggests that PETAL achieves better alignment quality.


Figure 3
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Correlation coefficients of log-intensities of aligned features. The x-axis indicates the sample pair: B1–2, B1–3, and B2–3 are the three replicate pairs of the bovine protein sample; S1–2, S1–3, and S2–3 for the human serum sample; SB1–2, SB1–3, and SB2–3 for the bovine + serum sample. The y-axis represents the values of the correlation coefficients. The labels in the legend indicate the different combination of feature detection and alignment methods (see Table 1).

 
In addition, with the spike-in design of the experiment, it is of interest to investigate the efficiency of detecting the spiked-in bovine peptides in the bovine + serum sample for different methods. Peptide features are deemed as candidate spiked-in bovine peptides if they appear in at least two replicates of the bovine + serum sample, as well as in at least two replicates of the bovine protein sample, but not in any replicates of the serum sample. The numbers of candidate bovine peptides resulting from different methods are listed in column NBF of Table 1. PETAL detects more than 55 candidate bovine peptides on both sets of feature detection results, which is almost twice the number of candidate bovine peptides detected by msI and four times the number of specA. (With the newly developed LTQ-FT instrument, which simultaneously provides intensity measurements and tandem mass spectrum measurements for each target peptide ion, it is possible to further validate those candidate features as bovine peptides by deriving the peptide sequence IDs from the tandem mass spectra through database searching. However, due to the limitation of the facilities, such a data set is not available at this point.)

Overall, we conclude that with the same alignment quality as (if not better than) the other two alignment methods, PETAL achieves the highest alignment efficiency.


    5. DISCUSSION
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LC-MS EXPERIMENT
 3. PETAL FOR LC-MS
 4. DATA EXAMPLE
 5. DISCUSSION
 REFERENCES
 
In this paper, we introduce a new alignment method, PETAL, which uses both raw spectrum data and peak detection results to simultaneously align features from multiple LC-MS experiments. By considering each peptide separately, this method offers more flexibility than simply matching retention time between different profiles. It treats all experiments symmetrically and avoids the possible biases that may result from choosing one experiment as a template. In addition, although PETAL is based on feature lists from the peak detection procedure, the ability to consider spectrum information and jointly learn from multiple profiles enables PETAL to improve the peak detection in return.

The backward-stepwise optimization strategy, whose computational complexity is about Formula (where N is the total number of samples, and K is the total number of peptide features in the study) is more efficient compared with the forward-stepwise strategy, whose computational complexity is about Formula. For further computational simplicity, we can divide the entire mz domain into multiple mz blocks, and then conduct alignments for individual mz blocks parallel. The Formula norm penalty parameter Formula is controlled by forcing the total number of nonzero coefficients smaller than Formula in each regression model. For the data example in Section 4, we used 100 mz blocks with each block averaging 5–10 mz. We choose Formula in the forward-stepwise strategy and Formula in the backward-stepwise strategy, where N is the total number of samples. The model complexity penalty parameter Formula is also controlled differently in the two strategies. For forward-stepwise, controlling Formula is equivalent to controlling the stopping constant Formula with smaller Formula corresponding to larger value of K. For backward-stepwise, Formula corresponds to the cutoff criterion in the clustering steps. The number of clusters represents the number of selected elements in the library (K).

PETAL can be easily applied to the scenario where an AMT (Accurate Mass and Time Tag) database is available (Fang and others, 2006Go). In such cases, the peptide element library Formula can be derived directly from the AMT database, and then only the regression coefficients Formula need to be estimated. Furthermore, for LC-MS experiments with isotopic labeling, viewing scan spectra as linear combinations of peptide element spectra, as well as the regression techniques discussed in this paper, can be used to accurately estimate the intensity ratio of light versus heavy forms when the mass of the labeling materials does not allow for complete separation of the two forms.

The R-package implementing the PETAL algorithm is available at http://peiwang.fhcrc.org/research-project.html.


    ACKNOWLEDGMENTS
 
We thank M. Igra, M. Bellew, and D. May for assistance on software msInspect; M. Brusniak, O. Vitek, and X. Li for assistance on software specArray; R. Fang for testing the program; and A. E. Detter for helpful suggestions and proof reading of the manuscript. This work was funded by National Cancer Institute (NCI) contract 23XS144A. Martin Mcintosh was supported in part by NCI contract P50 CA83636. Hua Tang, Eugene Yi, and Ruedi Aebersold were supported in part with federal funds from National Heart, Lung, and Blood Institute, National Institutes of Health (NIH) contract N01-HV-28179, NCI, NIH contract N01-CO-12400, and grant R21-CA-114852. Conflict of Interest: None declared. Funding to pay the Open Access publication charges for this article was provided by NCI contract 23XS144A.


    FOOTNOTES
 
{dagger} Equal contributors. Back

{dagger} The LC-MS system consists of a Bruker Daltonics Micro-TOF mass spectrometer equipped and a home-built nanospray device. Glycopeptides were first isolated from proteins in 80 Formulal (Zhang, 2005), (Zhang and others, 2003Go), and peptides from 5 Formulal of original serum were used in each MS analysis. Back


    REFERENCES
 TOP
 SUMMARY
 1. INTRODUCTION
 2. LC-MS EXPERIMENT
 3. PETAL FOR LC-MS
 4. DATA EXAMPLE
 5. DISCUSSION
 REFERENCES
 

    Bellew M, Coram M, Igra M, Fitzgibbon M, Randolp T, Wang P, Eng J, Lin C, Goodlett D, Fang R. and others (July 28, 2006). Informatics method for generating peptide arrays from high resolution lc-ms measurements from complex protein mixtures. Bioinformatics 10.1093/bioinformatics/btl379.

    Bylund D, Danielsson R, Malmquist G, Markides KE. (2002) Chromatographic alignment by warping and dynamic programming as a pre-processing tool for parafac modelling of liquid chromatography-mass spectrometry data. Journal of Chromatography. A 961:237–44.[CrossRef][ISI][Medline]

    Efron B, Johnstron I, Hastie T, Tibshirani R. (2003) Least angle regression. Annals of Statistics 32:407–99.[ISI]

    Fang R, Elias DA, Monroe ME, Shen Y, Mcintosh M, Wang P, Goddard CD, Callister SJ, Moore RJ, Gorby YA. and others. (2006) Differential label-free quantitative proteomic analysis of Shewanella oneidensis cultured under aerobic and suboxic conditions by accurate mass and time tag approach. Molecular & Cellular Proteomics 5:714–25.[ISI][Medline]

    Katajamaa M and Oresic M. (2005) Processing methods for differential analysis of lc/ms profile data. BMC Bioinformatics 6:179.[CrossRef][Medline]

    Li X, Yi E, Kemp C, Zhang H, Aebersold R. (2005) A software suite for the generation and comparison of peptide arrays from sets of data collected by liquid chromatography-mass spectrometry. Molecular & Cellular Proteomics 4:1328–40.[CrossRef][ISI][Medline]

    Liebler DC. (2002) Introduction to Proteomics(Humana Press, Totowa, NJ).

    Listgarten J, Neal RM, Roweis ST, Emili A. (2005) Multiple alignment of continuous time series. In Saul LK (Ed.), et al. Advances in Neural Information Processing Systems 17 (aka NIPS*2004)(MIT Press, Cambridge, MA).

    Mann M and Aebersold R. (2003) Mass spectrometry-based proteomics. Nature 422:198–207.[CrossRef][Medline]

    Nielsen NP, Carstensen JM, Smedsgaard J. (1998) Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping. Journal of Chromatography A 805:17–35.[CrossRef][ISI]

    Prakash A, Mallick P, Whiteaker J, Zhang H, Paulovich A, Flory M, Lee H, Aebersold R, Schwikowski B. (2006) Signal maps for mass spectrometry-based comparative proteomics. Molecular & Cellular Proteomics 5:423–32.[ISI][Medline]

    Radulovic D, Jelveh S, Ryu S, Hamilton TG, Foss E, Mao Y, Emili A. (2004) Informatics platform for global proteomic profiling and biomarker discovery using liquid-chromatography-tandem mass spectrometry. Molecular & Cellular Proteomics 3:984–97.[CrossRef][ISI][Medline]

    Randolph TW and Yasui Y. (2004) Multiscale processing of mass spectrometry data. Biometrics 62:589–97.

    Tibshirani R. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B 58:267–88.

    Tibshirani R, Hastie T, Narasimhan B, Soltys S, Shi G, Koong A, Le Q. (2004) Sample classification from protein mass spectrometry by peak probability contrasts. Bioinformatics 20:3034–44.[Abstract/Free Full Text]

    Wang W, Zhou H, Lin H, Roy S, Shaler TA, Hill LR, Norton S, Kumar P, Anderle M, Becker C. (2003) Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Analytical Chemistry 75:4818–26.[Medline]

    Zhang H, Yi EC, Li X-J, Mallick P, Kelly-Spratt KS, Masselon CD, Camp DG II, Smith RD, Kemp CJ, Aebersold R. (2005) High throughput quantitative analysis of serum proteins using glycopeptide capture and liquid chromatography mass spectrometry. Molecular & Cellular Proteomics 4:144–55.[ISI][Medline]

    Zhang H, Li X, Martin D, Aebersold R. (2003) Identification and quantification of n-linked glycoproteins using hydrazide chemistry, stable isotope labeling and mass spectrometry. Nature Biotechnology 21:660–6.[CrossRef][ISI][Medline]

    Zou H and Hastie T. (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B 67:301–20.[CrossRef]

    Received December 20, 2005; revised May 26, 2006; revised July 11, 2006; accepted for publication July 13, 2006.


    Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



    This Article
    Right arrow Abstract Freely available
    Right arrow FREE Full Text (PDF) Freely available
    Right arrow Supplementary Material
    Right arrowOA All Versions of this Article:
    8/2/357    most recent
    kxl015v2
    kxl015v1
    Right arrow Alert me when this article is cited
    Right arrow Alert me if a correction is posted
    Services
    Right arrow Email this article to a friend
    Right arrow Similar articles in this journal
    Right arrow Similar articles in PubMed
    Right arrow Alert me to new issues of the journal
    Right arrow Add to My Personal Archive
    Right arrow Download to citation manager
    Right arrow Disclaimer
    Google Scholar
    Right arrow Articles by Wang, P.
    Right arrow Articles by Aebersold, R.
    Right arrow Search for Related Content
    PubMed
    Right arrow PubMed Citation
    Right arrow Articles by Wang, P.
    Right arrow Articles by Aebersold, R.
    Social Bookmarking
     Add to CiteULike   Add to Connotea   Add to Del.icio.us  
    What's this?