A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores

David C. Anderson, Weiqun Li, Donald G. Payan and William Stafford Noble

Journal of Proteome Research. 2(2):137-146, 2003.


Abstract

Shotgun tandem mass spectrometry-based peptide sequencing using programs such as SEQUEST allows high-throughput identification of peptides, which in turn allows identification of corresponding proteins. We have applied a machine learning algorithm, called the support vector machine, to discriminate between correctly and incorrectly identified peptides using SEQUEST output. Each peptide was characterized by SEQUEST-calculated features such as delta Cn and Xcorr, measurements such as precursor ion current and mass, and additional calculated parameters such as the fraction of matched MS/MS peaks. The trained SVM classifier performed significantly better than previous cutoff-based methods at separating positive from negative peptides. Positive and negative peptides were more readily distinguished in training set data acquired on a QTOF compared to an ion trap mass spectrometer. The use of 13 features, including four new parameters, significantly improved the separation between positive and negative peptides. Use of the support vector machine and these additional parameters resulted in a more accurate interpretation of peptide MS/MS spectra, and is an important step towards automated interpretation of peptide tandem mass spectrometry data in proteomics.

Full paper in PDF format.


Supplementary data

The following data files contain tab-separated data. The first row of each file contains the names of the thirteen features associated with each peptide. In each subsequent row, the first column contains the peptide itself, along with the thirteen feature values. Associated with each data file is a label file in a similar format; however, rather than containing thirteen values for each peptide, these files contain a single value in each row: 1 indicates a positive example, and -1 indicates a negative example.

The 47 protein sequences in Fasta format used in these experiments.

Gist is the SVM software used to perform the classification.



Home