Semi-supervised learning for peptide identification from shotgun proteomics datasets
Lukas Käll, Jesse D. Canterbury, Jason Weston, William Stafford Noble and Michael J. MacCoss
Nature Methods 4:923 - 925, November 2007
Abstract
Shotgun proteomics uses liquid chromatography-tandem mass spectrometry to identify proteins in complex biological samples. We describe an algorithm, called Percolator, for improving the rate of peptide identifications from a collection of tandem mass spectra. Percolator uses semi-supervised machine learning to discriminate between correct and decoy spectrum identifications, correctly assigning peptides to 17% more spectra from a tryptic dataset and up to 77% more spectra from non-tryptic digests, relative to a fully supervised approach.
The
- Download the Percolator software here. For the experiments reported in the paper, Percolator version 1.01 was used.
- Supplementary data:
Yeast 1 Yeast 2 Yeast 3 Yeast 4 Worm Elastase Chymotrypsin Spectra 35236 37641 35414 35467 207804 29185 30340 PSMs 69705 74113 69901 70173 408322 57860 60217 PSMs with q-value < 0.01 12691 12619 12482 12608 70152 4861 5721 - For the "Yeast 1" spectra, we generated 10 sets of decoy PSMs using 10 decoy databases ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ) .
- The yeast-01 data is available in tab delimited format.
- We used the following SEQUEST parameter file and target database for the yeast data.
- We used the following SEQUEST parameter file and target database for the worm data.
ms2
andsqt
file formats are described in McDonald et al.(2004).