Semi-supervised learning for peptide identification from shotgun proteomics datasets

Semi-supervised learning for peptide identification from shotgun proteomics datasets

Lukas Käll, Jesse D. Canterbury, Jason Weston, William Stafford Noble and Michael J. MacCoss

Nature Methods 4:923 - 925, November 2007

Abstract

Shotgun proteomics uses liquid chromatography-tandem mass spectrometry to identify proteins in complex biological samples. We describe an algorithm, called Percolator, for improving the rate of peptide identifications from a collection of tandem mass spectra. Percolator uses semi-supervised machine learning to discriminate between correct and decoy spectrum identifications, correctly assigning peptides to 17% more spectra from a tryptic dataset and up to 77% more spectra from non-tryptic digests, relative to a fully supervised approach.

Download the Percolator software here. For the experiments reported in the paper, Percolator version 1.01 was used.
Supplementary data:

Yeast 1 Yeast 2 Yeast 3 Yeast 4 Worm Elastase Chymotrypsin

Spectra 35236 37641 35414 35467 207804 29185 30340

PSMs 69705 74113 69901 70173 408322 57860 60217

PSMs with q-value < 0.01 12691 12619 12482 12608 70152 4861 5721

For the "Yeast 1" spectra, we generated 10 sets of decoy PSMs using 10 decoy databases ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ) .
The yeast-01 data is available in tab delimited format.
We used the following SEQUEST parameter file and target database for the yeast data.
We used the following SEQUEST parameter file and target database for the worm data.
The ms2 and sqt file formats are described in McDonald et al.(2004).

	Yeast 1	Yeast 2	Yeast 3	Yeast 4	Worm	Elastase	Chymotrypsin
Spectra	35236	37641	35414	35467	207804	29185	30340
PSMs	69705	74113	69901	70173	408322	57860	60217
PSMs with q-value < 0.01	12691	12619	12482	12608	70152	4861	5721