Learning peptide-spectrum alignment models for tandem mass spectrometry

Learning peptide-spectrum alignment models for tandem mass spectrometry

John T. Halloran, Jeff A. Bilmes, and William S. Noble

Uncertainty in Artificial Intelligence(UAI). AUAI, Quebec City, Quebec, Canada, July 2014.

Abstract

We present a peptide-spectrum alignment strategy that employs a dynamic Bayesian network (DBN) for the identification of spectra produced by tandem mass spectrometry (MS/MS). Our method is fundamentally generative in that it models peptide fragmentation in MS/MS as a physical process. The model traverses an observed MS/MS spectrum and a peptide-based theoretical spectrum to calculate the best alignment between the two spectra. Unlike all existing state-of-the-art methods for spectrum identification that we are aware of, our method can learn alignment probabilities given a dataset of high-quality peptide-spectrum pairs. The method, moreover, accounts for noise peaks and absent theoretical peaks in the observed spectrum. We demonstrate that our method outperforms, on a majority of datasets, several widely used, state-of-the-art database search tools for spectrum identification. Furthermore, the proposed approach provides an extensible framework for MS/MS analysis and provides useful information that is not produced by other methods, thanks to its generative structure.

Supplementary data:

Test datasets: All test datasets presented in the paper may be found here in the MS2 file format, individually gzipped. Links to specific files are available in the table below.

Protein databases: For each dataset, we provide the protein sequences, as well as pre-digested target and decoy peptide sets. Tryptic digestion is performed without suppression of cleavage by proline (i.e., cleave at all occurrences of R or K, regardless of a trailing P). Each decoy database is generated by shuffling the peptides in the target peptide database.

Target Protein database Target Peptide database Decoy Peptide database Searched datasets

mouse_contam-nokeil.fasta mouse_contam-nokeil.fasta mouse_contam-nokeil-decoy.fasta mouse-01.ms2, mouse-02.ms2, mouse-03.ms2

worm-nokeil.fasta worm-nokeil.fasta worm-nokeil-decoy.fasta worm-01.ms2, worm-02.ms2, worm-03.ms2

yeast-nokeil.fasta yeast-nokeil.fasta yeast-nokeil-decoy.fasta yeast-01.ms2, yeast-02.ms2, yeast-03.ms2

Benchmark method identification files: DRIP, Didea, MS-GFDB, SEQUEST, and OMSSA top peptide-spectrum-matches (PSMs) for all datasets may be found in this file. Each identification file is a tab-delimited file with header "Kind\tSid\tPeptide\tScore", where "Kind" identifies whether a PSM is a target (t) or decoy (d), "Sid" is the spectrum/scan identification number, "Peptide" is the peptide string, and "Score" is the pertinent score.

Benchmark code: Python code to generate performance curves is provided in this file. Note that argparse and matplotlib must be installed in python. Example use to plot benchmarks for all datasets may be found in plot.sh, which assumes that pyFiles.tgz and identFiles.tgz have been untarred in the same directory.

Training data: DRIP was trained using this spectrum file containing the tandem mass spectra spectra in MS2 format and this PSM file containing the high-confidence PSMs. Both files originated from here and were re-processed to contain unique spectrum scan ID numbers.

Target Protein database	Target Peptide database	Decoy Peptide database	Searched datasets
mouse_contam-nokeil.fasta	mouse_contam-nokeil.fasta	mouse_contam-nokeil-decoy.fasta	mouse-01.ms2, mouse-02.ms2, mouse-03.ms2
worm-nokeil.fasta	worm-nokeil.fasta	worm-nokeil-decoy.fasta	worm-01.ms2, worm-02.ms2, worm-03.ms2
yeast-nokeil.fasta	yeast-nokeil.fasta	yeast-nokeil-decoy.fasta	yeast-01.ms2, yeast-02.ms2, yeast-03.ms2