GENOME 541: Intro to Computational Molecular Biology

Homework #8: Protein identification from tandem mass spectra

Due on Tueday, June 3, before class. Turn in at Foege S220C.

In this assignment, you will write a program that implements the "EM-like" portion of ProteinProphet. The algorithm is described in

"A statistical model for identifying proteins by tandem mass spectrometry" by Nesvizhskii et al. Analytical Chemistry. 75(17):4646-4658, 2003.

In particular, you should implement Equations 8 and 9.

The program takes as input this file. This is the H. influenzae data set from the above paper. Each row in the file corresponds to a peptide-spectrum match, and the columns contain

The program should first identify peptide-spectrum matches that involve the same peptide, and choose the maximal probability. The program should then eliminate all peptides below a user-specified threshold. Finally, the program should choose initial weights at random and perform the EM-like procedure. Optionally, you may want to repeat the optimization several times from different initial weights, selecting the solution that yields the largest total probability. The final output is the list of proteins, ranked by probability.

You should turn in hard copies of

For the plot, you can assume that H. influenze proteins (with identifiers beginning "gi|") are in the sample, and all other proteins are not. The plot should contain three series, corresponding to running the program with thresholds of 0%, 10% and 20%. Note that some proteins in the database are identical (or at least, cannot be discriminated using the identified peptides). Your program should treat these groups of redundant protein IDs as a single protein.

If you are feeling ambitious, I encourage you to experiment with alternative methods of your own devising for solving this problem, and see whether you can improve on ProteinProphet's performance.

Your program should be written in C, C++, Java, Perl or Python. If you want to use a different language, please let me know.

Please use descriptive naming in your code, as well as comments. Programs that are too difficult to read may be marked down, even if they work correctly.

It is OK to discuss programming strategies; however, the programming should be entirely your own. It is not OK to look at someone else's code or to show someone else your code. Code that has obviously been copied between class members will result in a score of zero for both assignments.