GENOME 541: Intro to Computational Molecular Biology Homework #6: Protein identification from tandem mass spectra
Due on Friday, May 15 by 3:15 pm. Turn in at Foege S220B.
In this assignment, you will write a program that implements the "EM-like" portion of ProteinProphet. The algorithm is described in
"A statistical model for identifying proteins by tandem mass spectrometry" by Nesvizhskii et al. Analytical Chemistry. 75(17):4646-4658, 2003.In particular, you should implement Equations 8 and 9.
The program takes as input this file. This is the H. influenzae data set from the above paper. Each row in the file corresponds to a peptide-spectrum match, and the columns contain
- the spectrum index,
- the probability assigned by PeptideProphet,
- the peptide (including its two flanking amino acids), and
- the protein or proteins in which the peptide occurs.
The program should first identify peptide-spectrum matches that involve the same peptide, and choose the maximal probability. The program should then eliminate all peptides below a user-specified threshold. Finally, the program should choose initial weights at random and perform the EM-like procedure. To determine convergence, you can measure the mean-squared error between the probabilities from one round to the next, and ensure that this error falls below some small constant. The final output is the list of proteins, ranked by probability.
You should turn in hard copies of
- your program code,
- a plot of the number of proteins identified as a function of the false discovery rate, and
- a list of the probabilities assigned to the following proteins, using a threshold of 20%: gi|1573457|gb|AAC22137.1|, gi|1574164|gb|AAC22886.1|, gi|1574091|gb|AAC22819.1|, gi|1574819|gb|AAC23003.1|, gi|1574011|gb|AAC22644.1|, gi|1574164|gb|AAC22886.1|, gi|1573457|gb|AAC22137.1|, gi|1573567|gb|AAC22237.1|, GPN:BC014987_1, gi|1574176|gb|AAC22896.1|, gi|1574164|gb|AAC22886.1|, gi|1573528|gb|AAC22201.1|, GP:AL136929_1, and GPN:AB067484_1.
For the plot, you can assume that H. influenzae proteins (with identifiers beginning "
gi|") are in the sample, and all other proteins are not. The plot should contain three series, corresponding to running the program with thresholds of 0%, 10% and 20%. Note that some proteins in the database are identical (or at least, cannot be discriminated using the identified peptides). Your program should treat these groups of redundant protein IDs as a single protein.If you are feeling ambitious, I encourage you to experiment with alternative methods of your own devising for solving this problem, and see whether you can improve on ProteinProphet's performance.
Your program should be written in C, C++, Java, Perl or Python. If you want to use a different language, please let me know.
Please use descriptive naming in your code, as well as comments. Programs that are too difficult to read may be marked down, even if they work correctly.
It is OK to discuss programming strategies; however, the programming should be entirely your own. It is not OK to look at someone else's code or to show someone else your code. Code that has obviously been copied between class members will result in a score of zero for both assignments.