Running a Simple Search Using Tide and Percolator
Now that you have your environment set
up and the two input files in your working directory, you can
conduct the search. The search process compares each spectrum
in demo.ms2 to peptides (subsequences of the
proteins) in fasta files provided in a
dirctory, yeast-index/. Peptides whose
precursor mass is close to that of the observed spectrum are scored
against that spectrum, and the top scores are reported in the output.
To conduct the search, we first create a peptide index
using tide-index
and then execute the search
using tide-search
.
-
$ crux tide-index small-yeast.fasta yeast-index
While generating the peptide index, you will see output like this:
INFO: CPU: pyrrolysine.gs.washington.edu INFO: Crux version: 2.1 INFO: Fri Feb 5 11:24:02 PST 2016 COMMAND LINE: ./crux tide-index small-yeast.fasta yeast-index INFO: Running tide-index... INFO: Writing results to output directory 'yeast-index'. INFO: Reading small-yeast.fasta and computing unmodified peptides... INFO: Writing decoy fasta... INFO: Reading proteins INFO: Precomputing theoretical spectra... INFO: Elapsed time: 0.0973 s INFO: Finished crux tide-index. INFO: Return Code:0
This command produces the peptide index in
yeast-index
and also produces a directorycrux-output
containing the following files:- tide-index.decoy.fasta – a set of decoy proteins, derived from the proteins in the input set,
- tide-search.params.txt – a record of all the parameters used in the search, and
- tide-search.log.txt – a log file containing a copy of all the messages printed to the screen during the search.
Now you can run this command:
$ crux tide-search --compute-sp T demo.ms2 yeast-index
- tide-search.target.txt – search results in tab-delimited format.
- tide-search.decoy.txt – search results from a decoy database in tab-delimited format.
- tide-search.params.txt – a record of all the parameters used in the search.
- tide-search.log.txt – a log file containing a copy of all the messages printed to the screen during the search.
$ crux sort-by-column --column-type real --ascending T crux-output/tide-search.target.txt "xcorr score" > crux-output/tide-search.target.sort.txt
-
$ crux percolator --test-fdr 0.1 crux-output/tide-search.target.txt
While the analysis is running, you will see output like this
INFO: CPU: pyrrolysine.gs.washington.edu INFO: Crux version: 3.1-ccddffef5395314ce3a224ad5b34167229341653 INFO: Tue Dec 12 15:47:13 PST 2017 COMMAND LINE: ./crux percolator --test-fdr 0.1 crux-output/tide-search.target.txt INFO: Beginning percolator. INFO: Reading file crux-output/tide-search.target.txt INFO: Converting input to pin format. INFO: Parsing crux-output/tide-search.target.txt INFO: Assigning index 0 to demo.ms2. INFO: Parsing crux-output/tide-search.decoy.txt INFO: There are 690 target matches and 690 decoys INFO: Maximum observed charge is 3. INFO: File conversion complete. INFO: Percolator version 3.01.nightly-22-2f4e2a6, Build Date Nov 22 2017 10:07:29 INFO: Copyright (c) 2006-9 University of Washington. All rights reserved. INFO: Written by Lukas Käll (lukall@u.washington.edu) in the INFO: Department of Genome Sciences at the University of Washington. INFO: Issued command: INFO: percolator --results-peptides crux-output/percolator.target.peptides.txt --decoy-results-peptides crux-output/percolator.decoy.peptides.txt --results-psms crux-output/percolator.target.psms.txt --decoy-results-psms crux-output/percolator.decoy.psms.txt --verbose 2 --protein-decoy-pattern decoy_ --seed 1 --subset-max-train 0 --trainFDR 0.01 --testFDR 0.1 --maxiter 10 --search-input auto --no-schema-validation --protein-enzyme trypsin --post-processing-tdc crux-output/make-pin.pin INFO: Started Tue Dec 12 15:47:13 2017 INFO: Hyperparameters: selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10 INFO: Reading tab-delimited input from datafile crux-output/make-pin.pin INFO: Features: INFO: lnrSp deltLCn deltCn XCorr Sp IonFrac PepLen Charge1 Charge2 Charge3 enzN enzC enzInt lnNumDSP dM absdM INFO: Found 1380 PSMs INFO: Separate target and decoy search inputs detected, using target-decoy competition on Percolator scores. INFO: Train/test set contains 690 positives and 690 negatives, size ratio=1 and pi0=1 INFO: Selecting Cpos by cross-validation. INFO: Selecting Cneg by cross-validation. INFO: Split 1: Selected feature 2 as initial search direction. Could separate 44 training set positives in that direction. INFO: Split 2: Selected feature 5 as initial search direction. Could separate 41 training set positives in that direction. INFO: Split 3: Selected feature 2 as initial search direction. Could separate 32 training set positives in that direction. INFO: Found 39 test set positives with q<0.1 in initial direction INFO: Reading in data and feature calculation took 0.077138 cpu seconds or 0 seconds wall clock time. INFO: ---Training with Cpos selected by cross validation, Cneg selected by cross validation, fdr=0.01 INFO: Iteration 1: Estimated 71 PSMs with q<0.1 INFO: Iteration 2: Estimated 73 PSMs with q<0.1 INFO: Iteration 3: Estimated 73 PSMs with q<0.1 INFO: Iteration 4: Estimated 73 PSMs with q<0.1 INFO: Iteration 5: Estimated 73 PSMs with q<0.1 INFO: Iteration 6: Estimated 73 PSMs with q<0.1 INFO: Iteration 7: Estimated 73 PSMs with q<0.1 INFO: Iteration 8: Estimated 73 PSMs with q<0.1 INFO: Iteration 9: Estimated 73 PSMs with q<0.1 INFO: Iteration 10: Estimated 73 PSMs with q<0.1 INFO: Learned normalized SVM weights for the 3 cross-validation splits: INFO: Split1 Split2 Split3 FeatureName INFO: 0.3140 0.0590 0.2375 lnrSp INFO: 0.1862 -0.1121 0.1878 deltLCn INFO: 0.1340 0.0416 0.3738 deltCn INFO: 0.0642 0.3209 0.6567 XCorr INFO: 0.7780 0.3290 1.2113 Sp INFO: 0.2765 0.0558 -0.4342 IonFrac INFO: 0.3376 0.0918 0.3700 PepLen INFO: 0.0230 0.0536 0.1603 Charge1 INFO: 0.1090 0.0586 0.3298 Charge2 INFO: -0.1604 -0.1274 -0.5759 Charge3 INFO: 0.0000 0.0000 0.0000 enzN INFO: 0.0000 0.0000 0.0000 enzC INFO: -0.0195 -0.0015 0.0231 enzInt INFO: 0.1450 0.0298 0.6177 lnNumDSP INFO: -0.1984 -0.0750 -0.2956 dM INFO: -0.1965 -0.0327 -0.1631 absdM INFO: -1.5185 -1.1520 -2.0882 m0 INFO: Found 54 test set PSMs with q<0.1. INFO: Selected best-scoring PSM per scan+expMass (target-decoy competition): 104 target PSMs and 62 decoy PSMs. INFO: Tossing out "redundant" PSMs keeping only the best scoring PSM for each unique peptide. INFO: Calculating q values. INFO: Final list yields 0 target peptides with q<0.1. INFO: Calculating posterior error probabilities (PEPs). INFO: Processing took 3.472 cpu seconds or 3 seconds wall clock time. INFO: Elapsed time: 3.35 s INFO: Finished crux percolator. INFO: Return Code:0
The crux-output directory will now contain eight new files:
- percolator.target.psms.txt – a list of peptide-spectrum matches (PSMs), ranked by quality,
- percolator.target.peptides.txt – a list of peptides, ranked by quality,
- percolator.decoy.psms.txt – a ranked list of decoy PSMs,
- percolator.decoy.peptides.txt – a ranked list of decoy peptides,
- percolator.pout.xml – a single XML output file containing all of the Percolator results,
- make-pin.pin.xml: an intermediate XML format file that is used by Percolator.
- percolator.params.txt – parameter file, and
- percolator.log.txt – log file.
As before, you might want to sort the Percolator output files, this time by the "percolator score" column:
-
$ crux sort-by-column --column-type real --ascending T crux-output/percolator.target.psms.txt "percolator score" > crux-output/percolator.target.psms.sort.txt
The beginning of the resulting percolator.target.psms.sort.txt file will look like this:
file file_idx scan charge spectrum precursor m/z spectrum neutral mass peptide mass percolator score percolator rank percolator q-value total matches/spectrum sequence cleavage type protein id flanking aa crux-output/tide-search.target.txt 1 118 3 1031.9407 3092.8000 3095.2366 8.6190662 1 0 2 ELESAAYDHAEPVQPEDAPQDIANDELK trypsin-full-digest YGL009C KD crux-output/tide-search.target.txt 1 26 2 692.3773 1382.7400 1382.4467 7.0481362 2 0 6 TASEFDSAIAQDK trypsin-full-digest YLR043C KL crux-output/tide-search.target.txt 1 53 2 745.7473 1489.4800 1489.7318 6.7469611 3 0 3 NFLETVELQVGLK trypsin-full-digest YGL135W RN crux-output/tide-search.target.txt 1 62 2 745.8572 1489.7000 1489.7318 6.6489625 4 0 3 NFLETVELQVGLK trypsin-full-digest YGL135W RN crux-output/tide-search.target.txt 1 146 2 692.6772 1383.3400 1382.4467 6.6027999 5 0 5 TASEFDSAIAQDK trypsin-full-digest YLR043C KL crux-output/tide-search.target.txt 1 131 2 745.8473 1489.6801 1489.7318 6.5627294 6 0 3 NFLETVELQVGLK trypsin-full-digest YGL135W RN crux-output/tide-search.target.txt 1 50 2 651.2873 1300.5601 1301.4160 6.3757763 7 0 10 LDVDELGDVAQK trypsin-full-digest YLR043C KN crux-output/tide-search.target.txt 1 42 3 1032.2606 3093.7600 3095.2366 5.8047724 8 0 2 ELESAAYDHAEPVQPEDAPQDIANDELK trypsin-full-digest YGL009C KD crux-output/tide-search.target.txt 1 90 3 1032.0006 3092.9800 3095.2366 5.4775882 9 0 2 ELESAAYDHAEPVQPEDAPQDIANDELK trypsin-full-digest YGL009C KD crux-output/tide-search.target.txt 1 111 3 1033.0206 3096.0400 3095.2366 5.4364419 10 0 2 ELESAAYDHAEPVQPEDAPQDIANDELK trypsin-full-digest YGL009C KD In this output, the PSMs are ranked by "percolator score," with higher scores indicating a higher quality match. The associated statistical confidence estimate is reported as a "percolator q-value," interpreted as the minimal false discovery rate threshold at which this match is deemed significant. In the list above, all of the matches have q-values of 0, meaning that they are highly significant. The meanings of the remaining columns are described here. Note that when you run Percolator on your own computer, the results may be somewhat different than the ones reported here. This is because Percolator involves randomly subdividing the data in a cross-validation scheme (described in detail here.)
While the search is running, you will see output like this:
INFO: CPU: pyrrolysine.gs.washington.edu INFO: Crux version: 2.1 INFO: Fri Feb 5 11:24:23 PST 2016 COMMAND LINE: ./crux tide-search --compute-sp T demo.ms2 yeast-index INFO: Running tide-search... INFO: Reading index yeast-index INFO: Converting demo.ms2 to spectrumrecords format INFO: Reading spectra file crux-output/demo.ms2.spectrumrecords.tmp INFO: Sorting spectra INFO: Running search INFO: Time per spectrum-charge combination: 0.002318 s. INFO: Average number of candidates per spectrum-charge combination: 15.204820 INFO: Elapsed time: 0.389 s INFO: Finished crux tide-search. INFO: Return Code:0
The crux-output directory now contains four new files containing the search results:
Note that the peptide-spectrum matches (PSMs) in the tide-search.target.txt are sorted by the precursor m/z value associated with the spectrum. If you want to see which PSMs got the highest XCorr scores, you can do so like this:
The first lines of the resulting sorted output file should look like this:
file | scan | charge | spectrum precursor m/z | spectrum neutral mass | peptide mass | delta_cn | sp score | sp rank | xcorr score | xcorr rank | b/y ions matched | b/y ions total | distinct matches/spectrum | sequence | cleavage type | protein id | flanking aa |
demo.ms2 | 85 | 3 | 497.618 | 1489.83 | 1488.82 | 0.936932 | 2430.56 | 1 | 5.20757 | 1 | 27 | 48 | 3 | NFLETVELQVGLK | trypsin-full-digest | YGL135W(27) | RN |
demo.ms2 | 118 | 3 | 1031.94 | 3092.8 | 3093.41 | 0.949698 | 2383.16 | 1 | 4.76204 | 1 | 44 | 108 | 2 | ELESAAYDHAEPVQPEDAPQDIANDELK | trypsin-full-digest | YGL009C(500) | KD |
demo.ms2 | 156 | 3 | 1032.44 | 3094.3 | 3093.41 | 0.938358 | 1929.77 | 1 | 4.7476 | 1 | 43 | 108 | 2 | ELESAAYDHAEPVQPEDAPQDIANDELK | trypsin-full-digest | YGL009C(500) | KD |
demo.ms2 | 18 | 3 | 1032.4 | 3094.18 | 3093.41 | 0.902056 | 1732.23 | 1 | 4.48933 | 1 | 40 | 108 | 2 | ELESAAYDHAEPVQPEDAPQDIANDELK | trypsin-full-digest | YGL009C(500) | KD |
demo.ms2 | 11 | 2 | 745.269 | 1488.52 | 1488.82 | 0.822851 | 2816.6 | 1 | 4.44855 | 1 | 21 | 24 | 4 | NFLETVELQVGLK | trypsin-full-digest | YGL135W(27) | RN |
demo.ms2 | 53 | 2 | 745.749 | 1489.48 | 1488.82 | 0.935648 | 2797.78 | 1 | 4.39828 | 1 | 21 | 24 | 3 | NFLETVELQVGLK | trypsin-full-digest | YGL135W(27) | RN |
demo.ms2 | 38 | 3 | 1032.39 | 3094.15 | 3093.41 | 0.975741 | 1460.28 | 1 | 4.39627 | 1 | 37 | 108 | 2 | ELESAAYDHAEPVQPEDAPQDIANDELK | trypsin-full-digest | YGL009C(500) | KD |
demo.ms2 | 42 | 3 | 1032.26 | 3093.76 | 3093.41 | 1.0216 | 1509.47 | 1 | 4.35676 | 1 | 38 | 108 | 2 | ELESAAYDHAEPVQPEDAPQDIANDELK | trypsin-full-digest | YGL009C(500) | KD |
demo.ms2 | 111 | 3 | 1033.02 | 3096.04 | 3093.41 | 0.952727 | 1366.71 | 1 | 4.3498 | 1 | 36 | 108 | 2 | ELESAAYDHAEPVQPEDAPQDIANDELK | trypsin-full-digest | YGL009C(500) | KD |
demo.ms2 | 62 | 2 | 745.859 | 1489.7 | 1488.82 | 0.913711 | 2766.65 | 1 | 4.34057 | 1 | 21 | 24 | 3 | NFLETVELQVGLK | trypsin-full-digest | YGL135W(27) | RN |
The final step is to post-process the search results using Percolator. Each spectrum has been compared to many peptides and we would like to return only the best match for each spectrum. We also expect that some fraction of the spectra will not be identifiable as peptides (due to chemical noise, multiple peptides co-eluting, poor fragmentation, etc.). The analysis step filters out those spectra and ranks the matches by quality.