Running a Simple Search

Now that you have your environment set up and the two input files in your working directory, you can conduct the search. The search process compares each spectrum in demo.ms2 to peptides (subsequences of the proteins) in small-yeast.fasta. For each peptide a theoretical spectrum is created based on the predicted fragmentation of the peptide. The theoretical spectra are each compared to the experimental spectrum (in the ms2 file) and scored based on their similarity. These scores are reported in the output. To conduct the search, use this command.

$ crux search-for-matches demo.ms2 small-yeast.fasta

While the search is running, you will see output like this

INFO: Reading in ms2 file demo.ms2
INFO: Preparing protein fasta file small-yeast.fasta
INFO: Using amino acid alphabet (ACDEFGHIKLMNPQRSTVWYBUXZ).
INFO: Searching spectra

When the search has completed, it will print

INFO: crux-search-for-matches finished

A new directory, crux-output, will have been created. Looking at the contents of the directory, you will see seven new files. containing the search results: The file demo.csm containes the search results in a binary format that can be read by crux percolator or crux compute-q-values. The files search.target.sqt and search.target.txt contain the search results in human readable plain text formats. search.target.sqt uses the standard SQT format for spectrum-peptide matches, and search.target.txt uses tab-delmited fields. The remaining files, decoy.sqt, decoy.txt, demo-decoy-1.csm, demo-decoy-2.csm contain the results of searching a shuffled version of the protein database.

The final step is to analyze our search results. Each spectrum has been compared to many peptides and we would like to return only the best match(es) for each spectrum. We also expect that some fraction of the spectra will not be identifiable as peptides (due to chemical noise, multiple peptides co-eluting, poor fragmentation, etc.). The analysis step filters out those spectra and returns only the high-quality peptide-spectrum matches. Run this command.

$ crux percolator small-yeast.fasta

While the analysis is running, you will see output like this

INFO: Running percolator
INFO: Getting PSMs from ./demo.csm
INFO: Getting PSMs from ./demo-decoy-1.csm
INFO: Getting PSMs from ./demo-decoy-2.csm
INFO: Outputting matches.

When the analysis is complete it will print

INFO: crux percolator finished.

There is new file called target.sqt containing the results. The beginning of the file looks like this

H       SQTGenerator Crux
H       SQTGeneratorVersion 1.0
H       Comment Crux was written by...
H       Comment ref...
H       StartTime       Mon Mar  3 12:08:22 2008
H       EndTime
H       Database        index-yeast/small-yeast-binary-fasta
H       DBSeqLength     ?
H       DBLocusCount    56
H       PrecursorMasses average
H       FragmentMasses  mono
H       Alg-PreMasTol   3.0
H       Alg-FragMassTol 0.50
H       Alg-XCorrMode   0
H       Comment preliminary algorithm sp
H       Comment final algorithm xcorr
H       StaticMod       C=160.139
H       Alg-DisplayTop  5
H       EnzymeSpec      tryptic
H       Comment matches analyzed by percolator
S       10      10      2       0.00    server  636.34  0.00    0.00    2
M       1       1       1270.49 0.00    -1.19   0.81    7       22      K.SGLIVEIQGVQK.E        U
L       YDR093W DNF2 SGDID:S000002500, Chr IV from 631279-636117, Verified ORF, "Aminophospholipid translocase (flippase) that localizes primarily to the plasma membrane; contributes to endocytosis, protein"
S       11      11      2       0.00    server  745.27  0.00    0.00    2
M       1       1       1489.73 0.00    3.12    0.00    21      24      R.NFLETVELQVGLK.N       U
L       YGL135W RPL1B SGDID:S000003103, Chr VII from 254646-255299, Verified ORF, "N-terminally acetylated protein component of the large (60S) ribosomal subunit, nearly identical to Rpl1Ap and has similari"
S       12      12      3       0.00    server  472.56  0.00    0.00    7
M       1       3       1414.62 0.00    -1.00   0.61    12      44      K.VLEFHPFDPVSK.K        U
L       YGL008C PMA1 SGDID:S000002976, Chr VII from 482671-479915, reverse complement, Verified ORF, "Plasma membrane H+-ATPase, pumps protons out of the cell; major regulator of cytoplasmic pH and plasma m"

This file follows the SQT file format described here, with a few modifications. Lines beginning with H are header lines and contain general information about the run. Lines beginning with S contain information about a spectrum including scan number (for the first example 10), charge (2), two place-holders, the measured m/z of the precursor ion (636.34), two place holders, and the number of peptides in the database that were compared to this spectrum (2).

Lines beginning with M give information about the peptide-spectrum match including the rank of the primary scoring function (1), rank of the secondary scoring function (1), calculated mass of the charged peptide (1270.49), the relative difference between the previous score and this one (delta Cn, always 0 for the best match), the primary score (here, the percolator score, -1.19), secondary score (here the q-value calculated by percolator, 0.81), number of ions matched (7), total number of ions predicted (22), and the sequence of the peptide in the context of the protein (K.SGLIVEIQGVQK.E, the matched peptide is the sequence between the dots, the flanking residues are also shown.), and the manual validation code (U for unvalidated).

Lines beginning with L give the name of the protein locus for this peptide. If a peptide is found in more than one protein, it can be followed by multiple lines.

The pattern of S, M, L lines repeats for every spectrum searched. In this example, there is only one M line reported for each spectrum, but it is possible to report more. See the customization page for information about options.


Crux home

Next: Creating a peptide index