crux-pipeline

Overview

Crux is a mass spectrometry analysis toolkit developed by the Noble and MacCoss labs. Kaipo Tamura developed crux-pipeline to package up the functions of Crux in a way that is easy to use for members of the Department of Genome Sciences. Here are steps:

Convert raw files with MSConvert (ProteoWizard - Windows)
Hardklor+Bullseye (optional, runs by default)
Comet (default) or Tide
Percolator (optional, runs by default and required for MSDaPl upload)
MSDaPl upload (optional)

If you have questions about crux-pipeline, you can contact Kaipo (kaipot@uw.edu). General questions about Crux can be directed to crux-users@googlegroups.com

Detailed steps

For now, users must convert their .raw files to mzML, mzXML, ms2, or cms2. This can be done using, for example, ProteoWizard. We are working on a way to get the conversion done as part of crux-pipeline.

Connect to grid.gs.washington.edu via ssh.

The pipeline script then can be called by running "crux-pipeline". Running "crux-pipeline --help" will print a usage statement with a list of options. See options at the bottom of the file.

Basic usage is:

      crux-pipeline [options] <spectrum file>+ <FASTA file> <parameter file>

where spectrum files are in mzML, mzXML, ms2, or cms2 format. For example:

    crux-pipeline --msdapl-id 1000 --output-dir my_search example1.ms2 example2.ms2 example.fasta example.params

NOTE: you can use *.cms2 instead of listing all your files to run. Also if you are running Hardklor+Bullseye, crux knows to look for the cms1 files in the same location as the cms2 files.

You can also use predefined FASTA and/or parameter files. To see a list of available ones, run "crux-pipeline --list-fasta" or "crux-pipeline --list-param":

    $ crux-pipeline --list-param

"high-low" -> /net/maccoss/vol2/software/bin/crux-pipeline-files/param/jarrett_high_low_params.txt

"low-low" -> /net/maccoss/vol2/software/bin/crux-pipeline-files/param/jarrett_low_low_params.txt

"high-high" -> /net/maccoss/vol2/software/bin/crux-pipeline-files/param/jarrett_high_high.params.txt

This shows that you may enter "high-low", "low-low", or "high-high" instead of the path to a parameter file. These premade parameter files can also be useful if you want to copy them from their listed paths as a starting point and edit them for yourself.

NOTE: Default memory is set to 4.0G for each bullseye/comet and 8G for percolator. You may need to request more memory for percolator if you have many files (e.g., "--percolator-mem 12.0G").

Once you have started a run, the output files will go into the specified output directory (output_<timestamp> by default).

You can check the status of the run with: crux-pipeline -s <output directory>

Or cancel all jobs in a run with: crux-pipeline -c <output directory>

Once the run is complete, there will be a subdirectory in the output directory called "crux-output" containing the search results in sqt format, and (if it was run) the Percolator results in a file called combined-results.perc.xml. The results will also be uploaded into MSDaPl if the "msdapl-id" option was specified.

crux-pipeline --help

Usage: crux-pipeline [options] <spectrum file>+ <FASTA file> <parameter file>

Commands:
--help (-h) - Displays this message.
--version (-v) - Displays the version number of this script.
--status (-s) <dir> - Displays the status of jobs for a directory.
--cancel (-c) <dir> - Cancels jobs for a directory.
--list-fasta - Displays a list of FASTA keywords available for this script.
--list-param - Displays a list of parameter file keywords available for this script.

Options:
--crux-path <path> - Specify the path to the Crux executable to be used.
--bullseye (-b) <T|F> - Run Hardklor and Bullseye. Default T
--percolator (-p) <T|F> - Run Percolator. Must be true for loading results into MSDaPl. Default T
--search-engine (-e) <comet|tide> - The search engine to run. Default comet
--msdapl-id (-m) <id> - The number of the MSDaPl project to load final results into. Default none
--msdapl-name (-n) <name> - The submitter's username for MSDaPl. Default none
--msdapl-species (-x) <id> - The taxonomy ID of the target species for MSDaPl. Default none
--msdapl-instrument (-i) <instrument> - The name of the instrument used to acquire data for MSDaPl. Default none
--msdapl-comment (-t) <comment> - Comments to be used when uploading data to MSDaPl. Default none
--output-dir (-o) <dir> - The directory where results files will be outputted. Default output_<timestamp>
--bullseye-mem <value>, --comet-mem <value>, --tide-index-mem <value>, --tide-search-mem <value>, --percolator-mem <value>, --msdapl-mem <value> - How much memory to request for a command's job. Default 2.0G

Command expected runtime:
--bullseye-rt=<value>
--comet-rt=<value>
--tide-index-rt=<value>
--tide-search-rt=<value>
--percolator-rt=<value>
--msdapl-rt=<value>