percolator

Description:

This program takes as input two sets of peptide-spectrum matches (PSMs), one set derived from searching a real peptide database (target PSMs), and the second derived from searching a shuffled or reversed version of the database (decoy PSMs). The program learns to discriminate between the target and decoy PSMs, and then uses the learned classifier to re-rank the target PSMs. Each target PSM is assigned a q-value, which is defined as the minimal false discovery rate at which this PSM is deemed significant.

An optional, second set of decoy PSMs can be supplied. In this case, the classifier is trained using the first set of decoys, and the q-values for the final ranking are computed using the second set of decoys.

Usage:

The primary usage mode is as follows:

percolator [options] <target PSMs> <decoy PSMs> [<decoy2 PSMs>]

Alternatively, one can specify the command line thus:

percolator [options] -P <substring> <mixed PSMs>

In this second mode, the given SQT file contains a mixture of target and decoy PSMs. The <substring> specifies how to tell the difference between targets and decoys. Specifically, if the protein ID in the SQT file contains the specified substring, then the PSM is labeled as a decoy; otherwise, it is assumed to be a target PSM.

Finally, the third command line mode is

percolator [options] -g <data> <label>

In this case, the input files are tab-delimited files with a one-line header. Each row corresponds to a PSM, and the first column in both files contains PSM identifiers. In the data file, all subsequent columns contain numeric features used during training. The label file has the same number of rows as the data file, but only two columns: a PSM identifier, and a label indicating whether the PSM is a target (1) or a decoy (-1). Optionally, a third label can be included (-2), in which case these PSMs are reserved for use in computing the final set of q-values.

Input:

Input files are in SQT format. However, to allow the merging of small data sets, one can replace the SQT files with meta files. Meta files are text files containing the paths of SQT files, one path per line. For successful results, the different runs should be generated under similar conditions.

Output:

The program prints four tab-delimited columns to standard output: PSM identifier, score, estimated q-value, peptide and matched protein(s). PSMs are ranked in decreasing order by score. The first line of the output is a header line.

Options: