percolator
Description:
This program takes as input two sets of peptide-spectrum matches (PSMs), one set derived from searching a real peptide database (target PSMs), and the second derived from searching a shuffled or reversed version of the database (decoy PSMs). The program learns to discriminate between the target and decoy PSMs, and then uses the learned classifier to re-rank the target PSMs. Each target PSM is assigned a q-value, which is defined as the minimal false discovery rate at which this PSM is deemed significant.
An optional, second set of decoy PSMs can be supplied. In this case, the classifier is trained using the first set of decoys, and the q-values for the final ranking are computed using the second set of decoys.
Usage:
The primary usage mode is as follows:
percolator [options] <target PSMs> <decoy PSMs> [<decoy2 PSMs>]Alternatively, one can specify the command line thus:
percolator [options] -P <substring> <mixed PSMs>In this second mode, the given SQT file contains a mixture of target and decoy PSMs. The
<substring>specifies how to tell the difference between targets and decoys. Specifically, if the protein ID in the SQT file contains the specified substring, then the PSM is labeled as a decoy; otherwise, it is assumed to be a target PSM.Finally, the third command line mode is
percolator [options] -g <data> <label>In this case, the input files are tab-delimited files with a one-line header. Each row corresponds to a PSM, and the first column in both files contains PSM identifiers. In the data file, all subsequent columns contain numeric features used during training. The label file has the same number of rows as the data file, but only two columns: a PSM identifier, and a label indicating whether the PSM is a target (1) or a decoy (-1). Optionally, a third label can be included (-2), in which case these PSMs are reserved for use in computing the final set of q-values.
Input:
Input files are in SQT format. However, to allow the merging of small data sets, one can replace the SQT files with meta files. Meta files are text files containing the paths of SQT files, one path per line. For successful results, the different runs should be generated under similar conditions.Output:
The program prints four tab-delimited columns to standard output: PSM identifier, score, estimated q-value, peptide and matched protein(s). PSMs are ranked in decreasing order by score. The first line of the output is a header line.Options:
-A classic|direct|targeted, --Algorithm classic|direct|targetedSelect the variant of the Percolator algorithm to employ. Default is the classic, iterative algorithm described in the original Nature Methods paper. Direct optimization uses a nonlinear neural network with sigmoid loss function. The targeted approach optimizes for a particular q-value, specified using-F.-o <filename>, --sqt-out <filename>Create an SQT file with the specified name from the given target SQT file, replacing the XCorr value the learned score and Sp with the negated q-value.-s <filename>Same as-o, but for the decoy SQT file.-P <value>Option for single SQT file mode defining the name pattern used for shuffled data base. Typically set to random_seq-p <value>Cpos, penalty for mistakes made on positive examples. Set by cross validation if not specified.-n <value>Cneg, penalty factor for mistakes made on negative examples. Set by cross validation if not specified.-F <value>, --fdr <value>Use the specified false discovery rate threshold to define positive examples in training. With-A classic, this value is chosen by cross validation if set to 0. For the targeted q-value optimization (-A targeted), this option specifies the q-value target. Default is 0.01.-t <value>,--testFDR <value>When performing cross-validation to select Cpos and Cneg, use as a performance metric the number of accepted target PSMs at the specified FDR. Default is 0.01.-i <value>, --maxiter <value>Maximum number of iterations.-m <value>Number of matches to take into consideration per spectrum when using SQT files. Default=1.-G <trunk name>, --gist-out <trunk name>Output the computed features to the given file in tab-delimited format. A file with the features, named<trunk name>.data, and a file with the labels named<trunk name>.labelwill be created-g, --gist-inInput files are given as gist files. In this case first argument should be a file name of the data file, the second the label file. Labels are interpreted as 1 -- positive train and test set, -1 -- negative train set, -2 -- negative in test set.-w <filename>Output the final feature weights to the given file.-r <filename>Output result file (score ranked labels) to the given file.-uUse unit normalization [0-1] instead of standard deviation normalization.-QCalculate quadratic feature terms.-yTurn off calculation of tryptic features.-cReplace tryptic features with chymo-tryptic features.-eReplace tryptic features with elastase features.-xSelect hyperparameter cross validation to be performed on the whole iterating procedure, rather than on each iteration step.-q <value>,--pi0 <value>Estimated proportion of normal PSMs generated from the null distibution. Default value is1-(0.1/<value of -m option>).-b,--PTMCalculate feature for number of post-translational modifications.-d,--DTASelectAdd an extra hit to each spectrum when writing the SQT file.-R,--test-each-iterationMeasure performance on test set at each iteration.-a,--aa-freqCalculate amino acid frequency features-b,--PTMCalculate feature for number of post-translational modifications.-v <level> --verbose <level>Set verbosity of output: 0=no processing info, 5=all, default is 2-f <value>,--train-ratio <value>Fraction of the negative data set to be used as train set when only providing one negative set. The remaining examples will be used as the test set. The default is 0.7.-I,--intra-setTurn off calculation of intra-set features-h, --helpPrint a help message.