Protein ranking: from local to global structure in the protein
similarity network -- Supplementary data
Jason Weston, Andre Elisseeff, Dengyong Zhou, Christina Leslie and
William Stafford Noble
Abstract
Biologists regularly search databases of DNA or protein sequences for
evolutionary or functional relationships to a given query sequence.
We describe a ranking algorithm that exploits the entire network
structure of similarity relationships among proteins in a sequence
database by performing a diffusion operation on a pre-computed,
weighted network. The resulting ranking algorithm, evaluated using a
human-curated database of protein structures, is efficient and
provides significantly better rankings than a local network search
algorithm such as PSI-BLAST.
-
Supplementary results (in PDF format).
-
Animation (GIF) of Figure 12 from the
supplement.
-
ROC50 scores for all queries and
all detection methods from the paper in plain text format.
-
Training sequences. These are given as
indexes into the SCOP set (first column) in the same order as the
fasta file below. The third column is the superfamily they belong to,
given as a class ID, the fourth is the fold (as defined in SCOP).
-
Testing sequences with testing sequences
(queries), in the same format as above.
-
Training and testing sequences, same format as
above. Pass this file in as the second argument for
eval.cpp.
-
SCOP sequence file in
FASTA format containing all sequences in SCOP version 1.59 with less
than 95% identity.
-
Swiss-Prot Sequence file in FASTA format
containing all sequences in Swiss-Prot version 40 (zipped, 21 Mb).
-
7329x7329 Kernel matrices for methods used in the experiments:
(here are the IDs by row or column)
-
BLAST matrix, ascii
text file, gzipped (49 MB).
-
PSI-BLAST(SCOP)
matrix using the complete 7329 examples as a database, ascii text
file, gzipped (52 MB).
-
PSI-BLAST (SCOP+SPROT)
matrix using all SCOP+SPROT examples as a database, ascii text file,
gzipped (9 MB).
-
108,931x108,931 PSIBLAST score kernel
matrix for RankProp used in the experiments (342 Mb): the first
7329 IDs are from SCOP, IDs from 7330 onwards are SPROT
proteins. Format: <index> <number of homologs> <indices
of homlogs> <e-values of homologs>.
-
7329x108,931 PSIBLAST score kernel
matrix for RankProp used for the queries
in the experiments (189 Mb): unlike the above file, all edges are given (not just the first 1000).
- C++ code to run the experiments:
-
RankProp code (there is also a more general command line driven version here).
-
Evaluation of a ranking provided by a
given distance matrix, returns ROC-50 scores of each query.