+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++ Copyright 2008 University of Washington +++++++++++++++++++ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ This directory contains all of the information needed to train and test the Philius topology predictor described in "Transmembrane topology and signal peptide prediction using dynamic Bayesian networks" by Reynolds et al. (manuscript submitted). The directory structure is as follows: ./gmtk ./gmtk/bin ./gmtk/params ./gmtk/scripts ./python ./data ./data/faa ./data/faa/Phobius ./data/faa/SCAMPI ./data/faa/SignalP ./data/faa/SignalP-trunc ./data/obs ./data/obs/Phobius ./data/obs/SCAMPI ./data/obs/SignalP ./data/obs/SignalP-trunc The contents of each of these directories is described below. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ gmtk : this subdirectory contains the GMTK binaries as well as all parameter files necessary to define and implement the Philius DBN Note : the homepage for the Graphical Models Tool Kit is at ssli.ee.washington.edu/~bilmes/gmtk/ gmtk/bin/ contains binaries for four of the GMTK modules: - gmtkTriangulate : triangulates the graph, writes a *.trifile file which is used by subsequent modules; - gmtkEMtrainNew : performs EM-training given a set of input observations, a triangulated graph, and other necessary information; - gmtkJT : performs full inference on a graph using the "junction tree" algorithm, given a set of input observations, a triangulated graph, and other necessary information; - gmtkViterbiNew : finds the Viterbi assignment to all hidden variables given a set of observation sequences, a triangulated graph, and other necessary information; gmtk/params contains the various parameter files required by GMTK to define a dynamic Bayesian network: - Philius uses 3 separate graphs: one for training ("model_train.str") and one for each of the two decoding stages ("model_test?.str"). A single graph is defined in a single 'structure' file (with the ending ".str"). Since each .str file is automatically pre-processed by the C pre-processor, we can make use of #include commands in order to make the file more modular and more readable. - Before inference can be performed on a graph, it needs to be "triangulated". This is done by gmtkTriangulate using the following command, for example : gmtkTriangulate -strFile model_train.str and the output is automatically named "model_train.str.trifile". - Because we wish to compute posterior probabillities for certain individual hidden variables in the 1st decoding pass, a custom "trifile" is created (model_tst1.mytrifile) which defines additional single-variables cliques in each frame of the graph. - Aside from the structure file, the next most important parameter file is known as the "master" file. This file contains much (if not all) of the additional information required to do inference on a particular graph. Philius uses 3 separate "master" files, one for training, and one for each of the decoding stages. These files are named "model_train.master.EM", "model_test1.master.JT" and "model_test2.master.Viterbi". The master file used during training contains the definitions of two types of conditonal probability tables (CPT). Deterministic CPTs are implemented using decision trees and are non-trainable. Probabilistic CPTs are implemented as (N+1) dimensional tables (where N is the number of parents a discrete variable has) and may be trainable. - Input observation files are specified using a "list" file -- this file contains a list of all of the input (or output) files. Two samples list files contained here are "bigTrainingSet.list" and "smallTestSet.list". gmtk/scripts contains two sample scripts: "runTrain.sh" and "runTest.sh" which illustrate how to use GMTK to train and test the Philius model. - runTrain.sh invokes gmtkEMtrainNew inputs : model_train.str (and implicitly the associated trifile) bigTrainingSet.list model_train.master.EM output : learnedParams.out - runTest.sh invokes gmtkJT for the 1st pass, and gmtkViterbiNew for the 2nd pass, with the python script prepVEdata used in between Both of these scripts should be invoked from the gmtk directory as follows: ./scripts/runTrain.sh and ./scripts/runTest.sh params/smallTestSet.list +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ python : this subdirectory contains a few python scripts that are necessary or useful for running Philius - faa2gmtk.py : this python script reads an input labled FASTA file and writes out a set of GMTK observation files (one per input protein) - prepVEdata.py : this python script needs to be run between the two stages of the decoding process, it gathers the posterior probabilities from the 1st stage and sets up the information necessary for the 2nd stage - writeReport.py : the Viterbi-assigned state sequences are written out to binary files by GMTK -- this python script reads those binary files and writes out a human-readable report summarizing the predicted topology for each protein in the input test set +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ data : this subdirectory contains the Phobius, SignalP and SCAMPI datasets that were used in the development and testing of Philius data/faa/ contains labeled FASTA files for the 3 datasets data/obs/ contains the GMTK observation files: there is one file for each protein, and each file contains two columns of integers: the first column represents amino acid, and the second column represents the label (used only during training) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++