+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++ Copyright 2008 University of Washington +++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

This directory contains all of the information needed to train and test the
Philius topology predictor described in "Transmembrane topology and signal
peptide prediction using dynamic Bayesian networks" by Reynolds et al.
(manuscript submitted).

The directory structure is as follows:

	./gmtk
		./gmtk/bin
		./gmtk/params
		./gmtk/scripts

	./python

	./data
		./data/faa
			./data/faa/Phobius
			./data/faa/SCAMPI
			./data/faa/SignalP
			./data/faa/SignalP-trunc
		./data/obs
			./data/obs/Phobius
			./data/obs/SCAMPI
			./data/obs/SignalP
			./data/obs/SignalP-trunc

The contents of each of these directories is described below.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

gmtk : this subdirectory contains the GMTK binaries as well as all parameter
files necessary to define and implement the Philius DBN

Note : the homepage for the Graphical Models Tool Kit is at 
ssli.ee.washington.edu/~bilmes/gmtk/

gmtk/bin/ contains binaries for four of the GMTK modules:
	-  gmtkTriangulate : triangulates the graph, writes a *.trifile
	   file which is used by subsequent modules;
	-  gmtkEMtrainNew : performs EM-training given a set of input 
	   observations, a triangulated graph, and other necessary
	   information;
	-  gmtkJT : performs full inference on a graph using the "junction
	   tree" algorithm, given a set of input observations, a triangulated
	   graph, and other necessary information;
	-  gmtkViterbiNew : finds the Viterbi assignment to all hidden
	   variables given a set of observation sequences, a triangulated
	   graph, and other necessary information;

gmtk/params contains the various parameter files required by GMTK to define a
dynamic Bayesian network:

	- Philius uses 3 separate graphs: one for training ("model_train.str") 
	  and one for each of the two decoding stages ("model_test?.str").  
	  A single graph is defined in a single 'structure' file (with the 
	  ending ".str").  Since each .str file is automatically pre-processed 
	  by the C pre-processor, we can make use of #include commands in order 
	  to make the file more modular and more readable.

	- Before inference can be performed on a graph, it needs to be 
	  "triangulated".  This is done by gmtkTriangulate using the following
	  command, for example :
		gmtkTriangulate -strFile model_train.str
	  and the output is automatically named "model_train.str.trifile".

	- Because we wish to compute posterior probabillities for certain 
	  individual hidden variables in the 1st decoding pass, a custom
	  "trifile" is created (model_tst1.mytrifile) which defines additional
	  single-variables cliques in each frame of the graph.

	- Aside from the structure file, the next most important parameter 
	  file is known as the "master" file.  This file contains much (if not
	  all) of the additional information required to do inference on a
	  particular graph.  Philius uses 3 separate "master" files, one
	  for training, and one for each of the decoding stages.  These files
	  are named "model_train.master.EM", "model_test1.master.JT" and
	  "model_test2.master.Viterbi".

	  The master file used during training contains the definitions of two
	  types of conditonal probability tables (CPT).  Deterministic CPTs
	  are implemented using decision trees and are non-trainable.  
	  Probabilistic CPTs are implemented as (N+1) dimensional tables (where
	  N is the number of parents a discrete variable has) and may be 
	  trainable.

	- Input observation files are specified using a "list" file -- this 
	  file contains a list of all of the input (or output) files.  Two
	  samples list files contained here are "bigTrainingSet.list" and 
	  "smallTestSet.list".

gmtk/scripts contains two sample scripts: "runTrain.sh" and "runTest.sh" which
illustrate how to use GMTK to train and test the Philius model.

	- runTrain.sh invokes gmtkEMtrainNew 
		inputs : model_train.str (and implicitly the associated trifile)
			 bigTrainingSet.list
			 model_train.master.EM
		output : learnedParams.out

	- runTest.sh invokes gmtkJT for the 1st pass, and gmtkViterbiNew for the
	  2nd pass, with the python script prepVEdata used in between

Both of these scripts should be invoked from the gmtk directory as follows:
	./scripts/runTrain.sh
and	./scripts/runTest.sh params/smallTestSet.list

	
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

python : this subdirectory contains a few python scripts that are necessary
or useful for running Philius

	- faa2gmtk.py : this python script reads an input labled FASTA file
	  and writes out a set of GMTK observation files (one per input 
	  protein)

	- prepVEdata.py : this python script needs to be run between the
	  two stages of the decoding process, it gathers the posterior 
	  probabilities from the 1st stage and sets up the information 
	  necessary for the 2nd stage

	- writeReport.py : the Viterbi-assigned state sequences are written
	  out to binary files by GMTK -- this python script reads those
	  binary files and writes out a human-readable report summarizing
	  the predicted topology for each protein in the input test set

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

data : this subdirectory contains the Phobius, SignalP and SCAMPI datasets 
that were used in the development and testing of Philius

data/faa/ contains labeled FASTA files for the 3 datasets

data/obs/ contains the GMTK observation files: there is one file for each
protein, and each file contains two columns of integers: the first column
represents amino acid, and the second column represents the label (used only
during training)

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++