hmmseg

Usage:  java -jar HMMSeg.jar [options] <file list> [<file list>]+
  java src/edu/washington/gs/hmmseg/HMMSeg [options] <file list> [<file list>]+
Java command-line options may also prove useful. For large datasets, for instance, it may be necessary to request greater heap size with the Xmx option, as in java -Xmx1024m ...

Description:

Train a completely connected hidden Markov model on a given set of data, and use the trained model to segment the data. Optionally smooth the data using wavelets before segmenting.

Input:

The <file list> is a file containing a list of input data file names, one name per line. Each data file in the list contains either one floating point number per line or must be in BED format (see the --input-bed option). For example, you may have experimental data over two regions, contained in exp1.reg1.bed and exp1.reg2.bed. Then you would create a file, exp1.list, with the lines

exp1.reg1.bed
exp1.reg2.bed
and exp1.list is the file list that goes in the command line.

Optionally, additional file lists may be provided. In this case, the HMM uses a multi-dimensional Gaussian, with one dimension from each file list. Hence, the various file lists must contain the same number of file names, and the corresponding data files from different lists must also be of the same length.

Output:

The program randomly initializes an n-state hidden Markov model with a single Gaussian at each state, and then trains the model on the given data using expectation-maximization (EM). The resulting model is printed to standard output. Here is an example.

Initial model parameters are selected randomly as follows. Gaussian means are selected uniformly at random from within the range of the given data ([min,max]). Variances are selected uniformly at random between 0 and max - min. Transition probabilities are initialized uniformly at random from the interval <0,1>, and then renormalized so that the outgoing probabilities from each state sum to 1.

The model file format is as follows. The first line begins with the keyword "HMMSEG," followed by the number of states (n) in the model, not counting the non-emitting start state. The rest of the file consists of n+1 blocks of 3 lines each. Within each block, the lines are:

  • The keyword "State," followed by the integer index of this state. The indices appear in increasing order, with "Start" for the start state.
  • The keyword "Emissions," followed by the mean and variance of the Gaussian at this state. The means increase as the state index increases. The start state has "Emissions none" on this line.
  • The keyword "Transitions," followed by n transition probabilities out of the current state to each of the (non-start) states of the model. These transition probabilities must sum to 1.

The end of the file contains free-form text that includes the date the model file was created, the hmmseg command line that was used, the total probability of the data, given the model, and other useful information.

If the user desires, it is possible to redirect standard output to a file and use that file as the model file for future runs.

In addition to the model, the program produces a collection of output files, one per file in the <file list>. In the output files, each floating point value in the original input file is replaced by an integer state index. Each output file is named the same as the corresponding input file, with .vit or .total appended to the end, depending upon which parsing algorithm was employed. In the case of multi-dimensional inputs, the output file names are derived from the first file list. In the case of BED-format input, the output is a collection of wiggle files, one per input file with .wig appended to each file name (see the documentation for --input-bed below). If --smooth or --smooth-only is specified, additional output files are generated. See those options below.

Options:

Availability and Installation Instructions:

HMMSeg is available at http://noble.gs.washington.edu/proj/hmmseg

Warning messages: None

Bugs: