Usage: java -jar HMMSeg.jar [options] <file list> [<file list>]+
java src/edu/washington/gs/hmmseg/HMMSeg [options] <file list> [<file list>]+
Java command-line options may also prove useful. For large datasets, for instance, it may be necessary to request greater heap size with theXmx
option, as injava -Xmx1024m ...
Description:
Train a completely connected hidden Markov model on a given set of data, and use the trained model to segment the data. Optionally smooth the data using wavelets before segmenting.Input:
The
<file list>
is a file containing a list of input data file names, one name per line. Each data file in the list contains either one floating point number per line or must be in BED format (see the --input-bed option). For example, you may have experimental data over two regions, contained in exp1.reg1.bed and exp1.reg2.bed. Then you would create a file, exp1.list, with the linesexp1.reg1.bed exp1.reg2.bedand exp1.list is the file list that goes in the command line.Optionally, additional file lists may be provided. In this case, the HMM uses a multi-dimensional Gaussian, with one dimension from each file list. Hence, the various file lists must contain the same number of file names, and the corresponding data files from different lists must also be of the same length.
Output:
The program randomly initializes an n-state hidden Markov model with a single Gaussian at each state, and then trains the model on the given data using expectation-maximization (EM). The resulting model is printed to standard output. Here is an example.
Initial model parameters are selected randomly as follows. Gaussian means are selected uniformly at random from within the range of the given data ([min,max]). Variances are selected uniformly at random between 0 and max - min. Transition probabilities are initialized uniformly at random from the interval <0,1>, and then renormalized so that the outgoing probabilities from each state sum to 1.
The model file format is as follows. The first line begins with the keyword "HMMSEG," followed by the number of states (n) in the model, not counting the non-emitting start state. The rest of the file consists of n+1 blocks of 3 lines each. Within each block, the lines are:
- The keyword "State," followed by the integer index of this state. The indices appear in increasing order, with "Start" for the start state.
- The keyword "Emissions," followed by the mean and variance of the Gaussian at this state. The means increase as the state index increases. The start state has "Emissions none" on this line.
- The keyword "Transitions," followed by n transition probabilities out of the current state to each of the (non-start) states of the model. These transition probabilities must sum to 1.
The end of the file contains free-form text that includes the date the model file was created, the
hmmseg
command line that was used, the total probability of the data, given the model, and other useful information.If the user desires, it is possible to redirect standard output to a file and use that file as the model file for future runs.
In addition to the model, the program produces a collection of output files, one per file in the
<file list>
. In the output files, each floating point value in the original input file is replaced by an integer state index. Each output file is named the same as the corresponding input file, with.vit
or.total
appended to the end, depending upon which parsing algorithm was employed. In the case of multi-dimensional inputs, the output file names are derived from the first file list. In the case of BED-format input, the output is a collection of wiggle files, one per input file with .wig appended to each file name (see the documentation for--input-bed
below). If--smooth
or--smooth-only
is specified, additional output files are generated. See those options below.Options:
--num-states <integer>
Specify the number of states in the HMM, not including the non-emitting start and end states. The default value is 2.--model <file>
Use the specified model, rather than learning the model parameters via EM. See the section on model file format for information on using HMMSeg output as the model file.--output-dir <string>
Specify the directory in which output files are created. By default, output files are placed in the same directory as the input files.--input-bed
The input files are in BED format, with the data to be segmented in the "score" column (by default, column 5; see also --column). In this case, the program will produce .wig files in wiggle format. The wiggle file contains multiple tracks, including the input data, the corresponding segmentation (with the state number as the score), and (for total probability) the associated total probabilities. In addition, the data in each BED file should be evenly spaced, and the program will print a warning message if uneven spacing is detected. However, if wavelet smoothing has been requested, even spacing becomes a hard requirement, and the program will terminate if uneven spacing has been detected.--smooth <window size>
Wavelet-smooth the data prior to segmentation.<window size>
is the desired scale (in bp, for BED input) of the smoothing, and should be greater than the original scale of the input data. Larger values of<window size>
result in a greater degree of smoothing. Smoothing is accomplished using the maximal overlap discrete wavelet transform (MODWT; see Percival and Walden, Wavelet Methods for Time Series Analysis, 2000). The actual scale of the smoothing is given by (2J) *<original scale>
, where the integer J is chosen to make this "dyadic scale" closest to the requested<window size>
, and<original scale>
is inferred from BED input or taken to be 1 for non-BED input. In addition, if--input-bed
is specified, then each resulting wiggle file will contain a track with smoothed data in addition to the original data, and a BED file with each original column plus a column with smoothed data will be made for each input file; if not, a separate file, which contains the smoothed data, with .smooth appended to the original file name will be generated for each input file. If J is found to be zero or negative, then no smoothing will occur, and HMM segmentation will proceed as though--smooth
were not specified.--smooth-only <window size>
Same as--smooth
, but do not perform HMM segmentation.--output-list <file>
Specify the full names of the output files. This file must contain the same number of rows as the<file list>
.--init-model <file>
Use the given model to initialize EM training. See the section on model file format for information on using HMMSeg output as the model file.--mean-range <min> <max>
Select the means of each state uniformly at random from within the specified range of values. Use this format for single gaussian states. The next format should be used for multiple gaussian states.--mean-range "<min 1> <max 1> <...> <...> <min n> <max n>"
See the description of the previous option.--variance-range <min> <max>
Select the variances of each state uniformly at random from within the specified range of values. Use this format for single gaussian states. The next format should be used for muliple gaussian states.--variance-range "<min 1> <max 1> <...> <...> <min n> <max n>"
--maxiter <integer>
Terminate EM training after the specified number of iterations. The default value is 100.--num-starts <integer>
Re-start the expectation-maximization training the specified number of times, from different random initializations of the model. Only report the segmentation from the model with the highest total likelihood. The default value is 1.--parse <viterbi|totalprob|none|both>
Parse the training data using the specified algorithm. By default, the Viterbi algorithm is used. If the total probability is computed, then the output file contains two columns: the state index and the associated probability.--column <integer>
Read the input data values from the given column number. Columns are separated by white space. The default value is column 5 for BED files and 1 otherwise.--converge <float>
Terminate the expectation maximization training when the difference in total probability is less than this value. The default value is 0.01.--log <file>
Store log information in the given file. By default, status information is printed to standard error.--verbosity [0|1|2|3|4|5]
Specify the verbosity of status information printed to standard error. The default value is 2.--set-seed <integer>
Set the seed of hmmseg's random number generator. By default, the seed is the sum of a large, fixed number and a large number related to the system time.Availability and Installation Instructions:
HMMSeg is available at http://noble.gs.washington.edu/proj/hmmseg
Warning messages: None
Bugs:
- The data is concatenated in the order that it appears in each file list and then treated as one observation sequence. In future releases, each data file will be treated as a separate observation sequence; in general, this is the expected behavior of HMMSeg.
- Large datasets and/or a high degree of smoothing may cause Java "OutOfMemoryError"s. In that case, use the
Xmx
Java command-line option to request a larger heap size, as injava -Xmx1024m ...
to set the maximum heap size to 1GB (see your local Java implementation).