Example HMMSeg analysis

Here we present step-by-step instructions for computing the simultaneous 2-state segmentation of four ENCODE data-types, illustrated in this figure (postscript, png) appearing as Fig. 2 in the manuscript referenced above.

Raw data

The datasets are comprised of DNA replication timing, as measured by the so-called TR50 curve (University of Virginia), RNA transcription levels (Affymetrix), and abundance of two histone modifications, the activating mark H3ac (Sanger Institute) and the repressive mark H3K27me3 (UCSD). All four of these datasets were measured in HeLa cells. The raw datasets are available from the UCSC Genome Browser, where they appear in the human genome browser under links "UVa DNA Rep TR50", "Affy RNA Signal", "Sanger ChIP", and "LI ChIP Various", respectively. We simulataneously processed data from 43 ENCODE regions (data for one ENCODE region, ENm011, are not avaible for the TR50 experiment).

Data preprocessing

Wavelet and HMM processing requires that data be equally-spaced. Moreover, for multiple datasets, all data should be defined on the same set of ordinates (genomic positions).

Segmentation using HMMSeg

We perform two-state segmentation of the four 1000bp datasets at the 64kb scale as follows. Supposing we want to process the data from all 43 ENCODE regions, we create four separate file lists. For TR50 we create a file tr50.list containing the 43 lines
/home/..[path to data]../UvaDnaRepTr50.HeLa.hg17.ENm001.50bp.score.bed
/home/..[path to data]../UvaDnaRepTr50.HeLa.hg17.ENm002.50bp.score.bed
/home/..[path to data]../UvaDnaRepTr50.HeLa.hg17.ENr334.50bp.score.bed
And similarly for rna.list, h3ac.list, and h3k27me3.list. (For test purposes, you might want to process just a single ENCODE region, in which case your .list files would each contain a single line.) The command line to segment the data is then
java -jar HMMSeg.jar --num-states 2 --input-bed --smooth 64000 --nstarts 10 --log log.txt \
  tr50.list rna.list h3ac.list h3k27me3.list