Example HMMSeg analysis
Here we present step-by-step instructions for computing the
simultaneous 2-state segmentation of four ENCODE data-types,
illustrated in this figure (postscript, png) appearing as
Fig. 2 in the manuscript referenced above.
Raw data
The datasets are comprised of DNA replication timing, as measured by
the so-called TR50 curve (University of Virginia), RNA transcription
levels (Affymetrix), and abundance of two histone modifications, the
activating mark H3ac (Sanger Institute) and the repressive mark
H3K27me3 (UCSD). All four of these datasets were measured in HeLa
cells. The raw datasets are available from the UCSC Genome Browser, where they
appear in the human genome browser under links "UVa DNA Rep TR50",
"Affy RNA Signal", "Sanger ChIP", and "LI ChIP Various",
respectively. We simulataneously processed data from 43 ENCODE
regions (data for one ENCODE region, ENm011, are not avaible for the
TR50 experiment).
Data preprocessing
Wavelet and HMM processing requires that data be equally-spaced.
Moreover, for multiple datasets, all data should be defined on the
same set of ordinates (genomic positions).
Segmentation using HMMSeg
We perform two-state segmentation of the four 1000bp datasets at the
64kb scale as follows. Supposing we want to process the data from all
43 ENCODE regions, we create four separate file lists. For TR50 we
create a file tr50.list containing the 43 lines
/home/..[path to data]../UvaDnaRepTr50.HeLa.hg17.ENm001.50bp.score.bed
/home/..[path to data]../UvaDnaRepTr50.HeLa.hg17.ENm002.50bp.score.bed
.
.
.
/home/..[path to data]../UvaDnaRepTr50.HeLa.hg17.ENr334.50bp.score.bed
And similarly for rna.list, h3ac.list,
and h3k27me3.list. (For test purposes, you might want to
process just a single ENCODE region, in which case your .list
files would each contain a single line.) The command line to segment
the data is then
java -jar HMMSeg.jar --num-states 2 --input-bed --smooth 64000 --nstarts 10 --log log.txt \
tr50.list rna.list h3ac.list h3k27me3.list