Example HMMSeg analysis

Here we present step-by-step instructions for computing the simultaneous 2-state segmentation of four ENCODE data-types, illustrated in this figure (postscript, png) appearing as Fig. 2 in the manuscript referenced above.

Raw data

The datasets are comprised of DNA replication timing, as measured by the so-called TR50 curve (University of Virginia), RNA transcription levels (Affymetrix), and abundance of two histone modifications, the activating mark H3ac (Sanger Institute) and the repressive mark H3K27me3 (UCSD). All four of these datasets were measured in HeLa cells. The raw datasets are available from the UCSC Genome Browser, where they appear in the human genome browser under links "UVa DNA Rep TR50", "Affy RNA Signal", "Sanger ChIP", and "LI ChIP Various", respectively. We simulataneously processed data from 43 ENCODE regions (data for one ENCODE region, ENm011, are not avaible for the TR50 experiment).

Data preprocessing

Wavelet and HMM processing requires that data be equally-spaced. Moreover, for multiple datasets, all data should be defined on the same set of ordinates (genomic positions).

For each dataset we choose a nominal spacing consistent with the distance between measurements in the raw data: TR50, 50bp; RNA, 50bp; H3ac, 1000bp; H3K27me, 500bp. Interpolation at equally-spaced coordinates of the generally unequally-spaced raw data is accomplished in one of two ways, depending on the size of the gap to be interpolated over. For gaps less than 2000bp, data are linearly interpolated using the two immediately flanking data points. For gaps larger than 2000bp, we use an adaptive loess fitting strategy. In this case, a linear loess fit is computed at the point of interpolation, using all points in a window of width 50 times the gap to be filled for the fit. The R function loess is used for this purpose, using default weights. The value of the loess fit at the point of interpolation is taken to be the interpolated value there. R routines for doing this interpolation can be found here (bed.R, interp.R). See, in particular, function UCSC2bed in bed.R. An example R script calling these R routines can be found here. The resulting datasets using this scheme, in BED format, are located here:

50bp TR50 data
50bp RNA data (files starting with "AffyRnaSignal.HeLa")
1000bp H3ac data (files starting with "SangerChip.H3ac.HeLa")
500bp H3K27me3 data (files starting with "UcsdChip.H3K27me3")

Since H3ac is the coarsest of the three datasets, we choose to interpolate the other datasets at its ordinates. First, however, we smooth the fine-scale histone RNA data out to a scale close to 1000bp using MODWT wavelet smoothing (the la8 wavelet). The closest dyadic multiple of 50bp to 1000bp is 800bp. The wavelet smooths can be therefore be computed without segmentation using HMMSeg as follows,
```
java -jar HMMSeg.jar --input-bed --smooth-only 800 [file-list]
```
where file-list contains a list of all 50bp RNA data files. Note that while the TR50 data is available at every 50bp, the effective resolution is much coarser, due to loess smoothing during the calculation of that curve. Therefore, no wavelet smoothing is required of TR50. The result of interpolating the three datasets (including the wavelet smoothed RNA data) at the 1000bp coordinates of H3ac are available here:

Segmentation using HMMSeg

We perform two-state segmentation of the four 1000bp datasets at the 64kb scale as follows. Supposing we want to process the data from all 43 ENCODE regions, we create four separate file lists. For TR50 we create a file tr50.list containing the 43 lines

/home/..[path to data]../UvaDnaRepTr50.HeLa.hg17.ENm001.50bp.score.bed
/home/..[path to data]../UvaDnaRepTr50.HeLa.hg17.ENm002.50bp.score.bed
.
.
.
/home/..[path to data]../UvaDnaRepTr50.HeLa.hg17.ENr334.50bp.score.bed

And similarly for rna.list, h3ac.list, and h3k27me3.list. (For test purposes, you might want to process just a single ENCODE region, in which case your .list files would each contain a single line.) The command line to segment the data is then

java -jar HMMSeg.jar --num-states 2 --input-bed --smooth 64000 --nstarts 10 --log log.txt \
  tr50.list rna.list h3ac.list h3k27me3.list