A dynamic Bayesian network for identifying protein-binding footprints from single molecule-based sequencing data
Xiaoyu Chen, Michael M. Hoffman, Jeff A. Bilmes, Jay R. Hesselberth and William S. Noble
Bioinformatics (Proceedings of the ISMB). 26(12):i334-i342, 2010.
Abstract
Motivation: A global map of transcription factor binding sites (TFBSs) is critical to understanding gene regulation and genome function. DNaseI digestion of chromatin coupled with massively parallel sequencing (digital genomic footprinting) enables the identification of protein-binding footprints with high resolution on a genome-wide scale. However, accurately inferring the locations of these footprints remains a challenging computational problem.
Results: We present a dynamic Bayesian network-based approach for the identification and assignment of statistical confidence estimates to protein-binding footprints from digital genomic footprinting data. The method, DBFP, allows footprints to be identified in a probabilistic framework and outperforms our previously described algorithm in terms of precision at a fixed recall. Applied to a digital footprinting data set from Saccharomyces cerevisiae, DBFP identifies 4679 statistically significant footprints within intergenic regions. These footprints are mainly located near transcription start sites and are strongly enriched for known TFBSs. Footprints containing no known motif are preferentially located proximal to other footprints, consistent with cooperative binding of these footprints. DBFP also identifies a set of statistically significant footprints in the yeast coding regions. Many of these footprints coincide with the boundaries of antisense transcripts, and the most significant footprints are enriched for binding sites of the chromatin-associated factors Abf1 and Rap1.
Source code
- Source code of DBFP in a gzipped tar file. Please refer to the README file in the package for details and help information.
- Scripts to calculate a binomial score and assign a q-value to each footprint segment, also in gzipped tar format. Before using these scripts, please note that
- You need to generate a list of null scores to calculate q-values (i.e., a list of null scores is required as input for calculating q-values.) The method we used to generate our null scores was described in the paper.
- This part of code is less polished than the DBFP file above. In particular, you will need to replace filenames and directories in the scripts with the ones in your own system.
Data files
- Tag counts
- Unmappable bases
- Intergenic regions
- MacIsaac motif hits (3180 TF binding sites)
- MacIsaac binding sites (1992 TF binding sites)
- Footprints identified in intergenic regions
- Footprints identified in coding regions
Bioinformatics (Proceedings of the ISMB)
Home