Efficient identification of DNA binding partners in a sequence database
Tobias Mann and William Stafford Noble
Bioinformatics (Proceedings of the Intelligent Systems for Molecular Biology Conference). 22:e350--e358, 2006.
The specific hybridization of complementary DNA molecules underlies many widely used molecular biology assays, including the polymerase chain reaction and various types of microarray analysis. In order for such an assay to work well, the primer or probe must bind to its intended target, without also binding to additional sequences in the reaction mixture. For any given probe or primer, potential non-specific binding partners can be identified using state-of-the-art models of DNA binding stability. Unfortunately, these models rely on computationally complex dynamic programming algorithms that are too slow to apply on a genomic scale.
We present an algorithm that efficiently scans a DNA database for short (approximately 20-30 base) sequences that will bind to a query sequence. We use a filtering approach, in which a series of increasingly stringent filters is applied to a set of candidate k-mers. The k-mers that pass all filters are then located in the sequence database using a precomputed index, and an accurate model of DNA binding stability is applied to the sequence surrounding each of the k-mer occurrences. This approach reduces the time to identify all binding partners for a given DNA sequence in human genomic DNA by approximately three orders of magnitude, from two days for the ENCODE regions to less than one minute for typical queries. Our approach is scalable to large DNA sequences. Our method can scan the human genome for medium strength binding sites to a candidate PCR primer in an average of 34.5 minutes.