Assessing phylogenetic motif models for predicting transcription factor binding sites

Assessing phylogenetic motif models for predicting transcription factor binding sites

John Hawkins, Charles E. Grant, William Stafford Noble and Timothy L. Bailey

Bioinformatics (Proceedings of the ISMB). 25(12):i339-347, 2009.

Abstract

Motivation: A variety of algorithms have been developed to predict transcription factor binding sites within the genome by exploiting the evolutionary information implicit in multiple alignments of the genomes of related species. One such approach uses an extension of the standard position-specific motif model that incorporates phylogenetic information via a phylogenetic tree and a model of evolution. However, these phylogenetic motif models have never been rigorously benchmarked in order to determine whether they lead to better prediction of transcription factor binding sites than obtained using simple position weight matrix scanning.

Results: We evaluate three phylogenetic motif model-based prediction algorithms, each of which uses a different treatment of gapped alignments, and we compare their prediction accuracy with that of a non-phylogenetic motif scanning approach. Surprisingly, all of these algorithms appear to be inferior to simple motif scanning, when accuracy is measured using a gold standard of validated yeast transcription factor binding sites. However, the phylogenetic motif model scanners perform much better than simple motif scanning when we abandon the gold standard and consider the number of statistically significant sites predicted, using column-shuffled "random" motifs to measure significance. These results suggest that the common practice of measuring the accuracy of binding site predictors using collections of known sites may be dangerously misleading since such collections may be missing ``weak'' sites, which are exactly the type of sites needed to discriminate among predictors. We then introduce a novel theoretical model of binding site evolution that includes loss-of-site events. This model allows us to estimate the total number of binding sites for a transcription factor in the genome. The results suggest that the number of true sites for a yeast transcription factor is in general several times greater than the number of known sites listed in the Saccharomyces cerevisiae database (SCPD). Among the three scanning algorithms that we test, the MONKEY algorithm has the highest accuracy for predicting yeast transcription factor binding sites.

Bioinformatics
Home