A Bayesian Approach to Motif-based Protein Modeling
William Noble Grundy A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science and Cognitive Science, June 1998.
Committee in charge: Professor Charles Elkan (Chair), Professor Richard Belew, Professor Garrison Cottrell, Professor Clark Glymour, Professor Terrence Sejnowski.
The increasing stream of data produced by the Human Genome Project and similar work on other species requires sophisticated computational analysis. This dissertation describes Meta-MEME, a software toolkit for modeling families of related proteins. Meta-MEME produces probabilistic models that provide insight into the structural and functional operation of proteins, and may be used to discover functional and evolutionary relationships among proteins. In addition, the dissertation introduces Family Pairwise Search, a heuristic homology detection algorithm based upon the linear combination of multiple pairwise sequence comparison scores.
Meta-MEME combines two existing technologies -- motif discovery via expectation-maximization and hidden Markov modeling (HMMs) -- to build motif-based models of protein families. A motif is a subsequence that is conserved across all or most members of a protein family. Biologically, a motif corresponds to a region of the protein that is essential for the proper functioning or structural conformation of the protein. MEME is an unsupervised motif discovery tool that, given an unaligned set of related protein or DNA sequences, builds statistical models of one or more motifs. Meta-MEME combines these motif models within a hidden Markov model framework. A Meta-MEME model improves upon the collection of individual motif models by including information about the typical order and spacing of motifs within the family.
Meta-MEME provides two important improvements over existing protein HMMs. First, because Meta-MEME's models focus on motif regions, they are much smaller than traditional protein HMMs. This decreased size makes the models more computationally efficient and allows the models to be trained from smaller data sets. Second, the generalized topology of Meta-MEME models implies a complex model of molecular evolution, allowing for the repetition or shuffling of motif-sized elements within a single protein sequence.
The models produced by Meta-MEME provide biologists with insight into the characteristics of the given family of related proteins. Furthermore, the models may be used to search protein databases for previously unidentified homologs and to generate multiple alignments of the motif regions of the proteins. Family Pairwise Search, although lacking an explicit model and accurate statistics, is much more efficient than Meta-MEME and provides better homology detection performance.