Classifying proteins by family using the product of correlated p-values

Timothy L. Bailey and William N. Grundy

Proceedings of the Third international conference on computational molecular biology (RECOMB99), April 11-14, 1999. pp. 10-14.


An important goal in bioinformatics is determining the homology and function of proteins from their sequences. Pairwise sequence similarity algorithms are often employed for this purpose. This paper describes a method for improving the accuracy of such algorithms using knowledge about families of proteins. The method requires a library of protein families against which to compare query sequences. A standard pairwise similarity search algorithm is used to search the database with the query, and a new variant of the Family Pairwise Search algorithm converts the results into a list sorted by the E-values of the matches between the query and the families. The E-value of each query-family match is calculated using a statistical distribution introduced here that describes the behavior of the product of the p-values of correlated random variables. We also describe an algorithm (ESIZE) for estimating the single parameter of this distribution. This parameter summarizes the amount of correlation among the p-values being multiplied, which corresponds, in this application, to the divergence among the sequences in a family. We show empirically that the E-values reported by this variant of FPS are accurate and that the method has significantly superior classification accuracy than using the best pairwise p-value as the query-family match score. The new algorithm is closely related to the an earlier version of FPS that combines similarity scores by averaging "bit scores," which has been shown to have superior classification performance compared with several model-based methods (motifs, HMMs), but lacks E-values and their concomittant advantages.
PDF version