Classification of genes using probabilistic models of microarray expression profiles

Paul Pavlidis, Christopher Tang and William Stafford Noble

Proceedings of BIOKDD 2001: Workshop on Data Mining in Bioinformatics.


Microarray expression data provides a new method for classifying genes and gene products according to their expression profiles. Numerous unsupervised and supervised learning methods have been applied to the task of discovering and learning to recognize classes of co-expressed genes. Here we present a supervised learning method based upon techniques borrowed from biological sequence analysis. The expression profile of a class of co-expressed genes is summarized in a probabilistic model similar to a position-specific scoring matrix (PSSM). This model provides insight into the expression characteristics of the gene class, as well as accurate recognition performance. Because the PSSM models are generative, they are particularly useful when a biologist can identify a priori a class of co-expressed genes but is unable to identify a large collection of non co-expressed genes to serve as a negative training set. We validate the technique using expression data from S. cerevisiae and C. elegans.
PDF version