Combining microarray expression data and phylogenetic profiles to learn gene functional categories using support vector machines

Paul Pavlidis and William Noble Grundy, Columbia Genome Center and Department of Computer Science, Columbia University


A primary goal in biology is to understand the molecular machinery of the cell. The sequencing projects currently underway provide one view of this machinery. A complementary view is provided by data from DNA microarray hybridization experiments. Synthesizing the information from these disparate types of data requires the development of improved computational techniques. We demonstrate how to apply a machine learning algorithm called support vector machines to a heterogeneous data set consisting of expression data as well as phylogenetic profiles derived from sequence similarity searches against a collection of complete genomes. The two types of data provide accurate pictures of overlapping subsets of the gene functional categories present in the cell. Combining the expression data and phylogenetic profiles within a single learning algorithm frequently yields superior classification performance compared to using either data set alone. However, the improvement is not uniform across functional classes. For the data sets investigated here, 24-element phylogenetic profiles typically provide more information than than 79-element expression vectors. Often, adding expression data to the phylogenetic profiles introduces more noise than information. Thus, these two types of data should only be combined when there is evidence that the functional classification of interest is clearly reflected in both data sets.

Technical report

A technical report describing some of the work is available here in PDF format.

ISMB 2000 poster presentation

The contents of the poster are available here. The poster includes a great deal of data not described in the technical report, but not in much detail.


The SVM software is available for free download.


Download the yeast phylogenetic profiles. The expression data set is available from Stanford.