Kernel-based data fusion and its application to protein function prediction in yeast

Kernel-based data fusion and its application to protein function prediction in yeast

Gert R. G. Lanckriet, Gert R. G., Minghua Deng, Nello Cristianini, Michael I. Jordan and William Stafford Noble

Proceedings of the Pacific Symposium on Biocomputing, January 3-8, 2004. pp. 300-311.

Abstract

Kernel methods provide a principled framework in which to represent many types of data, including vectors, strings, trees and graphs. As such, these methods are useful for drawing inferences about biological phenomena. We describe a method for combining multiple kernel representations in an optimal fashion, by formulating the problem as a convex optimization problem that can be solved using semidefinite programming techniques. The method is applied to the problem of predicting yeast protein functional classifications using a support vector machine (SVM) trained on five types of data. For this problem, the new method performs better than a previously-described Markov random field method, and better than the SVM trained on any single type of data.

Manuscript (PDF)
Supplementary results (PDF)

Data

From the set of all 6355 yeast genes, we select the following 3588 genes with known function. The following list is an overview of all data used to construct the results in the PSB 2003 paper. All of the files are gzipped, tab-delimited text.

The labels for the 3588 genes, according to the 13 functional classifications.

The corresponding kernel matrices. Each of the following matrices is 3588 by 3588.

Centered and normalized kernel derived from the Pfam domain structure of the proteins (using presence or absence of the domain): kernel_matrix_pfamdom_cn_3588.txt.gz [used to compare to MRF method]
Normalized kernel derived from protein interaction, i.e., co-participation (or not) in a protein complex, as determined by tandem affinity purification (TAP): kernel_matrix_tap_n_3588.txt.gz [used to compare to MRF method]
Normalized kernel derived from physical protein-protein interactions: kernel_matrix_mpi_n_3588.txt.gz [used to compare to MRF method]
Normalized kernel derived from genetic interactions: kernel_matrix_mgi_n_3588.txt.gz [used to compare to MRF method]
Normalized kernel derived as a diffusion kernel from cell cycle gene expression measurements: kernel_matrix_exp_diff_n_3588.txt.gz [used to compare to MRF method]
Normalized kernel derived as a Gaussian kernel from cell cycle gene expression measurements: kernel_matrix_exp_gauss_n_3588.txt.gz [richer representation]
Centered and normalized kernel derived from the Pfam domain structure of proteins (using log E-values): kernel_matrix_pfamdom_exp_cn_3588.txt.gz [richer representation; also using additional domains of Pfam 9.0]
Centered and normalized kernel derived from Smith-Waterman pairwise sequence comparison algorithm: kernel_matrix_sw_cn_3588.txt.gz [extra kernel matrix]

The binary Pfam domain matrix used by Deng et al.

The 3588-by-5724 matrix of Pfam log E-values, computed with respect to Pfam 9.0.

The 3588-by-6349 matrix of log Smith-Waterman p-values.

A Matlab implementation of the algorithm proposed by Francis Bach and Gert Lanckriet to solve the multiple kernel learning problem. (See Bach, F.R. & Lanckriet, G.R.G., Jordan, M.I. (2004). Fast Kernel Learning using Sequential Minimal Optimization . Technical Report CSD-04-1307, Division of Computer Science, University of California, Berkeley), courtesy of Guillaume Obozinski.