GENOME 541: Intro to Computational Molecular Biology

Homework #7: Support vector machine classification of microarray data

Due by noon on Tuesday, May 23. Turn in at Foege S220C.

For this assignment, you will write a program that trains a linear support vector machine classifier to recognize classes of tissues, as characterized by DNA microarrays.

The program takes two inputs:

The program reads in the given files and uses the labeled examples to train an SVM, following the optimization procedure described by Jaakkola, Diekhans and Haussler in "A discriminative framework for detecting remote protein homologies". Do not use the radial basis kernel function that is described in the paper; instead, use a simple dot product.

The program produces as output a single, tab-delimited file. The file contains one row per experiment, and five columns per row. The first column is simply the experiment identifier. The second column contains the original label assigned to that experiment ("1", "-1" or "0"). The third column contains the weight that the SVM assigns to that experiment (or "N/A" for experiments labeled "0"). The fourth column contains a "1" or "-1," depending upon the sign of the discriminant computed by the trained SVM for that experiment. The final column contains the raw discriminant value.

Here are several published data sets, in the formats described above, for you to test your program on:

Here is the solution for the first data set above:

You should turn in hard copies of

Your program should be written in C, C++, Java or Perl. If you want to use a different language, please let me know.

Please use descriptive naming in your code, as well as comments. Programs that are too difficult to read may be marked down, even if they work correctly.

It is OK to discuss programming strategies; however, the programming should be entirely your own. It is not OK to look at someone else's code or to show someone else your code. Code that has obviously been copied between assignments will result in a score of zero for both assignments.