GENOME 541: Intro to Computational Molecular Biology

Homework #5: Support vector machine classification of microarray data

Due at the beginning of class on Friday, May 8.

For this assignment, you will write a program that trains a linear support vector machine classifier to recognize classes of tissues, as characterized by DNA microarrays.

The program takes two inputs:

The program reads in the given files and uses the labeled examples to train an SVM, following the optimization procedure described by Jaakkola, Diekhans and Haussler in "A discriminative framework for detecting remote protein homologies". Do not use the radial basis kernel function that is described in the paper; instead, use a simple dot product.

The program produces as output a single, tab-delimited file. The file contains one row per experiment, and five columns per row. The first column is simply the experiment identifier. The second column contains the original label assigned to that experiment ("1", "-1" or "0"). The third column contains the weight that the SVM assigns to that experiment (or "N/A" for experiments labeled "0"), multiplied by the label. The fourth column contains a "1" or "-1," depending upon the sign of the discriminant computed by the trained SVM for that experiment. The final column contains the raw discriminant value.

Here are several published data sets, in the formats described above, for you to test your program on:

Here is the solution for the first data set above:

Don't worry too much about deciding automatically when to stop training. A reasonable strategy is to converge when the mean squared difference between the previous set of weights and the current set of weights drops below a very small value. The true convergence criteria (so-called Karush-Kuhn-Tucker criteria) are trickier to implement. An alternative approach, which is perfectly fine here, is simply to run a large fixed number of iterations.

You should turn in hard copies of

Your program should be written in C, C++, Java, Perl or Python. If you want to use a different language, please let me know.

Please use descriptive naming in your code, as well as comments. Programs that are too difficult to read may be marked down, even if they work correctly.

It is OK to discuss programming strategies; however, the programming should be entirely your own. It is not OK to look at someone else's code or to show someone else your code. Code that has obviously been copied between assignments will result in a score of zero for both assignments.