GENOME 541: Intro to Computational Molecular Biology Homework #7: Support vector machine classification of microarray data
Due by noon on Tuesday, May 23. Turn in at Foege S220C.
For this assignment, you will write a program that trains a linear support vector machine classifier to recognize classes of tissues, as characterized by DNA microarrays.
The program takes two inputs:
- A matrix of gene expression measurements. This file is tab-delimited text. Each row corresponds to an experiment, and each column corresponds to a gene. The entries in the first column are experiment identifiers (i.e., text strings), and the entries in the first row are gene identifiers. The entry in the upper left corner (row 1 and column 1) can be anything at all. All other entries in the matrix are real-valued numbers, corresponding, e.g., to log expression ratios.
- A file that specifies which experiments belong in each of the two groups. The format is the same: tab-delimited text, with labels in the first row and first column. In this file, rows correspond to experiments, and there is only a single column (in addition to the experiment label). This column contains a "1" or "-1" for each experiment, indicating which group it belongs in. Some columns may also contain the label "0," indicating that they should not be used during SVM training, but that they should be assigned a predicted classification.
The program reads in the given files and uses the labeled examples to train an SVM, following the optimization procedure described by Jaakkola, Diekhans and Haussler in "A discriminative framework for detecting remote protein homologies". Do not use the radial basis kernel function that is described in the paper; instead, use a simple dot product.
The program produces as output a single, tab-delimited file. The file contains one row per experiment, and five columns per row. The first column is simply the experiment identifier. The second column contains the original label assigned to that experiment ("1", "-1" or "0"). The third column contains the weight that the SVM assigns to that experiment (or "N/A" for experiments labeled "0"). The fourth column contains a "1" or "-1," depending upon the sign of the discriminant computed by the trained SVM for that experiment. The final column contains the raw discriminant value.
Here are several published data sets, in the formats described above, for you to test your program on:
- Golub: paper, website, data, T cell class
- Khan: paper, website, data, RMS class
- Sandberg: paper, website, data, strain class
Here is the solution for the first data set above:
You should turn in hard copies of
- the output files for the other two data sets listed above, and
- your code.
Your program should be written in C, C++, Java or Perl. If you want to use a different language, please let me know.
Please use descriptive naming in your code, as well as comments. Programs that are too difficult to read may be marked down, even if they work correctly.
It is OK to discuss programming strategies; however, the programming should be entirely your own. It is not OK to look at someone else's code or to show someone else your code. Code that has obviously been copied between assignments will result in a score of zero for both assignments.