Multidimensional scaling modeling of the yeast genome
For this assignment, you will write a program that models the 3D structure of the yeast chromosome 7 using multidimensional scaling (MDS). Your code should take as input a list of Hi-C counts and output a list of the 3D positions of beads. You should convert the Hi-C counts to distances using the conversion relationship: distance(i,j) = 103 * counts(i,j)-3. Your program should take an argument controlling the number of dimensions to infer (i.e. 2D or 3D). Your progam should read a Hi-C counts file in the format:bead1 bead2 counts 5 8 5.2 ...
Use gradient descent to optimize the multidimensional scaling objective. In order to derive the gradient descent updates, you will need to find the partial derivative of the objective with respect to each variable:
You can use a learning rate of 0.001. To initialize your algorithm, place each coordinate of each bead uniformly at random in [0,1]. Terminate optimization when the difference in objective value between iterations is less than 0.001.
Some approaches to this problem place constraints on x; you do not need to use any constraints for this assignment.
To test your code, use the following synthetic data: spiral_counts.txt, spiral_true_points.txt. The points lie in a spiral in 2D. Use MDS with two dimensions to infer the positions of the points. You can use this script to plot your inferred points: plot_2D.R. It takes as input a file in the following format:x1 x2 3.4 1.2 ...
(The header line is required). You can create a plot with Rscript plot_2D.R points.txt plot.png, or you are welcome to write your own plotting script.
Run your code on pairwise distances derived from a Hi-C experiment on yeast chromosome 7: yeast_chrom7_counts.txt. This list includes only pairwise distances from significant Hi-C counts (FDR 1%), so not all pairs have observed distances. Because you can only plot two dimensions at a time, make three plots of your inferred points using plot_2D.R, one for each pair of dimensions: (0,1), (0,2) and (1,2).
- Your program's code.
- A plot from plot_2D.R of the output of your program on the synthetic spiral data set.
- Three plots from plot_2D of the output of your program on the yeast chromosome 7 data set, one for each pair of dimensions.
Please use descriptive naming in your code, as well as comments. Programs that are too difficult to read may be marked down, even if they work correctly. It is OK to discuss programming strategies; however, the programming should be entirely your own. It is not OK to look at someone else's code or to show someone else your code. Code that has obviously been copied between class members will result in a score of zero for both assignments.