GS541 2015 unit 4: Gene regulation and epigenomics

Lecture slides

  • Lecture 1: Transcription factor binding (sequence): notes, slides
  • Lecture 2: Transcription factor binding (functional genomics): slides
  • Lecture 3: Combinatorial analysis of functional genomics: slides
  • Lecture 4: Chromatin 3D architecture: slides

Homework 7

Implement the EM algorithm for learning a motif PSFM. Your code should take as input a list of DNA sequences and a starting PSFM and output a PSFM for a motif enriched in the sequences.

See the original MEME paper for the details of the algorithm. You may want to pay particular attention to section 2.1, on selecting a starting PSFM, and the Appendix, which describes the likelihood calculation. Use the one-motif-per-sequence assumption; you can ignore the part about the N-motifs-per-sequence model. You can base your implementation on the following pseudocode:

Run your code on sequences taken from a ChIP-seq experiment (performed in the human lymphoblastoid cell line GM12878) on the transcription factor REST (a.k.a. NRSF): rest_chipseq_sequences.txt. Use the REST consensus motif GGAGCTGTCCATGGGTCTGA as your starting point, assigning a probability of 0.5 to the consensus letter and 0.5/3 to the other letters (as described in section 2.1 of the MEME paper). You do not to implement the rest of the MEME algorithm that involves testing multiple starting points. You can stop after 10 iterations instead of testing for convergence.

Once you learn your motif, use the online TOMTOM server to search for known motifs similar to the one you learned. Compare to the JASPAR CORE database. (Click the "?" next to the input box to see what format it expects for the PSFM.) What motifs do you find?

To test your code, you can use this file of simulated sequences: simulated_sequences.txt. Each sequence in this file has one instance of a motif with the consensus sequence AACCGGTTAACCGGTT. This motif emits the consensus letter with probability 85% and one of the other three letters with probability 5%.

Turn in:

  • Your program's code.
  • The PSFM you found for the REST sequences in the same format you input to TOMTOM.
  • A screenshot of the top result from TOMTOM.
  • How many hours your spent on this assignment.

Please use descriptive naming in your code, as well as comments. Programs that are too difficult to read may be marked down, even if they work correctly. It is OK to discuss programming strategies; however, the programming should be entirely your own. It is not OK to look at someone else's code or to show someone else your code. Code that has obviously been copied between class members will result in a score of zero for both assignments.

Due by 3:15pm on Friday, May 22.

Homework 8

Model the 3D structure of the yeast chromosome 7 using multidimensional scaling (MDS). Your code should take as input a list of Hi-C counts and output a list of the 3D positions of beads. You should convert the Hi-C counts to distances using the conversion relationship: distance(i,j) = 10^3 * counts(i,j)^-3. Your program should take an argument controlling the number of dimensions to infer (i.e. 2D or 3D). Your progam should read a Hi-C counts file in the format:

bead1	bead2	counts
5	8	5.2
...

Use gradient descent on the multidimensional scaling objective:
You can use a learning rate (t in the slides) of 0.001. To initialize your algorithm, place each coordinate of each bead uniformly at random in [0,1]. Terminate optimization when the difference in objective value between iterations is less than 0.001.

The paper I presented in class placed constraints on x; for this assignment we not be using these constraints. Unconstrained optimization has been shown to perform nearly as well.

To test your code, use the following synthetic data: spiral_counts.txt, spiral_true_points.txt. The points lie in a spiral in 2D. Use MDS with two dimensions to infer the positions of the points. Use this script to plot your inferred points: plot_2D.R. It takes as input a file in the following format:

x1	x2
3.4	1.2
...
(the header line is required). Create a plot with Rscript plot_2D.R points.txt plot.png

Run your code on pairwise distances derived from a Hi-C experiment on yeast chromosome 7: yeast_chrom7_counts.txt. This list includes only pairwise distances from significant Hi-C counts (FDR 1%), so not all pairs have observed distances. Because you can only plot two dimensions at a time, make three plots of your inferred points using plot_2D.R, one for each pair of dimensions: (0,1), (0,2) and (1,2).

Turn in:

  • Your program's code.
  • A plot from plot_2D.R of the output of your program on the synthetic spiral data set.
  • Three plots from plot_2D of the output of your program on the yeast chromosome 7 data set, one for each pair of dimensions.
  • How many hours you spent on this assignment.

Please use descriptive naming in your code, as well as comments. Programs that are too difficult to read may be marked down, even if they work correctly. It is OK to discuss programming strategies; however, the programming should be entirely your own. It is not OK to look at someone else's code or to show someone else your code. Code that has obviously been copied between class members will result in a score of zero for both assignments.

Due by 3:15pm on Friday, May 29.