A neural network for predicting transcription factor binding

Manu Setty and Christina Leslie. "SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps." PLOS Computational Biology. 11(5):e1004271, 2015.

The paper cited above describes a method for finding motifs in ChIP-seq data. We will use similar data to train a deep neural network to recognize ChIP-seq peaks. Specifically, the goal of the classifier is to accurately distinguish between ChIP-seq peaks and flanking non-peak regions of the genome.

This zip file contains ENCODE ChIP-seq data from 39 different transcription factors. Each data set was generated as follows:

  1. Identify the top 5000 peaks with the highest binding signal as positives (peaks).
  2. Extract the central 150bp around the summit of each peak.
  3. Extract the region 300bp upstream of each positive as a corresponding negative (flanking regions).
  4. Remove overlapping regions and regions containing N's in the sequence.

Each ChIP-seq experiment folder includes

  1. two FASTA files containing the sequences of the train and test regions, and
  2. A label file containing the label (1 for positive/peak, 0 for negative/flank) of each region in the training set.

You should train a deep neural network using the Python deep learning library Keras to distinguish peaks from flanks. You will train one network for each transcription factor. You should use a one-hot encoding of the DNA sequence (i.e., each nucleotide is represent using four bits, 0001 for A, 0010 for C, 0100 for G, 1000 for T). You should use at least two layers, and you may select fully connected, convolutional or a mixture thereof. You can also choose how many layers to employ and how many nodes to use per layer. Your goal is to produce a network that maximizes the area under the ROC curve.

Once you have your system working, you should train two networks, for the factors CTCF and CEBPD, and submit real-valued (not 0/1) predictions for each example in the test set. The resulting label files should have the same number of lines as the given test BED files, with one number per line.

Some additional tools you may find helpful:

Please turn in a README file that briefly describes what modeling choices you made, your source code, as well as prediction files for the two TFs above.

It is OK to discuss programming strategies; however, the programming should be entirely your own. It is not OK to look at someone else's code or to show someone else your code. Code that has obviously been copied between class members will result in a score of zero for both assignments.

Due by 3:15 pm on Friday, April 24.

Thanks to Han Yuan in Christina Leslie's lab for sending this data set.