A neural network for predicting transcription factor binding
Manu Setty and Christina Leslie. "SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps." PLOS Computational Biology. 11(5):e1004271, 2015.The paper cited above describes a method for finding motifs in ChIP-seq data. We will use similar data to train a deep neural network to recognize ChIP-seq peaks. Specifically, the goal of the classifier is to accurately distinguish between ChIP-seq peaks and flanking non-peak regions of the genome.
This zip file contains ENCODE ChIP-seq data from 14 different transcription factors, all measured in the cell line K562. Each data set was generated as follows:
- Identify the top 5000 peaks with the highest binding signal as positives (peaks).
- Extract the central 150bp around the summit of each peak.
- Extract the region 300bp upstream of each positive as a corresponding negative (flanking regions).
- Remove overlapping regions and regions containing N's in the sequence.
For 10 of the ChIP-seq experiments, the folder includes
For the remaining 4 TFs (CTCF, CEBPB, NRF1 and MGA), the test labels are not provided. These will be used for the evaluation of your trained models.
- three FASTA files containing the sequences of the train, validation, and test regions and
- three label files containing the label (1 for positive/peak, 0 for negative/flank) of each region in the training set.
For each transcription factor, you should train a deep neural network using the Python deep learning library PyTorch to distinguish peaks from flanks. You should use a one-hot encoding of the DNA sequence (i.e., each nucleotide is represented using four bits, 0001 for A, 0010 for C, 0100 for G, and 1000 for T). You should use at least two layers, and you may select fully connected, convolutional or a mixture thereof. You can also choose how many layers to employ and how many nodes to use per layer. You should aim to produce a network that maximizes the area under the ROC curve (AUROC) for the validation set. You can use the 10 development TFs to help you choose model parameters.
We will evaluate your model on the test set for which the labels are not given to you. Hence, once your system works, train four networks, one for each test factor, and select models with the best AUROC on the respective validation sets. For each transcription factor, you should produce prediction scores (real values between 0 and 1) for their test set sequences. You should submit real-valued (not 0/1) predictions for each example in the test set. The resulting label files should have the same number of lines as the number of regions in the test fasta file , with one number per region.
Some additional tools you may find helpful:
- Resources for preprocessing datasets for PyTorch models.
- Converting arrays to one-hot-encodings that can be used by PyTorch neural networks.
- A basic tutorial for training a neural network in PyTorch
- Code to calculate an ROC curve in python, with an example of how to plot it.
Please turn in a README file that briefly describes what modeling choices you made, your source code, as well as prediction files for the two TFs above.
It is OK to discuss programming strategies; however, the programming should be entirely your own. It is not OK to look at someone else's code or to show someone else your code. Code that has obviously been copied between class members will result in a score of zero for both assignments.
Due by 3:15 pm on Friday, May 5.
Thanks to Han Yuan in Christina Leslie's lab for sending this data set.