Predicting human nucleosome occupancy from primary sequence

Supplementary data for "Predicting human nucleosome occupancy from primary sequence"

Shobhit Gupta, Jonathan Dennis, Robert E. Thurman, Robert Kingston, John A. Stamatoyannopoulos and William Stafford Noble

PLoS Computational Biology. 4(8):e10000134, 2008.

Paper

Microarray data from Dennis et al. [2007]

Raw data: Experiments were perfomed on three arrays (Array 3, Array 4 and Array 6). Each locus is represented six times on each array, three times in the forward orientation and three times in the backword orientation. Two files are provided for each array, one containing the nucleosomal intensities and the other containing genomic (background) intensities.

Array 3 (Nucleosomal / Genomic).
Array 4 (Nucleosomal / Genomic).
Array 6 (Nucleosomal / Genomic).

Weakly smoothed data: For each array, the smoothed data file contains both nucleosomal and genomic intensities.

Array 3.
Array 4.
Array 6.

Strongly smoothed data: Similar to the weakly smoothed data, each file contains both nucleosomal and genomic intensities.

Array 3.
Array 4.
Array 6.

SVM training sets. Each file is a FASTA file containing the sequences of the probes used to train the corresponding SVM. Chromosomal coordinates in NCBI Build 35 (May 2004 assembly) are also included.

Dennis: positive, negative

Dennis weak smooth: positive, negative

Dennis strong smooth: positive, negative

Ozsolak: positive, negative

Ozsolak raw: positive, negative

Ozsolak A375: positive, negative

Ozsolak MEC: positive, negative

Ozsolak IMR90: positive, negative

Ozsolak MALME: positive, negative

Ozsolak PM: positive, negative

Ozsolak MCF7: positive, negative

Ozsolak T47D: positive, negative

Predicted nucleosome occupancy in the ENCODE regions:

Segal model

Peckham SVM

A375 SVM

MEC SVM

Note: Predicted occupancy across the March 2006 assembly of the entire uman genome is now available in the UCSC Genome Browser. Select the "Nucleosome Occupancy" track.

Top- and bottom-scoring probes:

A375 SVM 1000 top-scoring probes

MEC SVM 1000 bottom-scoring probes

A Python program fasta2matrix.py that computes feature vectors from DNA sequences. The vectors used in this study were computed with a command line of the form python fasta2matrix.py -upto -revcomp -normalize frequency 6 foo.fa.

Home