Author: | Michael M. Hoffman <mmh1 at washington dot edu> |
---|---|
Organization: | University of Washington |
Address: | Department of Genome Sciences, PO Box 355065, Seattle, WA 98195-5065, United States of America |
Copyright: | 2009 Michael M. Hoffman |
For a broad overview, see the paper:
Hoffman MM, Buske OJ, Noble WS, “The genomedata format for storing large-scale functional genomics data.” In preparation.
Michael <mmh1 at washington dot edu> can send you a copy of the latest manuscript. Please cite this paper if you use genomedata.
A simple, interactive script has been created to install genomedata (and most dependencies) on any Linux platform. Installation is as simple as downloading and running this script! For instance:
wget http://noble.gs.washington.edu/proj/genomedata/install.py
python install.py
Note
The following are prerequisites:
Note
Please send any install script bugs/issues/comments to: Orion Buske <stasis at uw dot edu>
Genomedata is a module to store and access large-scale functional genomics data in a format which is both space-efficient and allows efficient random-access.
Under the surface, genomedata is implemented as a collection of HDF5 files, but genomedata provides a transparent interface to interact with your underlying data without having to worry about the mess of repeatedly parsing large data files or having to keep them in memory for random access.
The genomedata hierarchy:
- Each Genome contains many Chromosomes
- Each Chromosome contains many Supercontigs
- Each Supercontig contains one continuous data set
- Each continuous data set is a numpy.array of floating point numbers with a column for each data track and a row for each base in the data set.
A genomedata collection contains sequence and may also contain numerical data associated with that sequence. You can easily load sequence and numerical data into a genomedata collection with the genomedata-load command (see command details additional details):
genomedata-load [-t trackname=signalfile]... [-s sequencefile]... GENOMEDATADIR
This command is a user-friendly shortcut to the typical workflow. The underlying commands are still installed and may be used if more fine-grained control is required. The commands and required ordering are:
Note
A call to h5repack after genomedata-close-data may be used to transparently compress the data.
The data in genomedata is accessed through the hierarchy described in Overview. A full Python API is also available. To appreciate the full benefit of genomedata, it is most easily used as a contextmanager:
from genomedata import Genome
[...]
genomedatadir = "/path/to/genomedata"
with Genome(genomedatadir) as genome:
[...]
Note
If used as a context manager, chromosome access is memoized. If not, chromosomes should be closed manually with Chromosome.close().
Genomedata is designed to make it easy to get to the data you want. Here are a few examples:
Get arbitrary sequence (10-bp sequence starting at chr2:1423):
>>> chromosome = genome["chr2"]
>>> seq = chromosome.seq[1423:1433]
>>> seq
array([116, 99, 99, 99, 99, 103, 103, 103, 103, 103], dtype=uint8)
>>> seq.tostring()
'tccccggggg'
Get arbitrary data (data from first 3 tracks for region chr8:999-1000):
>>> chromosome = genome["chr8"]
>>> chromosome[999:1001, 0:3] # Note the half-open, zero-based indexing
array([[ NaN, NaN, NaN],
[ 3. , 5.5, 3.5], dtype=float32)
Get data for a specific track (specified data in first 5-bp of chr1):
>>> chromosome = genome["chr1"]
>>> data = chromosome[0:5, "sample_track"]
>>> data
array([ 47., NaN, NaN, NaN, NaN], dtype=float32)
Only specified data:
>>> from numpy import isfinite
>>> data[isfinite(data)]
array([ 47.], dtype=float32)
Note
Specify a slice for the track to keep the data in column form:
>>> col_index = chromosome.index_continuous("sample_track")
>>> data = chromosome[0:5, col_index:col_index+1]
Genomedata collections can be created and loaded from the command line with the genomedata-load command.
Usage information follows, but in summary, this script takes as input:
sequence files in FASTA (.fa or .fa.gz) format
the name of the genomedata collection to create
For example, let’s say you have sequence data for chrX (chrX.fa) and chrY (chrY.fa.gz), as well as two signal tracks: high (signal.high.wig) and low (signal.low.bed.gz). You could construct a genomedata collection named mygenomedata in the current directory with the following command:
genomedata-load -s chrX.fa -s chrY.fa.gz -t high=signal.high.wig -t low=signal.low.bed.gz mygenomedata
Command-line usage information:
Usage: genomedata-load [OPTIONS] GENOMEDATADIR
--track and --sequence may be repeated to specify multiple trackname=trackfile
pairings and sequence files, respectively
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-s SEQFILE, --sequence=SEQFILE
Add the sequence data in the specified file
-t TRACK, --track=TRACK
Add data for the given track. TRACK should be
specified in the form: NAME=FILE, such as: -t
signal=signal.dat
Alternately, as described in Overview, the underlying Python and C load scripts are also accessible for more finely-grained control. This can be especially useful for parallelizing genomedata loading over a cluster.
This command adds the provided sequence files to the specified genomedatadir, creating it if it does not already exist. Sequence files should be in FASTA (.fa or .fa.gz) format. Gaps of >= 100,000 base pairs (specified as gap-length) in the reference sequence, are used to divide the sequence into supercontigs.
Usage: genomedata-load-seq [OPTION]... GENOMEDATADIR SEQFILE...
Options:
-g, --gap-length XXX: Implement this.
--version show program's version number and exit
-h, --help show this help message and exit
This command opens the specified tracknames in the genomedata object, allowing data for those tracks to be added with genomedata-load-data.
Usage: genomedata-open-data [OPTION]... GENOMEDATADIR TRACKNAME...
Options:
--version show program's version number and exit
-h, --help show this help message and exit
This command loads data from stdin into genomedata under the given trackname. The input data must be in one of these supported datatypes: WIG, BED, bedGraph. A chunk-size can be specified to control the size of hdf5 chunks (the smallest data read size, like a page size). Larger values of chunk-size can increase the level of compression, but they also increase the minimum amount of data that must be read to access a single value.
Usage: genomedata-load-data [OPTION...] GENOMEDATADIR TRACKNAME
Loads data into genomedata format
Takes track data in on stdin
-c, --chunk-size=NROWS Chunk hdf5 data into blocks of NROWS. A higher
value increases compression but slows random
access. Must always be smaller than the max size
for a dataset. [default: 10000]
-?, --help Give this help list
--usage Give a short usage message
-V, --version Print program version
Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.
Closes the specified genomedata object.
Usage: genomedata-close-data [OPTION]... GENOMEDATADIR
Options:
--version show program's version number and exit
-h, --help show this help message and exit
The genomedata package is designed to be used from a variety of scripting languages, but currently only exports the following Python API.
The root level of the genomedata object hierarchy.
Implemented via a file system directory. If you use this as a context manager, it will keep track of open Chromosomes and close them for you later when the context is left:
with Genome("/path/to/genomedata") as genome:
chromosome = genome["chr1"]
[...]
If not used as a context manager, you are responsible for closing chromosomes when you are done with them:
>>> genome = Genome("/path/to/genomedata")
>>> chromosome = genome["chr1"]
[...]
>>> chromosome.close()
Create a Genome object from the genomedata objects in the directory.
Parameter: | dirname (string) – directory containing any chomosome files to include (usually just the genomedata directory). |
---|
Example:
>>> genome = Genome("./genomedata.ctcf.pol2b/")
>>> genome
Genome("./genomedata.ctcf.pol2b/")
Return next chromosome, in sorted order, with memoization.
Example:
for chromosome in genome:
print chromosome.name
for supercontig, continuous in chromosome.itercontinuous():
[...]
Return a reference to a chromosome of the given name.
Parameter: | name (string) – name of the chromosome file (e.g. “chr1” if chr1.genomedata is a file in the genomedata directory) |
---|---|
Returns: | Chromosome |
Example:
>>> genome["chrX"]
Chromosome('/path/to/genomedata/chrX.genomedata')
>>> genome["chrZ"]
KeyError: 'Could not find chromosome: chrZ'
Return a vector of the maximum value for each track.
Returns: | numpy.array |
---|
Return a vector of the mean value of each track.
Returns: | numpy.array |
---|
Return the minimum value for each track.
Returns: | numpy.array |
---|
Return the number of datapoints in each track.
Returns: | a numpy.array vector with an entry for each track. |
---|
Return a vector of the sum of the values for each track.
Returns: | numpy.array |
---|
Return a vector of the sum of squared values for each track’s data.
Returns: | numpy.array |
---|
Return a vector of the variance in the data for each track.
Returns: | numpy.array |
---|
The genomedata object corresponding to data for a given chromosome.
Implemented via an HDF5 File
Usually created by keying into a Genome object with the name of a chromosome, as in:
>>> with Genome("/path/to/genomedata") as genome:
... chromosome = genome["chrX"]
... chromosome
...
Chromosome('/path/to/genomedata/chrX.genomedata')
Return next supercontig in chromosome.
Seldom used in favor of the more direct: Chromosome.itercontinuous()
Example:
>>> for supercontig in chromosome:
... supercontig # calls repr()
...
<Supercontig('supercontig_0', 0:66115833)>
<Supercontig('supercontig_1', 66375833:90587544)>
<Supercontig('supercontig_2', 94987544:199501827)>
Return the continuous data corresponding to this bp slice
Parameter: | key (<base_key>[, <track_key>]) – key must index or slice bases, but can also index, slice, or directly specify (string or list of strings) the data tracks. |
---|---|
Returns: | numpy.array |
If slice is taken over or outside a supercontig boundary, missing data is filled in with NaN’s automatically and a warning is printed.
Typical use:
>>> chromosome = genome["chr4"]
>>> chromosome[0:5] # Get all data for the first five bases of chr4
>>> chromosome[0, 0:2] # Get data for first two tracks at chr4:0
>>> chromosome[100, "ctcf"] # Get "ctcf" track value at chr4:100
Close the current chromosome file.
This only needs to be called when genomedata is not being used as a context manager. Using genomedata as a context manager makes life easy by memoizing chromosome access and guaranteeing the proper cleanup. See Genome.
Return the column index of the trackname in the continuous data.
Parameter: | trackname (string) – name of data track |
---|---|
Returns: | integer |
This is used for efficient indexing into continuous data:
>>> chromosome = genome["chr3"]
>>> col_index = chromosome.index_continuous("sample_track")
>>> data = chromosome[100:150, col_index]
although for typical use, the track can be indexed directly:
>>> data = chromosome[100:150, "sample_track"]
Return a generator over all supercontig, continuous pairs.
This is the best way to efficiently iterate over the data since all specified data is in supercontigs:
for supercontig, continuous in chromosome.itercontinuous():
print supercontig, supercontig.start, supercontig.end
[...]
Return the supercontig that contains this range if possible.
Returns: | Supercontig |
---|
Indexable with a slice or simple index:
>>> chromosome.supercontigs[100]
[<Supercontig('supercontig_0', 0:66115833)>]
>>> chromosome.supercontigs[1:100000000]
[<Supercontig('supercontig_0', 0:66115833)>, <Supercontig('supercontig_1', 66375833:90587544)>, <Supercontig('supercontig_2', 94987544:199501827)>]
>>> chromosome.supercontigs[66115833:66375833] # Between two supercontigs
[]
A container for a segment of data in one chromosome.
Implemented via a HDF5 Group
Return the underlying continuous data in this supercontig.
Returns: | numpy.array |
---|
Project chromsomal coordinates to supercontig coordinates.
Parameter: | pos (integer) – chromosome coordinate |
---|---|
Returns: | integer |
Return the genomic sequence of this supercontig.
If the index or slice spans a non-supercontig range, N’s are inserted in place of the missing data and a warning is issued.
Example:
>>> chromosome = genome["chr1"]
>>> for supercontig in chromosome:
... print repr(supercontig)
...
<Supercontig('supercontig_0', 0:121186957)>
<Supercontig('supercontig_1', 141476957:143422081)>
<Supercontig('supercontig_2', 143522081:247249719)>
>>> chromosome.seq[0:10].tostring() # Inside supercontig
'taaccctaac'
>>> chromosome.seq[121186950:121186970].tostring() # Supercontig boundary
'agAATTCNNNNNNNNNNNNN'
>>> chromosome.seq[121186957:121186960].tostring() # Not in supercontig
/net/noble/vol2/home/stasis/arch/Linux/RHEL5/i686/lib/python2.5/genomedata-0.1.7.dev_r2548-py2.5.egg/genomedata/__init__.py:709: UserWarning: slice of chromosome sequence does not overlap any supercontig (filling with 'N')
warn("slice of chromosome sequence does not overlap any"
'NNN'
There is a moderated genomedata-announce mailing list that you can subscribe to for information on new releases of genomedata:
https://mailman1.u.washington.edu/mailman/listinfo/genomedata-announce
If you want to report a bug or request a feature, please do so using the Genomedata issue tracker:
http://code.google.com/p/genomedata/issues
For other support with Genomedata, or to provide feedback, please write contact the authors directly. We are interested in all comments regarding the package and the ease of use of installation and documentation.