Genomedata

Hoffman MM, Buske OJ, Noble WS. 2010. The Genomedata format for storing large-scale functional genomics data. Bioinformatics, 26(11):1458-1459; doi:10.1093/bioinformatics/btq164

Genomedata is a format for efficient storage of multiple tracks of numeric data anchored to a genome. The format allows fast random access to hundreds of gigabytes of data, while retaining a small disk space footprint. We have also developed utilities to load data into this format. A reference implementation in Python and C components is available here under the GNU General Public License.

Installation

The easy way to install genomedata and its prerequisites, and set up your environment properly to use them is to use our interactive install script. Just type these two commands on your Linux/Unix system*:

wget http://noble.gs.washington.edu/proj/genomedata/install.py
python install.py

To upgrade an existing Genomedata installation to the latest version, type the following command at the shell prompt:

easy_install -U genomedata

* We have only tested this software on Linux and Mac systems. We would love to extend our support to other systems in the future, and we would gladly accept any contributions toward this end. Specicially, we have successfully installed Genomedata on the following platforms:

Documentation

Genomedata is briefly described in the Bioinformatics application note cited and linked at the top of this page.

The application's documentation is available in two formats:

Source code

Version 1.2.2

In the latest release:

* Added support for adding additional tracks using genomedata-open-data and
  Genome.add_track_continuous().
* Added support for creating Genomedata archives without any tracks.
* Made chromosome.start and chromosome.end be based upon sequence instead
  of supercontigs.
* Made iter(chromosome) and chromosome.itercontinuous() yield supercontigs
  sorted by start index (instead of dictionary order).
* Fixed pointer dereference bug that could cause segfault in
  genomedata-load-data.
* Improved installation script robustness and clarity.

Example scripts

genomedata_random_access.py : Given genomic positions on stdin, prints the corresponding data values for a set of tracks in a Genomedata collection.
genomedata_offline_random_access.py : Similar to genomedata_random_access.py, except the full set of input positions is first read, sorted, and then Genomedata is scanned for these locations. Close to constant-time performance in the number of input positions.

Support

There is a moderated genomedata-announce mailing list that you can subscribe to for information on new releases of Genomedata.

There is also a genomedata-users mailing list for general discussion and questions about the use of the Genomedata system.

If you want to report a bug or request a feature, please do so using the Genomedata issue tracker.

For other support with Genomedata, or to provide feedback, please e-mail Michael. We are interested in all comments regarding the package and the ease of use of installation and documentation.

Michael Hoffman < mmh1 at uw period edu >