Genomedata

Hoffman MM, Buske OJ, Noble WS. 2010. The Genomedata format for storing large-scale functional genomics data. Bioinformatics, 26(11):1458-1459; doi:10.1093/bioinformatics/btq164

Genomedata is a format for efficient storage of multiple tracks of numeric data anchored to a genome. The format allows fast random access to hundreds of gigabytes of data, while retaining a small disk space footprint. We have also developed utilities to load data into this format. A reference implementation in Python and C components is available here under the GNU General Public License.

Installation

The easy way to install genomedata and its prerequisites, and set up your environment properly to use them is to use our interactive install script. Just type these two commands on your Linux/Unix system*:

wget http://noble.gs.washington.edu/proj/genomedata/install.py
python install.py

To upgrade an existing Genomedata installation to the latest version, type the following command at the shell prompt:

easy_install -U genomedata

* We have only tested this software on Linux and Mac systems. We would love to extend our support to other systems in the future, and we would gladly accept any contributions toward this end. Specicially, we have successfully installed Genomedata on the following platforms:

Documentation

Genomedata is briefly described in the Bioinformatics application note cited and linked at the top of this page.

The application's documentation is available in two formats:

Source code

Version 1.3.3

In the latest release:

1.3.3:

* genomedata-query: new command that prints data from a Genomedata archive for your
  non-Python scripting needs (thanks to Max Libbrecht)
* genomedata-histogram: new command that prints histograms from a Genomedata archive
  (combination of a new module by Max Libbrecht and an old module by Michael Hoffman)
* genomedata-info: add "contigs" subcommand (thanks to Max Libbrecht)
* genomedata-info: friendlier error when unsupported command name used
* genomedata-load-data: friendlier errors when invalid BED3+1/bedGraph data supplied
* genomedata-load-seq: always makes chromosome and supercontig
  coordinates with unsigned 32-bit integers instead of system int
* genomedata-load-data: more detailed error message when initial file open fails
* genomedata-load-data: bugfix
* now compile with -Wextra
* doc fixes

1.3.2:

* API: now allow array of tracks. For example: chromosome[245:270, array([7, 5])]

1.3.1:

* API: now allow lists of tracks when directly accessing chromosome data, for example:
  chromosome[245:270, ["data1", "data3"]] or chromosome[245:270, [7, 5]]
* genomedata-load-seq: add --assembly option which supports AGP files,
  to allow avoid loading seq while still dealing with assembly gaps
  properly
* genomedata-load: now supports --assembly and --sizes options
* genomedata-load-assembly: alias for genomedata-load-seq.
  genomedata-load-seq will be deprecated in the future
* genomedata-load-data: now support DOS-style line endings ("\r\n")
* genomedata-load: print genomedata-load-data error code on failure
* genomedata-load-data: print more informative messages when ignoring data
* genomedata-load: all diagnostics messages to stderr
* genomedata-load: some diagnostics now include timestamp so we can
  see where performance bottlenecks are
* genomedata-load: more descriptive error messages
* genomedata-load-seq: print more descriptive error message when
  attempting to load sequence from a non-FASTA file
* genomedata-load: fixed issue 10: now compiles on gcc 4.6.2
* docs: add links to source code
* docs: genomedata-load: sequence "option" is mandatory. In a future
  version, we should change this to an argument to reflect this.
* test: add tests for DOS-style line-endings

1.3.0:

* genomedata supercontigs are no longer guaranteed to have seq data
* add --sizes option to genomedata-load-seq, to allow avoid loading seq
* Genome.add_track_continuous() has a significant performance
  improvement. This also means that genomedata-open-data will run much
  faster, as well as genomedata-load-data on fresh tracks
* fix bug where genomedata-load-seq didn't work
* fix bug where directory genomedata archive didn't work with only one chromosome

1.2.3:

* allow use with PyTables >=2.2
* new command: genomedata-info: "genomedata-info tracknames ARCHIVE"
  prints the tracknames for ARCHIVE
* Genome.format_version will now return 0 when files are missing a
  genomedata_format_version attribute
* Genome.__init__: future-proof to future versions of file format by throwing an error
* tests: add regression tests, lots of changes
* docs: add man pages

1.2.2:

* genomedata-load: will now support track filenames with "=" in the names
* genomedata-load: now supports UNIX glob wildcards as arguments to -s
* genomedata-load-data: allow other delimiters besides space for
  variableStep and fixedStep, allow wiggle_0 track specification
* genomedata-load-data, genomedata-load: remove unused --chunk-size option
* genomedata-close-data: fix bug where chunk_starts, chunk_ends not
  written for supercontigs with zero present data
* installation: move from path.py to forked-path
* docs: fixed small errors
* various: removed exclamation marks from error messages. It's not *that* exciting.
* some portability improvements
* tests: improve unit test interface

1.2.1:

* Fixed an installation bug where HDF5 installations later in
  LIBRARY_PATH might override those specified first, leading to
  linking errors during build.

Example scripts

genomedata_random_access.py : Given genomic positions on stdin, prints the corresponding data values for a set of tracks in a Genomedata collection.
genomedata_offline_random_access.py : Similar to genomedata_random_access.py, except the full set of input positions is first read, sorted, and then Genomedata is scanned for these locations. Close to constant-time performance in the number of input positions.

Support

There is a moderated genomedata-announce mailing list that you can subscribe to for information on new releases of Genomedata.

There is also a genomedata-users mailing list for general discussion and questions about the use of the Genomedata system.

If you want to report a bug or request a feature, please do so using the Genomedata issue tracker.

For other support with Genomedata, or to provide feedback, please e-mail Michael. We are interested in all comments regarding the package and the ease of use of installation and documentation.

genomedata-users mailing list