Hoffman MM, Buske OJ, Noble WS. 2010. The Genomedata format for storing large-scale functional genomics data. Bioinformatics, 26(11):1458-1459; doi:10.1093/bioinformatics/btq164
Genomedata is a format for efficient storage of multiple tracks of numeric data anchored to a genome. The format allows fast random access to hundreds of gigabytes of data, while retaining a small disk space footprint. We have also developed utilities to load data into this format. A reference implementation in Python and C components is available here under the GNU General Public License.
The easy way to install genomedata and its prerequisites, and set up your environment properly to use them is to use our interactive install script. Just type these two commands on your Linux/Unix system*:
wget http://noble.gs.washington.edu/proj/genomedata/install.py python install.py
To upgrade an existing Genomedata installation to the latest version, type the following command at the shell prompt:
easy_install -U genomedata
Genomedata is briefly described in the Bioinformatics application note cited and linked at the top of this page.
The application's documentation is available in two formats:
1.3.3: * genomedata-query: new command that prints data from a Genomedata archive for your non-Python scripting needs (thanks to Max Libbrecht) * genomedata-histogram: new command that prints histograms from a Genomedata archive (combination of a new module by Max Libbrecht and an old module by Michael Hoffman) * genomedata-info: add "contigs" subcommand (thanks to Max Libbrecht) * genomedata-info: friendlier error when unsupported command name used * genomedata-load-data: friendlier errors when invalid BED3+1/bedGraph data supplied * genomedata-load-seq: always makes chromosome and supercontig coordinates with unsigned 32-bit integers instead of system int * genomedata-load-data: more detailed error message when initial file open fails * genomedata-load-data: bugfix * now compile with -Wextra * doc fixes 1.3.2: * API: now allow array of tracks. For example: chromosome[245:270, array([7, 5])] 1.3.1: * API: now allow lists of tracks when directly accessing chromosome data, for example: chromosome[245:270, ["data1", "data3"]] or chromosome[245:270, [7, 5]] * genomedata-load-seq: add --assembly option which supports AGP files, to allow avoid loading seq while still dealing with assembly gaps properly * genomedata-load: now supports --assembly and --sizes options * genomedata-load-assembly: alias for genomedata-load-seq. genomedata-load-seq will be deprecated in the future * genomedata-load-data: now support DOS-style line endings ("\r\n") * genomedata-load: print genomedata-load-data error code on failure * genomedata-load-data: print more informative messages when ignoring data * genomedata-load: all diagnostics messages to stderr * genomedata-load: some diagnostics now include timestamp so we can see where performance bottlenecks are * genomedata-load: more descriptive error messages * genomedata-load-seq: print more descriptive error message when attempting to load sequence from a non-FASTA file * genomedata-load: fixed issue 10: now compiles on gcc 4.6.2 * docs: add links to source code * docs: genomedata-load: sequence "option" is mandatory. In a future version, we should change this to an argument to reflect this. * test: add tests for DOS-style line-endings 1.3.0: * genomedata supercontigs are no longer guaranteed to have seq data * add --sizes option to genomedata-load-seq, to allow avoid loading seq * Genome.add_track_continuous() has a significant performance improvement. This also means that genomedata-open-data will run much faster, as well as genomedata-load-data on fresh tracks * fix bug where genomedata-load-seq didn't work * fix bug where directory genomedata archive didn't work with only one chromosome 1.2.3: * allow use with PyTables >=2.2 * new command: genomedata-info: "genomedata-info tracknames ARCHIVE" prints the tracknames for ARCHIVE * Genome.format_version will now return 0 when files are missing a genomedata_format_version attribute * Genome.__init__: future-proof to future versions of file format by throwing an error * tests: add regression tests, lots of changes * docs: add man pages 1.2.2: * genomedata-load: will now support track filenames with "=" in the names * genomedata-load: now supports UNIX glob wildcards as arguments to -s * genomedata-load-data: allow other delimiters besides space for variableStep and fixedStep, allow wiggle_0 track specification * genomedata-load-data, genomedata-load: remove unused --chunk-size option * genomedata-close-data: fix bug where chunk_starts, chunk_ends not written for supercontigs with zero present data * installation: move from path.py to forked-path * docs: fixed small errors * various: removed exclamation marks from error messages. It's not *that* exciting. * some portability improvements * tests: improve unit test interface 1.2.1: * Fixed an installation bug where HDF5 installations later in LIBRARY_PATH might override those specified first, leading to linking errors during build.
There is a moderated genomedata-announce mailing list that you can subscribe to for information on new releases of Genomedata.
There is also a genomedata-users mailing list for general discussion and questions about the use of the Genomedata system.
If you want to report a bug or request a feature, please do so using the Genomedata issue tracker.
For other support with Genomedata, or to provide feedback, please e-mail Michael. We are interested in all comments regarding the package and the ease of use of installation and documentation.
genomedata-users mailing list