Genomedata 1.2 documentation¶

Michael M. Hoffman <mmh1 at washington dot edu>

wget http://noble.gs.washington.edu/proj/genomedata/install.py
python install.py
Genome
Chromosomes
Chromosome
Supercontigs
Supercontig
continuous
continuous
Supercontigs
genomedata-load [-t trackname=signalfile]... [-s sequencefile]... GENOMEDATAFILE
>chr1
taaccctaaccctaaccctaaccctaaccctaaccctaaccctaacccta
accctaaccctaaccctaaccctaaccct
>chrY
ctaaccctaaccctaaccctaaccctaaccctaaccctCTGaaagtggac
fixedStep chrom=chr1 start=5 step=1
0.372
-2.540
0.371
-2.611
0.372
-2.320
chrY    0       12      4.67
chrY    20      23      9.24
chr1    1       3       2.71
chr1    3       6       1.61
chr1    6       24      3.14
genomedata.test
genomedata-load -s chr1.fa -s chrY.fa.gz -t low=signal_low.wigFix \
    -t high=signal_high.bed.gz genomedata.test
genomedata-load-seq genomedata.test chr1.fa chrY.fa.gz
genomedata-open-data genomedata.test low high
genomedata-load-data genomedata.test low < signal_low.wigFix
zcat signal_high.bed.gz | genomedata-load-data genomedata.test high
genomedata-close-data genomedata.test
from genomedata import Genome
[...]
gdfilename = "/path/to/genomedata/archive"
with Genome(gdfilename) as genome:
    [...]

Genome.close()
>>> chromosome = genome["chr2"]
>>> seq = chromosome.seq[1423:1433]
>>> seq
array([116,  99,  99,  99,  99, 103, 103, 103, 103, 103], dtype=uint8)
>>> seq.tostring()
'tccccggggg'

>>> chromosome = genome["chr8"]
>>> chromosome[999:1001, 0:3]  # Note the half-open, zero-based indexing
array([[ NaN,  NaN,  NaN],
       [ 3. ,  5.5,  3.5], dtype=float32)

>>> chromosome = genome["chr1"]
>>> data = chromosome[0:5, "sample_track"]
>>> data
array([ 47.,  NaN,  NaN,  NaN,  NaN], dtype=float32)

>>> from numpy import isfinite
>>> data[isfinite(data)]
array([ 47.], dtype=float32)

>>> col_index = chromosome.index_continuous("sample_track")
>>> data = chromosome[0:5, col_index:col_index+1]

.fa
.fa.gz
trackname=datafile
string
broad.h3k27me3
Usage: genomedata-load [OPTIONS] GENOMEDATAFILE

--track and --sequence may be repeated to specify multiple trackname=trackfile
pairings and sequence files, respectively

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -s SEQFILE, --sequence=SEQFILE
                        Add the sequence data in the specified file
  -t TRACK, --track=TRACK
                        Add data for the given track. TRACK should be
                        specified in the form: NAME=FILE, such as: -t
                        signal=signal.dat
genomedata-load-seq -s 'chr*.fa'
'chr*.fa'
"chr*.fa"
.fa
.fa.gz
Usage: genomedata-load-seq [OPTION]... GENOMEDATAFILE SEQFILE...

Options:
  -g, --gap-length  XXX: Implement this.
  --version         show program's version number and exit
  -h, --help        show this help message and exit
Usage: genomedata-open-data [OPTION]... GENOMEDATAFILE TRACKNAME...

Options:
  --version   show program's version number and exit
  -h, --help  show this help message and exit
Usage: genomedata-load-data [OPTION...] GENOMEDATAFILE TRACKNAME
Loads data into Genomedata format
Takes track data in on stdin

  -c, --chunk-size=NROWS     Chunk hdf5 data into blocks of NROWS. A higher
                             value increases compression but slows random
                             access. Must always be smaller than the max size
                             for a dataset. [default: 10000]
  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.
Usage: genomedata-close-data [OPTION]... GENOMEDATAFILE

Options:
  --version   show program's version number and exit
  -h, --help  show this help message and exit
Usage: genomedata-erase-data [OPTION]... GENOMEDATAFILE TRACKNAME...

Erase the specified tracks from the Genomedata archive in such a way that
the track can be replaced (via genomedata-load-data).

Options:
  --version      show program's version number and exit
  -h, --help     show this help message and exit
  -v, --verbose  Print status updates and diagnostic messages
genomedata.
Genome
with Genome("/path/to/genomedata") as genome:
  chromosome = genome["chr1"]
  [...]

>>> genome = Genome("/path/to/genomedata")
>>> chromosome = genome["chr1"]
[...]
>>> genome.close()

__init__
>>> genome = Genome("./genomedata.ctcf.pol2b/")
>>> genome
Genome("./genomedata.ctcf.pol2b/")
    [...]
>>> genome.close()
>>> genome = Genome("./cat_chipseq.genomedata", mode="r")
    [...]
>>> genome.close()

__iter__
for chromosome in genome:
  print chromosome.name
  for supercontig, continuous in chromosome.itercontinuous():
    [...]

__getitem__
>>> genome["chrX"]
<Chromosome 'chrX', file='/path/to/genomedata/chrX.genomedata'>
>>> genome["chrZ"]
KeyError: 'Could not find chromosome: chrZ'

add_track_continuous
close
Genome
erase_data
format_version
isopen
maxs
means
mins
num_datapoints
num_tracks_continuous
sums
sums_squares
tracknames_continuous
vars
genomedata.
Chromosome
>>> with Genome("/path/to/genomedata") as genome:
...     chromosome = genome["chrX"]
...     chromosome
...
<Chromosome 'chrX', file='/path/to/genomedata/chrX.genomedata'>

__iter__
Chromosome.itercontinuous()
>>> for supercontig in chromosome:
...     supercontig  # calls repr()
...
<Supercontig 'supercontig_0', [0:66115833]>
<Supercontig 'supercontig_1', [66375833:90587544]>
<Supercontig 'supercontig_2', [94987544:199501827]>

__getitem__
>>> chromosome = genome["chr4"]
>>> chromosome[0:5]  # Get all data for the first five bases of chr4
>>> chromosome[0, 0:2]  # Get data for first two tracks at chr4:0
>>> chromosome[100, "ctcf"]  # Get "ctcf" track value at chr4:100

ChromosomeDirtyError
Chromosome.
attrs
Chromosome.
close
Genome.close()
Genome
Chromosome.
end
Genome.format_version
>>> chromosome.seq[chromosome.start:chromosome.end]

Chromosome.
index_continuous
>>> chromosome = genome["chr3"]
>>> col_index = chromosome.index_continuous("sample_track")
>>> data = chromosome[100:150, col_index]

>>> data = chromosome[100:150, "sample_track"]

Chromosome.
isopen
Chromosome.
itercontinuous
for supercontig, continuous in chromosome.itercontinuous():
    print supercontig, supercontig.start, supercontig.end
    [...]

Chromosome.
maxs
Genome.maxs
Chromosome.
mins
Genome.mins
Chromosome.
name
Chromosome.
num_datapoints
Genome.num_datapoints
Chromosome.
num_tracks_continuous
Chromosome.
seq
>>> chromosome = genome["chr1"]
>>> for supercontig in chromosome:
...     print repr(supercontig)
...
<Supercontig 'supercontig_0', [0:121186957]>
<Supercontig 'supercontig_1', [141476957:143422081]>
<Supercontig 'supercontig_2', [143522081:247249719]>
>>> chromosome.seq[0:10].tostring()  # Inside supercontig
'taaccctaac'
>>> chromosome.seq[121186950:121186970].tostring() # Supercontig boundary
'agAATTCNNNNNNNNNNNNN'
>>> chromosome.seq[121186957:121186960].tostring() # Not in supercontig
UserWarning: slice of chromosome sequence does not overlap any supercontig (filling with 'N')
'NNN'

>>> chromosome.seq[chromosome.start:chromosome.end]

Chromosome.
start
Genome.format_version
Chromosome.
sums
Genome.sums
Chromosome.
sums_squares
Genome.sums_squares
Chromosome.
supercontigs
>>> chromosome.supercontigs[100]
[<Supercontig 'supercontig_0', [0:66115833]>]
>>> chromosome.supercontigs[1:100000000]
[<Supercontig 'supercontig_0', [0:66115833]>, <Supercontig 'supercontig_1', [66375833:90587544]>, <Supercontig 'supercontig_2', [94987544:199501827]>]
>>> chromosome.supercontigs[66115833:66375833]  # Between two supercontigs
[]

Chromosome.
tracknames_continuous
genomedata.
Supercontig
attrs
continuous
end
>>> supercontig.seq[supercontig.start:supercontig:end]

name
project
seq
Chromosome.seq
start
genomedata-announce
genomedata-users

Website:	http://noble.gs.washington.edu/proj/genomedata
Author:	Michael M. Hoffman <mmh1 at washington dot edu>
Organization:	University of Washington
Address:	Department of Genome Sciences, PO Box 355065, Seattle, WA 98195-5065, United States of America
Copyright:	2009 Michael M. Hoffman

Parameters:	name – name of the chromosome (e.g. “chr1” if chr1.genomedata is a file in the Genomedata archive or chr1 is a top-level group in the single-file Genomedata archive)
Returns:	<pending_xref py:class=”Genome” py:module=”genomedata” refdoc=”genomedata” refdomain=”py” refexplicit=”False” reftarget=”Chromosome” reftype=”class”><literal classes=”xref py py-class”>Chromosome</literal></pending_xref>

Parameters:	key – key must index or slice bases, but can also index, slice, or directly specify (string or list of strings) the data tracks.
Returns:	numpy.array

Parameters:	trackname – name of data track
Returns:	integer

Parameters:	pos – chromosome coordinate bound – bound result to valid supercontig coordinates
Returns:	integer

Genomedata 1.2 documentation¶

Installation¶

Overview¶

Implementation¶

Creation¶

Example¶

Genomedata usage¶

Python interface¶

Basic usage¶

Command-line interface¶

genomedata-load¶

genomedata-load-seq¶

genomedata-open-data¶

genomedata-load-data¶

genomedata-close-data¶

genomedata-erase-data¶

Python API¶

Support¶

Table Of Contents

Previous topic

This Page

Navigation

Genomedata 1.2 documentation¶

Installation¶

Overview¶

Implementation¶

Creation¶

Example¶

Genomedata usage¶

Python interface¶

Basic usage¶

Command-line interface¶

genomedata-load¶

genomedata-load-seq¶

genomedata-open-data¶

genomedata-load-data¶

genomedata-close-data¶

genomedata-erase-data¶

Python API¶

Support¶

Table Of Contents

Previous topic

This Page

Quick search

Navigation