___________________________________________________________________
Created by Ferhat Ay, Timothy Bailey and William Noble
January 19th, 2014
___________________________________________________________________

Fit-Hi-C is a tool for assigning statistical confidence estimates
to intra-chromosomal contact maps produced by genome architecture
assays.

___________________________________________________________________

HOW TO INSTALL DEPENDENCIES 

In order to run the fit-hi-c software you need the following to be 
present on a Unix machine (tested on LinuxMint Maya and RedHat 5).

1- Python 2.7 or higher with the libraries below installed:
  a. Scipy 
  b. Numpy
  c. rpy2 (http://rpy.sourceforge.net/rpy2_download.html)

  Once these libraries are installed you can test them by typing
  "python" in a terminal window, followed by the import statements:

	import numpy as np
	from scipy import *
	import rpy2.robjects as ro

  The output should look like this:

	Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
	[GCC 4.6.3] on linux2
	Type "help", "copyright", "credits" or "license" for more information.
	>>> import numpy as np
	>>> from scipy import *
	>>> import rpy2.robjects as ro
	>>> 

2- R 2.11.1 or higher with the package below installed:

  fdrtool (http://strimmerlab.org/software/fdrtool/index.html)

  This package can simply be installed by typing R in terminal 
  followed by the install statement

    	install.packages("fdrtool")

  Select a CRAN mirror to proceed with the install. A successful 
  install should end with:

	....
	* DONE (fdrtool)
	The downloaded packages are in
	‘/tmp/RtmpIPfChJ/downloaded_packages’

___________________________________________________________________

HOW TO EXTRACT THE SOFTWARE AND SAMPLE DATA SETS

1- Extract the fit-hi-c.tgz file in a directory named fit-hi-c

	tar xzvf fit-hi-c.tgz

Once the above dependencies are installed and the extraction 
is completed, then the software is ready for use.

___________________________________________________________________


HOW TO RUN THE SOFTWARE

The tar ball contains a runall script that will run fit-hi-c on 
four sample data sets with default parameters. 

To simply run this script type

	./runall

You can also run the python script fit-hi-c.py to read usage information.

	python bin/fit-hi-c.py -h

----	
Usage: fit-hi-c.py [options]

Options:
  -h, --help            show this help message and exit
  -f FRAGSFILE, --fragments=FRAGSFILE
                        midpoints (or start indices) of the fragments are read
                        from FRAGSFILE
  -i INTERSFILE, --interactions=INTERSFILE
                        interactions between fragment pairs are read from
                        INTERSFILE
  -o OUTDIR, --outdir=OUTDIR
                        where the output files will be written
  -t BIASFILE, --biases=BIASFILE
                        OPTIONAL: biases calculated by ICE for each locus are
                        read from BIASFILE
  -p NOOFPASSES, --passes=NOOFPASSES
                        OPTIONAL: number of passes after the initial (before)
                        fit. DEFAULT is 1 (after)
  -b NOOFBINS, --noOfBins=NOOFBINS
                        OPTIONAL: number of equal-occupancy (count) bins.
                        Default is 100
  -m MAPPABILITYTHRESHOLD, --mappabilityThres=MAPPABILITYTHRESHOLD
                        OPTIONAL: minimum number of hits per locus that has to
                        exist to call it mappable. DEFAULT is 1.
  -l LIBNAME, --lib=LIBNAME
                        OPTIONAL: Name of the library that is analyzed to be
                        used for plots.
  -U DISTUPTHRES, --upperbound=DISTUPTHRES
                        OPTIONAL: upper bound on the intra-chromosomal
                        distance range (unit: base pairs). DEFAULT no limit.
  -L DISTLOWTHRES, --lowerbound=DISTLOWTHRES
                        OPTIONAL: lower bound on the intra-chromosomal
                        distance range (unit: base pairs). DEFAULT no limit.
  -v, --visual          OPTIONAL: use this flag for generating plots. DEFAULT
                        is False.
  -q, --quiet           OPTIONAL: use this flag for omitting plots. DEFAULT
                        behavior.
  -V, --version         fit-hi-c version 1.0.1.  A tool for assigning
                        statistical confidence estimates to intra-chromosomal
                        contact maps produced by genome architecture assays.
                        Released on January 19, 2014.  Method developed by
                        Ferhat Ay, Timothy Bailey and William Noble.
                        Implemented by Ferhat Ay (ferhatay@uw.edu).
                        Copyright (c), 2012, University of Washington.  This
                        software is offered under an MIT license.  For
                        details: http://opensource.org/licenses/MIT

----	

In order to run fit-hi-c with different parameter settings or on different 
data sets follow the steps below when you are in the fit-hi-c directory. 
The below example will use HindIII library from Duan et al data.

1-  Locate your input files. There are two input files you need.

	- File with list of fragments. This will be passed with -f flag. 
		-f data/fragmentLists/Duan_yeast_HindIII	

	- File with list of contact counts. This will be passed with -i flag. 	

		-i data/contactCounts/Duan_yeast_HindIII

2- Choose a genomic distance range (mid-range) for confidence estimate
   assignments. The units are always in base pairs (bp) and NOT Kb.

	- Lower bound on mid-range distances. This will be passed with
	  -L flag.  The rule of thumb here is to avoid distances lower
	  than an average meta-fragment length. When 10 consecutive RE
	  fragments are used per meta-frament use at least 50000 bp. In
	  order to have no lower bound simply don't use this argument
	  or pass -1 with the appropriate flag.  In our example here a
	  resolution of 1 RE fragment is used to process
	  Duan_yeast_HindIII. For this example we use 20000 as lower
	  bound.  Default value is -1.

		-L 20000

	- Upper bound on mid-range distances. This will be passed with
	  -U flag.  This can be disable similar to lower bound by
	  passing -1.  For this example we use 200000 as upper
	  bound. Default value is -1.

		-U 200000

3- Choose the number of steps that will be used to assign confidence
   estimates.  For instance, if 3 is selected than initial spline fit
   (spline-1) plus 2 steps of refinement of the null model will be
   applied. Results from each step will be outputted in separate
   files.  This will be passed with -p flag.  Default value is 2.

		-p 3

4- Choose the number of equal-occupancy bins that will be used for
   spline fit.  This will be passed with -b flag. Default value is
   100.

		-b 200

5- Choose a prefix that will be added in front of each file name that
   is outputted by fit-hi-c. If you want the output files to be in a
   specific directory you can first give the directory with -o option 
   then the prefix. The prefix will be passed with -l flag.  
   For this example we want output files to be under outputs directory. 
   Default is "" meaning no prefix and write in current directory.

		-l "Duan_yeast_HindIII-customSettings"

6- Run the python script fit-hi-c by the parameters selected.

	python bin/fit-hi-c.py -f data/fragmentLists/Duan_yeast_HindIII 
		-i data/contactCounts/Duan_yeast_HindIII -L 20000 -U 200000
		-p 3 -b 200 -o outputs -l "Duan_yeast_HindIII-customSettings"

___________________________________________________________________


HOW TO PREPARE INPUT FILES FOR fit-hi-c

fit-hi-c requires two main input files. As long as these files are
provided fit-hi-c can be applied on data sets from different
experiments (Hi-C, 5C, ChIA-PET) either using a resolution that is
dependent on restriction fragments or with fixed sized genomic
windows.

-- The first file contains a list of fragments/windows/meta-fragments.
   Each line will have 5 entries. The second and fifth fields can be
   any integer as they are not needed in most cases. The first field
   is the chromosome name or number, the third field is the coordinate
   of the midpoint of the fragment on that chromosome, the fourth
   field is the total number of observed mid-range reads (contact
   counts) that involve the specified fragment.  The fields can be
   separated by space or tab. All possible fragments need to be listed
   in this file.  One example file would look like below (excluding
   the header which is not a part of input):

"chr	extraField	fragmentMid	marginalizedContactCount	mappable? (0/1)"
1	0		15000		234				1
1	0		25000		0				0
...


-- The second file contains a list of mid-range contacts between the
   fragments/windows/meta-fragments listed in the first file
   above. Each fragment will be represented by its chromosome and
   midpoint coordinate. Each line will have 5 fields. The first two
   will represent first fragment, the following two will represent the
   second and the fifth field will correspond to number of contacts
   between these two fragments.  The fields can be separated by space
   or tab. Only the fragment pairs with non-zero contact counts are
   listed in this file.  One example file would look like below
   (excluding the header which is not a part of input):

"chr1	fragmentMid1	chr2	fragmentMid2	contactCount"
1	15000		1	35000				23
1	15000		1	55000				12
...

___________________________________________________________________

SAMPLE DATA SETS	

This tar ball includes four sample datasets. 

1- Duan_yeast_HindIII: Aggregate of all replicates generated using
   cross-linked DNA and HindIII digestion by Duan et al. This dataset
   is processed using the natural resolution of HindIII restriction
   fragments.

2- Duan_yeast_EcoRI: Aggregate of all replicates generated using
   cross-linked DNA and EcoRI digestion by Duan et al. This dataset is
   processed using the natural resolution of EcoRI restriction
   fragments.

3- Dixon_hESC_HindIII_hg18_combineFrags10_chr1: Aggregate of all
   replicates generated using human embryonic stem cell line and
   HindIII digestion by Dixon et al. This dataset is processed using
   meta-fragments that correspond to 10 consecutive HindIII
   restriction fragments. Data for only chromosome 1 is provided due
   to large file sizes for the whole-genome.

4- Dixon_mESC_HindIII_mm9_combineFrags10_chr1: Aggregate of all
   replicates generated using mouse embryonic stem cell line and
   HindIII digestion by Dixon et al. This dataset is processed using
   meta-fragments that correspond to 10 consecutive HindIII
   restriction fragments. Data for only chromosome 1 is provided due
   to large file sizes for the whole-genome.

For more data sets or processing your own data with fit-hi-c please
contact ferhatay@uw.edu.

___________________________________________________________________

OUTPUT FILES AND THEIR FORMAT

Each step of fit-hi-c, the number of which is user-defined through the
-p flag, generates two output files. For step N and library name
prefix denoted by ${PREFIX} the two output files will have the
following names:

1- ${PREFIX}.fithic_passN.txt 
2- ${PREFIX}.spline_passN.significances.txt.gz


The first file will report the results of equal occupancy binning in
five fields:

"avgGenomicDist	contactProbability	standardError	noOfLocusPairs	totalOfContactCounts"
20077	2.38e-05	2.11e-06	210	19574
20228	1.88e-05	1.44e-06	268	19662
...

The second file will have the exact same lines as in the input file
that contains the list of mid-range contacts. This input file had 5
fields as described above. The output from each step will append two
more columns to these fields, namely p-value and q-value.

"chr1	fragmentMid1	chr2	fragmentMid2	contactCount	p-value	q-value"
10	100695	10	127796	11	1.000000e+00	1.000000e+00
10	104051	10	229415	12	2.544592e-02	1.202603e-01
10	104051	10	231999	15	1.506105e-03	9.463644e-03
...


COPYRIGHT
___________________________________________________________________

This software is offered under an MIT license. For details:
http://opensource.org/licenses/MIT

Copyright (c), 2012, University of Washington

Permission is hereby granted, free of charge, to any person 
obtaining a copy of this software and associated documentation 
files (the "Software"), to deal in the Software without restriction, 
including without limitation the rights to use, copy, modify, merge, 
publish, distribute, sublicense, and/or sell copies of the Software, 
and to permit persons to whom the Software is furnished to do so, 
subject to the following conditions:

The above copyright notice and this permission notice shall be 
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS 
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN 
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


CONTACT
___________________________________________________________________

For any problem or request about the software, please contact
Ferhat Ay <ferhatay@uw.edu>.