GO-TermFinder identifies GO terms that annotate a list of genes with a significant p-value. For this analysis we use a modified version that calculates a two-tailed p-value so that it can also identify terms that are under-represented in the list. GO-TermFinder is a Perl module, but it also provides two functional scripts that simplify the process of finding significant GO terms. analyze.pl produces a plain-text listing of the significant GO terms and their p-values. batchGOView.pl creates a plot of the GO graph with the nodes color-coded according to their p-value. It also produces an HTML table of the significant GO terms.
The inputs to analyze.pl and batchGOView.pl are files describing each aspect (function, process, and component) of the Gene Ontology graph, an annotation database assigning GO terms to genes, and a list of genes whose annotations are to be evaluated. The Gene Ontology files were obtained from the Gene Ontology Consortium. The format of the annotation database is described here.
The annotation file was built by downloading gene annotation data from Ensembl. The query to Ensembl requested the Ensembl Gene ID, Ensembl Transcript ID, External Gene ID, GO ID, GO Evidence Code, and RefSeq ID from Build 34 of the Human Genome and dbSNP121. The tab delimited data returned by the query is here. This data is transformed into the required annotation file format using a python script: make_go_association.py. The resulting annotation file is here. In this annotation file the annotated entities are the genes, identified by their Ensembl Gene IDs. The Ensembl transcript IDs are used as aliases for the genes. The annotation file contains only genes that have GO annotations. The unannotated genes are accounted for by providing an estimate for the total number of genes for the organism. In the raw data downloaded from Ensembl there were 22291 genes of which 12222 had GO annotations.
The HS sites are listed in a tab delimited table. The Ensembl Transcript ID, RefSeq ID, and the distance from the HS site to the transcript start are listed for each site. The python script filter_50kb.py picks out the Ensembl Transcript IDs that are withing 50kb of a HS site. These IDs and the annotation file are then passed to analyze.pl and batchGOView.pl. Finally the significant GO terms for each subset are collated into a single HTML table by the python script make_summary_table.py.
Summary of corrected p-values of significant GO annotations for sets of HS sites
Summary of corrected p-values of significant GO annotations for sets of HS sites (tab delimited text)