GO Term Analysis of Human HS Site/Chimp Deletion Overlaps

See chimpdel for an overview of this project.

GO-TermFinder identifies GO terms that annotate a list of genes with a significant p-value. For this analysis we use a modified version that calculates a two-tailed p-value so that it can also identify terms that are under-represented in the list. GO-TermFinder is a Perl module, but it also provides two functional scripts that simplify the process of finding significant GO terms. analyze.pl produces a plain-text listing of the significant GO terms and their p-values. batchGOView.pl creates a plot of the GO graph with the nodes color-coded according to their p-value. It also produces an HTML table of the significant GO terms.

The inputs to analyze.pl and batchGOView.pl are files describing each aspect (function, process, and component) of the Gene Ontology graph, an annotation database assigning GO terms to genes, and a list of genes whose annotations are to be evaluated. The Gene Ontology files were obtained from the Gene Ontology Consortium. The format of the annotation database is described here.

The annotation file was built by downloading gene annotation data from Ensembl. The query to Ensembl requested the Ensembl Gene ID, Ensembl Transcript ID, External Gene ID, GO ID, GO Evidence Code, and RefSeq ID from Build 34 of the Human Genome and dbSNP121. The tab delimited data returned by the query is here. This data is transformed into the required annotation file format using a python script: make_go_association.py. The resulting annotation file is here. In this annotation file the annotated entities are the genes, identified by their Ensembl Gene IDs. The Ensembl transcript IDs are used as aliases for the genes. The annotation file contains only genes that have GO annotations. The unannotated genes are accounted for by providing an estimate for the total number of genes for the organism. In the raw data downloaded from Ensembl there were 22291 genes of which 12222 had GO annotations.

The HS sites are listed in a tab delimited table. The Ensembl Transcript ID, RefSeq ID, and the distance from the HS site to the transcript start are listed for each site. The python script filter_50kb.py picks out the Ensembl Transcript IDs that are withing 50kb of a HS site. These IDs and the annotation file are then passed to analyze.pl and batchGOView.pl. Finally the significant GO terms for each subset are collated into a single HTML table by the python script make_summary_table.py.

Analyzed using annotation by gene, assuming the total number of genes = 22291
and including only sites within 50kb

Summary of corrected p-values of significant GO annotations for sets of HS sites
Summary of corrected p-values of significant GO annotations for sets of HS sites (tab delimited text)

Analyzed using annotation by transcript, assuming the total number of transcripts = 34111
and including only sites within 50kb

I was concerned that the annotation file used above used genes as the annotated entity but the HS sites were associated with transcripts. I created a 2nd annotation file that annotated the Ensembl transcripts directly without reference to an associated gene. The resulting annotation file can be found here Analysis using this annotation file identified as significant all but three of the GO terms found previously, and identified 40 additional terms as significant. The results are linked below.

Summary of corrected p-values of significant GO annotations for sets of HS sites
Summary of corrected p-values of significant GO annotations for sets of HS sites (tab delimited text)