University of Washington
The human epigenome has been experimentally characterized by measurements of protein binding, chromatin acessibility, methylation, and histone modification in hundreds of cell types. The result is a huge compendium of data, consisting of over ten thousand measurements for every basepair in the human genome. These data are difficult to make sense of, not only for humans, but also for computational methods that aim to detect genes and other functional elements, predict gene expression, characterize polymorphisms, etc. To address this challenge, we propose a deep neural network tensor factorization method, Avocado, that compresses epigenomic data into a dense, information-rich representation of the human genome. We use data from the Roadmap Epigenomics Consortium and the ENODE Consortium to demonstrate that this approach can leearn representations of the genome that are broadly useful: first, by imputing tens of thousands of tracks of epigenomic data more accurately than previous methods, and second, by showing that machine learning models that exploit the learned representations outperform those trained directly on epigenomic data on a variety of genomics tasks. These tasks include predicting gene expression, promoter-enhancer interactions, and elements of 3D chromatin architecture. Our findings suggest the broad utility of Avocado's learned latent representation for computational genomics and epigenomics.