Capturing cell-type specific compartment patterns by applying topic modeling to single-cell Hi-C data
Hyeon-Jin Kim*, Galip Gurkan Yardimici*, Giancarlo Bonora, Vijay Ramani, Jie Liu, Ruolan Qiu, Cholie Lee, Jennifer Hesson, Carol B. Ware, Jay Shendure, Zhijun Duan and William Stafford Noble
Abstract
Single-cell Hi-C (scHi-C) interrogates genome-wide chromatin interaction in individual cells, allowing us to gain insights into 3D genome organization. However, the extremely sparse nature of scHi-C data poses a significant barrier to analysis, limiting our ability to tease out hidden biological information. In this work, we approach this problem by applying topic modeling to scHi-C data. Topic modeling is well-suited for discovering latent topics in a collection of discrete data. For our analysis, we generate nine different single-cell combinatorial indexed Hi-C (sci-Hi-C) libraries from five human cell lines (GM12878, H1Esc, HFF, IMR90, and HAP1), consisting over 19,000 cells. We demonstrate that topic modeling is able to successfully capture cell type differences from sci-Hi-C data in the form of "chromatin topics." We further show enrichment of particular compartment structures associated with locus pairs in these topics.
Supplementary Data
Processed sci-Hi-C data
Library name 4DN biosample accession sci-Hi-C .matrix files Cell labels H1Esc.R1 4DNEXJLO3SIH H1Esc.R1.tar.gz H1Esc.R1.labeled H1Esc.R2 4DNEXJD1GLV9 H1Esc.R2.tar.gz H1Esc.R2.labeled H1Esc-HFF.R1 4DNEXSXHFNQU, 4DNEX6HS133B H1Esc-HFF.R1.tar.gz H1Esc-HFF.R1.labeled H1Esc-HFF.R2 4DNEXSR13BV7, 4DNEXOTPTHX8 H1Esc-HFF.R2.tar.gz H1Esc-HFF.R2.labeled HFF-GM12878.R1 4DNEXTZ5222J, 4DNEXRQS6RMZ HFF-GM12878.R1.tar.gz HFF-GM12878.R1.labeled HFF-GM12878.R2 4DNEX4QH6UPV, 4DNEXC3U5F5M HFF-GM12878.R2.tar.gz HFF-GM12878.R2.labeled GM12878-IMR90.R1 4DNEXE63LNM5, 4DNEX344GLEL GM12878-IMR90.R1.tar.gz GM12878-IMR90.R1.labeled IMR90-HAP1.R1 4DNEXUDB9U31, 4DNEXRK4Z69N IMR90-HAP1.R1.tar.gz IMR90-HAP1.R1.labeled IMR90-HAP1.R2 4DNEXJ9KRQWC, 4DNEX1K4H8JM IMR90-HAP1.R2.tar.gz IMR90-HAP1.R2.labeled
Input and output data
Dataset Resolution (kb) Inter locus pair distance (Mb) Cell-LP matrix Cell labels LP labels Topic model object Our sci-Hi-C data 100 10 txt.gz txt txt rds Our sci-Hi-C data 250 10 txt.gz txt rds Our sci-Hi-C data 500 10 txt.gz txt rds Our sci-Hi-C data 1000 10 txt.gz txt rds Our sci-Hi-C data 500 3 txt.gz txt rds Our sci-Hi-C data 500 5 txt.gz txt rds Our sci-Hi-C data 500 15 txt.gz txt rds Our sci-Hi-C data 500 20 txt.gz txt rds Nagano 500 10 txt.gz txt txt rds Flyamer (all cells) 500 10 txt.gz txt txt rds Flyamer (downsampled) 500 10 txt.gz txt rds
Supplementary Code
Source code for the manuscript is available here.
Please email questions to william-noble@uw.edu and khj3017@uw.edu.