Capturing cell-type specific compartment patterns by applying topic modeling to single-cell Hi-C data

Hyeon-Jin Kim*, Galip Gurkan Yardimici*, Giancarlo Bonora, Vijay Ramani, Jie Liu, Ruolan Qiu, Cholie Lee, Jennifer Hesson, Carol B. Ware, Jay Shendure, Zhijun Duan and William Stafford Noble

Abstract

Single-cell Hi-C (scHi-C) interrogates genome-wide chromatin interaction in individual cells, allowing us to gain insights into 3D genome organization. However, the extremely sparse nature of scHi-C data poses a significant barrier to analysis, limiting our ability to tease out hidden biological information. In this work, we approach this problem by applying topic modeling to scHi-C data. Topic modeling is well-suited for discovering latent topics in a collection of discrete data. For our analysis, we generate nine different single-cell combinatorial indexed Hi-C (sci-Hi-C) libraries from five human cell lines (GM12878, H1Esc, HFF, IMR90, and HAP1), consisting over 19,000 cells. We demonstrate that topic modeling is able to successfully capture cell type differences from sci-Hi-C data in the form of "chromatin topics." We further show enrichment of particular compartment structures associated with locus pairs in these topics.


Supplementary Data

Processed sci-Hi-C data

Library name 4DN biosample accession sci-Hi-C .matrix files Cell labels
H1Esc.R1 4DNEXJLO3SIH H1Esc.R1.tar.gz H1Esc.R1.labeled
H1Esc.R2 4DNEXJD1GLV9 H1Esc.R2.tar.gz H1Esc.R2.labeled
H1Esc-HFF.R1 4DNEXSXHFNQU, 4DNEX6HS133B H1Esc-HFF.R1.tar.gz H1Esc-HFF.R1.labeled
H1Esc-HFF.R2 4DNEXSR13BV7, 4DNEXOTPTHX8 H1Esc-HFF.R2.tar.gz H1Esc-HFF.R2.labeled
HFF-GM12878.R1 4DNEXTZ5222J, 4DNEXRQS6RMZ HFF-GM12878.R1.tar.gz HFF-GM12878.R1.labeled
HFF-GM12878.R2 4DNEX4QH6UPV, 4DNEXC3U5F5M HFF-GM12878.R2.tar.gz HFF-GM12878.R2.labeled
GM12878-IMR90.R1 4DNEXE63LNM5, 4DNEX344GLEL GM12878-IMR90.R1.tar.gz GM12878-IMR90.R1.labeled
IMR90-HAP1.R1 4DNEXUDB9U31, 4DNEXRK4Z69N IMR90-HAP1.R1.tar.gz IMR90-HAP1.R1.labeled
IMR90-HAP1.R2 4DNEXJ9KRQWC, 4DNEX1K4H8JM IMR90-HAP1.R2.tar.gz IMR90-HAP1.R2.labeled

Input and output data

Dataset Resolution (kb) Inter locus pair distance (Mb) Cell-LP matrix Cell labels LP labels Topic model object
Our sci-Hi-C data 100 10 txt.gz txt txt rds
Our sci-Hi-C data 250 10 txt.gz txt rds
Our sci-Hi-C data 500 10 txt.gz txt rds
Our sci-Hi-C data 1000 10 txt.gz txt rds
Our sci-Hi-C data 500 3 txt.gz txt rds
Our sci-Hi-C data 500 5 txt.gz txt rds
Our sci-Hi-C data 500 15 txt.gz txt rds
Our sci-Hi-C data 500 20 txt.gz txt rds
Nagano 500 10 txt.gz txt txt rds
Flyamer (all cells) 500 10 txt.gz txt txt rds
Flyamer (downsampled) 500 10 txt.gz txt rds

Supplementary Code

Source code for the manuscript is available here.


Please email questions to william-noble@uw.edu and khj3017@uw.edu.