Malladi_2020: Total functional score of enhancer elements identifies lineage-specific enhancers that drive differentiation of pancreatic cells
- keywords
Enhancer, epigenome, gene regulation, pancreas, tissue-specific Transcription, transcription factor
Total functional score of enhancer elements identifies lineage-specific enhancers that drive differentiation of pancreatic cells
Abstract
Introduction
Lineage specification is dependent on the interactions of Transcription Factor (TF) and chromatin states at enhancers
Enhancers have been shown to share several common features
Increased chromatin accessibility (as measured by DNase-seq or ATAC-seq)
enrichment of posttranslational modification of the amino-terminal tails of core histone proteins (as assessed by ChIP-Seq)
histone H3 lysine 4 monomethyl (H3K4me1)
histone H3 lysine 27 acetyl (H3K27ac)
Recent genomic assays have shown that active enhancers are bound by RNA polymerase II (Pol II) and are transcribed, producing noncoding RNAs known as enhancer RNAs (eRNAs)
Enhancer Transcription (as measured by total RNA-seq, GRO-seq, or PRO-seq) can be used in the absence of any other genomic information to predict enhancer activity
Advances in technology have facilitated the large-scale functional characterization of enhancer activity and the annotation of TF-binding sites (TFBSs)
genome-wide in various cell types and tissues
Analyses that predict TFBSs
Fail to consider that such sequences occur frequently
by chance throughout the genome and that TF binding is cell type specific
TFBSs, which are usually 4 to 12 nucleotides in length
Total Functional Score of Enhancer Elements (TFSEE)
Aimed to (1) evaluate TFSEE as an enhancer-calling algorithm and (2) understand the TF-driven transcriptional programs differentiating human embryonic stem cells (hESCs) into pancreatic cells
Figure 1.
Materials and Methods
Genomic data curation
GRO-seq, ChIP-seq, and RNA-seq data from time course differentiation of human embryonic stem cells (hESCs) to pancreatic endoderm (PE)
Analysis of ChIP-seq data
GRCh37/hg19
Aligned with Bowtie v 1.0.0
Analysis of RNA-seq data
GRCh37/hg19
Aligned with STAR v 2.4.2a
Quantification of genes with RSEM
Analysis of GRO-seq data
Trimmed to the first 36 bases to the trim adapter and low-quality sequence using fastxtrimmer
GRCh37/hg19
Aligned with BWA v0.7.12
Kernel density
calculated in Python (ver. 2.7.11) using the kdeplot function from seaborn version 0.7.1
Defining Transcription Start Site (TSS) and promoters
Made TSSs for protein-coding genes using MakeGencodeTSS
Indetified active promoters using H3K4me3 enrichment
RPKM cutoff of ≥1 for H3K4me3
Enhancer calling by GRO-seq
Calling a universe of transcripts from GRO-seq data.
groHMM
built a universe of transcripts by merging the groHMM-called transcripts from individual cell lines and stratifying the boundaries to remove overlaps/redundancies occurring from the union of all transcripts.
Calling active enhancers using GRO-seq-defined enhancer transcripts
<9 kb in length and >3 kb away from known TSSs
protein-coding genes AND H3K4me3 peaks
Classified into
short paired eRNAs
cutoff of RPKM ≥ 0.5
short unpaired eRNAs
cutoff of RPKM ≥ 1
The comprehensive universe of expressed eRNAs (short paired and short unpaired) assembled using the cutoffs noted above for each cell line was used for further analyses.
Motif analyses for GRO-seq-defined enhancers
De novo motif analyses was preformed on a 1kb region surround the overlap center, or the TSS using MEME
Enhancer calling by ChIP-seq
Calling active enhancers using histone modification ChIP-seq data
Built a universe of peak calls by merging the peaks from individual cell lines for histone modifications (H3K4me1 and H3K27ac)
Potential enhancers were defined as peaks that were >3 kb from known TSSs, protein-coding genes from Gencode version 19 annotations, 41 and H3K4me3 peaks
RPKM cutoff of ≥1 for H3K4me1 and H3K27ac in at least one cell line
Motif analyses for ChIP-seq-defined enhancers
De novo motif analyses were performed on a 1 kb region (±500 bp) surrounding the peak summit for the top 10000 enhancers using MEME
TODO Generating heatmaps and clusters
TODO Nearest neighboring gene analyses and box plots
TODO Overlapping enhancer analysis
Results
The TFSEE model
Step 1 - Method 1: enhancer calling based on enhancer transcripts defined by GRO-seq
Step 1 - Method 2: enhancer calling based on histone modification defined by ChIP-seq
Step 2: Calculating enrichment and activity profiles
Figure 2. Data processing for Total Functional Score of Enhancer Elements (TFSEE) method
5 data processing steps
Steps 3 to 5: De novo motif searching and TF expression
Calculating the TFSEE score by data integration
Figure 3. Overview of Total Functional Score of Enhancer Elements (TFSEE) method
GRO-seq expression + H3K27ac enrichment + H3K4me1 enrichment = Enhancer Activity
Enhancer Activity x Motif prediction
Comparison of enhancer calls by methods 1 and 2
greater than 84% of the enhancers identified based solely on enhancer transcription were not called based on enrichment of H3K27ac or H3K4me1
Figure 4. Comparison of approaches for genome-wide prediction of enhancers during pancreatic differentiation
Decided to focus on the enhancers identified based on enhancer transcription using GRO-seq data (method 1)
TFSEE identifies lineage-specific enhancers and their cognate TFs during pancreatic differentiation
TFSEE scores determined by using inputs from method 1
Figure 5. TFSEE identifies cell type-specific enhancers and their cognate TFs that drive gene expression during pancreatic differentiation
Heatmap of the 5 stages of pandcreatic differenitation
Box plots of normalized TFSEE score for clusters identified in pancreatic differentiation
Cluster TFSEE scores
Figure 6. TFSEE-predicted TFs are enriched in pre- and late pancreatic differentiation
TF expression
Enhancer Trascription
Nearest Neighboring Gene Expression
Cluster 3 Rank Order of Enriched TFs
Cluster 4 Rank Order of Enriched TFs
TODO TFSEE scores determined using inputs from method 2.
TODO Comparison of TF identification using inputs from method 1 or method 2
Discussion
TFSEE enables analysis of driver TFs using a limited amount of data
model was able to identify lineage-specific TFs with as little as 5 cell types and with only 2 data types, RNA-seq and ChIP-seq (for H3K4me3, H3K4me1, and H3K27ac)
A limitation of the TFSEE method is that while the model can be used with a reduced number of data types for enhancer identification, it fails to identify additional subtype- or stage-specific drivers with reducde data input
Integrating additional genomic data into TFSEE
Integrate genomic data indicating open regions of chromatin (ATAC-seq, DNase-seq, or MNase-seq)
ChromHMM could be used to annotate alternate chromatin states with additional histone modifications
Chromatin Looping data for enhancer-promoter interactions (as measured by 4C, ChIA-PET, or Hi-C)