Monica C. Sleumer (苏漠) 2012-09-19 Human Genome • • • • • 3,101,804,739 base pairs 22 chromosomes plus X and Y 21,224 protein-coding genes 15,952 ncRNA genes 3–8% of bases are under selection – From comparative genomic studies • Question: What is the genome doing? Objectives • Find all functional elements – – – – Bound by specific proteins Transcribed Histone modifications DNA methylation • Use this information to annotate functional regions – – – – – – – Genes (coding and non-coding) Promoters Enhancers Specific transcription factor binding sites Silencers Insulators Chromatin states • Cross-reference data from other studies – Comparative genomics – 1000 Genomes Project – Genome-wide association studies (GWAS) ENCODE projects • ENCODE pilot project: 1% of the genome 2003-2007 • modENCODE: Drosophila and C. elegans • ENCODE main project 2007-2012 – – – – – 1649 dataset-generating experiments 147 cell types 235 antibodies and assay protocols 450 authors 32 institutes • 31 publications 2012-09-06 – – – – 6 in Nature 18 in Genome Research 6 in Genome Biology 1 in BMC Genetics www.nature.com/encode/category/research-papers Materials • 147 types of human cell lines, 3 priority levels • Tier 1 cell lines: top priority for all experiments Name Description Lineage Tissue Karyotype GM12878 B-lymphocyte, lymphoblastoid, Epstein-Barr Virus, mesoderm 1000 Genomes Project blood normal H1-hESC embryonic stem cells inner cell mass embryonic stem cell normal K562 leukemia, 53-year-old female with chronic myelogenous leukemia mesoderm blood cancer • Tier 2 cell lines to be done after Tier 1 (next slide) • Tier 3: any other cell lines Tier 2 Cell Lines Name Description Lineage lung carcinoma epithelium, 58-yearendoderm old caucasian male donor B cells: RO01778 and CD20+ mesoderm RO01794 CD20+_RO01778 B cells, caucasian mesoderm CD20+_RO01794 B cells, African American mesoderm neurons derived from H1 H1-neurons ectoderm embryonic stem cells HeLa-S3 cervical carcinoma ectoderm HepG2 hepatocellular carcinoma endoderm HUVEC umbilical vein endothelial cells mesoderm IMR90 fetal lung fibroblasts endoderm skeletal myoblasts from pectoralis LHCN-M2 mesoderm major muscle, 41 year old caucasian MCF-7 mammary gland, adenocarcinoma ectoderm MonocytesMonocytes-CD14+, leukapheresis mesoderm CD14+ from RO 01746 and RO 01826 SK-N-SH neuroblastoma, 4 year old ectoderm A549 Tissue Karyotype epithelium cancer blood normal blood blood normal normal neurons normal cervix liver blood vessel lung skeletal muscle myoblast breast cancer cancer normal normal monocytes normal brain cancer http://encodeproject.org/ENCODE/cellTypes.html cancer Methods RNA-Seq Different fractions of RNA -> sequencing CAGE 5’ Capped RNA sequencing RNA-PET Sequencing 5’ Cap plus poly-A tail ChIP-seq Chromatin immunoprecipitation of a DNA binding protein -> sequencing DNase-seq Cut exposed DNA with DNase I -> sequencing FAIRE-seq Nucleosome-depleted DNA -> sequencing RRBS Bisulphite treatment: unmethylated C->U -> sequencing 3C,5C, ChIA-PET Chromatin interactions -> sequencing Results: RNA Sequencing • 62% of the genome is transcribed into sequences >200 bp long – 5.5% of this is exon – 31% is intergenic – no annotated gene – Remaining: intronic • CAGE-seq: 62,403 TSS – 44% within 100bp of the 5’ end of a GENCODE gene – Others: exons and 3’ UTRs, significance unknown • Lots of short ncRNAs: tRNA, miRNA, snRNA etc. • Further description: Wu Dingming, 9:30 Results: Transcribed and protein-coding regions • GENCODE reference gene set – 20,687 Protein-coding • • • • 6.3 alternatively spliced transcripts on average 3.9 protein isoforms on average Protein-coding exons: 1.22% of the genome Still more to come: unidentified peptides in mass-spec – 18,441 ncRNA genes • 8801 short ncRNA • 9640 long nc RNA – 11,224 pseudogenes • 863 transcribed ChIP-Seq Acronym Description ChromRem ATP-dependent chromatin complexes DNA repair DNARep HISase Other Pol2 Factors analysed 5 3 Histone acetylation, 8 deacetylation or methylation complexes Cyclin kinase associated with 1 transcription Pol II subunit 1 (2 forms) Pol3 Pol III-associated 6 TFNS General Pol II-associated factor, not site-specific Pol II transcription factor with sequence-specific DNA binding 8 TFSS Total 87 119 www.illumina.com/technology/chip_seq_assay.ilmn ChIP-Seq: Histone modifications Histone modification or Signal variant characteristics H2A.Z Peak Association dynamic chromatin H3K4me1 Peak/region H3K4me2 Peak enhancers and other distal elements, also downstream of transcription starts promoters and enhancers H3K4me3 Peak promoters/transcription starts H3K9ac Peak promoters H3K9me1 Region 5′ end of genes H3K9me3 Peak/region H3K27ac Peak Gene repression, constitutive heterochromatin and repetitive elements Gene expression, active enhancers and promoters H3K27me3 Region H3K36me3 Region H3K79me2 Region polycomb complex, repressive domains and silent developmental genes Elongation, transcribed portions of genes, 3′ regions after intron 1 Transcription, 5′ end of genes H4K20me1 Region 5′ end of genes Results: ChIP-Seq • 636,336 binding regions • 8.1% of the genome • Sequence-specific TF ChIP-seq: – 86% of the DNA segments occupied by sequencespecific transcription factors contained a strong DNA-binding motif – 55% cases contained the expected motif • Further description: Qin Zhiyi & Ma Xiaopeng, 13:30 DNase I hypersensitivity • • • • • 2,890,000 unique hypersensitive sites (DHSs) 4,800,000 sites across 25 cell types Tier 1 and tier 2 cell types: 205,109 DHSs per cell type 98.5% of ChIP-seq TFBS within DHSs Further description: Guo Weilong 12:30, He Chao 14:30 https://www.nationaldiagnostics.com/electrophoresis/article/dnase-i-footprinting FAIRE-seq • Like the opposite of ChIP-seq • Cross-link the nucleosomes to the DNA – But not the sequence-specific TFs • Shear the DNA into small pieces • Remove the protein-bound DNA • Sequence the non-bound DNA Gaulton KJ et al, Nature Genetics 42, 255–259 (2010) doi:10.1038/ng.530 DNA methylation • CpG methylation: regulates gene expression – In promoters: gene repression – In genes: gene transcription • 1,200,000 methylated CpGs in 82 cell lines and tissues – 96% differentially methylated, especially those in genes • Unmethylated genic CpG islands associated with P300 binding , an enhancer-related histone acetyltransferase • Allele-specific methylation: genomic imprinting • Aberrant methylation in cancer cell lines • Reproducible methylation outside CpG dinucleotides http://www.diagenode.com/en/applications/bisulfite-conversion.php Chromosome conformation capture Montavon and Duboule, Trends in Cell Biology (2012) 22:7, 347–354 Results: Chromosome interactions • Chromosome conformation capture (3C) : – 5C: 3C-carbon copy – ChIA-PET • Identified 127,417 promoter-centred chromatin interactions using ChIA-PET – 98% intra-chromosomal • 2,324 promoters involved in ‘single-gene’ enhancer– promoter interactions • 19,813 promoters were involved in ‘multi-gene’ interaction complexes spanning up to several megabases • 50–60% of long-range interactions occurred in only one of the four cell lines • Further discussion: Li Yanjian, 10:40 Primary Findings • 80.4% of the human genome is doing at least one of the following: – Bound by a transcription factor – Transcribed – Modified histone • 99% is within 1.7 kb of at least one of the biochemical events • 95% within 8 kb of a DNA–protein interaction or DNase I footprint • 7 chromatin states: – 399,124 enhancer-like regions – 70,292 promoter-like regions • Correlation between transcription, chromatin marks, and TF binding • Functional regions contain lots of SNPs – Disease-associated SNPs in non-coding regions tend to be in functional elements End of Introduction Summary of ENCODE elements • 80.4% of the human genome is covered by at least one ENCODE-identified element • 62% of the genome is transcribed • 56% of the genome associated with histone modifications • Excluding RNA elements and broad histone elements, 44.2% of the genome is covered – open chromatin (15.2%) – transcription factor binding (8.1%) – 19.4% DHS or transcription factor ChIP-seq peaks across all cell lines • 8.5% of bases are covered by either a transcription-factorbinding-site motif (4.6%) or a DHS footprint (5.7%) – 4.5x the amount of protein-coding exons (1.2%) – 2x the amount of conserved sequence between mammals • Estimate: 50% of DHS remain to be found – Based on saturation curves Diversity Diversity vs Conservation: Interactive Figure Conservation A high-resolution map of human evolutionary constraint using 29 mammals Nature 478, 476–482 (2011) Diversity Conservation in Bound Motifs vs Unbound Motifs Conservation http://www.nature.com/encode/interactive-figures/nature11247_F1 Model of gene expression – histone marks Model of gene expression – TF binding Transcription factor co-associations Seven major classes of genome states CTCF CTCF-enriched element CTCF signal , no histone modifications, open chromatin, may have insulator function, enriched for cohesin components RAD21 and SMC3 E Predicted enhancer Open chromatin, H3K4me1, other enhancer-associated marks, enriched for EP300, FOS, FOSL1, GATA2, HDAC8, JUNB, JUND, NFE2, SMARCA4, SMARCB1, SIRT6 and TAL1 sites, nuclear and whole-cell RNA poly(A) signal PF Predicted promoter flanking Regions that generally surround TSS segments R Predicted repressed H3K27me3 polycomb-enriched regions, REST, BRF2, CEBPB, MAFK, TRIM28, ZNF274 and SETDB1 sites or no signal at all TSS Predicted promoter including TSS H3K4me3, open chromatin, Pol II, Pol III, short RNAs, close to TSS sites T Predicted transcribed H3K36me3 transcriptional elongation signal., overlap with gene bodies, phosphorylated Pol II , cytoplasmic poly(A)+ RNA WE Weak enhancer Similar to the E state, but weaker signals and weaker enrichments. Data integration and genome segmentation Enhancer Transcribed Repressed TSS RNA expression Transcription factors Association between genome states and annotations Genome segment Genome segment Enhancer validation in mouse and fish Enhancer from K562 cell (leukemia) drives basal promoter with reporter gene in embryonic mouse blood cells and medaka fish Genome segment clustering 6 cell types Genome cluster function Genome state is related to gene function Allele-specific expression Pol II Txn Rpn Correlation of allele-specific signal by gene by genomic segment Genome-wide association studies Annotated diseasecausing SNPs Control SNPs Selected TFBS tracks Significant overlap Diseases No genes, but several TFBS near the disease-causing SNPs Conclusions • 80% of human genome annotated with at least one association – Protein-binding – Histone modification – Transcription • ENCODE data combination – Model gene expression – Genome segmented into 7 types • Different in each cell line • ENCODE data combined with other data – 1000 genomes: see influence of parental DNA – Genome-wide association studies Discussion • 147 types of cells, and the human body has a few thousand • 80% functional : controversial – 80% of the genome is being transcribed and/or has a protein bound to it some of the time – Heterochromatin: tightly packed repeat sequences – most of that activity isn’t particularly specific or interesting and may not have impact – Important not to overstate the findings – Ewan Birney: “cumulative occupation of 8% of the genome by TFs” • Reproducibility – In exactly the same cell lines, same conditions, different time or place – Same cell lines, different conditions – Same cell type, different people • Cell lines vs tissue • Cancer vs normal http://blogs.nature.com/news/2012/09/fighting-about-encode-and-junk.html http://blogs.discovermagazine.com/notrocketscience/2012/09/05/encode-the-rough-guide-to-the-human-genome/ Applications • Visible as genome tracks in UCSC • Mutation from – Cancer sequencing – GWAS – Find out what that part of the genome is doing • Compare with your cancer data (RNA-seq) • Comparative genome analysis • Gene or pathway of interest Online Resources • Interactive graphics in online version of paper • Interactive app on Nature ENCODE main page www.nature.com/encode/