CS273A Lecture 16: Functional Genomics MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos Achlioptas http://cs273a.stanford.edu [BejeranoFall13/14] 1 Gene set enrichment analysis: The gene regulatory version http://cs273a.stanford.edu [BejeranoFall13/14] 2 Cluster all genes for differential expression Experiment Control (replicates) (replicates) genes Most significantly up-regulated genes Unchanged genes Most significantly down-regulated genes http://cs273a.stanford.edu [BejeranoFall13/14] 3 The Gene Sets to test come from the Ontologies TJL-2004 4 Ask about whole gene sets + Exper. Control Gene Set 1 Gene Set 2 Gene Set 3 Gene set 3 up regulated ES/NES statistic Gene set 2 down regulated http://cs273a.stanford.edu [BejeranoFall13/14] Combinatorial Regulatory Code 2,000 different proteins can bind specific DNA sequences. Proteins DNA Protein binding site Gene DNA A regulatory region encodes 3-10 such protein binding sites. When all are bound by proteins the regulatory region turns “on”, and the nearby gene is activated to produce protein. http://cs273a.stanford.edu [BejeranoFall13/14] 6 ChIP-Seq: first glimpses of the regulatory genome in action Peak Calling Cis-regulatory peak http://cs273a.stanford.edu [BejeranoFall13/14] 7 What is the transcription factor I just assayed doing? • Collect known literature of the form • Function A: Gene1, Gene2, Gene3, ... • Function B: Gene1, Gene2, Gene3, ... • Function C: ... • Ask whether the binding sites you discovered are preferentially binding (regulating) any one or more of the functions listed above. • Form hypothesis and perform further experiments. Cis-regulatory peak Gene transcription start site http://cs273a.stanford.edu [BejeranoFall13/14] 8 Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile Gene transcription start site SRF binding ChIP-seq peak • ChIP-seq identified 2,429 SRF binding peaks in human Jurkat cells1 • SRF is known as a “master regulator of the actin cytoskeleton” • In the ChIP-Seq peaks, we expect to find binding sites regulating (genes involved in) actin cytoskeleton formation. http://cs273a.stanford.edu [BejeranoFall13/14] 9 Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile Gene transcription start site SRF binding ChIP-seq peak Ontology term (e.g. ‘actin cytoskeleton’) Existing, gene-based method to analyze enrichment: • Ignore distal binding events. • Count affected genes. • Rank by enrichment hypergeometric p-value. N K n k = 8 genes in genome = 3 genes annotated with = 2 genes selected by proximal peaks = 1 selected gene annotated with P = Pr(k ≥1 | n=2, K =3, N=8) http://cs273a.stanford.edu [BejeranoFall13/14] 10 We have (reduced ChIP-Seq into) a gene list! What is the gene list enriched for? Pro: A lot of tools out there for the analysis of gene lists. Cons: These tools are built for microarray analysis. Does it matter ?? Microarray data Microarray data Gene regulation data Microarray tool http://cs273a.stanford.edu [BejeranoFall13/14] 11 SRF Gene-based enrichment results • Original authors can only state: “basic cellular processes, particularly those related to gene expression” are enriched1 SRF SRF acts on genes both in nucleus and cytoplasm, that are involved in transcription and various types of binding Where’s the signal? Top “actin” term is ranked #28 in the list. SRF [1] Valouev A. et al., Nat. Methods, 2008 http://cs273a.stanford.edu [BejeranoFall13/14] 12 Associating only proximal peaks loses a lot of information Relationship of binding peaks to nearest genes for eight human (H) and mouse (M) ChIP-seq datasets SRF (H: Jurkat) NRSF (H: Jurkat) GABP (H: Jurkat) Stat3 (M: ESC) p300 (M: ESC) p300 (M: limb) p300 (M: forebrain) p300 (M: midbrain) Fraction of all elements 0.7 0.6 Restricting to proximal peaks often leads to complete loss of key enrichments 0.5 0.4 0.3 0.2 0.1 0 0-2 2-5 5-50 50-500 > 500 Distance to nearest transcription start site (kb) http://cs273a.stanford.edu [BejeranoFall13/14] 13 Bad Solution: Associating distal peaks brings in many false enrichments Why bad? 14% of human genes tagged ‘multicellular organismal development’. But 33% of base pairs have such a gene nearest upstream/downstream. SRF ChIP-seq set has >2,000 binding events. Throw a random set of 2,000 regions at the genome. What do you get from a gene list analysis? Term Bonferroni corrected p-value nervous system development 5x10-9 system development 8x10-9 anatomical structure development 7x10-8 multicellular organismal development 1x10-7 developmental process 2x10-6 http://cs273a.stanford.edu [BejeranoFall13/14] Large “gene deserts” are often next to key developmental genes 14 Real Solution: Do not convert to gene list. Analyze the set of genomic regions Gene transcription start site Ontology term ( ‘actin cytoskeleton’) Gene regulatory domain Genomic region (ChIP-seq peak) p = 0.33 of genome annotated with n = 6 genomic regions k = 5 genomic regions hit annotation GREAT = Genomic Regions Enrichment of Annotations Tool P = Prbinom(k ≥5 | n=6, p =0.33) Since 33% of base pairs are near a ‘multicellular organismal development’ gene, we now expect 33% of genomic regions to hit this term by chance. => Toss 2,000 random regions at genome, get NO (false) enrichments. http://cs273a.stanford.edu [BejeranoFall13/14] 15 How does GREAT know how to assign distal binding peaks to genes? Future: High-throughput assays based on chromosome conformation capture (3C) methods will elucidate complex regulation mechanisms Currently: Flexible computational definitions allow assignment of peaks to nearest gene, nearest two genes, etc. • Default: each gene has a “basal regulatory domain” of 5 kb up- and 1kb downstream of transcription start site, extends to basal domain of nearest genes within 1 Mb • Though some associations may be missed or incorrect, in general signal richness and robustness is greatly improved by associating distal peaks http://cs273a.stanford.edu [BejeranoFall13/14] 16 GREAT infers many specific functions of SRF from its binding profile Top GREAT enrichments of SRF Top gene-based enrichments of SRF Ontology Term # Genes Binomial Experimental P-value support* 30 31 7x10-9 5x10-5 Miano et al. 2007 Pathway Commons TRAIL signaling 32 Class I PI3K signaling 26 5x10-7 2x10-6 Bertolotto et al. 2000 TreeFam FOS gene family 5 1x10-8 Chai & Tarnawski 2002 TF Targets Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 84 28 44 23 5x10-76 4x10-9 1x10-6 2x10-4 Positive control Gene Ontology actin cytoskeleton actin binding (top actin-related term 28th in list) Miano et al. 2007 Poser et al. 2000 ChIp-Seq support Natesan & Gilman 1995 * Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT. Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq [McLean et al., Nat Biotechnol., 2010] http://cs273a.stanford.edu [BejeranoFall13/14] 17 Limb P300: I was blind and I can see Gene List http://cs273a.stanford.edu [BejeranoFall13/14] 18 GREAT works with ANY cis-regulatory rich set Example: GWAS Compendium set Heightassociated unlinked SNPs http://cs273a.stanford.edu [BejeranoFall13/14] 19 GREAT analysis of histone mark combinations http://cs273a.stanford.edu [BejeranoFall13/14] 20 GREAT includes multiple ontologies • Twenty ontologies spanning broad categories of biology • 44,832 total ontology terms tested in each GREAT run (2,800 terms) (5,215) (834) (6,700) (3,079) (911) (5,781) (427) (456) (150) (1,253) (288) (706) http://cs273a.stanford.edu [BejeranoFall13/14] (615) (19) (222) (9) (6,857) (8,272) (238) Michael Hiller 21 Advantages of the GREAT approach Tailored to the biology of gene regulation: • Distal sites are incorporated, not ignored • Variable length gene regulatory domains • Multiple bindings next to same target gene rewarded • Binding sites associated to (both) TSS, not gene body • Extensive ontologies, some home-made http://cs273a.stanford.edu [BejeranoFall13/14] 22 Algorithmic Optimization: A it works; B make it efficient http://cs273a.stanford.edu [BejeranoFall13/14] 23 enter GREAT.stanford.edu Choose genome Input peak list Hit submit! http://cs273a.stanford.edu [BejeranoFall13/14] 24 GREAT web app: (Optional): alter association rules http://great.stanford.edu Three association rule choices Lnp Evx2 HoxD cluster Literature-curated domains for a small subset of genes http://cs273a.stanford.edu [BejeranoFall13/14] [adapted from Spitz, Gonzalez, & Duboule, Cell, 2003] 25 GREAT web app: output summary Additional ontologies, term statistics, multiple hypothesis corrections, etc. Ontology-specific enrichments Cool visualization opportunities! http://cs273a.stanford.edu [BejeranoFall13/14] 26 GREAT web app: term details page Genes annotated as “actin binding” with associated genomic regions Genomic regions annotated with “actin binding” Drill down to explore how a particular peak regulates Plectin and its role in actin binding http://cs273a.stanford.edu [BejeranoFall13/14] 27 You can also submit any track straight from the UCSC Table Browser A simple, well documented programmatic interface allows any tool to submit directly to GREAT. (See our Help / Inquiries welcome!) http://cs273a.stanford.edu [BejeranoFall13/14] 28 GREAT web app: export data HTML output displays all user selected rows and columns Tab-separated values also available for additional postprocessing http://cs273a.stanford.edu [BejeranoFall13/14] 29 GREAT Web Stats 200-400 job submissions per day, from 8,000 IP addrs Over 175,000 jobs served, hundreds of citations http://cs273a.stanford.edu [BejeranoFall13/14] 30 Adding a new species to GREAT We need: 1. A good assembly 2. A high quality gene set 3. Good gene annotations* *Most valuable for species with independent annotations! http://cs273a.stanford.edu [BejeranoFall13/14] 31 Test case: early neocortex development E12.5 E14.5 E16.5 Dissected the dorsal cerebral wall We Performed p300, H3K27Ac ChIP-seq. What have we learned? http://cs273a.stanford.edu [BejeranoFall13/14] 32 Collect, Call Peaks: Mostly Distal Did the experiment work? http://cs273a.stanford.edu [BejeranoFall13/14] 33 In silico validation Most enriched terms: what you want, where & when you want it http://cs273a.stanford.edu [BejeranoFall13/14] 34 Are all cell populations represented? SVZ-IZ CP VZ [Ayoub et al., PNAS. 2011] http://cs273a.stanford.edu [BejeranoFall13/14] What are different enhancer groups doing? http://cs273a.stanford.edu [BejeranoFall13/14] 36 Which genes are most regulated? [Ayoub et al, 2011] http://cs273a.stanford.edu [BejeranoFall13/14] 37 The most densely regulated genes http://cs273a.stanford.edu [BejeranoFall13/14] 38 Mn1 – Cryba4 gene desert http://cs273a.stanford.edu [BejeranoFall13/14] 39 ChIP-seq is a powerful technique crosslink ChIP-seq lessons: Transcription factors (TF) bind next to many relevant target genes. Example: SRF regulates actin cytoskeleton. Binds near 40 actin genes in immune cells. Most binding is distal. Ex: 75% of SRF binding sites near actin genes well over 1kb from TSS (upto 300kb). Binding is context specific. Ex: SRF also regulates muscle development. SRF is not enriched near muscle genes, when assayed in immune cells. fragment antibody immunoprecip. high throughput sequencing map to genome + find peaks Many TF functions yet undiscovered. http://cs273a.stanford.edu [BejeranoFall13/14] 40 How can we discover new TF functions? Exciting new technologies: designer genome editing. Which TF to knock out next? Where to look for its effects? ChIP-seq requires • Quality antibody • Enough cells of right context • Sequencing costs Seldom used in exploratory mode (ENCODE sampled 100 mostly basal TFs in very few contexts) TF function exploration: expression perturbations, genetics screens. Which TF next? Where to look? Educated guesses. Goal: Devise a rapid method for TF function prediction http://cs273a.stanford.edu [BejeranoFall13/14] 41 How can we discover new TF functions? Observations: 1. Gene function annotation is very rich. (Examples: GO, MGI expression & phenotype, etc. etc.) 2. Conserved non-coding, likely cis-regulatory, sequence makes up 5-10% of the mammalian genomes. 3. Conserved binding site prediction has become accurate. (Predict only with strong evidence, minimize false positives) Idea: Use transcription factor tendency to bind next to many context-relevant genes to make function predictions. http://cs273a.stanford.edu [BejeranoFall13/14] 42 Transcription Factor Function Prediction 1. Curate a rich library of TF binding site motifs. Pro: hundreds are now known. (inc. from Tim, Jussi’s work) 2. Predict cross-species conserved binding sites. Pro: careful conserved binding site prediction is accurate. 3. Search for extreme binding site concentration next to genes of any particular function. Pro: Leverage the observed phenomenon, and the very large body of knowledge about all (target) genes in the genome. Con: You will not get all binding sites of any function. Pro: You may predict many diverse TF functions. SRF http://cs273a.stanford.edu [BejeranoFall13/14] Predict SRF to regulate: Actin cytoskeleton, Muscle development, ... 43 This is what we did: 1. Build motif library 300 motifs 332 motifs for 300 transcription factors (updating for many more) http://cs273a.stanford.edu [BejeranoFall13/14] 44 2. Predict conserved binding sites . = same as human • • • • Human Chimp Gorilla Orangutan Rhesus Tarsier Mouse lemur Bushbaby Tree shrew Mouse Guinea pig Squirrel Rabbit Alpaca Cow Cat Microbat Megabat Hedgehog Rock hyrax Tenrec Armadillo Sloth TTTCCCTTAAAAGGCTTAAATAAACTCACCAGTGTTTAATT ......................................... ......................................... ...T..................................... ...T..................................... ...T..............G...........G......T... ...T............C........AT..TG.....C.... ...T................C....AT...G.....C.... ...T.........................TG.......... ...T.....................C...TG.....G.G.. ...T.........................TG........CG ...T.........................TG.......... ...T......................T...G......G... ...T..................................... ...T........................TTG.......... ...T..........................GAC.......A ...T..........................C.......... ...T......................T.-.CA......... ...T...............G..................... ...T..................................... ...T..................................... ...T.........................TG.......... ...T.........................TG.......... We in fact allow: • Imperfect motif matches • Binding site / alignment wobble • Subset of species support Guard against alignment fragmentations Predict efficiently Improve state of the art using “Excess conservation” scoring http://cs273a.stanford.edu [BejeranoFall13/14] 45 Compare to shuffled motifs & weed out! SRF motif shuffle #1 shuffle #2 shuffle #3 shuffle #4 … shuffle #10 http://cs273a.stanford.edu [BejeranoFall13/14] 46 Use three reference genomes rich in gene function annotations Predicted for human, for mouse and for zebrafish http://cs273a.stanford.edu [BejeranoFall13/14] 47 3. Predict binding site/TF functions MYOG MYF6 BMP4 NKX2-5 gene transcription start site SRF binding site CAV3 ACTA1 Enhancer to gene association SRF must regulate muscle structure development: predicted to bind next to 157 genes (p=7.43×10-41) http://cs273a.stanford.edu [BejeranoFall13/14] 48 Conservative multiple testing correction • Control false positives when predicting function for hundreds of transcription factors: SRF motif shuffle #1 shuffle #2 shuffle #3 shuffle #4 shuffle #10 Remove terms with E[occurrences] ≥ 1 kidney morphogenesis absent semicircular canals dorsal spinal cord development … Retains only 5% of GREAT predictions. 2,543 transcription factor to function links (16% FDR) http://cs273a.stanford.edu [BejeranoFall13/14] 49 PRISM vs. ChIP-seq Actin cytoskeleton SRF Term PRISM T cell ChIP-seq actin cytoskeleton Known structural constituent of muscle Known dilated heart ventricles Known regulation of insulin secretion Novel muscle SRF http://cs273a.stanford.edu [BejeranoFall13/14] heart 50 PRISM vs. ChIP-seq Actin cytoskeleton SRF Term PRISM T cell ChIP-seq actin cytoskeleton Known structural constituent of muscle Known dilated heart ventricles Known regulation of insulin secretion Novel muscle Every known function is supported SRF by dozens or hundreds of novel binding heartsites. http://cs273a.stanford.edu [BejeranoFall13/14] 51 PRISM re-discovers known functions TF SRF function muscle structure development p-value 7.43×10-41 target genes 157 GLI2 skeletal system development 7.07×10-48 192 Is the number of re-discovered known functions impressive? CRX retinal photoreceptor degeneration 1.30×10-10 AR abnormal spermiogenesis 1.19×10-6 http://cs273a.stanford.edu [BejeranoFall13/14] 34 26 52 PRISM re-discovers many known functions Genes involved in “muscle structure development” SRF http://cs273a.stanford.edu [BejeranoFall13/14] 53 Distal binding sites are important 3.6× http://cs273a.stanford.edu [BejeranoFall13/14] 54 PRISM predicts many novel functions TF MYF6 GABPA GATA6 … function abnormal pancreas development transcription by RNA polymerase I abnormal pancreas development p-value 1.67×10-10 3.64×10-11 5.69×10-13 target genes 21 10 23 Nature Genetics, Dec 2011 Function prediction word cloud (word size correlates with its frequency in our function predictions) http://cs273a.stanford.edu [BejeranoFall13/14] 55 PRISM works: Experimental validation TF MYF6 function abnormal pancreas development p-value 1.67×10-10 http://cs273a.stanford.edu [BejeranoFall13/14] target genes 21 56 PRISM is built for the experimentalist http://PRISM.stanford.edu Search: Search for: • Human, • • • • • Mouse, or • Zebrafish Transcription factor: e.g. “MYOG”, Function: e.g. “insulin”, Target gene: e.g. “ACTA1”, or Target genomic region: e.g. 9p21 http://cs273a.stanford.edu [BejeranoFall13/14] 57