Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller, Francesca Chiaromonte, Anton Nekrutenko, Kateryna Makova, Stephan Schuster, Ross Hardison University of California at Santa Cruz: David Haussler, Jim Kent Children’s Hospital of Philadelphia: Mitch Weiss NimbleGen: Roland Green University of Nebraska, Lincoln February 14. 2007 Major goals of comparative genomics • Identify all DNA sequences in a genome that are functional – Selection to preserve function – Adaptive selection • Determine the biological role of each functional sequence • Elucidate the evolutionary history of each type of sequence • Provide bioinformatic tools so that anyone can easily incorporate insights from comparative genomics into their research Known types of gene regulatory regions G.A. Maston, S.K. Evans, M.R. Green (2006) Ann. Rev. Genomics & Human Genetics 7:29-59. Regulatory regions tend to be clusters of transcription factor binding sites Sequence-specific SV40 promoters and enhancer Properties of known regulatory regions • Binding sites for transcription factors, many with sequence specificity • Clusters of binding sites • Conventional promoters encompass major start sites for transcription • Conserved over evolutionary time??? Structures involved in transcription are probably more complex Middle image: Green: active transcription (Br-UTP label) Red: all nucleic acids HeLa cell Sides: EM spreads of transcripts Peter R. Cook, Oxford University, http://users.path.ox.ac.uk/~pcook/images/Images.html Domain opening is associated with movement to nonheterochromatic regions Schubeler, Francastel, Cimbora, Reik, Martin, Groudine (2000) Genes & Dev. 14: 940-950 Other possible activities for sequences involved in gene regulation • Opening or closing a chromosomal domain • Move a gene to or away from a transcription factory • Control how long a gene is in a transcription factory – Long association • High level expression • Really long gene – Short association • Lower level expression • Rapid regulation • Are these conserved over evolutionary time? 3 modes of evolution Sequence matches at longer phylogenetic distances could reflect purifying selection Sequence differences at closer phylogenetic distances could reflect adaptive evolution. Conservation vs. Constraint • Conserved sequences are those that align between two species thought to be descended from a common ancestor • Constrained sequences show evidence in their alignments of negative (purifying) selection – E.g. change at a rate significantly slower than “neutral” DNA Ideal cases for interpretation Human vs mouse Negative selection (purifying) Similarity Neutral DNA Human vs rhesus Neutral DNA Similarity Positive selection (adaptive) P (not neutral) Neutral DNA Position along chromosome DNA segments with a function common to divergent species. DNA segments in which change is beneficial to at least one of the two species. Messages about evolutionary approaches to predicting regulatory regions • Regulatory regions are conserved, but not all to the same phylogenetic distance. • Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential). • Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity. • In vivo occupancy by GATA-1 suggests other activities in addition to enhancers. • Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection. Finding all gene regulatory regions is a challenge for comparative genomics • • • • Known regulatory regions for the HBB complex 23 total 19 conserved (align) between human and mouse Many others show no significant difference in a measure of constraint (phastCons) from the bulk or neutral DNA Two extremes of constraint in TRRs ENCODE projects • ENCODE (ENCyclopedia Of DNA Elements): consortium aiming to find function for all human DNA sequences – Phase I focused on 1% of human DNA – 30 Mb, 44 regions • About 10 regions had known genes of interest (CFTR, HOX) • Others were chosen to get a sampling of regions varying in gene density and alignability with mouse • Major areas – – – – – Genes and transcripts Transcriptional regulation Chromatin structure Multiple sequence alignment Variation in human populations Biochemical assays for protein-binding sites in DNA Purified protein & Naked DNA Chromatin Immunoprecipitation: DNA sites occupied by a protein inside cells. ChIP-on-chip to examine many sites Putative transcriptional regulatory regions = pTRRs • Antibodies vs 10 sequence-specific factors: – Sp1, Sp3, E2F1, E2F4, cMyc, STAT1, cJun, CEBPe, PU1, RA Receptor A – High resolution ChIP-chip platforms: Affymetrix and NimbleGen – Data from several different labs in ENCODE consortium • High likelihood hits for ChIP-chip – 5% false discovery rate • Supported by chromatin modification data – Modified histones in chromatin: H4Ac, H3Ac, H3K4me, H3K4me2, H3K4me3, etc. – DNase hypersensitive sites (DHSs) or nucleosome depleted sites • Result: set of 1369 pTRRs A small fraction of cis-regulatory modules are conserved from human to chicken Millions of years 91 173 310 • About 4% of pTRRs, 4% of DNase HSs, 4-7% of promoters active in multiple cell lines • Tend to regulate genes whose products control transcription and development 450 David King Most pTRRs are conserved in eutherian mammals Percentage of class that align no further than: pTRRs Millions of years DNase HSs Promoters Primates: 3% 11% 1-13% Eutherians: 71% 70% 63% Marsupials: 21% 14% 16-28% Tetrapods: 4% 4% 4-7% Vertebrates: 1% 1% 2-4% 91 173 310 450 Within aligned noncoding DNA of eutherians, need to distinguish constrained DNA (purifying selection) from neutral DNA. Measures of conservation and constraint capture only a subset of pTRRs Fraction overlapping an MCS Stringent constraint phastCons (background rate corrected) Allows a range of constraint Composite alignability (background rate corrected) Aligns, but no inference about purifying selection Sensitivity Different measures perform better on specific functional regions 1-Specificity Examples of clade-specific pTRRs Messages about evolutionary approaches to predicting regulatory regions • Regulatory regions are conserved, but not all to the same phylogenetic distance. • Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential). • Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity. • In vivo occupancy by GATA-1 suggests other activities in addition to enhancers. • Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection. Regulatory potential (RP) to distinguish functional classes Good performance of ESPERR for gene regulatory regions (RP) - James Taylor Francesca Chiaromonte Messages about evolutionary approaches to predicting regulatory regions • Regulatory regions are conserved, but not all to the same phylogenetic distance. • Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential). • Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity. • In vivo occupancy by GATA-1 suggests other activities in addition to enhancers. • Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection. Conservation of predicted binding sites for transcription factors Binding site for GATA-1 Genes Co-expressed in Late Erythroid Maturation G1E-ER cells: proerythroblast line lacking the transcription factor GATA-1. Can rescue by expressing an estrogen-responsive form of GATA-1 Rylski et al., Mol Cell Biol. 2003 Predicted cis-Regulatory Modules (preCRMs) Around Erythroid Genes B:Yong Cheng, Ross, Yuepin Zhou, David King F:Ying Zhang, Joel Martin, Christine Dorman, Hao Wang preCRMs with conserved consensus GATA-1 BS tend to be active on transfected plasmids preCRMs with conserved consensus GATA-1 BS tend to be active after integration into a chromosome Examples of validated preCRMs Correlation of Enhancer Activity with RP Score Validation status for 99 tested fragments preCRMs with High RP and Conserved Consensus GATA-1 Tend To Be Validated CACC box helps distinguish validated from nonvalidated preCRMs All validated preCRMs Same parameters All nonvalidated preCRMs Compare the outputs Consensus for EKLF binding site: CCNCMCCCW Ying Zhang CCNCMCCCW CCNCMCCCW Messages about evolutionary approaches to predicting regulatory regions • Regulatory regions are conserved, but not all to the same phylogenetic distance. • Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential). • Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity. • In vivo occupancy by GATA-1 suggests other activities in addition to enhancers. • Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection. preCRMs with conserved consensus GATA-1 binding sites are usually occupied by that protein: ChIP assay Design of ChIP-chip for occupancy by GATA-1 1. Non-overlapping tiling array with 50bp probe and 100bp resolution (NimbleGen) 50 50 100 2. Cover range Mouse chr7:57225996-123812258 (~70Mbp) 3. Antibody against the ER portion of GATA-1-ER protein in rescued G1E-ER4 cells Yong Cheng, with Mitch Weiss & Lou Dore (CHoP), Roland Green (NimbleGen) Signals in known occupied sites in Hbb LCR HS1 HS2 1) Cluster of high signals 2) “hill” shape of the signals HS3 Peak Finding Programs • TAMALPAIS Mark Bieda from Peggy Farmham’s lab Focus more on the cluster of the signals 4 thresholds based on number of consecutive probes with signals in the 98th or 95th percentiles • MPEAK Bing Ren’s lab Focus more one the “hill” shape of the signal 4 thresholds, for a series of probes with at least one that is 3, 2.5, 2 or 1 standard deviations above the mean ChIP-chip hits for GATA-1 occupancy Technical replicates of ChIP-chip with antibody against GATA1-ER Mpeak 275 hits in both TAMALPAIS 59 216 60 321 total ChIP-chip hits 276 hits in both ChIP-chip hits validate at a high rate Validation determined by quantitative PCR. 19 of the 321 hits were tested. 13 (~70%) were validated. ChIP DNA Validation rate is similar at different thresholds 9 regions were “hits” in only one of the two technical replicates. None were validated. Association of WGATAR and conservation with ChIP-chip Hits 1. 249 out of the 321 (78%) have WGATAR motifs, binding site for GATA-1 2. Of the GATA-1 binding motifs in those 249 hits, 112 (45%) are conserved between mouse and at least one non-rodent species. Expected and unexpected ChIP-chip hits Distribution of ChIP-chip hits on 70Mb of mouse chr7 Yong Cheng, Yuepin Zhou and Christine Dorman H P G 18 H 1 P1 G GH 0 H P G P1 7 H 8 P3 2 0 G GH 9 H P G P1 1 H 8 P2 6 0 G GH 5 H P G P3 4 H 1 G P1 4 H 7 P 2 G 16 G HP 7 H 7 P 4 G 19 H 3 P2 G GH 7 H P P 9 G 17 H 0 G P1 G HP 8 H 1 P 6 G 24 H 3 G P1 H 5 G P2 H 8 G P1 H 7 G P3 G HP 1 H 1 G P1 1 H 9 P 8 G 16 G HP 9 H 1 P 4 G 17 G HP 3 H 2 P 9 G 19 H 9 P G 12 G HP G HP 3 H 2 P 4 G 16 H 4 G P1 H 3 G P3 H 0 G P1 G HP 9 H 2 G P1 6 H 6 G P1 1 H 9 G P1 1 H 9 G P1 7 H 8 P1 3 G 84 G HP G HP 6 H 2 G P2 3 H 0 G P1 6 H 9 P2 4 0 G GH 2 H P P2 0 0 G GH 0 H P G P1 8 H 8 P 5 G 11 G HP 8 H 2 P2 0 G 04 H G N5 H G N034 H G N106 H G N033 H 3 N 7 3 G Y 22 H C N 3 21 3 G Fold change over parent Almost half the GATA-1 ChIP-chip hits increase expression of a transgene, K562 cells 4 15 6 6 3 2 1 0 GATA-1 occupied sites by ChIP-chip 24 validated out of 56 fragments with ChIP-chip hits tested 43% No GATA-1 Conserved and nonconserved ChIP-chip hits can be active as enhancers Conserved, active Conserved, not active Not conserved, active Messages about evolutionary approaches to predicting regulatory regions • Regulatory regions are conserved, but not all to the same phylogenetic distance. • Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential). • Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity. • In vivo occupancy by GATA-1 suggests other activities in addition to enhancers. • Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection. Polymorphism as a transient phase of evolution Slide from Dr. Hiroshi Akashi Test of neutrality using polymorphism and divergence data Test for recent selection in human noncoding DNA • • • • McDonald-Kreitman test Use ancestral repeats as neutral model (MKAR test) Count polymorphisms in human using dbSNP126 Count divergence of human from – Chimpanzee (great Ape, diverged from human lineage 6 Myr ago) – Rhesus macaque (Old World Monkey, diverged from human lineage 23 Myr ago) • Tiled windows, most analysis on 10kb windows • Compute p-value for neutrality by chi-square test • Ratio of polymorphism to divergence ratios gives indication of direction of inferred selection Heather Lawson, Anthropology, PSU pTRR apparently under positive selection A promoter distal to the beta-like globin genes has a signal for recent purifying selection Selection on a primate-specific promoter The distal promoter is close to the locus control region for beta-globin genes Messages about evolutionary approaches to predicting regulatory regions • Regulatory regions are conserved, but not all to the same phylogenetic distance. • Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential). • Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity. • In vivo occupancy by GATA-1 suggests other activities in addition to enhancers. • Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection. Many thanks … B:Yong Cheng, Ross, Yuepin Zhou, David King F:Ying Zhang, Joel Martin, Christine Dorman, Hao Wang PSU Database crew: Belinda Giardine, Cathy Riemer, Yi Zhang, Anton Nekrutenko RP scores and other bioinformatic input: Francesca Chiaromonte, James Taylor, Shan Yang, Alignments, chains, nets, browsers, ideas, … Diana Kolbe, Laura Elnitski Webb Miller, Jim Kent, David Haussler Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU Computing Regulatory Potential (RP) Alignment seq1 seq2 seq3 Collapsed alphabet G G A 1 T T T 2 A G G 1 C T T 3 C C C 4 T G A 5 A 7 C 7 T A A 6 A G A 8 C C T 3 G C G 6 C C T 3 A A A 9 •A 3-way alignment has 124 types of columns. Collapse these to a smaller alphabet with characters s (for example, 1-9). •Train two order t Markov models for the probability that t alignment columns are followed by a particular column in training sets: –positive (alignments in known regulatory regions) –negative (alignments in ancestral repeats, a model for neutral DNA) –E.g. Frequency that 3 4 is followed by 5: 0.001 in regulatory regions 0.0001 in ancestral repeats •RP of any 3-way alignment is the sum of the log likelihood ratios of finding the strings of alignment characters in known regulatory regions vs. ancestral repeats. RP a in segment pREG (sa | sa1 ...sa t ) log pAR (sa | sa1 ...sa t ) Stage 1: Reduced representations gap ESPERR: Evolutionary Sequence and Pattern Extraction using Reduced Representations G T Stage 2: Improve encoding Train models for classification 6 6 2 may occur frequently in positive training set and rarely in the negative training set, and thus contribute to discrimination. If the positive training set is known regulatory regions, this would contribute to a positive RP. Note that many different columns are reduced to single “encoding” (a number in the figure). E.g. Four different columns are each called “3”. Categories of Tested DNA Segments Example that suggests turnover GATA-1 BSs Additional methods find CACC box as distinctive for validation All validated preCRMs All nonvalidated preCRMs CLOVER (Zlab) Background: Mouse chr 19 (42.8% C+G) NCBI Build 30 EKLF PWM (Dr. Perkins) Output for validated preCRMs Motif P(mm_chr19.m) EKLF 0.0008 Output for nonvalidated preCRMs Motif P(mm_chr19.m) none none 6-mer 7-mer 8-mer 9-mer Hexamer Counting ELPH (UMaryland) counts validated nonvalidated NCACCC 60 32 CACCCW 56 27 expected validated nonvalidated NCACCC 16.31 5.81 CACCCW 11.74 4.36 validated non-validated TTATYT GGCAGR CCWCAGM RGRCAGR CASCCWGC CAGGGAWR CCWGGCWGM CWGRGAWRA Using Galaxy to find predicted CRMs