Detection and analysis of transcriptional control sequences Wyeth Wasserman October VanBUG Seminar Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital University of British Columbia Transcription Simplified CMMT URF Pol-II URE TATA Overview of Transcription in Gene Regulation CMMT • At the most basic level, transcriptional regulation is defined by binding of TFs to DNA • Complexity is increased by TF interactions, chromatin structure and protein modifications • How can we advance our understanding of regulation by computational analysis? A short history lesson… Representing Binding Sites for a TF (HNF1) • A single HNF1 site » AAGTTAATGATTAAC • A set of sites represented as a consensus » VDRTWRWWSHDWVWH • A matrix describing a a set of sites A C G T 14 16 4 0 1 19 20 1 3 0 0 0 0 0 0 0 4 3 17 0 0 2 0 0 0 2 0 21 20 0 1 20 4 13 4 4 13 12 3 7 3 1 0 3 1 12 9 1 3 0 5 2 2 1 4 13 17 0 6 4 CMMT PFMs to PWMs CMMT One would like to add the following features to the model: 1. Correcting for the base frequencies in DNA 2. Weighting for the confidence (depth) in the pattern 3. Convert to log-scale probability for easy arithmetic w matrix f matrix A C G T 5 0 0 0 0 2 3 0 1 2 1 1 0 4 0 1 f(b,i) + N / 4 0 Log p(b) 0 4 1 ( ) AC 1.6 -1.7 G -1.7 T -1.7 -1.7 0.5 1.0 -1.7 -0.2 0.5 -0.2 -0.2 -1.7 1.3 -1.7 -0.2 TGCTG = 0.9 -1.7 -1.7 1.3 -0.2 Performance of Profiles CMMT • 95% of predicted sites bound in vitro (Tronche 1997) • MyoD binding sites predicted about once every 600 bp (Fickett 1995) • The Futility Theorem – Nearly 100% of predicted transcription factor binding sites have no function in vivo A 1 kbp promoter screened with collection of TF profiles CMMT CMMT Phylogenetic Footprinting 70,000,000 years of evolution reveals most regulatory regions. Phylogenetic Footprinting to Identify Functional Segments CMMT % Identity 200 bp Window Start Position (human sequence) Actin gene compared between human and mouse with DPB. Regulatory sites are usually conserved between orthologous genes CMMT HUMAN ACGATACGCATCACAGACT.ACAGACTACGGCTAGCA -|-|||||||||-|---|--|||-------|-|---| MOUSE GCAATACGCATCGCGATCAGACATCAGCACG.TGTGA HUMAN ACATCAGCATACACGCAACTACACAGACTACGACTA ---|||||-||||---|-|----||-||-||||--MOUSE CGTTCAGCTTACAGCTAGCATAGCATACGACGATAC The 1kbp promoter screen with footprinting CMMT Choosing the ”right” species... (BONUS: What’s the ultimate sin in bioinformatics?) CMMT CHICKEN HUMAN MOUSE HUMAN COW HUMAN ConSite (www.phylofoot.org) CMMT Performance: Human vs. Mouse CMMT • Testing set: 40 experimentally defined sites in 15 well studied genes • 85-95% of defined sites detected with conservation filter, while only 11-16%of total predictions retained CMMT de novo Discovery of TF Binding Sites Unraveling Transcriptional Control Mechanisms CMMT Given a set of ”co-regulated” genes, define motifs over-represented in the regulatory regions Pattern Detection Methods CMMT • Exhaustive – e.g. “Moby Dick” (Bussemaker, Li & Siggia) – Identify over-represented oligomers in comparison of “+” and “-” (or complete) promoter collections • Monte Carlo/Gibbs Sampling – e.g. AnnSpec (Workman & Stormo) – Identify strong patterns in “+” promoter collection vs. background model of expected sequence characteristics Yeast Regulatory Sequence Analysis (YRSA) system CMMT Yeast tests of YRSA System DNA-damage Classic PDR3-regulated cell-cycleresponse array genes data partially mediating by et MCB re-clustered from array by Getz study al CMMT THE PROBLEM: Pattern Detection in Long Sequences CMMT MEF2 SIMILARITY SCORE 18 16 14 MEF2 SET 12 RANDOM SET 10 0 100 200 300 400 SEQUENCE LENGTH 500 600 Four Approaches to Extend Sensitivity CMMT • Phylogenetic Footprinting – Human-Mouse eliminates ~75% of sequence • Better background models – e.g. AnnSpec • Better definition of co-regulation – Microarrays occasionally produce noise • Use biochemical knowledge about TFs – TFBS patterns are NOT random CMMT Some characteristics have been explored… • Segmentation: informative positions separated by variable positions (proteins bind as dimers) • Positional Variance: subset of positions contain most of the info • Palindromes are common in the patterns Our Hypothesis CMMT • Point 1: Structurally-related DNA binding domains interact with similar target sequences • Exceptions exist (e.g. Zn-fingers) • Point 2: There are a finite number of binding domains used in human TFs • Approximately 20-25 • Idea: We could use the shared binding properties for each family to focus pattern detection methods • Constrain the range of patterns sought Comparison of profiles requires alignment and a scoring function CMMT Frequency • Scoring function based on sum of squared differences • Align frequency matrices with modified Needleman-Wunsch algorithm • Calculate empirical p-values based on simulated set of matrices Score Prediction of TF Class CMMT Match to bHLH COMPARE TF Database (JASPAR) Jackknife Test 87% correct Independent Test Set 93% correct CMMT FBPs enhance sensitivity of pattern detection CMMT APPLICATION: Cancer Protection Response CMMT • Detoxification-related enzymes are induced by compounds present in Broccoli • Arrays, SSH and hard work have defined a set of responsive genes • A known element mediates the response (Antioxidant Responsive Element) • Controversy over the type of mediating leucine zipper TF • NF-E2/Maf or Jun/Fos Application (2) CMMT Classify Gibbs with New FBP TF Motif Prior Gibbs Sampling Maf (p<0.02) Jun (p<0.98) Problem: Given a set of co-regulated genes, determine the common TFBS. Classify the mediating TF. We expect a leucine zipper-type TF. CMMT Regulatory Modules TFs do NOT act in isolation Layers of Complexity in Metazoan Transcription Chromatin picture used with permission of Zymogenetics. Liver Differentiation (data mostly from studies of hepatocytes) CMMT Early Fetal Stem CEBPa HNF3 HNF1 HNF4 Mature Liver regulatory modules CMMT Models for Liver TFs… (Data that takes 2 months to produce and 10 seconds to present) (Or, what to do with an astrophysicist new to bioinformatics) HNF1 C/EBP CMMT HNF3 HNF4 Training predictive models for modules CMMT • Limited by small size of positive training set • We elected to use logistic regression analysis for the first models • Your favorite statistical approach would probably do equally well – data limited Logistic Regression Analysis CMMT * a1 * a2 * a3 * a4 Optimize a vector to maximize the distance between output values for positive and negative training data. S “logit” Output value is: elogit p(x)= 1 + elogit UDPGT1 (Gilbert’s Syndrome) 1 0.8 0.6 Series1 Wildtype Mutant Series2 0.4 0.2 0 “Window” Position in Sequence 5840 5430 5020 4610 4200 3790 3380 2970 2560 2150 1740 1330 920 510 -0.2 100 Liver Module Model Score CMMT PERFORMANCE CMMT • Liver (Genome Research, 2001) – At 1 hit per 35 kbp, identifies 60% of modules – Limited to genes expressed late in liver development LRA Models do not account for multiple sites for the same TF* • Skeletal Muscle (JMB, 1998) – Set to 1 prediction per 35 000 bp – Identifies 66% of test set correctly * Side-track: Newer Methods Combining Phylogenetic Footprinting with a Module Model CMMT Genome Scan CMMT • Screened the available mouse genomic sequences (~300 MB) for modules and discarded hits for which sequence was not conserved with human (BLAST) • Removed regions for which corresponding human sequence did not score as module • Of ~100 predicted modules • 20 annotated genes: 5 from training, 3 additional modules, 5 liver specific, 3 unknown and 4 not liver CMMT de novo Discovery of Regulatory Modules Focus on regulatory modules for pattern detection CMMT Cluster Genes by Expression Predictive Models 6 2 0 0 0 8 0 0 0 4 4 0 0 7 0 1 7 1 0 0 0 0 8 0 0 2 0 6 Identify and Model Contributing TFs Finding binding sites in sets of co-regulated human genes • Sequence “space” is too large – Narrow with Phylogenetic Footprinting • Identify patterns in conserved blocks via Gibbs sampling • Assess quality of patterns based on biological knowledge CMMT Phylogenetic Footprinting to Identify Conserved Regions CMMT Skeletal Muscle Genes CMMT • One of the most extensively studied tissues for transcriptional regulation – 45 genes partially analyzed – 26 genes with orthologous genomic sequence from human and rodent • Five primary classes of transcription factors – Principal: Myf (myoD), Mef2, SRF – Secondary: Sp1 (G/C rich patches), Tef (subset of skeletal muscle types) Regulatory regions directing muscle-specific transcription MyoD/Myf SRF Mef2 Tef CMMT de novo Discovery of Skeletal Muscle Transcription Factor Binding Sites CMMT Mef2-Like SRF-Like Myf-Like We will soon be able to define modules for many contexts… CMMT A gene-centric data integration project... CMMT COMING SOON: The Integrated Module Sampler CMMT Gene1 Gene2 Gene3 Gene4 Gene5 Calls to GeneLynx Calls to ensEMBL Calls to BlastZ (Switch to Lagan?) Module Sampler CMMT YOU SHOULD HAVE BEEN THERE… THIS SLIDE EXCLUDED FROM THE POSTED FILE Conclusions CMMT • Evolution drives understanding in biology – Phylogenetic Footprinting • Biochemistry inspires Bioinformatics – Regulatory Modules – Familial Binding Profiles • Analysis of regulatory sequences is improving – Given sets of orthologous genes, one can predict regulatory regions – Given sets of co-regulated genes, it is possible to infer the binding profiles for critical transcription factors • Much more work is needed… THANKS! Wasserman Group – CMMT Danielle Kemmer Several Newcomers Wasserman Group - Sweden Albin Sandelin Raf Podowski (CA) Wynand Alkema Collaborating Students Malin Andersson (Odeberg) Öjvind Johansson (Lagergren) Hui Gao (Dahlman-Wright) Emily Hodges (Höög) Collaborators Chip Lawrence (Wadsworth) Boris Lenhard (K.I.) Jens Lagergren (SBC) Christer Höög (K.I.) Brenda Gallie (OCI) Jacob Odeberg (KTH) Niclas Jareborg (AZ) William Hayes (AZ) Group Alumni Per Engström Elena Herzog Annette Höglund William Krivan Boris Lenhard Luis Mendoza Support: Merck-Frosst, C&W, Pharmacia, EU–Marie Curie, CGDN, KI-Funder URLs... CMMT • Group: www.cmmt.ubc.ca • ConSite/DPB: www.phylofoot.org • GeneLynx: www.genelynx.org