#28 - Promoter Prediction 10/29/07 Required Reading BCB 444/544 (before lecture) Mon Oct 29 - Lecture 28 Lecture 28 Promoter & Regulatory Element Prediction • Chp 9 - pp 113 - 126 Wed Oct 30 - Lecture 29 Gene Prediction - finish it Phylogenetics Basics • Chp 10 - pp 127 - 141 Promoter Prediction Thurs Oct 31 - Lab 9 Gene & Regulatory Element Prediction Fri Oct 30 - Lecture 29 #28_Oct29 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Phylogenetic Tree Construction Methods & Programs • Chp 11 - pp 142 - 169 10/29/07 1 Assignments & Announcements 10/29/07 2 10/29/07 4 BCB 544 "Team" Projects Mon Oct 29 - HW#5 - will be posted today Last week of classes will be devoted to Projects HW#5 = Hands-on exercises with phylogenetics and tree-building software Due: Mon Nov 5 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction • Written reports due: • Mon Dec 3 (no class that day) (not Fri Nov 1 as previously posted) • Oral presentations (20-30') will be: • Wed-Fri Dec 5,6,7 • 1 or 2 teams will present during each class period See Guidelines for Projects posted online BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 3 BCB 544 Only: New Homework Assignment Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: 544 Extra#2 Due: BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction http://www.bcb.iastate.edu/seminars/index.html √ PART 1 - ASAP PART 2 - meeting prior to 5 PM Fri Nov 2 • Nov 1 Thurs - BBMB Seminar 4:10 in 1414 MBB • Todd Yeates UCLA TBA -something cool about structure and evolution? Part 1 - Brief outline of Project, email to Drena & Michael after response/approval, then: • Nov 2 Fri - BCB Faculty Seminar 2:10 in 102 ScI Part 2 - More detailed outline of project • Bob Jernigan BBMB, ISU • Control of Protein Motions by Structure Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Fall 07 Dobbs 10/29/07 5 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 6 1 #28 - Promoter Prediction 10/29/07 Chp 8 - Gene Prediction Computational Gene Prediction: Approaches • Ab initio methods SECTION III GENE AND PROMOTER PREDICTION • Search by signal: find DNA sequences involved in gene Xiong: Chp 8 Gene Prediction expression • Search by content: Test statistical properties distinguishing • Categories of Gene Prediction Programs coding from non-coding DNA • Gene Prediction in Prokaryotes • Similarity-based methods • Gene Prediction in Eukaryotes • Database search: exploit similarity to proteins, ESTs, cDNAs • Comparative genomics: exploit aligned genomes • Do other organisms have similar sequence? • Hybrid methods - best BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 7 This is a new slide Computational Gene Prediction: Algorithms 1. Neural Networks (NNs) BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Signals Search 10/29/07 8 This is a new slide Approach: Build models (PSSMs, profiles, HMMs, …) and search (more on these later…) against DNA. Detected instances provide evidence for genes e.g., GRAIL 2. Linear discriminant analysis (LDA) (see text) e.g., FGENES, MZEF 3. Markov Models (MMs) & Hidden Markov Models (HMMs) e.g., GeneSeqer - uses MMs GENSCAN - uses 5th order HMMs - (see text) HMMgene - uses conditional maximum likelihood (see text) BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Content Search 10/29/07 9 This is a new slide BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Human Codon Usage 10/29/07 10 This is a new slide Observation: Encoding a protein affects statistical properties of DNA sequence: • Nucleotide.amino acid distribution • GC content (CpG islands, exon/intron) • Uneven usage of synonymous codons (codon bias) • Hexamer frequency - most discriminative of these for identifying coding potential Method: Evaluate these differences (coding statistics) to differentiate between coding and non-coding regions BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Fall 07 Dobbs 10/29/07 11 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 12 2 #28 - Promoter Prediction 10/29/07 Predicting Genes based on Codon Usage Differences This is a new slide • This is a new slide In different genomes: Translate DNA into all 6 reading frames and search against proteins (TBLASTX,BLASTX, etc.) Algorithm: Process sliding window • Similarity-Based Methods: Database Search ATTGCGTAGGGCGCT TAACGCATCCCGCGA Use codon frequencies to compute probability of coding versus non-coding Plot log-likelihood ratio: & P (S | coding ) # log$$ !! % P ( S | non ' coding ) " Within same genome: Search with EST/cDNA database Exons (EST2genome, BLAT, etc.). Problems: Coding Profile of ß-globin gene BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Similarity-Based Methods: Comparative Genomics 10/29/07 • Will not find “new” or RNA genes (non-coding genes). • Limits of similarity are hard to define • Small exons might be overlooked 13 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction This is a new slide 10/29/07 Human-Mouse Homology Hum an 14 This is a new slide Mouse Idea: Functional regions are more conserved than non-functional ones; high similarity in alignment indicates gene human mouse GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | | C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA- Comparison of 1196 orthologous genes • Sequence identity between genes in human vs mouse Exons: 84.6% Protein: 85.4% Introns: 35% 5’ UTRs: 67% 3’ UTRs: 69% Advantages: • May find uncharacterized or RNA genes Problems: • • Finding suitable evolutionary distance Finding limits of high similarity (functional regions) BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 15 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 16 GeneSeqer - Brendel et al.- ISU Thanks to Volker Brendel, ISU for the following Figs & Slides http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi Spliced Alignment Algorithm Brendel et al (2004) Bioinformatics 20: 1157 http://bioinformatics.oxfordjournals.org/cgi/con tent/abstract/20/7/1157 Slightly modified from: BSSI Genome Informatics Module • Perform pairwise alignment with large gaps in one sequence (due to introns) http://www.bioinformatics.iastate.edu/BBSI/course_desc_20 05.html#moduleB • Align genomic DNA with cDNA, ESTs, protein sequences • Score semi-conserved sequences at splice junctions V Brendel vbrendel@iastate.edu • Using Bayesian probability model & 1st order MM • Score coding constraints in translated exons Intron • Using Bayesian model Brendel et al (2004) Bioinformatics 20: 1157 GT Donor BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Fall 07 Dobbs 10/29/07 17 Brendel 2005 AG Splice sites BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Acceptor 10/29/07 18 3 #28 - Promoter Prediction 10/29/07 Information Content vs Position Splice Site Detection Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal? YES 0.8 • Information Content Ii : Ii = 2 + " 0.7 Human T2_GT 0.6 0.5 f iB log 2 ( f iB ) B !U ,C , A ,G 0.5 0.4 0.3 0.3 0.2 0.1 0.1 0.0 -50 I i ! I + 196 . "I i: ith position in sequence Ī: avg information content over all positions >20 nt from splice site σĪ: avg sample standard deviation of Ī BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 19 -40 -30 -10 0.0 0 10 20 30 40 50 -50 -40 -30 -20 -10 0 10 20 30 40 50 Which sequences are exons & which are introns? How can you tell? Brendel 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 20 This is a new slide Markov Model for Spliced Alignment PΔG -20 Human T2_AG 0.6 0.4 0.2 • Extent of Splice Signal Window: Brendel 2005 0.8 0.7 Evaluation of Splice Site Prediction PΔG (1-PΔG )(1-PD(n+1)) en en+1 (1-PΔG )PD(n+1) TP FP TN FN PA(n)PΔG Right! (1-PΔG )PD(n+1) in in+1 = = = = positive instance correctly predicted as positive negative instance incorrectly predicted as positive negative instance correctly predicted as negative positive instance incorrectly predicted as negative 1-PA(n) Brendel 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 21 Evaluation of Predictions True TP FP PP=TP+FP False FN TN PN=FN+TN FN AP = TP / AP • Sensitivity: S n S=nTP / AP = 1=!1"! " Coverage Recall != != False Positives Predicted = 1$" 1$" + # BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Fall 07 Dobbs 22 True TP FP PP=TP+FP False FN TN PN=FN+TN AP=TP+FN AN=FP+TN • Sensitivity: S n = TP / AP = = 1 !Coverage " FP AN In English? Sensitivity is the fraction of all positive instances having a true positive prediction. AN AN 1 1!!## AN = TP S/ pPP ==TP ! = 1"! =" ="1=! # • Specificity: S p S=pTP / PP 1=!/1PP !##r+ PPPP PP1 !1#!1+ "+ r"" • Normalized specificity: ! 10/29/07 Actual True False True Positives AP=TP+FN AN=FP+TN • Misclassification rates: BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Evaluation of Predictions - in English Predicted Positives Actual True False Predicted Fig 5.11 Baxevanis & Ouellette 2005 r= AN AP Do not memorize this! 10/29/07 23 IMPORTANT: Sensitivity alone does not tell us much about performance because a 100% sensitivity can be achieved trivially by labeling all test cases positive! AN 1!# "= • Specificity: S p = TP / PP = 1=! Recall PP 1IMPORTANT: ! # + r" in medical jargon, Specificity is sometimes defined In English? Specificity is the differently (what we define here as fraction of all predicted positives "Specificity" is sometimes referred that are, in fact, true positives. to as "Positive predictive value") BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 24 4 #28 - Promoter Prediction 10/29/07 This slide has been changed GeneSeqer: Input Best Measures for Comparison? • ROC curves http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi (Receiver Operating Characteristic (?!!) http://en.wikipedia.org/wiki/Roc_curve In signal detection theory, a receiver operating characteristic (ROC), or ROC curve is a plot of sensitivity vs (1 - specificity) for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently by plotting fraction of true positives (TPR = true positive rate) vs fraction of false positives (FPR = false positive rate) • Correlation Coefficient Matthews correlation coefficient (MCC) MCC = 1 for a perfect prediction 0 for a completely random assignment -1 for a "perfectly incorrect" prediction Do not memorize this! BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 25 Brendel 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 27 Brendel 2005 Common errors? • • False positive intergenic regions: • 2 annotated genes actually correspond to a single gene • False negative intergenic region: • One annotated gene structure actually contains 2 genes • False negative gene prediction: • Missing gene (no annotation) • Other: • Partially incorrect gene annotation • Missing annotation of alternative transcripts • 10/29/07 28 • GENSCAN: http://genes.mit.edu/GENSCAN.html GeneMark.hmm: http://exon.gatech.edu/GeneMark/ others: GRAIL, FGENES, MZEF, HMMgene Similarity-based • BLAST, GenomeScan, EST2Genome, Twinscan Combined: • • GeneSeqer, http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi ROSETTA Consensus: because results depend on organisms & specific • For ab initio prediction in eukaryotes: HMMs have better overall performance for detecting intron/exon boundaries task, Always use more than one program! • Two servers hat report consensus predictions • GeneComber • DIGIT • Limitation? Training data: predictions are organism specific • Combined ab initio/homology based predictions: Improved accurracy • Limitation? Availability of identifiable sequence homologs in databases 10/29/07 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Ab initio • • • Current status? BCB 444/544 Fall 07 Dobbs 26 Recommended Gene Prediction Software Gene Prediction - Problems & Status? BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 GeneSeqer: Gene Evidence Summary GeneSeqer: Output Brendel 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 29 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 30 5 #28 - Promoter Prediction 10/29/07 Other Gene Prediction Resources: GaTech, MIT, Stanford, etc. Other Gene Prediction Resources: at ISU http://www.bioinformatics.iastate.edu/bioinformatics2go/ Lists of Gene Prediction Software http://www.bioinformaticsonline.org/links/ch_09_t_1.html http://cmgm.stanford.edu/classes/genefind/ Current Protocols in Bioinformatics (BCB/ISU owns a copy - currently in my lab!) Chapter 4 Finding Genes 4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations 4.2 Using MZEF To Find Internal Coding Exons 4.3 Using GENEID to Identify Genes 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome 4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 31 10/29/07 32 Eukaryotes vs Prokaryotes: Genomes Chp 9 - Promoter & Regulatory Element Prediction Eukaryotic genomes SECTION III GENE AND PROMOTER PREDICTION • Are packaged in chromatin & sequestered in a nucleus • Are larger and have multiple linear chromosomes • Contain mostly non-protein coding DNA (98-99%) Xiong: Chp 9 Promoter & Regulatory Element Prediction • Promoter & Regulatory Elements in Prokaryotes Prokarytic genomes • Promoter & Regulatory Elements in Eukaryotes • DNA is associated with a nucleoid, but no nucleus • Much larger, usually single, circular chromosome • Contain mostly protein encoding DNA • Prediction Algorithms BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 33 Eukaryotes vs Prokryotes: Gene Structure BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 34 Eukaryotes vs Prokaryotes: Genes Eukaryotic genes • Are larger and more complex than in prokaryotes • Contain introns that are “spliced” out to generate mature mRNAs* • Often undergo alternative splicing, giving rise to multiple RNAs* • Are transcribed by 3 different RNA polymerases (instead of 1, as in prokaryotes) * In biology, statements such as this include an implicit “usually” or “often” BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Fall 07 Dobbs 10/29/07 35 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 36 6 #28 - Promoter Prediction 10/29/07 Eukaryotes vs Prokaryotes: Regulatory Elements Eukaryotes vs Prokaryotes: Levels of Gene Regulation Primary level of control? • Prokaryotes: • Promoters & operators (for operons) - cis-acting DNA signals • Activators & repressors - trans-acting proteins (we won't discuss these…) • Prokaryotes: Transcription initiation • Eukaryotes: Transcription is also very important, but • Expression is regulated at multiple levels many of which are post-transcriptional: • • • • • • Eukaryotes: • Promoters & enhancers (for single genes) - cis-acting •Transcription factors - trans-acting RNA processing, transport, stability Translation initiation Protein processing, transport, stability Post-translational modification (PTM) Subcellular localization • Recent important discoveries: small regulatory RNAs (miRNA, siRNA) are abundant and play very important roles in controlling gene expression in eukaryotes, often at post-transcriptional levels BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 •What the RNA polymerase actually binds 37 Prokaryotic Promoters BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction • Transcription factors must bind first and serve as landmarks recognized by RNA polymerase complexes • Eukaryotic promoter sequences are less highly conserved, but many • Prokaryotic promoter sequences are highly conserved: • -10 region • -35 region promoters (for RNA polymerase II) contain : • -30 region "TATA" box • -100 region "CCAAT" box 10/29/07 39 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 40 Eukaryotic genes are transcribed by 3 different RNA polymerases Eukaryotic Promoters vs Enhancers (Location of promoter regions, TFBSs & TFs differ, too) Both promoters & enhancers are binding sites for transcription factors (TFs) Promoters • essential for initiation of transcription • located “relatively” close to start site (usually <200 bp upstream, but can be located within gene, rather than upstream!) • 38 • Eukaryotic RNA polymerase complexes do not bind directly to promoter sequences • Prokaryotic RNA polymerase complex binds directly to promoter, by virtue of its sigma subunit - no requirement for “transcription factors” binding first • 10/29/07 Eukaryotic Promoters • RNA polymerase complex recognizes promoter sequences located very close to and on 5’ side (“upstream”) of tansription initiation site BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Important difference? rRNA Enhancers • needed for regulated transcription (differential expression in specific cell types, developmental stages, in response to environment, etc.) • can be very far from start site (sometimes > 100 kb) mRNA tRNA, 5S RNA Brown Fig 9.18 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Fall 07 Dobbs 10/29/07 41 BIOS Scientific Publishers Ltd, BCB 1999 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 42 7 #28 - Promoter Prediction 10/29/07 Promoter of lac operon in E. coli Prokaryotic Genes & Operons (Transcribed by prokaryotic RNA polymerase) • Genes with related functions are often clustered within operons (e.g., lac operon) • Operons = genes with related functions that are transcribed and regulated as a single unit; one promoter controls expression of several proteins • mRNAs produced from operons are “polycistronic” - a single mRNA encodes several proteins; i.e., there are multiple ORFs, each with its own AUG (START) & STOP codons, linked within one mRNA molecule BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 43 Brown Fig 9.17 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BIOS Scientific Publishers Ltd, 1999 10/29/07 44 Eukaryotic genes have large & complex regulatory regions Eukaryotic genes • Genes with related functions are occasionally, but not usually clustered; instead, they share common regulatory regions (promoters, enhancers, etc.) • Chromatin structure must also be “active” for transcription to occur •Cis-acting regulatory elements include: Promoters, enhancers, silencers •Trans-acting regulatory factors include: Transcription factors (TFs), chromatin remodeling complexes, small RNAs Brown Fig 9.17 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 45 Eukaryotic Promoters: DNA sequences required for initiation, usually <200 bp from start site BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BIOS Scientific Publishers Ltd, 1999 10/29/07 46 Eukaryotic promoters & enhancer regions often contain many different TFBS motifs Eukaryotic RNA polymerases bind by recognizing a complex of TFs bound at promotor First, TFs must bind short motifs (TFBSs) within promoters; then RNA polymerase can bind and initiate transcription of RNA ~250 bp BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Fall 07 Dobbs Pre-mRNA 10/29/07 47 Fig 9.13 Mount 2004 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 48 8 #28 - Promoter Prediction 10/29/07 Simplified View of Promoters in Eukaryotes Eukaryotic Activators vs Repressors Regions far from the promoter can act as "enhancers" or "repressors" of transcription by serving as binding sites for activator or repressor proteins (TFs) repressor Gene 100 - 50,000 bp Activator proteins (TFs) bind to enhancers & interact with RNAP to stimulate transcription Fig 5.12 Baxevanis & Ouellette 2005 RNAP promoter enhancer enhancer proteins interact with RNAP transcription repressor prevents binding of activator Repressors block the action of activators BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 49 Eukaryotic Transcription Factors (TFs) BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction • Common in eukaryotic proteins • ~ 1% of mammalian genes encode zinc-finger proteins (ZFPs) Here motif = amino acid sequence in protein • In C. elegans, there are > 500 ! • Can be used as highly specific DNA binding modules • Potentially valuable tools for directed genome modification (esp. in plants) & human gene therapy - one clinical trial will begin soon! • TFs recognize and bind specific short DNA sequence motifs called “transcription factor binding sites” (TFBSs) • Databases for TFs &TFBSs include: • TRANSFAC, • JASPAR Here motif = nucleotide sequence in DNA http://www.generegulation.com/cgibin/pub/databases/transfac Brown Fig 9.12 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 50 Zinc Finger Proteins - Transcription Factors • Transcription factors = proteins that interact with the RNA polymerase complex to activate or repress transcription • TFs often contain both: • a trans-activating domain • a DNA binding domain or motif 10/29/07 10/29/07 51 Promoter Prediction Algorithms & Software • Did you go to Dave Segal's seminar? • Your TAs Pete & Jeff work on designing better ZFPs! BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BIOS Scientific Publishers Ltd, 1999 10/29/07 52 Eukaryotes vs Prokaryotes: Promoter Prediction Promoter prediction is much easier in prokaryotes Xiong - Why? Highly conserved Simpler gene structures More sequenced genomes! (for comparative approaches) Methods? Previously: mostly HMM-based Now: similarity-based comparative methods because so many genomes available Xiong textbook: 1) "Manual method"= rules of Wang et al (see text) 2) BPROM - uses linear discriminant function BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Fall 07 Dobbs 10/29/07 53 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 54 9 #28 - Promoter Prediction 10/29/07 Eukaryotes vs Prokaryotes: Promoter Prediction Predicting Promoters in Eukaryotes Promoter prediction is much easier in prokaryotes Why? Closely related to gene prediction! • Obtain genomic sequence Highly conserved Simpler gene structures More sequenced genomes! (for comparative approaches) • Use sequence-similarity based comparison (BLAST, MSA) to find related genes But: "regulatory" regions are much less wellconserved than coding regions • Locate ORFs • Identify Transcription Start Site (TSS) (if possible!) • Use Promoter Prediction Programs • Analyze motifs, etc. in DNA sequence (TRANSFAC, JASPAR) Methods? Previously: mostly HMM-based Now: similarity-based comparative methods because so many genomes available Xiong textbook: 1) "Manual method"= rules of Wang et al (see text) 2) BPROM - uses linear discriminant function BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 55 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Predicting promoters: Steps & Strategies 10/29/07 56 Automated Promoter Prediction Strategies Identify TSS --if possible? 1) Pattern-driven algorithms (ab initio) • One of biggest problems is determining exact TSS! Not very many full-length cDNAs! • Good starting point? (human & vertebrate genes) Use FirstEF found within UCSC Genome Browser or submit to FirstEF web server 2) Sequence-driven algorithms (homology based) 3) Combined "evidence-based" BEST RESULTS? Combined, sequential Fig 5.10 Baxevanis & Ouellette 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 57 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 1) Pattern-driven Algorithms 10/29/07 58 Ways to Reduce FPs in ab initio Prediction • Take sequence context/biology into account Eukaryotes: clusters of TFBSs are common • Success depends on availability of collections of annotated transcription factor binding sites (TFBSs) Tend to produce very large numbers of false positives (FPs) • Why? • Probability of "real" binding site higher if annotated transcription start site (TSS) is nearby But: What about enhancers? (no TSS nearby!) & only a small fraction of TSSs have been experimentally determinined • Do the wet lab experiments! • • • • • • Prokaryotes: knowledge of σ (sigma) factors helps Binding sites for specific TFs are often variable Binding sites are short (typically 6-10 bp) Interactions between TFs (& other proteins) influence both affinity & specificity of TF binding One binding site often recognized by multiple TFs But: Promoter-bashing can be tedious… Biology is complex: gene activation is often specific to organism/cell/stage/environmental condition; promoter and enhancer elements must mediate this BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Fall 07 Dobbs 10/29/07 59 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 60 10 #28 - Promoter Prediction 10/29/07 2) Sequence-driven Algorithms Phylogenetic Footprinting • Assumption: Common functionality can be deduced from sequence conservation (Homology) • Alignments of co-regulated genes should highlight elements involved in regulation Based on increasing availability of whole genome DNA sequences from many different species Selection of organisms for comparison is important • • • Careful: How determine co-regulation? 1. Orthologous genes from difference species 2. Genes experimentally shown to be co-regulated (using microarrays??) Comparative promoter prediction: 1. Phylogenetic footprinting 2. Expression Profiling BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 • • • • BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction • 2. • 62 Need sets of co-regulated genes • Co-expression implies co-regulation Co-regulated genes share common regulatory elements Drawbacks: 1. 10/29/07 Problems with Sequence-driven Algorithms Assumptions: (sometimes valid, sometimes NOT) 1. 2. Consite, rVISTA, PromH(W), Bayes aligner, Footprinter 61 Based on increasing availability of whole genome mRNA expression data, esp., microarray data High-throughput simultaneous monitoring of expression levels of thousands of genes • use MSA algorithms (e.g., CLUSTAL) more sensitive methods • Gibbs sampling • Expectation Maximization (EM) methods Examples of programs: • Expression Profiling • not too close, not too far: good = human vs mouse To reduce FPs, must extract non-coding sequences and then align them; prediction depends on good alignment • Signals are short & weak! Requires Gibbs sampling or EM: e.g., MEME, AlignACE, Melina Prediction depends on determining which genes are co-expressed usually by clustering - which an be error prone Examples of programs: • INCLUSive - combined microarray analysis & motif detection • PhyloCon - combined phylo footprinting & expression profiling) BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 For comparative (phylogenetic) methods • Must choose appropriate species • Different genomes evolve at different rates • Classical alignment methods have trouble with translocations or inversions than change order of functional elements • If background conservation of entire region is high, comparison is useless • Not enough data (but Prokaryotes >>> Eukaryotes) Complexity: many regulatory elements are not conserved across species! 63 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 64 Global Alignment of Human & Mouse Obese Gene Promoters (200 bp upstream from TSS) TRANSFAC Matrix Entry: for TATA box Fields: • Accession & ID • Brief description • TFs associated with this entry • Weight matrix • Number of sites used to build • Other info Fig 5.13 Baxevanis & Ouellette 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Fall 07 Dobbs 10/29/07 65 Fig 5.14 Baxevanis & Ouellette 2005 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 66 11 #28 - Promoter Prediction 10/29/07 Annotated Lists of Promoter Databases & Promoter Prediction Software • Check out Optional Review & Try Associated Tutorial: URLs from Mount textbook: Table 9.12 http://www.bioinformaticsonline.org/links/ch_09_t_2.html • Table in Wasserman & Sandelin Nat Rev Genet article http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.htm • Wasserman WW & Sandelin A (2004) Applied bioinformatics for identification of regulatory elements. Nat Rev Genet 5:276-287 http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html URLs from Baxevanis & Ouellette textbook: http://www.wiley.com/legacy/products/subject/life/bioinformatics/ch05.htm#links Check this out: http://www.phylofoot.org/NRG_testcases/ More lists: • http://www.softberry.com/berry.phtml?topic=index&group=programs&subgroup=promoter • http://bioinformatics.ubc.ca/resources/links_directory/?subcategory_id=104 • http://www3.oup.co.uk/nar/database/subcat/1/4/ BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Fall 07 Dobbs 10/29/07 Bottom line: this is a very "hot" area - new software for computational prediction of gene regulatory elements published every day! 67 BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 10/29/07 68 12