COMPUTATIONAL ANALYSIS OF PROMOTERS Gene regulation • Genomes usually contain several thousands of different genes. • Some of the gene products are required by the cell under all growth conditions and are called housekeeping genes. • genes for DNA polymerase, RNA polymerase, rRNA, tRNA, … • Many other gene products are required under specific growth conditions. • e.g. enzymes responding to a specific environmental condition such as DNA damage Gene regulation • Housekeeping genes must be expressed at some level all of the time. • Frequently, as the cell grows faster, more of the housekeeping gene products are needed. • The gene products required for specific growth conditions are not needed all of the time. • These genes are frequently expressed at extremely low levels, or not expressed at all when they are not needed and yet made when they are needed. • Apparently, the gene expression must be regulated so that the genes that are being expressed meet the needs of different cell types, developmental stages, or different external conditions. Gene regulation Gene regulation basically occurs at three different places: 1. transcriptional regulation • transcription of the gene is regulated • control of transcription initiation – most important control mechanism 2. translational regulation • translation of the gene is regulated • How often the mRNA is translated influences the amount of gene product that is made. 3. post-transcriptional/post-translational regulation • regulation of gene products after they are completely synthesized, e.g. degradation, chemical modifications (methylation, phosphorylation) Transcriptional regulation • Transcription control has two key features: protein-binding regulatory DNA sequences (control elements) are associated with genes 2. specific proteins that bind to regulatory sequences determine where transcription will start, and either activate or repress its transcription 1. • DNA sequence specifying where RNA polymerase binds and initiates transcription of a gene is called a promoter. • Transcription from a particular promoter is controlled by DNAbinding proteins, termed transcription factors. • DNA control elements in binding transcription factors may be located very far from the promoter they regulate. Three different polymerases • As a result of this arrangement, transcription from a single promoter may be regulated by binding of multiple transcription factors to alternative control elements, permitting complex control of gene expression. • RNA polymerase I synthesizes rRNA. • RNA polymerase II synthesizes mRNA. • RNA polymerase III synthesizes small RNAs and tRNA. source: Molecular Biology of the Cell. 4th edition. Alberts B Three parts of promoter • core promoter • responsible for actual binding of transcription apparatus • very close upstream (~35 bp), may also be downstream, see later • proximal promoter • contains several regulatory elements • few hundreds bases upstream of transcriptional start site (TSS) • distal promoter • contains enhancers (upstream/downstream), silencers • They are cis-acting … cis-element regulates gene on the same DNA molecule. cis-acting sequences are bound by trans-acting (i.e. acting from a different molecule) regulatory proteins. • However, the distinctions between proximal elements and enhancers/silencers is not very clear. Core promoter • Eukaryotic RNAPII is not itself capable of transcriptional initiation in vitro. • It needs to be supplemented by general (basal) transcription factors (GTFs). • Factors are identified as TFIIX, where X is a letter. e.g. TFIIA, TFIIB, … • RNAPII + TFs form pre-initiation complex (PIC). Only then transcription can commence. • minimal (core) promoter – DNA sequence sufficient for assembly of pre-initiation complex. • Transcription initiated by the core promoter is called basal transcription. Core promoter elements • Core promoter is usually located proximal to or overlapping TSS. • Contains several sequence motifs. TFs interact with them in sequence-specific manner. • Combination of TF-binding motifs vary depending on the gene. Core promoter elements • TATA box … ~ 30 bp upstream, consensus TATA(A/T)A(A/T) • Instead of a TATA box, some eukaryotic (TATA-less) genes contain initiator (Inr) … surrounds TSS, extremely degenerate consensus sequence YYAN(T/A)YYY (A – TSS, N – any nucleotide) • Promoters with both TATA and Inr also exist. • DPE (downstream promoter element) in TATA-less • Present in some TATA-, Inr+ promoters, 30 bp downstream. consensus: RGWCGTG (W = A or T) Butler JE, Kadonaga JT. The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes Dev. 2002; 16 (20):2583-92. Promoter proximal elements • Found within 100 to 200â bp of the TSS. • CAAT (CCAAT, CAT) box … consensus GGCCAATCT • GC box … consensus G/T G/A GGCG G/T G/A G/A C/T. • It’s GC rich segment. • Promoter may contain multiple GC boxes, such promoter usually lack TATA box. A hypothetic mammalian promoter region Promoter Proximal Element +1 Enhancer Enhancer -10~-50 Kb -200 TATA -30 Intron Exon Enhancer +10~50 Kb CpG island • Transcription of genes with TATA/Inr promoters begins at • • • • • a well-defined sites. However, transcription of many protein-coding genes has been shown to begin at any one of multiple possible sites over an extended region 20–200 bp long. As a result, such genes give rise to mRNAs with multiple alternative 5’ ends. These are housekeeping genes, they do not contain TATA, Inr. Most genes of this type contain a CG-rich stretch of several hundreds nucleotides – CpG island – within ≈100 base pairs upstream of TSS. CpG islands are typical for vertebrates (including human). They are not common in lower eukaryotes. CpG island mRNA ~100 bp CpG island Multiple 5’-start sites • Computational analysis is based on CG dinucleotide imbalance. • length = 200 bp, C+G content min CpGððð ððð£ðð 50%, CpGðð¥ðððð¡ðð = ð(CG) ð C ð(ðº) > 0.60 M. Gardiner-Garden, M. Frommer, CpG islands in vertebrate genomes, J. Mol. Biol. 1987, 196, 261-282. • length = 500 bp, C+G content min 55%, CpGððð ððð£ðð CpGðð¥ðððð¡ðð > 0.65 D. Takai, P. A. Jones, Comprehensive analysis of CpG islands in human chromosomes 21 and 22, PNAS 2002, 99, 3740-45. CpG island len=51, #C=76, #g=101, #CG=30, ðC = 76 , ðG = 251 101 30 251 251 , ðCG = ,CGcontent = ðC + ðG = 0.71, CpGo/e=0.98 • simple methods based on the frequency of CG perform remarkably well at correctly predicting regions containing TSSs • EMBOSS CpGPlot/CpGReport -http://www.ebi.ac.uk/Tools/emboss/cpgplot/ • CpG Island Searcher - http://cpgislands.usc.edu/ (IE only) Promoter regions in human genes Suzuki Y et al., Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res. 2001, 11(5):677-84. TATA 32% Inr 85% GC box 97% CAAT box 64% located in CpG 48% TATA+Inr+ 28% TATA+Inr- 4% TATA-Inr+ 56% TATA-Inr- 12% Computational analysis of promoters Introduction • Regulatory regions typically contain several transcription factor binding sites strung out over a large region. • Which particular factor is used not only relies on the binding site, but also on what factors are available for binding in a given cell type at a given time. • Any given gene will typically have its very own pattern of binding sites for transcriptional activators and repressors ensuring that the gene is only transcribed in the proper cell type(s) and at the proper time during the development. Introduction • Transcription factors themselves are also subject to similar transcriptional regulation, thereby forming transcriptional cascades and feed-back control loops. • While this all is very nice and interesting from a biologist’s point of view, it spells big trouble for promoter prediction. Computational difficulties • There thousands of transcriptional regulators, many of which have recognition sequences that are not yet characterized. • Any given sequence element might be recognized by different factors in different cell types. • Core promoter regulatory elements are short and not completely conserved â¹ similar elements will be found purely by chance all over the genome. What promoter prediction methods actually predict? • 1st nucleotide copied at the 5’ end of the corresponding mRNA – transcription start site TSS • region around TSS is often referred as the core promoter • Owing to the strong link between TSS and core promoter, these terms are often used interchangeably. • Three distinct types of promoter prediction 1. signal features 2. context features 3. structure features Evaluating predictions • sensitivity (Se), recall, TPR • proportion of correct predictions of TSSs relative to all experimental TSSs Se = ðð ðð + ð¹ð • positive predictive value (PPV), precision • proportion of correct predictions of TSSs out of all counted positive predictions PPV = ðð ðð + ð¹ð Evaluating predictions • And how to obtain FP, FN, TP? • You have a gene sequence for which you know TSS location. And you make your prediction. • If it falls within the region [-2000, +2000] relative to annotated TSS, you have TP. • Prediction falling into the annotated part of gene within [+2001, EndOfGene] are FPs. • If you predict no promoter for this gene sequence, you have FN. Signal features • Recognize “conserved” signals such as TATA box, Inr, DPE, BRE etc. • Such motifs are highly variable and degenerate. This leads to high false positive rate. • Methods based on core promoter elements and other specific TFBs (e.g. CAAT box) are far from being accurate. • Much more reliable signal is CpG-island feature. However, only ≈50% of human genes contain CpG islands. ⇓ CpG and non-CpG promoters are predicted with different success, prediction of non-CpG is less accurate Context features • Extracted from genomic context of promoters • Represented by a set of n-mers (DNA sequence long n bases). Their statistics are estimated from training samples. • n-mers can cover most biological signals (TFBS: TATAAA, CCAAT; CpG: GC rich n-mers like CGGCG) • n-mer representation encodes contextual information of promoters and has following advantages • contextual information is independent of any biological signals • distribution of n-mers may have biological significance (TFBS, CpG) • n-mers may reveal details of yet unknown promoter regions • n-mers reduce FPR while maintaining relatively high TPR (i.e. Se) Structure features • They originate from DNA 3D structures that characterize proximal promoters. • DNA actually encodes in its sequence at least two independent levels of functional information • DNA sequence – encodes proteins and their regulatory elements. • Physical and structural properties of DNA itself. • Example: • dinucleotide properties – stacking energy, propeller twist • trinucleotide – bendability, nucleocome position preference • They have long-range interactions (up to 10 kbp), so they can exhibit properties not visible in the sequence. Model for cooperative assembly of an activated transcription-initiation complex. This figure clearly shows, why are structural features such as flexibility important. Molecular Cell Biology. 4th edition. Lodish H, Berk A, Zipursky SL, et al. New York: W. H. Freeman; 2000. Werner T, Fessele S, Maier H, Nelson PJ. Computer modeling of promoter organization as a tool to study transcriptional coregulation. FASEB J. 2003; 17(10):1228-37. Software Signal features (two leading CpG predictors) • FirstEF – different quadratic discriminant functions for CpG and non-CpG, slightly improves performance by concentrating to regions around first exon • Eponine – TATA and G+C rich domain, Relevance Vector Machine Context features • PromoterInspector – IUPAC word groups with wildcards Structure features • McPromoter – DNA sequence, bending, DNA twist, ANN • EP3 – features from1, prediction based just on the threshold imposed on the structural profile. 1 Florquin K et al., Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res. 2005;33(13):4255 Integrated approaches • combine sequence, context and structural features • ARTS – SVM, sophisticated kernels, combines n-mers to structure features (e.g. twist angle, stacking energies) • does not distinguish CpG related promoter from unrelated, not clear how it performs on non-CpG • SCS – sequence (TATA, Inr, DPE, CpG), structure (flexibility), and context (6-mers) features are used in different prediction models, their outcomes are combined by Decission Tree • CoreBoost – boosting technique with stumps, integrates core promoter signals, DNA flexibility, n-mer frequency, … • CoreBoost_HM … adds experimental histone modification data Boosting, stumps • Boosting • Belongs between ensemble methods that produce a very accurate prediction rule (strong learner) by combining rough and moderately inaccurate (i.e. just a bit better than random guessing) rules (weak learners, WL). • Iteratively learn weak classifiers and add them to a final strong classifier • When WL is added, it’s weighted based on their accuracy. • After a WL is added, the data is reweighted: misclassified examples gain weight and correctly classified examples lose weight. • Thus, future WLs focus more on the examples that previous WLs misclassified. • Stump • One-level decision tree (i.e. it has one root and two terminal nodes) source: wikipedia Databases • EPD – Eukaryotic Promoter Database • http://epd.vital-it.ch • manually annotated non-redundant collection of eukaryotic POL II promoters • DBTSS • http://dbtss.hgc.jp/ • putative core promoter: e.g. -100 bp … +50 bp, -250 bp … +50 bp, -200 … +200 bp Actual state of the promoter prediction • CpG island promoters are better to predict than non-CpG. • CpG islands usually correspond to housekeeping genes. Promoters of housekeeping genes are easier to predict, but housekeeping genes are not regulated that strongly. So if biologist wants to up- or down-regulate the expression and you tell him he has CpG island promoter, he is usually not happy. • non-CpG islands correspond to tissue-specific expression. And are the bottleneck in accurate promoter prediction. • Best way how to do it: use transcription data. Alignment of the 5’ of ESTs or full cDNAs can be indicative of promoter sequence. However, cDNA does not contain 5’ UTR. This is overcome by new mRNA cap cloning techniques – DBTSS. Future directions • False positives are still the main problem. • This is because the information about chromatine structure is missing in prediction models. • Without knowing which regions of chromatin are opened or closed (and to what degree), researchers have to assume the whole genome is accessible for binding, which is obviously wrong and will lead to more FP (and FN because of the extra noise). • Chromatin remodelling: enzyme-assisted movement of nucleosomes on DNA. source: http://www.nida.nih.gov/NIDA_notes/NNvol21N4/gene.html