Functional Non-Coding DNA Part I Non-coding genes and non-coding elements of coding genes BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG What Does ‘Functional Non-Coding DNA’ Mean? • DNA whose sequence affects transcripts made from DNA in some way • Could affect transcription levels, splicing or sequestering of RNA • Three main ways to identify functional noncoding elements – Sequence characteristics – favored bases – Genomic conservation – Epigenetic marks and open chromatin • especially outside of genes Types of Non-Coding Elements • Non-coding RNAs – miRNAs, lncRNAs, etc • Non-coding gene elements – UTRs, splice sites, poly-adenylation sites, splice sites and regulating element, RNA-binding sites • DNA elements outside genes – our main focus – Promoters – Enhancers/Silencers – Insulators Types of Non-Coding RNA • • • • • • • • microRNAs Silencing RNAs Small nuclear/nucleolar RNAs Piwi-Interacting RNAs Long Non-Coding RNAs Circular RNAs Still other RNAs??? Comprehensive data base at www.ncrna.org Micro-RNAs • Micro-RNAs are small non-coding RNA molecules, about 21– 25 nucleotides in length • They are processed from much longer genes, or from introns within mRNA, by several molecular pathways • Micro-RNAs base-pair with complementary sequences within mRNA molecules, often in 3’ or 5’ UTR. • miRNA binding usually results in gene repression either via translational stalling or by triggering mRNA degradation Image by Charles Mallery, U of Miami Micro-RNAs • The human genome encodes over 1500 miRNAs, which are believed to affect more than half of human genes • miRNAs are abundant in many cell types – Thousands of copies per cell of some miRNAs – Those within gene introns share regulation • miRNAs are well-conserved across vertebrates – No orthologs between plant and animal miRNAs – miRBase is the comprehensive repository of microRNAs Other Short RNAs: siRNA • Small interfering RNAs are double-stranded with an overhang • They are processed by some of the same machinery as miRNAs and have some of the same effects Other Short RNAs: piRNA • Piwi-Interacting RNAs are longer 26-31 base single-stranded RNAs – PIWI (P-element Induces Wimpy Testis) protein • Over 50,000 sequences known in mouse – They are the largest class of nc-RNA • They seem to play an ancient role in defense against retro-viruses and transposons Other Short RNAs: snRNAs & snoRNAs • Small nuclear RNAs (snRNAs) are typically ~ 150 bases long, and associate with protein – Many conserved copies of each snRNA gene U6 snRNA – U1-U6 snRNAs key parts of splicing machinery • Small nucleolar RNAs (snoRNAs) – Guide chemical modifications of other RNAs – Prader-Willi syndrome results from deletion of region containing 29 copies of SNORD116 on chr 15q11 Long Non-Coding RNAs • Many long (>200bp) stretches of genome are transcribed and have epigenetic marks like those of protein-coding genes • Most of these are spliced RNAs with two (or more) exons • GENCODE v15 has 13.5K lncRNA • See also – Derrien et al, Genome Research 2012 – Lee, Science 2012 From Derrien et al Genome Res 2012 Many lncRNAs Induce Silencing • Coat nearby gene(s) and silence them • Xist binds to gene clusters first • Xist binds disparate parts of chromosome • Many lncRNA are antisense to genes • Some lncRNAs maintain pluripotency of stem cells From Jeannie Lee lab (Harvard) website Long Non-Coding RNAs - 2 • Most lncRNAs are expressed in only a few tissues • Most human lncRNAs are specific to the primate lineage From Derrien et al Genome Res 2012 Circular RNAs • Several thousand non-coding RNAs apparently form circular structures • Many form complexes with AGO and seem to absorb attached miRNAs, blocking processing • CDR1 has 70 conserved binding sites for mir7 Functional Pseudo-Genes • Pseudo-genes are copies of genes that are decaying and rarely (never) make proteins • Some pseudo-genes act to absorb negative regulators of the original gene – eg. SRGAP2B How to Identify Non-Coding RNAs? • Short (and long) RNA transcriptomes • Promoter chromatin marks for independent (non-embedded) miRNAs and lncRNAs DEMO: Display HOTAIR & XIST Tracks in UCSC Browser Non-Coding Elements of Genes • • • • • • TSS 5' UTRs Introns Splicing regulation sites 3' UTRs Termination/Poly-adenylation sites Transcription Start Sites • Transcription of most genes may initiate at several distinct clusters of locations with distinct promoters for each TSS • Two major types of metazoan TSS: CG-rich broad TSS, and narrow (often tissue-specific) TSS Transcription Start Sites Transcription often starts at CG within promoter 5’ Untranslated Regions • First exon often contains dozens to thousands of bases before Start codon (median 150) • Sometimes contains regulatory sequences, e.g. binding sites for RNA binding proteins, and translation initiators Splice Regulatory Sites • Splicing is achieved through binding of spliceosome to recognition sequences on nascent RNA molecule Splice Regulatory Sites • Tissue-specific splice regulatory sites are highly conserved From Merkin et al Science 2012 Splicing Patterns Evolve in All Tissues Except Brain From Merkin et al Science 2012 Non-Coding Elements in Coding Exons • Many regulatory sites occur within coding exons, esp. toward 5’ end • These constrain some codons as much as protein sequence • Many human SNPs break TFBS but have little effect on protein (AFAWK) From Stergachis et al Science 2013 3’ Untranslated Regions • Longest exon is usually 3’UTR (>1000 nt) • Typically 1/3 – 1/2 of a gene is in 5’ & 3’ UTRs • 3’UTR has binding sites for miRNAs and RNA binding proteins • AU-rich elements (AREs) stabilize mRNA • Proteins recognize complex secondary structure GRIK4 3’UTR secondary structure is conserved RNA Binding-Protein Sites • mRNAs are usually further processed (e.g. transported or sequestered) • RNA binding proteins recognize specific motifs within secondary structure of 3’ or 5’ UTR • These sites are often highly conserved From Ray et al Nature 2013 Poly-adenylation/Termination Sites • Transcripts can be terminated and poly-adenylated at sites with specific sequences • Most genes have alternate poly-adenylation sites • Median lengths of 3’UTR are 250 & 1773 bp (mouse) Poly-adenylation/Termination Sites • Rapidly proliferating cells express gene isoforms with short 3’ UTRs • Neurons typically have longer 3’ UTRs Types of alternate poly-adenylation Elkon et al, NRG 2013 DEMO: GAPDH and GABRA1 in UCSC Browser