CS273A Lecture 4: NON-protein coding genes (ncRNA) MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos Achlioptas http://cs273a.stanford.edu [BejeranoFall13/14] 1 Announcements • HW1 will be out by Friday. • We’ll announce it over the mailing list/Piazza. http://cs273a.stanford.edu [BejeranoFall13/14] 2 ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA ATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAA TTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGA ATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTT ATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTT TGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGT TCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATAC ATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT GCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTA CGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGA ATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACA TCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAAC GGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAA CTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTG GCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTC TTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAAT TGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCT GCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT AATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCT TCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTT AATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGA TTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTA CTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTT TACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTT ACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAA http://cs173.stanford.edu [BejeranoWinter12/13] 3 AATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGT “non coding” RNAs (ncRNA) http://cs273a.stanford.edu [BejeranoFall13/14] 4 Central Dogma of Biology: 5 Active forms of “non coding” RNA long non-coding RNA reverse transcription (of eg retrogenes) microRNA rRNA, snRNA, snoRNA 6 What is ncRNA? • Non-coding RNA (ncRNA) is an RNA that functions without being translated to a protein. • Known roles for ncRNAs: – RNA catalyzes excision/ligation in introns. – RNA catalyzes the maturation of tRNA. – RNA catalyzes peptide bond formation. – RNA is a required subunit in telomerase. – RNA plays roles in immunity and development (RNAi). – RNA plays a role in dosage compensation. – RNA plays a role in carbon storage. – RNA is a major subunit in the SRP, which is important in protein trafficking. – RNA guides RNA modification. – RNA can do so many different functions, it is thought in the beginning there was an RNA World, where RNA was both the information carrier and active molecule. 7 “non coding” RNAs (ncRNA) Small structural RNAs (ssRNA) http://cs273a.stanford.edu [BejeranoFall13/14] 8 ssRNA Folds into Secondary and 3D Structures AAUUGCGGGAAAGGGGUCAA CAGCCGUUCAGUACCAAGUC UCAGGGGAAACUUUGAGAUG GCCUUGCAAAGGGUAUGGUA AUAAGCUGACGGACAUGGUC CUAACCACGCAGCCAAGUCC UAAGUCAACAGAUCUUCUGU UGAUAUGGAUGCAGUUCA Computational Challenge: predict 2D & 3D from sequence. A C A GA CA G C U A G C 200 G C G U 120 U P5a CA G U G C P5 UC G G U A C G C G G G U U AA A U A A A AU A A A C 180 G C G C P5c G U A G C G U A A 260 A C G A A GG G C C A C P4 C G U GUUCC G A U A 140 G U G A U U U 160 G C P6 G C G AA A U C A C P5b G A U A C U G A G U G 220 G U C G U A A G C G P6a AA C G U U A A A G U U A C G A U A U P6b C G A U G C 240 A U UCU Waring & Davies. (1984) Gene 28: 277. Cate, et al. (Cech & Doudna). (1996) Science 273:1678. 9 For example, tRNA tRNA Activity ssRNA structure rules • • • • Canonical basepairs: – Watson-Crick basepairs: • G≡C • A=U – Wobble basepair: • G−U Stacks: continuous nested basepairs. (energetically favorable) Non-basepaired loops: – Hairpin loop – Bulge – Internal loop – Multiloop Pseudo-knots Ab initio RNA structure prediction: lots of Dynamic Programming • Objective: Maximizing the number of base pairs (Nussinov et al, 1978) simple model: (i, j) = 1 iff GC|AU|GU fancier model: GC > AU > GU Dynamic programming Compute S(i,j) recursively (dynamic programming) – Compares a sequence against itself in a dynamic programming matrix Three steps Initialization G G G A A A U C C G 0 Example: G 0 0 G 0 A A A U C C GGGAAAUCC 0 0 0 0 0 0 0 0 0 0 0 0 0 the main diagonal the diagonal below L: the length of input sequence Recursion j G G G A A A U C C Fill up the table (DP matrix) -- diagonal by diagonal G 0 0 0 0 G 0 0 0 0 0 G 0 0 0 0 0 0 0 0 0 ? 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 A i A A U C C Traceback G G G A A A U C C G 0 0 0 0 0 0 1 2 3 G 0 0 0 0 0 0 1 2 3 0 0 0 0 0 1 2 2 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 G A A A U C C The structure is: What are the other “optimal” structures? Pseudoknots drastically increase computational complexity Objective: Minimize Secondary Structure Free Energy at 37 OC: Instead of (i, j), measure and sum energies: -2.1 -0.9 -1.6 C G G U U U U G Ghelix = G G C + G C A + 2G A A + G A C = C G U U U G G G -2.0 kcal/mol - 2.1 kcal/mol + 2x(-0.9) kcal/mol - 1.8 kcal/mol = -7.7 kcal/mol U U G C A A A CA C G G Ghairpin loop = Ginitiation (6 nucleotides) + Gmismatch C A = -2.0 -0.9 -1.8 +5.0 5.0 kcal/mol - 1.6 kcal/mol = 3.4 kcal/mol Gtotal = Ghairpin + Ghelix = 3.4 kcal/mol - 7.7 kcal/mol = -4.3 kcal/mol Mathews, Disney, Childs, Schroeder, Zuker, & Turner. 2004. PNAS 101: 7287. http://cs273a.stanford.edu [BejeranoFall13/14] 19 Zuker’s algorithm MFOLD: computing loop dependent energies RNA structure • • Base-pairing defines a secondary structure. The base-pairing is usually non-crossing. In this example the pseudo-knot makes for an exception. Bafna 1 Stochastic context-free grammar S aSu S aSu S cSg acSgu S gSc accSggu S uSa accuSaggu Sa accuSSaggu Sc accugScSaggu Sg accuggSccSaggu Su accuggaccSaggu S SS accuggacccSgaggu accuggacccuSagaggu accuggacccuuagaggu 1. A CFG 2. A derivation of “accuggacccuuagaggu” 3. Corresponding structure Cool algorithmics. Unfortunately… – – Random DNA (with high GC content) often folds into low-energy structures. We will mention powerful newer methods later on. ssRNA transcription • ssRNAs like tRNAs are usually encoded by short “non coding” genes, that transcribe independently. • Found in both the UCSC “known genes” track, and as a subtrack of the RepeatMasker track http://cs273a.stanford.edu [BejeranoFall13/14] 24 “non coding” RNAs (ncRNA) microRNAs (miRNA/miR) http://cs273a.stanford.edu [BejeranoFall13/14] 25 MicroRNA (miR) ~70nt ~22nt miR match to target mRNA is quite loose. mRNA a single miR can regulate the expression of hundreds of genes. http://cs273a.stanford.edu [BejeranoFall13/14] 26 MicroRNA Transcription mRNA http://cs273a.stanford.edu [BejeranoFall13/14] 27 MicroRNA Transcription mRNA http://cs273a.stanford.edu [BejeranoFall13/14] 28 MicroRNA (miR) Computational challenges: Predict miRs. Predict miR targets. ~70nt ~22nt miR match to target mRNA is quite loose. mRNA a single miR can regulate the expression of hundreds of genes. http://cs273a.stanford.edu [BejeranoFall13/14] 29 MicroRNA Therapeutics Idea: bolster/inhibit miR production to broadly modulate protein production Hope: “right” the good guys and/or “wrong” the bad guys Challenge: and not vice versa. ~70nt ~22nt miR match to target mRNA is quite loose. mRNA a single miR can regulate the expression of hundreds of genes. http://cs273a.stanford.edu [BejeranoFall13/14] 30 Other Non Coding Transcripts http://cs273a.stanford.edu [BejeranoFall13/14] 31 lncRNAs (long non coding RNAs) Don’t seem to fold into clear structures (or only a sub-region does). Diverse roles only now starting to be understood. Example: bind splice site, cause intron retention. Hard to detect or predict function computationally (currently) http://cs273a.stanford.edu [BejeranoFall13/14] 32 lncRNAs come in many flavors http://cs273a.stanford.edu [BejeranoFall13/14] 33 X chromosome inactivation in mammals X X X Dosage compensation X Y Xist – X inactive-specific transcript Avner and Heard, Nat. Rev. Genetics 2001 2(1):59-67 Transcripts, transcripts everywhere Human Genome Transcribed* (Tx) Tx from both strands* * True size of set unknown http://cs273a.stanford.edu [BejeranoFall13/14] 36 Or are they? http://cs273a.stanford.edu [BejeranoFall13/14] 37 The million dollar question Human Genome Transcribed* (Tx) Leaky tx? Tx from both strands* Functional? * True size of set unknown http://cs273a.stanford.edu [BejeranoFall13/14] 38 System output measurements We can measure non/coding gene expression! (just remember – it is context dependent) 1. First generation mRNA (cDNA) and EST sequencing: In UCSC Browser: http://cs273a.stanford.edu [BejeranoFall13/14] 39 2. Gene Expression Microarrays (“chips”) http://cs273a.stanford.edu [BejeranoFall13/14] 40 3. RNA-seq “Next” (2nd) generation sequencing. http://cs273a.stanford.edu [BejeranoFall13/14] 41 Gene Finding II: technology dependence Challenge: “Find the genes, the whole genes, and nothing but the genes” We started out trying to predict genes directly from the genome. When you measure gene expression, the challenge changes: Now you want to build gene models from your observations. These are both technology dependent challenges. The hybrid: what we measure is a tiny fraction of the space-time state space for cells in our body. We want to generalize from measured states and improve our predictions for the full compendium of states. http://cs273a.stanford.edu [BejeranoFall13/14] 42 4. Spatial-temporal maps generation http://cs273a.stanford.edu [BejeranoFall13/14] 43 Cluster all genes for differential expression Experiment Control (replicates) (replicates) genes Most significantly up-regulated genes Unchanged genes Most significantly down-regulated genes http://cs273a.stanford.edu [BejeranoFall13/14] 44 Determine cut-offs, examine individual genes Experiment Control (replicates) (replicates) genes Most significantly up-regulated genes Unchanged genes Most significantly down-regulated genes http://cs273a.stanford.edu [BejeranoFall13/14] 45 Genes usually work in groups Biochemical pathways, signaling pathways, etc. Asking about the expression perturbation of groups of genes is both more appealing biologically, and more powerful statistically (you sum perturbations). http://cs273a.stanford.edu [BejeranoFall13/14] 46 Ask about whole gene sets + Exper. Control Gene Set 1 Gene Set 2 Gene Set 3 Gene set 3 up regulated ES/NES statistic Gene set 2 down regulated http://cs273a.stanford.edu [BejeranoFall13/14] The Gene Sets to test come from the Ontologies TJL-2004 48 One approach: GSEA Dataset distribution Number of genes Gene set 3 distribution Gene set 1 distribution Gene Expression Level http://cs273a.stanford.edu [BejeranoFall13/14] 49 Another popular approach: DAVID Input: list of genes of interest (without expression values). http://cs273a.stanford.edu [BejeranoFall13/14] 50 Multiple Testing Correction run tool Note that statistically you cannot just run individual tests on 1,000 different gene sets. You have to apply further statistical corrections, to account for the fact that even in 1,000 random experiments a handful may come out good by chance. (eg experiment = Throw a coin 10 times. Ask if it is biased. If you repeat it 1,000 times, you will eventually get an all heads series, from a fair coin. Mustn’t deduce that the coin is biased) http://cs273a.stanford.edu [BejeranoFall13/14] 51 What will you test? run tool Also note that this is a very general approach to test gene lists. Instead of a microarray experiment you can do RNA-seq. Instead of up/down-regulated genes you can test all the genes in a personal genome where you see “bad” mutations. Any gene list can be tested. http://cs273a.stanford.edu [BejeranoFall13/14] 52 Coding and non-coding gene production To change its behavior a cell can change the repertoire of genes and ncRNAs it makes. The cell is constantly making new proteins and ncRNAs. These perform their function for a while, And are then degraded. Newly made coding and non coding gene products take their place. The picture within a cell is constantly “refreshing”. http://cs273a.stanford.edu [BejeranoFall13/14] 53 Cell differentiation To change its behavior a cell can change the repertoire of genes and ncRNAs it makes. That is exactly what happens when cells differentiate during development from stem cells to their different final fates. http://cs273a.stanford.edu [BejeranoFall13/14] 54 Human manipulation of cell fate To change its behavior a cell can change the repertoire of genes and ncRNAs it makes. We have learned (in a dish) to: 1 control differentiation 2 reverse differentiation 3 hop between different states http://cs273a.stanford.edu [BejeranoFall13/14] 55 Cell replacement therapies We want to use this knowledge to provide a patient with healthy self cells of a needed type. We have learned (in a dish) to: 1 control differentiation 2 reverse differentiation 3 hop between different states (iPS = induced pluripotent stem cells) http://cs273a.stanford.edu [BejeranoFall13/14] 56 How does this happen? Different cells in our body hold copies of (essentially) the same genome. Yet they express very different repertoires of proteins and non-coding RNAs. How do cells do it? A: like they do everything else: using their proteins & ncRNAs… http://cs273a.stanford.edu [BejeranoFall13/14] 57 Gene Regulation Some proteins and non coding RNAs go “back” to bind DNA near genes, turning these genes on and off. Gene DNA Proteins To be continued… http://cs273a.stanford.edu [BejeranoFall13/14] 58 Inferring Gene Expression Causality Measuring gene expression over time provides sets of genes that change their expression in synchrony. • But who regulates whom? • Some of the necessary regulators may not change their expression level when measured, and yet be essential. “Reading” enhancers can provide gene regulatory logic: • If present(TF A, TF B, TF C) then turn on nearby gene X http://cs273a.stanford.edu [BejeranoFall13/14] 59 Review Lecture 6 • Central dogma recap – Genes, proteins and non coding RNAs • RNA world hypothesis • Small structural RNAs – Sequence, structure, function – Structure prediction – Transcription mode • MicroRNAs – Functions – Modes of transcription • lncRNAs – Xist • Genome wide (and context wide) transcription – How much? – To what goals? • Gene transcription and cell identity – Cell differentiation – Human manipulation of cell fates • Gene regulation control http://cs273a.stanford.edu [BejeranoFall13/14] 60