Predicting RNA Secondary Structures: A Lattice Walk Approach to Modeling Sequences Within the HIV-1 RNA Structure Facing the Challenge of Infectious Diseases in Africa: The Role of Mathematical Modeling University of Witswatersrand Johannesburg, South Africa September 25-27, 2006 Asamoah Nkwanta, Ph.D. Morgan State University Nkwanta@jewel.morgan.edu TOPICS RNA Prediction & Molecular Biology RNA Combinatorics Certain Class of Random Walks Matrix Theory Connection Between Walks & RNA Modeling HIV-1 RNA Sequences RNA Secondary Structure Prediction “The Human Genome Project and related efforts have generated enormous amounts of raw biological sequence data. However, understanding how biological sequences encode structural information remains a fundamental scientific challenge. For instance, understanding the base pairing, or secondary structure, of single-stranded RNA sequences is crucial to advancing knowledge of their novel biochemical functions.” C. E. Heithsch, Combinatorics on Plane Trees, Motivated by RNA Secondary Structure Configuration (preprint, 2005) What is RNA Secondary Sequence Prediction ? RNA Secondary Structure Prediction Given a primary sequence, we want to find the biological function of the related secondary structure. To achieve this goal we predict (model) its’ secondary structure. Most methods predict secondary structure rather than tertiary structure. The three dimensional shape is important for biological function, and it is harder to predict. Molecular Biology (Cont.) 3-D structure of Haloarcula marismortui 5S ribosomal RNA in large ribosomal subunit Molecular Biology Central Dogma DNA RNA Protein Transcription / Translation Molecular Biology (Cont.) Molecular Biology (Cont.) However, the "Central Dogma" has had to be revised a bit. It turns out that you CAN go back from RNA to DNA, and that RNA can also make copies of itself. It is still NOT possible to go from Proteins back to RNA or DNA, and no known mechanism has yet been demonstrated for proteins making copies of themselves. Molecular Biology (cont.) HIV is one of a group of atypical viruses called retroviruses that maintain their genetic information in the form of RNA. Retroviruses are capable of producing DNA from RNA. Molecular Biology (Cont.) Molecular Biology (cont.) Ribonucleic acid (RNA) molecule: Three main categories mRNA (messenger) – carries genetic information from genes to other cells tRNA (transfer) – carries amino acids to a ribosome (cells for making proteins) rRNA (ribosomal) – part of the structure of a ribosome Molecular Biology (cont.) Other types (RNA) molecules: snRNA (small nuclear RNA) – carries genetic information from genes to other cells miRNA (micro RNA) – carries amino acids to a ribosome (cells for making proteins) iRNA (immune RNA) – part of the structure of a ribosome (Important for HIV studies) RNA Secondary Structure “RNA secondary structures are important in many biological processes and efficient structure prediction can give vital directions for experimental investigation. “ B. Knudsen and J. Hein, Pfold: RNA secondary structure prediction using stochastic context-free grammars (Nucleic Acids Research, 2003) There are published examples involving tRNA, rRNA, and other types of RNA RNA Secondary Structure (cont.) A ribonucleic acid (RNA) molecule consists of a sequence of ribonucleotides (typically single stranded) Each ribonucleotide contains one of four bases: adenine (A), cytosine (C), guanine (G), and uracil (U) Secondary Structure (cont.) Note U is replaced by thymine (T) in DNA As the molecule forms, chemical bonds join A-U and C-G pairs, (Unstable G-U). These are called the Watson-Crick pairs. Secondary Structure (cont.) Primary Structure – The linear sequence of bases in an RNA molecule Secondary Structure – The folding or coiling of the sequence due to bonded nucleotide pairs: A-U, G-C Tertiary Structure – The three dimensional configuration of an RNA molecule Primary RNA Sequence CAGCAUCACAUCCGCGGGGUAAACGCU Nucleotide Length, 27 bases Geometric Representation Secondary structure is a graph defined on a set of n labeled points (M.S. Waterman, 1978) Biological Combinatorial/Graph Theoretic Random Walk RNA COMBINATORICS RNA Numbers 1,1,1,2,4,8,17,37,82,185,423,978,… These numbers count various combinatorial objects including RNA secondary structures of length n. RNA COMBINATORICS (cont.) The number of RNA secondary structures for the sequence [1,n] is counted by the coefficients of s(z): 3 z2 2 z1 z1 1 ) z ( s 6 z 71 5 z8 4 Coefficients of the formal power series: (1,1,1,2,4,8,17,37,82,185,423,978,…) z4 RNA COMBINATORICS (cont.) The number of lattice paths with unit steps R (right), U (up) & D (down) that go from (0,0), remain in the first quadrant of the coordinate plane, and return to the x-axis under the restriction that there are never consecutive UD steps is the nth RNA number: (1,1,1,2,4,8,17,37,82,185,423,978,…) RNA COMBINATORICS (cont.) n The number of RNA sequences of length n that can be formed over the words [A,U,G,C] such that the letters A & U are not adjacent is equal to: n 71 3 5 71 71 3 5 71 2 2 7 1 2 71 2 What a remarkable formula for an integer, when n = 1 we get 4, and n = 2 we get 14. Counting Sequence Database The On-line Encyclopedia of Integer Sequences: http:/www.research.att.com/njas/sequence s/index.html N.J.A. Sloane & S. Plouffe, The Encyclopedia of Integer Sequences, Academic Press, 1995. RNA EQUATIONS Recurrence Relations: s0 s1 s2 1 n 1 sn 1 sn s j 1sn j , n 2 j 1 RNA EQUATIONS (cont.) 4 Generating Function: z z2 z z2 1 z z 1 z s 2 z2 3 2 2 1,1,1,2,4,8,17,37,82,185,423,978,… RNA EQUATIONS (cont.) Exact Formula: 1 n k n k sn k 1 n k k 1 k RNA EQUATIONS (cont.) s(n,k) is the number of structures of length n with exactly k base pairs: For n,k > 0, 1 n k n k 1 sn, k k k 1 k 1 RNA EQUATIONS (cont.) Asymptotic Estimate: As n grows without bound 15 7 5 3 5 n sn ~ 8 2 3 2 n Random Walk A random walk is a lattice path from one point to another such that steps are allowed in a discrete number of directions and are of a certain length RNA Walk – Type I NSE* Walks – Unit step walks starting at the origin (0,0) with steps up, down, and right No walks pass below the x-axis and there are no consecutive NS steps RNA Walk – Type I (cont.) N = (0,1) up S = (0,-1) down E = (1,0) right Type I Walk Array (n x k) 1 1 1 2 4 8 17 0 0 0 0 0 0 1 2 0 1 0 0 0 0 0 0 0 0 3 6 3 6 1 4 0 1 0 0 0 0 13 28 13 30 10 24 5 15 1 6 0 1 RNA Walk – Type II NSE** Walks – Unit-step walks starting at the origin (0,0) with steps up, down, and right such that no walks pass below the x-axis and there are no consecutive SN steps Type II Walk Array (n x k) 1 1 2 4 8 17 37 0 0 0 0 0 0 1 2 0 1 0 0 0 0 0 0 0 0 4 9 3 7 1 4 0 1 0 0 0 0 20 45 17 41 11 29 5 16 1 6 0 1 Examples Type I: ENNESNESSE Type II: NEEENSEEES RNA Walk Bijection Theorem: There is a bijection between the set of NSE* walks of length n+1 ending at height k = 0 and the set of NSE** walks of length n ending at height k = 0. Source: Lattice paths, generating functions, and the Riordan group, Ph.D. Thesis, Howard University, Washington, DC, 1997 Matrices Count Lattice Walks Type I Walks 1 0 1 1 1 2 2 3 4 6 8 13 17 28 - - 0 0 1 3 6 13 30 - 0 0 0 1 4 10 24 - 0 0 0 0 1 5 15 - 0 0 0 0 0 1 6 0 0 0 0 0 0 1 - - - Type II Walks 1 1 2 4 8 17 37 - 0 1 2 4 9 20 41 - 0 0 1 3 7 17 41 - 0 0 0 1 4 11 29 - 0 0 0 0 1 5 16 - 0 0 0 0 0 1 6 - 0 0 0 0 0 0 1 - - The ith-jth entry corresponds to the number of random walks of length i and ending height j. Type I Formation Rule (Recurrence) m0,0 1 mn 1,0 mn j , j j 0 mn 1, k mn, k 1 mn j , k j mn 1, k 0, k n 1 j 0 The Connection Between RNA and the Walks Theorem: There is a bijection between the set of RNA secondary structures of length n and the set of NSE* walks ending at height k=0. Source: Lattice paths and RNA secondary structures, DIMAC Series in Discrete Math. & Theoretical Computer Science 34 (1997) 137-147. (CAARMS2 Proceedings) Primary Molecule Molecules Derived From Lattice Walks Secondary Molecules RNA Sequence GGU GUUAA UUG Known Conserved Region AGG GUG GAG Regulatory Region Walk 1 Lattice Walk Application To Predict RNA Sequence Walk 2 HIV-1 RNA Sequence Prediction We want to construct a lattice walk method to predict secondary RNA sequences that code for regions of the SL2 and SL3 domains within the HIV-1 5’ UTR RNA molecule. These domains are important for HIV genomic packaging HIV-1 RNA Structural Components Components of Secondary Structure Base pairs Bulges Interior Loops End loops Hairpin Multibranch loops – junctions where more than one hairpin or more complex secondary structures are appended. HIV-1 Sequence (SL2 & SL3) The following sequence was obtained from the NCBI website. The first 363 nucleotides were extracted from the entire HIV-1 RNA genomic sequence: GGUCUCUCUGGUUAGACCAGAUCUGAGCCUGGGAGCUCUCU GGCUAACUAGGGAACCCACUGCUUAAGCCUCAAUAAAGCUU GCCUUGAGUGCUUCAAGUAGUGUGUGCCCGUCUGUUGUGU GACUCUGGUAACUAGAGAUCCCUCAGACCCUUUUAGUCAGU GUGGAAAAUCUCUAGCAGUGGCGCCCGAACAGGGACCUGA AAGCGAAAGGGAAACCAGAGGAGCUCUCUCGACGCAGGAC UCGGCUUGCUGAAGCGCGCACGGCAAGAGGCGAGGGGCGG CGACUGGUGAGUACGCCAAAAAUUUUGACUAGCGGAGGCUA GAAGGAGAGAGAUGGGUGCGAGAGCGUCAGUAUUAAGCG Color key: SL2 – yellow SL3 - red Known Sequence of the SL2 Domain GGCGACUGGUGAGUACGCC mfe -7.1 Original Structure Type I Walk Type II Walk Lattice Walk Model Start with an RNA primary sequence Perform RNA combinatorial analysis on the given sequence Connect lattice walks to the given sequence using Type I and II walks Calculate identified sequences to find the minimum free energy Predict secondary sequence Conduct laboratory experiments for biological functionality Acknowledgments National Science Foundation, DIMACS, AIMS, Burroughs Wellcome, SACEMA, WITS MATH. Modeling 561, Graduate Students Collaborators: Dwayne Hill, Biology Dept., MSU, and Alvin Kennedy, Chemistry Dept., MSU