Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi & Giancarlo Mauri Dept. of Computer Science, Systems and Communication University of Milano – Bicocca Milan, Italy Why Is RNA So Interesting? • After the completion of various genome projects, the attention of many researchers has shifted from coding to non – coding parts • More than 95% of our genome is not coding: what about the rest? • Non – coding RNA: RNA that is transcribed from DNA, but does not encode directly for a protein (tRNA, microRNA, etc.) CPM 2003 - Morelia Giulio Pavesi 2 A Motivating Example Post-transcriptional regulation of gene expression CPM 2003 - Morelia Giulio Pavesi 3 The Problem • Functionally related RNA sequences present structural similarity, at least in some parts • Given two or more RNA molecules, find similar (supposedly functional) structural elements in them • Sequence similarity implies structure similarity, but this is not always that true for RNA..... • Given two or more RNA sequences of unknown structure, find similar structural elements in them (motifs) • Low sequence similarity can anyway correspond to high structure similarity CPM 2003 - Morelia Giulio Pavesi 4 “Know Thine Enemy” • RNA secondary structure: list of the base pairs among nucleotides in the sequences, such that: – No nucleotide takes part in more than a single base pair (usually, Watson – Crick pairs and wobble pairs G – T, i.e. canonical base pairs) – Base pairs never cross: if nucleotide i is bound to nucleotide j and k with l, then either i < j < k < l or i < k < l < j CPM 2003 - Morelia Giulio Pavesi 5 RNA Secondary Structure .((..(((.((....)))))...(((.(((...)))...))))) CPM 2003 - Morelia Giulio Pavesi 6 Motifs in RNA Secondary Structure • Many functional motifs can be described by secondary structure alone • Two types of similarity: – sequence similarity (in unpaired nucleotides, mainly) – structure similarity CPM 2003 - Morelia Giulio Pavesi 7 Data Structures? • When dealing with DNA or protein sequences, some significant advantages have been obtained by using suitable text— indexing structures (e.g. suffix trees) • RNA secondary structure can be described by a string • Is there a “good” structure that will do for RNA sequences, allowing us to consider sequence and structure at the same time? CPM 2003 - Morelia Giulio Pavesi 8 Affix Trees Affix tree for string S = ATATC Suffix and prefix edges Suffix edges spell the substrings of string S Prefix edges (dotted) spell substrings of S-1 (the reverse Built in linear time Takes linear space CPM 2003 - Morelia Giulio Pavesi 9 Affix Trees • The affix tree of a string S indexes all the substrings of both S and S-1 • Once a substring of S has been located in the tree, we can extend it to the right (by following suffix edges) and to the left (by following prefix edges) • Good if we search for patterns in the sequences with some kind of symmetry CPM 2003 - Morelia Giulio Pavesi 10 The Hairpin • The basic element of RNA secondary structure is the hairpin (or stem— loop) structure • The hairpin is symmetric!!!! ((((( ...... ))))) AGGTC CAGTCA GATCT CPM 2003 - Morelia Giulio Pavesi 11 First Try • Predict the secondary structure of each input sequence • Build the affix tree for the folded sequences (in bracket notation) • Search exhaustively for patterns describing hairpin structures (possibly with differences) • Report those occurring in at least q sequences CPM 2003 - Morelia Giulio Pavesi 12 Searching for Hairpins in Affix Trees • For each loop size l: 1. 2. Find l dots in the tree, on suffix edges (hairpin loop) Add a base pair: a) b) 3. 4. Find a ) on suffix edges Find a ( on prefix edges If the result appears in at least q sequences, jump to 2, else return from jump Add internal loops: a) b) Find a dot on prefix edges: jump to 2; Find a dot on suffix edges: jump to 2; CPM 2003 - Morelia Giulio Pavesi 13 Recursive Algorithm 1 2 2 2 3a 2 • • .... (....) ((....)) (((....))) .((....)) (.((....))) (ok) (ok) (ok) (no) (ok) (ok) On each path, we keep a pointer for the prefix edge, and another for the suffix edge Speed—up: represent the unpaired elements with a single symbol describing type and size, so to compare two symbols instead of two regions CPM 2003 - Morelia Giulio Pavesi 14 Approximate Search • We can allow some approximation: – Hairpin loops of different size (range value at step 1) – Internal loops of different size at the same position along the stem – Internal loops or bulges at different positions along the stem – Stems of different size (base pairs) – Any combination of the previous CPM 2003 - Morelia Giulio Pavesi 15 Complexity • Given a set of k folded sequences of overall length N : – Construction of the tree: O(N) – Annotation of the tree: O(kN) – Search: O(V(m)kN), where m is the length of the longest pattern found – V(m) depends on the degree of approximation – In practice, the most time consuming part is predicting the structure of the sequences CPM 2003 - Morelia Giulio Pavesi 16 Does It Work? • Test: Iron Responsive Element, located in the UTRs of mRNA coding for proteins involved in iron metabolism (e.g. ferritin, transferrin) • Does it appear in all the predicted structures? • Alas, it does not!!!!!! CPM 2003 - Morelia Giulio Pavesi 17 Why? The “real structure”often does not correspond to the optimal one!!!! The motif “disappears” from the (supposedly) optimal structure CPM 2003 - Morelia Giulio Pavesi 18 One Possible Solution • Idea: for each sequence, consider also a number of alternative sub-optimal structures • All the possible structures can be enumerated • Check whether a motif appears in at least one alternative structure per sequence • The affix tree can handle efficiently even hundreds of alternative structures per input sequence • Downside: the number of potential secondary structures for a sequence of length n is O(2n) • If similarity is not stringent, we have too many candidates CPM 2003 - Morelia Giulio Pavesi 19 But..... • If the same structure has to appear in a set of sequences, then the same pattern of complementary base pairs has to appear in the sequences ((((( AGGTC GCGAG CCCAG CPM 2003 - Morelia ...... CAGTCA CAGTCT CAGTCA Giulio Pavesi ))))) GATCT CTTGC CTGGG 20 Idea! • Instead of working on folded sequences, build the affix tree for the sequences alone, and find complementary base pairs on the fly • The search can be implemented with the same parameters of the folded case CPM 2003 - Morelia Giulio Pavesi 21 Building Hairpins on the Fly • By working on unfolded sequences, the theoretical time complexity is higher, since different paths correspond to the same structure • In practice it is much faster, since we do not have to run the prediction algorithm on the input sequences • We need to “validate” the candidate structures, e.g. according to their energy CPM 2003 - Morelia Giulio Pavesi 22 Post - Processing • So far we have considered structure alone • More than a single motif occurrence per sequence is often reported, especially if structural constraints are loose • Post processing: compare the candidate occurrences by evaluating sequence similarity in unpaired elements • Find the group of instances that are more similar at the sequence level CPM 2003 - Morelia Giulio Pavesi 23 Results and Work in Progress • The second approach gave better results, in terms of reliability and efficiency • Candidate hairpins can be validated according to their energy value (more reliable, in this case!) • Good results on “harder” tests • Too many input parameters yet • Extend to more complex structures CPM 2003 - Morelia Giulio Pavesi 24