1 The University of Hong Kong CSIS0801 Final Year Project Alignment Algorithms for RNA Molecules Project Plan V1.0 Supervisor: Dr. S.M.Yiu Member: Kwong Lap Man, Levin Email: fyp11016@cs.hku.hk Website: http://i.cs.hku.hk/fyp/2011/fyp11016/public_html/index.html Date: September 23, 2011 Final Year Project 2011-2012 Alignment Algorithms for RNA Molecules Kwong Lap Man, Levin 2 Introduction: Ribonucleic acid, or RNA, is a major macromolecule that is essential for all known forms of life. RNA is responsible for regulating genetic and metabolic activities in cells. RNA is made up of a single long chain of components called nucleotides, which have four basic types, namely Adenine (A), Cytosine (C), Guanine (G) and Uracil (U). The sequence of nucleotides affects the structure of an RNA which further affects the functions of the molecule. Nowadays many functional genes are found and identified. And they are classified by their functions into families. Given a query RNA with known structure, and a genome, it is useful to identify all genomic sub-strings that match the query sequence and structure, it will help biologist to determine the function of this RNA. Actually this problem has been studied for many years; many computer scientists have developed algorithms for this problem. For this project, it is focused on using different programming techniques to improve the algorithms. Final Year Project 2011-2012 Alignment Algorithms for RNA Molecules Kwong Lap Man, Levin 3 Possible directions: (a) Improve the fundamental DP algorithm: One of the directions to improve the algorithm already exists is to improve the fundamental dynamic programming algorithm. First step is to study the O(mn4) algorithm (BHUM HAN, BANU DOST, VINEET BAFNA and SHAOJIE ZHANG, 2008); Second step is to study some techniques that had already been used to improve DP algorithm, such as Four-Russians speed up; Final step is to use the technique to improve the algorithm and test the performance. (b) Index the RNA sequence: Another direction is to index the RNA sequence, because the RNA sequence is long and there exists redundancies (mainly exists in the non-functional substrings). So the idea to use suffix tree to eliminate the redundancies comes. Step one is to build up a suffix tree of the RNA sequence; Step two is to change the DP algorithm to adapt suffix tree; Final step is to test the performance of the suffix tree DP algorithm. (c) Index the genome families: Another direction is to index the genome families, because each RNA substring needs to be compared to all genes from each genome families, and structure of genes in the same family has similar structure. So the idea to index each genome family comes. Step one is to find a way to index a family; Step two is to change the DP algorithm to adapt indexed families; Final step is to test the performance of the improved DP algorithm Final Year Project 2011-2012 Alignment Algorithms for RNA Molecules Kwong Lap Man, Levin 4 Project Schedule: Items Study existing algorithms Design improved algorithm on focusing the core DP Implement and test the algorithm Design improved algorithm on indexing the RNA sequence Implement and test the algorithm Design improved algorithm on indexing the genomic families Implement and test the algorithm Time required 1.5 months 1.5 months 0.5 months 1.5 months 0.5 months 1.5 months 0.5 months References: (1) Predicting RNA Secondary Structures with Arbitrary Pseudoknots by Maximizing the Number of Stacking Pairs , Journal Of Computational Biology, Volume 10, Number 6, 2003 by SAMUEL IEONG, MING-YANG KAO, TAK-WAH LAM, WING-KIN SUNG, and SIU-MING YIU (2) Structural Alignment of RNAs with Pseudoknots, by Thomas K.F. Wong, and S.M Yiu (3) Structural Alignment of Pseudoknotted RNA, Journal Of Computational Biology Final Year Project 2011-2012 Alignment Algorithms for RNA Molecules Kwong Lap Man, Levin