CITY UNIVERSITY OF HONG KONG CERG Applications 2004-2005 Checklist of documents to be submitted with Application Please specify the subject area under which you wish your application to be considered (please choose only one): E1 E2 E3 E4 Civil Engineering, Surveying, Building and Construction Computing Science & Information Technology Electrical & Electronic Engineering Mechanical, Production & Industrial Engineering P1 P2 P3 Chemical Engineering ) Physical Sciences ) Mathematics ) M1 M2 Biological Sciences Medicine, Dentistry & Health H1 H2 H3 H4 Administrative, Business & Social Studies Arts & Languages Education Law, Architecture, Town Planning & Other Professional & Vocational Subjects Application intended for the Physical Sciences panel MUST be submitted through the RGC electronic system Please check the appropriate boxes on the right by a (X) indicating that the necessary information and/or supporting documents have been included with your application. Principal Investigators are responsible for ensuring that the application is complete. Included (X) 1. Application signed by all Investigators 2. The completed Research Grant Data Sheet (CERG 04/05) 3. Quotations for equipment purchase costing over $200,000 4. A one-page brief C.V. where required for Investigators 5. (a) Proof of human research ethics approval (b) Proof of animal research ethics approval (c) Proof/certificate of biological safety (d) Proof/certificate of ionizing radiation safety (e) Proof/certificate of non-ionizing radiation safety 6. Primary field area (secondary field area is optional) and corresponding codes have been indicated Name of PI Signature ******************************************************************************* (For RO use only) Initial Checking CS HT KC HT KC Update list Update Proposal Required/Not required Final Checking CS ERG1_03.doc ERG 1 (Revised 5/03) CITY UNIVERSITY OF HONG KONG RESEARCH GRANT DATA SHEET This form must accompany all proposals for research for input onto the RGC database. It is designed to simplify the procedure for obtaining approval signatures, to ensure compliance with University safety/ethics policies, and to improve record-keeping. SECTION A - PROJECT DETAILS Principal Investigator: Title Associate professor Department:Dept. of Computer Science Surname: Wang Tel: 2788 9820 First Name Lusheng Wang Email: lwang@cs.cityu.edu.hk Co-Investigator(s): Title Name Dept/Institution/Country Project period (YY/MM): July 1, 2004 to June 30, 2006 Duration (in months):24 Please select one of the following: New submission Re-submission (previous application ref: ___________________________ Application for individual Research Application for Longer-term grants Project title: ) Random Projection, Motif Based Sequence Alignment and RNA Secondary Structures Budget summary: Less: Research support staff salaries HK$ Equipment HK$ General expenses/Travel HK$ Conference HK$ Others (including funds already secured) HK$ Net Total HK$ Please complete each of the following [Note to PI: are you aware of any safety hazards introduced by this research proposal that are not covered or addressed in the existing safety codes or guidelines. If yes, please bring them to the attention of the Head of Department]: Approval not required Approval obtained & attached Approval Pending Human research ethics Animal research ethics Biological safety Chemical safety Radiation safety SECTION B - UNDERTAKING BY THE PRINCIPAL INVESTIGATOR This application is submitted in compliance with the RGC's terms and conditions and University polices and procedures, whereby research, if funded, will be undertaken accordingly by all staff and students engaged on the project. The information given is to the best of my knowledge complete and accurate. Signature: Date: SECTION C - APPROVAL SIGNATURES I confirm: [Please mark 'X' in the appropriate box(es)] a) that I have read this application, understand its resource implications for the Department, and will make available the necessary infrastructural support and space for the project if it is funded; b) that I am satisfied that all health and safety risks associated with the project have been considered, and that the Department can provide adequate measures to control these risks [Heads should contact the Safety Officer for further advice if they are in any doubt about the safety aspects of a proposal]; c) Yes No N/A Yes No N/A Yes No N/A that the proposals is worthy of the support of the University. Department Head: ERG1_03.doc Signature Date: ERG 1 (Revised 5/03) RGC Ref No. (to be completed by the institution) RESEARCH GRANTS COUNCIL Application for Allocation from the Earmarked Research Grant for 2004-2005 Application Form (ERG1) [Please read the Explanatory Notes ERG2 carefully before completing this form] PART I SUMMARY OF THE RESEARCH PROPOSAL [To be completed by the applicant(s)] 1. Title of Project: Random Projection, Motif Based Sequence Alignment and RNA Secondary Structures Primary Field: 2. Algorithms Secondary Field: Computational Biology Name(s) and Academic Affiliation(s) of Applicant(s): Name Principal Investigator [PI]: (with title) Dr. Lusheng Wang Post Unit/Department/ Institution Asso. Prof. CS/City Univ. Co-Investigator(s) [Co-I(s)]: (with title) 3. Allocation Requested from the Earmarked Research Grant: Total cost of the project: (a) Staff (b) Relief Teacher (required exceptionally) (c) Equipment (d) General Expenses (e) Conference (standard rate : $12,000 per year) Less: Other research funds secured from other sources Net amount requested * : HK$ HK$ HK$ HK$ HK$ HK$ HK$ The amount may be reduced further if additional funds from other sources have been secured after submission of this application. /.... 4. Nature ERG1_03.doc ERG 1 (Revised 5/03) 2 4. Nature of application * New [i.e. PI and/or Co-I(s) applying for RGC funds on this research topic for the first time]. Please give further details in Part II item 2. Re-submission [i.e. PI and/or Co-I(s) have previously applied for RGC funds on this research topic but application not supported]. Please give further details in Part II item 4. On-going [i.e. PI and/or Co-I(s) extending work previously funded by the RGC]. Please give further details in Part II items 2, and 7-9. Application for individual research. Application for longer-term research grant. * Please tick ‘’ as appropriate ERG1_03.doc ERG 1 (Revised 5/03) 5. 3 Abstract of research (limited to ½ A-4 page or 200 words, and comprehensible to a non-specialist): In this project, we intend to design and analyze efficient and effective algorithms and heuristics for some computational problems arising in computational molecular biology. The problems we intend to attack include random projection algorithms for motif detection, motif-based multiple sequence alignment and some computational problems for RNA secondary structures. Motif detection problem is a challenging problem in terms of computation. It has many applications in computational biology, e.g., locating binding sites and finding conserved regions in unaligned sequences. We will study the random projection approach from theoretical point of view. Multiple sequence alignment is one of the core problem in computation biology. Algorithms for multiple sequence alignment are routinely used to find conserved regions in biomolecular sequences, to construct family and superfamily representations of sequences, and to reveal evolutionary histories of species (or genes). We intend to adopt the motif-based approach (proposed long time ago) to attack the problem. Note that, great progress has been made recently for identifying conserved motifs. Those new techniques for motif detection should greatly increase the chance for motif-based approaches for multiple sequence alignment. RNA secondary structure . We will study (1) algorithmic issues of the space-time trade off for the comparison of two RNA secondary structures and (2) RNA secondary structure search in RNA sequences. The project will emphasize algorithmic issues as well as computational complexity for all the proposed problems. /.... PART II ERG1_03.doc ERG 1 (Revised 5/03) PART II 4 DETAILS OF THE RESEARCH PROPOSAL [To be completed by the applicant(s)] RESEARCH DETAILS 1. The project objectives and long-term impact (maximum 1 A-4 page): Purpose: We will design algorithms that are desperately needed in practice for the following important problems. Motif-based multiple sequence alignment: Let S ={s1,s2, … ,sn} be a set of n sequences. An alignment A of the n sequences is obtained by inserting spaces into the sequences such that all resulting sequences have the same length and overlaying them. For motif-based multiple sequence alignment, we first identify conserved motifs and use those motifs to decompose the sequences into smaller segments and finally solve the problem by fixing the alignment for those smaller segments. The problem of how to decompose the sequences into smaller segments after motifs are identified is one of the important issues here. We will study this issue here and hopefully can obtain both theoretically and practically interesting results. Another issue is how to identify the conserved motifs. Recently, there are new developments on motif detection. We will study randomized algorithms, e.g, random projection approach, for motif detection. Motif detection problem: Given an integer L and a set of n sequences s1, s2, …, sn, each is of length at most m, find those conserved regions of length L, where the conserved regions can be measured as consensus score or bottleneck score (will be discussed later). RNA secondary structure (1) space-time tradeoff for RNA secondary structure comparison: Many measures have been proposed for RNA secondary structure comparison, e.g., tree edit distance, constrained edit distance and alignment of trees. For all those measures, the algorithms need super quadratic time and space. In many cases, space limit is more serious than that of time. For pair-wise sequence comparison, a linear-space algorithm that runs in O(n2) time (as good as the O(n2) space algorithm) was designed. This technique is treated as a classic technique in this area. Here we will study the algorithmic issues of the space-time tradeoff for all those different measures. (2) RNA secondary structure search: (i) Exact search: given a RNA secondary structure R (pattern) and an RNA sequence S (a sequence over {A,C,G,U} which is treated as the text), a base-pair (i, j) in R matches the two letters S[k] and S[k+j-i] in S if S[k] and S[k+j-i] are either C-G pare and A-U pair. The secondary structure R matches the segment S[k]S[k+1]…S[k+|n] if for every base-pair (i,j) in R, S[k+I] and S[k+j] form either C-G pair and A-U pair. The problem here is to find all the segments in S that can match R. (ii) Approximate search: given a RNA secondary structure R (pattern), the number of mismatches between R and the segment S[k]S[k+1]…S[k+n] is the number of (i,j)’s, where (i,j) is a base-pair in R and S[k+i] and S[k+j] form neither C-G pair nor A-U pair. Other than R ans S, we are also given an integer m, the problem here is to find all segments in S that has at most m mismatches with respect to R. Long Term Significance: The problems proposed in this project are from computational molecular biology and bioinformatics. They have applications in SNP haplotype map construction, comparison of genomes, PCR primer design, binding site finding, creating diagnostic probes, and potential drug target identification. Our research focuses on theoretical study of those computational problems. The results produced will form a solid foundation for design the very much needed software for these problems. /.... 2. Background ERG1_03.doc ERG 1 (Revised 5/03) 2. 5 Background of research (maximum 2½ A-4 pages, including references): Section A: Work done by others Multiple sequence alignment is one of the core problems in computational biology. It is the most critical cutting–edge tool for biological sequence analysis that helps to extract and represent biologically important, yet faint or widely dispersed, commonalities from a set of sequences. These commonalities may reveal conserved motifs, conserved characters in DNA or protein, common secondary or tertiary structures, or clues about the common biological functions. Commonalities might be blur, and may not be apparent when comparing two sequences. However, they may become clear when comparing a set of related sequences. From computational point of view, multiple sequence alignment is one of the hardest problems in computational molecular biology [AL89,LAK89,Sankoffb,G93,GBOOK, W89,Wbook]. Many measures have been proposed, such as SP-score, consensus-score and tree-score. For all the three scores, the problem was proved to be NP-hard [JLW94,WJ94, BV00]. Hundreds of research papers have been published on multiple sequence alignment. Here we can only briefly mention a few. Exact Algorithms: The problem (for all the three scores) can be solved by a dynamic programming approach that runs in time exponential in terms of the input size. Extensive discussions of such algorithms can be found in [Gusfield, 1997; Sankoff and Kruskal, 1983]. Carrillo and Lipman proposed a method to cut down the computational volume of the dynamic programming algorithm for SP alignment (Carrillo and Lipman, 1988). The basic idea is to compute upper bounds on alignment costs for each pair of sequences in the computation of the k dimensional matrix and eliminate those cells of the matrix that violate the upper bounds. Altschul and Lipman proposed a similar method for tree alignment (Altschul and D. Lipman, 1989). Approximation Algorithms: Many approximation algorithms have been developed for the three different scores. The first approximation algorithm for SP alignment was given by D. Gusfield (Gusfield, 1993). The performance ratio is 2-2/k, where k is the number of sequences. With great effort, the best known ratio is improved to 2-l/k for constant l (Bafna et al 1997). For consensus-score, Gusfield gave a ratio-2 algorithm (Gusfield 1997). Other approximation algorithms will be discussed in Section B. Heuristic approaches: Many heuristic approaches have been developed. The popular ones include progressive alignment (Feng and Doolittle 1987; Thompson et al 1994), iterative method for tree-score (Snakoff et al 1976, Sankoff and Kruskal 1983), sequence graph approach (Hein 1989), motif-based approaches etc. There are too many others to mention. Motif-based approach relies on first finding motifs (subsequences that are common in most of the given sequences to be aligned) and then use those motifs to decompose the sequences into smaller segments. Other heuristics will be used for each of the smaller segments. The history of this approach goes back to 1984 (Waterman et al 1984). Other references can be found in (374, 455, 456, 457). The most popular motif-based software is MACAW (398) that is of commercial—quality. The methods for finding motifs vary for different groups. However, the ideas are similar. A fixed size window is used to collect substrings appearing in most of the sequences to be aligned. Difficulty of motif-based approach: The window size is restricted due to many reasons. If insertions and deletions are not incorporated, no motif can be found if the window size is to big. On the other hand, allowing mismatch, insertion and deletion will dramatically increasing the running for finding motifs. Due to the limit of window size, the task of decomposing sequence into smaller regions becomes difficult. For example, detected motifs could have some overlaps, and some motifs appear in several places of a sequence. (446) developed a nontrivial decomposing method for three sequences. For more sequences, the method might be too slow. Newer approaches use Gibbs sampling to identify motifs (295). New development of motif identification: Recently, there are some new development for motif identification. Many models and formulations have been given. The simplest combinatorial definitions include the consensus pattern and the closest substring problems (Li et al 1999; Li et al 2002). More sophisticated definitions include Hertz and Stormo’s binding sites representation and Pevzner and Sze’s (l,d)-signal [PS00,HS99,S00]. More realistic approaches have been developed recently in [PS00, KP02a, KP02b]. In particular, (JCB) has developed a random projection approach. That approach uses random projection to find some conserved segments from some, say, k, of the given sequences, forms a preliminary profile of the motif based on the conserved segments, and apply EM approach to finalize the motif. The performance of ()’s approach is excellent in practice. Here we will study some other randomized algorithms and their combination with EM approach. We will compare our approach with (JCB)’s approach. We will also use both (JCB)’s approach and our approach as the tool for finding motifs for multiple sequence alignment. Since the new approach allow relative big window size (15 –20, or more), we can expect the problem of decomposing sequence into smaller segments will become easier. For example, the chance of ERG1_03.doc ERG 1 (Revised 5/03) 6 having overlaps for different motifs is lower and the chance for a long motif that appears in multiple places in a sequence is low. New formulation for segment decomposition: After motifs are identified, if we treat each of the motifs as a new character, the problem of decomposing sequences into smaller segments becomes to find the longest common subsequence (LCS) for the new sequences (containing new characters representing motifs). LCS is known to be NP-hard in general. Based on the new advanced approaches for motif identification, the decomposition of sequences into smaller segments may have new features. For example, we can assume that each motif appears once in a sequence. This assumption may make the LCS problem tractable. We will also develop some algorithms for the special case of LCS, where only a few characters (motifs) appear more than once in each of the given sequences. Comparison of RNA structures is an important problem in computational biology. The results of the RNA comparison can be used to determine similarity between RNAs, to align (match) RNA structures, and to determine approximately common structures from a given set of RNA sequences. Many measures have been proposed and they can be divided into two classes. For the first class, an RNA secondary structure can be decomposed into components of five types: stem (S); hairpin (H), bulge (B), interior loop (I), and multi-branch loop (M). The secondary structure can be represented as an ordered tree in which each node is labeled by a letter S, H, B, I or M and the left to right order among siblings is significant (Jiang et al. 1995). Comparison of RNA secondary structure trees has applications in identifying conserved structural motifs in an RNA folding process (Le et al., 1989a; Le at al., 1989b) and constructing taxonomy trees (Shapiro and Zhang, 1990). For the second class, an RNA secondary structure is treated as a sequence plus some base-pairs. [Zhang] proposed a method that directly compares the two structures. In this project, we will study the space-time tradeoff of various versions for RNA secondary structure comparison. RNA secondary structure search:RNA secondary structures play an important role in regulating gene expressions. Many of these RNA structures are assembled from a collection of RNA motifs. These basic patterns appear repeatedly and in various combinations to form different RNA types and define their unique structural and functional properties. Identification of RNA structural motifs will therefore enhance our understanding of RNA structures and their association with functional and regulatory elements. An important technique for extracting and identifying secondary^M motifs is to search patterns in sequence databases. A number of algorithms and software have been developed for this purpose. Early attempts in structural motif searching were designed for^M specific families, e.g., FAStRNA~\cite{EL96} for tRNAs, and CITRON~\cite{LDM94} for group I introns. Tools for general secondary structures appear in ~\cite{BKV96, Mac01, PLD00}. Here in this project, we intend to study the exact and approximate search as defined in Part 1. Sectoin B: Work done by us Multiple sequence alignment: We studied the complexity of multiple sequence alignment [WL94, JWL94]. We have designed polynomial-time approximation schemes (PTAS) for c-diagonal multiple sequence alignment, for both SP-score and consensus score[LMW00a]. For tree alignment, we have developed polynomial-time approximation schemes [WJL96, WG98, WJG00]. At present, our results still remain the best known ratios for each case. Motif identification: Our previous work was mainly focused on theoretical aspects for the following two versions. Given a set of strings S={s1, s2,…, sn}, and an integer L, the consensus pattern asks to find a string s of length L such that for each string siS, there is some substring ti of si with i=1,2,…, n d(ti,s) minimized , whereas the closest substring problem asks to find a string s of length L such that for each string siS, d(s, ti )d for some substring ti (of length L) of si, and d is minimized. [LMW99] designed a PTAS for the consensus pattern problem. [LMW02] designed a PTAS for the closest substring selection problem. [DLLMW02] designed a PTAS that can approximate two objectives for two groups of given strings (the so-called substring selection problem in [LLMWZ99]). [DLW02] gave some theoretical results for the case where the alphabet size is unbounded. Reasons to do the project: Due to the new development on motif detection, we can expect that the problem of decomposing sequence into smaller segments will become easier. We can have new assumptions, e.g., the number of occurrences for each motif is one and for very few motifs, the number of occurrences is great than one. Therefore, we believe that we have a good chance to make good progress on motif-based multiple sequence alignment. RNA secondary structure: We studied algorithms and complexity for comparison of RNA secondary structures [JWZ95, MWZ99]. [JWZ95] uses tree approach, whereas [MWZ99] studies the edit distance between two RNA structures. [WZ03] provides a software for parametric alignment of two trees that rises and partially solves the space-time tradeoff problem. Recently, we wrote a paper for RNA secondary ERG1_03.doc ERG 1 (Revised 5/03) 7 structure search [XW03]. Besides, we will study the parametric version of the measure proposed in [Zhang] and extend the work in [WZ03] to the measure. References: [AL89] S. Altschul and D. Lipman, ``Trees, stars, and multiple sequence alignment’’, SIAM Journal on Applied Math., vol. 49, pp. 197-209, 1989. [BLP96] V. Bafna, E. Lawer and P. Pevzner, ``Approximation algorithms for multiple sequence alignment'', Theoretical Computer Science, vol. 182, pp. 233-244, 1997. [BV20] P. Bonizzoni and G. D. Vedova, ``The complexity of multiple sequence alignment with SP-score that is metric'', Theoretical Computer Science, to appear. [GBOOK] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997. [G93] D. Gusfield, Efficient methods for multiple sequence alignment with guaranteed error bounds, Bulletin of Mathematical Biology, vol. 55, pp. 141-154, 1993. [JLW94] T. Jiang, E. L. Lawler and L. Wang, Aligning sequences via an evolutionary tree: complexity and approximation, the 26th ACM Symp. on Theory of Computing, pp. 760-769, 1994. [JUST] W. Just, On the computational complexity of gap-0 multiple alignment, manuscript, 1998. [LLMWZ99] K. Lanctot, M. Li, b. Ma, and L. Zhang, Distinguishing string selection problems, Proc. 10th ACM-SIAM Symp. On Discrete Algorithms, pp. 633-642, 1999. [LMW99] M. Li, B. Ma, and L. Wang, Finding similar regions in many strings, the 31th ACM Symp. on Theory of Computing, pp. 473-482, 1999. [LMW00a] M. Li, B. Ma and L. Wang, Near optimal multiple alignment within a band in polynomial time, the 32th ACM Symp. on Theory of Computing, to appear. [LMW00b] M. Li, B. Ma and L. Wang, On the closest string and substring problems, submitted for publication. [LAK89] J. Lipman, S.F. Altschul, and J.D. Kececioglu, A tool for multiple sequence alignment, Proc. Nat. Acid Sci. U.S.A., vol. 86, pp.4412-4415, 1989. [P92] P. Pevzner, Multiple alignment, communication cost, and graph matching, SIAM Journal on Applied Math., vol. 56, pp. 1763-1779, 1992. [Sankoffb] D. Sankoff and J. Kruskal, Time warps,string edits, and macromolecules: the theory and practice of sequence comparison, Addison Wesley, 1983 [WJ94] L. Wang and T. Jiang, On the complexity of multiple sequence alignment, Journal of Computational Biology, vol. 1, pp. 337-348, 1994. [W89] M.S. Waterman, Sequence alignments, in Mathematical Methods for DNA Sequences, M.S. Waterman (ed.), CRC, Boca Raton, FL, pp. 53-92, 1989. [Wbook] M.S. Waterman, Introduction to Computational Biology: Maps, sequences, and genomes, Chapman and Hall, 1995. Waterman,M., Arratia,R., and Galas,D.,1984, Pattern recognition in several sequences: consensus and alignment, Bulletin of Mathematical Biology,vol 46 pt 4 pp 515-527. Waterman M.,and Perlwitz,M.,1984, Line geometries for sequence comparisons, Bulletin of Mathematical Biology,vol 46 pp567-577. Waterman,M.1986, Multiple sequence alignment by consensus, Nucleic Acids Research, vol 14 pp 9095-9102. Schuler,G.D.,Altschul,S.F.,and Lipman D.J.,1991, A workbench for multiple alignment construction and analysis, Proteins: Structure, Function, and Genetics,9,pp 180-190. Ming Li, Bin Ma, and Lusheng Wang, 2002, On the closest string and substring problem, Journal of the ACM, vol.49,No.2,March 2002,pp157-171. Tao Jiang and Lusheng Wang, Algorithmic methods for multiple sequence alignment. Jeremy Buhler and Martin Tompa, 2002, Finding motifs using random projections, Journal of Computational Biology, vol.9, pp.225-242. Lawrence,C.E. and Reilly,A.A. 1990, An expectation maximization(EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Structure, Function, and Genetics, vol.7, pp.41-45. Bailey,T.L., and Elkan,C. 1995. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning,vol.21,pp.51-80. Books: Gusfield,D. 1997. Algorithms on strings,trees,and sequences. Cambridge Univ. Press. Pevzner,P.A. 2000. Computational molecular biology – An algorithmic approach. The MIT Press, Cambridge, Mass. Waterman,M.S. 1989. Mathematical methods for DNA sequences. CRC Press. [13] Ying Xu and Lusheng Wang, Exact Matching of RNA Secondary Structure Patterns, submitted to TCS (June 30, 2003) [2] Lusheng Wang and Jianyun Zhao, Parametric Alignment of Ordered Trees, Bioinformatics, to appear. [MWZ99] Kaizhong Zhang, Lusheng Wang and Bin Ma,``Computing similarity between RNA structures'', the 10th Combinatorial Pattern Matching, LNCS 1645, pp. 281-293, July 1999. [36] L. Wang and T. Jiang, “On the complexity of multiple sequence alignment”, Journal of Computational Biology, Vol. 1, No. 4, pp. 337-348, 1994. ERG1_03.doc ERG 1 (Revised 5/03) 8 /.... 3. Research ERG1_03.doc ERG 1 (Revised 5/03) 3. 9 Research plan and methodology (maximum 3 A-4 pages, including key references): New development of motif detection: Recently, there are rich new developments on this topic (See Background of Research). The following two formulations, consensus pattern and the closest substring, have been extensively studied [LMW99; LMW02]. Given a set of strings S={s1, s2,…, sn}, and an integer L, the consensus pattern asks to find a string s of length L such that for each string siS, there is some substring ti of si with i=1,2,…, n d(ti,s) minimized , whereas the closest substring problem asks to find a string s of length L such that for each string siS, d(s, ti )d for some substring ti (of length L) of si, and d is minimized. Polynomial time approximation schemes (PTAS) have been developed for both problems [LMW99; LMW02]. However, the time complexity for both PTAS’s is too high and thus both PTAS’s are not practical. Here we borrow the ideas in [LMW02] for the closest substring to attack the consensus pattern problem and obtain a practical randomized algorithm [W03]. We can show that with high probability, 1-1/poly(n), the algorithm can give an good approximation, i.e., the cost of the solution is at mot (1+) times of the optimum, where n is the input size of the problem. Other than theoretical results, preliminary experiences have shown that the algorithm can give very accurate results for practical size input in reasonable time. (For example, the input contains 20 sequences, 600 characters each and the planted patterns are of 15 characters long and have at most 4 mismatches. We can always find the planted patterns in about 1-2 minutes.) This is one of the few approximation algorithms that are sound from both theoretical and practical points of view in computational biology [W03]. Another method is the random projection approach developed in [jcb]. The algorithm csn be roughly described as follows: (1) randomly choose k ( a parameter) position among L positions (2) 4k buckets are formed, (3) put every segment of length L for s1, s2, …, sn in the corresponding bucket; (4) choose those buckets that have more than m (a parameter) items; (5) for each choosing chosen bucket, construct a profile of the pattern; (6) use EM approach to iteratively improve the result. At present, experiments show that our randomized algorithm is slightly slower than the random projection algorithm in [JCB]. Note that, the algorithm in [JCB] uses EM to do iteratively improvement, whereas our algorithm does not. We are going to incorporate EM approach in our algorithm and hopefully improve the speed. It is interesting to notice that our algorithm has both theoretical proof and practical running speed, whereas the random projection approach has not obtained any theoretical analysis. We will also try to analyze the random projection approach. Motif-based multiple sequence alignment: With the new development of motif identification, the formulation of segments decomposition becomes different. In general, if we treat each motif as a new character, the problem of aligning sequences over a small size , e.g, ={A, C, G, T} for DNA sequences, becomes to align sequences over a bigger size representing all motifs. How can we benefit from this? If each motif can appear very few times, we can expect to have efficient algorithm to solve the problem. We can formulate the problem of decomposing sequences into smaller segments as the longest common subsequence (LCS) problem for the new sequences, where each sequence contains new characters representing motifs. The LCS problem is known to be NP-hard even if each character appear at most 2 times and the length of the sequence is at most 3???[mj94] However, if each character can appear only once in each sequence, we observe that the problem can be solved in polynomial time (very quickly). The assumption that each character can appear only once in each sequence may not be too strict, since the new approaches can detect motif of length about 15. The case where a few (constant number of) motifs can appear multiple times is also easy to handle. For example, if k motifs can appear twice in a sequence, we can solve it in O(k2 poly(n)) time, where poly(n) is the time for the case, where each motif appears once in a sequence. With the above observations, we strong believe that we have a good method that is quite different from the existing ones. ERG1_03.doc chance to develop a new ERG 1 (Revised 5/03) 10 Comparison of RNA structures: Many measures have been proposed for comparing RNA secondary structures. Alignment of trees [] is one of them. and they can be divided into two classes. For the first class, an RNA secondary structure can be decomposed into components of five types: stem (S); hairpin (H), bulge (B), interior loop (I), and multi-branch loop (M). The secondary structure can be represented as an ordered tree in which each node is labeled by a letter S, H, B, I or M and the left to right order among siblings is significant (Jiang et al. 1995). Comparison of RNA secondary structure trees has applications in identifying conserved structural motifs in an RNA folding process (Le et al., 1989a; Le at al., 1989b) and constructing taxonomy trees (Shapiro and Zhang, 1990). For the second class, an RNA secondary structure is treated as a sequence plus some base-pairs. [Zhang] proposed a method that directly compares the two structures. In this project, we Reference [W03] Lusheng Wang, Randomized algorithms for subtle motif identification, manuscripts. /…. 4(a). Has ERG1_03.doc ERG 1 (Revised 5/03) 4(a). 11 Has similar submission(s) been made to seek funding? Yes No If yes, please state the funding agency and the funding programme: Reference No. : [for RGC-funded projects only] Title of Project [if different from Item 1 of Part I above] Date (month/year) of application: Outcome: 4(b). If this application is the same as or similar to one(s) submitted previously, what were the main concerns/suggestions of the reviewers then? 4(c). Please give a brief response to the points mentioned at 4(b) above, highlighting the major changes that have been incorporated in this application. /.... 5. Is ERG1_03.doc 12 ERG 1 (Revised 5/03) 5. Is there similar or related research being carried out at your institution(s)? Yes No If yes, please give brief details [names of investigators, departmental and institutional affiliations, project title(s) and nature of the project(s)] 6. Plan(s) for collaboration in this application: [Indicate the role and the specific task(s) the PI and each Co-I, if any, is responsible for.] /.... 7. Details ERG1_03.doc 13 ERG 1 (Revised 5/03) GRANT RECORD OF INVESTIGATORS 7. Details of on-going and completed research projects funded from all (RGC and non-RGC) sources undertaken by the PI (in a PI or Co-I capacity) in the past five years. [Please attach a copy of the original abstract of each listed project] Seq. No. 8. Project Title PI/Co-I Funding Source(s) and Amount (HK$) Start Date (Expected) Completion Date Details of on-going and completed research projects funded from all (RGC and non-RGC) sources undertaken by each Co-I (in a PI capacity) in the past three years. [Please attach a copy of the original abstract of each listed project] Seq. No. Name of Co-I (s) Project Title Funding Source(s) and Amount (HK$) Start Date (Expected) Completion Date 9. Research output of previously funded projects (RGC and non-RGC sources) undertaken by the PI and each Co-I relevant to this application. [Attach one A-4 page summary on the progress/publications/conferences/student-training, etc. of the projects, with the relevant project reference no.] 10. Curriculum vitae (CV) of applicant(s). [For the PI and each Co-I, attach one A-4 page CV with personal particulars, academic qualifications, positions held and publication records. Please present publications in two sections: most representative publications (ten at maximum), and research-related prizes and awards.] /.... 11. Expected ERG1_03.doc 14 ERG 1 (Revised 5/03) PROJECT FUNDING 11. Expected duration of this project (in months) Proposed start date: 12. Estimated completion date: Estimated cost and resource implications: Year 1 (HK$) Year 2 (HK$) Year 3 (HK$) Year 4# (HK$) Year 5# (HK$) Total (HK$) [# applicable to “longer-term research grant” only] (a) Staff Rank No. Salary per month (b) Relief Teacher [see Explanatory Notes] Rank Months Salary per month (c) Equipment [please itemize and provide quotations for each item costing over HK$200,000] (d) General expenses [please itemize] (e) Conference expenses [see Explanatory Notes] Total /.... 13(a). Justifications ERG1_03.doc ERG 1 (Revised 5/03) 15 13(a). Justifications for each category/item of the budget in Item 12 above: [Detailed justifications should be given in order to support the request] 13(b). Existing facilities and major equipment already available for this research project: /….14. Other ERG1_03.doc ERG 1 (Revised 5/03) 14. 16 Other research funds already secured: Source Amount (HK$) 15. Allocation from Earmarked Research Grant requested: [The amount shown here should be the same as shown in Item 3 of Part I above] 16. Other research funds to be or are being sought [If funds under this item are secured, the amount of the Earmarked Research Grant to be awarded may be reduced]: Source HK$ Amount (HK$) ANCILLARY INFORMATION 17. Research ethics/safety approval: [The primary responsibility of seeking the relevant approval rests with the PI. The PI’s institution is required to complete and sign Part III of this application form to certify whether the relevant approval has been given] (a) Please tick ‘’ the appropriate boxes to confirm if approval for the respective ethics and/or safety issues is required and has been obtained from the PI’s institution. Approval not required (i) (ii) (iii) (iv) (v) (vi) Approval obtained Approval being sought Human research ethics Animal research ethics Biological safety Ionizing radiation safety Non-ionizing radiation safety Chemical safety (b) Approval required, if any, by other authorities and the prospects of such approval. put down “N.A.” if not applicable. Please /.... 18(a). List ERG1_03.doc ERG 1 (Revised 5/03) 17 18(a). List of proposed reviewers: Points to note before completion: This is NOT a compulsory section. This list serves as a reference for the RGC Panel. The named reviewer(s) may or may not be chosen to review the application. Applicant(s) can nominate none or a maximum of five reviewers. They should preferably be experts whom the applicant(s) has no relationship with. If however the applicant(s) i.e., the PI as well as the Co-I(s), decide to nominate reviewers with a past or present relationship, a declaration on the association must be made. It is the responsibility of the PI and the Co-I(s) to ensure that all relationships are fully declared. Failure to disclose fully or accurately the relationship may result in disqualification of the application. Please DO NOT put down here the name(s) of any reviewer(s) whom the applicant(s) may wish to exclude from being invited for assessment. (i) Title/Name/Post/Institution: Address/Tel./Fax/E-mail: Area of Expertise: (ii) Title/Name/Post/Institution: Address/Tel./Fax/E-mail: Area of Expertise: (iii)Title/Name/Post/Institution: Address/Tel./Fax/E-mail: Area of Expertise: (iv) Title/Name/Post/Institution: Address/Tel./Fax/E-mail: Area of Expertise: (v) Title/Name/Post/Institution: Address/Tel./Fax/E-mail: Area of Expertise: /.... 18(b). Declaration ERG1_03.doc ERG 1 (Revised 5/03) 18 18(b). Declaration of any past and present relationship between the investigator(s) i.e., PI and Co-Is, and the nominated reviewers [minimum one tick () per reviewer]: Nature of relationship (please elaborate in 18 (c)) (i) (ii) Reviewer (iii) (iv) (v) Advisor or Advisee Colleague in the same organization (when and where) Research Collaborator Co-authors of papers and patents Significant financial interest Others (please specify) None 18(c). Elaboration on the nature of the relationship, if any: 19. DATA ARCHIVE POSSIBILITIES Is the proposed project likely to generate data set(s) of retention value? Yes No If yes, please describe the nature, quantity and potential use of the data set(s) in future. Are you willing to make the data set(s) available to others for reference twelve months after the publication of research results or the completion of this proposed project? Yes No I/We understand that the RGC only considers data archiving requests after the completion of the RGC-funded project, and the Council has full discretion in funding the archiving requests. Data sets archived with RGC funds will require users to acknowledge the originator and the RGC. The originator will also be provided with copies of all publications derived from the use of the data. /.… 20. I/We ERG1_03.doc 19 ERG 1 (Revised 5/03) 20. I/We certify that I/we have completed this application form in accordance with the Explanatory Notes ERG2. The information given is complete and accurate to the best of my/our knowledge. Name of Principal : Investigator _______________ Signature : Name of : Co-investigator ______________________ Name of : Co-investigator ___________________ Signature : Signature : _________________ Date :___________ __________________ Date :___________ _________________ Date :___________ (Add more names if necessary) /.... PART III ERG1_03.doc