CERG Applications 2004-2005 CITY UNIVERSITY OF HONG KONG

advertisement
CITY UNIVERSITY OF HONG KONG
CERG Applications 2004-2005
Checklist of documents to be submitted with Application
Please specify the subject area under which you wish your application to be considered (please choose only one):
E1
E2
E3
E4
Civil Engineering, Surveying, Building and Construction
Computing Science & Information Technology
Electrical & Electronic Engineering
Mechanical, Production & Industrial Engineering
P1
P2
P3
Chemical Engineering )
Physical Sciences
)
Mathematics
)
M1
M2
Biological Sciences
Medicine, Dentistry & Health
H1
H2
H3
H4
Administrative, Business & Social Studies
Arts & Languages
Education
Law, Architecture, Town Planning & Other Professional & Vocational Subjects
Application intended for the Physical Sciences
panel MUST be submitted through the RGC
electronic system
Please check the appropriate boxes on the right by a (X) indicating that the necessary information and/or supporting
documents have been included with your application. Principal Investigators are responsible for ensuring that the
application is complete.
Included
(X)
1.
Application signed by all Investigators
2.
The completed Research Grant Data Sheet (CERG 04/05)
3.
Quotations for equipment purchase costing over $200,000
4.
A one-page brief C.V. where required for Investigators
5.
(a) Proof of human research ethics approval
(b) Proof of animal research ethics approval
(c) Proof/certificate of biological safety
(d) Proof/certificate of ionizing radiation safety
(e) Proof/certificate of non-ionizing radiation safety
6.
Primary field area (secondary field area is optional) and corresponding codes have been
indicated
Name of PI
Signature
*******************************************************************************
(For RO use only)
Initial Checking
CS
HT
KC
HT
KC
Update list
Update Proposal
Required/Not required
Final Checking
CS
ERG1_03.doc
ERG 1 (Revised 5/03)
CITY UNIVERSITY OF HONG KONG
RESEARCH GRANT DATA SHEET
This form must accompany all proposals for research for input onto the RGC database. It is designed to simplify the procedure for obtaining approval
signatures, to ensure compliance with University safety/ethics policies, and to improve record-keeping.
SECTION A - PROJECT DETAILS
Principal Investigator:
Title
Associate professor
Department:Dept. of Computer Science
Surname: Wang
Tel: 2788 9820
First Name Lusheng Wang
Email: lwang@cs.cityu.edu.hk
Co-Investigator(s):
Title
Name
Dept/Institution/Country
Project period (YY/MM): July 1, 2004
to June 30, 2006
Duration (in months):24
Please select one of the following:
New submission
Re-submission (previous application ref: ___________________________
Application for individual Research
Application for Longer-term grants
Project title:
)
Random Projection, Motif Based Sequence Alignment and RNA Secondary Structures
Budget summary:
Less:
Research support staff salaries
HK$
Equipment
HK$
General expenses/Travel
HK$
Conference
HK$
Others (including funds already secured)
HK$
Net Total
HK$
Please complete each of the following [Note to PI: are you aware of any safety hazards introduced by this research proposal that are not covered or
addressed in the existing safety codes or guidelines. If yes, please bring them to the attention of the Head of Department]:
Approval not required
Approval obtained & attached
Approval Pending
Human research ethics
Animal research ethics
Biological safety
Chemical safety
Radiation safety
SECTION B - UNDERTAKING BY THE PRINCIPAL INVESTIGATOR
This application is submitted in compliance with the RGC's terms and conditions and University polices and procedures, whereby research, if funded,
will be undertaken accordingly by all staff and students engaged on the project. The information given is to the best of my knowledge complete and
accurate.
Signature:
Date:
SECTION C - APPROVAL SIGNATURES
I confirm:
[Please mark 'X' in the appropriate box(es)]
a)
that I have read this application, understand its resource implications for the
Department, and will make available the necessary infrastructural support and space for
the project if it is funded;
b)
that I am satisfied that all health and safety risks associated with the project have been
considered, and that the Department can provide adequate measures to control these
risks [Heads should contact the Safety Officer for further advice if they are in any doubt
about the safety aspects of a proposal];
c)
Yes
No
N/A
Yes
No
N/A
Yes
No
N/A
that the proposals is worthy of the support of the University.
Department Head:
ERG1_03.doc
Signature
Date:
ERG 1 (Revised 5/03)
RGC Ref No.
(to be completed by the institution)
RESEARCH GRANTS COUNCIL
Application for Allocation from
the Earmarked Research Grant for 2004-2005
Application Form (ERG1)
[Please read the Explanatory Notes ERG2 carefully before completing this form]
PART I
SUMMARY OF THE RESEARCH PROPOSAL
[To be completed by the applicant(s)]
1.
Title of Project: Random Projection, Motif Based Sequence Alignment and RNA Secondary
Structures
Primary Field:
2.
Algorithms
Secondary Field:
Computational
Biology
Name(s) and Academic Affiliation(s) of Applicant(s):
Name
Principal
Investigator [PI]:
(with title)
Dr. Lusheng Wang
Post
Unit/Department/
Institution
Asso. Prof.
CS/City Univ.
Co-Investigator(s) [Co-I(s)]:
(with title)
3.
Allocation Requested from the Earmarked Research Grant:
Total cost of the project:
(a)
Staff
(b)
Relief Teacher (required exceptionally)
(c)
Equipment
(d)
General Expenses
(e)
Conference (standard rate : $12,000 per year)
Less:
Other research funds secured from other sources
Net amount requested * :

HK$
HK$
HK$
HK$
HK$
HK$
HK$
The amount may be reduced further if additional funds from other sources have been secured
after submission of this application.
/.... 4. Nature
ERG1_03.doc
ERG 1 (Revised 5/03)
2
4. Nature of application *
 New [i.e. PI and/or Co-I(s) applying for RGC funds on this research topic for the first time].
Please give further details in Part II item 2.
Re-submission [i.e. PI and/or Co-I(s) have previously applied for RGC funds on this
research
topic but application not supported]. Please give further details in Part II item 4.
On-going [i.e. PI and/or Co-I(s) extending work previously funded by the RGC].
Please give further details in Part II items 2, and 7-9.
Application for individual research.
Application for longer-term research grant.
* Please tick ‘’ as appropriate
ERG1_03.doc
ERG 1 (Revised 5/03)
5.
3
Abstract of research (limited to ½ A-4 page or 200 words, and comprehensible to a
non-specialist):
In this project, we intend to design and analyze efficient and effective algorithms and
heuristics for some computational problems arising in computational molecular biology.
The problems we intend to attack include random projection algorithms for motif detection,
motif-based multiple sequence alignment and some computational problems for RNA
secondary structures.
Motif detection problem is a challenging problem in terms of computation. It has many
applications in computational biology, e.g., locating binding sites and finding conserved
regions in unaligned sequences. We will study the random projection approach from
theoretical point of view.
Multiple sequence alignment is one of the core problem in computation biology. Algorithms
for multiple sequence alignment are routinely used to find conserved regions in biomolecular
sequences, to construct family and superfamily representations of sequences, and to reveal
evolutionary histories of species (or genes). We intend to adopt the motif-based approach
(proposed long time ago) to attack the problem. Note that, great progress has been made
recently for identifying conserved motifs. Those new techniques for motif detection should
greatly increase the chance for motif-based approaches for multiple sequence alignment.
RNA secondary structure . We will study (1) algorithmic issues of the space-time trade off
for the comparison of two RNA secondary structures and (2) RNA secondary structure
search in RNA sequences.
The project will emphasize algorithmic issues as well as computational complexity for
all the proposed problems.
/.... PART II
ERG1_03.doc
ERG 1 (Revised 5/03)
PART II
4
DETAILS OF THE RESEARCH PROPOSAL
[To be completed by the applicant(s)]
RESEARCH DETAILS
1.
The project objectives and long-term impact (maximum 1 A-4 page):
Purpose: We will design algorithms that are desperately needed in practice for the following important
problems.
Motif-based multiple sequence alignment: Let S ={s1,s2, … ,sn} be a set of n sequences. An alignment A
of the n sequences is obtained by inserting spaces into the sequences such that all resulting sequences
have the same length and overlaying them. For motif-based multiple sequence alignment, we first
identify conserved motifs and use those motifs to decompose the sequences into smaller segments
and finally solve the problem by fixing the alignment for those smaller segments.
The problem of how to decompose the sequences into smaller segments after motifs are identified is one of
the important issues here. We will study this issue here and hopefully can obtain both theoretically
and practically interesting results. Another issue is how to identify the conserved motifs. Recently,
there are new developments on motif detection. We will study randomized algorithms, e.g, random
projection approach, for motif detection.
Motif detection problem: Given an integer L and a set of n sequences s1, s2, …, sn, each is of length at most
m, find those conserved regions of length L, where the conserved regions can be measured as
consensus score or bottleneck score (will be discussed later).
RNA secondary structure
(1) space-time tradeoff for RNA secondary structure comparison:
Many measures have been proposed for RNA secondary structure comparison, e.g., tree edit distance,
constrained edit distance and alignment of trees. For all those measures, the algorithms need super
quadratic time and space. In many cases, space limit is more serious than that of time. For pair-wise
sequence comparison, a linear-space algorithm that runs in O(n2) time (as good as the O(n2) space
algorithm) was designed. This technique is treated as a classic technique in this area. Here we will
study the algorithmic issues of the space-time tradeoff for all those different measures.
(2) RNA secondary structure search:
(i)
Exact search: given a RNA secondary structure R (pattern) and an RNA sequence S (a
sequence over {A,C,G,U} which is treated as the text), a base-pair (i, j) in R matches the
two letters S[k] and S[k+j-i] in S if S[k] and S[k+j-i] are either C-G pare and A-U
pair. The secondary structure R matches the segment S[k]S[k+1]…S[k+|n] if for every
base-pair (i,j) in R, S[k+I] and S[k+j] form either C-G pair and A-U pair. The problem
here is to find all the segments in S that can match R.
(ii)
Approximate search: given a RNA secondary structure R (pattern), the number of
mismatches between R and the segment S[k]S[k+1]…S[k+n] is the number of (i,j)’s,
where (i,j) is a base-pair in R and S[k+i] and S[k+j] form neither C-G pair nor A-U
pair. Other than R ans S, we are also given an integer m, the problem here is to find all
segments in S that has at most m mismatches with respect to R.
Long Term Significance: The problems proposed in this project are from computational molecular biology
and bioinformatics. They have applications in SNP haplotype map construction, comparison of genomes,
PCR primer design, binding site finding, creating diagnostic probes, and potential drug target identification.
Our research focuses on theoretical study of those computational problems. The results produced will form a
solid foundation for design the very much needed software for these problems.
/.... 2. Background
ERG1_03.doc
ERG 1 (Revised 5/03)
2.
5
Background of research (maximum 2½ A-4 pages, including references):
Section A: Work done by others
Multiple sequence alignment is one of the core problems in computational biology. It is the most critical
cutting–edge tool for biological sequence analysis that helps to extract and represent biologically important,
yet faint or widely dispersed, commonalities from a set of sequences. These commonalities may reveal
conserved motifs, conserved characters in DNA or protein, common secondary or tertiary structures, or clues
about the common biological functions. Commonalities might be blur, and may not be apparent when
comparing two sequences. However, they may become clear when comparing a set of related sequences.
From computational point of view, multiple sequence alignment is one of the hardest problems in
computational molecular biology [AL89,LAK89,Sankoffb,G93,GBOOK, W89,Wbook]. Many measures
have been proposed, such as SP-score, consensus-score and tree-score. For all the three scores, the problem
was proved to be NP-hard [JLW94,WJ94, BV00]. Hundreds of research papers have been published on
multiple sequence alignment. Here we can only briefly mention a few.
Exact Algorithms: The problem (for all the three scores) can be solved by a dynamic programming
approach that runs in time exponential in terms of the input size. Extensive discussions of such algorithms
can be found in [Gusfield, 1997; Sankoff and Kruskal, 1983]. Carrillo and Lipman proposed a method to cut
down the computational volume of the dynamic programming algorithm for SP alignment (Carrillo and
Lipman, 1988). The basic idea is to compute upper bounds on alignment costs for each pair of sequences in
the computation of the k dimensional matrix and eliminate those cells of the matrix that violate the upper
bounds. Altschul and Lipman proposed a similar method for tree alignment (Altschul and D. Lipman,
1989).
Approximation Algorithms: Many approximation algorithms have been developed for the three different
scores. The first approximation algorithm for SP alignment was given by D. Gusfield (Gusfield, 1993). The
performance ratio is 2-2/k, where k is the number of sequences. With great effort, the best known ratio is
improved to 2-l/k for constant l (Bafna et al 1997). For consensus-score, Gusfield gave a ratio-2 algorithm
(Gusfield 1997). Other approximation algorithms will be discussed in Section B.
Heuristic approaches: Many heuristic approaches have been developed. The popular ones include
progressive alignment (Feng and Doolittle 1987; Thompson et al 1994), iterative method for tree-score
(Snakoff et al 1976, Sankoff and Kruskal 1983), sequence graph approach (Hein 1989), motif-based
approaches etc. There are too many others to mention.
Motif-based approach relies on first finding motifs (subsequences that are common in most of the given
sequences to be aligned) and then use those motifs to decompose the sequences into smaller segments. Other
heuristics will be used for each of the smaller segments. The history of this approach goes back to 1984
(Waterman et al 1984). Other references can be found in (374, 455, 456, 457). The most popular motif-based
software is MACAW (398) that is of commercial—quality. The methods for finding motifs vary for
different groups. However, the ideas are similar. A fixed size window is used to collect substrings appearing
in most of the sequences to be aligned.
Difficulty of motif-based approach: The window size is restricted due to many reasons. If insertions and
deletions are not incorporated, no motif can be found if the window size is to big. On the other hand,
allowing mismatch, insertion and deletion will dramatically increasing the running for finding motifs. Due
to the limit of window size, the task of decomposing sequence into smaller regions becomes difficult. For
example, detected motifs could have some overlaps, and some motifs appear in several places of a
sequence. (446) developed a nontrivial decomposing method for three sequences. For more sequences,
the method might be too slow. Newer approaches use Gibbs sampling to identify motifs (295).
New development of motif identification: Recently, there are some new development for motif
identification. Many models and formulations have been given. The simplest combinatorial definitions
include the consensus pattern and the closest substring problems (Li et al 1999; Li et al 2002). More
sophisticated definitions include Hertz and Stormo’s binding sites representation and Pevzner and Sze’s
(l,d)-signal [PS00,HS99,S00]. More realistic approaches have been developed recently in [PS00, KP02a,
KP02b]. In particular, (JCB) has developed a random projection approach. That approach uses random
projection to find some conserved segments from some, say, k, of the given sequences, forms a preliminary
profile of the motif based on the conserved segments, and apply EM approach to finalize the motif. The
performance of ()’s approach is excellent in practice.
Here we will study some other randomized
algorithms and their combination with EM approach. We will compare our approach with (JCB)’s approach.
We will also use both (JCB)’s approach and our approach as the tool for finding motifs for multiple sequence
alignment. Since the new approach allow relative big window size (15 –20, or more), we can expect the
problem of decomposing sequence into smaller segments will become easier. For example, the chance of
ERG1_03.doc
ERG 1 (Revised 5/03)
6
having overlaps for different motifs is lower and the chance for a long motif that appears in multiple places in
a sequence is low.
New formulation for segment decomposition: After motifs are identified, if we treat each of the motifs as a
new character, the problem of decomposing sequences into smaller segments becomes to find the longest
common subsequence (LCS) for the new sequences (containing new characters representing motifs). LCS is
known to be NP-hard in general. Based on the new advanced approaches for motif identification, the
decomposition of sequences into smaller segments may have new features. For example, we can assume that
each motif appears once in a sequence. This assumption may make the LCS problem tractable. We will also
develop some algorithms for the special case of LCS, where only a few characters (motifs) appear more than
once in each of the given sequences.
Comparison of RNA structures is an important problem in computational biology. The results of the RNA
comparison can be used to determine similarity between RNAs, to align (match) RNA structures, and to
determine approximately common structures from a given set of RNA sequences. Many measures have
been proposed and they can be divided into two classes. For the first class, an RNA secondary structure can
be decomposed into components of five types: stem (S); hairpin (H), bulge (B), interior loop (I), and
multi-branch loop (M). The secondary structure can be represented as an ordered tree in which each node is
labeled by a letter S, H, B, I or M and the left to right order among siblings is significant (Jiang et al. 1995).
Comparison of RNA secondary structure trees has applications in identifying conserved structural motifs in
an RNA folding process (Le et al., 1989a; Le at al., 1989b) and constructing taxonomy trees (Shapiro and
Zhang, 1990). For the second class, an RNA secondary structure is treated as a sequence plus some
base-pairs. [Zhang] proposed a method that directly compares the two structures. In this project, we will
study the space-time tradeoff of various versions for RNA secondary structure comparison.
RNA secondary structure search:RNA secondary structures play an important role in regulating gene
expressions. Many of these RNA structures are assembled from a collection of RNA motifs. These basic
patterns appear repeatedly and in various combinations to form different RNA types and define their unique
structural and functional properties. Identification of RNA structural motifs will therefore enhance our
understanding of RNA structures and their association with functional and regulatory elements. An important
technique for extracting and identifying secondary^M
motifs is to search patterns in sequence databases. A number of algorithms and software have been
developed for this purpose. Early attempts in structural motif searching were designed for^M
specific families, e.g., FAStRNA~\cite{EL96} for tRNAs, and CITRON~\cite{LDM94} for group I introns.
Tools for general secondary structures appear in ~\cite{BKV96, Mac01, PLD00}. Here in this project, we
intend to study the exact and approximate search as defined in Part 1.
Sectoin B: Work done by us
Multiple sequence alignment: We studied the complexity of multiple sequence alignment [WL94, JWL94].
We have designed polynomial-time approximation schemes (PTAS) for c-diagonal multiple sequence
alignment, for both SP-score and consensus score[LMW00a]. For tree alignment, we have developed
polynomial-time approximation schemes [WJL96, WG98, WJG00]. At present, our results still remain the
best known ratios for each case.
Motif identification: Our previous work was mainly focused on theoretical aspects for the following two
versions. Given a set of strings S={s1, s2,…, sn}, and an integer L, the consensus pattern asks to find a string s
of length L such that for each string siS, there is some substring ti of si with i=1,2,…, n d(ti,s) minimized ,
whereas the closest substring problem asks to find a string s of length L such that for each string siS, d(s,
ti )d for some substring ti (of length L) of si, and d is minimized. [LMW99] designed a PTAS for the
consensus pattern problem. [LMW02] designed a PTAS for the closest substring selection problem.
[DLLMW02] designed a PTAS that can approximate two objectives for two groups of given strings (the
so-called substring selection problem in [LLMWZ99]). [DLW02] gave some theoretical results for the case
where the alphabet size is unbounded.
Reasons to do the project: Due to the new development on motif detection, we can expect that the problem
of decomposing sequence into smaller segments will become easier. We can have new assumptions, e.g., the
number of occurrences for each motif is one and for very few motifs, the number of occurrences is great than
one. Therefore, we believe that we have a good chance to make good progress on motif-based multiple
sequence alignment.
RNA secondary structure: We studied algorithms and complexity for comparison of RNA secondary
structures [JWZ95, MWZ99]. [JWZ95] uses tree approach, whereas [MWZ99] studies the edit distance
between two RNA structures. [WZ03] provides a software for parametric alignment of two trees that rises
and partially solves the space-time tradeoff problem. Recently, we wrote a paper for RNA secondary
ERG1_03.doc
ERG 1 (Revised 5/03)
7
structure search [XW03]. Besides, we will study the parametric version of the measure proposed in [Zhang]
and extend the work in [WZ03] to the measure.
References:
[AL89] S. Altschul and D. Lipman, ``Trees, stars, and multiple sequence alignment’’, SIAM Journal on Applied
Math., vol. 49, pp. 197-209, 1989.
[BLP96] V. Bafna, E. Lawer and P. Pevzner, ``Approximation algorithms for multiple sequence alignment'',
Theoretical Computer Science, vol. 182, pp. 233-244, 1997.
[BV20] P. Bonizzoni and G. D. Vedova, ``The complexity of multiple sequence alignment with SP-score that is metric'',
Theoretical Computer Science, to appear.
[GBOOK] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology,
Cambridge University Press, 1997.
[G93] D. Gusfield, Efficient methods for multiple sequence alignment with guaranteed error bounds,
Bulletin of Mathematical Biology, vol. 55, pp. 141-154, 1993.
[JLW94]
T. Jiang, E. L. Lawler and L. Wang, Aligning sequences via an evolutionary tree: complexity and
approximation, the 26th ACM Symp. on Theory of Computing, pp. 760-769, 1994.
[JUST] W. Just, On the computational complexity of gap-0 multiple alignment, manuscript, 1998.
[LLMWZ99] K. Lanctot, M. Li, b. Ma, and L. Zhang, Distinguishing string selection problems, Proc. 10th ACM-SIAM
Symp. On Discrete Algorithms, pp. 633-642, 1999.
[LMW99] M. Li, B. Ma, and L. Wang, Finding similar regions in many strings, the 31th ACM Symp. on
Theory of Computing, pp. 473-482, 1999.
[LMW00a] M. Li, B. Ma and L. Wang, Near optimal multiple alignment within a band in polynomial time,
the 32th ACM Symp. on Theory of Computing, to appear.
[LMW00b] M. Li, B. Ma and L. Wang, On the closest string and substring problems, submitted for publication.
[LAK89] J. Lipman, S.F. Altschul, and J.D. Kececioglu, A tool for multiple sequence alignment,
Proc. Nat. Acid Sci. U.S.A., vol. 86, pp.4412-4415, 1989.
[P92] P. Pevzner, Multiple alignment, communication cost, and graph matching, SIAM Journal on Applied Math.,
vol. 56, pp. 1763-1779, 1992.
[Sankoffb] D. Sankoff and J. Kruskal, Time warps,string edits, and macromolecules: the theory and practice of sequence
comparison, Addison Wesley, 1983
[WJ94] L. Wang and T. Jiang, On the complexity of multiple sequence alignment, Journal of Computational
Biology, vol. 1, pp. 337-348, 1994.
[W89] M.S. Waterman, Sequence alignments, in Mathematical Methods for DNA Sequences, M.S. Waterman (ed.),
CRC, Boca Raton, FL, pp. 53-92, 1989.
[Wbook] M.S. Waterman, Introduction to Computational Biology: Maps, sequences, and genomes, Chapman and Hall,
1995.
Waterman,M., Arratia,R., and Galas,D.,1984, Pattern recognition in several sequences: consensus and alignment,
Bulletin of Mathematical Biology,vol 46 pt 4 pp 515-527.
Waterman M.,and Perlwitz,M.,1984, Line geometries for sequence comparisons, Bulletin of Mathematical Biology,vol
46 pp567-577.
Waterman,M.1986, Multiple sequence alignment by consensus, Nucleic Acids Research, vol 14 pp 9095-9102.
Schuler,G.D.,Altschul,S.F.,and Lipman D.J.,1991, A workbench for multiple alignment construction and analysis,
Proteins: Structure, Function, and Genetics,9,pp 180-190.
Ming Li, Bin Ma, and Lusheng Wang, 2002, On the closest string and substring problem, Journal of the ACM,
vol.49,No.2,March 2002,pp157-171.
Tao Jiang and Lusheng Wang, Algorithmic methods for multiple sequence alignment.
Jeremy Buhler and Martin Tompa, 2002, Finding motifs using random projections, Journal of Computational Biology,
vol.9, pp.225-242.
Lawrence,C.E. and Reilly,A.A. 1990, An expectation maximization(EM) algorithm for the identification and
characterization of common sites in unaligned biopolymer sequences. Proteins: Structure, Function, and Genetics, vol.7,
pp.41-45.
Bailey,T.L., and Elkan,C. 1995. Unsupervised learning of multiple motifs in biopolymers using expectation
maximization. Machine Learning,vol.21,pp.51-80.
Books:
Gusfield,D. 1997. Algorithms on strings,trees,and sequences. Cambridge Univ. Press.
Pevzner,P.A. 2000. Computational molecular biology – An algorithmic approach. The MIT Press, Cambridge, Mass.
Waterman,M.S. 1989. Mathematical methods for DNA sequences. CRC Press.
[13] Ying Xu and Lusheng Wang, Exact Matching of RNA Secondary Structure Patterns, submitted to TCS
(June 30, 2003)
[2] Lusheng Wang and Jianyun Zhao, Parametric Alignment of Ordered Trees, Bioinformatics, to appear.
[MWZ99] Kaizhong Zhang, Lusheng Wang and Bin Ma,``Computing similarity between RNA structures'',
the 10th Combinatorial Pattern Matching, LNCS 1645, pp. 281-293, July 1999.
[36] L. Wang and T. Jiang, “On the complexity of multiple sequence alignment”,
Journal of Computational Biology, Vol. 1, No. 4, pp. 337-348, 1994.
ERG1_03.doc
ERG 1 (Revised 5/03)
8
/.... 3. Research
ERG1_03.doc
ERG 1 (Revised 5/03)
3.
9
Research plan and methodology (maximum 3 A-4 pages, including key references):
New development of motif detection: Recently, there are rich new developments on this topic (See
Background of Research). The following two formulations, consensus pattern and the closest
substring, have been extensively studied [LMW99; LMW02]. Given a set of strings S={s1, s2,…, sn},
and an integer L, the consensus pattern asks to find a string s of length L such that for each string
siS, there is some substring ti of si with i=1,2,…, n d(ti,s) minimized , whereas the closest
substring problem asks to find a string s of length L such that for each string siS, d(s, ti )d for
some substring ti (of length L) of si, and d is minimized. Polynomial time approximation schemes
(PTAS) have been developed for both problems [LMW99; LMW02]. However, the time complexity
for both PTAS’s is too high and thus both PTAS’s are not practical. Here we borrow the ideas in
[LMW02] for the closest substring to attack the consensus pattern problem and obtain a practical
randomized algorithm [W03]. We can show that with high probability, 1-1/poly(n), the algorithm
can give an good approximation, i.e., the cost of the solution is at mot (1+) times of the optimum,
where n is the input size of the problem. Other than theoretical results, preliminary experiences have
shown that the algorithm can give very accurate results for practical size input in reasonable time.
(For example, the input contains 20 sequences, 600 characters each and the planted patterns are of
15 characters long and have at most 4 mismatches. We can always find the planted patterns in about
1-2 minutes.) This is one of the few approximation algorithms that are sound from both theoretical
and practical points of view in computational biology [W03].
Another method is the random projection approach developed in [jcb]. The algorithm csn be
roughly described as follows: (1) randomly choose k ( a parameter) position among L positions (2) 4k
buckets are formed, (3) put every segment of length L for s1, s2, …, sn in the corresponding bucket; (4)
choose those buckets that have more than m (a parameter) items; (5) for each choosing chosen bucket,
construct a profile of the pattern; (6) use EM approach to iteratively improve the result.
At present, experiments show that our randomized algorithm is slightly slower than the random
projection algorithm in [JCB]. Note that, the algorithm in [JCB] uses EM to do iteratively
improvement, whereas our algorithm does not. We are going to incorporate EM approach in our
algorithm and hopefully improve the speed. It is interesting to notice that our algorithm has both
theoretical proof and practical running speed, whereas the random projection approach has not
obtained any theoretical analysis. We will also try to analyze the random projection approach.
Motif-based multiple sequence alignment: With the new development of motif identification, the
formulation of segments decomposition becomes different. In general, if we treat each motif as a
new character, the problem of aligning sequences over a small size , e.g, ={A, C, G, T} for DNA
sequences, becomes to align sequences over a bigger size  representing all motifs. How can we
benefit from this? If each motif can appear very few times, we can expect to have efficient
algorithm to solve the problem. We can formulate the problem of decomposing sequences into smaller
segments as the longest common subsequence (LCS) problem for the new sequences, where each sequence
contains new characters representing motifs. The LCS problem is known to be NP-hard even if each
character appear at most 2 times and the length of the sequence is at most 3???[mj94] However, if each
character can appear only once in each sequence, we observe that the problem can be solved in polynomial
time (very quickly). The assumption that each character can appear only once in each sequence may not be
too strict, since the new approaches can detect motif of length about 15. The case where a few (constant
number of) motifs can appear multiple times is also easy to handle. For example, if k motifs can appear twice
in a sequence, we can solve it in O(k2 poly(n)) time, where poly(n) is the time for the case, where each motif
appears once in a sequence.
With the above observations, we strong believe that we have a good
method that is quite different from the existing ones.
ERG1_03.doc
chance to develop a new
ERG 1 (Revised 5/03)
10
Comparison of RNA structures: Many measures have been proposed for comparing RNA secondary
structures. Alignment of trees [] is one of them. and they can be divided into two classes. For the first class,
an RNA secondary structure can be decomposed into components of five types: stem (S); hairpin (H), bulge
(B), interior loop (I), and multi-branch loop (M). The secondary structure can be represented as an ordered
tree in which each node is labeled by a letter S, H, B, I or M and the left to right order among siblings is
significant (Jiang et al. 1995). Comparison of RNA secondary structure trees has applications in identifying
conserved structural motifs in an RNA folding process (Le et al., 1989a; Le at al., 1989b) and constructing
taxonomy trees (Shapiro and Zhang, 1990). For the second class, an RNA secondary structure is treated as a
sequence plus some base-pairs. [Zhang] proposed a method that directly compares the two structures. In
this project, we
Reference
[W03] Lusheng Wang, Randomized algorithms for subtle motif identification, manuscripts.
/…. 4(a). Has
ERG1_03.doc
ERG 1 (Revised 5/03)
4(a).
11
Has similar submission(s) been made to seek funding?
Yes
No
If yes, please state the funding agency and the funding programme:
Reference No. :
[for RGC-funded projects only]
Title of Project [if different from Item 1 of Part I above]
Date (month/year) of application:
Outcome:
4(b).
If this application is the same as or similar to one(s) submitted previously, what were the
main concerns/suggestions of the reviewers then?
4(c). Please give a brief response to the points mentioned at 4(b) above, highlighting the major
changes that have been incorporated in this application.
/.... 5. Is
ERG1_03.doc
12
ERG 1 (Revised 5/03)
5.
Is there similar or related research being carried out at your institution(s)?
Yes
No
If yes, please give brief details [names of investigators, departmental and institutional
affiliations, project title(s) and nature of the project(s)]
6.
Plan(s) for collaboration in this application:
[Indicate the role and the specific task(s) the PI and each Co-I, if any, is responsible for.]
/.... 7. Details
ERG1_03.doc
13
ERG 1 (Revised 5/03)
GRANT RECORD OF INVESTIGATORS
7.
Details of on-going and completed research projects funded from all (RGC and non-RGC)
sources undertaken by the PI (in a PI or Co-I capacity) in the past five years.
[Please attach a copy of the original abstract of each listed project]
Seq. No.
8.
Project Title
PI/Co-I
Funding Source(s) and
Amount (HK$)
Start
Date
(Expected)
Completion
Date
Details of on-going and completed research projects funded from all (RGC and non-RGC)
sources undertaken by each Co-I (in a PI capacity) in the past three years.
[Please attach a copy of the original abstract of each listed project]
Seq. No.
Name of Co-I
(s)
Project Title
Funding
Source(s)
and Amount
(HK$)
Start
Date
(Expected)
Completion
Date
9.
Research output of previously funded projects (RGC and non-RGC sources) undertaken by
the PI and each Co-I relevant to this application.
[Attach one A-4 page summary on the progress/publications/conferences/student-training, etc.
of the projects, with the relevant project reference no.]
10.
Curriculum vitae (CV) of applicant(s).
[For the PI and each Co-I, attach one A-4 page CV with personal particulars, academic
qualifications, positions held and publication records. Please present publications in two
sections: most representative publications (ten at maximum), and research-related prizes and
awards.]
/.... 11. Expected
ERG1_03.doc
14
ERG 1 (Revised 5/03)
PROJECT FUNDING
11.
Expected duration of this project (in months)
Proposed start date:
12.
Estimated completion date:
Estimated cost and resource implications:
Year 1
(HK$)
Year 2
(HK$)
Year 3
(HK$)
Year 4#
(HK$)
Year 5#
(HK$)
Total
(HK$)
[# applicable to “longer-term
research grant” only]
(a) Staff
Rank
No. Salary per month
(b) Relief Teacher
[see Explanatory Notes]
Rank Months Salary per month
(c) Equipment
[please itemize and provide
quotations for each item
costing over HK$200,000]
(d) General expenses
[please itemize]
(e) Conference expenses
[see Explanatory Notes]
Total
/.... 13(a). Justifications
ERG1_03.doc
ERG 1 (Revised 5/03)
15
13(a). Justifications for each category/item of the budget in Item 12 above:
[Detailed justifications should be given in order to support the request]
13(b). Existing facilities and major equipment already available for this research project:
/….14. Other
ERG1_03.doc
ERG 1 (Revised 5/03)
14.
16
Other research funds already secured:
Source
Amount (HK$)
15.
Allocation from Earmarked Research Grant requested:
[The amount shown here should be the same as shown
in Item 3 of Part I above]
16.
Other research funds to be or are being sought [If funds under this item are secured, the
amount of the Earmarked Research Grant to be awarded may be reduced]:
Source
HK$
Amount (HK$)
ANCILLARY INFORMATION
17.
Research ethics/safety approval:
[The primary responsibility of seeking the relevant approval rests with the PI. The PI’s
institution is required to complete and sign Part III of this application form to certify whether
the relevant approval has been given]
(a) Please tick ‘’ the appropriate boxes to confirm if approval for the respective ethics and/or
safety issues is required and has been obtained from the PI’s institution.
Approval
not required
(i)
(ii)
(iii)
(iv)
(v)
(vi)
Approval
obtained
Approval
being sought
Human research ethics
Animal research ethics
Biological safety
Ionizing radiation safety
Non-ionizing radiation safety
Chemical safety
(b) Approval required, if any, by other authorities and the prospects of such approval.
put down “N.A.” if not applicable.
Please
/.... 18(a). List
ERG1_03.doc
ERG 1 (Revised 5/03)
17
18(a). List of proposed reviewers:
Points to note before completion: This is NOT a compulsory section. This list serves as a reference for the RGC Panel.
The named reviewer(s) may or may not be chosen to review the application.
 Applicant(s) can nominate none or a maximum of five reviewers. They should preferably
be experts whom the applicant(s) has no relationship with. If however the applicant(s) i.e.,
the PI as well as the Co-I(s), decide to nominate reviewers with a past or present
relationship, a declaration on the association must be made. It is the responsibility of the
PI and the Co-I(s) to ensure that all relationships are fully declared. Failure to disclose
fully or accurately the relationship may result in disqualification of the application.
 Please DO NOT put down here the name(s) of any reviewer(s) whom the applicant(s) may
wish to exclude from being invited for assessment.
(i) Title/Name/Post/Institution:
Address/Tel./Fax/E-mail:
Area of Expertise:
(ii)
Title/Name/Post/Institution:
Address/Tel./Fax/E-mail:
Area of Expertise:
(iii)Title/Name/Post/Institution:
Address/Tel./Fax/E-mail:
Area of Expertise:
(iv) Title/Name/Post/Institution:
Address/Tel./Fax/E-mail:
Area of Expertise:
(v) Title/Name/Post/Institution:
Address/Tel./Fax/E-mail:
Area of Expertise:
/.... 18(b). Declaration
ERG1_03.doc
ERG 1 (Revised 5/03)
18
18(b). Declaration of any past and present relationship between the investigator(s) i.e., PI and Co-Is,
and the nominated reviewers [minimum one tick () per reviewer]:
Nature of relationship (please elaborate in 18 (c))
(i)
(ii)
Reviewer
(iii)
(iv)
(v)
Advisor or Advisee
Colleague in the same organization (when and where)
Research Collaborator
Co-authors of papers and patents
Significant financial interest
Others (please specify)
None
18(c). Elaboration on the nature of the relationship, if any:
19.
DATA ARCHIVE POSSIBILITIES
Is the proposed project likely to generate data set(s) of retention value?
Yes
No
If yes, please describe the nature, quantity and potential use of the data set(s) in future.
Are you willing to make the data set(s) available to others for reference twelve months after
the publication of research results or the completion of this proposed project?
Yes
No
I/We understand that the RGC only considers data archiving requests after the completion of
the RGC-funded project, and the Council has full discretion in funding the archiving requests.
Data sets archived with RGC funds will require users to acknowledge the originator and the
RGC. The originator will also be provided with copies of all publications derived from the
use of the data.
/.… 20. I/We
ERG1_03.doc
19
ERG 1 (Revised 5/03)
20.
I/We certify that I/we have completed this application form in accordance with the
Explanatory Notes ERG2. The information given is complete and accurate to the best of
my/our knowledge.
Name of Principal :
Investigator
_______________ Signature :
Name of
:
Co-investigator
______________________
Name of
:
Co-investigator
___________________
Signature :
Signature :
_________________ Date :___________
__________________ Date :___________
_________________ Date :___________
(Add more names if necessary)
/.... PART III
ERG1_03.doc
Download