PPT - Computer Science and Engineering

advertisement
Pattern Discovery in RNA Secondary
Structure Using Affix Trees
(when computer scientists meet real molecules)
Giulio Pavesi & Giancarlo Mauri
Dept. of Computer Science, Systems and Communication
University of Milano – Bicocca
Milan, Italy
Why Is RNA So Interesting?
• After the completion of various genome
projects, the attention of many researchers
has shifted from coding to non – coding parts
• More than 95% of our genome is not coding:
what about the rest?
• Non – coding RNA: RNA that is transcribed
from DNA, but does not encode directly for a
protein (tRNA, microRNA, etc.)
CPM 2003 - Morelia
Giulio Pavesi
2
A Motivating Example
Post-transcriptional regulation of gene expression
CPM 2003 - Morelia
Giulio Pavesi
3
The Problem
• Functionally related RNA sequences present
structural similarity, at least in some parts
• Given two or more RNA molecules, find similar
(supposedly functional) structural elements in them
• Sequence similarity implies structure similarity, but
this is not always that true for RNA.....
• Given two or more RNA sequences of unknown
structure, find similar structural elements in them
(motifs)
• Low sequence similarity can anyway correspond to
high structure similarity
CPM 2003 - Morelia
Giulio Pavesi
4
“Know Thine Enemy”
• RNA secondary structure: list of the base
pairs among nucleotides in the sequences,
such that:
– No nucleotide takes part in more than a single
base pair (usually, Watson – Crick pairs and
wobble pairs G – T, i.e. canonical base pairs)
– Base pairs never cross: if nucleotide i is bound to
nucleotide j and k with l, then either i < j < k < l
or i < k < l < j
CPM 2003 - Morelia
Giulio Pavesi
5
RNA Secondary Structure
.((..(((.((....)))))...(((.(((...)))...)))))
CPM 2003 - Morelia
Giulio Pavesi
6
Motifs in RNA Secondary Structure
• Many functional motifs can be described by secondary
structure alone
• Two types of similarity:
– sequence similarity (in unpaired nucleotides, mainly)
– structure similarity
CPM 2003 - Morelia
Giulio Pavesi
7
Data Structures?
• When dealing with DNA or protein
sequences, some significant advantages have
been obtained by using suitable text—
indexing structures (e.g. suffix trees)
• RNA secondary structure can be described
by a string
• Is there a “good” structure that will do for
RNA sequences, allowing us to consider
sequence and structure at the same time?
CPM 2003 - Morelia
Giulio Pavesi
8
Affix Trees
Affix tree for string
S = ATATC
Suffix and prefix edges
Suffix edges spell the
substrings of string S
Prefix edges (dotted)
spell substrings of S-1
(the reverse
Built in linear time
Takes linear space
CPM 2003 - Morelia
Giulio Pavesi
9
Affix Trees
• The affix tree of a string S indexes all the
substrings of both S and S-1
• Once a substring of S has been located in the
tree, we can extend it to the right (by
following suffix edges) and to the left (by
following prefix edges)
• Good if we search for patterns in the
sequences with some kind of symmetry
CPM 2003 - Morelia
Giulio Pavesi
10
The Hairpin
• The basic element of
RNA secondary structure
is the hairpin (or stem—
loop) structure
• The hairpin is
symmetric!!!!
((((( ...... )))))
AGGTC CAGTCA GATCT
CPM 2003 - Morelia
Giulio Pavesi
11
First Try
• Predict the secondary structure of each input
sequence
• Build the affix tree for the folded sequences
(in bracket notation)
• Search exhaustively for patterns describing
hairpin structures (possibly with differences)
• Report those occurring in at least q
sequences
CPM 2003 - Morelia
Giulio Pavesi
12
Searching for Hairpins in Affix Trees
•
For each loop size l:
1.
2.
Find l dots in the tree, on suffix
edges (hairpin loop)
Add a base pair:
a)
b)
3.
4.
Find a ) on suffix edges
Find a ( on prefix edges
If the result appears in at least q
sequences, jump to 2, else return
from jump
Add internal loops:
a)
b)
Find a dot on prefix edges: jump
to 2;
Find a dot on suffix edges: jump
to 2;
CPM 2003 - Morelia
Giulio Pavesi
13
Recursive Algorithm
1
2
2
2
3a
2
•
•
....
(....)
((....))
(((....)))
.((....))
(.((....)))
(ok)
(ok)
(ok)
(no)
(ok)
(ok)
On each path, we keep a pointer for the prefix
edge, and another for the suffix edge
Speed—up: represent the unpaired elements with
a single symbol describing type and size, so to
compare two symbols instead of two regions
CPM 2003 - Morelia
Giulio Pavesi
14
Approximate Search
• We can allow some approximation:
– Hairpin loops of different size (range value at
step 1)
– Internal loops of different size at the same
position along the stem
– Internal loops or bulges at different positions
along the stem
– Stems of different size (base pairs)
– Any combination of the previous
CPM 2003 - Morelia
Giulio Pavesi
15
Complexity
• Given a set of k folded sequences of overall
length N :
– Construction of the tree: O(N)
– Annotation of the tree: O(kN)
– Search: O(V(m)kN), where m is the length of the
longest pattern found
– V(m) depends on the degree of approximation
– In practice, the most time consuming part is
predicting the structure of the sequences
CPM 2003 - Morelia
Giulio Pavesi
16
Does It Work?
• Test: Iron Responsive Element, located in the UTRs
of mRNA coding for proteins involved in iron
metabolism (e.g. ferritin, transferrin)
• Does it appear in all the predicted structures?
• Alas, it does not!!!!!!
CPM 2003 - Morelia
Giulio Pavesi
17
Why?
The “real
structure”often
does not
correspond to the
optimal one!!!!
The motif
“disappears”
from the
(supposedly)
optimal structure
CPM 2003 - Morelia
Giulio Pavesi
18
One Possible Solution
• Idea: for each sequence, consider also a number of
alternative sub-optimal structures
• All the possible structures can be enumerated
• Check whether a motif appears in at least one
alternative structure per sequence
• The affix tree can handle efficiently even hundreds
of alternative structures per input sequence
• Downside: the number of potential secondary
structures for a sequence of length n is O(2n)
• If similarity is not stringent, we have too many
candidates
CPM 2003 - Morelia
Giulio Pavesi
19
But.....
• If the same structure has to appear in a set of
sequences, then the same pattern of
complementary base pairs has to appear in
the sequences
(((((
AGGTC
GCGAG
CCCAG
CPM 2003 - Morelia
......
CAGTCA
CAGTCT
CAGTCA
Giulio Pavesi
)))))
GATCT
CTTGC
CTGGG
20
Idea!
• Instead of working on
folded sequences, build
the affix tree for the
sequences alone, and
find complementary
base pairs on the fly
• The search can be
implemented with the
same parameters of the
folded case
CPM 2003 - Morelia
Giulio Pavesi
21
Building Hairpins on the Fly
• By working on unfolded sequences, the
theoretical time complexity is higher, since
different paths correspond to the same
structure
• In practice it is much faster, since we do not
have to run the prediction algorithm on the
input sequences
• We need to “validate” the candidate
structures, e.g. according to their energy
CPM 2003 - Morelia
Giulio Pavesi
22
Post - Processing
• So far we have considered structure alone
• More than a single motif occurrence per
sequence is often reported, especially if
structural constraints are loose
• Post processing: compare the candidate
occurrences by evaluating sequence
similarity in unpaired elements
• Find the group of instances that are more
similar at the sequence level
CPM 2003 - Morelia
Giulio Pavesi
23
Results and Work in Progress
• The second approach gave better results, in
terms of reliability and efficiency
• Candidate hairpins can be validated
according to their energy value (more
reliable, in this case!)
• Good results on “harder” tests
• Too many input parameters yet
• Extend to more complex structures
CPM 2003 - Morelia
Giulio Pavesi
24
Download