THE RNA DETECTIVE GAME: FINDING RNA CHAINS FROM FRAGMENTS Fred Roberts, Rutgers University

advertisement
THE RNA DETECTIVE GAME:
FINDING RNA CHAINS FROM
FRAGMENTS
RNA
Detective
Fred Roberts, Rutgers University
1
DNA and RNA
Deoxyribonucleic acid, DNA, is the basic building block
of inheritance.
DNA can be thought of as a chain consisting of bases.
Each base is one of four possible chemicals:
Thymine (T), Cytosine (C), Adenine (A), Guanine (G)
2
DNA and RNA
Some DNA chains:
GGATCCTGG, TTCGCAAAAAGAATC
Real DNA chains are long:
Algae (P. salina): 6.6x105 bases long
Slime mold (D. discoideum): 5.4x107 bases long
3
DNA and RNA
Insect (D. melanogaster – fruit fly): 1.4x108 bases long
Bird (G. domesticus): 1.2x109 bases long
4
DNA and RNA
Human (H. sapiens): 3.3x109 bases long
The sequence of bases in DNA encodes certain genetic
information.
In particular, it determines long chains of amino acids
known as proteins.
5
DNA and RNA
How many possible DNA chains are there in humans?
6
Aside: Counting
Fundamental methods of combinatorics are important in
mathematical biology.
7
The Product Rule
How many sequences of 0’s and 1’s are there of length 2?
There are 2 ways to choose the first digit and no matter
how we choose the first digit, there are two ways to
choose the second digit.
Thus, there are 2x2 = 22 = 4 ways to choose the sequence.
00, 01, 10, 11
How many sequences are there of length 3?
By similar reasoning: 2x2x2 = 23.
8
The Product Rule
Is this interesting?
9
The Product Rule
Boring!
10
The Product Rule
Really boring!
11
The Product Rule
Counting may be boring at times, but we will see that it
can be really powerful.
12
The Product Rule
Product Rule: If something can happen in n1 ways and no
matter how the first thing happens, a second thing can
happen in n2 ways, then the two things together can
happen in n1 x n2 ways.
More generally, if something can happen in n1 ways and
no matter how the first thing happens, a second thing can
happen in n2 ways, and no matter how the first two things
happen a third thing can happen in n3 ways, … then all the
things together can happen in n1 x n2 x n3 x … ways.
13
DNA and RNA
How many possible DNA chains are there in humans?
How many DNA chains are there with two bases?
Answer (Product Rule): 4x4 = 42 = 16.
There are 4 choices for the first base and, for each such
choice, 4 choices for the second base.
How many with 3 bases?
How many with n bases?
14
DNA and RNA
How many with 3 bases? 43 = 64
How many with n bases? 4n
How many human DNA chains are possible?
4^(3.3x109)
This is greater than 10^(1.98x109)
(1 followed by 198 million zeroes!)
15
DNA and RNA
RNA is a “messenger molecule” whose links are defined
from DNA.
An RNA chain has at each link one of four bases.
The possible bases are the same as those in DNA except
that the base Uracil (U) replaces the base Thymine (T).
16
The RNA Detective Game
Sample RNA chains:
GGCAUUGGA, UAUAUGCGGCUUC
RNA chains are very long.
Can we discover what they look
like without actually
observing them?
Trick: Use enzymes.
17
The RNA Detective Game
Some enzymes break up an RNA chain into fragments
after each G link.
Some enzymes break up the chain after each C or U link.
Consider the chain
CCGGUCCGAAAG
Applying the G enzyme breaks the chain into the
following fragments:
G fragments: CCG, G, UCCG, AAAG
We know that these are the fragments, but we do not know
the order in which they appear.
How many possible chains have these four fragments?
18
The RNA Detective Game
Chain:
CCGGUCCGAAAG
G fragments: CCG, G, UCCG, AAAG
Product rule again: 4 choices for the first fragment, for
each such choice 3 choices for second fragment, …
There are 4x3x2x1 = 4! = 24 possible chains.
One chain corresponding to each permutation of these four
fragments.
One such chain different from the original:
UCCGGCCGAAAG
19
The RNA Detective Game
Chain:
CCGGUCCGAAAG
Suppose we instead apply the U,C enzyme.
We get the following fragments:
U,C fragments: C, C, GGU, C, C, GAAAG
How many chains are there with these fragments?
Is 6! = 720 the correct answer???
Two of the permutations are the one that takes the
fragments in the order given and the one that takes the
second fragment first and the first second and all others
in this order.
20
They give rise to the same chain.
The RNA Detective Game
So 6! is wrong.
What is the answer??
What if the fragments were
C, C, C, C, C
There are 5! permutations of these fragments, but only
one RNA chain with these fragments:
CCCCC
21
Aside: More Counting
22
Multinomial Coefficients
Putting n distinguishable balls into k distinguishable
boxes:
The number of ways to put n1 balls into the first box,
n2 balls into the second box, …, nk balls into the kth
box is denoted by C(n;n1,n2,…,nk), where
n = n1 + n2 + … nk.
23
Multinomial Coefficients
Theorem: C(n;n1,n2,…,nk) = n!/n1!n2!...nk!
Example: How many RNA chains of length 6 have 3 C’s
and 3 A’s?
Think of 2 boxes, a C box and an A box. How many ways
are there to put 3 positions (balls) into the C box and 3
into the A box?
Answer: C(6;3,3) = 6!/3!3! = 20.
Some of these are: CACACA, ACACAC, AAACCC.
24
Multinomial Coefficients
If a 6-link RNA chain is chosen at random, what is the
probability of obtaining one with 3 C’s and 3 A’s?
Answer: There are 46 possible RNA chains of length 6.
The probability is therefore
C(6;3,3)/46 = 20/4096  .005.
25
Multinomial Coefficients
The number of 10-link RNA chains consisting of 3 A’s, 2
C’s, 2 U’s, and 3 G’s is
C(10;3,2,2,3) = 25,200
What if we know they end in AAG?
Then, only the first 7 positions need to be filled, and 2 A’s
and one G are already used up. Hence, the answer is
C(7;1,2,2,2) = 630
Notice how knowing the end of a chain can dramatically
reduce the number of possible chains.
26
Returning to the RNA Detective
Game
27
The RNA Detective Game
Recall that we have the following U,C fragments:
C, C, GGU, C, C, GAAAG
The number of RNA chains with these fragments is not 6!
= 720.
Think of having 6 positions (there are 6 fragments) and
assigning 4 positions to the C box, 1 to the GGU box,
and one to the GAAAG box.
Then the number of ways of doing this is
C(6;4,1,1) = 6!/4!1!1! = 30
28
The RNA Detective Game
U,C fragments: C, C, GGU, C, C, GAAAG
Actually, this computation is still a bit off, though not
because the combinatorial argument is wrong.
Notice that the fragment GAAAG does not end in U or C.
Thus, we know it comes last.
There are 5 remaining U,C fragments.
The number of chains beginning with these 5 fragments is
given by
C(5;4,1) = 5
Beginning of the chains: CCCCGGU, CCCGGUC,
CCGGUCC, CGGUCCC, GGUCCCC
29
The RNA Detective Game
We get all chains with the given U,C fragments by adding
GAAAG to the end of each of these:
CCCCGGUGAAAG
CCCGGUCGAAAG
CCGGUCCGAAAG
CGGUCCCGAAAG
GGUCCCCGAAAG
30
The RNA Detective Game
Thus, there are 24 possible chains with the given G
fragments and 5 with the possible U,C fragments.
But: We have not yet combined our knowledge of both G
and U,C fragments.
G fragments: CCG, G, UCCG, AAAG
U,C fragments: C, C, GGU, C, C, GAAAG
Which of the 5 chains with these U,C fragments has the
right G fragments?
31
The RNA Detective Game
G fragments: CCG, G, UCCG, AAAG
U,C fragments: C, C, GGU, C, C, GAAAG
Which of the 5 chains with these U,C fragments has the
right G fragments?
CCCCGGUGAAAG
CCCGGUCGAAAG
CCGGUCCGAAAG
CGGUCCCGAAAG
GGUCCCCGAAAG
CCCCGGUGAAAG does not: It has CCCCG as a G
fragment.
What about the others?
32
The RNA Detective Game
Checking the remaining 4 possible RNA chains with the
given U,C fragments shows that only the third one,
CCGGUCCGAAAG
has the given G fragments.
Hence, we have recovered the initial chain.
This is an example of recovery of an RNA chain given a
complete digest by enzymes.
How remarkable is it that we could recover the initial
RNA chain this way?
33
The RNA Detective Game
CCGGUCCGAAAG
How many RNA chains are there with the same bases as
this chain?
There are 12 bases: 4 C’s, 4 G’s, 3 A’s, and 1 U.
The number of chains with these bases is given by
C(12;4,4,3,1) = 138,600
Thus, knowing the number of bases is not nearly as useful
as knowing the fragments.
34
The RNA Detective Game
Another example.
G fragments: UG, ACG, AC
U,C fragments: U, GAC, GAC
Step 1: Does any fragment have to come last?
35
The RNA Detective Game
G fragments: UG, ACG, AC
U,C fragments: U, GAC, GAC
Step 1: Does any fragment have to come last?
None of the U,C fragments has to come last.
However, the G fragment AC has to come last.
Thus, the other two G fragments come first in some order
and there are only two possible RNA chains with these
G fragments: UGACGAC, ACGUGAC
36
The RNA Detective Game
G fragments: UG, ACG, AC
U,C fragments: U, GAC, GAC
There are only two possible RNA chains with these G
fragments: UGACGAC, ACGUGAC
The latter has AC as a U,C fragment. So, the former is the
correct chain.
37
The RNA Detective Game
Is it always possible to completely recover the original
RNA chain given its G fragments and U,C fragments?
RNA
38
The RNA Detective Game
Is it always possible to completely recover the original
RNA chain given its G fragments and U,C fragments?
No: sometimes the solution is ambiguous.
Exercise: Find two RNA chains with the same G and U,C
fragments.
39
Eulerian Paths
Surprisingly, eulerian paths in multidigraphs can be used
to help with the RNA detective game.
When a digraph is allowed to have more than one arc from
vertex x to vertex y, we call it a multidigraph.
A path in a multidigraph is called eulerian if it uses every
arc once and only once. (Recall the Konigsberg Bridge
Problem.)
A closed path (one that ends where it starts) is eulerian if
it is eulerian as a path.
40
Eulerian Paths
d
a
b
c
e
eulerian closed path: a, b, c, d, b, e, a
41
Eulerian Paths
d
a
b
c
e
eulerian path: a, b, c, d, b, e
42
Eulerian Paths
When does a multidigraph have an eulerian path or closed
path?
Theorem (I.J. Good, 1946): A connected multidigraph
has an eulerian closed path iff for every vertex, the
indegree (number of incoming arcs) equals the
outdegree (number of outgoing arcs).
Theorem (I.J. Good, 1946): A connected multidigraph
has an eulerian path iff for all vertices with the
possible exception of two, indegree equals outdegree,
and for at most two vertices, indegree and outdegree
differ by one.
43
Eulerian Paths
a
b
d
a
b
c
44
Eulerian Paths
Note that these theorems hold if there are loops from a
vertex to itself.
A loop adds 1 to indegree and 1 to outdegree.
Thus, loops do not affect the existence of eulerian paths or
closed paths.
45
Eulerian Paths and the RNA Detective
Game
Assume that there are at least two G fragments and at least
two U,C fragments. Otherwise, we can recover the
original chain.
Example:
G fragments: CCG, G, UCACG, AAAG, AA
U,C fragments: C, C, GGU, C, AC, GAAAGAA
46
Eulerian Paths and the RNA Detective
Game
G fragments: CCG, G, UCACG, AAAG, AA
U,C fragments: C, C, GGU, C, AC, GAAAGAA
Step 1: Break down each fragment after each G, U, or C.
E.g.: GAAAGAA becomes GxAAAGxAA
GGU becomes GxGxU
UCACG becomes UxCxACxG
Each piece is called an extended base.
All extended bases in a fragment except first and last are
47
called interior extended bases.
Eulerian Paths and the RNA Detective
Game
G fragments: CCG, G, UCACG, AAAG, AA
U,C fragments: C, C, GGU, C, AC, GAAAGAA
Step 2: Use the extended base breakup of fragments to
find the beginning and end of the RNA chain.
Start by making two lists
All interior extended bases of all fragments:
C, C, AC, G, AAAG
Fragments with one extended base:
G, AAAG, AA, C, C, C, AC
48
Eulerian Paths and the RNA Detective
Game
All interior extended bases of all fragments:
C, C, AC, G, AAAG
Fragments with one extended base:
G, AAAG, AA, C, C, C, AC
Theorem: Every entry on the first list is on the second list.
There are always exactly two entries on the second list
not on the first. One of these is the first extended base
of the entire RNA chain and the other is the last.
Thus: chain begins in AA or C and ends in AA or C.
How do you tell how it ends?
49
Eulerian Paths and the RNA Detective
Game
Thus: chain begins in AA or C and ends in AA or C.
How do you tell how it ends?
One of these must be from an abnormal fragment: a G
fragment that doesn’t end in G or a U,C fragment that
doesn’t end in U or C.
G fragments: CCG, G, UCACG, AAAG, AA
U,C fragments: C, C, GGU, C, AC, GAAAGAA
AA is such an abnormal fragment.
An abnormal fragment marks the end of the chain.
So: chain ends in AA and begins in C.
50
Eulerian Paths and the RNA Detective
Game
Step 3: Build a multidigraph.
First, identify all normal fragments with more than one
extended base. From each such fragment, use the first
and last extended bases as vertices and draw an arc
from the first to the last.
Label the arc with the corresponding fragment.
G fragments: CCG, G, UCACG, AAAG, AA
U,C fragments: C, C, GGU, C, AC, GAAAGAA
Fragment UCACG gives rise to vertices U and G and we
51
include an arc from U to G labeled UCACG.
Eulerian Paths and the RNA Detective
Game
G
U
UCACG
52
Eulerian Paths and the RNA Detective
Game
G fragments: CCG, G, UCACG, AAAG, AA
U,C fragments: C, C, GGU, C, AC, GAAAGAA
Fragment CCG means that we include an arc from C to G
labeled CCG.
Fragment GGU means that we include an arc from G to U
labeled GGU.
53
Eulerian Paths and the RNA Detective
Game
GGU
G
C
U
CCG
UCACG
54
Eulerian Paths and the RNA Detective
Game
There might be several arcs from a given extended base to
another if there are several normal fragments from the
first to the second. That is why we get a multidigraph.
Step 4: We add one additional arc.
Identify the longest abnormal fragment.
Include an arc from the first (and perhaps only) extended
base in this fragment to the first extended base in the
chain.
Label this as X*Y where X is the longest abnormal
fragment in the chain and Y is first extended base in
the chain.
55
Eulerian Paths and the RNA Detective
Game
G fragments: CCG, G, UCACG, AAAG, AA
U,C fragments: C, C, GGU, C, AC, GAAAGAA
GAAAGAA is the longest abnormal fragment.
Put in an arc from G (first extended base in this fragment)
to C (first extended base in the chain).
Label the arc as GAAAGAA*C
56
Eulerian Paths and the RNA Detective
Game
GAAAGAA*C
GGU
G
C
U
CCG
UCACG
57
Eulerian Paths and the RNA Detective
Game
Theorem: This multidigraph has an eulerian closed path.
The RNA chains with the given G and U,C fragments
correspond to eulerian closed paths that end with the
special arc X*Y.
In our example, it is easy to check it has an eulerian closed
path. (Use I.J. Good’s Theorem.)
58
Eulerian Paths and the RNA Detective
Game
GAAAGAA*C
GGU
G
C
U
CCG
UCACG
The only eulerian closed path that ends in
GAAAGAA*C goes from C to G to U to G to C.
59
Eulerian Paths and the RNA Detective
Game
GAAAGAA*C
GGU
G
C
U
CCG
UCACG
Step 5: Use the corresponding labeling of arcs to obtain
the chain:
CCGGUCACGAAAGAA
60
It is easy to check this has the right G and U,C fragments.
The RNA Detective Game:
Concluding Comments
The “fragmentation stratagem” we have described was
used by R.W. Holley and his colleagues at Cornell in
1965 to determine the first nucleic acid sequence.
The method is not used anymore and was only used for a
short time before other, more efficient methods were
adopted.
However, it has great historical significance and illustrates
an important role for mathematical methods in biology.
61
The RNA Detective Game:
Concluding Comments
Nowadays, by use of radioactive marking and high-speed
computer analysis, it is possible to sequence long RNA
and DNA chains rather quickly.
62
The RNA Detective Game:
Concluding Comments
The mathematical power of the fragmentation stratagem,
nevertheless, is a good illustration of the use of
methods of discrete mathematics in modern molecular
biology.
63
The RNA Detective Game:
Concluding Comments
And of the power
of counting!
64
The RNA Detective Game: Enjoy it
with Your Students
65
Download