Why4DNABases11IIT.ppt

advertisement
Bo Deng
Department of Mathematics
UNL
IIT, 14 Feb. 2011
http://www.math.unl.edu/~bdeng1
Working Hypothesis
Evolution is driven to maximize biodiversity against constraints
in time and energy across all biological scales
Applied to all informational systems:
o DNA Replication
o Protein Synthesis
o Sexual Reproduction
o Speciation to Phylogenetic Tree
o Ecological Community
o Animal Brain
o Consciousness
o Language
o Social, Economical, Political Structures
Channel
C. E. Shannon, ``A mathematical theory of communication,''
Bell System Technical Journal, vol. 27, pp. 379-423 and
623-656, July and October, 1948.
Claude E. Shannon
(1916-2001)
What is Information? and What Matters the Most?
All about choices
Transmission Speed Comparison
Dial-up
2400 bps ~ 56Kbps
DSL
128 Kbps ~ 8 Mbps
Cable
512 Kbps ~ 20 Mbps
Internet
Satellite
~ 6 Mbps
Optic Fiber
45 Mbps ~ 150 Mbps
Mathematical Measure of Information: What is in a bit?
One Bit = One Binary Digit
Dead Channel --- Transmit only one kind of symbol all the times
e.g. 0000…..
 0 bit  0 bit information
Live Channel --- Transmit one of many possible symbols
each time, e.g. 011101… in a binary channel
 Each transmitted symbol is either 0 or 1
 Each symbol contains 1 bit information
Pop Quiz: How many bits in a quaternary symbol, 1, 2, 3, 4?
or in a symbol of n alphabets, 1, 2, 3, …, n?
Answer: H4 = 2 bits, and Hn = log2 n bits respectively
because 4 = 2 log2 4, n = 2 log2 n
# of sequences
length log n = # ofsymbol
choices n is just
0 or 1
Key
Assumption:
Each of
transmitted
Ex: { a, b, c, d } =
2
Bit Unit:
……
{ 00, 01, 10, 11}
one of n equally probable choices
What is in the transmission rate?
Let tk be time needed to transmit symbol k
Then the average transmission time per base is
Tn = (t1 + t2 + t3 +…+ tn ) / n
And the mean rate is
Rn = Hn / Tn = n log2 n / (t1 + t2 + t3 +…+ tn )
The definition implicitly assumes that all symbols occur
equally probable.
Why, or is it reasonable?
Recall: Rn = Hn / Tn = n log2n / (t1 + t2 + t3 +…+ tn )
All-purpose Channel
 Each transmitted Symbol 1 is just one choice out of 1/p1
many possible
choices
therefore
Symbol 1spams,
contains
 Internet
message
types:and
video,
audio, pictures,
…etc
log2 1/pfrequency
1 bits information
 Each has different
distribution in the encoding
symbols
since 1/p1 = 2 log2 1/p1
Bit Unit:
0 or 1
1/p1 = # of sequences of length log 1/p1
……
2
 Similarly, Symbol k contains log2 1/pk bits information
Example: Pick a marble from
Important
fact:
 The average bits per symbol for our video only source is
Equiprobability
a bagp oflog
2 blue,
andH = log n
H(p) = H(p)
p1 log=2p1/p
+…+
1/p
<=
1 2 1/p1 +…+
n
2 pn log
n 2 1/pnn
2
1 log
5 read marbles
 Probability
for picking
Conclusion:
For an all-purpose
channel,
the mean rate
Example
of Possible
Non-equiprobability:
a blue
marble:
is calculated
anyever
particular
source
 If we know all
video files not
thatfor
have
transmitted
= 2/7source entropy,
entropy
but we
for can
thepmaximal
over the internet,
then
make
blue an accurate
Hn , which
reached
equaprobability
Number
of with
choices
for each blue picked
frequency table:
saypis
1 for Symbol 1, p2 for 2, etc, and
distribution
of the
symbols.
pn for symbol
n
1 /transmitting
pblue = 7/2 =3.5
Design Criterion
To choose n so that Rn = Hn / Tn
is the largest!
Example
Encoding states:
....
Symbols: 1
2
3
….
n
Trans. Times: t1
t2
t3
…
tn
Assume:
t1 = 1 sec, t2 = 2 sec, t3 = 3 sec, … , tn = n sec
Then
Rn = Hn / Tn = n log2n / (t1 + t2 + t3 +…+ tn ) = 2 log2 n / (n+1)
DNA Replication
James D. Watson (1928 -), Francis Crick (1916 - 2004), Molecular structure of nucleic acids,
Nature, 171(1953), pp.737--738.
http://www.mun.ca/biology/scarr/An11_01_DNA_replication.mov
Communication Model for DNA Replication
Fact:
 DNA replication is the same for all genomes
 Replication is a sequential process – one base a time
Observation:
 Each species genome is an information source
 Genome upon replication is a transmitted message
Conceptual Model:
DNA replication is an all-purpose channel
Questions:
Why 4 bases: A, T , C , G?
Replication Mean Rate: Rn = Hn / Tn ,
(per-base diversity rate)
Assumption:
 Weaker chemical bonds take
longer to replicate (Heisenberg’s
Uncertainty Principle: t E ~ constant )
Time scale of a single Hydrogen
bond pairing: 4X10-15 sec.
 Paring times of high energy bonds
are ignored (as a first attempt/order approximation
for the pairing time)
 tA = tT = pairing time of one H…O bond = t0
tG = tC = pairing time of two H…O bond = 2 t0
t5 = t6 = pairing time of three H…O bond = 3 t0, etc.
(by Watson and Crick’s base paring principle)
The Result
Let k = # of base pairs, and
n = # of bases
Then
n=2k
Since t2m-1 =t2m = m t0 for m = 1,2, …, k
Rn = Hn / Tn = log2 n / [2(t1 + t3 + …+ t2k-1) /n]
= log2 n / [(n/2+1) t0/2]
A further refined model predicts
1.65 < tC,G / tA,T < 3  R4 = the optimal rate
1.8267
2 Sexes Problem
Sexual Reproduction is a process of information exchange
Reproduction Mean Ratio: Sn = Hn / En ,
Assumption:
 Information payoff per-crossover base for n sexes:
Hn = log2 n
 1:1 sex ratio with M members for each sex
 Cost to sexual reproduction in energy and time is
inversely proportional to the probability of having
a reproductive group of n members having exactly
one sex each
 Reproductive group is formed by random encounter
Reproductive Probability:
Reproductive Group in k Tries:
Expected Tries for One Reproductive Group :
Expected Tries for One Reproductive Group for Large Population :
The Result: Entropy-to-Cost Ratio: Sn = Hn / En ,
M = 10m
Genetic Entropy Exchange without Sexual but Existential Cost :
Multiparous Strategy
Multiparous Entropy:
Multiparous Cost :
Multiparous Entropy to Cost Ratio :
With Mixed (Random & Wedlock) Cost :
Discussions
n=4
Slower by
Evolutionary
Set-back by
n=2
< 0.75
> 25%
> 1 billion yrs
n=6
< 0.98
> 2%
> 80 million yrs
Rn / R4
a=2
Evolutionary Clock
Set-back with 3 Sexes:
 Life on Earth could have not evolved faster and
have had a richer diversity at the same time
 Consistent with Darwinian Theory of Survival-ofthe-Fittest theory but at the molecular level
Question: Was the origin of life driven by
informational selection?
The Role of Mathematics
 Why is the per-base diversity measure by
Hn = log2 n or H ( p ) = S pk log2 1/pk
log2 1/(p1 p2) = log2 1/p1 + log2 1/p2  Information is additive
 Mathematics is driven by open problems
 Science is driven by existing solutions
 Mathematical modeling is to discover the mathematics
to which Nature fits as a solution
 Exception to the rule is the rule in biology
Acknowledgements
 Dr. Reg Garrett, Department of Biology, University of Virginia,
regarding the GC transcription elongation problem
 Dr. David Ussery, Center for Biological Sequence Analysis,
Technical University of Denmark, on most base frequency data
 Dr. Daniel Smith, Department of Biology, Oregon State University,
regarding the base frequencies of P. ubique
 Dr. Tony Joern, Department of Biology, UNL, Kansas State University
 Dr. Etsuko Moriyama, the Beadle Center for Genetics Research,
University of Nebraska-Lincoln
 Dr. Hideaki Moriyama, Dr. Xiao-Cheng Zhen,
Department of Chemistry, University of Nebraska-Lincoln
 Irakli Loladze, David Logan, Department of Mathematics, UNL
The show of life is on your DNA channel
We are consumers of
reproductive entropy
Genome
Base Frequency
A
T
G
C
d

S. coelicolor
13.9
14.0
36.1
36.0
0.1%
-44.2%
E. coli K-12
24.6
24.6
25.4
25.4
0.0%
-1.6%
E. coli O15:H7
24.8
24.7
25.2
25.2
0.1%
-1.0%
Human*
29.4
29.7
20.5
20.4
0.3%
18.2%
P.ubique
35.3
35.0
14.9
14.8
0.3%
40.6%
W. glossinidia
38.8
38.7
11.2
11.3
0.1%
55.0%
* Base frequency for the chromosome 14 which has the largest d.
d  max{ | p A  pT |, | pG  pC |}
  ( p A  pT )  ( pG  pC )
Viruses are taking advantage of the replication system
by having the near maximal per-base diversity entropy
and having their hosts do the replication for them.
Genome
Base Frequency
A
T
G
C
d

H ( p)
phage P1
26.1
26.6
23.5
23.8
0.5%
5.4%
1.9978
phage T4
31.8
32.9
16.5
18.8
2.3%
29.5%
1.9355
phage VT2-Sa
25.6
24.5
26.9
23.0
3.9%
0.2%
1.9976
phage 933W
27.6
22.8
27.4
22.2
5.2%
0.8%
1.9927
phage phiX174
24.0
31.3
23.3
21.5
7.3%
10.6%
1.9846
max.
2.0000
To Maximize Stationary Entropy:
H(p) = p1 log2 1/p1 +…+ pn log2 1/pn
1.8267
Genome
1.8267
Base Frequency
A
T
G
C
d

**
H ( p) t A,T R( p)
S. coelicolor
13.9
14.0
36.1
36.0
0.1%
-44.2%
1.8538
1.1623
E. coli K-12
24.6
24.6
25.4
25.4
0.0%
-1.6%
1.9998
1.4093
E. coli O15:H7
24.8
24.7
25.2
25.2
0.1%
-1.0%
1.9999
1.4122
Human*
29.4
29.7
20.5
20.4
0.3%
18.2%
1.9834
1.4005
P.ubique
35.3
35.0
14.9
14.8
0.3%
40.6%
1.8774
1.5081
W. glossinidia
38.8
38.7
11.2
11.3
0.1%
55.0%
1.7688
1.4921
* Base frequency for the chromosome 14 which has the largest d.
**a  1.8267
Others have to scramble with individual
and absolute Channel Capacities, i.e.,
Objective: Max. R(p) = H (p) / T (p)
Subject to: p1 + p2 + …+ pn = 1, pk > 0
Optimization Result:
 pA  pT , pG  pC
 pG  pAa, a  tG,C /tA,T
 K = max R(p) = (log2 1/pA) /tA,T
Download