t_tlusty_nodalweek

advertisement
The Birth of Smooth
Biological Codes in a Rough
Evolutionary World
Shalev Itzkovitz, Guy Shinar, Uri Alon
TT
o Biological codes are information channels or
maps with natural ‘fitness’ measure.
o Codes are evolved and selected according
to their fitness or ‘smoothness’.
o The emergence of a code is a phase
transition in an information channel.
o Topology of errors (noise) governs the
emergent code.
Biological codes are (often) maps
• Biological code is a mapping between two sets of
molecules:
– Transcription net: Proteins → DNA binding sites
– Protein-protein recognition: immune system…
– Protein synthesis: DNA → Proteins
The genetic code
DNA
Proteins
Information flows from DNA to RNA
to proteins through the genetic code
DNA
ACGGAGGTACCC
4 letters
RNA
ACGGAGGUACCC
4 letters
Protein
Thr
Glu
Val
Pro
20 letters
• The 20 letters are the amino acids.
• Proteins are amino acid polymers.
Each of the 20 amino acids
has specific chemistry
• Amino acid = backbone +
specific side group.
• Some amino acids are
hydrophilic, hydrophobic,
basic, acidic…
• The diversity of amino
acids allows proteins to
perform a wide variety
of functions efficiently.
Each of the 20 amino acids is
encoded by a triplet of RNA letters
Glu
ACG
GAG
Thr
GUA
CCC
Val
Pro
• Genetic Code = mapping triplets to amino acids.
• 64 = 43 triplet codons encode only 20 amino acids
(degeneracy)
• Only 48 discernable codons due to U-C “wobble” at 3rd base.
The genetic code is smooth,
degenerate and compact
• Redundancy – only 20 of 48.
• Degeneracy – mostly in the 3rd base
• Close codons separated by a single
letter (Hamming Distance = 1)
• Smoothness – Close codons encode
chemically similar amino acids.
( Hydrophobic xUx, hydrophilic
xAx).
• Compactness – single contiguous
domain per each amino-acid.
• The code is highly nonrandom
• (“one in a million” [Haig & Hurst] ).
Shades: lighter (darker) – low (high) polarity.
Letters: black (white) – hydrophobic (hydrophilic)
yellow – medium. [Knight, Freeland, Landweber]
Biological codes evolve(d)
to cope with inherent noise
• Messages are written in molecular words that are
read and interpreted by other molecules, which
calculate the response etc…
• Typical energy scale ~ a few kBT.
• Thermal noise → errors.
• Information channels adapt to errors through
evolutionary of selection-mutation
• Some errors = mutations are essential to evolution
…
The code is an information channel
with an average distortion
misreading
encoding

U
i
W
decoding
j
V

,
distortion
HUV = ∑paths Pαijβ Dαβ = ∑α,I,j,β PαUαiWijVjβDαβ
• U and V are binary matrices that determine the code
• W is the misreading (noise) stochastic matrix
Fitter code is one with less distortion
• The ‘error-load’ H measures the difference
between desired and the reproduced amino-acids.
• H is a natural measure for the fitness of the code.
• For better codes the encoding U and the decoding
V are optimized with respect to the reading W.
• The decoded amino-acids must be diverse enough
to map diverse chemical properties.
• However, to minimize the impact of errors it is
preferable to decode fewer amino-acids.
Theories on the origin of the code:
Frozen accident or optimization?
Frozen accident hypothesis:
Load minimization hypothesis:
Any change in the code affects all
the proteins in the cell and
therefore will be too harmful:
Darwinian dynamics optimize the
code to minimize errors in
information flow
Life began with very few amino(due to mutations, misreading).
acids. New amino-acids were
added until eventually the code
became frozen in its present form. [Sonneborn, Zuckerkandl &
[Crick 1968]
Pauling… 1965]
Variant codes - evidence for ongoing
optimization of the code
U
U
C
A
G
C
A
G
UUU
Phe
UCU
Ser UAU
Tyr
UGU
Cys
UUC
Phe
UCC
Ser UAC
Tyr
UGC
Cys
UUA
Leu
UCA
Ser UAA
TER UGA TER
UUG
Leu
UCG
Ser UAG TER UGG
Trp
CUU
Leu
CCU
Pro CAU
His
CGU
Arg
CUC
Leu
CCC
Pro CAC
His
CGC
Arg
CUA
Leu
CCA
Pro CAA
Gln
CGA
Arg
CUG
Leu
CCG
Pro CAG
Gln
CGG
Arg
AUU
Ile
ACU
Thr AAU
Asn
AGU
Ser
AUC
Ile
ACC
Thr AAC
Asn
AGC
Ser
AUA
Ile
ACA
Thr AAA
Lys
AGA
Arg
AUG
Met
ACG
Thr AAG
Lys
AGG
Arg
GUU
Val
GCU
Ala GAU
Asp
GGU
Gly
GUC
Val
GCC
Ala GAC
Asp
GGC
Gly
GUA
Val
GCA
Ala GAA
Glu
GGA
Gly
GUG
Val
GCG
Ala GAG
Glu
GGG
Gly
• Variants of the “universal”
genetic code in many organisms
[Osawa, Jukes 1992].
• All variants use the same
twenty amino-acids
(universal invariant?)
• Continuity - Most changes are
to a neighboring amino-acid.
(‘hydrodynamic’ flow ?)
o Biological codes are information channels or
maps with natural ‘fitness’ measure.
o Codes are evolved and selected according
to their fitness.
o The emergence of a code is a phase
transition in an information channel.
o Topology of errors (noise) governs the
emergent code.
Codes compete by their error-load
• One letter change in DNA can change one amino
acid in one protein. If the new amino acid is
similar to the original the upset is minimal.
• The organism with the smallest error-load takes
over the population.
•
- relatively small population
- high noise levels in protein synthesis
weak selection forces « random drift
Code’s evolution reaches steady-state
• Small effective population and strong drift.
• Population is in detailed balance and therefore
P(fitness) ~ exp(fitness/T) [Lassig,Sella & Hirsh]
• Smaller population is hotter:
T ~ 1/Neff.
• The Boltzmannian probability PUV ~ exp(-HUV/T)
minimizes a ‘free energy’
F= <H>-TS = ∑HUV PUV + ∑ PUV logPUV
• F is used to optimize information channels …
At high T no code is chosen
• At high T (small populations) Boltzmann implies
that all codes are equally probable: <Uαi> = 1/NC
• The natural order parameter is uαi= <Uαi>-1/NC
• At high T the state is random ‘non-coding’ uαi=0
• Stability of F is determined by
T    w d  u2 u .


t
δF ~ u (TIδ×Iw – w ×d)u
F
ij
2
ij
i
j
i, j, ,
• w – the preference of the reading w = W − 1/NC
d – normalized chemical distance matrix
o Biological codes are information channels or
maps with natural ‘fitness’ measure.
o Codes are evolved and selected according
to their fitness.
o The emergence of a code is a phase
transition in an information channel.
o Topology of errors (noise) governs the
emergent code.
Code emerges at a phase transition
• When T is decreased below Tc an inhomogeneous
coding state appears
δF ~ ut(TIδ×Iw – w2×d)u
• Critical temperature
Tc = λw2 × λd
• The code is the mode uαi of F that corresponds to
these maximal eigenvalues.
• Tc increases with the accuracy of reading w .
• The phase transition is continuous (2nd order).
• Analogous phase transition in information channels
Why twenty amino-acids?
• Code is the mode uαi that minimizes the free energy.
• This mode corresponds to the maximal w - eigenvalue.
• Knowledge of w at the phase transition yields code.
• What can we say without such knowledge? (Why 20?)
• More amino-acids more sensitivity to errors.
• Fewer amino-acids reduce functionality of proteins.
• Historical mechanisms : Freezing, Biosynthetic etc..
• Twenty as a topological feature of generic
evolutionary phase transition?
o Biological codes are information channels or
maps with natural ‘fitness’ measure.
o Codes are evolved and selected according
to their fitness.
o The emergence of a code is a phase
transition in an information channel.
o Topology of errors (noise) governs the
emergent code.
The probable errors define the graph
and the topology of the genetic code
• Graph = codon vertices +
one-letter difference edges ( Hamming = 1 )
K4 X K4
U
X K3
C
AGG
UGA
AAG
AGA
A
CAG
CAA
UUA
CCA
AAA
ACA
AUA
GUA
GAA
AAU
GAU
ACU
X
C
A
G
UAA
UC
U
G
X
A
G
Topology and genus of a simpler code
UU
AU
CU
AU
AC
CC
UA
AA
CA
UC
AC
CC
UU
X
CU
AA
UC
U
U
A
C A
C
CA
UA
Doublet Code with 3 bases
is imbedded on a torus
Each codon has 4 neighbors
V = vertices, E = edges, F = faces
Euler’s characteristic
χ = V – E + F
Euler Genus (# holes)
γ = 1 - (1/2) χ
Faces are quadrilateral mutation cycles
F=V (d/4)= 9 ; E=V (d/2)=18
The genetic code graph is holey
• The 48-codon graph K4 X K4 X K3
:
– Each codon has degree d = 3+3+2 = 8 therefore
• E = 48 (d/2) = 192 edges
• F = 48 (d/4) = 96 faces
• The Euler characteristic is χ = V – E + F = -48 and
– Euler’s genus is
γ = 1 - (1/2) χ = 25 (24 holes + Klein)
– Embedding by group
Automorphism analysis
• Can one hear the shape of
The code?
K
The genetic code has a spectrum
• uαi is average preference of codon i to encode α.
• Every mode corresponds to an amino-acid
-> number of modes = number of amino-acids.
• Misreading w is actually the graph Laplacian
w = -(Δ-Δrandom)
where Δij=-Wij Δii=Σj≠iWij
• Δ measures the difference between codons and
their neighbors, a natural measure for error load.
• Maximal mode of w is the 2nd eigenmode of Δ
• Courant’s theorem: uαi have a single maximum
-> single contiguous domain for each amino-acid.
Topology optimizes amino-acid
assignment is in compact domains
• uαi have single compact domains with one maximum
and one minimum (Courant’s theorem).
• Compact organization reduces impact of errors
• Single domain in any direction (linearity) Σnαuαi
Embedding in RN-1 is tight
→ The code graph contains complete graph KN
[Banchoff 1965, Colin de Verdiére’s 1987]
amino-acids # = N = chr(γ)
Coloring number of graph code is an upper
limit for the number of amino-acids
• What is the minimal number of colors required in a map
so that no two adjacent regions have the same color?
• The coloring number is a topological invariant and
therefore a function of the genus solely. chr ( )  max( K N )
• Heawood’s conjecture [Ringel & Youngs, Appel & Haken]


1

chr ( )   7  1  48 
2

N  chr ( code )
4
7
8
9
10
11
12
12
13
13
14
15
15
16
16
16
17
17
18
18
19
19
19
20
20
20
21
21
21
22
22
22
23
23
23
24
24
24
24
25
25
25
25
26
26
26
27
27
27
27
The genetic code coevolves with increasing
accuracy of translation
• A path for evolution of codes:
from early codes with higher codon
degeneracy and fewer amino acids
to lower degeneracy codes with
more amino acids.
• Preliminary simulations
K4 X K4
• Twenty amino acids is invariant
even in variant codes. 21st and 22nd
amino acids are context dependent.

1st
2nd
3rd
1
4
1
0
4
2
4
1
1
6
4
4
1
5
11
4
4
2
13
16
4
4
3
25
20
4
4
4
41
25
chr #
Summary
• The 64 3-letter triplet code is patterned and
degenerate, maps only 20 amino acids.
• The governing evolutionary dynamics is interplay between
protein diversity and error penalty described by
stochastic diffusion equation.
• The 1st excited state of this diffusive mapping dynamics
on the high-genus surface of the code yield a pattern of
ordered 20 amino acids (20 = the coloring number of the
graph).
• Topology + dynamics  Coloring (?)
Transcription network is a code that
relates DNA sites and binding proteins
• Reading DNA to synthesize proteins is controlled by a
system of protein-DNA interactions (transcription net).
• Presence/absence of transcription factor may
repress/enhance synthesis of protein from nearby gene.
•
The transcription network is actually a code that relates
proteins with their DNA targets.
• Like the genetic code, transcription is subject
to evolutionary forces and
adapts to minimize errors.
TF
Pol
DNA
Probable recognition errors define
the binding sequence space
sphere packing (Shannon)
Overlap and continuity
TF  AA
Codon  binding site
• Typical binding site: 4 base pairs = 12 bit
Hamming = 1 K46 -> 4096 ‘codons’ •
Probable recognition errors define
the binding sequence space
• Coloring number estimate:
v = 4L
(L=6)
4
10
winged helix
e ~ 4L(3/2)L
f~
4L(3/4)L
-> γ ~ 4L(3/8)L
n-domain C2H2
3
10
2
10
• The coloring #
chr(γ) ~ 300
1
10
0
10
3
4
10
10
number of genes
????
•
•
Why does the code exhaust the coloring limit?
Other population dynamics models (‘quasi-species’)
•
•
Glassy 'almost-frozen' dynamics?
The necessity of the wobble (64/48)? 25 acids?
•
Generic phase transition scenario that does not depend finely
on missing details of the evolutionary pathway.
Although not much is known about the primordial environment,
minimal assumptions about the topology of probable errors can
yield characteristics of biological codes.
Esp. the number of twenty amino-acids in the present picture
is reminiscent of a 'shell magic number‘.
•
•
Shalev Itzkovitz
Guy Shinar
Uri Alon
Guy Sella
J. –P. Eckmann
Elisha Moses
Download