Genomes and the structure of the protein universe

advertisement
Power laws, scalefree networks, the structure of the
Protein Universe and genome evolution
Nothing in (computational) biology makes
sense except in the light of evolution
after Theodosius Dobzhansky (1970)
The Protein Universe
Total number of potential protein sequences - ~20200
Total number of existing protein sequences:
1010-1011
GenBank2002: ~106
What is the distribution of these sequences in the
sequence and structure spaces?
Hierarchical classification of
proteins
Category
Example
Definition, criteria,
main features
Structural class
/
Overall composition of
structural elements. No
evolutionary relationship.
P-loop
Topology of the folded protein
backbone. Monophyletic
origin?
Superfamily
P-loop containing
nucleotide
triphosphate
hydrolases
Recognizable sequence
similarity (at least a conserved
motif); conservation of basic
biochemical properties.
Monophyletic origin;
Family
Nucleotide and
nucleoside kinases
Significant sequence similarity;
conservation of biochemical
function.
Adenylate kinase
Orthologous relationships
within the given set of species;
conservation of the biological
function.
DR0202, DR0494
and DR2273 in D.
radiodurans
Paralogs originating from a
lineage-specific duplication
Fold
Group of
orthologs
(COG)
Lineagespecific
expansion
200
no. of folds
150
100
50
0
1
2
3
4
5
6
7
8
9
10
families per fold
The distribution of folds by the number of families
in the protein structural database (PDB).
There are many folds with 1-3 families
but only a few folds with numerous families
Altogether, there might be as many as 5,000-10,000 folds
but >90% of the families belong to <1,000 common folds
Mapping the Protein Universe is feasible!
Thermotoga maritima
Size distributions of domain families in two
genomes - 2-log plot
C. elegans
The size distributions of folds and families are
approximated by a power law:
f(i) ~ i-k (k ~1-3)
Power laws describe distributions of a number of
quantities in biological and other contexts, e.g.,
the node degrees (number of connections) in
metabolic and protein interactions networks,
the Internet and social networks, citations of scientific
papers, population of cities, personal wealth…
Networks described by power laws are known as
scale-free - they look the same at different scales.
The existence of a small number of highly connected
nodes (hubs) in scalefree networks determines their
small-world properties and error tolerance
Scale-free networks evolve through
preferential attachment:
the rich get richer
or
the fit get fitter
Domain accretion in the evolution of orthologous sets of
eukaryotic genes
C1
C
2
C3
C1
C
2
C3
C1
C
2
C3
C1
C
2
Zk
Zk
Ub
Zk
Yeasts
Br Br
Zk
Br
Br Br
C. elegans
A. thaliana
D. melanogaster
Distribution of proteins
by the number of domains
follows is exponential!
(if repeats do not
count)
However, we get a power
law if repeats are
included
Domain connectivity network
The domain connectivity graph is roughly
approximated by a power law
Evolution of protein domain families in genomes can
be described by simple models which involve
domain birth, death and innovation (“invention”)
as elementary events
BDIM: elementary events
Birth
Death
Innovation
BDIM – Birth, Death and Innovation Model
BDIM: the layout of the model
domain
per-family
birth rate
domain
innovation rate
family
family
n
d1
1
d1
l1
l2
l3
d2
d3
d4
2
d2
3
d3
…
li-1
li
di
di+1
i
di
per-family
death rate
size class
number of families in a
size
class
maximum
number of
domains in a family
lN-1
…
dN
N
dN
BDIM:ratethe
basic
equations
innovation
(instead of
of change for d
i
"class 0" birth)
Gain: birth in class i-1
df1(t)/dt = n-d1f1-l1f1+d2f2
…
Loss: birth in class i
dfi(t)/dt = li-1fi-1-difi-lifi+di+1fi+1
…
dfN(t)/dt = lN-1fN-1-dNfN
Gain: death in class i+1
no birth into and death from
class N+1
Loss: death in class i
N
F(t) =
fi(t)

i 1
- the total number of families
Power Approximation vs Power Asymptote under
the
linear BDIM
asymptote (k = a-b-1)
1000
100
10
1
1
0.1
0.01
0.001
10
Linear BDIM
100
1000
10000
approximation
Linear BDIM: Size Does Matter?
li/i = l(1+a1/i) per domain birth rate
di/i = d(1+b1/i) per domain death rate
li/i
di/i
i
Family size
Conclusions
I.
The world, including biology, is full of power law
distributions and scalefree networks
I.
The emergence of these seems to be explained
by relatively simple evolutionary models
Tomorrow??
Genomics today
“There are two kinds of science: physics and stamp collection”
Attributed to Ernest Rutherford
Download