Gene Duplication and Mutation Explain Large and

advertisement
1
Copying nodes vs. Editing links: the source of the
difference between genetic networks and the WWW
Yoram. Louzoun*, Lev Muchnick& and Sorin Solomon#
* Department of mathematics, Bar Ilan University, Ramat Gan, Israel, 52900
& Department of Physics, Bar Ilan University, Ramat Gan, Israel, 52900
# Hebrew University, Jerusalem, Israel and ISI Torino
Corresponding author: Yoram Louzoun, Department of mathematics, Bar Ilan
University, Ramat Gan, Israel, 52900, email:louzouy@math.biu.ac.il, Tel ++972-352317610
2
Abstract
Experimental observations on networks are currently accumulated in a wide
range of disciplines and at all scales. We show that networks in different
disciplines, with different node and link basic dynamics, have different
macroscopic experimental features that are strongly system dependent: such as the
clustering structure, the small scale motifs frequency, the time correlations and the
connectivity of outgoing links. Some other features are generic: the large scale
connectivity distribution of incoming links (scale free) and the network diameter
(small-worlds). The common properties are just the general hallmark of
autocatalysis, while the specific properties hinge on the specific elementary
mechanisms.
We exemplify this by studying two types of networks: Genetic networks and
the World Wide Web. In the first case we formulate a model including only
random unbiased gene duplications and mutations. In the second case, the basic
moves are website generation and rapid surf-induced link creation (/ destruction).
In both cases we do reproduce the experimental observations at all scales.
In the genetic case the emerging picture is of a centre dominated structure
with arrows preferentially directed from weakly connected branches toward an
old highly connected core. In the WWW case, the hierarchy is rather in the form of
strongly linked clusters-within clusters. The inter-cluster connections are
generically pointing in both directions, and are age independent.
3
Introduction
Recent attempts to characterize the properties of networks (social, internet,
citations, WWW etc.) (1-6) have been directed to generic properties that are shared by
most evolving systems such as a small world character, scaling and auto-catalytic
growth (2, 3, 7, 8). The stress was on the generality of the phenomena over many
disciplines rather than the specific elementary interactions responsible for particular
systems. In particular the focus was on the scaling properties of the node connectivity,
which are universal and only depend on the autocatalytic nature of the processes, as
shown by Simon (9). Models reproducing only these generic properties cannot claim to
reflect the elementary mechanisms underlying the observed networks. There are
however empirically observed properties that are system specific. Those can be used in
combination with the known microscopic interactions to reverse engineer the basic laws
of these systems.
After investing the first years in broad brush description of networks properties
and ad-hoc microscopic mechanisms that could generate them, the time has come to
address the detailed properties of networks appearing in specific fields and to trace their
macroscopic properties to the known microscopic interactions. In other words, we place
the network analysis within the usual scientific paradigm: advancing hypothesae,
deducing their consequences and confronting them with the data. Thus one allows for
the classical Popper falsifiabililty criterion (10). Specifically we define different
microscopic laws corresponding to the known evolution mechanisms governing
individual gene and respectively the dynamics of WWW node formation. We then
deduce corresponding divergent predictions among networks for the network structure,
4
its main sub-networks components, the incoming and outgoing node degree distribution
and its time correlations. These predictions are confirmed for the corresponding network
and fail for the other one.
Results
Genetic networks
We present a simple gene evolution model, and show that it reproduces in detail
the observed properties of genetics networks. The nodes in a genetic network are
specific genes of a given organism. A link pointing from gene A to gene B implies that
gene A regulates (through the protein that it codes for) gene B.
The model addresses the following aspects:
A) The basic mechanism used in the model is the now accepted gene duplication
mechanism (11-18). Node duplication has been clearly shown to govern genetic
network generation. This is in contrast with the usual “preferential attachment”
algorithm that amounts to the random creation of new nodes (followed by selection).
B) Genetic networks size saturates in the time scale of the network generation(1921). Accordingly, our model produces a scale-free connectivity distribution with  ~ 2,
even for non-monotonic size variation. This is again in contrast with most currently
used models that assume a continuous growth of the network.
C) The model distinguishes between incoming and outgoing degree distributions.
This difference is obvious and has been widely observed experimentally. In most
5
dynamical systems, originating and receiving a link are tied to very different properties.
The production of an outgoing link requires some specific action of the origin node,
while receiving an incoming link, implies the existence of a triggering or referral
mechanism at the level of the target node. In the specific case of genetic networks, the
trigger of the links incoming to a node depends on its transcription factor binding sites
sequences, and the action of links outgoing from a node is a function of the 3D structure
of the protein transcribed by it. Moreover, the observed distributions in most genetic
networks (for example the E. coli network) are scale free for incoming links, but
normal for outgoing links.
All of those elements are incorporated into the following model: At each time
step, nodes can be copied (fig1a), removed (fig1b) or mutated (fig1c). More precisely,
at each time step one node is copied (fig1a), while each existing node can be removed
with a probability  per time step (fig1b). When nodes are removed, all incoming and
outgoing links to and from these nodes are removed. When nodes are copied, only
outgoing links are copied; the added node has no incoming links. Links can be mutated
(fig1c) randomly at a fixed rate: At each time step a constant number  of links are
mutated, i.e. “duplicated” by keeping the link origin(22) and choosing a new random
target.
The interpretation of these abstract operations is: Node Copying corresponds to
gene duplication. Note that we model only the duplication of the exons (for which there
is experimental evidence(11-13)); We do not assume duplication of the intronic and
control regions, as there is no clear experimental evidence for it. Node removal
represents random deleterious mutations that result in gene destruction. Link mutation
6
represents mutations through which an existing protein binds a promoter of an unrelated
gene.
Our model provides a direct relation between the observed microbiological
mechanisms (duplication and mutations) in their raw form and the experimentally
observed properties of the network. The main results of our model are: A) A scale free
incoming node degree distribution with an exponent (figure 2) compatible with the
observed 2-2.5 (i.e. a cumulative exponent of 1-1.5)(2). B) A normal outgoing node
distribution, again compatible with the observations (Suplementary material figure 1).
C) A hierarchical structure characterised by a high clustering coefficient that scales
with the experimentally observed exponent of -1 (Figure 3) (23) D) A small world
geometry: The node distance distribution is Gaussian and its average is proportional to
the log of the network size (inset in figure 3) as is indeed experimentally observed(22,
24). These emerging properties are independent of the network's seed or history. E) A
positive correlation between nodes age and their incoming links degree (Figure 4) (25)
F) Our model also reproduces the observed sub-graph distribution. Using the algorithm
in (26) we measured the relative occurrence frequency of the n-nodes “motifs” with n<
5. An n-motif is a directed sub-network with n nodes. We compared the occurrence
frequency of each motif in our networks to their frequency in randomly generated
networks with identical link degree distribution (27).
In the networks generated by our algorithm, the most statistically significant
deviation from the random network was obtained for the “bi-fan” motif (fig4A). For
instance, in a network of 894 nodes and 5074 edges it occurred 36546 times, compared
to 15597±447 to its random counterpart (Z score=47). This is the only 4-motif with an
7
exceptionally high Z score in all experimental genetic networks (e.g. the bacterium E.
coli and the yeast S. cerevisiae (27)). The large Z score of the bi-fan motif is a direct
result of the copying procedure performed in our algorithm. If X originally had links to
Z and W, once copied to a new node Y, Y will have similar outgoing links. Thus the
appearance of this motif in the above mentioned experimental network confirms the
microscopic mechanisms assumed in our model. The next very frequent motifs in our
analysis were: the “Converging fan” (fig4B) and 2 variations of the Bi-fan (fig4C and
D). The excess occurrence of these motifs in our networks is again consistent with the
existing genetic observations (motif C is for example the next frequent motif after the
bi-fan in the E. coli genetic network) and can be assigned to the specific elementary
operations in our model. The only motif experimentally in excess in genetic networks
and not in our analysis is the feed-forward loop 3-motif (X->Y-Z and X->Z). This may
actually prove that this motif is selected through evolution and is not a direct result of
the random generation process (as suggested by Mangan et al (28)).
An analytic treatment of the incoming degree distribution in the model follows
from the auto-catalytic character of the “copying” process. A process is called
autocatalytic if the variation in time of a quantity (e.g. the current incoming link degree
of each node) is proportional to the quantity itself via a random factor (extracted from a
probability distribution that is node independent). Solomon et al (29) have proved that
such an autocatalytic dynamics leads to a scale free distribution even if the agents
number fluctuates.
In the present case the autocatalytic dynamics results from the combination of link
removal and addition implied by the node removal and copying processes. Node
8
removal and addition lead to the addition/removal of a number of links proportional to
the number of existing incoming links. Mutations add (in average) a constant number of
incoming links. Such systems have been analytically proven to generically lead to scale
free distributions (30).
WWW network
The mechanisms governing the WWW evolution are very different from the ones
generating genetic networks: Genetic networks evolve through a slow random copying
process (accompanied by some selection), while the WWW evolves mainly through the
very rapid and continuous editing of the existing websites. The most frequent network
operation in the WWW is link editing rather than node addition or removal.
Consequently, one of the dramatic differences between the WWW and the genetic
networks is in the correlation between the nodes degree and their age. While our genetic
network model predicts strong correlations between the node degree and its age (since
long lived nodes continuously receive new incoming links) our WWW model has
practically no age-degree correlations (because of the extensive reshuffling of links
between node additions or removals). These predictions are validated by the actual data
both in genetic networks (25) and in WWW measurements. (31) (See figure 5 for direct
comparison between our model and the experimental data).
Our WWW evolution model consists of the elementary known operations taking place
in the WWW: node creation with a rate of  per node, node destruction with a rate of 
per node and node editing sessions with a rate of  per node,  >>  and ).

Node creation is typically performed through a "copy and paste" action
involving multiple websites. In our case we create new nodes using half of the
9
outgoing links of a randomly chosen node and half of the outgoing links of
another randomly chosen node. Newly created sites have obviously no incoming
links. We have validated that using other forms of "copy and paste" mechanisms
has no significant effect on our results.

Node destruction is straightforward: when a node is removed, all the links to
and from the node are removed.

Node editing sessions consist of independent removals and additions of links
originating from a randomly chosen node. Link targets are not chosen randomly,
but rather through surfing and the detection of interesting websites or common
interest among websites. Link addition is thus performed either through web
browsing excursions (with a probability of , or through detection of common
interest (with a probability 1-). An excursion consists in following an outgoing
link and continuing on one of the target's outgoing links and so on. The
excursion stops with probability 0.1 per node by adding a link from the origin to
the stopping point. Common interest links are added towards nodes in the close
proximity of an existing node (we interpret the non directional proximity on the
network as a measure of common interest among nodes).
In the specific application discussed here we added 0.02 nodes per existing node per
time step and removed 0.016 nodes per node per time step. At each time step we
performed in average one editing session per node. Varying these parameters by a
factor of two has no significant effect on the results. During an editing session, the link
removal and addition probabilities are 5% per link; We also have a 5 % probability of
adding a link independent of the number of existing links.
10
This algorithm reproduces the observed collective properties of the WWW, namely: A)
Negligible correlations between node age an their link degree distributions (figure 5a-b),
B ) the observed average number of links per node (Lexperimental=7-8 vs. Lmodel=7.5)
(32), C) an appropriate exponent for the power of the incoming in~2.1~1+L/L-1) (33)
and outgoing out~3) link degree distributions, (Figure 5) (31, 34), D) a small world
network topology (inset in figure 5 and supplementary material figure 1) (35). E) We
reproduce the appropriate small scale motifs. The motifs experimentally observed in
excess the WWW are only 3-motifs triangles containing the feed forward loop and 3motifs based on the feed forward loops with bidirectional links, instead of the
unidirectional ones. In our WWW model, the feed forward loop is indeed the most
frequent over-expressed motif, and following it are also its extensions to bidirectional
links (Supplementary material figure 2). The excellent fit between our model and the
observed properties of the WWW allows us to conclude that our model does contain the
crucial microscopic dynamic elements underlying the WWW evolution.
Discussion
Network collective properties are currently extensively studied for a wide class of
networks(1-5, 8, 22-24, 26-28, 34, 36-40). This vast amount of information allows us to
distinguish between network classes, but was never properly correlated to the
elementary mechanisms generating the networks. We here show that the actual use of
realistic microscopic elements for network generation indeed reproduces the observed
specific collective network properties. We exemplify the relation between network
generation models and their properties for two types of networks: genetic network and
11
the WWW. In the case of genetic networks the main generating mechanism used was
node copying. We reproduced all the known properties of genetic network, such as their
link degree distributions, their topology, age correlations and the small scale motifs
appearing in the network. The resulting network is a highly hierarchical and directional.
Information flows from a low degree periphery toward a central core. In the case of the
WWW, the main mechanism is the extensive editing of nodes via browsing and the
sharing of common interest between neighboring nodes. This in turns leads to a nondirectional hierarchically clustered network, with most connections within the core.
Although the WWW and genetic networks seem to have similar incoming link degree
distribution scaling properties (and as such were classified under the generic rubric of
"scale free networks") one can see that their structure is dramatically different. These
differences can only be understood when connecting their global properties to their
elementary mechanisms.
A large number of other networks, such as neural, social, citations, linguistic and
ecological networks (See for example (4, 22, 34, 40, 41)), were also clustered under the
same rubric of scale free networks. These networks are obviously generated by very
different mechanisms. The scale free character of their incoming link distribution only
reflects their autocatalytic nature, and cannot reveal their specific properties.
Many of the elements introduced in our models were present in previous models.
For example gene duplication models of monotonically growing genetic networks have
been considered in combination with preferential attachment (36-39) . Bi-directional
networks were studied too e.g. in the context of a small world model (36). The present
research incorporates only the obvious elementary observed mechanisms to produce the
experimentally observed statistical collective features of specific networks.
12
Bibliography
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
Albert, R. & Barabasi, A. L. (2002) Reviews of Modern Physics 74, 47–
97.
Barabasi, A. L. & Albert, R. (1999) Science 286, 509-512.
Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. & ., A. L. B. (2000) Nature
407, 651-4.
Amaral, L. A., Scala, A., Barthelemy, M. & Stanley, H. E. (2000) Proc
Natl Acad Sci U S A. 97, 11149-52.
Strogatz, S. (2001) Nature 410, 268-276.
Faloutsos, M., Faloutsos, P. & Faloutsos, C. (1999) Comp. Comm. R.
29,, 251.
Uetz, P., Giot, L., Cagney, G., Mansfield, T. A., Judson, R. S., Knight, J.
R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., QureshiEmili, A., Li, Y., Godwin, B., Conover, D., Kalbfleisch, T., Vijayadamodar,
G., Yang, M., Johnston, M., Fields, S. & Rothberg, J. M. (2000) Nature
403, 623-7.
Dorogovtsev, S. N. & Mendes, J. F. F. (2001) Phys. Rev. E. 63, 1-18.
Simon, H. A. (1955) Biometrica 42.
Popper, K. R. (1935.) Logik der Forschung (The Logic of Research)
(Springer, Vienna).
Venter, J. C. (2001) Science 291, 1304-1351.
Dehal, P., Predki, P., Olsen, A. S., Kobayashi, A., Folta, P., Lucas, S.,
Land, M. & Terry, A. (2001) Science 293, 104-111.
Stubbs, L. (2002) in Genomic Technologies:Present and Future, ed.
McCormack, S. (Caister Academic Press.
Friedman, R. & Hughes, A. (2002) Genome Res 11, 373-381.
Seioghe, C. & Wolfe, K. H. (1999) Gene 238, 253-261.
Wolfe, K. H. & D.C.Shields (1997) Nature 387, 708-713.
Sidow, A. (1996) Curr. Opin. Genet. Dev 6, 715-722.
Gu, Z., Cavalcanti, A., Chen, F. C., Bouman, P. & Li, W. H. (2002) Mol.
Biol. Evol 19, 256-262.
Betran, E. & Long, M. (2002) Genetica 115, 65-80.
Forterre, P. & Philippe, H. (1999) Bioessays 21, 871-9.
Holland, P. W. (2003) J Struct Funct Genomics 3, 75-84.
Watts, D. J. & Strogatz, S. H. (1998) Nature 393, 440-2.
Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. & Barabasi, A. L.
(2002) Science 297, 1551-5.
Agrawal, H. (2002) Phys Rev Lett. 89, 268702.
Eisenberg, E. & Levanon, E. Y. (2003) Phys. Rev. Lett. 91, 138701.
Itzkovitz, S., Milo, R., Kashtan, N., Ziv, G. & Alon, U. (2003) Phys Rev E
Stat Nonlin Soft Matter Phys 68, 026127.
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. & ., U. A.
(2002) Science. 298, 824-7.
Mangan, S. & Alon, U. (2003) Proc Natl Acad Sci U S A. 100, 11980-5.
Blank, A. & Solomon, S. (2000) Physica A 287, 279-288.
13
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
Solomon, S. & Levy, M. (1996) International Journal of Modern Physics
C 7, 745.
Adamic, L. A., Huberman, B. A., Baraba'si, A.-L., Albert, R., Jeong, H. &
Bianconi;, G. (2000) Science 287, 2115.
Kumar, S. R., Raghavan, P., Rajagopalan, S. & Tomkins., A. (1999) in
8th WWW Conference,, pp. 403-416.
Solomon, S. (1998) in Decision Technologies for Computational
Finance,, eds. Refenes, A.-P., Burgess, A. N. & Moody, J. E. (Kluwer
Academic Publishers.
Adamic, L. A., Lukose, R. M., Puniyani, A. R. & Huberman, B. A. (2001)
Physical Review E 6404, art. no.-046135.
Albert, R., Jeong, H. & Barabasi., A.-L. (1999) Nature 401, 130-131.
Bhan, A., Galas, D. J. & Dewey, T. G. (2002) Bioinformatics 18, 1486-93.
Chung, F., Lu, L., Dewey, T. G. & Galas, D. J. (2003) J Comput Biol 10,
677-87.
Rzhetsky, A. & Gomez, S. M. (2001) Bioinformatics 17, 988-996.
Wagner, A. (2001) Mol. Biol. Evol. 18, 1283-1292.
Wu, F., Huberman, B. A., Adamic, L. A. & Tyler, J. R. (2004) Physica aStatistical Mechanics and Its Applications 337, 327-335.
Newman, M. E. J. (2001) Proc Natl Acad Sci U S A. 98, 404–409.
14
Figure Captions
Figure 1 - Mechanisms of individual node evolution: the 3 elementary processes
defining our genetic model. Their effect is demonstrated on the configuration of
the left upper corner (only links and nodes relevant for the explanation are
explicitly shown). The effect of a Node copying elementary event is shown in A) .
The blue node is “duplicated” by introducing the new brown node that has the
same targets for its out-going links (and no incoming links at all). The Node
removal is illustrated in B). The green node and all its links are deleted. The
drawing C) illustrates the elementary operation of Link Mutation: the pink link is
“copied”. i.e. a new link with the same origin but different target is created . These
3 elementary operations turn out to be sufficient for the formation of a steady state
directional hierarchical scale free network, with the experimentally observed submotif distribution.
15
16
Figure 2 Node degree distribution in the genetic model- Incoming (empty circles)
and outgoing (full squares) link distributions. The outgoing link distribution is
normal (as experimentally observed). The incoming link distribution is scale free
over more than three orders of magnitude (10-50,000). The straight line
corresponds on this double logarithmic graph to a power law with exponent -2.2
(the value actually observed in most genetic networks). Increasing the network
size has no effect on the exponent of the power distribution; it only increases its
validity range.
17
6
Incoming
Outgoing
5
10
log (Rank)
4
3
2
1
0
0
0.5
1
1.5
2
2.5
log10(Number of Nodes)
3
3.5
4
4.5
18
Figure 3. Measures of the Small World character of the genetic network. The
main graph represents the clustering coefficient (C) as a function of the degree k.
The clustering coefficient of the node i, with a degree ki is defined as as Ci =
2ni/ki(ki - 1), where ni denotes the number of direct links connecting its ki nearest
neighbors among themselves. Ci is equal to 1 if the neighbors of i are all connected
one to the other. A random (Erdos-Renyi) graph would produce a flat very small
clustering coefficient. One sees from the graph that the actual distribution, far
from being a small constant is fitted by a straight line that represents on the double
logarithmic scale a power law with exponent -1 as observed in genetic networks
(and in contrast with the preferential attachment dynamics). The fit is excellent
with no free parameters. The inset plots the distribution of distances between
arbitrary pairs of nodes in the network. A regular one dimensional lattice would
produce a distance proportional to the network size (50,000). The combination of
a distance of the order of the log of the network size (inset) and the existence of
high clustering coefficient are the hallmark of a "small world network".
19
20
Figure 4 The 4 motifs most frequently over- expressed in the networks produced
by the genetic model. They turn out to be the 4 motifs most frequently overexpressed in experimental genetic networks. The first motif (A) is denoted as a
BiFan and is consistently the most over-expressed in ALL genetic networks and in
our model. Motifs C and D represent various combinations of the BiFan and
added links. Motif B excess is a result of the combination of node copying and
removal.
21
A
B
N=36546
Z=46
N=253270
Z=23
C
D
N=826
Z=2.8
N=442
Z=2.2
22
Figure 5 : Correlations between node degree and their age. The upper left
drawing represents the experimental scatter plot of the websites incoming degree
as a function of the year they were created (extracted from Adamic, L.A. &
Huberman, B.A., Science 287, 2115a ( 2000). No correlation is present. This is
accurately reproduced by our WWW model (upper right drawing). These patterns
are in stark contrast with the pattern predicted by our genetic network model
(lower left drawing) where an age-degree correlation is seen. The age-degree
correlation in the genetic model (empty diamonds in lower right drawing) fits the
experimental results (big grey circles taken from “Preferential Attachment in the
Protein Network Evolution”, Eli Eisenberg and Erez Y. Levanon, Phys. Rev.
Letters.. 2003, VOLUME 91(13)).
23
5
10
4
10
3
Degree
10
2
10
1
10
0
10
1984
1986
1988
1990
1992
Age
5
10
4
10
3
Degree
10
2
10
1
10
0
10
1984
1986
1988
1990
1992
Age
1994
1996
1998
1994
1996
1998
24
Figure 6: Double logarithmic histogram of the nodes rank distribution in the
WWW model. Results show a power law for both incoming (empty circles;
exponent 2.1+/-0.1) and outgoing links (full squares; exponent 3.1 +/- 0.2). These
exponents are in perfect agreement with the experimental data.
The inset shows the average node distance as predicted by the WWW model. The
distance grows as the log of the network size, and thus the network is indeed a
small world network in agreement with the experiments.
25
Nodes rank distribution
3.5
Incoming
Outgoing
3
0.85
Incoming
Power law fit
Outgoing
Power law fit
0.8
2.5
log
10
(<d>)
0.75
0.65
10
log (Rank)
0.7
2
1.5
0.6
3
3.1
3.2
3.3
3.4
3.5
3.6
log10(Network Size)
3.7
3.8
3.9
4
1
0.5
0
1
1.5
2
2.5
log10(Number of nodes)
3
3.5
4
Download