1 Copying nodes vs. Editing links: the source of the difference between genetic networks and the WWW Yoram. Louzoun*, Lev Muchnick& and Sorin Solomon# * Department of mathematics, Bar Ilan University, Ramat Gan, Israel, 52900 & Department of Physics, Bar Ilan University, Ramat Gan, Israel, 52900 # Hebrew University, Jerusalem, Israel and ISI Torino Corresponding author: Yoram Louzoun, Department of mathematics, Bar Ilan University, Ramat Gan, Israel, 52900, email:louzouy@math.biu.ac.il, Tel ++972-352317610 2 Abstract Experimental observations on networks are currently accumulated in a wide range of disciplines and at all scales. We show that networks in different disciplines, with different node and link basic dynamics, have different macroscopic experimental features that are strongly system dependent: such as the clustering structure, the small scale motifs frequency, the time correlations and the connectivity of outgoing links. Some other features are generic: the large scale connectivity distribution of incoming links (scale free) and the network diameter (small-worlds). The common properties are just the general hallmark of autocatalysis, while the specific properties hinge on the specific elementary mechanisms. We exemplify this by studying two types of networks: Genetic networks and the World Wide Web. In the first case we formulate a model including only random unbiased gene duplications and mutations. In the second case, the basic moves are website generation and rapid surf-induced link creation (/ destruction). In both cases we do reproduce the experimental observations at all scales. In the genetic case the emerging picture is of a centre dominated structure with arrows preferentially directed from weakly connected branches toward an old highly connected core. In the WWW case, the hierarchy is rather in the form of strongly linked clusters-within clusters. The inter-cluster connections are generically pointing in both directions, and are age independent. 3 Introduction Recent attempts to characterize the properties of networks (social, internet, citations, WWW etc.) (1-6) have been directed to generic properties that are shared by most evolving systems such as a small world character, scaling and auto-catalytic growth (2, 3, 7, 8). The stress was on the generality of the phenomena over many disciplines rather than the specific elementary interactions responsible for particular systems. In particular the focus was on the scaling properties of the node connectivity, which are universal and only depend on the autocatalytic nature of the processes, as shown by Simon (9). Models reproducing only these generic properties cannot claim to reflect the elementary mechanisms underlying the observed networks. There are however empirically observed properties that are system specific. Those can be used in combination with the known microscopic interactions to reverse engineer the basic laws of these systems. After investing the first years in broad brush description of networks properties and ad-hoc microscopic mechanisms that could generate them, the time has come to address the detailed properties of networks appearing in specific fields and to trace their macroscopic properties to the known microscopic interactions. In other words, we place the network analysis within the usual scientific paradigm: advancing hypothesae, deducing their consequences and confronting them with the data. Thus one allows for the classical Popper falsifiabililty criterion (10). Specifically we define different microscopic laws corresponding to the known evolution mechanisms governing individual gene and respectively the dynamics of WWW node formation. We then deduce corresponding divergent predictions among networks for the network structure, 4 its main sub-networks components, the incoming and outgoing node degree distribution and its time correlations. These predictions are confirmed for the corresponding network and fail for the other one. Results Genetic networks We present a simple gene evolution model, and show that it reproduces in detail the observed properties of genetics networks. The nodes in a genetic network are specific genes of a given organism. A link pointing from gene A to gene B implies that gene A regulates (through the protein that it codes for) gene B. The model addresses the following aspects: A) The basic mechanism used in the model is the now accepted gene duplication mechanism (11-18). Node duplication has been clearly shown to govern genetic network generation. This is in contrast with the usual “preferential attachment” algorithm that amounts to the random creation of new nodes (followed by selection). B) Genetic networks size saturates in the time scale of the network generation(1921). Accordingly, our model produces a scale-free connectivity distribution with ~ 2, even for non-monotonic size variation. This is again in contrast with most currently used models that assume a continuous growth of the network. C) The model distinguishes between incoming and outgoing degree distributions. This difference is obvious and has been widely observed experimentally. In most 5 dynamical systems, originating and receiving a link are tied to very different properties. The production of an outgoing link requires some specific action of the origin node, while receiving an incoming link, implies the existence of a triggering or referral mechanism at the level of the target node. In the specific case of genetic networks, the trigger of the links incoming to a node depends on its transcription factor binding sites sequences, and the action of links outgoing from a node is a function of the 3D structure of the protein transcribed by it. Moreover, the observed distributions in most genetic networks (for example the E. coli network) are scale free for incoming links, but normal for outgoing links. All of those elements are incorporated into the following model: At each time step, nodes can be copied (fig1a), removed (fig1b) or mutated (fig1c). More precisely, at each time step one node is copied (fig1a), while each existing node can be removed with a probability per time step (fig1b). When nodes are removed, all incoming and outgoing links to and from these nodes are removed. When nodes are copied, only outgoing links are copied; the added node has no incoming links. Links can be mutated (fig1c) randomly at a fixed rate: At each time step a constant number of links are mutated, i.e. “duplicated” by keeping the link origin(22) and choosing a new random target. The interpretation of these abstract operations is: Node Copying corresponds to gene duplication. Note that we model only the duplication of the exons (for which there is experimental evidence(11-13)); We do not assume duplication of the intronic and control regions, as there is no clear experimental evidence for it. Node removal represents random deleterious mutations that result in gene destruction. Link mutation 6 represents mutations through which an existing protein binds a promoter of an unrelated gene. Our model provides a direct relation between the observed microbiological mechanisms (duplication and mutations) in their raw form and the experimentally observed properties of the network. The main results of our model are: A) A scale free incoming node degree distribution with an exponent (figure 2) compatible with the observed 2-2.5 (i.e. a cumulative exponent of 1-1.5)(2). B) A normal outgoing node distribution, again compatible with the observations (Suplementary material figure 1). C) A hierarchical structure characterised by a high clustering coefficient that scales with the experimentally observed exponent of -1 (Figure 3) (23) D) A small world geometry: The node distance distribution is Gaussian and its average is proportional to the log of the network size (inset in figure 3) as is indeed experimentally observed(22, 24). These emerging properties are independent of the network's seed or history. E) A positive correlation between nodes age and their incoming links degree (Figure 4) (25) F) Our model also reproduces the observed sub-graph distribution. Using the algorithm in (26) we measured the relative occurrence frequency of the n-nodes “motifs” with n< 5. An n-motif is a directed sub-network with n nodes. We compared the occurrence frequency of each motif in our networks to their frequency in randomly generated networks with identical link degree distribution (27). In the networks generated by our algorithm, the most statistically significant deviation from the random network was obtained for the “bi-fan” motif (fig4A). For instance, in a network of 894 nodes and 5074 edges it occurred 36546 times, compared to 15597±447 to its random counterpart (Z score=47). This is the only 4-motif with an 7 exceptionally high Z score in all experimental genetic networks (e.g. the bacterium E. coli and the yeast S. cerevisiae (27)). The large Z score of the bi-fan motif is a direct result of the copying procedure performed in our algorithm. If X originally had links to Z and W, once copied to a new node Y, Y will have similar outgoing links. Thus the appearance of this motif in the above mentioned experimental network confirms the microscopic mechanisms assumed in our model. The next very frequent motifs in our analysis were: the “Converging fan” (fig4B) and 2 variations of the Bi-fan (fig4C and D). The excess occurrence of these motifs in our networks is again consistent with the existing genetic observations (motif C is for example the next frequent motif after the bi-fan in the E. coli genetic network) and can be assigned to the specific elementary operations in our model. The only motif experimentally in excess in genetic networks and not in our analysis is the feed-forward loop 3-motif (X->Y-Z and X->Z). This may actually prove that this motif is selected through evolution and is not a direct result of the random generation process (as suggested by Mangan et al (28)). An analytic treatment of the incoming degree distribution in the model follows from the auto-catalytic character of the “copying” process. A process is called autocatalytic if the variation in time of a quantity (e.g. the current incoming link degree of each node) is proportional to the quantity itself via a random factor (extracted from a probability distribution that is node independent). Solomon et al (29) have proved that such an autocatalytic dynamics leads to a scale free distribution even if the agents number fluctuates. In the present case the autocatalytic dynamics results from the combination of link removal and addition implied by the node removal and copying processes. Node 8 removal and addition lead to the addition/removal of a number of links proportional to the number of existing incoming links. Mutations add (in average) a constant number of incoming links. Such systems have been analytically proven to generically lead to scale free distributions (30). WWW network The mechanisms governing the WWW evolution are very different from the ones generating genetic networks: Genetic networks evolve through a slow random copying process (accompanied by some selection), while the WWW evolves mainly through the very rapid and continuous editing of the existing websites. The most frequent network operation in the WWW is link editing rather than node addition or removal. Consequently, one of the dramatic differences between the WWW and the genetic networks is in the correlation between the nodes degree and their age. While our genetic network model predicts strong correlations between the node degree and its age (since long lived nodes continuously receive new incoming links) our WWW model has practically no age-degree correlations (because of the extensive reshuffling of links between node additions or removals). These predictions are validated by the actual data both in genetic networks (25) and in WWW measurements. (31) (See figure 5 for direct comparison between our model and the experimental data). Our WWW evolution model consists of the elementary known operations taking place in the WWW: node creation with a rate of per node, node destruction with a rate of per node and node editing sessions with a rate of per node, >> and ). Node creation is typically performed through a "copy and paste" action involving multiple websites. In our case we create new nodes using half of the 9 outgoing links of a randomly chosen node and half of the outgoing links of another randomly chosen node. Newly created sites have obviously no incoming links. We have validated that using other forms of "copy and paste" mechanisms has no significant effect on our results. Node destruction is straightforward: when a node is removed, all the links to and from the node are removed. Node editing sessions consist of independent removals and additions of links originating from a randomly chosen node. Link targets are not chosen randomly, but rather through surfing and the detection of interesting websites or common interest among websites. Link addition is thus performed either through web browsing excursions (with a probability of , or through detection of common interest (with a probability 1-). An excursion consists in following an outgoing link and continuing on one of the target's outgoing links and so on. The excursion stops with probability 0.1 per node by adding a link from the origin to the stopping point. Common interest links are added towards nodes in the close proximity of an existing node (we interpret the non directional proximity on the network as a measure of common interest among nodes). In the specific application discussed here we added 0.02 nodes per existing node per time step and removed 0.016 nodes per node per time step. At each time step we performed in average one editing session per node. Varying these parameters by a factor of two has no significant effect on the results. During an editing session, the link removal and addition probabilities are 5% per link; We also have a 5 % probability of adding a link independent of the number of existing links. 10 This algorithm reproduces the observed collective properties of the WWW, namely: A) Negligible correlations between node age an their link degree distributions (figure 5a-b), B ) the observed average number of links per node (Lexperimental=7-8 vs. Lmodel=7.5) (32), C) an appropriate exponent for the power of the incoming in~2.1~1+L/L-1) (33) and outgoing out~3) link degree distributions, (Figure 5) (31, 34), D) a small world network topology (inset in figure 5 and supplementary material figure 1) (35). E) We reproduce the appropriate small scale motifs. The motifs experimentally observed in excess the WWW are only 3-motifs triangles containing the feed forward loop and 3motifs based on the feed forward loops with bidirectional links, instead of the unidirectional ones. In our WWW model, the feed forward loop is indeed the most frequent over-expressed motif, and following it are also its extensions to bidirectional links (Supplementary material figure 2). The excellent fit between our model and the observed properties of the WWW allows us to conclude that our model does contain the crucial microscopic dynamic elements underlying the WWW evolution. Discussion Network collective properties are currently extensively studied for a wide class of networks(1-5, 8, 22-24, 26-28, 34, 36-40). This vast amount of information allows us to distinguish between network classes, but was never properly correlated to the elementary mechanisms generating the networks. We here show that the actual use of realistic microscopic elements for network generation indeed reproduces the observed specific collective network properties. We exemplify the relation between network generation models and their properties for two types of networks: genetic network and 11 the WWW. In the case of genetic networks the main generating mechanism used was node copying. We reproduced all the known properties of genetic network, such as their link degree distributions, their topology, age correlations and the small scale motifs appearing in the network. The resulting network is a highly hierarchical and directional. Information flows from a low degree periphery toward a central core. In the case of the WWW, the main mechanism is the extensive editing of nodes via browsing and the sharing of common interest between neighboring nodes. This in turns leads to a nondirectional hierarchically clustered network, with most connections within the core. Although the WWW and genetic networks seem to have similar incoming link degree distribution scaling properties (and as such were classified under the generic rubric of "scale free networks") one can see that their structure is dramatically different. These differences can only be understood when connecting their global properties to their elementary mechanisms. A large number of other networks, such as neural, social, citations, linguistic and ecological networks (See for example (4, 22, 34, 40, 41)), were also clustered under the same rubric of scale free networks. These networks are obviously generated by very different mechanisms. The scale free character of their incoming link distribution only reflects their autocatalytic nature, and cannot reveal their specific properties. Many of the elements introduced in our models were present in previous models. For example gene duplication models of monotonically growing genetic networks have been considered in combination with preferential attachment (36-39) . Bi-directional networks were studied too e.g. in the context of a small world model (36). The present research incorporates only the obvious elementary observed mechanisms to produce the experimentally observed statistical collective features of specific networks. 12 Bibliography 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. Albert, R. & Barabasi, A. L. (2002) Reviews of Modern Physics 74, 47– 97. Barabasi, A. L. & Albert, R. (1999) Science 286, 509-512. Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. & ., A. L. B. (2000) Nature 407, 651-4. Amaral, L. A., Scala, A., Barthelemy, M. & Stanley, H. E. (2000) Proc Natl Acad Sci U S A. 97, 11149-52. Strogatz, S. (2001) Nature 410, 268-276. Faloutsos, M., Faloutsos, P. & Faloutsos, C. (1999) Comp. Comm. R. 29,, 251. Uetz, P., Giot, L., Cagney, G., Mansfield, T. A., Judson, R. S., Knight, J. R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., QureshiEmili, A., Li, Y., Godwin, B., Conover, D., Kalbfleisch, T., Vijayadamodar, G., Yang, M., Johnston, M., Fields, S. & Rothberg, J. M. (2000) Nature 403, 623-7. Dorogovtsev, S. N. & Mendes, J. F. F. (2001) Phys. Rev. E. 63, 1-18. Simon, H. A. (1955) Biometrica 42. Popper, K. R. (1935.) Logik der Forschung (The Logic of Research) (Springer, Vienna). Venter, J. C. (2001) Science 291, 1304-1351. Dehal, P., Predki, P., Olsen, A. S., Kobayashi, A., Folta, P., Lucas, S., Land, M. & Terry, A. (2001) Science 293, 104-111. Stubbs, L. (2002) in Genomic Technologies:Present and Future, ed. McCormack, S. (Caister Academic Press. Friedman, R. & Hughes, A. (2002) Genome Res 11, 373-381. Seioghe, C. & Wolfe, K. H. (1999) Gene 238, 253-261. Wolfe, K. H. & D.C.Shields (1997) Nature 387, 708-713. Sidow, A. (1996) Curr. Opin. Genet. Dev 6, 715-722. Gu, Z., Cavalcanti, A., Chen, F. C., Bouman, P. & Li, W. H. (2002) Mol. Biol. Evol 19, 256-262. Betran, E. & Long, M. (2002) Genetica 115, 65-80. Forterre, P. & Philippe, H. (1999) Bioessays 21, 871-9. Holland, P. W. (2003) J Struct Funct Genomics 3, 75-84. Watts, D. J. & Strogatz, S. H. (1998) Nature 393, 440-2. Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. & Barabasi, A. L. (2002) Science 297, 1551-5. Agrawal, H. (2002) Phys Rev Lett. 89, 268702. Eisenberg, E. & Levanon, E. Y. (2003) Phys. Rev. Lett. 91, 138701. Itzkovitz, S., Milo, R., Kashtan, N., Ziv, G. & Alon, U. (2003) Phys Rev E Stat Nonlin Soft Matter Phys 68, 026127. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. & ., U. A. (2002) Science. 298, 824-7. Mangan, S. & Alon, U. (2003) Proc Natl Acad Sci U S A. 100, 11980-5. Blank, A. & Solomon, S. (2000) Physica A 287, 279-288. 13 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. Solomon, S. & Levy, M. (1996) International Journal of Modern Physics C 7, 745. Adamic, L. A., Huberman, B. A., Baraba'si, A.-L., Albert, R., Jeong, H. & Bianconi;, G. (2000) Science 287, 2115. Kumar, S. R., Raghavan, P., Rajagopalan, S. & Tomkins., A. (1999) in 8th WWW Conference,, pp. 403-416. Solomon, S. (1998) in Decision Technologies for Computational Finance,, eds. Refenes, A.-P., Burgess, A. N. & Moody, J. E. (Kluwer Academic Publishers. Adamic, L. A., Lukose, R. M., Puniyani, A. R. & Huberman, B. A. (2001) Physical Review E 6404, art. no.-046135. Albert, R., Jeong, H. & Barabasi., A.-L. (1999) Nature 401, 130-131. Bhan, A., Galas, D. J. & Dewey, T. G. (2002) Bioinformatics 18, 1486-93. Chung, F., Lu, L., Dewey, T. G. & Galas, D. J. (2003) J Comput Biol 10, 677-87. Rzhetsky, A. & Gomez, S. M. (2001) Bioinformatics 17, 988-996. Wagner, A. (2001) Mol. Biol. Evol. 18, 1283-1292. Wu, F., Huberman, B. A., Adamic, L. A. & Tyler, J. R. (2004) Physica aStatistical Mechanics and Its Applications 337, 327-335. Newman, M. E. J. (2001) Proc Natl Acad Sci U S A. 98, 404–409. 14 Figure Captions Figure 1 - Mechanisms of individual node evolution: the 3 elementary processes defining our genetic model. Their effect is demonstrated on the configuration of the left upper corner (only links and nodes relevant for the explanation are explicitly shown). The effect of a Node copying elementary event is shown in A) . The blue node is “duplicated” by introducing the new brown node that has the same targets for its out-going links (and no incoming links at all). The Node removal is illustrated in B). The green node and all its links are deleted. The drawing C) illustrates the elementary operation of Link Mutation: the pink link is “copied”. i.e. a new link with the same origin but different target is created . These 3 elementary operations turn out to be sufficient for the formation of a steady state directional hierarchical scale free network, with the experimentally observed submotif distribution. 15 16 Figure 2 Node degree distribution in the genetic model- Incoming (empty circles) and outgoing (full squares) link distributions. The outgoing link distribution is normal (as experimentally observed). The incoming link distribution is scale free over more than three orders of magnitude (10-50,000). The straight line corresponds on this double logarithmic graph to a power law with exponent -2.2 (the value actually observed in most genetic networks). Increasing the network size has no effect on the exponent of the power distribution; it only increases its validity range. 17 6 Incoming Outgoing 5 10 log (Rank) 4 3 2 1 0 0 0.5 1 1.5 2 2.5 log10(Number of Nodes) 3 3.5 4 4.5 18 Figure 3. Measures of the Small World character of the genetic network. The main graph represents the clustering coefficient (C) as a function of the degree k. The clustering coefficient of the node i, with a degree ki is defined as as Ci = 2ni/ki(ki - 1), where ni denotes the number of direct links connecting its ki nearest neighbors among themselves. Ci is equal to 1 if the neighbors of i are all connected one to the other. A random (Erdos-Renyi) graph would produce a flat very small clustering coefficient. One sees from the graph that the actual distribution, far from being a small constant is fitted by a straight line that represents on the double logarithmic scale a power law with exponent -1 as observed in genetic networks (and in contrast with the preferential attachment dynamics). The fit is excellent with no free parameters. The inset plots the distribution of distances between arbitrary pairs of nodes in the network. A regular one dimensional lattice would produce a distance proportional to the network size (50,000). The combination of a distance of the order of the log of the network size (inset) and the existence of high clustering coefficient are the hallmark of a "small world network". 19 20 Figure 4 The 4 motifs most frequently over- expressed in the networks produced by the genetic model. They turn out to be the 4 motifs most frequently overexpressed in experimental genetic networks. The first motif (A) is denoted as a BiFan and is consistently the most over-expressed in ALL genetic networks and in our model. Motifs C and D represent various combinations of the BiFan and added links. Motif B excess is a result of the combination of node copying and removal. 21 A B N=36546 Z=46 N=253270 Z=23 C D N=826 Z=2.8 N=442 Z=2.2 22 Figure 5 : Correlations between node degree and their age. The upper left drawing represents the experimental scatter plot of the websites incoming degree as a function of the year they were created (extracted from Adamic, L.A. & Huberman, B.A., Science 287, 2115a ( 2000). No correlation is present. This is accurately reproduced by our WWW model (upper right drawing). These patterns are in stark contrast with the pattern predicted by our genetic network model (lower left drawing) where an age-degree correlation is seen. The age-degree correlation in the genetic model (empty diamonds in lower right drawing) fits the experimental results (big grey circles taken from “Preferential Attachment in the Protein Network Evolution”, Eli Eisenberg and Erez Y. Levanon, Phys. Rev. Letters.. 2003, VOLUME 91(13)). 23 5 10 4 10 3 Degree 10 2 10 1 10 0 10 1984 1986 1988 1990 1992 Age 5 10 4 10 3 Degree 10 2 10 1 10 0 10 1984 1986 1988 1990 1992 Age 1994 1996 1998 1994 1996 1998 24 Figure 6: Double logarithmic histogram of the nodes rank distribution in the WWW model. Results show a power law for both incoming (empty circles; exponent 2.1+/-0.1) and outgoing links (full squares; exponent 3.1 +/- 0.2). These exponents are in perfect agreement with the experimental data. The inset shows the average node distance as predicted by the WWW model. The distance grows as the log of the network size, and thus the network is indeed a small world network in agreement with the experiments. 25 Nodes rank distribution 3.5 Incoming Outgoing 3 0.85 Incoming Power law fit Outgoing Power law fit 0.8 2.5 log 10 (<d>) 0.75 0.65 10 log (Rank) 0.7 2 1.5 0.6 3 3.1 3.2 3.3 3.4 3.5 3.6 log10(Network Size) 3.7 3.8 3.9 4 1 0.5 0 1 1.5 2 2.5 log10(Number of nodes) 3 3.5 4