Chapter EVOLUTIONARY CHANGES IN THE BLUEPRINT FOR TRANSCRIPTIONAL REGULATION IN PROKARYOTES 4.1 INTRODUCTION ................................................................................................. 4-1 4.2 MATERIALS AND METHODS ........................................................................... 4-2 4.2.1 ALGORITHM TO RECONSTRUCT TRANSCRIPTIONAL NETWORKS. .......................... 4-2 4.2.2 PROCEDURE TO IDENTIFY ORTHOLOGOUS PROTEINS ......................................... 4-3 4.2.3 PROCEDURE TO EVALUATE SIGNIFICANCE OF THE BIAS IN GENE CONSERVATION. 4-4 4.2.4 ALGORITHM TO RECONSTRUCT ANCESTRAL NETWORKS. .................................... 4-5 4.2.5 PROCEDURE TO GROUP GENOMES BY INTERACTIONS AND GENES CONSERVED. .. 4-6 4.2.6 PROCEDURE TO ANALYSE SCALE FREE BEHAVIOUR OF CONSERVED NETWORKS. . 4-7 4.2.7 ALGORITHM TO ANALYSE CONSERVATION OF ‘NETWORK MOTIFS’........................ 4-8 4.2.8 PROCEDURE TO EVALUATE SIGNIFICANCE OF MOTIF INTERACTION CONSERVATION4-9 4.3 RESULTS AND DISCUSSION ......................................................................... 4-10 4.3.1 RECONSTRUCTION OF TRANSCRIPTIONAL REGULATORY NETWORKS ................. 4-10 4.3.2 CONSERVATION OF TRANSCRIPTION FACTORS AND TARGET GENES .................. 4-13 4.3.3 CONSERVATION OF TRANSCRIPTIONAL REGULATORY NETWORKS ..................... 4-17 4.3.4 EVOLUTION OF GLOBAL NETWORK STRUCTURE ............................................... 4-20 4.3.5 EVOLUTION OF LOCAL NETWORK STRUCTURE ................................................. 4-23 4.4 CONCLUSIONS ................................................................................................ 4-29 4.5 REFERENCES .................................................................................................. 4-30 4 4.1 Introduction EVOLUTIONARY CHANGES IN THE BLUEPRINT FOR TRANSCRIPTIONAL REGULATION IN PROKARYOTES 4 Parts of this chapter will appear in: 1. Madan Babu, M.*, Teichmann, S. A. and Aravind, L.* (2004). Evolutionary changes in the transcriptional regulatory network in prokaryotes, manuscript in preparation 4.1 Introduction Transcriptional control is a fundamental regulatory mechanism common to all organisms. This form of regulation is mediated by a DNA-binding protein that binds to target sites in the genome, and either individually or in concert with other transcription factors regulates the expression of one or more target genes. The sum total of all such transcriptional interactions in an organism can be conceptualised as a network (Lee et al., 2002; Madan Babu et al., 2004). Theoretical characterization of these transcriptional regulatory networks showed that they are scale-free structures with striking structural and topological similarity to other networks from biological and non-biological systems. It was also shown that they are characterized by the recurrence of small patterns of interconnections called network motifs (Albert et al., 2000; Barabasi and Albert, 1999; Lee et al., 2002; Milo et al., 2004; Oltvai and Barabasi, 2002; Ravasz et al., 2002; Shen-Orr et al., 2002). Even though the general properties of transcriptional networks are well understood, there are several unanswered questions regarding their origin and evolution: How did connections between transcription factors and target genes evolve? In particular, how did interactions in 4-1 4.2 Materials and methods topologically equivalent motifs evolve? What is the common ancestor network? Are network structural properties conserved through evolution? In this chapter, we address these questions using the experimentally determined gene regulatory network of E. coli as a reference, and performing an exhaustive comparative genomic analysis of 175 prokaryotes belonging to different lineages of both bacterial and archaeal superkingdoms. 4.2 Materials and methods 4.2.1 Algorithm to reconstruct transcriptional networks The transcriptional regulatory network for E. coli was used as the basis to reconstruct networks for other genomes. Information about regulatory interactions was obtained from RegulonDB (Salgado et al., 2004) and the dataset used in Shen-Orr et. al. (Shen-Orr et al., 2002) This provided us with 1295 transcriptional interactions involving 755 proteins (112 transcription factors). Orthologous proteins were identified in the genome of interest. If orthologs were identified for an interacting transcription factor and target gene in E. coli, then an interaction was considered to be present in the genome of interest (Figure 4.1). Figure 4.1: Procedure to reconstruct transcriptional regulatory networks Reconstruction of transcriptional regulatory networks Step 1 Step 2 Step 3 Define the TFs and TGs in the E. coli network Identify orthologous proteins in the genome of interest Reconstruct interactions (hence the network) if orthologous TFs and TGs exist in genome of interest and are known to interact in the E. coli network Figure 4.1: This figure illustrates the steps involved in reconstructing regulatory networks for the genomes. Blue circles represent target genes and red circle represents transcription factor. Black arrow represents a transcriptional interaction. Genes and interactions that are absent are shown in grey. Reconstructed transcriptional regulatory networks for the 175 genomes using the methods discussed above is available from the supplementary material website (Appendix A). 4-2 4.2 Materials and methods 4.2.2 Procedure to identify orthologous proteins The procedure, which was used to identify orthologous proteins in a genome, is described as a flow chart below (Figure 4.2): Figure 4.2: Flowchart describing the procedure to detect orthologs For each protein P seen in the E. coli transcriptional regulatory network, and for each of the 175 genomes considered, carry out a bi-directional best hit ortholog detection procedure if bi-directional best hit does not exists Store the best hit and carry out a BLASTclust procedure if bi-directional best hit exists Report ortholog Figure 4.2: First a bi-directional best hit is performed, if an ortholog is not detected then BLASTclust procedure is adopted. Supp methods figure-1 Bi-directional best-hit procedure: For each protein P in the E. coli network, a BLAST search was performed against the genome of interest (x). The best hit, sequence Px from genome x, was then used as a query and a BLAST search was carried out against the E. coli genome. If the best hit using Px as the query happens to be P in E. coli, then P & Px were considered as orthologous proteins. If however Px does not pick up P from E. coli as its best hit, then a BLASTclust (Lespinet et al., 2002) procedure was adopted. BLASTclust algorithm: For each of the proteins P in E. coli for which the above-mentioned procedure did not pick up orthologous proteins, the best-hit sequence Px (using P as the query against genome x) for each genome were obtained. Thus, for every protein P, this procedure gives us a set of proteins which are the best hits from genomes where bi-directional best hit procedure fails. The set of sequences thus obtained along with the E. coli protein is taken through a BLASTclust procedure using length conservation (L) of 60% and a score density (S) of 60% (Lespinet et al., 2002). Score density (S) is defined as the ratio of number of identical residues in the alignment to the length of the alignment. Documentation is available at: http://bio.ifom-firc.it/docs/blast/readme.bcl.txt and in (Lespinet et al., 2002). This procedure produces clusters of sequences using the single linkage-clustering algorithm. All the sequences belonging to the cluster that also contains the E. coli protein P are considered as orthologs (see schematic of the ortholog detection procedure). Analysis of the clusters reveal that the parameters S=0.6 and L=0.6 performs best with an optimum coverage and lowest false positive rate. See Figure 4.3 for a schematic. 4-3 4.2 Materials and methods Figure 4.3: Schematic of the procedure to detect orthologous proteins Protein P3 best hit genome 3 Protein P4 best hit genome 4 Cluster 2 Cluster 1 BLASTclust E. coli protein P ‘BEST HIT’ PROTEINS Protein P2 bi-directional best hit genome 2 All proteins in the cluster that also contains the E. coli protein are considered as orthologs of the E. coli protein P ORTHOLOGS Protein P1 bi-directional best hit genome 1 Get best hits Protein P5 best hit genome 5 Figure 4.3: This figure illustrates the steps involved in ortholog detection. Information about orthologs for every genome is stored as an ‘orthomap’ file and is available from the as supplementary material. 4.2.3 Procedure to evaluate significance of the bias in gene conservation Analysis of the conservation of transcription factors and target genes for the various genomes indicate that target genes are more conserved than transcription factors. To evaluate the significance of this observation, the following procedure was carried out. For each of the 175 genomes, orthologous proteins in the E. coli network were first identified. Among the identified orthologs, the fraction of TFs and TGs conserved were calculated. A graph of %TFs conserved against % TGs conserved was plotted. The slope and intercept of the best fit to the observed trend was obtained. Then for each of the genome, random networks of similar sizes were created (Figure 4.4) 10,000 times. For each run, the slope and intercept for the plot of % TF conserved against % TG conserved was obtained. The P-value was estimated as the number of runs where the slope was less than or equal to what was observed. 4-4 4.2 Materials and methods Figure 4.4: Procedure to create random networks to evaluate TF and TG conservation Generating ‘n’ random networks of similar size to evaluate TF & TG conservation Select 8 genes randomly n-times from the reference network Reference genome Genome of interest Ortholog detection #orthologs = 8 (of 14); %genes = 57% #TFs = 1 (of 6); TF% = 17% #TGs = 7 (of 8); TG% = 88% Reference network #TFs = 6 #TGs = 8 #Total = 14 #orthologs = 8 (of 14); %genes = 57% #TFs = 4 (of 6); TF% = 66% #TGs = 4 (of 8); TG% = 50% #orthologs = 8 (of 14); %genes = 57% #TFs = 3 (of 6); TF% = 50% #TGs = 5 (of 8); TG% = 63% #orthologs = 8 (of 14); %genes = 57% #TFs = 3 (of 8); TF% = 50% #TGs = 5 (of 8); TG% = 63% Figure 4.4: This figure illustrates the procedure to generate random networks of similar size to evaluate significance of TF and TG conservation in the different genomes. 4.2.4 Algorithm to reconstruct ancestral networks Standard species tree (Figure 4.5) was adapted from Margulis and Schwartz (1998), as shown in the figure below. Only organisms whose genome size was greater than half the size of the E. coli genome were considered to avoid any bias arising from parasitic genomes that have lost genes due to a specialised life style. This criterion resulted in a total of 97 genomes. Proteobacteria (1) (1) (21) Gracilicutes (9) (1) (3) (4) Euryarcheota Crenarcheota (2) Deinococci (24) Actinobacteria (8) Endospora (11) Spirochaetea Cyanobacteria (genomes) (8) Chlorobia Figure 4.5: Species tree Firmicutes Archaea Eubacteria Root Figure 4.5: Species tree adopted for calculating ancestral networks. 4-5 4.2 Materials and methods Figure 4.6: The Dollo approach taken to calculate ancestral networks AC= u [9] DN= u [1] CR= u [3] EU= u [4] (24) (2) (1) (1) (21) (9) (1) (3) (4) Proteobacteria PROT= GRAC + AL + BE + DE GRAC= EUB + CY + CH + SP FIRM= u [EN, AC, DN] Gracilicutes Euryarcheota EN= u [21] Crenarcheota (8) SP= u [1] Deinococci (11) CH= u [1] Actinobacteria (8) DE= u [2] Endospora BE= u [8] Spirochaetea AL= u [11] Cyanobacteria (genomes) CY= u [8] Chlorobia GA = PROT + ALL - Ec ARCH= u [CR, EU] Firmicutes EUB= ARCH + FIRM Archaea Eubacteria ROOT= ARCH Figure 4.6: The figure illustrates the principle to calculate the ancestral networks using the Dollo parsimony approach. A gene is considered to be present in the node if at least one of its child nodes contains it. The function U represents union operation. For example, the node CY = u [8] means that the ancestor for all cyanobacterial genomes (8 genomes in this case) will contain all the genes seen in each of the 8 cyanobacterial genomes. The Dollo principle was used to reconstruct the ancestral states for all the ancestral nodes. This principle states that a gene (ortholog) can be gained only once. The emergence of a gene was thus assigned to the last common ancestor of all lineages that contained the given gene (Figure 4.6). Once the gene composition has been determined for each node in the tree, the transcriptional regulatory network was reconstructed as described in the procedure mentioned before. 4.2.5 Procedure to group genomes by interactions and genes conserved A novel method to analyse conservation of interaction was developed (Figure 4.7). For every genome, an interaction conservation profile was calculated based on the reconstructed networks. A distance matrix based on interactions conserved was calculated. Relationships between genomes, and within interactions/genes were represented as a tree and a matrix using the procedure shown in the next page. This procedure was carried out for genes instead of transcriptional interactions. Using this procedure, clusters of transcription factors that have similar conservation profiles were obtained. 4-6 4.2 Materials and methods Figure 4.7: Procedure to cluster genomes based on interactions conserved organism A interaction 0001: yes interaction 0002: yes interaction 0003: yes interaction 0004: no interaction 0005: yes interaction 0006: no . . interaction 1295: yes organism B interaction 0001: yes interaction 0002: no interaction 0003: yes interaction 0004: yes interaction 0005: yes interaction 0006: no . . interaction 1295: yes ..... organism Z interaction 0001: no interaction 0002: no interaction 0003: no interaction 0004: yes interaction 0005: yes interaction 0006: no . . interaction 1295: no step1: K-means clustering cluster of interactions with similar conservation profile Interaction conservation profile interaction 1 2 3 4 organism A 1 1 1 0 organism B 1 0 1 1 . organism Z 0 0 0 1 5 6 . . 1295 1 0 . . 1 1 0 . . 1 1 0 . . 2-way clustering 0 interaction 6 4 2 1 organism A 0 0 1 1 organism B 0 1 0 1 . organism Z 0 0 0 0 3 5 . . 1295 1 1 . . 1 1 1 . . 1 0 0 . . 0 step2: hierarchical clustering interactio ns conservedin A and B 8 1 0 .2 interactio ns conservedin A or B 10 D 1 Distance matrix Org A B . Z A 0 0.2 . . B 0.2 . . . . . . Z . . . . cluster # I II III G H . A B clustering using neighbour-joining algorithm F Represent relationships between genomes in terms of interactions conserved E D C Figure 4.7: Representing the interactions as a vector (interaction profile) makes the network amenable to standard clustering procedure. 4.2.6 Procedure to analyse scale free behaviour of conserved networks The distribution of outgoing connectivity provides an indication about the large-scale structure of networks. It is well established that the outgoing connectivity for the E. coli network follows a scale-free behaviour, i.e. the distribution is best approximated by a power-law function T = aK-b where T is the number of transcription factors with K connections. To evaluate the distribution for the reconstructed networks, the following procedure was carried out: For each of the 175 genomes, the distribution is approximated by a linear function (T = a + bK), exponential function (T = ae-K; log T = log a – K log e) and a power-law function (T = aK-; log T = log a – log K). The function that best approximates the observed distribution is identified as the one that has the lowest error. To identify the trend in random networks the following procedure was carried out: for each of 10,000 times create 175 random networks of similar sizes to what is observed in the genomes (Figure 4.8). The procedure explained above is carried out to get the function that best approximates the distribution for the random networks. For each of the 17g genomes, the mean and standard deviation for the power-law exponent over the 10,000 runs is calculated. 4-7 4.2 Materials and methods P-value was calculated as the fraction of the runs where the value for the exponent was greater than or equal to the observed value. Z-score was calculated as Z = (obs – mean)/ Figure 4.8: Procedure to create random networks as seen in the genome of interest Generating ‘n’ random networks of similar size (as seen in genome of interest) Reference genome Reference network #TFs = 8 #TGs = 9 #interactions = 21 Genome of interest Ortholog detection #orthologous TFs = 4 #orthologous TGs = 7 Genome of interest Reconstructed network # orthologous TFs = 4 #orthologous TGs = 7 #reconstructed interactions = 9 Select 4 TFs and 7 TGs randomly n-times from the reference network Reconstruct network based on reference network Reconstruct network based on reference network Reconstruct network based on reference network Random network 1 (4 reconstructed interactions) Random network2 (8 reconstructed interactions) Random network 3 (9 reconstructed interactions) Figure 4.8: Random networks are generated by randomly choosing the same number of genes as seen in the genome of interest and interactions are reconstructed as discussed before. 4.2.7 Algorithm to analyse conservation of ‘network motifs’ A novel algorithm to analyse conservation of network motifs developed by us is shown in Figure 4.9 below: Feed forward motif (FFM), single input module (SIM) and multi-input module (MIM) motifs were identified using the methods described by Shen-Orr et al (Shen-Orr et al., 2002). A motif was considered to be absolutely conserved in a genome, if all the genes constituting the motif in E. coli were conserved in the other genome. If some genes were missing, the fraction of interactions in the motifs conserved was noted. Thus for each genome, an ordered ndimensional vector (motif conservation profile) was created, where n is the number of motifs considered. The values represent the fraction of the interactions forming the motifs that are conserved. This matrix was then subjected to the procedure explained in section 4.2.5 to get motifs that have similar conservation profile. 4-8 4.2 Materials and methods Figure 4.9: Procedure to cluster genomes and motifs according to the extent conserved single input modules feed-forward motifs E. coli F1 ... Fn S1 ... Sn M1 ... Mn F1 ... Fn S1 ... Sn M1 ... Mn organism A organism B F1 motif conservation profile multi input modules ... Fn S1 ... Sn M1 ... Mn Org F1 F2 . Fn Org S1 S2 . Sn Org M1 M2 . Mn A (3/3) . . (1/3) A (2/3) . . (3/3) A (2/4) . . (4/4) B (3/3) . . (1/3) B (2/3) . . (3/3) B (1/4) . . (4/4) . . . . . . . . . . . . . . . Z . . . . Z . . . . Z . . . . clustering of motifs (e.g. K-means) clustering of motifs (e.g. K-means) Figure 4.9: The two way clustering provides information about which motifs have been completely conserved in set of genomes. Specific examples are discussed in the main text. 4.2.8 Procedure to evaluate significance of motif interaction conservation To evaluate whether interactions in a motif are more conserved than any interaction in the network, we introduce a term called conservation index (C.I.), which is defined as follows: R C.I . genomeX log 2 motif R all R motif I motif genome X I motif E. coli and 4-9 R all I all genome X I all E. coli 4.3 Results and discussion In this definition, I motif genome X is the number of interactions that forms a motif in E. coli, which has motif been conserved in genome X. I E. coli is the number of interactions in a motif in E. coli. I all genome X all is the total number of interactions that have been conserved in genome X and I E. coli is the total number of interactions in E. coli. The log2 of the ratio ensures that selection for and against are represented symmetrically in the graph. For example, if R motif = 0.9 and Rall = 0.6, C.I. can be calculated as log2 (.9/.6) = 0.58. Thus if interactions in motifs are selected for, then the C.I value will be greater than 0, if not, the value will be less than 0. Please refer to appendix A for a detailed procedure. 4.3 Results and discussion 4.3.1 Reconstruction of transcriptional regulatory networks While there has been considerable advance in unravelling the regulatory networks of various model organisms such as E. coli, the extrapolation of this information to the range of poorly studied organisms, whose complete genome sequences are now available, remains a major challenge. One can infer transcriptional targets of a regulator in a genome of interest by identifying the orthologous transcription factor and its target genes in a genome with experimentally characterized transcriptional regulatory interactions. This procedure of transferring information about transcriptional regulation from a genome with known regulatory interactions to another genome by identifying orthologous proteins was recently assessed by Yu et al (Yu et al., 2004b). It was found to be a robust method for predicting such interactions in eukaryotes, and we have benchmarked the predicted transcriptional interactions using expression data available for E. coli and V. cholerae. We found that genes which are regulated by the same set of TFs in E. coli, for which the transcriptional network is known, and V. cholerae, for which predictions were based on the reconstructed network, tend to have very similar expression profiles (Figure 4.10). This result supports the validity of reconstructing transcriptional networks using our procedure. Using the transcriptional interaction network available for E. coli as our template, we used an ortholog detection procedure and reconstructed partial transcriptional interaction networks, for the first time, for 175 prokaryotic genomes. The transcriptional interaction network for E. coli is shown in Figure 4.11, along with reconstructed networks for a pathogen, Bacillus anthracis and a free-living organism, Streptomyces coelicolor. It should be emphasized that we can only reconstruct transcriptional interactions that are present in E. coli and that there are many interactions that we cannot see because of limiting our template to the E. coli network. Nevertheless, the methods presented in this work should be applicable to any network, hence would benefit as more data becomes available for other genomes. 4-10 Figure 4.10: Benchmarking predicted transcriptional regulatory interactions using expression data 0.6 Pearson Correlation Coefficient 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 0 -0.6 0 -0.8 0.2 -1 0.2 -0.2 0.4 -0.4 0.4 0.8 -0.6 0.6 V. cholerae Co-regulated pairs of TGs TF – TG pairs Random pair of genes 1 -0.8 0.8 b 1.2 Co-regulated pairs of TGs TF – TG pairs Random pair of genes 1 Relative distribution E. coli -1 a Relative distribution 1.2 Pearson Correlation Coefficient Figure 4.10: Gene expression data for the E. coli and V. cholerae genomes were downloaded from the IECA consortium website (http://www.unigiessen.de/~gx1052/ IECA/ieca.html) and from the Stanford Microarray Database (Gollub et al., 2003). For both genomes, Pearson Correlation Coefficient (PCC) were calculated for (i) Transcription factor – Target gene pair, where the TF regulates TG (blue line) (ii) Pairs of TGs, which have the same set of TFs regulating them (i.e. a pair of genes in this category will have the same set of TFs regulating both the genes and each gene will NOT have any other additional TF regulating them; red line) and (iii) Random pair of genes (black line). (a) Normalised distribution of PCC for 1295 pairs of experimentally verified TF-TG interactions, 2570 pairs of co-regulated TGs from the known transcriptional regulatory network and 10,000 random pairs of genes for E. coli. The plot shows that a large fraction of pairs of co-regulated genes have very high PCC. The peak at PCC value of 0.9 is much higher than what is seen for the random pairs of genes indicating that genes that have the same set of TFs are regulated the same way in the E. coli network (b) Normalised distribution of PCC for 507 pairs of predicted TF-TG interactions, 1142 pairs of predicted co-regulated TGs and 10,000 random pairs of genes for V. cholerae. The plot behaves the same way as seen for E. coli. The same fraction of the predicted co-regulated genes [calculated from the predicted transcriptional regulatory network using orthology information; see methods] in V. cholerae has a PCC value of 0.9, strongly lending support to the validity of the predictions. Figure 4.11: Reconstructed transcriptional networks Escherichia coli K12 (4311 genes) Bacillus anthracis A2012 (5544 genes) Streptomyces coelicolor (7769 genes) 112 38 41 711 250 251 1295 314 326 12 49 171 156 78 100 496 308 760 Winged HTH 96 Winged HTH 111 Winged HTH 226 Classical prokaryotic HTH 42 Classical prokaryotic HTH 39 Classical prokaryotic HTH 200 C-terminal effector domain 39 C-terminal effector domain 45 C-terminal effector domain 142 Cro/C1 type HTH 25 Cro/C1 type HTH 47 Cro/C1 type HTH 87 FIS like 13 FIS like 6 FIS like 1 Figure 4.11: Known transcriptional regulatory network for E. coli and reconstructed transcriptional regulatory networks for a pathogen, Bacillus anthracis and a free-living organism, Streptomyces coelicolor. Transcription factors and target genes are represented as red and blue circles and transcriptional interaction is represented as a black line. Information about the number of transcription factors, target genes, regulatory interactions and regulatory network motifs (Lee et al., 2002; Milo et al., 2002) is provided on the right. This figure illustrates how we can partially reconstruct transcriptional networks for genomes, including organisms, which are poorly characterised but are important. The table below the networks shows that number of transcription factors that have a DNA binding domain belonging to a particular family. It is clear by comparing the numbers that each genome has evolved their own set of transcriptional regulators by using the same DNA binding domains to different extents. 4.3 Results and discussion 4.3.2 Conservation of transcription factors and target genes To assess the basic trends in the evolutionary conservation of the TRN, we asked if transcriptional regulators and their targets are differentially conserved in evolution. Figure 4.12 is a graph of the fraction of target genes conserved against the fraction of transcription factors conserved for each of the 175 genomes. The graph shows that transcription factors are less conserved across genomes during evolution when compared to their target genes. We assessed the significance of the bias for conservation of target genes versus transcription factors by comparing to random networks, and found that the bias is statistically significant (pvalue < 10-4). Figure 4.12: Gene conservation profile Gene conservation profile 100 Transcription factor conservation (%) 90 y = 1.1173x - 16.609 R2 = 0.9223 80 70 60 50 40 30 20 10 0 0 20 40 60 80 100 Target gene conservation (%) Figure 4.12: For each of the 175 genomes, the fraction of target genes conserved (x-axis) is plotted against the fraction of transcription factors conserved (y-axis). The diagonal, shown in green, represents equal conservation of transcription factors and target genes, and every point represents a genome. The graph shows that more target genes are conserved than transcription factors in the genomes considered (y = 1.1173x - 16.609, R2 = 0.9223, p-value < 10-4). The significance of this trend was assessed by simulating network evolution, which involved neutral removal of different sets of genes 10,000 times (see methods 4.2.3). One might expect there to be a link between conservation of transcription factors and their target genes; research on evolution of protein-protein interactions across has shown that interacting pairs of proteins tend to be conserved across genomes in a concerted manner (Pagel et al., 2004). In the TRNs studied here, pairs of interacting TF- TGs are not preferentially conserved compared to random pairs of genes in the network (Figure 4.13). This shows that conservation of transcription factors and their target genes are entirely independent of each other. 4-13 4.3 Results and discussion Figure 4.13: Interaction conservation profile Interactions conserved (%) Regulatory interaction conservation 100 90 80 70 60 Observed trend in genomes 50 Random pair of genes 40 30 20 10 0 0 20 40 60 80 100 Genes lost (%) Figure 4.13: Each point in this graph represents a genome. We find that there is no preferential conservation of a transcription factor target gene pair when compared to conservation of any random pairs of genes. Several organisms in our study had few transcription factors with orthologues in the E.coli reference TRN, such as B. anthracis and S. coelicolor in Figure 4.11. However, these genomes are relatively large and must contain a complex transcriptional regulatory network. Hence, we searched to identify proteins with DNA-binding domains typically represented in known transcription factors in all the genomes using sensitive profile methods (Madan Babu and Teichmann, 2003). In most genomes we could identify other transcription factors belonging to the same repertoire of DNA-binding domain families as in E. coli (Figure 4.11), but not orthologous to the E. coli proteins. For small genomes, typically those of obligate parasites with low fractions of transcription factors, no additional transcription factors were detectable. As previously observed by van Nimwegen (van Nimwegen, 2003), we also see that there is a nonlinear increase in the number of regulatory proteins when compared against the genome size (Figure 4.14). These observations, together with the higher degree of conservation of target genes, suggest that: 1) In small parasitic genomes, transcription factors have been lost due to absence of selective pressures for regulation. 2) In larger genomes, target genes are often controlled by additional regulators or regulators that are non-orthologous, so that the genes are responsive to different signals. Figure 4.15 is a result of the clustering procedure discussed in the methods, which shows the profile of transcription factor conservation across the 175 genomes. Such representation provides us information about the different signals to which organisms will respond to. For example, organisms that can respond to changes in levels of fucose and cAMP are marked in Figure 4.15. 4-14 4.3 Results and discussion Figure 4.14: Genome size v/s number of predicted transcription factors Transcription factors v/s genome size Transcription factors 700 S. coelicolor y = 9e-05x1.75 R2 = 0.83 600 500 B. japonicum B. pertussis B. bronchiseptica 400 B. parapertussis D. hafniense 300 M. magnetotacticum 200 N. punctiforme Nostoc Sp 100 Pirellula_sp L. interrogans 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Genome size Figure 4.14: The number of predicted transcription factors (y-axis) increase non-linearly with the increase in genome size (x-axis) indicating that new transcription factors evolve as genome size increases. A power-law equation y = 9e-05x1.75 fits the data best (R2 = 0.83). 4-15 Figure 4.15: Transcription factor conservation profile Absent Present GENOMES Genomes Crp That can respond to cAMP levels T Genomes F That can respond to S FucR fucose levels Figure 4.15: The 112 TFs in E. coli are shown as rows and the 175 genomes as columns. If an ortholog of an E. coli TF is present in a genome, the corresponding cell is coloured red, or blue if absent. The procedure to build and cluster transcription factors based on conservation profile is described in the methods section. This figure illustrates that different Transcription factors have been conserved to varying extent in the different genomes. Representation like this can be extremely helpful in finding sets of factors that have similar conservation profiles, hence can be used to identify signals under which the genomes will respond. Visit: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/reconstruct_net/ to get a larger version of this image along with the raw data. Matrices were prepared using Matrix2PNG (Pavlidis and Noble, 2003). 4.3 Results and discussion 4.3.3 Conservation of transcriptional regulatory networks Having reconstructed the TRNs for the genomes, we then reconstructed ancestral networks for the different ancestors of the phylogenetic lineages (see methods 4.2.4). Figure 4.16 shows the ancestral networks for the genomes considered. The predicted ancestral networks of eubacteria and archaea that is common with E. coli show a scale free behaviour, with Crp, Fnr and Lrp as major regulatory hubs. This suggests that the ancestral genome could sense changes in cAMP and oxygen levels. The predicted ancestral network consists of at least 62 TFs that regulate genes involved for basic living which include regulators for genes involved in purine biosynthesis, fructose utilization, xylulose utilization, iron uptake, idonate utilization, hydroxyphenyl compound utilization, fatty acid biosynthesis, anaerobic respiration, leucine biosynthesis, etc and multiple antibiotic resistance. From the figure it is evident that there has been loss of specific genes down the archaeal lineage. However there has been gain of specific regulatory systems down the eubacterial lineages. These include transcriptional regulators that can sense a variety of different sugar molecules and their target genes that can utilize these sugars to generate energy (e.g. mellibiose, mannitol, glucitol, galactose, etc). Further down from this point, the firmicute lineage has lost regulatory systems that can utilize L-idonate, and the sulphur utilization system, whereas the gracilicute lineage has conserved them and has additionally gained the Fis protein, which can control gene expression by regulating the superhelicity of the DNA. Further down the gracilicute lineage, there have been multiple instances where various regulatory systems have been lost in the different lineages. These results suggest that there are some ancient regulatory pathways that were present in the ancestral network which have been lost in the different lineages. The dynamics and rationale for loss or conservation of specific systems will be discussed in the following sections. We then sought to address the extent to which genomes have conserved same set of interactions. To do this, we developed an algorithm to deduce the relationship between different genomes based on conservation of specific sets of transcriptional interactions (see method 4.2.5). We first represented the presence or absence of nodes and edges of the reference transcriptional network graph from E.coli, in each of the 175 prokaryotic genomes as a binary ‘interaction conservation profile’. The organisms were then hierarchically clustered based on their interaction conservation profiles (Figure 4.17). This method is generic and can be used to compare any type of network. The clustering of organisms depicted in Figure 4.17 reveals that transcriptional interactions appear to be shaped by a set of disparate forces. At low to moderate evolutionary distance from the reference organism, namely proteobacteria, the clustering generally mirrors organismal relationships, suggesting that the TRN retains a noticeable phylogenetic signal. Beyond the proteobacteria, a number of other effects appear to dominate, upon the background of a weaker phylogenetic signal. These include the similarities in genome size and ecological adaptations: 4-17 4.3 Results and discussion for example various soil bacteria with comparable genome sizes, such as Mycobacterium tuberculosis and Bacillus subtilis, form a cluster (Figure 4.17). Likewise, the obligate or intracellular parasites from diverse bacterial clades, namely Mycoplasma, Rickettsiae and Chlamydiae, group together in this analysis. Based on the information about conservation of specific transcription factors, and their interactions, we could reconstruct the presence of transcriptional response pathways similar to those in the reference network in various other genomes. These results suggest the conservation of certain transcriptional regulatory pathways and could have predictive value in probing poorly characterized or experimentally intractable organisms, including key human pathogens. For example, important pathogens like Yersinia pestis, Pseudomonas syringae and a key symbiont of soybean plant, Bradyrhizobium japonicum have conserved the GntR, KdgR, ExuR and UxuR transcriptional regulators which can sense different hexuronates and hexuronides suggesting the conservation of the pathway that can catabolize these compounds. Examination of target gene conservation shows that they are indeed conserved, thus allowing us to predict the presence of these response pathways and their transcriptional regulators. Figure 4.16: Ancestral transcriptional regulatory networks a Ancestral networks (24) (2) Proteobacteria 110 697 1262 108 686 1239 (1) (1) (21) 94 638 1116 Gracilicutes 101 660 1158 84 590 1016 71 481 741 24 142 151 (9) 41 209 233 (1) (3) 44 236 286 (4) Euryarcheota (8) 29 135 174 Crenarcheota (11) 22 116 141 Deinococci 41 289 481 Actinobacteria 110 697 1262 Endospora 91 585 997 Spirochaetea Cyanobacteria (genomes) (8) 86 569 1008 Chlorobia 66 414 647 Firmicutes 62 376 492 Eubacteria Archaea 62 376 492 Root Figure 4.16: Species tree was assumed to reconstruct ancestral networks. Red and blue circles near each reconstructed networks represent transcription factors and target genes. Black lines represent transcriptional interactions. Dollo principle was used to arrive at the composition of the ancestral networks. Only genomes whose genome size was more than 2,150 genes (half the size of E. coli genome) were considered, thus leaving with 97 genomes. This figure illustrates that some regulatory pathways present in the ancestral genome have been conserved or lost in the different lineages after their divergence from a common ancestor. 4-18 4.3 Results and discussion Figure 4.17: Genome tree based on the transcriptional interaction conservation profile r eu r eu en cr ba M ae ur Py c e a M eur Ph e u r Pa spiro ur Tp ae Mk ur m e gam Mj a pS Ba ga mm Bsp eu r T ac eu r Tvo Y. pestis Dd e delta Sst cren Dr dr Bt h V bach Pmar C cy a Lint sp Pmar iro Scsp M cya Pm cyan ar M Cth E cy Oo f irmi e fi rm Lp i l Efa firmi Ss V firm oc Ap re F a cre n n Sm c eu C a u fi r Oih c fi rmi rm Af fi e u rm i i r D. desulfuricans M a m A. fulgidus Lpn ga mma Bfl gamma ii H sp eur Wb e gamm a Wbr gam ma R p a lp ha Rco a lp ha Spn T firm Spn firmi T wh TW act Twh T Ct ch acti Cca lapir ch lap Cm i uc hla Cp p n Cp T ch i nJ la Cp ch n Cp A c la Hp n C hla c H p J e h la p Bb 2 e sil M sp psil Ba pn fi iro M p g rmi ge a f ir mm m a i Distance tree based on interaction conservation b U u im m fir rmi fi i ga firm an pe M qn r e M eu Mth eur Mta e ur i Pfu u firm M ma Mp gam n Kp elta a ed Gm gamm e Md amma g Plu a mm Kg m Yp e C g am Ype ma gam Sen ma St gam ma m Sty ga amm Sty T g ma Ec O gam Ec OE gamm Ec C gamma Aae aqt e Chu bachl o Psp chlapi ib m f ir ctin Bs l a tin Cg f ac an C e a cy n a An cy n p ya Ss uc n Np i cya i G v firm i a Lg irm ef i Lm rm y fi Sp 8 fir y M fir Sp M3 Sp y m S fir Sp y irmi Lla f 2 firm Sag N firm Sag chlo Pgi ba chl Ctep ba o Hhe epsil Wsu epsi lo Cj epsilon Ter cyan Thel cyan Bfu beta Cvi bet a Rme beta Avi gam ma Pp u g amma Psy ga mma Pae gamma Pf l ga m ma Son gam ma Vch g Vpa amma g Vvu amma Vv C g a m uY m Dh ga mm a fi rm Sp i a Rs r alp ha o Bp l be Bp beta ta Bb p be Rh r be t a M sp ta Bm lo a alph el lpha a al ph a Fnu fuso Ca u chlo ro a ph a al 1 alph Bs e alph Sm c alph Atu w pha u At a al ha Rp alp a h Rru alp pha a l Brj a g a a Mm alph r ma Cc gam a a Xc gamm x ma Xa gam Xci amm 3g Hdu ma gam Hso a amm Hi g a m gam Pmu eta Neu b b eta Nm Z a Nm M bet ma Cbu gam Xfa gamma Xfa T gamm iii P. aeruginosa Ec K gamma Sfl 2 gamm Sfl g amma Sep A firm Sa MW firm Sa firmi N Sa Mu fir m Bc A firm i Ban A m fir Ban A 2 fir Ml e actin Cdi actin Sa v actin Sco ea cti n Tm a Tf u qte s Lm act in o fi Lin rm i Tte firmi Ct firmi e Cp E fi Blo e fir rm mi ac M tu tin M H t M u C acti ac Bh b o ti a ac f ir tin m i S. pyogenes i M. tuberculosis B. pertussis R. rubrum Figure 4.17: Genomes were clustered according to the interactions they conserve. The figure shows that genomes in the same phylogenetic group generally cluster together (i). Parasitic genomes (ii), and genomes with similar life style but belonging to different phylogenetic groups (iii) also cluster together. This suggests that interactions are gained or lost depending on the organism’s life style and the environmental conditions in which they live. Reconstructed transcriptional network for a representative organism from each group is given by the side. 4-19 Fig 2 4.3 Results and discussion 4.3.4 Evolution of global network structure Above we considered conservation of genes and interactions with respect to their functions as regulators or regulated genes, and the phylogenetic relationships of the genomes. Here we focus on the conservation of genes and interactions in the context of network topology. TRNs are known to have an overall structure in which the outgoing connectivity of transcription factors follows a power law function y=ax, where y is the number of transcription factors and x is the number of target genes. In other words, there are few transcription factors that regulate many target genes, so these influential transcription factors are hubs in the network. Albert and Barabasi among many others show that this so-called scale-free behaviour should hold good for randomly selected parts of networks of this form (Barabasi and Albert, 1999). We observed that all the conserved transcriptional networks follow a scale-free distribution (Figure 4.18), as do the ancestral networks at internal nodes of the phylogenetic tree shown in Figure 4.16. Therefore, even though particular regulatory hubs may be lost or replaced, as shown for Haemophilus influenzae and Bordetella pertussis in Figure 4.19, the scale-free topology is still present. From Figure 4.18, it is evident that the exponent ( in the fit y=ax) increases in genomes where the fraction of conserved nodes is less than 60%, and this is statistically significant. This implies that a stronger hub-like structure is retained in poorly conserved networks (Figure 4.19). Figure 4.18: Scale-free behaviour in conserved networks Scale-free exponent ( Scale-free behaviour in conserved networks -0.2 -0.4 Observed trend in genomes -0.6 Observed trend in random networks of similar size -0.8 -1 30 40 50 60 70 80 90 100 Genes conserved (%) Figure 4.18: The fraction of genes conserved (x-axis) is plotted against the scale free exponent value of the reconstructed networks (y-axis) for the different genomes, in black. The average value obtained after simulating neutral condition of similar sized networks 10,000 times is shown in green. This plot shows that for genomes, which conserve about 30 – 60 % of the genes, the exponent value () is higher than in random networks suggesting a more pronounced hub-like behaviour. This supports the observation that there is a weak selection for hubs for the same genomes (30-60% of genes conserved) 4-20 4.3 Results and discussion Taken together the fact that the conserved networks still show scale-free behaviour but the values for the exponent are quite different from what would be expected in random networks of similar size suggests that the complete network of these organisms are scale free but are quite different from what is seen for E. coli. This would mean that for genomes distantly related to E. coli, the scale free behaviour is not acquired (inherited) from the common ancestor, instead each lineage has emerged the scale free behaviour independently during evolution. Figure 4.19: Scale-free behaviour and loss of regulatory hubs E. coli Crp Crp NarL NarL H. influenzae Crp NarL B. pertussis Figure 4.19: Reconstructed transcriptional regulatory networks for H. influenzae and B. pertussis is shown to illustrate that the scale free behaviour is still conserved even though regulatory hubs can be lost or replaced. For example, the distribution for the outgoing connection for the conserved network in H. influenzae is y =214.8x-0.49 even though a regulatory hub NarL is lost. Is this because transcription factors that are hubs are preferentially conserved? Earlier studies on protein-protein interaction networks (PINs) have suggested that the proteins with most interactions tend to be more conserved across organisms (Jeong et al., 2001; Yu et al., 2004a). Interestingly, there is no correlation between the connectivity of a particular transcription factor and its conservation across genomes, as shown in Figure 4.20. We establish this by comparing the observed conservation of regulatory interactions to models where transcription factors are lost in a selectively neutral manner, or hubs are preferentially conserved or lost. The observed pattern is closest to neutral removal of transcription factors independent of their number of target genes in the E. coli TRN. Table 4.1 provides specific examples of transcription factors with many target genes that are conserved in few genomes, and transcription factors with low connectivity that are conserved in many genomes. Why is there no bias for conservation of influential transcription factors? It was recently reported that most regulatory hubs in yeast are condition-specific, i.e. they regulate many genes only in particular conditions, but remain silent in other conditions (Luscombe et al., 2004). Most prokaryotic transcription factors in prokaryotes are also condition-specific, in that they respond to specific environmental signal (Martinez-Antonio et al., 2003). If an organism has a lifestyle that does not involve a particular condition, then that particular regulatory protein can be lost 4-21 4.3 Results and discussion (Madan Babu, 2003). For example, E. coli and P. aeruginosa have conserved the MhpR, HcaR and FeaR transcriptional regulators because they can sense specific hydroxy-phenyl compounds and activate genes that catabolize these compounds to derive energy. However, the regulators and their target genes have been lost in pathogens like Staphylococcus aureus and Campylobacter jejuni because the environments in which they predominantly live do not contain these aromatic compounds. Figure 4.20: Interaction conservation profile for observed and random networks Interactions conserved (%) Regulatory interaction conservation 100 90 80 Selection for regulatory hubs 70 60 Observed trend in genomes 50 40 30 Neutral removal of genes 20 Selection against regulatory hubs 10 0 0 20 40 60 80 100 Genes lost (%) Figure 4.20: The fraction of transcriptional regulatory interactions conserved in the network (xaxis) is plotted against the fraction of the genes conserved (y-axis) for each of the 175 genomes (y = 0.007x2 - 1.7843x + 100, R2 = 0.95). To understand what the observed trend means, we simulated the process of network evolution by incorporating different rules: (i) remove genes with low connectivity first, implying selection to conserve highly connected transcription factors, shown in blue; (ii) remove genes neutrally with no special emphasis on connectivity, implying neutral evolution, shown in green and (iii) remove nodes with high connectivity first, implying selection against highly connected regulatory proteins, shown in red. We find that the observed trend is more similar to the trend that we obtain while simulating the neutral condition. Table 4.1: Conservation and connectivity profile for transcription factors A. Proteins with few target genes (k ≤ 15) but conserved in more than 50% of the genomes. Transcription factor GI number Low connectivity (k ≤ 15) High conservation (≥ 50%) Ada 16130150 5 74.86 BirA 16131807 5 89.71 MarR 33347561 7 54.29 DnaA 16131570 10 87.43 CpxR 16131752 14 55.43 OmpR 16131282 14 57.14 NarP 16130130 15 60.57 4-22 Function Transcriptional regulator of DNA repair Biotin operon repressor Repressor of multiple antibiotic resistance (Mar) operon Transcriptional regulator of housekeeping genes Regulator of a 2CST* Regulator of adaptive response (2CST*) Regulator of aerobic respiration (2CST*) 4.3 Results and discussion B. Proteins with many target genes (k > 15) but conserved in less than 50% of the genomes. 16130638 High connectivity (k > 15) 16 Low conservation (< 50%) 34.29 Rob 16132213 17 24.00 CysB FruR Hns PurR Fis Lrp NarL ArcA Ihf 16129236 16128073 16129198 16129616 16131149 16128856 16129184 16132218 16129668 18 25 26 29 34 53 66 70 96 25.14 21.14 14.29 33.14 28.00 44.57 36.00 40.00 37.71 Fnr 16129295 110 49.14 Transcription factor GI number FhlA Function Formate hydrogen lyase activator Transcriptional activator for antibiotic resistance Regulator of cysteine biosynthesis Fructose operon repressor General regulator Purine biosynthesis repressor Regulator of rRNA and tRNA operons Leucine responsive regulatory protein Regulator of anaerobic respiration Regulator of aerobic respiration General regulator Tanscriptional regulator of aerobic, anaerobic respiration and osmotic balance *2CST – Two-component signal transduction system What may allow loss and gain of transcriptional regulatory interactions relatively quickly in evolution is the fact that this can be achieved by a small number of mutations in the DNAbinding interface of the transcription factor or the DNA-recognition site near the target gene. Protein-protein interactions on the other hand involve larger interfaces and hence more mutations are needed to alter them. Indeed Maslov et al. (Maslov et al., 2004) have shown that transcriptional regulatory interactions of paralogous proteins within the yeast genome evolve more quickly than protein-protein interactions. 4.3.5 Evolution of local network structure The structure of transcriptional regulatory networks is scale-free at the global level, while at the local level there are recurrent topological patterns of interconnections called network motifs (Lee et al., 2002; Milo et al., 2002; Shen-Orr et al., 2002), which can be viewed as building blocks of the network. Individual motifs have specific information processing ability and dictate specific patterns of gene expression (Kalir and Alon, 2004; Mangan and Alon, 2003). To evaluate the motif conservation in TRNs, we devised a new algorithm (method 4.2.7) that clusters motifs according to their conservation profile. In principle this algorithm can be applied to any set of networks. We create a motif conservation profile for each genome, and then carry out a twoway clustering procedure: a k-means clustering of all motifs with similar occurrence profile across genomes, and then a hierarchical clustering of genomes based on their motif conservation profile (Figure 4.21 and Figure 4.22). This procedure clusters motifs according to the conservation of their interactions, and allows us to study evolutionary changes in local network structure. 4-23 Figure 4.21: Feed forward motif conservation profile 0% FEED FORWARD MOTIFS 100% Conservation (%) Specific E. coli FFMs G conserved in selected sets genomes E N of O M E S Figure 4.21: The 175 genomes are shown as rows and the 277 different FFMs are shown as columns. If all constituents of a motif is present in a genome, of interest, then the particular cell is coloured red, or in shades of red to blue reflecting the amount of conservation (blue for completely absent). The procedure to build and cluster motifs based on conservation profile of the network motifs is described in the methods section. This figure illustrates that different feedforward motifs have been conserved to varying extent in the different genomes. Representation like this can be extremely helpful in finding sets of motifs that have similar conservation profiles. Two-way clustering, as done here, also allows us to identify different genomes that have conserved similar sets of feedforward motifs. Representation like this can be very helpful in understanding gene expression patterns in genomes for which information is unavailable. Figure 4.22: SIM and MIM conservation profile SIMs Conservation (%) 0% 100% MIMs Conservation (%) 0% 100% G E N O M E S Figure 4.22: The 175 genomes are shown as rows. The 43 SIMs & 70 MIMs seen in E. coli are shown as columns in the two matrices. Conservation of motif constituents for genomes are coloured in shades of red (absolutely conserved) to blue (completely absent). The figure illustrates that the SIMs and MIMs have been conserved to varying extent in the different genomes. G E N O M E S 4.3 Results and discussion This analysis showed that there is no significant conservation of whole motifs compared to random networks of similar size. Interestingly, within coherent monophyletic lineages of prokaryotes particular network motif may not be conserved, whereas genomes belonging to unrelated phylogenetic groups conserve some motifs. For example, Fnr (a global regulator, activated during low oxygen levels), NarL (transcriptional regulator of a two-component signal transduction system) and NuoN (subunit of the NADH dehydrogenase complex I) form a feedforward motif, which is not conserved in other proteobacteria, whereas it is found in several distantly related genomes (Figure 4.23ab). This observation is consistent with the fact that related genomes might not conserve specific regulatory hubs. To determine whether the individual interactions that form motifs were preferentially conserved compared to other interactions in the network, we computed a motif conservation index (C.I) for each organism (see method 4.2.8). C.I is defined as the ratio of the fraction of the conserved interactions that forms a motif in E. coli to the fraction of all interactions conserved. As shown in Figure 4.23c, the observed trend in the genomes is closest to a model of neutral removal of interactions, rather than preferential conservation or loss of interactions in motifs. Thus we find that there is no preferential conservation of whole motifs or parts thereof (Figure 4.24). Figure 4.23: Evolutionary changes in the local network structure a Fnr NuoN NarL Salmonella typhi ( proteobacteria) Vibrio cholerae ( proteobacteria) b c Haemophilus somnus ( proteobacteria) Xylella fastidiosa ( proteobacteria) Blochmannia floridanus ( proteobacteria) 1 0.5 Fnr Selection for motifs NarL NuoN R. palustris ( proteobacteria) B. pertussis ( proteobacteria) N. punctiforme (Cyanobacteria) S. avermitilis (Actinobacteria) D. hafniense (Firmicute) C.I Motifs and hubs conserved among unrelated genomes Motifs and hubs not conserved among related genomes 0 Observed trend in genomes Neutral conservation of motifs -0.5 Selection against motifs -1 30 40 50 60 70 80 90 100 % genes conserved Figure 4.23: (a) A feed-forward motif formed by Fnr, NarL and NuoN genes in E. coli is completely conserved in a closely related genome, Salmonella typhi, but not in other proteobacterial genomes. Fnr is a regulatory hub and is not always conserved in genomes within the same phylogenetic group. This suggests that condition specific regulatory hubs can 4-26 4.3 Results and discussion either be lost or displaced by a non-orthologous protein in closely related genomes. (b) Distantly related organisms that have conserved all interactions in the regulatory motif and that have conserved the regulatory hub, Fnr. (c) This is a plot of fraction of genes conserved (x-axis) against conservation index (y-axis), which is a measure of the extent to which interactions in a motif are conserved. A value > 1 means that interactions in motifs are selected for; close to 0 suggests neutral selection and < 1 suggests selection against such interactions. The plot shows that interactions that form the network motif are not selected for or against in evolution and are conserved to the same extent as any other interaction in the network. This implies that evolution acts by tinkering to form different motifs using the same interaction in different genomes. Figure 4.24: Gene conservation v/s number of motifs 300 250 # FFMs 200 15 0 10 0 50 0 30 40 50 60 70 80 90 10 0 % Genes conserved 50 # SIMs 40 30 20 10 0 30 40 50 60 70 80 90 10 0 90 10 0 % Genes conserved 80 70 # MIMs 60 50 40 30 20 10 0 30 40 50 60 70 80 % Genes conserved Figure 4.24: Graphs of the number of whole motifs against percentage of genes that are seen in the conserved networks (black triangle) and in random networks of similar size (green dots). The graphs show that the number of whole motifs that exist in the conserved networks is comparable to what is seen in random networks of similar size. However, for genomes where the fraction of genes conserved is less than 60%, there seems to be a slight preference for SIMs to be over represented. For most genomes, the P-values and Z-scores suggest that the observed trend is close to random. 4-27 4.3 Results and discussion Careful examination of partially conserved motifs reveals that the same gene can become a part of different motifs in the TRN of different organisms (Figure 4.25). An important implication of this observation is that the same gene in a different organism may require a different pattern of gene expression. Thus, different organisms have arrived at different solutions by tinkering specific regulatory interactions rather than porting whole blocks of pre-existing transcriptional interactions. For example, a gene, which may need to be tightly regulated (as a feed-forward motif), may need to be regulated directly and quickly via a direct signal (as in a single input module) in a different genome due to changes in life-style, development or environment. Specific cases are shown in Figure 4.25. For example, E. coli, which lives in a more stable aerobic environment, will not express the fumarate reductase genes (FrdB and FrdC, which converts fumarate to succinate under anaerobic conditions to derive energy) unless there is persistent signal for lack of oxygen (through both Fnr and NarL). However Haemophilus influenzae, which may encounter very different oxygen milieus (in arterial and venous blood) during the course of infection, will need to regulate the fumarate reductase genes immediately (by Fnr) whenever oxygen concentration fluctuates to ensure survival. Figure 4.25: Changes seen in motif structure a FFM SIM Fnr Fnr 1 C(t) C(t) 1 0 0 0 2 4 6 8 10 12 FrdB 14 time, t FrdC NarL FrdB E. coli 0 FrdC NarL 2 4 6 8 10 12 14 10 12 14 10 12 14 time, t H. influenzae b MIM SIM TrpR TyrR TrpR TyrR AroL AroM AroL AroM 1 C(t) C(t) 1 0 0 2 4 6 8 10 12 0 14 time, t E. coli c FFM+MIM 2 4 6 8 time, t MIM Crp AcnA TdcA Fnr 1 Crp C(t) Fnr C(t) 1 0 X. fastidiosa 0 0 0 2 4 6 8 time, t 10 12 14 ArcA ArcA E. coli AcnA K. pneumoniae TdcA 0 2 4 6 8 time, t Figure 4.25: Analysis of partially conserved motifs revealed that by losing (or gaining) specific transcription factors, orthologous genes in different genomes could be expressed in different ways according to specific requirements. Figures by the side of each motif represent the concentration profile of the component genes with time. Genes in a feed-forward motif (FFM) in E. coli can be regulated as a part of a single input module (SIM) by losing a transcription factor (shown in grey). Feed forward motif regulation ensures that target gene expression is not 4-28 4.4 Conclusions sensitive to fluctuations in input signals. Whereas a SIM regulation ensures expression of target genes as long as there is some input signal, like small molecules. Genes in a multiple input motif (MIM) in E. coli can be regulated as a single input module (SIM) by losing a transcription factor. Genes regulated as a part of a MIM ensures that target genes are transcribed only when two input signals independently cross a particular thresholds. Thus by losing a transcription factor, genes that are tightly regulated can be regulated in a simple manner. Genes regulated as a part of a FFM can be regulated as a part of a MIM by the loss of a transcription factor. Thus evolution tinkers with specific regulatory interactions to create different motifs in genomes when orthologous genes need to be expressed differently. Generically, the figure illustrates that by bringing about temporal expression of specific transcription factors, the same target gene can follow different patterns of gene expression, be it in a different cell type or in different organisms. In this context, we note that studies of duplicated genes within the regulatory network of E. coli and yeast (Conant and Wagner, 2003; Teichmann and Madan Babu, 2004) have shown that network motifs have not evolved by duplication of whole ancestral modules. These findings hint that the same interaction could have existed in different regulatory contexts in ancestral genomes, which is in agreement with our findings. Our conclusions are consistent with a recent study showing the lack of conservation of higher order network modules, which are semiindependent units larger than network motifs (Snel and Huynen, 2004). This type of flexibility in the regulatory context of a gene can be viewed as analogous to the changes in gene regulation across different cell types within a multi-cellular organism. 4.4 Conclusions We analysed the conservation of genes and transcriptional regulatory networks across genomes as well as the evolution of networks topology at both global and local level by comparing the extend of conservation of an experimentally established reference network, that of E. coli, across 175 microbial genomes. We showed that, at the level of individual genes, target genes are more conserved than transcription factors, and that conservation of a target gene and its transcription factor are uncoupled. We also showed that at the local level there is no preferential conservation of interactions within network motifs: By losing or conserving specific transcriptional regulators, orthologous genes in different genomes are incorporated within different regulatory contexts and thereby encode different patterns of gene expression. At the level of global network topology, conservation of transcription factors is independent of the number of target genes they regulate, and depends on similar or different lifestyles of the organisms rather than their phylogenetic distance from E. coli. This, coupled with the fact that 4-29 4.5 References both the reference network and the inferred ones are characterized by a power law distribution but different exponents, imply that these networks evolved independently. Our findings suggest that evolution tinkers with individual network connections to arrive at a satisfactory working solution and does so by maintaining the scale-free nature of the ancestor core network and at the same time rewiring and adding/deleting components to it to suit new endogenous or environmental requirements. Finally, our method can be readily applied to any set of networks and genomes that advancements in large-scale experimental identification of transcriptional regulatory networks will make available. The reconstructed networks and predicted transcription factors for 175 organisms can serve as a scaffold for experimental studies on transcriptional control in poorly characterized genomes, and can be relevant to engineer regulatory interactions in these organisms. 4.5 References Albert, R., Jeong, H. and Barabasi, A. L. (2000). Error and attack tolerance of complex networks. Nature 406, 378-82. Barabasi, A. L. and Albert, R. (1999). Emergence of scaling in random networks. Science 286, 509-12. Conant, G. C. and Wagner, A. (2003). Convergent evolution of gene circuits. Nat Genet 34, 264-6. Gollub, J., Ball, C. A., Binkley, G., Demeter, J., Finkelstein, D. B., Hebert, J. M., Hernandez-Boussard, T., Jin, H., Kaloper, M., Matese, J. C. et al. (2003). The Stanford Microarray Database: data access and quality assessment tools. Nucleic Acids Res 31, 94-6. Jeong, H., Mason, S. P., Barabasi, A. L. and Oltvai, Z. N. (2001). Lethality and centrality in protein networks. Nature 411, 41-2. Kalir, S. and Alon, U. (2004). Using a quantitative blueprint to reprogram the dynamics of the flagella gene network. Cell 117, 713-20. Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I. et al. (2002). Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298, 799-804. Lespinet, O., Wolf, Y. I., Koonin, E. V. and Aravind, L. (2002). The role of lineage-specific gene family expansion in the evolution of eukaryotes. Genome Res 12, 1048-59. Luscombe, N. M., Madan Babu, M., Yu, H., M., S., Teichmann, S. A. and Gerstein, M. (2004). Genomic analysis of regulatory network dynamics. Nature in press. Madan Babu, M. (2003). Did the loss of sigma factors initiate pseudogene accumulation in M. leprae? Trends Microbiol 11, 59-61. 4-30 4.5 References Madan Babu, M., Luscombe, N. M., Aravind, L., Gerstein, M. and Teichmann, S. A. (2004). Structure and evolution of transcriptional regulatory networks. Current Opinion in Structural Biology 14, 283-291. Madan Babu, M. and Teichmann, S. A. (2003). Evolution of transcription factors and the gene regulatory network in Escherichia coli. Nucleic Acids Res 31, 1234-44. Mangan, S. and Alon, U. (2003). Structure and function of the feed-forward loop network motif. Proc Natl Acad Sci U S A 100, 11980-5. Epub 2003 Oct 6. Martinez-Antonio, A., Salgado, H., Gama-Castro, S., Gutierrez-Rios, R. M., JimenezJacinto, V. and Collado-Vides, J. (2003). Environmental conditions and transcriptional regulation in Escherichia coli: a physiological integrative approach. Biotechnol Bioeng 84, 7439. Maslov, S., Sneppen, K., Eriksen, K. A. and Yan, K. K. (2004). Upstream plasticity and downstream robustness in evolution of molecular networks. BMC Evol Biol 4, 9. Milo, R., Itzkovitz, S., Kashtan, N., Levitt, R., Shen-Orr, S., Ayzenshtat, I., Sheffer, M. and Alon, U. (2004). Superfamilies of evolved and designed networks. Science 303, 1538-42. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon, U. (2002). Network motifs: simple building blocks of complex networks. Science 298, 824-7. Oltvai, Z. N. and Barabasi, A. L. (2002). Systems biology. Life's complexity pyramid. Science 298, 763-4. Pagel, P., Mewes, H. W. and Frishman, D. (2004). Conservation of protein-protein interactions - lessons from ascomycota. Trends Genet 20, 72-6. Pavlidis, P. and Noble, W. S. (2003). Matrix2png: a utility for visualizing matrix data. Bioinformatics 19, 295-6. Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. and Barabasi, A. L. (2002). Hierarchical organization of modularity in metabolic networks. Science 297, 1551-5. Salgado, H., Gama-Castro, S., Martinez-Antonio, A., Diaz-Peredo, E., Sanchez-Solano, F., Peralta-Gil, M., Garcia-Alonso, D., Jimenez-Jacinto, V., Santos-Zavaleta, A., BonavidesMartinez, C. et al. (2004). RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res 32, D303-6. Shen-Orr, S. S., Milo, R., Mangan, S. and Alon, U. (2002). Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31, 64-8. Snel, B. and Huynen, M. A. (2004). Quantifying modularity in the evolution of biomolecular systems. Genome Res 14, 391-7. Teichmann, S. A. and Madan Babu, M. (2004). Gene regulatory network growth by duplication. Nat Genet 36, 492-6. Epub 2004 Apr 11. van Nimwegen, E. (2003). Scaling laws in the functional content of genomes. Trends Genet 19, 479-84. Yu, H., Greenbaum, D., Xin Lu, H., Zhu, X. and Gerstein, M. (2004a). Genomic analysis of essentiality within protein networks. Trends Genet 20, 227-31. 4-31 4.5 References Yu, H., Luscombe, N. M., Lu, H. X., Zhu, X., Xia, Y., Han, J. D., Bertin, N., Chung, S., Vidal, M. and Gerstein, M. (2004b). Annotation Transfer Between Genomes: Protein-Protein Interologs and Protein-DNA Regulogs. Genome Res 14, 1107-18. 4-32