4.2 Materials and methods - MRC Laboratory of Molecular Biology

advertisement
Chapter
EVOLUTIONARY CHANGES IN THE
BLUEPRINT FOR TRANSCRIPTIONAL
REGULATION IN PROKARYOTES
4.1 INTRODUCTION ................................................................................................. 4-1
4.2 MATERIALS AND METHODS ........................................................................... 4-2
4.2.1 ALGORITHM TO RECONSTRUCT TRANSCRIPTIONAL NETWORKS. .......................... 4-2
4.2.2 PROCEDURE TO IDENTIFY ORTHOLOGOUS PROTEINS ......................................... 4-3
4.2.3 PROCEDURE TO EVALUATE SIGNIFICANCE OF THE BIAS IN GENE CONSERVATION. 4-4
4.2.4 ALGORITHM TO RECONSTRUCT ANCESTRAL NETWORKS. .................................... 4-5
4.2.5 PROCEDURE TO GROUP GENOMES BY INTERACTIONS AND GENES CONSERVED. .. 4-6
4.2.6 PROCEDURE TO ANALYSE SCALE FREE BEHAVIOUR OF CONSERVED NETWORKS. . 4-7
4.2.7 ALGORITHM TO ANALYSE CONSERVATION OF ‘NETWORK MOTIFS’........................ 4-8
4.2.8 PROCEDURE TO EVALUATE SIGNIFICANCE OF MOTIF INTERACTION CONSERVATION4-9
4.3 RESULTS AND DISCUSSION ......................................................................... 4-10
4.3.1 RECONSTRUCTION OF TRANSCRIPTIONAL REGULATORY NETWORKS ................. 4-10
4.3.2 CONSERVATION OF TRANSCRIPTION FACTORS AND TARGET GENES .................. 4-13
4.3.3 CONSERVATION OF TRANSCRIPTIONAL REGULATORY NETWORKS ..................... 4-17
4.3.4 EVOLUTION OF GLOBAL NETWORK STRUCTURE ............................................... 4-20
4.3.5 EVOLUTION OF LOCAL NETWORK STRUCTURE ................................................. 4-23
4.4 CONCLUSIONS ................................................................................................ 4-29
4.5 REFERENCES .................................................................................................. 4-30
4
4.1 Introduction
EVOLUTIONARY CHANGES IN THE
BLUEPRINT FOR TRANSCRIPTIONAL
REGULATION IN PROKARYOTES
4
Parts of this chapter will appear in:
1. Madan Babu, M.*, Teichmann, S. A. and Aravind, L.* (2004). Evolutionary changes in the
transcriptional regulatory network in prokaryotes, manuscript in preparation
4.1 Introduction
Transcriptional control is a fundamental regulatory mechanism common to all organisms. This
form of regulation is mediated by a DNA-binding protein that binds to target sites in the genome,
and either individually or in concert with other transcription factors regulates the expression of
one or more target genes.
The sum total of all such transcriptional interactions in an organism can be conceptualised as a
network (Lee et al., 2002; Madan Babu et al., 2004). Theoretical characterization of these
transcriptional regulatory networks showed that they are scale-free structures with striking
structural and topological similarity to other networks from biological and non-biological
systems. It was also shown that they are characterized by the recurrence of small patterns of
interconnections called network motifs (Albert et al., 2000; Barabasi and Albert, 1999; Lee et al.,
2002; Milo et al., 2004; Oltvai and Barabasi, 2002; Ravasz et al., 2002; Shen-Orr et al., 2002).
Even though the general properties of transcriptional networks are well understood, there are
several unanswered questions regarding their origin and evolution: How did connections
between transcription factors and target genes evolve? In particular, how did interactions in
4-1
4.2 Materials and methods
topologically equivalent motifs evolve? What is the common ancestor network? Are network
structural properties conserved through evolution?
In this chapter, we address these questions using the experimentally determined gene
regulatory network of E. coli as a reference, and performing an exhaustive comparative
genomic analysis of 175 prokaryotes belonging to different lineages of both bacterial and
archaeal superkingdoms.
4.2 Materials and methods
4.2.1 Algorithm to reconstruct transcriptional networks
The transcriptional regulatory network for E. coli was used as the basis to reconstruct networks
for other genomes. Information about regulatory interactions was obtained from RegulonDB
(Salgado et al., 2004) and the dataset used in Shen-Orr et. al. (Shen-Orr et al., 2002) This
provided us with 1295 transcriptional interactions involving 755 proteins (112 transcription
factors). Orthologous proteins were identified in the genome of interest. If orthologs were
identified for an interacting transcription factor and target gene in E. coli, then an interaction was
considered to be present in the genome of interest (Figure 4.1).
Figure 4.1: Procedure to reconstruct transcriptional regulatory networks
Reconstruction of transcriptional regulatory networks
Step 1
Step 2
Step 3
Define the TFs
and TGs in the
E. coli network
Identify orthologous
proteins in the genome
of interest
Reconstruct interactions (hence
the network) if orthologous
TFs and TGs exist in genome of
interest and are known to interact
in the E. coli network
Figure 4.1: This figure illustrates the steps involved in reconstructing regulatory networks for
the genomes. Blue circles represent target genes and red circle represents transcription factor.
Black arrow represents a transcriptional interaction. Genes and interactions that are absent are
shown in grey. Reconstructed transcriptional regulatory networks for the 175 genomes using the
methods discussed above is available from the supplementary material website (Appendix A).
4-2
4.2 Materials and methods
4.2.2 Procedure to identify orthologous proteins
The procedure, which was used to identify orthologous proteins in a genome, is described as a
flow chart below (Figure 4.2):
Figure 4.2: Flowchart describing the procedure to detect orthologs
For each protein P seen in the E. coli
transcriptional regulatory network,
and for each of the 175 genomes
considered, carry out a bi-directional
best hit ortholog detection procedure
if bi-directional
best hit does
not exists
Store the best hit and
carry out a BLASTclust
procedure
if bi-directional
best hit exists
Report ortholog
Figure 4.2: First a bi-directional best hit is performed, if an ortholog is not detected then
BLASTclust procedure is adopted.
Supp methods figure-1
Bi-directional best-hit procedure: For each protein P in the E. coli network, a BLAST search was
performed against the genome of interest (x). The best hit, sequence Px from genome x, was
then used as a query and a BLAST search was carried out against the E. coli genome. If the
best hit using Px as the query happens to be P in E. coli, then P & Px were considered as
orthologous proteins. If however Px does not pick up P from E. coli as its best hit, then a
BLASTclust (Lespinet et al., 2002) procedure was adopted.
BLASTclust algorithm: For each of the proteins P in E. coli for which the above-mentioned
procedure did not pick up orthologous proteins, the best-hit sequence Px (using P as the query
against genome x) for each genome were obtained. Thus, for every protein P, this procedure
gives us a set of proteins which are the best hits from genomes where bi-directional best hit
procedure fails. The set of sequences thus obtained along with the E. coli protein is taken
through a BLASTclust procedure using length conservation (L) of 60% and a score density (S)
of 60% (Lespinet et al., 2002). Score density (S) is defined as the ratio of number of identical
residues in the alignment to the length of the alignment. Documentation is available at:
http://bio.ifom-firc.it/docs/blast/readme.bcl.txt and in (Lespinet et al., 2002). This procedure
produces clusters of sequences using the single linkage-clustering algorithm. All the sequences
belonging to the cluster that also contains the E. coli protein P are considered as orthologs (see
schematic of the ortholog detection procedure). Analysis of the clusters reveal that the
parameters S=0.6 and L=0.6 performs best with an optimum coverage and lowest false positive
rate. See Figure 4.3 for a schematic.
4-3
4.2 Materials and methods
Figure 4.3: Schematic of the procedure to detect orthologous proteins
Protein P3
best hit
genome 3
Protein P4
best hit
genome 4
Cluster 2
Cluster 1
BLASTclust
E. coli
protein
P
‘BEST HIT’ PROTEINS
Protein P2
bi-directional
best hit
genome 2
All proteins in the cluster that also
contains the E. coli protein are considered
as orthologs of the E. coli protein P
ORTHOLOGS
Protein P1
bi-directional
best hit
genome 1
Get best hits
Protein P5
best hit
genome 5
Figure 4.3: This figure illustrates the steps involved in ortholog detection. Information about
orthologs for every genome is stored as an ‘orthomap’ file and is available from the as
supplementary material.
4.2.3 Procedure to evaluate significance of the bias in gene conservation
Analysis of the conservation of transcription factors and target genes for the various genomes
indicate that target genes are more conserved than transcription factors. To evaluate the
significance of this observation, the following procedure was carried out. For each of the 175
genomes, orthologous proteins in the E. coli network were first identified. Among the identified
orthologs, the fraction of TFs and TGs conserved were calculated. A graph of %TFs conserved
against % TGs conserved was plotted. The slope and intercept of the best fit to the observed
trend was obtained. Then for each of the genome, random networks of similar sizes were
created (Figure 4.4) 10,000 times. For each run, the slope and intercept for the plot of % TF
conserved against % TG conserved was obtained. The P-value was estimated as the number of
runs where the slope was less than or equal to what was observed.
4-4
4.2 Materials and methods
Figure 4.4: Procedure to create random networks to evaluate TF and TG conservation
Generating ‘n’ random networks of similar size to evaluate TF & TG conservation
Select 8 genes randomly n-times
from the reference network
Reference genome
Genome of interest
Ortholog detection
#orthologs = 8 (of 14); %genes = 57%
#TFs = 1 (of 6); TF% = 17%
#TGs = 7 (of 8); TG% = 88%
Reference network
#TFs = 6
#TGs = 8
#Total = 14
#orthologs = 8 (of 14); %genes = 57%
#TFs = 4 (of 6); TF% = 66%
#TGs = 4 (of 8); TG% = 50%
#orthologs = 8 (of 14); %genes = 57%
#TFs = 3 (of 6); TF% = 50%
#TGs = 5 (of 8); TG% = 63%
#orthologs = 8 (of 14); %genes = 57%
#TFs = 3 (of 8); TF% = 50%
#TGs = 5 (of 8); TG% = 63%
Figure 4.4: This figure illustrates the procedure to generate random networks of similar size to
evaluate significance of TF and TG conservation in the different genomes.
4.2.4 Algorithm to reconstruct ancestral networks
Standard species tree (Figure 4.5) was adapted from Margulis and Schwartz (1998), as shown
in the figure below. Only organisms whose genome size was greater than half the size of the E.
coli genome were considered to avoid any bias arising from parasitic genomes that have lost
genes due to a specialised life style. This criterion resulted in a total of 97 genomes.
Proteobacteria
(1)
(1)
(21)
Gracilicutes
(9)
(1)
(3)
(4)
Euryarcheota

Crenarcheota
(2)

Deinococci
(24)

Actinobacteria
(8)

Endospora
(11)
Spirochaetea
Cyanobacteria
(genomes) (8)
Chlorobia
Figure 4.5: Species tree
Firmicutes
Archaea
Eubacteria
Root
Figure 4.5: Species tree adopted for calculating ancestral networks.
4-5
4.2 Materials and methods
Figure 4.6: The Dollo approach taken to calculate ancestral networks

AC=
u [9]
DN=
u [1]
CR=
u [3]
EU=
u [4]
(24)
(2)
(1)
(1)
(21)
(9)
(1)
(3)
(4)


Proteobacteria
PROT=
GRAC + AL + BE + DE
GRAC=
EUB + CY + CH + SP
FIRM=
u [EN, AC, DN]
Gracilicutes
Euryarcheota

EN=
u [21]
Crenarcheota
(8)
SP=
u [1]
Deinococci
(11)
CH=
u [1]
Actinobacteria
(8)
DE=
u [2]
Endospora
BE=
u [8]
Spirochaetea
AL=
u [11]
Cyanobacteria
(genomes)
CY=
u [8]
Chlorobia
GA = PROT
+ ALL - Ec
ARCH=
u [CR, EU]
Firmicutes
EUB=
ARCH + FIRM
Archaea
Eubacteria
ROOT=
ARCH
Figure 4.6: The figure illustrates the principle to calculate the ancestral networks using the
Dollo parsimony approach.
A gene is considered to be present in the node if at least one of its child nodes contains it. The
function U represents union operation. For example, the node CY = u [8] means that the
ancestor for all cyanobacterial genomes (8 genomes in this case) will contain all the genes seen
in each of the 8 cyanobacterial genomes. The Dollo principle was used to reconstruct the
ancestral states for all the ancestral nodes. This principle states that a gene (ortholog) can be
gained only once. The emergence of a gene was thus assigned to the last common ancestor of
all lineages that contained the given gene (Figure 4.6). Once the gene composition has been
determined for each node in the tree, the transcriptional regulatory network was reconstructed
as described in the procedure mentioned before.
4.2.5 Procedure to group genomes by interactions and genes conserved
A novel method to analyse conservation of interaction was developed (Figure 4.7). For every
genome, an interaction conservation profile was calculated based on the reconstructed
networks. A distance matrix based on interactions conserved was calculated. Relationships
between genomes, and within interactions/genes were represented as a tree and a matrix using
the procedure shown in the next page.
This procedure was carried out for genes instead of transcriptional interactions. Using this
procedure, clusters of transcription factors that have similar conservation profiles were obtained.
4-6
4.2 Materials and methods
Figure 4.7: Procedure to cluster genomes based on interactions conserved
organism A
interaction 0001: yes
interaction 0002: yes
interaction 0003: yes
interaction 0004: no
interaction 0005: yes
interaction 0006: no
.
.
interaction 1295: yes
organism B
interaction 0001: yes
interaction 0002: no
interaction 0003: yes
interaction 0004: yes
interaction 0005: yes
interaction 0006: no
.
.
interaction 1295: yes
.....
organism Z
interaction 0001: no
interaction 0002: no
interaction 0003: no
interaction 0004: yes
interaction 0005: yes
interaction 0006: no
.
.
interaction 1295: no
step1: K-means clustering
cluster of interactions with similar
conservation profile
Interaction conservation profile
interaction 1 2 3 4
organism A 1 1 1 0
organism B 1 0 1 1
.
organism Z 0 0 0 1
5 6 . . 1295
1 0 . . 1
1 0 . . 1
1 0 . .
2-way clustering
0
interaction 6 4 2 1
organism A 0 0 1 1
organism B 0 1 0 1
.
organism Z 0 0 0 0
3 5 . . 1295
1 1 . . 1
1 1 . . 1
0 0 . .
0
step2: hierarchical clustering
 interactio ns conservedin A and B 
8
  1     0 .2
 interactio ns conservedin A or B 
 10 
D  1 
Distance matrix
Org
A
B
.
Z
A
0
0.2
.
.
B
0.2
.
.
.
.
.
.
Z
.
.
.
.
cluster #
I II
III
G
H
.
A
B
clustering using neighbour-joining algorithm
F
Represent relationships between genomes in
terms of interactions conserved
E
D
C
Figure 4.7: Representing the interactions as a vector (interaction profile) makes the network
amenable to standard clustering procedure.
4.2.6 Procedure to analyse scale free behaviour of conserved networks
The distribution of outgoing connectivity provides an indication about the large-scale structure of
networks. It is well established that the outgoing connectivity for the E. coli network follows a
scale-free behaviour, i.e. the distribution is best approximated by a power-law function T = aK-b
where T is the number of transcription factors with K connections. To evaluate the distribution
for the reconstructed networks, the following procedure was carried out: For each of the 175
genomes, the distribution is approximated by a linear function (T = a + bK), exponential function
(T = ae-K; log T = log a – K  log e) and a power-law function (T = aK-; log T = log a –  log K).
The function that best approximates the observed distribution is identified as the one that has
the lowest error. To identify the trend in random networks the following procedure was carried
out: for each of 10,000 times create 175 random networks of similar sizes to what is observed
in the genomes (Figure 4.8). The procedure explained above is carried out to get the function
that best approximates the distribution for the random networks. For each of the 17g genomes,
the mean and standard deviation for the power-law exponent over the 10,000 runs is calculated.
4-7
4.2 Materials and methods
P-value was calculated as the fraction of the runs where the value for the exponent was greater
than or equal to the observed value. Z-score was calculated as Z = (obs – mean)/ 
Figure 4.8: Procedure to create random networks as seen in the genome of interest
Generating ‘n’ random networks of similar size (as seen in genome of interest)
Reference genome
Reference network
#TFs = 8
#TGs = 9
#interactions = 21
Genome of interest
Ortholog detection
#orthologous TFs = 4
#orthologous TGs = 7
Genome of interest
Reconstructed network
# orthologous TFs = 4
#orthologous TGs = 7
#reconstructed interactions = 9
Select 4 TFs and 7 TGs randomly
n-times from the reference network
Reconstruct network
based on reference network
Reconstruct network
based on reference network
Reconstruct network
based on reference network
Random network 1
(4 reconstructed interactions)
Random network2
(8 reconstructed interactions)
Random network 3
(9 reconstructed interactions)
Figure 4.8: Random networks are generated by randomly choosing the same number of genes
as seen in the genome of interest and interactions are reconstructed as discussed before.
4.2.7 Algorithm to analyse conservation of ‘network motifs’
A novel algorithm to analyse conservation of network motifs developed by us is shown in Figure
4.9 below: Feed forward motif (FFM), single input module (SIM) and multi-input module (MIM)
motifs were identified using the methods described by Shen-Orr et al (Shen-Orr et al., 2002). A
motif was considered to be absolutely conserved in a genome, if all the genes constituting the
motif in E. coli were conserved in the other genome. If some genes were missing, the fraction of
interactions in the motifs conserved was noted. Thus for each genome, an ordered ndimensional vector (motif conservation profile) was created, where n is the number of motifs
considered. The values represent the fraction of the interactions forming the motifs that are
conserved. This matrix was then subjected to the procedure explained in section 4.2.5 to get
motifs that have similar conservation profile.
4-8
4.2 Materials and methods
Figure 4.9: Procedure to cluster genomes and motifs according to the extent conserved
single input modules
feed-forward motifs
E. coli
F1
...
Fn
S1
...
Sn
M1
...
Mn
F1
...
Fn
S1
...
Sn
M1
...
Mn
organism A
organism B
F1
motif conservation
profile
multi input modules
...
Fn
S1
...
Sn
M1
...
Mn
Org
F1
F2
.
Fn
Org
S1
S2
.
Sn
Org
M1
M2
.
Mn
A
(3/3)
.
.
(1/3)
A
(2/3)
.
.
(3/3)
A
(2/4)
.
.
(4/4)
B
(3/3)
.
.
(1/3)
B
(2/3)
.
.
(3/3)
B
(1/4)
.
.
(4/4)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Z
.
.
.
.
Z
.
.
.
.
Z
.
.
.
.
clustering of motifs (e.g. K-means)
clustering of motifs (e.g. K-means)
Figure 4.9: The two way clustering provides information about which motifs have been
completely conserved in set of genomes. Specific examples are discussed in the main text.
4.2.8 Procedure to evaluate significance of motif interaction conservation
To evaluate whether interactions in a motif are more conserved than any interaction in the
network, we introduce a term called conservation index (C.I.), which is defined as follows:
R
C.I . genomeX  log 2  motif
 R all
R motif 
I motif
genome X
I motif
E. coli
and
4-9



R all 
I all
genome X
I all
E. coli
4.3 Results and discussion
In this definition,
I motif
genome X is the number of interactions that forms a motif in E. coli, which has
motif
been conserved in genome X. I E. coli is the number of interactions in a motif in E. coli.
I all
genome X
all
is the total number of interactions that have been conserved in genome X and I E. coli is the total
number of interactions in E. coli. The log2 of the ratio ensures that selection for and against are
represented symmetrically in the graph. For example, if R motif = 0.9 and Rall = 0.6, C.I. can be
calculated as log2 (.9/.6) = 0.58. Thus if interactions in motifs are selected for, then the C.I value
will be greater than 0, if not, the value will be less than 0. Please refer to appendix A for a
detailed procedure.
4.3 Results and discussion
4.3.1 Reconstruction of transcriptional regulatory networks
While there has been considerable advance in unravelling the regulatory networks of various
model organisms such as E. coli, the extrapolation of this information to the range of poorly
studied organisms, whose complete genome sequences are now available, remains a major
challenge. One can infer transcriptional targets of a regulator in a genome of interest by
identifying the orthologous transcription factor and its target genes in a genome with
experimentally characterized transcriptional regulatory interactions. This procedure of
transferring information about transcriptional regulation from a genome with known regulatory
interactions to another genome by identifying orthologous proteins was recently assessed by Yu
et al (Yu et al., 2004b). It was found to be a robust method for predicting such interactions in
eukaryotes, and we have benchmarked the predicted transcriptional interactions using
expression data available for E. coli and V. cholerae. We found that genes which are regulated
by the same set of TFs in E. coli, for which the transcriptional network is known, and V.
cholerae, for which predictions were based on the reconstructed network, tend to have very
similar expression profiles (Figure 4.10). This result supports the validity of reconstructing
transcriptional networks using our procedure.
Using the transcriptional interaction network available for E. coli as our template, we used an
ortholog detection procedure and reconstructed partial transcriptional interaction networks, for
the first time, for 175 prokaryotic genomes. The transcriptional interaction network for E. coli is
shown in Figure 4.11, along with reconstructed networks for a pathogen, Bacillus anthracis and
a free-living organism, Streptomyces coelicolor. It should be emphasized that we can only
reconstruct transcriptional interactions that are present in E. coli and that there are many
interactions that we cannot see because of limiting our template to the E. coli network.
Nevertheless, the methods presented in this work should be applicable to any network, hence
would benefit as more data becomes available for other genomes.
4-10
Figure 4.10: Benchmarking predicted transcriptional regulatory interactions using expression data
0.6
Pearson Correlation Coefficient
1
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
0
-0.6
0
-0.8
0.2
-1
0.2
-0.2
0.4
-0.4
0.4
0.8
-0.6
0.6
V. cholerae
Co-regulated pairs of TGs
TF – TG pairs
Random pair of genes
1
-0.8
0.8
b
1.2
Co-regulated pairs of TGs
TF – TG pairs
Random pair of genes
1
Relative distribution
E. coli
-1
a
Relative distribution
1.2
Pearson Correlation Coefficient
Figure 4.10: Gene expression data for the E. coli and V. cholerae genomes were downloaded from the IECA consortium website (http://www.unigiessen.de/~gx1052/ IECA/ieca.html) and from the Stanford Microarray Database (Gollub et al., 2003). For both genomes, Pearson Correlation Coefficient
(PCC) were calculated for (i) Transcription factor – Target gene pair, where the TF regulates TG (blue line) (ii) Pairs of TGs, which have the same set of TFs
regulating them (i.e. a pair of genes in this category will have the same set of TFs regulating both the genes and each gene will NOT have any other
additional TF regulating them; red line) and (iii) Random pair of genes (black line). (a) Normalised distribution of PCC for 1295 pairs of experimentally verified
TF-TG interactions, 2570 pairs of co-regulated TGs from the known transcriptional regulatory network and 10,000 random pairs of genes for E. coli. The plot
shows that a large fraction of pairs of co-regulated genes have very high PCC. The peak at PCC value of 0.9 is much higher than what is seen for the random
pairs of genes indicating that genes that have the same set of TFs are regulated the same way in the E. coli network (b) Normalised distribution of PCC for
507 pairs of predicted TF-TG interactions, 1142 pairs of predicted co-regulated TGs and 10,000 random pairs of genes for V. cholerae. The plot behaves the
same way as seen for E. coli. The same fraction of the predicted co-regulated genes [calculated from the predicted transcriptional regulatory network using
orthology information; see methods] in V. cholerae has a PCC value of 0.9, strongly lending support to the validity of the predictions.
Figure 4.11: Reconstructed transcriptional networks
Escherichia coli K12 (4311 genes)
Bacillus anthracis A2012 (5544 genes)
Streptomyces coelicolor (7769 genes)
112
38
41
711
250
251
1295
314
326
12
49
171
156
78
100
496
308
760
Winged HTH 96
Winged HTH 111
Winged HTH 226
Classical prokaryotic HTH 42
Classical prokaryotic HTH 39
Classical prokaryotic HTH 200
C-terminal effector domain 39
C-terminal effector domain 45
C-terminal effector domain 142
Cro/C1 type HTH 25
Cro/C1 type HTH 47
Cro/C1 type HTH 87
FIS like 13
FIS like 6
FIS like 1
Figure 4.11: Known transcriptional regulatory network for E. coli and reconstructed transcriptional regulatory networks for a pathogen, Bacillus anthracis and
a free-living organism, Streptomyces coelicolor. Transcription factors and target genes are represented as red and blue circles and transcriptional interaction
is represented as a black line. Information about the number of transcription factors, target genes, regulatory interactions and regulatory network motifs (Lee
et al., 2002; Milo et al., 2002) is provided on the right. This figure illustrates how we can partially reconstruct transcriptional networks for genomes, including
organisms, which are poorly characterised but are important. The table below the networks shows that number of transcription factors that have a DNA
binding domain belonging to a particular family. It is clear by comparing the numbers that each genome has evolved their own set of transcriptional regulators
by using the same DNA binding domains to different extents.
4.3 Results and discussion
4.3.2 Conservation of transcription factors and target genes
To assess the basic trends in the evolutionary conservation of the TRN, we asked if
transcriptional regulators and their targets are differentially conserved in evolution. Figure 4.12
is a graph of the fraction of target genes conserved against the fraction of transcription factors
conserved for each of the 175 genomes. The graph shows that transcription factors are less
conserved across genomes during evolution when compared to their target genes. We
assessed the significance of the bias for conservation of target genes versus transcription
factors by comparing to random networks, and found that the bias is statistically significant (pvalue < 10-4).
Figure 4.12: Gene conservation profile
Gene conservation profile
100
Transcription factor
conservation (%)
90
y = 1.1173x - 16.609
R2 = 0.9223
80
70
60
50
40
30
20
10
0
0
20
40
60
80
100
Target gene conservation (%)
Figure 4.12: For each of the 175 genomes, the fraction of target genes conserved (x-axis) is
plotted against the fraction of transcription factors conserved (y-axis). The diagonal, shown in
green, represents equal conservation of transcription factors and target genes, and every point
represents a genome. The graph shows that more target genes are conserved than
transcription factors in the genomes considered (y = 1.1173x - 16.609, R2 = 0.9223, p-value <
10-4). The significance of this trend was assessed by simulating network evolution, which
involved neutral removal of different sets of genes 10,000 times (see methods 4.2.3).
One might expect there to be a link between conservation of transcription factors and their
target genes; research on evolution of protein-protein interactions across has shown that
interacting pairs of proteins tend to be conserved across genomes in a concerted manner
(Pagel et al., 2004). In the TRNs studied here, pairs of interacting TF- TGs are not preferentially
conserved compared to random pairs of genes in the network (Figure 4.13). This shows that
conservation of transcription factors and their target genes are entirely independent of each
other.
4-13
4.3 Results and discussion
Figure 4.13: Interaction conservation profile
Interactions conserved (%)
Regulatory interaction conservation
100
90
80
70
60
Observed trend in genomes
50
Random pair of genes
40
30
20
10
0
0
20
40
60
80
100
Genes lost (%)
Figure 4.13: Each point in this graph represents a genome. We find that there is no preferential
conservation of a transcription factor target gene pair when compared to conservation of any
random pairs of genes.
Several organisms in our study had few transcription factors with orthologues in the E.coli
reference TRN, such as B. anthracis and S. coelicolor in Figure 4.11. However, these genomes
are relatively large and must contain a complex transcriptional regulatory network. Hence, we
searched to identify proteins with DNA-binding domains typically represented in known
transcription factors in all the genomes using sensitive profile methods (Madan Babu and
Teichmann, 2003). In most genomes we could identify other transcription factors belonging to
the same repertoire of DNA-binding domain families as in E. coli (Figure 4.11), but not
orthologous to the E. coli proteins. For small genomes, typically those of obligate parasites with
low fractions of transcription factors, no additional transcription factors were detectable. As
previously observed by van Nimwegen (van Nimwegen, 2003), we also see that there is a nonlinear increase in the number of regulatory proteins when compared against the genome size
(Figure 4.14). These observations, together with the higher degree of conservation of target
genes, suggest that: 1) In small parasitic genomes, transcription factors have been lost due to
absence of selective pressures for regulation. 2) In larger genomes, target genes are often
controlled by additional regulators or regulators that are non-orthologous, so that the genes are
responsive to different signals. Figure 4.15 is a result of the clustering procedure discussed in
the methods, which shows the profile of transcription factor conservation across the 175
genomes. Such representation provides us information about the different signals to which
organisms will respond to. For example, organisms that can respond to changes in levels of
fucose and cAMP are marked in Figure 4.15.
4-14
4.3 Results and discussion
Figure 4.14: Genome size v/s number of predicted transcription factors
Transcription factors v/s genome size
Transcription factors
700
S. coelicolor
y = 9e-05x1.75
R2 = 0.83
600
500
B. japonicum
B. pertussis
B. bronchiseptica
400
B. parapertussis
D. hafniense
300
M. magnetotacticum
200
N. punctiforme
Nostoc Sp
100
Pirellula_sp
L. interrogans
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Genome size
Figure 4.14: The number of predicted transcription factors (y-axis) increase non-linearly with
the increase in genome size (x-axis) indicating that new transcription factors evolve as genome
size increases. A power-law equation y = 9e-05x1.75 fits the data best (R2 = 0.83).
4-15
Figure 4.15: Transcription factor conservation profile
Absent
Present
GENOMES
Genomes
Crp
That can respond to
cAMP levels
T
Genomes
F
That can respond to
S
FucR
fucose levels
Figure 4.15: The 112 TFs in E. coli are shown as rows and the 175 genomes as columns. If an ortholog of an E. coli TF is present in a genome, the
corresponding cell is coloured red, or blue if absent. The procedure to build and cluster transcription factors based on conservation profile is described in the
methods section. This figure illustrates that different Transcription factors have been conserved to varying extent in the different genomes. Representation like this
can be extremely helpful in finding sets of factors that have similar conservation profiles, hence can be used to identify signals under which the genomes will
respond. Visit: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/reconstruct_net/ to get a larger version of this image along with the raw data. Matrices were
prepared using Matrix2PNG (Pavlidis and Noble, 2003).
4.3 Results and discussion
4.3.3 Conservation of transcriptional regulatory networks
Having reconstructed the TRNs for the genomes, we then reconstructed ancestral networks for
the different ancestors of the phylogenetic lineages (see methods 4.2.4). Figure 4.16 shows the
ancestral networks for the genomes considered. The predicted ancestral networks of eubacteria
and archaea that is common with E. coli show a scale free behaviour, with Crp, Fnr and Lrp as
major regulatory hubs. This suggests that the ancestral genome could sense changes in cAMP
and oxygen levels. The predicted ancestral network consists of at least 62 TFs that regulate
genes involved for basic living which include regulators for genes involved in purine
biosynthesis, fructose utilization, xylulose utilization, iron uptake, idonate utilization, hydroxyphenyl compound utilization, fatty acid biosynthesis, anaerobic respiration, leucine biosynthesis,
etc and multiple antibiotic resistance. From the figure it is evident that there has been loss of
specific genes down the archaeal lineage. However there has been gain of specific regulatory
systems down the eubacterial lineages. These include transcriptional regulators that can sense
a variety of different sugar molecules and their target genes that can utilize these sugars to
generate energy (e.g. mellibiose, mannitol, glucitol, galactose, etc). Further down from this
point, the firmicute lineage has lost regulatory systems that can utilize L-idonate, and the
sulphur utilization system, whereas the gracilicute lineage has conserved them and has
additionally gained the Fis protein, which can control gene expression by regulating the
superhelicity of the DNA. Further down the gracilicute lineage, there have been multiple
instances where various regulatory systems have been lost in the different lineages. These
results suggest that there are some ancient regulatory pathways that were present in the
ancestral network which have been lost in the different lineages. The dynamics and rationale for
loss or conservation of specific systems will be discussed in the following sections.
We then sought to address the extent to which genomes have conserved same set of
interactions. To do this, we developed an algorithm to deduce the relationship between different
genomes based on conservation of specific sets of transcriptional interactions (see method
4.2.5). We first represented the presence or absence of nodes and edges of the reference
transcriptional network graph from E.coli, in each of the 175 prokaryotic genomes as a binary
‘interaction conservation profile’. The organisms were then hierarchically clustered based on
their interaction conservation profiles (Figure 4.17). This method is generic and can be used to
compare any type of network.
The clustering of organisms depicted in Figure 4.17 reveals that transcriptional interactions
appear to be shaped by a set of disparate forces. At low to moderate evolutionary distance from
the reference organism, namely proteobacteria, the clustering generally mirrors organismal
relationships, suggesting that the TRN retains a noticeable phylogenetic signal. Beyond the
proteobacteria, a number of other effects appear to dominate, upon the background of a weaker
phylogenetic signal. These include the similarities in genome size and ecological adaptations:
4-17
4.3 Results and discussion
for example various soil bacteria with comparable genome sizes, such as Mycobacterium
tuberculosis and Bacillus subtilis, form a cluster (Figure 4.17). Likewise, the obligate or
intracellular parasites from diverse bacterial clades, namely Mycoplasma, Rickettsiae and
Chlamydiae, group together in this analysis. Based on the information about conservation of
specific transcription factors, and their interactions, we could reconstruct the presence of
transcriptional response pathways similar to those in the reference network in various other
genomes. These results suggest the conservation of certain transcriptional regulatory pathways
and could have predictive value in probing poorly characterized or experimentally intractable
organisms, including key human pathogens. For example, important pathogens like Yersinia
pestis, Pseudomonas syringae and a key symbiont of soybean plant, Bradyrhizobium japonicum
have conserved the GntR, KdgR, ExuR and UxuR transcriptional regulators which can sense
different hexuronates and hexuronides suggesting the conservation of the pathway that can
catabolize these compounds. Examination of target gene conservation shows that they are
indeed conserved, thus allowing us to predict the presence of these response pathways and
their transcriptional regulators.
Figure 4.16: Ancestral transcriptional regulatory networks
a
Ancestral networks
(24)
(2)




Proteobacteria
110
697
1262
108
686
1239
(1)
(1)
(21)
94
638
1116
Gracilicutes
101
660
1158
84
590
1016
71
481
741
24
142
151
(9)
41
209
233
(1)
(3)
44
236
286
(4)
Euryarcheota
(8)
29
135
174
Crenarcheota
(11)
22
116
141
Deinococci
41
289
481
Actinobacteria
110
697
1262
Endospora
91
585
997
Spirochaetea
Cyanobacteria
(genomes) (8)
86
569
1008
Chlorobia
66
414
647
Firmicutes
62
376
492
Eubacteria
Archaea
62
376
492
Root
Figure 4.16: Species tree was assumed to reconstruct ancestral networks. Red and blue circles
near each reconstructed networks represent transcription factors and target genes. Black lines
represent transcriptional interactions. Dollo principle was used to arrive at the composition of
the ancestral networks. Only genomes whose genome size was more than 2,150 genes (half
the size of E. coli genome) were considered, thus leaving with 97 genomes. This figure
illustrates that some regulatory pathways present in the ancestral genome have been
conserved or lost in the different lineages after their divergence from a common ancestor.
4-18
4.3 Results and discussion
Figure 4.17: Genome tree based on the transcriptional interaction conservation profile
r
eu
r
eu en
cr
ba
M ae ur
Py c e
a
M eur
Ph e u r
Pa spiro
ur
Tp
ae
Mk ur
m
e
gam
Mj
a
pS
Ba ga mm
Bsp
eu r
T ac
eu r
Tvo
Y. pestis
Dd e delta
Sst cren
Dr dr
Bt h V
bach
Pmar
C cy
a
Lint
sp
Pmar iro
Scsp M cya
Pm cyan
ar M
Cth
E cy
Oo f irmi
e fi
rm
Lp
i
l
Efa firmi
Ss V firm
oc
Ap
re
F a cre n
n
Sm c eu
C a u fi r
Oih c fi rmi
rm
Af
fi
e u rm i i
r
D. desulfuricans
M
a
m
A. fulgidus
Lpn ga mma
Bfl gamma
ii
H sp eur
Wb e gamm
a
Wbr gam
ma
R p a lp
ha
Rco a
lp ha
Spn T
firm
Spn
firmi
T wh
TW
act
Twh
T
Ct ch acti
Cca lapir
ch
lap
Cm
i
uc
hla
Cp
p
n
Cp T ch i
nJ
la
Cp
ch
n
Cp A c la
Hp n C hla
c
H p J e h la
p
Bb 2 e sil
M sp psil
Ba pn fi iro
M p g rmi
ge
a
f ir mm
m
a
i
Distance tree based on interaction conservation
b
U
u
im
m
fir rmi
fi
i
ga firm
an
pe
M qn r
e
M
eu
Mth eur
Mta e ur
i
Pfu u firm
M
ma
Mp gam
n
Kp
elta
a
ed
Gm gamm
e
Md amma
g
Plu
a mm
Kg
m
Yp e
C g am
Ype
ma
gam
Sen
ma
St gam
ma
m
Sty ga
amm
Sty T g
ma
Ec O gam
Ec OE gamm
Ec C gamma
Aae aqt e
Chu bachl o
Psp chlapi
ib
m
f ir ctin
Bs l a tin
Cg f ac an
C e a cy n
a
An cy n
p
ya
Ss
uc n
Np i cya
i
G v firm
i
a
Lg
irm
ef
i
Lm
rm
y fi
Sp
8 fir
y M fir
Sp
M3
Sp y
m
S fir
Sp y
irmi
Lla f
2 firm
Sag
N firm
Sag
chlo
Pgi ba
chl
Ctep ba
o
Hhe epsil
Wsu epsi lo
Cj epsilon
Ter cyan
Thel cyan
Bfu beta
Cvi bet a
Rme beta
Avi gam
ma
Pp u g
amma
Psy ga
mma
Pae
gamma
Pf l ga
m ma
Son
gam
ma
Vch
g
Vpa amma
g
Vvu amma
Vv C g a m
uY
m
Dh
ga
mm
a fi
rm
Sp
i
a
Rs r alp
ha
o
Bp l be
Bp beta ta
Bb p be
Rh r be t a
M sp ta
Bm lo a alph
el lpha a
al
ph
a
Fnu fuso
Ca u chlo ro
a
ph a
al
1 alph
Bs e alph
Sm c alph
Atu w pha
u
At a al ha
Rp alp a
h
Rru alp pha
a
l
Brj a g a
a
Mm alph
r
ma
Cc gam
a
a
Xc gamm
x
ma
Xa
gam
Xci
amm
3g
Hdu
ma
gam
Hso
a
amm
Hi g
a
m
gam
Pmu
eta
Neu b
b eta
Nm Z
a
Nm M bet
ma
Cbu gam
Xfa gamma
Xfa T gamm
iii
P. aeruginosa
Ec K gamma
Sfl 2 gamm
Sfl g amma
Sep A firm
Sa MW firm
Sa firmi
N
Sa Mu fir
m
Bc A firm
i
Ban A
m fir
Ban A
2 fir
Ml e
actin
Cdi
actin
Sa v
actin
Sco
ea
cti n
Tm
a
Tf u qte
s
Lm act in
o fi
Lin
rm
i
Tte firmi
Ct firmi
e
Cp E fi
Blo e fir rm
mi
ac
M
tu
tin
M
H
t
M u C acti
ac
Bh b o
ti
a ac
f ir tin
m
i
S. pyogenes
i
M. tuberculosis
B. pertussis
R. rubrum
Figure 4.17: Genomes were clustered according to the interactions they conserve. The figure
shows that genomes in the same phylogenetic group generally cluster together (i). Parasitic
genomes (ii), and genomes with similar life style but belonging to different phylogenetic groups
(iii) also cluster together. This suggests that interactions are gained or lost depending on the
organism’s life style and the environmental conditions in which they live. Reconstructed
transcriptional network for a representative organism from each group is given by the side.
4-19
Fig 2
4.3 Results and discussion
4.3.4 Evolution of global network structure
Above we considered conservation of genes and interactions with respect to their functions as
regulators or regulated genes, and the phylogenetic relationships of the genomes. Here we
focus on the conservation of genes and interactions in the context of network topology. TRNs
are known to have an overall structure in which the outgoing connectivity of transcription factors
follows a power law function y=ax, where y is the number of transcription factors and x is the
number of target genes. In other words, there are few transcription factors that regulate many
target genes, so these influential transcription factors are hubs in the network.
Albert and Barabasi among many others show that this so-called scale-free behaviour should
hold good for randomly selected parts of networks of this form (Barabasi and Albert, 1999). We
observed that all the conserved transcriptional networks follow a scale-free distribution (Figure
4.18), as do the ancestral networks at internal nodes of the phylogenetic tree shown in Figure
4.16. Therefore, even though particular regulatory hubs may be lost or replaced, as shown for
Haemophilus influenzae and Bordetella pertussis in Figure 4.19, the scale-free topology is still
present. From Figure 4.18, it is evident that the exponent ( in the fit y=ax) increases in
genomes where the fraction of conserved nodes is less than 60%, and this is statistically
significant. This implies that a stronger hub-like structure is retained in poorly conserved
networks (Figure 4.19).
Figure 4.18: Scale-free behaviour in conserved networks
Scale-free exponent (
Scale-free behaviour in conserved networks
-0.2
-0.4
Observed trend in genomes
-0.6
Observed trend in random
networks of similar size
-0.8
-1
30
40
50
60
70
80
90
100
Genes conserved (%)
Figure 4.18: The fraction of genes conserved (x-axis) is plotted against the scale free exponent
value of the reconstructed networks (y-axis) for the different genomes, in black. The average
value obtained after simulating neutral condition of similar sized networks 10,000 times is
shown in green. This plot shows that for genomes, which conserve about 30 – 60 % of the
genes, the exponent value () is higher than in random networks suggesting a more pronounced
hub-like behaviour. This supports the observation that there is a weak selection for hubs for the
same genomes (30-60% of genes conserved)
4-20
4.3 Results and discussion
Taken together the fact that the conserved networks still show scale-free behaviour but the
values for the exponent are quite different from what would be expected in random networks of
similar size suggests that the complete network of these organisms are scale free but are quite
different from what is seen for E. coli. This would mean that for genomes distantly related to E.
coli, the scale free behaviour is not acquired (inherited) from the common ancestor, instead
each lineage has emerged the scale free behaviour independently during evolution.
Figure 4.19: Scale-free behaviour and loss of regulatory hubs
E. coli
Crp
Crp
NarL
NarL
H. influenzae
Crp
NarL
B. pertussis
Figure 4.19: Reconstructed transcriptional regulatory networks for H. influenzae and B.
pertussis is shown to illustrate that the scale free behaviour is still conserved even though
regulatory hubs can be lost or replaced. For example, the distribution for the outgoing
connection for the conserved network in H. influenzae is y =214.8x-0.49 even though a regulatory
hub NarL is lost.
Is this because transcription factors that are hubs are preferentially conserved? Earlier studies
on protein-protein interaction networks (PINs) have suggested that the proteins with most
interactions tend to be more conserved across organisms (Jeong et al., 2001; Yu et al., 2004a).
Interestingly, there is no correlation between the connectivity of a particular transcription factor
and its conservation across genomes, as shown in Figure 4.20. We establish this by comparing
the observed conservation of regulatory interactions to models where transcription factors are
lost in a selectively neutral manner, or hubs are preferentially conserved or lost. The observed
pattern is closest to neutral removal of transcription factors independent of their number of
target genes in the E. coli TRN. Table 4.1 provides specific examples of transcription factors
with many target genes that are conserved in few genomes, and transcription factors with low
connectivity that are conserved in many genomes.
Why is there no bias for conservation of influential transcription factors? It was recently reported
that most regulatory hubs in yeast are condition-specific, i.e. they regulate many genes only in
particular conditions, but remain silent in other conditions (Luscombe et al., 2004). Most
prokaryotic transcription factors in prokaryotes are also condition-specific, in that they respond
to specific environmental signal (Martinez-Antonio et al., 2003). If an organism has a lifestyle
that does not involve a particular condition, then that particular regulatory protein can be lost
4-21
4.3 Results and discussion
(Madan Babu, 2003). For example, E. coli and P. aeruginosa have conserved the MhpR, HcaR
and FeaR transcriptional regulators because they can sense specific hydroxy-phenyl
compounds and activate genes that catabolize these compounds to derive energy. However,
the regulators and their target genes have been lost in pathogens like Staphylococcus aureus
and Campylobacter jejuni because the environments in which they predominantly live do not
contain these aromatic compounds.
Figure 4.20: Interaction conservation profile for observed and random networks
Interactions conserved (%)
Regulatory interaction conservation
100
90
80
Selection for regulatory hubs
70
60
Observed trend in genomes
50
40
30
Neutral removal of genes
20
Selection against regulatory hubs
10
0
0
20
40
60
80
100
Genes lost (%)
Figure 4.20: The fraction of transcriptional regulatory interactions conserved in the network (xaxis) is plotted against the fraction of the genes conserved (y-axis) for each of the 175 genomes
(y = 0.007x2 - 1.7843x + 100, R2 = 0.95). To understand what the observed trend means, we
simulated the process of network evolution by incorporating different rules: (i) remove genes
with low connectivity first, implying selection to conserve highly connected transcription factors,
shown in blue; (ii) remove genes neutrally with no special emphasis on connectivity, implying
neutral evolution, shown in green and (iii) remove nodes with high connectivity first, implying
selection against highly connected regulatory proteins, shown in red. We find that the observed
trend is more similar to the trend that we obtain while simulating the neutral condition.
Table 4.1: Conservation and connectivity profile for transcription factors
A. Proteins with few target genes (k ≤ 15) but conserved in more than 50% of the genomes.
Transcription
factor
GI number
Low
connectivity
(k ≤ 15)
High
conservation
(≥ 50%)
Ada
16130150
5
74.86
BirA
16131807
5
89.71
MarR
33347561
7
54.29
DnaA
16131570
10
87.43
CpxR
16131752
14
55.43
OmpR
16131282
14
57.14
NarP
16130130
15
60.57
4-22
Function
Transcriptional regulator of DNA
repair
Biotin operon repressor
Repressor of multiple antibiotic
resistance (Mar) operon
Transcriptional regulator of housekeeping genes
Regulator of a 2CST*
Regulator of adaptive response
(2CST*)
Regulator of aerobic respiration
(2CST*)
4.3 Results and discussion
B. Proteins with many target genes (k > 15) but conserved in less than 50% of the genomes.
16130638
High
connectivity
(k > 15)
16
Low
conservation
(< 50%)
34.29
Rob
16132213
17
24.00
CysB
FruR
Hns
PurR
Fis
Lrp
NarL
ArcA
Ihf
16129236
16128073
16129198
16129616
16131149
16128856
16129184
16132218
16129668
18
25
26
29
34
53
66
70
96
25.14
21.14
14.29
33.14
28.00
44.57
36.00
40.00
37.71
Fnr
16129295
110
49.14
Transcription
factor
GI
number
FhlA
Function
Formate hydrogen lyase activator
Transcriptional activator for antibiotic
resistance
Regulator of cysteine biosynthesis
Fructose operon repressor
General regulator
Purine biosynthesis repressor
Regulator of rRNA and tRNA operons
Leucine responsive regulatory protein
Regulator of anaerobic respiration
Regulator of aerobic respiration
General regulator
Tanscriptional regulator of aerobic,
anaerobic respiration and osmotic
balance
*2CST – Two-component signal transduction system
What may allow loss and gain of transcriptional regulatory interactions relatively quickly in
evolution is the fact that this can be achieved by a small number of mutations in the DNAbinding interface of the transcription factor or the DNA-recognition site near the target gene.
Protein-protein interactions on the other hand involve larger interfaces and hence more
mutations are needed to alter them. Indeed Maslov et al. (Maslov et al., 2004) have shown that
transcriptional regulatory interactions of paralogous proteins within the yeast genome evolve
more quickly than protein-protein interactions.
4.3.5 Evolution of local network structure
The structure of transcriptional regulatory networks is scale-free at the global level, while at the
local level there are recurrent topological patterns of interconnections called network motifs (Lee
et al., 2002; Milo et al., 2002; Shen-Orr et al., 2002), which can be viewed as building blocks of
the network. Individual motifs have specific information processing ability and dictate specific
patterns of gene expression (Kalir and Alon, 2004; Mangan and Alon, 2003). To evaluate the
motif conservation in TRNs, we devised a new algorithm (method 4.2.7) that clusters motifs
according to their conservation profile. In principle this algorithm can be applied to any set of
networks. We create a motif conservation profile for each genome, and then carry out a twoway clustering procedure: a k-means clustering of all motifs with similar occurrence profile
across genomes, and then a hierarchical clustering of genomes based on their motif
conservation profile (Figure 4.21 and Figure 4.22). This procedure clusters motifs according to
the conservation of their interactions, and allows us to study evolutionary changes in local
network structure.
4-23
Figure 4.21: Feed forward motif conservation profile
0%
FEED FORWARD MOTIFS
100%
Conservation (%)
Specific
E. coli
FFMs
G
conserved in
selected
sets
genomes
E
N
of
O
M
E
S
Figure 4.21: The 175 genomes are shown as rows and the 277 different FFMs are shown as columns. If all constituents of a motif is present in a genome, of
interest, then the particular cell is coloured red, or in shades of red to blue reflecting the amount of conservation (blue for completely absent). The procedure
to build and cluster motifs based on conservation profile of the network motifs is described in the methods section. This figure illustrates that different feedforward motifs have been conserved to varying extent in the different genomes. Representation like this can be extremely helpful in finding sets of motifs that
have similar conservation profiles. Two-way clustering, as done here, also allows us to identify different genomes that have conserved similar sets of feedforward motifs. Representation like this can be very helpful in understanding gene expression patterns in genomes for which information is unavailable.
Figure 4.22: SIM and MIM conservation profile
SIMs
Conservation (%)
0%
100%
MIMs
Conservation (%)
0%
100%
G
E
N
O
M
E
S
Figure 4.22: The 175 genomes are shown as rows. The 43 SIMs & 70 MIMs seen in E. coli are shown
as columns in the two matrices. Conservation of motif constituents for genomes are coloured in shades
of red (absolutely conserved) to blue (completely absent). The figure illustrates that the SIMs and MIMs
have been conserved to varying extent in the different genomes.
G
E
N
O
M
E
S
4.3 Results and discussion
This analysis showed that there is no significant conservation of whole motifs compared to
random networks of similar size. Interestingly, within coherent monophyletic lineages of
prokaryotes particular network motif may not be conserved, whereas genomes belonging to
unrelated phylogenetic groups conserve some motifs. For example, Fnr (a global regulator,
activated during low oxygen levels), NarL (transcriptional regulator of a two-component signal
transduction system) and NuoN (subunit of the NADH dehydrogenase complex I) form a feedforward motif, which is not conserved in other  proteobacteria, whereas it is found in several
distantly related genomes (Figure 4.23ab). This observation is consistent with the fact that
related genomes might not conserve specific regulatory hubs.
To determine whether the individual interactions that form motifs were preferentially conserved
compared to other interactions in the network, we computed a motif conservation index (C.I) for
each organism (see method 4.2.8). C.I is defined as the ratio of the fraction of the conserved
interactions that forms a motif in E. coli to the fraction of all interactions conserved. As shown in
Figure 4.23c, the observed trend in the genomes is closest to a model of neutral removal of
interactions, rather than preferential conservation or loss of interactions in motifs. Thus we find
that there is no preferential conservation of whole motifs or parts thereof (Figure 4.24).
Figure 4.23: Evolutionary changes in the local network structure
a
Fnr
NuoN
NarL
Salmonella typhi
( proteobacteria)
Vibrio cholerae
( proteobacteria)
b
c
Haemophilus somnus
( proteobacteria)
Xylella fastidiosa
( proteobacteria)
Blochmannia floridanus
( proteobacteria)
1
0.5
Fnr
Selection for motifs
NarL
NuoN
R. palustris ( proteobacteria)
B. pertussis ( proteobacteria)
N. punctiforme (Cyanobacteria)
S. avermitilis (Actinobacteria)
D. hafniense (Firmicute)
C.I
Motifs and hubs conserved among unrelated genomes
Motifs and hubs not conserved among related genomes
0
Observed trend in genomes
Neutral conservation of motifs
-0.5
Selection against motifs
-1
30
40
50
60
70
80
90
100
% genes conserved
Figure 4.23: (a) A feed-forward motif formed by Fnr, NarL and NuoN genes in E. coli is
completely conserved in a closely related genome, Salmonella typhi, but not in other proteobacterial genomes. Fnr is a regulatory hub and is not always conserved in genomes
within the same phylogenetic group. This suggests that condition specific regulatory hubs can
4-26
4.3 Results and discussion
either be lost or displaced by a non-orthologous protein in closely related genomes. (b) Distantly
related organisms that have conserved all interactions in the regulatory motif and that have
conserved the regulatory hub, Fnr. (c) This is a plot of fraction of genes conserved (x-axis)
against conservation index (y-axis), which is a measure of the extent to which interactions in a
motif are conserved. A value > 1 means that interactions in motifs are selected for; close to 0
suggests neutral selection and < 1 suggests selection against such interactions. The plot shows
that interactions that form the network motif are not selected for or against in evolution and are
conserved to the same extent as any other interaction in the network. This implies that evolution
acts by tinkering to form different motifs using the same interaction in different genomes.
Figure 4.24: Gene conservation v/s number of motifs
300
250
# FFMs
200
15 0
10 0
50
0
30
40
50
60
70
80
90
10 0
% Genes conserved
50
# SIMs
40
30
20
10
0
30
40
50
60
70
80
90
10 0
90
10 0
% Genes conserved
80
70
# MIMs
60
50
40
30
20
10
0
30
40
50
60
70
80
% Genes conserved
Figure 4.24: Graphs of the number of whole motifs against percentage of genes that are seen
in the conserved networks (black triangle) and in random networks of similar size (green
dots). The graphs show that the number of whole motifs that exist in the conserved networks is
comparable to what is seen in random networks of similar size. However, for genomes where
the fraction of genes conserved is less than 60%, there seems to be a slight preference for
SIMs to be over represented. For most genomes, the P-values and Z-scores suggest that the
observed trend is close to random.
4-27
4.3 Results and discussion
Careful examination of partially conserved motifs reveals that the same gene can become a part
of different motifs in the TRN of different organisms (Figure 4.25). An important implication of
this observation is that the same gene in a different organism may require a different pattern of
gene expression. Thus, different organisms have arrived at different solutions by tinkering
specific regulatory interactions rather than porting whole blocks of pre-existing transcriptional
interactions. For example, a gene, which may need to be tightly regulated (as a feed-forward
motif), may need to be regulated directly and quickly via a direct signal (as in a single input
module) in a different genome due to changes in life-style, development or environment.
Specific cases are shown in Figure 4.25. For example, E. coli, which lives in a more stable
aerobic environment, will not express the fumarate reductase genes (FrdB and FrdC, which
converts fumarate to succinate under anaerobic conditions to derive energy) unless there is
persistent signal for lack of oxygen (through both Fnr and NarL). However Haemophilus
influenzae, which may encounter very different oxygen milieus (in arterial and venous blood)
during the course of infection, will need to regulate the fumarate reductase genes immediately
(by Fnr) whenever oxygen concentration fluctuates to ensure survival.
Figure 4.25: Changes seen in motif structure
a
FFM
SIM
Fnr
Fnr
1
C(t)
C(t)
1
0
0
0
2
4
6
8
10
12
FrdB
14
time, t
FrdC
NarL
FrdB
E. coli
0
FrdC
NarL
2
4
6
8
10
12
14
10
12
14
10
12
14
time, t
H. influenzae
b
MIM
SIM
TrpR
TyrR
TrpR
TyrR
AroL
AroM
AroL
AroM
1
C(t)
C(t)
1
0
0
2
4
6
8
10
12
0
14
time, t
E. coli
c
FFM+MIM
2
4
6
8
time, t
MIM
Crp
AcnA
TdcA
Fnr
1
Crp
C(t)
Fnr
C(t)
1
0
X. fastidiosa
0
0
0
2
4
6
8
time, t
10
12
14
ArcA
ArcA
E. coli
AcnA
K. pneumoniae
TdcA
0
2
4
6
8
time, t
Figure 4.25: Analysis of partially conserved motifs revealed that by losing (or gaining) specific
transcription factors, orthologous genes in different genomes could be expressed in different
ways according to specific requirements. Figures by the side of each motif represent the
concentration profile of the component genes with time. Genes in a feed-forward motif (FFM) in
E. coli can be regulated as a part of a single input module (SIM) by losing a transcription factor
(shown in grey). Feed forward motif regulation ensures that target gene expression is not
4-28
4.4 Conclusions
sensitive to fluctuations in input signals. Whereas a SIM regulation ensures expression of target
genes as long as there is some input signal, like small molecules. Genes in a multiple input
motif (MIM) in E. coli can be regulated as a single input module (SIM) by losing a transcription
factor. Genes regulated as a part of a MIM ensures that target genes are transcribed only when
two input signals independently cross a particular thresholds. Thus by losing a transcription
factor, genes that are tightly regulated can be regulated in a simple manner. Genes regulated
as a part of a FFM can be regulated as a part of a MIM by the loss of a transcription factor.
Thus evolution tinkers with specific regulatory interactions to create different motifs in genomes
when orthologous genes need to be expressed differently. Generically, the figure illustrates that
by bringing about temporal expression of specific transcription factors, the same target gene
can follow different patterns of gene expression, be it in a different cell type or in different
organisms.
In this context, we note that studies of duplicated genes within the regulatory network of E. coli
and yeast (Conant and Wagner, 2003; Teichmann and Madan Babu, 2004) have shown that
network motifs have not evolved by duplication of whole ancestral modules. These findings hint
that the same interaction could have existed in different regulatory contexts in ancestral
genomes, which is in agreement with our findings. Our conclusions are consistent with a recent
study showing the lack of conservation of higher order network modules, which are semiindependent units larger than network motifs (Snel and Huynen, 2004). This type of flexibility in
the regulatory context of a gene can be viewed as analogous to the changes in gene regulation
across different cell types within a multi-cellular organism.
4.4 Conclusions
We analysed the conservation of genes and transcriptional regulatory networks across
genomes as well as the evolution of networks topology at both global and local level by
comparing the extend of conservation of an experimentally established reference network, that
of E. coli, across 175 microbial genomes.
We showed that, at the level of individual genes, target genes are more conserved than
transcription factors, and that conservation of a target gene and its transcription factor are
uncoupled.
We also showed that at the local level there is no preferential conservation of interactions within
network motifs: By losing or conserving specific transcriptional regulators, orthologous genes in
different genomes are incorporated within different regulatory contexts and thereby encode
different patterns of gene expression.
At the level of global network topology, conservation of transcription factors is independent of
the number of target genes they regulate, and depends on similar or different lifestyles of the
organisms rather than their phylogenetic distance from E. coli. This, coupled with the fact that
4-29
4.5 References
both the reference network and the inferred ones are characterized by a power law distribution
but different exponents, imply that these networks evolved independently. Our findings suggest
that evolution tinkers with individual network connections to arrive at a satisfactory working
solution and does so by maintaining the scale-free nature of the ancestor core network and at
the same time rewiring and adding/deleting components to it to suit new endogenous or
environmental requirements.
Finally, our method can be readily applied to any set of networks and genomes that
advancements in large-scale experimental identification of transcriptional regulatory networks
will make available. The reconstructed networks and predicted transcription factors for 175
organisms can serve as a scaffold for experimental studies on transcriptional control in poorly
characterized genomes, and can be relevant to engineer regulatory interactions in these
organisms.
4.5 References
Albert, R., Jeong, H. and Barabasi, A. L. (2000). Error and attack tolerance of complex
networks. Nature 406, 378-82.
Barabasi, A. L. and Albert, R. (1999). Emergence of scaling in random networks. Science 286,
509-12.
Conant, G. C. and Wagner, A. (2003). Convergent evolution of gene circuits. Nat Genet 34,
264-6.
Gollub, J., Ball, C. A., Binkley, G., Demeter, J., Finkelstein, D. B., Hebert, J. M.,
Hernandez-Boussard, T., Jin, H., Kaloper, M., Matese, J. C. et al. (2003). The Stanford
Microarray Database: data access and quality assessment tools. Nucleic Acids Res 31, 94-6.
Jeong, H., Mason, S. P., Barabasi, A. L. and Oltvai, Z. N. (2001). Lethality and centrality in
protein networks. Nature 411, 41-2.
Kalir, S. and Alon, U. (2004). Using a quantitative blueprint to reprogram the dynamics of the
flagella gene network. Cell 117, 713-20.
Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N.
M., Harbison, C. T., Thompson, C. M., Simon, I. et al. (2002). Transcriptional regulatory
networks in Saccharomyces cerevisiae. Science 298, 799-804.
Lespinet, O., Wolf, Y. I., Koonin, E. V. and Aravind, L. (2002). The role of lineage-specific
gene family expansion in the evolution of eukaryotes. Genome Res 12, 1048-59.
Luscombe, N. M., Madan Babu, M., Yu, H., M., S., Teichmann, S. A. and Gerstein, M.
(2004). Genomic analysis of regulatory network dynamics. Nature in press.
Madan Babu, M. (2003). Did the loss of sigma factors initiate pseudogene accumulation in M.
leprae? Trends Microbiol 11, 59-61.
4-30
4.5 References
Madan Babu, M., Luscombe, N. M., Aravind, L., Gerstein, M. and Teichmann, S. A. (2004).
Structure and evolution of transcriptional regulatory networks. Current Opinion in Structural
Biology 14, 283-291.
Madan Babu, M. and Teichmann, S. A. (2003). Evolution of transcription factors and the gene
regulatory network in Escherichia coli. Nucleic Acids Res 31, 1234-44.
Mangan, S. and Alon, U. (2003). Structure and function of the feed-forward loop network motif.
Proc Natl Acad Sci U S A 100, 11980-5. Epub 2003 Oct 6.
Martinez-Antonio, A., Salgado, H., Gama-Castro, S., Gutierrez-Rios, R. M., JimenezJacinto, V. and Collado-Vides, J. (2003). Environmental conditions and transcriptional
regulation in Escherichia coli: a physiological integrative approach. Biotechnol Bioeng 84, 7439.
Maslov, S., Sneppen, K., Eriksen, K. A. and Yan, K. K. (2004). Upstream plasticity and
downstream robustness in evolution of molecular networks. BMC Evol Biol 4, 9.
Milo, R., Itzkovitz, S., Kashtan, N., Levitt, R., Shen-Orr, S., Ayzenshtat, I., Sheffer, M. and
Alon, U. (2004). Superfamilies of evolved and designed networks. Science 303, 1538-42.
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon, U. (2002).
Network motifs: simple building blocks of complex networks. Science 298, 824-7.
Oltvai, Z. N. and Barabasi, A. L. (2002). Systems biology. Life's complexity pyramid. Science
298, 763-4.
Pagel, P., Mewes, H. W. and Frishman, D. (2004). Conservation of protein-protein interactions
- lessons from ascomycota. Trends Genet 20, 72-6.
Pavlidis, P. and Noble, W. S. (2003). Matrix2png: a utility for visualizing matrix data.
Bioinformatics 19, 295-6.
Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. and Barabasi, A. L. (2002).
Hierarchical organization of modularity in metabolic networks. Science 297, 1551-5.
Salgado, H., Gama-Castro, S., Martinez-Antonio, A., Diaz-Peredo, E., Sanchez-Solano, F.,
Peralta-Gil, M., Garcia-Alonso, D., Jimenez-Jacinto, V., Santos-Zavaleta, A., BonavidesMartinez, C. et al. (2004). RegulonDB (version 4.0): transcriptional regulation, operon
organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res 32, D303-6.
Shen-Orr, S. S., Milo, R., Mangan, S. and Alon, U. (2002). Network motifs in the
transcriptional regulation network of Escherichia coli. Nat Genet 31, 64-8.
Snel, B. and Huynen, M. A. (2004). Quantifying modularity in the evolution of biomolecular
systems. Genome Res 14, 391-7.
Teichmann, S. A. and Madan Babu, M. (2004). Gene regulatory network growth by
duplication. Nat Genet 36, 492-6. Epub 2004 Apr 11.
van Nimwegen, E. (2003). Scaling laws in the functional content of genomes. Trends Genet 19,
479-84.
Yu, H., Greenbaum, D., Xin Lu, H., Zhu, X. and Gerstein, M. (2004a). Genomic analysis of
essentiality within protein networks. Trends Genet 20, 227-31.
4-31
4.5 References
Yu, H., Luscombe, N. M., Lu, H. X., Zhu, X., Xia, Y., Han, J. D., Bertin, N., Chung, S., Vidal,
M. and Gerstein, M. (2004b). Annotation Transfer Between Genomes: Protein-Protein
Interologs and Protein-DNA Regulogs. Genome Res 14, 1107-18.
4-32
Download