WORD document

advertisement
S.2: Supplementary Material – Description of Algorithms
The gene regulatory network is represented as a directed graph G=(V, E) where each
protein in the network is a vertex (v  V). The protein can be a transcription factor (tf)
or a regulated gene (rg) where tf, rg  V. The transcriptional interaction (from a
transcription factor to the regulated gene) is a directed edge (e  E). The network is
implemented using two hash tables PR (Parent of a Regulated gene) and CT (Child of a
Transcription factor). PR consists of |V| keys, one for each vertex in V. The key is a
regulated gene and the values correspond to all the transcription factors regulating it
(i.e. for each key rg, the value corresponds to all vertices tf such that there is an edge
(tf → rg)  E). CT consists of |V| keys, one for each vertex in V. The key is a
transcription factor and the values correspond to all genes that it regulates (i.e. for
each key tf, the value corresponds to all vertices rg such that there is an edge (tf → rg)
 E).
Each protein (v  V) is characterised by a domain architecture, i.e. the ordered set of
domains from the N-terminus to the C-terminus. Proteins with the same domain
architecture, ignoring gaps and internal duplications, are considered homologues in
this analysis. This information is stored in a hash table DA of |D| elements, one for
each domain architecture d. A key corresponds to a domain architecture d and values
correspond to all vertices h (homologous proteins) in the graph that has the domain
architecture d. In other words, this hash table represents genes (h) that have arisen by
duplication from a common ancestor. Another hash table DAG consists of |V| keys, one
for each vertex in V such that the key is a gene and the value is its domain
architecture, d.
An explanation for each line in the algorithm for section S.2.1 is given as a guide.
S.2.1: Characterization of the vertices in the network
A. Algorithm to identify duplicated genes (regulated genes) that are controlled by
the same transcription factors
01: for each d  DA
02: for each h  DA [d]
03:
for each tf  PR [h]
04:
X [tf] ← 0
05: for each h  DA [d]
06:
for each tf  PR [h]
07:
X [tf] ← X [tf] + 1
08: for each h  DA [d]
09:
for each tf  PR [h]
10:
if X [tf] > 1
11:
print (h)
# for each domain architecture (d) in DA
# for each homologous gene (h) with domain architecture (d)
# for each tf that regulates h
# set the count X [tf] to 0
# for each homologous gene (h) with domain architecture (d)
# for each tf that regulates h
# increment the count X [tf]
# for each homologous gene (h) with domain architecture (d)
# for each tf that regulates h
# if the tf regulates more than one homologous gene
# print the rg
B. Algorithm to identify duplicated genes (transcription factors) that control the
same target genes
01: for each d  DA
02: for each h  DA [d]
03:
for each rg  CT [h]
04:
X [rg] ← 0
# for each domain architecture (d) in DA
# for each homologous gene (h) with domain architecture (d)
# for each rg that is regulated by h
# set the count X [rg] to 0
05:
06:
07:
08:
09:
10:
11:
for each h  DA [d]
for each rg  CT [h]
X [rg] ← X [rg] + 1
for each h  DA [d]
for each rg  CT [h]
if X [rg] > 1
print (h)
# for each homologous gene (h) with domain architecture (d)
# for each rg that is regulated by h
# increment the count X [rg]
# for each homologous gene (h) with domain architecture (d)
# for each rg that is regulated by h
# if the rg is regulated by more than one homologous tf
# print the tf
S.2.2: Characterization of the edges in the network
A. Algorithm to identify edges that have evolved by the duplication of RGs:
01: for each d  DA
02: for each h  DA [d]
03:
for each tf  PR [h]
04:
X [tf] ← 0
05: for each h  DA [d]
06:
for each tf  PR [h]
07:
X [tf] ← X [tf] + 1
08:
for each h  DA [d]
09:
for each tf  PR [h]
10:
if X [tf] > 1
11:
print E (tf → h)
B. Algorithm to identify edges that have evolved by the duplication of TFs:
01: for each d  DA
02: for each h  DA [d]
03:
for each rg  CT [h]
04:
X [rg] ← 0
05: for each h  DA [d]
06:
for each rg  CT [h]
07:
X [rg] ← X [rg] + 1
08:
for each h  DA [d]
09:
for each rg  CT [h]
10:
if X [rg] > 1
11:
print E (h → rg)
As mentioned in the manuscript, Algorithms A and B have precedence over C,
meaning cases picked up by A and B are excluded from consideration by C.
C. Algorithm to identify edges that have evolved by duplication of TFs and RGs:
01: for each d  DA
02: for each h  DA [d]
03:
X [d] ← X [d] + 1
04:
for each rg  CT [h]
05:
Y [ DAG [rg] ] ← Y [ DAG[rg] ] + 1
06:
Z [rg] ← Z [rg] + 1
07:
if X [d] > 1
08:
for each rg  CT [h]
09:
if Z [rg] = 1 and Y [ DAG[rg] ] > 1
10:
print E (h → rg)
As described in the manuscript, Algorithms A, B and C have precedence over D,
E meaning cases picked up by A, B and C are excluded from consideration by D
and E.
D. Algorithm to identify edges that have evolved by duplication of TFs/RGs but lost
their old interactions and gained new interactions:
D.1 Edges where the RG duplicates and gets regulated by a new TF and loses old
interactions:
01: for each d  DA
02: for each h  DA [d]
03:
X [d] ← X [d] + 1
04:
for each tf  PR [h]
05:
Y [tf] ← 0
06: for each h  DA [d]
07:
for each tf  PR [h]
08:
Y [tf] ← Y [tf] + 1
09: for each h  DA [d]
10:
for each tf  PR [h]
11:
if Y [tf] = 1 and X [ DAG[h] ] > 1
12:
print E (tf → h)
D.2 Edges where the TF duplicates and regulates new RGs and loses old ones:
01: for each d  DA
02: for each h  DA [d]
03
X [d] ← X [d] + 1
04:
for each rg  CT [h]
05:
Y [rg] ← 0
06: for each h  DA [d]
07:
for each rg  CT [h]
08:
Y [rg] ← Y [rg] + 1
09: for each h  DA [d]
10:
for each rg  CT [h]
11:
if Y [rg] = 1 and X [ DAG[h] ] > 1
12:
print E (h → rg)
E. Algorithm to identify edges that have evolved by innovation of new regulatory
interactions:
01: for each d  DA
02: for each h  DA [d]
03:
X [d] ← X [d] + 1
04: for each d  DA
05: for each h  DA [d]
06:
if X [d] = 1
07:
for each rg  CT [h]
08:
if X [ DAG [rg] ] = 1
09:
print E (h → rg)
S.2.3: Identification of network motifs
The algorithms to identify Feed-forward motifs and Single Input Modules are
described below.
In these algorithms, P is an array which contains a list of transcription factors that
regulate genes that are regulated by only one transcription factor and u, v, w  V
A. Feed Forward motif
01: for each u  CT
02: X [u] ← 0
03: for each v  CT [u]
04:
X [v] ← 1
05: for each w  CT [v]
06:
X [w] ← X [w] + 1
07:
if X [w] = 2 & (u ≠ v, u ≠ w, v ≠ w)
08:
print E (u → v → w)
09:
X [w] ← 1
B. Single Input Module motif
01: for each rg  PR
02: for each tf  PR [rg]
03:
X [rg] ← X [rg] + 1
04: for each rg  PR
05: if X [rg] = 1
06:
Push (P, PR [rg])
07: for each tf  P
08: for each rg CT [tf]
09:
if X [rg] = 1
10:
print E (tf → rg)
S.2.4: Identification of directly and indirectly regulated genes
The breadth-first search1 method, which uses a recursive algorithm, was implemented
to identify all nodes that can be reached from a given node. This was used to identify
all other genes that can be affected by a transcription factor. The hash table CT
provides a list of all directly regulated genes. The difference between the two sets
gives the indirectly regulated genes by a transcription factor.
References:
1. Introduction to Algorithms, 2nd Edition (2001), Cormen, T.H., Leiserson, E.E.,
Rivest, R. L. and Stein, C. The MIT Press.
S.2.5: Simulation procedure to estimate the significance of duplication
in the network
The domain architecture for each transcription factor is stored in a hash table DATF of
|D| elements, one for each domain architecture d. A key corresponds to a domain
architecture d and values correspond to all vertices w that are transcription factors and
have the domain architecture d. In other words, this hash table represents groups of
homologous transcription factors. Similarly a hash table DARG is created where a key
corresponds to a domain architecture and the values correspond to all vertices w that
are target genes and have the domain architecture d.
is a function that assigns a random domain architecture to a gene.
However it is ensured that the number of genes with each domain architecture is the
same as seen in the real network (as implemented in the hash tables).
Randomise
A. Calculation of P-value for the duplication of target genes (RGs) that are
controlled by the same transcription factor (TF)
01: for i = 1 to 10,000
02: Randomise (DARG)
03: Perform S.2.1.A and count the number of RGs that satisfy the criteria
04: if the number of RGs in simulation ≥ number seen in the real network
05:
n ← n+1
06: P-value = n/10,000
B. Calculation of P-value for the duplication of transcription factors that control
the same target genes
01: for i = 1 to 10,000
02: Randomise (DATF)
03: Perform S.2.1.B and count the number of TFs that satisfy the criteria
04: if the number of TFs in simulation ≥ number seen in the real network
05:
n ← n+1
06: P-value = n/10,000
S.2.6: Simulation procedure to estimate the significance of duplication
in determining network topology
01: for i = 1 to 10,000
02: Randomise (DARG)
03: for each TF in CT
04:
if number of distinct domain architectures of it’s RG ≤ number seen in the real network
05:
n ← n+1
06: P-value = n/10,000
Download