Report on the predicted TFs in the Dicty genome

advertisement
Report on the predicted TFs in the Dicty genome
M. Madan Babu and Sarah. A. Teichmann
Aim:
To predict and analyse the transcription factors in the Dicty genome in terms of the
DNA binding domain family, partner domains and compare their occurrence in other
genomes.
Materials:
Martin M and Sarah K provided the Pfam and Superfamily assignments for the Dicty
predicted proteins sequences (and for the other genomes). I identified 159 families (of
the 5600 Pfam families) that can bind DNA and are seen in TFs. I also identified 47
Superfamilies (out of 1200 SCOP superfamilies) that are seen in DNA binding
proteins. I then removed models of enzymes belonging to DNA binding domain
superfamilies from subsequent analysis. Independently, Sarah Teichmann confirmed
this list of models.
Results:
1. From the Pfam assignments, there were 111 sequences, which were predicted to be
TFs based on the occurrence of particular DNA binding domains.
2. From the Sfam assignments, there were 76 sequences that were predicted to be TFs
based on the occurrence of particular DNA binding domains. The assignments were
filtered so that models of enzymes belonging to DNA binding domain superfamilies
were removed (For example, the superfamily winged helix DNA binding domain, has
a family of enzymatic domains that do not bind DNA – the methionine amino
peptidase insert domain, for instance).
3. Overlap between the two predictions: 55 sequences have been predicted as TFs by
both Pfam and Sfam, and 51 and 21 sequences were uniquely predicted as TFs by
Pfam and Sfam respectively.
4. The 111 sequences predicted as TFs using Pfam assignments belong to 15 different
protein families. Their occurrence is as follows:
#
1
1
DNA binding domain
ARID
DDT
#
2
2
DNA binding domain
Rcd1
YEATS
1
1
1
1
1
1
1
1
1
1
1
1
1
E2F_TDP
HTH_3
NOT
NmrA
Not1
PAH
Plus-3
SART-1
Ssrecog
TBP
WRKY
YL1
zf-NF-X1
2
2
3
3
4
5
5
6
6
9
12
13
23
Zn_clus
zf-MIZ
HMG_box
SRF-TF
zf-C2H2
SAP
SIR2
CBFD_NFYB_HMF
homeobox
PHD
GATA
bZIP
myb_DNA-binding
We clearly find that the myb_DNA-binding domain has expanded in Dicty. These
belong to the SANT domain family that specifically recognize the sequence
YAAC(G/T)G. The next popular DNA binding domain is the bZIP domain and the
GATA domain. The GATA domain is a zinc finger domain that uses four Cys
residues to coordinate the Zn ion and specifically recognise the sequences
(A/T)GATA(A/G).
5. When we analyse the sequences that Sfam predicts to be TFs, we get the following
statistics (The DBDs come from 13 different families):
#
1
1
1
1
2
2
3
3
3
6
6
16
31
DNA binding domain
46774 – ARID like
47113 – Histone fold
47413 - lambda repressor-like DNA-binding domains
47954 - Cyclin-like
47454 - A DNA-binding domain in eukaryotic transcription factors
57701 - Zn2/Cys6 DNA-binding domain
47095 - HMG-box
55455 – SRF-like
57903 - FYVE/PHD zinc finger
46785 - "Winged helix" DNA-binding domain
57667 - C2H2 and C2HC zinc fingers
57716 - Glucocorticoid receptor-like (DNA-binding domain)
46689 - Homeodomain-like
Here, we clearly find that the Homeodomain-like DBD has been expanded in the
Dicty lineage. The next over-represented DBD is the Glucocorticoid receptor-like
DNA binding domain, which has been expanded in C. elegans to a large extent. Table
1a and 1b compares the occurrences of DBDs in the other genomes.
6. If we look at the partner domains for the 111 Tfs, they come from 28 different
families (+15 DNA binding domain families). These are:
ARID, BRCT, CBFD_NFYB_HMF, DDT, DnaJ, E2F_TDP, GATA, HMG_box,
HTH_3, NOT, NmrA, Not1, PAH, PHD, Plus-3, Rcd1, SAP, SART-1, SIR2, SRFTF, Ssrecog, SWIRM, TBP, WRKY, YEATS, YL1, ZZ, Zn_clus, ank, bZIP,
bromodomain, histone, Homeobox, jmjC, jmjN, myb_DNA-binding, polyprenyl_synt,
rrm, zf-C2H2, zf-MIZ, zf-NF-X1, zf-UBP, zf-UBR1
Domain content for the individual predicted TFs from Pfam assignments:
http://www.mrc-lmb.cam.ac.uk/genomes/dtf/pf.tf.domain.content
7. Similarly for Sfam predicted TFs, they come from 12 families (+13 DNA binding
domain families). These are:
"Chaperone J-domain", "Homeodomain-like", "ARID-like", ""Winged helix" DNAbinding domain", "HMG-box", "Histone-fold", "lambda repressor-like DNA-binding
domains", "A DNA-binding domain in eukaryotic transcription factors", "EF-hand",
"Cyclin-like", "Ankyrin repeat", "Concanavalin A-like lectins/glucanases", "PH
domain-like", "Rap30/74 interaction domains", "RNI-like", "BRCT domain", "P-loop
containing nucleotide triphosphate hydrolases", "RNA-binding domain, RBD", "SRFlike", "Protein kinase-like (PK-like)", "C2H2 and C2HC zinc fingers", "Zn2/Cys6
DNA-binding domain", "Glucocorticoid receptor-like (DNA-binding domain)",
"FYVE/PHD zinc finger", "Glycerol-3-phosphate (1)-acyltransferase"
Domain content for the TFs predicted from Sfam assignments:
http://www.mrc-lmb.cam.ac.uk/genomes/dtf/sf.tf.domain.content
8. Comparison of DBD occurrence in other genomes
Occurrences of Sfam – DBDs in comparison to other genomes:
Sfamily
46774
47113
47413
47954
47454
57701
47095
ec
0
0
29
0
0
0
0
sc
2
5
1
0
0
53
7
Ds
1
1
1
1
2
2
3
ce
8
2
9
2
17
0
27
mm
23
6
23
6
31
0
151
hs
19
5
24
3
30
0
77
at
10
15
4
1
1
0
22
55455
57903
46785
57667
57716
46689
0
1
115
0
0
43
4
13
19
30
10
21
3
3
6
6
16
31
3
42
93
125
361
222
6
88
151
1068
20
312
7
80
160
1039
19
319
113
61
44
58
32
283
9. Occurrences of Pfam – DBDs in comparison to other genomes:
Pfamily
ARID
DDT
E2F_TDP
HTH_3
NOT
NmrA
Not1
PAH
Plus-3
SART-1
SSrecog
TBP
WRKY
YL1
zf-NF-X1
Rcd1
YEATS
Zn_clus
zf-MIZ
HMG_box
SRF-TF
zf-C2H2
SAP
SIR2
CBFD_NFYB_HMF
homeobox
PHD
GATA
BZIP
myb_DNA-binding
ec
0
0
0
12
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
Sc
2
1
0
1
0
0
1
1
1
0
1
1
0
1
1
1
3
51
2
7
4
37
5
4
4
0
14
10
8
0
ds
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
3
3
4
5
5
6
6
9
12
13
23
ce
9
8
6
1
0
0
1
2
1
2
2
3
0
3
2
2
2
0
8
32
3
160
11
6
7
0
47
29
24
0
mm
23
8
16
1
0
0
1
6
1
1
1
7
0
1
3
1
5
0
10
157
6
1104
31
10
8
0
100
16
50
0
hs
23
7
9
1
0
0
2
4
1
1
1
4
0
1
5
2
4
0
9
84
7
973
26
13
9
0
97
15
40
0
at
9
6
12
3
0
0
1
22
6
2
1
3
81
1
2
3
2
0
2
14
111
49
8
5
31
0
60
31
75
0
From these results we clearly see that different DBDs have been expanded in a
lineage specific manner in the different genomes. For Dicty, Homeodomain
(Myb_DNA-binding domain), GATA and BZIP domain containing proteins have
expanded in a lineage specific manner.
10. Predicted TFs from other genomes
Numbers of DNA-binding transcription factors in model organisms
Organism
E. coli
S. cerevisiae
Dicty
C. elegans
M. musculus
H. sapiens
A. thaliana
Number of
transcripts
Number of proteins
with DNA-binding
domains
Percentage of transcripts
containing DNA-binding
domains
4,280
6,357
12,730
31,677
32,911
32,036+
28,787
267
245
132
1463
2460
2604
1667
6.2%
3.9%
1.0%
4.6%
7.4%
8.1%
5.7%
DNA-binding domain assignments from Pfam and SUPERFAMILY were used to
establish the repertoire of DNA-binding transcription factors in model organisms.
Expectation value threshold of 0.002 was used in making the assignments. Coregulators that do not bind DNA directly are excluded.
We find that the fraction of the genome coding TFs in the Dicty genome is
remarkably small. This would mean that most of the transcriptional regulation could
happen purely by chromatin remodelling as seen in some parasites like P. falciparum
or that uncharacterised Dicty specific DBD exists.
Supplementary information:
http://www.mrc-lmb.cam.ac.uk/genomes/dtf/
Information on pfam and sfam assignments for the predicted TFs as tab delimited file:
http://www.mrc-lmb.cam.ac.uk/genomes/dtf/tfs.domain.content
Format:
Identifier \t Pfam_assignment \t Sfam_assignment
Pfam DBD occurrence profile:
http://www.mrc-lmb.cam.ac.uk/genomes/dtf/pf.profile.ec.sc.ds.ce.mm.hs.at
Format:
Family \t ec \t sc \t ds \t ce \t mm \t hs \t at
Sfam DBD occurrence profile:
http://www.mrc-lmb.cam.ac.uk/genomes/dtf/sf.profile.ec.sc.ds.ce.mm.hs.at
Format:
Family \t ec \t sc \t ds \t ce \t mm \t hs \t at
Download