Ref: NBT-L28388A (acceptance letter – 19 Jan 2013)
Nadav Rappoport 1 , Nathan Linial 1 and Michal Linial 2
1
School of Computer Science and Engineering,
2
Deptartment of Biological Chemistry,
Institute of Life Sciences, The Sudarsky Center for Computational Biology, The
Hebrew University of Jerusalem, Israel
*
Corresponding author
Corresponding author details:
Michal Linial
E-Mail: michall@cc.huji.ac.il
Department of Biological Chemistry, Institute of Life Sciences
The Hebrew University
Givat Ram Campus
Jerusalem, 91904
Israel
Phone:
FAX:
972-2-6585425
972-2-6586448
1
Appendix
Supplementary text:
Methodology outline and quality assessment of ProtoNet
Text Sections 1-9
Tables S1-S8
Table S1: Statistics of proteins classifications and data growth
Table S2: Resources for annotation
Table S3: Number of annotations for the mouse proteome
Table S4: Annotation types for all proteins
Table S5: Best-clusters for annotation inference by source
Table S6: List of 2069 annotations and clusters’ statistics
Table S7: Protein accessions for cluster
A4768953
Table S8: Expansion of cluster A4768953 by UniProtKB 2012
Table S9: Root cluster A4741039, DUF148.
Figures S1-S3
Fig. S1: The number of stable clusters along the ProtoNet depth
Fig. S2: Quality assessment of ProtoNet clusters by InterPro keywords
Fig. S3: General statistical information for clusters A4654740 A4768953
Methodology – outline
1. Databases and sources
The tools and methods described in this paper were applied to the ProtoNet protein classification system. ProtoNet provides an agglomerative hierarchical clustering using several merging strategies. For the sake of simplicity, we discuss only one of the merging strategies offered by ProtoNet, called the ‘Arithmetic’ merging strategy
1
.
2
The data source for ProtoNet is the UniProtKB
2
. ProtoNet 6.0 is based on release
15.4 of UniProtKB. The initial version of ProtoNet, for example version 2.1 provided a classification hierarchy that covered all 114,033 proteins in the SwissProt database
(SWP, release 40.28). In the resulting tree there were 227,436 clusters and a total of
630 roots (mostly singletons). The growth of ProtoNet is shown in Table S1. It reflects a dramatic growth over the years with over 9 million proteins included in
ProtoNet 6.0 3 .
2. Database update - ProtoNet 6.1.
The growth in genomics science led to a fast increase in the protein sequence space.
Coping with the most recent version of UniProKB is extremely challenging. To meet this task, ProtoNet 6.1 was updated with 2,478,328 representatives. These UniRef50 representatives cover 18,887,498 proteins that are presented in current version of
UniProtKB (24th Sep. 2012).
Due to a dynamic nature of the UniProtKB resource, 241,795 ProtoNet leafs are not any more supported by UniProtKB (marked ‘obsolete’). There are 2,778,067
UniRef50 representatives that are not represented in ProtoNet 6.1 and 1,972,512 of them are of size 1. All together, the expanded number of proteins in UniRef50 clusters that are missing from ProtoNet6.1 reaches 6,707,256.
Table S1. Statistics of proteins classifications and data growth
UniProt version
Type of classification version a
UniProt6.1 A-UP 2012
UniProt6.0 A-UP 15.4
Total proteins
Total clusters: w/o cond.
Total proteins:
Expanded b
Total clusters: w cond.
Total root clusters: w/o cond.
2236533 4929553 18887498 1383857 27103
2478328 4929553 9064751 1383857 27103
UniProt5.0 A-UP 8.1
UniProt4.0 A-SW41.21
1864667 3729154 3188835 635175
1072911 295615 50768
180
8947
UniProt2.0 A-SW40.28 114033 227436 37391 630
UniProt2.0 G- SW40.28 114033 227436 30065 630
UniProt1.0 A- SW39.15 94152 186795 32975 1509
UniProt1.0 G- SW39.15 94152 186795 26312 1509
Total root clusters: w cond.
132951
132951
52735
11716
3153
2217
5878
5111
3
a
Classification types are A, Arithmetic; G-Geometric; SW, SwissProt/UniProtKB;
UP, UniProtKB. b
Total number of proteins following expansion of the seed proteins
(UniRef50 for UP15.4 and UniProt90 for UP8.1). Clusters are counted following filtration for minimal Life Time (LT≥1.0).
Fig. S1 . The number of stable clusters along the ProtoNet depth (ProtoLevel). UP
15.4 (red); UP 8.1 (green); SW41.21 (blue). The X-axis ranges from 0 (proteins as singletons) to 100 (the root of the ProtoNet tree). The graph focuses on values PL 40-
100.
ProtoLevel (PL)
3. ProtoNet clustering measurements
The agglomerative hierarchical clustering scheme defines a set of terms that are intrinsically associated with the process. In such a scheme, each cluster is created from smaller clusters, which are captured as its descendants in the clustering tree.
ProtoNet is a model free (i.e., no HMMs or alignment based PSSMs are considered).
Hence, only intrinsic properties of the data and the merging process determine the features of the final ProtoNet Tree.
ProtoLevel (PL) ranges from 0-100 and is used as a standard quantitative measure of the relative height of a cluster in the merging tree. Indirectly, the PL of a cluster reflects the global average of the sequence similarity BLAST E-score between proteins in the cluster. Specifically, the pre-calculation of all-against-all BLAST
4
search BLAST E-values are used for clustering. The similarity values are collected at a very relaxed value of BLAST E-value 100.
The PL of the leaves of the tree is defined as 0, whereas the PL of a root equals 100.
The larger the PL, the ‘later’ the merging that created the cluster took place.
Therefore, the PL scale is considered as an internal timer during the clustering process
(Fig. S1).
Lifetime (LT) of a cluster is the difference between PL at its creation (i.e. the time when two clusters were merged to form the present cluster) and its termination (i.e. the time where the cluster was merged with another cluster). The LT of a cluster reflects its remoteness from the clusters in its “vicinity”. Explanations for additional terms that describe the clustering process such as depth, connectivity and compactness are available at the ProtoNet Web site.
Major annotation
Type sources
ENZYME (6/09) Func
SMART (6.0) DOM+Fam
GENE3D (6.1.0) DOM+Fam
CATH (3.2.0)
SCOP (1.75)
GO (1.7)
Str
Str
Func
UniProtKB (15.4) Func
PFAM (23.0) DOM+Fam
InterPro (21.0)
NCBI Taxonomy
(6/09)
DOM+Fam
Tax
Coverage
(%) SWP
44
19
37
37
47
93
99
92
95
100
Coverage
(%) ProtoDB
11
14
27
27
35
52
63
73
77
100
Amount of annotations
5,190
720
1,024
3,338
7,821
27,050
949
10,640
18,638
442,867
Table S2: The annotation resources in ProtoDB cover functional annotations (Func), domains (Dom) and families (Fam), structure (Str), taxonomy (Tax) and more.
4. Integration of annotations
The UniProt Keywords are a list of general functional terms. These keywords are based on information contained in SwissProt, TrEMBL, and PIR. InterPro
4
is a metaannotation resource that combines 15 of the most widely used domain and families databases. We kept the original collection of the databases that compile the InterPro.
The major resources that are combined in the InterPro annotation scheme are:
PROSITE, a database of protein families and domains. It is manually curated and used
5
as a benchmark for false positive and false negative assignments. Pfam
5
is a large collection of multiple sequence alignments and hidden Markov models covering most protein domains and families. Pfam was used as a high quality hidden Markov models
(HMMs) for repeats, domains and families. PRINTS database
6
is a collection of protein fingerprints that characterize a protein family. The ProDom protein domain database
7
consists of an automatic compilation of homologous domains from a recursive PSI-BLAST search protocol. SMART 8 provides annotation for domains and their architectures. TIGRFAM 9 is a collection of protein families based on curated multiple sequence alignments and HMMs. SUPERFAMILY
10
is a library of profile HMMs that represent all proteins of known structure. The library is based on the SCOP classification
11
. The Gene3D database
12
describes protein families and domain architectures in complete genomes. Gene3D unifies the HMM libraries of
CATH
13
and Pfam domains. PANTHER
14
is a large collection of protein families that have been manually divided into subfamilies. HMMs are built for each family and subfamily.
Domain based annotations
PRODOM domain
TIGRFAMs domain
PIRSF domain
PRINTS domain
SSF domain
SMART domain
PROSITE domain
PFAM domain
# of annotated proteins # of mouse proteins
199,830 689
1,806,403 3,098
494,833
1,516,755
3,804,487
1,665,306
4,836,740
8,945,725
3,483
15,198
22,605
24,656
33,444
46,467
Taxonomy species
Total w/o Taxonomy
9,063,896
194,312,350
60,605
602,836
Table S3 . Summary of major domain based annotations associated with the mouse proteome.
5. Keyword correspondence scores
In order to measure the correspondence between a given cluster and a specific annotation, we define the notion of a correspondence score (CS). The CS for a certain cluster C and a given keyword K measures the correlation between the cluster and the keyword, using the well-known intersect-union ratio.
( , )
|
|
C
C
K
K
|
|
TP
TP
FP
FN where: c is the set of annotated proteins in cluster C, k is the set of proteins annotated
6
with K, TP, FP, FN stand for true positives, false positives, and false negatives, respectively. TP = the number of proteins in cluster C that have keyword annotation
K, FP = the number of annotated proteins in cluster C that do not have keyword annotation K, FN = the number of proteins not in cluster C that have keyword annotation K. The cluster receiving the maximal score for keyword K is considered the cluster that best represents K within the ProtoNet tree. The score for a given cluster on keyword K ranges from 0 (no correspondence) to 1 (the cluster contains exactly all of the proteins with keyword K, i.e. maximally corresponds to the keyword). For a formal definition see
15
.
For annotation keywords from several external sources, we define the cluster with the best CS for each keyword as the best cluster for this keyword. The sources used for defining the best clusters as well as their CS are: InterPro (families, domains and others), Pfam, SCOP (fold, superfamily, family and domain levels), GO (in 3 categories- Molecular function, Cellular process and Cellular localization) and
ENZYME (4 levels of EC hierarchy).
6. Clusters stability, pruning protocol and expanded clusters
The term assigned to our measure of inherent stability of a cluster is the Life Time
(LT) of a cluster. The LT is the difference between the time (i.e. merging steps) that a cluster was created and the time it is merged to a larger cluster. This value is a reflection to the relative height of a cluster in the merging tree (Figure 1B). The level of the tree (ProtoLevel, PL) is an additional internal monotonic timer for merging along the clustering process. The LT and PL are combined for the purpose of tree pruning. Pruning ProtoNet 6.0 tree at LT and PL thresholds of 1.0 and 90 respectively, resulted in 162,088 high quality stable clusters. The pruning protocol yielded a 30-fold compression from the original 5 million clusters (including leaves) generated prior to pruning.
The ProtoNet tree is constructed using the representative proteins from UniRef50.
UniRef50 comprises about 2.5 million representative proteins with the property that every protein has a >50% overlap with at least one representative protein. For each cluster the expanded list of proteins of the complete UniProtKB is provided. On
7
average, there is a 4.5 fold expansion from UniRef50 to the UniProtKB full list. Thus the 10^7 proteins represent the expanded view of the ProtoNet6.0
3
.
7. Performance by external experts
ProtoNet 6.0 provides a nested tree with about 150,000 stable clusters. As an unsupervised platform, we assess the quality of the clustering process. At a
ProtoLevel of >40, a drop in the number of clusters (containing at least two proteins) along the progression of ProtoNet tree reflects the merges of pre-existing clusters and the establishing of larger clusters. Testing the quality of the mergers throughout the clustering protocol is based on the keyword Corresponding Score (CS).
The quality of the stable clusters (at a PL>1.0) is illustrated by testing the CS for all the families and domains from Pfam (12,000) and InterPro (18,000). The integrated version of InterPro covers about 80% of the proteins. About 2/3 of these keywords are included in the analysis using a minimal cutoff of ≥20 proteins in each cluster.
Fig. 2S shows that the quality as measured by the CS is stable throughout the entire range of the ProtoLevel / Birth time of the ProtoNet tree. This trend is valid for Pfam and InterPro. The average quality of Pfam clusters as measured by CS is 0.91 (Fig.
1D) and the CS values for InterPro are an average 0.8.
Birth Time (time of merging)
Fig. S2.
Quality of ProtoNet clusters (≥20 proteins) according to InterPro families and domains. The dashed white line show the CS=0.5. Note that throughout the Birth time range (0-1.0), the CS remains at 0.8. Hence, the high quality of the ProtoNet tree in view of InterPro is valid for all the levels of the tree.
8
8. Genomic perspective on the protein clusters
The number of organisms covered by UniProtKB is huge (Table S2). Nevertheless, a third of the protein sequences originate from a relatively small number of organisms that were completely sequenced (annotated as a complete proteome). Multi-cellular organisms that serve as genetic model organisms are included in the genome view.
ProtoNet 6.0 supports over 30 organisms from all superkingdoms. Specifically, D. melanegaster, C. elegans , human, honeybee, mouse and more are organized in a
EC - x
EC - x.x
EC - x.x.x
EC - x.x.x.x
Tax species
Tax genus
Tax family
Tax order
Tax class
Tax phylum
Tax kingdom
Tax superkingdom
SMART domain
PRINTS domain
PFAM domain
PROSITE domain
PRODOM domain
TIGRFAMs domain
PIRSF domain
SSF domain
PANTHER domain
GENE3D domain
PFAM CLANS domain taxonomy tree. Selecting any node from the organisms’ tree returns the clusters that include proteins from the selected taxonomical level.
Annotation type # unique proteins in DB
1,025,565
1,016,337
1,001,790
893,537
9,063,869
8,799,481
8,408,876
8,047,843
7,414,051
7,981,148
2,380,779
8,272,660
1,306,672
1,321,388
6,582,986
3,165,669
190,503
1,657,561
494,833
3,137,662
1,445,866
2,434,880
3,438,771
33,444
689
3,098
3,483
22,605
16,112
19,293
27,588
60,605
60,605
60,605
60,605
60,605
24,656
15,198
46,467
# of proteins in mouse
4,170
4,082
3,985
3,025
60,605
60,605
60,605
9
InterPro Domain
InterPro Family
InterPro Conserved_site
InterPro Active_site
InterPro Repeat
InterPro Region
InterPro Binding_site
InterPro PTM
4,886,673
3,905,328
828,822
257,406
220,350
204,357
170,864
48,201
UniProt keyword
GO molecular function
GO biological process
GO cellular component
CATH class
CATH architecture
2,432,181
2,432,181
CATH topology 2,430,414
CATH homologous superfamily 2,424,415
SCOP class
SCOP fold
5,667,975
4,135,974
3,516,315
2,333,260
SCOP superfamily
Total
Total w/o Taxonomy
Avg. / protein
3,135,943
3,135,943
3,135,943
143,849,828
74,416,565
82
33,365
40,941
35,457
36,693
19,280
19,280
19,278
19,238
35,843
23,414
9,314
2,496
4,452
4,993
2,480
623
22,598
22,598
22,598
1,148,281
602,836
9.2
Table S4.
Annotation types associated with all proteins (an expanded set, >9 million proteins).
9. Navigation in the ProtoNet family tree
The overall summary of the ProtoNet clusters is beyond the scope of this correspondence letter. A list of the proteins that are analyzed in the examples clusters is shown.
Table S5 . ProtoNet clusters with maximal Correspondence Score (CS) values for thousands of annotations are coined ‘Best clusters’. Several filtrations were applied to present clusters with high confidence for annotation inference. The filters used
10
include (i) average length ≤300; (ii) cluster size ≥ 10; (iii) the number of proteins that are subjected for inference > 0; (iv) the fraction of annotation inferred proteins ≤50%.
There are 2,069 annotations that match the combination of the selected filters. These annotations are from a verity of sources that are supported by ProtoNet (Fig. 2A).
Keyword Type
CATH homologous superfamily
CATH topology
EC - all levels
GO biological process
GO cellular component
GO molecular function
InterPro Domain
InterPro Family
PFAM domain
SCOP superfamily
UniProt keyword
Total
Total Number of clusters:
Number of Known proteins
4,818
3,155
2,224
6,375
1,708
5,639
16,682
36,287
58,479
5,647
3,586
144,600
2,069
Number of predicted proteins
717
453
713
2,000
504
1,815
2,827
4,947
6,941
395
1,266
22,578
Table S6.
The 2,069 annotations of the ‘Best clusters’ are associated with 1,082 unique clusters. A detailed list of the clusters and related attributes of the clusters
(number of proteins, false positives, false negatives, CS values, fraction of inference,
ProtoLevel and more) is shown. The numbers of proteins that are listed in the table are according to UniRef50 representatives. On average, the number of the expanded proteins list is 7.8 folds larger. The average CS for all the Best clusters is very high
(0.89).
Table S7.
UniProtKB accession numbers of proteins from cluster A4768953 (98.5% from bacteria). The cluster is created at PL 96.9 (and LT≥1.0). There are 121
UniRef50 representatives that accounts for 371 sequences for ProtoNet 6.0.
A0AY61_BURCH, A0AYP1_BURCH, A0B523_BURCH, A0KDK2_BURCH, A0YUS0_9CYAN,
A0YXM6_9CYAN, A0ZIK3_NODSP, A1AQR7_PELPD, A1ASA5_PELPD, A1ATP6_PELPD,
A1IQ44_NEIMA, A1KS43_NEIMF, A1U426_MARAV, A1URJ7_BARBK, A1VDE5_DESVV,
A1VDJ0_DESVV, A1VID2_POLNA, A1VUI2_POLNA, A1VUU8_POLNA, A1VX41_POLNA,
A1WA18_ACISJ, A1WDJ6_ACISJ, A1WN69_VEREI, A1WN70_VEREI, A2FQI0_TRIVA,
A2S771_BURM9, A2W067_9BURK, A2W0N7_9BURK, A2WHL8_9BURK, A3ETU0_9BACT,
A3EUI8_9BACT, A3EUT8_9BACT, A3EUW0_9BACT, A3IMU5_9CHRO, A3ITS1_9CHRO,
A3IVA6_9CHRO, A3JL43_9ALTE, A3MPS7_BURM7, A3N343_ACTP2, A3NQ10_BURP0,
A3RSU1_RALSO, A3RVF0_RALSO, A3T2F0_9RHOB, A3T2P0_9RHOB, A3U3I6_9RHOB,
A3VJ15_9RHOB, A3XE30_9RHOB, A3YBX4_9GAMM, A3YXB9_9SYNE, A3YYN4_9SYNE,
A4BMI2_9GAMM, A4BMU0_9GAMM, A4G757_HERAR, A4JLJ4_BURVG, A4JWV1_9CAUD,
A4MHG7_BURPS, A4NIH0_HAEIN, A4P1H1_HAEIN, A4SDW1_PROVI, A4SDW9_PROVI,
A4SMF9_AERS4, A4TNW4_YERPP, A4W5D4_ENT38, A4YYF9_BRASO, A5EB55_BRASB,
11
A5EWA6_DICNV, A5G8E7_GEOUR, A5THN4_BURMA, A5VYQ8_PSEP1, A5VZX2_PSEP1,
A5ZQ34_9FIRM, A6BTF5_YERPE, A6GBU6_9DELT, A6GLV3_9BURK, A6GSB6_9BURK,
A6VNM7_ACTSZ, A6VRK4_MARMS, A6VWS7_MARMS, A6W1R6_MARMS, A7BLG0_9GAMM,
A7BNL4_9GAMM, A7C8Q0_BURPI, A7DVA4_VIBVU, A7HV43_PARL1, A7J7Y1_PBCVF,
A7JT83_PASHA, A7JWP4_PASHA, A7JWT6_PASHA, A7K8S7_9PHYC, A7MAH2_PSEAE,
A7ZDD6_CAMC1, A8F2Y5_RICM5, A8GU01_RICRS, A8GV95_RICB8, A8YEU1_MICAE,
A8YFI8_MICAE, A8ZL86_ACAM1, A8ZNL1_ACAM1, A8ZRU4_DESOH, A8ZUV6_DESOH,
A8ZZ23_DESOH, A9AP44_BURM1, A9BR82_DELAS, A9BSW9_DELAS, A9BUI5_DELAS,
A9BZS8_DELAS, A9EAX3_9RHOB, A9INC9_BART1, A9IPK2_BART1, A9ITS5_BART1,
A9IY58_BART1, A9K0S7_BURMA, A9KH51_COXBN, A9L145_SHEB9, A9M1H7_NEIM0,
A9MEC2_BRUC2, A9MYW2_SALPB, A9QS52_PSESX, A9R2C7_YERPG, A9VY58_METEP,
A9WY66_BRUSI, A9ZA21_YERPE, A9ZV04_YERPE, B0BH72_9BACT, B0BSH4_ACTPJ,
B0BVJ3_RICRO, B0C1R5_ACAM1, B0C9N6_ACAM1, B0CDJ2_ACAM1, B0CF50_ACAM1,
B0GF09_YERPE, B0GPY3_YERPE, B0H3V1_YERPE, B0HEH9_YERPE, B0HZ49_YERPE,
B0J3J2_RHILT, B0JHD9_MICAN, B0JHE9_MICAN, B0JNS5_MICAN, B0JUA6_MICAN,
B0JUV0_MICAN, B0KS04_PSEPG, B0QVD2_HAEPR, B0QX65_HAEPR, B0SXW2_CAUSK,
B0USW2_HAES2, B0UW55_HAES2, B1FAH8_9BURK, B1FFX5_9BURK, B1GB48_9BURK,
B1K4X1_BURCC, B1K610_BURCC, B1MA45_METRJ, B1SYC7_9BURK, B1T955_9BURK,
B1VJ81_PROMH, B1YZV0_BURA4, B1Z0V9_BURA4, B1ZAK5_METPB, B1ZHK8_METPB,
B1ZJU5_METPB, B1ZLX6_METPB, B2C6T4_ACIBA, B2FDC5_RHIME, B2FJ08_STRMK,
B2GZV1_BURPS, B2I5N4_XYLF2, B2IAQ2_XYLF2, B2TE06_BURPP, B2TSF9_SHIB3,
B2U639_ECOLX, B2UHE0_RALPJ, B2VAY8_ERWT9, B3DQN4_BIFLD, B3E764_GEOLS,
B3ECU4_CHLL2, B3EMQ3_CHLPB, B3EMQ6_CHLPB, B3GYY1_ACTP7, B3HFC0_ECOLX,
B3ISX8_ECOLX, B3JYH8_9DELT, B3PFX8_CELJU, B3PFX9_CELJU, B3R4X4_CUPTR,
B3R659_CUPTR, B3X774_SHIDY, B4BWK0_9CHRO, B4EL47_BURCJ, B4EML7_BURCJ,
B4F1A0_PROMH, B4RNY5_NEIG2, B4SE49_PELPB, B4SF38_PELPB, B4ST99_STRM5,
B4VPB9_9CYAN, B4WDE7_9CAUL, B5C0W3_SALET, B5EJ13_GEOBB, B5ERP5_ACIF5,
B5FD56_VIBFM, B5FHK8_SALDC, B5N208_SALET, B5PE14_SALET, B5PPY4_SALHA,
B5RZ11_RALSO, B5SLX0_RALSO, B5WP46_9BURK, B5XNN8_KLEP3, B6AKX5_9BACT,
B6ANJ3_9BACT, B6C3Q3_9GAMM, B6C438_9GAMM, B6IYD2_RHOCS, B6X043_9ENTR,
B7GQ85_BIFLI, B7JAX0_ACIF2, B7KPZ4_METC4, B7KSA8_METC4, B7LJE8_ECOLU,
B7YG92_VARPD, B7YTG6_VARPD, B8F4N1_HAEPS, B8F7F6_HAEPS, B8GSA3_THISH,
B8GTV7_THISH, B8GUV3_THISH, B8H360_CAUCN, B8ITM9_METNO, B8KUY3_9GAMM,
B8LA05_9GAMM, B8LA19_9GAMM, B9B269_9BURK, B9BKG6_9BURK, B9C4H2_9BURK,
B9DAL1_9GAMM, B9JFK9_AGRRK, B9JPN3_AGRRK, B9K3H8_AGRVS, B9NWT0_9RHOB,
C0BSI9_9BIFI, C0G9V0_9RHIZ, C0GTU2_9DELT, C0GUR9_9DELT, C0Q4S6_SALPC,
C0QI01_DESAH, C0VCW4_9MICO, C0YUD1_9FLAO, C1ADF9_9BACT, C1DE86_AZOVD,
C1DR99_AZOVD, C1HVF5_NEIGO, C1SPK8_9BACT, C1T3N5_DESBA, C1XQ07_9DEIN,
C2A215_SULDE, C2BV60_9ACTO, C2CUD4_GARVA, C2CUJ3_GARVA, C2KQK6_9ACTO,
C2KR20_9ACTO, C2LIN4_PROMI, C3K003_PSEFL, C3KFQ9_PSEFL, C3WDL4_FUSMR,
Q01YS7_SOLUE, Q02RQ2_PSEAB, Q05HP1_XANOR, Q07IW4_RHOP5, Q0AF37_NITEC,
Q0B469_BURCM, Q0B4P3_BURCM, Q0CZA0_ASPTN, Q0KBP5_RALEH, Q11N66_MESSB,
Q131N1_RHOPS, Q138H7_RHOPS, Q13FU3_BURXL, Q13LE6_BURXL, Q1BLP1_BURCA,
Q1BW04_BURCA, Q1CAI7_YERPA, Q1CFJ4_YERPN, Q1I6T7_PSEE4, Q1IB13_PSEE4,
Q1IU34_ACIBL, Q1NJK5_9DELT, Q1NP18_9DELT, Q1NR36_9DELT, Q1QF67_NITHX,
Q1QFT4_NITHX, Q1RHK4_RICBR, Q1XGP3_PSEPU, Q214U9_RHOPB, Q21ZJ2_RHOFD,
Q2ISG9_RHOP2, Q2IT41_RHOP2, Q2IUV2_RHOP2, Q2RPR8_RHORT, Q2RUT9_RHORT,
Q2RX84_RHORT, Q2SIZ8_HAHCH, Q2TR62_ACIBA, Q2W3Q7_MAGMM, Q315U5_DESDG,
Q392I8_BURS3, Q393E1_BURS3, Q39SL9_GEOMG, Q3AR07_CHLCH, Q3B1U1_PELLD,
Q3BTL9_XANC5, Q3JDA3_NITOC, Q3JDK7_NITOC, Q3JDX3_NITOC, Q3KEQ6_PSEPF,
Q3KIR2_PSEPF, Q3KJ92_PSEPF, Q3QZ61_XYLFA, Q3R7W4_XYLFA, Q3RFI8_XYLFA,
Q48B36_PSE14, Q4BUP6_CROWT, Q4BUQ5_CROWT, Q4BVR2_CROWT, Q4KIY9_PSEF5,
Q4UJX2_RICFE, Q5NZP9_AZOSE, Q5QWX9_IDILO, Q5X039_LEGPL, Q63LE9_BURPS,
Q63YL3_BURPS, Q6ALJ0_DESPS, Q6G4P2_BARHE, Q6J5I5_HAEIN, Q6MCI0_PARUW,
Q6N367_RHOPA, Q6W4R6_VIBAN, Q6ZEG0_SYNY3, Q70W78_YEREN, Q74CX9_GEOSL,
Q7CH33_YERPE, Q7MBL1_VIBVY, Q7N237_PHOLL, Q7N3K9_PHOLL, Q7N4L9_PHOLL,
Q7N6M0_PHOLL, Q7N7P7_PHOLL, Q7N7R2_PHOLL, Q7N8S2_PHOLL, Q7X1K1_9BACT,
Q820L9_NITEU, Q840E6_9GAMM, Q847G4_PSEPU, Q879U0_XYLFT, Q87CA7_XYLFT,
Q87UD0_PSESM, Q88NG8_PSEPK, Q88PM1_PSEPK, Q89KB2_BRAJA, Q8FLW1_COREF,
Q8G4Q3_BIFLO, Q8VMN8_PSEPU, Q8XTN0_RALSO, Q8YUE9_ANASP, Q926N9_LISIN,
Q9A407_CAUCR, Q9P9V8_XYLFA, Q9PCS6_XYLFA, Q9PHG2_XYLFA, Q9XAX6_PSEAC,
Y1420_HAEIN
12
Fig. S3. A summary page for ProtoNet clusters A4654740 (98 proteins) and
A4768953 (112 proteins, Fig. 2B). Statistical details for the cluster are shown in a table format. PL=100 reflects a root cluster. Cluster A4654740 was assigned as ‘Best cluster’ for InterPro keyword ‘Addiction module antidote protein, HI1420 ’ (CS=1.0) and thus it is assigned as the ProtoName. In ProtoNet database, there are 53
UniProt50 representative proteins that are annotated by this keyword, all are included in these cluster.
Table S8.
ProtoNet 6.1 view of expanded cluster A4768953. The 121 representative
UniRef50 sequences are expanded to 1022 sequences according to ProtoNet 6.1.
Cluster ID Cluster name
UniRef50_A0YUS0 Putative uncharacterized protein
Size Length
1 150
13
UniRef50_A0YXM6 Putative uncharacterized protein
UniRef50_A9KH51
UniRef50_A9L145
UniRef50_B0C1R5
UniRef50_B0CDJ2
UniRef50_B0JHD9
UniRef50_B0SXW2
UniRef50_B1ZAK5
UniRef50_B2FDC5
UniRef50_B3X774
UniRef50_B4VPB9
UniRef50_B4WDE7
UniRef50_B5PPY4
UniRef50_B6IYD2
UniRef50_B8H360
UniRef50_B8ITM9
UniRef50_B9JFK9
UniRef50_B9JPN3
UniRef50_C0BSI9
UniRef50_C1ADF9
UniRef50_C2KR20
UniRef50_C3WDL4
UniRef50_C5D1V6
UniRef50_C6AZF6
UniRef50_C6DYU8
UniRef50_C9T0K8
UniRef50_D0SF18
UniRef50_A1AQR7
UniRef50_A1VX41
UniRef50_A2FQI0
UniRef50_A3IMU5
UniRef50_A3T2P0
UniRef50_A4BMI2
UniRef50_A4G757
UniRef50_A4YAZ9
UniRef50_A5EB55
UniRef50_A5EWA6
UniRef50_A5ZQ34
UniRef50_A6GBU6
UniRef50_A6VWS7
UniRef50_A6W1R6
UniRef50_A7BLG0
UniRef50_A7BNL4
UniRef50_A7DVA4
UniRef50_A7HV43
UniRef50_A7J7Y1
UniRef50_A7JT83
UniRef50_A7K8S7
UniRef50_A8F2Y5
UniRef50_A8ZNL1
UniRef50_A8ZRU4
UniRef50_A8ZUV6
UniRef50_A9BR82
Putative transcriptional regulator
Putative transcriptional regulator
Putative uncharacterized protein
Putative uncharacterized protein
Putative uncharacterized protein
Putative uncharacterized protein
Putative uncharacterized protein
Fimbrial protein pilin
Putative uncharacterized protein
Putative uncharacterized protein
Uncharacterized protein
Putative uncharacterized protein
Putative transcriptional regulator
Putative transcriptional regulator
Transcriptional regulator
Putative uncharacterized protein
Transcriptional regulator, Cro/CI family
Putative transcriptional regulator
Putative uncharacterized protein n627R
Possible transcriptional regulator
Putative uncharacterized protein z317L
Transcriptional regulator
Putative uncharacterized protein
Putative transcriptional regulator
Putative transcriptional regulator
Putative uncharacterized protein
Putative uncharacterized protein
Putative uncharacterized protein
Addiction module antitoxin, putative
Putative uncharacterized protein
Putative uncharacterized protein
Putative transcriptional regulator
Putative transcriptional regulator
Putative partial transcriptional regulator protein
DNA-binding prophage protein
Putative uncharacterized protein
Probable addiction module antidote protein
Putative uncharacterized protein
Putative uncharacterized protein
Putative transcriptional regulator, antitoxin protein higA
Putative transcriptional regulator
Transcriptional regulator
Putative uncharacterized protein
Putative uncharacterized protein
Uncharacterized protein
Possible transcriptional regulator
Putative uncharacterized protein
Addiction module antidote protein
Addiction module antidote protein
Addiction module antidote protein
Predicted protein
Putative uncharacterized protein
104
118
106
102
96
106
108
62
88
79
252
97
74
101
275
89
81
70
97
107
100
40
78
97
97
86
67
70
107
90
76
106
92
99
63
125
94
94
92
44
98
107
111
69
148
133
43
121
100
68
67
106
98
2
1
3
1
5
25
1
1
5
1
1
1
1
1
1
5
1
1
5
14
3
1
1
7
5
19
2
3
1
1
1
1
14
1
2
1
1
4
4
2
1
4
7
2
8
2
1
1
1
1
1
1
7
14
UniRef50_D1BRK1
UniRef50_D2UFF5
UniRef50_D6SJR1
UniRef50_D8F2I0
UniRef50_E0NB55
UniRef50_E1VPC2
UniRef50_E2STQ7
UniRef50_E2XVJ2
UniRef50_E3HHX7
UniRef50_E6PN15
UniRef50_E6QCN5
UniRef50_E6QCQ8
UniRef50_E6X9K7
UniRef50_E8X7N4
UniRef50_F3INW7
UniRef50_F3L0G0
UniRef50_F6AX15
UniRef50_F6BME8
UniRef50_F7ND47
UniRef50_F8GGJ2
UniRef50_F8LFL8
UniRef50_G4E1Q1
UniRef50_G6XQZ4
UniRef50_G7CR59
UniRef50_G7HEF7
UniRef50_G8Q9N4
UniRef50_H0HXW3
UniRef50_H1KHC6
UniRef50_H1LAV3
UniRef50_H1NMJ8
UniRef50_H5WLK2
UniRef50_H6SIS2
UniRef50_I0INU3
UniRef50_I1WF00
UniRef50_I2Y9I5
UniRef50_I4FCC2
UniRef50_I4FRG9
UniRef50_I4XN50
UniRef50_I6HTI1
UniRef50_J0QIC2
UniRef50_J1A1Y8
UniRef50_J1P6H9
UniRef50_J2P562
UniRef50_P44191
UniRef50_Q01YS7
UniRef50_Q0AF37
UniRef50_Q0CZA0
UniRef50_Q11N66
UniRef50_Q138H7
UniRef50_Q1IB13
UniRef50_Q1IU34
UniRef50_Q1QFT4
UniRef50_Q21ZJ2
Addiction module antidote protein
Putative uncharacterized protein
Addiction module antidote protein
Toxin-antitoxin system, antitoxin component, ribbon-helixhelix domain protein
XRE family transcriptional regulator
Predicted transcriptional regulator
Toxin-antitoxin system, antitoxin component, Xre family
Transcriptional regulator
Transcriptional regulator 3
Uncharacterized protein
Uncharacterized protein
Uncharacterized protein
Addiction module antidote protein
Addiction module antidote protein
Uncharacterized protein
Addiction module antidote protein
Putative uncharacterized protein
Addiction module antidote protein
Transcriptional regulator
Addiction module antidote protein
Uncharacterized protein HI_1420
Addiction module antidote protein
Addiction module antidote protein
Putative uncharacterized protein
Addiction module antidote protein
Transciptional regulator
Addiction module antidote protein
Addiction module antidote protein
Addiction module antidote protein
DNA replication and repair protein RecF
Thiol-disulfide isomerase-like thioredoxin
Predicted transcriptional regulator
Uncharacterized protein
Addiction module antidote protein
Toxin-antitoxin system, antitoxin component, Xre family
Uncharacterized protein
Uncharacterized protein
Uncharacterized protein
Uncharacterized protein
Uncharacterized protein
Uncharacterized protein
Uncharacterized protein
Uncharacterized protein
Uncharacterized protein HI_1420
Putative uncharacterized protein
Putative transcriptional regulator
Predicted protein
Putative uncharacterized protein
Transcriptional regulator-like
Putative uncharacterized protein
Putative uncharacterized protein
Putative uncharacterized protein
Putative uncharacterized protein
99
52
116
97
98
390
174
103
106
118
106
100
100
108
101
125
109
164
99
105
94
106
121
100
84
196
108
105
182
105
131
129
106
102
101
97
104
95
187
177
305
221
107
168
183
108
120
107
104
106
94
126
103
6
2
48
11
16
5
116
34
6
20
13
2
13
12
5
4
4
11
13
4
2
35
32
13
3
36
11
2
2
7
27
8
30
5
3
2
2
1
2
13
1
1
1
1
1
16
10
37
4
18
9
10
84
15
UniRef50_Q2ISG9
UniRef50_Q2IUV2
UniRef50_Q2W3Q7
UniRef50_Q315U5
UniRef50_Q3BTL9
UniRef50_Q3JDK7
UniRef50_Q4BUP6
UniRef50_Q4UJX2
UniRef50_Q5X039
UniRef50_Q6ALJ0
UniRef50_Q70W78
UniRef50_Q7N237
UniRef50_Q820L7
UniRef50_Q89KB2
UniRef50_Q8YUE9
UniRef50_Q926N9
Putative transcriptional regulator
Putative uncharacterized protein
Putative uncharacterized protein
Addiction module antidote protein
Putative uncharacterized protein
Transcriptional regulator, Cro/CI family
Similar to transcriptional regulator
Predicted transcriptional regulator
Putative uncharacterized protein
Putative uncharacterized protein
Putative uncharacterized protein ORF7
Similar to hypothetical protein of Haemophilus influenzae
Possible phoP; Response regulators consisting of a CheYlike receiver domain and a HTH DNA-binding domain
Blr4995 protein
Asl2401 protein
Pli0013 protein
3
18
1
1
1
1
5
2
18
1
1
1
3
2
1
1
Table S9.
UniProtKB accession numbers of proteins from a root cluster A4741039
(65 proteins, Fig. 2D). Cluster Name: Protein of unknown function DUF148 .
The cluster includes all 47 proteins of DUF148 (PF02520, for UniRef50) and thus associates with CS=1.0. Additional 18 proteins in this cluster lack any InterPro /
Pfam annotations. Recall that the DUF148 family has no known function nor do any of the proteins that possess it. Still all the proteins in the cluster belong to a diverse collection of nematodes (some cause severe human diseases). Inspection of this cluster suggests that these proteins may act as surface antigen and may activate the host immune response. The expended cluster (UniRef100) comprises of 128 sequences.
A0DWN3_PARTE, A1IKL2_ANISI, A1Z6D0_CAEEL, A7M6T4_ANISI,
A8NVF8_BRUMA, A8Q0T0_BRUMA, A8Q3S4_BRUMA, A8Q4D7_BRUMA,
A8R0N5_BRUPA, A8WY92_CAEBR, A8WYA2_CAEBR, A8X0P8_CAEBR,
A8X193_CAEBR, A8X194_CAEBR, A8X6I6_CAEBR, A8XIZ6_CAEBR,
A8XLL8_CAEBR, A8XTF6_CAEBR, A8XZS0_CAEBR, A8Y4D4_CAEBR,
B0ZBB0_9PELO, B2XCP1_ANISI, B6IH84_CAEBR, B6VBW3_9PELO,
O01674_ACAVI, O17021_CAEEL, O17974_CAEEL, O45098_CAEEL,
O45347_CAEEL, O76573_CAEEL, OV17_ONCVO, OV39_ONCVO, P91548_CAEEL,
Q18280_CAEEL, Q18577_CAEEL, Q19406_CAEEL, Q19414_CAEEL,
Q1W204_ANCCA, Q20202_CAEEL, Q20998_CAEEL, Q21588_CAEEL,
Q22237_CAEEL, Q22537_CAEEL, Q23545_CAEEL, Q23614_CAEEL,
Q23615_CAEEL, Q23683_CAEEL, Q4W5P1_CAEEL, Q5FC55_CAEEL,
Q6LA91_CAEEL, Q6RUQ0_MELIC, Q6UY31_HUMAN, Q7YTI8_CAEEL,
Q7YTP4_CAEEL, Q7YWN3_CAEEL, Q8MXD0_9BILA, Q8WQ00_OSTOS,
Q93122_ASCSU, Q95YM3_NIPBR, Q9GU97_LOALO, Q9NFS0_GLORO,
Q9NFU9_GLORO, Q9TZL7_WUCBA, Q9XTN7_CAEEL, YY23_CAEEL
99
105
86
91
188
69
81
185
78
100
112
117
229
100
98
173
16
References
1.
(2003).
O. Sasson, A. Vaaknin, H. Fleischer et al., Nucleic Acids Res 31 (1), 348
2. C. H. Wu, R. Apweiler, A. Bairoch et al., Nucleic Acids Res 34 (Database issue), D187 (2006).
3. N. Rappoport, S. Karsenty, A. Stern et al., Nucleic Acids Res 40 (Database issue), D313 (2012).
4. S. Hunter, R. Apweiler, T. K. Attwood et al., Nucleic Acids Res 37 (Database issue), D211 (2009).
5. R. D. Finn, J. Tate, J. Mistry et al., Nucleic Acids Res 36 (Database issue),
D281 (2008).
6. T. K. Attwood, P. Bradley, D. R. Flower et al., Nucleic Acids Res 31 (1), 400
(2003).
7.
8.
F. Corpet, F. Servant, J. Gouzy et al., issue), D142 (2004).
Nucleic Acids Res
I. Letunic, R. R. Copley, S. Schmidt et al.,
28 (1), 267 (2000).
Nucleic Acids Res 32 (Database
9.
(2003).
D. H. Haft, J. D. Selengut, and O. White, Nucleic Acids Res 31 (1), 371
10. M. Madera, C. Vogel, S. K. Kummerfeld et al., Nucleic Acids Res 32
(Database issue), D235 (2004).
11. A. G. Murzin, S. E. Brenner, T. Hubbard et al., J Mol Biol 247 (4), 536 (1995).
12. F. Pearl, A. Todd, I. Sillitoe et al., Nucleic Acids Res 33 (Database issue),
D247 (2005).
13. C. A. Orengo, A. D. Michie, S. Jones et al., Structure 5 (8), 1093 (1997).
14. H. Mi, B. Lazareva-Ulitsky, R. Loo et al., Nucleic Acids Res 33 (Database issue), D284 (2005).
15. N. Kaplan, O. Sasson, U. Inbar et al., Nucleic Acids Res 33 (Database issue),
D216 (2005).
17