Supplementary Text and Figures

advertisement

Nature Biotechnology

Ref: NBT-L28388A (acceptance letter – 19 Jan 2013)

ProtoNet: Charting the expanding universe of protein sequences

Nadav Rappoport 1 , Nathan Linial 1 and Michal Linial 2

1

School of Computer Science and Engineering,

2

Deptartment of Biological Chemistry,

Institute of Life Sciences, The Sudarsky Center for Computational Biology, The

Hebrew University of Jerusalem, Israel

*

Corresponding author

Corresponding author details:

Michal Linial

E-Mail: michall@cc.huji.ac.il

Department of Biological Chemistry, Institute of Life Sciences

The Hebrew University

Givat Ram Campus

Jerusalem, 91904

Israel

Phone:

FAX:

972-2-6585425

972-2-6586448

1

Appendix

Supplementary text:

Methodology outline and quality assessment of ProtoNet

Text Sections 1-9

Tables S1-S8

Table S1: Statistics of proteins classifications and data growth

Table S2: Resources for annotation

Table S3: Number of annotations for the mouse proteome

Table S4: Annotation types for all proteins

Table S5: Best-clusters for annotation inference by source

Table S6: List of 2069 annotations and clusters’ statistics

Table S7: Protein accessions for cluster

A4768953

Table S8: Expansion of cluster A4768953 by UniProtKB 2012

Table S9: Root cluster A4741039, DUF148.

Figures S1-S3

Fig. S1: The number of stable clusters along the ProtoNet depth

Fig. S2: Quality assessment of ProtoNet clusters by InterPro keywords

 Fig. S3: General statistical information for clusters A4654740 A4768953

Methodology – outline

1. Databases and sources

The tools and methods described in this paper were applied to the ProtoNet protein classification system. ProtoNet provides an agglomerative hierarchical clustering using several merging strategies. For the sake of simplicity, we discuss only one of the merging strategies offered by ProtoNet, called the ‘Arithmetic’ merging strategy

1

.

2

The data source for ProtoNet is the UniProtKB

2

. ProtoNet 6.0 is based on release

15.4 of UniProtKB. The initial version of ProtoNet, for example version 2.1 provided a classification hierarchy that covered all 114,033 proteins in the SwissProt database

(SWP, release 40.28). In the resulting tree there were 227,436 clusters and a total of

630 roots (mostly singletons). The growth of ProtoNet is shown in Table S1. It reflects a dramatic growth over the years with over 9 million proteins included in

ProtoNet 6.0 3 .

2. Database update - ProtoNet 6.1.

The growth in genomics science led to a fast increase in the protein sequence space.

Coping with the most recent version of UniProKB is extremely challenging. To meet this task, ProtoNet 6.1 was updated with 2,478,328 representatives. These UniRef50 representatives cover 18,887,498 proteins that are presented in current version of

UniProtKB (24th Sep. 2012).

Due to a dynamic nature of the UniProtKB resource, 241,795 ProtoNet leafs are not any more supported by UniProtKB (marked ‘obsolete’). There are 2,778,067

UniRef50 representatives that are not represented in ProtoNet 6.1 and 1,972,512 of them are of size 1. All together, the expanded number of proteins in UniRef50 clusters that are missing from ProtoNet6.1 reaches 6,707,256.

Table S1. Statistics of proteins classifications and data growth

UniProt version

Type of classification version a

UniProt6.1 A-UP 2012

UniProt6.0 A-UP 15.4

Total proteins

Total clusters: w/o cond.

Total proteins:

Expanded b

Total clusters: w cond.

Total root clusters: w/o cond.

2236533 4929553 18887498 1383857 27103

2478328 4929553 9064751 1383857 27103

UniProt5.0 A-UP 8.1

UniProt4.0 A-SW41.21

1864667 3729154 3188835 635175

1072911 295615 50768

180

8947

UniProt2.0 A-SW40.28 114033 227436 37391 630

UniProt2.0 G- SW40.28 114033 227436 30065 630

UniProt1.0 A- SW39.15 94152 186795 32975 1509

UniProt1.0 G- SW39.15 94152 186795 26312 1509

Total root clusters: w cond.

132951

132951

52735

11716

3153

2217

5878

5111

3

a

Classification types are A, Arithmetic; G-Geometric; SW, SwissProt/UniProtKB;

UP, UniProtKB. b

Total number of proteins following expansion of the seed proteins

(UniRef50 for UP15.4 and UniProt90 for UP8.1). Clusters are counted following filtration for minimal Life Time (LT≥1.0).

Fig. S1 . The number of stable clusters along the ProtoNet depth (ProtoLevel). UP

15.4 (red); UP 8.1 (green); SW41.21 (blue). The X-axis ranges from 0 (proteins as singletons) to 100 (the root of the ProtoNet tree). The graph focuses on values PL 40-

100.

ProtoLevel (PL)

3. ProtoNet clustering measurements

The agglomerative hierarchical clustering scheme defines a set of terms that are intrinsically associated with the process. In such a scheme, each cluster is created from smaller clusters, which are captured as its descendants in the clustering tree.

ProtoNet is a model free (i.e., no HMMs or alignment based PSSMs are considered).

Hence, only intrinsic properties of the data and the merging process determine the features of the final ProtoNet Tree.

ProtoLevel (PL) ranges from 0-100 and is used as a standard quantitative measure of the relative height of a cluster in the merging tree. Indirectly, the PL of a cluster reflects the global average of the sequence similarity BLAST E-score between proteins in the cluster. Specifically, the pre-calculation of all-against-all BLAST

4

search BLAST E-values are used for clustering. The similarity values are collected at a very relaxed value of BLAST E-value 100.

The PL of the leaves of the tree is defined as 0, whereas the PL of a root equals 100.

The larger the PL, the ‘later’ the merging that created the cluster took place.

Therefore, the PL scale is considered as an internal timer during the clustering process

(Fig. S1).

Lifetime (LT) of a cluster is the difference between PL at its creation (i.e. the time when two clusters were merged to form the present cluster) and its termination (i.e. the time where the cluster was merged with another cluster). The LT of a cluster reflects its remoteness from the clusters in its “vicinity”. Explanations for additional terms that describe the clustering process such as depth, connectivity and compactness are available at the ProtoNet Web site.

Major annotation

Type sources

ENZYME (6/09) Func

SMART (6.0) DOM+Fam

GENE3D (6.1.0) DOM+Fam

CATH (3.2.0)

SCOP (1.75)

GO (1.7)

Str

Str

Func

UniProtKB (15.4) Func

PFAM (23.0) DOM+Fam

InterPro (21.0)

NCBI Taxonomy

(6/09)

DOM+Fam

Tax

Coverage

(%) SWP

44

19

37

37

47

93

99

92

95

100

Coverage

(%) ProtoDB

11

14

27

27

35

52

63

73

77

100

Amount of annotations

5,190

720

1,024

3,338

7,821

27,050

949

10,640

18,638

442,867

Table S2: The annotation resources in ProtoDB cover functional annotations (Func), domains (Dom) and families (Fam), structure (Str), taxonomy (Tax) and more.

4. Integration of annotations

The UniProt Keywords are a list of general functional terms. These keywords are based on information contained in SwissProt, TrEMBL, and PIR. InterPro

4

is a metaannotation resource that combines 15 of the most widely used domain and families databases. We kept the original collection of the databases that compile the InterPro.

The major resources that are combined in the InterPro annotation scheme are:

PROSITE, a database of protein families and domains. It is manually curated and used

5

as a benchmark for false positive and false negative assignments. Pfam

5

is a large collection of multiple sequence alignments and hidden Markov models covering most protein domains and families. Pfam was used as a high quality hidden Markov models

(HMMs) for repeats, domains and families. PRINTS database

6

is a collection of protein fingerprints that characterize a protein family. The ProDom protein domain database

7

consists of an automatic compilation of homologous domains from a recursive PSI-BLAST search protocol. SMART 8 provides annotation for domains and their architectures. TIGRFAM 9 is a collection of protein families based on curated multiple sequence alignments and HMMs. SUPERFAMILY

10

is a library of profile HMMs that represent all proteins of known structure. The library is based on the SCOP classification

11

. The Gene3D database

12

describes protein families and domain architectures in complete genomes. Gene3D unifies the HMM libraries of

CATH

13

and Pfam domains. PANTHER

14

is a large collection of protein families that have been manually divided into subfamilies. HMMs are built for each family and subfamily.

Domain based annotations

PRODOM domain

TIGRFAMs domain

PIRSF domain

PRINTS domain

SSF domain

SMART domain

PROSITE domain

PFAM domain

# of annotated proteins # of mouse proteins

199,830 689

1,806,403 3,098

494,833

1,516,755

3,804,487

1,665,306

4,836,740

8,945,725

3,483

15,198

22,605

24,656

33,444

46,467

Taxonomy species

Total w/o Taxonomy

9,063,896

194,312,350

60,605

602,836

Table S3 . Summary of major domain based annotations associated with the mouse proteome.

5. Keyword correspondence scores

In order to measure the correspondence between a given cluster and a specific annotation, we define the notion of a correspondence score (CS). The CS for a certain cluster C and a given keyword K measures the correlation between the cluster and the keyword, using the well-known intersect-union ratio.

( , )

|

|

C

C

K

K

|

|

TP

TP

FP

FN where: c is the set of annotated proteins in cluster C, k is the set of proteins annotated

6

with K, TP, FP, FN stand for true positives, false positives, and false negatives, respectively. TP = the number of proteins in cluster C that have keyword annotation

K, FP = the number of annotated proteins in cluster C that do not have keyword annotation K, FN = the number of proteins not in cluster C that have keyword annotation K. The cluster receiving the maximal score for keyword K is considered the cluster that best represents K within the ProtoNet tree. The score for a given cluster on keyword K ranges from 0 (no correspondence) to 1 (the cluster contains exactly all of the proteins with keyword K, i.e. maximally corresponds to the keyword). For a formal definition see

15

.

For annotation keywords from several external sources, we define the cluster with the best CS for each keyword as the best cluster for this keyword. The sources used for defining the best clusters as well as their CS are: InterPro (families, domains and others), Pfam, SCOP (fold, superfamily, family and domain levels), GO (in 3 categories- Molecular function, Cellular process and Cellular localization) and

ENZYME (4 levels of EC hierarchy).

6. Clusters stability, pruning protocol and expanded clusters

The term assigned to our measure of inherent stability of a cluster is the Life Time

(LT) of a cluster. The LT is the difference between the time (i.e. merging steps) that a cluster was created and the time it is merged to a larger cluster. This value is a reflection to the relative height of a cluster in the merging tree (Figure 1B). The level of the tree (ProtoLevel, PL) is an additional internal monotonic timer for merging along the clustering process. The LT and PL are combined for the purpose of tree pruning. Pruning ProtoNet 6.0 tree at LT and PL thresholds of 1.0 and 90 respectively, resulted in 162,088 high quality stable clusters. The pruning protocol yielded a 30-fold compression from the original 5 million clusters (including leaves) generated prior to pruning.

The ProtoNet tree is constructed using the representative proteins from UniRef50.

UniRef50 comprises about 2.5 million representative proteins with the property that every protein has a >50% overlap with at least one representative protein. For each cluster the expanded list of proteins of the complete UniProtKB is provided. On

7

average, there is a 4.5 fold expansion from UniRef50 to the UniProtKB full list. Thus the 10^7 proteins represent the expanded view of the ProtoNet6.0

3

.

7. Performance by external experts

ProtoNet 6.0 provides a nested tree with about 150,000 stable clusters. As an unsupervised platform, we assess the quality of the clustering process. At a

ProtoLevel of >40, a drop in the number of clusters (containing at least two proteins) along the progression of ProtoNet tree reflects the merges of pre-existing clusters and the establishing of larger clusters. Testing the quality of the mergers throughout the clustering protocol is based on the keyword Corresponding Score (CS).

The quality of the stable clusters (at a PL>1.0) is illustrated by testing the CS for all the families and domains from Pfam (12,000) and InterPro (18,000). The integrated version of InterPro covers about 80% of the proteins. About 2/3 of these keywords are included in the analysis using a minimal cutoff of ≥20 proteins in each cluster.

Fig. 2S shows that the quality as measured by the CS is stable throughout the entire range of the ProtoLevel / Birth time of the ProtoNet tree. This trend is valid for Pfam and InterPro. The average quality of Pfam clusters as measured by CS is 0.91 (Fig.

1D) and the CS values for InterPro are an average 0.8.

Birth Time (time of merging)

Fig. S2.

Quality of ProtoNet clusters (≥20 proteins) according to InterPro families and domains. The dashed white line show the CS=0.5. Note that throughout the Birth time range (0-1.0), the CS remains at 0.8. Hence, the high quality of the ProtoNet tree in view of InterPro is valid for all the levels of the tree.

8

8. Genomic perspective on the protein clusters

The number of organisms covered by UniProtKB is huge (Table S2). Nevertheless, a third of the protein sequences originate from a relatively small number of organisms that were completely sequenced (annotated as a complete proteome). Multi-cellular organisms that serve as genetic model organisms are included in the genome view.

ProtoNet 6.0 supports over 30 organisms from all superkingdoms. Specifically, D. melanegaster, C. elegans , human, honeybee, mouse and more are organized in a

EC - x

EC - x.x

EC - x.x.x

EC - x.x.x.x

Tax species

Tax genus

Tax family

Tax order

Tax class

Tax phylum

Tax kingdom

Tax superkingdom

SMART domain

PRINTS domain

PFAM domain

PROSITE domain

PRODOM domain

TIGRFAMs domain

PIRSF domain

SSF domain

PANTHER domain

GENE3D domain

PFAM CLANS domain taxonomy tree. Selecting any node from the organisms’ tree returns the clusters that include proteins from the selected taxonomical level.

Annotation type # unique proteins in DB

1,025,565

1,016,337

1,001,790

893,537

9,063,869

8,799,481

8,408,876

8,047,843

7,414,051

7,981,148

2,380,779

8,272,660

1,306,672

1,321,388

6,582,986

3,165,669

190,503

1,657,561

494,833

3,137,662

1,445,866

2,434,880

3,438,771

33,444

689

3,098

3,483

22,605

16,112

19,293

27,588

60,605

60,605

60,605

60,605

60,605

24,656

15,198

46,467

# of proteins in mouse

4,170

4,082

3,985

3,025

60,605

60,605

60,605

9

InterPro Domain

InterPro Family

InterPro Conserved_site

InterPro Active_site

InterPro Repeat

InterPro Region

InterPro Binding_site

InterPro PTM

4,886,673

3,905,328

828,822

257,406

220,350

204,357

170,864

48,201

UniProt keyword

GO molecular function

GO biological process

GO cellular component

CATH class

CATH architecture

2,432,181

2,432,181

CATH topology 2,430,414

CATH homologous superfamily 2,424,415

SCOP class

SCOP fold

5,667,975

4,135,974

3,516,315

2,333,260

SCOP superfamily

Total

Total w/o Taxonomy

Avg. / protein

3,135,943

3,135,943

3,135,943

143,849,828

74,416,565

82

33,365

40,941

35,457

36,693

19,280

19,280

19,278

19,238

35,843

23,414

9,314

2,496

4,452

4,993

2,480

623

22,598

22,598

22,598

1,148,281

602,836

9.2

Table S4.

Annotation types associated with all proteins (an expanded set, >9 million proteins).

9. Navigation in the ProtoNet family tree

The overall summary of the ProtoNet clusters is beyond the scope of this correspondence letter. A list of the proteins that are analyzed in the examples clusters is shown.

Table S5 . ProtoNet clusters with maximal Correspondence Score (CS) values for thousands of annotations are coined ‘Best clusters’. Several filtrations were applied to present clusters with high confidence for annotation inference. The filters used

10

include (i) average length ≤300; (ii) cluster size ≥ 10; (iii) the number of proteins that are subjected for inference > 0; (iv) the fraction of annotation inferred proteins ≤50%.

There are 2,069 annotations that match the combination of the selected filters. These annotations are from a verity of sources that are supported by ProtoNet (Fig. 2A).

Keyword Type

CATH homologous superfamily

CATH topology

EC - all levels

GO biological process

GO cellular component

GO molecular function

InterPro Domain

InterPro Family

PFAM domain

SCOP superfamily

UniProt keyword

Total

Total Number of clusters:

Number of Known proteins

4,818

3,155

2,224

6,375

1,708

5,639

16,682

36,287

58,479

5,647

3,586

144,600

2,069

Number of predicted proteins

717

453

713

2,000

504

1,815

2,827

4,947

6,941

395

1,266

22,578

Table S6.

The 2,069 annotations of the ‘Best clusters’ are associated with 1,082 unique clusters. A detailed list of the clusters and related attributes of the clusters

(number of proteins, false positives, false negatives, CS values, fraction of inference,

ProtoLevel and more) is shown. The numbers of proteins that are listed in the table are according to UniRef50 representatives. On average, the number of the expanded proteins list is 7.8 folds larger. The average CS for all the Best clusters is very high

(0.89).

Table S7.

UniProtKB accession numbers of proteins from cluster A4768953 (98.5% from bacteria). The cluster is created at PL 96.9 (and LT≥1.0). There are 121

UniRef50 representatives that accounts for 371 sequences for ProtoNet 6.0.

A0AY61_BURCH, A0AYP1_BURCH, A0B523_BURCH, A0KDK2_BURCH, A0YUS0_9CYAN,

A0YXM6_9CYAN, A0ZIK3_NODSP, A1AQR7_PELPD, A1ASA5_PELPD, A1ATP6_PELPD,

A1IQ44_NEIMA, A1KS43_NEIMF, A1U426_MARAV, A1URJ7_BARBK, A1VDE5_DESVV,

A1VDJ0_DESVV, A1VID2_POLNA, A1VUI2_POLNA, A1VUU8_POLNA, A1VX41_POLNA,

A1WA18_ACISJ, A1WDJ6_ACISJ, A1WN69_VEREI, A1WN70_VEREI, A2FQI0_TRIVA,

A2S771_BURM9, A2W067_9BURK, A2W0N7_9BURK, A2WHL8_9BURK, A3ETU0_9BACT,

A3EUI8_9BACT, A3EUT8_9BACT, A3EUW0_9BACT, A3IMU5_9CHRO, A3ITS1_9CHRO,

A3IVA6_9CHRO, A3JL43_9ALTE, A3MPS7_BURM7, A3N343_ACTP2, A3NQ10_BURP0,

A3RSU1_RALSO, A3RVF0_RALSO, A3T2F0_9RHOB, A3T2P0_9RHOB, A3U3I6_9RHOB,

A3VJ15_9RHOB, A3XE30_9RHOB, A3YBX4_9GAMM, A3YXB9_9SYNE, A3YYN4_9SYNE,

A4BMI2_9GAMM, A4BMU0_9GAMM, A4G757_HERAR, A4JLJ4_BURVG, A4JWV1_9CAUD,

A4MHG7_BURPS, A4NIH0_HAEIN, A4P1H1_HAEIN, A4SDW1_PROVI, A4SDW9_PROVI,

A4SMF9_AERS4, A4TNW4_YERPP, A4W5D4_ENT38, A4YYF9_BRASO, A5EB55_BRASB,

11

A5EWA6_DICNV, A5G8E7_GEOUR, A5THN4_BURMA, A5VYQ8_PSEP1, A5VZX2_PSEP1,

A5ZQ34_9FIRM, A6BTF5_YERPE, A6GBU6_9DELT, A6GLV3_9BURK, A6GSB6_9BURK,

A6VNM7_ACTSZ, A6VRK4_MARMS, A6VWS7_MARMS, A6W1R6_MARMS, A7BLG0_9GAMM,

A7BNL4_9GAMM, A7C8Q0_BURPI, A7DVA4_VIBVU, A7HV43_PARL1, A7J7Y1_PBCVF,

A7JT83_PASHA, A7JWP4_PASHA, A7JWT6_PASHA, A7K8S7_9PHYC, A7MAH2_PSEAE,

A7ZDD6_CAMC1, A8F2Y5_RICM5, A8GU01_RICRS, A8GV95_RICB8, A8YEU1_MICAE,

A8YFI8_MICAE, A8ZL86_ACAM1, A8ZNL1_ACAM1, A8ZRU4_DESOH, A8ZUV6_DESOH,

A8ZZ23_DESOH, A9AP44_BURM1, A9BR82_DELAS, A9BSW9_DELAS, A9BUI5_DELAS,

A9BZS8_DELAS, A9EAX3_9RHOB, A9INC9_BART1, A9IPK2_BART1, A9ITS5_BART1,

A9IY58_BART1, A9K0S7_BURMA, A9KH51_COXBN, A9L145_SHEB9, A9M1H7_NEIM0,

A9MEC2_BRUC2, A9MYW2_SALPB, A9QS52_PSESX, A9R2C7_YERPG, A9VY58_METEP,

A9WY66_BRUSI, A9ZA21_YERPE, A9ZV04_YERPE, B0BH72_9BACT, B0BSH4_ACTPJ,

B0BVJ3_RICRO, B0C1R5_ACAM1, B0C9N6_ACAM1, B0CDJ2_ACAM1, B0CF50_ACAM1,

B0GF09_YERPE, B0GPY3_YERPE, B0H3V1_YERPE, B0HEH9_YERPE, B0HZ49_YERPE,

B0J3J2_RHILT, B0JHD9_MICAN, B0JHE9_MICAN, B0JNS5_MICAN, B0JUA6_MICAN,

B0JUV0_MICAN, B0KS04_PSEPG, B0QVD2_HAEPR, B0QX65_HAEPR, B0SXW2_CAUSK,

B0USW2_HAES2, B0UW55_HAES2, B1FAH8_9BURK, B1FFX5_9BURK, B1GB48_9BURK,

B1K4X1_BURCC, B1K610_BURCC, B1MA45_METRJ, B1SYC7_9BURK, B1T955_9BURK,

B1VJ81_PROMH, B1YZV0_BURA4, B1Z0V9_BURA4, B1ZAK5_METPB, B1ZHK8_METPB,

B1ZJU5_METPB, B1ZLX6_METPB, B2C6T4_ACIBA, B2FDC5_RHIME, B2FJ08_STRMK,

B2GZV1_BURPS, B2I5N4_XYLF2, B2IAQ2_XYLF2, B2TE06_BURPP, B2TSF9_SHIB3,

B2U639_ECOLX, B2UHE0_RALPJ, B2VAY8_ERWT9, B3DQN4_BIFLD, B3E764_GEOLS,

B3ECU4_CHLL2, B3EMQ3_CHLPB, B3EMQ6_CHLPB, B3GYY1_ACTP7, B3HFC0_ECOLX,

B3ISX8_ECOLX, B3JYH8_9DELT, B3PFX8_CELJU, B3PFX9_CELJU, B3R4X4_CUPTR,

B3R659_CUPTR, B3X774_SHIDY, B4BWK0_9CHRO, B4EL47_BURCJ, B4EML7_BURCJ,

B4F1A0_PROMH, B4RNY5_NEIG2, B4SE49_PELPB, B4SF38_PELPB, B4ST99_STRM5,

B4VPB9_9CYAN, B4WDE7_9CAUL, B5C0W3_SALET, B5EJ13_GEOBB, B5ERP5_ACIF5,

B5FD56_VIBFM, B5FHK8_SALDC, B5N208_SALET, B5PE14_SALET, B5PPY4_SALHA,

B5RZ11_RALSO, B5SLX0_RALSO, B5WP46_9BURK, B5XNN8_KLEP3, B6AKX5_9BACT,

B6ANJ3_9BACT, B6C3Q3_9GAMM, B6C438_9GAMM, B6IYD2_RHOCS, B6X043_9ENTR,

B7GQ85_BIFLI, B7JAX0_ACIF2, B7KPZ4_METC4, B7KSA8_METC4, B7LJE8_ECOLU,

B7YG92_VARPD, B7YTG6_VARPD, B8F4N1_HAEPS, B8F7F6_HAEPS, B8GSA3_THISH,

B8GTV7_THISH, B8GUV3_THISH, B8H360_CAUCN, B8ITM9_METNO, B8KUY3_9GAMM,

B8LA05_9GAMM, B8LA19_9GAMM, B9B269_9BURK, B9BKG6_9BURK, B9C4H2_9BURK,

B9DAL1_9GAMM, B9JFK9_AGRRK, B9JPN3_AGRRK, B9K3H8_AGRVS, B9NWT0_9RHOB,

C0BSI9_9BIFI, C0G9V0_9RHIZ, C0GTU2_9DELT, C0GUR9_9DELT, C0Q4S6_SALPC,

C0QI01_DESAH, C0VCW4_9MICO, C0YUD1_9FLAO, C1ADF9_9BACT, C1DE86_AZOVD,

C1DR99_AZOVD, C1HVF5_NEIGO, C1SPK8_9BACT, C1T3N5_DESBA, C1XQ07_9DEIN,

C2A215_SULDE, C2BV60_9ACTO, C2CUD4_GARVA, C2CUJ3_GARVA, C2KQK6_9ACTO,

C2KR20_9ACTO, C2LIN4_PROMI, C3K003_PSEFL, C3KFQ9_PSEFL, C3WDL4_FUSMR,

Q01YS7_SOLUE, Q02RQ2_PSEAB, Q05HP1_XANOR, Q07IW4_RHOP5, Q0AF37_NITEC,

Q0B469_BURCM, Q0B4P3_BURCM, Q0CZA0_ASPTN, Q0KBP5_RALEH, Q11N66_MESSB,

Q131N1_RHOPS, Q138H7_RHOPS, Q13FU3_BURXL, Q13LE6_BURXL, Q1BLP1_BURCA,

Q1BW04_BURCA, Q1CAI7_YERPA, Q1CFJ4_YERPN, Q1I6T7_PSEE4, Q1IB13_PSEE4,

Q1IU34_ACIBL, Q1NJK5_9DELT, Q1NP18_9DELT, Q1NR36_9DELT, Q1QF67_NITHX,

Q1QFT4_NITHX, Q1RHK4_RICBR, Q1XGP3_PSEPU, Q214U9_RHOPB, Q21ZJ2_RHOFD,

Q2ISG9_RHOP2, Q2IT41_RHOP2, Q2IUV2_RHOP2, Q2RPR8_RHORT, Q2RUT9_RHORT,

Q2RX84_RHORT, Q2SIZ8_HAHCH, Q2TR62_ACIBA, Q2W3Q7_MAGMM, Q315U5_DESDG,

Q392I8_BURS3, Q393E1_BURS3, Q39SL9_GEOMG, Q3AR07_CHLCH, Q3B1U1_PELLD,

Q3BTL9_XANC5, Q3JDA3_NITOC, Q3JDK7_NITOC, Q3JDX3_NITOC, Q3KEQ6_PSEPF,

Q3KIR2_PSEPF, Q3KJ92_PSEPF, Q3QZ61_XYLFA, Q3R7W4_XYLFA, Q3RFI8_XYLFA,

Q48B36_PSE14, Q4BUP6_CROWT, Q4BUQ5_CROWT, Q4BVR2_CROWT, Q4KIY9_PSEF5,

Q4UJX2_RICFE, Q5NZP9_AZOSE, Q5QWX9_IDILO, Q5X039_LEGPL, Q63LE9_BURPS,

Q63YL3_BURPS, Q6ALJ0_DESPS, Q6G4P2_BARHE, Q6J5I5_HAEIN, Q6MCI0_PARUW,

Q6N367_RHOPA, Q6W4R6_VIBAN, Q6ZEG0_SYNY3, Q70W78_YEREN, Q74CX9_GEOSL,

Q7CH33_YERPE, Q7MBL1_VIBVY, Q7N237_PHOLL, Q7N3K9_PHOLL, Q7N4L9_PHOLL,

Q7N6M0_PHOLL, Q7N7P7_PHOLL, Q7N7R2_PHOLL, Q7N8S2_PHOLL, Q7X1K1_9BACT,

Q820L9_NITEU, Q840E6_9GAMM, Q847G4_PSEPU, Q879U0_XYLFT, Q87CA7_XYLFT,

Q87UD0_PSESM, Q88NG8_PSEPK, Q88PM1_PSEPK, Q89KB2_BRAJA, Q8FLW1_COREF,

Q8G4Q3_BIFLO, Q8VMN8_PSEPU, Q8XTN0_RALSO, Q8YUE9_ANASP, Q926N9_LISIN,

Q9A407_CAUCR, Q9P9V8_XYLFA, Q9PCS6_XYLFA, Q9PHG2_XYLFA, Q9XAX6_PSEAC,

Y1420_HAEIN

12

Fig. S3. A summary page for ProtoNet clusters A4654740 (98 proteins) and

A4768953 (112 proteins, Fig. 2B). Statistical details for the cluster are shown in a table format. PL=100 reflects a root cluster. Cluster A4654740 was assigned as ‘Best cluster’ for InterPro keyword ‘Addiction module antidote protein, HI1420 ’ (CS=1.0) and thus it is assigned as the ProtoName. In ProtoNet database, there are 53

UniProt50 representative proteins that are annotated by this keyword, all are included in these cluster.

Table S8.

ProtoNet 6.1 view of expanded cluster A4768953. The 121 representative

UniRef50 sequences are expanded to 1022 sequences according to ProtoNet 6.1.

Cluster ID Cluster name

UniRef50_A0YUS0 Putative uncharacterized protein

Size Length

1 150

13

UniRef50_A0YXM6 Putative uncharacterized protein

UniRef50_A9KH51

UniRef50_A9L145

UniRef50_B0C1R5

UniRef50_B0CDJ2

UniRef50_B0JHD9

UniRef50_B0SXW2

UniRef50_B1ZAK5

UniRef50_B2FDC5

UniRef50_B3X774

UniRef50_B4VPB9

UniRef50_B4WDE7

UniRef50_B5PPY4

UniRef50_B6IYD2

UniRef50_B8H360

UniRef50_B8ITM9

UniRef50_B9JFK9

UniRef50_B9JPN3

UniRef50_C0BSI9

UniRef50_C1ADF9

UniRef50_C2KR20

UniRef50_C3WDL4

UniRef50_C5D1V6

UniRef50_C6AZF6

UniRef50_C6DYU8

UniRef50_C9T0K8

UniRef50_D0SF18

UniRef50_A1AQR7

UniRef50_A1VX41

UniRef50_A2FQI0

UniRef50_A3IMU5

UniRef50_A3T2P0

UniRef50_A4BMI2

UniRef50_A4G757

UniRef50_A4YAZ9

UniRef50_A5EB55

UniRef50_A5EWA6

UniRef50_A5ZQ34

UniRef50_A6GBU6

UniRef50_A6VWS7

UniRef50_A6W1R6

UniRef50_A7BLG0

UniRef50_A7BNL4

UniRef50_A7DVA4

UniRef50_A7HV43

UniRef50_A7J7Y1

UniRef50_A7JT83

UniRef50_A7K8S7

UniRef50_A8F2Y5

UniRef50_A8ZNL1

UniRef50_A8ZRU4

UniRef50_A8ZUV6

UniRef50_A9BR82

Putative transcriptional regulator

Putative transcriptional regulator

Putative uncharacterized protein

Putative uncharacterized protein

Putative uncharacterized protein

Putative uncharacterized protein

Putative uncharacterized protein

Fimbrial protein pilin

Putative uncharacterized protein

Putative uncharacterized protein

Uncharacterized protein

Putative uncharacterized protein

Putative transcriptional regulator

Putative transcriptional regulator

Transcriptional regulator

Putative uncharacterized protein

Transcriptional regulator, Cro/CI family

Putative transcriptional regulator

Putative uncharacterized protein n627R

Possible transcriptional regulator

Putative uncharacterized protein z317L

Transcriptional regulator

Putative uncharacterized protein

Putative transcriptional regulator

Putative transcriptional regulator

Putative uncharacterized protein

Putative uncharacterized protein

Putative uncharacterized protein

Addiction module antitoxin, putative

Putative uncharacterized protein

Putative uncharacterized protein

Putative transcriptional regulator

Putative transcriptional regulator

Putative partial transcriptional regulator protein

DNA-binding prophage protein

Putative uncharacterized protein

Probable addiction module antidote protein

Putative uncharacterized protein

Putative uncharacterized protein

Putative transcriptional regulator, antitoxin protein higA

Putative transcriptional regulator

Transcriptional regulator

Putative uncharacterized protein

Putative uncharacterized protein

Uncharacterized protein

Possible transcriptional regulator

Putative uncharacterized protein

Addiction module antidote protein

Addiction module antidote protein

Addiction module antidote protein

Predicted protein

Putative uncharacterized protein

104

118

106

102

96

106

108

62

88

79

252

97

74

101

275

89

81

70

97

107

100

40

78

97

97

86

67

70

107

90

76

106

92

99

63

125

94

94

92

44

98

107

111

69

148

133

43

121

100

68

67

106

98

2

1

3

1

5

25

1

1

5

1

1

1

1

1

1

5

1

1

5

14

3

1

1

7

5

19

2

3

1

1

1

1

14

1

2

1

1

4

4

2

1

4

7

2

8

2

1

1

1

1

1

1

7

14

UniRef50_D1BRK1

UniRef50_D2UFF5

UniRef50_D6SJR1

UniRef50_D8F2I0

UniRef50_E0NB55

UniRef50_E1VPC2

UniRef50_E2STQ7

UniRef50_E2XVJ2

UniRef50_E3HHX7

UniRef50_E6PN15

UniRef50_E6QCN5

UniRef50_E6QCQ8

UniRef50_E6X9K7

UniRef50_E8X7N4

UniRef50_F3INW7

UniRef50_F3L0G0

UniRef50_F6AX15

UniRef50_F6BME8

UniRef50_F7ND47

UniRef50_F8GGJ2

UniRef50_F8LFL8

UniRef50_G4E1Q1

UniRef50_G6XQZ4

UniRef50_G7CR59

UniRef50_G7HEF7

UniRef50_G8Q9N4

UniRef50_H0HXW3

UniRef50_H1KHC6

UniRef50_H1LAV3

UniRef50_H1NMJ8

UniRef50_H5WLK2

UniRef50_H6SIS2

UniRef50_I0INU3

UniRef50_I1WF00

UniRef50_I2Y9I5

UniRef50_I4FCC2

UniRef50_I4FRG9

UniRef50_I4XN50

UniRef50_I6HTI1

UniRef50_J0QIC2

UniRef50_J1A1Y8

UniRef50_J1P6H9

UniRef50_J2P562

UniRef50_P44191

UniRef50_Q01YS7

UniRef50_Q0AF37

UniRef50_Q0CZA0

UniRef50_Q11N66

UniRef50_Q138H7

UniRef50_Q1IB13

UniRef50_Q1IU34

UniRef50_Q1QFT4

UniRef50_Q21ZJ2

Addiction module antidote protein

Putative uncharacterized protein

Addiction module antidote protein

Toxin-antitoxin system, antitoxin component, ribbon-helixhelix domain protein

XRE family transcriptional regulator

Predicted transcriptional regulator

Toxin-antitoxin system, antitoxin component, Xre family

Transcriptional regulator

Transcriptional regulator 3

Uncharacterized protein

Uncharacterized protein

Uncharacterized protein

Addiction module antidote protein

Addiction module antidote protein

Uncharacterized protein

Addiction module antidote protein

Putative uncharacterized protein

Addiction module antidote protein

Transcriptional regulator

Addiction module antidote protein

Uncharacterized protein HI_1420

Addiction module antidote protein

Addiction module antidote protein

Putative uncharacterized protein

Addiction module antidote protein

Transciptional regulator

Addiction module antidote protein

Addiction module antidote protein

Addiction module antidote protein

DNA replication and repair protein RecF

Thiol-disulfide isomerase-like thioredoxin

Predicted transcriptional regulator

Uncharacterized protein

Addiction module antidote protein

Toxin-antitoxin system, antitoxin component, Xre family

Uncharacterized protein

Uncharacterized protein

Uncharacterized protein

Uncharacterized protein

Uncharacterized protein

Uncharacterized protein

Uncharacterized protein

Uncharacterized protein

Uncharacterized protein HI_1420

Putative uncharacterized protein

Putative transcriptional regulator

Predicted protein

Putative uncharacterized protein

Transcriptional regulator-like

Putative uncharacterized protein

Putative uncharacterized protein

Putative uncharacterized protein

Putative uncharacterized protein

99

52

116

97

98

390

174

103

106

118

106

100

100

108

101

125

109

164

99

105

94

106

121

100

84

196

108

105

182

105

131

129

106

102

101

97

104

95

187

177

305

221

107

168

183

108

120

107

104

106

94

126

103

6

2

48

11

16

5

116

34

6

20

13

2

13

12

5

4

4

11

13

4

2

35

32

13

3

36

11

2

2

7

27

8

30

5

3

2

2

1

2

13

1

1

1

1

1

16

10

37

4

18

9

10

84

15

UniRef50_Q2ISG9

UniRef50_Q2IUV2

UniRef50_Q2W3Q7

UniRef50_Q315U5

UniRef50_Q3BTL9

UniRef50_Q3JDK7

UniRef50_Q4BUP6

UniRef50_Q4UJX2

UniRef50_Q5X039

UniRef50_Q6ALJ0

UniRef50_Q70W78

UniRef50_Q7N237

UniRef50_Q820L7

UniRef50_Q89KB2

UniRef50_Q8YUE9

UniRef50_Q926N9

Putative transcriptional regulator

Putative uncharacterized protein

Putative uncharacterized protein

Addiction module antidote protein

Putative uncharacterized protein

Transcriptional regulator, Cro/CI family

Similar to transcriptional regulator

Predicted transcriptional regulator

Putative uncharacterized protein

Putative uncharacterized protein

Putative uncharacterized protein ORF7

Similar to hypothetical protein of Haemophilus influenzae

Possible phoP; Response regulators consisting of a CheYlike receiver domain and a HTH DNA-binding domain

Blr4995 protein

Asl2401 protein

Pli0013 protein

3

18

1

1

1

1

5

2

18

1

1

1

3

2

1

1

Table S9.

UniProtKB accession numbers of proteins from a root cluster A4741039

(65 proteins, Fig. 2D). Cluster Name: Protein of unknown function DUF148 .

The cluster includes all 47 proteins of DUF148 (PF02520, for UniRef50) and thus associates with CS=1.0. Additional 18 proteins in this cluster lack any InterPro /

Pfam annotations. Recall that the DUF148 family has no known function nor do any of the proteins that possess it. Still all the proteins in the cluster belong to a diverse collection of nematodes (some cause severe human diseases). Inspection of this cluster suggests that these proteins may act as surface antigen and may activate the host immune response. The expended cluster (UniRef100) comprises of 128 sequences.

A0DWN3_PARTE, A1IKL2_ANISI, A1Z6D0_CAEEL, A7M6T4_ANISI,

A8NVF8_BRUMA, A8Q0T0_BRUMA, A8Q3S4_BRUMA, A8Q4D7_BRUMA,

A8R0N5_BRUPA, A8WY92_CAEBR, A8WYA2_CAEBR, A8X0P8_CAEBR,

A8X193_CAEBR, A8X194_CAEBR, A8X6I6_CAEBR, A8XIZ6_CAEBR,

A8XLL8_CAEBR, A8XTF6_CAEBR, A8XZS0_CAEBR, A8Y4D4_CAEBR,

B0ZBB0_9PELO, B2XCP1_ANISI, B6IH84_CAEBR, B6VBW3_9PELO,

O01674_ACAVI, O17021_CAEEL, O17974_CAEEL, O45098_CAEEL,

O45347_CAEEL, O76573_CAEEL, OV17_ONCVO, OV39_ONCVO, P91548_CAEEL,

Q18280_CAEEL, Q18577_CAEEL, Q19406_CAEEL, Q19414_CAEEL,

Q1W204_ANCCA, Q20202_CAEEL, Q20998_CAEEL, Q21588_CAEEL,

Q22237_CAEEL, Q22537_CAEEL, Q23545_CAEEL, Q23614_CAEEL,

Q23615_CAEEL, Q23683_CAEEL, Q4W5P1_CAEEL, Q5FC55_CAEEL,

Q6LA91_CAEEL, Q6RUQ0_MELIC, Q6UY31_HUMAN, Q7YTI8_CAEEL,

Q7YTP4_CAEEL, Q7YWN3_CAEEL, Q8MXD0_9BILA, Q8WQ00_OSTOS,

Q93122_ASCSU, Q95YM3_NIPBR, Q9GU97_LOALO, Q9NFS0_GLORO,

Q9NFU9_GLORO, Q9TZL7_WUCBA, Q9XTN7_CAEEL, YY23_CAEEL

99

105

86

91

188

69

81

185

78

100

112

117

229

100

98

173

16

References

1.

(2003).

O. Sasson, A. Vaaknin, H. Fleischer et al., Nucleic Acids Res 31 (1), 348

2. C. H. Wu, R. Apweiler, A. Bairoch et al., Nucleic Acids Res 34 (Database issue), D187 (2006).

3. N. Rappoport, S. Karsenty, A. Stern et al., Nucleic Acids Res 40 (Database issue), D313 (2012).

4. S. Hunter, R. Apweiler, T. K. Attwood et al., Nucleic Acids Res 37 (Database issue), D211 (2009).

5. R. D. Finn, J. Tate, J. Mistry et al., Nucleic Acids Res 36 (Database issue),

D281 (2008).

6. T. K. Attwood, P. Bradley, D. R. Flower et al., Nucleic Acids Res 31 (1), 400

(2003).

7.

8.

F. Corpet, F. Servant, J. Gouzy et al., issue), D142 (2004).

Nucleic Acids Res

I. Letunic, R. R. Copley, S. Schmidt et al.,

28 (1), 267 (2000).

Nucleic Acids Res 32 (Database

9.

(2003).

D. H. Haft, J. D. Selengut, and O. White, Nucleic Acids Res 31 (1), 371

10. M. Madera, C. Vogel, S. K. Kummerfeld et al., Nucleic Acids Res 32

(Database issue), D235 (2004).

11. A. G. Murzin, S. E. Brenner, T. Hubbard et al., J Mol Biol 247 (4), 536 (1995).

12. F. Pearl, A. Todd, I. Sillitoe et al., Nucleic Acids Res 33 (Database issue),

D247 (2005).

13. C. A. Orengo, A. D. Michie, S. Jones et al., Structure 5 (8), 1093 (1997).

14. H. Mi, B. Lazareva-Ulitsky, R. Loo et al., Nucleic Acids Res 33 (Database issue), D284 (2005).

15. N. Kaplan, O. Sasson, U. Inbar et al., Nucleic Acids Res 33 (Database issue),

D216 (2005).

17

Download