1 Supplementary data for Aggregation and evolution of multi-domain proteins: the importance of sequence identity Caroline F. Wright, Sarah A. Teichmann, Jane Clarke and Christopher M. Dobson The supplementary data contains the following supplementary information: Figure S1: Structural features of titin. Figure S2: Aggregation kinetics for monomeric TI I27 at various concentrations Supplementary to Figure 1 in the manuscript. Figure S3: Co-aggregation of different domains with TI I27. Supplementary to Figure 2 in the manuscript. Figure S4: Details of the analysis of sequence identity of Ig and fnIII domains in the human genome. Supplementary to Figure 3 in the manuscript. The figure legend gives full details of the analysis of homologous Ig and fnIII domains in the human genome. Table S1: Sequence identity and aggregation kinetics of monomeric proteins. Supplementary to Figure 2 in the manuscript. The footnotes include a detailed description of the kinetic analysis. Table S2: Domains in proteins of genomes from the three kingdoms of life. The footnotes give details of the analysis of whole genomes. Table S3: Length distribution of repeats of tandem homologous domains in human. Table S4: The largest repeat superfamilies in the human genome Table S5: Sequence identities in the I-band of titin References: To support the supplementary material 2 Supplementary Figure S1. Figure S1. Structural features of titin. (a) The 27th Ig domain from titin1. The strands are labelled A, A’, B, C, D, E, F and G, from N to C termini. The two sheets pack on to each other to form a -sandwich structure with Greek Key topology. (b) Schematic representation of part of a multi-modular protein, such as titin; the domains are labelled from the N-terminus. 3 Supplementary Figure S2. Figure S2. Aggregation kinetics for monomeric TI I27 at various concentrations. Supplementary to Figure 1 in the manuscript. As the concentration of monomer increases the lag decreases and the elongation rate increases. 4 Supplementary Figure S3. Figure S3. Co-aggregation of different domains with TI I27. Each monomeric protein was added to a solution of TI I27, to reach a final concentration of each domain of 1 mg/ml (i.e. a total protein concentration of 2 mg/ml). The extent of light scattering was then monitored, using a Synergy Bio-Tek plate reader, as a function of the time of incubation in a buffered 28 % TFE solution at 25 ˚C. The aggregation kinetics (black circles) were found to fall between 5 two limiting rates, I27 aggregating alone at either 1 mg/ml (lower dashed red line) or 2 mg/ml (upper dashed red line). The aggregation time course of each protein alone at 2 mg/ml are also shown (solid black line). The protein domains used and sequences identities (ID) to wild-type TI I27, are (a) PI3SH3, 0 % ID, (b) TNfn3 , 8 % ID, (c) TI I28, 28 % ID, (d) TI I32, 42 % ID, (e) TI I31, 56 % ID, (f) TI I27 (F14L), 99 % ID. (The data shown are averages of at least 5 measurements, except for TI I31 where only 2 measurements were possible due to very low yields of protein in the expression system.) (Panels (a) (c) and (d) are reproduced in the manuscript). 6 Supplementary Figure S4. Figure S4 Details of the analysis of sequence identity of Ig and fnIII domains in the human genome (See associated Figure 3 in manuscript). Ig and fnIII domains in the human genome (ENSEMBL v. 15.33) were assigned by hidden Markov models in the SUPERFAMILY database4 version 1.61. These two superfamilies are amongst the largest in the human genome, with 2545 Ig domains and 1098 fnIII domains. The fnIII superfamily is comparable to the Ig family in both size and function (extracellular domains predominantly involved in cell adhesion). However, fnIII domains do not contain disulphide bridges. The domains in SUPERFAMILY database are as defined in the SCOP database5. 7 All the sequence regions assigned as Ig or fnIII domains in the human genome were compared to each other by pairwise sequence comparisons using FASTA6 . Three types of homologous domain pairs amongst the proteins in an organism were identified (as illustrated above). 1) Paralogous domains: domains in different proteins within the same organism here. 2) Adjacent domains: domains within 30 residues of each other within the same protein. (The 30 residue cut-off is used since very few independent domains are less than 30 residues long). 3) Non-adjacent domains: domains within the same protein that are >30 residues apart. Since the Ig superfamily encompasses very distantly related proteins, many domain pairs do not produce significant alignments based on pairwise amino acid sequence comparisons. Statistical scores were not used since the data sets of sequences compared were small, so in order to ensure that only meaningful sequence identities were obtained, any alignments shorter than thirty residues and/or with less than thirty percent sequence identity were considered ‘unmatched’7. In Figure 3a in the manuscript the identities are binned in a category of more than thirty percent identity. Of all the paralogous domain pairs in the human genome, only 2% of the ~ 3x106 paralogous Ig pairs and 3% of the ~ 6x105 fnIII domain pairs have ≥30% sequence identity. Adjacent domains are on average more similar than the domains on different proteins, which is probably because they are likely to be exposed to the same sorts of functional constraints on their sequences and to have evolved by internal gene duplication of each other. Despite these similar evolutionary constraints, only 27% of the 1165 Ig and 29% of the 673 fnIII adjacent pairs have >30% sequence identity. Domains within the same polypeptide chain that are not adjacent, are plotted in the third section of Figure 3a. The vast majority of these domains are 8 themselves adjacent to other Ig or fnIII domains. Thus any selective pressure for sequence divergence from their neighbours will be reflected in their sequence similarity to domains further away in the proteins. Nevertheless, the percentages of pairs with greater than 30% identity are actually somewhat greater amongst these domain pairs than amongst the adjacent pairs: 35% of the 4712 Ig and 53% of the 4047 fnIII pairs. The difference is greater for the fnIII family than for the Ig superfamily and could be due a greater pressure for sequence divergence amongst adjacent fnIII domains in the absence of disulphide bonding; the presence of such bonds and the restrictions they place on conformational transitions are likely to inhibit aggregation in the Ig superfamily. Although the percentage of adjacent domain pairs with 30% or greater sequence identities is 27% and 29% of the Ig and fnIII adjacent domains respectively, it is important to appreciate that the sequence identity distributions of the adjacent domain pairs (and indeed all groups of domain pairs) falls off roughly as a power law (Figure 3b). This means that, for instance, only about 8% of the adjacent Ig domains and 12% of the fnIII domains have more than 40% sequence identity. 9 Table S1 Sequence identity and aggregation kinetics of monomeric proteins % ID k (s-1) Lag (s) k+I27 (s-1) Norm. k+I27 Lag+I27 (s) PI3-SH3 0 < 0.0001 > 10 000 0.0004 -0.1 (± 0.2) 498 TNfn3 8 < 0.0001 > 10 000 0.0002 -0.2 (± 0.2) 588 TI I28 28 0.0001 3000 0.0005 0.0 (± 0.2) 510 TI I32 42 0.0005 1500 0.0011 0.5 (± 0.2) 527 TI I31 56 0.0006 300 0.0014 0.7 (± 0.2) 243 TI I27 (F14L) 99 0.0004 960 0.0017 1.0 (± 0.2) 342 TI I27 (wt) 100 0.0018 334 0.0018 1.0 334 Protein Co-aggregation was assessed by measuring the aggregation rate of a 1 mg/ml solution of TI I27 in the presence of 1 mg/ml of a second protein whose aggregation rate is otherwise slow under the conditions used here. If the aggregation rate is increased in the mixed protein solution, it indicates that the effective concentration of the TI I27 is increased as a result of the ability of the second protein to co-aggregate with TI I27. The kinetics were analysed with Prism (Graphpad) using a simple single exponential function, with terms to account for the lag phase as well as linear and quadratic drift over long timescales (see below). k, single exponential aggregation rate constant; k +I27, single exponential aggregation rate constant when co-incubated with TI I27; Lag, length of the lag phase; Lag+I27, mean length of the lag phase when co-incubated with TI I27; Norm. k+I27, normalised co-aggregation rate constants, from zero to one (aggregation at the same rate as TI I27 alone at 1 mg/ml or 2 mg/ml respectively). Errors are calculated from the standard deviation of the aggregation kinetics fitted to repeated measurements. The error on all aggregation rates is approximately 15 %; the error on all lag times is approximately 20 %. All data are given for a total protein concentration of 2 mg/ml, which is either all one protein or 1 mg/ml of each protein with 1 mg/ml of TI I27 for the co-aggregation experiments. Aggregation of wild-type TI I27 at 1 mg/ml has an elongation rate of 0.0004 s-1 and a lag phase of 540 s. 10 Details of the fitting procedure: Because of the short lag time for the aggregation of TI I27, it is not possible to fit the kinetics with a sigmoidal curve of the type used, for example, in fitting IAPP aggregation kinetics8. Instead, data were analysed using a phenomenological equation: OD400 IF(t t0,P,P (T P)1 exp[kA (t t0 )] at bt 2 where t is the time (s), t0 is the lag time, P is the initial light scattering intensity plateau during the lag time, T is the final (top) light scattering plateau, kA is the elongation rate and a and b are terms to account for non-linearity in the data after the initial plateau due to fragmentation and association. The IF function forces the data to be fitted to a flat, straight line until the end of the lag phase, at which point the data are fitted to a single exponential function. The program Graphpad (Prism Software) was used to analyse all aggregation data. Since the kinetic data for the aggregation of AcP with 25 % TFE fit well to an equation containing just a single exponential function9, the fit was validated by manually fitting a smaller portion of the data (after the visible lag time) to a simple single exponential function. The data fit was not improved by using double, triple or quadruple exponential functions. (Note that where the lag time < 90 s it was not possible to assign a value to the lag time because of the 60 s experimental dead-time.) In order to determine the value of the final aggregation plateau (i.e. the light scattering at t = ∞), the OD400 was plotted versus 1/t for a series of times after the end of the initial exponential phase; linear extrapolation to the y-intercept then gives the final plateau point OD . Analysis of the data using the simpler procedure of comparing the half-times for the aggregation reactions, rather than by explicit analysis of the lag phase and kinetic phases as described here, gives very similar results and the conculsions are unchanged. 11 Table S2. Domains in proteins of genomes from the three kingdoms of life. % multi-domain Genome % proteins % proteins with proteins with Longest array with structural assignments tandem of tandem domain that are multi- homologous homologous assignments domain domains domains Vertebrates Human 59 77 27 45 Mouse 63 72 26 32 Fugu (fish) 44 56 24 45 51 73 16 44 54 80 19 32 47 77 13 10 56 74 14 8 50 65 15 6 Invertebrates Worm (C. elegans) Fly (D. melanogaster) Unicellular eukaryotes Budding yeast (S. cerevisiae) Fission yeast (S. pombe) Encephalitozoon cuniculi Plant 12 A. thaliana 55 76 13 8 E. coli 57 55 13 16 Y. pestis 54 54 14 25 B. subtilis 54 52 14 6 M. pneumoniae 52 59 12 3 A. fulgidus 57 50 16 5 M. jannaschii 57 48 15 5 T. acidophilum 60 51 14 5 Bacteria Archaea The genomes and structural domain assignments are taken from the SUPERFAMILY database version 1.614. The parameters for adjacency between domains, and for unassigned regions equivalent to domain are also taken directly from this database. The fraction of assigned proteins that have two or more domains is between two thirds and three quarters for most eukaryotes, and closer to one half for prokaryotes. Multi-cellular animals are enriched in proteins with repeats of two or more tandem homologous domains. (Homology is defined as domains that belong to the same superfamily, as described in the SCOP database 5.) They include muscle proteins such as titin, as well as proteins involved in multi-cellularity such as extracellular matrix proteins, cell adhesion proteins and cell signalling molecules. Thus the fraction of multi-domain proteins with tandem homologous domains is approximately one quarter in vertebrates, and slightly lower in invertebrates. The proteins with the longest number of repeats of tandem homologous are also found in vertebrates and invertebrates. Proteins with >30 tandem repeats include muscle proteins such as titin and extracellular matrix proteins, which all contain long arrays of Ig and fnIII domains. Interestingly, the proteins in bacteria, such as E. coli and Y. pestis, with tandem repeats of >20 domains are cell adhesion proteins used for host invasion in pathogenic bacteria. 13 Table S3. Length distribution of repeats of tandem homologous domains in human. % of proteins with Number of tandem % of proteins with tandem Number of tandem tandem homologous homologous domains homologous domains homologous domains domains 2 41 14 0.12 3 32 15 0.12 4 10 16 0.12 5 5 18 0.07 6 3 19 0.05 7 4 20 0.02 8 1 22 0.02 9 1 24 0.02 10 1 25 0.02 11 0.5 28 0.02 12 0.5 29 0.07 13 0.32 31 0.02 45 0.02 About one quarter of the multi-domain proteins in the human genome contain tandem repeats of two or more homologous domains. About three quarters of these proteins have either two or three homologous domains in a consecutive array, as shown in this table. Only about three percent contain ten or more domains, with up to forty-five homologous domains as a maximum in this data set of human proteins. 14 Table S4. The largest repeat superfamilies in the human genome Number of Number of non- adjacent adjacent Superfamily name in SCOP pairs pairs TPR-like 40 40 WW domain 42 37 Kringle-like 45 35 Concanavalin A-like lectins/glucanases 49 23 Spermadhesin, CUB domain 50 377 Homeodomain-like 43 46 Eukaryotic type KH-domain (eKH-domain) 59 141 Ankyrin repeat 61 59 Integrin A (or I) domain 62 234 SH3-domain 65 80 C2 domain (Calcium/lipid-binding domain, CaLB) 68 95 C-type lectin-like 73 136 PDZ domain-like 98 370 Actin-like ATPase domain 120 23 Scavenger receptor cysteine-rich (SRCR) domain 153 525 Glucocorticoid receptor-like (DNA-binding domain) 154 125 RNA-binding domain, RBD 218 126 P-loop containing nucleotide triphosphate hydrolases 239 180 15 Complement control module/SCR domain 267 79 LDL receptor-like module 276 940 Spectrin repeat 363 3677 Cadherin 671 2901 Fibronectin type III 573 4047 EGF/Laminin 1082 8374 Immunoglobulin 1165 4712 C2H2 and C2HC zinc fingers 5934 23296 There are 25 superfamilies in the human genome that have forty or more adjacent pairs of homologous domains and thirty or more non-adjacent domains within the same protein sequence using the domain assignments in SUPERFAMILY v. 1.614. If a superfamily has small numbers of non-adjacent pairs, it means that most of the homologous domains in the same sequence consist of a pair of adjacent domains. Larger numbers of non-adjacent domain pairs are associated with long arrays of tandem homologous domains. 16 Table S5: Sequence identities in the I-band of human cardiac titin domain I- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 1 100 17 17 23 19 16 22 23 19 24 25 24 20 24 29 25 27 22 27 23 14 23 15 19 24 18 17 18 19 15 16 27 24 16 17 21 22 21 21 18 2 17 100 29 27 26 24 29 22 34 28 32 37 25 29 27 22 19 24 27 31 22 20 20 29 27 34 26 29 26 30 23 30 32 35 27 24 29 25 33 26 3 18 30 100 30 29 28 26 25 28 30 29 30 19 18 25 21 24 25 19 25 17 26 25 26 25 28 22 24 21 28 21 28 37 30 30 28 24 28 29 22 4 24 27 30 100 26 30 25 22 26 26 29 32 26 25 24 24 20 25 22 22 25 19 16 25 23 29 22 31 16 29 16 26 33 25 27 19 20 30 23 20 5 20 27 29 27 100 26 28 29 31 26 28 29 30 21 33 20 19 24 20 22 21 20 21 37 23 27 21 34 23 26 21 29 33 32 28 26 28 28 30 23 6 17 25 28 30 26 100 28 36 37 31 28 29 26 21 31 22 22 27 26 25 20 21 26 33 28 28 22 26 30 34 29 31 31 35 26 26 27 19 30 26 7 24 30 26 26 28 28 100 28 26 29 24 34 28 24 30 29 22 25 26 29 26 24 30 31 28 33 31 28 21 35 27 31 35 35 27 25 30 26 35 26 8 24 22 24 22 29 36 28 100 27 37 29 24 27 23 21 24 19 24 29 21 26 22 19 31 28 23 22 24 29 20 26 23 29 30 23 27 29 22 33 26 9 20 36 28 27 31 37 26 27 100 39 35 35 31 20 30 24 19 20 26 33 18 27 25 31 28 31 24 21 28 31 22 34 31 26 22 26 24 24 26 34 10 26 29 30 27 26 31 29 37 39 100 37 34 37 24 26 29 24 22 35 28 21 30 24 35 26 31 29 30 25 28 29 27 37 30 29 24 27 31 28 24 11 27 34 29 29 28 28 24 29 35 37 100 37 33 24 30 24 20 26 28 27 17 24 19 27 24 31 21 30 27 25 25 28 29 27 26 18 33 34 28 26 12 26 38 30 33 29 29 34 25 35 34 37 100 43 26 28 24 27 19 31 31 21 26 29 30 27 28 28 36 33 28 29 30 29 29 30 27 24 31 31 24 13 21 25 19 26 30 25 27 26 31 36 32 42 100 20 25 21 15 18 26 25 23 16 19 21 15 23 20 24 19 21 25 21 26 24 21 13 21 29 23 24 14 24 29 17 24 20 20 22 22 19 22 22 24 19 100 34 32 33 31 28 23 17 24 23 26 19 20 23 26 20 23 20 28 29 27 19 22 21 15 26 16 15 30 27 24 24 32 30 29 20 29 25 29 27 25 34 100 27 24 31 27 25 23 20 18 26 27 26 19 27 22 27 19 28 29 22 19 23 30 24 24 23 16 26 21 20 23 19 21 28 23 22 28 22 22 20 32 27 100 30 29 30 22 19 19 21 19 19 26 24 23 15 23 16 21 28 19 21 23 16 20 27 19 17 27 19 22 19 18 21 21 18 18 22 19 25 15 33 23 29 100 22 26 23 19 17 16 24 20 19 22 17 20 19 14 21 24 22 19 21 21 20 17 16 18 23 24 24 25 24 26 24 24 19 22 25 18 17 31 31 29 23 100 23 18 18 23 18 26 15 27 14 24 18 26 17 23 26 23 22 23 24 23 20 23 19 28 27 18 21 19 24 24 28 24 33 27 30 26 28 27 30 27 22 100 29 17 24 22 24 22 28 23 19 19 22 21 24 28 28 22 23 21 26 23 23 20 24 31 24 22 22 24 28 20 31 27 26 30 25 24 25 23 24 18 29 100 20 24 27 29 25 28 29 22 17 26 17 27 31 28 25 27 24 22 30 24 21 14 22 16 25 20 19 25 25 17 20 16 20 23 17 23 19 19 18 17 20 100 17 16 23 20 25 17 15 14 24 13 23 28 25 27 22 20 23 20 17 22 24 21 25 19 20 21 23 22 26 30 23 25 16 25 21 20 18 23 25 24 18 100 22 30 24 30 19 30 24 29 20 23 30 30 26 25 26 24 27 22 23 16 21 25 17 21 26 30 19 25 24 19 29 19 25 19 22 17 19 24 28 17 22 100 25 30 31 47 19 29 28 48 46 28 31 22 29 19 20 22 19 24 20 31 26 26 38 33 32 32 32 35 27 31 22 27 27 20 26 27 26 31 24 31 25 100 32 32 28 41 28 32 27 33 38 43 28 32 30 28 40 22 25 26 28 25 24 24 28 28 28 28 26 24 27 16 20 28 20 21 16 24 26 21 25 30 31 100 25 31 22 52 25 30 40 36 35 25 30 28 24 35 22 26 19 36 28 30 27 28 33 24 32 32 32 28 24 22 27 27 20 28 30 30 26 31 32 32 25 100 24 27 24 61 23 30 42 40 39 32 33 32 34 19 17 27 18 27 22 22 21 22 31 22 24 29 21 28 20 25 20 26 24 15 25 30 18 19 47 28 31 24 100 25 25 20 57 42 29 29 25 28 21 22 26 20 28 19 30 24 31 35 26 28 25 21 30 30 36 25 27 28 25 18 25 20 22 16 30 19 40 22 27 25 100 28 31 22 24 36 36 31 22 26 31 34 21 29 20 27 21 17 24 30 21 29 28 25 27 33 19 21 22 16 21 19 20 18 15 25 29 28 52 24 25 28 100 22 30 34 26 31 18 24 26 30 28 24 30 16 31 28 29 26 34 35 20 31 28 25 28 21 25 28 25 20 27 24 27 25 29 28 31 25 61 20 31 22 100 21 27 40 38 34 29 24 28 31 16 31 17 24 21 17 21 29 27 26 22 29 25 29 26 21 20 17 15 18 22 18 13 20 48 27 30 22 57 22 30 21 100 39 28 27 25 22 25 21 25 19 32 29 31 28 27 29 31 31 24 34 27 28 30 21 29 29 22 22 24 26 28 24 24 46 33 40 29 42 24 34 27 39 100 37 33 22 29 28 27 34 29 33 26 34 37 34 34 31 35 29 31 37 29 29 27 30 30 29 26 27 29 33 29 30 28 37 36 42 29 36 26 40 28 37 100 51 37 28 35 27 45 26 34 17 37 30 26 33 35 35 30 26 30 27 29 25 28 22 20 24 24 29 29 26 30 31 43 35 39 29 36 31 38 27 33 51 100 34 36 35 26 45 21 35 18 28 30 28 28 26 27 24 22 29 26 30 21 20 20 22 20 22 24 26 28 27 22 28 25 38 25 31 18 34 25 22 37 34 100 29 29 20 27 16 36 22 25 28 19 26 26 25 27 26 24 18 27 13 24 24 25 22 24 25 28 22 26 29 31 30 31 28 22 24 29 22 29 28 36 29 100 22 21 30 26 37 23 29 23 19 27 26 29 28 23 26 31 23 20 22 30 16 22 24 22 24 20 26 18 28 27 31 20 25 25 23 24 27 33 33 28 22 100 31 34 20 38 22 26 28 30 28 19 26 22 24 31 34 31 29 16 25 21 21 24 27 22 24 25 20 28 24 31 22 31 30 28 21 27 27 26 20 21 33 100 33 25 39 22 35 29 24 30 30 35 34 26 28 28 31 24 27 25 28 18 21 25 31 21 28 22 39 35 34 26 34 28 31 25 34 45 45 27 30 36 33 100 31 40 19 27 22 20 24 26 26 26 34 24 26 24 25 17 24 20 17 24 25 25 18 22 19 21 22 19 20 21 24 16 19 29 26 21 16 26 21 25 31 100 domain I- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Pairwise sequence identities of Ig domains in the I-band of human cardiac titin. The alignment is taken from reference1. Only two adjacent domain pairs have > 40% sequence identity (coloured in yellow). Non- adjacent domain pairs with > 40% sequence identity are shaded in grey. 18 References. 1. 2. 3. 4. 5. 6. 7. 8. 9. Improta, S., Politou, A. S. & Pastore, A. Immunoglobulin-like modules from titin I-band: extensible components of muscle elasticity. Structure 4, 323-337 (1996). Oosawa, F. & Asakura, S. Thermodynamics of the polymerization of protein (Acadmic Press, London, 1975). Fernandez, C. O. et al. NMR of alpha-synuclein-polyamine complexes elucidates the mechanism and kinetics of induced aggregation. Embo J. 23, 2039-2046 (2004). Gough, J. & Chothia, C. SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res. 30, 268-272 (2002). Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. SCOP - a Structural Classification of Proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540 (1995). Pearson, W. R. & Lipman, D. J. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444-2448 (1988). Sander, C. & Schneider, R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins Struct. Funct. Genet. 9, 56-68 (1991). Padrick, S. B. & Miranker, A. D. Islet amyloid: phase partitioning and secondary nucleation are central to the mechanism of fibrillogenesis. Biochemistry 41, 4694-4703 (2002). Chiti, F. et al. Mutational analysis of the propensity for amyloid formation by a globular protein. EMBO J. 19, 1441-1449 (2000).