Rorick-Wagner Supplementary Materials

Rorick-Wagner Supplementary Material

V. Appendix

1. Derivations of Model Equations

Defining the Markov Process:

We consider a discrete stochastic process where the random variable, b , is the length of a domain at a time-step t . The transition matrix M consists of the probabilities O h , g

of a functional domain transitioning from any length b = g to any other length b = h in a single time-step. These O h , g

probabilities that constitute M can be calculated from the assumptions of our model. The minimum length for b is m and the maximum length is R -

- the length of the peptide. Transitions to lengths above and below these maximum and minimum values, respectively, are set to zero.

Each of the elements of M that is immediately off the diagonal (i.e., that represent the probabilities of plus-one or minus-one length transitions) is a product of two independent probabilities. The first is the probability U h,g

of a mutation occurring that causes a specific length transition from length g to length h . The second is the probability f h,g of fixation for this particular mutation with its associated length transition. These probabilities are specified by our model.

Let us first consider the probability of transitioning to a length h that is one residue smaller or larger than the starting length g: if h = g -1 or g +1, O h , g

= U h,g f h,g

. (S1)

For transitions of step sizes larger than one, we make simplifications that approximate the actual transition probabilities. First we assume that mutations that increase domain size by more than one amino acid have effectively zero probability. Mutations that increase

1

the domain size by more than one require either that two or more beneficial mutations simultaneously occur, or that a single mutation connects previously existing contiguous set of one or more alphabet amino acids to the functional domain. In real biological systems the first scenario is rare (occurring at a rate equal to or less than the square of the mutation rate), so in this model and in simulations we so in this model we only allow a single mutation per time-step. In this model we also ignore the relatively small probability of the second scenario occurring (even though it may occur in our simulations). In this analysis, we effectively we treat mutations that make connections between multiple contiguous series of alphabet amino acids as if they only increase the length by one. Thus, the probability of a mutation occurring that increases the length by more than one is simplified to zero, such that if h > g +1, O h , g

= 0*f h,g

, (S2) which means the transition probability is zero.

Another simplification is that we treat all mutations which decrease the length by more than one as if they have a zero probability f of fixation, such that if h < g -1, O h , g

= U h,g

0 . (S3)

This is because these mutations are generally very deleterious, and therefore, infrequently fixed. In any given time-step we thus assume a zero probability for transitions that increase or decrease the domain length by more than one amino acid.

The probability that no mutation is fixed during a given time-step can be determined by subtracting, from a total probability of one, the probability of increasing the length by one and the probability of decreasing the length by one:

2

O g , g

=1- U g -1 ,g f g -1 ,g

- U g +1 ,g f g +1 ,g

.

(S4)

Thus, we only need to calculate U h,g

and f h,g

for h = g +1 and h = g -1, since O h , g

for h > g +1 and h < g -1 has been simplified to zero, and we can infer O g,g

(equation S4).

Mutation Probability Calculations:

In order to calculate the probability U h,g

of a mutation occurring that causes a specific length transition, we first have to calculate the number Y g

of viable mutations that are possible from a starting length g, and then calculate the number Z h,g

of mutations that cause that specific length transition. The probability U h,g is then the ratio Z h , g

/ Y g

, times the per-site mutation rate,



.

The only mutations that are unviable are those that decrease the size of the domain to less than the minimum length m . So, when the functional domain is equal to or larger than 2 m , all possible mutations are viable—to any of the z -1 other amino acid types in the whole alphabet, at any of the R sites, such that if g

≥2 m , Y g

= R ( z -1). (S5)

When the functional domain is smaller than 2 m , some mutations at sites near the center of the domain are not viable because they will break the domain into two pieces that are both smaller than m . Thus, there are 2( g m ) sites within the functional domain of length g which can undergo non-neutral (i.e., domain length changing), viable mutations. There are z a non-alphabet amino acid types, so there are 2( g m )( z a ) non-neutral, viable mutations and ( a -1) g neutral mutations within this region. Similarly, because any mutation can occur outside the functional domain (and there are z -1 non-wildtype amino acid types), there are ( R g )( z -1) viable mutations outside the functional domain. In summary,

3

if 2 m > g

≥ m,

Y g

= ( R g )( z -1) + ( a -1) g + (2 g -2 m )( z a ). (S6)

We then calculate the number Z h , g

of equally likely viable mutations that cause a change in length from g to h (where h = g +1 or g -1).

Transitions that reduce the length by one amino acid occur by a mutation to a non-alphabet amino acid (there are z a of these) on one end of the domain or the other

(i.e., at either of two different sites), such that if h=g-1, Z h , g

=2( z a ). (S7)

The simulations are set up such that functional domains do not come in contact with the edges of the sequence. Thus, for the purposes of calculating transition probabilities, we imagine a functional domain that is perfectly centered such that it always has two flanking sites at either end of the domain until it is R -1 residues long. These flanking residues can mutate to an alphabet amino acid, causing an increase in length. Thus, if g < R -1, Z g +1, g

= 2a . (S8)

If the domain is only one less than the maximum length R , there is only one flanking site, thus,

Z

R , R -1

= a . (S9)

Lastly, we do not need to know the probability of a mutation that maintains the same domain length ( U g,g

) because we can simply infer O g,g

from equation S4 (though see

Appendix 4 for a description of how to calculate this directly).

Because in our model all viable mutations are equally likely, to finally calculate the probability U h,g

of a mutation that causes the length transition from g to h , we simply divide the number Z h , g

of mutations that cause the specific length transition by the total number Y g

of viable mutations that are possible at that time-step, and then multiply by the

4

mutation rate,



, such that

(S10)

Fixation Probability Calculations:

The probability f h,g

of fixation, as a function the starting and ending domain lengths ( g and h , respectively), is determined by the forces of drift and selection in accordance with equations 4-6, and the definitions of function performance ( p ), protein fitness ( w ), and the selection coefficient ( s ) (equations 1-3). By substituting the expression for performance p

(equation 1) into the expression for fitness w (equation 2), we obtain an expression for w as a function of b. Substituting this expression for w into equation 3 provides us with an expression for the selection coefficient s as a function of b . We can then use this expression for s to determine N e s , and through this, the fixation probability (equations 4-

6). Ultimately, this process provides the fixation probability for a given mutation as a function of the domain lengths before and after the mutational event (i.e., g and h , respectively) (equations S11-S13):

(S11)

(S12)

(S13)

5

Equilibrium Length Calculations:

Because the fixation probabilities are dependent in a complex way on the fitness function, we cannot easily derive the equilibrium length as a function of the parameters N e

, c , m , R , z and a . However, with the above equations we can numerically solve for the equilibrium length for any given set of parameters. The stationary distribution n associated with the transition matrix M is the eigenvector associated with the eigenvalue of 1 for M , that is, n = M

 n . The mean of the stationary distribution represents the expected equilibrium length , that is,

(S14) where n b

are the elements in the stationary distribution n , such that n ={ n

1

, n

2

, n

3,

… n

R

}.

Rate of random walk calculations:

Our expectation for a domain’s movement is zero because there is no bias for movement in one direction or the other, and expectations for movement in the positive and negative directions cancel out. Nevertheless, we do expect movement to occur over time in the form of a symmetric random walk. We can solve for the expected absolute value of the distance traveled to achieve a non-zero and non-constant function of time. Because we are ignoring the relatively low probabilities of increasing or decreasing domain length by any step size greater than one, we can solve for the expected absolute value of the distance x traveled from a starting point as a function of t time-steps (Rice 2004):

, (S15)

6

where and are the probabilities of, respectively, increasing or decreasing domain length by one amino acid once the equilibrium length has been reached.

Probability of touch-event calculations:

We can also calculate the probability v of two functional domains at equilibrium length q , that start at a certain distance y between their closest ends, experiencing a touch-event by t time-steps. Because two domains undergoing a random walk are equally likely to diverge as opposed to converge over time, the probability of a touch-event is never more than 50%. The probability of a touch-event can be calculated based on an understanding of the probability distribution of a functional domain’s position x over time (which we index as the position of the central amino acid of the domain). The probability distribution of x is assumed to be normal because it is the sum of many independent events, with an expectation E[ x ( t )]=0 for any time t because steps in the positive and negative directions are equally likely. To find the variance of this distribution we first find the variance of x after a single time-step using the definition of variance for discrete random variables:

(S16) where N is the number of different possible outcomes x w

that are each associated with a probability P w

, and E[ x ] is the expected outcome. There are only three possible outcomes for a single step: a move in the positive direction ( x ( t =1) = 1), a move in the negative direction ( x ( t =1) = -1), or no move at all ( x ( t =1) = 0). We know the probabilities of each of these positions after one time-step is , and , respectively. Our expectation for the domain’s position after one time-step is zero, so the variance of x after

7

one time-step is

(S17)

We can now calculate the variance of the position x after multiple time-steps because the position at time t is just the sum of many single steps, and the variance of a sum of elements is the sum of the variances of each of the elements. Thus,

. (S18)

With the probability distribution of the position of a given domain, we can now calculate the probability distribution that describes the distance between the closest ends of two functional domains.

Let K ( t ) be the distance between the central amino acids of two functional domains 1 and 2, at equilibrium length , starting at a distance y between their ends, as the functional domains change position over time, such that

, (S19) where x 1( t ) and x 2( t ) are the changes in position (or the absolute position, if both were relative to starting positions of zero) of functional domains 1 and 2, respectively. Our expectation for K ( t ) is simply the sum of the expectations for each of its parts:

. (S20)

Similarly, because the variance of a sum or difference is the sum of the variances of each of the parts:

. (S21)

Lastly, knowing the expectation and the variance for K ( t ), and the fact that it must be normally distributed (because it is a sum of many independent changes in position), we

8

can use the cumulative distribution function to find the probability that two functional domains, at equilibrium length and with a starting distance y between their ends, have touched (i.e., that K ( t

)≤

) at any given time t :

. (S22)

2. Simulation details

For simulation results shown in Figures 4 and 7 and Table I, and in all other simulations unless otherwise noted, the parameters and initial conditions are as follows. The protein sequence consists of R =100 residues and the effective population size is N e

=100. At all time-steps a sequence performs two primary functions and, if any, a single secondary function (i.e., there are F =2 or 3 functions). Environmental change events occur at intervals of r =100 time-steps.

All functional domains have a minimum length of m =5 amino acids. There are two primary functions (Functions 1 and 2), each of which is defined by a four-amino acid alphabet (consisting of four numbers, since each amino acid type is identified by a unique number). These two alphabets intersect at a single amino acid type (i.e., amino acid type

4), and there is a single secondary function with an eight-amino acid alphabet.

Function 1 alphabet: {1,2,3,4}

Function 2 alphabet: {4,5,6,7}

Function 3 alphabet (initial): {8,9,10,11,12,13,14,15}

The length of the primary functional domains at time-step 0 is equal to the equilibrium length determined from previous simulations. When a secondary functional domain is

9

present, the observed primary domain equilibrium length is 19 amino acids. Thus, in simulations with secondary functions, the primary functional domains are initiated at a length of 19 amino acids. In the absence of secondary functional domains, the observed equilibrium length of the primary functional domains is 23, so simulations without secondary functions are initiated with primary functional domains of length 23. In simulations with and without secondary functions, the two primary functional domains start with 5 amino acids between their two closest ends. The simulations begin at timestep 0, and are ended at time-step 400000, 1000000, or 2000000.

The particular parameters and initial conditions for the above simulations were chosen for the sake of comparability between simulations with and without secondary functional domains. The same qualitative pattern (encroachment on primary domains by secondary domains, and the evolution of overlap between primary domains) is also observed in simulations with the following alternative initial conditions and parameter choices: (1) primary functional domains initially located much farther apart, (2) a secondary domain initially located between the primary domains, (3) primary and/or secondary domains that are not initially at equilibrium length, (4) several primary and secondary functions present, (5) no enhanced competition between primary and secondary functions (i.e., primary and secondary function alphabets overlap with each other), and (6) equation 1 replaced with a rational function of form

, where k and l are parameters that allow a good fit to equation 1.

(S23)

Simulations with results in Figures 2 & 3 each have only a single primary functional domain and no secondary functional domains. The primary functional domains

10

were initiated at the minimum functional domain length m =5, and were run for 1000000 time-steps. Mean domain lengths and random walk rate measures were taken using the last 800000 time-steps of each simulation to allow plenty of time for domains to reach equilibrium length.

The simulations with results reported in Figure 9 have several primary functions

( F =4) with partially intersecting alphabets evolved under crowded conditions within a sequence (without any secondary functions). Function 1 has alphabet {1,2,3,4}, function

2 has alphabet {1,5,6,7}, function 3 has alphabet {1,8,9,10}, and function 4 has alphabet

{1,11,12,13}. N e

=10000, m =10. Simulations were initiated with domains of length m .

The simulation in Figure 10 was performed with only a single secondary function

( F =1) with alphabet {1,2,3,4}. In this simulation, loss of function mutations were allowed

(so there is occasionally no viable functional domain within this protein). N e

=10000, m =4, r =1000. Simulations were initiated with a domain of length m .

The simulations reported in Figure 5 have four functional domains at any given time: two primary and two secondary. The alphabets for the primary functional domains are {1,2,3,4} and {5,6,7,8}, and the secondary functional domains are {9,10,11,12} and

{13,14,15,16} at the beginning of the simulation. The coexisting secondary functional alphabets were never allowed to overlap (i.e., share any amino acid types). Parameters were set as follows: tmax =100000, c =.2, N e

=100, r =100, m =4, and the simulations were initiated with functional domains of length m (or longer, if by random chance the inserted domains were adjacent to amino acids within the same alphabet).

Simulations with results shown in Figures 4, 5, 6A, 6B, 6C, 7, 9, 10 and Table I were performed with a slight technical variation from the analytical model described in

11

the text: “mutations” that do not alter amino acid composition (e.g., a “4” randomly reassigned to be a “4”) still count as mutational time-steps. Compared to the analytical model or the simulations that do not count such transitions as mutations (Figures 2 and

3), this has the effect of slightly lowering the mutation rate from 1 to 19/20.

3. Sticking frequency

The sticking frequency is defined as the probability of evolving overlap given a touchevent between two primary functional domains. If two functional domains touch, under the assumption that they do not affect one another, we would expect there to be a 50% chance of them continuing to converge versus moving apart again. If, on the other hand, the two functional domains do affect one another by drawing each other in, we would expect the sticking frequency to be above 50%. In order to determine whether two functions that have randomly touched are indeed “stuck” (converging), we have to establish how fast we expect a domain to move after an initial touch-event. We do not want to count a touch-event as a sticking-event until there has been enough time for the two functional domains to move apart again. Likewise, we do not want to miss a real stick-event by assessing the stick event after too long a time interval, since any two functions will eventually diverge given enough time—even if they are non-randomly stuck to one another for a long time. To establish the appropriate time interval after a touch-event at which to assess sticking frequency, we first determine the expected amount of time required for the central amino acid of a functional domain (at equilibrium length and undergoing a random walk) to have a .95 probability of moving by at least one amino acid position. This is a little below 50,000 time-steps, so we round up and use

12

50,000 time-steps after the touch-event as the threshold to determine whether two functional domains are stuck or not.

4. Calculating U g,g

directly

Although we can infer U g,g

, we can also calculate it directly in order to verify that the probabilities for all possible mutations, from a given starting length g , add up to one. The number Z g , g

of equally likely viable mutations that cause no change in length depends on how long the domain is. When the functional domain is length g = R , all residues are part of the domain, so only alphabet amino acid mutations maintain the domain’s length.

There are a -1 alphabet amino acids, thus,

Z

R , R

= R ( a -1) . (S24)

When the functional domain has length g = R -1, there is exactly one flanking residue that is not within the domain. Mutations to alphabet amino acids within the domain keep the length the same, but the domain lengthens if the flanking residue mutates to an alphabet amino acid. This flanking residue, therefore, must undergo a non-alphabet mutation.

There are z a non-alphabet amino acid types and ( z a )-1 non-alphabet and non-wildtype amino acid types that maintain the same domain length. Thus,

Z

R -1, R -1

= (( z a )-1) + ( R -1)( a -1), (S25)

When g < R -1, there are sites that are not within or flanking the functional domain. These sites can mutate to any of the other z -1 amino acids without changing the length of the domain. There are also always two flanking sites to the functional domain, which must mutate to one of the other ( z a )-1 non-alphabet amino acids to avoid lengthening the domain. Lastly, any mutation to one of the other a -1 alphabet amino acids at a site within

13

the functional domain also maintains the domain length. Thus, if g < R -1, Z h , g

=( R -( g +2))( z -1) + 2(( z a )-1) + g ( a -1) . (S26)

We find that U g , g

=1- U g -1,g

- U g +1 ,g for all g where R

≥ g

≥ m , which confirms that all possible mutational events from any given starting length g have been counted exactly once.

5. Minimal conditions of the model

There is a small set of minimal conditions necessary to allow for the evolution of highly conserved domains by the mechanism described above. First, protein function performance must at least partly be determined by how many residues are dedicated to that function. Second, a protein must perform multiple distinct functions with alphabets of amino acids that differ to some extent. If these first conditions are not met, then there can be no competition between functions for residue control. Third, as more and more space within the sequence is designated for a particular function, that function’s performance must increase less and less quickly (decreasing returns assumption). If this is not the case, then lower performing functions would not generally out-compete higher performing functions for control over residues, and new, poorly adapted secondary functional domains would not encroach on old, well adapted primary functional domains anymore than the reverse occurrence. Forth, the alphabets of different functions must intersect to some extent so that there is the possibility for the evolution of overlap between functional domains. Exactly how we specify that space within a sequence contributes to performing a function is not important.

14

6. Comparing analytical predictions to simulation results

Equilibrium Length:

In simulations with the default parameter combination (as described in Appendix 2), the mean equilibrium length at which primary functions stabilize after many time-steps is

23.20 (Figure 2). At this parameter combination, the analytical prediction (22.25) is slightly lower, falling outside the 95% C.I. of the simulation mean, and this is generally the case at other N e

values also (Figure 2). The discrepancy is most likely due to the assumption within the Markov model that all mutations increase or decrease the functional domain by at most one amino acid, and that they are only selected as such. In the simulations, for every mutation that increases the length by one, there is an a / z probability that the residue flanking the mutated residue on the outside of the domain is an alphabet amino acid—which would cause an actual increase in length of more than one. Such mutations that cause greater increases in length would experience stronger selection, and thus a higher likelihood of fixation. In making analytical predictions, we also ignore steps that decrease the functional domain length by more than one, but the net effect of both simplifications seems to shift the predicted equilibrium length down relative to simulations.

Rate of random walk:

Equation S15 provides our expectation for the absolute value of the distance x traveled from a starting point as a function of t time-steps. Setting the parameters in this expression to match those of the simulations (as described in Appendix 2), the analytically predicted rate of random walk is about . Because random walk rates are slow, they are difficult to measure accurately from the length of simulations analyzed

15

here-- especially for large population sizes (for which random walk rates are slower).

Nevertheless, we observe that the random walk rate depends on the effective population size in basically the way we would predict from analytical predictions: it increases as the population size becomes smaller, presumably because quasi-neutral mutations become fixed more easily (Figure 3). On the other hand, the random walk rates observed in simulations appear to be slightly lower than the analytically predicted rates, especially as population size decreases (Figure 3). This may be because the analytical predictions do not take into account the effect of the sequence edges preventing the functional domain from moving any further in one direction, and thus they overestimate the random walk rate. We would expect this to cause a greater discrepancy between simulation and analytical results at small populations sizes that have faster random walk rates, because the edges of the sequence play a more significant role when the functional domains come into contact with them that much more frequently.

Probability of a touch-event:

Given the set of parameters in our simulation, and a t =400000 timeframe, the probability of a touch-events is about .24. This prediction is not significantly different from their observed frequency over 24 simulations, each evaluated after 400000 time-steps. (Given a binomial distribution with 24 trials and the analytically predicted touch-event probability (using either the analytically predicted or empirically determined ), there is about a 10% chance of the observed number of touch-events (9 out of 24) occurring, or more.)

In summary, results from simulations with only primary functions are generally consistent with analytical predictions for equilibrium length, random walk rate, and

16

probability of touch-events. This indicates that the simplified Markov model sufficiently captures the overall dynamics of the primary functional domain through evolutionary time—both its change in length and its movement along the sequence.

7. Predictions based on the alternative mechanisms for amino acid repeat evolution

To distinguish which of the two hypothetical mechanisms is the underlying driver of amino acid repeat evolution in real proteins, we can make predictions about the location of non-homogenous repeats in conserved proteins. Under the first mechanism, which is driven by the evolution of overlap, we would expect repeats to be within the most conserved regions of the proteins, and part of primary functional domains that perform multiple functions. We would not expect to see them ever disappear once they emerge.

Under the second mechanism, where selection for “pre-adapted” sequences causes the evolution of repeats, we would expect to find repeats inside secondary functional domains, possibly with some variation, and at loci that have a single function at any given time. Of course, given this second mechanism, it is still entirely possible that a repeat would be perfectly conserved within the timeframe of evolutionary history that is examined.

17

Rorick-Wagner Supplementary Materials

Rorick-Wagner Supplementary Material

V. Appendix

Related documents

Products

Support

Rorick-Wagner Supplementary Materials