Quasispecies Theory

USC3002 Picturing the World through Mathematics
AY08/09, Semester 2
Prof Lawton, Wayne Michael
Matric No.: U062281A
Ye Dan
Quasispecies Theory is a concept developed by two chemists Manfred Eigen and Peter Schuster in
the 1970s, in the process of their attempt to develop a chemical theory for the origin of life1,2. The
“species” here follows the chemists’ definition: an ensemble of equal molecules1,2,4. A “quasispecies”
can thus be directly but only preliminarily understood as a description of a cluster of closely related
molecular species. Quasispecies theory is especially relevant when nucleic acid molecules are
considered, as quasispecies is always produced by errors in their inaccurate self-replication2,4,5.
Furthermore, it is argued that since every biological reproduction is unavoidably error-prone, the
quasispecies concept can be readily applied to genetic processes other than RNA self-replication1.
More realistic virus, bacteria, or even plants and animals are quasispecies. Their genetic
reproduction is of course more complicated and will include more sophisticated mutational events
such as recombination, sexual reproduction etc., but the underlying principle remains the same1.
This article focuses on the nature of the quasispecies concept and its mathematical formulation, as
well as its implications in evolutionary biology and virology.
The Origin of Quasispecies Concept
In considering the earliest life forms on earth, Eigen and Schuster assumed RNA to be the first
biological replicator1,4. Indeed, the non-living yet self-replicating RNA molecules are a very possible
starting point of the “reproduction” process in the primordial soup. This process of replication based
on nucleotide base-pairing (Adenine to Uracil, Cytosine to Guanine) is the basis of all life and is
nowadays conducted by highly sophisticated and precise enzymes in living cells. However, at the
very beginning, this process may have been occurring very slowly and, very importantly, subject to
high error rate1,4,5. The errors can be mutations, or mismatching in base-pairing. Whatever the cause,
the result is not an absolutely homogeneous population of wildtype RNA molecules, but a mixture of
RNA molecules with slightly different nucleotide sequences, which constitutes a quasispecies1,4,5.
Therefore, in order to study the evolutionary dynamics of the spontaneous replication of RNA
molecules – the first macromolecules on earth, it is necessary to view the problem in a quasispecies
The Chemical Kinetics and the Mathematical Framework –
Quasispecies Equation
The spontaneous replication of RNA molecules is a primitive genetic replication process that can be
described by chemical kinetics1, that is by equations specifying how the concentration of certain
molecules changes over time. Consider a sequence with length . There are n different RNA
with the same length . It is reasonable to make an assumption that each
RNA molecule has a different rate of replication, depending on its sequence. Some sequences may
produce “offspring” faster than others. We can regard the rate of replication as fitness – the faster
the rate, the fitter the sequence is. Selection thus comes into the picture. We denote the
characteristic replication rate ie. fitness of each variant by
. If there is no error in
replication, the variant with the highest replication rate (fittest) will grow fastest and reach fixation
as a result of selection without errors.
If we take errors in replication into consideration, we are bringing mutation into the picture. We
then need to have a probability of replication of template sequence (parent) results in an offspring
, assuming all the errors are caused by point mutations1,3.
can be
regarded as a mutation matrix.
We can now write down the following two chemical reactions for quasispecies replication. The
symbol A denotes the four nucleotides, A, T, C and G, required for RNA synthesis. The available
amount of A in the environment is assumed to be constant, so that it will not enter as a variable.
Error free:
Error-free replication and mutation are parallel reactions of the same mechanism. The rate of
replication, , depends only on the parent sequence ; the mutation probability,
, depends on
both the parent sequence and the offspring sequence.
represent the abundance of variants
, we can write down an ordinary
differential equation to describe the time evolution of the population of each variant. For example,
the growth rate of variant can be written as
The first term on the right hand side represents the rate of new
replication of
sequences formed by error-free
. The rest of the terms represent the contribution from erroneous replication of
other sequences giving rise to new
sequences. In general, we can write the whole system of
differential equations for growth rate of all the variants:
If we wish to keep the total population size constant and normalized so that to regard
as relative abundances (=frequencies), we may write
Each sequence is removed at rate
to ensure that the total population size remains constant, ie.
has to be the average fitness of the population,
Equation (3) is known as the quasispecies equation3,5.
We can see that with mutation, the growth rate of any variant does not only depend on itself, but
also on all other variants1. The resultant population will no longer consist only of the fastest growing
sequence, but a whole ensemble of mutants with different replication rates. In the long run, under
constant selection, the frequency of every variant will reach equilibrium. This leads to a more precise
definition of quasispecies as the equilibrium distribution of sequences that is formed by the
interaction of mutation and selection2,3. “Mutation” because the reproduction process is subjected
to errors; “selection” because sequences have different fitness. From here onwards, the term
“quasispecies” in this article will be referring to this definition.
We can write the quasispecies equation in vector form2,3,
The vector has its elements
representing the frequencies of the individual sequences,
gives the exact genomic structure of the population. The matrix
is a combination of the
fitness landscape and the mutation matrix – a mutation-selection matrix:
is the fitness vector giving the fitness landscape, and
is the mutation
matrix. We note that is a stochastic matrix: it has as many rows as columns; each entry is a
probability; each column sums to one,
, the average fitness given by equation (__), is
obviously the inner product of vectors and ,
The equilibrium of the quasispecies equation can therefore be given by solving for the eigenvector
and eigenvalue of
The average fitness, , is the largest eigenvalue of the matrix . The eigenvector associated with
this eigenvalue, with proper normalization
, gives exact equilibrium population structure
of the quasispecies. In the long run, if selection remains constant, the rate of growth of every variant
converges to the average fitness . This eigenvector is the precise mathematical definition of
quasispecies, and the eigenvalue is the mathematical definition of the fitness of the quasispecies2.
Generically, there is a unique and globally stable equilibrium of the quasispecies equation3.
With the help of all the mathematical work done above, we have defined quasispecies as the welldefined equilibrium distribution of sequences that is generated by a specific mutation-selection
process describing the erroneous replication of macromolecules, RNA in this case. It is clear that the
frequency of any variant within the quasispecies does not only depend on its own fitness, but also
on the likelihood of it being produced by erroneous replication of other variants and their
frequencies in the quasispecies distribution. The consequence of this effect, which has important
implication in understanding evolution process, is that selection no longer targets on an individual
sequence with the highest fitness1,2,3,4. Mutations are so ubiquitous that the fitness of an individual
sequence becomes somewhat meaningless. Instead, the whole quasispecies itself is the target of
selection in a mutation-selection process. This idea is best illustrated with the knowledge of
sequence space and fitness landscape.
Sequence Space and Fitness Landscape
If we consider an RNA sequence with a length of l bases, each of the I positions can be one of the 4
bases, A, U, C or G. There are thus 4l different possible sequences of length l. Each of the possible
sequences, or variants, is a point in a space called “sequence space”. In the sequence space, all the
possible variants are arranged such that neighbors differ by only one base substitution. Generally,
distance between any two variants is the Hamming distance between the two variants, which is the
number of bases the two variant differ from each other. The number of dimensions of the sequence
space is the length of the sequence, l. There are 4 possibilities in each dimension: A, U, C, G. A
sequence space thus includes all possible variants2,3. It can be seen as the mathematical “habitat” of
a quasispecies of sequences with the given length, as mutation on any position to any of the 4 bases
always has a certain probability. Assigning abundance to every point in the sequence space, we
obtain a quasispecies. Practically, many points in the sequence space will have zero abundance, due
to the fact that the number of all possible sequences 4l usually exceeds the population size. A
quasispecies is usually a set of points closely located, and is thus a small “cloud” in sequence space.
If we assign every variant in the sequence space a fitness value, we build a “mountain range” on the
foundation of the l-dimensional sequence space, as illustrated in Figure 1. There are “peaks” in the
mountain range, representing regions of high fitness2,3. The peak can be “sharp” and “narrow”,
meaning it constitutes of a point representing a sequence with high fitness surrounded by points
representing sequences with very low fitness. There are also “flatter” and “broader” peaks, meaning
that the fitness of the sequences surround the highest fitness sequence is but only a little bit lower
than the highest fitness. This mountain range, termed the “fitness landscape”, has l+1 dimensions,
the additional dimension being the fitness. Due to the high dimensionality of the sequence space, a
few point mutations can lead from one region in the sequence space to a completely different
region. There are many directions that the sequences can go ( choices just from one point to a
neighboring point), but natural selection provides a “guiding gradient” that guides the mutations to
one, “correct” direction by defining a fitness landscape1.
In the full sequence space, each mutant would occupy two positions because replication produces
complementary sequences. This gives a mirror image of fitness landscape.
Figure 1. The fitness landscape is a high-dimensional mountain range. Reducing the sequence space
to be 2-dimensional, the height represent fitness. Each sequence ie. each point in the sequence space
gets assigned a fitness value.
Quasispecies lives in the sequence space and adapt to the fitness landscape1. The quasispecies
equation describes the movement of a population through the sequence space. We can visualize the
quasispecies as a small cloud in sequence space wandering over the fitness landscape searching for
peaks, regions of high fitness, of the mountain range. It usually attempts to “climb” uphill the
mountains3 in the high dimensional fitness landscape under the guidance of natural selection and
reach local or global peaks. This illustration allows us to envision how natural selection does not
simply choose the fittest sequence, but the quasispecies as a whole. We can consider two peaks in a
fitness landscape, the first one is high but narrow (surrounded by sequences with low fitness), the
second one is a lower but broader peak (surrounded by sequence with relatively high fitness). If
there is very little mutation, the quasispecies at equilibrium will center at the higher peak, because
basically only the sequence with the maximum fitness matters. However, with a higher rate of
mutation, the fitness of the neighboring sequence becomes important and a sharp transition of
quasispecies from the higher, narrower peak to the lower, broader peak happens when the
mutation rate increases beyond a certain threshold. We can thus see that for a given mutation rate,
selection always choose the equilibrium quasispecies with the maximum average fitness, which is
represented as in our mathematical formulation and is characteristic to the quasispecies as a
whole. Indeed, “survival of the fittest” must be replaced by “survival of the quasispecies”3.
Adaptation and Error Threshold
For the attempts of the quasispecies to climb uphill and reach local or global peaks to be successful,
one condition to be considered is the error threshold3.
If all replications are error-free, there would be no mutants arises and the evolution will stop.
Adaptation and evolution will, however, also be impossible if the error rate of replication is too high1.
The population would then not be able to maintain any genetic information. In the long run, the
composition of the quasi-species will only be determined by randomness. The abundance of
individual sequence will no longer be depended on its fitness. Therefore, for adaptation and
evolution to take place, the error rate must be kept below a critical threshold level2,3.
Error rate can be expressed in , the probability that a mutation occurs in one specific position, or
the per base probability of making a mistake1. If we consider only point mutations, we can write the
mutation probability of a sequence with length as
denotes the Hamming distance between sequences and . A few assumptions were
made in writing this expression: , the per base probability that a mutation occurs, is assumed to be
the same for all positions; all mutations are assumed to be independent of one another, so that we
can take the product of their probabilities3. Also, when we utilize this expression in the treatment of
quasispecies, we indeed considered only point mutation and ignored deletion, insertion and other
types of mutation that involves change of sequence length .
Consider a wildtype sequence with length . We denote the frequency of the wildtype by
equation (6), the probability that the wildtype produces exact copies of itself is given by
. From
The probability that the wildtype produces other sequences, the mutants, in the sequence space is
. If we neglect the very unlikely back mutation from mutant sequence to wildtype
sequence , we can obtain from equation (3) and
Where is the fitness of the wildtype,
is the total frequency of all mutant sequences,
is the average fitness of all the mutant sequences, is the average fitness
. We
consider the usual case where the wildtype is fitter than mutants,
. Express and in
terms of , we obtain one single equation for the system
, the terms in the square bracket is less than zero and will converge to zero – the
fittest wildtype sequence cannot be maintained in the population. For
, consider the
Therefore, the wildtype can only be maintained in the population if
, a critical single base replication accuracy
lower bound of any single base accuracy
, can be obtained as
, which must be a
Taking logarithm on both sides and approximating
as the first term of its Taylor’s series, we
obtain a relationship between the replication accuracy and the sequence length
, we simplify the relationship to be
This inequality gives an approximation of the upper bound of the sequence length that can be
maintained by the specified single base error rate without losing adaption1,2,3,5. It can be expanded
to more complicated organisms1,3. Table 1 shows genomes of some other organisms that are
consistent with the relationship in equation (14).
Figure 2. Error threshold. Blue colour shows the schematic distribution of sequences in the
quasispecies when error rate is below or above the error threshold. If the mutation rate (denoted by
u here) is less than a critical value 1/L, then the peak is selected; if u is greater than 1/L, the peak is
not selected and the quasispecies cannot “feel” the peak on the fitness landscape. 1/L is the error
Adaptation indicates the ability of the quasispecies to find peaks in the fitness landscape and stay
there3. Consider a fitness landscape with only one peak. If the mutation rate is lower than the error
threshold, the peak is selected. The equilibrium quasispecies is centered around this peak, with most
sequences in the quasispecies resemble the sequence with the maximum fitness or are only a few
point mutations away from it. The sequences which are far away from the peak will have very low
frequency. The quasispecies is thus said to be adapted to this peak, or localized at this peak3. A
smaller mutation rate corresponds to a narrower quasispecies distribution. As the mutation rate
increases, the quasispecies distribution widens. If the mutation rate increases beyond the error
threshold, the equilibrium quasispecies can no longer “detect” or “feel” the peak2,3. The fitness
landscape no longer matters to the quasispecies. A transformation from a localized to a delocalized
state occurs and adaptation is lost.
With the knowledge of error threshold, we can review the statement “selection of quasispecies”
with more confidence (Figure 3).
Firugre 3. Selection of quasispecies. The single line represents a high but narrow peak in fitness
landscape, while the group of lines represents a lower but broader peak. If the mutation rate
(denoted by u here) is less than a critical value u1, then the higher peak is selected; if u is greater than
u1 but lower than another critical value u2, the lower peak is selected. If u is greater than u2, neither
peak is selected and the quasispecies cannot “feel” the peaks on the fitness landscape any more. u2 is
the error threshold.
Evolution in a Quasispecies Context
The understanding that natural selection selects the quasispecies rather than any individual
sequence has important implication on evolution. Evolution is conventionally thought of as the
interaction between mutation and selection – selection is a factor that favors advantageous mutants,
and the mutants are generated by pure chance. It turns out, however, that mutation can be
“guided”, if we look at it in a quasispecies context3. This does not mean that there is any correlation
between the intrinsically stochastic act of mutation and the selective advantage of the mutant. Since
selection operates on the whole quasispecies, below the error threshold the quasispecies is adapted
to the fitness landscape decided by the selection pressure. If the selection pressure remains
unchanged, the quasispecies stays at the peak as an equilibrium. Evolution happens as
destabilization of the existing quasispecies upon arrival of a new advantageous mutant sequence3
(which already has a place in the sequence space and a “height” in the fitness landscape) or, in the
case of change of selection pressure, a change of fitness landscape. The quasispecies then moves
towards the peaks with newly assigned abundance, or the new peaks. Each individual mutation is
still by chance, but the quasispecies theory shows that on the whole, only the mutations which
produce sequences with relatively high fitness ie. sequences which are at a peak in the fitness
landscape – preferably a broad peak, are able to maintain themselves. In this way, mutation can be
“guided” towards the peaks of this fitness landscape in the process of evolution. Evolution
optimization can be viewed as a hill-climbing process of the quasispecies that occurs along certain
pathways in sequence space1,3. In this sense, the quasispecies theory has changed the classical view
of evolution from the picture of a single wild type moving through sequence space randomly6, into
the picture of a quasi-species with its mutant distribution migrating through sequence space in an
internally controlled manner and guiding itself to the peaks of the fitness landscape.
Figure 4. In adaptation, quasispecies climbs the mountains in the high-dimensional sequence space.
The higher the quasispecies get, the fitter it is.
Viral Quasispecies
Although originally put forward to model the evolution of the first macromolecules on earth, the
quasispecies concept has been applied to populations of RNA virus with its host because of the high
mutation rates (in the order of one per round of replication7) and the resultant extensive genetic
heterogeneity2. Also, viral populations, though not infinite, are extremely large.
Quasispecies theory is significance in virology. Since selection acts on the quasispecies (clouds of
mutants) rather than individual sequences, the evolutionary trajectory of the viral infection cannot
be predicted solely from the characteristic of the fittest sequence. Error rates have been determined,
for example, for influenza A virus, vesicular stomatitis virus, foot-and-mouth disease virus, spleen
necrosis virus and HIV-I2. All results show a correlation between error rate and the sequence length
given by equation (14)2. In the event that the mutation rate of a quasispecies is higher than its error
threshold, the viral population would not be able to maintain the sequence with the highest fitness,
and therefore the ability for the population to its environment would be compromised. This dynamic
can be well utilized in developing antiviral drugs which is able to increase the mutation rate. For
example, increased doses of the mutagen Ribavirin reduces the infectivity of Poliovirus8.
Quasispecies theory was originally developed to model the evolution of the first macromolecules on
earth – RNA, and is useful to explain something about the early stages of the origin of life. A
quasispecies is the equilibrium distribution of sequences that is generated by a specific mutation-
selection process describing the erroneous replication of macromolecules, RNA in this case. It is
based on chemical kinetics, and we can build a mathematical framework to understand its evolution.
The target of selection is no longer an individual mutant sequence, but the whole quasispecies.
Fitness, as a property of the whole quasispecies, is mathematically defined as the largest eigenvalue
of the mutation-selection matrix. The corresponding eigenvector gives the exact structure of the
Quasispecies lives in the sequence space and adapt to the fitness landscape. An elegant expression
for the error threshold, beyond which adaptation is lost and evolution becomes impossible, was
derived to be one over the sequence length for per base error rate. The derivation is based on a few
assumptions and mathematical approximations, such as the fitness of the wildtype is greater than all
the mutants, the per base probability that a mutation occurs is assumed to be the same for all
positions, and all mutations are assumed to be independent of one another.
Selection chooses the equilibrium quasispecies with the maximum average fitness and stabilizes a
quasispecies distribution in the sequence space. Evolution destabilizes the existing quasispecies
upon arrival of a new advantageous mutant or change of fitness landscape. The quasispecies theory
has changed the classical view of evolution from the picture of a single wild type moving through
sequence space randomly, into the picture of a quasi-species with its mutant distribution migrating
through sequence space in an internally controlled manner and guiding itself to the peaks of the
fitness landscape.
The importance and significance of quasispecies theory has been the subject of some debate9, as it
has been shown that there is no necessary conflict between a quasispecies model and traditional
population genetics. Also, quantitative predictions based on this model are difficult because some
input parameters such as fitness of individual sequences and mutation probability are hard to obtain
from actual biological systems5. However, there are experimental evidences of quasispecies based
on determination of the input parameters with sophisticated techniques4. Meanwhile, there are
many experiments going on now that pursue further study and further evidence of quasispecies
theory2,4, as it is still deemed important when population is large and mutation rate is high.
Table 1. Genome length (in bases), mutation rate per base, and mutation rate per genome for
organism ranging from RNA virus to humans
