Supplementary material S1 US public data Genbank Source

advertisement
Supplementary material
S1 US public data
Genbank Source Sequence Accession numbers for the samples used to infer figure1:
GQ117097, GQ377093, GQ232002, FJ984360, FJ984397, GQ117100, GQ168642, GQ168661, GQ232019,
GQ232049, GQ117032, GQ160527, GQ168652, CY041122, GQ231996, CY041541, GQ377103,
GQ117112, GQ221794, CY041170, CY041605, GQ338409, CY039999, GQ160526, GQ200237, GQ221809,
GQ160531, CY040701, CY041202, CY041498, GQ338355, CY047326, GQ200270, CY041162, CY046307,
CY041621, GQ377045, CY046387, CY047350, GQ232067, CY046187, CY041806, CY049867, CY046275,
CY041645, GQ323464, GQ323470, CY041782, CY043171, CY046323, CY046403, CY046483, CY046523,
CY046859, CY046875, GQ323520, GQ457501, CY049820, CY041814, CY046555, CY041766, CY043123,
CY046571, CY046587, CY046563, CY046611, CY041758, CY043163, CY043219, GQ323574, CY050328,
CY046675, CY050372, CY051903, CY052959, CY053055, CY041790, CY043131, CY044973, CY046627,
CY050380, CY041774, CY043235, CY044981, CY046635, CY046643, CY041798, CY045085, GQ377069,
CY044064, CY044917, CY045143, CY047358, CY050412, CY044048, CY044056, CY046707, CY046731,
CY053174, CY044877, CY044965, CY053039, CY053190, CY046747, CY053206, CY050420, CY053222,
GQ323579, CY044957, CY046699
The figure below shows the frequency distribution of US H1N1 HA sequences sampled for the
production of the BSP in Figure1. Twenty-two sequences were randomly sampled each week for five
weeks in order to reduce uneven temporal sampling.
1
S2 lineages through time (LTT) plots for corresponding plots in figure2
2
S3 lineages through time plot for figure4a
S4 Classical skyline plot
We considered the possibility that the averaging and smoothing that takes place in the production of the
generalised skyline plot [1] – of which the BSP is an example of - may in some way be responsible for the
slow-down in effective population size observed at later times. In order to rule this out we generated
multiple classical skyline plots (Pybus, Rambaut & Harvey 2000) from the dataset used in Figure 3 of the
main text (R=1.5, k=1, over 27 generations, one random sample per generation).
As detailed in [2] if Ne(x) is the effective population size at some time x in the past then for n randomly
sampled sequences (all sampled at the present) the corresponding genealogy will have n-1 internode
intervals I2, I3…..,In with interval sizes g2, g3…..,gn with subscripts equal to the number of lineages in said
interval. Application of the coalescent approximation means that interval sizes are distributed following
the probability density function with ti the time at which Ii interval begins:

i
 gi ti
 
2

p ( g i ti ) 
exp   
 x t
N e  g i  ti 
i


i 
  
 2  dx 
N e ( x) 


The exponential portion of this equation allows the simulation of coalescent trees for most demographic
models and the reconstructed genealogy gives gi and ti from which Ne can be inferred (strictly speaking
a piecewise function over time and is the harmonic mean over internode interval Ii with time interval
3
[ gi , gi  ti ]). The effective population size estimated this way in known to be an underestimate (for
demographic histories other than constant) as the harmonic mean is less than the arithmetic mean.
The classical skyline is inherently more noisy due to coalescent stochastic variability and that the
number of free parameters is the same as the number of observations distributed independently, in
addition to lacking the smoothing associated with the generalized skyline plot and more extensively with
the BSP. The “tree file” output from BEAST represents a random sample from the posterior distribution
of trees give the data. In the case of the sampled simulated data used in Figure 3 of the main text, the
MCMC algorithm explored 30million tree states that were sampled every 3000 states. This resulted in
10,000 trees of which we drew one every hundred resulting in one hundred trees. The figure below
shows classical skyline plots of the first 50 of these using the software GENIE [3] which infers the
demographic history from these reconstructed molecular phylogenies. The substitution rate estimated
from the Bayesian coalescent analysis is used to rescale the skyline plot into days.
4
Comparison with the BSP in Figure 3 of the main text shows that the effective population size no longer
increases after 40 days into the epidemic for both BSP and classical skyline plots, indicating that the
averaging and smoothing used to generated generalized skyline plots is not responsible for the slowdown seen in the BSPs.
S5 varying mutation rates of simulated data:
There is evidence that for higher mutation rates the slow-down occurs at a later time.
5
Same BSP as previous but with confidence intervals (and extended to be clearer):
6
7
1 Strimmer, K., and Pybus, O. G. 2001 Exploring the demographic history of DNA sequences
using the generalized skyline plot. Mol Biol Evol 18, 2298-2305.
2 Pybus, O. G., Rambaut, A., and Harvey, P. H. 2000 An integrated framework for the inference
of viral population history from reconstructed genealogies. Genetics 155, 1429-1437.
3 Pybus, O. G., and Rambaut, A. 2002 GENIE: estimating demographic history from molecular
phylogenies. Bioinformatics 18, 1404-1405. (doi:10.1093/bioinformatics/18.10.1404)
8
Download