140224aobSupplementary

advertisement
Supplementary material for:
A footprint of past climate change on the diversity and population structure of Miscanthus sinensis
Lindsay V. Clark, Joe E. Brummer, Katarzyna Głowacka, Megan Hall, Kweon Heo, Junhua Peng, Toshihiko
Yamada, Ji Hye Yoo, Chang Yeon Yu, Hua Zhao, Stephen P. Long, and Erik J. Sacks
Table S1. Miscanthus collections used in the present study.
Species
M. sinensis*
Origin
China
Japan**
S. Korea
Taiwan
Ornamental cultivars***
U.S. naturalized
# of entries
296
136
184
1
43
43
M. floridulus
China
Japan****
New Caledonia****
Papua New Guinea****
1
1
1
1
M. sinensis x M. sacchariflorus hybrids
China*****
11
Ornamental
cultivars******
34
M. sacchariflorus
China
S. Korea
Ornamental cultivars
8
3
1
M. oligostachyus
Ornamental cultivars
1
M. sp.
Taiwan****
unknown****
1
1
Total
767
*includes varieties condensatus (3), purpurascens (1) and transmorrisonensis (2; presumed to
have originated in Taiwan).
**includes 4 biomass cultivars.
***From U.S. nurseries.
****From USDA NPGS.
*****F1s and BC1 found in the wild.
******BC1s and BC2s with M. sinensis as the recurrent parent, as well as two F1s.
1
Table S2. Analysis of Molecular Variance (AMOVA) of 620 M. sinensis and M. floridulus individuals
sampled in the native range, using 21,207 RAD-seq SNPs. The genetic clusters are the six identified by
DAPC analysis on RAD-seq data. The two Japanese clusters were considered to be a separate region
from the other four clusters.
Among genetic clusters
Among regions
(Japan vs.
mainland)
Within regions
Within genetic clusters
Total
Variance
796
Proportion of variance
0.31
P
< 0.001
318
478
1802
2598
0.12
0.18
0.69
< 0.001
< 0.001
Table S3. Jost’s D statistic showing pairwise differentiation between M. sinensis genetic groups based
on chloroplast haplotype frequency.
Yangtze Qinling
Sichuan basin
Korea, N
China
N Japan
S Japan
US nurseries,
hybrid
US nurseries,
non-hybrid
US
naturalized
SE
China
plus
tropical
Yangtze
Qinling
0.74
0.14
0.40
0.85
0.89
0.88
US
US
nurseries,
nurseries, nonhybrid
hybrid
Sichuan
basin
Korea,
N China
N Japan
0.07
0.99
0.78
0.56
0.99
0.87
0.66
0.60
0.45
0.99
0.93
0.97
0.94
1.00
0.81
0.97
1.00
1.00
0.94
0.83
0.61
0.31
0.99
1.00
1.00
1.00
1.00
0.78
0.10
S Japan
0.16
2
100
0
50
4.48% variation
100
50
0
5.18% variation
150
50% minimum call rate
150
90% minimum call rate
-100
-80
-60
-40
6.05% variation
-20
0
-40
-20
0
20
40
60
80
4.70% variation
Fig. S1. Choice of minimum call rate for RAD-seq markers generated from the UNEAK pipeline. Principal
component analysis of the data is shown with a 90% minimum call rate and a 50% minimum call rate.
Individuals are colored using the scheme from Fig. 2 of the main manuscript, with the addition of
magenta for M. oligostachyus, bright green for Saccharum officinarum, and black for doubled haploid
lines and others excluded from later analysis.
3
Value of BIC
versus number of clusters
4140
4100
4120
BIC
BIC
4160
4180
A
5
10
15
Number
of clusters
Number
of clusters
20
(K)
Value of BIC
versus number of clusters
5120
5080
BIC
BIC
5160
B
5
10
15
20
Number
of clusters (K)
Number
of clusters
Delta K versus number of clusters
10 20 30 40 50 60 70
0
Delta K
C
2
3
4
5
6
7
8
9
Number of clusters (K)
Fig. S2. Selection of number of clusters for Structure and DAPC analysis. A-B) Bayesian Information
Criterion (BIC) versus number of clusters from DAPC analysis on A) 620 M. sinensis individuals from the
native range and B) 765 Miscanthus individuals, including M. sacchariflorus and accessions from the U.S.
C) Delta K values. Three Structure runs were performed on each of three random sets of 2000 RAD-seq
markers at each value of K=1 through 10.
4
0
-1
Msa
Mol
hybrid US nurs.
non-hybrid US nurs.
natural hybrids
-2
5.4% of variation
1
PCA
-6
-4
-2
0
2
44.0% of variation
Fig. S3. Principal component analysis demonstrating hybrid ancestry of Msi accessions from US
nurseries. 170 RAD-seq loci that had fixed polymorphism between M. sacchariflorus (Msa) and M.
oligostachyus (Mol) were used. Msi accessions were classified as hybrid or non-hybrid based on
Structure results (Fig. 1 of main manuscript). Natural Msa × Msi hybrids from the set of Chinese
accessions are included as a control.
5
0.9
0.8
0.7
0.6
US nurseries
0.5
Proportion of polymorphic loci captured
Optimal core germplasm sets
US nat.
0
20
40
60
80
100
Number of individuals
Fig. S4. Proportion of polymorphic loci captured in core sets vs. number of individuals in the set.
Individuals were chosen for inclusion in core sets, out of 620 non-hybrid M. sinensis individuals from the
native range, by a simulated annealing algorithm to maximize the average number of alleles per locus.
Since SNP markers can have a maximum of two alleles per locus, the proportion of polymorphic loci is
directly related to the average number of alleles per locus. Proportions of polymorphic loci captured by
76 individuals from US nurseries and 43 US naturalized individuals are indicated in red.
Dataset S1. Dataset with information on individuals and markers in the study. Geographic coordinates,
species, cluster assignments, chloroplast haplotypes, and assignment to core germplasm sets are listed
for all individuals. All RAD-seq and GoldenGate markers are listed, including sequences, and allele
frequencies are included for markers that were retained for analysis. Provided as Microsoft Excel file.
6
Supplementary Materials and Methods
DNA extraction
Leaf samples were harvested, frozen at -80° C, then lyophilized. Dried leaf material was pulverized in a
Geno/Grinder 2000 ball mill (SPEX SamplePrep, LLC; Metuchen, NJ). Genomic DNA was isolated from
30-40 mg of ground, freeze-dried leaves in 1.6 ml microfuge tubes using a CTAB method modified from
Kabelka et al. (Kabelka et al. 2002). DNA was quantified using a Quant-iT™ dsDNA Picogreen® Kit (Life
Technologies) and diluted to 100 ng/μl for GoldenGate™ and RAD-seq analysis and further to 10 ng/μl
for the amplification of plastid microsatellite markers.
Sequencing library preparation
Digestion and ligation were performed on 96-well plates, with each well corresponding to a sequence
barcode, in order to multiplex 95 individuals into one library with one unused barcode as a
contamination control and library identifier. Most individuals were only included in one library,
although some that had low read counts in their first run were duplicated in later libraries. 250 ng of
DNA from each individual was digested with 5 U each of PstI-HF and MspI in a 15 μl total volume of 1X
NEBuffer 4 (New England Biolabs) at 37°C for three hours, followed by a 20-minute inactivation step at
80°C. 1.5 pmol of barcoded PstI adapter, 5 pmol MspI Y-adapter, and 200 U T4 DNA ligase (New England
Biolabs) were then added to each well for a total volume of 25 μl of 0.4X T4 ligase buffer, 0.6X
NEBuffer4, and 1 mM ATP. Ligation reactions were incubated at 25°C for two hours followed by a 20minute inactivation step at 65°C. All wells were then pooled into one tube and mixed. 40 μl of the
mixture was run on a 2% agarose gel. The smear from 200-500 bp was cut out of the gel with a razor
blade and purified using a Qiagen Gel Extraction Kit. 3 μl of the purified DNA was amplified in a 50 μl
PCR reaction using Phusion Master Mix (New England Biolabs) and universal Illumina primers. The
thermal cycling program was 98°C for 30 seconds; followed by 15 cycles of 98°C for 10 seconds, 65°C for
30 seconds, and 72°C for 30 seconds; followed by 72°C for 5 minutes. The PCR product was extracted
from a 2% agarose gel as above to eliminate primer-dimers. Library concentration was determined
using a Quant-iT Picogreen kit (Life Technologies) and average fragment size was estimated using a
Bioanalyzer (Agilent Technologies) in order to dilute the library to 10 nM. Quantitative PCR and
sequencing on an Illumina HiSeq 2000 with 100 bp single-end reads were performed at the University of
Illinois Roy J. Carver Biotechnology Center DNA Sequencing Unit.
UNEAK pipeline
PstI-MspI was selected as the enzyme set for creating tag count files from FASTQ files. Tag counts were
merged across 773 taxa using the UMergeTaxaTagCountPlugin with a minimum tag count of one,
yielding 77,213,063 unique tags. 1,297,117 reciprocal tag pairs (SNPs) were found using the
UTagCountToTagPairPlugin with error tolerance rate set at the default 0.03.
UMapInfoToHapMapPlugin was used to further filter the data, with a minimum minor allele frequency
of 0.001. After preliminary exploration of the minimum call rate (mnC) for UMapInfoToHapMapPlugin,
mnC of 50% and 90% were tested with the final data set, yielding 22,929 and 5135 SNPs, respectively.
Any SNPs that appeared heterozygous in at least one of the three confirmed doubled haploid lines,
7
indicating that they represented paralogous loci, were removed from the data set. 1722 SNPs were
removed this way from the 50% mnC set, and 702 from the 90% mnC set. In the 50% mnC set, the mean
observed heterozygosity of the 1722 removed SNPs was 48%, whereas the mean observed
heterozygosity of the remaining 21,207 SNPs was 11%.
SNP analysis
Out of the set of 21,207 RAD-seq SNPs, 6000 were chosen at random and divided into three sets of 2000
for determination of the appropriate number of clusters (K) in STRUCTURE 2.3.4 (Falush et al. 2003).
765 Miscanthus samples were analyzed with this software. Each set of 2000 markers was subject to
three each at K = 1 through 10 with a burn-in of 10,000 MCMC repetitions followed by an additional
50,000 MCMC repetitions under default conditions. Delta K was calculated with Structure Harvester
(Earl and VonHoldt 2011) and used to determine the optimum K. At the selected value of K, all 21,207
markers were subjected to six runs at the same conditions. Q values were examined to confirm
consistency between runs, then averaged. Inter-species hybrids between M. sinensis and M.
sacchariflorus were identified by having at least a Q value of 0.05 for the M. sacchariflorus cluster.
Clustering of individuals was performed in the R package adegenet (Jombart et al. 2010) using the glPca,
find.clusters, and dapc functions. The n.start argument for find.clusters was set to 200 or 500 to make
the function converge on a single answer for six or seven clusters, respectively. The first 200 principal
components were retained for DAPC analysis based on the recommendation in the adegenet
documentation that not more than one third of the total number of principal components be retained.
A cladogram of all non-hybrid individuals was generated using the Neighbor-Joining method in TASSEL
3.0 (Bradbury et al. 2007) and plotted using the R package ape (Paradis et al. 2004).
Jost’s D (Jost 2008) was calculated with the R package mmod (Winter 2012). Diversity (expected
heterozygosity) was estimated from allele frequencies using the glMean function in adegenet (Jombart
and Ahmed 2011). FIS was estimated from observed and expected heterozygosities calculated in
adegenet. RAD-seq markers were used for diversity estimates given that ascertainment bias would be
expected in the GoldenGate markers. GoldenGate markers were used for FIS estimates because many
heterozygotes are miscalled as homozygotes in RAD-seq analysis, leading to an upward bias of FIS when
estimated from RAD-seq data. FST values differentiating each Asia Msi cluster from the rest of the Asia
Msi dataset were estimated in adegenet using allele frequencies.
An inter-individual Euclidian distance matrix was calculated from the RAD-seq data in R. AMOVA was
performed in pegas (Paradis 2010). GenAlEx 6.41 (Peakall and Smouse 2006) was used to confirm the
results and calculate p-values. The six genetic clusters identified by DAPC analysis on RAD-seq data were
considered to be populations. The two Japanese populations were considered to be a separate region
from the other four populations.
TreeMix (Pickrell and Pritchard 2012) was used to model divergence and migration between the groups
of native M. sinensis and M. floridulus individuals identified by DAPC on RAD-seq data. RAD-seq and
GoldenGate markers were combined into one dataset for analysis. SE China was set as the outgroup.
8
Three migration edges were assumed because the results were most reproducible and made the most
geographic sense. One hundred bootstrapped sets of 500 SNPs were run, and bootstrap values for the
graph were calculated with ape (Paradis et al. 2004).
The simulated annealing algorithm from PowerMarker (Liu and Muse 2005) was implemented in R in
order to select core germplasm sets using RAD-seq SNPs. The algorithm searches for sets of individuals
that maximize the average number of alleles per locus. First, a given number of individuals is randomly
selected from the full set. A randomly chosen individual from the core is then selected to be swapped
out for another randomly chosen individual from the full set. If the swap would increase the number of
alleles found in the core set, it is always made; if it would decrease the number of alleles, the probability
of the swap being made is 𝑒 −𝐷/𝑇 , where 𝐷 is the amount of the decrease and 𝑇 is a “temperature” set
by the user. The swapping algorithm is performed 1000 times, and then the temperature reduced by a
factor of 0.95. The process is repeated and the temperature is lowered until no swaps are successfully
made at the current temperature (convergence). With the average number of alleles per locus being
between 1 and 2, convergence was observed at temperatures on the order of 10-4 to 10-5. Starting
temperatures were chosen empirically; these were 0.025 for smaller sets of individuals (<58) and 0.002
for larger sets of individuals.
Plastid markers
Plastid microsatellites included Sac-2, Sac-3, Sac-10, Sac-13, Sac-17, Sac-26 (Cesare et al. 2010), Mcp-2,
Mcp-5, Mcp-10, and Mcp-16 (Jiang et al. 2012). All forward primers included an 18 nucleotide universal
sequence (M13) at the 5’ end of the published sequence. A third primer, consisting only of the M13
sequence and labeled with the fluorophore 6-FAM™, VIC®, PET™, or NED™ (Applied Biosystems) was
included in each reaction. Reactions were performed using GoTaq 2X colorless master mix (Promega).
Sac-2, Sac-3, Sac-10, and Sac-13 were amplified in PCR using the published annealing temperature and
program (Cesare et al. 2010). All other plastid microsatellite markers were amplified using a touchdown
PCR protocol with the annealing temperature beginning at 65°C and ending at 55°C. PCR reactions were
pooled and diluted 10X in water, then run on an ABI 3730 Genetic Analyzer (Applied Biosystems) for
fragment analysis. Fragment sizes were called using the software STRand (Toonen and Hughes 2001).
Cytoplasmic markers were not mined from the RAD-seq data because missing data would interfere with
haplotype network analysis.
Analysis was performed on the 763 Miscanthus individuals for which there was no missing plastid data.
Plastid genotypes for all ten microsatellite markers were imported into the R package polysat (Clark and
Jasieniuk 2011), which was used to calculate an inter-individual distance matrix, using the proportion of
loci at which two individuals differed in genotype as the distance metric. (polysat was designed to
handle microsatellite data of any ploidy, including haploid.) Source code from the R package pegas
(Paradis 2010) was then used to generate a haplotype network. The R package polysat was used to
estimate the Simpson index of diversity.
9
References for Supplementary Materials and Methods
Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. 2007. TASSEL: software for
association mapping of complex traits in diverse samples. Bioinformatics 23: 2633–2635.
Cesare M, Hodkinson TR, Barth S. 2010. Chloroplast DNA markers (cpSSRs, SNPs) for Miscanthus,
Saccharum and related grasses (Panicoideae, Poaceae). Molecular Breeding 26: 539–544.
Clark L V, Jasieniuk M. 2011. POLYSAT: an R package for polyploid microsatellite analysis. Molecular
Ecology Resources 11: 562–566.
Earl DA, VonHoldt BM. 2011. STRUCTURE HARVESTER: a website and program for visualizing
STRUCTURE output and implementing the Evanno method. Conservation Genetics Resources 4: 359–361.
Falush D, Stephens M, Pritchard JK. 2003. Inference of population structure using multilocus genotype
data: Linked loci and correlated allele frequencies. Genetics 164: 1567–1587.
Jiang J-X, Wang Z-H, Tang B-R, Xiao L, Ai X, Yi Z-L. 2012. Development of novel chloroplast microsatellite
markers for Miscanthus species (Poaceae). American Journal of Botany 99: e230–e233.
Jombart T, Ahmed I. 2011. adegenet 1.3-1: new tools for the analysis of genome-wide SNP data.
Bioinformatics 27: 3070–1.
Jombart T, Devillard S, Balloux F. 2010. Discriminant analysis of principal components: a new method
for the analysis of genetically structured populations. BMC Genetics 11: 94.
Jost L. 2008. G ST and its relatives do not measure differentiation. Molecular Ecology 17: 4015–4026.
Kabelka E, Franchino B, Francis DM. 2002. Two loci from Lycopersicon hirsutum LA407 confer resistance
to strains of Clavibacter michiganensis subsp. michiganensis. Phytopathology 92: 504–510.
Liu K, Muse S V. 2005. PowerMarker: an integrated analysis environment for genetic marker analysis.
Bioinformatics 21: 2128–9.
Paradis E. 2010. pegas: an R package for population genetics with an integrated-modular approach.
Bioinformatics 26: 419–20.
Paradis E, Claude J, Strimmer K. 2004. APE: Analyses of Phylogenetics and Evolution in R language.
Bioinformatics 20: 289–290.
Peakall R, Smouse PE. 2006. genalex 6: genetic analysis in Excel. Population genetic software for
teaching and research. Molecular Ecology Notes 6: 288–295.
Pickrell JK, Pritchard JK. 2012. Inference of population splits and mixtures from genome-wide allele
frequency data. PLoS Genetics 8: e1002967.
Toonen R, Hughes S. 2001. Increased throughput for fragment analysis on an ABI PRISM (R) automated
sequencer using a membrane comb and STRand software. Biotechniques 31: 1320–1324.
Winter DJ. 2012. MMOD: an R library for the calculation of population differentiation statistics.
Molecular ecology resources 12: 1158–60.
10
Download