Supplementary material for: A footprint of past climate change on the diversity and population structure of Miscanthus sinensis Lindsay V. Clark, Joe E. Brummer, Katarzyna Głowacka, Megan Hall, Kweon Heo, Junhua Peng, Toshihiko Yamada, Ji Hye Yoo, Chang Yeon Yu, Hua Zhao, Stephen P. Long, and Erik J. Sacks Table S1. Miscanthus collections used in the present study. Species M. sinensis* Origin China Japan** S. Korea Taiwan Ornamental cultivars*** U.S. naturalized # of entries 296 136 184 1 43 43 M. floridulus China Japan**** New Caledonia**** Papua New Guinea**** 1 1 1 1 M. sinensis x M. sacchariflorus hybrids China***** 11 Ornamental cultivars****** 34 M. sacchariflorus China S. Korea Ornamental cultivars 8 3 1 M. oligostachyus Ornamental cultivars 1 M. sp. Taiwan**** unknown**** 1 1 Total 767 *includes varieties condensatus (3), purpurascens (1) and transmorrisonensis (2; presumed to have originated in Taiwan). **includes 4 biomass cultivars. ***From U.S. nurseries. ****From USDA NPGS. *****F1s and BC1 found in the wild. ******BC1s and BC2s with M. sinensis as the recurrent parent, as well as two F1s. 1 Table S2. Analysis of Molecular Variance (AMOVA) of 620 M. sinensis and M. floridulus individuals sampled in the native range, using 21,207 RAD-seq SNPs. The genetic clusters are the six identified by DAPC analysis on RAD-seq data. The two Japanese clusters were considered to be a separate region from the other four clusters. Among genetic clusters Among regions (Japan vs. mainland) Within regions Within genetic clusters Total Variance 796 Proportion of variance 0.31 P < 0.001 318 478 1802 2598 0.12 0.18 0.69 < 0.001 < 0.001 Table S3. Jost’s D statistic showing pairwise differentiation between M. sinensis genetic groups based on chloroplast haplotype frequency. Yangtze Qinling Sichuan basin Korea, N China N Japan S Japan US nurseries, hybrid US nurseries, non-hybrid US naturalized SE China plus tropical Yangtze Qinling 0.74 0.14 0.40 0.85 0.89 0.88 US US nurseries, nurseries, nonhybrid hybrid Sichuan basin Korea, N China N Japan 0.07 0.99 0.78 0.56 0.99 0.87 0.66 0.60 0.45 0.99 0.93 0.97 0.94 1.00 0.81 0.97 1.00 1.00 0.94 0.83 0.61 0.31 0.99 1.00 1.00 1.00 1.00 0.78 0.10 S Japan 0.16 2 100 0 50 4.48% variation 100 50 0 5.18% variation 150 50% minimum call rate 150 90% minimum call rate -100 -80 -60 -40 6.05% variation -20 0 -40 -20 0 20 40 60 80 4.70% variation Fig. S1. Choice of minimum call rate for RAD-seq markers generated from the UNEAK pipeline. Principal component analysis of the data is shown with a 90% minimum call rate and a 50% minimum call rate. Individuals are colored using the scheme from Fig. 2 of the main manuscript, with the addition of magenta for M. oligostachyus, bright green for Saccharum officinarum, and black for doubled haploid lines and others excluded from later analysis. 3 Value of BIC versus number of clusters 4140 4100 4120 BIC BIC 4160 4180 A 5 10 15 Number of clusters Number of clusters 20 (K) Value of BIC versus number of clusters 5120 5080 BIC BIC 5160 B 5 10 15 20 Number of clusters (K) Number of clusters Delta K versus number of clusters 10 20 30 40 50 60 70 0 Delta K C 2 3 4 5 6 7 8 9 Number of clusters (K) Fig. S2. Selection of number of clusters for Structure and DAPC analysis. A-B) Bayesian Information Criterion (BIC) versus number of clusters from DAPC analysis on A) 620 M. sinensis individuals from the native range and B) 765 Miscanthus individuals, including M. sacchariflorus and accessions from the U.S. C) Delta K values. Three Structure runs were performed on each of three random sets of 2000 RAD-seq markers at each value of K=1 through 10. 4 0 -1 Msa Mol hybrid US nurs. non-hybrid US nurs. natural hybrids -2 5.4% of variation 1 PCA -6 -4 -2 0 2 44.0% of variation Fig. S3. Principal component analysis demonstrating hybrid ancestry of Msi accessions from US nurseries. 170 RAD-seq loci that had fixed polymorphism between M. sacchariflorus (Msa) and M. oligostachyus (Mol) were used. Msi accessions were classified as hybrid or non-hybrid based on Structure results (Fig. 1 of main manuscript). Natural Msa × Msi hybrids from the set of Chinese accessions are included as a control. 5 0.9 0.8 0.7 0.6 US nurseries 0.5 Proportion of polymorphic loci captured Optimal core germplasm sets US nat. 0 20 40 60 80 100 Number of individuals Fig. S4. Proportion of polymorphic loci captured in core sets vs. number of individuals in the set. Individuals were chosen for inclusion in core sets, out of 620 non-hybrid M. sinensis individuals from the native range, by a simulated annealing algorithm to maximize the average number of alleles per locus. Since SNP markers can have a maximum of two alleles per locus, the proportion of polymorphic loci is directly related to the average number of alleles per locus. Proportions of polymorphic loci captured by 76 individuals from US nurseries and 43 US naturalized individuals are indicated in red. Dataset S1. Dataset with information on individuals and markers in the study. Geographic coordinates, species, cluster assignments, chloroplast haplotypes, and assignment to core germplasm sets are listed for all individuals. All RAD-seq and GoldenGate markers are listed, including sequences, and allele frequencies are included for markers that were retained for analysis. Provided as Microsoft Excel file. 6 Supplementary Materials and Methods DNA extraction Leaf samples were harvested, frozen at -80° C, then lyophilized. Dried leaf material was pulverized in a Geno/Grinder 2000 ball mill (SPEX SamplePrep, LLC; Metuchen, NJ). Genomic DNA was isolated from 30-40 mg of ground, freeze-dried leaves in 1.6 ml microfuge tubes using a CTAB method modified from Kabelka et al. (Kabelka et al. 2002). DNA was quantified using a Quant-iT™ dsDNA Picogreen® Kit (Life Technologies) and diluted to 100 ng/μl for GoldenGate™ and RAD-seq analysis and further to 10 ng/μl for the amplification of plastid microsatellite markers. Sequencing library preparation Digestion and ligation were performed on 96-well plates, with each well corresponding to a sequence barcode, in order to multiplex 95 individuals into one library with one unused barcode as a contamination control and library identifier. Most individuals were only included in one library, although some that had low read counts in their first run were duplicated in later libraries. 250 ng of DNA from each individual was digested with 5 U each of PstI-HF and MspI in a 15 μl total volume of 1X NEBuffer 4 (New England Biolabs) at 37°C for three hours, followed by a 20-minute inactivation step at 80°C. 1.5 pmol of barcoded PstI adapter, 5 pmol MspI Y-adapter, and 200 U T4 DNA ligase (New England Biolabs) were then added to each well for a total volume of 25 μl of 0.4X T4 ligase buffer, 0.6X NEBuffer4, and 1 mM ATP. Ligation reactions were incubated at 25°C for two hours followed by a 20minute inactivation step at 65°C. All wells were then pooled into one tube and mixed. 40 μl of the mixture was run on a 2% agarose gel. The smear from 200-500 bp was cut out of the gel with a razor blade and purified using a Qiagen Gel Extraction Kit. 3 μl of the purified DNA was amplified in a 50 μl PCR reaction using Phusion Master Mix (New England Biolabs) and universal Illumina primers. The thermal cycling program was 98°C for 30 seconds; followed by 15 cycles of 98°C for 10 seconds, 65°C for 30 seconds, and 72°C for 30 seconds; followed by 72°C for 5 minutes. The PCR product was extracted from a 2% agarose gel as above to eliminate primer-dimers. Library concentration was determined using a Quant-iT Picogreen kit (Life Technologies) and average fragment size was estimated using a Bioanalyzer (Agilent Technologies) in order to dilute the library to 10 nM. Quantitative PCR and sequencing on an Illumina HiSeq 2000 with 100 bp single-end reads were performed at the University of Illinois Roy J. Carver Biotechnology Center DNA Sequencing Unit. UNEAK pipeline PstI-MspI was selected as the enzyme set for creating tag count files from FASTQ files. Tag counts were merged across 773 taxa using the UMergeTaxaTagCountPlugin with a minimum tag count of one, yielding 77,213,063 unique tags. 1,297,117 reciprocal tag pairs (SNPs) were found using the UTagCountToTagPairPlugin with error tolerance rate set at the default 0.03. UMapInfoToHapMapPlugin was used to further filter the data, with a minimum minor allele frequency of 0.001. After preliminary exploration of the minimum call rate (mnC) for UMapInfoToHapMapPlugin, mnC of 50% and 90% were tested with the final data set, yielding 22,929 and 5135 SNPs, respectively. Any SNPs that appeared heterozygous in at least one of the three confirmed doubled haploid lines, 7 indicating that they represented paralogous loci, were removed from the data set. 1722 SNPs were removed this way from the 50% mnC set, and 702 from the 90% mnC set. In the 50% mnC set, the mean observed heterozygosity of the 1722 removed SNPs was 48%, whereas the mean observed heterozygosity of the remaining 21,207 SNPs was 11%. SNP analysis Out of the set of 21,207 RAD-seq SNPs, 6000 were chosen at random and divided into three sets of 2000 for determination of the appropriate number of clusters (K) in STRUCTURE 2.3.4 (Falush et al. 2003). 765 Miscanthus samples were analyzed with this software. Each set of 2000 markers was subject to three each at K = 1 through 10 with a burn-in of 10,000 MCMC repetitions followed by an additional 50,000 MCMC repetitions under default conditions. Delta K was calculated with Structure Harvester (Earl and VonHoldt 2011) and used to determine the optimum K. At the selected value of K, all 21,207 markers were subjected to six runs at the same conditions. Q values were examined to confirm consistency between runs, then averaged. Inter-species hybrids between M. sinensis and M. sacchariflorus were identified by having at least a Q value of 0.05 for the M. sacchariflorus cluster. Clustering of individuals was performed in the R package adegenet (Jombart et al. 2010) using the glPca, find.clusters, and dapc functions. The n.start argument for find.clusters was set to 200 or 500 to make the function converge on a single answer for six or seven clusters, respectively. The first 200 principal components were retained for DAPC analysis based on the recommendation in the adegenet documentation that not more than one third of the total number of principal components be retained. A cladogram of all non-hybrid individuals was generated using the Neighbor-Joining method in TASSEL 3.0 (Bradbury et al. 2007) and plotted using the R package ape (Paradis et al. 2004). Jost’s D (Jost 2008) was calculated with the R package mmod (Winter 2012). Diversity (expected heterozygosity) was estimated from allele frequencies using the glMean function in adegenet (Jombart and Ahmed 2011). FIS was estimated from observed and expected heterozygosities calculated in adegenet. RAD-seq markers were used for diversity estimates given that ascertainment bias would be expected in the GoldenGate markers. GoldenGate markers were used for FIS estimates because many heterozygotes are miscalled as homozygotes in RAD-seq analysis, leading to an upward bias of FIS when estimated from RAD-seq data. FST values differentiating each Asia Msi cluster from the rest of the Asia Msi dataset were estimated in adegenet using allele frequencies. An inter-individual Euclidian distance matrix was calculated from the RAD-seq data in R. AMOVA was performed in pegas (Paradis 2010). GenAlEx 6.41 (Peakall and Smouse 2006) was used to confirm the results and calculate p-values. The six genetic clusters identified by DAPC analysis on RAD-seq data were considered to be populations. The two Japanese populations were considered to be a separate region from the other four populations. TreeMix (Pickrell and Pritchard 2012) was used to model divergence and migration between the groups of native M. sinensis and M. floridulus individuals identified by DAPC on RAD-seq data. RAD-seq and GoldenGate markers were combined into one dataset for analysis. SE China was set as the outgroup. 8 Three migration edges were assumed because the results were most reproducible and made the most geographic sense. One hundred bootstrapped sets of 500 SNPs were run, and bootstrap values for the graph were calculated with ape (Paradis et al. 2004). The simulated annealing algorithm from PowerMarker (Liu and Muse 2005) was implemented in R in order to select core germplasm sets using RAD-seq SNPs. The algorithm searches for sets of individuals that maximize the average number of alleles per locus. First, a given number of individuals is randomly selected from the full set. A randomly chosen individual from the core is then selected to be swapped out for another randomly chosen individual from the full set. If the swap would increase the number of alleles found in the core set, it is always made; if it would decrease the number of alleles, the probability of the swap being made is 𝑒 −𝐷/𝑇 , where 𝐷 is the amount of the decrease and 𝑇 is a “temperature” set by the user. The swapping algorithm is performed 1000 times, and then the temperature reduced by a factor of 0.95. The process is repeated and the temperature is lowered until no swaps are successfully made at the current temperature (convergence). With the average number of alleles per locus being between 1 and 2, convergence was observed at temperatures on the order of 10-4 to 10-5. Starting temperatures were chosen empirically; these were 0.025 for smaller sets of individuals (<58) and 0.002 for larger sets of individuals. Plastid markers Plastid microsatellites included Sac-2, Sac-3, Sac-10, Sac-13, Sac-17, Sac-26 (Cesare et al. 2010), Mcp-2, Mcp-5, Mcp-10, and Mcp-16 (Jiang et al. 2012). All forward primers included an 18 nucleotide universal sequence (M13) at the 5’ end of the published sequence. A third primer, consisting only of the M13 sequence and labeled with the fluorophore 6-FAM™, VIC®, PET™, or NED™ (Applied Biosystems) was included in each reaction. Reactions were performed using GoTaq 2X colorless master mix (Promega). Sac-2, Sac-3, Sac-10, and Sac-13 were amplified in PCR using the published annealing temperature and program (Cesare et al. 2010). All other plastid microsatellite markers were amplified using a touchdown PCR protocol with the annealing temperature beginning at 65°C and ending at 55°C. PCR reactions were pooled and diluted 10X in water, then run on an ABI 3730 Genetic Analyzer (Applied Biosystems) for fragment analysis. Fragment sizes were called using the software STRand (Toonen and Hughes 2001). Cytoplasmic markers were not mined from the RAD-seq data because missing data would interfere with haplotype network analysis. Analysis was performed on the 763 Miscanthus individuals for which there was no missing plastid data. Plastid genotypes for all ten microsatellite markers were imported into the R package polysat (Clark and Jasieniuk 2011), which was used to calculate an inter-individual distance matrix, using the proportion of loci at which two individuals differed in genotype as the distance metric. (polysat was designed to handle microsatellite data of any ploidy, including haploid.) Source code from the R package pegas (Paradis 2010) was then used to generate a haplotype network. The R package polysat was used to estimate the Simpson index of diversity. 9 References for Supplementary Materials and Methods Bradbury PJ, Zhang Z, Kroon DE, Casstevens TM, Ramdoss Y, Buckler ES. 2007. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics 23: 2633–2635. Cesare M, Hodkinson TR, Barth S. 2010. Chloroplast DNA markers (cpSSRs, SNPs) for Miscanthus, Saccharum and related grasses (Panicoideae, Poaceae). Molecular Breeding 26: 539–544. Clark L V, Jasieniuk M. 2011. POLYSAT: an R package for polyploid microsatellite analysis. Molecular Ecology Resources 11: 562–566. Earl DA, VonHoldt BM. 2011. STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conservation Genetics Resources 4: 359–361. Falush D, Stephens M, Pritchard JK. 2003. Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics 164: 1567–1587. Jiang J-X, Wang Z-H, Tang B-R, Xiao L, Ai X, Yi Z-L. 2012. Development of novel chloroplast microsatellite markers for Miscanthus species (Poaceae). American Journal of Botany 99: e230–e233. Jombart T, Ahmed I. 2011. adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics 27: 3070–1. Jombart T, Devillard S, Balloux F. 2010. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genetics 11: 94. Jost L. 2008. G ST and its relatives do not measure differentiation. Molecular Ecology 17: 4015–4026. Kabelka E, Franchino B, Francis DM. 2002. Two loci from Lycopersicon hirsutum LA407 confer resistance to strains of Clavibacter michiganensis subsp. michiganensis. Phytopathology 92: 504–510. Liu K, Muse S V. 2005. PowerMarker: an integrated analysis environment for genetic marker analysis. Bioinformatics 21: 2128–9. Paradis E. 2010. pegas: an R package for population genetics with an integrated-modular approach. Bioinformatics 26: 419–20. Paradis E, Claude J, Strimmer K. 2004. APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics 20: 289–290. Peakall R, Smouse PE. 2006. genalex 6: genetic analysis in Excel. Population genetic software for teaching and research. Molecular Ecology Notes 6: 288–295. Pickrell JK, Pritchard JK. 2012. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genetics 8: e1002967. Toonen R, Hughes S. 2001. Increased throughput for fragment analysis on an ABI PRISM (R) automated sequencer using a membrane comb and STRand software. Biotechniques 31: 1320–1324. Winter DJ. 2012. MMOD: an R library for the calculation of population differentiation statistics. Molecular ecology resources 12: 1158–60. 10