Bombus RAD tag Diversity Supplemental Material J. Lozier Supplementary Material Table S1 Sample locations for individuals used for RAD tag sequencing Species County State Latitude Longitude Sequence Reads (106) B. impatiens Bibb Alabama 33.05 -87.01 2.57 Washington Arkansas 35.82 -94.16 5.11 Hartford Connecticut 41.77 -72.90 1.74 Crawford Iowa 41.99 -95.39 4.00 Ogle Illinois 41.98 -89.36 4.10 Peoria Illinois 40.83 -89.80 7.42 Montgomery Indiana 39.96 -87.07 3.00 Ripley Indiana 39.07 -85.44 6.63 Ossage Kansas 38.64 -95.60 3.31 McCracken Kentucky 37.03 -88.76 3.08 Franklin Kentucky 38.16 -84.94 3.24 Winona Minnesota 43.98 -91.43 6.13 1 Bombus RAD tag Diversity Supplemental Material B. pensylvanicus J. Lozier Franklin Missouri 38.48 -90.82 16.6 Stokes North Carolina 36.47 -80.39 1.58 Seneca New York 42.68 -76.85 4.31 Belmont Ohio 40.00 -81.14 4.73 Cameron Pennsylvania 41.40 -78.03 3.80 Kershaw South Carolina 34.16 -80.57 1.33 Cocke Tennessee 35.92 -82.98 5.09 Appomattox Virginia 37.26 -78.68 1.13 Windsor Vermont 43.41 -72.71 3.62 Dane Wisconsin 43.04 -89.43 3.17 Bibb Alabama 33.05 -87.01 2.40 Prowers Colorado 38.11 -102.31 7.65 Columbia Florida 29.89 -82.67 7.02 Crawford Iowa 41.99 -95.39 2.95 Union Illinois 37.45 -89.12 8.27 2 Bombus RAD tag Diversity Supplemental Material J. Lozier Piatt Illinois 40.01 -88.64 7.88 Cherokee Kansas 37.34 -94.83 5.23 Saline Kansas 38.70 -97.43 6.80 Bossier Louisiana 32.53 -93.68 3.32 Lawrence Missouri 36.96 -93.68 7.24 Boone Missouri 38.99 -92.52 4.17 Franklin Missouri 38.48 -90.82 7.27 Clay Mississippi 33.55 -88.64 8.02 Lenoir North Carolina 35.29 -77.79 6.90 Howard Nebraska 41.13 -98.55 1.72 Wayne Ohio 40.91 -81.98 5.61 Cleveland Oklahoma 35.25 -97.27 6.72 Marion South Carolina 34.23 -79.15 4.47 Jasper South Carolina 32.59 -81.21 5.88 Stanley South Dakota 44.29 -100.33 2.41 3 Bombus RAD tag Diversity Supplemental Material J. Lozier Bastrop Texas 30.35 -97.38 7.69 Eastland Texas 32.10 -98.96 5.08 Greensville Virginia 36.64 -77.56 3.83 4 Bombus RAD tag Diversity Supplemental Material J. Lozier Impacts of read depth variation B. pensylvanicus RAD tags received, on average, higher sequencing coverage than B. impatiens RAD tags (in part due to the greater number of RAD tags in B. impatiens). Although sequencing depth in both species is very high (average >100x reads per site per bee; Fig. S1), which should facilitate accurate calling of heterozygous sites, I aimed to examine how this difference might bias diversity estimates. I filtered the SNP set to only consider those SNPs that received 125200X read depth, a region of substantial coverage overlap for the two species (Fig S1). I focused on comparing data sets for Filter Set 4 (20 individuals per species with the highest sequencing coverage; maximum of 5% missing data per SNP) to avoid sample size differences. Overall diversity estimates were slightly reduced in both species by this standardization, but SNP (heterozygosity per SNP) was similar for the two species, and RAD (average nucleotide diversity per entire RAD tag) was slightly higher in B. pensylvanicus, as in the full data set (Table S1). The site frequency spectra are also highly similar for the two data sets, with a slight increase in the proportions of low-frequency alleles in both species (Fig S2), explaining the slightly lower nucleotide diversity estimates. Thus, the relative comparison is not altered by limiting the analysis to SNPs with similar coverage. 5 Bombus RAD tag Diversity Supplemental Material J. Lozier Fig. S1 Average sequencing coverage per single nucleotide polymorphism per individual (a) and number of single nucleotide polymorphisms per RAD tag locus (b) following initial filtering (filter set 1). Invariant RAD tag loci not shown. 6 Bombus RAD tag Diversity Supplemental Material J. Lozier Table S2 Parameters estimated by filtering Filter Set 4 to mean SNP coverage of 125-200X per SNP per individual Mean read depth per SNP SNP No. No SNPs per Ind SNP SE RADtags RAD RAD SE B imp 3029 158.508 0.133 0.002 1932 0.0023 0.000055 B pen 2106 165.093 0.136 0.003 1211 0.0026 0.000078 0.7 impfil4 Proportion of SNPs 0.6 penfil4 0.5 impfil4_125-200x 0.4 penfil4_125-200x 0.3 0.2 0.1 0. 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Minor Allele Frequency Figure S2 SNP minor allele frequency histograms with the “125-200X per site per individual read depth” filter [shows both the original ‘filter set 4’ data (solid colors), and for filter set 4 using the depth cutoff of 125-200x (solid+hatch)]. 7 Bombus RAD tag Diversity Supplemental Material J. Lozier Simulations of Diversity Under Population Declines Coalescent simulations were conducted in SIMCOAL 2.1.2. A historical population size of 50,000 haplodiploid individuals (= 75,000 chromosomes) was assumed, which provides a per site = 0.0015 with = 1 x 10-8 per site per generation in a stable population, similar to that seen in the present sequence data (see Results). A microsatellite = 5 x 10-5 per generation was applied, selected to produce a mean gene diversity of ~0.75 for a stable population. A series of instantaneous bottleneck sizes was considered starting 75 generations before present. 100 replicates of either ten microsatellite loci or 10,000 90 bp DNA sequence loci (no intralocus recombination, 3:1 Ts/Tv ratio) were simulated for each scenario and mean microsatellite heterozygosity or per-site nucleotide diversity for each replicate were analyzed in ARLEQUIN (references in main text). Contemporary eastern Bombus populations are essentially unstructured, with near-zero FST, D, no clustering in Bayesian approaches like STRUCTURE, and essentially no isolation by distance. For an unstructured population with starting diversity similar to B. impatiens (stable), results show that massive bottlenecks to 50-500 individuals would be needed to reduce microsatellite heterozygosity by ~20%, the proportional difference currently observed between B. pensylvanicus and B. impatiens whether individual populations are examined, or a pool of single individuals from separate populations (see main text). Per site nucleotide diversities are smaller than multilocus heterozygosity estimates and thus the absolute difference in diversity is smaller, but the proportional loss in diversity is similar for RAD tag-like data (Fig. S3). Also note that the RAD-tag like data had less variability among replicates, with no overlap in confidence regions among scenarios, as expected given the far greater number of loci (Fig. S3). 8 Bombus RAD tag Diversity Supplemental Material J. Lozier Thus RAD-tags should be capable of resolving differences in diversity due to demographic effects of sufficient impact to alter microsatellite gene diversity. Table S3 Coalescent simulation parameter summary Locus Type Demography Ancestral Ne (haplodiploids) Current Ne Msat Msat Msat Msat 90bp sequence (RAD-like) 90bp sequence (RAD-like) 90bp sequence (RAD-like) Stable Bneck-75 gen bp Bneck-75 gen bp Bneck-75 gen bp Stable 50,000 50,000 50,000 5,000 Sample Size (diploids) 20 20 Independent Loci Mutation Rate 10 10 5.00E-05 5.00E-05 50,000 500 20 10 5.00E-05 50,000 50 20 10 5.00E-05 50,000 50,000 20 10,000 1.00E-08 Bneck-75 gen bp Bneck-75 gen bp 50,000 500 20 10,000 1.00E-08 50,000 50 20 10,000 1.00E-08 Figure S3 Box plots of mean diversity from 100 simulation replicates per scenario for microsatellite (left) and RADtag-like (right) data under a range of bottlenecks from an ancestral Ne = 50,000 haplodiploids. See Table S3 for simulation parameters. Note that based on microsatellite simulation results, the 5,000 Ne bottleneck was not performed for the more computationally intensive RADtag-like data as differences from the stable population simulation should be small. 9 Bombus RAD tag Diversity Supplemental Material J. Lozier Quality checks to ensure integrity of sample data One concern with RAD tag and other barcoded NGS library approaches is the potential for laboratory or computational errors to result in erroneous data sets. Such errors can be difficult to detect because of the massive amounts of data produced by these methods prohibit visual screening of samples for inconsistencies. Of particular interest in the present study is the potential for barcode-swaps during library preparation or demultiplexing to result in mixed data sets, which would corrupt any between species comparisons. I took advantage of the evolutionary distance between B. impatiens and B. pensylvanicus to check the integrity of sequence data within each species. For this analysis, I took the two data sets reported in the study (B. imp aligned against the B. imp RAD reference and B. pen aligned against the B. pen RAD reference) and made use of alignments for both B. impatiens and B. pensylvanicus against a common B. impatiens reference (not otherwise discussed in the main paper) using Bowtie and filtered as for Filter Set 1 (see Methods). The latter analysis produces data of a characteristic form: a large and nearly identical fraction (~45%) of the called SNPs are fixed for the alternate allele state across all B. pensylvanicus individuals, as one might expect for such deeply divergent lineages (Figure S4), while no B. impatiens individuals show this pattern. Propor on of Sites Fixed for non B. impa ens reference allele (alt allele) 0.5 0.45 0.4 0.35 0.3 0.25 0.2 impa ens pensylvanicus 0.15 0.1 0.05 0 Figure S4 Proportion of fixed non-B. impatiens-reference alleles when reads from individuals in both species are simultaneously aligned against a B. impatiens reference. Each point = 1 individual. 10 Bombus RAD tag Diversity Supplemental Material J. Lozier A PERL script by C. Bergey (https://code.google.com/p/vcf-tab-to-fasta/) was then used to convert the vcf files to fasta format (missing data = -, heterozygotes = IUPAC ambiguities) as a second approach to detect outlier individuals. MEGA 5.2.2 (Tamura et al., 2011) was used to construct neighbor-joining trees, deleting missing data and calculating branch lengths as the number of differences. Figure S5 shows the resulting trees. As expected for very-low FST species, the data sets used in the main paper show unstructured star-like trees with no geographical patterns apparent and small numbers of differences separating individuals. The lower tree shows an analysis with both species aligned to the B. impatiens reference and shows a large divergence of several thousand fixed base differences between the two species, and no individuals sorting incorrectly among species. Such patterns suggest no major errors in read identity across samples consistent with barcode swaps. Subsequent errors in post-alignment processing of BAM files could be detected by comparing VCF SNP call files with those produced by independent analysis of the original alignment files with ANGSD. Overall, similar sites were called as variable in both sets, and comparison of allele frequencies in the two analyses show comparable nucleotide diversities (main text). Differences appear mostly due to the larger number of regions including in the ANGSD analysis, and a minor differences for sites that were filtered out of the VCF-formatted genotyping data (e.g., due to low coverage, which ANGSD allows and takes into account), or due to the fact that ANGSD is estimating nucleotide diversities by taking into account multiple factors (sequence depth, quality, etc), and not calculating them solely from allele counts. These checks suggest that nothing erroneous occurred during post-alignment processing of bam files. The vcf files for the two species, the per site ANGSD diversity output, and original alignments, are located on DRYAD. Based on these data checks, incorrect pooling of data among samples 11 Bombus RAD tag Diversity Supplemental Material J. Lozier and species would also presumably result in larger numbers of SNPs per RAD tag, as well as elevated intermediate allele frequencies in the site frequency spectra, rather than the typical “Lshaped” distributions observed here. Together, these analyses suggest that the present RAD tag data contain no major contamination errors that would affect inferences. Figure S5 NJ trees for RAD tag reads aligned to within species (top) and for both species together aligned against the B. impatiens reference (bottom). Scale provided as number of differences. 12 Bombus RAD tag Diversity Supplemental Material J. Lozier Tamura K, Peterson D, Peterson N et al. (2011) MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Molecular Biology and Evolution, 28, 2731–9. 13