1 Environmental adaptation in Chinook salmon 2 (Oncorhynchus tshawytscha) throughout their North American range 3 Benjamin C. Hecht1,2, Andrew P. Matala1, Jon E. Hess1, and Shawn R. Narum1 4 5 1 Columbia River Inter-Tribal Fish Commission and 2University of Idaho, Hagerman Fish Culture Experiment Station, Hagerman, ID 83332, USA 6 7 Supplementary File S1 8 9 RAD library preparation and Sequencing 10 Restriction-site associated DNA (RAD) (Miller et al. 2007; Baird et al. 2008) libraries were prepared for 11 Illumina HiSeq 1500 sequencing using a protocol similar to those previously published (Baird et al. 2008; 12 Miller et al. 2012) but modified as described in Hecht et al. (Hecht et al. 2013) to allow for lower starting 13 DNA concentrations when tissue was limited for some individuals or populations. Here libraries were 14 prepared with a starting DNA concentration per sample of 150, 250, or 500ng depending on sample 15 quality and quantity. Samples were digested individually with the restriction enzyme SbfI-HF (NEB, 16 Ipswich, MA, USA) and individually barcoded using a 6nt barcode adapter sequence. Digested and 17 barcoded samples of the same starting concentration were pooled into libraries of between 36 and 96 18 samples, where no two samples within a library were assigned the same barcode sequence, and each 19 barcode sequence within a library differed by at least two bases from another barcode sequence. 20 Libraries were mechanically sheared to generate DNA fragment lengths between 200-700bp using a 21 Bioruptor 300 sonicator (Diagenode, Denville, NJ, USA) and fragments were size selected and isolated 22 using an Agencourt AMPure XP bead purification system (Beckman Coulter, Brea, CA, USA). The 23 remainder of the RAD library preparation follows the previously defined protocol of Miller et al. (2012). 24 Prior to sequencing, RAD libraries were quantified using real time PCR and a Kapa Illumina Library 25 Quantification Kit following recommended protocols (Kapa Biosystems Inc., Woburn, MA, USA) on an 26 ABI 7900HT Sequence Detection System (Life Technologies, Grand Island, NY, USA). Libraries were 27 sequenced on an Illumina HiSeq 1500 sequencer (Illumina Inc., San Diego, CA, USA) at a single read 28 length of 100bp. Depending on the quality and quantity of the sequence generated, some libraries were 29 sequenced in more than one lane to reach target read depths per individual of approximately 2 million 30 reads. In total 2,775 samples were sequenced in 44 libraries across 63 Illumina flow cell lanes. 31 32 de Novo SNP discovery and genotyping 33 While a RADtag based SNP catalog had previously been constructed for Chinook salmon (Brieuc et al. 34 2014), here we included samples from a broader range of populations from Alaska to California in order 35 to identify informative loci throughout the entire species range. We therefore identified and genotyped 36 SNP loci de novo, by constructing a SNP catalog from individuals spanning the geographic range of 37 populations in our collection. This was performed using the software pipeline Stacks v.1.03 (Catchen et 38 al. 2011, 2013). Raw Illumina reads were first scrutinized for quality using the software program FastQC 39 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). It was determined that the 3’ 15-20 40 bases of Illumina sequence reads had in general reduced quality scores relative to the 5’ 80-85 base 41 positions across our sequence data. We therefore truncated our sequence reads to 80 bases by 42 removing the 3’ sequence most prone to error. In addition to truncating, our reads were quality filtered 43 and de-multiplexed using the ‘process_radtags’ program of the Stacks pipeline and included options for 44 cleaning the data by discarding any read with an uncalled base (-c), discarding reads with low quality 45 scores (-q), and rescuing barcodes and partial restriction enzyme recognition sites (-r). All other 46 parameters and options were executed with the default values as outlined in the manual for the 47 program (http://creskolab.uoregon.edu/stacks). 48 49 After individual sample reads were quality filtered, trimmed, and de-multiplexed, sequences for each 50 sample were submitted to the ‘ustacks’ module of Stacks to identify loci. In ‘ustacks’, the deleveraging (- 51 d) and removal (-r) algorithms were applied to filter out those sequences that were likely to be 52 paralogous and highly repetitive. We required the minimum depth of coverage at a stack (-m) to be 5 53 and allowed a maximum distance (-M) of 2 between stacks, and 4 between secondary reads and primary 54 stacks (-N). SNP discovery was carried out using the default SNP model with a chi-square significance 55 level of 0.05. We created a de novo catalog of RAD tag loci using the ‘cstacks’ module by selecting two 56 individuals from each of 51 populations (n=102) with at least 2.5 million reads (but no greater than 4 57 million reads) to represent genetic variation throughout the native range of Chinook salmon (note that 58 two populations had poor sequence quality and were excluded from further analysis). Individual 59 samples were then aligned to the catalog using the module ‘sstacks’ and genotypes were exported using 60 the ‘populations’ module. Genotypes were filtered to exclude 1) any RAD tag locus with more than 4 61 SNP sites to remove putative PSV, hyper-variable, or poorly sequenced tags, 2) any RAD tag locus where 62 one of the ten doubled haploid samples was observed to be heterozygous at any of the SNP positions in 63 order to remove putative PSVs, 3) any SNP marker with more than 2 alleles to remove SNPs with 64 sequencing errors, putative PSVs, or loci that do not fit a bi-allelic statistical model, 4) any SNP marker 65 missing more than 20% of the genotypes across all of the populations to limit the amount of missing 66 data, 5) any SNP marker failing tests of Hardy-Weinberg Equilibrium (HWE; Bonferroni corrected critical 67 value at α=0.05, 0.05/19703= 0.00000254) in more than 10% of the populations in order to exclude 68 technical artifacts such as null alleles (heterozygote deficit loci) and putative PSVs (heterozygote excess 69 loci), and 6) any SNP marker with an average minor allele frequency (MAF) across the populations falling 70 below 0.01 in order to exclude spurious rare SNPs or sequencing errors. Since linked SNPs would bias 71 population genetics statistics, we only retained one SNP marker per RAD tag, where we kept the SNP 72 with the greatest global MAF since this was likely to be the most informative SNP across all the 73 populations. Individual samples were also filtered from the dataset if they were missing more than 20% 74 of genotypes across all filtered loci and whole populations were removed if fewer than 10 individuals 75 remained to represent the population after applying filtering criteria. After applying the first four filters 76 outlined above, we identified 29,668 loci, though upon applying the remaining filters we retained 19,703 77 SNP loci as our final dataset. Allowing only a single SNP per RAD tag resulted in the most substantial 78 reduction in markers as 6,556 SNPs were removed from the data set to reduce SNPs that were physically 79 linked within 100 bp. The other filters primarily removed rare SNPs and polymorphisms with unresolved 80 technical difficulties such as described by Davey et al. (2013). 81 82 Alignment to Chinook Salmon Linkage Map and Rainbow Trout Genome Assembly 83 RADtag sequences from this study were aligned to those from a high density RADtag based linkage map 84 in Chinook salmon (Brieuc et al. 2014) in an effort to determine the relative genetic position and linkage 85 group assignment of loci. Alignment to the RADtag database of Brieuc et al. (2014) was conducted using 86 the short sequence alignment software program Bowtie v.1.0.1 (Langmead et al. 2009) with no more 87 than two mismatches between the query sequence and the database sequence. RADtag sequences 88 were also aligned to the rainbow trout genome (O. mykiss; Berthelot et al. 2014), which is the most 89 closely related species to Chinook salmon with a published genome assembly. Alignment to the rainbow 90 trout genome was carried out using Bowtie 2 v.2.2.3 (Langmead & Salzberg 2012) to assign sequences to 91 rainbow trout chromosomes and obtain genomic positions, where a RADtag alignment was accepted if 92 no more than four mismatches occurred between the RADtag site and the rainbow trout genome, and 93 no more than one alignment site was identified for the RADtag sequence within the genome. We then 94 queried the rainbow trout genome for coding sequence within a distance of 5kb of the RADtag locus 95 alignment sites in an effort to identify putatively linked genes. While linkage disequilibrium on average 96 was found to decay after approximately 2 cM in a domesticated strain of rainbow trout (Rexroad III & 97 Vallejo 2009), we conservatively opted to only identify coding regions within 5kb of the RADtag 98 alignment site, given our limited understanding of the intra-chromosomal micro-rearrangements 99 between rainbow trout and Chinook salmon genomes. 100 To identify gene functions and annotations, coding sequences were then queried against the 101 NCBI nucleotide sequence database using the software program Blast2GO (Conesa et al. 2005). Gene 102 functions and annotations to linked adaptive loci (outlier loci from RDA analysis) were compared to the 103 functions and annotations of all RADtag linked genes to determine if there was an enrichment of gene 104 ontologies in adaptive loci relative to all other genes using a Fisher’s exact test corrected for multiple 105 comparisons as implemented in the program Blast2GO (Conesa et al. 2005). 106 107 Detecting Neutral and Outlier Loci 108 For all FST outlier tests the “Wenatchee River” population was a pooled population of genetically similar 109 samples from Nason Creek, Chiwawa River, and White River (n=51), and samples from the John Day 110 River (JDR, n=12) were excluded from analyses because this population size was too low to be 111 representative relative to other populations in the study and within the lineage. In total 44 populations 112 including 1,945 individually genotyped samples were investigated at 19,703 markers to identify neutral 113 and outlier loci. Tests were conducted in four ways, and loci were only considered neutral if they met 114 the expectations of neutrality in all four tests. The first test included a range-wide global analysis, where 115 all 44 populations (note that the Nason/Chiwawa population was merged with the White River and the 116 John Day River population was removed from the analysis) were analyzed together, the second test was 117 conducted on a subset of populations previously identified as a putative “North Coastal Lineage”, the 118 third test on a subset of populations identified as a putative “South Coastal Lineage”, and the fourth test 119 on a subset of populations identified as an “Interior Columbia River Stream-Type Lineage”. Outlier loci 120 were identified in each test as those loci which did show excessively higher or lower FST than would be 121 expected under the assumptions of neutrality. In this case p-values were corrected for the four multiple 122 tests using a Benjamini-Yekutieli correction (Benjamini & Yekutieli 2001) as recommended by Narum 123 (2006) and thus included loci with a P-value greater than 0.988 and less than 0.012. 124 125 Bibliography 126 127 Baird NA, Etter PD, Atwood TS et al. (2008) Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers. PLoS One, 3, 1–7. 128 129 Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29, 1165–1188. 130 131 Berthelot C, Brunet F, Chalopin D et al. (2014) The rainbow trout genome provides novel insights into evolution after whole-genome duplication in vertebrates. Nature Communications, 5, 3657. 132 133 134 Brieuc MSO, Waters CD, Seeb JE, Naish KA (2014) A dense linkage map for Chinook salmon (Oncorhynchus tshawytscha) reveals variable chromosomal divergence after an ancestral whole genome duplication event. G3: Genes, Genomes, Genetics, 4, 447–460. 135 136 Catchen JM, Amores A, Hohenlohe P, Cresko W, Postlethwait JH (2011) Stacks: building and genotyping Loci de novo from short-read sequences. G3: Genes, Genomes, Genetics, 1, 171–182. 137 138 Catchen J, Hohenlohe P, Bassham S, Amores A, Cresko W (2013) Stacks: an analysis tool set for population genomics. Molecular Ecology, 22, 3124–3140. 139 140 Conesa A, Götz S, García-Gómez JM et al. (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics, 21, 3674–3676. 141 142 143 Hecht BC, Campbell NR, Holecek DE, Narum SR (2013) Genome-wide association reveals genetic basis for the propensity to migrate in wild populations of rainbow and steelhead trout. Molecular Ecology, 22, 3061–3076. 144 145 Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nature Methods, 9, 357– 359. 146 147 Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10, R25.1–R25.10. 148 149 Miller MR, Brunelli JP, Wheeler P a. et al. (2012) A conserved haplotype controls parallel adaptation in geographically distant salmonid populations. Molecular ecology, 21, 237–249. 150 151 152 Miller MR, Dunham JP, Amores A, Cresko WA, Johnson EA (2007) Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA ( RAD ) markers. Genome Research, 17, 240–248. 153 154 Narum SR (2006) Beyond Bonferroni: Less conservative analyses for conservation genetics. Conservation Genetics, 7, 783–787. 155 156 Narum SR, Hess JE, Matala AP (2010) Examining Genetic Lineages of Chinook Salmon in the Columbia River Basin. Transactions of the American Fisheries Society, 139, 1465–1477. 157 158 Rexroad III CE, Vallejo RL (2009) Estimates of linkage disequilibrium and effective population size in rainbow trout. BMC genetics, 10, 83. 159