mec13409-sup-0001-FileS1

advertisement
1
Environmental adaptation in Chinook salmon
2
(Oncorhynchus tshawytscha) throughout their North American range
3
Benjamin C. Hecht1,2, Andrew P. Matala1, Jon E. Hess1, and Shawn R. Narum1
4
5
1
Columbia River Inter-Tribal Fish Commission and 2University of Idaho, Hagerman Fish Culture
Experiment Station, Hagerman, ID 83332, USA
6
7
Supplementary File S1
8
9
RAD library preparation and Sequencing
10
Restriction-site associated DNA (RAD) (Miller et al. 2007; Baird et al. 2008) libraries were prepared for
11
Illumina HiSeq 1500 sequencing using a protocol similar to those previously published (Baird et al. 2008;
12
Miller et al. 2012) but modified as described in Hecht et al. (Hecht et al. 2013) to allow for lower starting
13
DNA concentrations when tissue was limited for some individuals or populations. Here libraries were
14
prepared with a starting DNA concentration per sample of 150, 250, or 500ng depending on sample
15
quality and quantity. Samples were digested individually with the restriction enzyme SbfI-HF (NEB,
16
Ipswich, MA, USA) and individually barcoded using a 6nt barcode adapter sequence. Digested and
17
barcoded samples of the same starting concentration were pooled into libraries of between 36 and 96
18
samples, where no two samples within a library were assigned the same barcode sequence, and each
19
barcode sequence within a library differed by at least two bases from another barcode sequence.
20
Libraries were mechanically sheared to generate DNA fragment lengths between 200-700bp using a
21
Bioruptor 300 sonicator (Diagenode, Denville, NJ, USA) and fragments were size selected and isolated
22
using an Agencourt AMPure XP bead purification system (Beckman Coulter, Brea, CA, USA). The
23
remainder of the RAD library preparation follows the previously defined protocol of Miller et al. (2012).
24
Prior to sequencing, RAD libraries were quantified using real time PCR and a Kapa Illumina Library
25
Quantification Kit following recommended protocols (Kapa Biosystems Inc., Woburn, MA, USA) on an
26
ABI 7900HT Sequence Detection System (Life Technologies, Grand Island, NY, USA). Libraries were
27
sequenced on an Illumina HiSeq 1500 sequencer (Illumina Inc., San Diego, CA, USA) at a single read
28
length of 100bp. Depending on the quality and quantity of the sequence generated, some libraries were
29
sequenced in more than one lane to reach target read depths per individual of approximately 2 million
30
reads. In total 2,775 samples were sequenced in 44 libraries across 63 Illumina flow cell lanes.
31
32
de Novo SNP discovery and genotyping
33
While a RADtag based SNP catalog had previously been constructed for Chinook salmon (Brieuc et al.
34
2014), here we included samples from a broader range of populations from Alaska to California in order
35
to identify informative loci throughout the entire species range. We therefore identified and genotyped
36
SNP loci de novo, by constructing a SNP catalog from individuals spanning the geographic range of
37
populations in our collection. This was performed using the software pipeline Stacks v.1.03 (Catchen et
38
al. 2011, 2013). Raw Illumina reads were first scrutinized for quality using the software program FastQC
39
(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). It was determined that the 3’ 15-20
40
bases of Illumina sequence reads had in general reduced quality scores relative to the 5’ 80-85 base
41
positions across our sequence data. We therefore truncated our sequence reads to 80 bases by
42
removing the 3’ sequence most prone to error. In addition to truncating, our reads were quality filtered
43
and de-multiplexed using the ‘process_radtags’ program of the Stacks pipeline and included options for
44
cleaning the data by discarding any read with an uncalled base (-c), discarding reads with low quality
45
scores (-q), and rescuing barcodes and partial restriction enzyme recognition sites (-r). All other
46
parameters and options were executed with the default values as outlined in the manual for the
47
program (http://creskolab.uoregon.edu/stacks).
48
49
After individual sample reads were quality filtered, trimmed, and de-multiplexed, sequences for each
50
sample were submitted to the ‘ustacks’ module of Stacks to identify loci. In ‘ustacks’, the deleveraging (-
51
d) and removal (-r) algorithms were applied to filter out those sequences that were likely to be
52
paralogous and highly repetitive. We required the minimum depth of coverage at a stack (-m) to be 5
53
and allowed a maximum distance (-M) of 2 between stacks, and 4 between secondary reads and primary
54
stacks (-N). SNP discovery was carried out using the default SNP model with a chi-square significance
55
level of 0.05. We created a de novo catalog of RAD tag loci using the ‘cstacks’ module by selecting two
56
individuals from each of 51 populations (n=102) with at least 2.5 million reads (but no greater than 4
57
million reads) to represent genetic variation throughout the native range of Chinook salmon (note that
58
two populations had poor sequence quality and were excluded from further analysis). Individual
59
samples were then aligned to the catalog using the module ‘sstacks’ and genotypes were exported using
60
the ‘populations’ module. Genotypes were filtered to exclude 1) any RAD tag locus with more than 4
61
SNP sites to remove putative PSV, hyper-variable, or poorly sequenced tags, 2) any RAD tag locus where
62
one of the ten doubled haploid samples was observed to be heterozygous at any of the SNP positions in
63
order to remove putative PSVs, 3) any SNP marker with more than 2 alleles to remove SNPs with
64
sequencing errors, putative PSVs, or loci that do not fit a bi-allelic statistical model, 4) any SNP marker
65
missing more than 20% of the genotypes across all of the populations to limit the amount of missing
66
data, 5) any SNP marker failing tests of Hardy-Weinberg Equilibrium (HWE; Bonferroni corrected critical
67
value at α=0.05, 0.05/19703= 0.00000254) in more than 10% of the populations in order to exclude
68
technical artifacts such as null alleles (heterozygote deficit loci) and putative PSVs (heterozygote excess
69
loci), and 6) any SNP marker with an average minor allele frequency (MAF) across the populations falling
70
below 0.01 in order to exclude spurious rare SNPs or sequencing errors. Since linked SNPs would bias
71
population genetics statistics, we only retained one SNP marker per RAD tag, where we kept the SNP
72
with the greatest global MAF since this was likely to be the most informative SNP across all the
73
populations. Individual samples were also filtered from the dataset if they were missing more than 20%
74
of genotypes across all filtered loci and whole populations were removed if fewer than 10 individuals
75
remained to represent the population after applying filtering criteria. After applying the first four filters
76
outlined above, we identified 29,668 loci, though upon applying the remaining filters we retained 19,703
77
SNP loci as our final dataset. Allowing only a single SNP per RAD tag resulted in the most substantial
78
reduction in markers as 6,556 SNPs were removed from the data set to reduce SNPs that were physically
79
linked within 100 bp. The other filters primarily removed rare SNPs and polymorphisms with unresolved
80
technical difficulties such as described by Davey et al. (2013).
81
82
Alignment to Chinook Salmon Linkage Map and Rainbow Trout Genome Assembly
83
RADtag sequences from this study were aligned to those from a high density RADtag based linkage map
84
in Chinook salmon (Brieuc et al. 2014) in an effort to determine the relative genetic position and linkage
85
group assignment of loci. Alignment to the RADtag database of Brieuc et al. (2014) was conducted using
86
the short sequence alignment software program Bowtie v.1.0.1 (Langmead et al. 2009) with no more
87
than two mismatches between the query sequence and the database sequence. RADtag sequences
88
were also aligned to the rainbow trout genome (O. mykiss; Berthelot et al. 2014), which is the most
89
closely related species to Chinook salmon with a published genome assembly. Alignment to the rainbow
90
trout genome was carried out using Bowtie 2 v.2.2.3 (Langmead & Salzberg 2012) to assign sequences to
91
rainbow trout chromosomes and obtain genomic positions, where a RADtag alignment was accepted if
92
no more than four mismatches occurred between the RADtag site and the rainbow trout genome, and
93
no more than one alignment site was identified for the RADtag sequence within the genome. We then
94
queried the rainbow trout genome for coding sequence within a distance of 5kb of the RADtag locus
95
alignment sites in an effort to identify putatively linked genes. While linkage disequilibrium on average
96
was found to decay after approximately 2 cM in a domesticated strain of rainbow trout (Rexroad III &
97
Vallejo 2009), we conservatively opted to only identify coding regions within 5kb of the RADtag
98
alignment site, given our limited understanding of the intra-chromosomal micro-rearrangements
99
between rainbow trout and Chinook salmon genomes.
100
To identify gene functions and annotations, coding sequences were then queried against the
101
NCBI nucleotide sequence database using the software program Blast2GO (Conesa et al. 2005). Gene
102
functions and annotations to linked adaptive loci (outlier loci from RDA analysis) were compared to the
103
functions and annotations of all RADtag linked genes to determine if there was an enrichment of gene
104
ontologies in adaptive loci relative to all other genes using a Fisher’s exact test corrected for multiple
105
comparisons as implemented in the program Blast2GO (Conesa et al. 2005).
106
107
Detecting Neutral and Outlier Loci
108
For all FST outlier tests the “Wenatchee River” population was a pooled population of genetically similar
109
samples from Nason Creek, Chiwawa River, and White River (n=51), and samples from the John Day
110
River (JDR, n=12) were excluded from analyses because this population size was too low to be
111
representative relative to other populations in the study and within the lineage. In total 44 populations
112
including 1,945 individually genotyped samples were investigated at 19,703 markers to identify neutral
113
and outlier loci. Tests were conducted in four ways, and loci were only considered neutral if they met
114
the expectations of neutrality in all four tests. The first test included a range-wide global analysis, where
115
all 44 populations (note that the Nason/Chiwawa population was merged with the White River and the
116
John Day River population was removed from the analysis) were analyzed together, the second test was
117
conducted on a subset of populations previously identified as a putative “North Coastal Lineage”, the
118
third test on a subset of populations identified as a putative “South Coastal Lineage”, and the fourth test
119
on a subset of populations identified as an “Interior Columbia River Stream-Type Lineage”. Outlier loci
120
were identified in each test as those loci which did show excessively higher or lower FST than would be
121
expected under the assumptions of neutrality. In this case p-values were corrected for the four multiple
122
tests using a Benjamini-Yekutieli correction (Benjamini & Yekutieli 2001) as recommended by Narum
123
(2006) and thus included loci with a P-value greater than 0.988 and less than 0.012.
124
125
Bibliography
126
127
Baird NA, Etter PD, Atwood TS et al. (2008) Rapid SNP Discovery and Genetic Mapping Using Sequenced
RAD Markers. PLoS One, 3, 1–7.
128
129
Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under
dependency. Annals of Statistics, 29, 1165–1188.
130
131
Berthelot C, Brunet F, Chalopin D et al. (2014) The rainbow trout genome provides novel insights into
evolution after whole-genome duplication in vertebrates. Nature Communications, 5, 3657.
132
133
134
Brieuc MSO, Waters CD, Seeb JE, Naish KA (2014) A dense linkage map for Chinook salmon
(Oncorhynchus tshawytscha) reveals variable chromosomal divergence after an ancestral whole
genome duplication event. G3: Genes, Genomes, Genetics, 4, 447–460.
135
136
Catchen JM, Amores A, Hohenlohe P, Cresko W, Postlethwait JH (2011) Stacks: building and genotyping
Loci de novo from short-read sequences. G3: Genes, Genomes, Genetics, 1, 171–182.
137
138
Catchen J, Hohenlohe P, Bassham S, Amores A, Cresko W (2013) Stacks: an analysis tool set for
population genomics. Molecular Ecology, 22, 3124–3140.
139
140
Conesa A, Götz S, García-Gómez JM et al. (2005) Blast2GO: a universal tool for annotation, visualization
and analysis in functional genomics research. Bioinformatics, 21, 3674–3676.
141
142
143
Hecht BC, Campbell NR, Holecek DE, Narum SR (2013) Genome-wide association reveals genetic basis
for the propensity to migrate in wild populations of rainbow and steelhead trout. Molecular
Ecology, 22, 3061–3076.
144
145
Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nature Methods, 9, 357–
359.
146
147
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short
DNA sequences to the human genome. Genome Biology, 10, R25.1–R25.10.
148
149
Miller MR, Brunelli JP, Wheeler P a. et al. (2012) A conserved haplotype controls parallel adaptation in
geographically distant salmonid populations. Molecular ecology, 21, 237–249.
150
151
152
Miller MR, Dunham JP, Amores A, Cresko WA, Johnson EA (2007) Rapid and cost-effective polymorphism
identification and genotyping using restriction site associated DNA ( RAD ) markers. Genome
Research, 17, 240–248.
153
154
Narum SR (2006) Beyond Bonferroni: Less conservative analyses for conservation genetics. Conservation
Genetics, 7, 783–787.
155
156
Narum SR, Hess JE, Matala AP (2010) Examining Genetic Lineages of Chinook Salmon in the Columbia
River Basin. Transactions of the American Fisheries Society, 139, 1465–1477.
157
158
Rexroad III CE, Vallejo RL (2009) Estimates of linkage disequilibrium and effective population size in
rainbow trout. BMC genetics, 10, 83.
159
Download