mec12083-sup-0001-FigS1-S4-TableS1-S3

advertisement
1
Population genomic signatures of divergent adaptation, gene flow, and
2
hybrid speciation in the rapid radiation of Lake Victoria cichlid fishes
3
4
I. Keller, C. E. Wagner, L. Greuter, S. Mwaiko, O. Selz, A. Sivasundar, S. Wittwer & O.
5
Seehausen
6
7
8
Supplementary material
9
10
Comparison of different assembly criteria
11
The results reported in the main document are based on a de novo assembly allowing a
12
maximum of two mismatches between reads within a locus (ustacks parameter M=2), and at
13
most two mismatches when mapping the reads back to the “reference”, i.e. the consensus
14
sequences from the assembly (Table S1, column M2). We assessed the effect of different
15
assembly and mapping parameters by rerunning our analysis pipeline twice with different
16
parameter settings:
17
M1 (more stringent): A maximum of one mismatch was tolerated within loci in the de novo
18
assembly and no more than two mismatches between read and “reference” in the mapping.
19
M4 (less stringent): A maximum of four mismatches was tolerated within loci in the de novo
20
assembly. In the mapping to the “reference”, no more than two mismatches were allowed
21
within the first 20bp. Additional mismatches were possible in the remainder of the read as
22
long as the sum of phred scores of all mismatching positions was ≤164 (Table S1). This last
23
criterion allows a maximum of four mismatches for bases with the highest quality score
24
possible in the Illumina 1.8 format (i.e. 41).
25
The total number of reads utilized in the assembly increased by ca. 1 mio from the most (M1)
26
to the least stringent (M4) condition, ranging from 92 to 93% of the total number of reads. As
27
more mismatches were tolerated within a RAD locus, more reads were used in the assembly
28
and they were merged into fewer loci (Table S1). This increased the number of polymorphic
29
sites overall, as well as the proportion of RAD loci containing more than one SNP. The total
30
number of loci was substantially higher in all cases than the ca. 60K RAD tags expected in the
31
cichlid genome, but many loci were observed only in a few individuals. When considering
32
only RAD tags recovered in more than half of the individuals (i.e. ≥5 individuals/species), the
33
observed number of loci was reduced to ca. 40K, more consistent with expectations.
34
The choice of the optimal assembly parameters is not trivial. If very few pairwise differences
35
are tolerated between the haplotypes within a locus, true variation will be missed as reads
36
from more highly polymorphic loci will not be considered orthologous. At the other extreme,
37
erroneous merging of non-orthologous haplotypes will produce RAD loci full of artifactual
38
SNPs, so-called paralogous sequence variants (Renaut et al. 2010).The optimal parameter
39
values will need to be assessed on a case-to-case basis and will depend on the level of
40
divergence between the study taxa and their genomic organisation. A recently duplicated
41
genome (e.g. in salmonid fishes; Sanchez et al. 2009; Seeb et al. 2011), for example, is
42
expected to contain a very large number of paralogs and may require more stringent assembly
43
and quality filtering criteria.
44
The five cichlid species investigated here, like most of the endemic species of Lake Victoria,
45
have most likely diverged from common ancestors within less than 15’000 years (Seehausen
46
2006; Stager & Johnson 2008). Although recent work at phylogenomic scales has shown
47
sympatric species in this radiation to be reciprocally monophyletic (Wagner et al. in press),
48
previous work revealed only very limited genetic differentiation between species (Seehausen
49
et al. 2008; Mzighani et al. 2010; Bezault et al. 2011). In contrast, the long-wavelength
50
sensitive opsin gene is known for exceptionally high levels of sequence divergence between
51
sister species, and the split between two alleles, H (nearly fixed in P. nyererei) and P (nearly
52
fixed in P. pundamilia) is actually thought to predate the split between the two species
53
(Seehausen et al. 2008). Still, the two alleles differ at only five out of 872 bases,
54
corresponding to a sequence divergence of ca. 0.6%. Here, we observe at most three
55
polymorphic sites within any given stretch of 84 bp (= length of our RAD loci). This suggests
56
that our assembly criteria, which allowed a maximum of one, two or four pairwise differences
57
between haplotypes within a locus, are well within the range of divergence expected between
58
true alleles in these species, unless a locus is exceptionally polymorphic (such as the MHC
59
gene with ca. 10% pairwise divergence among seven P. nyereri sequences; Figueroa et al.
60
2000).
61
It is encouraging to find that analyses based on the three different assemblies produce highly
62
consistent results. For example, we detected very similar outlier proportions (Fig. S1) and the
63
identity of the outlier RAD loci was also very consistent between assemblies: of the RAD loci
64
found to contain SNPs with unusually high FST between M. mbipi and P. sp. “pink anal fin”
65
based on the M1 assembly, for example, 80% were confirmed as outliers in the M4 analysis.
66
FST estimates between all species pairs were also highly correlated across all three assemblies
67
(correlation coefficients ≥ 0.98 in all cases). A reference genome of P. nyererei will become
68
available in the near future (Cichlid Genome Consortium at
69
http://cichlid.umd.edu/CGCindex.html) which will allow a more thorough validation and
70
comparison of the different assemblies produced here.
71
72
References
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
Bezault E, Mwaiko S, Seehausen O (2011) Population genomic tests of models of adaptive radiation in Lake
Victoria region cichlid fish. Evolution 65, 3318-3397.
Figueroa F, Mayer WE, Sültmann H, et al. (2000) MHC class II B gene evolution in East African cichlid fishes.
Immunogenetics 51, 556-575.
Mzighani SI, Nikaido M, Takeda M, et al. (2010) Genetic variation and demographic history of the
Haplochromis laparogramma group of Lake Victoria—An analysis based on SINEs and mitochondrial
DNA. Gene 450, 39-47.
Renaut S, Nolte A, Bernatchez L (2010) Mining transcriptome sequences towards identifying adaptive single
nucleotide polymorphisms in lake whitefish species pairs (Coregonus spp. Salmonidae). Molecular
Ecology 19, 115-131.
Sanchez C, Smith T, Wiedmann R, et al. (2009) Single nucleotide polymorphism discovery in rainbow trout by
deep sequencing of a reduced representation library. BMC Genomics 10, 559.
Seeb JE, Pascal CE, Grau ED, et al. (2011) Transcriptome sequencing and high-resolution melt analysis advance
single nucleotide polymorphism discovery in duplicated salmonids. Molecular Ecology Resources 11,
335-348.
Seehausen O (2006) African cichlid fish: a model system in adaptive radiation research. Proceedings of the
Royal Society of London. Series B: Biological Sciences 273, 1987-1998.
Seehausen O, Terai Y, Magalhaes IS, et al. (2008) Speciation through sensory drive in cichlid fish. Nature 455,
620-626.
Stager J, Johnson T (2008) The late Pleistocene desiccation of Lake Victoria and the origin of its endemic biota.
Hydrobiologia 596, 5-16.
Wagner CE, Keller I, Wittwer S, et al. (in press) Genome-wide RAD sequence data provides unprecedented
resolution of species boundaries and relationships in the Lake Victoria cichlid adaptive radiation.
Molecular Ecology.
97
98
99
100
101
Table S1: Number of RAD loci and polymorphic sites obtained with different assembly and
mapping criteria in 50 individuals from five cichlid species
De novo assembly
Max. # mismatches
between reads within
locus (M of ustacks)
# reads assembled
# of putative loci1)
M1
M2
M4
1
2
4
107.0 mio
144’310
107.5 mio
136’386
108.0 mio
126’613
Mapping
Max. mismatches
between read and
„reference“
2
2
2 within first 20bp,
sum of phred scores
of all mismatching
bases ≤164
# of loci covered in
≥5 ind/species2)
40’741
43’566
41’410
122’ 137
126’ 238
249’183
8’194
10’663
14’881
6’080
7’111
8’145
76.5%
18.2%
5.3%
70.1%
21.5%
8.4%
61.5%
21.3%
17.2%
SNPs
Total # of SNPs3)
Total # of SNPs retained
after quality filtering4)
Total # of polymorphic
RAD loci after quality
filtering5)
% loci with exactly
...1 SNP
...2 SNPs
...3 or more SNPs
102
103
104
1)
105
106
2)
107
3)
108
109
110
4)
111
112
5)
113
Total number of RAD loci produced by de novo assembly before subsequent filtering steps.
This number includes, for example, monomorphic loci or loci present in a single individual.
Number of RAD loci recovered in at least 5 individuals per species at a read depth sufficient
for genotype calling.
All SNP sites with quality score of at least 10.
Based on the full dataset of 50 individuals. A SNP site is retained if at least 5
individuals/species have a genotype assigned, the minor allele is observed at least 3 times, and
the observed heterozygosity is ≤0.5 in all five species.
Total number of loci containing one or more high-quality SNP. The following three rows
indicate the percentage of loci containing exactly 1 SNP, 2 SNPs or ≥3 SNPs.
114
115
116
117
118
Figure S1: Proportion of outliers (FDR=20%; prior odds 10) among all polymorphic SNPs
between all species pairs for each of the three assemblies (see Table S1 for details on
assemblies). The outlier proportions are highly correlated among the three assemblies (R2
≥0.78 or higher).
119
1.0%
0.9%
Proportion of outliers
0.8%
0.7%
0.6%
0.5%
0.4%
M1
0.3%
M2
0.2%
M4
0.1%
0.0%
species pair
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
Figure S2: Results of outlier scans assuming even prior odds for all pairwise comparisons
among the five cichlid species. The barplot indicates the proportion of SNPs detected as
significant outliers in each comparision. In the bottom panel, each column represents a
pairwise comparison and each row a SNP site showing outlier behaviour in ≥2 comparisons.
We were specifically interested in identifying SNPs detected as outliers in ≥2 independent
comparisons between the two genera and/or the two colour types. If this criterion was
satisfied for genus and/or colour, we coloured all significant comparisons at that locus. Green
= between-genus outlier in ≥2 independent comparisons; blue = between-colour type outlier
in ≥2 independent comparisons; turquoise = outlier in ≥2 of both between-genus and betweencolour comparisons. All other significant comparisons are indicated in grey. SNPs are ordered
from top to bottom by the number of comparisons in which they were detected as outliers
among the nine pairwise comparisons. lut=Mbipia lutea; mbi=M. mbipi; nyer=Pundamilia
nyererei; pink=P. sp. “pink anal fin”; pund=P. pundamilia.
135
136
137
138
139
140
141
142
143
144
145
146
Figure S3: a) Results of a Structure analysis of the full dataset of 10’663 SNPs. The
leftmost column shows the probability of the data (ln P(D)) for different numbers of genetic
groups (K). The middle column shows the Structure barplots for K=2 where several
alternative solutions were observed. Here, we present only the two dominant solutions. Under
grouping 1, the species are grouped according to male nuptial colouration as indicated by the
letters above the structure barplots (Y=yellow; B=blue). Under grouping 2, the species group
by genera with the exception of P. sp. “pink anal fin” which clusters with the two Mbipia
species. To the right of the barplots, we indicate the number of times a given solution was
observed among a total of ten runs and provide the average ln P(D) across these runs. All
plots are averaged across all runs supporting a given grouping. The rightmost column shows
the Structure barplot for K=5 averaged across 10 replicate runs.
147
148
149
150
b) Maximum likelihood tree based on the full dataset of 10’663 SNPs. Tip colours represent
the species. The colours are consistent with male nuptial coloration. Triangles indicate Mbipia
spp., circles Pundamilia spp. Values on branches are bootstrap support from 100 rounds of
bootstrapping using RAxML’s rapid bootstrapping algorithm, and are shown only if ≥50.
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
Figure S4: Results of Structure analyses of three data subsets. The leftmost column shows the
probability of the data (ln P(D)) for different numbers of genetic groups (K). The right
column shows the Structure barplots for K=2 where multiple solutions were observed. Here,
we present only the two dominant solutions. In the first, the species are grouped according to
male nuptial colouration as indicated by the letters above the structure barplots (grouping 1 –
by colour). In the second, the grouping is based on genera with the exception of P. sp. “pink
anal fin” which clusters with the two Mbipia species (grouping 2). To the right of the
barplots, we indicate the number of times a given solution was observed among a total of ten
runs and provide the average ln P(D) across these runs. All plots are averaged across all runs
supporting a given grouping. Intermediate=5’331 SNPs with FST values between the 25th and
75th percentiles of all locus-specific FST values ordered from lowest to highest; high=75th-99th
percentile, 2’559 SNPs; top=above 99th percentile, 107 SNPs. Y=yellow; B=blue.
171
172
173
174
Table S2: Genetic diversity of five haplochromine species (Mbipia lutea, Mbipia mbipia,
Pundamilia nyererei, Pundamilia sp. "pink anal fin" and Pundamilia pundamilia) at Makobe
Island, Southern Lake Victoria based on the M2 assembly. The given average gene diversity
(He) within a species is for polymorphic loci (sites) only.
175
Species name
Number of polymorphic loci
(sites)
Average gene diversity (=He)
Standard deviation for He
M. lutea
M. mbipi
P. nyererei
P. sp.
"pink anal
fin“
P. pundamilia
6803
7442
6633
6867
6747
0.105
0.052
0.118
0.058
0.111
0.055
0.108
0.053
0.111
0.055
176
177
178
179
180
181
182
Table S3: Neutral FST between all species pairs estimated based on 5’331 intermediate SNPs
(i.e. between lower and upper quartiles of a list of all SNPs arranged in order of increasing
global FST). lut=Mbipia lutea; mbi=M. mbipi; nyer=Pundamilia nyererei; pink=P. sp. “pink
anal fin”; pund=P. pundamilia.
183
mbi
nyer
pink
pund
184
185
lut
mbi
nyer
pink
0.038
0.054
0.052
0.061
0.046
0.025
0.043
0.051
0.055
0.052
Download