1 Population genomic signatures of divergent adaptation, gene flow, and 2 hybrid speciation in the rapid radiation of Lake Victoria cichlid fishes 3 4 I. Keller, C. E. Wagner, L. Greuter, S. Mwaiko, O. Selz, A. Sivasundar, S. Wittwer & O. 5 Seehausen 6 7 8 Supplementary material 9 10 Comparison of different assembly criteria 11 The results reported in the main document are based on a de novo assembly allowing a 12 maximum of two mismatches between reads within a locus (ustacks parameter M=2), and at 13 most two mismatches when mapping the reads back to the “reference”, i.e. the consensus 14 sequences from the assembly (Table S1, column M2). We assessed the effect of different 15 assembly and mapping parameters by rerunning our analysis pipeline twice with different 16 parameter settings: 17 M1 (more stringent): A maximum of one mismatch was tolerated within loci in the de novo 18 assembly and no more than two mismatches between read and “reference” in the mapping. 19 M4 (less stringent): A maximum of four mismatches was tolerated within loci in the de novo 20 assembly. In the mapping to the “reference”, no more than two mismatches were allowed 21 within the first 20bp. Additional mismatches were possible in the remainder of the read as 22 long as the sum of phred scores of all mismatching positions was ≤164 (Table S1). This last 23 criterion allows a maximum of four mismatches for bases with the highest quality score 24 possible in the Illumina 1.8 format (i.e. 41). 25 The total number of reads utilized in the assembly increased by ca. 1 mio from the most (M1) 26 to the least stringent (M4) condition, ranging from 92 to 93% of the total number of reads. As 27 more mismatches were tolerated within a RAD locus, more reads were used in the assembly 28 and they were merged into fewer loci (Table S1). This increased the number of polymorphic 29 sites overall, as well as the proportion of RAD loci containing more than one SNP. The total 30 number of loci was substantially higher in all cases than the ca. 60K RAD tags expected in the 31 cichlid genome, but many loci were observed only in a few individuals. When considering 32 only RAD tags recovered in more than half of the individuals (i.e. ≥5 individuals/species), the 33 observed number of loci was reduced to ca. 40K, more consistent with expectations. 34 The choice of the optimal assembly parameters is not trivial. If very few pairwise differences 35 are tolerated between the haplotypes within a locus, true variation will be missed as reads 36 from more highly polymorphic loci will not be considered orthologous. At the other extreme, 37 erroneous merging of non-orthologous haplotypes will produce RAD loci full of artifactual 38 SNPs, so-called paralogous sequence variants (Renaut et al. 2010).The optimal parameter 39 values will need to be assessed on a case-to-case basis and will depend on the level of 40 divergence between the study taxa and their genomic organisation. A recently duplicated 41 genome (e.g. in salmonid fishes; Sanchez et al. 2009; Seeb et al. 2011), for example, is 42 expected to contain a very large number of paralogs and may require more stringent assembly 43 and quality filtering criteria. 44 The five cichlid species investigated here, like most of the endemic species of Lake Victoria, 45 have most likely diverged from common ancestors within less than 15’000 years (Seehausen 46 2006; Stager & Johnson 2008). Although recent work at phylogenomic scales has shown 47 sympatric species in this radiation to be reciprocally monophyletic (Wagner et al. in press), 48 previous work revealed only very limited genetic differentiation between species (Seehausen 49 et al. 2008; Mzighani et al. 2010; Bezault et al. 2011). In contrast, the long-wavelength 50 sensitive opsin gene is known for exceptionally high levels of sequence divergence between 51 sister species, and the split between two alleles, H (nearly fixed in P. nyererei) and P (nearly 52 fixed in P. pundamilia) is actually thought to predate the split between the two species 53 (Seehausen et al. 2008). Still, the two alleles differ at only five out of 872 bases, 54 corresponding to a sequence divergence of ca. 0.6%. Here, we observe at most three 55 polymorphic sites within any given stretch of 84 bp (= length of our RAD loci). This suggests 56 that our assembly criteria, which allowed a maximum of one, two or four pairwise differences 57 between haplotypes within a locus, are well within the range of divergence expected between 58 true alleles in these species, unless a locus is exceptionally polymorphic (such as the MHC 59 gene with ca. 10% pairwise divergence among seven P. nyereri sequences; Figueroa et al. 60 2000). 61 It is encouraging to find that analyses based on the three different assemblies produce highly 62 consistent results. For example, we detected very similar outlier proportions (Fig. S1) and the 63 identity of the outlier RAD loci was also very consistent between assemblies: of the RAD loci 64 found to contain SNPs with unusually high FST between M. mbipi and P. sp. “pink anal fin” 65 based on the M1 assembly, for example, 80% were confirmed as outliers in the M4 analysis. 66 FST estimates between all species pairs were also highly correlated across all three assemblies 67 (correlation coefficients ≥ 0.98 in all cases). A reference genome of P. nyererei will become 68 available in the near future (Cichlid Genome Consortium at 69 http://cichlid.umd.edu/CGCindex.html) which will allow a more thorough validation and 70 comparison of the different assemblies produced here. 71 72 References 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 Bezault E, Mwaiko S, Seehausen O (2011) Population genomic tests of models of adaptive radiation in Lake Victoria region cichlid fish. Evolution 65, 3318-3397. Figueroa F, Mayer WE, Sültmann H, et al. (2000) MHC class II B gene evolution in East African cichlid fishes. Immunogenetics 51, 556-575. Mzighani SI, Nikaido M, Takeda M, et al. (2010) Genetic variation and demographic history of the Haplochromis laparogramma group of Lake Victoria—An analysis based on SINEs and mitochondrial DNA. Gene 450, 39-47. Renaut S, Nolte A, Bernatchez L (2010) Mining transcriptome sequences towards identifying adaptive single nucleotide polymorphisms in lake whitefish species pairs (Coregonus spp. Salmonidae). Molecular Ecology 19, 115-131. Sanchez C, Smith T, Wiedmann R, et al. (2009) Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library. BMC Genomics 10, 559. Seeb JE, Pascal CE, Grau ED, et al. (2011) Transcriptome sequencing and high-resolution melt analysis advance single nucleotide polymorphism discovery in duplicated salmonids. Molecular Ecology Resources 11, 335-348. Seehausen O (2006) African cichlid fish: a model system in adaptive radiation research. Proceedings of the Royal Society of London. Series B: Biological Sciences 273, 1987-1998. Seehausen O, Terai Y, Magalhaes IS, et al. (2008) Speciation through sensory drive in cichlid fish. Nature 455, 620-626. Stager J, Johnson T (2008) The late Pleistocene desiccation of Lake Victoria and the origin of its endemic biota. Hydrobiologia 596, 5-16. Wagner CE, Keller I, Wittwer S, et al. (in press) Genome-wide RAD sequence data provides unprecedented resolution of species boundaries and relationships in the Lake Victoria cichlid adaptive radiation. Molecular Ecology. 97 98 99 100 101 Table S1: Number of RAD loci and polymorphic sites obtained with different assembly and mapping criteria in 50 individuals from five cichlid species De novo assembly Max. # mismatches between reads within locus (M of ustacks) # reads assembled # of putative loci1) M1 M2 M4 1 2 4 107.0 mio 144’310 107.5 mio 136’386 108.0 mio 126’613 Mapping Max. mismatches between read and „reference“ 2 2 2 within first 20bp, sum of phred scores of all mismatching bases ≤164 # of loci covered in ≥5 ind/species2) 40’741 43’566 41’410 122’ 137 126’ 238 249’183 8’194 10’663 14’881 6’080 7’111 8’145 76.5% 18.2% 5.3% 70.1% 21.5% 8.4% 61.5% 21.3% 17.2% SNPs Total # of SNPs3) Total # of SNPs retained after quality filtering4) Total # of polymorphic RAD loci after quality filtering5) % loci with exactly ...1 SNP ...2 SNPs ...3 or more SNPs 102 103 104 1) 105 106 2) 107 3) 108 109 110 4) 111 112 5) 113 Total number of RAD loci produced by de novo assembly before subsequent filtering steps. This number includes, for example, monomorphic loci or loci present in a single individual. Number of RAD loci recovered in at least 5 individuals per species at a read depth sufficient for genotype calling. All SNP sites with quality score of at least 10. Based on the full dataset of 50 individuals. A SNP site is retained if at least 5 individuals/species have a genotype assigned, the minor allele is observed at least 3 times, and the observed heterozygosity is ≤0.5 in all five species. Total number of loci containing one or more high-quality SNP. The following three rows indicate the percentage of loci containing exactly 1 SNP, 2 SNPs or ≥3 SNPs. 114 115 116 117 118 Figure S1: Proportion of outliers (FDR=20%; prior odds 10) among all polymorphic SNPs between all species pairs for each of the three assemblies (see Table S1 for details on assemblies). The outlier proportions are highly correlated among the three assemblies (R2 ≥0.78 or higher). 119 1.0% 0.9% Proportion of outliers 0.8% 0.7% 0.6% 0.5% 0.4% M1 0.3% M2 0.2% M4 0.1% 0.0% species pair 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 Figure S2: Results of outlier scans assuming even prior odds for all pairwise comparisons among the five cichlid species. The barplot indicates the proportion of SNPs detected as significant outliers in each comparision. In the bottom panel, each column represents a pairwise comparison and each row a SNP site showing outlier behaviour in ≥2 comparisons. We were specifically interested in identifying SNPs detected as outliers in ≥2 independent comparisons between the two genera and/or the two colour types. If this criterion was satisfied for genus and/or colour, we coloured all significant comparisons at that locus. Green = between-genus outlier in ≥2 independent comparisons; blue = between-colour type outlier in ≥2 independent comparisons; turquoise = outlier in ≥2 of both between-genus and betweencolour comparisons. All other significant comparisons are indicated in grey. SNPs are ordered from top to bottom by the number of comparisons in which they were detected as outliers among the nine pairwise comparisons. lut=Mbipia lutea; mbi=M. mbipi; nyer=Pundamilia nyererei; pink=P. sp. “pink anal fin”; pund=P. pundamilia. 135 136 137 138 139 140 141 142 143 144 145 146 Figure S3: a) Results of a Structure analysis of the full dataset of 10’663 SNPs. The leftmost column shows the probability of the data (ln P(D)) for different numbers of genetic groups (K). The middle column shows the Structure barplots for K=2 where several alternative solutions were observed. Here, we present only the two dominant solutions. Under grouping 1, the species are grouped according to male nuptial colouration as indicated by the letters above the structure barplots (Y=yellow; B=blue). Under grouping 2, the species group by genera with the exception of P. sp. “pink anal fin” which clusters with the two Mbipia species. To the right of the barplots, we indicate the number of times a given solution was observed among a total of ten runs and provide the average ln P(D) across these runs. All plots are averaged across all runs supporting a given grouping. The rightmost column shows the Structure barplot for K=5 averaged across 10 replicate runs. 147 148 149 150 b) Maximum likelihood tree based on the full dataset of 10’663 SNPs. Tip colours represent the species. The colours are consistent with male nuptial coloration. Triangles indicate Mbipia spp., circles Pundamilia spp. Values on branches are bootstrap support from 100 rounds of bootstrapping using RAxML’s rapid bootstrapping algorithm, and are shown only if ≥50. 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 Figure S4: Results of Structure analyses of three data subsets. The leftmost column shows the probability of the data (ln P(D)) for different numbers of genetic groups (K). The right column shows the Structure barplots for K=2 where multiple solutions were observed. Here, we present only the two dominant solutions. In the first, the species are grouped according to male nuptial colouration as indicated by the letters above the structure barplots (grouping 1 – by colour). In the second, the grouping is based on genera with the exception of P. sp. “pink anal fin” which clusters with the two Mbipia species (grouping 2). To the right of the barplots, we indicate the number of times a given solution was observed among a total of ten runs and provide the average ln P(D) across these runs. All plots are averaged across all runs supporting a given grouping. Intermediate=5’331 SNPs with FST values between the 25th and 75th percentiles of all locus-specific FST values ordered from lowest to highest; high=75th-99th percentile, 2’559 SNPs; top=above 99th percentile, 107 SNPs. Y=yellow; B=blue. 171 172 173 174 Table S2: Genetic diversity of five haplochromine species (Mbipia lutea, Mbipia mbipia, Pundamilia nyererei, Pundamilia sp. "pink anal fin" and Pundamilia pundamilia) at Makobe Island, Southern Lake Victoria based on the M2 assembly. The given average gene diversity (He) within a species is for polymorphic loci (sites) only. 175 Species name Number of polymorphic loci (sites) Average gene diversity (=He) Standard deviation for He M. lutea M. mbipi P. nyererei P. sp. "pink anal fin“ P. pundamilia 6803 7442 6633 6867 6747 0.105 0.052 0.118 0.058 0.111 0.055 0.108 0.053 0.111 0.055 176 177 178 179 180 181 182 Table S3: Neutral FST between all species pairs estimated based on 5’331 intermediate SNPs (i.e. between lower and upper quartiles of a list of all SNPs arranged in order of increasing global FST). lut=Mbipia lutea; mbi=M. mbipi; nyer=Pundamilia nyererei; pink=P. sp. “pink anal fin”; pund=P. pundamilia. 183 mbi nyer pink pund 184 185 lut mbi nyer pink 0.038 0.054 0.052 0.061 0.046 0.025 0.043 0.051 0.055 0.052