Effective population size shapes the patterns of molecular evolution in Drosophila Background: A corollary of the nearly neutral theory of molecular evolution is that effective population size determines the efficiency of natural selection. Population size also seems to play a key role in the evolution of genome complexity [7]. Multispecies data on nucleotide diversity can be used to test the expected relationship between population size and selection efficiency at the DNA level. Using nucleotide data from Drosophila species, here we study the levels of synonymous polymorphism among species and the possible causes of their differences. Results: Differences in the levels of synonymous polymorphism among Drosophila species were accurately assessed showing that around 22% of the variation in synonymous polymorphism is explained by true differences among species. The differences in synonymous polymorphism are not greatly affected by differences in gene functions, phylogenetic relationships among species or demography. Selection efficiency seems be positively correlated with the levels of synonymous polymorphism: 1) codon bias is positively correlated with the levels of synonymous polymorphism; 2) synonymous polymorphism was found to be negatively correlated with the proportions of repetitive sequences in the genomes; and 3) there are significant differences in the proportion of adaptive substitutions in ortologue genes of pairs of species with significant differences in the levels of synonymous polymorphism. Conclusions: Differences in the levels of synonymous polymorphism are likely due to differences in long-term effective population size. These differences seem to have implications for the molecular evolution of the Drosophila species, as is suggested by the positive relationships found among the levels of synonymous polymorphism and (1) evidence of positive selection on codifying DNA and (2) the amount of bias for preferred codons. Moreover, species with low synonymous polymorphism levels have a higher proportion of repetitive sequences in their genomes, suggesting a lower efficiency of natural selection in species with smaller population size. Effective population size seems to affect the patterns of molecular evolution even within a group of closely related species such as those belonging to the Drosophila genus. 1 Background The neutral theory of molecular evolution claims that most evolutionary changes at the molecular level are not caused by Darwinian natural selection acting on advantageous mutants, but by random fixations of selectively neutral variants through the cumulative effect of sampling drift on the continued input of new mutations. According to the neutral theory, the rate of mutant substitutions in evolution is equal to the neutral mutation rate, and this rate is independent of the population size and environmental conditions [8]. The neutral theory assumes that mutations can be classified as neutral or deleterious. However, Ohta [9] proposed that most “neutral mutations” are indeed slightly deleterious, rather than strictly neutral. In this nearly neutral theory, the fate of a mutation depends on the relative forces of selection and drift [10]. This observation means that the evolutionary rate is higher in smaller populations than in larger populations, because a slightly deleterious mutation behaves as selectively neutral when the product of the effective population size by the selection coefficient (Nes) is smaller than unity, while it may be efficiently selected if Nes is larger than unity [11]. The effective population size (Ne) determines the degree to which gene frequencies are faithfully transmitted across generations [7]. Estimates of the effective population size can be obtained from the neutral nucleotide variation, since from neutral theory θ ≈ 4Neµ (where θ is the average heterozygosity per site for neutral mutations; Ne is the effective population size and µ is the neutral mutation rate per nucleotide site per generation [11]). In recent years, slightly advantageous as well as slightly deleterious amino acid substitutions have been shown to occur in protein evolution [10, 12, 13]. Bustamante et al. [12] showed that the mean selection coefficients are very weak for Arabidopsis (Nes around -2 to 0.5) and Drosophila (Nes around 0.5 to 3), giving support to the nearly neutral theory [10]. Moreover, Fay et al. [14] and Eyre-Walker et al. [15] showed that a high fraction of amino acid mutations are slightly deleterious in mammals and Drosophila. However, the proportion of substitutions that have been driven by positive selection seems be high in humans and Drosophila (around 30–50 %), showing that the fraction of positively selected mutations is not as small as would be expected by neutral theory [13, 16-19]. Overall, both genetic drift and positive selection are the driving 2 forces of molecular evolution, but the general relative impact of both forces in shaping the observed patterns of nucleotide variability is still under discussion. Effective population is a key factor in the nearly neutral theory of molecular evolution, because the fate of a mutation is determined by the product Nes. Akashi [3] showed that for weakly selected silent mutations, the differences in codon bias between D. melanogaster and D. simulans could be explained by differences in effective population sizes. He estimated a coefficient of selection acting on preferred codons in D. simulans of nearly Nes ≈ 2. Based on the observation that D. simulans seems to have a larger population size than D. melanogaster [20], Akashi [3] concluded that selection for preferred codons is inefficient in D. melanogaster. Furthermore, Vicario et al [21] found differences in the levels of genomic codon bias in Drosophila species, suggesting differences in effective population size as a possible explanation. A recent study to detect positively selected genes in the genomes of humans and chimpanzees showed that although the rate of nonsynonymous substitution in humans is higher than in chimpanzees, the number of genes with evidence of positive selection is substantially smaller in humans [22]. These results are explained by the reduced efficacy of natural selection in humans because of their smaller long- term effective population size. Recently, a more far-reaching and fascinating new role has been given to effective population size in relation with the evolution of genome complexity. Lynch and Conery [7] have proposed that the changes in genome complexity from prokaryotes to multicellular eukaryotes, including gene number, intron abundance and mobile genetic elements, emerged passively in response to long-term population-size reductions that accompanied increases in organism size. According to this hypothesis, much of the restructuring of eukaryotic genomes was initiated by non-adaptive processes, which in turn provided novel substrates for the secondary evolution of phenotypic complexity by natural selection [7]. Multispecies data on nucleotide diversity is a very useful source of information that could be used to test the expected relationship between population size and selection at the DNA level. Drosophila species has been the focus of genetic and evolutionary studies for decades, and now the availability of the sequences of 12 genomes of these species makes them the focus of evolutionary genomics [21]. The availability of nucleotide polymorphism data in these species (DPDB “Drosophila Polymorphism Data Base”[23, 24] allow us to test the hypothesis that population 3 effective size and selection efficiency are positively correlated. Since synonymous polymorphism is thought to be mainly neutral (πs ≈ θ), the differences detected in the levels of synonymous polymorphism among species could be due to differences in effective population size. Therefore, in this paper we first have assessed the differences in the levels of synonymous polymorphism among Drosophila species. Then, we evaluate different possible causes that could explain the detected differences. The results indicate that effective population size seems to be a major factor in explaining the differences detected in the levels of synonymous polymorphism among species. Furthermore, we tested the hypothesis that the selection efficiency is positively correlated with effective population size by relating the levels of synonymous polymorphism with i) the bias for preferred codons, ii) genome content, and iii) amount of adaptive substitutions. The results show that both variables, the levels of codon bias and the proportion of adaptive substitutions, seem to take higher values in species with higher levels of synonymous polymorphism. Moreover, species with low synonymous polymorphism have a higher proportion of repetitive sequences in their genomes, suggesting a diminished efficiency of selection in species with smaller population size. Our results provide strong evidence in favour of the importance of effective population size as a factor determining the patterns of molecular evolution even within a group of closely related species such as those belonging to the Drosophila genus. Results Differences in the levels of synonymous polymorphism among species DPDB (Drosophila Polymorphism DataBase [23, 24]) provides polymorphism data by gene and species in the Drosophila genus, giving several evaluations of the quality of the estimates and additional information on the genes, such as the coordinates to the D. melanogaster genome and Gene Ontology classification of the genes. A link to the source information of the sequences in GenBank and EMBL, and the possibility of reanalyzing any gene-specie dataset is also provided in the web page. We performed a data filtering from DPDB (see methods) obtaining a dataset with 751 estimates of synonymous polymorphism (πs) [25], belonging to 482 different genes of 15 Drosophila species. The accession number of DPDB and the data of πs and Tajima’s D (Tajima 1989) by species and genes are listed in Supplementary Table 1. Figure 1 graphs the different mean πs values per species. 4 To assess the differences in synonymous polymorphism levels among Drosophila species, we performed an unbalanced variance analysis. Since different types of genes are included in the 15 analyzed species, the datasets by species could contain a biased sample of genes with high codon bias or with high rate of selective sweeps which would decrease the level of synonymous polymorphism in a particular species. Therefore, a two-factor analysis (gene and species) was performed. The result shows that significant differences were found for the species factor, but neither the gene factor nor the interaction between species and genes were significant (Table 1). Since the data of D. melanogaster alone account for 40% of the dataset (Figure 1), the same analysis was performed excluding the data of this species, showing a similar result (Table 1). To evaluate which species differed in their πs values, we performed a post hoc pairwise Tukey test estimating the differences among pairs of the species. Because the effect of a single gene among the 482 different genes analyzed can be undetected by the variance analysis, we also performed a paired T test among ortologue genes of pairs of species (Table 2). The results of the Tukey test showed that the pairs of species dmeldsim, dsim-dmir, dpse-dmir, dyak-dsim, dmir-dmel, dsim-dsan, dmir-dari, dmir-dame, dmir-dkik and dkik-dsan are significantly different in their levels of πs. The results of the paired T test were coincident, showing that the pairs of species dmel-dsim (N=114), dsim-dyak (N=10) and dsim-dmir (N=10) are significantly different in theirs πs values (Table 2). Significant differences detected by Tukey test were borderline significant for the pair of species dpse-dmir (N=9) and dmir-dmel (N=10) and not significant for dsimdsan (N=5) by paired T Test (Table 2). Moreover, paired T test detected significant differences which were undetected by Tukey test between the pairs of species dsimdsec, dsec-dmau, dmel-dsec and dmel-dpse (Table 2). Finally, there are significant differences between pairs of species detected by Tukey test, dmir-dari; dmir-dame, dmir-dkik, dmir-dmoj and dkik-dsan, that could not be tested by paired T tests because no data for ortologue genes are available for these pairs of species. These results are consistent with the previous ANOVA, indicating that the differences in πs among species are not due to the different sampled genes within each species, because the paired T test compares the same set of genes for each pair of species. The differences in levels of πs between dmel and dsim were highly significant both by Tukey test and paired T test. African and derived populations of D. melanogaster and D simulans seem to have different levels of synonymous 5 polymorphism [26-29]. We therefore grouped the gene sequences by their origin and recalculated the πs values by gene. We obtained a dataset of 91 ortologue genes from non-African populations and 23 ortologue genes from African populations of D. melanogaster and D. simulans. We detected significant differences among πs values in Non-African populations of these two species, but the differences were not significant in African populations (Table 2). All comparisons in Table 2, including dsim and dmel with other species, are based on non-African populations of these species. Various different causal factors could lead to the differences detected in the levels of πs among species in the above analysis: 1) Different selective constraints in diverse functional categories of genes of Drosophila species could affect the πs levels by selectively acting on linked non-synonymous sites [19, 21, 30]; 2) species belonging to different groups of species have different average levels of recombination rates [24, 33] which could affect the levels of synonymous polymorphism [31]; and 3) deviations of the neutral equilibrium by demographic events, such as population bottlenecks, population structure or population expansion could influence the level of synonymous polymorphism [34, 35]. Differences in selective constraint among genes. We performed ten different ANOVAs testing the model πs = species + GO. Where GO is a component that sorts genes as belonging or not to one of ten different categories of biological process at level 3 of the Gene Ontology (see Methods). The results are summarized in Table 4. While, the species component has a significant effect on the levels of πs; the differences in the levels of πs due to differences in functional categories only were detected for genes belonging to functional categories related to metabolic processes (cellular metabolism; macromolecule metabolism and primary metabolism) and sexual reproduction. These genes have significantly lower πs values than genes from other functional categories (Supplementary Results, Table 2). The functional category effect suggests different selective constraints for these genes. To test this assumption, the levels of constraints (πn / πs) were investigated for the four categories, showing that the genes involved in metabolic processes are significantly more constrained than the rest of genes (average πn / πs = 0.13 vs. 0.28); while the genes involved in sexual reproduction are less constrained than the other 6 genes (average πn / πs = 0.30 vs. 0.22), although in this last case the differences were not significant. Nevertheless, the percentage of the variance explained by the differences between species was 14.38 % and the mean percentage of the variance explained by the functional differences was 2.6 %, demonstrating once more that the variation in the levels of πs is mainly due to proper differences in species (Table 3). Taxonomic relationships among species. We performed a fully nested ANOVA to test whether the differences in the levels of synonymous polymorphism could be explained by differences in the subgenus or taxonomic groups of species. The differences in the levels of synonymous polymorphism explained by differences in subgenus were null (Table 5), while differences among species groups explained a very low percentage of the total variance (1.42%) and the differences among species within groups explained a significant percentage of the total variation in the levels of synonymous polymorphism (14.2%; Table 5); indicating that the differences in the levels of synonymous polymorphism could be attributed mainly to differences among species within the groups of species. The results in the previous section show that some percentage of the variance of πs is explained by differences in the selective constraints of the genes. Therefore, to get a more accurate percentage of the variance explained by the differences among species, data were grouped by the functional category of the genes, and the fully nested ANOVA was performed for each functional categories. In all analysis the species component explained a higher percentage of the variance (Supplementary Results, Table 3). The average percentage of the variance in the levels of πs, explained by the differences among species for 10 datasets when genes were grouped by their function was 22% (Supplementary Results, Table 3). Deviations from the neutral equilibrium. To evaluate whether the differences in the levels of synonymous polymorphism could be due to deviation of the neutral equilibrium of the populations, we tested differences in the D statistics of the Tajima test [34]. Tajima’s D statistics measure the deviation from the expected selection-drift equilibrium in a population with constant size. Deviation from the neutral equilibrium could be produced by selective events, such as selective sweeps or demographic events, such as bottlenecks or population expansion [34]. However, demographic events will generate an overall deviation across all genomes [13]. Therefore, if the levels of 7 synonymous polymorphism are affected by demographic events in a species, the mean value of Tajima’s D would be different for this species. The mean values of D are shown in Table 1. As with πs, we performed an unbalanced ANOVA to detect differences among mean values of D by species and a post hoc Tukey and paired T tests for ortologue genes to detect differences among pairs of species. The ANOVA showed significant differences in the mean values of D among species (F (14,736) = 3.01, p<0.001). Tukey test showed that the differences could be mainly attributed to D. miranda, which have a mean value of D significantly different to the rest of the species. Moreover, differences were found between the pair of species dyak-dsim. Paired T test confirmed this last difference. However, the differences in Tajima’s D between D. miranda and the rest of the species were not detected by paired T test. An examination of the data of this species showed that the differences in the values of D in D. miranda are due to 16 genes from the NeoX chromosome of this species. Bachtrog and Andolfatto [2] found that genes from NeoX chromosome seem to deviate from equilibrium, maybe due to a higher rate of adaptive substitutions in this chromosome combined with demographic events [2]. The Tukey analysis without these genes do not show significant differences in Tajima’s D between D. miranda and the rest of species. Paired T test for Tajima’s D statistic was performed for all pairs of species with ortologue genes and evidence of significant differences in the levels of synonymous polymorphism (Table 2). Significant differences in Tajima’s D values by paired T test for ortologue genes were only found between NA_dmel and NA_dsim (p = 0.010, N = 91) and between dyak and NA_dsim (p = 0,005, N=10). Results of coalescent simulations indicate that the mean values of Tajima’s D by species are within the expected values under a neutral model (Supplementary Results). Testing the hypothesis that selection efficiency is positively correlated with effective population size The whole analysis of the previous section indicates that the differences in πs observed in the dataset are mainly explained by differences among species, which could be differences in effective population size. In the next sections we will test the hypothesis that selection efficiency and effective population size are positively correlated, assuming that differences in synonymous polymorphism are representative of differences in effective population size (Ne). Differences in selection efficiency were 8 approached by differences in: 1) selection for preferred codons, 2) genome contents, and 3) proportion of adaptive substitutions. 1. Selection of preferred codons in Drosophila species Codon bias has been widely detected in Drosophila species [3, 36-38]. Recently, an analysis of the whole genome data of 12 species of Drosophila showed differences in the levels of codon bias among species, which could be explained by differences in effective population size [1]. In our dataset, there are data for 8 of out the 12 species studied by Vicario et al. [1]. Therefore, to test the hypothesis that the differences in the levels of codon bias are due to differences in the effective population sizes, we performed analyses of correlation and regression between πs and average genomic ENC. The results show that πs is negatively correlated with ENC (rSpearman = -0.149, p= 0.0003, N=594; Figure 2a). The regression was significant (r2 = 0.037, F1,538=20.7; p<0.0001; N=540) as measured by the F statistics of the ANOVA of the regression, following Sokal and Rholf [39]. This result could be biased by the different number of πs values from different species (Figure 2a). We investigated this bias by analyzing a dataset (Dataset 2) with 30 genes of D. melanogaster and D. simulans selected at random, and obtaining confidence intervals by bootstrapping of the correlation. The results of the correlation in this new dataset (N = 175) was rSpearman = -0.151, p= 0.046 with confidence intervals at 95% of -0.273 to -0.024. This negative association found between ENC and πs is indicative of a positive association between codon bias and πs, which could be expected if the πs levels are indicators of effective population size, which agrees with Vicario et al.’s [1] hypothesis. 2. Differences in genome contents among Drosophila species. Lynch and Conery [7] have pointed out the importance of genetic drift in genome evolution. Basically, they have shown that effective population size is negatively correlated with genome size, the number of different types of transposable elements and the number and size of introns in the genomes. The implications of these results is that much of the restructuring of eukaryotic genomes was initiated by nonadaptive processes, and this in turn provided novel substrates for the secondary evolution of phenotypic complexity by natural selection [7]. With recently published genome data from the twelve species of Drosophila, we can test Lynch and Conery’s [7] hypothesis among related species belonging to a single genus. The relationship between πs for 8 Drosophila species and different genome 9 variables (percentage of coding, intron and repeat sequences in the genomes) was studied by correlation and regression analysis. As in the section above these correlations could be biased by the different number of values of πs from different species. We investigated this bias, performing the correlations with Dataset 2 and bootstrapping the correlation. The results are shown in Table 6. Correlations betweens percentage of repetitive sequences in the genome and πs were negative and significant (Table 6, Figure 3 a-d). The correlation between the percentage of coding sequences in the genome and πs was positive and significant, when the dataset with all data was analyzed, but this result seems to be affected by the higher number of data of D. melanogaster and D. simulans in that dataset (Table 6). The results of correlation analyses between the percentage of intronic sequences and πs were contradictory, depending on the measure of genome size (see Methods) and the dataset (Table 6). The regression analyses considering πs as an independent variable showed similar results, suggesting that between 2 and 6% of the variation in the percentage of repetitive sequences in the genomes could be explained by the effective population size (Table 6). 3. Differences in the proportion of adaptive substitution among Drosophila species. In a complete genome analysis, Bakewell et al (2007) found that the chimpanzee lineage has more positively selected genes than the human lineage. Their interpretation of this result is that selection efficiency was weaker in the human lineage due to a lower effective population size. We performed an exploratory analysis to find differences in the proportion of adaptive substitutions for ortologue genes between pairs of species with evidence of differences in the levels of πs detected by Tukey and/or paired T test (Table 2); except for the comparison dsim- dsan, since there are no nonsynonymous polymorphisms in the dataset. We calculated an average alfa ( ) and an average number of adaptive substitutions (ā) following Smith and Eyre-Walker [16] (see methods). The calculated are presented in the Table 7. All estimated are within the confidence intervals estimated by bootstrapping for the values of α by gene (equation 2 of Smith and EyreWalker [16]; see Methods). The confidence intervals were very wide in many of the cases where the number of genes analyzed is low (Supplementary Results, Table 4 and 10 Table 5). Thus, the estimated are representative of the genes analyzed, but not of the species. The effect of each gene in the estimates of based in 10 or fewer genes were evaluated by resampling without replacement calculating the in N datasets (where N is the number of genes) without a different gene each time. The results show that the differences in between pairs of species are not greatly affected by single genes (Supplementary Results, Table 6) except in the comparison of dmel-dmir. In this dataset the of dmel varied widely (from 0.03 to 0.98) depending on the genes included (Supplementary Results, Table 6). The differences in the values of average number of adaptive substitutions (ā = * Dn) are significant in all the cases, except in the comparisons dmel-dpse and dmel-dmir (Table 7). In all comparisons the results show that species with significantly higher levels of synonymous polymorphism have a higher proportion of adaptive substitutions for ortologue genes, except in the comparison dmirdmel (Table 2 and Table 7). Furthermore, the relation Dn/Ds showed the same trend in most of the comparisons (Table 7); suggesting a higher number of non synonymous substitutions than expected by the number of synonymous substitutions in the species with higher levels of synonymous polymorphism. Discussion Differences at synonymous polymorphism levels among Drosophila species We found significant differences in the levels of synonymous polymorphism among Drosophila species (Table 2). For those species whose polymorphism had been already studied, our estimates of synonymous polymorphism were similar to those previously reported [2, 6, 17, 19, 40]. These differences among species could not be explained either by differences in different sampled genes for the different species, nor by differences in selective constraints among genes (Tables 3 and 4). The small effect detected in selective constraints among genes does not explain the differences found among species. Genes related with metabolic processes were shown to be more constrained than other genes, whereas genes related with sexual reproduction were less constrained. These results are in agreement with the evolutionary rates calculated for the genes belonging to these functional categories in genome wide analyses [21]. The phylogenetic relationships among species do not seem affect the differences in the levels of synonymous polymorphism found among Drosophila species (Table 5, and Supplementary Results, Table 3). 11 Population bottlenecks, structure or expansion could affect the levels of polymorphism in the species [34, 35]. Our results of the analysis of Tajma’s D to detect deviations in neutrality due to the demographics of the species showed differences for data of Neo X chromosome of D. miranda, and for the comparison between D. simulans with D. melanogaster and D. yakuba. Deviations in neutral equilibrium of the genes of the Neo X chromosome in D. miranda had already been detected by Bactrogh and Andolfatto [2] and are possibly due to a combination of adaptive selection and demographic events [2]. We think that the differences found in the values of Tajima’s D in the comparisons of dsim-dyak and dsim-dmel (Table 2) are due to D. simulans. Since the mean values of Tajima’s D are not significantly different in other comparisons including D. melanogaster or D. yakuba. Moreover, Bierne and Eyre-Walker [41] suggest that, although the effective population size of D. simulans has remained fairly stable, populations of D. simulans have not reached a stochastic steady state. This is because D. simulans seems be a large long term effective population size and the neutral alleles are not fixed yet [41], and it is congruent with the higher mean value of Tajima’s D found for this species when compared with D. melanogaster or D. yakuba. We did not perform an accurate evaluation of the deviation of neutral equilibrium due to demographic events for each species. However, our result of the analysis of Tajima’s D suggests that the bulk of the differences found in the levels of synonymous polymorphism are not caused by demographic events. A more in-depth analysis using genome wide polymorphism data would be necessary to fully rule out the demographic effect in the differences on synonymous polymorphism among Drosophila species. Differences in the effective population size among Drosophila species has been proposed many times in various different studies. Aquadro [20] and Akashi [4] had proposed differences in effective population size between D. melanogaster and D. simulans. However, whether these differences are true or a side-effect of the demographic histories of these two species has been widely discussed [27, 42, 43]. Drosophila sechellia is a species endemic to a island and a small effective population size had been proposed for this species [6]. Moreover, differences in effective population size have been proposed for the sister species D. miranda and D. pseudoobscura [5]. Finally a recent study of the genome wide levels of codon bias among Drosophila species suggests that the differences found among the levels of codon bias could be attributed to differences in effective population size [1]. Given all this evidence indicating differences in effective population size among Drosophila 12 species in addition to the results presented in the first section of this study, we consider that the differences in the levels of synonymous polymorphism found here could be indicative of differences in effective population size. Selection for preferred codons We found a slight but significant correlation between synonymous polymorphism and the whole-genome ENC estimates [1] in eight of the twelve species with sequenced genomes (rSpearman = -0.14956, p = 0.0003, N = 594; Figure 2). The regression between the codon bias (dependent variable) and the πs (independent variable) was significant (r2 = 0.0371, F(1,538) = 20.7; p<0.0001; N = 540), showing that nearly 3.7% of the variation in the levels of codon bias is due to differences in πs. This result can be interpreted in terms of nearly neutral theory. In addition to the differences detected in the patterns of mutations and selection on synonymous sites in Drosophila species [38], the coefficients of selection for preferred codons are very weak (Nes ≈ 2 for D. simulans and D. sechellia [3, 38]). Therefore, if the differences observed in the πs levels among species are due to differences in effective population size, the expected correlation between mean πs and ENC will be negative, since ENC is an inverse measure of codon bias. Thus, weak positive selection seems be more effective in species with larger effective population size. Vicario et al. [1] have showed that the levels of codon bias between the twelve genome-sequenced species of Drosophila are significantly different. They interpreted their results in terms of differences in effective population size among the Drosophila species. Therefore, the results presented here provide quantitative evidence in favour of Vicario et al.’s [1] hypothesis that the differences in observed levels of codon bias are due to differences in effective population size. Genome content Lynch and Conery [7] showed that many of the features of complex genomes could be initiated as a non adaptive process due to reductions in effective population size. Following their analysis strategy, we have tested their hypothesis using the data from the sequenced genomes of Drosophila and our estimates of 4Nes (πs). We found that πs is negatively correlated with the percentage of repetitive sequences in the genomes, which is an approximate measure of the content of transposable elements in the genomes (Table 6). This correlation does not seem to be biased either by the dataset or by the estimation of the percentage of repetitive sequences in the genomes. According to Lynch and Conery’s [7] interpretation, in species with low effective 13 populations size, selection is ineffective against mildly deleterious insertions into introns or insertions of transposable elements, which increase genome size. Bosco et al. [44] showed that the differences in genome size among species are mainly due to the differences in repetitive DNA satellite contained in the heterochromatin. The two estimates of percentage of repetitive sequences analysed in this study are based on the euchromatic portion of the genomes and may be underestimations. Moreover there are differences in the two measures of percentage of repeats in the genomes considered here. This variation is noticeable mainly in dmel, dsec and dvir (Figures 3 a-d). According to the citometric measures of the genome size and the results of Bosco et al. [44], the percentage of repeats calculated based on the number of the base pairs that are repeat as determined by RepeatMasker (%repeat 1; see methods) seems be underestimated for dvir and overestimated for dmel and dsec. In addition, although there is evidence of a small effective population size for D. sechellia (Figure 3) [6]; the very low levels of synonymous polymorphism estimated by this specie could be biased by the genes analysed in this study. Thus, our results agree with Lynch and Conery [7] results showing that genetic drift could play a crucial role in genome evolution. Genome size seems to increase by insertions of sequences that are nearly deleterious, which are effectively eliminated by natural selection or fixed by genetic drift, depending of the population effective size. The results of the correlations between synonymous polymorphism and the percentage of coding and intron sequences in the genomes were more difficult to reconcile between the two analysed datasets. Although there seems to be a trend towards a positive association between the proportion of coding sequences in the genomes and πs, it seems to be weak. Estimates of genome content are currently rough and unfinished for the bulk of the species analysed, an improvement in these estimates and data from synonymous polymorphism based a higher number of genes and of the 4 species lacking in this study would help to sort out these relationships in more detail. Adaptive selection Selection efficiency, defined as the effectiveness of natural selection in determining the fate of a mutation, depends on the product of effective population size and the selection coefficient (Nes, [8]). Therefore, for two species differing in population size and with similar selection coefficients, we expected to find more 14 evidence of positive selection in the species with larger Ne. Bakewell et al (2007) found evidence of this assumption when analyzing the genome of chimpanzees and humans. The results of the comparisons of the proportion of adaptive substitutions presented in Table 7 suggest a similar interpretation for Drosophila species; that is, species with higher levels of effective population size have a higher proportion of adaptive substitutions. However, the results presented in Table 7 are based on comparisons of a small number of genes and thus, these are only suggestive but not conclusive. The results in Table 7 show the same trend as those in the other two analyses of selection efficiency, showing that in species with higher levels of synonymous polymorphism (Table2) adaptive selection seems be more frequent. An exception is the comparison between D. melanogaster and D. miranda (Table 7), but this seems to be affected by the genes analyzed (Supplementary Results, Table 6). As the comparisons including D. melanogaster and D. simulans and other species were made with genes sampled in non-African populations the differences found in could not be due to differences in long term effective population size. Changes in population size might affect the proportion of mutations that are effectively neutral [45]. Therefore, the demographic process may have a major influence in selective constraint, violating the assumption of McDonald and Kreitman’s test [45, 46] of the constant constraint across the lineages [41, 45, 47]. The could be overestimated when the current population size is larger than in the past [41, 45]. This is because slightly deleterious mutations that have been fixed in the past are not longer segregating as polymorphism [41]. However, the population size of D. melanogaster and D. simulans seem to have declined since their expansion out of Africa [4, 19, 40]. In fact, the levels of synonymous polymorphism found in this study for African populations are higher than non-African population for this two species (Table2). Therefore, the for these two species could be underestimated rather than overestimated, when compared with other species [41]. However, the differences found in between D. melanogaster and D. simulans might be affected by the demographic histories of these two species. Estimations of the proportion of adaptive substitutions from D. melanogaster and D. simulans have been made in several studies [13, 16, 17, 19, 47]. Moreover, a comparison of the proportion of adaptive substitutions in D. melanogaster and D. simulans were widely evaluated, compared and discussed by Bierne and Eyre -Walker 15 [41], taking into account differences in linkage disequilibria in the genes analyzed and demographic histories. Our estimations of from these two species in the comparison from non-African populations are similar to the obtained by them and other authors [17, 41, 47]. Fay et al. [13] found that the estimates of adaptive substitutions in D. melanogaster could be underestimated by the slightly deleterious mutations segregating as polymorphism. We do not excluded singletons of our analysis; however, the calculated for D. melanogaster in the lineage D. melanogaster-D. yakuba was similar to the calculated by Bierne and Eyre-Walker excluding singletons [41] and Shapiro et al.[17] in a larger dataset. It is possible that many of the singletons were lost in our D. melanogaster dataset by calculating the number of polymorphic sites equalling the number of samples with D. simulans (see Methods). Although differences in synonymous polymorphism for 23 genes of African populations of D. melanogaster and D. simulans are not significant (Table 2), the level of synonymous polymorphism in D. simulans is still higher than in D. melanogaster. The calculated for 21 of these genes was similar to that obtained for non-African populations ( A_dmel = 0.33; A_dsim = 0.53). This result seem be contradictory under our hypothesis having into account a recent result indicating no differences in effective population size between African populations of D. melanogaster and D. simulans [27]. However, these estimates of are based on a small number of samples. More accurate estimates of synonymous polymorphism and from African population of these two species would be necessary to sort out what are the exact differences between the long term effective population sizes of these two species. Conclusions This study presents a multispecies analysis of the levels of synonymous polymorphism among Drosophila species. Differences detected in the levels of synonymous polymorphism among species are not greatly affected by differences in gene functions, phylogenetic relationships among species or demography. Although we cannot demonstrate that the differences in the levels of synonymous polymorphism are true differences in long-term effective population size, it seems a likely possibility. These differences in effective population size could have implications on the molecular evolution of Drosophila species, as suggested by the analysis testing differences in selection efficiency. 1) The differences in whole-genome estimates of codon bias among 16 Drosophila species are correlated with the differences in the levels of synonymous polymorphism, indicating that weak selection is possibly more effective for larger estimated effective population size. 2) We found evidence that the possible nonadaptive fixation of sequences in the genomes is negatively correlated with levels of synonymous polymorphism of the species, indicating lower efficiency of the natural selection in species with lower effective population size. And 3) Species with lower synonymous polymorphism seem to have smaller proportions of adaptive substitutions in ortologue genes. Overall analyses show that Drososphila species may have several different effective population sizes, which seem to affect the patterns of molecular evolution of these species. Methods Data Data on synonymous polymorphism (πs[25]) and Tajima’s D [34] by gene, were obtained from DPDB (Drosophila Polymorphism Data Base http//dpdb.uab.es, [23]). To get reliable estimates of πs, the available data of the database was filtered according to the following criteria: 1. Only species with more than five analyzed genes and πs estimates based on 4 or more sequences were considered 2. Alignments with total gap length or length differences larger than 30% were discarded. 3. Genes from transposable elements (gag, pol, etc.) and pseudogenes were discarded (N=9). The filtered dataset consisted of 754 estimates of πs, belonging to 482 different genes of 15 Drosophila species (D. americana, D. arizonae, D. kikkawai, D. mauritiana, D. melanogaster, D. miranda, D. mojavensis, D. persimilis, D. pseudoobscura, D. santomea, D. sechellia, D. simulans, D. subobscura, D. virilis and D. yakuba) and of four different groups of species (melonagaster, obscura, virilis and repleta). The estimates of πs in the dataset were based on alignments with an average of 16.22 sequences and 1,181.62 analysed sites, with an average and standard deviation values of 0.0165 and 0.0063, respectively. Although, we do not have information on the chromosomic location of all the genes analysed from the different species, the 17 distribution of the ortologue genes in the D. melanogaster genome seems be random (almost the same number of genes from different chromosome arms). The genes in the dataset were classified according to their category in the Gene Ontology, (Biological Process at level 3); using the web site FatiGO (http://fatigo.bioinfo.cipf.es, [48]). The ten most well-represented categories in the dataset were used to group the genes by their functions: anatomical structural development (15.5%); cell communication (16.5%); cellular metabolic process (53.1%); cellular organization and biogenesis (14.1%); cellular developmental process (15.9%); macromolecule metabolic process (46.8%); multicellular organismal development (26.3%); primary metabolic process (50.4%); regulation of biological process (27.3%) and sexual reproduction (10.9%) Genomic codifying sequences of D. yakuba, D. melanogaster , D mojavensis and D. pseudoobscura were retrieved from precomputed alignment of genome sequences from Vista Genome Browser (http://pipeline.lbl.gov/cgi-bin/gateway2 , [49]). Genome data: genome size, total coding and intron sequences and repeat coverage were obtained from [21]. The percentages of coding and intron sequences in the genomes were calculated as the total coding or intron sequences on the total genome size. Two measures of total genome size were considered, one calculated as the sum of the base pairs in genomic scaffold>200,000bp [21] and the other based on flow citometry [44]. The total percentage of repeats in the genomes was obtained from the UCSC web page (http://genome.ucsc.edu/, [50]) as calculated with RepeatMasker. The total genomic codon bias was obtained from Vicario et al.[1] as measured by ENC (Effective Number of Codons; [51]). Analysis The variable πs is not normally distributed; therefore, the values of πs were transformed to natural logarithm, thereby obtaining a normally distributed dataset which allowed us to perform parametric analyses (ANOVAs and Regression analysis). Since different genes are included in the 15 analysed species, any difference found among species could be artifactual due to this gene heterogeneity. To take into account this potential source of error, several different approaches were used: 1. A two-factor analysis of the variance (unbalanced) was carried out to test if the observed differences in πs values (dependent variable) were due 18 to species or genes. Since the data of D. melanogaster alone account for 40% of the dataset (Figure 1), the same analysis was performed excluding the data of this species. 2. Both, post hoc Tukey test for log transformed data and two tailed paired T tests (for ortologue genes in pairs of species), were carried out to test all pairwise differences among species. Differences in selective constraint. The various different selective constraints of the genes belonging to different functional categories were tested by ten unbalanced analyses of variance with the underlying model: πs = species + GO. Where GO is a component that classifies the genes as belonging or not to one of the ten categories of biological process at level 3. Phylogenetic relationships among species. The potential effect of the phylogenetic relationship among the species was tested by a fully nested ANOVA grouping the species in Subgenus and groups of species ( Subgenus: Sophophora and Drosophila and groups: melanogaster, virilis, repleta or obscura). Nested ANOVAS were also performed for ten datasets grouping the data by their function (categories of biological process at level 3). Deviations from neutral equilibrium. Differences in the values of Tajima’s D statistic were used to test possible differences in deviations of neutral equilibrium. The differences in Tajima’s D statistic were tested by an unbalanced ANOVA, Tukey test and paired T test for ortologue genes of pairs of species. Coalescent simulations were performed with intermediate values of recombination using DNAsp [52], giving the mean values of segregating sites, samples and number of sites. Testing the hypothesis that selection efficiency is positively correlated with effective population size. Correlations and regression analyses were performed between each genome variable (genome size, percentage of genic sequences, percentage of intron sequences, percentage of repeat sequences and total ENC) and πs. Bootstrapping of the correlations was performed using Resampling software from D. C. Howell. The proportion of average proportion ( ) and average number of adaptive substitution (ā) was estimated following the equations 1 to 3 of Smith and EyreWalker[16] 19 The numbers of synonymous polymorphic sites (Ps), no synonymous polymorphic sites (Pn), synonymous fixed changes (Ds) and nonsynonymous fixed changes (Dn) were calculated, using DNAsp [52] and the website http://mkt.uab.es. To enable comparisons between the ( ) between a pairs of species the gene alignments of the species were aligned with a genomic codifying sequences of a specie external to the tested lineage, using Muscle [53]. Thus, the comparisons among dsec-dmau, dsec-dmel, dsec-dsim and dmel-dsim were made by calculating the number of divergent sites (Dn and Ds) from D. yakuba. In the comparison between the pair of species dsim-dyak the divergence was calculated from D. pseudoobscura ; in the comparison of dmir-dpse the divergence was calculated from D. melanogaster; and in the comparison between obscura species group versus melanogaster species group the divergence was calculated from D. mojavensis. A possible bias can arise in the estimation of the number of polymorphic sites (Pn and Ps) by a different number of sequences in the genes being compared between pairs of species. Therefore, to avoid this bias the number of sequences was equalled in each gene-to-gene comparison. All the alignments were revised by eye and are available from the authors on request. The list of genes compared for each species pair is shown in Table 5 of the Supplementary Results. Differences in the number of genes compared in Table 2 and Table 7 are due to bad quality alignments or lack of ortologue sequences to calculate divergence. Confidence intervals of were obtained by bootstrapping the values of α by gene. All statistical analysis were made using SAS 9.1 statistical program (SAS Institute Inc.). Authors' contributions All authors conceived this project and drafted the manuscript. NP carried out the data searching, filtering and analysis. All authors have interpreted and discussed the results, read and approved the final manuscript. Acknowledgements The authors would like to thank Josefa Gonzalez and three anonymous reviewers for their useful comments and Raquel Egea for corrections on the first version of the manuscript. This work was funded by the Ministerio de Educaciòn y Ciencia (Grant 20 BFU-2006-08640). N.P. was supported by a grant from the Departament de Genètica i Microbiologia of the Universitat Autònoma de Barcelona. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. Vicario S, Moriyama EN, Powell JR: Codon usage in twelve species of Drosophila. BMC Evol Biol 2007, 7(1):226. Bachtrog D, Andolfatto P: Selection, recombination and demographic history in Drosophila miranda. Genetics 2006, 174(4):2045-2059. Akashi H: Inferring weak selection from patterns of polymorphism and divergence at "silent" sites in Drosophila DNA. Genetics 1995, 139(2):10671076. Akashi H: Molecular evolution between Drosophila melanogaster and D. simulans: reduced codon bias, faster rates of amino acid substitution, and larger proteins in D. melanogaster. Genetics 1996, 144(3):1297-1307. Yi S, Bachtrog D, Charlesworth B: A survey of chromosomal and nucleotide sequence variation in Drosophila miranda. Genetics 2003, 164(4):1369-1381. Kliman RM, Andolfatto P, Coyne JA, Depaulis F, Kreitman M, Berry AJ, McCarter J, Wakeley J, Hey J: The population genetics of the origin and divergence of the Drosophila simulans complex species. Genetics 2000, 156(4):1913-1931. Lynch M, Conery JS: The origins of genome complexity. Science 2003, 302(5649):1401-1404. Kimura M: The neutral theory of molecular evolution. Cambridge: Cambridge University Press; 1983. Ohta T: Role of very slightly deleterious mutations in molecular evolution and polymorphism. Theor Popul Biol 1976, 10(3):254-275. Ohta T: Near-neutrality in evolution of genes and gene regulation. Proc Natl Acad Sci U S A 2002, 99(25):16134-16137. Kimura M: The neutral theory of molecular evolution: a review of recent evidence. Jpn J Genet 1991, 66(4):367-386. Bustamante CD, Nielsen R, Sawyer SA, Olsen KM, Purugganan MD, Hartl DL: The cost of inbreeding in Arabidopsis. Nature 2002, 416(6880):531-534. Fay JC, Wyckoff GJ, Wu CI: Testing the neutral theory of molecular evolution with genomic data from Drosophila. Nature 2002, 415(6875):10241026. Fay JC, Wyckoff GJ, Wu CI: Positive and negative selection on the human genome. Genetics 2001, 158(3):1227-1234. Eyre-Walker A, Keightley PD, Smith NG, Gaffney D: Quantifying the slightly deleterious mutation model of molecular evolution. Mol Biol Evol 2002, 19(12):2142-2149. Smith NG, Eyre-Walker A: Adaptive protein evolution in Drosophila. Nature 2002, 415(6875):1022-1024. Shapiro JA, Huang W, Zhang C, Hubisz MJ, Lu J, Turissini DA, Fang S, Wang HY, Hudson RR, Nielsen R et al: Adaptive genic evolution in the Drosophila genomes. Proc Natl Acad Sci U S A 2007, 104(7):2271-2276. Macpherson JM, Sella G, Davis JC, Petrov DA: Genomewide spatial correspondence between nonsynonymous divergence and neutral 21 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. polymorphism reveals extensive adaptation in Drosophila. Genetics 2007, 177(4):2083-2099. Begun DJ, Holloway AK, Stevens K, Hillier LW, Poh YP, Hahn MW, Nista PM, Jones CD, Kern AD, Dewey CN et al: Population genomics: wholegenome analysis of polymorphism and divergence in Drosophila simulans. PLoS Biol 2007, 5(11):e310. Aquadro CF: Why is the genome variable? Insights from Drosophila. Trends Genet 1992, 8(10):355-362. Clark AG, Eisen MB, Smith DR, Bergman CM, Oliver B, Markow TA, Kaufman TC, Kellis M, Gelbart W, Iyer VN et al: Evolution of genes and genomes on the Drosophila phylogeny. Nature 2007, 450(7167):203-218. Bakewell MA, Shi P, Zhang J: More genes underwent positive selection in chimpanzee evolution than in human evolution. Proc Natl Acad Sci U S A 2007, 104(18):7489-7494. Casillas S, Petit N, Barbadilla A: DPDB: a database for the storage, representation and analysis of polymorphism in the Drosophila genus. Bioinformatics 2005, 21 Suppl 2:ii26-30. Casillas S, Egea R, Petit N, Bergman CM, Barbadilla A: Drosophila Polymorphism DataBase (DPDB): a portal for nucleotide polymorphism in Drosophila. Fly 2007, 1:205-211. Nei M, Gojobori T: Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 1986, 3(5):418-426. Andolfatto P: Contrasting patterns of X-linked and autosomal nucleotide variation in Drosophila melanogaster and Drosophila simulans. Mol Biol Evol 2001, 18(3):279-290. Nolte V, Schlotterer C: African Drosophila melanogaster and D. simulans populations have similar levels of sequence variability, suggesting comparable effective population sizes. Genetics 2008, 178(1):405-412. Haddrill PR, Charlesworth B, Halligan DL, Andolfatto P: Patterns of intron sequence evolution in Drosophila are dependent upon length and GC content. Genome Biol 2005, 6(8):R67. Eyre-Walker A: Changing effective population size and the McDonaldKreitman test. Genetics 2002, 162(4):2017-2024. Larracuente AM, Sackton TB, Greenberg AJ, Wong A, Singh ND, Sturgill D, Zhang Y, Oliver B, Clark AG: Evolution of protein-coding genes in Drosophila. Trends Genet 2008, 24(3):114-123. Begun DJ, Aquadro CF: Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster. Nature 1992, 356(6369):519-520. Charlesworth B: Background selection and patterns of genetic diversity in Drosophila melanogaster. Genet Res 1996, 68(2):131-149. Caceres M, Barbadilla A, Ruiz A: Recombination rate predicts inversion size in Diptera. Genetics 1999, 153(1):251-259. Tajima F: Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 1989, 123(3):585-595. Tajima F: The effect of change in population size on DNA polymorphism. Genetics 1989, 123(3):597-601. Akashi H: Codon bias evolution in Drosophila. Population genetics of mutation-selection drift. Gene 1997, 205(1-2):269-278. 22 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. McVean GA, Vieira J: Inferring parameters of mutation, selection and demography from patterns of synonymous site evolution in Drosophila. Genetics 2001, 157(1):245-257. Singh ND, Bauer Dumont VL, Hubisz MJ, Nielsen R, Aquadro CF: Patterns of mutation and selection at synonymous sites in Drosophila. Mol Biol Evol 2007, 24(12):2687-2697. Rohlf RRSaFJ: Biometry: the principles and practice of statistics in biological research, 3rd edn. New York: W. H. Freeman and Co.; 1995. Moriyama EN, Powell JR: Intraspecific nuclear DNA variation in Drosophila. Mol Biol Evol 1996, 13(1):261-277. Bierne N, Eyre-Walker A: The genomic rate of adaptive amino acid substitution in Drosophila. Mol Biol Evol 2004, 21(7):1350-1360. Mousset S, Derome N: Molecular polymorphism in Drosophila melanogaster and D. simulans: what have we learned from recent studies? Genetica 2004, 120(1-3):79-86. Capy P, Gibert P: Drosophila melanogaster, Drosophila simulans: so similar yet so different. Genetica 2004, 120(1-3):5-16. Bosco G, Campbell P, Leiva-Neto JT, Markow TA: Analysis of Drosophila species genome size and satellite DNA content reveals significant differences among strains as well as between species. Genetics 2007, 177(3):1277-1290. McDonald JH, Kreitman M: Adaptive protein evolution at the Adh locus in Drosophila. Nature 1991, 351(6328):652-654. Andolfatto P: Adaptive evolution of non-coding DNA in Drosophila. Nature 2005, 437(7062):1149-1152. Welch JJ: Estimating the genomewide rate of adaptive protein evolution in Drosophila. Genetics 2006, 173(2):821-837. Al-Shahrour F, Minguez P, Vaquerizas JM, Conde L, Dopazo J: BABELOMICS: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments. Nucleic Acids Res 2005, 33(Web Server issue):W460-464. Couronne O, Poliakov A, Bray N, Ishkhanov T, Ryaboy D, Rubin E, Pachter L, Dubchak I: Strategies and tools for whole-genome alignments. Genome Res 2003, 13(1):73-80. Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT, Roskin KM, Schwartz M, Sugnet CW, Thomas DJ et al: The UCSC Genome Browser Database. Nucleic Acids Res 2003, 31(1):51-54. Wright F: The 'effective number of codons' used in a gene. Gene 1990, 87(1):23-29. Rozas J, Sanchez-DelBarrio JC, Messeguer X, Rozas R: DnaSP, DNA polymorphism analyses by the coalescent and other methods. Bioinformatics 2003, 19(18):2496-2497. Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32(5):1792-1797. 23 Table 1: Mean and standard deviation of the synonymous polymorphism and Tajima’s D for the 15 species of Drosophila analyzed. Mean ± Std dev. Mean ± Std dev of πs Tajima's D Species N genes 8 0.030 ± 0.021 0.087 ± 0.633 dkik 17 0.019 ± 0.013 0.150 ± 0.882 dmau 305 0.016 ± 0.132 -0.331 ± 0.829 dmel 16 0.008 ± 0.015 -0.497 ± 0.611 dsan 11 0.005 ± 0.009 -0.314 ± 0.879 dsec 174 0.022 ± 0.018 -0.134 ± 0.814 dsim 26 0.014 ± 0.015 -0.870 ± 0.645 dyak 42 0.008 ± 0.007 -0.378*± 0.962 dmir 6 0.017 ± 0.011 -0.631 ± 0.438 dper 20 0.022 ± 0.016 -0.620 ± 0.830 dpse 10 0.021 ± 0.016 -0.771 ± 0.712 dsub 42 0.017 ± 0.016 -0.330 ± 0.823 dari 45 0.016 ± 0.012 -0.244 ± 0.994 dmoj 25 0.019 ± 0.010 -0.571 ± 0.593 dame 7 0.012 ± 0.009 -0.482 ± 1.021 dvir * without data from genes of the Neo-X chromosome (see text) Table 2: Results of the analysis of variance of two factors testing the interaction between genes and species factors, with and without data from D. melanogaster. The dependent variable was the natural logarithm of πs. Source species gene species*gene DF 14 452 205 all data F 6.13 1.18 0.69 Pr >F 0.0049 0.4265 0.8343 without D. melanogaster DF F Pr >F 13 16.84 0.0073 283 3.47 0.1147 94 1.69 0.3299 24 Table 3. Differences of the values of πs between pairs of species evaluated by paired T test for ortologues genes. Significant differences are bold faced. Mean sp1 dame dame dari NA_dmel A_dmel dmel dmel dmel dmel dmel dmel dmel dmir dmir dper dpse dsec dsim dsim dsim dsim a b sp2 dsim dvir dmoj NA_dsim A_dsim dsec dpse dmir dkik dame dmau dyak dper dpse dpse dsim dmau dsec dmir dsan dyak N genes 6 6 42 91a 23b 11 9 10 10 6 16 10 5 9 5 10 11 11 10 5 9 Mean πs sp1 πs sp2 p T test 0.018 0.015 0.017 0.013 0.015 0.017 0.012 0.011 0.015 0.014 0.019 0.012 0.008 0.009 0.013 0.021 0.006 0.026 0.023 0.010 0.021 0.025 0.474 0.009 0.123 0.016 0.796 0.020 0.00003 0.032 0.063 0.006 0.002 0.020 0.021 0.006 0.083 * 0.025 0.105 0.019 0.061 0.019 0.790 0.010 0.399 0.017 0.128 0.015 0.066* 0.018 0.190 0.027 0.522 0.021 0.032 0.006 0.007 0.006 0.006 0.025 0.109* 0.008 0.045 Data from Non African populatios of these two species Data from African populations of these two species * Species with means levels of πs significantly different by Tukey test. 25 Table 4. Results of ten analyses of variance of the model πs = species + GO. GO is a component sorting genes as belonging or not to one of ten different categories of biological process at level 3 of the Gene Ontology. The dependent variable is the natural logarithm of πs. anatomical structural development cell communication cellular metabolic process cellular organization and biogenesis cellular developmental process macromolecule metabolic process multicellular organismal development primary metabolic process regulation of biological process sexual reproduction Species component F (14, 619) p <0,0001 6.09 <0,0001 6.09 <0.0001 6.20 <0.0001 6.07 <0.0001 6.08 <0.0001 6.21 <0.0001 6.11 <0.0001 6.24 <0.0001 6.10 <0.0001 6.11 GO component F(1, 619) p 2.19 0.1398 2.16 0.1421 0.0004 12.90 0.08 0.7750 1.28 0.2582 0.0002 14.14 3.83 0.0509 <0,0001 17.10 2.85 0.0921 0.0366 4.36 Table 5. Results of the fully nested analysis of variance, grouping the species by their phylogenetic relationship (subgenus and groups). The dependent variable was the natural logarithm of πs. Source DF F p Var Comp. % of Total Subgenus 1 0,64 0,507042 -0.040* 0 Group 2 1,05 0,383304 0,02 1,48 6,70E-10 Species 11 6,30 0,14 14,2 Error 604 0,83 84,31 Total 618 0,98 * Value of the component of variance is negative, and the% of the total is estimated by zero. 26 Table 6. Correlation and regression analyses between πs and genome variables. In the regression analysis, the independent variable was the natural logarithm of πs, and significance was calculated based on the F statistic of the ANOVA of the regression. Significant values are bold faced. The confidence interval at 95% were calculated by bootstrapping of the correlations. Dataset 2 All data R Spearman (p) N=594 R2 F(1,538) (p) R Spearman (CI 95%) (p ) N=175 R2 F(1,172) (p) perc_repeat1‡ -0.175 (<.001) 0.059 33.85 (<0.001) -0.239 (-0.338 - -0.061) (0.001) 0.048 8.83 0.003 perc_repeat2† -0.154 (<0.001) 0.025 13.91 (0.001) -0.148(-0.29 - -0.031) (0.049) 0.016 2.49 0.061* perc_coding1a 0.113 (0.0059) 0.043 24.35 (<0.001) 0.020 (-0.121 - 0.215) (0.585) 0.006 0.99 0.322 perc_intron1a -0.123 (0.0031) 0.005 2.53 (0.112) -0.091 (-0.103 - 0.159) (0.243) 0.001 0.51 0.477 perc_coding2b 0.1273 (0.002) 0.021 11.36 (<0.001) 0.069 (-0.127 - 0.131) (0.267) 0.007 1.78 0.183 0.050 (-0.113- 0.126) 0.009 1.39 0.106 0.010 5.27 (0.533) 0.237 (0.001) (0.022) ‡ percentage of the base pairs in the total genome that are repeats, as detected by repeatmasker (UCSC, [50] ) † repeat coverage in percentages, calculated as the fraction of scaffolds >200kb covered by repeats [21] a percentage of coding and intron sequences, calculated as total coding and total intron sequence from [21]divided by genome size estimated as the sum of the base pairs in genomic scaffold>200,000bp [21] b percentage of coding and intron sequences, calculated as total coding and total intron sequence from [21] divided by genome size estimates based on flow citometry [44] * the regression is significant without data of dsec (F (1,147)= 4.58; p= 0.033; see Figure 3 d) perc_intron2b 27 Table 7. Differences in the proportion of adaptive substitutions among species of Drosophila with evidences of differences in the levels of πs from Table 2. Species group Species Species N genes Avg(Pn/Ps+1)a Avg Dn/ Avg Dsa 1 2 sp1 - sp2 sp1 - sp2 a sp1 - sp2 Dnb sp1 - sp2 χ2 p Species Divb dmel dsec 10 0.66 - 0.82 1.50 - 1.09 0.56-0.25 277 - 345 42.27 <0.00001 dyak dsim dsec 10 0.61 - 0.82 1.36 - 1.09 0.55-0.25 431 - 345 76.87. <0.00001 dyak dmau dsec 10 0.75 - 0.82 1.42 - 1.09 0.47-0.25 342 - 345 25.98 <0.00001 dyak dmel dsim 88 0.58 - 0.59 0.82 – 1.02 0.30-0.43 2858 - 3202 98.44 <0.00001 dyak dyak dsim dpse dmir 0.30 - 0.23 0.16 - 0.38 (0.21*) - (0.22*) 0.38 - 0.35 0.20-0.35 346 - 283 27.98 <0.00001 0.35 - 0.54 0.54 -0.30 95 - 204 (0.35*) - (0.23*) (0.40*)-(0.04*) (84*) - (44*) 16.02 0.00006 dpse obscura 8 8 (6*) obscura dmel dpse 5 0.05 - 0.05 0.22 - 0.33 0.76-0.85 99 - 103 vs dsim dmir 8 0.04 - 0.19 0.30 - 0.30 0.87-0.37 284 - 237 144.11 <0.00001 dmoj dmel dmir 6 0.17 - 0.11 0.42 - 0.27 0.58-0.62 85 - 100 0.351 dmoj melanogaster melanogaster a Average value of Pn divided Ps+1; average value of Dn divided average value of Ds and 2.92 0.087 0.553 dmel dmoj calculated following Smith and Eyre-Walker[16] as 1- (Avg(Pn/Ps+1) /(Avg Dn/ Avg Ds)) (see methods) b The divergence values were calculated from a species external to the tested lineage. * without data of the genes from the Neo X chromosome (see text) 28 Figure legends Figure 1. Mean and standard error of πs for the 15 studied species of Drosophila. Figure 2. Scatterplot of the relationship between codon bias (whole-genome ENC) and πs. Figure 3. Scatterplot of the relationships between πs (a and c), mean πs (b and c) and the percentage of repetitive sequences estimated from genomes of Drosophila species. a) and b) percentage of the base pairs in the total genome that are repeats, as detected by repeatmasker in the web siteUCSC, [50]. c) and d) repeat coverage in percentages, calculated as the fraction of scaffolds >200kb covered by repeats [21]. 29