Rabbit genome analysis reveals a polygenic basis for phenotypic change during domestication The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Carneiro, M., C.-J. Rubin, F. Di Palma, F. W. Albert, J. Alfoldi, A. M. Barrio, G. Pielberg, et al. “Rabbit Genome Analysis Reveals a Polygenic Basis for Phenotypic Change During Domestication.” Science 345, no. 6200 (August 28, 2014): 1074–1079. As Published http://dx.doi.org/10.1126/science.1253714 Publisher American Association for the Advancement of Science (AAAS) Version Author's final manuscript Accessed Wed May 25 15:28:17 EDT 2016 Citable Link http://hdl.handle.net/1721.1/98345 Terms of Use Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. Detailed Terms Rabbit genome analysis reveals a polygenic basis for phenotypic change during domestication Miguel Carneiro1,*, Carl-Johan Rubin2,*, Federica Di Palma3,4,*, Frank W. Albert5,24, Jessica Alföldi3, Alvaro Martinez Barrio2, Gerli Pielberg2, Nima Rafati2, Shumaila Sayyab6, Jason Turner-Maier3, Shady Younis2,7, Sandra Afonso1, Bronwen Aken8,18, Joel M. Alves1,9, Daniel Barrell8,18, Gerard Bolet10, Samuel Boucher11, Hernán A. Burbano5,25, Rita Campos1, Jean L. Chang3, Veronique Duranthon12, Luca Fontanesi13, Hervé Garreau10, David Heiman3, Jeremy Johnson3, Rose G. Mage14, Ze Peng15, Guillaume Queney16, Claire Rogel-Gaillard17, Magali Ruffier8,18, Steve Searle8, Rafael Villafuerte19, Anqi Xiong20, Sarah Young3, Karin Forsberg-Nilsson20, Jeffrey M. Good5,21, Eric S. Lander3, Nuno Ferrand1,22,*, Kerstin Lindblad-Toh2,3,*, Leif Andersson2,6,23,* 1 CIBIO/InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Campus Agrário de Vairão, Universidade do Porto, 4485-661, Vairão, Portugal. 2 Science of Life Laboratory Uppsala, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden. 3 Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, MA 02142, USA. 4 Vertebrate and Health Genomics, The Genome Analysis Center, Norwich, UK. 5 Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany. 6 Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden. 7 Department of Animal Production, Ain Shams University, Shoubra El-Kheima, Cairo, Egypt. 8 Wellcome Trust Sanger Institute, Hinxton, UK. 9 Department of Genetics, University of Cambridge, Cambridge, CB2 3EH, UK. 1 10 INRA, UMR1388 Génétique, Physiologie et Systèmes d’Elevage, F-31326 Castanet-Tolosan, France. 11 Labovet Conseil, Les Herbiers Cedex, France. 12 INRA, UMR1198 Biologie du Développement et Reproduction, F-78350 Jouy-en-Josas, France. 13 Department of Agricultural and Food Sciences, Division of Animal Sciences, University of Bologna, 40127 Bologna Italy. 14 Laboratory of Immunology, NIAID, NIH, Bethesda, MD, 20892, USA. 15 DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, 2800 Mitchell Drive, Walnut Creek, CA 94598. 16 ANTAGENE, Animal Genomics Laboratory, Lyon, France. 17 INRA, UMR1313 Génétique Animale et Biologie Intégrative, F- 78350, Jouy-en-Josas, France. 18 European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. 19 Instituto de Estudios Sociales Avanzados, (IESA-CSIC) Campo Santo de los Mártires 7, Córdoba Spain. 20 Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden. 21 Division of Biological Sciences, The University of Montana, Missoula, MT 59812, USA. 22 Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Rua do Campo Alegre s⁄n. 4169-007 Porto, Portugal 23 Department of Veterinary Integrative Biosciences, College of Veterinary Medicine and Biomedical Sciences, Texas A&M University, College Station, USA. *contributed equally within group Correspondence should be addressed to: kersli@broadinstitute.org and leif.andersson@imbim.uu.se. Present addresses: 24 Department of Human Genetics, University of California, Los Angeles, Gonda Center, 695 Charles E. Young Drive South, Los Angeles, CA 90095, USA. 2 25 Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany. 3 Abstract The genetic changes underlying the initial steps of animal domestication are still poorly understood. We generated a high-quality reference genome for rabbit and compared it to resequencing data from populations of wild and domestic rabbits. We identified over 100 selective sweeps specific to domestic rabbits, but only a relatively small number of fixed (or nearly fixed) SNPs for derived alleles. SNPs with marked allele frequency differences between wild and domestic rabbits were enriched for conserved non-coding sites. Enrichment analyses suggest that genes affecting brain and neuronal development have often been targeted during domestication. We propose that due to a truly complex genetic background, tame behavior in rabbits and other domestic animals evolved by shifts in allele frequencies at many loci, rather than by critical changes at only a few ‘domestication loci’. Introduction Domestication of animals - the evolution of wild species into tame forms – has resulted in striking changes in behavior, morphology, physiology and reproduction (1). However, the genetic underpinnings of the initial steps of animal domestication are poorly understood but likely involved changes in behavior that allowed the animals to survive and reproduce under conditions that might be too stressful for wild animals. Given the differences in behavior between wild and domesticated animals, we investigated to what extent this process involved fixation of new mutations with large phenotypic effects as opposed to selection on standing variation. Such studies are hampered in most domestic animals due to ancient domestication events, extinct wild ancestors, or geographically widespread wild ancestors. Rabbit domestication was initiated in monasteries in Southern France as recent as ~1,400 years ago (2). At this time, wild rabbits were mostly restricted to the Iberian Peninsula, where two subspecies occurred (Oryctolagus cuniculus cuniculus and O. c. algirus), and to France, colonized by O. c. cuniculus (Fig. 1). Additionally, the area of origin of domestic rabbit is still populated with wild rabbits related to 4 the ancestors of the domestic rabbit (3). This recent and well-defined origin provides a major advantage for inferring genetic changes underlying domestication. A female rabbit genome was Sanger sequenced and assembled (4). The draft OryCun2.0 assembly size is 2.66 Gb with a contig N50 size of 64.7 kb and a scaffold N50 size of 35.9 Mb (Tables S1-S2). The genome assembly was annotated using the Ensembl gene annotation pipeline (Ensembl release 73, Sept. 2013) and with both rabbit RNA-seq data and the annotation of human orthologs (4) (Table S3). Our analysis of rabbit domestication used Ensembl annotations as well as a custom pipeline for annotation of UTRs (168,286 unique features), non-coding RNA (n=9,666) and non-coding conserved elements (2,518,476 unique features). To identify genomic regions under selection during domestication we performed whole genome resequencing (10X coverage) of pooled samples (Table S4) of six different breeds of domestic rabbits (Fig. 1A), three pools of wild rabbits from Southern France, and 11 pools of wild rabbits from the Iberian Peninsula, representing both subspecies (Fig. 1B). We also sequenced a close relative, snowshoe hare (Lepus americanus), to deduce the ancestral state at polymorphic sites. Short sequence reads were aligned to our assembly; SNP calling resulted in the identification of 50 million high quality SNPs and 5.6 million insertion/deletion polymorphisms after filtering (Table S5). The numbers of SNPs at non-coding conserved sites and in coding sequences were 719,911and 154,489, respectively. The per site nucleotide diversity (π) within populations of wild rabbits was in the range of 0.6-0.9% (Fig. 1C). Thus, rabbit is one of the most polymorphic mammals sequenced so far, presumably due to a larger long-term effective population size relative to other sequenced mammals (5). Identity scores confirm that the domestic rabbit is most closely related to wild rabbits from Southern France (Fig. S1A), and we inferred a strong correlation (r = 0.94) in allele frequencies at most loci between these groups (Fig. S1B). The average nucleotide diversity of each sequenced population is consistent with a bottleneck and reduction in genetic diversity when rabbits from the Iberian Peninsula colonized Southern France and a second bottleneck during domestication (3)(Figs. 1B,C). 5 Selective sweeps occur when beneficial genetic variants increase in frequency due to positive selection together with linked neutral sequence variants (6). This results in genomic islands of reduced heterozygosity, and increased differentiation between populations around the selected site. We compared genetic diversity between domestic rabbits as one group to wild rabbits representing 14 different locations in France and the Iberian Peninsula. We calculated FST values between wild and domestic rabbits, and average pooled heterozygosity (H) in domestic rabbits, in 50 kb sliding windows across the genome (hereafter referred to as FST-H outlier approach). We identified 78 outliers with FST>0.35 and H<0.05 (Figs. 2A, S2, Database S1). We also used SweepFinder (7), which calculates maximum composite likelihoods for the presence of a selective sweep, taking into account the background pattern of genetic variation in the data, and with a significance threshold set by coalescent simulations incorporating the recent demographic history of domestic and wild rabbits (Figs. S3, S4, Databases S1, S2) (4). This analysis resulted in the identification of 78 significant sweeps (false discovery rate 5%, Fig. 2A, Database S1). Thirty-one (40%) of these were also detected with the FST-H approach (Fig. 2A). This incomplete overlap is likely explained by the fact that SweepFinder primarily assesses the distribution of genetic diversity within the selected population, while the FST-H analysis identifies the most differentiated regions of the genome between wild and domestic rabbits. We carried out an additional screen using targeted sequence capture on an independent sample of individual French wild and domestic rabbits. We targeted over 6 Mb of DNA sequence split into 5,000 1.2 kb intronic fragments that were distributed across the genome and selected independently of the genome resequencing results above. Coalescent simulations, using the targeted resequencing dataset and incorporating the recent demographic history of domestic rabbit as a null model (Figs. S3, S4, Databases S1, S2) (4), revealed that the majority of the sweep regions detected by whole genome resequencing showed levels and patterns of genetic variation that were observed less than 5% of the times in the simulated dataset (76.0% with SweepFinder and 73.7% with FST-H outlier regions, excluding regions without targeted fragments), a highly significant overlap (Fisher’s exact test, P<1x10-5 for both tests). Furthermore, 26 of the 31 sweep regions detected with both SweepFinder and the FST-H approach were targeted in the capture experiment and an even greater 6 proportion (88.5%) showed levels and patterns of genetic variation unlikely to be generated under the specified demographic model. An example of a selective sweep overlapping the 3’-part of GRIK2 (glutamate receptor, ionotropic, kainate 2) is shown in Fig. 2B. Parts of this region have low heterozygosity in domestic rabbits, and at position chr12:90,153,383 bp domestic rabbits carry a nearly fixed derived allele at a site with 100% sequence conservation among 29 mammals except for the allele present in domestic rabbits (8) suggesting functional significance. GRIK2 encodes a subunit of a glutamate receptor highly expressed in brain and has been associated with recessive mental retardation in humans (9). Both SweepFinder and the FST–H outlier analysis identified two sweeps near SOX2 (SRY-BOX 2) separated by a region of high heterozygosity (Fig. 2C). SOX2 encodes a transcription factor that is required for stem-cell maintenance (10). Given the comprehensive sampling in our study and the correlation in allele frequencies between domestic and French wild rabbits (Fig. S1B), highly differentiated individual SNPs are likely to have been either directly targeted by selection or occur in the vicinity of loci under selection. For each SNP, we calculated the absolute allele frequency difference between wild and domestic rabbits (ΔAF) and sorted these into 5% bins (ΔAF=0-0.05, etc.). The majority of SNPs showed low ΔAF between wild and domestic rabbits (Fig. 2D). We examined exons, introns, UTRs and evolutionarily conserved sites for enrichment of SNPs with high ΔAF, as would be expected under directional selection on many independent mutations (Fig. 2D, Table S6). We observed no consistent enrichment for high ΔAF SNPs in introns, but significant enrichments in exons, UTRs and conserved non-coding sites (χ2 test, P<0.05). Of note, we detected a significant excess of SNPs at conserved non-coding sites for each bin ΔAF>0.45 (χ2 test, P=1.8x10-3 - 7.3x10-17), whereas in coding sequence, a significant excess was only found at ΔAF>0.80 (χ2 test, P=3.0x10-2 - 1.0x10-3). Compared to the relative proportions in the entire dataset, there was an excess of 3,000 SNPs at conserved non-coding sites with ΔAF>0.45, whereas for exonic SNPs with ΔAF>0.80 the excess was only 83 SNPs (Table S6). Thus, changes at regulatory sites have 7 played a much more prominent role in rabbit domestication, at least numerically, than changes in coding sequences. We selected the 1,635 SNPs at conserved non-coding sites with ΔAF>0.80, which represent 681 non-overlapping 1 Mb blocks of the rabbit genome. In order not to inflate significances due to inclusion of SNPs in strong linkage disequilibrium we selected only one SNP per 50 kb, leaving 1,071 SNPs. More than 60% of the SNPs were located 50 kb or more from the closest transcriptional start site (TSS; Fig. 2E), suggesting that many differentiated SNPs are located in long-range regulatory elements. A gene ontology (GO) overrepresentation analysis (11) examining all genes located within 1 Mb from high ΔAF SNPs showed that the most enriched categories of biological processes involved ‘cell fate commitment’ (Bonferroni P=3.1x10-3-5.4x10-5; Table 1, Database S3), while the statistically most significant categories involved brain and nervous system cell development (Bonferroni P=2.9x10-3-3.7x10-10). Many of the mouse orthologs of genes associated with non-coding high ΔAF SNPs were expressed in brain or sensory organs, and this enrichment was highly significant (Table 1). We also examined phenotypes observed in mouse mutants (http://www.informatics.jax.org) for these genes, revealing a significant (Bonferroni P=3.7x10-2-7.5x10-17) enrichment of genes associated with defects in brain and neuronal development, development of sensory organs (hearing and vision), ectoderm development and respiratory system phenotypes (Fig. S5). These highly significant overrepresentations were obtained because there were many genes in the overrepresented categories (Table 1). For example, we observed high ΔAF SNPs associated with 191 genes (113 expected by chance) from the nervous system development GO category (Bonferroni P=3.7x10-10). Thus, rabbit domestication must have a highly polygenic basis with many loci responding to selection, and where genes affecting brain and neuronal development have been particularly targeted. None of the coding SNPs that differed between wild and domestic rabbits was a nonsense or frame-shift mutation - consistent with data from chicken (12) and pigs (13), suggesting that gene loss has not played a major role during animal domestication. This is an important finding as it has been suggested that gene inactivation could be an important mechanism for rapid evolutionary change, for instance 8 during domestication (14). Of 69,985 autosomal missense mutations, there were no fixed differences and only 14 showed a ΔAF above 90%. On the basis of poor sequence conservation, similar chemical properties of the substituted amino acids, and/or the derived state of the domestic allele we assume that most of these result from hitchhiking, rather than being functionally important (Database S4). However, two missense mutations stand out; these may be direct targets of selection because at these two positions the domestic rabbit differs from all other sequenced mammals (>40 species). The first is a Q813R substitution in TTC21B (tetratricopeptide repeat domain 21B protein), which is part of the ciliome and modulates sonic hedgehog signaling during embryonic development (15). The other is a R1627W substitution in KDM6B (lysine-specific demethylase 6B) that encodes a histone H3K27 demethylase involved in HOX gene regulation during development (16). Deletions unique to domestic rabbits were difficult to identify, because the genome assembly is based on a domestic rabbit, but some convincing duplications were detected with striking frequency differences between wild and domestic rabbits (Database S5). We observed a one bp insertion/deletion polymorphism located within an intron of IMMP2L (inner mitochondrial membrane peptidase-2 like protein), where domestic and wild rabbits were fixed for different alleles. The polymorphism occurs in a sweep region and it is the sequence polymorphism with highest ΔAF in the region (Fig. S6). Mutations in IMMP2L have been associated with Tourette syndrome and autism in humans (17). Cell fate determination was a strongly enriched GO category (enrichment factor=4.9; Database S3) for genes near variants with high ΔAF. We examined the functional significance of twelve SOX2, four KLF4 and one PAX2 high ΔAF SNPs associated with this GO category and where all 17 SNPs were unique to domestic rabbits compared with other sequenced mammals. Electrophoretic mobility shift assay (EMSA) with nuclear extracts from mouse ES-cell derived neural stem cells revealed specific DNAprotein interactions (Figs. 3, S7, Table S7). Four probes, all from the SOX2 region, showed a gel shift difference between wild and domestic alleles. Nuclear extracts from a mouse P19 embryonic carcinoma cell line before and after neuronal differentiation recapitulated these four gel shifts and revealed three additional probes, one in PAX2 and two more in SOX2 that showed gel shift differences between wild 9 type and mutant probes only after neuronal differentiation. Thus, altered DNA-protein interactions for 7 of the 17 high ΔAF SNPs tested were identified, qualifying them as candidate causal SNPs that may have contributed to rabbit domestication. Our results show that very few loci have gone to complete fixation in domestic rabbits, none at coding sites nor any at non-coding conserved sites. However, allele frequency shifts were detected at many loci spread across the genome and the great majority of ‘domestic’ alleles were also found in wild rabbits, implying that directional selection events associated with rabbit domestication are consistent with polygenic and soft sweep modes of selection (18) that primarily acted on standing genetic variation in regulatory regions of the genome. This stands in contrast with breed-specific traits in many domesticated animals that often show a simple genetic basis with complete fixation of causative alleles (19). Our finding that many genes affecting brain and neuronal development have been targeted during rabbit domestication is fully consistent with the view that the most critical phenotypic changes during the initial steps of animal domestication likely involved behavioral traits that allowed animals to tolerate humans and the environment humans offered. On the basis of these observations, we propose that the reason for the paucity of specific fixed domestication genes in animals is that no single genetic change is either necessary or sufficient for domestication and because of the complex genetic background for tame behavior we propose that domestic animals evolved by means of many mutations of small effect, rather than by critical changes at only a few domestication loci. Author contributions KLT, FDP, and ESL oversaw genome sequencing, assembly and annotation performed by JA, JTM, JJ, DH, JLC, and SaY. BA, DB, MR, and SS did Ensembl annotations. CRG, VD, LF, RGM, and ZP contributed to the genome project. LA, MC, CJR, NF, and KLT led the domestication study and AMB, NR, and SS contributed with bioinformatic analyses. ShY and GP performed EMSA, AX and KFN developed neural stem cells for EMSA. MC, FWA, JMG, SA, JMA, GB, SB, HAB, RC, HG, GQ, and 10 RV designed, performed and analyzed the sequence capture experiment. LA, MC, CJR, KLT, NF and FDP wrote the paper with input from other authors. Acknowledgements The work was supported by grants from NHGRI (U54 HG003067 to ESL), ERC project BATESON to LA, Wellcome Trust (grant numbers WT095908 and WT098051), the intramural research program of the NIH, NIAID (RGM), the European Molecular Biology Laboratory, POPH-QREN funds from the European Social Fund and Portuguese MCTES [postdoc grants to M.C (SFRH/BPD/72343/2010) and R.C. (SFRH/BPD/64365/2009), and PhD grant to J. Alves (SFRH/BD/72381/2010)], a NSF international postdoctoral fellowship to J.M.G. (OISE- 0754461), by FEDER funds through the COMPETE program and Portuguese national funds through the FCT – Fundação para a Ciência e a Tecnologia – (PTDC/CVT/122943/2010; PTDC/BIA-EVF/115069/2009; PTDC/BIA-BDE/72304/2006; PTDC/BIABDE/72277/2006), by the projects “Genomics and Evolutionary Biology” and “Genomics Applied to Genetic Resources” co-financed by North Portugal Regional Operational Programme 2007/2013 (ON.2 – O Novo Norte) under the National Strategic Reference Framework (NSRF) and European Regional Development Fund (ERDF), by travel grants to M.C. (COST Action TD1101) and S.S. was supported by Higher Education Commission in Pakistan. We are grateful to L. Gaffney for assistance with figure preparation, Paulo C. Alves and Scott Mills for providing the snowshoe hare sample and S. Pääbo for hosting M.C., S.A. and R.C. Sequencing was performed by the Broad Institute Genomics Platform. Computer resources were supplied by BITS and UPPNEX at Science for Life Laboratory. The O. cuniculus genome assembly has been deposited in Genbank under the accession number AAGW02000000. The RNA-seq data have been deposited there under the bioproject PRJNA78323, the rabbit genome resequencing data under the bioproject PRJNA242290, and the sequence capture data under the bioproject PRJNA221358. 11 Table 1 Summary of results from enrichment analysis of ΔAF >0.8 SNPs located in conserved non-coding elements. One significantly enriched term was chosen from each group of significantly enriched inter-correlated terms. Full lists of enriched terms and inter-correlations are presented in Database S3 and the most enriched inter-correlated terms are presented in Figure S5 Enrich Unique Number Enriched term P1 ment loci of genes O/R)2 Gene Ontology Biological Process GO:0007399 nervous system development 191 3.7x10-10 1.7 154/155 GO:0045595 regulation of cell differentiation 107 4.5x10-6 1.8 94/91 GO:0045935 positive regulation of nucleobase-containing compound 122 2.0x10-5 1.7 101/100 metabolic process GO:0045165 cell fate commitment 36 5.5x10-5 2.9 31/32 GO:0007389 pattern specification process 57 1.4x10-4 2.2 43/44 GO:0009887 organ morphogenesis 85 2.0x10-3 1.8 72/73 GO:0048646 anatomical structure formation involved in morphogenesis 75 2.8x10-3 1.8 65/64 -2 GO:0045892 negative regulation of transcription, DNA-dependent 82 1.4x10 1.7 62/62 GO:0034332 adherens junction organization 13 1.5x10-2 4.7 11/11 Mouse Genome Informatics gene expression3 -25 11853 TS23 diencephalon; lateral wall; mantle layer 109 3.9x10 3.3 86/85 12449 TS23 medulla oblongata; lateral wall; basal plate; mantle layer 115 2.6x10-13 2.3 90/89 2257 TS17 sensory organ 113 3.4 x10-13 2.3 98/99 1324 TS15 future brain 72 8.5x10-9 2.4 61/61 Mouse Genome Informatics phenotype MP:0010832 lethality during fetal growth through weaning 240 7.5x10-17 1.8 197/189 MP:0003632 abnormal nervous system morphology 237 1.2x10-13 1.7 191/193 -7 MP:0005388 respiratory system phenotype 127 1.7x10 1.8 101/102 MP:0000428 abnormal craniofacial morphology 109 1.4x10-6 1.9 93/92 MP:0002925 abnormal cardiovascular development 88 3.3x10-5 1.9 73/73 -4 MP:0005377 hearing/vestibular/ear phenotype 73 1.8x10 2.0 61/62 1 Bonferroni corrected P-value. 2Number of unique non-overlapping 1Mb windows observed (O) and the average number of 1 Mb windows observed in 1000 random (R) samplings of the same number of genes (rounded to the nearest integer). 3TS=Thieler stage. 12 Figures Fig. 1. Experimental design and population data (A) Images of the six rabbit breeds, sized to reflect differences in body weight, included in the study and of a wild rabbit. (B) Map of the Iberian Peninsula and Southern France with sample locations marked (orange dots). Demographic history of this species is indicated and a logarithmic time scale is indicated to the left. The hybrid zone between the two subspecies is marked with dashes. (C) Nucleotide diversities in domestic and wild populations. The French (FRW13) and Iberian (IW1-11) wild rabbit populations are ordered according to a northeast to southwest transection. Sample locations are provided in Table S4. Fig. 2. Selective sweep and delta allele frequency analyses. (A) Plot of FST values between wild and domestic rabbits. Sweeps detected with the FST-H outlier approach, SweepFinder and their overlaps are marked on top. Unassigned scaffolds were not included in the analysis. (B) and (C) Selective sweeps at GRIK2 (B) and SOX2 (C). Heterozygosity plots for wild (red) and domestic (black) rabbits together with plots of FST values and SNPs with ΔAF>0.75 (HΔAF). The bottom panel shows putative sweep regions, detected with the FST-H outlier approach and SweepFinder, marked with horizontal bars. Gene annotations in sweep regions are indicated; *represents ENSOCUT000000. **SOX2-OT represents the manually annotated SOX2 overlapping transcript (4). (D) Number of SNPs in non-overlapping ΔAF bins (black lines, left y-axis). M-values (log2-fold changes) of the relative frequencies of SNPs at non-coding evolutionary conserved sites, in untranslated regions (UTR), exons and introns according to ΔAF bins (colored lines, right y-axis); M-values were calculated by comparing the frequency of SNPs in a given annotation category in a specific bin with the corresponding frequency across all bins. (E) Location of SNPs at conserved non-coding sites with ΔAF≥0.8 SNPs (n=1,635) and with ΔAF<0.8 SNPs (n=502,343) in relation to the transcription start site (TSS) of the most closely linked gene; **, P<0.01). 13 Fig. 3. Bioinformatic and functional analysis of candidate causal mutations. Three examples of SNPs near SOX2 and PAX2 where the domestic allele differs from other mammals. Fourteen such SNPs assessed with electrophoretic mobility shift assays (EMSA) are indicated by red crosses on top. EMSA with nuclear extracts from ES-cell derived neural stem cells or from mouse P19 embryonic carcinoma cells before (un-diff) or after neuronal differentiation (diff) are shown for three SNPs; exact nucleotide positions of polymorphic sites are indicated. Allele-specific gel shifts are indicated by arrows. WT=wildtype allele; Dom=domestic, the most common allele in domestic rabbits. Cold probes at 100-fold excess were used to verify specific DNA-protein interactions. References 1. C. Darwin, On the origins of species by means of natural selection or the preservation of favoured races in the struggle for life. (John Murray, London, 1859). 2. J. A. Clutton-Brock, Natural History of Domesticated Mammals. (Cambridge University Press, Cambridge, 1999). 3. M. Carneiro, S. Afonso, A. Geraldes, H. Garreau, G. Bolet et al., The genetic structure of domestic rabbits. Mol. Biol. Evol. 28, 1801-1816 (2011). 4. Materials and methods are available as supplementary material on Science Online. 5. M. Carneiro, F. W. Albert, J. Melo-Ferreira, N. Galtier, P. Gayral et al., Evidence for Widespread Positive and Purifying Selection Across the European Rabbit (Oryctolagus cuniculus) Genome. Mol. Biol. Evol. 29, 1837-1849 (2012). 6. J. Maynard-Smith, J. Haigh, The hitch-hiking effect of a favourable gene. Genet. Res. 23, 23-35 (1974). 7. R. Nielsen, S. Williamson, Y. Kim, M. J. Hubisz, A. G. Clark et al., Genomic scans for selective sweeps using SNP data. Genome Res. 15, 1566-1575 (2005). 14 8. K. Lindblad-Toh, M. Garber, O. Zuk, M. F. Lin, B. J. Parker et al., A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478, 476-482 (2011). 9. M. M. Motazacker, B. R. Rost, T. Hucho, M. Garshasbi, K. Kahrizi et al., A defect in the ionotropic glutamate receptor 6 gene (GRIK2) is associated with autosomal recessive mental retardation. Am. J. Hum. Genet. 81, 792-798 (2007). 10. K. Takahashi, S. Yamanaka, Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 126, 663-676 (2006). 11. C. Y. McLean, D. Bristor, M. Hiller, S. L. Clarke, B. T. Schaar et al., GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotech. 28, 495-501 (2010). 12. C.-J. Rubin, M. C. Zody, J. Eriksson, J. R. S. Meadows, E. Sherwood et al., Whole genome resequencing reveals loci under selection during chicken domestication. Nature 464, 587-591 (2010). 13. C.-J. Rubin, H. Megens, A. Martinez Barrio, K. Maqbool, S. Sayyab et al., Strong signatures of selection in the domestic pig genome. Proc. Natl. Acad. Sci. U.S.A. 109, 19529-19536 (2012). 14. M.V. Olson, When less is more: Gene loss as an engine of evolutionary change. Am. J. Hum. Genet. 64, 18–23 (1999). 15. P. V. Tran, C. J. Haycraft, T. Y. Besschetnova, A. Turbe-Doan, R. W. Stottmann et al., THM1 negatively modulates mouse sonic hedgehog signal transduction and affects retrograde intraflagellar transport in cilia. Nat. Genet. 40, 403-410 (2008). 16. K. Agger, P. A. C. Cloos, J. Christensen, D. Pasini, S. Rose et al., UTX and JMJD3 are histone H3K27 demethylases involved in HOX gene regulation and development. Nature 449, 731-734 (2007). 17. H. Deng, K. Gao, J. Jankovic, The genetics of Tourette syndrome. Nat. Rev. Neurol. 8, 203- 213 (2012). 15 18. J. K. Pritchard, J. K. Pickrell, G. Coop, The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Curr. Biol. 20, 208-215 (2010). 19. L. Andersson, Molecular consequences of animal breeding. Curr. Opin. Genet. Dev. 23, 295-301 (2013). Supplementary Materials www.sciencemag.org Methods Tables S1 to S7 Figs. S1 to S7 Databases S1 to S5 References (20-61) 16 Rabbit genome analysis reveals a polygenic basis for phenotypic change during domestication Carneiro et al. Supplementary Materials www.sciencemag.org Methods Tables S1 to S7 Figs. S1 to S7 Databases S1 to S5 Full reference list List of Supplementary material: Table S1. Summary statistics of the rabbit OryCun2.0 assembly. Table S2. Comparison between rabbit genome assembly versions OryCun1.0 and OryCun2.0. Table S3. Summary of RNA-sequencing data used for annotation of the rabbit genome. Table S4. Sample details of wild and domestic rabbits used in this study both for whole-genome resequencing and DNA sequence capture using hybridization on microarrays. Average sequence coverage per sample is provided. 1 Table S5. Summary of SNPs and Insertions/Deletions detected in the rabbit genome. Table S6. Distributions of SNP counts in the different delta allele frequency bins for conserved non-coding elements, UTRs, coding sequences and introns. Deviations from expected values were tested with a standard Χ2–analysis (d.f.=1). M-values were calculated using the average frequency of the corresponding annotation category as reference. Table S7. Summary of electrophoretic mobility shift assays using nuclear extracts from ES-cell derived neural stem cells (ES) or from mouse P19 embryonic carcinoma cells before (un-diff) or after neuronal differentiation (diff). "-" = no shift, "+" = shifted, "++" = lower band intensity, "+++" = whole band/complex disappeared and "?" = unclear. Fig. S1. Genetic relatedness between populations sampled in this study (A) Heat map (color code to the right) of identity scores based on comparing resequencing data with the assembly. The xaxis represents genome coordinates with chromosome 1 to the left and chromosome X to the right. The first row (R) represents the reference rabbit against itself, rows 8-21 correspond to locations marked with red dots in Fig. 1A, ordered according to a northeast to southwest transection. (B) Strong correlation in allele frequencies between domestic and French wild rabbits; RAF French and RAF Domestic represent the frequency of the reference allele in wild French and domestic rabbits, respectively. 2 Fig. S2. Distributions and scatterplot of heterozygosity and FST in 50 kb windows across the rabbit genome. (A) Distribution of pooled heterozygosity in pools of domestic rabbits. The arrow indicates the upper heterozygosity threshold for opening a sweep. (B) Distribution of FST values observed in the contrast domestics vs. wild French. The arrow indicates the lower FST threshold for opening a sweep. (C) Scatter plot of FST and pool heterozygosity. Red dots represent windows fulfilling the sweep opening requirement (FST ≥ 0.35 and heterozygosity ≤0.05). Fig. S3. Demographic model and magnitude of the domestication bottleneck estimated using the targeted capture dataset. (A) Schematic representation of the coalescent model used to represent the main events of the demographic history of wild and domestic rabbits. See Supplementary Methods for a detailed description of all parameters and assumptions. (B) Diagram representing the multilocus maximum-likelihood (ML) estimate for the magnitude of the domestication bottleneck (kdom). Multilocus ML (y axis) of the parameter kdom (x axis) was estimated as the product of the likelihood for each locus. Fig. S4. Comparison between observed and simulated data under the estimated demographic model. Observed and simulated data are represented in red and black, respectively. The first two panels (A and B) illustrate the relationship between values of both π and θw in wild rabbits from France (x axis) and domestic rabbits (y axis). The two histograms on the bottom row (C and D) illustrate the distribution of FST and percentage of shared mutations between the two populations. These figures include the full observed dataset, including outliers, explaining the discrepancy between the observed and simulated data. 3 Fig. S5. Gene overrepresentation analysis using the GREAT software based on genes associated with high delta allele frequency SNPs (ΔAF≥0.8) at non-coding conserved sites. (A) Gene Ontology categories within Biological Processes, (B) MGI mouse expression data, and (C) MGI mouse phenotypes. Each row in the heat maps represents one specific category and colors on that row indicate the proportion of shared genes in relation to the other categories (ordered in the same way on the x-axis as on y-axis). To the left of each cluster, the type of terms in that cluster is summarized. Bars immediately left of heat-maps visualize the significance-, enrichment- and number of genes of each significant term (P= -log10 Bonferroni-corrected P-value; E=Enrichment: N=number of genes). For P, E, and N, the ranges are indicated below the plot. The full results are presented in Database S3. Fig. S6. Selective sweep at IMMP2L on chromosome 7. Heterozygosity plots for wild (red) and domestic (black) rabbits together with plots of FST values and high ΔAF (HΔAF) SNPs with ΔAF>0.75. Putative sweep regions detected with the FST -Heterozygosity outlier approach and SweepFinder are marked with horizontal bars. Gene annotations in sweep regions are indicated. IMMP2L was not predicted by Ensembl but by our in-house predictions based on RNAsequencing data. The human consensus model features all human IMMP2L transcripts in a collapsed fashion. Asterisks (*) represent ENSOCUT000000. Red dots indicate the location of the high ΔAF insertion/deletion in a conserved element for IMMP2L. Fig. S7. Results for all 17 SNPs functionally examined by electrophoretic mobility shift assays (EMSA). Results of EMSA using nuclear extracts from ES-cell derived neural stem cells or from mouse P19 embryonic carcinoma cells before (un-diff) or after neuronal differentiation (diff) are 4 presented. WT=wild-type allele; D=domestic, the most common allele in domestic rabbits. Cold probes at 100-fold excess were used to verify specific DNA-protein interactions. The results are summarized in Table S7. Database S1. Regions inferred to have been targeted by directional selection using: 1) a FST-H outlier approach contrasting genetic diversity between wild and domestic rabbits, 2) allele frequency spectra (SweepFinder), and 3) an explicit demographic model contrasting genetic diversity between wild and domestic rabbits (capture arrays data). Database S2. P-values obtained using coalescent simulations of the demographic scenario inferred in this study for the SweepFinder and targeted capture analyses. The obtained value for the magnitude of the domestication bottleneck in this study (kdom = 1.3) was lower than a previous estimate (kdom = 2.8). Under the model used to infer directional selection in the domesticated lineage using the targeted capture dataset, the estimation from this study describes a stronger bottleneck and thus renders our selection tests conservative. However, this is not the case for the SweepFinder analysis and P-values using both bottlenecks estimates are provided. Database S3. Full results of overrepresentation analysis using SNPs at conserved non-coding sited with ΔAF>0.80 performed using the GREAT tool. Database S4. Missense SNPs showing a delta allele frequency difference of 0.90 or higher between wild and domestic rabbits. 5 Database S5. Duplications and deletions showing allele frequency differences between wild and domestic rabbits according to an ANOVA analysis. 6 Supplementary Methods Genome assembly and annotation Genome sample collection. Eight adult female rabbits of the Thorbecke New Zealand white partially inbred line were obtained from Covance Research Products. This entire line was destroyed in a fire in 2005, and thus is no longer available. Sequence heterozygosity of these samples was assessed at up to 267 random nuclear loci through PCR and Sanger sequencing at the Broad Institute. Heterozygosity was calculated based on high quality heterozygous single nucleotide polymorphisms (SNPs) inferred from the chromatograms. The sample with the lowest heterozygosity was selected as reference individual to whole genome sequencing. Genome sequencing and assembly. An initial 2X assembly of the rabbit genome (OryCun1) was constructed using paired-end reads from 4 kb plasmids and 40 kb fosmids from a single female rabbit (Table S2). This assembly was described in Lindblad-Toh et al. (8). The assembly described in this paper (OryCun2) used the sequencing reads from OryCun1 as well as additional paired-end reads from 4 kb plasmids, 10 kb plasmids, 40 kb fosmids, and bacterial artificial chromosomes (BACs). All sequenced reads derive from the same individual rabbit, except for those from BACs (see below). Genome assembly was performed using the software package Arachne 2.0 (20). The 6.55X OryCun2 assembly is comprised of 5.04X 4 kb reads, 1.14X 10 kb reads, 0.33X 40 kb reads, and 0.04X BAC-derived reads. This assembly has an N50 contig size of 64.65 kb (i.e. half of all bases reside in a contiguous sequence of 64.65 kb or more), an N50 scaffold size of 35.92 Mb and a total assembly size of 2.66 Gb (Table S1). It shows a heterozygosity of 1/3,506 bp, a GC content of 43.8% and is 16.7% repetitive as defined by 48- 7 mer counts. OryCun2 used 92.3% of all reads produced, comprising 94.3% of all bases produced. It is of similar quality to other Sanger-based draft assemblies. OryCun2 was anchored to chromosomes using a cytogenetically anchored microsatellite map (21). 364 BACs were end-sequenced, resulting in the anchoring of 99 scaffolds, comprising 82% of the genome assembly, or 2.178 Gb. Of the 2.178 Gb anchored, 238 Mb were only ordered, while 1.940 Gb were ordered and oriented. BAC library. The white blood cells of a New Zealand white rabbit of unknown sex were embedded in agarose plugs at the concentration of 107 per ml, treated with ESP buffer (50mg/ml proteinase K, 1% Sarkosyl, and 0.5M EDTA, pH 8.0), and rinsed with TE until suitable for enzymatic digests. The DNA was partially digested with EcoRI and EcoRI methylase, size selected, and cloned into the pBACe3.6 vector as described (22). The ligated DNA was then transformed into DH10B electro-competent cells (Invitrogen). The library was arrayed into 322 384-well plates. The average insert size of 269 randomly selected clones is 175 kb, and about 92% of these clones are greater than 140 kb. This library represents about 7-fold coverage of the rabbit genome. RNA sequencing, transcriptome assembly, and annotation. A panel of 19 RNA samples derived from New Zealand white rabbits was used. Ten RNA samples (nine different tissues from a single female and testis from a male) were purchased from the company Zyagen while the other nine samples were isolated from INRA 1077 New Zealand white rabbits (Table S3). Nineteen strand-specific dUTP libraries (23) were produced from Oligo dT polyA-isolated RNA. The libraries were sequenced using Illumina Hi-Seq instruments, producing 76 bp single end reads (34 Gb of sequence/tissue). All nineteen RNA-seq datasets were assembled via the genomeindependent RNA-seq assembler Trinity (24). 8 The genome assembly was annotated both by the Ensembl gene annotation pipeline (Ensembl release 73, Sept. 2013) and by a novel methodology using both RNA-seq and orthologous annotation in human. The Ensembl gene annotation pipeline created gene models using UniProt protein alignments and RNA-seq data. This pipeline produced 24,964 transcripts arising from 19,293 protein coding genes and 3,375 short non-coding transcripts. The custom pipeline created gene models using the same RNA-seq panel, and older Ensembl rabbit (Genebuild 71.3) and human annotations (Genebuild 71.37). It produced 19,118 high-confidence protein-coding genes, 881 low-confidence protein-coding genes, 1,318 spliced antisense transcripts, 2,243 unspliced antisense loci, 2,746 high confidence lncRNA genes and 48,794 lowconfidence non-coding transcripts. Our analysis of rabbit domestication used Ensembl annotations as well as the custom pipeline for annotation of UTRs, non-coding RNA and noncoding conserved elements. Genome resequencing and data analyses Sampling. To identify regions of the genome likely to have been targeted by selective breeding at domestication and shared across breeds, we obtained whole genome sequence data for both domestic and wild animals using a pool-sequencing approach (Table S4). Our sampling of domestic rabbits followed that of Carneiro et al. (3), and includes individuals from breeds representing a wide range of phenotypes and for which historical records indicate a mostly unrelated origin. This sampling design focusing on divergent breeds should assure an approximation to the ancestral genetic diversity prior to breed formation and captured in the early years of rabbit domestication. In total, we sampled six domestic breeds (Table S4). From the wild, we sampled individuals from three localities in southern France (i.e. the ancestral population that gave origin to domestic rabbits) and from 11 localities within the ancestral range 9 of rabbits in the Iberian Peninsula, including localities within the range of both rabbit subspecies (Table S4). The number of individuals sampled for each breed and locality varied from 10-20. A single snowshoe hare (Lepus americanus) individual was sampled as outgroup. Resequencing, alignment and SNP calling. Paired-end sequencing libraries were generated from domestic and wild rabbit DNA pools using standard procedures. The resulting libraries were sequenced as 2X 76bp paired-end reads using Genome Analyzer II (Illumina), to a coverage of ~10X per pool (Table S4). Sequence reads were then mapped to the reference genome assembly (OryCun2) with the Burrows-Wheeler alignment tool (25) using default parameters, except for the q-parameter, indicating the base quality cut-off for soft-clipping reads, which was set to 5. We then marked duplicate reads using Picard (26). The mapping distance distribution of pairedend reads revealed that a large proportion of paired-end reads had been generated from fragments smaller than the length of the two reads (76bp x 2). In order not to bias allele frequency estimations by counting overlapping parts of the same molecule twice, we merged overlapping paired-end reads into a single read by using a custom python script. For mismatching bases in overlapping parts, the highest quality base and its quality value were retained and for overlapping bases that agreed the base qualities were rescaled (PhredOL=PhredR1+PhredR2) to reflect the increased confidence of base calling. SNP calling was performed using Samtools (27) (version 0.1.19-44428cd) and the output was further filtered using the mpileup2snp option of VarScan v2.3.3 (28). This approach enabled the detection of more than 61 x 106 raw SNPs, a set further processed by using a custom script requiring that: i) the sum of read depths (RD) across all populations 100 ≥ RD ≤350 for autosomes and 100 ≥ RD ≤270 for the X chromosome; ii) the least abundant SNP allele was observed in at least three independent reads at a frequency ≥0.01; iii) the most commonly 10 observed variant allele constituted ≥ 85% of the total count of all three possible variant alleles; iv) the reference allele was represented by at least one read in the dataset; v) that the variant allele was not observed in the reference individual; and vi) the average Phred scaled base quality of the variant allele was ≥15. Application of these filters retained 50,165,386 SNPs for downstream analyses and the allele counts for each of the sequenced rabbit pools were extracted using mpileup from the Samtools package. Read alignments from the Snowshoe hare (Lepus americanus) individual were interrogated at rabbit SNP positions to infer the ancestral and derived states of alleles. For a position to be assessed it was required that either the rabbit reference or variant allele was the only observed allele in the hare and that this allele was supported by ≥3 reads with an average mapping quality of ≥20. Genome-wide identity scores from pool-sequencing data. To visualize genetic relatedness between populations we calculated identity scores (IS) of individual SNPs in relation to the reference assembly using the reference allele frequencies (AFREF) observed for each sequenced pool as well as for the resequenced reference individual (IS= AFREF). Identity scores for 50 kb windows presented along the genome in Fig. S1A were calculated as the average IS of all SNPs in a window. Genome-wide estimates of nucleotide diversity from pool-sequencing data. To estimate levels of nucleotide diversity (π) across all sampled populations we used the PoPoolation package (29), which implements corrections for biases introduced by pooling and sequencing errors. We obtained genome-wide values of π by averaging estimates computed along each chromosome in 50 kb non-overlapping windows. We required per position a minimum coverage of 4, a maximum coverage of 30, at least two reads supporting the minor allele in polymorphic 11 positions, and a Phred quality score of 20 or higher. Windows with less than 60% of positions passing these quality filters were discarded. Identification of regions consistent with directional selection in the domesticated lineage. The merits of model-based methods using simulations versus outlier methods based on genomewide empirical distributions of summary statistics for detecting genomic regions targeted by directional selection have frequently been discussed. Here, we use both these approaches. We started by using an outlier-based approach that is free from any assumptions regarding domestic rabbits’ demographic history and we searched for regions showing unusual levels and patterns of genetic diversity using statistics summarizing heterozygosity and differentiation. We then estimated a neutral demographic scenario for rabbit domestication taking advantage of individual genotypes resulting from an additional dataset where we resequenced to high coverage 5,000 fragments enriched by DNA hybridization on microarrays. This null demographic model was then used to search for signatures of selection in the domesticated lineage using methods relying on different aspects of the data, including the allele frequency spectra, heterozygosity, and genetic differentiation. We described these approaches in detail below. Inference of selection using heterozygosity and FST. In order to define candidate regions having undergone directional selection during rabbit domestication we started by using an approach combining heterozygosity estimates for the six sequenced domestic breeds with estimates of FST between domestic and French wild rabbits. We calculated pooled heterozygosity (HP) for 50 kb windows iterated every 25 kb along each chromosome. HP was calculated independently for the three main populations considered in this study: 1) six sequenced pools of domestic rabbits (HPDOM); 2) three pools of wild rabbits from France (WF); and 3) eleven pools of wild rabbits sampled across the Iberian Peninsula (WI). For each SNP in each pool combination, we 12 determined the major allele from the sum of observed reference and variant alleles and then proceeded to calculate the average frequency of the major allele (MajFreq) and minor allele (MinFreq) in the individual pools included in that particular pool combination: HP was then calculated for individual SNPs in each sequenced pool by HP = 2(MajFreq * MinFreq). HPDOM of individual SNPs were calculated as the mean HP of the individual pools in which read depth ≥ 3 and HPDOM of 50 kb windows was calculated as the average HP of all SNPs in that window. FST was calculated for individual SNPs between domestic rabbits and wild rabbits from France using the formula by Weir & Cockerham (30) and FST of 50 kb windows were then calculated as the average values of all SNPs in a window. For selective sweep calling based on extreme HPDOM and FST values of 50 kb windows we consulted the distributions of HPDOM and FST (Fig. S2) and selected the joint criterion of HPDOM ≤ 0.05 + FST ≥ 0.35 to open a selective sweep and then extended the sweep to each side for as long as windows fulfilled either HPDOM ≤ 0.05 or FST ≥ 0.35. In order not to excessively fragment predicted selective sweeps, regions separated by two or fewer windows not meeting the above extension criteria were collapsed into a single putative sweep region. Inference of selection using the site frequency spectrum and a demographic model. To detect evidence of directional selection we used an additional approach (SweepFinder) that uses a likelihood framework to search for local deviations in the site frequency spectrum when compared to the remainder of the genome, evaluating for each location a selective sweep with a neutral model (7). The genomic background frequency spectrum was obtained for each chromosome independently and the likelihood ratio between neutral and selective models was calculated for a different grid size for each chromosome in order to obtain estimates approximately every 10 kb. Following Williamson et al. (31), we included SNPs that were 13 monomorphic within the targeted subpopulation (i.e. domestic rabbits), but variable in the combined sample (i.e. domestic and wild rabbits from France). This should increase the power to detect recent population-specific sweeps in the domesticated lineage that have eliminated most genetic variation, while accounting for mutation rate heterogeneity across the genome. The number of chromosomes sampled for each position was equal to the number of reference and alternative allele counts. The ancestral state of mutations was polarized using the French wild rabbits (i.e. the most common allele was picked as the ancestral allele). A null distribution is required to infer the statistical significance of the selective sweep hypothesis. To create a null distribution of the likelihood ratio statistic we used neutral coalescent simulations according to a non-equilibrium demographic model based on historical records and incorporating the estimated magnitude of the domestication bottleneck obtained from genetic data (Fig. S3A) (see below for details). We performed 1,000 simulations of 4Mb segments for a total of 200 chromosomes (the total number of chromosomes of domestic rabbits for the six breeds) and a SNP density equal to the observed data. Sequencing pooled DNA results in an additional source of error associated with sampling with replacement of the sequenced alleles. The average coverage (~60X) for domestic rabbits in our dataset was much lower than the total number of chromosomes in the pools (200 chromosomes of domestic origin), and thus allele counts for each position are likely to mostly represent different chromosomes. Nevertheless, in order to more closely mimic the pool sequencing data structure we resampled alleles in the simulated data for each position by randomly drawing values from the empirical distribution of coverage in the observed data. The background frequency spectrum was obtained for each simulation independently and, similarly to the observed data, the likelihood ratio was calculated every 10 kb. Sweep regions were considered significant at a P<0.001 (i.e. likelihood ratio values higher than 14 the top value observed in our simulated data) and the borders of these regions were extended by aggregating genomic positions while P<0.01. Sweep regions separated by less than 10 genomic positions with P-values not meeting the above criteria were collapsed into a single region. Using a P-value < 0.001 approximately 250 windows are expected to be significant just by chance alone (225,368 * 0.001), but we identified 4758 windows resulting in a false discovery rate of ~5%. The magnitude of the domestication bottleneck (kdom) estimated in this study (Fig. S3B) describes a more stringent bottleneck (kdom = 1.3) when compared to a previous estimate (kdom = 2.8) (3). The power to uncover regions consistent with directional selection using the allele frequency spectra has been shown to vary with the magnitude of the bottleneck, and not always in the more intuitive direction (31). For example, CLR values and their variance are lower in stronger bottlenecks than that in bottlenecks of intermediate strength or even in equilibrium models. Although the previous estimate in rabbits was based on a much smaller dataset (16 fragments and biased towards the X-chromosome) when compared to the 5,000 fragments used here, we created an additional null distribution using the same non-equilibrium demographic model but incorporating this less stringent estimate. The distribution is slightly inflated but 63% of regions still displayed CLR values that were not observed in the null distribution and the remaining regions were highly significant (P≤0.003). Given that our current estimate is based on a much larger dataset, and thus likely to be more robust and more representative of genome-wide patterns of genetic diversity, we list in the main manuscript all regions identified using this estimate but both P-values are available in Database S2. Genes within regions robust to these different demographic models are the most promising candidates to follow up with functional studies. For instance, regions containing GRIK2 and SOX2 (Fig. 2) are amongst these regions. 15 A potential confounding factor in both our sweep analysis is artificial selection associated with breed formation. These later events in the domestication process could potentially result in selection being inferred for genomic regions not associated with the initial steps of domestication. We examined this by testing whether two genes (ASIP and TYR) that control breed-characteristic coat colour phenotypes overlap regions detected in our genomic scan. Three of the six breeds carried mutations at these loci, New Zealand (TYR), Champagne d’Argent (ASIP) and Dutch (ASIP) (32, 33). Although these phenotypes follow Mendelian inheritance patterns and certainly represent some of the strongest signatures of selection imposed by the process of breed formation, none of the genes were found within the regions inferred to be under selection. This finding, although anecdotal, suggests that our catalog of regions targeted by directional selection should be highly enriched for genes selected before breed formation, validating the utility of our approach. Confirmation of selective sweeps using a targeted capture dataset. To confirm our selective sweep regions obtained using the pool sequencing approach we generated an additional dataset by sequencing genomic regions enriched by DNA hybridization on microarrays (34) and focused on a slightly different set of breeds (six in total and three in common with the pool sequencing data) and wild individuals from France (five localities in total and none in common with the pool sequencing data; Table S4). Published sequence data for the same genomic regions from six wild rabbits from the Iberian Peninsula (35) were used in data analysis. The full methodological details are given elsewhere (35) so here we provide a brief description. We designed a custom Agilent array for the selective enrichment of 6 Mb of intronic sequence throughout the rabbit genome (5000 fragments of 1.2 kb). This array was designed initially for the study described above and thus the sequenced fragments are located randomly 16 with regard to gene function or to the location of the sweeps inferred from the pool sequencing approach. All sequencing runs of the barcoded Illumina sequencing libraries were performed on an Illumina GAII platform using a combination of single-end and paired-end 76 bp reads. As before, read mapping to the rabbit reference genome was performed using the Burrows-Wheeler alignment tool using default parameters and PCR duplicates were removed from further analysis. SNP and genotype/consensus calling were also carried out using Samtools. SNPs with a minimum quality of 20, minimum mapping quality of 20 (root mean square), and distancing 10 bp from indel polymorphisms were kept for individual genotype calling. Homozygote and heterozygote genotypes were accepted according to the algorithm implemented in Samtools if the total effective sequence coverage was equal or higher than 8X and genotype quality equal or higher than 20, otherwise that specific genotype was coded as missing data. Consensus calling for positions where no SNPs were identified was performed using these same criteria, otherwise that specific position was coded as missing data. The average effective coverage per individual (i.e. after quality filtering and duplicates removal) was 30X±4.3, and >91% of targeted positions were covered by ≥8 reads in all individuals (Table S4). This dataset consisting of DNA sequence variation in individual (rather than pooled) wild and domesticated rabbits allows a formal investigation of the main demographic events in the recent history of domestic rabbits, and its impact upon levels and patterns of genetic diversity across the genome. We carried out computer simulations of the coalescent process, according to the model depicted in Fig. S3A, using a modified version of the computer program ms (36). Our main goal was to estimate the magnitude of the domestication bottleneck which is described as kdom = Nbdom/ddom, where Nbdom is the size of the bottlenecked population and ddom the duration of the bottleneck in generations. Similar models have been applied previously to other domestic 17 species (37-42) and also to rabbits (3). Our model consisted of three populations and describes two bottleneck events occurring from an ancestral source population. First, we incorporated the recent colonization of France from the Iberian Peninsula after the last glacial maximum and consequent bottleneck (3, 43), in which an ancestral population (rabbits from the Iberian Peninsula) of constant size (Na) gives rise at time t2fr to a small founder population (Nbfr). More recently, the bottlenecked population at time t1fr (dfr = t2fr – t1fr, where dfr is the duration of the bottleneck in generations) expands into its current size (Npfr). Second, we incorporated the domestication bottleneck from French wild populations. The overall configuration of this second event is similar to the colonization of France and consists of similar parameters (Nbdom, Npdom, ddom, and kdom). Several summary statistics were calculated both for observed and simulated data. We summarized levels and patterns of genetic diversity using two standard estimators of the neutral mutation parameter: Watterson's θw (44), the proportion of segregating sites in a sample, and π (45), the average number of pairwise differences per sequence in a sample. Genetic differentiation between populations was described by means of the fixation index (FST) (46). All these statistics were calculated for each 1.2 kb fragment independently. We simulated data varying the magnitude of the domestication bottleneck (kdom). To find the best fitting model we compared summary statistics computed on the observed data to those computed on simulated data for each fragment independently. The observed missing data for each fragment was incorporated in the simulations and calculations of the summary statistics (36). Owing to the complex and mostly unknown demography of the natural populations of rabbits, our model is certainly a simplified scenario. Therefore, we conditioned the simulations using rejection sampling (47). Briefly, we simulated the population of rabbits from the Iberian Peninsula as the ancestral population, and forced the simulations to fit the data observed in 18 French wild rabbits. Simulations were kept if the values obtained were within 20% of the observed values of π, θw and FST, and run until 1,000 genealogies were recorded. This should, in principle, help remove uncertainty associated with the demographic scenario prior to domestication that is not historically documented, which might otherwise have negatively impacted our estimation of kdom (39). The best-fitting value of kdom for each fragment was then estimated as the proportion of 1,000 simulations whose summary statistics (θw and FST) were contained within 20% of the observed values in domestic rabbits. The overall maximumlikelihood across loci was estimated as the product of likelihoods for each locus. Loci incompatible with the multilocus estimate of kdom (see below) were removed and the kdom estimate updated until no deviations were found. We used several key assumptions. First, previously published effective population size estimates for wild rabbits from Iberian Peninsula and France were used as current population sizes (Na = 1,000,000 and Npfr = 500,000 (3, 48)), and the current population size of domestic rabbits was assumed to be 50,000. Second, variation in mutation rate among fragments was incorporated in the simulations from the variation in the population mutation parameter (θw = 4Neµ for autosomal and θw = 3Neµ for X-linked loci) estimated from the wild rabbits from the Iberian Peninsula (i.e. the ancestral population). Third, using available historical evidence (43, 49) and assuming a generation time of 1 year (50), we assumed a split time between wild rabbits from Iberian Peninsula and wild rabbits from France of 10,000 generations (i.e. after the last glacial maximum), and the initial domestication event 1,400 generations ago. We have previously shown that because the short time scale since the initial rabbit domestication provides little opportunity for new mutations to have a substantial impact, several fold changes in Ne or split times result in little or no qualitative change in the estimation of kdom 25). Fourth, the parameter k 19 is the ratio of Nb and d, which are positively correlated (37, 38). Therefore, we fixed ddom at 500 generations and used 35 different values of Nbdom so that kdom ranged between 0.1 and 15. Because our previous estimate of kdom based on a smaller dataset was 2.8 (3), we used a finer grid of values between 0.5 and 3.6. Fifth, we included the population recombination parameter (θ = 4Ner) in our simulations assuming the genome-wide average recombination rate in rabbits (~1cM/Mb) (21) and estimates of Ne described above. Finally, for computational efficiency we used a previous estimate of kfr = 2 for the magnitude of the bottleneck associated with the colonization of France (3). Our rejection sampling approach should, in principle, attenuate the potential uncertainty associated with this parameter. Although several features of the demography of both wild and domestic rabbits may not have been accurately captured by our model, the estimated model generated shifts in genetic diversity between wild rabbits from France and domestic rabbits (measured using both π and θw) as well as levels of differentiation between populations (FST and percentage of shared mutations) that were qualitatively similar to those estimated from the observed data (Fig. S4). Average values for all statistics were also similar between observed [πdom/(πdom + πfr) = 0.355; θdom/(θdom + θfr) = 0.337; FST = 0.141; % shared = 0.470] and simulated data [πdom/(πdom + πfr) = 0.376; θdom/(θdom + θfr) = 0.354; FST = 0.090; % shared = 0.535], and even more similar when loci incompatible with genome-wide background levels and patterns of genetic diversity were removed as described below from the observed data [πdom/(πdom + πfr) = 0.393; θdom/(θdom + θfr) = 0.367; FST = 0.091; % shared = 0.519]. Overall, our model is generally consistent with observed patterns of sequence diversity. If directional selection resulted in a sweep of a favorable allele associated with a domestication trait, we expect to find one, or both of the following signatures: 1) a loss of 20 heterozygosity in the genomic region under selection significantly greater than the loss in genetic diversity produced by the domestication bottleneck; and 2) higher differentiation between wild and domestic populations due to the potential for directional selection to generate elevated levels of population differentiation in structured populations. Both these signatures under the described model make no assumptions as to whether selection happened on new or on previously existing genetic variation. In order to identify regions under selection, we used the multilocus value of kdom and interrogated for each locus individually whether levels of genetic diversity (θw) and differentiation (FST) were significantly lower and higher (P≤0.05), respectively, than expected from the null demographic model alone. Given that these two signatures, although partially interrelated because both depend on a within-population component of variation, may respond differently to distinct forms of selection, we performed a significance test for each summary statistic independently (Database S2). Unlike the previous implementation based on SweepFinder, this method using a more stringent kdom and focused on reductions of heterozygosity and increased differentiation renders our test conservative. Due to the small number of individuals used and the relatively short size of the sequenced fragments, we note that P-values associated with individual fragments are not robust to multiple testing. However, our main intent with this additional screen for selection was not to attach great significance to individual loci but to corroborate the findings of the pool-sequencing approach (see main manuscript). In fact, under the specified significance threshold (P<0.05) we identified a strong excess of outlier fragments (FST, n = 703; θw, n= 403) compared to the chance expectations alone (5000*0.05 = 250) This suggests that our combined list of outlier regions based on FST and θw (n = 936) contains false positives but should be highly enriched for regions displaying levels and patterns of genetic diversity that are incompatible with the estimated demographic model. 21 Analyses of allele frequency differences between domestic and wild rabbits at individual SNPs. For estimations of allele frequencies of single SNPs we required minimum read depth sums of 20, 10, and 40 reads across the six domestic population pools, the three French wild pools and the 11 wild Iberian pools, respectively. This filter retained 34,293,238 SNPs where reliable allele frequencies could be estimated. The per-SNP absolute allele frequency difference (ΔAF) between domestic and wild rabbits was then calculated using the formula: ΔAF= abs(RefAFdom-mean(RefAFfrench+ RefAFiberian)). We next binned SNPs by ΔAF in steps of 0.05 (i.e. ΔAF=0–0.05, 0.05–0.10, etc. until 0.95-1.00) and intersected these binned SNPs with coding exons, UTRs, introns and non-coding elements identified as under evolutionary constraint in mammals using a rate-based score (omega) (8). Coding exons and introns were based on rabbit Ensembl version 73. UTRs of Ensembl gene models were frequently either missing or were truncated in relation to those annotated in human. For the intersections of ΔAF SNPs with UTRs we therefore utilized a custom gene prediction pipeline less reliant on cross-species homology and more reliant on our rabbit RNA-sequencing data. This pipeline defined UTRs as elements of predicted protein coding exons where no open reading frame was identified. In order not to include retained introns due to incomplete splicing in our UTR set we removed sequences extending an Ensembl intron by more than 90% of intron length. For intersections with ΔAF binned SNPs we also used elements identified as under evolutionary constraint in analysis of 29 mammalian genomes (8) (Omega set) that were lifted over from hg18 to OryCun2 using the liftOver tool (51) from the University of Santa Cruz (UCSC) and the resulting rabbit elements were then stripped of those overlapping Ensembl version 73 exons. For each ΔAF bin, the proportions of SNPs falling into coding exons, UTRs, introns and the 29 mammals conserved elements were then determined by intersections using the software bedops (52). 22 Coding variants and assessment of functional significance. To investigate the location of SNPs within genes and their potential coding properties we started by using ANNOVAR (53) and relied on Ensembl 73 annotations. We specifically focused on missense SNPs showing a ΔAF>0.90 between wild and domestic rabbits. To infer whether these missense mutations are likely to have functional significance we used several approaches. First, we analyzed sequence conservation for each SNP among 100 vertebrates and placental mammals using the UCSC (University of California Santa Cruz) genome browser with rabbit coordinates obtained with the liftover tool. To identify potential artefacts due to errors in gene predictions, we used the Basic Local Alignment Search Tool (BLAST) (54) for alignment of proteins to the non-redundant database in UniProtKB (UniProt Knowledgebase), aligned the orthologs, and visually inspected the alignments. For example, this information allowed us to identify Ensembl gene models that we believe represent nonfunctional retrogenes due to errors in gene prediction, which was further supported by lack of expression data for those genes using the RNA-seq data generated to annotate the reference genome (see above). Second, the nature of each mutation was assessed using standalone versions of SIFT v4.0.5 (55) and PolyPhen2 v2.2.2 (56) using default parameters. Finally, we inferred the ancestral or derived state of each allele using the outgroup Lepus americanus as mentioned above. Detection of insertions/deletions. We called small insertions/deletions (indels) with GATK (57) using the INDEL model of UnifiedGenotyper with default parameters. In total, 9,331,686 indels in relation to the reference genome were discovered. Because indels are often a product of read misalignment, sequence reads overlapping called indel positions were realigned with the GATK pipeline. We could validate that every initially called indel was recalled again and although restricted by previously discovered indel positions, no novel indels were reported for the 23 realigned dataset. However, the number of alleles counted had been readjusted giving a better estimation of the allelic frequencies at indel positions. In order to calculate the frequency spectra of the distribution, we extracted counts of reads supporting the observed alleles from the GATK vcf files and calculated the reference allele frequency for each sequenced pool. In order to limit the effect of misalignment on called indels we discarded those where the resequenced reference individual had any reads supporting a non-reference allele (n=1,223,493). Then, we established restrictive depth filters to distinguish real indels from putative artefacts. Only indel positions that were supported by a read depth of reads within a 50% range from the average for each population, inclusive the reference individual, were analysed and counted and 1,112,286 positions were removed by this filter. In this analysis, different median depth thresholds were used for autosomes, X-chromosome and the mitochondrion. All the unplaced (Un) scaffolds were treated as autosomal. We ran GATK with default parameters to report a maximum of six allelic variants. From those resulting indels (n= 6,995,907), we found 6,380,621 bi-allelic, 502,852 tri-allelic, 103,723 tetra-allelic, 8,550 penta-allelic, and 161 hexa-allelic indels. There were 1,413,933 indels that supported different major alleles in the two wild subspecies included in this study (O. c. cuniculus and O. c. algirus). These positions were discarded from further analyses. We then grouped the remaining indels (n= 5,581,974) into 5% bins of absolute allele frequency difference (ΔAF) between wild and domestic rabbits to create the background genome-wide distribution. We intersected indels in these ΔAF bins with Ensembl v73 UTRs and coding sequences as well as elements under evolutionary constraint in mammals and identified 2,331, 251 and 9,213 indels falling in these three categories, respectively. Sweeps detected by SweepFinder and the FST-Het combination were merged in one unique dataset and we 24 enlarged these loci by taking 20 kb surrounding them and intersected with indels. Thus, we discovered 15,923 indels overlapping sweeps. Detection of structural changes. To identify duplications/deletions that could differ between wild and domestic rabbits we performed an analysis of variance (ANOVA) on depth of coverage using the package implemented in R (58). We scanned the genome in 1 kb non-overlapping windows and all depths were normalized against the Flemish giant breed, which had higher average depth compared to the other breeds. The contrast was made between all domestic breeds and wild populations. For each window we also calculated a M-value to measure fold-coverage difference between domestic and wild populations as follows: π − π£πππ’π = πππ! ( π·ππππ π‘π€π ππππ‘β ) ππ€ππ ππππ‘β In total, 2,165,484 windows were analysed from which 710 windows were filtered out due to low coverage. 29,792 windows were significant with a P ≤ 0.001. Using the P-value of the ANOVA analysis together with M-values, we merged significant windows to identify regions indicating the same signature of duplication/deletion using the following criteria. First, we opened a locus when a given window fulfilled two criteria: 1) P ≤ 0.001; and 2) absolute (Mvalue) ≥ 0.6. Second, we continued scanning by setting a dynamic cut-off for |M-value|. It was redefined when a given window had higher |M-value| than the starting threshold (|M-value| ≥ 0.6). To merge adjacent windows, we employed a more lenient cut-off for the M-value (≥ 0.5 * redefined |M-value|) and P-value (≤ 0.05). Third, we allowed two tandem outliers and closed the locus when three or more consecutive outliers were detected. Finally, we rescanned windows upstream of the opening site using the criteria mentioned in previous steps. 25 Gene overrepresentation analyses. In order to assess enrichment of gene ontology terms and other ontology terms such as those associated with murine phenotypes and gene expression among ΔAF≥0.8 SNP loci we utilized the software Genomic Regions Enrichment of Annotations Tool (GREAT) (11). In order not to inflate significances due to inclusion of SNPs in strong linkage disequilibrium we selected only one SNP per 50 kb from the set of 1635 SNPs in conserved elements with ΔAF ≥0.80, leaving 1,071 SNPs. We next lifted this selected set of rabbit SNPs over to the human assembly hg18 and used the resulting human coordinates as input for GREAT, all genes with a Transcription Start Site (TSS) within 1 Mb from a high ΔAF SNP were included in the analysis. Overrepresentation was tested using hypergeometric tests, and Bonferroni corrected P-values were used to evaluate statistically significant overrepresentation of terms. For extraction of data in order to condense the GREAT output for visualization in Fig. S5, we first required ≥6 genes associated with a term and then further filtered the GREAT output requiring a Bonferroni P <0.05 and an enrichment >1.7 for GO Biological Process as well as for mouse phenotype gene association terms (MGI Phenotypes) in Fig. S5 and a Bonferroni P<1x108 and an enrichment >2.2 for MGI expression. Next we determined the fractions of genes shared in each pairwise contrast of associated terms (in relation to the term containing fewest genes) and this matrix of gene sharing fractions was then subjected to hierarchical clustering (Pearson correlation) to construct heat maps. Annotations summarizing clustered terms to the left of heatmaps were manually inferred from included terms to represent a description of the individual terms in each cluster. Electrophoretic mobility shift assays (EMSA) Cells and preparation of nuclear extracts. Mouse ES-cell derived neural stem cells were obtained by in vitro differentiation of R1 ES cells using the protocol by Conti et al. (59). The 26 resulting neural stem cells were propagated in N2B27 medium supplemented with EGF and FGF2 (60). After dissociation into single cells by trypsinization, cell pellets were stored at -70°C for further analysis. The P19 embryonic carcinoma cells were maintained in Alpha Modified Eagle’s Medium (αMEM, Invitrogen) supplemented with 10% heat-inactivated fetal bovine serum and penicillin (0.2 U/mL)/streptomycin (0.2 µg/ml)/L-glutamine (0.2 µg/ml) (Gibco) at 37ºC in a 5% CO2 humidified atmosphere. In order to induce the neuronal differentiation of P19, the cells were cultured in αMEM growth medium containing 1 µM all-trans-Retinoic Acid (RA, Sigma Aldrich R-2625) in bacterial garde Petri dishes to promote the aggregation of embryonic bodies (EB). After 48h the EBs were plated in adherent culture plates in RA-free growth medium and after 48h the differentiated P19 cells were harvested for preparation of nuclear extracts. The nuclear extracts were prepared from embryonic stem cell-derived neural stem cells (ES-NSC) and P19 cells using NucBuster Protein Extraction Kit (Novagen). EMSA. Seventeen selected SNPs were functionally assessed using EMSA to reveal their potential to affect DNA-protein interaction. 5’Biotin-labelled probes were synthesized (Integrated DNA Technologies) and can be obtained upon request. Double-stranded probes were generated by annealing single-stranded complementary oligonucleotides in 1X NEB2 buffer (New England Biolabs). Two µg of the nuclear extracts were added to the binding reaction (10X binding buffer, 30.1 mM KCl, 2 mM MgCl2, 7.5% Glycerol, 0.1 mM EDTA, 0.063% NP-40 and 1µg/ml Poly(dI•dC)) and then preincubated for 20 min on ice. For competition reactions, 20 pmol of unlabeled double-stranded oligos were added to the reaction. Thereafter, 20 fmol of biotinylated oligos were added to the reactions and incubated for 20 min at RT. The DNA-protein complexes were separated by electrophoresis on 5% non-denaturing polyacrylamide gel at 100 V for 1.5 h at RT in 0.5×TBE running buffer. The separated complexes were transferred to nylon membrane 27 (Perkin Elmer) at 45 V for 1 h in cold 0.5×TBE. The DNA-protein complexes were crosslinked using a transilluminator with 312 nm UV bulbs. The biotinylated probes were detected using Lightshift Electrophoretic Mobility-Shift Assay kit (Thermo Scientific). References 1. C. Darwin, On the origins of species by means of natural selection or the preservation of favoured races in the struggle for life. (John Murray, London, 1859). 2. J. A. Clutton-Brock, Natural History of Domesticated Mammals. (Cambridge University Press, Cambridge, 1999). 3. M. Carneiro, S. Afonso, A. Geraldes, H. Garreau, G. Bolet et al., The genetic structure of domestic rabbits. Mol. Biol. Evol. 28, 1801-1816 (2011). 4. Materials and methods are available as supplementary material on Science Online. 5. M. Carneiro, F. W. Albert, J. Melo-Ferreira, N. Galtier, P. Gayral et al., Evidence for Widespread Positive and Purifying Selection Across the European Rabbit (Oryctolagus cuniculus) Genome. Mol. Biol. Evol. 29, 1837-1849 (2012). 6. J. Maynard-Smith, J. Haigh, The hitch-hiking effect of a favourable gene. Genet. Res. 23, 2335 (1974). 7. R. Nielsen, S. Williamson, Y. Kim, M. J. Hubisz, A. G. Clark et al., Genomic scans for selective sweeps using SNP data. Genome Res. 15, 1566-1575 (2005). 8. K. Lindblad-Toh, M. Garber, O. Zuk, M. F. Lin, B. J. Parker et al., A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478, 476-482 (2011). 28 9. M. M. Motazacker, B. R. Rost, T. Hucho, M. Garshasbi, K. Kahrizi et al., A defect in the ionotropic glutamate receptor 6 gene (GRIK2) is associated with autosomal recessive mental retardation. Am. J. Hum. Genet. 81, 792-798 ( 2007). 10. K. Takahashi, S. Yamanaka, Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 126, 663-676 (2006). 11. C. Y. McLean, D. Bristor, M. Hiller, S. L. Clarke, B. T. Schaar et al., GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotech. 28, 495-501 (2010). 12. C.-J. Rubin, M. C. Zody, J. Eriksson, J. R. S. Meadows, E. Sherwood et al., Whole genome resequencing reveals loci under selection during chicken domestication. Nature 464, 587-591 (2010). 13. C.-J. Rubin, H. Megens, A. Martinez Barrio, K. Maqbool, S. Sayyab et al., Strong signatures of selection in the domestic pig genome. Proc. Natl. Acad. Sci. U.S.A. 109, 19529-19536 (2012). 14. M.V. Olson, When less is more: Gene loss as an engine of evolutionary change. Am. J. Hum. Genet. 64, 18–23 (1999). 15. P. V. Tran, C. J. Haycraft, T. Y. Besschetnova, A. Turbe-Doan, R. W. Stottmann et al., THM1 negatively modulates mouse sonic hedgehog signal transduction and affects retrograde intraflagellar transport in cilia. Nat. Genet. 40, 403-410 (2008). 16. K. Agger, P. A. C. Cloos, J. Christensen, D. Pasini, S. Rose et al., UTX and JMJD3 are histone H3K27 demethylases involved in HOX gene regulation and development. Nature 449, 731-734 (2007). 17. H. Deng, K. Gao, J. Jankovic, The genetics of Tourette syndrome. Nat. Rev. Neurol. 8, 203213 (2012). 29 18. J. K. Pritchard, J. K. Pickrell, G. Coop, The genetics of human adaptation: hard sweeps, soft sweeps, and polygenic adaptation. Curr. Biol. 20, 208-215 (2010). 19. L. Andersson, Molecular consequences of animal breeding. Curr. Opin. Genet. Dev. 23, 295301 (2013). 20. D. B. Jaffe, J. Butler, S. Gnerre, E. Mauceli, K. Lindblad-Toh et al., Whole-Genome Sequence Assembly for Mammalian Genomes: Arachne 2. Genome Res. 13, 91-96 (2003). 21. C. Chantry-Darmon, C. Urien, H. De Rochambeau, D. Allain, B. Pena et al., A firstgeneration microsatellite-based integrated genetic and cytogenetic map for the European rabbit (Oryctolagus cuniculus) and localization of angora and albino. Anim. Genet. 37, 335-341 (2006). 22. K. Osoegawa, P. Y. Woon, B. Zhao, E. Frengen, M. Tateno et al., An Improved Approach for Construction of Bacterial Artificial Chromosome Libraries. Genomics 52, 1-8 (1998). 23. J. Z. Levin, M. Yassour, X. Adiconis, C. Nusbaum, D. A. Thompson et al., Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat. Meth. 7, 709-715 (2010). 24. M. G. Grabherr, B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson et al., Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644-652 (2011). 25. H. Li, R. Durbin, Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589-595 (2010). 26. http://picard.sourceforge.net 27. H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079 (2009). 30 28. D. C. Koboldt, Q. Zhang, D. E. Larson, D. Shen, M. D. McLellan et al., VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568-576 (2012). 29. R. Kofler, P. Orozco-terWengel, N. De Maio, R. V. Pandey, V. Nolte et al., PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PLoS One 6, e15925 (2011). doi:10.1371/journal.pone.0015925. 30. B. S. Weir, C. C. Cockerham, Estimating F-Statistics for the Analysis of Population Structure. Evolution 38, 1358-1370 (1984). 31. S. H. Williamson, M. J. Hubisz, A. G. Clark, B. A. Payseur, C. D. Bustamante et al., Localizing recent adaptive evolution in the human genome. PLoS Genet. 3, e90 (2007). doi:10.1371/journal.pgen.0030090. 32. B. Aigner, U. Besenfelder, M. Muller, G. Brem, Tyrosinase gene variants in different rabbit strains. Mamm. Genome 11, 700-702 (2000). 33. L. Fontanesi, L. Forestier, D. Allain, E. Scotti, F. Beretti et al., Characterization of the rabbit agouti signaling protein (ASIP) gene: Transcripts and phylogenetic analyses and identification of the causative mutation of the nonagouti black coat colour. Genomics 95, 166-175 (2010). 34. E. Hodges, M. Rooks, Z. Xuan, A. Bhattacharjee, D. Benjamin Gordon et al., Hybrid selection of discrete genomic intervals on custom-designed microarrays for massively parallel sequencing. Nat. Protoc. 4, 960-974 (2009). 35. M. Carneiro, F. W. Albert, S. Afonso, R. Pereira, H. Burbano et al., The genomic architecture of population divergence between subspecies of the European rabbit. PLoS Genet. (in press). 36. P. Pavlidis, S. Laurent, W. Stephan, msABC: a modification of Hudson's ms to facilitate multi-locus ABC analysis. Mol. Ecol. Resour. 10, 723-727 (2010). 31 37. A. Eyre-Walker, R. L. Gaut, H. Hilton, D. L. Feldman, B. S. Gaut, Investigation of the bottleneck leading to the domestication of maize. Proc. Natl. Acad. Sci. U.S.A. 95, 4441-4446 (1998). 38. M. I. Tenaillon, J. U'Ren, O. Tenaillon, B. S. Gaut, Selection versus demography: a multilocus investigation of the domestication process in maize. Mol. Biol. Evol. 21, 1214-1225 (2004). 39. S. I. Wright, I. V. Bi, S. G. Schroeder, M. Yamasaki, J. F. Doebley et al., The effects of artificial selection on the maize genome. Science 308, 1310-1314 (2005). 40. Q. Zhu, X. Zheng, J. Luo, B. S. Gaut, S. Ge, Multilocus analysis of nucleotide variation of Oryza sativa and its wild relatives: severe bottleneck during domestication of rice. Mol. Biol. Evol. 24, 875-888 (2007). 41. A. Haudry, A. Cenci, C. Ravel, T. Bataillon, D. Brunel et al., Grinding up wheat: a massive loss of nucleotide diversity since domestication. Mol. Biol. Evol. 24, 1506-1517 (2007). 42. H. S. Yu, Y. H. Shen, G. X. Yuan, Y. G. Hu, H. E. Xu et al., Evidence of selection at melanin synthesis pathway loci during silkworm domestication. Mol. Biol. Evol. 28, 1785-1799 (2011). 43. G. Queney, N. Ferrand, S. Weiss, F. Mougel, M. Monnerot, Stationary distributions of microsatellite loci between divergent population groups of the European rabbit (Oryctolagus cuniculus). Mol. Biol. Evol. 18, 2169-2178 (2001). 44. G. A. Watterson, On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7, 256-276 (1975). 45. M. Nei, Molecular Evolutionary Genetics. (Columbia University Press, New York, 1987). 46. R. R. Hudson, M. Slatkin, W. P. Maddison, Estimation of Levels of Gene Flow from DNASequence Data. Genetics 132, 583-589 (1992). 32 47. G. Weiss, A. von Haeseler, Inference of population history using a likelihood approach. Genetics 149, 1539-1546 (1998). 48. M. Carneiro, N. Ferrand, M. W. Nachman, Recombination and speciation: loci near centromeres are more differentiated than loci near telomeres between subspecies of the European rabbit (Oryctolagus cuniculus). Genetics 181, 593-606 (2009). 49. C. Callou, De la garenne au clapier: etude archeozoologique du lapin en Europe Occidentale, (Memoir Mus Natl Hist, 2003), pp. 352. 50. R. C. Soriguer, El conejo: papel ecológico y estrategia de vida en los ecosistemas mediterráneos (In Proc.: XV Congreso Internacional de Fauna Cinegética y Silvestre, Trujillo, Spain, 517-542, 1983). 51. D. Karolchik, R. Baertsch, M. Diekhans, T. S. Furey, A. Hinrichs et al., The UCSC Genome Browser Database. Nucleic Acids Res. 31, 51-54 (2003). 52. S. Neph, M. S. Kuehn, A. P. Reynolds, E. Haugen, R. E. Thurman et al., BEDOPS: highperformance genomic feature operations. Bioinformatics 28, 1919-1920 (2012). 53. K. Wang, M. Li, H. Hakonarson, ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38, e164 (2010). doi:10.1093/nar/gkq603. 54. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, Basic local alignment search tool. J. Mol. Biol. 215, 403-410 (1990). 55. N. L. Sim, P. Kumar, J. Hu, S. Henikoff, G. Schneider et al., SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452-457 (2012). doi:10.1093/nar/gks539. 56. I. A. Adzhubei, S. Schmidt, L. Peshkin, V. E. Ramensky, A. Gerasimova et al., A method and server for predicting damaging missense mutations. Nat. Methods 7, 248-249 (2010). 33 57. M. A. DePristo, E. Banks, R. Poplin, K. V. Garimella, J. R. Maguire et al., A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491-498 (2011). 58. R Core Team, A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2013 59. L. Conti, S. M. Pollard, T. Gorba, E. Reitano, M. Toselli et al., Niche-independent symmetrical self-renewal of a mammalian tissue stem cell. Plos Biol. 3, e283 (2005). doi:10.1371/journal.pbio.0030283 60. Q. L. Ying, M. Stavridis, D. Griffiths, M. Li, A. Smith, Conversion of embryonic stem cells into neuroectodermal precursors in adherent monoculture. Nat. Biotechnol. 21, 183-186 (2003). 61. E. M. Gertz, A. A. Schaffer, R. Agarwala, A. Bonnet-Garnier, C. Rogel-Gaillard et al., Accuracy and coverage assessment of Oryctolagus cuniculus (rabbit) genes encoding immunoglobulins in the whole genome sequence assembly (OryCun2.0) and localization of the IGH locus to chromosome 20. Immunogenetics 65, 749-762 (2013). 34 R French Domestic A Identity score Iberian 1.0 0.8 0.6 0.4 0.2 0 Windows along rabbit genome B 1.0 6.5 0.9 RAF Wild French 0.7 5.5 0.6 5.0 0.5 4.5 0.4 4.0 0.3 3.5 0.2 log10(nSNPs) 6.0 0.8 3.0 0.1 2.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 RAF Domestic Fig. S1: Genetic relatedness between populations sampled in this study. (A) Heat map (color code to the right) of identity scores based on comparing resequencing data with the assembly. The x-axis represents genome coordinates with chromosome 1 to the left and chromosome X to the right. The first row (R) represents the reference rabbit against itself, rows 8-21 correspond to locations marked with red dots in Fig. 1A, ordered according to a northeast to southwest transection. (B) Strong correlation in allele frequencies between domestic and French wild rabbits; RAF French and RAF Domestic represent the frequency of the reference allele in wild French and domestic rabbits, respectively. A Heterozygosity domestic pools number of windows 3000 2500 2000 1500 1000 500 0 0 0.05 B 0.15 0.2 0.25 Heterozygosity 0.3 0.35 0.4 0.45 0.5 FST domestics vs. wild French 6000 number of windows 0.1 5000 4000 3000 2000 1000 0 0 0.1 0.2 0.3 FST domestics vs. wild French C FST 0.4 0.5 0.6 0.7 Scatter plot FST and heterozygosity 0.8 0.6 0.4 0.2 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Heterozygosity domestic pools 0.4 0.45 0.5 Fig. S2: Distributions and scatterplot of heterozygosity and FST of 50 kb windows. (A) Distribution of pool heterozygosity in the domestic pools sequenced. The arrow indicates the upper heterozygosity treshold for opening a sweep. (B) Distribution of FST values observed in the contrast domestics vs. wild French. The arrow indicates thelower FST treshold for opening a sweep. (C) Scatter plot of FST and pool heterozygosity. Red dots represent windows fulfilling the sweep opening requirement (FST >= 0.35 and heterozygosity <= 0.05) Nbfr t2fr dfr t1fr Na Npfr d Nbdom dom t2dom t1dom Npdom Present Wild Iberia Wild France Domestics 12000 Past 0 B Relative Relative ln ln likelihood likelihood 4000 8000 A 0 3 6 9 12 Kdom Kdom Fig. S3: Demographic model and magnitude of the domestication bottleneck estimated using the targeted capture dataset.(A) Schematic representation of the coalescent model used to represent the main events of the demographic history of wild and domestic rabbits. See Supplementary Methods for a detailed description of all parameters and assumptions. (B) Diagram representing the multilocus maximum-likelihood (ML) estimate for the magnitude of the domestication bottleneck (kdom). Multilocus ML (y axis) of the parameter kdom (x axis) was estimated as the product of the likelihood for each locus. 15 0.25 0.020 0.015 0.010 0.005 0 0 0.005 0.010 π Domestics 0.015 0.020 0.25 A 0 0.005 0.010 0.015 π France 0.25 0 0.005 0.010 0.015 0.020 0.25 D 0 0 100 200 1000 300 400 Frequency Frequency 2000 500 600 3000 700 C 0.020 0.1 0.2 0.3 0.4 0.5 0.6 FST 0.7 0.8 0.9 1 10 20 30 40 50 60 70 % Shared mutations 80 90 100 Fig. S4: Comparison between observed and simulated data under the estimated demographic model. Observed and simulated data are represented in red and black, respectively. The first two panels (A and B) illustrate the relationship between values of both π and θw in wild rabbits from France (x axis) and domestic rabbits (y axis). The two histograms on the bottom row (C and D) illustrate the distribution of FST and percentage of shared mutations between the two populations.These figures include the full observed dataset, including outliers, explaining the discrepancy between the observed and simulated data. 80 160 140 120 100 0 3 4 2 60 40 20 3.5 2.5 80 1 0 0.5 1.5 5 0 0 0 0 0 0 1.0 0.9 5 5 20 20 30 15 25 30 15 20 25 30 0 Brain parts and sensory organ Thieler stage 17 Positive regulation of gene expression 20 Negative regulation of gene expression 0.7 0.6 20 20 20 0.5 0.4 30 30 30 0.3 0.2 Hearing phenotype 0.1 160 proportion of genes shared 10 10 0.8 Abnormal sensory capabilities 40 40 50 50 60 60 70 70 40 Abnormal cranial nerve morphology 1.0 0.9 10 Abnormal ectoderm development 16 0 Respiratory system phenotype 14 12 6 5 4 8 6 4 2 0 10 3 2 1 250 200 150 50 100 MGI: Mouse Phenotypes Lethality associated terms 4 0 0 P E N P E N C 25 0 200 5 0 0 Future brain + CNS Thieler stage 15 10 0 Adherens junction organization 0.1 Cerebellum, forebrain and sensory organs Thieler stage 20 0 0.3 0.2 15 20 25 Cell fate 20 Eye, sensory organ and organ morphogenesis 0.1 0.4 10 Morphogenesis 0.5 10 0.2 0.6 Different brain regions Thieler stage 23 10 15 15 0.3 0.7 5 0.4 0.8 5 10 10 0.5 0.9 5 0.6 15 Pattern and axis specification 0.8 0.7 10 Brain and nervous system development 5 Regulation of cell development and neurogenesis proportion of genes shared 1.0 proportion of genes shared 0 25 20 15 10 MGI: Gene expression detected 80 80 80 60 B 200 180 160 140 120 100 0 40 20 5 5 5 5 5 4 5 3 1 2 0 9 8 7 6 5 4 3 2 1 0 10 Gene Ontology: Biological Process A 0 0 70 250 60 6 0 50 Abnormal brain and nervous system morphology terms 18 0 Abnormal eye development P E N Fig. S5: Gene overrepresentation analysis using the GREAT software based on genes associated with high delta allele frequency SNPs (ΔAF≥0.8) at non-coding conserved sites. (A) Gene Ontology categories within Biological Processes, (B) MGI mouse expression data and (C) MGI mouse phenotypes. Each row in the heat-maps represents one specific category and colors on that row indicate the proportion of shared genes in relation to the other categories (ordered in the same way on the x-axis as on y-axis). To the left of each cluster of enriched terms the types of terms in that group is indicated in text. Bars immediately left of heat-maps visualize the significance-, enrichment- and number of genes of each significant term (P= -log10 Bonferroni-corrected P-value; E= fold enrichment: N=number of genes). The ranges for P, E, and N are indicated below the plot. The full results are presented in Database S3. Heterozygosity 0.4 0 FST 0.5 0 Delta > 0.75 1.00 0.75 Fst-Het Sweepfinder IMMP2L^ LRRN3 IMMP2L DOCK4 ZNF277 *15263 LSMEM1 IMMP2L H. sapiens consensus Chr7 47 47.5 48 48.5 49 49.5 Mb Fig. S6: Selective sweep at IMMP2L on chromosome 7. Heterozygosity plots for wild (red) and domestic (black) rabbits together with plots of FST values and high ΔAF (HΔAF) SNPs with ΔAF>0.75. Putative sweep regions detected with the FST-Heterozygosity outlier approach and SweepFinder are marked with horizontal bars. Gene annotations in sweep regions are indicated. IMMP2L was not predicted by Ensembl but by our in-house predictions based on RNA-sequencing data. The human consensus model shows all human IMMP2L transcripts in a collapsed fashion. Asterisks (*) represent ENSOCUT000000. Red dots indicate the location of the high ΔAF insertion/deletion in a conserved element for IMMP2L. Mouse ES-cell derived neural stem cells SNP-1 SNP-2 Cold-probe D WT WT SNP-6 SNP-3 SNP-4 SNP-5 D WT D WT D D D WT WT SNP-7 D SNP-8 WT WT D SNP-13 SNP-14 SNP-15 SNP-16 SNP-17 + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - + - SNP-10 D WT + - + - + - + - + - + - + - + - + - + - + - + -+ - + -+ - + - + - + - SNP-12 SNP-9 WT D SNP-11 D WT WT D WT D WT D D WT WT D WT D Un-differentiated mouse P19 embryonic carcinoma cells SNP-1 WT D SNP-2 WT D SNP-3 WT D SNP-4 WT D SNP-5 D WT SNP-6 D WT SNP-7 D WT SNP-8 WT D SNP-9 WT D SNP-10 SNP-11 D WT WT D SNP-12 WT D SNP-13 WT D SNP-14 WT D SNP-15 D WT SNP-16 SNP-17 WT D WT D Differentiated mouse P19 embryonic carcinoma cells SNP-1 WT D SNP-2 WT D SNP-3 D WT SNP-4 WT D SNP-5 D WT SNP-6 D WT SNP-7 D WT SNP-8 WT D SNP-9 WT D SNP-10 D WT SNP-11 WT SNP-12 SNP-13 SNP-14 SNP-15 WT D WT D WT D D WT D SNP-16 SNP-17 WT D WT D Un-differentiated and differentiated mouse P19 embryonic carcinoma cells repeated for SNPs 1,2,3,5 and 13 SNP-1 Cold-probe un-diff WT D + - + - diff WT D + - + - SNP-2 un-diff WT D + - + - diff WT D + - + - SNP-3 un-diff WT D + - + - diff WT D + - + - SNP-13 SNP-5 un-diff D WT + - + - D diff + - WT + - un-diff WT D + - + - diff WT D + - + - Fig. S7: Results for all 17 SNPs functionally examined by electrophoretic mobility shift assays (EMSA). Results of EMSA using nuclear extracts from ES-cell derived neural stem cells or from mouse P19 embryonic carcinoma cells before (un-diff) or after neuronal differentiation (diff) are presented. WT=wild-type allele; D=domestic, the most common allele in domestic rabbits. Cold probes at 100-fold excess were used to verify specific DNA-protein interactions. The results are summarized in Table S7. Table S1. Summary statistics of the rabbit OryCun2.0 assembly Coverage (Q20) Contig N50 (kb) Scaffold N50 (Mb) Assembly size with gaps (Gb) Assembly size without gaps (Gb) Number of anchored scaffolds % of assembly in anchored scaffolds Sequence in anchored scaffolds (Gb) OryCun2.0 6.55X 64.7 35.9 2.66 2.60 99 82.0 2.18 Table S2. Comparison between rabbit genome assembly versions OryCun1.0 and OryCun2.0. Assembly Covera ge (Q20) Contig N50 (kb) Scaffold N50 (Mb) Assembly size (with gaps) Assembly size (without gaps) OryCun1.0 1.95X 2.84 0.05 3.69 2.08 N/A N/A N/A OryCun2.0 6.55X 64.65 35.92 2.66 2.60 99 82 2.18 Number anchored scaffolds % assembly Sequence in anchored anchored scaffolds bases (Gb) Table S3. Summary of RNA-sequencing data for annotation of the rabbit genome Adult adrenal gland 85 775 334 Assembled Transcripts 29 672 Blood 103 242 794 25 451 1226 Zyagen Brain 118 014 338 59 029 1362 Zyagen Fetal adrenal gland – gestation day 28 92 146 626 36 404 1465 INRA Heart 100 862 716 35 557 1433 Zyagen Kidney 108 298 292 39 556 1403 Zyagen Liver 102 233 272 30 711 1435 Zyagen Lung 93 531 078 41 731 1479 Zyagen Mammary gland – gestation day 3 89 869 204 57 763 1336 INRA Mammary gland - gestation day 14 84 998 914 58 136 1377 INRA Mammary gland – lactation day 16 89 377 970 27 691 1403 INRA Ovary 110 264 790 42 350 1381 Zyagen Peripheral blood mononuclear cells 83 915 910 28 031 1493 INRA Placenta – female fetus – gestation day 28 86 728 588 30 992 1388 INRA Placenta – male fetus – gestation day 28 76 666 886 28 420 1389 INRA Skeletal muscle 96 519 628 28 805 1489 Zyagen Skin 87 995 594 43 294 1462 Zyagen Spleen 87 432 290 48 398 1426 INRA Testis 117 649 896 63 676 1191 Zyagen Tissue Site Pass Filter Reads Median Transcript Length (bp) 1380 Source INRA Table S4. Sample details of wild and domestic rabbits used in this study both for whole genome resequencing and DNA sequence capture using hybridization on microarrays. Average sequence coverage per sample is provided. Population Breed/Location ID Fig.1c Sample size Sequencing method Coverage Domestic Belgian hare 17 (pool) WGS 10.4 Domestic Champagne d'argent 16 (pool) WGS 11.2 Domestic Dutch 13 (pool) WGS 10.1 Domestic Flemish giant 18 (pool) WGS 12.1 Domestic French lop 20 (pool) WGS 11.9 Domestic New Zealand white Thorbecke (reference) 16 (pool) WGS 11.1 1 (individual) WGS 10.9 1 (individual) WGS 10.2 Domestic Hare (outgroup) Wild France Missoula, Montana Caumont FRW1 10 (pool) WGS 10.7 Wild France La Roque FRW2 10 (pool) WGS 11.5 Wild France Villemolaque FRW3 10 (pool) WGS 12.0 Wild Iberian Huelva IW11 16 (pool) WGS 11.0 Wild Iberian IW10 11 (pool) WGS 10.7 IW9 20 (pool) WGS 11.0 Wild Iberian Milmarcos S. Agustin de Guadalix Mora IW8 13 (pool) WGS 11.9 Wild Iberian Toledo IW7 14 (pool) WGS 10.7 Wild Iberian Mazarambroz IW6 16 (pool) WGS 11.3 Wild Iberian Urda IW5 16 (pool) WGS 11.4 Wild Iberian IW4 17 (pool) WGS 11.7 IW3 16 (pool) WGS 12.1 Wild Iberian Carrión de Calatrava Calzada de Calatrava Pedroche IW2 11 (pool) WGS 11.2 Wild Iberian Zaragoza IW1 12 (pool) WGS 6.1 Wild France Aveyron 1 (individual) Seq. capture 27.3 Wild France Fos-su-Mer 2 (individual) Seq. capture 31.7; 31.1 Wild France Herauld 1 (individual) Seq. capture 30.5 Wild France Lancon 2 (individual) Seq. capture 33.3; 34.3 Wild France Vaucluse 1 (individual) Seq. capture 34.0 Domestic Belgian hare 1 (individual) Seq. capture 35.5 Domestic Champagne d'argent 1 (individual) Seq. capture 31.3 Domestic Angora 2 (individual) Seq. capture 27.9; 30.7 Domestic French lop 1 (individual) Seq. capture 36.5 Domestic Flemish giant 2 (individual) Seq. capture 24.3; 20.8 Domestic Rex 1 (individual) Seq. capture 26.7 Wild Iberian Wild Iberian * WGS= whole genome resequencing, Seq. Capture = Targeted resequencing after sequence capture Table S5. Summary of SNPs and Insertions/Deletions detected in the rabbit genome Type Total count In coding Missense2 Conserved 1 sequence non-coding3 SNPs Indels 1 50,165,386 5,581,974 154,489 2,812 48,453 - 719,911 82,289 In coding exons based on Ensembl v73 gene models Amino acid alteration predicted by the software Annovar (28) 3 Located within elements under evolutionary constraint but outside of Ensembl v73 coding exons 2 Table S6. Distributions of SNP counts in the different delta allele frequency bins for conserved non-coding elements, UTRs, coding sequences and introns. Conserved non-coding sites All UTRs Coding ΔAF bin a 0.95-1.0 0.90-0.95 0.85-0.90 0.80-0.85 0.75-0.80 0.70-0.75 0.65-0.70 0.60-0.65 0.55-0.60 0.50-0.55 0.45-0.50 0.40-0.45 0.35-0.40 0.30-0.35 0.25-0.30 0.20-0.25 0.15-0.20 0.10-0.15 0.05-0.10 0-0.05 a Observed o 1,030 6,987 22,968 55,438 108,850 184,708 287,540 417,269 578,178 771,964 1,003,765 1,294,445 1,650,687 2,095,342 2,640,617 3,317,366 4,268,059 5,451,393 6,017,447 4,119,252 o Observed Expected 27 140 448 1,021 1,918 3,009 4,703 6,560 8,953 11,789 15,031 18,986 23,783 30,227 37,522 47,090 60,926 79,151 89,894 58,999 15 102 335 809 1,588 2,694 4,194 6,086 8,433 11,259 14,640 18,880 24,076 30,561 38,514 48,385 62,251 79,510 87,766 60,080 b c b P M 1.80E-03 1.50E-04 4.99E-10 5.98E-14 7.29E-17 9.74E-10 2.42E-15 9.32E-10 1.17E-08 4.86E-07 1.13E-03 4.37E-01 5.71E-02 5.43E-02 3.54E-07 3.02E-09 8.81E-08 2.00E-01 4.62E-13 8.88E-06 0.8 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Observed 22 113 378 875 1,720 2,782 4,397 6,304 8,375 10,936 14,266 18,459 23,277 29,643 36,784 45,569 59,111 75,301 83,900 57,428 o Expected 14 98 321 775 1,522 2,583 4,022 5,836 8,087 10,797 14,039 18,105 23,087 29,306 36,933 46,398 59,695 76,245 84,162 57,614 b P c 3.13E-02 1.27E-01 1.36E-03 2.97E-04 3.20E-07 8.04E-05 2.60E-09 6.85E-10 1.26E-03 1.78E-01 5.37E-02 8.06E-03 2.08E-01 4.74E-02 4.35E-01 1.06E-04 1.61E-02 5.76E-04 3.63E-01 4.35E-01 M b 0.7 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Observed d 10 56 147 391 646 1,161 1,812 2,592 3,521 4,762 6,046 7,810 9,629 12,263 15,408 19,111 24,976 32,760 37,931 27,334 o Expected 6 42 140 337 661 1,122 1,747 2,535 3,513 4,691 6,099 7,865 10,030 12,732 16,045 20,157 25,933 33,123 36,563 25,029 b P c M d 1.01E-01 3.03E-02 5.53E-01 3.17E-03 5.58E-01 2.43E-01 1.19E-01 2.56E-01 8.92E-01 2.98E-01 4.96E-01 5.34E-01 5.91E-05 3.06E-05 4.55E-07 1.47E-13 2.51E-09 4.54E-02 7.17E-13 2.28E-48 Introns b Observed d 325 2,089 6,951 16,671 32,177 54,993 85,412 124,361 172,726 229,011 299,028 387,925 495,609 632,561 793,553 984,798 1,255,095 1,597,012 1,757,751 1,186,573 0.7 0.4 0.1 0.2 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 -0.1 -0.1 -0.1 -0.1 -0.1 0.0 0.1 0.1 o Expected b 304 2,061 6,774 16,351 32,105 54,479 84,808 123,071 170,530 227,687 296,055 381,789 486,861 618,010 778,835 978,439 1,258,840 1,607,858 1,774,813 1,214,951 c M 1.51E-01 4.63E-01 1.04E-02 2.88E-03 6.32E-01 8.73E-03 1.35E-02 1.19E-05 2.40E-10 9.51E-04 7.65E-11 2.84E-32 2.07E-50 1.10E-107 8.75E-88 1.92E-14 7.03E-05 2.27E-24 1.58E-52 1.87E-206 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Binned delta allele frequencies for the contrast domestic vs. wild rabbits. Domestic and Wild reference allele frequencies (dRAF and wRAF) and ΔAF were calculated as: dRAF= mean(RAF Domestics), wRAF= (mean(RAF Wild French)+mean(RAF Wild Iberian))/2, ΔAF=abs(dRAF-wRAF) b The expected number of SNPs of each category in each bin was calculated as p(category) X n(binX), where p(category) is the proportion of a specific SNP category (Conserved non-coding, UTR, coding or intronic) in the entire genome and n(binX) is the total number of SNPs in a given bin. M-­βvalues were then calculated as the log2 fold change of the observed SNP count vs. the expected SNP count. c d b P Statistical significance of deviations from expected values were tested with a standard Χ2–analysis (d.f.=1) For coding SNPs in the ΔAF bin 0.95-­β1.0 4 SNPs annotated as nonsynonymous in CTDSP2 were shown to reside in a pseudogene (see Database S4), reducing the number of coding SNPs from 14 to 10, the P-­βvalue from 0.001 to 0.1 and the M-­βvalue from 1.2 to 0.7. P-­βvalues in bold font indicate those ≤ 0.05. M-­βvalues in bold font indicate those with M ≥ 0.1 and with P ≤ 0.05 Table S7. Summary of electrophoretic mobility shift assays using nuclear extracts from ES-cell derived neural stem cells (ES or from mouse P19 embryonic carcinoma cells before (un-diff) or after neuronal differentiation (diff). Shifted bands Difference domestic vs. wild SNP Chr Pos Gene* ES P19 P19 ES P19 P19 un-diff. diff. un-diff. diff. 77728853 1 chr14 SOX2 + + + ++ 77734115 2 chr14 SOX2 + + + ++ 78187181 3 chr14 SOX2 + + + ++ 78228121 4 chr14 SOX2 48839466 5 chr18 PAX2 + + + +++ 5538210 6 chr1 KLF4 + + + 5609176 7 chr1 KLF4 77386300 8 chr14 SOX2 + + ? ? 77458138 9 chr14 SOX2 + + + ++ ++ ++ 77700191 10 chr14 SOX2 + + + 77885074 11 chr14 SOX2 + + + ++ ++ ++ 78046337 12 chr14 SOX2 + + + 78227621 13 chr14 SOX2 + + + +++ +++ +++ 78223293 14 chr14 SOX2 + + + +++ +++ +++ 5609251 15 chr1 KLF4 78256018 16 chr14 SOX2 + 78290160 17 chr14 SOX2 + + ? "-" = no shift, "+" = shifted, "++" = lower band intensity, "+++" = whole band/complex disappeared and "?" = unclear