Molecular Ecology (2013) 22, 3098–3111 doi: 10.1111/mec.12131 Genome sequence of dwarf birch (Betula nana) and cross-species RAD markers NIAN WANG,* MARIAN THOMSON,† WILLIAM J. A. BODLES,‡ ROBERT M. M. CRAWFORD,§ H A R R I E T V . H U N T , ¶ A L A N W A T S O N F E A T H E R S T O N E , * * J A U M E P E L L I C E R † † and RICHARD J. A. BUGGS* *School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London, E1 4NS, UK, †The GenePool, Ashworth Laboratories, King’s Buildings, University of Edinburgh, Edinburgh, EH9 3JT, UK, ‡Highland Birchwoods, Littleburn Road, Munlochy, Ross-shire, IV8 8NN, UK, §School of Biology, University of St Andrews, St Andrews, Fife, KY16 9TH, UK, ¶McDonald Institute for Archaeological Research, University of Cambridge, Downing Street, Cambridge, CB2 3ER, UK, **Trees for Life, The Park, Findhorn Bay, Forres, IV36 3TZ, UK, ††Jodrell Laboratory, Royal Botanic Gardens, Kew, Richmond, Surrey, TW9 3AB, UK Abstract New sequencing technologies allow development of genome-wide markers for any genus of ecological interest, including plant genera such as Betula (birch) that have previously proved difficult to study due to widespread polyploidy and hybridization. We present a de novo reference genome sequence assembly, from 663 short read coverage, of Betula nana (dwarf birch) – a diploid that is the keystone woody species of subarctic scrub communities but of conservation concern in Britain. We also present 100 bp PstI RAD markers for B. nana and closely related Betula tree species. Assembly of RAD markers in 15 individuals by alignment to the reference B. nana genome yielded 44–86k RAD loci per individual, whereas de novo RAD assembly yielded 64–121k loci per individual. Of the loci assembled by the de novo method, 3k homologous loci were found in all 15 individuals studied, and 35k in 10 or more individuals. Matching of RAD loci to RAD locus catalogues from the B. nana individual used for the reference genome showed similar numbers of matches from both methods of RAD locus assembly but indicated that the de novo RAD assembly method may overassemble some paralogous loci. In 12 individuals hetero-specific to B. nana 37–47k RAD loci matched a catalogue of RAD loci from the B. nana individual used for the reference genome, whereas 44–60k RAD loci aligned to the B. nana reference genome itself. We present a preliminary study of allele sharing among species, demonstrating the utility of the data for introgression studies and for the identification of species-specific alleles. Keywords: Betula, genome sequence, hybridization, polyploidy, restriction-site associated DNA tags Received 30 June 2012; revision received 12 October 2012; accepted 17 October 2012 Introduction Plant species complexes where polyploidy and hybridization are widespread pose significant challenges to ecological and conservation genetic study (Lowe et al. 2004), but next-generation sequencing technologies have Correspondence: Richard Buggs, Fax: +44 (0) 207 882 7732; E-mail: r.buggs@qmul.ac.uk begun to make these systems tractable (Buggs et al. 2012b; Ilut et al. 2012; Lai et al. 2012). One challenge in such systems is genotyping genetic markers at homologous loci among individuals and species, in a manner that is both high-throughput and inexpensive (Cronn et al. 2012). Many genotyping systems rely upon the synthesis of unique oligonucleotides to bind each locus being assayed (Kim & Misra 2007; Mamanova et al. 2010; Cronn et al. 2012); this makes genome-wide studies © 2012 John Wiley & Sons Ltd B E T U L A G E N O M E S E Q U E N C E A N D R A D 3099 both expensive and vulnerable to unexpected polymorphisms within oligonucleotide-binding sites. In many ecological genetic studies, sample sizes are high, the nature of polymorphism among and within species is unknown, and funding is low. This makes genotyping by sequencing, for example using restriction-site-associated DNA (RAD) tag sequencing (Baird et al. 2008; Davey et al. 2011), promising methodology as it may be implemented without prior knowledge of genomic diversity, and without costly synthesis of thousands of different oligonucleotides. In this study, we describe the development of RAD sequencing for Betula (birch) a widespread, outcrossing, highly polymorphic genus where hybridization and polyploidization are rampant (Woodworth 1929; Jarvinen et al. 2004). The ultimate intention is to use RAD markers to trace patterns of genome-wide introgression among hybridizing Betula species (B. nana, B. pubescens and B. pendula) in Britain through sequencing of homologous loci in many individuals. To be able to more fully characterize and validate the RAD markers developed, and to greatly increase genomic data available for Betula, we also present a draft reference genome sequence for an individual of Betula nana (dwarf birch). This shrub species is common in exposed subarctic and alpine tundra and swampy montane habitats, but in the United Kingdom it is nationally scarce, mainly restricted to small, relictual, montane populations in the Scottish highlands (Aston 1984; de Groot et al. 1997). It is under active conservation by organizations such as Trees for Life, Scottish Natural Heritage and the Forestry Commission. Betula nana is a diploid species and the genome sizes of five individuals from west Iceland were recently measured: two plants analysed using flow cytometry had 1C (i.e. unreplicated haploid genome size) values of 453 and 462 Mbp, and three plants analysed using Feulgen DNA Image Densitometry had 1C values of 441, 432 and 451 Mbp (Anamthawat-J onsson et al. 2010). Betula nana is reported to hybridize with the diploid tree species silver birch (B. pendula) and the tetraploid tree species downy birch (B. pubescens). Hybridization may threaten the reproduction of dwarf birch populations in areas where the species is scarce (e.g. Buggs & Pannell 2006) and may also generate recombinant genotypes of stress-tolerant tree in the subarctic (Vaarama & Valanne 1973; Wilsey & Saloniemi 1999; Karlsson et al. 2000). In Iceland, first-generation hybrids between B. nana and B. pubescens produce viable pollen, which can back-cross with parental species (Anamthawat-J onsson & Tomasson 1990; Karlsd ottir et al. 2008). Introgression has been shown to occur between B. nana and B. pubsecens using cytogenetics (Anamthawat-Jónsson & Thórsson 2003), morphology (Elkington 1968; Thórsson et al. 2007) and genetic markers (Th orsson et al. 2001; Palmé © 2012 John Wiley & Sons Ltd et al. 2004). In Iceland, plants at hybrid zones between B. nana and B. pubescens are strictly diploid, triploid or tetraploid at the cytological level, but morphological and genetic intermediates are found in all three of these ploidal levels (Anamthawat-J onsson & Tomasson 1990; Th orsson et al. 2001; Anamthawat-Jónsson & Thórsson 2003; Thórsson et al. 2007). Putative hybrids between B. nana and B. pubescens have been reported in several locations in Scotland, where they are known as B. 9 intermedia (Kenworthy et al. 1972; Crawford 2008), and intermediates between strict ploidy levels appear to occur (Kenworthy et al. 1972). We are especially interested in understanding how patterns of introgression may have arisen among Betula species due to hybrid zone movement (Buggs 2007; Currat et al. 2008) as the ranges of Betula species have shifted due to climate change (Huntley & Birks 1983; Caseldine 2001; Karlsdóttir et al. 2009). Population genetic variation in Betula has previously been studied using microsatellite markers (Truong et al. 2005), RAPDs (Howland et al. 1995; Dabrowska et al. 2006), RFLPs (Howland et al. 1995), chloroplast genes (Palmé et al. 2004) and in situ hybridization (Anamthawat-J onsson 2004). These showed high levels of polymorphism and gene flow among species (see also Eriksson & Jónsson 1986). The taxonomy of the genus is not straightforward (De Jong 1993), and morphological intermediacy between taxa is common (Regel 1861; Woodworth 1929). Of particular interest for the current study, B. pendula and B. pubescens in the UK are regarded as two separate species, partly on the basis of ploidy difference (Brown & Al-Dawoody 1979; Gill & Davy 1983; Brown & Williams 1984), yet there is a continuum of morphological variation between them (Brown & Tuley 1971; Pelham et al. 1984; Atkinson & Codling 1986). The Atkinson discriminant function has been devised to distinguish between the two species (Atkinson & Codling 1986), based on leaf-shape quantitative traits. The continuum of morphological intermediacy is though to be partly due to hybridization (Brown et al. 1982; Atkinson 1992), but morphological variation has also be associated with environmental factors (Pelham et al. 1988); for example, leaf size and apex angle for B. pubescens and B. pendula differ on bogs vs. heaths (Davy & Gill 1984). In East Anglian populations, diploid and tetraploid Betula were found to be distinct neither on the basis of morphological nor molecular characters (Howland et al. 1995). There is clearly much basic genomic characterization needed if we are to understand the genomic basis of species differences in the genus Betula, and the dynamics of hybridization and introgression among them. The work presented in this study is designed to lay foundations for a genome-wide survey of genomic variation among 3100 N . W A N G E T A L . Betula populations and species in the United Kingdom, through (i) the generation of a whole-genome reference sequence for the diploid species B. nana and (ii) the development of a RAD sequencing protocol to assay genome-wide homologous loci among populations and species. Using a subset of the data generated, we also present a preliminary study of allele sharing among species, as a proof-of-concept of the utility of these data for introgression studies. Materials and methods Plant materials Materials were collected from natural populations of Betula growing in the United Kingdom in the summers of 2010 and 2011. The B. nana individual used for wholegenome sequencing (097-10, Fig. S1 photo, Supporting information) was grown from a seed collected in 2010 from a B. nana shrub on the Dundreggan Estate, Scotland, which belongs to Trees for Life. The seed was germinated in water and grown in a pot of compost in a rooftop garden at the Mile End Campus of Queen Mary University of London. Fresh cambial material was collected from this plant in 2011 for DNA extraction. The leaves shown in Fig. 1 were collected in 2011, and the photo in Fig. S1 (Supporting information) was taken in summer 2012. Plant tissues used for RAD-sequencing (apart from those used from 097-10, where DNA samples extracted for whole-genome sequencing were also used for RAD) were collected in the field from plants identified by morphology as B. nana (one individual in addition to 097-10), B. pubescens (seven individuals), B. pendula (one individual) and B. 9 intermedia (five individuals). Plant tissues were stored as dried herbarium specimens or extracted twig cambial tissue on silica gel. A pair of leaves from each sample is shown in Fig. 1, and in Table 1 the identification of the plants based on leaf morphology according to the standard guide for UK birch identification (Rich & Jermy 1998) is shown. We also applied the Atkinson discriminant function as a mean for two leaves (Atkinson & Codling 1986; Stace 2010); this function is designed to distinguish between B. pendula and B. pubescens on the basis of the width and tooth structure of leaves. As all samples apart from B. nana showed a range of morphologies from B. pendula, through B. pubescens, to putative B. 9 intermedia, the Atkinson discriminant function is given for all of them including those provisionally identified as B. 9 intermedia. Details of source locations of each individual are shown in Table 1. Five of the individuals were from a Ben Loyal site where Bernard Kenworthy found morphological, biochemical and cytological evidence for hybrid occurrence (Kenworthy et al. 1972). Fig. 1 Pairs of leaves from each sample used in the present study, showing upper and lower sides (shown at 1:1.77 scale). DNA extraction Isolating high-quality genomic DNA from Betula, and especially Betula nana, is difficult. We developed a protocol that utilized a combination of steps selected and modified from three published protocols (Doyle & Doyle 1987; Cullings 1992; Zeng et al. 2002). Due to the high concentration of secondary compounds in leaves of B. nana, we mainly extracted DNA from cambial tissue, as has been used for flow cytometry (AnamthawatJ onsson et al. 2010). This also had the benefit of reducing the concentration of chloroplast DNA in the final DNA solution. Leaves or cambial tissue frozen in liquid nitrogen were ground into powder using 3 mm steel beads in a © 2012 John Wiley & Sons Ltd B E T U L A G E N O M E S E Q U E N C E A N D R A D 3101 Table 1 Plant materials used Plant number Plant ID Latitude 097-10 325 425f 574 582 583 605p 1045 1123 1124 1153 1158 97 325 425 574 582 583 605 1045 1123 1124 1153 1158 57.230171 57.687443 56.932211 57.010529 58.419353 58.419334 58.419328 58.415103 58.475289 58.475177 57.118613 55.220690 4.754931 4.630766 3.179569 3.54908 4.416346 4.416296 4.419328 4.417008 4.435964 4.435932 3.901351 3.416942 1183a 1183 58.891900 3.383755 1184c 1578 1184 1578 52.454126 58.423109 0.996216 4.421281 Longitude Approximate location Dundreggan Ben Wyvis Loch Muick Glen Lui, nr Braemar Ben Loyal Ben Loyal Ben Loyal Ben Loyal nr Ben Loyal nr Ben Loyal South of Aviemore Johnstonebridge, Dumfries Berriedale Wood, Orkney Eccles Car, Norfolk Ben Loyal Approximate distance to nearest known B. nana (km) Atkinson discriminant function score 0 0 1 5 0 0 0 0 6 6 7 60 Size† Species morphology according to Plant Crib Number of 100 bp RAD Illumina reads n/a 21 11 16 n/a 20 16 22 17 25.5 16 18 S S L L S S L S M M L L B. B. B. B. B. B. B. B. B. B. B. B. 5655935 4278688 5101887 6904599 14344814 3035186 14467520 10803087 5376089 7829717 3613943 5043068 87 5 L B. pubescens 9847086 350 0 +13 23.5 L L B. pendula B. pubescens 6094568 5494190 nana 9 intermedia pubescens 9 intermedia nana 9 intermedia pubescens 9 intermedia 9 intermedia pubescens pubescens pubescens †Size: height of plant, S = <50 cm, M = 50–200 cm, L = >200 cm. Qiagen TissueLyzer (Qiagen, Santa Clarita, CA). To remove polysaccharides from the tissues, up to 1.6 mL of ice-cold TNE buffer (200 mM Tris-HCl, 250 mM NaCl, 50 mM EDTA) was added to each tube and kept on ice for 10 min. These tubes were centrifuged at full speed for 5 min in a refrigerated microfuge, the supernatant discarded and the TNE step repeated. About 800 lL of 29 CTAB buffer (AppliChem, Darmstadt, Germany), 100 lL 10% Sarkosyl (N-Lauroylsarcosine sodium salt) solution (Sigma-Aldrich, UK) and 20 lL Proteinase K (Qiagen) was added to each tube, and these were vortexed vigorously. Tubes were kept in a 65 °C water bath for about 3 h and were vortexed at intervals of 30 min. Approximately 700 lL 24:1 chloroform/isoamyl alcohol (AppliChem) was added to each tube and mixed by inversion. These tubes were centrifuged at maximum speed for 10 min, and following centrifugation the top (aqueous) layer was carefully transferred to new tubes. Approximate 10 lL RNAase (Qiagen) was added to each tube and kept at 37 °C for 30 min. Then the chloroform/isoamyl alcohol step was repeated, and the aqueous layer again removed. A half volume of 5 M NaCl (Sigma-Aldrich) and 0.7 volume of ice-cold isopropanol (VWR International S.A.S, France) were added to this aqueous isolate and mixed by inversion. These tubes were placed at 4 °C for about 45 min for precipitation. After precipitation, the tubes were centrifuged for 10 min at full speed in a refrigerated microfuge, and the supernatant was poured off carefully. Roughly © 2012 John Wiley & Sons Ltd 700 lL cold 70% ethanol (VWR International Ltd., UK) was added to each tube, mixed by inversion and centrifuged for 5 min at maximum speed. The pellet was dried completely, then dissolved in 200 lL high-salt TE buffer (10 mM Tris with pH 8.0, 1 mM EDTA, 1 M NaCl). The DNA was reprecipitated by adding two volumes (400 lL) of ice-cold 95% ethanol and mixing by inversion. These tubes were centrifuged at full speed for 10 min, and the ethanol was poured off. The pellet was dried completely, and then washed twice using 75% ethanol and 95% ethanol, respectively. The redried pellet was suspended in TE buffer (10 mM Tris with pH 8.0, 1 mM EDTA), and its quality analysed with a Nanovue machine (GE Healthcare, UK), a Qubit analyser (Life Technologies Cooperation, Carisbad, CA) and a 0.8% agarose gel. Genome sequencing Genome Sequencing was conducted at the Beijing Genomic Institute, China. Five DNA libraries were constructed: three paired-end libraries with insert sizes of 200 bp, 500 bp and 800 bp, and two mate-paired libraries with 2000 bp and 5000 bp insert sizes. All libraries were sequenced through the Illumina HiSeq 2000 pipeline, the paired-end libraries with read length of 95 bp and the mate-paired libraries with read lengths of 49 bp. Reads were filtered to exclude reads with: more than 2% N calls, polyA structures, adapter 3102 N . W A N G E T A L . contamination, quality scores of less than 7 for 40% of the reads in paired-end libraries and 60% of the reads in mate-paired libraries, overlapping paired reads in paired-end libraries, and identical sequences at each end of a paired end. mina ECO instrument using the Kapa Library Quantification Kit. The quantified library was checked on an Illumina MiSeq prior to sequencing in a single lane of an Illumina HiSeq 2000 instrument using 100 base reads (v3 chemistry). The library of one individual failed. Genome assembly RAD data processing The genomic sequences of Betula nana were assembled using SOAPdenovo-63mer version 2.04.3, obtained from the Beijing Genomics Institute. The assembly started with the smallest insert size libraries, moving up through the insert size libraries to join contigs together. The resulting contigs were further processed by the GapCloser program from the Beijing Genomics Institute. Various options, and kmer lengths between 28 and 38, were tried in the assembly, and the quality of the resulting assemblies were assessed using the assemblathon statistics 2 PERL script (Earl et al. 2011), and the CEGMA pipeline (Parra et al. 2007). The best assembly was chosen on the basis of high N50 contig length, low number of Ns and high complete coverage of conserved eukaryote genes. The RAD data were processed by pipelines outlined in Fig. S2 (Supporting information). The read data were checked for adapter contamination, and demultiplexed, using the process_radtags module of Stacks (Catchen et al. 2011). The data were then processed into catalogues for each individual sample through two different pipelines as follows: (i) Alignment of RAD reads to the B. nana genome assembly using Bowtie (Langmead et al. 2009), followed by extraction of read stacks using the pstacks module of Stacks and cataloguing with the cstacks module. (ii) De novo assembly of reads using the ustacks module of Stacks followed by cataloguing with cstacks. In both pipelines, 10 reads were required as minimum depth of coverage required in a stack of reads for it to be included as a locus. In the first pipeline, Bowtie allowed a maximum number of mismatches in the 28 bp ‘seed’ to be two and required the sum of the Phred quality values at all mismatched positions to be less than 70. In the second pipeline, the maximum distance (in nucleotides) allowed between stacks was set to four. Catalogues produced by both pipelines were used to calculate the number of loci and alleles covered in each individual. The sstacks module of Stacks (Catchen et al. 2011) was used to search the 097-10 RAD catalogue from each pipeline against the stacks generated for each individual sample in their respective pipelines. In addition, the second pipeline was also used to make a catalogue from all the data from all individuals. This universal catalogue of all RAD loci was then used for searches with sstacks for loci from each individual. Custom PERL scripts and Excel (Microsoft, Washington State, USA) were then used to count the number of RAD loci shared among different individuals. RAD library preparation and sequencing Sixteen samples were prepared for RAD sequencing following the protocol in Baird et al. (2008). For each sample, 0.2 lg of genomic DNA was digested with PstI (New England Biolabs, UK). This enzyme has a 6 bp recognition site and leaves a 4 bp overhang. Assuming a 39% GC content and 450 Mb genome size, we expected there would be 60 527 PstI cut sites, using the ‘radcounter_v3.xls’ spreadsheet available from the RAD-sequencing wiki (www.wiki.ed.ac.uk/display/ RADSequencing/). Digestion was followed by ligation of barcoded P1 adapters (8 nM). Ligated DNA was sheared using a Covaris S2 (KBiosciences, UK) instrument in 1.5 mL tubes (duty cycle 20%; intensity 5; cycles/burst 200; duration 30 s followed by a 20 s pause step repeated 13 times), and the size range 300 to 600 bp was isolated using gel excision and purification. After this stage, all subsequent reactions were cleaned using Ampure beads. After end-repair and A-tailing, the size-selected DNA was ligated to P2 adapters (400 nM) and PCR amplified. PCR amplification was carried out in 8 independent 25 µL reactions consisting of 20 ng ligated DNA, 0.5 vol 29 Phusion Master Mix (New England Biolabs), 0.05 vol DMSO, 0.04 vol P1 and P2 amplification primers (10 nM stock), using the following cycling parameters: 98 °C for 30 s followed by 14 cycles of 98 °C for 10 s and 72 °C for 60 s. The 8 reactions were pooled into one library for sequencing. The final library consisting of an equimolar pooling of the 8 PCR reactions was quantified by qPCR on an Illu- Preliminary study of introgression To establish the usefulness of the RAD data generated for the study of introgression among Betula species, we carried out a preliminary survey of the distribution of alleles of a subset of RAD loci. Starting with all matches found using sstacks to the universal catalogue (see above), we extracted those loci that were sequenced in all 15 individuals. We retrieved all alleles for these loci from all individuals, and cut this data set down to only those loci that were polymorphic among the 15 individuals © 2012 John Wiley & Sons Ltd B E T U L A G E N O M E S E Q U E N C E A N D R A D 3103 but homozygous for the same allele in both B. nana individuals. For this subset of loci, we scored presence and absence of the allele found in B. nana in all other individuals. For each non-B. nana individual, we counted the number of loci where a ‘B. nana’ allele was present at least once, and the number of loci where a ‘B. nana’ allele was not present. Loci with alleles specific to B. nana were annotated by BLASTN searches to Genbank: searches were first carried out using the 100 bp RAD sequences, and then using the 1000 bp of flanking region each side from the B. nana reference genome. Flow cytometry The genome sizes of plants 097-10 and 574 were estimated by flow cytometry with propidium iodide (PI) stained nuclei using either fresh young female catkins or twig internode cambial tissue. Target samples were co-chopped with an internal standard (Oryza sativa ‘IR36’ for 097-10, and Solanum lycopersicum ‘Stupicke polnı rane’ for 574) using new razor blades. The two-step protocol described by Dolezel et al. (2007) using Otto’s buffers (Otto 1990) was followed with minor modifications. After cochopping the samples in 1 mL of ice-cold Otto I buffer, 50 lL of ribonuclease A (1 mg/mL RNase A; SigmaAldrich) was added to the sample, which were then incubated for 30 min at 37 °C prior to pelleting the nuclei. The relative fluorescence of up to 10 000 particles per replicate was recorded on a Partec Cyflow SL3 (Partec GmbH, Germany) flow cytometer fitted with a 100-mW green solid state laser (Cobolt Samba; Cobolt, Sweden). The resulting histograms were analysed with the FlowMax software (v. 2.4, Partec GmbH). Fig. 2 Statistics for the B. nana reference genome assembly: N (X) scaffold length is calculated by sorting lengths of scaffolds from the longest to the shortest and determining at what point X% of the total assembly size is reached. The length of the scaffold at that point is the N(X) length. Thus, over 60% of the genome is in scaffolds of over 10 kb. arising at the contig stage due to heterozygosity in the diploid individual being sequenced. Average contig coverage was 22, when contigs with coverage less than 2.2 and greater than 44.0 were masked. The total size of the scaffolded assembly was 564 Mbp, with 7.78% of the bases in the scaffolds being ‘N’s. The N50 scaffold size of this assembly was 18 689 bp (Fig. 2). Complete ‘Assemblathon2’ statistics for the assembly are shown in Table S1 (Supporting information). The Cegma pipeline (Parra et al. 2007) showed that all of the 248 core eukaryotic genes searched for were present in the genome assembly, 96.77% of them having ‘complete’ hits (see Table 2). The genome assembly and raw read data are available at the EMBL Nucleotide Sequence Database (study accession ERP001867). Results RAD-sequencing Genome sequence A total of 42.05 Gb of raw data was produced, which yielded 29.84 Gb of clean data. This represents 669 coverage of the genome, assuming a genome size of 450 Mb (Anamthawat-J onsson et al. 2010). The data were partitioned among the different libraries as follows: 9.20 Gb clean data from the 200 bp insert size library, 7.64 Gb clean data from the 500 bp insert size library, 6.21 Gb clean data from the 800 bp insert size library, 4.83 Gb clean data from the 2000 bp insert size library and 1.96 Gb clean data from the 5000 bp insert size library. Genome assembly The best assembly was constructed using a Kmer of 35 in SOAPdenovo, with option – M 3 at the contig stage. The option –M 3 helped to deal with pathway bubbles © 2012 John Wiley & Sons Ltd From 15 individual plants, a total of 114.0 million 100 bp reads were generated (1.1 Gb of sequence). Of these, 5.9 million were rejected due to ambiguity in barcodes or RAD-tagging. Thus, 108.1 million reads were retained. The number of retained reads per individual is shown in Table 1. Between 1.4 Gb and 0.3 Gb of sequence data were generated per individual for 15 individuals. The RAD experiment was designed before the reference genome was assembled, and we had predicted there would be 60 527 PstI cut sites (see Materials and methods). The cut site of PstI (CTGCAG and its complement) was in fact found 70 954 times in the reference genome assembly. Thus, we would expect to find up to 141 908 RAD-tags, and the 108.1 million reads of DNA sequence we obtained for RAD sequencing should provide a mean of 762 reads per tag in total. Read data from the RAD sequencing are available at 3104 N . W A N G E T A L . Table 2 Statistics of the completeness of the B. nana reference genome based on BLAST searches for 248 Conserved Eukaryotic genes (CEGs) using the CEGMA pipeline (see main text for details) Complete hits Partial hits Number of 248 ultra-conserved CEGs present in genome Percentage of 248 ultra-conserved CEGs present Total number of CEGs present including putative orthologs Average number of orthologs per CEG Percentage of detected CEGS that have more than 1 ortholog 240 248 96.77 100.00 671 903 2.80 3.64 88.33 95.16 the EMBL Nucleotide Sequence Database (study accession ERP001869). RAD locus assembly to reference genome A summary of the RAD analyses is shown in Fig. S2 (Supporting information), and data from intermediate and final analysis steps are available on DRYAD (Wang et al. 2012). RAD-tag locus assembly was first carried out using Bowtie alignment with the 097-10 B. nana genome assembly as reference, followed by implementation of pstacks and cstacks (Catchen et al. 2011). This yielded a catalogue of 86 399 loci from the B. nana 097-10 RAD library; this is 61% of the number of RAD loci predicted from genome cut sites. The other 14 individuals yielded catalogues with a mean of 55 816 loci (minimum 44 932; maximum 60 085, see Fig. 3A). Within each individual catalogue, a mean of 16 508 loci had more than one allele (minimum 8335, maximum 21 005, for mean number of alleles per locus see Fig. 3B). We examined the percentage of loci in each individual’s catalogue that had three or four alleles (Fig. 3C,D) (A) (B) (C) (D) as these should be zero in a diploid. In the two B. nana individuals, 1.4% and 1.3% of loci had three or four alleles; this is slightly higher than expected, probably due to paralogy. In contrast, in all the other individuals the number of loci with three or four alleles fell in a narrow range between 4.7% and 6.8%, except for individual 574 (which had initially been identified as B 9 intermedia) with 1.3%. We compared all 15 individual pstacks outputs against the B. nana 097-10 catalogue using sstacks. Although all loci in all catalogues had been initially identified using alignment to the 097-10 genome sequence, not all loci matched loci in the 097-10 RAD catalogue: while 88.3% of the loci in the 582 B. nana catalogue matched loci in the 097-10 catalogue, 78.2–82.3% of the loci in the other catalogues matched loci in the 097-10 catalogue, with the exception of 574 where only 71.6% of loci matched the 097-10 catalogue. Those RAD loci that align to the 097-10 genome but are not found in the 097-10 RAD library are likely to be mainly due to differences among individuals and species at PstI cut sites. Some may be due to low coverage in the 097-10 RAD library, but as this library had the third largest Fig. 3 Results of RAD Sequencing, comparing results from de novo (dark grey bars) and reference assembled (light grey bars) pipelines. (A) Number of RAD loci with minimum 10X coverage, (B) Mean number of alleles per locus, (C) Percentage of loci with three alleles, (D) Percentage of loci with four alleles. © 2012 John Wiley & Sons Ltd B E T U L A G E N O M E S E Q U E N C E A N D R A D 3105 Fig. 4 RAD stacks from each sample that match more than one locus in the 097-10 catalogue, or have different SNPs, shown as a percentage of the total number of RAD stacks from each sample that match the 097-10 catalogue. Stacks with more than one catalogue locus match are shown in dark grey and stacks with SNPs not in the 097-10 catalogue are shown in light grey. Results from the de novo RAD loci are shown on the left of each pair of columns and results using referenced assembly on the right (with dashed lines). number of reads of any of the libraries, this is unlikely to be a major factor. Of those RAD stacks in all individuals that matched loci in the 097-10 RAD catalogue, between 12.6% and 16.0% per individual matched more than one locus in the 097-10 RAD catalogue (Fig. 4) – this was true even of the matching of the 097-10 RAD stacks to the 097-10 RAD catalogue, where 13.1% of matching loci matched more that one catalogue locus. This is likely to be because identifying RAD loci using alignment to a genome sequence allows paralogs to be separated, whereas these cannot be distinguished when matching RAD loci against one another. RAD locus assembly de novo Sequence reads for RAD loci were also de novo assembled for each individual using ustacks followed by cataloguing with cstacks. By this method, the number of RAD loci found in the RAD data for 097-10 was 89 972; this is 3573 (1.04 times) more than the number (86 399) assembled by the reference genome method above. To some extent, the fact that the Bowtie method assembled fewer RAD loci than the de novo method may be due to the failure of some RAD reads to align to the genome assembly in Bowtie. © 2012 John Wiley & Sons Ltd As a brief excursus to check this difference in loci assembled by the two methods for individual 097-10, we took all RAD reads that had earlier failed to align to the genome assembly with Bowtie, and ran them through the de novo pipeline. This yielded 23 106 extra loci, a much higher number than the 3573 difference between the two RAD assembly methods. The fact that 86 399 loci are found that align with Bowtie to the reference sequence, despite the 23 106 of putative loci that are apparently being excluded by the Bowtie method due to non-alignment to the reference genome, may be due to stacks of reads being divided among paralogs in the reference genome by Bowtie (see also below). For the individuals other than 097-10, the number of RAD loci assembled by the de novo method was on average 1.6 times higher than the number assembled by the Bowtie genome-referenced method (Fig. 3A). This larger difference between the two methods than seen with individual 097-10 is likely to be due to a greater number of reads from these other individuals that do not align with Bowtie to the 097-10 reference genome. Within each individual catalogue, a mean of 29 955 loci had more than one allele (minimum 16 159, maximum 39 998). The percentage of loci with three or four alleles was higher in the de novo assembled data (Fig. 3C,D) perhaps due to paralogs being assembled into the same stacks. However, the overall pattern among individuals was the same as before, with low numbers of tri- and tetra-allelic loci in 097-10, 574 and 582. We then compared all the individual de novo RAD catalogues against the B. nana 097-10 de novo RAD catalogue using sstacks. As we would expect, lower percentages of loci in each de novo catalogue matched the 097-10 catalogue, as these catalogues had not been made by alignment to the 097-10 genome. However, the absolute number of matching loci was very similar by the two methods: a mean of 43 827 by the de novo method compared with 42 940 by the reference method. Among these matching loci, an important difference emerged between the two methods: the mean percentage of matching loci that matched more than one 097-10 catalogue locus was 0.1% in the de novo catalogues but 14.2% in the reference-built catalogues. In contrast, the mean percentage of matching loci that had SNPs not in the 097-10 catalogue locus was 39.4% in the de novo catalogues but 24.1% in the reference-built catalogues. Figure 4 shows this difference for each individual in the study separately. This suggests that some SNPs in the de novo assembled RAD loci are due to paralogs being stacked together, and these may be detected as such by the reference RAD assembly method. All read data were also concatenated into a single file and run through the de novo RAD pipeline to produce a universal catalogue of all RAD loci in the 15 individuals. 3106 N . W A N G E T A L . Fig. 5 Sharing of sequenced RAD loci in the universal catalogue among the 15 individual plant samples. The horizontal axis showing the number of samples in which the 100 bp loci were found with coverage of at least 10 reads and nucleotide difference of up to 4%, and the vertical axis shows the number of loci. This universal catalogue contained 281 748 loci (which is twice as many as we expected there to be in a B. nana genome). Stacks of de novo RAD loci from each individual were then searched against the universal catalogue, and the degree to which homologous loci were being hit in each individual was assessed. Figure 5 shows the number of loci that were found in different numbers of individuals: 3000 loci were sequenced in every individual, and 35 000 in 10 or more individuals. Table S2 (Supporting information) shows pairwise comparisons between individuals: a mean of 49.3% (standard deviation = 6.7) of loci found in any given individual matched with loci in any other given individual. When the B. nana individuals were compared pairwise with all the non-B. nana individuals, the mean number of loci shared was only slightly lower at 45.3% (standard deviation = 5.1). Preliminary study of introgression A subset of the RAD data was used for a proof-of-concept study of introgression. In all matches found using sstacks to the universal catalogue, 3156 loci were sequenced in all 15 individuals. Of these, 1089 loci were polymorphic, and of these, 719 loci were homozygous for the same allele in both B. nana individuals; we designated these alleles as putative ‘B. nana’ alleles. For four of these loci, the ‘B. nana’ alleles were not found in any other individuals, but 715 loci showed the ‘B. nana’ allele to be present in at least one non-B. nana individual. For this subset of 715 loci, presence and absence of the ‘B. nana’ alleles in the 13 other individuals is shown in Fig. 6 as a percentage of the 715 loci examined. Fig. 6 Preliminary assessment of allele sharing, showing percentage of 715 loci in each non-B. nana individual that show presence of one or more ‘B. nana’ alleles (see main text for details). Abbreviations on the horizontal axis are as follows: L = large (>200 cm height), M = Medium (50–200 cm height), S = small (<50 cm height), int = B. 9 intermedia, pub = B. pubescens. Individual 574 had ‘B. nana’ alleles at only 44% of loci examined, whereas all other individuals had ‘B. nana’ alleles present at between 69% and 79% of the loci. Apart from 574, the individual with the lowest percentage of ‘B. nana’ alleles was 1184 from Norfolk in the southeast of Britain. The individual with the highest percentage of ‘B. nana’ alleles was 1123 from the northern coast of Britain – the morphology of this plant had previously been scored as B. 9 intermedia and was found growing close to 1124 in a large, shrubby population on a west-facing slope above the Kyle of Tongue beach. Excluding individual 574, there is a general tendency for plants with more ‘B. nana’ alleles to be smaller and have a B. 9 intermedia morphology (Fig. 6). An exception to this is individual 1045, which was a very small plant growing within a heavily grazed population of B. nana on Ben Loyal and seems to have fewer ‘B. nana’ alleles than most other plants. As mentioned in the previous paragraph, four loci were found which had alleles unique to B. nana and homozygous and identical in both B. nana individuals. These may be candidate genes for species differences between B. nana and the other individuals in the study. These were loci numbers 139431, 160800, 222373 and 280086 in the Stacks catalogue made from all RAD data, assembled de novo. For 280086, all the non-B. nana individuals were homozygous and identical for a different allele. These four loci were used for BLASTN searches of the Genbank nucleotide collection; both as 100 bp loci and with 1000 bp flanking sequence each side from the B. nana reference genome where available. Locus 139431, hit a predicted Glycine max cation/proton antiporter 18 (E = 7e 10 with the 100bp sequence, E < e 120 with flanking region). Locus 160800 had no significant © 2012 John Wiley & Sons Ltd B E T U L A G E N O M E S E Q U E N C E A N D R A D 3107 hit as a 100 bp sequence, but with flanking regions hit a putative Ricinus communis beta-glucosidase (E = 2e 48). Locus 222373, had no significant hit as a 100 bp sequence, but with flanking regions hit a predicted Vitis vinifera indole-3-acetate O-methyltransferase 1 (E = 1e 107). Locus 280086 hit a predicted Vitis vinifera interactor of constitutive active Rho-related GTPase 4 (E = 2e 22 with the 100 bp, E = 3e 120 with flanking region), which may be involved in controlling cell polarity (Li et al. 2008). Flow cytometry For plant number 097-10, four replicated flow cytometry runs with CVs below 5% were gained using fresh decorticated cambium, which gave a mean 2C value of 0.92 pg (SD 0.02); therefore, the 1C value of this plant is 450 Mb. For plant 574, nine replicated runs from catkins gave a mean 2C value of 2.06 pg (SD = 0.02), thus 1C = 1007 Mb. Discussion could have ready access to fresh materials in the laboratory. We are confident that the plant used was not an F1 hybrid because its morphology is characteristic of B. nana, and we compared this to the very different morphology and growth of plants grown from other seed from the same maternal parent that did show clear F1 hybrid phenotypes. In addition, its genome size as measured by flow cytometry is very similar to those of Icelandic B. nana individuals. It is of course possible that there may be some genes present in the B. nana genome that we have sequenced that have introgressed from other Betula species due to past hybridization followed by backcrossing. However, it should be noted that introgression from tetraploid B. pubescens into diploid B. nana is much less likely than introgression in the reverse direction, due to the direction of ploidy difference (Stebbins 1971). It would be interesting in future to make genomic comparisons with B. nana individuals from large populations in subarctic tundra, though these are likely to harbour more heterozygosity and therefore have genomes that are harder to assemble. Genome resources for Betula (birch) Utility of methods The work presented in this study is a major progression in our knowledge of the genome of Betula, and in the availability of tools for ecological and conservation genomics. This has been made possible by Illumina sequencing technology, and the RAD-tag method of library preparation. The genome sequence we present for B. nana appears to have covered the whole-genome well, with scaffolds of sufficient length and accuracy for complete core eukaryote genes to be found by BLAST searches. This allows us to annotate RAD markers that show interesting patterns of diversity among individuals, and the RAD markers themselves may in future allow us to produce a genetic map the genome (e.g. Baxter et al. 2011). We chose B. nana as the best species for constructing a reference genome for Betula for several reasons. First, it is a diploid species and therefore much simpler to sequence than a tetraploid such as B. pubescens. Second, it exists in the UK in small isolated populations, which are likely to have relatively low genetic diversity compared with the other UK diploid species, B. pendula, which occurs in much larger populations and is likely, as a tall wind-pollinated tree, to have much gene flow among populations. Ideally, we would have used an inbred line but we did not have access to such materials, so an individual from a small isolated population was the next best option. Third, we are interested in introgression from B. nana to other Betula species, and so need a thorough knowledge of the B. nana genome. We used an individual grown from seed so that we The RAD marker protocol presented here is shown to be a reliable method of enriching genomic DNA samples with homologous markers among individuals within and among Betula species, for accurate genotyping. The two methods of RAD assembly used have contrasting benefits: assembling with Bowtie to the B. nana reference genome seems to separate out paralogs in the RAD data (assuming that alleles in the B. nana genome have not been under-assembled), but excludes many loci that have good coverage in other species and are assembled by the de novo method. It has been suggested that RAD markers are best applied at the intraspecific level, whereas target enrichment or sequence capture by oligonucleotides may be more effective at higher levels (McCormack et al. 2011), but among hybridizing Betula species it appears that RAD is an effective method, perhaps because genetic differentiation is low (see below). The Betula system presented here can therefore join a growing list of ‘nonmodel’ plants that have successful RAD protocols, including the globe artichoke (Scaglione et al. 2012), Brassica napus (Bus et al. 2012) and the egg plant (Barchi et al. 2011). The presence of polyploidy in the Betula complex presents challenges for SNP calling (see also Ogden et al., this issue). The Stacks package (Catchen et al. 2011) assumes that samples are diploid and does not distinguish between the different copy numbers of alleles that can be present at heterozygous loci in polyploids. However, where loci have sufficient read-depth, allele © 2012 John Wiley & Sons Ltd 3108 N . W A N G E T A L . copy number could in principle be assessed in a similar manner to measuring allele-specific gene expression in cDNA libraries; other SNP genotyping methods have previously been analysed in this way (Buggs et al. 2012a). A further difficulty in polyploids is that of distinguishing paralogs from homeologs (genes duplicated by whole-genome duplication), though the availability of a genome sequence assembly will help us here, as assembly of RAD tags against the reference genome seems assist in the identification of paralogs. Biological implications Though the RAD analyses presented here are intended as a proof-of-concept for the ability of RAD sequencing to cover homologous loci among individuals and species of Betula, even preliminary analysis of the data provides new biological insights into Betula genetics and species differences. These insights are provisional, given the small sample sizes, but provide valuable hints for future research effort. The data have yielded unexpected insights on taxonomy. Individual 574 is a large, mature tree growing in Glen Lui, with unusually small leaves apparently of B. 9 intermedia phenotype. However, it has unusually low allele sharing with B. nana, so is unlikely to be a hybrid of B. nana. It has allele numbers suggestive of diploidy, but investigation with flow cytometry showed it to have a C-value (1C = 1007 Mb) consistent with tetraploidy. One might speculate that it is an autotetraploid, perhaps related to B. pendula, but further investigations are needed to identify it reliably. In contrast, individual 1184c was initially identified as B. pendula on the basis of leaf morphology but according to our RAD data does not appear to be diploid, as one would expect for B. pendula. This fits with the observation of Howland et al. that in East Anglia diploid and tetraploid Betula are not distinct on the basis of morphological characters (Howland et al. 1995). The data presented here suggest that genetic sequence differentiation between B. nana and B. pubescens in the UK is low. This is indicated by the fact that the percentage of loci shared between sample pairs is similar for pairs between species and pairs within species (Table S2, Supporting information), and by the high degree of allele sharing found between B. nana and B. pubescens individuals (Fig. 6). Among the 1089 loci that were polymorphic and found in all 15 individuals, only 4 showed no allele sharing between B. nana and the other individuals. This low genetic differentiation may reflect past hybridization between B. nana and B. pubescens, or possibly a hybrid origin of B. pubescens with B. nana as a parental species. The study of introgression presented in this study is of necessity very preliminary, due to the low sample sizes of plants. We have therefore only conducted simple presence/absence analyses in those loci that were sequenced in all individuals and therefore easiest to analyse. Thorough sampling of B. nana populations is needed for the accurate identification of alleles that are truly unique to B. nana, or found in B. nana at a higher frequency than in other species. Despite its preliminary nature, the data show that there are differences in the degree to which alleles found in B. nana are shared with the other individuals, and this to some extent seems to correspond to plant morphology. The fact that the individual from the south of England shows lowest allele sharing with B. nana (if we exclude individual 574) demonstrates that it is worthwhile to continue investigating whether there is a north-south cline in allele sharing in Britain. Our hypothesis is that such a cline exists and is the result of introgression between B. nana and other Betula species as a hybrid zone between them moved northwards through Britain due to global warming after the last glacial maximum (Huntley & Birks 1983; Caseldine 2001; Buggs 2007; Karlsdóttir et al. 2009). RAD sequencing has proved to be informative in the assessing hybridization among Heliconius species (Dasmahapatra et al. 2012) and between rainbow and westslope cutthroat trout (Hohenlohe et al. 2011). It has also shed light on past range shifts in the pitcher plant mosquito (Emerson et al. 2010). In this study, we have also shown how, in principle, candidate genes for the uniqueness of B. nana can be identified. In our simple analysis of allele sharing, we have identified four genes that show no allele sharing between B. nana and the other individuals in the study. Annotation of these genes, facilitated by the availability of the B. nana reference genome, suggests that they may be involved in the growth habits and water-tolerance of B. nana, though at this stage, we cannot exclude the possibility that these alleles were unique to B. nana due to chance alone. More plants need to be sampled to ascertain whether or not these genes are unique to B. nana among more populations. With a larger data set, we may also be able to discover genes that are unique to B. nana; that is, taxonomically restricted or ‘orphan’ genes (Khalturin et al. 2009). The work presented in this study lays essential foundations for ecological and conservation genetic study in Betula. Acknowledgements We thank Michael Drury of Trees for Life for logistical help at the Dundreggan site, Simon Renny Byfield for assistance with PERL scripts, and Andrew Leitch, Richard Nichols, Douglas Soltis and Pamela Soltis for helpful discussions. RAD sequencing was carried out in the NBAF GenePool genomics facility in the University of Edinburgh. Genome Sequencing was conducted at the Beijing Genomic Institute, China. This project © 2012 John Wiley & Sons Ltd B E T U L A G E N O M E S E Q U E N C E A N D R A D 3109 was funded by Natural Environment Research Council (UK) Fellowship NE/G01504X/1 to RJAB. References Anamthawat-J onsson K (2004) Preparation of chromosomes from plant leaf meristems for karyotype analysis and in situ hybridization. Methods in Cell Science, 25, 91–95. Anamthawat-Jónsson K, Thórsson ÆT (2003) Natural hybridisation in birch: triploid hybrids between Betula nana and B. pubescens. Plant Cell, Tissue and Organ Culture, 75, 99–107. Anamthawat-J onsson K, Tomasson T (1990) Cytogenetics of hybrid introgression in Icelandic Birch. Hereditas, 112, 65–70. Anamthawat-J onsson K, Th orsson ÆT, Temsch EM, Greilhuber J (2010) Icelandic birch polyploids – the case of perfect fit in genome size,. Journal of Botany, 2010, 347254. Aston D (1984) Betula nana L., a note on its status in the United Kingdom. Proceedings of the Royal Society of Edinburgh Section B-Biological Sciences, 85, 43–47. Atkinson MD (1992) Betula pendula Roth (B. verrucosa Ehrh.) and B. pubescens Ehrh. Journal of Ecology, 80, 837–870. Atkinson MD, Codling AN (1986) A reliable method for distinguishing between Betula pendula and B. pubescens. Watsonia, 7, 5–76. Baird NA, Etter PD, Atwood TS et al. (2008) Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS ONE, 3, e3376. Barchi L, Lanteri S, Portis E et al. (2011) Identification of SNP and SSR markers in eggplant using RAD tag sequencing. BMC Genomics, 12, 304. Baxter SW, Davey JW, Johnston JS et al. (2011) Linkage mapping and comparative genomics using next-generation RAD sequencing of a non-model organism. PLoS ONE, 6, e19315. Brown IR, Al-Dawoody D (1979) Observations on meiosis in three cytotypes of Betula alba L. New Phytologist, 83, 801–811. Brown IR, Tuley G (1971) A study of a population of birches in Glen Gairn. Botanical Journal of Scotland, 41, 231–245. Brown IR, Williams DA (1984) Cytology of Betula alba L complex. Proceedings of the Royal Society of Edinburgh Section B-Biological Sciences, 85, 49–64. Brown I, Kennedy D, Williams D (1982) The occurence of natural hybrids between Betula pendula Roth and B. pubescens Ehrh. Watsonia, 14, 133–145. Buggs RJA (2007) Empirical study of hybrid zone movement. Heredity, 99, 301–312. Buggs RJA, Pannell JR (2006) Rapid displacement of a monoecious plant lineage is due to pollen swamping by a dioecious relative. Current Biology, 16, 996–1000. Buggs RJA, Chamala S, Wu W et al. (2012a) Rapid, repeated and clustered loss of duplicate genes in allopolyploid plant populations of independent origin. Current Biology, 22, 248–252. Buggs RJA, Renny-Byfield S, Chester M et al. (2012b) Next-generation sequencing and genome evolution in allopolyploids. American Journal of Botany, 99, 372–382. Bus A, Hecht J, Huettel B, Reinhardt R, Stich B (2012) Highthroughput polymorphism detection and genotyping in Brassica napus using next-generation RAD sequencing. BMC Genomics, 13, 281. Caseldine C (2001) Changes in Betula in the Holocene record from Iceland—a palaeoclimatic record or evidence for early © 2012 John Wiley & Sons Ltd Holocene hybridisation? Review of Palaeobotany and Palynology, 117, 139–152. Catchen JM, Amores A, Hohenlohe P, Cresko W, Postlethwait JH (2011) Stacks: building and genotyping loci de novo from short-read sequences. G3: Genes, Genomes, Genetics, 1, 171–182. Crawford RMM (2008) Plants at the Margin: Ecological Limits and Climate Change. Cambridge University Press, Cambridge, UK. Cronn R, Knaus BJ, Liston A et al. (2012) Targeted enrichment strategies for next-generation plant biology. American Journal of Botany, 99, 291–311. Cullings KW (1992) Design and testing of a plant-specific PCR primer for ecological and evolutionary studies. Molecular Ecology, 1, 233–240. Currat M, Ruedi M, Petit RJ, Excoffier L (2008) The hidden side of invasions: massive introgression by local genes. Evolution, 62, 1908–1920. Dabrowska G, Dzialuk A, Burnicka O, Ejankowski W, Gugnacka-Fiedor W, Goc A (2006) Genetic diversity of postglacial relict shrub Betula nana revealed by RAPD analysis. Dendrobiology, 55, 19–23. Dasmahapatra KK, Walters JR, Briscoe AD et al. (2012) Butterfly genome reveals promiscuous exchange of mimicry adaptations among species. Nature, 487, 94–98. Davey JW, Hohenlohe PA, Etter PD, Boone JQ, Catchen JM, Blaxter ML (2011) Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nature Reviews Genetics, 12, 499–510. Davy AJ, Gill JA (1984) Variation due to environment and heredity in birch transplanted between heath and bog. New Phytologist, 97, 489–505. De Jong P (1993) An introduction to Betula: its morphology, evolution, classification and distribution, with a survey of recent work. In: Proceedings of the IDS Betula Symposium (ed. Hunt D). International Dendrology Society, Richmond, UK. Dole zel J, Greilhuber J, Suda J (2007) Estimation of nuclear DNA content in plants using flow cytometry. Nature Protocols, 2, 2233–2244. Doyle J, Doyle JL (1987) Genomic plant DNA preparation from fresh tissue-CTAB method. Phytochemical Bulletin, 19, 11–15. Earl D, Bradnam K, St. John J et al. (2011) Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research, 21, 2224–2241. Elkington TT (1968) Introgressive hybridization between Betula nana L. and B. pubescens Ehrh. in north-west Iceland. New Phytologist, 67, 109–118. Emerson KJ, Merz CR, Catchen JM et al. (2010) Resolving postglacial phylogeography using high-throughput sequencing. Proceedings of the National Academy of Sciences USA, 107, 16196–16200. Eriksson G, Jónsson A (1986) A review of the genetics of Betula. Scandinavian Journal of Forest Research, 1, 421–434. Gill JA, Davy AJ (1983) Variation and polyploidy within lowland populations of the Betula pendula/B. pubescens complex. New Phytologist, 94, 433–451. de Groot WJ, Thomas PA, Wein RW (1997) Betula nana L and Betula glandulosa Michx. Journal of Ecology, 85, 241–264. Hohenlohe PA, Amish SJ, Catchen JM, Allendorf FW, Luikart G (2011) Next-generation RAD sequencing identifies thou- 3110 N . W A N G E T A L . sands of SNPs for assessing hybridization between rainbow and westslope cutthroat trout. Molecular Ecology Resources, 11, 117–122. Howland DE, Oliver RP, Davy AJ (1995) Morphological and Molecular Variation in Natural Populations of Betula. New Phytologist, 130, 117–124. Huntley B, Birks H (1983) An Atlas of Past and Present Pollen Maps of Europe: 0-13,000 Years ago. Cambridge University Press, Cambridge. Ilut DC, Coate JE, Luciano AK et al. (2012) A comparative transcriptomic study of an allotetraploid and its diploid progenitors illustrates the unique advantages and challenges of RNA-seq in plant species. American Journal of Botany, 99, 383–396. Jarvinen P, Palmé A, Morales LO et al. (2004) Phylogenetic relationships of Betula species (Betulaceae) based on nuclear ADH and chloroplast matK sequences. American Journal of Botany, 91, 1834–1845. Karlsd ottir L, Hallsd ottir M, Th orsson ÆT, AnamthawatJ onsson K (2008) Characteristics of pollen from natural triploid Betula hybrids. Grana, 47, 52–59. Karlsdóttir L, Hallsdóttir M, Thórsson ÆT, AnamthawatJ onsson K (2009) Evidence of hybridisation between Betula pubescens and B. nana in Iceland during the early Holocene. Review of Palaeobotany and Palynology, 156, 350–357. Karlsson PS, Schleicher LF, Weih M (2000) Seedling growth characteristics in three birches originating from different environments. Ecoscience, 7, 80–85. Kenworthy JB, Aston D, Bucknall SA (1972) A study of hybrids between Betula pubescens Ehrh. and Betula nana L. from Sutherland‚ an integrated approach. Botanical Journal of Scotland, 41, 517–539. Khalturin K, Hemmrich G, Fraune S, Augustin R, Bosch T (2009) More than just orphans: are taxonomicallyrestricted genes important in evolution? Trends in Genetics, 25, 404–413. Kim S, Misra A (2007) SNP genotyping: technologies and biomedical applications. Annual Review of Biomedical Engineering, 9, 289–320. Lai Z, Kane NC, Kozik A et al. (2012) Genomics of Compositae weeds: EST libraries, microarrays, and evidence of introgression. American Journal of Botany, 99, 209–218. Langmead B, Trapnell C, Pop M, Salzberg S (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10, R25. Li S, Gu Y, Yan A, Lord E, Yang Z-B (2008) RIP1 (ROP Interactive Partner 1)/ICR1 marks pollen germination sites and may act in the ROP1 pathway in the control of polarized pollen growth. Molecular Plant, 1, 1021–1035. Lowe AJ, Harris SA, Ashton P (2004) Ecological Genetics: Design, Analysis and Application. Blackwell, Oxford. Mamanova L, Coffey AJ, Scott CE et al. (2010) Target-enrichment strategies for next-generation sequencing. Nature Methods, 7, 111–118. McCormack JE, Hird SM, Zellmer AJ, Carstens BC, Brumfield RT (2011) Applications of next-generation sequencing to phylogeography and phylogenetics. Molecular Phylogenetics and Evolution, doi: 10.1016/j.ympev.2011.12.007. Otto F (1990) DAPI staining of fixed cells for high-resolution flow cytometry of nuclear DNA. In: Methods in Cell Biology (eds Crisssman H, Darzynkiewicz Z), pp. 105–110. Academic Press, New York. Palme AE, Su Q, Palsson S, Lascoux M (2004) Extensive sharing of chloroplast haplotypes among European birches indicates hybridization among Betula pendula, B. pubescens and B. nana. Molecular Ecology, 13, 167–178. Parra G, Bradnam K, Korf I (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics, 23, 1061–1067. Pelham J, Kinnaird JW, Gardiner AS, Last FT (1984) Variation in, and reproductive capacity of, Betula pendula and Betula pubescens. Proceedings of the Royal Society of Edinburgh Section B-Biological Sciences, 85, 27–41. Pelham J, Gardiner AS, Smith RI, Last FT (1988) Variation in Betula pubescens Ehrh (Betulaceae) in Scotland - its nature and association with environmental factors. Botanical Journal of the Linnean Society, 96, 217–234. Regel E (1861) Monographische Bearbeitung der Betulaceen. Nouveaux Memoires de la Societe Imperiale des Naturalistes de Moscou, 13, 59–187. Rich TCG, Jermy AC (1998) Plant Crib. Botanical Society of the British Isles in association with National Museums & Galleries of Wales, London. Scaglione D, Acquadro A, Portis E, Tirone M, Knapp S, Lanteri S (2012) RAD tag sequencing as a source of SNP markers in Cynara cardunculus L. BMC Genomics, 13, 3. Stace CA (2010) New Flora of the British Isles, 3rd edn. Cambridge University Press, Cambridge. Stebbins GL (1971) Chromosomal Evolution in Higher Plants. Edward Arnold, London. Thórsson ÆT, Palsson SP, Sigurgeirsson A, AnamthawatJónsson K (2007) Morphological variation among Betula nana (diploid), B. pubescens (tetraploid) and their triploid hybrids in Iceland. Annals of Botany, 99, 1183–1193. Th orsson ÆT, Salmela E, Anamthawat-Jónsson K (2001) Morphological, cytogenetic, and molecular evidence for introgressive hybridization in birch. Journal of Heredity, 92, 404–408. Truong C, Palmé AE, Felber F, Naciri-Graven Y (2005) Isolation and characterization of microsatellite markers in the tetraploid birch, Betula pubescens ssp. tortuosa. Molecular Ecology Notes, 5, 96–98. Vaarama A, Valanne T (1973) On the taxonomy, biology and origin of Betula tortuosa Ledeb. Reports from the Kevo Subarctic Research Station, 10, 70–84. Wang N, Thomson M, Bodles WJA et al. (2012) Data from: Genome sequence of dwarf birch (Betula nana) and crossspecies RAD markers. Dryad Digital Repository http://dx.doi. org/10.5061/dryad.v5gd2. Wilsey BJ, Saloniemi I (1999) Leaf fluctuating asymmetry in tree-line mountain birches, Betula pubescens ssp tortuosa: genetic or environmentally influenced? Oikos, 87, 341–345. Woodworth RH (1929) Cytological studies in the Betulaceae. I. Betula. Botanical Gazette, 87, 331–363. Zeng J, Zou YP, Bai JY, Zheng HS (2002) Preparation of total DNA from “recalcitrant plant taxa”. Acta Botanica Sinica, 44, 694–697. © 2012 John Wiley & Sons Ltd B E T U L A G E N O M E S E Q U E N C E A N D R A D 3111 N.W. is a PhD student supervised by R.J.A.B. working on birch hybridization and phylogenetics. M.T. specializes in Illumina and RAD sequencing at the GenePool facility, University of Edinburgh. W.J.A.B. is a projects manager at Highland Birchwoods, leading a Heritage Lottery Fund mountain woodlands project. R.M.M.C. studies plant responses to the environment and is Emertius Professor at St Andrews University. H.V.H. is a post-doctoral researcher in archaeogenetics at Cambridge University, working on plant genetic diversity and population history. A.W.F. is a conservationist and director of Trees for Life. J.P. is a post-doctoral researcher specializing in flow cytometry and genome size evolution at RBG Kew. R.J.A.B. is a NERC Fellow and Senior Lecturer at Queen Mary University of London, interested in evolution, ecology and genomics. DNA read sequences for RAD loci: Sequence Read Archive study accession ERP001869 http://www.ebi.ac. uk/ena/data/view/ERP001869. RAD catalogs and matches: DRYAD http://dx.doi. org/10.5061/dryad.v5gd2. Herbarium vouchers: Natural History Museum, London, accession numbers BM001074532-BM001074546. Supporting information Additional supporting information may be found in the online version of this article. Table S1 Assemblathon2 statistics for genome assembly. Table S2 Sequenced RAD loci shared between each pair of samples, expressed as a percentage of the total number of loci in the RAD locus catalog from the sample listed in the left-hand edge column. Data accessibility DNA read sequences for Betula nana reference genome, and genome assembly: Sequence Read Archive study accession ERP001867 http://www.ebi.ac.uk/ena/data/ view/ERP001867. © 2012 John Wiley & Sons Ltd Fig. S1 Photograph of Betula nana individual 097-10 used for genome sequencing. Fig. S2 Flow chart outlining analysis pipelines used in this study.