Genome sequence of dwarf birch (Betula nana) and crossspecies

advertisement
Molecular Ecology (2013) 22, 3098–3111
doi: 10.1111/mec.12131
Genome sequence of dwarf birch (Betula nana) and
cross-species RAD markers
NIAN WANG,* MARIAN THOMSON,† WILLIAM J. A. BODLES,‡ ROBERT M. M. CRAWFORD,§
H A R R I E T V . H U N T , ¶ A L A N W A T S O N F E A T H E R S T O N E , * * J A U M E P E L L I C E R † † and
RICHARD J. A. BUGGS*
*School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London, E1 4NS, UK, †The
GenePool, Ashworth Laboratories, King’s Buildings, University of Edinburgh, Edinburgh, EH9 3JT, UK, ‡Highland Birchwoods,
Littleburn Road, Munlochy, Ross-shire, IV8 8NN, UK, §School of Biology, University of St Andrews, St Andrews, Fife, KY16
9TH, UK, ¶McDonald Institute for Archaeological Research, University of Cambridge, Downing Street, Cambridge, CB2 3ER,
UK, **Trees for Life, The Park, Findhorn Bay, Forres, IV36 3TZ, UK, ††Jodrell Laboratory, Royal Botanic Gardens, Kew,
Richmond, Surrey, TW9 3AB, UK
Abstract
New sequencing technologies allow development of genome-wide markers for any
genus of ecological interest, including plant genera such as Betula (birch) that have
previously proved difficult to study due to widespread polyploidy and hybridization.
We present a de novo reference genome sequence assembly, from 663 short read coverage, of Betula nana (dwarf birch) – a diploid that is the keystone woody species of
subarctic scrub communities but of conservation concern in Britain. We also present
100 bp PstI RAD markers for B. nana and closely related Betula tree species. Assembly
of RAD markers in 15 individuals by alignment to the reference B. nana genome
yielded 44–86k RAD loci per individual, whereas de novo RAD assembly yielded
64–121k loci per individual. Of the loci assembled by the de novo method, 3k homologous loci were found in all 15 individuals studied, and 35k in 10 or more individuals.
Matching of RAD loci to RAD locus catalogues from the B. nana individual used for
the reference genome showed similar numbers of matches from both methods of RAD
locus assembly but indicated that the de novo RAD assembly method may overassemble some paralogous loci. In 12 individuals hetero-specific to B. nana 37–47k RAD loci
matched a catalogue of RAD loci from the B. nana individual used for the reference
genome, whereas 44–60k RAD loci aligned to the B. nana reference genome itself. We
present a preliminary study of allele sharing among species, demonstrating the utility
of the data for introgression studies and for the identification of species-specific
alleles.
Keywords: Betula, genome sequence, hybridization, polyploidy, restriction-site associated DNA
tags
Received 30 June 2012; revision received 12 October 2012; accepted 17 October 2012
Introduction
Plant species complexes where polyploidy and hybridization are widespread pose significant challenges to
ecological and conservation genetic study (Lowe et al.
2004), but next-generation sequencing technologies have
Correspondence: Richard Buggs, Fax: +44 (0) 207 882 7732;
E-mail: r.buggs@qmul.ac.uk
begun to make these systems tractable (Buggs et al.
2012b; Ilut et al. 2012; Lai et al. 2012). One challenge in
such systems is genotyping genetic markers at homologous loci among individuals and species, in a manner
that is both high-throughput and inexpensive (Cronn
et al. 2012). Many genotyping systems rely upon the
synthesis of unique oligonucleotides to bind each locus
being assayed (Kim & Misra 2007; Mamanova et al.
2010; Cronn et al. 2012); this makes genome-wide studies
© 2012 John Wiley & Sons Ltd
B E T U L A G E N O M E S E Q U E N C E A N D R A D 3099
both expensive and vulnerable to unexpected polymorphisms within oligonucleotide-binding sites. In many
ecological genetic studies, sample sizes are high, the
nature of polymorphism among and within species is
unknown, and funding is low. This makes genotyping
by sequencing, for example using restriction-site-associated
DNA (RAD) tag sequencing (Baird et al. 2008; Davey
et al. 2011), promising methodology as it may be implemented without prior knowledge of genomic diversity,
and without costly synthesis of thousands of different
oligonucleotides. In this study, we describe the development of RAD sequencing for Betula (birch) a widespread,
outcrossing, highly polymorphic genus where hybridization and polyploidization are rampant (Woodworth
1929; Jarvinen et al. 2004). The ultimate intention is to
use RAD markers to trace patterns of genome-wide
introgression among hybridizing Betula species (B. nana,
B. pubescens and B. pendula) in Britain through sequencing of homologous loci in many individuals.
To be able to more fully characterize and validate the
RAD markers developed, and to greatly increase genomic data available for Betula, we also present a draft
reference genome sequence for an individual of Betula
nana (dwarf birch). This shrub species is common in
exposed subarctic and alpine tundra and swampy montane habitats, but in the United Kingdom it is nationally
scarce, mainly restricted to small, relictual, montane
populations in the Scottish highlands (Aston 1984; de
Groot et al. 1997). It is under active conservation by
organizations such as Trees for Life, Scottish Natural
Heritage and the Forestry Commission. Betula nana is a
diploid species and the genome sizes of five individuals
from west Iceland were recently measured: two plants
analysed
using
flow
cytometry
had
1C
(i.e. unreplicated haploid genome size) values of 453
and 462 Mbp, and three plants analysed using Feulgen
DNA Image Densitometry had 1C values of 441, 432
and 451 Mbp (Anamthawat-J
onsson et al. 2010).
Betula nana is reported to hybridize with the diploid
tree species silver birch (B. pendula) and the tetraploid
tree species downy birch (B. pubescens). Hybridization
may threaten the reproduction of dwarf birch populations in areas where the species is scarce (e.g. Buggs &
Pannell 2006) and may also generate recombinant genotypes of stress-tolerant tree in the subarctic (Vaarama &
Valanne 1973; Wilsey & Saloniemi 1999; Karlsson et al.
2000). In Iceland, first-generation hybrids between B.
nana and B. pubescens produce viable pollen, which can
back-cross with parental species (Anamthawat-J
onsson
& Tomasson 1990; Karlsd
ottir et al. 2008). Introgression
has been shown to occur between B. nana and B. pubsecens using cytogenetics (Anamthawat-Jónsson & Thórsson 2003), morphology (Elkington 1968; Thórsson et al.
2007) and genetic markers (Th
orsson et al. 2001; Palmé
© 2012 John Wiley & Sons Ltd
et al. 2004). In Iceland, plants at hybrid zones between
B. nana and B. pubescens are strictly diploid, triploid or
tetraploid at the cytological level, but morphological
and genetic intermediates are found in all three of these
ploidal levels (Anamthawat-J
onsson & Tomasson 1990;
Th
orsson et al. 2001; Anamthawat-Jónsson & Thórsson
2003; Thórsson et al. 2007). Putative hybrids between B.
nana and B. pubescens have been reported in several
locations in Scotland, where they are known as B. 9 intermedia (Kenworthy et al. 1972; Crawford 2008), and
intermediates between strict ploidy levels appear to
occur (Kenworthy et al. 1972). We are especially interested in understanding how patterns of introgression
may have arisen among Betula species due to hybrid
zone movement (Buggs 2007; Currat et al. 2008) as the
ranges of Betula species have shifted due to climate
change (Huntley & Birks 1983; Caseldine 2001; Karlsdóttir et al. 2009).
Population genetic variation in Betula has previously
been studied using microsatellite markers (Truong et al.
2005), RAPDs (Howland et al. 1995; Dabrowska et al.
2006), RFLPs (Howland et al. 1995), chloroplast genes
(Palmé et al. 2004) and in situ hybridization (Anamthawat-J
onsson 2004). These showed high levels of polymorphism and gene flow among species (see also
Eriksson & Jónsson 1986). The taxonomy of the genus is
not straightforward (De Jong 1993), and morphological
intermediacy between taxa is common (Regel 1861;
Woodworth 1929). Of particular interest for the current
study, B. pendula and B. pubescens in the UK are
regarded as two separate species, partly on the basis of
ploidy difference (Brown & Al-Dawoody 1979; Gill &
Davy 1983; Brown & Williams 1984), yet there is a continuum of morphological variation between them
(Brown & Tuley 1971; Pelham et al. 1984; Atkinson &
Codling 1986). The Atkinson discriminant function has
been devised to distinguish between the two species
(Atkinson & Codling 1986), based on leaf-shape quantitative traits. The continuum of morphological intermediacy is though to be partly due to hybridization
(Brown et al. 1982; Atkinson 1992), but morphological
variation has also be associated with environmental factors (Pelham et al. 1988); for example, leaf size and apex
angle for B. pubescens and B. pendula differ on bogs vs.
heaths (Davy & Gill 1984). In East Anglian populations,
diploid and tetraploid Betula were found to be distinct
neither on the basis of morphological nor molecular
characters (Howland et al. 1995).
There is clearly much basic genomic characterization
needed if we are to understand the genomic basis of
species differences in the genus Betula, and the dynamics of hybridization and introgression among them. The
work presented in this study is designed to lay foundations
for a genome-wide survey of genomic variation among
3100 N . W A N G E T A L .
Betula populations and species in the United Kingdom,
through (i) the generation of a whole-genome reference
sequence for the diploid species B. nana and (ii) the
development of a RAD sequencing protocol to assay
genome-wide homologous loci among populations and
species. Using a subset of the data generated, we also
present a preliminary study of allele sharing among
species, as a proof-of-concept of the utility of these data
for introgression studies.
Materials and methods
Plant materials
Materials were collected from natural populations of
Betula growing in the United Kingdom in the summers of
2010 and 2011. The B. nana individual used for wholegenome sequencing (097-10, Fig. S1 photo, Supporting
information) was grown from a seed collected in 2010
from a B. nana shrub on the Dundreggan Estate, Scotland,
which belongs to Trees for Life. The seed was germinated
in water and grown in a pot of compost in a rooftop garden at the Mile End Campus of Queen Mary University
of London. Fresh cambial material was collected from
this plant in 2011 for DNA extraction. The leaves shown
in Fig. 1 were collected in 2011, and the photo in Fig. S1
(Supporting information) was taken in summer 2012.
Plant tissues used for RAD-sequencing (apart from
those used from 097-10, where DNA samples extracted
for whole-genome sequencing were also used for RAD)
were collected in the field from plants identified by
morphology as B. nana (one individual in addition to
097-10), B. pubescens (seven individuals), B. pendula (one
individual) and B. 9 intermedia (five individuals). Plant
tissues were stored as dried herbarium specimens or
extracted twig cambial tissue on silica gel. A pair of
leaves from each sample is shown in Fig. 1, and in
Table 1 the identification of the plants based on leaf
morphology according to the standard guide for UK
birch identification (Rich & Jermy 1998) is shown. We
also applied the Atkinson discriminant function as a
mean for two leaves (Atkinson & Codling 1986; Stace
2010); this function is designed to distinguish between
B. pendula and B. pubescens on the basis of the width and
tooth structure of leaves. As all samples apart from B.
nana showed a range of morphologies from B. pendula,
through B. pubescens, to putative B. 9 intermedia, the
Atkinson discriminant function is given for all of them
including those provisionally identified as B. 9 intermedia. Details of source locations of each individual are
shown in Table 1. Five of the individuals were from
a Ben Loyal site where Bernard Kenworthy found
morphological, biochemical and cytological evidence for
hybrid occurrence (Kenworthy et al. 1972).
Fig. 1 Pairs of leaves from each sample used in the present
study, showing upper and lower sides (shown at 1:1.77 scale).
DNA extraction
Isolating high-quality genomic DNA from Betula, and
especially Betula nana, is difficult. We developed a protocol that utilized a combination of steps selected and
modified from three published protocols (Doyle &
Doyle 1987; Cullings 1992; Zeng et al. 2002). Due to the
high concentration of secondary compounds in leaves
of B. nana, we mainly extracted DNA from cambial tissue, as has been used for flow cytometry (AnamthawatJ
onsson et al. 2010). This also had the benefit of reducing the concentration of chloroplast DNA in the final
DNA solution.
Leaves or cambial tissue frozen in liquid nitrogen
were ground into powder using 3 mm steel beads in a
© 2012 John Wiley & Sons Ltd
B E T U L A G E N O M E S E Q U E N C E A N D R A D 3101
Table 1 Plant materials used
Plant
number
Plant
ID
Latitude
097-10
325
425f
574
582
583
605p
1045
1123
1124
1153
1158
97
325
425
574
582
583
605
1045
1123
1124
1153
1158
57.230171
57.687443
56.932211
57.010529
58.419353
58.419334
58.419328
58.415103
58.475289
58.475177
57.118613
55.220690
4.754931
4.630766
3.179569
3.54908
4.416346
4.416296
4.419328
4.417008
4.435964
4.435932
3.901351
3.416942
1183a
1183
58.891900
3.383755
1184c
1578
1184
1578
52.454126
58.423109
0.996216
4.421281
Longitude
Approximate
location
Dundreggan
Ben Wyvis
Loch Muick
Glen Lui, nr Braemar
Ben Loyal
Ben Loyal
Ben Loyal
Ben Loyal
nr Ben Loyal
nr Ben Loyal
South of Aviemore
Johnstonebridge,
Dumfries
Berriedale Wood,
Orkney
Eccles Car, Norfolk
Ben Loyal
Approximate
distance to
nearest
known B.
nana (km)
Atkinson
discriminant
function
score
0
0
1
5
0
0
0
0
6
6
7
60
Size†
Species
morphology
according to
Plant Crib
Number of
100 bp RAD
Illumina
reads
n/a
21
11
16
n/a
20
16
22
17
25.5
16
18
S
S
L
L
S
S
L
S
M
M
L
L
B.
B.
B.
B.
B.
B.
B.
B.
B.
B.
B.
B.
5655935
4278688
5101887
6904599
14344814
3035186
14467520
10803087
5376089
7829717
3613943
5043068
87
5
L
B. pubescens
9847086
350
0
+13
23.5
L
L
B. pendula
B. pubescens
6094568
5494190
nana
9 intermedia
pubescens
9 intermedia
nana
9 intermedia
pubescens
9 intermedia
9 intermedia
pubescens
pubescens
pubescens
†Size: height of plant, S = <50 cm, M = 50–200 cm, L = >200 cm.
Qiagen TissueLyzer (Qiagen, Santa Clarita, CA). To
remove polysaccharides from the tissues, up to 1.6 mL
of ice-cold TNE buffer (200 mM Tris-HCl, 250 mM NaCl,
50 mM EDTA) was added to each tube and kept on ice
for 10 min. These tubes were centrifuged at full speed
for 5 min in a refrigerated microfuge, the supernatant
discarded and the TNE step repeated. About 800 lL of
29 CTAB buffer (AppliChem, Darmstadt, Germany),
100 lL 10% Sarkosyl (N-Lauroylsarcosine sodium salt)
solution (Sigma-Aldrich, UK) and 20 lL Proteinase K
(Qiagen) was added to each tube, and these were vortexed vigorously. Tubes were kept in a 65 °C water
bath for about 3 h and were vortexed at intervals of
30 min. Approximately 700 lL 24:1 chloroform/isoamyl
alcohol (AppliChem) was added to each tube and
mixed by inversion. These tubes were centrifuged at
maximum speed for 10 min, and following centrifugation the top (aqueous) layer was carefully transferred to
new tubes. Approximate 10 lL RNAase (Qiagen) was
added to each tube and kept at 37 °C for 30 min. Then
the chloroform/isoamyl alcohol step was repeated, and
the aqueous layer again removed. A half volume of 5 M
NaCl (Sigma-Aldrich) and 0.7 volume of ice-cold isopropanol (VWR International S.A.S, France) were added
to this aqueous isolate and mixed by inversion. These
tubes were placed at 4 °C for about 45 min for precipitation. After precipitation, the tubes were centrifuged
for 10 min at full speed in a refrigerated microfuge, and
the supernatant was poured off carefully. Roughly
© 2012 John Wiley & Sons Ltd
700 lL cold 70% ethanol (VWR International Ltd., UK)
was added to each tube, mixed by inversion and centrifuged for 5 min at maximum speed. The pellet was
dried completely, then dissolved in 200 lL high-salt TE
buffer (10 mM Tris with pH 8.0, 1 mM EDTA, 1 M NaCl).
The DNA was reprecipitated by adding two volumes
(400 lL) of ice-cold 95% ethanol and mixing by inversion. These tubes were centrifuged at full speed for
10 min, and the ethanol was poured off. The pellet was
dried completely, and then washed twice using 75%
ethanol and 95% ethanol, respectively. The redried pellet was suspended in TE buffer (10 mM Tris with pH
8.0, 1 mM EDTA), and its quality analysed with a Nanovue machine (GE Healthcare, UK), a Qubit analyser
(Life Technologies Cooperation, Carisbad, CA) and a
0.8% agarose gel.
Genome sequencing
Genome Sequencing was conducted at the Beijing
Genomic Institute, China. Five DNA libraries were
constructed: three paired-end libraries with insert sizes
of 200 bp, 500 bp and 800 bp, and two mate-paired
libraries with 2000 bp and 5000 bp insert sizes. All
libraries were sequenced through the Illumina HiSeq
2000 pipeline, the paired-end libraries with read length
of 95 bp and the mate-paired libraries with read
lengths of 49 bp. Reads were filtered to exclude reads
with: more than 2% N calls, polyA structures, adapter
3102 N . W A N G E T A L .
contamination, quality scores of less than 7 for 40% of
the reads in paired-end libraries and 60% of the reads
in mate-paired libraries, overlapping paired reads in
paired-end libraries, and identical sequences at each
end of a paired end.
mina ECO instrument using the Kapa Library Quantification Kit. The quantified library was checked on an
Illumina MiSeq prior to sequencing in a single lane of
an Illumina HiSeq 2000 instrument using 100 base reads
(v3 chemistry). The library of one individual failed.
Genome assembly
RAD data processing
The genomic sequences of Betula nana were assembled
using SOAPdenovo-63mer version 2.04.3, obtained from
the Beijing Genomics Institute. The assembly started with
the smallest insert size libraries, moving up through the
insert size libraries to join contigs together. The resulting contigs were further processed by the GapCloser
program from the Beijing Genomics Institute. Various
options, and kmer lengths between 28 and 38, were tried
in the assembly, and the quality of the resulting assemblies were assessed using the assemblathon statistics 2
PERL script (Earl et al. 2011), and the CEGMA pipeline
(Parra et al. 2007). The best assembly was chosen on the
basis of high N50 contig length, low number of Ns and
high complete coverage of conserved eukaryote genes.
The RAD data were processed by pipelines outlined in
Fig. S2 (Supporting information). The read data were
checked for adapter contamination, and demultiplexed,
using the process_radtags module of Stacks (Catchen
et al. 2011). The data were then processed into catalogues for each individual sample through two different
pipelines as follows: (i) Alignment of RAD reads to the
B. nana genome assembly using Bowtie (Langmead et al.
2009), followed by extraction of read stacks using the
pstacks module of Stacks and cataloguing with
the cstacks module. (ii) De novo assembly of reads using
the ustacks module of Stacks followed by cataloguing
with cstacks. In both pipelines, 10 reads were required
as minimum depth of coverage required in a stack of
reads for it to be included as a locus. In the first pipeline, Bowtie allowed a maximum number of mismatches in the 28 bp ‘seed’ to be two and required the
sum of the Phred quality values at all mismatched positions to be less than 70. In the second pipeline, the maximum distance (in nucleotides) allowed between stacks
was set to four. Catalogues produced by both pipelines
were used to calculate the number of loci and alleles
covered in each individual.
The sstacks module of Stacks (Catchen et al. 2011)
was used to search the 097-10 RAD catalogue from each
pipeline against the stacks generated for each individual sample in their respective pipelines. In addition, the
second pipeline was also used to make a catalogue from
all the data from all individuals. This universal catalogue of all RAD loci was then used for searches with
sstacks for loci from each individual. Custom PERL
scripts and Excel (Microsoft, Washington State, USA)
were then used to count the number of RAD loci shared
among different individuals.
RAD library preparation and sequencing
Sixteen samples were prepared for RAD sequencing following the protocol in Baird et al. (2008). For each sample, 0.2 lg of genomic DNA was digested with PstI
(New England Biolabs, UK). This enzyme has a 6 bp
recognition site and leaves a 4 bp overhang. Assuming
a 39% GC content and 450 Mb genome size, we
expected there would be 60 527 PstI cut sites, using the
‘radcounter_v3.xls’ spreadsheet available from the
RAD-sequencing wiki (www.wiki.ed.ac.uk/display/
RADSequencing/). Digestion was followed by ligation
of barcoded P1 adapters (8 nM). Ligated DNA was
sheared using a Covaris S2 (KBiosciences, UK) instrument in 1.5 mL tubes (duty cycle 20%; intensity 5;
cycles/burst 200; duration 30 s followed by a 20 s
pause step repeated 13 times), and the size range 300 to
600 bp was isolated using gel excision and purification.
After this stage, all subsequent reactions were cleaned
using Ampure beads. After end-repair and A-tailing,
the size-selected DNA was ligated to P2 adapters
(400 nM) and PCR amplified. PCR amplification was
carried out in 8 independent 25 µL reactions consisting
of 20 ng ligated DNA, 0.5 vol 29 Phusion Master Mix
(New England Biolabs), 0.05 vol DMSO, 0.04 vol P1
and P2 amplification primers (10 nM stock), using the
following cycling parameters: 98 °C for 30 s followed
by 14 cycles of 98 °C for 10 s and 72 °C for 60 s. The 8
reactions were pooled into one library for sequencing.
The final library consisting of an equimolar pooling of
the 8 PCR reactions was quantified by qPCR on an Illu-
Preliminary study of introgression
To establish the usefulness of the RAD data generated
for the study of introgression among Betula species, we
carried out a preliminary survey of the distribution of
alleles of a subset of RAD loci. Starting with all matches
found using sstacks to the universal catalogue (see
above), we extracted those loci that were sequenced in
all 15 individuals. We retrieved all alleles for these loci
from all individuals, and cut this data set down to only
those loci that were polymorphic among the 15 individuals
© 2012 John Wiley & Sons Ltd
B E T U L A G E N O M E S E Q U E N C E A N D R A D 3103
but homozygous for the same allele in both B. nana
individuals. For this subset of loci, we scored presence
and absence of the allele found in B. nana in all other
individuals. For each non-B. nana individual, we
counted the number of loci where a ‘B. nana’ allele was
present at least once, and the number of loci where a
‘B. nana’ allele was not present. Loci with alleles specific
to B. nana were annotated by BLASTN searches to Genbank: searches were first carried out using the 100 bp
RAD sequences, and then using the 1000 bp of flanking
region each side from the B. nana reference genome.
Flow cytometry
The genome sizes of plants 097-10 and 574 were estimated by flow cytometry with propidium iodide (PI)
stained nuclei using either fresh young female catkins or
twig internode cambial tissue. Target samples were
co-chopped with an internal standard (Oryza sativa ‘IR36’
for 097-10, and Solanum lycopersicum ‘Stupicke polnı rane’
for 574) using new razor blades. The two-step protocol
described by Dolezel et al. (2007) using Otto’s buffers
(Otto 1990) was followed with minor modifications. After
cochopping the samples in 1 mL of ice-cold Otto I buffer,
50 lL of ribonuclease A (1 mg/mL RNase A; SigmaAldrich) was added to the sample, which were then incubated for 30 min at 37 °C prior to pelleting the nuclei.
The relative fluorescence of up to 10 000 particles per
replicate was recorded on a Partec Cyflow SL3 (Partec
GmbH, Germany) flow cytometer fitted with a 100-mW
green solid state laser (Cobolt Samba; Cobolt, Sweden).
The resulting histograms were analysed with the FlowMax software (v. 2.4, Partec GmbH).
Fig. 2 Statistics for the B. nana reference genome assembly: N
(X) scaffold length is calculated by sorting lengths of scaffolds
from the longest to the shortest and determining at what point
X% of the total assembly size is reached. The length of the
scaffold at that point is the N(X) length. Thus, over 60% of the
genome is in scaffolds of over 10 kb.
arising at the contig stage due to heterozygosity in the
diploid individual being sequenced. Average contig
coverage was 22, when contigs with coverage less than
2.2 and greater than 44.0 were masked. The total size of
the scaffolded assembly was 564 Mbp, with 7.78% of
the bases in the scaffolds being ‘N’s. The N50 scaffold
size of this assembly was 18 689 bp (Fig. 2). Complete
‘Assemblathon2’ statistics for the assembly are shown
in Table S1 (Supporting information). The Cegma pipeline (Parra et al. 2007) showed that all of the 248 core
eukaryotic genes searched for were present in the genome assembly, 96.77% of them having ‘complete’ hits
(see Table 2). The genome assembly and raw read data
are available at the EMBL Nucleotide Sequence Database (study accession ERP001867).
Results
RAD-sequencing
Genome sequence
A total of 42.05 Gb of raw data was produced, which
yielded 29.84 Gb of clean data. This represents 669 coverage of the genome, assuming a genome size of 450 Mb
(Anamthawat-J
onsson et al. 2010). The data were partitioned among the different libraries as follows: 9.20 Gb
clean data from the 200 bp insert size library, 7.64 Gb
clean data from the 500 bp insert size library, 6.21 Gb
clean data from the 800 bp insert size library, 4.83 Gb
clean data from the 2000 bp insert size library and
1.96 Gb clean data from the 5000 bp insert size library.
Genome assembly
The best assembly was constructed using a Kmer of 35
in SOAPdenovo, with option – M 3 at the contig stage.
The option –M 3 helped to deal with pathway bubbles
© 2012 John Wiley & Sons Ltd
From 15 individual plants, a total of 114.0 million
100 bp reads were generated (1.1 Gb of sequence). Of
these, 5.9 million were rejected due to ambiguity in barcodes or RAD-tagging. Thus, 108.1 million reads were
retained. The number of retained reads per individual
is shown in Table 1. Between 1.4 Gb and 0.3 Gb of
sequence data were generated per individual for 15
individuals. The RAD experiment was designed before
the reference genome was assembled, and we had predicted there would be 60 527 PstI cut sites (see Materials and methods). The cut site of PstI (CTGCAG and its
complement) was in fact found 70 954 times in the reference genome assembly. Thus, we would expect to
find up to 141 908 RAD-tags, and the 108.1 million
reads of DNA sequence we obtained for RAD sequencing should provide a mean of 762 reads per tag in total.
Read data from the RAD sequencing are available at
3104 N . W A N G E T A L .
Table 2 Statistics of the completeness of the B. nana reference genome based on BLAST searches for 248 Conserved Eukaryotic genes
(CEGs) using the CEGMA pipeline (see main text for details)
Complete hits
Partial hits
Number of
248 ultra-conserved
CEGs present in
genome
Percentage of
248 ultra-conserved
CEGs present
Total number
of CEGs present
including putative
orthologs
Average number
of orthologs
per CEG
Percentage of
detected CEGS
that have more
than 1 ortholog
240
248
96.77
100.00
671
903
2.80
3.64
88.33
95.16
the EMBL Nucleotide Sequence Database (study accession ERP001869).
RAD locus assembly to reference genome
A summary of the RAD analyses is shown in Fig. S2
(Supporting information), and data from intermediate
and final analysis steps are available on DRYAD (Wang
et al. 2012).
RAD-tag locus assembly was first carried out using
Bowtie alignment with the 097-10 B. nana genome assembly as reference, followed by implementation of pstacks
and cstacks (Catchen et al. 2011). This yielded a catalogue
of 86 399 loci from the B. nana 097-10 RAD library; this is
61% of the number of RAD loci predicted from genome
cut sites. The other 14 individuals yielded catalogues
with a mean of 55 816 loci (minimum 44 932; maximum
60 085, see Fig. 3A). Within each individual catalogue, a
mean of 16 508 loci had more than one allele (minimum
8335, maximum 21 005, for mean number of alleles per
locus see Fig. 3B).
We examined the percentage of loci in each individual’s catalogue that had three or four alleles (Fig. 3C,D)
(A)
(B)
(C)
(D)
as these should be zero in a diploid. In the two B. nana
individuals, 1.4% and 1.3% of loci had three or four
alleles; this is slightly higher than expected, probably
due to paralogy. In contrast, in all the other individuals
the number of loci with three or four alleles fell in a
narrow range between 4.7% and 6.8%, except for individual 574 (which had initially been identified as B 9
intermedia) with 1.3%.
We compared all 15 individual pstacks outputs
against the B. nana 097-10 catalogue using sstacks.
Although all loci in all catalogues had been initially
identified using alignment to the 097-10 genome
sequence, not all loci matched loci in the 097-10 RAD
catalogue: while 88.3% of the loci in the 582 B. nana catalogue matched loci in the 097-10 catalogue, 78.2–82.3%
of the loci in the other catalogues matched loci in the
097-10 catalogue, with the exception of 574 where only
71.6% of loci matched the 097-10 catalogue. Those RAD
loci that align to the 097-10 genome but are not found
in the 097-10 RAD library are likely to be mainly due to
differences among individuals and species at PstI cut
sites. Some may be due to low coverage in the 097-10
RAD library, but as this library had the third largest
Fig. 3 Results of RAD Sequencing, comparing results from de novo (dark grey
bars) and reference assembled (light grey
bars) pipelines. (A) Number of RAD loci
with minimum 10X coverage, (B) Mean
number of alleles per locus, (C) Percentage
of loci with three alleles, (D) Percentage
of loci with four alleles.
© 2012 John Wiley & Sons Ltd
B E T U L A G E N O M E S E Q U E N C E A N D R A D 3105
Fig. 4 RAD stacks from each sample that match more than one
locus in the 097-10 catalogue, or have different SNPs, shown as
a percentage of the total number of RAD stacks from each
sample that match the 097-10 catalogue. Stacks with more than
one catalogue locus match are shown in dark grey and stacks
with SNPs not in the 097-10 catalogue are shown in light grey.
Results from the de novo RAD loci are shown on the left of
each pair of columns and results using referenced assembly on
the right (with dashed lines).
number of reads of any of the libraries, this is unlikely
to be a major factor.
Of those RAD stacks in all individuals that matched
loci in the 097-10 RAD catalogue, between 12.6% and
16.0% per individual matched more than one locus in
the 097-10 RAD catalogue (Fig. 4) – this was true even
of the matching of the 097-10 RAD stacks to the 097-10
RAD catalogue, where 13.1% of matching loci matched
more that one catalogue locus. This is likely to be
because identifying RAD loci using alignment to a genome sequence allows paralogs to be separated, whereas
these cannot be distinguished when matching RAD loci
against one another.
RAD locus assembly de novo
Sequence reads for RAD loci were also de novo assembled for each individual using ustacks followed by cataloguing with cstacks. By this method, the number of
RAD loci found in the RAD data for 097-10 was 89 972;
this is 3573 (1.04 times) more than the number (86 399)
assembled by the reference genome method above. To
some extent, the fact that the Bowtie method assembled
fewer RAD loci than the de novo method may be due to
the failure of some RAD reads to align to the genome
assembly in Bowtie.
© 2012 John Wiley & Sons Ltd
As a brief excursus to check this difference in loci
assembled by the two methods for individual 097-10,
we took all RAD reads that had earlier failed to align to
the genome assembly with Bowtie, and ran them
through the de novo pipeline. This yielded 23 106 extra
loci, a much higher number than the 3573 difference
between the two RAD assembly methods. The fact that
86 399 loci are found that align with Bowtie to the reference sequence, despite the 23 106 of putative loci that
are apparently being excluded by the Bowtie method
due to non-alignment to the reference genome, may be
due to stacks of reads being divided among paralogs in
the reference genome by Bowtie (see also below).
For the individuals other than 097-10, the number of
RAD loci assembled by the de novo method was on
average 1.6 times higher than the number assembled by
the Bowtie genome-referenced method (Fig. 3A). This
larger difference between the two methods than seen
with individual 097-10 is likely to be due to a greater
number of reads from these other individuals that do
not align with Bowtie to the 097-10 reference genome.
Within each individual catalogue, a mean of 29 955 loci
had more than one allele (minimum 16 159, maximum
39 998). The percentage of loci with three or four alleles
was higher in the de novo assembled data (Fig. 3C,D)
perhaps due to paralogs being assembled into the same
stacks. However, the overall pattern among individuals
was the same as before, with low numbers of tri- and
tetra-allelic loci in 097-10, 574 and 582.
We then compared all the individual de novo RAD
catalogues against the B. nana 097-10 de novo RAD catalogue using sstacks. As we would expect, lower percentages of loci in each de novo catalogue matched the
097-10 catalogue, as these catalogues had not been
made by alignment to the 097-10 genome. However, the
absolute number of matching loci was very similar by
the two methods: a mean of 43 827 by the de novo
method compared with 42 940 by the reference method.
Among these matching loci, an important difference
emerged between the two methods: the mean percentage of matching loci that matched more than one 097-10
catalogue locus was 0.1% in the de novo catalogues but
14.2% in the reference-built catalogues. In contrast, the
mean percentage of matching loci that had SNPs not in
the 097-10 catalogue locus was 39.4% in the de novo catalogues but 24.1% in the reference-built catalogues. Figure 4 shows this difference for each individual in the
study separately. This suggests that some SNPs in the
de novo assembled RAD loci are due to paralogs being
stacked together, and these may be detected as such by
the reference RAD assembly method.
All read data were also concatenated into a single file
and run through the de novo RAD pipeline to produce a
universal catalogue of all RAD loci in the 15 individuals.
3106 N . W A N G E T A L .
Fig. 5 Sharing of sequenced RAD loci in the universal catalogue among the 15 individual plant samples. The horizontal
axis showing the number of samples in which the 100 bp loci
were found with coverage of at least 10 reads and nucleotide
difference of up to 4%, and the vertical axis shows the number
of loci.
This universal catalogue contained 281 748 loci (which
is twice as many as we expected there to be in a B. nana
genome). Stacks of de novo RAD loci from each individual were then searched against the universal catalogue,
and the degree to which homologous loci were being hit
in each individual was assessed. Figure 5 shows the
number of loci that were found in different numbers of
individuals: 3000 loci were sequenced in every individual, and 35 000 in 10 or more individuals. Table S2
(Supporting information) shows pairwise comparisons
between individuals: a mean of 49.3% (standard deviation = 6.7) of loci found in any given individual
matched with loci in any other given individual. When
the B. nana individuals were compared pairwise with
all the non-B. nana individuals, the mean number of
loci shared was only slightly lower at 45.3% (standard
deviation = 5.1).
Preliminary study of introgression
A subset of the RAD data was used for a proof-of-concept study of introgression. In all matches found using
sstacks to the universal catalogue, 3156 loci were
sequenced in all 15 individuals. Of these, 1089 loci were
polymorphic, and of these, 719 loci were homozygous
for the same allele in both B. nana individuals; we designated these alleles as putative ‘B. nana’ alleles. For
four of these loci, the ‘B. nana’ alleles were not found in
any other individuals, but 715 loci showed the ‘B. nana’
allele to be present in at least one non-B. nana individual. For this subset of 715 loci, presence and absence of
the ‘B. nana’ alleles in the 13 other individuals is shown
in Fig. 6 as a percentage of the 715 loci examined.
Fig. 6 Preliminary assessment of allele sharing, showing percentage of 715 loci in each non-B. nana individual that show
presence of one or more ‘B. nana’ alleles (see main text for
details). Abbreviations on the horizontal axis are as follows:
L = large (>200 cm height), M = Medium (50–200 cm height),
S = small (<50 cm height), int = B. 9 intermedia, pub = B. pubescens.
Individual 574 had ‘B. nana’ alleles at only 44% of loci
examined, whereas all other individuals had ‘B. nana’
alleles present at between 69% and 79% of the loci.
Apart from 574, the individual with the lowest percentage of ‘B. nana’ alleles was 1184 from Norfolk in the
southeast of Britain. The individual with the highest
percentage of ‘B. nana’ alleles was 1123 from the northern coast of Britain – the morphology of this plant had
previously been scored as B. 9 intermedia and was
found growing close to 1124 in a large, shrubby population on a west-facing slope above the Kyle of Tongue
beach. Excluding individual 574, there is a general tendency for plants with more ‘B. nana’ alleles to be smaller and have a B. 9 intermedia morphology (Fig. 6). An
exception to this is individual 1045, which was a very
small plant growing within a heavily grazed population
of B. nana on Ben Loyal and seems to have fewer ‘B.
nana’ alleles than most other plants.
As mentioned in the previous paragraph, four loci
were found which had alleles unique to B. nana and
homozygous and identical in both B. nana individuals.
These may be candidate genes for species differences
between B. nana and the other individuals in the study.
These were loci numbers 139431, 160800, 222373 and
280086 in the Stacks catalogue made from all RAD data,
assembled de novo. For 280086, all the non-B. nana individuals were homozygous and identical for a different
allele. These four loci were used for BLASTN searches
of the Genbank nucleotide collection; both as 100 bp
loci and with 1000 bp flanking sequence each side from
the B. nana reference genome where available. Locus
139431, hit a predicted Glycine max cation/proton antiporter 18 (E = 7e 10 with the 100bp sequence, E < e 120
with flanking region). Locus 160800 had no significant
© 2012 John Wiley & Sons Ltd
B E T U L A G E N O M E S E Q U E N C E A N D R A D 3107
hit as a 100 bp sequence, but with flanking regions hit a
putative Ricinus communis beta-glucosidase (E = 2e 48).
Locus 222373, had no significant hit as a 100 bp
sequence, but with flanking regions hit a predicted Vitis
vinifera indole-3-acetate O-methyltransferase 1 (E = 1e 107).
Locus 280086 hit a predicted Vitis vinifera interactor of
constitutive active Rho-related GTPase 4 (E = 2e 22 with
the 100 bp, E = 3e 120 with flanking region), which may
be involved in controlling cell polarity (Li et al. 2008).
Flow cytometry
For plant number 097-10, four replicated flow cytometry
runs with CVs below 5% were gained using fresh decorticated cambium, which gave a mean 2C value of
0.92 pg (SD 0.02); therefore, the 1C value of this plant is
450 Mb. For plant 574, nine replicated runs from catkins
gave a mean 2C value of 2.06 pg (SD = 0.02), thus
1C = 1007 Mb.
Discussion
could have ready access to fresh materials in the laboratory. We are confident that the plant used was not an
F1 hybrid because its morphology is characteristic of
B. nana, and we compared this to the very different
morphology and growth of plants grown from other
seed from the same maternal parent that did show clear
F1 hybrid phenotypes. In addition, its genome size as
measured by flow cytometry is very similar to those of
Icelandic B. nana individuals. It is of course possible
that there may be some genes present in the B. nana
genome that we have sequenced that have introgressed
from other Betula species due to past hybridization
followed by backcrossing. However, it should be noted
that introgression from tetraploid B. pubescens into diploid
B. nana is much less likely than introgression in the
reverse direction, due to the direction of ploidy difference (Stebbins 1971). It would be interesting in future to
make genomic comparisons with B. nana individuals
from large populations in subarctic tundra, though
these are likely to harbour more heterozygosity and
therefore have genomes that are harder to assemble.
Genome resources for Betula (birch)
Utility of methods
The work presented in this study is a major progression
in our knowledge of the genome of Betula, and in the
availability of tools for ecological and conservation
genomics. This has been made possible by Illumina
sequencing technology, and the RAD-tag method of
library preparation. The genome sequence we present
for B. nana appears to have covered the whole-genome
well, with scaffolds of sufficient length and accuracy for
complete core eukaryote genes to be found by BLAST
searches. This allows us to annotate RAD markers that
show interesting patterns of diversity among individuals, and the RAD markers themselves may in future
allow us to produce a genetic map the genome (e.g.
Baxter et al. 2011).
We chose B. nana as the best species for constructing
a reference genome for Betula for several reasons. First,
it is a diploid species and therefore much simpler to
sequence than a tetraploid such as B. pubescens. Second,
it exists in the UK in small isolated populations, which
are likely to have relatively low genetic diversity compared with the other UK diploid species, B. pendula,
which occurs in much larger populations and is likely,
as a tall wind-pollinated tree, to have much gene flow
among populations. Ideally, we would have used an
inbred line but we did not have access to such materials, so an individual from a small isolated population
was the next best option. Third, we are interested in
introgression from B. nana to other Betula species, and
so need a thorough knowledge of the B. nana genome.
We used an individual grown from seed so that we
The RAD marker protocol presented here is shown to
be a reliable method of enriching genomic DNA samples with homologous markers among individuals
within and among Betula species, for accurate genotyping. The two methods of RAD assembly used have
contrasting benefits: assembling with Bowtie to the B.
nana reference genome seems to separate out paralogs
in the RAD data (assuming that alleles in the B. nana
genome have not been under-assembled), but excludes
many loci that have good coverage in other species and
are assembled by the de novo method.
It has been suggested that RAD markers are best
applied at the intraspecific level, whereas target enrichment or sequence capture by oligonucleotides may be
more effective at higher levels (McCormack et al. 2011),
but among hybridizing Betula species it appears that
RAD is an effective method, perhaps because genetic
differentiation is low (see below). The Betula system
presented here can therefore join a growing list of ‘nonmodel’ plants that have successful RAD protocols,
including the globe artichoke (Scaglione et al. 2012),
Brassica napus (Bus et al. 2012) and the egg plant (Barchi
et al. 2011).
The presence of polyploidy in the Betula complex presents challenges for SNP calling (see also Ogden et al.,
this issue). The Stacks package (Catchen et al. 2011)
assumes that samples are diploid and does not distinguish between the different copy numbers of alleles
that can be present at heterozygous loci in polyploids.
However, where loci have sufficient read-depth, allele
© 2012 John Wiley & Sons Ltd
3108 N . W A N G E T A L .
copy number could in principle be assessed in a similar
manner to measuring allele-specific gene expression in
cDNA libraries; other SNP genotyping methods have
previously been analysed in this way (Buggs et al.
2012a). A further difficulty in polyploids is that of distinguishing paralogs from homeologs (genes duplicated
by whole-genome duplication), though the availability
of a genome sequence assembly will help us here, as
assembly of RAD tags against the reference genome
seems assist in the identification of paralogs.
Biological implications
Though the RAD analyses presented here are intended
as a proof-of-concept for the ability of RAD sequencing
to cover homologous loci among individuals and species of Betula, even preliminary analysis of the data provides new biological insights into Betula genetics and
species differences. These insights are provisional, given
the small sample sizes, but provide valuable hints for
future research effort.
The data have yielded unexpected insights on taxonomy. Individual 574 is a large, mature tree growing in
Glen Lui, with unusually small leaves apparently of
B. 9 intermedia phenotype. However, it has unusually
low allele sharing with B. nana, so is unlikely to be a
hybrid of B. nana. It has allele numbers suggestive of
diploidy, but investigation with flow cytometry showed
it to have a C-value (1C = 1007 Mb) consistent with tetraploidy. One might speculate that it is an autotetraploid, perhaps related to B. pendula, but further
investigations are needed to identify it reliably. In contrast, individual 1184c was initially identified as B. pendula on the basis of leaf morphology but according to
our RAD data does not appear to be diploid, as one
would expect for B. pendula. This fits with the observation of Howland et al. that in East Anglia diploid and
tetraploid Betula are not distinct on the basis of morphological characters (Howland et al. 1995).
The data presented here suggest that genetic sequence
differentiation between B. nana and B. pubescens in the
UK is low. This is indicated by the fact that the percentage of loci shared between sample pairs is similar for
pairs between species and pairs within species (Table S2,
Supporting information), and by the high degree of allele
sharing found between B. nana and B. pubescens individuals (Fig. 6). Among the 1089 loci that were polymorphic
and found in all 15 individuals, only 4 showed no allele
sharing between B. nana and the other individuals. This
low genetic differentiation may reflect past hybridization
between B. nana and B. pubescens, or possibly a hybrid
origin of B. pubescens with B. nana as a parental species.
The study of introgression presented in this study is
of necessity very preliminary, due to the low sample
sizes of plants. We have therefore only conducted simple presence/absence analyses in those loci that were
sequenced in all individuals and therefore easiest to
analyse. Thorough sampling of B. nana populations is
needed for the accurate identification of alleles that are
truly unique to B. nana, or found in B. nana at a higher
frequency than in other species. Despite its preliminary
nature, the data show that there are differences in the
degree to which alleles found in B. nana are shared with
the other individuals, and this to some extent seems to
correspond to plant morphology. The fact that the individual from the south of England shows lowest allele
sharing with B. nana (if we exclude individual 574)
demonstrates that it is worthwhile to continue investigating whether there is a north-south cline in allele
sharing in Britain. Our hypothesis is that such a cline
exists and is the result of introgression between B. nana
and other Betula species as a hybrid zone between them
moved northwards through Britain due to global warming after the last glacial maximum (Huntley & Birks
1983; Caseldine 2001; Buggs 2007; Karlsdóttir et al.
2009). RAD sequencing has proved to be informative in
the assessing hybridization among Heliconius species
(Dasmahapatra et al. 2012) and between rainbow and
westslope cutthroat trout (Hohenlohe et al. 2011). It has
also shed light on past range shifts in the pitcher plant
mosquito (Emerson et al. 2010).
In this study, we have also shown how, in principle,
candidate genes for the uniqueness of B. nana can be
identified. In our simple analysis of allele sharing, we
have identified four genes that show no allele sharing
between B. nana and the other individuals in the study.
Annotation of these genes, facilitated by the availability
of the B. nana reference genome, suggests that they may
be involved in the growth habits and water-tolerance of
B. nana, though at this stage, we cannot exclude the
possibility that these alleles were unique to B. nana due
to chance alone. More plants need to be sampled to
ascertain whether or not these genes are unique to
B. nana among more populations. With a larger data
set, we may also be able to discover genes that are
unique to B. nana; that is, taxonomically restricted or
‘orphan’ genes (Khalturin et al. 2009). The work presented
in this study lays essential foundations for ecological
and conservation genetic study in Betula.
Acknowledgements
We thank Michael Drury of Trees for Life for logistical help at
the Dundreggan site, Simon Renny Byfield for assistance with
PERL scripts, and Andrew Leitch, Richard Nichols, Douglas
Soltis and Pamela Soltis for helpful discussions. RAD sequencing was carried out in the NBAF GenePool genomics facility in
the University of Edinburgh. Genome Sequencing was conducted at the Beijing Genomic Institute, China. This project
© 2012 John Wiley & Sons Ltd
B E T U L A G E N O M E S E Q U E N C E A N D R A D 3109
was funded by Natural Environment Research Council (UK)
Fellowship NE/G01504X/1 to RJAB.
References
Anamthawat-J
onsson K (2004) Preparation of chromosomes
from plant leaf meristems for karyotype analysis and in situ
hybridization. Methods in Cell Science, 25, 91–95.
Anamthawat-Jónsson K, Thórsson ÆT (2003) Natural hybridisation in birch: triploid hybrids between Betula nana and
B. pubescens. Plant Cell, Tissue and Organ Culture, 75, 99–107.
Anamthawat-J
onsson K, Tomasson T (1990) Cytogenetics of
hybrid introgression in Icelandic Birch. Hereditas, 112, 65–70.
Anamthawat-J
onsson K, Th
orsson ÆT, Temsch EM, Greilhuber
J (2010) Icelandic birch polyploids – the case of perfect fit in
genome size,. Journal of Botany, 2010, 347254.
Aston D (1984) Betula nana L., a note on its status in the United
Kingdom. Proceedings of the Royal Society of Edinburgh Section
B-Biological Sciences, 85, 43–47.
Atkinson MD (1992) Betula pendula Roth (B. verrucosa Ehrh.)
and B. pubescens Ehrh. Journal of Ecology, 80, 837–870.
Atkinson MD, Codling AN (1986) A reliable method for distinguishing between Betula pendula and B. pubescens. Watsonia,
7, 5–76.
Baird NA, Etter PD, Atwood TS et al. (2008) Rapid SNP discovery and genetic mapping using sequenced RAD markers.
PLoS ONE, 3, e3376.
Barchi L, Lanteri S, Portis E et al. (2011) Identification of SNP
and SSR markers in eggplant using RAD tag sequencing.
BMC Genomics, 12, 304.
Baxter SW, Davey JW, Johnston JS et al. (2011) Linkage mapping and comparative genomics using next-generation RAD
sequencing of a non-model organism. PLoS ONE, 6, e19315.
Brown IR, Al-Dawoody D (1979) Observations on meiosis in
three cytotypes of Betula alba L. New Phytologist, 83, 801–811.
Brown IR, Tuley G (1971) A study of a population of birches in
Glen Gairn. Botanical Journal of Scotland, 41, 231–245.
Brown IR, Williams DA (1984) Cytology of Betula alba L complex. Proceedings of the Royal Society of Edinburgh Section
B-Biological Sciences, 85, 49–64.
Brown I, Kennedy D, Williams D (1982) The occurence of natural hybrids between Betula pendula Roth and B. pubescens
Ehrh. Watsonia, 14, 133–145.
Buggs RJA (2007) Empirical study of hybrid zone movement.
Heredity, 99, 301–312.
Buggs RJA, Pannell JR (2006) Rapid displacement of a monoecious plant lineage is due to pollen swamping by a dioecious
relative. Current Biology, 16, 996–1000.
Buggs RJA, Chamala S, Wu W et al. (2012a) Rapid, repeated
and clustered loss of duplicate genes in allopolyploid plant
populations of independent origin. Current Biology, 22,
248–252.
Buggs RJA, Renny-Byfield S, Chester M et al. (2012b) Next-generation sequencing and genome evolution in allopolyploids.
American Journal of Botany, 99, 372–382.
Bus A, Hecht J, Huettel B, Reinhardt R, Stich B (2012) Highthroughput polymorphism detection and genotyping in
Brassica napus using next-generation RAD sequencing. BMC
Genomics, 13, 281.
Caseldine C (2001) Changes in Betula in the Holocene record
from Iceland—a palaeoclimatic record or evidence for early
© 2012 John Wiley & Sons Ltd
Holocene hybridisation? Review of Palaeobotany and Palynology, 117, 139–152.
Catchen JM, Amores A, Hohenlohe P, Cresko W, Postlethwait
JH (2011) Stacks: building and genotyping loci de novo from
short-read sequences. G3: Genes, Genomes, Genetics, 1,
171–182.
Crawford RMM (2008) Plants at the Margin: Ecological Limits
and Climate Change. Cambridge University Press, Cambridge,
UK.
Cronn R, Knaus BJ, Liston A et al. (2012) Targeted enrichment
strategies for next-generation plant biology. American Journal
of Botany, 99, 291–311.
Cullings KW (1992) Design and testing of a plant-specific PCR
primer for ecological and evolutionary studies. Molecular
Ecology, 1, 233–240.
Currat M, Ruedi M, Petit RJ, Excoffier L (2008) The hidden side
of invasions: massive introgression by local genes. Evolution,
62, 1908–1920.
Dabrowska G, Dzialuk A, Burnicka O, Ejankowski W,
Gugnacka-Fiedor W, Goc A (2006) Genetic diversity of postglacial relict shrub Betula nana revealed by RAPD analysis.
Dendrobiology, 55, 19–23.
Dasmahapatra KK, Walters JR, Briscoe AD et al. (2012) Butterfly genome reveals promiscuous exchange of mimicry adaptations among species. Nature, 487, 94–98.
Davey JW, Hohenlohe PA, Etter PD, Boone JQ, Catchen JM,
Blaxter ML (2011) Genome-wide genetic marker discovery
and genotyping using next-generation sequencing. Nature
Reviews Genetics, 12, 499–510.
Davy AJ, Gill JA (1984) Variation due to environment and
heredity in birch transplanted between heath and bog. New
Phytologist, 97, 489–505.
De Jong P (1993) An introduction to Betula: its morphology,
evolution, classification and distribution, with a survey
of recent work. In: Proceedings of the IDS Betula Symposium
(ed. Hunt D). International Dendrology Society, Richmond,
UK.
Dole
zel J, Greilhuber J, Suda J (2007) Estimation of nuclear
DNA content in plants using flow cytometry. Nature Protocols, 2, 2233–2244.
Doyle J, Doyle JL (1987) Genomic plant DNA preparation from
fresh tissue-CTAB method. Phytochemical Bulletin, 19, 11–15.
Earl D, Bradnam K, St. John J et al. (2011) Assemblathon 1:
A competitive assessment of de novo short read assembly
methods. Genome Research, 21, 2224–2241.
Elkington TT (1968) Introgressive hybridization between Betula
nana L. and B. pubescens Ehrh. in north-west Iceland. New
Phytologist, 67, 109–118.
Emerson KJ, Merz CR, Catchen JM et al. (2010) Resolving postglacial phylogeography using high-throughput sequencing.
Proceedings of the National Academy of Sciences USA, 107,
16196–16200.
Eriksson G, Jónsson A (1986) A review of the genetics of Betula.
Scandinavian Journal of Forest Research, 1, 421–434.
Gill JA, Davy AJ (1983) Variation and polyploidy within lowland populations of the Betula pendula/B. pubescens complex.
New Phytologist, 94, 433–451.
de Groot WJ, Thomas PA, Wein RW (1997) Betula nana L and
Betula glandulosa Michx. Journal of Ecology, 85, 241–264.
Hohenlohe PA, Amish SJ, Catchen JM, Allendorf FW, Luikart
G (2011) Next-generation RAD sequencing identifies thou-
3110 N . W A N G E T A L .
sands of SNPs for assessing hybridization between rainbow
and westslope cutthroat trout. Molecular Ecology Resources,
11, 117–122.
Howland DE, Oliver RP, Davy AJ (1995) Morphological and
Molecular Variation in Natural Populations of Betula. New
Phytologist, 130, 117–124.
Huntley B, Birks H (1983) An Atlas of Past and Present Pollen
Maps of Europe: 0-13,000 Years ago. Cambridge University
Press, Cambridge.
Ilut DC, Coate JE, Luciano AK et al. (2012) A comparative transcriptomic study of an allotetraploid and its diploid progenitors illustrates the unique advantages and challenges of
RNA-seq in plant species. American Journal of Botany, 99,
383–396.
Jarvinen P, Palmé A, Morales LO et al. (2004) Phylogenetic
relationships of Betula species (Betulaceae) based on nuclear
ADH and chloroplast matK sequences. American Journal of
Botany, 91, 1834–1845.
Karlsd
ottir L, Hallsd
ottir M, Th
orsson ÆT, AnamthawatJ
onsson K (2008) Characteristics of pollen from natural triploid Betula hybrids. Grana, 47, 52–59.
Karlsdóttir L, Hallsdóttir M, Thórsson ÆT, AnamthawatJ
onsson K (2009) Evidence of hybridisation between Betula
pubescens and B. nana in Iceland during the early Holocene.
Review of Palaeobotany and Palynology, 156, 350–357.
Karlsson PS, Schleicher LF, Weih M (2000) Seedling growth
characteristics in three birches originating from different
environments. Ecoscience, 7, 80–85.
Kenworthy JB, Aston D, Bucknall SA (1972) A study of hybrids
between Betula pubescens Ehrh. and Betula nana L. from Sutherland‚ an integrated approach. Botanical Journal of Scotland,
41, 517–539.
Khalturin K, Hemmrich G, Fraune S, Augustin R, Bosch T
(2009) More than just orphans: are taxonomicallyrestricted genes important in evolution? Trends in Genetics,
25, 404–413.
Kim S, Misra A (2007) SNP genotyping: technologies and biomedical applications. Annual Review of Biomedical Engineering,
9, 289–320.
Lai Z, Kane NC, Kozik A et al. (2012) Genomics of Compositae
weeds: EST libraries, microarrays, and evidence of introgression. American Journal of Botany, 99, 209–218.
Langmead B, Trapnell C, Pop M, Salzberg S (2009) Ultrafast
and memory-efficient alignment of short DNA sequences to
the human genome. Genome Biology, 10, R25.
Li S, Gu Y, Yan A, Lord E, Yang Z-B (2008) RIP1 (ROP Interactive Partner 1)/ICR1 marks pollen germination sites and
may act in the ROP1 pathway in the control of polarized
pollen growth. Molecular Plant, 1, 1021–1035.
Lowe AJ, Harris SA, Ashton P (2004) Ecological Genetics: Design,
Analysis and Application. Blackwell, Oxford.
Mamanova L, Coffey AJ, Scott CE et al. (2010) Target-enrichment strategies for next-generation sequencing. Nature Methods, 7, 111–118.
McCormack JE, Hird SM, Zellmer AJ, Carstens BC, Brumfield
RT (2011) Applications of next-generation sequencing to phylogeography and phylogenetics. Molecular Phylogenetics and
Evolution, doi: 10.1016/j.ympev.2011.12.007.
Otto F (1990) DAPI staining of fixed cells for high-resolution
flow cytometry of nuclear DNA. In: Methods in Cell Biology
(eds Crisssman H, Darzynkiewicz Z), pp. 105–110. Academic
Press, New York.
Palme AE, Su Q, Palsson S, Lascoux M (2004) Extensive
sharing of chloroplast haplotypes among European birches
indicates hybridization among Betula pendula, B. pubescens
and B. nana. Molecular Ecology, 13, 167–178.
Parra G, Bradnam K, Korf I (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics, 23, 1061–1067.
Pelham J, Kinnaird JW, Gardiner AS, Last FT (1984) Variation
in, and reproductive capacity of, Betula pendula and Betula
pubescens. Proceedings of the Royal Society of Edinburgh Section
B-Biological Sciences, 85, 27–41.
Pelham J, Gardiner AS, Smith RI, Last FT (1988) Variation in
Betula pubescens Ehrh (Betulaceae) in Scotland - its nature
and association with environmental factors. Botanical Journal
of the Linnean Society, 96, 217–234.
Regel E (1861) Monographische Bearbeitung der Betulaceen.
Nouveaux Memoires de la Societe Imperiale des Naturalistes de
Moscou, 13, 59–187.
Rich TCG, Jermy AC (1998) Plant Crib. Botanical Society of the
British Isles in association with National Museums & Galleries of Wales, London.
Scaglione D, Acquadro A, Portis E, Tirone M, Knapp S, Lanteri
S (2012) RAD tag sequencing as a source of SNP markers in
Cynara cardunculus L. BMC Genomics, 13, 3.
Stace CA (2010) New Flora of the British Isles, 3rd edn. Cambridge University Press, Cambridge.
Stebbins GL (1971) Chromosomal Evolution in Higher Plants.
Edward Arnold, London.
Thórsson ÆT, Palsson SP, Sigurgeirsson A, AnamthawatJónsson K (2007) Morphological variation among Betula nana
(diploid), B. pubescens (tetraploid) and their triploid hybrids
in Iceland. Annals of Botany, 99, 1183–1193.
Th
orsson ÆT, Salmela E, Anamthawat-Jónsson K (2001)
Morphological, cytogenetic, and molecular evidence for introgressive hybridization in birch. Journal of Heredity, 92, 404–408.
Truong C, Palmé AE, Felber F, Naciri-Graven Y (2005) Isolation
and characterization of microsatellite markers in the tetraploid birch, Betula pubescens ssp. tortuosa. Molecular Ecology
Notes, 5, 96–98.
Vaarama A, Valanne T (1973) On the taxonomy, biology and
origin of Betula tortuosa Ledeb. Reports from the Kevo Subarctic
Research Station, 10, 70–84.
Wang N, Thomson M, Bodles WJA et al. (2012) Data from:
Genome sequence of dwarf birch (Betula nana) and crossspecies RAD markers. Dryad Digital Repository http://dx.doi.
org/10.5061/dryad.v5gd2.
Wilsey BJ, Saloniemi I (1999) Leaf fluctuating asymmetry in
tree-line mountain birches, Betula pubescens ssp tortuosa:
genetic or environmentally influenced? Oikos, 87, 341–345.
Woodworth RH (1929) Cytological studies in the Betulaceae. I.
Betula. Botanical Gazette, 87, 331–363.
Zeng J, Zou YP, Bai JY, Zheng HS (2002) Preparation of total
DNA from “recalcitrant plant taxa”. Acta Botanica Sinica, 44,
694–697.
© 2012 John Wiley & Sons Ltd
B E T U L A G E N O M E S E Q U E N C E A N D R A D 3111
N.W. is a PhD student supervised by R.J.A.B. working on birch
hybridization and phylogenetics. M.T. specializes in Illumina
and RAD sequencing at the GenePool facility, University of
Edinburgh. W.J.A.B. is a projects manager at Highland Birchwoods, leading a Heritage Lottery Fund mountain woodlands
project. R.M.M.C. studies plant responses to the environment
and is Emertius Professor at St Andrews University. H.V.H. is
a post-doctoral researcher in archaeogenetics at Cambridge
University, working on plant genetic diversity and population
history. A.W.F. is a conservationist and director of Trees for
Life. J.P. is a post-doctoral researcher specializing in flow
cytometry and genome size evolution at RBG Kew. R.J.A.B. is
a NERC Fellow and Senior Lecturer at Queen Mary University
of London, interested in evolution, ecology and genomics.
DNA read sequences for RAD loci: Sequence Read
Archive study accession ERP001869 http://www.ebi.ac.
uk/ena/data/view/ERP001869.
RAD catalogs and matches: DRYAD http://dx.doi.
org/10.5061/dryad.v5gd2.
Herbarium vouchers: Natural History Museum,
London, accession numbers BM001074532-BM001074546.
Supporting information
Additional supporting information may be found in the online version of this article.
Table S1 Assemblathon2 statistics for genome assembly.
Table S2 Sequenced RAD loci shared between each pair of samples,
expressed as a percentage of the total number of loci in the RAD
locus catalog from the sample listed in the left-hand edge column.
Data accessibility
DNA read sequences for Betula nana reference genome,
and genome assembly: Sequence Read Archive study
accession ERP001867 http://www.ebi.ac.uk/ena/data/
view/ERP001867.
© 2012 John Wiley & Sons Ltd
Fig. S1 Photograph of Betula nana individual 097-10 used for genome
sequencing.
Fig. S2 Flow chart outlining analysis pipelines used in this study.
Download