(Clupea harengus) Sequencing in Atlantic Herring

advertisement
OPEN 0 ACCESS Freely available online
SNP Discovery Using Next Generation Transcriptomic
Sequencing in Atlantic Herring (Clupea harengus)
Sarah J. Helyar1'2*9, M orten T. Limborg39, Dorte B ekkevold3, M assim iliano Babbucci4, Jeroen van
H oudt5, G regory E. M aes5, Luca Bargelloni4, Rasmus O. N ielsen 6, Martin I. Taylor1, Rob O gd en 7,
A lessia Cariani8, Gary R. Carvalho1, FishPopTrace C onsortium 1, Frank Panitz6
1 Molecular Ecology and Fisheries Genetics Laboratory, School of Biological Sciences, College of Natural Sciences, Bangor University, Bangor, Gwynedd, United Kingdom,
2 Food Safety, Environment & Genetics, Matis, Reykjavik, Iceland, 3 National Institute of Aquatic Resources, Technical University of Denmark, Silkeborg, Denmark,
4 D epartm ent o f Comparative Biomedicine and Food Science, University of Padova, Legnaro, Italy, 5 Laboratory of Biodiversity and Evolutionary Genomics, Katholieke
Universiteit Leuven, Leuven, Belgium, 6 D epartm ent o f Molecular Biology and Genetics, Faculty of Science and Technology, Aarhus University, Tjele, Denmark, 7 TRACE
Wildlife Forensics Network, Royal Zoological Society o f Scotland, Edinburgh, United Kingdom, 8 D epartm ent of Experimental and Evolutionary Biology, University of
Bologna, Bologna, Italy
Abstract
The i n tr o d u c tio n of Next G e n era tio n S e q u e n c in g (NGS) has revolution ised p o p u la tio n g e netics, providing s tu d ie s o f n o n ­
m o d e l sp e c ie s with u n p r e c e d e n t e d g e n o m i c c o v e r a g e , allowing ev o lu tio n ary biologists t o a d d r e s s q u e s t io n s previously far
b e y o n d t h e reach o f available resources. F urth e rm o re , t h e sim p le m u ta t io n m o d e l o f Single N ucleotid e P o lym orph ism s
(SNPs) pe rm its cost-effective h i g h - t h r o u g h p u t g e n o t y p i n g in t h o u s a n d s o f individuals sim ultaneou sly. G e n o m ic resources
are sc arce for t h e Atlantic herring (Clupea harengus), a small pela gic sp e c ie s t h a t sustains high r e v e n u e fisheries. This p a p e r
details t h e d e v e l o p m e n t o f 578 SNPs using a c o m b i n e d NGS a n d h i g h - t h r o u g h p u t g e n o t y p i n g a p p r o a c h . Eight individuals
c ov ering t h e sp e c ie s d istributio n in t h e e a s te r n Atlantic w e re b a r -c o d e d a n d m ultip lex ed into a single cDNA library a n d
s e q u e n c e d using t h e 4 5 4 GS FLX platform . SNP discovery w a s p e r fo r m e d by de n o v o s e q u e n c e c lu ste ring a n d c o ntig
a ssem bly, follo w ed by t h e m a p p i n g o f reads a g a in s t c o n s e n s u s co n tig s e q u e n c e s . Selection o f c a n d i d a t e SNPs for
g e n o t y p i n g w as c o n d u c t e d using a n in silico a p p r o a c h . SNP validation a n d g e n o t y p i n g w e r e p e r fo rm e d sim u lta n eo u s ly
using a n Illumina 1,536 G o ld e n G a te assay. A l th o u g h t h e c o nversion rate o f c a n d i d a t e SNPs in t h e g e n o t y p i n g a ssa y c a n n o t
b e p r e d ic te d in a d v an c e, this a p p r o a c h has t h e poten tial to m axim ise c o st a n d t im e efficiencies by avo idin g e x p e n s iv e a n d
t im e - c o n s u m i n g labora tory s t a g e s of SNP validation. Additionally, t h e in silico a p p r o a c h leads t o low er a s c e r t a i n m e n t bias in
t h e resulting SNP pa n el as m ark e r selectio n is b a s e d only o n t h e ability t o d e s ig n prim ers a n d t h e p re d ic te d p r e s e n c e of
intron-e x on b o u n d a rie s. C o n s e q u e n tl y SNPs with a w id e r s p e c t r u m o f m in o r allele fre q u e n c ie s (MAFs) will b e g e n o t y p e d in
t h e final panel. T he g e n o m i c re sou rce s p r e s e n t e d he re r e p r e s e n t a va lu a ble m u lti- p u rp o se re source for d e v e lo p in g
informativ e m arke r p a n els for p o p u la tio n discrim ination, m icroarray d e v e l o p m e n t a n d for p o p u la tio n g e n o m i c s tu d ie s in
t h e wild.
C itation : Helyar SJ, Limborg MT, Bekkevold D, Babbucci M, van Houdt J, e t al. (2012) SNP Discovery Using Next Generation Transcriptomic Sequencing in Atlantic
Herring (Clupea harengus). PLoS ONE 7(8): e42089. doi:10.1371/journal.pone.0042089
Editor: Arnar Palsson, University o f Iceland, Iceland
Received March 7, 2012; A ccepted July 2, 2012; Published A ugust 7, 2012
C opyrigh t: © 2012 Helyar e t al. This is an open-access article distributed under th e term s of th e Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any medium, provided th e original author and source are credited.
Funding: The research leading to th ese results received funding from th e European Community's Seventh Framework Programme (FP7/2007-2013) under grant
agreem ent no KBBE-212399 (FishPopTrace). In addition, MTL received financial support from th e European Commission through th e FP6 projects UNCOVER
(Contract No. 022717) and RECLAIM (Contract No. 044133). GM is a post-doctoral researcher funded by the Scientific Research Fund Flanders (FWO-Flanders). The
funders had no role in study design, data collection and analysis, decision to publish, or preparation of th e manuscript.
C om peting Interests: The authors have declared th a t no com peting interests exist.
* E-mail: sarah.helyar@matis.is
9 These authors contributed equally to this work.
a n d th ere is a n u rgent need for genom ic tools to identify
po p u latio n structure a n d b oundaries to allow effective m an ag e­
m en t [4], A dditionally the forensic identification o f fish a n d fish
products th ro u g h o u t the food processing chain from n et to plate
w ould assist in the fight against Illegal, U n re p o rte d a n d
U n re g u lated (IUU) fishing, currently a prio rity for the E u ro p ea n
U n io n [5] a n d globally [6], SN Ps are the optim al m ark er for this
type o f application, b u t large S N P panels are cu rren tly available
for few m arin e fish species (e.g. A tlantic cod (Gadus morhua) [7];
E u ro p ea n hake (Merluccius merluccius) [8]). T h u s, the developm ent
o f genom ic resources for m arin e fish is urgently req u ire d for
evolutionary, conservation a n d m an a g em e n t perspectives.
Introduction
P opulation genom ic ap p ro ach es have b e e n revolutionised by
N ext G e n era tio n S equencing (NGS) technologies such as 454
(Roche) a n d Illum ina sequencing. T hese developm ents facilitate
genom e-w ide analyses o f genetic variatio n across populations o f
non-m odel organism s [1,2], allow ing a range o f evolutionary
questions to b e investigated effectively for the first tim e. M arin e
fishes a re excellent m odel systems for studying a d ap tatio n due to
th eir large geographic ranges th a t frequently encom pass strong
e nvironm ental gradients a n d th eir large pop u latio n sizes th at
increase the relative strength o f selection over drift [3]. M oreover,
m an y m arin e fishes are u n d e r extrem e a nthropogenic pressure
PLOS ONE I w w w .plosone.org
1
A ugu st 2012 | V olum e 7 | Issue 8 | e42089
SNP Discovery in A tla n tic Herring
M aterials and M ethods
T h e strategy used for SN P developm ent in non-m odel
organism s is d e p en d e n t o n the availability o f genom ic inform a­
tion from closely related species. If such resources are available,
P C R am plicons (hom ologous to regions in the reference
genom e) can be sequenced a n d SN Ps identified (however, these
are intrinsically lim ited in the n u m b e r o f SN Ps th a t c an be
identified). W ith o u t a reference genom e, th ree principal
strategies for genom e-w ide SN P discovery can b e applied;
w hole genom e sequencing a n d assem bly, genom e com plexity
red u ctio n a n d sequencing m ethods (e.g. R R L a n d RA D -seq)
a n d cD N A sequencing (RNA-seq). W hile w hole genom e
sequencing has now b e en com pleted for species w ith large
com plex genom es (for exam ple: p a n d a (Ailuropoda melanoleurd) [9];
cacao (Theobroma cacao) [10]), this rem ains outside the scope o f
m ost studies, as in general the de novo assem bly o f larger, repeatrich o r polyploid genom es requires ad ditional inform ation (e.g.
physical BAC m aps o r p a ire d -en d libraries) a n d extensive
bioinform atic capacity in o rd e r to build the large, c o m p u ta ­
tionally intensive, structured sequence scaffolds [11]. G enom ic
libraries w hich sequence a small fraction o f the genom e
(typically 3 -5 % ) req u ire a high level o f coverage for contig
assem bly a n d detection o f S N P variants (see [12-14] for
applications). D eep sequencing o f cD N A libraries provides an
attractive a p p ro ac h to achieve the high sequence coverage
need ed for de novo contig assem bly a n d SN P prediction, as only
a small percentage o f the genom e is acco u n ted for by the
transcriptom e. A n o th e r ad vantage o f transcriptom e sequencing
is the in form ation p ro d u c ed co n cern in g functional genetic
v a riation in specific genes w hich m ay b e u n d e r selection; these
can th en b e targ eted to evaluate gene expression profiles. T h e
ability to exam ine b o th neu tral variatio n a n d genom ic regions
u n d e r selection provides researchers w ith u n p re ce d en te d tools
for u n d erstan d in g local a d ap tatio n o f w ild populations a t the
m olecular level.
A tlantic h e rrin g (Clupea harengus) is a n a b u n d a n t a n d
ecologically highly diverse species, o ccurring w ith a m ore or
less continuous distribution in the N o rth -A d an tic benthopelagic
zone. H ab itats a re distributed across highly diverse environ­
m ents, from tem p erate (33°N) to arctic (80°N) a n d a t salinities
from oceanic (~ 3 5 ppt) to brackish (down to 3 ppt). In spite o f
its large ecological range, studies using “ n e u tra l” m icrosatellites
have unanim ously re p o rte d w eak p o p u latio n differentiation th a t
is statistically significant only o n regional scales [15-17].
H ow ever, despite relatively high levels o f gene flow am o n g
populations, evidence o f local a d ap tatio n has b e en identified in
the A tlantic h e rrin g in the Baltic Sea using m icrosatellite loci
[18,19]. T h erefo re it is expected th a t analyses w ith transcriptom e-w ide coverage applying h u n d red s o f m arkers associated
w ith adaptive a n d neu tral v a riation will provide novel insights
into the role o f selective a n d dem o g rap h ic processes in shaping
pop u latio n structure.
W e describe transcriptom e-based S N P d evelopm ent in A tlantic
h e rrin g using a R o ch e 454 G S F L X (hereafter 454) sequencing
a p p ro ac h . O u r aim was three-fold; 1) to develop a S N P assay
exhibiting m inim al ascertain m en t bias across east A tlantic
populations, 2) to test the applicability o f in silico SN P detection
utilizing a com b in ed S N P screening a n d validation a p p ro ac h as a
cost efficient w ay o f o btain in g po p u latio n genom ic resources, a n d
3) to establish a transcriptom e resource for tissue-specific gene
expression profiling a n d m ic ro arra y developm ent. W e present, to
o u r know ledge, one o f the first studies describing SN P discovery in
a non-m odel m arin e fish based o n transcriptom e sequencing using
N G S.
PLOS ONE I w w w .plosone.org
cDNA Library C o n s t r u c t i o n a n d 4 5 4 S e q u e n c i n g
S N P developm ent was based o n m uscle sam ples from eight fish
collected from four locations from across the eastern A tlantic
(Figure 1). T hese locations w ere chosen to m axim ise geographic
coverage a n d e nvironm ental differences, thereby m inim ising
poten tial ascertain m en t bias. A pproxim ately 5g o f m uscle tissue
was taken from each o f two individuals (male a n d female) from
each location a n d im m ediately p laced in R N A later (Invitrogen)
a n d after 12 hours a t 4°C , w ere stored a t —80°C . T o ta l R N A was
extracted using the R N easy L ipid T issue M ini K it (Qiagen). T h e
O ligotex m R N A M ini K it (Q iagen) was used to isolate m R N A ,
a n d non-norm alised cD N A was synthesized using the S u p e rsc rip t
D oub le-stran d ed cD N A Synthesis K it (Invitrogen). A m ultiplex
sequencing lib rary was p re p a re d by pooling equal am ounts o f
cD N A from all eight individuals, w here two specific 10-m er
b a rco d in g oligonucleotides w ere ligated to each individual sam ple
to allow post-sequencing identification o f sequences (m odified
from [20]). H ig h -th ro u g h p u t sequencing was p erfo rm ed o n a 454
sequencer according to the m an u fa ctu re rs’ protocol.
S e q u e n c e P r o c e s s in g a n d A s s e m b ly
Sequences w ere first de-m ultiplexed using the b a rco d in g tags
(sfffile tool, R o ch e 454 analysis software) a n d sorted by sam ple.
M itoch o n d rial sequences w ere rem oved from the d a ta set by
m ap p in g the reads against the A tlantic h e rrin g m itochondrial
genom e (G enbank accession N C _009577 [21]) using the R oche
454 gsM ap p er software. R ep eatM ask er [22] was used to identify
a n d m ask repetitive a n d low com plexity regions w ithin the reads
by using the zebrafish [Danio rerio) re p ea t library. R eads w ere
cleaned for short sequences (< 5 0 bp) a n d low quality regions using
S eqC lean (h ttp ://c o m p b io .d fc i.h a rv a rd .e d u /tg i/s o ftw a re /). Se­
quence clustering was p erfo rm ed in tw o steps; initial clustering
was p erfo rm ed using C L C G enom ics W o rk b en ch (C L C bio,
D enm ark), the resulting ace file sequences w ere th en assem bled
‘p e r contig’ in C A P3 [23]. T h e consensus sequences for the contigs
p ro d u c ed by this assem bly w ere th en used as a reference for
m ap p in g reads in the subsequent in silico S N P detection.
SNP D e te c t io n
T o identify can d id ate SNPs, all contig specific reads from the
C A P 3.ace files w ere re -m a p p ed onto the consensus sequence a n d
can d id ate SN Ps w ere identified using GigaBayes [24], T his
p ro g ram scans each position o f the assem bly for the presence o f
a t least two S N P alleles a n d calculates the p robability o f a given
site bein g polym orphic using a B ayesian a p p ro ach . N o insertion or
deletion variants (InDels) w ere considered a n d the polym orphism
ra te was set to 0.003. A m in im u m contig d e p th o f four reads
covering the polym orphic site a n d a m in im u m o f two reads for the
ra re allele w ere re q u ire d for a site to be considered as a putative
SN P. All contigs c o ntaining SN Ps w ere filtered to rem ove
instances in w hich the alternative allele o f the SN P was only
identified in a single individual, as these m ay either represent false
positives or m ay lead to strong a scertain m en t bias.
M icro sa tellite S e q u e n c e S c re e n in g
M icrosatellites are a n im p o rta n t resource for sm aller scale
studies in p o p u latio n genetics, m icrosatellites w ithin expressed
genom ic regions have b e en show n to p ro d u c e clearer genotyping
results as th ere are fewer null alleles a n d stutter b an d s [25,26];
therefore the contig library developed here was screened to detect
re p ea t regions. A ssem bled contigs w ere screened for m icrosatellite
repeats using M satC o m m a n d e r [27] a P ython p ro g ra m w hich
2
A ugu st 2012 | V olum e 7 | Issue 8 | e42089
SNP Discovery in A tla n tic Herring
North Atlantic
Figure 1. Location of the 18 samples used in this study. T h e e ig h t s e q u e n c e d a s c e rta in m e n t in d iv id u a ls (2 p e r lo c a tio n ) c a m e fro m th e fo u r
sa m p lin g site s d e n o te d in red .
d o i:1 0 .1 3 7 1 /jo u rn a l.p o n e .0 0 4 2 0 8 9 .g 0 0 1
locates m icrosatellite repeats (di-, tri-, tetra-, penta-, a n d hexanucleotide repeats) w ithin fasta-form atted sequences o r consensus
files. M satC o m m a n d e r th en uses P rim erS [28] to screen sequences
con tain in g m icrosatellite loci for high-quality P C R p rim e r sites
w ithin the flanking regions for ‘potentially am plified loci’ (PALs
[29]).
co ntaining SN Ps w ere first blasted against six p eptide sequence
databases (Ensem bl genom e assem bly for G. aculeatus, T. nigroviridis,
0 . latipes, T. rubripes, D . rerio a n d the Sw issprot database) using the
Blastx function (E-value cut-off < 1 .0 E - 3 ). F o r each SN P
c o ntaining contig the best m atch was selected a n d the aligned
sections o f the qu ery w ere saved. Subsequently, two 121 b p
sequences p e r SN P (i.e. 60 b p u p /d o w n -stre a m o f the SN P
position, one sequence for each allele) w ere pro d u ced , these w ere
used in a Blastx analysis against the file retrieved from the peptide
sequences (E-value cut-off < 1 .0 E -10), a n d w ere th en c o m p a red to
d eterm in e if the SN P rep resen ted a synonym ous o r nonsynonym ous m utation.
C o n tig A n n o t a t i o n
C ontigs w ere a n n o ta te d using the Basic L ocal A lignm ent Search
T ool (BLAST) against m ultiple sequence databases. Blastn
searches (E-value cut-off < 1 .0 E 5) w ere cond u cted against all
a n n o ta te d transcripts o f Gasterosteus aculeatus, Tetraodon nigroviridis,
Oryzias latipes, Takifiigu rubripes, Danio rerio a n d Homo sapiens available
th ro u g h the E nsem bl G enom e Browser, a n d against all unique
transcripts for D . rerio, H . sapiens, 0 . latipes, T. rubripes, Salmo salar,
a n d Oncorhynchus mykiss in the N C B I LTniGene database. Blastx
searches w ere c o n d u cted (E-value cut off < 1 .0 E 3) against the
LTniProtK B/Sw issProt a n d L h iiP ro tK B /T rE M B L databases. L ast­
ly Blastx searches w ere p erfo rm ed against all a n n o ta te d proteins
from the transcriptom es o f G. aculeatus, T. nigroviridis, 0 . latipes, T.
rubripes, D . rerio a n d H . sapiens available th ro u g h the E nsem bl
G enom e Browser.
S e le c tio n o f C a n d i d a t e SNPs fo r G e n o t y p i n g A ssay
SN Ps w ere validated follow ing a n in silico protocol, aim ed at
m inim ising validation costs, whilst also m inim ising subsequent
locus dro p o u t. S N P selection was based on the results from the
Illum ina Assay D esign T ool, detection o f p utative intron-exon
b oundaries w ithin the flanking regions o f c andidate SN Ps, a n d a
visual evaluation o f the quality o f contig sequence alignm ents. T h e
S N PS core from the Illum ina Assay D esign T ool (referred to as the
Assay D esign S core/A D S ) utilises factors including tem plate G C
content, m elting tem p e ra tu re, sequence uniqueness, a n d selfco m plem entarity to filter the can d id ate SN Ps p rio r to further
inspection. T h e Assay D esign Score (assigned betw een 0 a n d 1) is
T o pred ict the effect o f the m u ta tio n underlying each SN P at
the am ino acid level, a pipeline was developed to pred ict the
re ad in g fram e for each S N P -containing contig. All contigs
PLOS ONE I w w w .plosone.org
3
A ugu st 2012 | V olum e 7 | Issue 8 | e42089
SNP Discovery in A tla n tic Herring
indicative o f the ability to design suitable oligos w ithin the 60 b p
u p /d o w n -stre a m flanking region, a n d the expected success o f the
assay w hen genotyped w ith the Illum ina G o ld en G ate chem istry.
Follow ing the Illum ina guidelines, all SN Ps w ith a score below 0.4
w ere discarded; SN Ps w ith a score above 0.4 w ere accepted, w ith
SN Ps scoring above 0.7 bein g used preferentially.
T h e pred ictio n o f intro n -ex o n bo u n d aries w ithin the SN P
flanking regions (60 b p u p /d o w n -stre a m o f S N P position) was
p erfo rm ed using two approaches. T h e first directly c o m p a red
SN P -containing contigs against five high quality reference
genom es for m odel fish species (Ensem bl genom e assem bly for
C. aculeatus, 71 nigroviridis, O. latipes, 71 rubripes a n d D. rerio; tee
Figure S I, left pipeline), using the B lastn o ption (E-value cut-off
10 5). Blast results w ere th en p arsed via a custom Perl script
considering alignm ent length, start a n d e n d p o in t o f the alignm ent
to determ in e the best positive m atc h (further details o f the Perl
script a n d workflow are available from the a u thors on request). If
the 60 b p o n b o th sides o f the SN P w ere p re sen t in the alignm ent,
the can d id ate SN P was considered to b e co n tain ed w ithin a single
exon; otherw ise a n intron-exon b o u n d a ry was assum ed to be
p re sen t w ithin the 121 b p assay design region. SN Ps w ere th en
assigned to one o f th ree categories either having, o r n o t h aving an
intro n -ex o n b o u n d a ry p red icted w ithin the flanking region, o r as
n o t re tu rn in g a significant m atc h against any o f the five blasted fish
genom es. In the o th er a p p ro ach , the likelihood o f a positive m atch
a n d the reliability o f intro n -ex o n b o u n d a ry predictions w ere
increased, w ith S N P -containing contigs used as a qu ery in a Blast
search (blastn, E -value cut-off 10 5) against the corresponding
transcriptom e o f the sam e five reference databases (see above). If
the blast search p ro d u c ed a positive result, the m atch in g transcript
was dow nloaded from the E nsem bl database, a n d blasted against
its ow n genom e sequence (see Figure S I, rig h t pipeline). W ithin
the dow nloaded sequence, the nucleotide position corresponding
to the can d id ate S N P in the A tlantic h e rrin g sequence was
identified based on the start a n d e n d positions o f the alignm ent
betw een the original contig a n d the E nsem bl transcript. U sing the
p rojected S N P position, the flanking regions w ere again classified
as bein g located o n a single exon, d isrupted by a n intron, or not
having a significant m atch. R esults from the tw o a pproaches w ere
c o m p a red to o b tain a consensus estim ate for the likelihood o f an
intro n -ex o n b o u n d a ry o ccurring w ithin the 121 b p assay for each
o f the can d id ate SNPs.
Finally, the rem ain in g can d id ate S N P contigs w ere visually
evaluated using clview (clview; h ttp ://c o m p b io .d fc i.h a rv a rd .e d u /
tg i/so ftw a re /) in o rd e r to ra n k putative SN Ps w ithin a n d am o n g
contigs. T h is was assessed by considering the overall quality o f the
assem bly, the d e p th a n d length o f alignm ents, a n d the n u m b e r o f
m ism atch sites flanking the SN P. T his step was included to
increase the likelihood o f excluding incorrectly identified SN Ps (for
exam ple; regions w ith alternative splicing or erroneous clustering
o f paralogous sequences). W ith in each contig, one o r tw o SNPs
receiving the highest quality score w ere considered for further
validation (see below).
Illu m in a’s G enom eS tudio d a ta analysis software (1.0.2.20706,
Illum ina Inc.). O nly S N P assays show ing clear genotype clustering,
a n d individual sam ples w ith a call ra te above 0.8 w ere considered
for fu rth er analysis.
C ro s s -s p e c ie s A m p lificatio n
T o assess the utility o f developed m arkers in related species, two
species identified from a consensus phylogeny [31], the sister
species; Pacific h e rrin g (C. pallasii) a n d a m ore distantly related
species; anchovy (Engraulis encrasicolus) w ere genotyped for the full
1,536 S N P panel.
Statistical A n aly ses
T o assess the predictive value a n d utility o f the different
p a ram ete rs used in the in silico S N P validation pipeline, a binom ial
logistic regression analysis was conducted. T w o categorical
variables (Conversion a n d Polymorphism) w ere evaluated w hich
describe the outcom e o f the SN P assay validation; these are
expected to d e p en d o n a range o f c andidate p re d ic to r variables
(see below). Conversion was scored b y assigning all 1,536 genotyped
S N P assays as eith er failed (score = 0) o r successfully am plified a n d
clustered (score = 1). Polymorphism assigned all the successfully
am plified S N P assays into m o n o m o rp h ic (0) o r polym orphic (1)
categories. N ine variables w ere th en assessed for th eir predictive
value in determ in in g S N P assay conversion a n d polym orphism : i)
n u m b e r o f ascertain m en t p an el individuals supplying sequence
reads a t the S N P position, ii) n u m b e r o f sequences aligned u n d e r
S N P position, iii) n u m b e r o f sequences w ith the m in o r allele, iv)
frequency o f sequences w ith m in o r allele, v) n u m b e r o f ascertain­
m en t individuals w ith the m in o r allele, vi) Illum ina Assay D esign
Score (ADS), vii) outcom e o f the intron-exon b o u n d a ry pipeline
(scored as S N P assay bein g w ithin a single exon, in te rru p ted b y an
in tro n or as h aving no B L A ST m atch), viii) n u m b e r o f reference
species supporting findings from the intro n -ex o n pipeline, a n d ix)
n e ig h b o u rh o o d sequence quality (determ ined by the n u m b e r o f
m ism atches in the flanking region alignm ent). T o statistically test
the predictive effect o f the above variables for b o th Conversion a n d
Polymorphism a tw o-step binom ial logistic regression analysis was
used as im p lem en ted in SPSS v l2 .0 . All variables w ere included in
the initial m odel, a n d a backw ard stepwise deletion a p p ro ac h was
used for optim isation, in w hich the least inform ative variable is
rem oved sequentially until only significantly con trib u tin g variables
rem ain. A W ald x 2 statistic was used to estim ate the relative
con trib u tio n from each rem ain in g p a ram ete r.
F o r the successful polym orphic assays global values o f observed
(H o) a n d expected (H E) heterozygosity w ere estim ated for 20
individuals from each o f the four ascertain m en t populations
(Figure 1) using G enA lE x 6.4 [32]. F o r these sam e populations
deviations from H a rd y -W e in b erg equilibrium (HW E) a n d evi­
dence o f linkage disequilibrium (LD) w ere explored using
G en ep o p 4.0 [33]. Significance levels for H W E a n d LD tests
w ere estim ated using a n M C M C ch ain o f 10,000 iterations a n d 20
batches, / ’-values w ere adjusted for m ultiple tests by false discovery
ra te (FDR) correction following B enjam ini & Y ekutieli [34],
Lastly, a scertain m en t bias, resulting from the n o n -ran d o m
exclusion o f SN Ps w ith a low M in o r Allele F requency (MAF) from
the m ark e r panel, m ay occur due to the small size (n = 8) o f the
ascertain m en t p an el (com pared to the w hole population), a n d the
lim ited geographical coverage (com pared to the w hole species
range). W h en m arkers are th en genotyped on a m u ch larger
sam ple o f individuals the resulting a scertain m en t bias [35,36] m ay
affect the estim ation o f m an y evolutionary a n d po p u latio n genetic
p a ram ete rs [2]. T o assess the m agnitude o f a poten tial bias, the
distribution o f M A F in the m ark er p an el was assessed across a
SNP V alidatio n
Follow ing the pipeline described above, 1,536 high scoring
can d id ate m arkers w ere chosen for validation by high th ro u g h p u t
genotyping assay. D N A was extracted from fin clips for 626 fish
sam pled from eighteen sites across the species range in the eastern
A tlantic, including tw enty fish from each o f the four S N P discovery
populations (Figure 1). T h e quality a n d q u a n tity o f D N A was
checked using a N a n o d ro p spectrophotom eter, a n d all sam ples
w ere standardised to 70 n g /p L . G enotyping was p erfo rm ed using
the Illum ina G olden G ate platform [30], a n d was visualised using
PLOS ONE I w w w .plosone.org
4
A ugu st 2012 | V olum e 7 | Issue 8 | e42089
SNP Discovery in A tla n tic Herring
large d a ta set covering 18 locations across the E astern A tlantic to
check for a n elevated n o n -ran d o m exclusion o f SN Ps w ith a low
M A F. A n un-biased S N P pan el should exhibit a n “ L -shape”
distribution o f M A F categories in dicating a d eq u a te representation
o f low M A F SN Ps [37],
S e q u e n c e P r o c e s s in g a n d A s s e m b ly
S equence cleaning a n d processing identified 5.8% o f the
assigned reads as having a m atch o f a t least 94% identity over
60 base pairs to A tlantic h e rrin g m ito ch o n d rial sequences a n d
these w ere rem oved from the d a ta set. R ep eatM ask er m asked
1.9% o f the d ataset using the zebrafish re p ea t library. T h e
S eqC lean p ro g ra m rem oved a furth er 3.5% o f the assigned reads
due to low -com plexity (n = 7,885), low quality (n = 169) o r being
below the m in im u m re a d length o f 50 bases (n = 13,010). Lastly,
som e reads w ere trim m ed, yielding a total o f 571,731 reads for
sequence clustering a n d assem bly. Initially reads w ere clustered
w ith C L C G enom ics W o rk b en ch (C L C bio, D enm ark), resulting in
16,456 clusters ra nging from 2 0 0 -4 0 0 bp. T hese w ere th en
individually re-assem bled w ith C A P3 resulting in 19,246 contigs
(some clusters p ro d u c ed b y C L C w ere split into two or m ore
contigs) a n d 30,344 singletons o f w hich m ore th a n 50% could be
a n n o ta te d (T able 1). T h e m ajority o f contigs consisted o f less th an
30 reads a n d ra n g ed betw een 100-500 b p (Figure 3C-D).
Results
454 S eq u en cin g
R esults for the sequencing a n d S N P discovery pipeline are
illustrated in Figure 2. A total o f 683,503 cD N A sequences w ere
gen erated from the m ultiplexed A tlantic h e rrin g m uscle library.
T h e reads w ere de-m ultiplexed to assign reads to one o f the eight
sequenced individuals according to their b a rco d in g tag. F or 8% o f
the raw reads no b a rco d in g tag w as identified, while the rem ain in g
629,541 raw reads (average re a d length: 205 bp, Figure 3B)
co n tain ed the 5 ' tag sequence a n d could be allocated to pools p e r
sam ple p e r geographical region (Figure 3A). G eographic pools
ra n g ed from 86,731 (English C hannel) to 187,554 (Barents Sea)
sequences. All 454 sequence d a ta has b e en subm itted to the
Sequence R e a d A rchive (SRA) u n d e r the study accession n u m b er
E R P 0 0 1233
(h t t p : / /w w w .e b i.a c .u k /e n a /d a ta /v ie w /
E R P001233).
SNP D e te c t io n a n d A n n o t a t i o n R esults
S N P discovery w ith GigaBayes d etected 6,331 putative SN Ps in
1,991 separate contigs. T h e p rim a ry a n n o ta tio n o f contig
sequences is sum m arized in T ab le 1 a n d in m ore detail in T able
S I.
cDNA library
4 5 4 GS FLX S e q u e n c in g
683,503 EST read s
D e-m u ltip lex ed by ID
M icrosatelfitE
629,541 EST reads
-►
SNP Identification
R em oval o f mtDNA
6501
P rim e r d e s ig n
1757
592,795 EST reads
R e p e a t m a sk in g a n d
c le a n up
571,731 EST reads
6331
R em o v al o f te rm in a l
SNPs
5338
R em o v al if ADT <0.4
C lu s te r in g ^ s s e m b ly
19,246 contigs and
30,344 singletons
5253
In tro n ^ x o n
s c re e n in g
B last/A n n o tatio n
4018
11,970 contigs and
14,943 singletons
T
I
Final s e le c tio n fo r
g e n o ty p in g
1536
M arker d e v e lo p m e n t
Figure 2. Schematic of transcript assem bly and SNP d etection p ip eline. S c h e m a tic o v e rv ie w w ith n u m b e rs o f re a d s, c o n tig s a n d SNPs
th r o u g h th e tra n s c r ip t a ss e m b ly (c e n tre ) SNP d e te c tio n (rig h t h a n d side) a n d m ic ro sa te llite d e te c tio n (left h a n d side) p ip e lin e s (see te x t fo r m o re
details).
d o i:1 0 .1 3 7 1 /jo u rn a l.p o n e .0 0 4 2 0 8 9 .g 0 0 2
PLOS ONE I w w w .plosone.org
5
A ugu st 2012 | V olum e 7 | Issue 8 | e42089
SNP Discovery in A tla n tic Herring
80000
60000
40000
50000 -
1nnnnnn
20000
S
8 K
8
(T>
7
LT)
■
£
§
•7
7
Lfl
N
a
0\
S
?
§
Í
lii
¡5
s
IN
in
fM
S
8
i'"
O'
Í
Ul
IN
m
s
y
7 !
n
n
S e q u e n c e l en g th
14000 -
8000
13500 -
7500
Ul 13000
Ol
IO
7000
6500
O 12500 u
o
2000 ■i
(Ü
1500 ¥
1000 Z
6000
5500
1500
1000
500 -
500
□bn
0 -
r r ~ i
t
O
r 'i' tiH
•h »c-Jm m
Í
i
l
i— t— i— t
I I I I I n -r i
8 8
1/1
IÛ
N
fl)
/s
n
i
í
i i
in
i
in
S R S gî g
Num ber of reads
S S
C on tig le ng th
Figure 3. Sum m ary o f sequence data. A) num ber of sequences successfully barcoded for each of the eight ascertainm ent individuals; and for the
combined data, B) sequence length, C) num ber of reads per contig and D) contig length.
doi:10.1371/journal.pone.0042089.g003
(18.4%) h a d B L A ST hits w hich suggested th a t th ere was no
i n tr o n /exon b o u n d a ry presen t (sum m arised in Figure 2).
S e le c tio n o f C a n d i d a t e SNPs fo r G e n o t y p i n g A ssay
F ro m the 6,331 p re d ic te d SN Ps, 993 (15.6%) w ere located in
the term in al region o f the contigs a n d did n o t have the req u ired
m in im u m o f a 60 b p flanking region to design oligos for the
G o ld en G ate a rra y (Figure 2). O f those rem aining, 85 SN Ps (1.3%)
scored below the m in im u m value (< 0.4) reco m m e n d e d for p rim e r
design a n d w ere n o t considered. 4,104 SN Ps (76.8%) h a d high
Assay D esign Scores (betw een 0.7-1.0) a n d 1,149 SN Ps (21.5%)
h a d acceptable Assay D esign Scores (betw een 0.4-0.7), all 5,253 o f
these w ere taken forw ard to the next stage. O f the putative SNPs
screened for p otential i n tr o n /exon splicing sites w ithin the flanking
regions, 1,235 (23.5%) h a d p u tative i n tr o n /exon boundaries
w ithin the flanking regions, a n d so w ere rejected. T h e m ajority
(3,052, 58.1% ) h a d no m atc h in g B L A ST hits, w hile ju st 966
SNP V alid ation
F ro m the full 1,536 p an el o f SN Ps th a t w ere genotyped, 290
(19%) assays failed to amplify. O f the rem ain in g 1,246 assays, 201
w ere m o n o m o rp h ic (false positives: 13%) 467 p ro d u c ed am bigu­
ous clustering (30%) a n d 578 w ere polym orphic, equivalent to a
conversion ra te o f 38% . F ro m these 578 SN Ps a n open reading
fram e was o b tain e d for 270 o f the respective 121 b p sequences
(SN P a n d 60 b p u p /d o w n stream ), o f w hich 66 w ere suggested be
non-synonym ous, a n d 204 to b e synonym ous, equivalent to a ratio
(non-synonym ous/synonym ous) o f 0.32 (T able S2).
Results on the predictive value o f the SN P selection p aram eters
for assay conversion (i.e. for successful am plification) show th at
inclusion o f all o f the p re d ic to r variables (see m ethods) m arginally
im proves m odel-fitting (x2 = 18.520, d.f. = 9, p < 0 .0 3 0 ). W h en
using backw ard stepwise deletion o f p red icto r variables, the Assay
D esign Score a n d n u m b e r o f ascertainm ent individuals w ith the
m in o r allele w ere identified as the only significant predictors o f
assay conversion, b u t only the Assay D esign Score show ed the
expected positive correlation w ith conversion rate (Table 2). T h e
binom ial logistic regression analysis o n the polym orphic status o f
all successfully am plifying assays show ed th a t w hen all p redictor
variables w ere included, the overall m odel fit was not significant
(X~ = 11.554, d.f. = 9, p = 0.240). Flow ever, n eig h b o u rh o o d se­
T a b le 1. N u m b e r of c o n tig s a n d sin g leto n s o b t a i n e d a n d
successfully a n n o t a t e d .
To tal
A n n o tate d
% An no tated
Contigs
19,246
11,970
62.1
Singletons
30,344
14,943
49.2
To tal
49,590
26,913
54.3
dol:10.1371 /journal.pone.0042089.t001
PLOS ONE I w w w .plosone.org
6
A ugu st 2012 | V olum e 7 | Issue 8 | e42089
SNP Discovery in A tla n tic Herring
quence quality h a d a significant negative correlation w ith
polym orphism . As before a backw ard stepwise deletion ap p ro ac h
was used a n d this red u ced the significantly con trib u tin g predictors
to the n u m b e r individuals in the ascertainm ent p an el w ith the
m in o r allele a n d the n e ig h b o u rh o o d sequence quality w hich, as
expected, respectively show ed positive a n d negative correlation
w ith S N P polym orphism (T able 3).
T a b le 3. Results for SNP d e te c t i o n variables for predicting
SNP a ssay p o ly m o r p h ism following a b a ckw a rd ste p w ise
elim in atio n p ro c e d u re .
E stim ates o f H o a n d H e across the four ascertain m en t sam ples
ra n g ed from 0 .0 0 -0 .6 3 (m ean 0.18) a n d 0 .0 0 -0 .5 0 (m ean 0.18),
respectively (T able S2). O bserved heterozygosity w ithin the four
ascertainm ent populations revealed sim ilar levels o f diversity to the
18 sam pled locations used for the S N P validation [38], T ests for
deviation from H W E for each locus a n d po p u latio n revealed 43
out o f 1,249 p erfo rm ed tests (3.4%) w ith significant deviations
from H W E before co rrection for m ultiple tests. T hese tests w ere
distributed a m o n g all four populations a n d across 35 loci. E ight
tests distributed across three populations a n d seven loci retain ed
significance follow ing co rrection for m ultiple tests ( a = 0.05). D ue
to the presence o f m o n o m o rp h ic loci in the four ascertainm ent
sam ples, 229,094 tests for LD w ere p erfo rm ed o f w hich 352
re m a in e d significant after co rrection for F D R (a = 0.05). O f these,
14 pairs w ere significant in m ore th a n one o f the four populations
b u t in all cases SN Ps o riginated from different contigs suggesting
lack o f close physical linkage. S N P frequency distributions o f M A F
categories in the full pan el o f 18 sam ples indicated little bias due to
n o n -ran d o m selection o f high frequency SN Ps (Figure 4).
T a b le 2. Results for SNP d e te c t i o n variables for predicting
SNP a ssay c o n version following a b ack w ard ste p w ise
e lim in atio n proc edure .
Asc_ind*
-0 .1 6 5
4.67
1
0.031
AD?
0.763
4.785
1
0.0 2 9
C onstant
-0 .3 7 8
1.464
1
0.226
d eg ressio n coefficient for individual variable.
hWald X2 statistic.
A ssociated probability.
Number o f ascertainm ent individuals with th e minor allele.
eAssay Design Score. Significant p-values are shown in bold.
doi:10.1371 /journal.pone.0042089.t002
PLOS ONE I w w w .plosone.org
Asc_incf*
0.249
2.965
1
0.085
NSCt
-0.111
7.321
1
0 .0 0 7
C onstant
0.935
21.137
1
0.000
F o r next g eneration sequencing to be successfully applied to the
developm ent o f genetic resources in non-m odel organism s,
m ethodological issues m ust be addressed to optim ise the p ro c e ­
dures for each project. SN Ps c an b e genom e- o r transcriptom e
derived and, in the latter case, selected from m ore a b u n d a n t or
ra re r expressed transcripts; in addition, m ark er developm ent is
influenced by sequence d e p th a n d contig length due to the
sequencing platfo rm chosen a n d the com plexity o f the hypothesis
to be investigated (i.e. sm aller n u m b e r o f SN Ps re q u ire d for species
identification analysis as c o m p a red to p o p u latio n genetic studies).
T h e choice o f sequencing p latform should reflect the objective o f a
given study. W hile longer reads (e.g. 454 sequencing) are expected
to im prove contig assem bly, m ore, b u t shorter, reads (e.g. Illum ina
sequencing) m ay be p referable in o rd e r to reduce detection o f false
positive SN Ps from h igher alignm ent depth, especially w hen an
existing reference sequence is available. T his study took advantage
o f the longer re a d lengths o b tain e d w ith 454 sequencing in a de
novo assem bly o f a reference scaffold for S N P discovery in herring.
T h e clustering a n d assem bly step is critical for S N P m in in g as it
generates the reference for v a ria n t detection by m ap p in g reads to
the contig. T h erefo re, the absence o f a reference genom e or
transcriptom e poses a challenge for assessing the ‘correctness’ o f a
c ontig assem bly, as p otential m is-assem blies o f sequence due to
hom ologous or paralogous genes can n o t b e directly verified by
back -m ap p in g to the species-specific genom e. G enerally, cluster
assem bly w ith overly stringent p a ram ete rs will lead to splitting
sequences belonging to g eth er into m ore contigs, resulting in a
h igher n u m b e r o f shorter contigs w ith low er coverage depth.
W hilst applying criteria th a t are overly relaxed will assem ble reads
from related genes o r gene families into single contigs, resulting in
a low er n u m b ers o f contigs th a t have a h igher sequence depth,
how ever this increases the likelihood o f m isidentifying polym or­
phism s b etw een paralogous sequence variants (PSVs) as SNPs.
A dditionally, as no genom e reference is available for A tlantic
herring, the o ccurrence o f PSVs can n o t b e assessed, this was
p ro b ab ly the cause for the m ajority o f am biguous clustering th at
was subsequently seen in the SNPs.
F o r the S N P detection, the low sequence d e p th o f the m ajority
o f contigs (Figure 3C) re q u ire d relatively low criteria to be set (i.e.
depth: four reads, redundancy: two observations o f the m in o r
T his study dem onstrates the de novo discovery o f 6,331 putative
SN Ps based on 454 transcriptom e sequencing o f eight individuals
covering the N o rth east A tlantic distribution o f the A tlantic
herring. O f p a rticu la r interest in the a p p ro ac h is the single
validation a n d genotyping step, disposing w ith the traditional step
Pc
Pc
S e q u e n c e A s s e m b ly a n d SNP D e te c t io n
D iscussion
df
df
o f testing each S N P for am plification p rio r to large scale
genotyping (e.g. [39,40]). T h e d a ta gen erated in this study
constitutes a new resource for genetic analysis in A tlantic h e rrin g
significantly increasing the n u m b e r o f know n transcripts as well as
novel SN P a n d m icrosatellite m arkers.
T h e m ajority (99%) o f the 578 m arkers identified as p olym or­
phic in A tlantic h e rrin g also am plified in Pacific herring, b u t only
12% exhibited m ore th a n one allele. O n ly ab o u t 10% o f the 578
SN Ps am plified in anchovy, a n d o f these, only ten loci exhibited
polym orphism .
M satC o m m a n d e r detected 6,501 m icrosatellites w ith a repeat
length o f b etw een two a n d seven bases w ith four or m o re repeat
units in 3,741 contigs (T able 4). 27% o f the m icrosatellites h a d
sufficient suitable flanking sequence to enable the design o f
prim ers. D etails o f the m icrosatellites (num ber a n d type o f repeat,
prim ers, T m a n d % G C ) are listed in T ab le S3.
W a ld b
W a ld b
deg ressio n coefficient for individual variable.
AVald y2 statistic.
Associated probability.
dNumber of ascertainm ent individuals with the minor allele.
N eighbourhood Sequence Quality. Significant p-values are shown in bold.
doi:10.1371 /journal.pone.0042089.t003
C ro s s -s p e c ie s A m plification a n d M icrosatellite D e te c t io n
Ba
Ba
7
A ugu st 2012 | V olum e 7 | Issue 8 | e42089
SNP D iscovery in A tla n tic H erring
0.5
0 .4 -
in
0J
CL
fü
in
tH
0.3 -
Ci
i/i
CL
z
m
'<5
>
u
c
dJ
D
O'
B
0. 2
-
0.1
-
0.0
LD
O
o
»H
LD
t—
1
o
rsi
io
<M
o
m
LO
ro
2
LO
o
LO
?
O
o
d
?
LO
o
o
?
o
t—
1
o
?
LO
T
-1
o
?
o
r\j
o
?
LO
?
o
rg
o
?
LO
rg
o
?
Q
t
o
?
LD
in
o
o
MAF
Figure 4. M in o r allele frequency (MAF) d istrib ution . T h e d is trib u tio n o f th e MAF in 5 7 8 SNPs ty p e d in 18 p o p u la tio n s a c ro ss th e e a s te rn
h e rrin g d is trib u tio n .
d o i:1 0 .1 3 7 1 /jo u rn a l.p o n e .0 0 4 2 0 8 9 .g 0 0 4
allele). H ow ever, these low thresholds together w ith the sequencing
o f eight ascertainm ent individuals spanning the entire northeast
A tlantic distribution o f h e rrin g resulted in m inim al ascertainm ent
bias due to exclusion o f low M A F SN Ps (Figure 4). O n e expected
result o f the low d e p th a n d re d u n d an c y p a ram ete rs is, how ever,
the low conversion ra te from the inflated n u m b e r o f candidate
T a b le 4. Type a n d n u m b e r of r e p e a ts of t h e microsatellites
d e t e c t e d in t h e herring c o n tig s using M s a t c o m m a n d e r .
T y p e o f rep ea t
N u m b e r o f repeats
4 -9
1 0 -1 4
1 4 -1 9
To tal
>19
M axim u m
Dinucleotide
4418
505
193
175
75
5291
Trinucleotide
829
35
9
2
36
875
Tetranucleotide
202
13
3
12
31
230
Pentanucleotide
43
1
1
0
17
45
Hexa nucleotide
57
2
0
1
21
60
Total
5549
556
206
190
-
6501
doi:10.1371 /journal.pone.0042089.t004
PLOS ONE I w w w .plosone.org
SN Ps (identified due to sequencing errors). T h e 454 platform specific challenge o f resolving hom opolym eric regions m ay further
have com prom ised SN P detection b y reducing assem bly quality or
calling false SN Ps w ithin these regions [41], b u t such an effect
could n o t be assessed here due to the lack o f a know n reference
sequence.
T h e use o f transcriptom e sequencing in this study has resulted
in only a few p e r cent o f the total genom e bein g covered, b u t a t a
relatively high sequencing d epth, thus lim iting sequencing costs
while achieving the n u m b e r o f SN Ps re q u ire d for custom -designed
S N P assays. A dditionally, transcriptom e sequencing provides
in form ation ab o u t tissue-specific genes a n d their expression
profile, w hich can be used to develop furth er tools for gene
expression studies such as oligonucleotide m icro array o r R N A -seq
approaches.
SNP V alid ation
T h e genotyping o f 1,536 selected S N P assays p erfo rm ed w ith
genom ic D N A for a large p an el o f A tlantic h e rrin g sam ples from
across the n o rth east A tlantic indicated th a t nearly 600 o f the SNPs
are polym orphic (37.6% ). H ow ever, alm ost 49.3% o f the
can d id ate SN Ps failed to work; due to eith er non-am plification
A ugu st 2012 | V olum e 7 | Issue 8 | e42089
SNP Discovery in A tla n tic Herring
(18.9%), false positives (m onom orphic loci) (13.1%) o r am biguous
clustering (17.3%). D espite o u r a tte m p t to screen for p otential
in tr o n /exon splicing sites w ithin flanking regions o f all candidate
SN Ps using available reference genom es, only 41.9% o f all queries
m atc h ed equivalent sequences in a t least one o f the reference
species. T h u s, the presence o f u n d e te cte d introns m ay have
constituted a m ajor cause for genotyping failure [42]. M oreover,
can d id ate SN Ps th a t a p p ea red m o n o m o rp h ic in the large-scale
screening m ight either b e the result o f false-positive predictions or
could indicate real, ra re SN Ps n o t p re sen t in the sam ples tested
[7]. T h e p urely in silico SN P detection m eth o d presen ted in this
study m ay have a relatively low conversion rate to validated SNPs
w hen c o m p a red to o th er m ethods. H ow ever, this m eth o d is still
extrem ely com petitive given a lim ited resource for m ark er
developm ent, once the tim e a n d cost associated w ith designing
a n d o rd e rin g h u n d red s o f prim ers, ru n n in g validation P C R s, a n d
ad ditional S anger sequencing for validation are considered (e.g.
[39,40]). All o f w hich w ould be in add itio n to the cost o f
genotyping the resulting 578 validated SNPs.
In o rd e r to reduce the n u m b e r o f erroneous S N P predictions,
i.e. to increase the p robability o f a n in silico d etected SN P being a
truly polym orphic site, furth er sequencing w ould lead to greater
sequence d e p th o f the contigs, allow ing m ore stringent selection o f
S N P candidates. It has b e en show n for m ultiplexed re-sequencing
th a t m o re th a n 90 % o f the variants can b e d etected correctly using
next g eneration sequencing technologies w h en a n average d e p th o f
a t least 20 reads p e r base is achieved [43,44], Increasing the
average sequence d e p th will also be advantageous for identifying
SN Ps from rarely expressed genes. A n o th e r interesting ap p ro ach ,
recently described by R a ta n et al. [45], suggests a m eth o d to call
SN Ps w ith o u t a reference genom e sequence. S N P calling is
p erfo rm ed w henever new sequences are added; thus, sequencing
continues only as long as need ed to identify an a d eq u a te n u m b er
o f can d id ate SN Ps. T h e m eth o d is re p o rte d to w ork even w hen the
sequence coverage is n o t sufficient for de novo assem bly. A ddition­
ally, the use o f next g eneration sequencing for analysing a
restriction enzym e-generated D N A library (R R L a n d in p a rticu la r
R A D sequencing, for reviews see [46,47]) based o n m ultiple
tagged individuals now enables the fast discovery o f thousands o f
SN Ps in non-m odel organism s w ith no p rio r genom e inform ation
[48,49]. H ow ever, one dow nstream pro b lem identified w ith R A D seq is th a t transferring the SN Ps o nto a h ig h -th ro u g h p u t
genotyping platform is difficult w ith o u t a reference genom e, as
the m ajority o f SN Ps identified do n o t have the 60 b p flanking
sequenced req u ire d for assay design. T his has to som e extent been
solved using P aired E n d R A D (RA D -PE)[50], how ever the
bioinform atic a pproaches for SN P discovery in R A D -P E contigs
are still lim ited. A dditionally, w hile R R L /R A D -se q approaches
elim inate the problem s e n co u n tere d w ith in tr o n /exon b oundaries
th a t a re associated w ith tran scrip to m e sequencing, these m ethods
only consider ra n d o m fragm ents o f the entire genom e, w hereas
o u r transcriptom e based pipeline specifically targets expressed
genes w ith a n increased likelihood for detecting SN Ps (e.g. nonsynonym ous substitutions) associated w ith genom ic regions u n d e r
selection. Such n o n -n e u tra l SN Ps are expected to provide high
discrim inatory pow er a t the p o p u latio n level a n d will constitute a
valuable forensic tool in future applications [47,51]. T h e
com bination o f the coverage a n d S N P discovery rates obtain ed
by R A D -seq, w ith the targ eted red u ctio n o b tain e d by sequencing
the transcriptom e w ould potentially be a very pow erful tool.
H ow ever, it m ust be n o ted th a t due to the ra p id ra te o f technical
developm ents in the field, such as the increased re ad length a n d
decreasing costs o f existing platform s, a n d the poten tial o f n a n o ­
sequencing technology, the best solution reg ard in g platform s a n d
PLOS ONE I w w w .plosone.org
m ethods to optim ise the cost effectiveness for a specific application
needs careful consideration.
W h en determ in in g the predictive value o f the S N P selection
p a ram ete rs for successful am plification o f the in silico detected
SN Ps (Conversion), as expected, a positive correlation was found
w ith the Assay D esign Score, i.e. the likelihood for designing
successful prim ers a ro u n d the S N P position. U nexpectedly, a
negative correlation was found w ith n u m b e r o f ascertainm ent
individuals for w hich the ra re allele was observed, alth o u g h the
reasons b e h in d this correlation are unclear. O verall, only very
w eak predictive variables for Polymorphism w ere identified, w ith
only the n e ig h b o u rh o o d sequence quality significantly explaining
the negative correlation; as the n u m b e r o f m ism atches in flanking
regions increases, a p re d ic te d SN P is m ore likely to be a false
positive. T his increase in m ism atches o f a n aligned region could be
indicative o f erroneous clustering, for exam ple, PSV s o r o th er
sequences w ith differing genom ic origin (this has for exam ple also
b e en seen for hake in a sim ilar study [8]). T h e n u m b e r o f
individuals w ith the m in o r allele in the ascertain m en t p a n el also
show ed a positive correlation w ith Polymorphism. W hile this
p a ra m e te r is less conclusive th a n for pred ictin g Conversion rate,
th ere is potentially a predictive role o f this p a ra m e te r for detecting
tru e SNPs. F u tu re SN P developm ent efforts m ay reduce the false
positive ra te by applying relatively stringent thresholds for this
variable (e.g. having a t least 2 individuals w ith the m in o r allele
rep resen ted in the SN P co n tain in g contig, although this will, o f
course, d e p en d on the size o f the a scertain m en t panel).
T h e two binom ial logistic regression analyses w ere re p ea te d
w ith a red u ced set o f variables representing the strongest a priori
candidates (the n u m b e r o f sequences aligned u n d e r the SN P
position, the frequency o f sequences w ith m in o r allele, the
n e ig h b o u rh o o d sequence quality, the Assay D esign Score, a n d
the outcom e o f the intro n -ex o n b o u n d a ry pipeline). T his also
allow ed controlling for a p otential bias from n o n -in d ep e n d en t
variables such as the tw o intro n -ex o n a n d three m in o r allele
related p aram eters. R esults w ere largely c o n g ru en t confirm ing
Assay D esign Score a n d n eig h b o u rh o o d sequence quality to be the
m ost significant predictors o f Conversion a n d Polymorphism, respec­
tively.
T h e range o f allele frequencies w ithin the SN P p an el suggests
th a t the strategy o f carefully selecting individuals to m axim ise the
geographical, phenotypic a n d genetic diversity covered by the
S N P d evelopm ent sam ples has b e en successful in m inim ising
ascertain m en t bias.
C ross S p e c ie s A m plification a n d M ic rosatellite D e te c t io n
A high p ro p o rtio n o f d etected SN Ps also am plified single P C R
products in Pacific h e rrin g albeit w ith a low polym orphism rate,
w hich is as expected due to th eir developm ent from conserved
genom ic regions. H ow ever, due to the small sam ple size (n = 4),
this n u m b e r is likely to be dow nw ardly biased a n d a m u ch higher
p ro p o rtio n o f SN Ps m ay in fact be polym orphic a n d therefore
prove useful in this species. As expected from the phylogenetics o f
these species, the pro p o rtio n s o f SN P am plification a n d polym or­
phism w ere low er in the anchovy. A dditionally, o u r sequencing
effort has led to the discovery o f a large resource o f m icrosatellite
m arkers, 36% o f w hich have prim ers successfully designed (Table
S3). T hese include b o th n e u tra l loci a n d loci th a t are physically
linked to SN Ps representing genom ic regions th a t have b een
show n to b e u n d e r directional selection [38]. A n o th er a ttrib u te o f
m ulti-allelic m icrosatellite m arkers w h en studying adaptive genetic
variatio n is the increased statistical pow er for detecting balan cin g
selection c o m p a red to bi-allelic m arkers (such as SN Ps, e.g. [52]),
a n d also for applications such as p a re n ta l assignm ent.
9
A ugu st 2012 | V olum e 7 | Issue 8 | e42089
SNP Discovery in A tla n tic Herring
alleles in brackets. Also global estim ates o f observed (Ho) a n d
expected heterozygosity (He) in the four a scertain m en t populations
for each SN P. T h e S /N S colum n denotes w h eth er a S N P was
eith er synonym ous (S) o r non-synonym ous (NS) w ith N A
designating SN Ps w ith no contig m atc h in the B L A ST search
(see text for m ore details).
(XLSX)
C o n c lu s io n
O u r a p p ro ac h o f applying b arco d in g a n d m ultiplexing individ­
uals for large-scale in silico m ining o f transcriptom e sequences
seems to be a very ap p ro p ria te strategy to develop new SN P
m arkers in non-m odel species as it does n o t req u ire costíy a n d
tim e-intensive re-sequencing o f targ e t am plicons necessitating
p rio r know ledge a n d availability o f genom e sequence inform ation.
H ow ever, the p urely in silico based S N P detection com es w ith a
trad e off in the form o f a n expectedly low er conversion ra te in the
final genotyping assay [53]. T h e resu ltan t resources will b e o f
value in on-going analyses o f pop u latio n structuring a n d stock
dynam ics, assays o f adaptive variation, a n d for e n h an cin g the
scope o f m icrosatellite-based studies.
T a b le S3 List o f the m icrosatellites for w hich prim ers w ere
successfully designed, along w ith u p to 200 bases flanking
sequence.
(XLSX)
A ck n ow led gm en ts
S upporting Information
W e would like to thank all the m em bers o f the FishPopTrace Consortium
for their input.
Sam pling was m ade possible by the generous collaboration o f Eero Aro,
Philip Coupland, G eir Dahle, Audrey Geffen, T hom as Gröhsler, Birgitta
Krischansson, C iaran O ’Donnell, H enn Ojaver, G uöm undur Oskarsson,
Iain Penny, Jukka Pönni, Fausto T inti, V éronique Verrez-Bagnis Phil
W atts, and Miroslaw Wyszynski. W e thank Pernille K. A ndersen (Aarhus
University, Denmark) for sequencing an d library m anagem ent.
F ig u re S I Analysis pipeline. T h e p a th o n the left o f the figure
illustrates the pipeline for the genom ic app ro ach , w here h e rrin g
transcripts are direcüy c o m p a red w ith five reference genom es. T h e
p a th on the rig h t o f the figure shows the pipeline for the
transcriptom ic a p p ro ac h , w here h e rrin g transcripts are first
c o m p a red to the transcriptom e o f the five reference species. H its
w ere th en subsequendy m atc h ed to the corresponding genom es o f
the sam e species (see text for m ore de tads).
(TIF)
Author C ontributions
Conceived and designed the experiments: G R C FP SJH DB M IT LB R O
G E M J v H FPT Consortium . Perform ed the experiments: FP R O N R O
SJH M T L . C ontributed reagents/m aterials/analysis tools: FPT C onsor­
tium. W rote the paper: SJH M T L M IT DB G R C FP. C arried out in silico
analyses: FP R O N SJH M T L M IT MB Jv H G E M AC. Analyzed genotype
data: SJH M T L DB M IT . C arried out statistical analysis: SJH M T L DB
LB MB.
T a b le S I N u m b e r o f contigs a n d singletons a n n o ta te d using a
range o f fish a n d h u m a n reference resources a n d databases.
(XLSX)
T a b le S2 List o f the 578 validated polym orphic SN Ps found in
this study, including the 120 b p flanking region, w ith the two SN P
References
1. H e ly a r SJ, H e m m e r -H a n s e n J , B e k k ev o ld D , T a y lo r M I, O g d e n R , e t al. {2011)
A p p lic a tio n o f S N P s fo r p o p u la tio n g e n e tic s o f n o n - m o d e l o rg an ism s: n e w
u n s e q u e n c e d g e n o m e s u s in g s e c o n d g e n e ra tio n h ig h th r o u g h p u t se q u e n c in g
tec h n o lo g y : a p p lie d to tu rk e y . B M C G e n o m ic s: 10: 47 9 .
o p p o rtu n itie s a n d ch a lle n g es. M o le c u la r E c o lo g y R e s o u rc e s 1 1(S1): 123—136.
S ta p le y J , R e g e r J , F e u ln e r P G D , S m a d ja G , G a lin d o J , e t al. (2010) A d a p ta tio n
g e nom ics: th e n e x t g e n e ra tio n . T r e n d s in E c o lo g y & E v o lu tio n 25: 70 5 —712.
3. N ie lse n E E , H e m m e r -H a n s e n J , L a rse n P F , B ek k ev o ld D (2009) P o p u la tio n
g e n o m ic s o f m a r in e fishes: I d e n tify in g a d a p tiv e v a ria tio n in sp a ce a n d tim e.
14. v a n B ers N E , v a n O e r s K , K e rs te n s H H , D ib b its B W , C ro o ijm a n s R P , e t al.
{2010) G e n o m e -w id e S N P d e te c tio n in th e g r e a t tit Parus major u sin g h ig h
th ro u g h p u t se q u en c in g . M o le c u la r E co lo g y : 19{S1): 8 9 —99.
2.
4.
5.
6.
15.
M c P h e rs o n A A , S te p h e n s o n R L , O ’R eilly P T , J o n e s M W , T a g g a r t G T {2001)
G e n e tic d iv ersity o f c o a sta l N o r th w e s t A tla n tic h e rr in g p o p u la tio n s: im p lic a tio n s
fo r m a n a g e m e n t. J o u r n a l o f F ish B io lo g y 59: 35 6 —370.
16. B e k k ev o ld D , A n d r e C , D a h lg re n T G , C la u s e n L A W , T o rs te n s e n E , e t al.
{2005). E n v iro n m e n ta l c o rre la te s o f p o p u la tio n d iffe re n tia tio n in A d a n tic
h e rrin g . E v o lu d o n 59: 2 6 5 6 -2 6 6 8 .
M o le c u la r E c o lo g y 18: 3 1 2 8 -5 0 .
W a p le s R S , P u n t A E , C o p e J M (2 0 0 8 ) I n te g r a t i n g g e n e tic d a ta in to
m a n a g e m e n t o f m a r in e reso u rc e s: h o w c a n w e d o it b e tte r? F ish a n d Fish eries
9: 4 2 3 -4 4 9 .
17. J o rg e n s e n H B H , H a n s e n M M , B ek k ev o ld D , R u z z a n te D E , L o e sc h ck e V {2005)
M a rin e lan d s c a p e s a n d p o p u la tio n g e n e tic s tru c tu re o f h e rr in g {Clupea harengus L.)
in th e B altic Sea. M o le c u la r E c o lo g y 14: 3 2 1 9 -3 2 3 4 .
C o r r ig e n d u m to C o u n c il R e g u la d o n (2008) (EG ) N o 1 0 0 5 /2 0 0 8 o f 29
S e p te m b e r 2 0 0 8 e sta b lish in g a C o m m u n ity sy stem to p re v e n t, d e te r a n d
e lim in a te illegal, u n r e p o r te d a n d u n re g u la te d fish in g , a m e n d in g R e g u la tio n s
(E E C ) N o 2 8 4 7 /9 3 , (EG) N o 1 9 3 6 /2 0 0 1 a n d (EG) N o 6 0 1 /2 0 0 4 a n d re p e a lin g
R e g u la tio n s (EG) N o 1 0 9 3 /9 4 a n d (EG) N o 1 4 4 7 /1 9 9 9 . O ffic ial J o u r n a l o f th e
E u r o p e a n U n io n , L 286.
18.
L a rss o n L G , L a ik re L , P a lm S, A n d r e G , C a r v a lh o G R , e t al. {2007)
C o n c o r d a n c e o f a llo z y m e a n d m ic ro sa te llite d iffe re n tia tio n in a m a r in e fish,
b u t e v id e n c e o f se le c tio n a t a m ic ro sa te llite lo cu s. M o le c u la r E c o lo g y 16: 1135—
1147.
19. G a g g io tti O E , B e k k ev o ld D , J o rg e n s e n H B H , F o il M , C a rv a lh o G R , e t al. {2009)
D is e n ta n g lin g th e effects o f e v o lu tio n a ry , d e m o g ra p h ic , a n d e n v iro n m e n ta l
fac to rs in flu e n cin g g e n e tic s tru c tu re o f n a tu r a l p o p u la tio n s: A d a n tic h e rr in g as a
ca se stu d y . E v o lu tio n 63: 2 9 3 9 -2 9 5 1 .
20. B in la d e n J , G ilb e rt M T , B o llb a c k J P , P a n itz F , B e n d ix e n G , e t al. {2007) T h e use
o f c o d e d P C R p rim e rs e n a b le s h ig h -th ro u g h p u t s e q u e n c in g o f m u ld p le h o m o lo g
a m p lific a d o n p r o d u c ts b y 4 5 4 p a ra lle l s e q u en c in g . P L o S O N E 2: e l9 7 .
F A O F ish eries a n d A q u a c u ltu re R e p o r t N o . 9 7 3 ( F IP I/R 9 7 3 ) R o m e , 31
J a n u a r y - G F e b ru a ry 2 0 1 1 . R e p o r t o f th e tw e n ty -n in th session o f th e c o m m itte e
o n F isheries. A vailab le: h t t p : // w w w .f a o .o r g / d o c r e p / 0 1 4 /i 2 2 8 1 e /i 2 2 8 1 e 0 0 .p d f .
A c c e ssed 2011 F e b 7.
7.
H u b e r t S, H ig g in s B, B o rz a T , B o w m a n S (2010) D e v e lo p m e n t o f a S N P
re s o u rc e a n d a g e n e tic lin k a g e m a p fo r A d a n tic c o d {G adus m o rh u a ). B M C
G e n o m ic s 11: 191.
8. M ila n o I, B a b b u c c i M , P a n itz F, O g d e n R , N ie lse n R O , e t al. {2011) N o v e l tools
fo r c o n s e rv a tio n g e n o m ics: C o m p a rin g tw o h ig h -th ro u g h p u t a p p ro a c h e s fo r
S N P d isc o v e ry in th e tra n s c rip to m e o f th e E u r o p e a n h a k e . P L o S O N E 6:
e2 8 0 0 8 .
9. L i R , F a n W , T ia n G , Z h u H , H e L , e t al. {2010) T h e s e q u e n c e a n d d e n o v o
a sse m b ly o f th e g ia n t p a n d a g e n o m e . N a tu r e 4 6 3 : 3 1 1 —317.
21.
22.
10. A r g o u t X , S a ls e J , A u r y J M , G u iltin a n M J , D r o c G , e t al. {2011) T h e g e n o m e o f
11.
12.
13.
Theobroma cacao. N a tu r e G e n e tic s 43: 101—108.
S ta r B, N e d e r b ra g t A J, J e n to f t S, G r im h o lt U , M a lm s tro m M , e t al. {2011) T h e
g e n o m e s e q u e n c e o f A d a n tic c o d rev e a ls a u n iq u e im m u n e system . N a tu r e 477:
2 0 7 -2 1 0 .
S á n c h e z C G , S m ith T P , W ie d m a n n R T , V a lle jo R L , S a le m M , e t al. {2009)
Single n u c le o d d e p o ly m o rp h is m d isc o v e ry in r a in b o w tr o u t b y d e e p s e q u e n c in g
o f a r e d u c e d re p r e s e n ta tio n lib ra ry . B M C G e n o m ic s: 10: 55 9 .
K e rs te n s H H , G ro o ijm a n s R P , V e e n e n d a a l A , D ib b its B W , G h in -A -W o e n g T F ,
e t al. {2009) L a r g e sc a le sin g le n u c le o tid e p o ly m o r p h is m d isc o v e ry in
PLOS ONE I w w w .plosone.org
23.
L a v o u é S, M iy a M , S a ito h K , Ish ig u ro N B , N is h id a M {2007) P h y lo g e n e d c
r e la d o n s h ip s a m o n g a n c h o v ie s, sa rd in e s, h e rrin g s a n d th e ir rela tiv e s {C lupei­
fo rm es), in fe rre d f ro m w h o le m ito g e n o m e s e q u en c e s. M o le c u la r P h y lo g e n e d c s
a n d E v o lu tio n 43: 1 0 9 6 -1 1 0 5 .
S m it A F A , H u b le y R , G re e n P {1996) RepeatMasker Open-3.0. A v a ila b le h t t p : / /
w w w .r e p e a tm a s k e r .o r g . v e rs io n o p e n - 3 .2 .7 w ith R M d a ta b a s e v e rs io n
20090120.
H u a n g X Q , M a d a n A {1999) C A P 3 : A D N A s e q u e n c e a sse m b ly p r o g ra m .
G e n o m e R e s e a rc h 9: 8 6 8 - 8 7 7 .
24.
M a r th G T , K o r f I, Y a n d e ll M D , Y e h R T , G u Z J, e t al. {1999) A g e n e ra l
a p p ro a c h to sin g le -n u c le o tid e p o ly m o rp h is m d isco v ery . N a tu re G e n e tic s 23:
452M 56.
25. B a i J , L i Q , G o n g R H , S u n W J, L iu J , e t al. {2011) D e v e lo p m e n t a n d
c h a ra c te r iz a tio n o f 6 8 E S T -S S R m a rk e rs in th e P acific o y ster, Crassostrea gigas.
J o u r n a l o f th e W o rld A q u a c u ltu r e S o ciety 42{3): 4 4 4 M 5 5 .
10
A ugu st 2012 | V olum e 7 | Issue 8 | e42089
SNP Discovery in A tla n tic Herring
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
P a sh le y C H . Ellis J R . M c C a u le y D E , B u rk e J M (2006). E S T d a ta b a s e s as a
s o u rc e fo r m o le c u la r m ark e rs: lesso n s f ro m Helianthus. J o u r n a l o f H e r e d ity 97:
3 8 1 -3 8 8 .
F a irc lo th B C (2008) M S A T C O M M A N D E R : d e te c tio n o f m ic ro sa te llite re p e a t
a rr a y s a n d a u to m a te d , lo c u s -s p e c ific p r im e r d e sig n . M o le c u la r E c o lo g y
R e s o u rc e s 8: 9 2 - 9 4 .
R o z e n S, S kaletsky H J (2000) P rim e r3 o n th e W W W fo r g e n e ra l u sers a n d fo r
b io lo g ist p r o g ra m m e rs . In : B io in fo rm a tic s M e th o d s a n d P ro to co ls: M e th o d s in
M o le c u la r B iology {eds K r a w e tz S, M is e n e r S). H u m a n a P ress, T o to w a , N J.
C a s to e T A , P o o le A W , G u W , J a s o n d e K o n in g A P , D a z a J M , e t al. (2010)
R a p id id e n tific a tio n o f th o u sa n d s o f c o p p e rh e a d sn a k e (.Agkistrodon contortrix)
m ic ro sa te llite loci fro m m o d e s t a m o u n ts o f 4 5 4 s h o tg u n g e n o m e s e q u en c e .
M o le c u la r E c o lo g y R e so u rce slO : 3 4 1 -3 4 7 .
F a n J B , O l i p h a n t A , S h e n R , K e r m a n i B G , G a rc ia F , e t al. (2003) H ig h ly
p a ra lle l S N P g e n o ty p in g . C o ld S p rin g H a r b o r S y m p o sia o n Q u a n tita tiv e
B iology 68: 6 9 - 7 8 .
L i C , O r t i G (2007) M o le c u la r p h y lo g e n y o f C lu p e ifo rm e s (A ctin o p tery g ii)
in fe rre d fro m n u c le a r a n d m ito c h o n d ria l D N A se q u en c e s. M o le c u la r P h y lo g e ­
n etics a n d E v o lu tio n 44: 3 8 6 -3 9 8 .
P e a k a ll R , S m o u se P E (2006) G E N A L E X 6: g e n e d c an aly sis in E xcel.
P o p u la tio n g e n e tic so ftw a re fo r te a c h in g a n d re s e a rc h . M o le c u la r E c o lo g y
N o te s 6: 2 8 8 -2 9 5 .
R o u s s e t F (2008) G E N E P O P 5 0 0 7 : a c o m p le te re -im p le m e n ta tio n o f th e
G E N E P O P so ftw a re fo r W in d o w s a n d L in u x . M o le c u la r E c o lo g y R e so u rc e s
8(1): 1 0 3 -1 0 6 .
B e n ja m in i Y , Y ek u tie li D (2001) T h e c o n tr o l o f th e false d isco v ery r a te in
m u ltip le tes tin g u n d e r d e p e n d e n c y . A n n a ls o f S tatistics 29: 1165—1188.
A lb re c h ts e n A , N ie lse n F C , N ie lse n R (2010) A s c e rta in m e n t B iases in S N P C h ip s
A ffect M e a s u re s o f P o p u la tio n D iv e rg e n c e . M o le c u la r B io lo g y a n d E v o lu tio n 27:
2534—2547.
R o s e n b lu m E B , N o v e m b re J (2007) A s c e r ta in m e n t b ia s in sp a tia lly s tru c tu re d
p o p u la tio n s: A ca se stu d y in th e e a ste rn fen c e liz a rd . J o u r n a l o f H e r e d ity 98:
40.
41.
42.
S eeb J E , P a s c a l C E , G r a u E D , S e e b L W , T e m p lin W D , e t al. (2011)
T ra n s c r ip to m e s e q u e n c in g a n d h ig h -re so lu tio n m e lt analysis a d v a n c e single
n u c le o tid e p o ly m o rp h is m d isc o v e ry in d u p lic a te d sa lm o n id s. M o le c u la r E c o lo g y
R e s o u rc e s 11(S1): 33 5 —348.
M a rg u lie s M , E g h o lm M , A ltm a n W E , A ttiy a S, B a d e r J S , e t al. (2005) G e n o m e
s e q u e n c in g in m ic ro fa b ric a te d h ig h -d e n sity p ic o litre rea c to rs. N a tu re 437: 376—
380.
W a n g S L , S h a Z X , S o n s te g a rd T S , L iu H , X u P , e t al. (2008) Q u a lity
a sse ssm e n t p a ra m e te r s fo r E S T -d e riv e d S N P s f ro m catfish . B M C G e n o m ic s 9:
43.
450.
C r a ig D W , P e a rs o n J V , S z e lin g e r S, S e k a r A , R e d m a n M , e t al. (2008)
I d e n tific a tio n o f g e n e tic v a ria n ts u sin g b a r- c o d e d m u ltip le x e d se q u en c in g .
44.
N a tu r e M e th o d s 5: 8 8 7 -8 9 3 .
H a r is m e n d y O , N g P C , S tra u s b e rg R L , W a n g X , S to ck w ell T B , e t al. (2009)
E v a lu a tio n o f n e x t g e n e ra tio n s e q u e n c in g p la tfo rm s fo r p o p u la tio n ta rg e te d
45.
s e q u e n c in g stu d ies. G e n o m e B iology 10: R 3 2 .
R a t a n A , Y u Z , H a y e s V M , S c h u s te r S C , M ille r W (2010) C a llin g S N P s w ith o u t
a re fe re n c e s e q u en c e . B M C B io in fo rm a tic s 11: 130.
46.
47.
48.
49.
50.
3 3 1 -3 3 6 .
M a r th G T , C z a b a rk a E , M u rv a i J , S h e rry S T (2004) T h e a llele fre q u e n c y
s p e c tru m in g e n o m e -w id e h u m a n v a ria tio n d a ta rev e a ls sig n als o f d iffe re n tia l
d e m o g r a p h ic h isto ry in th re e la rg e w o rld p o p u la tio n s. G e n e tic s 166: 3 5 1 —372.
L im b o rg M T , H e ly a r SJ, d e B ru y n M , T a y lo r M I, N ie lse n E E , e t al. (2012)
E n v iro n m e n ta l se le c tio n o n tra n s c rip to m e -d e riv e d S N P s in a h ig h g e n e flow
m a r in e fish, th e A tla n tic h e rr in g {Clupea harengus). M o le c u la r E c o lo g y doi:
10.1111 / j . 13 6 5 -2 9 4 X .2 0 1 2 .0 5 6 3 9 .x .
G e ra ld e s A , P a n g J , T h ie ss e n N , C e z a rd T , M o o re R , e t al. (2011) S N P
d isc o v e ry in b la c k c o tto n w o o d (.Populous trichocarpa) b y p o p u la tio n tra n s c rip to m e
51.
52.
53.
O g d e n R (2011) U n lo c k in g th e p o te n tia l fo r g e n o m ic tec h n o lo g ie s fo r w ildlife
fo ren sics. M o le c u la r E c o lo g y R e s o u rc e s 11{S1): 109—116.
D a v e y J W , H o h e n lo h e P A , E tte r P D , B o o n e J Q C a tc h e n J M , e t al. (2011)
G e n o m e -w id e g e n e tic m a r k e r d isc o v e ry a n d g e n o ty p in g u s in g n e x t-g e n e ra tio n
s e q u en c in g . N a tu re R e v ie w G e n e tic s 12: 4 9 9 —510.
H o h e n lo h e P A , A m ish SJ, C a tc h e n J M , A lle n d o rf F W , L u ik a rt G (2011) N e x tg e n e ra tio n R A D s e q u e n c in g id e n tifie s th o u s a n d s o f S N P s fo r a sse ssin g
h y b rid iz a tio n b e tw e e n r a in b o w a n d w e stslo p e c u tth r o a t tro u t. M o le c u la r
E c o lo g y R e so u rc e s 11{S1): 1 1 7 -1 2 2 .
V a n T a ssell C P , S m ith T L , M a tu k u m a lli L K , T a y lo r J F , S c h n a b e l R D , e t al.
(2008) S N P d isc o v e ry a n d allele fre q u e n c y e s tim a tio n b y d e e p s e q u e n c in g o f
r e d u c e d re p r e s e n ta tio n lib ra rie s. N a tu re M e th o d s 5: 2 4 7 —252.
E tte r P D , P re sto n J L , B a ssh a m S, C re sk o W A , J o h n s o n E A (2011) L o c a l De Mom
A ssem b ly o f R A D P a ire d - E n d C o n tig s U s in g S h o r t S e q u e n c in g R e a d s . P L oS
O N E 6(4): e l8 5 6 1 .
N ie lse n E , C a r ia n i A , M a c A o id h E , M a e s G , M ila n o I, e t al. (2012) G e n e a sso c ia te d m a rk e rs p ro v id e to o ls fo r ta c k lin g illegal fish in g a n d false ecoc e rtifica tio n . N a tu r e C o m m u n ic a tio n s 3: 851 doi: 1 0 .1 0 3 8 /n c o m m s l8 4 5 .
N a r u m S R , H e ss J E (2011) C o m p a ris o n o f F - S T o u tlie r tests fo r S N P lo ci u n d e r
selectio n . M o le c u la r E c o lo g y R e s o u rc e s 11 (SI): 184—194.
L e p o itte v in C , F r ig e r io J M , G a r n ie r - G é ré P , S a lin F , C e r v e ra M T , e t al. (2010)
In vitro vs in silico d e te c te d S N P s fo r th e d e v e lo p m e n t o f a g e n o ty p in g a rra y : w h a t
c a n w e le a r n f ro m a n o n -m o d e l species? P L o S O n e 5: e l 1034.
res e q u e n c in g . M o le c u la r E c o lo g y R e s o u rc e s 11(S1): 8 1 —92.
PLOS ONE I w w w .plosone.org
11
A ugu st 2012 | V olum e 7 | Issue 8 | e42089
Download