OPEN 0 ACCESS Freely available online SNP Discovery Using Next Generation Transcriptomic Sequencing in Atlantic Herring (Clupea harengus) Sarah J. Helyar1'2*9, M orten T. Limborg39, Dorte B ekkevold3, M assim iliano Babbucci4, Jeroen van H oudt5, G regory E. M aes5, Luca Bargelloni4, Rasmus O. N ielsen 6, Martin I. Taylor1, Rob O gd en 7, A lessia Cariani8, Gary R. Carvalho1, FishPopTrace C onsortium 1, Frank Panitz6 1 Molecular Ecology and Fisheries Genetics Laboratory, School of Biological Sciences, College of Natural Sciences, Bangor University, Bangor, Gwynedd, United Kingdom, 2 Food Safety, Environment & Genetics, Matis, Reykjavik, Iceland, 3 National Institute of Aquatic Resources, Technical University of Denmark, Silkeborg, Denmark, 4 D epartm ent o f Comparative Biomedicine and Food Science, University of Padova, Legnaro, Italy, 5 Laboratory of Biodiversity and Evolutionary Genomics, Katholieke Universiteit Leuven, Leuven, Belgium, 6 D epartm ent o f Molecular Biology and Genetics, Faculty of Science and Technology, Aarhus University, Tjele, Denmark, 7 TRACE Wildlife Forensics Network, Royal Zoological Society o f Scotland, Edinburgh, United Kingdom, 8 D epartm ent of Experimental and Evolutionary Biology, University of Bologna, Bologna, Italy Abstract The i n tr o d u c tio n of Next G e n era tio n S e q u e n c in g (NGS) has revolution ised p o p u la tio n g e netics, providing s tu d ie s o f n o n ­ m o d e l sp e c ie s with u n p r e c e d e n t e d g e n o m i c c o v e r a g e , allowing ev o lu tio n ary biologists t o a d d r e s s q u e s t io n s previously far b e y o n d t h e reach o f available resources. F urth e rm o re , t h e sim p le m u ta t io n m o d e l o f Single N ucleotid e P o lym orph ism s (SNPs) pe rm its cost-effective h i g h - t h r o u g h p u t g e n o t y p i n g in t h o u s a n d s o f individuals sim ultaneou sly. G e n o m ic resources are sc arce for t h e Atlantic herring (Clupea harengus), a small pela gic sp e c ie s t h a t sustains high r e v e n u e fisheries. This p a p e r details t h e d e v e l o p m e n t o f 578 SNPs using a c o m b i n e d NGS a n d h i g h - t h r o u g h p u t g e n o t y p i n g a p p r o a c h . Eight individuals c ov ering t h e sp e c ie s d istributio n in t h e e a s te r n Atlantic w e re b a r -c o d e d a n d m ultip lex ed into a single cDNA library a n d s e q u e n c e d using t h e 4 5 4 GS FLX platform . SNP discovery w a s p e r fo r m e d by de n o v o s e q u e n c e c lu ste ring a n d c o ntig a ssem bly, follo w ed by t h e m a p p i n g o f reads a g a in s t c o n s e n s u s co n tig s e q u e n c e s . Selection o f c a n d i d a t e SNPs for g e n o t y p i n g w as c o n d u c t e d using a n in silico a p p r o a c h . SNP validation a n d g e n o t y p i n g w e r e p e r fo rm e d sim u lta n eo u s ly using a n Illumina 1,536 G o ld e n G a te assay. A l th o u g h t h e c o nversion rate o f c a n d i d a t e SNPs in t h e g e n o t y p i n g a ssa y c a n n o t b e p r e d ic te d in a d v an c e, this a p p r o a c h has t h e poten tial to m axim ise c o st a n d t im e efficiencies by avo idin g e x p e n s iv e a n d t im e - c o n s u m i n g labora tory s t a g e s of SNP validation. Additionally, t h e in silico a p p r o a c h leads t o low er a s c e r t a i n m e n t bias in t h e resulting SNP pa n el as m ark e r selectio n is b a s e d only o n t h e ability t o d e s ig n prim ers a n d t h e p re d ic te d p r e s e n c e of intron-e x on b o u n d a rie s. C o n s e q u e n tl y SNPs with a w id e r s p e c t r u m o f m in o r allele fre q u e n c ie s (MAFs) will b e g e n o t y p e d in t h e final panel. T he g e n o m i c re sou rce s p r e s e n t e d he re r e p r e s e n t a va lu a ble m u lti- p u rp o se re source for d e v e lo p in g informativ e m arke r p a n els for p o p u la tio n discrim ination, m icroarray d e v e l o p m e n t a n d for p o p u la tio n g e n o m i c s tu d ie s in t h e wild. C itation : Helyar SJ, Limborg MT, Bekkevold D, Babbucci M, van Houdt J, e t al. (2012) SNP Discovery Using Next Generation Transcriptomic Sequencing in Atlantic Herring (Clupea harengus). PLoS ONE 7(8): e42089. doi:10.1371/journal.pone.0042089 Editor: Arnar Palsson, University o f Iceland, Iceland Received March 7, 2012; A ccepted July 2, 2012; Published A ugust 7, 2012 C opyrigh t: © 2012 Helyar e t al. This is an open-access article distributed under th e term s of th e Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided th e original author and source are credited. Funding: The research leading to th ese results received funding from th e European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreem ent no KBBE-212399 (FishPopTrace). In addition, MTL received financial support from th e European Commission through th e FP6 projects UNCOVER (Contract No. 022717) and RECLAIM (Contract No. 044133). GM is a post-doctoral researcher funded by the Scientific Research Fund Flanders (FWO-Flanders). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of th e manuscript. C om peting Interests: The authors have declared th a t no com peting interests exist. * E-mail: sarah.helyar@matis.is 9 These authors contributed equally to this work. a n d th ere is a n u rgent need for genom ic tools to identify po p u latio n structure a n d b oundaries to allow effective m an ag e­ m en t [4], A dditionally the forensic identification o f fish a n d fish products th ro u g h o u t the food processing chain from n et to plate w ould assist in the fight against Illegal, U n re p o rte d a n d U n re g u lated (IUU) fishing, currently a prio rity for the E u ro p ea n U n io n [5] a n d globally [6], SN Ps are the optim al m ark er for this type o f application, b u t large S N P panels are cu rren tly available for few m arin e fish species (e.g. A tlantic cod (Gadus morhua) [7]; E u ro p ea n hake (Merluccius merluccius) [8]). T h u s, the developm ent o f genom ic resources for m arin e fish is urgently req u ire d for evolutionary, conservation a n d m an a g em e n t perspectives. Introduction P opulation genom ic ap p ro ach es have b e e n revolutionised by N ext G e n era tio n S equencing (NGS) technologies such as 454 (Roche) a n d Illum ina sequencing. T hese developm ents facilitate genom e-w ide analyses o f genetic variatio n across populations o f non-m odel organism s [1,2], allow ing a range o f evolutionary questions to b e investigated effectively for the first tim e. M arin e fishes a re excellent m odel systems for studying a d ap tatio n due to th eir large geographic ranges th a t frequently encom pass strong e nvironm ental gradients a n d th eir large pop u latio n sizes th at increase the relative strength o f selection over drift [3]. M oreover, m an y m arin e fishes are u n d e r extrem e a nthropogenic pressure PLOS ONE I w w w .plosone.org 1 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring M aterials and M ethods T h e strategy used for SN P developm ent in non-m odel organism s is d e p en d e n t o n the availability o f genom ic inform a­ tion from closely related species. If such resources are available, P C R am plicons (hom ologous to regions in the reference genom e) can be sequenced a n d SN Ps identified (however, these are intrinsically lim ited in the n u m b e r o f SN Ps th a t c an be identified). W ith o u t a reference genom e, th ree principal strategies for genom e-w ide SN P discovery can b e applied; w hole genom e sequencing a n d assem bly, genom e com plexity red u ctio n a n d sequencing m ethods (e.g. R R L a n d RA D -seq) a n d cD N A sequencing (RNA-seq). W hile w hole genom e sequencing has now b e en com pleted for species w ith large com plex genom es (for exam ple: p a n d a (Ailuropoda melanoleurd) [9]; cacao (Theobroma cacao) [10]), this rem ains outside the scope o f m ost studies, as in general the de novo assem bly o f larger, repeatrich o r polyploid genom es requires ad ditional inform ation (e.g. physical BAC m aps o r p a ire d -en d libraries) a n d extensive bioinform atic capacity in o rd e r to build the large, c o m p u ta ­ tionally intensive, structured sequence scaffolds [11]. G enom ic libraries w hich sequence a small fraction o f the genom e (typically 3 -5 % ) req u ire a high level o f coverage for contig assem bly a n d detection o f S N P variants (see [12-14] for applications). D eep sequencing o f cD N A libraries provides an attractive a p p ro ac h to achieve the high sequence coverage need ed for de novo contig assem bly a n d SN P prediction, as only a small percentage o f the genom e is acco u n ted for by the transcriptom e. A n o th e r ad vantage o f transcriptom e sequencing is the in form ation p ro d u c ed co n cern in g functional genetic v a riation in specific genes w hich m ay b e u n d e r selection; these can th en b e targ eted to evaluate gene expression profiles. T h e ability to exam ine b o th neu tral variatio n a n d genom ic regions u n d e r selection provides researchers w ith u n p re ce d en te d tools for u n d erstan d in g local a d ap tatio n o f w ild populations a t the m olecular level. A tlantic h e rrin g (Clupea harengus) is a n a b u n d a n t a n d ecologically highly diverse species, o ccurring w ith a m ore or less continuous distribution in the N o rth -A d an tic benthopelagic zone. H ab itats a re distributed across highly diverse environ­ m ents, from tem p erate (33°N) to arctic (80°N) a n d a t salinities from oceanic (~ 3 5 ppt) to brackish (down to 3 ppt). In spite o f its large ecological range, studies using “ n e u tra l” m icrosatellites have unanim ously re p o rte d w eak p o p u latio n differentiation th a t is statistically significant only o n regional scales [15-17]. H ow ever, despite relatively high levels o f gene flow am o n g populations, evidence o f local a d ap tatio n has b e en identified in the A tlantic h e rrin g in the Baltic Sea using m icrosatellite loci [18,19]. T h erefo re it is expected th a t analyses w ith transcriptom e-w ide coverage applying h u n d red s o f m arkers associated w ith adaptive a n d neu tral v a riation will provide novel insights into the role o f selective a n d dem o g rap h ic processes in shaping pop u latio n structure. W e describe transcriptom e-based S N P d evelopm ent in A tlantic h e rrin g using a R o ch e 454 G S F L X (hereafter 454) sequencing a p p ro ac h . O u r aim was three-fold; 1) to develop a S N P assay exhibiting m inim al ascertain m en t bias across east A tlantic populations, 2) to test the applicability o f in silico SN P detection utilizing a com b in ed S N P screening a n d validation a p p ro ac h as a cost efficient w ay o f o btain in g po p u latio n genom ic resources, a n d 3) to establish a transcriptom e resource for tissue-specific gene expression profiling a n d m ic ro arra y developm ent. W e present, to o u r know ledge, one o f the first studies describing SN P discovery in a non-m odel m arin e fish based o n transcriptom e sequencing using N G S. PLOS ONE I w w w .plosone.org cDNA Library C o n s t r u c t i o n a n d 4 5 4 S e q u e n c i n g S N P developm ent was based o n m uscle sam ples from eight fish collected from four locations from across the eastern A tlantic (Figure 1). T hese locations w ere chosen to m axim ise geographic coverage a n d e nvironm ental differences, thereby m inim ising poten tial ascertain m en t bias. A pproxim ately 5g o f m uscle tissue was taken from each o f two individuals (male a n d female) from each location a n d im m ediately p laced in R N A later (Invitrogen) a n d after 12 hours a t 4°C , w ere stored a t —80°C . T o ta l R N A was extracted using the R N easy L ipid T issue M ini K it (Qiagen). T h e O ligotex m R N A M ini K it (Q iagen) was used to isolate m R N A , a n d non-norm alised cD N A was synthesized using the S u p e rsc rip t D oub le-stran d ed cD N A Synthesis K it (Invitrogen). A m ultiplex sequencing lib rary was p re p a re d by pooling equal am ounts o f cD N A from all eight individuals, w here two specific 10-m er b a rco d in g oligonucleotides w ere ligated to each individual sam ple to allow post-sequencing identification o f sequences (m odified from [20]). H ig h -th ro u g h p u t sequencing was p erfo rm ed o n a 454 sequencer according to the m an u fa ctu re rs’ protocol. S e q u e n c e P r o c e s s in g a n d A s s e m b ly Sequences w ere first de-m ultiplexed using the b a rco d in g tags (sfffile tool, R o ch e 454 analysis software) a n d sorted by sam ple. M itoch o n d rial sequences w ere rem oved from the d a ta set by m ap p in g the reads against the A tlantic h e rrin g m itochondrial genom e (G enbank accession N C _009577 [21]) using the R oche 454 gsM ap p er software. R ep eatM ask er [22] was used to identify a n d m ask repetitive a n d low com plexity regions w ithin the reads by using the zebrafish [Danio rerio) re p ea t library. R eads w ere cleaned for short sequences (< 5 0 bp) a n d low quality regions using S eqC lean (h ttp ://c o m p b io .d fc i.h a rv a rd .e d u /tg i/s o ftw a re /). Se­ quence clustering was p erfo rm ed in tw o steps; initial clustering was p erfo rm ed using C L C G enom ics W o rk b en ch (C L C bio, D enm ark), the resulting ace file sequences w ere th en assem bled ‘p e r contig’ in C A P3 [23]. T h e consensus sequences for the contigs p ro d u c ed by this assem bly w ere th en used as a reference for m ap p in g reads in the subsequent in silico S N P detection. SNP D e te c t io n T o identify can d id ate SNPs, all contig specific reads from the C A P 3.ace files w ere re -m a p p ed onto the consensus sequence a n d can d id ate SN Ps w ere identified using GigaBayes [24], T his p ro g ram scans each position o f the assem bly for the presence o f a t least two S N P alleles a n d calculates the p robability o f a given site bein g polym orphic using a B ayesian a p p ro ach . N o insertion or deletion variants (InDels) w ere considered a n d the polym orphism ra te was set to 0.003. A m in im u m contig d e p th o f four reads covering the polym orphic site a n d a m in im u m o f two reads for the ra re allele w ere re q u ire d for a site to be considered as a putative SN P. All contigs c o ntaining SN Ps w ere filtered to rem ove instances in w hich the alternative allele o f the SN P was only identified in a single individual, as these m ay either represent false positives or m ay lead to strong a scertain m en t bias. M icro sa tellite S e q u e n c e S c re e n in g M icrosatellites are a n im p o rta n t resource for sm aller scale studies in p o p u latio n genetics, m icrosatellites w ithin expressed genom ic regions have b e en show n to p ro d u c e clearer genotyping results as th ere are fewer null alleles a n d stutter b an d s [25,26]; therefore the contig library developed here was screened to detect re p ea t regions. A ssem bled contigs w ere screened for m icrosatellite repeats using M satC o m m a n d e r [27] a P ython p ro g ra m w hich 2 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring North Atlantic Figure 1. Location of the 18 samples used in this study. T h e e ig h t s e q u e n c e d a s c e rta in m e n t in d iv id u a ls (2 p e r lo c a tio n ) c a m e fro m th e fo u r sa m p lin g site s d e n o te d in red . d o i:1 0 .1 3 7 1 /jo u rn a l.p o n e .0 0 4 2 0 8 9 .g 0 0 1 locates m icrosatellite repeats (di-, tri-, tetra-, penta-, a n d hexanucleotide repeats) w ithin fasta-form atted sequences o r consensus files. M satC o m m a n d e r th en uses P rim erS [28] to screen sequences con tain in g m icrosatellite loci for high-quality P C R p rim e r sites w ithin the flanking regions for ‘potentially am plified loci’ (PALs [29]). co ntaining SN Ps w ere first blasted against six p eptide sequence databases (Ensem bl genom e assem bly for G. aculeatus, T. nigroviridis, 0 . latipes, T. rubripes, D . rerio a n d the Sw issprot database) using the Blastx function (E-value cut-off < 1 .0 E - 3 ). F o r each SN P c o ntaining contig the best m atch was selected a n d the aligned sections o f the qu ery w ere saved. Subsequently, two 121 b p sequences p e r SN P (i.e. 60 b p u p /d o w n -stre a m o f the SN P position, one sequence for each allele) w ere pro d u ced , these w ere used in a Blastx analysis against the file retrieved from the peptide sequences (E-value cut-off < 1 .0 E -10), a n d w ere th en c o m p a red to d eterm in e if the SN P rep resen ted a synonym ous o r nonsynonym ous m utation. C o n tig A n n o t a t i o n C ontigs w ere a n n o ta te d using the Basic L ocal A lignm ent Search T ool (BLAST) against m ultiple sequence databases. Blastn searches (E-value cut-off < 1 .0 E 5) w ere cond u cted against all a n n o ta te d transcripts o f Gasterosteus aculeatus, Tetraodon nigroviridis, Oryzias latipes, Takifiigu rubripes, Danio rerio a n d Homo sapiens available th ro u g h the E nsem bl G enom e Browser, a n d against all unique transcripts for D . rerio, H . sapiens, 0 . latipes, T. rubripes, Salmo salar, a n d Oncorhynchus mykiss in the N C B I LTniGene database. Blastx searches w ere c o n d u cted (E-value cut off < 1 .0 E 3) against the LTniProtK B/Sw issProt a n d L h iiP ro tK B /T rE M B L databases. L ast­ ly Blastx searches w ere p erfo rm ed against all a n n o ta te d proteins from the transcriptom es o f G. aculeatus, T. nigroviridis, 0 . latipes, T. rubripes, D . rerio a n d H . sapiens available th ro u g h the E nsem bl G enom e Browser. S e le c tio n o f C a n d i d a t e SNPs fo r G e n o t y p i n g A ssay SN Ps w ere validated follow ing a n in silico protocol, aim ed at m inim ising validation costs, whilst also m inim ising subsequent locus dro p o u t. S N P selection was based on the results from the Illum ina Assay D esign T ool, detection o f p utative intron-exon b oundaries w ithin the flanking regions o f c andidate SN Ps, a n d a visual evaluation o f the quality o f contig sequence alignm ents. T h e S N PS core from the Illum ina Assay D esign T ool (referred to as the Assay D esign S core/A D S ) utilises factors including tem plate G C content, m elting tem p e ra tu re, sequence uniqueness, a n d selfco m plem entarity to filter the can d id ate SN Ps p rio r to further inspection. T h e Assay D esign Score (assigned betw een 0 a n d 1) is T o pred ict the effect o f the m u ta tio n underlying each SN P at the am ino acid level, a pipeline was developed to pred ict the re ad in g fram e for each S N P -containing contig. All contigs PLOS ONE I w w w .plosone.org 3 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring indicative o f the ability to design suitable oligos w ithin the 60 b p u p /d o w n -stre a m flanking region, a n d the expected success o f the assay w hen genotyped w ith the Illum ina G o ld en G ate chem istry. Follow ing the Illum ina guidelines, all SN Ps w ith a score below 0.4 w ere discarded; SN Ps w ith a score above 0.4 w ere accepted, w ith SN Ps scoring above 0.7 bein g used preferentially. T h e pred ictio n o f intro n -ex o n bo u n d aries w ithin the SN P flanking regions (60 b p u p /d o w n -stre a m o f S N P position) was p erfo rm ed using two approaches. T h e first directly c o m p a red SN P -containing contigs against five high quality reference genom es for m odel fish species (Ensem bl genom e assem bly for C. aculeatus, 71 nigroviridis, O. latipes, 71 rubripes a n d D. rerio; tee Figure S I, left pipeline), using the B lastn o ption (E-value cut-off 10 5). Blast results w ere th en p arsed via a custom Perl script considering alignm ent length, start a n d e n d p o in t o f the alignm ent to determ in e the best positive m atc h (further details o f the Perl script a n d workflow are available from the a u thors on request). If the 60 b p o n b o th sides o f the SN P w ere p re sen t in the alignm ent, the can d id ate SN P was considered to b e co n tain ed w ithin a single exon; otherw ise a n intron-exon b o u n d a ry was assum ed to be p re sen t w ithin the 121 b p assay design region. SN Ps w ere th en assigned to one o f th ree categories either having, o r n o t h aving an intro n -ex o n b o u n d a ry p red icted w ithin the flanking region, o r as n o t re tu rn in g a significant m atc h against any o f the five blasted fish genom es. In the o th er a p p ro ach , the likelihood o f a positive m atch a n d the reliability o f intro n -ex o n b o u n d a ry predictions w ere increased, w ith S N P -containing contigs used as a qu ery in a Blast search (blastn, E -value cut-off 10 5) against the corresponding transcriptom e o f the sam e five reference databases (see above). If the blast search p ro d u c ed a positive result, the m atch in g transcript was dow nloaded from the E nsem bl database, a n d blasted against its ow n genom e sequence (see Figure S I, rig h t pipeline). W ithin the dow nloaded sequence, the nucleotide position corresponding to the can d id ate S N P in the A tlantic h e rrin g sequence was identified based on the start a n d e n d positions o f the alignm ent betw een the original contig a n d the E nsem bl transcript. U sing the p rojected S N P position, the flanking regions w ere again classified as bein g located o n a single exon, d isrupted by a n intron, or not having a significant m atch. R esults from the tw o a pproaches w ere c o m p a red to o b tain a consensus estim ate for the likelihood o f an intro n -ex o n b o u n d a ry o ccurring w ithin the 121 b p assay for each o f the can d id ate SNPs. Finally, the rem ain in g can d id ate S N P contigs w ere visually evaluated using clview (clview; h ttp ://c o m p b io .d fc i.h a rv a rd .e d u / tg i/so ftw a re /) in o rd e r to ra n k putative SN Ps w ithin a n d am o n g contigs. T h is was assessed by considering the overall quality o f the assem bly, the d e p th a n d length o f alignm ents, a n d the n u m b e r o f m ism atch sites flanking the SN P. T his step was included to increase the likelihood o f excluding incorrectly identified SN Ps (for exam ple; regions w ith alternative splicing or erroneous clustering o f paralogous sequences). W ith in each contig, one o r tw o SNPs receiving the highest quality score w ere considered for further validation (see below). Illu m in a’s G enom eS tudio d a ta analysis software (1.0.2.20706, Illum ina Inc.). O nly S N P assays show ing clear genotype clustering, a n d individual sam ples w ith a call ra te above 0.8 w ere considered for fu rth er analysis. C ro s s -s p e c ie s A m p lificatio n T o assess the utility o f developed m arkers in related species, two species identified from a consensus phylogeny [31], the sister species; Pacific h e rrin g (C. pallasii) a n d a m ore distantly related species; anchovy (Engraulis encrasicolus) w ere genotyped for the full 1,536 S N P panel. Statistical A n aly ses T o assess the predictive value a n d utility o f the different p a ram ete rs used in the in silico S N P validation pipeline, a binom ial logistic regression analysis was conducted. T w o categorical variables (Conversion a n d Polymorphism) w ere evaluated w hich describe the outcom e o f the SN P assay validation; these are expected to d e p en d o n a range o f c andidate p re d ic to r variables (see below). Conversion was scored b y assigning all 1,536 genotyped S N P assays as eith er failed (score = 0) o r successfully am plified a n d clustered (score = 1). Polymorphism assigned all the successfully am plified S N P assays into m o n o m o rp h ic (0) o r polym orphic (1) categories. N ine variables w ere th en assessed for th eir predictive value in determ in in g S N P assay conversion a n d polym orphism : i) n u m b e r o f ascertain m en t p an el individuals supplying sequence reads a t the S N P position, ii) n u m b e r o f sequences aligned u n d e r S N P position, iii) n u m b e r o f sequences w ith the m in o r allele, iv) frequency o f sequences w ith m in o r allele, v) n u m b e r o f ascertain­ m en t individuals w ith the m in o r allele, vi) Illum ina Assay D esign Score (ADS), vii) outcom e o f the intron-exon b o u n d a ry pipeline (scored as S N P assay bein g w ithin a single exon, in te rru p ted b y an in tro n or as h aving no B L A ST m atch), viii) n u m b e r o f reference species supporting findings from the intro n -ex o n pipeline, a n d ix) n e ig h b o u rh o o d sequence quality (determ ined by the n u m b e r o f m ism atches in the flanking region alignm ent). T o statistically test the predictive effect o f the above variables for b o th Conversion a n d Polymorphism a tw o-step binom ial logistic regression analysis was used as im p lem en ted in SPSS v l2 .0 . All variables w ere included in the initial m odel, a n d a backw ard stepwise deletion a p p ro ac h was used for optim isation, in w hich the least inform ative variable is rem oved sequentially until only significantly con trib u tin g variables rem ain. A W ald x 2 statistic was used to estim ate the relative con trib u tio n from each rem ain in g p a ram ete r. F o r the successful polym orphic assays global values o f observed (H o) a n d expected (H E) heterozygosity w ere estim ated for 20 individuals from each o f the four ascertain m en t populations (Figure 1) using G enA lE x 6.4 [32]. F o r these sam e populations deviations from H a rd y -W e in b erg equilibrium (HW E) a n d evi­ dence o f linkage disequilibrium (LD) w ere explored using G en ep o p 4.0 [33]. Significance levels for H W E a n d LD tests w ere estim ated using a n M C M C ch ain o f 10,000 iterations a n d 20 batches, / ’-values w ere adjusted for m ultiple tests by false discovery ra te (FDR) correction following B enjam ini & Y ekutieli [34], Lastly, a scertain m en t bias, resulting from the n o n -ran d o m exclusion o f SN Ps w ith a low M in o r Allele F requency (MAF) from the m ark e r panel, m ay occur due to the small size (n = 8) o f the ascertain m en t p an el (com pared to the w hole population), a n d the lim ited geographical coverage (com pared to the w hole species range). W h en m arkers are th en genotyped on a m u ch larger sam ple o f individuals the resulting a scertain m en t bias [35,36] m ay affect the estim ation o f m an y evolutionary a n d po p u latio n genetic p a ram ete rs [2]. T o assess the m agnitude o f a poten tial bias, the distribution o f M A F in the m ark er p an el was assessed across a SNP V alidatio n Follow ing the pipeline described above, 1,536 high scoring can d id ate m arkers w ere chosen for validation by high th ro u g h p u t genotyping assay. D N A was extracted from fin clips for 626 fish sam pled from eighteen sites across the species range in the eastern A tlantic, including tw enty fish from each o f the four S N P discovery populations (Figure 1). T h e quality a n d q u a n tity o f D N A was checked using a N a n o d ro p spectrophotom eter, a n d all sam ples w ere standardised to 70 n g /p L . G enotyping was p erfo rm ed using the Illum ina G olden G ate platform [30], a n d was visualised using PLOS ONE I w w w .plosone.org 4 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring large d a ta set covering 18 locations across the E astern A tlantic to check for a n elevated n o n -ran d o m exclusion o f SN Ps w ith a low M A F. A n un-biased S N P pan el should exhibit a n “ L -shape” distribution o f M A F categories in dicating a d eq u a te representation o f low M A F SN Ps [37], S e q u e n c e P r o c e s s in g a n d A s s e m b ly S equence cleaning a n d processing identified 5.8% o f the assigned reads as having a m atch o f a t least 94% identity over 60 base pairs to A tlantic h e rrin g m ito ch o n d rial sequences a n d these w ere rem oved from the d a ta set. R ep eatM ask er m asked 1.9% o f the d ataset using the zebrafish re p ea t library. T h e S eqC lean p ro g ra m rem oved a furth er 3.5% o f the assigned reads due to low -com plexity (n = 7,885), low quality (n = 169) o r being below the m in im u m re a d length o f 50 bases (n = 13,010). Lastly, som e reads w ere trim m ed, yielding a total o f 571,731 reads for sequence clustering a n d assem bly. Initially reads w ere clustered w ith C L C G enom ics W o rk b en ch (C L C bio, D enm ark), resulting in 16,456 clusters ra nging from 2 0 0 -4 0 0 bp. T hese w ere th en individually re-assem bled w ith C A P3 resulting in 19,246 contigs (some clusters p ro d u c ed b y C L C w ere split into two or m ore contigs) a n d 30,344 singletons o f w hich m ore th a n 50% could be a n n o ta te d (T able 1). T h e m ajority o f contigs consisted o f less th an 30 reads a n d ra n g ed betw een 100-500 b p (Figure 3C-D). Results 454 S eq u en cin g R esults for the sequencing a n d S N P discovery pipeline are illustrated in Figure 2. A total o f 683,503 cD N A sequences w ere gen erated from the m ultiplexed A tlantic h e rrin g m uscle library. T h e reads w ere de-m ultiplexed to assign reads to one o f the eight sequenced individuals according to their b a rco d in g tag. F or 8% o f the raw reads no b a rco d in g tag w as identified, while the rem ain in g 629,541 raw reads (average re a d length: 205 bp, Figure 3B) co n tain ed the 5 ' tag sequence a n d could be allocated to pools p e r sam ple p e r geographical region (Figure 3A). G eographic pools ra n g ed from 86,731 (English C hannel) to 187,554 (Barents Sea) sequences. All 454 sequence d a ta has b e en subm itted to the Sequence R e a d A rchive (SRA) u n d e r the study accession n u m b er E R P 0 0 1233 (h t t p : / /w w w .e b i.a c .u k /e n a /d a ta /v ie w / E R P001233). SNP D e te c t io n a n d A n n o t a t i o n R esults S N P discovery w ith GigaBayes d etected 6,331 putative SN Ps in 1,991 separate contigs. T h e p rim a ry a n n o ta tio n o f contig sequences is sum m arized in T ab le 1 a n d in m ore detail in T able S I. cDNA library 4 5 4 GS FLX S e q u e n c in g 683,503 EST read s D e-m u ltip lex ed by ID M icrosatelfitE 629,541 EST reads -► SNP Identification R em oval o f mtDNA 6501 P rim e r d e s ig n 1757 592,795 EST reads R e p e a t m a sk in g a n d c le a n up 571,731 EST reads 6331 R em o v al o f te rm in a l SNPs 5338 R em o v al if ADT <0.4 C lu s te r in g ^ s s e m b ly 19,246 contigs and 30,344 singletons 5253 In tro n ^ x o n s c re e n in g B last/A n n o tatio n 4018 11,970 contigs and 14,943 singletons T I Final s e le c tio n fo r g e n o ty p in g 1536 M arker d e v e lo p m e n t Figure 2. Schematic of transcript assem bly and SNP d etection p ip eline. S c h e m a tic o v e rv ie w w ith n u m b e rs o f re a d s, c o n tig s a n d SNPs th r o u g h th e tra n s c r ip t a ss e m b ly (c e n tre ) SNP d e te c tio n (rig h t h a n d side) a n d m ic ro sa te llite d e te c tio n (left h a n d side) p ip e lin e s (see te x t fo r m o re details). d o i:1 0 .1 3 7 1 /jo u rn a l.p o n e .0 0 4 2 0 8 9 .g 0 0 2 PLOS ONE I w w w .plosone.org 5 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring 80000 60000 40000 50000 - 1nnnnnn 20000 S 8 K 8 (T> 7 LT) ■ £ § •7 7 Lfl N a 0\ S ? § Í lii ¡5 s IN in fM S 8 i'" O' Í Ul IN m s y 7 ! n n S e q u e n c e l en g th 14000 - 8000 13500 - 7500 Ul 13000 Ol IO 7000 6500 O 12500 u o 2000 ■i (Ü 1500 ¥ 1000 Z 6000 5500 1500 1000 500 - 500 □bn 0 - r r ~ i t O r 'i' tiH •h »c-Jm m Í i l i— t— i— t I I I I I n -r i 8 8 1/1 IÛ N fl) /s n i í i i in i in S R S gî g Num ber of reads S S C on tig le ng th Figure 3. Sum m ary o f sequence data. A) num ber of sequences successfully barcoded for each of the eight ascertainm ent individuals; and for the combined data, B) sequence length, C) num ber of reads per contig and D) contig length. doi:10.1371/journal.pone.0042089.g003 (18.4%) h a d B L A ST hits w hich suggested th a t th ere was no i n tr o n /exon b o u n d a ry presen t (sum m arised in Figure 2). S e le c tio n o f C a n d i d a t e SNPs fo r G e n o t y p i n g A ssay F ro m the 6,331 p re d ic te d SN Ps, 993 (15.6%) w ere located in the term in al region o f the contigs a n d did n o t have the req u ired m in im u m o f a 60 b p flanking region to design oligos for the G o ld en G ate a rra y (Figure 2). O f those rem aining, 85 SN Ps (1.3%) scored below the m in im u m value (< 0.4) reco m m e n d e d for p rim e r design a n d w ere n o t considered. 4,104 SN Ps (76.8%) h a d high Assay D esign Scores (betw een 0.7-1.0) a n d 1,149 SN Ps (21.5%) h a d acceptable Assay D esign Scores (betw een 0.4-0.7), all 5,253 o f these w ere taken forw ard to the next stage. O f the putative SNPs screened for p otential i n tr o n /exon splicing sites w ithin the flanking regions, 1,235 (23.5%) h a d p u tative i n tr o n /exon boundaries w ithin the flanking regions, a n d so w ere rejected. T h e m ajority (3,052, 58.1% ) h a d no m atc h in g B L A ST hits, w hile ju st 966 SNP V alid ation F ro m the full 1,536 p an el o f SN Ps th a t w ere genotyped, 290 (19%) assays failed to amplify. O f the rem ain in g 1,246 assays, 201 w ere m o n o m o rp h ic (false positives: 13%) 467 p ro d u c ed am bigu­ ous clustering (30%) a n d 578 w ere polym orphic, equivalent to a conversion ra te o f 38% . F ro m these 578 SN Ps a n open reading fram e was o b tain e d for 270 o f the respective 121 b p sequences (SN P a n d 60 b p u p /d o w n stream ), o f w hich 66 w ere suggested be non-synonym ous, a n d 204 to b e synonym ous, equivalent to a ratio (non-synonym ous/synonym ous) o f 0.32 (T able S2). Results on the predictive value o f the SN P selection p aram eters for assay conversion (i.e. for successful am plification) show th at inclusion o f all o f the p re d ic to r variables (see m ethods) m arginally im proves m odel-fitting (x2 = 18.520, d.f. = 9, p < 0 .0 3 0 ). W h en using backw ard stepwise deletion o f p red icto r variables, the Assay D esign Score a n d n u m b e r o f ascertainm ent individuals w ith the m in o r allele w ere identified as the only significant predictors o f assay conversion, b u t only the Assay D esign Score show ed the expected positive correlation w ith conversion rate (Table 2). T h e binom ial logistic regression analysis o n the polym orphic status o f all successfully am plifying assays show ed th a t w hen all p redictor variables w ere included, the overall m odel fit was not significant (X~ = 11.554, d.f. = 9, p = 0.240). Flow ever, n eig h b o u rh o o d se­ T a b le 1. N u m b e r of c o n tig s a n d sin g leto n s o b t a i n e d a n d successfully a n n o t a t e d . To tal A n n o tate d % An no tated Contigs 19,246 11,970 62.1 Singletons 30,344 14,943 49.2 To tal 49,590 26,913 54.3 dol:10.1371 /journal.pone.0042089.t001 PLOS ONE I w w w .plosone.org 6 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring quence quality h a d a significant negative correlation w ith polym orphism . As before a backw ard stepwise deletion ap p ro ac h was used a n d this red u ced the significantly con trib u tin g predictors to the n u m b e r individuals in the ascertainm ent p an el w ith the m in o r allele a n d the n e ig h b o u rh o o d sequence quality w hich, as expected, respectively show ed positive a n d negative correlation w ith S N P polym orphism (T able 3). T a b le 3. Results for SNP d e te c t i o n variables for predicting SNP a ssay p o ly m o r p h ism following a b a ckw a rd ste p w ise elim in atio n p ro c e d u re . E stim ates o f H o a n d H e across the four ascertain m en t sam ples ra n g ed from 0 .0 0 -0 .6 3 (m ean 0.18) a n d 0 .0 0 -0 .5 0 (m ean 0.18), respectively (T able S2). O bserved heterozygosity w ithin the four ascertainm ent populations revealed sim ilar levels o f diversity to the 18 sam pled locations used for the S N P validation [38], T ests for deviation from H W E for each locus a n d po p u latio n revealed 43 out o f 1,249 p erfo rm ed tests (3.4%) w ith significant deviations from H W E before co rrection for m ultiple tests. T hese tests w ere distributed a m o n g all four populations a n d across 35 loci. E ight tests distributed across three populations a n d seven loci retain ed significance follow ing co rrection for m ultiple tests ( a = 0.05). D ue to the presence o f m o n o m o rp h ic loci in the four ascertainm ent sam ples, 229,094 tests for LD w ere p erfo rm ed o f w hich 352 re m a in e d significant after co rrection for F D R (a = 0.05). O f these, 14 pairs w ere significant in m ore th a n one o f the four populations b u t in all cases SN Ps o riginated from different contigs suggesting lack o f close physical linkage. S N P frequency distributions o f M A F categories in the full pan el o f 18 sam ples indicated little bias due to n o n -ran d o m selection o f high frequency SN Ps (Figure 4). T a b le 2. Results for SNP d e te c t i o n variables for predicting SNP a ssay c o n version following a b ack w ard ste p w ise e lim in atio n proc edure . Asc_ind* -0 .1 6 5 4.67 1 0.031 AD? 0.763 4.785 1 0.0 2 9 C onstant -0 .3 7 8 1.464 1 0.226 d eg ressio n coefficient for individual variable. hWald X2 statistic. A ssociated probability. Number o f ascertainm ent individuals with th e minor allele. eAssay Design Score. Significant p-values are shown in bold. doi:10.1371 /journal.pone.0042089.t002 PLOS ONE I w w w .plosone.org Asc_incf* 0.249 2.965 1 0.085 NSCt -0.111 7.321 1 0 .0 0 7 C onstant 0.935 21.137 1 0.000 F o r next g eneration sequencing to be successfully applied to the developm ent o f genetic resources in non-m odel organism s, m ethodological issues m ust be addressed to optim ise the p ro c e ­ dures for each project. SN Ps c an b e genom e- o r transcriptom e derived and, in the latter case, selected from m ore a b u n d a n t or ra re r expressed transcripts; in addition, m ark er developm ent is influenced by sequence d e p th a n d contig length due to the sequencing platfo rm chosen a n d the com plexity o f the hypothesis to be investigated (i.e. sm aller n u m b e r o f SN Ps re q u ire d for species identification analysis as c o m p a red to p o p u latio n genetic studies). T h e choice o f sequencing p latform should reflect the objective o f a given study. W hile longer reads (e.g. 454 sequencing) are expected to im prove contig assem bly, m ore, b u t shorter, reads (e.g. Illum ina sequencing) m ay be p referable in o rd e r to reduce detection o f false positive SN Ps from h igher alignm ent depth, especially w hen an existing reference sequence is available. T his study took advantage o f the longer re a d lengths o b tain e d w ith 454 sequencing in a de novo assem bly o f a reference scaffold for S N P discovery in herring. T h e clustering a n d assem bly step is critical for S N P m in in g as it generates the reference for v a ria n t detection by m ap p in g reads to the contig. T h erefo re, the absence o f a reference genom e or transcriptom e poses a challenge for assessing the ‘correctness’ o f a c ontig assem bly, as p otential m is-assem blies o f sequence due to hom ologous or paralogous genes can n o t b e directly verified by back -m ap p in g to the species-specific genom e. G enerally, cluster assem bly w ith overly stringent p a ram ete rs will lead to splitting sequences belonging to g eth er into m ore contigs, resulting in a h igher n u m b e r o f shorter contigs w ith low er coverage depth. W hilst applying criteria th a t are overly relaxed will assem ble reads from related genes o r gene families into single contigs, resulting in a low er n u m b ers o f contigs th a t have a h igher sequence depth, how ever this increases the likelihood o f m isidentifying polym or­ phism s b etw een paralogous sequence variants (PSVs) as SNPs. A dditionally, as no genom e reference is available for A tlantic herring, the o ccurrence o f PSVs can n o t b e assessed, this was p ro b ab ly the cause for the m ajority o f am biguous clustering th at was subsequently seen in the SNPs. F o r the S N P detection, the low sequence d e p th o f the m ajority o f contigs (Figure 3C) re q u ire d relatively low criteria to be set (i.e. depth: four reads, redundancy: two observations o f the m in o r T his study dem onstrates the de novo discovery o f 6,331 putative SN Ps based on 454 transcriptom e sequencing o f eight individuals covering the N o rth east A tlantic distribution o f the A tlantic herring. O f p a rticu la r interest in the a p p ro ac h is the single validation a n d genotyping step, disposing w ith the traditional step Pc Pc S e q u e n c e A s s e m b ly a n d SNP D e te c t io n D iscussion df df o f testing each S N P for am plification p rio r to large scale genotyping (e.g. [39,40]). T h e d a ta gen erated in this study constitutes a new resource for genetic analysis in A tlantic h e rrin g significantly increasing the n u m b e r o f know n transcripts as well as novel SN P a n d m icrosatellite m arkers. T h e m ajority (99%) o f the 578 m arkers identified as p olym or­ phic in A tlantic h e rrin g also am plified in Pacific herring, b u t only 12% exhibited m ore th a n one allele. O n ly ab o u t 10% o f the 578 SN Ps am plified in anchovy, a n d o f these, only ten loci exhibited polym orphism . M satC o m m a n d e r detected 6,501 m icrosatellites w ith a repeat length o f b etw een two a n d seven bases w ith four or m o re repeat units in 3,741 contigs (T able 4). 27% o f the m icrosatellites h a d sufficient suitable flanking sequence to enable the design o f prim ers. D etails o f the m icrosatellites (num ber a n d type o f repeat, prim ers, T m a n d % G C ) are listed in T ab le S3. W a ld b W a ld b deg ressio n coefficient for individual variable. AVald y2 statistic. Associated probability. dNumber of ascertainm ent individuals with the minor allele. N eighbourhood Sequence Quality. Significant p-values are shown in bold. doi:10.1371 /journal.pone.0042089.t003 C ro s s -s p e c ie s A m plification a n d M icrosatellite D e te c t io n Ba Ba 7 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP D iscovery in A tla n tic H erring 0.5 0 .4 - in 0J CL fü in tH 0.3 - Ci i/i CL z m '<5 > u c dJ D O' B 0. 2 - 0.1 - 0.0 LD O o »H LD t— 1 o rsi io <M o m LO ro 2 LO o LO ? O o d ? LO o o ? o t— 1 o ? LO T -1 o ? o r\j o ? LO ? o rg o ? LO rg o ? Q t o ? LD in o o MAF Figure 4. M in o r allele frequency (MAF) d istrib ution . T h e d is trib u tio n o f th e MAF in 5 7 8 SNPs ty p e d in 18 p o p u la tio n s a c ro ss th e e a s te rn h e rrin g d is trib u tio n . d o i:1 0 .1 3 7 1 /jo u rn a l.p o n e .0 0 4 2 0 8 9 .g 0 0 4 allele). H ow ever, these low thresholds together w ith the sequencing o f eight ascertainm ent individuals spanning the entire northeast A tlantic distribution o f h e rrin g resulted in m inim al ascertainm ent bias due to exclusion o f low M A F SN Ps (Figure 4). O n e expected result o f the low d e p th a n d re d u n d an c y p a ram ete rs is, how ever, the low conversion ra te from the inflated n u m b e r o f candidate T a b le 4. Type a n d n u m b e r of r e p e a ts of t h e microsatellites d e t e c t e d in t h e herring c o n tig s using M s a t c o m m a n d e r . T y p e o f rep ea t N u m b e r o f repeats 4 -9 1 0 -1 4 1 4 -1 9 To tal >19 M axim u m Dinucleotide 4418 505 193 175 75 5291 Trinucleotide 829 35 9 2 36 875 Tetranucleotide 202 13 3 12 31 230 Pentanucleotide 43 1 1 0 17 45 Hexa nucleotide 57 2 0 1 21 60 Total 5549 556 206 190 - 6501 doi:10.1371 /journal.pone.0042089.t004 PLOS ONE I w w w .plosone.org SN Ps (identified due to sequencing errors). T h e 454 platform specific challenge o f resolving hom opolym eric regions m ay further have com prom ised SN P detection b y reducing assem bly quality or calling false SN Ps w ithin these regions [41], b u t such an effect could n o t be assessed here due to the lack o f a know n reference sequence. T h e use o f transcriptom e sequencing in this study has resulted in only a few p e r cent o f the total genom e bein g covered, b u t a t a relatively high sequencing d epth, thus lim iting sequencing costs while achieving the n u m b e r o f SN Ps re q u ire d for custom -designed S N P assays. A dditionally, transcriptom e sequencing provides in form ation ab o u t tissue-specific genes a n d their expression profile, w hich can be used to develop furth er tools for gene expression studies such as oligonucleotide m icro array o r R N A -seq approaches. SNP V alid ation T h e genotyping o f 1,536 selected S N P assays p erfo rm ed w ith genom ic D N A for a large p an el o f A tlantic h e rrin g sam ples from across the n o rth east A tlantic indicated th a t nearly 600 o f the SNPs are polym orphic (37.6% ). H ow ever, alm ost 49.3% o f the can d id ate SN Ps failed to work; due to eith er non-am plification A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring (18.9%), false positives (m onom orphic loci) (13.1%) o r am biguous clustering (17.3%). D espite o u r a tte m p t to screen for p otential in tr o n /exon splicing sites w ithin flanking regions o f all candidate SN Ps using available reference genom es, only 41.9% o f all queries m atc h ed equivalent sequences in a t least one o f the reference species. T h u s, the presence o f u n d e te cte d introns m ay have constituted a m ajor cause for genotyping failure [42]. M oreover, can d id ate SN Ps th a t a p p ea red m o n o m o rp h ic in the large-scale screening m ight either b e the result o f false-positive predictions or could indicate real, ra re SN Ps n o t p re sen t in the sam ples tested [7]. T h e p urely in silico SN P detection m eth o d presen ted in this study m ay have a relatively low conversion rate to validated SNPs w hen c o m p a red to o th er m ethods. H ow ever, this m eth o d is still extrem ely com petitive given a lim ited resource for m ark er developm ent, once the tim e a n d cost associated w ith designing a n d o rd e rin g h u n d red s o f prim ers, ru n n in g validation P C R s, a n d ad ditional S anger sequencing for validation are considered (e.g. [39,40]). All o f w hich w ould be in add itio n to the cost o f genotyping the resulting 578 validated SNPs. In o rd e r to reduce the n u m b e r o f erroneous S N P predictions, i.e. to increase the p robability o f a n in silico d etected SN P being a truly polym orphic site, furth er sequencing w ould lead to greater sequence d e p th o f the contigs, allow ing m ore stringent selection o f S N P candidates. It has b e en show n for m ultiplexed re-sequencing th a t m o re th a n 90 % o f the variants can b e d etected correctly using next g eneration sequencing technologies w h en a n average d e p th o f a t least 20 reads p e r base is achieved [43,44], Increasing the average sequence d e p th will also be advantageous for identifying SN Ps from rarely expressed genes. A n o th e r interesting ap p ro ach , recently described by R a ta n et al. [45], suggests a m eth o d to call SN Ps w ith o u t a reference genom e sequence. S N P calling is p erfo rm ed w henever new sequences are added; thus, sequencing continues only as long as need ed to identify an a d eq u a te n u m b er o f can d id ate SN Ps. T h e m eth o d is re p o rte d to w ork even w hen the sequence coverage is n o t sufficient for de novo assem bly. A ddition­ ally, the use o f next g eneration sequencing for analysing a restriction enzym e-generated D N A library (R R L a n d in p a rticu la r R A D sequencing, for reviews see [46,47]) based o n m ultiple tagged individuals now enables the fast discovery o f thousands o f SN Ps in non-m odel organism s w ith no p rio r genom e inform ation [48,49]. H ow ever, one dow nstream pro b lem identified w ith R A D seq is th a t transferring the SN Ps o nto a h ig h -th ro u g h p u t genotyping platform is difficult w ith o u t a reference genom e, as the m ajority o f SN Ps identified do n o t have the 60 b p flanking sequenced req u ire d for assay design. T his has to som e extent been solved using P aired E n d R A D (RA D -PE)[50], how ever the bioinform atic a pproaches for SN P discovery in R A D -P E contigs are still lim ited. A dditionally, w hile R R L /R A D -se q approaches elim inate the problem s e n co u n tere d w ith in tr o n /exon b oundaries th a t a re associated w ith tran scrip to m e sequencing, these m ethods only consider ra n d o m fragm ents o f the entire genom e, w hereas o u r transcriptom e based pipeline specifically targets expressed genes w ith a n increased likelihood for detecting SN Ps (e.g. nonsynonym ous substitutions) associated w ith genom ic regions u n d e r selection. Such n o n -n e u tra l SN Ps are expected to provide high discrim inatory pow er a t the p o p u latio n level a n d will constitute a valuable forensic tool in future applications [47,51]. T h e com bination o f the coverage a n d S N P discovery rates obtain ed by R A D -seq, w ith the targ eted red u ctio n o b tain e d by sequencing the transcriptom e w ould potentially be a very pow erful tool. H ow ever, it m ust be n o ted th a t due to the ra p id ra te o f technical developm ents in the field, such as the increased re ad length a n d decreasing costs o f existing platform s, a n d the poten tial o f n a n o ­ sequencing technology, the best solution reg ard in g platform s a n d PLOS ONE I w w w .plosone.org m ethods to optim ise the cost effectiveness for a specific application needs careful consideration. W h en determ in in g the predictive value o f the S N P selection p a ram ete rs for successful am plification o f the in silico detected SN Ps (Conversion), as expected, a positive correlation was found w ith the Assay D esign Score, i.e. the likelihood for designing successful prim ers a ro u n d the S N P position. U nexpectedly, a negative correlation was found w ith n u m b e r o f ascertainm ent individuals for w hich the ra re allele was observed, alth o u g h the reasons b e h in d this correlation are unclear. O verall, only very w eak predictive variables for Polymorphism w ere identified, w ith only the n e ig h b o u rh o o d sequence quality significantly explaining the negative correlation; as the n u m b e r o f m ism atches in flanking regions increases, a p re d ic te d SN P is m ore likely to be a false positive. T his increase in m ism atches o f a n aligned region could be indicative o f erroneous clustering, for exam ple, PSV s o r o th er sequences w ith differing genom ic origin (this has for exam ple also b e en seen for hake in a sim ilar study [8]). T h e n u m b e r o f individuals w ith the m in o r allele in the ascertain m en t p a n el also show ed a positive correlation w ith Polymorphism. W hile this p a ra m e te r is less conclusive th a n for pred ictin g Conversion rate, th ere is potentially a predictive role o f this p a ra m e te r for detecting tru e SNPs. F u tu re SN P developm ent efforts m ay reduce the false positive ra te by applying relatively stringent thresholds for this variable (e.g. having a t least 2 individuals w ith the m in o r allele rep resen ted in the SN P co n tain in g contig, although this will, o f course, d e p en d on the size o f the a scertain m en t panel). T h e two binom ial logistic regression analyses w ere re p ea te d w ith a red u ced set o f variables representing the strongest a priori candidates (the n u m b e r o f sequences aligned u n d e r the SN P position, the frequency o f sequences w ith m in o r allele, the n e ig h b o u rh o o d sequence quality, the Assay D esign Score, a n d the outcom e o f the intro n -ex o n b o u n d a ry pipeline). T his also allow ed controlling for a p otential bias from n o n -in d ep e n d en t variables such as the tw o intro n -ex o n a n d three m in o r allele related p aram eters. R esults w ere largely c o n g ru en t confirm ing Assay D esign Score a n d n eig h b o u rh o o d sequence quality to be the m ost significant predictors o f Conversion a n d Polymorphism, respec­ tively. T h e range o f allele frequencies w ithin the SN P p an el suggests th a t the strategy o f carefully selecting individuals to m axim ise the geographical, phenotypic a n d genetic diversity covered by the S N P d evelopm ent sam ples has b e en successful in m inim ising ascertain m en t bias. C ross S p e c ie s A m plification a n d M ic rosatellite D e te c t io n A high p ro p o rtio n o f d etected SN Ps also am plified single P C R products in Pacific h e rrin g albeit w ith a low polym orphism rate, w hich is as expected due to th eir developm ent from conserved genom ic regions. H ow ever, due to the small sam ple size (n = 4), this n u m b e r is likely to be dow nw ardly biased a n d a m u ch higher p ro p o rtio n o f SN Ps m ay in fact be polym orphic a n d therefore prove useful in this species. As expected from the phylogenetics o f these species, the pro p o rtio n s o f SN P am plification a n d polym or­ phism w ere low er in the anchovy. A dditionally, o u r sequencing effort has led to the discovery o f a large resource o f m icrosatellite m arkers, 36% o f w hich have prim ers successfully designed (Table S3). T hese include b o th n e u tra l loci a n d loci th a t are physically linked to SN Ps representing genom ic regions th a t have b een show n to b e u n d e r directional selection [38]. A n o th er a ttrib u te o f m ulti-allelic m icrosatellite m arkers w h en studying adaptive genetic variatio n is the increased statistical pow er for detecting balan cin g selection c o m p a red to bi-allelic m arkers (such as SN Ps, e.g. [52]), a n d also for applications such as p a re n ta l assignm ent. 9 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring alleles in brackets. Also global estim ates o f observed (Ho) a n d expected heterozygosity (He) in the four a scertain m en t populations for each SN P. T h e S /N S colum n denotes w h eth er a S N P was eith er synonym ous (S) o r non-synonym ous (NS) w ith N A designating SN Ps w ith no contig m atc h in the B L A ST search (see text for m ore details). (XLSX) C o n c lu s io n O u r a p p ro ac h o f applying b arco d in g a n d m ultiplexing individ­ uals for large-scale in silico m ining o f transcriptom e sequences seems to be a very ap p ro p ria te strategy to develop new SN P m arkers in non-m odel species as it does n o t req u ire costíy a n d tim e-intensive re-sequencing o f targ e t am plicons necessitating p rio r know ledge a n d availability o f genom e sequence inform ation. H ow ever, the p urely in silico based S N P detection com es w ith a trad e off in the form o f a n expectedly low er conversion ra te in the final genotyping assay [53]. T h e resu ltan t resources will b e o f value in on-going analyses o f pop u latio n structuring a n d stock dynam ics, assays o f adaptive variation, a n d for e n h an cin g the scope o f m icrosatellite-based studies. T a b le S3 List o f the m icrosatellites for w hich prim ers w ere successfully designed, along w ith u p to 200 bases flanking sequence. (XLSX) A ck n ow led gm en ts S upporting Information W e would like to thank all the m em bers o f the FishPopTrace Consortium for their input. Sam pling was m ade possible by the generous collaboration o f Eero Aro, Philip Coupland, G eir Dahle, Audrey Geffen, T hom as Gröhsler, Birgitta Krischansson, C iaran O ’Donnell, H enn Ojaver, G uöm undur Oskarsson, Iain Penny, Jukka Pönni, Fausto T inti, V éronique Verrez-Bagnis Phil W atts, and Miroslaw Wyszynski. W e thank Pernille K. A ndersen (Aarhus University, Denmark) for sequencing an d library m anagem ent. F ig u re S I Analysis pipeline. T h e p a th o n the left o f the figure illustrates the pipeline for the genom ic app ro ach , w here h e rrin g transcripts are direcüy c o m p a red w ith five reference genom es. T h e p a th on the rig h t o f the figure shows the pipeline for the transcriptom ic a p p ro ac h , w here h e rrin g transcripts are first c o m p a red to the transcriptom e o f the five reference species. H its w ere th en subsequendy m atc h ed to the corresponding genom es o f the sam e species (see text for m ore de tads). (TIF) Author C ontributions Conceived and designed the experiments: G R C FP SJH DB M IT LB R O G E M J v H FPT Consortium . Perform ed the experiments: FP R O N R O SJH M T L . C ontributed reagents/m aterials/analysis tools: FPT C onsor­ tium. W rote the paper: SJH M T L M IT DB G R C FP. C arried out in silico analyses: FP R O N SJH M T L M IT MB Jv H G E M AC. Analyzed genotype data: SJH M T L DB M IT . C arried out statistical analysis: SJH M T L DB LB MB. T a b le S I N u m b e r o f contigs a n d singletons a n n o ta te d using a range o f fish a n d h u m a n reference resources a n d databases. (XLSX) T a b le S2 List o f the 578 validated polym orphic SN Ps found in this study, including the 120 b p flanking region, w ith the two SN P References 1. H e ly a r SJ, H e m m e r -H a n s e n J , B e k k ev o ld D , T a y lo r M I, O g d e n R , e t al. {2011) A p p lic a tio n o f S N P s fo r p o p u la tio n g e n e tic s o f n o n - m o d e l o rg an ism s: n e w u n s e q u e n c e d g e n o m e s u s in g s e c o n d g e n e ra tio n h ig h th r o u g h p u t se q u e n c in g tec h n o lo g y : a p p lie d to tu rk e y . B M C G e n o m ic s: 10: 47 9 . o p p o rtu n itie s a n d ch a lle n g es. M o le c u la r E c o lo g y R e s o u rc e s 1 1(S1): 123—136. S ta p le y J , R e g e r J , F e u ln e r P G D , S m a d ja G , G a lin d o J , e t al. (2010) A d a p ta tio n g e nom ics: th e n e x t g e n e ra tio n . T r e n d s in E c o lo g y & E v o lu tio n 25: 70 5 —712. 3. N ie lse n E E , H e m m e r -H a n s e n J , L a rse n P F , B ek k ev o ld D (2009) P o p u la tio n g e n o m ic s o f m a r in e fishes: I d e n tify in g a d a p tiv e v a ria tio n in sp a ce a n d tim e. 14. v a n B ers N E , v a n O e r s K , K e rs te n s H H , D ib b its B W , C ro o ijm a n s R P , e t al. {2010) G e n o m e -w id e S N P d e te c tio n in th e g r e a t tit Parus major u sin g h ig h th ro u g h p u t se q u en c in g . M o le c u la r E co lo g y : 19{S1): 8 9 —99. 2. 4. 5. 6. 15. M c P h e rs o n A A , S te p h e n s o n R L , O ’R eilly P T , J o n e s M W , T a g g a r t G T {2001) G e n e tic d iv ersity o f c o a sta l N o r th w e s t A tla n tic h e rr in g p o p u la tio n s: im p lic a tio n s fo r m a n a g e m e n t. J o u r n a l o f F ish B io lo g y 59: 35 6 —370. 16. B e k k ev o ld D , A n d r e C , D a h lg re n T G , C la u s e n L A W , T o rs te n s e n E , e t al. {2005). E n v iro n m e n ta l c o rre la te s o f p o p u la tio n d iffe re n tia tio n in A d a n tic h e rrin g . E v o lu d o n 59: 2 6 5 6 -2 6 6 8 . M o le c u la r E c o lo g y 18: 3 1 2 8 -5 0 . W a p le s R S , P u n t A E , C o p e J M (2 0 0 8 ) I n te g r a t i n g g e n e tic d a ta in to m a n a g e m e n t o f m a r in e reso u rc e s: h o w c a n w e d o it b e tte r? F ish a n d Fish eries 9: 4 2 3 -4 4 9 . 17. J o rg e n s e n H B H , H a n s e n M M , B ek k ev o ld D , R u z z a n te D E , L o e sc h ck e V {2005) M a rin e lan d s c a p e s a n d p o p u la tio n g e n e tic s tru c tu re o f h e rr in g {Clupea harengus L.) in th e B altic Sea. M o le c u la r E c o lo g y 14: 3 2 1 9 -3 2 3 4 . C o r r ig e n d u m to C o u n c il R e g u la d o n (2008) (EG ) N o 1 0 0 5 /2 0 0 8 o f 29 S e p te m b e r 2 0 0 8 e sta b lish in g a C o m m u n ity sy stem to p re v e n t, d e te r a n d e lim in a te illegal, u n r e p o r te d a n d u n re g u la te d fish in g , a m e n d in g R e g u la tio n s (E E C ) N o 2 8 4 7 /9 3 , (EG) N o 1 9 3 6 /2 0 0 1 a n d (EG) N o 6 0 1 /2 0 0 4 a n d re p e a lin g R e g u la tio n s (EG) N o 1 0 9 3 /9 4 a n d (EG) N o 1 4 4 7 /1 9 9 9 . O ffic ial J o u r n a l o f th e E u r o p e a n U n io n , L 286. 18. L a rss o n L G , L a ik re L , P a lm S, A n d r e G , C a r v a lh o G R , e t al. {2007) C o n c o r d a n c e o f a llo z y m e a n d m ic ro sa te llite d iffe re n tia tio n in a m a r in e fish, b u t e v id e n c e o f se le c tio n a t a m ic ro sa te llite lo cu s. M o le c u la r E c o lo g y 16: 1135— 1147. 19. G a g g io tti O E , B e k k ev o ld D , J o rg e n s e n H B H , F o il M , C a rv a lh o G R , e t al. {2009) D is e n ta n g lin g th e effects o f e v o lu tio n a ry , d e m o g ra p h ic , a n d e n v iro n m e n ta l fac to rs in flu e n cin g g e n e tic s tru c tu re o f n a tu r a l p o p u la tio n s: A d a n tic h e rr in g as a ca se stu d y . E v o lu tio n 63: 2 9 3 9 -2 9 5 1 . 20. B in la d e n J , G ilb e rt M T , B o llb a c k J P , P a n itz F , B e n d ix e n G , e t al. {2007) T h e use o f c o d e d P C R p rim e rs e n a b le s h ig h -th ro u g h p u t s e q u e n c in g o f m u ld p le h o m o lo g a m p lific a d o n p r o d u c ts b y 4 5 4 p a ra lle l s e q u en c in g . P L o S O N E 2: e l9 7 . F A O F ish eries a n d A q u a c u ltu re R e p o r t N o . 9 7 3 ( F IP I/R 9 7 3 ) R o m e , 31 J a n u a r y - G F e b ru a ry 2 0 1 1 . R e p o r t o f th e tw e n ty -n in th session o f th e c o m m itte e o n F isheries. A vailab le: h t t p : // w w w .f a o .o r g / d o c r e p / 0 1 4 /i 2 2 8 1 e /i 2 2 8 1 e 0 0 .p d f . A c c e ssed 2011 F e b 7. 7. H u b e r t S, H ig g in s B, B o rz a T , B o w m a n S (2010) D e v e lo p m e n t o f a S N P re s o u rc e a n d a g e n e tic lin k a g e m a p fo r A d a n tic c o d {G adus m o rh u a ). B M C G e n o m ic s 11: 191. 8. M ila n o I, B a b b u c c i M , P a n itz F, O g d e n R , N ie lse n R O , e t al. {2011) N o v e l tools fo r c o n s e rv a tio n g e n o m ics: C o m p a rin g tw o h ig h -th ro u g h p u t a p p ro a c h e s fo r S N P d isc o v e ry in th e tra n s c rip to m e o f th e E u r o p e a n h a k e . P L o S O N E 6: e2 8 0 0 8 . 9. L i R , F a n W , T ia n G , Z h u H , H e L , e t al. {2010) T h e s e q u e n c e a n d d e n o v o a sse m b ly o f th e g ia n t p a n d a g e n o m e . N a tu r e 4 6 3 : 3 1 1 —317. 21. 22. 10. A r g o u t X , S a ls e J , A u r y J M , G u iltin a n M J , D r o c G , e t al. {2011) T h e g e n o m e o f 11. 12. 13. Theobroma cacao. N a tu r e G e n e tic s 43: 101—108. S ta r B, N e d e r b ra g t A J, J e n to f t S, G r im h o lt U , M a lm s tro m M , e t al. {2011) T h e g e n o m e s e q u e n c e o f A d a n tic c o d rev e a ls a u n iq u e im m u n e system . N a tu r e 477: 2 0 7 -2 1 0 . S á n c h e z C G , S m ith T P , W ie d m a n n R T , V a lle jo R L , S a le m M , e t al. {2009) Single n u c le o d d e p o ly m o rp h is m d isc o v e ry in r a in b o w tr o u t b y d e e p s e q u e n c in g o f a r e d u c e d re p r e s e n ta tio n lib ra ry . B M C G e n o m ic s: 10: 55 9 . K e rs te n s H H , G ro o ijm a n s R P , V e e n e n d a a l A , D ib b its B W , G h in -A -W o e n g T F , e t al. {2009) L a r g e sc a le sin g le n u c le o tid e p o ly m o r p h is m d isc o v e ry in PLOS ONE I w w w .plosone.org 23. L a v o u é S, M iy a M , S a ito h K , Ish ig u ro N B , N is h id a M {2007) P h y lo g e n e d c r e la d o n s h ip s a m o n g a n c h o v ie s, sa rd in e s, h e rrin g s a n d th e ir rela tiv e s {C lupei­ fo rm es), in fe rre d f ro m w h o le m ito g e n o m e s e q u en c e s. M o le c u la r P h y lo g e n e d c s a n d E v o lu tio n 43: 1 0 9 6 -1 1 0 5 . S m it A F A , H u b le y R , G re e n P {1996) RepeatMasker Open-3.0. A v a ila b le h t t p : / / w w w .r e p e a tm a s k e r .o r g . v e rs io n o p e n - 3 .2 .7 w ith R M d a ta b a s e v e rs io n 20090120. H u a n g X Q , M a d a n A {1999) C A P 3 : A D N A s e q u e n c e a sse m b ly p r o g ra m . G e n o m e R e s e a rc h 9: 8 6 8 - 8 7 7 . 24. M a r th G T , K o r f I, Y a n d e ll M D , Y e h R T , G u Z J, e t al. {1999) A g e n e ra l a p p ro a c h to sin g le -n u c le o tid e p o ly m o rp h is m d isco v ery . N a tu re G e n e tic s 23: 452M 56. 25. B a i J , L i Q , G o n g R H , S u n W J, L iu J , e t al. {2011) D e v e lo p m e n t a n d c h a ra c te r iz a tio n o f 6 8 E S T -S S R m a rk e rs in th e P acific o y ster, Crassostrea gigas. J o u r n a l o f th e W o rld A q u a c u ltu r e S o ciety 42{3): 4 4 4 M 5 5 . 10 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. P a sh le y C H . Ellis J R . M c C a u le y D E , B u rk e J M (2006). E S T d a ta b a s e s as a s o u rc e fo r m o le c u la r m ark e rs: lesso n s f ro m Helianthus. J o u r n a l o f H e r e d ity 97: 3 8 1 -3 8 8 . F a irc lo th B C (2008) M S A T C O M M A N D E R : d e te c tio n o f m ic ro sa te llite re p e a t a rr a y s a n d a u to m a te d , lo c u s -s p e c ific p r im e r d e sig n . M o le c u la r E c o lo g y R e s o u rc e s 8: 9 2 - 9 4 . R o z e n S, S kaletsky H J (2000) P rim e r3 o n th e W W W fo r g e n e ra l u sers a n d fo r b io lo g ist p r o g ra m m e rs . In : B io in fo rm a tic s M e th o d s a n d P ro to co ls: M e th o d s in M o le c u la r B iology {eds K r a w e tz S, M is e n e r S). H u m a n a P ress, T o to w a , N J. C a s to e T A , P o o le A W , G u W , J a s o n d e K o n in g A P , D a z a J M , e t al. (2010) R a p id id e n tific a tio n o f th o u sa n d s o f c o p p e rh e a d sn a k e (.Agkistrodon contortrix) m ic ro sa te llite loci fro m m o d e s t a m o u n ts o f 4 5 4 s h o tg u n g e n o m e s e q u en c e . M o le c u la r E c o lo g y R e so u rce slO : 3 4 1 -3 4 7 . F a n J B , O l i p h a n t A , S h e n R , K e r m a n i B G , G a rc ia F , e t al. (2003) H ig h ly p a ra lle l S N P g e n o ty p in g . C o ld S p rin g H a r b o r S y m p o sia o n Q u a n tita tiv e B iology 68: 6 9 - 7 8 . L i C , O r t i G (2007) M o le c u la r p h y lo g e n y o f C lu p e ifo rm e s (A ctin o p tery g ii) in fe rre d fro m n u c le a r a n d m ito c h o n d ria l D N A se q u en c e s. M o le c u la r P h y lo g e ­ n etics a n d E v o lu tio n 44: 3 8 6 -3 9 8 . P e a k a ll R , S m o u se P E (2006) G E N A L E X 6: g e n e d c an aly sis in E xcel. P o p u la tio n g e n e tic so ftw a re fo r te a c h in g a n d re s e a rc h . M o le c u la r E c o lo g y N o te s 6: 2 8 8 -2 9 5 . R o u s s e t F (2008) G E N E P O P 5 0 0 7 : a c o m p le te re -im p le m e n ta tio n o f th e G E N E P O P so ftw a re fo r W in d o w s a n d L in u x . M o le c u la r E c o lo g y R e so u rc e s 8(1): 1 0 3 -1 0 6 . B e n ja m in i Y , Y ek u tie li D (2001) T h e c o n tr o l o f th e false d isco v ery r a te in m u ltip le tes tin g u n d e r d e p e n d e n c y . A n n a ls o f S tatistics 29: 1165—1188. A lb re c h ts e n A , N ie lse n F C , N ie lse n R (2010) A s c e rta in m e n t B iases in S N P C h ip s A ffect M e a s u re s o f P o p u la tio n D iv e rg e n c e . M o le c u la r B io lo g y a n d E v o lu tio n 27: 2534—2547. R o s e n b lu m E B , N o v e m b re J (2007) A s c e r ta in m e n t b ia s in sp a tia lly s tru c tu re d p o p u la tio n s: A ca se stu d y in th e e a ste rn fen c e liz a rd . J o u r n a l o f H e r e d ity 98: 40. 41. 42. S eeb J E , P a s c a l C E , G r a u E D , S e e b L W , T e m p lin W D , e t al. (2011) T ra n s c r ip to m e s e q u e n c in g a n d h ig h -re so lu tio n m e lt analysis a d v a n c e single n u c le o tid e p o ly m o rp h is m d isc o v e ry in d u p lic a te d sa lm o n id s. M o le c u la r E c o lo g y R e s o u rc e s 11(S1): 33 5 —348. M a rg u lie s M , E g h o lm M , A ltm a n W E , A ttiy a S, B a d e r J S , e t al. (2005) G e n o m e s e q u e n c in g in m ic ro fa b ric a te d h ig h -d e n sity p ic o litre rea c to rs. N a tu re 437: 376— 380. W a n g S L , S h a Z X , S o n s te g a rd T S , L iu H , X u P , e t al. (2008) Q u a lity a sse ssm e n t p a ra m e te r s fo r E S T -d e riv e d S N P s f ro m catfish . B M C G e n o m ic s 9: 43. 450. C r a ig D W , P e a rs o n J V , S z e lin g e r S, S e k a r A , R e d m a n M , e t al. (2008) I d e n tific a tio n o f g e n e tic v a ria n ts u sin g b a r- c o d e d m u ltip le x e d se q u en c in g . 44. N a tu r e M e th o d s 5: 8 8 7 -8 9 3 . H a r is m e n d y O , N g P C , S tra u s b e rg R L , W a n g X , S to ck w ell T B , e t al. (2009) E v a lu a tio n o f n e x t g e n e ra tio n s e q u e n c in g p la tfo rm s fo r p o p u la tio n ta rg e te d 45. s e q u e n c in g stu d ies. G e n o m e B iology 10: R 3 2 . R a t a n A , Y u Z , H a y e s V M , S c h u s te r S C , M ille r W (2010) C a llin g S N P s w ith o u t a re fe re n c e s e q u en c e . B M C B io in fo rm a tic s 11: 130. 46. 47. 48. 49. 50. 3 3 1 -3 3 6 . M a r th G T , C z a b a rk a E , M u rv a i J , S h e rry S T (2004) T h e a llele fre q u e n c y s p e c tru m in g e n o m e -w id e h u m a n v a ria tio n d a ta rev e a ls sig n als o f d iffe re n tia l d e m o g r a p h ic h isto ry in th re e la rg e w o rld p o p u la tio n s. G e n e tic s 166: 3 5 1 —372. L im b o rg M T , H e ly a r SJ, d e B ru y n M , T a y lo r M I, N ie lse n E E , e t al. (2012) E n v iro n m e n ta l se le c tio n o n tra n s c rip to m e -d e riv e d S N P s in a h ig h g e n e flow m a r in e fish, th e A tla n tic h e rr in g {Clupea harengus). M o le c u la r E c o lo g y doi: 10.1111 / j . 13 6 5 -2 9 4 X .2 0 1 2 .0 5 6 3 9 .x . G e ra ld e s A , P a n g J , T h ie ss e n N , C e z a rd T , M o o re R , e t al. (2011) S N P d isc o v e ry in b la c k c o tto n w o o d (.Populous trichocarpa) b y p o p u la tio n tra n s c rip to m e 51. 52. 53. O g d e n R (2011) U n lo c k in g th e p o te n tia l fo r g e n o m ic tec h n o lo g ie s fo r w ildlife fo ren sics. M o le c u la r E c o lo g y R e s o u rc e s 11{S1): 109—116. D a v e y J W , H o h e n lo h e P A , E tte r P D , B o o n e J Q C a tc h e n J M , e t al. (2011) G e n o m e -w id e g e n e tic m a r k e r d isc o v e ry a n d g e n o ty p in g u s in g n e x t-g e n e ra tio n s e q u en c in g . N a tu re R e v ie w G e n e tic s 12: 4 9 9 —510. H o h e n lo h e P A , A m ish SJ, C a tc h e n J M , A lle n d o rf F W , L u ik a rt G (2011) N e x tg e n e ra tio n R A D s e q u e n c in g id e n tifie s th o u s a n d s o f S N P s fo r a sse ssin g h y b rid iz a tio n b e tw e e n r a in b o w a n d w e stslo p e c u tth r o a t tro u t. M o le c u la r E c o lo g y R e so u rc e s 11{S1): 1 1 7 -1 2 2 . V a n T a ssell C P , S m ith T L , M a tu k u m a lli L K , T a y lo r J F , S c h n a b e l R D , e t al. (2008) S N P d isc o v e ry a n d allele fre q u e n c y e s tim a tio n b y d e e p s e q u e n c in g o f r e d u c e d re p r e s e n ta tio n lib ra rie s. N a tu re M e th o d s 5: 2 4 7 —252. E tte r P D , P re sto n J L , B a ssh a m S, C re sk o W A , J o h n s o n E A (2011) L o c a l De Mom A ssem b ly o f R A D P a ire d - E n d C o n tig s U s in g S h o r t S e q u e n c in g R e a d s . P L oS O N E 6(4): e l8 5 6 1 . N ie lse n E , C a r ia n i A , M a c A o id h E , M a e s G , M ila n o I, e t al. (2012) G e n e a sso c ia te d m a rk e rs p ro v id e to o ls fo r ta c k lin g illegal fish in g a n d false ecoc e rtifica tio n . N a tu r e C o m m u n ic a tio n s 3: 851 doi: 1 0 .1 0 3 8 /n c o m m s l8 4 5 . N a r u m S R , H e ss J E (2011) C o m p a ris o n o f F - S T o u tlie r tests fo r S N P lo ci u n d e r selectio n . M o le c u la r E c o lo g y R e s o u rc e s 11 (SI): 184—194. L e p o itte v in C , F r ig e r io J M , G a r n ie r - G é ré P , S a lin F , C e r v e ra M T , e t al. (2010) In vitro vs in silico d e te c te d S N P s fo r th e d e v e lo p m e n t o f a g e n o ty p in g a rra y : w h a t c a n w e le a r n f ro m a n o n -m o d e l species? P L o S O n e 5: e l 1034. res e q u e n c in g . M o le c u la r E c o lo g y R e s o u rc e s 11(S1): 8 1 —92. PLOS ONE I w w w .plosone.org 11 A ugu st 2012 | V olum e 7 | Issue 8 | e42089