(Clupea harengus) Sequencing in Atlantic Herring

OPEN 0 ACCESS Freely available online SNP Discovery Using Next Generation Transcriptomic Sequencing in Atlantic Herring (Clupea harengus) Sarah J. Helyar1'2*9, M orten T. Limborg39, Dorte B ekkevold3, M assim iliano Babbucci4, Jeroen van H oudt5, G regory E. M aes5, Luca Bargelloni4, Rasmus O. N ielsen 6, Martin I. Taylor1, Rob O gd en 7, A lessia Cariani8, Gary R. Carvalho1, FishPopTrace C onsortium 1, Frank Panitz6 1 Molecular Ecology and Fisheries Genetics Laboratory, School of Biological Sciences, College of Natural Sciences, Bangor University, Bangor, Gwynedd, United Kingdom, 2 Food Safety, Environment & Genetics, Matis, Reykjavik, Iceland, 3 National Institute of Aquatic Resources, Technical University of Denmark, Silkeborg, Denmark, 4 D epartm ent o f Comparative Biomedicine and Food Science, University of Padova, Legnaro, Italy, 5 Laboratory of Biodiversity and Evolutionary Genomics, Katholieke Universiteit Leuven, Leuven, Belgium, 6 D epartm ent o f Molecular Biology and Genetics, Faculty of Science and Technology, Aarhus University, Tjele, Denmark, 7 TRACE Wildlife Forensics Network, Royal Zoological Society o f Scotland, Edinburgh, United Kingdom, 8 D epartm ent of Experimental and Evolutionary Biology, University of Bologna, Bologna, Italy Abstract The i n tr o d u c tio n of Next G e n era tio n S e q u e n c in g (NGS) has revolution ised p o p u la tio n g e netics, providing s tu d ie s o f n o n m o d e l sp e c ie s with u n p r e c e d e n t e d g e n o m i c c o v e r a g e , allowing ev o lu tio n ary biologists t o a d d r e s s q u e s t io n s previously far b e y o n d t h e reach o f available resources. F urth e rm o re , t h e sim p le m u ta t io n m o d e l o f Single N ucleotid e P o lym orph ism s (SNPs) pe rm its cost-effective h i g h - t h r o u g h p u t g e n o t y p i n g in t h o u s a n d s o f individuals sim ultaneou sly. G e n o m ic resources are sc arce for t h e Atlantic herring (Clupea harengus), a small pela gic sp e c ie s t h a t sustains high r e v e n u e fisheries. This p a p e r details t h e d e v e l o p m e n t o f 578 SNPs using a c o m b i n e d NGS a n d h i g h - t h r o u g h p u t g e n o t y p i n g a p p r o a c h . Eight individuals c ov ering t h e sp e c ie s d istributio n in t h e e a s te r n Atlantic w e re b a r -c o d e d a n d m ultip lex ed into a single cDNA library a n d s e q u e n c e d using t h e 4 5 4 GS FLX platform . SNP discovery w a s p e r fo r m e d by de n o v o s e q u e n c e c lu ste ring a n d c o ntig a ssem bly, follo w ed by t h e m a p p i n g o f reads a g a in s t c o n s e n s u s co n tig s e q u e n c e s . Selection o f c a n d i d a t e SNPs for g e n o t y p i n g w as c o n d u c t e d using a n in silico a p p r o a c h . SNP validation a n d g e n o t y p i n g w e r e p e r fo rm e d sim u lta n eo u s ly using a n Illumina 1,536 G o ld e n G a te assay. A l th o u g h t h e c o nversion rate o f c a n d i d a t e SNPs in t h e g e n o t y p i n g a ssa y c a n n o t b e p r e d ic te d in a d v an c e, this a p p r o a c h has t h e poten tial to m axim ise c o st a n d t im e efficiencies by avo idin g e x p e n s iv e a n d t im e - c o n s u m i n g labora tory s t a g e s of SNP validation. Additionally, t h e in silico a p p r o a c h leads t o low er a s c e r t a i n m e n t bias in t h e resulting SNP pa n el as m ark e r selectio n is b a s e d only o n t h e ability t o d e s ig n prim ers a n d t h e p re d ic te d p r e s e n c e of intron-e x on b o u n d a rie s. C o n s e q u e n tl y SNPs with a w id e r s p e c t r u m o f m in o r allele fre q u e n c ie s (MAFs) will b e g e n o t y p e d in t h e final panel. T he g e n o m i c re sou rce s p r e s e n t e d he re r e p r e s e n t a va lu a ble m u lti- p u rp o se re source for d e v e lo p in g informativ e m arke r p a n els for p o p u la tio n discrim ination, m icroarray d e v e l o p m e n t a n d for p o p u la tio n g e n o m i c s tu d ie s in t h e wild. C itation : Helyar SJ, Limborg MT, Bekkevold D, Babbucci M, van Houdt J, e t al. (2012) SNP Discovery Using Next Generation Transcriptomic Sequencing in Atlantic Herring (Clupea harengus). PLoS ONE 7(8): e42089. doi:10.1371/journal.pone.0042089 Editor: Arnar Palsson, University o f Iceland, Iceland Received March 7, 2012; A ccepted July 2, 2012; Published A ugust 7, 2012 C opyrigh t: © 2012 Helyar e t al. This is an open-access article distributed under th e term s of th e Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided th e original author and source are credited. Funding: The research leading to th ese results received funding from th e European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreem ent no KBBE-212399 (FishPopTrace). In addition, MTL received financial support from th e European Commission through th e FP6 projects UNCOVER (Contract No. 022717) and RECLAIM (Contract No. 044133). GM is a post-doctoral researcher funded by the Scientific Research Fund Flanders (FWO-Flanders). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of th e manuscript. C om peting Interests: The authors have declared th a t no com peting interests exist. * E-mail: sarah.helyar@matis.is 9 These authors contributed equally to this work. a n d th ere is a n u rgent need for genom ic tools to identify po p u latio n structure a n d b oundaries to allow effective m an ag e m en t [4], A dditionally the forensic identification o f fish a n d fish products th ro u g h o u t the food processing chain from n et to plate w ould assist in the fight against Illegal, U n re p o rte d a n d U n re g u lated (IUU) fishing, currently a prio rity for the E u ro p ea n U n io n [5] a n d globally [6], SN Ps are the optim al m ark er for this type o f application, b u t large S N P panels are cu rren tly available for few m arin e fish species (e.g. A tlantic cod (Gadus morhua) [7]; E u ro p ea n hake (Merluccius merluccius) [8]). T h u s, the developm ent o f genom ic resources for m arin e fish is urgently req u ire d for evolutionary, conservation a n d m an a g em e n t perspectives. Introduction P opulation genom ic ap p ro ach es have b e e n revolutionised by N ext G e n era tio n S equencing (NGS) technologies such as 454 (Roche) a n d Illum ina sequencing. T hese developm ents facilitate genom e-w ide analyses o f genetic variatio n across populations o f non-m odel organism s [1,2], allow ing a range o f evolutionary questions to b e investigated effectively for the first tim e. M arin e fishes a re excellent m odel systems for studying a d ap tatio n due to th eir large geographic ranges th a t frequently encom pass strong e nvironm ental gradients a n d th eir large pop u latio n sizes th at increase the relative strength o f selection over drift [3]. M oreover, m an y m arin e fishes are u n d e r extrem e a nthropogenic pressure PLOS ONE I w w w .plosone.org 1 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring M aterials and M ethods T h e strategy used for SN P developm ent in non-m odel organism s is d e p en d e n t o n the availability o f genom ic inform a tion from closely related species. If such resources are available, P C R am plicons (hom ologous to regions in the reference genom e) can be sequenced a n d SN Ps identified (however, these are intrinsically lim ited in the n u m b e r o f SN Ps th a t c an be identified). W ith o u t a reference genom e, th ree principal strategies for genom e-w ide SN P discovery can b e applied; w hole genom e sequencing a n d assem bly, genom e com plexity red u ctio n a n d sequencing m ethods (e.g. R R L a n d RA D -seq) a n d cD N A sequencing (RNA-seq). W hile w hole genom e sequencing has now b e en com pleted for species w ith large com plex genom es (for exam ple: p a n d a (Ailuropoda melanoleurd) [9]; cacao (Theobroma cacao) [10]), this rem ains outside the scope o f m ost studies, as in general the de novo assem bly o f larger, repeatrich o r polyploid genom es requires ad ditional inform ation (e.g. physical BAC m aps o r p a ire d -en d libraries) a n d extensive bioinform atic capacity in o rd e r to build the large, c o m p u ta tionally intensive, structured sequence scaffolds [11]. G enom ic libraries w hich sequence a small fraction o f the genom e (typically 3 -5 % ) req u ire a high level o f coverage for contig assem bly a n d detection o f S N P variants (see [12-14] for applications). D eep sequencing o f cD N A libraries provides an attractive a p p ro ac h to achieve the high sequence coverage need ed for de novo contig assem bly a n d SN P prediction, as only a small percentage o f the genom e is acco u n ted for by the transcriptom e. A n o th e r ad vantage o f transcriptom e sequencing is the in form ation p ro d u c ed co n cern in g functional genetic v a riation in specific genes w hich m ay b e u n d e r selection; these can th en b e targ eted to evaluate gene expression profiles. T h e ability to exam ine b o th neu tral variatio n a n d genom ic regions u n d e r selection provides researchers w ith u n p re ce d en te d tools for u n d erstan d in g local a d ap tatio n o f w ild populations a t the m olecular level. A tlantic h e rrin g (Clupea harengus) is a n a b u n d a n t a n d ecologically highly diverse species, o ccurring w ith a m ore or less continuous distribution in the N o rth -A d an tic benthopelagic zone. H ab itats a re distributed across highly diverse environ m ents, from tem p erate (33°N) to arctic (80°N) a n d a t salinities from oceanic (~ 3 5 ppt) to brackish (down to 3 ppt). In spite o f its large ecological range, studies using “ n e u tra l” m icrosatellites have unanim ously re p o rte d w eak p o p u latio n differentiation th a t is statistically significant only o n regional scales [15-17]. H ow ever, despite relatively high levels o f gene flow am o n g populations, evidence o f local a d ap tatio n has b e en identified in the A tlantic h e rrin g in the Baltic Sea using m icrosatellite loci [18,19]. T h erefo re it is expected th a t analyses w ith transcriptom e-w ide coverage applying h u n d red s o f m arkers associated w ith adaptive a n d neu tral v a riation will provide novel insights into the role o f selective a n d dem o g rap h ic processes in shaping pop u latio n structure. W e describe transcriptom e-based S N P d evelopm ent in A tlantic h e rrin g using a R o ch e 454 G S F L X (hereafter 454) sequencing a p p ro ac h . O u r aim was three-fold; 1) to develop a S N P assay exhibiting m inim al ascertain m en t bias across east A tlantic populations, 2) to test the applicability o f in silico SN P detection utilizing a com b in ed S N P screening a n d validation a p p ro ac h as a cost efficient w ay o f o btain in g po p u latio n genom ic resources, a n d 3) to establish a transcriptom e resource for tissue-specific gene expression profiling a n d m ic ro arra y developm ent. W e present, to o u r know ledge, one o f the first studies describing SN P discovery in a non-m odel m arin e fish based o n transcriptom e sequencing using N G S. PLOS ONE I w w w .plosone.org cDNA Library C o n s t r u c t i o n a n d 4 5 4 S e q u e n c i n g S N P developm ent was based o n m uscle sam ples from eight fish collected from four locations from across the eastern A tlantic (Figure 1). T hese locations w ere chosen to m axim ise geographic coverage a n d e nvironm ental differences, thereby m inim ising poten tial ascertain m en t bias. A pproxim ately 5g o f m uscle tissue was taken from each o f two individuals (male a n d female) from each location a n d im m ediately p laced in R N A later (Invitrogen) a n d after 12 hours a t 4°C , w ere stored a t —80°C . T o ta l R N A was extracted using the R N easy L ipid T issue M ini K it (Qiagen). T h e O ligotex m R N A M ini K it (Q iagen) was used to isolate m R N A , a n d non-norm alised cD N A was synthesized using the S u p e rsc rip t D oub le-stran d ed cD N A Synthesis K it (Invitrogen). A m ultiplex sequencing lib rary was p re p a re d by pooling equal am ounts o f cD N A from all eight individuals, w here two specific 10-m er b a rco d in g oligonucleotides w ere ligated to each individual sam ple to allow post-sequencing identification o f sequences (m odified from [20]). H ig h -th ro u g h p u t sequencing was p erfo rm ed o n a 454 sequencer according to the m an u fa ctu re rs’ protocol. S e q u e n c e P r o c e s s in g a n d A s s e m b ly Sequences w ere first de-m ultiplexed using the b a rco d in g tags (sfffile tool, R o ch e 454 analysis software) a n d sorted by sam ple. M itoch o n d rial sequences w ere rem oved from the d a ta set by m ap p in g the reads against the A tlantic h e rrin g m itochondrial genom e (G enbank accession N C _009577 [21]) using the R oche 454 gsM ap p er software. R ep eatM ask er [22] was used to identify a n d m ask repetitive a n d low com plexity regions w ithin the reads by using the zebrafish [Danio rerio) re p ea t library. R eads w ere cleaned for short sequences (< 5 0 bp) a n d low quality regions using S eqC lean (h ttp ://c o m p b io .d fc i.h a rv a rd .e d u /tg i/s o ftw a re /). Se quence clustering was p erfo rm ed in tw o steps; initial clustering was p erfo rm ed using C L C G enom ics W o rk b en ch (C L C bio, D enm ark), the resulting ace file sequences w ere th en assem bled ‘p e r contig’ in C A P3 [23]. T h e consensus sequences for the contigs p ro d u c ed by this assem bly w ere th en used as a reference for m ap p in g reads in the subsequent in silico S N P detection. SNP D e te c t io n T o identify can d id ate SNPs, all contig specific reads from the C A P 3.ace files w ere re -m a p p ed onto the consensus sequence a n d can d id ate SN Ps w ere identified using GigaBayes [24], T his p ro g ram scans each position o f the assem bly for the presence o f a t least two S N P alleles a n d calculates the p robability o f a given site bein g polym orphic using a B ayesian a p p ro ach . N o insertion or deletion variants (InDels) w ere considered a n d the polym orphism ra te was set to 0.003. A m in im u m contig d e p th o f four reads covering the polym orphic site a n d a m in im u m o f two reads for the ra re allele w ere re q u ire d for a site to be considered as a putative SN P. All contigs c o ntaining SN Ps w ere filtered to rem ove instances in w hich the alternative allele o f the SN P was only identified in a single individual, as these m ay either represent false positives or m ay lead to strong a scertain m en t bias. M icro sa tellite S e q u e n c e S c re e n in g M icrosatellites are a n im p o rta n t resource for sm aller scale studies in p o p u latio n genetics, m icrosatellites w ithin expressed genom ic regions have b e en show n to p ro d u c e clearer genotyping results as th ere are fewer null alleles a n d stutter b an d s [25,26]; therefore the contig library developed here was screened to detect re p ea t regions. A ssem bled contigs w ere screened for m icrosatellite repeats using M satC o m m a n d e r [27] a P ython p ro g ra m w hich 2 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring North Atlantic Figure 1. Location of the 18 samples used in this study. T h e e ig h t s e q u e n c e d a s c e rta in m e n t in d iv id u a ls (2 p e r lo c a tio n ) c a m e fro m th e fo u r sa m p lin g site s d e n o te d in red . d o i:1 0 .1 3 7 1 /jo u rn a l.p o n e .0 0 4 2 0 8 9 .g 0 0 1 locates m icrosatellite repeats (di-, tri-, tetra-, penta-, a n d hexanucleotide repeats) w ithin fasta-form atted sequences o r consensus files. M satC o m m a n d e r th en uses P rim erS [28] to screen sequences con tain in g m icrosatellite loci for high-quality P C R p rim e r sites w ithin the flanking regions for ‘potentially am plified loci’ (PALs [29]). co ntaining SN Ps w ere first blasted against six p eptide sequence databases (Ensem bl genom e assem bly for G. aculeatus, T. nigroviridis, 0 . latipes, T. rubripes, D . rerio a n d the Sw issprot database) using the Blastx function (E-value cut-off < 1 .0 E - 3 ). F o r each SN P c o ntaining contig the best m atch was selected a n d the aligned sections o f the qu ery w ere saved. Subsequently, two 121 b p sequences p e r SN P (i.e. 60 b p u p /d o w n -stre a m o f the SN P position, one sequence for each allele) w ere pro d u ced , these w ere used in a Blastx analysis against the file retrieved from the peptide sequences (E-value cut-off < 1 .0 E -10), a n d w ere th en c o m p a red to d eterm in e if the SN P rep resen ted a synonym ous o r nonsynonym ous m utation. C o n tig A n n o t a t i o n C ontigs w ere a n n o ta te d using the Basic L ocal A lignm ent Search T ool (BLAST) against m ultiple sequence databases. Blastn searches (E-value cut-off < 1 .0 E 5) w ere cond u cted against all a n n o ta te d transcripts o f Gasterosteus aculeatus, Tetraodon nigroviridis, Oryzias latipes, Takifiigu rubripes, Danio rerio a n d Homo sapiens available th ro u g h the E nsem bl G enom e Browser, a n d against all unique transcripts for D . rerio, H . sapiens, 0 . latipes, T. rubripes, Salmo salar, a n d Oncorhynchus mykiss in the N C B I LTniGene database. Blastx searches w ere c o n d u cted (E-value cut off < 1 .0 E 3) against the LTniProtK B/Sw issProt a n d L h iiP ro tK B /T rE M B L databases. L ast ly Blastx searches w ere p erfo rm ed against all a n n o ta te d proteins from the transcriptom es o f G. aculeatus, T. nigroviridis, 0 . latipes, T. rubripes, D . rerio a n d H . sapiens available th ro u g h the E nsem bl G enom e Browser. S e le c tio n o f C a n d i d a t e SNPs fo r G e n o t y p i n g A ssay SN Ps w ere validated follow ing a n in silico protocol, aim ed at m inim ising validation costs, whilst also m inim ising subsequent locus dro p o u t. S N P selection was based on the results from the Illum ina Assay D esign T ool, detection o f p utative intron-exon b oundaries w ithin the flanking regions o f c andidate SN Ps, a n d a visual evaluation o f the quality o f contig sequence alignm ents. T h e S N PS core from the Illum ina Assay D esign T ool (referred to as the Assay D esign S core/A D S ) utilises factors including tem plate G C content, m elting tem p e ra tu re, sequence uniqueness, a n d selfco m plem entarity to filter the can d id ate SN Ps p rio r to further inspection. T h e Assay D esign Score (assigned betw een 0 a n d 1) is T o pred ict the effect o f the m u ta tio n underlying each SN P at the am ino acid level, a pipeline was developed to pred ict the re ad in g fram e for each S N P -containing contig. All contigs PLOS ONE I w w w .plosone.org 3 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring indicative o f the ability to design suitable oligos w ithin the 60 b p u p /d o w n -stre a m flanking region, a n d the expected success o f the assay w hen genotyped w ith the Illum ina G o ld en G ate chem istry. Follow ing the Illum ina guidelines, all SN Ps w ith a score below 0.4 w ere discarded; SN Ps w ith a score above 0.4 w ere accepted, w ith SN Ps scoring above 0.7 bein g used preferentially. T h e pred ictio n o f intro n -ex o n bo u n d aries w ithin the SN P flanking regions (60 b p u p /d o w n -stre a m o f S N P position) was p erfo rm ed using two approaches. T h e first directly c o m p a red SN P -containing contigs against five high quality reference genom es for m odel fish species (Ensem bl genom e assem bly for C. aculeatus, 71 nigroviridis, O. latipes, 71 rubripes a n d D. rerio; tee Figure S I, left pipeline), using the B lastn o ption (E-value cut-off 10 5). Blast results w ere th en p arsed via a custom Perl script considering alignm ent length, start a n d e n d p o in t o f the alignm ent to determ in e the best positive m atc h (further details o f the Perl script a n d workflow are available from the a u thors on request). If the 60 b p o n b o th sides o f the SN P w ere p re sen t in the alignm ent, the can d id ate SN P was considered to b e co n tain ed w ithin a single exon; otherw ise a n intron-exon b o u n d a ry was assum ed to be p re sen t w ithin the 121 b p assay design region. SN Ps w ere th en assigned to one o f th ree categories either having, o r n o t h aving an intro n -ex o n b o u n d a ry p red icted w ithin the flanking region, o r as n o t re tu rn in g a significant m atc h against any o f the five blasted fish genom es. In the o th er a p p ro ach , the likelihood o f a positive m atch a n d the reliability o f intro n -ex o n b o u n d a ry predictions w ere increased, w ith S N P -containing contigs used as a qu ery in a Blast search (blastn, E -value cut-off 10 5) against the corresponding transcriptom e o f the sam e five reference databases (see above). If the blast search p ro d u c ed a positive result, the m atch in g transcript was dow nloaded from the E nsem bl database, a n d blasted against its ow n genom e sequence (see Figure S I, rig h t pipeline). W ithin the dow nloaded sequence, the nucleotide position corresponding to the can d id ate S N P in the A tlantic h e rrin g sequence was identified based on the start a n d e n d positions o f the alignm ent betw een the original contig a n d the E nsem bl transcript. U sing the p rojected S N P position, the flanking regions w ere again classified as bein g located o n a single exon, d isrupted by a n intron, or not having a significant m atch. R esults from the tw o a pproaches w ere c o m p a red to o b tain a consensus estim ate for the likelihood o f an intro n -ex o n b o u n d a ry o ccurring w ithin the 121 b p assay for each o f the can d id ate SNPs. Finally, the rem ain in g can d id ate S N P contigs w ere visually evaluated using clview (clview; h ttp ://c o m p b io .d fc i.h a rv a rd .e d u / tg i/so ftw a re /) in o rd e r to ra n k putative SN Ps w ithin a n d am o n g contigs. T h is was assessed by considering the overall quality o f the assem bly, the d e p th a n d length o f alignm ents, a n d the n u m b e r o f m ism atch sites flanking the SN P. T his step was included to increase the likelihood o f excluding incorrectly identified SN Ps (for exam ple; regions w ith alternative splicing or erroneous clustering o f paralogous sequences). W ith in each contig, one o r tw o SNPs receiving the highest quality score w ere considered for further validation (see below). Illu m in a’s G enom eS tudio d a ta analysis software (1.0.2.20706, Illum ina Inc.). O nly S N P assays show ing clear genotype clustering, a n d individual sam ples w ith a call ra te above 0.8 w ere considered for fu rth er analysis. C ro s s -s p e c ie s A m p lificatio n T o assess the utility o f developed m arkers in related species, two species identified from a consensus phylogeny [31], the sister species; Pacific h e rrin g (C. pallasii) a n d a m ore distantly related species; anchovy (Engraulis encrasicolus) w ere genotyped for the full 1,536 S N P panel. Statistical A n aly ses T o assess the predictive value a n d utility o f the different p a ram ete rs used in the in silico S N P validation pipeline, a binom ial logistic regression analysis was conducted. T w o categorical variables (Conversion a n d Polymorphism) w ere evaluated w hich describe the outcom e o f the SN P assay validation; these are expected to d e p en d o n a range o f c andidate p re d ic to r variables (see below). Conversion was scored b y assigning all 1,536 genotyped S N P assays as eith er failed (score = 0) o r successfully am plified a n d clustered (score = 1). Polymorphism assigned all the successfully am plified S N P assays into m o n o m o rp h ic (0) o r polym orphic (1) categories. N ine variables w ere th en assessed for th eir predictive value in determ in in g S N P assay conversion a n d polym orphism : i) n u m b e r o f ascertain m en t p an el individuals supplying sequence reads a t the S N P position, ii) n u m b e r o f sequences aligned u n d e r S N P position, iii) n u m b e r o f sequences w ith the m in o r allele, iv) frequency o f sequences w ith m in o r allele, v) n u m b e r o f ascertain m en t individuals w ith the m in o r allele, vi) Illum ina Assay D esign Score (ADS), vii) outcom e o f the intron-exon b o u n d a ry pipeline (scored as S N P assay bein g w ithin a single exon, in te rru p ted b y an in tro n or as h aving no B L A ST m atch), viii) n u m b e r o f reference species supporting findings from the intro n -ex o n pipeline, a n d ix) n e ig h b o u rh o o d sequence quality (determ ined by the n u m b e r o f m ism atches in the flanking region alignm ent). T o statistically test the predictive effect o f the above variables for b o th Conversion a n d Polymorphism a tw o-step binom ial logistic regression analysis was used as im p lem en ted in SPSS v l2 .0 . All variables w ere included in the initial m odel, a n d a backw ard stepwise deletion a p p ro ac h was used for optim isation, in w hich the least inform ative variable is rem oved sequentially until only significantly con trib u tin g variables rem ain. A W ald x 2 statistic was used to estim ate the relative con trib u tio n from each rem ain in g p a ram ete r. F o r the successful polym orphic assays global values o f observed (H o) a n d expected (H E) heterozygosity w ere estim ated for 20 individuals from each o f the four ascertain m en t populations (Figure 1) using G enA lE x 6.4 [32]. F o r these sam e populations deviations from H a rd y -W e in b erg equilibrium (HW E) a n d evi dence o f linkage disequilibrium (LD) w ere explored using G en ep o p 4.0 [33]. Significance levels for H W E a n d LD tests w ere estim ated using a n M C M C ch ain o f 10,000 iterations a n d 20 batches, / ’-values w ere adjusted for m ultiple tests by false discovery ra te (FDR) correction following B enjam ini & Y ekutieli [34], Lastly, a scertain m en t bias, resulting from the n o n -ran d o m exclusion o f SN Ps w ith a low M in o r Allele F requency (MAF) from the m ark e r panel, m ay occur due to the small size (n = 8) o f the ascertain m en t p an el (com pared to the w hole population), a n d the lim ited geographical coverage (com pared to the w hole species range). W h en m arkers are th en genotyped on a m u ch larger sam ple o f individuals the resulting a scertain m en t bias [35,36] m ay affect the estim ation o f m an y evolutionary a n d po p u latio n genetic p a ram ete rs [2]. T o assess the m agnitude o f a poten tial bias, the distribution o f M A F in the m ark er p an el was assessed across a SNP V alidatio n Follow ing the pipeline described above, 1,536 high scoring can d id ate m arkers w ere chosen for validation by high th ro u g h p u t genotyping assay. D N A was extracted from fin clips for 626 fish sam pled from eighteen sites across the species range in the eastern A tlantic, including tw enty fish from each o f the four S N P discovery populations (Figure 1). T h e quality a n d q u a n tity o f D N A was checked using a N a n o d ro p spectrophotom eter, a n d all sam ples w ere standardised to 70 n g /p L . G enotyping was p erfo rm ed using the Illum ina G olden G ate platform [30], a n d was visualised using PLOS ONE I w w w .plosone.org 4 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring large d a ta set covering 18 locations across the E astern A tlantic to check for a n elevated n o n -ran d o m exclusion o f SN Ps w ith a low M A F. A n un-biased S N P pan el should exhibit a n “ L -shape” distribution o f M A F categories in dicating a d eq u a te representation o f low M A F SN Ps [37], S e q u e n c e P r o c e s s in g a n d A s s e m b ly S equence cleaning a n d processing identified 5.8% o f the assigned reads as having a m atch o f a t least 94% identity over 60 base pairs to A tlantic h e rrin g m ito ch o n d rial sequences a n d these w ere rem oved from the d a ta set. R ep eatM ask er m asked 1.9% o f the d ataset using the zebrafish re p ea t library. T h e S eqC lean p ro g ra m rem oved a furth er 3.5% o f the assigned reads due to low -com plexity (n = 7,885), low quality (n = 169) o r being below the m in im u m re a d length o f 50 bases (n = 13,010). Lastly, som e reads w ere trim m ed, yielding a total o f 571,731 reads for sequence clustering a n d assem bly. Initially reads w ere clustered w ith C L C G enom ics W o rk b en ch (C L C bio, D enm ark), resulting in 16,456 clusters ra nging from 2 0 0 -4 0 0 bp. T hese w ere th en individually re-assem bled w ith C A P3 resulting in 19,246 contigs (some clusters p ro d u c ed b y C L C w ere split into two or m ore contigs) a n d 30,344 singletons o f w hich m ore th a n 50% could be a n n o ta te d (T able 1). T h e m ajority o f contigs consisted o f less th an 30 reads a n d ra n g ed betw een 100-500 b p (Figure 3C-D). Results 454 S eq u en cin g R esults for the sequencing a n d S N P discovery pipeline are illustrated in Figure 2. A total o f 683,503 cD N A sequences w ere gen erated from the m ultiplexed A tlantic h e rrin g m uscle library. T h e reads w ere de-m ultiplexed to assign reads to one o f the eight sequenced individuals according to their b a rco d in g tag. F or 8% o f the raw reads no b a rco d in g tag w as identified, while the rem ain in g 629,541 raw reads (average re a d length: 205 bp, Figure 3B) co n tain ed the 5 ' tag sequence a n d could be allocated to pools p e r sam ple p e r geographical region (Figure 3A). G eographic pools ra n g ed from 86,731 (English C hannel) to 187,554 (Barents Sea) sequences. All 454 sequence d a ta has b e en subm itted to the Sequence R e a d A rchive (SRA) u n d e r the study accession n u m b er E R P 0 0 1233 (h t t p : / /w w w .e b i.a c .u k /e n a /d a ta /v ie w / E R P001233). SNP D e te c t io n a n d A n n o t a t i o n R esults S N P discovery w ith GigaBayes d etected 6,331 putative SN Ps in 1,991 separate contigs. T h e p rim a ry a n n o ta tio n o f contig sequences is sum m arized in T ab le 1 a n d in m ore detail in T able S I. cDNA library 4 5 4 GS FLX S e q u e n c in g 683,503 EST read s D e-m u ltip lex ed by ID M icrosatelfitE 629,541 EST reads -► SNP Identification R em oval o f mtDNA 6501 P rim e r d e s ig n 1757 592,795 EST reads R e p e a t m a sk in g a n d c le a n up 571,731 EST reads 6331 R em o v al o f te rm in a l SNPs 5338 R em o v al if ADT <0.4 C lu s te r in g ^ s s e m b ly 19,246 contigs and 30,344 singletons 5253 In tro n ^ x o n s c re e n in g B last/A n n o tatio n 4018 11,970 contigs and 14,943 singletons T I Final s e le c tio n fo r g e n o ty p in g 1536 M arker d e v e lo p m e n t Figure 2. Schematic of transcript assem bly and SNP d etection p ip eline. S c h e m a tic o v e rv ie w w ith n u m b e rs o f re a d s, c o n tig s a n d SNPs th r o u g h th e tra n s c r ip t a ss e m b ly (c e n tre ) SNP d e te c tio n (rig h t h a n d side) a n d m ic ro sa te llite d e te c tio n (left h a n d side) p ip e lin e s (see te x t fo r m o re details). d o i:1 0 .1 3 7 1 /jo u rn a l.p o n e .0 0 4 2 0 8 9 .g 0 0 2 PLOS ONE I w w w .plosone.org 5 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring 80000 60000 40000 50000 - 1nnnnnn 20000 S 8 K 8 (T> 7 LT) ■ £ § •7 7 Lfl N a 0\ S ? § Í lii ¡5 s IN in fM S 8 i'" O' Í Ul IN m s y 7 ! n n S e q u e n c e l en g th 14000 - 8000 13500 - 7500 Ul 13000 Ol IO 7000 6500 O 12500 u o 2000 ■i (Ü 1500 ¥ 1000 Z 6000 5500 1500 1000 500 - 500 □bn 0 - r r ~ i t O r 'i' tiH •h »c-Jm m Í i l i— t— i— t I I I I I n -r i 8 8 1/1 IÛ N fl) /s n i í i i in i in S R S gî g Num ber of reads S S C on tig le ng th Figure 3. Sum m ary o f sequence data. A) num ber of sequences successfully barcoded for each of the eight ascertainm ent individuals; and for the combined data, B) sequence length, C) num ber of reads per contig and D) contig length. doi:10.1371/journal.pone.0042089.g003 (18.4%) h a d B L A ST hits w hich suggested th a t th ere was no i n tr o n /exon b o u n d a ry presen t (sum m arised in Figure 2). S e le c tio n o f C a n d i d a t e SNPs fo r G e n o t y p i n g A ssay F ro m the 6,331 p re d ic te d SN Ps, 993 (15.6%) w ere located in the term in al region o f the contigs a n d did n o t have the req u ired m in im u m o f a 60 b p flanking region to design oligos for the G o ld en G ate a rra y (Figure 2). O f those rem aining, 85 SN Ps (1.3%) scored below the m in im u m value (< 0.4) reco m m e n d e d for p rim e r design a n d w ere n o t considered. 4,104 SN Ps (76.8%) h a d high Assay D esign Scores (betw een 0.7-1.0) a n d 1,149 SN Ps (21.5%) h a d acceptable Assay D esign Scores (betw een 0.4-0.7), all 5,253 o f these w ere taken forw ard to the next stage. O f the putative SNPs screened for p otential i n tr o n /exon splicing sites w ithin the flanking regions, 1,235 (23.5%) h a d p u tative i n tr o n /exon boundaries w ithin the flanking regions, a n d so w ere rejected. T h e m ajority (3,052, 58.1% ) h a d no m atc h in g B L A ST hits, w hile ju st 966 SNP V alid ation F ro m the full 1,536 p an el o f SN Ps th a t w ere genotyped, 290 (19%) assays failed to amplify. O f the rem ain in g 1,246 assays, 201 w ere m o n o m o rp h ic (false positives: 13%) 467 p ro d u c ed am bigu ous clustering (30%) a n d 578 w ere polym orphic, equivalent to a conversion ra te o f 38% . F ro m these 578 SN Ps a n open reading fram e was o b tain e d for 270 o f the respective 121 b p sequences (SN P a n d 60 b p u p /d o w n stream ), o f w hich 66 w ere suggested be non-synonym ous, a n d 204 to b e synonym ous, equivalent to a ratio (non-synonym ous/synonym ous) o f 0.32 (T able S2). Results on the predictive value o f the SN P selection p aram eters for assay conversion (i.e. for successful am plification) show th at inclusion o f all o f the p re d ic to r variables (see m ethods) m arginally im proves m odel-fitting (x2 = 18.520, d.f. = 9, p < 0 .0 3 0 ). W h en using backw ard stepwise deletion o f p red icto r variables, the Assay D esign Score a n d n u m b e r o f ascertainm ent individuals w ith the m in o r allele w ere identified as the only significant predictors o f assay conversion, b u t only the Assay D esign Score show ed the expected positive correlation w ith conversion rate (Table 2). T h e binom ial logistic regression analysis o n the polym orphic status o f all successfully am plifying assays show ed th a t w hen all p redictor variables w ere included, the overall m odel fit was not significant (X~ = 11.554, d.f. = 9, p = 0.240). Flow ever, n eig h b o u rh o o d se T a b le 1. N u m b e r of c o n tig s a n d sin g leto n s o b t a i n e d a n d successfully a n n o t a t e d . To tal A n n o tate d % An no tated Contigs 19,246 11,970 62.1 Singletons 30,344 14,943 49.2 To tal 49,590 26,913 54.3 dol:10.1371 /journal.pone.0042089.t001 PLOS ONE I w w w .plosone.org 6 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring quence quality h a d a significant negative correlation w ith polym orphism . As before a backw ard stepwise deletion ap p ro ac h was used a n d this red u ced the significantly con trib u tin g predictors to the n u m b e r individuals in the ascertainm ent p an el w ith the m in o r allele a n d the n e ig h b o u rh o o d sequence quality w hich, as expected, respectively show ed positive a n d negative correlation w ith S N P polym orphism (T able 3). T a b le 3. Results for SNP d e te c t i o n variables for predicting SNP a ssay p o ly m o r p h ism following a b a ckw a rd ste p w ise elim in atio n p ro c e d u re . E stim ates o f H o a n d H e across the four ascertain m en t sam ples ra n g ed from 0 .0 0 -0 .6 3 (m ean 0.18) a n d 0 .0 0 -0 .5 0 (m ean 0.18), respectively (T able S2). O bserved heterozygosity w ithin the four ascertainm ent populations revealed sim ilar levels o f diversity to the 18 sam pled locations used for the S N P validation [38], T ests for deviation from H W E for each locus a n d po p u latio n revealed 43 out o f 1,249 p erfo rm ed tests (3.4%) w ith significant deviations from H W E before co rrection for m ultiple tests. T hese tests w ere distributed a m o n g all four populations a n d across 35 loci. E ight tests distributed across three populations a n d seven loci retain ed significance follow ing co rrection for m ultiple tests ( a = 0.05). D ue to the presence o f m o n o m o rp h ic loci in the four ascertainm ent sam ples, 229,094 tests for LD w ere p erfo rm ed o f w hich 352 re m a in e d significant after co rrection for F D R (a = 0.05). O f these, 14 pairs w ere significant in m ore th a n one o f the four populations b u t in all cases SN Ps o riginated from different contigs suggesting lack o f close physical linkage. S N P frequency distributions o f M A F categories in the full pan el o f 18 sam ples indicated little bias due to n o n -ran d o m selection o f high frequency SN Ps (Figure 4). T a b le 2. Results for SNP d e te c t i o n variables for predicting SNP a ssay c o n version following a b ack w ard ste p w ise e lim in atio n proc edure . Asc_ind* -0 .1 6 5 4.67 1 0.031 AD? 0.763 4.785 1 0.0 2 9 C onstant -0 .3 7 8 1.464 1 0.226 d eg ressio n coefficient for individual variable. hWald X2 statistic. A ssociated probability. Number o f ascertainm ent individuals with th e minor allele. eAssay Design Score. Significant p-values are shown in bold. doi:10.1371 /journal.pone.0042089.t002 PLOS ONE I w w w .plosone.org Asc_incf* 0.249 2.965 1 0.085 NSCt -0.111 7.321 1 0 .0 0 7 C onstant 0.935 21.137 1 0.000 F o r next g eneration sequencing to be successfully applied to the developm ent o f genetic resources in non-m odel organism s, m ethodological issues m ust be addressed to optim ise the p ro c e dures for each project. SN Ps c an b e genom e- o r transcriptom e derived and, in the latter case, selected from m ore a b u n d a n t or ra re r expressed transcripts; in addition, m ark er developm ent is influenced by sequence d e p th a n d contig length due to the sequencing platfo rm chosen a n d the com plexity o f the hypothesis to be investigated (i.e. sm aller n u m b e r o f SN Ps re q u ire d for species identification analysis as c o m p a red to p o p u latio n genetic studies). T h e choice o f sequencing p latform should reflect the objective o f a given study. W hile longer reads (e.g. 454 sequencing) are expected to im prove contig assem bly, m ore, b u t shorter, reads (e.g. Illum ina sequencing) m ay be p referable in o rd e r to reduce detection o f false positive SN Ps from h igher alignm ent depth, especially w hen an existing reference sequence is available. T his study took advantage o f the longer re a d lengths o b tain e d w ith 454 sequencing in a de novo assem bly o f a reference scaffold for S N P discovery in herring. T h e clustering a n d assem bly step is critical for S N P m in in g as it generates the reference for v a ria n t detection by m ap p in g reads to the contig. T h erefo re, the absence o f a reference genom e or transcriptom e poses a challenge for assessing the ‘correctness’ o f a c ontig assem bly, as p otential m is-assem blies o f sequence due to hom ologous or paralogous genes can n o t b e directly verified by back -m ap p in g to the species-specific genom e. G enerally, cluster assem bly w ith overly stringent p a ram ete rs will lead to splitting sequences belonging to g eth er into m ore contigs, resulting in a h igher n u m b e r o f shorter contigs w ith low er coverage depth. W hilst applying criteria th a t are overly relaxed will assem ble reads from related genes o r gene families into single contigs, resulting in a low er n u m b ers o f contigs th a t have a h igher sequence depth, how ever this increases the likelihood o f m isidentifying polym or phism s b etw een paralogous sequence variants (PSVs) as SNPs. A dditionally, as no genom e reference is available for A tlantic herring, the o ccurrence o f PSVs can n o t b e assessed, this was p ro b ab ly the cause for the m ajority o f am biguous clustering th at was subsequently seen in the SNPs. F o r the S N P detection, the low sequence d e p th o f the m ajority o f contigs (Figure 3C) re q u ire d relatively low criteria to be set (i.e. depth: four reads, redundancy: two observations o f the m in o r T his study dem onstrates the de novo discovery o f 6,331 putative SN Ps based on 454 transcriptom e sequencing o f eight individuals covering the N o rth east A tlantic distribution o f the A tlantic herring. O f p a rticu la r interest in the a p p ro ac h is the single validation a n d genotyping step, disposing w ith the traditional step Pc Pc S e q u e n c e A s s e m b ly a n d SNP D e te c t io n D iscussion df df o f testing each S N P for am plification p rio r to large scale genotyping (e.g. [39,40]). T h e d a ta gen erated in this study constitutes a new resource for genetic analysis in A tlantic h e rrin g significantly increasing the n u m b e r o f know n transcripts as well as novel SN P a n d m icrosatellite m arkers. T h e m ajority (99%) o f the 578 m arkers identified as p olym or phic in A tlantic h e rrin g also am plified in Pacific herring, b u t only 12% exhibited m ore th a n one allele. O n ly ab o u t 10% o f the 578 SN Ps am plified in anchovy, a n d o f these, only ten loci exhibited polym orphism . M satC o m m a n d e r detected 6,501 m icrosatellites w ith a repeat length o f b etw een two a n d seven bases w ith four or m o re repeat units in 3,741 contigs (T able 4). 27% o f the m icrosatellites h a d sufficient suitable flanking sequence to enable the design o f prim ers. D etails o f the m icrosatellites (num ber a n d type o f repeat, prim ers, T m a n d % G C ) are listed in T ab le S3. W a ld b W a ld b deg ressio n coefficient for individual variable. AVald y2 statistic. Associated probability. dNumber of ascertainm ent individuals with the minor allele. N eighbourhood Sequence Quality. Significant p-values are shown in bold. doi:10.1371 /journal.pone.0042089.t003 C ro s s -s p e c ie s A m plification a n d M icrosatellite D e te c t io n Ba Ba 7 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP D iscovery in A tla n tic H erring 0.5 0 .4 - in 0J CL fü in tH 0.3 - Ci i/i CL z m '<5 > u c dJ D O' B 0. 2 - 0.1 - 0.0 LD O o »H LD t— 1 o rsi io <M o m LO ro 2 LO o LO ? O o d ? LO o o ? o t— 1 o ? LO T -1 o ? o r\j o ? LO ? o rg o ? LO rg o ? Q t o ? LD in o o MAF Figure 4. M in o r allele frequency (MAF) d istrib ution . T h e d is trib u tio n o f th e MAF in 5 7 8 SNPs ty p e d in 18 p o p u la tio n s a c ro ss th e e a s te rn h e rrin g d is trib u tio n . d o i:1 0 .1 3 7 1 /jo u rn a l.p o n e .0 0 4 2 0 8 9 .g 0 0 4 allele). H ow ever, these low thresholds together w ith the sequencing o f eight ascertainm ent individuals spanning the entire northeast A tlantic distribution o f h e rrin g resulted in m inim al ascertainm ent bias due to exclusion o f low M A F SN Ps (Figure 4). O n e expected result o f the low d e p th a n d re d u n d an c y p a ram ete rs is, how ever, the low conversion ra te from the inflated n u m b e r o f candidate T a b le 4. Type a n d n u m b e r of r e p e a ts of t h e microsatellites d e t e c t e d in t h e herring c o n tig s using M s a t c o m m a n d e r . T y p e o f rep ea t N u m b e r o f repeats 4 -9 1 0 -1 4 1 4 -1 9 To tal >19 M axim u m Dinucleotide 4418 505 193 175 75 5291 Trinucleotide 829 35 9 2 36 875 Tetranucleotide 202 13 3 12 31 230 Pentanucleotide 43 1 1 0 17 45 Hexa nucleotide 57 2 0 1 21 60 Total 5549 556 206 190 - 6501 doi:10.1371 /journal.pone.0042089.t004 PLOS ONE I w w w .plosone.org SN Ps (identified due to sequencing errors). T h e 454 platform specific challenge o f resolving hom opolym eric regions m ay further have com prom ised SN P detection b y reducing assem bly quality or calling false SN Ps w ithin these regions [41], b u t such an effect could n o t be assessed here due to the lack o f a know n reference sequence. T h e use o f transcriptom e sequencing in this study has resulted in only a few p e r cent o f the total genom e bein g covered, b u t a t a relatively high sequencing d epth, thus lim iting sequencing costs while achieving the n u m b e r o f SN Ps re q u ire d for custom -designed S N P assays. A dditionally, transcriptom e sequencing provides in form ation ab o u t tissue-specific genes a n d their expression profile, w hich can be used to develop furth er tools for gene expression studies such as oligonucleotide m icro array o r R N A -seq approaches. SNP V alid ation T h e genotyping o f 1,536 selected S N P assays p erfo rm ed w ith genom ic D N A for a large p an el o f A tlantic h e rrin g sam ples from across the n o rth east A tlantic indicated th a t nearly 600 o f the SNPs are polym orphic (37.6% ). H ow ever, alm ost 49.3% o f the can d id ate SN Ps failed to work; due to eith er non-am plification A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring (18.9%), false positives (m onom orphic loci) (13.1%) o r am biguous clustering (17.3%). D espite o u r a tte m p t to screen for p otential in tr o n /exon splicing sites w ithin flanking regions o f all candidate SN Ps using available reference genom es, only 41.9% o f all queries m atc h ed equivalent sequences in a t least one o f the reference species. T h u s, the presence o f u n d e te cte d introns m ay have constituted a m ajor cause for genotyping failure [42]. M oreover, can d id ate SN Ps th a t a p p ea red m o n o m o rp h ic in the large-scale screening m ight either b e the result o f false-positive predictions or could indicate real, ra re SN Ps n o t p re sen t in the sam ples tested [7]. T h e p urely in silico SN P detection m eth o d presen ted in this study m ay have a relatively low conversion rate to validated SNPs w hen c o m p a red to o th er m ethods. H ow ever, this m eth o d is still extrem ely com petitive given a lim ited resource for m ark er developm ent, once the tim e a n d cost associated w ith designing a n d o rd e rin g h u n d red s o f prim ers, ru n n in g validation P C R s, a n d ad ditional S anger sequencing for validation are considered (e.g. [39,40]). All o f w hich w ould be in add itio n to the cost o f genotyping the resulting 578 validated SNPs. In o rd e r to reduce the n u m b e r o f erroneous S N P predictions, i.e. to increase the p robability o f a n in silico d etected SN P being a truly polym orphic site, furth er sequencing w ould lead to greater sequence d e p th o f the contigs, allow ing m ore stringent selection o f S N P candidates. It has b e en show n for m ultiplexed re-sequencing th a t m o re th a n 90 % o f the variants can b e d etected correctly using next g eneration sequencing technologies w h en a n average d e p th o f a t least 20 reads p e r base is achieved [43,44], Increasing the average sequence d e p th will also be advantageous for identifying SN Ps from rarely expressed genes. A n o th e r interesting ap p ro ach , recently described by R a ta n et al. [45], suggests a m eth o d to call SN Ps w ith o u t a reference genom e sequence. S N P calling is p erfo rm ed w henever new sequences are added; thus, sequencing continues only as long as need ed to identify an a d eq u a te n u m b er o f can d id ate SN Ps. T h e m eth o d is re p o rte d to w ork even w hen the sequence coverage is n o t sufficient for de novo assem bly. A ddition ally, the use o f next g eneration sequencing for analysing a restriction enzym e-generated D N A library (R R L a n d in p a rticu la r R A D sequencing, for reviews see [46,47]) based o n m ultiple tagged individuals now enables the fast discovery o f thousands o f SN Ps in non-m odel organism s w ith no p rio r genom e inform ation [48,49]. H ow ever, one dow nstream pro b lem identified w ith R A D seq is th a t transferring the SN Ps o nto a h ig h -th ro u g h p u t genotyping platform is difficult w ith o u t a reference genom e, as the m ajority o f SN Ps identified do n o t have the 60 b p flanking sequenced req u ire d for assay design. T his has to som e extent been solved using P aired E n d R A D (RA D -PE)[50], how ever the bioinform atic a pproaches for SN P discovery in R A D -P E contigs are still lim ited. A dditionally, w hile R R L /R A D -se q approaches elim inate the problem s e n co u n tere d w ith in tr o n /exon b oundaries th a t a re associated w ith tran scrip to m e sequencing, these m ethods only consider ra n d o m fragm ents o f the entire genom e, w hereas o u r transcriptom e based pipeline specifically targets expressed genes w ith a n increased likelihood for detecting SN Ps (e.g. nonsynonym ous substitutions) associated w ith genom ic regions u n d e r selection. Such n o n -n e u tra l SN Ps are expected to provide high discrim inatory pow er a t the p o p u latio n level a n d will constitute a valuable forensic tool in future applications [47,51]. T h e com bination o f the coverage a n d S N P discovery rates obtain ed by R A D -seq, w ith the targ eted red u ctio n o b tain e d by sequencing the transcriptom e w ould potentially be a very pow erful tool. H ow ever, it m ust be n o ted th a t due to the ra p id ra te o f technical developm ents in the field, such as the increased re ad length a n d decreasing costs o f existing platform s, a n d the poten tial o f n a n o sequencing technology, the best solution reg ard in g platform s a n d PLOS ONE I w w w .plosone.org m ethods to optim ise the cost effectiveness for a specific application needs careful consideration. W h en determ in in g the predictive value o f the S N P selection p a ram ete rs for successful am plification o f the in silico detected SN Ps (Conversion), as expected, a positive correlation was found w ith the Assay D esign Score, i.e. the likelihood for designing successful prim ers a ro u n d the S N P position. U nexpectedly, a negative correlation was found w ith n u m b e r o f ascertainm ent individuals for w hich the ra re allele was observed, alth o u g h the reasons b e h in d this correlation are unclear. O verall, only very w eak predictive variables for Polymorphism w ere identified, w ith only the n e ig h b o u rh o o d sequence quality significantly explaining the negative correlation; as the n u m b e r o f m ism atches in flanking regions increases, a p re d ic te d SN P is m ore likely to be a false positive. T his increase in m ism atches o f a n aligned region could be indicative o f erroneous clustering, for exam ple, PSV s o r o th er sequences w ith differing genom ic origin (this has for exam ple also b e en seen for hake in a sim ilar study [8]). T h e n u m b e r o f individuals w ith the m in o r allele in the ascertain m en t p a n el also show ed a positive correlation w ith Polymorphism. W hile this p a ra m e te r is less conclusive th a n for pred ictin g Conversion rate, th ere is potentially a predictive role o f this p a ra m e te r for detecting tru e SNPs. F u tu re SN P developm ent efforts m ay reduce the false positive ra te by applying relatively stringent thresholds for this variable (e.g. having a t least 2 individuals w ith the m in o r allele rep resen ted in the SN P co n tain in g contig, although this will, o f course, d e p en d on the size o f the a scertain m en t panel). T h e two binom ial logistic regression analyses w ere re p ea te d w ith a red u ced set o f variables representing the strongest a priori candidates (the n u m b e r o f sequences aligned u n d e r the SN P position, the frequency o f sequences w ith m in o r allele, the n e ig h b o u rh o o d sequence quality, the Assay D esign Score, a n d the outcom e o f the intro n -ex o n b o u n d a ry pipeline). T his also allow ed controlling for a p otential bias from n o n -in d ep e n d en t variables such as the tw o intro n -ex o n a n d three m in o r allele related p aram eters. R esults w ere largely c o n g ru en t confirm ing Assay D esign Score a n d n eig h b o u rh o o d sequence quality to be the m ost significant predictors o f Conversion a n d Polymorphism, respec tively. T h e range o f allele frequencies w ithin the SN P p an el suggests th a t the strategy o f carefully selecting individuals to m axim ise the geographical, phenotypic a n d genetic diversity covered by the S N P d evelopm ent sam ples has b e en successful in m inim ising ascertain m en t bias. C ross S p e c ie s A m plification a n d M ic rosatellite D e te c t io n A high p ro p o rtio n o f d etected SN Ps also am plified single P C R products in Pacific h e rrin g albeit w ith a low polym orphism rate, w hich is as expected due to th eir developm ent from conserved genom ic regions. H ow ever, due to the small sam ple size (n = 4), this n u m b e r is likely to be dow nw ardly biased a n d a m u ch higher p ro p o rtio n o f SN Ps m ay in fact be polym orphic a n d therefore prove useful in this species. As expected from the phylogenetics o f these species, the pro p o rtio n s o f SN P am plification a n d polym or phism w ere low er in the anchovy. A dditionally, o u r sequencing effort has led to the discovery o f a large resource o f m icrosatellite m arkers, 36% o f w hich have prim ers successfully designed (Table S3). T hese include b o th n e u tra l loci a n d loci th a t are physically linked to SN Ps representing genom ic regions th a t have b een show n to b e u n d e r directional selection [38]. A n o th er a ttrib u te o f m ulti-allelic m icrosatellite m arkers w h en studying adaptive genetic variatio n is the increased statistical pow er for detecting balan cin g selection c o m p a red to bi-allelic m arkers (such as SN Ps, e.g. [52]), a n d also for applications such as p a re n ta l assignm ent. 9 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring alleles in brackets. Also global estim ates o f observed (Ho) a n d expected heterozygosity (He) in the four a scertain m en t populations for each SN P. T h e S /N S colum n denotes w h eth er a S N P was eith er synonym ous (S) o r non-synonym ous (NS) w ith N A designating SN Ps w ith no contig m atc h in the B L A ST search (see text for m ore details). (XLSX) C o n c lu s io n O u r a p p ro ac h o f applying b arco d in g a n d m ultiplexing individ uals for large-scale in silico m ining o f transcriptom e sequences seems to be a very ap p ro p ria te strategy to develop new SN P m arkers in non-m odel species as it does n o t req u ire costíy a n d tim e-intensive re-sequencing o f targ e t am plicons necessitating p rio r know ledge a n d availability o f genom e sequence inform ation. H ow ever, the p urely in silico based S N P detection com es w ith a trad e off in the form o f a n expectedly low er conversion ra te in the final genotyping assay [53]. T h e resu ltan t resources will b e o f value in on-going analyses o f pop u latio n structuring a n d stock dynam ics, assays o f adaptive variation, a n d for e n h an cin g the scope o f m icrosatellite-based studies. T a b le S3 List o f the m icrosatellites for w hich prim ers w ere successfully designed, along w ith u p to 200 bases flanking sequence. (XLSX) A ck n ow led gm en ts S upporting Information W e would like to thank all the m em bers o f the FishPopTrace Consortium for their input. Sam pling was m ade possible by the generous collaboration o f Eero Aro, Philip Coupland, G eir Dahle, Audrey Geffen, T hom as Gröhsler, Birgitta Krischansson, C iaran O ’Donnell, H enn Ojaver, G uöm undur Oskarsson, Iain Penny, Jukka Pönni, Fausto T inti, V éronique Verrez-Bagnis Phil W atts, and Miroslaw Wyszynski. W e thank Pernille K. A ndersen (Aarhus University, Denmark) for sequencing an d library m anagem ent. F ig u re S I Analysis pipeline. T h e p a th o n the left o f the figure illustrates the pipeline for the genom ic app ro ach , w here h e rrin g transcripts are direcüy c o m p a red w ith five reference genom es. T h e p a th on the rig h t o f the figure shows the pipeline for the transcriptom ic a p p ro ac h , w here h e rrin g transcripts are first c o m p a red to the transcriptom e o f the five reference species. H its w ere th en subsequendy m atc h ed to the corresponding genom es o f the sam e species (see text for m ore de tads). (TIF) Author C ontributions Conceived and designed the experiments: G R C FP SJH DB M IT LB R O G E M J v H FPT Consortium . Perform ed the experiments: FP R O N R O SJH M T L . C ontributed reagents/m aterials/analysis tools: FPT C onsor tium. W rote the paper: SJH M T L M IT DB G R C FP. C arried out in silico analyses: FP R O N SJH M T L M IT MB Jv H G E M AC. Analyzed genotype data: SJH M T L DB M IT . C arried out statistical analysis: SJH M T L DB LB MB. T a b le S I N u m b e r o f contigs a n d singletons a n n o ta te d using a range o f fish a n d h u m a n reference resources a n d databases. (XLSX) T a b le S2 List o f the 578 validated polym orphic SN Ps found in this study, including the 120 b p flanking region, w ith the two SN P References 1. H e ly a r SJ, H e m m e r -H a n s e n J , B e k k ev o ld D , T a y lo r M I, O g d e n R , e t al. {2011) A p p lic a tio n o f S N P s fo r p o p u la tio n g e n e tic s o f n o n - m o d e l o rg an ism s: n e w u n s e q u e n c e d g e n o m e s u s in g s e c o n d g e n e ra tio n h ig h th r o u g h p u t se q u e n c in g tec h n o lo g y : a p p lie d to tu rk e y . B M C G e n o m ic s: 10: 47 9 . o p p o rtu n itie s a n d ch a lle n g es. M o le c u la r E c o lo g y R e s o u rc e s 1 1(S1): 123—136. S ta p le y J , R e g e r J , F e u ln e r P G D , S m a d ja G , G a lin d o J , e t al. (2010) A d a p ta tio n g e nom ics: th e n e x t g e n e ra tio n . T r e n d s in E c o lo g y & E v o lu tio n 25: 70 5 —712. 3. N ie lse n E E , H e m m e r -H a n s e n J , L a rse n P F , B ek k ev o ld D (2009) P o p u la tio n g e n o m ic s o f m a r in e fishes: I d e n tify in g a d a p tiv e v a ria tio n in sp a ce a n d tim e. 14. v a n B ers N E , v a n O e r s K , K e rs te n s H H , D ib b its B W , C ro o ijm a n s R P , e t al. {2010) G e n o m e -w id e S N P d e te c tio n in th e g r e a t tit Parus major u sin g h ig h th ro u g h p u t se q u en c in g . M o le c u la r E co lo g y : 19{S1): 8 9 —99. 2. 4. 5. 6. 15. M c P h e rs o n A A , S te p h e n s o n R L , O ’R eilly P T , J o n e s M W , T a g g a r t G T {2001) G e n e tic d iv ersity o f c o a sta l N o r th w e s t A tla n tic h e rr in g p o p u la tio n s: im p lic a tio n s fo r m a n a g e m e n t. J o u r n a l o f F ish B io lo g y 59: 35 6 —370. 16. B e k k ev o ld D , A n d r e C , D a h lg re n T G , C la u s e n L A W , T o rs te n s e n E , e t al. {2005). E n v iro n m e n ta l c o rre la te s o f p o p u la tio n d iffe re n tia tio n in A d a n tic h e rrin g . E v o lu d o n 59: 2 6 5 6 -2 6 6 8 . M o le c u la r E c o lo g y 18: 3 1 2 8 -5 0 . W a p le s R S , P u n t A E , C o p e J M (2 0 0 8 ) I n te g r a t i n g g e n e tic d a ta in to m a n a g e m e n t o f m a r in e reso u rc e s: h o w c a n w e d o it b e tte r? F ish a n d Fish eries 9: 4 2 3 -4 4 9 . 17. J o rg e n s e n H B H , H a n s e n M M , B ek k ev o ld D , R u z z a n te D E , L o e sc h ck e V {2005) M a rin e lan d s c a p e s a n d p o p u la tio n g e n e tic s tru c tu re o f h e rr in g {Clupea harengus L.) in th e B altic Sea. M o le c u la r E c o lo g y 14: 3 2 1 9 -3 2 3 4 . C o r r ig e n d u m to C o u n c il R e g u la d o n (2008) (EG ) N o 1 0 0 5 /2 0 0 8 o f 29 S e p te m b e r 2 0 0 8 e sta b lish in g a C o m m u n ity sy stem to p re v e n t, d e te r a n d e lim in a te illegal, u n r e p o r te d a n d u n re g u la te d fish in g , a m e n d in g R e g u la tio n s (E E C ) N o 2 8 4 7 /9 3 , (EG) N o 1 9 3 6 /2 0 0 1 a n d (EG) N o 6 0 1 /2 0 0 4 a n d re p e a lin g R e g u la tio n s (EG) N o 1 0 9 3 /9 4 a n d (EG) N o 1 4 4 7 /1 9 9 9 . O ffic ial J o u r n a l o f th e E u r o p e a n U n io n , L 286. 18. L a rss o n L G , L a ik re L , P a lm S, A n d r e G , C a r v a lh o G R , e t al. {2007) C o n c o r d a n c e o f a llo z y m e a n d m ic ro sa te llite d iffe re n tia tio n in a m a r in e fish, b u t e v id e n c e o f se le c tio n a t a m ic ro sa te llite lo cu s. M o le c u la r E c o lo g y 16: 1135— 1147. 19. G a g g io tti O E , B e k k ev o ld D , J o rg e n s e n H B H , F o il M , C a rv a lh o G R , e t al. {2009) D is e n ta n g lin g th e effects o f e v o lu tio n a ry , d e m o g ra p h ic , a n d e n v iro n m e n ta l fac to rs in flu e n cin g g e n e tic s tru c tu re o f n a tu r a l p o p u la tio n s: A d a n tic h e rr in g as a ca se stu d y . E v o lu tio n 63: 2 9 3 9 -2 9 5 1 . 20. B in la d e n J , G ilb e rt M T , B o llb a c k J P , P a n itz F , B e n d ix e n G , e t al. {2007) T h e use o f c o d e d P C R p rim e rs e n a b le s h ig h -th ro u g h p u t s e q u e n c in g o f m u ld p le h o m o lo g a m p lific a d o n p r o d u c ts b y 4 5 4 p a ra lle l s e q u en c in g . P L o S O N E 2: e l9 7 . F A O F ish eries a n d A q u a c u ltu re R e p o r t N o . 9 7 3 ( F IP I/R 9 7 3 ) R o m e , 31 J a n u a r y - G F e b ru a ry 2 0 1 1 . R e p o r t o f th e tw e n ty -n in th session o f th e c o m m itte e o n F isheries. A vailab le: h t t p : // w w w .f a o .o r g / d o c r e p / 0 1 4 /i 2 2 8 1 e /i 2 2 8 1 e 0 0 .p d f . A c c e ssed 2011 F e b 7. 7. H u b e r t S, H ig g in s B, B o rz a T , B o w m a n S (2010) D e v e lo p m e n t o f a S N P re s o u rc e a n d a g e n e tic lin k a g e m a p fo r A d a n tic c o d {G adus m o rh u a ). B M C G e n o m ic s 11: 191. 8. M ila n o I, B a b b u c c i M , P a n itz F, O g d e n R , N ie lse n R O , e t al. {2011) N o v e l tools fo r c o n s e rv a tio n g e n o m ics: C o m p a rin g tw o h ig h -th ro u g h p u t a p p ro a c h e s fo r S N P d isc o v e ry in th e tra n s c rip to m e o f th e E u r o p e a n h a k e . P L o S O N E 6: e2 8 0 0 8 . 9. L i R , F a n W , T ia n G , Z h u H , H e L , e t al. {2010) T h e s e q u e n c e a n d d e n o v o a sse m b ly o f th e g ia n t p a n d a g e n o m e . N a tu r e 4 6 3 : 3 1 1 —317. 21. 22. 10. A r g o u t X , S a ls e J , A u r y J M , G u iltin a n M J , D r o c G , e t al. {2011) T h e g e n o m e o f 11. 12. 13. Theobroma cacao. N a tu r e G e n e tic s 43: 101—108. S ta r B, N e d e r b ra g t A J, J e n to f t S, G r im h o lt U , M a lm s tro m M , e t al. {2011) T h e g e n o m e s e q u e n c e o f A d a n tic c o d rev e a ls a u n iq u e im m u n e system . N a tu r e 477: 2 0 7 -2 1 0 . S á n c h e z C G , S m ith T P , W ie d m a n n R T , V a lle jo R L , S a le m M , e t al. {2009) Single n u c le o d d e p o ly m o rp h is m d isc o v e ry in r a in b o w tr o u t b y d e e p s e q u e n c in g o f a r e d u c e d re p r e s e n ta tio n lib ra ry . B M C G e n o m ic s: 10: 55 9 . K e rs te n s H H , G ro o ijm a n s R P , V e e n e n d a a l A , D ib b its B W , G h in -A -W o e n g T F , e t al. {2009) L a r g e sc a le sin g le n u c le o tid e p o ly m o r p h is m d isc o v e ry in PLOS ONE I w w w .plosone.org 23. L a v o u é S, M iy a M , S a ito h K , Ish ig u ro N B , N is h id a M {2007) P h y lo g e n e d c r e la d o n s h ip s a m o n g a n c h o v ie s, sa rd in e s, h e rrin g s a n d th e ir rela tiv e s {C lupei fo rm es), in fe rre d f ro m w h o le m ito g e n o m e s e q u en c e s. M o le c u la r P h y lo g e n e d c s a n d E v o lu tio n 43: 1 0 9 6 -1 1 0 5 . S m it A F A , H u b le y R , G re e n P {1996) RepeatMasker Open-3.0. A v a ila b le h t t p : / / w w w .r e p e a tm a s k e r .o r g . v e rs io n o p e n - 3 .2 .7 w ith R M d a ta b a s e v e rs io n 20090120. H u a n g X Q , M a d a n A {1999) C A P 3 : A D N A s e q u e n c e a sse m b ly p r o g ra m . G e n o m e R e s e a rc h 9: 8 6 8 - 8 7 7 . 24. M a r th G T , K o r f I, Y a n d e ll M D , Y e h R T , G u Z J, e t al. {1999) A g e n e ra l a p p ro a c h to sin g le -n u c le o tid e p o ly m o rp h is m d isco v ery . N a tu re G e n e tic s 23: 452M 56. 25. B a i J , L i Q , G o n g R H , S u n W J, L iu J , e t al. {2011) D e v e lo p m e n t a n d c h a ra c te r iz a tio n o f 6 8 E S T -S S R m a rk e rs in th e P acific o y ster, Crassostrea gigas. J o u r n a l o f th e W o rld A q u a c u ltu r e S o ciety 42{3): 4 4 4 M 5 5 . 10 A ugu st 2012 | V olum e 7 | Issue 8 | e42089 SNP Discovery in A tla n tic Herring 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. P a sh le y C H . Ellis J R . M c C a u le y D E , B u rk e J M (2006). E S T d a ta b a s e s as a s o u rc e fo r m o le c u la r m ark e rs: lesso n s f ro m Helianthus. J o u r n a l o f H e r e d ity 97: 3 8 1 -3 8 8 . F a irc lo th B C (2008) M S A T C O M M A N D E R : d e te c tio n o f m ic ro sa te llite re p e a t a rr a y s a n d a u to m a te d , lo c u s -s p e c ific p r im e r d e sig n . M o le c u la r E c o lo g y R e s o u rc e s 8: 9 2 - 9 4 . R o z e n S, S kaletsky H J (2000) P rim e r3 o n th e W W W fo r g e n e ra l u sers a n d fo r b io lo g ist p r o g ra m m e rs . In : B io in fo rm a tic s M e th o d s a n d P ro to co ls: M e th o d s in M o le c u la r B iology {eds K r a w e tz S, M is e n e r S). H u m a n a P ress, T o to w a , N J. C a s to e T A , P o o le A W , G u W , J a s o n d e K o n in g A P , D a z a J M , e t al. (2010) R a p id id e n tific a tio n o f th o u sa n d s o f c o p p e rh e a d sn a k e (.Agkistrodon contortrix) m ic ro sa te llite loci fro m m o d e s t a m o u n ts o f 4 5 4 s h o tg u n g e n o m e s e q u en c e . M o le c u la r E c o lo g y R e so u rce slO : 3 4 1 -3 4 7 . F a n J B , O l i p h a n t A , S h e n R , K e r m a n i B G , G a rc ia F , e t al. (2003) H ig h ly p a ra lle l S N P g e n o ty p in g . C o ld S p rin g H a r b o r S y m p o sia o n Q u a n tita tiv e B iology 68: 6 9 - 7 8 . L i C , O r t i G (2007) M o le c u la r p h y lo g e n y o f C lu p e ifo rm e s (A ctin o p tery g ii) in fe rre d fro m n u c le a r a n d m ito c h o n d ria l D N A se q u en c e s. M o le c u la r P h y lo g e n etics a n d E v o lu tio n 44: 3 8 6 -3 9 8 . P e a k a ll R , S m o u se P E (2006) G E N A L E X 6: g e n e d c an aly sis in E xcel. P o p u la tio n g e n e tic so ftw a re fo r te a c h in g a n d re s e a rc h . M o le c u la r E c o lo g y N o te s 6: 2 8 8 -2 9 5 . R o u s s e t F (2008) G E N E P O P 5 0 0 7 : a c o m p le te re -im p le m e n ta tio n o f th e G E N E P O P so ftw a re fo r W in d o w s a n d L in u x . M o le c u la r E c o lo g y R e so u rc e s 8(1): 1 0 3 -1 0 6 . B e n ja m in i Y , Y ek u tie li D (2001) T h e c o n tr o l o f th e false d isco v ery r a te in m u ltip le tes tin g u n d e r d e p e n d e n c y . A n n a ls o f S tatistics 29: 1165—1188. A lb re c h ts e n A , N ie lse n F C , N ie lse n R (2010) A s c e rta in m e n t B iases in S N P C h ip s A ffect M e a s u re s o f P o p u la tio n D iv e rg e n c e . M o le c u la r B io lo g y a n d E v o lu tio n 27: 2534—2547. R o s e n b lu m E B , N o v e m b re J (2007) A s c e r ta in m e n t b ia s in sp a tia lly s tru c tu re d p o p u la tio n s: A ca se stu d y in th e e a ste rn fen c e liz a rd . J o u r n a l o f H e r e d ity 98: 40. 41. 42. S eeb J E , P a s c a l C E , G r a u E D , S e e b L W , T e m p lin W D , e t al. (2011) T ra n s c r ip to m e s e q u e n c in g a n d h ig h -re so lu tio n m e lt analysis a d v a n c e single n u c le o tid e p o ly m o rp h is m d isc o v e ry in d u p lic a te d sa lm o n id s. M o le c u la r E c o lo g y R e s o u rc e s 11(S1): 33 5 —348. M a rg u lie s M , E g h o lm M , A ltm a n W E , A ttiy a S, B a d e r J S , e t al. (2005) G e n o m e s e q u e n c in g in m ic ro fa b ric a te d h ig h -d e n sity p ic o litre rea c to rs. N a tu re 437: 376— 380. W a n g S L , S h a Z X , S o n s te g a rd T S , L iu H , X u P , e t al. (2008) Q u a lity a sse ssm e n t p a ra m e te r s fo r E S T -d e riv e d S N P s f ro m catfish . B M C G e n o m ic s 9: 43. 450. C r a ig D W , P e a rs o n J V , S z e lin g e r S, S e k a r A , R e d m a n M , e t al. (2008) I d e n tific a tio n o f g e n e tic v a ria n ts u sin g b a r- c o d e d m u ltip le x e d se q u en c in g . 44. N a tu r e M e th o d s 5: 8 8 7 -8 9 3 . H a r is m e n d y O , N g P C , S tra u s b e rg R L , W a n g X , S to ck w ell T B , e t al. (2009) E v a lu a tio n o f n e x t g e n e ra tio n s e q u e n c in g p la tfo rm s fo r p o p u la tio n ta rg e te d 45. s e q u e n c in g stu d ies. G e n o m e B iology 10: R 3 2 . R a t a n A , Y u Z , H a y e s V M , S c h u s te r S C , M ille r W (2010) C a llin g S N P s w ith o u t a re fe re n c e s e q u en c e . B M C B io in fo rm a tic s 11: 130. 46. 47. 48. 49. 50. 3 3 1 -3 3 6 . M a r th G T , C z a b a rk a E , M u rv a i J , S h e rry S T (2004) T h e a llele fre q u e n c y s p e c tru m in g e n o m e -w id e h u m a n v a ria tio n d a ta rev e a ls sig n als o f d iffe re n tia l d e m o g r a p h ic h isto ry in th re e la rg e w o rld p o p u la tio n s. G e n e tic s 166: 3 5 1 —372. L im b o rg M T , H e ly a r SJ, d e B ru y n M , T a y lo r M I, N ie lse n E E , e t al. (2012) E n v iro n m e n ta l se le c tio n o n tra n s c rip to m e -d e riv e d S N P s in a h ig h g e n e flow m a r in e fish, th e A tla n tic h e rr in g {Clupea harengus). M o le c u la r E c o lo g y doi: 10.1111 / j . 13 6 5 -2 9 4 X .2 0 1 2 .0 5 6 3 9 .x . G e ra ld e s A , P a n g J , T h ie ss e n N , C e z a rd T , M o o re R , e t al. (2011) S N P d isc o v e ry in b la c k c o tto n w o o d (.Populous trichocarpa) b y p o p u la tio n tra n s c rip to m e 51. 52. 53. O g d e n R (2011) U n lo c k in g th e p o te n tia l fo r g e n o m ic tec h n o lo g ie s fo r w ildlife fo ren sics. M o le c u la r E c o lo g y R e s o u rc e s 11{S1): 109—116. D a v e y J W , H o h e n lo h e P A , E tte r P D , B o o n e J Q C a tc h e n J M , e t al. (2011) G e n o m e -w id e g e n e tic m a r k e r d isc o v e ry a n d g e n o ty p in g u s in g n e x t-g e n e ra tio n s e q u en c in g . N a tu re R e v ie w G e n e tic s 12: 4 9 9 —510. H o h e n lo h e P A , A m ish SJ, C a tc h e n J M , A lle n d o rf F W , L u ik a rt G (2011) N e x tg e n e ra tio n R A D s e q u e n c in g id e n tifie s th o u s a n d s o f S N P s fo r a sse ssin g h y b rid iz a tio n b e tw e e n r a in b o w a n d w e stslo p e c u tth r o a t tro u t. M o le c u la r E c o lo g y R e so u rc e s 11{S1): 1 1 7 -1 2 2 . V a n T a ssell C P , S m ith T L , M a tu k u m a lli L K , T a y lo r J F , S c h n a b e l R D , e t al. (2008) S N P d isc o v e ry a n d allele fre q u e n c y e s tim a tio n b y d e e p s e q u e n c in g o f r e d u c e d re p r e s e n ta tio n lib ra rie s. N a tu re M e th o d s 5: 2 4 7 —252. E tte r P D , P re sto n J L , B a ssh a m S, C re sk o W A , J o h n s o n E A (2011) L o c a l De Mom A ssem b ly o f R A D P a ire d - E n d C o n tig s U s in g S h o r t S e q u e n c in g R e a d s . P L oS O N E 6(4): e l8 5 6 1 . N ie lse n E , C a r ia n i A , M a c A o id h E , M a e s G , M ila n o I, e t al. (2012) G e n e a sso c ia te d m a rk e rs p ro v id e to o ls fo r ta c k lin g illegal fish in g a n d false ecoc e rtifica tio n . N a tu r e C o m m u n ic a tio n s 3: 851 doi: 1 0 .1 0 3 8 /n c o m m s l8 4 5 . N a r u m S R , H e ss J E (2011) C o m p a ris o n o f F - S T o u tlie r tests fo r S N P lo ci u n d e r selectio n . M o le c u la r E c o lo g y R e s o u rc e s 11 (SI): 184—194. L e p o itte v in C , F r ig e r io J M , G a r n ie r - G é ré P , S a lin F , C e r v e ra M T , e t al. (2010) In vitro vs in silico d e te c te d S N P s fo r th e d e v e lo p m e n t o f a g e n o ty p in g a rra y : w h a t c a n w e le a r n f ro m a n o n -m o d e l species? P L o S O n e 5: e l 1034. res e q u e n c in g . M o le c u la r E c o lo g y R e s o u rc e s 11(S1): 8 1 —92. PLOS ONE I w w w .plosone.org 11 A ugu st 2012 | V olum e 7 | Issue 8 | e42089

(Clupea harengus) Sequencing in Atlantic Herring

Related documents

Products

Support

(Clupea harengus) Sequencing in Atlantic Herring

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib