SNP Assisted Breeding for Sunflower Improvement Pegadaraju Venkatramana1, Brent Hulke2, Lili Qi2 Quentin Schultz1 1 Biodiagnostics Inc, 507 Highland Drive River Falls, WI-54022, venki@biodiagnostics.net; Quentin.Schultz@biodiagnostics.net 2 USDA-ARS-NCSL,1605 Albrecht Blvd N Fargo, ND 58102-2765 USA, Brent.Hulke@ARS.USDA.GOV,lili.qi@ars.usda.gov ABSTRACT Application of Single Nucleotide Polymorphism (SNP) marker technology as a tool in sunflower breeding programs offers enormous potential to improve sunflower genetics, and facilitate faster release of sunflower hybrids to market place. Through a National Sunflower Association (NSA) funded initiative, we report on the process of SNP discovery through reductive sequencing and local assembly of the sunflower genome using a combination of Restriction site Associated DNA (RAD) protocols and Illumina/Solexa paired-end sequencing chemistry. DNA from six selected sunflower lines was processed into RAD libraries, sequenced on a Genome Analyzer IIx and then analyzed to identify putative SNP markers. Raw data from a single cultivar was first coalesced into 18.9 Mb of assembled sunflower genomic sequence distributed across 50,760 unique sequence contigs. This assembly served as a framework for pair wise sequence alignment and marker discovery using data obtained from the other sunflower lines. Approximately 16,000 putative sequence variations (SNPs and InDels) with characteristics favorable for the Illumina Assay Design Tool were mined. A final chip consisting of 8700 SNP markers was synthesized and used in genotyping a diverse panel of 1200 sunflower accessions to validate SNP performance and understand the pattern of diversity in sunflower. A significant proportion of SNP markers that were identified using the RAD approach successfully translated into functional infinium assays. This tool can widely be employed in a range of sunflower molecular breeding applications such as, association mapping, genome-wide selection, high-resolution QTL mapping, marker-assisted trait introgression and seed production QA/QC applications. Key Words: Single Nucleotide Polymorphism (SNP) - Restriction –site Associated DNA (RAD) - InsertionsDeletions (InDels) INTRODUCTION Domestic sunflower (Helianthus annus) is native to North America and widely cultivated as oil seed and confection crop type. Besides being economically important, sunflower also serves as model in ecological & evolutionary studies. Primary focus in both public and private sunflower breeding programs is oriented towards enhancing intrinsic yield, improving oil content and disease resistance traits. Application of modern breeding tools such as molecular markers improves efficiency of plant selection, saves time and provides accuracy in a breeding program. (Collard et al., 2005). A wide range of molecular markers such as RFLP, AFLP, SSR & TRAP developed in sunflower have successfully enabled construction of high density genetic maps and has led to identification of QTLs for several agronomically important traits (Tang et al., 2003). However, practical usage of these makers for routine breeding purposes has been limited due to high assay cost, non-reproducibility and lack of QTL validations studies. In recent years, SNP markers have gained popularity in crop breeding programs, due to their low cost, high throughput efficiency and abundance. SNPs are the most preferred marker types in studies such as association mapping and genome selection that involve scanning whole genomes with extremely high marker densities to identify the causal polymorphisms within genes. As opposed to traditional bi-parental QTL mapping, association mapping exploits historic recombination occurring within natural populations to map traits with increased resolution. Despite these promises, association studies have not been reported in sunflower. It is estimated that due to low linkage disequilibrium and high haplotype diversity, SNPs in the order of several thousand would be need to successfully conduct association analysis in sunflower. Until recently only a limited number of SNP markers have been reported in sunflower and these were identified primarily using candidate gene re-sequencing and by in-silco mining of sunflower EST (Expressed Sequence Tags) sequences available in data base repositories (Fusari et al., 2008, Kolkman et al., 2007, Lai et al., 2005) However, none of the published SNPs in sunflowers have been validated on array based platforms to evaluate assay performance. Large scale discovery of genome-wide distributed SNPs can be effectively conducted with the aid of massively–parallel, next-generation sequencing (NGS) technologies on species with & without reference genome. In species lacking reference genome complexity reduction strategies have been explored that reduce the representation of low-information-content repetitive sequencing in the sequencing fraction. Usage of restriction enzymes serves as a simpler, quicker and highly efficient method for enriching low copy regions in sequencing. Restriction Site-Associated DNA sequencing (RAD-Seq) is novel method for SNP detection in genomes and is based on identifying polymorphic variants adjacent to the given restriction enzyme recognition sites (Miller et al., 2007, Baird et al., 2008). Here we demonstrate the application of paired-end RAD-Seq for simultaneous genomic assembly and SNP discovery in Helianthus annuus (sunflower). We report our findings with paired-end RAD-Seq in de novo assembly of sunflower contigs spanning hundreds of basepairs and provide statistics on the use of this sequence resource for design of SNP marker panels though Illumina Infinium Genotyping Technology. MATERIALS & METHODS Plant material and DNA extraction Leaf material from germinated seedlings was lyophilized and used for DNA isolation. Lyophilized leaf tissue was manually disrupted and the resulting samples were extracted following the DNeasy Plant Mini Kit protocol using filter plates (Qiagen, part #69181). The same elution buffer was passed through the filters a second time to maximize DNA concentration. DNA concentrations were estimated using the PicoGreen using protocols as per manufactures instructions (Invitrogen, part#P7581). Concentrations above 50ng/ul were obtained for all the samples. RAD library preparation protocols Genomic DNA from six selected sunflower accessions (TX16R, CR29, SEEDS2000 B-Line, HA467, RHA468, RHA464) was digested with the restriction endonuclease PstI and processed into RAD libraries similarly to the method of Baird, et al, 2008. Briefly, ~300 ng of genomic DNA was digested for 60 min at 37°C in a 50 µL reaction with 20 units (U) of PstI (New England Biolabs [NEB]). After digestion, samples were heat-inactivated for 20 min at 65°C followed by addition of 2.0 µL of 100 nM P1 Adapter(s), a modified Solexa© adapter (2006 Illumina, Inc., all rights reserved). PstI P1 adapters each contained a unique multiplex sequence index (barcode) which is read during the first four nucleotides of the Illumina sequence read. 100 P1 nM adaptor were added to each sample along with 1 µL of 10 mM rATP (Promega), 1 µL 10× NEB Buffer 4, 1.0 µL (1000 U) T4 DNA Ligase (high concentration, Enzymatics, Inc), 5 µL H2O and incubated at room temperature (RT) for 20 min. After subsequent purification, 1 µL of 10 µM P2 adapter, a divergent modified Solexa© adapter (2006 Illumina, Inc., all rights reserved), was ligated to the obtained DNA fragments at 18°C. Samples were again purified and eluted in 50 µL. Six RAD libraries corresponding to TX16R, CR29, SEEDS2000 B-Line, HA467, RHA468, and RHA464 were run on an Illumina Genome Analyzer IIx at the University of Oregon High Throughput Sequencing Facility in Eugene, Oregon. Illumina / Solexa protocols were followed for paired end (2x60bp) sequencing chemistry. Bioinformatics - Paired-end RAD-Seq Assembly Paired-end RAD-Seq uses mate-paired Illumina / Solexa data to assemble DNA sequence adjacent to restriction enzyme cleavage sites in a target genome. Unlike randomized short-insert paired end (SIPE) Illumina libraries, paired-end RAD-Seq sequence data is characterized by a common or shared single end (SE) read that is anchored by the restriction enzyme digestion site and a variable paired end read generated during a shearing step during library construction. To faciltate variant discovery in Sunflower, a species without a reference genome, paired-end RAD-Seq contigs, data from RHA464 was used to construct a reference assembly for SNP detection. First, sequences with >5 poor Illumina quality scores (converted phred score of Q10 or lower) or were discarded (typically <5% of all data). Remaining reads were then collapsed into RAD sequence clusters which share 100% sequence identity across the single end Illumina read. The variable paired end sequences for each common single end locus were extracted from these filtered sequences and passed to the Velvet sequence assembler for contig assembly (Zerbino and Birney, 2008). RESULTS Paired-end RAD-Seq de novo assembly & SNP discovery The paired-end RAD-Seq strategy was employed in sequencing all the six sunflower lines Fig. 1 describes the process of RAD library preparation, sequencing and the assembly of long reads. Figure 1: Paired-end RAD-Seq denovo assembly & SNP discovery A sunflower reference genome was created by assembling raw illumina reads from RHA464, a total of 9,016,941 paired end illumina reads (2x60bp) were coalesced using custom perl scripts and Velvet program to form long read contigs. After initial assembly, contigs were aligned against a custom database to remove sequences with significant plastid homology, 50,726 contigs covering 18 Mbp of the sunflower genome remained. The mean length for RHA464 assemblies was 400bp (range 220-600bp; N50:379bp) refere Figure 2. These served as a reference scaffold for sequence alignment of Illumina data from the other cultivars. Sequence alignment and variant calling was accomplished though use of internal bioinformatics tools. Contigs containing putative polymorphisms were evaluated for the presence of repetitive elements using the Arabidopsis, Panicoid, Triticale & Rice repeat masker database and approximately 2% of the nucleotides were masked within the RAD-Seq sunflower assemblies (data not shown). Figure 2: Distribution of reference cultivar contigs Overall we identified a total of 233,335 SNPs and 5280 InDels across six sunflower accessions (Table1). Table1: Variant Detection Summary Based on the number of variants cataloged and the amount of sunflower sequence generated a frequency of 1 SNP per 81bp and 1InDel every 3574bp was calculated. After additional bioinformatics filtering a total of 16394 SNPs were selected that had a clear 50bp sequence flanking the polymorphic site for designing Illumina Infinium genotyping assay. DISCUSSION Paired-end RAD-Seq Assembly Previously RAD-Seq has been used for both SNP discovery and genotyping applications in a number of major plant systems. Here we demonstrate the use of paired-end RAD sequencing to enable efficient, cost-effective, high throughput marker development in Helianthus annuus, a major crop without an assembled genome sequence. Comparison of RAD-Seq assemblies from RHA464 to preexisting sunflower unigenes generated from Sanger dideoxy sequencing indicates substantial sequence concordance is observed between the two datasets. The high sequence coverage inherent in the RAD-Seq minimizes sequencing and assembly errors as each nucleotide in the contig is derived from consensus of many overlapping Illumina sequence reads. Alignment of sequence data from six sunflower breeding lines to the RHA reference revealed the presence of over 233,000 putative SNPs scattered across the 18.8 Mb of assembled genome sequence. This level of sequence diversity translates to a rate of approximately 1 SNP observed per 81 bp of genomic sequence which falls in general agreement with levels of nucleotide diversity observed in other sunflower studies 1 SNP per 69 bp, (Fusari, BMC Genomics, 2008) and 1 SNP per 45 bp (Kolkman, Genetics, 2007). From less than a flowcell of Illumina 2×60 bp sequence data we.have assembled over 50,000 high-quality sequence contigs with an N50 individual contig length of ~400 nucleotides and identified more than 230,000 candidate SNPs from analysis of six sunflower breeding lines. Thus, many thousands of SNPs can be rapidly identified at a low cost, in a format suitable for high-throughput genotyping. References: Tang, S., V.K. Kishore, S.J.Knapp. 2003. PCR-multiplexes for a genome-wide framework of simple sequence repeat marker loci in cultivated sunflower. Theor Appl Genet. Jun; 107(1):6-19 Baird, N.A., Etter, P.D., Atwood, T.S., Currey, M.C., Shiver, A.L., Lewis, Z.A., Selker, E.U., Cresko, W.A., Johnson, E.A. 2008. Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers. Plos One. 3(10) Miller, M.R., Dunham, J.P., Amores, A., Cresko, W.A., Johnson, E.A., 2007. Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. Genome Research. 17(2):240-248. Lai , Z. K., Livingstone, Y. Zou, S. A. Church, S. J. Knapp, J. Andrews, and L. H. Rieseberg. 2005. Identification and mapping of SNPs from ESTs in sunflower Theor Appl Genet. November; 111(8): 1532–1544. Chengsong Zhu, Michael Gore, Edward, S. Buckler and Jianming Yu. 2008. Status and Prospects of Association Mapping in Plants. Vol. 1 No. 1, p. 5-20 Corina, M. Fusari, Verónica, V. Lia, H. Esteban Hopp, Ruth, A. Heinz and Norma, B. Paniego. 2008. Identification of Single Nucleotide Polymorphisms and analysis of Linkage Disequilibrium in sunflower elite inbred lines using the candidate gene approach. BMC Plant Biology.8:7 Judith, M. Kolkman, Simon, T. Berry, Alberto, J. Leon, Mary, B. Slabaugh, Shunxue Tang Wenxiang Gao, David, K. Shintani, John, M. Burke and Steven, J. Knapp. 2007. Single Nucleotide Polymorphisms and Linkage Disequilibrium in Sunflower. Genetics. vol. 177 no. 1 457-468