SNP Assisted Breeding for Sunflower Improvement

SNP Assisted Breeding for Sunflower Improvement
Pegadaraju Venkatramana1, Brent Hulke2, Lili Qi2 Quentin Schultz1
1
Biodiagnostics Inc, 507 Highland Drive River Falls, WI-54022, venki@biodiagnostics.net;
Quentin.Schultz@biodiagnostics.net
2
USDA-ARS-NCSL,1605
Albrecht
Blvd
N
Fargo,
ND
58102-2765 USA,
Brent.Hulke@ARS.USDA.GOV,lili.qi@ars.usda.gov
ABSTRACT

Application of Single Nucleotide Polymorphism (SNP) marker technology as a tool in sunflower
breeding programs offers enormous potential to improve sunflower genetics, and facilitate faster release of
sunflower hybrids to market place.

Through a National Sunflower Association (NSA) funded initiative, we report on the process of SNP
discovery through reductive sequencing and local assembly of the sunflower genome using a combination of
Restriction site Associated DNA (RAD) protocols and Illumina/Solexa paired-end sequencing chemistry. DNA
from six selected sunflower lines was processed into RAD libraries, sequenced on a Genome Analyzer IIx and
then analyzed to identify putative SNP markers.

Raw data from a single cultivar was first coalesced into 18.9 Mb of assembled sunflower genomic
sequence distributed across 50,760 unique sequence contigs. This assembly served as a framework for pair wise
sequence alignment and marker discovery using data obtained from the other sunflower lines. Approximately
16,000 putative sequence variations (SNPs and InDels) with characteristics favorable for the Illumina Assay
Design Tool were mined. A final chip consisting of 8700 SNP markers was synthesized and used in genotyping
a diverse panel of 1200 sunflower accessions to validate SNP performance and understand the pattern of
diversity in sunflower.

A significant proportion of SNP markers that were identified using the RAD approach successfully
translated into functional infinium assays.

This tool can widely be employed in a range of sunflower molecular breeding applications such as,
association mapping, genome-wide selection, high-resolution QTL mapping, marker-assisted trait introgression
and seed production QA/QC applications.
Key Words: Single Nucleotide Polymorphism (SNP) - Restriction –site Associated DNA (RAD) - InsertionsDeletions (InDels)
INTRODUCTION
Domestic sunflower (Helianthus annus) is native to North America and widely cultivated as oil seed and
confection crop type. Besides being economically important, sunflower also serves as model in ecological &
evolutionary studies. Primary focus in both public and private sunflower breeding programs is oriented towards
enhancing intrinsic yield, improving oil content and disease resistance traits. Application of modern breeding
tools such as molecular markers improves efficiency of plant selection, saves time and provides accuracy in a
breeding program. (Collard et al., 2005). A wide range of molecular markers such as RFLP, AFLP, SSR &
TRAP developed in sunflower have successfully enabled construction of high density genetic maps and has led
to identification of QTLs for several agronomically important traits (Tang et al., 2003). However, practical
usage of these makers for routine breeding purposes has been limited due to high assay cost, non-reproducibility
and lack of QTL validations studies.
In recent years, SNP markers have gained popularity in crop breeding programs, due to their low cost, high
throughput efficiency and abundance. SNPs are the most preferred marker types in studies such as association
mapping and genome selection that involve scanning whole genomes with extremely high marker densities to
identify the causal polymorphisms within genes. As opposed to traditional bi-parental QTL mapping,
association mapping exploits historic recombination occurring within natural populations to map traits with
increased resolution. Despite these promises, association studies have not been reported in sunflower. It is
estimated that due to low linkage disequilibrium and high haplotype diversity, SNPs in the order of several
thousand would be need to successfully conduct association analysis in sunflower. Until recently only a limited
number of SNP markers have been reported in sunflower and these were identified primarily using candidate
gene re-sequencing and by in-silco mining of sunflower EST (Expressed Sequence Tags) sequences available in
data base repositories (Fusari et al., 2008, Kolkman et al., 2007, Lai et al., 2005) However, none of the
published SNPs in sunflowers have been validated on array based platforms to evaluate assay performance.
Large scale discovery of genome-wide distributed SNPs can be effectively conducted with the aid of
massively–parallel, next-generation sequencing (NGS) technologies on species with & without reference
genome. In species lacking reference genome complexity reduction strategies have been explored that reduce
the representation of low-information-content repetitive sequencing in the sequencing fraction. Usage of
restriction enzymes serves as a simpler, quicker and highly efficient method for enriching low copy regions in
sequencing. Restriction Site-Associated DNA sequencing (RAD-Seq) is novel method for SNP detection in
genomes and is based on identifying polymorphic variants adjacent to the given restriction enzyme recognition
sites (Miller et al., 2007, Baird et al., 2008).
Here we demonstrate the application of paired-end RAD-Seq for simultaneous genomic assembly and SNP
discovery in Helianthus annuus (sunflower). We report our findings with paired-end RAD-Seq in de novo
assembly of sunflower contigs spanning hundreds of basepairs and provide statistics on the use of this sequence
resource for design of SNP marker panels though Illumina Infinium Genotyping Technology.
MATERIALS & METHODS
Plant material and DNA extraction
Leaf material from germinated seedlings was lyophilized and used for DNA isolation. Lyophilized leaf tissue
was manually disrupted and the resulting samples were extracted following the DNeasy Plant Mini Kit protocol
using filter plates (Qiagen, part #69181). The same elution buffer was passed through the filters a second time
to maximize DNA concentration. DNA concentrations were estimated using the PicoGreen using protocols as
per manufactures instructions (Invitrogen, part#P7581). Concentrations above 50ng/ul were obtained for all the
samples.
RAD library preparation protocols
Genomic DNA from six selected sunflower accessions (TX16R, CR29, SEEDS2000 B-Line, HA467, RHA468,
RHA464) was digested with the restriction endonuclease PstI and processed into RAD libraries similarly to the
method of Baird, et al, 2008. Briefly, ~300 ng of genomic DNA was digested for 60 min at 37°C in a 50 µL
reaction with 20 units (U) of PstI (New England Biolabs [NEB]). After digestion, samples were heat-inactivated
for 20 min at 65°C followed by addition of 2.0 µL of 100 nM P1 Adapter(s), a modified Solexa© adapter (2006
Illumina, Inc., all rights reserved). PstI P1 adapters each contained a unique multiplex sequence index (barcode)
which is read during the first four nucleotides of the Illumina sequence read. 100 P1 nM adaptor were added to
each sample along with 1 µL of 10 mM rATP (Promega), 1 µL 10× NEB Buffer 4, 1.0 µL (1000 U) T4 DNA
Ligase (high concentration, Enzymatics, Inc), 5 µL H2O and incubated at room temperature (RT) for 20 min.
After subsequent purification, 1 µL of 10 µM P2 adapter, a divergent modified Solexa© adapter (2006 Illumina,
Inc., all rights reserved), was ligated to the obtained DNA fragments at 18°C. Samples were again purified and
eluted in 50 µL. Six RAD libraries corresponding to TX16R, CR29, SEEDS2000 B-Line, HA467, RHA468,
and RHA464 were run on an Illumina Genome Analyzer IIx at the University of Oregon High Throughput
Sequencing Facility in Eugene, Oregon. Illumina / Solexa protocols were followed for paired end (2x60bp)
sequencing chemistry.
Bioinformatics - Paired-end RAD-Seq Assembly
Paired-end RAD-Seq uses mate-paired Illumina / Solexa data to assemble DNA sequence adjacent to restriction
enzyme cleavage sites in a target genome. Unlike randomized short-insert paired end (SIPE) Illumina libraries,
paired-end RAD-Seq sequence data is characterized by a common or shared single end (SE) read that is
anchored by the restriction enzyme digestion site and a variable paired end read generated during a shearing step
during library construction. To faciltate variant discovery in Sunflower, a species without a reference genome,
paired-end RAD-Seq contigs, data from RHA464 was used to construct a reference assembly for SNP detection.
First, sequences with >5 poor Illumina quality scores (converted phred score of Q10 or lower) or were discarded
(typically <5% of all data). Remaining reads were then collapsed into RAD sequence clusters which share 100%
sequence identity across the single end Illumina read. The variable paired end sequences for each common
single end locus were extracted from these filtered sequences and passed to the Velvet sequence assembler for
contig assembly (Zerbino and Birney, 2008).
RESULTS
Paired-end RAD-Seq de novo assembly & SNP discovery
The paired-end RAD-Seq strategy was employed in sequencing all the six sunflower lines Fig. 1 describes the
process of RAD library preparation, sequencing and the assembly of long reads.
Figure 1: Paired-end RAD-Seq denovo assembly & SNP discovery
A sunflower reference genome was created by assembling raw illumina reads from RHA464, a total of
9,016,941 paired end illumina reads (2x60bp) were coalesced using custom perl scripts and Velvet program to
form long read contigs. After initial assembly, contigs were aligned against a custom database to remove
sequences with significant plastid homology, 50,726 contigs covering 18 Mbp of the sunflower genome
remained. The mean length for RHA464 assemblies was 400bp (range 220-600bp; N50:379bp) refere Figure 2.
These served as a reference scaffold for sequence alignment of Illumina data from the other cultivars. Sequence
alignment and variant calling was accomplished though use of internal bioinformatics tools. Contigs containing
putative polymorphisms were evaluated for the presence of repetitive elements using the Arabidopsis, Panicoid,
Triticale & Rice repeat masker database and approximately 2% of the nucleotides were masked within the
RAD-Seq sunflower assemblies (data not shown).
Figure 2: Distribution of reference cultivar contigs
Overall we identified a total of 233,335 SNPs and 5280 InDels across six sunflower accessions (Table1).
Table1: Variant Detection Summary
Based on the number of variants cataloged and the amount of sunflower sequence generated a frequency of 1
SNP per 81bp and 1InDel every 3574bp was calculated. After additional bioinformatics filtering a total of
16394 SNPs were selected that had a clear 50bp sequence flanking the polymorphic site for designing Illumina
Infinium genotyping assay.
DISCUSSION
Paired-end RAD-Seq Assembly
Previously RAD-Seq has been used for both SNP discovery and genotyping applications in a number of major
plant systems. Here we demonstrate the use of paired-end RAD sequencing to enable efficient, cost-effective,
high throughput marker development in Helianthus annuus, a major crop without an assembled genome
sequence. Comparison of RAD-Seq assemblies from RHA464 to preexisting sunflower unigenes generated from
Sanger dideoxy sequencing indicates substantial sequence concordance is observed between the two datasets.
The high sequence coverage inherent in the RAD-Seq minimizes sequencing and assembly errors as each
nucleotide in the contig is derived from consensus of many overlapping Illumina sequence reads. Alignment of
sequence data from six sunflower breeding lines to the RHA reference revealed the presence of over 233,000
putative SNPs scattered across the 18.8 Mb of assembled genome sequence. This level of sequence diversity
translates to a rate of approximately 1 SNP observed per 81 bp of genomic sequence which falls in general
agreement with levels of nucleotide diversity observed in other sunflower studies 1 SNP per 69 bp, (Fusari,
BMC Genomics, 2008) and 1 SNP per 45 bp (Kolkman, Genetics, 2007). From less than a flowcell of Illumina
2×60 bp sequence data we.have assembled over 50,000 high-quality sequence contigs with an N50 individual
contig length of ~400 nucleotides and identified more than 230,000 candidate SNPs from analysis of six
sunflower breeding lines. Thus, many thousands of SNPs can be rapidly identified at a low cost, in a format
suitable for high-throughput genotyping.
References:
Tang, S., V.K. Kishore, S.J.Knapp. 2003. PCR-multiplexes for a genome-wide framework of simple sequence
repeat marker loci in cultivated sunflower. Theor Appl Genet. Jun; 107(1):6-19
Baird, N.A., Etter, P.D., Atwood, T.S., Currey, M.C., Shiver, A.L., Lewis, Z.A., Selker, E.U., Cresko, W.A.,
Johnson, E.A. 2008. Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers. Plos One.
3(10)
Miller, M.R., Dunham, J.P., Amores, A., Cresko, W.A., Johnson, E.A., 2007. Rapid and cost-effective
polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. Genome
Research. 17(2):240-248.
Lai , Z. K., Livingstone, Y. Zou, S. A. Church, S. J. Knapp, J. Andrews, and L. H. Rieseberg. 2005.
Identification and mapping of SNPs from ESTs in sunflower Theor Appl Genet. November; 111(8): 1532–1544.
Chengsong Zhu, Michael Gore, Edward, S. Buckler and Jianming Yu. 2008. Status and Prospects of Association
Mapping in Plants. Vol. 1 No. 1, p. 5-20
Corina, M. Fusari, Verónica, V. Lia, H. Esteban Hopp, Ruth, A. Heinz and Norma, B. Paniego. 2008.
Identification of Single Nucleotide Polymorphisms and analysis of Linkage Disequilibrium in sunflower elite
inbred lines using the candidate gene approach. BMC Plant Biology.8:7
Judith, M. Kolkman, Simon, T. Berry, Alberto, J. Leon, Mary, B. Slabaugh, Shunxue Tang Wenxiang Gao,
David, K. Shintani, John, M. Burke and Steven, J. Knapp. 2007. Single Nucleotide Polymorphisms and
Linkage Disequilibrium in Sunflower. Genetics. vol. 177 no. 1 457-468