High Throughput SNP Discovery & Genotyping Perry Cregan Soybean Genomics and Improvement Lab USDA, ARS, BARC-West Beltsville, Maryland Agricultural Research Service Single Nucleotide Polymorphism - A working definition - Single base changes between homologous DNA fragments + Small insertions and deletions (indels) 1 ..GAATCTTATTATCTATACTATACATAATTATATACTAAT-GGGTATTGTTCTTAT.. 2 ..GAATCTTATTATCTATGCTATACATAATTATATACTAATAGGGTATTGTTCTTAT.. ..CTTAGAATAATAGATATGATATGTATTAATATATGATTA-CCCATAACAAGAATA.. ..CTTAGAATAATAGATACGATATGTATTAATATATGATTATCCCATAACAAGAATA.. SNP SNP (Indel) Initial SNP Discovery and Mapping SNP discovery using Sanger re-sequencing - Mostly genic - BAC-end and BAC subclones SNP genotyping and mapping - Sequenom mass spectrometer - Luminex Flow cytometer - Illumina Inc. GoldenGate™ assay SNP Discovery in Soybean Unigenes Re-Sequencing Design PCR primers to existing 3'-unigene sequence Identify sequence tagged sites (STSs) visually (agarose gel) and by sequence analysis Determine sequence quality (PHRED) and Align sequence traces (PHRAP) from six diverse soybean genotypes Analyze assemblies with SNP discovery software ( ) for SNP discovery in redundant sequence Analysis of haplotype variation and databasing In silico From existing expressed sequence tag (EST) data SNP Discovery in Soybean Unigenes Re-Sequencing Design PCR primers to existing 3'-unigene sequence Identify sequence tagged sites (STSs) visually (agarose gel) and by sequence analysis Determine sequence quality (PHRED) and Align sequence traces (PHRAP) from six diverse soybean genotypes Analyze assemblies with SNP discovery software ( ) for SNP discovery in redundant sequence Analysis of haplotype variation and databasing In silico From existing expressed sequence tag (EST) data Initial Assessment of PCR Primers Designed to Soybean Unigenes SNP Discovery in Soybean Unigenes Re-Sequencing Design PCR primers to existing 3'-unigene sequence Identify sequence tagged sites (STSs) visually (agarose gel) and by sequence analysis Determine sequence quality (PHRED) and Align sequence traces (PHRAP) from six diverse soybean genotypes Analyze assemblies with SNP discovery software ( ) for SNP discovery in redundant sequence Analysis of haplotype variation and databasing In silico From existing expressed sequence tag (EST) data Discovery of SNPs in aligned DNA sequence data using PolyBayes in the Consed viewer SNP Discovery software SNP DNA Sequence Alignment for Single Nucleotide Polymorphism (SNP) Discovery in Soybean SNP Discovery in Soybean Unigenes Primers sets designed and tested . . . . . . . 9459 Primer sets producing a single PCR product. . . . . . . . . . . . . . . . . . . . . . 6290 (66.5%) High quality sequence data for all 6 SNP discovery genotypes . . . . . . . . . . . . . . 4240 (44.8%) Genes with at least one SNP . . . . . . . . . . . . 2032 (21.5%) Data from: Choi et al. (2007) Genetics 176: 685-696 Initial SNP Discovery and Mapping SNP discovery using Sanger resequencing - Mostly genic - BAC-end and BAC subclones SNP genotyping and mapping - Sequenom mass spectrometer - Luminex Flow cytometer - Illumina Inc. GoldenGate™ assay SNP Analysis Using the Illumina, Inc. GoldenGate™ Assay - A Three Step Process 1. Allele Specific Extension and Ligation 2. PCR Amplification 3. Hybridization to the Universal Sentrix® Array Matrix Allele Specific Extension and Ligation Genomic DNA Allele Specific Extension & Ligation [T/C] Polymerase Universal PCR Sequence 1 Ligase A G [T/A] illumiCode’ Address Universal PCR Sequence 3’ Universal PCR Sequence 2 Custom Oligo Pool All (OPA) 96-1,536 SNPs multiplexed Total oligos in reaction – 288-4,608 PCR Amplification A Amplification Template PCR with Common Primers Cy3 Universal Primer 1 Cy5 Universal Primer 2 illumiCode #561 Universal Primer P3 Hybridization to Sentrix® Array Matrix SNP #561 G/G SNP #217 /\/\/\/ /\/\/\/ A/A illumiCode #1024 /\/\/\/ illumiCode #217 illumiCode #561 C/T SNP #1024 Sentrix® Array Matrix 1.5 mm 400 mm 10 mm The Illumina BeadStation 500G permits high throughput analysis of thousands of SNP DNA markers in hundreds of genotypes in less than one week. Genetic Mapping Three Mapping Populations of 89 individuals each Total markers = 6521 – 1008 SSR – 3959 SNP – 637 RFLP – 14 Classical – 3 Isozyme Total Map length 2393.7 centiMorgans A Set of 1536 SNPs with Maximal Genome Coverage and High Minor Allele Frequency 3110 working GoldenGate assays All SNPs have been genetically mapped All SNPs analyzed on diverse Exotic and Elite soybean germplasm lines - 96 Diverse Asian introductions from China, Korea, and Japan collected from 22-50 degrees N and 104-140 degrees E. - 96 N. American released cultivars selected based upon a cluster analysis using pedigree data to maximize diversity The Costs Reagents for Whole Genome Scans of 96 Genotypes Using an Optimized Set of 1536 SNPs $ / set of 96 genotypes $12,000 $10,000 $8,000 $6,000 $4,000 $2,000 $0 0 10 20 30 40 Sets of 96 genotypes 50 60 70 Accelerated SNP Discovery Creation of a Reduced Representation Genome Library – Digest genomic DNA with a combination of five blunt-end restriction endonucleases – Select a combination of restriction enzymes such that approx. 5% of the genome is present in the 110-140 bp fraction Solexa sequence analysis of the Reduced Representation Library SNP discovery via alignment of the Solexa reads with the Williams 82 whole genome sequence from the DOE, JGI Creation of a Reduced Representation Genome Library New England BioLabs 100 bp and 50 bp Ladders New England BioLabs 100 bp and 50 bp Ladders PI 468916 genomic DNA (4 ul = 50 ng) 200 bp 150 bp 100 bp 50 bp Mix of 5 genomic DNAs (12 ul = 50 ng) Solexa Resequencing Results – PI 468916 Reduced Representation Library No. of occurrences of a particular 33mer No. of unique 33mers 500 plus 1,293 2,142,203 70,692,699 300-500 1,414 536,701 17,711,133 100-299 6,561 1,097,771 36,226,443 35-99 15,119 871,270 28,751,910 20-34 14,510 374,818 12,368,994 15-20 12,040 206,542 6,815,886 11-14 15,645 192,046 6,337,518 9-10 15,234 143,581 4,738,173 7-8 29,215 216,367 7,140,111 6 26,648 159,888 5,276,304 5 43,105 215,525 7,112,325 4 72,955 291,820 9,630,060 3 130,555 391,665 12,924,945 2 259,225 518,450 17,108,850 1 1,312,518 1,312,518 43,313,094 1,956,037 8,671,165 286,148,445 TOTAL No. of 33 base reads Total bases Green arrows indicate reads that are unique to one genome position Position of SNP Conclusions Approx. 20,000 SNPs discovered in 6000+ Sequence Tagged Site via Sanger re-sequencing Linkage Analysis– In the near future we will have an optimized set of 1536 GoldenGate assays for high throughput QTL analysis The Solexa analysis will greatly accelerate SNP discovery The Illumina Infinium assay will provide an order of magnitude greater genotyping capacity Association Analysis Illumina Infinium assay will allow estimates of linkage disequilibrium across the soybean genome Creating an Association Panel of 2400 cultivated soybean genotypes Phenotyping will become the limiting factor Collaborators David Hyten & Lakshmi Matukumalli, USDA-ARS, Beltsville, MD Qijian Song & Eun-Young Hwang, Univ. of Maryland Ik-Young Choi, Seoul National University, South Korea James Specht, Univ. of Nebraska Randy Shoemaker, Steve Cannon & Michelle Graham, USDA-ARS, Ames, IA Greg May & Andrew Farmer, NCGR, Sante Fe, NM Randall Nelson, USDA-ARS, Urbana, IL Tommy Carter, Jr. USDA-ARS, Raleigh, NC Kevin Chase & K. Gordon Lark, Univ. of Utah Funding Support USDA-ARS, United Soybean Board