National Genetic Trait Index Update on grape pilot project •Next-Generation sequencing to sample diversity • Genotyping the germplasm collection Doreen Ware USDA ARS NGWI April 27, 2009 Outline • Background on the National Genetic Trait Index ( NGTI) • Grape Project objectives • Step 1: Next-Generation Sequencing to sample diversity – DNA preparation, sequencing method and analysis of sequencing reads for variation – Characterization of SNPs: position, allele support, and coverage – 10k SNP array development • Step 2: Genotyping the germplasm collection – SNPs identification – Preliminary results of the array • Phenotyping What are germplasm collections? • The culmination of thousands of years of selection and improvement of plants • Our richest genetic heritage • The central resource for feeding and fueling the world • A resource from the past that we must pass on to the future in an improved state What do we currently know? • • • • • Multiple functional variants per gene = alleles 20,000 to 50,000 genes Most traits product of 100s of genes Many possible genetic combinations Over the last 10,000 years, we have tested only a limited set of genetic combinations • Need a rational plan to organize and use this diversity – Genetics and Breeding What do we want to do? We want to make more useful plants by conserving, finding and combining better alleles. The National Germplasm conserves 464,000 accessions and may contain 100,000,000 distinct alleles, but there is no index. Stakeholder View Current Variety DGL 2343 Yield (CA) Flavor Available Germplasm PI 265443 PI 532443 PI 783472 Disease Resistance +3 +0 -2 -1 +2 +3 Although poor yielding, it has complementary yield alleles and good disease resistance and flavor +0 +1 Although good yielding, the current line already +3 +0 these alleles captures +3 PI 572811 Good Allele Neutral Allele +0 Bad Allele +2 +2 -1 Absolute View Yield (CA) Flavor Current Variety DGL 2343 Disease Resistance +3 +0 -2 Available Germplasm PI 265443 +9 -1 +2 +3 PI 532443 +4 +0 +1 +0 PI 783472 +5 -1 +2 +1 PI 572811 +3 +0 +0 -1 Good Allele Neutral Allele Bad Allele Contrast View Yield (CA) Flavor Current Variety DGL 2343 Disease Resistance +3 +0 -2 Available Germplasm PI 265443 -1 +2 +3 PI 532443 +0 +1 +2 Optimal Result +7 +7 +6 Good Allele Neutral Allele Bad Allele Absolute View Impact • Identify our most important and representative germplasm – Focus curators, security, and breeding efforts to the most important germplasm • We would know what is genetically feasible with natural variation • Biosecurity – Rapidly respond to pathogen introductions • Identify novel alleles and facilitate marker assisted breeding – Accelerate breeding results • Make US Agriculture Competitive and Open New Markets NGTI Grape Germplasm • Pilot project to demonstrate the feasibility of genotyping diverse NPGS germplasm collection for a species with more limited genomic resources • Provide markers for improved curation of the grape collection and help breeders and geneticists unleash the genetic diversity of grapes Grape • Contains over 60 species mostly found in temperate regions of the northern hemisphere • Vitis vinifera is the most important domesticated species cultivated for table grapes and wine making • The wild grape Vitis sylvestris is considered the progenitor of the domesticated grape • High nucleotide diversity (π=0.004), highly heterozygous and low LD (~200bp) Genetic Diversity in the Domesticated Grape Cluster Density Cluster Size Genetic Diversity Berry Size ? Berry Shape Grape Diversity Project •Identify SNPs using a high-throughput sequencing approach •Select 10,000 informative SNPs and establish a genotyping chip for: – Genotyping the USDA grape germplasm repository •1200 Vitis vinifera + 1000 wild Vitis samples – Study the population structure of Vitis •Patterns of shared polymorphism between Vitis vinifera and wild species – Create a SNP preliminary panel for association studies •Pilot project for developing informatics resources for SNP discovery in other high-diversity crop species Team • Edward Buckler and Sean Myles – Genomics and statistical analysis • Doreen Ware, Jer-Ming Chia, Bonnie Hurwitz – Bioinformatics • Charles Simon, Gan-Yuan Zhong, Mallikarjuna Aradhya, Bernard Prins – Germplasm • Leon Kochian- oversight National Genetic Trait Index Project: Grapevine Step 1: Discovery of genetic variants (SNPs) Make data available Integrate SNP data into public grape genome browser Diverse Samples 60 million sequences 10 cultivated Vitis varieties (Vitis vinifera) 6 wild Vitis species Total: 2 billion base pairs of sequence Discovery of >1 million SNPs Genome complexity Illuminia/Solexa sequencing reduction Sequencing by synthesis Digestion with HpaII restriction enzyme SNP Discovery Panel • Goal: Capture recent variation in domesticated grape as well as more ancient alleles in wild species • Solexa libraries constructed from 10 domesticated cultivars and 6 wild species 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Ehrenfelser French Colombard Gewurztraminer Kadarka Malvasia Muscat of Alexandria Pinot Noir Plavac Mali Thompson Seedless White Riesling 11. 12. 13. 14. 15. 16. Vitis Vitis Vitis Vitis Vitis Vitis amurensis cinerea labrusca palmata rotundifolia sylvestris 17. Inbred Pinot Noir (Reference Genome) Library Construction Protocol Reducing the complexity of the Genome DNA Extraction Solexa Genome Analyzer Whole Genome Amplification* Ligation of Solexa Adaptors Genome Complexity Reduction: Restriction enzyme digest Addition of ‘A’ Base to 3`ends Size Selection from Gel: 100-600bp Reduced Representation Libraries HpaII site HpaII site HpaII site ACTATCTATCCGGTCGCTAGCCGTATATCGGTATAGCTTCGGTCCGGTCATCGATTAGCCTAGCTCGATCGCTTACCGGTAGGACTGCTTCGA CGGTCATCGATTAGCCTAGCTCGATCGCTTACCG CGGTAGGACTGCTTCGA ACTATCTATCCG CGGTCGCTAGCCGTATATCGGTATAGCTTCGGTCCG Solexa sequencing Next-Generation Sequence Analysis Workflow Image files from Solexa GA Sequence and Base Quality Base Calling Ungapped Alignment Firecrest, Bustard Read Mapping NO Gapped Alignment Mapped to genome? YES Sequence and Base Quality Alignments Data Storage Aln Consensus & Quality Variation Variation Discovery Variation Discovery Filters Called SNPs Data Accessibility Building a Pipeline • Modular components allow for different mapping strategies – Mapping only non-redundant, non-singleton reads – Gapped vs. un-gapped alignment • Customizable SNP Filtering – Quality and probability filters – Read coverage accessed by technical or biological replicates • Tighter Data Control – Interim data, procedural and analysis results are stored – Allows for easy rollbacks and efficient re-analysis of the data using different parameters – Increment data can be added and analyzed without having to re-run the entire pipeline Deciphering Genetic Diversity From High-Throughput Sequencing Variation Discovery •Retrieved reads that carried a variant allele as compared to the reference genome •Initial pass for variation has very loose thresholds – Minor allele frequency >= 0.05 • Allele frequency is approximated by read counts – Bi-allelic • Alleles showing up with > 2% frequency is considered informative. A SNP with 3 or more informative alleles frequency are considered non bi-allelic – Total read count for the SNP >=10 but <=1000 •469,470 potential SNPs Selecting 10K SNPs for Array •Selected 10,000 SNPs for constructing an Illumina Infinium assay •On top of SNP quality, also considered: – Segregation patterns: Select SNPs that are supported by homozygous and heterozygous samples • Homozygous criteria: – More than 5 reads supporting the allele • Heterozygosity test: – Simple binomial test applied to the reference and alternate read counts in a single sample. – Probe design • Specificity -> minimize cross-hybridization – Took 50bp on both sides of each SNP and matched against genome (blast) – Disregard the flanking region if it matches to another location with < 2 mismatches within the first 10bp and < 5 mismatches in total • Sensitivity – SNPs within the probe sequence might cause assay to fail, so disregarded flanking region if another SNP is found within 10bp 10K SNPs Consequence within Genomic Sequence • SNP consequence data facilitated via the integration of SNP calls with the genome annotation through Ensembl • Selected 10K SNPs enriched for genic SNPs. • In contrast, genome is 46% in genic space, 41% repetitive/transposable elements 10K SNPs: Segregation Patterns Step 2: Genotyping the grape germplasm repository Analyses SNP selection -Establish core germplasm collection -Identify synonyms and homonyms - Association mapping - Estimate population genetic parameters Choose 10,000 high quality SNPs from the 500,000 Solexa SNPs 10K SNP chip 21 million genotypes Genotype the germplasm repository -1200 cultivated species (Vitis vinifera) - 1000 wild species Production of custom 10,000 (8898) SNP genotyping array Genotyping the Collection • • • • • 10K array 8898 SNPs genotypes represented Mean concordance among replicates is 98.8% Of which 5500 SNPs are showing results (62%) 515 accessions ~192 samples a week should be complete by the end of July PCA analysis of array scored SNPs show clustering of the different germplasm PCA are able to discriminate between the wild variety Outcomes • Genotyped Germplasm Collection – GRIN will have a real dataset to work with • Facilitate better curation • Allow breeders to estimate breeding values for entire germplasm collection • Background to initiate detailed phenotypic evaluation of germplasm and understanding genes underlying key traits Phenotypes Pilot Phenotyping Key Secondary Metabolites Geneva, NY and Davis, CA: Gan-Yuan Zhong Phenotyping Key Secondary Metabolites of Grapes Phenotyping the USDA-ARS Vitis collections will be the next critical step for maximizing the value of the current genotyping effort A pilot project has been initiated for phenotyping key secondary metabolites of the Vitis collections from both Davis, CA and Geneva, NY About 400 V. vinifera and 200 North American collections will be phenotyped for 50 various phenolics including anthocyanins DAD1 C, Sig=365,20 Ref=off (C:\DOCUME~1\LC89\DESKTOP\05230008.D) DAD1 E, Sig=525,20 Ref=off (C:\DOCUME~1\LC89\DESKTOP\05230008.D) DAD1 A, Sig=280,20 Ref=off (C:\DOCUME~1\LC89\DESKTOP\05230008.D) mAU mAU mAU 60 525nm 100 365nm 20 280nm 50 80 15 40 60 30 10 40 20 5 20 10 0 0 0 0 10 20 30 40 50 min 0 10 20 30 40 50 min 0 10 20 30 Profiling anthocyanins (525 nm) and other phenolics in grapes (HPLC-DAD chromatograms) 40 50 min Past and Current Work Sample collection Grape germplasm repository, Davis, CA Sample Species Status Size Vitis vinifera (table grapes) 578 Vitis vinifera (wine grapes) 632 Wild Vitis species 973 Hybrids (vinifera x wild) 856 Laboratory and Analyses Task Solexa sequencing of 16 diverse samples Solexa sequencing of positive control (Inbred Pinot Noir used for genome sequence obtained from CNRS, France) DNA extraction and purification of samples from Davis Bioinformatics pipeline - SNP calling Status Past and Current Work Task Anticipated completion date Sample collection from germplasm repository, Geneva, NY Ordering 10K SNP chips from Illumina Begin genotyping with 10K chips Finish genotyping with 10K chips In progress Analyses and submission of publication In progress Visualization of variations In progress Team • Edward Buckler and Sean Myles – Genomics and statistical analysis • Doreen Ware, Jer-Ming Chia, Bonnie Hurwitz – Bioinformatics • Charles Simon, Gan-Yuan Zhong, Mallikarjuna Aradhya, Bernard Prins – Germplasm • Leon Kochian- oversight Mapping Statistics of reads from each of the germplasm to the reference vitis genome SNP calling protocol Reference sequence Repetitive region of grape Mapped Solexa Reads SNPs called Variation, Frequency, Depth A T A G33 C23 T11 Overview of the Solexa SNP pipeline 1. 56 Million reads (1.8 billion bp) are aligned to the reference genome – The divergence within V. vinifera and with other Vitis is so great we need to develop other algorithms to map the reads 2. 1.1 Million regions of the genome have potential SNPs, which are statistically evaluated for genotypic basis. 3. 50,000 high probability SNPs are identified 4. Empirically validating a small subset of the data. 5. With improved algorithms and increased knowledge of grape diversity, we may be able to extract 100,000s of SNPs. 10K SNP Chip Segregates within Vitis Vinifera 2108 >=1 vinifera sample is homozygous for reference allele AND >=1 vinifera sample homozygous for alternate allele AND >=1 vinifera sample is heterozygous AND Fisher's test <= 0.01 and average quality score >=20 Segregates within wild Vitis 225 As above but for wild vitis Segregates within Vitis 1159 >=1 sample is homozygous for reference allele AND >=1 sample homozygous for alternate allele AND >=1 sample is heterozygous AND Fisher's test <= 0.01 and average quality score >=20 Fixed within a single wild Vitis 1275 One vitis sample is fixed (homozygous) for one allele while all other samples are fixed for the other allele Segregates within Vitis Vinifera (lenient version) 3754 >=1 vinifera sample is homozygous for reference allele OR >=1 vinifera sample homozygous for alternate allele AND >=1 vinifera sample is heterozygous AND Fisher's test <= 0.01 and average quality score >=20 SNPs within candidate genes 787 SNPs within a gene from a candidate gene list AND reads obtained from >= 10 samples and Fisher's test <=0.01 and average quality score >=20 Berry Color 2 From literature Test SNPs 320 20 SNPs selected from each quartile of the following distributions: Read count, number of samples >=1 read, number of reads, Fisher's test TOTAL 9630 SNP Filtering Using Base Quality Uneven distribution of SNPs based on position in sequencing read Tail end of Solexa reads have higher error rates - contributing to false SNPs Filtering out SNPs where the average base quality < 20 No quality filter Ave. Base Qual >=10 Number of reads • • • Ave. Base Qual >=20 Position on Solexa read at which variant allele is found Characterization of SNPs • • • • Position in the read Contingency test for allele Frequency observed in different accessions Depth of coverage by number of reads Using Fisher’s Test as a filter • Read counts of a particular SNP represented as a contingency table • Fisher’s exact test used to test the independence of rows and columns in a contingency table - used as a metric to evaluate segregation of alleles Sample Read Counts A B C D E F Reference Allele 9 0 8 15 7 0 Alternate Allele 0 11 0 0 6 12 Fisher’s exact p-value < 1e-16 Sample Read Counts A B C D E F Reference Allele 2 3 5 0 2 0 Alternate Allele 0 1 0 0 0 1 Fisher’s exact p-value = 0.05 SNP Filtering using Fisher’s Test • Accept only SNPs with a p-value <= 0.01 as high confidence Number of reads No filter p-value <= 0.1 p-value <= 0.05 p-value <= 0.01 Position on Solexa read at which variant allele is found No. of samples vs SNP quality • SNPs backed by reads from more cultivars+wild species have better quality Number of reads supporting a SNP vs SNP quality • Quality of SNP plateaus out after 150 read support