Vitis vinifera

advertisement
National Genetic Trait Index
Update on grape pilot project
•Next-Generation sequencing to sample diversity
• Genotyping the germplasm collection
Doreen Ware
USDA ARS
NGWI April 27, 2009
Outline
• Background on the National Genetic Trait Index ( NGTI)
• Grape Project objectives
• Step 1: Next-Generation Sequencing to sample diversity
– DNA preparation, sequencing method and analysis of
sequencing reads for variation
– Characterization of SNPs: position, allele support, and
coverage
– 10k SNP array development
• Step 2: Genotyping the germplasm collection
– SNPs identification
– Preliminary results of the array
• Phenotyping
What are germplasm collections?
• The culmination of thousands of years
of selection and improvement of plants
• Our richest genetic heritage
• The central resource for feeding and
fueling the world
• A resource from the past that we must
pass on to the future in an improved
state
What do we currently know?
•
•
•
•
•
Multiple functional variants per gene = alleles
20,000 to 50,000 genes
Most traits product of 100s of genes
Many possible genetic combinations
Over the last 10,000 years, we have tested
only a limited set of genetic combinations
• Need a rational plan to organize and use this
diversity
– Genetics and Breeding
What do we want to do?
We want to make more useful
plants by conserving, finding and
combining better alleles.
The National Germplasm conserves
464,000 accessions and may contain
100,000,000 distinct alleles,
but there is no index.
Stakeholder View
Current Variety
DGL 2343
Yield (CA)
Flavor
Available
Germplasm
PI 265443
PI 532443
PI 783472
Disease Resistance
+3
+0
-2
-1
+2
+3
Although poor yielding, it has complementary yield
alleles and good disease resistance and flavor
+0
+1
Although good yielding, the current line already
+3
+0 these alleles
captures
+3
PI 572811
Good Allele
Neutral Allele
+0
Bad Allele
+2
+2
-1
Absolute View
Yield (CA)
Flavor
Current Variety
DGL 2343
Disease Resistance
+3
+0
-2
Available
Germplasm
PI 265443 +9
-1
+2
+3
PI 532443 +4
+0
+1
+0
PI 783472 +5
-1
+2
+1
PI 572811 +3
+0
+0
-1
Good Allele
Neutral Allele
Bad Allele
Contrast View
Yield (CA)
Flavor
Current Variety
DGL 2343
Disease Resistance
+3
+0
-2
Available
Germplasm
PI 265443
-1
+2
+3
PI 532443
+0
+1
+2
Optimal Result
+7
+7
+6
Good Allele
Neutral Allele
Bad Allele
Absolute View
Impact
• Identify our most important and representative
germplasm
– Focus curators, security, and breeding efforts to the most
important germplasm
• We would know what is genetically feasible with
natural variation
• Biosecurity
– Rapidly respond to pathogen introductions
• Identify novel alleles and facilitate marker assisted
breeding
– Accelerate breeding results
• Make US Agriculture Competitive and Open New
Markets
NGTI Grape Germplasm
• Pilot project to demonstrate the feasibility
of genotyping diverse NPGS germplasm
collection for a species with more limited
genomic resources
• Provide markers for improved curation of
the grape collection and help breeders and
geneticists unleash the genetic diversity of
grapes
Grape
• Contains over 60 species mostly found in temperate
regions of the northern hemisphere
• Vitis vinifera is the most important domesticated species
cultivated for table grapes and wine making
• The wild grape Vitis sylvestris is considered the progenitor
of the domesticated grape
• High nucleotide diversity (π=0.004), highly heterozygous
and low LD (~200bp)
Genetic Diversity in the
Domesticated Grape
Cluster Density
Cluster Size
Genetic Diversity
Berry Size
?
Berry Shape
Grape Diversity Project
•Identify SNPs using a high-throughput sequencing approach
•Select 10,000 informative SNPs and establish a genotyping chip for:
– Genotyping the USDA grape germplasm repository
•1200 Vitis vinifera + 1000 wild Vitis samples
– Study the population structure of Vitis
•Patterns of shared polymorphism between Vitis vinifera and wild
species
– Create a SNP preliminary panel for association studies
•Pilot project for developing informatics resources for SNP discovery in other
high-diversity crop species
Team
• Edward Buckler and Sean Myles –
Genomics and statistical analysis
• Doreen Ware, Jer-Ming Chia,
Bonnie Hurwitz – Bioinformatics
• Charles Simon, Gan-Yuan Zhong,
Mallikarjuna Aradhya, Bernard Prins –
Germplasm
• Leon Kochian- oversight
National Genetic Trait Index Project: Grapevine
Step 1: Discovery of genetic variants (SNPs)
Make data available
Integrate SNP data into public
grape genome browser
Diverse Samples
60 million sequences
10 cultivated Vitis varieties (Vitis vinifera)
6 wild Vitis species
Total: 2 billion base pairs of sequence
Discovery of >1 million SNPs
Genome complexity
Illuminia/Solexa sequencing
reduction
Sequencing by synthesis
Digestion with HpaII
restriction enzyme
SNP Discovery Panel
• Goal: Capture recent variation in domesticated grape as well as more
ancient alleles in wild species
• Solexa libraries constructed from 10 domesticated cultivars and 6 wild
species
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Ehrenfelser
French Colombard
Gewurztraminer
Kadarka
Malvasia
Muscat of Alexandria
Pinot Noir
Plavac Mali
Thompson Seedless
White Riesling
11.
12.
13.
14.
15.
16.
Vitis
Vitis
Vitis
Vitis
Vitis
Vitis
amurensis
cinerea
labrusca
palmata
rotundifolia
sylvestris
17. Inbred Pinot Noir (Reference Genome)
Library Construction Protocol
Reducing the complexity of the Genome
DNA Extraction
Solexa Genome
Analyzer
Whole Genome
Amplification*
Ligation of Solexa
Adaptors
Genome
Complexity
Reduction:
Restriction
enzyme digest
Addition of ‘A’
Base to 3`ends
Size Selection from
Gel: 100-600bp
Reduced Representation Libraries
HpaII site
HpaII site
HpaII site
ACTATCTATCCGGTCGCTAGCCGTATATCGGTATAGCTTCGGTCCGGTCATCGATTAGCCTAGCTCGATCGCTTACCGGTAGGACTGCTTCGA
CGGTCATCGATTAGCCTAGCTCGATCGCTTACCG
CGGTAGGACTGCTTCGA
ACTATCTATCCG
CGGTCGCTAGCCGTATATCGGTATAGCTTCGGTCCG
Solexa sequencing
Next-Generation Sequence Analysis Workflow
Image files
from Solexa
GA
Sequence and
Base Quality
Base Calling
Ungapped
Alignment
Firecrest,
Bustard
Read Mapping
NO
Gapped
Alignment
Mapped to
genome?
YES
Sequence and
Base Quality
Alignments
Data Storage
Aln Consensus
& Quality
Variation
Variation
Discovery
Variation
Discovery
Filters
Called
SNPs
Data
Accessibility
Building a Pipeline
•
Modular components allow for different mapping strategies
– Mapping only non-redundant, non-singleton reads
– Gapped vs. un-gapped alignment
•
Customizable SNP Filtering
– Quality and probability filters
– Read coverage accessed by technical or biological replicates
•
Tighter Data Control
– Interim data, procedural and analysis results are stored
– Allows for easy rollbacks and efficient re-analysis of the data using
different parameters
– Increment data can be added and analyzed without having to re-run the
entire pipeline
Deciphering Genetic Diversity From
High-Throughput Sequencing
Variation Discovery
•Retrieved reads that carried a variant allele as compared to the
reference genome
•Initial pass for variation has very loose thresholds
– Minor allele frequency >= 0.05
• Allele frequency is approximated by read counts
– Bi-allelic
• Alleles showing up with > 2% frequency is considered
informative. A SNP with 3 or more informative alleles
frequency are considered non bi-allelic
– Total read count for the SNP >=10 but <=1000
•469,470 potential SNPs
Selecting 10K SNPs for Array
•Selected 10,000 SNPs for constructing an Illumina Infinium assay
•On top of SNP quality, also considered:
– Segregation patterns: Select SNPs that are supported by homozygous
and heterozygous samples
• Homozygous criteria:
– More than 5 reads supporting the allele
• Heterozygosity test:
– Simple binomial test applied to the reference and alternate read counts in a
single sample.
– Probe design
• Specificity -> minimize cross-hybridization
– Took 50bp on both sides of each SNP and matched against genome (blast)
– Disregard the flanking region if it matches to another location with < 2
mismatches within the first 10bp and < 5 mismatches in total
• Sensitivity
– SNPs within the probe sequence might cause assay to fail, so disregarded
flanking region if another SNP is found within 10bp
10K SNPs Consequence within
Genomic Sequence
• SNP consequence data
facilitated via the
integration of SNP calls
with the genome
annotation through
Ensembl
• Selected 10K SNPs
enriched for genic SNPs.
• In contrast, genome is
46% in genic space, 41%
repetitive/transposable
elements
10K SNPs: Segregation Patterns
Step 2: Genotyping the grape germplasm repository
Analyses
SNP selection
-Establish core germplasm collection
-Identify synonyms and homonyms
- Association mapping
- Estimate population genetic parameters
Choose 10,000 high quality SNPs
from the 500,000 Solexa SNPs
10K SNP chip
21 million genotypes
Genotype the germplasm repository
-1200 cultivated species (Vitis vinifera)
- 1000 wild species
Production of custom 10,000
(8898) SNP genotyping array
Genotyping the Collection
•
•
•
•
•
10K array 8898 SNPs genotypes represented
Mean concordance among replicates is 98.8%
Of which 5500 SNPs are showing results (62%)
515 accessions
~192 samples a week should be complete by
the end of July
PCA analysis of array scored SNPs show
clustering of the different germplasm
PCA are able to discriminate between
the wild variety
Outcomes
• Genotyped Germplasm Collection
– GRIN will have a real dataset to work with
• Facilitate better curation
• Allow breeders to estimate breeding values
for entire germplasm collection
• Background to initiate detailed phenotypic
evaluation of germplasm and understanding
genes underlying key traits
Phenotypes
Pilot Phenotyping Key Secondary Metabolites
Geneva, NY and Davis, CA: Gan-Yuan Zhong
Phenotyping Key Secondary Metabolites of Grapes
 Phenotyping the USDA-ARS Vitis collections will be the next critical
step for maximizing the value of the current genotyping effort
 A pilot project has been initiated for phenotyping key secondary
metabolites of the Vitis collections from both Davis, CA and Geneva,
NY
 About 400 V. vinifera and 200 North American collections will be
phenotyped for 50 various phenolics including anthocyanins
DAD1 C, Sig=365,20 Ref=off (C:\DOCUME~1\LC89\DESKTOP\05230008.D)
DAD1 E, Sig=525,20 Ref=off (C:\DOCUME~1\LC89\DESKTOP\05230008.D)
DAD1 A, Sig=280,20 Ref=off (C:\DOCUME~1\LC89\DESKTOP\05230008.D)
mAU
mAU
mAU
60
525nm
100
365nm
20
280nm
50
80
15
40
60
30
10
40
20
5
20
10
0
0
0
0
10
20
30
40
50
min
0
10
20
30
40
50
min
0
10
20
30
Profiling anthocyanins (525 nm) and other phenolics in grapes
(HPLC-DAD chromatograms)
40
50
min
Past and Current Work
Sample collection
Grape germplasm repository, Davis, CA
Sample
Species
Status
Size
Vitis vinifera (table grapes)
578
Vitis vinifera (wine grapes)
632
Wild Vitis species
973
Hybrids (vinifera x wild)
856
Laboratory and Analyses
Task
Solexa sequencing of 16 diverse samples
Solexa sequencing of positive control
(Inbred Pinot Noir used for genome sequence obtained from CNRS, France)
DNA extraction and purification of samples from Davis
Bioinformatics pipeline - SNP calling
Status
Past and Current Work
Task
Anticipated
completion date
Sample collection from germplasm repository, Geneva, NY
Ordering 10K SNP chips from Illumina
Begin genotyping with 10K chips
Finish genotyping with 10K chips
In progress
Analyses and submission of publication
In progress
Visualization of variations
In progress
Team
• Edward Buckler and Sean Myles –
Genomics and statistical analysis
• Doreen Ware, Jer-Ming Chia,
Bonnie Hurwitz – Bioinformatics
• Charles Simon, Gan-Yuan Zhong,
Mallikarjuna Aradhya, Bernard Prins –
Germplasm
• Leon Kochian- oversight
Mapping Statistics of reads from each of the
germplasm to the reference vitis genome
SNP calling protocol
Reference sequence
Repetitive region of grape
Mapped Solexa Reads
SNPs called
Variation, Frequency, Depth
A
T
A
G33
C23
T11
Overview of the Solexa SNP pipeline
1. 56 Million reads (1.8 billion bp) are aligned to the
reference genome
–
The divergence within V. vinifera and with other Vitis is
so great we need to develop other algorithms to map
the reads
2. 1.1 Million regions of the genome have potential
SNPs, which are statistically evaluated for
genotypic basis.
3. 50,000 high probability SNPs are identified
4. Empirically validating a small subset of the data.
5. With improved algorithms and increased
knowledge of grape diversity, we may be able to
extract 100,000s of SNPs.
10K SNP Chip
Segregates within Vitis
Vinifera
2108
>=1 vinifera sample is homozygous for reference allele
AND >=1 vinifera sample homozygous for alternate allele
AND >=1 vinifera sample is heterozygous AND Fisher's
test <= 0.01 and average quality score >=20
Segregates within wild
Vitis
225
As above but for wild vitis
Segregates within Vitis
1159
>=1 sample is homozygous for reference allele AND >=1
sample homozygous for alternate allele AND >=1
sample is heterozygous AND Fisher's test <= 0.01 and
average quality score >=20
Fixed within a single wild
Vitis
1275
One vitis sample is fixed (homozygous) for one allele
while all other samples are fixed for the other allele
Segregates within Vitis
Vinifera (lenient version)
3754
>=1 vinifera sample is homozygous for reference allele
OR >=1 vinifera sample homozygous for alternate allele
AND >=1 vinifera sample is heterozygous AND Fisher's
test <= 0.01 and average quality score >=20
SNPs within candidate
genes
787
SNPs within a gene from a candidate gene list AND reads
obtained from >= 10 samples and Fisher's test <=0.01
and average quality score >=20
Berry Color
2
From literature
Test SNPs
320
20 SNPs selected from each quartile of the following
distributions: Read count, number of samples >=1 read,
number of reads, Fisher's test
TOTAL
9630
SNP Filtering Using Base Quality
Uneven distribution of SNPs based on position in sequencing read
Tail end of Solexa reads have higher error rates - contributing to false SNPs
Filtering out SNPs where the average base quality < 20
No quality filter
Ave. Base Qual >=10
Number of reads
•
•
•
Ave. Base Qual >=20
Position on Solexa read at which variant allele is
found
Characterization of SNPs
•
•
•
•
Position in the read
Contingency test for allele
Frequency observed in different accessions
Depth of coverage by number of reads
Using Fisher’s Test as a filter
• Read counts of a particular SNP represented as a contingency table
• Fisher’s exact test used to test the independence of rows and columns in a
contingency table - used as a metric to evaluate segregation of alleles
Sample Read Counts
A
B
C
D
E
F
Reference Allele
9
0
8
15
7
0
Alternate Allele
0
11
0
0
6
12
Fisher’s exact
p-value < 1e-16
Sample Read Counts
A
B
C
D
E
F
Reference Allele
2
3
5
0
2
0
Alternate Allele
0
1
0
0
0
1
Fisher’s exact
p-value = 0.05
SNP Filtering using Fisher’s Test
• Accept only SNPs with a p-value <= 0.01 as high confidence
Number of reads
No filter
p-value <= 0.1
p-value <= 0.05
p-value <= 0.01
Position on Solexa read at which variant allele is
found
No. of samples vs SNP quality
• SNPs backed by reads from more cultivars+wild species have better quality
Number of reads supporting a SNP
vs SNP quality
• Quality of SNP plateaus out after 150 read support
Download