Affynetrix_CaseyGates

advertisement
GeneChip® CustomSeq®
Resequencing Arrays
GeneChip® CustomSeq® Arrays
 Most efficient and cost-effective method for largescale sequence variation analysis
–
up to 300kb on single array for >.01 cent per base
 Flexible array formats for a variety of applications;
50kb, 100kb, and 300kb
 Long “read length” minimizes curation and
assembly time
 High-quality data
–
–
Accuracy >99.9%
Reproducibility >99.9%
Resequencing Assay Overview
Select
areas of
interest by
PCR
Genomic DNA
Pool and
fragment
DNA fragments
PCR products
Label
B
B
B
B
B
B
B
End-labeled fragments
Scan
Hybridization, Wash, and Stain
GeneChip® Resequencing System
GeneChip Resequencing Assay Kit
GeneChip CustomSeq® Array
GeneChip Scanner 3000
 Data Collection
 Data Management
 Client-Server Option
GeneChip Sequence Analysis Software
(GSEQ) 4.0
GeneChip Operating Software (GCOS) 1.4
Resequencing Tiling Strategy
ATCGGTAGCCATACATGAGTTACTA
ATCGGTAGCCATTCATGAGTTACTA
ATCGGTAGCCATCCATGAGTTACTA
ATCGGTAGCCATGCATGAGTTACTA
ATCGGTAGCCATGCATGAGTTACTA CAGCT
TAGCCATCGGTACGTACTCAATGAT GTCGA
TAGCCATCGGTAGGTACTCAATGAT
TAGCCATCGGTACGTACTCAATGAT
TAGCCATCGGTATGTACTCAATGAT
TAGCCATCGGTAAGTACTCAATGAT
Resequencing Tiling Strategy
TCGGTAGCCATGAATGAGTTACTAC
TCGGTAGCCATGCATGAGTTACTAC
TCGGTAGCCATGGATGAGTTACTAC
TCGGTAGCCATGTATGAGTTACTAC
ATCGGTAGCCATGCATGAGTTACTA CAGCT
TAGCCATCGGTACGTACTCAATGAT GTCGA
AGCCATCGGTAGATACTCAATGATG
AGCCATCGGTAGCTACTCAATGATG
AGCCATCGGTAGGTACTCAATGATG
AGCCATCGGTAGTTACTCAATGATG
Resequencing Tiling Strategy
CGGTAGCCATGCATGAGTTACTACA
CGGTAGCCATGCCTGAGTTACTACA
CGGTAGCCATGCGTGAGTTACTACA
CGGTAGCCATGCTTGAGTTACTACA
ATCGGTAGCCATGCATGAGTTACTA CAGCT
TAGCCATCGGTACGTACTCAATGAT GTCGA
GCCATCGGTACGAACTCAATGATGT
GCCATCGGTACGCACTCAATGATGT
GCCATCGGTACGGACTCAATGATGT
GCCATCGGTACGTACTCAATGATGT
Resequencing Tiling Strategy
GGTAGCCATGCAAGAGTTACTACAG
GGTAGCCATGCACGAGTTACTACAG
GGTAGCCATGCAGGAGTTACTACAG
GGTAGCCATGCATGAGTTACTACAG
ATCGGTAGCCATGCATGAGTTACTA CAGCT
TAGCCATCGGTACGTACTCAATGAT GTCGA
CCATCGGTACGTACTCAATGATGTC
CCATCGGTACGTCCTCAATGATGTC
CCATCGGTACGTGCTCAATGATGTC
CCATCGGTACGTTCTCAATGATGTC
Resequencing Tiling Strategy
GTAGCCATGCATAAGTTACTACAGC
GTAGCCATGCATCAGTTACTACAGC
GTAGCCATGCATGAGTTACTACAGC
GTAGCCATGCATTAGTTACTACAGC
ATCGGTAGCCATGCATGAGTTACTA CAGCT
TAGCCATCGGTACGTACTCAATGAT GTCGA
CATCGGTACGTAATCAATGATGTCG
CATCGGTACGTACTCAATGATGTCG
CATCGGTACGTAGTCAATGATGTCG
CATCGGTACGTATTCAATGATGTCG
Resequencing Tiling Strategy
TAGCCATGCATGAGTTACTACAGCT
TAGCCATGCATGCGTTACTACAGCT
TAGCCATGCATGGGTTACTACAGCT
TAGCCATGCATGTGTTACTACAGCT
ATCGGTAGCCATGCATGAGTTACTA CAGCT
TAGCCATCGGTACGTACTCAATGAT GTCGA
ATCGGTACGTACACAATGATGTCGA
ATCGGTACGTACCCAATGATGTCGA
ATCGGTACGTACGCAATGATGTCGA
ATCGGTACGTATTCAATGATGTCGA
Resequencing Array Performance
Call rates*
> 90%
Overall Accuracy
> 99.9%
Reproducibility
> 99.9%
* Average call rate reported for arrays where all
target has been amplified to sufficient quantity
 Performance may vary depending on genomic content on individual custom
designs. Specific factors that impact performance include:
– GC content
– INDELs
– Divergence from reference sequence
– Multiple SNPs within 10bp
 Performance characterized across several data sets
Set 1: Mouse – 30kb of sequence in 9 inbred DBA (diploid genome with
homozygous SNPs)
–Data Set 2: ENCODE – 300kb across 16 CEPH (diploid genome with heterozygous
SNPs)
–Data Set 3: Mitochondrial – 16kb across 3 replicates of 1 reference (haploid genome
with heteroplasmy)
–Data
Quality Threshold Score
Impact on call rate verses accuracy
Homozygous Model
100.00%
100%
95%
99.95%
99.90%
85%
99.85%
80%
Call Rate
75%
99.80%
70%
99.75%
65%
3
6
9
12
QTS
Overall Accuracy
Heterozygous Model
Call Rate
100.00%
100%
95%
99.95%
90%
99.90%
85%
99.85%
80%
75%
99.80%
70%
99.75%
65%
0
3
6
9
QTS
Overall Accuracy
Call Rate
12
Call Rate
0
Accuracy
Accuracy
90%
Data Set 1:
Mouse Array – Homozygous Model
Average Call Rate
The number of bases called divided by the total number of
bases possible.
95.92%
Overall Accuracy
For all bases where a call is made, the percentage that
agrees with capillary sequencing
99.99%
Overall Reproducibility
For a pair of technical replicate chips (pairs of mouse
samples in this case), concordance is computed for all sites
where the two arrays make a call. N’s excluded.
>99.99
%
Homozygous SNP Call
Rate
Percentage of calls made for all known SNP positions.
95.95%
Homozygous SNP
Accuracy
Percentage of call a SNP when capillary sequence called
the same base a SNP
 Does not include N’s or SNPs within 9 bp of another SNP
100%
Homozygous SNP False
Positive
Percentage of calls made as a SNP when capillary
sequencing called the base a reference.
0.01%
Homozygous SNP False
Negative
Percentage of N calls made when capillary sequencing
called the base as SNP
 Does not include SNPs within 9 bp of another SNP
 Calculated for individual genotypes, not SNP sites
4.05%
Homozygous SNP
Reproducibility
Same as overall reproducibility, but for SNP sites only.
100%
Data Set 2:
Diploid Analysis of Encode Interval on Chr 4
• Collected array data and dideoxy sequence data from 16
diploid CEPH individuals across 115kbp of non-repetitive Chr4
sequence = 1.84Mbp in total
• Dideoxy sequencing data
• Total of 1.44Mbp covered by dideoxy sequencing
• Each variant was confirmed by genotyping in all 16 DNAs
• Array data
• 27 LR-PCRs amplified ~250kbp of genomic sequence
• LR-PCRs pooled, fragmented, labeled, hybed per SOP
• Hybed one array per individual to query 115kb of non-repetitive
sequence
• Intensity data analysed using GSEQ v3.0
• in diploid mode
• at various quality threshold values
Data Set 2 Diploid ENCODE Region
Call Rate
The number of non-N calls divided by the total number of
calls
96.56%
Overall Accuracy
Percentage of all calls (excluding Ns) that are concordant
with ENCODE data
99.95%
Call rate at variant sites
Percentage of calls made for all known SNP loci including
heteroygous and homozygous calls
89.70%
SNP False Negative Rate
Percentage of variant positions in the ENCODE data that
are called N or reference in the array data
17.34%
SNP False Positive Rate
Percentage of reference positions in the ENCODE data that
are called variant in the array data
0.04%
Homozygous Accuracy
Percentage of homozygous variant positions in the
ENCODE data with concordant array data (excluding array
Ns)
96.91%
Heterozygous Accuracy
Percentage of heterozygous positions in the ENCODE data
with concordant array data (excluding array Ns)
86.25%
Homozygous SNP False Negative
Percentage of mis-calls (No Calls and Ref calls) made for all
known homozygous SNP positions in the ENCODE data
9.12%
Heterozygous SNP False Negative
Percentage of mis-calls (No Calls and Ref calls) made for all
known heterozygous SNP positions in the ENCODE data
22.15%
Post GSEQ Filters to Reduce False
Positives
Summary of exclusions position*sample specific cell
counts
PCR Failure
Calls removed
# FPs
removed
%FPs
removed
# TPs
removed
%TPs
removed
19519
168
31.28%
5
0.33%
Nearby SNPs – Footprint
252
167
31.10%
14
0.33%
Cross Hybridization sites
64
33
6.15%
0
0.00%
128
1
0.19%
0
0.00%
32
32
5.96%
0
0.00%
Low Complexity Probes
Non-biallelic Calls
Performance Post GSEQ Filters
Before Filters
After Filters
Call Rate
96.56%
95.98%
Overall Accuracy
99.95%
99.98%
False Positive Calls
537
219
False Positive Rate
0.040%
0.016%
True Positive Calls
1498
1479
SNP call False Negative
Rate
17.34%
18.52%
SNP site False Negative
rate
8.18%
9.39%
Impact of GC Content on Call Rates
Call Rate vs. Probe GC content
100.00%
Call Rate
95.00%
90.00%
85.00%
80.00%
75.00%
<10
11-20
21-30
31-40
41-50
% probe GC
51-60
61-70
>70
Batch Analysis Improves Performance
Performance as a function of sample size
100.0%
0.025%
80.0%
0.020%
70.0%
60.0%
0.015%
50.0%
40.0%
0.010%
30.0%
20.0%
0.005%
10.0%
0.0%
0.000%
1
2
4
8
Number of Samples
16
False Positive Rate
Call Rate/False Negative Rate
90.0%
Call Rate
FN rate
FP rate
CustomSeq® Applications
 Haploid
–
–
Pathogen identification and typing
Mitochondria
 Diploid
–
–
–
Candidate genes
Regions of linkage/association
Pharmacogenomics
Microarray-based Resequencing of
Multiple Bacillus anthracis Isolates
Zwick, ME. et al., Genome Biology, 6:R10 (2004)
Bacillus anthracis Research
 Rapid, accurate, and inexpensive resequencing required
for a variety of applications and studies.
–
–
–
–
Definitively identify B. anthracis in environmental and
clinical samples
Determine forensic attribution and detect genetic
manipulation
Determine phylogenetic relationships of strains
Uncover the genetic basis of phenotypic variation in traits
such as mammalian virulence.
 Neither the AFLP nor the MLST studies discover and
genotype sufficient genetic variation to distinguish
between B. anthracis strains
 Sequencing efforts are increasing but limited by cost.
Zwick, ME. et al., Genome Biology, 6:R10 (2004)
Bacillus anthracis Custom Array
 Experimental Design
 Array
–
30kb CustomSeq® array containing 29,212bp of
unique sequence
 Samples
–
–
56 isolates from Biological Defense Research
Directorate's strain collection
Samples hybridized in replicate on 2 arrays
 Assay
–
Long range PCR
 Analysis- ABACUS software
Zwick, ME. et al., Genome Biology, 6:R10 (2004)
Bacillus anthracis Custom Array Results
Replication experiment
Total number of bases called in replicate 1
1,383,229
Total number of bases called in replicate 2
1,373,905
Total number of bases called in both replicates
1,349,177
Total number of bases called differently
1
Replication experiment discrepancy rate
7.4E-07
 Results – Call Rate and Reproducibility
–
–
–
115/ 118 array hybridizations successful
Average call rate = 92.6%
High reproducibility- only one discrepancy found between
replicates across 1.35Mb of sequence
Zwick, ME. et al., Genome Biology, 6:R10 (2004)
Bacillus anthracis Custom Array Results
Accuracy estimation experiment
Total number of bases called identically
398,452
Total number of bases called differently
15
Accuracy experiment discrepancy rate
3.8E-05
 Results - Accuracy
–
–
30 arrays hybridized to anthrax strains previously
sequenced by capillary sequencing
15 discrepancies/ 6 SNP sites
 10 discrepancies /5 sites resolved as arrays agreed with
most recent shot gun assembly
 1 site accounting for 5 discrepancies could not be confirmed
based on a single read with phred score of 7
Zwick, ME. et al., Genome Biology, 6:R10 (2004)
Bacillus anthracis Custom Array Conclusions
 Study demonstrated that microarray-based
resequencing is technologically robust and generates
highly replicable and accurate data when compared to
alternative sequence technologies
 In this experiment, 115 arrays, or 97.5% of the total
attempted, were processed successfully, obtaining an
average high-quality base-calling rate of 92.6%
 Called bases are shown to be highly replicable
(discrepancy rate of 7.4 × 10-7) and accurate when
compared to conventional shotgun sequence
(discrepancy rate of < 2.5 × 10-6)
Zwick, ME. et al., Genome Biology, 6:R10 (2004)
Sequencing Arrays for Screening Multiple
Genes Associated with Early-Onset Human
Retinal Degenerations on a HighThroughput Platform
Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005)
* Study conducted at the Dept of Ophthalmology and Visual Science at the University of
Michigan, Ann Arbor in collaboration with McGill University Health Science Center and the
NEI/NIH
Retinitis pigmentosa
 Progressive retinal degeneration leading to
irreversible blindness or severe visual impairment
 Affects 1:3500 individuals worldwide
 Broad genetic heterogeneity with at least 32 genes
known to be associated with various forms (AD, AR,
X linked) of RP.
 Several treatments are in development but
response to individual treatments is likely to be
linked to genotype.
 Screening all known genes (~60kb) is inefficient by
traditional methods.
Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005)
Retinitis pigmentosa Custom Array
 Array
–
11 RP genes (coding and flanking regions) representing 25.8kb
unique sequence were tiled on a 30kb CustomSeq Array
 Samples
–
–
–
35 cases with known genotypes
35 novel cases
26 unaffected family members
 Assay
–
–
Standard CustomSeq protocol
Traditional PCR -159 amplicons
Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005)
Retinitis pigmentosa Array- Results
 Base calling Performance
–
–
–
Average Call Rates = 97.60% (individual arrays ranged
96.0%-98.5%)
Accuracy = >99%
Reproducibility = >99%
 SNP Detection
–
506 sequence changes identified
 Accurately detected 382 previously reported SNPs and
identified 113 novel SNPs
 Accurately detected 5 previously reported mutations and
identified 7 novel rare mutations
Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005)
Retinitis pigmentosa Array- Results
Summary of Novel Potentially Pathogenic Nucleotide Changes and Previously
Reported Mutations Detected in Patient DNA, with the arRP-I Arrays
Nucleotide
Change
Amino Acid
Change
Genotype
Reference
Patient Gene
KE727
RHO
C959A
Thr320Asn
Het
Novel
KE1246
CRB1
G2473A
Glu825Lys
Het
Novel
R165
TULP1
IVS2 _ 3 A _ G
Homo
Novel
R206
ABCA4
G1699A
Val567Met
Het
Novel
KE869
RGR
C734T
Ser245Phe
Het
Novel
R353
MERTK
G500A
Arg167His
Het
Novel
R376
ABCA4
IVS23-2 A _ T
Het
Novel
KE385
RPE65
T963G
Asn321Lys
Het
Known
KE1246
ABCA4
T3602G
Leu1201Arg
Het
Known
KE1246
ABCA4
G5077A
Val1693Ile
Het
Known
R376
ABCA4
C5327T
Pro1776Leu
Het
Known
Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005)
Retinitis pigmentosa Array- Conclusion
 Resequencing arrays provide an efficient and
reliable method of high-throughput screening
for mutations in genetically heterogeneous
diseases.
 Enables one to screen multiple genes and
enables the analysis of both Mendelian and
complex forms of retinal degeneration
 Comparison of material costs revealed that
arrays were 23% cheaper. Time and labor
savings further increased the cost
effectiveness of this method
Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005)
A Transforming MET Mutation Discovered in
Non-small Cell Lung Cancer Using
Microarray-based Resequencing
Tengs T., et al. Cancer Lett. (2005)
* Study conducted at the Dana Farber Cancer Institute in collaboration with MIT/Broad Institute
and Merck Pharmaceuticals
Custom Cancer Array
 Objective
–
–
Evaluate the performance of resequencing arrays to
detect mutations in oncogenes and tumor
suppressor genes
Identify novel mutations which may have an impact
on therapeutic response
 Experimental design
–
–
–
164 exons (23,966 bp) from genes associated with
cancer
Sequenced 20 lung tumor samples with matched
normal controls
Dideoxy sequencing was performed on a subset of
exons in order to evaluate the performance of the
arrays
Tengs T, et al. Cancer Lett. (2005)
Custom Cancer Array Performance
 Coverage and accuracy of resequencing arrays when
compared to dideoxy sequencing
–
–
–
Call rate - 97.53%
Overall accuracy- 99.99%
Only 4 SNP call errors reported
 3 hom SNP called het SNP/ 1 het SNP called ref
Exons also covered by dideoxy sequencing
Total number of bases interrogated
335,420
Total number of ’no calls’ made by GDAS
8,283 (2.47%)
Coverage
327,137 (97.53%)
Number of homozygous mutations found by dideoxy sequencing in loci where GDAS made calls
37
Number of heterozygous mutations found by dideoxy sequencing in loci where GDAS made calls
71
Total number of ’no calls’ made by GDAS in mutated loci
11 (9.24%)
Homozygous mutations called correctly by GDAS
34 (91.89%)
Heterozygous mutations called correctly by GDAS
70 (98.59%)
Total number of correct calls by GDAS in loci covered
327,132 (99.99%)
Tengs T, et al. Cancer Lett. (2005)
Amino acid changing mutations detected in the
20 NSCLC samples
Gene
Refseq
Nucleotide change
CDKN2A
NM_000077
G654A
CDKN2A
NM_000077
KRAS2
Amino acid change
Origin
Heterozygous/ homozygous
A148T
Germline
2/0b
T556A
V115E
Germline
1/0
NM_004985
G216A
G12D
Somatic
2/1
KRAS2
NM_004985
G216T
G12V
Somatic
1/0
MET
NM_000245
A1311G
N375S
Germline
3/0
MET
NM_000245
C2646T
P814S
Germline
1/0
MET
NM_000245
C3162T
T1010I
Germline
1/0
NRAS
NM_002524
A435T
Q61L
Germline
1/0
PTEN
NM_000314
G1266A
A79T
Germline
1/0
RET
NM_020630
G1645A
D489N
Germline
1/0
RET
NM_020630
G2251A
G691S
Germline
8/0
RET
NM_020630
C3124T
R982C
Germline
1/0
TP53
NM_000546
G1075T
C275F
Somatic
1/0
TP53
NM_000546
G984T
G245C
Somatic
1/0
TP53
NM_000546
G714T
R158L
Somatic
2/0
TP53
NM_000546
G775A
R175H
Somatic
1/0
TP53
NM_000546
G997T
R249M
Somatic
1/0
TP53
NM_000546
C1167T
R306Stop
Somatic
1/0
TP53
NM_000546
G466C
R72P
Germline
4/1
TP53
NM_000546
GOT
splice-site
Somatic
1/0
TP53
NM_000546
G1065T
V272L
Somatic
1/0
TP53
NM_000546
A739G
Y163C
Somatic
1/0
Tengs T, et al. Cancer Lett. (2005)
Custom Cancer Array- Conclusions
 Results show that resequencing microarrays can be
used as a tool for cancer mutation detection and
discovery
 The overall performance of the platform is comparable
to traditional Sanger-based sequencing with a very high
concordance rate (327,132 out of 327,137 bases called
consistently; >99.99% concordance)
 Furthermore, we have found the transforming MET
mutation T1010I in NSCLC to be present in a small
fraction of lung tumors. Since MET inhibitors are
currently being evaluated in lung cancer, it is tempting to
speculate that they might prove beneficial in a subset of
lung tumors with activated MET tyrosine kinase
Tengs T, et al. Cancer Lett. (2005)
Summary
 CustomSeq® Resequencing arrays have proven to be a
valuable tool for a variety of applications including
microbial research, mitochondrial analysis, and
resequencing of genes involved in heterogeneous
diseases
 CustomSeq arrays provide an efficient and costeffective method for large-scale sequence variation
analysis
 Resequencing arrays provide high-quality sequence
information
–
–
–
Call Rates >90%
Accuracy >99.9%
Reproducibility >99.9%
 Resequencing Arrays facilitate large-scale comparative
sequencing projects by providing significant benefits in
terms of ease of use and data analysis
Download