GeneChip® CustomSeq® Resequencing Arrays GeneChip® CustomSeq® Arrays Most efficient and cost-effective method for largescale sequence variation analysis – up to 300kb on single array for >.01 cent per base Flexible array formats for a variety of applications; 50kb, 100kb, and 300kb Long “read length” minimizes curation and assembly time High-quality data – – Accuracy >99.9% Reproducibility >99.9% Resequencing Assay Overview Select areas of interest by PCR Genomic DNA Pool and fragment DNA fragments PCR products Label B B B B B B B End-labeled fragments Scan Hybridization, Wash, and Stain GeneChip® Resequencing System GeneChip Resequencing Assay Kit GeneChip CustomSeq® Array GeneChip Scanner 3000 Data Collection Data Management Client-Server Option GeneChip Sequence Analysis Software (GSEQ) 4.0 GeneChip Operating Software (GCOS) 1.4 Resequencing Tiling Strategy ATCGGTAGCCATACATGAGTTACTA ATCGGTAGCCATTCATGAGTTACTA ATCGGTAGCCATCCATGAGTTACTA ATCGGTAGCCATGCATGAGTTACTA ATCGGTAGCCATGCATGAGTTACTA CAGCT TAGCCATCGGTACGTACTCAATGAT GTCGA TAGCCATCGGTAGGTACTCAATGAT TAGCCATCGGTACGTACTCAATGAT TAGCCATCGGTATGTACTCAATGAT TAGCCATCGGTAAGTACTCAATGAT Resequencing Tiling Strategy TCGGTAGCCATGAATGAGTTACTAC TCGGTAGCCATGCATGAGTTACTAC TCGGTAGCCATGGATGAGTTACTAC TCGGTAGCCATGTATGAGTTACTAC ATCGGTAGCCATGCATGAGTTACTA CAGCT TAGCCATCGGTACGTACTCAATGAT GTCGA AGCCATCGGTAGATACTCAATGATG AGCCATCGGTAGCTACTCAATGATG AGCCATCGGTAGGTACTCAATGATG AGCCATCGGTAGTTACTCAATGATG Resequencing Tiling Strategy CGGTAGCCATGCATGAGTTACTACA CGGTAGCCATGCCTGAGTTACTACA CGGTAGCCATGCGTGAGTTACTACA CGGTAGCCATGCTTGAGTTACTACA ATCGGTAGCCATGCATGAGTTACTA CAGCT TAGCCATCGGTACGTACTCAATGAT GTCGA GCCATCGGTACGAACTCAATGATGT GCCATCGGTACGCACTCAATGATGT GCCATCGGTACGGACTCAATGATGT GCCATCGGTACGTACTCAATGATGT Resequencing Tiling Strategy GGTAGCCATGCAAGAGTTACTACAG GGTAGCCATGCACGAGTTACTACAG GGTAGCCATGCAGGAGTTACTACAG GGTAGCCATGCATGAGTTACTACAG ATCGGTAGCCATGCATGAGTTACTA CAGCT TAGCCATCGGTACGTACTCAATGAT GTCGA CCATCGGTACGTACTCAATGATGTC CCATCGGTACGTCCTCAATGATGTC CCATCGGTACGTGCTCAATGATGTC CCATCGGTACGTTCTCAATGATGTC Resequencing Tiling Strategy GTAGCCATGCATAAGTTACTACAGC GTAGCCATGCATCAGTTACTACAGC GTAGCCATGCATGAGTTACTACAGC GTAGCCATGCATTAGTTACTACAGC ATCGGTAGCCATGCATGAGTTACTA CAGCT TAGCCATCGGTACGTACTCAATGAT GTCGA CATCGGTACGTAATCAATGATGTCG CATCGGTACGTACTCAATGATGTCG CATCGGTACGTAGTCAATGATGTCG CATCGGTACGTATTCAATGATGTCG Resequencing Tiling Strategy TAGCCATGCATGAGTTACTACAGCT TAGCCATGCATGCGTTACTACAGCT TAGCCATGCATGGGTTACTACAGCT TAGCCATGCATGTGTTACTACAGCT ATCGGTAGCCATGCATGAGTTACTA CAGCT TAGCCATCGGTACGTACTCAATGAT GTCGA ATCGGTACGTACACAATGATGTCGA ATCGGTACGTACCCAATGATGTCGA ATCGGTACGTACGCAATGATGTCGA ATCGGTACGTATTCAATGATGTCGA Resequencing Array Performance Call rates* > 90% Overall Accuracy > 99.9% Reproducibility > 99.9% * Average call rate reported for arrays where all target has been amplified to sufficient quantity Performance may vary depending on genomic content on individual custom designs. Specific factors that impact performance include: – GC content – INDELs – Divergence from reference sequence – Multiple SNPs within 10bp Performance characterized across several data sets Set 1: Mouse – 30kb of sequence in 9 inbred DBA (diploid genome with homozygous SNPs) –Data Set 2: ENCODE – 300kb across 16 CEPH (diploid genome with heterozygous SNPs) –Data Set 3: Mitochondrial – 16kb across 3 replicates of 1 reference (haploid genome with heteroplasmy) –Data Quality Threshold Score Impact on call rate verses accuracy Homozygous Model 100.00% 100% 95% 99.95% 99.90% 85% 99.85% 80% Call Rate 75% 99.80% 70% 99.75% 65% 3 6 9 12 QTS Overall Accuracy Heterozygous Model Call Rate 100.00% 100% 95% 99.95% 90% 99.90% 85% 99.85% 80% 75% 99.80% 70% 99.75% 65% 0 3 6 9 QTS Overall Accuracy Call Rate 12 Call Rate 0 Accuracy Accuracy 90% Data Set 1: Mouse Array – Homozygous Model Average Call Rate The number of bases called divided by the total number of bases possible. 95.92% Overall Accuracy For all bases where a call is made, the percentage that agrees with capillary sequencing 99.99% Overall Reproducibility For a pair of technical replicate chips (pairs of mouse samples in this case), concordance is computed for all sites where the two arrays make a call. N’s excluded. >99.99 % Homozygous SNP Call Rate Percentage of calls made for all known SNP positions. 95.95% Homozygous SNP Accuracy Percentage of call a SNP when capillary sequence called the same base a SNP Does not include N’s or SNPs within 9 bp of another SNP 100% Homozygous SNP False Positive Percentage of calls made as a SNP when capillary sequencing called the base a reference. 0.01% Homozygous SNP False Negative Percentage of N calls made when capillary sequencing called the base as SNP Does not include SNPs within 9 bp of another SNP Calculated for individual genotypes, not SNP sites 4.05% Homozygous SNP Reproducibility Same as overall reproducibility, but for SNP sites only. 100% Data Set 2: Diploid Analysis of Encode Interval on Chr 4 • Collected array data and dideoxy sequence data from 16 diploid CEPH individuals across 115kbp of non-repetitive Chr4 sequence = 1.84Mbp in total • Dideoxy sequencing data • Total of 1.44Mbp covered by dideoxy sequencing • Each variant was confirmed by genotyping in all 16 DNAs • Array data • 27 LR-PCRs amplified ~250kbp of genomic sequence • LR-PCRs pooled, fragmented, labeled, hybed per SOP • Hybed one array per individual to query 115kb of non-repetitive sequence • Intensity data analysed using GSEQ v3.0 • in diploid mode • at various quality threshold values Data Set 2 Diploid ENCODE Region Call Rate The number of non-N calls divided by the total number of calls 96.56% Overall Accuracy Percentage of all calls (excluding Ns) that are concordant with ENCODE data 99.95% Call rate at variant sites Percentage of calls made for all known SNP loci including heteroygous and homozygous calls 89.70% SNP False Negative Rate Percentage of variant positions in the ENCODE data that are called N or reference in the array data 17.34% SNP False Positive Rate Percentage of reference positions in the ENCODE data that are called variant in the array data 0.04% Homozygous Accuracy Percentage of homozygous variant positions in the ENCODE data with concordant array data (excluding array Ns) 96.91% Heterozygous Accuracy Percentage of heterozygous positions in the ENCODE data with concordant array data (excluding array Ns) 86.25% Homozygous SNP False Negative Percentage of mis-calls (No Calls and Ref calls) made for all known homozygous SNP positions in the ENCODE data 9.12% Heterozygous SNP False Negative Percentage of mis-calls (No Calls and Ref calls) made for all known heterozygous SNP positions in the ENCODE data 22.15% Post GSEQ Filters to Reduce False Positives Summary of exclusions position*sample specific cell counts PCR Failure Calls removed # FPs removed %FPs removed # TPs removed %TPs removed 19519 168 31.28% 5 0.33% Nearby SNPs – Footprint 252 167 31.10% 14 0.33% Cross Hybridization sites 64 33 6.15% 0 0.00% 128 1 0.19% 0 0.00% 32 32 5.96% 0 0.00% Low Complexity Probes Non-biallelic Calls Performance Post GSEQ Filters Before Filters After Filters Call Rate 96.56% 95.98% Overall Accuracy 99.95% 99.98% False Positive Calls 537 219 False Positive Rate 0.040% 0.016% True Positive Calls 1498 1479 SNP call False Negative Rate 17.34% 18.52% SNP site False Negative rate 8.18% 9.39% Impact of GC Content on Call Rates Call Rate vs. Probe GC content 100.00% Call Rate 95.00% 90.00% 85.00% 80.00% 75.00% <10 11-20 21-30 31-40 41-50 % probe GC 51-60 61-70 >70 Batch Analysis Improves Performance Performance as a function of sample size 100.0% 0.025% 80.0% 0.020% 70.0% 60.0% 0.015% 50.0% 40.0% 0.010% 30.0% 20.0% 0.005% 10.0% 0.0% 0.000% 1 2 4 8 Number of Samples 16 False Positive Rate Call Rate/False Negative Rate 90.0% Call Rate FN rate FP rate CustomSeq® Applications Haploid – – Pathogen identification and typing Mitochondria Diploid – – – Candidate genes Regions of linkage/association Pharmacogenomics Microarray-based Resequencing of Multiple Bacillus anthracis Isolates Zwick, ME. et al., Genome Biology, 6:R10 (2004) Bacillus anthracis Research Rapid, accurate, and inexpensive resequencing required for a variety of applications and studies. – – – – Definitively identify B. anthracis in environmental and clinical samples Determine forensic attribution and detect genetic manipulation Determine phylogenetic relationships of strains Uncover the genetic basis of phenotypic variation in traits such as mammalian virulence. Neither the AFLP nor the MLST studies discover and genotype sufficient genetic variation to distinguish between B. anthracis strains Sequencing efforts are increasing but limited by cost. Zwick, ME. et al., Genome Biology, 6:R10 (2004) Bacillus anthracis Custom Array Experimental Design Array – 30kb CustomSeq® array containing 29,212bp of unique sequence Samples – – 56 isolates from Biological Defense Research Directorate's strain collection Samples hybridized in replicate on 2 arrays Assay – Long range PCR Analysis- ABACUS software Zwick, ME. et al., Genome Biology, 6:R10 (2004) Bacillus anthracis Custom Array Results Replication experiment Total number of bases called in replicate 1 1,383,229 Total number of bases called in replicate 2 1,373,905 Total number of bases called in both replicates 1,349,177 Total number of bases called differently 1 Replication experiment discrepancy rate 7.4E-07 Results – Call Rate and Reproducibility – – – 115/ 118 array hybridizations successful Average call rate = 92.6% High reproducibility- only one discrepancy found between replicates across 1.35Mb of sequence Zwick, ME. et al., Genome Biology, 6:R10 (2004) Bacillus anthracis Custom Array Results Accuracy estimation experiment Total number of bases called identically 398,452 Total number of bases called differently 15 Accuracy experiment discrepancy rate 3.8E-05 Results - Accuracy – – 30 arrays hybridized to anthrax strains previously sequenced by capillary sequencing 15 discrepancies/ 6 SNP sites 10 discrepancies /5 sites resolved as arrays agreed with most recent shot gun assembly 1 site accounting for 5 discrepancies could not be confirmed based on a single read with phred score of 7 Zwick, ME. et al., Genome Biology, 6:R10 (2004) Bacillus anthracis Custom Array Conclusions Study demonstrated that microarray-based resequencing is technologically robust and generates highly replicable and accurate data when compared to alternative sequence technologies In this experiment, 115 arrays, or 97.5% of the total attempted, were processed successfully, obtaining an average high-quality base-calling rate of 92.6% Called bases are shown to be highly replicable (discrepancy rate of 7.4 × 10-7) and accurate when compared to conventional shotgun sequence (discrepancy rate of < 2.5 × 10-6) Zwick, ME. et al., Genome Biology, 6:R10 (2004) Sequencing Arrays for Screening Multiple Genes Associated with Early-Onset Human Retinal Degenerations on a HighThroughput Platform Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005) * Study conducted at the Dept of Ophthalmology and Visual Science at the University of Michigan, Ann Arbor in collaboration with McGill University Health Science Center and the NEI/NIH Retinitis pigmentosa Progressive retinal degeneration leading to irreversible blindness or severe visual impairment Affects 1:3500 individuals worldwide Broad genetic heterogeneity with at least 32 genes known to be associated with various forms (AD, AR, X linked) of RP. Several treatments are in development but response to individual treatments is likely to be linked to genotype. Screening all known genes (~60kb) is inefficient by traditional methods. Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005) Retinitis pigmentosa Custom Array Array – 11 RP genes (coding and flanking regions) representing 25.8kb unique sequence were tiled on a 30kb CustomSeq Array Samples – – – 35 cases with known genotypes 35 novel cases 26 unaffected family members Assay – – Standard CustomSeq protocol Traditional PCR -159 amplicons Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005) Retinitis pigmentosa Array- Results Base calling Performance – – – Average Call Rates = 97.60% (individual arrays ranged 96.0%-98.5%) Accuracy = >99% Reproducibility = >99% SNP Detection – 506 sequence changes identified Accurately detected 382 previously reported SNPs and identified 113 novel SNPs Accurately detected 5 previously reported mutations and identified 7 novel rare mutations Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005) Retinitis pigmentosa Array- Results Summary of Novel Potentially Pathogenic Nucleotide Changes and Previously Reported Mutations Detected in Patient DNA, with the arRP-I Arrays Nucleotide Change Amino Acid Change Genotype Reference Patient Gene KE727 RHO C959A Thr320Asn Het Novel KE1246 CRB1 G2473A Glu825Lys Het Novel R165 TULP1 IVS2 _ 3 A _ G Homo Novel R206 ABCA4 G1699A Val567Met Het Novel KE869 RGR C734T Ser245Phe Het Novel R353 MERTK G500A Arg167His Het Novel R376 ABCA4 IVS23-2 A _ T Het Novel KE385 RPE65 T963G Asn321Lys Het Known KE1246 ABCA4 T3602G Leu1201Arg Het Known KE1246 ABCA4 G5077A Val1693Ile Het Known R376 ABCA4 C5327T Pro1776Leu Het Known Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005) Retinitis pigmentosa Array- Conclusion Resequencing arrays provide an efficient and reliable method of high-throughput screening for mutations in genetically heterogeneous diseases. Enables one to screen multiple genes and enables the analysis of both Mendelian and complex forms of retinal degeneration Comparison of material costs revealed that arrays were 23% cheaper. Time and labor savings further increased the cost effectiveness of this method Mandal MN, et al. Invest Ophthalmol Vis Sci. 46(9):3355-62 (2005) A Transforming MET Mutation Discovered in Non-small Cell Lung Cancer Using Microarray-based Resequencing Tengs T., et al. Cancer Lett. (2005) * Study conducted at the Dana Farber Cancer Institute in collaboration with MIT/Broad Institute and Merck Pharmaceuticals Custom Cancer Array Objective – – Evaluate the performance of resequencing arrays to detect mutations in oncogenes and tumor suppressor genes Identify novel mutations which may have an impact on therapeutic response Experimental design – – – 164 exons (23,966 bp) from genes associated with cancer Sequenced 20 lung tumor samples with matched normal controls Dideoxy sequencing was performed on a subset of exons in order to evaluate the performance of the arrays Tengs T, et al. Cancer Lett. (2005) Custom Cancer Array Performance Coverage and accuracy of resequencing arrays when compared to dideoxy sequencing – – – Call rate - 97.53% Overall accuracy- 99.99% Only 4 SNP call errors reported 3 hom SNP called het SNP/ 1 het SNP called ref Exons also covered by dideoxy sequencing Total number of bases interrogated 335,420 Total number of ’no calls’ made by GDAS 8,283 (2.47%) Coverage 327,137 (97.53%) Number of homozygous mutations found by dideoxy sequencing in loci where GDAS made calls 37 Number of heterozygous mutations found by dideoxy sequencing in loci where GDAS made calls 71 Total number of ’no calls’ made by GDAS in mutated loci 11 (9.24%) Homozygous mutations called correctly by GDAS 34 (91.89%) Heterozygous mutations called correctly by GDAS 70 (98.59%) Total number of correct calls by GDAS in loci covered 327,132 (99.99%) Tengs T, et al. Cancer Lett. (2005) Amino acid changing mutations detected in the 20 NSCLC samples Gene Refseq Nucleotide change CDKN2A NM_000077 G654A CDKN2A NM_000077 KRAS2 Amino acid change Origin Heterozygous/ homozygous A148T Germline 2/0b T556A V115E Germline 1/0 NM_004985 G216A G12D Somatic 2/1 KRAS2 NM_004985 G216T G12V Somatic 1/0 MET NM_000245 A1311G N375S Germline 3/0 MET NM_000245 C2646T P814S Germline 1/0 MET NM_000245 C3162T T1010I Germline 1/0 NRAS NM_002524 A435T Q61L Germline 1/0 PTEN NM_000314 G1266A A79T Germline 1/0 RET NM_020630 G1645A D489N Germline 1/0 RET NM_020630 G2251A G691S Germline 8/0 RET NM_020630 C3124T R982C Germline 1/0 TP53 NM_000546 G1075T C275F Somatic 1/0 TP53 NM_000546 G984T G245C Somatic 1/0 TP53 NM_000546 G714T R158L Somatic 2/0 TP53 NM_000546 G775A R175H Somatic 1/0 TP53 NM_000546 G997T R249M Somatic 1/0 TP53 NM_000546 C1167T R306Stop Somatic 1/0 TP53 NM_000546 G466C R72P Germline 4/1 TP53 NM_000546 GOT splice-site Somatic 1/0 TP53 NM_000546 G1065T V272L Somatic 1/0 TP53 NM_000546 A739G Y163C Somatic 1/0 Tengs T, et al. Cancer Lett. (2005) Custom Cancer Array- Conclusions Results show that resequencing microarrays can be used as a tool for cancer mutation detection and discovery The overall performance of the platform is comparable to traditional Sanger-based sequencing with a very high concordance rate (327,132 out of 327,137 bases called consistently; >99.99% concordance) Furthermore, we have found the transforming MET mutation T1010I in NSCLC to be present in a small fraction of lung tumors. Since MET inhibitors are currently being evaluated in lung cancer, it is tempting to speculate that they might prove beneficial in a subset of lung tumors with activated MET tyrosine kinase Tengs T, et al. Cancer Lett. (2005) Summary CustomSeq® Resequencing arrays have proven to be a valuable tool for a variety of applications including microbial research, mitochondrial analysis, and resequencing of genes involved in heterogeneous diseases CustomSeq arrays provide an efficient and costeffective method for large-scale sequence variation analysis Resequencing arrays provide high-quality sequence information – – – Call Rates >90% Accuracy >99.9% Reproducibility >99.9% Resequencing Arrays facilitate large-scale comparative sequencing projects by providing significant benefits in terms of ease of use and data analysis