Low-Level Copy Number Analysis Part 1 - Background Henrik Bengtsson Post doc, Department of Statistics, University of California, Berkeley, USA CEIT Workshop on SNP arrays, Dec 15-17, 2008, San Sebastian Acknowledgments UC Berkeley: • James Bullard • Kasper Hansen • Elizabeth Purdom • Terry Speed Lawrence Berkeley National Labs: • Amrita Ray • Paul Spellman John Hopkins, Baltimore: • Benilton Carvalho • Rafael Irizarry WEHI, Melbourne, Australia: • Mark Robinson • Ken Simpson ISREC, Lausanne, Switzerland: • Pratyaksha “Asa” Wirapati Affymetrix, California: • Ben Bolstad • Simon Cawley • Steve Chervitz • Harley Gorrell • Earl Hubbell • Luis Jevons • Chuck Sugnet • Jim Veitch • Alan Williams Detect more and smaller aberrations with less errors CRMA detection rate other #1 other #2 other #3 Copy number analysis is about finding "aberrations" in one or several individuals The HapMap project - Large project to identify SNPs in Humans (2003-) The HapMap is a catalog of common genetic variants (SNPs) that occur in human beings. It describes what These variants are, where they occur in our DNA, and How they are distributed among people within populations and among populations in different parts of the world. URL: http://www.hapmap.org/ The HapMap project - 270 normal individuals genotyped by different labs using various technologies • • • • 90 CEU individuals (Utah/Europe, 30 trio families) 90 YRI individuals (Nigeria; 30 trio families) 45 CHB (China; unrelated) 45 JPT (Japan; unrelated) Publicly available: • High quality data. • Raw data, e.g. Affymetrix CEL files. • Genotypes. • Studied by many groups. Copy number polymorphism - People share common CN aberrations (2005-) The Cancer Genome Atlas (TCGA) project - Large project for genetic mapping of tumors (2007-) The TCGA project - A large number of tissues are studies with many DNA & RNA technologies • Tumor types: – brain cancer (glioblastoma multiforme, or GBM), – lung cancer (squamous cell carcinoma of the lung), and – ovarian cancer (serous cystadenocarcinoma of the ovary). • 234 tumors (of 500) characterized. • Multiple labs in the US – Broad, Harvard, Stanford, LBNL, … • High quality data. • Platforms: Affymetrix, Illumina, Agilent, … • Gene-, exon-, microRNA- expression, methylation, SNP & CN, sequencing… • Raw and summarized data immediately available (publicly), e.g. Affymetrix CEL files. Combining copy numbers across platforms & labs Henrik Bengtsson (UC Berkeley), Amrita Ray (LBNL), Paul Spellman (LBNL), Terry Speed (UC Berkeley) BACKGROUND: Whole-genome copy-number (CN) studies are rapidly expanding, and with this expansion comes a demand for increased precision and resolution of CN estimates. Several recent studies have obtained CN estimates from more than one platform on the same samples, and it is natural to want to combine the different estimates in order to meet this demand. PROBLEM: CN estimates from different platforms show different degrees of attenuation of the true CN changes. Differences can also be observed in CN estimates from the same platform run in different labs, or in the same lab, with different analytical methods. This is the reason why it is not straightforward matter to combine CN estimates from different sources (platforms, labs, analysis methods, etc). (A) Broad, Affymetrix GWS6, n=1800K, 1.59kb/locus, 25-mers (B) Stanford, Illumina 550K, n=550K, 5.53kb/locus, 50-mers (C) MSKCC, Agilent 244K, n=236K, 12.7kb/locus, 60-mers (D) Harvard, Agilent 244K, n=236K, 12.7kb/locus, 60-mers The smoothed raw CNs from the four sources have similar CN profiles but different mean levels. Tumor/normal CNs by four TCGA centers (“sources”) in a 60Mb region on Chr3 in sample TCGA-02-104. (The combined set would consist of 2,822K loci with 0.95kb/locus.) METHOD: We have developed a single-sample multi-source normalization that brings full-resolution CN estimates to the same scale across sources. Kernel estimators and principal component curves are used to estimate the non-linear relationships between the sources. Full-resolution data is then normalized such that these relationships become linear. The normalized estimates are such that for any underlying CN level, the mean level of the CN estimates is the same regardless of source. CNs with consistent mean levels are better suited for being combined across sources, e.g. existing segmentation methods may be used to identify aberrant regions. Before normalization: Non-linearity between pairs The smoothed normalized CNs from the four sources have similar CN profiles and same mean levels. After normalization: Linearity between pairs Normalized full-resolution CNs for the four sources. RESULTS: We use microarray-based CN estimates from The Cancer Genome Atlas (TCGA) project to illustrate the method. We show that after normalization the mean levels of randomly selected CN aberrations are the same across platforms, and that the normalized and combined data better separate two CN states at a given resolution. We conclude that it is possible to combine CNs from multiple sources such that the resolution becomes effectively larger, and when multiple platforms are combined, they also enhance the genome coverage by complementing each other in different regions. (A) (B) (C) A 400kb region in TCGA-02-104 on Chr 3: CNs from different sources give different segmenting results at different precisions. (D) (comb+raw) (comb+norm) At any given resolution (amount of smoothing), with combined normalized CNs (solid red) one can separate two CN states better than with combined raw CNs (dot-dashed red), and with each of the individuals sources (gray dotted). With combined normalized CNs, there is more power to detect change points (CPs) and their locations are more precise. Examples of genomic profiles The Affymetrix platform The Affymetrix GeneChip is a synthesized high-density (single-array) microarray * * * * * 5 µm 1.28 cm 5 µm 1.28 cm 1 million identical 25-mer sequences 6.5 million probes/chip Copy-number probes are used to quantify the amount of DNA at known loci ...CGTAGCCATCGGTAAGTACTCAATGATAG... ATCGGTAGCCATTCATGAGTTACTA CN locus: PM: CN=1 CN=2 ** * PM = c CN=3 ** * PM = 2c ** * PM = 3c Single Nucleotide Polymorphism (SNP) Definition: A sequence variation such that two chromosomes may differ by a single nucleotide (A, T, C, or G). Allele A: Allele B: A ...CGTAGCCATCGGTA/GTACTCAATGATAG... G A person is either AA, AB, or BB at this SNP. Probes for SNPs PMA: Allele A: ATCGGTAGCCATTCATGAGTTACTA ...CGTAGCCATCGGTAAGTACTCAATGATAG... Allele B: PMB: ...CGTAGCCATCGGTAGGTACTCAATGATAG... ATCGGTAGCCATCCATGAGTTACTA (Also MMs, but not in the newer chips, so we will not use these!) AA BB AB ** * ** * PMA >> PMB * ** * ** * PMA ¼ PMB ** * ** * PMA << PMB SNP probes can also be used to estimate total copy numbers BB AA ** * ** * ** * PM = PMA + PMB = 2c PM = PMA + PMB = 2c AB ** * AAB ** * ** * PM = PMA + PMB = 2c ** * ** * PM = PMA + PMB = 3c * The Affymetrix assay - takes 4-5 working days to complete 1. Start with target gDNA (genomic DNA) or mRNA. 2. Obtain labeled single-stranded target DNA fragments for hybridization to the probes on the chip. 3. After hybridization, washing, and scanning we get a digital image. 4. Image summarized across pixels to probe-level intensities before we begin. This is our "raw data". Restriction enzymes digest the DNA, which is then amplified and hybridized Target DNA find their way to complementary probes by massive parallel hybridization Scanning Image Analysis Example array: Dimensions: 1600x1600 cells Each cell: 3x3 pixels Dynamic range: 65536 (16-bits) intensity levels Cell summaries: (mean pixel, stddev pixel, #pixels) Preparation + Hybridization + Scanning DAT File(s) [Image, pixel intensities] Image analysis workable raw data CEL File(s) [Probe Cell Intensity] + Low-level analysis Segmentation CDF [Chip Description File] A brief history of Affymetrix SNP & CN arrays How did we get here? Data from 2003 on Chr22 (on of the smaller chromosomes) zoom in 2003: 10,000 loci x1 2004: 100,000 loci x10 2005: 500,000 loci x50 2006: 900,000 loci x90 2007: 1,800,000 loci x180 Genome-Wide Human SNP Array 6.0 - state-of-the-art array • > 906,600 SNPs: – Unbiased selection of 482,000 SNPs: historical SNPs from the SNP Array 5.0 (== 500K) – Selection of additional 424,000 SNPs: • • • • • Tag SNPs SNPs from chromosomes X and Y Mitochondrial SNPs Recent SNPs added to the dbSNP database SNPs in recombination hotspots • > 946,000 copy-number probes: – 202,000 probes targeting 5,677 CNV regions from the Toronto Database of Genomic Variants. Regions resolve into 3,182 distinct, non-overlapping segments; on average 61 probe sets per region – 744,000 probes, evenly spaced along the genome Rapid increase in density Distance between loci: 4x further out… 10K 100K 500K 5.0 6.0 294kb 26kb next? 6.0kb 3.6kb # loci 1.6kb year 2003 2004 2005 2006 2007 Affymetrix & Illumina are competing - we get more bang for the buck (cup) Released 10K 100K 500K 5.0 6.0 July 2003 April 2004 Sept 2005 Feb 2007 May 2007 # SNPs 10,204 116,204 500,568 500,568 934,946 # CNPs - - - 340,742 946,371 10,204 116,204 500,568 841,310 1,878,317 294kb 25.8kb 6.0kb 3.6kb 1.6kb Price / chip set 65 USD 400 USD 300 USD 175 USD 300 USD # loci / cup of espresso ($1.35) 116 loci 215 loci 1236 loci 3561 loci 4638 loci # loci Distance Price source: Affymetrix Pricing Information [http://store.affymetrix.com/] and Berkeley Coffee Shops, Dec 2008. Affymetrix are moving away from MM probes - therefore we don’t utilize them Target DNA: Perfect match (PM): Mis-match (MM): ...CGTAGCCATCGGTAAGTACTCAATGATAG... ||||||||||||||||||||||||| ATCGGTAGCCATTCATGAGTTACTA ATCGGTAGCCATACATGAGTTACTA 25 nucleotides * Target seq. ** PM * X ** MM * * ** Other DNA Other DNA Other seq. * other PMs Low-Level Copy Number Analysis Part 2 – Simple preprocessing Henrik Bengtsson Post doc, Department of Statistics, University of California, Berkeley, USA CEIT Workshop on SNP arrays, Dec 15-17, 2008, San Sebastian Recap: Copy-number probes ...CGTAGCCATCGGTAAGTACTCAATGATAG... ATCGGTAGCCATTCATGAGTTACTA CN locus: PM: CN=1 CN=2 ** * PM = c CN=3 ** * PM = 2c ** * PM = 3c Recap: Adding SNP probes gives total CN signal BB AA ** * ** * ** * PM = PMA + PMB = 2c PM = PMA + PMB = 2c AB ** * AAB ** * ** * PM = PMA + PMB = 2c ** * ** * PM = PMA + PMB = 3c * Notation - here and in our papers Indices: Arrays/samples: i = 1, 2, …, I Loci/SNPs/CN units: j = 1, 2, …, J Replicated probes for SNP: k = 1, 2, …, K Probe signals: CN locus: yij = PMij (single-probe units) SNP allele pair k: (yijkA,yijkB) = (PMijkA,PMijkB) Summarized signals (“chip effects”): CN locus: ij SNP: (ijA,ijB) A simple way to obtain CN estimates • Calculate non-polymorphic SNP summaries: – For each array i=1,…,I and SNP j=1,…,J: • Probe allele pairs: (PMijkA,PMijkB); k=1,…,K • For both alleles, average across probes: ijA = mediank {PMijkA}, ijB = mediank {PMijkB} • Sum both alleles: ij = ijA + ijB • Calculate reference Rj across all arrays: – For each SNP j=1,…,J: • Rj = mediani {ij} • Calculate CN log-ratios: – For each array i=1,…,I and SNP j=1,…,J: • Mij = log2 (ij / Rj) The software tools make this easy for you - using aroma.affymetrix package cs <- AffymetrixCelSet$byName(“GSE8605”, chipType=“Mapping10K_Xba142”); plm <- AvgCnPlm(cs, combineAlleles=TRUE); fit(plm); ces <- getChipEffectSet(plm); theta <- extractTheta(ces); thetaR <- rowMedians(theta); M <- log2(theta / thetaR); Copy number regions are found by lining up estimates along the chromosome Example: Log-ratios for one sample on Chromosome 22. Even without a segmentation algorithm, we can easily spot a deletion here. If we don’t add up the alleles, we get allele-specific estimates from which we can get genotypes Example: (ijA,ijB) for one SNP across all samples BB AB AA There are a lot of artifacts in microarray data - can we do better? Systematic variation can be added due to: • Spatial artifacts • Intensity dependent effects • Probe-sequence dependent effects • GC-content effects • PCR effects • Lab & people effects • Non-calibrated scanners • …? Spatial artifacts (“extreme”) http://plmimagegallery.bmbolstad.com/ Intensity dependent artifacts/variation Lab and people effects/variation PCR fragment length effects/variation 32.5Mb deletion on chr 11 Before = 0.246 After = 0.225 “Wave” patterns along genome Nucleotide-position effect () Probe-sequence effects/variation - probes respond differently Position (t) Low-Level Copy Number Analysis Part 3 – aroma.affymetrix Henrik Bengtsson Post doc, Department of Statistics, University of California, Berkeley, USA CEIT Workshop on SNP arrays, Dec 15-17, 2008, San Sebastian aroma.affymetrix processes unlimited number of arrays • Processes unlimited number of arrays: – Bounded memory algorithms. – Works toward file system. – Persistent memory: robust & picks up where last stopped. • Memory requirements: 1.0-2.0GB RAM. – Example: RMA on 4500 HG-U133A arrays uses ~500MB of RAM. – Example: CRMA on 300 SNP6.0 arrays uses ~1.5GB of RAM. – Example: FIRMA on 200 HuEx-1.0 arrays uses ~1.5GB of RAM. • Cross platform: Linux/Unix, Windows, OSX. • Supports most Affymetrix chip types: – All chip types with a CDF (and some more). – Custom CDFs. aroma.affymetrix "implements" several existing methods • Calibration and normalization: – – – Background correction methods: RMA, gcRMA, ... Allelic cross-talk calibration, quantile normalization, spatial normalization, probe-sequence normalization, … PCR fragment-length normalization, GC-content normalization. • Probe-level summarization: – multiplicative (dChip), affine, and log-additive (RMA) models. Easy to add new. • Quality assessment: – – RLE (Relative Log Expressions), NUSE (Normalized Unscaled Standard Error) Spatial plots: probe signals, PLM residuals, chip effects, CDF annotations, ... • Paired & non-paired copy-number analysis: – – – – – All SNP & CN platforms. Multiple chip types. CRMA (our methods for estimating raw CNs). Allele-specific and/or total CN estimates Genotyping via CRLMM Segmentation method: CBS & GLAD. Easy to add more. • Miscellaneous: – – – – Alternative splicing (exon arrays): Finding Isoforms using RMA (FIRMA) Tiling-array analysis: MAT processing Resequencing arrays Gene expression arrays (of course) aroma.affymetrix is an open-source solution • Community: – 250+ users worldwide (approx 10 installation per day) – Active mailing list (Google Groups; 150+ message per month) – Collaborative documentation and vignettes (Google Groups). • Development: – – – – – – ~3 years since start: Jan 2006-Oct 2006: Phase I: Identifying API (1-3 person project). Oct 2006-Feb 2007: Phase II: Maturing API and testing (10-15 users & developers). Feb 2007-Aug 2007: Phase III: Extended real-world testing (30-50 users & developers). Aug 2007-Fall 2008: Phase IV: Public release and heaps of CPU mileage (more "wild" use cases). Fall 2008-… Phase V: Third party extensions are coming in. More chip types supported. – 3-5 active developers. One main maintainer / code coordinator. – Some external code snippet contributors. – Lots of validation code - catches existing and future bugs. – 1,000+ pages code / Rd pages. • Standards: – Standard file formats, e.g. reads/writes CEL (via affxparser/Fusion SDK). – Imports and exports to: APT, GTC, CNAG, CNAT, dChip, Bioconductor etc. – Strict directory structures and relative pathnames => portable scripts, robustness, more automated validations, easier to troubleshoot, simplified support. – Utilizes existing packages, e.g. preprocessCore, gcRMA, DNAcopy... Walk-through example Complete aroma.affymetrix script for copy-number analysis of 270 SNP6.0 samples cdf <- AffymetrixCdfFile$byChipType("GenomeWideSNP_6") csR <- AffymetrixCelSet$byName("HapMap270", cdf=cdf) acc <- AllelicCrosstalkCalibration(csR) csC <- process(acc) bpn <- BasePositionNormalization(csC) csN <- process(bpn) plm <- AvgCnPlm(csN, combineAlleles=TRUE) fit(plm) ces <- getChipEffectSet(plm) fln <- FragmentLengthNormalization(ces) cesN <- process(fln) seg <- CbsModel(cesN) ce <- ChromosomeExplorer(seg) process(ce) Offline & online dynamic HTML reports Example: ChromosomeExplorer Setup is as simple as placing the files in a strict & standardized directory structure annotationData/ chipTypes/ GenomeWideSNP_6/ GenomeWideSNP_6.CDF GenomeWideSNP_6.UGP GenomeWideSNP_6.UFL rawData/ HapMap270,CEU/ GenomeWideSNP_6/ *.CEL No (absolute) pathnames are used - maximizes portability annotationData/ chipTypes/ GenomeWideSNP_6/ GenomeWideSNP_6.CDF GenomeWideSNP_6.UGP ... cdf <- AffymetrixCdfFile$byChipType("GenomeWideSNP_6") print(cdf) AffymetrixCdfFile: Path: annotationData/chipTypes/GenomeWideSNP_6 Filename: GenomeWideSNP_6.cdf Filesize: 470.44MB File format: v4 (binary; XDA) Chip type: GenomeWideSNP_6 Dimension: 2572x2680 Number of cells: 6892960 Number of units: 1881415 ... The file system is the memory - data is loaded only when needed cdf <- AffymetrixCdfFile$byChipType("GenomeWideSNP_6") csR <- AffymetrixCelSet$byName("HapMap270", cdf=cdf) AffymetrixCelSet: Name: HapMap270 Tags: CEU Path: rawData/HapMap270,CEU/GenomeWideSNP_6 Chip type: GenomeWideSNP_6 Number of arrays: 270 Names: NA06985, NA06991, ..., NA07019 Total file size: 17.7GB RAM: 0.01MB Normalized data is stored as CEL files - import to any software acc <- AllelicCrosstalkCalibration(csR) csC <- process(acc) print(csC) AffymetrixCelSet: Name: HapMap270 Tags: CEU,ACC,ra,-XY Path: probeData/HapMap270,CEU,ACC,ra,-XY/GenomeWideSNP_6 Chip type: GenomeWideSNP_6 Number of arrays: 270 Names: NA06985, NA06991, ..., NA07019 Total file size: 17.7GB RAM: 0.01MB files <- getPathnames(csC) print(files[1]) [1] "probeData/HapMap270,CEU,ACC,ra,-XY/ GenomeWideSNP_6/NA06985.CEL" Data sets (directories) are marked with unique tags qn <- QuantileNormalization(csC) csN <- process(qn) print(csN) AffymetrixCelSet: Name: HapMap270 Tags: CEU,ACC,ra,-XY,ACC,QN Path: probeData/HapMap270,CEU,ACC,ra,-XY,QN/GenomeWideSNP_6 Chip type: GenomeWideSNP_6 Number of arrays: 270 Names: NA06985, NA06991, ..., NA07019 Total file size: 17.7GB RAM: 0.01MB