Microarray bias and data integration for breast cancer meta-analysis Andy Sims ABC (Applied Bioinformatics of Cancer) Research Group Gene Expression Data Integration • Ability to measure level of thousands (‘all’) transcripts simultaneously • Holistic approaches to complex problems – ‘data driven’ • Huge amount of publicly available data – great resource, under used… Thousands of transcripts on each array, many datasets, hundreds+ tumours – huge cost! • Meta-analysis could indentify findings that are not study-specific, leading to consensus and provide guidelines • Problems of bias and challenges towards data integration …Even with same platform and same tissue Batch sample processing Batch A – 8 GenChips Batch B – 9 GeneChips A2 A8 A4 A3 A5 A7 A6 C9 C10 Batch A C6 B9 C8 A2 A8 A4 C9 C10 C3 Uncorrected x mean overall mean batch B7 B8 C1 A7 C7 C5 C2 A3 B2 A5 C4 B6 C6 C5 C7 C2 C4 C3 C1 C8 B3 B5 B6 B7 B4 B9 Batch C Corrected = B4 B5 B1 A6 A1 Arrghh..! Batch C – 10 GeneChips G2 assay Batch B B3 B8 B1 B2 A1 Roberts et al. (1999) Am. J. Hum. Genet. 65: 784-94 Sims et al. (2007) International Journal of Radiation Biology 13: 617-628 Amplification protocols introduce systematic but reproducible errors into gene expression studies amplified MCF7 vs. amplified MCF10A Between Unamplified MCF7 vs. Unamplified MCF10A amplified MCF7 vs. Unamplified MCF10A Wilson et al. (2004). Biotechniques 36:498-506 Across Unamplified MCF7 vs. amplified MCF10A • Fold changes generated using different protocols are comparable(ish!) • However, raw intensity levels are NOT comparable • Different protocols introduce a systematic reproducible component and comparison between different protocols may be dangerous Mean correction & Amplification Raw expression Across amplified MCF10A replicates amplified MCF7 vs. Unamplified MCF10A Across Unamplified MCF7 vs. amplified MCF10A Corrected = Unamplified MCF10A replicates Effect of correction CORRECTED Unamplified MCF7 vs. Unamplified MCF10A Fold changes Between Unamplified MCF7 vs. Unamplified MCF10A Uncorrected x mean overall mean batch Bias is introduced at many stages of microarray processing… Generation of GeneChip (Exon v 3’) Amplification method (NuGen v MA) Scanners (GeneArray 2500 / 3000) Amount of RNA (Amp v unamp) Protocol (original v new) …but it can be removed Why it works… the maths bit! For logged data Measured Intensity = ‘Real’ x Intensity + Systematic bias If systematic bias is specific to a group of chips; Mean Measured Intensity = Mean ‘Real’ Intensity + Systematic bias Combining equations and rearranging for ‘real intensity’ cancels out the systematic bias RELATIVE ‘Real’ Intensity = Mean Measured - Measured Intensity Intensity - Systematic + Systematic bias bias To get absolute ‘real’ intensity values that compare with the measured values, add the overall mean Mean Overall Mean Uncorrected ‘Real’ Measured Measured Measured Corrected = x- mean overall + For unlogged data Intensity Intensity mean batch Intensity Intensity Integrating published datasets Richardson et al. 2006 Farmer et al. 2005 Cancer Cell 9:121-132 Oncogene 24:4660-4671 U133 plus2, standard labelling 18 ‘basal-like carcinomas’ 20 ‘non-basal-like carcinomas’ 2 BRCA carcinomas U133A, amplified labelling 16 basal tumours 27 luminal tumours 6 ‘molecular apocrine tumours’ ≈ ≈ Identified 640 Affymetrix probesets representing the 534 ‘intrinsic geneset’ Luminal A Luminal B ERBB2+ Basal Normal-like Sorlie et al. PNAS (2003) 100:8418-23 Integrating published datasets UC MC Richardson et al. Farmer et al. ERBB2 (216836_s_at) GRB7 (210761_s_at) ERBB2 (210930_s_at) GATA3 (209604_s_at) GATA3 (209602_s_at) GATA3 (209603_s_at) FBP1 (209696_at) ESR1 (205225_at) NAT1 (214440_at) KRT17 (212236_at) KRT17 (205157_at) KRT5 (201820_at) Richardson et al. Farmer et al. ERBB2 (210930_s_at) ERBB2 (216836_s_at) GRB7 (210761_s_at) XBP1 (200670_at) GATA3 (209603_s_at) GATA3 (209604_s_at) GATA3 (209602_s_at) FBP1 (209696_at) ESR1 (205225_at) KRT5 (201820_at) KRT17 (212236_at) KRT17 (205157_at) Combining 6 published datasets Mean-centered Uncorrected Basal, ERBB2, Luminal A, Luminal B, Normal-like Chin et al Desmedt et al Ivshina et al Pawitan et al Sotiriou et al W ang et al Does combining datasets improve prognostic prediction? Use single or combinations of datasets as ‘training sets’ to generate profiles based upon patient follow up data Evaluate predictive performance of profile using other datasets as ‘test sets’ Supervised principle components analysis with time to recurrence or distant metastasis Bair and Tibshirani PLoS Biol (2004) 2:E108 1 training dataset (159 tumours) 5 training datasets (821 tumours) 1 test dataset (286 tumours) p = 0.045 p = 0.0063 More data = ‘improved’ prediction 0.12 0.1 0.08 0.06 0.04 0.02 0 0 1 2 3 4 5 0.1 0.08 0.06 0.04 0.02 0 0 6 200 400 600 800 1000 1200 1000 1200 Number of tumours p=0.02 p=0.0006 0.12 0.1 0.08 0.06 0.04 0.02 0 0 p=0.02 0.12 Number of datasets Mean-centered R2 Statistic (Cox proportional hazards model) R2 Statistic (Cox proportional hazards model) p=0.002 1 2 3 Number of datasets 4 5 6 R2 Statistic (Cox proportional hazards model) R2 Statistic (Cox proportional hazards model) Uncorrected 0.12 p=0.002 0.1 0.08 0.06 0.04 0.02 0 0 200 400 600 800 Number of tumours Uneven dataset composition 1:1 ratio (12,12v12,12) 2:1 ratio (14,15v19,7) 5:1 ratio (3,15v19,3) 2:1 ratio (14,15v19,7)ER+ 380 347 725 608 5:1 ratio (3,15v19,3)ER+ 283 290 480* 684* Between Basal Luminal Farmer Richardson Across Sam Common /1000 Between UC Across MC wMC 1:1 ratio (12,12v12,12)ER+ 394 368 708 - Summary • Systematic bias introduced at many stages of microarray experiments • Must be removed in order to combine datasets at raw expression level • Combining datasets CAN improve prognosis prediction BUT is better when systematic bias reduced • Prognostic performance best when test-sets reflect training-sets • Weighting correction based upon dataset composition may improve findings Cannot compensate for all differences (many unknown) • A resource to look at co-expression, subtle interactions (Cheaply!) • More representative/accurate of population – identify/compare rare subtypes • Consistent patient meta-data required Chin Desmedt Ivshina Pawitan Sotoriou Wang Acknowledgements Rob Clarke, Tony Howell CR-UK Bioinformatics Group Graeme Smethurst, Michal Okoniewski, Claire Wilson, Crispin Miller CR-UK GeneChip Microarray Service Stuart Pepper, Yvonne Hey, Gill Newton 1:42