R and Eddie for Breast Cancer Bioinformatics Duncan Sproul Microarrays • Assay Samples on a Genome Wide Scale • Developed in 1990s • Widely Applied in Biology Microarrays Widely Applied in Breast Cancer Sorlie et al Nature 2000 Microarray Approaches in Breast Cancer Unsupervised Supervised Molecular subtyping Patient Based Model Based ‘Intrinsic set’ Recurrence/Metastasis/Survival ER/Grade/Hypoxia/Proliferation Molecular subtypes Poor Good Low High Prognostic Implications? Sims (2009). J. Clin Path. (available online) What Can They Measure? SNP arrays CGH arrays VARIATION COPY NUMBER METHYLATION Methylation arrays Genomic DNA Gene TRANSCRIPTION mRNA Gene expression (mRNA) arrays miRNA miRNA arrays TRANSLATION Protein sequence Peptide arrays POST-TRANSLATIONAL MODIFICATIONS 3D Protein structure How We Use Eddie… Integration of Previous Studies Richardson et al Farmer et al. ERBB2 (216836_s_at) GRB7 (210761_s_at) ERBB2 (210930_s_at) GATA3 (209604_s_at) GATA3 (209602_s_at) GATA3 (209603_s_at) FBP1 (209696_at) ESR1 (205225_at) NAT1 (214440_at) SEMA3C (203789_at) XBP1 (200670_at) KRT17 (212236_at) KRT17 (205157_at) KRT5 (201820_at) 1 Richardson et al. (2006) Cancer Cell 9: 121-32 40 tumours, U133 plus2, standard labelling, 18 ‘basal-like’, ‘20 non-basal-like’ , 2 BRCA 2 Farmer et al. (2005) Oncogene. 24(29):4660-71 49 tumours, U133A, amplified RNA, 27 luminal, 16 basal, 6 ‘molecular apocrine’ Richardson et al. Farmer et al. ERBB2 (210930_s_at) ERBB2 (216836_s_at) GRB7 (210761_s_at) NAT1 (214440_at) XBP1 (200670_at) GATA3 (209603_s_at) GATA3 (209604_s_at) GATA3 (209602_s_at) FBP1 (209696_at) ESR1 (205225_at) KRT5 (201820_at) KRT17 (212236_at) KRT17 (205157_at) Sims et al. (2008) BMC Medical Genomics 1:42 Data Repositories Microarray Hybridisations by Year in Array Express Number of Hybrisations 300000 250000 200000 150000 100000 50000 0 2004 2005 2006 2007 2008 2009 Year Statistics taken from Array Express for January of Each Year Semi-Automated Pre-Processing of Studies from Repositories Running Analyses with Varying Parameters Using Array Jobs Number of Loci by Distance Paramater 100% Number of Loci (Relative to Distance = 0) 90% 80% 70% 60% Mouse 50% Human Interphase Human Mitotic 40% 30% 20% 10% 0% 0 100 200 300 400 500 600 Distance Parameter 700 800 900 1000 Analysis of WNT Signalling in Breast Cancer WNT Gene Sets 2 Datasets Process Multiple Breast Cancer Datasets Groups of Functionally Related Genes Remaining Datasets Determine Association with Clinical Variables Detecting Regions of CoRegulation in Breast Cancer Number of Loci (Relative to Distance = 0) Number of Loci by Distance Paramater 100% 90% 80% 70% 60% Mouse Human Interphase Human Mitotic 50% 40% 30% 20% 10% 0% 0 Process Breast Cancer Datasets 100 200 300 400 500 600 700 Distance Parameter 800 900 Decide on Parameters Find Regions of Interest Significance Testing by Permutation 1000 Regions for Further Analysis Mapping of Short Sequence Tags to Transcriptional Start Sites 900 800 700 Short Sequence Tags Mapped to Genome Number of Tags 600 Low 500 Medium 400 High 300 200 100 0 -2000 -1500 -1000 -500 0 500 1000 1500 2000 Distance from TSS Enrichment by Gene Group Gene Locations Parallelization of Mapping Reduces Estimated Time from ~11hrs to ~2hrs The Future • More Parallel Jobs • Bigger Jobs and Parallel Processing – Eg affyPara Bioconductor Package • More Mass Sequencing – More Data! Thanks! Andy Sims Arran Turnbull Liang Liang Colette Meyer Robert Kitchen Breakthrough Unit Elad Katz Sylvie Dubois-Marshall Charlene Kay Bartlett Lab Nick Gilbert Bernie Ramsahoye Catherine Naughton Jayne Culley Ben Skerry Jacqueline Dickson Melanie Spears Karen Taylor Carrie Cunningham Meehan Lab Colm Nestor Donncha Dunican Bickmore Lab Sehrish Rafique Lee Murphy Angie Fawkes Louise Evenden WTCRF Bauke Ylstra, VU Medisch Centrum, Amsterdam Dimitra Dafou Kate Lawrenson UCL EGA Institute for Women’s Health