“alexes of all nations unite!” Epicenter Analysis in Cancer Alex Krasnitz, CSHL Search and knowledge building for biological datasets, UCLA, 11.26-30, 2007 •Input: segmented data from (ROMA) CGH. •A predictive signal: a whole-genome biomarker for survival. •Pinning algorithm. •Pins find cancer genes. •Pins predict tissue of origin. •Pins and progression. (ROMA) CGH in vitro and in silico: A method for measuring relative copy numbers of short fragments in a genome. A multistep process consisting of • Digestion a restriction enzyme (BglII) • PCR → short (0.2-1.2kb) fragments are selected • Hybridization to an oligonucleotide (50mer probes) microarray (85K probe format used in present study, higher resolution work in progress) • Gridding • Normalization • Segmentation • Thresholding • CNP masking • Horizontal slicing Raw and segmented ROMA profile; FISH validation of copy number variations detected by ROMA. Segmentation algorithm (B. Lakshmi, M. Wigler): replace a raw profile by a piecewise-constant function minimizing variance. Cancer-free Female x Cancer-free Reference Male CNPs and SNPs are genetic markers Heterozygous CNPs Copy Number Polymorphisms Homozygous ‘ROMA’ SNPs Typical tumor genomes are NOT normal. Still, they may contain CNPs that must be filtered out. 101.0 8 7 6 5 4 3 2 100.0 8 7 6 5 4 3 2 10-1.0 8 7 6 Genomic rearrangements in cancer (Bayani et al, Seminars in Cancer Biology 17, 5, 2007) CNP masking: determine positions of frequent CNPs from a set of cancer-free genomes (~500 cases); excise these from cancer profiles in a minimally intrusive fashion. Event identification: horizontal slicing • • • Allow multiple events at a locus in a profile. Select vertically non-overlapping segments of maximal total length. These define tiers. Assign remaining segments each to the closest tier. 1 2 3 4 5 Breast cancer study • 257 frozen tissue samples of Scandinavian (140 Swedish, 117 Norwegian) origin. • Accompanied by clinical documentation. Karolinska Inst. Sweden Total Node (pos/neg) Median Age At Diag. Grade I/II/III Size (mm) <20/>20 PR* (+/-) ER* (+/-) ERBB2+ amp/norm Diploid (Survival >7 yr) 60 28/31 52 8/11/33 19/41 41/9 43/7 3/57 Diploid (Survival <7 yr) 39 14/25 57 3/12/16 11/25 20 /13 24/8 9/30 Aneuploid 41 28/13 49 0/2/22 21/20 14 /19 25/1 0 15/26 Oslo Micrometastasis Study (OMS) 123 52/46 63 10/50/41 44/55 43 /57 58 /44 27/76 *progesterone (PR) and estrogen (ER) receptors measured by ligand binding; pos=>0.5fg/mg protein + ERBB2 amplification scored by ROMA as segmented ratio greater than 0.1 above baseline. A heuristic classification of breast cancer profiles: simplex, sawtooth and firestorm Small # of events overall & per chromosome Multiple events, no clustering Multiple clustered events Initial observation: firestorms lead to poor survival. Quantify presence of firestorms by (sum over inverse average lengths of adjacent segments). 2 F L R i li li Is F a predictor of survival, and if so, is it independent of clinical parameters? Fisher’s exact test: strong association with survival, no association with any clinical parameter except age at diagnosis. Fd value Clinical parameter Discriminating principle p-value from Fisher’s test Odds ratio 0.08 Survival Above or below 7 yr 2.8×10-7 0.073 0.09 Survival Above or below 7 yr 5.9×10-7 0.070 0.1 Survival Above or below 7 yr 8.2×10-6 0.073 0.09 Grade 2 vs 3 0.39 0.58 0.09 Node condition Negative or positive 1.0 0.96 0.09 Size Smaller or larger than 20mm 0.40 (0.38 for 29) 0.62 (0.62 for 29) 0.09 ER status elow 0.05 fg/mg prot. 0.73 0.77 0.09 PR status Above or below 0.05 fg/mg prot. 0.75 0.70 0.09 HER2 amplification Above or below segment threshold 1.0 0.86 0.09 Age at diagnosis Above or below 57 years 0.0066 0.26 0.09 Adjuvant therapy -/+ 0.44 0.64 0.09 Radiation therapy -/+ 1.0 1.1 KM plots for the Swedish diploid subset (no significant change when adjusted for age at diagnosis) Search for epicenters • Key assumption: observed amplifications and deletions are more likely than not to confer a selective advantage upon a neoplastic cell. • If so, expect frequently amplified regions of the genome to be enriched in oncogenes. • Require methods for detecting such regions. • Frequency plot inadequate. Potential Benefits • Massive data reduction (O(105) probes to ~100 epicenters); a manageable set of predictors • Disentanglement • Target selection for functional studies (cancer gene finding) Pinning • Consider a smallest unit of the genome containing all its events (a chromosome). • For a given N, find N positions within that unit that best explain the observed set of (amplification or deletion) events, i.e., N positions that are shared by the highest number k(N) of events. • Multiple solutions occur, either due to a “fuzzy pin” or due to N being too low. • Increment N until the increment I(N)=k(N)-k(N-1) reaches a pre-set minimal value. Note that I(N) is a nonincreasing function of N. • Pinning is convergent: it is guaranteed to recover the epicenters given enough data. Greedy pinning is not optimal Greedy, N=2 (5 out of 6) Non-greedy, N=2 (6 out of 6) •Required: exhaustive enumeration of all possible N-pin configurations. •Pin positions: a fixed grid or determined by break points in the data. •In present data set: up to 5 pins per chromosome, O(100) pin positions. Test of significance 1. For the optimal N-pin solutions determine the event score k(N), and the gain IN=k(N)-k(N-1). 2. Perform multiple whole-genome shuffles of the events, including those of the opposite sign. For each shuffle find its IN. Estimate a p-value by comparison to the true IN. Interpretation of results: consider only the top-scoring pin configurations. Then, for pin #i in a top-scoring configuration, compute, at coordinate x p i ( x) L j ( x ,i ) 1 j ( x ,i ) (the sum is over the inverse lengths the events pinned by #i and containing x) Example: 17q, 5 pins Lung cancer deletions: known tumor suppressors and novel elements (213 cases, courtesy S. Powers) Estimates of utility • Goal: select the most promising 10% of the genome to focus functional studies on. • Is pinning useful in this sense? • A test: how enriched is the top-scoring 10% quantile in known genetic elements implicated in breast cancer? • We hit major known oncogenes, so can expect good results. More formally, perform a database search (top 10%, 17q). Estimates of utility Database Hits in region Hits in top 10% Atlas of Genetics and Cytogenetics in Oncology 8 (annotated as amplified and/or overexpressed) 8 (p=10-8) NCBI map viewer 184 (hits on “breast cancer”) 64 (p=2×10-16), likely overly conservative NCBI map viewer 47 (genes implicated in breast cancer) 10 (p=0.016), likely overly conservative and Haematology Gene Enrichment Epicenters are enriched in (CCDS) genes compared to the genome and to the copy number events because (a) epicenters bracket genes and (b) genes are clustered. organ, polarity # of epis Gene Enr. vs count genome Enr. vs events Enr. vs gene brackets p genome p events Breast amp 37 167 1.92 1.65 0.87 0.02 .054 Breast del 24 251 2.85 1.99 1.03 0.002 0.008 Lung amp 32 232 3.05 2.54 1.22 <0.002 0.002 Lung del 37 425 2.30 1.63 1.00 0.002 0.028 Colon amp 23 231 2.33 1.82 0.98 0.006 0.038 Colon del 16 292 3.21 2.45 1.17 0.006 0.01 Application: predicting tissue of origin Random forest classifier using joint sets of epicenters as predictors Organ 1 Organ 2 N1 (training) N2 (training) N1 (test) N2 (test) Training error 1 Training error 2 Test error 1 Test error 2 breast breast lung lung colon colon 129 129 107 107 69 69 128 128 106 106 68 68 0.18 0.02 0.09 0.24 0.29 0.36 0.18 0.02 0.08 0.21 0.19 0.21 Application: early events in breast cancer Compute frequency weighted by inverse number of events for contiguous groups of epicenters. Outliers: FISHvalidated early 16p-1q translocation. Summary • Pinning is a method for finding copy number variation epicenters in (cancer) genomes. • Applied to: a set of 257 FISH-validated breast cancer genome profiles; lung and colon cancer sets. • The epicenters found by pinning are significantly enriched in genes. • Epicenters find tissue of origin. • Epicenters detect early lesions. ROMA-based Cancer Biology at CSHL Mike Wigler, Jim Hicks, Rob Lucito, Scott Powers, David Mu ROMA Michael Riggs Diane Esposito Joan Alexander Jen Troge Evan Leibu FACS/Database Linda Rodgers Bioinformatics Lakshmi Muthuswamy Boris Yamrom AK Vlad Grubor Yoon-Ha Lee Tony Leotta Jude Kendall Deepa Pai Andy Reiner John Healy FISH Primer Selection Program & Probes Nicholas Navin FISH (Karolinska) Susanne Maner Par Lundin Statistics Xiaoyue Zhao Chris Yoon Collaborators: Anders Zetterberg –Karolinska Inst. Anne-Lise Borressen-Dale – Norway Radium Hosp. Kenny Ye – Albert Einstein Sch. Med. Thea Tlsty – UCSF Larry Norton - MSKCC