Epicenter Analysis in Cancer

advertisement
“alexes of all nations unite!”
Epicenter Analysis in Cancer
Alex Krasnitz, CSHL
Search and knowledge building for biological datasets, UCLA,
11.26-30, 2007
•Input: segmented data from (ROMA) CGH.
•A predictive signal: a whole-genome biomarker for survival.
•Pinning algorithm.
•Pins find cancer genes.
•Pins predict tissue of origin.
•Pins and progression.
(ROMA) CGH in vitro and in silico:
A method for measuring relative copy numbers of
short fragments in a genome.
A multistep process consisting of
• Digestion a restriction enzyme (BglII)
• PCR → short (0.2-1.2kb) fragments are selected
• Hybridization to an oligonucleotide (50mer probes) microarray (85K
probe format used in present study, higher resolution work in
progress)
• Gridding
• Normalization
• Segmentation
• Thresholding
• CNP masking
• Horizontal slicing
Raw and segmented ROMA profile;
FISH validation of copy number variations detected by ROMA.
Segmentation algorithm (B. Lakshmi, M. Wigler): replace a raw profile by a
piecewise-constant function minimizing variance.
Cancer-free Female x Cancer-free Reference Male
CNPs and SNPs are genetic markers
Heterozygous
CNPs
Copy Number
Polymorphisms
Homozygous
‘ROMA’ SNPs
Typical tumor genomes are NOT normal. Still, they
may contain CNPs that must be filtered out.
101.0
8
7
6
5
4
3
2
100.0
8
7
6
5
4
3
2
10-1.0
8
7
6
Genomic rearrangements in cancer
(Bayani et al, Seminars in Cancer Biology 17, 5, 2007)
CNP masking: determine positions of frequent CNPs from a set of
cancer-free genomes (~500 cases); excise these from cancer profiles
in a minimally intrusive fashion.
Event identification: horizontal slicing
•
•
•
Allow multiple events at a locus in a profile.
Select vertically non-overlapping segments of maximal total length. These
define tiers.
Assign remaining segments each to the closest tier.
1
2
3
4
5
Breast cancer study
• 257 frozen tissue samples of Scandinavian (140 Swedish, 117
Norwegian) origin.
• Accompanied by clinical documentation.
Karolinska Inst.
Sweden
Total
Node
(pos/neg)
Median
Age
At Diag.
Grade
I/II/III
Size
(mm)
<20/>20
PR*
(+/-)
ER*
(+/-)
ERBB2+
amp/norm
Diploid
(Survival >7 yr)
60
28/31
52
8/11/33
19/41
41/9
43/7
3/57
Diploid
(Survival <7 yr)
39
14/25
57
3/12/16
11/25
20
/13
24/8
9/30
Aneuploid
41
28/13
49
0/2/22
21/20
14
/19
25/1
0
15/26
Oslo
Micrometastasis
Study (OMS)
123
52/46
63
10/50/41
44/55
43
/57
58
/44
27/76
*progesterone (PR) and estrogen (ER) receptors measured by ligand binding; pos=>0.5fg/mg protein
+ ERBB2 amplification scored by ROMA as segmented ratio greater than 0.1 above baseline.
A heuristic classification of breast cancer profiles: simplex,
sawtooth and firestorm
Small # of events overall
& per chromosome
Multiple events,
no clustering
Multiple clustered events
Initial observation: firestorms lead to poor survival.
Quantify presence of firestorms by (sum over inverse average
lengths of adjacent segments).
2
F  L R
i li  li
Is F a predictor of survival, and if so, is it independent of clinical
parameters?
Fisher’s exact test: strong association with survival, no
association with any clinical parameter except age at diagnosis.
Fd value
Clinical parameter
Discriminating principle
p-value from
Fisher’s test
Odds ratio
0.08
Survival
Above or below 7 yr
2.8×10-7
0.073
0.09
Survival
Above or below 7 yr
5.9×10-7
0.070
0.1
Survival
Above or below 7 yr
8.2×10-6
0.073
0.09
Grade
2 vs 3
0.39
0.58
0.09
Node condition
Negative or positive
1.0
0.96
0.09
Size
Smaller or larger than 20mm
0.40 (0.38 for 29)
0.62 (0.62 for 29)
0.09
ER status
elow 0.05 fg/mg prot.
0.73
0.77
0.09
PR status
Above or below 0.05 fg/mg prot.
0.75
0.70
0.09
HER2 amplification
Above or below segment threshold
1.0
0.86
0.09
Age at diagnosis
Above or below 57 years
0.0066
0.26
0.09
Adjuvant therapy
-/+
0.44
0.64
0.09
Radiation therapy
-/+
1.0
1.1
KM plots for the Swedish diploid subset
(no significant change when adjusted for age at diagnosis)
Search for epicenters
• Key assumption: observed amplifications and
deletions are more likely than not to confer a
selective advantage upon a neoplastic cell.
• If so, expect frequently amplified regions of the
genome to be enriched in oncogenes.
• Require methods for detecting such regions.
• Frequency plot inadequate.
Potential Benefits
• Massive data reduction (O(105) probes to
~100 epicenters); a manageable set of
predictors
• Disentanglement
• Target selection for functional studies
(cancer gene finding)
Pinning
• Consider a smallest unit of the genome containing all its
events (a chromosome).
• For a given N, find N positions within that unit that best
explain the observed set of (amplification or deletion)
events, i.e., N positions that are shared by the highest
number k(N) of events.
• Multiple solutions occur, either due to a “fuzzy pin” or
due to N being too low.
• Increment N until the increment I(N)=k(N)-k(N-1) reaches
a pre-set minimal value. Note that I(N) is a nonincreasing function of N.
• Pinning is convergent: it is guaranteed to recover the
epicenters given enough data.
Greedy pinning is not optimal
Greedy, N=2 (5 out of 6)
Non-greedy, N=2 (6 out of 6)
•Required: exhaustive enumeration of all possible N-pin configurations.
•Pin positions: a fixed grid or determined by break points in the data.
•In present data set: up to 5 pins per chromosome, O(100) pin positions.
Test of significance
1. For the optimal N-pin solutions determine the event
score k(N), and the gain IN=k(N)-k(N-1).
2. Perform multiple whole-genome shuffles of the events,
including those of the opposite sign. For each shuffle find
its IN. Estimate a p-value by comparison to the true IN.
Interpretation of results: consider only the top-scoring pin configurations. Then,
for pin #i in a top-scoring configuration, compute, at coordinate x
p i ( x) 
L
j ( x ,i )
1
j ( x ,i )
(the sum is over the inverse lengths the events pinned by #i and containing x)
Example: 17q, 5 pins
Lung cancer deletions: known tumor suppressors and
novel elements (213 cases, courtesy S. Powers)
Estimates of utility
• Goal: select the most promising 10% of the genome to
focus functional studies on.
• Is pinning useful in this sense?
• A test: how enriched is the top-scoring 10% quantile in
known genetic elements implicated in breast cancer?
• We hit major known oncogenes, so can expect good
results. More formally, perform a database search (top
10%, 17q).
Estimates of utility
Database
Hits in region
Hits in top 10%
Atlas of Genetics and
Cytogenetics in Oncology
8 (annotated as
amplified and/or
overexpressed)
8 (p=10-8)
NCBI map viewer
184 (hits on “breast
cancer”)
64 (p=2×10-16), likely
overly conservative
NCBI map viewer
47 (genes implicated
in breast cancer)
10 (p=0.016), likely
overly conservative
and Haematology
Gene Enrichment
Epicenters are enriched in (CCDS) genes compared to the genome
and to the copy number events because (a) epicenters bracket genes
and (b) genes are clustered.
organ, polarity
# of
epis
Gene Enr. vs
count genome
Enr. vs
events
Enr. vs
gene
brackets
p
genome
p
events
Breast amp
37
167
1.92
1.65
0.87
0.02
.054
Breast del
24
251
2.85
1.99
1.03
0.002
0.008
Lung amp
32
232
3.05
2.54
1.22
<0.002
0.002
Lung del
37
425
2.30
1.63
1.00
0.002
0.028
Colon amp
23
231
2.33
1.82
0.98
0.006
0.038
Colon del
16
292
3.21
2.45
1.17
0.006
0.01
Application: predicting tissue of origin
Random forest classifier using joint sets of epicenters as predictors
Organ 1
Organ 2
N1
(training)
N2
(training)
N1
(test)
N2
(test)
Training
error 1
Training
error 2
Test
error 1
Test
error
2
breast
breast
lung
lung
colon
colon
129
129
107
107
69
69
128
128
106
106
68
68
0.18
0.02
0.09
0.24
0.29
0.36
0.18
0.02
0.08
0.21
0.19
0.21
Application: early events in breast
cancer
Compute frequency weighted by inverse number of events
for contiguous groups of epicenters. Outliers: FISHvalidated early 16p-1q translocation.
Summary
• Pinning is a method for finding copy
number variation epicenters in (cancer)
genomes.
• Applied to: a set of 257 FISH-validated
breast cancer genome profiles; lung and
colon cancer sets.
• The epicenters found by pinning are
significantly enriched in genes.
• Epicenters find tissue of origin.
• Epicenters detect early lesions.
ROMA-based Cancer Biology at CSHL
Mike Wigler, Jim Hicks, Rob Lucito, Scott Powers, David Mu
ROMA
Michael Riggs
Diane Esposito
Joan Alexander
Jen Troge
Evan Leibu
FACS/Database
Linda Rodgers
Bioinformatics
Lakshmi Muthuswamy
Boris Yamrom
AK
Vlad Grubor
Yoon-Ha Lee
Tony Leotta
Jude Kendall
Deepa Pai
Andy Reiner
John Healy
FISH Primer Selection
Program & Probes
Nicholas Navin
FISH (Karolinska)
Susanne Maner
Par Lundin
Statistics
Xiaoyue Zhao
Chris Yoon
Collaborators:
Anders Zetterberg –Karolinska Inst.
Anne-Lise Borressen-Dale – Norway Radium Hosp.
Kenny Ye – Albert Einstein Sch. Med.
Thea Tlsty – UCSF
Larry Norton - MSKCC
Download