Tumor subtype identification and patients stratification using aCGH data Kurt Zhang, PhD

advertisement
Tumor subtype identification
and patients stratification
using aCGH data
Kurt Zhang, PhD
Bioinformatics Core
Department of Pathology
University of North Dakota School of Medicine
The 3rd PaSiPhIC, June 27, 2012
1
Outline
• What is aCGH data?
• Can aCGH data be used for tumor subtype
identification and patient stratification?
– Tumor is genetically heterogeneous
– Link tumor subtypes with clinical outcomes
• Can we develop a better clustering method
for aCGH data?
– “All models are approximations; some perform
better than others” – Peter Bonate
2
Array Comparative Genomic
Hybridization (aCGH)
3
Array Comparative Genomic Hybridization
•
An array technology: allow quantitative determination of
amplifications and deletions on a genome-wide scale
Normal Human Genome
MDA-MB-435 Breast Cancer Cell Genome
4
Array CGH Platform:
Affymetrix 100k SNP Array
• SNP: Single Nucleotide Polymorphisms
• 120k rows of data for each samples
• GTYPE software for generating copy number data
5
Chromosome 7q
6
Can aCGH data be used for
tumor subtype identification and
patient stratification?
7
Tumor intro-heterogeneity
Gerlinger et al. NEJM 2012
8
3*30 data:
Hierarchical Clustering (Pearson’s dissimilarity/ Ward’s linkage)
CRC
tumors
NB
primary
w/ pairs
SCLC
lines
9
PCA Clustering – 3*30 data
10
Nonnegative Matrix Factorization (NMF)
• NMF (Nonnegative Matrix Factorization):
Lee & Seung 1999 Nature 401:788
Theory: Perception of the whole is based on perception of its
parts.
To compare with PCA (Principal Components Analysis)
11
NMF Algorithm
Green are the “features”.
Red are the “weights”.
Samples
H
SNPs
A
=
Start with
random
elements in red
and green.
W
WH
+E
Optimize so
that
(aij – whij)2 is
minimized.
12
NMF Algorithm Cont’d
W
1
Sample k
.
.
.
.
.
.
.
.
0.38
.
.
.
.
.
.
.
.
2
3
.
.
.
.
.
.
.
.
0.52
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.17
.
.
.
.
.
.
.
.
Sample k is placed in cluster 2
13
NMF clustering – 3*30 data
group 1
sample.01_exp.5AAAA1
sample.02_exp.5AAAA1
sample.03_exp.5AAAA2
sample.04_exp.5AAAA2
sample.05_exp.5AAAA3
sample.06_exp.5AAAA3
sample.08_exp19BBBB
sample.9_exp.2BBBB
sample.09_exp.5AAAA5
sample.10_exp.5AAAA5
sample.11_exp.5AAAA
sample.12_exp.5AAAA
sample.13_exp.5AAAA4
sample.14_exp.5AAAA4
sample.15_exp.5AAAA
sample.16_exp.5AAAA
sample.17_exp.5AAAA
sample.19_exp.2BBBB
sample.19_exp5AAAA
sample.20_exp.5AAAA
sample.22_exp.2BBBB
sample.23_exp.2BBBB
sample18_exp.5AAAA
sample_01_exp21AAAA
sample_01_exp24
sample_01_exp31
sample_01_exp42
sample_02_exp21AAAA
sample_02_exp24
sample_02_exp28
sample_02_exp36
sample_03_exp21AAAA
sample_03_exp24
sample_03_exp25
sample_03_exp26
sample_03_exp28
sample_04_exp20BBBB
sample_04_exp21AAAA
sample_05_exp21AAAA
sample_06_exp21AAAA
sample_08_exp21AAAA
sample_08_exp24
sample_08_exp25
sample_08_exp31
sample_09_exp10
sample_11_exp21AAAA
sample_12_exp21AAAA
group 2
sample.20_exp.2BBBB
sample_01_exp25
sample_01_exp26
sample_01_exp28
sample_01_exp36
sample_02_exp10
sample_02_exp25
sample_02_exp26
sample_02_exp31
sample_02_exp42
sample_03_exp10
sample_03_exp31
sample_08_exp26
sample_08_exp28
sample_08_exp36
sample_08_exp42
sample_09_exp21AAAA
sample_09_exp24
group 3
sample.1_exp.2BBBB
sample.2_exp.2BBBB
sample.3_exp.2BBBB
sample.4_exp.4BBBB
sample.5_exp.2BBBB
sample.6_exp.2BBBB
sample.7_exp.2BBBB
sample.07_exp19BBBB
sample.8_exp.2BBBB
sample.09_exp19BBBB
sample.10_exp.2BBBB
sample.10_exp19BBBB
sample.11_exp.2BBBB
sample.12_exp.2BBBB
sample.13_exp.2BBBB
sample.14_exp.2BBBB
sample.15_exp.2BBBB
sample.16_exp.2BBBB
sample.17_exp.2BBBB
sample.18_exp.2BBBB
sample.21_exp.2BBBB
sample_02_exp20BBBB
sample_03_exp20BBBB
sample_07_exp21AAAA
sample_10_exp21AAAA
14
NMF with Segmentation for 3*30 Data
(5262 segments)
group 1
sample_01_exp24
sample_01_exp25
sample_01_exp26
sample_01_exp28
sample_01_exp31
sample_01_exp36
sample_01_exp42
sample_02_exp10
sample_02_exp24
sample_02_exp25
sample_02_exp26
sample_02_exp28
sample_02_exp31
sample_02_exp36
sample_02_exp42
sample_03_exp10
sample_03_exp24
sample_03_exp25
sample_03_exp26
sample_03_exp28
sample_03_exp31
sample_08_exp24
sample_08_exp25
sample_08_exp26
sample_08_exp28
sample_08_exp31
sample_08_exp36
sample_08_exp42
sample_09_exp10
sample_09_exp21AAAAA
sample_09_exp24
group 2
sample.01_exp.5AAAAA1
sample.02_exp.5AAAAA1
sample.03_exp.5AAAAA2
sample.04_exp.5AAAAA2
sample.05_exp.5AAAAA3
sample.06_exp.5AAAAA3
sample.08_exp19BBBBB
sample.9_exp.2BBBBB
sample.09_exp.5AAAAA5
sample.10_exp.5AAAAA5
sample.11_exp.5AAAAA
sample.12_exp.5AAAAA
sample.13_exp.5AAAAA4
sample.14_exp.5AAAAA4
sample.15_exp.5AAAAA
sample.16_exp.5AAAAA
sample.17_exp.5AAAAA
sample.19_exp.2BBBBB
sample.19_exp5AAAAA
sample.20_exp.5AAAAA
sample.21_exp.2BBBBB
sample.22_exp.2BBBBB
sample18_exp.5AAAAA
sample_01_exp21AAAAA
sample_02_exp21AAAAA
sample_03_exp21AAAAA
sample_04_exp20BBBBB
sample_04_exp21AAAAA
sample_05_exp21AAAAA
sample_06_exp21AAAAA
sample_08_exp21AAAAA
sample_11_exp21AAAAA
sample_12_exp21AAAAA
group 3
sample.1_exp.2BBBBB
sample.2_exp.2BBBBB
sample.3_exp.2BBBBB
sample.4_exp.4BBBBB
sample.5_exp.2BBBBB
sample.6_exp.2BBBBB
sample.7_exp.2BBBBB
sample.07_exp19BBBBB
sample.8_exp.2BBBBB
sample.09_exp19BBBBB
sample.10_exp.2BBBBB
sample.10_exp19BBBBB
sample.11_exp.2BBBBB
sample.12_exp.2BBBBB
sample.13_exp.2BBBBB
sample.14_exp.2BBBBB
sample.15_exp.2BBBBB
sample.16_exp.2BBBBB
sample.17_exp.2BBBBB
sample.18_exp.2BBBBB
sample.20_exp.2BBBBB
sample.23_exp.2BBBBB
sample_02_exp20BBBBB
sample_03_exp20BBBBB
sample_07_exp21AAAAA
sample_10_exp21AAAAA
15
NMF and HC clustering - Astrocytoma
16
Kaplan-Meier Curve - Astrocytoma
NMF
HC
Survival Plot
1.0
1
2
0.9
0.8
Surviving
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
10
20
30
40
50
60
Month
Log-rank Test:
Log-rank Test:
P value=0.008
P value=0.459
17
Kaplan-Meier Curve - Astrocytoma
NMF
HC
Survival Plot
Survival Plot
1.0
1.0
1
2
3
0.9
0.8
0.8
0.7
0.7
Surviving
Surviving
1
2
3
0.9
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
10
20
30
40
month
Log-rank Test
clust1
clust2
1
2
1
3
2
3
50
60
10
20
30
40
50
60
Month
P value
0.008
0.194
0.213
Log-rank Test
clust1
clust2
1
2
1
3
2
3
P value
0.625
0.481
0.954
18
Leave-5-out Stability Test
1
2
2
Randomly leave 5 out
1
3
3
Misplacement rate
# Clusters
NMF
HC
2
1.79%
13.03%
3
9.48%
8.76%
19
Can we develop a better
clustering method for aCGH
data?
20
DNA copy numbers are
highly correlated
21
A hidden Markov Model(HMM)
Ptran
Pemi
•
•
•
•
Ptran
Pemi
Ptran
Pemi
A hidden Markov model is used to calculate the true copy number of each marker
from 5’ to 3’ on each chromosome.
In graphic, t is the index of the current SNP, y is observed intensity value of the SNP,
x is the true copy number. The copy number of the next SNP of the process only
depends upon the present SNP.
If a DNA segment is normal, the copy number of SNPs on this region should be 2. If
copy number is greater than 2, it is a gain of copy number (amplification); if less
than 2, it is a loss of copy number (deletion).
The true copy number of the next SNP is determined by the transition probability
from the current SNP and the emission probability to the observation. The transition
probability is modeled by a multinomial distribution, and the emission probability is
modeled by lognormal distribution.
A HMM-based Clustering Method
yij(t-1)
yij(t)
yij(t+1)
• Suppose we have G groups.
• Each group of samples is fitted with a single HMM.
• At each state, the true copy number is calculated by the
transition probability from the previous state and the
multiple emission probabilities for all observations within
the group.
• Thus the entire samples are modeled with a mixture HMM.23
HMMC Algorithm
Simulation: clustering error
rate
Simulation: convergence rate
Model Assessment: #clusters = ?
• Akaike/Bayesian Information Criteria (AIC/BIC)
AIC = -2log(L) + 2K
BIC = -2log(L) + log(n)*K
Where L is maximum likelihood for the given
model, K is the number of parameters in the
model, and n is the total number of samples
(sample size).
27
HMM Likelihood and BIC
• Log-likelihood of cluster g
ng


 g   ln Ptran (t )   ln Pemi ( y jt )
t 1 
j 1

T
• BIC
where G is the number of clusters, T is the number of markers
and n is the number of samples.
28
HMMC - Astrocytoma
NMF
HMMC
1.0
1.0
0.9
0.9
0.8
0.8
Surviving
0.6
0.5
0.4
P=0.0078
0.3
0.2
0.7
Surviving
Log-rank test:
0.7
Log-rank test:
0.6
0.5
0.4
P=0.0046
0.3
0.2
0.1
0.1
0.0
0.0
10
20
30
month
40
50
60
10
20
30
40
50
60
month
29
HMMC - Astrocytoma
Summary of HMMC
• We used a HMM model to take into account the
spatial correlation between genomic markers.
• The number of groups is determined by a HMMbased BIC.
• A data mining algorithm and parallel computing is
used to increase the computation efficiency.
• The HMM clusters are associated with clinical
outcome.
• Genomic CNV biomarkers can be derived from
the clustering results.
31
Future Studies
• To implement new distance metrics, such as
distance correlation, maximum information
coeffient, etc.
• To use genetic algorithm for optimal clustering.
• To extend our clustering algorithm to nextgeneration genome sequencing data.
• To develop a nonparametric statistic method to
segment DNA copy number and to discover
genomic biomarkers for each tumor subtype.
32
Acknowledgements
• UND bioinformatics core
– Yi Yang
– Maksym Tkach
– Mohammed Mohmoud
– Brent Weichel
– Kaitlin Clarke
– Paul Rodemeyer
– Thomas Kokal
• Tumor genomics group in Abbott Laboratories
– Dimitri Semizarov, PhD
– Xin Lu, PhD
– Charles van Sant, PhD
– Viswanath Devanarayan, PhD
This study is supported by NIH INBRE.
33
Download