Robust diagnosis of DLBCL from gene expression data from different laboratories

advertisement
Robust diagnosis of DLBCL from gene
expression data from different
laboratories
DIMACS - RUTCOR Workshop on
Boolean and Pseudo-Boolean Functions
in Memory of Peter L. Hammer
January 19-22, 2009
1
Peter L Hammer
Sorin Alexe
David E Axelrod
Gustavo Stolovitzky
IBM TJ WATSON RESEARCH
RUTGERS UNIV
Gyan Bhanot
Arnold J Levine
INSTITUTE FOR ADVANCED
STUDY PRINCETON
David Weissmann
2
CANCER INSTITUTE OF NEW JERSEY
Overview

Motivation

Pattern-based ensemble classifiers

Case study – compare data from two labs for
DLBCL vs FL diagnosis
Shipp et al. (2002) Nature Med.; 8(1), 68-74. (Whitehead Lab)
Stolovitzky G. (2005) In Deisboeck et al Complex Systems Science in BioMedicine (in press)
(preprint: http://www.wkap.nl/prod/a/Stolovitzky.pdf). (DellaFavera Lab)
Alexe, Alexe, Axelrod, Hammer, Weissmann (2005) Artificial Intelligence in Medicine
Bhanot, Alexe, Stolowitzky, Levine (2005) Genome Informatics
3
Non-Hodgkin lymphomas
FL
low grade non-Hodgkin lymphoma / no cure if advanced stage
second most frequent subtype of nodal lymphoid malignancies
Incidence has risen from 2–3/
to more than 5–7/ 100,000/year (’50 –’00)
t(14;18) translocation:over-expression of anti-apoptotic bcl2
25-60% FL cases evolve to DLBCL
DLBCL high grade non-Hodgkin lymphoma / high variability to treatment
most frequent subtype of NHL
< 2 years survival if untreated
Biomarkers: FL transformation to DLBCL
•
•
•
•
p53/MDM2 (Moller et al., 1999)
p16 (Pyniol, 1998)
p38MAPK (Elenitoba-Johnson et al., 2003)
c-myc (Lossos et al., 2002)
4
Gene arrays

Gene arrays are a way to study the variation of
mRNA levels between different types of cells.

This allows diagnosis and inference of pathways
that cause disease / early stage diagnosis

Identify molecular profiles of disease –
personalized medicine
5
Lymphoma datasets
Data:
WI (Shipp et al., 2002) Affy HuGeneFL
CU (DallaFavera Lab, Stolovitzky, 2005) Affy Hu95Av2
Samples:
WI: 58 DLBCL & 19 FL
CU: 14 DLBCL & 7 FL
Genes:
WI: 6817
CU: 12581
6
Diagnosis problem
Input
Training (biomedical) data:
2 classes: FL and DLBCL
m samples described by N >> features
Output
Collection of robust biomarkers, models
Robust, accurate classifier /
tested on out-of-sample data
7
Data preprocessing
Input data
Creating training and test data
Normalization
Noise estimation
Robust feature selection
Biology-based feature
selection
Filtering
Support set selection
Filtering
Support set selection
INDIVIDUAL CLASSIFIERS
Artificial Neural
Netw orks
Support Vector
Machines
Pattern data
(training)
Weighted
Voting System
(LAD)
k-Nearest
Neighbors
Calibration
Decision Trees
(C4.5)
Logistic
Regression
Raw data
(training)
Principal Com ponents
INTERMEDIATE CLASSIFIERS
Classifier (Weighted Voting)
META-CLASSIFIER
Validation
(test data)
8
Patterns (Logical Analysis of Data, Hammer 1988)
Positive Patterns
Negative Patterns
Model
-Exhaustive collections of patterns
-Pattern space
-Classification / attribute analysis / new class
identification
9
Data Preprocessing





50 % P calls, UL = 16000, LL = 20
2/1 stratify WI data to train/test CU data test
Normalize data to median 1000 per array
Generate 500 data sets using noise + k fold stratified
sampling + jackknife
Find genes with high correlation to phenotype using t-test
or SNR. Keep genes that are in > 90% of datasets
10
Choosing support sets

Create quality patterns using small subsets of
genes, validate using weighted voting with 10 fold
cross validation

Sort genes by their appearance in good patterns

Select top genes to cover each sample by at least
10 patterns
Alexe, Alexe, Hammer, Vizvari (2005)
11
Genes@Work
t-test
*
*
TXNIP
*
*
metastases suppressor
DNASE1L3
*
*
apoptosis
CDH11
*
*
LUCA15
oxidative stress
*
cell adhesion
*
apoptosis
GPR18
*
*
*
signaling pathway
CLU
*
*
*
apoptosis
LY9
*
*
cell adhesion
RHOH
*
*
T-cell differentiation
ELF2
The 30 genes that
transcription
CCNG2
*
CR2
CDKN2D
*
*
cell cycle
signal transduction
G18
cell growth
LY86
*
apoptosis
ARPC1B
FL from DLBCL
cell cycle
complement activation
PPP2R5C
best distinguish
Biological
function
p53 regulated
Shipp et al.
*
Gene symbol
SEPP1
cell motility
MCM7
*
BCL2A1
*
*
*
cell cycle
*
*
*
apoptosis
IMPDH2
*
*
RRP45
STAT1
*
DLG7
*
*
SLC1A5
*
*
TUBB2
*
GMP biosynthesis
immune response
NF-kappaB cascade
*
cell-cell signaling
transport
*
PSMA6
microtubule movement
protein catabolism
PSMC1
*
*
*
spinocerebellar ataxia
LGALS3
*
*
*
sugar binding
CLTA
*
*
transport
PAGA
*
*
cell proliferation
12
Genes identified by LAD (AIIM 2005) to distinguish
DLBCL from FL
#
Gene
index
Gene description
Accession #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
506
1612
972
2137
605
6815
7102
2988
4028
4292
4485
1430
1988
582
1092
2929
3005
4010
2789
6703
DNA REPLICATION LICENSING FACTOR CDC47 HOMOLOG
(clone GPCR W) G protein-linked receptor gene (GPCR) gene, 5' end of cds
Rad2
HIGH AFFINITY IMMUNOGLOBULIN GAMMA FC RECEPTOR I "A FORM" PRECURSOR
5-aminoimidazole-4-carboxamide-1-beta-D-ribonucleotide transformylase/inosinicase
Tubulin, Beta 2
HLA-A MHC class I protein HLA-A (HLA-A28,-B40, -Cw3)
RCH1 RAG (recombination activating gene) cohort 1
LDHA Lactate dehydrogenase A
PKM2 Pyruvate kinase, muscle
IDH2 Isocitrate dehydrogenase 2 (NADP+), mitochondrial
Protein tyrosine phosphatase (CIP2)mRNA
INSULIN-LIKE GROWTH FACTOR BINDING PROTEIN 3 PRECURSOR
KIAA0175 gene
GAMMA-INTERFERON-INDUCIBLE PROTEIN IP-30 PRECURSOR
Mitochondrial serine hydroxymethyltransferase gene, nuclear encoded mitochondrion protein
Bcl-2 related (Bfl-1) mRNA
PGK1 Phosphoglycerate kinase 1
CENPA Centromere protein A (17kD)
Dents Disease candidate gene
D55716_at
L42324_at
HG4074-HT4344_at
M63835_at
D82348_at
HG1980-HT2023_at
M94880_f_at
U28386_at
X02152_at
X56494_at
X69433_at
L25876_at
M35878_at
D79997_at
J03909_at
U23143_at
U29680_at
V00572_at
U14518_at
X81836_s_at
Pearson
correlation of
Frequency of
genes in
participation in the
support set
definition of
with
combinatorial
DLBCL vs
biomarkers
FL outcome
0.45
-0.49
0.45
0.43
0.53
0.50
-0.43
0.48
0.62
0.55
0.47
0.44
-0.28
0.45
0.53
0.42
0.44
0.36
0.51
0.37
Functional
gene group #
(*)
42.08
30.00
23.33
23.33
22.50
8.33
8.33
7.08
6.25
5.00
5.00
4.17
4.17
2.08
2.08
2.08
2.08
2.08
0.00
0.00
1
2
1
2
4
2
1
6
6
6
5
2
3
5
6
5
-
Table 1. Selected non-minimal support set of 20 genes for distingushing DLBCL from FL cases.
* 1: DNA replication, recombination and repair, 2: cell surface proteins and receptors, 3: protein synthesis and degradation, 4: structural proteins, 5: cell cycle and apoptosis, 6: metabolism, -: other.
13
Examples of FL and DLBCL patterns
Gene Symbol
Prevalence (%)
Pattern
Training set
GPR18
P1
P2
N1
CLU
DLG7
>-1.13
£0.91
>-0.26
Test set
MCM7
>-0.62
>-0.77
£-0.55
Pos
Neg
Pos
Neg
97
95
0
0
0
100
91
79
3
23
31
54
WI training data:
Each DLBCL case satisfies at least one of the patterns P1 and P2
Each FL case satisfies the pattern N1 (and none of the patterns P1 and P2)
14
Pattern data
Negative
patterns
DLBCL
Positive
patterns
FL
WI test data
WI training data
CU test data
15
Meta-classifier performance
Training
Classifier
Weight Sensitivity Specificity Error rate Sensitivity Specificity Error rate
(%)
(%)
(%)
(%)
(%)
(%)
Trained on raw
data
ANN 0.08
SVM 0.08
kNN
0.09
WV
0.07
C4.5
0.06
LR
0.07
ANN 0.10
SVM 0.10
kNN
0.10
WV
0.10
C4.5
0.10
LR
0.05
Meta-classifier
Trained on
pattern data
Test
94.74
97.37
97.37
92.11
94.74
97.37
100.00
100.00
100.00
100.00
100.00
100.00
100.00
92.31
92.31
100.00
92.31
84.62
84.62
100.00
100.00
100.00
100.00
100.00
76.92
100.00
5.88
3.92
1.96
7.84
7.84
5.88
0.00
0.00
0.00
0.00
0.00
5.88
0.00
82.35
97.06
91.18
94.12
94.12
94.12
97.06
97.06
100.00
97.06
91.18
100.00
100.00
84.62
76.92
84.62
76.92
69.23
69.23
76.92
76.92
69.23
76.92
76.92
61.54
76.92
17.02
8.51
10.64
10.64
12.77
12.77
8.51
8.51
8.51
8.51
12.77
10.64
16
6.38
Error distribution: raw and pattern data
Meta-classifier
Classifiers trained on pattern data
Classifiers trained on raw data
0
WI 10
test data
20
30
40 test data
CU
50
17
Biology based method
18
FL  DLBCL
progression
p53 related genes
identified by filtering
procedure
CCNB1
MCM7
BRCA1
BCL2A1
PPP2R4
EIF2S2
COMT
IARS
MPI
ALAS1
MRPL3
NCF2
AARS
KIF11
CDK4
ATP1B1
CDC20
PRIM1
CDC2
TOP2A
CDK2
MYC
CCNE1
Gene symbol
EPRS
PMAIP1
GSK3B
ACAA2
COL6A1 E2F5*
HRAS
POLA
SERPING1 HMGB2
CCNA2
PSMB5
CCT6A
ACTA2
PRKDC
INSR
CAD
SNRPA
TNFRSF1B G1P2
ZNF184* IMPDH1
ALDOA
MAP2K2
KARS
TOP2A
MAD2L1 CXCL1
GOT1
BAG1
CDC25B TOP1
PSMA1
MAP4
KIAA0101 FDFT1
PCNA
MTA1
TCF3
CDKN1A
CYC1
HLAE*
UPP1
PLK1
TOPBP1 CDK7
E2F3
MDM4
AMPD2
RBBP4
CCNG2*
HARS
CASP6
RPS6KA1
GRP58
TP53
SMAD2
ATP5C1
TIMP3
THBS2
MYCBP
DTR
TIMP3
CBS
CDKN2D*
RELA
19
p53 pattern data
Positive
patterns
Negative
FL
DLBCL
patterns
WI data
20
CU data
Examples of p53 responsive genes patterns
Prevalence (%)
Pos
Neg
Pos
Neg
>0.11
93
90
69
3
3
3
11
11
11
74
68
68
86
71
64
14
21
7
29
29
14
71
57
71
CBS
P1 >-0.66
>-0.89
P2 >-0.66
>-0.78
P3 >-0.8
>-0.33
N1 £-0.66
N2
£-0.56
£-0.18
N3
£-0.11
Test set
E2F5
Training set
CDC2
KIAA0101
CCNE1
BCL2A1
CCNB1
MCM7
Pattern
Gene symbol
WI data:
Each DLBCL case satisfies one of the patterns P1, P2, P3
Each FL case satisfies one of the patterns N1, N2, N3
21
p53 combinatorial biomarker
90
80
70
79% DLBCL & 23% FL cases (3.4 fold)
at least two genes over-expressed
60
% cases
77% FL & 21% DLBCL cases (3.7 fold)
at most one gene over-expressed
50
DLBCL
40
FL
30
20
10
0
Each individual gene: over- expressed in about
40-70% DLBCL & 20-40% FL
(specificity 50-60%, sensitivity 60-70%)
<= 1
>=2
# of over-expressed
genes in DLBCL vs. FL
(p53, PLK1, CDK2)
22
What are these genes?





Plk1 (stpk13): polo-like kinase serine threonine protein kinase 13, M-phase
specific
cell transformation, neoplastic, drives quiescent cells into mitosis
over-expressed in various human tumors
Takai et al., Oncogene, 2005: plk1 potential target for cancer therapy, new
prognostic marker for cancer
Mito et al, Leuk Lymph, 2005: plk1 biomarker for DLBCL

Cdk2 (p33): cyclin -dependent kinase: G2/M transition of mitotic cell cycle,
interacts with cyclins A, B3, D, E

P53 tumor suppressor gene (Levine 1982)
23
Conclusions

Pattern-based meta-classifier is robust against noise

Good prediction of FL  DLBCL

Biology based analysis also possible

Yields useful biomarker

Should study biologically motivated sets of genes 
build pathways
24
<>
Thank you for your attention !
25
Download