Logical Analysis of Data and Biomedical Applications

advertisement
Logical Analysis of
Diffuse Large B Cell Lymphoma
Gabriela Alexe1, Sorin Alexe1, David Axelrod2, Peter
Hammer1, and David Weissmann3
of RUTCOR(1) and Department of Genetics(2), Rutgers University;
and Robert Wood Johnson Medical School(3)
This Talk
• Lymphoma
• Gene Expression Level Analysis
R
U
T
C
O
R
• cDNA Microarray
• Applied to Diffuse Large B-Cell Lymphoma
• Logical Analysis of Data
•
•
•
•
•
Discretization/Binarization
Support Sets
Pattern Generation
Theories and Models
Prediction
2
Lymphoma
Lymphoma
R
U
T
C
O
R
• Cancer of lymphoid cells
• Clonal
• Uncontrolled growth
• Metastasis
• Lymphoma
• Diagnosis
• Grade
4
Diffuse Large B Cell Lymphoma (DLBCL)
R
U
T
C
O
R
• 31% of non-Hodgkin lymphoma cases
• 50% long-term, disease-free survival
• Clinical variability
• Prognosis & therapy
• IPI
• Morphology
• Gene expression
5
Diffuse Large B Cell Lymphoma
R
U
T
C
O
R
6
Spleen with Diffuse Large B Cell Lymphoma
R
U
T
C
O
R
7
Gene Expression Level Analysis
DNA-RNA Hybridization
R
U
T
C
O
R
9
Gene Expression Profiling
Tumor
R
U
T
C
O
R
Standard
cDNA
microarray
analysis
10
DLBCL & cDNA Microarray Analysis
R
U
T
C
O
R
• Distinct types of diffuse large B-cell
lymphoma identified by gene expression
profiling,
Alizadeh et al., Nature, Vol 403, pp 503-511
• cDNA microarray data -> unsupervised
hierarchical agglomerative clustering
• Germinal center signature: 76% survival at 5
years
• Activated B cell signature: 16% at 5 years
11
DLBCL Clustering
R
U
T
C
O
R
Each case (patient)
is a point in
N-dimensional space
where N = # of genes
Germinal
center
genes
Activated
B cell
genes
12
DLBCL Survival by Type
R
U
T
C
O
R
13
Supervised Learning Classification of DLBCL
R
U
T
C
O
R
• Diffuse large B-cell lymphoma prediction by
gene-expression profiling and supervised
machine learning
Shipp et al., Nature Medicine, vol 8, p 68-74
• Prognosis of DLBCL
• Highly correlated genes -> weighted voting
algorithm
14
Logical Analysis of Data
Logical Analysis of Data (LAD)
• Non-statistical method based on:
R
U
T
C
O
R
• Combinatorics
• Optimization
• Logic
• Based on dataset of cases/patients
• LAD learns patterns characteristic of classes
• Subsets of patients who are +/- for a condition
• Collections of patterns are extensible
• Predictions
17
The Problem :
Approximation of Hidden Function
R
U
T
C
O
R
Dataset
Hidden
Function
LAD
Approximation
18
Main Components of LAD
R
U
T
C
O
R
• Discretization/Binarization
• Support Sets
• Pattern Generation
• Theories and Models
• Prediction
19
Discretization
Separating
Cutpoints
Minimum Set of
Separating
Cutpoints
R
U
T
C
O
R
20
Cutpoints and Support Set
R
U
T
C
O
R
• Minimization is NP hard
• Numerous powerful methods
• Support set:
• Cutpoints define a grid in which ideally no
cell contains both + and – cases
• Cutpoints simplify data and decrease
noise
21
Patterns
• Examples:
R
U
T
C
O
R
• Gene A > 34 & gene B < 24 & gene C < 2
• Positive and negative patterns
• Pattern parameters:
• Degree (# of conditions)
• Prevalence (# of +/- cases that satisfy it)
• Homogeneity (proportion of +/- cases among
those it covers)
• Best: low degree, large prevalence, high
homogeneity
• Patterns are extensible!
22
Pattern Generation
• Generate patterns based on learning set
• Stipulate control parameters. For example:
R
U
T
C
O
R
• Degree 4
• + & - prevalences >= 70%
• + & - homogeneities = 100%
• All 75 patterns in 1.2 seconds on Pentium IV
1 Gz PC
• Evaluate set:
• Average # of patterns covering each observation
• Accuracy applied to evaluation set
23
R
U
T
C
O
R
Patterns: Illustration
Positive Pattern
Negative Pattern
24
Theories: Approximations of the 2 Regions
R
U
T
C
O
R
A theory is a set of positive (or negative) patterns such
that every positive (or negative) case is covered.
Positive Theory
Negative Theory
25
Models
• A set of a positive and a negative theory
R
U
T
C
O
R
• A good model:
• Small number of features (genes)
• Patterns are high quality
• Low degrees
• High prevalences
• High homogeneities
• Number of patterns is small
• Maximize their biologic interpretability
26
R
U
T
C
O
R
Theories and Models
Positive Theory
Unexplained Area
Negative Theory
Model
Positive Area
Discordant Area
Negative Area
27
LAD Prediction
R
U
T
C
O
R
• A new case: a set of gene expression levels
• Satisfy some positive & no negative?
• Satisfy some negative & no positive ?
• Satisfy some of both?
• Which more?
• Does not satisfy any (rare)
28
8 Gene Classification Model
Gene index
6642
6992
3890
5383
3674
2004
1692
R
U
T
C
O
R
Prevalence (%)
2280
Description Butyrophilin (BTF1)
Dystrobrevin-alpha
mRNA P120E4F
mRNA
transcription
Mitogen induced
factor
SM15
mRNA
nuclear
gene orphan
(human
Neurotrophin-3
receptor
interferon-related
Lecithin-cholesterol
(MINOR)
(NT-3) gene
mRNA
protein
BETA-1,4
SM15
acyltransferase
N-ACETYLGALACTOSAMINYLTRANSFERASE
(U09585);mRNA,
final exon
withsimilar
5' and to
3' partial
flankingsequence
DNA sequences
of human
Accession # U90543_at
U46744_at
U87269_at
U12767_at
U73167_cds5_at
M37763_at M12625_at M83651_at
Pattern
P1
>0.49
P2
>0.48
P3
>0.48
P4
>0.48
0.
>0.3
>0.46
>0.46
0.40
0.36
>0.47
0.3
P5
>0.46
P6
>0.63
P7
Test set
Positive
Negative
Positive
Negative
72.22
0.00
62.50
30.00
72.22
0.00
50.00
20.00
72.22
0.00
62.50
10.00
>0.0
72.22
0.00
50.00
20.00
>0.6
61.11
0.00
62.50
20.00
61.11
0.00
50.00
10.00
55.56
0.00
25.00
0.00
55.56
0.00
50.00
20.00
0.30
>0.46
>0.8
P8
>0.1
Training set
>0.8
P9
>0.49
55.56
0.00
50.00
30.00
N1
0.60
0.69
0.6
0.00
72.73
12.50
70.00
N2
0.3
0.69
0.7
0.00
68.18
12.50
50.00
N3
0.3
0.69
0.00
63.64
12.50
40.00
N4
0.60
0.00
63.64
50.00
70.00
N5
0.3
0.00
63.64
0.00
50.00
0.00
59.09
0.00
40.00
N6
>0.
0.
>0.10.6
0.69
>0.10.69
0.3
0.
29
Accuracy of Prognosis
R
U
T
C
O
R
ACCURACY OF PROGNOSIS
Logistic
Regression
Artificial
Neural
Networks
CART
Fisher
Discriminant
LAD
Sensitivity (%)
88.9
94.4
55.6
94.4
100
Specificity (%)
90.9
90.9
90.9
81.8
100
Sensitivity (%)
50
87.5
75
50
87.5
Specificity (%)
60
80
60
80
90
Training
Test set
30
Conclusion
R
U
T
C
O
R
• Logical Analysis of Data (LAD ): a versatile new
classification method here applied to diagnosis and
prognosis of lymphoma.
• LAD genes differ almost entirely from those specified
by other studies.
• Genes not individually correlated with diagnosis or
prognosis but highly correlated in combinations of as
few as two genes.
• Patterns suggest biologic pathways
• LAD provides highly accurate prognosis of DLBCL
31
Contacts
R
U
T
C
O
R
• Gabriela Alexe: galexe@us.ibm.com
• Soren Alexe: salexe@rutcor.rutgers.edu
• David Axelrod: axelrod@biology.rutgers.edu
• Peter Hammer: hammer@rutcor.rutgers.edu
• David Weissmann: weissmdj@umdnj.edu
32
Download